Elsevier

Automatica

Volume 43, Issue 3, March 2007, Pages 473-481
Automatica

Brief paper
Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control

https://doi.org/10.1016/j.automatica.2006.09.019Get rights and content

Abstract

In this paper, the optimal strategies for discrete-time linear system quadratic zero-sum games related to the H-infinity optimal control problem are solved in forward time without knowing the system dynamical matrices. The idea is to solve for an action dependent value function Q(x,u,w) of the zero-sum game instead of solving for the state dependent value function V(x) which satisfies a corresponding game algebraic Riccati equation (GARE). Since the state and actions spaces are continuous, two action networks and one critic network are used that are adaptively tuned in forward time using adaptive critic methods. The result is a Q-learning approximate dynamic programming (ADP) model-free approach that solves the zero-sum game forward in time. It is shown that the critic converges to the game value function and the action networks converge to the Nash equilibrium of the game. Proofs of convergence of the algorithm are shown. It is proven that the algorithm ends up to be a model-free iterative algorithm to solve the GARE of the linear quadratic discrete-time zero-sum game. The effectiveness of this method is shown by performing an H-infinity control autopilot design for an F-16 aircraft.

Introduction

This paper is concerned with the application of approximate dynamic programming techniques (ADP) to the discrete-time linear quadratic zero-sum game that appearing in the H optimal control problem (Başar & Bernhard, 1995), where is the disturbance has finite energy. Approximate dynamic programming is an approach to solve dynamical programming problems. Approximate dynamic programming was proposed by Werbos (1991), Barto, Sutton, and Anderson (1983), Howard (1960), Watkins (1989), Bertsekas and Tsitsiklis (1996), and others to solve optimal control problems forward-in-time. In ADP, one combines adaptive critics, a reinforcement learning technique, with dynamic programming.

Several approximate dynamic programming schemes appear in literature. Howard (1960) proposed iterations in the policy space in the framework of stochastic decision theory. Bradtke, Ydestie, and Barto (1994), implemented a Q-learning policy iteration method for the discrete-time linear quadratic optimal control problem, while our is concerned with zero-sum games. In addition, the way we handle exploration noise is different in order to obtain convergence results of the associated Riccati equation (GARE). Hagen and Krose (1998) discussed the relation between the Q-learning policy iteration method and model-based adaptive control with system identification. Werbos (1992) classified approximate dynamic programming approaches into four main schemes: heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP), action dependent heuristic dynamic programming (ADHDP), also known as Q-learning Watkins (1989), and action dependent dual heuristic dynamic programming (ADDHP). Prokhorov and Wunsch (1997) developed new approximate dynamic programming schemes known as globalized-DHP (GDHP) and ADGDHP. Landelius (1997) applied HDP, DHP, ADHDP and ADDHP techniques to the discrete-time linear quadratic optimal control problem. The current status of work on approximate dynamic programming is given in Si, Barto, Powell, and Wunsch (2004). See also Bertsekas and Tsitsiklis, 1996, He and Jagannathan, 2005, Si and Wang (2001) and Cao (2002).

In this paper, Q-learning is used since as will be seen in the paper, it allows model-free tuning of the action and critic networks. That is, this method does not require knowledge of the plant model. In Landelius (1997), no initial stable control policy for the optimal control problem is required, however the requirement of exploration noise is not studied.

This problem has been solved off-line using the dynamic programming principle Başar and Bernhard, 1995, Başar and Olsder, 1999, Lewis (1995). An off-line neural net policy iterations solution was given by Abu-Khalaf, Lewis, and Huang (2004) for the continuous-time case.

The importance of this paper stems from the fact that we propose game-theoretic adaptive critics that create controllers that learn to co-exist with an L2-gain disturbance signal Başar and Bernhard, 1995, Başar and Olsder, 1999. In control system design, this is a two-player zero-sum game problem that corresponds to the well-known H control problem.

An H control F-16 aircraft autopilot design example is given to show the practical effectiveness of the ADP techniques.

Section snippets

Q-function setup for discrete-time linear quadratic zero-sum games installation

In this section, we formulate Bellman's optimality principle for the zero-sum-game using the concept of Q-functions (Watkins, 1989, Werbos, 1990) instead of the standard value functions used elsewhere. Consider the following discrete-time linear systemxk+1=Axk+Buk+Ewk,yk=xk,where xRn, yRp, ukRm1 is the control input and wkRm2 is the disturbance input. Also consider the infinite-horizon value functionV*(xk)=minuimaxwii=k[xiTRxi+uiTui-γ2wiTwi]for a prescribed fixed value of γ. In the

Model-free online tuning based on the Q-learning algorithm

In this section, we use the Q-function of Section 2 to develop a Q-learning algorithm to solve for the DT zero-sum game H matrix that does not require the system dynamical matrices. In the Q-learning approach, a parametric structure is used to approximate Q-function of the current control policy. Then the certainty equivalent principle is used to improve the policy of the action network.

Online adp H autopilot controller design for an F-16 aircraft

H controllers have been proven to be highly effective in the design of feedback control systems with robustness and disturbance rejection capabilities for F-16 aircraft autopilot design. The presented H controller design is a model-free online tuning design that is based on the Q-learning method presented in this paper.

The F-16 short period dynamics has three states given as x=[αqδe]T, where α is the angle of attack, q is the pitch rate and δe is the elevator deflection angle. The

Conclusion

In this paper we introduced an on-line ADP technique based on Q-learning to solve the discrete-time zero-sum game problem with continuous state and action spaces. The derivation of the policies and the convergence of the Q-learning are provided. In the Q-learning algorithm the system model is not needed to tune the action networks nor the critic network. The results in this paper can be summarized as a model-free approach to solve the linear quadratic discrete-time zero-sum game forward in time.

Asma Al-Tamimi was born in Amman, Jordan in 1976. She did her high school studies at Ajnadeen high school in Zarqa. She received her Bachelor's Degree in Electromechanical Engineering from Al-Balqa University in Amman, Jordan in 1999. She then joined The University of Texas at Arlington from which she received the Master's of Science in Electrical Engineering in 2003. Currently she is working on her PhD degree at The University of Texas at Arlington and working as a research assistant at the

References (25)

  • Abu-Khalaf, M., Lewis, F.L., & Huang, J. (2004). Hamilton-Jacobi-Isaacs formulation for constrained input nonlinear...
  • A.G. Barto et al.

    Neuronlike elements that can solve difficult learning control problems

    IEEE Transactions on Systems Man and Cybernetics

    (1983)
  • T. Başar et al.

    H Optimal control and related minimax design problems

    (1995)
  • Başar, T., & Olsder, G.J. (1999). Dynamic noncooperative game theory....
  • D.P. Bertsekas et al.

    Neuro-dynamic programming

    (1996)
  • Bradtke, S.J., Ydestie, B.E., & Barto, A.G. (1994). Adaptive linear quadratic control using policy iteration....
  • J.W. Brewer

    Kronecker products and matrix calculus in system theory

    IEEE Transactions on Circuit and System

    (1978)
  • Cao, Xi.-R. (2002). Learning and optimization—from a systems theoretic perspective. Proceedings of IEEE conference on...
  • Hagen, S., & Krose, B. (1998). Linear quadratic regulation using reinforcement learning. Belgian_Dutch conference on...
  • P. He et al.

    Reinforcement learning-based output feedback control of nonlinear systems with input constraints

    IEEE Transactions on Systems Man and Cybernetics—Part B

    (2005)
  • R. Howard

    Dynamic programming and Markov processes

    (1960)
  • D.H. Jacobson

    On values and strategies for infinite-time linear quadratic games

    IEEE TAC

    (1977)
  • Cited by (465)

    View all citing articles on Scopus

    Asma Al-Tamimi was born in Amman, Jordan in 1976. She did her high school studies at Ajnadeen high school in Zarqa. She received her Bachelor's Degree in Electromechanical Engineering from Al-Balqa University in Amman, Jordan in 1999. She then joined The University of Texas at Arlington from which she received the Master's of Science in Electrical Engineering in 2003. Currently she is working on her PhD degree at The University of Texas at Arlington and working as a research assistant at the Automation and Robotics Research Institute.

    Frank L. Lewis was born in Würzburg, Germany, subsequently studying in Chile and Gordonstoun School in Scotland. He obtained the Bachelor's Degree in Physics/Electrical Engineering and the Master's of Electrical Engineering Degree at Rice University in 1971. He spent six years in the U.S. Navy, serving as Navigator aboard the frigate USS Trippe (FF-1075), and Executive Officer and Acting Commanding Officer aboard USS Salinan (ATF-161). In 1977 he received the Master's of Science in Aeronautical Engineering from the University of West Florida. In 1981 he obtained the Ph.D. degree at The Georgia Institute of Technology in Atlanta, where he was employed as a professor from 1981 to 1990 and is currently an Adjunct Professor. He is a Professor of Electrical Engineering at The University of Texas at Arlington, where he was awarded the Moncrief-O’Donnell Endowed Chair in 1990 at the Automation and Robotics Research Institute. He is a Fellow of the IEEE, a member of the New York Academy of Sciences, and a registered Professional Engineer in the State of Texas. He is a Charter Member (2004) of the UTA Academy of Distinguished Scholars. He has served as Visiting Professor at Democritus University in Greece, Hong Kong University of Science and Technology, Chinese University of Hong Kong, National University of Singapore. He is an elected Guest Consulting Professor at both Shanghai Jiao Tong University and South China University of Technology. Dr. Lewis’ current interests include intelligent control, neural and fuzzy systems, microelectromechanical systems (MEMS), wireless sensor networks, nonlinear systems, robotics, condition-based maintenance, and manufacturing process control. He is the author/co-author of 3 U.S. patents, 157 journal papers, 23 chapters and encyclopedia articles, 239 refereed conference papers, nine books, including Optimal Control, Optimal Estimation, Applied Optimal Control and Estimation, Aircraft Control and Simulation, Control of Robot Manipulators, Neural Network Control, High-Level Feedback Control with Neural Networks and the IEEE reprint volume Robot Control. He was selected to the Editorial Boards of International Journal of Control, Neural Computing and Applications, and Int. J. Intelligent Control Systems. He served as an Editor for the flagship journal Automatica. He is the recipient of an NSF Research Initiation Grant and has been continuously funded by NSF since 1982. Since 1991 he has received $4.8 million in funding from NSF and other government agencies, including significant DoD SBIR and industry funding. His SBIR program was instrumental in ARRI's receipt of the SBA Tibbets Award in 1996. He has received a Fulbright Research Award, the American Society of Engineering Education F.E. Terman Award, three Sigma Xi Research Awards, the UTA Halliburton Engineering Research Award, the UTA University-Wide Distinguished Research Award, the ARRI Patent Award, various Best Paper Awards, the IEEE Control Systems Society Best Chapter Award (as Founding Chairman), and the National Sigma Xi Award for Outstanding Chapter (as President). He was selected as Engineer of the year in 1994 by the Ft. Worth IEEE Section. He was appointed to the NAE Committee on Space Station in 1995 and to the IEEE Control Systems Society Board of Governors in 1996. In 1998 he was selected as an IEEE Control Systems Society Distinguished Lecturer. He is a Founding Member of the Board of Governors of the Mediterranean Control Association.

    Murad Abu-Khalaf was born in Jerusalem, Palestine in 1977. He obtained his B.S. in Electronics and Electrical Engineering from Boğaziçi University in Istanbul, Turkey in 1998, and the M.S. and Ph.D. in Electrical Engineering from The University of Texas at Arlington in 2000 and 2005, respectively. His research interest is in the areas of nonlinear control, optimal control, neural network control, and adaptive intelligent systems. He is the author/co-author of one book, two book chapters, 8 journals papers and 15 refereed conference proceedings. He is a member of IEEE member, and a member of Etta Kappa Nu honor society, and is listed in Who's Who in America. His interest is in the areas of nonlinear control, optimal control, neural network control, adaptive intelligent systems.

    This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by associate Editor Derong Liu under the direction of Editor M. Krstic. This research was supported by the National Science Foundation ECS-0501451, the Army Research Office W91NF-05-1-0314.

    View full text