Brief paperModel-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control☆
Introduction
This paper is concerned with the application of approximate dynamic programming techniques (ADP) to the discrete-time linear quadratic zero-sum game that appearing in the optimal control problem (Başar & Bernhard, 1995), where is the disturbance has finite energy. Approximate dynamic programming is an approach to solve dynamical programming problems. Approximate dynamic programming was proposed by Werbos (1991), Barto, Sutton, and Anderson (1983), Howard (1960), Watkins (1989), Bertsekas and Tsitsiklis (1996), and others to solve optimal control problems forward-in-time. In ADP, one combines adaptive critics, a reinforcement learning technique, with dynamic programming.
Several approximate dynamic programming schemes appear in literature. Howard (1960) proposed iterations in the policy space in the framework of stochastic decision theory. Bradtke, Ydestie, and Barto (1994), implemented a -learning policy iteration method for the discrete-time linear quadratic optimal control problem, while our is concerned with zero-sum games. In addition, the way we handle exploration noise is different in order to obtain convergence results of the associated Riccati equation (GARE). Hagen and Krose (1998) discussed the relation between the -learning policy iteration method and model-based adaptive control with system identification. Werbos (1992) classified approximate dynamic programming approaches into four main schemes: heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP), action dependent heuristic dynamic programming (ADHDP), also known as -learning Watkins (1989), and action dependent dual heuristic dynamic programming (ADDHP). Prokhorov and Wunsch (1997) developed new approximate dynamic programming schemes known as globalized-DHP (GDHP) and ADGDHP. Landelius (1997) applied HDP, DHP, ADHDP and ADDHP techniques to the discrete-time linear quadratic optimal control problem. The current status of work on approximate dynamic programming is given in Si, Barto, Powell, and Wunsch (2004). See also Bertsekas and Tsitsiklis, 1996, He and Jagannathan, 2005, Si and Wang (2001) and Cao (2002).
In this paper, -learning is used since as will be seen in the paper, it allows model-free tuning of the action and critic networks. That is, this method does not require knowledge of the plant model. In Landelius (1997), no initial stable control policy for the optimal control problem is required, however the requirement of exploration noise is not studied.
This problem has been solved off-line using the dynamic programming principle Başar and Bernhard, 1995, Başar and Olsder, 1999, Lewis (1995). An off-line neural net policy iterations solution was given by Abu-Khalaf, Lewis, and Huang (2004) for the continuous-time case.
The importance of this paper stems from the fact that we propose game-theoretic adaptive critics that create controllers that learn to co-exist with an -gain disturbance signal Başar and Bernhard, 1995, Başar and Olsder, 1999. In control system design, this is a two-player zero-sum game problem that corresponds to the well-known control problem.
An control F-16 aircraft autopilot design example is given to show the practical effectiveness of the ADP techniques.
Section snippets
-function setup for discrete-time linear quadratic zero-sum games installation
In this section, we formulate Bellman's optimality principle for the zero-sum-game using the concept of -functions (Watkins, 1989, Werbos, 1990) instead of the standard value functions used elsewhere. Consider the following discrete-time linear systemwhere , , is the control input and is the disturbance input. Also consider the infinite-horizon value functionfor a prescribed fixed value of . In the
Model-free online tuning based on the -learning algorithm
In this section, we use the -function of Section 2 to develop a -learning algorithm to solve for the DT zero-sum game H matrix that does not require the system dynamical matrices. In the -learning approach, a parametric structure is used to approximate -function of the current control policy. Then the certainty equivalent principle is used to improve the policy of the action network.
Online adp autopilot controller design for an F-16 aircraft
controllers have been proven to be highly effective in the design of feedback control systems with robustness and disturbance rejection capabilities for F-16 aircraft autopilot design. The presented controller design is a model-free online tuning design that is based on the -learning method presented in this paper.
The F-16 short period dynamics has three states given as , where is the angle of attack, is the pitch rate and is the elevator deflection angle. The
Conclusion
In this paper we introduced an on-line ADP technique based on -learning to solve the discrete-time zero-sum game problem with continuous state and action spaces. The derivation of the policies and the convergence of the -learning are provided. In the -learning algorithm the system model is not needed to tune the action networks nor the critic network. The results in this paper can be summarized as a model-free approach to solve the linear quadratic discrete-time zero-sum game forward in time.
Asma Al-Tamimi was born in Amman, Jordan in 1976. She did her high school studies at Ajnadeen high school in Zarqa. She received her Bachelor's Degree in Electromechanical Engineering from Al-Balqa University in Amman, Jordan in 1999. She then joined The University of Texas at Arlington from which she received the Master's of Science in Electrical Engineering in 2003. Currently she is working on her PhD degree at The University of Texas at Arlington and working as a research assistant at the
References (25)
- Abu-Khalaf, M., Lewis, F.L., & Huang, J. (2004). Hamilton-Jacobi-Isaacs formulation for constrained input nonlinear...
- et al.
Neuronlike elements that can solve difficult learning control problems
IEEE Transactions on Systems Man and Cybernetics
(1983) - et al.
Optimal control and related minimax design problems
(1995) - Başar, T., & Olsder, G.J. (1999). Dynamic noncooperative game theory....
- et al.
Neuro-dynamic programming
(1996) - Bradtke, S.J., Ydestie, B.E., & Barto, A.G. (1994). Adaptive linear quadratic control using policy iteration....
Kronecker products and matrix calculus in system theory
IEEE Transactions on Circuit and System
(1978)- Cao, Xi.-R. (2002). Learning and optimization—from a systems theoretic perspective. Proceedings of IEEE conference on...
- Hagen, S., & Krose, B. (1998). Linear quadratic regulation using reinforcement learning. Belgian_Dutch conference on...
- et al.
Reinforcement learning-based output feedback control of nonlinear systems with input constraints
IEEE Transactions on Systems Man and Cybernetics—Part B
(2005)
Dynamic programming and Markov processes
On values and strategies for infinite-time linear quadratic games
IEEE TAC
Cited by (465)
Zero-sum game-based optimal control for discrete-time Markov jump systems: A parallel off-policy Q-learning method
2024, Applied Mathematics and ComputationData driven secure control for cyber–physical systems under hybrid attacks: A Stackelberg game approach
2024, Journal of the Franklin InstituteOptimal fuzzy output feedback tracking control for unmanned surface vehicles systems
2024, Ocean EngineeringValue iteration for LQR control of unknown stochastic-parameter linear systems
2024, Systems and Control Letters
Asma Al-Tamimi was born in Amman, Jordan in 1976. She did her high school studies at Ajnadeen high school in Zarqa. She received her Bachelor's Degree in Electromechanical Engineering from Al-Balqa University in Amman, Jordan in 1999. She then joined The University of Texas at Arlington from which she received the Master's of Science in Electrical Engineering in 2003. Currently she is working on her PhD degree at The University of Texas at Arlington and working as a research assistant at the Automation and Robotics Research Institute.
Frank L. Lewis was born in Würzburg, Germany, subsequently studying in Chile and Gordonstoun School in Scotland. He obtained the Bachelor's Degree in Physics/Electrical Engineering and the Master's of Electrical Engineering Degree at Rice University in 1971. He spent six years in the U.S. Navy, serving as Navigator aboard the frigate USS Trippe (FF-1075), and Executive Officer and Acting Commanding Officer aboard USS Salinan (ATF-161). In 1977 he received the Master's of Science in Aeronautical Engineering from the University of West Florida. In 1981 he obtained the Ph.D. degree at The Georgia Institute of Technology in Atlanta, where he was employed as a professor from 1981 to 1990 and is currently an Adjunct Professor. He is a Professor of Electrical Engineering at The University of Texas at Arlington, where he was awarded the Moncrief-O’Donnell Endowed Chair in 1990 at the Automation and Robotics Research Institute. He is a Fellow of the IEEE, a member of the New York Academy of Sciences, and a registered Professional Engineer in the State of Texas. He is a Charter Member (2004) of the UTA Academy of Distinguished Scholars. He has served as Visiting Professor at Democritus University in Greece, Hong Kong University of Science and Technology, Chinese University of Hong Kong, National University of Singapore. He is an elected Guest Consulting Professor at both Shanghai Jiao Tong University and South China University of Technology. Dr. Lewis’ current interests include intelligent control, neural and fuzzy systems, microelectromechanical systems (MEMS), wireless sensor networks, nonlinear systems, robotics, condition-based maintenance, and manufacturing process control. He is the author/co-author of 3 U.S. patents, 157 journal papers, 23 chapters and encyclopedia articles, 239 refereed conference papers, nine books, including Optimal Control, Optimal Estimation, Applied Optimal Control and Estimation, Aircraft Control and Simulation, Control of Robot Manipulators, Neural Network Control, High-Level Feedback Control with Neural Networks and the IEEE reprint volume Robot Control. He was selected to the Editorial Boards of International Journal of Control, Neural Computing and Applications, and Int. J. Intelligent Control Systems. He served as an Editor for the flagship journal Automatica. He is the recipient of an NSF Research Initiation Grant and has been continuously funded by NSF since 1982. Since 1991 he has received $4.8 million in funding from NSF and other government agencies, including significant DoD SBIR and industry funding. His SBIR program was instrumental in ARRI's receipt of the SBA Tibbets Award in 1996. He has received a Fulbright Research Award, the American Society of Engineering Education F.E. Terman Award, three Sigma Xi Research Awards, the UTA Halliburton Engineering Research Award, the UTA University-Wide Distinguished Research Award, the ARRI Patent Award, various Best Paper Awards, the IEEE Control Systems Society Best Chapter Award (as Founding Chairman), and the National Sigma Xi Award for Outstanding Chapter (as President). He was selected as Engineer of the year in 1994 by the Ft. Worth IEEE Section. He was appointed to the NAE Committee on Space Station in 1995 and to the IEEE Control Systems Society Board of Governors in 1996. In 1998 he was selected as an IEEE Control Systems Society Distinguished Lecturer. He is a Founding Member of the Board of Governors of the Mediterranean Control Association.
Murad Abu-Khalaf was born in Jerusalem, Palestine in 1977. He obtained his B.S. in Electronics and Electrical Engineering from Boğaziçi University in Istanbul, Turkey in 1998, and the M.S. and Ph.D. in Electrical Engineering from The University of Texas at Arlington in 2000 and 2005, respectively. His research interest is in the areas of nonlinear control, optimal control, neural network control, and adaptive intelligent systems. He is the author/co-author of one book, two book chapters, 8 journals papers and 15 refereed conference proceedings. He is a member of IEEE member, and a member of Etta Kappa Nu honor society, and is listed in Who's Who in America. His interest is in the areas of nonlinear control, optimal control, neural network control, adaptive intelligent systems.
- ☆
This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by associate Editor Derong Liu under the direction of Editor M. Krstic. This research was supported by the National Science Foundation ECS-0501451, the Army Research Office W91NF-05-1-0314.