Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton–Jacobi equations☆
Introduction
Game theory (Tijs, 2003) has been very successful in modeling strategic behavior, where the outcome for each player depends on the actions of himself and all the other players. Every player chooses a control to minimize independently from the others his own performance objective. None has knowledge of the others’ strategy. A lot of applications of optimization theory require the solution of coupled Hamilton–Jacobi equations (Başar and Olsder, 1999, Freiling et al., 2002). In games with players, each player decides for the Nash equilibrium depending on Hamilton–Jacobi equations coupled through their quadratic terms (Freiling et al., 2002, Gajic and Li, 1988). Each dynamic game consists of three parts: (i) players; (ii) actions available for each player; (iii) costs for every player that depend on their actions.
Multi player non-zero-sum games rely on solving the coupled Hamilton–Jacobi (HJ) equations, which in the linear quadratic case reduce to the coupled algebraic Riccati equations (Abou-Kandil et al., 2003, Freiling et al., 2002, Gajic and Li, 1988). Solution methods are generally offline and generate fixed control policies that are then implemented in online controllers in real time. In the nonlinear case the coupled HJ equations are difficult or impossible to solve, and may not have global analytic solutions even in simple cases (e.g. scalar system, bilinear in input and state) (Sontag, discussion of viscosity solutions).
For the most part, interest in the control systems community has been in the (non-cooperative) zero-sum games, which provide the solution of the H-infinity robust control problem (Başar and Olsder, 1999, Limebeer et al., 1994). However, dynamic team games may have some cooperative objectives and some selfish objectives among the players. This cooperative/non-cooperative balance is captured in the NZS games, as detailed herein.
In this paper we are interested in feedback policies with full state information, and provide methods for online gaming, that is for solution of -player infinite horizon NZS games online, through learning the Nash-equilibrium in real-time. The dynamics are nonlinear in continuous-time and are assumed known. A novel adaptive control technique is given that is based on reinforcement learning techniques, whereby each player’s control policies are tuned online using data generated in real time along the system trajectories. Also tuned by each player are ‘critic’ approximator structures whose function is to identify the values of the current control policies for each player. Based on these value estimates, the players’ policies are continuously updated. This is a sort of indirect adaptive control algorithm, yet, due to the simple form dependence of the control policies on the learned value, it is affected online as direct (‘optimal’) adaptive control.
Reinforcement learning (RL) is a sub-area of machine learning concerned with how to methodically modify the actions of an agent (player) based on observed responses from its environment (Lewis & Vrabie, 2009, Powell, 2007, Sutton & Barto, 1998). In game theory, reinforcement learning is considered as a bounded rational interpretation of how equilibrium may arise. RL is a means of learning optimal behaviors by observing the response from the environment to non-optimal control policies.
RL methods offer many advantages that have motivated control systems researchers to develop RL algorithms which result in optimal feedback controllers for dynamic systems that are described by difference or ordinary differential equations. These involve a computational intelligence technique known as Policy Iteration (PI) (Bertsekas and Tsitsiklis, 1996, Sutton and Barto, 1998, Werbos, 1974, Werbos, 1992), which refers to a class of two step iteration algorithms: policy evaluation and policy improvement. PI has primarily been developed for discrete-time systems, and online implementation for control systems has been developed through approximation of the value function based on work by (Bertsekas and Tsitsiklis, 1996, Werbos, 1974, Werbos, 1992). PI provides an effective means of learning solutions to HJ equations online. In control theoretic terms, the PI algorithm amounts to learning the solution to a nonlinear Lyapunov equation, and then updating the policy through minimizing a Hamiltonian function. Online reinforcement learning techniques have been developed for continuous-time systems in Vamvoudakis, Vrabie, and Lewis (2009), Vrabie, Pastravanu, Lewis, and Abu-Khalaf (2009) and Vrabie, Vamvoudakis, and Lewis (2009).
In recent work (Vamvoudakis & Lewis, 2010) we developed an online approximate solution method based on PI for the (1-player) infinite horizon optimal control problem (solution of Hamilton–Jacobi–Bellman equation).
This paper proposes an algorithm for nonlinear continuous-time systems with known dynamics to solve the -player non-zero sum (NZS) game problem where each player wants to optimize his own performance index (Başar & Olsder, 1999). The number of parametric approximator structures that are used is . Each player maintains a critic approximator neural network (NN) to learn his optimal value and a control actor NN to learn his optimal control policy. Parameter update laws are given to tune the -critic and -actor neural networks simultaneously online to converge to the solution to the coupled HJ equations, while also guaranteeing closed-loop stability. Rigorous proofs of performance and convergence are given. For the sake of clarity, we restrict ourselves to two player differential games in the actual proof. The proof technique can be directly extended using further careful bookkeeping to multiple players.
The paper is organized as follows. It is necessary to develop policy iteration (PI) techniques for solving multi-player games, for these PI algorithms give the controller structure needed for the online adaptive learning techniques presented in this paper. Therefore, Section 2 presents the formulation of multi-player NZS differential games for nonlinear systems (Başar & Olsder, 1999) and presents a policy iteration algorithm to solve the required coupled nonlinear HJ equations by successive solutions of nonlinear Lyapunov-like equations. Based on this structure, Section 4 develops an adaptive control method for on-line learning of the solution to the two-player NZS game problem. The method generalizes to the multi-player games, and particularizes to the special case of zero-sum (ZS) games. Based on the PI algorithm in Section 3, suitable approximator structures (based on neural networks) are developed for the value function and the control inputs of the two players. A rigorous mathematical analysis is carried out. It is found that the actor and critic neural networks require novel nonstandard tuning algorithms to guarantee stability and convergence to the Nash equilibrium. A persistence of excitation condition (Ioannou and Fidan, 2006, Tao, 2003) is needed to guarantee proper convergence to the optimal value functions. A Lyapunov analysis technique is used. Section 5 presents simulation examples that show the effectiveness of the synchronous online game algorithm in learning the optimal values for both linear and nonlinear systems.
Section snippets
-player differential game for nonlinear systems
In this section is presented the formulation of -player games for nonlinear systems. A Policy Iteration solution algorithm is given. The objective is to lay a foundation for the control structure needed in Section 4 for online solution of the game problems in real-time.
Value function approximation for solution of nonlinear Lyapunov equations
This paper uses nonlinear approximator structures for Value Function Approximation (VFA) (Bertsekas and Tsitsiklis, 1996, Werbos, 1974, Werbos, 1992) to solve (13). We show how to solve the 2-player non-zero-sum game presented in Section 2, the approach can easily be extended to more than 2 players.
Consider the nonlinear time-invariant affine in the input dynamical system given by where state , first control input , and second control input .
Online solution of 2-player games
In this section we develop an optimal adaptive control algorithm that solves the 2-player game problem online using data measured along the system trajectories. A special case is the zero-sum 2-player game. The technique given here generalizes directly to the -player game. A Lyapunov technique is used to derive novel parameter tuning algorithms for the values and control policies that guarantee closed-loop stability as well as convergence to the approximate game solution of (10).
A suitable
Simulation results
Here we present simulations of nonlinear and linear systems to show that the game can be solved ONLINE by learning in real time, using the method of this paper. PE is needed to guarantee convergence to the Nash solution. In these simulations, exponentially decreasing probing noise is added to the control inputs to ensure PE until convergence is obtained.
Kyriakos G. Vamvoudakis was born in Athens Greece. He received the Diploma in Electronic and Computer Engineering from the Technical University of Crete, Greece in 2006 with highest honors and the M.Sc. degree in Electrical Engineering from The University of Texas at Arlington in 2008. He is currently working toward the Ph.D. degree and working as a research assistant at the Automation and Robotics Research Institute, The University of Texas at Arlington. He is coauthor of 2 book chapters, and
References (30)
- et al.
Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach
Automatica
(2005) - et al.
Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks
Neural Networks
(1990) - et al.
Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem
Automatica
(2010) - et al.
Adaptive optimal control for continuous-time linear systems based on policy iteration
Automatica
(2009) - et al.
Matrix Riccati equations in control and systems theory
(2003) - et al.
Policy iterations on the Hamilton–Jacobi–Isaacs equation for state feedback control with input saturation
IEEE Transactions on Automatic Control
(2006) - et al.
Sobolev spaces
(2003) - et al.
Dynamic noncooperative game theory
(1999) - et al.
Neuro-dynamic programming
(1996) The method of weighted residuals and variational principles
(1990)
On global existence of solutions to coupled matrix Riccati equations in closed loop Nash games
IEEE Transactions on Automatic Control
Adaptive neural control of uncertain MIMO nonlinear systems
IEEE Transactions on Neural Networks
Solving coupled algebraic Riccati equations from closed-loop Nash strategy, by lack of trust approach
International Journal of Tomography & Statistics
Cited by (403)
A spectral iterative algorithm for solving constrained optimal control problems with nonquadratic functional
2024, Applied Numerical MathematicsOptimal tracking control for motion constrained robot systems via event-sampled critic learning
2023, Expert Systems with Applications
Kyriakos G. Vamvoudakis was born in Athens Greece. He received the Diploma in Electronic and Computer Engineering from the Technical University of Crete, Greece in 2006 with highest honors and the M.Sc. degree in Electrical Engineering from The University of Texas at Arlington in 2008. He is currently working toward the Ph.D. degree and working as a research assistant at the Automation and Robotics Research Institute, The University of Texas at Arlington. He is coauthor of 2 book chapters, and 25 technical publications. His current research interests include approximate dynamic programming, game theory, neural network feedback control, optimal control, adaptive control and systems biology. He is a member of Tau Beta Pi, Eta Kappa Nu and Golden Key honor societies and is listed in Who’s Who in the world and Who’s Who in Science and Engineering. He received the Best Paper Award for Autonomous/Unmanned Vehicles at the 27th Army Science Conference in 2010. He also received the Best Student Award, UTA Automation & Robotics Research Institute in 2010. He has co-organized special sessions for several international conferences. Mr. Vamvoudakis is a registered Electrical/Computer engineer (PE) and member of Technical Chamber of Greece.
Frank L. Lewis, Fellow IEEE, Fellow IFAC, Fellow UK Institute of Measurement & Control, PE Texas, UK Chartered Engineer, is Distinguished Scholar Professor and Moncrief-O’Donnell Chair at University of Texas at Arlington’s Automation & Robotics Research Institute. He obtained the Bachelor’s Degree in Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, distributed control systems, and sensor networks. He is author of 6 US patents, 216 journal papers, 330 conference papers, 14 books, 44 chapters, and 11 journal special issues. He received the Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award 2009, UK Inst Measurement & Control Honeywell Field Engineering Medal 2009. Received Outstanding Service Award from Dallas IEEE Section, selected as Engineer of the year by Ft. Worth IEEE Section. Listed in Ft. Worth Business Press Top 200 Leaders in Manufacturing. Received the 2010 IEEE Region 5 Outstanding Engineering Educator Award and the 2010 UTA Graduate Dean’s Excellence in Doctoral Mentoring Award. He served on the NAE Committee on Space Station in 1995. He is an elected Guest Consulting Professor at South China University of Technology and Shanghai Jiao Tong University. Founding Member of the Board of Governors of the Mediterranean Control Association. Helped win the IEEE Control Systems Society Best Chapter Award (as Founding Chairman of DFW Chapter), the National Sigma Xi Award for Outstanding Chapter (as President of UTA Chapter), and the US SBA Tibbets Award in 1996 (as Director of ARRI’s SBIR Program).
- ☆
This work was supported by the National Science Foundation ECS-0801330, the Army Research Office W91NF-05-1-0314 and the Air Force Office of Scientific Research FA9550-09-1-0278. This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Raul Ordóñez under the direction of Editor Miroslav Krstic.