Elsevier

Automatica

Volume 47, Issue 8, August 2011, Pages 1556-1569
Automatica

Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton–Jacobi equations

https://doi.org/10.1016/j.automatica.2011.03.005Get rights and content

Abstract

In this paper we present an online adaptive control algorithm based on policy iteration reinforcement learning techniques to solve the continuous-time (CT) multi player non-zero-sum (NZS) game with infinite horizon for linear and nonlinear systems. NZS games allow for players to have a cooperative team component and an individual selfish component of strategy. The adaptive algorithm learns online the solution of coupled Riccati equations and coupled Hamilton–Jacobi equations for linear and nonlinear systems respectively. This adaptive control method finds in real-time approximations of the optimal value and the NZS Nash-equilibrium, while also guaranteeing closed-loop stability. The optimal-adaptive algorithm is implemented as a separate actor/critic parametric network approximator structure for every player, and involves simultaneous continuous-time adaptation of the actor/critic networks. A persistence of excitation condition is shown to guarantee convergence of every critic to the actual optimal value function for that player. A detailed mathematical analysis is done for 2-player NZS games. Novel tuning algorithms are given for the actor/critic networks. The convergence to the Nash equilibrium is proven and stability of the system is also guaranteed. This provides optimal adaptive control solutions for both non-zero-sum games and their special case, the zero-sum games. Simulation examples show the effectiveness of the new algorithm.

Introduction

Game theory (Tijs, 2003) has been very successful in modeling strategic behavior, where the outcome for each player depends on the actions of himself and all the other players. Every player chooses a control to minimize independently from the others his own performance objective. None has knowledge of the others’ strategy. A lot of applications of optimization theory require the solution of coupled Hamilton–Jacobi equations (Başar and Olsder, 1999, Freiling et al., 2002). In games with N players, each player decides for the Nash equilibrium depending on Hamilton–Jacobi equations coupled through their quadratic terms (Freiling et al., 2002, Gajic and Li, 1988). Each dynamic game consists of three parts: (i) players; (ii) actions available for each player; (iii) costs for every player that depend on their actions.

Multi player non-zero-sum games rely on solving the coupled Hamilton–Jacobi (HJ) equations, which in the linear quadratic case reduce to the coupled algebraic Riccati equations (Abou-Kandil et al., 2003, Freiling et al., 2002, Gajic and Li, 1988). Solution methods are generally offline and generate fixed control policies that are then implemented in online controllers in real time. In the nonlinear case the coupled HJ equations are difficult or impossible to solve, and may not have global analytic solutions even in simple cases (e.g. scalar system, bilinear in input and state) (Sontag, discussion of viscosity solutions).

For the most part, interest in the control systems community has been in the (non-cooperative) zero-sum games, which provide the solution of the H-infinity robust control problem (Başar and Olsder, 1999, Limebeer et al., 1994). However, dynamic team games may have some cooperative objectives and some selfish objectives among the players. This cooperative/non-cooperative balance is captured in the NZS games, as detailed herein.

In this paper we are interested in feedback policies with full state information, and provide methods for online gaming, that is for solution of N-player infinite horizon NZS games online, through learning the Nash-equilibrium in real-time. The dynamics are nonlinear in continuous-time and are assumed known. A novel adaptive control technique is given that is based on reinforcement learning techniques, whereby each player’s control policies are tuned online using data generated in real time along the system trajectories. Also tuned by each player are ‘critic’ approximator structures whose function is to identify the values of the current control policies for each player. Based on these value estimates, the players’ policies are continuously updated. This is a sort of indirect adaptive control algorithm, yet, due to the simple form dependence of the control policies on the learned value, it is affected online as direct (‘optimal’) adaptive control.

Reinforcement learning (RL) is a sub-area of machine learning concerned with how to methodically modify the actions of an agent (player) based on observed responses from its environment (Lewis & Vrabie, 2009, Powell, 2007, Sutton & Barto, 1998). In game theory, reinforcement learning is considered as a bounded rational interpretation of how equilibrium may arise. RL is a means of learning optimal behaviors by observing the response from the environment to non-optimal control policies.

RL methods offer many advantages that have motivated control systems researchers to develop RL algorithms which result in optimal feedback controllers for dynamic systems that are described by difference or ordinary differential equations. These involve a computational intelligence technique known as Policy Iteration (PI) (Bertsekas and Tsitsiklis, 1996, Sutton and Barto, 1998, Werbos, 1974, Werbos, 1992), which refers to a class of two step iteration algorithms: policy evaluation and policy improvement. PI has primarily been developed for discrete-time systems, and online implementation for control systems has been developed through approximation of the value function based on work by (Bertsekas and Tsitsiklis, 1996, Werbos, 1974, Werbos, 1992). PI provides an effective means of learning solutions to HJ equations online. In control theoretic terms, the PI algorithm amounts to learning the solution to a nonlinear Lyapunov equation, and then updating the policy through minimizing a Hamiltonian function. Online reinforcement learning techniques have been developed for continuous-time systems in Vamvoudakis, Vrabie, and Lewis (2009), Vrabie, Pastravanu, Lewis, and Abu-Khalaf (2009) and Vrabie, Vamvoudakis, and Lewis (2009).

In recent work (Vamvoudakis & Lewis, 2010) we developed an online approximate solution method based on PI for the (1-player) infinite horizon optimal control problem (solution of Hamilton–Jacobi–Bellman equation).

This paper proposes an algorithm for nonlinear continuous-time systems with known dynamics to solve the N-player non-zero sum (NZS) game problem where each player wants to optimize his own performance index (Başar & Olsder, 1999). The number of parametric approximator structures that are used is 2N. Each player maintains a critic approximator neural network (NN) to learn his optimal value and a control actor NN to learn his optimal control policy. Parameter update laws are given to tune the N-critic and N-actor neural networks simultaneously online to converge to the solution to the coupled HJ equations, while also guaranteeing closed-loop stability. Rigorous proofs of performance and convergence are given. For the sake of clarity, we restrict ourselves to two player differential games in the actual proof. The proof technique can be directly extended using further careful bookkeeping to multiple players.

The paper is organized as follows. It is necessary to develop policy iteration (PI) techniques for solving multi-player games, for these PI algorithms give the controller structure needed for the online adaptive learning techniques presented in this paper. Therefore, Section 2 presents the formulation of multi-player NZS differential games for nonlinear systems (Başar & Olsder, 1999) and presents a policy iteration algorithm to solve the required coupled nonlinear HJ equations by successive solutions of nonlinear Lyapunov-like equations. Based on this structure, Section 4 develops an adaptive control method for on-line learning of the solution to the two-player NZS game problem. The method generalizes to the multi-player games, and particularizes to the special case of zero-sum (ZS) games. Based on the PI algorithm in Section 3, suitable approximator structures (based on neural networks) are developed for the value function and the control inputs of the two players. A rigorous mathematical analysis is carried out. It is found that the actor and critic neural networks require novel nonstandard tuning algorithms to guarantee stability and convergence to the Nash equilibrium. A persistence of excitation condition (Ioannou and Fidan, 2006, Tao, 2003) is needed to guarantee proper convergence to the optimal value functions. A Lyapunov analysis technique is used. Section 5 presents simulation examples that show the effectiveness of the synchronous online game algorithm in learning the optimal values for both linear and nonlinear systems.

Section snippets

N-player differential game for nonlinear systems

In this section is presented the formulation of N-player games for nonlinear systems. A Policy Iteration solution algorithm is given. The objective is to lay a foundation for the control structure needed in Section 4 for online solution of the game problems in real-time.

Value function approximation for solution of nonlinear Lyapunov equations

This paper uses nonlinear approximator structures for Value Function Approximation (VFA) (Bertsekas and Tsitsiklis, 1996, Werbos, 1974, Werbos, 1992) to solve (13). We show how to solve the 2-player non-zero-sum game presented in Section 2, the approach can easily be extended to more than 2 players.

Consider the nonlinear time-invariant affine in the input dynamical system given by ẋ=f(x)+g(x)u(x)+k(x)d(x) where state x(t)Rn, first control input u(x)Rm, and second control input d(x)Rq.

Online solution of 2-player games

In this section we develop an optimal adaptive control algorithm that solves the 2-player game problem online using data measured along the system trajectories. A special case is the zero-sum 2-player game. The technique given here generalizes directly to the N-player game. A Lyapunov technique is used to derive novel parameter tuning algorithms for the values and control policies that guarantee closed-loop stability as well as convergence to the approximate game solution of (10).

A suitable

Simulation results

Here we present simulations of nonlinear and linear systems to show that the game can be solved ONLINE by learning in real time, using the method of this paper. PE is needed to guarantee convergence to the Nash solution. In these simulations, exponentially decreasing probing noise is added to the control inputs to ensure PE until convergence is obtained.

Kyriakos G. Vamvoudakis was born in Athens Greece. He received the Diploma in Electronic and Computer Engineering from the Technical University of Crete, Greece in 2006 with highest honors and the M.Sc. degree in Electrical Engineering from The University of Texas at Arlington in 2008. He is currently working toward the Ph.D. degree and working as a research assistant at the Automation and Robotics Research Institute, The University of Texas at Arlington. He is coauthor of 2 book chapters, and

References (30)

  • G. Freiling et al.

    On global existence of solutions to coupled matrix Riccati equations in closed loop Nash games

    IEEE Transactions on Automatic Control

    (2002)
  • Gajic, Z., & Li, T. -Y. (1988). Simulation results for two new algorithms for solving coupled algebraic Riccati...
  • S.S. Ge et al.

    Adaptive neural control of uncertain MIMO nonlinear systems

    IEEE Transactions on Neural Networks

    (2004)
  • P. Ioannou et al.
  • M. Jungers et al.

    Solving coupled algebraic Riccati equations from closed-loop Nash strategy, by lack of trust approach

    International Journal of Tomography & Statistics

    (2007)
  • Cited by (403)

    View all citing articles on Scopus

    Kyriakos G. Vamvoudakis was born in Athens Greece. He received the Diploma in Electronic and Computer Engineering from the Technical University of Crete, Greece in 2006 with highest honors and the M.Sc. degree in Electrical Engineering from The University of Texas at Arlington in 2008. He is currently working toward the Ph.D. degree and working as a research assistant at the Automation and Robotics Research Institute, The University of Texas at Arlington. He is coauthor of 2 book chapters, and 25 technical publications. His current research interests include approximate dynamic programming, game theory, neural network feedback control, optimal control, adaptive control and systems biology. He is a member of Tau Beta Pi, Eta Kappa Nu and Golden Key honor societies and is listed in Who’s Who in the world and Who’s Who in Science and Engineering. He received the Best Paper Award for Autonomous/Unmanned Vehicles at the 27th Army Science Conference in 2010. He also received the Best Student Award, UTA Automation & Robotics Research Institute in 2010. He has co-organized special sessions for several international conferences. Mr. Vamvoudakis is a registered Electrical/Computer engineer (PE) and member of Technical Chamber of Greece.

    Frank L. Lewis, Fellow IEEE, Fellow IFAC, Fellow UK Institute of Measurement & Control, PE Texas, UK Chartered Engineer, is Distinguished Scholar Professor and Moncrief-O’Donnell Chair at University of Texas at Arlington’s Automation & Robotics Research Institute. He obtained the Bachelor’s Degree in Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, distributed control systems, and sensor networks. He is author of 6 US patents, 216 journal papers, 330 conference papers, 14 books, 44 chapters, and 11 journal special issues. He received the Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award 2009, UK Inst Measurement & Control Honeywell Field Engineering Medal 2009. Received Outstanding Service Award from Dallas IEEE Section, selected as Engineer of the year by Ft. Worth IEEE Section. Listed in Ft. Worth Business Press Top 200 Leaders in Manufacturing. Received the 2010 IEEE Region 5 Outstanding Engineering Educator Award and the 2010 UTA Graduate Dean’s Excellence in Doctoral Mentoring Award. He served on the NAE Committee on Space Station in 1995. He is an elected Guest Consulting Professor at South China University of Technology and Shanghai Jiao Tong University. Founding Member of the Board of Governors of the Mediterranean Control Association. Helped win the IEEE Control Systems Society Best Chapter Award (as Founding Chairman of DFW Chapter), the National Sigma Xi Award for Outstanding Chapter (as President of UTA Chapter), and the US SBA Tibbets Award in 1996 (as Director of ARRI’s SBIR Program).

    This work was supported by the National Science Foundation ECS-0801330, the Army Research Office W91NF-05-1-0314 and the Air Force Office of Scientific Research FA9550-09-1-0278. This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Raul Ordóñez under the direction of Editor Miroslav Krstic.

    View full text