Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton–Jacobi equations

doi:10.1016/j.automatica.2011.03.005

Automatica

Volume 47, Issue 8, August 2011, Pages 1556-1569

https://doi.org/10.1016/j.automatica.2011.03.005 Get rights and content

Abstract

In this paper we present an online adaptive control algorithm based on policy iteration reinforcement learning techniques to solve the continuous-time (CT) multi player non-zero-sum (NZS) game with infinite horizon for linear and nonlinear systems. NZS games allow for players to have a cooperative team component and an individual selfish component of strategy. The adaptive algorithm learns online the solution of coupled Riccati equations and coupled Hamilton–Jacobi equations for linear and nonlinear systems respectively. This adaptive control method finds in real-time approximations of the optimal value and the NZS Nash-equilibrium, while also guaranteeing closed-loop stability. The optimal-adaptive algorithm is implemented as a separate actor/critic parametric network approximator structure for every player, and involves simultaneous continuous-time adaptation of the actor/critic networks. A persistence of excitation condition is shown to guarantee convergence of every critic to the actual optimal value function for that player. A detailed mathematical analysis is done for 2-player NZS games. Novel tuning algorithms are given for the actor/critic networks. The convergence to the Nash equilibrium is proven and stability of the system is also guaranteed. This provides optimal adaptive control solutions for both non-zero-sum games and their special case, the zero-sum games. Simulation examples show the effectiveness of the new algorithm.

Introduction

Game theory (Tijs, 2003) has been very successful in modeling strategic behavior, where the outcome for each player depends on the actions of himself and all the other players. Every player chooses a control to minimize independently from the others his own performance objective. None has knowledge of the others’ strategy. A lot of applications of optimization theory require the solution of coupled Hamilton–Jacobi equations (Başar and Olsder, 1999, Freiling et al., 2002). In games with $N$ players, each player decides for the Nash equilibrium depending on Hamilton–Jacobi equations coupled through their quadratic terms (Freiling et al., 2002, Gajic and Li, 1988). Each dynamic game consists of three parts: (i) players; (ii) actions available for each player; (iii) costs for every player that depend on their actions.

Multi player non-zero-sum games rely on solving the coupled Hamilton–Jacobi (HJ) equations, which in the linear quadratic case reduce to the coupled algebraic Riccati equations (Abou-Kandil et al., 2003, Freiling et al., 2002, Gajic and Li, 1988). Solution methods are generally offline and generate fixed control policies that are then implemented in online controllers in real time. In the nonlinear case the coupled HJ equations are difficult or impossible to solve, and may not have global analytic solutions even in simple cases (e.g. scalar system, bilinear in input and state) (Sontag, discussion of viscosity solutions).

For the most part, interest in the control systems community has been in the (non-cooperative) zero-sum games, which provide the solution of the H-infinity robust control problem (Başar and Olsder, 1999, Limebeer et al., 1994). However, dynamic team games may have some cooperative objectives and some selfish objectives among the players. This cooperative/non-cooperative balance is captured in the NZS games, as detailed herein.

In this paper we are interested in feedback policies with full state information, and provide methods for online gaming, that is for solution of $N$ -player infinite horizon NZS games online, through learning the Nash-equilibrium in real-time. The dynamics are nonlinear in continuous-time and are assumed known. A novel adaptive control technique is given that is based on reinforcement learning techniques, whereby each player’s control policies are tuned online using data generated in real time along the system trajectories. Also tuned by each player are ‘critic’ approximator structures whose function is to identify the values of the current control policies for each player. Based on these value estimates, the players’ policies are continuously updated. This is a sort of indirect adaptive control algorithm, yet, due to the simple form dependence of the control policies on the learned value, it is affected online as direct (‘optimal’) adaptive control.

Reinforcement learning (RL) is a sub-area of machine learning concerned with how to methodically modify the actions of an agent (player) based on observed responses from its environment (Lewis & Vrabie, 2009, Powell, 2007, Sutton & Barto, 1998). In game theory, reinforcement learning is considered as a bounded rational interpretation of how equilibrium may arise. RL is a means of learning optimal behaviors by observing the response from the environment to non-optimal control policies.

RL methods offer many advantages that have motivated control systems researchers to develop RL algorithms which result in optimal feedback controllers for dynamic systems that are described by difference or ordinary differential equations. These involve a computational intelligence technique known as Policy Iteration (PI) (Bertsekas and Tsitsiklis, 1996, Sutton and Barto, 1998, Werbos, 1974, Werbos, 1992), which refers to a class of two step iteration algorithms: policy evaluation and policy improvement. PI has primarily been developed for discrete-time systems, and online implementation for control systems has been developed through approximation of the value function based on work by (Bertsekas and Tsitsiklis, 1996, Werbos, 1974, Werbos, 1992). PI provides an effective means of learning solutions to HJ equations online. In control theoretic terms, the PI algorithm amounts to learning the solution to a nonlinear Lyapunov equation, and then updating the policy through minimizing a Hamiltonian function. Online reinforcement learning techniques have been developed for continuous-time systems in Vamvoudakis, Vrabie, and Lewis (2009), Vrabie, Pastravanu, Lewis, and Abu-Khalaf (2009) and Vrabie, Vamvoudakis, and Lewis (2009).

In recent work (Vamvoudakis & Lewis, 2010) we developed an online approximate solution method based on PI for the (1-player) infinite horizon optimal control problem (solution of Hamilton–Jacobi–Bellman equation).

This paper proposes an algorithm for nonlinear continuous-time systems with known dynamics to solve the $N$ -player non-zero sum (NZS) game problem where each player wants to optimize his own performance index (Başar & Olsder, 1999). The number of parametric approximator structures that are used is $2 N$ . Each player maintains a critic approximator neural network (NN) to learn his optimal value and a control actor NN to learn his optimal control policy. Parameter update laws are given to tune the $N$ -critic and $N$ -actor neural networks simultaneously online to converge to the solution to the coupled HJ equations, while also guaranteeing closed-loop stability. Rigorous proofs of performance and convergence are given. For the sake of clarity, we restrict ourselves to two player differential games in the actual proof. The proof technique can be directly extended using further careful bookkeeping to multiple players.

The paper is organized as follows. It is necessary to develop policy iteration (PI) techniques for solving multi-player games, for these PI algorithms give the controller structure needed for the online adaptive learning techniques presented in this paper. Therefore, Section 2 presents the formulation of multi-player NZS differential games for nonlinear systems (Başar & Olsder, 1999) and presents a policy iteration algorithm to solve the required coupled nonlinear HJ equations by successive solutions of nonlinear Lyapunov-like equations. Based on this structure, Section 4 develops an adaptive control method for on-line learning of the solution to the two-player NZS game problem. The method generalizes to the multi-player games, and particularizes to the special case of zero-sum (ZS) games. Based on the PI algorithm in Section 3, suitable approximator structures (based on neural networks) are developed for the value function and the control inputs of the two players. A rigorous mathematical analysis is carried out. It is found that the actor and critic neural networks require novel nonstandard tuning algorithms to guarantee stability and convergence to the Nash equilibrium. A persistence of excitation condition (Ioannou and Fidan, 2006, Tao, 2003) is needed to guarantee proper convergence to the optimal value functions. A Lyapunov analysis technique is used. Section 5 presents simulation examples that show the effectiveness of the synchronous online game algorithm in learning the optimal values for both linear and nonlinear systems.

Section snippets

$N$ -player differential game for nonlinear systems

In this section is presented the formulation of $N$ -player games for nonlinear systems. A Policy Iteration solution algorithm is given. The objective is to lay a foundation for the control structure needed in Section 4 for online solution of the game problems in real-time.

Value function approximation for solution of nonlinear Lyapunov equations

This paper uses nonlinear approximator structures for Value Function Approximation (VFA) (Bertsekas and Tsitsiklis, 1996, Werbos, 1974, Werbos, 1992) to solve (13). We show how to solve the 2-player non-zero-sum game presented in Section 2, the approach can easily be extended to more than 2 players.

Consider the nonlinear time-invariant affine in the input dynamical system given by $\dot{x} = f (x) + g (x) u (x) + k (x) d (x)$ where state $x (t) \in R^{n}$ , first control input $u (x) \in R^{m}$ , and second control input $d (x) \in R^{q}$ .

Online solution of 2-player games

In this section we develop an optimal adaptive control algorithm that solves the 2-player game problem online using data measured along the system trajectories. A special case is the zero-sum 2-player game. The technique given here generalizes directly to the $N$ -player game. A Lyapunov technique is used to derive novel parameter tuning algorithms for the values and control policies that guarantee closed-loop stability as well as convergence to the approximate game solution of (10).

A suitable

Simulation results

Here we present simulations of nonlinear and linear systems to show that the game can be solved ONLINE by learning in real time, using the method of this paper. PE is needed to guarantee convergence to the Nash solution. In these simulations, exponentially decreasing probing noise is added to the control inputs to ensure PE until convergence is obtained.

References (30)

M. Abu-Khalaf et al.
Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach
Automatica
(2005)
K. Hornik et al.
Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks
Neural Networks
(1990)
Kyriakos G. Vamvoudakis et al.
Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem
Automatica
(2010)
D. Vrabie et al.
Adaptive optimal control for continuous-time linear systems based on policy iteration
Automatica
(2009)
H. Abou-Kandil et al.
Matrix Riccati equations in control and systems theory
(2003)
M. Abu-Khalaf et al.
Policy iterations on the Hamilton–Jacobi–Isaacs equation for $H_{\infty}$ state feedback control with input saturation
IEEE Transactions on Automatic Control
(2006)
R. Adams et al.
Sobolev spaces
(2003)
T. Başar et al.
Dynamic noncooperative game theory
(1999)
D.P. Bertsekas et al.
Neuro-dynamic programming
(1996)
B.A. Finlayson
The method of weighted residuals and variational principles
(1990)

G. Freiling et al.

On global existence of solutions to coupled matrix Riccati equations in closed loop Nash games

IEEE Transactions on Automatic Control

(2002)

Gajic, Z., & Li, T. -Y. (1988). Simulation results for two new algorithms for solving coupled algebraic Riccati...

S.S. Ge et al.

Adaptive neural control of uncertain MIMO nonlinear systems

IEEE Transactions on Neural Networks

(2004)

P. Ioannou et al.

M. Jungers et al.

Solving coupled algebraic Riccati equations from closed-loop Nash strategy, by lack of trust approach

International Journal of Tomography & Statistics

(2007)

Cited by (403)

A spectral iterative algorithm for solving constrained optimal control problems with nonquadratic functional
2024, Applied Numerical Mathematics
The numerical approximation of the solution to Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) in constrained optimal control problems (COCPs) with nonquadratic functional is studied in this investigation. Discretizing both of the space and time of HJB PDE by the Legendre collocation method, a nonlinear system of algebraic equations is obtained to find the expansion coefficients of the approximate solution. However, solving the system of nonlinear algebraic equations is generally relatively difficult and time-consuming. To improve the implementation of the method, we combine the Legendre collocation method with the policy iteration (PI) algorithm. This process leads to a sequence of linear systems of algebraic equations to find the expansion coefficients. The convergence results for the proposed method are provided. Moreover, it is shown that the PI algorithm for a class of constrained optimal control problems with odd nonquadratic functionals is mathematically equivalent to the quasi-Newton's iteration. Finally, the efficiency and accuracy of the method are demonstrated by solving a number of examples.
Nearly optimal stabilization of unknown continuous-time nonlinear systems: A new parallel control approach
2024, Neurocomputing
This paper develops a novel online nearly optimal control (ONOC) method for unknown continuous-time (CT) nonaffine nonlinear systems without recovering unknown systems. First, a dynamic control law is proposed for CT nonaffine nonlinear systems using parallel control. To achieve the proposed dynamic control law, an affine augmented system (AAS) is constructed according to the original system, and an augmented performance index (API) is constructed on the basis of the original performance index (OPI). Then, the stability relationship between the original system and the AAS is provided, and it is proven that, by selecting a suitable parameter in the API, optimal control of the AAS with the API is equivalent to near-optimal control of the original system with the OPI. Subsequently, based on the proposed dynamic control law, we extend integral reinforcement learning (IRL) to completely unknown CT nonaffine systems, and it is further proved that closed-loop signals are uniformly ultimately bounded (UUB) without the assumption that the input dynamics are bounded. Furthermore, the OPI can be set to an arbitrary positive-definite form, and the UUB bound for the state vector can be predetermined. Lastly, simulations are offered to exhibit the correctness of the developed ONOC method. Source code of this paper is available at: https://github.com/lujingweihh/Adaptive-dynamic-programming-algorithms/tree/main/model_free_integral_reinforcement_learning.
Reinforcement learning-based formation-surrounding control for multiple quadrotor UAVs pursuit-evasion games
2024, ISA Transactions
This paper proposes a reinforcement learning-based formation-surrounding control method for multiple quadrotor unmanned aerial vehicles (UAVs) pursuit-evasion (MPE) games system subject to external disturbances. In the framework of the MPE games, the pursuers aim to equally surround the evaders which try to avoid being surrounded when forming the desired formation. By constructing position and attitude tracking error subsystems of quadrotor UAV, this paper proposes two control strategies which combines the feedforward control technique and reinforcement learning (RL) method. First, two novel cost functions are presented for the quadrotor UAV with external disturbances. Then, two control schemes based on RL have been developed to guarantee the stability of the tracking error subsystem. Subsequently, two critic-only neural networks (NN) weight update laws that only satisfy finite excitation conditions are proposed to estimate the optimal cost function. Furthermore, Nash equilibrium for multiple quadrotor UAVs is achieved by means of RL strategy to solve the Hamilton-Jacobi-Isaacs (HJI) equations. And the property of equally surrounding is proved for the first time by utilizing Euler's formula in this paper. Finally, the numerical simulation results are given to show the effectiveness and superior performance of the proposed control method.
Dynamic event-triggered-based online IRL algorithm for the decentralized control of the input and state constrained large-scale unmatched interconnected system
2024, Neurocomputing
This article proposed a novel adaptive decentralized control (ADC) method for the continuous-time state-constrained and input-constrained large-scale unmatched interconnection system by the means of the adaptive critic design in the edge dynamic event-triggered (EDET) mechanism. The barrier function is used to transform the state-constrained system to the common system without constrained states. To overcome the influence of the unmatched interconnection terms, the auxiliary systems are devised for each subsystem of the large-scale system. Moreover, the non-quadratic utility functions are introduced to constrain the input of the auxiliary systems. The decentralized control scheme of the large-scale unmatched interconnection system can be realized by solving the optimal control schemes of the auxiliary systems. Then with the help of the single critic neural network (NN), the approximate optimal control policies are acquired by designing the approximate cost functions. After that, an integral reinforcement learning (IRL)-based scheme is proposed to solve the integral Bellman equation rather than the complex coupled Hamilton–Jacobi-Bellman equation (HJBE). Simultaneously, the EDET mechanism is introduced to reduce the computation efforts and communication loads. The asymptotic stability of the constrained large-scale unmatched interconnection system is proved. In addition, the infamous Zeno behavior is effectively avoided. Finally, two simulation cases are given to verify the effectiveness of the proposed algorithm.
Optimal tracking control for motion constrained robot systems via event-sampled critic learning
2023, Expert Systems with Applications
This article proposes a novel adaptive event-sampled learning method for addressing the optimal tracking control problem for robotic systems with motion constraints. In many practical systems, the joint movement of robots is restricted by mechanical structures or operational limitations, thus prescribed constraints are imposed on the joint states of the robot. To deal with the issue, a state-dependent transformation method is used to ensure motion restrictions and construct an unconstrained tracking error system. Then, for the transformed system, the constrained control problem is turned into a general optimal tracking control problem. To obtain the optimal solution, an only-critic learning-based framework is developed. Two novel event-sampled mechanisms are incorporated into the controller designs, reducing the state sampling times and computing costs. It is proven that the stability of the closed-loop system is ensured and the Zeno behavior is successfully eliminated via event-sampled learning control approaches. Finally, simulation and comparisons are conducted to validate the effectiveness of the theoretical results and proposed learning-based control approach.
Synergetic learning for unknown nonlinear H<inf>∞</inf> control using neural networks
2023, Neural Networks
The well-known $H_{\infty}$ control design gives robustness to a controller by rejecting perturbations from the external environment, which is difficult to do for completely unknown affine nonlinear systems. Accordingly, the immediate objective of this paper is to develop an on-line real-time synergetic learning algorithm, so that a data-driven $H_{\infty}$ controller can be received. By converting the $H_{\infty}$ control problem into a two-player zero-sum game, a model-free Hamilton–Jacobi–Isaacs equation (MF-HJIE) is first derived using off-policy reinforcement learning, followed by a proof of equivalence between the MF-HJIE and the conventional HJIE. Next, by applying the temporal difference to the MF-HJIE, a synergetic evolutionary rule with experience replay is designed to learn the optimal value function, the optimal control, and the worst perturbation, that can be performed on-line and in real-time along the system state trajectory. It is proven that the synergistic learning system constructed by the system plant and the evolutionary rule is uniformly ultimately bounded. Finally, simulation results on an F16 aircraft system and a nonlinear system back up the tractability of the proposed method.

View all citing articles on Scopus

Kyriakos G. Vamvoudakis was born in Athens Greece. He received the Diploma in Electronic and Computer Engineering from the Technical University of Crete, Greece in 2006 with highest honors and the M.Sc. degree in Electrical Engineering from The University of Texas at Arlington in 2008. He is currently working toward the Ph.D. degree and working as a research assistant at the Automation and Robotics Research Institute, The University of Texas at Arlington. He is coauthor of 2 book chapters, and 25 technical publications. His current research interests include approximate dynamic programming, game theory, neural network feedback control, optimal control, adaptive control and systems biology. He is a member of Tau Beta Pi, Eta Kappa Nu and Golden Key honor societies and is listed in Who’s Who in the world and Who’s Who in Science and Engineering. He received the Best Paper Award for Autonomous/Unmanned Vehicles at the 27th Army Science Conference in 2010. He also received the Best Student Award, UTA Automation & Robotics Research Institute in 2010. He has co-organized special sessions for several international conferences. Mr. Vamvoudakis is a registered Electrical/Computer engineer (PE) and member of Technical Chamber of Greece.

Frank L. Lewis, Fellow IEEE, Fellow IFAC, Fellow UK Institute of Measurement & Control, PE Texas, UK Chartered Engineer, is Distinguished Scholar Professor and Moncrief-O’Donnell Chair at University of Texas at Arlington’s Automation & Robotics Research Institute. He obtained the Bachelor’s Degree in Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, distributed control systems, and sensor networks. He is author of 6 US patents, 216 journal papers, 330 conference papers, 14 books, 44 chapters, and 11 journal special issues. He received the Fulbright Research Award, NSF Research Initiation Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award 2009, UK Inst Measurement & Control Honeywell Field Engineering Medal 2009. Received Outstanding Service Award from Dallas IEEE Section, selected as Engineer of the year by Ft. Worth IEEE Section. Listed in Ft. Worth Business Press Top 200 Leaders in Manufacturing. Received the 2010 IEEE Region 5 Outstanding Engineering Educator Award and the 2010 UTA Graduate Dean’s Excellence in Doctoral Mentoring Award. He served on the NAE Committee on Space Station in 1995. He is an elected Guest Consulting Professor at South China University of Technology and Shanghai Jiao Tong University. Founding Member of the Board of Governors of the Mediterranean Control Association. Helped win the IEEE Control Systems Society Best Chapter Award (as Founding Chairman of DFW Chapter), the National Sigma Xi Award for Outstanding Chapter (as President of UTA Chapter), and the US SBA Tibbets Award in 1996 (as Director of ARRI’s SBIR Program).

^☆: This work was supported by the National Science Foundation ECS-0801330, the Army Research Office W91NF-05-1-0314 and the Air Force Office of Scientific Research FA9550-09-1-0278. This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Raul Ordóñez under the direction of Editor Miroslav Krstic.

View full text

Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton–Jacobi equations☆

Abstract

Introduction

Section snippets

N-player differential game for nonlinear systems

Value function approximation for solution of nonlinear Lyapunov equations

Online solution of 2-player games

Simulation results

Automatica

Neural Networks

Automatica

Automatica

Matrix Riccati equations in control and systems theory

Policy iterations on the Hamilton–Jacobi–Isaacs equation for H∞ state feedback control with input saturation

IEEE Transactions on Automatic Control

Sobolev spaces

Dynamic noncooperative game theory

Neuro-dynamic programming

The method of weighted residuals and variational principles

On global existence of solutions to coupled matrix Riccati equations in closed loop Nash games

IEEE Transactions on Automatic Control

Adaptive neural control of uncertain MIMO nonlinear systems

IEEE Transactions on Neural Networks

Solving coupled algebraic Riccati equations from closed-loop Nash strategy, by lack of trust approach

International Journal of Tomography & Statistics

$N$ -player differential game for nonlinear systems

Policy iterations on the Hamilton–Jacobi–Isaacs equation for $H_{\infty}$ state feedback control with input saturation