Q-Learning in Regularized Mean-field Games

Anahtarci, Berkay; Kariksiz, Can Deha; Saldi, Naci

doi:10.1007/s13235-022-00450-2

Q-Learning in Regularized Mean-field Games

Published: 23 May 2022

Volume 13, pages 89–117, (2023)
Cite this article

Dynamic Games and Applications Aims and scope Submit manuscript

677 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we introduce a regularized mean-field game and study learning of this game under an infinite-horizon discounted reward function. Regularization is introduced by adding a strongly concave regularization function to the one-stage reward function in the classical mean-field game model. We establish a value iteration based learning algorithm to this regularized mean-field game using fitted Q-learning. The regularization term in general makes reinforcement learning algorithm more robust to the system components. Moreover, it enables us to establish error analysis of the learning algorithm without imposing restrictive convexity assumptions on the system components, which are needed in the absence of a regularization term.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unified reinforcement Q-learning for mean field game and control problems

Article 15 January 2022

Andrea Angiuli, Jean-Pierre Fouque & Mathieu Laurière

Model-free method for LQ mean-field social control problems with one-dimensional state space

Article 15 April 2024

Zhenhui Xu & Tielong Shen

A myopic adjustment process for mean field games with finite state and action space

Article Open access 01 August 2023

Berenice Anne Neumann

Notes

In classical mean-field game literature, the exogenous behaviour of the other agents is in general modelled by a state-measure flow $\{\mu _t\}$, $\mu _t \in \mathcal {P}(\mathsf {X})$ for all t, which means that total population behaviour is non-stationary. In this paper, we only consider the stationary case; that is, $\mu _t = \mu $ for all t. Establishing a learning algorithm for the non-stationary case is more challenging.

References

Adlakha S, Johari R, Weintraub G (2015) Equilibria of dynamic games with many players: existence, approximation, and market structure. J Econ Theory 156:269–316
Article MathSciNet MATH Google Scholar
Anahtarci B, Kariksiz C, Saldi N (2019) Fitted Q-learning in mean-field games. arXiv:1912.13309
Anahtarci B, Kariksiz C, Saldi N (2020) Value iteration algorithm for mean field games. Syst Control Lett 143
Antos A, Munos R, Szepesvári C (2007) Fitted Q-iteration in continuous action-space MDPs. In: Proceedings of the 20th international conference on neural information processing systems, pp 9–16
Antos A, Munos R, Szepesvári C (2007) Fitted Q-iteration in continuous action-space MDPs. Tech. rep. inria-00185311v1
Bensoussan A, Frehse J, Yam P (2013) Mean field games and mean field type control theory. Springer, New York
Book MATH Google Scholar
Biswas A (2015) Mean field games with ergodic cost for discrete time Markov processes. arXiv:1510.08968
Cardaliaguet P (2011) Notes on mean-field games. Technical report, p 120
Carmona R, Delarue F (2013) Probabilistic analysis of mean-field games. SIAM J Control Optim 51(4):2705–2734
Article MathSciNet MATH Google Scholar
Carmona R, Lauriere M, Tan Z (2019) Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods. arXiv:1910.04295
Elie R, Perolat J, Lauriere M, Geist M, Pietquin O (2019) Approximate fictitious play for mean-field games. arXiv:1907.02633
Elliot R, Li X, Ni Y (2013) Discrete time mean-field stochastic linear-quadratic optimal control problems. Automatica 49:3222–3233
Article MathSciNet MATH Google Scholar
Fu Z, Yang Z, Chen Y, Wang Z (2019) Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. arXiv:1910.07498
Geist M, Scherrer B, Pietquin O (2019) A theory of regularized Markov decision processes. arXiv:1901.11275
Georgii H (2011) Gibbs Measures and Phase Transitions. De Gruyter studies in mathematics. De Gruyter
Gomes D, Mohr J, Souza R (2010) Discrete time, finite state space mean field games. J Math Pures Appl 93:308–328
Article MathSciNet MATH Google Scholar
Gomes D, Saúde J (2014) Mean field games models: a brief survey. Dyn Games Appl 4(2):110–154
Article MathSciNet MATH Google Scholar
Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. arXiv:1901.09585
Huang M (2010) Large-population LQG games involving major player: the nash certainty equivalence principle. SIAM J Control Optim 48(5):3318–3353
Article MathSciNet MATH Google Scholar
Huang M, Caines P, Malhamé R (2007) Large-population cost coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized $\epsilon $-Nash equilibria. IEEE Trans Autom Control 52(9):1560–1571
Article MathSciNet MATH Google Scholar
Huang M, Malhamé R, Caines P (2006) Large population stochastic dynamic games: closed loop McKean-Vlasov systems and the Nash certainty equivalence principle. Commun Inform Syst 6:221–252
Article MathSciNet MATH Google Scholar
Kara AD, Yüksel S (2019) Robustness to incorrect priors in partially observed stochastic control. SIAM J Control Optim 57(3):1929–1964
Article MathSciNet MATH Google Scholar
Kara AD, Yüksel S (2020) Robustness to incorrect system models in stochastic control. SIAM J Control Optim 58(2):1144–1182
Article MathSciNet MATH Google Scholar
Kontorovich L, Ramanan K (2008) Concentration inequalities for dependent random variables via the martingale method. Ann Probab 36(6):2126–2158
Article MathSciNet MATH Google Scholar
Lasry J, Lions P (2007) Mean field games. Japan J Math 2:229–260
Article MathSciNet MATH Google Scholar
Mehta P, Meyn S (2009) Q-learning and Pontryagin’s minimum principle. In: Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pp 3598–3605
Moon J, Başar T (2015) Discrete-time decentralized control using the risk-sensitive performance criterion in the large population regime: a mean field approach. In: ACC 2015. Chicago
Moon J, Başar T (2016) Discrete-time mean field Stackelberg games with a large number of followers. In: CDC 2016. Las Vegas
Moon J, Başar T (2016) Robust mean field games for coupled Markov jump linear systems. Int J Control 89(7):1367–1381
Article MathSciNet MATH Google Scholar
Neu G, Jonsson A, Gomez V (2017) A unified view of entropy-regularized Markov decision processes. arXiv:1705.07798
Nourian M, Nair G (2013) Linear-quadratic-Gaussian mean field games under high rate quantization. In: CDC 2013. Florence
Saldi N (2019) Discrete-time average-cost mean-field games on Polish spaces. arXiv:1908.08793 (accepted to Turkish Journal of Mathematics)
Saldi N, Başar T, Raginsky M (2018) Markov-Nash equilibria in mean-field games with discounted cost. SIAM J Control Optim 56(6):4256–4287
Article MathSciNet MATH Google Scholar
Saldi N, Başar T, Raginsky M (2019) Approximate Markov-Nash equilibria for discrete-time risk-sensitive mean-field games. to appear in Mathematics of Operations Research
Saldi N, Başar T, Raginsky M (2019) Approximate Nash equilibria in partially observed stochastic games with mean-field interactions. Math Oper Res 44(3):1006–1033
Article MathSciNet MATH Google Scholar
Shalev-Shwartz S (2007) Online learning: theory, algorithms, and applications. Ph.D. thesis, The Hebrew University of Jerusalem
Tembine H, Zhu Q, Başar T (2014) Risk-sensitive mean field games. IEEE Trans Autom Control 59(4):835–850
Article MathSciNet MATH Google Scholar
Vidyasagar M (2010) Learning and generalization: with applications to neural networks, 2nd edn. Springer, New York
Google Scholar
Wiecek P (2020) Discrete-time ergodic mean-field games with average reward on compact spaces. Dyn Games Appl 10:222–256
Wiecek P, Altman E (2015) Stationary anonymous sequential games with undiscounted rewards. J Optim Theory Appl 166(2):686–710
Article MathSciNet MATH Google Scholar
Yang J, Ye X, Trivedi R, Hu X, Zha H (2018) Learning deep mean field games for modelling large population behaviour. arXiv:1711.03156
Yin H, Mehta P, Meyn S, Shanbhag U (2014) Learning in mean-field games. IEEE Trans Autom Control 59:629–644
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was partly supported by the BAGEP Award of the Science Academy.

Funding

Funding was provided by Bilim Akademisi (Grant No. BAGEP 2021).

Author information

Authors and Affiliations

Özyeğin University, Çekmeköy, Istanbul, Turkey
Berkay Anahtarci & Can Deha Kariksiz
Bilkent University, Çankaya, Ankara, Turkey
Naci Saldi

Authors

Berkay Anahtarci
View author publications
You can also search for this author in PubMed Google Scholar
Can Deha Kariksiz
View author publications
You can also search for this author in PubMed Google Scholar
Naci Saldi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naci Saldi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Multi-agent Dynamic Decision Making and Learning” edited by Konstantin Avrachenkov, Vivek S. Borkar and U. Jayakrishnan Nair.

Appendix

1.1 Duality of Strong Convexity and Smoothness

Suppose that $\mathsf {E}= \mathbb {R}^d$ for some $d \ge 1$ with an inner product $\langle \cdot ,\cdot \rangle $. We denote $\mathbb {R}^{*} = \mathbb {R}\, \bigcup \, \{\infty \}$. Let $f:\mathsf {E}\rightarrow \mathbb {R}^{*}$ be a differentiable convex function with the domain $S :=\{x \in \mathsf {E}: f(x) \in \mathbb {R}\}$, which is necessarily convex subset of $\mathsf {E}$. The Fenchel conjugate of f is a convex function $f^*:\mathsf {E}\rightarrow \mathbb {R}^*$ that is defined as

$$\begin{aligned} f^*(y) \,{:=}\, \sup _{x \in S} \, \langle x,y \rangle - f(x). \end{aligned}$$

Now, we will state duality result between strong convexity and smoothness. To this end, we suppose that f is $\rho $-strongly convex with respect to a norm $\Vert \cdot \Vert $ on $\mathsf {E}$ (not necessarily Euclidean norm); that is, for all $x,y \in S$, we have

$$\begin{aligned} f(y) \ge f(x) + \langle \nabla f(x),y-x \rangle + \frac{1}{2} \rho \Vert y-x\Vert ^2. \end{aligned}$$

To state the result, we need to define the dual norm of $\Vert \cdot \Vert $. The dual norm $\Vert \cdot \Vert _*$ of $\Vert \cdot \Vert $ on $\mathsf {E}$ is defined as

$$\begin{aligned} \Vert z\Vert _* \,{:=} \, \sup \{\langle z,x \rangle : \Vert x\Vert \le 1\}. \end{aligned}$$

For example, $\Vert \cdot \Vert _{\infty }$ is the dual norm of $\Vert \cdot \Vert _{1}$.

Proposition 3

([36, Lemma 15]) Let $f:\mathsf {E}\rightarrow \mathbb {R}^*$ be a differentiable $\rho $-strongly convex function with respect to the norm $\Vert \cdot \Vert $ and let S denote its domain. Then,

1.
$f^*$ is differentiable on $\mathsf {E}$.
2.
$\nabla f^*(y) = \mathop {\mathrm{arg\, max}}_{x \in S} \langle x,y \rangle - f(x)$.
3.
$f^*$ is $\frac{1}{\rho }$-smooth with respect to the norm $\Vert \cdot \Vert _*$ ; that is,
$$\begin{aligned} \Vert \nabla f^*(y_1) - \nabla f^*(y_2)\Vert \le \frac{1}{\rho } \Vert y_1-y_2\Vert _* \,\, \text {for all} \,\, y_1,y_2 \in \mathsf {E}. \end{aligned}$$

In the paper, we make use of the properties 2 and 3 of Proposition 3 to establish the Lipschitz continuity of the optimal policies, which enables us to prove the main results of our paper.

1.2 Proof of Proposition 1

Fix any $x,\hat{x}, u,\hat{u}, \mu , \hat{\mu }$. Let us recall the following fact about $l_1$ norm on the set probability distributions on finite sets [15, p. 141]. Suppose that there exists a real valued function F on a finite set $\mathsf {E}$. Let $\lambda (F) :=\sup _{e \in \mathsf {E}} F(e) - \inf _{e \in \mathsf {E}} F(e)$. Then, for any pair of probability distributions $\mu ,\nu $ on $\mathsf {E}$, we have

$$\begin{aligned} \left| \sum _{e} F(e) \, \mu (e) - \sum _{e} F(e) \, \nu (e) \right| \le \frac{\lambda (F)}{2} \, \Vert \mu -\nu \Vert _{1}. \end{aligned}$$

(6)

Using this fact, we now have

$$\begin{aligned} |R(x,u,\mu ) - R(\hat{x},\hat{u},\hat{\mu })|&= \left| \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} r(\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\le \left| \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, \hat{u}(a) \right| \\&\quad + \left| \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, \hat{u}(a) - \sum _{a \in \mathsf {A}} r(\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\le L_1 \, \left( 1_{\{x \ne \hat{x}\}} + \Vert u-\hat{u}\Vert _1 +\Vert \mu -\hat{\mu }\Vert _1 \right) , \end{aligned}$$

where the last inequality follows from the following fact in view of (6):

$$\begin{aligned} \sup _{a} r(x,a,\mu ) - \inf _{a} r(x,a,\mu )&\, {:=}\, r(x,a_{\max },\mu ) - r(x,a_{\min },\mu ) \\&\le 2 L_1 \, 1_{\{a_{\max } \ne a_{\min }\}} = 2L_1. \end{aligned}$$

Similarly, we have

$$\begin{aligned}&\Vert P(\cdot |x,u,\mu ) - P(\cdot |\hat{x},\hat{u},\hat{\mu })\Vert _1 \\&\quad = \sum _{y \in \mathsf {X}} \left| P(y|x,u,\mu ) - P(y|\hat{x},\hat{u},\hat{\mu }) \right| \\&\quad = \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} p(y|\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\quad \le \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) \right| \\&\qquad + \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) - \sum _{a \in \mathsf {A}} p(y|\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\quad \overset{(I)}{\le } K_1 \Vert u-\hat{u}\Vert _1 + \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) - \sum _{a \in \mathsf {A}} p(y|\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\quad \le K_1 \, \left( 1_{\{x \ne \hat{x}\}} + \Vert u-\hat{u}\Vert _1 +\Vert \mu -\hat{\mu }\Vert _1 \right) . \end{aligned}$$

To show that (I) follows from Assumption 1-(b), let us define the transition probability $M:\mathsf {A}\rightarrow \mathcal {P}(\mathsf {X})$ as

$$\begin{aligned} M(\cdot |a) :=p(\cdot |x,a,\mu ). \end{aligned}$$

Let $\xi \in \mathcal {P}(\mathsf {A}\times \mathsf {A})$ be the optimal coupling of u and $\hat{u}$ that achieves total variation distance $\Vert u-\hat{u}\Vert _{TV}$. Similarly, for any $a,\hat{a} \in \mathsf {A}$, let $K(\cdot |a,\hat{a}) \in \mathcal {P}(\mathsf {X}\times \mathsf {X})$ be the optimal coupling of $M(\cdot |a)$ and $M(\cdot |\hat{a})$ that achieves total variation distance $\Vert M(\cdot |a)-M(\cdot |\hat{a})\Vert _{TV}$. Note that

$$\begin{aligned} \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) \right| = 2 \Vert u M-\hat{u} M\Vert _{TV}, \end{aligned}$$

where

$$\begin{aligned} u M(\cdot ) :=\sum _{a \in \mathsf {A}} M(\cdot |a) \, u(a) \end{aligned}$$

and

$$\begin{aligned} \hat{u} M(\cdot ) :=\sum _{a \in \mathsf {A}} M(\cdot |a) \, \hat{u}(a). \end{aligned}$$

Let us define $\nu (\cdot ) :=\sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} K(\cdot |a,\hat{a}) \, \xi (a,\hat{a})$, and so, $\nu $ is a coupling of uM and $\hat{u} M$. Therefore, we have

$$\begin{aligned} 2 \, \Vert u M-\hat{u} M\Vert _{TV}&\le 2 \sum _{(x,y) \in \mathsf {X}\times \mathsf {X}} 1_{\{x \ne y\}} \, \nu (x,y) \\&= 2 \sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} \sum _{(x,y) \in \mathsf {X}\times \mathsf {X}} 1_{\{x \ne y\}} \, K(x,y|a,\hat{a}) \, \xi (a,\hat{a})\\&= \sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} \Vert M(\cdot |a)-M(\cdot |\hat{a})\Vert _{1} \, \xi (a,\hat{a}) \\&\le 2 \, K_1 \, \sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} 1_{\{a \ne \hat{a}\}} \, \xi (a,\hat{a}) \\&= K_1 \, \Vert u-\hat{u}\Vert _{1}. \end{aligned}$$

Hence, (I) follows. This completes the proof.

1.3 Proof of Lemma 1

Fix any $\mu $. If a function $f: \mathsf {X}\rightarrow \mathbb {R}$ is K-Lipschitz continuous for some K, then $g = \frac{f}{K}$ is 1-Lipschitz continuous. Hence, for all $u \in \mathsf {U}$ and $z,y \in \mathsf {X}$ we have

$$\begin{aligned}&\biggl | \sum _{x} f(x) P(x|z,u,\mu ) - \sum _{x} f(x) P(x|y,u,\mu ) \biggr | \\&\quad = K \biggl | \sum _{x} g(x) P(x|z,u,\mu ) - \sum _{x} g(x) P(x|y,u,\mu ) \biggr | \\&\quad \le \frac{K}{2} \, \Vert P(\,\cdot \,|z,u,\mu ) - P(\,\cdot \,|y,u,\mu )\Vert _1 \, \text {(by (6))}\\&\quad \le \frac{KK_1}{2} \, 1_{\{z \ne y\}}, \, \text {(by Proposition 1)} \end{aligned}$$

since $\sup _x g(x) - \inf _x g(x) \le 1$. Hence, the contraction operator $T_{\mu }$ maps K-Lipschitz functions to $L_1+\beta K K_1/2$-Lipschitz functions, since, for all $z,y \in \mathsf {X}$

$$\begin{aligned} | T_{\mu }f(z) - T_{\mu }f(y) |&\le \sup _{u} \biggl \{ |R(z,u,\mu ) - R(y,u,\mu )| \\&\quad + \beta \biggl | \sum _{x} f(x) P(x|z,u,\mu ) - \sum _{x} f(x) P(x|y,u,\mu ) \biggr | \biggr \}\\&\le L_1 1_{\{z\ne y\}} + \beta \frac{K K_1}{2} 1_{\{z \ne y\}} = \biggl (L_1 + \beta \frac{K K_1}{2}\biggr ) 1_{\{z \ne y\}}. \end{aligned}$$

Now we apply $T_{\mu }$ recursively to obtain the sequence $\{T_{\mu }^n f\}$ by letting $T_{\mu }^n f = T_{\mu } (T_{\mu }^{n-1} f )$, which converges to the value function $Q^{\mathop {\mathrm{reg}},*}_{\mu ,\max }$ by the Banach fixed point theorem. Clearly, by mathematical induction, we have for all $n\ge 1$, $T_{\mu }^n f$ is $K_n$-Lipschitz continuous, where $K_n = L_1 \sum _{i=0}^{n-1} (\beta K_1/2)^i + K (\beta K_1/2)^n$. If we choose $K < L_1$, then $K_n \le K_{n+1}$ for all n and therefore, $K_n \uparrow \frac{ L_1}{1-\beta K_1/2}$. Hence, $T_{\mu }^n f$ is $ \frac{L_1}{1-\beta K_1/2}$-Lipschitz continuous for all n, and therefore, $Q^{\mathop {\mathrm{reg}},*}_{\mu ,\max }$ is also $ \frac{L_1}{1-\beta K_1/2}$-Lipschitz continuous.

1.4 Proof of Lemma 2

Under Assumption 1, it is straightforward to prove that $H_1$ maps $\mathcal {P}(\mathsf {X})$ into ${\mathcal {C}}$. Indeed, the only non-trivial fact is the $\left( K_{\mathop {\mathrm{Lip}}}+L_{\mathop {\mathrm{reg}}}\right) $-Lipschitz continuity of $H_1(\mu ) =:Q_{\mu }^{\mathop {\mathrm{reg}},*}$. This can be proved as follows: For any (x, u) and $(\hat{x},\hat{u})$, we have

$$\begin{aligned} |Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,u)-Q_{\mu }^{\mathop {\mathrm{reg}},*}(\hat{x},\hat{u})|&= |R(x,u,\mu ) - \varOmega (u) + \beta \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,\mu ) \\&\quad - R(\hat{x},\hat{u},\mu ) + \varOmega (\hat{u}) - \beta \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|\hat{x},\hat{u},\mu )| \\&\le L_1 (1_{\{x\ne \hat{x}\}}+\Vert u-\hat{u}\Vert _1) + L_{\mathop {\mathrm{reg}}} \Vert u-\hat{u}\Vert _1 \\&\quad + \beta \frac{K_1 K_{\mathop {\mathrm{Lip}}}}{2} \, (1_{\{x\ne \hat{x}\}}+\Vert u-\hat{u}\Vert _1), \end{aligned}$$

where the last inequality follows from (6) and Lemma 1. Hence, $Q_{\mu }^{\mathop {\mathrm{reg}},*}$ is $\left( K_{\mathop {\mathrm{Lip}}}+L_{\mathop {\mathrm{reg}}}\right) $-Lipschitz continuous.

Now, for any $\mu ,{\hat{\mu }}\in \mathcal {P}(\mathsf {X})$, we have

$$\begin{aligned} \Vert H_1(\mu ) - H_1({\hat{\mu }})\Vert _{\infty }&= \Vert Q_{\mu }^{\mathop {\mathrm{reg}},*}-Q_{{\hat{\mu }}}^{\mathop {\mathrm{reg}},*}\Vert _{\infty } \\&= \sup _{x,u} \bigg | R(x,u,\mu ) + \beta \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,\mu ) \\&\quad - R(x,u,{\hat{\mu }}) - \beta \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }}) \bigg | \\&\le L_1 \, \Vert \mu -{\hat{\mu }}\Vert _1 \\&\quad + \beta \left| \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,\mu ) - \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }})\right| \\&\quad + \beta \left| \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }}) - \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }}) \right| \\&\le L_1 \, \Vert \mu -{\hat{\mu }}\Vert _1 + \frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2} \, \Vert \mu -{\hat{\mu }}\Vert _1 + \beta \, \Vert Q_{\mu }^{\mathop {\mathrm{reg}},*}-Q_{{\hat{\mu }}}^{\mathop {\mathrm{reg}},*}\Vert _{\infty }, \end{aligned}$$

where the last inequality follows from (6) and Lemma 1. This completes the proof.

1.5 Proof of Lemma 3

For any $\mu \in \mathcal {P}(\mathsf {X})$, we have

$$\begin{aligned} Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,u)&= L_{\mu } Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,u) \\&= R(x,u,\mu ) + \beta \sum _{y \in \mathsf {X}} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) \, P(y|x,u,\mu ) - \varOmega (u) \\&= \langle q_{x}^{\mu },u \rangle - \varOmega (u), \end{aligned}$$

where $ q_{x}^{\mu }(\cdot ) :=r(x,\cdot ,\mu ) + \beta \sum _{y \in \mathsf {X}} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) \, p(y|x,\cdot ,\mu ). $ By $\rho $-strong convexity of $\varOmega $, $Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,\cdot )$ has a unique maximizer $f_{\mu }(x) \in \mathsf {U}$ for any $x \in \mathsf {X}$, which is the optimal policy for $\mu $. By Property 2 of Proposition 3, we have

$$\begin{aligned} f_{\mu }(x) = \nabla \varOmega ^*(q_x^{\mu }), \end{aligned}$$

where $\varOmega ^*$ is the Fenchel conjugate of $\varOmega $, and $\varOmega ^*(q_x^{\mu }) = Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(x)$.

Moreover, for any $\mu ,{\hat{\mu }}\in \mathcal {P}(\mathsf {X})$ and $x,\hat{x}\in \mathsf {X}$, by property 3 of Proposition 3 and by noting the fact that $\Vert \cdot \Vert _{\infty }$ is the dual norm of $\Vert \cdot \Vert _1$ on $\mathsf {U}$, we obtain the following bound:

$$\begin{aligned} \Vert f_{\mu }(x)-f_{{\hat{\mu }}}(\hat{x})\Vert _1 \le \frac{1}{\rho } \, \Vert q_{x}^{\mu }-q_{\hat{x}}^{{\hat{\mu }}}\Vert _{\infty }. \end{aligned}$$

Note that we have

$$\begin{aligned} \Vert q_{x}^{\mu }-q_{\hat{x}}^{{\hat{\mu }}}\Vert _{\infty }&=\sup _{a \in \mathsf {A}} \bigg |r(x,a,\mu ) + \beta \sum _{y \in \mathsf {X}} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) \, p(y|x,a,\mu ) \\&\quad - r(\hat{x},a,{\hat{\mu }}) - \beta \sum _{y \in \mathsf {X}} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) \, p(y|\hat{x},a,{\hat{\mu }}) \bigg | \\&\le L_1 (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) \\&\quad + \beta \sup _{a \in \mathsf {A}} \bigg | \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) p(y|x,a,\mu ) - \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) p(y|x,a,\mu ) \bigg | \\&\quad + \beta \sup _{a \in \mathsf {A}} \bigg | \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) p(y|x,a,\mu ) - \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) p(y|\hat{x},a,{\hat{\mu }}) \bigg |\\&\le L_1 (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) + \beta \Vert Q_{\mu }^{\mathop {\mathrm{reg}},*} - Q_{{\hat{\mu }}}^{\mathop {\mathrm{reg}},*}\Vert _{\infty } \\&\quad + \beta \frac{K_1 K_{\mathop {\mathrm{Lip}}}}{2} (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) \\&\le K_{\mathop {\mathrm{Lip}}} \, (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) + \beta K_{H_1} \, \Vert \mu -{\hat{\mu }}\Vert _1 \\&\le K_{H_1} (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1). \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \Vert f_{\mu }(x)-f_{{\hat{\mu }}}(\hat{x})\Vert _1 \le \frac{1}{\rho } \, K_{H_1} (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1). \end{aligned}$$

1.6 Proof of Theorem 1

Let $\mu _{\varepsilon } \in \varLambda ^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon })$. Then, we have

$$\begin{aligned} \Vert \mu _{\varepsilon }-\mu _*\Vert _1&= \sum _{y} \, \bigg | \sum _{x} \, P(y|x,\pi _{\varepsilon },\mu _{\varepsilon }) \, \mu _{\varepsilon }(x) - \sum _{x} \, P(y|x,\pi _*(x),\mu _*) \, \mu _*(x) \biggr | \\&\le \sum _{y} \, \bigg | \sum _{x} \, P(y|x,\pi _{\varepsilon },\mu _{\varepsilon }) \, \mu _{\varepsilon }(x) - \sum _{x} \, P(y|x,\pi _*(x),\mu _*) \, \mu _{\varepsilon }(x) \biggr | \\&\quad + \sum _{y} \, \bigg | \sum _{x} \,P(y|x,\pi _*(x),\mu _*) \, \mu _{\varepsilon }(x) - \sum _{x} \, P(y|x,\pi _*(x),\mu _*) \, \mu _*(x) \biggr | \\&\overset{(I)}{\le } \sum _{x} \left\| P(\cdot |x,\pi _{\varepsilon }(x),\mu _{\varepsilon })-P(\cdot |x,\pi _*(x),\mu _*) \right\| _1 \mu _{\varepsilon }(x) \\&\quad + \frac{K_1}{2} \left( 1 + \frac{K_{H_1}}{\rho } \right) \, \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \\&\le K_1 \left( \sup _x \Vert \pi _{\varepsilon }(x)-\pi _*(x)\Vert _1 + \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \right) \\&\quad + \frac{K_1}{2} \left( 1 + \frac{K_{H_1}}{\rho } \right) \, \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \\&\le K_1 \, \varepsilon + \left( \frac{3 \, K_1}{2} + \frac{K_1 \, K_{H_1}}{2\rho } \right) \, \Vert \mu _{\varepsilon }-\mu _*\Vert _1. \end{aligned}$$

Note that Lemma 3 and Proposition 1 lead to

$$\begin{aligned} \Vert P(\cdot |x,\pi _*(x),\mu _*)-P(\cdot |y,\pi _*(y),\mu _*)\Vert _1 \le K_1 \left( 1 + \frac{K_{H_1}}{\rho } \right) \, 1_{\{x \ne y\}}. \end{aligned}$$

Hence, (I) follows from [24, Lemma A2]. Therefore, we have:

$$\begin{aligned} \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \le \frac{K_1 \, \varepsilon }{1-C_1}, \end{aligned}$$

where $C_1 :=\left( \frac{3 \, K_1}{2} + \frac{K_1 \, K_{H_1}}{2\rho } \right) $. Note that by Assumption 2, $C_1 < 1$. Now, fix any policy $\pi \in \varPi $. Then, we have

$$\begin{aligned}&\Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty } \\&\quad =\sup _{x} \bigg | R^{\mathop {\mathrm{reg}}}(x,\pi (x),\mu _*) + \beta \, \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _*) \\&\qquad -R^{\mathop {\mathrm{reg}}}(x,\pi (x),\mu _{\varepsilon }) - \beta \, \sum _{y} J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon })\bigg | \\&\quad \le L_1 \, \Vert \mu _*-\mu _{\varepsilon }\Vert _1 \\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _*) - \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon })\bigg |\\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon }) - \sum _{y} J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon })\bigg | \\&\quad \overset{(II)}{\le } \left( L_1+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \Vert \mu _*-\mu _{\varepsilon }\Vert _1 + \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty }\\&\quad \le \left( L_1+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \frac{K_1 \varepsilon }{1-C_1}+ \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty }. \end{aligned}$$

Here, (II) follows from (6) and the fact that $J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )$ is $K_{\mathop {\mathrm{Lip}}}$-Lipschitz continuous, which can be proved as in Lemma 1. Therefore, we obtain

$$\begin{aligned} \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty } \le \frac{C_2 \, \varepsilon }{1-\beta }, \end{aligned}$$

(7)

where $C_2 :=\left( L_1+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \frac{K_1}{1-C_1}$.

Note that we also have

$$\begin{aligned}&\quad \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty }\\&\quad =\sup _{x} \bigg | R^{\mathop {\mathrm{reg}}}(x,\pi _*(x),\mu _*) + \beta \, \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _*(x),\mu _*) \\&\qquad -R^{\mathop {\mathrm{reg}}}(x,\pi _{\varepsilon }(x),\mu _*) - \beta \, \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*)\bigg | \\&\quad \le (L_1 + L_{\mathop {\mathrm{reg}}}) \, \sup _{x} \Vert \pi _*(x)-\pi _{\varepsilon }(x)\Vert _1 \\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _*(x),\mu _*) - \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*)\bigg |\\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*) - \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*)\bigg | \\&\quad \overset{(III)}{\le } \left( L_1+L_{\mathop {\mathrm{reg}}}+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \sup _x \Vert \pi _*(x)-\pi _{\varepsilon }(x)\Vert _1\\&\qquad + \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty }\\&\quad \le \left( L_1+L_{\mathop {\mathrm{reg}}}+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \, \varepsilon + \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty }. \end{aligned}$$

Here, (III) follows from (6) and the fact that $J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )$ is $K_{\mathop {\mathrm{Lip}}}$-Lipschitz continuous, which can be proved as in Lemma 1. Therefore, we obtain

$$\begin{aligned} \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty } \le \frac{C_3\varepsilon }{1-\beta }, \end{aligned}$$

(8)

where $C_3 :=\left( L_1+L_{\mathop {\mathrm{reg}}}+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) $.

Note that we must prove that

$$\begin{aligned} J_i^{(N)}({\varvec{\pi }}^{(N)})&\ge \sup _{\pi ^i \in \varPi _i} J_i^{(N)}({\varvec{\pi }}^{(N)}_{-i},\pi ^i) - \tau \, \varepsilon - \delta \end{aligned}$$

(9)

for each $i=1,\ldots ,N$, when N is sufficiently large. As the transition probabilities and the one-stage reward functions are the same for all agents, it is sufficient to prove (9) for Agent 1 only. Given $\delta > 0$, for each $N\ge 1$, let ${\tilde{\pi }}^{(N)} \in \varPi _1$ be such that

$$\begin{aligned} J_1^{(N)} ({\tilde{\pi }}^{(N)},\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) > \sup _{\pi ' \in \varPi _1} J_1^{(N)} (\pi ',\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) - \frac{\delta }{3}. \end{aligned}$$

Then, by [33, Theorem 4.10], we have

Therefore, there exists $N(\delta )$ such that

$$\begin{aligned}&\sup _{\pi ' \in \varPi _1} J_1^{(N)} (\pi ',\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) - \delta - \tau \,\varepsilon \\&\quad \le J_1^{(N)} ({\tilde{\pi }}^{(N)},\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) - \frac{2\delta }{3} - \tau \, \varepsilon \\&\quad \le J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon }) - \frac{\delta }{3} \\&\quad \le J_1^{(N)} (\pi _{\varepsilon },\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }). \end{aligned}$$

for all $N\ge N(\delta )$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anahtarci, B., Kariksiz, C.D. & Saldi, N. Q-Learning in Regularized Mean-field Games. Dyn Games Appl 13, 89–117 (2023). https://doi.org/10.1007/s13235-022-00450-2

Download citation

Accepted: 24 April 2022
Published: 23 May 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s13235-022-00450-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Q-Learning in Regularized Mean-field Games

Abstract

Access this article

Similar content being viewed by others

Unified reinforcement Q-learning for mean field game and control problems

Model-free method for LQ mean-field social control problems with one-dimensional state space

A myopic adjustment process for mean field games with finite state and action space

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 Duality of Strong Convexity and Smoothness

Proposition 3

1.2 Proof of Proposition 1

1.3 Proof of Lemma 1

1.4 Proof of Lemma 2

1.5 Proof of Lemma 3

1.6 Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Q-Learning in Regularized Mean-field Games

Abstract

Access this article

Similar content being viewed by others

Unified reinforcement Q-learning for mean field game and control problems

Model-free method for LQ mean-field social control problems with one-dimensional state space

A myopic adjustment process for mean field games with finite state and action space

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Duality of Strong Convexity and Smoothness

Proposition 3

1.2 Proof of Proposition 1

1.3 Proof of Lemma 1

1.4 Proof of Lemma 2

1.5 Proof of Lemma 3

1.6 Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation