Skip to main content
Log in

Q-Learning in Regularized Mean-field Games

  • Published:
Dynamic Games and Applications Aims and scope Submit manuscript

Abstract

In this paper, we introduce a regularized mean-field game and study learning of this game under an infinite-horizon discounted reward function. Regularization is introduced by adding a strongly concave regularization function to the one-stage reward function in the classical mean-field game model. We establish a value iteration based learning algorithm to this regularized mean-field game using fitted Q-learning. The regularization term in general makes reinforcement learning algorithm more robust to the system components. Moreover, it enables us to establish error analysis of the learning algorithm without imposing restrictive convexity assumptions on the system components, which are needed in the absence of a regularization term.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. In classical mean-field game literature, the exogenous behaviour of the other agents is in general modelled by a state-measure flow \(\{\mu _t\}\), \(\mu _t \in \mathcal {P}(\mathsf {X})\) for all t, which means that total population behaviour is non-stationary. In this paper, we only consider the stationary case; that is, \(\mu _t = \mu \) for all t. Establishing a learning algorithm for the non-stationary case is more challenging.

References

  1. Adlakha S, Johari R, Weintraub G (2015) Equilibria of dynamic games with many players: existence, approximation, and market structure. J Econ Theory 156:269–316

    Article  MathSciNet  MATH  Google Scholar 

  2. Anahtarci B, Kariksiz C, Saldi N (2019) Fitted Q-learning in mean-field games. arXiv:1912.13309

  3. Anahtarci B, Kariksiz C, Saldi N (2020) Value iteration algorithm for mean field games. Syst Control Lett 143

  4. Antos A, Munos R, Szepesvári C (2007) Fitted Q-iteration in continuous action-space MDPs. In: Proceedings of the 20th international conference on neural information processing systems, pp 9–16

  5. Antos A, Munos R, Szepesvári C (2007) Fitted Q-iteration in continuous action-space MDPs. Tech. rep. inria-00185311v1

  6. Bensoussan A, Frehse J, Yam P (2013) Mean field games and mean field type control theory. Springer, New York

    Book  MATH  Google Scholar 

  7. Biswas A (2015) Mean field games with ergodic cost for discrete time Markov processes. arXiv:1510.08968

  8. Cardaliaguet P (2011) Notes on mean-field games. Technical report, p 120

  9. Carmona R, Delarue F (2013) Probabilistic analysis of mean-field games. SIAM J Control Optim 51(4):2705–2734

    Article  MathSciNet  MATH  Google Scholar 

  10. Carmona R, Lauriere M, Tan Z (2019) Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods. arXiv:1910.04295

  11. Elie R, Perolat J, Lauriere M, Geist M, Pietquin O (2019) Approximate fictitious play for mean-field games. arXiv:1907.02633

  12. Elliot R, Li X, Ni Y (2013) Discrete time mean-field stochastic linear-quadratic optimal control problems. Automatica 49:3222–3233

    Article  MathSciNet  MATH  Google Scholar 

  13. Fu Z, Yang Z, Chen Y, Wang Z (2019) Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. arXiv:1910.07498

  14. Geist M, Scherrer B, Pietquin O (2019) A theory of regularized Markov decision processes. arXiv:1901.11275

  15. Georgii H (2011) Gibbs Measures and Phase Transitions. De Gruyter studies in mathematics. De Gruyter

  16. Gomes D, Mohr J, Souza R (2010) Discrete time, finite state space mean field games. J Math Pures Appl 93:308–328

    Article  MathSciNet  MATH  Google Scholar 

  17. Gomes D, Saúde J (2014) Mean field games models: a brief survey. Dyn Games Appl 4(2):110–154

    Article  MathSciNet  MATH  Google Scholar 

  18. Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. arXiv:1901.09585

  19. Huang M (2010) Large-population LQG games involving major player: the nash certainty equivalence principle. SIAM J Control Optim 48(5):3318–3353

    Article  MathSciNet  MATH  Google Scholar 

  20. Huang M, Caines P, Malhamé R (2007) Large-population cost coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized \(\epsilon \)-Nash equilibria. IEEE Trans Autom Control 52(9):1560–1571

    Article  MathSciNet  MATH  Google Scholar 

  21. Huang M, Malhamé R, Caines P (2006) Large population stochastic dynamic games: closed loop McKean-Vlasov systems and the Nash certainty equivalence principle. Commun Inform Syst 6:221–252

    Article  MathSciNet  MATH  Google Scholar 

  22. Kara AD, Yüksel S (2019) Robustness to incorrect priors in partially observed stochastic control. SIAM J Control Optim 57(3):1929–1964

    Article  MathSciNet  MATH  Google Scholar 

  23. Kara AD, Yüksel S (2020) Robustness to incorrect system models in stochastic control. SIAM J Control Optim 58(2):1144–1182

    Article  MathSciNet  MATH  Google Scholar 

  24. Kontorovich L, Ramanan K (2008) Concentration inequalities for dependent random variables via the martingale method. Ann Probab 36(6):2126–2158

    Article  MathSciNet  MATH  Google Scholar 

  25. Lasry J, Lions P (2007) Mean field games. Japan J Math 2:229–260

    Article  MathSciNet  MATH  Google Scholar 

  26. Mehta P, Meyn S (2009) Q-learning and Pontryagin’s minimum principle. In: Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pp 3598–3605

  27. Moon J, Başar T (2015) Discrete-time decentralized control using the risk-sensitive performance criterion in the large population regime: a mean field approach. In: ACC 2015. Chicago

  28. Moon J, Başar T (2016) Discrete-time mean field Stackelberg games with a large number of followers. In: CDC 2016. Las Vegas

  29. Moon J, Başar T (2016) Robust mean field games for coupled Markov jump linear systems. Int J Control 89(7):1367–1381

    Article  MathSciNet  MATH  Google Scholar 

  30. Neu G, Jonsson A, Gomez V (2017) A unified view of entropy-regularized Markov decision processes. arXiv:1705.07798

  31. Nourian M, Nair G (2013) Linear-quadratic-Gaussian mean field games under high rate quantization. In: CDC 2013. Florence

  32. Saldi N (2019) Discrete-time average-cost mean-field games on Polish spaces. arXiv:1908.08793 (accepted to Turkish Journal of Mathematics)

  33. Saldi N, Başar T, Raginsky M (2018) Markov-Nash equilibria in mean-field games with discounted cost. SIAM J Control Optim 56(6):4256–4287

    Article  MathSciNet  MATH  Google Scholar 

  34. Saldi N, Başar T, Raginsky M (2019) Approximate Markov-Nash equilibria for discrete-time risk-sensitive mean-field games. to appear in Mathematics of Operations Research

  35. Saldi N, Başar T, Raginsky M (2019) Approximate Nash equilibria in partially observed stochastic games with mean-field interactions. Math Oper Res 44(3):1006–1033

    Article  MathSciNet  MATH  Google Scholar 

  36. Shalev-Shwartz S (2007) Online learning: theory, algorithms, and applications. Ph.D. thesis, The Hebrew University of Jerusalem

  37. Tembine H, Zhu Q, Başar T (2014) Risk-sensitive mean field games. IEEE Trans Autom Control 59(4):835–850

    Article  MathSciNet  MATH  Google Scholar 

  38. Vidyasagar M (2010) Learning and generalization: with applications to neural networks, 2nd edn. Springer, New York

    Google Scholar 

  39. Wiecek P (2020) Discrete-time ergodic mean-field games with average reward on compact spaces. Dyn Games Appl 10:222–256

  40. Wiecek P, Altman E (2015) Stationary anonymous sequential games with undiscounted rewards. J Optim Theory Appl 166(2):686–710

    Article  MathSciNet  MATH  Google Scholar 

  41. Yang J, Ye X, Trivedi R, Hu X, Zha H (2018) Learning deep mean field games for modelling large population behaviour. arXiv:1711.03156

  42. Yin H, Mehta P, Meyn S, Shanbhag U (2014) Learning in mean-field games. IEEE Trans Autom Control 59:629–644

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was partly supported by the BAGEP Award of the Science Academy.

Funding

Funding was provided by Bilim Akademisi (Grant No. BAGEP 2021).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Naci Saldi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Multi-agent Dynamic Decision Making and Learning” edited by Konstantin Avrachenkov, Vivek S. Borkar and U. Jayakrishnan Nair.

Appendix

Appendix

1.1 Duality of Strong Convexity and Smoothness

Suppose that \(\mathsf {E}= \mathbb {R}^d\) for some \(d \ge 1\) with an inner product \(\langle \cdot ,\cdot \rangle \). We denote \(\mathbb {R}^{*} = \mathbb {R}\, \bigcup \, \{\infty \}\). Let \(f:\mathsf {E}\rightarrow \mathbb {R}^{*}\) be a differentiable convex function with the domain \(S :=\{x \in \mathsf {E}: f(x) \in \mathbb {R}\}\), which is necessarily convex subset of \(\mathsf {E}\). The Fenchel conjugate of f is a convex function \(f^*:\mathsf {E}\rightarrow \mathbb {R}^*\) that is defined as

$$\begin{aligned} f^*(y) \,{:=}\, \sup _{x \in S} \, \langle x,y \rangle - f(x). \end{aligned}$$

Now, we will state duality result between strong convexity and smoothness. To this end, we suppose that f is \(\rho \)-strongly convex with respect to a norm \(\Vert \cdot \Vert \) on \(\mathsf {E}\) (not necessarily Euclidean norm); that is, for all \(x,y \in S\), we have

$$\begin{aligned} f(y) \ge f(x) + \langle \nabla f(x),y-x \rangle + \frac{1}{2} \rho \Vert y-x\Vert ^2. \end{aligned}$$

To state the result, we need to define the dual norm of \(\Vert \cdot \Vert \). The dual norm \(\Vert \cdot \Vert _*\) of \(\Vert \cdot \Vert \) on \(\mathsf {E}\) is defined as

$$\begin{aligned} \Vert z\Vert _* \,{:=} \, \sup \{\langle z,x \rangle : \Vert x\Vert \le 1\}. \end{aligned}$$

For example, \(\Vert \cdot \Vert _{\infty }\) is the dual norm of \(\Vert \cdot \Vert _{1}\).

Proposition 3

([36, Lemma 15]) Let \(f:\mathsf {E}\rightarrow \mathbb {R}^*\) be a differentiable \(\rho \)-strongly convex function with respect to the norm \(\Vert \cdot \Vert \) and let S denote its domain. Then,

  1. 1.

    \(f^*\) is differentiable on \(\mathsf {E}\).

  2. 2.

    \(\nabla f^*(y) = \mathop {\mathrm{arg\, max}}_{x \in S} \langle x,y \rangle - f(x)\).

  3. 3.

    \(f^*\) is \(\frac{1}{\rho }\)-smooth with respect to the norm \(\Vert \cdot \Vert _*\) ; that is,

    $$\begin{aligned} \Vert \nabla f^*(y_1) - \nabla f^*(y_2)\Vert \le \frac{1}{\rho } \Vert y_1-y_2\Vert _* \,\, \text {for all} \,\, y_1,y_2 \in \mathsf {E}. \end{aligned}$$

In the paper, we make use of the properties 2 and 3 of Proposition 3 to establish the Lipschitz continuity of the optimal policies, which enables us to prove the main results of our paper.

1.2 Proof of Proposition 1

Fix any \(x,\hat{x}, u,\hat{u}, \mu , \hat{\mu }\). Let us recall the following fact about \(l_1\) norm on the set probability distributions on finite sets [15, p. 141]. Suppose that there exists a real valued function F on a finite set \(\mathsf {E}\). Let \(\lambda (F) :=\sup _{e \in \mathsf {E}} F(e) - \inf _{e \in \mathsf {E}} F(e)\). Then, for any pair of probability distributions \(\mu ,\nu \) on \(\mathsf {E}\), we have

$$\begin{aligned} \left| \sum _{e} F(e) \, \mu (e) - \sum _{e} F(e) \, \nu (e) \right| \le \frac{\lambda (F)}{2} \, \Vert \mu -\nu \Vert _{1}. \end{aligned}$$
(6)

Using this fact, we now have

$$\begin{aligned} |R(x,u,\mu ) - R(\hat{x},\hat{u},\hat{\mu })|&= \left| \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} r(\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\le \left| \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, \hat{u}(a) \right| \\&\quad + \left| \sum _{a \in \mathsf {A}} r(x,a,\mu ) \, \hat{u}(a) - \sum _{a \in \mathsf {A}} r(\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\le L_1 \, \left( 1_{\{x \ne \hat{x}\}} + \Vert u-\hat{u}\Vert _1 +\Vert \mu -\hat{\mu }\Vert _1 \right) , \end{aligned}$$

where the last inequality follows from the following fact in view of (6):

$$\begin{aligned} \sup _{a} r(x,a,\mu ) - \inf _{a} r(x,a,\mu )&\, {:=}\, r(x,a_{\max },\mu ) - r(x,a_{\min },\mu ) \\&\le 2 L_1 \, 1_{\{a_{\max } \ne a_{\min }\}} = 2L_1. \end{aligned}$$

Similarly, we have

$$\begin{aligned}&\Vert P(\cdot |x,u,\mu ) - P(\cdot |\hat{x},\hat{u},\hat{\mu })\Vert _1 \\&\quad = \sum _{y \in \mathsf {X}} \left| P(y|x,u,\mu ) - P(y|\hat{x},\hat{u},\hat{\mu }) \right| \\&\quad = \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} p(y|\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\quad \le \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) \right| \\&\qquad + \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) - \sum _{a \in \mathsf {A}} p(y|\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\quad \overset{(I)}{\le } K_1 \Vert u-\hat{u}\Vert _1 + \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) - \sum _{a \in \mathsf {A}} p(y|\hat{x},a,\hat{\mu }) \, \hat{u}(a) \right| \\&\quad \le K_1 \, \left( 1_{\{x \ne \hat{x}\}} + \Vert u-\hat{u}\Vert _1 +\Vert \mu -\hat{\mu }\Vert _1 \right) . \end{aligned}$$

To show that (I) follows from Assumption 1-(b), let us define the transition probability \(M:\mathsf {A}\rightarrow \mathcal {P}(\mathsf {X})\) as

$$\begin{aligned} M(\cdot |a) :=p(\cdot |x,a,\mu ). \end{aligned}$$

Let \(\xi \in \mathcal {P}(\mathsf {A}\times \mathsf {A})\) be the optimal coupling of u and \(\hat{u}\) that achieves total variation distance \(\Vert u-\hat{u}\Vert _{TV}\). Similarly, for any \(a,\hat{a} \in \mathsf {A}\), let \(K(\cdot |a,\hat{a}) \in \mathcal {P}(\mathsf {X}\times \mathsf {X})\) be the optimal coupling of \(M(\cdot |a)\) and \(M(\cdot |\hat{a})\) that achieves total variation distance \(\Vert M(\cdot |a)-M(\cdot |\hat{a})\Vert _{TV}\). Note that

$$\begin{aligned} \sum _{y \in \mathsf {X}} \left| \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, u(a) - \sum _{a \in \mathsf {A}} p(y|x,a,\mu ) \, \hat{u}(a) \right| = 2 \Vert u M-\hat{u} M\Vert _{TV}, \end{aligned}$$

where

$$\begin{aligned} u M(\cdot ) :=\sum _{a \in \mathsf {A}} M(\cdot |a) \, u(a) \end{aligned}$$

and

$$\begin{aligned} \hat{u} M(\cdot ) :=\sum _{a \in \mathsf {A}} M(\cdot |a) \, \hat{u}(a). \end{aligned}$$

Let us define \(\nu (\cdot ) :=\sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} K(\cdot |a,\hat{a}) \, \xi (a,\hat{a})\), and so, \(\nu \) is a coupling of uM and \(\hat{u} M\). Therefore, we have

$$\begin{aligned} 2 \, \Vert u M-\hat{u} M\Vert _{TV}&\le 2 \sum _{(x,y) \in \mathsf {X}\times \mathsf {X}} 1_{\{x \ne y\}} \, \nu (x,y) \\&= 2 \sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} \sum _{(x,y) \in \mathsf {X}\times \mathsf {X}} 1_{\{x \ne y\}} \, K(x,y|a,\hat{a}) \, \xi (a,\hat{a})\\&= \sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} \Vert M(\cdot |a)-M(\cdot |\hat{a})\Vert _{1} \, \xi (a,\hat{a}) \\&\le 2 \, K_1 \, \sum _{(a,\hat{a}) \in \mathsf {A}\times \mathsf {A}} 1_{\{a \ne \hat{a}\}} \, \xi (a,\hat{a}) \\&= K_1 \, \Vert u-\hat{u}\Vert _{1}. \end{aligned}$$

Hence, (I) follows. This completes the proof.

1.3 Proof of Lemma 1

Fix any \(\mu \). If a function \(f: \mathsf {X}\rightarrow \mathbb {R}\) is K-Lipschitz continuous for some K, then \(g = \frac{f}{K}\) is 1-Lipschitz continuous. Hence, for all \(u \in \mathsf {U}\) and \(z,y \in \mathsf {X}\) we have

$$\begin{aligned}&\biggl | \sum _{x} f(x) P(x|z,u,\mu ) - \sum _{x} f(x) P(x|y,u,\mu ) \biggr | \\&\quad = K \biggl | \sum _{x} g(x) P(x|z,u,\mu ) - \sum _{x} g(x) P(x|y,u,\mu ) \biggr | \\&\quad \le \frac{K}{2} \, \Vert P(\,\cdot \,|z,u,\mu ) - P(\,\cdot \,|y,u,\mu )\Vert _1 \, \text {(by (6))}\\&\quad \le \frac{KK_1}{2} \, 1_{\{z \ne y\}}, \, \text {(by Proposition 1)} \end{aligned}$$

since \(\sup _x g(x) - \inf _x g(x) \le 1\). Hence, the contraction operator \(T_{\mu }\) maps K-Lipschitz functions to \(L_1+\beta K K_1/2\)-Lipschitz functions, since, for all \(z,y \in \mathsf {X}\)

$$\begin{aligned} | T_{\mu }f(z) - T_{\mu }f(y) |&\le \sup _{u} \biggl \{ |R(z,u,\mu ) - R(y,u,\mu )| \\&\quad + \beta \biggl | \sum _{x} f(x) P(x|z,u,\mu ) - \sum _{x} f(x) P(x|y,u,\mu ) \biggr | \biggr \}\\&\le L_1 1_{\{z\ne y\}} + \beta \frac{K K_1}{2} 1_{\{z \ne y\}} = \biggl (L_1 + \beta \frac{K K_1}{2}\biggr ) 1_{\{z \ne y\}}. \end{aligned}$$

Now we apply \(T_{\mu }\) recursively to obtain the sequence \(\{T_{\mu }^n f\}\) by letting \(T_{\mu }^n f = T_{\mu } (T_{\mu }^{n-1} f )\), which converges to the value function \(Q^{\mathop {\mathrm{reg}},*}_{\mu ,\max }\) by the Banach fixed point theorem. Clearly, by mathematical induction, we have for all \(n\ge 1\), \(T_{\mu }^n f\) is \(K_n\)-Lipschitz continuous, where \(K_n = L_1 \sum _{i=0}^{n-1} (\beta K_1/2)^i + K (\beta K_1/2)^n\). If we choose \(K < L_1\), then \(K_n \le K_{n+1}\) for all n and therefore, \(K_n \uparrow \frac{ L_1}{1-\beta K_1/2}\). Hence, \(T_{\mu }^n f\) is \( \frac{L_1}{1-\beta K_1/2}\)-Lipschitz continuous for all n, and therefore, \(Q^{\mathop {\mathrm{reg}},*}_{\mu ,\max }\) is also \( \frac{L_1}{1-\beta K_1/2}\)-Lipschitz continuous.

1.4 Proof of Lemma 2

Under Assumption 1, it is straightforward to prove that \(H_1\) maps \(\mathcal {P}(\mathsf {X})\) into \({\mathcal {C}}\). Indeed, the only non-trivial fact is the \(\left( K_{\mathop {\mathrm{Lip}}}+L_{\mathop {\mathrm{reg}}}\right) \)-Lipschitz continuity of \(H_1(\mu ) =:Q_{\mu }^{\mathop {\mathrm{reg}},*}\). This can be proved as follows: For any (xu) and \((\hat{x},\hat{u})\), we have

$$\begin{aligned} |Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,u)-Q_{\mu }^{\mathop {\mathrm{reg}},*}(\hat{x},\hat{u})|&= |R(x,u,\mu ) - \varOmega (u) + \beta \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,\mu ) \\&\quad - R(\hat{x},\hat{u},\mu ) + \varOmega (\hat{u}) - \beta \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|\hat{x},\hat{u},\mu )| \\&\le L_1 (1_{\{x\ne \hat{x}\}}+\Vert u-\hat{u}\Vert _1) + L_{\mathop {\mathrm{reg}}} \Vert u-\hat{u}\Vert _1 \\&\quad + \beta \frac{K_1 K_{\mathop {\mathrm{Lip}}}}{2} \, (1_{\{x\ne \hat{x}\}}+\Vert u-\hat{u}\Vert _1), \end{aligned}$$

where the last inequality follows from (6) and Lemma 1. Hence, \(Q_{\mu }^{\mathop {\mathrm{reg}},*}\) is \(\left( K_{\mathop {\mathrm{Lip}}}+L_{\mathop {\mathrm{reg}}}\right) \)-Lipschitz continuous.

Now, for any \(\mu ,{\hat{\mu }}\in \mathcal {P}(\mathsf {X})\), we have

$$\begin{aligned} \Vert H_1(\mu ) - H_1({\hat{\mu }})\Vert _{\infty }&= \Vert Q_{\mu }^{\mathop {\mathrm{reg}},*}-Q_{{\hat{\mu }}}^{\mathop {\mathrm{reg}},*}\Vert _{\infty } \\&= \sup _{x,u} \bigg | R(x,u,\mu ) + \beta \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,\mu ) \\&\quad - R(x,u,{\hat{\mu }}) - \beta \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }}) \bigg | \\&\le L_1 \, \Vert \mu -{\hat{\mu }}\Vert _1 \\&\quad + \beta \left| \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,\mu ) - \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }})\right| \\&\quad + \beta \left| \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }}) - \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) P(y|x,u,{\hat{\mu }}) \right| \\&\le L_1 \, \Vert \mu -{\hat{\mu }}\Vert _1 + \frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2} \, \Vert \mu -{\hat{\mu }}\Vert _1 + \beta \, \Vert Q_{\mu }^{\mathop {\mathrm{reg}},*}-Q_{{\hat{\mu }}}^{\mathop {\mathrm{reg}},*}\Vert _{\infty }, \end{aligned}$$

where the last inequality follows from (6) and Lemma 1. This completes the proof.

1.5 Proof of Lemma 3

For any \(\mu \in \mathcal {P}(\mathsf {X})\), we have

$$\begin{aligned} Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,u)&= L_{\mu } Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,u) \\&= R(x,u,\mu ) + \beta \sum _{y \in \mathsf {X}} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) \, P(y|x,u,\mu ) - \varOmega (u) \\&= \langle q_{x}^{\mu },u \rangle - \varOmega (u), \end{aligned}$$

where \( q_{x}^{\mu }(\cdot ) :=r(x,\cdot ,\mu ) + \beta \sum _{y \in \mathsf {X}} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) \, p(y|x,\cdot ,\mu ). \) By \(\rho \)-strong convexity of \(\varOmega \), \(Q_{\mu }^{\mathop {\mathrm{reg}},*}(x,\cdot )\) has a unique maximizer \(f_{\mu }(x) \in \mathsf {U}\) for any \(x \in \mathsf {X}\), which is the optimal policy for \(\mu \). By Property 2 of Proposition 3, we have

$$\begin{aligned} f_{\mu }(x) = \nabla \varOmega ^*(q_x^{\mu }), \end{aligned}$$

where \(\varOmega ^*\) is the Fenchel conjugate of \(\varOmega \), and \(\varOmega ^*(q_x^{\mu }) = Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(x)\).

Moreover, for any \(\mu ,{\hat{\mu }}\in \mathcal {P}(\mathsf {X})\) and \(x,\hat{x}\in \mathsf {X}\), by property 3 of Proposition 3 and by noting the fact that \(\Vert \cdot \Vert _{\infty }\) is the dual norm of \(\Vert \cdot \Vert _1\) on \(\mathsf {U}\), we obtain the following bound:

$$\begin{aligned} \Vert f_{\mu }(x)-f_{{\hat{\mu }}}(\hat{x})\Vert _1 \le \frac{1}{\rho } \, \Vert q_{x}^{\mu }-q_{\hat{x}}^{{\hat{\mu }}}\Vert _{\infty }. \end{aligned}$$

Note that we have

$$\begin{aligned} \Vert q_{x}^{\mu }-q_{\hat{x}}^{{\hat{\mu }}}\Vert _{\infty }&=\sup _{a \in \mathsf {A}} \bigg |r(x,a,\mu ) + \beta \sum _{y \in \mathsf {X}} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) \, p(y|x,a,\mu ) \\&\quad - r(\hat{x},a,{\hat{\mu }}) - \beta \sum _{y \in \mathsf {X}} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) \, p(y|\hat{x},a,{\hat{\mu }}) \bigg | \\&\le L_1 (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) \\&\quad + \beta \sup _{a \in \mathsf {A}} \bigg | \sum _{y} Q_{\mu ,\max }^{\mathop {\mathrm{reg}},*}(y) p(y|x,a,\mu ) - \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) p(y|x,a,\mu ) \bigg | \\&\quad + \beta \sup _{a \in \mathsf {A}} \bigg | \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) p(y|x,a,\mu ) - \sum _{y} Q_{{\hat{\mu }},\max }^{\mathop {\mathrm{reg}},*}(y) p(y|\hat{x},a,{\hat{\mu }}) \bigg |\\&\le L_1 (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) + \beta \Vert Q_{\mu }^{\mathop {\mathrm{reg}},*} - Q_{{\hat{\mu }}}^{\mathop {\mathrm{reg}},*}\Vert _{\infty } \\&\quad + \beta \frac{K_1 K_{\mathop {\mathrm{Lip}}}}{2} (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) \\&\le K_{\mathop {\mathrm{Lip}}} \, (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1) + \beta K_{H_1} \, \Vert \mu -{\hat{\mu }}\Vert _1 \\&\le K_{H_1} (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1). \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \Vert f_{\mu }(x)-f_{{\hat{\mu }}}(\hat{x})\Vert _1 \le \frac{1}{\rho } \, K_{H_1} (1_{\{x \ne \hat{x}\}} + \Vert \mu -{\hat{\mu }}\Vert _1). \end{aligned}$$

1.6 Proof of Theorem 1

Let \(\mu _{\varepsilon } \in \varLambda ^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon })\). Then, we have

$$\begin{aligned} \Vert \mu _{\varepsilon }-\mu _*\Vert _1&= \sum _{y} \, \bigg | \sum _{x} \, P(y|x,\pi _{\varepsilon },\mu _{\varepsilon }) \, \mu _{\varepsilon }(x) - \sum _{x} \, P(y|x,\pi _*(x),\mu _*) \, \mu _*(x) \biggr | \\&\le \sum _{y} \, \bigg | \sum _{x} \, P(y|x,\pi _{\varepsilon },\mu _{\varepsilon }) \, \mu _{\varepsilon }(x) - \sum _{x} \, P(y|x,\pi _*(x),\mu _*) \, \mu _{\varepsilon }(x) \biggr | \\&\quad + \sum _{y} \, \bigg | \sum _{x} \,P(y|x,\pi _*(x),\mu _*) \, \mu _{\varepsilon }(x) - \sum _{x} \, P(y|x,\pi _*(x),\mu _*) \, \mu _*(x) \biggr | \\&\overset{(I)}{\le } \sum _{x} \left\| P(\cdot |x,\pi _{\varepsilon }(x),\mu _{\varepsilon })-P(\cdot |x,\pi _*(x),\mu _*) \right\| _1 \mu _{\varepsilon }(x) \\&\quad + \frac{K_1}{2} \left( 1 + \frac{K_{H_1}}{\rho } \right) \, \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \\&\le K_1 \left( \sup _x \Vert \pi _{\varepsilon }(x)-\pi _*(x)\Vert _1 + \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \right) \\&\quad + \frac{K_1}{2} \left( 1 + \frac{K_{H_1}}{\rho } \right) \, \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \\&\le K_1 \, \varepsilon + \left( \frac{3 \, K_1}{2} + \frac{K_1 \, K_{H_1}}{2\rho } \right) \, \Vert \mu _{\varepsilon }-\mu _*\Vert _1. \end{aligned}$$

Note that Lemma 3 and Proposition 1 lead to

$$\begin{aligned} \Vert P(\cdot |x,\pi _*(x),\mu _*)-P(\cdot |y,\pi _*(y),\mu _*)\Vert _1 \le K_1 \left( 1 + \frac{K_{H_1}}{\rho } \right) \, 1_{\{x \ne y\}}. \end{aligned}$$

Hence, (I) follows from [24, Lemma A2]. Therefore, we have:

$$\begin{aligned} \Vert \mu _{\varepsilon }-\mu _*\Vert _1 \le \frac{K_1 \, \varepsilon }{1-C_1}, \end{aligned}$$

where \(C_1 :=\left( \frac{3 \, K_1}{2} + \frac{K_1 \, K_{H_1}}{2\rho } \right) \). Note that by Assumption 2, \(C_1 < 1\). Now, fix any policy \(\pi \in \varPi \). Then, we have

$$\begin{aligned}&\Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty } \\&\quad =\sup _{x} \bigg | R^{\mathop {\mathrm{reg}}}(x,\pi (x),\mu _*) + \beta \, \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _*) \\&\qquad -R^{\mathop {\mathrm{reg}}}(x,\pi (x),\mu _{\varepsilon }) - \beta \, \sum _{y} J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon })\bigg | \\&\quad \le L_1 \, \Vert \mu _*-\mu _{\varepsilon }\Vert _1 \\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _*) - \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon })\bigg |\\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon }) - \sum _{y} J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,y) \, p(y|x,\pi (x),\mu _{\varepsilon })\bigg | \\&\quad \overset{(II)}{\le } \left( L_1+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \Vert \mu _*-\mu _{\varepsilon }\Vert _1 + \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty }\\&\quad \le \left( L_1+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \frac{K_1 \varepsilon }{1-C_1}+ \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty }. \end{aligned}$$

Here, (II) follows from (6) and the fact that \(J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\) is \(K_{\mathop {\mathrm{Lip}}}\)-Lipschitz continuous, which can be proved as in Lemma 1. Therefore, we obtain

$$\begin{aligned} \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )-J_{\mu _{\varepsilon }}^{\mathop {\mathrm{reg}}}(\pi ,\cdot )\Vert _{\infty } \le \frac{C_2 \, \varepsilon }{1-\beta }, \end{aligned}$$
(7)

where \(C_2 :=\left( L_1+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \frac{K_1}{1-C_1}\).

Note that we also have

$$\begin{aligned}&\quad \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty }\\&\quad =\sup _{x} \bigg | R^{\mathop {\mathrm{reg}}}(x,\pi _*(x),\mu _*) + \beta \, \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _*(x),\mu _*) \\&\qquad -R^{\mathop {\mathrm{reg}}}(x,\pi _{\varepsilon }(x),\mu _*) - \beta \, \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*)\bigg | \\&\quad \le (L_1 + L_{\mathop {\mathrm{reg}}}) \, \sup _{x} \Vert \pi _*(x)-\pi _{\varepsilon }(x)\Vert _1 \\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _*(x),\mu _*) - \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*)\bigg |\\&\qquad + \beta \sup _{x} \bigg |\sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*) - \sum _{y} J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },y) \, p(y|x,\pi _{\varepsilon }(x),\mu _*)\bigg | \\&\quad \overset{(III)}{\le } \left( L_1+L_{\mathop {\mathrm{reg}}}+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \sup _x \Vert \pi _*(x)-\pi _{\varepsilon }(x)\Vert _1\\&\qquad + \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty }\\&\quad \le \left( L_1+L_{\mathop {\mathrm{reg}}}+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \, \varepsilon + \beta \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty }. \end{aligned}$$

Here, (III) follows from (6) and the fact that \(J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )\) is \(K_{\mathop {\mathrm{Lip}}}\)-Lipschitz continuous, which can be proved as in Lemma 1. Therefore, we obtain

$$\begin{aligned} \Vert J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _*,\cdot )-J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon },\cdot )\Vert _{\infty } \le \frac{C_3\varepsilon }{1-\beta }, \end{aligned}$$
(8)

where \(C_3 :=\left( L_1+L_{\mathop {\mathrm{reg}}}+\frac{\beta K_1 K_{\mathop {\mathrm{Lip}}}}{2}\right) \).

Note that we must prove that

$$\begin{aligned} J_i^{(N)}({\varvec{\pi }}^{(N)})&\ge \sup _{\pi ^i \in \varPi _i} J_i^{(N)}({\varvec{\pi }}^{(N)}_{-i},\pi ^i) - \tau \, \varepsilon - \delta \end{aligned}$$
(9)

for each \(i=1,\ldots ,N\), when N is sufficiently large. As the transition probabilities and the one-stage reward functions are the same for all agents, it is sufficient to prove (9) for Agent 1 only. Given \(\delta > 0\), for each \(N\ge 1\), let \({\tilde{\pi }}^{(N)} \in \varPi _1\) be such that

$$\begin{aligned} J_1^{(N)} ({\tilde{\pi }}^{(N)},\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) > \sup _{\pi ' \in \varPi _1} J_1^{(N)} (\pi ',\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) - \frac{\delta }{3}. \end{aligned}$$

Then, by [33, Theorem 4.10], we have

Therefore, there exists \(N(\delta )\) such that

$$\begin{aligned}&\sup _{\pi ' \in \varPi _1} J_1^{(N)} (\pi ',\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) - \delta - \tau \,\varepsilon \\&\quad \le J_1^{(N)} ({\tilde{\pi }}^{(N)},\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }) - \frac{2\delta }{3} - \tau \, \varepsilon \\&\quad \le J_{\mu _*}^{\mathop {\mathrm{reg}}}(\pi _{\varepsilon }) - \frac{\delta }{3} \\&\quad \le J_1^{(N)} (\pi _{\varepsilon },\pi _{\varepsilon },\ldots ,\pi _{\varepsilon }). \end{aligned}$$

for all \(N\ge N(\delta )\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Anahtarci, B., Kariksiz, C.D. & Saldi, N. Q-Learning in Regularized Mean-field Games. Dyn Games Appl 13, 89–117 (2023). https://doi.org/10.1007/s13235-022-00450-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13235-022-00450-2

Keywords

Navigation