1 Introduction

A decade of active research on mean field games (MFGs) has been driven by a primarily intuitive connection with large-population stochastic differential games of a certain symmetric type. The idea, which began with the pioneering work of Lasry and Lions [31] and Huang et al. [21], is that a large-population game of this type should behave similarly to its MFG counterpart, which may be thought of as an infinite-player version of the game. Rigorous analysis of this connection, however, remains restricted in scope. Following [21], the vast majority of the literature works backward from the mean field limit, in the sense that a solution of the MFG is used to construct approximate Nash equilibria for the corresponding n-player games for large n. Fewer papers [2, 14, 15, 31] have approached from the other direction: given for each n a Nash equilibrium for the n-player game, in what sense (if any) do these equilibria converge as n tends to infinity? The goal of this paper is to address both of these problems in a general framework.

More precisely, we study an n-player stochastic differential game, in which the private state processes \(X^1,\ldots ,X^n\) of the agents (or players) are given by the following dynamics:

$$\begin{aligned} dX^{i}_t&= b\left( t,X^{i}_t,{\widehat{\mu }}^n_t,\alpha ^i_t\right) dt + \sigma \left( t,X^{i}_t,{\widehat{\mu }}^n_t\right) dW^i_t + \sigma _0\left( t,X^{i}_t,{\widehat{\mu }}^n_t\right) dB_t, \\ {\widehat{\mu }}^n_t&= \frac{1}{n}\sum _{k=1}^n\delta _{X^{k}_t}. \end{aligned}$$

Here \(B,W^1,\ldots ,W^n\) are independent Wiener processes, \(\alpha ^i\) is the control of agent i, and \({\widehat{\mu }}^n\) is the empirical distribution of the state processes. We call \(W^1,\ldots ,W^n\) the independent or idiosyncratic noises, since agent i feels only \(W^i\) directly, and we call B the common noise, since each agent feels B equally. The reward to agent i of the strategy profile \((\alpha ^1,\ldots ,\alpha ^n)\) is

$$\begin{aligned} J_i(\alpha ^1,\ldots ,\alpha ^n) = {\mathbb {E}}\left[ \int _0^Tf\left( t,X^{i}_t,{\widehat{\mu }}^n_t,\alpha ^i_t\right) dt + g\left( X^{i}_T,{\widehat{\mu }}^n_T\right) \right] . \end{aligned}$$

Agent i seeks to maximize this reward, and so we say that \((\alpha ^1,\ldots ,\alpha ^n)\) form an \(\epsilon \)-Nash equilibrium (or an approximate Nash equilibrium) if

$$\begin{aligned} J_i(\alpha ^1,\ldots ,\alpha ^n) + \epsilon \ge J_i(\alpha ^1,\ldots ,\alpha ^{i-1},\beta ,\alpha ^{i+1},\ldots ,\alpha ^n) \end{aligned}$$

for each admissible alternative strategy \(\beta \). Intuitively, if the number of agents n is very large, a single representative agent has little influence on the empirical measure flow \(({\widehat{\mu }}^n_t)_{t \in [0,T]}\), and so this agent expects to lose little in the way of optimality by ignoring her own effect on the empirical measure. Crucially, the system is symmetric in the sense that the same functions \((b,\sigma ,\sigma _0)\) and (fg) determine the dynamics and objectives of each agent, and thus we may hope to learn something of the entire system from the behavior of a single representative agent.

The mean field game is specified precisely in Sect. 2, and it follows this intuition by treating n as infinite. Loosely speaking, a strong MFG solution is a \(({\mathcal {F}}^B_t = \sigma (B_s : s\le t))_{t \in [0,T]}\)-adapted measure-valued process \((\mu _t)_{t \in [0,T]}\) satisfying \(\mu _t = \text {Law}(X^{\alpha ^*}_t \ | \ {\mathcal {F}}^B_t)\) for each t, where \(X^{\alpha ^*}\) is an optimally controlled state process coming from the following stochastic optimal control problem:

$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha ^* &{}\in \arg \max _\alpha {\mathbb {E}}\left[ \int _0^Tf(t,X^\alpha _t,\mu _t,\alpha _t)dt + g(X^\alpha _T,\mu _T)\right] , \quad \text {s.t.} \\ dX^\alpha _t &{}= b(t,X^\alpha _t,\mu _t,\alpha _t)dt + \sigma (t,X^\alpha _t,\mu _t)dW_t + \sigma _0(t,X^\alpha _t,\mu _t)dB_t. \end{array}\right. } \end{aligned}$$

In other words, with the process \((\mu _t)_{t \in [0,T]}\) treated as fixed, the representative agent solves an optimal control problem. The requirement \(\mu _t = \text {Law}(X^{\alpha ^*}_t \ | \ {\mathcal {F}}^B_t)\), often known as a consistency condition, assures us that this decoupled optimal control problem is truly representative of the entire population, and we may think of the measure flow \((\mu _t)_{t \in [0,T]}\) as an equilibrium.

The analysis of this paper focuses on mean field games with common noise, but both of the volatility coefficients \(\sigma \) and \(\sigma _0\) are allowed to be degenerate. Hence, our results cover the usual mean field games without common noise (where \(\sigma _0 \equiv 0\)) as well as deterministic mean field games (where \(\sigma \equiv \sigma _0 \equiv 0\)). The literature on mean field games with common noise is quite scarce so far, but some general analysis is provided in the recent papers [1, 5, 9, 11], and some specific models were studied in [12, 19]. This paper can be seen as a sequel to [11], from which we borrow many definitions and a handful of lemmas. It is emphasized in [11] that strong solutions are quite difficult to obtain when common noise is present, and this leads to a notion of weak MFG solution. Weak solutions, defined carefully in Sect. 2.2, differ most significantly from strong solutions in that the measure flow \((\mu _t)_{t \in [0,T]}\) need not be \(({\mathcal {F}}^B_t)_{t \in [0,T]}\)-adapted, and the consistency condition is weakened to something like \(\mu _t = \text {Law}(X^{\alpha ^*}_t \ | \ {\mathcal {F}}^{B,\mu }_t)\), where \({\mathcal {F}}^{B,\mu }_t = \sigma (B_s,\mu _s : s \le t)\). Additionally, weak MFG solutions allow for relaxed (i.e. measure-valued) controls which need not be adapted to the filtration generated by the inputs \((X_0,B,W,\mu )\) of the control problem.

Although this weaker notion of MFG solution was introduced in [11] to develop an existence and uniqueness theory for MFGs with common noise, the main result of this paper is to assert that this notion is the right one from the point of view of the finite-player game, in the sense that weak MFG solutions characterize the limits of approximate Nash equilibria. The main results are stated in full generality in Sects. 2.4 and 2.5, but let us state them loosely for now in a simplified form: First, we show that if for each n we are given an \(\epsilon _n\)-Nash equilibrium \((\alpha ^{n,1},\ldots ,\alpha ^{n,n})\) for the n-player game, where \(\epsilon _n \rightarrow 0\), then the family \((\text {Law}(B,{\widehat{\mu }}^n))_{n=1}^\infty \) is tight, and every weak limit agrees with the law of \((B,\mu )\) coming from some weak MFG solution. Second, we show conversely that every weak MFG solution can be obtained as a limit in this way.

Specializing our results to the case without common noise uncovers something unexpected. In the literature thus far, a MFG solution is defined in terms of a deterministic equilibrium \((\mu _t)_{t \in [0,T]}\), corresponding to our notion of strong MFG solution. Even when there is no common noise, a weak MFG solution still involves a stochastic equilibrium, and because of our main theorems we must therefore expect the limits of the finite-player empirical measures to remain stochastic. Moreover, we demonstrate by a simple example that a stochastic equilibrium is not necessarily just a randomization among the family of deterministic equilibria. Hence, the solution concept considered thusfar in literature on mean field games (without common noise) does not fully capture the limiting dynamics of finite-player approximate Nash equilibria. This is unlike the case of McKean–Vlasov limits (see [16, 32, 34]), which can be seen as mean field games with no control. We prove some admittedly difficult-to-apply results which nevertheless shed some light on this phenomenon: The fundamental obstruction is the adaptedness required of controls, which renders the class of admissible controls quite sensitive to whether or not \((\mu _t)_{t \in [0,T]}\) is stochastic.

Our first theorem, regarding the convergence of arbitrary approximate equilibria (open-loop, full-information, and possibly asymmetric), is arguably the more novel of our two main theorems. It appears to be the first result of its kind for mean field games with common noise, with the exception of the linear quadratic model of [12] for which explicit computations are available. However, even in the setting without common noise we substantially generalize the few existing results.

Several papers, such as the recent [9] dealing with common noise, contain purely heuristic derivations of the MFG as the limit of n-player games. The intuition guiding such derivations is as follows (and let us assume there is no common noise for the sake of simplicity): If n is large, a single agent in a large population should lose little in the way of optimality if she ignores the small feedbacks arising through the empirical measure flow \(({\widehat{\mu }}^n_t)_{t \in [0,T]}\). If each of the n identical agents does this, then we expect to see symmetric strategies which are nearly independent and ideally of the form \({\hat{\alpha }}(t,X^i_t)\), for some feedback control \({\hat{\alpha }}\) common to all of the agents. From the theory of McKean–Vlasov limits, we then expect that \(({\widehat{\mu }}^n_t)_{t \in [0,T]}\) converges to a deterministic limit. This intuition, however, is largely unsubstantiated and, we will argue, inaccurate in general.

Lasry and Lions [30, 31] first attacked this problem rigorously using PDE methods, working with an infinite time horizon and strong simplifying assumptions on the data, and their results were later generalized by Feleqi [14]. Bardi and Priuli [2, 3] justified the MFG limit for certain linear-quadratic problems, and Gomes et al. [17] studied models with finite state space. Substantial progress was made in a very recent paper of Fischer [15], which deserves special mention also because both the level of generality and the method of proof are quite similar to ours; we will return to this shortly.

With the exception of [15], the aforementioned results share the important limitation that the agents have only partial information: the control of agent i may depend only on her own state process \(X^{n,i}\) or Wiener process \(W^i\). Our results allow for arbitrary full-information strategies, settling a conjecture of Lasry and Lions (stated in Remark x after [31, Theorem 2.3] for the case of infinite time horizon and closed-loop controls). Combined in [14, 30, 31] with the assumption that the state process coefficients \((b,\sigma )\) do not depend on the empirical measure, the assumption of partial information leads to the immensely useful simplification that the state processes of the n-player games are independent. By showing then that they are also asymptotically identically distributed, the aforementioned heuristic argument can be made precise.

Fischer [15], on the other hand, allows for full-information controls but characterizes only the deterministic limits of \(({\widehat{\mu }}^n_t)_{t \in [0,T]}\) as MFG equilibria. Assuming that the limit is deterministic implicitly restricts the class of n-player equilibria in question. By characterizing even the stochastic limits of \(({\widehat{\mu }}^n_t)_{t \in [0,T]}\), which we show are in fact quite typical, we impose no such restriction on the equilibrium strategies of the n-player games. This not to say, however, that our results completely subsume those of [15], which work with a more general notion of local approximate equilibria and which notably include conditions under which the assumption of a deterministic limit can be verified.

Our second main theorem, which asserts that every weak MFG solution is attainable as a limit of finite-player approximate Nash equilibria, is something of an abstraction of the kind of limiting result most commonly discussed in the MFG literature. In a tradition beginning with [21] and continued by the majority of the probabilistic papers on the subject [4, 6, 8, 13, 27], an optimal control from an MFG solution is used to construct approximate equilibria for the finite-player games. Although our result applies in more general settings, our conclusions are duly weaker, in the sense that the approximate equilibria we construct do not necessarily consist of particularly tangible (i.e. distributed or even symmetric) strategies. We emphasize that the goal of this work is not to construct nice approximate equilibria but rather to characterize all possible limits of approximate equilibria.

It is worth emphasizing that this paper makes no claims whatsoever regarding the existence or uniqueness of equilibria for either the n-player game or the MFG. Rather, we show that if a sequence of n-player approximate equilibria exists, then its limits are described by weak MFG solutions. Conversely, if a weak MFG solution exists, then it is achieved as the limit of some sequence of n-player approximate equilibria. Hence, existence of a weak MFG solution is equivalent to existence of a sequence of n-player approximate equilibria. Note, however, that the main Assumption A of this paper actually guarantees the existence of a weak MFG solution, because of the recent results of [11]. Far more results are available for MFGs without common noise; refer to the surveys [7, 18] and the recent book [4] for a wealth of well-posedness results and for further discussion of MFG theory in general.

The paper is organized as follows. Section 2 defines the MFG and the corresponding n-player games, before stating the main limit Theorem 2.6 and its converse, Theorem 2.11, along with several useful corollaries. Section 3 specializes the results to the more familiar setting without common noise and explains the gap between weak and strong solutions. Section 4 provides some background on weak solutions of MFGs with common noise, borrowed from [11], before we turn to the proofs of the main results in Sects. 56, and 7. Section 5 is devoted to the proof of Theorem 2.6, while Sect. 6 contains the proof of the converse Theorem 2.11. Finally, Sect. 7 explains how to carefully specialize these two theorems to the setting without common noise.

2 The mean field limit with common noise

After establishing some notation, this section first defines quickly and concisely the mean field game. We work with the same definitions and nearly the same assumptions as [11], to which the reader is referred for a more thorough discussion. Then, the n-player game is formulated precisely, allowing for somewhat more general information structures than one usually finds in the literature on stochastic differential games. This generality is not just for its own sake; it will play a crucial role in the proofs later.

2.1 Notation and standing assumptions

For a topological space E, let \({\mathcal {B}}(E)\) denote the Borel \(\sigma \)-field, and let \({\mathcal {P}}(E)\) denote the set of Borel probability measures on E. For \(p \ge 1\) and a separable metric space (Ed), let \({\mathcal {P}}^p(E)\) denote the set of \(\mu \in {\mathcal {P}}(E)\) satisfying \(\int _Ed^p(x,x_0)\mu (dx) < \infty \) for some (and thus for any) \(x_0 \in E\). Let \(\ell _{E,p}\) denote the p-Wasserstein distance on \({\mathcal {P}}^p(E)\), given by

$$\begin{aligned}&\ell _{E,p}(\mu ,\nu ) \nonumber \\&\quad := \inf \left\{ \left( \int _{E \times E}\gamma (dx,dy)d^p(x,y)\right) ^{1/p} : \gamma \in {\mathcal {P}}(E \times E) \quad \text {has marginals } \mu ,\nu \right\} \nonumber \\ \end{aligned}$$
(2.1)

Unless otherwise stated, the space \({\mathcal {P}}^p(E)\) is equipped with the metric \(\ell _{E,p}\), and all continuity and measurability statements involving \({\mathcal {P}}^p(E)\) are with respect to \(\ell _{E,p}\) and the corresponding Borel \(\sigma \)-field. The analysis of the paper will make routine use of several topological properties of the spaces \({\mathcal {P}}^p(E)\) and \({\mathcal {P}}^p({\mathcal {P}}^p(E))\), especially when E is a product space. All of the results we need, well known or not, are summarized in the Appendices A and B of [29].

We are given a time horizon \(T > 0\), three exponents \((p',p,p_\sigma )\) with \(p \ge 1\), a control space A, an initial state distribution \(\lambda \in {\mathcal {P}}({\mathbb {R}}^d)\), and the following functions:

$$\begin{aligned} (b,f)&: [0,T] \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d) \times A \rightarrow {\mathbb {R}}^d \times {\mathbb {R}}, \\ (\sigma ,\sigma _0)&: [0,T] \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d) \rightarrow {\mathbb {R}}^{d \times m} \times {\mathbb {R}}^{d \times m_0}, \\ g&: {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d) \rightarrow {\mathbb {R}}. \end{aligned}$$

Assume throughout the paper that the following Assumption A holds. This is exactly Assumption A of [11], except that here we require that \(p' \ge 2\) and that \((b,\sigma ,\sigma _0)\) are Lipschitz not only in the state argument but also in the measure argument.

Assumption A

  1. (A.1)

    A is a closed subset of a Euclidean space. (More generally, as in [20], a closed \(\sigma \)-compact subset of a Banach space would suffice.)

  2. (A.2)

    The exponents satisfy \(p' > p \ge 1 \vee p_\sigma \) and \(p' \ge 2 \ge p_\sigma \ge 0\), and also \(\lambda \in {\mathcal {P}}^{p'}({\mathbb {R}}^d)\).

  3. (A.3)

    The functions b, \(\sigma \), \(\sigma _0\), f, and g of \((t,x,\mu ,a)\) are jointly measurable and are continuous in \((x,\mu ,a)\) for each t.

  4. (A.4)

    There exists \(c_1 > 0\) such that, for all \((t,x,y,\mu ,\nu ,a) \in [0,T] \times {\mathbb {R}}^d \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d) \times {\mathcal {P}}^p({\mathbb {R}}^d) \times A\),

    $$\begin{aligned}&|b(t,x,\mu ,a) - b(t,y,\nu ,a)| + |(\sigma ,\sigma _0)(t,x,\mu ) - (\sigma ,\sigma _0)(t,y,\nu )| \\&\quad \le c_1\left( |x-y| + \ell _{{\mathbb {R}}^d,p}(\mu ,\nu )\right) , \end{aligned}$$

    and

    $$\begin{aligned} |b(t,0,\delta _0,a)|&\le c_1(1 + |a|), \\ |(\sigma \sigma + \sigma _0\sigma _0^\top )(t,x,\mu )|&\le c_1\left[ 1 + |x|^{p_\sigma } + \left( \int _{{\mathbb {R}}^d}|z|^p\mu (dz)\right) ^{p_\sigma /p}\right] . \end{aligned}$$
  5. (A.5)

    There exist \(c_2, c_3 > 0\) such that, for each \((t,x,\mu ,a) \in [0,T] \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d) \times A\),

    $$\begin{aligned}&|g(x,\mu )| \le c_2\left( 1 + |x|^p + \int _{{\mathbb {R}}^d}|z|^p\mu (dz)\right) ,\\&\quad -c_2\left( 1 + |x|^p + \int _{{\mathbb {R}}^d}|z|^p\mu (dz) + |a|^{p'}\right) \le f(t,x,\mu ,a) \\&\quad \le c_2\left( 1 + |x|^p + \int _{{\mathbb {R}}^d}|z|^p\mu (dz)\right) - c_3|a|^{p'}. \end{aligned}$$

While these assumptions are fairly general, they do not cover all linear-quadratic models. Because of the requirement \(p' > p\), the running objective f may grow quadratically in a only if its growth in \((x,\mu )\) is strictly subquadratic. This requirement is important for compactness purposes, both for the results of this paper and for the existence results of [11, 29]. In fact, [11, 29] provide examples of MFGs with \(p'=p\) which do not admit solutions even though they verify the rest of Assumption A. Existence results for this somewhat delicate boundary case have been obtained in [6, 8, 10, 12] by assuming some additional inequalities between coefficients. It seems feasible to expect our main results to adapt to such settings, but we do not pursue this here.

2.2 Relaxed controls and mean field games

Define \({\mathcal {V}}\) to be the set of measures q on \([0,T] \times A\) with first marginal equal to Lebesgue measure, i.e. \(q([s,t] \times A) = t-s\) for \(0 \le s \le t \le T\), satisfying also

$$\begin{aligned} \int _{[0,T] \times A}|a|^pq(dt,da) < \infty . \end{aligned}$$

Since these measures have mass T, we may endow \({\mathcal {V}}\) with a suitable scaling of the p-Wasserstein metric. Each \(q \!\in \! {\mathcal {V}}\) may be identified with a measurable function \([0,T] \ni t \mapsto q_t \!\in \! {\mathcal {P}}^p(A)\), determined uniquely (up to a.e. equality) by \(dtq_t(da) \!=\! q(dt,da)\). It is known that \({\mathcal {V}}\) is a Polish space, and in fact if A is compact then so is \({\mathcal {V}}\); see [29, Appendix A] for more details. The elements of \({\mathcal {V}}\) are called relaxed controls, and \(q \!\in \! {\mathcal {V}}\) is called a strict control if it satisfies \(q(dt,da) \!=\! dt\delta _{\alpha _t}(da)\) for some measurable function \([0,T] \ni t \mapsto \alpha _t \!\in \! A\). Finally, if we are given a measurable process \((\Lambda _t)_{t \!\in \! [0,T]}\) with values in \({\mathcal {P}}(A)\) defined on some measurable space and with , we write \(\Lambda \!=\! dt\Lambda _t(da)\) for the corresponding random element of \({\mathcal {V}}\).

Let us define some additional canonical spaces. For a positive integer k let \({\mathcal {C}}^k = C([0,T];{\mathbb {R}}^k)\) denote the set of continuous functions from [0, T] to \({\mathbb {R}}^k\), and define the truncated supremum norms \(\Vert \cdot \Vert _t\) on \({\mathcal {C}}^k\) by

$$\begin{aligned} \Vert x\Vert _t := \sup _{s \in [0,t]}|x_s|, \ t \in [0,T]. \end{aligned}$$
(2.2)

Unless otherwise stated, \({\mathcal {C}}^k\) is endowed with the norm \(\Vert \cdot \Vert _T\) and its Borel \(\sigma \)-field. For \(\mu \in {\mathcal {P}}({\mathcal {C}}^k)\), let \(\mu _t \in {\mathcal {P}}({\mathbb {R}}^k)\) denote the image of \(\mu \) under the map \(x \mapsto x_t\). Let

$$\begin{aligned} {\mathcal {X}}&:= {\mathcal {C}}^m \times {\mathcal {V}}\times {\mathcal {C}}^d. \end{aligned}$$
(2.3)

This space will house the idiosyncratic noise, the relaxed control, and the state process. Let \(({\mathcal {F}}^{\mathcal {X}}_t)_{t \in [0,T]}\) denote the canonical filtration on \({\mathcal {X}}\), where \({\mathcal {F}}^{\mathcal {X}}_t\) is the \(\sigma \)-field generated by the maps

$$\begin{aligned} {\mathcal {X}}\ni (w,q,x)&\mapsto \left( w_s,x_s,q([0,s] \times C)\right) \in {\mathbb {R}}^m \times {\mathbb {R}}^d \times {\mathbb {R}}, \quad \text {for}\quad s \le t, \ C \in {\mathcal {B}}(A). \end{aligned}$$

For \(\mu \in {\mathcal {P}}({\mathcal {X}})\), let \(\mu ^x := \mu ({\mathcal {C}}^m \times {\mathcal {V}}\times \cdot )\) denote the \({\mathcal {C}}^d\)-marginal. Finally, for ease of notation let us define the objective functional \(\Gamma : {\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {V}}\times {\mathcal {C}}^d \rightarrow {\mathbb {R}}\) by

$$\begin{aligned} \Gamma (\mu ,q,x) := \int _0^T\int _Af(t,x_t,\mu _t,a)q_t(da)dt + g(x_T,\mu _T). \end{aligned}$$
(2.4)

The following definition of weak mean field game (MFG) solution is borrowed from [11].

Definition 2.1

A weak MFG solution with weak control (with initial state distribution \(\lambda \)), or simply a weak MFG solution, is a tuple \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\), where \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) is a complete filtered probability space supporting \((B,W,\mu ,\Lambda ,X)\) satisfying

  1. (1)

    \((B_t)_{t \in [0,T]}\) and \((W_t)_{t \in [0,T]}\) are independent \(({\mathcal {F}}_t)_{t \in [0,T]}\)-Wiener processes of respective dimension \(m_0\) and m, the process \((X_t)_{t \in [0,T]}\) is \(({\mathcal {F}}_t)_{t \in [0,T]}\)-adapted with values in \({\mathbb {R}}^d\), and \(P \circ X_0^{-1} = \lambda \). Moreover, \(\mu \) is a random element of \({\mathcal {P}}^p({\mathcal {X}})\) such that \(\mu (C)\) is \({\mathcal {F}}_t\)-measurable for each \(C \in {\mathcal {F}}^{\mathcal {X}}_t\) and \(t \in [0,T]\).

  2. (2)

    \(X_0\), W, and \((B,\mu )\) are independent.

  3. (3)

    \((\Lambda _t)_{t \in [0,T]}\) is \(({\mathcal {F}}_t)_{t \in [0,T]}\)-progressively measurable with values in \({\mathcal {P}}(A)\) and

    $$\begin{aligned} {\mathbb {E}}^P\int _0^T\int _A|a|^p\Lambda _t(da)dt < \infty . \end{aligned}$$

    Moreover, \(\sigma (\Lambda _s : s \le t)\) is conditionally independent of \({\mathcal {F}}^{X_0,B,W,\mu }_T\) given \({\mathcal {F}}^{X_0,B,W,\mu }_t\), for each \(t \in [0,T]\), where

    $$\begin{aligned} {\mathcal {F}}^{X_0,B,W,\mu }_t&= \sigma \left( X_0,B_s,W_s,\mu (C) : s \le t, \ C \in {\mathcal {F}}^{\mathcal {X}}_t\right) . \end{aligned}$$
  4. (4)

    The state equation holds:

    $$\begin{aligned} dX_t = \int _Ab(t,X_t,\mu ^x_t,a)\Lambda _t(da)dt + \sigma (t,X_t,\mu ^x_t)dW_t + \sigma _0(t,X_t,\mu ^x_t)dB_t. \end{aligned}$$
    (2.5)
  5. (5)

    If \(({\widetilde{\Omega }}',({\mathcal {F}}'_t)_{t \in [0,T]},P')\) is another filtered probability space supporting \((B',W',\mu ',\Lambda ',X')\) satisfying (1-4) and \(P \circ (B,\mu )^{-1} = P' \circ (B',\mu ')^{-1}\), then

    $$\begin{aligned} {\mathbb {E}}^P\left[ \Gamma (\mu ^x,\Lambda ,X)\right] \ge {\mathbb {E}}^{P'}\left[ \Gamma (\mu '^x,\Lambda ',X')\right] . \end{aligned}$$
  6. (6)

    \(\mu \) is a version of the conditional law of \((W,\Lambda ,X)\) given \((B,\mu )\).

If also there exists an A-valued process \((\alpha _t)_{t \in [0,T]}\) such that \(P(\Lambda _t = \delta _{\alpha _t} \ a.e. \ t)=1\), then we say the weak MFG solution has strict control. If this \((\alpha _t)_{t \in [0,T]}\) is progressively measurable with respect to the completion of \(({\mathcal {F}}^{X_0,B,W,\mu }_t)_{t \in [0,T]}\), we say the weak MFG solution has strong control. If \(\mu \) is a.s. B-measurable, then we have a strong MFG solution (with either weak control, strict control, or strong control).

Given a weak MFG solution \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\), we may view \((X_0,B,W,\mu ,\Lambda ,X)\) as a random element of the canonical space

$$\begin{aligned} \Omega := {\mathbb {R}}^d \times {\mathcal {C}}^{m_0} \times {\mathcal {C}}^m \times {\mathcal {P}}^p({\mathcal {X}}) \times {\mathcal {V}}\times {\mathcal {C}}^d. \end{aligned}$$
(2.6)

A weak MFG solution thus induces a probability measure on \(\Omega \), which itself we would like to call a MFG solution, as it is really the object of interest more than the particular probability space. (The initial state \(X_0\) is singled out mostly for notational convenience later on and to be consistent with the paper [11].) The following definition will be reformulated in Sect. 4 in a more intrinsic manner.

Definition 2.2

If \(P \in {\mathcal {P}}(\Omega )\) satisfies \(P = P' \circ (X_0,B,W,\mu ,\Lambda ,X)^{-1}\) for some weak MFG solution \((\Omega ',({\mathcal {F}}'_t)_{t \in [0,T]},P',B,W,\mu ,\Lambda ,X)\), then we refer to P itself as a weak MFG solution. Naturally, we may also refer to P as a weak MFG solution with strict control or strong control, or as a strong MFG solution, under the analogous additional assumptions.

2.3 Finite-player games

This section describes a general form of the finite-player games, allowing controls to be relaxed and adapted to general filtrations.

An n-player environment is defined to be any tuple \({\mathcal {E}}_n = (\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\), where \((\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n)\) is a complete filtered probability space supporting an \({\mathcal {F}}^n_0\)-measurable \(({\mathbb {R}}^d)^n\)-valued random variable \(\xi = (\xi ^1,\ldots ,\xi ^n)\) with law \(\lambda ^{\times n}\), an \(m_0\)-dimensional \(({\mathcal {F}}^n_t)_{t \in [0,T]}\)-Wiener process B, and a nm-dimensional \(({\mathcal {F}}^n_t)_{t \in [0,T]}\)-Wiener process \(W = (W^1,\ldots ,W^n)\), independent of B. For simplicity, we consider i.i.d. initial states \(\xi ^1,\ldots ,\xi ^n\) with common law \(\lambda \), although it is presumably possible to generalize this. Perhaps all of the notation here should be parametrized by \({\mathcal {E}}_n\) or an additional index for n, but, since we will typically focus on a fixed sequence of environments \(({\mathcal {E}}_n)_{n=1}^\infty \), we avoid complicating the notation. Indeed, the subscript n on the measure \({\mathbb {P}}_n\) will be enough to remind us on which environment we are working at any moment.

Until further notice, we work with a fixed n-player environment \({\mathcal {E}}_n\). An admissible control is any \(({\mathcal {F}}^n_t)_{t \in [0,T]}\)-progressively measurable \({\mathcal {P}}(A)\)-valued process \((\Lambda _t)_{t \in [0,T]}\) satisfying

$$\begin{aligned} {\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A|a|^p\Lambda _t(da)dt < \infty . \end{aligned}$$

An admissible strategy is a vector of n admissible controls. The set of admissible controls is denoted \({\mathcal {A}}_n({\mathcal {E}}_n)\), and accordingly the set of admissible strategies is the Cartesian product \({\mathcal {A}}_n^n({\mathcal {E}}_n)\). A strict control is any control \(\Lambda \in {\mathcal {A}}_n({\mathcal {E}}_n)\) such that \({\mathbb {P}}_n(\Lambda _t = \delta _{\alpha _t}, \ a.e. \ t) = 1\) for some \(({\mathcal {F}}^n_t)_{t \in [0,T]}\)-progressively measurable A-valued process \((\alpha _t)_{t \in [0,T]}\), and a strict strategy is any vector of n strict controls. Given an admissible control \(\Lambda =(\Lambda ^1,\ldots ,\Lambda ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) define the state processes \(X[\Lambda ] := (X^1[\Lambda ],\ldots ,X^n[\Lambda ])\) by

$$\begin{aligned} dX^i_t[\Lambda ]&= \int _Ab(t,X^i_t[\Lambda ],{\widehat{\mu }}^x_t[\Lambda ],a)\Lambda ^i_t(da)dt + \sigma (t,X^i_t[\Lambda ],{\widehat{\mu }}^x_t[\Lambda ])dW^i_t \\&\quad + \sigma _0(t,X^i_t[\Lambda ],{\widehat{\mu }}^x_t[\Lambda ])dB_t, \quad \quad X^i_0 = \xi ^i, \\ {\widehat{\mu }}^x[\Lambda ]&:= \frac{1}{n}\sum _{k=1}^n\delta _{X^k[\Lambda ]}. \end{aligned}$$

Note that Assumption A ensures that a unique strong solution of this SDE system existsFootnote 1. Indeed, the Lipschitz assumption of (A.4) and the obvious inequality

$$\begin{aligned} \ell ^p_{{\mathbb {R}}^d,p}\left( \frac{1}{n}\sum _{i=1}^n\delta _{x_i},\frac{1}{n} \sum _{i=1}^n\delta _{y_i}\right) \le \frac{1}{n}\sum _{i=1}^n|x_i-y_i|^p \end{aligned}$$

together imply, for example, that the function

$$\begin{aligned} ({\mathbb {R}}^d)^n \ni (x_1,\ldots ,x_n) \mapsto b\left( t,x_1,\frac{1}{n}\sum _{i=1}^n\delta _{x_i},a\right) \in {\mathbb {R}}^d \end{aligned}$$

is Lipschitz, uniformly in (ta). A standard estimate using assumption (A.4), which is worked out in Lemma 5.1, shows that \({\mathbb {E}}^{{\mathbb {P}}_n}[\Vert X^i[\Lambda ]\Vert _T^p] < \infty \) for each \(\Lambda \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\), \(n \ge i \ge 1\).

The value for player i corresponding to a strategy \(\Lambda = (\Lambda ^1,\ldots ,\Lambda ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) is defined by

$$\begin{aligned} J_i(\Lambda ) := {\mathbb {E}}^{{\mathbb {P}}_n}\left[ \Gamma ({\widehat{\mu }}^x[\Lambda ], \Lambda ^i,X^i_t[\Lambda ])\right] . \end{aligned}$$
(2.7)

Note that \(J_i(\Lambda ) < \infty \) is well-defined because of the upper bounds of assumption (A.5), but it is possible that \(J_i(\Lambda ) = -\infty \), since we do not require that an admissible control possess finite moment of order \(p'\). Given a strategy \(\Lambda = (\Lambda ^1,\ldots ,\Lambda ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) and a control \(\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)\), define a new strategy \((\Lambda ^{-i},\beta ) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) by

$$\begin{aligned} (\Lambda ^{-i},\beta ) = (\Lambda ^1,\ldots ,\Lambda ^{i-1},\beta ,\Lambda ^{i+1},\ldots ,\Lambda ^n). \end{aligned}$$

Given \(\epsilon = (\epsilon _1,\ldots ,\epsilon _n) \in [0,\infty )^n\), a relaxed \(\epsilon \) -Nash equilibrium in \({\mathcal {E}}_n\) is any strategy \(\Lambda \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) satisfying

$$\begin{aligned} J_i(\Lambda ) \ge \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_i((\Lambda ^{-i},\beta )) - \epsilon _i, \quad i=1,\ldots ,n. \end{aligned}$$

Naturally, if \(\epsilon _i=0\) for each \(i=1,\ldots ,n\), we use the simpler term Nash equilibrium, as opposed to 0-Nash equilibrium. A strict \(\epsilon \) -Nash equilibrium in \({\mathcal {E}}_n\) is any strict strategy \(\Lambda \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) satisfying

$$\begin{aligned} J_i(\Lambda ) \ge \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n) \text { strict}}J_i((\Lambda ^{-i},\beta )) - \epsilon _i, \quad i=1,\ldots ,n. \end{aligned}$$

Note that the optimality is required only among strict controls.

Note that the role of the filtration \(({\mathcal {F}}^n_t)_{t \in [0,T]}\) in the environment \({\mathcal {E}}_n\) is mainly to specify the class of admissible controls. We are particularly interested in the sub-filtration generated by the Wiener processes and initial states; define \(({\mathcal {F}}^{s,n}_t)_{t \in [0,T]}\) to be the \({\mathbb {P}}_n\)-completion of

$$\begin{aligned} \left( \sigma (\xi ,B_s,W_s : s \le t)\right) _{t \in [0,T]}. \end{aligned}$$
(2.8)

Of course, \({\mathcal {F}}^{s,n}_t \subset {\mathcal {F}}^n_t\) for each t. Let us say that \(\Lambda \in {\mathcal {A}}_n({\mathcal {E}}_n)\) is a strong control if \({\mathbb {P}}_n(\Lambda _t = \delta _{\alpha _t} \ a.e. \ t)=1\) for some \(({\mathcal {F}}^{s,n}_t)_{t \in [0,T]}\)-progressively measurable A-valued process \((\alpha _t)_{t \in [0,T]}\). Naturally, a strong strategy is a vector of strong controls. A strong \(\epsilon \) -Nash equilibrium in \({\mathcal {E}}_n\) is any strong strategy \(\Lambda \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) such that

$$\begin{aligned} J_i(\Lambda ) \ge \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n) \text { strong}}J_i((\Lambda ^{-i},\beta )) - \epsilon _i, \quad i=1,\ldots ,n. \end{aligned}$$

Remark 2.3

Equivalently, a strong \(\epsilon \)-Nash equilbrium in \({\mathcal {E}}_n = (\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\) is a strict \(\epsilon \)-Nash equilibrium in \({\widetilde{{\mathcal {E}}}}_n := (\Omega _n,({\mathcal {F}}^{s,n}_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\).

The most common type of Nash equilibrium considered in the literature is, in our terminology, a strong Nash equilibrium. The next proposition assures us that our equilibrium concept using relaxed controls (and general filtrations) truly generalizes this more standard situation, thus permitting a unified analysis of all of the equilibria described thusfar. The proof is deferred to Appendix A.1.

Proposition 2.4

On any n-player environment \({\mathcal {E}}_n\), every strong \(\epsilon \)-Nash equilibrium is also a strict \(\epsilon \)-Nash equilibrium, and every strict \(\epsilon \)-Nash equilibrium is also a relaxed \(\epsilon \)-Nash equilibrium.

Remark 2.5

Another common type of strategy in dynamic game theory is called closed-loop. Whereas our strategies (also called open-loop) are specified by processes, a closed-loop (strict) strategy is specified by feedback functions \(\phi _i : [0,T] \,\times \,({\mathbb {R}}^d)^n \rightarrow A\), for \(i=1,\ldots ,n\), to be evaluated along the path of the state process. In the model of Carmona et al. [12], both the open-loop and closed-loop equilibria are computed explicitly for the n-player games, and they are shown to converge to the same MFG limit. A natural question, which this paper does not attempt to answer, is whether or not closed-loop equilibria converge to the same MFG limit that we obtain in Theorem 2.6.

2.4 The main limit theorem

We are ready now to state the first main Theorem 2.6 and its corollaries. The proof is deferred to Sect. 5. Given an admissible strategy \(\Lambda = (\Lambda ^1,\ldots ,\Lambda ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) defined on some n-player environment \({\mathcal {E}}_n = (\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\), define (on \(\Omega _n\)) the random element \({\widehat{\mu }}[\Lambda ]\) of \({\mathcal {P}}^p({\mathcal {X}})\) (recalling the definition of \({\mathcal {X}}\) from (2.3)) by

$$\begin{aligned} {\widehat{\mu }}[\Lambda ] := \frac{1}{n}\sum _{i=1}^n\delta _{(W^i,\Lambda ^i,X^i[\Lambda ])}. \end{aligned}$$

As usual, we identify a \({\mathcal {P}}(A)\)-valued process \((\Lambda ^i_t)_{t \in [0,T]}\) with the random element \(\Lambda ^i = dt\Lambda ^i_t(da)\) of \({\mathcal {V}}\). Recall the definition of the canonical space \(\Omega \) from (2.6).

Theorem 2.6

Suppose Assumption A holds. For each n, let \(\epsilon ^n = (\epsilon ^n_1,\ldots ,\epsilon ^n_n) \in [0,\infty )^n\), and let \({\mathcal {E}}_n = (\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\) be any n-player environment. Assume

$$\begin{aligned} \lim _{n \rightarrow \infty } \frac{1}{n}\sum _{i=1}^n\epsilon ^n_i = 0. \end{aligned}$$
(2.9)

Suppose for each n that \(\Lambda ^n = (\Lambda ^{n,1},\ldots ,\Lambda ^{n,n}) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) is a relaxed \(\epsilon ^n\)-Nash equilibrium, and let

$$\begin{aligned} P_n := \frac{1}{n}\sum _{i=1}^n{\mathbb {P}}_n \circ \left( \xi ^i,B,W^i,{\widehat{\mu }}[\Lambda ^n], \Lambda ^{n,i}, X^i[\Lambda ^n]\right) ^{-1}. \end{aligned}$$
(2.10)

Then \((P_n)_{n=1}^\infty \) is relatively compact in \({\mathcal {P}}^p(\Omega )\), and each limit point is a weak MFG solution.

Remark 2.7

Averaging over \(i=1,\ldots ,n\) in (2.10) circumvents the problem that the strategies \((\Lambda ^{n,1},\ldots ,\Lambda ^{n,n})\) need not be exchangeable, and we note that the limiting behavior of \({\mathbb {P}}_n \circ (B,{\widehat{\mu }}[\Lambda ^n])^{-1}\) can always be recovered from that of \(P_n\). To interpret the definition of \(P_n\), note that we may write

$$\begin{aligned} P_n = {\mathbb {P}}_n \circ \left( \xi ^{U_n},B,W^{U_n},{\widehat{\mu }}[\Lambda ^n], \Lambda ^{n,U_n}, X^{U_n} [\Lambda ^n]\right) ^{-1}, \end{aligned}$$

where \(U_n\) is a random variable independent of \({\mathcal {F}}^n_T\), uniformly distributed among \(\{1,\ldots ,n\}\), constructed by extending the probability space \(\Omega _n\). In words, \(P_n\) is the joint law of the processes relevant to a randomly selected representative agent. Of course, Theorem 2.6 specializes when there is exchangeability, in the following sense. For any set E, any element \(e = (e^1,\ldots ,e^n) \in E^n\), and any permutation \(\pi \) of \(\{1,\ldots ,n\}\), let \(e_\pi := (e^{\pi (1)},\ldots ,e^{\pi (n)})\). If

$$\begin{aligned} {\mathbb {P}}_n \circ \left( \xi _\pi ,B,W_\pi ,\Lambda ^n_\pi \right) ^{-1} \end{aligned}$$

is independent of the choice of permutation \(\pi \), then so is

$$\begin{aligned} {\mathbb {P}}_n \circ \left( \xi _\pi ,B,W_\pi ,{\widehat{\mu }}[\Lambda ^n_\pi ],\Lambda ^n_\pi , X[\Lambda ^n_\pi ]_\pi \right) ^{-1}. \end{aligned}$$

It then follows that

$$\begin{aligned} P_n = {\mathbb {P}}_n \circ \left( \xi ^k,B,W^k,{\widehat{\mu }}[\Lambda ^n],\Lambda ^{n,k}, X^k[\Lambda ^n]\right) ^{-1}, \quad \text {for}\quad n \ge k. \end{aligned}$$

Theorem 2.6 is stated in quite a bit of generality, devoid even of standard convexity assumptions on the objective functions f and g. Theorem 2.6 includes quite degenerate cases, such as the case of no objectives, where \(f \equiv g \equiv 0\) and A is compact. In this case, any strategy profile whatsoever in the n-player game is a Nash equilibrium, and any weak control can arise in the limit. Exploiting results of [11], the following corollaries demonstrate how, under various additional convexity assumptions, we may refine the conclusion of Theorem 2.6 by ruling out certain types of limits, such as those involving relaxed controls.

Corollary 2.8

Suppose the assumptions of Theorem 2.6 hold, and assume also that for each \((t,x,\mu ) \in [0,T] \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d)\) the following subset of \({\mathbb {R}}^d \times {\mathbb {R}}\) is convex:

$$\begin{aligned} \left\{ (b(t,x,\mu ,a),z) : a \in A, \ z \le f(t,x,\mu ,a) \right\} . \end{aligned}$$

Then

$$\begin{aligned} \left\{ \frac{1}{n}\sum _{i=1}^n{\mathbb {P}}_n \circ \left( B,W^i,{\widehat{\mu }}^x[\Lambda ^n],X^i[\Lambda ^n]\right) ^{-1} : n \ge 1 \right\} \end{aligned}$$

is relatively compact in \({\mathcal {P}}^p({\mathcal {C}}^{m_0} \times {\mathcal {C}}^m \times {\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d)\), and every limit is of the form \(P \circ (B,W,\mu ^x,X)^{-1}\), for some weak MFG solution with strict control \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\).

Proof

This follows from Theorem 2.6 and the argument of [11, Theorem 4.1]. Indeed, the latter shows that for every weak MFG solution with weak control \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\), there exists a weak MFG solution with strict control \(({\widetilde{\Omega }}',({\mathcal {F}}'_t)_{t \in [0,T]},P',B',W',\mu ',\Lambda ',X')\) such that \(P \circ (B,W,\mu ^x,X)^{-1} = P' \circ (B',W',\mu '^x,X')^{-1}\). \(\square \)

Corollary 2.9

Suppose the assumptions of Theorem 2.6 hold, and define \(P_n\) as in (2.10). Assume also that for each fixed \((t,\mu ) \in [0,T] \times {\mathcal {P}}^p({\mathbb {R}}^d)\), \((b,\sigma ,\sigma _0)(t,x,\mu ,a)\) is affine in (xa), \(g(x,\mu )\) is concave in x, and \(f(t,x,\mu ,a)\) is strictly concave in (xa). Then \((P_n)_{n=1}^\infty \) is relatively compact in \({\mathcal {P}}^p(\Omega )\), and every limit point is a weak MFG solution with strong control.

Proof

By [11, Proposition 4.4], the present assumptions guarantee that every weak MFG solution is a weak MFG solution with strong control. The claim then follows from Theorem 2.6. \(\square \)

Finally, we provide an example of the satisfying situation, in which there is a unique MFG solution. Say that uniqueness in law holds for the MFG if any two weak MFG solutions induce the same law on \(\Omega \). The following corollary is an immediate consequence of Theorem 2.6 and the uniqueness result of [11, Theorem 6.2], which makes use of the monotonicity assumption of Lasry and Lions [31].

Corollary 2.10

Suppose the assumptions of Corollary 2.9 hold, and define \(P_n\) as in (2.10). Assume also that

  1. (1)

    b, \(\sigma \), and \(\sigma _0\) have no mean field term, i.e. no \(\mu \) dependence,

  2. (2)

    f is of the form \(f(t,x,\mu ,a) = f_1(t,x,a) + f_2(t,x,\mu )\),

  3. (3)

    For each \(\mu ,\nu \in {\mathcal {P}}^p({\mathcal {C}}^d)\) we have

    $$\begin{aligned}&\int _{{\mathcal {C}}^d}(\mu -\nu )(dx)\left[ g(x_T,\mu _T) - g(x_T,\nu _T) + \int _0^T\left( f_2(t,x,\mu ) - f_2(t,x,\nu )\right) dt\right] \nonumber \\&\quad \le 0. \end{aligned}$$

Then there exists a unique in law weak MFG solution, and it is a strong MFG solution with strong control. In particular, \(P_n\) converges in \({\mathcal {P}}^p(\Omega )\) to this unique MFG solution.

2.5 The converse limit theorem

This section states and discusses a converse to Theorem 2.6. For this, we need an additional technical assumption, which we note holds automatically under Assumption A in the case that the control space A is compact.

Assumption B

The function f of \((t,x,\mu ,a)\) is continuous in \((x,\mu )\), uniformly in a, for each \(t \in [0,T]\). That is,

$$\begin{aligned} \lim _{(x',\mu ') \rightarrow (x,\mu )}\sup _{a \in A}\left| f(t,x',\mu ',a) - f(t,x,\mu ,a)\right| = 0, \ \forall t \in [0,T]. \end{aligned}$$

Moreover, there exists \(c_4 > 0\) such that, for all \((t,x,x',\mu ,\mu ',a)\),

$$\begin{aligned} \left| f(t,x',\mu ',a) - f(t,x,\mu ,a)\right| \le c_4\left( 1 + |x'|^p + |x|^p + \int _{{\mathbb {R}}^d}|z|^p(\mu ' + \mu )(dz)\right) . \end{aligned}$$

Theorem 2.11

Suppose Assumptions A and B hold. Let \(P \in {\mathcal {P}}(\Omega )\) be a weak MFG solution, and for each n let \({\mathcal {E}}_n = (\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\) be any n-player environment. Then there exist, for each n, \(\epsilon _n \ge 0\) and a strong \((\epsilon _n,\ldots ,\epsilon _n)\)-Nash equilibrium \(\Lambda ^n = (\Lambda ^{n,1},\ldots ,\Lambda ^{n,n})\) on \({\mathcal {E}}_n\), such that \(\lim _{n\rightarrow \infty }\epsilon _n = 0\) and

$$\begin{aligned} P = \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^n{\mathbb {P}}_n \circ \left( \xi ^i,B,W^i,{\widehat{\mu }}[\Lambda ^n],\Lambda ^{n,i},X^i[\Lambda ^n]\right) ^{-1}, \quad \text {in}\quad {\mathcal {P}}^p(\Omega ). \end{aligned}$$
(2.11)

Combining Theorems 2.6 and 2.11 shows that the set of weak MFG solutions is exactly the set of limits of strong approximate Nash equilibria. More precisely, the set of weak MFG solutions is exactly the set of limits

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{1}{n_k}\sum _{i=1}^{n_k}{\mathbb {P}}_{n_k} \circ \left( \xi ^i,B,W^i,{\widehat{\mu }}[\Lambda ^{n_k}],\Lambda ^{n_k,i}, X^i[\Lambda ^{n_k}]\right) ^{-1}, \end{aligned}$$

where \(\Lambda ^n \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) are strong \(\epsilon ^n\)-Nash equilibria and \(\epsilon ^n=(\epsilon ^n_1,\ldots ,\epsilon ^n_n) \in [0,\infty )^n\) satisfies (2.9). The same statement is true when the word “strong” is replaced by “strict” or “relaxed”, because of Proposition 2.4. Similarly, combining Theorem 2.11 with Corollaries 2.8 and 2.9 yields characterizations of the mean field limit without recourse to relaxed controls.

Remark 2.12

In light of Remark 2.3, the statement of Theorem 2.11 is insensitive to the choice of environments \({\mathcal {E}}_n\). Without loss of generality, they may all be assumed to satisfy \({\mathcal {F}}^n_t = {\mathcal {F}}^{s,n}_t\) (where the latter filtration was defined in (2.8)) for each t; that is, the filtration may be taken to be the one generated by the process \((\xi ,B_t,W_t)_{t \in [0,T]}\).

Remark 2.13

It follows from the proofs of Theorems 2.6 and 2.11 that the values converge as well, in the sense that \(\frac{1}{n}\sum _{i=1}^nJ_i(\Lambda ^n)\) converges (along a subsequence in the case of Theorem 2.6) to the corresponding optimal value corresponding to the MFG solution.

Remark 2.14

Theorem 2.11 is admittedly abstract, and not as strong in its conclusion as the typical results of this nature in the literature. Namely, in the setting without common noise, it is usually argued as in [21] that a MFG solution may be used to construct not just any sequence of approximate equilibria, but rather one consisting of symmetric distributed strategies, in which the control of agent i is of the form \({\hat{\alpha }}(t,X^i_t)\) for some function \({\hat{\alpha }}\) which depends neither on the agent i nor the number of agents n. When the measure flow is stochastic under P (i.e. when the solution is weak or involves common noise), we may naturally look for strategies of the form \({\hat{\alpha }}(t,X^i_t,{\widehat{\mu }}^x_t)\). The techniques of this paper seem too abstract to yield a result of this nature, although a careful reading of the proof of Theorem 2.11 shows that two refinements are possible: First, we may take the n-player equilibrium \(\Lambda ^n\) to be exchangeable in the sense of Remark 2.7 if we only require it to be a relaxed equilibrium, not a strong one. Second, if the given MFG solution is strong, then the strong n-player equilibrium \(\Lambda ^n\) can be taken to be exchangeable in the sense of Remark 2.7. These details stray somewhat from the main objective of the paper, so we refrain from elaborating any further. On a related note, at the level of generality of Theorem 2.11 we do not expect to obtain a rate of convergence of \(\epsilon _n\), as in [8, 27].

3 The case of no common noise

The goal of this section is to specialize the main results to MFGs without common noise. Indeed we assume that \(\sigma _0 \equiv 0\) throughout this section. Assumption A permits degenerate volatility, but when \(\sigma _0 \equiv 0\) our general definition of weak MFG solution still involves the common noise B, which in a sense should no longer play any role. To be absolutely clear, we will rewrite the definitions and the two main theorems so that they do not involve a common noise; most notably, the notion of strong controls for the finite-player games is refined to very strong controls.

The proofs of the main results of Sect. 3.1, Proposition 3.3 and Theorem 3.4, are deferred to Sect. 7, where we will see how to deduce almost all of the results without common noise from those with common noise. Crucially, even without common noise, a weak MFG solution still involves a random measure \(\mu \), and the consistency condition becomes \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ \mu )\). We illustrate by example just how different weak solutions can be from the strong solutions typically considered in the MFG literature, in which \(\mu \) is deterministic. Finally we close the section by discussing some situations in which weak solutions are concentrated on the family of strong solutions.

3.1 Definitions and results

First, let us state a simplified definition of MFG solution for the case \(\sigma _0 \equiv 0\), which is really just Definition 2.1 rewritten without B. Again, the following definition is relative to the initial state distribution \(\lambda \).

Definition 3.1

A weak MFG solution without common noise is a tuple \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\), where \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) is a complete filtered probability space supporting \((W,\mu ,\Lambda ,X)\) satisfying

  1. (1)

    \((W_t)_{t \in [0,T]}\) is an \(({\mathcal {F}}_t)_{t \in [0,T]}\)-Wiener processes of dimension m, the process \((X_t)_{t \in [0,T]}\) is \(({\mathcal {F}}_t)_{t \in [0,T]}\)-adapted with values in \({\mathbb {R}}^d\), and \(P \circ X_0^{-1} = \lambda \). Moreover, \(\mu \) is a random element of \({\mathcal {P}}^p({\mathcal {X}})\) such that \(\mu (C)\) is \({\mathcal {F}}_t\)-measurable for each \(C \in {\mathcal {F}}^{\mathcal {X}}_t\) and \(t \in [0,T]\).

  2. (2)

    \(X_0\), W, and \(\mu \) are independent.

  3. (3)

    \((\Lambda _t)_{t \in [0,T]}\) is \(({\mathcal {F}}_t)_{t \in [0,T]}\)-progressively measurable with values in \({\mathcal {P}}(A)\) and

    $$\begin{aligned} {\mathbb {E}}^P\int _0^T\int _A|a|^p\Lambda _t(da)dt < \infty . \end{aligned}$$

    Moreover, \(\sigma (\Lambda _s : s \le t)\) is conditionally independent of \({\mathcal {F}}^{X_0,W,\mu }_T\) given \({\mathcal {F}}^{X_0,W,\mu }_t\), for each \(t \in [0,T]\), where

    $$\begin{aligned} {\mathcal {F}}^{X_0,W,\mu }_t&= \sigma \left( X_0,W_s,\mu (C) : s \le t, \ C \in {\mathcal {F}}^{\mathcal {X}}_t\right) . \end{aligned}$$
  4. (4)

    The state equation holds:

    $$\begin{aligned} dX_t = \int _Ab(t,X_t,\mu ^x_t,a)\Lambda _t(da)dt + \sigma (t,X_t,\mu ^x_t)dW_t. \end{aligned}$$
    (3.1)
  5. (5)

    If \(({\widetilde{\Omega }}',({\mathcal {F}}'_t)_{t \in [0,T]},P')\) is another filtered probability space supporting \((W',\mu ',\Lambda ',X')\) satisfying (1-4) and \(P \circ \mu ^{-1} = P' \circ (\mu ')^{-1}\), then

    $$\begin{aligned} {\mathbb {E}}^P\left[ \Gamma (\mu ^x,\Lambda ,X)\right] \ge {\mathbb {E}}^{P'}\left[ \Gamma (\mu '^x,\Lambda ',X')\right] . \end{aligned}$$
  6. (6)

    \(\mu \) is a version of the conditional law of \((W,\Lambda ,X)\) given \(\mu \).

As in Definition 2.2, we may refer to the law \(P \circ (W,\mu ,\Lambda ,X)^{-1}\) itself as a weak MFG solution. Again, if also there exists an A-valued process \((\alpha _t)_{t \in [0,T]}\) such that \(P(\Lambda _t = \delta _{\alpha _t} \ a.e. \ t)=1\), then we say the MFG solution has strict control. If this \((\alpha _t)_{t \in [0,T]}\) is progressively measurable with respect to the completion of \(({\mathcal {F}}^{X_0,W,\mu }_t)_{t \in [0,T]}\), we say the MFG solution has strong control. If \(\mu \) is a.s.-constant, then we have a strong MFG solution without common noise. In this case, we may abuse the terminology somewhat by saying that a measure \({\widetilde{\mu }} \in {\mathcal {P}}^p({\mathcal {X}})\) is itself a strong MFG solution (without common noise), if there exists a weak MFG solution \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) without common noise such that \(P(\mu = {\widetilde{\mu }}) = 1\).

Remark 3.2

Our notion of strong MFG solution (without common noise) with weak control corresponds to the solution concept adopted by the recent papers [15, 29]. On the other hand, a strong MFG solution (without common noise) with strong control corresponds to the usual definition of MFG solution in the literature [6, 8, 21]. This is not immediate, since our strong control is still required to be optimal among the class of weak controls, whereas the latter papers require optimality only relative to other strong controls. But in most cases, such as under Assumption A, optimality among strong controls implies optimality among weak controls, and thus our definition reduces to the standard one. This is the same phenomenon driving Propositions 2.4 and 3.3, and it is well known in control theory. Lemma 6.5 will elaborate on this point, and see also [24] or the more recent [26] for further discussion.

We continue to work with the definition of the n-player games of the previous section. Suppose we are given an n-player environment \({\mathcal {E}}_n = ({\widetilde{\Omega }}_n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\), as was defined in Sect. 2.3. Let \(({\mathcal {F}}^{vs,n}_t)_{t \in [0,T]}\) denote the \({\mathbb {P}}_n\)-completion of \((\sigma (\xi ,W_s : s \le t))_{t \in [0,T]}\), that is the filtration generated by the initial state and the idiosyncratic noises (but not the common noise). Let us say that a control \(\Lambda \in {\mathcal {A}}_n({\mathcal {E}}_n)\) is a very strong control if \({\mathbb {P}}_n(\Lambda _t = \delta _{\alpha _t} \ a.e. \ t) = 1\), for some \(({\mathcal {F}}^{vs,n}_t)_{t \in [0,T]}\)-progressively measurable A-valued process \((\alpha _t)_{t \in [0,T]}\). A very strong strategy is a vector of very strong controls. For \(\epsilon =(\epsilon _1,\ldots ,\epsilon _n) \in [0,\infty )^n\), a very strong \(\epsilon \) -Nash equilibrium in \({\mathcal {E}}_n\) is any very strong strategy \(\Lambda \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) such that

$$\begin{aligned} J_i(\Lambda ) \ge \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n) \text { very strong}}J_i((\Lambda ^{-i},\beta )) - \epsilon _i, \quad i=1,\ldots ,n. \end{aligned}$$

The very strong equilibrium is arguably the most natural notion of equilibrium in the case of no common noise, and it is certainly one of the most common in the literature. The proof of the following Proposition is deferred to Appendix A.2.

Proposition 3.3

When \(\sigma _0\equiv 0\), every very strong \(\epsilon \)-Nash equilibrium is also a relaxed \(\epsilon \)-Nash equilibrium.

The following Theorem 3.4 rewrites Theorems 2.6 and 2.11 in the setting without common noise. Although this is mostly derived from Theorems 2.6 and 2.11, the proof is spelled out in Sect. 7, as it is not entirely straightforward.

Theorem 3.4

Suppose \(\sigma _0 \equiv 0\). Theorem 2.6 remains true if the term “weak MFG solution” is replaced by “weak MFG solution without common noise,” and if \(P_n\) is defined instead by

$$\begin{aligned} P_n := \frac{1}{n}\sum _{i=1}^n{\mathbb {P}}_n \circ \left( \xi ^i,W^i,{\widehat{\mu }}[\Lambda ^n],\Lambda ^{n,i}, X^i[\Lambda ^n]\right) ^{-1}. \end{aligned}$$
(3.2)

Theorem 2.11 remains true if “weak MFG solution” is replaced by “weak MFG solution without common noise,” if \(P_n\) is defined by (3.2), and if “strong” is replaced by “very strong.”

Since strong MFG solutions are more familiar in the literature on mean field games and presumably more accessible computationally, it would be nice to have a description of weak solutions in terms of strong solutions. We will see that this is not possible in general, and the investigation of this issue highlights the fundamental difference between stochastic and deterministic equilibria (i.e. weak and strong MFG solutions). First, a discussion of a special case will help to clarify the ideas.

3.2 A digression on McKean–Vlasov equations

When there is no control (when A is a singleton), the mean field game reduces to a McKean–Vlasov equation. In this case, an interesting simplification occurs: every weak solution is simply a randomization over strong solutions. To be more clear, suppose we have a system of weakly interacting diffusions, given by

$$\begin{aligned} dX^i_t&= {\tilde{b}}\left( t,X^i_t,\mu ^n_t\right) dt + {\tilde{\sigma }}\left( t,X^i_t,\mu ^n_t\right) dW^i_t, \\ \mu ^n&:= \frac{1}{n}\sum _{k=1}^n\delta _{X^k}. \end{aligned}$$

A common argument in the theory of McKean–Vlasov limits [16, 32, 34] is to show, under suitable assumptions on \(({\tilde{b}},{\tilde{\sigma }})\), that \((\mu ^n)_{n=1}^\infty \) is tight, and that every weak limit point (an element of \({\mathcal {P}}({\mathcal {P}}({\mathcal {C}}^d))\)) is concentrated on the set of solutions \(\mu \in {\mathcal {P}}({\mathcal {C}}^d)\) of the following strong McKean–Vlasov equation:

$$\begin{aligned} {\left\{ \begin{array}{ll} dX_t = {\tilde{b}}(t,X_t,\mu _t)dt + {\tilde{\sigma }}(t,X_t, \mu _t) dW_t,\\ \mu = \text {Law}(X). \end{array}\right. } \end{aligned}$$

Consider also searching for a \({\mathcal {P}}({\mathcal {C}}^d)\)-valued random variable \(\mu \) satisfying the weak McKean–Vlasov equation:

$$\begin{aligned} {\left\{ \begin{array}{ll} dX_t = {\tilde{b}}(t,X_t,\mu _t)dt + {\tilde{\sigma }}(t,X_t, \mu _t)dW_t,\\ \mu = \text {Law}(X \ | \ \mu ), \quad \text {with}\quad X_0, \mu , W \quad \text {independent}, \end{array}\right. } \end{aligned}$$

It is not too difficult to convince yourself that a \({\mathcal {P}}({\mathcal {C}}^d)\)-valued random variable satisfies the weak McKean–Vlasov equation if and only if it almost surely satisfies the strong McKean–Vlasov equation. That is, every weak solution is supported on the set of strong solutions. In particular, we find that the set of strong McKean–Vlasov solutions is rich enough to characterize all of the possible limiting behaviors of the finite-particle systems.

In general, no such simplification is available for mean field games. Given a weak MFG solution (without common noise), we can still say that the random measure \(\mu \) is concentrated on the set of solutions of the corresponding McKean–Vlasov equation, but the optimality condition breaks down. This is essentially because the adaptedness requirement makes the class of admissible controls quite dependent on how random \(\mu \) is. To highlight this point, Sect. 3.3 below describes a model possessing weak MFG solutions which are not randomizations of strong MFG solutions. Section 3.4 discusses some partial results on when this simplification can occur in the MFG setting.

3.3 An illuminating example

This section describes a deceptively simple example which illustrates the difference between weak and strong solutions. Consider the time horizon \(T = 2\), the initial state distribution \(\lambda = \delta _0\), and the following data (still with \(\sigma _0 \equiv 0\)):

$$\begin{aligned} b(t,x,\nu ,a)&= a, \quad \sigma \text { constant}, \quad A = [-1,1]\\ g(x,\nu )&= x{\bar{\nu }}, \quad f \equiv 0, \end{aligned}$$

where for \(\nu \in {\mathcal {P}}^1({\mathbb {R}})\) we define \({\bar{\nu }} := \int x\nu (dx)\). Similarly, for \(\mu \in {\mathcal {P}}^1({\mathcal {X}})\) write \({\bar{\mu }}^x_t := \int _{{\mathbb {R}}}x\mu ^x_t(dx)\). Assumption A is verified by choosing \(p=2\), \(p_\sigma = 0\), and any \(p' > 2\). Let us first study the optimization problems arising in the MFG problem. Let \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,2]},P,W,\mu ,\Lambda ,X)\) satisfy (1-5) of Definition 3.1. For \(({\mathcal {F}}_t)_{t\in [0,2]}\)-progressively measurable \({\mathcal {P}}([-1,1])\)-valued processes \(\beta = (\beta _t)_{t\in [0,2]}\), define

$$\begin{aligned} {\widetilde{J}}(\beta ) := {\mathbb {E}}\left[ X^\beta _2{\bar{\mu }}^x_2\right] , \end{aligned}$$

where

$$\begin{aligned} X^\beta _t = \int _0^t\int _{[-1,1]}a\beta _t(da)dt + \sigma W_t, \ t \in [0,2]. \end{aligned}$$

Independence of W and \(\mu \) implies

$$\begin{aligned} {\widetilde{J}}(\beta ) = {\mathbb {E}}\left[ \int _0^2\int _{[-1,1]}a{\bar{\mu }}^x_2\beta _t(da)dt\right] = {\mathbb {E}}\left[ \int _0^2\int _{[-1,1]}a{\mathbb {E}}^P[{\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^\beta _t]\beta _t(da)dt\right] , \end{aligned}$$

where \({\mathcal {F}}^\beta _t := \sigma (\beta _s : s \le t)\). If it is also required that \({\mathcal {F}}^\beta _t\) is conditionally independent of \({\mathcal {F}}^{X_0,W,\mu }_2\) given \({\mathcal {F}}^{X_0,W,\mu }_t\), then

$$\begin{aligned} {\mathbb {E}}^P\left[ {\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^\beta _t\right] = {\mathbb {E}}^P\left[ {\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^{X_0,W,\mu }_t\right] = {\mathbb {E}}^P\left[ {\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^\mu _t\right] , \end{aligned}$$

where the last equality follows from independence of \((X_0,W)\) and \(\mu \), and \({\mathcal {F}}^\mu _t := \sigma (\mu (C) : C \in {\mathcal {F}}^{\mathcal {X}}_t)\). Hence

$$\begin{aligned} {\widetilde{J}}(\beta ) = {\mathbb {E}}\left[ \int _0^2\int _{[-1,1]}a{\mathbb {E}}^P\left[ {\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^{\mu }_t\right] \beta _t(da)dt\right] . \end{aligned}$$
(3.3)

Condition (5) of Definition 3.1 implies that \(\Lambda \) maximizes J over all such processes \(\beta \), which implies that \(\Lambda _t(\omega )\) must equal \(\delta _{\alpha ^*_t(\omega )}\) on the \((t,\omega )\)-set \(\{\alpha ^* \ne 0\}\), where

$$\begin{aligned} \alpha ^*_t&:= \text {sign}\left( {\mathbb {E}}\left[ \left. {\bar{\mu }}^x_2 \right| {\mathcal {F}}^\mu _t\right] \right) , \end{aligned}$$

and we use the convention \(\text {sign}(0) := 0\).

Remark 3.5

This already highlights the key point: When \(\mu \) is deterministic, an optimal control is the constant \(\text {sign}({\bar{\mu }}^x_2)\), but when \(\mu \) is random, this control is inadmissible since it is not adapted.

Proposition 3.6

Every strong MFG solution (without common noise) satisfies \({\bar{\mu }}^x_2 \in \{-2,0,2\}\) and \({\bar{\mu }}^x_t = t\,\mathrm {sign}({\bar{\mu }}^x_2)\).

Proof

Let \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,2]},P,W,\mu ,\Lambda ,X)\) satisfy Definition 3.1, with \(\mu \) deterministic. In this case, \(\alpha ^*_t = \text {sign}({\bar{\mu }}^x_2)\) for all t. Suppose that \({\bar{\mu }}^x_2 \ne 0\). Then \(\Lambda _t = \delta _{\alpha ^*_t}\) must hold \(dt \otimes dP\)-a.e., and thus

$$\begin{aligned} X_t = t\,\text {sign}({\bar{\mu }}^x_2) + \sigma W_t, \ t \in [0,2]. \end{aligned}$$

The consistency condition (6) of Definition 3.1 implies \({\bar{\mu }}^x_t = {\mathbb {E}}[X_t] = t\,\text {sign}({\bar{\mu }}^x_2)\). In particular, \({\bar{\mu }}^x_2 = 2\,\text {sign}({\bar{\mu }}^x_2)\), which implies \({\bar{\mu }}^x_2 = \pm 2\) since we assumed \({\bar{\mu }}^x_2 \ne 0\). \(\square \)

Proposition 3.7

There exists a weak MFG solution (without common noise) satisfying \(P({\bar{\mu }}^x_2 = 1) = P({\bar{\mu }}^x_2 = -1) = 1/2\).

Proof

Construct on some probability space \(({\widetilde{\Omega }},{\mathcal {F}},P)\) a random variable \(\gamma \) with \(P(\gamma = 1) = P(\gamma = -1) = 1/2\) and an independent Wiener process W. Let \(\alpha ^*_t = \gamma 1_{(1,2]}(t)\) for each t (noticing that this interval is open on the left), and define \(({\mathcal {F}}_t)_{t \in [0,2]}\) to be the complete filtration generated by \((W_t,\alpha ^*_t)_{t \in [0,2]}\). Let

$$\begin{aligned} X_t := \int _0^t\alpha ^*_sds + \sigma W_t = (t-1)\gamma 1_{(1,2]}(t) + \sigma W_t, \ t \in [0,2]. \end{aligned}$$

Finally, let \(\Lambda = dt\delta _{\alpha ^*_t}(da)\), and define \(\mu := P((W,\Lambda ,X) \in \cdot \ | \ \gamma )\). Clearly \(\mu \) is \(\gamma \)-measurable. On the other hand, independence of \(\gamma \) and W implies

$$\begin{aligned} {\bar{\mu }}^x_2 = {\mathbb {E}}[X_2 \ | \ \gamma ] = \gamma . \end{aligned}$$

Thus \(\gamma \) is also \(\mu \)-measurable, and we conclude that \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ \mu )\). It is straightforward to check that

$$\begin{aligned} {\mathcal {F}}^\mu _t = {\left\{ \begin{array}{ll} \{\emptyset , {\widetilde{\Omega }}\} &{}\text {if } t \le 1 \\ \sigma (\gamma ) &{}\text {if } 1 < t \le 2 \end{array}\right. }. \end{aligned}$$

Thus

$$\begin{aligned} {\mathbb {E}}\left[ {\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^\mu _t\right] = {\left\{ \begin{array}{ll} {\mathbb {E}}[\gamma ] = 0 &{}\text {if } t \le 1 \\ {\mathbb {E}}[\gamma \ | \ \gamma ] = \gamma &{}\text {if } 1 < t \le 2 \end{array}\right. }. \end{aligned}$$

Since \({\bar{\mu }}^x_2 = \gamma = \text {sign}(\gamma )\), we conclude that \(\alpha ^*_t = \text {sign}({\mathbb {E}}[{\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^\mu _t])\). It is then readily checked using the previous arguments that \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,2]},P,W,\mu ,\Lambda ,X)\) is a weak MFG solution. \(\square \)

To be absolutely clear, the above two propositions imply the following: If \(S := \{\nu \in {\mathcal {P}}({\mathcal {X}}) : {\bar{\nu }}^x_2 \in \{-2,0,2\}\}\), then every strong MFG solution lies in S, but there exists a weak MFG solution with \(P(\mu \in S) = 0\).

Remark 3.8

The example of Proposition 3.7 can be modified to illustrate another strange phenomenon. The proof of Proposition 3.7 has \(\alpha ^*_t = \gamma \) for \(t \in (1,2]\) and \(\alpha ^*_t=0\) for \(t \le 1\). Instead, we could set \(\alpha ^*_t = \eta _t\) for \(t \le 1\), for any mean-zero \([-1,1]\)-valued process \((\eta _t)_{t \in [0,1]}\) independent of \(\gamma \) and W. The rest of the proof proceeds unchanged, yielding another weak MFG solution with the same conditional mean state \({\bar{\mu }}^x\), but with different conditional law \(\mu ^x\). (In fact, we could even choose \(\alpha ^*\) to be any mean-zero relaxed control on the time interval [0, 1].) Intuitively, for \(t \le 1\) we have \({\mathbb {E}}[{\bar{\mu }}^x_2 \ | \ {\mathcal {F}}^\mu _t] = 0\), and the choice of control on the time interval [0, 1] does not matter in light of (3.3); the agent then has some freedom to randomize her choice of control among the family of non-unique optimal choices. This type of randomization can typically occur when optimal controls are non-unique, and although it is unnatural in some sense, Theorem 2.6 indicate that this behavior can indeed arise in the limit from the finite-player games.

3.4 Supports of weak solutions

In this section, we attempt to partially explain what permits the existence of weak solutions which are not randomizations among strong solutions. As was mentioned in Remark 3.5, the culprit is the adaptedness required of controls. Indeed, in the example of Sect. 3.3, very different optimal controls arise depending on whether or not the measure \(\mu \) is random. If \(\mu \) is deterministic, then so is the optimal control, and we may write this optimal control as a functional of \(\mu \) by

$$\begin{aligned} {\hat{\alpha }}^D(t,\mu ) = \text {sign}({\bar{\mu }}^x_T), \ t \in [0,T]. \end{aligned}$$

The problem is as follows: for each fixed deterministic \(\mu \), the optimal control \(({\hat{\alpha }}^D(t,\mu ))_{t \in [0,T]}\) is deterministic and thus trivially adapted, but when \(\mu \) is allowed to be random then this control is no longer adapted and thus no longer admissible. If, for a different MFG problem, it happens that \({\hat{\alpha }}^D\) is in fact progressively measurable with respect to \(({\mathcal {F}}^\mu _t)_{t \in [0,T]}\), then this control is still admissible when \(\mu \) is randomized; moreover, it should be optimal when \(\mu \) is randomized, since it was optimal for each realization of \(\mu \). The following results make this idea precise, but first some terminology will be useful. As usual we work under Assumption A at all times, and the initial state distribution \(\lambda \in {\mathcal {P}}^{p'}({\mathbb {R}}^d)\) is fixed.

Definition 3.9

We say that a function \({\hat{\alpha }} : [0,T] \times {\mathcal {C}}^m \times {\mathcal {C}}^d \times {\mathcal {P}}^p({\mathcal {X}}) \rightarrow A\) is a universally admissible control if:

  1. (1)

    \({\hat{\alpha }}\) is progressively measurable with respect to the (universal completion of the) natural filtration \(({\mathcal {F}}^{W,X,\mu }_t)_{t \in [0,T]}\) on \({\mathcal {C}}^m \times {\mathcal {C}}^d \times {\mathcal {P}}^p({\mathcal {X}})\). Here \({\mathcal {F}}^{W,X,\mu }_t := \sigma (W_s,X_s,\mu (C) : s \le t, \ C \in {\mathcal {F}}^{\mathcal {X}}_t)\) for each t, where \((W,X,\mu )\) denotes the identity map on \({\mathcal {C}}^m \times {\mathcal {C}}^d \times {\mathcal {P}}^p({\mathcal {X}})\).

  2. (2)

    For each fixed \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\), the SDE

    $$\begin{aligned} dX_t = b(t,X_t,\nu ^x_t,{\hat{\alpha }}(t,W,X,\nu ))dt + \sigma (t,X_t,\nu ^x_t)dW_t, \ X_0 \sim \lambda , \end{aligned}$$
    (3.4)

    is unique in joint law; that is, if we are given two pairs of processes \((W^i_t,X^i_t)_{t \in [0,T]}\) for \(i=1,2\), possibly on different filtered probability spaces but with \((W^i_t)_{t \in [0,T]}\) a Wiener process in either case, then \((W^1,X^1)\) and \((W^2,X^2)\) have the same law.

  3. (3)

    Suppose we are given a filtered probability space \(({\widetilde{\Omega }},({\widetilde{{\mathcal {F}}}}_t)_{t \in [0,T]},{\widetilde{P}})\) supporting an \(({\widetilde{{\mathcal {F}}}}_t)_{t \in [0,T]}\)-Wiener process \({\widetilde{W}}\), an \({\widetilde{{\mathcal {F}}}}_0\)-measurable \({\mathbb {R}}^d\)-valued random variable \({\widetilde{\xi }}\) with law \(\lambda \), and a \({\mathcal {P}}^p({\mathcal {X}})\)-valued random variable \({\tilde{\mu }}\) independent of \((\xi ,W)\) such that \({\tilde{\mu }}(C)\) is \({\widetilde{{\mathcal {F}}}}_t\)-measurable for each \(C \in {\mathcal {F}}^{\mathcal {X}}_t\) and \(t \in [0,T]\). Then there exists a strong solution \({\widetilde{X}}\) of the SDE

    $$\begin{aligned} d{\widetilde{X}}_t = b(t,{\widetilde{X}}_t,{\tilde{\mu }}^x_t,{\hat{\alpha }} (t,W,{\widetilde{X}}, {\tilde{\mu }}))dt + \sigma (t,{\widetilde{X}}_t, {\tilde{\mu }}^x_t)d{\widetilde{W}}_t, \ {\widetilde{X}}_0 = {\widetilde{\xi }}, \end{aligned}$$

    and it satisfies \({\mathbb {E}}\int _0^T|{\hat{\alpha }}(t,W,{\widetilde{X}},{\tilde{\mu }})|^pdt < \infty \).

If \({\hat{\alpha }}\) is a universally admissible control, we say it is locally optimal if for each fixed \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\) there exists a complete filtered probability space \((\Omega ^{(\nu )},({\mathcal {F}}^{(\nu )}_t)_{t \in [0,T]},P^\nu )\) supporting a Wiener process \(W^\nu \) and a continuous adapted process \(X^\nu \) such that \((W^\nu ,X^\nu )\) satisfies the SDE (3.4) and:

  1. (4)

    If \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) supports a m-dimensional Wiener process W, a progressive \({\mathcal {P}}(A)\)-valued process \(\Lambda \) satisfying \({\mathbb {E}}^P\int _0^T\int _A|a|^p\Lambda _t(da)dt < \infty \), and a continuous adapted \({\mathbb {R}}^d\)-valued process X satisfying

    $$\begin{aligned} dX_t = \int _Ab\left( t,X_t,\nu ^x_t,a\right) \Lambda _t(da)dt + \sigma \left( t,X_t,\nu ^x_t\right) dW_t, \ P \circ X_0^{-1} = \lambda , \end{aligned}$$

    then

    $$\begin{aligned} {\mathbb {E}}^{P^{(\nu )}}\left[ \Gamma \left( \nu ^x,dt\delta _{{\hat{\alpha }} (t,W^\nu ,X^\nu ,\nu )}(da),X^\nu \right) \right] \ge {\mathbb {E}}^P\left[ \Gamma (\nu ^x,\Lambda ,X)\right] . \end{aligned}$$

We need an additional Assumption C, which simply requires the uniqueness of the optimal controls. Some simple conditions are given in [11, Proposition 4.4] under which Assumption C holds: in particular, it suffices to assume that b and \(\sigma \) are affine in (xa), that f is strictly concave in (xa), and that g is concave in x.

Assumption C

If \(({\widetilde{\Omega }}^i,({\mathcal {F}}^i_t)_{t \in [0,T]},P^i,W^i,\mu ^i,\Lambda ^i,X^i)\) for \(i=1,2\) both satisfy (1-5) of Definition 3.1 as well as \(P^1 \circ (\mu ^1)^{-1} = P^2 \circ (\mu ^2)^{-1}\), then \(P^1 \circ (W^1,\mu ^1,\Lambda ^1,X^1)^{-1} = P^2 \circ (W^2,\mu ^2,\Lambda ^2,X^2)^{-1}\).

Theorem 3.10

Assume C holds. Suppose that there exists a universally admissible and locally optimal control \({\hat{\alpha }} : [0,T] \times {\mathcal {C}}^m \times {\mathcal {C}}^d \times {\mathcal {P}}^p({\mathcal {X}}) \rightarrow A\). Then, for every weak MFG solution \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) (without common noise), \(P \circ \mu ^{-1}\) is concentrated on the set of strong MFG solutions (without common noise). Conversely, if \(\rho \in {\mathcal {P}}^p({\mathcal {P}}^p({\mathcal {X}}))\) is concentrated on the set of strong MFG solutions (without common noise), then there exists a weak MFG solution (without common noise) with \(P \circ \mu ^{-1} = \rho \).

Proof

Let \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) be a weak MFG solution (without common noise).

Step 1: We will first show that necessarily \(\Lambda _t = \delta _{{\hat{\alpha }}(t,W,X,\mu )}\) holds \(dt \otimes dP\)-a.e. On \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) we may use (3) of Definition 3.9 to find a strong solution \(X'\) of the SDE

$$\begin{aligned} dX'_t = b\left( t,X'_t,\mu ^x_t,{\hat{\alpha }}(t,W,X',\mu )\right) dt + \sigma \left( t,X'_t,\mu ^x_t\right) dW_t, \ X'_0 = X_0, \end{aligned}$$

with \({\mathbb {E}}^P\int _0^T|{\hat{\alpha }}(t,W,X',\mu )|^pdt < \infty \). In particular, \(X'\) is adapted to the (completion of the) filtration \({\mathcal {F}}^{X_0,W,\mu }_t := \sigma (X_0,W_s,\mu (C) : s\le t, \ C \in {\mathcal {F}}^{\mathcal {X}}_t)\). Let \(\Lambda ' := dt\delta _{{\hat{\alpha }}(t,W,X',\mu )}(da)\). Then it is clear that \(({\widetilde{\Omega }},({\mathcal {F}}^{X_0,W,\mu }_t)_{t \in [0,T]},P,W,\mu ,\Lambda ',X')\) satisfies conditions (1-4) of Definition 3.1. Optimality of P implies

$$\begin{aligned} {\mathbb {E}}^P\left[ \Gamma (\mu ^x,\Lambda ,X)\right] \ge {\mathbb {E}}^P\left[ \Gamma (\mu ^x,\Lambda ',X')\right] . \end{aligned}$$

On the other hand, for \(P \circ \mu ^{-1}\)-a.e. \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\), the following hold under \(P(\cdot \ | \ \mu =\nu )\):

  • W is a \(({\mathcal {F}}_t)_{t \in [0,T]}\)-Wiener process.

  • \((W,\Lambda ,X)\) satisfies

    $$\begin{aligned} dX_t = \int _Ab\left( t,X_t,\nu ^x_t,a\right) \Lambda _t(da) + \sigma \left( t,X_t,\nu ^x_t\right) dW_t. \end{aligned}$$
  • \((W,X')\) solves the SDE (3.4).

From the local optimality of \({\hat{\alpha }}\) we conclude (keeping in mind the uniqueness condition (2) of Definition 3.9) that

$$\begin{aligned} {\mathbb {E}}^P\left[ \left. \Gamma (\mu ^x,\Lambda ,X)\right| \mu \right] \le {\mathbb {E}}^P\left[ \left. \Gamma (\mu ^x,\Lambda ',X')\right| \mu \right] . \end{aligned}$$

Thus

$$\begin{aligned} {\mathbb {E}}^P\left[ \Gamma (\mu ^x,\Lambda ,X)\right] = {\mathbb {E}}^P\left[ \Gamma (\mu ^x,\Lambda ',X')\right] . \end{aligned}$$

By Assumption C, there is only one optimal control, and so \(\Lambda = \Lambda ' = dt\delta _{{\hat{\alpha }}(t,W,X',\mu )}(da)\), P-a.s. From uniqueness of the SDE solutions we conclude that \(X=X'\) a.s. as well, completing the first step. (Note we do not use the assumptions of Definition 3.9 for this last conclusion, but only the Lipschitz assumption (A.4).)

Step 2: Next, we show that \(P \circ \mu ^{-1}\) is concentrated on the set of strong MFG solutions. Using (2) and (3) of Definition 3.9, we know that for \(P \circ \mu ^{-1}\)-a.e. \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\) there exists on some filtered probability space \((\Omega ^{(\nu )},({\mathcal {F}}^{(\nu )}_t)_{t \in [0,T]},P^\nu )\) a weak solution \(X^\nu \) of the SDE

$$\begin{aligned} dX^\nu _t = b\left( t,X^\nu _t,\nu ^x_t,{\hat{\alpha }}(t,W^\nu ,X^\nu ,\nu )\right) dt + \sigma \left( t,X^\nu _t,\nu ^x_t\right) dW^\nu _t, \ P^\nu \circ (X^\nu _0)^{-1} = \lambda , \end{aligned}$$

where \(W^\nu \) is an \(({\mathcal {F}}^{(\nu )}_t)_{t \in [0,T]}\)-Wiener process. From Step 1, on \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) we have

$$\begin{aligned} dX_t = b\left( t,X_t,\mu ^x_t,{\hat{\alpha }}(t,W,X,\mu )\right) dt + \sigma \left( t,X_t,\mu ^x_t\right) dW_t, \ P \circ X_0^{-1} = \lambda . \end{aligned}$$

It follows from the P-independence of \(\mu \), \(X_0\), and W along with the uniqueness in law of condition (2) of Definition 3.9 that

$$\begin{aligned} P((W,\Lambda ,X) \in \cdot \ | \ \mu = \nu ) = P^\nu \circ \left( W^\nu ,dt\delta _{{\hat{\alpha }}(t,W^\nu ,X^\nu ,\nu )}(da), X^\nu \right) ^{-1}, \end{aligned}$$
(3.5)

for \(P \circ \mu ^{-1}\)-a.e. \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\). Since \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ \mu )\), it follows that

$$\begin{aligned} \nu = P^\nu \circ \left( W^\nu ,dt\delta _{{\hat{\alpha }} (t,W^\nu ,X^\nu ,\nu )} (da),X^\nu \right) ^{-1},\quad \text {for}\quad P \circ \mu ^{-1}\text {-a.e. } \nu \in {\mathcal {P}}^p({\mathcal {X}}). \end{aligned}$$
(3.6)

We conclude that \(P \circ \mu ^{-1}\)-a.e. \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\) is a strong MFG solution, or more precisely that

$$\begin{aligned} \left( \Omega ^{(\nu )},({\mathcal {F}}^{(\nu )}_t)_{t \in [0,T]},P^\nu ,W^\nu ,\nu ,dt\delta _{{\hat{\alpha }} (t,W^\nu ,X^\nu ,\nu )}(da),X^\nu \right) \end{aligned}$$

is a strong MFG solution. Indeed, we just verified condition (6) of Definition 3.1, and conditions (1-4) are obvious. The optimality condition (5) of Definition 3.1 is a simple consequence of the local optimality of \({\hat{\alpha }}\).

Step 3: We turn now to the converse. Let \(({\widetilde{\Omega }},{\mathcal {F}},P)\) be any probability space supporting a random variable \((\xi ,W,\mu )\) with values in \({\mathbb {R}}^d \times {\mathcal {C}}^m \times {\mathcal {P}}^p({\mathcal {X}})\) with law \(\lambda \times {\mathcal {W}}^m \times \rho \), where \({\mathcal {W}}^m\) is Wiener measure on \({\mathcal {C}}^m\). Let \(({\mathcal {F}}_t)_{t \in [0,T]}\) denote the P-completion of \((\sigma (\xi ,W_s,\mu (C) : s \le t, \ C \in {\mathcal {F}}^{\mathcal {X}}_t))_{t \in [0,T]}\). Solve strongly on \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) the SDE

$$\begin{aligned} dX_t = b\left( t,X_t,\mu ^x_t,{\hat{\alpha }}(t,W,X,\mu )\right) dt + \sigma \left( t,X_t,\mu ^x_t\right) dW_t, \ X_0 = \xi . \end{aligned}$$

Note that hypothesis (3) makes this possible. Define \(\Lambda := dt\delta _{{\hat{\alpha }}(t,W,X,\mu )}(da)\). Clearly \(P \circ \mu ^{-1} = \rho \) by construction, and we claim that \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) is a weak MFG solution. Using hypothesis (1), it is clear that conditions (1-4) of Definition 3.1 hold, and thus we must only check the optimality condition (5) and the fixed point condition (6).

First, let \(({\widetilde{\Omega }}',({\mathcal {F}}'_t)_{t \in [0,T]},P',W',\mu ',\Lambda ',X')\) be an alternative probability space satisfying (1-4) of Definition 3.1 and \(P' \circ (\mu ')^{-1} = P \circ \mu ^{-1} = \rho \). The uniqueness in law condition (2) of Definition 3.9 implies that \(P(((W,X) \in \cdot \ | \ \mu = \nu )\) is exactly the law of the solution of the SDE (3.4), for \(P \circ \mu ^{-1}\)-a.e. \(\nu \). Applying local optimality of \({\hat{\alpha }}\) for each \(\nu \), we conclude that

$$\begin{aligned} {\mathbb {E}}^P\left[ \left. \Gamma (\nu ^x,\Lambda ,X)\right| \mu = \nu \right] \ge {\mathbb {E}}^{P'}\left[ \left. \Gamma (\nu ^x,\Lambda ',X')\right| \mu = \nu \right] , \text {for } \rho \hbox {-}a.e. \ \nu . \end{aligned}$$

Integrate with respect to \(\rho \) on both sides to get \({\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ,X)] \ge {\mathbb {E}}^{P'}[\Gamma ((\mu ')^x,\Lambda ',X')]\), which verifies condition (5) of Definition 3.1. Finally, we check (6) by applying Step 1 to deterministic \(\mu \) and again using uniqueness of the SDE (3.4) to find that both (3.5) and (3.6) hold for \(\rho \)-a.e. \(\nu \). \(\square \)

3.5 Applications of Theorem 3.10

It is admittedly quite difficult to check that there exists a universally admissible, locally optimal control, and we will leave this problem open in all but the simplest cases. Note, however, that conditions (2) and (3) of Definition 3.9 hold automatically when \({\hat{\alpha }}(t,w,x,\nu ) = {\hat{\alpha }}'(t,w,x_0,\nu )\), for some \({\hat{\alpha }}' : [0,T] \times {\mathcal {C}}^m \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathcal {X}}) \rightarrow A\).

A simple class of examples Suppose \(A \subset {\mathbb {R}}^k\) is convex, \(g \equiv 0\), and \(f=f(t,\mu ,a)\) is twice differentiable in a with uniformly negative Hessian in a. That is, \(D_a^2f(t,\mu ,a) \le -\delta \) for all \((t,\mu )\), for some \(\delta > 0\). Suppose as usual that Assumption A holds. Define

$$\begin{aligned}&{\hat{\alpha }}(t,w,x,\nu ) := \arg \max _{a \in A}f(t,\nu ^x_t,a), \quad \text {for}\\&\quad (t,w,x,\mu ) \in [0,T] \times {\mathcal {C}}^m \times {\mathcal {C}}^d \times {\mathcal {P}}^p({\mathcal {X}}). \end{aligned}$$

It is straightforward to check that Assumption C holds and that \({\hat{\alpha }}\) is a universally admissible and locally optimal control. Of course, this example is simple in that the state process does not influence the optimization.

A possible general strategy The following approach may be more widely applicable. First, for a fixed \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\), we may define the value function \(V[\nu ](t,x)\) of the corresponding optimal control problem in the usual way, and it should solve a Hamilton–Jacobi–Bellman (HJB) PDE of the form

$$\begin{aligned} \left\{ \begin{array}{l} -\partial _tV[\nu ](t,x) - H(t,x,\nu ^x_t,D_xV[\nu ](t,x),D^2_xV[\nu ](t,x)) = 0, \quad \text {on}\quad [0,T) \times {\mathbb {R}}^d, \\ V[\nu ](T,x) = g(x,\nu ^x_T) \end{array} \right. , \end{aligned}$$

where the Hamiltonian \(H : [0,T] \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d) \times {\mathbb {R}}^d \times {\mathbb {R}}^{d \times d} \rightarrow {\mathbb {R}}\) is defined by

$$\begin{aligned} H(t,x,\mu ,y,z) := \sup _{a \in A}\left[ y^\top b(t,x,\mu ,a) + f(t,x,\mu ,a)\right] + \frac{1}{2}\mathrm {Tr}\left[ z\sigma \sigma ^\top (t,x,\mu )\right] . \end{aligned}$$

Suppose that we can show (as is well known to be possible in very general situations) that for each \(\nu \) the value function \(V[\nu ]\) is the unique (viscosity) solution of this HJB equation. Then, an optimal control can be obtained by finding a function \(s=s(t,x,\mu ,y,z)\) which attains the supremum in \(H(t,x,\mu ,y,z)\) for each \((t,x,\mu ,y,z)\), and setting

$$\begin{aligned} {\hat{\alpha }}(t,x,\nu ) := s\left( t,x,\nu ^x_t,D_xV[\nu ](t,x),D_x^2V[\nu ](t,x)\right) , \end{aligned}$$

for each \((t,x,\nu ) \in [0,T] \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathcal {X}})\). For each fixed, deterministic \(\nu \in {\mathcal {P}}^p({\mathcal {X}})\), this gives us the solution to the corresponding optimal control problem. The crux of this approach is to view \(\nu \) as variable and show that \({\hat{\alpha }}\) universally admissible (it is clearly then locally optimal). For this it suffices to show that the value function \(V[\nu ](t,x)\) is adapted with respect to \(\nu \). More precisely, assuming \(V[\nu ](t,\cdot )\) is continuous for each \((t,\nu )\), we must show that \((t,\nu ) \mapsto V[\nu ](t,x)\) is progressively measurable for each x, using the canonical filtration on \({\mathcal {P}}^p({\mathcal {X}})\) given by \((\sigma (\mu (C) : C \in {\mathcal {F}}^{\mathcal {X}}_t))_{t \in [0,T]}\). Even better would be to show that V is Markovian, in the sense that \(V[\nu ](t,x) = {\widetilde{V}}(t,x,\nu ^x_t)\) for a measurable function \({\widetilde{V}}\). In general, V can easily fail to be adapted in this sense, as in the example of Sect. 3.3 above. But this naturally raises the question of what assumptions can be imposed on the model inputs to produce this adaptedness. In short, we must study the dependence of a family of HJB equations on a path-valued parameter.

4 Mean field games on a canonical space

In this section, we begin to work toward the proofs of the main results announced in Sects. 2.4 and 2.5. This section briefly elaborates on the notion of mean field game solution on the canonical space, in order to state simpler conditions by which may check that a measure P is a weak MFG solution, in the sense of Definition 2.2. The definitions and notations of this section are again mostly borrowed from [11], to which the reader is referred for more details.

First, we mention some notational conventions. We will routinely use the same letter \(\phi \) to denote the natural extension of a function \(\phi : E \rightarrow F\) to any product space \(E \times E'\), given by \(\phi (x,y) := \phi (x)\) for \((x,y) \in E \times E'\). Similarly, we will use the same symbol \(({\mathcal {F}}_t)_{t \in [0,T]}\) to denote the natural extension of a filtration \(({\mathcal {F}}_t)_{t \in [0,T]}\) on a space E to any product space \(E \times E'\), given by \(({\mathcal {F}}_t \otimes \{\emptyset , E'\})_{t \in [0,T]}\).

We will make use of the following canonical spaces, two of which have been defined already but are recalled for convenience:

$$\begin{aligned} {\mathcal {X}}&:= {\mathcal {C}}^m \times {\mathcal {V}}\times {\mathcal {C}}^d, \quad \Omega _0 := {\mathbb {R}}^d \times {\mathcal {C}}^{m_0} \times {\mathcal {C}}^m, \quad \Omega := \Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}) \times {\mathcal {V}}\times {\mathcal {C}}^d. \end{aligned}$$

From now on, let \(\xi \), B, W, \(\mu \), \(\Lambda \), and X denote the identity maps on \({\mathbb {R}}^d\), \({\mathcal {C}}^{m_0}\), \({\mathcal {C}}^m\), \({\mathcal {P}}^p({\mathcal {X}})\), \({\mathcal {V}}\), and \({\mathcal {C}}^d\), respectively. Note, for example, that our convention permits W to denote both the identity map on \({\mathcal {C}}^m\) and the projection from \(\Omega \) to \({\mathcal {C}}^m\). The canonical filtration \(({\mathcal {F}}^\Lambda _t)_{t \in [0,T]}\) is defined on \({\mathcal {V}}\) by

$$\begin{aligned} {\mathcal {F}}^\Lambda _t := \sigma \left( \Lambda ([0,s] \times C) : s \le t, \ C \in {\mathcal {B}}(A)\right) . \end{aligned}$$

It is known (e.g. [29, Lemma 3.8]) that there exists a \(({\mathcal {F}}^\Lambda _t)_{t \in [0,T]}\)-predictable process \(\overline{\Lambda } : [0,T] \times {\mathcal {V}}\rightarrow {\mathcal {P}}(A)\) such that \(dt[\overline{\Lambda }(t,q)](da) = q\) for each q, or equivalently \(\overline{\Lambda }(t,q) = q_t\) for a.e. t, for each q. Then, we may think of \((\overline{\Lambda }(t,\cdot ))_{t \in [0,T]}\) as the canonical \({\mathcal {P}}(A)\)-valued process defined on \({\mathcal {V}}\), and it is clear that \({\mathcal {F}}^\Lambda _t = \sigma (\overline{\Lambda }(s,\cdot ) : s \le t)\). With this in mind, we may somewhat abusively write \(\Lambda _t\) in place of \(\overline{\Lambda }(t,\cdot )\), and with this notation \({\mathcal {F}}^\Lambda _t = \sigma (\Lambda _s : s \le t)\).

The canonical processes B, W, and X generate obvious natural filtrations, denoted \(({\mathcal {F}}^B_t)_{t \in [0,T]}\), \(({\mathcal {F}}^W_t)_{t \in [0,T]}\), and \(({\mathcal {F}}^X_t)_{t \in [0,T]}\), respectively. We will frequently work with filtrations generated by several canonical processes, such as \({\mathcal {F}}^{\xi ,B,W}_t := \sigma (\xi ,B_s,W_s : s \le t)\) defined on \(\Omega _0\), and \({\mathcal {F}}^{\xi ,B,W,\Lambda }_t = {\mathcal {F}}^{\xi ,B,W}_t \otimes {\mathcal {F}}^{\Lambda }_t\) defined on \(\Omega _0 \times {\mathcal {V}}\). Our convention on canonical extensions of filtrations to product spaces permits the use of \(({\mathcal {F}}^{\xi ,B,W}_t)_{t \in [0,T]}\) to refer also to the filtration on \(\Omega _0 \times {\mathcal {V}}\) generated by \((\xi ,B,W)\), and it should be clear from context on which space the filtration is defined. Hence, the filtration \(({\mathcal {F}}^{\mathcal {X}}_t)_{t \in [0,T]}\) defined just before Definition 2.1 could alternatively be denoted \({\mathcal {F}}^{\mathcal {X}}_t = {\mathcal {F}}^{W,\Lambda ,X}_t\), but we stick with the former notation for consistency. Define the canonical filtration \(({\mathcal {F}}^\mu _t)_{t \in [0,T]}\) on \({\mathcal {P}}^p({\mathcal {X}})\) by

$$\begin{aligned} {\mathcal {F}}^\mu _t := \sigma \left( \mu (C) : C \in {\mathcal {F}}^{\mathcal {X}}_t\right) . \end{aligned}$$

There is somewhat of a conflict in notation, between our use of \((\xi ,B,W)\) here as the identity map on \({\mathbb {R}}^d \times {\mathcal {C}}^{m_0} \times {\mathcal {C}}^m\) and our previous use (beginning in Sect. 2.3) of the same letters for random variables with values in \(({\mathbb {R}}^d)^n \times {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n\), defined on an n-player environment \({\mathcal {E}}_n = (\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\). However, we will almost exclusively discuss the random variables \((\xi ,B,W)\) through the lenses of various probability measures, and thus it should be clear from context (i.e. from the nearest notated probability measure) which random variables \((\xi ,B,W)\) we are working with at any given moment. For example, given \(P \in {\mathcal {P}}(\Omega )\), the notation \(P \circ (\xi ,B,W)^{-1}\) refers to a measure on \({\mathbb {R}}^d \times {\mathcal {C}}^{m_0} \times {\mathcal {C}}^m\). On the other hand, \({\mathbb {P}}_n\) is reserved for the measure on \(\Omega _n\) in a typical n-player environment, and so \({\mathbb {P}}_n \circ (\xi ,B,W)^{-1}\) refers to a measure on \(({\mathbb {R}}^d)^n \times {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n\).

Recall that the initial state distribution \(\lambda \in {\mathcal {P}}^{p'}({\mathbb {R}}^d)\) is fixed throughout. Let \({\mathcal {M}}_\lambda \) denote the set of \(\rho \in {\mathcal {P}}^p(\Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}))\) satisfying

  1. (1)

    \(\rho \circ \xi ^{-1} = \lambda \),

  2. (2)

    B and W are independent Wiener processes on \((\Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}),({\mathcal {F}}^{\xi ,B,W,\mu }_t)_{t \in [0,T]},\rho )\).

(Note that the set \({\mathcal {M}}_\lambda \) was denoted \({\mathcal {P}}^p_c[(\Omega _0,{\mathcal {W}}_\lambda ) \leadsto {\mathcal {P}}^p({\mathcal {X}})]\) in [11]; we prefer this shorter notation mainly because we will make no use of it after this section.) For \(\rho \in {\mathcal {M}}_\lambda \), the class \({\mathcal {A}}(\rho )\) of admissible controls is the set of probability measures Q on \(\Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}) \times {\mathcal {V}}\) satisfying:

  1. (1)

    \({\mathcal {F}}^\Lambda _t\) and \({\mathcal {F}}^{\xi ,B,W,\mu }_T\) are conditionally independent under Q given \({\mathcal {F}}^{\xi ,B,W,\mu }_t\), for each \(t \in [0,T]\),

  2. (2)

    \(Q \circ (\xi ,B,W,\mu )^{-1} = \rho \),

  3. (3)

    \({\mathbb {E}}^Q\int _0^T\int _A|a|^p\Lambda _t(da)dt < \infty \).

We say \(Q \in {\mathcal {A}}(\rho )\) is a strict control if there exists an A-valued process \((\alpha _t)_{t \in [0,T]}\), progressively measurable with respect to the Q-completion of \(({\mathcal {F}}^{\xi ,B,W,\mu ,\Lambda }_t)_{t \in [0,T]}\), such that

$$\begin{aligned} Q\left( \Lambda = dt\delta _{\alpha _t}(da)\right) = Q(\Lambda _t = \delta _{\alpha _t} \ a.e. \ t) = 1. \end{aligned}$$

We say \(Q \in {\mathcal {A}}(\rho )\) is a strong control if the above holds but with \((\alpha _t)_{t \in [0,T]}\) progressively measurable with respect to the Q-completion of \(({\mathcal {F}}^{\xi ,B,W,\mu }_t)_{t \in [0,T]}\).

If \(\rho \in {\mathcal {M}}_\lambda \) and \(Q \in {\mathcal {A}}(\rho )\), note that B and W are Wiener processes on \((\Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}) \times {\mathcal {V}},({\mathcal {F}}^{\xi ,B,W,\mu ,\Lambda }_t)_{t \in [0,T]},Q)\). For each \(\rho \in {\mathcal {M}}_\lambda \) and \(Q \in {\mathcal {A}}(\rho )\), on the completion of the filtered probability space \((\Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}) \times {\mathcal {V}}, ({\mathcal {F}}^{\xi ,B,W,\mu ,\Lambda }_t)_{t \in [0,T]},Q)\), there exists a unique strong solution Y of the SDE

$$\begin{aligned} Y_t&= \xi + \int _0^t\int _Ab(s,Y_s,\mu ^x_s,a)\Lambda _s(da)ds \nonumber \\&\quad + \int _0^t\sigma (s,Y_s,\mu ^x_s)dW_s + \int _0^t\sigma _0(s,Y_s,\mu ^x_s)dB_s. \end{aligned}$$
(4.1)

Viewing Y as a random element of \({\mathcal {C}}^d\), let \({\mathcal {R}}(Q) := Q \circ (\xi ,B,W,\mu ,\Lambda ,Y)^{-1} \in {\mathcal {P}}(\Omega )\) denote the joint law of the solution and the inputs. Define

$$\begin{aligned} {\mathcal {R}}{\mathcal {A}}(\rho ) := {\mathcal {R}}({\mathcal {A}}(\rho )) = \left\{ {\mathcal {R}}(Q) : Q \in {\mathcal {A}}(\rho )\right\} , \end{aligned}$$

which we think of as the set of admissible joint laws for the optimal control problem associated to \(\rho \). Alternatively, \({\mathcal {R}}(Q)\) may be defined as the unique element P of \({\mathcal {P}}(\Omega )\) such that \(P \circ (\xi ,B,W,\mu ,\Lambda )^{-1} = Q\) and such that the canonical processes \((\xi ,B,W,\mu ,\Lambda ,X)\) verify the state SDE on \(\Omega \):

$$\begin{aligned} X_t&= \xi + \int _0^t\int _Ab\left( s,X_s,\mu ^x_s,a\right) \Lambda _s(da)ds \nonumber \\&\quad + \int _0^t\sigma \left( s,X_s,\mu ^x_s\right) dW_s + \int _0^t\sigma _0\left( s,X_s,\mu ^x_s\right) dB_s. \end{aligned}$$
(4.2)

It follows from standard estimates (e.g. [11, Lemma 2.4]) that \({\mathcal {R}}(Q) \in {\mathcal {P}}^p(\Omega )\).

Recalling the definition of the objective functional \(\Gamma \) from (2.4), we define the reward associated to an element \(P \in {\mathcal {P}}^p(\Omega )\) by

$$\begin{aligned} J(P) := {\mathbb {E}}^P\left[ \Gamma (\mu ^x,\Lambda ,X)\right] . \end{aligned}$$
(4.3)

Define the set of optimal controls corresponding to \(\rho \) by

$$\begin{aligned} {\mathcal {A}}^{*}(\rho )&:= \arg \max _{Q \in {\mathcal {A}}(\rho )}J({\mathcal {R}}(Q)), \end{aligned}$$
(4.4)

and note that

$$\begin{aligned} {\mathcal {R}}{\mathcal {A}}^{*}(\rho )&:= {\mathcal {R}}({\mathcal {A}}^{*}(\rho )) = \arg \max _{P \in {\mathcal {R}}{\mathcal {A}}(\rho )}J(P). \end{aligned}$$

Let us now adapt the definition of MFG solution to the canonical space \(\Omega \):

Definition 4.1

(MFG pre-solution) We say \(P \in {\mathcal {P}}(\Omega )\) is a MFG pre-solution if it satisfies the following:

  1. (1)

    \(\xi \), W, and \((B,\mu )\) are independent under P.

  2. (2)

    \(P \in {\mathcal {R}}{\mathcal {A}}(\rho )\) where \(\rho := P \circ (\xi ,B,W,\mu )^{-1}\) is in \({\mathcal {M}}_\lambda \).

  3. (3)

    \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ B,\mu )\) a.s. That is, \(\mu \) is a version of the conditional law of \((W,\Lambda ,X)\) given \((B,\mu )\).

The following two Lemmas give us a characterization of MFG solution which is convenient for taking limits. The first is more or less obvious, stated as a Lemma merely for emphasis, while the second has more content and is discussed thoroughly in [11].

Lemma 4.2

(Lemma 3.9 of [11]) Let \(P \in {\mathcal {P}}^p(\Omega )\), and define \(\rho := P \circ (\xi ,B,W,\mu )^{-1}\). If P is an MFG pre-solution and \(P \in {\mathcal {R}}{\mathcal {A}}^{*}(\rho )\), then P is a weak MFG solution in the sense of Definition 2.2.

Lemma 4.3

(Lemma 3.7 of [11]) Let \(P \in {\mathcal {P}}^p(\Omega )\), and define \(\rho := P \circ (\xi ,B,W,\mu )^{-1}\). Suppose the following hold under P:

  1. (1)

    B and W are independent \(({\mathcal {F}}^{\xi ,B,W,\mu ,\Lambda ,X}_t)_{t \in [0,T]}\)-Wiener processes, and \(P \circ \xi ^{-1} = \lambda \).

  2. (2)

    \(\xi \), W, and \((B,\mu )\) are independent.

  3. (3)

    \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ B,\mu ), \ a.s.\)

  4. (4)

    The canonical processes \((\xi ,B,W,\mu ,\Lambda ,X)\) verify the state equation (4.2) on \(\Omega \).

Then P is a MFG pre-solution.

We close the section with three useful results from [11], topological in nature. They will not be used until the final step of the proof of Theorem 2.6, in Sect. 5.4.

Lemma 4.4

(Lemma 3.12 of [11]) Suppose \(K \subset \bigcup _{\rho \in {\mathcal {M}}_\lambda }{\mathcal {A}}(\rho )\) satisfies

$$\begin{aligned} \sup _{P \in K}{\mathbb {E}}^P\left[ \int _{{\mathcal {C}}^d}\Vert x\Vert _T^{p'}\mu ^x(dx) + \int _0^T\int |a|^{p'}\Lambda _t(da)dt\right] < \infty . \end{aligned}$$

The map \({\mathcal {R}}: K \rightarrow {\mathcal {P}}^p(\Omega )\) is continuous.

Lemma 4.5

The map \(J : {\mathcal {P}}^p(\Omega ) \rightarrow {\mathbb {R}}\) is upper semicontinuous, and for each \(\rho \in {\mathcal {M}}_\lambda \) the sets \({\mathcal {A}}^{*}(\rho )\) and \({\mathcal {R}}{\mathcal {A}}^{*}(\rho )\) are nonempty and compact. Moreover, the restriction of J to a set \(K \subset {\mathcal {P}}^p(\Omega )\) is continuous whenever K satisfies the uniform integrability condition

$$\begin{aligned} \lim _{r \rightarrow \infty }\sup _{P \in K}{\mathbb {E}}^P\left[ \int _0^T\int _{\{|a| > r\}}|a|^{p'}\Lambda _t(da) dt\right] = 0. \end{aligned}$$
(4.5)

Proof

This is all covered by Lemma 3.13 of [11], except for the final claim. Now let \(P_n \rightarrow P_\infty \) in \({\mathcal {P}}^p(\Omega )\) with \(P_n \in K\) for each n. The continuity and growth assumptions on g imply that \({\mathbb {E}}^{P_n}[g(X_T,\mu ^x_T)] \rightarrow {\mathbb {E}}^P[g(X_T,\mu ^x_T)]\), and the f term causes the only problems. The convergence \(P_n \rightarrow P_\infty \) implies (e.g. by [35, Theorem 7.12])

$$\begin{aligned} \lim _{r \rightarrow \infty }\sup _{n}{\mathbb {E}}^{P_n} \left[ \Vert X\Vert _T^p1_{\{\Vert X\Vert _T^p > r\}} + \int _{{\mathcal {C}}^d}\Vert z\Vert ^p_T\mu ^x(dz) 1_{\left\{ \int _{{\mathcal {C}}^d}\Vert z\Vert ^p_T\mu ^x(dz) > r\right\} }\right] = 0. \end{aligned}$$
(4.6)

For \(1 \le n \le \infty \), define probability measures \(Q_n\) on \({\widetilde{\Omega }} := [0,T] \times {\mathbb {R}}^d \times {\mathcal {P}}^p({\mathbb {R}}^d) \times A\) by

$$\begin{aligned} Q_n(C) := \frac{1}{T}{\mathbb {E}}^{P_n}\left[ \int _0^T\int _A1_{\{(t,X_t,\mu ^x_t,a) \in C\}}\Lambda _t(da)dt\right] , \ C \in {\mathcal {B}}({\widetilde{\Omega }}). \end{aligned}$$

Certainly \(Q_n \rightarrow Q_\infty \) weakly in \({\mathcal {P}}({\widetilde{\Omega }})\). Since the [0, T]-marginal is the same for each \(Q_n\), it is known (e.g. [22] or [29, Lemma A.3]) that this implies \(\int \phi \,dQ_n\rightarrow \int \phi \,dQ_\infty \) for each bounded measurable \(\phi : {\widetilde{\Omega }} \rightarrow {\mathbb {R}}\) with \(\phi (t,\cdot )\) continuous for each t. Thus \(Q_n \circ f^{-1} \rightarrow Q_\infty \circ f^{-1}\) weakly in \({\mathcal {P}}({\mathbb {R}})\), by continuity of \(f(t,\cdot )\) for each t. But it follows from (4.5), (4.6), and the growth assumption of (A.5) that

$$\begin{aligned} \lim _{r \rightarrow \infty }\sup _n\int _{\{|f| > r\}}f\,dQ_n = 0, \end{aligned}$$

and thus \(\int f\,dQ_n \rightarrow \int f\,dQ_\infty \). \(\square \)

The following definition highlights a useful subclass of admissible controls, which Lemma 4.7 shows is dense in the class of admissible controls in a sense.

Definition 4.6

A function \(\phi : \Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}) \rightarrow {\mathcal {V}}\) is said to be adapted if \(\phi ^{-1}(C) \in {\mathcal {F}}^{\xi ,B,W,\mu }_t\) for each \(C \in {\mathcal {F}}^\Lambda _t\) and \(t \in [0,T]\). We say \(\phi \) is compact if there exists a compact set \(K \subset [0,T] \times A\) such that \(\phi (\omega ,\nu )(K^c) = 0\) for each \((\omega ,\nu ) \in \Omega _0 \times {\mathcal {P}}^p({\mathcal {X}})\). For \(\rho \in {\mathcal {M}}_\lambda \), let \({\mathcal {A}}_a(\rho )\) denote the set of measures of the form

$$\begin{aligned} \rho \circ \left( \xi ,B,W,\mu ,\phi (\xi ,B,W,\mu )\right) ^{-1} \end{aligned}$$

where \(\phi \) is adapted, compact, and continuous.

Lemma 4.7

For each \(\rho \in {\mathcal {M}}_\lambda \), \({\mathcal {A}}_a(\rho )\) is a dense subset of \({\mathcal {A}}(\rho )\). Moreover, for each \(P \in {\mathcal {R}}{\mathcal {A}}(\rho )\) with \({\mathbb {E}}^P\int _0^T\int _A|a|^{p'}\Lambda _t(da)dt < \infty \), there exist \(P_n \in {\mathcal {R}}{\mathcal {A}}_a(\rho )\) such that \(K := \{P_n : n \ge 1\}\) satisfies (4.5) and \(P_n \rightarrow P\) in \({\mathcal {P}}^p(\Omega )\); in particular, \(J(P_n) \rightarrow J(P)\).

Proof

Lemma 3.11 of [11] covers the first claim in the case that A is bounded, while the general case is treated in the second step of the proof of Lemma 3.17 in [11]. Except for the claim that K satisfies the uniform integrability condition (4.5), the second statement is precisely Lemma 3.17 of [11], the proof of which elucidates this uniform integrability. \(\square \)

5 Proof of Theorem 2.6

With the mean field game concisely summarized on the canonical space, we now turn to the proof of Theorem 2.6. Throughout the section, we work with the notation and assumptions of Theorem 2.6. Following Lemma 4.2, the strategy is to prove the claimed relative compactness, then that any limit is a MFG pre-solution using Lemma 4.3, and then finally that any limit corresponds to an optimal control. First, we establish some useful estimates for the n-player systems.

5.1 Estimates

The first estimate below, Lemma 5.1, is fairly standard, but it is important that it is independent of the number of agents n. The second estimate, Lemma 5.2, will be used to establish some uniform integrability of the equilibrium controls, and it is precisely where we need the coercivity of the running cost f. Note in the following proofs that the initial states \(X^i_0[\Lambda ] = X^i_0 = \xi ^i\) and the initial empirical measure \({\widehat{\mu }}^x_0[\Lambda ] = {\widehat{\mu }}^x_0 = \frac{1}{n}\sum _{i=1}^n\delta _{\xi ^i}\) do not depend on the choice of control. Recall the definition of the truncated supremum norm (2.2).

Lemma 5.1

There exists a constant \(c_5 \ge 1\), depending only on p, \(p'\), T, and the constant \(c_1\) of assumption (A.4) such that, for each \(\gamma \in [p,p']\), \(\beta = (\beta ^1,\ldots ,\beta ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\), and \(1 \le k \le n\),

$$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}[\Vert X^k[\beta ]\Vert _T^\gamma ] \\&\quad \le c_5{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^\gamma + \int _0^T\int _A|a|^\gamma \beta ^k_t(da)dt + \frac{1}{n}\sum _{i=1}^n\int _0^T\int _A|a|^\gamma \beta ^i_t(da)dt \right] , \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}^{{\mathbb {P}}_n}\int _{{\mathcal {C}}^d}\Vert z\Vert _T^\gamma {\widehat{\mu }}^x[\beta ](dz)&= \frac{1}{n}\sum _{i=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}[\Vert X^i[\beta ]\Vert _T^\gamma ] \\&\le c_5{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^\gamma + \frac{1}{n}\sum _{i=1}^n\int _0^T\int _A|a|^\gamma \beta ^i_t(da)dt\right] . \end{aligned}$$

Proof

We omit \([\beta ]\) from the notation throughout the proof, as well as the superscript \({\mathbb {P}}_n\) which should appear above the expectations. Abbreviate \(\Sigma := \sigma \sigma ^\top + \sigma _0\sigma _0^\top \). Apply the Burkholder–Davis–Gundy inequality and assumption (A.4) to find a universal constant \(C > 0\) (which will change from line to line) such that, for all \(\gamma \in [p,p']\),

$$\begin{aligned}&{\mathbb {E}}[\Vert X^k\Vert ^\gamma _t] \\&\quad \le \,C{\mathbb {E}}\left[ |\xi ^k|^\gamma + \left( \int _0^t\int _A|b(s,X^k_s,{\widehat{\mu }}^x_s,a)|\beta ^k_s(da) ds\right) ^\gamma + \left( \int _0^t\left| \Sigma (s,X^k_s, {\widehat{\mu }}^x_s) \right| ds\right) ^{\gamma /2}\right] \\&\quad \le \,C\,{\mathbb {E}}\left\{ 1 + |\xi ^k|^\gamma + \int _0^t\left[ \Vert X^k\Vert ^\gamma _s + \left( \int _{{\mathcal {C}}^d}\Vert z\Vert ^p_s {\widehat{\mu }}^x(dz)\right) ^{\gamma /p} + \int _A|a|^\gamma \beta ^k_s(da)\right] ds\right\} \\&\quad \quad + C\,{\mathbb {E}}\left\{ \left[ \int _0^t\left( \Vert X^k\Vert ^{p_\sigma }_s + \left( \int _{{\mathcal {C}}^d}\Vert z\Vert ^p_s{\widehat{\mu }}^x(dz) \right) ^{p_\sigma /p}\right) ds \right] ^{\gamma /2}\right\} \\&\quad \le \,C\,{\mathbb {E}}\left\{ 1 + |\xi ^k|^\gamma + \int _0^t\left[ \Vert X^k\Vert ^\gamma _s + \int _{{\mathcal {C}}^d}\Vert z\Vert ^\gamma _s {\widehat{\mu }}^x(dz) + \int _A|a|^\gamma \beta ^k_s(da)\right] ds \right\} . \end{aligned}$$

The last line follows from the bound \((\int \Vert z\Vert _s^p\nu (dz))^{\gamma /p} \le \int \Vert z\Vert _s^\gamma \nu (dz)\) for \(\nu \in {\mathcal {P}}({\mathcal {C}}^d)\), which holds because \(\gamma \ge p\). To deal with the \(\gamma /2\) outside of the time integral, we used the following argument. If \(\gamma \ge 2\), we simply use Jensen’s inequality to pass \(\gamma /2\) inside of the time integral, and then use the inequality \(|x|^{p_\sigma \gamma /2} \le 1 + |x|^\gamma \), which holds because \(p_\sigma \le 2\). The other case is \(1 \vee p_\sigma \le p \le \gamma < 2\), and we use then the inequalities \(|x|^{\gamma /2} \le 1 + |x|\) and \(|x|^{p_\sigma } \le 1 + |x|^\gamma \). By Gronwall’s inequality,

$$\begin{aligned} {\mathbb {E}}[\Vert X^k\Vert ^\gamma _t] \le C{\mathbb {E}}\left\{ 1 + |\xi ^k|^\gamma + \int _0^t\left[ \int _{{\mathcal {C}}^d}\Vert z\Vert ^\gamma _s{\widehat{\mu }}^x(dz) + \int _A|a|^\gamma \beta ^k_s(da)\right] ds\right\} \end{aligned}$$
(5.1)

Note that \({\mathbb {E}}^{{\mathbb {P}}_n}[|\xi ^k|^\gamma ] = {\mathbb {E}}^{{\mathbb {P}}_n}[|\xi ^1|^\gamma ]\) for each k, and average over \(k=1,\ldots ,n\) to get

$$\begin{aligned}&{\mathbb {E}}\int _{{\mathcal {C}}^d}\Vert z\Vert ^\gamma _t{\widehat{\mu }}^x(dz)= \frac{1}{n}\sum _{i=1}^n{\mathbb {E}}[\Vert X^i\Vert ^\gamma _t] \\&\quad \le C{\mathbb {E}}\left\{ 1 + |\xi ^1|^\gamma + \int _0^t\left[ \int _{{\mathcal {C}}^d}\Vert z\Vert ^\gamma _s{\widehat{\mu }}^x(dz) + \frac{1}{n}\sum _{i=1}^n\int _A|a|^\gamma \beta ^i_s(da)\right] ds\right\} . \end{aligned}$$

Apply Gronwall’s inequality once again to prove the second claimed inequality. The first claim follows from the second and from (5.1). \(\square \)

Lemma 5.2

There exist constants \(c_6,c_7 > 0\), depending only p, \(p'\), T, and the constants \(c_1,c_2,c_3\) of Assumption A, such that for each \(\beta = (\beta ^1,\ldots ,\beta ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\), the following hold:

  1. (1)

    For each \(1 \le k \le n\),

    $$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A(|a|^{p'} - c_6|a|^p)\beta ^k_t(da)dt \\&\quad \le c_7{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^p + \frac{1}{n}\sum _{i \ne k}^n\int _0^T\int _A|a|^p\beta ^i_t(da)\right] - c_7J_k(\beta ). \end{aligned}$$
  2. (2)

    If for some \(n \ge k \ge 1\), \(\epsilon > 0\), and \({\widetilde{\beta }}^k \in {\mathcal {A}}_n({\mathcal {E}}_n)\) we have

    $$\begin{aligned} J_k({\widetilde{\beta }}^k) \ge \sup _{{\widetilde{\beta }} \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_k((\beta ^{-k},{\widetilde{\beta }})) - \epsilon , \end{aligned}$$

    then

    $$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A(|a|^{p'} - c_6|a|^p){\widetilde{\beta }}^k_t(da)dt \\&\quad \le c_7{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + \epsilon + |\xi ^1|^p + \frac{1}{n}\sum _{i \ne k}^n\int _0^T\int _A|a|^p\beta ^i_t(da)\right] . \end{aligned}$$
  3. (3)

    If \(\beta \) is an \(\epsilon \)-Nash equilibrium for some \(\epsilon =(\epsilon _1,\ldots ,\epsilon _n) \in [0,\infty )^n\), then

    $$\begin{aligned} \frac{1}{n}\sum _{i=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A(|a|^{p'} - c_6|a|^p)\beta ^i_t(da)dt \le c_7\left( 1 + {\mathbb {E}}^{{\mathbb {P}}_n}|\xi ^1|^p + \frac{1}{n}\sum _{i=1}^n\epsilon _i\right) . \end{aligned}$$

Proof

Recall that \({\mathbb {E}}^{{\mathbb {P}}_n}[|\xi ^1|^p] < \infty \) and that every \({\widetilde{\beta }} \in {\mathcal {A}}_n({\mathcal {E}}_n)\) is required to satisfy

$$\begin{aligned} {\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A|a|^p{\widetilde{\beta }}_t(da)dt < \infty . \end{aligned}$$

Moreover, if \({\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A|a|^{p'}{\widetilde{\beta }}_t(da)dt = \infty \) then the upper bound of assumption (A.5) implies that \(J_k((\beta ^{-k},{\widetilde{\beta }})) = -\infty \), for each \(\beta \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) and \(1 \le k \le n\).

Proof of (1): First, use the upper bounds of f and g from assumption (A.5) to get

$$\begin{aligned} J_k(\beta )\le & {} c_2(T+1){\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + \left\| X^k[\beta ]\right\| _T^p + \int _{{\mathcal {C}}^d}\left\| z\right\| ^p_T{\widehat{\mu }}^x[\beta ](dz)\right] \\&- c_3{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A|a|^{p'}\beta ^k_t(da)dt \\\le & {} 3c_5c_2(T+1){\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^p + \int _0^T\int _A|a|^p\beta ^k_t(da)dt \right. \\&\left. + \frac{1}{n}\sum _{i=1}^n\int _0^T\int _A|a|^p\beta ^i_t(da)dt\right] - c_3{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A|a|^{p'}\beta ^k_t(da)dt, \end{aligned}$$

where the last inequality follows from Lemma 5.1 (and \(c_5 \ge 1\)). This proves the first claim, with \(c_6 := 6c_5c_2(T+1)/c_3\) and \(c_7 := c_6 \vee (1/c_3)\).

Proof of (2): Fix \(a_0 \in A\) arbitrarily. Abuse notation somewhat by writing \(a_0\) in place of the constant strict control \((\delta _{a_0})_{t \in [0,T]} \in {\mathcal {A}}_n({\mathcal {E}}_n)\). Lemma 5.1 implies

$$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \left\| X^k[(\beta ^{-k},a_0)]\right\| _T^p\right] \\&\quad \le c_5{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^p + T\left( 1 + \frac{1}{n}\right) |a_0|^p + \frac{1}{n}\sum _{i \ne k}^n\int _0^T\int _A|a|^p\beta ^i_t(da) dt\right] \end{aligned}$$

and

$$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}\int _{{\mathcal {C}}^d}\left\| z\right\| _T^p {\widehat{\mu }}^x[(\beta ^{-k},a_0)](dz) \\&\quad \le c_5{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^p + \frac{T}{n}|a_0|^p + \frac{1}{n}\sum _{i \ne k}^n\int _0^T\int _A|a|^p\beta ^i_t(da) dt\right] . \end{aligned}$$

Use the hypothesis along with the lower bounds on f and g from assumption (A.5) to get

$$\begin{aligned} J_k((\beta ^{-k},{\widetilde{\beta }}^k))\ge & {} J_k((\beta ^{-k},a_0))- \epsilon \\\ge & {} -c_2(T+1){\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + \left\| X^k[(\beta ^{-k},a_0)]\right\| _T^p \right. \\&\left. + \int _{{\mathcal {C}}^d}\left\| z\right\| _T^p{\widehat{\mu }}^x[(\beta ^{-k},a_0)](dz) + |a_0|^{p'}\right] - \epsilon \\\ge & {} -C{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^p + \frac{1}{n}\sum _{i \ne k}^n\int _0^T\int _A|a|^p\beta ^i_t(da) dt\right] - \epsilon , \end{aligned}$$

where \(C > 0\) depends only on \(c_2\), \(c_5\), T, and \(|a_0|^{p'}\). Applying this with the first result with \(\beta \) replaced by \((\beta ^{-k},{\widetilde{\beta }}^k)\) proves (2), replacing \(c_7\) by \(c_7(1+C)\).

Proof of (3): If \(\beta \) is an \(\epsilon \)-Nash equilibrium, then applying (2) with \({\widetilde{\beta }}^k=\beta ^k\) gives

$$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A(|a|^{p'} - c_6|a|^p)\beta ^k_t(da)dt \\&\quad \le c_7{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + \epsilon _k + |\xi ^1|^p + \frac{1}{n}\sum _{i = 1}^n\int _0^T\int _A|a|^p\beta ^i_t(da)\right] . \end{aligned}$$

The proof is completed by averaging over \(k=1,\ldots ,n\), rearranging terms, and replacing \(c_6\) by \(c_6 + c_7\). \(\square \)

5.2 Relative compactness and MFG pre-solution

This section proves that \((P_n)_{n=1}^\infty \), defined in (2.10), is relatively compact and that each limit point is a MFG pre-solution. First, we state a tailor-made tightness result for Itô processes. It is essentially an application of Aldous’ criterion, but the proof is deferred to Sect. 1.

Proposition 5.3

Fix \(c > 0\) and a positive integer k. For each \(\kappa \ge 0\), let \({\mathcal {Q}}_\kappa \subset {\mathcal {P}}({\mathcal {V}}\times {\mathcal {C}}^d)\) denote the set of laws \(P \circ (\Lambda ,X)^{-1}\) of \({\mathcal {V}}\times {\mathcal {C}}^d\)-valued random variables \((\Lambda ,X)\) defined on some filtered probability space \((\Theta ,({\mathcal {G}}_t)_{t \in [0,T]},P)\) satisfying

$$\begin{aligned} dX_t = \int _AB(t,a)\Lambda _t(da)dt + \Sigma (t)dW_t, \end{aligned}$$

where the following hold:

  1. (1)

    W is a \(({\mathcal {G}}_t)_{t \in [0,T]}\)-Wiener process of dimension k.

  2. (2)

    \(\Sigma : [0,T] \times \Theta \rightarrow {\mathbb {R}}^{d \times k}\) is progressively measurable, and \(B : [0,T] \times \Theta \times A \rightarrow {\mathbb {R}}^d\) is jointly measurable with respect to the progressive \(\sigma \)-field on \([0,T] \times \Theta \) and the Borel \(\sigma \)-field on A.

  3. (3)

    \(X_0\) is \({\mathcal {G}}_0\)-measurable.

  4. (4)

    There exists a nonnegative \({\mathcal {G}}_T\)-measurable random variable Z such that

    1. (a)

      For each \((t,\omega ,a) \in [0,T] \times \Theta \times A\),

      $$\begin{aligned} |B(t,a)|&\le c\left( 1 + |X_t| + Z + |a|\right) , \quad |\Sigma \Sigma ^\top (t)| \le c\left( 1 + |X_t|^{p_\sigma } + Z^{p_\sigma }\right) \end{aligned}$$
    2. (b)

      Lastly,

      $$\begin{aligned} {\mathbb {E}}^P\left[ |X_0|^{p'} + Z^{p'} + \int _0^T\int _A|a|^{p'}\Lambda _t(da)dt\right] \le \kappa . \end{aligned}$$

(That is, we vary over \(\Sigma \), B, Z, k, and the probability space.) Then, for any triangular array \(\{\kappa _{n,i} : 1 \le i \le n\} \subset [0,\infty )\) with \(\sup _n\frac{1}{n}\sum _{i=1}^n\kappa _{n,i} < \infty \), the set

$$\begin{aligned} {\mathcal {Q}}:= \left\{ \frac{1}{n}\sum _{i=1}^nP_i : n \ge 1, \ P^i \in {\mathcal {Q}}_{\kappa _{n,i}} \quad \text {for } i=1,\ldots ,n\right\} \end{aligned}$$

is relatively compact in \({\mathcal {P}}^p({\mathcal {V}}\times {\mathcal {C}}^d)\).

Lemma 5.4

\((P_n)_{n=1}^\infty \) is relatively compact in \({\mathcal {P}}^p(\Omega )\), and

$$\begin{aligned} \sup _n{\mathbb {E}}^{P_n}\left[ \Vert X\Vert ^{p'}_T + \int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T\mu (dz) + \int _0^T\int _A|a|^{p'}\Lambda _t(da)dt\right] < \infty . \end{aligned}$$
(5.2)

Proof

We first establish (5.2). Since \(\Lambda ^n\) is a \(\epsilon ^n\)-Nash equilibrium, part (3) of Lemma 5.2 implies

$$\begin{aligned} \frac{1}{n}\sum _{k=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T\int _A(|a|^{p'} - c_6|a|^p)\Lambda ^{n,k}_t(da)dt \le c_7\left( 1 + {\mathbb {E}}^{{\mathbb {P}}_n}[|\xi ^1|^p] + \frac{1}{n}\sum _{k=1}^n\epsilon ^n_k \right) . \end{aligned}$$

The right-hand side above is bounded in n, because of hypothesis (2.9) and because \({\mathbb {P}}_n \circ (\xi ^1)^{-1} = \lambda \in {\mathcal {P}}^p({\mathbb {R}}^d)\) for each n. Since \(p' > p\), it follows that

$$\begin{aligned} \sup _n\frac{1}{n}\sum _{k=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}\int _0^T \int _A|a|^{p'} \Lambda ^{n,k}_t(da)dt < \infty . \end{aligned}$$
(5.3)

Lemma 5.1 implies

$$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}\int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T{\widehat{\mu }}^x[\Lambda ^n](dz) \\&\quad \le c_5{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 1 + |\xi ^1|^{p'} + \frac{1}{n}\sum _{k=1}^n\int _0^T\int _A|a|^{p'} \Lambda ^{n,k}_t(da) dt\right] =: \kappa _n. \end{aligned}$$

Thus

$$\begin{aligned}&{\mathbb {E}}^{P_n} \left[ \Vert X\Vert ^{p'}_T + \int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T\mu (dz) + \int _0^T\int _A|a|^{p'}\Lambda _t(da)dt\right] \\&\quad = \frac{1}{n}\sum _{k=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \Vert X^k[\Lambda ^n] \Vert ^{p'}_T + \int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T{\widehat{\mu }}^x[\Lambda ^n](dz) + \int _0^T\int _A|a|^{p'}\Lambda ^{n,k}_t(da)dt\right] \\&\quad \le c_5{\mathbb {E}}^{{\mathbb {P}}_n}\left[ 2 + 2|\xi ^1|^{p'} + \frac{3}{n}\sum _{k=1}^n\int _0^T\int _A|a|^{p'} \Lambda ^{n,k}_t(da)dt\right] \\&\quad \le 3\kappa _n. \end{aligned}$$

Recall in the last line that \(c_5 \ge 1\). From (5.3) we conclude that \(\sup _n\kappa _n < \infty \), and (5.2) follows.

To prove that \((P_n)_{n=1}^\infty \), it suffices to show that each family of marginals is relatively compact (e.g. by [29, Lemma A.2]). Since \((P_n \circ (\xi ,B,W)^{-1})_{n=1}^\infty \) is a singleton, it is trivially compact. We may apply Proposition 5.3 to show that

$$\begin{aligned} P_n \circ (\Lambda ,X)^{-1} = \frac{1}{n}\sum _{i=1}^n {\mathbb {P}}_n \circ (\Lambda ^{n,i},X^{n,i}[\Lambda ^n])^{-1} \end{aligned}$$

forms a relatively compact sequence. Indeed, in the notation of Proposition 5.3, we use \(Z = (\int _{{\mathcal {C}}^d}\Vert z\Vert _T^p{\widehat{\mu }}^x[\Lambda ^n](dz))^{1/p}\) and \(c = c_1\) of assumption (A.4) to check that \({\mathbb {P}}_n \circ (\Lambda ^{n,i},X^{n,i}[\Lambda ^n])^{-1}\) is in \({\mathcal {Q}}_{\kappa _{n,i}}\) for each \(1 \le i \le n\), where

$$\begin{aligned} \kappa _{n,i} = \kappa _n + {\mathbb {E}}^{{\mathbb {P}}_n}\left[ |\xi ^i|^{p'} + \int _0^T\int _A|a|^{p'}\Lambda ^{n,i}_t(da)dt\right] . \end{aligned}$$

Since \(c_5 \ge 1\), we have \(\frac{1}{n}\sum _{i=1}^n\kappa _{n,i} \le 2\kappa _n\), and so \(\sup _n\frac{1}{n}\sum _{i=1}^n\kappa _{n,i} < \infty \). Thus, Proposition 5.3 establishes the relative compactness of \((P_n \circ (\Lambda ,X)^{-1})_{n=1}^\infty \). Next, note that \(P_n \circ (W,\Lambda ,X)^{-1}\) is the mean measure of \(P_n \circ \mu ^{-1}\) for each n, since for each bounded measurable \(\phi : {\mathcal {X}}\rightarrow {\mathbb {R}}\) we have

$$\begin{aligned} {\mathbb {E}}^{P_n}\left[ \phi (W,\Lambda ,X)\right]= & {} \frac{1}{n}\sum _{i=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \phi (W^i, \Lambda ^{n,i}, X^i[\Lambda ^n])\right] \\= & {} {\mathbb {E}}^{{\mathbb {P}}_n}\int _{\mathcal {X}}\phi \,d{\widehat{\mu }}[\Lambda ^n] = {\mathbb {E}}^{P_n}\int _{\mathcal {X}}\phi \,d\mu . \end{aligned}$$

Since also

$$\begin{aligned} \sup _n{\mathbb {E}}^{P_n}\left[ \Vert W\Vert ^{p'}_T + \int _0^T\int _A|a|^{p'}\Lambda _t(da)dt + \Vert X\Vert ^{p'}_T\right] < \infty , \end{aligned}$$

the relative compactness of \((P_n \circ \mu ^{-1})_{n=1}^\infty \) in \({\mathcal {P}}^p({\mathcal {P}}^p({\mathcal {X}}))\) follows from the relative compactness of \((P_n \circ (W,\Lambda ,X)^{-1})_{n=1}^\infty \) in \({\mathcal {P}}^p({\mathcal {X}})\). Indeed, when \(p=0\) and \({\mathcal {P}}^0\) is given the topology of weak convergence, this is a well known result of Sznitman, stated in (2.5) of the proof of [34, Proposition 2.2]. See [29, Corollary B.2] for the generalization to \({\mathcal {P}}^p\). This completes the proof. \(\square \)

Lemma 5.5

Any limit point P of \((P_n)_{n=1}^\infty \) in \({\mathcal {P}}^p(\Omega )\) is a MFG pre-solution.

Proof

We abuse notation somewhat by assume that \(P_n \rightarrow P\), with the understanding that this is along a subsequence. We check that P satisfies the four conditions of Lemma 4.3.

  1. (1)

    Of course,

    $$\begin{aligned} P_n \circ (\xi ,B,W)^{-1}&= \frac{1}{n}\sum _{i=1}^n{\mathbb {P}}_n \circ (\xi ^i,B,W^i)^{-1} = \lambda \times {\mathcal {W}}^{m_0} \times {\mathcal {W}}^m, \end{aligned}$$

    where \({\mathcal {W}}^k\) denotes Wiener measure on \({\mathcal {C}}^k\). Thus \(P \circ (\xi ,B,W)^{-1} = \lambda \times {\mathcal {W}}^{m_0} \times {\mathcal {W}}^m\) as well. On \(\Omega _n\), we know \(\sigma (W^i_s - W^i_t,B_s-B_t : i=1,\ldots ,n, \ s \in [t,T])\) is \({\mathbb {P}}_n\)-independent of \({\mathcal {F}}^n_t\) for each \(t \in [0,T]\). It follows that, on \(\Omega \), \(\sigma (W_s - W_t,B_s-B_t : s \in [t,T])\) is \(P_n\)-independent of \({\mathcal {F}}^{\xi ,B,W,\mu ,\Lambda ,X}_t\). Hence B and W are Wiener processes on \((\Omega ,({\mathcal {F}}^{\xi ,B,W,\mu ,\Lambda ,X}_t)_{t \in [0,T]},P)\).

  2. (2)

    Fix bounded continuous functions \(\phi : {\mathbb {R}}^d \times {\mathcal {C}}^m \rightarrow {\mathbb {R}}\) and \(\psi : {\mathcal {C}}^{m_0} \times {\mathcal {P}}^p({\mathcal {X}}) \rightarrow {\mathbb {R}}\). Since \((\xi ^1,W^1),\ldots ,(\xi ^n,W^n)\) are i.i.d. under \({\mathbb {P}}_n\) with common law \(P \circ (\xi ,W)^{-1}\) for each n, the law of large numbers implies

    $$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \left| \frac{1}{n}\sum _{i=1}^n\phi (\xi ^i,W^i) - {\mathbb {E}}^P[\phi (\xi ,W)]\right| \psi (B,{\widehat{\mu }}[\Lambda ^n])\right] = 0. \end{aligned}$$

    This implies

    $$\begin{aligned} {\mathbb {E}}^P\left[ \phi (\xi ,W)\psi (B,\mu )\right]&= \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{i=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \phi (\xi ^i,W^i) \psi (B, {\widehat{\mu }}[\Lambda ^n]) \right] \\&= {\mathbb {E}}^P[\phi (\xi ,W)]\lim _{n\rightarrow \infty } \frac{1}{n} \sum _{i=1}^n{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \psi (B,{\widehat{\mu }}[\Lambda ^n]) \right] \\&= {\mathbb {E}}^P[\phi (\xi ,W)]{\mathbb {E}}^P\left[ \psi (B,\mu )\right] . \end{aligned}$$

    This shows \((B,\mu )\) is independent of \((\xi ,W)\) under P. Since \(\xi ^i\) and \(W^i\) are independent under \({\mathbb {P}}_n\), it follows that \(\xi \) and W are independent under \(P_n\), for each n. Thus \(\xi \) and W are independent under P, and we conclude that \(\xi \), W, and \((B,\mu )\) are independent under P.

  3. (3)

    Let \(\phi : {\mathcal {X}}\rightarrow {\mathbb {R}}\) and \(\psi : {\mathcal {C}}^{m_0} \times {\mathcal {P}}^p({\mathcal {X}}) \rightarrow {\mathbb {R}}\) be bounded and continuous. Then

    $$\begin{aligned}&{\mathbb {E}}^P\left[ \psi (B,\mu )\phi (W,\Lambda ,X)\right] \\&\quad = \lim _{n\rightarrow \infty }{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \psi (B,{\widehat{\mu }} [\Lambda ^n])\frac{1}{n} \sum _{i=1}^n\phi (W^i, \Lambda ^{n,i},X^i[\Lambda ^n]) \right] \\&\quad = \lim _{n\rightarrow \infty }{\mathbb {E}}^{{\mathbb {P}}_n} \left[ \psi (B, {\widehat{\mu }} [\Lambda ^n])\int _{\mathcal {X}}\phi \,d{\widehat{\mu }}[\Lambda ^n] \right] \\&\quad = {\mathbb {E}}^P\left[ \psi (B,\mu )\int _{\mathcal {X}}\phi \,d\mu \right] . \end{aligned}$$
  4. (4)

    Since \((\xi ^i,B,W^i, {\widehat{\mu }} [\Lambda ^n], \Lambda ^{n,i}, X^i[\Lambda ^n])\) verify the state SDE under \({\mathbb {P}}_n\), the canonical processes \((\xi ,B,W,\mu ,\Lambda ,X)\) verify the state equation (4.1) under each \(P_n\), for each n. It follows from the results of Kurtz and Protter [28] (see Theorem 4.8 therein, and the preceding paragraph) that the state equation holds under the limit measure P as well.\(\square \)

5.3 Modified finite-player games

The last step of the proof, executed in the next Sect. 5.4, is to show that any limit P of \(P_n\) is optimal. This step is more involved, and we devote this subsection to studying a useful technical device which we call the k-modified n-player game, in which agent k is removed from the empirical measures. Intuitively, if the n-player game is modified so that the empirical measure (present in the state process dynamics and objective functions) no longer includes agent k, then the optimization problem of agent k de-couples from that of the other agents; agent k may then treat the empirical measure of the other \(n-1\) agents as fixed and thus faces exactly the type of control problem encountered in the MFG. Let us make this idea precise.

For \(\beta = (\beta ^1,\ldots ,\beta ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\), define \(Y^{-k}[\beta ] = (Y^{-k,1}[\beta ],\ldots ,Y^{-k,n}[\beta ])\) to be the unique strong solution on \((\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n)\) of the SDE

$$\begin{aligned} Y^{-k,i}_t[\beta ]= & {} \xi ^i + \int _0^t\int _Ab(s,Y^{-k,i}_s[\beta ], {\widehat{\mu }}^{-k,x}_s[\beta ],a)\beta ^i_s(da)dt \\&+ \int _0^t\sigma (s,Y^{-k,i}_s[\beta ], {\widehat{\mu }}^{-k,x}_s [\beta ])dW^i_s \\&+ \int _0^t\sigma _0(s,Y^{-k,i}_s[\beta ], {\widehat{\mu }}^{-k,x}_s[\beta ])dB_s, \\ {\widehat{\mu }}^{-k,x}[\beta ]:= & {} \frac{1}{n-1}\sum _{i \ne k}^n\delta _{Y^{-k,i}[\beta ]}. \end{aligned}$$

Define also

$$\begin{aligned} {\widehat{\mu }}^{-k}[\beta ] = \frac{1}{n-1}\sum _{i \ne k}^n\delta _{(W^i,\beta ^i,Y^{-k,i}[\beta ])}. \end{aligned}$$

Intuitively, \(Y^{-k,i}\) is agent i’s state process in an analog of the n-player game, in which agent k has been removed from the empirical measure. Naturally, for fixed k, the k-modified state processes \(Y^{-k}[\beta ]\) should not be far from the true state processes \(X[\beta ]\) if n is large, and we will quantify this precisely. We will need to be somewhat explicit about the choice of metric on \({\mathcal {V}}\), so we define \(d_{\mathcal {V}}\) by

$$\begin{aligned} d^p_{\mathcal {V}}(q,q'):= & {} T\ell _{[0,T] \times A}(q/T,q'/T) \\= & {} \inf _\gamma \int _{[0,T]^2 \times A^2}(|t-t'|^p + |a-a'|^p)\gamma (dt,dt',da,da'), \end{aligned}$$

where the infimum is over measures \(\gamma \) on \([0,T]^2 \times A^2\) with marginals q and \(q'\). By choosing \(\gamma = dt\delta _{t}(dt')q_t(da)q'_t(da')\), we note that

$$\begin{aligned} d^p_{\mathcal {V}}(q,q') \le 2^{p-1}\int _0^T\int _A|a|^pq_t(da)dt + 2^{p-1}\int _0^T\int _A|a|^pq'_t(da)dt. \end{aligned}$$
(5.4)

Define the \(p'\)-Wasserstein distance \(\ell _{{\mathcal {X}},p'}\) on \({\mathcal {P}}^{p'}({\mathcal {X}})\) with respect to the metric

$$\begin{aligned} d_{\mathcal {X}}((w,q,x),(w',q',x'))&:= \Vert w-w'\Vert _T + d_{\mathcal {V}}(q,q') + \Vert x-x'\Vert _T. \end{aligned}$$
(5.5)

Lemma 5.6

There exists a constant \(c_8 > 0\) such that, for each \(n \ge k \ge 1\) and \(\beta = (\beta ^1,\ldots ,\beta ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\), we have

$$\begin{aligned}&{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \ell _{{\mathcal {X}},p'}^{p'}({\widehat{\mu }}^{-k}[\beta ], {\widehat{\mu }}[\beta ]) + \left\| X^k[\beta ] - Y^{-k,k}[\beta ] \right\| ^{p'}_T \right] \le c_8(1+M[\beta ])/n, \\&\quad \text {where}~M[\beta ] := {\mathbb {E}}^{{\mathbb {P}}_n}\left[ |\xi ^1|^{p'} + \frac{1}{n}\sum _{i=1}^n\int _0^T\int _A|a|^{p'}\beta ^i_t(da)dt\right] . \end{aligned}$$

Proof

Throughout the proof, n is fixed, expected values are all with respect to \({\mathbb {P}}_n\), and the notation \([\beta ]\) is omitted. Define the truncated \(p'\)-Wasserstein distance \(\ell _t\) on \({\mathcal {P}}^{p'}({\mathcal {C}}^d)\) by

$$\begin{aligned} \ell ^{p'}_t(\mu ,\nu )&:= \inf \left\{ \int _{{\mathcal {C}}^d \times {\mathcal {C}}^d}\Vert x-y\Vert _t^{p'} \, \gamma (dx,dy)\right. \nonumber \\&\left. \quad : \gamma \in {\mathcal {P}}({\mathcal {C}}^d \times {\mathcal {C}}^d) \quad \text {has marginals } \mu , \nu \right\} \end{aligned}$$
(5.6)

Apply the Doob’s maximal inequality and Jensen’s inequality (using the assumption \(p' \ge 2\)) to find a constant \(C > 0\) (which will change from line to line but depends only on d, p, \(p'\), T, \(c_1\), and \(c_5\)) such that

$$\begin{aligned} {\mathbb {E}}\left[ \Vert X^i \!-\! Y^{-k,i}\Vert ^{p'}_t\right] \le&\,C{\mathbb {E}}\int _0^t\int _A|b(s,X^i_s,{\widehat{\mu }}^x_s,a) \!-\! b(s,Y^{-k,i}_s,{\widehat{\mu }}^{-k,x}_s,a)|^{p'}\beta ^i_s(da)ds \\&+ C{\mathbb {E}}\int _0^t\left| \sigma (s,X^i_s,{\widehat{\mu }}^x_s) - \sigma (s,Y^{-k,i}_s,{\widehat{\mu }}^{-k,x}_s)\right| ^{p'}ds \\&+ C{\mathbb {E}}\int _0^t\left| \sigma _0(s,X^i_s,{\widehat{\mu }}^x_s) - \sigma _0(s,Y^{-k,i}_s,{\widehat{\mu }}^{-k,x}_s)\right| ^{p'}ds \\ \le&\,C{\mathbb {E}}\int _0^t\left( \Vert X^i - Y^{-k,i}\Vert ^{p'}_s + \ell ^{p'}_s({\widehat{\mu }}^x,{\widehat{\mu }}^{-k,x})\right) ds. \end{aligned}$$

The last line followed from the Lipschitz assumption (A.4), along with the observation that

$$\begin{aligned} \ell _{{\mathbb {R}}^d,p}(\nu ^1_s,\nu ^2_s) \le \ell _{{\mathbb {R}}^d,p'}(\nu ^1_s,\nu ^2_s) \le \ell _s(\nu ^1,\nu ^2), \end{aligned}$$

for each \(\nu ^1,\nu ^2 \in {\mathcal {P}}^{p}({\mathcal {C}}^d)\). By Gronwall’s inequality (updating the constant C),

$$\begin{aligned} {\mathbb {E}}\left[ \Vert X^i - Y^{-k,i}\Vert ^{p'}_t\right] \le C{\mathbb {E}}\int _0^t\ell ^{p'}_s({\widehat{\mu }}^x,{\widehat{\mu }}^{-k,x})ds. \end{aligned}$$
(5.7)

Now we define a standard coupling of the empirical measures \({\widehat{\mu }}^x\) and \({\widehat{\mu }}^{-k,x}\): first, draw a number j from \(\{1,\ldots ,n\}\) uniformly at random, and consider \(X^j\) to be a sample from \({\widehat{\mu }}^x\). If \(j \ne k\), choose \(Y^{-k,j}\) to be a sample from \({\widehat{\mu }}^{-k,x}\), but if \(j = k\), draw another number \(j'\) from \(\{1,\ldots ,n\} \backslash \{k\}\) uniformly at random, and choose \(Y^{-k,j'}\) to be a sample from \({\widehat{\mu }}^{-k,x}\). This yields

$$\begin{aligned} \ell ^{p'}_t({\widehat{\mu }}^x,{\widehat{\mu }}^{-k,x})&\le \frac{1}{n}\sum _{i \ne k}^n\Vert X^i - Y^{-k,i}\Vert _t^{p'} + \frac{1}{n(n-1)}\sum _{i\ne k}^n\Vert X^k - Y^{-k,i}\Vert _t^{p'} \end{aligned}$$
(5.8)

We know from Lemma 5.1 that

$$\begin{aligned} \frac{1}{n-1}\sum _{i \ne k}^n{\mathbb {E}}\left[ \Vert X^i\Vert _T^{p'}\right] \le c_5(1+M), \end{aligned}$$

It should be clear that an analog of Lemma 5.1 holds for \(Y^{-k,i}\) as well, with the same constant. In particular,

$$\begin{aligned} \frac{1}{n-1}\sum _{i \ne k}^n{\mathbb {E}}\left[ \Vert Y^{-k,i}\Vert _T^{p'}\right] \le c_5(1+M). \end{aligned}$$

Combine the above four inequalities, averaging (5.7) over \(i \ne k\), to get

$$\begin{aligned} {\mathbb {E}}\left[ \ell ^{p'}_t({\widehat{\mu }}^x,{\widehat{\mu }}^{-k,x})\right]&\le C{\mathbb {E}}\int _0^t\ell ^{p'}_s({\widehat{\mu }}^x,{\widehat{\mu }}^{-k,x})ds + 2^{p'}c_5(1+M)/n. \end{aligned}$$

Gronwall’s inequality yields a new constant such that

$$\begin{aligned} {\mathbb {E}}\left[ \ell ^{p'}_T({\widehat{\mu }}^x,{\widehat{\mu }}^{-k,x})\right] \le C(1+M)/n. \end{aligned}$$

Return to (5.7) to find

$$\begin{aligned} {\mathbb {E}}\left[ \Vert X^i - Y^{-k,i}\Vert ^{p'}_T\right] \le C(1+M[\beta ])/n, \quad \text {for } i=1,\ldots ,n. \end{aligned}$$
(5.9)

The same coupling argument leading to (5.8) also yields

$$\begin{aligned} \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\mu }},{\widehat{\mu }}^{-k})&\le \frac{1}{n}\sum _{i \ne k}^n\Vert X^i - Y^{-k,i}\Vert ^{p'}_T \nonumber \\&\quad + \frac{1}{n(n-1)}\sum _{i\ne k}^n d^{p'}_{\mathcal {X}}((W^i,\beta ^{i},Y^{-k,i}),(W^k,\beta ^{k},X^k)) \end{aligned}$$
(5.10)

Using (5.4), we find yet another constant such that

$$\begin{aligned}&{\mathbb {E}}\left[ d^{p'}_{\mathcal {X}}((W^i,\beta ^{i},Y^{-k,i}), (W^k,\beta ^{k},X^k))\right] \\&\quad \le 3^{p'-1}{\mathbb {E}}\left[ \Vert W^i - W^k\Vert _T^{p'} + d_{{\mathcal {V}}}^{p'}(\beta ^{i},\beta ^{k}) + \Vert Y^{-k,i} -X^k \Vert _T^{p'}\right] \\&\quad \le C{\mathbb {E}}\left[ \int _0^T\int _A|a|^{p'}\beta ^i_t(da)dt + \int _0^T\int _A|a|^{p'}\beta ^k_t(da)dt + \Vert W^1\Vert _T^{p'} \right. \\&\qquad \left. + |Y^{-k,i}\Vert _T^{p'} + \Vert X^k \Vert _T^{p'}\right] \\&\quad \le C\left( 2nM[\beta ] + 2nc_5(1+M[\beta ]) + {\mathbb {E}}[\Vert W^1\Vert _T^{p'}]\right) . \end{aligned}$$

Thus

$$\begin{aligned} \frac{1}{n-1}\sum _{i \ne k}^n{\mathbb {E}}\left[ d^{p'}_{\mathcal {X}}((W^i,\beta ^{i}, Y^{-k,i}), (W^k,\beta ^{k},X^k))\right]&\le C(1+M[\beta ]). \end{aligned}$$

Applying this bound and (5.9) to (5.10) completes the proof. \(\square \)

5.4 Optimality in the limit

Before we complete the proof, recall the definitions of \({\mathcal {R}}\), \({\mathcal {A}}\), and \({\mathcal {A}}^*\) from Sect. 4. The final step is to show that \(P \in {\mathcal {R}}{\mathcal {A}}^*(P \circ (\xi ,B,W,\mu )^{-1})\), for any limit P of \((P_n)_{n=1}^\infty \). The idea of the proof is to use the density of adapted controls (see Lemma 4.7) to construct nearly optimal controls for the MFG with nice continuity properties. From these controls we build admissible controls for the n-player game, and it must finally be argued that the inequality obtained from the \(\epsilon ^n\)-Nash assumption on \(\Lambda ^n\) may be passed to the limit.

Proof of Theorem 2.6

Let P be a limit point of \((P_n)_{n=1}^\infty \), which we know exists by Lemma 5.4, and again abuse notation by assuming that \(P_n \rightarrow P\). Let \(\rho := P \circ (\xi ,B,W,\mu )^{-1}\). We know from Lemma 5.5 that P is a MFG pre-solution, so we need only to check that P is optimal. Thanks to the density result of Lemma 4.7, we need only to check that \(J(P) \ge J({\mathcal {R}}({\widetilde{Q}}))\) for each \({\widetilde{Q}} \in {\mathcal {A}}_a(\rho )\) (see Definition 4.6). Fix arbitrarily some \({\widetilde{Q}} \in {\mathcal {A}}_a(\rho )\), and recall that \({\widetilde{Q}} \in {\mathcal {A}}_a(\rho )\) means that there exist compact, continuous, adapted functions \({\tilde{\phi }} : \Omega _0 \times {\mathcal {P}}^p({\mathcal {X}}) \rightarrow {\mathcal {V}}\) such that

$$\begin{aligned} {\widetilde{Q}} := \rho \circ (\xi ,B,W,\mu ,{\tilde{\phi }}(\xi ,B,W,\mu ))^{-1}. \end{aligned}$$

For \(1 \le k \le n\), let

$$\begin{aligned} \rho _{n,k} := {\mathbb {P}}_n \circ (\xi ^k,B,W^k,{\widehat{\mu }}^{-k}[\Lambda ^n])^{-1}, \end{aligned}$$

and

$$\begin{aligned} Q_{n,k}&:= \rho _{n,k} \circ (\xi ,B,W,\mu ,{\tilde{\phi }}(\xi ,B,W,\mu ))^{-1} \\&= {\mathbb {P}}_n \circ \left( \xi ^k,B,W^k,{\widehat{\mu }}^{-k} [\Lambda ^n], {\tilde{\phi }}(\xi ^k,B,W^k,{\widehat{\mu }}^{-k}[\Lambda ^n])\right) ^{-1}. \end{aligned}$$

It follows from Lemma 5.4 and the definition of \(P_n\) that

$$\begin{aligned} \sup _n\frac{1}{n}\sum _{i=1}^n{\mathbb {E}}^{{\mathbb {P}}_n} \int _0^T\int _A|a|^p \Lambda ^{n,i}_t(da)dt < \infty . \end{aligned}$$

In the notation of Lemma 5.6, this implies \(\sup _nM[\Lambda ^n] < \infty \), and thus

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{k=1}^n\rho _{n,k}&= \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{k=1}^n{\mathbb {P}}_n \circ (\xi ^k,B,W^k,{\widehat{\mu }}[\Lambda ^n])^{-1} = \rho . \end{aligned}$$

Since

$$\begin{aligned} \frac{1}{n}\sum _{k=1}^n\rho _{n,k} \circ (\xi ,B,W)^{-1} = P \circ (\xi ,B,W)^{-1} \end{aligned}$$

does not depend on n, the continuity of \({\tilde{\phi }}\) implies

$$\begin{aligned} {\widetilde{Q}} = \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{k=1}^nQ_{n,k}. \end{aligned}$$

It is fairly straightforward to check that \({\mathcal {R}}\) is a linear map, and it is even more straightforward to check that J is linear. Moreover, since \({\tilde{\phi }}\) is a compact function, the continuity of \({\mathcal {R}}\) and J of Lemmas 4.4 and 4.5 imply

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{1}{n}\sum _{k=1}^{n}J({\mathcal {R}}(Q_{n,k}))&= \lim _{n\rightarrow \infty }J\left( {\mathcal {R}}\left( \frac{1}{n}\sum _{k=1}^{n}Q_{n,k}\right) \right) = J({\mathcal {R}}({\widetilde{Q}})). \end{aligned}$$
(5.11)

(Note that the \(p'\)-moment bound of Lemma 5.4 permits the application of Lemma 4.4.)

Now, for \(k \le n\), define \(\beta ^{n,k} \in {\mathcal {A}}_n({\mathcal {E}}_n)\) by

$$\begin{aligned} \beta ^{n,k} := {\tilde{\phi }}\left( \xi ^k,B,W^k, {\widehat{\mu }}^{-k} [\Lambda ^n]\right) . \end{aligned}$$

For \(\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)\), abbreviate \((\Lambda ^{n,-k},\beta ) := ((\Lambda ^n)^{-k},\beta )\). Since agent k is removed from the empirical measure, we have \({\widehat{\mu }}^{-k}[\Lambda ^n] = {\widehat{\mu }}^{-k}[(\Lambda ^{n,-k},\beta )]\) for any \(\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)\). The key point is that for each \(k \le n\),

$$\begin{aligned} {\mathbb {P}}_n \circ \left( \xi ^k,B,W^k,{\widehat{\mu }}^{-k}[(\Lambda ^{n,-k},\beta ^{n,k})], \beta ^{n,k},Y^{-k,k} [(\Lambda ^{n,-k},\beta ^{n,k})]\right) ^{-1} = {\mathcal {R}}(Q_{n,k}). \end{aligned}$$
(5.12)

To prove (5.12), let \(P'\) denote the measure on the left-hand side. Since \({\widehat{\mu }}^{-k}[\Lambda ^n] = {\widehat{\mu }}^{-k}[(\Lambda ^{n,-k},\beta ^{n,k})]\), we have

$$\begin{aligned} P' \circ (\xi ,B,W,\mu ,\Lambda )^{-1}&= Q_{n,k}. \end{aligned}$$

Since the processes

$$\begin{aligned} \left( \xi ^k,B,W^k,{\widehat{\mu }}^{-k}[(\Lambda ^{n,-k},\beta ^{n,k})], \beta ^{n,k},Y^{-k,k} [(\Lambda ^{n,-k}, \beta ^{n,k})]\right) \end{aligned}$$

verify the state SDE (4.1) on \((\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n)\), the canonical processes \((\xi ,B,W,\mu ,\Lambda ,X)\) verify the state SDE (4.1) under \(P'\). Hence, \(P' = {\mathcal {R}}(Q_{n,k})\). With (5.12) in hand, by definition of J the equation (5.11) then translates to

$$\begin{aligned}&\lim _{n\rightarrow \infty }\frac{1}{n}\sum _{k=1}^{n} {\mathbb {E}}^{{\mathbb {P}}_n} \left[ \Gamma \left( {\widehat{\mu }}^{-k,x}[(\Lambda ^{n,-k},\beta ^{n,k})],\beta ^{n,k},Y^{-k,k} [(\Lambda ^{n,-k},\beta ^{n,k})]\right) \right] \nonumber \\&\quad = J({\mathcal {R}}({\widetilde{Q}})). \end{aligned}$$
(5.13)

One more technical ingredient is needed before we can complete the proof. Namely, we would like to substitute \(X^k[(\Lambda ^{n,-k},\beta ^{n,k})]\) for \(Y^{-k,k}[(\Lambda ^{n,-k},\beta ^{n,k})]\) in (5.13), by proving

$$\begin{aligned} 0&= \lim _{n\rightarrow \infty }\frac{1}{n} \sum _{k=1}^{n} {\mathbb {E}}^{{\mathbb {P}}_n} \left[ \Gamma \left( {\widehat{\mu }}^{-k,x}[(\Lambda ^{n,-k},\beta ^{n,k})], \beta ^{n,k},Y^{-k,k} [(\Lambda ^{n,-k}, \beta ^{n,k})]\right) \right. \nonumber \\&\quad \left. - \Gamma \left( {\widehat{\mu }}^x[(\Lambda ^{n,-k}, \beta ^{n,k})], \beta ^{n,k},X^k[(\Lambda ^{n,-k},\beta ^{n,k})]\right) \right] \end{aligned}$$
(5.14)

Indeed, it follows from Lemma 5.1 (and an obvious analog for the modified state processes Y) that

$$\begin{aligned} Z_{n,k}&:= {\mathbb {E}}^{{\mathbb {P}}_n}\left[ \Vert X^k[(\Lambda ^{n,-k}, \beta ^{n,k})]\Vert ^{p'}_T + \Vert Y^{-k,k}[(\Lambda ^{n,-k},\beta ^{n,k})]\Vert ^{p'}_T \right. \\&\quad \left. + \int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T{\widehat{\mu }}^x[(\Lambda ^{n, -k}, \beta ^{n,k})](dz) + \int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T {\widehat{\mu }}^{-k,x} [(\Lambda ^{n,-k},\beta ^{n,k})](dz)\right] \\&\le 4c_4{\mathbb {E}}^{{\mathbb {P}}_n}\left[ |\xi ^1|^{p'} + \frac{1}{n}\sum _{i=1}^n\int _0^T\int _A|a|^{p'}\Lambda ^{n,i}_t(da)dt + \int _0^T\int _A|a|^{p'}\beta ^{n,k}_t(da)dt\right] .\!\! \end{aligned}$$

Lemma 5.4 says that

$$\begin{aligned} \sup _n{\mathbb {E}}^{{\mathbb {P}}_n}\left[ \frac{1}{n}\sum _{i=1}^n\int _0^T\int _A|a|^{p'} \Lambda ^{n,i}_t(da)dt\right] < \infty . \end{aligned}$$

Compactness of \({\tilde{\phi }}\) implies that there exists a compact set \(K \subset A\) such that \({\widetilde{\beta }}^{n,k}_t(K^c) = 0\) for a.e. \(t \in [0,T]\) and all \(n \ge k \ge 1\). Thus

$$\begin{aligned} \sup _n\frac{1}{n}\sum _{k=1}^nZ_{n,k} < \infty , \end{aligned}$$

and we have the uniform integrability needed to deduce (5.14), from Lemma 5.6 and from the continuity and growth assumptions (A.5) on f and g.

A simple manipulation of the definitions yields \(J(P_n) = \frac{1}{n}\sum _{k=1}^nJ_k(\Lambda ^n)\). Then, since \(P_n \rightarrow P\), the upper semicontinuity of J of Lemma 4.5 implies

$$\begin{aligned} J(P) \ge \limsup _{n\rightarrow \infty }\frac{1}{n}\sum _{k=1}^nJ_k(\Lambda ^n). \end{aligned}$$

Finally, use the fact that \(\Lambda ^n\) is a relaxed \(\epsilon ^n\)-Nash equilibrium to get

$$\begin{aligned} J(P)&\ge \liminf _{n\rightarrow \infty }\frac{1}{n} \sum _{k=1}^n\left[ J_k((\Lambda ^{n,-k},\beta ^{n,k})) - \epsilon ^n_k\right] \\&= \liminf _{n\rightarrow \infty }\frac{1}{n}\sum _{k=1}^n {\mathbb {E}}^{{\mathbb {P}}_n}\left[ \Gamma \left( {\widehat{\mu }}^x[(\Lambda ^{n,-k}, \beta ^{n,k})], \beta ^{n,k},X^k[(\Lambda ^{n,-k},\beta ^{n,k})]\right) \right] \\&= \liminf _{n\rightarrow \infty } \frac{1}{n}\sum _{k=1}^{n}{\mathbb {E}}^{{\mathbb {P}}_n} \left[ \Gamma \left( {\widehat{\mu }}^{-k,x} [(\Lambda ^{n,-k},\beta ^{n,k})],\beta ^{n,k},Y^{-k,k} [(\Lambda ^{n,-k},\beta ^{n,k})]\right) \right] \\&= J({\mathcal {R}}({\widetilde{Q}})) \end{aligned}$$

The second line follows from the definition of \(J_k\), and the \(\epsilon ^n_k\) drops out because of the hypothesis (2.9). The third line comes from (5.14), and the last is from (5.13). This completes the proof. \(\square \)

6 Proof of Theorem 2.11

This section is devoted to the proof of Theorem 2.11, which we split into two pieces.

Theorem 6.1

Suppose Assumptions A and B hold. Let \(P \in {\mathcal {P}}(\Omega )\) be a weak MFG solution. Then there exist, for each n,

  1. (1)

    \(\epsilon _n \ge 0\),

  2. (2)

    an n-player environment \({\mathcal {E}}_n = (\Omega _n,({\mathcal {F}}^n_t)_{t \in [0,T]},{\mathbb {P}}_n,\xi ,B,W)\), and

  3. (3)

    a relaxed \((\epsilon _n,\ldots ,\epsilon _n)\)-Nash equilibrium \(\Lambda ^n = (\Lambda ^{n,1},\ldots ,\Lambda ^{n,n})\) on \({\mathcal {E}}_n\),

such that \(\lim _{n\rightarrow \infty }\epsilon _n = 0\) and \(P_n \rightarrow P\) in \({\mathcal {P}}^p(\Omega )\), where

$$\begin{aligned} P_n := \frac{1}{n}\sum _{i=1}^n{\mathbb {P}}_n \circ \left( \xi ^i,B,W^i,{\widehat{\mu }}[\Lambda ^n], \Lambda ^{n,i}, X^i[\Lambda ^n]\right) ^{-1}. \end{aligned}$$

Theorem 6.1 is nearly the same as Theorem 2.11, except that the equilibria \(\Lambda ^n\) are now relaxed instead of strong, and the environments \({\mathcal {E}}_n\) are now part of the conclusion of the theorem instead of the input. We will prove Theorem 6.1 by constructing a convenient sequence of environments \({\mathcal {E}}_n\), which all live on the same larger probability space supporting an i.i.d. sequence of state processes corresponding to the given MFG solution. This kind of argument is known as trajectorial propagation of chaos in the literature on McKean–Vlasov limits, and the Lipschitz assumption in the measure argument is useful here. The precise choice of environments also facilitates the proof of the following Proposition. Recall the definition of a strong \(\epsilon \) -Nash equilibrium from Remark 2.3 and the discussion preceding it.

Proposition 6.2

Let \({\mathcal {E}}_n\) be the environments defined in the proof of Theorem 6.1 (in Sect. 6.1). Let \(\Lambda ^0 = (\Lambda ^{0,1},\ldots ,\Lambda ^{0,n}) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\). Then there exist strong strategies \(\Lambda ^k = (\Lambda ^{k,1},\ldots ,\Lambda ^{k,n}) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) such that:

  1. (1)

    In \({\mathcal {P}}^p({\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n \times ({\mathcal {C}}^d)^n)\),

    $$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {P}}_n \circ \left( B,W,\Lambda ^k,X[\Lambda ^k]\right) ^{-1} ={\mathbb {P}}_n \circ \left( B,W,\Lambda ^0,X[\Lambda ^0]\right) ^{-1}, \end{aligned}$$
  2. (2)

    \(\lim _{k\rightarrow \infty }J_i(\Lambda ^k) = J_i(\Lambda ^0)\), for \(i=1,\ldots ,n\),

  3. (3)
    $$\begin{aligned} \limsup _{k\rightarrow \infty }\sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_i((\Lambda ^{k,-i},\beta )) \le \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_i((\Lambda ^{0,-i},\beta )), \quad \text {for}\quad i=1,\ldots ,n. \end{aligned}$$

In particular, if \(\Lambda ^0\) is a relaxed \(\epsilon ^0=(\epsilon ^0_1,\ldots ,\epsilon ^0_n)\)-Nash equilibrium, then \(\Lambda ^k\) is a strong \((\epsilon ^0 + \epsilon ^k)\)-Nash equilibrium, where

$$\begin{aligned} \epsilon ^k_i := \left[ \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_i((\Lambda ^{k,-i},\beta )) - J_i(\Lambda ^k) - \epsilon ^0_i\right] ^+ \rightarrow 0 \quad \text {as}\quad k \rightarrow \infty . \end{aligned}$$

Proof of Theorem 2.11

Recall that strong strategies are insensitive to the choice of n-player environment (see Remark 2.12), and so it suffices to prove the theorem on any given sequence of environments, such as those provided by Theorem 6.1. By Theorem 6.1 we may find \(\epsilon _n \rightarrow 0\) and a relaxed \((\epsilon _n,\ldots ,\epsilon _n)\)-Nash equilibrium \(\Lambda ^n\) for the n-player game, with the desired convergence properties. Then, by Proposition 6.2, we find for each n each k a strong \(\epsilon ^{n,k}=(\epsilon _n + \epsilon ^{n,k}_1,\ldots ,\epsilon _n + \epsilon ^{n,k}_n)\)-Nash equilibrium \(\Lambda ^{n,k} \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) with the convergence properties defined in Proposition 6.2. For each n, choose \(k_n\) large enough to make \(\epsilon ^{n,k_n}_i \le 2^{-n}\) for each \(i=1,\ldots ,n\) and so that the sequences in (1-3) of Proposition 6.2 are each within \(2^{-n}\) of their respective limits.

\(\square \)

6.1 Construction of environments

Fix a weak MFG solution P. Define \(P_{B,\mu } := P \circ (B,\mu )^{-1}\). We will work on the space

$$\begin{aligned} \overline{\Omega } := [0,1] \times {\mathcal {C}}^{m_0} \times {\mathcal {P}}^p({\mathcal {X}}) \times {\mathcal {X}}^\infty . \end{aligned}$$

Let \((U,B,\mu ,(W^i,\Lambda ^i,Y^i)_{i=1}^\infty )\) denote the identity map (i.e. coordinate processes) on \(\overline{\Omega }\). For \(n \in {\mathbb {N}}\cup \{\infty \}\), consider the complete filtration \((\overline{{\mathcal {F}}}^n_t)_{t \in [0,T]}\) generated by U, B, \(\mu \), and \((W^i,\Lambda ^i,Y^i)_{i=1}^n\), that is the completion of

$$\begin{aligned} \sigma \left\{ \left( U,B_s,\mu (C_1),\left( W^i_s,\Lambda ^i([0,s] \times C_2),Y^i_s\right) _{i=1}^n\right) : s \le t, C_1 \in {\mathcal {F}}^{\mathcal {X}}_t, C_2 \in {\mathcal {B}}(A)\right\} . \end{aligned}$$

Define the probability measure \({\mathbb {P}}\) on \((\overline{\Omega },\overline{{\mathcal {F}}}^\infty _T)\) by

$$\begin{aligned} {\mathbb {P}}:= duP_{B,\mu }(d\beta ,d\nu )\prod _{i=1}^\infty \nu (dw^i,dq^i,dy^i). \end{aligned}$$

By construction,

$$\begin{aligned} {\mathbb {P}}\circ \left( Y^i_0,B,W^i,\mu ,\Lambda ^i,Y^i\right) ^{-1} = P, \quad \text {for each } i, \end{aligned}$$

and \((W^i,\Lambda ^i,Y^i)_{i=1}^\infty \) are conditionally i.i.d. with common law \(\mu \) given \((B,\mu )\). Moreover, U and \((B,\mu ,(W^i,\Lambda ^i,Y^i)_{i=1}^\infty )\) are independent under \({\mathbb {P}}\). We will work with the n-player environments

$$\begin{aligned} {\mathcal {E}}_n := \left( \overline{\Omega },(\overline{{\mathcal {F}}}^n_t)_{t \in [0,T]},{\mathbb {P}},(Y^1_0,\ldots ,Y^n_0),B,(W^1,\ldots ,W^n)\right) , \end{aligned}$$

and we will show that the canonical process \((\Lambda ^1,\ldots ,\Lambda ^n)\) is a relaxed \((\epsilon _n,\ldots ,\epsilon _n)\)-Nash equilibrium for some \(\epsilon _n \rightarrow \infty \). Including the seemingly superfluous random variable U makes the class of admissible controls as rich as possible, in the sense that by using U a control can mimic the effect of external randomizations or weak controls. This will be more clear in the proof of Proposition 6.2, more specifically Lemma 6.7, where we will see, for instance, that the set

$$\begin{aligned} \left\{ {\mathbb {P}}\circ ((Y^1_0,\ldots ,Y^n_0),B,(W^1,\ldots ,W^n),\beta )^{-1} : \beta \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\right\} \end{aligned}$$

is closed in \({\mathcal {P}}^p(({\mathbb {R}}^d)^n \times {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n)\). Until the proof of Lemma 6.7, however, U will remain behind the scenes.

Define \(X[\beta ]\) and \({\widehat{\mu }}[\beta ]\) for \(\beta \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) as usual, as in Sect. 2.3. For each \((\overline{{\mathcal {F}}}^\infty _t)_{t \in [0,T]}\)-progressive \({\mathcal {P}}(A)\)-valued process \(\beta \) on \(\overline{\Omega }\) and each \(i \ge 1\), define \(Y^i[\beta ]\) to be the unique solution of the SDE

$$\begin{aligned} dY^i_t[\beta ]= & {} \int _Ab\left( t,Y^i_t[\beta ],\mu ^x_t,a\right) \beta _t(da) + \sigma \left( t,Y^i_t[\beta ],\mu ^x_t\right) dW^i_t \\&+ \sigma _0\left( t,Y^i_t[\beta ],\mu ^x_t\right) dB_t, \ Y^i_0[\beta ] = Y^i_0. \end{aligned}$$

Note that if \(\beta = (\beta ^1,\ldots ,\beta ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) then \(X^i[\beta ]\) differs from \(Y^i[\beta ^i]\) only in the measure flow which appears in the dynamics; \(X^i[\beta ]\) depends on the empirical measure flow of \((X^1[\beta ],\ldots ,X^n[\beta ])\), whereas \(Y^i[\beta ^i]\) depends on the random measure \(\mu \) coming from the MFG solution. Define the canonical n-player strategy profile by

$$\begin{aligned} \overline{\Lambda }^n = (\overline{\Lambda }^{n,1},\ldots ,\overline{\Lambda }^{n,n}) := (\Lambda ^1,\ldots ,\Lambda ^n) \in {\mathcal {A}}_n^n({\mathcal {E}}_n). \end{aligned}$$

This abbreviation serves in part to indicate which n we are working with at any given moment, so that we can suppress the index n from the rest of the notation. Note that \(Y^i[\overline{\Lambda }^{n,i}] = Y^i[\Lambda ^i] = Y^i\).

6.2 Trajectorial propagation of chaos

Intuition from the theory of propagation of chaos suggests that the state processes \((Y^1,\ldots ,Y^n)\) and \((X^1,\ldots ,X^n)\) should be close in some sense, and the purpose of this section is to make this quantitative. For \(\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)\), abbreviate

$$\begin{aligned} (\overline{\Lambda }^{n,-i},\beta ) := ((\overline{\Lambda }^n)^{-i},\beta ) \in {\mathcal {A}}_n^n({\mathcal {E}}_n). \end{aligned}$$

Recall the definition of the metric \(d_{\mathcal {X}}\) on \({\mathcal {X}}\) from (5.5), and again define the \(p'\)-Wasserstein metric \(\ell _{{\mathcal {X}},p'}\) on \({\mathcal {P}}^p({\mathcal {X}})\) relative to the metric \(d_{\mathcal {X}}\).

Lemma 6.3

Fix i and a \((\overline{{\mathcal {F}}}^\infty _t)_{t \in [0,T]}\)-progressive P(A)-valued process \(\beta \), and define

$$\begin{aligned} {\widehat{\nu }}^{n,i}[\beta ] := \frac{1}{n}\left( \sum _{k \ne i}^n\delta _{(W^k,\Lambda ^k,Y^k)} + \delta _{(W^i,\beta ,Y^i[\beta ])}\right) . \end{aligned}$$

There exists a sequence \(\delta _n > 0\) converging to zero such that

$$\begin{aligned} {\mathbb {E}}^{\mathbb {P}}\left[ \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\nu }}^{n,i}[\beta ],\mu )\right] \le \delta _n\left( 1 + {\mathbb {E}}^{\mathbb {P}}\int _0^T\int _A|a|^{p'}\beta _t(da)dt\right) . \end{aligned}$$

Proof

Expectations are all with respect to \({\mathbb {P}}\) throughout the proof. For \(1 \le i \le n\) define

$$\begin{aligned} {\widehat{\nu }}^n := \frac{1}{n}\sum _{k=1}^n\delta _{(W^k,\Lambda ^k,Y^k)}. \end{aligned}$$

Using the obvious coupling, we find

$$\begin{aligned} \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\nu }}^{n,i}[\beta ],{\widehat{\nu }}^n) \le \frac{1}{n}d^{p'}_{\mathcal {X}}\left( (W^i,\Lambda ^i,Y^i), (W^i,\beta ,Y^i[\beta ])\right) . \end{aligned}$$

Using (5.4), we find a constant \(C > 0\), depending only on p, \(p'\), and T, such that

$$\begin{aligned}&{\mathbb {E}}\left[ d^{p'}_{\mathcal {X}}\left( (W^i,\Lambda ^i,Y^i),(W^i, \beta ,Y^i[\beta ])\right) \right] \\&\quad \le C{\mathbb {E}}\left[ \int _0^T\int _A|a|^{p'}\beta _t(da)dt + \int _0^T\int _A|a|^{p'}\Lambda ^i_t(da)dt + \Vert Y^i\Vert ^{p'}_T + \Vert Y^i[\beta ]\Vert ^{p'}_T\right] \end{aligned}$$

Analogously to Lemma 5.1, it holds that

$$\begin{aligned} {\mathbb {E}}[\Vert Y^i[\beta ]\Vert ^{p'}_T]&\le c_5{\mathbb {E}}\left[ 1 + |Y^i_0|^{p'} + \int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T\mu ^x(dz) + \int _0^T\int _A|a|^{p'}\beta _t(da)dt\right] . \end{aligned}$$
(6.1)

Note that \({\mathbb {E}}\int _{{\mathcal {C}}^d}\Vert z\Vert ^{p'}_T\mu ^x(dz) < \infty \) and that \({\mathbb {E}}[|Y^i_0|^{p'}] = {\mathbb {E}}[|Y^1_0|^{p'}] < \infty \). Apply (6.1) also with \(\beta = \Lambda ^i\), we find a new constant, still called C and still independent of n, such that

$$\begin{aligned} {\mathbb {E}}\left[ d^{p'}_{\mathcal {X}}\left( (W^i,\Lambda ^i,Y^i),(W^i, \beta ,Y^i[\beta ])\right) \right] \le C\left( 1 + {\mathbb {E}}\int _0^T\int _A|a|^{p'}\beta _t(da)dt\right) . \end{aligned}$$

Finally, recall that \((W^k,\Lambda ^k,Y^k)_{k=1}^\infty \) are conditionally i.i.d. given \((B,\mu )\) with common conditional law \(\mu \). Since also they are \(p'\)-integrable, it follows from the law of large numbers that

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {E}}\left[ \ell ^{p'}_{{\mathcal {X}},p'} ({\widehat{\nu }}^n,\mu )\right] = 0. \end{aligned}$$

Complete the proof by using the triangle inequality to get

$$\begin{aligned} {\mathbb {E}}\left[ \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\nu }}^{n,i}[\beta ],\mu )\right]\le & {} \frac{C2^{p'-1}}{n}\left( 1 + {\mathbb {E}}\int _0^T\int _A|a|^{p'}\beta _t(da)dt\right) \\&+ 2^{p'-1}{\mathbb {E}}\left[ \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\nu }}^n,\mu )\right] . \end{aligned}$$

\(\square \)

Lemma 6.4

There is a sequence \(\delta _n > 0\) converging to zero such that for each \(1 \le i \le n\) and each \(\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)\),

$$\begin{aligned}&{\mathbb {E}}^{\mathbb {P}}\left[ \ell _{{\mathcal {X}},p'}^{p'}({\widehat{\mu }} [(\overline{\Lambda }^{n,-i},\beta )],\mu ) + \left\| X^i[(\overline{\Lambda }^{n,-i},\beta )] - Y^i[\beta ]\right\| _T^{p'}\right] \\&\quad \le \delta _n\left( 1 + {\mathbb {E}}^{\mathbb {P}}\int _0^T\int _A|a|^{p'}\beta _t(da)dt\right) . \end{aligned}$$

Proof

The proof is similar to that of Lemma 5.6, and we work again with the truncated \(p'\)-Wasserstein distances \(\ell _t\) on \({\mathcal {C}}^d\) defined in (5.6). Throughout this proof, n and i are fixed, and expectations are all with respect to \({\mathbb {P}}\). Abbreviate \(\overline{X}^k = X^k[(\overline{\Lambda }^{n,-i},\beta )]\) and \({\widehat{\mu }} = {\widehat{\mu }}[(\overline{\Lambda }^{n,-i},\beta )]\) throughout. Define \(\overline{Y}^i := Y^i[\beta ]\) and \(\overline{Y}^k := Y^k\) for \(k \ne i\). As in the proof of Lemma 5.6, we use the Burkholder–Davis–Gundy inequality followed by Gronwall’s inequality to find a constant \(C_1 > 0\), depending only on \(c_1\), \(p'\), and T, such that

$$\begin{aligned} {\mathbb {E}}\left[ \Vert \overline{X}^k - \overline{Y}^k\Vert ^{p'}_t\right] \le C_1{\mathbb {E}}\int _0^t\ell ^{p'}_s({\widehat{\mu }}^x,\mu ^x)ds, \quad \text {for } 1 \le k \le n. \end{aligned}$$
(6.2)

Define \({\widehat{\nu }}^{n,i} = {\widehat{\nu }}^{n,i}[\beta ]\) as in Lemma 6.3, and write \({\widehat{\nu }}^{n,i,x} := ({\widehat{\nu }}^{n,i})^x\) for the empirical distribution of \((\overline{Y}^1,\ldots ,\overline{Y}^n)\). Use (6.2) and the triangle inequality to get

$$\begin{aligned} \frac{1}{n}\sum _{k=1}^n{\mathbb {E}}\left[ \Vert \overline{X}^k - \overline{Y}^k\Vert ^{p'}_t\right]&\le 2^{p'-1}C_1{\mathbb {E}}\int _0^t\left( \ell ^{p'}_s({\widehat{\mu }}^x, {\widehat{\nu }}^{n,i,x}) + \ell ^{p'}_s({\widehat{\nu }}^{n,i,x},\mu ^x) \right) ds \\&\le 2^{p'-1}C_1{\mathbb {E}}\int _0^t\left( \frac{1}{n}\sum _{k=1}^n \Vert \overline{X}^k\! -\! \overline{Y}^k\Vert ^{p'}_s \!+\! \ell ^{p'}_s({\widehat{\nu }}^{n,i,x},\mu ^x) \right) ds \end{aligned}$$

By Gronwall’s inequality and Lemma 6.3, with \(C_2 := 2^{p'-1}C_1e^{2^{p'-1}C_1T}\) we have

$$\begin{aligned} \frac{1}{n}\sum _{k=1}^n{\mathbb {E}}\left[ \Vert \overline{X}^k - \overline{Y}^k\Vert ^{p'}_t\right]&\le C_2{\mathbb {E}}\int _0^t\ell ^{p'}_s({\widehat{\nu }}^{n,i,x},\mu ^x)ds \le C_2 T{\mathbb {E}}\left[ \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\nu }}^{n,i},\mu )\right] \nonumber \\&\le C_2 T\delta _n\left( 1 + {\mathbb {E}}\int _0^T\int _A|a|^{p'}\beta _t(da)dt\right) . \end{aligned}$$
(6.3)

The obvious coupling yields the inequality

$$\begin{aligned} \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\mu }},{\widehat{\nu }}^{n,i}) \le \frac{1}{n}\sum _{k=1}^n\Vert \overline{X}^k - \overline{Y}^k\Vert ^{p'}_T, \end{aligned}$$

and then the triangle inequality implies

$$\begin{aligned} {\mathbb {E}}\left[ \ell _{{\mathcal {X}},p'}^{p'}({\widehat{\mu }},\mu )\right] \le 2^{p'-1}\frac{1}{n}\sum _{k=1}^n{\mathbb {E}}\left[ \Vert \overline{X}^k - \overline{Y}^k\Vert ^{p'}_T\right] + 2^{p'-1}{\mathbb {E}}\left[ \ell ^{p'}_{{\mathcal {X}},p'}({\widehat{\nu }}^{n,i},\mu )\right] . \end{aligned}$$

Conclude from Lemma 6.3 and (6.3). \(\square \)

6.3 Proof of Theorem 6.1

With Lemma 6.4 in hand, we begin the proof of Theorem 6.1. The convergence \(P_n \rightarrow P\) follows immediately from Lemma 6.4, and it remains only to check that \(\overline{\Lambda }^n\) is a relaxed \((\epsilon _n,\ldots ,\epsilon _n)\)-Nash equilibrium for some \(\epsilon _n \rightarrow 0\). Define

$$\begin{aligned} \epsilon _n&:= \max _{i=1}^n\left[ \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_i((\overline{\Lambda }^{n,-i},\beta )) - J_i(\overline{\Lambda }^n)\right] \\&= \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_1((\overline{\Lambda }^{n,-1},\beta )) - J_1(\overline{\Lambda }^n), \end{aligned}$$

where the second equality follows from exchangeability, or more precisely from the fact that (using the notation of Remark 2.7) the measure

$$\begin{aligned} {\mathbb {P}}\circ \left( \xi _\pi ,B,W_\pi ,{\widehat{\mu }} [\overline{\Lambda }^n_\pi ], \overline{\Lambda }^n_\pi ,X[\overline{\Lambda }^n_\pi ]_\pi \right) ^{-1} \end{aligned}$$

does not depend on the choice of permutation \(\pi \). Recall that \(P \in {\mathcal {P}}(\Omega )\) was the given MFG solution, and define \(\rho := P \circ (\xi ,B,W,\mu )^{-1}\) so that \(P \in {\mathcal {R}}{\mathcal {A}}^{*}(\rho )\). For each n, find \(\beta ^n \in {\mathcal {A}}_n({\mathcal {E}}_n)\) such that

$$\begin{aligned} J_1((\overline{\Lambda }^{n,-1},\beta ^n)) \ge \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_1((\overline{\Lambda }^{n,-1},\beta )) - 1/n. \end{aligned}$$
(6.4)

To complete the proof, it suffices to prove the following:

$$\begin{aligned}&\displaystyle \lim _{n\rightarrow \infty }J_1(\overline{\Lambda }^n) = {\mathbb {E}}^{\mathbb {P}}\left[ \Gamma (\mu ^x,\Lambda ^1,Y^1)\right] , \end{aligned}$$
(6.5)
$$\begin{aligned}&\displaystyle \lim _{n\rightarrow \infty }\left| {\mathbb {E}}^{\mathbb {P}}\left[ \Gamma ({\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1},\beta ^n)],\beta ^n,\right. \left. X^1[(\overline{\Lambda }^{n,-1},\beta ^n)]) - \Gamma (\mu ^x,\beta ^n,Y^1[\beta ^n])\right] \right| = 0.\nonumber \\ \end{aligned}$$
(6.6)

Indeed, note that \({\mathbb {P}}\circ (\xi ^1,B,W^1,\mu ,\Lambda ^1,Y^1)^{-1} = P\) holds by construction. Since

$$\begin{aligned} P'_n := {\mathbb {P}}\circ \left( \xi ^1,B,W^1,\mu ,\beta ^n,Y^1[\beta ^n]\right) ^{-1} \end{aligned}$$

is in \({\mathcal {R}}{\mathcal {A}}(\rho )\) for each n, and since P is in \({\mathcal {R}}{\mathcal {A}}^{*}(\rho )\), we have

$$\begin{aligned} {\mathbb {E}}^{\mathbb {P}}\left[ \Gamma (\mu ^x,\beta ^n,Y^1)\right] = J(P) \ge J(P'_n) = {\mathbb {E}}^{\mathbb {P}}\left[ \Gamma (\mu ^x,\beta ^n,Y^1[\beta ^n])\right] , \quad \text {for all } n. \end{aligned}$$

Thus, from (6.5) and (6.6) it follows that

$$\begin{aligned} \lim _{n\rightarrow \infty }J_1(\overline{\Lambda }^n)&\ge \limsup _{n\rightarrow \infty }{\mathbb {E}}^{\mathbb {P}}\left[ \Gamma (\mu ^x,\beta ^n,Y^1[\beta ^n])\right] \\&= \limsup _{n\rightarrow \infty }J_1((\overline{\Lambda }^{n,-1}, \beta ^n)) \\&= \limsup _{n\rightarrow \infty }\sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_1 ((\overline{\Lambda }^{n,-1},\beta )), \end{aligned}$$

where of course in the last step we have used (6.4). Since \(\epsilon _n \ge 0\), this shows \(\epsilon _n \rightarrow 0\).

Proof of (6.5): First, apply Lemma 6.4 with \(\beta = \Lambda ^1\) (so that \((\overline{\Lambda }^{n,-1},\beta ) = \overline{\Lambda }^n\)) to get

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {P}}\circ \left( Y^1_0,B,W^1,{\widehat{\mu }}[\overline{\Lambda }^n],\Lambda ^1, X^1[\overline{\Lambda }^n]\right) ^{-1}&= {\mathbb {P}}\circ \left( Y^1_0,B,W^1,\mu ,\Lambda ^1,Y^1\right) ^{-1}, \end{aligned}$$

where the limit is taken in \({\mathcal {P}}^p(\Omega )\). Moreover, since \({\mathbb {E}}^{\mathbb {P}}\int _0^T\int _A|a|^{p'}\Lambda ^1_t(da)dt < \infty \), we use the continuity of J of Lemma 4.5 (since the additional uniform integrability condition holds trivially) to conclude that

$$\begin{aligned} \lim _{n\rightarrow \infty }J_1(\overline{\Lambda }^n)&= \lim _{n\rightarrow \infty }{\mathbb {E}}^{\mathbb {P}}\left[ \Gamma ({\widehat{\mu }}^x[\overline{\Lambda }^n], \Lambda ^1,X^1[\overline{\Lambda }^n])\right] = {\mathbb {E}}^{\mathbb {P}}\left[ \Gamma (\mu ^x,\Lambda ^1,Y^1)\right] . \end{aligned}$$

Proof of (6.6): This step is fairly involved and thus divided into several steps. The first two steps identify a relative compactness for the laws of the empirical measure and state process pairs, crucial for the third and fourth steps below. Step (3) focuses on the g term, and Step (4) uses the additional Assumption B to deal with the f term.

Proof of (6.6), Step (1): We show first that

$$\begin{aligned} \sup _n{\mathbb {E}}\int _0^T\int _A|a|^{p'}\beta ^n_t(da)dt < \infty . \end{aligned}$$
(6.7)

By (6.4) and Lemma 5.2(2), we have

$$\begin{aligned}&{\mathbb {E}}\int _0^T\int _A(|a|^{p'} - c_6|a|^p)\beta ^n_t(da)dt \\&\quad \le c_7{\mathbb {E}}\left[ 1 + \frac{1}{n} + |\xi ^1|^p + \frac{1}{n}\sum _{i=2}^n\int _0^T\int _A|a|^p\Lambda ^i_t(da)dt\right] \\&\quad = c_7{\mathbb {E}}\left[ 1 + \frac{1}{n} + |\xi ^1|^p + \frac{n-1}{n}\int _0^T\int _A|a|^p\Lambda ^1_t(da)dt\right] , \end{aligned}$$

where the second line follows from symmetry. Since \({\mathbb {E}}[|\xi ^1|^p] < \infty \) and \({\mathbb {E}}\int _0^T\int _A|a|^p\Lambda ^1_t(da)dt < \infty \), we have proven (6.7).

Proof of (6.6), Step (2): Define \({\mathcal {A}}_R\) for \(R > 0\) to be the set of \((\overline{{\mathcal {F}}}^\infty _t)_{t \in [0,T]}\)-progressive \({\mathcal {P}}(A)\)-valued processes \(\beta \) such that

$$\begin{aligned} {\mathbb {E}}\int _0^T\int _A|a|^{p'}\beta _t(da)dt \le R. \end{aligned}$$

According to (6.7), there exists \(R > 0\) such that \(\beta ^n \in {\mathcal {A}}_R\) for all n. Define also

$$\begin{aligned} S_R := \left\{ {\mathbb {P}}\circ \left( {\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1},\beta )], X^1[(\overline{\Lambda }^{n,-1},\beta )]\right) ^{-1} : n \ge 1, \beta \in {\mathcal {A}}_R \right\} . \end{aligned}$$

We show next that \(S_R\) is relatively compact in \({\mathcal {P}}^p({\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d)\). Note first that it follows from Lemma 5.1 that

$$\begin{aligned} \sup \left\{ {\mathbb {E}}^{\mathbb {P}}\int _{{\mathcal {C}}^d}\Vert z\Vert _T^{p'}{\widehat{\mu }}^x [(\overline{\Lambda }^{n,-1},\beta )](dz) : n \ge 1, \ \beta \in {\mathcal {A}}_R\right\} < \infty . \end{aligned}$$
(6.8)

By symmetry, we have

$$\begin{aligned}&\left\{ {\mathbb {P}}\circ (X^1[(\overline{\Lambda }^{n,-1},\beta )])^{-1} : n \ge 1, \ \beta \in {\mathcal {A}}_R \right\} \\&\quad = \left\{ \frac{1}{n}\sum _{k=1}^n{\mathbb {P}}\circ (X^k[(\overline{\Lambda }^{n,-k},\beta )])^{-1} : n \ge 1, \ \beta \in {\mathcal {A}}_R \right\} , \end{aligned}$$

and by Proposition 5.3 this set is relatively compact in \({\mathcal {P}}^p({\mathcal {C}}^d)\). For \(\beta \in {\mathcal {A}}_R\), the mean measure of \({\mathbb {P}}\circ ({\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1},\beta )])^{-1}\) is exactly

$$\begin{aligned} \frac{1}{n}\sum _{k=1}^n{\mathbb {P}}\circ (X^k[(\overline{\Lambda }^{n,-1},\beta )])^{-1}, \end{aligned}$$

and it follows again from Proposition 5.3 that the family

$$\begin{aligned} \left\{ \frac{1}{n}\sum _{k=1}^n{\mathbb {P}}\circ (X^k[(\overline{\Lambda }^{n,-1},\beta )])^{-1} : n \ge 1, \beta \in {\mathcal {A}}_R\right\} \end{aligned}$$

is relatively compact in \({\mathcal {P}}^p({\mathcal {C}}^d)\). From this and (6.8) we conclude that \({\mathbb {P}}\circ ({\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1},\beta )])^{-1}\) are relatively compact in \({\mathcal {P}}^p({\mathcal {P}}^p({\mathcal {C}}^d))\). Hence, \(S_R\) is relatively compact. (See Corollary B.2 and Lemma A.2 of [29] regarding these last two conclusions.)

Proof of (6.6), Step (3): Since \(\beta ^n \in {\mathcal {A}}_R\) for each n, to prove (6.6) it suffices to show that

$$\begin{aligned} \sup _{\beta \in {\mathcal {A}}_R}I^\beta _n \rightarrow 0, \end{aligned}$$
(6.9)

where

$$\begin{aligned} I^\beta _n&:= {\mathbb {E}}\left[ \Gamma ({\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1},\beta )],\beta , X^1[(\overline{\Lambda }^{n,-1},\beta )]) - \Gamma (\mu ^x,\beta ,Y^1[\beta ])\right] \\&= {\mathbb {E}}\left[ \int _0^T\int _A\left( f(t,X^1_t[(\overline{\Lambda }^{n,-1}, \beta )],{\widehat{\mu }}^x_t[(\overline{\Lambda }^{n,-1},\beta )],a) \right. \right. \\&\quad \left. \left. - f(t,Y^1_t[\beta ],\mu ^x_t,a)\right) \beta _t(da)dt\right] \\&\quad + {\mathbb {E}}\left[ g(X^1_T[(\overline{\Lambda }^{n,-1},\beta )], {\widehat{\mu }}^x_T[(\overline{\Lambda }^{n,-1},\beta )]) - g(Y^1_T[\beta ],\mu ^x_T)\right] . \end{aligned}$$

We start with the g term. Define

$$\begin{aligned} Q_n^\beta&:= {\mathbb {P}}\circ ({\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1}, \beta )],X^1[(\overline{\Lambda }^{n,-1},\beta )])^{-1},\\ Q^\beta&:= {\mathbb {P}}\circ (\mu ^x,Y^1[\beta ])^{-1}. \end{aligned}$$

Using the metric on \({\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d\) given by

$$\begin{aligned} ((\mu ,x),(\mu ',x')) \mapsto \left[ \ell ^p_{{\mathcal {C}}^d,p}(\mu ,\mu ') + \Vert x-x'\Vert _T^p\right] ^{1/p}, \end{aligned}$$

we define the p-Wasserstein metric \(\ell _{{\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d,p}\) on \({\mathcal {P}}^p({\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d)\). By Lemma 6.4, we have

$$\begin{aligned}&\ell _{{\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d,p}^{p'}(Q_n^\beta ,Q^\beta ) \\&\quad \le {\mathbb {E}}\left[ \ell ^p_{{\mathcal {C}}^d,p} \left( {\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1},\beta )],\mu ^x\right) + \Vert X^1[(\overline{\Lambda }^{n,-1},\beta )] - Y^1[\beta ]\Vert _T^p\right] ^{p'/p} \\&\quad \le 2^{p'/p-1}{\mathbb {E}}\left[ \ell ^{p'}_{{\mathcal {X}},p'}\left( {\widehat{\mu }} [(\overline{\Lambda }^{n,-1},\beta )],\mu \right) + \Vert X^1[(\overline{\Lambda }^{n,-1},\beta )] - Y^1[\beta ]\Vert _T^{p'}\right] \\&\quad \le 2^{p'/p-1}\delta _n(1 + R), \end{aligned}$$

and thus \(Q_n^\beta \rightarrow Q^\beta \) in \({\mathcal {P}}^p({\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d)\), uniformly in \(\beta \in {\mathcal {A}}_R\). The function

$$\begin{aligned} {\mathcal {P}}^p({\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d) \ni Q \mapsto \int Q(d\nu ,dx)g(x_T,\nu _T) \end{aligned}$$

is continuous, and so its restriction to the closure of \(S_R\) is uniformly continuous. Thus, since \(\{Q_n^\beta : n \ge 1, \ \beta \in {\mathcal {A}}_R\} \subset S_R\),

$$\begin{aligned} \lim _{n\rightarrow \infty }\sup _{\beta \in {\mathcal {A}}_R}\left| {\mathbb {E}}\left[ g(X^1_T[(\overline{\Lambda }^{n,-1},\beta )], {\widehat{\mu }}^x_T[(\overline{\Lambda }^{n,-1},\beta )]) - g(Y^1_T[\beta ],\mu ^x_T)\right] \right| = 0. \end{aligned}$$

Proof of (6.6), Step (4): To deal with the f term in \(I^\beta _n\) it will be useful to define \(G : {\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d \rightarrow {\mathbb {R}}\) by

$$\begin{aligned} G\left( (\mu ^1,x^1),(\mu ^2,x^2)\right) := \int _0^T\sup _{a \in A}\left| f(t,x^1_t,\mu ^1_t,a) - f(t,x^2_t,\mu ^2_t,a)\right| dt \end{aligned}$$

With the g term taken care of in Step (3) above, the proof of (6.9) and thus the theorem will be complete if we show that

$$\begin{aligned} 0&= \lim _{n\rightarrow \infty }\sup _{\beta \in {\mathcal {A}}_R}{\mathbb {E}}\left[ Z^n_\beta \right] , \quad \text {where} \nonumber \\ Z^n_\beta&:= G\left( ({\widehat{\mu }}^x [(\overline{\Lambda }^{n,-1}, \beta )],X^1[(\overline{\Lambda }^{n,-1},\beta )]), (\mu ^x,Y^1[\beta ])\right) . \end{aligned}$$
(6.10)

Fix \(\eta > 0\), and note that by relative compactness of \(S_R\) we may find (e.g. by [35, Theorem 7.12]) a compact set \(K \subset {\mathcal {P}}^p({\mathcal {C}}^d) \times {\mathcal {C}}^d\) such that, if the event \(K_\beta \) is defined by

$$\begin{aligned} K_\beta := \left\{ \left( {\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1}, \beta )],X^1[(\overline{\Lambda }^{n,-1},\beta )]\right) \in K\right\} , \end{aligned}$$

then

$$\begin{aligned} {\mathbb {E}}\left[ \left( 1 + \int _{{\mathcal {C}}^d}\Vert z\Vert ^p_T{\widehat{\mu }}^x[(\overline{\Lambda }^{n,-1}, \beta )](dz) + \Vert X^1[(\overline{\Lambda }^{n,-1}, \beta )\Vert _T^p\right) 1_{K^c_\beta }\right] \le \eta , \end{aligned}$$

for all \(n \ge 1\) and \(\beta \in {\mathcal {A}}_R\). Sending \(n \rightarrow \infty \), it follows from Lemma 6.4 that also

$$\begin{aligned} {\mathbb {E}}\left[ \left( 1 + \int _{{\mathcal {C}}^d}\Vert z\Vert ^p_T\mu ^x(dz) + \Vert Y^1[\beta ]\Vert _T^p\right) 1_{K^c_\beta }\right] \le \eta . \end{aligned}$$

Hence, the growth condition of Assumption B implies

$$\begin{aligned} {\mathbb {E}}\left[ 1_{K^c_\beta }Z^n_\beta \right] \le c_4\eta , \end{aligned}$$
(6.11)

for all \(n \ge 1\) and \(\beta \in {\mathcal {A}}_R\). Assumption B implies that G is continuous, and thus uniformly continuous on \(K \times K\). We will check next that \({\mathbb {E}}[1_{K_\beta }Z^n_\beta ]\) converges to zero, uniformly in \(\beta \in {\mathcal {A}}_R\). Indeed, by uniform continuity there exists \(\eta _0 > 0\) such that if \((\mu ^1,x^1),(\mu ^2,x^2) \in K\) and \(G((\mu ^1,x^1),(\mu ^2,x^2)) > \eta \) then \(\Vert x^1-x^2\Vert _T + \ell _{{\mathcal {C}}^d,p}(\mu ^1,\mu ^2) > \eta _0\). Thus, since G is bounded on \(K \times K\), say by \(C > 0\), we use Markov’s inequality and Lemma 6.4 to conclude that

$$\begin{aligned}&{\mathbb {E}}\left[ 1_{K_\beta }Z^n_\beta \right] \le \eta + C{\mathbb {P}}\left\{ \left\| X^1[(\overline{\Lambda }^{n,-1},\beta )] - Y^1[\beta ]\right\| _T \right. \\&\quad \quad \left. + \,\ell _{{\mathcal {C}}^d,p} \left( {\widehat{\mu }}^x [(\overline{\Lambda }^{n,-1},\beta )],\mu ^x\right) > \eta _0\right\} \\&\quad \le \eta + 2^{p'-1}C \eta _0^{-p'} {\mathbb {E}}\left[ \left\| X^1 [(\overline{\Lambda }^{n,-1}, \beta )] - Y^1[\beta ]\right\| _T^{p'} + \ell _{{\mathcal {C}}^d,p}^{p'}\left( {\widehat{\mu }}^x [(\overline{\Lambda }^{n,-1},\beta )],\mu ^x\right) \right] \\&\quad \le \eta + 2^{p'-1}C\eta _0^{-p'}\delta _n\left( 1 + {\mathbb {E}}\int _0^T\int _A|a|^{p'}\beta _t(da)dt\right) \\&\quad \le \eta + 2^{p'-1}C\eta _0^{-p'}\delta _n(1+R), \end{aligned}$$

whenever \(\beta \in {\mathcal {A}}_R\), where \(\delta _n \rightarrow 0\) is from Lemma 6.4. Combining this with (6.11), we get

$$\begin{aligned} \limsup _{n\rightarrow \infty }\sup _{\beta \in {\mathcal {A}}_R}{\mathbb {E}}\left[ Z^n_\beta \right] \le (1+c_4)\eta . \end{aligned}$$

This holds for each \(\eta > 0\), completing the proof of (6.10) and thus of the theorem. \(\square \)

6.4 Proof of Proposition 6.2

Throughout the section, the number of agents n is fixed, and we work on the n-player environment \({\mathcal {E}}_n\) specified in Sect. 6.1. The proof of Proposition 6.2 is split into two main steps. In this first step, we approximate the relaxed strategy \(\Lambda ^0\) by bounded strong strategies, and we check the convergences (1) and (2) claimed in Proposition 6.2. The second step verifies the somewhat more subtle inequality (3) of Proposition 6.2. First, we need a lemma somewhat complementary to Lemma 4.7; this is a form of the well-known Chattering lemma [25, Theorem 2.2(b)].

Lemma 6.5

Suppose \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) is a filtered probability space supporting a \(({\mathcal {F}}_t)_{t \in [0,T]}\)-Wiener process \({\widetilde{W}}\) (of any dimension), an \({\mathcal {F}}_0\)-measurable random variable \({\widetilde{\xi }}\) living in some Euclidean space, and a progressively measurable \({\mathcal {P}}(A)\)-valued process \(({\widetilde{\Lambda }}_t)_{t \in [0,T]}\), satisfying \({\mathbb {E}}^P\int _0^T\int _A|a|^{p'}{\widetilde{\Lambda }}_t(da)dt < \infty \). Then, if \({\mathcal {G}}_t := \sigma ({\widetilde{\xi }},{\widetilde{W}}_s : s \le t)\), then there exists a sequence \((\alpha ^k)_{k=1}^\infty \) of \(({\mathcal {G}}_t)_{t \in [0,T]}\)-progressively measurable A-valued processes such that

$$\begin{aligned} \lim _{k\rightarrow \infty } P \circ \left( {\widetilde{\xi }}, {\widetilde{W}},dt\delta _{\alpha ^k_t}(da) \right) ^{-1} = P \circ \left( {\widetilde{\xi }},{\widetilde{W}},dt{\widetilde{\Lambda }}_t (da)\right) ^{-1}, \end{aligned}$$

and

$$\begin{aligned} \lim _{r\rightarrow \infty }\sup _k{\mathbb {E}}^P\left[ \int _0^T| \alpha ^k_t|^{p'}1_{\{|\alpha ^k_t| > r\}}dt\right] < \infty . \end{aligned}$$

Proof

We may reduce to the case of compact A as follows: For sufficiently large n (so that the intersection of A and the centered ball of radius n is nonempty), let \(\iota _n : A \rightarrow A\) be any measurable map satisfying \(\iota _n(a) = a\) for \(|a| \le n\) and \(|\iota _n(a)| \le n\) for all \(a \in A\). Then \(\iota _n\) converges pointwise to the identity map, and the \({\mathcal {P}}(A)\)-valued \(({\mathcal {F}}_t)_{t \in [0,T]}\)-adapted processes \({\widetilde{\Lambda }}^n_t := {\widetilde{\Lambda }}_t \circ \iota _n^{-1}\) converge pointwise to \({\widetilde{\Lambda }}\). It suffices now to prove the claim under the additional assumption that A is compact.

The Chattering lemma [25, Theorem 2.2(b)] states that the claim is true if we require only that \(\alpha ^k\) be adapted to \(({\mathcal {F}}_t)_{t \in [0,T]}\) as opposed to \(({\mathcal {G}}_t)_{t \in [0,T]}\). The rest of the proof proceeds exactly as the proof of Lemma 3.11 in [11]; the only difference is that we must apply the first part of [11, Proposition C.1] rather than the second, since we do not assume A is convex. (Note that our \(\alpha ^k\) need not depend continuously on \(({\widetilde{\xi }},{\widetilde{W}})\).)

\(\square \)

Before we prove Proposition 6.2, we need the following lemma, which is a simple variant of a standard result:

Lemma 6.6

Suppose \({\widetilde{\Lambda }}^k = ({\widetilde{\Lambda }}^{k,1},\ldots ,{\widetilde{\Lambda }}^{k,n}) \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) is such that

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {P}}\circ (\xi ,B,W,{\widetilde{\Lambda }}^k)^{-1} = {\mathbb {P}}\circ (\xi ,B,W,\Lambda ^0)^{-1}, \end{aligned}$$

with the limit taken in \({\mathcal {P}}^p(({\mathbb {R}}^d)^n \times {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n)\). Then

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {P}}\circ \left( B,W,{\widetilde{\Lambda }}^k,X[{\widetilde{\Lambda }}^k]\right) ^{-1} = {\mathbb {P}}\circ \left( B,W,\Lambda ^0,X[\Lambda ^0]\right) ^{-1}, \end{aligned}$$

in \({\mathcal {P}}^p({\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n \times ({\mathcal {C}}^d)^n)\).

Proof

This is analogous to the proof of Lemma 4.4, given in [11], which is itself an instance of a standard method proving weak convergence of SDE solutions, so we only sketch the proof. It can be shown as in Proposition 5.3 that \(\{{\mathbb {P}}\circ (X[{\widetilde{\Lambda }}^k])^{-1} : k \ge 1\}\) is relatively compact in \({\mathcal {P}}^p(({\mathcal {C}}^d)^n)\), and thus \(\{{\mathbb {P}}\circ (B,W,{\widetilde{\Lambda }}^k,X[{\widetilde{\Lambda }}^k])^{-1} : k \ge 1\}\) is relatively compact in \({\mathcal {P}}^p({\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n \times ({\mathcal {C}}^d)^n)\). Using the results of Kurtz and Protter [28], it is straightforward to check that under any limit point the canonical processes satisfy a certain SDE, and the claimed convergence follows from uniqueness of the SDE solution. \(\square \)

We are now ready to prove Proposition 6.2.

Step 1: Define \(\overline{{\mathcal {V}}}\) analogously to \({\mathcal {V}}\), but with A replaced by \(A^n\). That is, \(\overline{{\mathcal {V}}}\) is the set of measures q on \([0,T] \times A^n\) with first marginal equal to Lebesgue measure and with

$$\begin{aligned} \int _{[0,T] \times A^n}\sum _{i=1}^n|a_i|^pq(dt,da_1,\ldots ,da_n) < \infty . \end{aligned}$$

Endow \(\overline{{\mathcal {V}}}\) with the p-Wasserstein metric. Define

$$\begin{aligned} \overline{\Lambda }^0_t(da_1,\ldots ,da_n) := \prod _{i=1}^n\Lambda ^{0,i}_t(da_i), \end{aligned}$$

and identify this \({\mathcal {P}}(A^n)\)-valued process with the random element \(\overline{\Lambda }^0 := dt\overline{\Lambda }^0_t(da)\) of \(\overline{{\mathcal {V}}}\). By Lemma 6.5, with A replaced by \(A^n\), there exists a sequence of bounded \(A^n\)-valued processes \(\alpha ^k = (\alpha ^{k,1},\ldots ,\alpha ^{k,n})\) such that, if we define

$$\begin{aligned} \overline{\Lambda }^k := dt\delta _{\alpha ^{k}_t}(da_1,\ldots ,da_n) = dt\prod _{i=1}^n\delta _{\alpha ^{k,i}_t}(da_i), \end{aligned}$$

then we have

$$\begin{aligned} \lim _{r\rightarrow \infty }\sup _k{\mathbb {E}}^{\mathbb {P}}\left[ \int _0^T| \alpha ^k_t|^{p'}1_{\{|\alpha ^k_t| > r\}}dt \right] = 0 \end{aligned}$$
(6.12)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {P}}\circ \left( \xi ,B,W,\overline{\Lambda }^{k}\right) ^{-1} = {\mathbb {P}}\circ \left( \xi ,B,W,\overline{\Lambda }^0\right) ^{-1}, \end{aligned}$$

in \({\mathcal {P}}^{p}(({\mathbb {R}}^d)^n \times {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times \overline{{\mathcal {V}}})\). Defining \(\pi _i : [0,T] \times A^n \rightarrow [0,T] \times A\) by \(\pi _i(t,a_1,\ldots ,a_n) := (t,a_i)\), we note that the map \(\overline{{\mathcal {V}}} \ni q \mapsto q \circ \pi _i^{-1} \in {\mathcal {V}}\) is continuous. Define \(\Lambda ^{k,i}_t := \delta _{\alpha ^{k,i}_t}\) and \(\Lambda ^{k} = (\Lambda ^{k,1},\ldots ,\Lambda ^{k,n})\), and conclude that

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {P}}\circ \left( \xi ,B,W,\Lambda ^{k}\right) ^{-1} = {\mathbb {P}}\circ \left( \xi ,B,W,\Lambda ^0\right) ^{-1}, \end{aligned}$$

in \({\mathcal {P}}^{p}(({\mathbb {R}}^d)^n \times {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n)\), for each k. By Lemma 6.6,

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {P}}\circ \left( B,W,\Lambda ^{k},X[\Lambda ^{k}]\right) ^{-1} = {\mathbb {P}}\circ \left( B,W,\Lambda ^0,X[\Lambda ^0]\right) ^{-1}, \end{aligned}$$

in \({\mathcal {P}}^p({\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n \times ({\mathcal {C}}^d)^n)\). It follows from the uniform integrability (6.12) and the continuity of J of Lemma 4.5 that

$$\begin{aligned} \lim _{k\rightarrow \infty }J_i(\Lambda ^{k}) = J_i(\Lambda ^0), \quad i=1,\ldots ,n. \end{aligned}$$

This verifies (1) and (2) of Proposition 6.2.

Step 2: It remains to justify the inequality (3) of Proposition 6.2. We prove this only for \(i=1\), since the cases \(i=2,\ldots ,n\) are identical. For each k find \(\beta ^k \in {\mathcal {A}}_n({\mathcal {E}}_n)\) such that

$$\begin{aligned} J_i((\Lambda ^{k,-1},\beta ^k)) \ge \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_i((\Lambda ^{k,-1},\beta )) - \frac{1}{k}. \end{aligned}$$
(6.13)

First, use Lemma 5.2(2) to get

$$\begin{aligned}&{\mathbb {E}}\int _0^T\int _A(|a|^{p'} - c_6|a|^p)\beta ^k_t(da)dt \\&\quad \le c_7{\mathbb {E}}\left[ 1 + \frac{1}{k} + |\xi ^1|^p + \frac{1}{n}\sum _{i=2}^n\int _0^T\int _A|a|^p\Lambda ^{k,i}_t(da)dt\right] . \end{aligned}$$

Since \({\mathbb {E}}[|\xi ^1|^p] < \infty \), and since

$$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {E}}\int _0^T\int _A|a|^p\Lambda ^{k,i}_t(da)dt = {\mathbb {E}}\int _0^T\int _A|a|^p\Lambda ^{0,i}_t(da)dt < \infty , \end{aligned}$$

holds by construction, for \(i=2,\ldots ,n\), it follows that

$$\begin{aligned} R := \sup _k {\mathbb {E}}^{{\mathbb {P}}}\int _0^T\int _A|a|^{p'}\beta ^k_t(da)dt < \infty . \end{aligned}$$

It follows as in Proposition 5.3 (or more precisely [29, Proposition B.4]) that the set

$$\begin{aligned} \left\{ {\mathbb {P}}\circ \left( (\Lambda ^{k,-1},\beta ^k),X[(\Lambda ^{k,-1},\beta ^k)]\right) ^{-1} : k \ge 1\right\} \end{aligned}$$

is relatively compact in \({\mathcal {P}}^p({\mathcal {V}}^n \times ({\mathcal {C}}^d)^n)\). Hence, the set

$$\begin{aligned} \left\{ P_k := {\mathbb {P}}\circ \left( B,W,(\Lambda ^{k,-1},\beta ^k), X[(\Lambda ^{k,-1},\beta ^k)]\right) ^{-1} : k \ge 1\right\} \end{aligned}$$
(6.14)

is relatively compact in \({\mathcal {P}}^p({\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n \times ({\mathcal {C}}^d)^n)\) (e.g. by [29, Lemma A.2]). By the following Lemma 6.7, every limit point P of \((P_k)_{k=1}^\infty \) is of the form

$$\begin{aligned} P = {\mathbb {P}}\circ \left( B,W,(\Lambda ^{0,-1},\beta ), X[(\Lambda ^{0,-1},\beta )]\right) ^{-1}, \quad \text {for some } \beta \in {\mathcal {A}}_n({\mathcal {E}}_n). \end{aligned}$$
(6.15)

This implies

$$\begin{aligned} \limsup _{k\rightarrow \infty }J_i((\Lambda ^{k,-1},\beta ^k)) \le \sup _{\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)}J_i((\Lambda ^{0,-1},\beta )). \end{aligned}$$

Because of (6.13), this completes the proof of Proposition 6.2.

Lemma 6.7

Every limit point P of \((P_k)_{k=1}^\infty \) (defined in (6.14)) is of the form (6.15).

Proof

Let us abbreviate

$$\begin{aligned} \Omega ^{(n)} := {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \times {\mathcal {V}}^n \times ({\mathcal {C}}^d)^n. \end{aligned}$$

Let \((B,W=(W^1,\ldots ,W^n), \Lambda =(\Lambda ^1,\ldots , \Lambda ^n),X=(X^1,\ldots ,X^n))\) denote the identity map on \(\Omega ^{(n)}\), and let \(({\mathcal {F}}^{(n)}_t)_{t \in [0,T]}\) denote the natural filtration,

$$\begin{aligned} {\mathcal {F}}^{(n)}_t = \sigma \left( (B_s,W_s,\Lambda ([0,s] \times C),X_s) : s \le t, \ C \in {\mathcal {B}}(A)\right) . \end{aligned}$$

Fix a limit point P of \(P_k\). It is easily verified that P satisfies

$$\begin{aligned} P \circ \left( X_0,B,W,(\Lambda ^2,\ldots ,\Lambda ^n)\right) ^{-1}&= {\mathbb {P}}\circ \left( X_0,B,W,(\Lambda ^{0,2},\ldots ,\Lambda ^{0,n})\right) ^{-1}. \end{aligned}$$
(6.16)

Moreover, for each k, we know that B and W are independent \(({\mathcal {F}}^{(n)}_t)_{t \in [0,T]}\)-Wiener processes under \(P_k\), and thus this is true under P as well. Note that \((B,W,(\Lambda ^{k,-1},\beta ^k),X[(\Lambda ^{k,-1},\beta ^k)])\) satisfy the state SDE under \({\mathbb {P}}\), or equivalently under \(P_k\) the canonical processes verify the SDE

$$\begin{aligned} {\left\{ \begin{array}{ll} dX^i_t = \int _Ab(t,X^i_t,{\widehat{\mu }}^x_t,a)\Lambda ^i_t(da)dt + \sigma (t,X^i_t,{\widehat{\mu }}^x_t)dW^i_t\\ \qquad + \sigma _0(t,X^i_t,{\widehat{\mu }}^x_t)dB_t, \ i=1,\ldots ,n \\ {\widehat{\mu }}^x_t= \frac{1}{n}\sum _{k=1}^n\delta _{X^k_t}. \end{array}\right. } \end{aligned}$$
(6.17)

The results of Kurtz and Protter [28] imply that this passes to the limit: The canonical processes on \(\Omega ^{(n)}\) verify the same SDE under P.

It remains only to show that there exists \(\beta \in {\mathcal {A}}_n({\mathcal {E}}_n)\) such that

$$\begin{aligned} {\mathbb {P}}\circ (X_0,B,W,(\Lambda ^{0,-1},\beta ))^{-1} = P \circ (X_0,B,W,\Lambda )^{-1}. \end{aligned}$$
(6.18)

Indeed, from uniqueness in law of the solution of the SDE (6.17) it will then follow that

$$\begin{aligned} {\mathbb {P}}\circ (B,W,(\Lambda ^{0,-1},\beta ),X[(\Lambda ^{0,-1},\beta )])^{-1} = P. \end{aligned}$$

The independent uniform random variable U built into \({\mathcal {E}}_n\) now finally comes into play. Using a well known result from measure theory (e.g. [23, Theorem 6.10]) we may find a measurable function

$$\begin{aligned} \overline{\beta } = (\overline{\beta }^1,\ldots ,\overline{\beta }^n) : [0,1] \times ({\mathbb {R}}^d)^n \times {\mathcal {C}}^{m_0} \times ({\mathcal {C}}^m)^n \rightarrow {\mathcal {V}}^n \end{aligned}$$

such that

$$\begin{aligned} {\mathbb {P}}\circ \left( X_0,B,W,\overline{\beta }(U,X_0,B,W)\right) ^{-1} = P \circ (X_0,B,W,\Lambda )^{-1}. \end{aligned}$$
(6.19)

Since B and W are independent \(({\mathcal {F}}^{(n)}_t)_{t \in [0,T]}\)-Wiener processes under P, it follows that

$$\begin{aligned} (\overline{\beta }(U,X_0,B,W)_s)_{s \in [0,t]} \quad \text {and} \quad \sigma (B_s-B_t,W_s-W_t : s \in [t,T]) \end{aligned}$$

are independent under \({\mathbb {P}}\), for each \(t \in [0,T]\). Thus, \((\overline{\beta }(U,X_0,B,W)_t)_{t \in [0,T]}\) is progressively measurable with respect to the \({\mathbb {P}}\)-completion of the filtration \((\sigma (U,X_0,B_s,W_s : s \le t))_{t \in [0,T]}\). In particular, \((\overline{\beta }(U,X_0,B,W))_{t \in [0,T]} \in {\mathcal {A}}_n^n({\mathcal {E}}_n)\) and \(\beta := (\overline{\beta }^1(U,X_0,B,W)_t)_{t \in [0,T]}\) is in \({\mathcal {A}}_n({\mathcal {E}}_n)\). Now note that (6.16) and (6.19) together imply

$$\begin{aligned}&{\mathbb {P}}\circ \left( X_0,B,W,\left( \overline{\beta }^2(U,X_0,B,W),\ldots , \overline{\beta }^n(U,X_0,B,W)\right) \right) ^{-1} \\&\quad = P \circ \left( X_0,B,W,(\Lambda ^2,\ldots ,\Lambda ^n)\right) ^{-1}. \end{aligned}$$

On the other hand, (6.19) implies that the conditional law under P of \(\Lambda ^1\) given \((X_0,B,W,\Lambda ^2,\ldots ,\Lambda ^n)\) is the same as the conditional law under \({\mathbb {P}}\) of \(\overline{\beta }^1(U,X_0,B,W)\) given

$$\begin{aligned} \left( X_0,B,W,\overline{\beta }^2(U,X_0,B,W), \ldots ,\overline{\beta }^n(U,X_0,B,W)\right) . \end{aligned}$$

This completes the proof of (6.18). \(\square \)

7 Proof of Theorem 3.4

This section explains the proof of Theorem 3.4, which specializes the main results to the setting without common noise essentially by means of the following simple observation. Note that although we assume \(\sigma _0\equiv 0\) throughout the section, weak MFG solution has the same meaning as in Definition 2.1, distinct from Definition 3.1 of weak MFG solution without common noise.

Lemma 7.1

If \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\) is a weak MFG solution, then \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\) \(\Lambda ,X)\) is a weak MFG solution without common noise. Conversely, if \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) is a weak MFG solution without common noise, then we may construct (by enlarging the probability space, if necessary) an \(m_0\)-dimensional Wiener process B independent of \((W,\mu ,\Lambda ,X)\) such that \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\) is a weak MFG solution.

Proof

The only difficulty comes from the conditional independence required in condition (3) of both Definitions 2.1 and 3.1, and it is convenient here to reformulate the definitions slightly. Lemma 4.3 tells us that Definition 2.1 of a weak MFG solution is equivalent to an alternative definition, in which the conditional independence is omitted from condition (3) and is added to condition (5). To be precise, define the following conditions:

  1. (3.a)

    \((\Lambda _t)_{t \in [0,T]}\) is \(({\mathcal {F}}_t)_{t \in [0,T]}\)-progressively measurable with values in \({\mathcal {P}}(A)\) and

    $$\begin{aligned} {\mathbb {E}}^P\int _0^T\int _A|a|^p\Lambda _t(da)dt < \infty . \end{aligned}$$
  2. (5.a)

    Suppose \(({\widetilde{\Omega }}',({\mathcal {F}}'_t)_{t \in [0,T]},P')\) is another filtered probability space supporting \((B',W',\mu ',\Lambda ',X')\) satisfying (3.a), (1,2,4) of Definition 2.1, and \(P \circ (B,\mu )^{-1} = P' \circ (B',\mu ')^{-1}\), with \(\sigma (\Lambda '_s : s \le t)\) conditionally independent of \({\mathcal {F}}^{X'_0,B',W',\mu '}_T\) given \({\mathcal {F}}^{X'_0,B',W',\mu '}_t\), for each \(t \in [0,T]\). Then

    $$\begin{aligned} {\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ,X)] \ge {\mathbb {E}}^{P'}[\Gamma (\mu '^x,\Lambda ',X')]. \end{aligned}$$

Then, by Lemma 4.3, \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\) is a weak MFG solution if and only if it satisfies Definition 2.1 with conditions (3) and (5) replaced by (3.a) and (5.a). In fact, the same is true if (5.a) is replaced by

  1. (5′.a)

    If \((\Lambda '_t)_{t \in [0,T]}\) is \(({\mathcal {F}}^{X_0,B,W,\mu }_t)_{t \in [0,T]}\)-progressively measurable with values in \({\mathcal {P}}(A)\) and

    $$\begin{aligned} {\mathbb {E}}^P\int _0^T\int _A|a|^p\Lambda '_t(da)dt < \infty , \end{aligned}$$

    and if \(X'\) is the unique strong solution of

    $$\begin{aligned} dX'_t = \int _Ab(t,X'_t,\mu ^x_t,a)\Lambda '_t(da)dt + \sigma (t,X'_t,\mu ^x_t)dW_t, \ X'_0 = X_0, \end{aligned}$$
    (7.1)

    then \({\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ,X)] \ge {\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ',X')]\).

Indeed, this follows from the density of strong controls provided by Lemma 6.5. Analogously, for the setting without common noise, consider the following condition:

  1. (5′.b)

    If \((\Lambda '_t)_{t \in [0,T]}\) is \(({\mathcal {F}}^{X_0,W,\mu }_t)_{t \in [0,T]}\)-progressively measurable with values in \({\mathcal {P}}(A)\) and

    $$\begin{aligned} {\mathbb {E}}^P\int _0^T\int _A|a|^p\Lambda '_t(da)dt < \infty , \end{aligned}$$

    and if \(X'\) is the unique strong solution of (7.1), then \({\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ,X)] \ge {\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ',X')]\).

It is proven exactly as in Lemma 4.3 that \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) is a weak MFG solution without common noise if and only if it satisfies Definition 3.1 with conditions (3) and (5) replaced by (3.a) and (5’.b). We are now ready to prove the lemma:

Suppose \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\) is a weak MFG solution. It is straightforward to check that \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) satisfies condition (3.a) as well as (1,2,4) of Definition 3.1. Condition (5) of Definition 2.1 clearly implies condition (5’.b). Finally \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ B,\mu )\) implies \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ \mu )\), which verifies the final condition (6) of Definition 3.1. Hence \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) is a weak MFG solution without common noise.

Conversely, let \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,W,\mu ,\Lambda ,X)\) be a weak MFG solution without common noise, and assume without loss of generality that \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P)\) supports an \(({\mathcal {F}}_t)_{t \in [0,T]}\)-Wiener process B of dimension \(m_0\) which is independent of \((W,\mu ,\Lambda ,X)\). Again, condition (3.a) as well as (1), (2), and (4) of Definition 2.1 clearly hold. The consistency condition \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ \mu )\) and the independence of B and \((W,\mu ,\Lambda ,X)\) imply \(\mu = P((W,\Lambda ,X) \in \cdot \ | \ B,\mu )\). Finally, to check (5’.a), note first that the independence of B and \((X_0,W,\mu )\) implies easily that \({\mathcal {F}}^{X_0,B,W,\mu }_t\) and \({\mathcal {F}}^{X_0,W,\mu }_T\) are conditionally independent given \({\mathcal {F}}^{X_0,W,\mu }_t\). Thus, if \((\Lambda '_t)_{t \in [0,T]}\) is \(({\mathcal {F}}^{X_0,B,W,\mu }_t)_{t \in [0,T]}\)-progressively measurable, then \(\sigma (\Lambda '_s : s \le t)\) is conditionally independent of \({\mathcal {F}}^{X_0,W,\mu }_T\) given \({\mathcal {F}}^{X_0,W,\mu }_t\), and condition (5) of Definition 3.1 implies that \({\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ,X)] \ge {\mathbb {E}}^P[\Gamma (\mu ^x,\Lambda ',X')]\), where \(X'\) is defined as in (7.1). This verifies (5’.a), and so \(({\widetilde{\Omega }},({\mathcal {F}}_t)_{t \in [0,T]},P,B,W,\mu ,\Lambda ,X)\) is a weak MFG solution. \(\square \)

Proof of Theorem 3.4

At this point, the proof is mostly straightforward. The first claim, regarding the adaptation of Theorem 2.6, follows immediately from Theorem 2.6 and the observation of Lemma 7.1. The second claim, about adapting Theorem 2.11, is not so immediate but requires nothing new. First, notice that Theorem 6.1 remains true if we replace “weak MFG solution” by “weak MFG solution without common noise,” and if we define \(P_n\) instead by (3.2); this is a consequence of Theorem 6.1 and Lemma 7.1. Then, we must only check that Proposition 6.2 remains true if we replace “strong” by “very strong,” and if we replace the conclusion (1) by

  1. (1′)

    In \({\mathcal {P}}^p(({\mathcal {C}}^m)^n \times {\mathcal {V}}^n \times ({\mathcal {C}}^d)^n)\)

    $$\begin{aligned} \lim _{k\rightarrow \infty }{\mathbb {P}}_n \circ \left( W,\Lambda ^k,X[\Lambda ^k]\right) ^{-1} = {\mathbb {P}}_n \circ \left( W,\Lambda ^0,X[\Lambda ^0]\right) ^{-1}. \end{aligned}$$

It is straightforward to check that the proof of Proposition 6.2 given in Sect. 6.4 translates mutatis mutandis to this new setting. \(\square \)