Reinforcement Learning for Mean Field Game

Stochastic games provide a framework for interactions among multi-agents and enable a myriad of applications. In these games, agents decide on actions simultaneously, the state of an agent moves to the next state, and each agent receives a reward. However, finding an equilibrium (if exists) in this game is often difficult when the number of agents become large. This paper focuses on finding a mean-field equilibrium (MFE) in an action coupled stochastic game setting in an episodic framework. It is assumed that the impact of the other agents' can be assumed by the empirical distribution of the mean of the actions. All agents know the action distribution and employ lower-myopic best response dynamics to choose the optimal oblivious strategy. This paper proposes a posterior sampling based approach for reinforcement learning in the mean-field game, where each agent samples a transition probability from the previous transitions. We show that the policy and action distributions converge to the optimal oblivious strategy and the limiting distribution, respectively, which constitute a MFE.


I. INTRODUCTION
We are living in a world where multiple agents interact repeatedly over a common environment. For example, multiple robots interact to achieve a specific goal. The multi-agent reinforcement learning (MARL) refers to the problem of learning and planning in a sequential decision making system when the underlying system dynamics are unknown, and may need to be learnt by trying different options and observing their actions. Learning in a MARL is fundamentally different from the traditional single-agent reinforcement learning problem (RL) since agents not only interact with the environment but also with each other. Thus, an agent when tries to learn the underlying system dynamics has to consider the action taken by the other actions. Changes in the policy (or, action) of an agent affects the others and vice-versa.
One natural learning algorithm is to extend the RL algorithms to the MARL by assuming that the other agents' actions are independent. However, the studies show that a smart agent who learns the joint actions of the others performs better compared to the ones which do not [7]. Hence, it is futile to consider that an agent will not indulge itself to learn the joint actions of the others. When the agents are strategic, i.e., they only want to take actions which maximize their own utility (or, value), Nash equilibrium is often employed as the equilibrium concept. The existing equilibrium solving approaches are only capable of solving a handful of agents and for some restricted games (when there exists an adversarial equilibrium or coordination equilibrium) [5]. The computational complexity of finding Nash Equilibrium at every stage game [4] prevents those approaches to be applied in games where the number of agents is large.
All authors contributed equally to the work. Contact author: V. Aggarwal, vaneet@purdue.edu In this paper, we consider a MARL where a large number of agents co-exist. Similar to [10], we revert to the mean-field approach where we assume that the Q-function of an agent is affected by the mean actions of the others. Mean-field game drastically reduces the complexity since an agent only now needs to consider the empirical distribution of the actions played by other agents. Further, we consider an oblivious strategy [1], where each agent takes an action based only on its own state. An agent does not have to track the policy evolution of the other agents.
Unlike [10], we consider a generalized version of the game where the state can be different for different agents.
Further, we do not consider a game where the adversarial equilibrium or coordinated equilibrium is required to be present. We also do not need to track the action and the realized Q-value of other agents as was the case in [10].
Mean-field games exist in several domains. For example, the mean-field game is observed in an security game where a large number of agents make individual decisions about their own security. However, the ultimate security depends on the decisions made by other agents. For example, in a network of computers, if an agent invests heavily in building firewalls, its computer can still be breached if the agents' computers are not secure. In the security game, each agent invests a certain amount to attain a security level, however, the investment level depends on the investment made by the other agents. If the number of agents is large, the game can be modeled as the mean-field game as the average investment made by per agent impacts the decision of an agent.
Another example of mean-field game is the demand response price in the smart grid. The utility company sets a price based on the average demand per household, hence, at a certain time, the average demand is high, the agent may want to reduce its consumption in order to decrease the price.
Unlike the standard literature on the mean-field equilibrium on stochastic games [1], [9], [2], we consider that the transition probabilities are unknown to the agents. Instead, each agent learns the underlying transition probability matrix a readily implementable posterior sampling approach [3], [6]. All agents employ best response dynamics to choose the best response strategy which maximizes the discounted payoff for the remaining episode length. We show the asymptotic convergence of the policy and the action distribution to the optimal oblivious strategy and the limiting action distribution, respectively, and the converged point constitutes a mean-field equilibrium (MFE). We use the compactness of state and action space for the convergence. We estimate the value function using update steps which are similar to the Expected Sarsa algorithm [8] and show that the iterates converge to the optimal value function of the true distribution.
The key contribution of the paper is a novel algorithm that is used by each agent in a multi-agent setting, which is shown to converge to a mean-field equilibrium. The proposed algorithm does not assume the knowledge of transition probabilities, and learns them using a posterior sampling approach.

A. Stochastic Game
An n-player stochastic game is formalized by the tuple M = {S, A, r M , p M , τ, ρ γ}, where the parameters are defined as follows. The state of the i th agent at time t is given by s i,t ∈ S, where S is the state space set. A(s) is defined as the set of the feasible actions an agent can take in state s. A is the action space set defined as s∈S A(s).
If the state of the agent i at time t is given by s i,t = s and the action taken by the agent j is given by a j,t = a j , then the next state is s ′ ∈ S with probability p M (s ′ |s, a 1 , . . . , a n ) which is assumed to be same for all the agents. r M i (s, a, a −i ) is the probability distribution over the realized reward by the agent i when action a is selected in the state s, a −i is the action taken by all other agents. It is assumed to be similar for each agent with support [0,1].
The constant γ ∈ [0, 1) is the discount factor. ρ is the initial state distribution.
We consider an episodic framework where the length of the time horizon is τ . State space set S, action space set A, τ are deterministic and need not be learned by the agent. We consider that the game is played in episodes k = 0, 1, 2, . . . . The length of each episode is given by τ . In each episode, the game is played in discrete steps, j = 1, 2, ..., τ . The episodes begin at times t k = (k − 1)τ + 1, k = 1, 2, 3, . . . . At each time t, the state of the agent i is given by s i,t , the agent selects an action a i,t , agent observes a scalar reward r i,t and the state transitions to the state s t+1 . Let H i,t = (s 1,1 , a 1,1 , r 1,1 , . . . s i,t−1 , a i,t−1 , r i,t−1 ) denote the history of the agent i till time t.

B. Problem Formulation
In a game with a large number of players, we might expect that the fluctuations of the agents' action will "average out". Since the effect of other agents on a single agent's payoff is only via the actions of the population, it is intuitive that as the number of agents increases, a single agent has negligible effect on the game. This intuition is formalized as the mean field equilibrium.
Let α −i,t (a m ) be the fraction of the agents (excluding agent i that take action a m at time t. Mathematically, for where ½(a j,t = a m ) is the indicator function that the action taken by the agent j at time t is a m .
In mean field equilibrium, each agent makes an action based on the average of the population actions. Thus the agent doesn't need to concern itself with the action of the other agents, it just needs to keep the track of the averages of the population actions. Thus, the actions of other players are taken using the set α −i,t as the proportion of actions of the other players. Each agent conjectures that the next state is randomly distributed according to the transition probability measure p M : Note that we assume that the transition probabilities are unknown and are learned using the posterior sampling algorithm described in Section III. Since each player is only concerned with its current state and the average action of the population, we describe a set of strategies which is known as oblivious strategy.

Definition 1.
In an oblivious strategy, the set of policies available to the agent is given by Consider π, an oblivious policy followed by the agent i that maps each state s ∈ S and average action of other agents to an action a ∈ A. We now define a value function for the agent i for oblivious policy π at ℓ th time step as: is the expected reward realized by the agent i in state s i when action a i is selected. It is clear that the action a i = π(s i , α −i ). Since we mainly consider one agent decision, index i will be dropped at places where the agent index is unambiguous. The index will also be dropped from α −i to denote the limiting distribution or the long-run average population distribution. Each agent determines the policy based on limiting distribution α. However, the decision of agents (π) impact the limiting distribution α. An equilibrium is established if and only if the the decisions π do not change the values of α. Thus the agents need to be aware of the long-run action distribution α to make their decision.
Since the action space and state space are finite, the set of strategies available to the players is also finite. The player adopts the lower-myopic best response dynamics to choose the policy. As time proceeds, the strategies and the action distribution converge to the asymptotic equilibrium [1]. Therefore, we have dropped time index from the action distribution to represent the limiting distribution.
The above defined value function satisfies the Bellman-property for finite horizon MDPs, given by The proof of the above expression is given in the Appendix A.
We also define the Q-function as: We, now, define the optimal oblivious strategy: Definition 2. The set P(α) maps a distribution α to the set of the optimal oblivious strategies which are chosen from the Q-function In other words, for a given α, a policyπ ∈ P(α) if and only if Here, the policy π(s, j) is used at j th time step so that the Q value Q M π,j (s) is maximized for all states s ∈ S. Note that π(s, j) does not depend on the distribution α. Hence, it is a state oblivious strategy where each agent takes its decision based on its own strategy only. The set P(α) can be empty however in the subsequent section we'll show that the set P(α) is non-empty under the assumptions used in this paper. Note that with slight abuse of notation, we denote π(s, j) as the strategy which has been learnt till episode j.
Suppose all agents play the optimal oblivious strategyπ ∈ P(α). The initial population state distribution denoted by f evolves with all the agents knowing the limiting action distribution α and play according to the optimal oblivious strategyπ, then if the long run state distribution is equal to the initial state distribution f , the distribution June 3, 2019 DRAFT f is said be invariant of the dynamics induced by α andπ. We denote the set of all such state distributions through a map D(π, α).
We assume that an agent observes the action distribution, however, it does not observe the state of the other agents. Thus, an agent does not know the probability transition matrix and will try to estimate it from the past observations as described in the next section.

III. PROPOSED ALGORITHM
In this section, we propose an algorithm, which will be shown to converge to the mean field equilibrium (MFE) in the following section. For each agent i, the algorithm begins with a prior distribution g over the stochastic game with state space set S and action space set A and time horizon τ . The game is played episodes k = 0, 1, 2, · · · . The length of each episode is given by τ . In each episode, the game is played in discrete steps, j = 1, 2, · · · , τ . The episodes begin at times t k = (k − 1)τ + 1, k = 1, 2, 3, · · · . At each time t, the state of the agent is given by s t , it selects an action a t , and observes a scalar reward r t then transitions to the state s t+1 . Let H t = (s 1 , a 1 , α 1 , r 1 , . . . s t , a t−1 , α t−1 , r t−1 ) denote the history of the agent till time t.
At the beginning of each episode, the MDP, M t k is sampled from the posterior distribution conditioned on the history H t k . The sampled action distribution is represented by α t k . We assume that after some samples, α t k has converged. Till the time α t k does not converge, the proposed algorithm will not converge. The updated value function in the last iteration is used to calculate the optimal oblivious policy π k ∈ P(α t k ) and follows it for the entire episode. Recall for a given α, a policy π ∈ P(α) if and only if π(s, j) = arg max a Q Mt k π k ,j (s, a|α) for all s ∈ S and j = 0, 1, . . . , τ − 1. But to choose the policy π k form the set P(α t k ) we use lower myopic learning dynamics, where at each episode we choose the strategy which is infimum of the set P(α t k ) or the value p(α t k ).
We note that M t k is used in the algorithm instead of M * where M * is the true distribution, since M * is not known. In order to obtain an estimate, each agent samples a transition probability matrix according to the posterior distribution. Each agent follows the strategy π k according to the Q-values over the episode. Based on action decision by each agent, we update the value function and the Q-function based on the obtained reward functions which depend on the value of α t k . This update is akin to learning the value function in Expected Sarsa algorithm [8]. The detailed algorithm steps can be seen in Algorithm 1. We note that, after the algorithm converges, the value of α converges and thus all the transition probabilities and value functions depend on the limiting distribution. We use M k as a shorter notation to represent M t k Here q j k (s j , a j ) is used as an estimate of the value Q M k π k ,j (s, a, α t k ) and v j k (s j ) is used as an estimate for V M k π k ,j (s j |α t k ) which is updated over the episodes in a way similar to the Expected Sarsa algorithm. The iteration is as follows: v j k (s t + 1) = a∈A π k (s t+1 , j)q j k (s t+1 , a) q j k+1 (s t , a t ) = (1 − β k )q j k (s t , a t ) + β k (r j (s t , a t , α t k ) + γv j k (s t+1 ))) where q j k (s t , a t ) is the estimate of the Q M k π k ,j (s t , a t |α t k ) where it is obvious that t = (k − 1)τ + j and v j k is the estimate of V M k π k ,j (s t |α t k ) in the k th episode.

11:
Update Q function: q j (s j , a j ) = (1 − β k )q j (s j , a j ) + β k (r j + γv j (s j+1 )) where α(t) is the limiting distribution of other agents at time t and L(·) denotes the expected changes in the limiting action distribution when an agent selects the action a. We make the following assumptions throughput the paper.
Assumption 1. We assume that for all s ∈ S and some p ∈ [1, ∞), the following hold.
where K, K 1 , and K 2 are positive constants.
The above assumption is equivalent to the conditions Lipschitz continuity of r M (·) and L(·). We further assume that the values of β t used to update Q-function in the algorithm satisfy the following.

IV. CONVERGENCE RESULT
In this section, we'll show that if the oblivious strategy is chosen according to the proposed algorithm, then the oblivious strategy π and the limiting population action distribution α constitutes a Mean Field Equilibrium (MFE).
More formally, we have June 3, 2019 DRAFT Theorem 1. The optimal oblivious strategy obtained from the Algorithm 1 and the limiting action distribution constitute a mean-field equilibrium and the value function obtained from the algorithm converges to the optimal value function of the true distribution.
The rest of the section proves this result. We first note that the lower-myopic best response strategy leads to a convergence of the action strategy following the results in [1] for finite action space and state space. The key intuition for the lower-myopic strategy is to avoid conflicts when there is non-unique strategy at the agent that maximizes the value function which might lead to choosing different strategies at different iterations if lower myopic is not used. Further, any way of resolving the multiple optimas could be used, including upper-myopic giving the same result. Having shown that α converges, we now proceed to show that the converged point of the algorithm results in a MFE.
We first show the conditions needed for a policy π, a population state f , and action distribution α to constitute an MFE (Section IV-A). Then, we show that the conditions for the policy to be MFE given in Section IV-A are met for any optimal oblivious strategy (Section IV-B). Thus, the key property that is required to show the desired result is that the proposed algorithm leads to an optimal oblivious strategy. In order to show that, we first show that the optimal oblivious strategy set is non-empty (Section IV-C). Then, we show that the value function of the sampled distribution converges to the true distribution (Section IV-D). However, we do not estimate the value function directly based on the samples, while use Expected Sarsa updates for the estimate of the value function.
Thus, we finally show that the iterates of the value function obtained from the algorithm converge to the optimal value function of the sampled distribution. (Section 4.5). This combined with the result in Section IV-D shows that the update of the value function steps eventually converge to the value function with knowledge of true underlying distribution of the transition probability M * , thus proving that the proposed algorithm converges to an optimal oblivious strategy which constitute a mean field equilibrium thus proving the theorem.

A. Conditions for a Strategy to be a MFE
In this section, we will describe the conditions for an oblivious strategy π to be a MFE. In Section II-B, we defined two maps P(α) and D(µ, α). For a given action coupled stochastic game, the map P(α) for a given population action distribution α gives the set of the optimal oblivious strategies. Further, the map D(µ, α) for a given population action distribution α and oblivious strategy µ gives the set of invariant population state distribution f.
We define the mapD(µ, α) which gives the induced population distribution α induced from the oblivious strategy µ and the population state distribution f. The following lemma gives the conditions that the stochastic game constitutes a mean field equilibrium. These conditions have been provided in [1], and the reader is referred to [1] for further details and proof of this result.
Lemma 1. An action coupled stochastic game with the strategy π, population state distribution f and population action distribution α constitute a mean field equilibrium if π ∈ P(α), f ∈ D(π, α) and α =D(π, f).

B. Conditions of Lemma 1 are met for any Optimal Oblivious Strategy
In this section, we show that the conditions of Lemma 1 are met for any optimal oblivious strategy. In the mean field equilibrium, each agent plays according to the strategy π ∈ P(α). If the long run average population action distribution is α, and each agent takes an oblivious strategy, hence, we must have the evolution of the state space such that the oblivious strategy on those states leads to an average action distribution of α. Let the long run average state distribution be f , i.e., where s i is the state of the agent i. Then the above statement implies that α must satisfy where π −1 (x) represents the set of states for which a i ∈ π(x). This is equivalent to saying that if all the agents follow the optimal oblivious strategy π ∈ P(α), then the long run average population state distribution f and the long run average population action α satisfy f ∈ D(π, α) and α =D(π, f ).

C. Optimal Oblivious Strategy Set is Non-Empty
In this subsection, we show that there exists an optimal oblivious strategy. More formally, we have the following lemma. Proof. Note that r is bounded and Lipschitz continuous. In addition, for each state s, the next state is drawn from a countable set. Further,
At the start of every episode, each agent samples a probability distribution from the posterior distribution. The following result bounds the difference between the optimal value function learned by the true distribution M * function using the optimal policy π * which is unknown, and the optimal value function achieved by the sampled distribution M k from the policy π k . (Here, M t k is abbreviated to M k ) Lemma 4. If the sampled distribution is chosen form Algorithm 1, we have the convergence of the optimal value function of the sample distribution M k , V M k π k ,j (s|α)) to the optimal value function of the true distribution M * , V M * π * ,j (s|α), i.e., for all states s ∈ S as k → ∞, Proof. To prove this, we first show an equivalence of the true distribution and the sampled distribution which comes from the property of posterior sampling shown in [6], which says that for any σ− measurable function of history which can be applied to the difference of the optimal functions of the two distributions to show that for all states s ∈ S, Note that the length of all episodes is given by τ and the support of the reward is [0,1]. Therefore for all states s ∈ S, we have V M * π * ,j (s|α) − V M k π k ,j (s|α)) ∈ [−τ, τ ]. Note that this condition is similar to bounded increments in Azuma-Hoeffding Lemma (Lemma 3).
Since V M * π * ,j (s|α) − V M k π k ,j (s|α) ∈ [−τ, τ ] is a zero mean martingale with respect to the filtration {H t k : k = 1, .., m}, and satisfies the assumptions of Azuma-Hoeffding Lemma, we obtain the result as in the statement of the Lemma. Also, for all states s ∈ S, we have, V M * π * ,j (s|α) − V M k π k ,j (s|α)) ∈ [−τ, τ ]. So, the difference is a zero-mean martingale and has the bounded increments property. Applying the Azuma-Hoeffding Lemma to the martingale, we have the following result, For total time T of the algorithm, we have m = T /τ . Thus, for all θ > 1/2, as T → ∞, we have Substituting θ = 1, the above expression says that τ times the average difference in an episode which converges to zero as total time T → ∞, which gives us the convergence of the optimal value functions of the two distributions.
Thus, we have

E. Value Function Update Steps in Algorithm 1 Converge to the Actual Value Function
We have already shown the bound between the optimal value function achieved by the sampled and the true distribution. In this subsection, we will show that if the policy π is chosen from Algorithm 1, the q j value converges to the optimal Q M k π k ,j function. Further, the value function converges to the optimal function.
Lemma 5. The Q-value computed using the algorithm, q j converges to the optimal Q-value of the sampled distribution given by Q M k π k ,j i.e. Proof. This result follows on the same lines as the convergence of Expected Sarsa algorithm [8], and is thus omitted.
The upadates use in the Algorihtm use learning rates which satisfy the conditions required for the convergence in Expected Sarsa.
Having shown that the Q function converges, it is easy to see that the value function converges, and thus we Proof. From Lemma 5, we note that the iterates of value function v j k (s) converge to the optimal value function of the sampled distribution given by V M k π k ,j (s|α). Also, by Lemma 4, the optimal value function of the sampled distribution V M k π k ,j (s|α) converges to the optimal value function achieved by the true distribution V M * π * ,j (s|α). Combining these, we obtain the result as in the statement of the theorem.

V. CONCLUSION
We consider an action coupled stochastic game consisting of large number of agents where the transition probabilities are unknown to the agents. We resort to the concept of mean-field equilibrium where each agent's reward and the transition probability is only impacted through the mean distribution of the actions of the other agents.
When the number of agents grows large, the mean-field equilibrium becomes equivalent to the Nash equilibrium.
We propose a posterior sampling based approach where each agent draws a sample using an updated posterior distribution and selects an optimal oblivious strategy accordingly. We show that the proposed algorithm converges to the mean field equilibrium without knowing the transition probabilities apriori. This paper shows asymptotic convergence to the mean-field equilibrium, while finding the convergence rate is an interesting future direction.