Reinforcement Learning for Mean-Field Game

Agarwal, Mridul; Aggarwal, Vaneet; Ghosh, Arnob; Tiwari, Nilay

doi:10.3390/a15030073

Open AccessArticle

Reinforcement Learning for Mean-Field Game

¹

School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907, USA

²

School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA

³

Department of Electrical and Computer Engineering, Ohio State University, Columbus, OH 43210, USA

⁴

Department of Electrical Engineering, I.I.T. Kanpur, Kanpur 208016, UP, India

^*

Authors to whom correspondence should be addressed.

Algorithms 2022, 15(3), 73; https://doi.org/10.3390/a15030073

Submission received: 6 January 2022 / Revised: 16 February 2022 / Accepted: 19 February 2022 / Published: 22 February 2022

(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

Download Review Reports Versions Notes

Abstract

:

Stochastic games provide a framework for interactions among multiple agents and enable a myriad of applications. In these games, agents decide on actions simultaneously. After taking an action, the state of every agent updates to the next state, and each agent receives a reward. However, finding an equilibrium (if exists) in this game is often difficult when the number of agents becomes large. This paper focuses on finding a mean-field equilibrium (MFE) in an action-coupled stochastic game setting in an episodic framework. It is assumed that an agent can approximate the impact of the other agents’ by the empirical distribution of the mean of the actions. All agents know the action distribution and employ lower-myopic best response dynamics to choose the optimal oblivious strategy. This paper proposes a posterior sampling-based approach for reinforcement learning in the mean-field game, where each agent samples a transition probability from the previous transitions. We show that the policy and action distributions converge to the optimal oblivious strategy and the limiting distribution, respectively, which constitute an MFE.

Keywords:

reinforcement learning; mean-field game; equilibrium

1. Introduction

1.1. Motivation

We live in a world where multiple agents interact repeatedly in a common environment. For example, multiple robots interact to achieve a specific goal. Multi-agent reinforcement learning (MARL) refers to the problem of learning and planning in a sequential decision-making system with unknown underlying system dynamics. The agents need to learn the system dynamics by trying different actions and observing rewards received over time. Learning in a MARL is fundamentally different from the traditional single-agent reinforcement learning (RL) problem since agents not only interact with the environment but also with each other. Thus, an agent, when trying to learn the underlying system dynamics, has to consider the action taken by the other agents. Changes in the policy (or actions) of any agent affect the others and vice versa.

One natural learning algorithm is to extend existing RL algorithms to the MARL by assuming that the other agents’ actions are independent. However, studies show that a smart agent which learns the joint actions of the others performs better as compared to the agent that does not learn the joint action of other agents [1,2]. For any agent, the actions of other agents become a part of the state. This results in the state space increases exponentially as the number of agents increases. When the agents are strategic, i.e., they only want to take actions that maximize their utility (or value), Nash equilibrium is often employed as the equilibrium concept. The existing equilibrium solving approaches work for some restricted games when there exists an adversarial equilibrium or coordination equilibrium [3]. Also, these approaches can handle a handful of agents because of the exponential increase in the state space. The computational complexity of finding Nash Equilibrium at every stage game prevents applications of these approaches in games where the number of agents is large [4].

In this paper, we consider MARL as an environment where a large number of agents co-exist. Similar to [5], we utilize a mean-field approach where we assume that the Q-function of an agent is affected by the mean actions of the others. Mean-field game drastically reduces the complexity, since an agent only needs to consider the empirical distribution of the actions played by other agents. Such mean-field games exist in several domains. For example, the mean-field game is observed in a cyber-security game where a large number of agents such as terminal nodes or servers make individual decisions about their security [6,7]. However, the ultimate security depends on the decisions made by other agents as well. For example, consider a network of computers, where there are a large number of agents and each agent manages a computer. If an agent invests heavily in building firewalls, its computer can still be breached if other agents’ computers are not secure. In the security game, each agent invests a certain amount to attain a security level. However, the investment level depends on the investment made by the other agents. If the number of agents is large, the game can be modeled as the mean-field game as the average investment made per agent impacts the decision of an agent.

Another example of a mean-field game is the demand response price in the smart grid [8,9]. The utility company sets a price based on the average demand per household. Hence, if at a certain time the average demand is high, the utility company may increase the price. The agent now might want to reduce its own consumption to decrease its costs resulting from the increased price. Mean-field equilibrium is the equilibrium concept in the mean-field game.

1.2. Contribution

We seek to obtain a model-based RL algorithm to find the mean-field equilibrium in an episodic-set up. To the best of our knowledge, this is the first work on an episodic RL algorithm for mean-field equilibrium. We consider an oblivious strategy [10,11], where each agent takes an action based only on its state. Thus, even though the transition probability and reward depend on the empirical distribution of the agents’ actions and states, an agent only seeks policy based on its own state. Hence, an agent does not need to track the policy state of the other agents.

In our algorithm, we maintain a history of the samples and sample an MDP that fits the history with high confidence at the start of the episode. The policy is then computed with this MDP. The agents then play the computed policy and collect samples over the steps of the episode. We show that such an algorithm converges to equilibrium.

1.3. Related Literature

Unlike the standard literature on the mean-field equilibrium on stochastic games [10,12,13], we consider that the transition probabilities are unknown to the agents. Instead, each agent learns the underlying transition probability matrix using a readily implementable posterior sampling approach [14,15]. All agents employ the best response dynamics to choose the best response strategy which maximizes the (discounted) payoff for the remaining episode length. We show the asymptotic convergence of the policy and the action distribution to the optimal oblivious strategy and the limiting action distribution respectively. We estimate the value function using backward induction and show that the value function estimates converge to the optimal value function of the true distribution. We also use the compactness of state and action space to show that the converged point constitutes a mean-field equilibrium (MFE).

Ref. [5] considers a variant of the Mean-field game where the state is the same across the agents. Unlike [5], we consider a generalized version of the game where the state can be different for different agents. Further, we do not consider a game where adversarial equilibrium or coordinated equilibrium is required to be present. We also do not need to track the action and the realized Q-value of other agents as was the case in [5].

Recently, the authors of [16] studied a policy-gradient based approach to achieve mean-field equilibrium. The authors of [17,18] considered variants of Q-learning to achieve Nash equilibrium. Actor-critic based algorithms have been analyzed in [19]. The authors of [20] consider a deep neural network-based algorithm for mean-field games. In contrast, this paper considers a posterior sampling-based approach. As noted in [21], model-based approaches converge faster than the model-free approaches in general. Thus, understanding of theoretical properties of the model-based approaches is an important problem. Further, all these papers require huge storage space as the policy depends on the actions of the other agents whereas in our setting we provide a policy that is oblivious of the states and actions of other agents.

The authors of [22,23] compute mean-field equilibrium for a setting where the evolution of the state of an agent does not depend on the action or states of the other agents. However, in our setting, we consider a generic setting where the both reward and the next state of an agent depend on the actions as well as the states of the other agents. Thus, though the mean-field game converges to the potential game in the above papers, in our setting, the game does not converge to the potential game. Hence, finding an equilibrium strategy is more challenging in our setting. The key contribution of the paper is a Posterior-sampling based algorithm that is used by each agent in a multi-agent setting, which is shown to converge to a mean-field equilibrium. The proposed algorithm does not assume the knowledge of transition probabilities and learns them using a posterior sampling approach. Further, our algorithm is computationally efficient as the policy only depends on the current state of an agent, i.e., it is oblivious of the states and actions of other agents.

2. Background

2.1. Multi-Player Stochastic Game

An n-player stochastic game is formalized by the system dynamics tuple

M^{*}

=

{S, A, P, r, τ, ρ, γ}

. The agents are indexed by the set

[n] = {1, 2, \dots, n}

. The state of the ith agent at time t is given by

s_{i, t} \in S

, where

S

is the state space set.

A (s)

is defined as the set of the feasible actions any agent can take in state s.

A

is the action space set defined as

⋃_{s \in S} A (s)

. We also assume that both

S

and

A

are finite sets. Since finite sets are also compact, the assumption allows us to use results from previous works of [10]. Since we have n agents in our system, the combined state space of the system becomes

S^{\times n} = S \times S \times \dots \times S

, and the combined action space of the system becomes

A^{\times n} = A \times A \times \dots \times A

. Let

s_{t} = s \in S^{\times n}

be a vector of length n, and the ith element of

s

denotes the states of the ith agent at time t. Similarly, let

a_{t} = a \in A^{\times n}

be a vector of length n, and the ith element of

s

denotes the action taken by the ith agent at time t.

If the agents play joint action

a_{t} = a \in A^{\times n}

the next state of the system

s_{t + 1} = s^{'} \in S^{\times n}

follows the probability distribution

P (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)

. Along with the state updates, ith agent also receives a reward

r_{i, t} = R_{i} (s_{t}, a_{t}, s_{t + 1}) \in [0, 1]

. We further assume that the reward function

R_{i} (\cdot)

independent of the agent i, and each player is trying to optimize for the same reward. Hence, we can drop the subscript i in the reward function

R_{i}

. However, since each agent can be in different state, or play different action, their individual rewards will be different and we still use the subscript i to differentiate between the instantaneous reward of the ith agent. The constant

γ \in [0, 1)

is the discount factor, and

ρ

is the initial state distribution such that

s_{0} \sim ρ

.

We consider an episodic framework where the length of the time horizon or the length of episodes is

τ

. State space set

S

, action space set

A

,

τ

are known and need not be learned by the agent. We consider that the game is played in episodes

k = 0, 1, 2, \dots

. In each episode, the game is played in discrete steps,

j \in [τ] = {0, 1, \dots, τ - 1}

. The episodes begin at times

t_{k} = k τ, k = 0, 1, 2, \dots

. At each time t, the state of the agent i is given by

s_{i, t}

, the agent selects an action

a_{i, t}

, agent observes a scalar reward

r_{i, t}

and the state transitions to the state

s_{t + 1}

. Let

H_{i, t} = (s_{i, 1}, a_{i, 1}, r_{i, 1}, \dots, s_{i, t - 1}, a_{i, t - 1}, r_{i, t - 1}, s_{i, t})

denote the history available to the agent i till time t.

2.2. Mean-Field Game

In a game with a large number of players, we might expect that the distribution of agents over the action space carries more meaning than the actions themselves. It is intuitive that a single agent has a negligible effect on the game as the number of agents increases. The effect of other agents on a single agent’s payoff is only via the action distribution of the population. This intuition is formalized in the mean-field game. We now formally define the mean-field game and equilibrium concepts.

First, we define few notations. Let

α_{- i, t} (a) : A \to [0, 1]

be the fraction of the agents (excluding agent i) that take action

a \in A

at time t. Mathematically, we have

\begin{matrix} α_{- i, t} (a) = \frac{1}{n - 1} \sum_{m \in [n] ∖ {i}} 𝟙 (a_{m, t} = a), \end{matrix}

(1)

where

𝟙 (a_{j, t} = a)

is the indicator function that the agent j takes action a at time t. Since we assume an episodic framework,

α_{- i, t} (a)

can be different at each time index in an episode and also across the episodes. The episodic nature of the problem will be used later (in (9)) to define convergence to a value that depends only the time index in the episode. Further, since each agent selects exactly one action from

A

, we have

α_{- i, t} (a) \geq 0 \forall a \in A, and \sum_{a \in A} α_{- i, t} (a) = 1

Similar to distribution of agents over actions, we define

f_{- i, t} : S \to [0, 1]

as the distribution of agents (excluding agent i) over the state space

S

.

\begin{matrix} f_{- i, t} (a) = \frac{1}{n - 1} \sum_{m \in [n] ∖ {i}} 𝟙 (s_{m, t} = s), \end{matrix}

(2)

In a mean-field game, every agent

i \in [n]

assumes that its next state

s_{i, t + 1}

is randomly distributed according to the transition probability distribution

P_{i}

conditioned on agent’s current state

s_{i, t}

, the action taken

a_{i, t}

and other agents’ distribution over actions

α_{- i, t}

. Also, the reward is function of the agents current state and action and the next state.

s_{i, t + 1} \sim P_{i} (\cdot | s_{i, t}, a_{i, t}, α_{- i, t})

(3)

r_{i, t} = ϕ (s_{i, t}, a_{i, t}, α_{- i, t}, s_{i, t + 1})

(4)

Thus the agent does not need to concern itself with the actions of the other agents, as the population action distribution

α_{- i, t}

becomes a part of the environment. This updated environment dynamics can now be used in decision-making. Note that the distribution of the population action

α_{- i, t}

could be explicitly taken into account for deciding an action as well. Note that the reward may also depend on f the state distribution of other agents. The analysis would have been similar.

Example 1.

We now provide an example drawn from a real-life application that simulates our setting. Suppose we consider the scenario of malware spreading. The state

X_{i, t} = 0

means that an agent is vaccinated and can not infect at t. On the other hand,

X_{i, t}

can vary between 0 and 1 with n quantization levels. The action space of an agent is

a_{i, t} = {0, 1}

. If

a_{i, t} = 0

then the agent does not take any action. If

a_{i, t} = 1

, agent i takes action in order to protect itself. In order to simplify the model, we consider the following state evolution model

X_{i, t + 1} = \{\begin{matrix} \min {X_{i, t} + ω_{i, t} α_{- i, t} (0), 1} i f a_{i, t} = 0 \\ 0 i f a_{i, t} = 1 \end{matrix}

Note that if all the other agents have taken action 1 i.e., they protect themselves, then an agent has less chance to be infected. Thus, the state is smaller when the fraction of agents taking action 0 (i.e.,

α_{- i, t} (0)

) is smaller.

ω_{i, t}

is a noise who has its value in

{1 / n, 2 / n, \dots, 1}

.

The reward function for agent i is defined as follows

r_{i, t} = - X_{i, t} α_{- i, t} (1) - λ a_{i, t}

. The reward function increases if the number of agents who have protected themselves is higher. λ is the cost that depends on the action. This example explains the necessity to compute the mean-field equilibrium.

Note that in general action of an agent should depend on the action distribution of other agents. However, Proposition 1 from [10] says that under equilibrium, an oblivious strategy performs as well as a strategy that considers other agents’ actions. Thus, the strategy does not need to explicitly consider the value of

α_{- i, t}

.

Definition 1.

An agent

i \in [n]

is said to follow an oblivious deterministic strategy

π_{i}

when the agent selects an action considering only time index j in an episode and current state

s_{i, j}

.

\begin{matrix} π_{i} & : S \times [τ] \to A \end{matrix}

(5)

\begin{matrix} a_{i, j} & = π_{i} (s_{i, j}, j) \end{matrix}

(6)

For the rest of the paper, we will focus on oblivious deterministic strategy for all agents.

2.3. Value Function, Q Function and Policy

We now define a value function for an agent

i \in [n]

for oblivious policy

π_{i}

at lth time step in an episode as:

\begin{matrix} V_{i, π_{i}, l} (s | α_{- i, l}) & = E_{P_{i}, π_{i}} [\sum_{j = l}^{τ - 1} γ^{j - l} r_{i, t} | s_{i, l} = s] \end{matrix}

(7)

\begin{matrix} = E_{P_{i}, π_{i}} [\sum_{j = l}^{τ - 1} γ^{j - l} ϕ (s_{i, j}, a_{i, j}, s_{i, j + 1}) | s_{i, l} = s] . \end{matrix}

(8)

The expectation in Equation (8) is taken over the actions taken from time step l and the states visited after time step l in an episode. We will consider the rest of the definitions from some ith agent’s perspective,

i \in [n]

, so subscripts i and

- i

will be dropped for brevity.

We note that the action space and state space are finite and hence the set of strategies available to the players is also finite. The player adopts the lower myopic best response dynamics to choose the policy. A lower myopic policy selects an action with the lowest index among the actions that maximize the value function. As time proceeds, the strategies and the action distribution converge to the asymptotic equilibrium [10].

Let

α_{j}^{*} \in {[0, 1]}^{| A |}

be the limiting population action distribution for jth time index in episode k. We note that due to the episodic framework, the limiting action distribution depends on the index in an episode. Then, from the definition of limit, for every

ϵ > 0

there exist a

K_{ϵ} < \infty

such that for all

k > K_{ϵ}

, we have

| | α_{k τ + j} - α_{j}^{*} {| |}_{2} < ϵ

(9)

where

{| | . | |}_{2}

denotes the

ℓ_{2}

norm.

The value function defined in Equation (8) satisfies the Bellman-property for finite horizon MDPs, given by

\begin{matrix} V_{π, l} (s | α_{l}^{*}) & = E_{P_{i}, π_{i}} [\sum_{j = l}^{τ - 1} γ^{j - l} r_{i, j} | s_{i, l} = s] \end{matrix}

(10)

\begin{matrix} = \sum_{s^{'} \in S} P_{i} (s^{'} | s, a, α_{l}^{*}) r_{i, l} + E_{P_{i}, π_{i}} [\sum_{j = l + 1}^{τ - 1} γ^{j - l} r_{i, j} = s^{'}] \end{matrix}

(11)

\begin{matrix} = {\bar{r}}_{l} + γ \sum_{s^{'} \in S} P (s^{'} | s, a, α_{l}^{*}) V_{π, l + 1} (s^{'} | α_{l + 1}^{*}) \end{matrix}

(12)

where

{\bar{r}}_{l} = \sum_{s^{'} \in S} P (s^{'} | s, a, α_{l}^{*}) r_{l}

, and

a = π (s, l)

.

Similarly, we also define the Q-function as:

\begin{matrix} Q_{π, l} (s, a | α_{l}^{*}) & = {\bar{r}}_{l} + γ \sum_{s^{'} \in S} P (s^{'} | s, a, α_{l}^{*}) V_{π, l + 1} (s^{'} | α_{l + 1}^{*}) \end{matrix}

(13)

We further consider the agents are strategic and hence care only about individual rewards. The goal of each agent is to find an optimal oblivious policy

π^{*}

, such that,

\begin{matrix} V_{π^{*}, l} (s | α_{l}^{*}) \geq V_{π, l} (s | α_{l}^{*}) \forall s \in S, \forall l \in [τ] . \end{matrix}

(14)

Let

α^{*} = [α_{0}^{*}, \dots, α_{τ - 1}^{*}] \in {[0, 1]}^{τ \times | A |}

, then we can define the optimal oblivious strategy:

Definition 2.

The set

P (α)

is the set of the optimal oblivious strategies which are chosen from the Q-function generated by α. In other words, for a given α, a policy

\bar{π} \in P (α)

if and only if

\begin{matrix} \bar{π} (s, l) \in \arg \max_{a} Q_{\bar{π}, l} (s, a | α_{l}) \forall s \in S l \in [τ] \end{matrix}

(15)

Here, the policy

\bar{π} (s, l)

is used at lth time index in an episode so that the Q-value

Q_{π, l} (s)

is maximized for all states

s \in S

. Note that

\bar{π} (s, l)

does not depend on the distribution

α^{*}

explicitly. Hence, it is an oblivious strategy where each agent takes its decision based on its own observed state only. Since the reward function is bounded and

γ < 1

, the set

P (α)

is always non-empty. However, finding the optimal action is challenging for an oblivious strategy profile. We denote the initial population state distribution denoted by

f_{0}

. We note that as

α_{t}

evolves, the population state distribution

f_{t}

also evolves. After convergence, for a time index j in any episodes, the population state distribution will converge to the limiting population state distribution

f_{j}^{*}

, or

\begin{matrix} | | f_{k τ + j} - f_{j}^{*} {| |}_{2} \to 0 . \end{matrix}

(16)

2.4. Stationary Mean-Field Equilibrium

Throughout this paper, we seek to compute a mean-field equilibrium and action strategy. Thus, the action was taken by an agent only depends on its state independent of its episode. Further, such an action profile should converge to a stationary action distribution and state distribution. We now formally define a mean-field equilibrium.

Definition 3

([24]). We say that Mean-Field Equilibrium (MFE) is achieved by an oblivious strategy

\bar{π} (\cdot)

, if the strategy for the players, population action, and the state distribution is such that:

Each player i optimizes its expected discounted payoff assuming that population action distribution α is fixed; i.e., it satisfies (15).
For strategy $\bar{π}$ of any player i, the fixed population action distribution α satisfies

$α_{j} (a) = \frac{1}{n} \sum_{s} 𝟙_{\bar{π} (s^{'}, j) = a} p (s^{'} | s, a, α_{j - 1}) \forall j \in [τ]$

(17)

We define the above as $α \in \hat{D} (\bar{π}, f)$
For α and $\bar{π}$ , the state distribution f satisfies

$\begin{matrix} f (s_{j}^{'}) = \sum_{s} f (s) p (s^{'} | s, π (s), α_{j - 1}) \forall j \in [τ] \end{matrix}$

(18)

which we denote as $f \in D (\bar{π}, α)$ .

Specifically, if we fix

α

and each agent takes action

\bar{π}

which belongs to

P (α)

, then the action distribution should return to

α

as it is invariant under the transition probability. Further, if we fix

α

and each agent takes action

\bar{π}

, the state distribution should give back f.

Since the players are learning an oblivious strategy, no agent observes the states or actions of the other agents. Also, an agent does not know the probability transition matrix, and reward function and will try to estimate it from the past observations as described in the next section.

3. Proposed Algorithm

In this section, we propose an algorithm, which will be shown to converge to the mean-field equilibrium (MFE) in the following section. For each agent i, the algorithm begins with a prior distribution g over the stochastic game with state-space set S and action space set A and time horizon

τ

. The prior distribution g for modeling state transition probability distribution is typically taken to be Dirichlet distribution [14,15].

The game is played episodes

k = 0, 1, 2, \dots

. The length of each episode is given by

τ

. In each episode, the game is played in discrete steps,

j = 0, 1, \dots, τ - 1

. The episodes begin at times

t_{k} = k τ, k = 0, 1, 2, \dots

. At each time

t = k τ + j

, the state of the agent is given by

s_{t}

, it selects an action

a_{t}

, and observes a scalar reward

r_{t}

then transitions to the state

s_{t + 1}

. Let

H_{t} = (s_{1}, a_{1}, r_{1}, \dots s_{t}, a_{t - 1}, r_{t - 1})

denote the history of the agent till time t.

The proposed algorithm is described in Algorithm 1. At the beginning of each episode, the MDP,

M_{k}

is sampled from the posterior distribution conditioned on the history

H_{t_{k}}

in Line 4. We note that the sampling of MDP only relates to the sampling of the transition probability P and the reward distribution since the rest of the parameters are known. We note that the algorithm doesn’t perform explicit exploration like an

ϵ

-greedy algorithm. Instead, the algorithm samples a new MDP

M_{k}

for episode k in Line 4. The Algorithm can generate a new trajectory from the new policy [14,15] solved for the sampled MDP

M_{k}

. We assume that after some samples,

α_{k}

has converged. The proposed algorithm converges as the induced transition probability and reward function converge after

α_{k}

converge.

Algorithm 1 Proposed Algorithm for Mean-Field Game with Best Response Learning Dynamics.

1:: Input: Prior distribution g, time horizon $τ$ , $γ$
2:: Initialize $H_{0} = ϕ$ .
3:: for episodes $k = 0, 1, 2, \dots$ do
4:: Sample $M_{k} \sim g (\cdot | H_{k τ})$ .
5:: Obtain optimal Q for $M_{k}$ from Algorithm 2
6:: for time steps $j = 0$ , …, $τ - 1$ do
7:: Play $a_{j} = \arg \max_{a} Q_{j} (s_{j}, a)$ .
8:: Observe reward $r_{j}$ , action of the agent $a_{j}$ , and next state $s_{j + 1}$ .
9:: Append action taken $a_{j}$ , reward obtained $r_{j}$ , and state update $s_{j + 1}$ to history

$H_{k τ + j + 1} = H_{k τ + j} \cup {a_{j}, r_{j}, s_{j + 1}} .$
10:: end for
11:: end for

We use Backwards Induction algorithm [25] described in Algorithm 2 to obtain the Q-value function for the current sampled MDP (Line 5, Algorithm 1). Backward induction in Algorithm 2 starts from the end of the episode and calculates the potential maximum rewards for each state and action (Line 5). The algorithm then goes back in the episode (Line 8), to calculate the maximum possible cumulative rewards for each state and action in Line 11. After all the time indices in an episode are covered, the algorithm returns the calculated optimal Q-values. We obtain the policy

π_{k}

from the calculated Q-values and the policy is not altered in an episode. Recall for a given

α

, a policy

π \in P (α)

if and only if

π_{k} (s, j) \in \arg \max_{a} Q_{π_{k}, j} (s, a | α_{j})

for all

s \in S

and

j = 0, 1, \dots, τ - 1

. Let

α_{k}

be the population action distribution in episode k, then the algorithm aims to choose a policy

π_{k} \in P (α_{k})

. In order to choose the policy

π_{k}

from the set

P (α_{k})

, we use lower myopic learning dynamics, where at each episode we choose the strategy which is the smallest action index in the set

P (α_{k})

.

Algorithm 2 Backwards Induction Algorithm.

1:: Input: $M = {S, A, P, r, τ, γ}$ ▹ Sampled MDP from Algorithm 1
2:: Initialize $Q_{l} (s, a) = 0 \forall s \in S, a \in A, l \in [τ]$ .
3:: for state $s \in S$ do
4:: for state $a \in A$ do
5:: Update Q-value function for last action

$Q_{τ - 1} (s, a) = \sum_{s^{'} \in S} P (s^{'} | s, a) r (s, a, s^{'})$
6:: end for
7:: end for
8:: for time steps $l = τ - 2, \dots, 0$ do
9:: for state $s \in S$ do
10:: for state $a \in A$ do
11:: Update Q-value function

$Q_{l} (s, a) = \sum_{s^{'} \in S} P (s^{'} | s, a) \times$

$(r (s, a, s^{'}) + γ \arg \max_{a} Q_{l + 1} (s^{'}, a))$

(19)
12:: end for
13:: end for
14:: end for
15:: Return: $Q_{l} (s, a) \forall l, s, a$

We note that

M_{k}

is used in the algorithm instead of

M^{*}

where

M^{*}

, the true distribution, is not known. In order to obtain an estimate, each agent samples a transition probability matrix according to the posterior distribution. Each agent follows the strategy

π_{k}

according to the Q-values over the episode. Based on the action decision by each agent, we update the value function and the Q-function based on the obtained reward functions which depend on the value of

α_{k}

. The detailed algorithm steps can be seen in Algorithm 1. We note that, as the algorithm converges, the value of

α

converges, and thus all the transition probabilities and value functions depend on the limiting distribution.

4. Convergence Result

In this section, we’ll show that if the oblivious strategy is chosen according to the proposed algorithm, then the oblivious strategy

π

and the limiting population action distribution

α

constitutes a Mean-Field Equilibrium (MFE). More formally, we have obtained the following–

Theorem 1.

The optimal oblivious strategy obtained from Algorithm 1 and the limiting action distribution constitute a mean-field equilibrium and the value function obtained from the algorithm converges to the optimal value function of the true distribution.

The rest of the section proves this result. We first note that the lower-myopic best response strategy leads to a convergence of the action strategy following the results in [10] for finite action space and state space. We note that there might be multiple actions that can maximize the state-action value function. This may lead to choosing different actions at different iterations for the same state. To avoid the oscillations between the best actions and hence keep the policy stable, we choose a lower-myopic strategy. This lower-myopic strategy avoids conflicts when the agents have a non-unique strategy that maximizes the value function. Further, any way of resolving the multiple optima could be used, including upper-myopic giving the same result. Having shown that

α

converges, we now proceed to show that the converged point of the algorithm results in an MFE.

We first show the conditions needed for a policy

π

, a population state f, and action distribution

α

to constitute an MFE (Section 4.1). Then, we show that the conditions for the policy to be MFE given in Section 4.1 are met for any optimal oblivious strategy (Section 4.2). Thus, the key property that is required to show the desired result is that the proposed algorithm leads to an optimal oblivious strategy. In order to show that, we show that the value function of the sampled distribution converges to the true distribution (Section 4.3). The result in Section 4.3 shows that the value function iterates eventually converge to the value function with knowledge of the true underlying distribution of the transition probability

M^{*}

, thus proving that the proposed algorithm converges to an optimal oblivious strategy which constitutes a mean-field equilibrium thus proving the theorem.

4.1. Conditions for a Strategy to Be a MFE

In this section, we will describe the conditions for an oblivious strategy

π

to be a MFE. Recall that, in Section 2.2, we defined two maps

P (α)

and

D (π, α)

. For a given action coupled stochastic game, the map

P (α)

for a given population action distribution

α

gives the set of the optimal oblivious strategies. Further, the map

D (π, α)

for a given population action distribution

α

and oblivious strategy

π

gives the set of invariant population state distribution f.

We define the map

\hat{D} (π, f)

which gives the induced population action distribution

α

induced from the oblivious strategy

π

and the population state distribution f. The following lemma gives the conditions that the stochastic game constitutes a mean-field equilibrium. These conditions have been provided in [11], and the reader is referred to [11] for further details and proof of this result.

Lemma 1

(Definition 7 [11]). An action coupled stochastic game with the strategy π, population state distribution f and population action distribution α constitute a mean-field equilibrium if

π \in P (α), f \in D (π, α)

and

α \in \hat{D} (π, f)

.

4.2. Conditions of Lemma 1 Are Met for Any Optimal Oblivious Strategy

In this section, we show that the conditions of Lemma 1 are met for any optimal oblivious strategy. In the mean-field equilibrium, each agent plays according to the strategy

π \in P (α)

. If the average population action distribution is

α

, and each agent takes an oblivious strategy, hence, we must have the evolution of the state space such that the oblivious strategy on those states leads to an average action distribution of

α

. Since we assume a large number of agents, including agent i’s own state will not change the population state distribution. So, we let the average population state distribution be

f_{j}

at time index j as,

\begin{matrix} f_{j} (s) = \frac{1}{n} \sum_{m \in [n]} 𝟙_{s_{m, j} = s} \end{matrix}

(20)

where

s_{m, j}

is the state of the agent m at time index j. Similarly, we also include agent i’s action as well in the population action distribution

α

. Then, we have,

\begin{matrix} E [f_{j} (s)] & = E [\frac{1}{n} \sum_{m \in [n]} 𝟙_{s_{m, j} = s}] \end{matrix}

(21)

\begin{matrix} = \frac{1}{n} \sum_{m \in [n]} E [𝟙_{s_{m, j} = s}] \end{matrix}

(22)

\begin{matrix} = \frac{1}{n} \sum_{m \in [n]} P ({s_{m, j} = s}) \end{matrix}

(23)

\begin{matrix} = \frac{1}{n} \sum_{m \in [n]} \sum_{s^{'} \in S} P ({s_{m, j - 1} = s}) P (s | s^{'}, π (s^{'}), α_{j - 1}) \end{matrix}

(24)

\begin{matrix} = \sum_{s^{'} \in S} (\frac{1}{n} \sum_{m \in [n]} P ({s_{m, j - 1} = s})) P (s | s^{'}, π (s^{'}), α_{j - 1}) \end{matrix}

(25)

\begin{matrix} = \sum_{s^{'} \in S} E [f_{j - 1} (s^{'})] P (s | s^{'}, π (s^{'}), α_{j - 1}) \end{matrix}

(26)

where

P ({s_{m, j} = s})

is probability of agent m being in state s at time index j. Here, Equation (24) follows from the transition probability matrix. Recursively replacing

f_{j - 1}

in Equation (26) using Equation (21) for all

j \in [τ]

gives the required result of

f \in D (π, α)

.

The above statement also implies that

α

must satisfy

\begin{matrix} α_{j} (a) = \sum_{π^{- 1} (a, j)} f_{j} (s) \end{matrix}

(27)

where

π^{- 1} (a, j)

represents the set of states s for which

a \in π (s, j)

. This is equivalent to saying that if all the agents follow the optimal oblivious strategy

π \in P (α)

, then the population state distribution f and the population action distribution

α

satisfy

f \in D (π, α)

and

α \in \hat{D} (π, f)

.

4.3. Sampling Does Not Lead to a Gap for Expected Value Function

In the last subsection, we proved that there exists an optimal oblivious strategy. We will show that the policies generated by Algorithm 2,

π_{k}

, for the sampled system dynamics

M_{k}

in episode k by Algorithm 1 converges to the optimal oblivious policy

\bar{π}

. To show the convergence of the policy

π_{k}

to

\bar{π}

, we will show that the value function of the optimal oblivious policy

π_{k}

converges to the optimal value function of the true system dynamics.

We will first describe the lemmas that are used to prove the required result. We start by stating the Azuma-Hoeffding Lemma for obtaining confidence intervals.

Lemma 2

(Azuma-Hoeffding Lemma [15]). If

Y_{n}

is a zero-mean martingale with almost surely bounded increments,

| Y_{i} - Y_{i - 1} |

≤ C, then for any δ≥ 0 with probability at least

1 - δ

,

Y_{n} \leq C \sqrt{2 n \log (1 / δ)}

.

We also utilize the following result of [15] on any

σ

-measurable function g.

Lemma 3.

If f is the distribution of

M^{*}

then, for any

σ (H_{t_{k}})

-measurable function g,

\begin{matrix} E [g (M^{*}) | H_{t_{k}}] = E [g (M_{k}) | H_{t_{k}}] . \end{matrix}

(28)

At the start of every episode, each agent samples system dynamics from the posterior distribution given

H_{t_{k}}

. The following result bounds the difference between the optimal value function learned by the true distribution

M^{*}

function using the optimal policy

π^{*}

which is unknown, and the optimal value function achieved by the sampled distribution

M_{k}

from the policy

π_{k}

.

Lemma 4.

Let

V_{π_{k}, j}^{M_{k}} (s | α_{j}))

be the optimal value function for an oblivious policy

π_{k} (s, j) = \arg \max_{a} Q_{π_{k}, j}^{M_{k}} (s, a | α_{l})

for the sampled system dynamics

M_{k}

chosen form Algorithm 1. Then

V_{π_{k}, j}^{M_{k}} (s | α_{j}))

converges to the optimal value function,

V_{π^{*}, j} (s | α_{j})

, of the true system dynamics

M^{*}

i.e., for all states

s \in S

as

k \to \infty

with probability at least

1 - δ

,

\begin{matrix} V_{π_{k}, j}^{M_{k}} (s | α_{j})) - V_{π^{*}, j} (s | α_{j}) \to 0 \end{matrix}

(29)

Proof.

We note that since the optimal value function is a

σ (H_{t_{k}})

-measurable, we can use Lemma 3 to bound the difference of the optimal value functions of the sampled distribution at episode k and the true distribution to show that for all states

s \in S

,

\begin{matrix} E [V_{π^{*}, j} (s | α_{j}) - V_{π_{k}, j}^{M_{k}} (s | α_{j}))] = 0 \end{matrix}

(30)

Note that the length of all episodes is given by

τ

and the support of the reward is [0, 1]. Therefore for all states

s \in S

, we have

V_{π^{*}, 0} (s | α_{0}) - V_{π_{k}, 0}^{M_{k}} (s | α_{0})) \in [- τ, τ]

. Note that this condition is similar to bounded increments in Azuma-Hoeffding Lemma (Lemma 2).

Since

V_{π^{*}, 0} (s | α_{0}) - V_{π_{k}, 0}^{M_{k}} (s | α_{0}) \in [- τ, τ]

is a zero mean martingale with respect to the filtration

{H_{t_{k}} : k = 1, . ., m}

, and satisfies the assumptions of Azuma-Hoeffding Lemma, we obtain the result as in the statement of the Lemma. Also, for all states

s \in S

, we have,

V_{π^{*}, 0} (s | α_{0}) - V_{π_{k}, 0}^{M_{k}} (s | α_{0})) \in [- τ, τ]

. So, the difference is a zero-mean martingale and has the bounded increments property. Applying the Azuma-Hoeffding Lemma to the martingale, we have the following result,

\begin{matrix} \sum_{k = 1}^{m} (V_{π^{*}, 0} (s | α_{0}) - V_{π_{k}, 0}^{M_{k}} (s | α_{0})) \leq τ \sqrt{2 m \log (1 / δ)} \end{matrix}

(31)

For total time T of the algorithm, we have

m = T / τ

. Thus, we obtain,

\begin{matrix} \sum_{k = 1}^{m} (V_{π^{*}, 0} (s | α_{0}) - V_{π_{k}, 0}^{M_{k}} (s | α_{0})) & \leq τ \sqrt{2 \frac{T}{τ} \log (1 / δ)} \end{matrix}

(32)

\begin{matrix} = \sqrt{2 T τ \log (1 / δ)} \end{matrix}

(33)

Thus, for all

θ > 1 / 2

, as

T \to \infty

, we have

\begin{matrix} \frac{\sum_{k = 1}^{⌈T / τ⌉} (V_{π^{*}, 0} (s | α_{0}) - V_{π_{k}, 0}^{M_{k}} {(s | α)}_{0})}{T^{θ}} \to 0 \end{matrix}

(34)

Substituting

θ = 1

, the above expression says that

τ

times the average difference in an episode which converges to zero as total time

T \to \infty

, which gives us the convergence of the optimal value functions of the two distributions. Thus, we have

\begin{matrix} V_{π^{*}, 0} (s | α_{0}) - V_{π_{k}, 0}^{M_{k}} (s | α_{0}) \to 0 as k \to \infty \end{matrix}

(35)

□

5. Conclusions

We consider an action-coupled stochastic game consisting of a large number of agents where the transition probabilities are unknown to the agents. We utilize the concept of mean-field equilibrium where each agent’s reward and the transition probability are only impacted through the mean distribution of the actions of the other agents. When the number of agents grows large, the mean-field equilibrium becomes equivalent to the Nash equilibrium. We propose a posterior sampling-based approach where each agent draws a sample using an updated posterior distribution and selects an optimal oblivious strategy accordingly. We show that the proposed algorithm converges to the mean-field equilibrium without knowing the transition probabilities apriori.

This paper shows asymptotic convergence to the mean-field equilibrium while finding the convergence rate is an interesting future direction. Further, the convergence rate in the number of users to the mean-field limit (akin to [26], while for the mean-field game rather than mean-field control) is an important direction.

Author Contributions

All the authors contributed significantly to the work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Science Foundation under grants CCF-1527486 and CNS-1618335.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tan, M. Multi-agent reinforcement learning: Independent versus cooperative agents. In Proceedings of the 10th International Conference on International Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 1993; pp. 330–337. [Google Scholar]
Panait, L.; Luke, S. Cooperative multi-agent learning: The state of the art. Auton. Agents Multi-Agent Syst. 2005, 11, 387–434. [Google Scholar] [CrossRef]
Littman, M.L. Friend-or-foe Q-learning in general-sum games. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, pp. 322–328. [Google Scholar]
Hu, J.; Wellman, M.P. Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 2003, 4, 1039–1069. [Google Scholar]
Yang, Y.; Luo, R.; Li, M.; Zhou, M.; Zhang, W.; Wang, J. Mean Field Multi-Agent Reinforcement Learning. arXiv 2018, arXiv:1802.05438. [Google Scholar]
Miao, L.; Wang, L.; Li, S.; Xu, H.; Zhou, X. Optimal defense strategy based on the mean field game model for cyber security. Int. J. Distrib. Sens. Netw. 2019, 15, 1550147719831180. [Google Scholar] [CrossRef] [Green Version]
Kolokoltsov, V.N.; Bensoussan, A. Mean-field-game model for botnet defense in cyber-security. Appl. Math. Optim. 2016, 74, 669–692. [Google Scholar] [CrossRef] [Green Version]
Li, J.; Xia, B.; Geng, X.; Ming, H.; Shakkottai, S.; Subramanian, V.; Xie, L. Energy Coupon: A Mean Field Game Perspective on Demand Response in Smart Grids. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA, 15–19 June 2015; pp. 455–456. [Google Scholar]
Farzaneh, H.; Kebriaei, H.; Aminifar, F. Deterministic Mean Field Game for Energy Management in a Utility with Many Users. In Proceedings of the 2018 Smart Grid Conference (SGC), Sanandaj, Iran, 28–29 November 2018; pp. 1–6. [Google Scholar]
Adlakha, S.; Johari, R. Mean field equilibrium in dynamic games with strategic complementarities. Oper. Res. 2013, 61, 971–989. [Google Scholar] [CrossRef] [Green Version]
Adlakha, S.; Johari, R. Mean Field Equilibrium in Dynamic Games with Complementarities. arXiv 2010, arXiv:1011.5677. [Google Scholar]
Weintraub, G.Y.; Benkard, C.L.; Van Roy, B. Markov perfect industry dynamics with many firms. Econometrica 2008, 76, 1375–1411. [Google Scholar]
Adlakha, S.; Johari, R.; Weintraub, G.Y. Equilibria of dynamic games with many players: Existence, approximation, and market structure. J. Econ. Theory 2015, 156, 269–316. [Google Scholar] [CrossRef] [Green Version]
Agrawal, S.; Jia, R. Optimistic posterior sampling for reinforcement learning: Worst-case regret bounds. arXiv 2017, arXiv:1705.07041. [Google Scholar]
Osband, I.; Russo, D.; Van Roy, B. (More) efficient reinforcement learning via posterior sampling. arXiv 2013, arXiv:1306.0940. [Google Scholar]
Subramanian, J.; Mahajan, A. Reinforcement learning in stationary mean-field games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 251–259. [Google Scholar]
Guo, X.; Hu, A.; Xu, R.; Zhang, J. Learning mean-field games. arXiv 2019, arXiv:1901.09585. [Google Scholar]
Anahtarcı, B.; Karıksız, C.D.; Saldi, N. Fitted Q-Learning in Mean-field Games. arXiv 2019, arXiv:1912.13309. [Google Scholar]
Fu, Z.; Yang, Z.; Chen, Y.; Wang, Z. Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. arXiv 2019, arXiv:1910.07498. [Google Scholar]
Yang, J.; Ye, X.; Trivedi, R.; Xu, H.; Zha, H. Learning deep mean field games for modeling large population behavior. arXiv 2017, arXiv:1711.03156. [Google Scholar]
Pong, V.; Gu, S.; Dalal, M.; Levine, S. Temporal difference models: Model-free deep rl for model-based control. arXiv 2018, arXiv:1802.09081. [Google Scholar]
Mguni, D.; Jennings, J.; de Cote, E.M. Decentralised learning in systems with many, many strategic agents. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Elie, R.; Pérolat, J.; Laurière, M.; Geist, M.; Pietquin, O. Approximate fictitious play for mean field games. arXiv 2019, arXiv:1907.02633. [Google Scholar]
Light, B.; Weintraub, G.Y. Mean field equilibrium: Uniqueness, existence, and comparative statics. In Columbia Business School Research Paper; Paper No. 19-3; Columbia Business School Publishing: New York, NY, USA, 2018. [Google Scholar]
Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Mondal, W.U.; Agarwal, M.; Aggarwal, V.; Ukkusuri, S.V. On the approximation of cooperative heterogeneous multi-agent reinforcement learning (marl) using mean field control (mfc). arXiv 2021, arXiv:2109.04024. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Agarwal, M.; Aggarwal, V.; Ghosh, A.; Tiwari, N. Reinforcement Learning for Mean-Field Game. Algorithms 2022, 15, 73. https://doi.org/10.3390/a15030073

AMA Style

Agarwal M, Aggarwal V, Ghosh A, Tiwari N. Reinforcement Learning for Mean-Field Game. Algorithms. 2022; 15(3):73. https://doi.org/10.3390/a15030073

Chicago/Turabian Style

Agarwal, Mridul, Vaneet Aggarwal, Arnob Ghosh, and Nilay Tiwari. 2022. "Reinforcement Learning for Mean-Field Game" Algorithms 15, no. 3: 73. https://doi.org/10.3390/a15030073

APA Style

Agarwal, M., Aggarwal, V., Ghosh, A., & Tiwari, N. (2022). Reinforcement Learning for Mean-Field Game. Algorithms, 15(3), 73. https://doi.org/10.3390/a15030073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning for Mean-Field Game

Abstract

1. Introduction

1.1. Motivation

1.2. Contribution

1.3. Related Literature

2. Background

2.1. Multi-Player Stochastic Game

2.2. Mean-Field Game

2.3. Value Function, Q Function and Policy

2.4. Stationary Mean-Field Equilibrium

3. Proposed Algorithm

4. Convergence Result

4.1. Conditions for a Strategy to Be a MFE

4.2. Conditions of Lemma 1 Are Met for Any Optimal Oblivious Strategy

4.3. Sampling Does Not Lead to a Gap for Expected Value Function

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI