2. Background
2.1. Multi-Player Stochastic Game
An
n-player stochastic game is formalized by the system dynamics tuple
=
. The agents are indexed by the set
. The state of the
ith agent at time
t is given by
, where
is the state space set.
is defined as the set of the feasible actions any agent can take in state
s.
is the action space set defined as
. We also assume that both
and
are finite sets. Since finite sets are also compact, the assumption allows us to use results from previous works of [
10]. Since we have
n agents in our system, the combined state space of the system becomes
, and the combined action space of the system becomes
. Let
be a vector of length
n, and the
ith element of
denotes the states of the
ith agent at time
t. Similarly, let
be a vector of length
n, and the
ith element of
denotes the action taken by the
ith agent at time
t.
If the agents play joint action the next state of the system follows the probability distribution . Along with the state updates, ith agent also receives a reward . We further assume that the reward function independent of the agent i, and each player is trying to optimize for the same reward. Hence, we can drop the subscript i in the reward function . However, since each agent can be in different state, or play different action, their individual rewards will be different and we still use the subscript i to differentiate between the instantaneous reward of the ith agent. The constant is the discount factor, and is the initial state distribution such that .
We consider an episodic framework where the length of the time horizon or the length of episodes is . State space set , action space set , are known and need not be learned by the agent. We consider that the game is played in episodes . In each episode, the game is played in discrete steps, . The episodes begin at times . At each time t, the state of the agent i is given by , the agent selects an action , agent observes a scalar reward and the state transitions to the state . Let denote the history available to the agent i till time t.
2.2. Mean-Field Game
In a game with a large number of players, we might expect that the distribution of agents over the action space carries more meaning than the actions themselves. It is intuitive that a single agent has a negligible effect on the game as the number of agents increases. The effect of other agents on a single agent’s payoff is only via the action distribution of the population. This intuition is formalized in the mean-field game. We now formally define the mean-field game and equilibrium concepts.
First, we define few notations. Let
be the fraction of the agents (excluding agent
i) that take action
at time
t. Mathematically, we have
where
is the indicator function that the agent
j takes action
a at time
t. Since we assume an episodic framework,
can be different at each time index in an episode and also across the episodes. The episodic nature of the problem will be used later (in (
9)) to define convergence to a value that depends only the time index in the episode. Further, since each agent selects exactly one action from
, we have
Similar to distribution of agents over actions, we define
as the distribution of agents (excluding agent
i) over the state space
.
In a mean-field game, every agent
assumes that its next state
is randomly distributed according to the transition probability distribution
conditioned on agent’s current state
, the action taken
and other agents’ distribution over actions
. Also, the reward is function of the agents current state and action and the next state.
Thus the agent does not need to concern itself with the actions of the other agents, as the population action distribution becomes a part of the environment. This updated environment dynamics can now be used in decision-making. Note that the distribution of the population action could be explicitly taken into account for deciding an action as well. Note that the reward may also depend on f the state distribution of other agents. The analysis would have been similar.
Example 1. We now provide an example drawn from a real-life application that simulates our setting. Suppose we consider the scenario of malware spreading. The state means that an agent is vaccinated and can not infect at t. On the other hand, can vary between 0 and 1 with n quantization levels. The action space of an agent is . If then the agent does not take any action. If , agent i takes action in order to protect itself. In order to simplify the model, we consider the following state evolution model Note that if all the other agents have taken action 1 i.e., they protect themselves, then an agent has less chance to be infected. Thus, the state is smaller when the fraction of agents taking action 0 (i.e., ) is smaller. is a noise who has its value in .
The reward function for agent i is defined as follows . The reward function increases if the number of agents who have protected themselves is higher. λ is the cost that depends on the action. This example explains the necessity to compute the mean-field equilibrium.
Note that in general action of an agent should depend on the action distribution of other agents. However, Proposition 1 from [
10] says that under equilibrium, an oblivious strategy performs as well as a strategy that considers other agents’ actions. Thus, the strategy does not need to explicitly consider the value of
.
Definition 1. An agent is said to follow an oblivious deterministic strategy when the agent selects an action considering only time index j in an episode and current state . For the rest of the paper, we will focus on oblivious deterministic strategy for all agents.
2.3. Value Function, Q Function and Policy
We now define a value function for an agent
for oblivious policy
at
lth time step in an episode as:
The expectation in Equation (
8) is taken over the actions taken from time step
l and the states visited after time step
l in an episode. We will consider the rest of the definitions from some
ith agent’s perspective,
, so subscripts
i and
will be dropped for brevity.
We note that the action space and state space are finite and hence the set of strategies available to the players is also finite. The player adopts the lower myopic best response dynamics to choose the policy. A lower myopic policy selects an action with the lowest index among the actions that maximize the value function. As time proceeds, the strategies and the action distribution converge to the asymptotic equilibrium [
10].
Let
be the limiting population action distribution for
jth time index in episode
k. We note that due to the episodic framework, the limiting action distribution depends on the index in an episode. Then, from the definition of limit, for every
there exist a
such that for all
, we have
where
denotes the
norm.
The value function defined in Equation (
8) satisfies the Bellman-property for finite horizon MDPs, given by
where
, and
.
Similarly, we also define the
Q-function as:
We further consider the agents are strategic and hence care only about individual rewards. The goal of each agent is to find an optimal oblivious policy
, such that,
Let , then we can define the optimal oblivious strategy:
Definition 2. The set is the set of the optimal oblivious strategies which are chosen from the Q-function generated by α. In other words, for a given α, a policy if and only if Here, the policy
is used at
lth time index in an episode so that the
Q-value
is maximized for all states
.
Note that does not depend on the distribution explicitly. Hence, it is an oblivious strategy where each agent takes its decision based on its own observed state only. Since the reward function is bounded and
, the set
is always non-empty. However, finding the optimal action is challenging for an oblivious strategy profile. We denote the initial population state distribution denoted by
. We note that as
evolves, the population state distribution
also evolves. After convergence, for a time index
j in any episodes, the population state distribution will converge to the limiting population state distribution
, or
2.4. Stationary Mean-Field Equilibrium
Throughout this paper, we seek to compute a mean-field equilibrium and action strategy. Thus, the action was taken by an agent only depends on its state independent of its episode. Further, such an action profile should converge to a stationary action distribution and state distribution. We now formally define a mean-field equilibrium.
Definition 3 ([
24])
. We say that Mean-Field Equilibrium (MFE) is achieved by an oblivious strategy , if the strategy for the players, population action, and the state distribution is such that: Specifically, if we fix and each agent takes action which belongs to , then the action distribution should return to as it is invariant under the transition probability. Further, if we fix and each agent takes action , the state distribution should give back f.
Since the players are learning an oblivious strategy, no agent observes the states or actions of the other agents. Also, an agent does not know the probability transition matrix, and reward function and will try to estimate it from the past observations as described in the next section.
3. Proposed Algorithm
In this section, we propose an algorithm, which will be shown to converge to the mean-field equilibrium (MFE) in the following section. For each agent
i, the algorithm begins with a prior distribution
g over the stochastic game with state-space set
S and action space set
A and time horizon
. The prior distribution
g for modeling state transition probability distribution is typically taken to be Dirichlet distribution [
14,
15].
The game is played episodes . The length of each episode is given by . In each episode, the game is played in discrete steps, . The episodes begin at times . At each time , the state of the agent is given by , it selects an action , and observes a scalar reward then transitions to the state . Let denote the history of the agent till time t.
The proposed algorithm is described in Algorithm 1. At the beginning of each episode, the MDP,
is sampled from the posterior distribution conditioned on the history
in Line 4. We note that the sampling of MDP only relates to the sampling of the transition probability
P and the reward distribution since the rest of the parameters are known. We note that the algorithm doesn’t perform explicit exploration like an
-greedy algorithm. Instead, the algorithm samples a new MDP
for episode
k in Line 4. The Algorithm can generate a new trajectory from the new policy [
14,
15] solved for the sampled MDP
. We assume that after some samples,
has converged. The proposed algorithm converges as the induced transition probability and reward function converge after
converge.
Algorithm 1 Proposed Algorithm for Mean-Field Game with Best Response Learning Dynamics. |
- 1:
Input: Prior distribution g, time horizon , - 2:
Initialize . - 3:
for episodes do - 4:
Sample . - 5:
Obtain optimal Q for from Algorithm 2 - 6:
for time steps , …, do - 7:
Play . - 8:
Observe reward , action of the agent , and next state . - 9:
Append action taken , reward obtained , and state update to history
- 10:
end for - 11:
end for
|
We use Backwards Induction algorithm [
25] described in Algorithm 2 to obtain the Q-value function for the current sampled MDP (Line 5, Algorithm 1). Backward induction in Algorithm 2 starts from the end of the episode and calculates the potential maximum rewards for each state and action (Line 5). The algorithm then goes back in the episode (Line 8), to calculate the maximum possible cumulative rewards for each state and action in Line 11. After all the time indices in an episode are covered, the algorithm returns the calculated optimal Q-values. We obtain the policy
from the calculated Q-values and the policy is not altered in an episode. Recall for a given
, a policy
if and only if
for all
and
. Let
be the population action distribution in episode
k, then the algorithm aims to choose a policy
. In order to choose the policy
from the set
, we use lower myopic learning dynamics, where at each episode we choose the strategy which is the smallest action index in the set
.
Algorithm 2 Backwards Induction Algorithm. |
- 1:
Input: ▹ Sampled MDP from Algorithm 1 - 2:
Initialize . - 3:
for state do - 4:
for state do - 5:
Update Q-value function for last action
- 6:
end for - 7:
end for - 8:
for time steps do - 9:
for state do - 10:
for state do - 11:
- 12:
end for - 13:
end for - 14:
end for - 15:
Return:
|
We note that is used in the algorithm instead of where , the true distribution, is not known. In order to obtain an estimate, each agent samples a transition probability matrix according to the posterior distribution. Each agent follows the strategy according to the Q-values over the episode. Based on the action decision by each agent, we update the value function and the Q-function based on the obtained reward functions which depend on the value of . The detailed algorithm steps can be seen in Algorithm 1. We note that, as the algorithm converges, the value of converges, and thus all the transition probabilities and value functions depend on the limiting distribution.
4. Convergence Result
In this section, we’ll show that if the oblivious strategy is chosen according to the proposed algorithm, then the oblivious strategy and the limiting population action distribution constitutes a Mean-Field Equilibrium (MFE). More formally, we have obtained the following–
Theorem 1. The optimal oblivious strategy obtained from Algorithm 1 and the limiting action distribution constitute a mean-field equilibrium and the value function obtained from the algorithm converges to the optimal value function of the true distribution.
The rest of the section proves this result. We first note that the lower-myopic best response strategy leads to a convergence of the action strategy following the results in [
10] for finite action space and state space. We note that there might be multiple actions that can maximize the state-action value function. This may lead to choosing different actions at different iterations for the same state. To avoid the oscillations between the best actions and hence keep the policy stable, we choose a lower-myopic strategy. This lower-myopic strategy avoids conflicts when the agents have a non-unique strategy that maximizes the value function. Further, any way of resolving the multiple optima could be used, including upper-myopic giving the same result. Having shown that
converges, we now proceed to show that the converged point of the algorithm results in an MFE.
We first show the conditions needed for a policy
, a population state
f, and action distribution
to constitute an MFE (
Section 4.1). Then, we show that the conditions for the policy to be MFE given in
Section 4.1 are met for any optimal oblivious strategy (
Section 4.2). Thus, the key property that is required to show the desired result is that the proposed algorithm leads to an optimal oblivious strategy. In order to show that, we show that the value function of the sampled distribution converges to the true distribution (
Section 4.3). The result in
Section 4.3 shows that the value function iterates eventually converge to the value function with knowledge of the true underlying distribution of the transition probability
, thus proving that the proposed algorithm converges to an optimal oblivious strategy which constitutes a mean-field equilibrium thus proving the theorem.
4.1. Conditions for a Strategy to Be a MFE
In this section, we will describe the conditions for an oblivious strategy
to be a MFE. Recall that, in
Section 2.2, we defined two maps
and
. For a given action coupled stochastic game, the map
for a given population action distribution
gives the set of the optimal oblivious strategies. Further, the map
for a given population action distribution
and oblivious strategy
gives the set of invariant population state distribution
f.
We define the map
which gives the induced population action distribution
induced from the oblivious strategy
and the population state distribution
f. The following lemma gives the conditions that the stochastic game constitutes a mean-field equilibrium. These conditions have been provided in [
11], and the reader is referred to [
11] for further details and proof of this result.
Lemma 1 (Definition 7 [
11])
. An action coupled stochastic game with the strategy π, population state distribution f and population action distribution α constitute a mean-field equilibrium if and . 4.2. Conditions of Lemma 1 Are Met for Any Optimal Oblivious Strategy
In this section, we show that the conditions of Lemma 1 are met for any optimal oblivious strategy. In the mean-field equilibrium, each agent plays according to the strategy
. If the average population action distribution is
, and each agent takes an oblivious strategy, hence, we must have the evolution of the state space such that the oblivious strategy on those states leads to an average action distribution of
. Since we assume a large number of agents, including agent
i’s own state will not change the population state distribution. So, we let the average population state distribution be
at time index
j as,
where
is the state of the agent
m at time index
j. Similarly, we also include agent
i’s action as well in the population action distribution
. Then, we have,
where
is probability of agent
m being in state
s at time index
j. Here, Equation (24) follows from the transition probability matrix. Recursively replacing
in Equation (26) using Equation (
21) for all
gives the required result of
.
The above statement also implies that
must satisfy
where
represents the set of states
s for which
. This is equivalent to saying that if all the agents follow the optimal oblivious strategy
, then the population state distribution
f and the population action distribution
satisfy
and
.
4.3. Sampling Does Not Lead to a Gap for Expected Value Function
In the last subsection, we proved that there exists an optimal oblivious strategy. We will show that the policies generated by Algorithm 2, , for the sampled system dynamics in episode k by Algorithm 1 converges to the optimal oblivious policy . To show the convergence of the policy to , we will show that the value function of the optimal oblivious policy converges to the optimal value function of the true system dynamics.
We will first describe the lemmas that are used to prove the required result. We start by stating the Azuma-Hoeffding Lemma for obtaining confidence intervals.
Lemma 2 (Azuma-Hoeffding Lemma [
15])
. If is a zero-mean martingale with almost surely bounded increments, ≤ C, then for any δ≥ 0 with probability at least , . We also utilize the following result of [
15] on any
-measurable function
g.
Lemma 3. If f is the distribution of then, for any -measurable function g, At the start of every episode, each agent samples system dynamics from the posterior distribution given . The following result bounds the difference between the optimal value function learned by the true distribution function using the optimal policy which is unknown, and the optimal value function achieved by the sampled distribution from the policy .
Lemma 4. Let be the optimal value function for an oblivious policy for the sampled system dynamics chosen form Algorithm 1. Then converges to the optimal value function, , of the true system dynamics i.e., for all states as with probability at least , Proof. We note that since the optimal value function is a
-measurable, we can use Lemma 3 to bound the difference of the optimal value functions of the sampled distribution at episode
k and the true distribution to show that for all states
,
Note that the length of all episodes is given by and the support of the reward is [0, 1]. Therefore for all states , we have . Note that this condition is similar to bounded increments in Azuma-Hoeffding Lemma (Lemma 2).
Since
is a zero mean martingale with respect to the filtration
, and satisfies the assumptions of Azuma-Hoeffding Lemma, we obtain the result as in the statement of the Lemma. Also, for all states
, we have,
. So, the difference is a zero-mean martingale and has the bounded increments property. Applying the Azuma-Hoeffding Lemma to the martingale, we have the following result,
For total time
T of the algorithm, we have
. Thus, we obtain,
Thus, for all
, as
, we have
Substituting
, the above expression says that
times the average difference in an episode which converges to zero as total time
, which gives us the convergence of the optimal value functions of the two distributions. Thus, we have
□