Multi-Agent Reinforcement Learning: A Review of Challenges and Applications

: In this review, we present an analysis of the most used multi-agent reinforcement learning algorithms. Starting with the single-agent reinforcement learning algorithms, we focus on the most critical issues that must be taken into account in their extension to multi-agent scenarios. The analyzed algorithms were grouped according to their features. We present a detailed taxonomy of the main multi-agent approaches proposed in the literature, focusing on their related mathematical models. For each algorithm, we describe the possible application ﬁelds, while pointing out its pros and cons. The described multi-agent algorithms are compared in terms of the most important characteristics for multi-agent reinforcement learning applications—namely, nonstationarity, scalability, and observability. We also describe the most common benchmark environments used to evaluate the performances of the considered methods.


Introduction
In the field of machine learning (ML), reinforcement learning (RL) has attracted the attention of the scientific community owing to its ability to solve a wide range of tasks by using a simple architecture and without the need for prior knowledge of the dynamics of the problem to solve. RL has found uses in many applications, from finance [1] and robotics [2][3][4], to natural language processing [5] and telecommunications [6]. The core of a RL system is the agent that operates in an environment that models the task that it has to fulfill. In all of the above applications, the RL agents interact with the environment via a trial and error approach, within which they receive rewards (reinforcement) for their actions. This mechanism, similar to human learning, guides the agent to the improvement of its future decisions in order to maximize the upcoming rewards. Despite the success of this approach, a large number of real-world problems cannot be fully solved by a single active agent that interacts with the environment; the solution to that problem is the multiagent system (MAS), in which several agents learn concurrently how to solve a task by interacting with the same environment [7]. In Figure 1, we show the representation of the RL structure for a single agent and for an MAS.
MASs can be used in several fields, for example, traffic control, network packet routing, energy distribution, systems of robots, economic modeling, and the analysis of social dilemmas. For these reasons, in the last few years, researchers have attempted to extend the existing single-agent RL algorithms to multi-agent approaches. Empirical evaluations, however, have shown that a direct implementation of single-agent RL to several agents cannot converge to optimal solutions, because the environment is no longer stationary from each agent's perspective. In fact, an action performed by a certain agent can yield different rewards depending on the actions taken by the other agents. This challenge is called the non-stationarity of the environment and is the main problem to address in order to develop an efficient multi-agent RL (MARL) algorithm. Even when convergence is obtained, typically, this kind of algorithm can maintain acceptable performance in terms of the quality of the policies derived and speed of convergence only if a restricted number of agents is involved. The scalability to a high number of agents is an essential feature that must be taken into account when developing algorithms that can be applied to real-world problems [8].
In this survey, we present an introduction to multi-agent reinforcement learning. We focus on the models used to the describe the framework environment and how to adapt the most relevant single-agent reinforcement learning techniques for multi-agent settings.
Below, we present an assortment of MARL algorithms that address the above-mentioned challenges of non-stationarity and scalability. We then address partially observable environments. For MAS, partial observability is far more common than in the single-agent setting; thus, it is crucial for the development of algorithms that can be applied to realworld problems. Finally, we introduce an overview of the most common benchmarking environments used to evaluate the performances of RL algorithms. This work is intended to be an introduction to multi-agent reinforcement learning, introducing the main challenges of the field and the main solutions adopted in the literature. Finally, we describe typical applications of MARL. While the research into MARL is still at an early stage and is often not supported by theoretical proof, it has shown promising progress in terms of its application. It could be considered a novel approach to achieve systems that are capable of helping humans perform complex tasks, such as working in hazardous environments, and exhibit general artificial intelligence.

Background
In the field of machine learning, reinforcement learning stands apart from the classic paradigm based on learning from examples. It adopts a trial and error procedure using a reward provided by an interpreter that observes the interaction of the agent with the environment. It takes inspiration from research into animal learning [9] combined with theoretical concepts of optimal control theory.
The interaction between the agent and the environment can be described by three fundamental elements: state, actions, and reward. The state represents a particular configuration of the environment, the actions are the options that the agents have to interact with to modify the environment, and the reward is a signal used to define the task of the agent and is what motivates the agent to pick one action with respect to the others.
The learning process is iterative. The agent senses the environment while collecting its current state. The agent performs an action, thereby reaching the next state, and it receives a reward based on the combination of the state and selected action. This process is then repeated. The agents adopt a policy to determine the best action to select, which is a mapping from all the possible states of the environment to the action that can be selected.
The reward, however, is not sufficient to determine the optimal policy because an instantaneous reward does not give information about the future rewards that a specific action can lead to-that is, the long-term profit. For this reason, it is useful to introduce a new kind of reward: the return value R t . If we write as r t the reward received by the agent at the time-step t, the return value over a finite length time horizon T is defined as Sometimes, the return value is also considered for non-finite time horizons.
where γ is a discounted factor such that 0 ≤ γ < 1. To evaluate the quality of a particular state or state-action pair, it is possible to define two value functions. In particular, under a policy π, the value function of the state is calculated as and the value faction of the state-action pair is calculated as A value function can be expressed as the relation of two consecutive states s t and s t + 1 defining the so called Bellman Equations (5) and (6). and A way to approximate the solutions to these equations is by using dynamic programming. However, this approach requires complete information about the dynamics of the problem. Model-free RL algorithms can be considered as an efficient alternative.

Multi-Agent Framework
Before introducing the algorithms used in RL, we present the most used frameworks for modeling the environment of such applications. We begin with the single-agent formulation and then extend the concept to multiple agents.

Markov Decision Process
In the single-agent RL, the environment is usually modeled as a Markov decision process (MDP).
Formally, a Markov decision process is defined by a tuple (S, A, P, R, γ), where • S is the state space; • A is the action space; is the transition probability from state s ∈ S to s ∈ S given the action a ∈ A; • R : S × A × S → R is the reward function, whose value is the reward received by the agent for a transition from the state-action pair (s, a) to the state s ; • γ ∈ [0, 1] is the discount factor and is a parameter used to compensate for the effect of instantaneous and future rewards.

Markov Game
When more than one agent is involved, an MDP is no longer suitable for describing the environment, given that actions from other agents are strongly tied to the state dynamics. A generalization of MDP is given by Markov games (MGs), also called stochastic games. A Markov game is defined by the tuple (N S is the space observed by all agents; • A i is the action space of the i-th agent and A := A 1 × A 2 × · · · × A N is called the joint action space; • P : S × A → δ(S) is the transition probability to each state s' ∈ S given a starting state s ∈ S and a joint action a ∈ A; • R i : S × A × S → R is the reward function of the i-th agent representing the instantaneous reward received, transitioning from (s, a) to s ; • γ ∈ [0, 1] is the discount factor.

A Partially-Observable Markov Decision Process
A partially-observable Markov decision process (POMDP) is a generalization of an MDP that considers the uncertainty regarding the state of a Markov process, allowing a state information acquisition [10]. We can define a POMDP using two stochastic processes: the first is a non observable core process X t that represents the evolution of the state and is assumed to be a finite state Markov chain; the second is an observation process Y t that represents the observations received by an agent. There is a probabilistic relationship Q = [q xy ] between X t and Y t that links the probability of observing a particular value of Y t = y if the agent is in the state X t = x.
A POMDP is well suited to modeling a large variety of RL problems even though it is an intrinsically more complex method than using an MDP.

Dec-POMDP
A decentralized partial-observable Markov decision process (Dec-POMDP) is defined by the tuple I, S, A, T, R, Ω, O, γ , where • I is the set of n agents; • S is the state space; The state S evolution is based on the transition probability T(s, a, s ) : S × A → δ(S). It indicates the probability that, given the joint action a and the current state s, the next state will be s . At every time step, all agents receive an observation o (i) ∈ Ω (i) given by the joint observation probability O(o, s , a) = P(o|s , a). For each agent i, we can define its local observation history at iteration t as o t At every iteration, the team receives a joint reward r t = R(s t , a t ) conditioned by the joint action and the current state. This is used to maximize the expected return over the horizon T V = E[∑ T t=0 γ t r t ] where γ is the discount factor [11]. As in the Markov games framework, the objective is to maximize the expected return by selecting an optimal joint policy. However, in this case, the policy performs a mapping from local observations to actions; thus, we can write the local policy of agent i as π (i) : O t (i) → A (i) .

Single-Agent RL Algorithms
The field of RL is extremely broad, and over the last few years, a multitude of algorithms have been presented. To categorize them, the first division can be made between model-free and model-based algorithms. Model-based algorithms require users to have access to a model of the environment, including the transition probability and the associated reward, or to learn the model directly during training. They then use the model to plan the best action to take. An example of this approach is the DYNA algorithm [12]. This approach leads to an optimal solution; however, in most applications, a model of the environment cannot be obtained or would be too complex to implement. Moreover, the quality of the solutions is sensitive to error in the model estimation. Another downside is a loss of computational efficiency in cases in which the model is highly complex but the optimal policy is very simple. An example could be a complex robotic system that has to learn only how to move forward. In recent research, the efficiency of the learned model has been addressed, obtaining state-of-the-art performances with approaches such as MuZero and AlphaZero [13,14].
Model-free algorithms do not require any knowledge about the model, and they operate by directly optimizing the policy or other value parameters. This type of approach is goal-oriented and can perform in a variety of environments and adapt to their changes. In this work, we focus exclusively on model-free algorithms. Model-free algorithms can be further divided in two fields: value-based approaches and policy-based approaches.
The objective of value-based methods is to find good estimates of the state and/or state-action pair value functions V(s) and Q(s, a). The optimal policy is selected using a fixed rule to map from the value functions to the actions; for an example, take the − greedy policy, which selects the action associated with the higher Q-value with probability 1 − and a random action with probability . An example of this approach is the famous Q-learning algorithm [15]. Policy-based methods do not require the value function to be estimated but use a parameterized policy that represents a probability distribution of actions over states π θ = Pr[a|s] as a neural network. The policy is directly optimized by defining an objective function and using gradient ascent to reach an optimal point. If we consider, for example, the reward as the objective function, we can derive the expression which is known as the policy gradient theorem. While the expected value of the reward cannot be differentiated, the policy can, making gradient ascent possible. A last category of RL algorithms that originates from the combination of policy-based and value-based methods is the actor-critic approach. As shown in Figure 2, in an actor-critic algorithm, two different learning agents are defined: • The critic, which has the task of estimating the value function, typically using TD methods; • The actor, which represents the parameterized policy and updates its action distribution in the direction "suggested" by the critic using a policy gradient.
The critic outputs a signal that changes the action selection preferences in order to chose actions related to higher value functions more frequently. This kind of algorithm presents a reduced variance in the gradient estimates due to the use of the value function information to guide the policy evolution. In addition, minimal computation is required to select an action as there is no need to compare different state-action values, and it is possible to learn an explicitly stochastic policy [16].
All of these methods, as summarized in Figure 3, have been used with success in the solution of RL problems, and their effectiveness is dependent on the type of problem.

Q-Learning
Q-learning [15] is a form of model-free, value-based reinforcement learning. It takes inspiration from dynamic programming and methods of temporal differences such as TD(0) [17]. The Q-learning agent faces the task of determining an optimal policy that maximizes the total discounted reward; this policy π * is such that where R s (a) is the average reward received by an agent in the state s if it selects the action a, and P ss [a] is the transition probability from the state s to s , given the action a. The object is to estimate the state-action pair value function (Q-values); this is done by creating a matrix Q n (s, a) of dimensions S × A containing the estimates of all the value functions at the time-step n. At each iteration n of the learning process, the Q-values are updated using the equation if s = s n and a = a n Q n−a (s, a) otherwise (10) where α n is the learning rate, r n is the instant reward received, and It was shown in [18] that, under the assumption of bounded rewards |r n | ≤ R and learning rates 0 ≤ α n < 1 such that the estimates Q n (s, a) will converge to the optimal Q-value Q * (s, a) with probability 1.

REINFORCE
REINFORCE is a class of episodic reinforcement learning algorithms that have the most simple implementations of the policy gradient methods; for this reason, the name vanilla policy gradient exists [19]. The policy is directly optimized without the need to estimate the value functions. The policy π is parameterized with a set of weights θ such that π(s; θ) ≡ π θ (s), and it is the probability distribution of actions over the state. Regardless of the architecture of the parameterized policy, all REINFORCE algorithms use the same weight update procedure: where α is a non-negative learning parameter, R t is the discounted return value, and b is the baseline, which is used to reduce the variance of the gradient estimation. b is a function of the state (a constant without loss of generality). The steps to implement a REINFORCE algorithm are the following: 1.
Initialize the policy parameters θ at random; 2.
Estimate the return R t ; 2.

A3C
Recalling the policy update rule of REINFORCE θ ← θ + α(R t − b)∇ θ ln π θ , a frequently used baseline b is a learned estimate of the state value function b(s) = V π (s; θ v ). If this kind of baseline is selected, we can express the term R t − b as an advantage function A(a t , s t ) = Q(s t , a t ) − V(s t ) because R t is an estimate of the state-action value function. This approach can be viewed as an actor-critic method in which the actor is the policy network π θ and the critic is represented by the baseline. In [20], A3C (asynchronous advantage actor critic) and its synchronous variant A2C are presented; the algorithm maintains a policy function π(a t |s t ; θ) parameterized by θ and an estimate of the value function V(s t ; θ v ) parameterized by θ v . In the formulation, the weights are considered separately; in the implementation, the same deep convolutional neural network is used to approximate the policy and the state value, with the only difference being in the separated output layers (a soft-max layer for the policy and linear layer for the state value). The algorithm continues for a number of steps t max or until it reaches a terminal state; the return value is then calculated, and the updating of the parameters is performed following the gradient. In Algorithm 1 the psuedocode of the A3C algorithm can be found.
The algorithm was trained and tested in a variety of domains, such as the Atari 2600 platform, improving upon the results of state-of-the-art RL algorithms such as the Deep Q-network [21].
//Assume global shared parameter vectors θ and θ v and global shared counter T = 0 //Assume thread-specific parameter vectors θ and θ v Initialize thread step counter t ← 1 repeat Reset gradients dθ ← 0 and dθ v ← 0. Syncronize thread-specific parameters θ = θ and θ v = θ v t start = t Get sate s t repeat Perform a t according to policy π(a t |s t ; θ ) Recieve reward r t and new state s t+1 Perform asynchronous update of θ using dθ and of θ v using dθ v until T > T max

The Limits of Multi-Agent Reinforcement Learning
MARL algorithms can be coarsely divided into three groups depending on the kind of reward given by the environment: fully cooperative, fully competitive, and mixed cooperative-competitive. In the cooperative setting, all the agents collaborate to maximize a common long-term return. An example of this setting is a smart energy grid, in which multiple buildings (agents) with different energy production capabilities have to share energy in order to minimize the outside-grid energy demands. Another is an autonomous driving setting in which the vehicles have to collaborate in order to avoid collisions while trying to maximize traffic flow and possibly fuel efficiency. In competitive settings, the return of all the agents sums to zero. A variety of board and card games, including chess, Go, and poker, belong to this setting, and are of great interest in the reinforcement learning community as benchmarks for algorithms. Mixed settings combine the aforementioned characteristics and present a general-sum reward; a typical example of this is constituted by team games in which the agents have to cooperate with their own team-mates while competing with the opposing teams. The transition from single-agent to multi-agent settings introduces new challenges that require a different design approach for the algorithms.

Nonstationarity
The environment in a multi-agent setting can be modified by the actions of all agents; thus, from the single-agent perspective, the environment becomes non-stationary. The effectiveness of most reinforcement learning algorithms is tied to the Markov property, which does not hold in non-stationary environments [16]. Policies created in a non-stationary environment are deemed to have become outdated. Despite the loss of theoretical support, algorithms designed for the single-agent setting have been applied in multi-agent settings, such as independent learners (IL), occasionally achieving desirable results [22] . A naive approach to tackle the non-stationarity is the use of joint action learners (JAL). This uses a single-agent RL but with the joint action instead of the local action used to compute its value functions. This approach eliminates the problem of non-stationarity entirely; however, it is computationally ineffective, and the action space dimension becomes |A| N , where N is the number of agents, making it difficult to scale this type of approach to more than a few agents. In addition, to ensure that every agent knows the actions of others, some type of centralized controller or communication network is required [23]. In [24], a variation of Q-learning for swarm systems is presented, called Q-RTS. The key idea is to use a centralized aggregation center to combine all the Q-value tables of the agents to form a global swarm matrix Q sw (s, a) containing the highest and lowest Q-values representing the most interesting iterations by the agents. The swarm matrix is then linearly combined into the local Q-value matrix of each agent Q i (s, a) using an independence factor β ∈ [0, 1), controlling the tradeoff of local and global knowledge. The Q-learning update then proceeds independently on Q i for each agent.
An FPGA implementation was also proposed [25].

Varying Learning Speed
The effect of environmental non-stationarity is to make the reward information of state-action pairs related to past iterations obsolete, and this effect grows as time progresses. Several approaches tackle this challenge by adopting varying learning rates with the aim of guiding the training to the most efficient joint policy. In the context of cooperative games, hysteretic Q-learning [26] is an algorithm that improves the performance of standard independent learners approaches. The reward is shared between agents and conditioned by the joint action; thus, an agent can be punished even if it select the optimal option due to the bad actions of other teammates, who might be exploring. The algorithm applies a different learning rate if the update would cause a decrease of the Q-value using the update Equation (15) under the condition that β < α.
This heuristic has the positive effect of implementing "optimistic" agents that are able to ignore the bad rewards caused by the actions of others, improving the performance compared to traditional IL approaches in cooperative MGs. In [27], the authors argue that in order to obtain robustness against incorrect future reward estimation, in the early iterations of the reinforcement learning algorithms, the agents need to show some sort of leniency towards others. In particular, the future reward for a performed action can be assessed as the maximum reward received over a number of different actions chosen by the other agents. The proposed algorithm is called lenient multi-agent reinforcement learning (LMRL) and implements varying leniency, lowering the amount that each agent exhibits in the later stages of learning. Agents decide the amount of leniency to apply by associating a temperature with each state-action pair. The level of leniency is inversely proportional to the temperature and decreases as those state-action pairs are selected.

Scalability
As the number of agents increases, there is a growth in the joint action space. For this reason, centralized approaches, in which an observer selects the actions after receiving the action-state information of every agent, require large amounts of computational resources and memory to work with more than a couple of agents. A possible solution to the curse of dimensionality in MARL is to use independent learners, but as we have seen, this approach is unable to obtain consistent results in a non-stationary environment. A third model of agent connection is the decentralized setting with networked agents. In this setting, every agent is able to interact with the environment and to exchange information with few other agents (typically, those in its vicinity), creating a time-varying communication network between all the agents. Algorithms developed for this setting are scalable to a massive number of agents and more real-world-oriented applications, as the absence of a central controller and uncertainty in communication links are typical requirements in a large number of applications. In [28], a distributed version of Q-learning, namely, QD-learning, is proposed under the assumption that each agent is only aware of its local action and reward and the inter-agent communication network is weakly connected. The optimal policy is achieved by agents sending their Q-values to their neighbors. The update of the Q-value is then computed locally for each agent using the following equation: In can be seen from Equation (16) that the update is defined by two processes, consensus and innovation, where the former is the sum of the differences of the Q-value of the agent and the Q-values of its neighbors, controlled by the weight sequence β i,u (t), and the latter is the traditional Q-learning update rule weighted by the sequence α i,u (t). In [29], the same setting of decentralized reinforcement learning with networked agents is addressed using an actor-critic algorithm. The policy is parametrized by a set θ ∈ Θ, and the gradient of the globally averaged return J(θ) with respect to θ is given by The local advantage function and a −i is the action chosen by every agent except for agent i. The advantage function is not available considering only local information, so A θ is estimated with the state-value TD-error, which is an unbiased estimator of the advantage function.

Deep Reinforcement Learning
Function approximation is a common technique for reducing the number of parameters in high-dimension state-action spaces. The use of a deep neural network to approximate the state-action value function, called deep reinforcement learning (DRL), was first presented in [21] in the single-agent setting and achieved promising results in the training of an agent capable of playing a large set of Atari 2600 games using the raw pixel data from the screen as the input for training. The success of the proposed approach is based on two features. The first is the introduction of an experience replay mechanism in which every experience tuple e t , composed of the state transition, action selected, and reward received, is stored in a dataset and then randomly batched to train the action-value approximation network. This method eliminates the correlation between consecutive iterations, which is inevitable for sequentially generated data points. The experience replay also has the additional effect of increasing the sample efficiency by reusing experience tuples. The second feature is the use of two networks to approximate Q, the Q-network, and the target network; the parameters of the latter are updated every C steps with the Q-network parameters and used as the target of the training loss function, defined as in which γ is the discount factor; θ − i and θ i are the parameters of the target network and of the Q-network at iteration i, respectively. Policy gradient methods have been extended to make use of deep neural networks, keeping the advantage of allowing for policies in the continuous action space. In [30], a deep Q-network was combined with an actorcritic approach. The parameterized actor µ(s|θ µ ) that represented the policy and the critic Q(s, a|θ) that estimated the value using the state-action pair were represented by DQNs. The networks were trained using a deterministic policy gradient algorithm with a batch normalization technique [31]. The performances obtained have motivated the research community to adopt deep networks in the multi-agent environment. In [32], the lenient reinforcement learning algorithm was adapted to DRL, and the authors thus proposed the lenient deep Q-network (LDQN) algorithm. In [33], the authors proposed two techniques to stabilize the effect of the experience replay in the multi-agent setting: low-dimensional fingerprints, made by a Boolean vector, were added to the experience tuple to disambiguate training samples; and importance sampling, which consists of recording the other agent policies µ −a (u −a |s) = ∏ i∈−a π i (u i |s) in the experience replay, forming an augmented transition tuple s, u a , r, π(u −a |s), s t c . The DQN parameters θ are trained by minimizing the importance-weighted loss function analogous to Equation (19).
where b is the size of the batch used in the learning and y DQN i is the output of the target network.

Partial Observability
Most of the algorithms presented are based on the assumption of the full observability of the state by all agents. In most real-world applications, this condition is rarely present, as agents observe different instances of the state; for example, an agent may have vision of only its surroundings, making the observations correlated with its geographical position, or the agents might be provided with different sets of sensors. In the setting of DRL, the algorithms developed for the full state can only achieve desired performance if the observations are reflective of the hidden system state. In [34], an algorithm called the deep recurrent Q-network (DRQN) was proposed. The first layer of the DQN was replaced with recurrent long short-term memory (LSTM). This architecture was tested on flickering versions of Atari games in which the state (the pixels of the game screen) was sometimes presented to the agent and sometimes obscured. In this benchmark, DRQN achieved better performance than the traditional DQN. This idea was transferred to the multi-agent setting in [35], which integrated hysteretic learning [26] to deal with the non-stationarity of the environment and the capabilities of representing the hidden state space of deep recurrent Q-networks, proposing an algorithm called the decentralized hysteretic deep recurrent Q-network (Dec-HDRQN). In the work, a variation of the experience replay mechanism called concurrent experience replay trajectories (CERTs) was used. In CERT, t+1 ) containing the current observation, action, and reward, and the subsequent observation, is indexed by the agent number, the time-step of acquisition, and the episode. The samples given to the Q-network for training are then taken from this structure in a synchronized way.

Centralized Learning of Decentralized Policies
Scalability and partial observability are connected problems, especially when considering the applications of algorithms to real-world problems. In a setting with an extremely high number of agents, as in the case of the number of autonomous driving cars in a city, it is impossible for every car to have full information about the state, as cars cannot exchange information with each other without a prohibitive communication overhead. The full information about the state could therefore not be useful for executing an optimal policy, but it could be useful during the learning phase to correctly estimate the value-functions or the policy gradients. To exploit this dichotomy, the paradigm of centralized learning of decentralized policies was introduced. The idea behind this is that, during learning, it is often possible to have access to a simulator of the environment or a central controller, or have additional information about the state. We can imagine a setting in which the agents are, for example, a swarm of drones trained in a closed building, such as a hangar, with limited visibility, and the extra information about the state of the environment is given by a fixed camera looking down on all of the drones. In [36], a comparison between various training schemes-centralized, independent learning, and centralized training-for decentralized policies using parameter sharing is presented. A parameter sharing variant of the single agent algorithm trust region policy optimization (TRPO) [37], namely, PS-TRPO, is presented. The decentralized paradigm obtained the best performance of all the training schemes in the evaluation phase. In addition, the scalability of the PS-TRPO algorithm was addressed using curriculum learning [38], in which the agents had to solve a number of sub-tasks that increased in difficulty in order to coordinate with an increasing number of agents. To exploit the utility of having a centralized learning phase in [39], a novel actor-critic architecture, called the conterfactual multi-agent policy gradient (COMA), was presented. In the training, a single centralized critic is used, which estimates the Q-function using the joint action and the full information on the state, or in the absence of a well-defined state, the concatenation of all the local observations made by the agents, and several decentralized actors that deploy a policy which maps from local observations to an action. Additionally, COMA tackles the challenge of multi-agent credit assignment in a cooperative setting where there is a unique reward shared to all the agents. Training is more efficient if each agent is able to determine how an action reflects on the success of the task. This is done through the use of a counterfactual baseline; each agent receives a shaped reward, which is the difference between the reward obtained at the time-step t and the reward that would be obtained if the agent actions were to change to the default D a = r(s, u) − r(s, (u −a , c a )). It can be shown that any action by the agent a that improves D a also improves the true global reward. The main downsides of this approach are the necessity for a simulator of the environment to calculate r(s, (u −a , c a )) and the choice of the default action c a , which is not always trivial. Both of these problems were addressed by the architecture of the critic which computed for each agent an advantage function as follows: A a (s, u a ) = Q(s, u) − ∑ u a π a (u a |o a )Q(s, (u −a , u a )) (21) In particular, the actions u −a are given as an input to a network that determines the state-action value for each action of the agent a in a single forward pass. This algorithm (full pseudocode can be found in Algorithm 2) was tested in Starcraft, a combat-based video-game environment. Several homogeneous and heterogeneous unit combinations, with each one represented by an agent, were considered. It was shown that, in that setting, COMA reached competitive performance in regard to fully centralized methods with better training speed. While the paradigm of the centralized learning of decentralized policies is easily implementable for actor-critic and policy gradient methods, it is not as straightforward when considering value-based methods. A possible approach was presented in [40] that consists of decomposing the team value function into agent-wise value functions. The assumption on which this approach is based is that the joint actionvalue function can be factorized according to each agent's Q-function based only on each agent's local observation: Algorithm 2 COMA pseudocode [39].
Initialise θ c 1 ,θ c 1 , θ π for each training episode e do Empty buffer for e c = 1 to BatchSize n do s 1 =initial state, t = 0, h a 0 = 0 for each agent a while s t = terminal and t < T do t = t + 1 for each agent a do h a t = Actor (o a t , h a t−1 , u a t−1 , a, u; θ i ) Sample u a t from π(h a t , (e)) end for Get reward r t and next state s t+1 end while Add episode to buffer end for Collate episodes in buffer into single batch for t = 1 to T do //from now processing all agents in parallel via single batch Batch unroll RNN using states, actions and rewards Calculate TD(λ) targets y a t usingθ c t end for for t = T down to 1 do ∆Q a i = y a t − Q(s a j , u) ∇θ c = ∇θ c + ∂ ∂θ c (∆Q a t ) 2 //calculate critic gradient θ c t+1 = θ c t + α∇θ c // update critic weights Every C steps resetθ c t = θ c t end for for t = T down to 1 do A a (s a t , u) = Q(s a t , u) − ∑ u Q(s a t , u, u −a )π(u|h a t ) //calculate COMA ∇θ π = ∇θ π + ∂ ∂θ π log π(u|h a t )A a (s a t , u) // accumulate actor gradients end for θ π t+1 = θ π t + α∇θ π //update actor weights Each agent taking a greedy action to maximize their returns is equivalent to a central controller maximizing the joint action-value functions. The value decomposition network (VDN) used DQN or DRQN, with the possibility of communication between agents at a low level (sharing observations) or high level (sharing network weights). It was tested in a two-dimensional grid environment, obtaining better performances than the centralized and independent learners methods. The limit of this approach is that not every multiagent problem can be approximated and solved as a summation of Q-functions. In [41], the VDN method was extended with the QMIX algorithm. The authors argued that a full factorization of the value function is not required to extract effective policies. It is sufficient that the result of an argmax on the joint action-value functions produces the same result to apply an argmax to all the individual action-value functions. This is possible if a monotonicity constraint is enforced between the total Q-value Q tot and local agent Q-value Q a .
∂Q tot ∂Q a > 0, ∀a Each agent value function Q a (o a t , u a t−1 ) is represented by a DRQN that takes, as its input, the local observation and the last action at each time step. Then, the singular value functions are combined using a mixing network, which is a feed-forward neural network. The weights of the mixing network are bounded to be non-negative to enforce the condition presented in Equation (23). These weights are determined by a separate hyper-network which takes the augmented state as its input. The networks are trained to minimize the loss function in a way analogous to DQN: with transitions sampled by the replay buffer. In the evaluation carried out in the StarCraft II Learning Environment, QMIX obtained better results than VDQ at the cost of the added architectural complexity. VDQ combines the local Q-functions using a simple summation, whereas QMIX uses a neural network.

Communications between Agents
In the case of the partial visibility of the environment, the use of communication between agents is often a necessity, as collaborating agents can then better infer the underlying state of the environment. Communication protocols are often hand-designed and optimized for the execution of particular tasks. In contrast with this approach, in [42], a simple neural communication model called CommNet was proposed, which learned a taskspecific communication protocol that aided the performance of the agents. Considering a setting with J fully cooperative agents, a model was proposed with a = Φ(o), where a is the concatenation of discrete actions for each agent j and o is the concatenation of all agents observations. Φ is built from modules f i consisting of a multi-layer neural network with i ∈ 0, . . . , K, where K is the number of communication steps of the network. Each agent j sends two input vectors to the module-the agent's hidden state h i j and the communication c i j , and f i outputs the next hidden state h i+1 j . The evolution of the hidden space and the communication message is regulated by the following equations: The first hidden state is obtained through an encoder function h 0 j = r(s j ), which takes the state observation of agent and outputs its hidden state. h 0 j is made of a single layer neural network. After the last round of communication, a decoder function, made of a single layer neural network followed by a softmax, is used to convert the hidden state h K j to a probability distribution over the space of action q(h K j ). The action is chosen by sampling this distribution. A variance of the broadcast transmission represented by Equation (26) is a local connectivity scheme: (27) where N(j) is the neighborhood of agent j. This model was tested with various difficult tasks, including a traffic simulation with four connected crossroads, with the objective being to avoid collisions while trying to maximize the vehicle flow. Promising results were obtained even with a very limited-if not absent-observability of the state. Another similar work addressing the communication problem in a multi-agent partially-observable setting was presented in [43] using a deep distributed recurrent Q-network (DDRQN) architecture. This approach takes inspiration from the single-agent DRQN algorithm [34] and generalizes it to a MA setting by making three fundamental modifications: Last-action inputs-giving each agent access to its previous action as an input for the next time-step; inter-agent weight sharing-a single network is learned causing a quicker learning process. Weight sharing allows diverse behavior between agents, as the agents receive different observations and thus evolve in different hidden states; disabling experience replay-this modification is made as the nonstationarity of the environment renders old experiences obsolete or misleading. As a proof of concept, this architecture was used for the solution of two multi-agent reinforcement learning problems based on the well known riddles of the hat riddle and the switch riddle. In these problems, the optimal solution can only be achieved if the agents reach a consensus on a communication protocol to use to solve the task. In Table 1, a summary of all the presented algorithm can be found with a brief explanation of its principal features and their scientific spreading. The latter was evaluated according to the number of search results from the Google Scholar indexing service. In Figure 4, a summary of all the algorithms with regard to the challenges they address is presented.  Agents create a global knowledge Q-matrix combining their most valuable experiences and make updates on a linear combination of the matrix and their local Q-table.

Benchmark Environments for Multi-Agent Systems
Reinforcement learning, especially when compared to more traditional data-driven machine learning approaches, presents an implicit difficulty in terms of evaluating the performance of its algorithms. For small problems, it is possible to compare the obtained policies to optimal ones computed using, for example, game theory. The performance is evaluated as the number of steps and/or episodes required to converge at the optimal policy. When the state and actions space grow in size-for example, due to the introduction of multiple agents-the optimal solution gradually becomes intractable, and this type of approach cannot be used. Since reinforcement learning uses an online, experience-based approach to learn policies with data generated by a simulator or by in-field implementation, it is a natural fit to use the same infrastructure during the evaluation phase to determine the effective performance of the algorithm. There are two main types of environments used to evaluate the performances: continuous and episodic-based simulation. In a continuous simulation, the agents act in an environment that is capable of generating tasks for a theoretical infinite time horizon, and the performances can be represented as the sum of cumulative rewards over a fixed time-step window. This could apply to a traffic simulation in which the performances are determined by the number of total collisions in a fixed time slot. On the other hand, episodic-based simulations are characterized by a number of final states that represent the completion of a task and a maximal time-step at which the system reaches a terminal state, indicating that the task has failed. In this case, a possible benchmark could be to run a number of episodes and present the percentage of completed tasks over the number of simulated episodes.
MazeBaze [44] is a highly configurable, open source, 2D learning environment that is suitable for cooperative and competitive tasks. The environment is represented by a rectangular grid in which a set of items is placed; these items could be obstacles, goal grids, movable objects, doors, and switches, granting a flexible approach for task definitions.
MazeBaze was developed for single-agent environments but is easily adapted to multiagent scenarios, as was performed in [42] to create a traffic junction, as shown in Figure 5. The environment can be sensed by the agents as a set of features or directly using the pixel representation of the environment as an input. Another useful framework to evaluate the decision making and coordination capabilities of a group of agents is the strategic video-game Starcraft-in particular, the Starcraft II Learning Environment (SC2LE) [45], which is a combat-based learning environment that allows the development of algorithms with heterogeneous agents with partial observability. The difficulty of the problem makes it a good benchmark for cooperative multi-agent algorithms. If the end-goal of the algorithm is to develop robotic movement control, the physics-based environment of MuJuCo [46] is ideal. This environment offers some robotics-related features, such as the avoidance of joint violation and the capability to operate in continuous action spaces.

Applications
MARL algorithms have been used in a variety of applications in different fields thanks to their ability to model competitive and cooperative scenarios. In [47], a team of UAVs equipped with a downward-facing camera was considered; the UAVs were assigned the task of optimal sensing coverage. The drone fleet had to cover an area of interest under its FOV while minimizing the overlap between camera captures. This task is required in several applications, such as wildlife monitoring or search and rescue, where the use of UAVs is growing due to their mobility and ease of deployment. UAVs are considered as independent agents, and the convergence to a singular joint action policy is obtained by the means of social conventions. The agents act in a prior-defined order so that the last agent can observe the preceding actions of its teammates and make a decision accordingly. This action selection order allows easy collision avoidance, as the agents are not able to choose an action that will cause a collision with its predecessors. To emphasize collaboration between agents, a single global reward is used during learning, and to reduce the dimensionality of the state space, several techniques of function approximation are implemented. In [48], the authors proposed a novel actor-critic multi-agent reinforcement learning algorithm to address the problem of multi-UAV target assignment and path planning (MUTAPP). A team of UAVs operates in an environment containing target locations to reach and locations to avoid and has to decide the most efficient way to maneuver while avoiding collisions. The critic and the actor are designed according to the multi-agent deep deterministic policy gradient (MADDPG) algorithm [49]. The critic of each agent shares its actions and observations during training while the actor network works exclusively based on local information. The reward signal is factorized into three components, with each one being responsible for a desired behavior: minimizing travel distance, avoiding collisions with other agents, or avoiding collisions with target areas. The evaluation was performed using the openAI platform and showed that the proposed algorithm can solve the problem of MUTAPP. In terms of convergence speed over the number of agents, the method showed some evident limitations in scaling.
Another UAV application used a UAV team to implement a network for line-of-sight (LoS) communications using MARL to optimize resource management, including the transmit power and subchannel selection [50]. The multi-agent setting is addressed via independent learner agents using Q-learning to maximize the future rewards, defined as the difference between the data throughput and the cost of power consumption. Agents select an action tuple θ m (t) = (a m (t), c m (t)), p m (t) where a m (t) is the selected user, c m (t) is the selected subchannel and p m (t) is the power level. A simulation showed that, in this context, if a high exploration level is selected ( = 0.5), the algorithm can reach a tradeoff between system performance and information exchange that other methods in this setting require, such as the Gale-Shapley algorithm, which is based on the matching theory for user selection [51]. In [52], the authors study the problem of the limited spectrum in UAV networks considering a relay-based cooperative spectrum leasing scenario. The UAV fleet needs to transmit its data to a fusion center and does that by forwarding data packets for a ground user in exchange for spectrum access. The objective of the algorithm is to partition the UAVs into two groups: the relaying group, which handles the data transfer for the ground user, and the transmitting group, which forwards packets to the UAV fusion center. The learning algorithm is developed in a distributed way using independent learner agents using Q-learning without the need for communications, but only has access at the task partition, which serves as the state (we can consider this to be a fully observable setting). The algorithm was tested in scenarios with two UAVs and six UAVs and in both cases managed to achieve the optimal configuration.
In [53], a multi-agent approach based on a CommNet architecture [42] was proposed to coordinate the operation of the wireless recharging towers of a group of UAVs. The aim of the algorithm was to schedule the UAV to serve, determine how much energy should be delivered, and share the energy between charging towers. Each tower had access to a local photo-voltaic power generator, and the energy was shared to minimize the purchase of electricity from the energy market, resulting in lower operating costs. The proposed algorithm was tested in an urban simulation consisting of 30 UAVs and four charging towers. It performance improvements over the baseline policies of random sharing and scheduling.
Applications of MARL do not consider only the control of UAV fleets. In [54], the problem of a joint top-down active search of multiple objects was addressed. Each detector was an agent, and deep reinforcement learning was used to learn the optimal policy for object localization. Coordination between agents was obtained by the use of a differential message for each agent, which was a function of the action selected and the state of the environment with parameters θ m , m(a, s; θ m ). The agent-wise Q-function was then defined as The agent could control the effect of messages from other agents in its decision making process by the use of a learning gating mechanism. The algorithm was tested with a two-agent implementation on a series of datasets for joint image detection of man-bike, ball-racket, or person-handbag; the joint model achieved good performance using fewer iterations than the single-agent counterpart. Agents managed to "help" each other by sending clues about the correlations between object locations in the messages. In [55], a deep multi-agent reinforcement algorithm was developed to regulate the energy exchange between a community of buildings with heterogeneous energy production and storage capabilities. The objective was to reach nearly zero energy community (nZEC) status, which is defined as "A micro-grid that has distributed generation, storage, delivery, consumption and zero net annual energy balance." The buildings were modeled as DRL agents, and the authors proposed the presence of a community monitoring service (CMS) to aggregate data from all the agents and enable cooperation. Each agent used its local energy generation, consumption, and storing states as its states, and the energy balance of the entire community. This information was used to select an optimal action, maximizing the global reward signal given by the CMS, which was the negative of the community energy status where c(h i ) is the energy consumed by the i-th house and g(h i ) is the energy generated by the i-th house. This approach was tested in a simulation of a summer and winter setting with up to 10 agents and confronted with some behavioral baselines, including never-share, always share and a random selection of actions. Baselines were outperformed especially in the summer setting. A limitation of this approach is that the learning was conducted using an episodic base approach and thus there is no guarantee that using an online learning approach would lead to the same convergence to an optimum. In [56], a novel multi-agent reinforcement learning algorithm called Equilibrium selection MARL (ES-MARL) to control the energy scheduling of residential microgrid was presented. The microgrid consisted of renewable energy generators (RG) (wind turbines and photo-voltaic), households that demanded energy from the grid and a number of electric vehicles (EV) that could offer or request energy when connected to a recharging station with a Vehicle to Grid (V2G) connectivity. The algorithm uses several types of agents: an EV Aggregator agent that macromanages the exchanges of energy of all the EVs parked by demanding a certain charging power from the grid or by offering power to the grid while selecting the price; a User Aggregator agent, which receives the energy demand of the residential household and decides how much load to curtail (i.e., reducing air conditioning) and how much load to shift to another time-step (i.e., postponing the use of a washing machine). Two RG agents-one for each type of energy generation-were used to decide the price for selling their energy production. Cooperation between the heterogeneous agents is achieved throughout the action of an Equilibrium Selecting Agent with the objective of separately negotiating with all the agents to select the optimal equilibrium based on the average reward. Several scenarios of the microgrid were simulated, and the proposed approach showed an higher average reward compared to single-agent reinforcement learning approaches. When confronted with another MARL algorithm, such as Nash-Q [57], ES-MARL showed a faster convergence rate.
A similar application but for an industry production control setting was presented in [58]. An independent MARL algorithm based on Proximal Policy Optimization (PPO) was proposed to control the energy exchanges in a factory, composed of local power generators (renewable and fuel based), a battery system and a certain number of resources that could consume power to produce a variety of products. Each of these elements were represented in the algorithm as separated agents and a market agent that regulated the energy purchase from the energy market. The coordination between agents was encouraged by the use of a global reward combined with agent-specific local rewards. The reward function could be decomposed into energy and production costs that needed to be minimized. The proposed algorithm was compared to a reactive control strategy (RCS) and a predictive-reactive control strategy (PCS). MARL outperformed the RCS but did not match the performance of the PCS, indicating that it was able to reach only a local optimum. Comparing the time required to make a decision ( 1 s for MARL and 2.5 h for PCS), the MARL approach showed the capability to operate online and change its policy in real-time according to stochastic changes in the environment, such as changes of electricity cost or the failure of a production job.
The autonomous driving setting is a natural framework to develop MARL algorithms; however, controlling vehicles with passengers on-board requires strict safety guarantees that are not compatible with the learned nature of MARL. In [59], a safe reinforcement learning algorithm was presented. The policy function was decomposed into a learned policy for desires and trajectory planning. The desires policy was related to granting a comfortable driving experience (for example, the absence of sharp curves or sudden acceleration) and producing a cost function over driving trajectories and was learned through using policy gradient methods. The trajectory planning, on the other hand, was not learned and was tied to hard constraints. It took as its input the cost function and aimed to find a trajectory that minimized the cost while enforcing driving safety. For the trajectory planning, an approach based on the option framework was used [60]. The resulting algorithm was tested in the noticeably difficult setting of a double-merge intersection while avoiding collision.
A multi-agent reinforcement learning framework was used to address the fleet management problem for large-scale online ride-sharing platforms such as Uber and Lift [61]. The authors proposed two algorithms, namely, contextual DQN (cDQN) and contextual actor-critic (cA2C), which allocate each car (represented by an agent) to a particular zone of a city, divided into hexagonal grids. The computational efficiency was increased by the use of contexts: a geographical context was employed that reduced the action space of an agent in a grid by filtering the actions that would lead to an infeasible grid, as well as a collaborative context that avoided situations in which the agents moved in conflicting directions (for example, swapping grids at a given time). The efficiencies of those algorithms were tested in a simulator environment calibrated using the historical data provided by the Didi Chuxing ride-sharing app. It was compared to independent-DQN and the historical data, achieving good performances in terms of gross merchandise volume (GVM) and order response rate.
In [62], an independent deep Q-network (IDQN) architecture was used to address a heterogeneous multi-junction urban traffic controlled scenario. The state was considered as an image-like representation of the simulator environment, and the actions for each agent (which represents an intersection) were the possible configurations of a traffic light. The reward to be maximized was connected to the cumulative waiting time of all the vehicles in the road network. The algorithm was tested in an open-source traffic simulator SUMO [63], showing promising results, particularly in a low-traffic setting.
In [64], a framework for developing an RL algorithm for the smart charge-discharge of a lithium battery pack was proposed in order to achieve a longer lifespan of electric vehicles, cellular phones and embedded systems. A multi-agent actor-critic algorithm in which every battery of the pack was represented by an agent was compared to a weightedk round robin heuristic [65] and managed to achieve a better overall lifespan, as well as maintaining the battery temperature under a safety threshold. Another capability of the MARL approach was to adapt to several lithium battery models while still using the same structure.
MARL was used in social sciences in common pool resource (CPR) appropriation [66] and the study of sequential social dilemmas (SSD) [67]. The behavior of a group of people was analyzed, representing each person as a self-interested agent that aimed to maximize their own rewards. Independent Q-learning was used in [68] to address the mobile network management problem. The control parameters of a group of base stations (BSs), such as transmission power (TXP) and tilt, were optimized using the quality-of-service of mobile terminals in the range of the base station as a local reinforcement signal. The algorithm proposed used the current TXP value and the connection quality of the terminals as a statespace. The environment state was computed independently for each BS and abstracted to reduce its dimensionality. The state space with the exclusion of the control parameters was used to compute the reward, which was a linear combination of all the quality of service vectors. The effectiveness of the proposed algorithm was tested in a simulator for a mobile network developed by Nokia, composed of 12 base stations and 2000 mobile users moving randomly around the simulation area. Simulations showed an increase in the reward received by the cells; however, there were fluctuations in performance. All of the presented application can be found summarized in Table 2 and divided by field of application.

Conclusions
Multi-agent reinforcement learning is a new, promising branch of machine learning theory. The technological trend is moving towards distributed systems composed of multitudes of computational units, which is also due to the development of the IoT and the edge computing sector. MARL could be the answer to the realization of intelligent systems that are capable of learning how to cooperate to maximize their efficiency. The development of such algorithms can eliminate the need to interface with a centralized controller for multi-agent systems, such as cloud servers, minimizing the time required to select an action, and-from a reliability perspective-not having a singular point of failure. In this work, the main challenges in the development of MARL algorithms were presented, addressing the nonstationarity of the environment, the scaling, and the need to move to partially observable settings as key components of a fast-converging, efficient algorithm. The research community has proposed an assortment of solutions to these challenges in recent years. It was shown that MARL algorithms have been used to address a large variety of applications, such as traffic light control, autonomous driving, and smart energy grids; however, the vast majority of approaches have adopted an independent learning paradigm. It would be interesting to observe the performance of more MARL algorithm typologies in real-world applications.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: