Multi-Agent Distributed Deep Deterministic Policy Gradient for Partially Observable Tracking

: In many existing multi-agent reinforcement learning tasks, each agent observes all the other agents from its own perspective. In addition, the training process is centralized, namely the critic of each agent can access the policies of all the agents. This scheme has certain limitations since every single agent can only obtain the information of its neighbor agents due to the communication range in practical applications. Therefore, in this paper, a multi-agent distributed deep deterministic policy gradient (MAD3PG) approach is presented with decentralized actors and distributed critics to realize multi-agent distributed tracking. The distinguishing feature of the proposed framework is that we adopted the multi-agent distributed training with decentralized execution, where each critic only takes the agent’s and the neighbor agents’ policies into account. Experiments were conducted in the distributed tracking tasks based on multi-agent particle environments where N ( N = 3, N = 5 ) agents track a target agent with partial observation. The results showed that the proposed method achieves a higher reward with a shorter training time compared to other methods, including MADDPG, DDPG, PPO, and DQN. The proposed novel method leads to a more efﬁcient and effective multi-agent tracking.


Introduction
Multi-agent systems (MASs) comprise intelligent entities (robots, vehicles, machines, etc.) that interact within a common environment [1,2]. The tracking of a target in MASs often plays a pivotal part in a variety of applications, including convoys, surveillance, and navigation [3][4][5][6].
In the scientific literature on MASs, the theory and application related to reinforcement learning (RL) have attracted considerable interest [7][8][9]. RL is an area of machine learning, providing a formalism for behavior and concerning how agents are supposed to adopt effective actions in an environment so as to maximize the long-term reward [10]. The classical RL methods are limited to the tasks with a small state space and action space that are generally discrete. However, more complex tasks, which are closer to practical applications, often have a large state space and continuous action space. The classical RL methods have difficulty dealing with high-dimensional input. With the improvement of computing and storage capacity, deep reinforcement learning (DRL), a combination of deep learning (DL) and RL, has achieved human-level or higher performances on challenging games [11][12][13][14]. Multi-agent designs are more robust than single monolithic designs, since agent failures can be compensated, which makes the system more scalable and complex in the sense of [15]. Multi-agent deep reinforcement learning (MADRL) applies the idea and algorithm of DRL to the learning of multi-agent systems [16][17][18][19][20], which is an important method to develop swarm intelligence [21,22] and has received increasing attention in recent years [23][24][25][26][27][28][29][30].
The basic assumption of RL is to increase the accumulated reward as much as possible; thus, the design of reward function is certainly a key step. Several recent studies investigating difference reward [31], coordinated learning without exploratory action noise (CLEAN) reward [32], and counterfactual reward [33] have been carried out on MADRL. Independent Q-learning (IQL) [34] simply and roughly performs a Q-learning algorithm for each agent, which has a good effect in some applications. Value decomposition networks (VDNs) [35] learn to divide the joint value function into pieces, which are agentwise and additive. Monotonic value function factorization (QMIX) [36] can combine the value of each agent based on a local observation in a complex and nonlinear way to estimate the joint action value. Unlike VDNs and QMIX, the factorisable pieces in learning to factorize with transformation (QTRAN) [37] is not constrained by additivity or monotonicity, but has a more general factorization method of joint action-value functions. However, the multi-agent deep deterministic policy gradient (MADDPG) [38] extends the deep deterministic policy gradient (DDPG) method for multi-agent environments, where the critic takes the information about the policies of all the agents into account while the actor has access only to the information of the agent itself and the associated information of the other agents from the perspective of the agent.
The methods [35][36][37][38] proposed in recent years took the decentralized partially observable Markov decision process (Dec-POMDP) [39,40] as a de facto standard to model cooperative multi-agent tasks, but there is much less research on partially observable stochastic games (POSGs) [41] in multi-agent systems. The agents in POSG do not use a joint reward signal, which is suitable for describing decision-making problems in scenarios that consist of multiple agents with partial observations. While MADRL is a growing field, relatively little research has been carried out on distributed tracking problems in the field. Distributed mode is applicable to multi-agent tracking, which requires less computation and fewer costs. However, in the literature mentioned above, each agent observes all the other agents, and the training process of multiple agents is centralized [35][36][37][38]. The generalizability of these results is subject to certain limitations. In practical applications, due to the limitations of the communication technique, the communication range is constrained. It is unpractical for each agent to observe all other agents' information. Besides, the MASs with centralized training require a large amount of computation and training costs. To alleviate these problems, the research on distributed problems is of significance, where each agent can only obtain the observation of the neighbor agents in distributed tracking cases. There remains a paucity of studies focusing specifically on the issue of multi-agent reinforcement learning for distributed partially observable tracking, which is the main content of our research.
In brief, the aforementioned methods ignored several problems as follows. In practical tracking processes, each agent could hardly obtain the observation of all the other agents. Besides, the right reward function should be given to each agent to satisfy the goal of the system. In addition, the centralized training mode is not suitable for distributed tracking of multi-agent learning.
Motivated by the above discussions, this paper explores scenarios in a distributed topology, in which a number of agents with partial observations of neighbor agents cooperate to track the target. To realize the multi-agent distributed tracking, we provide a novel method of multi-agent reinforcement learning, as shown in Figure 1. This paper investigates multi-agent distributed tracking problems with the framework of distributed training and decentralized execution. The major contributions are addressed as follows: • A new multi-agent tracking environment based on the multi-agent particle environment (MPE) is built. Different from receiving observations of all agents, in our scenarios, each agent only obtains information from the neighbor agents, which is more practical under the condition of limited communication; • In consideration of the distributed partially observable condition, this study establishes a framework of a private reward mechanism for the multi-agent tracking system, providing a kind of incentive method in distributed scenarios. This is the fundament of further research; • A novel methodology named the multi-agent distributed deep deterministic policy gradient (MAD3PG) for the tracking system is proposed. In this method, a distributed critic and a decentralized actor are designed for the distributed scenario. The importance and originality of the MAD3PG is that it adopts distributed training with decentralized execution rather than centralized or decentralized training, which are widely used in current approaches. Compared with recent relevant algorithms, the MAD3PG has shorter time cost and better training performance.
The remaining part is organized as follows. Section 2 introduces the necessary background regarding graph theory, similarities and differences between the Dec-POMDP and POSG, and the current methods related to our research. Section 3 addresses the main contributions, including the essential settings of the distributed tracking model and the structure of the MAD3PG. The results of empirically evaluating the MAD3PG and compared methods are discussed in Section 4. Section 5 concludes this work and briefly describes the direction for future research.

Graph Theory
In the multi-agent tracking scenario, agents in different positions aim to be consistent with the target agent through learning, which can often be represented as a set of different agents with an inherent high-level structure [42]. Every single agent could be abstracted as a node. Then, the communication topology of the target tracking system could be described as a graph, which is denoted by G = (N , E , A ), where N = {1, 2, . . . , N}, E ⊆ N × N , and A = [α ij ] ∈ R N×N represent a node set, an edge set, and a weighted adjacency matrix, respectively. In a directed graph, (i, j) and (j, i), which belong to E , are regarded differently. The nodes that can be accessed by node i are judged as its neighbors. A set of neighbors of node i is described by N i = {j|j ∈ N , (i, j) ∈ E }. If j ∈ N i , a ij > 0, otherwise a ij = 0. The connections between tracking agents and the target are expressed by a weighted matrix B = diag{β 1 , β 2 , . . . , β N }. If the i-th tracking agent can obtain the target's information, β i > 0, otherwise β i = 0.

Multi-Agent Decision Process
An agent of full observability directly observes the environment state, formally as a Markov decision process (MDP). An agent of partial observability indirectly observes the environment, formally as a partially observable Markov decision process (POMDP) [43]. In MASs, the process that a bunch of agents are working together for a common reward is called a decentralized POMDP (Dec-POMDP) [39,40]. In general, the Dec-POMDP can be expressed by a tuple < I, S, {A i }, P, {Ω i }, O, R, h >, where the meaning of each symbol can be obtained by referring to Table 1. This model contains only one reward function, meaning that the agents operate together toward the same objective. The Dec-POMDP is useful when all agents have a common goal. However, the Partially Observable Stochastic Game (POSG) [41] allows each agent to have a different objective encoded by a private reward function, which is represented by a tuple < I, S, refers to the individual reward function of the i-th agent. The difference between the POSG and Dec-POMDP can be easily distinguished from Table 1. Except the reward function, all the components in the POSG are the same as those in the Dec-POMDP.

Policy Gradient
The agent's goal is to learn a policy, which is a mapping of state to the optimal action. The parametric representation of policy is π θ (s, a) = P[a|s, θ]. In order to learn the policy parameter θ, policy gradient methods seek to maximize performance according to the gradient of some scalar performance measure J(θ) of policy parameters. The update process can be regarded as a gradient ascent in J: θ t+1 = θ t + α∇J(θ t ). ∇ θ J(θ) is the policy gradient [10], as shown in Equation (1).
whereQ π i,t is an estimate of the return if the action a is taken in the state s. As one of the variants of the policy gradient, proximal policy optimization (PPO) [13] adopts importance sampling to realize on-policy learning, which is the default RL algorithm in Open AI. Either an adaptive KL penalty coefficient or a clipped surrogate objective is available in PPO. In [13], with the setting of clipping, PPO was simpler to implement and had better performance. The multi-agent scenario in this paper required some lines of code change to a vanilla PPO implementation for the comparison in the experiments.

Value Function Methods
Q(s, a) denotes an array estimate of action-value function q π (s, a) or q * (s, a). Q π (s, a) denotes an estimate of the value when action a is taken in state s under policy π, which is used in Q-learning. One of the representations of Q-function is Q π (s, a) = E s [r(s, a) + γE a ∼π [Q π (s , a )]]. Q * (s, a) refers to the estimated action-value under the optimal policy.
where y = r + γ max a Q * (s , a |θ ), andQ stands for the target action-value function [44]. In a multi-agent environment, each agent i owns an independent optimal function Q i that is updated by learning. Value function methods are key to most of the reinforcement learning algorithms, which are essential for the efficient search in the space of policies. There is a more general view about the family of Q-learning algorithms, which have three main processes consisting of data collection, target update, and Q-function regression [45,46].

Deep Deterministic Policy Gradient
When facing a continuous vector action a, Q-learning does not have general solutions to the optimization problem, which can be solved by a neural network µ(s) named the actor that outputs the action [47,48]. Meanwhile, the Q function, as the critic, tells the actor what kind of behavior will obtain more value. They are updated in the deep deterministic policy gradient (DDPG) as follows.
The loss function of the critic is expressed by: where: By the sampled policy gradient, the actor can be updated as: where θ refers to the parameters of the policy networks, which are updated using the chain rule. In addition, there exist a target Q-function Q and a target actor µ . φ and θ represent the parameters of Q and µ , respectively.

Multi-Agent Deep Deterministic Policy Gradient
The multi-agent deep deterministic policy gradient (MADDPG) [38] is a common algorithm used in deep reinforcement learning in environments where multiple agents are interacting with each other. The MADDPG is based on a framework of centralized training and decentralized execution (CTDE). That is, agents can share experience during training, which is implemented by a shared experience replay buffer D containing the information of states, next states, actions, and rewards. Each agent uses only local observations at execution time. Consider a scenario with N agents with policies µ θ i ; the gradient is calculated by: where Q µ i (x, a 1 , a 2 , . . . , a N ) can be updated in centralized training by the following loss function: where µ θ i represents the target policies of the agents and (x, a, r, x ) denotes the sample obtained from D.

Distributed Observation
Consider a partially observable distributed scenario for multi-agent reinforcement learning, where the i-th (i ∈ N ) agent receives its own local observation o t i rather than the state s t at time t. For simplicity, time t is dropped in the following notations. In this work, instead of containing the states of all the agents, o i consists of the positions and velocities of the i-th, agent as well as its neighbors according to the topology of the multi-agent tracking system. Besides, each agent i ∈ N keeps its own reward function that is pertinent to the task. The observation of the i-th tracking agent is: where X i is a vector that contains the total information of the i-th agent, X j , X T refers to a vector that contains the partial information of the j-th agent and the target, respectively, and α ij , β i are connected weights between the i-th tracking agent and the j-th tracking agent and the i-th tracking agent and the target agent, respectively.

Reward Function
In the scenario of multi-agent reinforcement learning, each agent aims at maximizing its expected accumulated reward. The global goal of the multi-agent tracking system is to minimize the distances between each tracking agent and the target as soon as possible, which are important indexes in the design of the reward function. In this paper, due to the requirement of the distributed tracking where each agent observes the information of the neighbor agents from it own perspective, using the reward function related to the distributed topology can effectively reduce the computation cost. Thus, the reward of the i-th tracking agent, as shown in (9), is supposed to be designed concerning the task demands, as well as the tracking topology.
where P i , P j , and P T refer to the position information of the i-th, the j-th agent, and the target agent, respectively. k and m denote reward evaluation factors that are task related, which can balance the weight between the i-th tracking agent and the j-th tracking agent and the i-th tracking agent and target agent.

Multi-Agent Distributed Deep Deterministic Policy Gradient
In order to deal with distributed tracking problems in multi-agent RL, a multi-agent distributed deep deterministic policy gradient (MAD3PG) method is proposed, as illustrated in Figure 2.
Our method improves the critic in the training process. Each blue dotted line with an arrow in Figure 2 is to judge whether the j-th or i-th agent is the neighbor agent of the i-th or j-th agent. If the judging result is true, the information can be transmitted; otherwise, the information cannot be transmitted. In the MADDPG, the information can be transmitted without any judgments. Thus, in the centralized training of the MADDPG, the information of all the agents is fed into each critic, while in distributed training of the MAD3PG, only the information of the agent and its neighbors can be fed into the corresponding critic.
More concretely, consider a system with N tracking agents and a target, where each agent owns a decentralized actor and a distributed critic, the parameters of which are represented by θ and φ, respectively. As outlined in the Algorithm 1, MAD3PG has two complementary strategies: decentralized policies, as well as distributed critics, the details of which are addressed as follows. Reset the environment, and obtain initial state x; each agent i obtains the initial observation o i

5:
for t = 1 to T, where T refers to the maximum number of steps in an episode do 6: for agent i = 1 to N do 7: Select action a i = µ θ i (o i ) + N t based on the actor, as well as exploration 8: Execute action a i , then obtain reward r i and new observation o i for agent i = 1 to N do 13: Randomly sample a mini-batch B of (o, a, r, o ) from D

14:
Set the ground truth of the critic networks to be Update the distributed critic by minimizing the loss:

16:
Update the decentralized actor: end for 18: if episode mod n = 0, where n denotes the target update interval then 19: Update the parameters of the target actor networks and target critic networks for each agent i: end if 21: end for 22: end for As illustrated in Figure 2, the input of the i-th agent's policy networks is the observation o i , which contains itself and its neighbor agents' information, while the output is the action a i . The gradient for the policy of the i-th agent µ θ i with respect to parameters θ i is expressed as follows: where D refers to the experience replay buffer, containing the data from interaction with the environment (õ } is a set of policies, and µ θ j(j∈N i ) denotes the policies of the i-th agent's neighbors with parameters θ j . µ θ i uses rewards to directly enhance and weaken the possibility of choice behavior. That is, the probability of positive behavior being selected next time will be enhanced, while the probability of negative behavior being selected next time will be reduced.
It can be seen from Figure 2 that the input of the i-th agent's critic networks consists of o i , a i , o j , and a j only if the j-th agent is its neighbor agent (j ∈ N i ), while the output is the state-action value Q i . The distributed state-action value function Q µ ⊕ i i can be updated by minimizing the loss: where µ θ i and µ θ j(j∈N i ) are the target policy of the i-th agent and its neighbor(s) that take θ i and θ j as delayed parameters, respectively. Agents are learning in the environment and actually interacting with the environment. When collecting data, the agents capture the latest value function, with a little bit of exploration, and fetch some transitions that will be put into a buffer. The first in, first out (FIFO) queue could be adopted to evict the oldest entry and let new data in, when the buffer fills up. Then, the target parameters will be updated by the current parameters at set intervals.
The computational complexity of the proposed algorithm is O(2N(∑ D a l a =1 ζ l a −1 ζ l a + ∑ D c l c =1 ζ l c −1 ζ l c )), where D a and D c stand for the depth of the neural networks in the actor and the critic, respectively. ζ l a −1 and ζ l a refer to the input dimension and output dimension of the l a -th layer in the actor. ζ l c −1 and ζ l c represent the input dimension and output dimension of the l c -th layer in the critic.
Ignoring the redundant information of non-neighbor agents in the environment, the distributed training requires less computation cost and a shorter elapsed time. With the framework of distributed training with decentralized execution, the proposed method may work well in scenarios such as collaborative target tracking, a convoy of vehicles, defensive navigation, and surveillance of a specific area where each agent is only able to obtain the observation of some agents in the communication range.

Task Description
The proposed algorithm was evaluated through a multi-agent target tracking task. The experiments were performed based on the multi-agent particle environment [38], where the agents can take actions in a two-dimensional space following some basic simulated physics.
In this paper, a scenario where a maneuvering target is tracked by N tracking agents in two-dimensional space is presented. Agents are unaware of the states of all the other agents in the task. The MAD3PG and compared methods were trained with the settings of N = 3 and N = 5, where each tracking agent has an observation of the neighbor agents with the relative positions and velocities. At each time step, the reward of the i-th agent follows Equation (9). In this tracking scenario, the reward evaluation factors were set as k = 0.1 and m = 0.1.

Implementation Specifications
This work focused on the distributed tracking problem in multi-agent systems, in which each agent receives a private observation. When the discount factor (γ) is closer to zero, the agent pays more attention to the short-term return. However, when γ is closer to one, the longer-term return becomes more important to the agent. The soft target update rate (τ) is related to the update of target actors and target critics. For the sake of comparison fairness, we adopted the same hyperparameter settings as those in the MADDPG [38], which are specified in Table 2. The actor and critic have the same neural networks structure of a two-layer rectified linear unit (ReLU) multilayer perception (MLP) with 64 neurons per layer with the Adam [49] optimizer. In our tracking tasks, each episode took 100 steps in training and testing. To train the neural networks, we needed a dataset of transitions. Therefore, there was a data collection process that went out and fetched data. The agents are learning in the scenario mentioned above, interacting with each other. Then, they take some actions and obtain the transitions that are supposed to put into experience replay buffer D. When the buffer fills up, another process will use a FIFO queue to evict the previous data. This is an eviction process to ensure that the memory requirements will not grow out of control.
Then, the training process is actually a process of Q-function regression. This process fetches a mini-batch of data from the buffer uniformly and fetches the parameters of target neural networks to calculate the target values and update the current Q networks. Then, the chain rule is adopted to update the parameters of current policy networks.
There is another process that updates the parameters of the target neural networks. It is a much slower process that will grab the parameters of current policies and critics, which are supposed to be copied into the parameters of their corresponding target neural networks, with the Polyak averaging method in every single step.

Experimental Evaluation Results of a System with Three Tracking Agents
This section presents the experimental evaluation results of the proposed MAD3PG algorithm compared with the MADDPG [38], DDPG [48], PPO [13], and DQN [44] mentioned in Section 2 in the scenarios based on the Dec-POMDP and POSG.
Every single agent cannot observe all the other agents, namely not every tracking agent can observe the target in the tracking process. Each tracking agent only has an observation of its neighbor agents. At the initial time of tracking, the neighbor set of each agent is determined from the connection of agents. The value of observation evolves through time, but the connection between agents do not evolve through time. The weighted graph matrices, which present the connection status of the multi-agent system with three The numerical comparison of experimental results of the five methods regarding the time cost and the average tracking rewards in 1000 episodes is given in Table 3. The MAD3PG in the POSG group took the shortest time (5177 s) and obtained the highest reward (−0.013). In addition, Figure 3 displays the experimental results intuitively. What is striking is that the performance of the MAD3PG (five-pointed star) in the POSG group was better than the other compared methods.  The bar chart (Figure 4) compares the training time cost in two scenarios with the MAD3PG, MADDPG, DDPG, PPO, and DQN, respectively. It is noticeable that the training time of each method applied to the Dec-POMDP was longer than that applied to POSG. Overall, less training time was taken using the MAD3PG than the other compared methods in both scenarios. With the proposed MAD3PG, the shortest time cost was in the POSG training process. Figure 5 presents the mean rewards in an episode that averaged over all tracking agents obtained by the proposed MAD3PG and compared methods. At the beginning of the training process, the rewards obtained by the five methods increased alternately and fluctuated greatly. At around the 15,000th episode, the reward of the MAD3PG converged to a relatively stable state with a reward close to zero, and it maintained at this level for the rest of the training process. However, the rewards of the MADDPG, DDPG, PPO, and DQN still fluctuated at the final stages, which were worse than the MAD3PG. Besides, the MAD3PG gained a higher convergence speed than the other compared methods. Overall, it is clear that the MAD3PG showed outstanding performance over the training period. We can see that the MAD3PG converged to a higher mean reward with a higher convergence rate and better stability than the compared methods. We tested the models of the tracking system at different episodes saved in the training process with the MAD3PG algorithm. The tracking trajectories at the episodes of n (n ∈ {1k, 5k, 10k, 15k, 20k, 25k}) are shown in Figure 6. In these figures, red lines represent the trajectories of the target. At the 1k-th episode, agents do not know how to follow the target. During the 1k-th episode to 20k-th episode, agents gradually have the trend of convergence to the target. Finally, the agents can achieve the learning goal. In general, with the increase of the training episodes, tracking agents gradually learn to track the target.

Experimental Evaluation Results of System with Five Tracking Agents
The basic performance of the proposed algorithm was tested in the scenario with three tracking agents. In order to evaluate the flexibility of proposed algorithm, the models of the MAD3PG and compared methods were also trained in the task where a target agent is tracked by five tracking agents with the partial observation of different agents. We tested the models with four different graphs. The weighted graph matrices of this scenario are The mean episode reward (E(r)) and variance of the reward (D(r)) obtained by the MAD3PG and compared methods are listed in Table 4. Our method (MAD3PG) gained the highest mean episode reward with the smallest variance among the compared methods.

Conclusions
In this paper, a novel distributed training method with decentralized execution was proposed for multi-agent settings by partial observations of neighbor agents. The distributed algorithmic framework of the multi-agent distributed deep deterministic policy gradient (MAD3PG) reduced the interference of the nonstatic environment in multi-agent systems to the critic, which outperformed MADDPG, DDPG, PPO, and DQN in the tested tracking environment. With a shorter training time and better performance, the experimental results demonstrated that this presented algorithm is efficient and effective. In future work, it will be worth extending it to multi-agent systems with switching topologies since the information flow connection between neighbor agents may be broken or fail with an uncertain probability in practical target tracking applications.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

MASs
Multi