You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

18 April 2022

Scalable and Transferable Reinforcement Learning for Multi-Agent Mixed Cooperative–Competitive Environments Based on Hierarchical Graph Attention

,
,
and
1
School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China
2
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Swarms and Network Intelligence

Abstract

Most previous studies on multi-agent systems aim to coordinate agents to achieve a common goal, but the lack of scalability and transferability prevents them from being applied to large-scale multi-agent tasks. To deal with these limitations, we propose a deep reinforcement learning (DRL) based multi-agent coordination control method for mixed cooperative–competitive environments. To improve scalability and transferability when applying in large-scale multi-agent systems, we construct inter-agent communication and use hierarchical graph attention networks (HGAT) to process the local observations of agents and received messages from neighbors. We also adopt the gated recurrent units (GRU) to address the partial observability issue by recording historical information. The simulation results based on a cooperative task and a competitive task not only show the superiority of our method, but also indicate the scalability and transferability of our method in various scale tasks.

1. Introduction

The last few years witnessed the rapid development of the multi-agent system. Due to its ability to solve complex computing or coordinating problems [1], it has been widely used in different fields, such as computer networks [2,3], robotics [4,5], etc. In the multi-agent system, agents try to learn their policies and execute tasks collaboratively, in either cooperative or competitive environments, by making autonomous decisions. However, in large-scale multi-agent systems, partial observability, scalability, and transferability are three important issues to be addressed for developing efficient and effective multi-agent coordination methods. Firstly, it is impossible for agents to learn their policies from the global state of the environment, as it contains massive information about a large number of agents. Therefore, they need to communicate with other agents in some ways, to reach consensus on decision making. Secondly, the previous methods either learn a policy to control all agents [6] or train their policies individually [7], which is difficult to be extended for large-scale agents. Thirdly, in existing deep-learning-based methods, the policies are trained and tested under the same number of agents, making them untransferable to different scales.
In this paper, we propose a scalable and transferable multi-agent coordination control method, based on deep reinforcement learning (DRL) and hierarchical graph attention networks (HGAT) [8], for mixed cooperative-competitive environments. By regarding the whole system as a graph, HGAT helps agents extract the relationships among different groups of entities in their observations and learn to selectively pay attention to them, which brings high scalability when applying in large-scale multi-agent systems. We enforce inter-agent communication to share agents’ local observations with their neighbors and process the received messages through HGAT; therefore, agents can reach consensus by learning from their local observations and information aggregated from the neighbors. Moreover, we introduce the gated recurrent unit (GRU) [9] into our method to record the historical information of agents and utilize it when determining actions, which optimizes the policies under partial observability. We also apply parameter sharing to make our method transferable. Compared with previous works, our method achieves a better performance in mixed cooperative–competitive environments while acquiring high scalability and transferability.
The rest of this paper is organized as follows. In Section 2, we review the related works. We describe some background knowledge of multi-agent reinforcement learning and hierarchical graph attention networks in Section 3. In Section 4, we describe a cooperative scenario, UAV recon and a competitive scenario, predator-prey. We present the mechanism of our method in Section 5. The simulation results are shown in Section 6. We discuss advantages of our method in Section 7 and draw the conclusion in Section 8. The list of abbreviations is shown in Abbreviations.

3. Background

3.1. Multi-Agent Reinforcement Learning (MARL)

The process of MARL is regarded as a decentralized partially observable Markov decision process (Dec-POMDP) [33]. In MARL, each agent i observes the environment state s and obtains a local observation o i . Then, it selects an action according to its policy π i . The environment executes the joint actions a = ( a 1 , , a N ) and transforms s to the next state s . After execution, each agent acquires a reward r i = R i ( s , a ) and a next observation o i from the environment. Each agent aims to optimize its policy to maximize its total expected return R i = t = 0 T γ t r i ( t ) , where T is a final timeslot, and γ [ 0 , 1 ] is the discount factor.
Q-learning [19] and policy gradient [34] are two popular RL methods. The idea of Q-learning is to estimate an state-action value function Q ( s , a ) = E [ R ] and select the optimal action to maximize Q ( · ) . Deep Q-network (DQN) [35], a Q-learning-based algorithm, uses a DNN as a function approximator and trains it by minimizing the loss:
L ( θ ) = E s , a , r , s [ ( y Q ( s , a | θ ) ) 2 ]
where θ is the parameter of the DNN. The target value y is defined as y = r + γ max a Q ( s , a ) [35], where Q is the target network, whose parameters are periodically updated from θ . DQN also applies a replay buffer to stabilize learning.
Policy gradient directly optimizes the policy π to maximize J ( θ π ) = E [ R ] and updates parameters based on the gradient [34]:
θ π J ( θ π ) = E s p π , a π [ θ π log π ( a | s , θ π ) Q ( s , a ) ]
where p π is the state distribution. Q ( s , a ) can be estimated by samples [36] or a function approximator, such as DQN, which leads to the actor–critic algorithm [37].

3.2. Hierarchical Graph Attention Network (HGAT)

HGAT is an effective method for processing hierarchically structured data represented as a graph and introduced into MARL to extract the relationships among agents. By stacking multiple GATs hierarchically, HGAT firstly aggregates embedding vectors e i j l from neighboring agents in each group l into e i l and subsequently aggregates e i l from all groups into e i . The aggregated embedding vector e i represents the hierarchical relationships among different groups of neighbors.

4. System Model and Problem Statement

In this section, we describe the settings of a multi-agent cooperative scenario, UAV recon and a competitive scenario, predator-prey.

4.1. UAV Recon

As shown in Figure 1a, we deploy N UAVs into a hot-spot area to scout n point-of-interests (PoIs) for T timeslots, where PoIs are randomly distributed. As we consider our UAVs to move at the same altitude, the area of our mission is two-dimensional. Each UAV has a circled recon area whose radius is considered as a recon range. If the Euclidean distance between a UAV and a PoI is less than the recon range, we consider the PoI to be covered.
Figure 1. Illustrations of (a) UAV Recon and (b) Predator-Prey.
In the beginning, each UAV is deployed in a random position. At each timeslot t, each UAV i determines its acceleration a c c i ( a c c , 0 ) , ( a c c , 0 ) , ( 0 , a c c ) , ( 0 , a c c ) , ( 0 , 0 ) as its action. The action space of i is discrete. The energy consumption of i is defined as:
E i = E h + v i v m a x E m
where v i is the velocity of i and v m a x is the maximum velocity of UAVs. E h and E m are the energy consumption for hovering and movement, respectively.
In our scenario, our goals of UAVs are to cover more PoIs more fairly with less energy consumption. To evaluate the quality of tasks, we consider three metrics: coverage C, fairness F, and energy consumption E. The score of C denotes the proportion of covered PoIs, which is defined as:
C = n C ( t ) n
where n C ( t ) is the number of covered PoIs at timeslot t.
The score of fairness denotes how fair all PoIs are covered. Here, we use Jain’s fairness index [38] to define the score of F as:
F = ( j = 1 n c j ) 2 n j = 1 n c j 2
while c j is the coverage time of PoI j.
Finally, UAVs need to control energy consumption in tasks. We define the score of E as:
E = 1 N i = 1 N E i
When executing recon missions, each UAV needs to observe local states of other UAVs and PoIs to determine its action. The local state of UAV i is defined as s i = ( P i , v i ) , where P i and v i are the position and the velocity of i, respectively. Each PoI j’s local state s j = ( P j ) . If a PoI is in UAV i’s observation range, we consider the PoI is observed by i. If another UAV j is in i’s communication range, we consider i can communicate with j. To train UAV’s policy, we define a heuristic reward r i as:
r i = η 1 × r i n d v + η 2 × r s h a r e d E i p i
where p i is a penalty factor. When UAV i flies across the border, it is penalized by p i . r i n d v = 1 if no PoIs is covered by i individually, otherwise r i n d v = n i n d v , where n i n d v means the number of PoIs that are only covered by i. r s h a r e d = 0 if i does not share PoIs with others, otherwise r s h a r e d = n s h a r e d N s h a r e , where n s h a r e d denotes the number of PoIs which are covered by N s h a r e neighboring UAVs. η 1 and η 2 are the importance factor of r i n d i v i d u a l and r s h a r e d , respectively. We empirically set η 1 η 2 to encourage UAVs to cover more PoIs and avoid overlapping with others.

4.2. Predator-Prey

As shown in Figure 1b, we deploy N p r e d a t o r predators to eliminate N p r e y prey.
Both of them are controlled by a DRL-based method. If the distance between a predator and a prey is less than predators’ attack range, we consider the prey to be eliminated. The goal of the predators is to eliminate all prey, while the goal of the prey is to escape from the predators. The speed of the predators is slower than the prey speed, so they need to cooperate with each other when chasing prey.
The action space of the predators and the prey is the same as the UAVs in the UAV recon scenario. The local state of each predator or prey is defined as s i = ( P i , v i ) , where P i and v i are the position and the velocity of a predator or prey, respectively. We consider that each predator and prey can observe the local state of adversaries inside its observation range, while it can communicate with companions inside its communication range. The eliminated preys can neither be observed nor communicate with others. To evaluate the performance of predators and prey, we define the score as:
S = T T e l i m i n a t e T
where T is the total timeslots of an episode, while T e l i m i n a t e is the timeslot when all prey are eliminated.
When predator i eliminates prey j, i will obtain a positive reward, while j will obtain a negative reward. When all prey are eliminated, the predators will get an additional reward.

5. HGAT-Based Multi-Agent Coordination Control Method

To achieve the goals of two scenarios described in Section IV, we present a multi-agent coordination control method based on HGAT for mixed cooperative–competitive environments. In our method, the global state of the environment is regarded as a graph, containing the local state of agents and the relationship among them. Each agent summarizes the information from the environment by HGAT and subsequently computes the Q-value and action in a value-based or actor–critic framework.

5.1. HGAT-Based Observation Aggregation and Inter-Agent Communication

In the multi-agent system, the environment involves multiple kinds of entities, including agents, PoIs, etc. As they are heterogeneous, agents need to treat their local states and model their relationships separately. Thus, we categorize all entities into different groups at the first step of execution in cooperative or competitive scenarios. As shown in Figure 2, M entities (containing N agents) are clustered into K groups and represent the environment’s state as graphs. The agents construct an observation graph G O and a communication graph G C respectively based on their observation ranges O 1 , , O N and communication ranges C 1 , , C N . The edges of G O represent that the entities can be observed by agents, while the edges of G C represent that two of the agents can communicate with each other. The adjacency matrix of G O and G C are A O and A C , respectively. i’s observation is defined as o i = s j | j O i . Its received messages from the others is m i = m j i | j C i , where s j is agent j’s local state and m j i is the message that j sends to i.
Figure 2. The clustering of agents and their topology.
At each timeslot, the agents use the network shown in Figure 3 to determine their actions according to s , A O , and A C received from the environment, where s = ( s 1 , , , s M ) . The parameters of the network are shared among the agents in the same group. The network contains three components, a set of encoders, two stacked HGAT layers, and a recurrent unit, which consists of a gated recurrent unit (GRU) layer and a fully connected layer. GRU is a variant of the recurrent neural network (RNN). To summarize the information in each agent i’s observation o i , the first HGAT layer processes o i into a high-dimensional aggregated embedding vector e i as shown in Figure 4. Firstly, the encoder which consists of a fully connected layer transforms the local states from each group l into embedding vectors as e j = f e l ( s j ) , where f e l means the encoder for group l. e j is the embedding vector for entity j in group l. Then, it aggregates e j as e i l = j α i j W v l e j [32], where W v l is a matrix that transforms e j into a “value”. The attention weight α i j represents the importance of the embedding vector e j from j to i, which is calculated by softmax as α i j exp ( e j T W k l T W q e i ) [32] if a i , j O in A O is 1, otherwise α i j = 0 . W k and W q transform a embedding vector into a “key” and a “query”, respectively. A O is used for selection so that only the local states from O i are summarized. To improve the performance, we use the multiple attention heads here. Finally, e i l from all groups are aggregated into e i by a fully connected layer f G , as:
e i = f G ( l = 1 K e i l )
where ‖ represents the concatenation operation. We do not apply another GAT for aggregating, such as HAMA, as our approach has less computing overhead.
Figure 3. The overall structure of the network. s l and e l represent the local states and embedding vectors of agents in group l. A i denotes the ith row of A .
Figure 4. The architecture of an HGAT layer.
After calculating e i , agent i sends it as a message m i j to each neighboring agent j in C ( i ) . Inter-agent communication helps agents to share their observations with neighbors, which brings a better performance in coordination. To summarize each agent i’s received messages m i , the second HGAT layer processes m i and aggregates it into another embedding vector e i by the same means as shown in Figure 4. The adjacency matrix used here is A C instead of A O . Our method is capable of inner-group and inter-group communication and can easily extend to a multi-hop by stacking new HGAT layers.

5.2. Implementation in a Value-Based Framework

This implement is based on DQN. Each agent i maintains hidden states h i for the recurrent unit and calculates its Q-values by a Q-network, as shown in Figure 3. Similar to DQN, our method also employs a target network with the same structure.
We introduce the skip-connection strategy by concatenating e i and e i as an input of the recurrent unit when computing the Q-value, so agents can use the information both from their observation and others’. The Q-value is calculated as:
Q l ( o i , m i , a i , h i ) f R l ( e i , e i , h i )
where Q l represents the Q-network of group l where i belongs, f R l means the recurrent unit in Q l , and a i is the action determined by i according to Q-values. We apply ϵ -greedy policy [35] to balance the exploitation and exploration as:
a i = arg max a A i Q l ( o i , m i , a , h i ) , w i t h p r o b a b i l i t y 1 ϵ r a n d o m ( A i ) , w i t h p r o b a b i l i t y ϵ
where A i is the action space of i.
After executing the joint actions a = ( a 1 , , a N ) , the environment transforms the current state to the next and sends the next local states s , the next adjacency matrix A O and A C , and the reward r i to each agent i. The experience ( s , A O , A C , a , r , s , A O , A C , h , h ) is stored in a shared replay buffer B, where r = ( r 1 , , r N ) , h = ( h 1 , , h N ) , and h = ( h 1 , , h N ) . h i is the next hidden state that the Q-network outputs when agent i calculates Q-values. h i is initialized to zero at the beginning of an episode.
To training the Q-network of each group, we sample H experiences from B as a minibatch and minimize the loss:
L ( θ l Q ) = 1 N l i = 1 N l E [ ( y i Q l ( o i , m i , a i , h i | θ l Q ) ) 2 ]
where N l means the number of agents in group l and θ l Q denote the parameters of Q l . y i is the target value that calculated by the target network Q l , as:
y i = r i + γ max a A i Q l ( o i , m i , a , h i | θ l Q )
where o i and m i are i’s next observation and next received messages, respectively. θ l Q denote the parameters of Q l , which are periodically updated from θ l Q .

5.3. Implementation in an Actor–Critic Framework

Our method can also be implemented on the actor–critic framework. In this implementation, each agent i has an actor network and a critic network, maintaining hidden states h i π and h i Q . After obtaining s , A O and A C , agent i in group l computes the probability of actions as:
π l ( o i , m i , h i π ) f R π l ( e i π , e i π , h i π )
where π l represents the actor network of group l and f R π l represents the recurrent unit in π l . We employ the ϵ -categorical policy here. Agent i determines an action based on π l ( o i , m i , h i π ) with probability 1 ϵ and makes a random choice with probability ϵ . The critic network Q l subsequently calculates Q-values, such as the value-based framework. The hidden states h i π and h i Q and the next hidden states h i π and h i Q are stored in the replay buffer, where h i π and h i Q are the outputs of π l and Q l , respectively.
The critic network of each group is trained by minimizing the loss L ( θ l Q ) , which is computed as (13). As the actor–critic framework selects actions according to π l ( o i , m i , h i π ) instead of the maximum Q-value, we use the expectation of the next state’s Q-value to calculate the target value y i as:
y i = r i + γ a A i π l ( a | o i , m i , h i π , θ l π ) Q l ( o i , m i , a , h i Q | θ l Q )
where θ l π and θ l Q are the parameters of target network π l and Q l , respectively.
The actor network of each group is trained according to the gradient:
θ l π ( J ( θ l π ) ) = 1 N l i = 1 N l E [ log π l ( a i | o i , m i , h i π , θ l π ) ( Q l ( o i , m i , a i , h i Q | θ l Q ) b i ) ]
where the baseline b i is designed to reduce variance and stabilize training [39], which is defined as:
b i = a A i π l ( a | o i , m i , h i π , θ l π ) Q l ( o i , m i , a , h i Q | θ l Q )
After training, θ l π and θ l Q are updated as θ l π τ θ l π + ( 1 τ ) θ l π , and θ l Q τ θ l Q + ( 1 τ ) θ l Q , respectively [40].
Our method can be extended to continuous action space by estimating the expectation of b i with Monte Carlo samples or a learnable state value function V ( o i , m i ) [23].

6. Simulation

6.1. Set Up

To evaluate the performance of our method, we conduct a series of simulations on an Ubuntu 18.04 server with 2 NVIDIA RTX 3080 GPUs. We implement a value-based (VB) version and an actor–critic (AC) version of our method in PyTorch. Each fully connected layer and GRU layer contains 256 units. The activation functions in encoders and HGAT layers are ReLU [41]. The number of attention heads is 4. Empirically, we set the learning rate of the optimizer to 0.001, and the discount factor γ to 0.95. The replay buffer size is 50 K and the size of a minibatch is 128. ϵ is set to 0.3. For the value-based version, The target networks are updated every five training steps. For the actor–critic version, we set τ to 0.01. The networks are trained every 100 timeslots and update their parameters four times in a training step.
We compare our method with four MARL baselines, including DGN, DQN, HAMA, and MADDPG. For non-HGAT-based approaches, each agent concatenates all local states in its observation into a vector, while padding 0 for unobserved entities. The parameters of networks are shared among agents in all baselines except MADDPG. We use the Gumbel-Softmax reparameterization trick [42] in HAMA and MADDPG to make them trainable in discrete action spaces. DGN is based on our proposed algorithm [31], which applies a GAT layer for inter-agent communication. We train our method and each baseline for 100 K episodes and test them for 10 K episodes.

6.2. UAV Recon

As summarized in Table 1, we deploy several UAVs in a 200 × 200 area where 120 PoIs are distributed. The penalty factor p in (7) is set to 1. We evaluate the performance of our method in the test stage under different number of UAVs and compare it with baselines.
Table 1. Experiment parameters of UAV recon.
Figure 5 shows the performance of each method in terms of coverage, fairness, and energy consumption under different numbers of UAVs. Note that both two versions of our method are trained with 20 UAVs and transferred to a different scale of UAV swarms. From Figure 5a,b, we observe that our method outperforms all baselines in terms of coverage and fairness. Compared with DGN and DQN, our method employs HGAT to extract features from observation, which is more effective than processing raw observation vectors directly. Therefore, our method helps UAVs to search PoIs and better optimize their flight trajectories. Although HAMA also applies HGAT, UAVs cannot cooperate as effectively as our method, owing to the lack of communication. In our method, the UAVs communicate with others and process received messages by another HGAT layer. Furthermore, the recurrent unit helps UAVs to learn from the hidden states, which induces a better performance. In MADDPG, each UAV trains an individual network and concatenates observations and actions of all agents into a high-dimensional vector as an input of the critic. As the networks in MADDPG expands exponentially to the scale of the agents, it is hard to be trained effectively and efficiently in large-scale multi-agent systems. As a consequence, the MADDPG consumes more time to train but obtains the lowest score.
Figure 5. Simulation results of all methods on coverage, fairness, and energy consumption under different number of UAVs.
Figure 5c indicates that our method consumes less energy than DGN and DQN. As their flight trajectories are better, UAVs can cover more PoIs fairly while consuming less energy. The energy consumption of HAMA is considerable with our method in low-scale environments and increases when the number of UAVs is up to 40. MADDPG fails to improve coverage and fairness, so it tends to save on energy to maximize its reward.
To test the capability of transferred learning, we compare the transferred policies with those trained under the same settings of testing. As shown in Figure 6, the performance does not deteriorate when the policy is transferred to execute with 10, 30, or 40 UAVs, which indicates that our method is highly transferable under various numbers of UAVs.
Figure 6. Simulation results of transfer learning on coverage, fairness, and energy consumption under different number of UAVs.

6.3. Predator-Prey

As summarized in Table 2, we deploy five predators in a 100 × 100 area to eliminate five prey. We set the attack reward of predators and prey to 10 and −10, respectively. The additional reward is set as r a d d i t i o n a l = 10 × S . We train the policy by the value-based version of our method and test it by competing with other policies trained by different methods.
Table 2. Experiment parameters of predator-prey.
Table 3 indicates that our method shows its superiority over all baselines in both roles of predator and prey. By introducing GRU and inter-agent communication, the predators obtain more information from hidden states and neighbors to decide which prey to capture. It is more flexible for predators to determine whether to chase prey individually or cooperatively. Similarly, GRU and inter-agent communication also bring more information to prey, so they can choose from various strategies to survive. For example, prey can escape from predators by their faster speed or sacrifice one of them to distract predators.
Table 3. The mean and standard deviation of scores in predator-prey.

7. Discussion

The experimental results indicate that the performance of our method is superior to those of others in both cooperative and competitive scenarios. We assume that three components, including HGAT, GRU, and inter-agent communication, are the key factors which induce the success of our method. To validate our hypothesis, we conduct an ablation study in Appendix A to clarify the necessity of each component (HGAT, GRU, and inter-agent communication).
From Table A1, we observe a significant deterioration in performance when removing HGAT or GRU, while disabling inter-agent communication also induces a decrease in terms of coverage and fairness. To explain the necessity of each component, we assume the following reasons. Firstly, HGAT plays an important role in summarizing observations. Not only does it process the local states from all groups, but it also quantifies their importance with attention weights. In addition, HGAT models the hierarchical relationships among agents as a graph, which is effective for them to optimize their policies in dynamic environments. Secondly, GRU makes a significant contribution to overcoming the limitation of partial observability. When determining actions, GRU helps agents to remember the historical information recorded in the hidden states, such as the position of the observed PoIs in the UAV recon. It is beneficial for agents to improve their performance by getting what they cannot observe from the hidden states. Finally, inter-agent communication expands agents’ horizons. By sending high-dimensional embedding vectors, they share their observations with others. With the help of HGAT, agents can cooperate better by using extensive information from those vectors in decision making.
Compared with non-HGAT-based approaches, our method has another advantage in the replay buffer. As they concatenate all local states into a vector and pad 0 for unobserved entities, the space complexity of observations in their replay buffer is O ( N × M ) , where N and M means the number of agents and entities, respectively. However, our method only stores local states, whose space complexity is O ( M ) . Although it has to store the adjacency matrices A O to represent the relationship among agents, this is more economical than storing observations in terms of storage, as an adjacency matrix represents the relationship between agents by a bit.

8. Conclusions

In this paper, we propose a scalable and transferable DRL-based multi-agent coordination control method for cooperative and competitive tasks. This method introduces HGAT, GRU, and inter-agent communication into DRL to improve performance in mixed cooperative–competitive environments. By intensive simulations, our method shows its superiority over DGN, DQN, HAMA, and MADDPG both in UAV recon and predator-prey.
In the future, we will improve our method by introducing an adaptive policy base on the action entropy of the agent to provide a more intelligent exploration. We will evaluate the performance of the entropy-based policy and compare it with the ϵ -greedy and ϵ -categorical policies. Specifically, we will test the capabilities of automating entropy adjustment under different entropy targets in large-scale multi-agent systems. Furthermore, we will try to extend our method into continuous policies and evaluate its performance in cooperative and competitive scenarios with continuous action space.

Author Contributions

Conceptualization, Y.C. and Z.Y.; methodology, Y.C.; software, Y.C.; validation, Y.C. and Z.Y.; formal analysis, Y.C.; investigation, Y.C. and Z.Y.; resources, Y.C.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, G.S.; visualization, Y.C.; supervision, G.S. and X.J.; project administration, G.S.; funding acquisition, X.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National key R & D program of China sub project “Emergent behavior recognition, training and interpretation techniques” grant number 2018AAA0102302.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ACActor–Critic
ACOAnt Colony Optimization
COMACOunterfactual Multi-Agent
Dec-POMDPDecentralized Partially Observable Markov Decision Process
DGNDeep Graph Network
DNNDeep Neural Network
DQNDeep Q-Network
DRGNDeep Recurrent Graph Network
DRLDeep Reinforcement Learning
GAGenetic Algorithm
GATGraph Attention neTwork
GCNGraph Convolutional Network
GRUGated Recurrent Unit
HAMAHierarchical Graph Attention-Based Multi-Agent Actor–Critic
HGATHierarchical Graph Attention neTwork
MAACMulti-Actor-Attention-Critic
MADDPGMulti-Agent Deep Deterministic Policy Gradient
MARLMulti-Agent Reinforcement Learning
PoIPoint-of-Interest
QoSQuality of Service
RLReinforcement Learning
RSHRandomized Search Heuristic
SARSAState-Action-Reward-State-Action
TRPOTrust Region Policy Optimization
UAVUnmanned Aerial Vehicle
UAV-MBSUnmanned Aerial Vehicle Mobile Base Station
VBValue-Based

Appendix A. Ablation Study

In the ablation study, we test our method and three variants in UAV recon with 20 UAVs and summarized the performances in Table A1. The variants are described as follows:
  • Without H: removing the first HGAT layer;
  • Without G: removing GRU in the recurrent unit;
  • Without C: disabling inter-agent communication.
Table A1. The mean and standard deviation of three metrics in the ablation study ( N = 20 ).
Table A1. The mean and standard deviation of three metrics in the ablation study ( N = 20 ).
Model Metric
CoverageFairnessEnergy Consumption
Our method0.466 ± 0.0460.784 ± 0.0610.665 ± 0.015
Without H0.103 ± 0.0220.696 ± 0.0870.710 ± 0.011
Without G0.349 ± 0.0650.726 ± 0.1160.730 ± 0.017
Without C0.412 ± 0.0580.749 ± 0.0810.668 ± 0.023

References

  1. Dorri, A.; Kanhere, S.S.; Jurdak, R. Multi-agent systems: A survey. IEEE Access 2018, 6, 28573–28593. [Google Scholar] [CrossRef]
  2. Gutierrez-Garcia, J.O.; Sim, K.M. Agent-based cloud bag-of-tasks execution. J. Syst. Softw. 2015, 104, 17–31. [Google Scholar] [CrossRef]
  3. Claes, R.; Holvoet, T.; Weyns, D. A decentralized approach for anticipatory vehicle routing using delegate multiagent systems. IEEE Trans. Intell. Transp. Syst. 2011, 12, 364–373. [Google Scholar] [CrossRef] [Green Version]
  4. Ota, J. Multi-agent robot systems as distributed autonomous systems. Adv. Eng. Inform. 2006, 20, 59–70. [Google Scholar] [CrossRef]
  5. Iñigo-Blasco, P.; Diaz-del Rio, F.; Romero-Ternero, M.C.; Cagigas-Muñiz, D.; Vicente-Diaz, S. Robotics software frameworks for multi-agent robotic systems development. Robot. Auton. Syst. 2012, 60, 803–821. [Google Scholar] [CrossRef]
  6. Liu, C.H.; Chen, Z.; Tang, J.; Xu, J.; Piao, C. Energy-efficient UAV control for effective and fair communication coverage: A deep reinforcement learning approach. IEEE J. Sel. Areas Commun. 2018, 36, 2059–2070. [Google Scholar] [CrossRef]
  7. Zhang, Y.; Mou, Z.; Gao, F.; Jiang, J.; Ding, R.; Han, Z. UAV-enabled secure communications by multi-agent deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 11599–11611. [Google Scholar] [CrossRef]
  8. Ryu, H.; Shin, H.; Park, J. Multi-agent actor-critic with hierarchical graph attention network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7236–7243. [Google Scholar] [CrossRef]
  9. Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
  10. Ramirez-Atencia, C.; Bello-Orgaz, G.; R-Moreno, M.D.; Camacho, D. Solving complex multi-UAV mission planning problems using multi-objective genetic algorithms. Soft Comput. 2017, 21, 4883–4900. [Google Scholar] [CrossRef]
  11. Zhen, Z.; Xing, D.; Gao, C. Cooperative search-attack mission planning for multi-UAV based on intelligent self-organized algorithm. Aerosp. Sci. Technol. 2018, 76, 402–411. [Google Scholar] [CrossRef]
  12. Altshuler, Y.; Wagner, I.; Yanovski, V.; Bruckstein, A. Multi-agent Cooperative Cleaning of Expanding Domains. Int. J. Robot. Res. 2010, 30, 1037–1071. [Google Scholar] [CrossRef]
  13. Altshuler, Y.; Pentland, A.; Bruckstein, A.M. Swarms and Network Intelligence in Search; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  14. Altshuler, Y.; Bruckstein, A.M. Static and expanding grid coverage with ant robots: Complexity results. Theor. Comput. Sci. 2011, 412, 4661–4674. [Google Scholar] [CrossRef] [Green Version]
  15. Cho, S.W.; Park, H.J.; Lee, H.; Shim, D.H.; Kim, S.Y. Coverage path planning for multiple unmanned aerial vehicles in maritime search and rescue operations. Comput. Ind. Eng. 2021, 161, 107612. [Google Scholar] [CrossRef]
  16. Apostolopoulos, P.A.; Torres, M.; Tsiropoulou, E.E. Satisfaction-aware data offloading in surveillance systems. In Proceedings of the 14th Workshop on Challenged Networks, Los Cabos, Mexico, 25 October 2019; pp. 21–26. [Google Scholar] [CrossRef]
  17. Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Citeseer: University Park, PA, USA, 1994; Volume 37. [Google Scholar]
  18. Shamsoshoara, A.; Khaledi, M.; Afghah, F.; Razi, A.; Ashdown, J. Distributed cooperative spectrum sharing in uav networks using multi-agent reinforcement learning. In Proceedings of the 2019 16th IEEE Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 11–14 January 2019; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
  19. Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  20. Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust adversarial reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2817–2826. [Google Scholar]
  21. Finn, C.; Yu, T.; Fu, J.; Abbeel, P.; Levine, S. Generalizing skills with semi-supervised reinforcement learning. arXiv 2016, arXiv:1612.00429. [Google Scholar]
  22. Lowe, R.; WU, Y.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 6379–6390. [Google Scholar]
  23. Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  24. Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 1993; pp. 330–337. [Google Scholar]
  25. Iqbal, S.; Sha, F. Actor-Attention-Critic for Multi-Agent Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Long Beach, CA, USA, 2019; Volume 97, pp. 2961–2970. [Google Scholar]
  26. Jiang, J.; Dun, C.; Huang, T.; Lu, Z. Graph convolutional reinforcement learning. arXiv 2018, arXiv:1810.09202. [Google Scholar]
  27. Liu, C.H.; Ma, X.; Gao, X.; Tang, J. Distributed energy-efficient multi-UAV navigation for long-term communication coverage by deep reinforcement learning. IEEE Trans. Mob. Comput. 2019, 19, 1274–1285. [Google Scholar] [CrossRef]
  28. Khan, A.; Tolstaya, E.; Ribeiro, A.; Kumar, V. Graph policy gradients for large scale robot control. In Proceedings of the Conference on Robot Learning, Cambridge, MA, USA, 16–18 November 2020; pp. 823–834. [Google Scholar]
  29. Walker, O.; Vanegas, F.; Gonzalez, F.; Koenig, S. A deep reinforcement learning framework for UAV navigation in indoor environments. In Proceedings of the 2019 IEEE Aerospace Conference, Big Sky, MT, USA, 2–9 March 2019; pp. 1–14. [Google Scholar]
  30. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
  31. Ye, Z.; Wang, K.; Chen, Y.; Jiang, X.; Song, G. Multi-UAV Navigation for Partially Observable Communication Coverage by Graph Reinforcement Learning. IEEE Trans. Mob. Comput. 2022. [Google Scholar] [CrossRef]
  32. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  33. Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
  34. Sutton, R.S.; McAllester, D.A.; Singh, S.P.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 1057–1063. [Google Scholar]
  35. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529. [Google Scholar] [CrossRef] [PubMed]
  36. Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
  37. Konda, V.R.; Tsitsiklis, J.N. Actor-critic algorithms. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 1008–1014. [Google Scholar]
  38. Jain, R.K.; Chiu, D.M.W.; Hawe, W.R. A Quantitative Measure of Fairness and Discrimination; Eastern Research Laboratory, Digital Equipment Corporation: Hudson, MA, USA, 1984. [Google Scholar]
  39. Weaver, L.; Tao, N. The optimal reward baseline for gradient-based reinforcement learning. arXiv 2013, arXiv:1301.2315. [Google Scholar]
  40. Heess, N.; Hunt, J.J.; Lillicrap, T.P.; Silver, D. Memory-based control with recurrent neural networks. arXiv 2015, arXiv:1512.04455. [Google Scholar]
  41. Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  42. Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.