Building a Connected Communication Network for UAV Clusters Using DE-MADDPG

: Clusters of unmanned aerial vehicles (UAVs) are often used to perform complex tasks. In such clusters, the reliability of the communication network connecting the UAVs is an essential factor in their collective efﬁciency. Due to the complex wireless environment, however, communication malfunctions within the cluster are likely during the ﬂight of UAVs. In such cases, it is important to control the cluster and rebuild the connected network. The asymmetry of the cluster topology also increases the complexity of the control mechanisms. The traditional control methods based on cluster consistency often rely on the motion information of the neighboring UAVs. The motion information, however, may become unavailable because of the interrupted communications. UAV control algorithms based on deep reinforcement learning have achieved outstanding results in many ﬁelds. Here, we propose a cluster control method based on the Decomposed Multi-Agent Deep Deterministic Policy Gradient (DE-MADDPG) to rebuild a communication network for UAV clusters. The DE-MADDPG improves the framework of the traditional multi-agent deep deterministic policy gradient (MADDPG) algorithm by decomposing the reward function. We further introduce the reward reshaping function to facilitate the convergence of the algorithm in sparse reward environments. To address the instability of the state-space in the reinforcement learning framework, we also propose the notion of the virtual leader–follower model. Extensive simulations show that the success rate of the DE-MADDPG is higher than that of the MADDPG algorithm, conﬁrming the effectiveness of the proposed method.


Introduction
The increasing application of unmanned aerial vehicles (UAVs) has resulted in complex scenarios. Apart from military operations such as target destruction and cooperative investigation [1], civilian applications can also benefit from UAV technology, such as environmental monitoring, precision agriculture [2,3], disaster relief [4], and traffic monitoring [5]. As it becomes increasingly difficult for a single UAV to meet the needs of complex tasks [6], the recent development of technologies such as cluster intelligence has enabled applications with small UAV cluster systems. Using multiple UAVs in a system is challenging due the communication problems between them, such as the damage of some UAVs and the interruption of communication links, which may affect the data transmission within the cluster. Therefore, rebuilding a connected communication network for a cluster with communication problems is essential for the cooperation of UAVs [7].
Recently, deep reinforcement learning (DRL) methods have been proposed to solve extremely high complexity problems which are not often solvable using traditional control 1.
We construct a partially observable Markov decision process (POMDP) framework for a UAV cluster to build a connected network. To the best of our knowledge, this is the first use of multi-agent reinforcement learning to build a connected network of UAV clusters. In response to this problem, we make improvements to the MADDPG framework and propose the DE-MADDPG algorithm.

2.
We rationally apply the theory of mobile consistency to set the reward reshaping function, which guides the agent to learn an effective policy faster and solves the problem of difficult convergence caused by the sparse reward. 3.
To the best of our knowledge, our work is the first to control a cluster based on RL in open space. We introduce the idea of the virtual leader-follower model to solve the problem of an unstable state space and achieve remarkable results.
The rest of the paper is organized as follows. In Section 2, we present work related to our study. Section 3 presents the model of our work. In Section 4, we present the MADDPG and the DE-MADDPG, which is followed by the simulation results and discussions in Section 5. Finally, the conclusion of this paper and the future work are presented in Section 6.

Related Work
There is a large amount of research on how to build a connected communication network for UAV clusters. Ramanatha et al. [13] proposed a method based on heuristic algorithms and controlled the UAV transmission power to enable a dynamic network topology and achieve network connectivity. Nevertheless, due to the limited transmit power, it is almost impossible for UAVs to communicate with each other if the distance between them becomes larger than their corresponding communication range. Kashyap et al. [14] enhanced the connectivity of the cluster network by adding relay nodes. Adding additional relay nodes, however, further increases the number of UAVs.
Models such as those presented by Reynolds et al. [15], Vicsek et al. [16], and Conzin et al. [17] are based on the theory of cluster mobility and control the movement of UAVs to construct a connected network. In a distributed situation, however, UAVs that are too marginal are not able to obtain neighbor information and are thus unable to connect with others. According to the estimation result of algebraic connectivity, Yang et al. [18] use the gradient method to derive the agent's motion control amount, so that the multi-agent can maintain a certain communication network during the movement process. Hao et al. [19] designed a distributed agent cluster control protocol based on the control strategy of the potential function gradient and also realized the maintenance of the communication network of the multi-agent system. Dai et al. [20] discussed the optimal trajectory planning of multiple spacecraft from a disconnected state to build a connected network, which has certain reference significance for how to control the movement of multiple UAVs to construct a connected network.
With the development of reinforcement learning (RL), some scholars have applied RL to cluster control and achieved remarkable results. To control the UAV to search for an unknown area, Zanol et al. [21] used the Q-learning algorithm to train a single agent that controls a drone searching for an unknown area. A similar approach was taken by Klaine et al. [22], and they controlled the UAVs to build an emergency network for a disaster area. In [23], to maximize the coverage of the UAV cluster, Liu et al. proposed a DRL-based energy-efficient control method for coverage and connectivity (DRL-EC3). Wang et al. [9] proposed a trajectory control algorithm of UAVs based on MADDPG, guiding the movement of ground users. In [24], to detect targets and avoid obstacles, Han et al. proposed the simultaneous target assignment and path planning (STAPP) approach based on MADDPG.
Similar to these studies, we propose a distributed UAV control method based on DE-MADDPG to rebuild a connected communication network for a cluster with communication problems.

Model
With a GPS receiver and a set of radio transceivers, each UAV can obtain its real-time position and communicate with other UAVs if their physical distance is small enough [25]. When the cluster communication network is intact, each UAV can communicate with all other UAVs through the communication link. To improve the stability of the communication network between the UAV clusters, it is necessary to ensure that each UAV can be connected to at least one other UAV. Otherwise, some individual UAVs in the clusters may move apart from the clusters. Some malfunctions, such as damage to some UAVs and the interruption of communication links, may change the topology of the cluster and affect the communication of other UAVs. The cluster must rebuild the connected network after malfunctions.

Coordinate System Establishment
We consider the launching point as the origin of the north-east-down coordinate system (denoted as O XYZ ), as shown in Figure 1. We index the UAV in the cluster as i, i ∈ {1, 2, · · · , n}. The position vector of i is p i . The origin of cluster coordinate system (denoted as o xyz ) coincides with the initial barycenter o of the cluster. The position vector p o and the velocity vector v o of the barycenter of the cluster in the O XYZ system are given by Equation (1). The x axis coincides with the cluster velocity v o , and the z axis lies in the vertical plane and is perpendicular to the x axis downward; the y axis direction is determined by the right-hand rule, as shown in Figure 1.
The coordinate conversion matrix M O→o from the O XYZ system to the o xyz system is given by Equations (2)-(4): where β is the flight path angle of the barycenter of the cluster, and ξ is the heading angle.
In the O XYZ system, vector In the o xyz system, the position vector R i of UAV i is given by Equation (5). We model the motion of UAVs in the o xyz system.

Virtual Leader-Follower Model
Using the traditional method to construct the state space, the RL algorithm is not likely to converge because of the instability of the state space. To address this issue, we introduce a virtual navigator.
When the cluster communication network is intact, UAVs can exchange position information with each other and calculate the real-time position p o of the barycenter. We assume that at time t 0 , the cluster communication fails. However, the position vector p o (t 0 ) and the velocity vector v o (t 0 ) of the barycenter in the O XYZ system are obtained by each UAV. The virtual navigator takes the position vector p o (t 0 ) as the initial position and moves in a uniform straight line at the velocity v o (t 0 ). The movement of the virtual navigator is not affected by other UAVs.
Each UAV can easily calculate the real-time position of the virtual navigator, although the position information of other UAVs cannot be obtained. To better describe the cluster movement, we take the virtual navigator as the origin of the o xyz system. The velocity v o (t 0 ) direction of the virtual navigator is in the x axis direction, and the definitions of y and z axes are the same.
In the o xyz system, based on Equation (6), the velocity vector v i of UAV i is given by where v i is the velocity vector of UAV i in the O XYZ system. The virtual navigator model reduces the state space of the agent by taking the position of the UAV relative to the virtual navigator as the state of RL and makes it possible to control the UAV cluster in the open space based on RL. Although the virtual navigator is introduced to assist the agent in training, the algorithm remains a decentralized algorithm.

Network Connectivity Model
The transmission of wireless signals in the space results in the damping of the received power. Based on the Friis transmission equation, the transmitting power P i t of node f i and the receiving power P j r of receiving node f j satisfies Equation (7): where G t is the antenna gain of the transmitting node, G r is the antenna gain of the receiving node, d ij is the distance between the transmitting node f i and the receiving node f j , λ is the wavelength of the radio frequency (RF) signal, and λ 4πd ij is called the free-space loss factor. Usually, G t , G r are constant values.
The signal-to-noise ratio of the receiving node f j in decibels (dB) is given by where P n is the power of noise and interference. The signal-to-noise ratio threshold is SNR th . If the signal-to-noise ratio threshold satisfies Equation (9), the received signal is detectable.
Based on Equations (7)-(9), the maximum transmission distance d max between the transmitter f i and receiver f j can be determined by Equation (10), which indicates the communication range of two UAV nodes based on their distance.
Assuming that the UAV cluster network is homogeneous, each UAV can be represented as a node, and the two-way communication links between UAVs are represented by the edges. Therefore, the UAV nodes f = { f 1 , f 2 , · · · , f n } and their corresponding communication links e = {e 1 , e 2 , · · · e m } form a three-dimensional network topology, expressed as a graph G(f, e), and m is the number of edges in the graph. In Figure 2, we show a cluster with a communication malfunction, and the UAV on the left cannot communicate with the UAV on the right. The connectivity of this network can be modeled using an adjacency matrix of G(v, e), which is defined as A = (a ij ) n×n .
where a ij = 1 means that f i , f j are connected. If all the entries of the matrix P(A) = I + A + A 2 + · · · + A n−1 are non-zero, the UAV cluster network is fully connected [18,26].

Methods
The multi-agent reinforcement learning (MARL) algorithm [27] is a solution model framework for the collaborative control of UAV clusters. It enables multiple agents to complete complex tasks through collaborative decision-making in a high-dimensional, dynamic environment. The MARL is autonomous, distributed, coordinated, and near to the optimal joint policy based on the core idea of centralized learning and distributed execution. It is based on the actor-critic framework [28] and effectively solves the problems of non-stationarity in the multi-agent environment and the failure of experience playback.

Markov Game for Multi-UAV Cooperation
We simulate the training process of reinforcement learning based on a locally observable Markov game [29]. The Markov process of n agents is represented by a highdimensional tuple < S, A, R, P, γ >, where S = [s 1 , s 2 , · · · , s n ] denotes the state space of the Markov decision process, A = [a 1 , a 2 , · · · , a n ] is the joint action set of all agents, R = [r 1 , r 2 , · · · , r n ], r i is the reward of the agent i, P : S × A × S → [0, 1] is the state transfer function, and γ is the attenuation coefficient of the cumulative discount reward [30].
In a multi-agent system, all agents execute a joint policy to generate a new state. The reward of each agent depends on the joint policy executed by all of them. According to the relationship between the agents, MARL can be divided into complete cooperation, complete competition, and a competition/cooperation hybrid [31]. Our objective is to train the UAV clusters to learn the policy of constructing connected networks with complete cooperation. In this setting, the objective of each agent is to maximize the common reward.

Multi-Agent DDPG Approach
The MADDPG algorithm is an extension of the DDPG algorithm considering a multiagent environment, where each agent has independent actor and critic networks [12]. The MADDPG algorithm has been improved based on the traditional A-C algorithm, allowing the critic network to use the strategies of other agents to learn.
As shown in Figure 3, the MADDPG adopts the framework of centralized training with decentralized execution. Each agent is associated with an actor network and a centralized critic network. In the training, a single agent observes its state, outputs actions based on the actor network, and then obtains the corresponding rewards and new states from the environment. The critic network obtains the action and state information of all agents and outputs the state-action value for the state and action from a global perspective. Then, the actor network is updated by the state-action value. In the execution, without the critic network, the actor network outputs the actions based on the local observation.
The joint policy is denoted by π = [π 1 , π 2 , · · · , π n ], which is defined by the policy parameter θ = [θ 1 , θ 2 , · · · , θ n ]. The cumulative expected reward of the agent i is where p π is the state distribution, t is the instant time, θ i is the probability distributions function of action and is implicit in policy π i , and the operator E is the expectation operator.
For the deterministic policy gradient algorithm, the policy gradient is given by [32] ∇J(θ i ) = E s,a∼D ∇ θ i π i (a i |o i )∇ a i Q π i (s, a 1 , · · · , a n ) a i =π i (o i ) (13) where D is the experience replay buffer consisting of a series of tuples (s, s , a 1 , . . . , a n , r 1 , . . . , r n ) which record the experience of all agents, o i is the observation of the agent i, Q π i (s, a 1 , · · · , a n ) is the centralized state-action value function of the agent i, s = [o 1 , · · · , o n ] consists of the observations of all agents, and s = [o 1 , · · · , o n ] consists of the new observations of all agents after executing the actions. Note that different agents have different state-action value functions, which are the basis of cooperation and competition amongst multi-agents. The centralized critic network Q i of the agent i is updated as L(θ i ) = E s,a,r,s (Q π i (s, a 1 , · · · , a n ) − y) 2 where y = r i + γQ π i (s , a 1 , . . . , a n ) where π = [π 1 , · · · , π n ] is the target policy vector with the lagging update parameter θ i . In the MADDPG algorithm, the critic network calls the global information update network, whereas the actor network only uses local observation information. For any π i = π i , there is P(s |s, a 1 , . . . , a N ) = P(s |s, a 1 , . . . , a N , π 1 , . . . , π n ) . Since the actions of the agent are known, the real-time policy is changed and the environment becomes stable [12].

DE-MADDPG Approach
The DE-MADDPG algorithm is an extended version of the MADDPG algorithm, which improves the network architecture of the MADDPG algorithm [33]. The MADDPG algorithm implements centralized training through a global centralized critic network. As shown in Figure 4, the DE-MADDPG algorithm has a global centralized critic network shared amongst all agents, and each agent has a local critic network [34]. The purpose of the improvement is to maximize local and global rewards through reward decoupling. The role of the global critic network is to maximize the global reward, whereas the role of the local critic network is to maximize the local reward. The DE-MADDPG algorithm is therefore a centralized control and distributed execution architecture. During the training phase, the state and action information of other agents are needed, but it is only necessary to output actions based on the agent's state during execution.
The main idea of the DE-MADDPG algorithm is to combine the MADDPG with a single agent DDPG, where the policy gradient is ∇J(θ i ) = E s,a∼D ∇ θ i π i (a i |o i )∇ a i Q π ψ (s, a 1 , . . . , a n ) where D is the experience replay buffer, Q ϕ i is the local critic network of the agent i with the parameter ϕ i , and Q ψ is the global critic network with the parameter ψ [33]. The two critic networks of the DE-MADDPG algorithm correspond to two loss functions, respectively. The global critic network Q ψ is updated as L(ψ) = E s,a,r,s Q π ψ (s, a 1 , . . . , a n ) − y g 2 where y g = r g + γQ π ψ (s , a 1 , . . . , a n ) where π = (π 1 , π 2 , . . . , π n ) are the target policies with the parameters θ = (θ 1 , θ 2 , · · · , θ n ), and Q ψ is the target global critic network with the parameter ψ .
The local critic network Q ϕ i of an agent i is updated as where Q ϕ i is the target local critic network with the parameter ϕ i . The pseudo-code of the DE-MADDPG is shown in the following Algorithm 1.

Algorithm 1: Training Process of the DE-MADDPG.
Initialize: main global critic network Q ψ with the parameter ψ Initialize: target global critic network Q ψ with the parameter ψ Initialize: critic network Q ϕ i with the parameter ϕ i of each agent i = 1, 2, · · · , n Initialize: target critic network Q ϕ i with the parameter ϕ i of each agent i Initialize: actor network π i with the parameter θ i of each agent i Initialize: target actor network π i with the parameter θ i of each agent i Initialize: replay buffer D 1: for episode = 1 to Max-episodes do 2: for step t = 1 to Max-step do 3: Get initial state s = [o 1 , · · · , o n ] 4: Chooses action a i = π i (o i ) 5: Execute actions a = [a 1 , · · · , a n ] 6: Receive new state s = [o 1 , · · · , o n ], local rewards r l = [r 1 , · · · , r n ], and global reward r g 7: Store s, a, r l , r g , s into D /* Train global critic */

17: end for 18:
Update target critic network and target actor network for each agent i:

Design of DE-MADDPG
We set up a reward function based on the goal of building a connected network and trained the UAV cluster based on the DE-MADDPG algorithm. In the following, we present the application of the DE-MADDPG algorithm in this task in the four involved elements, including agent, state, action, and reward. Agent: The n UAVs are regarded as n agents. In reality, each UAV can only obtain its state. However, in the training phase, it is allowed to use the state and action information of all UAVs to train the agents.
State: The state of each agent is the location coordinates of the UAV in the cluster coordinate system, represented by a vector (x, y, y).
Action: The action of each agent is the velocity of the UAV in the north-east-down angle of the UAV i and ξ i is the heading angle. Reward: Building a connected network of UAV clusters based on reinforcement learning is a sparse reward problem. The feasible solution is to define a reward shaping function that guides the agent to the ultimate goal. In the case of the connectivity problem, the ultimate goal is to achieve a connected network within a limited time. Therefore, we introduce the aggregation tendency rule to construct the reward reshaping function. During the movement of the cluster, each UAV that is moving to the center facilitates communication and interconnection within the cluster. Based on this, we consider a Gaussian shaping reward function. At time t, we denote the distance between the UAV i and the virtual navigator by d i , and the difference between this distance at the present time instant and at the previous time instant is denoted by ∆d i = d i (t) − d i (t − 1). The shaping reward function r i of the agent i is therefore defined as where σ 1 and σ 2 are the reward coefficients that are used to adjust the reward size.
In the DE-MADDPG algorithm, it is necessary to design the local reward and the global reward. The global reward is given by if the network is totally connected 0 otherwise (19) where R 2 is the terminal reward of the entire cluster if the network is connected. The local reward of an agent i is given by where r c is the punishment of UAV i if it is out of state space, and S is the set of state space.

Experiment and Results
In this section, we first present the simulation setting and the hyperparameters. Then, the configuration of the algorithm is presented, followed by the results, including the reward function and success rate.

Environment Settings
The detailed simulation parameters are given in Table 1. 1. With the virtual navigator as the center, we considered the initialization of a space area of 100 m × 100 m × 100 m. The UAV clusters were randomly distributed in this space area, and the space area moved with the movement of the virtual navigator. UAVs were homogeneous with the same system parameters.
3. We considered two environments for training and testing. In the training environment, in cases where the UAV was about to move beyond the boundary, a rebound occurred with a certain penalty but without energy consumption. In the test environment, if the UAV hit the boundary, the episode would fail.

Training Configuration
The simulations were developed in Python, using Pycharm Community 2020.2 and the Anaconda3 platform. The deep neural network was also based on the Paddlepad-dle1.8.5 framework.
The UAV obtained its state by observing the environment and produced the control values according to the set control policy. The UAV then used the feedback of the environment to adjust the control policy and form a closed-loop training process. The training hyperparameters are presented in Table 2. The network structure of the DE-MADDPG is shown in Figure 5. The local critic network of each agent is a three-layer fully connected neural network, including an input layer, a hidden layer, and an output layer. The input corresponds to the observation and action information of this agent. The output is a scalar that evaluates the action in the state of this agent. The hidden layer includes 64 neurons with Rectified Linear Unit (ReLU) activation function. The output layer is linear without an activation function. The global centralized critic network is a four-layer fully connected neural network, including two hidden layers. The input is the information of observations and actions of all agents. The output is a scalar that evaluates the current actions in the current global state.
The actor network is a four-layer fully connected neural network-the same as the global centralized critic network. The input corresponds to the observation information of this agent. The output is the value of all actions in the current state.

Simulation Results
In Figure 6, we present the total reward curve of the DE-MADDPG for the training of 20,000 episodes. The green line is the reward for each episode, and the red line is the average reward for every 50 episodes. During the first 1000 episodes, the agents gradually learn the policy of moving toward the center and obtain higher intermediate rewards. The agents gradually learn the policy of building a connected network, and the reward reaches a stable value after 17,000 episodes. Note that there are very few episodes that did not achieve full connectivity within 200 steps. The training results after different numbers of episodes are shown intuitively in Figure 7. In the subgraphs (a-d), lines with different colors indicate the tracks of different UAVs that built the connected network, and the dot on each line indicates the position of the UAV in the cluster coordinate system after the network was connected. After 500 episodes, the tracks were very messy and redundant, and the UAVs were more likely to collide. The agents learned the policy of moving toward the center after 1000 episodes. However, the turbulent tracks show that the UAVs changed their direction of movement frequently. After 3000 episodes, the tracks were clearer, but the direction of movement of some UAVs changed too many times. After 10,000 episodes, the UAV rarely changed their direction of movement, and the change of direction was relatively smooth. We ran 100 test episodes after every 500 training episodes, and the average steps of successful testing episodes after different numbers of episodes are shown in Figure 8. This illustrates that the cluster built a connected network faster and more directly with an increasing number of training episodes.
In Figure 9, we present the average reward for every 50 episodes and the standard deviation for 10 training sessions of the MADDPG and DE-MADDPG. In each training, the same random seed was used to ensure the same initial state in each training round. Throughout the training phase, the reward of the DE-MADDPG was higher than that of the MADDPG. This reason can be easily found in Figure 10, where we see that the MADDPG was not able to learn the strategy of controlling the UAV not to exceed the state space. Figure 10 also illustrates the rate of exceeding state space of the MADDPG and DE-MADDPG for different numbers of training episodes. Comparing the training results of MADDPG and DE-MADDPG, the policy that was learned by DE-MADDPG was more efficient.     Figure 11 presents the success rate of the MADDPG and DE-MADDPG for different numbers of training episodes. We conducted 100 tests with 100 episodes of training. During the tests, the agents did not update their policy. Comparing the test results of the MADDPG and DE-MADDPG, the success rate of the DE-MADDPG was higher, indicating that the policy of the DE-MADDPG was better than that of the MADDPG.  Figure 12 illustrates the success rate of clusters with different numbers of UAVs after training. It is seen that by increasing the number of UAVs, the training becomes more difficult to converge. Nevertheless, after training enough episodes, all of them eventually learn excellent policies for constructing a connected network.

Conclusions
We developed the DE-MADDPG algorithm for UAV clusters rebuilding a connected network, where a virtual navigator was proposed to address the instability issue of the state space. The simulation results confirmed the effectiveness of the proposed algorithm for controlling the UAV clusters and constructing a connected network, where the success rate was much higher than that of the MADDPG algorithm. The results presented in this paper fill the gap in the application of reinforcement learning algorithms to providing cluster network connectivity.
The communication network in our study is a single-connected network, which is easily paralyzed by external interference. Fortunately, a bi-connected network, which is defined as two disjointed communication paths between any pair of UAV nodes in the network, is more robust. In the future, we will study how to build a bi-connected network for UAV clusters based on RL.