Resource Optimization for Multi-Unmanned Aerial Vehicle Formation Communication Based on an Improved Deep Q-Network

With the widespread application of unmanned aerial vehicle (UAV) formation technology, it is very important to maintain good communication quality with the limited power and spectrum resources that are available. To maximize the transmission rate and increase the successful data transfer probability simultaneously, the convolutional block attention module (CBAM) and value decomposition network (VDN) algorithm were introduced on the basis of a deep Q-network (DQN) for a UAV formation communication system. To make full use of the frequency, this manuscript considers both the UAV-to-base station (U2B) and the UAV-to-UAV (U2U) links, and the U2B links can be reused by the U2U communication links. In the DQN, the U2U links, which are treated as agents, can interact with the system and they intelligently learn how to choose the best power and spectrum. The CBAM affects the training results along both the channel and spatial aspects. Moreover, the VDN algorithm was introduced to solve the problem of partial observation in one UAV using distributed execution by decomposing the team q-function into agent-wise q-functions through the VDN. The experimental results showed that the improvement in data transfer rate and the successful data transfer probability was obvious.


Introduction
As a new technology, unmanned aerial vehicle (UAV) technology has been widely used in civil, public and military fields [1]. The effectiveness and potential of UAVs for coastal-zone applications were elaborated in a review [2]. The execution latency has been decreased and the computation performance has been improved for the mobile edge computing networks by integrating UAVs into the research [3]. By poring over a large amount of literature, Ref. [4] provided a novel insight into cyber physical systems in UAV networks. To maximize the quality of the experience of real-time video streaming, the authors of [5] designed the power control, the movement and the video resolution of the UAV to base station and UAV user links. Using the Stackelberg dynamic game, Ref. [6] came up with a resource pricing and trading scheme to realize edge computing resource allocation between UAVs and edge computing stations. Despite only one single UAV being able to finish some difficult tasks, the possibility of mission failure was relatively high, as tasks were getting more complex and diversified. Multi-UAV applications have attracted widespread attention and achieved remarkable accomplishments in some situations where one UAV may not suffice. To make full use of the advantages of the UAV, multi-UAV formations have been designed and widely used in many fields. The UAV formation technology has become an important research topic in the world because it is highly three-dimensional, informationalized and electronized. The UAV formation technology is mainly applied in data fusion, physical verification platform technologies, formation 1.
Resource optimization based on the three-dimensional distribution of multiple UAVs, which is different from a random [28] or flat distribution [29] of UAVs, as used in the previous study. The multi-UAVs were located successively at edges or the eight vertices of a cube and kept a cube formation during their flight; 2.
In order to achieve joint power and channel resource optimization for the UAV formation, a DQN was used. Moreover, a CNN was used after the state data were preprocessed in the training process; 3.
Based on a DQN, the convolutional block attention module (CBAM) and value decomposition network (VDN) modules were introduced to further realize system performance improvement. The CBAM worked at the CNN layer to capture correlations of the feature map along both the channel and spatial aspects, and the VDN worked as the reward mechanism. The experimental results proved the effectiveness of the introduction of the CBAM and VDN.

System Model
In this chapter, the scenario of a multi-UAV formation network and the data transmission model are introduced, respectively. In the channel model, both the small-scale and large-scale fading is included and the optimization objective function is given last.

Scenario Setting
It was assumed that the multi-UAVs kept the flying formation during the execution of a task. The whole UAV network consisted of multi-UAVs and one base station (BS). The multi-UAVs were located at the eight vertices or edges of one cube. There were two data transmission modes, and the spectrum availability was limited. For the UAV network, M (M = 1, 2, . . . , M) orthogonal uplinks were pre-assigned to M UAVs to perform the UAV to base station (U2B) data transfer, and N (N = 1, 2, 3, . . . , N) pairs of UAV to UAV (U2U) links performed the U2U data transfer. The U2B links could be reused by the U2U links to make full use of the frequency. In one time slot, as agreed, one U2B link only used one uplink and multiple U2U links could reuse the same uplink; however, one U2U link could only occupy one uplink. The value ρ n,m was used as the n * m binary indicator matrix and ρ n,m = 1 if the nth U2U link reused the mth uplink resource, otherwise ρ n,m = 0. (x k , y k , z k ), x j , y j , z j and (0, 0, H) were used to represent the locations of the kth UAV, the jth UAV and the BS, respectively. Then, the distance between the kth and the jth UAV is calculated as follows: Similarly, the distance between the kth UAV and the BS is calculated as

Modeling the Line of Sight Probability
In this part, the channel model used in the multi-UAV formation network is introduced. The channel model consists of the channel model of the U2U and U2B data transfer. The U2U and U2B links are different due to the different line-of-sight (LoS) characteristics and the elevation angle in the actual environment.
For the U2U channel model, the free space channel model of the U2U link was used in the multi-UAV formation network. The values P ntr and P nre denote the transmitter and receiver power of the nth U2U link reusing the mth uplink. The relationship between them can be expressed as where h is the constant channel gains factor related to the amplifier and antenna, d ntr,nre is the distance between the transmitter and receiver UAVs of the nth U2U link and ω is the channel path loss constant. The receiver interference of the nth U2U link consists of the data that comes from the nth U2B link and other U2U links that reuse the same mth uplink. The valueP U2B m is labeled as the transmitter power of the mth U2B link, and, then, the receiver interference of the nth U2U link I nre is expressed as follows.
where P U2U n is labeled as the transmitter power of the n th U2U link, ρ n ,m indicates whether the n th U2U link reuses the mth uplink and σ 2 is the variance of the additive white Gaussian noise (AWGN) with a mean equal to zero. The data transfer rate of the U2U link can be calculated as follows.
For the U2B channel model, the model of air-to-ground (ATG) propagation adopts the U2B data transmission proposed in the existing literature [30,31]. To predicte the ATG path loss, [32] designed a statistical propagation model. In time slot t, the path loss of LoS and non-line-of-sight (NLoS) of the kth U2U link is indicated as where d k,BS is the distance between the BS and transmitter UAV of the k th U2U link, the values ηLoS and ηNLoS are attenuation factors that are caused by the LoS and NLoS transmission, respectively. The value L FS,k (t) is the free space path loss that is expressed as follows.
where f is the carrier frequency. Recommended by the International Telecommunication Union (ITU), as mentioned in the reference [33], the probability of the LoS link data transfer is expressed as where α and β are constants that are related to the environment and θ k (t) is the elevation angle of the kth UAV. The sight probability model is adapted to the low-altitude aerial platforms of the UAV network. This probability is closely related to three statistical parameters in the environment: the average number of buildings per unit area; the proportion of construction land area in the total land area; the building height distribution based on the Rayleigh probability density function. The average path loss in decibels (dB) is expressed as follows.
PL avg,k (t) = P Los,k (t) * PL LoS,k (t) + P NLos,k (t) * PL NLoS,k (t) The average power received by the BS from the mth U2B link is denoted as where P U2B m is the transmitter power of the mth U2B link. Similarly, P U2U n is labeled as the transmitter power of the nth U2U link that reuses the mth uplink. The interference received by the BS from the U2U links is calculated as Therefore, the signal-to-interference-plus-noise-ratio (SINR) of the mth U2B link is calculated as where σ 2 is the variance of the AWGN with a mean equal to zero. The data transfer rate of the mth U2B link is expressed as

The Problem Optimization Objective
This section details the optimization objective of the multi-UAV formation network. The goal of this manuscript was to improve the successful data transfer probability of the U2U links and realize the maximization of the system data transfer rate. The optimization objective is modeled as where P max stands for the maximum transmitter power used by the UAVs, P ntr and P m stand for the transmitter power of the nth U2U and mth U2B links, respectively. In view of the fact that the distances from the UAVs to the BS are longer than those between each UAV, the maximum power is adopted by the U2B link transmitters to guarantee the quality of the data transfer. For the sake of maximizing the data transfer rate and generating less interference to the system, each U2U link ought to choose the appropriate power and spectrum. Three power levels were set to be used by the U2U links. Then, P ntr and ρ n,m are the arguments to be optimized. The optimization function consists of the rate of U2U links, U2B links and the transmission time limit. The optimization function at each time slot t that was used as the reward function in the proposed DQN mechanism is expressed as follows.
where a,b and (1 − a − b) are the percentage of each part. If the data in the U2U links are transmitted successfully within the transmission time Ts, it is classed as a successful data transfer. The value E t is the remaining time left to finish the data transfer in the U2U links and it is initialized to T. The third section is initialized to zero, which means the transmitters have a lot of time to transmit data to obtain a bigger reward function r t . If the remaining time E t gets shorter, the reward function r t can reduce due to the increase in the failure data transfer probability.

The Proposed Method
This section details the joint spectrum and power allocation method based on a DQN in the multi-UAV formation network. The process of the DQN method is introduced first, and" then the algorithm of the CBAM and the VDN module is added, based on the DQN.

Introduction of the DQN
The multi-UAV movement problem could be modeled as the constrained Markov decision process (CMDP) problem, giving a moving area that is classical in order to solve RL tasks with constraints. The maximum cumulative long-term discounted reward can be obtained after the agents are well trained to execute the best action strategy. In the multi-UAV formation network, the whole network corresponds to the environment in the RL method, and each U2U link is regarded as an agent. An action is labelled with a t and it includes two parameters of the action space A. Each U2U transmitter makes an action choice according to the current state s t , and the environment feedbacks a reward r t and the next state s t+1 . Therefore, state, action, reward and the next state [s t , a t , r t , s t+1 ] are the main transitions in RL. The value A is a set of M uplinks and three power levels. The state s t contains the channel information in the U2B links H UB and U2U links H uu , the interference from the U2U links I uu , the uplink reuse indicator of the U2U links B, the time left to transmit data E t and the data left to transmit D t . The set of all states is called the state space S.
For a given state s t , in each time slot the U2U link transmitter needs to select an action according to the strategy π. In regular Q-learning, the decision-making strategy is carried out by a q-function. Once a state-action pair is determined, the q-value reward is calculated after selecting one action. To find the optimization strategy to maximize the cumulative reward, once the q-value is calculated, the action selection strategy can be carried out according to the following expression So, the optimization strategy can be denoted as π : s t ∈ S → a t ∈ A. For a specific action and state, the maximum Q-value, i.e, the expected accumulative discount reward, is expressed as where λ is the discount coefficient that is needed to strike a balance between current and future decisions. Equation (17) is the total cost for the optimization. According to Equation (15), each UAV can choose the most appropriate power and spectrum based on its remaining data transmission time and the system interference.
In the multi-UAV formation problem, the state and action space will be very large, which makes it is difficult to use regular Q-learning as some Q-values are seldom updated. As a combination of RL and a deep network, the DQN has been widely used. A DQN can take a state as the input and output all the action values Q(s, a; θ). It can then select the action with the maximum value as the next action. The values of θ are the parameters of the neural network (NN) and are called the Q-network. As is well-known, in the study of [22], Mnih et al. introduced target networks and experience replay to achieve a better performance for the DQN algorithm. The output of target network in the DQN is written as whereθ are the coefficients of the target network and are updated according to the Qnetwork after certain steps. The coefficients of the Q-network are updated according to the gradient descent policy to loss function. The loss function is shown as below.
where D is a memory buffer that is used to store all the transitions [s t , a t , r t , s t+1 ]. Referring to the experience replay method, in order to reduce the data correlation, fixed batch transitions are randomly selected from D in every training period. As the input of the Q-network, parameters of the Q-network are updated while training the data.

The DQN with the CBAM
Research [22] has proven that the introduction of the experience replay and target network has greatly improved the performance of the DQN. In this system, the state data are converted into a square matrix to train the CNN. On this basis of the DQN, the CBAM is added, which consists of two important submodules: channel and spatial. The channel submodule adopts both average-pooling and max-pooling outputs, processed by a shared network. The outputs are then processed by the spatial submodule. In the spatial submodule, passing through the pooling layer, the output data of the channel submodule are put into a convolution layer. The following details the operation of the submodules.
Let F c avg and F c max represent the average-pool and max-pool operations, respectively. Two different descriptors are generated after the pooling operation, acting on the UAV states. The channel attention data M c ∈ R C * 1 * 1 can be obtained after the pooling results are both processed by the following shared network. The shared network consists of a multilayer perceptron (MLP) that has one hidden layer, and the size of the hidden activation is R C/r * 1 * 1 , where r is the reduction ratio. The output feature data are summed element-wise after the shared network. So, the channel attention operation can be expressed as where σ represents the sigmoid function, W 0 and W 1 are the weights of the shared network and they are shared for both inputs' pooling results, W 0 ∈ R C/r * r and W 1 ∈ R C * C/r . The spatial attention graph is generated by using the spatial relationships of the features that focus on 'where' the useful information is. The spatial attention is supplement to the channel attention. For the spatial attention, the average-pooling and max-pooling operations are applied first to concatenate the channel attention outputs to a resultful characteristic descriptor. Then, the descriptor is processed by a convolution layer to obtain the spatial attention data M s (F) ∈ R HW . So, the spatial attention operation can be expressed as where ξ represents the sigmoid function and f denotes a convolution operation.

The DQN with the VDN
Section 3.2 details the optimization of the network from the channel and spatial aspects, and this section looks at improving the system performance in terms of the reward mechanism. Due to partial observability, the spurious rewards problem exists in both the fully centralized and decentralized mechanisms. To avoid the spurious reward data produced by the naive independent agents, a VDN is introduced while training the individual agents. The operation is to obtain agent-wise value functions by decomposing the team value function. Based on team rewards, the system can acquire optimal linear decomposition learning. The total Q-gradient is propagated backward by the deep neural network (DNN) representing the function of each agent-wise value.
For the multi-UAV formation network, distributed across n UAVs, the observations and actions are labeled as n-dimensional tuples of observations in O and actions in A.
The additive decomposition can be expressed by the following equation.
Q((h 1 , h 2 , · · · , h n ), (a 1 , a 2 , · · · , a n )) ≈ where theQ i represent every agent's own observations and is learned by back propagating gradients according to the joint summation reward, rather than the specific reward of agent i, (h 1 ; h 2 ; · · · , h n ) is the history tuple of one agent. For the improved summation reward, Q i are action-value functions that consist of all particular rewards of the UAVs, without any other constraints. The overall architecture is shown in Figure 1.

Training and Validation
Algorithm 1 is the flow of the entire training architecture. The buffer capacity D is 10 6 , the training step M is 40000 and every 2000 steps represents one episode. The value γ is 0.99 and K is 50. The value λ of the Q-function is 1. The programming language is python and it was written and executed on PyCharm Community Edition 2019.2.2 x64 using Xeon Silver 4110 CPU @ 2.10 GHz @ 2.10 GHz. The network layer is detailed in the simulation section. The communication rate and time delay were designed in the Q-function, in order to maximize the Q-function and obtain a better training effect, and the CBAM and VDN were added, respectively. The algorithm is based on a pure DQN without 1 and 2 modules. Any module can be embedded into the network quickly. The CBAM affects the training results along both channel and spatial aspects. With the VDN, the action selection for the UAVs is based on the global observations of all UAVs. While in the validation architecture, every transmitter of the U2U link adopts the action that can maximize the evaluation Q-value, based on the well-trained target network at every time step. Accordingly, the results of one episode were recorded and are shown in the simulation section.

Algorithm 1 The improved DQN for multi-UAV formation resource optimization with the CBAM and VDN (training stage)
Initialize replay memory buffer to capacity D. Initialize the Q-network of the UAVs with random parameters θ and a target network of UAVs with parametersθ = θ. Activate environmental simulator, generating the locations of all the UAVs. Initialize the action space A, including the power and frequency each UAV can choose. for step=1, M do Update positions of UAVs, channel information of U2B links H UB , the U2U links H uu , the interference from U2U links I uu , the uplink reuse indicator of U2U links B, the time left to transmit data E t and the data left to transmit. for episode =1, T do The U2U link selects a random action with probability ε, otherwise, select a t = max a t ∈A Q(s t , a; θ) with probability 1 − ε. Implement a t and move on to the next state s t+1 with a reward r t . Store the transition (s t , a t , r t , s t+1 ) into the memory buffer. A certain amount of mini-batch transitions (s t , a t , r t , s t+1 ) is sampled randomly from the memory buffer. Preprocess the collected state data and input the processed data into the first convolution layer of the Q-network and output the data M cnn . 1. Input data M cnn into the CBAM module to obtain the data M CBAM . The q-values are obtained after the following convolution and relu operations are carried out on the output of the last layer. 2. Perform VDN operations to obtain the summation reward Q.
Set y t = r j , if ends at step t + 1 r t + γ max a t Q(s t+1 , a t ;θ), otherwise .
Perform a gradient descent policy on the loss function (y tot j − Q(τ j , u j , s j ; θ)) 2 to improve the coefficients θ. Every K step, letθ = θ. end for end for Save the coefficients of the target network.

Simulation
The selection of the simulation parameters are based on existing works [18] and 3GPP specifications [34]. Multiple UAVs were arranged as a cube formation in a threedimensional 2 km × 2 km × 2 km area. The main parameters are shown in Table 1. In the training stage, the size of the input and output CNN layer convolution kernel was 3 with a padding of 1. The global average-pool and max-pool operator in the CBAM module was 6 * 6 and the channel ratio r was 16. To verify the generalization of the proposed method, scenarios with different numbers of UAVs were tested. To prove the validity of the proposed method, first the experiment with only a traditional DQN was executed, then, the CBAM and VDN were embedded separately and lastly both the CBAM and VDN were embedded simultaneously. The experimental results were recorded, analyzed and compared to the random and multichannel access methods proposed in [17] and [27], respectively. For the random method, the uplinks and power were randomly allocated to the UAVs. For the multichannel access method, the power used by every UAV was not changed, and the accumulated discounted reward was maximized to realize the optimization of the channel allocation.  Figure 2 shows how the average U2B link data transfer rate varied with the number of UAVs. More interference was generated as the number of UAVs increased, which caused the decrease in the average data transfer rate. The average data transfer rate, obtained using the random and multichannel access methods, was always lower than that obtained by other methods, especially the DQN with CBAM or with both the VDN and CBAM. From this it can be inferred that the improved DQN with CBAM method, or the method with both the VDN and CBAM, will maintain a high communication rate as more UAVs are added to the multi-UAV formation. Figure 3 shows how the successful data transfer probability of the U2U links varied with the number of UAVs. More interference was generated as the number of UAVs increased, which caused the decrease in the successful data transfer probability. The successful data transfer probability obtained using the random method was always the lowest. When there were only 8 UAVs, the successful data transfer probability using the multichannel access method was not low, but it went down quite fast as the number of UAVs increases. From this figure, it can be inferred that the improved DQN with VDN method, or the method with both the VDN and CBAM, will maintain a high successful data transfer probability, as more UAVs are added to the multi-UAV formation.  In order to figure out why the proposed improved DQN mechanisms showed superior performances, the power selection situations of the different methods with a different number of UAVs over time were recorded. Shown in Figures 4-6, the probability of the power selection with 8, 12 and 16 UAVs for three situations was drawn, respectively. By observing these three figures, it can be inferred that because the UAVs can wisely select different powers, less interference is generated in the system. In addition, when the number of UAVs in the formation varies, the UAVs automatically adjust the number of lowest or highest power users.   Finally, Figures 7 and 8 show how the reward and loss varied in the overtraining step, respectively, to show the convergence of the proposed DQN and the improved DQN. The axis label "training step()" represents every number times 100 or 1000, with a maximum of 40,000 training steps. With 40,000 training steps, the reward and loss gradually converged, despite some fluctuations. From Figure 7, the reward was bigger in the improved DQN than that in the DQN method. In Figure 8, the lowest points of loss are marked. It is obvious that the DQN with the VDN or CBAM always reached the lowest point earlier than with the DQN alone, and the method with both the VDN and CBAM was the earliest one. The two figures further illustrate the effectiveness and convergence of the improved method.

Conclusions
• This manuscript focuses on the realization of joint power and spectrum resource optimization using a DQN mechanism for a multi-UAV formation communication network, in which the UAVs are located in 3D forms when the multi-UAVs are on a mission. The objective was to maximize the transmission rate and increase the successful data transfer probability simultaneously. • The main idea was that the U2U links were treated as agents and the CBAM and VDN were further introduced based on the DQN. Compared to the random and multichannel access methods, the DQN method slightly improved system performance, and the introduction of the CBAM and VDN further improved the data transfer rate of the U2B links and the successful data transfer probability of the U2B links. • The reason for the performance improvement is that the UAVs could intelligently choose the appropriate power and frequency based on the remaining time and the number of UAVs. The authors drew the conclusion that the superiority of the proposed method became more and more obvious as the number of UAVs in the formation increased. • This model can be used in agricultural or military applications, such as disaster relief, environmental detection, remote situational awareness, deception and jamming. Even if a few UAVs break down, the overall capability of the UAVs will not be affected. However, the speed is low and the mobility and self-defense ability is poor. If the number of UAVs is small, they are easy to detect and intercept, and the different flight altitudes will affect the communication quality between the UAVs and the ground base station. In this work, the formation of the UAVs remained the same during the execution of tasks, and future work could focus on the situation with changing formations. Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: