Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration

Distributed artificial intelligence is increasingly being applied to multiple unmanned aerial vehicles (multi-UAVs). This poses challenges to the distributed reconfiguration (DR) required for the optimal redeployment of multi-UAVs in the event of vehicle destruction. This paper presents a multi-agent deep reinforcement learning-based DR strategy (DRS) that optimizes the multi-UAV group redeployment in terms of swarm performance. To generate a two-layer DRS between multiple groups and a single group, a multi-agent deep reinforcement learning framework is developed in which a QMIX network determines the swarm redeployment, and each deep Q-network determines the single-group redeployment. The proposed method is simulated using Python and a case study demonstrates its effectiveness as a high-quality DRS for large-scale scenarios.


Introduction
Recently, mission planning associated with unmanned aerial vehicles (UAVs) has received considerable attention [1,2], and distributed artificial intelligence (AI) technologies have been extensively applied in multiple-UAV (multi-UAV) mission planning, enabling efficient decision-making and yielding high-quality solutions [3,4].For missions in geographically decentralized environments, the focus is on deploying UAVs to their destinations and repositioning them to adapt to changing circumstances [5].To minimize the costs of positioning UAVs, Masroor et al. [6] proposed a branch-and-bound algorithm that determines the optimal UAV deployment solution in emergency situations.Savkin et al. [7] employed a range-based reactive algorithm for autonomous UAV deployment.Nevertheless, many existing distributed algorithms lack the security necessary to achieve the global objective.
For the UAVs in a swarm, the placement of individual is important, but the completion of the swarm mission is the ultimate goal.Wang et al. [8] proposed a K-means clusteringbased UAV deployment scheme that significantly improves the spectrum efficiency and energy efficiency of cellular uplinks at limited cost, while Yu et al. [9] introduced an evolutionary game-based adaptive dynamic reconfiguration mechanism that provides decision support for the cooperative mode design of unmanned swarm operations.These algorithms take static multi-swarm problems into account.However, some of the UAVs may suffer destruction or break down during a mission [10].To deal with situations in which the swarm suffers unexpected destruction, adaptive swarm reconfiguration strategies are required [11].
Learning-based methods are gaining increasing attention for their flexibility and efficiency [12,13].Deep reinforcement learning (DRL) has shown promising results in resolving the task assignment problems associated with multi-UAV swarms [14].Samir et al. [15] combined DRL with joint optimization to achieve improved learning efficiency, although changes to the dynamic environment can hinder the implementation of this strategy.Zhang et al. [16] investigated a double deep Q-network (DQN) framework for long period UAV swarm collaborative tasks and designed a guided reward function to solve the convergence problem caused by the sparse returns of long period tasks.Huda et al. [17] investigated a surveillance application scenario using a hierarchical UAV swarm.In this case, they used a DQN to minimize the weighted sum cost.As a result, their DRL method exhibited better convergence and effectiveness than traditional methods.Zhang et al. [18] designed a DRL-based algorithm to find the optimal attack sequence for large-scale UAV swarm so that the purpose of destroying the target communication system can be achieved.Mou et al. [19] built a geometric way to project the 3D terrain surface into many weighted 2D patches and proposed a swarm DQN reinforcement learning algorithm to select patches for leader UAVs, which could cover the object area with little redundancies.Liu et al. [20] focused on a latency minimization problem for both communication and computation in a maritime UAV swarm mobile edge computing network, then they proposed a DQN and a deep deterministic policy gradient algorithm to optimize the trajectory of multi-UAVs and configuration of virtual machines.However, multi-agent DRL (MADRL) captures realworld situations more easily than DRL [21,22].Hence, MADRL is considered an important topic of research.Xia et al. [22] proposed an end-to-end cooperative multi-agent reinforcement learning scheme that enables the UAV swarm to make decisions on the basis of the past and current states of the target.Lv et al. [23] proposed a MADRL-based UAV swarm communication scheme to optimize the relay selection and power allocation, then they designed a DRL-based scheme to improve the anti-jamming performance.Xiang et al. [24] established an intelligent UAV swarm model based on a multi-agent deep deterministic policy gradient algorithm, significantly improving the success rate of the UAV swarm in confrontations.
In summary, developments in distributed AI mean that swarm intelligence is now of vital strategic importance.Under this background, it is vital to develop multi-agent algorithms.However, few reconfiguration studies have investigated this distributed multiagent scenario.Therefore, this paper proposes a MADRL-based distributed reconfiguration strategy (DRS) for the problem of UAV swarm reconfiguration after large-scale destruction.The main contributions of this paper are as follows: (1) UAV swarm reconfiguration is formulated so as to generate a swarm DRS considering detection missions and destruction.The finite number of UAVs is the constraint, and the coverage area forms the objective.
(2) MADRL-based swarm reconfiguration employs multi-agent deep learning and the Q-MIX network.Each agent, representing a group, uses reinforcement learning to select the optimal distributed reconfiguration (DR) actions.The Q-MIX network is used to synthesize the actions of each agent and output the final strategy.
(3) When the network has been trained well, the algorithm can effectively utilize various UAV swarm information to support DR decision-making.This enables efficient and steady multi-group swarm DR to achieve the mission objective.
The remainder of this paper is organized as follows.Section 2 presents the swarm mission framework.Section 3 elucidates the DRS, before Section 4 introduces a UAV swarm reconfiguration case study of detection missions.Finally, Section 5 presents the concluding remarks.

Mission
A detection mission containing M irregular detection areas is considered.As shown in Figure 1a, the detection areas (colored yellow) are divided into hexagons, which are inscribed hexagons of the mission areas (colored green).

Problem Formulation
where N m is the total number of hexagons in group mission area MA m , and each hexagon represents a single UAV mission area  , {1,2, … , }, n{1, 2, …, N m }.A UAV swarm, the size of which is determined by the detection area, is dispatched for a detection mission.Each area requires a group to execute the detection mission, and the number of UAVs in the group depends on the number of hexagons in the mission area.Furthermore, each group is formed of one leader UAV and several follower UAVs.To execute a detection mission, as shown in Figure 1b, the radius R of the UAV detection area is determined by the detection equipment installed on the UAVs.
The UAV swarm can then be expressed as follows: ) where each group  , m{1, 2, …, M} , performs detection in the group mission area MA m , as follows: where  is the m-th UAV in group n and performs detection in UAV mission area  , {1,2, … , }, n{1, 2, …, N m }.The first UAV in each group is the leader of that group.The swarm detection mission area set can then be expressed as follows: where each group mission area MA m , m ∈ {1, 2, . . . ,M}, is covered by a certain number of hexagons, as follows: MA m = {ma m1 , ma m2 , . . . ,ma mn , . . . ,ma mN m } where N m is the total number of hexagons in group mission area MA m , and each hexagon represents a single UAV mission area ma mn , m ∈ {1, 2, . . . ,M}, n ∈ {1, 2, . . . ,N m }.
A UAV swarm, the size of which is determined by the detection area, is dispatched for a detection mission.Each area requires a group to execute the detection mission, and the number of UAVs in the group depends on the number of hexagons in the mission area.Furthermore, each group is formed of one leader UAV and several follower UAVs.To execute a detection mission, as shown in Figure 1b, the radius R of the UAV detection area is determined by the detection equipment installed on the UAVs.
The UAV swarm can then be expressed as follows: where each group G m , m ∈ {1, 2, . . . ,M}, performs detection in the group mission area MA m , as follows: G m = {U m1 , U m2 , . . . ,U mn , . . . ,U mN m } where U mn is the m-th UAV in group n and performs detection in UAV mission area ma mn , m ∈ {1, 2, . . . ,M}, n ∈ {1, 2, . . . ,N m }.The first UAV in each group is the leader of that group.

Destruction
The UAV swarm may be subject to local and random destruction, and some UAVs may be destroyed.The effects of this destruction are used as inputs.Each UAV has two states: normal working and complete failure.When a UAV suffers destruction, it enters the failure state.When a leader UAV is destroyed, a follower UAV in the same group assumes the role of the leader of that particular group.
The scope of local destruction is represented by a circle with center coordinates (i d , j d ) and radius r d , as illustrated in Figure 2a.The values of (i d , j d ) and r d are randomly generated.

Destruction
The UAV swarm may be subject to local and random destruction, and some UAVs may be destroyed.The effects of this destruction are used as inputs.Each UAV has two states: normal working and complete failure.When a UAV suffers destruction, it enters the failure state.When a leader UAV is destroyed, a follower UAV in the same group assumes the role of the leader of that particular group.
The scope of local destruction is represented by a circle with center coordinates ( ,  ) and radius  , as illustrated in Figure 2a.The values of ( ,  ) and  are randomly generated.
Random destruction is characterized by a destruction scale, denoted as  , which is also generated randomly.When random destruction occurs,  random UAVs transition from the normal state to the faulty state, as depicted in Figure 2b.

Reconfiguration
UAV swarm reconfiguration is an autonomous behavior that adapts to changes in the environment to enable the execution of the task.When the swarm is affected by dynamic changes during task execution, the system can use a DRS to achieve global mission performance recovery and reconfiguration, thus ensuring mission continuity.
When destruction occurs, the state of the UAV swarm is input into the reconstruction algorithm.The resulting strategy is communicated back to each UAV group.In-group reconstruction and inter-group reconstruction are applied to certain UAVs, as shown in Figure 3a.After the reconstruction is completed, all mission areas should be covered by the detection range of the UAVs, as shown in Figure 3b.Random destruction is characterized by a destruction scale, denoted as S rand , which is also generated randomly.When random destruction occurs, S rand random UAVs transition from the normal state to the faulty state, as depicted in Figure 2b.

Reconfiguration
UAV swarm reconfiguration is an autonomous behavior that adapts to changes in the environment to enable the execution of the task.When the swarm is affected by dynamic changes during task execution, the system can use a DRS to achieve global mission performance recovery and reconfiguration, thus ensuring mission continuity.
When destruction occurs, the state of the UAV swarm is input into the reconstruction algorithm.The resulting strategy is communicated back to each UAV group.In-group reconstruction and inter-group reconstruction are applied to certain UAVs, as shown in Figure 3a.After the reconstruction is completed, all mission areas should be covered by the detection range of the UAVs, as shown in Figure 3b.

Objective, Constraints, and Variables
Over a finite time  , swarm reconfiguration aims to maximize the total coverage area (TCA)  , which is the mission area detected by the UAVs.This can be expressed as follows: where  () is the TCA at the current time ,  is the detected area of mission area  ; if  is not covered,  = 0.The problem should be solved at the swarm level.Considering the number of remaining UAVs, the number of UAVs to be repositioned should be less than the number of normal working UAVs.Furthermore, the minimum area detected by the UAVs in each mission area must be set.Therefore, the reconfiguration problem can be expressed as follows: where  is the coverage area of group  ,  is the specified minimum coverage area for group  ,  is the number of UAVs in  that can be repositioned,  is the number of normal-working UAVs in  , d is the distance between two normal-working UAVs, and  is the minimum allowable distance, which is the safety distance between UAVs.This problem considers the UAVs within the communication range.If a UAV exceeds the communication distance, it enters the faulty state due to communication failure.

Objective, Constraints, and Variables
Over a finite time τ thr , swarm reconfiguration aims to maximize the total coverage area (TCA) ε tot , which is the mission area detected by the UAVs.This can be expressed as follows: where ε tot (τ) is the TCA at the current time τ, ε mn is the detected area of mission area ma mn ; if ma mn is not covered, ε mn = 0.The problem should be solved at the swarm level.Considering the number of remaining UAVs, the number of UAVs to be repositioned should be less than the number of normal working UAVs.Furthermore, the minimum area detected by the UAVs in each mission area must be set.Therefore, the reconfiguration problem can be expressed as follows: where ε m is the coverage area of group G m , ε m min is the specified minimum coverage area for group G m , N m move is the number of UAVs in G m that can be repositioned, N m normal is the number of normal-working UAVs in G m , d is the distance between two normal-working UAVs, and d min is the minimum allowable distance, which is the safety distance between UAVs.This problem considers the UAVs within the communication range.If a UAV exceeds the communication distance, it enters the faulty state due to communication failure.
The initial deployment status depends on whether there is a normal-working UAV at a certain hexagon for each group mission area MA m .Then, the UAV swarm distribution deployment status can be represented by a I × J matrix S. The matrix element s ij = 1 if there is a normal-working UAV U mn in hexagon H ij and s ij = 0 if not.Therefore, the deployment status information of the UAV swarm can be expressed as follows:

MADRL-Based DR Method
An MADRL framework was developed to solve the DR problem described in the previous section, as shown in Figure 4.The framework consists of three parts: a reconfiguration decision-making progress, agent decision-making, and a neural network.The three parts of the framework are described in this section, and the reconfiguration decision-making process is illustrated in Figure 4A.The initial deployment status depends on whether there is a normal-working UAV at a certain hexagon for each group mission area  .Then, the UAV swarm distribution deployment status can be represented by a  ×  matrix S. The matrix element  = 1 if there is a normal-working UAV  in hexagon  and  = 0 if not.Therefore, the deployment status information of the UAV swarm can be expressed as follows:

MADRL-Based DR Method
An MADRL framework was developed to solve the DR problem described in the previous section, as shown in Figure 4.The framework consists of three parts: a reconfiguration decision-making progress, agent decision-making, and a neural network.The three parts of the framework are described in this section, and the reconfiguration decision-making process is illustrated in Figure 4A.

Reconfiguration Decision Process
The group agents choose the DRS for the UAV groups.The UAV swarm's status matrix S t is used by the group agents as the main input.This process can be expressed as follows: where t represents the movement feature selected by the agent m at time step t.A swarm agent uses a QMIX network to combine the outputs of all group agents and choose the most efficient one.This can be expressed as follows: where represents the movement set of all group agents at time step t, M t−1 represents the last swarm movement feature set which consists of this swarm's history of movement features {mov t−1 , mov t−2 , mov t−3 }, and the output mov t |[S t , M t ] represents the final chosen movement feature for the swarm.
The DR process consists of mission and destruction features, DR action generation, and renewal features.These three components are described in the following subsections.

Mission and Destruction Features
The destruction is randomly initialized at time t d , and the status matrix S is then generated.The coverage area at this time is ε To reconfigure the swarm and reach the maximum coverage rate, M agents, representing M different UAV groups, execute a sequence of DR actions.The DR action set is described as follows: where the act t|mn is the DR action of U AV mn at time step t.This DR action is defined as act t|mn = cen H ij , cen H i"j" which means that the U AV mn in hexagon H ij moves to the target hexagon H i"j" .The parameter cen H ij represents the center location of hexagon H ij , and the action act t|mn is generated according to the movement feature mov t .The DR action set of group m can be described as follows: After the DR action has finished, agent m uses a search algorithm to select the next DR action act mn|t for U AV mn in group G m , or chooses to finish the reconfiguration process.This process is repeated in each time step t.The neural network of agent m (see Section 3.2) can be described as follows: where Q m (S t , mov m t ) is the value of the movement feature mov m t at time step t.Each time step corresponds to a realistic period of time, the length of which is proportional to the distance the UAV moves in this time step.

Reconfiguration Action Generation
For the DR action act t|mn , once complete, the moving UAV is considered to perform the detection mission at the new location, then the status matrix S t can be updated.The term ε A (t) can be calculated according to (5) after the movement.The objective of agent m is to achieve the maximum coverage area as efficiently as possible.Thus, the reward should include both the coverage area and reconfiguration time.All agents use the same reward function, and the reward at time step t is defined as follows: where R t is the reward at time step t, τ t+ζ is the reconfiguration time of time step ( t + ζ), τ t+ζ−1 is the reconfiguration time of time step (t + ζ − 1), ε 0 is the initial TCA, δ is the discounted factor, and τ T is the time to finish reconfiguration (TTFR).
For agent m, an optimization algorithm is used to select the best movement feature of UAV group m.For each time step t, the DQN of agent m outputs a movement value quantity Q m (S t , mov m t ), then this agent outputs a movement feature mov m t .A QMIX network is used to select the most effective action from all possible actions.
The mixing network has two parts: a parameter generating network and an inference network.The former receives the global state S t and generates the neuron weights and deviations.The latter receives the control quantity Q m (S t , mov m t ) from each agent and generates the global utility value Q tot with the help of the neuron weights and deviations.
The movement utility value Q tot is used to formulate the final decision for the whole swarm (see Section 3.3), as expressed in (15).

Renewal Features
Once the swarm has finished act t|mn , the state matrix and feature set [S t , A t ] is used as the new input to the algorithm.The algorithm continues to run and outputs new movement actions or takes the decision to end the reconfiguration process.

Deep Q-Learning for Reconfiguration
The agents use the deep Q-learning algorithm to evaluate the movement action, in which the action-value function is represented by a deep neural network parameterized by ϑ.The movement feature mov m t has a movement value function of , where ∑ R t = ∑ ∞ i=0 δ i r t+i is the discounted return and δ is the discount factor.
The transition tuple of each movement action of the group agent m is stored as [S, mov m , R, S ], where S is the state before mov, mov is the selected mobile movement feature, R is the reward for this movement, and S is the state after the movement has finished.ϑ is learned by sampling batches of b transitions and minimizing the squared temporal-difference error: Sensors 2023, 23, 9484 9 of 16 where γ DQN = R + δmax mov Q m S , mov m ; ϑ − , ϑ − represents parameters of the target network that are periodically copied from ϑ and held constant for several iterations, b is the batch size of transitions sampled from the replay buffer, Q m (S, mov m ; ϑ) is the utility value of mov m .

QMIX for Multi-Agent Strategy
The QMIX network is applied to the generated swarm-level DR action.The network represents Q tot as a monotone function for mixing the individual value functions Q(S t , mov t ) of each agent.This can be expressed as follows: where Q m (S, mov m )| m=M m=1 is the movement value set, and the Q tot (S, mov) is a joint move- ment value of the swarm.The monotonicity of ( 15) can be enforced by the partial derivative relation To ensure this relationship, QMIX consists of agent networks, a mixing network, and a set of hypernetworks, as shown in Figure 4C.
For each agent m, there is one agent network representing the individual value function Q m (S, mov m ).The agent networks are represented as deep recurrent Q-networks (DRQNs).At each time step, the DRQNs receive the status S t and last movement mov t as input and output a value function Q m (S, mov m ) to the mixing network.
The mixing network is a feedforward neural network that monotonically mixes all Q m (S, mov m ) with nonnegative weights.The weights of the mixing network are generated by separate hypernetworks, each of which generates the weight of one layer using the status S t .The biases of the mixing network are produced in the same manner but are not necessarily nonnegative.The final bias is produced by a two-layer hypernetwork.
The whole QMIX network is trained end-to-end to minimize the following loss: where γ DQN = R + δmax mov Q tot (S , mov ; ϑ − ), Q tot (S, mov; ϑ) is the globe utility value of mov.

Case Study
A case study of UAV swarm reconfiguration was simulated using Python.The numerical simulation is described from the perspective of optimal UAV swarm reconfiguration.The effectiveness of the proposed DR decision-making method is validated using the reconfiguration results under different scenarios.In this section, a fixed-wing UAV swarm is considered, although the proposed method is also applicable to other types of UAV swarms.

Mission
A detection mission containing seven irregular detection areas is randomly generated, as shown in Figure 5.The yellow areas represent the detection areas, and the map is divided into hexagons.The hexagons of the mission areas (colored green) need to cover the detection areas.A UAV swarm with 7 × 6 UAVs is simulated to execute this detection mission; the initial location information of each UAV is presented in Table 1.
From the UAV swarm deployment in Figure 5, the initial detection mission state is shown in Figure 6a, where each light-gray circle represents the detection area of one UAV.In this case, all UAVs in the swarm are assumed to have the same detection radius of 3 √ 3 km, and the initial TCA is ε tot = 770.59km 2 .Furthermore, the safety distance is assumed to be 0.2 km.The detection radius and safety distance can also be assigned based on the actual regions.

Mission
A detection mission containing seven irregular detection areas is randomly generated, as shown in Figure 5.The yellow areas represent the detection areas, and the map is divided into hexagons.The hexagons of the mission areas (colored green) need to cover the detection areas.A UAV swarm with 7 × 6 UAVs is simulated to execute this detection mission; the initial location information of each UAV is presented in Table 1.From the UAV swarm deployment in Figure 5, the initial detection mission state is shown in Figure 6a, where each light-gray circle represents the detection area of one UAV.In this case, all UAVs in the swarm are assumed to have the same detection radius of 3√3 km, and the initial TCA is  = 770.59km 2 .Furthermore, the safety distance is assumed to be 0.2 km.The detection radius and safety distance can also be assigned based on the actual regions.

Destruction
The destruction states were randomly generated.Two kinds of destruction, namely local and random destruction, were considered simultaneously.For local destruction, the destruction center is a randomly sampled point on the mission area, and the destruction area is a randomly generated irregular polygon.For random destruction, the number of

Destruction
The destruction states were randomly generated.Two kinds of destruction, namely local and random destruction, were considered simultaneously.For local destruction, the destruction center is a randomly sampled point on the mission area, and the destruction area is a randomly generated irregular polygon.For random destruction, the number of destroyed UAVs is assumed to follow the Poisson distribution with λ = 1.
For the mission and swarm deployment case in Figure 6a, the generated destruction states are illustrated in Figure 6b, and consist of two local destruction areas and a random destruction with three UAVs.The destruction centers of these two local destruction areas are (18, 17.32) and (5, 83.13), and the radii are 11 and 4, respectively.The destroyed UAVs in Figure 6b can be described as {U 1,1 , U 1,5 , U 3,2 , U 4,3 , U 5,4 , U 6,1 , U 6,2 , U 6,3 , U 6,4 , U 6,5 , U 6,6 }, including both local destruction and random destruction.After this destruction process, the current total coverage area is ε tot = 615.02km 2 .All of the destruction information is presented in Table 2.For the reconfiguration process, the initial time step and the final time step are shown in Figure 6c,d, respectively.The UAV colored yellow represents the initial location of this reconfiguration process, while the UAV colored blue represents the final location.The red arrow represents the reconfiguration route from the initial location to the final location, which can be generated by the movement feature set M according to (9).Each reconfiguration action is generated by an agent of the proposed multi-agent framework according to (9).For the case in Figure 6c,d, the DR action set Φ is listed in Table 3.After this reconfiguration process, the UAV swarm has finished its redeployment.The current detection state is shown in Figure 6e.All UAVs in the swarm are assumed to have the same speed of 50 km/h.The speed can also be assigned based on the actual regions.The TCA is considered as a metric of UAV swarm performance.During this reconfiguration process, the UAV swarm performance exhibits a fluctuating upward trend, as shown in Figure 6f.The black dashed line in Figure 6f represents the TCA threshold ε thr , which is assumed to be 714 km 2 .This TCA threshold can also be assigned on the basis of actual conditions.After finishing the reconfiguration process, the final TCA is ε tot (τ T ) = 732.31km 2 .

Discussion
In addressing the UAV swarm reconfiguration, the main objective is to generate an optimal feasible strategy.Extended analyses are now presented covering the method performance and the influence of various factors.

Different Algorithms
This section evaluates the performance of the proposed QMIX method, although the DQN method and a cooperative game (CG) method [25] have also been used to generate this UAV swarm DRS.We used a single machine with one Intel i9 7980XE CPU and four RTX2080 TI-A11G GPUs to train the QMIX network and the DQN network.During the training process, each episode generated a DRS for the randomly generated mission and destruction, as described in Section 4.1.This section presents the results of the following assessment process: for each training procedure, the training is paused every 100 episodes and the method runs 10 independent episodes with greedy action selection.Figure 7 plots the mean reward across these 10 runs for each method with independent mission and destruction details.As the 10 independent episodes are fixed, the mean reward of the CG method is a constant value.Thus, the reward curve of CG method is a straight line.The shading around each reward curve represents the standard deviation across the 10 runs.Over the training process, 100,000 episodes were executed for each method.The reward curves of these two methods fluctuate upward.In the first 17,000 episodes, the DQN method exhibits faster growth than the QMIX method.However, QMIX achieves a higher upper bound of the reward curve after 20,000 episodes.QMIX is noticeably stronger in terms of the final DR decision-making performance.The superior representational capacity of QMIX combined with the state information provides a clear benefit over the DQN method.

Different Algorithms
This section evaluates the performance of the proposed QMIX method, although the DQN method and a cooperative game (CG) method [25] have also been used to generate this UAV swarm DRS.We used a single machine with one Intel i9 7980XE CPU and four RTX2080 TI-A11G GPUs to train the QMIX network and the DQN network.During the training process, each episode generated a DRS for the randomly generated mission and destruction, as described in Section 4.1.This section presents the results of the following assessment process: for each training procedure, the training is paused every 100 episodes and the method runs 10 independent episodes with greedy action selection.Figure 7 plots the mean reward across these 10 runs for each method with independent mission and destruction details.As the 10 independent episodes are fixed, the mean reward of the CG method is a constant value.Thus, the reward curve of CG method is a straight line.The shading around each reward curve represents the standard deviation across the 10 runs.Over the training process, 100,000 episodes were executed for each method.The reward curves of these two methods fluctuate upward.In the first 17,000 episodes, the DQN method exhibits faster growth than the QMIX method.However, QMIX achieves a higher upper bound of the reward curve after 20,000 episodes.QMIX is noticeably stronger in terms of the final DR decision-making performance.The superior representational capacity of QMIX combined with the state information provides a clear benefit over the DQN method.

Different Destruction Cases
For a given mission and swarm scale, the destruction process was randomly generated.The redeployment results were obtained by executing the QMIX reconfiguration strategy, as shown in Figure 8.The three subgraphs demonstrate the initial deployment status, the destruction status, and the redeployment results of the proposed QMIX algorithm.The geographical distributions of all mission areas and the swarm with 5 × 6 UAVs are the same in the three subgraphs, while the destruction states are different.After the reconfiguration process, the redeployment results in the three subgraphs demonstrate that the proposed QMIX method exhibits stable performance for this reconfiguration decisionmaking problem with different destruction patterns.This is because, during the training

Different Destruction Cases
For a given mission and swarm scale, the destruction process was randomly generated.The redeployment results were obtained by executing the QMIX reconfiguration strategy, as shown in Figure 8.The three subgraphs demonstrate the initial deployment status, the destruction status, and the redeployment results of the proposed QMIX algorithm.
The geographical distributions of all mission areas and the swarm with 5 × 6 UAVs are the same in the three subgraphs, while the destruction states are different.After the reconfiguration process, the redeployment results in the three subgraphs demonstrate that the proposed QMIX method exhibits stable performance for this reconfiguration decisionmaking problem with different destruction patterns.This is because, during the training process, UAV destruction is generated randomly.In addressing the UAV swarm redeployment, the main objective was to obtain an optimal feasible DRS strategy.Extended analyses of the optimization strategy were conducted to determine the influences of different methods.The QMIX method with high efficiency was proposed for this optimization strategy, while the DQN method and the CG method had also been used to solve the three destruction cases in Figure 8.The QMIX method gives optimal solutions with better TCA  ( ) and less TTFR  than the other methods, as shown in Figure 9.The proposed method achieves the better solution, since these two methods may lead to local optima, such as a situation in which multiple UAVs have to spend more time moving during the reconfiguration process.The efficiencies of the methods are analyzed in Table 4.According to these results, the solution speeds of the QMIX and DQN are close, while the solution speed of CG method is significantly slower than the other two methods.In addressing the UAV swarm redeployment, the main objective was to obtain an optimal feasible DRS strategy.Extended analyses of the optimization strategy were conducted to determine the influences of different methods.The QMIX method with high efficiency was proposed for this optimization strategy, while the DQN method and the CG method had also been used to solve the three destruction cases in Figure 8.The QMIX method gives optimal solutions with better TCA ε tot (τ T ) and less TTFR τ T than the other methods, as shown in Figure 9.The proposed method achieves the better solution, since these two methods may lead to local optima, such as a situation in which multiple UAVs have to spend more time moving during the reconfiguration process.The efficiencies of the methods are analyzed in Table 4.According to these results, the solution speeds of the QMIX and DQN are close, while the solution speed of CG method is significantly slower than the other two methods.

Different Swarm Scales
Under different missions and swarm scales, the redeployment results obtained by executing the QMIX reconfiguration strategy are shown in Figure 10.The three subgraphs demonstrate the different deployment missions, the destruction status, and the redeployment results of the proposed method.The geographical distributions of all mission areas were randomly generated in the three subgraphs, and the initial swarm scales were 5 × 6, 7 × 6, and 9 × 6.Then, the destruction states were randomly generated.After the reconfiguration process, the redeployment results in the three subgraphs demonstrate that the proposed QMIX method exhibits stable performance under the different missions and swarm scales.During the training process, the missions and swarm scales were generated randomly.Thus, the superior representational capacity of QMIX combined with the mission state and swarm state information provides a clear benefit in terms of reconfiguration decision-making performance.
optimal feasible DRS strategy.Extended analyses of the optimization strategy were conducted to determine the influences of different methods.The QMIX method with high efficiency was proposed for this optimization strategy, while the DQN method and the CG method had also been used to solve the three destruction cases in Figure 8.The QMIX method gives optimal solutions with better TCA  ( ) and less TTFR  than the other methods, as shown in Figure 9.The proposed method achieves the better solution, since these two methods may lead to local optima, such as a situation in which multiple UAVs have to spend more time moving during the reconfiguration process.The efficiencies of the methods are analyzed in Table 4.According to these results, the solution speeds of the QMIX and DQN are close, while the solution speed of CG method is significantly slower than the other two methods.Under different missions and swarm scales, the redeployment results obtained by executing the QMIX reconfiguration strategy are shown in Figure 10.The three subgraphs demonstrate the different deployment missions, the destruction status, and the redeployment results of the proposed method.The geographical distributions of all mission areas were randomly generated in the three subgraphs, and the initial swarm scales were 5 × 6, 7 × 6, and 9 × 6.Then, the destruction states were randomly generated.After the reconfiguration process, the redeployment results in the three subgraphs demonstrate that the proposed QMIX method exhibits stable performance under the different missions and swarm scales.During the training process, the missions and swarm scales were generated randomly.Thus, the superior representational capacity of QMIX combined with the mission state and swarm state information provides a clear benefit in terms of reconfiguration decision-making performance.Again, keeping the same cases as in Figure 10 and using the QMIX method, the DQN method, and the CG method, we also analyzed the differences in algorithm performance.Under different missions and swarm scales, the QMIX method also gives optimal solutions with better TCA  ( ) and less TTFR  than the other methods, as shown in Figure 11.The efficiencies of the methods are analyzed in Table 5.According to these results, the solution speeds of the QMIX and DQN are close for each case, while the solution Again, keeping the same cases as in Figure 10 and using the QMIX method, the DQN method, and the CG method, we also analyzed the differences in algorithm performance.Under different missions and swarm scales, the QMIX method also gives optimal solutions with better TCA ε tot (τ T ) and less TTFR τ T than the other methods, as shown in Figure 11.The efficiencies of the methods are analyzed in Table 5.According to these results, the solution speeds of the QMIX and DQN are close for each case, while the solution speed of CG method is significantly slower than the other two methods.Furthermore, the solution speeds of the QMIX and DQN are stable, and they do not exponentially decrease as the swarm scales increase.However, the solution speed of the CG method clearly decreases as the swarm scale increases.Thus, these results show that the proposed QMIX method exhibits stable DR decision-making performance for swarms with different scales.speed of CG method is significantly slower than the other two methods.Furthermore, the solution speeds of the QMIX and DQN are stable, and they do not exponentially decrease as the swarm scales increase.However, the solution speed of the CG method clearly decreases as the swarm scale increases.Thus, these results show that the proposed QMIX method exhibits stable DR decision-making performance for swarms with different scales.

Conclusions
Distributed AI is gradually being applied to multi-UAVs.This paper has focused on DR decision-making for UAV swarm deployment optimization using a proposed MADRL framework.A two-layered decision-making framework based on MADRL enables UAV swarm redeployment, which maximizes swarm performance.Simulations using Python have demonstrated that the proposed QMIX method can generate a globally optimal DRS for UAV swarm redeployment.Furthermore, the results of the case study show that the QMIX method achieves a better swarm performance with less reconfiguration time than the other methods and exhibits stable and efficient solution speed.The DR decision-making problem considered in this paper is one of redeployment decision-making; the initial deployment planning was not addressed.Future research should emphasize the integration of UAV swarm initial deployments into decision-making frameworks.

Figure 11 .Table 5 .
Figure 11.Reconfiguration under different swarm sizes.Table 5.Running time (in seconds) of different methods under different swarm sizes.Different Swarm Cases QMIX DQN CG Case (a) in Figure 10 20.542 19.942 44.845 Case (b) in Figure 10 27.031 26.531 70.275 Case (c) in Figure 10 32.816 32.116 120.389 t represents the current state matrix of time step t, and M m t−1 represents the movement feature set of agent m which consists of this agent's history of movement features mov m t−1 , mov m t−2 , mov m t−3 .History of movement features are necessary, because the agents are not fully observable solely from the current state, since the DR decisionmaking is a sequence decision process.The movement feature of agent m can be described are the location matrices to describe the hexagons in the figures of Section 2. Each element in the location matric relates to a hexagon, and if the element is 1, the related hexagon is the chosen location.Both loc init

Table 1 .
Location of each UAV.
Figure 5. Mission and UAV swarm deployment.

Table 1 .
Location of each UAV.

Table 4 .
Running time (in seconds) of different methods under different destruction cases.

Table 4 .
Running time (in seconds) of different methods under different destruction cases.