Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration

Wu, Qilong; Geng, Zitao; Ren, Yi; Feng, Qiang; Zhong, Jilong

doi:10.3390/s23239484

Open AccessArticle

Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration

¹

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China

²

Defense Innovation Institute, Academy of Military Science, Beijing 100071, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2023, 23(23), 9484; https://doi.org/10.3390/s23239484

Submission received: 24 October 2023 / Revised: 18 November 2023 / Accepted: 27 November 2023 / Published: 28 November 2023

(This article belongs to the Special Issue Artificial-Intelligence-Enhanced Fault Diagnosis and PHM)

Download

Browse Figures

Versions Notes

Abstract

:

Distributed artificial intelligence is increasingly being applied to multiple unmanned aerial vehicles (multi-UAVs). This poses challenges to the distributed reconfiguration (DR) required for the optimal redeployment of multi-UAVs in the event of vehicle destruction. This paper presents a multi-agent deep reinforcement learning-based DR strategy (DRS) that optimizes the multi-UAV group redeployment in terms of swarm performance. To generate a two-layer DRS between multiple groups and a single group, a multi-agent deep reinforcement learning framework is developed in which a QMIX network determines the swarm redeployment, and each deep Q-network determines the single-group redeployment. The proposed method is simulated using Python and a case study demonstrates its effectiveness as a high-quality DRS for large-scale scenarios.

Keywords:

distributed reconfiguration strategy; multi-agent deep reinforcement learning; unmanned aerial vehicle (UAV); UAV swarm redeployment

1. Introduction

Recently, mission planning associated with unmanned aerial vehicles (UAVs) has received considerable attention [1,2], and distributed artificial intelligence (AI) technologies have been extensively applied in multiple-UAV (multi-UAV) mission planning, enabling efficient decision-making and yielding high-quality solutions [3,4]. For missions in geographically decentralized environments, the focus is on deploying UAVs to their destinations and repositioning them to adapt to changing circumstances [5]. To minimize the costs of positioning UAVs, Masroor et al. [6] proposed a branch-and-bound algorithm that determines the optimal UAV deployment solution in emergency situations. Savkin et al. [7] employed a range-based reactive algorithm for autonomous UAV deployment. Nevertheless, many existing distributed algorithms lack the security necessary to achieve the global objective.

For the UAVs in a swarm, the placement of individual is important, but the completion of the swarm mission is the ultimate goal. Wang et al. [8] proposed a K-means clustering-based UAV deployment scheme that significantly improves the spectrum efficiency and energy efficiency of cellular uplinks at limited cost, while Yu et al. [9] introduced an evolutionary game-based adaptive dynamic reconfiguration mechanism that provides decision support for the cooperative mode design of unmanned swarm operations. These algorithms take static multi-swarm problems into account. However, some of the UAVs may suffer destruction or break down during a mission [10]. To deal with situations in which the swarm suffers unexpected destruction, adaptive swarm reconfiguration strategies are required [11].

Learning-based methods are gaining increasing attention for their flexibility and efficiency [12,13]. Deep reinforcement learning (DRL) has shown promising results in resolving the task assignment problems associated with multi-UAV swarms [14]. Samir et al. [15] combined DRL with joint optimization to achieve improved learning efficiency, although changes to the dynamic environment can hinder the implementation of this strategy. Zhang et al. [16] investigated a double deep Q-network (DQN) framework for long period UAV swarm collaborative tasks and designed a guided reward function to solve the convergence problem caused by the sparse returns of long period tasks. Huda et al. [17] investigated a surveillance application scenario using a hierarchical UAV swarm. In this case, they used a DQN to minimize the weighted sum cost. As a result, their DRL method exhibited better convergence and effectiveness than traditional methods. Zhang et al. [18] designed a DRL-based algorithm to find the optimal attack sequence for large-scale UAV swarm so that the purpose of destroying the target communication system can be achieved. Mou et al. [19] built a geometric way to project the 3D terrain surface into many weighted 2D patches and proposed a swarm DQN reinforcement learning algorithm to select patches for leader UAVs, which could cover the object area with little redundancies. Liu et al. [20] focused on a latency minimization problem for both communication and computation in a maritime UAV swarm mobile edge computing network, then they proposed a DQN and a deep deterministic policy gradient algorithm to optimize the trajectory of multi-UAVs and configuration of virtual machines. However, multi-agent DRL (MADRL) captures real-world situations more easily than DRL [21,22]. Hence, MADRL is considered an important topic of research. Xia et al. [22] proposed an end-to-end cooperative multi-agent reinforcement learning scheme that enables the UAV swarm to make decisions on the basis of the past and current states of the target. Lv et al. [23] proposed a MADRL-based UAV swarm communication scheme to optimize the relay selection and power allocation, then they designed a DRL-based scheme to improve the anti-jamming performance. Xiang et al. [24] established an intelligent UAV swarm model based on a multi-agent deep deterministic policy gradient algorithm, significantly improving the success rate of the UAV swarm in confrontations.

In summary, developments in distributed AI mean that swarm intelligence is now of vital strategic importance. Under this background, it is vital to develop multi-agent algorithms. However, few reconfiguration studies have investigated this distributed multi-agent scenario. Therefore, this paper proposes a MADRL-based distributed reconfiguration strategy (DRS) for the problem of UAV swarm reconfiguration after large-scale destruction. The main contributions of this paper are as follows:

(1) UAV swarm reconfiguration is formulated so as to generate a swarm DRS considering detection missions and destruction. The finite number of UAVs is the constraint, and the coverage area forms the objective.

(2) MADRL-based swarm reconfiguration employs multi-agent deep learning and the Q-MIX network. Each agent, representing a group, uses reinforcement learning to select the optimal distributed reconfiguration (DR) actions. The Q-MIX network is used to synthesize the actions of each agent and output the final strategy.

(3) When the network has been trained well, the algorithm can effectively utilize various UAV swarm information to support DR decision-making. This enables efficient and steady multi-group swarm DR to achieve the mission objective.

The remainder of this paper is organized as follows. Section 2 presents the swarm mission framework. Section 3 elucidates the DRS, before Section 4 introduces a UAV swarm reconfiguration case study of detection missions. Finally, Section 5 presents the concluding remarks.

2. Problem Formulation

2.1. Mission, Destruction, and Reconfiguration

2.1.1. Mission

A detection mission containing M irregular detection areas is considered. As shown in Figure 1a, the detection areas (colored yellow) are divided into hexagons, which are inscribed hexagons of the mission areas (colored green).

The swarm detection mission area set can then be expressed as follows:

D = {{M A}_{1}, {M A}_{2}, \dots, {M A}_{m}, \dots, {M A}_{M}}

(1)

where each group mission area

{M A}_{m}

,

m \in {1, 2, \dots, M}

, is covered by a certain number of hexagons, as follows:

{M A}_{m} = {{m a}_{m 1}, {m a}_{m 2}, \dots, {m a}_{m n}, \dots, {m a}_{{m N}_{m}}}

(2)

where

N_{m}

is the total number of hexagons in group mission area

{MA}_{m}

, and each hexagon represents a single UAV mission area

{m a}_{m n}

,

m \in {1, 2, \dots, M}

,

n \in {1, 2, \dots, N_{m}}

.

A UAV swarm, the size of which is determined by the detection area, is dispatched for a detection mission. Each area requires a group to execute the detection mission, and the number of UAVs in the group depends on the number of hexagons in the mission area. Furthermore, each group is formed of one leader UAV and several follower UAVs. To execute a detection mission, as shown in Figure 1b, the radius R of the UAV detection area is determined by the detection equipment installed on the UAVs.

The UAV swarm can then be expressed as follows:

S w a r m = {G_{1}, G_{2}, \dots, G_{m}, \dots, G_{M}}

(3)

where each group

G_{m}

,

m \in {1, 2, \dots, M}

, performs detection in the group mission area

{MA}_{m}

, as follows:

G_{m} = {U_{m 1}, U_{m 2}, \dots, U_{m n}, \dots, U_{{m N}_{m}}}

(4)

where

U_{m n}

is the m-th UAV in group n and performs detection in UAV mission area

{m a}_{m n}

,

m \in {1, 2, \dots, M}

,

n \in {1, 2, \dots, N_{m}}

. The first UAV in each group is the leader of that group.

2.1.2. Destruction

The UAV swarm may be subject to local and random destruction, and some UAVs may be destroyed. The effects of this destruction are used as inputs. Each UAV has two states: normal working and complete failure. When a UAV suffers destruction, it enters the failure state. When a leader UAV is destroyed, a follower UAV in the same group assumes the role of the leader of that particular group.

The scope of local destruction is represented by a circle with center coordinates

(i_{d}, j_{d})

and radius

r_{d}

, as illustrated in Figure 2a. The values of

(i_{d}, j_{d})

and

r_{d}

are randomly generated.

Random destruction is characterized by a destruction scale, denoted as

S_{r a n d}

, which is also generated randomly. When random destruction occurs,

S_{r a n d}

random UAVs transition from the normal state to the faulty state, as depicted in Figure 2b.

2.1.3. Reconfiguration

UAV swarm reconfiguration is an autonomous behavior that adapts to changes in the environment to enable the execution of the task. When the swarm is affected by dynamic changes during task execution, the system can use a DRS to achieve global mission performance recovery and reconfiguration, thus ensuring mission continuity.

When destruction occurs, the state of the UAV swarm is input into the reconstruction algorithm. The resulting strategy is communicated back to each UAV group. In-group reconstruction and inter-group reconstruction are applied to certain UAVs, as shown in Figure 3a. After the reconstruction is completed, all mission areas should be covered by the detection range of the UAVs, as shown in Figure 3b.

2.2. Objective, Constraints, and Variables

Over a finite time

τ_{t h r}

, swarm reconfiguration aims to maximize the total coverage area (TCA)

ε_{t o t}

, which is the mission area detected by the UAVs. This can be expressed as follows:

ε_{t o t} (τ) = \sum_{m = 1}^{M} \sum_{n = 1}^{N_{m}} ε_{m n}

(5)

where

ε_{t o t} (τ)

is the TCA at the current time

τ

,

ε_{m n}

is the detected area of mission area

{m a}_{m n}

; if

{m a}_{m n}

is not covered,

ε_{m n} = 0

.

The problem should be solved at the swarm level. Considering the number of remaining UAVs, the number of UAVs to be repositioned should be less than the number of normal working UAVs. Furthermore, the minimum area detected by the UAVs in each mission area must be set. Therefore, the reconfiguration problem can be expressed as follows:

M a x ε_{t o t} s . t . ε_{m} \geq ε_{m i n}^{m} N_{m o v e}^{m} \leq N_{n o r m a l}^{m} d \geq d_{m i n} m \in {1, 2, \dots, M}, n \in {1, 2, \dots, N_{m}}

(6)

where

ε_{m}

is the coverage area of group

G_{m}

,

ε_{m i n}^{m}

is the specified minimum coverage area for group

G_{m}

,

N_{m o v e}^{m}

is the number of UAVs in

G_{m}

that can be repositioned,

N_{n o r m a l}^{m}

is the number of normal-working UAVs in

G_{m}

, d is the distance between two normal-working UAVs, and

d_{m i n}

is the minimum allowable distance, which is the safety distance between UAVs. This problem considers the UAVs within the communication range. If a UAV exceeds the communication distance, it enters the faulty state due to communication failure.

The initial deployment status depends on whether there is a normal-working UAV at a certain hexagon for each group mission area

{M A}_{m}

. Then, the UAV swarm distribution deployment status can be represented by a

I \times J

matrix S. The matrix element

s_{i j}

= 1 if there is a normal-working UAV

U_{m n}

in hexagon

H_{i j}

and

s_{i j}

= 0 if not. Therefore, the deployment status information of the UAV swarm can be expressed as follows:

S = [\begin{matrix} s_{11} & s_{12} & \dots & s_{1 J} \\ s_{21} & s_{22} & \dots & s_{2 J} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ s_{(I - 1) 1} & s_{(I - 1) 2} & \dots & s_{(I - 1) J} \\ s_{I 1} & s_{I 2} & \dots & s_{I J} \end{matrix}]

(7)

3. MADRL-Based DR Method

An MADRL framework was developed to solve the DR problem described in the previous section, as shown in Figure 4. The framework consists of three parts: a reconfiguration decision-making progress, agent decision-making, and a neural network. The three parts of the framework are described in this section, and the reconfiguration decision-making process is illustrated in Figure 4A.

3.1. Reconfiguration Decision Process

The group agents choose the DRS for the UAV groups. The UAV swarm’s status matrix

S_{t}

is used by the group agents as the main input. This process can be expressed as follows:

f_{a g e n t - m} (S_{t}, M_{t - 1}^{m}) = {m o v}_{t}^{m} | [S_{t}, M_{t - 1}^{m}]

(8)

where

S_{t}

represents the current state matrix of time step t, and

M_{t - 1}^{m}

represents the movement feature set of agent m which consists of this agent’s history of movement features

{{m o v}_{t - 1}^{m}, {m o v}_{t - 2}^{m}, {m o v}_{t - 3}^{m}}

. History of movement features are necessary, because the agents are not fully observable solely from the current state, since the DR decision-making is a sequence decision process. The movement feature of agent m can be described as

{m o v}_{t}^{m} = [{l o c}_{t}^{i n i t}, {l o c}_{t}^{f i n a l}]

, where both

{l o c}_{t}^{i n i t}

and

{l o c}_{t}^{f i n a l}

are the location matrices to describe the hexagons in the figures of Section 2. Each element in the location matric relates to a hexagon, and if the element is 1, the related hexagon is the chosen location. Both

{l o c}_{t}^{i n i t}

and

{l o c}_{t}^{f i n a l}

have one element which is equal to 1, and the other elements are equal to 0. The location matrix

{l o c}_{t}^{i n i t}

represents the initial location of the movement feature at time step t, while the location matrix

{l o c}_{t}^{f i n a l}

represents the final location of the same movement feature. Furthermore, if

t < 1

, then both

{l o c}_{t}^{i n i t}

and

{l o c}_{t}^{f i n a l}

are zero matrices. The output

{m o v}_{t}^{m}

represents the movement feature selected by the agent m at time step

t

. A swarm agent uses a QMIX network to combine the outputs of all group agents and choose the most efficient one. This can be expressed as follows:

f_{a g e n t - q m i x} ({m o v}_{t}^{m} |_{m = 1}^{m = M}; S_{t}, M_{t - 1}) = {m o v}_{t} | [S_{t}, M_{t - 1}]

(9)

where

{m o v}_{t}^{m} |_{m = 1}^{m = M} = {{m o v}_{t}^{1}, {m o v}_{t}^{2}, \dots, {m o v}_{t}^{M}}

represents the movement set of all group agents at time step t,

M_{t - 1}

represents the last swarm movement feature set which consists of this swarm’s history of movement features

{{m o v}_{t - 1}, {m o v}_{t - 2}, {m o v}_{t - 3}}

, and the output

{m o v}_{t} | [S_{t}, M_{t}]

represents the final chosen movement feature for the swarm.

The DR process consists of mission and destruction features, DR action generation, and renewal features. These three components are described in the following subsections.

3.1.1. Mission and Destruction Features

The destruction is randomly initialized at time

t_{d}

, and the status matrix S is then generated. The coverage area at this time is

ε_{A} (t_{d}) = ε_{d}

.

To reconfigure the swarm and reach the maximum coverage rate, M agents, representing M different UAV groups, execute a sequence of DR actions. The DR action set is described as follows:

Φ = {{a c t}_{t | m n}} | t = 1 t o T, m \in \{1, 2, \dots, M\}, n \in {1, 2, \dots, N_{m}}

(10)

where the

{a c t}_{t | m n}

is the DR action of

{U A V}_{m n}

at time step t. This DR action is defined as

{a c t}_{t | m n} = [c e n (H_{i j}), c e n (H_{i ” j ”})]

which means that the

{U A V}_{m n}

in hexagon

H_{i j}

moves to the target hexagon

H_{i ” j ”}

. The parameter

c e n (H_{i j})

represents the center location of hexagon

H_{i j}

, and the action

{a c t}_{t | m n}

is generated according to the movement feature

{m o v}_{t}

. The DR action set of group m can be described as follows:

Φ_{m} = {{a c t}_{t | m n}} | n = 1 t o N_{m}, m \in \{1, 2, \dots, M\}, t \in {1, 2, \dots, T}

(11)

After the DR action has finished, agent m uses a search algorithm to select the next DR action

{a c t}_{m n | t}

for

{U A V}_{m n}

in group

G_{m}

, or chooses to finish the reconfiguration process. This process is repeated in each time step t. The neural network of agent m (see Section 3.2) can be described as follows:

Q_{m} (S_{t}, {m o v}_{t}^{m}) = f_{D Q N - m} (S_{t}, M_{t}^{m})

(12)

where

Q_{m} (S_{t}, {m o v}_{t}^{m})

is the value of the movement feature

{m o v}_{t}^{m}

at time step t.

Each time step corresponds to a realistic period of time, the length of which is proportional to the distance the UAV moves in this time step.

3.1.2. Reconfiguration Action Generation

For the DR action

{a c t}_{t | m n}

, once complete, the moving UAV is considered to perform the detection mission at the new location, then the status matrix

S_{t}

can be updated. The term

ε_{A} (t)

can be calculated according to (5) after the movement. The objective of agent m is to achieve the maximum coverage area as efficiently as possible. Thus, the reward should include both the coverage area and reconfiguration time. All agents use the same reward function, and the reward at time step

t

is defined as follows:

R_{t} = \sum_{ζ = 0}^{T - t} δ^{ζ} \int_{τ_{t + ζ - 1}}^{τ_{t + ζ}} (1 - \frac{ε_{t o t} (τ)}{ε_{0}}) d τ

(13)

where

R_{t}

is the reward at time step

t

,

τ_{t + ζ}

is the reconfiguration time of time step (

t + ζ)

,

τ_{t + ζ - 1}

is the reconfiguration time of time step

(t + ζ - 1)

,

ε_{0}

is the initial TCA,

δ

is the discounted factor, and

τ_{T}

is the time to finish reconfiguration (TTFR).

For agent m, an optimization algorithm is used to select the best movement feature of UAV group m. For each time step t, the DQN of agent m outputs a movement value quantity

Q_{m} (S_{t}, {m o v}_{t}^{m})

, then this agent outputs a movement feature

{m o v}_{t}^{m}

. A QMIX network is used to select the most effective action from all possible actions.

The mixing network has two parts: a parameter generating network and an inference network. The former receives the global state

S_{t}

and generates the neuron weights and deviations. The latter receives the control quantity

Q_{m} (S_{t}, {m o v}_{t}^{m})

from each agent and generates the global utility value

Q_{t o t}

with the help of the neuron weights and deviations.

The movement utility value

Q_{t o t}

is used to formulate the final decision for the whole swarm (see Section 3.3), as expressed in (15).

3.1.3. Renewal Features

Once the swarm has finished

{a c t}_{t | m n}

, the state matrix and feature set

[S_{t}, A_{t}]

is used as the new input to the algorithm. The algorithm continues to run and outputs new movement actions or takes the decision to end the reconfiguration process.

3.2. Deep Q-Learning for Reconfiguration

The agents use the deep Q-learning algorithm to evaluate the movement action, in which the action-value function is represented by a deep neural network parameterized by

ϑ

. The movement feature

{m o v}_{t}^{m}

has a movement value function of

Q_{m} (S_{t}, {m o v}_{t}^{m}) = E_{S_{t + 1; \infty}, {m o v}_{t + 1; \infty}^{m}} [\sum R_{t} | S_{t}, {m o v}_{t}^{m}]

, where

\sum R_{t} = \sum_{i = 0}^{\infty} δ^{i} r_{t + i}

is the discounted return and

δ

is the discount factor.

The transition tuple of each movement action of the group agent m is stored as

[S, {m o v}^{m}, R, S^{'}]

, where

S

is the state before

m o v

,

m o v

is the selected mobile movement feature,

R

is the reward for this movement, and

S^{'}

is the state after the movement has finished.

ϑ

is learned by sampling batches of b transitions and minimizing the squared temporal-difference error:

L (ϑ) = \sum_{i = 1}^{b} [{(γ_{i}^{D Q N} - Q_{m} (S, {m o v}^{m}; ϑ))}^{2}]

(14)

where

γ^{D Q N} = R + δ {m a x}_{{m o v}^{'}} Q_{m} (S^{'}, {m o v}^{m^{'}}; ϑ^{-})

,

ϑ^{-}

represents parameters of the target network that are periodically copied from

ϑ

and held constant for several iterations, b is the batch size of transitions sampled from the replay buffer,

Q_{m} (S, {m o v}^{m}; ϑ)

is the utility value of

{m o v}^{m}

.

3.3. QMIX for Multi-Agent Strategy

The QMIX network is applied to the generated swarm-level DR action. The network represents

Q_{t o t}

as a monotone function for mixing the individual value functions

Q (S_{t}, {m o v}_{t})

of each agent. This can be expressed as follows:

Q_{t o t} (S, m o v) = f_{q m i x} (Q_{m} (S, {m o v}^{m}) |_{m = 1}^{m = M})

(15)

where

Q_{m} (S, {m o v}^{m}) |_{m = 1}^{m = M}

is the movement value set, and the

Q_{t o t} (S, m o v)

is a joint movement value of the swarm. The monotonicity of (15) can be enforced by the partial derivative relation

\frac{\partial Q_{t o t}}{\partial Q_{m}} \geq 0, m \in [1, M]

. To ensure this relationship, QMIX consists of agent networks, a mixing network, and a set of hypernetworks, as shown in Figure 4C.

For each agent m, there is one agent network representing the individual value function

Q_{m} (S, {m o v}^{m})

. The agent networks are represented as deep recurrent Q-networks (DRQNs). At each time step, the DRQNs receive the status

S_{t}

and last movement

{m o v}_{t}

as input and output a value function

Q_{m} (S, {m o v}^{m})

to the mixing network.

The mixing network is a feedforward neural network that monotonically mixes all

Q_{m} (S, {m o v}^{m})

with nonnegative weights. The weights of the mixing network are generated by separate hypernetworks, each of which generates the weight of one layer using the status

S_{t}

. The biases of the mixing network are produced in the same manner but are not necessarily nonnegative. The final bias is produced by a two-layer hypernetwork.

The whole QMIX network is trained end-to-end to minimize the following loss:

L (ϑ) = \sum_{i = 1}^{b} [{(γ_{i}^{D Q N} - Q_{t o t} (S, m o v; ϑ))}^{2}]

(16)

where

γ^{D Q N} = R + δ {m a x}_{{m o v}^{'}} Q_{t o t} (S^{'}, {m o v}^{'}; ϑ^{-})

,

Q_{t o t} (S, m o v; ϑ)

is the globe utility value of

m o v

.

4. Case Study

A case study of UAV swarm reconfiguration was simulated using Python. The numerical simulation is described from the perspective of optimal UAV swarm reconfiguration. The effectiveness of the proposed DR decision-making method is validated using the reconfiguration results under different scenarios. In this section, a fixed-wing UAV swarm is considered, although the proposed method is also applicable to other types of UAV swarms.

4.1. UAV Swarm Reconfiguration

4.1.1. Mission

A detection mission containing seven irregular detection areas is randomly generated, as shown in Figure 5. The yellow areas represent the detection areas, and the map is divided into hexagons. The hexagons of the mission areas (colored green) need to cover the detection areas. A UAV swarm with 7 × 6 UAVs is simulated to execute this detection mission; the initial location information of each UAV is presented in Table 1.

From the UAV swarm deployment in Figure 5, the initial detection mission state is shown in Figure 6a, where each light-gray circle represents the detection area of one UAV. In this case, all UAVs in the swarm are assumed to have the same detection radius of

3 \sqrt{3}

km, and the initial TCA is

ε_{t o t}

= 770.59 km². Furthermore, the safety distance is assumed to be 0.2 km. The detection radius and safety distance can also be assigned based on the actual regions.

4.1.2. Destruction

The destruction states were randomly generated. Two kinds of destruction, namely local and random destruction, were considered simultaneously. For local destruction, the destruction center is a randomly sampled point on the mission area, and the destruction area is a randomly generated irregular polygon. For random destruction, the number of destroyed UAVs is assumed to follow the Poisson distribution with

λ = 1

.

For the mission and swarm deployment case in Figure 6a, the generated destruction states are illustrated in Figure 6b, and consist of two local destruction areas and a random destruction with three UAVs. The destruction centers of these two local destruction areas are (18, 17.32) and (5, 83.13), and the radii are 11 and 4, respectively. The destroyed UAVs in Figure 6b can be described as {U_1,1, U_1,5, U_3,2, U_4,3, U_5,4, U_6,1, U_6,2, U_6,3, U_6,4, U_6,5, U_6,6}, including both local destruction and random destruction. After this destruction process, the current total coverage area is

ε_{t o t}

= 615.02 km². All of the destruction information is presented in Table 2.

4.1.3. Reconfiguration

For the reconfiguration process, the initial time step and the final time step are shown in Figure 6c,d, respectively. The UAV colored yellow represents the initial location of this reconfiguration process, while the UAV colored blue represents the final location. The red arrow represents the reconfiguration route from the initial location to the final location, which can be generated by the movement feature set

M

according to (9). Each reconfiguration action is generated by an agent of the proposed multi-agent framework according to (9). For the case in Figure 6c,d, the DR action set

Φ

is listed in Table 3.

After this reconfiguration process, the UAV swarm has finished its redeployment. The current detection state is shown in Figure 6e. All UAVs in the swarm are assumed to have the same speed of 50 km/h. The speed can also be assigned based on the actual regions. The TCA is considered as a metric of UAV swarm performance. During this reconfiguration process, the UAV swarm performance exhibits a fluctuating upward trend, as shown in Figure 6f. The black dashed line in Figure 6f represents the TCA threshold

ε_{t h r}

, which is assumed to be 714 km². This TCA threshold can also be assigned on the basis of actual conditions. After finishing the reconfiguration process, the final TCA is

ε_{t o t} (τ_{T})

= 732.31 km².

4.2. Discussion

In addressing the UAV swarm reconfiguration, the main objective is to generate an optimal feasible strategy. Extended analyses are now presented covering the method performance and the influence of various factors.

4.2.1. Different Algorithms

This section evaluates the performance of the proposed QMIX method, although the DQN method and a cooperative game (CG) method [25] have also been used to generate this UAV swarm DRS. We used a single machine with one Intel i9 7980XE CPU and four RTX2080 TI-A11G GPUs to train the QMIX network and the DQN network. During the training process, each episode generated a DRS for the randomly generated mission and destruction, as described in Section 4.1. This section presents the results of the following assessment process: for each training procedure, the training is paused every 100 episodes and the method runs 10 independent episodes with greedy action selection. Figure 7 plots the mean reward across these 10 runs for each method with independent mission and destruction details. As the 10 independent episodes are fixed, the mean reward of the CG method is a constant value. Thus, the reward curve of CG method is a straight line. The shading around each reward curve represents the standard deviation across the 10 runs. Over the training process, 100,000 episodes were executed for each method. The reward curves of these two methods fluctuate upward. In the first 17,000 episodes, the DQN method exhibits faster growth than the QMIX method. However, QMIX achieves a higher upper bound of the reward curve after 20,000 episodes. QMIX is noticeably stronger in terms of the final DR decision-making performance. The superior representational capacity of QMIX combined with the state information provides a clear benefit over the DQN method.

4.2.2. Different Destruction Cases

For a given mission and swarm scale, the destruction process was randomly generated. The redeployment results were obtained by executing the QMIX reconfiguration strategy, as shown in Figure 8. The three subgraphs demonstrate the initial deployment status, the destruction status, and the redeployment results of the proposed QMIX algorithm. The geographical distributions of all mission areas and the swarm with 5 × 6 UAVs are the same in the three subgraphs, while the destruction states are different. After the reconfiguration process, the redeployment results in the three subgraphs demonstrate that the proposed QMIX method exhibits stable performance for this reconfiguration decision-making problem with different destruction patterns. This is because, during the training process, UAV destruction is generated randomly.

In addressing the UAV swarm redeployment, the main objective was to obtain an optimal feasible DRS strategy. Extended analyses of the optimization strategy were conducted to determine the influences of different methods. The QMIX method with high efficiency was proposed for this optimization strategy, while the DQN method and the CG method had also been used to solve the three destruction cases in Figure 8. The QMIX method gives optimal solutions with better TCA

ε_{t o t} (τ_{T})

and less TTFR

τ_{T}

than the other methods, as shown in Figure 9. The proposed method achieves the better solution, since these two methods may lead to local optima, such as a situation in which multiple UAVs have to spend more time moving during the reconfiguration process. The efficiencies of the methods are analyzed in Table 4. According to these results, the solution speeds of the QMIX and DQN are close, while the solution speed of CG method is significantly slower than the other two methods.

4.2.3. Different Swarm Scales

Under different missions and swarm scales, the redeployment results obtained by executing the QMIX reconfiguration strategy are shown in Figure 10. The three subgraphs demonstrate the different deployment missions, the destruction status, and the redeployment results of the proposed method. The geographical distributions of all mission areas were randomly generated in the three subgraphs, and the initial swarm scales were 5 × 6, 7 × 6, and 9 × 6. Then, the destruction states were randomly generated. After the reconfiguration process, the redeployment results in the three subgraphs demonstrate that the proposed QMIX method exhibits stable performance under the different missions and swarm scales. During the training process, the missions and swarm scales were generated randomly. Thus, the superior representational capacity of QMIX combined with the mission state and swarm state information provides a clear benefit in terms of reconfiguration decision-making performance.

Again, keeping the same cases as in Figure 10 and using the QMIX method, the DQN method, and the CG method, we also analyzed the differences in algorithm performance. Under different missions and swarm scales, the QMIX method also gives optimal solutions with better TCA

ε_{t o t} (τ_{T})

and less TTFR

τ_{T}

than the other methods, as shown in Figure 11. The efficiencies of the methods are analyzed in Table 5. According to these results, the solution speeds of the QMIX and DQN are close for each case, while the solution speed of CG method is significantly slower than the other two methods. Furthermore, the solution speeds of the QMIX and DQN are stable, and they do not exponentially decrease as the swarm scales increase. However, the solution speed of the CG method clearly decreases as the swarm scale increases. Thus, these results show that the proposed QMIX method exhibits stable DR decision-making performance for swarms with different scales.

5. Conclusions

Distributed AI is gradually being applied to multi-UAVs. This paper has focused on DR decision-making for UAV swarm deployment optimization using a proposed MADRL framework. A two-layered decision-making framework based on MADRL enables UAV swarm redeployment, which maximizes swarm performance. Simulations using Python have demonstrated that the proposed QMIX method can generate a globally optimal DRS for UAV swarm redeployment. Furthermore, the results of the case study show that the QMIX method achieves a better swarm performance with less reconfiguration time than the other methods and exhibits stable and efficient solution speed. The DR decision-making problem considered in this paper is one of redeployment decision-making; the initial deployment planning was not addressed. Future research should emphasize the integration of UAV swarm initial deployments into decision-making frameworks.

Author Contributions

Conceptualization, Y.R., Q.F. and Q.W.; methodology, and Z.G. and Q.W.; software, Q.W.; validation, J.Z., Z.G. and Q.W.; formal analysis, Q.F. and Q.W.; investigation, Q.F. and Q.W.; resources, J.Z. and Q.W.; data curation, J.Z. and Q.W.; writing—original draft preparation, Z.G. and Q.W.; writing—review and editing, Y.R., Q.F. and Q.W.; supervision, Q.F.; project administration, Q.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China: 72001213.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, Z.C.; Yen, G.G.; Wu, J.; Ren, H.; An, H.; Yang, J. Mission planning for energy-efficient passive UAV radar imaging system based on substage division collaborative search. IEEE Trans. Cybern. 2023, 53, 275–288. [Google Scholar] [CrossRef] [PubMed]
Jinqiang, H.; Husheng, W.; Renjun, Z.; Rafik, M.; Xuanwu, Z. Self-organized search-attack mission planning for UAV swarm based on wolf pack hunting behavior. J. Syst. Eng. Electron. 2021, 32, 1463–1476. [Google Scholar] [CrossRef]
Cheng, N.; Wu, S.; Wang, X.; Yin, Z.; Li, C.; Chen, W.; Chen, F. AI for UAV-assisted IoT applications: A comprehensive review. IEEE Internet Things J. 2023, 10, 14438–14461. [Google Scholar] [CrossRef]
Khan, M.A.; Kumar, N.; Mohsan, S.A.H.; Khan, W.U.; Nasralla, M.M.; Alsharif, M.H.; Żywiołek, J.; Ullah, I. Swarm of UAVs for network management in 6G: A technical review. IEEE Trans. Netw. Serv. Manag. 2023, 20, 741–761. [Google Scholar] [CrossRef]
Li, X.W.; Yao, H.P.; Wang, J.J.; Xu, X.; Jiang, C.; Hanzo, L. A near-optimal UAV-aided radio coverage strategy for dense urban areas. IEEE Trans. Veh. Technol. 2019, 68, 9098–9109. [Google Scholar] [CrossRef]
Masroor, R.; Naeem, M.; Ejaz, W. Efficient deployment of UAVs for disaster management: A multi-criterion optimization approach. Comput. Commun. 2021, 177, 185–194. [Google Scholar] [CrossRef]
Savkin, A.V.; Huang, H.L. Range-based reactive deployment of autonomous drones for optimal coverage in disaster areas. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 4606–4610. [Google Scholar] [CrossRef]
Wang, J.; Liu, M.; Sun, J.L.; Gui, G.; Gacanin, H.; Sari, H.; Adachi, F. Multiple unmanned-aerial-vehicles deployment and user pairing for nonorthogonal multiple access schemes. IEEE Internet Things J. 2021, 8, 1883–1895. [Google Scholar] [CrossRef]
Yu, M.G.; Niu, Y.J.; Liu, X.D.; Zhang, D.G.; Peng, Z.; He, M.; Luo, L. Adaptive dynamic reconfiguration mechanism of unmanned swarm topology based on an evolutionary game. J. Syst. Eng. Electron. 2023, 34, 598–614. [Google Scholar] [CrossRef]
Wang, Y.Z.; Yue, Y.F.; Shan, M.; He, L.; Wang, D. Formation reconstruction and trajectory replanning for multi-UAV patrol. IEEE/ASME Trans. Mechatron. 2021, 26, 719–729. [Google Scholar] [CrossRef]
Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. A generic spatiotemporal scheduling for autonomous UAVs: A reinforcement learning-based approach. IEEE Open J. Veh. Technol. 2020, 1, 93–106. [Google Scholar] [CrossRef]
Zhang, H.; Li, J.; Qi, Z.; Aronsson, A.; Bosch, J.; Olsson, H.H. Deep Reinforcement Learning for Multiple Agents in a Decentralized Architecture: A Case Study in the Telecommunication Domain. In Proceedings of the IEEE 20th International Conference on Software Architecture Companion (ICSA-C), L’Aquila, Italy, 13–17 March 2023; Volume 2023, pp. 183–186. [Google Scholar] [CrossRef]
Ren, L.; Wang, C.; Yang, Y.; Cao, Z. A Learning-Based Control Approach for Blind Quadrupedal Locomotion with Guided-DRL and Hierarchical-DRL. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 6–9 December 2021; Volume 2021, pp. 881–886. [Google Scholar] [CrossRef]
Xu, J.; Guo, Q.; Xiao, L.; Li, Z.; Zhang, G. Autonomous Decision-Making Method for Combat Mission of UAV Based on Deep Reinforcement Learning, Electronic and Automation Control. In Proceedings of the Conference (IAEAC), Chengdu, China, 20–22 December 2019; Volume 2019, pp. 538–544. [Google Scholar] [CrossRef]
Samir, M.; Assi, C.; Sharafeddine, S.; Ebrahimi, D.; Ghrayeb, A. Age of Information Aware Trajectory Planning of UAVs in Intelligent Transportation Systems: A Deep Learning Approach. IEEE Trans. Veh. Technol. 2020, 69, 12382–12395. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Y.; Wu, Z.; Xu, J. Deep reinforcement learning for UAV swarm rendezvous behavior. J. Syst. Eng. Electron. 2023, 34, 360–373. [Google Scholar] [CrossRef]
Huda, S.M.A.; Moh, S. Deep reinforcement learning-based computation offloading in uav swarm-enabled edge computing for surveillance applications. IEEE Access 2023, 11, 68269–68285. [Google Scholar] [CrossRef]
Zhang, N.; Liu, C.; Ba, J. Decomposing FANET to Counter Massive UAV Swarm Based on Reinforcement Learning. IEEE Commun. Lett. 2023, 27, 1784–1788. [Google Scholar] [CrossRef]
Mou, Z.; Zhang, Y.; Gao, F.; Wang, H.; Zhang, T.; Han, Z. Deep Reinforcement Learning Based Three-Dimensional Area Coverage With UAV Swarm. IEEE J. Sel. Areas Commun. 2021, 39, 3160–3176. [Google Scholar] [CrossRef]
Liu, Y.; Yan, J.; Zhao, X. Deep Reinforcement Learning Based Latency Minimization for Mobile Edge Computing with Virtualization in Maritime UAV Communication Network. IEEE Trans. Veh. Technol. 2022, 71, 4225–4236. [Google Scholar] [CrossRef]
Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of drones: Multi-uav pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7900–7909. [Google Scholar] [CrossRef]
Xia, Z.; Du, J.; Wang, J.; Jiang, C.; Ren, Y.; Li, G.; Han, Z. Multi-agent reinforcement learning aided intelligent UAV swarm for target tracking. IEEE Trans. Veh. Technol. 2022, 71, 931–945. [Google Scholar] [CrossRef]
Lv, Z.; Xiao, L.; Du, Y.; Niu, G.; Xing, C.; Xu, W. Multi-Agent Reinforcement Learning based UAV Swarm Communications against Jamming. IEEE Trans. Wirel. Commun. 2023. [Google Scholar] [CrossRef]
Xiang, L.; Xie, T. Research on UAV Swarm Confrontation Task Based on MADDPG Algorithm. In Proceedings of the 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), Harbin, China, 25–27 December 2020; Volume 2020, pp. 1513–1518. [Google Scholar] [CrossRef]
Feng, Q.; Bi, W.; Chen, Y.; Ren, Y.; Yang, D. Cooperative Game Approach based on Agent Learning for Fleet Maintenance Oriented to Mission Reliability. Comput. Ind. Eng. 2017, 112, 221–230. [Google Scholar] [CrossRef]

Figure 1. Mission and UAV swarm. (a) Detection and mission areas. (b) UAV swarm detection.

Figure 2. Different destruction types. (a) Local destruction. (b) Random destruction.

Figure 3. Reconfiguration progress. (a) Reconfiguration. (b) Detection.

Figure 4. Multi-agent deep reinforcement learning framework.

Figure 5. Mission and UAV swarm deployment.

Figure 6. Destruction and reconfiguration of UAVs. (a) Initial state. (b) Destruction. (c) Reconfiguration step 1. (d) Reconfiguration step T. (e) Redeployment. (f) Performance.

Figure 7. Reward of different algorithms.

Figure 8. Reconfiguration under different destruction cases (a–c).

Figure 9. Reconfiguration under different destruction cases.

Figure 10. Reconfiguration under different swarm scales (a–c).

Figure 11. Reconfiguration under different swarm sizes.

Table 1. Location of each UAV.

$G_{1}$	${U A V}_{1, 1}$	${U A V}_{1, 2}$	${U A V}_{1, 3}$	${U A V}_{1, 4}$	${U A V}_{1, 5}$	${U A V}_{1, 6}$
Location	(3, 74.48)	(3, 77.94)	(3, 81.41)	(6, 76.21)	(6, 79.67)	(9, 77.94)
$G_{2}$	${U A V}_{2, 1}$	${U A V}_{2, 2}$	${U A V}_{2, 3}$	${U A V}_{2, 4}$	${U A V}_{2, 5}$	${U A V}_{2, 6}$
Location	(12, 76.21)	(12, 79.67)	(12, 83.14)	(15, 77.94)	(15, 81.41)	(18, 79.67)
$G_{3}$	${U A V}_{3, 1}$	${U A V}_{3, 2}$	${U A V}_{3, 3}$	${U A V}_{3, 4}$	${U A V}_{3, 5}$	${U A V}_{3, 6}$
Location	(33, 53.69)	(33, 57.16)	(33, 60.62)	(36, 55.43)	(36, 58.89)	(39, 57.16)
……
$G_{6}$	${U A V}_{6, 1}$	${U A V}_{6, 2}$	${U A V}_{6, 3}$	${U A V}_{6, 4}$	${U A V}_{6, 5}$	${U A V}_{6, 6}$
Location	(15, 12.12)	(15, 15.59)	(15, 19.05)	(18, 13.86)	(18, 17.32)	(21, 15.69)
$G_{7}$	${U A V}_{7, 1}$	${U A V}_{7, 2}$	${U A V}_{7, 3}$	${U A V}_{7, 4}$	${U A V}_{7, 5}$	${U A V}_{7, 6}$
Location	(33, 5.20)	(33, 12.12)	(33, 8.66)	(36, 6.93)	(36, 10.39)	(39, 8.66)

Table 2. Destruction information.

Destruction	Parameter	Parameter Value
Local destruction 1	Destruction center	(18, 17.32)
	Destruction radius	11
	Destroyed UAVs	${U A V}_{6, 1}$ , ${U A V}_{6, 2}$ , ${U A V}_{6, 3}$ , ${U A V}_{6, 4}$ , ${U A V}_{6, 5}$ , ${U A V}_{6, 6}$
Local destruction 2	Destruction center	(5, 83.13)
	Destruction radius	4
	Destroyed UAVs	${U A V}_{1, 1}$ , ${U A V}_{1, 5}$
Random destruction	Destroyed UAVs	${U A V}_{3, 2}$ , ${U A V}_{4, 3}$ , ${U A V}_{5, 4}$

Table 3. Reconfiguration action set.

Swarm Scale	UAV	Agent	Reconfiguration Action
Swarm Scale	UAV	Agent	Initial Location	Final Location
7 × 6 UAVs	${U A V}_{1, 6}$	agent 1	(9, 77.94)	(6, 79.67)
	${U A V}_{2, 5}$	agent 2	(15, 81.41)	(15, 15.59)
	${U A V}_{3, 3}$	agent 3	(33, 60.62)	(18, 17.32)
	${U A V}_{3, 6}$	agent 3	(39, 57.16)	(33, 57.16)
	……
	${U A V}_{4, 1}$	agent 4	(82, 36.37)	(21, 15.59)
	${U A V}_{7, 4}$	agent 7	(36, 6.93)	(18, 13.86)

Table 4. Running time (in seconds) of different methods under different destruction cases.

Different Destruction Cases	QMIX	DQN	CG
Case (a) in Figure 8	20.652	19.215	43.642
Case (b) in Figure 8	20.857	19.618	44.258
Case (c) in Figure 8	20.116	19.128	42.289

Table 5. Running time (in seconds) of different methods under different swarm sizes.

Different Swarm Cases	QMIX	DQN	CG
Case (a) in Figure 10	20.542	19.942	44.845
Case (b) in Figure 10	27.031	26.531	70.275
Case (c) in Figure 10	32.816	32.116	120.389

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Q.; Geng, Z.; Ren, Y.; Feng, Q.; Zhong, J. Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration. Sensors 2023, 23, 9484. https://doi.org/10.3390/s23239484

AMA Style

Wu Q, Geng Z, Ren Y, Feng Q, Zhong J. Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration. Sensors. 2023; 23(23):9484. https://doi.org/10.3390/s23239484

Chicago/Turabian Style

Wu, Qilong, Zitao Geng, Yi Ren, Qiang Feng, and Jilong Zhong. 2023. "Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration" Sensors 23, no. 23: 9484. https://doi.org/10.3390/s23239484

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration

Abstract

1. Introduction

2. Problem Formulation

2.1. Mission, Destruction, and Reconfiguration

2.1.1. Mission

2.1.2. Destruction

2.1.3. Reconfiguration

2.2. Objective, Constraints, and Variables

3. MADRL-Based DR Method

3.1. Reconfiguration Decision Process

3.1.1. Mission and Destruction Features

3.1.2. Reconfiguration Action Generation

3.1.3. Renewal Features

3.2. Deep Q-Learning for Reconfiguration

3.3. QMIX for Multi-Agent Strategy

4. Case Study

4.1. UAV Swarm Reconfiguration

4.1.1. Mission

4.1.2. Destruction

4.1.3. Reconfiguration

4.2. Discussion

4.2.1. Different Algorithms

4.2.2. Different Destruction Cases

4.2.3. Different Swarm Scales

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI