Next Article in Journal
Equivalent Spatial Plane-Based Relative Pose Estimation of UAVs
Next Article in Special Issue
Multiple Unmanned Aerial Vehicle (multi-UAV) Reconnaissance and Search with Limited Communication Range Using Semantic Episodic Memory in Reinforcement Learning
Previous Article in Journal
Joint Optimization of Relay Communication Rates in Clustered Drones under Interference Conditions
Previous Article in Special Issue
UAV Confrontation and Evolutionary Upgrade Based on Multi-Agent Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Unmanned Aerial Vehicle Confrontation in Intelligent Air Combat: A Multi-Agent Deep Reinforcement Learning Approach †

1
School of Electronic and Information Engineering, Soochow University, Suzhou 215006, China
2
Dongguan Power Supply Bureau of Guangdong Power Grid Corporation, Dongguan 523120, China
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in 2023 International Conference on Ubiquitous Communication (Ucom).
Drones 2024, 8(8), 382; https://doi.org/10.3390/drones8080382
Submission received: 9 July 2024 / Revised: 3 August 2024 / Accepted: 4 August 2024 / Published: 7 August 2024
(This article belongs to the Special Issue Distributed Control, Optimization, and Game of UAV Swarm Systems)

Abstract

:
Multiple unmanned aerial vehicle (multi-UAV) confrontation is becoming an increasingly important combat mode in intelligent air combat. The confrontation highly relies on the intelligent collaboration and real-time decision-making of the UAVs. Thus, a decomposed and prioritized experience replay (PER)-based multi-agent deep deterministic policy gradient (DP-MADDPG) algorithm has been proposed in this paper for the moving and attacking decisions of UAVs. Specifically, the confrontation is formulated as a partially observable Markov game. To solve the problem, the DP-MADDPG algorithm is proposed by integrating the decomposed and PER mechanisms into the traditional MADDPG. To overcome the technical challenges of the convergence to a local optimum and a single dominant policy, the decomposed mechanism is applied to modify the MADDPG framework with local and global dual critic networks. Furthermore, to improve the convergence rate of the MADDPG training process, the PER mechanism is utilized to optimize the sampling efficiency from the experience replay buffer. Simulations have been conducted based on the Multi-agent Combat Arena (MaCA) platform, wherein the traditional MADDPG and independent learning DDPG (ILDDPG) algorithms are benchmarks. Simulation results indicate that the proposed DP-MADDPG improves the convergence rate and the convergent reward value. During confrontations against the vanilla distance-prioritized rule-empowered and intelligent ILDDPG-empowered blue parties, the DP-MADDPG-empowered red party can improve the win rate to 96% and 80.5%, respectively.

1. Introduction

In the era of intelligent air combat, multiple unmanned aerial vehicle (multi-UAV) confrontation is becoming an increasingly important combat mode in the military battlefield, which takes advantage of the efficient collaboration of UAVs to complete the confrontation mission [1,2]. In the confrontation scenario, the multiple UAVs with the capabilities of communications, sensing, and computing can collaborate with each other and complete the mission through real-time situation awareness, optimal resource allocation, and intelligent decision-making [3,4]. Therefore, to complete the multi-UAV confrontation mission, the development of intelligent decision-making algorithms is a necessity.
For the intelligent decision-making of UAVs on the battlefield, an expert system-based air combat decision-making algorithm is proposed [5], which builds a tactical rule database based on the situation of battlefields and designs maneuver decisions according to expert knowledge. However, the tactical rule database relies on expert knowledge, which leads to the low adaptability and high complexity. Falomir et al. have proposed an artificial potential field-based moving model for the UAV swarm, which enables the collaboration of multiple UAVs, while the artificial potential field may lead to the local optimum [6].
Research efforts have also been spared on the game theory-based solutions. Ha et al. have proposed a matrix game method to facilitate the UAVs with the attacking decision-making in air combat [7]. It establishes a score matrix based on the target distance, flight speed, and direction of UAVs and evaluates the decisions of each UAV in real time. However, the matrix game method can hardly be applied to the sequential decision problem. In [8], the authors proposed an auction-based attack–defense decision-making algorithm for target–attack–defense tripartite UAV swarms, wherein the swarm confrontation is modeled as the differential game. As investigated by the authors, the scalability of the tripartite UAV swarms with different scales and the distributed decision-making based on the multi-source perception could be further improved.
In terms of swarm intelligence optimization, the gray wolf optimizer (GWO) [9] has been applied to the attack–defense scenario for optimizing the success rate of a single blue-party UAV breaking through the interception of the red-party cooperative UAV swarms [10]. In [11], the pigeon intelligence optimization (PIO) [12] was improved to resolve the issue of autonomous maneuver decisions in air-to-air confrontation. However, the candidate maneuvers of UAVs are limited by the maneuver library, which cannot meet the requirements of the continuous and flexible actions of UAVs.
With the advantages of self-learning and scalability, deep reinforcement learning (DRL) algorithms have been developed for intelligent decision-making. In [13], the deep deterministic policy gradient (DDPG) [14] was improved to aid the UAV with the decision-making of the head-on attack, pursuing, fleeing, and energy-storing. However, this work focuses on the single-agent DDPG-enabled decision-making of a single UAV. Considering multiple UAVs in intelligent air combat, multi-agent DRL (MADRL) algorithms were investigated [15]. Xuan et al. [16] use the multi-agent deep deterministic policy gradient (MADDPG) algorithm [17] to enable the decision-making of UAVs, which improves the win rate of the multi-UAV confrontation. Li et al. [18] proposed a MADDPG-based confrontation algorithm, wherein a gate recurrent unit is introduced to the actor network to facilitate the decision-making of UAVs with historical decision information. Jiang et al. [19] introduce a self-attention mechanism into the MAPPDG algorithm to make it adaptable to the dynamic changes of friendly forces and enemies during the policy learning process. However, the MADRL-based confrontation algorithms still need to be improved from the perspectives of convergence and stability [20].
To resolve the issues, a decomposed and prioritized experience replay (PER)-based MADDPG (DP-MADDPG) algorithm is proposed in our work for the intelligent decision-making of UAVs in the multi-UAV confrontation scenario. Specifically, the multi-UAV confrontation system model is set up first, wherein the multi-UAV confrontation mission, UAV attack, and communications models are depicted. Subsequently, the problem of intelligent decision-making in the multi-UAV confrontation is formulated as a partially observable Markov game, wherein the real-time decisions that UAVs need to make are modeled as the actions, including the real-time acceleration, angular velocity, attacking mode, and target enemy. To resolve the problem, the DP-MADDPG algorithm is proposed, which integrates the decomposed and PER mechanisms into the traditional MADDPG. Based on the improvement of our earlier work [21], the contributions of this paper are highlighted as follows:
  • The decomposed mechanism is developed to resolve the issue of data coupling among multiple UAVs, where the fundamental structure of the traditional MADDPG is modified by integrating both the local and global critic networks. With dual critic networks, UAVs can maximize both local and global rewards to resolve the local optimum and avoid a dominant decision-making policy incurred by the case of a single global critic network.
  • The PER mechanism is applied to the traditional MADDPG to improve the sampling efficiency from the experience replay buffer. By using the PER, the uniform distribution of the experience samples in the experience replay buffer is crashed, and a larger sampling weight is assigned to the sample with a higher learning efficiency, so that the convergence rate can be consequently improved.
  • Simulations were conducted based on the Multi-agent Combat Arena (MaCA) platform. The results indicate that the proposed DP-MADDPG improves the convergence rate and the convergent reward value compared to the traditional MADDPG and independent learning DDPG (ILDDPG). Furthermore, in the multi-UAV confrontation, as against the vanilla distance-prioritized rule-empowered and the intelligent ILDDPG-empowered blue party, the win rate of the DP-MADDPG-empowered red party was improved to 96% and 80.5%, respectively.
The rest of this paper is organized as follows. The multi-UAV confrontation system model is introduced in Section 2. The multi-UAV confrontation mission is formulated as the partially observable Markov game in Section 3. In Section 4, the DP-MADDPG algorithm for multi-UAV confrontation is proposed. The performance evaluation is conducted in Section 5. Finally, this work is summarized in Section 6.

2. Multi-UAV Confrontation System Model

The fundamental models of the multi-UAV confrontation are introduced in this section, including the multi-UAV confrontation mission, motion, and attack models.

2.1. Multi-UAV Confrontation Mission Model

The multi-UAV confrontation is the mutual attack of UAV swarms on the battlefield, where the UAVs are intelligent agents with the capability of decision-making. It assumes that there are two parties in the scenario, namely red and blue parties, as depicted in Figure 1. Specifically, each party consists of reconnaissance UAVs and combat UAVs.
  • Reconnaissance UAVs are responsible for wide-area omnidirectional detection with radars working in L and S bands and support multi-frequency switching.
  • Combat UAVs have the capabilities of reconnaissance, interference, and striking. Combat UAVs are equipped with directional detection radars working in the X band and L, S, and X bands interference devices for blocking and aiming at electronic interference. They also carry a number of missiles for target striking. When a UAV runs out of missiles in the confrontation mission, it can adopt a suicide collision strategy against an enemy UAV within its attack range. Furthermore, combat UAVs have the capability of passive detection and can be applied to collaborative localization and radio source recognition.
Moreover, the UAVs are equipped with measurement devices such as the global positioning system (GPS) and inertial measurement unit (IMU) to obtain their real-time position and speed. For each UAV, the situational information of ally and enemy UAVs, including relative position, relative heading, and relative speed, is detected by the carried passive radars.
For the confrontation rule, in an allotted time, there may be multiple rounds of confrontations. If one party’s UAVs are all eliminated during the process or if it has fewer surviving UAVs in the end, then the other party is declared the winner. If both parties have the same number of surviving UAVs, the one with a larger number of remaining missiles wins.

2.2. UAV Motion Model

In this work, the multi-UAV confrontation is assumed to occur in a horizontal plane, namely, the UAVs flying at identical altitudes. Thus, the six-degree-of-freedom (6-DOF) UAV model is simplified to a 2D-plane model [20].
Based on the models in [20,22], the motion of UAV i is modeled by
x i ( t + 1 ) = x i ( t ) + v i ( t ) · c o s ( φ i ( t ) ) · Δ t , y i ( t + 1 ) = y i ( t ) + v i ( t ) · s i n ( φ i ( t ) ) · Δ t , v i ( t + 1 ) = v i ( t ) + a i ( t ) · Δ t , φ i ( t + 1 ) = φ i ( t ) + ω i ( t ) · Δ t ,
where ( x i ( t ) , y i ( t ) ) refers to the 2D coordinates in the horizontal plane at time t. φ i ( t ) is the heading angle. Δ t is the time interval. v i ( t ) and ω i ( t ) are the flying speed and angular velocity, separately. a i ( t ) is the acceleration.
The motion is constrained by
0 v i v i m a x , 0 ϕ i 2 π , a i m a x a i a i m a x , ω i m a x ω i ω i m a x ,
where v i m a x , a i m a x , and ω i m a x are the upper limits of the speed, acceleration, and angular velocity.

2.3. UAV Attack Model

The UAV attack model is described by an attack–target pair, as shown in Figure 2. It is assumed that an attacker UAV has the capability of omnidirectional radar detection, and the detection radius is D. The attacking zone is a sector with the radius R and angle β .
The attack–target pair is denoted by δ i j ( t ) , d i j ( t ) , θ i j ( t ) T . Specifically, δ i j is the relative distance vector between attacker UAV i and target UAV j at time t [22],
δ i j ( t ) = ( δ i j x ( t ) , δ i j y ( t ) ) = ( x j ( t ) x i ( t ) , y j ( t ) y i ( t ) ) .
d i j is the Euclidean distance,
d i j ( t ) = ( x j ( t ) x i ( t ) ) 2 + ( y j ( t ) y i ( t ) ) 2 .
θ i j is the attack angle,
θ i j ( t ) = arctan δ i j y ( t ) δ i j x ( t ) .
This supposes the attacker can strike the target when the distance and angle between the attacker UAV i and the target UAV j meet the following conditions [22]:
d i j ( t ) < R , θ i j ( t ) < β 2 .

3. Partially Observable Markov Game Modeling

Multi-UAV confrontation is modeled as a partially observable Markov game, wherein each UAV is regarded as an intelligent agent. The Markov game of N UAVs is defined as a tuple ( N , S , A 1 , , A N , O 1 , , O N , T , R 1 , , R N , R g ) . S is the global state. A i and O i represent the local action and observation of UAV i, respectively. T is the state transition function, S × A 1 × × A N S . R i is the local cumulative reward of UAV i, and R g is the global cumulative reward.

3.1. Observation Space

The local observation space O i of UAV i as listed in Table 1 includes the information of its own, friendly forces { a l l y i } i = 1 m , and enemy { e n e m y j } j = 1 n , where m and n are the number of ally and enemy UAVs in its detection range, respectively. A UAV’s own information includes its coordinates, speed, heading angle, type, detection radius, sector attack area, and remaining ammunition, { ( x i , y i ) , v i , φ i , t y p e _ i d , D , ( R , β ) , M } . t y p e _ i d is the type of UAV, where 0 and 1 refer to combat UAV and reconnaissance UAV, respectively. The observation information of ally and enemy UAVs includes relative position, relative speed, relative heading, and type of UAVs. The global state S consists of the local observation information of all UAVs.

3.2. Action Space

For each UAV, its moving and attacking decisions are formulated as the action space as listed in Table 2, which is a vector of acceleration, angular velocity, attacking mode, and target enemy ID, namely A i = a , ω , a t t a c k _ m o d e , a t t a c k _ e n e m y _ i d . a t t a c k _ m o d e is a binary variable, where 1 and 0 refer to attacking or not. Although a and ω are intrinsically continuous actions, in practical applications, they are discretized to simplify the action decision. For example, the angular velocity is in the range of 0 , 359 , wherein the horizontal rightwards is 0, and the angle is increased clockwise. The change in angle is discretized with an interval of 10 . For action n , n = 1 , 2 , , the angle adds 10 n clockwise.

3.3. Reward Function

Considering the collaborative requirements and mission objectives of the multi-UAV confrontation, the local reward of each UAV and the global reward are designed according to the confrontation result of each round. The local reward of each UAV is designed as provided in Table 3, where UAVs with different roles, namely combat and reconnaissance, are considered individually [23]. The local reward is mainly determined by the impact of the UAV’s actions on the environment and avoids the problem of reward sparseness in the meantime.
The global reward is designed as listed in Table 4. When designing the global reward, the comprehensive impact of one round of multi-UAV confrontation inference on the environment is mainly considered to avoid the intelligent agent falling into the local optimum during the exploration process. Particularly, the round deduction in Table 4 refers to a null round where nothing happens.

4. Proposed DP-MADDPG Algorithm for Multi-UAV Confrontation

In this section, the DP-MADDPG algorithm is proposed for the multi-UAV confrontation. First of all, the process of the MADRL-based multi-UAV confrontation is provided. Subsequently, the fundamental MADDPG algorithm is analyzed. Finally, the DP-MADDPG algorithm is proposed, where the decomposed and PER mechanisms are explained in detail.

4.1. MADRL-Based Multi-UAV Confrontation

Based on modeling the multi-UAV confrontation as a partially observable Markov game, the distributed intelligent decision-making of multiple UAVs is resolved by a multi-agent DRL (MADRL) algorithm. The process of the MADRL-based multi-UAV confrontation is shown in Figure 3, which consists of two main phases, namely training and running.

4.1.1. Training

The MADRL algorithm is executed to explore the confrontation environment and train the local actor networks as the decision-making policies. At the end of the training phase, each UAV obtains a convergent actor network.

4.1.2. Running

Each UAV i determines its actions a i , including acceleration, angular velocity, attack mode, and attack target enemy ID at each time step according to its practical observations o i and local actor network μ i ( a i | o i ) .

4.2. Fundamental MADDPG Algorithm

In this work, the policy-gradient MADRL algorithm, MADDPG, is adopted as the benchmark for the following reasons. (1) MADDPG adopts the centralized training and distributed execution framework. In the centralized training, the full state of the environment can be obtained, which can help in dealing with the non-stationarity of the multi-UAV scenario. (2) By the exploitation of MADDPG, each UAV has an individual actor–critic structure, which is more flexible. Namely, each UAV has its actor network as the policy for decision-making based on the local observation and the critic network as the value function evaluating the decisions for policy optimization. (3) MADDPG is based on the deterministic policy, which is more efficient than the stochastic policy and can deal with the continuous action space.
In the confrontation scenario, it is assumed that there are N UAVs with local actor networks μ = { μ 1 , , μ N } parameterized by θ = { θ 1 , , θ N } . The policy gradient for training the actor network is
J ( θ i ) = E x , a D θ i μ i ( a i | o i ) a i Q i μ ( x , a 1 , , a N ) | a i = μ i ( o i ) ,
where o i is the local observation of UAV i, and x = ( o 1 , . . . , o N ) is the set of local observations. a i is the local action determined by its actor network, a i = μ i ( o i ) . D is the experience replay buffer, D = ( x , a , r , x ) , where r is the set of local rewards, and x is the set of local observations at the next timestep. Q i μ is the critic network of UAV i with parameters ψ i . Based on the critic network, the loss function is defined by
L ( ψ i ) = E x , a , r , x ( Q i μ ( x , a 1 , , a N ) y i ) 2 ,
and
y i = r i + γ Q i μ ( x , a 1 , , a N ) | a i = μ i ( o i ) ,
where r i is the local reward of UAV i. γ is the discount factor. Q i μ is the target critic network with parameters ψ . μ i is the target actor network with parameters θ . The target networks are introduced by the double-network structure for stability, as shown in Figure 4. The parameters of the target networks are updated through soft update by
θ i = τ θ i + ( 1 τ ) θ i ,
and
ψ i = τ ψ i + ( 1 τ ) ψ i ,
where τ is the update rate and τ 1 .   

4.3. Proposed DP-MADDPG Algorithm

Considering the collaboration of multi-UAV confrontation, UAVs need not only real-time information exchange of the battlefield situation but also optimal cooperative decision-making for their actions. Thus, for local decision-making, the MADDPG is further improved with the decomposed and PER mechanisms.

4.3.1. Decomposed Mechanism

To resolve the issue of data coupling among multiple UAVs, the decomposed mechanism is developed to modify the fundamental structure of the MADDPG, as depicted in Figure 5. In the centralized training and distributed execution framework, each UAV holds a local actor neural network as the local policy for decision-making.
For training the local policies, a local critic neural network Q ψ i μ parameterized by ψ i is built for a UAV i, i 1 , N , and a global critic neural network Q Ψ g is built for the environment. For UAV i, the selected actions a i are not only passed to the global critic network Q Ψ g ( s , a 1 , , a N ) , but also to its local critic network Q ψ i μ ( o i , a i ) . By integrating the dual critic networks into the MADDPG, UAVs can maximize both local and global rewards. On the one hand, this makes the action selections of the actor networks go in the direction of maximizing local and global rewards, and overcomes the issues of local optimum. On the other hand, it avoids the generation of a dominant decision-making policy incurred by the case of a single global critic network, which makes the method more adaptive to different environments.
Based on the modeling, the policy gradient of UAV i in DP-MADDPG is expressed as
J ( θ i ) = E o i , a i D θ i μ i ( a i | o i ) a i Q ψ i μ ( o i , a i ) + E s , a D θ i μ i ( a i | o i ) a i Q Ψ g ( s , a 1 , , a N ) ,
where s = ( o 1 , , o N ) is the global state that consists of the local observation information of all UAVs. a i = μ i ( o i ) is the local action of UAV i determined by the local observation o i according to the policy μ i . D is the experience replay buffer.
For the training of the global critic network Q Ψ g , the loss function is defined by
L ( Ψ ) = E ( s , a , r g , s ) ( Q Ψ g ( s , a 1 , , a N ) y g ) 2 ,
and
y g = r g + γ Q Ψ g ( s , a 1 , , a N ) | a i = μ i ( o i ) ,
where μ = { μ 1 , , μ N } are the target networks of the local actor networks parameterized by θ = { θ 1 , , θ N } . r g is the global reward. γ is the discount factor. Q Ψ g is the target network of the global critic network. A target network has the same structure as the main network and is periodically updated with the parameters of the main network to improve the network stability.
For the training of the local critic network Q ψ i μ , the loss function is defined by
L ( ψ i ) = E ( o , a , r l , o ) ( Q ψ i μ ( o i , a i ) y l i ) 2 ,
and
y l i = r l i + γ Q ψ i μ ( o i , a i ) | a i = μ i ( o i ) ,
where r l i is the local reward of UAV i. γ is the discount factor. Q ψ i μ is the target network of the local critic network.

4.3.2. Prioritized Experience Replay (PER) Mechanism

By using the PER mechanism, the sampling efficiency from the experience replay buffer can be dramatically improved. The principle of PER is to break the uniform distribution of the experience samples in the experience replay buffer D and assign a larger sampling weight to the sample with a higher learning efficiency. Specifically, the mechanism adopts the temporal difference (TD)-error δ to evaluate the importance of experience samples and uses random priority and TD-error-based importance sampling to grab the experience samples.
Based on the random priority, the sampling probability of an experience sample i is calculated as
P ( i ) = p i α k D p k α ,
where α is the amplified times of the priority and set to 0.6. p i = 1 / r a n k ( i ) and r a n k ( i ) is the ranking of the TD-error | δ i | of sample i in the experience replay buffer in a descending order. The TD-error | δ i | is calculated as
δ i = y g ( i ) Q Ψ g ( i ) ( s , a 1 , , a N ) .
The importance sampling weight is determined by
ω i = ( 1 N · 1 P ( i ) ) β ,
where N is the size of the experience replay buffer, and β is an adjusting factor and set to 0.4. The PER mechanism is introduced into MADDPG to grab the samples from the experience replay buffer and form the mini-batch for the neural network training. For the loss functions (13) and (15), the loss of each sample is weighted by the importance sampling weight ω i .
The pseudocode of the DP-MADDPG algorithm is listed in Algorithm 1.
Algorithm 1 DP-MADDPG Algorithm
 1:
Input: MaxEpisode, MaxStep, N
 2:
Initialize main and target global critic network Q ψ g , Q ψ g
 3:
Initialize main and target local actor network μ θ i , μ θ i
 4:
Initialize main and target local critic network Q ψ i μ , Q ψ i μ
 5:
for  e p i s o d e = 1 , , MaxEpisode do
 6:
   Initialize the environment
 7:
   for  t = 1 , , MaxStep do
 8:
      UAV i executes action a i = μ θ i ( o i ) , i = 1 , , N
 9:
      Receive global reward r g and local rewards r l , and update global state s s
10:
      Store ( s , a , r g , r l , s ) into experience replay buffer D
11:
      /* Global critic network training */
12:
      S samples ( s j , a j , r g j , s j ) from D as mini-batch
13:
      Calculate y g by (14)
14:
      Update global critic by minimizing 1 S j = 1 S ω S j ( y g Q Ψ g ( s , a 1 , , a N ) ) 2
15:
      Update target global critic by τ Ψ + ( 1 τ ) Ψ Ψ
16:
      /* Local actor and critic network training */
17:
      for UAV i = 1 , , N  do
18:
      K samples ( s j , a j , r l j , s j ) from D as mini-batch
19:
      Calculate y l i by (16)
20:
      Update local critic network by minimizing 1 K j = 1 K ω K j ( y l i Q ψ i μ ( o i j , a i j ) ) 2
21:
      Update actor network θ i = θ i + J ( θ i ) by (12)
22:
      Update target actor and critic networks by τ θ i + ( 1 τ ) θ i θ i and τ ψ i + ( 1 τ ) ψ i ψ i
23:
      end for
24:
   end for
25:
end for
26:
Output: Model Parameters of Actor Network θ i

5. Performance Evaluation

The experimental environment is set up with Win10, Intel(R) Core(TM) i7-10875H CPU, 16G RAM, and NVIDIA GeForce RTX 2060 GPU. The software framework is based on Python 3.6 and PyTorch 1.1. Based on the experimental settings in Section 5.1, the performance is analyzed from the perspectives of convergence rate and confrontation results.

5.1. Experimental Settings

The simulation platform of air combat is set up based on the MaCA released by the Key Laboratory of Cognitive and Intelligent Technology of China Electronics Technology Group Corporation [23]. The battlefield of the air combat scenario is 10 km × 10 km × 1 km. The blue and red parties both have 10 UAVs, including eight combat UAVs and two reconnaissance UAVs. Combat UAVs and reconnaissance UAVs have different radar detection ranges. The detection range of a combat UAV is 700 m × 120°, and that of a reconnaissance UAV is 2000 m × 360°. The number of missiles carried by a combat UAV is 4. The sector attack area of the attack model is (400 m, 45°). The initial speed of the UAVs is 10 m/s. The upper limits of the speed v i m a x , acceleration a i m a x , and angular velocity ω i m a x are 50 m/s, 5 m/s2, and π /12 rad/s, respectively. The safe distance between UAVs is 20 m. It is worth noticing that the MaCA is a discrete-event simulation platform set up mainly for the multi-UAV confrontation and DRL-based strategy analysis, wherein weather conditions such as wind and rain are not considered in the current version. The network structures of the local actor network, local critic network, and global critic network are shown in Figure 6. The hyperparameter settings for the model training of DP-MADDPG are given in Table 5.

5.2. Strategy Replay

To visualize the strategies learned by the UAVs, it is assumed that the red party in the multi-UAV confrontation scenario exploits the DP-MADDPG algorithm for intelligent decision-making, and the blue party adopts the fundamental distance-prioritized rule. The distance-prioritized rule refers to UAVs attacking the enemies with the nearest distances and fleeing in the opposite direction.
The actions of the DP-MADDPG-empowered red-party UAVs are replayed in Figure 7, which depicts the process of the UAVs in the red party gradually learning the confrontation strategies, including tracking, maneuver, deception, and encirclement.
(a) Tracking: From Figure 7a, it can be seen that the UAV in the red party has learned a tracking strategy. When an individual red UAV discovers an enemy UAV, instead of taking the initiative to attack, it adopts a tracking strategy and continuously tracks the enemy before launching an attack at the right time.
(b) Maneuver: From Figure 7b, it can be seen that a red UAV is being pursued by the blue and may be dangerous at the next moment. The red UAV promptly adopts the maneuvering strategy to avoid the enemy’s attack and occupy an advantageous position.
(c) Deception: From Figure 7c, it can be seen that a red UAV intentionally appears in the forward direction of the blue UAVs, enticing the enemies to pursue it. Its allies actively occupy the advantageous attacking positions on the left side to launch precise strikes in a way that minimizes losses.
(d) Encirclement: From Figure 7d, it can be seen that the red UAVs encircle the enemy and relieve the reconnaissance UAV from being pursued by the blue.
Thus, Figure 7 unveils that the proposed DP-MADDPG algorithm enables the UAV’s learning of effective confrontation strategies. Furthermore, Figure 7 also demonstrates the cooperative behavior of multiple UAVs in the red party.

5.3. Convergence Analysis

To evaluate the performance of the proposed DP-MADDPG algorithm, the traditional MADDPG algorithm and the independent learning DDPG (ILDDPG) are adopted as benchmarks. For ILDDPG, it refers to the case that each UAV exploits the single-agent DDPG algorithm for decision-making. It assumes that the red party in the multi-UAV confrontation scenario exploits the DRL algorithms for intelligent decision-making, including DP-MADDPG, MADDPG, and ILDDPG, and the blue party adopts the distance-prioritized rule for Case 1 and the ILDDPG for Case 2.

5.3.1. Case 1-Confrontation against Distance-Prioritized Rule

Figure 8a shows the average reward per training round, where the red party adopts the DP-MADDPG, MADDPG, and ILDDPG, respectively, and the blue party adopts the distance-prioritized rule. The total number of training rounds is 25,000. Figure 8a indicates that the proposed DP-MAPPG converges faster and finally converges to a higher value. More details about the training process are provided in Figure 8b. In the experiment, the mean value over the last 100 training rounds is regarded as the convergent reward. The reward of DP-MADDPG, MADDPG, and ILDDPG converges to 695.5, 592.9, and 344.8, respectively, which indicates that the DP-MADDPG improves the convergent reward by 17.3% and 101.7% compared to MADDPG and ILDDPG.

5.3.2. Case 2-Confrontation against ILDDPG

To further analyze the performance of the proposed DP-MADDPG algorithm, it assumes that the red party still exploits the DRL algorithms, and the blue party adopts the ILDDPG instead of the distance-prioritized rule. The average reward per training round is shown in Figure 9a. The total number of training rounds is 25,000. Details of the training process are provided in Figure 9b. The average reward of DP-MADDPG, MADDPG, and ILDDPG converges to 795.2, 711.4, and 362.8, respectively, which indicates that the DP-MADDPG improves the convergent reward by 11.8% and 119.2% compared to MADDPG and ILDDPG.
In both Case 1 and Case 2, DP-MADDPG and MADDPG converge faster to higher rewards than the ILDDPG, which indicates that collaboration is necessary for the multi-UAV confrontation. Moreover, the proposed DP-MADDPG outperforms the fundamental MADDPG. The reason is that the DP-MADDPG algorithm adopts the decomposed and PER mechanisms, which can learn the optimal strategy from the experience buffer more accurately and efficiently. From Table 3 and Table 4, the rewards are determined by the confrontation results. Thus, a higher convergent reward implies an optimal confrontation strategy leading to a higher win rate in the multi-UAV confrontation. The confrontation analysis is thus conducted in the next subsection.

5.4. Confrontation Analysis

To analyze the confrontation results, the win rate of the DRL-empowered red party is evaluated. The blue party adopts the distance-prioritized rule and the ILDDPG as well. The multi-UAV confrontation is tested 1000 times, and the rates are the average over the 1000 tests.
The comparison of the win rate in two different confrontation scenarios is shown in Figure 10a and Figure 10b, respectively. Figure 10a illustrates that the win rate of the proposed DP-MADDPG is 96%, which is higher than that of the MADDPG and ILDDPG with 91.2% and 78.6% in the confrontation against the distance-prioritized rule. Meanwhile, for the confrontation against the ILDDPG in Figure 10b, the win rates of the proposed DP-MADDPG, MADDPG, and ILDDPG are 80.5%, 70.1% and 41.4%, respectively.
The results in Figure 10a,b unveil that, against the distance-prioritized rule, the DRL-empowered party can achieve a higher win rate, which identifies that the DRL algorithms can optimize the decision-making. Furthermore, DP-MADDPG and MADDPG can obtain a higher win rate than ILDDPG in both scenarios. The reason for this is that ILDDPG is a single-agent DRL algorithm, wherein each UAV only has its local observation and optimizes its own policy accordingly. Meanwhile, for the multi-agent DRL algorithms, the UAVs can share the global state and optimize the cooperative policy to improve the win rate. In terms of the multi-agent DRL algorithms, the proposed DP-MADDPG outperforms the traditional MADDPG. This is because the proposed DP-MADDPG introduces the local critic network, and each UAV adopts the decomposed mechanism to maximize the global reward and local reward at the same time, which avoids the generation of unstable policies and ensures the completion of the confrontation mission of the UAVs in the red party.

5.5. Discussions and Future Work

Multi-UAV confrontation, as the main form of intelligent air combat, plays an increasingly important role on battlefields. For example, the US Air Force Laboratory released the LOCUST (LOw-Cost UAV Swarming Technology) project in 2015, planning to launch multiple UAVs to complete aerial attack and defense missions. Europe has also been studying the FCAS (Future Combat Air System) as a key instrument in ensuring European autonomy and sovereignty in defense and security. The multi-UAV confrontation relies on the cooperation of the UAV swarm, which means that UAVs with intelligent and computing capabilities cooperate to fulfill situational awareness, resource allocation, and adversarial task completion. Due to the resource constraints and the drastic changes in battlefields, intelligent decision-making is critical for multi-UAV cooperative confrontation [22]. Traditional multi-UAV cooperation and confrontation strategies meet the technical challenges of weak coordination, high computational complexity, and low accuracy, which cannot meet the real-time requirements of air combat [20]. Therefore, it is necessary to combine artificial intelligence (AI) with military technologies such as multi-UAV confrontation and design intelligent decision-making algorithms suitable for air combat.
AI technology can conduct battlefield deduction, enabling UAVs to autonomously make air combat decisions and complete adversarial missions with minimal cost. In future intelligent air combat, the critical missions in the multi-UAV confrontation include breaking through enemy defenses, carrying out precise target strikes, and intercepting incoming UAVs. The multi-UAV confrontation belongs to the tactic, and air combat decision-making is crucial for multi-UAV confrontation missions. MADRL is a kind of AI technology developed for multi-agent systems and intelligent decision-making. In this work, a MADRL algorithm, DP-MADDPG, is proposed to resolve the issues in the multi-UAV confrontation, including multi-UAV autonomous decision-making, distributed UAV swarm cooperation, and multi-UAV situation assessment. Compared to the benchmark MADDPG algorithm, the convergent rate and reward of the DP-MADDPG are improved and lead to an increased win rate of confrontation.
Nonetheless, there are still some limitations to this work. In the confrontation scenario, the effects of the aerial obstacles and interference and the weather events are not considered. In the future, research efforts will be spared to overcome these limitations and fulfill the work with more realistic confrontation scenarios. Moreover, a priori tactical rule database can be added to the framework to reduce ineffective experience exploration and improve the decision-making efficiency of multiple UAVs.

6. Conclusions

To overcome the technical challenges of intelligent decision-making in the multi-UAV confrontation, a DP-MADDPG algorithm was proposed in this paper. Specifically, the decision-making problem was formulated as a partially observable Markov game. Subsequently, the DP-MADDPG algorithm was proposed as the solution by integrating the decomposed and PER mechanisms into the traditional MADDPG. The decomposed mechanism improved the MADDPG framework with local and global dual critic networks to avoid the convergence to a local optimum and a single dominant policy. The PER mechanism optimized the sampling efficiency from the experience replay buffer. Simulation results indicate that the integration of the decomposed and PER mechanisms improved the convergence rate and the reward value of the proposed DP-MADDPG compared to MADDPG and ILDDPG. Furthermore, in the confrontation against the distance-prioritized rule-empowered blue party, the win rate of the DP-MADDPG-empowered red party improved to 96% compared to the 91.2% and 78.6% rates of MADDPG and ILDDPG. In the confrontation against the ILDDPG-empowered blue party, the win rate of the DP-MADDPG-empowered red party improved to 80.5% compared to the 70.1% and 41.4% rates of MADDPG and ILDDPG.

Author Contributions

Conceptualization, J.Y. and X.Y.; methodology, X.Y. and T.Y.; software, X.Y.; validation, J.Y.; formal analysis, T.Y.; resources, X.Y.; data curation, X.Y.; writing—original draft preparation, J.Y. and X.Y.; writing—review and editing, T.Y.; visualization, X.Y.; supervision, T.Y.; project administration, T.Y.; funding acquisition, T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62101373.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Xinwei Yang was employed by the company Dongguan Power Supply Bureau of Guangdong Power Grid Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Liu, F.; Dong, X.; Yu, J.; Hua, Y.; Li, Q.; Ren, Z. Distributed Nash Equilibrium Seeking of N-Coalition Noncooperative Games with Application to UAV Swarms. IEEE Trans. Netw. Sci. Eng. 2022, 9, 2392–2405. [Google Scholar] [CrossRef]
  2. Zhou, Y.; Rao, B.; Wang, W. UAV Swarm Intelligence: Recent Advances and Future Trends. IEEE Access 2020, 8, 183856–183878. [Google Scholar] [CrossRef]
  3. Gupta, L.; Jain, R.; Vaszkun, G. Survey of Important Issues in UAV Communication Networks. IEEE Commun. Surv. Tutor. 2016, 18, 1123–1152. [Google Scholar] [CrossRef]
  4. Zhou, K.; Wei, R.; Zhang, Q.; Xu, Z. Learning System for Air Combat Decision Inspired by Cognitive Mechanisms of the Brain. IEEE Access 2020, 8, 8129–8144. [Google Scholar] [CrossRef]
  5. Wang, R.; Gao, Z. Research on Decision in Air Combat Simulation Using Maneuver Library. Flight Dyn. 2009, 27, 72–77. [Google Scholar] [CrossRef]
  6. Falomir, E.; Chaumette, S.; Guerrini, G. A Mobility Model Based on Improved Artificial Potential Fields for Swarms of UAVs. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 8499–8504. [Google Scholar] [CrossRef]
  7. Ha, J.S.; Chae, H.J.; Choi, H.L. A stochastic game-theoretic approach for analysis of multiple cooperative air combat. In Proceedings of the American Control Conference (ACC), Chicago, IL, USA, 1–3 July 2015; pp. 3728–3733. [Google Scholar] [CrossRef]
  8. Li, Y.; Li, J.; Liu, C.; Li, J.; Xin, Z.; Chen, Z. An Auction-based Attack-defense Decision-making Method for UAV Air Combat. In Proceedings of the IEEE International Conference on Unmanned Systems (ICUS), Guangzhou, China, 28–30 October 2022; pp. 902–909. [Google Scholar] [CrossRef]
  9. Mirjalili, S.; Mirjalili, S.M.; Lewis, A. Grey Wolf Optimizer. Adv. Eng. Softw. 2014, 69, 46–61. [Google Scholar] [CrossRef]
  10. Zuo, J.; Liu, Z.; Chen, J.; Li, Z.; Li, C. A Multi-agent Cluster Cooperative Confrontation Method Based on Swarm Intelligence Optimization. In Proceedings of the IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Nanchang, China, 26–28 March 2021; pp. 668–672. [Google Scholar] [CrossRef]
  11. Duan, H.; Lei, Y.; Xia, J.; Deng, Y.; Shi, Y. Autonomous Maneuver Decision for Unmanned Aerial Vehicle via Improved Pigeon-Inspired Optimization. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 3156–3170. [Google Scholar] [CrossRef]
  12. Duan, H.; Qiao, P. Pigeon-Inspired Optimization: A New Swarm Intelligence Optimizer for Air Robot Path Planning. Int. J. Intell. Comput. Cybern. 2014, 7, 24–37. [Google Scholar] [CrossRef]
  13. Li, Y.; Han, W.; Wang, Y. Deep Reinforcement Learning with Application to Air Confrontation Intelligent Decision-Making of Manned/Unmanned Aerial Vehicle Cooperative System. IEEE Access 2020, 8, 67887–67898. [Google Scholar] [CrossRef]
  14. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  15. Nguyen, T.T.; Nguyen, N.D.; Nahavandi, S. Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE Trans. Cybern. 2020, 50, 3826–3839. [Google Scholar] [CrossRef] [PubMed]
  16. Xuan, S.; Ke, L. UAV Swarm Attack-Defense Confrontation based on Multi-Agent Reinforcement Learning. In Proceedings of the Advanced Guidance Navigation Control: Proceeding of International Conference on Guidance, Navigation and Control (ICGNC), Tianjin, China, 23–25 October 2020; Springer: Berlin/Heidelberg, Germany, 2022; pp. 5599–5608. [Google Scholar] [CrossRef]
  17. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6382–6393. [Google Scholar]
  18. Li, S.; Jia, Y.; Yang, F.; Qin, Q.; Gao, H.; Zhou, Y. Collaborative Decision-Making Method for Multi-UAV Based on Multiagent Reinforcement Learning. IEEE Access 2022, 10, 91385–91396. [Google Scholar] [CrossRef]
  19. Jiang, T.; Zhuang, D.; Xie, H. Anti-drone Policy Learning based on Self-Attention Multi-Agent Deterministic Policy Gradient. In Proceedings of the International Conference on Autonomous Unmanned Systems; Springer: Berlin/Heidelberg, Germany, 2021; pp. 2277–2289. [Google Scholar] [CrossRef]
  20. Zhang, T.; Chai, L.; Wang, S.; Jin, J.; Liu, X.; Song, A.; Lan, Y. Improving Autonomous Behavior Strategy Learning in an Unmanned Swarm System Through Knowledge Enhancement. IEEE Trans. Reliab. 2022, 71, 763–774. [Google Scholar] [CrossRef]
  21. Yang, X.; Xue, X.; Yang, J.; Hu, J.; Yu, T. Decomposed and Prioritized Experience Replay-based MADDPG Algorithm for Multi-UAV Confrontation. In Proceedings of the International Conference on Ubiquitous Communication (Ucom), Xi’an, China, 7–9 July 2023; pp. 292–297. [Google Scholar] [CrossRef]
  22. Wang, B.; Li, S.; Gao, X.; Xie, T. Weighted Mean Field Reinforcement Learning for Large-Scale UAV Swarm Confrontation. Appl. Intell. 2023, 53, 5274–5289. [Google Scholar] [CrossRef]
  23. Key Laboratory of Cognitive and Intelligent Technology of China Electronics Technology Group Corporation. MaCA Specifications; China Electronics Technology Group Corporation: Beijing, China, 2019; pp. 1–20. [Google Scholar]
Figure 1. Diagram of multi-UAV confrontation environment.
Figure 1. Diagram of multi-UAV confrontation environment.
Drones 08 00382 g001
Figure 2. Attack–target pair in the multi-UAV confrontation.
Figure 2. Attack–target pair in the multi-UAV confrontation.
Drones 08 00382 g002
Figure 3. MADRL-enabled intelligent decision-making in multi-UAV confrontation.
Figure 3. MADRL-enabled intelligent decision-making in multi-UAV confrontation.
Drones 08 00382 g003
Figure 4. Update process of double actor–critic networks.
Figure 4. Update process of double actor–critic networks.
Drones 08 00382 g004
Figure 5. Framework modification with decomposed local critic network and global critic network: (a) framework of fundamental MADDPG, (b) framework of proposed DP-MADDPG.
Figure 5. Framework modification with decomposed local critic network and global critic network: (a) framework of fundamental MADDPG, (b) framework of proposed DP-MADDPG.
Drones 08 00382 g005
Figure 6. Structures of local actor, local critic, and global critic networks.
Figure 6. Structures of local actor, local critic, and global critic networks.
Drones 08 00382 g006
Figure 7. Confrontation strategies learned by DP-MADDPG-empowered red-party UAVs, including (a) tracking; (b) maneuver; (c) deception; (d) encirclement.
Figure 7. Confrontation strategies learned by DP-MADDPG-empowered red-party UAVs, including (a) tracking; (b) maneuver; (c) deception; (d) encirclement.
Drones 08 00382 g007
Figure 8. Confrontation against distance-prioritized rule: Comparison of the average reward of training among DP-MADDPG, MADDPG, and ILDDPG algorithms.
Figure 8. Confrontation against distance-prioritized rule: Comparison of the average reward of training among DP-MADDPG, MADDPG, and ILDDPG algorithms.
Drones 08 00382 g008
Figure 9. Confrontation against ILDDPG: Comparison of the average reward of training among DP-MADDPG, MADDPG, and ILDDPG algorithms.
Figure 9. Confrontation against ILDDPG: Comparison of the average reward of training among DP-MADDPG, MADDPG, and ILDDPG algorithms.
Drones 08 00382 g009
Figure 10. Comparison of multi-UAV confrontation results among DP-MADDPG, MADDPG, and ILDDPG algorithms.
Figure 10. Comparison of multi-UAV confrontation results among DP-MADDPG, MADDPG, and ILDDPG algorithms.
Drones 08 00382 g010
Table 1. Observation space of UAV.
Table 1. Observation space of UAV.
InformationDescription
position ( x i , y i )  (unit: (m, m))
type_id (0: combat UAV, 1: reconnaissance UAV)
speed v i  (unit: m/s)
Ownammunition M
heading φ i  (unit: °)
attacking zone ( R , β )  (unit: (m, °))
detection radius D (unit: m)
ally_1: relative position, relative heading, relative speed, type_id
Allyally_2: relative position, relative heading, relative speed, type_id
……
enemy_1: relative position, relative heading, relative speed, type_id
Enemyenemy_2: relative position, relative heading, relative speed, type_id
……
Table 2. Action space of UAV.
Table 2. Action space of UAV.
ActionDescription
aacceleration (continuous) (unit: m/s2)
ω angular velocity (continuous) (unit: rad/s)
a t t a c k _ m o d e attacking (1) or not (0) (discrete)
a t t a c k _ e n e m y _ i d target enemy ID (discrete)
Table 3. Local reward of UAV.
Table 3. Local reward of UAV.
TypeEventReward
Legal attack action5
Illegal attack action−5
Detection of combat UAV5
Detection of reconnaissance UAV10
CombatSuccessful in combating combat UAV10
Unsuccessful in combating combat UAV−5
Successful in combating reconnaissance UAV20
Unsuccessful in combating reconnaissance UAV−10
Be destroyed−10
Detection of combat UAV5
ReconnaissanceDetection of reconnaissance UAV10
Be destroyed−20
GeneralSurvive each round2
Table 4. Global reward.
Table 4. Global reward.
EventReward
Complete victory200
Victory100
Complete defeat−200
Defeat−100
Draw0
Round deduction−50
Table 5. Hyperparameter settings.
Table 5. Hyperparameter settings.
HyperparameterValue
MaxEpisode25,000
MaxStep2000
Capacity of replay buffer 10 5
Batch size64
Discount factor0.95
Soft update rate0.01
Actor learning rate0.001
Critic learning rate0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.; Yang, X.; Yu, T. Multi-Unmanned Aerial Vehicle Confrontation in Intelligent Air Combat: A Multi-Agent Deep Reinforcement Learning Approach. Drones 2024, 8, 382. https://doi.org/10.3390/drones8080382

AMA Style

Yang J, Yang X, Yu T. Multi-Unmanned Aerial Vehicle Confrontation in Intelligent Air Combat: A Multi-Agent Deep Reinforcement Learning Approach. Drones. 2024; 8(8):382. https://doi.org/10.3390/drones8080382

Chicago/Turabian Style

Yang, Jianfeng, Xinwei Yang, and Tianqi Yu. 2024. "Multi-Unmanned Aerial Vehicle Confrontation in Intelligent Air Combat: A Multi-Agent Deep Reinforcement Learning Approach" Drones 8, no. 8: 382. https://doi.org/10.3390/drones8080382

APA Style

Yang, J., Yang, X., & Yu, T. (2024). Multi-Unmanned Aerial Vehicle Confrontation in Intelligent Air Combat: A Multi-Agent Deep Reinforcement Learning Approach. Drones, 8(8), 382. https://doi.org/10.3390/drones8080382

Article Metrics

Back to TopTop