Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning

Ding, Yunlong; Kuang, Minchi; Shi, Heng; Gao, Jiazhan

doi:10.3390/drones8100562

Open AccessArticle

Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

²

Precision Instruments Department, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(10), 562; https://doi.org/10.3390/drones8100562

Submission received: 4 August 2024 / Revised: 27 September 2024 / Accepted: 28 September 2024 / Published: 9 October 2024

(This article belongs to the Collection Drones for Security and Defense Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

To overcome the problems of traditional distributed target allocation algorithms in terms of lack of target strategic priority, poor scalability, and robustness, this paper proposes a proximal strategy optimization algorithm that combines threat assessment and attention mechanism (TAPPO). Based on the distributed training framework, the algorithm integrates a threat assessment and dynamic attention strategy and designs a dynamic reward function based on the current hit rate of the drone and the missile benefit ratio to improve the algorithm’s exploration ability and scalability. Through an 8vs8 multi-UAV confrontation experiment in a digital twin simulation environment, the results show that the agent using the TAPPO algorithm for target allocation defeats the state machine with an 85% winning rate and is significantly better than other current mainstream target allocation algorithms, verifying the effectiveness of the algorithm.

Keywords:

target assignment; multi-UAV air combat; reinforcement learning; attention mechanism; PPO

1. Introduction

In recent years, with the rapid development of UAV technology, multi-UAV systems have become increasingly widely used in military and civilian fields. Especially in the military, the coordinated operation of multiple drones can significantly improve combat efficiency and effectiveness. However, the target allocation problem in multi-aircraft cooperative operations becomes a typical nonlinear polynomial problem because the decision space increases exponentially with the task size. Solving such problems is crucial to improving the combat efficiency of drone groups, but it also faces substantial technical challenges.

Currently, target allocation algorithms for multi-UAV cooperative operations are divided into two basic types: centralized and distributed. The centralized goal allocation algorithm makes global optimization decisions through central nodes to ensure the accuracy and efficiency of decision-making. In this mode, the central node receives situation information from other UAVs during the iterative process and runs the target allocation algorithm to determine the optimal target combination scheme. Such algorithms usually use global or local optimization problem models and can be further subdivided into two categories. The first category is traditional optimization methods used for minor problems, such as exhaustive methods, integer linear programming [1], and graph theory analysis [2], focusing on accurate solutions to ensure that the absolute optimal solution to the problem is found; the second category includes heuristic and meta-heuristic methods such as genetic algorithms and their derivative algorithms [3,4], particle swarm optimization [5], and artificial immune algorithms [6]. These algorithms are known for their flexibility and scalability; they can effectively handle larger-scale problems and obtain approximately global or local optimal solutions within an acceptable time range. It is suitable for complex and dynamically changing decision-making environments. Although the centralized target allocation algorithm has been widely used, its dependence on the central node leads to extended execution time and reduced response speed. At the same time, the data processing of the central node can easily cause information bottlenecks and single points of failure, affecting system robustness.

Distributed algorithms improve the robustness and scalability of the system through decentralized decision-making, but how to ensure the consistency and optimization effect of decisions is the critical problem that needs to be solved. In response to these challenges, researchers in the academic community have proposed various solutions. For example, the Consensus-Based Bundle Algorithm (CBBA) [7,8,9] by Choi et al. effectively solves the target allocation problem of multi-agents by combining auctions and consensus mechanisms. However, it has shortcomings in the allocation of single targets by multi-agents. Since then, various improved algorithms based on the CBBA framework have been proposed. Li et al. [10] proposed a multi-target consensus-based auction algorithm (MTCBBA) specifically for target allocation in multi-agent collaborative beyond-visual-range air combat. However, the algorithm has poor adaptability in dynamic environments and relies on collaborative mechanisms, which limits its application in large-scale complex air combat environments. Zhao et al. [11] developed a multi-UAV dynamic target allocation algorithm based on communication network node clustering (CU-CBBA). The node grouping and clustering strategy effectively improved the limitations of CBBA in the communication network structure, but its high computational complexity resulted in a significant allocation delay. These studies have significantly promoted the development of distributed target allocation algorithms, but issues such as dynamic environment adaptability and target importance evaluation are still the main challenges facing this field.

With the continuous advancement of technology, more and more scholars have begun to adopt reinforcement learning methods to solve the problem of target allocation for multi-UAV collaboration. Through the adaptive learning mechanism, drones can select the optimal attack target in a complex environment. For example, Li et al. [12] proposed a target allocation model based on the Actor–Critic structure to solve the problem of multi-agent collaborative target allocation. In addition, Ma et al. [13] constructed a collaborative target allocation model based on multi-agent reinforcement learning and achieved the allocation of optimal targets by combining local strategy scoring with centralized strategy reasoning. These reinforcement learning algorithms show significant speed advantages when dealing with large-scale and complex continuous-state spaces. However, current reinforcement learning algorithms have shortcomings in target importance evaluation, which may limit their effectiveness in improving air victory rate and drone survivability.

At present, the attention mechanism has been widely used in many fields, such as target detection, semantic communication, autonomous navigation, and multi-UAV collaborative task allocation, and has dramatically improved the efficiency of air combat. Therefore, this paper fully considers the problems faced by traditional air combat target allocation algorithms and the unique advantages of the attention mechanism for target importance assessment. A target allocation algorithm based on threat assessment and attention mechanism is proposed, a high-precision air combat simulation environment is constructed, 8vs8 multi-UAV collaborative confrontation is carried out, and it is compared with the current mainstream target allocation algorithm to verify the effectiveness of the algorithm. The main contributions of this paper are summarized as follows:

(1): A proximal strategy optimization algorithm integrating a threat assessment and attention mechanism is proposed. The algorithm considers the prior knowledge of air combat target allocation, introduces threat assessment and attention mechanism under the reinforcement learning framework, and solves the challenges of traditional algorithms in target importance assessment.
(2): Based on this, a dynamic reward function based on drone hit rate and missile benefit ratio is constructed to improve the algorithm’s convergence speed and enhance its robustness and scalability in different battlefield environments and mission types.
(3): A highly realistic air combat simulation scene is constructed using Unity3D, covering air combat elements such as aerodynamics, thermal imaging, radar systems, and missiles. A multi-aircraft air combat scene simulation is carried out in this environment, and the performance is compared with the rule-based finite state machine and the traditional target allocation algorithm to verify the effectiveness of the proposed method.

2. Related Work

2.1. Proximal Policy Optimization

In reinforcement learning, the Proximal Policy Optimization (PPO) algorithm [14,15] belongs to the policy gradient algorithm. Its principle is to parameterize the policy and represent the policy through a parameterized linear function or neural network. An essential core of the PPO algorithm is importance sampling, which evaluates the difference between the new and old policies. A large or small importance sampling ratio will limit the new policy and prevent the new and old policies from deviating too far. The new and old policy ratios are shown in the Equation (1), where

π_{θ}

represents the probability that the current policy selects action under a given state s, and

π_{θ_{o l d}}

represents the old policy, that is, the probability that the policy at the time of the last policy update selects an action under a given state.

a_{t}

represents the action taken by the agent at time t, and

s_{t}

represents the state of the agent at time t.

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

(1)

Another core is gradient clipping. The objective function expression of the PPO algorithm is shown in Equation (2):

L^{c l i p} (θ) = E_{t} [m i n (r_{t} (θ) A_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]

(2)

Among them,

θ

is the policy parameter;

A_{t}

is the advantage value of action

a_{t}

in-state

S_{t}

estimated using generalised advantage estimation. The

c l i p

function acts as a clipping mechanism, mainly used for gradient clipping, to ensure the stability of the action probability distribution. When the policy is updated, the adjustment is limited to between

1 + ϵ

and

1 - ϵ

, which effectively prevents large deviations during the policy update process and makes the policy change within a limited range.

The PPO algorithm uses an Actor–Critic architecture. In order to enhance the agent’s exploration ability, a policy entropy is usually added to the loss function of the actor network and multiplied by a preset coefficient—the entropy coefficient. This coefficient is typically set to 0.01. The introduction of policy entropy is intended to encourage the agent to explore a broader range of action spaces, thereby avoiding premature convergence to suboptimal policies. The entropy of the policy is defined as shown Equation (3).

H (π (| s_{t})) = - \sum_{a_{t}} π (a_{t} | s_{t}) l o g (π (a_{t} | s_{t})) = E_{a_{t} π} [- l o g (π (a_{t} | s_{t}))]

(3)

The critic network uses TD-Error to update the network parameter

θ

, which is defined as shown in the Equation (4), where

r_{t}

represents the immediate reward obtained by the agent in the environment at time t, which measures the feedback of the environment after executing the action

a_{t}

in the current state

s_{t}

.

γ

is a discount factor, usually between 0 and 1, which controls the importance of the agent to future rewards. If

γ

is closer to 1, the agent will pay more attention to long-term rewards; if

γ

is closer to 0, the agent will pay more attention to immediate rewards.

V (S_{t})

and

V (S_{t + 1})

represent the value estimates of state

S_{t}

at time t and time t + 1, respectively.

T D - E r r o r = γ V_{θ} (S_{t + 1}) + r_{t} - V_{θ} (S_{t})

(4)

2.2. Attention Mechanism

The self-attention mechanism [16] uses the inherent information of the features to perform attention interactions. By introducing the self-attention mechanism, the neural network solves the problem of model information overload and improves the accuracy and robustness of the network. The calculation of the self-attention mechanism is divided into two steps: first, it calculates the attention weights between each vector in the input sequence; then, these weights are used to perform a weighted average calculation on the sequence. In this way, the self-attention mechanism can identify the connection between the parts of the sequence and improve the ability to identify critical information. Compared with the traditional attention mechanism, self-attention only depends on the input features and does not require external parameters or structures, making the calculation more efficient. Figure 1 shows the principle of the self-attention mechanism.

Where x represents the input sequence data, and its detailed calculation formula is shown in the Equation (5).

\begin{matrix} Q & = X W^{Q} \\ K & = X W^{K} \\ V & = X W^{V} \\ A t t e n t i o n (Q, K, V) & = s o f t m a x (\frac{Q K^{T}}{\sqrt{\dim}}) V \end{matrix}

(5)

Calculating the self-attention mechanism involves three matrices: the query matrix Q, the fundamental matrix K, and the value matrix V. They are obtained by multiplying the input with the corresponding weight matrix. First, calculate the product of the transformed rank of Q and K, divide the above result by the square root of the dimension of Q and K, and then multiply the result by the value matrix to get the value of self-attention.

The multi-head attention mechanism enables the model to focus on information from multiple subspaces at different positions and merge this information to increase the weight of important information. Its calculation process involves the output of multiple attention heads, and key features are strengthened by splicing this information. The specific calculation formula can be expressed as Equations (6) and (7), where

W^{o}

represents the matrix used to calculate the linear transformation of each attention head instance.

\begin{matrix} h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

(6)

\begin{matrix} M u l t i (Q, K, V) = C o n c a t (h e a d_{i}, \dots, h e a d_{h}) W^{o} \end{matrix}

(7)

In current research, intelligent drones have significantly improved their efficiency in target detection, semantic communication, autonomous navigation, and multi-drone collaborative task allocation by integrating attention mechanisms. In 2023, Zhang et al. [17] developed a global–local feature guidance module that uses attention mechanisms to focus on specific target areas, thereby significantly improving the accuracy of drone image target detection. In addition, the graph attention exchange network proposed by Yun et al. [18] in 2021 effectively solved the problem of ultra-reliable low-latency air-to-ground communication between mobile ground users. The research of Liu et al. [19] enhanced the application effect of drones in autonomous navigation through the attention mechanism. At the same time, Wu et al. [20] introduced an attention mechanism based on the ISOM algorithm, taking into account the flight distance and task execution time of drones, and further improved the efficiency of multi-drone collaborative task execution.

In designing the multi-UAV collaborative target allocation module, this paper applies the attention mechanism to assign weights to targets, thereby participating in the intelligent decision-making process of drones. This mechanism enables drones to focus on enemy aircraft with more significant potential threats, guiding the intelligent agent to implement the air combat strategy of “first enemy detection, first enemy strike”.

3. Methods

3.1. Algorithm Structure Design

Air combat decision-making is an essential challenge in the field of drone air combat games, among which target assignment is a critical link in the decision-making process. At present, decision-making theory based on the OODA [21] (observation, orientation, decision, action) cycle is widely used in this field. UAVs make tactical decisions by assimilating battlefield environment information and combining their status for strategic maneuvers.

Due to the complexity and continuity of air combat, this paper introduces a multi-UAV collaborative decision-making solution based on the Long Short-Term Memory Network-Proximal Policy Optimization algorithm (LSTM-PPO), that is, using LSTM to extract features from the front-end packaged state and using the PPO algorithm to select actions. The TAPPO algorithm is used to optimize target allocation in the drone decision-making process, thereby improving the survival rate of air combat. The algorithm framework mainly includes the TAPPO-Actor maneuver decision network, LSTM-Critic network, experience buffer, and state transfer module. The detailed framework is shown in Figure 2.

The maneuver decision network selects appropriate maneuver strategies according to the current situation and the target. It interacts with the environment to generate the state and reward at the next moment. The state transfer module packages the current state, maneuver strategy, reward, and state at the next moment into a sequence of 16 steps and sends it to the experience pool. When the experience pool threshold is reached, the experience is used to participate in the agent training.

The TAPPO algorithm introduces threat assessment and a multi-head attention mechanism, based on the PPO algorithm, to solve the defects of the traditional target allocation algorithm in target importance assessment.

When the agent selects a target, the network must focus on the local critical target. This paper uses the drone state and the threat parameter vector of the enemy drone to implement the attention mechanism through four attention heads and an additive model. Define

x = [x 1, x 2, \dots, x n]

as N input features, and each

x_{i}

represents the feature vector of a potential target. Given q and x, choose to calculate the i-th attention distribution as the target weight value.

A_{i}

is defined as Equation (8):

\begin{matrix} A_{i} = p (z = i | x, q) = s o f t m a x (s c o r e (x_{i}, q)) \\ = \frac{e x p (s c o r e (s_{i}, q))}{\sum_{j = 1}^{N} e x p (s c o r e (s_{i}, q))} \end{matrix}

(8)

Here,

s c o r e (q, x_{i})

is an attention evaluation function, which is used to calculate the matching degree between a given query q (the threat feature currently being focused on) and each target

x_{i}

. The calculation model is an additive model, and the

s c o r e (q, x_{i})

is defined as Equation (9).

score (x_{i}, q) = v^{T} tanh (W_{k} x_{i} + W_{q} q)

(9)

In the weight matrices,

W_{k}

and

W_{q}

are defined as trainable neural network parameters, v is the global state information feature vector, and the global information is used to participate in the attention network to evaluate the attention value of each target machine. By applying

s i g m o i d

sampling to each bit of the attention value, the corresponding target weight value is generated. The pseudo code for TAPPO algorithm is as Figure 3 and Figure 4:

3.2. Network Structure Design

The LSTM-PPO algorithm [22] is used in the maneuver decision to achieve multi-UAV collaborative confrontation. The algorithm uses the Actor–Critic distributed network framework and the LSTM network for state encoding and feature extraction. In a single step, the agent fully considers the current state information and goals, outputs reward-maximizing maneuvers, and achieves the optimal solution for multi-UAV collaborative confrontation.

As shown in Figure 5, when designing the network structure, we designed different observation spaces for the Actor and Critic networks and partially observable state spaces to encourage the agent to give full play to the exploration advantages of the PPO algorithm, thereby improving the scalability and robustness of the algorithm. However, the air battlefield changes rapidly. In order to avoid the problem of state lock in the new state faced by the agent, the agent keeps repeating a maneuver in a new state. We propose a deep 2048-cell LSTM network to process real-time data. This design makes full use of historical state information, helps to break the state lock phenomenon, and improves the adaptability and decision-making effect of the agent in complex environments.

The Critic network adopts the design of a global observation space, which can guide the agent to find the optimal strategy and accelerate convergence. The global observation space enables the Critic network to comprehensively consider the information of the entire environment, including the state and goals of all agents, thereby providing a more comprehensive and accurate reward signal.

Global optimization in air combat is the process of an intelligent agent looking for the optimal strategy in the observed state space. This process defines a probability distribution function to complete the mapping from action to state. To optimize this distribution to maximize the air combat victory rate, this paper introduces the TAPPO algorithm for threat assessment and target allocation, aiming to improve the survival rate and victory rate in air combat.

We improve the PPO algorithm and introduce a threat value and multi-head attention mechanism based on PPO intelligent decision-making to achieve target allocation. The Actor network structure of the TAPPO algorithm is shown in Figure 6.

The network structure of the TAPPO algorithm mainly consists of two parts: Actor and Critic. In this structure, the Actor-network receives an n×m-dimensional state space as the input, where n represents the number of enemy aircraft and m represents the dimension of the threat value. We designed a 128-dimensional neural network as the input layer and used LSTM to extract the threat parameters of the enemy drone at the current moment. The threat parameters in this paper include seven dimensions: altitude, position, speed, pitch angle, roll angle, yaw angle, and the current state of the aircraft.

This paper adopts a four-head attention mechanism to increase model integration and generalization capabilities. The high-dimensional and continuous characteristics of the air combat environment make historical state information crucial. In order to solve the defect that multiplicative attention cannot handle long-distance dependencies, this paper introduces an additive model in each attention head, uses the Equation (8) to calculate the attention score of each enemy aircraft, participates in the probability selection calculation of the Network, and adjusts the action probability distribution of the PPO algorithm. Finally, the actor-network outputs a p-dimensional action space probability distribution, where p represents the dimension of the friendly aircraft.

Similar to the actor network, the Critic network is also an essential part of the TAPPO algorithm. However, unlike the Actor output action probability distribution, the output of the Critic network is a single value function used to evaluate the predicted value of the current state. The predicted value can help the agent judge the pros and cons of the current goal selection, optimizing the decision-making process and improving the selection of subsequent actions. The Critic network enhances the system’s ability to adapt to environmental changes, allowing the agent to learn and adjust strategies more effectively.

3.3. Decision-Making Process Design

This study used the Unity simulation platform to build a realistic air combat scenario simulation environment. Our method is to use the current state information of the drone as input, extracting features through the extended short-term memory network and then processing these features through an attention network to optimize the importance evaluation of the information. Based on these attention-weighted features, the target allocation network generates a target probability matrix to guide the drone in selecting targets.

The maneuver decision network determines the appropriate maneuver strategy based on the current state information and target selection results to maximize the target reward. The drone’s state information, the reward obtained, the attention weight, the target probability matrix, and the selected maneuver actions are stored in three independent experience buffers for updating the parameters of each network.

The system receives the drone’s state information at each moment and calculates the target allocation and maneuver decisions through the reinforcement learning algorithm, which will act in the simulation environment in real-time. If the number of enemy aircraft is reduced to zero, it is considered that our side wins. This process will continue until the training goal is reached or the simulation ends, as shown in Figure 7.

3.4. Air Combat Model Design

This paper models multi-UAV cooperative target allocation and intelligent decision-making as a partially observable Markov decision process problem, which consists of a five-tuple

< S, A, R, P, γ >

, where

s \in S

represents the air combat state information at the current time t,

a \in A

represents the action taken by the agent at the current time t,

r \in R

represents the reward feedback given by the system, P is used as the state transfer function, and the discount factor

γ \in [0, 1]

is used to determine the importance of future rewards. The agent performs action according to the current state

s_{t}

and interacts with the environment to produce the state

s_{t + 1}

at the next moment. The system feeds back the agent reward rt, and the agent expects the return to be maximized.

G_{t}

is defined as the Equation (10):

G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k}

(10)

In POMDP, the task of the agent is to learn the mapping from state to action, which is defined as a policy. The core goal of POMDP is to find an optimal policy that maximizes the expected cumulative discounted return, defined as Equation (11).

π^{*} = a r g m a x_{π} E [\sum_{t = 0}^{\infty} γ^{t} r_{t} | π]

(11)

The state space, action space, and reward function of the multi-UAV collaborative target allocation problem under reinforcement learning are as follows. These are used to evaluate the agent’s single-step decision-making performance under the reinforcement learning framework.

3.4.1. State Space Design

S_{s e l f}

and

S_{e n e m y}

represent the state spaces of the friendly and enemy drones, respectively. Both spaces consist of sets of related drone and missile parameters. The drone state includes position P, height H, and speed V. Missile-related information involves missile type

M_{t y p e}

, launch state

M_{c a s t a b l e}

, and remaining number

M_{l e f t}

. In the enemy state space,

T_{e n e m y}

is additionally introduced. This parameter quantifies the current threat level of the enemy target.

S_{s e l f}

and

S_{e n e m y}

are defined Equations (12) and (13).

S_{s e l f} = [M_{T_{y p e}}, M_{C a s t a b l e}, M_{L e f t}, H_{s e l f}, P_{s e l f}, V_{s e l f}]

(12)

S_{e n e m y} = [M_{T y p e}, M_{C a s t a b l e}, M_{L e f t}, H_{e n e m y}, P_{e n e m y}, V_{e n e m y}, T_{e n e m y}]

(13)

The attitude of the drone is controlled by the pitch angle

(φ)

, roll angle

(θ)

, and yaw angle

(ϕ)

. In addition, the McGrew [23] method is used to describe the azimuth information of the drone. Specifically, the line of aspect angle (AA) is defined as the angle between the speed direction of our aircraft and the line connecting the two drones; the antenna train angle (ATA) represents the angle between the speed direction of the enemy aircraft and the line connecting the two drones; the horizontal crossing angle (HCA) is the angle between the speed directions of the two drones. At the same time, R represents the distance between the enemy and our drones. The representation of the drone azimuth information is shown in Figure 8.

In summary, the state space is represented by

S_{d a t a}

and is defined as the Equation (14):

S_{data} = [S_{self}, S_{enemy}, φ, θ, ϕ, A A, A T A, H C A]

(14)

3.4.2. Action Space Design

Action Space Design of TAPPO Algorith

In the reinforcement learning framework, the agent adjusts its decision-making strategy by performing actions and receiving corresponding rewards to select the most optimal action at future decision points to maximize the cumulative reward. Therefore, the action space is crucial to achieve an efficient learning process and is defined as Equation (15)

A_{data} = [a_{1}, a_{2}, a_{3}, \dots . a_{n}]

(15)

Among them,

a_{i}

represents the probability that the agent selects the i-th drone as the attack target, and n represents the number of enemy drones on the current battlefield.

Design of Action Space for Manoeuvrable Decision Making

This paper will set six basic actions for the fighter to cope with the complex and ever-changing air combat environment and develop more combined actions for the intelligent agent in the air combat process. By reasonably and flexibly using these basic actions, some joint maneuvers in the air combat can be performed, such as the Immelmann maneuver and the Cobra maneuver [24]. These six basic actions include direct flight, pursuit, circling, somersault, attack, and evasion. The action description is shown in Table 1.

3.4.3. Reward Function Design

TAPPO Algorithm Reward Function Design

In multi-UAV coordinated air combat, the agent aims to shoot down the assigned enemy aircraft through appropriate maneuvers to maximize the reward. The target assignment process mainly focuses on the target threat level and the current drone status. The reward function is defined as the formula

R (s, a) = R_{thead} + R_{state}

(16)

Among them,

R (s, a)

represents the reward obtained by taking an action in the current state,

R_{t}

quantifies the threat level of the target aircraft at the current moment, and

R_{s}

reflects the current combat status of our drone. The definitions of

R_{t}

and

R_{s}

are shown in Equation (17).

\begin{matrix} R_{t} = R_{v} + R_{h} + R_{d} \\ R_{v} = \{\begin{matrix} - 50, & 1.5 v_{i} ⩽ v_{j} \\ 100 (v_{j} / v_{i} - 0.5), & 0.6 v_{i} < v_{j} < 1.5 v_{i} \\ 80, & v_{j} ⩽ 0.6 v_{i} \end{matrix} \\ R_{h} = \{\begin{matrix} 80, & h_{i j} ⩾ 5 \\ 20 + 10 h_{i j}, & 0 < h_{i j} < 5 \\ - 50, & h_{i j} \leq 0 \end{matrix} \\ R_{d} = \{\begin{matrix} 80, & d_{i j} \leq 5 \\ 10 (10 - d_{i j}), & 5 < d_{i j} \leq 15 \\ - 50, & d_{i j} > 15 \end{matrix} \end{matrix}

(17)

The agent prioritizes targets with higher threat values during the target assignment process. The reward function

R_{t}

quantifies the threat level of the enemy drone. It consists of three parts: speed reward

(R_{v})

, height reward

(R_{h})

, and distance reward

(R_{d})

.

R_{v}

reflects the relative difference in speed between the enemy and ourselves. Set

V_{i}

and

V_{j}

to represent the speed of our and enemy aircraft, respectively. In the model, the faster the enemy aircraft is, the higher the threat level, and

R_{v}

increases accordingly to prompt the agent to prioritize faster targets.

R_{h}

considers the height difference

h_{i j}

between the enemy and our drones. When the enemy and the enemy are at significantly different altitude levels, the side at the lower altitude is usually at a tactical disadvantage. Therefore, the size of

R_{h}

increases or decreases with the height difference, reflecting the combat advantages of different tactical positions.

R_{d}

is designed based on the distance difference

d_{i j}

between enemy and friendly drones. As the distance decreases, the hit rate of both sides increases, and the threat value becomes more extensive, thus increasing the distance reward.

R_{s} = 100 (\frac{M i s s i l e_L a u n c h e s}{T a r g e t_H i t s}) + 100 (1 - \frac{R e m a i n i n g_M i s s i l e s}{R e m a i n i n g_T a r g e t s})

(18)

As shown in Equation (18), this article proposes a reward mechanism based on the current state of the agent. Agents with higher hit rates and significant missile gains will receive higher reward returns. Missile_Launches quantifies the value of launched missiles, and target_hits represents the number of successful target hits. At the same time, Remaining_Missiles and Remaining_Targets quantify the number of remaining missiles and targets, respectively. The hit rate and missile benefit ratio are introduced as critical factors in the quantitative reward function to encourage the agent to select targets based on its current situation, thereby improving the model’s adaptability and robustness in complex air combat scenarios.

Design of Reward Function for Maneuvering Decision

In a complex and ever-changing air combat environment, the reward function directly affects the training of the agent. This paper introduces global rewards, local rewards, and node event rewards for multi-aircraft air combat. The global reward is that if the agent wins, it will receive a reward of +100; if it loses, it will receive a reward of −100; and if it draws, it will receive a reward of −10. Equation (19) shows that

R (s, a)

represents the reward obtained by taking a maneuver in the current state.

R (s, a) = \{\begin{matrix} + 100, & combat aircrafts win \\ - 10, & combat aircrafts tie \\ - 100, & combat aircrafts lose \end{matrix}

(19)

The global reward function has the problem of sparse rewards and delayed rewards. The agent will only receive reward feedback when a battle is over, which is not conducive to the training of the agent. To address this problem, local rewards and node event rewards are introduced. The local reward is shown in Equation (20):

\begin{matrix} R_{A} = \{\begin{matrix} A (\frac{D}{1000}), & D \leq 1000 \\ A (\frac{39,000 - D}{38,000}), & 1000 < D \leq 20,000 \\ A (\frac{10,000}{D}), & 20,000 < D \end{matrix} \\ R_{T} = \{\begin{matrix} 160 e^{\frac{a n g}{144}} (1 - \frac{t}{20}), & 0 \leq t < 20 \\ 0, & 20 \leq t \end{matrix} \end{matrix}

(20)

R_{A}

represents the advantage of our aircraft relative to the enemy aircraft, D represents the distance between the two aircraft, and

A n g

represents the angle between the two aircraft. When it tends to 0, our aircraft is tailgating relative to the enemy aircraft. As the distance decreases, the advantage value will increase significantly, where

A = 80 e^{- \frac{A n g^{2}}{1300}}

.

R_{t}

represents the threat of the enemy aircraft’s missile,

a n g

represents the angle between the missile and our aircraft, and t represents the estimated arrival time of the missile. When it tends to 0, the threat of the enemy aircraft’s missile is at its maximum value.

Local rewards are continuous and coherent, while node event rewards are instantaneous. When a fighter triggers an event, it will immediately receive a reward. The specific reward values are shown in Table 2.

3.5. Training System Design

The training system consists of six parts, as shown in Figure 9. Unity is used to implement air combat simulation to sample much training data. We use multiple GPUs to open multiple simulation environments in parallel, and the agent continuously interacts with the environment to improve the convergence speed. The agent receives the state observation value transmitted by the air combat environment and sends the observation data to the TAPPO training module and the agent adversarial training module.

The TAPPO algorithm uses the current observation information to extract the threat value parameters, calculates and updates the weight matrix of the attention network, obtains the target attention weight vector that participates in the calculation of the actor network, and adjusts the target selection probability distribution. The agent samples the target probability matrix to obtain the target. The adversarial training module receives the target allocation results, makes maneuver decisions, and interacts with the environment based on the current state, generating the state and reward of the next moment.

The model training data set mainly consists of drone state information, action information, reward information, and a selection probability matrix. The training module packages the generated data into a sequence containing the current state, reward, and next moment. At the same time, the TAPPO module also generates a target probability distribution matrix as one of the sequence parameters. These data are packaged into 16 actions and transmitted to the experience pool asynchronously. When the experience pool reaches the threshold, the system extracts 120 experience samples from it for the training of the intelligent agent. These 120 experiences comprise 80% old data and 20% new experience, aiming to break the correlation between data and improve data utilization.

4. Experiment

4.1. Experimental Environment

This paper designs an 8vs8 air combat simulation confrontation in an air combat simulation environment based on digital twin technology, aiming to evaluate the effectiveness and superiority of the proposed algorithm. In order to comprehensively evaluate the performance of the algorithm, a comparative experiment was conducted to compare the performance of the existing mainstream target allocation algorithm and the TAPPO algorithm in terms of win rate, loss rate, and draw rate under the same conditions. By designing an ablation experiment, it is verified that the improvement proposed in this paper is meaningful. Finally, the results of target allocation under disadvantageous conditions are quantified and statistically analyzed to verify the scalability and robustness of the algorithm.

4.2. Experimental Configuration

This study was experimentally verified through an air combat simulation platform independently developed by the research team. The platform integrates digital twin technology and supports complex battlefield decision-making scenario simulations through advanced target allocation functions, including dynamic target priority setting and multi-target tracking functions, to achieve more accurate target allocation tests (as shown in the figure). The experimental simulation uses an F22 fighter equipped with six AIM-120 medium-range missiles, relying on a radar guidance system for guidance. The aircraft supports a complete missile guidance chain.

In terms of hardware configuration, the experimental environment uses a 5950X + 3070 TI host as the front end, and multiple independent battlefields are managed and scheduled through multi-threading and multi-processing to achieve parallel air combat scene simulation. At the same time, the TCP/IP protocol is used to ensure efficient data exchange between the front-end and back-end models. The back end has an 8-way 3090 graphics processor responsible for real-time data extraction from the experience pool. It uses the neighbouring strategy to optimize the calculation of gradients and mean square errors and continuously updates the Actor and Critic networks in the TAPPO algorithm to ensure the optimization and adaptability of the algorithm in dynamic target allocation scenarios.

4.3. 8vs8 Multi-UAV Collaborative Confrontation Results

This paper designs an 8vs8 multi-UAV collaborative confrontation. The red side is a reinforcement learning agent based on the TAPPO target allocation algorithm, and the blue side is a finite state machine based on an expert system. Nearly 2000 iterations are counted, and eight allocations are made in each iteration. Each reward is normalized, as shown in Figure 10. The reward function changes from negative to positive, showing the agent’s gradual learning process.

Due to the significant jitter in the reinforcement learning training process, the curve is challenging to observe. To effectively visualize the general trend of the curve, this paper performs an exponential moving average on the original data. The light curve represents the original data, and the dark curve represents the data after sliding average processing.

Observing the loss function of the value network in Figure 11, we can see that the current agent has reached a convergence state. In order to evaluate the real-time performance of the agent, we counted the real-time winning rate against enemy aircraft in the last 2000 iterations. According to Figure 12, the real-time winning rate of the agent reached 86%. Due to the random initialization of the red and blue aircraft positions during initialization, some enemy aircraft are initially far away, making it impossible for the radar to detect them immediately, so achieving a 100% winning rate is not comfortable.

4.4. Comparative Experiment

In order to verify the efficiency and effect of the TAPPO algorithm in multi-aircraft coordinated air combat target allocation, a comparative experiment was designed and conducted to compare the TAPPO algorithm with the MTCBBA, genetic algorithm (GA) [25], and Hungarian algorithm [26]. We verified their winning rate, floating point operations per second (FLOPs), real-time performance, and algorithm scalability. All algorithms counted the above four indicators under the same experimental conditions. Each algorithm was run on the air combat simulation platform for 2000 iterations to ensure the reliability of the data.

As shown in Figure 13, the Hungarian algorithm is mainly suitable for static or relatively simple assignment scenarios. In a dynamically changing air combat environment, its lack of adaptability leads to low decision-making efficiency and a low winning rate. GA has excellent global search capabilities and can find the optimal target allocation solution. However, in continuous and complex high-dimensional dynamic air combat scenarios that require fast response, the genetic algorithm requires more iterations to converge to the optimal solution. This increases decision-making time and reduces decision-making efficiency. Although both the TAPPO algorithm and the MTCBBA algorithm reach convergence in about 1200 rounds, the MTCBBA algorithm places too much emphasis on cost–benefit optimization during design rather than maximizing threat response. This strategy cannot effectively ensure the maximum survival rate of our aircraft in air combat. In contrast, the TAPPO algorithm takes advantage of the multi-head attention mechanism to enable the agent to prioritize the enemy aircraft that pose the greatest threat to our aircraft as the target, thereby significantly improving the survival probability of friendly aircraft and improving air combat efficiency.

FLOPs are usually used to measure the computational intensity of algorithms, models, or systems, especially in scenarios involving many floating-point operations. As shown in Table 3, we counted the FLOPs of a single algorithm run and found that the FLOPs of MTCBAA, genetic algorithm, and Hungarian algorithm were relatively low, indicating that these algorithms are suitable for dealing with smaller-scale target allocation problems. In contrast, the FLOPs of the TAPPO algorithm proposed in this paper far exceed other mainstream target allocation algorithms, which reflects that the TAPPO algorithm can perform more complex and detailed calculations. When dealing with high-dimensional continuous air combat environments, the TAPPO algorithm can provide more accurate target allocation decisions and show stronger robustness. Therefore, the TAPPO algorithm is particularly suitable for dealing with multi-aircraft cooperative target allocation problems with complex state spaces and large scales.

As shown in Table 4, we conducted a statistical analysis of the time of nearly 200 target allocations. Genetic algorithms require global search and try to find the optimal solution through continuous iterations. Each iteration requires evaluation and selection of multiple candidate solutions. Therefore, in complex air combat scenarios, this process is time-consuming. In addition, its high FLOPs lead to long calculation and allocation times. Although the MTCBBA algorithm has low FLOPs, it emphasizes cost-effective optimization, which also takes much time to optimize. The Hungarian algorithm can quickly complete task allocation due to its greedy strategy.

Although the TAPPO algorithm has a high FLOPs value, it has a lower target allocation time due to the introduction of deep neural networks and the increased calculation speed. This shows that although our algorithm is suitable for large-scale and complex state spaces, it is equal to mainstream reinforcement learning algorithms in terms of real-time performance.

In order to verify the scalability of each algorithm, we conducted experiments in an 8vs8 air combat simulation environment. After each algorithm converged, we added two enemy aircraft to the environment and recorded the winning rate over nearly 1000 battles. The winning rate results are shown in Figure 14.

As can be seen from Figure 14, the scalability of the TAPPO algorithm is significantly better than the other three mainstream algorithms. As the Hungarian algorithm is not good at processing continuous and complex state spaces, its winning rate dropped significantly after adding enemy aircraft. The performance of the GA and MTCBBA algorithms also declined when facing new enemy aircraft, but not to the same extent as the Hungarian algorithm. In contrast, the TAPPO algorithm can still maintain a high winning rate when the number of enemy aircraft increases, which shows that the algorithm has high scalability.

By comparing the four indicators, the TAPPO algorithm has advantages in processing large-scale continuous dynamic spaces, achieving a winning rate far exceeding that of other mainstream algorithms in air combat, verifying the effectiveness and superiority of the algorithm.

4.5. Ablation Experiment

Given the problems existing in the traditional target allocation algorithm, this paper makes two improvements to the PPO algorithm. An ablation experiment was conducted to evaluate the impact of different mechanisms on the algorithm’s performance. In each experiment, one improvement was removed, and the winning rate in the 8vs8 multi-UAV collaborative confrontation was compared, as shown in Table 5.

The comparison results of the ablation experiment algorithms are shown in Figure 15. As can be seen from the figure, the effects of removing different improvements on the performance of the algorithms are significantly different. Regarding the win rate of the empty game, the win rates of Our-noR and Ours-noAT are higher than those of the PPO method. Regarding convergence speed, the models trained using the Our-noR and Our-noAT methods also converge faster than those using the PPO method. It is particularly noteworthy that Our-noR has the highest win rate, indicating that the key to evaluating the importance of the target lies in the use of attention mechanisms and threat values. The Our-noTA algorithm has the fastest convergence speed, while the Our-noR converges slower, which shows that the reward function accelerates the algorithm’s convergence.

Based on the target allocation experiment of the improved PPO algorithm, the target assignment of the improved PPO algorithm is generally superior to that of the PPO algorithm in terms of convergence speed and winning rate. The experimental results verify the effectiveness of the algorithm.

4.6. Disadvantage Combat Experiment

To verify the scalability and robustness of the algorithm at different scales, this paper designed a 6-to-8 multi-UAV collaborative confrontation target allocation experiment, as shown in Figure 16. Under the same experimental conditions, it can be observed that the winning rate has dropped by nearly ten percentage points compared to the 8-to-8 experiment, which is mainly due to the decrease in the winning rate due to the disadvantage in quantity. After the algorithm converged, the results of the last ten target allocations were counted, and all solutions were optimal.

4.7. Discussion and Future Work

By combining the attention mechanism, the TAPPO algorithm effectively realizes the importance of assessment and target selection in a multi-target environment. The results show that the algorithm can significantly optimize the allocation of combat resources by assigning appropriate attention weights to each enemy aircraft and combining PPO for intelligent decision-making, thereby improving the aircraft’s survivability and winning rate. Compared with traditional target allocation methods, the TAPPO algorithm has significant advantages in allocation efficiency and complexity.

In the rapidly evolving field of air combat technology, this study not only enriches the research on automated target allocation systems in theory, but also opens up new perspectives and strategies for practical military applications. Especially in modern electronic warfare environments, the algorithm may significantly improve the response speed and accuracy of military decision-making and effectively respond to the challenges of quickly processing a large amount of target information and making real-time decisions.

Despite the excellent performance of the TAPPO algorithm, we also noticed that it has some limitations. First, the algorithm’s execution is highly dependent on high-quality real-time data input, and any delays or errors in front-end and back-end data transmission may seriously affect the accuracy of decision-making. Secondly, although the algorithm has optimized computational efficiency, it is still urgent to reduce the FLOPs of the algorithm and reduce the algorithm’s target allocation time. In future research, we will seek to solve these problems and further improve the generalization and computational capabilities of the algorithm.

5. Conclusions

This study introduces threat assessment and a multi-head attention mechanism on the basis of the PPO algorithm to solve the problem of the lack of assessment of target importance in traditional target allocation algorithms. Given the poor scalability and weak robustness of traditional target allocation algorithms, this paper proposes a reward function based on missile hit rate and benefit ratio to encourage the agent to explore and maximize the efficiency of missiles actively. In the air combat simulation environment built based on digital twin simulation technology, this study conducted equilibrium experiments (8vs8) and disadvantage experiments (6vs8). It verified the scalability and robustness of the algorithm by statistically analyzing the winning rate of nearly 2000 rounds of air combat iterations. The ablation experiment further verified the effectiveness of the improvement proposed in this paper. Finally, a comparative experiment was designed to compare the winning rate and the average single target allocation time, proving that the algorithm proposed in this paper is superior to the current mainstream multi-UAV collaborative target allocation algorithm.

Author Contributions

Conceptualization, Y.D. and J.G.; methodology, Y.D.; software, Y.D.; validation, J.G., Y.D. and M.K.; formal analysis, Y.D.; investigation, H.S.; resources, M.K.; data curation, Y.D.; writing—original draft preparation, Y.D.; writing—review and editing, M.K., H.S. and J.G.; visualization, Y.D.; supervision, M.K.; project administration, H.S.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.The video and curve of the algorithm convergence effect can be viewed at the following link. Access is permanent. URL: https://youtu.be/noUPWVj02Ro (accessed on 10 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

DURC Statement

Current research is limited to the intelligent air combat, which is beneficial [share benefits and/or primary use] and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving intelligent air combat and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Abbreviations

The following abbreviations are used in this manuscript:

PPO	Proximal Policy Optimization
TAPPO	Threat-Attention Proximal Policy Optimization
LSTM	Long Short-Term Memory
CBBA	Consensus-Based Bundle Algorithm
MTCBBA	Multi-target Consensus-Based Bundle Algorithm
GA	Genetic Algorithm

References

Yang, M.; Bi, W.; Zhang, A.; Gao, F. A distributed task reassignment method in dynamic environment for multi-UAV system. Appl. Intell. 2022, 52, 1582–1601. [Google Scholar] [CrossRef]
Zhan, C.; Zeng, Y. Energy minimization for cellular-connected UAV: From optimization to deep reinforcement learning. IEEE Trans. Wirel. Commun. 2022, 21, 5541–5555. [Google Scholar] [CrossRef]
Liu, H.; Ge, J.; Wang, Y.; Li, J.; Ding, K.; Zhang, Z.; Guo, Z.; Li, W.; Lan, J. Multi-UAV optimal mission assignment and path planning for disaster rescue using adaptive genetic algorithm and improved artificial bee colony method. Actuators 2021, 11, 4. [Google Scholar] [CrossRef]
Jeong, B.M.; Jang, D.S.; Hwang, N.E.; Kim, J.W.; Choi, H.L. Genetic algorithm based multi-UAV mission planning method considering temporal constraints. J. Aerosp. Syst. Eng. 2023, 17, 78–85. [Google Scholar]
Peng, Q.; Wu, H.; Li, N. Modeling and solving the dynamic task allocation problem of heterogeneous UAV swarm in unknown environment. Complexity 2022, 2022, 9219805. [Google Scholar] [CrossRef]
Gao, Y.; Chen, S.; Yu, M.; Hai, J.; Fang, R. Target allocation method of multi-aircraft cooperative air combat based on improved artificial immune algorithm. Xibei Gongye Daxue Xuebao/J. Northwestern Polytech. Univ. 2019, 37, 354–360. [Google Scholar] [CrossRef]
Choi, H.L.; Brunet, L.; How, J.P. Consensus-based decentralized auctions for robust task allocation. IEEE Trans. Robot. 2009, 25, 912–926. [Google Scholar] [CrossRef]
Wang, H.; Duan, H.; Wei, C. Dynamic resource allocation of drone swarms based on cooperative competitive public goods game. Sci. China Inf. Sci. 2022, 52, 1598–1609. [Google Scholar]
Zhang, Y.; Feng, W.; Shi, G.; Jiang, F.; Chowdhury, M.; Ling, S.H. UAV swarm mission planning in dynamic environment using consensus-based bundle algorithm. Sensors 2020, 20, 2307. [Google Scholar] [CrossRef]
Li, W.; Lyu, Y.; Dai, S.; Chen, H.; Shi, J.; Li, Y. A multi-target consensus-based auction algorithm for distributed target assignment in cooperative beyond-visual-range air combat. Aerospace 2022, 9, 486. [Google Scholar] [CrossRef]
Zhao, T.; Deng, H.; Gao, J.; Huang, J. Dynamic Target Assignment of Multiple Unmanned Aerial Vehicles Based on Clustering of Network Nodes. J. Syst. Simul. 2023, 35, 695–708. [Google Scholar]
Li, S.; Jia, Y.; Yang, F.; Qin, Q.; Gao, H.; Zhou, Y. Collaborative decision-making method for multi-UAV based on multiagent reinforcement learning. IEEE Access 2022, 10, 91385–91396. [Google Scholar] [CrossRef]
Ma, Y.; Wu, L.; Xu, X. Collaborative goal allocation based on multi-agent reinforcement learning. Syst. Eng. Electron. 2023, 45, 191–198. [Google Scholar]
Wang, H.; Zhou, Z.; Jiang, J.; Deng, W.; Chen, X. Autonomous Air Combat Maneuver Decision-Making Based on PPO-BWDA. IEEE Access 2024, 12, 119116–119132. [Google Scholar] [CrossRef]
Lü, X.; Zang, Z.; Li, S. Attention-based recurrent PPO algorithm and its application. Comput. Technol. Dev. 2024, 34, 136–142. [Google Scholar]
Kwak, D.; Choi, S.; Chang, W. Self-attention based deep direct recurrent reinforcement learning with hybrid loss for trading signal generation. Inf. Sci. 2023, 623, 592–606. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Zhang, T.; Liu, Y.; Zheng, Y. Self-attention guidance and multiscale feature fusion-based UAV image object detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Yun, W.J.; Lim, B.; Jung, S.; Ko, Y.C.; Park, J.; Kim, J.; Bennis, M. Attention-based reinforcement learning for real-time UAV semantic communication. In Proceedings of the 2021 17th International Symposium on Wireless Communication Systems (ISWCS), Berlin, Germany, 6–9 September 2021; pp. 1–6. [Google Scholar]
Liu, Z.; Cao, Y.; Chen, J.; Li, J. A hierarchical reinforcement learning algorithm based on attention mechanism for UAV autonomous navigation. IEEE Trans. Intell. Transp. Syst. 2022, 24, 13309–13320. [Google Scholar] [CrossRef]
Wu, J.; Zhang, J.; Li, X.; Gao, L.; Han, G.; Sun, Y. Multi-UAV Collaborative Dynamic Task Allocation Method Based on ISOM and Attention Mechanism. IEEE Trans. Veh. Technol. 2023, 73, 6225–6235. [Google Scholar] [CrossRef]
Sun, L.; Qiu, H.; Wang, Y.; Yan, C. Autonomous UAV maneuvering decisions by refining opponent strategies. IEEE Trans. Aeros. Electron. Syst. 2024, 60, 3454–3467. [Google Scholar] [CrossRef]
Ding, Y.; Kuang, M.; Zhu, J.; Zhu, J.; Qiao, Z. Intelligent decision making and target assignment of multi-aircraft air combat based on the LSTM–PPO algorithm. Chin. J. Eng. 2024, 46, 1179–1186. [Google Scholar]
McGrew, J.S.; How, J.P.; Williams, B.; Roy, N. Air-combat strategy using approximate dynamic programming. J. Guid. Control. Dyn. 2010, 33, 1641–1654. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, H.; Kuang, M.; Shi, H.; Zhu, J.; Qiao, Z.; Zhou, W. Curriculum Learning-based Simulation of UAV Air Combat Under Sparse Rewards. J. Syst. Simul. 2024, 36, 1452. [Google Scholar]
Yan, F.; Chu, J.; Hu, J.; Zhu, X. Cooperative task allocation with simultaneous arrival and resource constraint for multi-UAV using a genetic algorithm. Expert Syst. Appl. 2024, 245, 123023. [Google Scholar] [CrossRef]
Yan, J.; Daobo, W.; Tingting, B.; Zongyuan, Y. Multi-UAV objective assignment using Hungarian fusion genetic algorithm. IEEE Access 2022, 10, 43013–43021. [Google Scholar] [CrossRef]

Figure 1. Principle of the self-attention mechanism.

Figure 2. Maneuver decision algorithm structure diagram.

Figure 3. Pseudocode for attention mechanism implementation.

Figure 4. TAPPO algorithm implementation pseudo code.

Figure 5. Design of the mobile decision-making network structure.

Figure 6. TAPPO algorithm Actor network structure diagram.

Figure 7. Decision making flowchart.

Figure 8. Azimuth angle diagram.

Figure 9. Training framework diagram.

Figure 10. 8vs8 target assignment reward curve.

Figure 11. 8vs8 value network loss curve.

Figure 12. 8vs8 multi-UAV collaborative confrontation winning rate curve.

Figure 13. Win rate comparison chart of mainstream algorithms.

Figure 14. Comparison of algorithm scalability win rate.

Figure 15. Ablation experiment results (a) Our-noR target allocation win rate curve (b) Our-noTa target allocation win rate curve (c) PPO target allocation win rate curve.

Figure 16. 6vs8 Disadvantage combat curve.

Table 1. Maneuver action design.

Type	Parameter	Describe
Straight	Target pitch Angle	A straight flight is a steady, horizontal and continuous straight flight
Track	Target location and bearing	Tracking the behavior of observing a target and approaching its location in order to achieve continuous observation and contact with the target
Hover	Roll and pitch	The act of continuously rotating and circulating in a confined space
Loop	Pitch	A somersault is the act of turning and tumbling rapidly in the air
Attack	Target	prediction position An attack is the act of violating, attacking and causing harm to a target by means of action, force and weapons
Escape	Alarm information	Avoidance is the taking of measures and actions to avoid or circumvent potential threats, dangers, or attacks

Table 2. Node event reward design.

Name	Reward Value	Describe
Hit	+100	The missile hit the enemy target
Lose	−100	Hit by an enemy missile
Tie	−10	The remaining fighters on both sides are 0
Crash	−100	Their plane hit the ground and crashed
Scanning	+10	Radar picked up enemy aircraft
Escape	+10	Successfully evaded the missile at close range
Past	+10	Close pass and contact with enemy aircraft

Table 3. FLOPs comparison table of mainstream algorithms.

Algorithm	Floating Point Operations per Second
MTCBAA	89,695
GA	266,746
Hungarian Algorithm	2240
TAPPO (our)	1,842,176

Table 4. Comparison table of the average single decision time of mainstream algorithms.

Algorithm	Average Decision Time per Decision (Seconds)
MTCBAA	0.127
GA	2.06
Hungarian Algorithm	0.017
TAPPO (our)	0.019

Table 5. Node event reward design.

Algorithm	Thread-Attention	Reward
Our-noR	✔	×
Our-noTA	×	✔
PPO	×	×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Y.; Kuang, M.; Shi, H.; Gao, J. Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning. Drones 2024, 8, 562. https://doi.org/10.3390/drones8100562

AMA Style

Ding Y, Kuang M, Shi H, Gao J. Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning. Drones. 2024; 8(10):562. https://doi.org/10.3390/drones8100562

Chicago/Turabian Style

Ding, Yunlong, Minchi Kuang, Heng Shi, and Jiazhan Gao. 2024. "Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning" Drones 8, no. 10: 562. https://doi.org/10.3390/drones8100562

APA Style

Ding, Y., Kuang, M., Shi, H., & Gao, J. (2024). Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning. Drones, 8(10), 562. https://doi.org/10.3390/drones8100562

Article Menu

Multi-UAV Cooperative Target Assignment Method Based on Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Proximal Policy Optimization

2.2. Attention Mechanism

3. Methods

3.1. Algorithm Structure Design

3.2. Network Structure Design

3.3. Decision-Making Process Design

3.4. Air Combat Model Design

3.4.1. State Space Design

3.4.2. Action Space Design

Action Space Design of TAPPO Algorith

Design of Action Space for Manoeuvrable Decision Making

3.4.3. Reward Function Design

TAPPO Algorithm Reward Function Design

Design of Reward Function for Maneuvering Decision

3.5. Training System Design

4. Experiment

4.1. Experimental Environment

4.2. Experimental Configuration

4.3. 8vs8 Multi-UAV Collaborative Confrontation Results

4.4. Comparative Experiment

4.5. Ablation Experiment

4.6. Disadvantage Combat Experiment

4.7. Discussion and Future Work

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

DURC Statement

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI