Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game

Zhang, Cheng; Song, Junhao; Tao, Chengyang; Su, Zitao; Xu, Zhiqiang; Feng, Weijia; Zhang, Zhaoxiang; Xu, Yuelei

doi:10.3390/drones9050382

Open AccessArticle

Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game

by

Cheng Zhang

^1,2,3,

Junhao Song

^1,2,3,

Chengyang Tao

^1,2,3,

Zitao Su

^1,2,3

,

Zhiqiang Xu

^1,2,3,

Weijia Feng

^1,2,3

,

Zhaoxiang Zhang

^1,2,3 and

Yuelei Xu

^1,2,3,*

¹

Unmanned System Intelligent Perception and Collaboration Technology Laboratory, Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an 710129, China

²

National Key Laboratory of Unmanned Aerial Vehicle Technology, Northwestern Polytechnical University, Xi’an 710129, China

³

Integrated Research and Development Platform of Unmanned Aerial Vehicle Technology, Northwestern Polytechnical University, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(5), 382; https://doi.org/10.3390/drones9050382

Submission received: 3 April 2025 / Revised: 12 May 2025 / Accepted: 15 May 2025 / Published: 21 May 2025

(This article belongs to the Collection Drones for Security and Defense Applications)

Download

Browse Figures

Versions Notes

Abstract

In recent years, UAVs have faced increasingly severe and diversified missile threats. To address the challenge that reinforcement learning-based missile evasion algorithms struggle to adapt to various unknown missile types, we introduce a risk-sensitive PPO algorithm and propose a training framework incorporating multi-head attention mechanisms and dual-population adversarial training. The multi-head attention mechanism enables the policy network to extract latent features such as missile guidance laws from state sequences, while the dual-population adversarial approach ensures policy diversity and robustness. Compared to conventional self-play methods and GRU-based evasion strategies, our method demonstrates superior training efficiency and generates evasion policies with better adaptability to different missile types.

Keywords:

missile evasion; reinforcement learning intelligent control of UAVs; multi-head attention mechanism; dual population confrontation game

1. Introduction

Missile evasion strategies are a crucial research area in modern military operations, aimed at enhancing the survivability of UAVs and other equipment when facing missile threats [1]. The research background of missile evasion strategies stems mainly from the complexity and variability of missile systems. During missile attacks, UAVs must assess threats in real-time and respond swiftly. Ref. [2] proposed a threat-plane-based approach to improve the speed of threat alerts, enabling aircraft to initiate evasive maneuvers earlier, thereby significantly increasing their survival probability. In the context of missile evasion tasks, numerous strategies have been developed. For instance, Ref. [3] proposed a hybrid evasion strategy based on a variable-structure guidance law; Ref. [4] explored a real-time control-based pursuit-evasion game method; Ref. [5] designed a three-dimensional evasion strategy without relying on missile guidance information; and Ref. [6] investigated equilibrium strategies in three-dimensional pursuit-evasion games. These strategies leverage mathematical models and computational algorithms to continuously optimize evasion decisions, aiming to better adapt to the dynamic nature of actual defence scenarios. Despite the significant progress achieved by traditional evasion methods, they still face notable limitations in practical applications [7]. Traditional missile evasion methods face several limitations. First, they often rely on fixed threat assessment and response mechanisms, struggling to adapt to dynamic battlefield conditions—such as highly maneuverable missiles or coordinated multi-target attacks. Second, they depend on precise threat parameters, which are often unavailable in real combat. Additionally, while game-theoretic approaches offer theoretical insights, their high computational complexity hinders real-time application. To overcome these challenges, reinforcement learning (RL) has emerged as a promising alternative. RL enables dynamic strategy optimization through environmental interaction, improving adaptability and robustness in complex scenarios [7,8,9].

The combination of reinforcement learning and game theory has demonstrated tremendous potential across various domains, providing a theoretical foundation for strategic interactions in multi-agent systems and offering new perspectives and methods for solving complex decision-making problems [10,11,12]. In the field of transportation, Ref. [6] proposed a framework called SafeDriveRL, which integrates non-cooperative game theory and reinforcement learning to explore and mitigate high-level uncertainties caused by human behavior. This framework simulates the reward mechanisms of human drivers to evaluate the adaptability of autonomous vehicles in complex traffic environments. In the field of unmanned aerial vehicles (UAVs), Ref. [13] presented an improved radar tracking method based on cooperative game theory and reinforcement learning, offering innovative solutions for UAVs performing collaborative tasks in complex environments. In the domain of cyber–physical security, Ref. [14] designed a cyber–physical security framework that integrates reinforcement learning and game theory, simulating attack-defense scenarios and demonstrating its practical value in social control systems. Reinforcement learning, with its powerful learning capabilities and flexible decision-making mechanisms, has exhibited outstanding performance across various fields, particularly in addressing problems characterized by high uncertainty and dynamic complexity [15]. This suggests that reinforcement learning holds significant potential in tackling similar challenges in missile evasion tasks. By combining game theory, reinforcement learning is expected to provide smarter, real-time, and more efficient evasion strategies, opening up new directions for the research and application of missile evasion. Such an approach not only addresses the limitations of traditional strategies but also has the potential to significantly enhance the survivability of UAVs in modern battlefield environments.

The missile evasion problem presents unique challenges due to the diversity of threat characteristics (including guidance systems, speeds, and maneuverability) and the highly dynamic nature of the engagement environment [16]. These factors, combined with potential incomplete or unreliable threat information, create significant difficulties for traditional reinforcement learning approaches. Recent advances show promise but face limitations: Yan et al. [17] developed a TD3-based intelligent maneuvering strategy for hypersonic vehicles that effectively handles two interceptors in continuous action spaces, though its computational demands may hinder real-time implementation. Similarly, Jia et al. [18] proposed a model-free RL approach for pursuit-evasion games with unknown dynamics, but its generalization in complex scenarios requires further validation. These developments highlight both the progress in RL-based evasion strategies and the critical need for improved generalization capabilities to address the full spectrum of missile threats and environmental dynamics in practical applications.

Recent studies have made significant progress in enhancing the generalization capabilities of reinforcement learning (RL). Ho et al. [19] proposed a human-inspired meta-reinforcement learning (HMRL) framework that enables cross-task generalization through Bayesian knowledge analysis. Yang et al. [20] developed Ensemble Proximal Policy Optimization (EPPO), demonstrating improved robustness in policy integration. McClellan et al. [21] significantly enhanced sample efficiency using Equivariant Graph Neural Networks (EGNNs), while Prajapati et al. [22] improved temporal generalization through State Chrono Representation (SCR). Hu et al. [23] further advanced generalization by optimizing state representations. However, these methods face substantial challenges when applied to missile evasion scenarios, which are characterized by highly dynamic threats, rapidly changing environments, and incomplete information. The core assumptions of existing approaches often fail to meet the stringent requirements for real-time performance, adaptability, and partial observability inherent in missile evasion tasks. Consequently, there is an urgent need to develop specialized RL generalization enhancement methods tailored to the unique demands of missile evasion in real-world operational environments.

In order to solve the problems of poor generalization ability of the evasion strategy and information asymmetry in the missile evasion problem, this paper introduces a risk-sensitive PPO algorithm based on the multi-head attention mechanism and proposes a training framework for dual-population game confrontation. The multi-head attention mechanism is used to construct the policy network and the value network, enabling it to extract the motion characteristics of the missile from historical information. At the same time, the method of dual-population game solves the problem of overfitting the opponent in the self-play method, ensuring the diversity and generalization ability of the evasion strategy, and enabling it to effectively deal with different types of incoming missiles.

This paper is organized as follows: Section 2 introduces the mathematical model of the missile evasion problem, including the dynamics and kinematics models of the UAV, missile, and decoy flare. Section 3 introduces the method proposed in this paper, including the dual-population game framework based on the RPPO algorithm, the Markov decision model of the missile evasion problem, and the network structure design based on the multi-head attention mechanism. Section 4 verifies the effectiveness and superiority of the method proposed in this paper through simulation experiments. Conclusive remarks and a summary of findings are presented in Section 5.

2. Preliminary

2.1. UAV Dynamic Model

This paper mainly discusses the intelligent control of fixed-wing UAVs when dealing with missile threats. Therefore, the method of fixed-wing aircraft is introduced to model the kinematics and dynamics of UAVs. The external forces acting on a UAV influence its flight attitude and velocity, thereby altering its trajectory. These forces include gravity (

W

), thrust (

T

), and aerodynamic forces (

A

). Among them, gravity (

W

) acts through the UAV’s center of mass, while thrust (

T

) and aerodynamic forces (

A

) generate moments around the center of mass. These moments include the pitch moment (

M

), yaw moment (

N

), and roll moment (

L

), as illustrated in Figure 1.

When analyzing a UAV’s flight attitude and trajectory, it can be assumed that the deflection of control surfaces eliminates the aforementioned moments, ensuring that the total moment acting on the UAV remains balanced. This results in a spatial force system intersecting at the UAV’s center of mass. The forces acting on the UAV, excluding gravity, are collectively referred to as the controllable force (

N

) and can be expressed as

N = A + T

(1)

The resultant force (

N

) is decomposed along the

O_{a} x_{a}

axis and the

O_{a} y_{a} z_{a}

UAV as follows:

\{\begin{matrix} N_{n} = A_{n} + T_{n} \\ N_{τ} = A_{τ} + T_{τ} \end{matrix}

(2)

N_{n}

is referred to as the controllable normal force, which is used to change the direction of the flight velocity.

N_{τ}

is referred to as the controllable tangential force, which is used to change the magnitude of the flight velocity.

For a longitudinally symmetric UAV, control is primarily achieved through the manipulation of the elevator

δ_{e}

, ailerons

δ_{a}

, rudder

δ_{R}

, and throttle

δ_{T}

, as illustrated in Figure 2. Deceleration can be achieved by deploying speed brakes to increase drag

D

or by using throttle and other thrust-reversing mechanisms to adjust the tangential force component

T_{τ}

.

The elevator

δ_{e}

, ailerons

δ_{a}

, and rudder

δ_{R}

are used to alter the UAV’s flight attitude. The elevator

δ_{e}

changes the UAV’s flight path angle, which in turn alters the angle of attack

α

, thereby affecting the magnitude and direction of the lift. The ailerons

δ_{a}

change the UAV’s roll attitude, which adjusts the roll angle

ϕ

, thereby altering the direction of the lift. The horizontal component of the lift force causes the UAV to change direction in the horizontal plane. The rudder

δ_{R}

changes the UAV’s yaw attitude, which causes the sideslip angle

β

to vary, generating a side force

C

that also leads to a change in the UAV’s motion in the horizontal plane. By adjusting the elevator

δ_{e}

, ailerons

δ_{a}

, rudder

δ_{R}

, and throttle

δ_{T}

, the UAV can be maneuvered in any direction in space.

Assume that the moving coordinate system (

O x y z

) rotates with an angular velocity

ω

relative to the horizontal coordinate system (

O_{l} x_{l} y_{l} z_{l}

), and the absolute velocity of the center of mass is

V

. By projecting the velocity

V

and angular velocity

ω

onto the moving coordinate system (

O x y z

), the following equations can be derived:

\{\begin{matrix} V = V_{x} i + V_{y} j + V_{z} k \\ ω = ω_{x} i + ω_{y} j + ω_{z} k \end{matrix}

(3)

In Equation (3),

i

,

j

, and

k

represent the unit vectors in the moving coordinate system (

O x y z

). Due to the presence of angular velocity

ω

, the direction of the velocity

V

continuously changes. The absolute acceleration of the center of mass is expressed as

\frac{d V}{d t} = \frac{d V_{x}}{d t} i + \frac{d V_{y}}{d t} j + \frac{d V_{z}}{d t} k + V_{x} \frac{d i}{d t} + V_{y} \frac{d j}{d t} + V_{z} \frac{d k}{d t} = \frac{δ V}{δ t} + ω \times V

(4)

In Equation (4),

δ V / δ t

represents the acceleration when the angular velocity

ω = 0

, and

ω \times V

is the acceleration caused by the change in the direction of the velocity

V

in the moving coordinate system due to the presence of angular velocity

ω

.

The mechanical equation of motion for the UAV’s center of mass can be expressed as

F = m \frac{d V}{d t} = m (\frac{δ V}{δ t} + ω \times V) = F_{x} i + F_{y} j + F_{z} k

(5)

In Equation (5), m represents the mass of the UAV, and

F

is the vector sum of all external forces acting on the UAV’s center of mass.

The scalar form of the UAV’s motion in the moving coordinate system is expressed as

\{\begin{matrix} F_{x} = m (\frac{d V_{x}}{d t} + V_{z} ω_{y} - V_{y} ω_{z}) \\ F_{y} = m (\frac{d V_{y}}{d t} + V_{x} ω_{z} - V_{z} ω_{x}) \\ F_{z} = m (\frac{d V_{z}}{d t} + V_{y} ω_{x} - V_{x} ω_{y}) \end{matrix}

(6)

The UAV’s dynamic equation is applicable in any moving coordinate system. By substituting the flight path coordinate system into the moving coordinate system, and based on the definition of the flight path coordinate system (

O_{k} x_{k} y_{k} z_{k}

), we know that:

V_{x} = V

,

V_{y} = V_{z} = 0

. According to the relative position between the flight path coordinate system and the ground coordinate system, the flight path coordinate system first rotates with angular velocity

{\dot{ψ}}_{a}

around the

O_{a} z_{a}

axis, and then with angular velocity

{\dot{θ}}_{a}

around the

O_{a} z_{a}

axis to form the ground coordinate system. Therefore, the angular velocity in the flight path coordinate system is represented as

ω = {\dot{ψ}}_{a} + {\dot{θ}}_{a}

, which can be expressed as

[\begin{matrix} ω_{x} \\ ω_{y} \\ ω_{z} \end{matrix}] = L_{k l} [\begin{matrix} 0 \\ 0 \\ {\dot{ψ}}_{a} \end{matrix}] + [\begin{matrix} 0 \\ {\dot{θ}}_{a} \\ 0 \end{matrix}] = [\begin{matrix} - {\dot{ψ}}_{a} cos {\dot{θ}}_{a} \\ {\dot{θ}}_{a} \\ {\dot{ψ}}_{a} sin {\dot{θ}}_{a} \end{matrix}]

(7)

By substituting the thrust

T

, aerodynamic force

A

, and gravity

W

into Equation (6), and using (

χ, γ, μ

) to replace (

ψ_{a}, θ_{a}, ϕ_{a}

), the center of mass dynamics equation in the flight path coordinate system (

O_{k} x_{k} y_{k} z_{k}

) is obtained as Equation (7). Here, the lift

F_{L}

, drag D, and side force C represent the components of the aerodynamic force

A

in the velocity coordinate system (

O_{a} x_{a} y_{a} z_{a}

).

\{\begin{matrix} m \frac{d V}{d t} = T cos (α + φ) cos β - D - m g sin γ \\ m V cos γ \frac{d χ}{d t} = T [sin (α + φ) sin μ - cos (α + φ) sin β cos μ] + C cos μ + F_{L} sin μ \\ - m V \frac{d γ}{d t} = T [- sin (α + φ) cos μ - cos (α + φ) sin β sin μ] + C sin μ - F_{L} cos μ + m g cos γ \end{matrix}

(8)

The UAV’s dynamic equation describes the laws of its motion and variation, enabling effective control of the UAV. Based on relevant literature, the following assumptions are made: (1) The Earth’s rotation and revolution are neglected, treating it as a stationary plane; (2) The UAV is assumed to be a rigid body, with constant gravity; (3) The effects of wind speed are neglected, and at this point, the

O_{k} x_{k}

axis,

O_{a} x_{a}

axis, and

O_{b} x_{b}

axis coincide; (4) The UAV is assumed to be in zero sideslip flight; (5) The thrust T is assumed to be aligned with the

O_{a} x_{a}

axis. By adjusting the thrust T, the velocity deflection angle

ϕ

, and the angle of attack

α

, the UAV can be controlled, simplifying the UAV’s dynamic equation as follows:

\{\begin{matrix} m \frac{d V}{d t} = T cos α - D - m g sin γ \\ m V cos γ \frac{d χ}{d t} = F_{L} sin ϕ \\ m V \frac{d γ}{d t} = F_{L} cos ϕ - m g cos γ \end{matrix}

(9)

2.2. Missile Motion and Dynamics Model

In the inertial coordinate system, the missile’s kinematic equation is

\{\begin{matrix} \dot{x} = v cos θ cos φ \\ \dot{y} = v cos θ sin φ \\ \dot{z} = v sin θ \end{matrix}

(10)

In the equation,

(x, y, z)

represents the missile’s position in the inertial coordinate system, and

(v, θ, φ)

represents the missile’s velocity, flight path pitch angle, and flight path yaw angle, all as functions of flight time t. Based on the missile’s flight being divided into active and passive phases, in the inertial system, the forces acting on the missile during the active phase primarily include: thrust

T (t)

, gravity

G = m (t) g

, and aerodynamic drag

D (t)

(in a non-inertial system, Coriolis forces C also need to be considered). Therefore, in the trajectory coordinate system, the missile’s particle dynamics equation is expressed as

\{\begin{matrix} \dot{v} = g (n_{x} - sin θ) \\ \dot{φ} = \frac{g}{v} n_{y} cos θ \\ \dot{θ} = \frac{g}{v} (n_{z} - cos θ) \end{matrix}

(11)

where,

n_{x} (t) = \frac{T (t) - D (t)}{m (t) g}

is the overload in the velocity direction, and

n_{y}, n_{z}

represents the missile’s lateral control overloads in the yaw and pitch directions, respectively, which are calculated using proportional guidance.

m (t)

is the missile’s current mass, and g is the gravitational acceleration constant, taken as

g = 9.81 {m / s}^{2}

.

2.3. Decoy Motion Model

To effectively describe the various forces and motion states experienced by the decoy missile during flight, a three-degree-of-freedom particle model is used to model the decoy missile’s motion process. The specific model is as follows:

\{\begin{matrix} {\ddot{x}}_{d} = - \frac{D_{d} {\dot{x}}_{d}}{m_{d} V_{d}} \\ {\ddot{y}}_{d} = - \frac{D_{d} {\dot{y}}_{d}}{m_{d} V_{d}} \\ {\ddot{z}}_{d} = - \frac{D_{d} {\dot{z}}_{d}}{m_{d} V_{d}} + g \end{matrix}

(12)

In the equation,

m_{d}

represents the mass of the decoy missile,

V_{d}

represents the true velocity of the decoy missile relative to the incoming flow direction, and

D_{a}

represents the magnitude of the decoy missile’s flight resistance in the direction opposite to the velocity axis

O_{x}

. Since the decoy missile cannot be controlled after being ejected, there are no control variables.

V_{d} = \sqrt{{\dot{x}}_{d}^{2} + {\dot{y}}_{d}^{2} + {\dot{z}}_{d}^{2}}

(13)

D_{d} = \frac{1}{2} ρ V_{d}^{2} C_{d} S_{d}

(14)

where

ρ

represents the air density,

C_{d}

represents the aerodynamic drag coefficient, and

S_{d}

represents the windward area of the decoy missile.

3. Method

3.1. Risk-Sensitive PPO

In missile evasion missions, the synergistic effect of maneuver evasion and decoy interference aims to deceive and ultimately break missile tracking. During this process, the intelligent agent must balance the timing of decoy deployment with the selection of appropriate maneuvers based on the current engagement situation.

When the missile is at a long distance from the UAV, the seeker’s ability to distinguish between decoys and the UAV is relatively weak. However, the UAV’s maneuvers have a limited impact on the line-of-sight angle, making it more difficult to escape the missile’s lock-on zone. Conversely, when the missile is at close range, the UAV’s maneuvers significantly affect the line-of-sight angle, but the decoy’s interference capability diminishes.

Additionally, the effectiveness of decoy interference exhibits a certain degree of randomness, and different missile pursuit strategies elicit varying responses to the UAV’s maneuvers. Therefore, a coordinated evasion strategy must possess sufficient diversity to adapt to various engagement scenarios. This necessitates a robust decision-making framework capable of dynamically adjusting evasion tactics based on real-time threat assessment.

We employ the Risk-Sensitive Proximal Policy Optimization (RPPO) algorithm to train our evasion strategy. This approach introduces a risk-level hyperparameter that guides the policy toward the Bellman-optimal strategy, enabling the algorithm to learn the underlying reward signal distribution of the task and ultimately produce strategies with different behavioral preferences based on varying risk levels. The incorporation of risk sensitivity modifies the Bellman operator for the value function as follows:

B_{τ} V (s) = V (s) + 2 α E_{s} E_{s^{'}} [τ {[δ]}_{+} + (1 - τ) {[δ]}_{-}],

(15)

where

δ

is the TD error. When the discount factor is

γ

,

δ (s, a, s^{'}) = r (s, a) + γ V (s^{'}) - V (s)

. The operator

{[\cdot]}_{+} = max (\cdot, 0)

and

{[\cdot]}_{-} = min (\cdot, 0)

. It can be proven that for any

τ \in (0, 1)

,

B_{(τ)}

is a contraction mapping under the infinity norm. This ensures that the value function with the risk level parameter can converge during the policy evaluation stage. On this basis, the advantage function can be defined as

A_{τ}^{π} = 2 α E_{s^{'}} [τ {[δ]}_{+} + (1 - τ) {[δ]}_{-}] .

(16)

In order to effectively reduce the variance in the advantage estimation, it is necessary to introduce the generalized advantage estimation into the RPPO. To this end, the Bellman operator of the value function with the risk level needs to be rewritten in a multi-step form:

B_{τ, λ} V (s) : = (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} {(B_{τ})}^{n} V (s) .

(17)

It can be proven that the Bellman operator of the value function in the multi-step form also has the property of a contraction mapping. However, it is difficult to estimate it through sample data. Therefore, an operator in the sample form also needs to be introduced. Denote the estimated value of the value function V as

\hat{V}

, and we have the following:

{\hat{B}}_{τ}^{\hat{V}} V (s) : = V (s) + 2 α [τ {[\hat{δ}]}_{+} + (1 - τ) {[\hat{δ}]}_{-}],

(18)

Among them,

\hat{δ}

is the estimated value of the TD error. When

\hat{V} = V

is chosen, we can obtain

B_{τ} (s) = E_{a \sim π (\cdot | s), s^{'} \sim p (\cdot | s, a)} {\hat{B}}_{τ}^{V} V (s)

. Therefore, the Bellman operator of the multi-step value function can be expressed as

{\hat{V}}_{τ, λ}^{π} (s) = (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} {\hat{B}}_{τ}^{{\hat{V}}_{n}} V (s), {\hat{V}}_{n} (s) = {\hat{B}}_{τ}^{{\hat{V}}_{n - 1}} V (s) .

(19)

On this basis, the advantage estimation can be written as

{\hat{A}}_{τ, λ}^{π} (s, a) = {\hat{V}}_{τ, λ}^{π} (s) - V (s),

(20)

The multi-step value function Bellman operator in the sample form and the corresponding advantage function estimation constitute the complete RPPO algorithm. Compared with the original PPO algorithm, it only requires some modifications to the calculations of the advantage function and the value function, as shown in Algorithm 1.

Algorithm 1 Risk-sensitivePPO(RPPO)

Input: initial actor network parameter

θ

and critic network parameter

ϕ

, replay buffer D, update epochs K
for

i = 0, 1, 2, 3, \dots

do
Collect trajectories by running policy

π_{θ}

until the replay buffer is filled.
Compute

λ

-variant returns

{\hat{V}}_{τ, λ}^{π}

according to Equation (19)
Compute advantages

{\hat{A}}_{τ, λ}^{π}

according to Equation (20)
for

k = 0, 1, 2, 3, \dots, K

do
Update the

θ

by maximizing the surrogate function with

L_{π}^{CLIP} (θ) = \frac{1}{| D |} \sum_{t = 0}^{| D | - 1} [min (ω_{t} (θ) {\hat{A}}_{τ} (s_{t}, a_{t}), clip (ω_{t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{τ} (s_{t}, a_{t}))]

via Adam algorithms.
Update the

ϕ

by minimizing the mean-squared error with

L_{V} (ϕ) = \frac{1}{2 | D |} \sum_{t = 0}^{| D | - 1} {(V_{ϕ} (s_{u}) - {\hat{V}}_{τ, t})}^{2}

via Adam algorithms.
end for
end for

3.2. Cyclic Group Game

The research of [24,25] shows that the agent trained under the framework of priority virtual self-game can achieve a strategy close to Nash equilibrium in a zero-sum game. However, the missile evasion problem is an asymmetric game, and the state transition and control amount of both missile and UAV are not the same, so it is impossible to directly apply the self-game framework, and cycle training can be conducted by reusing a single agent and its historical version to improve the strategy performance. In order to train various evasive strategies to cope with different types of incoming missiles, this paper proposes a two-population countermeasure game training framework based on RPPO algorithm.

In order to avoid the pursuit strategy overfitting an escape strategy or the escape strategy overfitting a pursuit strategy, the two-population based game consists of the UAV evading agent population and the missile pursuing agent population. Before training, each population randomly initializes several agents, each using the same network structure, but with different initialization parameters and risk coefficients. In order to avoid meaningless exploration in the early stage of training, some pursuit agents controlled by proportional guidance rate and parallel guidance law are added to the missile pursuit agent population.

In each round of training, in order to evaluate the performance of each agent in the two populations, the pursuit agent and the escape agent are selected from the two populations respectively through multiple random sampling to conduct missile avoidance confrontation. Each confrontation is conducted in several rounds. The confrontation process and outcome are recorded starting from the random initial situation in each round. The data generated in the confrontation are stored in the common experience playback pool. In order to enable RPPO algorithm to use the data generated by different agents, the recorded data include action probability density

π (a | s)

in addition to

s, a, r, s^{'}

. The Diagram of the group game is shown as the Figure 3.

In order to avoid the phenomenon of cyclic restraint—that is, escape strategy A is superior to pursuit strategy C, pursuit strategy C is superior to escape strategy B, escape strategy B is superior to pursuit strategy D, and pursuit strategy D is superior to escape strategy A—we designed a reward and punishment correction scheme based on ELO ranking. After the completion of each cycle confrontation, each agent in the two populations is ranked by the ELO mechanism, and the reward signal in the previous game is corrected according to the ranking of each agent. When an agent performs worse against an enemy agent with a lower rank, the penalty term in the reward function is increased according to the difference in rank. If an agent wins against an enemy agent with a higher rank than himself, the reward item of all reward values in the game is amplified based on the rank difference.

r_{i, j}^{t} = \{\begin{matrix} r_{i, j}^{t} * (1 + tanh (R a n k_{i} - R a n k_{j})), r_{i, j}^{t} > 0 \\ r_{i, j}^{t} * (1 - tanh (R a n k_{i} - R a n k_{j})), r_{i, j}^{t} < 0 \end{matrix}

(21)

As shown in Equation (21), when a certain agent loses to an agent with a lower ranking than itself in the missile evasion confrontation, all the reward items in the reward values of both parties in the match will be reduced, and the penalty items will be amplified, and vice versa.

After the correction is complete, all matchup data are stored in the common experience playback pool of each population, and both trainable agents use the data in the experience playback pool to update their own strategies. In order to make full use of the better-performing agents, before each update, the parameters of the better-ranked agents in the two populations will be replaced with those of the lower-ranked agents. At the same time, in order to ensure the diversity of strategies and better exploration, the risk coefficient of the original agent will be added with noise between −0.1 and 0.1 when the parameters are overwritten.

3.3. Markov Decision Model

3.3.1. State Space

When evading incoming missiles, the drone needs to make decisions based on information such as its own position, speed, and attitude, as well as the relative situation with the missile. To ensure Markov property, it is also necessary to obtain information such as the missile’s speed, attitude, and control guidance law. However, since the missile is a non-cooperative target, this information cannot be directly acquired. In order to ensure the effectiveness of the evasion strategy, we combine several past known information sequences into a state signal, and extract hidden features from it by introducing the multi-head attention mechanism and the GRU module in the policy network and the value network. The state space is represented as

S_{U A V} = [x_{u}, y_{u}, z_{u}, χ_{u}, γ_{u}, ϕ_{u}, V_{u}, r_{d}, β_{d}, ε_{d}] |_{i = t - n}^{t},

(22)

The state representation includes the UAV’s own status (denoted by subscript u) such as its coordinates, flight attitude, and velocity, as well as the relative motion states (denoted by subscript d) including the relative distance and line-of-sight angle between the UAV and the missile.

The missile only needs to rely on the line-of-sight distance, the line-of-sight deflection angle, the line-of-sight inclination angle and their first-order differentials to effectively pursue the target. Therefore, the state space of the missile agent is

S_{M i s s i l e} = [r_{d}, β_{d}, ε_{d}, \dot{r_{d}}, \dot{β_{d}}, \dot{ε_{d}}]

(23)

3.3.2. Action Space

According to the discussion in Section 2, the state of the UAV is controlled by the axial overload, the longitudinal overload and the roll angular velocity. In addition, whether to launch the decoy is determined by the launch probability f, which is a value between 0 and 1. After the policy network gives the value of f, whether to launch the decoy is decided by the roulette wheel selection method. Therefore, the action space of the drone is

A_{U A V} = [T, F_{L}, \dot{ϕ}, f]

(24)

Similarly, the missile is controlled by the lateral overload and the longitudinal overload, and its action space is

A_{M i s s i l e} = [n_{y}, n_{z}]

(25)

3.3.3. Reward Function

The reward function is a crucial component of reinforcement learning, while the termination function is an essential part of the simulation environment. The reward function provides an immediate feedback value (reward or penalty) to the agent when taking an action

a

in a given state

s

, guiding the agent to make the correct decisions. The termination function determines when the agent should end the current episode, marking the conclusion of a round.

3.3.4. Reward Function

In the actual process of evasion, we aim for the UAV to achieve the highest possible survival rate when defending against incoming missiles while minimizing the use of decoys. Therefore, in this study, penalties are assigned for UAV crashes, being hit by missiles, and deploying decoys, whereas successful missile defense is rewarded. The specific design of the reward function is as follows:

R_{U A V} (s, a) = \{\begin{matrix} - 100 & r_{d} \leq r_{\exp}, r_{d} \in s \\ - 100 & z_{u} \leq 0, z_{u} \in s \\ - 1 & u \sim U (0, 1) \leq f, f \in a \\ \sum_{n} exp (\dot{r_{d}} / 1200) & \dot{r_{d}} \in s \\ + 100 & r_{d} Monotonically increasing for 5 s \\ 0 & otherwise \end{matrix}

(26)

The pursuit-evasion game between the missile and the drone can be regarded as a zero-sum game. Therefore, the reward function of the missile is

R_{M i s s i l e} = - R_{U A V}

(27)

In the equation,

v_{approach}

represents the missile’s closing velocity. When two missiles appear simultaneously, the closing velocity for each missile must be calculated separately and then summed.

3.4. Policy and Value Network Structure

In the missile evasion mission, when a UAV makes an evasion decision, it not only needs to consider the relative situation between the UAV itself and the missile, but also needs to take into account information such as the guidance mode of the missile and the influence of decoy flares on the missile. The relative situation between the missile and the UAV can be directly obtained through observation, while characteristics such as the guidance mode of the missile and the interaction between it and the decoy flares are hidden in the missile’s response to the UAV’s maneuvering actions over a certain period of time. Even though this paper introduces the RPPO algorithm to deal with the potential randomness in the missile evasion problem, in the actual application of the algorithm, it is still necessary to process the state sequence over a certain period of time to ensure that the intelligent agent can learn a sufficiently good evasion strategy.

As shown in Figure 4 and Figure 5, when designing the Actor network and Critic network of the RPPO algorithm, we designed serialized state input signals and introduced the GRU module and the multi-head attention mechanism. The state sequence is encoded through the GRU module, and the multi-head attention mechanism is used to selectively focus on the historical information from the state sequence to better extract the hidden features [26]. This enables the intelligent agent to effectively perceive the changes in the motion laws of the missile before and after being interfered by the decoy flares and recognize the guidance laws of the missile from the motion situation sequence of the missile. As a result, the evasion strategy has a better adaptability to complex situations and has good robustness against diverse incoming missiles.

The RPPO algorithm is a stochastic policy algorithm. The Actor network first encodes the state sequence through the MLP layer, and then uses a network based on the multi-head attention mechanism to extract the temporal features in the state sequence. In addition, a feature network composed of a fully connected layer and a GRU layer is used to extract the spatial features contained in the states at different moments. After these features are concatenated, the fully connected network is used to calculate the distribution parameters of each action, and the final action instruction is obtained after sampling through these parameters.

Different from the Actor network that outputs the probability distribution of actions, the Critic network outputs an estimate of the value function, and this estimated value represents the advantages and disadvantages of the current state. As shown in Algorithm x, the RPPO algorithm guides the optimization of the policy network through the value function. The addition of the GRU layer and the multi-head attention mechanism enables it to better evaluate the situation in the missile evasion problem, allowing the intelligent agent to learn and adjust the strategy more effectively. At the same time, it endows the network with stronger representation ability, which can effectively enhance the adaptability of the intelligent agent to complex environments and diverse threats.

4. Experiments

4.1. Experiment Configuration

We trained the evasion strategy in the simulation environment constructed according to the mathematical model in Section 3.2 using the dual-population adversarial game and the RPPO algorithm based on the multi-head attention mechanism. The hyperparameters used in the algorithm are shown in the following Table 1.

In the simulation scenario, the initial position of the UAV is set at 30 degrees north latitude, 120 degrees east longitude, and at an altitude of 8000 m. The initial heading is due east, and the initial speed is 350 m/s. The initial speed of the missile is 800 m/s, and it is generated at a distance of 10–12 km from the UAV. The initial line-of-sight deflection angle between the missile and the UAV is randomly generated within the range of

[- π, π]

, and the initial line-of-sight inclination angle is randomly generated within the range of

[- \frac{π}{4}, \frac{π}{4}]

, both of which satisfy a uniform distribution. The detailed initial situation is shown in Table 2 and Table 3, as well as Figure 6.

The UAV carries 30 decoy flares. The lethal radius of the missile is 50 m. At the initial moment, the velocity vector of the missile is parallel to the line of sight and points towards the UAV. When the flight altitude of the UAV is less than 0 m or the distance between the UAV and the missile is less than the lethal radius of the missile, it is considered that the UAV has been destroyed and the simulation ends; when the distance between the UAV and the missile increases monotonically for 5 consecutive seconds, it is considered that the UAV has successfully evaded the missile and the simulation ends.

We conducted the algorithm training on 8 NVIDIA GeForce RTX 4090 graphics cards. We carried out simulated adversarial battles through 3000 parallel-running simulation environments to improve the efficiency of data collection. The detailed usage of software and hardware is shown in the following table.

The superiority of the method in this paper is demonstrated by comparing the training efficiency of the method in this paper with that of other methods, as well as indicators such as the survival rate when dealing with different types of missiles. The detailed information of the comparison schemes is as follows.

Scheme 1: Only use the basic PPO algorithm. The network structure is the same as that of the method in this paper. Training is carried out in a simulation environment with only a single type of missile. The missile is controlled by the proportional navigation guidance law, and the proportional navigation constant is 3.

Scheme 2: Use the MADDPG algorithm to train the UAV and the missile through a method similar to self-play. The network structure used in the PPO algorithm is the same as that of the method in this paper.

4.2. Training Results

As can be seen from the Figure 7, the survival rate of Scheme 1 increased rapidly in the initial stage of training. After about 200,000 steps of interaction, the survival rate had risen to 60%. In the subsequent 500,000 steps, the survival rate increased slowly and finally stabilized at about 68%. The survival rate of Scheme 2 slowly increased to 42% during the first 200,000 steps of training. Subsequently, the rate of increase in the survival rate accelerated, and it rapidly increased to 70% during the next 150,000 steps of training, exceeding that of Scheme 1. In the subsequent training, the rate of increase in the survival rate significantly slowed down and finally reached 81%. The survival rate The of method in this paper increased to about 45% during the first 150,000 steps of training, and the rate of increase was slower than that of Scheme 1. However, it rapidly increased to 76% during the subsequent 50,000 steps of training. After that, the rate of increase slowed down, and finally, the survival rate reached 94%.

The training results show that although the method in this paper has limited effect on improving the survival rate in the initial stage of training, after collecting enough data and conducting sufficient preliminary training, the method in this paper shows significant advantages in terms of the improvement rate. Moreover, due to the diversity of the missile agents in the dual-population adversarial game, the evasion strategy trained by the evasion method in this paper can adapt to different types of missiles, and the survival rate against different missiles in the test reaches a high level. In contrast, the evasion strategy trained based on the MAPPO algorithm in the comparison Scheme 2 has a relatively single opponent. In the later stage of training, the evasion strategy has a certain degree of overfitting to the opponent, which affects the diversity of the evasion measures, resulting in a lack of ability to deal with different types of missiles. The evasion strategy trained by the PPO algorithm and a specific type of missile in Scheme 1 can only ensure effective evasion of this type of missile. When the type of missile changes, the optimality of the evasion strategy cannot be guaranteed, so it is unable to effectively deal with different types of missiles.

4.3. Survival Rate Statistics

Furthermore, we statistically analyzed the performance of the evasion strategies trained by the three schemes when dealing with several different types of missiles. For each type of missile, the three schemes each carried out 5000 evasion experiments. In each experiment, the initial situation was randomly generated in the same way as during the training, and the results of the survival rate regarding the azimuth angle of the incoming missile are shown in Figure 8, Figure 9 and Figure 10.

Figure 8 shows the survival rates of the three evasion strategies when dealing with missiles controlled by the proportional navigation guidance law with different parameters. For each group of experiments, the survival rates of various strategies are relatively low when the azimuth angle of the incoming missile is near 0 degrees and 360 degrees. This is in line with the analysis in Ref. Therefore, keeping the flight direction perpendicular to the line of sight can force the missile to make a large-scale maneuver, which can consume the missile’s energy to the greatest extent. At this time, with the release of appropriate decoy flares, the pursuit of the missile can be effectively shaken off. At this time, with the release of appropriate decoy flares, the pursuit of the missile can be effectively shaken off. This is because the missile used in the experiment in Figure 8a is the same as the missile used during the training of Scheme 1, and the evasion strategy trained by Scheme 1 is completely focused on evading this missile. This is because the missile used in the experiment in Figure 8a is the same as the missile used during the training of Scheme 1, and the evasion strategy trained by Scheme 1 is completely focused on evading this missile. By comparing the performances in Figure 8b,c, it can be seen that the performance of the method in this paper is relatively stable under different missiles, while the performance of the evasion strategy of Scheme 1 drops significantly when facing missiles different from those in the training environment. By comparing the performances in Figure 8b,c, it can be seen that the performance of the method in this paper is relatively stable under different missiles, while the performance of the evasion strategy of Scheme 1 drops significantly when facing missiles different from those in the training environment.

Figure 9 and Figure 10 show the performances of the three strategies when dealing with missiles using the parallel approaching method and the pure tracking method with different parameters. In the two groups of experiments, the trends of the survival rates shown by the method in this paper and Scheme 2 are consistent with those in Figure 9, indicating that both can effectively deal with various different types of missiles. However, due to the relatively single opponent during the training process of the evasion strategy of Scheme 2, a certain degree of overfitting has occurred, which affects its approximation of the optimal evasion strategy. The survival rates of the evasion strategy trained by the method in this paper in each group of experiments are significantly higher than those of Scheme 2, showing better adaptability and diversity.

4.4. Simulation Sample

Figure 11 and Figure 12 show the specific performance of the evasion strategy trained by the method in this paper in the missile evasion experiment. As shown in Figure 11, the missile approaches from the rear of the UAV. The UAV makes a diving maneuver to convert gravitational potential energy into kinetic energy, attempting to increase its own speed and uses decoy flares to deceive the missile when the missile is at a relatively long distance. However, the effect is limited. The UAV continues to perform maneuvering actions and starts to intensively release decoy flares at the 8th second. At this time, the missile is relatively close to the UAV, and the decoy flares successfully cause the missile to deviate from the target, and the UAV successfully evades the incoming missile.

In Figure 12, the missile approaches from the rear side. The UAV deflects towards the direction of the incoming missile and dives to accelerate at the first time. After the missile gets closer, the UAV starts to release decoy flares and continues to turn left and dive, maneuvering in the direction perpendicular to the line connecting the missile and the decoy flares to maximize the influence of the decoy flares on the missile. Finally, after being deceived by the decoy flares, the missile loses the target, and the UAV successfully completes the evasion.

5. Conclusions

This study develops an adversarial gaming framework consisting of an escaping strategy population for controlled UAV and an intercepting strategy population for missiles, employing a risk-sensitive PPO algorithm to train coordinated UAV maneuver-decoy evasion strategies. To address the issue of cyclic counteractions in population-based adversarial training, we implement a RANK-based reward correction mechanism. The incorporation of multi-head attention mechanisms significantly enhances the policy network’s capability in state sequence modeling and strategy representation. Our method employs a policy network for maneuver-based evasion decision-making, which demonstrates low computational overhead during the inference phase. The decision-relevant information can be effectively acquired through onboard active radar and other sensors available on the UAV, indicating strong practical applicability.

Comparative experiments demonstrate that our method achieves faster convergence to effective evasion strategies than both self-play approaches and the original PPO algorithm. In comprehensive testing scenarios comprising various missile types, the proposed method attains a 94% successful evasion rate. Simulation results reveal that our trained strategies exhibit advanced tactical proficiency, effectively combining decoy deployment with evasive maneuvers to counter incoming missiles.

Furthermore, evasion tests against missiles with different guidance laws and parameters confirm that our approach produces strategies with superior generalization capability, consistently maintaining high performance across diverse missile types and configurations. The experimental validation suggests that our method significantly outperforms conventional approaches in both training efficiency and operational effectiveness.

Author Contributions

Conceptualization, C.Z. and Z.S.; Methodology, J.S., C.T. and W.F.; Software, C.Z.; Validation, C.Z., J.S., Z.X. and Z.Z.; Formal analysis, C.T.; Investigation, C.Z., Z.X. and W.F.; Resources, C.T.; Data curation, J.S.; Writing—original draft, C.Z. and J.S.; Writing—review & editing, Y.X.; Visualization, Z.S.; Supervision, Y.X.; Project administration, Y.X.; Funding acquisition, Z.Z. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Young Scientists Fund of the National Natural Science Foundation of China, grant number 2302506.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

DURC Statement

Current research is limited to the chase and escape game, which is beneficial Intelligent Control of Drones and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving missile evasion and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shima, T. Optimal cooperative pursuit and evasion strategies against a homing missile. J. Guid. Control. Dyn. 2011, 34, 414–425. [Google Scholar] [CrossRef]
Tian, Z.; Danino, M.; Bar-Shalom, Y.; Milgrom, B. Missile Threat Detection and Evasion Maneuvers With Countermeasures for a Low-Altitude Aircraft. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 7352–7362. [Google Scholar] [CrossRef]
Turetsky, V.; Shima, T. Hybrid evasion strategy against a missile with guidance law of variable structure. In Proceedings of the 2016 American Control Conference (ACC), Boston, MA, USA, 6–8 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3132–3137. [Google Scholar]
Li, A.; Hu, X.; Ran, J.; Wei, P.; Yuan, X. Receding Horizon Control Based Real-time Strategy for Missile Pursuit-evasion Game. In Proceedings of the 2024 IEEE 25th China Conference on System Simulation Technology and its Application (CCSSTA), Tianjin, China, 21–23 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 408–415. [Google Scholar]
Du, Q.; Hu, Y.; Jing, W.; Gao, C. Three-dimensional target evasion strategy without missile guidance information. Aerosp. Sci. Technol. 2025, 157, 109857. [Google Scholar] [CrossRef]
Chen, N.; Li, L.; Mao, W. Equilibrium Strategy of the Pursuit-Evasion Game in Three-Dimensional Space. IEEE/CAA J. Autom. Sin. 2024, 11, 446–458. [Google Scholar] [CrossRef]
Yan, M.; Yang, R.; Zhang, Y.; Yue, L.; Hu, D. A hierarchical reinforcement learning method for missile evasion and guidance. Sci. Rep. 2022, 12, 18888. [Google Scholar] [CrossRef] [PubMed]
Gong, X.; Chen, W.; Chen, Z. Intelligent game strategies in target-missile-defender engagement using curriculum-based deep reinforcement learning. Aerospace 2023, 10, 133. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, K.; Guirao, J.L.; Pan, K.; Chen, H. Online intelligent maneuvering penetration methods of missile with respect to unknown intercepting strategies based on reinforcement learning. Electron. Res. Arch. 2022, 30, 4366–4381. [Google Scholar] [CrossRef]
Jain, G.; Kumar, A.; Bhat, S.A. Recent developments of game theory and reinforcement learning approaches: A systematic review. IEEE Access 2024, 12, 9999–10011. [Google Scholar] [CrossRef]
Sharma, R.; Gopal, M. Synergizing reinforcement learning and game theory—A new direction for control. Appl. Soft Comput. 2010, 10, 675–688. [Google Scholar] [CrossRef]
Nowé, A.; Vrancx, P.; De Hauwere, Y.M. Game theory and multi-agent reinforcement learning. In Reinforcement Learning: State-of-the-Art; Springer: Berlin/Heidelberg, Germany, 2012; pp. 441–470. [Google Scholar]
Dolinger, G.; Stringer, A.; Sharp, T.; Karch, J.; Metcalf, J.G.; Bowersox, A. Collaborative Game Theory and Reinforcement Learning Improvements for Radar Tracking. In Proceedings of the 2023 IEEE International Radar Conference (RADAR), Sydney, Australia, 6–10 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Cao, Y.; Tao, C. Reinforcement learning and game theory based cyber–physical security framework for the humans interacting over societal control systems. Front. Energy Res. 2024, 12, 1413576. [Google Scholar] [CrossRef]
Zhao, X.; Hu, S.; Cho, J.H.; Chen, F. Uncertainty-based decision making using deep reinforcement learning. In Proceedings of the 2019 22th International Conference on Information Fusion (FUSION), Ottawa, ON, Canada, 2–5 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar]
Fonod, R.; Shima, T. Multiple model adaptive evasion against a homing missile. J. Guid. Control. Dyn. 2016, 39, 1578–1592. [Google Scholar] [CrossRef]
Yan, T.; Jiang, Z.; Li, T.; Gao, M.; Liu, C. Intelligent maneuver strategy for hypersonic vehicles in three-player pursuit-evasion games via deep reinforcement learning. Front. Neurosci. 2024, 18, 1362303. [Google Scholar] [CrossRef] [PubMed]
Jia, Y.; Dong, Y. Optimal Capture Strategy Design Based on Reinforcement Learning in the Pursuit-Evasion Game with Unknown Dynamics. In Proceedings of the 2024 American Control Conference (ACC), Toronto, ON, Canada, 10–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2685–2690. [Google Scholar]
Ho, J.; Wang, C.M.; King, C.T.; You, Y.H.; Feng, C.W.; Chen, Y.M.; Kuo, B.Y. Learning Adaptation and Generalization from Human-Inspired Meta-Reinforcement Learning Using Bayesian Knowledge and Analysis. In Proceedings of the 2023 IEEE Sixth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Laguna Hills, CA, USA, 25–27 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 9–16. [Google Scholar]
Yang, Z.; Ren, K.; Luo, X.; Liu, M.; Liu, W.; Bian, J.; Zhang, W.; Li, D. Towards applicable reinforcement learning: Improving the generalization and sample efficiency with policy ensemble. arXiv 2022, arXiv:2205.09284. [Google Scholar]
McClellan, J.; Haghani, N.; Winder, J.; Huang, F.; Tokekar, P. Boosting sample efficiency and generalization in multi-agent reinforcement learning via equivariance. arXiv 2024, arXiv:2410.02581. [Google Scholar]
Prajapati, R.; El-Wakeel, A.S. Cloud-based Federated Learning Framework for MRI Segmentation. arXiv 2024, arXiv:2403.00254. [Google Scholar]
Hu, Q.; Li, R.; Deng, Q.; Zhao, Y.; Li, R. Enhancing Network by Reinforcement Learning and Neural Confined Local Search. In Proceedings of the IJCAI, Macao, China, 19–25 August 2023; pp. 2122–2132. [Google Scholar]
Heinrich, J.; Lanctot, M.; Silver, D. Fictitious self-play in extensive-form games. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: Birmingham, UK, 2015; pp. 805–813. [Google Scholar]
Jaderberg, M.; Czarnecki, W.M.; Dunning, I.; Marris, L.; Lever, G.; Castaneda, A.G.; Beattie, C.; Rabinowitz, N.C.; Morcos, A.S.; Ruderman, A.; et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 2019, 364, 859–865. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Xu, Y.; Song, J.; Zhou, Q.; Rasol, J.; Ma, L. Planet craters detection based on unsupervised domain adaptation. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 7140–7152. [Google Scholar] [CrossRef]

Figure 1. Torque on the UAV.

Figure 2. Control Principles of Longitudinally Symmetrical UAV.

Figure 3. Diagram of the group game.

Figure 4. Structre of critic network.

Figure 5. Structre of actor network.

Figure 6. Initial situation diagram.

Figure 7. Training result of three schemes.

Figure 8. Survival rate distribution of the circumvented missile governed by proportional guidance law.

Figure 9. Survival rate distribution of the circumvented missile governed by parallel approaching method.

Figure 10. Survival rate distribution of the circumvented missile governed by pure tracking method.

Figure 11. Situation display of UAV avoidance process 1.

Figure 12. Situation display of UAV avoidance process 2.

Table 1. Hyperparameters in training.

Parameter	Value
Actor network learning rate	$1 \times 10^{- 4}$
Critic network learning rate	$3 \times 10^{- 5}$
Replay buffer size	$5 \times 10^{7}$
Population size	30
Maximum number of steps	30
Parameter update steps	100

Table 2. Initial state setting.

Parameter	Value
Missile’s initial velocity	800 m/s
UAV’s initial velocity	350 m/s
Initial line-of-sight deflection angle	$[- π, π]$
Initial line-of-sight inclination angle	$[- \frac{π}{4}, \frac{π}{4}]$
Max number of decoy	30

Table 3. Initial software items.

Software Item	Version	Hardware Item	Model
Operating system	Window 11 10.0.22631	CPU	Intel(R) Core(TM) i9-10900K @ 3.70 GHz
Python	3.12.3 v.1938 64 bit	GPU	NVIDIA GeForce RTX 4090 *8
PyTorch	2.4.1	Internal memory	Kingston FURY DDR4 3600 16G × 2
CUDA Toolkit	11.8	Hard disk	ZhiTai PCIe 4.0 1TB SSD
Tacview	1.9.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, C.; Song, J.; Tao, C.; Su, Z.; Xu, Z.; Feng, W.; Zhang, Z.; Xu, Y. Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game. Drones 2025, 9, 382. https://doi.org/10.3390/drones9050382

AMA Style

Zhang C, Song J, Tao C, Su Z, Xu Z, Feng W, Zhang Z, Xu Y. Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game. Drones. 2025; 9(5):382. https://doi.org/10.3390/drones9050382

Chicago/Turabian Style

Zhang, Cheng, Junhao Song, Chengyang Tao, Zitao Su, Zhiqiang Xu, Weijia Feng, Zhaoxiang Zhang, and Yuelei Xu. 2025. "Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game" Drones 9, no. 5: 382. https://doi.org/10.3390/drones9050382

APA Style

Zhang, C., Song, J., Tao, C., Su, Z., Xu, Z., Feng, W., Zhang, Z., & Xu, Y. (2025). Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game. Drones, 9(5), 382. https://doi.org/10.3390/drones9050382

Article Menu

Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game

Abstract

1. Introduction

2. Preliminary

2.1. UAV Dynamic Model

2.2. Missile Motion and Dynamics Model

2.3. Decoy Motion Model

3. Method

3.1. Risk-Sensitive PPO

3.2. Cyclic Group Game

3.3. Markov Decision Model

3.3.1. State Space

3.3.2. Action Space

3.3.3. Reward Function

3.3.4. Reward Function

3.4. Policy and Value Network Structure

4. Experiments

4.1. Experiment Configuration

4.2. Training Results

4.3. Survival Rate Statistics

4.4. Simulation Sample

5. Conclusions

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI