Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm

Quan, Shiming; Cao, Su; Wang, Chang; Yu, Huangchao

doi:10.3390/drones10010035

Open AccessArticle

Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm

by

Shiming Quan

,

Su Cao

,

Chang Wang

and

Huangchao Yu

^*

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 35; https://doi.org/10.3390/drones10010035

Submission received: 22 October 2025 / Revised: 25 December 2025 / Accepted: 26 December 2025 / Published: 6 January 2026

(This article belongs to the Topic International Conference on Autonomous Unmanned Systems (5th ICAUS 2025))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Proposed a continuous control strategy based on the Soft Actor-Critic (SAC) algorithm for UAV autonomous maneuvering in 1v1 tactical encounter scenarios.
Introduced a multi-dimensional situation-coupled reward function with a Health Point (HP) system to quantitatively evaluate situational advantages and cumulative tactical performance.

What are the implications of the main findings?

Enhanced decision-making precision and robustness under both ideal and noisy observation conditions.
Demonstrated the effectiveness of SAC in handling high-dimensional state spaces and generating precise maneuvering commands.

Abstract

Focusing on continuous action space methods for autonomous maneuvering decision making in 1v1 unmanned aerial vehicle scenarios, this paper first establishes a UAV kinematic model and a decision-making framework under the Markov Decision Process. Second, a continuous control strategy based on the Soft Actor-Critic (SAC) reinforcement learning algorithm is developed to generate precise maneuvering commands. Then, a multi-dimensional situation-coupled reward function is designed, introducing a Health Point (HP) metric to assess situational advantages and simulate cumulative effects quantitatively. Finally, extensive simulations in a custom Gym environment validate the effectiveness of the proposed method and its robustness under both ideal and noisy observation conditions.

Keywords:

UAV autonomous aerial maneuvering; reward function; Soft Actor-Critic decision making

1. Introduction

Advances in artificial intelligence are facilitating the intelligent evolution of unmanned systems. Unmanned aerial vehicles (UAVs) are extensively employed in both offensive/defensive maneuvering and collaborative missions [1]. With significant enhancements in UAV maneuverability, close-range decision making between UAVs is characterized by high dynamics and high overload, imposing rigorous requirements on the real-time performance and flexibility of UAV autonomous maneuvering decision-making systems [2]. Currently, three primary approaches address UAV close-range maneuver decision making: game theory, optimization search, and data-driven methods [3]. Game theory methods encompass differential games and matrix games. While both models are continuous, dynamic autonomous aerial maneuvering decision processes that derive analytical solutions, they struggle with high-dimensional state spaces [4]. Optimization search methods formulate autonomous aerial maneuvering objectives as multi-objective optimization problems, employing techniques like heuristic search and expert systems [5]. However, their performance is constrained by search space limitations and rule design complexity. Data-driven methods, including Deep Learning (DL) and Deep Reinforcement Learning (DRL), generate maneuver policies autonomously by learning from UAV maneuvering data. However, DL approaches are limited by the quality of training data used in behavioral cloning. In contrast, DRL typically requires minimal external data, acquiring experience autonomously through agent–environment interaction. This capability demonstrates potential for identifying advantageous maneuvers, making DRL a prominent research focus globally.

However, current deep reinforcement learning (DRL) research on UAV close-range maneuvering decision making primarily employs discrete action space methods, such as Deep Q-Network (DQN) [6], Actor-Critic (AC) [7], or Proximal Policy Optimization (PPO) [8]. The fixed action space inherently restricts the precision of generated maneuver commands. Studies by Yang et al. [9] and Gao et al. [10] leverage continuous action space algorithms (Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3), respectively) to enhance command accuracy. Nevertheless, both of their simulation victory conditions rely on an oversimplified premise when the friendly side enters the enemy’s attack zone and meets the weapon launch conditions. This assumption neglects battlefield uncertainties, specifically weapon hit probability fluctuations that cause damage outcome randomness, thereby compromising the robustness validation of the proposed strategies.

To address these limitations, this paper employs a multidimensional situation-coupled reward function and the Soft Actor-Critic (SAC) algorithm to generate maneuver decision commands in continuous action spaces. To mitigate algorithmic randomness, the UAV’s Health Points are introduced as an indicator of situational advantage. Furthermore, simulation experiments validate the robustness of the SAC algorithm’s decisions under fully observable inputs or conditions with observational errors.

This study investigates reinforcement learning-based autonomous maneuver decision-making methods applicable to close-range UAV pursuit and evasion scenarios. The main works are as follows:

This paper completed the modeling of UAV kinematics and the engagement scenario, as well as the construction of a discrete maneuver command library.
This paper implemented SAC deep reinforcement learning within a continuous action space using a multidimensional coupled reward function to generate maneuver decision commands.
This paper established a Gym simulation environment to validate the robustness of SAC/TD3 reinforcement learning algorithms under both error-free and erroneous input state observations.

2. Problem Formulation

Figure 1 shows the decision-making framework for 1v1 UAV close-range maneuvering, comprising the environment, UAV kinematic models, and maneuvering decision models for both sides. The environment feeds real-time states and situational data to the decision models, which generate maneuver commands executed kinematically to update UAV states and environmental dynamics, closing the perception–decision–action loop.

2.1. UAV Maneuvering Model

Focusing on UAV maneuver decision making, this paper prioritizes positional/trajectory dynamics over attitude change mechanisms. Consequently, a point-particle model is adopted for kinematic and dynamic representations, as follows:

\{\begin{matrix} {\dot{x}}_{g} = v cos θ cos ψ \\ {\dot{y}}_{g} = v cos θ sin ψ \\ {\dot{z}}_{g} = v sin θ \\ \dot{v} = g (n_{x} - sin θ) \\ \dot{θ} = \frac{g}{v} (n_{z} cos γ - cos θ) \\ \dot{ψ} = \frac{g n_{z} sin γ}{v cos θ} \end{matrix}

(1)

where

θ

denotes the pitch angle, defined as the angle between the body longitudinal axis (currently aligned with the velocity vector) and the horizontal plane

x_{g} O y_{g}

;

ψ

represents the yaw angle, which is the angle between the projection of the body longitudinal axis onto the horizontal plane

x_{g} O y_{g}

and the axis

O x_{g}

; and

γ

is the roll angle (see Figure 2).

The variables

n_{x}

and

n_{z}

correspond to the tangential overload (aligned with the velocity vector) and the normal overload (perpendicular to the velocity vector within the longitudinal symmetry plane of the aircraft), respectively. The velocity components in the ground frame

O x_{g} y_{g} z_{g}

are denoted by (

\dot{x}

,

\dot{y}

,

\dot{z}

),

\dot{v}

is the path acceleration, and

\dot{θ}

and

\dot{ψ}

are the angular rates of the pitch and yaw angles, respectively.

The gravitational acceleration is represented by g, and the dynamic model presented in Equation (1) enforces the following constraints on the state and control variables, as shown in the following Equation (2). While this simplified 3-DOF point-mass model exhibits certain limitations—such as not fully incorporating real-world elements like actuator dynamics (e.g., servo response delays), control latency, or environmental disturbances—it still holds significant value in the initial exploration of autonomous maneuvering decisions. This is particularly true in task scenarios requiring the isolation of underlying dynamic complexities to focus on high-level logic and algorithmic interpretability.

\{\begin{matrix} v_{min} ⩽ v ⩽ v_{max} \\ - \frac{π}{2} ⩽ θ ⩽ \frac{π}{2} \\ - π ⩽ ψ ⩽ π \\ n_{x min} ⩽ n_{x} ⩽ n_{x max} \\ n_{z min} ⩽ n_{z} ⩽ n_{z max} \\ - π ⩽ γ ⩽ π \end{matrix}

(2)

2.2. Autonomous Aerial Maneuvering Environment

The autonomous aerial maneuvering environment computes Health Points (HPs), relative states, and geometric situation angles from bilateral UAV states in (1), providing reference inputs for the maneuver decision-making part. Through situation assessment, it quantifies bilateral advantages.

Figure 3 depicts key situational information in autonomous maneuvering, where red and blue denote friendly and enemy UAVs, respectively. From the friendly UAV’s perspective, the line-of-sight (

LOS

) is defined as the vector pointing from the friendly to the enemy position;

φ \in [0, π]

represents the Antenna Train Angle (ATA), measuring the angular deviation between the friendly velocity vector

v_{r}

and the

LOS

direction;

q \in [0, π]

denotes the Aspect Angle (AA), quantifying the enemy’s escape potential as the angle between the enemy velocity vector

v_{b}

and the reverse

LOS

direction. The parameters

φ_{max}

and

q_{max}

define the angular constraints for the Weapon Engagement Zone (WEZ) and No Escape Zone (NEZ), while

D_{min}

and

D_{max}

specify the radial boundaries of these engagement zones.

\{\begin{matrix} φ_{r} = & arccos \frac{v_{r} \cdot LOS}{∥ v_{r} ∥ \cdot ∥ LOS ∥} \\ q_{r} = & arccos \frac{v_{b} \cdot LOS}{∥ v_{b} ∥ \cdot ∥ LOS ∥} \end{matrix}

(3)

To comprehensively quantify the cumulative impact of adverse situational factors during dynamic interactions between two parties, a Health Point (HP) metric is introduced, denoted as

H_{r}

for the friendly side and

H_{b}

for the enemy side. Lower HP values indicate greater accumulated performance loss due to sustained exposure to threat conditions. The simulation terminates when either side’s HP drops to ≤0, exceeds airspace limits, or reaches the maximum simulation steps K.

The HP update mechanism is governed by real-time relative geometric relationships, specifically including

φ

, q, and inter-UAV distance d. The following Equation (4) defines the HP update rule:

H_{k} (H_{k - 1}, φ, q, d) = \{\begin{matrix} H_{k - 1} - Δ H_{max} \frac{f_{att} (φ) + f_{esp} (q) + f_{dist} (d)}{3}, & φ \in [0, φ_{max}] \cap d \in [d_{min}, d_{max}] \\ H_{k - 1}, & otherwise \end{matrix}

(4)

where

H_{k}

represents current health at step k,

H_{k - 1}

denotes previous health, and

Δ H_{max}

is the maximum possible health deduction.

To accurately model the contribution degree of a single interaction to HP attenuation, three normalized attenuation factors determine the HP reduction magnitude. The ATA angle factor

f_{att}

follows a sigmoid response curve:

f_{att}

represents the threat level based on the Antenna Train Angle (ATA). Smaller

φ

values indicate better alignment between the velocity vector and the line of sight of our own UAV, thereby enhancing attack accuracy. The sigmoid function models the nonlinear increase in threat as

φ

approaches zero, mimicking real sensor sensitivity where minor angular deviations significantly affect weapon efficacy. The parameter

k_{φ}

controls the steepness of the curve, tuned according to empirical data from UAV engagements.

f_{att} = 1 - \frac{1}{1 + e^{- k_{φ} (φ - 0.5 φ_{max})}}

(5)

The AA angle factor

f_{esp}

implements piecewise-linear attenuation:

f_{esp}

quantifies the threat level from our side to the enemy’s escape potential via the Aspect Angle (AA). Smaller

q

values (e.g., below

20^{\circ}

) indicate that the enemy is not fleeing directly, making it more vulnerable to attack; thus

f_{esp}

attains higher values (closer to 1). The piecewise-linear design reflects tactical realism: at lower

q

, our advantage is maximized; as

q

increases, uncertainty in enemy maneuvers increases; at higher

q

, the threat diminishes as the enemy escapes.

f_{esp} (q) = \{\begin{matrix} 1, & q \leq 20^{\circ} \\ 1 - k_{q_{1}} (q - 20) / 70, & q \in [20^{\circ}, 90^{\circ}] \\ 0.2 - k_{q_{2}} (q - 90) / 60, & q \in [90^{\circ}, 150^{\circ}] \\ 0.05, & q > 150^{\circ} \end{matrix}

(6)

The distance factor

f_{dist}

employs exponential decay centered at optimal engagement distance:

f_{dist}

captures the effect of distance on weapon efficacy. Maximum damage occurs near

d_{opt}

, with rapid decay outside this range due to aerodynamic and sensor limitations. The exponential decay aligns with physical weapon characteristics, such as reduced hit probability at extreme distances. The parameters

k_{1}

,

k_{2}

,

p

, and

r

are calibrated to fit the engagement envelope of the UAV’s cannon.

f_{dist} (d) = \{\begin{matrix} e^{- k_{1} {(\frac{d_{opt} - d}{d_{opt} - d_{\min}})}^{p}}, & d \in [d_{\min}, d_{opt}] \\ e^{- k_{2} {(\frac{d - d_{opt}}{d_{\max} - d_{opt}})}^{r}}, & d \in [d_{opt}, d_{\max}] \end{matrix}

(7)

Opponent health updates analogously using the same equations but computed from the enemy’s perspective (

φ_{b}

). Three normalized attenuation factors in Equation (4) are shown in Figure 4. Simulation termination and outcome determination are governed by a conditional logic based on real-time Health Points (HPs), step progression, and operational constraints. The simulation terminates under one of the following scenarios: (1) Timeout: if the maximum step count K is reached without a decisive outcome, resulting in a timeout; (2) Constraint violation: if airspace boundaries are breached, such as altitude limits (

h < h_{min}

or

h > h_{max}

) or distance thresholds (

R > D_{t h}

or

R < R_{min}

), leading to a penalty for operational breach; (3) Crash: if Health Points of either side drop to zero or below, resolving into a win (when

H_{r} > 0

and

H_{b} ⩽ 0

), loss (when

H_{r} ⩽ 0

and

H_{b} > 0

), or draw (when both

H_{r} ⩽ 0

and

H_{b} ⩽ 0

). Otherwise, if none of these conditions are met, the simulation continues.

2.3. Opponent Maneuver Policy

The blue team applies Minimax search [11] with discrete actions. We expanded NASA’s seven basic maneuvers [12] into 27 action elements (see Table 1), assigning acceleration, constant-speed, and deceleration overload commands to each of the nine principal directions in three-dimensional space, where

c > 1

denotes a constant parameter used for compound maneuvers.

The Minimax policy is an algorithmic prescription for perfect-information zero-sum games, guaranteeing the minimization of the maximum possible loss under the premise of an optimally counteracting opponent, and thus the maximization of the minimum payoff. Figure 5 illustrates the computational process of the Minimax decision-making policy executed by the adversarial blue UAV, through which it selects the current action that maximizes the projected situational advantage over a k-step time horizon.

In this single-round game-theoretic framework, both agents sequentially select elementary maneuvers from the maneuver action primitive library (Table 1) to be executed over k consecutive time steps (arrows in Figure 5). The middle column nodes represent the payoffs obtained after the blue agent executes one action choice, while the rightmost column nodes denote the payoffs acquired after the red agent (opponent) completes its decision making. When the red agent’s maneuvering action results in the most unfavorable position for the blue side at time k in the future, the algorithm selects the child node with the smallest payoff in the rightmost column for each parent node in the middle column. The blue agent’s action selection should then correspond to the node with the largest payoff in the middle column, thereby maximizing its minimum positional advantage at time k in the future.

3. Methodologies

This section details the proposed reinforcement learning maneuver decision method, encompassing state/action space design, reward function formulation, and the SAC decision-making approach.

3.1. State Space and Action Space

This paper adopts a Markov Decision Process (MDP) framework under the assumption of fully observable states. The state input s is defined as a 27-dimensional continuous variable (see Equation (8)), including internal friend states

s_{own}

from Equation (1), friend Health Point

H_{r}

, and friend previous action in the last step

a_{t - 1}

, alongside sensor-derived observations such as bilateral relative states

s_{rel}

and relevant geometric angles

s_{geo}

.

\begin{matrix} s = \{s_{own}, s_{rel}, s_{geo}, a_{t - 1}\} \\ s_{own} = \{x_{r}, y_{r}, z_{r}, v_{r}, θ_{r}, ψ_{r}, H_{r}\} \\ s_{rel} = \{Δ x, Δ y, Δ z, R_{h}, R, v_{c}, Δ \dot{x}, Δ \dot{y}, Δ \dot{z}, Δ v\} \\ s_{geo} = \{φ_{r}, q_{r}, θ_{L}, ψ_{L}, θ_{V L}, ψ_{V L}, HCA\} \\ a_{t - 1} = \{n_{x}, n_{z}, γ\} \end{matrix}

(8)

Here,

R_{h}

denotes the

LOS

onto the horizontal plane, while R represents the magnitude of the

LOS

(i.e., the three-dimensional distance between the two parties). The variable

v_{c}

refers to the closure velocity between both sides, as defined in Equation (18). The terms

δ x

,

δ y

, and

δ z

correspond to the components of the difference between the opponent’s position vector and our position vector along the

O_{x}

,

O_{y}

, and

O_{z}

axes, respectively. Similarly,

Δ \dot{x}

,

Δ \dot{y}

, and

Δ \dot{z}

denote the components of the difference between the opponent’s velocity vector and our velocity vector along the same axes.

The angle

θ_{L}

indicates the angular deviation between the

LOS

and the horizontal plane

O_{x g}

, while

ψ_{L}

is the azimuth angle of the projection of the

LOS

onto the

O x_{g} y_{g}

plane relative to the

O x_{g}

axis. Additionally,

θ_{V L}

describes the angle between the projection of the friend velocity vector and the projection of the

LOS

onto the

y_{g} O z_{g}

plane, while

ψ_{V L}

denotes the angle between the friend velocity vector and the projection of the

LOS

onto the

x_{g} O y_{g}

plane. Finally, HCA (Heading Crossing Angle) refers to the angular separation between the velocity vectors of the opponent and our own side. All these state components are normalized to

s_{in \in [- 1, 1]}

before being fed into the fully connected network (shown in Equation (9)), and

δ_{own}, δ_{rel}, δ_{geo}

is taken as the constant.

\begin{matrix} δ = \{δ_{own}, δ_{rel}, δ_{geo}\} \\ δ_{own} = \{δ_{x}, δ_{y}, δ_{z}, δ_{v}, π, 2 π, δ_{HP}\} \\ δ_{rel} = \{δ_{Δ x}, δ_{Δ y}, δ_{Δ z}, δ_{R_{h}}, δ_{R}, v_{c max}, δ_{Δ \dot{x}}, δ_{Δ \dot{y}}, δ_{Δ \dot{z}}, δ_{Δ v}\} \\ δ_{geo} = \{2 π, 2 π, π, 2 π, 2 π, π, 2 π\} \\ δ_{a} = \{n_{x max}, n_{z max}, γ_{max}\} \\ s_{in} = \{tanh (s_{i} / δ_{i})\}, i \in [1, 27] \end{matrix}

(9)

We define a continuous 3-dimensional bounded action space for the friendly agent, parameterized as

[n_{x}, n_{z}, γ_{v}]

and subject to the constraint in Equation (2). This enables direct, precise control over the aircraft’s states, while the opponent’s actions are limited to a discrete maneuver library of finite commands (see Table 1).

3.2. Reward Function Design

Effective reward functions critically impact the convergence speed and quality of reinforcement learning. Traditional designs typically use isolated angle, speed, altitude, and distance metrics with minimal consideration of coupling, potentially compromising situational assessment accuracy and maneuver precision. This paper designs an expert-guided multi-dimensional coupled dense reward [13], including the relative position reward

R_{relpos}

, the closure speed reward

R_{closure}

, the altitude reward

R_{altitude}

, the firing rewards

R_{ownfire}

and

R_{enemyfire}

based on bilateral health changes, and event rewards (attack advantage reward

R_{c z}

, termination reward

R_{end}

).

\begin{matrix} R_{relpos} ({\bar{φ}}_{r}, {\bar{q}}_{r}) = 2 [({\bar{φ}}_{r} - 1) \cdot S ({\bar{q}}_{r}, α_{1}, \frac{1}{2}) - {\bar{φ}}_{r} + \frac{1}{2}] \end{matrix}

(10)

R_{closure} ({\bar{q}}_{r}, v_{c}, d) = \{\begin{matrix} \frac{v_{c}}{v_{c max}} [1 - S ({\bar{q}}_{r}, α_{1}, \frac{1}{2})] \cdot S (d, α_{2}, d_{0}), d ⩾ d_{0} \\ - 1, d < d_{0} \end{matrix}

(11)

R_{altitude} (h_{r}) = \{\begin{matrix} S (h_{r}, α_{3}, h_{min}) - 1 & , h_{r} < h_{0} \\ - S (h_{r}, α_{4}, h_{max}) & , h_{r} ⩾ h_{0} \end{matrix}

(12)

\begin{matrix} R_{ownfire} ({\bar{φ}}_{r}, d) = Γ_{R} (d) \cdot [1 - S ({\bar{φ}}_{r}, α_{5}, {\bar{φ}}_{max})] \end{matrix}

(13)

\begin{matrix} R_{enemyfire} ({\bar{q}}_{r}, d) = - Γ_{B} (d) \cdot S ({\bar{q}}_{r}, α_{8}, {\bar{q}}_{max}) \end{matrix}

(14)

R_{end} = \{\begin{matrix} 200 + 200 (\frac{K - k}{K}), if result = = win \\ + 400, if result = = lose or balance \\ - 600, if result = = timeout or beyond the boundaries \\ 0, o t h e r s \end{matrix}

(15)

R_{cz} = \{\begin{matrix} + 25, φ_{r} ⩽ φ_{max}, q_{r} ⩽ q_{max}, D_{min} ⩽ d ⩽ D_{max} \\ 0, e l s e \end{matrix}

(16)

Formulas (10)–(14) are dense reward functions. Here, S denotes the sigmoid function (17). As illustrated in Figure 6,

α

represents the rate of change of the sigmoid function at

x = x_{0}

. It is obvious that a larger

α

indicates a more pronounced variation in the sigmoid curve at

x = x_{0}

, rendering it more sensitive to perturbations in x. Conversely, a smaller

α

results in a less sensitive response, meaning that the function exhibits a weaker dependence on changes in x. Constants include

α_{1} \sim α_{10}

, which are not arbitrarily set but are determined based on expert prior knowledge from [13]. Moreover, these parameters are not unique; they can be adjusted within a reasonable range as long as the sigmoid function effectively captures the intended trends in the relevant dimensions, such as the sensitivity to angular or distance variations.

The remaining constants include

{\bar{φ}}_{r}

and

{\bar{q}}_{r}

(normalized

φ

and q for red), plus optimal attack distance

d_{0}

and flight altitude

h_{0}

.

v_{c}

is the inter-UAV approach speed,

Γ_{R} (d)

and

Γ_{B} (d)

are reward coefficients during own fire and enemy attacks, and

‖ β_{R} ‖

and

‖ β_{B} ‖

are offensive strategy aggressiveness for red and blue teams.

\begin{matrix} S (x, α, x_{0}) = \frac{1}{1 + e^{- α (x - x_{0})}} \end{matrix}

(17)

\begin{matrix} v_{c} = \frac{(v_{r} - v_{b}) \cdot (o_{b} - o_{r})}{∥ o_{b} - o_{r} ∥} \end{matrix}

(18)

Γ_{R} (d) = \{\begin{matrix} β_{R} \cdot S (d, α_{6}, D_{min}) & , d < d_{0} \\ β_{R} \cdot [1 - S (d, α_{7}, D_{max})] & , d ⩾ d_{0} \end{matrix}

(19)

Γ_{B} (d) = \{\begin{matrix} β_{B} \cdot S (d, α_{9}, D_{min}) & , d < d_{0} \\ β_{B} \cdot [1 - S (d, α_{10}, D_{max})] & , d ⩾ d_{0} \end{matrix}

(20)

The relative position reward function

R_{relpos} ({\bar{φ}}_{r}, {\bar{q}}_{r})

incentivizes our aircraft to maintain a tail-chase position behind the enemy, maximizing tactical advantage. The variables

{\bar{φ}}_{r}

and

{\bar{q}}_{r}

are normalized versions of

φ_{r}

and

q_{r}

, respectively, defined as

{\bar{φ}}_{r} = φ_{r} / π

and

{\bar{q}}_{r} = q_{r} / π

, scaling them to the range

[0, 1]

. As illustrated in Figure 7, the reward function exhibits a rapid decay at

{\bar{q}}_{r} = 0.5

(corresponding to the Aspect Angle

q = π / 2

), which serves as a critical threshold: for

{\bar{q}}_{r} < 0.5

, our aircraft is behind the enemy with a high advantage, while at

{\bar{q}}_{r} = 0.5

, it is at the beam position; and for

{\bar{q}}_{r} > 0.5

, the scenario transitions to a head-on engagement where enemy threat escalates and our positional superiority diminishes, often resulting in penalized rewards.

Furthermore, the function incorporates directional alignment constraints via

{\bar{φ}}_{r}

; when

{\bar{φ}}_{r} > 0.5

(indicating a substantial misalignment between our velocity vector and the line-of-sight, implying the aircraft is turning away from the target), the reward value decreases to penalize deviations from precise orientation and sustained pursuit, thereby indirectly incentivizing the agent to maintain optimal tracking behavior and ensuring robust multi-faceted situation assessment. This integrated approach enhances decision-making realism by simultaneously accounting for both positional and orientational factors.

The closure speed reward function

R_{closure ({\bar{q}}_{r}, v_{c}, d)}

incentivizes our agent to approach the enemy UAV from the rear to maximize tactical advantage, while penalizing cases where the enemy aircraft approaches from behind our own. Its magnitude scales with distance to adaptively balance aggression and safety: larger rewards at greater distances encourage a proactive approach, while diminishing rewards at closer ranges mitigate overshooting risks. When

d < d_{0}

, the reward remains constant negative to prevent collision or overshooting from excessive proximity—here,

d_{0}

denotes the optimal engagement distance: though closer proximity may enhance attack effectiveness, it significantly heightens crash/overshoot risks due to limited maneuvering space.

The design rationale for this function is grounded in operational constraints of UAV combat dynamics. Specifically, the sigmoid function parameters, such as

α = 0.0125

for

S (d, α_{2}, d_{0})

, are derived from the UAV’s optimal gun engagement zone, typically within

[D_{min}, D_{max}]

(e.g., 500

\sim

1200 m, as illustrated in Figure 8). This parameter selection ensures a gradual decline in reward values at the optimal engagement distance

d_{0}

, promoting smooth transitions in UAV distance and velocity. The gradual decay prevents abrupt behavioral changes, enhancing training stability and real-world applicability by avoiding erratic maneuvers. Thus, the reward function effectively balances encouragement and penalty mechanisms, aligning with physical constraints and ensuring robust policy convergence. This approach not only reinforces safe engagement practices but also leverages expert knowledge to optimize performance in dynamic aerial environments (see Figure 9).

The altitude reward function

R_{altitude} (h_{r})

is designed to constrain our UAV to maneuver within a specified altitude range

h_{r} \in [h_{min}, h_{max}]

, mitigating risks at excessively high altitudes (where thin air may cause engine aerodynamic failure and degraded control efficiency) and excessively low altitudes (where increased drag and crash hazards prevail). Here,

h_{0}

denotes the optimal flight altitude. Through the synergistic design of parameters

α_{3}

and

α_{4}

, the reward function ensures a smooth transition in reward values as altitude deviates from

h_{0}

, mitigating abrupt changes and avoiding excessive height fluctuations caused by velocity spikes. This design stabilizes the UAV’s altitude within the target range, balancing safety with operational flexibility (see Figure 10).

The firing reward function

R_{ownfire} ({\bar{φ}}_{r}, d)

and

R_{enemyfire} ({\bar{q}}_{r}, d)

are designed to guide our agent to gain significant offensive advantages while avoiding enemy fire damage. It encourages our side to achieve the smallest possible target Aspect Angle within a certain attack range, thereby maximizing the fire damage inflicted on the enemy. Here,

d_{0}

denotes the optimal weapon engagement distance, with

D_{min}

and

D_{max}

defining the minimum and maximum bounds of this optimal range.

Γ_{R} (d)

and

Γ_{B} (d)

model the distance-dependent variation of our and the enemy’s fire damage reward coefficients, respectively. The constants

β_{R}

and

β_{B}

quantify the offensive intent intensity of our and the enemy’s UAVs, where larger absolute values signify stronger aggressiveness. Notably, we assume symmetric intent strengths (

∥ β_{R} ∥ = ∥ β_{B} ∥

) to ensure balanced evaluation. As illustrated in Figure 11b, conservative design assigns the enemy a slightly larger engagement range (via parameters

α_{8}, α_{9}

) versus our UAV (via

α_{6}, α_{7}

), motivating our agent to devise superior maneuver strategies.

Event-based rewards comprise two fundamental components: the attack advantage reward

R_{cz}

(refer to Equation (16)) and the termination reward

R_{end}

(refer to Equation (15)). Specifically,

R_{cz}

is designed to incentivize the agent to engage the enemy from within its no-escape zone, thereby maximizing offensive efficiency by promoting optimal positioning and tactical dominance.

Conversely,

R_{end}

functions as an environmental bonus awarded at the conclusion of each episode, contingent upon the final outcome and simulation context. This reward mechanism aims to motivate our side to accomplish missions rapidly and reliably within constrained time frames. For victorious scenarios,

R_{end}

allocates a base reward of +200, augmented by an additional bonus of up to +200 that is proportional to the remaining simulation time (where K denotes the maximum steps per episode and k represents the termination step). This time-sensitive incentive encourages early decisive successes by rewarding expedient victories. In cases of loss or draw, a fixed penalty of −400 is enforced to discourage adverse outcomes, whereas timeout or boundary violations incur a stricter penalty of −600 to preclude deadlock situations and rule infractions.

This hierarchically structured reward design ensures that the reward signals acquired by the agent are intrinsically aligned with the dynamic complexities of real-world aerial combat. Consequently, the derived decision-making policy maintains high operational efficiency and reliability under stringent physical constraints, including platform dynamics and airspace limitations, thereby enhancing practical combat effectiveness.

To ensure the robustness and tactical effectiveness of the reward function, we adopted an incremental constructive design strategy driven by domain expert insights [14] and simulation failure analysis. Initially, a baseline reward set

{R_{relpos}, R_{closure}, R_{altitude}, R_{ownfire}, R_{cz},

R_{end}}

was established. Preliminary experiments indicated that although this configuration achieved basic convergence, it exhibited poor tactical efficiency; specifically, the agent frequently entered a mutual tail-chasing Nash equilibrium [15]. To break these deadlocks, the weights of

R_{relpos}

and

R_{closure}

were increased to 2–3 times their initial values to emphasize rapid approach and positional advantage acquisition.

However, despite improved positioning, the agent exhibited a critical vulnerability: it frequently sustained high damage from enemy fire due to a lack of defensive awareness. To address this, the

R_{enemyfire}

term was integrated, which significantly enhanced survivability. Subsequently, observations revealed a ‘delayed aiming’ phenomenon where the agent failed to lock onto the target quickly enough, allowing the enemy to fire first in head-on engagements. Consequently, to accelerate convergence and improve aiming precision, the alignment penalty term

R_{{\bar{φ}}_{r}} = - {\bar{φ}}_{r}

was introduced. Concurrently, other weights were fine-tuned:

ω_{ownfire}

was increased to enhance firing-driven convergence, and

ω_{relpos}

was slightly reduced to avoid over-reliance on situational angular advantages, which could cause stagnation. This step-wise integration balanced tactical priorities with training stability, ultimately yielding the final robust configuration.

The overall reward function

R_{total}

with the final weights

ω_{1}

to

ω_{6}

satisfies

\sum_{i = 1}^{6} ω_{i} = 1

, normalizing the dense rewards to prevent any single component from dominating. Additionally, a scaling factor

K > 1

is applied to amplify the total reward magnitude, mitigating saturation and stabilizing gradient updates during training.

R_{total} = K (ω_{1} R_{{\bar{φ}}_{r}} + ω_{2} R_{relpos} + ω_{3} R_{closure} + ω_{4} R_{altitude} + ω_{5} R_{ownfire} + ω_{6} R_{enemyfire}) + R_{cz} + R_{end}

(21)

3.3. Policy Network Training Method

Unlike traditional strategy optimization, the core innovation of SAC incorporates maximum entropy into optimization to maximize expected cumulative reward while maintaining action distribution entropy of policy

π

in states, i.e.,

\begin{matrix} J (π) = E [\sum_{t = 0}^{\infty} γ^{t} (r_{t} + α H (π (\cdot ∣ s_{t})))] \end{matrix}

(22)

Among them, the temperature coefficient

α

dynamically adjusts (Equation (23)) by optimizing the target entropy (

\bar{H} = - dim (A)

), balancing the exploration–exploitation tradeoff of the SAC strategy, significantly enhancing training stability in uncertain environments, and mitigating the local optima risk.

\begin{matrix} J (α) = E_{a_{t} \sim π_{t}} [- α log π_{t} (a_{t} ∣ s_{t}) - α \bar{H}] \end{matrix}

(23)

SAC exhibits significant advantages over other policy gradient (PG) and Actor-Critic (AC) algorithms. The maximum entropy objective facilitates more robust exploration, avoiding the premature convergence often seen in traditional PG methods. Compared with deterministic policies like those in DDPG, SAC’s stochastic policy with entropy regularization maintains a better exploration–exploitation balance, adapting to various environments without extensive hyperparameter tuning. Furthermore, the use of dual Q-networks reduces the overestimation bias common in AC methods, leading to more stable and reliable learning. These characteristics make SAC highly effective for continuous control tasks, such as UAV maneuvering, where precision and robustness are paramount.

SAC adopts the Actor-Critic architecture (Figure 12), comprising two soft Q networks (

θ_{1}, θ_{2}

), two target Q networks (

{\bar{θ}}_{1}, {\bar{θ}}_{2}

), and one actor network (

ϕ

). The critic optimizes the Q networks by minimizing the Bellman residual MSE loss (Equation (24)) and employs a dual Q mechanism to take the minimum Q value (Equation (25)), thereby mitigating Q-value overestimation. To handle the non-differentiability of random actions, the actor uses a reparameterization trick to convert action sampling into a deterministic linear transformation with random noise (Equation (26)), rendering the policy loss differentiable (Equation (27)).

\begin{matrix} J_{Q} (θ_{t}) = & E_{(s_{t}, a_{t}) \sim D} [\frac{1}{2} {(Q_{θ_{i}} (s_{t}, a_{t}) - (r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim p} [V_{\bar{θ}} (s_{t + 1})]))}^{2}] \end{matrix}

(24)

\begin{matrix} Q_{θ} (s_{t}, a_{t}) = min \{Q_{{\bar{θ}}_{1}} (s_{t + 1}, a_{t + 1}), Q_{{\bar{θ}}_{2}} (s_{t + 1}, a_{t + 1})\} \end{matrix}

(25)

\begin{matrix} a_{t} = f_{ϕ} (ε_{t}; s_{t}) = μ + σ \cdot ε_{t} \end{matrix}

(26)

\begin{matrix} J_{β} (ϕ) = E_{s_{t} \sim D, ε_{t} \sim N} [α log (π_{ϕ} (f_{ϕ} (ε_{t}; s_{t}) ∣ s_{t})) - Q_{θ} (s_{t}, f_{ϕ} (ε_{t}; s_{t}))] \end{matrix}

(27)

4. Simulation and Results

The simulation environment was built on OpenAI Gym [16], a widely adopted benchmark for reinforcement learning research. For the training of agents, we utilized the stable-baselines3 library [17] (based on the PyTorch framework (version 2.6.0)). To ensure a comprehensive comparison and evaluation of performance, we implemented and trained agents using the Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) algorithms, respectively. The neural network architecture for both algorithms consisted of four fully connected layers with ReLU activation functions.

Each training episode lasted 200 s with a 10 Hz decision frequency, and the total training was conducted over 50,000 episodes. For policy evaluation, pre-trained policy network weights were loaded to ensure that fixed strategies were used, and assessments were performed under randomized initial conditions (including both advantageous and parity scenarios). The evaluation was divided into two distinct phases: First, we assessed the clean performance of the trained policies under nominal conditions without any perturbations. Then, in the second phase, observational noise was injected into the state inputs to test the robustness of the policies.

4.1. Simulation Parameters and Initialization Conditions Settings

In the scenario, the blue UAV represents the opponent and is initialized with a random heading at a randomized altitude within a specified range. A local East–North–Up (END) coordinate frame is established at the blue’s initial position. Representing our own agent, the red UAV is then initialized relative to the blue UAV within a defined rear cone sector. Its initial position is constrained by a specific

LOS

geometric bearing

ψ_{L r}

,

LOS

elevation angle

θ_{L r}

, and distance range R, with its heading

ψ_{r}

facing toward the blue UAV within a certain angular tolerance. The initial states information and simulation parameters of the red and blue UAVs are listed in Table 2 and Table 3, respectively.

4.2. Results Analysis

The training is performed on a desktop computer equipped with 12th Gen Intel(R) Core(TM) i5-12600KF and NVIDIA GeForce RTX 4070 SUPER GPU. We utilize eight encapsulated environments for parallel simulation and experience collection, thereby accelerating policy optimization and reducing training time. Two representative DRL algorithms are selected for comparison: SAC and TD3, which are state-of-the-art, single-agent DRL algorithms that support continuous control.

We conducted five independent training sessions for both the SAC and TD3 algorithms, with each session comprising 50,000 episodes. As illustrated in Figure 13, the left panel presents the training performance of SAC, where the five thin curves denote the average reward per 100 episodes (smoothed via Savitzky–Golay filtering) from each of the five runs, the thick curve represents the smoothed mean reward across these five runs, and the shaded area corresponds to the 95% confidence interval (i.e., ±1.96 standard deviations from the mean curve); similarly, the right panel presents the training performance of TD3. It can be observed that both SAC and TD3 algorithms converge within a limited number of training episodes, indicating that the reward function design is universally applicable to both methods in this scenario. However, the convergence speed of TD3 is significantly slower than that of SAC, with SAC converging at around 15,000 episodes and TD3 converging after nearly 25,000 episodes.

We evaluated the performance of the policies from the SAC and TD3 methods over 300 episodes each. During the first 100 episodes, the input states were fully observable and noise-free. In the subsequent 200 episodes, we introduced zero-mean multivariate independent Gaussian noise with a diagonal covariance matrix into the input states. Specifically, for 100 episodes, we applied low-level noise (with smaller standard deviations). For the remaining 100 episodes, we applied high-level noise (with larger standard deviations), using distinct values for each dimension. To ensure comparability across experiments, the same random seed was used throughout all evaluations to maintain consistent initial conditions, and for each policy, identical noise was applied at the corresponding episode and simulation step. The evaluation results are summarized in Table 4.

Statistical analysis of the results in Table 4 demonstrates the significant superiority of the SAC policy over the TD3 baseline, validated by one-sided hypothesis tests (

α = 0.05

). The SAC agent achieved statistically significantly higher win rates across all conditions: no noise (p = 0.0226), low noise (p = 0.0385), and high noise (p = 0.0203). Multiple performance metrics support this conclusion—SAC maintained substantially higher mean returns (e.g., 266.55 vs. 138.57 under no noise) while simultaneously achieving lower loss rates (

1 % \pm 2.0 %

vs.

8 % \pm 5.3 %

) and reduced return variance (380.54 vs. 530.22), indicating enhanced stability and robustness in noisy environments. These findings statistically validate the theoretical advantages of SAC’s maximum entropy framework in handling high-dimensional state spaces and sensor uncertainties, providing empirical evidence for its efficacy in a continuous action space for UAV autonomous maneuvering decisions.

Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24, Figure 25, Figure 26, Figure 27, Figure 28, Figure 29, Figure 30, Figure 31, Figure 32, Figure 33, Figure 34, Figure 35 and Figure 36 show the simulation trajectories of the SAC and TD3 algorithms under different initial situational conditions. Among them, Figure 14, Figure 15 and Figure 16 correspond to the initial condition where our side is in a tail-chase scenario with an altitude advantage; Figure 17, Figure 18, Figure 19, Figure 20 and Figure 21 represent the initial condition where our side is in a tail-chase scenario but at an altitude disadvantage; Figure 22, Figure 23 and Figure 24 depict the initial condition where our side is in a tail-chase scenario with altitude parity; Figure 25, Figure 26, Figure 27, Figure 28, Figure 29 and Figure 30 illustrate the head-on initial situation with our side having an altitude advantage; and Figure 31, Figure 32, Figure 33, Figure 34, Figure 35 and Figure 36 portray the head-on initial situation with our side at an altitude disadvantage. In these figures, the pentagram denotes the trajectory starting point (corresponding to

T_{1}

), and the solid circle indicates the endpoint.

In each scenario depicted in Figure 14 and Figure 15, both sides start from identical initial conditions. The SAC algorithm successfully hits the target regardless of the presence of observational noise, whereas the TD3 algorithm fails to fully defeat the target when subjected to higher noise levels. As illustrated in Figure 16, during the interval from T2 to T3 (i.e., simulation time 60 s to 120 s), the red side gains a significant relative positional and range advantage, thereby securing a brief firing opportunity. However, from T3 to T4, due to a sharp left-turn maneuver executed by the blue side, the noise likely introduces substantial errors into the red side’s observational inputs. This causes the TD3 policy to fail in executing appropriate maneuvers correctly, ultimately resulting in deviation from the intended trajectory and a crash beyond the minimum altitude boundary. This outcome underscores TD3’s inherent sensitivity to observational noise due to its deterministic policy design, rather than reflecting neural network overfitting; in contrast, SAC’s maximum entropy framework maintains stability by promoting exploratory actions under similar uncertain conditions.

Based on Figure 17, Figure 18 and Figure 19, it can be observed that, under the initial condition of tail-chase altitude disadvantage, the SAC method consistently maintains both positional and angular advantages in tail pursuit, thereby securing firing opportunities. In contrast, as shown in Figure 20 and Figure 21, the performance of the TD3 policy in this scenario is less favorable. From Figure 21, it can be seen that, during the period from T1 to T6 (the first 500 s of simulation), the red side generally holds a superior posture. However, starting at T6, when the blue side executes a circular maneuver toward the upper right, the TD3 policy fails to continue tracking the opponent’s movement. Instead, it adopts a relatively conservative climbing maneuver, leading to an immediate decline in its situational advantage. Although by time T8, its relative positional advantage improves, the policy still opts for damage avoidance, ultimately allowing the blue side to overtake and gain dominance.

Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20 and Figure 21 compare the trajectory trends of the SAC and TD3 algorithms under an initial tail-chase advantage for our agent. When our agent has an initial altitude advantage, the SAC policy in Figure 14 exhibits lower average standard deviations of maneuver commands (with standard deviations of 0.947 for

n_{x}

, 1.196 for

n_{z}

, and 1.093 for

γ

) compared with the TD3 policy in Figure 15 (with standard deviations of 1.469 for

n_{x}

, 1.172 for

n_{z}

, and 1.516 for

γ

). Furthermore, in terms of Energy Consumption (measured by control effort

\sum (n_{x}^{2} + n_{z}^{2})

in a trajectory) and Control Smoothness (measured by the mean absolute difference of command signals), SAC achieves significantly lower values of 3486.13 and 0.2002, respectively, whereas TD3 records higher values of 8043.52 and 0.2881. Based on the quantitative analysis of these three metrics, it is evident that the SAC algorithm generates trajectories that are not only statistically more stable but also physically smoother and more energy efficient.

Similarly, when our agent is at an initial altitude disadvantage (Figure 17), the SAC policy maintains low metric values across all dimensions: standard deviations of 0.879, 1.055, and 1.397 for

n_{x}

,

n_{z}

, and

γ

, with an Energy Consumption of 9272.8 and a Control Smoothness of 0.4289. In contrast, the TD3 policy (Figure 20) shows degraded performance, characterized by elevated standard deviations (1.282, 4.036, 2.356 for

n_{x}

,

n_{z}

, and

γ

), a substantial energy cost of 18935.0633, and a poor smoothness of 1.5675.

It can be observed that, in complex environments with introduced observational noise, SAC-generated trajectories exhibit significantly smoother characteristics and a markedly higher mission success probability than TD3. SAC demonstrates consistently lower control signal variances, superior Control Smoothness, and lower Energy Consumption: its lower variances highlight robust generalization against noise-induced disturbances, and its enhanced smoothness and efficiency validate physically smoother, more energy-conservative trajectory generation. Conversely, TD3 exhibits higher variances, inferior smoothness, and reduced energy efficiency. The consistently lower values across these metrics confirm SAC’s robust generalization against noise-induced disturbances, which directly mitigates overfitting-like oscillatory behaviors observed in TD3.

This superiority stems from SAC’s maximum-entropy optimization framework, which incorporates a policy-entropy regularization term. This encourages stochastic exploration, preventing the policy from prematurely converging to a local optimum and thereby enhancing its robustness against disturbances such as sensor noise and environmental uncertainties. In contrast, TD3, as a deterministic policy algorithm, is advantageous in sample efficiency but suffers from its relatively fixed policy mode. This makes it prone to overfitting when confronted with observational noise, leading to unnecessary oscillations or deviations in its trajectory that ultimately compromise its effectiveness in the engagement. Therefore, the phenomena depicted in Figure 16 and Figure 21 should be interpreted as evidence of TD3’s algorithmic limitations in handling uncertainties, not as signs of overtraining; SAC’s consistent performance stems from its designed robustness, which effectively mitigates overfitting risks through stochastic exploration.

Figure 22, Figure 23 and Figure 24 depict the tail-chase trajectories when both sides start at similar altitudes. The results indicate that both strategies perform well, successfully defeating the opponent with relatively smooth trajectories. Figure 25, Figure 26, Figure 27, Figure 28, Figure 29 and Figure 30 illustrate the scenario where both parties start head-on, with our side having an altitude advantage. As shown by the corresponding dense reward function, the SAC strategy fails to achieve a favorable geometric angle between

T_{3}

and

T_{4}

, allowing the blue side to fire at us first. Figure 31, Figure 32, Figure 33, Figure 34, Figure 35 and Figure 36 represent the case of a head-on start with our side at an altitude disadvantage.

Here, the comparative analysis between Figure 33 and Figure 36 indicates that the TD3 strategy generates smoother trajectories than the SAC, with quantitative metrics further supporting this observation: lower standard deviations in most maneuver commands (specifically, 1.269 for

n_{x}

, 0.672 for

n_{z}

, and 1.182 for

γ

of the TD3 policy, compared with 1.047 for

n_{x}

, 1.350 for

n_{z}

, and 1.205 for

γ

of the SAC policy), lower Energy Consumption (3057.75 for TD3, 3544.70 for SAC), and superior Control Smoothness (0.0713 for TD3, 0.7472 for SAC). Additionally, the TD3 policy initiates attacks earlier and demonstrates a certain level of robustness.

5. Conclusions

This study addresses the challenge of autonomous maneuvering decision making for unmanned aerial vehicles (UAVs) by proposing a continuous action space decision-making framework based on the Soft Actor-Critic (SAC) algorithm. Through the construction of a simulation environment that incorporates three-dimensional kinematic constraints and a multi-dimensional situational reward mechanism, the framework achieves the generation of high-precision control commands, offering a valuable reference for research on autonomous behavior decision making of intelligent unmanned systems in dynamic environments.

The primary innovations of this work are manifested in two key aspects: first, the design of a comprehensive multi-dimensional reward function that integrates factors such as relative position, closure speed, altitude relationship, and interactive effectiveness, coupled with the introduction of a Health Point (HP) system for quantitative assessment of situational advantages, which enhances the strategy’s capability to perceive and respond to complex dynamic conditions; second, the application of the SAC algorithm with maximum entropy optimization, which facilitates the generation of smooth and reliable control commands while maintaining an equilibrium between exploration and exploitation, thereby improving the adaptability and stability of the decision-making system.

Despite these contributions, the current research exhibits certain limitations: the three-degrees-of-freedom point-mass model employed, while ensuring basic motion feasibility through overload constraints (e.g., the ranges of

n_{x}

and

n_{z}

), does not fully account for higher-order dynamic characteristics such as aerodynamic coupling effects and actuator response delays. Consequently, the practical feasibility of the generated trajectories under extreme maneuvers warrants further validation. Future efforts will prioritize two directions: the development of a six-degrees-of-freedom high-fidelity simulation platform that incorporates aerodynamic parameters and system dynamics to enhance model reliability and trajectory authenticity and the extension to multi-agent collaborative decision-making scenarios, focusing on cooperative maneuvering strategies for multiple UAVs in complex environments to strengthen the algorithm’s practicality and scalability.

Author Contributions

H.Y.: conceptualization, methodology, validation, formal analysis, investigation, resources, writing—review and editing, supervision, project administration, funding acquisition; S.Q.: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization; S.C.: validation, formal analysis, investigation, resources, writing—review and editing, supervision, funding acquisition; C.W.: validation, formal analysis, investigation, resources, writing—review and editing, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province, China, under Grant No. 2025JJ20055 and the National Natural Science Foundation of China under Grant No. 52475290 and 62403482.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

DURC Statement

Current research is limited to the field of unmanned aerial vehicles (UAVs), which is beneficial for advancing the intelligence and robustness of unmanned systems and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving autonomous UAV maneuvering algorithms and confirm that all necessary precautions have been taken to prevent potential misuse, including but not limited to conducting research in a controlled simulation environment without hardware implementation. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cao, S.; Wang, X.; Zhang, R.; Peng, Y.; Yu, H. Aerobatic Maneuvering flight control of fixed-wing UAVs: An SE (3) approach using dual quaternion. IEEE Trans. Ind. Electron. 2024, 71, 14362–14372. [Google Scholar] [CrossRef]
Zhang, J.d.; Yu, Y.f.; Zheng, L.h.; Yang, Q.m.; Shi, G.q.; Wu, Y. Situational continuity-based air combat autonomous maneuvering decision-making. Def. Technol. 2023, 29, 66–79. [Google Scholar] [CrossRef]
Li, Y.; Dong, W.; Zhang, P.; Zhai, H.; Li, G. Hierarchical reinforcement learning with automatic curriculum generation for unmanned combat aerial vehicle tactical decision-making in autonomous air combat. Drones 2025, 9, 384. [Google Scholar] [CrossRef]
Zhu, Y.; Zheng, Y.; Wei, W.; Fang, Z. Enhancing Automated Maneuvering Decisions in UCAV Air Combat Games Using Homotopy-Based Reinforcement Learning. Drones 2024, 8, 756. [Google Scholar] [CrossRef]
Yang, J.; Yang, X.; Yu, T. Multi-unmanned aerial vehicle confrontation in intelligent air combat: A multi-agent deep reinforcement learning approach. Drones 2024, 8, 382. [Google Scholar] [CrossRef]
Zhang, T.; Wang, Y.; Sun, M.; Chen, Z. Air combat maneuver decision based on deep reinforcement learning with auxiliary reward. Neural Comput. Appl. 2024, 36, 13341–13356. [Google Scholar] [CrossRef]
Fan, Z.; Xu, Y.; Kang, Y.; Luo, D. Air combat maneuver decision method based on A3C deep reinforcement learning. Machines 2022, 10, 1033. [Google Scholar] [CrossRef]
Zheng, Z.; Duan, H. UAV maneuver decision-making via deep reinforcement learning for short-range air combat. Intell. Robot. 2023, 3, 76–94. [Google Scholar] [CrossRef]
Yang, Q.; Zhu, Y.; Zhang, J.; Qiao, S.; Liu, J. UAV air combat autonomous maneuver decision based on DDPG algorithm. In Proceedings of the 2019 IEEE 15th International Conference on Control and Automation (ICCA), Edinburgh, UK, 16–19 July 2019; pp. 37–42. [Google Scholar]
Gao, X.; Zhang, Y.; Wang, B.; Leng, Z.; Hou, Z. The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm. Drones 2024, 8, 501. [Google Scholar] [CrossRef]
Buşoniu, L.; Rejeb, J.B.; Lal, I.; Morărescu, I.C.; Daafouz, J. Optimistic minimax search for noncooperative switched control with or without dwell time. Automatica 2020, 112, 108632. [Google Scholar] [CrossRef]
Austin, F.; Carbone, G.; Falco, M.; Hinz, H.; Lewis, M. Automated maneuvering decisions for air-to-air combat. In Proceedings of the Guidance, Navigation and Control Conference, Monterey, CA, USA, 17–19 August 1987; p. 2393. [Google Scholar]
Pope, A.P.; Ide, J.S.; Mićović, D.; Diaz, H.; Twedt, J.C.; Alcedo, K.; Walker, T.T.; Rosenbluth, D.; Ritholtz, L.; Javorsek, D. Hierarchical reinforcement learning for air combat at DARPA’s AlphaDogfight trials. IEEE Trans. Artif. Intell. 2022, 4, 1371–1385. [Google Scholar] [CrossRef]
Seong, H.; Shim, D.H. TempFuser: Learning Agile, Tactical, and Acrobatic Flight Maneuvers Using a Long Short-Term Temporal Fusion Transformer. IEEE Robot. Autom. Lett. 2024, 9, 10803–10810. [Google Scholar] [CrossRef]
Kong, W.; Zhou, D.; Du, Y.; Zhou, Y.; Zhao, Y. Reinforcement learning for multiaircraft autonomous air combat in multisensor UCAV platform. IEEE Sensors J. 2022, 23, 20596–20606. [Google Scholar] [CrossRef]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]

Figure 1. UAV 1v1 dynamic confrontation decision-making framework. The red UAV represents the friendly aircraft, and the blue UAV represents the enemy aircraft.

Figure 2. Schematic of the 3-DOF UAV dynamics model in ground coordinate systems and related angles.

Figure 3. The 1v1 UAV confrontation situation.

Figure 4. Normalized factors for HP update (corresponding to Equations (5)–(7)), illustrating the effect of angle and distance on HP decay.

Figure 5. The Minimax decision-making policy. The arrows indicate the sequence of maneuver choices over discrete time steps. The nodes represent the decision states, where the colors (e.g., blue and red) distinguish the decision turns of the friendly and enemy UAVs, respectively.

Figure 6. Sigmoid function under different

α

values.

Figure 6. Sigmoid function under different

α

values.

Figure 7. Relative position reward function

R_{relpos}

(corresponding to Equation (10)), showing the impact of angle variation on reward values.

Figure 7. Relative position reward function

R_{relpos}

(corresponding to Equation (10)), showing the impact of angle variation on reward values.

Figure 8. Distance attenuation factor

S (d, α_{2}, d_{0})

in the closure speed reward function

R_{closure} ({\bar{q}}_{r}, v_{c}, d)

(corresponding to Equation (11)), illustrating the exponential decay of the reward component based on the inter-UAV distance.

Figure 8. Distance attenuation factor

S (d, α_{2}, d_{0})

in the closure speed reward function

R_{closure} ({\bar{q}}_{r}, v_{c}, d)

(corresponding to Equation (11)), illustrating the exponential decay of the reward component based on the inter-UAV distance.

Figure 9. The closure speed reward function

R_{closure}

(corresponding to Equations (11)), showing the combined effect of Aspect Angle, approach speed, and distance on the reward value.

Figure 9. The closure speed reward function

R_{closure}

(corresponding to Equations (11)), showing the combined effect of Aspect Angle, approach speed, and distance on the reward value.

Figure 10. The altitude reward function

R_{altitude}

(corresponding to Equation (12)), demonstrating the reward modulation based on the UAV’s flight altitude relative to the optimal height

h_{0}

.

Figure 10. The altitude reward function

R_{altitude}

(corresponding to Equation (12)), demonstrating the reward modulation based on the UAV’s flight altitude relative to the optimal height

h_{0}

.

Figure 11. The firing reward function

R_{ownfire}

,

Γ_{R}

and

Γ_{B}

(corresponding to Equations (13), (19), and (20)), quantifying the offensive advantage and fire damage. In (b), the red line represents the friendly coefficient

Γ_{R}

, while the blue line represents the enemy coefficient

Γ_{B}

.

Figure 11. The firing reward function

R_{ownfire}

,

Γ_{R}

and

Γ_{B}

(corresponding to Equations (13), (19), and (20)), quantifying the offensive advantage and fire damage. In (b), the red line represents the friendly coefficient

Γ_{R}

, while the blue line represents the enemy coefficient

Γ_{B}

.

Figure 12. Schematic diagram of the Soft Actor-Critic (SAC) algorithm architecture and update process.

Figure 13. Average returns for training episodes.

Figure 14. Initial tail-chase SAC trajectory with altitude advantage.

Figure 15. Initial tail-chase TD3 trajectory with altitude advantage.

Figure 16. TD3 initial tail-chase with altitude advantage and high noise observations.

Figure 17. Initial tail-chase SAC trajectory with altitude disadvantage.

Figure 18. SAC initial tail chase with altitude disadvantage and no noise observations.

Figure 19. SAC initial tail chase with altitude disadvantage and high noise observations.

Figure 20. Initial tail-chase TD3 trajectory with altitude disadvantage.

Figure 21. TD3 initial tail chase with altitude disadvantage and no noise observations.

Figure 22. Initial tail-chase SAC trajectory with altitude balanced.

Figure 23. Initial tail-chase TD3 trajectory with altitude balanced.

Figure 24. Comparison of maneuvering decision command sequences generated by SAC and TD3 agents under noise-free and high-noise observation conditions during an altitude-balanced tail-chase scenario.

Figure 25. Initial head-on SAC trajectory with altitude advantage.

Figure 26. SAC initial head-on with altitude advantage and no noise observations.

Figure 27. SAC initial head-on with altitude advantage and high noise observations.

Figure 28. Initial head-on TD3 trajectory with altitude advantage.

Figure 29. TD3 initial head-on with altitude advantage and no noise observations.

Figure 30. TD3 initial head-on with altitude advantage and high noise observations.

Figure 31. Initial head-on SAC trajectory with altitude disadvantage.

Figure 32. SAC initial head-on with altitude disadvantage and no noise observations.

Figure 33. SAC initial head-on with altitude disadvantage and high noise observations.

Figure 34. Initial head-on TD3 trajectory with altitude disadvantage.

Figure 35. TD3 initial head-on with altitude disadvantage and no noise observations.

Figure 36. TD3 initial head-on with altitude disadvantage and high noise observations.

Table 1. Basic maneuver primitive instruction set.

Action ID	Maneuver Name	Maneuver Primitive Command $[n_{x}, n_{z}, γ]$
1	Constant Speed Forward	$[0, 1, 0]$
2	Accelerated Forward	$[n_{x}^{max}, 1, 0]$
3	Decelerated Forward	$[n_{x}^{min}, 1, 0]$
4	Constant Speed Right Turn	$[0, n_{z}^{max}, - {cos}^{- 1} (1 / n_{z}^{max})]$
5	Accelerated Right Turn	$[n_{x}^{max}, n_{z}^{max}, - {cos}^{- 1} (1 / n_{z}^{max})]$
6	Decelerated Right Turn	$[n_{x}^{min}, n_{z}^{max}, - {cos}^{- 1} (1 / n_{z}^{max})]$
7	Constant Speed Left Turn	$[0, n_{z}^{max}, {cos}^{- 1} (1 / n_{z}^{max})]$
8	Accelerated Left Turn	$[n_{x}^{max}, n_{z}^{max}, {cos}^{- 1} (1 / n_{z}^{max})]$
9	Decelerated Left Turn	$[n_{x}^{min}, n_{z}^{max}, {cos}^{- 1} (1 / n_{z}^{max})]$
10	Constant Speed Climb	$[0, n_{z}^{max}, 0]$
11	Accelerated Climb	$[n_{x}^{max}, n_{z}^{max}, 0]$
12	Decelerated Climb	$[n_{x}^{min}, n_{z}^{max}, 0]$
13	Constant Speed Dive	$[0, n_{z}^{min}, 0]$
14	Accelerated Dive	$[n_{x}^{max}, n_{z}^{min}, 0]$
15	Decelerated Dive	$[n_{x}^{min}, n_{z}^{min}, 0]$
16	Constant Speed Right Climb	$[0, n_{z}^{max}, - {cos}^{- 1} (c / n_{z}^{max})]$
17	Accelerated Right Climb	$[n_{x}^{max}, n_{z}^{max}, - {cos}^{- 1} (c / n_{z}^{max})]$
18	Decelerated Right Climb	$[n_{x}^{min}, n_{z}^{max}, - {cos}^{- 1} (c / n_{z}^{max})]$
19	Constant Speed Left Climb	$[0, n_{z}^{max}, {cos}^{- 1} (c / n_{z}^{max})]$
20	Accelerated Left Climb	$[n_{x}^{max}, n_{z}^{max}, {cos}^{- 1} (c / n_{z}^{max})]$
21	Decelerated Left Climb	$[n_{x}^{min}, n_{z}^{max}, {cos}^{- 1} (c / n_{z}^{max})]$
22	Constant Speed Right Dive	$[0, n_{z}^{min}, - {cos}^{- 1} (c / n_{z}^{max})]$
23	Accelerated Right Dive	$[n_{x}^{max}, n_{z}^{min}, - {cos}^{- 1} (c / n_{z}^{max})]$
24	Decelerated Right Dive	$[n_{x}^{min}, n_{z}^{min}, - {cos}^{- 1} (c / n_{z}^{max})]$
25	Constant Speed Left Dive	$[0, n_{z}^{min}, {cos}^{- 1} (c / n_{z}^{max})]$
26	Accelerated Left Dive	$[n_{x}^{max}, n_{z}^{min}, {cos}^{- 1} (c / n_{z}^{max})]$
27	Decelerated Left Dive	$[n_{x}^{min}, n_{z}^{min}, {cos}^{- 1} (c / n_{z}^{max})]$

Table 2. Initial state information.

State	x (km)	y (km)	z (km)	v (m/s)	$θ$ (rad)	$ψ$ (rad)	$HP$	$θ_{L}$	$ψ_{L}$
Blue	0	0	$[4, 5]$	$[180, 200]$	0	$[- π, π]$	100	-	-
Red	-	-	-	$[180, 200]$	0	$[\frac{π}{3}, \frac{2 π}{3}]$	100	$[- \frac{π}{6}, - \frac{π}{6}]$	$[\frac{π}{3}, \frac{2 π}{3}]$

Table 3. Simulation parameter settings.

Parameter	Value	Parameter	Value	Parameter	Value
$v_{min}$	0.25 Ma	$v_{max}$	1 Ma	$δ_{x, y, z}$	1 × 10⁴
$θ_{min}$	$- π / 2$	$θ_{max}$	$π / 2$	$δ_{v}$	1 × 10³
$n_{x min}$	$- 1$	$n_{x max}$	2	$δ_{HP}$	200
$n_{z min}$	$- 3$	$n_{z max}$	6	$d_{0}$	0.9 km
$D_{min}$	0.5 km	$D_{max}$	1.2 km	$h_{0}$	5.5 km
$φ_{max}$	$15 deg$	$q_{max}$	$30 deg$	$α_{1, 2}$	25
$h_{min}$	3 km	$h_{max}$	8 km	$α_{3, 4}$	0.01
$v_{c max}$	2 Ma	$z_{min}$	1 km	$α_{5}$	500
$z_{max}$	10 km	lr	2.5 × 10⁻⁴	$α_{6, 7}$	1/10
batch	256	buffer	1 × 10⁶	$α_{8}$	1 × 10³
$γ$	0.99	$α_{9, 10}$	1/15

Table 4. Evaluation comparison between SAC and TD3 with statistical significance.

Evaluation Comparison	Win Rate (95% CI)	Loss Rate (95% CI)	Tie Rate (95% CI)	Mean Returns	Variance of Returns	One-Sided Win Rate p-Value (SAC > TD3)
SAC with no noise input	$87 % \pm 6.6 %$	$1 % \pm 2.0 %$	$12 % \pm 6.4 %$	266.5532	380.5362	$p = 0.0226$
TD3 with no noise input	$76 % \pm 8.4 %$	$8 % \pm 5.3 %$	$16 % \pm 7.2 %$	138.5664	530.2239	–
SAC with low noise input	$85 % \pm 7.0 %$	$1 % \pm 2.0 %$	$14 % \pm 6.8 %$	250.4452	420.4957	$p = 0.0385$
TD3 with low noise input	$75 % \pm 8.5 %$	$8 % \pm 5.3 %$	$17 % \pm 7.4 %$	133.1989	545.2348	–
SAC with high noise input	$84 % \pm 7.2 %$	$0 %$	$16 % \pm 7.2 %$	263.4485	450.7286	$p = 0.0203$
TD3 with high noise input	$72 % \pm 8.8 %$	$6 % \pm 4.7 %$	$22 % \pm 8.1 %$	72.4408	591.0407	–

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Quan, S.; Cao, S.; Wang, C.; Yu, H. Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm. Drones 2026, 10, 35. https://doi.org/10.3390/drones10010035

AMA Style

Quan S, Cao S, Wang C, Yu H. Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm. Drones. 2026; 10(1):35. https://doi.org/10.3390/drones10010035

Chicago/Turabian Style

Quan, Shiming, Su Cao, Chang Wang, and Huangchao Yu. 2026. "Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm" Drones 10, no. 1: 35. https://doi.org/10.3390/drones10010035

APA Style

Quan, S., Cao, S., Wang, C., & Yu, H. (2026). Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm. Drones, 10(1), 35. https://doi.org/10.3390/drones10010035

Article Menu

Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm

Highlights

Abstract

1. Introduction

2. Problem Formulation

2.1. UAV Maneuvering Model

2.2. Autonomous Aerial Maneuvering Environment

2.3. Opponent Maneuver Policy

3. Methodologies

3.1. State Space and Action Space

3.2. Reward Function Design

3.3. Policy Network Training Method

4. Simulation and Results

4.1. Simulation Parameters and Initialization Conditions Settings

4.2. Results Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI