Next Article in Journal
Open-Set UAV Signal Identification Using Learnable Embeddings and Energy-Based Inference
Previous Article in Journal
Delay-Aware UAV Swarm Formation Control via Imitation Learning from ARD-PF Expert Policies
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Drones 2026, 10(1), 35; https://doi.org/10.3390/drones10010035
Submission received: 22 October 2025 / Revised: 25 December 2025 / Accepted: 26 December 2025 / Published: 6 January 2026

Highlights

What are the main findings?
  • Proposed a continuous control strategy based on the Soft Actor-Critic (SAC) algorithm for UAV autonomous maneuvering in 1v1 tactical encounter scenarios.
  • Introduced a multi-dimensional situation-coupled reward function with a Health Point (HP) system to quantitatively evaluate situational advantages and cumulative tactical performance.
What are the implications of the main findings?
  • Enhanced decision-making precision and robustness under both ideal and noisy observation conditions.
  • Demonstrated the effectiveness of SAC in handling high-dimensional state spaces and generating precise maneuvering commands.

Abstract

Focusing on continuous action space methods for autonomous maneuvering decision making in 1v1 unmanned aerial vehicle scenarios, this paper first establishes a UAV kinematic model and a decision-making framework under the Markov Decision Process. Second, a continuous control strategy based on the Soft Actor-Critic (SAC) reinforcement learning algorithm is developed to generate precise maneuvering commands. Then, a multi-dimensional situation-coupled reward function is designed, introducing a Health Point (HP) metric to assess situational advantages and simulate cumulative effects quantitatively. Finally, extensive simulations in a custom Gym environment validate the effectiveness of the proposed method and its robustness under both ideal and noisy observation conditions.

1. Introduction

Advances in artificial intelligence are facilitating the intelligent evolution of unmanned systems. Unmanned aerial vehicles (UAVs) are extensively employed in both offensive/defensive maneuvering and collaborative missions [1]. With significant enhancements in UAV maneuverability, close-range decision making between UAVs is characterized by high dynamics and high overload, imposing rigorous requirements on the real-time performance and flexibility of UAV autonomous maneuvering decision-making systems [2]. Currently, three primary approaches address UAV close-range maneuver decision making: game theory, optimization search, and data-driven methods [3]. Game theory methods encompass differential games and matrix games. While both models are continuous, dynamic autonomous aerial maneuvering decision processes that derive analytical solutions, they struggle with high-dimensional state spaces [4]. Optimization search methods formulate autonomous aerial maneuvering objectives as multi-objective optimization problems, employing techniques like heuristic search and expert systems [5]. However, their performance is constrained by search space limitations and rule design complexity. Data-driven methods, including Deep Learning (DL) and Deep Reinforcement Learning (DRL), generate maneuver policies autonomously by learning from UAV maneuvering data. However, DL approaches are limited by the quality of training data used in behavioral cloning. In contrast, DRL typically requires minimal external data, acquiring experience autonomously through agent–environment interaction. This capability demonstrates potential for identifying advantageous maneuvers, making DRL a prominent research focus globally.
However, current deep reinforcement learning (DRL) research on UAV close-range maneuvering decision making primarily employs discrete action space methods, such as Deep Q-Network (DQN) [6], Actor-Critic (AC) [7], or Proximal Policy Optimization (PPO) [8]. The fixed action space inherently restricts the precision of generated maneuver commands. Studies by Yang et al. [9] and Gao et al. [10] leverage continuous action space algorithms (Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3), respectively) to enhance command accuracy. Nevertheless, both of their simulation victory conditions rely on an oversimplified premise when the friendly side enters the enemy’s attack zone and meets the weapon launch conditions. This assumption neglects battlefield uncertainties, specifically weapon hit probability fluctuations that cause damage outcome randomness, thereby compromising the robustness validation of the proposed strategies.
To address these limitations, this paper employs a multidimensional situation-coupled reward function and the Soft Actor-Critic (SAC) algorithm to generate maneuver decision commands in continuous action spaces. To mitigate algorithmic randomness, the UAV’s Health Points are introduced as an indicator of situational advantage. Furthermore, simulation experiments validate the robustness of the SAC algorithm’s decisions under fully observable inputs or conditions with observational errors.
This study investigates reinforcement learning-based autonomous maneuver decision-making methods applicable to close-range UAV pursuit and evasion scenarios. The main works are as follows:
  • This paper completed the modeling of UAV kinematics and the engagement scenario, as well as the construction of a discrete maneuver command library.
  • This paper implemented SAC deep reinforcement learning within a continuous action space using a multidimensional coupled reward function to generate maneuver decision commands.
  • This paper established a Gym simulation environment to validate the robustness of SAC/TD3 reinforcement learning algorithms under both error-free and erroneous input state observations.

2. Problem Formulation

Figure 1 shows the decision-making framework for 1v1 UAV close-range maneuvering, comprising the environment, UAV kinematic models, and maneuvering decision models for both sides. The environment feeds real-time states and situational data to the decision models, which generate maneuver commands executed kinematically to update UAV states and environmental dynamics, closing the perception–decision–action loop.

2.1. UAV Maneuvering Model

Focusing on UAV maneuver decision making, this paper prioritizes positional/trajectory dynamics over attitude change mechanisms. Consequently, a point-particle model is adopted for kinematic and dynamic representations, as follows:
x ˙ g = v cos θ cos ψ y ˙ g = v cos θ sin ψ z ˙ g = v sin θ v ˙ = g ( n x sin θ ) θ ˙ = g v ( n z cos γ cos θ ) ψ ˙ = g n z sin γ v cos θ
where θ denotes the pitch angle, defined as the angle between the body longitudinal axis (currently aligned with the velocity vector) and the horizontal plane x g O y g ; ψ represents the yaw angle, which is the angle between the projection of the body longitudinal axis onto the horizontal plane x g O y g and the axis O x g ; and γ is the roll angle (see Figure 2).
The variables n x and n z correspond to the tangential overload (aligned with the velocity vector) and the normal overload (perpendicular to the velocity vector within the longitudinal symmetry plane of the aircraft), respectively. The velocity components in the ground frame O x g y g z g are denoted by ( x ˙ , y ˙ , z ˙ ), v ˙ is the path acceleration, and θ ˙ and ψ ˙ are the angular rates of the pitch and yaw angles, respectively.
The gravitational acceleration is represented by g, and the dynamic model presented in Equation (1) enforces the following constraints on the state and control variables, as shown in the following Equation (2). While this simplified 3-DOF point-mass model exhibits certain limitations—such as not fully incorporating real-world elements like actuator dynamics (e.g., servo response delays), control latency, or environmental disturbances—it still holds significant value in the initial exploration of autonomous maneuvering decisions. This is particularly true in task scenarios requiring the isolation of underlying dynamic complexities to focus on high-level logic and algorithmic interpretability.
v min v v max π 2 θ π 2 π ψ π n x   min n x n x   max n z   min n z n z   max π γ π

2.2. Autonomous Aerial Maneuvering Environment

The autonomous aerial maneuvering environment computes Health Points (HPs), relative states, and geometric situation angles from bilateral UAV states in (1), providing reference inputs for the maneuver decision-making part. Through situation assessment, it quantifies bilateral advantages.
Figure 3 depicts key situational information in autonomous maneuvering, where red and blue denote friendly and enemy UAVs, respectively. From the friendly UAV’s perspective, the line-of-sight ( LOS ) is defined as the vector pointing from the friendly to the enemy position; φ [ 0 , π ] represents the Antenna Train Angle (ATA), measuring the angular deviation between the friendly velocity vector v r and the LOS direction; q [ 0 , π ] denotes the Aspect Angle (AA), quantifying the enemy’s escape potential as the angle between the enemy velocity vector v b and the reverse LOS direction. The parameters φ max and q max define the angular constraints for the Weapon Engagement Zone (WEZ) and No Escape Zone (NEZ), while D min and D max specify the radial boundaries of these engagement zones.
φ r = arccos v r · LOS v r · LOS q r = arccos v b · LOS v b · LOS
To comprehensively quantify the cumulative impact of adverse situational factors during dynamic interactions between two parties, a Health Point (HP) metric is introduced, denoted as H r for the friendly side and H b for the enemy side. Lower HP values indicate greater accumulated performance loss due to sustained exposure to threat conditions. The simulation terminates when either side’s HP drops to ≤0, exceeds airspace limits, or reaches the maximum simulation steps K.
The HP update mechanism is governed by real-time relative geometric relationships, specifically including φ , q, and inter-UAV distance d. The following Equation (4) defines the HP update rule:
H k ( H k 1 , φ , q , d ) = H k 1 Δ H max f att ( φ ) + f esp ( q ) + f dist ( d ) 3 , φ [ 0 , φ max ] d [ d min , d max ] H k 1 , otherwise
where H k represents current health at step k, H k 1 denotes previous health, and Δ H max is the maximum possible health deduction.
To accurately model the contribution degree of a single interaction to HP attenuation, three normalized attenuation factors determine the HP reduction magnitude. The ATA angle factor f att follows a sigmoid response curve: f att represents the threat level based on the Antenna Train Angle (ATA). Smaller φ values indicate better alignment between the velocity vector and the line of sight of our own UAV, thereby enhancing attack accuracy. The sigmoid function models the nonlinear increase in threat as φ approaches zero, mimicking real sensor sensitivity where minor angular deviations significantly affect weapon efficacy. The parameter k φ controls the steepness of the curve, tuned according to empirical data from UAV engagements.
f att = 1 1 1 + e k φ ( φ 0.5 φ max )
The AA angle factor f esp implements piecewise-linear attenuation: f esp quantifies the threat level from our side to the enemy’s escape potential via the Aspect Angle (AA). Smaller q values (e.g., below 20 ) indicate that the enemy is not fleeing directly, making it more vulnerable to attack; thus f esp attains higher values (closer to 1). The piecewise-linear design reflects tactical realism: at lower q , our advantage is maximized; as q increases, uncertainty in enemy maneuvers increases; at higher q , the threat diminishes as the enemy escapes.
f esp ( q ) = 1 , q 20 1 k q 1 ( q 20 ) / 70 , q [ 20 , 90 ] 0.2 k q 2 ( q 90 ) / 60 , q [ 90 , 150 ] 0.05 , q > 150
The distance factor f dist employs exponential decay centered at optimal engagement distance: f dist captures the effect of distance on weapon efficacy. Maximum damage occurs near d opt , with rapid decay outside this range due to aerodynamic and sensor limitations. The exponential decay aligns with physical weapon characteristics, such as reduced hit probability at extreme distances. The parameters k 1 , k 2 , p , and r are calibrated to fit the engagement envelope of the UAV’s cannon.
f dist ( d ) = e k 1 d opt d d opt d min p , d [ d min , d opt ] e k 2 d d opt d max d opt r , d [ d opt , d max ]
Opponent health updates analogously using the same equations but computed from the enemy’s perspective ( φ b ). Three normalized attenuation factors in Equation (4) are shown in Figure 4. Simulation termination and outcome determination are governed by a conditional logic based on real-time Health Points (HPs), step progression, and operational constraints. The simulation terminates under one of the following scenarios: (1) Timeout: if the maximum step count K is reached without a decisive outcome, resulting in a timeout; (2) Constraint violation: if airspace boundaries are breached, such as altitude limits ( h < h min or h > h max ) or distance thresholds ( R > D t h or R < R min ), leading to a penalty for operational breach; (3) Crash: if Health Points of either side drop to zero or below, resolving into a win (when H r > 0 and H b 0 ), loss (when H r 0 and H b > 0 ), or draw (when both H r 0 and H b 0 ). Otherwise, if none of these conditions are met, the simulation continues.

2.3. Opponent Maneuver Policy

The blue team applies Minimax search [11] with discrete actions. We expanded NASA’s seven basic maneuvers [12] into 27 action elements (see Table 1), assigning acceleration, constant-speed, and deceleration overload commands to each of the nine principal directions in three-dimensional space, where c > 1 denotes a constant parameter used for compound maneuvers.
The Minimax policy is an algorithmic prescription for perfect-information zero-sum games, guaranteeing the minimization of the maximum possible loss under the premise of an optimally counteracting opponent, and thus the maximization of the minimum payoff. Figure 5 illustrates the computational process of the Minimax decision-making policy executed by the adversarial blue UAV, through which it selects the current action that maximizes the projected situational advantage over a k-step time horizon.
In this single-round game-theoretic framework, both agents sequentially select elementary maneuvers from the maneuver action primitive library (Table 1) to be executed over k consecutive time steps (arrows in Figure 5). The middle column nodes represent the payoffs obtained after the blue agent executes one action choice, while the rightmost column nodes denote the payoffs acquired after the red agent (opponent) completes its decision making. When the red agent’s maneuvering action results in the most unfavorable position for the blue side at time k in the future, the algorithm selects the child node with the smallest payoff in the rightmost column for each parent node in the middle column. The blue agent’s action selection should then correspond to the node with the largest payoff in the middle column, thereby maximizing its minimum positional advantage at time k in the future.

3. Methodologies

This section details the proposed reinforcement learning maneuver decision method, encompassing state/action space design, reward function formulation, and the SAC decision-making approach.

3.1. State Space and Action Space

This paper adopts a Markov Decision Process (MDP) framework under the assumption of fully observable states. The state input s is defined as a 27-dimensional continuous variable (see Equation (8)), including internal friend states s own from Equation (1), friend Health Point H r , and friend previous action in the last step a t 1 , alongside sensor-derived observations such as bilateral relative states s rel and relevant geometric angles s geo .
s = s own , s rel , s geo , a t 1 s own = x r , y r , z r , v r , θ r , ψ r , H r s rel = Δ x , Δ y , Δ z , R h , R , v c , Δ x ˙ , Δ y ˙ , Δ z ˙ , Δ v s geo = φ r , q r , θ L , ψ L , θ V L , ψ V L , HCA a t 1 = n x , n z , γ
Here, R h denotes the LOS onto the horizontal plane, while R represents the magnitude of the LOS (i.e., the three-dimensional distance between the two parties). The variable v c refers to the closure velocity between both sides, as defined in Equation (18). The terms δ x , δ y , and δ z correspond to the components of the difference between the opponent’s position vector and our position vector along the O x , O y , and O z axes, respectively. Similarly, Δ x ˙ , Δ y ˙ , and Δ z ˙ denote the components of the difference between the opponent’s velocity vector and our velocity vector along the same axes.
The angle θ L indicates the angular deviation between the LOS and the horizontal plane O x g , while ψ L is the azimuth angle of the projection of the LOS onto the O x g y g plane relative to the O x g axis. Additionally, θ V L describes the angle between the projection of the friend velocity vector and the projection of the LOS onto the y g O z g plane, while ψ V L denotes the angle between the friend velocity vector and the projection of the LOS onto the x g O y g plane. Finally, HCA (Heading Crossing Angle) refers to the angular separation between the velocity vectors of the opponent and our own side. All these state components are normalized to s in 1 , 1 before being fed into the fully connected network (shown in Equation (9)), and δ own , δ rel , δ geo is taken as the constant.
δ = δ own , δ rel , δ geo δ own = δ x , δ y , δ z , δ v , π , 2 π , δ HP δ rel = δ Δ x , δ Δ y , δ Δ z , δ R h , δ R , v c max , δ Δ x ˙ , δ Δ y ˙ , δ Δ z ˙ , δ Δ v δ geo = 2 π , 2 π , π , 2 π , 2 π , π , 2 π δ a = n x   max , n z   max , γ max s in = tanh s i / δ i , i 1 , 27
We define a continuous 3-dimensional bounded action space for the friendly agent, parameterized as n x , n z , γ v and subject to the constraint in Equation (2). This enables direct, precise control over the aircraft’s states, while the opponent’s actions are limited to a discrete maneuver library of finite commands (see Table 1).

3.2. Reward Function Design

Effective reward functions critically impact the convergence speed and quality of reinforcement learning. Traditional designs typically use isolated angle, speed, altitude, and distance metrics with minimal consideration of coupling, potentially compromising situational assessment accuracy and maneuver precision. This paper designs an expert-guided multi-dimensional coupled dense reward [13], including the relative position reward R relpos , the closure speed reward R closure , the altitude reward R altitude , the firing rewards R ownfire and R enemyfire based on bilateral health changes, and event rewards (attack advantage reward R c z , termination reward R end ).
R relpos φ ¯ r , q ¯ r = 2 φ ¯ r 1 · S q ¯ r , α 1 , 1 2 φ ¯ r + 1 2
R closure q ¯ r , v c , d = v c v c   max 1 S q ¯ r , α 1 , 1 2 · S d , α 2 , d 0 , d d 0 1 , d < d 0
R altitude h r = S h r , α 3 , h min 1 , h r < h 0 S h r , α 4 , h max , h r h 0
R ownfire φ ¯ r , d = Γ R d · 1 S φ ¯ r , α 5 , φ ¯ max
R enemyfire q ¯ r , d = Γ B d · S q ¯ r , α 8 , q ¯ max
R end = 200 + 200 K k K , if result = = win + 400 , if result = = lose or balance 600 , if result = = timeout or beyond the boundaries 0 , o t h e r s
R cz = + 25 , φ r φ max , q r q max , D min d D max 0 , e l s e
Formulas (10)–(14) are dense reward functions. Here, S denotes the sigmoid function (17). As illustrated in Figure 6, α represents the rate of change of the sigmoid function at x = x 0 . It is obvious that a larger α indicates a more pronounced variation in the sigmoid curve at x = x 0 , rendering it more sensitive to perturbations in x. Conversely, a smaller α results in a less sensitive response, meaning that the function exhibits a weaker dependence on changes in x. Constants include α 1 α 10 , which are not arbitrarily set but are determined based on expert prior knowledge from [13]. Moreover, these parameters are not unique; they can be adjusted within a reasonable range as long as the sigmoid function effectively captures the intended trends in the relevant dimensions, such as the sensitivity to angular or distance variations.
The remaining constants include φ ¯ r and q ¯ r (normalized φ and q for red), plus optimal attack distance d 0 and flight altitude h 0 . v c is the inter-UAV approach speed, Γ R d and Γ B d are reward coefficients during own fire and enemy attacks, and β R and β B are offensive strategy aggressiveness for red and blue teams.
S x , α , x 0 = 1 1 + e α x x 0
v c = v r v b · o b o r o b o r
Γ R d = β R · S d , α 6 , D min , d < d 0 β R · 1 S d , α 7 , D max , d d 0
Γ B d = β B · S d , α 9 , D min , d < d 0 β B · 1 S d , α 10 , D max , d d 0
The relative position reward function R relpos φ ¯ r , q ¯ r incentivizes our aircraft to maintain a tail-chase position behind the enemy, maximizing tactical advantage. The variables φ ¯ r and q ¯ r are normalized versions of φ r and q r , respectively, defined as φ ¯ r = φ r / π and q ¯ r = q r / π , scaling them to the range [ 0 , 1 ] . As illustrated in Figure 7, the reward function exhibits a rapid decay at q ¯ r = 0.5 (corresponding to the Aspect Angle q = π / 2 ), which serves as a critical threshold: for q ¯ r < 0.5 , our aircraft is behind the enemy with a high advantage, while at q ¯ r = 0.5 , it is at the beam position; and for q ¯ r > 0.5 , the scenario transitions to a head-on engagement where enemy threat escalates and our positional superiority diminishes, often resulting in penalized rewards.
Furthermore, the function incorporates directional alignment constraints via φ ¯ r ; when φ ¯ r > 0.5 (indicating a substantial misalignment between our velocity vector and the line-of-sight, implying the aircraft is turning away from the target), the reward value decreases to penalize deviations from precise orientation and sustained pursuit, thereby indirectly incentivizing the agent to maintain optimal tracking behavior and ensuring robust multi-faceted situation assessment. This integrated approach enhances decision-making realism by simultaneously accounting for both positional and orientational factors.
The closure speed reward function R closure q ¯ r , v c , d incentivizes our agent to approach the enemy UAV from the rear to maximize tactical advantage, while penalizing cases where the enemy aircraft approaches from behind our own. Its magnitude scales with distance to adaptively balance aggression and safety: larger rewards at greater distances encourage a proactive approach, while diminishing rewards at closer ranges mitigate overshooting risks. When d < d 0 , the reward remains constant negative to prevent collision or overshooting from excessive proximity—here, d 0 denotes the optimal engagement distance: though closer proximity may enhance attack effectiveness, it significantly heightens crash/overshoot risks due to limited maneuvering space.
The design rationale for this function is grounded in operational constraints of UAV combat dynamics. Specifically, the sigmoid function parameters, such as α = 0.0125 for S d , α 2 , d 0 , are derived from the UAV’s optimal gun engagement zone, typically within D min , D max (e.g., 500 1200 m, as illustrated in Figure 8). This parameter selection ensures a gradual decline in reward values at the optimal engagement distance d 0 , promoting smooth transitions in UAV distance and velocity. The gradual decay prevents abrupt behavioral changes, enhancing training stability and real-world applicability by avoiding erratic maneuvers. Thus, the reward function effectively balances encouragement and penalty mechanisms, aligning with physical constraints and ensuring robust policy convergence. This approach not only reinforces safe engagement practices but also leverages expert knowledge to optimize performance in dynamic aerial environments (see Figure 9).
The altitude reward function R altitude h r is designed to constrain our UAV to maneuver within a specified altitude range h r h min , h max , mitigating risks at excessively high altitudes (where thin air may cause engine aerodynamic failure and degraded control efficiency) and excessively low altitudes (where increased drag and crash hazards prevail). Here, h 0 denotes the optimal flight altitude. Through the synergistic design of parameters α 3 and α 4 , the reward function ensures a smooth transition in reward values as altitude deviates from h 0 , mitigating abrupt changes and avoiding excessive height fluctuations caused by velocity spikes. This design stabilizes the UAV’s altitude within the target range, balancing safety with operational flexibility (see Figure 10).
The firing reward function R ownfire φ ¯ r , d and R enemyfire q ¯ r , d are designed to guide our agent to gain significant offensive advantages while avoiding enemy fire damage. It encourages our side to achieve the smallest possible target Aspect Angle within a certain attack range, thereby maximizing the fire damage inflicted on the enemy. Here, d 0 denotes the optimal weapon engagement distance, with D min and D max defining the minimum and maximum bounds of this optimal range.
Γ R d and Γ B d model the distance-dependent variation of our and the enemy’s fire damage reward coefficients, respectively. The constants β R and β B quantify the offensive intent intensity of our and the enemy’s UAVs, where larger absolute values signify stronger aggressiveness. Notably, we assume symmetric intent strengths ( β R = β B ) to ensure balanced evaluation. As illustrated in Figure 11b, conservative design assigns the enemy a slightly larger engagement range (via parameters α 8 , α 9 ) versus our UAV (via α 6 , α 7 ), motivating our agent to devise superior maneuver strategies.
Event-based rewards comprise two fundamental components: the attack advantage reward R cz (refer to Equation (16)) and the termination reward R end (refer to Equation (15)). Specifically, R cz is designed to incentivize the agent to engage the enemy from within its no-escape zone, thereby maximizing offensive efficiency by promoting optimal positioning and tactical dominance.
Conversely, R end functions as an environmental bonus awarded at the conclusion of each episode, contingent upon the final outcome and simulation context. This reward mechanism aims to motivate our side to accomplish missions rapidly and reliably within constrained time frames. For victorious scenarios, R end allocates a base reward of +200, augmented by an additional bonus of up to +200 that is proportional to the remaining simulation time (where K denotes the maximum steps per episode and k represents the termination step). This time-sensitive incentive encourages early decisive successes by rewarding expedient victories. In cases of loss or draw, a fixed penalty of −400 is enforced to discourage adverse outcomes, whereas timeout or boundary violations incur a stricter penalty of −600 to preclude deadlock situations and rule infractions.
This hierarchically structured reward design ensures that the reward signals acquired by the agent are intrinsically aligned with the dynamic complexities of real-world aerial combat. Consequently, the derived decision-making policy maintains high operational efficiency and reliability under stringent physical constraints, including platform dynamics and airspace limitations, thereby enhancing practical combat effectiveness.
To ensure the robustness and tactical effectiveness of the reward function, we adopted an incremental constructive design strategy driven by domain expert insights [14] and simulation failure analysis. Initially, a baseline reward set { R relpos ,   R closure ,   R altitude ,   R ownfire ,   R cz ,   R end } was established. Preliminary experiments indicated that although this configuration achieved basic convergence, it exhibited poor tactical efficiency; specifically, the agent frequently entered a mutual tail-chasing Nash equilibrium [15]. To break these deadlocks, the weights of R relpos and R closure were increased to 2–3 times their initial values to emphasize rapid approach and positional advantage acquisition.
However, despite improved positioning, the agent exhibited a critical vulnerability: it frequently sustained high damage from enemy fire due to a lack of defensive awareness. To address this, the R enemyfire term was integrated, which significantly enhanced survivability. Subsequently, observations revealed a ‘delayed aiming’ phenomenon where the agent failed to lock onto the target quickly enough, allowing the enemy to fire first in head-on engagements. Consequently, to accelerate convergence and improve aiming precision, the alignment penalty term R φ ¯ r = φ ¯ r was introduced. Concurrently, other weights were fine-tuned: ω ownfire was increased to enhance firing-driven convergence, and ω relpos was slightly reduced to avoid over-reliance on situational angular advantages, which could cause stagnation. This step-wise integration balanced tactical priorities with training stability, ultimately yielding the final robust configuration.
The overall reward function R total with the final weights ω 1 to ω 6 satisfies i = 1 6 ω i = 1 , normalizing the dense rewards to prevent any single component from dominating. Additionally, a scaling factor K > 1 is applied to amplify the total reward magnitude, mitigating saturation and stabilizing gradient updates during training.
R total = K ω 1 R φ ¯ r + ω 2 R relpos + ω 3 R closure + ω 4 R altitude + ω 5 R ownfire + ω 6 R enemyfire + R cz + R end

3.3. Policy Network Training Method

Unlike traditional strategy optimization, the core innovation of SAC incorporates maximum entropy into optimization to maximize expected cumulative reward while maintaining action distribution entropy of policy π in states, i.e.,
J π = E t = 0 γ t r t + α H π · s t
Among them, the temperature coefficient α dynamically adjusts (Equation (23)) by optimizing the target entropy ( H ¯ = dim A ), balancing the exploration–exploitation tradeoff of the SAC strategy, significantly enhancing training stability in uncertain environments, and mitigating the local optima risk.
J α = E a t π t α log π t a t s t α H ¯
SAC exhibits significant advantages over other policy gradient (PG) and Actor-Critic (AC) algorithms. The maximum entropy objective facilitates more robust exploration, avoiding the premature convergence often seen in traditional PG methods. Compared with deterministic policies like those in DDPG, SAC’s stochastic policy with entropy regularization maintains a better exploration–exploitation balance, adapting to various environments without extensive hyperparameter tuning. Furthermore, the use of dual Q-networks reduces the overestimation bias common in AC methods, leading to more stable and reliable learning. These characteristics make SAC highly effective for continuous control tasks, such as UAV maneuvering, where precision and robustness are paramount.
SAC adopts the Actor-Critic architecture (Figure 12), comprising two soft Q networks ( θ 1 , θ 2 ), two target Q networks ( θ ¯ 1 , θ ¯ 2 ), and one actor network ( ϕ ). The critic optimizes the Q networks by minimizing the Bellman residual MSE loss (Equation (24)) and employs a dual Q mechanism to take the minimum Q value (Equation (25)), thereby mitigating Q-value overestimation. To handle the non-differentiability of random actions, the actor uses a reparameterization trick to convert action sampling into a deterministic linear transformation with random noise (Equation (26)), rendering the policy loss differentiable (Equation (27)).
J Q θ t = E s t , a t D 1 2 Q θ i s t , a t r s t , a t + γ E s t + 1 p V θ ¯ s t + 1 2
Q θ s t , a t = min Q θ ¯ 1 s t + 1 , a t + 1 , Q θ ¯ 2 s t + 1 , a t + 1
a t = f ϕ ε t ; s t = μ + σ · ε t
J β ϕ = E s t D , ε t N α log π ϕ f ϕ ε t ; s t s t Q θ s t , f ϕ ε t ; s t

4. Simulation and Results

The simulation environment was built on OpenAI Gym [16], a widely adopted benchmark for reinforcement learning research. For the training of agents, we utilized the stable-baselines3 library [17] (based on the PyTorch framework (version 2.6.0)). To ensure a comprehensive comparison and evaluation of performance, we implemented and trained agents using the Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) algorithms, respectively. The neural network architecture for both algorithms consisted of four fully connected layers with ReLU activation functions.
Each training episode lasted 200 s with a 10 Hz decision frequency, and the total training was conducted over 50,000 episodes. For policy evaluation, pre-trained policy network weights were loaded to ensure that fixed strategies were used, and assessments were performed under randomized initial conditions (including both advantageous and parity scenarios). The evaluation was divided into two distinct phases: First, we assessed the clean performance of the trained policies under nominal conditions without any perturbations. Then, in the second phase, observational noise was injected into the state inputs to test the robustness of the policies.

4.1. Simulation Parameters and Initialization Conditions Settings

In the scenario, the blue UAV represents the opponent and is initialized with a random heading at a randomized altitude within a specified range. A local East–North–Up (END) coordinate frame is established at the blue’s initial position. Representing our own agent, the red UAV is then initialized relative to the blue UAV within a defined rear cone sector. Its initial position is constrained by a specific LOS geometric bearing ψ L r , LOS elevation angle θ L r , and distance range R, with its heading ψ r facing toward the blue UAV within a certain angular tolerance. The initial states information and simulation parameters of the red and blue UAVs are listed in Table 2 and Table 3, respectively.

4.2. Results Analysis

The training is performed on a desktop computer equipped with 12th Gen Intel(R) Core(TM) i5-12600KF and NVIDIA GeForce RTX 4070 SUPER GPU. We utilize eight encapsulated environments for parallel simulation and experience collection, thereby accelerating policy optimization and reducing training time. Two representative DRL algorithms are selected for comparison: SAC and TD3, which are state-of-the-art, single-agent DRL algorithms that support continuous control.
We conducted five independent training sessions for both the SAC and TD3 algorithms, with each session comprising 50,000 episodes. As illustrated in Figure 13, the left panel presents the training performance of SAC, where the five thin curves denote the average reward per 100 episodes (smoothed via Savitzky–Golay filtering) from each of the five runs, the thick curve represents the smoothed mean reward across these five runs, and the shaded area corresponds to the 95% confidence interval (i.e., ±1.96 standard deviations from the mean curve); similarly, the right panel presents the training performance of TD3. It can be observed that both SAC and TD3 algorithms converge within a limited number of training episodes, indicating that the reward function design is universally applicable to both methods in this scenario. However, the convergence speed of TD3 is significantly slower than that of SAC, with SAC converging at around 15,000 episodes and TD3 converging after nearly 25,000 episodes.
We evaluated the performance of the policies from the SAC and TD3 methods over 300 episodes each. During the first 100 episodes, the input states were fully observable and noise-free. In the subsequent 200 episodes, we introduced zero-mean multivariate independent Gaussian noise with a diagonal covariance matrix into the input states. Specifically, for 100 episodes, we applied low-level noise (with smaller standard deviations). For the remaining 100 episodes, we applied high-level noise (with larger standard deviations), using distinct values for each dimension. To ensure comparability across experiments, the same random seed was used throughout all evaluations to maintain consistent initial conditions, and for each policy, identical noise was applied at the corresponding episode and simulation step. The evaluation results are summarized in Table 4.
Statistical analysis of the results in Table 4 demonstrates the significant superiority of the SAC policy over the TD3 baseline, validated by one-sided hypothesis tests ( α = 0.05 ). The SAC agent achieved statistically significantly higher win rates across all conditions: no noise (p = 0.0226), low noise (p = 0.0385), and high noise (p = 0.0203). Multiple performance metrics support this conclusion—SAC maintained substantially higher mean returns (e.g., 266.55 vs. 138.57 under no noise) while simultaneously achieving lower loss rates ( 1 % ± 2.0 % vs. 8 % ± 5.3 % ) and reduced return variance (380.54 vs. 530.22), indicating enhanced stability and robustness in noisy environments. These findings statistically validate the theoretical advantages of SAC’s maximum entropy framework in handling high-dimensional state spaces and sensor uncertainties, providing empirical evidence for its efficacy in a continuous action space for UAV autonomous maneuvering decisions.
Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24, Figure 25, Figure 26, Figure 27, Figure 28, Figure 29, Figure 30, Figure 31, Figure 32, Figure 33, Figure 34, Figure 35 and Figure 36 show the simulation trajectories of the SAC and TD3 algorithms under different initial situational conditions. Among them, Figure 14, Figure 15 and Figure 16 correspond to the initial condition where our side is in a tail-chase scenario with an altitude advantage; Figure 17, Figure 18, Figure 19, Figure 20 and Figure 21 represent the initial condition where our side is in a tail-chase scenario but at an altitude disadvantage; Figure 22, Figure 23 and Figure 24 depict the initial condition where our side is in a tail-chase scenario with altitude parity; Figure 25, Figure 26, Figure 27, Figure 28, Figure 29 and Figure 30 illustrate the head-on initial situation with our side having an altitude advantage; and Figure 31, Figure 32, Figure 33, Figure 34, Figure 35 and Figure 36 portray the head-on initial situation with our side at an altitude disadvantage. In these figures, the pentagram denotes the trajectory starting point (corresponding to T 1 ), and the solid circle indicates the endpoint.
In each scenario depicted in Figure 14 and Figure 15, both sides start from identical initial conditions. The SAC algorithm successfully hits the target regardless of the presence of observational noise, whereas the TD3 algorithm fails to fully defeat the target when subjected to higher noise levels. As illustrated in Figure 16, during the interval from T2 to T3 (i.e., simulation time 60 s to 120 s), the red side gains a significant relative positional and range advantage, thereby securing a brief firing opportunity. However, from T3 to T4, due to a sharp left-turn maneuver executed by the blue side, the noise likely introduces substantial errors into the red side’s observational inputs. This causes the TD3 policy to fail in executing appropriate maneuvers correctly, ultimately resulting in deviation from the intended trajectory and a crash beyond the minimum altitude boundary. This outcome underscores TD3’s inherent sensitivity to observational noise due to its deterministic policy design, rather than reflecting neural network overfitting; in contrast, SAC’s maximum entropy framework maintains stability by promoting exploratory actions under similar uncertain conditions.
Based on Figure 17, Figure 18 and Figure 19, it can be observed that, under the initial condition of tail-chase altitude disadvantage, the SAC method consistently maintains both positional and angular advantages in tail pursuit, thereby securing firing opportunities. In contrast, as shown in Figure 20 and Figure 21, the performance of the TD3 policy in this scenario is less favorable. From Figure 21, it can be seen that, during the period from T1 to T6 (the first 500 s of simulation), the red side generally holds a superior posture. However, starting at T6, when the blue side executes a circular maneuver toward the upper right, the TD3 policy fails to continue tracking the opponent’s movement. Instead, it adopts a relatively conservative climbing maneuver, leading to an immediate decline in its situational advantage. Although by time T8, its relative positional advantage improves, the policy still opts for damage avoidance, ultimately allowing the blue side to overtake and gain dominance.
Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20 and Figure 21 compare the trajectory trends of the SAC and TD3 algorithms under an initial tail-chase advantage for our agent. When our agent has an initial altitude advantage, the SAC policy in Figure 14 exhibits lower average standard deviations of maneuver commands (with standard deviations of 0.947 for n x , 1.196 for n z , and 1.093 for γ ) compared with the TD3 policy in Figure 15 (with standard deviations of 1.469 for n x , 1.172 for n z , and 1.516 for γ ). Furthermore, in terms of Energy Consumption (measured by control effort n x 2 + n z 2 in a trajectory) and Control Smoothness (measured by the mean absolute difference of command signals), SAC achieves significantly lower values of 3486.13 and 0.2002, respectively, whereas TD3 records higher values of 8043.52 and 0.2881. Based on the quantitative analysis of these three metrics, it is evident that the SAC algorithm generates trajectories that are not only statistically more stable but also physically smoother and more energy efficient.
Similarly, when our agent is at an initial altitude disadvantage (Figure 17), the SAC policy maintains low metric values across all dimensions: standard deviations of 0.879, 1.055, and 1.397 for n x , n z , and γ , with an Energy Consumption of 9272.8 and a Control Smoothness of 0.4289. In contrast, the TD3 policy (Figure 20) shows degraded performance, characterized by elevated standard deviations (1.282, 4.036, 2.356 for n x , n z , and γ ), a substantial energy cost of 18935.0633, and a poor smoothness of 1.5675.
It can be observed that, in complex environments with introduced observational noise, SAC-generated trajectories exhibit significantly smoother characteristics and a markedly higher mission success probability than TD3. SAC demonstrates consistently lower control signal variances, superior Control Smoothness, and lower Energy Consumption: its lower variances highlight robust generalization against noise-induced disturbances, and its enhanced smoothness and efficiency validate physically smoother, more energy-conservative trajectory generation. Conversely, TD3 exhibits higher variances, inferior smoothness, and reduced energy efficiency. The consistently lower values across these metrics confirm SAC’s robust generalization against noise-induced disturbances, which directly mitigates overfitting-like oscillatory behaviors observed in TD3.
This superiority stems from SAC’s maximum-entropy optimization framework, which incorporates a policy-entropy regularization term. This encourages stochastic exploration, preventing the policy from prematurely converging to a local optimum and thereby enhancing its robustness against disturbances such as sensor noise and environmental uncertainties. In contrast, TD3, as a deterministic policy algorithm, is advantageous in sample efficiency but suffers from its relatively fixed policy mode. This makes it prone to overfitting when confronted with observational noise, leading to unnecessary oscillations or deviations in its trajectory that ultimately compromise its effectiveness in the engagement. Therefore, the phenomena depicted in Figure 16 and Figure 21 should be interpreted as evidence of TD3’s algorithmic limitations in handling uncertainties, not as signs of overtraining; SAC’s consistent performance stems from its designed robustness, which effectively mitigates overfitting risks through stochastic exploration.
Figure 22, Figure 23 and Figure 24 depict the tail-chase trajectories when both sides start at similar altitudes. The results indicate that both strategies perform well, successfully defeating the opponent with relatively smooth trajectories. Figure 25, Figure 26, Figure 27, Figure 28, Figure 29 and Figure 30 illustrate the scenario where both parties start head-on, with our side having an altitude advantage. As shown by the corresponding dense reward function, the SAC strategy fails to achieve a favorable geometric angle between T 3 and T 4 , allowing the blue side to fire at us first. Figure 31, Figure 32, Figure 33, Figure 34, Figure 35 and Figure 36 represent the case of a head-on start with our side at an altitude disadvantage.
Here, the comparative analysis between Figure 33 and Figure 36 indicates that the TD3 strategy generates smoother trajectories than the SAC, with quantitative metrics further supporting this observation: lower standard deviations in most maneuver commands (specifically, 1.269 for n x , 0.672 for n z , and 1.182 for γ of the TD3 policy, compared with 1.047 for n x , 1.350 for n z , and 1.205 for γ of the SAC policy), lower Energy Consumption (3057.75 for TD3, 3544.70 for SAC), and superior Control Smoothness (0.0713 for TD3, 0.7472 for SAC). Additionally, the TD3 policy initiates attacks earlier and demonstrates a certain level of robustness.

5. Conclusions

This study addresses the challenge of autonomous maneuvering decision making for unmanned aerial vehicles (UAVs) by proposing a continuous action space decision-making framework based on the Soft Actor-Critic (SAC) algorithm. Through the construction of a simulation environment that incorporates three-dimensional kinematic constraints and a multi-dimensional situational reward mechanism, the framework achieves the generation of high-precision control commands, offering a valuable reference for research on autonomous behavior decision making of intelligent unmanned systems in dynamic environments.
The primary innovations of this work are manifested in two key aspects: first, the design of a comprehensive multi-dimensional reward function that integrates factors such as relative position, closure speed, altitude relationship, and interactive effectiveness, coupled with the introduction of a Health Point (HP) system for quantitative assessment of situational advantages, which enhances the strategy’s capability to perceive and respond to complex dynamic conditions; second, the application of the SAC algorithm with maximum entropy optimization, which facilitates the generation of smooth and reliable control commands while maintaining an equilibrium between exploration and exploitation, thereby improving the adaptability and stability of the decision-making system.
Despite these contributions, the current research exhibits certain limitations: the three-degrees-of-freedom point-mass model employed, while ensuring basic motion feasibility through overload constraints (e.g., the ranges of n x and n z ), does not fully account for higher-order dynamic characteristics such as aerodynamic coupling effects and actuator response delays. Consequently, the practical feasibility of the generated trajectories under extreme maneuvers warrants further validation. Future efforts will prioritize two directions: the development of a six-degrees-of-freedom high-fidelity simulation platform that incorporates aerodynamic parameters and system dynamics to enhance model reliability and trajectory authenticity and the extension to multi-agent collaborative decision-making scenarios, focusing on cooperative maneuvering strategies for multiple UAVs in complex environments to strengthen the algorithm’s practicality and scalability.

Author Contributions

H.Y.: conceptualization, methodology, validation, formal analysis, investigation, resources, writing—review and editing, supervision, project administration, funding acquisition; S.Q.: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization; S.C.: validation, formal analysis, investigation, resources, writing—review and editing, supervision, funding acquisition; C.W.: validation, formal analysis, investigation, resources, writing—review and editing, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province, China, under Grant No. 2025JJ20055 and the National Natural Science Foundation of China under Grant No. 52475290 and 62403482.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

DURC Statement

Current research is limited to the field of unmanned aerial vehicles (UAVs), which is beneficial for advancing the intelligence and robustness of unmanned systems and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving autonomous UAV maneuvering algorithms and confirm that all necessary precautions have been taken to prevent potential misuse, including but not limited to conducting research in a controlled simulation environment without hardware implementation. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cao, S.; Wang, X.; Zhang, R.; Peng, Y.; Yu, H. Aerobatic Maneuvering flight control of fixed-wing UAVs: An SE (3) approach using dual quaternion. IEEE Trans. Ind. Electron. 2024, 71, 14362–14372. [Google Scholar] [CrossRef]
  2. Zhang, J.d.; Yu, Y.f.; Zheng, L.h.; Yang, Q.m.; Shi, G.q.; Wu, Y. Situational continuity-based air combat autonomous maneuvering decision-making. Def. Technol. 2023, 29, 66–79. [Google Scholar] [CrossRef]
  3. Li, Y.; Dong, W.; Zhang, P.; Zhai, H.; Li, G. Hierarchical reinforcement learning with automatic curriculum generation for unmanned combat aerial vehicle tactical decision-making in autonomous air combat. Drones 2025, 9, 384. [Google Scholar] [CrossRef]
  4. Zhu, Y.; Zheng, Y.; Wei, W.; Fang, Z. Enhancing Automated Maneuvering Decisions in UCAV Air Combat Games Using Homotopy-Based Reinforcement Learning. Drones 2024, 8, 756. [Google Scholar] [CrossRef]
  5. Yang, J.; Yang, X.; Yu, T. Multi-unmanned aerial vehicle confrontation in intelligent air combat: A multi-agent deep reinforcement learning approach. Drones 2024, 8, 382. [Google Scholar] [CrossRef]
  6. Zhang, T.; Wang, Y.; Sun, M.; Chen, Z. Air combat maneuver decision based on deep reinforcement learning with auxiliary reward. Neural Comput. Appl. 2024, 36, 13341–13356. [Google Scholar] [CrossRef]
  7. Fan, Z.; Xu, Y.; Kang, Y.; Luo, D. Air combat maneuver decision method based on A3C deep reinforcement learning. Machines 2022, 10, 1033. [Google Scholar] [CrossRef]
  8. Zheng, Z.; Duan, H. UAV maneuver decision-making via deep reinforcement learning for short-range air combat. Intell. Robot. 2023, 3, 76–94. [Google Scholar] [CrossRef]
  9. Yang, Q.; Zhu, Y.; Zhang, J.; Qiao, S.; Liu, J. UAV air combat autonomous maneuver decision based on DDPG algorithm. In Proceedings of the 2019 IEEE 15th International Conference on Control and Automation (ICCA), Edinburgh, UK, 16–19 July 2019; pp. 37–42. [Google Scholar]
  10. Gao, X.; Zhang, Y.; Wang, B.; Leng, Z.; Hou, Z. The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm. Drones 2024, 8, 501. [Google Scholar] [CrossRef]
  11. Buşoniu, L.; Rejeb, J.B.; Lal, I.; Morărescu, I.C.; Daafouz, J. Optimistic minimax search for noncooperative switched control with or without dwell time. Automatica 2020, 112, 108632. [Google Scholar] [CrossRef]
  12. Austin, F.; Carbone, G.; Falco, M.; Hinz, H.; Lewis, M. Automated maneuvering decisions for air-to-air combat. In Proceedings of the Guidance, Navigation and Control Conference, Monterey, CA, USA, 17–19 August 1987; p. 2393. [Google Scholar]
  13. Pope, A.P.; Ide, J.S.; Mićović, D.; Diaz, H.; Twedt, J.C.; Alcedo, K.; Walker, T.T.; Rosenbluth, D.; Ritholtz, L.; Javorsek, D. Hierarchical reinforcement learning for air combat at DARPA’s AlphaDogfight trials. IEEE Trans. Artif. Intell. 2022, 4, 1371–1385. [Google Scholar] [CrossRef]
  14. Seong, H.; Shim, D.H. TempFuser: Learning Agile, Tactical, and Acrobatic Flight Maneuvers Using a Long Short-Term Temporal Fusion Transformer. IEEE Robot. Autom. Lett. 2024, 9, 10803–10810. [Google Scholar] [CrossRef]
  15. Kong, W.; Zhou, D.; Du, Y.; Zhou, Y.; Zhao, Y. Reinforcement learning for multiaircraft autonomous air combat in multisensor UCAV platform. IEEE Sensors J. 2022, 23, 20596–20606. [Google Scholar] [CrossRef]
  16. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
  17. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Figure 1. UAV 1v1 dynamic confrontation decision-making framework. The red UAV represents the friendly aircraft, and the blue UAV represents the enemy aircraft.
Figure 1. UAV 1v1 dynamic confrontation decision-making framework. The red UAV represents the friendly aircraft, and the blue UAV represents the enemy aircraft.
Drones 10 00035 g001
Figure 2. Schematic of the 3-DOF UAV dynamics model in ground coordinate systems and related angles.
Figure 2. Schematic of the 3-DOF UAV dynamics model in ground coordinate systems and related angles.
Drones 10 00035 g002
Figure 3. The 1v1 UAV confrontation situation.
Figure 3. The 1v1 UAV confrontation situation.
Drones 10 00035 g003
Figure 4. Normalized factors for HP update (corresponding to Equations (5)–(7)), illustrating the effect of angle and distance on HP decay.
Figure 4. Normalized factors for HP update (corresponding to Equations (5)–(7)), illustrating the effect of angle and distance on HP decay.
Drones 10 00035 g004
Figure 5. The Minimax decision-making policy. The arrows indicate the sequence of maneuver choices over discrete time steps. The nodes represent the decision states, where the colors (e.g., blue and red) distinguish the decision turns of the friendly and enemy UAVs, respectively.
Figure 5. The Minimax decision-making policy. The arrows indicate the sequence of maneuver choices over discrete time steps. The nodes represent the decision states, where the colors (e.g., blue and red) distinguish the decision turns of the friendly and enemy UAVs, respectively.
Drones 10 00035 g005
Figure 6. Sigmoid function under different α values.
Figure 6. Sigmoid function under different α values.
Drones 10 00035 g006
Figure 7. Relative position reward function R relpos (corresponding to Equation (10)), showing the impact of angle variation on reward values.
Figure 7. Relative position reward function R relpos (corresponding to Equation (10)), showing the impact of angle variation on reward values.
Drones 10 00035 g007
Figure 8. Distance attenuation factor S d , α 2 , d 0 in the closure speed reward function R closure q ¯ r , v c , d (corresponding to Equation (11)), illustrating the exponential decay of the reward component based on the inter-UAV distance.
Figure 8. Distance attenuation factor S d , α 2 , d 0 in the closure speed reward function R closure q ¯ r , v c , d (corresponding to Equation (11)), illustrating the exponential decay of the reward component based on the inter-UAV distance.
Drones 10 00035 g008
Figure 9. The closure speed reward function R closure (corresponding to Equations (11)), showing the combined effect of Aspect Angle, approach speed, and distance on the reward value.
Figure 9. The closure speed reward function R closure (corresponding to Equations (11)), showing the combined effect of Aspect Angle, approach speed, and distance on the reward value.
Drones 10 00035 g009
Figure 10. The altitude reward function R altitude (corresponding to Equation (12)), demonstrating the reward modulation based on the UAV’s flight altitude relative to the optimal height h 0 .
Figure 10. The altitude reward function R altitude (corresponding to Equation (12)), demonstrating the reward modulation based on the UAV’s flight altitude relative to the optimal height h 0 .
Drones 10 00035 g010
Figure 11. The firing reward function R ownfire , Γ R and Γ B (corresponding to Equations (13), (19), and (20)), quantifying the offensive advantage and fire damage. In (b), the red line represents the friendly coefficient Γ R , while the blue line represents the enemy coefficient Γ B .
Figure 11. The firing reward function R ownfire , Γ R and Γ B (corresponding to Equations (13), (19), and (20)), quantifying the offensive advantage and fire damage. In (b), the red line represents the friendly coefficient Γ R , while the blue line represents the enemy coefficient Γ B .
Drones 10 00035 g011
Figure 12. Schematic diagram of the Soft Actor-Critic (SAC) algorithm architecture and update process.
Figure 12. Schematic diagram of the Soft Actor-Critic (SAC) algorithm architecture and update process.
Drones 10 00035 g012
Figure 13. Average returns for training episodes.
Figure 13. Average returns for training episodes.
Drones 10 00035 g013
Figure 14. Initial tail-chase SAC trajectory with altitude advantage.
Figure 14. Initial tail-chase SAC trajectory with altitude advantage.
Drones 10 00035 g014
Figure 15. Initial tail-chase TD3 trajectory with altitude advantage.
Figure 15. Initial tail-chase TD3 trajectory with altitude advantage.
Drones 10 00035 g015
Figure 16. TD3 initial tail-chase with altitude advantage and high noise observations.
Figure 16. TD3 initial tail-chase with altitude advantage and high noise observations.
Drones 10 00035 g016
Figure 17. Initial tail-chase SAC trajectory with altitude disadvantage.
Figure 17. Initial tail-chase SAC trajectory with altitude disadvantage.
Drones 10 00035 g017
Figure 18. SAC initial tail chase with altitude disadvantage and no noise observations.
Figure 18. SAC initial tail chase with altitude disadvantage and no noise observations.
Drones 10 00035 g018
Figure 19. SAC initial tail chase with altitude disadvantage and high noise observations.
Figure 19. SAC initial tail chase with altitude disadvantage and high noise observations.
Drones 10 00035 g019
Figure 20. Initial tail-chase TD3 trajectory with altitude disadvantage.
Figure 20. Initial tail-chase TD3 trajectory with altitude disadvantage.
Drones 10 00035 g020
Figure 21. TD3 initial tail chase with altitude disadvantage and no noise observations.
Figure 21. TD3 initial tail chase with altitude disadvantage and no noise observations.
Drones 10 00035 g021
Figure 22. Initial tail-chase SAC trajectory with altitude balanced.
Figure 22. Initial tail-chase SAC trajectory with altitude balanced.
Drones 10 00035 g022
Figure 23. Initial tail-chase TD3 trajectory with altitude balanced.
Figure 23. Initial tail-chase TD3 trajectory with altitude balanced.
Drones 10 00035 g023
Figure 24. Comparison of maneuvering decision command sequences generated by SAC and TD3 agents under noise-free and high-noise observation conditions during an altitude-balanced tail-chase scenario.
Figure 24. Comparison of maneuvering decision command sequences generated by SAC and TD3 agents under noise-free and high-noise observation conditions during an altitude-balanced tail-chase scenario.
Drones 10 00035 g024
Figure 25. Initial head-on SAC trajectory with altitude advantage.
Figure 25. Initial head-on SAC trajectory with altitude advantage.
Drones 10 00035 g025
Figure 26. SAC initial head-on with altitude advantage and no noise observations.
Figure 26. SAC initial head-on with altitude advantage and no noise observations.
Drones 10 00035 g026
Figure 27. SAC initial head-on with altitude advantage and high noise observations.
Figure 27. SAC initial head-on with altitude advantage and high noise observations.
Drones 10 00035 g027
Figure 28. Initial head-on TD3 trajectory with altitude advantage.
Figure 28. Initial head-on TD3 trajectory with altitude advantage.
Drones 10 00035 g028
Figure 29. TD3 initial head-on with altitude advantage and no noise observations.
Figure 29. TD3 initial head-on with altitude advantage and no noise observations.
Drones 10 00035 g029
Figure 30. TD3 initial head-on with altitude advantage and high noise observations.
Figure 30. TD3 initial head-on with altitude advantage and high noise observations.
Drones 10 00035 g030
Figure 31. Initial head-on SAC trajectory with altitude disadvantage.
Figure 31. Initial head-on SAC trajectory with altitude disadvantage.
Drones 10 00035 g031
Figure 32. SAC initial head-on with altitude disadvantage and no noise observations.
Figure 32. SAC initial head-on with altitude disadvantage and no noise observations.
Drones 10 00035 g032
Figure 33. SAC initial head-on with altitude disadvantage and high noise observations.
Figure 33. SAC initial head-on with altitude disadvantage and high noise observations.
Drones 10 00035 g033
Figure 34. Initial head-on TD3 trajectory with altitude disadvantage.
Figure 34. Initial head-on TD3 trajectory with altitude disadvantage.
Drones 10 00035 g034
Figure 35. TD3 initial head-on with altitude disadvantage and no noise observations.
Figure 35. TD3 initial head-on with altitude disadvantage and no noise observations.
Drones 10 00035 g035
Figure 36. TD3 initial head-on with altitude disadvantage and high noise observations.
Figure 36. TD3 initial head-on with altitude disadvantage and high noise observations.
Drones 10 00035 g036
Table 1. Basic maneuver primitive instruction set.
Table 1. Basic maneuver primitive instruction set.
Action IDManeuver NameManeuver Primitive Command n x , n z , γ
1Constant Speed Forward 0 , 1 , 0
2Accelerated Forward n x max , 1 , 0
3Decelerated Forward n x min , 1 , 0
4Constant Speed Right Turn 0 , n z max , cos 1 ( 1 / n z max )
5Accelerated Right Turn n x max , n z max , cos 1 ( 1 / n z max )
6Decelerated Right Turn n x min , n z max , cos 1 ( 1 / n z max )
7Constant Speed Left Turn 0 , n z max , cos 1 ( 1 / n z max )
8Accelerated Left Turn n x max , n z max , cos 1 ( 1 / n z max )
9Decelerated Left Turn n x min , n z max , cos 1 ( 1 / n z max )
10Constant Speed Climb 0 , n z max , 0
11Accelerated Climb n x max , n z max , 0
12Decelerated Climb n x min , n z max , 0
13Constant Speed Dive 0 , n z min , 0
14Accelerated Dive n x max , n z min , 0
15Decelerated Dive n x min , n z min , 0
16Constant Speed Right Climb 0 , n z max , cos 1 ( c / n z max )
17Accelerated Right Climb n x max , n z max , cos 1 ( c / n z max )
18Decelerated Right Climb n x min , n z max , cos 1 ( c / n z max )
19Constant Speed Left Climb 0 , n z max , cos 1 ( c / n z max )
20Accelerated Left Climb n x max , n z max , cos 1 ( c / n z max )
21Decelerated Left Climb n x min , n z max , cos 1 ( c / n z max )
22Constant Speed Right Dive 0 , n z min , cos 1 ( c / n z max )
23Accelerated Right Dive n x max , n z min , cos 1 ( c / n z max )
24Decelerated Right Dive n x min , n z min , cos 1 ( c / n z max )
25Constant Speed Left Dive 0 , n z min , cos 1 ( c / n z max )
26Accelerated Left Dive n x max , n z min , cos 1 ( c / n z max )
27Decelerated Left Dive n x min , n z min , cos 1 ( c / n z max )
Table 2. Initial state information.
Table 2. Initial state information.
Statex (km)y (km)z (km)v (m/s) θ (rad) ψ (rad) HP θ L ψ L
Blue00 [ 4 , 5 ] [ 180 , 200 ] 0 [ π , π ] 100--
Red--- [ 180 , 200 ] 0 [ π 3 , 2 π 3 ] 100 [ π 6 , π 6 ] [ π 3 , 2 π 3 ]
Table 3. Simulation parameter settings.
Table 3. Simulation parameter settings.
ParameterValueParameterValueParameterValue
v min 0.25 Ma v max 1 Ma δ x , y , z 1 × 104
θ min π / 2 θ max π / 2 δ v 1 × 103
n x min 1 n x max 2 δ HP 200
n z min 3 n z max 6 d 0 0.9 km
D min 0.5 km D max 1.2 km h 0 5.5 km
φ max 15 deg q max 30 deg α 1 , 2 25
h min 3 km h max 8 km α 3 , 4 0.01
v c max 2 Ma z min 1 km α 5 500
z max 10 kmlr2.5 × 10−4 α 6 , 7 1/10
batch256buffer1 × 106 α 8 1 × 103
γ 0.99 α 9 , 10 1/15
Table 4. Evaluation comparison between SAC and TD3 with statistical significance.
Table 4. Evaluation comparison between SAC and TD3 with statistical significance.
Evaluation ComparisonWin Rate (95% CI)Loss Rate (95% CI)Tie Rate (95% CI)Mean ReturnsVariance of ReturnsOne-Sided Win Rate p-Value (SAC > TD3)
SAC with no noise input 87 % ± 6.6 % 1 % ± 2.0 % 12 % ± 6.4 % 266.5532380.5362 p = 0.0226
TD3 with no noise input 76 % ± 8.4 % 8 % ± 5.3 % 16 % ± 7.2 % 138.5664530.2239
SAC with low noise input 85 % ± 7.0 % 1 % ± 2.0 % 14 % ± 6.8 % 250.4452420.4957 p = 0.0385
TD3 with low noise input 75 % ± 8.5 % 8 % ± 5.3 % 17 % ± 7.4 % 133.1989545.2348
SAC with high noise input 84 % ± 7.2 % 0 % 16 % ± 7.2 % 263.4485450.7286 p = 0.0203
TD3 with high noise input 72 % ± 8.8 % 6 % ± 4.7 % 22 % ± 8.1 % 72.4408591.0407
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Quan, S.; Cao, S.; Wang, C.; Yu, H. Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm. Drones 2026, 10, 35. https://doi.org/10.3390/drones10010035

AMA Style

Quan S, Cao S, Wang C, Yu H. Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm. Drones. 2026; 10(1):35. https://doi.org/10.3390/drones10010035

Chicago/Turabian Style

Quan, Shiming, Su Cao, Chang Wang, and Huangchao Yu. 2026. "Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm" Drones 10, no. 1: 35. https://doi.org/10.3390/drones10010035

APA Style

Quan, S., Cao, S., Wang, C., & Yu, H. (2026). Autonomous Maneuvering Decision-Making Method for Unmanned Aerial Vehicle Based on Soft Actor-Critic Algorithm. Drones, 10(1), 35. https://doi.org/10.3390/drones10010035

Article Metrics

Back to TopTop