Next Article in Journal
Empowering Smallholder Farmers with UAV-Based Early Cotton Disease Detection Using AI
Previous Article in Journal
Aerial Drones for Geophysical Prospection in Mining: A Review
Previous Article in Special Issue
Adaptive Missile Avoidance Algorithm for UAV Based on Multi-Head Attention Mechanism and Dual Population Confrontation Game
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Reinforcement Learning with Automatic Curriculum Generation for Unmanned Combat Aerial Vehicle Tactical Decision-Making in Autonomous Air Combat

1
Aviation Engineering School, Air Force Engineering University, Xi’an 710038, China
2
Scientific Research and Academic Division, Air Force Engineering University, Xi’an 710038, China
3
The School of Mechano-Electronic Engineering, Xidian University, Xi’an 710071, China
*
Author to whom correspondence should be addressed.
Drones 2025, 9(5), 384; https://doi.org/10.3390/drones9050384
Submission received: 31 March 2025 / Revised: 27 April 2025 / Accepted: 28 April 2025 / Published: 21 May 2025
(This article belongs to the Collection Drones for Security and Defense Applications)

Abstract

:
This study proposes an unmanned combat aerial vehicle (UCAV)-oriented hierarchical reinforcement learning framework to address the temporal abstraction challenge in autonomous within-visual-range air combat (WVRAC) for UCAVs. The incorporation of maximum-entropy objectives within the MEOL framework facilitates the optimization of both autonomous low-level tactical discovery and high-level option selection. At the low level, three tactical policies (angle, snapshot, and energy tactics) are designed with reward functions informed by expert knowledge, while the high-level policy dynamically terminates current tactics and selects new ones through sparse reward learning, thus overcoming the limitations of fixed-duration tactical execution. Furthermore, a novel automatic curriculum generation mechanism based on Wasserstein Generative Adversarial Networks (WGANs) is introduced to enhance training efficiency and adaptability to diverse initial combat conditions. Extensive experiments conducted in UCAV air combat simulations have shown that MEOL not only achieves significantly better win rates than other policies when training against rule-based opponents, but also that MEOC achieves superior results in tests against tactical intra-option policies as well as other option learning policies. The framework facilitates dynamic termination and switching of tactics, thereby addressing the limitations of fixed-duration hierarchical methods. Ablation studies confirm the effectiveness of WGAN-based curricula in accelerating policy convergence. Additionally, the visual analysis of UCAVs’ flight logs validates the learned hierarchical decision-making process, showcasing the interplay between tactical selection and manoeuvring execution. This research provides novel methodologies combining hierarchical reinforcement learning with tactical domain knowledge for the autonomous decision-making of UCAVs in complex air combat scenarios.

1. Introduction

The rapid evolution of UCAV platforms has intensified the complexity of autonomous air combat decision-making, where real-time tactical generation and agile manoeuvring control are critical for UCAVs to dominate WVRAC engagements. Autonomous and intelligent decisions are expected as an auxiliary or even a substitution for human fighter pilots, which are of great significance for prospective air combat situations.
Accordingly, a plethora of studies have concentrated on the autonomous and intelligent decisions of UCAVs in air combat scenarios. The rule-based expert system [1] firstly merges prior knowledge into scripts for situation orientation and the manoeuvring procedure. The method used in [2] firstly converts the decisions on air combat into optimization problems based on key geometry metrics in air combat scenario, which lays the groundwork for optimization methods, including greedy algorithms [3], genetic algorithms [4], and approximate dynamic programming (ADP) [5,6,7,8]. However, the above methods are subjected to the complexity of modelling air combat dynamics. More recently, model-free reinforcement learning (RL) is capable of solving model-agnostic decisions and control problems through the trial-and-error mechanism. Based on the above optimization objective, there has been significant progress on applying RL [9,10] to the high-fidelity air combat environment. The proposed methods learn to manoeuvre and control the fighter aircraft with a six-degree-of-freedom (6-DoF) nonlinear dynamic model.
Despite the advances made in the field of reinforcement learning for UCAVs in addressing air combat challenges, decisions on tactical selection and manoeuvring control in the intrinsically hierarchical structure still remains to be explored. Inherently, a human fighter pilot continuously repeats the following procedure: (1) the pilot selects tactics through observing situations at a given moment and then (2) executes the selected tactics by performing basic fighter manoeuvring (BFM) during a period of time [11]. It can be concluded that the different temporal granularities at a hierarchical decision level exists in the process of tactic decision-making and manoeuvring execution. This is known as temporal abstraction [12,13,14,15,16], and hierarchical reinforcement learning (HRL) offers a promising solution to address it.
Furthermore, semi-Markov decision processes (SMDPs) [17,18], where the decision condition and timescale are generalized to varying cases, theoretically formalize the temporal abstraction for hierarchical reinforcement learning (HRL). Moreover, options, defined as temporally extended actions, combine options over MDPs with primitive actions selected among those options [15], which have been theoretically proven by a case of SMDPs. Following the paradigm of RL, option learning integrally optimizes both option discovery (low-level policy for actions or skills) and option utilization (high-level policy for options) for brevity.
Likewise, attempts of applying option learning to the challenges faced by UCAVs in air combat problems have been thriving. Such research has been discussed surrounding three questions in air combat decision: (1) when to terminate the present tactic; (2) which discrete, new tactic to select; and (3) how to execute the selected tactic with continuous manoeuvring control for a period of time. Pope et al. [19] applied the soft actor–critic (SAC) [20] to train both levels, where the options switch subpolicies at a frequency of a fixed batch of timesteps and set three subpolicies as the conservative, aggressive, and control zones. Likewise, Kong et al. [21] designed low-level polices including offensive, aggressive, and defensive subtasks, and the termination condition of options was handcrafted based on some key geometric situations. Selmonaj et al. [22] trained the fight and escape subpolicies with the proximal policy optimization (PPO), and the structure of the policy was heterogeneous. For the generation of specific tactics, the above research consistently separates the option discovery from the hierarchical optimization of the option framework with the help of baselines in RL. Moreover, the termination condition in these proposed methods is predefined, which excludes the concept of when to terminate the current option from hierarchical optimization. With regard to the random and unpredictable initial engagement situation that consists of range and geometric angles, predefined curriculum learning [10,19] is utilized to ensure adaptability and robustness with the assumption that the difficulty of combat is linearly increased with the initial condition. Nevertheless, this assumption restricts skill discovery naturally. From the above, further exploration is required into the questions concerning hierarchical optimization for UCAV air combat needs.
In this study, we extend our proposed method, the maximum-entropy option learning framework (MEOL) [23], to address UCAVs’ temporal abstraction in tactical decision-making and manoeuvring control for within-visual-range air combat (WVRAC) scenarios. The efficacy of this approach is substantiated by its convergence, which governs the discovery and utilization of options within a regulated entropy framework of hierarchical policies. Within this framework, the option discovery intends to characterize the low-level policy as the three typical tactics [11]: angle, snapshot, and energy tactics. Specifically, the angle tactic is committed to achieving a geometric advantage against an adversary fighter, which is featured by manoeuvring into its rear quarter to “bite the bandit”. The snapshot tactic prefers to snatch every opportunity as long as the bandit is targeted in the weapon engagement zone (WEZ). The energy tactic aims to strike a balance between the geometric advantages and energy advantages. In turn, we design the reward functions for angle, snapshot, and energy tactics based on expert knowledge [11] to guide the optimization of option discovery. To adapt to the prevailing initial conditions, we propose a novel automatic curriculum generation method to enhance the learning efficiency of option discovery, through which we introduce the WGANs [24] to learn and generate the intermediate-difficulty curricula (initial condition). An ablation study of the process of automatic curriculum generation has demonstrated that such methods have the capacity to accelerate the rate at which model training converges on a stable success rate. Subsequent to the training of the three low-level policies, the utilization of options is learned in order to both terminate the current tactic and determine a new tactic at high-level optimization with sparse rewards. The efficacy of MEOL in utilizing three types of low-level tactics to achieve superior win rates has been demonstrated by two comparative experiments. The contributions of this study are as follows:
  • The training of UCAV agents is conducted in accordance with a hierarchical structure within the MEOL framework. The option discovery enables the manoeuvring control of fighters with nonlinear flight dynamic models in a high-fidelity air combat simulation. The utilization of options extends beyond the selection of the most optimal option, but also extends the fixed duration of tactics [19,21,25] to a varied term.
  • Propose an end-to-end automatic curriculum generation method that enhances learning efficiency in option discovery and adaptability to random initial situations. The WGANs is trained to propose feasible initial conditions for option discovery.
  • The reward functions for the three types of tactical intra-options (angle, snapshot, and energy tactics) are shaped based on expert knowledge, and the tactical option-selection reward function is further designed. The final win rate against the rule-based adversary fighter is approximately 0.8, and the comparison test proves to be significantly better than the three types of tactical intra-option policies.
  • The UCAVs’ flight logs are analysed in order to provide a visual representation of combat with a variety of adversaries, including an explanation of the skills that have been acquired and the manner in which these skills have been employed.
The remainder of this paper is organized as follows: In Section 2, the fundamentals of UCAV air combat and the preliminaries of reinforcement learning are introduced. In Section 3, the automatic curriculum generation algorithm based on WGNs and the MOEL algorithm are presented. In Section 4, the application of MOEL to WVRAC is presented. The experiments are described in detail in Section 5. Finally, Section 6 offers a conclusion to this paper.

2. Preliminaries and Problem Formulation

In this section, we provide a theory basis of SMDPs with options with the basic approximate optimization algorithm, and delineate the basic geometric of air combat situations as well as a simulation of weaponry.

2.1. Semi-Markov Decision Process with Options

According to Theorem 1 in [15], any MDP with a fixed set of Markovian options is proven to be SMDP. The option framework is defined as a tuple ( S , Ω , A , p , r ) where s S , ω Ω , a A , reward function r : S × A R , and transition function p : S × A × S × R + [ 0 , 1 ] . The temporal abstraction mechanism consists of an option-selection policy π μ : S × Ω P ( Ω ) , intra-option policy π U : S × Ω P ( A ) , and termination function β : S × Ω [ 0 , 1 ] , where P ( Ω ) and P ( A ) represent the set of probability distributions for the set of options Ω and the set of actions A , respectively, which are depicted by the Bellman equation as
U * ( s , ω ) = ( 1 β ( s , ω ) ) Q Ω * ( s , ω ) + β ( s , ω ) max ω Ω Q Ω * ( s , ω ) ,
where
Q Ω * ( s , ω ) = E a π ( · | s , ω ) [ Q U * ( s , ω , a ) ] ,
Q U * ( s , ω , a ) = r ( s , a ) + γ E s p ( · | s , a ) [ U * ( s , ω ) ] ,
where U : S × Ω R is the option-value function upon arrival, Q Ω : S × Ω R is the option-value function and Q U : S × Ω × A R is the value function for executing an action in the context of a state-option pair called the state-option-action value function hereafter. The termination function β enables us to interrupt the executed option, which achieves the varied timesteps for operating a subpolicy. Furthermore, it includes the gradient of a decision moment over the overall optimality. In summary, this study integrates the option termination function β and option-selection policy π μ of option learning and redefines the option strategy π Ω : Ω × S × Ω [ 0 , 1 ] as
π Ω ( ω t s t , ω t 1 ) = ( 1 β ( s t , ω t 1 ) ) 1 ω t = ω t 1 + β ( s t , ω t 1 ) π μ ( ω t s t ) ,
where 1 ω t = ω t 1 is an indicator function that takes the value of 1 when ω t = ω t 1 .

2.2. Intra-Option Q-Learning

In intra-option Q-learning, the optimal option value function upon arrival can be solved iteratively by the Bellman Equation (1). It is approximated by applying an off-policy one-step temporal difference, where the option-value function Q Ω ϕ Ω is modelled with the parameter ϕ Ω . The  Q Ω ϕ Ω is optimized by minimizing the squared residual error for all states and options. As a result, ϕ Ω is updated as
ϕ Ω ϕ Ω + ξ θ ω ^ ϕ Ω Q ϕ Ω ( Q Ω ϕ ¯ Ω ( s t , ω t ) Q Ω ϕ Ω ( s t , ω t ) ) ,
where
Q Ω ϕ ¯ Ω ( s t , ω t ) = r ( s t , a t ) + γ U ϕ ¯ Ω ( s t + 1 , ω t ) ,
and U ϕ ¯ Ω is approximated as
U ϕ ¯ Ω ( s t + 1 , ω t ) = ( 1 β ( s t + 1 , ω t ) ) Q Ω ϕ ¯ Ω ( s t + 1 , ω t ) + β ( s t + 1 , ω t ) max ω Ω Q Ω ϕ ¯ Ω ( s t + 1 , ω ) .

2.3. Within-Visual-Range Air Combat

WVRAC, otherwise referred to as dogfighting, denotes close-range tactical engagements between aircraft within visual contact. The tactical objective of WVRAC is to establish an effective gun firing envelope. In contradistinction to beyond-visual-range air combat, which relies more on radar and weapon performance, WVRAC imposes more stringent requirements on the manoeuvring capabilities of UCVAs. This study aims to explore the generation mechanism of autonomous combat strategies for UCAV agents in WVRAC based on gun weapons.
In the context of geometric representation, a three-dimensional spatial geometric relationship model for two UCAVs is established, as illustrated in Figure 1. The ground coordinate system is defined as the reference frame, wherein the adversary fighter and self-fighter are designated by subscripts a and s, respectively. The position vector in the local coordinate system is represented by the vector p , while the velocity vector in the local coordinate system is represented by the vector v . The vector e is the normalized vector of the X-axis in the body coordinate system. The Line of Sight L O S is the vector from the centroid of the self-fighter to the centroid of the adversary fighter, with its magnitude LOS being the distance R a n g e between the two aircraft. The rate of change of R a n g e is defined as the proximity rate P r o x i m i t y . The Aspect Angle ( A A ) is defined as the angle between the heading vector of the adversary fighter and the L O S , while the Antenna Train Angle ( A T A ) is the angle between the heading vector of the self-fighter and the L O S . The Head Cross Angle ( H C A ) is the angle between the heading vectors of the self-fighter and the adversary fighter. It is important to note that the ranges of A A , A T A , and  H C A are all [ 0 , π ] . The following formulas are used to calculate these parameters:
LOS = p a p s Range = LOS Proximity = ( v a v s ) · LOS LOS ATA = arccos e s · LOS LOS AA = arccos e a · LOS LOS HCA = arccos e a · e s ,

2.4. Weapon Engagement Zone

The weapon engagement zone (WEZ) is predicated on the principle of WVRAC and the ballistic characteristics of the aerial gun. When the target distance exceeds the maximum effective striking distance set by the gun, the bullet’s kinetic energy will be attenuated to the extent that it will not be able to produce a lethal effect on the adversary fighter. Conversely, when the distance is less than the minimum safe firing distance set by the gun, the weapon system will trigger the prohibited logic to avoid the secondary damage caused by the splashing back of the bullet fragments to the UCAV. The maximum effective striking distance of the machine gun was defined as 1000 m , while the minimum safe firing distance was set at 150 m . The effective range of the aerial gun was simulated through the construction of a three-dimensional conical weapon attack area (illustrated in Figure 2). The targetable angle range is constrained to the three-dimensional angular domain of ± 15 on both sides of the nose axis. In the event that an adversary fighter enters the WEZ, the system will initiate the virtual fire projection process.

3. Methods

According to [23], MEOL applies maximum-entropy objective to both levels in the option framework and achieves efficient automatic skill discovery and selection in complex continuous tasks. In this section, we decouple MEOL for option discovery and option utilization to generate particular tactics and selection strategies. With reward shaping, the former learns tactical manoeuvring (angle, snapshot, and energy tactics). Additionally, the WGAN method was employed to facilitate automated curriculum generation, thereby assisting MEOL in option discovery. The latter learns to terminate the current tactic and select a new one.

3.1. Automatic Curriculum Generation with WGANs for Option Discovery

We build upon the concepts outlined in [26,27] by implementing Wasserstein Generative Adversarial Networks (WGANs) [24] to identify a sequence of learnable tasks for policy training. To address the critical challenge of mode collapse inherent in traditional GAN frameworks [28], we leverage the theoretical advantages of Wasserstein distance optimization with gradient penalty regularization [29]. This approach ensures stable training dynamics while maintaining high sample diversity through Lipschitz-constrained adversarial learning.

3.1.1. Labelling Positive Initial Condition

The initial engagement situation can be continuously parameterized as an initial condition x I R d . And the continuously parameterized tasks of universe initial situation are T ( x ) . We define the intermediate-difficulty set I f e a s i b l e , i as
I f e a s i b l e , i = { x : l w i n , m i n l w i n , π U , i , l w i n , π U , i l w i n , m a x } I ,
where l win , min and l win , max are the boundary values defining the win rate interval for moderately difficult tasks, and  l win , π U , i represents the average win rate of strategy π U , i under repeated confrontations on this course.
The above method is referenced in [27]. We screen the curriculum samples in the air combat task by presetting the win rate interval, and train the generative model to generate curricula with moderate difficulty, thus reducing the training difficulty of reinforcement learning.

3.1.2. Wasserstein Generative Adversarial Networks

According to [24], the adversarial training framework consists of a generator G ψ and a critic D ϕ that optimize the Wasserstein distance through a minimax game. The objective function is defined as
min ψ max ϕ E x P r [ D ϕ ( x ) ] E z p ( z ) [ D ϕ ( G ψ ( z ) ) ] ,
where z is sampled from latent space p ( z ) , and  P r represents the real data distribution. The critic is constrained to be 1-Lipschitz through gradient penalty [29]:
L G P = λ E x ^ P x ^ [ ( | | x ^ D ϕ ( x ^ ) | | 2 1 ) 2 ] .
For curriculum generation, we designed the generator to produce environment parameters x 0 = G ψ ( z ) with latent codes z N ( 0 , I ) . The quality assessment of generated environments was performed through policy optimization:
L env = E z p ( z ) [ R ( π θ * , G ψ ( z ) ) ] ,
where R represents the winning rate of optimized policy π θ * from Algorithm 1. The complete optimization objective combines adversarial training with curriculum learning:
L total = E [ D ϕ ( x real ) ] E [ D ϕ ( G ψ ( z ) ) ] + γ L env + L G P .
The implementation of the algorithm for automatic curriculum generation for option discovery with WGANs is shown in Algorithm 2.
Algorithm 1 Option discovery within MEOL framework for subpolicy ω i
  1:
Initialize the parameterized air combat environment with initial condition p
  2:
Initialize parameter vectors θ U , ω i , ϕ U , ω i , α U , ω i
  3:
Assign target network ϕ ¯ U , ω i ϕ U , ω i
  4:
Initialize learning rate ξ ϕ U , ω i , ξ θ U , ω i , ξ α U , ω i , σ U , ω i
  5:
for each batch do
  6:
    Select an action a t π U θ U , ω i ( a t | s t , ω i )
  7:
    Observe reward r ω i , t + 1 and s t + 1
  8:
    Store transition in D ( s t , a t , r ω i , t + 1 , s t + 1 )
  9:
    Sample a training batch of transitions from D
10:
    Update ϕ U , ω i according to (14)
11:
    Update θ U , ω i according to (17)
12:
    Update ϕ ¯ U , ω i , ϕ ¯ U , ω i σ U , ω i ϕ U , ω i + ( 1 σ U , ω i ϕ ¯ U , ω i )
13:
    Update α U , ω i according to (19)
14:
end for
Algorithm 2 Wasserstein Curriculum Generation for Option Discovery
  1:
Initialize generator G ψ and critic D ϕ
  2:
Initialize latent dimension d z , gradient penalty coefficient λ
  3:
Initialize learning rates ξ ψ , ξ ϕ
  4:
for each training epoch do
  5:
    for k critic steps do
  6:
        Sample real environments { x ( i ) } i = 1 m P r
  7:
        Sample latent codes { z ( i ) } i = 1 m p ( z )
  8:
        Generate synthetic environments x ˜ ( i ) = G ψ ( z ( i ) )
  9:
        Compute gradient penalty L G P
10:
        Update critic: ϕ ϕ ξ ϕ ϕ L total
11:
    end for
12:
    Sample new latent codes { z ( j ) } j = 1 n p ( z )
13:
    for each generated environment G ψ ( z ( j ) )  do
14:
        Initialize parameterized environment with x 0 ( j )
15:
        Execute Algorithm 1 for intra-option policy optimization
16:
        Evaluate winning rate R ( π θ * , x 0 ( j ) )
17:
    end for
18:
    Compute environment quality reward L env
19:
    Update generator: ψ ψ ξ ψ ψ L total
20:
end for

3.2. Option Discovery Within MEOL Framework

When training the low-level policy, the options ω are fixed as an identification of different tactical intra-option policy. For each option or tactic ω i , the parameters ϕ U , ω i of the soft option-action-value function Q s o f t , U ϕ U , ω i are updated as
ϕ U , ω i ϕ U , ω i ξ ϕ U , ω i ϕ U , ω i Q s o f t , U ϕ U , ω i ( s t , ω i , a t ) ( Q s o f t , U ϕ ¯ U , ω i ( s t , ω i , a t ) Q s o f t , U ϕ U , ω i ( s t , ω i , a t ) ) ,
with the target parameters ϕ ¯ U , ω i
Q s o f t , U ϕ ¯ U , ω i ( s t , ω i , a t ) = r ω i ( s t , a t ) + γ E s t + 1 p ( s t + 1 | s t , a t ) [ U s o f t ϕ ¯ U , ω i ( s t + 1 , ω i ) ] ,
where
U s o f t ϕ ¯ U , ω i ( s t + 1 , ω i ) = E a t + 1 π U θ U , ω i ( a t + 1 | s t + 1 , ω i ) [ Q s o f t , U ϕ ¯ U , ω i ( s t + 1 , ω i , a t + 1 ) α U , ω i log π U θ U , ω i ( a t + 1 | s t + 1 , ω i ) ] .
The parameters for θ U , ω i of intra-option policy π U θ U , ω i are updated as
θ U , ω i θ U , ω i ξ θ U , ω i ^ θ U , ω i J π U ( θ U , ω i ) ,
where
^ θ U , ω i J π U ( θ U , ω i ) = α U , ω i θ U , ω i log π U θ U , ω i ( a t | s t , ω i ) + a t ( α U , ω i log π U θ U , ω i ( a t | s t , ω i ) Q s o f t , U ϕ U , ω i ( s t , ω i , a t ) ) θ U , ω i f θ U , ω i ( ϵ t ; s t , ω i ) .
And the coefficient α U , ω i for regulating the entropy of intra-option policy is adjusted as
α U , ω i α U , ω i ξ α U , ω i ^ α U , ω i J H ¯ ( α U , ω i ) ,
in which
J H U ( α U , ω i ) = E a t π U θ U , ω i [ log π U θ U , ω i ( a t | s t , ω i ) ] H ¯ U , ω i .

3.3. Option Utilization Within MEOL Framework

For a higher level, the parameters ϕ Ω of the soft option-value function Q s o f t , Ω ϕ Ω are updated as
ϕ Ω ϕ Ω ξ ϕ Ω ϕ Ω Q s o f t , Ω ϕ Ω ( s t , ω t ) ( Q ^ s o f t , Ω ϕ ¯ U ( s t , ω t ) Q s o f t , Ω ϕ Ω ( s t , ω t ) ) ,
where the target parameters ϕ ¯ Ω
Q ^ s o f t , Ω ϕ ¯ Ω ( s t , ω t ) = r ( s t , a t ) + U ^ s o f t ϕ ¯ Ω ( s t + 1 , ω t ) ,
in which
U s o f t ϕ ¯ Ω ( s t + 1 , ω t ) = E ω t + 1 π Ω , θ ( ω t + 1 | s t + 1 , ω t ) [ Q s o f t , Ω ϕ ¯ Ω ( s t + 1 , ω t + 1 ) α Ω log π Ω θ Ω ( ω t + 1 | s t + 1 , ω t ) ] .
The parameters θ Ω of the option-selection policy π Ω θ Ω are updated as
θ Ω θ Ω ξ θ Ω ^ θ Ω J π Ω ( θ Ω ) ,
in which
J H Ω ( α Ω ) = E ω t π Ω θ Ω [ log π Ω θ Ω ( ω t | s t , ω t 1 ) ] H ¯ Ω .
The process of option utilization within the MEOL framework is governed as in Algorithm 3.
Algorithm 3 Option utilization within MEOL framework
  1:
Initialize parameter vectors θ Ω , ϕ Ω , α Ω
  2:
Assign target network ϕ ¯ Ω ϕ Ω
  3:
Initialize learning rate ξ ϕ Ω , ξ θ Ω , ξ α Ω , σ Ω
  4:
Assign each π U , ω i with trained θ U , ω i
  5:
for each epoch do
  6:
    for each batch do
  7:
        Select an option ω t π Ω θ Ω ( ω t | s t , ω t 1 )
  8:
        Select an action a t π U θ U , ω i ( a t | s t , ω t )
  9:
        Observe reward r t + 1 and s t + 1
10:
        Store transition in D ( ω t 1 , s t , ω t , a t , r t + 1 , s t + 1 )
11:
        Sample a training batch of transitions from D
12:
        Update ϕ Ω according to (21)
13:
        Update θ Ω according to (24)
14:
        Update ϕ ¯ Ω , ϕ ¯ Ω σ Ω ϕ Ω + ( 1 σ Ω ϕ ¯ Ω )
15:
        Update α Ω according to (25)
16:
    end for
17:
    end for

4. Hierarchical Training Architecture

In this section, we propose the utilization of parameterized air combat scenarios for the purpose of automatic curriculum reinforcement learning. Secondly, the reward shaping is designed for intra-option policies for tactical manoeuvring and control. Finally, the sparse reward of combat results is retained for the option-selection policy, with the objective of enhancing the success rate of UCAV air combat. The complete structure is delineated in Figure 3.

4.1. Parameterized Environment of Air Combat Scenario

According to [11], the nature of the initial engagement situation dictates the subsequent manoeuvres, a principle that remains applicable in the context of the UCAVs’ WVRAC. For instance, if the initial engagement situation is defined as ‘offensive’, signifying that the self-fighter maintains a position at the rear quarter of the adversary, the self fighter is expected to execute a pursuit manoeuvre in order to engage the target. Conversely, in an initial engagement situation that is head-on, characterized by both fighters occupying the front quarter of each other’s respective positions, the employment of either a one-circle or two-circle manoeuvre is to be considered. It is important to note that a wide range of initial engagement scenarios are interwoven with the curriculum, as elaborated in [10]. In this paper, the parameters of initial condition are included as follows:
( Range , AA , ATA )
where the value domain is listed in Table 1.

4.2. Construction of Tactical Intra-Option Policy

To facilitate the training of tactical intra-option policy, the learning objectives are aligned with the core principles of WVRAC [11]. Each intra-option policy is specialized for one of the three canonical tactics: angle, energy, and snapshot.
Faced with the sparse feedback and the problem of the long sequence of air combat, reward shaping [30,31] has proven to be efficient in training complex tasks by merging with the domain knowledge. Consequently, the basic term of the reward function for training tactical intra-option policies is composed of the sparse reward of air combat results R g o a l , the shaping reward R s h a p i n g , and the regularization term to constrain behaviour R r e g u l a r i z a t i o n , which is given below:
R = R g o a l + R s h a p i n g + R r e g u l a r i z a t i o n .
Mathematically, the reward shaping is designed based on the potential situation assessment, which measures the advantages in air combat geometry, including range distance, orientation, and energy advantages.
The potential function of the range distance Φ r a n g e is generally an exponential term, which is formalized as
Φ r a n g e = exp ( Range d r c r )
The potential functions of orientation, Φ A A and Φ A T A , are the exponential forms of the angles AA and ATA, which are given as
Φ A A = exp ( AA c A A ) ,
Φ A T A = exp ( ATA c A T A ) ,
The potential function of energy Φ o r i e n t a t i o n is the ratio of the self-fighter and adversary fighter’s energy:
Φ e n e r g y = E s E a ,
where E s and E a are the energies of the self- and adversary fighters, respectively, which are jointly defined as
E = a l t i t u d e + V I A S 2 2 g ,
The regularization terms play a vital role in improving flight performance and penalizing dangerous flight manoeuvring, which is an implicit way to avoid suboptimality.
The P d e c k penalizes the fighter agent for reaching lower than the minimum altitude, which prevents the agent from approaching too close to the horizon:
P d e c k = 1.0 , a l t i t u d e a l t i t u d e d e c k 0.0 , o t h e r w i s e .
The P V I A S penalizes the fighter agent for a low indicated airspeed V I A S , as this would make it difficult to provide enough lift for the fighter:
P V I A S = ( ( V I A S V c o r n e r ) / c V I A S ) 2 , V I A S V c o r n e r 0.0 , o t h e r w i s e .
The P α A o A penalizes the fighter agent for violating the stall of angle of attack to avoid stall, which is
P α A o A = 1 , α A o A α A o A , s t a l l 0 , o t h e r w i s e .
To maximize manoeuvrability alongside the flight envelope, P d e c k , P V I A S , and P α A o A only take effect when the corresponding parameters are out of bounds.
The P β penalizes the fighter agent for the sideslip, which causes energy inefficiency and aircraft deviation from the expected flight status:
P β = ( β / c β ) 2 , β β m a x 1.0 , o t h e r w i s e .
The ideal status is when there is no sideslip, so the penalty is set to be dense.
The cost of action C a c t i o n smooths the control output [32], which stabilizes the response process:
C a c t i o n = 0.1 ( a c t i o n ) 2 .

4.2.1. Reward Function for Angle Tactic

The fundamental objective of the angle tactic is to swiftly modify the UCAVs’ flight paths to ensure a nose-pointing advantage, a strategy that is particularly advantageous in WVRAC. In such situations, it is advantageous for the self-fighter to position itself in the 6 o’clock direction relative to the adversary fighter, thereby establishing a trailing position. In comparison to the other two tactics, the angle tactic does not involve the separation of the AA angle as a separate term. Rather, it integrates the potential functions of AA, ATA, and range, with the objective of achieving an angular superiority over the adversary aircraft at the optimal distance. It is evident that both c 0 and c 1 can be readily modified as hyperparameters. The angle tactic’s reward function is defined as follows:
R a n g l e = R g o a l + c 0 Φ A T A Φ A A Φ r a n g e + c 1 Φ e n e r g y + R p r o x i m i t y shaping + P d e c k + P V + P α + P β + C a c t i o n regularization
The rest of the terms in the reward functions are shared with each other. Cooperating with Φ r a n g e , the R p r o x i m i t y is proposed to accelerate the decreasing process of the range distance, which is
R p r o x i m i t y = Proximity / 1000 .
When the range distance is decreased, proximity is negative with R p r o x i m i t y set as positive. Otherwise, proximity is positive with R p r o x i m i t y set as negative. This reward takes dominant effect in closing the adversary when the range distance is large and the influence of Φ r a n g e is feeble.

4.2.2. Reward Function for Snapshot Tactic

The snapshot tactic is indicative of a more aggressive strategy, with an emphasis on the capture of the shooting window. This is intuitively reflected in the reward function, which places a higher value on the AA. This approach suggests a certain degree of disregard for potential attacks by adversary fighters. The design of the reward function for the snapshot tactic is as follows:
R s n a p s h o t = R g o a l + c 0 Φ A T A Φ r a n g e + Φ A A + c 1 Φ e n e r g y + R p r o x i m i t y shaping + P d e c k + P V + P α + P β + C a c t i o n regularization

4.2.3. Reward Function for Energy Tactic

Energy tactics are centred on the accumulation and utilization of energy, a critical aspect of the WVRAC. Large-angle manoeuvres necessitate substantial energy reserves to support them, and the exchange of energy for angular and positional advantages constitutes a form of attack strategy in WVRAC. The design of the reward function for the energy tactic is as follows:
R e n e r g y = R g o a l + c 0 Φ A T A Φ r a n g e Φ e n e r g y + c 1 Φ A A + R p r o x i m i t y shaping + P d e c k + P V + P α + P β + C a c t i o n regularization

4.3. Construction of Tactical Option-Selection Policy

The option-selection policy serves as a high-level decision-maker that dynamically chooses tactical options based on real-time combat situations, which is designed to choose the most appropriate tactic among the three intra-option tactics (angle, snapshot, and energy) based on the current situation of air combat. Inspired by the maximum-entropy framework in MEOL, the policy balances the exploitation of learned tactics and exploration of option specialization through entropy regularization. The reward function extends the sparse combat outcome with entropy incentives:
R t a c t i c = R g o a l

4.4. Observation Space and Action Space

In the real WVRAC, the self-fighter’s onboard sensors limit the acquisition of adversary fighter information. However, in modern air warfare, the intervention of ground radar, early warning aircraft, and other systems enables more accurate access to the adversary fighter position, speed, trajectory, and other information. Consequently, the observation space for each UCAV agent encompasses both raw data from the environment and processed information representing the air combat geometry. The raw data, including indicated airspeed ( V I A S ), angle of attack ( α A o A ), sideslip angle ( β ), and altitude, are used to reflect flight safety and the quality of manoeuvring. The incorporation of processed data, such as range, proximity rate, AA, and ATA, serves to depict the positional relationship with the adversary, aligning with the pilot’s perspective. These states are subsequently fed into the network after undergoing a process of normalization:
( Range , Proximity , AA , ATA , V I A S , a l t i t u d e , α A o A , β )
To comprehensively analyse and optimize the tactics and manoeuvring performance of UCAVs, this study employs a continuous action space, with control commands for ailerons, rudder, elevator, and throttle as outputs:
( δ a , δ r , δ e , δ t )

5. Experiment

The objective of this study is to investigate the role of MEOL in facilitating the development of tactical and manoeuvring skills in UCAV agents. To this end, a series of experiments have been conducted to address the following research questions: (1) Does MEOL enable the discovery of manoeuvring skills that are necessary for tactics at a low level? (2) Does the option framework effectively integrate tactics through high-level training? (3) How do trained UCAV agents cooperate with tactics and manoeuvre hierarchically?
Accordingly, the initial phase of the training process involves the training of the angle, snapshot, and energy tactics, along with the corresponding reward designs, to establish tactical intra-option policies. Subsequently, the option-selection policy for tactic selection was trained using this method. Furthermore, a pair of comparative experiments were employed to illustrate the MEOL framework’s superior performance. An ablated study demonstrates the effectiveness of automatic curriculum generation techniques based on WGANs. In the context of the adversary fighter, we designed the rule-based expert adversary fighter from the doctrines and instructions in [11]. Finally, we visualized tactical manoeuvring as well as options for tactical selection in the air combat process. The overall hyperparameters are in Table 2.

5.1. Air Combat Simulation Platform for HRL Training

In the implementation stage, the high-fidelity flight dynamic model (FDM) of the F-16 Fighting Falcon, derived from the JSBsim 1.1.6 [33], was utilized as the UCAV agent. Subsequently, the WEZ, observation assessment, and gaming judgment systems were constructed to facilitate the air combat simulation. The data, encompassing the state of the FDM and the control commands from the environment, were transmitted at a frequency of 50 Hz . The API (application interference) was provided for the purpose of training. The implementation of hierarchical reinforcement learning was carried out by Pytorch 1.8.0+cu111 and supported by Garage [34]. Additionally, Tacview 1.9.4 was utilized for the purpose of the UCAVs’ flight log analysis.

5.2. Decoupled Training for Tactical Intra-Option Policies

In this subsection, the three types of intra-option policies—angle, snapshot, and energy—that were trained individually based on the previous construction are presented. The characteristics of each of the three types of intra-option policies were analysed to demonstrate the generative power of the underlying tactics in the method.

5.2.1. Angle Tactic

The angle tactic was trained in accordance with the defined reward function. The success rate during the training phase was calculated based on the most recent 100 assessment results. The same calculation method was applied to the snapshot and energy tactics. The success rate curve of the angle tactic training process is shown in Figure 4. After the success rate reached 0.6 for the first time, a precipitous decline was observed, which may have been due to the increase in exploration rate caused by the overfitting of the policy. However, this ‘regression’ facilitated the transcendence of the limitations imposed by local optima. The subsequent recovery of the win rate curve also proves the success of the exploration of the better strategy path, and the final win rate stabilized at about 0.6.
The ensuing engagement between the self-fighter and the adversary fighter is illustrated in Figure 5. The initial situation was characterized by a neutral engagement. Between time markers 0 and 1, both UCAVs sought smaller turning radii through energy-to-kinetic conversion to enter each other’s loops. Subsequent to this, during the WVRAC single-loop dogfight from time markers 1 to 2, both parties minimized their turning radii to gain angular advantages. Between time markers 2 and 3, the self-fighter attained positional and angular superiority through two consecutive barrel rolls. Subsequently, from time markers 3 to 4, while the adversary fighter performed an S-turn, the self-fighter moved inside its turning radius and successfully shot it down. Figure 6 provides a visual representation of the UCAVs’ angular positions, the distance between them, the proximity rate, and the energy changes during the engagement.

5.2.2. Snapshot Tactic

As demonstrated in Figure 7, the win rate of the snapshot tactic exhibited stability within the 0.4 to 0.5 range following the training phase. The underlying reason for this relative deficiency in comparison to the other two tactics is that the fundamental design concept of the snapshot tactic is to prioritize the maintenance of the AA angle advantage. This approach, however, comes at the expense of responsiveness to potential threats from an adversary fighter.
A detailed examination of a typical engagement of the snapshot tactic (Figure 8) reveals that at time marker 0, the adversary fighter is located at the 11 o’clock position, and the AA angle remains stable at approximately 100°. From time markers 0 to 1, the self-fighter executed a rapid turn manoeuvre, successfully positioning itself behind the adversary’s 3/9 line. In response, the adversary fighter, perceiving the threat, adopted a strategy of potential and kinetic energy conversion, and endeavoured to extricate itself from the unfavourable situation by descending and turning sharply. At time markers 1 and 2, both sides employed descending turn manoeuvres in a coincidental manner. As illustrated in Figure 9f, the self-fighter encountered an energy deficit during this phase, which hindered our ability to lock on to the adversary fighter and launch an effective attack. Prior to time marker 3, a strategic realignment was executed, involving the adoption of a more assertive descending turn tactic. Utilizing precise manoeuvring capabilities, the self-fighter successfully infiltrated the inner ring of the enemy’s turn, thereby seizing a valuable window of opportunity to initiate offensive action.
A comparison of the snapshot tactic with the other two tactics at the macro level (Figure 9) reveals distinctive features attributable to the difference in the reward function. As illustrated in Figure 9a, the snapshot tactic demonstrates a particular strength in AA angle advantage retention, which is its core strength. However, as demonstrated in Figure 9f, the snapshot tactic’s relative paucity of emphasis on energy resulted in the self-fighter being at a long-term disadvantage in terms of energy posture, which also limited its win rate to a certain extent.

5.2.3. Energy Tactic

As demonstrated in Figure 10, the success rate of the energy tactic in the training process was eventually stabilized between 0.5 and 0.6. A refined analysis of the single confrontation process (Figure 11) demonstrates that at the initial moment of the confrontation (time marker 0), the self-fighter was in the 1 o’clock direction of the adversary fighter, a position that placed the self-fighter at a disadvantage in the initial posture. However, from time markers 0 to 1, the self-fighter successfully shifted both sides into an anisotropic position by rapidly dropping altitude, thereby simultaneously accruing a significant energy advantage. This in turn ensured strong energy support for subsequent manoeuvres. From time markers 1 to 2, the self fighter adroitly entered the rear of the adversary’s 3/9 line through a Low Yo-Yo manoeuvre. This pivotal manoeuvre ensured the self-fighter’s advantageous positioning within the tactical arrangement, thereby establishing the conditions for the subsequent attack. Subsequently, from time markers 2 to 3, the self-fighter executed forward tracking manoeuvres, successfully cutting into the adversary fighter’s turning radius and neutralizing the enemy. The angle, range, closure, and energy ratios during the engagement are shown in Figure 12.

5.3. Comparable Evaluation for Tactical Option-Selection Policies

In order to validate the MEOL framework’s ability to integrate tactics within options through coupled training and to reasonably use tactics during confrontation with adversary fighters, this study carried out a systematic validation from the dimensions of tactical coupled training effectiveness and tactical hierarchical application capability in the confrontation environment. In terms of comparison method selection, based on the theoretical framework of ref. [35], this study selected two advanced option learning algorithms: the double actor–critic architecture with Proximal Policy Optimization (DAC+PPO) and the double actor–critic architecture with advantage actor–critic (DAC+A2C), as the baseline methods. This selection accounts for the algorithms’ tactical optimization characteristics in dynamic confrontation environments and their comparability with the MEOL framework. Specifically, the decision was motivated by their proven efficiency in hierarchical decision-making tasks and their structural compatibility with the MEOL framework’s dual-layer optimization mechanism. Meanwhile, two comparative experiments were designed in this study: the first evaluates the MEOL framework’s training success rate against two baseline methods to demonstrate its superior ability in deploying tactical intra-options against rule-based expert opponents, while the second conducts post-training confrontation tests involving both baseline algorithms and three tactical intra-options to validate the framework’s operational mechanism for integrating tactical hierarchies and executing graded manoeuvres.

5.3.1. Coupling Training for Option-Selection Policy

In order to maintain the consistency of the exploration strategies during the training process, the two types of baseline methods contain the same three types of tactical options, and entropy regularization is used for both their option and movement hierarchies. As demonstrated in Figure 13, the success rates of the MEOL, DAC+A2C, and DAC+PPO option-selection policies in the training process of an expert rule-based UCAV opponent are illustrated; the success rate of the training process is also calculated by the last 100 evaluation results, and the total environment steps is set to 1.5 × 10 7 . It is evident that the final win rate of the MEOL framework is stabilized at around 0.8, which is significantly higher than the 0.6 of DAC+PPO and the 0.5 of DAC+A2C. Furthermore, it has been observed to improve by approximately 0.2, 0.35, and 0.25 compared to the three categories of tactical intra-option policies, respectively. This experimental result provides substantial evidence that the MEOL option framework can effectively utilize intra-option tactics to achieve a substantial increase in the success rate.

5.3.2. Confrontation Assessment

Following the conclusion of the training phase, the MEOL was subjected to 1000 confrontation experiments with DAC+PPO, DAC+A2C, and three types of tactical intra-option policies under random initial conditions. The results of these experiments are shown in Figure 14, where the vertical coordinate ‘success rate’ refers to the ratio of the number of victories of one side to the total number of 1000 confrontation tests. The results of the adversarial tests show that MEOL’s capability is significantly better than the two types of baseline algorithms as well as the three types of tactical intra-options.
Subsequently, the manoeuvre trajectory images of the MEOL against the three tactical intra-options and the two baseline methods were selected at random air combat environment parameters during the confrontation evaluation to illustrate how the MEOL utilized the tactical options during the confrontation.
The trajectory images of the MEOL against the three tactical intra-options are shown in Figure 15, where the self-fighter denotes the MEOL-controlled UCAV. Concurrently, beneath each confrontation track is a visualization of the corresponding MEOL choice of tactics’ intra-option in the time dimension. The utilization of the dark blue line indicates the application of the option, whereas the light yellow line signifies its non-application.
The trajectory of MEOL’s confrontation with angle tactics is illustrated in Figure 15a, while its corresponding option choices in the time dimension are demonstrated in Figure 15d. At the initial time node 0, the two sides present a typical head-on situation, and until time marker 1, the self-fighter’s option choices are focused on angle and snapshot tactics. It is evident that in the aforementioned head-on scenario, the direct utilization of snapshot tactics will yield a superior reward function benefit. However, this approach also renders the self-fighter vulnerable to attack. Consequently, the self-fighter successfully transitions its position through a diving turn manoeuvre, thereby establishing a rear-quarter attack position. The adversary fighter’s strategy is predicated on angle tactic training, the aim of which is to gain a situational advantage in the vertical plane between time mark 1 and 2 through an oblique loop manoeuvre. At this point, the self-fighter is confronted with adversary aircraft seeking to enter the rear hemisphere of the fighter, and gaining an angular advantage is clearly more important. Thus, the angle tactical option for the self-fighter dominates in the second half of the confrontation. Trajectory analysis demonstrates that the horizontal manoeuvres employed by the self-fighter resulted in the adversary fighter moving forward (overshoot), thereby successfully re-establishing the rear-aspect engagement geometry and achieving a successful shot down of the adversary fighter at time marker 3.
In a similar manner, the confrontation trajectory and option-selection process of MEOL and energy tactics are illustrated in Figure 15b and Figure 15e, respectively. The initial posture is a parallel divergence posture, and the adversary fighter trained based on the energy tactic expands its energy advantage by exchanging potential energy for kinetic energy between time markers 0 and 1, so as to reserve energy for the subsequent implementation of the large-overload turning manoeuvre and re-pointing to self-fighter. By analysing the option-selection process, between time markers 0 and 1, the self-fighter primarily selects the energy and angle options, which are visualized in the trajectory diagram by the horizontal turn of our aircraft to re-point at the adversary aircraft. In response to the adversary aircraft’s High Yo-Yo tactical manoeuvres to the rear hemisphere position of the self-fighter at time markers 1 to 2, the self-fighter primarily employs angle and rapid-fire options to augment the AA angle of advantage. This enables the successful execution of the lead turn and the acquisition of a shooting opportunity against the adversary’s fighter at time marker 2.
The trajectory of the confrontation between MEOL and snapshot tactics is illustrated in Figure 15c. It is evident that the adversary fighter tactics, which are trained based on snapshot tactics, exhibit a heightened degree of aggression. Between time markers 0 and 1, the adversary fighter, following an initial head-on attack on the self-fighter, executes a substantial overload turn in the vertical plane to re-establish a head-on position at time marker 1. In the face of the adversary fighter seeking an AA angle advantage, the self fighter prefers to conserve energy and adjust the angle in the early stage of the confrontation, as illustrated in Figure 15f. At time marker 1, when the head-on position is established and the self-fighter is at an altitude disadvantage, the self-fighter successfully performs the barrel roll tactical manoeuvre by utilizing the angle and snapshot options, thereby entering the rear hemisphere area of the adversary fighter. At time marker 2, when the AA angle has an advantage, the self-fighter successfully shoots down the adversary fighter on the side by employing the snapshot option.

5.4. Ablation Study

In order to validate the optimization of MEOL training effectiveness by the automatic curriculum generation mechanism, an ablation experiment was designed in this study. The experimental group adopts the automatic curriculum generation framework based on WGANs, while the control group adopts a baseline training method based on random environmental parameter sampling to conduct a comparative study for three intra-tactical option strategies. The experimental data demonstrate, as illustrated in Figure 16, that the experimental group employing the automatic curriculum generation mechanism exhibits a significant performance advantage in the training task of complex air combat scenarios. The training period required for their strategies to converge to a stable win rate is drastically shortened compared to that of the control group, and the probability of the success of the tactics is relatively elevated for the same number of training steps. These findings confirm that the adversarial training-based course generation mechanism can effectively enhance the tactical execution ability of intelligences in non-stationary combat environments.

6. Conclusions

This study provides empirical evidence that the MEOL framework is an effective solution to the hierarchical decision-making challenges that arise in UCAVs’ WVRAC scenarios. The integration of air combat tactical domain knowledge with hierarchical reinforcement learning enables the proposed method to successfully decouple low-level manoeuvring control from high-level tactical selection, while ensuring dynamic coordination between the two levels. Ablation experiments demonstrate that the WGAN-based automatic course generation mechanism is crucial in overcoming exploration challenges for efficient strategy training under different initial engagement conditions, and that the application of the WGAN technique roughly improves the win rate by about 0.2–0.4 in the training of tactical strategies within the three options. Meanwhile, the results of the confrontation experiments show that the intelligence based on the MEOL framework is significantly better than the intelligence based on a single tactical strategy as well as the option learning frameworks of DAC+PPO and DAC+A2C in terms of in-opportunity aerial combat capability. In the 1000 confrontation experiments conducted separately, the MEOL framework was able to achieve a win rate of 0.583, 0.617, 0.752, 0.605, and 0.68 compared to the DAC+PPO, DAC+A2C, angular tactics, rapid-fire tactics, and energy tactics, respectively, reflecting that it is more capable of exhibiting adaptive tactical switching behaviours in line with expert air combat theory. The framework’s limitations primarily stem from its current reliance on predefined tactical categories and simulated adversarial strategies. Subsequent research will concentrate on extending the methodology to multi-UCAVs’ agent cooperative scenarios and real-time adaptation against evolving adversary tactics. In conclusion, this study provides a concrete pathway for intelligent decision-making on tactics and manoeuvres in UCAVs’ WVRAC.

Author Contributions

Conceptualization, Y.L. and W.D.; methodology, Y.L. and P.Z.; software, P.Z.; validation, Y.L., W.D. and G.L.; formal analysis, Y.L. and H.Z.; investigation, P.Z.; resources, W.D. and G.L.; data curation, Y.L. and P.Z.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and W.D.; visualization, Y.L. and P.Z.; supervision, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the Natural Science Foundation of Shaanxi under Grant 2025JC-YBQN-842.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

DURC Statement

The current research is limited to the academic study of intelligent air combat under simulation conditions, which contributes to the study of the application of hierarchical reinforcement learning in the field of tactical decision-making for unmanned combat aerial vehicles. This research does not pose a threat to public health or national security. The authors acknowledge the dual-use potential of the research involving intelligent air combat and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, the authors strictly adhere to relevant national and international laws about DURC. The authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Burgin, G.H.; Sidor, L. Rule-Based Air Combat Simulation; Technical Report; NASA: Washington, DC, USA, 1988.
  2. Austin, F.; Carbone, G.; Hinz, H.; Lewis, M.S.; Falco, M. Game theory for automated manoeuvring during air-to-air combat. J. Guid. Control Dyn. 1990, 13, 1143–1149. [Google Scholar] [CrossRef]
  3. Park, H.; Lee, B.; Tahk, M.; Yoo, D. Differential game based air combat manoeuvre generation using scoring function matrix. Int. J. Aeronaut. Space Sci. 2016, 17, 204–213. [Google Scholar] [CrossRef]
  4. Ernest, N.; Carroll, D.; Schumacher, C.; Clark, M.; Lee, G. Genetic fuzzy based artificial intelligence for unmanned combat aerialvehicle control in simulated air combat missions. J. Def. Manag. 2016, 1, 1000144. [Google Scholar] [CrossRef]
  5. McGrew, J.S.; How, J.P.; Williams, B.; Roy, N. Air-combat strategy using approximate dynamic programming. J. Guid. Control. Dyn. 2010, 33, 1641–1654. [Google Scholar] [CrossRef]
  6. Fang, J.; Zhang, L.; Fang, W.; Xu, T. Approximate dynamic programming for CGF air combat maneuvering decision. In Proceedings of the 2016 2nd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 14–16 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1386–1390. [Google Scholar] [CrossRef]
  7. Wang, M.; Wang, L.; Yue, T.; Liu, H. Influence of unmanned combat aerial vehicle agility on short-range aerial combat effectiveness. Aerosp. Sci. Technol. 2020, 96, 105534. [Google Scholar] [CrossRef]
  8. Crumpacker, J.B.; Robbins, M.J.; Jenkins, P.R. An approximate dynamic programming approach for solving an air combat manoeuvring problem. Expert Syst. Appl. 2022, 203, 117448. [Google Scholar] [CrossRef]
  9. Zhu, J.; Kuang, M.; Zhou, W.; Shi, H.; Zhu, J.; Han, X. Mastering air combat game with deep reinforcement learning. Def. Technol. 2023, 34, 295–312. [Google Scholar] [CrossRef]
  10. Bae, J.H.; Jung, H.; Kim, S.; Kim, S.; Kim, Y. Deep reinforcement learning-based air-to-air combat manoeuvre generation in a realistic environment. IEEE Access 2023, 11, 26427–26440. [Google Scholar] [CrossRef]
  11. Shaw, R.L. Fighter Combat: Tactics and Manoeuvring; Naval Institute Press: Annapolis, MD, USA, 1985. [Google Scholar]
  12. Dayan, P.; Hinton, G.E. Feudal reinforcement learning. In Proceedings of the 5th International Conference on Neural Information Processing Systems (NIPS’92), Denver, CO, USA, 30 November–3 December 1992; pp. 271–278. [Google Scholar]
  13. Dietterich, T.G. The maxq method for hierarchical reinforcement learning. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), Madison, WI, USA, 24–27 July 1998; pp. 118–126. [Google Scholar]
  14. Parr, R.; Russell, S. Reinforcement learning with hierarchies of machines. In Proceedings of the 10th International Conference on Neural Information Processing Systems (NIPS’97), Denver, CO, USA, 1–6 December 1997; pp. 1043–1049. [Google Scholar]
  15. Sutton, R.S.; Precup, D.; Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef]
  16. Konidaris, G.; Barto, A. Skill discovery in continuous reinforcement learning domains using skill chaining. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS’09), Vancouver, BC, Canada, 7–10 December 2009; pp. 1015–1023. [Google Scholar]
  17. Bradtke, S.J.; Duff, M.O. Reinforcement Learning Methods for Continuous-Time Markov Decision Problems. In Neural Information Processing Systems. 1994. Available online: https://api.semanticscholar.org/CorpusID:17149277 (accessed on 7 February 2025).
  18. Parr, R. Hierarchical Control and Learning for Markov Decision Processes. Ph.D. Dissertation, University of California, Berkeley, CA, USA, 1998; 214p. Available online: https://api.semanticscholar.org/CorpusID:53939299 (accessed on 25 February 2025).
  19. Pope, A.P.; Ide, J.S.; Mićović, D.; Diaz, H.; Rosenbluth, D.; Ritholtz, L.; Twedt, J.C.; Walker, T.T.; Alcedo, K.; Javorsek, D. Hierarchical reinforcement learning for air-to-air combat. In Proceedings of the 2021 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 15–18 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 275–284. [Google Scholar] [CrossRef]
  20. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Volume 80, pp. 1861–1870. [Google Scholar] [CrossRef]
  21. Kong, W.; Zhou, D.; Du, Y.; Zhou, Y.; Zhao, Y. Hierarchical multi-agent reinforcement learning for multi-aircraft close-range air combat. IET Control Theory Appl. 2023, 17, 1840–1862. [Google Scholar] [CrossRef]
  22. Selmonaj, A.; Szehr, O.; Del Rio, G.; Antonucci, A.; Schneider, A.; Rüegsegger, M. Hierarchical multi-agent reinforcement learning for air combat manoeuvring. In Proceedings of the 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 15–17 December 2023; pp. 1031–1038. [Google Scholar] [CrossRef]
  23. Zhang, P.; Dong, W.; Cai, M.; Jia, S.; Wang, Z.-P. MEOL: A maximum-entropy framework for options learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 4834–4848. [Google Scholar] [CrossRef] [PubMed]
  24. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; PMLR: London, UK, 2017; pp. 214–223. [Google Scholar] [CrossRef]
  25. Pope, A.P.; Ide, J.S.; Mićović, D.; Diaz, H.; Twedt, J.C.; Alcedo, K.; Walker, T.T.; Rosenbluth, D.; Ritholtz, L.; Javorsek, D. Hierarchical Reinforcement Learning for Air Combat at DARPA’s AlphaDogfight Trials. IEEE Trans. Artif. Intell. 2023, 4, 1371–1385. [Google Scholar] [CrossRef]
  26. Florensa, C.; Held, D.; Wulfmeier, M.; Zhang, M.; Abbeel, P. Reverse curriculum generation for reinforcement learning. In Proceedings of the 1st Annual Conference on Robot Learning (CoRL), Mountain View, CA, USA, 13–15 November 2017; Levine, S., Vanhoucke, V., Goldberg, K., Eds.; PMLR: London, UK, 2017; pp. 482–495. [Google Scholar]
  27. Florensa, C.; Held, D.; Geng, X.; Abbeel, P. Automatic goal generation for reinforcement learning agents. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; PMLR: London, UK, 2018; pp. 1515–1528. [Google Scholar]
  28. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. 2014. Available online: https://arxiv.org/abs/1406.2661 (accessed on 1 March 2025).
  29. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved Training of Wasserstein GANs. 2017. Available online: https://arxiv.org/abs/1704.00028 (accessed on 5 March 2025).
  30. Ng, A.; Harada, D.; Russell, S.J. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning, Bled, Slovenia, 27–30 June 1999. [Google Scholar] [CrossRef]
  31. Dewey, D. Reinforcement learning and the reward engineering principle. In Proceedings of the AAAI Spring Symposium Series, Stanford, CA, USA, 24–26 March 2014; AAAI Press: Menlo Park, CA, USA, 2014; pp. 1–6. [Google Scholar] [CrossRef]
  32. Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; PMLR: London, UK, 2016; pp. 1329–1338. [Google Scholar] [CrossRef]
  33. Berndt, J.S. JSBSim: An Open Source Flight Dynamics Model in C++. In Proceedings of the AIAA Modeling and Simulation Technologies Conference and Exhibit, Providence, RI, USA, 16–19 August 2004; AIAA: Reston, VA, USA, 2004; pp. 1–12. [Google Scholar] [CrossRef]
  34. Garage Contributors Garage: A Toolkit for Reproducible Reinforcement Learning Research. Available online: https://github.com/rlworkgroup/garage (accessed on 12 October 2024).
  35. Zhang, S.; Whiteson, S. DAC: The Double Actor-Critic Architecture for Learning Options. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; p. 10112. [Google Scholar] [CrossRef]
Figure 1. Air combat geometry.
Figure 1. Air combat geometry.
Drones 09 00384 g001
Figure 2. Weapon engagement zone.
Figure 2. Weapon engagement zone.
Drones 09 00384 g002
Figure 3. Architecture of hierarchical training within MEOL framework.
Figure 3. Architecture of hierarchical training within MEOL framework.
Drones 09 00384 g003
Figure 4. Success rate curve during angle tactic training.
Figure 4. Success rate curve during angle tactic training.
Drones 09 00384 g004
Figure 5. The visualization of angle tactic: (a) manoeuvring trajectory in a geographic coordinate system; (b) manoeuvring trajectory in Tacview.
Figure 5. The visualization of angle tactic: (a) manoeuvring trajectory in a geographic coordinate system; (b) manoeuvring trajectory in Tacview.
Drones 09 00384 g005
Figure 6. The flight log of angle tactic: (a) AA. (b) ATA. (c) HCA. (d) Range. (e) Proximity. (f) Energy advantage.
Figure 6. The flight log of angle tactic: (a) AA. (b) ATA. (c) HCA. (d) Range. (e) Proximity. (f) Energy advantage.
Drones 09 00384 g006
Figure 7. Success rate curve during snapshot tactic training.
Figure 7. Success rate curve during snapshot tactic training.
Drones 09 00384 g007
Figure 8. A visualization of the snapshot tactic: (a) Manoeuvring trajectory in a geographic coordinate system. (b) Manoeuvring trajectory in Tacview.
Figure 8. A visualization of the snapshot tactic: (a) Manoeuvring trajectory in a geographic coordinate system. (b) Manoeuvring trajectory in Tacview.
Drones 09 00384 g008
Figure 9. The flight log of snapshot tactic: (a) AA. (b) ATA. (c) HCA. (d) Range. (e) Proximity. (f) Energy advantage.
Figure 9. The flight log of snapshot tactic: (a) AA. (b) ATA. (c) HCA. (d) Range. (e) Proximity. (f) Energy advantage.
Drones 09 00384 g009
Figure 10. Success rate curve during energy tactic training.
Figure 10. Success rate curve during energy tactic training.
Drones 09 00384 g010
Figure 11. A visualization of an energy tactic: (a) Manoeuvring trajectory in a geographic coordinate system. (b) Manoeuvring trajectory in Tacview.
Figure 11. A visualization of an energy tactic: (a) Manoeuvring trajectory in a geographic coordinate system. (b) Manoeuvring trajectory in Tacview.
Drones 09 00384 g011
Figure 12. The flight log of an energy tactic: (a) AA. (b) ATA. (c) HCA. (d) Range. (e) Proximity. (f) Energy advantage.
Figure 12. The flight log of an energy tactic: (a) AA. (b) ATA. (c) HCA. (d) Range. (e) Proximity. (f) Energy advantage.
Drones 09 00384 g012
Figure 13. Comparison curve of the success rate of the training process.
Figure 13. Comparison curve of the success rate of the training process.
Drones 09 00384 g013
Figure 14. Winning percentage against comparative policies.
Figure 14. Winning percentage against comparative policies.
Drones 09 00384 g014
Figure 15. Confrontation trajectory with tactical intra-option and option-selection process: (a) Trajectory against angle tactic. (b) Trajectory against energy tactic. (c) Trajectory against snapshot tactic. (d) Option-selection process against angle tactic. (e) Option-selection process against energy tactic. (f) Option-selection process against snapshot tactic.
Figure 15. Confrontation trajectory with tactical intra-option and option-selection process: (a) Trajectory against angle tactic. (b) Trajectory against energy tactic. (c) Trajectory against snapshot tactic. (d) Option-selection process against angle tactic. (e) Option-selection process against energy tactic. (f) Option-selection process against snapshot tactic.
Drones 09 00384 g015
Figure 16. Training curves of success rate on ablated study: (a) Angle tactic. (b) Energy tactic. (c) Snapshot tactic.
Figure 16. Training curves of success rate on ablated study: (a) Angle tactic. (b) Energy tactic. (c) Snapshot tactic.
Drones 09 00384 g016
Table 1. Initial condition of parameterized air combat environment.
Table 1. Initial condition of parameterized air combat environment.
Initial ParameterValue Domain
Range [ 1900 m , 5100 m ]
AA [ 180 , 180 ]
ATA [ 180 , 180 ]
Table 2. Initial condition of parameterized air combat environment.
Table 2. Initial condition of parameterized air combat environment.
HyperparametersValue
Number of hidden layers2
Dimensions of hidden layers256
Nonlinearity of hidden layersReLU
Sample batch size5000
Mini-batch size256
Reward scale1
Discount factor ( γ )0.99
Target update interval1
Gradient steps100
Total environment steps 1.5 × 10 7
OptimizerAdam
Reward function parameter ( c 0 )10
WGAN latent dimension ( d z )128
WGAN gradient penalty coefficient ( λ )10
Target action entropy ( H ¯ U , ω i )−4
Initial action entropy ( α U , ω i )0.1
Target option entropy ( H ¯ Ω )1
Initial option entropy ( α ϕ )0.1
Learning rate of WGAN generator ( ξ ψ ) 1 × 10 4
Learning rate of WGAN critic ( ξ θ ) 5 × 10 5
Target Q-network soft update factor ( τ U , ω i ) 5 × 10 3
Learning rate of option selection ( ξ θ Ω ) 3 × 10 5
Learning rate of intra-option selection ( ξ θ U , ω i ) 3 × 10 4
Learning rate of Q function ( ξ ϕ Ω ) 3 × 10 4
Learning rate of Sub-Q function ( σ U , ω i ) 3 × 10 4
Learning rate of option alpha ( ξ α Ω ) 3 × 10 5
Learning rate of action entropy ( ξ α U , ω i ) 3 × 10 4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Dong, W.; Zhang, P.; Zhai, H.; Li, G. Hierarchical Reinforcement Learning with Automatic Curriculum Generation for Unmanned Combat Aerial Vehicle Tactical Decision-Making in Autonomous Air Combat. Drones 2025, 9, 384. https://doi.org/10.3390/drones9050384

AMA Style

Li Y, Dong W, Zhang P, Zhai H, Li G. Hierarchical Reinforcement Learning with Automatic Curriculum Generation for Unmanned Combat Aerial Vehicle Tactical Decision-Making in Autonomous Air Combat. Drones. 2025; 9(5):384. https://doi.org/10.3390/drones9050384

Chicago/Turabian Style

Li, Yang, Wenhan Dong, Pin Zhang, Hengang Zhai, and Guangqi Li. 2025. "Hierarchical Reinforcement Learning with Automatic Curriculum Generation for Unmanned Combat Aerial Vehicle Tactical Decision-Making in Autonomous Air Combat" Drones 9, no. 5: 384. https://doi.org/10.3390/drones9050384

APA Style

Li, Y., Dong, W., Zhang, P., Zhai, H., & Li, G. (2025). Hierarchical Reinforcement Learning with Automatic Curriculum Generation for Unmanned Combat Aerial Vehicle Tactical Decision-Making in Autonomous Air Combat. Drones, 9(5), 384. https://doi.org/10.3390/drones9050384

Article Metrics

Back to TopTop