A Hierarchical Reinforcement Learning Method for Intelligent Decision-Making in Joint Operations of Sea–Air Unmanned Systems

Li, Chen; Dong, Wenhan; He, Lei; Cai, Ming; Li, Yang

doi:10.3390/drones9090596

Open AccessArticle

A Hierarchical Reinforcement Learning Method for Intelligent Decision-Making in Joint Operations of Sea–Air Unmanned Systems

by

Chen Li

,

Wenhan Dong

,

Lei He

^*

,

Ming Cai

and

Yang Li

Aviation Engineering School, Air Force Engineering University, Xi’an 710038, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 596; https://doi.org/10.3390/drones9090596

Submission received: 23 June 2025 / Revised: 4 August 2025 / Accepted: 5 August 2025 / Published: 25 August 2025

(This article belongs to the Collection Drones for Security and Defense Applications)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges of intelligent decision-making in complex and high-dimensional state–action spaces during joint operations simulations of sea–air unmanned systems, an end-to-end intelligent decision-making scheme is proposed. Initially, a highly versatile hierarchical intelligent decision-making method is designed for sea–air joint operations simulation scenarios. Subsequently, an approach combining intrinsic and extrinsic rewards is adopted to structurally mitigate the adverse effects of sparse rewards. Following this, a prominence detection method and a repetition penalty filtering method are devised, leading to the development of a hierarchical reinforcement learning algorithm based on a two-tier screening approach for potential subgoals. Finally, the feasibility of the proposed method is validated through ablation experiments and visualized simulation studies. Simulation results demonstrate that the presented method offers some reference value for research in intelligent decision-making for unmanned operations and can be applied to innovative studies in related response strategies.

Keywords:

sea–air unmanned systems; joint operations; intelligent decision-making; hierarchical reinforcement learning; subgoal; end-to-end

1. Introduction

It is widely recognized that with the rapid advancement of weapon technology, operational processes are becoming increasingly complex. The operational scenario is characterized by a growing number of variables and an ever more intricate environment. Operational strategies are evolving towards multi-service joint operations and unmanned operations. Relying solely on human judgment for real-time operational scenario decisions is no longer sufficient to adapt to the rapidly changing conditions. Moreover, employing mathematical models of the environment and tasks to derive optimal operational strategies faces the challenge of requiring substantial computational power. However, reinforcement learning does not require precise mathematical models of the environment and tasks. Instead, it merely necessitates continuous trial-and-error interactions within the environment and the utilization of rewards to refine strategies, thus enabling unmanned decision-making operations on the operational scenario. The current focus of unmanned operations systems is predominantly on Unmanned Aerial Vehicles (UAVs) and unmanned naval vessels. The joint operational decision-making of sea–air unmanned systems, primarily comprising these two types of equipment, is a focal point of current and future research. This paper intends to employ reinforcement learning techniques to investigate intelligent decision-making in joint operations of sea–air unmanned systems. The study aims to explore the intelligent decision-making for sea–air unmanned operations. The findings could assist commanders in innovating and validating new strategies and could also be applied to verify the effectiveness of sea–air unmanned weapons systems in practical applications.

To address the issue of maneuver decision-making in beyond-visual-range air operations, researchers have proposed a layered decision-making algorithm based on Deep Reinforcement Learning (DRL). This approach stratifies the action space in air operations and designs a reward function that considers multiple factors to tackle the sparse reward problem. Experiments have shown that this method can enable autonomous decision-making in beyond-visual-range air operations [1]. Additionally, numerous scholars have explored the application of DRL methods in close-range air operations maneuver decision-making [2,3,4]. DRL has also made substantial progress in the study of multi-aircraft air operations confrontations [5,6]. In order to enhance the survivability in the operational scenario of fixed-wing Unmanned Aerial Vehicles (UAVs), researchers have developed a layered target-guided learning method. This method leverages the layered characteristics of targets to improve the availability of transitions and mitigate the negative impact of sparse rewards on the algorithm. Results indicate that this algorithm outperforms traditional RL algorithms in terms of convergence speed [7]. Significant achievements have also been made in the application of DRL methods to UAV operations [8,9,10]. Some scholars have employed DRL to investigate UAV maneuver decision-making in air operations games [11,12], while others have used the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) algorithm to study Red–Blue multi-agent gaming issues [13]. Currently, various operational scenarios have begun to utilize DRL to enhance operational capabilities [14]. However, most of these applications are explorations within individual equipment or single domains, lacking research on cross-domain joint operations. Moreover, these applications almost invariably encounter the classic challenge of sparse rewards.

Through a comprehensive review of the literature and experimental research, we have identified three primary challenges in achieving the objectives of this study: (1) The complexity of the sea–air unmanned weapons systems in a joint operations environment and the vastness of the state space lead to the inefficiency of traditional single-layer reinforcement learning methods in exploration, making it difficult to acquire effective strategies. Currently, there is a lack of an end-to-end hierarchical decision-making framework for the complex scenarios of joint operations of sea–air unmanned systems. (2) Existing hierarchical DRL algorithms exhibit deficiencies in subgoal design. Some require manual setting of subgoals, while others are only applicable to specific environments. There is a need for a subgoal design method with strong generalization capabilities. (3) The lengthy time sequences in sea–air joint operations simulations and the singularity of reward functions result in sparse rewards, which poses a significant challenge to the learning process.

In recent years, the rapidly evolving hierarchical reinforcement learning (HRL) architecture has demonstrated remarkable effectiveness in addressing the complexities and dimensionality explosion associated with intricate tasks [15]. This approach decomposes complex problems into multiple subproblems, solving them sequentially. It stratifies policies into different hierarchical levels, where senior-level strategies guide junior-level ones by setting subgoals. Junior-level strategies learn from environmental observations and subgoals, executing decision-making actions. In turn, senior-level strategies receive rewards from the environment, enabling them to design better subgoals for the junior-level strategies. The hierarchical structure enhances sample efficiency because senior-level controllers provide intrinsic rewards to junior-level controllers based on their performance, ensuring that junior-level controllers can still receive rewards even when external rewards are absent, thus facilitating training. Based on the aforementioned challenges and current technological foundations, this paper makes three key contributions: (1) To address Challenge 1, we propose an end-to-end decision-making framework tailored for complex scenarios such as joint operations of sea–air unmanned systems. (2) To address Challenge 2, a more flexible subgoal selection method is introduced, based on a two-level screening approach for potential subgoals. This method does not require manual subgoal setting and yields subgoals with strong guiding effects, widely suitable for intelligent decision-making in complex environments. (3) To address Challenge 3, we leverage the advantages of HRL in addressing sparse reward issues, significantly increasing the frequency of rewards obtained during agent-environment interactions by combining intrinsic and extrinsic rewards.

2. Related Work

HRL is a prominent branch of DRL methodologies. It involves structuring the decision-making entity into multiple layers, with each layer responsible for different types of tasks. To achieve the ultimate goal, tasks are also decomposed layer by layer. There are two primary approaches to task decomposition: one based on options and the other based on subgoals. This paper adopts the subgoal-based task decomposition approach, which involves breaking down the ultimate objective into a series of simple goals, ultimately leading to the realization of the final target. The selection of subgoals is crucial in the subgoal-based task decomposition method. Many existing algorithms require the manual setting of subgoals, which necessitates priori knowledge of the environment. Different subgoals are suitable for different task environments, and a change in the environment requires a reset of the subgoals. This approach can significantly diminish the generalization capabilities of DRL methods. Therefore, subgoal-based methods must evolve towards automatic subgoal setting. The following sections delve into the development of hierarchical reinforcement learning and the progression of subgoal-based hierarchical reinforcement learning.

Kulkarni et al. [16] introduced the hierarchical deep reinforcement learning (HDQN) algorithm, which is based on the Deep Q-Network (DQN) algorithm. They were the first to demonstrate the superiority of a hierarchical structure over a single-layer structure in an electronic gaming environment, particularly in dealing with long-time series decision-making and sparse rewards. However, a notable drawback of HDQN is that its subgoals must be set manually, which is a significant limitation since designing subgoals for each environment using domain knowledge can be extremely time-consuming and can greatly reduce the generalization capabilities of DRL methods. Subsequently, Rafati et al. [17] addressed this deficiency by employing outlier detection and the K-means algorithm to design an unsupervised subgoal discovery method, which led to the development of the Unified HRL (UHRL) algorithm. This algorithm can be used as a general HRL algorithm. However, UHRL restricts subgoals to be selected from existing experiences, limiting the flexibility of subgoal selection and potentially discarding excellent potential subgoals. Additionally, its senior-level strategy uses the DQN algorithm, and the senior-level output actions are environmental states, which means it can only be applied to environments with discrete state spaces. Nachum et al. [18] proposed the HRL with an off-policy correction (HIRO) algorithm to address the instability issues that arise in off-policy training, allowing the senior-level strategy to utilize the experiences generated by the junior-level strategy. This algorithm was validated in an ant maze experiment, but it remains challenging to learn subgoals that are practical and meaningful. Li et al. [19] took a reverse approach, reasoning that if setting a subgoal to approach is feasible, then theoretically, setting a subgoal to avoid should also be feasible. They defined the states reached by the agent every n steps as anchors and used intrinsic rewards to encourage the agent to move away from these anchors, while extrinsic rewards guided the agent’s forward direction. Their experiments yielded promising results. Zhang et al. [20,21] further restricted subgoals to be within the neighborhood of the current state’s adjacent K steps, which played a guiding role for the agent. Levy et al. [22] extended the Hindsight Experience Replay (HER) method to a hierarchical framework and proposed a heuristic algorithm called Hierarchical Actor-Critic (HAC). They designed three methods: hindsight action transitions, hindsight goal transitions, and subgoal testing transitions, which completely eliminated the instability issues caused by off-policy training and significantly improved the algorithm’s exploration capabilities. HAC currently stands out as the best-performing HRL algorithm in experiments. Building upon HAC, researchers have incorporated automatic curriculum learning to form ACGHRL [23] (Automatic Curriculum Generation by HRL) and introduced curiosity to create CHAC [24] (Curious Hierarchical Actor-Critic). However, both of these modifications have shown limited improvements in training speed and effectiveness. HRL algorithms continue to evolve in the direction of enhancing generalization capabilities, and a substantial number of scholars have dedicated themselves to the research and application of subgoal-based hierarchical methods.

Pateria et al. [25] introduced the LIDOSS (hierarchical reinforcement learning with integrated discovery of salient subgoals) algorithm, an end-to-end HRL method that narrows the search space for subgoals by focusing on those that are both proximal to the goal state and have a high probability of occurrence. Experimental results demonstrate that LIDOSS outperforms HAC in most tasks. Wang et al. [26] proposed two requirements for subgoal generation: subgoals should be achievable by the junior-level policy with relatively little effort, and their achievement should facilitate the attainment of the ultimate goal. These requirements enable subgoals to provide more efficient guidance to the junior-level strategy. Nicholaus et al. [27] presented a potential-based subgoal generation method that automatically generates valuable subgoals using past observations from trajectories and prioritizes them using a designed potential function for the higher-level policy to choose. Xin et al. [28] introduced an Active Subgoal Generation Strategy, incorporating the HER mechanism into the higher-level policy and proposing two measures, novelty and coverage, to effectively address the sparse reward problem in complex tasks. Guertler and colleagues [29] proposed hierarchical reinforcement learning with timed subgoals (HiTS), which not only specifies subgoal states but also the time required to reach them. Subsequent scholars have introduced various subgoal selection and design methods tailored to different application scenarios [30,31,32]. Subgoal design methods fall into two categories: those designed based on task requirements, which offer strong controllability, and those that select useful information from existing data as subgoals, which facilitate the learning of effective strategies by junior-level networks due to their origin from existing information. Each approach has its own merits and limitations. We believe that a combination of both methods can leverage their respective advantages. Therefore, we have designed a method where subgoals are derived from both existing and non-existing information, with detailed design insights provided in Section 3.3.

3. Hierarchical Intelligent Decision-Making Scheme for Joint Operations of Sea–Air Unmanned Systems

In this section, we propose an HRL intelligent decision-making scheme to address the challenges in decision-making in joint operations of sea–air unmanned systems, and we design a subgoal-based HRL intelligent decision-making framework. As illustrated in Figure 1, the framework primarily consists of the environment, senior- and junior-level experience pools, and the corresponding Actor and Critic components for each level. Empirical evidence suggests that employing an off-policy algorithm for the upper-level strategy and an on-policy algorithm for the junior-level strategy yields superior performance in complex environments. The off-policy algorithm can effectively utilize a diverse range of strategic experiences, while the on-policy algorithm ensures precise execution of specific actions, thereby guaranteeing the stability and accuracy of junior-level strategy learning. Given the sample efficiency and the simplicity and flexibility of its framework demonstrated by the Deep Deterministic Policy Gradient (DDPG) algorithm in both continuous and discrete tasks, it surpasses off-policy algorithms such as DQN, TD3 (Twin Delayed Deep Deterministic Policy), and SAC (Soft Actor–Critic) in achieving stable and efficient exploration in complex tasks. Therefore, the upper-level strategy employs the DDPG algorithm enhanced with prioritized experience replay (PER) technique to design subgoals. The PPO (Proximal Policy Optimization) algorithm, on the other hand, allows for multiple gradient updates on a batch of experience data, ensuring its sample efficiency exceeds that of traditional on-policy algorithms. Additionally, its stable policy update mechanism and robustness to hyperparameters render the PPO algorithm more resilient in accomplishing sub-objectives, making it suitable for the lower-level strategy to realize these subgoals. The upper-level strategy takes environmental observations as input and, through learning, outputs subgoals, which are the desired environmental observations that the junior-level strategy is instructed to achieve through its actions. The junior-level strategy takes both environmental observations and subgoals as input, producing actions that can be executed in the environment. This design allows for the implementation of an end-to-end HRL intelligent decision-making method for joint operations by simply defining the basic environmental actions.

3.1. Subgoal-Based HRL Intelligent Decision-Making Framework for Joint Operations of Sea–Air Unmanned Systems

To explore an end-to-end approach for complex decision-making problems, we propose a hierarchical intelligent decision-making framework for joint operations of sea–air unmanned systems. This hierarchical scheme consists of two parts: The senior-level strategy selects the most promising subgoals from past experiences. When mapped to the practical operational domain, this entails the senior-level agent selecting scenario segments most conducive to the success of the joint operation as subgoals. The junior-level strategy interacts with the environment to learn strategies that approach these subgoals. When mapped to the practical operational domain, this involves the junior-level agent continuously countering adversaries within the environment, thereby refining its action strategy to progressively steer the current confrontational scenario closer to the most advantageous segment for achieving victory.

In the first part, the evaluation strategy network selects subgoals

g_{t}

based on the environmental state

h_{t}

and acquires the state after BL interactions between the junior-level strategy and the environment, as well as the sum of all environmental rewards

R_{t}

, storing experience

(h_{t}, g_{t}, R_{t}, h_{t + 1})

until the quantity reaches BS.

g_{t}

represents the subgoals, which serve as a bridge connecting the senior-level and junior-level components. This is because subgoals function both as actions selected by the senior-level strategy and as components of the junior-level state representation. The task is accomplished through an iterative process where the high-level strategy continuously selects subgoals, and the low-level strategy continuously implements them. The senior-level Critic network learns from m groups of experiences sampled based on the potential value of subgoals from the senior-level experience pool and updates the parameters of the senior-level Actor and Critic networks. It then updates the target networks of both Actor and Critic through soft updates, finally recalculating and updating the potential values of all experiences before starting the next iteration.

In the second part, the junior-level Actor network receives and merges the subgoals outputted by the senior-level strategy with environmental observations, then outputs actions

a_{t}

. Upon receiving an action, the environment transitions to the next state

s_{t + 1}

, and based on whether the current state achieves the subgoal, it outputs an intrinsic reward

r_{t}

and an environmental reward

R_{t}

. Once the quantity of experiences reaches BL, the learning process begins. The junior-level Actor calculates the probability ratio of the new and old strategies

r_{t} (ϕ)

, while the junior-level Critic computes the advantage function

{\hat{A}}_{t}

to further calculate the TD-error for each experience. It updates the junior-level Critic network by minimizing the mean squared error, and the junior-level Actor updates its network parameters by maximizing the objective function, which is calculated using the probability ratio

r_{t} (ϕ)

and the advantage function

{\hat{A}}_{t}

. Afterward, the next iteration commences. The specific implementation details are illustrated in Figure 1.

3.2. A Solution for Sparse Rewards Through the Integration of Intrinsic and Extrinsic Incentives

We employ a hybrid approach that combines intrinsic and extrinsic rewards to address the challenge of sparse rewards. The rewards obtained from the agent’s interactions with the environment serve as extrinsic rewards, guiding the learning process of the senior-level agent, while intrinsic rewards are provided to facilitate the learning of the junior-level agent. For the senior-level agent, although traditional single-layer reinforcement learning may incorporate intermediate rewards, these are still sparse as environmental rewards depend on multi-step decisions. However, in a hierarchical structure, the senior-level agent makes decisions less frequently after the junior-level agent has interacted with the environment multiple times. Consequently, the rewards received by the senior-level agent are the cumulative rewards of the environment over BL steps, effectively increasing the frequency of extrinsic rewards by a factor of L compared to single-layer structures. For the junior-level agent, every interaction with the environment yields a reward signal. After each action, the junior-level agent evaluates whether the resulting state aligns with the subgoal. If the state matches the subgoal, the agent receives a reward of +1; if not, it receives a reward of −1. This ensures that the junior-level agent receives a reward stimulus at every step. By decomposing the task into senior-level and junior-level components, our hierarchical structure reduces the difficulty of each layer’s task compared to a single-layer structure. Despite this, the frequency of rewards in each layer of the hierarchical structure is at least L times higher than that of the single-layer structure. Our approach thus effectively addresses the issue of sparse rewards from the perspective of reward frequency.

3.3. Hierarchical Reinforcement Learning Algorithm with Two-Level Screening of Latent Subgoals

The rationale for guiding training with latent subgoals is twofold. Firstly, the vast action space renders the learning process exceptionally costly; the senior-level strategy, which already has a low learning frequency, would become exceedingly slow in exploration if the action space is too large. Secondly, aimless exploration by the senior-level strategy can lead the agent into a state of confusion. However, what constitutes an appropriate subgoal? In our method, a subgoal is defined as a set of environmental states, with suitable subgoals being transitional states between the current and the goal states. These transitional states are continually updated, progressively closer to the goal state, until they coincide with it.

We propose to select subgoals based on their high contribution value and infrequent selection. To effectively acquire the most promising subgoals, we employ a two-level filtering approach, screening subgoals layer by layer. The two-level filtering process involves using the criteria from the first screening to elevate the priority of contributive states, forming a primary subgoal pool of high-contributive states. The second screening then selects subgoals from this pool that have been infrequently sampled, creating a secondary subgoal pool. This layer-by-layer enhancement of subgoal pool priority ensures a rapid and orderly subgoal screening process. Additionally, to prevent the limitations of subgoals from leading the training towards local optima, we design noise functions for the two-level screening mechanism, allowing the majority of subgoal selections to originate from the most promising subgoals, with a small portion derived from unfiltered environmental states. Finally, we integrate this screening method into the prioritized experience replay mechanism of the scheme depicted in Figure 1, proposing a hierarchical reinforcement learning algorithm based on the two-level screening method for latent subgoals. The specific implementation details of the algorithm are provided in Algorithm 1.

Here, we provide a detailed explanation of the key steps and parameters within Algorithm 1.

1. Subgoal selection: The selection of subgoals in hierarchical reinforcement learning is determined by the senior-level policy based on the global task objective. These subgoals typically consist of abstract concepts or intermediate tasks, which are then translated into specific actions for execution by the low-level policy.

2. Priority of each sample: The priority of each sample is determined by calculating the potential value of the current transition. A higher potential value corresponds to a higher priority. Consequently, the senior-level policy is more likely to select the transition of a higher potential value as a learning objective; conversely, a lower potential value results in a lower selection probability.

3. TD-error (Temporal-Difference Error): This is a fundamental metric in reinforcement learning used to quantify the discrepancy between the predicted value of the current state and the actual obtained return. It primarily serves to guide policy updates, thereby optimizing the learning process.

4. Advantage Estimates: Advantage estimation involves comparing the returns generated by the current policy against those of the average policy. This comparison is crucial for optimizing policy parameters and constitutes a core component of policy gradient optimization.

Algorithm 1. Hierarchical Reinforcement Learning Algorithm with Two-Level Screening of Latent Subgoals (HTS).

Randomly initialize senior-level Critic network,

Q (h, g ∣ θ^{Q})

, and Actor network

μ (h ∣ θ^{μ})

with weights

θ^{Q}

and

θ^{μ}

Initialize senior-level target network

Q^{'}

and

μ^{'}

with

θ^{Q^{'}} \leftarrow θ^{Q}

,

θ^{μ^{'}} \leftarrow θ^{μ}

Initialize senior-level default data structures of replay buffer SumTree RB, and set the priority of all m leaf nodes of SumTree

P_{j}

to 1, in other words,

P_{1} = 1

. Set the maximum capacity of RB to N.
Initialize junior-level value function parameters

ϕ^{Q}

and policy function parameters

ϕ^{μ}

For episode i = 1, M do, M represents the number of episodes during which the agent interacts with the environment.
# The environment interaction commences.
Initialize a stochastic process

κ

for action exploration.
Received an initialized observation state

h_{1}

.
For t = 1,,, T do, T denotes the number of high-level interaction steps.
# The senior-level agent initiates its interaction with the environment.
Select a subgoal

g_{t} = μ (s_{t} ∣ θ^{μ}) + κ_{t}

based on the current policy and exploration noise.
For step = 1,,, TL, TL represents the maximum number of low-level interaction steps per high-level interaction, as designed by the algorithm.
# The junior-level agent initiates its interaction with the environment and begins its training.
Junior-level agent interacts with the environment TL times using the policy

a_{i j} \leftarrow π_{ϕ} (a_{t} ∣ s_{t}, g_{t})

, or the environment reaches the terminal state.
Compute intrinsic reward, reward-to-go

{\hat{r}}_{i}

Compute advantage estimates

{\hat{A}}_{t + 1}, {\hat{A}}_{t + 1}, \cdot \cdot \cdot {\hat{A}}_{t + B}

, based on the current value function

V_{ϕ^{Q}}

Update the policy by taking K steps of minibatch SGD (Stochastic Gradient Descent):

{ϕ^{μ}}_{k + 1} = \arg \max_{ϕ} \underset{τ \sim π_{B}}{E} [\sum_{t = 0}^{T} [\min (r_{t} (ϕ) {\hat{A}}_{t}, c l i p (r_{t} (ϕ), 1 - ε, 1 + ε) {\hat{A}}_{t})]]

Update the value function by regression on mean-squared error:

{ϕ^{Q}}_{k + 1} = \underset{ϕ}{\arg \min} \underset{τ \sim π_{B}}{E} [\sum_{t = 0}^{T} {(V_{ϕ} (s_{t}) - {\hat{r}}_{t})}^{2}]

End For
# The interaction at the junior-level concludes, transitioning to the next interaction at the senior level.
Compute the sum of external rewards R obtained by the junior-level agent interacting with the environment TL times.
Store transition

(h_{t}, g_{t}, R_{t}, h_{t + 1})

to senior-level replay buffer

R B

.
If the size of the senior-level RB reaches the batch size BS,
# When the number of transitions at the senior-level meets the training criteria, the training of the senior-level policy is initiated.
Sample

m i n i b a t c h = m

transitions

(h_{t}, g_{t}, R_{t}, h_{t + 1})

from RB, with each sample being sampled with a probability of

P (i)

:

P (i) = \frac{P_{i}}{\sum_{j} P_{j}^{a}}

Compute importance-sampling-weight of each sample:

w_{i} = {(N * P (i))}^{- β} / \max_{j} (w_{j})

Calculate the current Q-value for each sample:

y_{i} = R_{i} + γ Q^{'} (h_{i + 1}, {argmax}_{g_{i + 1}} Q (h_{i + 1}))

Update senior-level Critic by minimizing the loss:

L = \frac{1}{m} \sum_{i} w_{i} {(y_{i} - Q (h_{i,} g_{i,} w))}^{2}

Update the senior-level Actor policy using the sampled policy gradient:

{{\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{g} Q (h, g ∣ θ^{Q}) |}_{h = h_{i}, g = μ (s_{i})} \nabla_{θ^{μ}} μ (h ∣ θ^{μ}) |}_{h_{i}}

Update the senior-level target network:

\begin{array}{l} {θ^{Q}}^{'} \leftarrow τ θ^{Q} + (1 - τ) {θ^{Q}}^{'} \\ {θ^{μ}}^{'} \leftarrow τ θ^{μ} + (1 - τ) {θ^{μ}}^{'} \end{array}

Recompute and update the priority of each sample:

P_{i} = C V + η R P (n_{i})

where

C V = λ [R_{i} + γ Q^{'} (h_{i + 1}, {argmax}_{g_{i + 1}} Q (h_{i + 1})) - Q (h i, g i)] + (1 - λ) R_{i}

,

R P (n_{i}) = \frac{\ln n_{i}}{n_{i} - 1}

End If
# The current training phase for the senior-level policy concludes.

End For
# Upon completion of the current episode’s training, the process transitions to the subsequent episode.
End For
# The data and policy are saved following the conclusion of this training phase.

3.3.1. Prominent Contribution Detection

We aim to filter out subgoals from the subgoal pool that make a significant contribution to the learning process of the junior-level strategy. This prominence manifests in two aspects: Firstly, the selected subgoal should represent a high-value state indicative of the current training achievements. Secondly, it should have a positive guiding effect on subsequent training. Therefore, in this paper, we propose using both the Temporal-Difference error (TD-error) and the reward associated with the experience as indicators to measure the contribution size. High-reward values inherently signify senior-level states. TD-error serves as a bridge connecting immediate rewards to long-term values. These two metrics not only serve as criteria for assessing the current state of high contribution but also orient the learning process for future endeavors. Together, these two factors filter out subgoals that are crucial for strategy learning and place them in the primary subgoal pool for potential selection by the senior-level strategy. When mapped to the practical operational domain, this process translates to identifying scenario segments characterized by both high-immediate-reward values and high potential for future rewards, which are then designated as learning objectives.

The contribution value (

C V

) is calculated as follows:

C V = λ \times T D + (1 - λ) R

(1)

where TD represents the TD-error value of the current transition, while R denotes the reward value of the same transition. During training, the relative importance of TD-error and reward can be controlled by adjusting the values assigned to them.

3.3.2. Repeat Penalty

In the experience replay of off-policy algorithms, repeatedly sampling and learning from unimportant experiences can lead to overfitting and potentially trap the algorithm in a local optimum. Similarly, in HRL algorithms, if less-important subgoals are repeatedly utilized, the agent’s focus may become fixated on certain specific subgoals, neglecting others that might be more significant. This leads to the classic exploration-exploitation dilemma in DRL.

To prevent the repeated selection of subgoals and enhance the diversity of subgoal selection, most hierarchical algorithms incorporate penalty terms into the training reward, such as assigning negative rewards if the current subgoal is not reached and positive rewards if it is. This approach prevents the repeated utilization of undesirable subgoals but does not address the issue of beneficial subgoals being used multiple times. In this paper, we design a penalty function

R P (n)

. We add an additional attribute to each subgoal in the primary subgoal pool, denoted as

n_{i}

to record the number of times each subgoal is selected. This attribute is updated every time the senior-level strategy selects a subgoal, denoted as

n_{i} = n_{i} + 1

. The more frequently a subgoal is chosen, the harsher the penalty, reducing its probability of being selected in subsequent choices. Therefore,

R P (n)

must be a monotonically decreasing function. When mapped to the practical operational domain, this process involves selecting scenario segments that have undergone the fewest learning iterations as learning objectives, thereby mitigating redundant learning. The function is designed as follows:

R P (n_{i}) = \frac{\ln n_{i}}{n_{i} - 1}

(2)

Where

n_{i}

represents the number of times a subgoal has been selected.

The proof of the monotonic decreasing nature of the function is presented below:

1. We employ differentiation to determine the monotonicity of the function. By differentiating the function

R P (n_{i})

with respect to

n_{i}

, we obtain the following expression:

R P^{'} (n_{i}) = \frac{\frac{n_{i} - 1}{n_{i}} - \ln n_{i} * 1}{{(n_{i} - 1)}^{2}} = \frac{1 - \frac{1}{n_{i}} - \ln n_{i}}{{(n_{i} - 1)}^{2}}

(3)

2. In the equation above, since the coefficient

n_{i} - 1

is always positive, it suffices to analyze the sign of the numerator. Let

g (n_{i}) = 1 - \frac{1}{n_{i}} - \ln n_{i}

, then differentiating

g (n_{i})

with respect to

n_{i}

yields that

g^{'} (n_{i}) = \frac{1}{{n_{i}}^{2}} - \frac{1}{n_{i}} = \frac{1 - n_{i}}{{n_{i}}^{2}}

.

3. When (

n_{i} > 1

),

1 - n_{i} < 0, {n_{i}}^{2} > 0

. Consequently,

g (n_{i})

is monotonically decreasing on the interval

(1, + \infty)

.

4. Furthermore, given that

g (1) = \frac{1 - 1}{1 - \ln (1)} = 0

and considering that

g (n_{i})

is monotonically decreasing with

g (1) = 0

, it follows that for all

n_{i} > 1

, the inequality

g (n_{i}) < g (1) = 0

holds.

5. Since the numerator of

R P (n_{i})

is always negative and the denominator is always positive, it follows that

R P^{'} (n_{i}) < 0

holds true throughout the interval

n_{i} > 1

. Consequently,

R P (n_{i})

is monotonically decreasing on the interval

(1, + \infty)

.

3.3.3. Potential Calculation Function

By employing a two-level screening method for the subgoal pool, we can assess the potential of each latent subgoal, effectively ensuring the specificity and diversity of subgoal selection. The potential calculation function for subgoals is as follows:

S C (e_{i}) = C V + η \times R P (n_{i})

(4)

where

η

is the penalty coefficient.

4. Experiment

In this section, we conduct a series of computer experiments to verify the application effectiveness of the method proposed in this paper for intelligent decision-making in joint operations of sea–air unmanned systems. We aim to answer the following questions: 1. Does the HTS algorithm outperform single-layer algorithms? 2. Does the two-level screening method for latent subgoals positively impact the strategy exploration efficiency of the intelligent decision-making method proposed in this paper? 3. Can the trained agent genuinely learn joint operations strategies? The experimental hardware includes a tower server equipped with an Intel^® Xeon^® Gold 6254 CPU with 72 cores and an Nvidia 3090Ti graphics card. The simulation environment utilizes a joint operations simulation platform. The scenario background is as follows: The Blue Force, within the blue boxes in Figure 2, acting as the offensive party, employs a rule-based tactical agent, whereas the Red Force, within the red box in Figure 2, in a defensive role, utilizes an intelligent decision-making agent developed through the research presented in this paper. The Blue Force has long been deliberately encroaching upon the two islands of the Red Force, necessitating the Red Force to deploy combined sea and air forces to implement strategic defense. The Blue Force’s objectives are to leverage integrated ground, surface, and aerial firepower to seize two islands from the Red Force and defend the two command posts thus acquired. The Red Force’s objective is to employ a comprehensive use of sea and air defense and support forces to defend their two islands and neutralize the two command posts captured by the Blue Force. The Blue Force deploys seven types of units, totaling 30 action elements, including manned bombers, unmanned early warning aircraft, fighters, unmanned destroyers, ground radars, surface-to-air missile battalions, and airports. The Red Force also deploys seven types of units, totaling 45 action elements, including manned bombers, unmanned early warning aircraft, unmanned reconnaissance aircraft, unmanned electronic warfare aircraft, fighters, unmanned destroyers, ground radars, and airports. This complex composition of forces is designed to accomplish tasks such as reconnaissance, early warning, jamming, escort, air defense, and ground defense. Upon commencement of the experiment, the environment is automatically initialized with a random initial scenario. The senior-level agent of HTS selects a high-potential subgoal, which represents a high-value scenario conducive to achieving task success, based on the current environmental state. The junior-level agent then designs and executes specific actions in accordance with this high-value scenario and the current environmental state. The junior-level agent takes TL steps for every single step taken by the senior-level agent. The junior-level reward is binary, assigned as +1 or −1 depending on whether the subsequent environmental state reaches the designated high-value scenario. The high-level reward is the cumulative sum of all environmental rewards obtained over the TL steps. Once the agent has accumulated the required number of interaction steps, they independently commence learning to update their respective policies, refining them based on their state–action pairs and rewards. Although the experimental validation section of this paper employs only one sea–air joint operational scenario, our method is theoretically unrestricted in terms of applicability, as it does not incorporate scenario-specific modifications. Consequently, if the desired outcomes are achieved in the aforementioned experimental scenario, similar results can be anticipated in other scenarios.

4.1. Ablation Experiment Study

The HAC algorithm represents one of the state-of-the-art and representative HRL algorithms currently available. Therefore, this paper adopts the HAC algorithm as a benchmark for comparison. This section of the experiment employed four methods: HAC, which signifies the use of a HAC algorithm; PPO, which signifies the use of a single-layer Proximal Policy Optimization algorithm; HRL, which denotes the application of the end-to-end intelligent decision-making scheme proposed in this paper without the two-level screening method for latent subgoals; and HTS, which represents the use of the proposed end-to-end intelligent decision-making scheme with the two-level screening method for latent subgoals. During training, the evaluation process automatically records the variation in training parameters. After a fixed number of training steps, we test the trained policies in the evaluation environment and record the assessment results, including the average reward and win rate. The policy evaluation results are statistically derived from five training iterations for each method. The solid lines in the graphs represent the average reward and win rate from five training runs for each method, with the shaded regions indicating the standard deviation.

4.1.1. Average Reward

The average reward comparison results from the ablation experiment are illustrated in Figure 3. We analyze the experimental results across three stages: early, mid, and late training phases. In the early stage, around 48,000 steps, the three algorithms, HAC, HRL, and PPO, experience a sudden drop in reward, with rewards plummeting to negative values, whereas PPO exhibits a more significant decline. In contrast, HTS does not exhibit a decrease at this point. This is attributed to the fact that the HTS method can robustly select high-value subgoals in the early stage to guide the update of junior-level policies. This phenomenon clearly demonstrates that the mechanism of two-level screening of latent subgoals significantly enhances the algorithm’s initial environmental exploration capability in complex environments. At around 64,000 steps, all four methods show a peak in reward increase, followed by a rapid decline. HAC, HRL, and PPO undergo two sudden drops, whereas HTS experiences only one. This indicates that during early environment exploration, HRL and PPO are more prone to falling into significant pitfalls, leading to reward decreases, whereas HTS demonstrates greater robustness in the early stage. In the mid-stage, between 82,000 and 200,000 steps, the rewards for all four methods gradually increase. In the late stage, beyond 200,000 steps, the reward growth for all four methods tends to stabilize. This suggests that the difficulty of environmental exploration decreases significantly in the later stages, while the agent’s comprehension abilities improve. Calculations indicate that the final reward of HTS improved by 4.75% relative to HAC, the final reward of HTS improved by 18.33% relative to HRL, and the final reward of HTS increased by 34.64% compared to PPO. Furthermore, HRL demonstrated a 13.29% improvement in reward performance compared to the single-layer PPO method.

Overall, HTS exhibits the most rapid growth in the early stage, with minimal differences between the four methods in the mid and late stages. The final reward comparison yields the order HTS > HRL > PPO. When considering the stability of reward growth, the result remains HTS > HRL > PPO. These experimental findings demonstrate that HTS surpasses HRL and PPO in terms of environment exploration and learning capabilities, providing preliminary answers to the first two research questions.

4.1.2. Average Win Rates

While the magnitude of rewards represents the agent’s ability to learn a reward-maximizing strategy, it does not absolutely reflect the agent’s capability to secure a victory. Therefore, it is necessary to analyze the average win rate comparison results to ultimately determine the agent’s winning ability, thereby evaluating the performance of the algorithm.

The win rate comparison results from the ablation experiment are presented in Figure 4. We analyze the experimental results in two stages: early and late training phases. In the early stage, before 144,000 steps, HAC, HTS, and HRL experience rapid growth to their peak win rates, with HTS exhibiting a significantly faster growth rate than HAC and HRL. In contrast, the win rate of PPO only fluctuates slightly without a pronounced increase. This indicates that the hierarchical structure significantly enhances the agent’s exploration capability in the joint operation environment, and the two-level subgoal screening method also provides a noticeable improvement to HRL, primarily by advancing the onset of win rate growth. In the later stage, after 144,000 steps, the win rates of HTS, HAC, and HRL stabilize, while PPO begins to increase and eventually stabilizes. This demonstrates that HTS outperforms HAC, HRL, and PPO in intelligent decision-making for joint operations. Calculations indicate that the final win rate of HTS improved by 13.48% relative to HAC, the final win rate of HTS improved by 16.92% relative to HRL, and the final win rate of HTS increased by 28.97% compared to PPO. Furthermore, HRL demonstrated a 10.22% improvement in win rate performance compared to the single-layer PPO method.

Overall, HTS, HAC, and HRL exhibit significantly higher learning efficiency compared to PPO. The final win rate comparison yields the same order as the reward comparison: HTS > HAC > HRL > PPO. From the analysis of both rewards and win rates, two conclusions can be drawn: Firstly, the hierarchical structure outperforms the single-layer structure, a trend most pronounced in the early and late stages of the win rate analysis. Secondly, the two-level subgoal screening method notably enhances the performance of the hierarchical decision-making approach for joint operations. A comparison of the peak values and final results of rewards and win rates across the four methods indicates that the agent’s early-stage exploration efficiency almost determines the ultimate result. The two-level subgoal screening method effectively guides the agent’s training by selecting the most promising subgoals, which substantially enhances the agent’s early-stage exploration efficiency and, consequently, significantly improves its overall decision-making capabilities.

4.2. Behavioral Research

This section discusses the joint response strategies that emerged during training, which serve an auxiliary function, validate whether the method can train agents to understand and apply joint response strategies.

Cooperative Reconnaissance and Defense strategies: Air force early warning aircraft and two naval destroyers advance together for cooperative reconnaissance, detecting enemy aerial and maritime targets, and sharing reconnaissance results with all friendly units. This provides operation information for all action units. The scenario of cooperative reconnaissance is shown in Figure 5, Figure 6 and Figure 7. In the mission to defend our islands, the coordinates of the enemy command post, the status information of the enemy, and coordinates of some enemy action units obtained by our manned bombers are provided by our sea–air cooperative reconnaissance units.

Additionally, as depicted in Figure 5, Figure 6 and Figure 7, the deployment of the Red Force is divided into three waves. The first wave consists of an early warning aircraft within the white box, and destroyers within the red box advancing for reconnaissance, followed by fighters dispatched for area patrol within the yellow box, and finally, manned bombers and unmanned escort fighter formations within the green box. The tasks assigned to them can be inferred as reconnaissance of enemy aerial and maritime activities, defending the airspace, and neutralizing the two command posts captured by the Blue Force, respectively. This sequence demonstrates that the agent has learned the basic order of deployment through trial and error. Moreover, it can be observed in the white box of Figure 7 that the white dot represents the scene where the Red Force’s early warning aircraft is shot down due to advancing too far. This indicates that while the agent has learned to utilize early warning aircraft for reconnaissance, it has not yet grasped the significance of such aircraft or how to protect critical units. It is noteworthy that the phenomenon of neat echelon deployment was observed exclusively in HTS and HRL, but not in PPO. This highlights the robust task decomposition capability of hierarchical structures, which breaks down complex tasks into multiple layers of subtasks, addressing them in stages. For instance, while the PPO model can, at most, learn to coordinate the operations of early warning aircraft, fighters, and manned bombers to accomplish a task, HTS and HRL are capable of decomposing the entire operational mission into subtasks such as reconnaissance, defending the airspace, and neutralizing command center subtasks. These subtasks are then, respectively, executed by the formations of early warning aircraft, fighters, and manned bombers, with the flexibility to adjust the deployment sequence and timing of these formations according to the demands of the three subtasks.

Formation-Cooperative Defense Strategy: In Figure 8 and Figure 9, which depict the mission to neutralize command posts captured by the Blue Force, the Red Force’s formation, comprised fighters, manned bombers, and electronic jamming aircraft, collaborates to execute the task with clear division of labor and objectives. Figure 8 presents the global operational map, while Figure 9 shows a magnified view of the formation within the red box in Figure 8. The manned bombers, Fleet No. 2362, are responsible for bombing the enemy command post, while the fighters and electronic jamming aircraft are tasked with protecting the manned bombers during their mission. The electronic jamming aircraft, No. 4976, is responsible for jamming enemy detection units and jamming missiles launched by enemy fighters and destroyers to prevent them from hitting our targets, while the fighters, No. 4926 and No. 4152, are responsible for launching missiles to eliminate enemy action units that pose a threat to us in the nearby airspace. In Figure 10, the destroyer, No. 1524, collaborates with the Red Force aircraft formation within the red box to defend the airspace. This demonstrates that the agent has learned formation-cooperative defense strategies. From Figure 9, it is evident that the bomber formation includes manned bombers, fighters, and electronic jamming aircraft. This specific formation design appeared with increased frequency during the later stages of model training, suggesting that it is a subgoal identified by the high-level strategy. Although this formation state does not yield a higher environmental reward, it exhibits a significant TD-error value. Consequently, during the subgoal selection process, this high-value formation state is not overlooked despite its low environmental reward. Furthermore, Figure 10 depicts a scenario in which destroyer and fighter formations collaborate to defend the airspace, leading to the downing of enemy aircraft and the attainment of a high reward, thereby constituting one of the subgoals. These two instances provide compelling evidence for the validity of the two metrics employed in the subgoal selection process outlined in this paper.

Manned–Unmanned Aerial Vehicles Cooperative Strategies: In Figure 11, it can be observed that manned bombers within the red box are neutralizing the northern command post while an unmanned reconnaissance UAV, No. 1081, is heading towards the forefront to survey the action results. In Figure 12a, the UAV’s reconnaissance reveals that the northern command post has been destroyed. Consequently, the Red Force promptly dispatches a formation of manned aircraft to neutralize the southern command post in Figure 12b. Since the manned aircraft formations do not possess the capability to conduct close-range reconnaissance, they return to base after exhausting all missiles, unable to confirm the mission accomplishment status on site. Furthermore, with the later destruction of the early warning aircraft, remotely confirming the mission accomplishment status becomes impossible. Therefore, an unmanned reconnaissance UAV is deployed to confirm the mission accomplishment status on site. This demonstrates that the intelligent agent has learned to coordinate operations between manned and unmanned aircraft formations. Additionally, the sequences depicted in Figure 11 and Figure 12 substantiate the efficacy of the combined internal and external reward design proposed in this paper. The subgoal at this stage is to conduct close-range reconnaissance and observation of mission accomplishment. The internal reward mechanism is configured to issue positive rewards only when the junior-level strategy successfully achieves the current subgoal. The external environment reward system imposes a significantly higher penalty for the loss of a manned aircraft compared to that of an unmanned aircraft. Since only fighters and unmanned aircraft are capable of executing such tasks, the agent, guided by the incentives and penalties provided by the internal and external rewards, ultimately learns to deploy unmanned aircraft for close-range reconnaissance mission.

5. Conclusions

Intelligent decision-making for joint operations simulation of sea–air unmanned systems poses significant challenges due to the complexity of its environmental logic and the vastness of its state–action space. This paper proposes a layered intelligent decision-making scheme for joint operations of sea–air unmanned systems, realizing end-to-end layered intelligent decision-making for joint operations of sea–air unmanned systems. Subsequently, based on this intelligent decision-making scheme, a layered reinforcement learning algorithm is introduced, utilizing a two-level screening method for potential subgoals. This method represents an innovative subgoal-based HRL algorithm, with the subgoals designed based on the potential subgoal screening method exhibiting superior guidance for junior-level strategies. The ablation study results demonstrate that the proposed HTS method effectively enhances the agent’s environmental exploration efficiency in complex environments. This is primarily evidenced by the significant reward improvements of 4.75%, 18.33%, and 34.64% achieved by HTS compared to HAC, HRL, and PPO, respectively, along with corresponding win rate increases of 13.48%, 16.92%, and 28.97%. Furthermore, the hierarchical structure proposed in this paper effectively elevates the agent’s decision-making capability. This is notably reflected in the HRL method, which achieved a 13.29% improvement in reward and a 10.22% increase in win rate compared to the single-layer PPO method. Simulation experiments indicate that the agents trained by this method can learn valuable joint operations coordination strategy. The paper’s exploratory research on the layered intelligent decision-making for joint operations simulation of sea–air unmanned systems demonstrates the tremendous application potential of HRL technology in complex scenarios as joint operations of sea–air unmanned system types.

In subsequent extension research, exploring multi-agent decision-making is a promising research direction. Each action unit can be treated as an independent agent for learning, which enriches the action space. Investigating the competition and cooperation among agents can also give rise to intriguing strategies. However, this approach imposes higher demands on computational power. Preliminary estimations suggest that the hardware configuration should include a CPU with 32 or more cores, a GPU equivalent to or exceeding an Nvidia RTX 3090, and DDR5 RAM ranging from 256 GB to 1 TB.

Author Contributions

Conceptualization, C.L. and W.D.; methodology, C.L.; software, C.L. and M.C.; validation, C.L. and W.D.; formal analysis, C.L. and Y.L.; investigation, M.C. and Y.L.; resources, W.D. and L.H.; data curation, L.H. and M.C.; writing—original draft preparation, C.L.; writing—review and editing, L.H.; visualization, M.C. and Y.L.; supervision, W.D. and L.H.; project administration, L.H.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the Natural Science Foundation of Shaanxi under Grant 2025JC-YBQN-842 and Natural Science Foundation of Shaanxi under Grant 2025JC-YBQN-840.

Data Availability Statement

The datasets generated and/or analyzed during the current study are not publicly available due to the fact that the experimental platform in this paper is an internal software, and the organization that owns the software requires that the experimental data be kept confidential. But they are available from the corresponding author on reasonable request.

DURC Statement

Current research is limited to the academic study of intelligent decision-making in simulation platform of joint operations, which contributes to the intelligent decision-making in sea–air unmanned systems and does not pose a threat to public health or national security. This paper presents an intelligent decision-making methodology for joint operations of Sea–air unmanned systems from a theoretical research perspective. The objective of this methodology is to provide decision support for commanders, facilitating a human-in-the-loop decision-making process. Consequently, the final authority to adopt the computed decision scheme rests with the human commander. Only bombers are equipped with lethal capabilities within this study, and bomber operations are executed by human operators, thereby fully mitigating potential ethical and legal concerns associated with the use of lethal force. Furthermore, the simulation experiments conducted in this paper focus on defensive aspects, specifically investigating defensive intelligent decision-making, which precludes any potential for offensive military applications. In summary, this research does not contravene the principles that govern the use of unmanned systems, including the principles of sovereignty, humanity, proportionality, as well as the distinction between civilians and combatants, as stipulated in international law. Authors acknowledge the dual-use potential of the research involving intelligent decision-making in joint operations and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Z.; Zhu, J.; Kuang, M.; Zhang, J.; Ren, J. Hierarchical Decision Algorithm for Air Combat with Hybrid Action Based on Deep Reinforcement Learning. Acta Aeronaut. Astronaut. Sin. 2024, 45, 17. [Google Scholar] [CrossRef]
Kong, W.; Zhou, D.; Du, Y.; Zhou, Y.; Zhao, Y. Hierarchical Multi-Agent Reinforcement Learning for Multi-Aircraft Close-Range Air Combat. IET Control Theory Appl. 2023, 17, 1840–1862. [Google Scholar] [CrossRef]
Zhang, T.Y.; Zheng, C.; Sun, M.W.; Wang, Y.S.; Chen, Z.Q. Research on Intelligent Maneuvering Decision in Close Air Combat Based on Deep Q Network. Presented at IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 12–14 May 2023; pp. 1044–1049. [Google Scholar] [CrossRef]
Chen, R.; Li, H.; Yan, G.; Peng, H.; Zhang, Q. Hierarchical Reinforcement Learning Framework in Geographic Coordination for Air Combat Tactical Pursuit. Entropy 2023, 25, 1409. [Google Scholar] [CrossRef]
Piao, H.; Han, Y.; He, S.; Yu, C.; Fan, S.; Hou, Y.; Bai, C.; Mo, L. Spatiotemporal Relationship Cognitive Learning for Multirobot Air Combat. IEEE Trans. Cognit. Dev. Syst. 2023, 15, 2254–2268. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Z.; Dong, L.; Li, N. Research on Multi-Aircraft Air Combat Behavior Modeling Based on Hierarchical Intelligent Modeling Methods. J. Syst. Simul. 2023, 35, 2249. [Google Scholar] [CrossRef]
Yuan, Y.; Yang, J.; Yu, Z.L.; Cheng, Y.; Jiao, P.; Hua, L. Hierarchical Goal-Guided Learning for the Evasive Maneuver of Fixed-Wing UAVs based on Deep Reinforcement Learning. J. Intell. Robot. Syst. 2023, 109, 43. [Google Scholar] [CrossRef]
Chai, J.; Chen, W.; Zhu, Y.; Yao, Z.-X.; Zhao, D. A Hierarchical Deep Reinforcement Learning Framework for 6-DOF UCAV Air-to-Air Combat. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 5417–5429. [Google Scholar] [CrossRef]
Kong, W.R.; Zhou, D.Y.; Du, Y.J.; Zhou, Y.; Zhao, Y.Y. Reinforcement Learning for Multiaircraft Autonomous Air Combat in Multisensor UCAV Platform. IEEE Sens. J. 2023, 23, 20596. [Google Scholar] [CrossRef]
Gong, Z.; Xu, Y.; Luo, D. UAV Cooperative Air Combat Maneuvering Confrontation Based on Multi-agent Reinforcement Learning. Unmanned Syst. 2023, 11, 273–286. [Google Scholar] [CrossRef]
He, S.; Gao, Y.; Zhang, B.; Chang, H.; Zhang, X. Advancing Air Combat Tactics with Improved Neural Fictitious Self-play Reinforcement Learning. In Advanced Intelligent Computing Technology and Applications, Proceedings of the 19th International Conference, ICIC 2023, Zhengzhou, China, 10–13 August 2023; Lecture Notes in Computer Science, Lecture Notes in Artificial Intelligence; Springer: Singapore, 2023; Volume 14090, pp. 653–666. [Google Scholar] [CrossRef]
Zhu, Y.; Zheng, Y.; Wei, W.; Fang, Z. Enhancing Automated Maneuvering Decisions in UCAV Air Combat Games Using Homotopy-Based Reinforcement Learning. Drones 2024, 8, 756. [Google Scholar] [CrossRef]
Yuxin, Z.; Enjiao, Z.; Hong, L.; Wentao, Z. MATD3 with Multiple Heterogeneous Sub-Networks for Multi-Agent Encirclement-Combat Task. J. Supercomput. 2024, 81, 279. [Google Scholar] [CrossRef]
Wang, B.; Liu, J. Research Progress and Development Trend on Learning Algorithm of Combat Agent. Ordnance Ind. Autom. 2023, 96, 74. [Google Scholar] [CrossRef]
Kong, W.-R.; Zhou, D.-Y.; Zhou, Y.; Zhao, Y.-Y. Hierarchical Reinforcement Learning from Competitive Self-Play for Dual-aircraft Formation Air Combat. J. Comput. Des. Eng. 2023, 10, 830–859. [Google Scholar] [CrossRef]
Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical Deep Reinforcement Learning: Integrating Tem-poral Abstraction and Intrinsic Motivation. Adv. Neural Inf. Process. Syst. 2016, 29, 1826. [Google Scholar] [CrossRef]
Rafati, J.; Noelle, D.C. Learning Representations in Model-Free Hierarchical Reinforcement Learning. In Proceedings of the AAAI conference on artificial intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 10009–10010. [Google Scholar] [CrossRef]
Nachum, O.; Gu, S.S.; Lee, H.; Levine, S. Data-Efficient Hierarchical Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2018, 31, 1679. [Google Scholar] [CrossRef]
Li, R.; Cai, Z.; Huang, T.; Zhu, W. Anchor: The Achieved Goal to Replace the Subgoal for Hierarchical Reinforcement Learning. Knowl.-Based Syst. 2021, 225, 107128. [Google Scholar] [CrossRef]
Zhang, T.; Guo, S.; Tan, T.; Hu, X.; Chen, F. Generating Adjacency-Constrained Subgoals in Hierarchical Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21579. [Google Scholar] [CrossRef]
Zhang, T.; Guo, S.; Tan, T.; Hu, X.; Chen, F. Adjacency Constraint for Efficient Hierarchical Reinforcement Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4152–4166. [Google Scholar] [CrossRef]
Levy, A.; Konidaris, G.; Platt, R.; Saenko, K. Learning Multi-Level Hierarchies with Hindsight. arXiv 2017, arXiv:1712.00948. [Google Scholar] [CrossRef]
He, Z.H.; Gu, C.C.; Xue, R.; Wu, K.J. Automatic Curriculum Generation by Hierarchical Reinforcement Learning. Presented at 27th International Conference on Neural Information Processing, Bangkok, Thailand, 18–20 November 2020; Volume 12533, pp. 202–213. [Google Scholar] [CrossRef]
Röder, F.; Eppe, M.; Nguyen, P.D.; Wermter, S. Curious Hierarchical Actor-Critic Reinforcement Learning. Presented at Artificial Neural Networks and Machine Learning—ICANN 2020: 29th International Conference on Artificial Neural Net-works, Bratislava, Slovakia, 15–18 September 2020; Volume 12397, pp. 408–419. [Google Scholar] [CrossRef]
Pateria, S.; Subagdja, B.; Tan, A.-H.; Quek, C. End-to-End Hierarchical Reinforcement Learning With Integrated Subgoal Discovery. IEEE Trans. Neural Netw Learn. Syst. 2021, 33, 7778–7790. [Google Scholar] [CrossRef]
Wang, K.; Ruan, J.; Zhang, Q.; Xing, D. Efficient Hierarchical Reinforcement Learning via Mutual Information Constrained Subgoal Discovery. In Proceedings of the Neural Information Processing: 30th International Conference, ICONIP 2023, Changsha, China, 20–23 November 2023; Communications in Computer and Information Science. pp. 76–87. [Google Scholar] [CrossRef]
Nicholaus, I.T.; Kang, D.-K. FTPSG: Feature Mixture Transformer and Potential-Based Subgoal Generation for Hierarchical Multi-Agent Reinforcement Learning. Expert Syst. Appl. 2025, 270, 14. [Google Scholar] [CrossRef]
Xu, X.; Zuo, G.; Li, J.; Huang, G. Efficient Hierarchical Exploration with An Active Subgoal Generation Strategy. In Proceedings of the 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO), Xishuangbanna, China, 5–9 December 2022; pp. 1335–1340. [Google Scholar]
Gürtler, N.; Büchler, D.; Martius, G. Hierarchical Reinforcement Learning With Timed Subgoals. Adv. Neural Inf. Process. Syst. 2021, 34, 21732–21743. [Google Scholar] [CrossRef]
Tong, S.; Liu, Q. Balanced Subgoals Generation in Hierarchical Reinforcement Learning. In Proceedings of the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Kuching, Malaysia, 6–10 October 2024; pp. 3813–3818. [Google Scholar]
Zhang, T.; Liu, Z.; Pu, Z.; Yi, J.; Liang, Y.; Zhang, D. Robot Subgoal-guided Navigation in Dynamic Crowded Environments with Hierarchical Deep Reinforcement Learning. Int. J. Control Autom. Syst. 2023, 21, 2350–2362. [Google Scholar] [CrossRef]
Xu, C.; Zhang, C.; Shi, Y.; Wang, R.; Duan, S.; Wan, Y.; Zhang, X. Subgoal-based Hierarchical Reinforcement Learning for Multi-Agent Collaboration. arXiv 2024, arXiv:2408.11416. [Google Scholar]

Figure 1. Implementation details of the hierarchical intelligent decision-making scheme for joint operations of sea–air unmanned systems.

Figure 2. Diagram of scenario background.

Figure 3. Average reward comparison in ablation experiments.

Figure 4. Average win rate comparison in ablation experiments.

Figure 5. At 7 min and 32 s, the air force early warning aircraft advances for reconnaissance.

Figure 6. At 10 min and 33 s, the scene of the navy’s Destroyer 1 advances for reconnaissance.

Figure 7. At 13 min and 20 s, the scene of the navy’s Destroyer 2 advances for reconnaissance.

Figure 8. Panoramic view of formation-cooperative defense of airspace.

Figure 9. Partial view of formation-cooperative defense of airspace.

Figure 10. The destroyer detects and repels enemy aircraft.

Figure 11. UAV is conducting close-range reconnaissance mission.

Figure 12. (a) UAV close-range reconnaissance mission concluded. (b) A formation of manned aircraft are sent to neutralize the southern command post.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Dong, W.; He, L.; Cai, M.; Li, Y. A Hierarchical Reinforcement Learning Method for Intelligent Decision-Making in Joint Operations of Sea–Air Unmanned Systems. Drones 2025, 9, 596. https://doi.org/10.3390/drones9090596

AMA Style

Li C, Dong W, He L, Cai M, Li Y. A Hierarchical Reinforcement Learning Method for Intelligent Decision-Making in Joint Operations of Sea–Air Unmanned Systems. Drones. 2025; 9(9):596. https://doi.org/10.3390/drones9090596

Chicago/Turabian Style

Li, Chen, Wenhan Dong, Lei He, Ming Cai, and Yang Li. 2025. "A Hierarchical Reinforcement Learning Method for Intelligent Decision-Making in Joint Operations of Sea–Air Unmanned Systems" Drones 9, no. 9: 596. https://doi.org/10.3390/drones9090596

APA Style

Li, C., Dong, W., He, L., Cai, M., & Li, Y. (2025). A Hierarchical Reinforcement Learning Method for Intelligent Decision-Making in Joint Operations of Sea–Air Unmanned Systems. Drones, 9(9), 596. https://doi.org/10.3390/drones9090596

Article Menu

A Hierarchical Reinforcement Learning Method for Intelligent Decision-Making in Joint Operations of Sea–Air Unmanned Systems

Abstract

1. Introduction

2. Related Work

3. Hierarchical Intelligent Decision-Making Scheme for Joint Operations of Sea–Air Unmanned Systems

3.1. Subgoal-Based HRL Intelligent Decision-Making Framework for Joint Operations of Sea–Air Unmanned Systems

3.2. A Solution for Sparse Rewards Through the Integration of Intrinsic and Extrinsic Incentives

3.3. Hierarchical Reinforcement Learning Algorithm with Two-Level Screening of Latent Subgoals

3.3.1. Prominent Contribution Detection

3.3.2. Repeat Penalty

3.3.3. Potential Calculation Function

4. Experiment

4.1. Ablation Experiment Study

4.1.1. Average Reward

4.1.2. Average Win Rates

4.2. Behavioral Research

5. Conclusions

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI