An RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm Based on Deep Reinforcement Learning for POMDPs with Limited Experience

Li, Ke; Zhang, Kun; Wei, Ziqi; Piao, Haiyin; Yuan, Binlin; Wang, Boxuan; Cheng, Jiangbo

doi:10.3390/drones10020142

Open AccessArticle

An RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm Based on Deep Reinforcement Learning for POMDPs with Limited Experience

by

Ke Li

¹

,

Kun Zhang

^1,*,

Ziqi Wei

²

,

Haiyin Piao

³,

Binlin Yuan

⁴,

Boxuan Wang

¹ and

Jiangbo Cheng

¹

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, China

²

Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

³

Shenyang Aircraft Design and Research Institute, Shenyang 110035, China

⁴

Sichuan Tengden Sci-Tech Innovation Co., Ltd., Chengdu 610037, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(2), 142; https://doi.org/10.3390/drones10020142

Submission received: 15 January 2026 / Revised: 6 February 2026 / Accepted: 9 February 2026 / Published: 17 February 2026

(This article belongs to the Special Issue Advances in AI Large Models for Unmanned Aerial Vehicles)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Considering the partially observable nature of environments and the ensuing limited information for LRGDEs, we constructed Bi-LSTM-modified Policy Networks (BLPNs) to realize UAV-maneuvering decision-making, and this was modeled using POMDPs.
To maximize the latent utility of insufficient transitions with sparse reward, we developed the Adaptive Multi-Feature Evaluation Experience Replay (AMFER) method, integrating expert experience and domain knowledge, to reshape the sampling process from historical data.

What are the implications of the main findings?

The REDCRL, a novel deep reinforcement learning algorithm integrating BLPN and AMFER, is proposed to solve UAV-maneuvering decision-making problems experienced when using LRGDEs and overcomes the partial observability property of LRGDEs.
The REDCRL significantly accelerates the convergence of policy and enhances the performance of converged policy while training policy in complex environments with limited experience and sparse reward.

Abstract

Autonomous flight is a critical capability for unmanned aerial vehicles (UAVs), enabling applications in wildlife and plant protection, infrastructure inspection, search and rescue, and other complex missions. Although some learning-based methods have achieved considerable progress, traditional algorithms still struggle with real-world challenges, due to the partially observable nature of environments and limited experience regarding the properties of dynamic unknown environments where threats and targets are movable and unpredictable. To address these difficulties, it is necessary to achieve autonomous guidance for UAVs performing long-range missions in dynamic environments (LRGDEs), and to develop a novel end-to-end algorithm that can overcome partial observability under limited state transitions. In this paper, we propose an RNN-enhanced Diverse Curriculum-driven Learning Algorithm (REDCRL) based on deep reinforcement learning. We modify the structure of traditional actor–critic networks and introduce Bi-LSTM into policy networks (referred to as Bi-LSTM-modified Policy Networks (BLPNs)) to alleviate observation incompleteness. Furthermore, to fully exploit the potential value of data and mitigate the problem of insufficient samples, we develop an Adaptive Multi-Feature Evaluation Experience Replay (AMFER) method to reshape the process of experience replay buffer construction and sampling. In addition, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is adopted to optimize UAV-maneuver decision policies. Compared with traditional algorithms, the proposed algorithm can accelerate policy convergence and improve the performance of the trained policy.

Keywords:

unmanned aerial vehicles; decision-making; autonomous flight; deep reinforcement learning; partially observable; curriculum learning

1. Introduction

With the rapid development of UAV technology, UAVs have been widely adopted to assist humans in performing mundane, repetitive, and hazardous tasks, owing to their low cost, zero crew risk, high flexibility, and ease of upgrading [1]. For example, UAVs have been widely applied in aerial surveillance, aerial mapping, agricultural monitoring, environmental monitoring, logistics, delivery, search and rescue, communication relay, and other missions across diverse engineering fields [2,3,4]. Among these applications, a critical engineering challenge for UAVs, termed long-range UAV Guidance in Dynamic Environments (LRGDEs), demands urgent resolution as it plays a pivotal role in enabling efficient task execution. During LRGDE missions, a UAV is required to depart from a starting point, navigate to a target destination, and avoid obstacles or threats using airborne sensor data. These obstacles and threats include both mobile and stationary objects. Furthermore, the problem becomes significantly more complex when the target point is rigidly attached to a moving vehicle. Figure 1 illustrates the process of a UAV performing an LRGDE mission. To address this problem, researchers have achieved notable progress toward the challenge described above.

The traditional solution for LRGDEs is to employ path-planning algorithms to obtain the optimal path from the start position to the target position [5] and to design a controller using trajectory-tracking algorithms to guide the UAV and follow the predefined path. To date, various path-planning algorithms have been proposed to generate optimal paths, such as visibility graphs [6]; randomly sampling search algorithms including rapidly exploring random tree [7]; probabilistic roadmaps [8]; heuristic algorithms including a-star [9], sparse a-star [10], and d-star [11]; biologically inspired optimization algorithms including genetic algorithms [12] and sand cat swarm optimization [13]; and so on. Subsequently, various trajectory-tracking algorithms have been developed to design controllers that guide UAVs to track the optimal path [14]. Although such solutions can effectively address the aforementioned problem, they suffer from inherent drawbacks and practical limitations. First, obtaining detailed environmental obstacle information (e.g., mountains, no-fly zones, and other threats) is highly challenging, which inherently limits the applicability of path-planning algorithms. Second, the above scheme lacks sufficient flexibility to adjust paths in real time when unexpected threats arise. Third, traditional path-planning algorithms struggle to generate feasible trajectories in the presence of moving targets. Fourth, conventional path-planning algorithms typically require excessive computation time to derive optimal solutions, making them unsuitable for real-time applications. Therefore, to achieve high-quality LRGDEs, it is essential to endow UAVs with fully autonomous flight capability, in which the UAV’s control variables are computed in real time by the algorithm, and the decision-making foundation consists of the UAV’s flight state, observed environmental information from airborne sensors, and predefined mission objectives. To realize autonomous flight for UAVs, a critical engineering problem must be solved—namely, UAV-maneuvering decision-making in LRGDEs—where the UAV’s control variables must be determined periodically based on all available information and the scheduled mission objective. Due to the Markovian nature of UAV-maneuvering decision-making in LRGDEs, some researchers have modeled this problem using Markov Decision Processes (MDPs) and proposed various algorithms based on reinforcement learning (RL) [15,16,17,18]. When applied to the UAV-maneuvering decision-making problem in LRGDEs, RL algorithms can reduce reliance on prior environmental information due to their model-free nature. Meanwhile, since deep reinforcement learning (DRL) algorithms inherently possess end-to-end characteristics, policies can be generated rapidly based on real-time observations of the surrounding environment, enabling the algorithms to address challenges posed by unknown or unexpected threats and satisfy strict real-time decision-making requirements.

When applying RL to solve the UAV-maneuvering decision-making problem in LRGDEs, a key challenge remains that limits RL performance in LRGDE tasks. During policy training for LRGDEs, the decision-making basis at each step mainly relies on observations from airborne sensors, such as radar, electro-optical sensors, and other active or passive sensors. For passive sensors such as optical sensors, only high-precision relative angular information of the target can be obtained, which leads to severe information dimensionality reduction. Even if the relative distance of the target can be inferred through algorithms, data at such a precision level is hardly sufficient to support decision-making. For active sensors such as radar, the relative distance of the target can be acquired, but high-precision velocity information cannot be obtained simultaneously due to the inherent Doppler ambiguity of radar. Thus, owing to incomplete state information from airborne sensors in LRGDEs, to apply RL to LRGDEs, the UAV-maneuvering decision-making problem in LRGDEs should be modeled based on Partially Observable Markov Decision Processes (POMDPs) rather than MDPs. One of the most well-known algorithms is the Monte Carlo Tree Search (MCTS) algorithm, which serves as the core technique of AlphaZero developed by Google DeepMind [19]. Based on MCTS, several researchers have proposed various MCTS variants for POMDPs, such as POMCP [20], DESPOT [21], and other improved MCTS-based algorithms. Although these modified MCTS-based algorithms show effectiveness in solving POMDPs, their excessive computational resource consumption and high time intensity significantly hinder practical engineering applications, since the belief state distribution is estimated using massive samples via particle filters, and the optimal solution is computed through planning methods. In addition, another category of modified algorithms based on DRL leverages the advantages of Recurrent Neural Networks (RNNs) in processing time-series data and estimates the missing dimensions in observation information. Since the problem to be solved is not highly complex, observations such as images, signals, and other forms of information are stacked along the time dimension, and temporal sequences are used to support policy decision-making; this approach is adopted in the input processing of DQN [22], DDPG [23], TD3 [24], PPO [25], SAC [26], and other DRL algorithms. As the number of missing dimensions increases, problems must be modeled as POMDPs, and such simple tricks can no longer solve the problem effectively. Therefore, several researchers have proposed various modified DRL algorithms based on RNNs, such as G-IPOMDP-PPO [27], CBC-TP Net [28], Fast-RDPG [29], and FRSVG(0) [30], among others. All the aforementioned methods introduce RNNs to improve traditional DRL algorithms for solving POMDPs, where RNNs are used to eliminate the impact caused by insufficient state information dimensions. The modified DRL algorithms can overcome the drawbacks of MCTS-based algorithms and satisfy real-time decision-making requirements. Therefore, to better address the UAV-maneuvering decision-making problem in LRGDEs, it is crucial to model LRGDEs based on POMDPs and improve policy networks using RNNs.

Aside from the partial observability in LRGDEs, two additional challenges arise: limited experience and sparse reward. Owing to the complex dynamics of LRGDEs, each simulation episode is time-consuming. Compared with standard MDPs, fewer state transitions are generated, resulting in limited experience that impedes policy convergence. Meanwhile, obtaining effective transitions with clear rewards requires extensive interactions between the agent and the environment, leading to slow convergence and unsatisfactory performance. This well-known challenge in policy training is referred to as sparse reward. To address these issues, previous studies have achieved promising results using methods such as uniform experience replay (UER) [31] and prioritized experience replay (PER) [32]. When applying UER and PER, before policy training using historical data, it is necessary to determine the sampling probability of each transition and sample a batch of transitions accordingly. This procedure is known as transition selection, in which transitions are evaluated from multiple perspectives. Based on UER and PER, various improved experience replay (ER) methods have been proposed, such as HER [33], DCRL [34], ERO [35], CHER [36], ACER [37], among others. These methods differ from traditional ER in their transition evaluation strategies. Although such modified algorithms enhance the sampling efficiency and overall performance of agents during policy training, relying on only a single criterion to evaluate sample priority, such as TD-error, mission goal, cumulative reward, or sample diversity, is insufficient. To improve the sampling efficiency of RL, it is essential to evaluate transitions using multiple features. In addition to transition selection, adjusting the agent’s training process is critical for mitigating limited experience and sparse reward in MDPs. Operating at a higher level than transition selection, reshaping the transitions generated during policy training is beneficial; this is referred to as transition modification. Instead of directly modifying transitions, the timing of experience collection can be adjusted by designing intermediate task milestones, i.e., learning task objectives from simple to difficult. Accordingly, several researchers have structured the policy training process using task curricula and proposed corresponding algorithms, such as PCCL [38], NavACL [39], CURROT [40], CDRL [41], SCG [42], and other curriculum-based DRL algorithms [43]. These studies demonstrate that deliberate scheduling of the training process can significantly improve the sampling efficiency of RL and accelerate policy convergence. For both ER and the training process of RL algorithms, curriculum learning (CL), which has been widely discussed above, provides valuable guidance for our work. Inspired by human education and animal training, researchers in deep learning (DL) initially proposed CL to accelerate neural network training [44]. Subsequently, CL has been adopted to enhance traditional RL algorithms with the goal of accelerating policy convergence [45]. Although many algorithms have been developed to improve transition modification, manually adjusting task goals remains inadequate. Focusing on LRGDEs, this study aims to overcome the limitations of traditional methods caused by limited experience and sparse reward: (1) reliance on a single transition evaluation criterion, (2) inflexible and manually designed training processes. By integrating advances in both transition selection and transition modification, we seek to improve the ER mechanism while reshaping the agent’s training process based on curriculum learning.

To effectively execute LRGDE missions, this work addresses the aforementioned challenges and focuses on UAV-maneuvering decision-making for LRGDEs. The main contributions of this paper are summarized as follows:

(a): To tackle the partial observability challenge in LRGDEs, we redesign the structure of traditional actor–critic networks and construct a Bi-LSTM-based Policy Network (BLPN). During decision-making, historical observation sequences are vectorized and fed into the BLPN to mitigate the curse of dimensionality in LRGDEs.
(b): To fully exploit the latent value of limited transitions under sparse rewards, we propose an Adaptive Multi-Feature Evaluation Experience Replay (AMFER) method. This method integrates an adaptive dynamic termination (ADT) mechanism and a multi-feature transition evaluation (MFTE) model to reshape the policy training process, fully unlocking the potential value of data via a “from easy to difficult” learning paradigm.
(c): We further propose the REDCRL algorithm, which integrates the proposed BLPN and AMFER. The effectiveness of REDCRL is verified through extensive simulation experiments and comparisons with conventional DRL algorithms. Experimental results and analyses demonstrate that REDCRL significantly accelerates policy convergence and improves the performance of the trained policies.

2. Related Work

In this section, we review the SOTA on DRL for UAVs, RL for POMDPs, and CL for RL.

2.1. DRL for UAVs

In 2015, DeepMind proposed the Deep Q-Network (DQN) algorithm in Nature, which achieved performance comparable to top human players in Atari games [22]. Owing to its outstanding performance and end-to-end characteristics, it has attracted researchers from various disciplines to explore its engineering applications. Meanwhile, the Deep Deterministic Policy Gradient (DDPG) algorithm was introduced to address the dimensional explosion caused by continuous action and state spaces [23]. Furthermore, to mitigate the overestimation bias inherent in DQN and DDPG training, the TD3 algorithm was developed. It integrates clipped double-Q learning, target policy smoothing, and delayed policy update to alleviate overestimation bias arising from the Bellman equation and function approximation [24].

Based on the aforementioned algorithmic breakthroughs, various DRL methods have been applied across diverse research fields, particularly to UAV-maneuvering decision-making problems. Zhang et al. proposed a UAV autonomous maneuver decision-making algorithm for route guidance based on double Q-learning with prioritized experience replay. They formulated the UAV-maneuver decision-making problem in the route guidance task and established a corresponding decision-making model based on MDPs [16]. Li et al. presented a DDPG-based UAV-maneuver decision-making algorithm with PER for autonomous airdrop missions. They defined two key problems, namely the turn-round problem and the guidance problem, in autonomous airdrop scenarios and constructed the corresponding decision-making model [15]. Zijian et al. introduced a novel double-screening sampling strategy and developed the Relevant Experience Learning DDPG (REL-DDPG) algorithm inspired by human learning, which further enhances the influence of the learning process on action selection in the current state [44]. Kaifang et al. investigated the motion control problem of UAVs navigating autonomously through uncertain environments. They proposed a DRL-based motion control method that integrates two difference-amplifying strategies with traditional DRL algorithms, employing an improved Lyapunov guidance vector field approach to track waypoints generated by the DRL framework [18]. Hu et al. developed an advanced DRL method for safe autonomous motion control in complex unknown environments. They introduced asynchronous curriculum experience replay to overcome the limitations of PER, which uses multithreading to update priorities asynchronously and assigns reasonable priorities to improve experience diversity [37].

Based on the above review, DRL can be employed to address UAV-maneuvering decision-making in LRGDEs and overcome the drawbacks of traditional methods such as path planning and trajectory tracking. However, conventional DRL algorithms still struggle to effectively handle the partial state observability challenge in LRGDEs. In this paper, we enhance traditional DRL algorithms to better utilize partially observable state information in LRGDEs.

2.2. RL for POMDPs

In 2018, Google DeepMind proposed AlphaZero, an intelligent agent capable of playing Chess, Go, and Shogi [19]. AlphaZero integrates deep neural networks (DNNs) with MCTS, enabling it to learn from scratch without relying on human expertise beyond basic rules and ultimately achieving superhuman performance. Silver et al. proposed Partially Observable Monte Carlo Planning (POMCP) [20], which features two key components: Monte Carlo sampling and a black-box simulator. POMCP is the first general-purpose planner that achieves high performance in large, unfactored POMDPs. Somani et al. introduced the Regularized Determinized Sparse Partially Observable Tree (R-DESPOT) [21], which employs randomized scenario sampling and regularization to balance policy value and complexity within a compact tree structure. This method alleviates the curse of dimensionality and enables efficient real-time decision-making in high-dimensional POMDPs. Zheng et al. proposed PPO for POMDP with Guidelines under Dense Reward (G-IPOMDP-PPO) [27], which incorporates image-state observations and a dense reward function. This method mitigates decision-making uncertainty and inefficiency in complex uncertain environments and solves multi-ship collision-avoidance problems. Zhang et al. developed vectorized actor–critic networks based on the Coronally Bidirectionally Coordinated with Target Prediction Network (CBC-TP Net) [28]. This approach integrates a vectorized extension of MADDPG for policy training, ensuring robust performance in both normal and anti-damage scenarios for multi-UAV pursuit–evasion games in obstacle-rich environments. Wang et al. proposed Fast-RDPG within an actor–critic framework. They constructed actor and critic networks using RNNs and adopted online DRL to map sensory observations to UAV control commands. This method outperforms state-of-the-art (SOTA) approaches in large-scale complex environments where navigation is modeled as POMDPs [29]. Xue et al. introduced a DRL algorithm named FRSVG(0) which embeds RNNs to address partial observability. This method achieves safe UAV navigation in large-scale, complex unknown environments and yields improved policy performance over the RSVG(0) algorithm [30].

As summarized above, both MCTS-based algorithms and RNN-augmented DRL methods can solve POMDPs. However, MCTS-based algorithms suffer from excessive computational resource consumption and high time complexity, which severely limit their practical engineering applications, since belief state distributions are estimated via particle filters using massive samples, and optimal solutions are derived through planning procedures. Compared with MCTS-based methods, RNN-augmented DRL algorithms exploit the strengths of RNNs in modeling time-series data and estimating missing dimensions in observation information. These approaches mitigate the adverse effects of incomplete state information, avoid the drawbacks of MCTS-based methods, and satisfy real-time decision-making requirements. Therefore, we formulate the mathematical model for the UAV-maneuvering decision-making problem in LRGDEs and propose BLPN to address partial state observability in LRGDEs.

2.3. CL for RL

During policy training with DRL algorithms, balancing exploration and exploitation is critical for effective policy learning [31]. Exploration enables algorithms to discover new samples and acquire knowledge about the target problem. In contrast, exploitation allows algorithms to refine policies using historical transitions stored in the experience memory. If the proportion of exploration exceeds that of exploitation during training, policy convergence will slow down. Conversely, insufficient exploration may lead to the loss of valuable samples and cause the policy to become trapped in a local optimum. Therefore, balancing exploration and exploitation remains essential for DRL-based policy training. When AlphaGo gained worldwide attention, the significance of experience replay was recognized, and the PER method was proposed [32]. In PER, each transition in the experience memory is assigned a TD-error. Transition priorities are determined according to their TD-errors, and the sampling probability of each transition is based on its priority. As defined by PER, the core of experience replay methods lies in the calculation of sample priorities and the design of transition sampling strategies.

In human education, a structured curriculum is typically designed to guide learners through a sequential knowledge acquisition process. It starts with the introduction of foundational concepts, such as basic arithmetic or language skills. As learners advance, more complex concepts and skills are gradually introduced, building on previously acquired knowledge. This hierarchical approach enables a more systematic and efficient learning experience. Similarly, animal training often follows a step-by-step procedure. Trainers begin with simple behaviors or tasks that the animal can easily understand and perform. For example, teaching a dog to sit or a bird to perch. As the animal masters these basic tasks, more complex behaviors are introduced, gradually shaping its ability to perform sophisticated actions. Accordingly, some deep learning researchers proposed an algorithm to accelerate neural network training, termed CL [45]. This gradual learning paradigm inspired by CL demonstrates that a “from easy to difficult” strategy is essential for policy training, helping algorithms to fully absorb knowledge embedded in samples. Some RL researchers have also attempted to employ CL to restructure the policy training process, aiming to improve policy convergence speed [46].

In general, existing approaches for improving RL with CL mainly follow these principles: (1) designing curriculum sequences based on task objectives; (2) developing dynamic goals tailored to task characteristics; (3) creating a time-varying initial state space; (4) implementing task-specific transition sampling strategies. Luo et al. proposed a precision-based continuous curriculum learning (PCCL) method, which employs a decay function to dynamically adjust precision requirements during training, thereby enhancing training efficiency and performance, especially in sparse reward scenarios [38]. Klink et al. introduced a curriculum reinforcement learning (CRL) method named CURROT, which formulates curriculum generation as a constrained optimal transport problem and interpolates between task distributions using the Wasserstein distance, thus improving performance across various tasks [40]. Ma et al. presented a curriculum-based deep reinforcement learning (CDRL) approach for quantum control. This method constructs curricula with tasks defined by fidelity thresholds, enabling agents to learn in an easy-to-difficult order and transfer knowledge across tasks [41]. Xiao et al. proposed an end-to-end deep reinforcement learning framework augmented with curriculum learning and a novel Sim2Real transfer method. This framework enables quadrotors to navigate through narrow gaps by dividing training into two phases with gradually shrinking gap sizes, alleviating reward sparsity and the simulation-to-reality gap [43]. Hu et al. developed an asynchronous curriculum experience replay (ACER) framework that uses multithreaded asynchronous priority updates, a temporary experience pool, and a first-in-useless-out (FIUO) mechanism to enhance UAV autonomous motion control in unknown dynamic environments, achieving faster convergence and better final performance compared with TD3 [37].

As summarized above, these approaches can be categorized into transition selection methods and transition modification methods. Although transition selection methods achieve diversified experience replay by computing the sampling probability of each transition, and transition modification methods reshape the RL policy training process by generating subtask goals from a “from easy to difficult” perspective, relying on only a single criterion to evaluate sample priority and merely manually adjusting task goals remains inadequate. Therefore, this work aims to improve ER mechanism of RL algorithms while restructuring the agent’s training process based on CL, by comprehensively integrating advances in both transition selection and transition modification.

3. Problem Formulation

In this section, we present the definition of LRGDEs and derive the UAV-maneuvering decision-making problem from it. Thereby, we establish the UAV-maneuvering decision-making model for LRGDEs based on POMDPs.

3.1. Description of LRGDE

As mentioned above, we aim to address the UAV-maneuvering decision-making problem in LRGDE missions, where the UAV is required to fly from a starting point to a specific area and detect a moving target within that area, laying the foundation for subsequent targeting. In addition, the UAV must also be guided to avoid obstacles and threats within the area. Furthermore, we formulate the mathematical model for LRGDEs in detail.

As shown in Figure 2, the UAV departs from a random starting point, with the objective of flying around the moving target while avoiding real-time detected threats. In this paper, the flight state of UAV could be defined by position

{\vec{X}}_{UAV}

, azimuth

ψ_{UAV}

and velocity

{\vec{V}}_{UAV}

. Each surrounding

i

-th threat is characterized by position

{\vec{X}}_{thr}^{i}

, velocity

{\vec{V}}_{thr}^{i}

and influence radius

R_{thr}^{i}

. Threats typically include mountains, no-fly zones, mobile warning radars, patrol squads, and other movable or stationary entities. The target to be detected is defined by position

{\vec{X}}_{tgt}

, velocity

{\vec{V}}_{tgt}

and detection radius

R_{\det}

. The relative distance vector between the UAV and the target is

{\vec{D}}_{LOS}

, and the relative azimuth is

ψ_{LOS}

.

Moreover, the UAV can observe threats in its surroundings via airborne sensors, such as radar, LiDAR, and the Infrared Search and Track System (IRST), among others. Figure 3 illustrates the perception process of the UAV observing threats in its surroundings. In practice, only the threats in the forward direction of the UAV will affect the execution of specific tasks. Therefore, we consider incorporating the threats within the front hemisphere of the UAV’s forward direction into the state space, and the set of threats observable by the UAV is defined as

W_{thr}

. Furthermore, the set

W_{thr}

is divided into

W_{thr}^{l}

,

W_{thr}^{f}

and

W_{thr}^{r}

, based on the range of

δ_{ψ_{thr}}

. The

δ_{ψ_{thr}}

values of

W_{thr}^{l}

,

W_{thr}^{f}

and

W_{thr}^{r}

belong to the ranges of

(30^{\circ}, 90^{\circ}]

,

(- 30^{\circ}, 30^{\circ}]

and

[- 90^{\circ}, - 30^{\circ}]

. Subsequently, we can obtain the nearest threats on the left front side, directly in front, and on the right front side, which are used to support the formulation of rational decisions.

In addition, we define the failed termination condition as

‖{\vec{X}}_{UAV} - {\vec{X}}_{thr}^{i}‖ \leq R_{thrc}^{i}, i = 1, 2, \dots, N_{thr}

(1)

where

N_{thr}

indicates the number of threats and

R_{thrc}^{i}

indicates the cutoff radius of the threat. Equation (1) shows that if UAV flies into the cutoff area of any threat, the mission will be failed. In addition, we define the successful condition of LRGDEs as

‖{\vec{X}}_{UAV} - {\vec{X}}_{tgt}‖ \leq R_{\det}

(2)

which means the UAV finishes mission successfully when the target enters the UAV’s detection range

R_{\det}

.

3.2. UAV-Maneuvering Decision-Making Model for LRGDE Based on POMDPs

In this section, we model the UAV-maneuvering decision-making problem for LRGDEs using POMDPs, and define the state space, observation space, action space, and reward function as follows.

3.2.1. State Space, Observation Space and Action Space

(a): The definition of state space

Considering the definition of LRGDEs, the state space is defined as

S = \{v_{UAV}, h_{UAV}, n_{z}, d_{LOS}, δ_{ψ_{LOS}}, v_{tgt}, ψ_{tgt}, d_{thr}^{f}, δ_{ψ_{thr}}^{f}, v_{thr}^{f}, ψ_{thr}^{f}, d_{thr}^{l}, δ_{ψ_{thr}}^{l}, v_{thr}^{l}, ψ_{thr}^{l}, d_{thr}^{r}, δ_{ψ_{thr}}^{r}, v_{thr}^{r}, ψ_{thr}^{r}\}

(3)

where

v_{UAV}

,

h_{UAV}

and

n_{z}

are the UAV’s speed, height and steering overload.

d_{LOS}

and

δ_{ψ_{LOS}}

represent the distance and azimuth of required placement area with respect to the UAV.

v_{tgt}

and

ψ_{tgt}

represent the velocity and azimuth of the target.

d_{thr}^{f}

and

δ_{ψ_{thr}}^{f}

indicate the distance and azimuth of threat in front with respect to the UAV.

d_{thr}^{l}

and

δ_{ψ_{thr}}^{l}

are the distance and azimuth of threat on the left with respect to the UAV.

d_{thr}^{r}

and

δ_{ψ_{thr}}^{r}

are the distance and azimuth of threat on the right with respect to the UAV.

(b): The definition of observation space

Because of the existence of airborne sensors in the UAV, the observation space is defined as Equation (4) based on the definitions of state space and perception process in LRGDEs.

O = \{v_{UAV}, h_{UAV}, n_{z}, d_{LOS}, δ_{ψ_{LOS}}, d_{thr}^{f}, δ_{ψ_{thr}}^{f}, d_{thr}^{l}, δ_{ψ_{thr}}^{l}, d_{thr}^{r}, δ_{ψ_{thr}}^{r}\}

(4)

In contrast to state space, the velocity and azimuth of the target and all of the threats in the observation space is not present. The curse of dimensionality of observation space poses significant challenges to decision-making.

(c): The definition of action space

Based on the UAV kinematic model involved in the simulation environment for LRGDEs, we could establish the action space as

A (s) = \{n_{z}\}

(5)

where

n_{z}

is the steering overload of the UAV.

3.2.2. Knowledge-Enhanced Reward Model

Based on the definitions of various termination conditions for LRGDEs, the basic reward function for LRGDEs is defined as

R_{b} (s, a) = \{\begin{matrix} 10.0 & Successful Termination \\ 1.0 & Takeover Termination \\ - 1.0 & Failed Termination \\ 0.0 & Otherwise \end{matrix}

(6)

To address the issue of sparse rewards and accelerate policy convergence, we design a shaped reward function based on the reward shaping (RS) method by incorporating expert experience and domain knowledge. In this paper, the shaped reward function consists of a potential-based reward and an event-based reward, as defined in Equation (7).

\begin{matrix} F_{K} (s, a, s^{'}) = F_{P} (s, a, s^{'}) + F_{E} (s, a, s^{'}) \\ = \sum_{i} P_{i} (s, a, s^{'}) + \sum_{j} e_{j} (s, a, s^{'}) \end{matrix}

(7)

where

F_{P} (s, a, s^{'})

represents the potential-based reward,

F_{E} (s, a, s^{'})

indicates the event-based reward,

P_{i} (s, a, s^{'})

is the differential potential function of each factor, and

e_{j} (s)

indicates the event-based function of j-th event, which is defined as Equation (8).

e_{j} (s) = r_{j} \cdot e^{t}

(8)

where

t

is the elapsed period since the event occurred at present,

r_{j}

indicates the immediate reward of the j-th event. For example, we define an event-based function

e_{ψ_{LOS}} (s) = r_{ψ_{LOS}} \cdot e^{t}

, whose event is that the relative azimuth of target point with respect to the UAV is less than the maximum azimuth range of airborne sensors. When the event occurs firstly, agent could receive

r_{ψ_{LOS}}

. Then, the agent will obtain

r_{ψ_{LOS}} \cdot e

in the next decision-making period.

As shown in Equation (9), the differential potential function consists of the distance factor function, the azimuth factor function and the threats factor function.

F_{P} (s, a, s^{'}) = γ [Φ_{d_{LOS}} (s^{'}) + Φ_{ψ_{LOS}} (s^{'}) + \sum_{i = 0}^{2} Φ_{thr}^{i} (s^{'})] - [Φ_{d_{LOS}} (s) + Φ_{ψ_{LOS}} (s) + \sum_{i = 0}^{2} Φ_{thr}^{i} (s)]

(9)

where

Φ_{d_{LOS}} (s)

indicates the potential function about distance between UAV and target, and

Φ_{ψ_{LOS}} (s)

indicates the potential function about relative azimuth of target with respect to the UAV, and

Φ_{thr}^{i} (s), i = 0, 1, 2

represents the potential functions about the influence of the front threat, left threat and right threat, respectively.

Φ_{d_{LOS}} (s)

is defined as

Φ_{d_{LOS}} (s) = \frac{1}{2} \cdot {[f_{norm} (d_{LOS})]}^{2}

(10)

where

f_{norm} (x)

indicates the normalization function of parameter

x

. It is constructed as

f_{norm} (x) = \frac{x^{\max} - x}{x^{\max} - x^{\min}}

(11)

where

x

belongs to the interval

[x^{\min}, x^{\max}]

. Moreover,

Φ_{ψ_{LOS}} (s)

is defined as

Φ_{ψ_{LOS}} (s) = \frac{1}{2} {[f_{norm} (δ_{ψ_{LOS}})]}^{2}

(12)

And

Φ_{thr}^{i} (s), i = 0, 1, 2

could be defined as

Φ_{thr}^{i} (s) = \frac{1}{2} {[f_{norm} (d_{thr}^{i})]}^{2}

(13)

where

d_{thr}^{i}, i = 0, 1, 2

represents

d_{thr}^{f}

,

d_{thr}^{l}

and

d_{thr}^{r}

, respectively.

On the other hand, we define event-based reward functions based on important events, whose hyperparameters are defined in Table 1. The appearance condition indicates the conditions for determining whether it is effective, and

r_{j}

denotes the immediate reward. For LRGDEs, we define six events to construct the event-based reward function.

4. RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm

In this section, we present the structure of the REDCRL. First, we propose BLPN to address the UAV-maneuvering decision-making problem for LRGDEs. Second, we propose AMFER, which integrates an ADT mechanism and an MFTE model, to fully exploit the latent value of limited transitions under sparse reward conditions.

4.1. Structure of Algorithm

It is well known that DRL has been applied to solve numerous problems across diverse research fields, including Atari games, chess, robotics, and other decision-making and control problems. In this paper, we adopt the actor–critic architecture as the foundation of REDCRL, which features model-free and end-to-end characteristics and is well suited for UAV-maneuvering decision-making in LRGDEs. In particular, we employ Bi-LSTM to construct both the actor network and the critic network within the actor–critic architecture, leveraging Bi-LSTM’s strengths in processing temporal sequential data to mitigate the partial state observability problem in LRGDEs. On the other hand, to address the challenges of limited experience and sparse reward encountered when applying traditional DRL algorithms to LRGDEs, we introduce CL to reshape the policy training process. This includes a curriculum-based dynamic task objective and a comprehensive transition-evaluation experience replay method.

Figure 4 illustrates the overall framework of REDCRL, which consists of the BLPN, AMFER (incorporating an ADT mechanism and an MFTE model), and other necessary modules. While employing REDCRL to solve LRGDEs, the BLPN determines the current action

a_{t}

of the UAV, leveraging observations

o_{t}

and rewards

r_{t}

derived from the LRGDE environment. During each episode, the ADT mechanism in AMFER decides whether to terminate the episode using a dynamic task objective, which acts as the criterion for judging LRGDE termination conditions. Furthermore, each transition

(o_{t}, a_{t}, r_{t}, o_{t + 1})

generated via interactions between the BLPN policy and the LRGDE environment is stored into the experience memory. The MFTE model in AMFER is then adopted to evaluate the priority of each stored transition. Accordingly, a batch of transitions is sampled from the experience memory based on the sampling probabilities assigned according to their priorities. Finally, these sampled transitions are used to update and optimize the BLPN. In the subsequent sections, we elaborate on the design of the BLPN and AMFER (including the ADT mechanism and MFTE model).

4.2. Bi-LSTM-Modified Policy Networks

To address the sequential decision-making problem posed by partial observability in LRGDEs, a policy network that can effectively process historical observation–action sequences is required. Although various neural architectures are available for time-series data modeling, we select the Bi-LSTM for its proven effectiveness in capturing mid-range temporal dependencies and its computational efficiency for real-time embedded system applications. Compared with recent alternatives such as the Transformer architecture, which excels in modeling long-range contextual dependencies, Bi-LSTM imposes a much lower computational cost. Figure 5 illustrates the structure of the BLPN, which consists of an actor network and a critic network. Both the actor network and the critic network follow a three-module structure: an input block, a middle block, and an output block. Since the input information includes the current action

a_{t}

and the historical trajectory

h_{t}^{N_{o}}

(a time-series vector consisting of recent states and actions), the input block is built upon Bi-LSTM, a variant of RNNs that processes sequential data in both forward and backward directions. The middle block and the output block are constructed using fully connected networks (FCNs), consistent with the policy networks used in traditional DRL algorithms.

Specifically, the input to the actor network

π_{po} (h_{t}^{N_{o}} | ϕ_{po})

consists of observations from the LRGDE environment, which are categorized into normal observation and sequential observation. At each decision step, the agent receives an observation

o_{t}

that includes the normal observation

o_{t}^{1}

for the UAV’s flight state and the relative observation

o_{t}^{2}

for the target and threats. In this work, we consider that

o_{t}^{1}

represents fully observable information about the UAV’s flight state and does not need to be included in sequential observations, which helps to reduce network size and conserve computational resources. In contrast,

o_{t}^{2}

represents partially observable information about the target and threats relative to the UAV, which can be obtained via various airborne sensors.

Subsequently,

o_{t}^{2}

and

a_{t}

over a period of time are used to construct the sequential observation, as illustrated in Figure 5. During interactions between the agent and the LRGDE environment,

h_{t}^{N_{o}}

is input into

π_{po} (h_{t}^{N_{o}} | ϕ_{po})

, which outputs the corresponding action

a_{t}

. In contrast to the actor network, the current action

a_{t}

is incorporated into the input of the critic network,

Q_{po} (h_{t}, a_{t} | θ_{po})

. The critic network then outputs the Q-value

Q_{po} (h_{t}, a_{t})

to guide the optimization of the actor network

π_{po} (h_{t}^{N_{o}} | ϕ_{po})

.

The critic network

Q_{po} (h_{t}, a_{t} | θ_{po})

could be optimized based on TD3 [24], and the optimization target is defined as

y = r_{t} + γ \min_{i = 1, 2} Q_{po}^{i'} (h_{t + 1}, π_{ϕ} (h_{t + 1}))

(14)

where

Q_{po}^{i'} (h, a | θ_{po}^{i^{'}}), i = 1, 2

are the target critic networks, and the transition

(h_{t}, a_{t}, r_{t}, h_{t + 1})

is used to calculate the optimization target

y

. Thereby, we could optimize the hyperparameters of critic networks by the loss function, defined as

L (θ_{i}) = \sum_{i} {[y_{i} - Q_{po}^{i} (h_{t}, a_{t} | θ_{po}^{i})]}^{2}

(15)

where

Q_{po}^{i} (h, a | θ_{po}^{i}), i = 1, 2

are the critic networks and

y_{i}

is the target of the corresponding critic networks.

The policy gradient of the actor network

π_{po} (h_{t}^{N_{o}} | ϕ_{po})

is calculated according to the RDPG theorem [47].

\nabla_{ϕ_{po}} J = \frac{1}{N} \sum_{i} \nabla_{π_{po}} Q_{θ} (h_{i}, π_{po} (h_{i} | ϕ_{po}) | θ) \nabla_{ϕ_{po}} π_{po} (h_{i} | ϕ_{po})

(16)

4.3. Adaptive Multi-Feature Evaluation Experience Replay

Based on the structure of REDCRL, it is critical to restructure the policy training process for effective policy learning. Reasonable restructuring of the policy training process helps to address the challenges of limited experience and sparse reward when applying DRL to agent training in LRGDEs. As illustrated in Figure 6, we design the structure of AMFER, which comprises an ADT mechanism and an MFTE model.

During the policy training process, to reduce the computational burden of algorithm training, we design an ADT mechanism following the “from easy to difficult” principle, which is analogous to the sequential learning processes in human education and animal training. The ADT mechanism generates a dynamic task objective based on the state of the LRGDE environment to determine whether the mission in the current episode is successful or unsuccessful. When determining the current task objective, we select the most appropriate goal for the current policy from a predesigned curriculum comprising a sequence of tasks arranged in ascending order of difficulty.

Furthermore, during interactions between the UAV-maneuvering decision-making policy and the LRGDE environment, transitions are stored in the experience memory, where a batch of transitions is sampled and utilized to optimize the policy. Once a transition is stored in the experience memory, the priority of each transition is evaluated by the MFTE model, which is developed based on the traditional ER method and integrates a comprehensive evaluation function. Prior to policy training, these transitions are sampled based on the sampling probabilities associated with the priorities of the transitions.

4.3.1. Adaptive Dynamic Termination Mechanism

For LRGDEs, the successful termination of the mission depends on the distance between the UAV and the target, as well as the UAV’s detection range. If the current distance between the UAV and the target satisfies Equation (2), the simulation episode is terminated successfully. Accordingly, the UAV’s detection range

R_{\det}

defines the mission difficulty for the current episode. In the ADT mechanism, we define a dynamic task objective that can automatically adjust itself based on the performance of the current policy.

First, we define a curriculum, which consists of a sequence of subtasks and is formulated as

C R M = [g_{1}, g_{2}, \dots, g_{n}]

(17)

where

g_{i}

is a subtask.

G

represents the goal of

g_{i}

.

In this paper, considering the definition of LRGDEs, the subtask goal

G

is related to Equation (2) and the key parameter of the dynamic task goal, i.e., the control variable of

G

, is the UAV’s detection range

R_{\det}

. Therefore, we can construct a curriculum about LRGDEs in terms of

R_{\det}

, which is defined as

R_{\det} \in [R_{\det}^{\min}, R_{\det}^{\max}]

. Among the curriculum

C R M

, every subtask

g_{i}

is defined by

\{R_{\det}^{i}\}

and all of the subtasks are sorted by the difficulty of the subtask goal in ascending order. While sorting the subtasks, the difficulty evaluation function

f_{C L}^{T G} (g_{i}) \in [0, 1]

is defined as

f_{C L}^{T G} (g_{i}) = \frac{R_{\det}^{\max} - R_{\det}}{R_{\det}^{\max} - R_{\det}^{\min}}

(18)

According to the function defined above, we can obtain a curriculum arranged in ascending order of difficulty. In addition, during the policy training process, we design a module to update the current task objective, which in turn determines the termination condition of the current episode.

In this work, we consider that the difficulty perceived by the current policy is associated with the current task objective of LRGDEs. Therefore, after constructing the curriculum, we adopt the Dynamic Success Rate (DSR) metric to evaluate the performance of the current policy and determine whether to switch the current subtask objective. The DSR can be computed as

D S R = \frac{N_{D S R}^{s}}{N_{D S R}}

(19)

where

N_{D S R}

denotes the total number of simulation episodes used for DSR calculation, and

N_{D S R}^{s}

represents the number of successfully completed episodes. Moreover, during policy training, the simulation results used for DSR calculation are obtained from the most recent

N_{D S R}

experiments. Accordingly, we can determine whether to switch the current subtask goal by

D S R \geq C S R

, and the Converged Successful Rate (CSR) acts as the threshold for judging whether the current policy has converged.

Figure 7 illustrates the structure of the ADT mechanism designed in this section. During the operation of the ADT mechanism, we first construct a curriculum according to the definition of the target LRGDE. During policy training, the DSR metric is evaluated to quantify the policy performance based on feedback from the LRGDE environment. The DSR is then used to determine whether to update the current termination condition. After the termination condition is switched to a more difficult subtask objective, the old transitions collected under the previous subtask objective are retained in the experience memory. Although the policy is learning subtask

g_{t}

, these old transitions from subtask

g_{t - 1}

still supply the policy with valuable prior knowledge.

The pseudocode of the ADT mechanism is shown in Algorithm 1.

Algorithm 1. The ADT Mechanism in AMFER
1:	Initialize a curriculum $C R M$ including a set of subtasks and select $g_{1}$ as the current subtask.
2:	while not terminate the training experiment do
3:	Obtain the state $s_{t}$ from LRGDE environment;
4:	Calculate the current $D S R$ according to the current subtask $g_{t}$ ;
5:	if $D S R \geq C S R$ then
6:	Switch current goal to next subtask $g_{t + 1}$ ;
7:	end if
8:	end while

4.3.2. Multi-Feature Transition Evaluation Model

In addition to the ADT mechanism, reshaping the sampling process from the experience memory is critical for effective policy training. Traditional ER methods, such as UER and PER, can facilitate online learning for DRL algorithms within the training environment. However, they face limitations when applied to UAV-maneuvering decision-making problems in LRGDEs, since relying solely on TD-error to evaluate transition priorities is inadequate. In addition to UER and PER, various CL-based algorithms with improved transition selection strategies have been proposed in recent years, including HER, DCRL, ERO, CHER, CER, LSER, and ACER. However, evaluating sample priority from only a single factor (e.g., TD-error, mission objective, cumulative reward, or sample diversity) remains insufficient. Accordingly, we design the MFTE model to comprehensively evaluate the priorities of transitions.

Figure 8 illustrates the structure of the MFTE model, whose core component is a comprehensive transition evaluation function composed of three evaluation factor terms: learning value, diversity, and intrinsic value. After each sampling and policy training iteration, the priorities of the sampled transitions are re-evaluated according to the current policy, and the updated priorities are used to compute the sampling probabilities for the next iteration. The comprehensive transition evaluation function is defined as

f_{C L}^{T E} (i) = f_{C L}^{L V} (i) + f_{C L}^{D V} (i) + f_{C L}^{I V} (i)

(20)

where

f_{C L}^{L V} (i)

denotes the learning value factor,

f_{C L}^{D V} (i)

denotes the diversity factor, and

f_{C L}^{I V} (i)

denotes the inherent value factor.

Within the MFTE model, the learning value of a transition characterizes the degree to which the transition contributes to policy optimization, and the learning value factor

f_{C L}^{L V} (i)

is computed based on the TD-error

δ_{i}

of the transition. In DL, transitions with large TD-error magnitudes require a smaller learning step to adapt to the curvature of the objective function. The larger the TD-error

δ_{i}

, the greater the impact of the transition on the current policy, and thus the more the transition should be utilized for learning. In CL,

f_{C L}^{L V} (i)

must satisfy specific constraints [34,37]. In this work, we design the learning value factor function, which is formulated as

f_{C L}^{L V} (i) = \{\begin{matrix} \exp (k_{1} \cdot (|δ| - λ)) & |δ| \leq λ \\ \exp (k_{2} \cdot (λ - |δ|)) & |δ| > λ \end{matrix}

(21)

where

k_{1}

and

k_{2}

are used to adjust the slope of function.

|δ|

denotes the loss of the

i

-th transition for the current network.

λ

represents the curriculum factor that indicates the learning stages.

Apart from the learning value, maintaining data diversity is also critical for effective policy training. In ER, excessive reuse of redundant transitions can lead to severe overtraining, and the policy is highly prone to becoming trapped in a local optimum. Therefore, maintaining the diversity of transitions sampled from the experience memory is one of the key issues to prevent the policy from becoming trapped in a local optimum. In this work, to achieve the exploration–exploitation tradeoff and maintain sufficient exploration in the state and action spaces, the diversity factor

f_{C L}^{D V} (i)

is adopted to enhance the diversity of sampled transitions, and

f_{C L}^{D V} (i)

is formulated as

f_{C L}^{D V} (i) = \frac{N_{D V}^{\max} - N_{D V}^{i}}{N_{D V}^{\max}}

(22)

where

N_{D V}^{\max}

denotes the maximum number of times all transitions in the experience memory have been sampled and

N_{D V}^{i}

represents the number of times transition

i

is sampled.

Finally, to facilitate the policy’s faster convergence to a near-optimal solution, we incorporate the reward of each transition into the comprehensive evaluation value. In DRL algorithms, rewards can guide the policy to learn effective behaviors. In other words, rewards serve as the specified learning guidance for the mission. Therefore, from a non-generalizability perspective, prioritizing transitions with higher rewards for learning can accelerate the policy’s convergence. Accordingly, we specifically integrate the intrinsic value factor

f_{C L}^{I V} (i)

into the comprehensive transition evaluation function, and

f_{C L}^{I V} (i)

is formulated as

f_{C L}^{I V} (i) = \frac{r_{i}}{r_{\max}}

(23)

where

r_{i}

is the reward of the i-th transition and

r_{\max}

denotes the maximum reward across all transitions in the experience memory.

Based on the comprehensive transition evaluation function

f_{C L}^{T E} (i)

, the priority

p_{i} = f_{C L}^{T E} (i)

of each transition can be obtained. Accordingly, we can compute the sampling probability

P (i)

of each transition, which is defined as

P (i) = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}}

(24)

where

α

is a hyperparameter used to control the influence of transition priority on the sampling probability. Moreover, since the transition distribution from the environment is altered, Importance Sampling (IS) weights [48] are adopted to correct the distribution bias caused by AMFER. The cumulative gradient of critic network could be calculated by

Δ_{i} = \sum_{j} ω_{j} \cdot δ_{j} \cdot \nabla_{θ_{po}^{i}} Q_{po}^{i} (h, a | θ_{po}^{i}), i = 1, 2

(25)

where

δ_{j}

is TD-error of the j-th transition and

Δ_{i}

is the cumulative gradient of the critic network

Q_{po}^{i} (h, a | θ_{po}^{i})

.

The pseudocode of the MFTE mechanism is shown in Algorithm 2.

Algorithm 2. The MFTE Model in AMFER
1:	for each transition $i$ in $B$ do
2:	Calculate the TD-Error of the transition $i$ according to current $π_{po} (h \| ϕ_{po})$ and $Q_{po}^{j} (h, a \| θ_{po}^{j}), j = 1, 2$ ;
3:	Calculate the learning value factor $f_{C L}^{L V} (i)$ according to Equation (21);
4:	Count the $N_{D V}^{\max}$ and $N_{D V}^{i}$ , and calculate the diversity factor $f_{C L}^{D V} (i)$ according to Equation (22);
5:	Search for the $r_{\max}$ among $B$ and calculate the inherent value factor $f_{C L}^{I V} (i)$ according to Equation (23);
6:	Calculate the comprehensive transition evaluation value $f_{C L}^{T E} (i)$ according to Equation (20);
7:	Update the priority $p_{i}$ of $i$ -th transition based on $f_{C L}^{T E} (i)$ and calculate the probability $P (i)$ of $i$ -th transition.
8:	end for

4.4. Policy Training Process of REDCRL

As illustrated in Figure 4, during REDCRL training in LRGDEs, the BLPN generates the action

a_{t}

based on the observation

o_{t}

and the reward

r_{t}

obtained from the LRGDE environment. Subsequently, transitions generated during training are stored in the experience memory. After each training epoch, the MFTE model integrated into AMFER is adopted to evaluate the priorities of sampled transitions. Meanwhile, the ADT mechanism integrated into AMFER is employed to determine whether to terminate the current simulation episode and update the task objective of the LRGDE environment.

Based on the implementations of the BLPN and AMFER modules, the integrated REDCRL algorithm is presented in Algorithm 3.

Algorithm 3. The REDCRL algorithm for UAV-maneuvering decision-making in LRGDEs
1:	Initialize policy networks $π_{po} (h \| ϕ_{po})$ and $Q_{po}^{i} (h, a \| θ_{po}^{i}), i = 1, 2$ and their target networks $π_{po}^{'} (h \| ϕ_{po}^{'})$ , $Q_{po}^{i'} (h, a \| θ_{po}^{i'}), i = 1, 2$ .
2:	for $m = 1$ to $M$ do
3:	Reset environment and obtain the initial observation $o_{0}$ ;
4:	Construct the history trajectory $h_{0}^{N_{o}}$ and output $a_{0} = π_{po} (h_{0}^{N_{o}} \| ϕ_{po})$ ;
5:	for $t = 1$ to $T$ do
6:	Observe current observation $o_{t}$ and calculate current action $a_{t}$ ;
7:	Observe next observation $o_{t + 1}$ and receive reward $r_{t}$ from environment, and store transition $(o_{t}, a_{t}, r_{t}, o_{t + 1})$ .
8:	if $t \mod k = 0$ then
9:	Reset the gradient $Δ = 0$ of critic networks with IS;
10:	Sample a batch of transitions according to the sampling probabilities of transitions;
11:	Accumulate the parameters gradient of $Q_{po}^{i} (h, a \| θ_{po}^{i}), i = 1, 2$ according to Equation (25).
12:	Update the parameters of $Q_{po}^{i} (h, a \| θ_{po}^{i})$ according to $Δ_{i}$ with learning rate $η_{c}$ ;
13:	Update the parameters of actor network $π_{po} (h \| ϕ_{po})$ according to Equation (16);
14:	Update the priorities of these transitions used for training according to MFTE defined in Algorithm 2
15:	end if
16:	If state $s_{t + 1}$ meets Equation (2) then
17:	Start the next episode.
18:	Update the task goal according to ADT defined in Algorithm 1
19:	else if state $s_{t + 1}$ satisfies Equation (1) then
20:	Start the next episode.
21:	end if
22:	end for
23:	end for

5. Simulation Experimental Results and Analysis

In this section, we design a series of experiments to verify the effectiveness of REDCRL and demonstrate its performance advantages compared with several traditional DRL algorithms. Furthermore, we evaluate the performance of the trained policy in the LRGDE environment and conduct ablation experiments to validate the individual contributions of each module to the overall performance of REDCRL.

5.1. Experiments Settings

Before conducting the experiments, we design the experimental scenario, including the initial environmental state and other environmental parameters. Meanwhile, we select several representative traditional algorithms and compare them with REDCRL to validate its performance. To quantify the algorithmic performance, we present several quantitative evaluation metrics. Finally, we specify the hyperparameter settings used in the experiments.

5.1.1. Experiments’ Scenario

In LRGDEs, the mission zone is restricted to 100 km × 100 km airspace and the height of the UAV is bounded to 500~10,000 m. Figure 9 illustrates the layout of the mission zone, which consists of the UAV launch area, threat areas, and the target area. Before each simulation run, the initial position of the UAV is randomly generated within the UAV launch area. Meanwhile, the initial positions of threats are randomly deployed within the threat areas. The target is reset to a predefined initial position.

The initial positions of the UAV and threats are randomly sampled from a uniform distribution. Specifically, these threats are distributed around the straight line connecting the UAV and the target. At the start of each simulation episode, the UAV policy makes a decision every 0.1 s.

5.1.2. Contrast Settings

To evaluate the performance of REDCRL, several SOTA algorithms are also employed to train policies in the LRGDE environment for comparative analysis. DDPG [23] and TD3 [24], two traditional DRL algorithms based on the actor–critic structure, serve as the baseline methods for REDCRL. RDPG (a variant of DDPG) [47] and RTD3 (a variant of TD3) [49] are variants of traditional DRL algorithms tailored for POMDPs. DCRL (a classic variant of DRL with improved ER) [34] is another classic variant of traditional DRL, in which ER is enhanced based on CL. Accordingly, we aim to validate the effectiveness of REDCRL by comparing it with DDPG, TD3, RDPG, RTD3, and DCRL across two key metrics: policy convergence speed and the performance of the trained policies.

In addition to the comparison with the aforementioned DRL variants, we also conduct an ablation experiment to validate the individual contributions of each module to REDCRL’s overall performance. In the ablation experiment, we run REDCRL, TD3 + BLPN, TD3 + AMFER, and TD3 independently in the LRGDE environment. This is because TD3 serves as the baseline trainer for our proposed REDCRL, while BLPN and AMFER are the two core modules of REDCRL.

5.1.3. Evaluation Metrics

To quantify the performance of REDCRL and the other baseline comparison algorithms, we define several metrics to evaluate the convergence speed and overall performance of these algorithms.

Successful Rate (SR): The ratio of successfully completed mission to the total number of experiments.
Dynamic Successful Rate (DSR): The ratio of successfully completed missions in the 50 most recent experiments.
Peak Successful Rate (PSR): The maximum successful rate during the whole course of policy training.
Valley Successful Rate (VSR): The minimum successful rate during the whole course of policy training.
Learning Time (LT): The number of episodes where DSR first reaches CSR.
Converged Policy Stability (CPS): The standard deviation of DSR after DSR reaches CSR.
Converged Policy Performance (CPP): The average DSR after DSR reaches CSR.
Average Inference Time (AIT): The average decision time of the policy during the whole course of policy training.

In addition to the metrics defined above, we define a threshold, namely CSR, to determine whether the policy has converged. In accordance with expert experience and common conventions in the field, the CSR is set to 80% in this work.

5.1.4. Parameters Assignment of REDCRL

Based on the definitions of the LRGDE environment and REDCRL, the parameter settings for REDCRL are provided in the GitHub repository at https://github.com/KeLi0000/REDCRML (accessed on 12 January 2026).

In addition, we train REDCRL with different historical trajectory lengths to analyze the impact of the hyperparameter

N_{o}

and identify an optimal value that fully leverages the advantages of BLPN for solving POMDPs. As presented in Figure 10, we collect the training data generated by REDCRL with different values of

N_{o}

. On the PSR, VSR, CPP, and CPS metrics, REDCRL achieves superior performance over all other cases when

N_{o} = 15

. Therefore, for all subsequent experiments, we adopt REDCRL with

N_{o} = 15

for policy training.

5.2. Training Experiments of the Algorithms

According to the experimental settings in Section 5.1, we conduct training experiments for REDCRL and the other comparison algorithms and collect the data generated during policy training. We then evaluate and analyze these training data.

Table 2 summarizes the evaluation results obtained using DDPG, RDPG, TD3, RTD3, DCRL, and REDCRL. REDCRL outperforms all other algorithms in terms of LT, with an average improvement of 75.7% over the other methods. Based on the ablation experiment results, this confirms that REDCRL can accelerate policy convergence by introducing the AMFER module. For instance, REDCRL achieves an improvement of approximately 83.9% over DDPG, the worst-performing method in terms of LT, and an improvement of 58.5% over DCRL, which is the best-performing algorithm among all baseline methods. Furthermore, the policy trained by REDCRL also exhibits superior performance over the other classic DRL algorithms with respect to CPP and CPS. Specifically, REDCRL outperforms the other algorithms by an average of 19.1% in CPP performance and by approximately 46.5% in CPS stability. In addition, REDCRL surpasses the best-performing baseline algorithm by 6.2% and the worst-performing one by 48.3% in terms of CPP. For the CPS metric, REDCRL outperforms the best baseline algorithm by 20.1% and the worst by 66.7%. Beyond the metric-wise analysis above, REDCRL also achieves favorable peak performance in terms of PSR.

In detail, we plot the curves of DSR versus training episodes for different algorithms, as illustrated in Figure 11. The results in the following graphs intuitively demonstrate that the proposed REDCRL can accelerate policy convergence and improve the performance of the trained policy. In particular, the policy convergence speed is significantly enhanced in terms of LT owing to the introduction of BLPN and AMFER.

5.3. Testing Experiments of the Trained Policy

To verify the effectiveness of the trained policy in the LRGDE task, we conducted extensive experiments in scenarios similar to the training environment. Policies trained using DDPG, RDPG, TD3, RTD3, DCRL, and REDCRL were subjected to 100 tests under the scenarios specified in Section 5.1.1. We collected and analyzed the results from these extensive tests, with the SR performance summarized in Table 3. Based on the evaluation results in Table 3, we found that the policy trained by REDCRL outperformed those of all other algorithms. In particular, REDCRL achieved an average SR performance approximately 16.0% higher than all other algorithms, 26.76% higher than the worst-performing algorithm, and 8.43% higher than the best-performing baseline algorithm in terms of SR.

Furthermore, we selected a set of test results and visualized the simulation process, including the UAV flight trajectory and the LRGDE execution flow, using the Tacview flight analysis software (version number 1.9.5), as illustrated in Figure 12. Figure 12(a1–f1) plots the UAV flight trajectories: The red solid line denotes the UAV flight trajectory, the red solid dot marks the UAV’s initial position, the red “×” denotes the UAV’s terminal position, the blue solid line denotes the target trajectory, the blue solid dot marks the target’s initial position, the blue “×” denotes the target’s terminal position, and the red dashed line represents the trajectory of the payload released by the UAV. In addition, the green hemispherical surface denotes the threat influence range, whose radius corresponds to the threat influence radius

R_{thr}^{i}

.

As clearly demonstrated in Figure 12, the REDCRL algorithm exhibits superior performance across multiple evaluation metrics relative to conventional DRL methods. The trajectory visualization in subfigures (f1) and (f2) reveals significantly improved smoothness and continuity: the generated path satisfies optimal curvature constraints while enabling efficient navigation in complex threat environments. This improved trajectory quality directly translates to higher flight stability and more efficient mission execution.

A comparative analysis with the other baseline methods (subfigures (a1)–(e1)) further validates REDCRL’s strong capability in addressing the challenges of LRGDE missions. The algorithm achieves outstanding performance in balancing obstacle avoidance and mission objectives, realizing more accurate target capture with minimal path deviation while sustaining an optimal trade-off between safety and efficiency during the entire navigation process.

5.4. Additional Experiments for REDCRL

In addition to the comparative experiments between REDCRL and other conventional algorithms, we performed extra experiments to evaluate the generalization ability of REDCRL by testing the trained policy in complex LRGDE scenarios with varying numbers of threats. Furthermore, to validate the effects of the BLPN and AMFER modules on REDCRL performance, we designed corresponding ablation experiments and analyzed the results.

5.4.1. Testing REDCRL in LRGDE with a Different Number of Threats

To evaluate the generalization ability of REDCRL in complex scenarios, we conducted extensive experiments to test the policy trained by REDCRL. In these experiments, the number of threats in the environment was set to 5, 6, 7, 8, 9, and 10, respectively.

Table 4 presents the SR of the trained policy when tested in LRGDE scenarios with different numbers of threats. We observe that the SR gradually decreases as the number of threats increases, but remains above 60% even when 10 threats are present. These results confirm that the policy trained by REDCRL exhibits favorable generalization ability. By further training the REDCRL policy in LRGDE scenarios with more threats, an even more robust policy capable of reliably handling LRGDE tasks with denser threats can be obtained.

5.4.2. Ablation Experiments on REDCRL

To evaluate the impacts of BLPN and AMFER on REDCRL performance, we conducted ablation experiments. In these experiments, we removed BLPN and AMFER from REDCRL separately to examine their individual effects on both policy convergence speed and converged performance.

As presented in Table 5, we considered that the policy was trained separately by REDCRL under four different settings, REDCRL without BLPN and AMFER, REDCRL without AMFER, REDCRL without BLPN, and the complete REDCRL. We evaluated the training process of each configuration by PSR, LT, CPP and CPS. A comparison between Case 1 and Case 2 shows that BLPN can significantly enhance the performance and stability of the algorithm in terms of CPP and CPS, while accelerating policy convergence in terms of LT, owing to its strengths in solving POMDPs. A comparison between Case 1 and Case 3 verifies that AMFER can substantially speed up policy convergence in terms of LT and boost the performance of the trained policy, thanks to its ability to address the issues of limited experience and sparse rewards. From the comparison between Case 1 and Case 4, we observe that both convergence speed and final performance are notably improved by introducing BLPN and AMFER together.

Therefore, BLPN and AMFER both contribute to improving the convergence speed and final converged performance of REDCRL, where BLPN mainly enhances policy performance, while AMFER is primarily dedicated to accelerating policy convergence. Both modules exert positive effects on REDCRL performance, and the combined improvement achieved by using both modules is significantly greater than that of either module alone.

6. Conclusions

In the present work, we described the LRGDE mission and refined the UAV-maneuvering decision-making model for LRGDEs based on POMDPs. Based on the definitions of LRGDEs, we proposed REDCRL based on DRL for optimizing policy to perform LRGDE missions. Specifically, we design BLPN to exploit the partially observable state information in LRGDEs, which enables superior policy performance over traditional DRL algorithms. Furthermore, we develop AMFER to reshape the experience sampling strategy from historical data, aiming to maximize the latent utility of limited transitions under sparse reward conditions. AMFER consists of the ADT mechanism and the MFTE model. Simulation results demonstrate that REDCRL effectively accelerates policy convergence and enhances the performance of the converged policy. In terms of LT, REDCRL achieves a 75.7% convergence speed improvement over classical DRL algorithms, while in terms of CPP, it outperforms traditional DRL methods by 19.1%. REDCRL also maintains strong generalization ability in LRGDE scenarios with varying numbers of threats. Ablation experiments further validate that the integration of BLPN and AMFER yields far greater performance gains than either module individually.

In future work, we will investigate targets with intelligent maneuvering policies, which transforms the UAV-maneuver decision-making problem in LRGDEs into a complex decision-making task with game-theoretic characteristics. Additionally, we will construct a physical experimental testbed in the laboratory and deploy the policy learned by REDCRL with physical UAVs for real-world flight validation.

Author Contributions

Conceptualization, K.L. and K.Z.; methodology, K.L., Z.W. and H.P.; software, K.L., B.W. and J.C.; validation, K.L. and B.Y.; formal analysis, K.L.; investigation, K.L.; resources, K.Z. and H.P.; data curation, K.Z.; writing—original draft preparation, K.L. and Z.W.; writing—review and editing, Z.W. and B.Y.; visualization, B.W.; supervision, K.Z.; project administration, K.Z.; funding acquisition, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Project (Grant No. JCKY2024205B032) and the Fundamental Research Funds for the Central Universities (Grant No. G2025KY06217).

Data Availability Statement

The code used in this study is publicly available on GitHub in the repository https://github.com/KeLi0000/REDCRML (accessed on 12 January 2026).

Conflicts of Interest

Author Binlin Yuan was employed by the company Sichuan Tengden Sci-Tech Innovation Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAVs	Unmanned Aerial Vehicles
LRGDE	Long-Range UAV Guidance in Dynamic Environments
REDCRL	RNN-Enhanced Diverse Curriculum-Driven Learning
BLPN	Bi-LSTM-Modified Policy Networks
AMFER	Adaptive Multi-Feature Evaluation Experience Replay
TD3	Twin Delayed Deep Deterministic Policy Gradient
MDPs	Markov Decision Processes
RL	Reinforcement Learning
DRL	Deep Reinforcement Learning
POMDPs	Partially Observable Markov Decision Processes
MCTS	Monte Carlo Tree Search
RNN	Recurrent Neural Networks
UER	Uniform Experience Replay
PER	Prioritized Experience Replay
ER	Experience Replay
CL	Curriculum Learning
DL	Deep Learning
DQN	Deep Q Network
DDPG	Deep Deterministic Policy Gradient
LOS	Line of Sight
ADT	Adaptive Dynamic Termination
MFTE	Multi-Feature Transition Evaluation
SR	Successful Rate
DSR	Dynamic Successful Rate
CSR	Converged Successful Rate
PSR	Peak Successful Rate
VSR	Valley Successful Rate
LT	Learning Time
CPS	Converged Policy Stability
CPP	Converged Policy Performance
AIT	Average Inference Time

References

Menouar, H.; Guvenc, I.; Akkaya, K.; Uluagac, A.; Kadri, A.; Tuncer, A. UAV-enabled intelligent transportation systems for the smart city: Applications and challenges. IEEE Commun. Mag. 2017, 55, 22–28. [Google Scholar] [CrossRef]
Lyu, X.; Li, X.; Dang, D.; Dou, H.; Wang, K.; Lou, A. Unmanned Aerial Vehicle (UAV) Remote Sensing in Grassland Ecosystem Monitoring: A Systematic Review. Remote Sens. 2022, 14, 1096. [Google Scholar] [CrossRef]
Ukaegbu, U.; Tartibu, L.; Okwu, M. Unmanned Aerial Vehicles for the Future: Classification, Challenges, and Opportunities. In Proceedings of the International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, Durban, South Africa, 5–6 August 2021. [Google Scholar]
Lee, S.; Song, Y.; Kil, S.H. Feasibility analyses of real-time detection of wildlife using UAV-derived thermal and RGB images. Remote Sens. 2021, 13, 2169. [Google Scholar] [CrossRef]
Yang, L.; Qi, J.; Xiao, J. A Literature Review of UAV 3D Path Planning. In Proceedings of the 11th World Congress on Intelligent Control and Automation, Shenyang, China, 29 June–4 July 2014. [Google Scholar]
Kaluđer, H.; Brezak, M.; Petrović, I. A Visibility Graph-Based Method for Path Planning in Dynamic Environments. In Proceedings of the 34th International MIPRO ICT and Electronics Convention, Opatija, Croatia, 23–27 May 2011. [Google Scholar]
Sun, Q.; Li, M.; Wang, T.; Zhao, C. UAV Path Planning Based on Improved Rapidly-Exploring Random Tree. In Proceedings of the Chinese Control and Decision Conference, Shenyang, China, 9–11 June 2018. [Google Scholar]
Yan, F.; Liu, Y.S.; Xiao, J.Z. Path planning in complex 3D environments using a probabilistic roadmap method. Int. J. Autom. Comput. 2013, 10, 525–533. [Google Scholar] [CrossRef]
Tseng, F.H.; Liang, T.T.; Lee, C.H.; Der Chou, L.; Chao, H.C. A Star Search Algorithm for Civil UAV Path Planning With 3G Communication. In Proceedings of the 10th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Kitakyushu, Japan, 27–29 August 2014. [Google Scholar]
Meng, B. UAV Path Planning Based on Bidirectional Sparse A* Search Algorithm. In Proceedings of the International Conference on Intelligent Computation Technology and Automation, Changsha, China, 11–12 May 2010. [Google Scholar]
Ferguson, D.; Stentz, A. Using interpolation to improve path planning: The Field D* algorithm. J. Field Robot. 2006, 23, 79–101. [Google Scholar] [CrossRef]
Silva Arantes, J.; Silva Arantes, M.; Motta Toledo, C.F.; Júnior, O.T.; Williams, B.C. Heuristic and genetic algorithm approaches for UAV path planning under critical situation. Int. J. Artif. Intell. Tools 2017, 26, 1760008. [Google Scholar] [CrossRef]
Deng, W.; Feng, J.Y.; Zhao, H.M. Autonomous Path Planning via Sand Cat Swarm Optimization with Multi-Strategy Mechanism for Unmanned Aerial Vehicles in Dynamic Environment. IEEE Internet Things J. 2025, 12, 26003–26013. [Google Scholar] [CrossRef]
Lee, H.; Kim, H.J. Trajectory tracking control of multirotor from modelling to experiments: A survey. Int. J. Control Autom. 2017, 15, 281–292. [Google Scholar] [CrossRef]
Li, K.; Zhang, K.; Zhang, Z.; Liu, Z.; Hua, S.; He, J. A UAV Maneuver Decision-Making Algorithm for Autonomous Airdrop Based on Deep Reinforcement Learning. Sensors 2021, 21, 2233. [Google Scholar] [CrossRef]
Zhang, K.; Li, K.; He, J.; Shi, H.; Wang, Y.; Niu, C. A UAV Autonomous Maneuver Decision-Making Algorithm for Route Guidance. In Proceedings of the International Conference on Unmanned Aircraft Systems, Athens, Greece, 1–4 September 2020. [Google Scholar]
Li, K.; Zhang, K.; Liu, H.; Li, Y.; Wang, Q. An UAV Maneuvering Decision-Making Algorithm Based on Deep Transfer Reinforcement Learning. In Proceedings of the Congress of the International Council of the Aeronautical Science, Florence, Italy, 9–13 September 2024. [Google Scholar]
Kaifang, W.; Bo, L.; Xiaoguang, G.; Zijian, H.; Zhipeng, Y. A learning-based flexible autonomous motion control method for UAV in dynamic unknown environments. J. Syst. Eng. Electron. 2021, 32, 1490–1508. [Google Scholar] [CrossRef]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
Silver, D.; Veness, J. Monte-Carlo planning in large POMDPs. In Proceedings of the 23rd Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010. [Google Scholar]
Somani, A.; Ye, N.; Hsu, D.; Lee, W.S. DESPOT: Online POMDP Planning with Regularization. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Zheng, K.; Zhang, X.; Wang, C.; Zhang, M.; Cui, H. A partially observable multi-ship collision avoidance decision-making model based on deep reinforcement learning. Ocean Coast. Manag. 2023, 242, 106689. [Google Scholar] [CrossRef]
Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of Drones: Multi-UAV Pursuit-Evasion Game with Online Motion Planning by Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 7900–7909. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous Navigation of UAVs in Large-Scale Complex Environments: A Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
Xue, Y.; Chen, W. A UAV Navigation Approach Based on Deep Reinforcement Learning in Large Cluttered 3D Environments. IEEE Trans. Veh. Technol. 2023, 72, 3001–3014. [Google Scholar] [CrossRef]
Lin, L.J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Pieter, A.; Zaremba, W. Hindsight Experience Replay. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ren, Z.; Dong, D.; Li, H.; Chen, C. Self-Paced Prioritized Curriculum Learning with Coverage Penalty in Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2216–2226. [Google Scholar] [CrossRef]
Zha, D.; Lai, K.H.; Zhou, K.; Hu, X. Experience Replay Optimization. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019. [Google Scholar]
Fang, M.; Zhou, T.; Du, Y.; Han, L.; Zhang, Z. Curriculum-guided Hindsight Experience Replay. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Hu, Z.; Gao, X.; Wan, K.; Wang, Q.; Zhai, Y. Asynchronous Curriculum Experience Replay: A Deep Reinforcement Learning Approach for UAV Autonomous Motion Control in Unknown Dynamic Environments. IEEE Trans. Veh. Technol. 2023, 72, 13985–14001. [Google Scholar] [CrossRef]
Luo, S.; Kasaei, H.; Schomaker, L. Accelerating Reinforcement Learning for Reaching Using Continuous Curriculum Learning. In Proceedings of the International Joint Conference on Neural Networks, Glasgow, UK, 19–24 July 2020. [Google Scholar]
Morad, S.D.; Mecca, R.; Poudel, R.P.K.; Liwicki, S.; Cipolla, R. Embodied Visual Navigation with Automatic Curriculum Learning in Real Environments. IEEE Robot. Autom. Lett. 2021, 6, 683–690. [Google Scholar] [CrossRef]
Klink, P.; Yang, H.; D’Eramo, C.; Pajarinen, J.; Peters, J. Curriculum Reinforcement Learning via Constrained Optimal Transport. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Ma, H.; Dong, D.; Ding, S.X.; Chen, C. Curriculum-Based Deep Reinforcement Learning for Quantum Control. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8852–8865. [Google Scholar] [CrossRef] [PubMed]
Koprulu, C.; Simão, T.D.; Jansen, N.; Topcu, U. Safety-Prioritizing Curricula for Constrained Reinforcement Learning. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Xiao, C.; Lu, P.; He, Q. Flying Through a Narrow Gap Using End-to-End Deep Reinforcement Learning Augmented with Curriculum Learning and Sim2Real. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2701–2708. [Google Scholar] [CrossRef]
Zijian, H.; Xiaoguang, G.; Kaifang, W.; Yiwei, Z.; Qianglong, W. Relevant experience learning: A Deep Reinforcement Learning method for UAV Autonomous Motion Planning in complex unknown environments. Chin. J. Aeronaut. 2021, 34, 187–204. [Google Scholar] [CrossRef]
Wang, X.; Chen, Y.; Zhu, W. A Survey on Curriculum Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4555–4576. [Google Scholar] [CrossRef]
Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey. J. Mach. Learn. Res. 2020, 21, 1–50. [Google Scholar]
Heess, N.; Hunt, J.; Lillicrap, T.; Silver, D. Memory-based control with recurrent neural networks. arXiv 2015, arXiv:1512.04455. [Google Scholar] [CrossRef]
Mahmood, A.; Hasselt, H.; Sutton, R. Weighted importance sampling for off-policy learning with linear function approximation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Yang, Z.H.; Nguyen, H. Recurrent Off-policy Baselines for Memory-based Continuous Control. arXiv 2021, arXiv:2110.12628. [Google Scholar]

Figure 1. Schematic of the LRGDE, depicting the UAV’s start position, mission zone, moving target trajectory, and distributed threats (e.g., mountains or no-fly zones). The red line indicates a sample UAV path, highlighting avoidance behavior and the objective area approach.

Figure 2. The vector diagram of relationship among UAV, threats and moving targets involved in LRGDEs.

Figure 3. The perception diagram of UAV observing threats existing in the surroundings.

Figure 4. Structure of REDCRL. REDCRL is composed of BLPN policy and AMFER.

Figure 5. Structure of Bi-LSTM-modified Policy Networks (BLPNs), including actor network and critic network.

Figure 6. Structure of Adaptive Multi-Feature Evaluation Experience Replay (AMFER) including adaptive dynamic termination mechanism and multi-feature transition evaluation model.

Figure 7. Structure of ADT mechanism. The ADT automatically adjusts the termination condition of the current episode according to the pre-generated curriculum, and this process implicitly realizes knowledge transfer among the subtasks of curriculum.

Figure 8. Structure of MFTE model. During the sampling process from the experience memory, the comprehensive transition evaluation function involved in MFTE is used to calculate the priorities of transitions in the experience memory and generate the sampling probabilities of transitions.

Figure 9. Mission area definition of LRGDE environment. Before starting each simulation episode, the UAV’s initial position and the threats’ initial position are generated randomly within their respective regions. And the target is reset to the starting position.

Figure 10. The evaluation results of REDCRL with different

N_{o}

. Red numbers in the figure denote the best performance achieved for each evaluation metric.

Figure 10. The evaluation results of REDCRL with different

N_{o}

. Red numbers in the figure denote the best performance achieved for each evaluation metric.

Figure 11. The curves of DSR versus training episodes when training with DDPG, RDPG, TD3, RTD3, DCRL and REDCRL. From these results, REDCRL outperforms the other algorithms in terms of the convergence speed and demonstrates superior performance among these algorithms.

Figure 12. The test results of 6 trained policy based on DDPG, RDPG, TD3, RTD3, DCRL and REDCRL.

Table 1. The definitions and hyperparameters of event-based reward functions.

NO.	Event Name	Appearance Condition	Immediate Reward
1	Lock-in of Line-of-Sight (LOS) azimuth	The LOS azimuth falls within the sensor’s maximum detection range.	+1.0
2	Positive LOS approaching rate	The UAV’s approach rate toward the target area becomes positive.	+1.0
3	Disappearance of a frontal threat	A threat previously within the UAV’s frontal field of view is no longer detected.	+1.0
4	Reduction in the number of observed threats	The total count of currently observed threats decreases.	+1.0
5	Unlocking of LOS azimuth	The LOS azimuth moves outside the sensor’s maximum detection range.	−1.0
6	Appearance of a new threat	A new threat enters the UAV’s field of view.	−1.0

Table 2. The evaluation results on the training data using DDPG, RDPG, TD3, RTD3, DCRL and REDCRL. The REDCRL demonstrates superior performance among these algorithms.

Alg. Metrics	DDPG	TD3	RDPG	RTD3	DCRL	REDCRL
LT	515	481	295	220	200	83
CPP	58%	78%	73%	81%	71%	86%
CPS	8.266 × 10⁻²	5.186 × 10⁻²	6.646 × 10⁻²	6.165 × 10⁻²	12.449 × 10⁻²	4.143 × 10⁻²
PSR	80%	88%	86%	96%	98%	98%
ADT	2.003 ms	2.172 ms	20.868 ms	21.314 ms	2.535 ms	20.219 ms

Table 3. The evaluation results of trained policy with DDPG, RDPG, TD3, RTD3, DCRL and REDCRL in terms of SR.

Algorithms	DDPG	TD3	RDPG	RTD3	DCRL	REDCRL
SR	71%	80%	74%	83%	80%	90%

Table 4. The SR of the trained policy based on REDCRL in LRGDEs with different numbers of threats,

N_{t h r}

.

Table 4. The SR of the trained policy based on REDCRL in LRGDEs with different numbers of threats,

N_{t h r}

.

$N_{t h r}$	5	6	7	8	9	10
SR	80%	79%	74%	68%	69%	62%

Table 5. The results of ablation experiments for REDCRL in terms of PSR, LT, CPP, and CPS.

NO.	BLPN Yes?	AMFER Yes?	LT	CPP	CPS	PSR
1	×	×	437	65%	12.926 × 10⁻²	88%
2	√	×	220	81%	6.165 × 10⁻²	96%
3	×	√	200	71%	12.449 × 10⁻²	98%
4	√	√	141	86%	5.234 × 10⁻²	98%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, K.; Zhang, K.; Wei, Z.; Piao, H.; Yuan, B.; Wang, B.; Cheng, J. An RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm Based on Deep Reinforcement Learning for POMDPs with Limited Experience. Drones 2026, 10, 142. https://doi.org/10.3390/drones10020142

AMA Style

Li K, Zhang K, Wei Z, Piao H, Yuan B, Wang B, Cheng J. An RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm Based on Deep Reinforcement Learning for POMDPs with Limited Experience. Drones. 2026; 10(2):142. https://doi.org/10.3390/drones10020142

Chicago/Turabian Style

Li, Ke, Kun Zhang, Ziqi Wei, Haiyin Piao, Binlin Yuan, Boxuan Wang, and Jiangbo Cheng. 2026. "An RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm Based on Deep Reinforcement Learning for POMDPs with Limited Experience" Drones 10, no. 2: 142. https://doi.org/10.3390/drones10020142

APA Style

Li, K., Zhang, K., Wei, Z., Piao, H., Yuan, B., Wang, B., & Cheng, J. (2026). An RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm Based on Deep Reinforcement Learning for POMDPs with Limited Experience. Drones, 10(2), 142. https://doi.org/10.3390/drones10020142

Article Menu

An RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm Based on Deep Reinforcement Learning for POMDPs with Limited Experience

Highlights

Abstract

1. Introduction

2. Related Work

2.1. DRL for UAVs

2.2. RL for POMDPs

2.3. CL for RL

3. Problem Formulation

3.1. Description of LRGDE

3.2. UAV-Maneuvering Decision-Making Model for LRGDE Based on POMDPs

3.2.1. State Space, Observation Space and Action Space

3.2.2. Knowledge-Enhanced Reward Model

4. RNN-Enhanced Diverse Curriculum-Driven Learning Algorithm

4.1. Structure of Algorithm

4.2. Bi-LSTM-Modified Policy Networks

4.3. Adaptive Multi-Feature Evaluation Experience Replay

4.3.1. Adaptive Dynamic Termination Mechanism

4.3.2. Multi-Feature Transition Evaluation Model

4.4. Policy Training Process of REDCRL

5. Simulation Experimental Results and Analysis

5.1. Experiments Settings

5.1.1. Experiments’ Scenario

5.1.2. Contrast Settings

5.1.3. Evaluation Metrics

5.1.4. Parameters Assignment of REDCRL

5.2. Training Experiments of the Algorithms

5.3. Testing Experiments of the Trained Policy

5.4. Additional Experiments for REDCRL

5.4.1. Testing REDCRL in LRGDE with a Different Number of Threats

5.4.2. Ablation Experiments on REDCRL

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI