1. Introduction
With the wide application of UAVs in complex real-world environments such as local wars, post-disaster rescue operations, and mountainous transportation [
1,
2], efficient route planning technology has become a core component in ensuring the reliable execution of UAV missions [
3]. Traditional global planning methods based on static environments can generate reference paths under known constraints [
4]. However, in real dynamic environments, such methods gradually become limited [
5]. In actual mission scenarios, UAVs not only need to deal with sudden threats, dynamic obstacles, and variable weather conditions, but must also consider multiple factors, such as the task sequence, energy consumption constraints, and real-time safety [
6]. Therefore, relying solely on global planning in static environments is insufficient to meet mission demands in complex environments. In this context, for specific mission scenarios, UAVs can be viewed as autonomous agents. Their real-time obstacle avoidance and online route re-planning capabilities must be studied under dynamic constraints. This is key to enhancing the autonomy and robustness of UAVs in unknown dynamic environments [
7].
Engineers and scholars have developed multiple dynamic route planning methods for unmanned systems, which can be broadly divided into the following categories.
Mathematical programming methods, such as mixed integer linear programming (MILP) and receding horizon control (RHC), encompass both linear and nonlinear programming approaches. These methods are based on the Bellman optimality principle and generate decision sequences. They are simple to implement and can generate globally optimal solutions. However, as the complexity of system dynamics or environmental constraints increases, the difficulty of solutions and computational load of this method increase significantly. Thus, these methods are more suitable for small-scale and simple scenarios [
8,
9].
Roadmap methods, such as the visibility graph (VG) and Voronoi diagram, have simple principles. They are capable of comprehensively considering multiple factors, such as path cost and threat distance [
10]. However, in three-dimensional (3D) environments, the analysis becomes complex. Researchers commonly adopt the cutting plane method for simplification, which addresses the 3D spatial route planning problem by conducting route searches in multiple two-dimensional planes.
Potential field methods, such as artificial potential field (APF) and stream function (SF), have low computational complexity and high real-time performance. They can generate smooth routes online. However, the problem of avoiding local minima needs to be carefully considered if these methods are implemented [
11].
Stochastic planning methods, such as the Rapid Exploring Random Tree (RRT), are probabilistically complete. However, the optimality of these routes can be influenced by randomness in node sampling [
12,
13].
Heuristic search methods, such as A* and its various improved algorithms, utilize heuristic functions to guide the search direction, enabling optimal paths to be quickly obtained [
14,
15]. However, they rely heavily on precise heuristic functions. Complex dynamic environments require frequent and extensive online searches, resulting in significant computational demands.
Biomimetic intelligent methods, such as the genetic algorithm (GA) and particle swarm optimization (PSO), perform exceptionally well in global optimization [
16]. However, considering their optimality and convergence rate, these methods are more suitable for static route planning problems with low uncertainty and simple environments [
17,
18].
Machine learning methods, represented by reinforcement learning (RL) and deep learning (DL), have experienced rapid development in recent years. By modeling the environment as a Markov Decision Process (MDP), the DRL agents optimize routes through interactive trial and error with the environment. The agent has a computationally efficient execution process, which continuously learns and improves optimization [
19]. These advantages make it a leading trend in dynamic obstacle avoidance route planning [
20]. Compared to classical local obstacle avoidance methods (e.g., A*, APF, DWA), DRL-based approaches exhibit superior adaptability and efficiency in dynamic, partially observable environments. In particular, value function approximation methods, such as deep Q-network (DQN) and its improved variants, enable UAVs to make autonomous decisions in unknown environments through end-to-end learning [
21,
22]. However, standard DQN faces major challenges when dealing with large-scale state spaces, particularly regarding the overestimation of Q-values, unstable training processes, and slow convergence rates [
23,
24]. Although improved algorithms, such as Double DQN and Dueling-DQN, have partially alleviated these issues [
25,
26], they largely operate within static environments based on the “flat” single-model architecture for route planning. When facing complex dynamic environments involving long-duration flights and multi-threat scenarios, the network structures become complex and large-scale, leading to increased training difficulty or even failure to converge. Additionally, it is difficult to simultaneously balance global optimality and real-time local obstacle avoidance across diverse threat scenarios [
27,
28]. Notably, while many methods perform well in hypothetical grid-based environment simulations, they lack sufficient validation of their effectiveness in real, complex, unstructured mountainous environments [
29,
30].
In recent years, hierarchical approaches integrated with deep reinforcement learning (DRL) have emerged as a predominant research direction. These methods primarily enhance performance by fusing global and local modules in various ways. Coordinating the optimality of global path planning with the safety of local obstacle avoidance is critical in such hierarchical planning architectures. Existing research has sought to improve performance through various pathways for integrating DRL: for instance, designing modular hierarchies (e.g., “DRL-based high-level decision-making with rule-based low-level control”) [
31], constructing serialized pipelines (e.g., “DRL-based global waypoint planning with model-based local tracking”) [
32], or implementing algorithmic fusion (e.g., traditional optimization algorithms for global planning combined with DRL for local switching) [
33]. However, when confronted with complex, dynamic, and unknown environments (such as intricate mountainous terrain), these methods still encounter challenges, including insufficient real-time synergy between global and local modules, complex structures and difficulties training local DRL agents, and limited flexibility in response to sudden and variable threats.
Thus, existing route planning methods struggle to achieve full-mission path planning and real-time obstacle avoidance for threats during flights in complex dynamic environments. Furthermore, most intelligent route planning methods are designed under static environment assumptions based on a single-model agent, resulting in low applicability to obstacle avoidance in various local dynamic scenarios.
In summary, UAVs must be capable of performing rapid evasion under any threat-triggering time scenarios in three-dimensional space. Although multiple prior works have been conducted on UAV path planning and obstacle avoidance, existing classical and intelligent methods have not adequately investigated and addressed the issue of evading random threats that can occur at any time and at any position along the global flight route in three-dimensional space [
34,
35]. The limitations and challenges of the current main approaches for such scenarios, as proposed in this paper, are summarized as follows: (1) Offline global planning cannot achieve real-time dynamic obstacle avoidance; (2) single-model agents for local route re-planning involve complex structures; and (3) most agents are designed for specific fixed obstacle scenarios, which results in a lack of adaptability to diverse real-world situations. These factors result in existing route planning methods being unable to meet the requirements of UAVs for the entire flight mission cycle.
To overcome these issues, we proposed a hierarchical route planning framework and MMDQN agent-based intelligent obstacle avoidance for UAVs. The main contributions of this paper are summarized as follows: (1) A hierarchical route planning framework is designed to address the coordination issues between global planning and local optimization in route flight missions. This scheme retains the flexibility inherent in hierarchical architectures. Specifically, while ensuring adaptability through global planning methods that represent the environment, it directly addresses core practical issues such as coordination and deployment efficiency via a tightly coupled DRL-based local module. (2) Based on the hierarchical framework, an MMDQN agent is designed for different threat scenarios, along with a dynamic threat adaptation mechanism, to reduce the complexity of neural network structure design and training for conventional single-model agents. (3) To train the MMDQN agent, we propose an MCTIL strategy based on threat-triggered events at any time during the entire flight path, improving the applicability and reliability of the route planning agent.
The rest of this paper is organized as follows:
Section 2 presents a description of the route planning issue.
Section 3 details the hierarchical route planning framework, as well as global planning and local dynamic optimization. The detailed algorithms are presented in
Section 4.
Section 5 presents the simulation results of the proposed methods. The conclusions and future works are outlined in
Section 6.
3. Global Planning and Local Dynamic Optimization-Based Hierarchical Route Planning Framework
To address the issue of coordinating global route planning with real-time local obstacle avoidance in complex mountainous environments, this section introduces a hierarchical route planning framework. It seamlessly integrates offline global planning with dynamic local online optimization. This approach ensures comprehensive management of task sequence logic for long-duration UAV flights and various dynamic threat avoidance tasks, ultimately guaranteeing reliable and safe flight operations.
In the global planning phase, based on elevation terrain data, any feasible offline optimal planning approach can be adopted to ensure the reliable execution of long-duration flight missions. For instance, approaches include the 3D spatial plane segmentation technique, critical path points optimization using the Voronoi algorithm, and the A* heuristic search method. Additionally, effective solutions may integrate multiple methods to achieve superior results.
For the local dynamic optimization task, local route re-planning is implemented based on a dynamic threat triggering mechanism. Specifically, the DRL intelligent algorithm is employed to achieve online rapid resolution of local routes to guarantee the safety of real-time flights for UAVs.
The hierarchical route planning framework is shown in
Figure 1.
Remark 2. This section focuses on the overall framework of the proposed algorithm, analyzing the operational logic of the path search algorithm at both the global and local levels. This was performed without improving restrictions on the selection of specific global and local path planning algorithms.
The specific functions of each module are presented in
Table 1, where the global route planning, re-planning task generation, and local route re-planning modules are the core algorithms of the hierarchical framework.
The relationships between modules are shown in
Figure 2.
The operational logic for all modules within the framework is depicted in
Figure 3.
The operational process for framework implementation is outlined as follows:
The Module Invocation Management Module is responsible for initiating the entire mission framework.
The starting and target points of the flight mission are set.
A global route is planned using any feasible optimal planning approach.
The UAV flies from the starting point along the global route.
During the flight, threat areas are dynamically generated, affecting the original UAV flight route.
When the UAV autonomously detects the threat area, it performs intelligent local route re-planning to avoid obstacles.
The original global flight route is updated with the local re-planned route.
The UAV continues flying along the updated route.
Steps 5 to 8 are repeated in the red frame of
Figure 3 until the UAV reaches the global target point.
4. Algorithm Design
In this section, we provide a detailed description of the core model and its related algorithm designs. Firstly, based on the global route planning provided in Ref. [
36], we introduce a local re-planning task generation module implemented when dynamic threats are detected during the flight. The instruction parameters generated by this module can be used to identify routes that avoid obstacles through a local route re-planning module. Secondly, we design an MMDQN agent for local route re-planning, which simplifies the network structure of the agent while adapting to various scenarios and avoiding dynamic threats. Next, we develop an MCTIL strategy to train this agent. After training, different scenarios can be designed to test the online application of the agent. Finally, we provide an overall description of the method structure and algorithm flow.
4.1. Threat-Triggering Local Replanning Task Generation Based on Global Routes with Multi-Flight Planes
Based on the global planning method with multi-layer spatial planes proposed in Ref. [
36], when flying along a pre-planned global route, a UAV can detect all dynamic threats within distance
in real time. The UAV needs to autonomously avoid dynamic threats that affect flight safety. The threat-triggering local re-planning task generation module is the link between global flight and dynamic obstacle avoidance and is mainly responsible for the generation of task instructions during local route re-planning and the standardized preprocessing of obstacle avoidance scenarios.
To address local dynamic threat avoidance within the detection range, interpolation of the pre-planned global flight path is required. Specifically, if the distance between two adjacent points in the route point sequence is greater than , the minimum number of route points between them can be inserted. This means that the distance between any two adjacent points in the resulting global route sequence is no greater than ; otherwise, the pre-planned global route remains unchanged. The process of implementing this interpolation is as follows:
Calculate the distance between two adjacent points in the pre-planned global flight route segment by segment:
where
and
are the original pre-planned route points;
; and
is the number of the original pre-planned route points.
where
is the ceiling function. Then, the coordinates of the route points inserted between the original pre-planned route points
and
are calculated as follows:
where
. Combined with the altitude settings for route point interpolation, a new sequence of pre-planned global flight route points is ultimately formed, with the total number of route points being
. In scenarios where the local starting point and target point reside on separate flight layers, the higher back-up plane is assigned to the insertion points for improved obstacle avoidance. The altitude setting rules for inserted route points are given in
Figure 4.
Then, based on the new global route point sequence after interpolation and the location of dynamic threats, local re-planning task instruction parameters are generated. Assuming that a UAV is flying towards the -th route point ≤ currently, the design of the local re-planning task instruction parameters is as follows:
This is the -th route point that the UAV is currently flying to.
- 2.
Local target point and map range
If the dynamic threat area covers the
-th route point, then the
-th route point will be the local target point. If the dynamic threat area is located on the line connecting the
-th route point to the
-th route point, then the
-th route point will be the local target point. The range of the local re-planning map is designed according to the specific location of the dynamic threat, as shown in
Table 2.
Remark 3. The method for setting the local starting point and target point with a certain spatial distance between the UAV and the threat helps mitigate the impact of uncertainties in threat detection and perception in real-world scenarios. This was considered despite this paper’s focus on studying idealized detection scenarios. Threats affecting flights are typically classified into waypoint-impacting and path-impacting scenarios. Since this paper assumes that threats remain stationary after appearing, and hybrid scenarios involving both threat types occur when threats are in motion, such cases are excluded from the analysis.
Thus, based on the starting point, target point, and local map range for local route re-planning, a local re-planning 2D map scenario can be generated with dynamic threat areas. Two types of scenarios are shown in
Figure 5.
4.2. The Design of the MMDQN Agent for Local Route Re-Planning
Based on the parameters of the local re-planning task instructions, an intelligent DRL method can effectively generate online local obstacle avoidance flight routes. To efficiently accommodate different threat impact scenarios and reduce the complexity of neural network design and training for a single-model agent, we constructed an MMDQN agent with model adaptation mechanisms. This was designed to address two categories of dynamic threats mentioned in
Section 4.1. The general structure and implementation scenario diagram of the local re-planning intelligent agent is shown in
Figure 6.
The MMDQN agent consists of two DQN models and a model adaptation mechanism.
The model adaptation mechanism operates by evaluating the overlap between threat zones and the new global route point sequence, thereby invoking the corresponding DQN model for application.
The two DQN models correspond to scenarios where the threat area covers route points and the connections between two adjacent route points, respectively. For each DQN model, the network includes two components: a policy network and a target network. The policy network is used to select actions and predict Q-values, while the target network generates target Q-values to provide a stable reference for updating the policy network.
The structure of the DQN consists of an input layer, a fully connected layer, a ReLU layer, another fully connected layer, another ReLU layer, and an output layer. The input layer is primarily responsible for the standardized input of state information, converting environmental state data into a tensor format that the network can process. The ReLU layers are activation layers, which can perform nonlinear transformations on the outputs of the fully connected layers, enabling the network to estimate complex Q-value functions. The fully connected layers flatten the high-dimensional features, facilitating global feature fusion and providing compact feature vectors for the output layer. The primary function of the output layer is to produce a Q value corresponding to each available action in the current state.
Experience replay is an important part of DQN. During the training process, the interaction between the agent and the environment determines , where represents the state at time ; is action at time ; is feedback reward at time and represents the next state at time . Experience replay can prevent correlations between experiences, making the training data less identical.
The Q value is updated as
in order to approach the optimal Q value function
, where
is the learning rate, and
is the discount factor. Equation (5) is the iterative update rule of the DQN, which adjusts the Q-values
by computing the temporal difference (TD) error
, and enabling convergence to the optimal action value
.
Remark 4. The DQN adopted in this article incorporates the experience replay mechanism, which can effectively mitigate variance in the process of gradient estimation. During the stable learning of the model, the target network is able to reduce the interference of noise factors on the performance of the algorithm network by suppressing the over-fitting phenomenon. The adoption of experience replay and the target network can effectively enhance the agent’s robustness against noise.
4.3. An MCTIL Strategy for the MMDQN Agent
Firstly, a local obstacle avoidance scenario database is constructed for the training of MMDQN agents. It is built by threat data triggered at any time along the entire flight route based on the Monte Carlo method. Specifically, it traverses all possible locations of threat areas within the maximum detection range of the UAV at every location along the global flight route. For instance, if the UAV is flying towards the -th route point, dynamic threat areas need to be traversed at all locations from the -th route point to the -th route point. Ultimately, multiple triggers using the Monte Carlo method are executed to obtain local obstacle avoidance scenario data.
Based on the design of the local re-planning task instruction parameters in
Section 4.1, the location of the UAV can be simplified at any moment to the locations of the new global route points. The connections between two route points within the detection range of the UAV can be processed by Bresenham’s line algorithm. Bresenham’s line algorithm represents a line as discrete coordinate points, and its process is as follows:
In a grid-based planar space, define the equation of the straight line
from the
-th route point
to the
-th route point
, where
is the slope and
is the intercept. When the straight line intersects the
Y-axis within the grid plane, define
and
as the deviations between the intersection point and the
Y values of coordinate points
and
.
where
is the current coordinate point. Then, the following can be obtained:
If is met, then the next grid coordinate is determined as with updated to . Conversely, under condition , the next coordinate becomes while adjusting to . This process continues iteratively, updating each point step by step until the final position of this segment is reached.
Next, based on the local obstacle avoidance scenario database constructed, various scenarios are traversed multiple times using the Monte Carlo method. MCTIL sampling involves using scenarios from the local obstacle avoidance scenario database as sample data and performing multiple iterative learning and training sessions through uniform distribution sampling. Typically, the number of scenario samples, i.e., the number of scenarios in the scenario database, is closely positively correlated with the complexity of the global route, such as the distance of the journey and the number of waypoints. In the subsequent simulations in this paper, there are 293 typical scenarios. This is combined with a model adaptation mechanism to invoke different models corresponding to each threat category for iterative learning and training in each scenario.
When training one model of the MMDQN agent in a specific scenario, it continuously interacts with the environment to collect data and uses an ε-greedy strategy to select actions.
The experience generated by each interaction is stored, and the network model parameters are updated by batch sampling with
data. The training goal of DQN is to minimize the predicted Q value and the mean square error of the target value. That is, the loss error
is determined by minimizing the loss function based on the gradient descent method. The target network parameter
is regularly assigned from the policy network parameter
, which effectively ensures the relative stability of the expected value
. During the specific implementation process of where model parameters are updated, the DQN first performs a forward calculation to obtain the
value
and the expected target
. Thus, the loss function is as follows:
Then, back-propagation and gradient descent are performed on the DQN to minimize the loss function, which involves computing the gradient of the loss function
with respect to parameter
.
Parameters are updated using the computed gradients
. For a comprehensive understanding, Algorithm 1 outlines the specific steps of the training procedure.
| Algorithm 1. Pseudo-code for MMDQN agent training. |
| Pseudo-Code |
Loop1: Start Monte Carlo loop, For : 1. Pre-planned global route points interpolation; Loop2: Start UAV position traversal loop, For : Loop3: Start threat area location traversal loop: 1. Determine the starting point for local re-planning; 2. Determine the impact of threat area location on route (Case 1 or Case 2); 1. Determine the target point and map range for local re-planning; 2. Normalize the starting point on the local map; 3. Load DQN agent model corresponding to current case; Loop4: Start training episode loop for current DQN model, For : 1. Reset environment and obtain initial state; Loop5: Start execution step loop, For : 1. Select action using ε-greedy policy; 2. Execute action, receive reward and next state ; 3. Store sample in experience replay buffer ; 4. Update state ; 5. Experience replay buffer reaches accumulation threshold, update model: 1. Randomly sample of samples from ; 2. Compute expectation ; 3. Calculate loss 4. Calculate stochastic gradient descent ; 6. Copy parameters to target network every steps: ; Loop5: End execution step loop, End For; Loop4: End training episode loop for current DQN model, End For; 3. End processing threat area impact on route; Loop3: Until all threat area positions are processed; Loop2: End UAV position traversal loop, End For;Loop1: End Monte Carlo loop, End For. Loop1: End Monte Carlo loop, End For. |
In summary, based on Monte Carlo methods, data of threat triggers at arbitrary times along the entire route is utilized to form a local obstacle-avoidance scenario database. Subsequently, through multiple iterations of scene traversal and multi-model adaptation to dynamic threat scenarios, the MMDQN agent is trained across these scenarios to update model parameters. This process enables the agent to learn the optimal planning route for each local obstacle-avoidance scenario, thereby completing the agent training.
4.4. Overall Method Analysis
This section focuses on analyzing the architectural composition and inherent structure characteristics of local MMDQN.
4.4.1. Overall Structure and Algorithm Flow of Local Threat Avoidance
Based on the hierarchical route planning framework, global route planning can be accomplished. Then, according to the pre-planned global route and utilizing the training and application of the MMDQN agent, local obstacle-avoidance routes triggered by dynamic threats can be rapidly planned at any moment during the drone flight. This enables effective coordination between the long-duration flight mission and dynamic threat avoidance, ultimately ensuring the completion of a safe UAV flight mission. The process of global route planning is detailed in Ref. [
36]. Building on this foundation, the overall implementation of the algorithm in this paper is as follows.
The hierarchical route planning framework includes both global route planning and local route re-planning capabilities. Based on the pre-planned global flight route, the MMDQN agent route re-planning algorithm comprises two modes: agent training and agent application.
In the training mode, data obtained from Monte Carlo simulations on obstacle avoidance triggered by threats along the entire pre-planned route at any moment is utilized for the iterative learning of the agent. Specifically, during the UAV’s traversal of its pre-planned global flight route, all possible locations of threat areas within the current maximum detection range are assessed. Through a re-planning task generation model, local re-planning task instruction parameters are obtained and used to train the MMDQN agent. By repeatedly undergoing events where threats are triggered, scene data are collected across the entire route via Monte Carlo simulations. The method ensures sufficient data volume and convergence performance for the agent models.
In the application mode, when a threat area appears within the UAV’s maximum detection range at any moment along the pre-planned global flight route, corresponding local re-planning task instruction parameters are generated using the re-planning task generation module. The MMDQN agent dynamically generates obstacle avoidance routes applicable to different threat scenarios by loading the appropriately trained models and performing forward computations to rapidly generate the local obstacle-avoidance routes within the current flight plane.
Finally, the pre-planned global route is locally updated based on the re-planning results, completing the entire flight mission. The overall method structure and algorithm flow, including the framework logic and the training and application processes of the MMDQN agent, are illustrated in
Figure 7 and
Figure 8, respectively.
4.4.2. Complexity Analysis of Local Threat Avoidance
According to the overall method structure and algorithm flow, the proposed method includes two key modules: task generation re-planning and local route re-planning. The theoretical complexity analysis of these modules is conducted as follows.
The computational complexity of re-planning task generation is , and the space complexity is , where is the number of global interpolation route points, and is the number of threat scenario categories.
For the local route re-planning, a training mode and an application mode are used. In the training mode, the space complexity of the experience replay buffer is , where is the size of the experience replay buffer and is the size of the state. The computational complexity per batch update includes forward computation and back-propagation. The computational complexity of forward computation is , where , and are the number of nodes in the first hidden layer, the second hidden layer, and the action space size. Back-propagation complexity is the same as forward computation.
In the application mode, the computational complexity per action is , while during the agent’s overall application phase, it is . Here, is the number of local route re-planning tasks, and is the number of actions selected per local re-planned route.
In the application mode of the MMDQN, the computation for each action only utilizes one of the agent’s model networks. By contrast, the network architecture of a conventional single-model DQN agent designed to address the two categories of threats discussed in this paper is more complex than any individual model network in the proposed MMDQN. Specifically, the number of hidden layer nodes in the single-model DQN agent exceeds those (e.g., and ) of the MMDQN agent’s sub-models. Evidently, the proposed MMDQN achieves lower computational complexity and superior performance compared to conventional DQN. Also, MMDQN holds a substantial computational complexity advantage over classical algorithms such as A*, as it performs a constant-time forward pass independent of environmental scale, while traditional planners require state-dependent, often exponential-time searches for each decision.
The method demonstrates advanced computational and space complexity. In particular, during the agent application mode, dynamic threat obstacle avoidance is performed by simply re-planning task generation and executing the agent’s forward computation, which ensures high efficiency in real-time applications.
5. Numerical Simulations
In this section, an implementation case is studied based on the real mountainous environments to illustrate the effectiveness of the proposed method. To analyze the advantages of the hierarchical route planning framework and local route re-planning agent concisely, we directly build our design and validation based on the global route planning method published in Ref. [
36]. A systematic quantitative evaluation based on numerical performance metrics is presented, including the effectiveness and robustness of the proposed method, the training convergence of the agent, the task success rate, and task completion.
Part 1. Verification of the hierarchical framework
To illustrate the availability of the hierarchical route planning framework for coordination between global route planning and dynamic obstacle avoidance, we construct the framework structure described in
Section 3. The acquisition of elevation terrain data is obtained through actual detection. The global starting point and target point are (3, 175) and (351, 50). Based on the global route planning method in Ref. [
36], the UAV global flight plan can be obtained, characterized by a best flight plane height of 571, a back-up plane height of 747, and a flight route point sequence of {(3, 175, 571), (56.4286, 166.8571, 571), (266, 87, 571), (314, 74, 571), (335, 57, 747), (351, 50, 571)}.
The global route of the flight plane and a 3D map are shown in
Figure 9.
Here, the interaction interface is based on the hierarchical route planning framework. As demonstrated by simulations, it effectively completes global route planning and can support local re-planning triggered by threats at any moment. This ensures the effectiveness of the coordination between global route planning and dynamic obstacle avoidance.
Part 2. Simulation of the training process
To illustrate the convergence of the MMDQN agent with the MCTIL strategy, the training process and simulation results of the MMDQN agent are presented.
For simplicity, the MMDQN agent structure for local route re-planning is based on two scenarios—dynamic threat areas covering flight route points or the connection between them. These are constructed as two DQN models with identical neural network architectures.
The network structure of each DQN model is built as follows: The input layer has four nodes. The first fully connected layer has 500 nodes with initial connection weights randomly assigned as . The bias values are . The activation function for the first ReLU layer is , where represents the outputs of the first fully connected layer. The second fully connected layer has 400 nodes, the initial connection weights of which are randomly assigned as . The bias values are . The activation function for the second ReLU layer is , where represents the outputs of the second fully connected layer. The output layer has four nodes.
Remark 5. The input of the agent is the state, and the output is the action. Both terrain and threat are defined as obstacles and serve as strict positioning constraints. In view of this, the agent that reaches the target point in the grid map completes the path search task. The positions of the agent and the target points are denoted as X and Y in the flight plane; as such, the state with four inputs is sufficient. Since the state is composed of discrete position coordinates on the rasterized map, no additional standardization processing is required. The agent’s actions include moving forward, backward, left, and right in a grid-based map.
The parameters for training the MMDQN agent are given as follows. The maximum detection distance of the UAV is . The number of Monte Carlo iterations is ; the number of training episodes for each specific local obstacle avoidance case is ; the number of maximum steps per episode is ; the experience replay buffer size is ; the random batch size is ; the target network update interval is ; the learning rate is ; the discount factor is ; the ε-greedy exploration strategy is ; and the decay rate is .
The iterative training process follows the MCTIL strategy described in
Section 4, where the specific local obstacle avoidance case of each episode is designed as follows:
Environmental constraints: The terrain, boundaries, and dynamic threats within the grid-based flight plane are all considered obstacles.
Initial condition: The local starting point is defined.
Termination conditions: The task is complete if the agent reaches the local target point or the maximum number of steps allowed is exceeded.
Reward mechanism: The scores are calculated based on environmental reward values, the specific details of which are outlined in
Table 3.
As shown in
Table 3, the reward mechanism established in this article is defined as follows: the agent’s score increases by 2 points when it arrives at the local target point; otherwise, 0 points are obtained. During the path search process, the agent’s score is subtracted by 0.01 per step if it fails to reach the local target point or change location. This kind of reward and punishment forces the agent to move and not to stay in the same place. If the distance in the X or Y direction within the flight plane decreases relative to the local target point, the agent’s score increases by 0.05 per step. But if the distance in the X or Y direction within the flight plane increases relative to the local target point, a score of 0.05 per step is subtracted from the score. This method of reward and punishment allows the agent to approach the target point in a gradual and stable manner. The final reward score of the agent is obtained by summing up the above-mentioned bonus and penalty items.
The reward score for each step of the agent is
Figure 10 illustrates the training curve of the agent in a specific local obstacle avoidance case presented in
Figure 11.
Figure 11 shows the results of local route re-planning. The overall map covers an area of 8.9 km × 19.6 km with a resolution of 100 m.
According to the training curve, in this specific case, after 124 training episodes, the average reward converged to 2.21 and remained stable. Meanwhile, the re-planned route in this case effectively completed obstacle avoidance. Through iterative training and traversing the diverse threat scenario library until convergence, the MMDQN agent was capable of handling multiple threat scenarios. These observations highlight the stability and strong convergence properties of this method as well as its ability to generate effective re-planning solutions.
Part 3 Performance of the MMDQN agent
The simulation results of the MMDQN agent’s application are provided based on a hierarchical route planning framework. This illustrates the agent’s operational reliability during task sequence logic management for global route planning based on multiple instances of local route re-planning at various times. It also demonstrates the applicability of the MMDQN agent with the MCTIL strategy in diverse obstacle avoidance scenarios.
During the global flight route, the UAV autonomously detects areas of threat and completes local route re-planning to avoid dynamic obstacles. According to measurements, during five instances of local obstacle avoidance, the maximum, minimum, and average time required for local path re-planning were 0.8058 s, 0.1610 s, and 0.4181 s, thereby enabling rapid evasion of dynamic threats. For comparison, under identical conditions, the single-model DQN agent based on local route re-planning had a maximum time of 1.5271 s, a minimum time of 0.4057 s, and an average time of 0.9834 s.
Remark 6. Considering that the focus of this study is not on improving the speed of local path re-planning, this paper only provides a brief illustration of the real-time ability for local obstacle avoidance compared to the single-model DQN agent, without further comparative analysis with other methods.
The final obstacle avoidance route within the flight plane is illustrated in
Figure 12.
The flight trajectory in 3D space is shown in
Figure 13.
To evaluate the advantages of the three proposed methods, we conducted a comparative analysis focusing on their ability to avoid local dynamic threats: a single-model agent, MMDQN with Monte Carlo stochastic iterative learning (MCSIL) strategy, and MMDQN based on the MCTIL strategy. The single-model agent is designed based on the conventional DQN algorithm. The MMDQN based on the MCSIL strategy employs the same MMDQN architecture and conducts training using multiple randomly generated dynamic threat scenarios. During the pre-planned global flight route, multiple local obstacle avoidance scenarios were generated for various threat scenarios, and identical experimental conditions were applied to each of the three methods for each scenario. By conducting simulation tests focused on dynamic threat avoidance throughout the entire flight route, statistics for evading threats at any time and location could be summarized, as shown in
Table 4.
Remark 7. The local route planning method proposed in this article mainly focuses on strategies to improve structural and iterative learning based on the DQN rather than considering actual types of threats. Comparisons related to these improvements are also provided. Classic non-DRL intelligent methods are not within the primary scope of this paper. In addition to the DQN, a variety of improved algorithms exist in the field of DRL, including Double DQN and Dueling-DQN. Despite the differences in their design mechanisms, these algorithms can both be classified as single-model architectures. The MMDQN proposed in the paper is a multi-model reinforcement learning algorithm that was improved based on the traditional DQN, and its performance is significantly superior to that of similar single-model algorithms. For this reason, the authors selected only DQN-based single-model agents as the control group for comparative experiments. As shown in Table 4, the MMDQN model with the MCTIL strategy achieves better comprehensive performance than the MMDAN adopting the MCSIL strategy, as well as all types of single-model agents. Remark 8. In these tests, successful threat avoidance is divided into two categories: reliable avoidance, where the agent strictly moves to the local target point to complete threat avoidance, and feasible avoidance, where the agent does not strictly move to the local target point but can still achieve threat avoidance. The success rate was calculated by dividing the number of successful threat avoidances by the total number of tests.
According to the statistical results, the success rate of dynamic threat avoidance for UAVs is 82.59%. When comparing identical neural network architectures, this success rate significantly surpasses that of the single-model agent, which only reached 44.70%. This represents an improvement of 37.89%, validating the effective performance of the proposed MMDQN agent for local route re-planning under test conditions.
Within the identical MMDQN agent network structure, our proposed MCTIL strategy demonstrates superior performance compared to the MCSIL strategy. While MCSIL achieves a threat avoidance success rate of 74.74%, our strategy improved this by 7.85%. Significant enhancement validates the reliability and effectiveness of the MCTIL strategy.
It is worth noting that, throughout the aforementioned experiments, the proposed framework and algorithms operated stably without any system errors or crashes caused by robustness-related issues. Moreover, for the comparative experiments discussed in
Section 3 of this paper, the three local route re-planning methods were successfully executed for 879 continuous trials under different threat emergence scenarios, further validating the robustness of the approach.