Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs

Dong, Boyu; Zhang, Yuzhen; Yuan, Peiyuan; Lu, Shuntong; Huang, Tao; Zhang, Gong

doi:10.3390/drones10010057

Open AccessArticle

Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs

by

Boyu Dong

^1,2,*

,

Yuzhen Zhang

²,

Peiyuan Yuan

²,

Shuntong Lu

²

,

Tao Huang

² and

Gong Zhang

¹

School of Electronics and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

²

AVIC Aviation Electronics Co., Ltd., Beijing 100028, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 57; https://doi.org/10.3390/drones10010057

Submission received: 10 December 2025 / Revised: 11 January 2026 / Accepted: 12 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue Advances in AI Large Models for Unmanned Aerial Vehicles)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A hierarchical route planning framework is constructed, including global route planning and deep reinforcement learning-based local route re-planning.
This study proposes a novel multi-model deep Q-network (MMDQN) architecture for local obstacle avoidance, which is trained using a designed Monte Carlo traversal iterative learning (MCTIL) strategy.

What are the implications of the main findings?

The hierarchical route planning framework can overcome the coordination issue of global route planning and local real-time obstacle avoidance in complex real-world environments.
MMDQN architecture offers a dual advantage over conventional single-model agents: it fundamentally simplifies the network structure and training. The integration of the MCTIL strategy further enhances the agent’s adaptability and ensures reliable real-time obstacle avoidance throughout the flight mission.

Abstract

Efficient route planning technology is the core support for ensuring the successful execution of unmanned aerial vehicle (UAV) flight missions. In this paper, the coordination issue of global route planning and local real-time obstacle avoidance in complex mountainous environments is studied. To deal with this issue, a hierarchical route planning framework is designed, including global route planning and AI-based local route re-planning using deep reinforcement learning, exhibiting both flexible versatility and practical coordination and deployment efficiency. Throughout the entire flight, the local route re-planning task triggered by dynamic threats can be executed in real time. Meanwhile, a multi-model DQN (MMDQN) agent with a Monte Carlo traversal iterative learning (MCTIL) strategy is designed for local route re-planning. Compared to existing methods, this agent can be directly used to generate local obstacle avoidance routes in various scenarios at any time during the flight, which simplifies the complicated structure and training process of conventional deep reinforcement learning (DRL) agents in dynamic, complex environments. Using the framework structure and MMDQN agent for local route re-planning ensures the safety and efficiency of the mission, as well as local obstacle avoidance during global flights. These performances are verified through simulations based on actual terrain data.

Keywords:

hierarchical route planning framework; AI-based local route re-planning; multi-model DQN agent; deep reinforcement learning; Monte Carlo traversal iterative learning

1. Introduction

With the wide application of UAVs in complex real-world environments such as local wars, post-disaster rescue operations, and mountainous transportation [1,2], efficient route planning technology has become a core component in ensuring the reliable execution of UAV missions [3]. Traditional global planning methods based on static environments can generate reference paths under known constraints [4]. However, in real dynamic environments, such methods gradually become limited [5]. In actual mission scenarios, UAVs not only need to deal with sudden threats, dynamic obstacles, and variable weather conditions, but must also consider multiple factors, such as the task sequence, energy consumption constraints, and real-time safety [6]. Therefore, relying solely on global planning in static environments is insufficient to meet mission demands in complex environments. In this context, for specific mission scenarios, UAVs can be viewed as autonomous agents. Their real-time obstacle avoidance and online route re-planning capabilities must be studied under dynamic constraints. This is key to enhancing the autonomy and robustness of UAVs in unknown dynamic environments [7].

Engineers and scholars have developed multiple dynamic route planning methods for unmanned systems, which can be broadly divided into the following categories.

Mathematical programming methods, such as mixed integer linear programming (MILP) and receding horizon control (RHC), encompass both linear and nonlinear programming approaches. These methods are based on the Bellman optimality principle and generate decision sequences. They are simple to implement and can generate globally optimal solutions. However, as the complexity of system dynamics or environmental constraints increases, the difficulty of solutions and computational load of this method increase significantly. Thus, these methods are more suitable for small-scale and simple scenarios [8,9].

Roadmap methods, such as the visibility graph (VG) and Voronoi diagram, have simple principles. They are capable of comprehensively considering multiple factors, such as path cost and threat distance [10]. However, in three-dimensional (3D) environments, the analysis becomes complex. Researchers commonly adopt the cutting plane method for simplification, which addresses the 3D spatial route planning problem by conducting route searches in multiple two-dimensional planes.

Potential field methods, such as artificial potential field (APF) and stream function (SF), have low computational complexity and high real-time performance. They can generate smooth routes online. However, the problem of avoiding local minima needs to be carefully considered if these methods are implemented [11].

Stochastic planning methods, such as the Rapid Exploring Random Tree (RRT), are probabilistically complete. However, the optimality of these routes can be influenced by randomness in node sampling [12,13].

Heuristic search methods, such as A* and its various improved algorithms, utilize heuristic functions to guide the search direction, enabling optimal paths to be quickly obtained [14,15]. However, they rely heavily on precise heuristic functions. Complex dynamic environments require frequent and extensive online searches, resulting in significant computational demands.

Biomimetic intelligent methods, such as the genetic algorithm (GA) and particle swarm optimization (PSO), perform exceptionally well in global optimization [16]. However, considering their optimality and convergence rate, these methods are more suitable for static route planning problems with low uncertainty and simple environments [17,18].

Machine learning methods, represented by reinforcement learning (RL) and deep learning (DL), have experienced rapid development in recent years. By modeling the environment as a Markov Decision Process (MDP), the DRL agents optimize routes through interactive trial and error with the environment. The agent has a computationally efficient execution process, which continuously learns and improves optimization [19]. These advantages make it a leading trend in dynamic obstacle avoidance route planning [20]. Compared to classical local obstacle avoidance methods (e.g., A*, APF, DWA), DRL-based approaches exhibit superior adaptability and efficiency in dynamic, partially observable environments. In particular, value function approximation methods, such as deep Q-network (DQN) and its improved variants, enable UAVs to make autonomous decisions in unknown environments through end-to-end learning [21,22]. However, standard DQN faces major challenges when dealing with large-scale state spaces, particularly regarding the overestimation of Q-values, unstable training processes, and slow convergence rates [23,24]. Although improved algorithms, such as Double DQN and Dueling-DQN, have partially alleviated these issues [25,26], they largely operate within static environments based on the “flat” single-model architecture for route planning. When facing complex dynamic environments involving long-duration flights and multi-threat scenarios, the network structures become complex and large-scale, leading to increased training difficulty or even failure to converge. Additionally, it is difficult to simultaneously balance global optimality and real-time local obstacle avoidance across diverse threat scenarios [27,28]. Notably, while many methods perform well in hypothetical grid-based environment simulations, they lack sufficient validation of their effectiveness in real, complex, unstructured mountainous environments [29,30].

In recent years, hierarchical approaches integrated with deep reinforcement learning (DRL) have emerged as a predominant research direction. These methods primarily enhance performance by fusing global and local modules in various ways. Coordinating the optimality of global path planning with the safety of local obstacle avoidance is critical in such hierarchical planning architectures. Existing research has sought to improve performance through various pathways for integrating DRL: for instance, designing modular hierarchies (e.g., “DRL-based high-level decision-making with rule-based low-level control”) [31], constructing serialized pipelines (e.g., “DRL-based global waypoint planning with model-based local tracking”) [32], or implementing algorithmic fusion (e.g., traditional optimization algorithms for global planning combined with DRL for local switching) [33]. However, when confronted with complex, dynamic, and unknown environments (such as intricate mountainous terrain), these methods still encounter challenges, including insufficient real-time synergy between global and local modules, complex structures and difficulties training local DRL agents, and limited flexibility in response to sudden and variable threats.

Thus, existing route planning methods struggle to achieve full-mission path planning and real-time obstacle avoidance for threats during flights in complex dynamic environments. Furthermore, most intelligent route planning methods are designed under static environment assumptions based on a single-model agent, resulting in low applicability to obstacle avoidance in various local dynamic scenarios.

In summary, UAVs must be capable of performing rapid evasion under any threat-triggering time scenarios in three-dimensional space. Although multiple prior works have been conducted on UAV path planning and obstacle avoidance, existing classical and intelligent methods have not adequately investigated and addressed the issue of evading random threats that can occur at any time and at any position along the global flight route in three-dimensional space [34,35]. The limitations and challenges of the current main approaches for such scenarios, as proposed in this paper, are summarized as follows: (1) Offline global planning cannot achieve real-time dynamic obstacle avoidance; (2) single-model agents for local route re-planning involve complex structures; and (3) most agents are designed for specific fixed obstacle scenarios, which results in a lack of adaptability to diverse real-world situations. These factors result in existing route planning methods being unable to meet the requirements of UAVs for the entire flight mission cycle.

To overcome these issues, we proposed a hierarchical route planning framework and MMDQN agent-based intelligent obstacle avoidance for UAVs. The main contributions of this paper are summarized as follows: (1) A hierarchical route planning framework is designed to address the coordination issues between global planning and local optimization in route flight missions. This scheme retains the flexibility inherent in hierarchical architectures. Specifically, while ensuring adaptability through global planning methods that represent the environment, it directly addresses core practical issues such as coordination and deployment efficiency via a tightly coupled DRL-based local module. (2) Based on the hierarchical framework, an MMDQN agent is designed for different threat scenarios, along with a dynamic threat adaptation mechanism, to reduce the complexity of neural network structure design and training for conventional single-model agents. (3) To train the MMDQN agent, we propose an MCTIL strategy based on threat-triggered events at any time during the entire flight path, improving the applicability and reliability of the route planning agent.

The rest of this paper is organized as follows: Section 2 presents a description of the route planning issue. Section 3 details the hierarchical route planning framework, as well as global planning and local dynamic optimization. The detailed algorithms are presented in Section 4. Section 5 presents the simulation results of the proposed methods. The conclusions and future works are outlined in Section 6.

2. Description of the Route Planning System

The flight mission scenarios of UAVs in real-world environments involve both complex terrain avoidance and dynamic threat evasion at any time and place. Complex terrains act as global static obstacles, while dynamic threats represent local dynamic obstacles that appear at specific times. Therefore, flight route planning can be considered the identification of a suitable route in environments with both static and dynamic obstacles.

In this study, we used map grid-based processing, a common map processing method. The route for the entire flight mission cycle can be planned using this method to combine the global flight plan with dynamic, rapid obstacle avoidance capabilities.

Remark 1.

Given that the path search in this article was carried out on the basis of a discretization map, both states and actions are accordingly treated as discrete variables for analysis. In addition, the optimization of the smoothness and continuity of the planned route is not the primary focus of this study.

Thus, the assumptions regarding environmental knowledge and obstacle dynamics are presented as follows: the UAV is flying in the grid-based map; the terrains are viewed as static obstacles; the dynamic threats are viewed as local dynamic obstacles that emerge at any time and place; dynamic threats remain stationary after emerging, which is predictable and typical; the detection is perfect, instantaneous and noise-free.

This study is based on the 3D global route planning method with multi-flight planes provided by Ref. [36], which uses an offline search approach to avoid static obstacles during the global flight mission. Considering the real-time and efficient performance of deep reinforcement learning intelligent route planning, this method can be used to design a quick local route re-planning algorithm within flight parameters. This enables online searches for routes that avoid obstacles when dynamic threats emerge. Furthermore, through the hierarchical route planning framework, algorithms searching offline for routes can be combined with those searching for dynamic obstacle avoidance routes. Ultimately, the route planning requirements for the entire flight mission cycle are transformed into global offline and local online route search task sequence logic management issues. Additionally, a minimum-cost local route search agent is designed based on scenarios that can trigger threats at any time during the entire flight in the grid map.

3. Global Planning and Local Dynamic Optimization-Based Hierarchical Route Planning Framework

To address the issue of coordinating global route planning with real-time local obstacle avoidance in complex mountainous environments, this section introduces a hierarchical route planning framework. It seamlessly integrates offline global planning with dynamic local online optimization. This approach ensures comprehensive management of task sequence logic for long-duration UAV flights and various dynamic threat avoidance tasks, ultimately guaranteeing reliable and safe flight operations.

In the global planning phase, based on elevation terrain data, any feasible offline optimal planning approach can be adopted to ensure the reliable execution of long-duration flight missions. For instance, approaches include the 3D spatial plane segmentation technique, critical path points optimization using the Voronoi algorithm, and the A* heuristic search method. Additionally, effective solutions may integrate multiple methods to achieve superior results.

For the local dynamic optimization task, local route re-planning is implemented based on a dynamic threat triggering mechanism. Specifically, the DRL intelligent algorithm is employed to achieve online rapid resolution of local routes to guarantee the safety of real-time flights for UAVs.

The hierarchical route planning framework is shown in Figure 1.

Remark 2.

This section focuses on the overall framework of the proposed algorithm, analyzing the operational logic of the path search algorithm at both the global and local levels. This was performed without improving restrictions on the selection of specific global and local path planning algorithms.

The specific functions of each module are presented in Table 1, where the global route planning, re-planning task generation, and local route re-planning modules are the core algorithms of the hierarchical framework.

The relationships between modules are shown in Figure 2.

The operational logic for all modules within the framework is depicted in Figure 3.

The operational process for framework implementation is outlined as follows:

The Module Invocation Management Module is responsible for initiating the entire mission framework.
The starting and target points of the flight mission are set.
A global route is planned using any feasible optimal planning approach.
The UAV flies from the starting point along the global route.
During the flight, threat areas are dynamically generated, affecting the original UAV flight route.
When the UAV autonomously detects the threat area, it performs intelligent local route re-planning to avoid obstacles.
The original global flight route is updated with the local re-planned route.
The UAV continues flying along the updated route.
Steps 5 to 8 are repeated in the red frame of Figure 3 until the UAV reaches the global target point.

4. Algorithm Design

In this section, we provide a detailed description of the core model and its related algorithm designs. Firstly, based on the global route planning provided in Ref. [36], we introduce a local re-planning task generation module implemented when dynamic threats are detected during the flight. The instruction parameters generated by this module can be used to identify routes that avoid obstacles through a local route re-planning module. Secondly, we design an MMDQN agent for local route re-planning, which simplifies the network structure of the agent while adapting to various scenarios and avoiding dynamic threats. Next, we develop an MCTIL strategy to train this agent. After training, different scenarios can be designed to test the online application of the agent. Finally, we provide an overall description of the method structure and algorithm flow.

4.1. Threat-Triggering Local Replanning Task Generation Based on Global Routes with Multi-Flight Planes

Based on the global planning method with multi-layer spatial planes proposed in Ref. [36], when flying along a pre-planned global route, a UAV can detect all dynamic threats within distance

L

in real time. The UAV needs to autonomously avoid dynamic threats that affect flight safety. The threat-triggering local re-planning task generation module is the link between global flight and dynamic obstacle avoidance and is mainly responsible for the generation of task instructions during local route re-planning and the standardized preprocessing of obstacle avoidance scenarios.

To address local dynamic threat avoidance within the detection range, interpolation of the pre-planned global flight path is required. Specifically, if the distance between two adjacent points in the route point sequence is greater than

L

, the minimum number of route points between them can be inserted. This means that the distance between any two adjacent points in the resulting global route sequence is no greater than

L

; otherwise, the pre-planned global route remains unchanged. The process of implementing this interpolation is as follows:

Calculate the distance between two adjacent points in the pre-planned global flight route segment by segment:

{D i s t}_{i + 1} = \sqrt{{(x_{i + 1} - x_{i})}^{2} + {(y_{i + 1} - y_{i})}^{2}}

(1)

where

(x_{i}, y_{i})

and

(x_{i + 1}, y_{i + 1})

are the original pre-planned route points;

i = 1, \dots, N_{g} - 1

; and

N_{g}

is the number of the original pre-planned route points.

{N u m}_{i + 1} = c e i l (2 \times \frac{{D i s t}_{i + 1}}{L}) - 1

(2)

where

c e i l (*)

is the ceiling function. Then, the coordinates of the route points inserted between the original pre-planned route points

(x_{i}, y_{i})

and

(x_{i + 1}, y_{i + 1})

are calculated as follows:

x_{i j} = x_{i} + j \times (x_{i + 1} - x_{i}) / ({N u m}_{i + 1} + 1)

(3)

y_{i j} = y_{i} + j \times (y_{i + 1} - y_{i}) / ({N u m}_{i + 1} + 1)

(4)

where

j = 1, \dots, {N u m}_{i + 1}

. Combined with the altitude settings for route point interpolation, a new sequence of pre-planned global flight route points is ultimately formed, with the total number of route points being

N

. In scenarios where the local starting point and target point reside on separate flight layers, the higher back-up plane is assigned to the insertion points for improved obstacle avoidance. The altitude setting rules for inserted route points are given in Figure 4.

Then, based on the new global route point sequence after interpolation and the location of dynamic threats, local re-planning task instruction parameters are generated. Assuming that a UAV is flying towards the

(n - 1)

-th route point

(n

≤

N)

currently, the design of the local re-planning task instruction parameters is as follows:

Starting point for local replanning

This is the

(n - 1)

-th route point that the UAV is currently flying to.

2.: Local target point and map range

If the dynamic threat area covers the

n

-th route point, then the

(n + 1)

-th route point will be the local target point. If the dynamic threat area is located on the line connecting the

(n + 1)

-th route point to the

n

-th route point, then the

n

-th route point will be the local target point. The range of the local re-planning map is designed according to the specific location of the dynamic threat, as shown in Table 2.

Remark 3.

The method for setting the local starting point and target point with a certain spatial distance between the UAV and the threat helps mitigate the impact of uncertainties in threat detection and perception in real-world scenarios. This was considered despite this paper’s focus on studying idealized detection scenarios. Threats affecting flights are typically classified into waypoint-impacting and path-impacting scenarios. Since this paper assumes that threats remain stationary after appearing, and hybrid scenarios involving both threat types occur when threats are in motion, such cases are excluded from the analysis.

Thus, based on the starting point, target point, and local map range for local route re-planning, a local re-planning 2D map scenario can be generated with dynamic threat areas. Two types of scenarios are shown in Figure 5.

4.2. The Design of the MMDQN Agent for Local Route Re-Planning

Based on the parameters of the local re-planning task instructions, an intelligent DRL method can effectively generate online local obstacle avoidance flight routes. To efficiently accommodate different threat impact scenarios and reduce the complexity of neural network design and training for a single-model agent, we constructed an MMDQN agent with model adaptation mechanisms. This was designed to address two categories of dynamic threats mentioned in Section 4.1. The general structure and implementation scenario diagram of the local re-planning intelligent agent is shown in Figure 6.

The MMDQN agent consists of two DQN models and a model adaptation mechanism.

The model adaptation mechanism operates by evaluating the overlap between threat zones and the new global route point sequence, thereby invoking the corresponding DQN model for application.

The two DQN models correspond to scenarios where the threat area covers route points and the connections between two adjacent route points, respectively. For each DQN model, the network includes two components: a policy network and a target network. The policy network is used to select actions and predict Q-values, while the target network generates target Q-values to provide a stable reference for updating the policy network.

The structure of the DQN consists of an input layer, a fully connected layer, a ReLU layer, another fully connected layer, another ReLU layer, and an output layer. The input layer is primarily responsible for the standardized input of state information, converting environmental state data into a tensor format that the network can process. The ReLU layers are activation layers, which can perform nonlinear transformations on the outputs of the fully connected layers, enabling the network to estimate complex Q-value functions. The fully connected layers flatten the high-dimensional features, facilitating global feature fusion and providing compact feature vectors for the output layer. The primary function of the output layer is to produce a Q value corresponding to each available action in the current state.

Experience replay is an important part of DQN. During the training process, the interaction between the agent and the environment determines

(s_{t}, a_{t}, r_{t}, s_{t + 1})

, where

s_{t}

represents the state at time

t

;

a_{t}

is action at time

t

;

r_{t}

is feedback reward at time

t;

and

s_{t + 1}

represents the next state at time

t + 1

. Experience replay can prevent correlations between experiences, making the training data less identical.

The Q value is updated as

Q (s_{t}, a_{t}) = Q (s_{t}, a_{t}) + α [r_{t + 1} + γ m a x Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]

(5)

in order to approach the optimal Q value function

Q^{*} (s, a)

, where

α

is the learning rate, and

γ

is the discount factor. Equation (5) is the iterative update rule of the DQN, which adjusts the Q-values

Q (s_{t}, a_{t})

by computing the temporal difference (TD) error

r_{t + 1} + γ m a x Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})

, and enabling convergence to the optimal action value

Q^{*} (s, a)

.

Remark 4.

The DQN adopted in this article incorporates the experience replay mechanism, which can effectively mitigate variance in the process of gradient estimation. During the stable learning of the model, the target network is able to reduce the interference of noise factors on the performance of the algorithm network by suppressing the over-fitting phenomenon. The adoption of experience replay and the target network can effectively enhance the agent’s robustness against noise.

4.3. An MCTIL Strategy for the MMDQN Agent

Firstly, a local obstacle avoidance scenario database is constructed for the training of MMDQN agents. It is built by threat data triggered at any time along the entire flight route based on the Monte Carlo method. Specifically, it traverses all possible locations of threat areas within the maximum detection range

L

of the UAV at every location along the global flight route. For instance, if the UAV is flying towards the

(n - 1)

-th route point, dynamic threat areas need to be traversed at all locations from the

(n - 1)

-th route point to the

n

-th route point. Ultimately, multiple triggers using the Monte Carlo method are executed to obtain local obstacle avoidance scenario data.

Based on the design of the local re-planning task instruction parameters in Section 4.1, the location of the UAV can be simplified at any moment to the locations of the new global route points. The connections between two route points within the detection range of the UAV can be processed by Bresenham’s line algorithm. Bresenham’s line algorithm represents a line as discrete coordinate points, and its process is as follows:

In a grid-based planar space, define the equation of the straight line

y = m x + b

from the

(n - 1)

-th route point

(x_{n - 1}, y_{n - 1})

to the

n

-th route point

(x_{n}, y_{n})

, where

m

is the slope and

b

is the intercept. When the straight line intersects the Y-axis within the grid plane, define

d_{u}

and

d_{l}

as the deviations between the intersection point and the Y values of coordinate points

(x_{k + 1}, y_{k + 1})

and

(x_{k + 1}, y_{k})

.

\{\begin{array}{l} d_{u} = y_{k + 1} - m x_{k + 1} - b \\ d_{l} = m x_{k + 1} + b - y_{k} \end{array}

(6)

where

(x_{k}, y_{k})

is the current coordinate point. Then, the following can be obtained:

p_{k} = ∆ x (d_{l} - d_{u}) = 2 ∆ y \times x_{k} - 2 ∆ x \times y_{k} + 2 ∆ y + ∆ x \times (2 b - 1)

(7)

If

p_{k} \geq 0

is met, then the next grid coordinate is determined as

(x_{k + 1}, y_{k + 1}),

with

p_{k + 1}

updated to

p_{k} + 2 ∆ y - 2 ∆ x

. Conversely, under condition

p_{k} < 0

, the next coordinate becomes

(x_{k + 1}, y_{k})

while adjusting

p_{k + 1}

to

p_{k} + 2 ∆ y

. This process continues iteratively, updating each point step by step until the final position of this segment is reached.

Next, based on the local obstacle avoidance scenario database constructed, various scenarios are traversed multiple times using the Monte Carlo method. MCTIL sampling involves using scenarios from the local obstacle avoidance scenario database as sample data and performing multiple iterative learning and training sessions through uniform distribution sampling. Typically, the number of scenario samples, i.e., the number of scenarios in the scenario database, is closely positively correlated with the complexity of the global route, such as the distance of the journey and the number of waypoints. In the subsequent simulations in this paper, there are 293 typical scenarios. This is combined with a model adaptation mechanism to invoke different models corresponding to each threat category for iterative learning and training in each scenario.

When training one model of the MMDQN agent in a specific scenario, it continuously interacts with the environment to collect data and uses an ε-greedy strategy to select actions.

a_{t} = \{\begin{matrix} a r g \max_{a} Q (s_{t}, a) p r o b a b i l i t y 1 - ε \\ o t h e r s p r o b a b i l i t y ε \end{matrix}

(8)

The experience generated by each interaction is stored, and the network model parameters are updated by batch sampling with

b a t c h

data. The training goal of DQN is to minimize the predicted Q value and the mean square error of the target value. That is, the loss error

L_{θ}

is determined by minimizing the loss function based on the gradient descent method. The target network parameter

θ^{-}

is regularly assigned from the policy network parameter

θ

, which effectively ensures the relative stability of the expected value

y_{t}

. During the specific implementation process of where model parameters are updated, the DQN first performs a forward calculation to obtain the

Q

value

Q (s_{t}, a_{t}; θ)

and the expected target

y_{t}

. Thus, the loss function is as follows:

L_{θ} = E [{(y_{t} - Q (s_{t}, a_{t}; θ))}^{2}] = \frac{1}{b a t c h} \sum_{i = 1}^{b a t c h} {(y_{t} - Q (s_{t}, a_{t}; θ))}^{2}

(9)

Then, back-propagation and gradient descent are performed on the DQN to minimize the loss function, which involves computing the gradient of the loss function

L_{θ}

with respect to parameter

θ

.

\nabla_{θ} L_{θ} = \frac{1}{b a t c h} \sum_{i = 1}^{b a t c h} (y_{t} - Q (s_{t}, a_{t}; θ)) \nabla_{θ} Q (s_{t}, a_{t}; θ)

(10)

Parameters are updated using the computed gradients

θ \leftarrow θ + α \nabla_{θ} L_{θ}

. For a comprehensive understanding, Algorithm 1 outlines the specific steps of the training procedure.

Algorithm 1. Pseudo-code for MMDQN agent training.

Pseudo-Code

Loop1: Start Monte Carlo loop, For

M_{i} = 1, \dots, M

:
1. Pre-planned global route points interpolation;
Loop2: Start UAV position traversal loop, For

N_{i} = 1, \dots, N

:
Loop3: Start threat area location traversal loop:
1. Determine the starting point for local re-planning;
2. Determine the impact of threat area location on route (Case 1 or Case 2);
1. Determine the target point and map range for local re-planning;
2. Normalize the starting point on the local map;
3. Load DQN agent model corresponding to current case;
Loop4: Start training episode loop for current DQN model, For

E_{i} = 1, \dots, E s p

:
1. Reset environment and obtain initial state;
Loop5: Start execution step loop, For

t = 1, \dots, S t e p

:
1. Select action

a_{t}

using ε-greedy policy;
2. Execute action, receive reward

r_{t}

and next state

s_{t + 1}

;
3. Store sample

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in experience replay buffer

D

;
4. Update state

s_{t} \leftarrow s_{t + 1}

;
5. Experience replay buffer reaches accumulation threshold, update model:
1. Randomly sample

b a t c h

of samples from

D

;
2. Compute

Q

expectation

y_{t} = r_{t} + γ {m a x}_{a^{'}} Q (s_{t + 1}, a^{'}; θ^{-})

;
3. Calculate loss

L_{θ} = E [{(y_{t} - Q (s_{t}, a_{t}; θ))}^{2}]

4. Calculate stochastic gradient descent

\nabla_{θ} L_{θ}

;
6. Copy parameters to target network every

C

steps:

θ \leftarrow θ^{-}

;
Loop5: End execution step loop, End For;
Loop4: End training episode loop for current DQN model, End For;
3. End processing threat area impact on route;
Loop3: Until all threat area positions are processed;
Loop2: End UAV position traversal loop, End For;Loop1: End Monte Carlo loop, End For.
Loop1: End Monte Carlo loop, End For.

In summary, based on Monte Carlo methods, data of threat triggers at arbitrary times along the entire route is utilized to form a local obstacle-avoidance scenario database. Subsequently, through multiple iterations of scene traversal and multi-model adaptation to dynamic threat scenarios, the MMDQN agent is trained across these scenarios to update model parameters. This process enables the agent to learn the optimal planning route for each local obstacle-avoidance scenario, thereby completing the agent training.

4.4. Overall Method Analysis

This section focuses on analyzing the architectural composition and inherent structure characteristics of local MMDQN.

4.4.1. Overall Structure and Algorithm Flow of Local Threat Avoidance

Based on the hierarchical route planning framework, global route planning can be accomplished. Then, according to the pre-planned global route and utilizing the training and application of the MMDQN agent, local obstacle-avoidance routes triggered by dynamic threats can be rapidly planned at any moment during the drone flight. This enables effective coordination between the long-duration flight mission and dynamic threat avoidance, ultimately ensuring the completion of a safe UAV flight mission. The process of global route planning is detailed in Ref. [36]. Building on this foundation, the overall implementation of the algorithm in this paper is as follows.

The hierarchical route planning framework includes both global route planning and local route re-planning capabilities. Based on the pre-planned global flight route, the MMDQN agent route re-planning algorithm comprises two modes: agent training and agent application.

In the training mode, data obtained from Monte Carlo simulations on obstacle avoidance triggered by threats along the entire pre-planned route at any moment is utilized for the iterative learning of the agent. Specifically, during the UAV’s traversal of its pre-planned global flight route, all possible locations of threat areas within the current maximum detection range are assessed. Through a re-planning task generation model, local re-planning task instruction parameters are obtained and used to train the MMDQN agent. By repeatedly undergoing events where threats are triggered, scene data are collected across the entire route via Monte Carlo simulations. The method ensures sufficient data volume and convergence performance for the agent models.

In the application mode, when a threat area appears within the UAV’s maximum detection range at any moment along the pre-planned global flight route, corresponding local re-planning task instruction parameters are generated using the re-planning task generation module. The MMDQN agent dynamically generates obstacle avoidance routes applicable to different threat scenarios by loading the appropriately trained models and performing forward computations to rapidly generate the local obstacle-avoidance routes within the current flight plane.

Finally, the pre-planned global route is locally updated based on the re-planning results, completing the entire flight mission. The overall method structure and algorithm flow, including the framework logic and the training and application processes of the MMDQN agent, are illustrated in Figure 7 and Figure 8, respectively.

4.4.2. Complexity Analysis of Local Threat Avoidance

According to the overall method structure and algorithm flow, the proposed method includes two key modules: task generation re-planning and local route re-planning. The theoretical complexity analysis of these modules is conducted as follows.

The computational complexity of re-planning task generation is

O ((N \times K))

, and the space complexity is

O (N \times K)

, where

N

is the number of global interpolation route points, and

K

is the number of threat scenario categories.

For the local route re-planning, a training mode and an application mode are used. In the training mode, the space complexity of the experience replay buffer is

O (D \times I)

, where

D

is the size of the experience replay buffer and

I

is the size of the state. The computational complexity per batch update includes forward computation and back-propagation. The computational complexity of forward computation is

O (b a t c h \times (I H_{1} + H_{1} H_{2} + H_{2} P))

, where

H_{1}

,

H_{2},

and

P

are the number of nodes in the first hidden layer, the second hidden layer, and the action space size. Back-propagation complexity is the same as forward computation.

In the application mode, the computational complexity per action is

O (I H_{1} + H_{1} H_{2} + H_{2} P)

, while during the agent’s overall application phase, it is

O (E \times T \times (I H_{1} + H_{1} H_{2} + H_{2} P))

. Here,

E

is the number of local route re-planning tasks, and

T

is the number of actions selected per local re-planned route.

In the application mode of the MMDQN, the computation for each action only utilizes one of the agent’s model networks. By contrast, the network architecture of a conventional single-model DQN agent designed to address the two categories of threats discussed in this paper is more complex than any individual model network in the proposed MMDQN. Specifically, the number of hidden layer nodes in the single-model DQN agent exceeds those (e.g.,

H_{1}

and

H_{2}

) of the MMDQN agent’s sub-models. Evidently, the proposed MMDQN achieves lower computational complexity and superior performance compared to conventional DQN. Also, MMDQN holds a substantial computational complexity advantage over classical algorithms such as A*, as it performs a constant-time forward pass independent of environmental scale, while traditional planners require state-dependent, often exponential-time searches for each decision.

The method demonstrates advanced computational and space complexity. In particular, during the agent application mode, dynamic threat obstacle avoidance is performed by simply re-planning task generation and executing the agent’s forward computation, which ensures high efficiency in real-time applications.

5. Numerical Simulations

In this section, an implementation case is studied based on the real mountainous environments to illustrate the effectiveness of the proposed method. To analyze the advantages of the hierarchical route planning framework and local route re-planning agent concisely, we directly build our design and validation based on the global route planning method published in Ref. [36]. A systematic quantitative evaluation based on numerical performance metrics is presented, including the effectiveness and robustness of the proposed method, the training convergence of the agent, the task success rate, and task completion.

Part 1. Verification of the hierarchical framework

To illustrate the availability of the hierarchical route planning framework for coordination between global route planning and dynamic obstacle avoidance, we construct the framework structure described in Section 3. The acquisition of elevation terrain data is obtained through actual detection. The global starting point and target point are (3, 175) and (351, 50). Based on the global route planning method in Ref. [36], the UAV global flight plan can be obtained, characterized by a best flight plane height of 571, a back-up plane height of 747, and a flight route point sequence of {(3, 175, 571), (56.4286, 166.8571, 571), (266, 87, 571), (314, 74, 571), (335, 57, 747), (351, 50, 571)}.

The global route of the flight plane and a 3D map are shown in Figure 9.

Here, the interaction interface is based on the hierarchical route planning framework. As demonstrated by simulations, it effectively completes global route planning and can support local re-planning triggered by threats at any moment. This ensures the effectiveness of the coordination between global route planning and dynamic obstacle avoidance.

Part 2. Simulation of the training process

To illustrate the convergence of the MMDQN agent with the MCTIL strategy, the training process and simulation results of the MMDQN agent are presented.

For simplicity, the MMDQN agent structure for local route re-planning is based on two scenarios—dynamic threat areas covering flight route points or the connection between them. These are constructed as two DQN models with identical neural network architectures.

The network structure of each DQN model is built as follows: The input layer has four nodes. The first fully connected layer has 500 nodes with initial connection weights randomly assigned as

W_{1} = \sqrt{\frac{2}{4}} \times (r a n d (500,4) - 0.5)

. The bias values are

B_{1} = 1 \times 10^{1} - 3 o n e s (500,1)

. The activation function for the first ReLU layer is

f (x_{1}) = m a x (0, x_{1})

, where

x_{1}

represents the outputs of the first fully connected layer. The second fully connected layer has 400 nodes, the initial connection weights of which are randomly assigned as

W_{2} = \sqrt{\frac{2}{4}} \times (r a n d (400,4) - 0.5)

. The bias values are

B_{2} = 1 \times 10^{1} - 3 o n e s (400,1)

. The activation function for the second ReLU layer is

f (x_{2}) = m a x (0, x_{2})

, where

x_{2}

represents the outputs of the second fully connected layer. The output layer has four nodes.

Remark 5.

The input of the agent is the state, and the output is the action. Both terrain and threat are defined as obstacles and serve as strict positioning constraints. In view of this, the agent that reaches the target point in the grid map completes the path search task. The positions of the agent and the target points are denoted as X and Y in the flight plane; as such, the state with four inputs is sufficient. Since the state is composed of discrete position coordinates on the rasterized map, no additional standardization processing is required. The agent’s actions include moving forward, backward, left, and right in a grid-based map.

The parameters for training the MMDQN agent are given as follows. The maximum detection distance of the UAV is

L = 20

. The number of Monte Carlo iterations is

M = 5

; the number of training episodes for each specific local obstacle avoidance case is

E s p = 500

; the number of maximum steps per episode is

S t e p = 625

; the experience replay buffer size is

D = 1 \times 10^{5}

; the random batch size is

b a t c h = 128

; the target network update interval is

C = 10

; the learning rate is

α = 5 \times 10^{- 4}

; the discount factor is

γ = 0.9

; the ε-greedy exploration strategy is

ε = 0.2

; and the decay rate is

∆ ε = 1 \times 10^{- 5}

.

The iterative training process follows the MCTIL strategy described in Section 4, where the specific local obstacle avoidance case of each episode is designed as follows:

Environmental constraints: The terrain, boundaries, and dynamic threats within the grid-based flight plane are all considered obstacles.

Initial condition: The local starting point is defined.

Termination conditions: The task is complete if the agent reaches the local target point or the maximum number of steps allowed is exceeded.

Reward mechanism: The scores are calculated based on environmental reward values, the specific details of which are outlined in Table 3.

As shown in Table 3, the reward mechanism established in this article is defined as follows: the agent’s score increases by 2 points when it arrives at the local target point; otherwise, 0 points are obtained. During the path search process, the agent’s score is subtracted by 0.01 per step if it fails to reach the local target point or change location. This kind of reward and punishment forces the agent to move and not to stay in the same place. If the distance in the X or Y direction within the flight plane decreases relative to the local target point, the agent’s score increases by 0.05 per step. But if the distance in the X or Y direction within the flight plane increases relative to the local target point, a score of 0.05 per step is subtracted from the score. This method of reward and punishment allows the agent to approach the target point in a gradual and stable manner. The final reward score of the agent is obtained by summing up the above-mentioned bonus and penalty items.

The reward score for each step of the agent is

R = A + B 1 + B 2 + B 3 + B 4

(11)

Figure 10 illustrates the training curve of the agent in a specific local obstacle avoidance case presented in Figure 11. Figure 11 shows the results of local route re-planning. The overall map covers an area of 8.9 km × 19.6 km with a resolution of 100 m.

According to the training curve, in this specific case, after 124 training episodes, the average reward converged to 2.21 and remained stable. Meanwhile, the re-planned route in this case effectively completed obstacle avoidance. Through iterative training and traversing the diverse threat scenario library until convergence, the MMDQN agent was capable of handling multiple threat scenarios. These observations highlight the stability and strong convergence properties of this method as well as its ability to generate effective re-planning solutions.

Part 3 Performance of the MMDQN agent

The simulation results of the MMDQN agent’s application are provided based on a hierarchical route planning framework. This illustrates the agent’s operational reliability during task sequence logic management for global route planning based on multiple instances of local route re-planning at various times. It also demonstrates the applicability of the MMDQN agent with the MCTIL strategy in diverse obstacle avoidance scenarios.

During the global flight route, the UAV autonomously detects areas of threat and completes local route re-planning to avoid dynamic obstacles. According to measurements, during five instances of local obstacle avoidance, the maximum, minimum, and average time required for local path re-planning were 0.8058 s, 0.1610 s, and 0.4181 s, thereby enabling rapid evasion of dynamic threats. For comparison, under identical conditions, the single-model DQN agent based on local route re-planning had a maximum time of 1.5271 s, a minimum time of 0.4057 s, and an average time of 0.9834 s.

Remark 6.

Considering that the focus of this study is not on improving the speed of local path re-planning, this paper only provides a brief illustration of the real-time ability for local obstacle avoidance compared to the single-model DQN agent, without further comparative analysis with other methods.

The final obstacle avoidance route within the flight plane is illustrated in Figure 12.

The flight trajectory in 3D space is shown in Figure 13.

To evaluate the advantages of the three proposed methods, we conducted a comparative analysis focusing on their ability to avoid local dynamic threats: a single-model agent, MMDQN with Monte Carlo stochastic iterative learning (MCSIL) strategy, and MMDQN based on the MCTIL strategy. The single-model agent is designed based on the conventional DQN algorithm. The MMDQN based on the MCSIL strategy employs the same MMDQN architecture and conducts training using multiple randomly generated dynamic threat scenarios. During the pre-planned global flight route, multiple local obstacle avoidance scenarios were generated for various threat scenarios, and identical experimental conditions were applied to each of the three methods for each scenario. By conducting simulation tests focused on dynamic threat avoidance throughout the entire flight route, statistics for evading threats at any time and location could be summarized, as shown in Table 4.

Remark 7.

The local route planning method proposed in this article mainly focuses on strategies to improve structural and iterative learning based on the DQN rather than considering actual types of threats. Comparisons related to these improvements are also provided. Classic non-DRL intelligent methods are not within the primary scope of this paper. In addition to the DQN, a variety of improved algorithms exist in the field of DRL, including Double DQN and Dueling-DQN. Despite the differences in their design mechanisms, these algorithms can both be classified as single-model architectures. The MMDQN proposed in the paper is a multi-model reinforcement learning algorithm that was improved based on the traditional DQN, and its performance is significantly superior to that of similar single-model algorithms. For this reason, the authors selected only DQN-based single-model agents as the control group for comparative experiments. As shown in Table 4, the MMDQN model with the MCTIL strategy achieves better comprehensive performance than the MMDAN adopting the MCSIL strategy, as well as all types of single-model agents.

Remark 8.

In these tests, successful threat avoidance is divided into two categories: reliable avoidance, where the agent strictly moves to the local target point to complete threat avoidance, and feasible avoidance, where the agent does not strictly move to the local target point but can still achieve threat avoidance. The success rate was calculated by dividing the number of successful threat avoidances by the total number of tests.

According to the statistical results, the success rate of dynamic threat avoidance for UAVs is 82.59%. When comparing identical neural network architectures, this success rate significantly surpasses that of the single-model agent, which only reached 44.70%. This represents an improvement of 37.89%, validating the effective performance of the proposed MMDQN agent for local route re-planning under test conditions.

Within the identical MMDQN agent network structure, our proposed MCTIL strategy demonstrates superior performance compared to the MCSIL strategy. While MCSIL achieves a threat avoidance success rate of 74.74%, our strategy improved this by 7.85%. Significant enhancement validates the reliability and effectiveness of the MCTIL strategy.

It is worth noting that, throughout the aforementioned experiments, the proposed framework and algorithms operated stably without any system errors or crashes caused by robustness-related issues. Moreover, for the comparative experiments discussed in Section 3 of this paper, the three local route re-planning methods were successfully executed for 879 continuous trials under different threat emergence scenarios, further validating the robustness of the approach.

6. Discussion

This paper addresses the issue of coordinating global route planning and local real-time obstacle avoidance in complex mountainous environments. A global planning and local dynamic optimization-based hierarchical route planning framework is designed, considering long-duration flight plans and the effective avoidance of obstacles. On the basis of global route planning, the MMDQN agent obtained through the MCTIL strategy can efficiently achieve local dynamic obstacle avoidance. The simulation results illustrate the efficiency of the proposed method. The main advantages of the proposed method are that (1) the hierarchical route planning framework can simultaneously execute the mission of the entire flight and rapidly evade dynamic threats, exhibiting both flexible versatility and practical coordination and deployment efficiency. (2) The structure design of the MMDQN agent effectively reduces the complexity of the neural network structure and the difficulty of training a single-model agent. (3) Finally, the MCTIL strategy is based on threat-triggering scenarios that could occur at any time along the entire flight. This enhances the adaptability and reliability of the local route re-planning agent.

While the proposed method demonstrates promising results, it is important to acknowledge its limitations. The results are based on a limited dataset of the actual terrain and mountainous environment, which may not fully capture real-world complexities. Meanwhile, the proposed algorithm assumes ideal conditions, which may not represent highly dynamic scenarios in all cases.

To overcome the current constraints, future work will focus on the obstacle avoidance flight of UAVs based on the proposed method in this paper, including further deployment in ground stations and hardware-in-the-loop to validate the algorithm’s effectiveness. Additionally, we aim to enrich map sources and increase training samples to verify the algorithm’s robustness. To further enhance robustness, we propose relaxing assumptions and improving the algorithm’s design to effectively address a wider range of scenarios in dynamic environments. In addition, a systematic analysis of the improvement mechanisms and performance expansion of well-established classical local planners, such as the Vector Field Histogram (VFH) and Dynamic Window Approach (DWA), and derivative intelligent algorithms, such as Double DQN and Dueling-DQN, may be explored in subsequent research work. Intelligent route planning for multi-UAV collaboration and autonomous obstacle avoidance in threat zones with time-varying locations, such as regions with adverse weather, will also be explored.

Author Contributions

Conceptualization, B.D. and Y.Z.; methodology, B.D. and Y.Z.; resources, P.Y.; software, Y.Z. and S.L.; supervision, G.Z. and T.H.; validation, Y.Z. and T.H.; writing—original draft, B.D. and Y.Z.; writing—review and editing, G.Z. and P.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to proprietary constraints related to ongoing product development under institutional intellectual property agreements.

Conflicts of Interest

Authors Yuzhen Zhang, Peiyuan Yuan, Shuntong Lu, and Tao Huang are employed by AVIC Aviation Electronics Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Yang, Y.; Xiong, X.; Yan, Y. UAV Formation Trajectory Planning Algorithms: A Review. Drones 2023, 7, 62. [Google Scholar] [CrossRef]
Bai, H.; Fan, T.; Niu, Y.; Cui, Z. Multi-UAV Cooperative Trajectory Planning Based on Many-Objective Evolutionary Algorithm. Complex Syst. Model. Simul. 2022, 2, 130–141. [Google Scholar] [CrossRef]
LI, J.; Long, T.; Sun, J.; Luo, Y.; Zhou, Z. UAV Dynamic Path Planning Method via Heuristic Incremental Search. Sci. Sin. Informationis 2025, 55, 931. [Google Scholar] [CrossRef]
Bai, X.; Jiang, H.; Cui, J.; Lu, K.; Chen, P.; Zhang, M. UAV Path Planning Based on Improved A* and DWA Algorithms. Int. J. Aerosp. Eng. 2021, 2021, 4511252. [Google Scholar] [CrossRef]
Xu, L.; Xi, M.; Gao, R.; Ye, Z.; He, Z. Dynamic Path Planning of UAV with Least Inflection Point Based on Adaptive Neighborhood A Algorithm and Multi-Strategy Fusion. Sci. Rep. 2025, 15, 8563. [Google Scholar] [CrossRef]
Yan, C.; Xiang, X.; Wang, C. Towards Real-Time Path Planning through Deep Reinforcement Learning for a UAV in Dynamic Environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
Sheng, Y.; Liu, H.; Li, J.; Han, Q. UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments. Drones 2024, 8, 516. [Google Scholar] [CrossRef]
Ait Saadi, A.; Soukane, A.; Meraihi, Y.; Benmessaoud Gabis, A.; Mirjalili, S.; Ramdane-Cherif, A. UAV Path Planning Using Optimization Approaches: A Survey. Arch. Comput. Methods Eng. 2022, 29, 4233–4284. [Google Scholar] [CrossRef]
Doukhi, O.; Lee, D.J. Sim-to-Real Learning-Based Nonlinear MPC for UAV Navigation and Collision Avoidance in Unknown Cluttered Environments. IEEE Access 2025, 13, 46249–46262. [Google Scholar] [CrossRef]
Shi, Y.; Zhang, L.; Dong, S. Path Planning of Anti Ship Missile Based on Voronoi Diagram and Binary Tree Algorithm. Def. Sci. J. 2019, 69, 369–377. [Google Scholar] [CrossRef]
Sheng, Z.; Song, T.; Song, J.; Liu, Y.; Ren, P. Bidirectional Rapidly Exploring Random Tree Path Planning Algorithm Based on Adaptive Strategies and Artificial Potential Fields. Eng. Appl. Artif. Intell. 2025, 148, 110393. [Google Scholar] [CrossRef]
Fusic, S.J.; Sitharthan, R. Improved RRT* Algorithm-Based Path Planning for Unmanned Aerial Vehicle in a 3D Metropolitan Environment. Unmanned Syst. 2024, 12, 859–875. [Google Scholar] [CrossRef]
Hu, Q.; Zhang, P.; Yu, P.; Xue, J.; Wang, X. Obstacle Avoidance Path Planning for the Blast Hole Filling Manipulator through Improved Inverse Kinematics Solving and Rapidly-Exploring Random Trees Star Method (RRT*). J. Braz. Soc. Mech. Sci. Eng. 2025, 47, 657. [Google Scholar] [CrossRef]
Chen, F.; Xu, X. Three-Dimensional Path Planning of UAV Based on Improved A* Algorithm. In Proceedings of the 2025 7th International Conference on Information Science, Electrical and Automation Engineering (ISEAE), Harbin, China, 18 April 2025; pp. 1337–1341. [Google Scholar]
Zhang, C.; Meng, X. Spare A* Search Approach for UAV Route Planning. In Proceedings of the 2017 IEEE International Conference on Unmanned Systems (ICUS), Beijing, China, 27 October 2017; pp. 413–417. [Google Scholar]
Zhu, S. UAVs Flight Path Optimization Based on Intelligent Bionic Optimization Algorithm. In Proceedings of the 2025 IEEE 5th International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 23–25 May 2025; pp. 711–718. [Google Scholar]
Liu, S. An Improved Particle Swarm Algorithm for UAV Path Planning. In Proceedings of the 2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA), Changchun, China, 11 August 2023; pp. 949–953. [Google Scholar]
Li, X.; Ma, R.; Zhang, S.; Hou, Y.; Pei, Y. Improved Design of Ant Colony Algorithm and Its Application in Path Planning. ACTA Aeronaut. Astronaut. Sin. 2020, 41, 724381. [Google Scholar] [CrossRef]
Wang, K.; Hui, M.; Hou, J.; Song, X. Deep Reinforcement Learning-Based UAV Path Planning Algorithm. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT), Yichang, China, 20–22 September 2024; pp. 1–4. [Google Scholar]
Guo, J.; Zhou, Y.; Burlion, L.; Savkin, A.V.; Huang, C. Autonomous UAV Last-Mile Delivery in Urban Environments: A Survey on Deep Learning and Reinforcement Learning Solutions. Control Eng. Pract. 2025, 165, 106491. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Yang, Y.; Liu, X. Deep Reinforcement Learning Based Trajectory Optimization for UAV-Enabled IoT with SWIPT. Ad Hoc Netw. 2024, 159, 103488. [Google Scholar] [CrossRef]
AlMahamid, F.; Grolinger, K. Agile DQN: Adaptive Deep Recurrent Attention Reinforcement Learning for Autonomous UAV Obstacle Avoidance. Sci. Rep. 2025, 15, 18043. [Google Scholar] [CrossRef]
Farid, G.; Bilal, M.; Zhang, L.; Alharbi, A.; Ahmed, I.; Azhar, M. An Improved Deep Q-Learning Approach for Navigation of an Autonomous UAV Agent in 3D Obstacle-Cluttered Environment. Drones 2025, 9, 518. [Google Scholar] [CrossRef]
Liu, C.; Zhong, Y.; Wu, R.; Ren, S.; Du, S.; Guo, B. Deep Reinforcement Learning Based 3D-Trajectory Design and Task Offloading in UAV-Enabled MEC System. IEEE Trans. Veh. Technol. 2025, 74, 3185–3195. [Google Scholar] [CrossRef]
Çetin, E.; Barrado, C.; Salamí, E.; Pastor, E. Analyzing Deep Reinforcement Learning Model Decisions with Shapley Additive Explanations for Counter Drone Operations. Appl. Intell. Dordr. Neth. 2024, 54, 12095–12111. [Google Scholar] [CrossRef]
Lei, H.; Yan, Y.; Liu, J.; Han, Q.; Li, Z. Hierarchical Multi-UAV Path Planning for Urban Low Altitude Environments. IEEE Access 2024, 12, 162109–162121. [Google Scholar] [CrossRef]
Pan, D.; Wang, Q.; Kong, F.; Yu, H.; Gao, S.; Xu, Z. D3QHF: A Hybrid Double-Deck Heuristic Reinforcement Learning Approach for UAV Path Planning. In Proceedings of the IEEE International Conference on Unmanned Systems (Online), Guangzhou, China, 28–30 October 2022; pp. 1221–1226. [Google Scholar]
Loquercio, A.; Kaufmann, E.; Ranftl, R.; Dosovitskiy, A.; Koltun, V.; Scaramuzza, D. Deep Drone Racing: From Simulation to Reality With Domain Randomization. IEEE Trans. Robot. 2020, 36, 1–14. [Google Scholar] [CrossRef]
Loquercio, A.; Kaufmann, E.; Ranftl, R.; Müller, M.; Koltun, V.; Scaramuzza, D. Learning High-Speed Flight in the Wild. Sci. Robot. 2021, 6, eabg5810. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Song, Y.; Jiao, P.; Hu, Y. A Hybrid and Hierarchical Approach for Spatial Exploration in Dynamic Environments. Electronics 2022, 11, 574. [Google Scholar] [CrossRef]
Zhao, L.; Li, G.; Zhang, H. Global and Local Awareness: Combine Reinforcement Learning and Model-Based Control for Collision Avoidance. IEEE Open J. Intell. Transp. Syst. 2024, 5, 422–432. [Google Scholar] [CrossRef]
Zhao, H.; Guo, Y.; Li, X.; Liu, Y.; Jin, J. Hierarchical Control Framework for Path Planning of Mobile Robots in Dynamic Environments Through Global Guidance and Reinforcement Learning. IEEE Internet Things J. 2025, 12, 309–333. [Google Scholar] [CrossRef]
Debnath, D.; Vanegas, F.; Sandino, J.; Hawary, A.F.; Gonzalez, F. A Review of UAV Path-Planning Algorithms and Obstacle Avoidance Methods for Remote Sensing Applications. Remote Sens. 2024, 16, 4019. [Google Scholar] [CrossRef]
Merei, A.; Mcheick, H.; Ghaddar, A.; Rebaine, D. A Survey on Obstacle Detection and Avoidance Methods for UAVs. Drones 2025, 9, 203. [Google Scholar] [CrossRef]
Dong, B.; Zhang, G.; Yang, Y.; Yuan, P.; Lu, S. A Voronoi–A* Fusion Algorithm with Adaptive Layering for Efficient UAV Path Planning in Complex Terrain. Drones 2025, 9, 542. [Google Scholar] [CrossRef]

Figure 1. The hierarchical route planning framework.

Figure 2. The relationships between modules.

Figure 3. The operational logic for all modules within the framework.

Figure 4. The altitude setting rules for inserted route points.

Figure 5. Two types of scenarios.

Figure 6. The general structure and implementation scenario diagram of the MMDQN agent.

Figure 7. The overall method structure.

Figure 8. The overall flow of the algorithm.

Figure 9. The software for the global route for UAVs.

Figure 10. The training curve of the agent for a specific case.

Figure 11. The obstacle avoidance route for a specific case.

Figure 12. The obstacle avoidance route for this specific case.

Figure 13. The final flight trajectory in 3D space.

Table 1. The specific functions of each module.

Modules	Functions	Input	Output
Global Route Planning	Based on elevation terrain data and performance constraints, the UAV conducts global route planning with multi-flight planes.	Global starting-target point; Digital map data.	Global route point sequence.
Threat Triggering	It provides time-varying threat obstacles in dynamic complex environments.	Global route point sequence; UAV location; Digital map data.	Threat area location.
Re-Planning Task Generation	When the UAV detects a threat area, it generates a local re-planning task with local starting-target points and map range, based on the global flight route.	Threat area location; Global route point sequence; Digital map data.	Local starting target points; Map range.
Local Route Re-planning	Driven by the local re-planning task instructions, the UAV performs local obstacle avoidance route planning via the DRL algorithm.	Local starting-target points; Threat area location; Digital map data.	Local route point sequence.
Module Invocation Management	It manages the task sequence logic of the global route planning, re-planning task generation, and local route re-planning.	Global route parameters; Dynamic threat trigger parameters.	Invoke module instruction.
Digital map	Providing digital map data required for route planning.	Range for global DEM data.	Digital map data.

Table 2. Local target point and map range.

Location of Dynamic Threat Area	Local Target Point	Side Length of the Square Map for Local Re-Planning
Covering the $n$ -th route point	The $(n + 1)$ -th route point	$R_{1} = c e i l (2.5 * L)$
Located on the line connecting the $(n - 1)$ -th route point to the $n$ -th route point	The $n$ -th route point	$R_{2} = c e i l (1.5 * L)$

Table 3. The reward mechanism.

Items	Cases	Scores
Result rewards	A: Arrive at the local target point	Add 2 if performed; otherwise, add 0.
Process rewards (cumulative score per step)	B1: Failure to reach the local target point.	Subtract 0.01 per step.
	B2: No location change.	Subtract 0.01 per step.
	B3: The distance in the X or Y direction within the flight plane decreases relative to the local target point.	Add 0.05 per step
	B4: The distance in the X or Y direction within the flight plane increases relative to the local target point.	Subtract 0.05 per step.

Table 4. The statistical results of dynamic threat avoidance.

Agent Types	Number of Reliable Results	Number of Feasible Results	Number of Failed Results	Success Rate
MMDQN (MCTIL strategy)	171	71	51	82.59%
MMDQN (MCSIL strategy)	153	66	74	74.74%
Single-model agent	30	101	162	44.70%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dong, B.; Zhang, Y.; Yuan, P.; Lu, S.; Huang, T.; Zhang, G. Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs. Drones 2026, 10, 57. https://doi.org/10.3390/drones10010057

AMA Style

Dong B, Zhang Y, Yuan P, Lu S, Huang T, Zhang G. Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs. Drones. 2026; 10(1):57. https://doi.org/10.3390/drones10010057

Chicago/Turabian Style

Dong, Boyu, Yuzhen Zhang, Peiyuan Yuan, Shuntong Lu, Tao Huang, and Gong Zhang. 2026. "Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs" Drones 10, no. 1: 57. https://doi.org/10.3390/drones10010057

APA Style

Dong, B., Zhang, Y., Yuan, P., Lu, S., Huang, T., & Zhang, G. (2026). Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs. Drones, 10(1), 57. https://doi.org/10.3390/drones10010057

Article Menu

Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs

Highlights

Abstract

1. Introduction

2. Description of the Route Planning System

3. Global Planning and Local Dynamic Optimization-Based Hierarchical Route Planning Framework

4. Algorithm Design

4.1. Threat-Triggering Local Replanning Task Generation Based on Global Routes with Multi-Flight Planes

4.2. The Design of the MMDQN Agent for Local Route Re-Planning

4.3. An MCTIL Strategy for the MMDQN Agent

4.4. Overall Method Analysis

4.4.1. Overall Structure and Algorithm Flow of Local Threat Avoidance

4.4.2. Complexity Analysis of Local Threat Avoidance

5. Numerical Simulations

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI