Stealth UAV Path Planning Based on DDQN Against Multi-Radar Detection

Bao, Lei; Guo, Zhengtao; Gao, Xianzhong; Li, Chaolong

doi:10.3390/aerospace12090774

Open AccessArticle

Stealth UAV Path Planning Based on DDQN Against Multi-Radar Detection

Test Center, National University of Defense Technology, Xi’an 710106, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed to the work equally.

Aerospace 2025, 12(9), 774; https://doi.org/10.3390/aerospace12090774

Submission received: 20 April 2025 / Revised: 6 August 2025 / Accepted: 15 August 2025 / Published: 28 August 2025

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

Considering the dynamic RCS characteristics of stealthy UAVs, we proposed a stealthy UAV path planning algorithm based on the Double Deep Q-Network (DDQN). By introducing the reinforcement learning model that can interact with the environment, the stealth UAV adjusts the path planning strategy through the rewards obtained from the environment to design the optimal path in real-time. Specifically, by considering the effect of RCS from different angles on the detection probability of the air defense radar, the stealth UAV realizes the iterative optimization of the path planning scheme to improve the reliability of the penetration path. Under the guidance of a goal-directed composite reward function proposed, the convergence speed of the stealth UAV path planning algorithm is improved. The simulation results show that the stealth UAV can reach the target position with the optimal path while avoiding the threat zone.

Keywords:

stealth UAV; Double Deep Q-Network; dynamic RCS; path planning

1. Introduction

The problem of UAV penetration into the radar air defense area can be modeled as a problem of flight path planning under path constraints, fire threat constraints, and flight time constraints. The purpose of the UAV is to plan a reasonable flight path to complete the task efficiently and ensure its safety [1]. The path planning of UAV penetration can be divided into static and dynamic path planning according to the dynamic change in the fire threat area [2]. Static path planning means that the space environment faced by the UAV is unchanged, and the flight path planning is completed before the UAV flies, and the waypoints and routes are loaded. Dynamic path planning means that the space environment faced by the UAV is dynamically changing, and the UAV can realize flight and design by perceiving the environment and adjusting its state [3,4]. The problem considered in this paper belongs to dynamic path planning.

In recent years, more and more algorithms have been proposed to solve the path planning problem. The commonly used algorithms include the graph search algorithm [5], operations research linear programming algorithm [6,7], intelligent optimization algorithm [8,9], and reinforcement learning algorithm [10,11]. The graph search algorithm is one of the classical algorithms of graph theory that can effectively solve the shortest path problem. However, as the environment becomes complicated and the number of nodes increases, the solving efficiency decreases [12]. The linear programming algorithm is a mathematical method that obtains the extreme value of the objective function under linear constraints. It is widely used in military and engineering technology fields and has the characteristics of simple and efficient calculation. However, in complex space environments, its constraints are not necessarily linear [13]. In UAV path planning, the genetic algorithm [14], particle swarm algorithm [15,16], ant colony algorithm [17], and hybrid algorithm [18,19] are widely used, but there are also problems with an unclear logical relationship. Reinforcement learning is an algorithm framework that interacts with the environment to obtain feedback and optimize decisions based on it. The core of reinforcement learning is the Markov decision process. In recent years, researchers have mainly used the Q-learning algorithm and the Deep Q-Network (DQN) algorithm for path planning [20,21], but the Q-learning algorithm is prone to overfitting, leading to local optimization, and the DQN algorithm has the problem of overestimating the reward function [22,23,24]. In order to solve the above two problems, the DDQN algorithm was introduced in this paper for the path planning of the stealth UAV. Compared with Dueling DQN, DDQN reduces the Q value estimation bias and has better initial performance. Compared with PPO, it has the characteristics of offline learning, is more suitable for a discrete action space, and has low dependence on the environment model. Compared with the DDPG algorithm, the DDQN algorithm is more suitable for discrete action output and has higher sample efficiency, while the DDPG algorithm is suitable for continuous action output. Compared with the MTCS algorithm, the DDQN algorithm has better online optimization ability and less dependence on the environment model, while the MTCS algorithm is based on model search, has high dependence on the model, and has weak online optimization ability [25].

It is different from the traditional assumption that the RCS of the UAV is constant. The RCS of the stealth UAV from various angles is different, and the angle between the stealth UAV and the radar constantly changes during the flight, resulting in a dynamic change in the detection probability of the radar and the range of the anti-aircraft kill zone. Using the characteristics of the anti-aircraft kill zone, the stealthy UAV penetration into the multi-radar anti-aircraft zone with optimal path planning reaches the mission destination. Compared with existing conventional UAV path planning algorithms based on reinforcement learning [26], the main innovation of this paper lies in the proposed stealth UAV path planning algorithm. The RCS characteristics of the stealthy UAV designed by our team are significantly different from those of conventional UAVs. At present, most of the research on stealthy UAV path planning is based on the traditional A* algorithm. The RCS value of stealth UAVs varies with the angle of the fuselage. Multiple radars can detect the aircraft from various angles, posing a serious threat to stealth UAVs. To address this issue, reference [27] proposed an A* algorithm based on the trajectory planning of stealth UAVs. Considering the RCS characteristics of stealth UAVs, reference [28] proposed an improved A* algorithm, a sparse A* algorithm, and a dynamic A* algorithm, taking into account the feasibility of the trajectory through the radar detection probability, and completed the trajectory planning under three different scenarios. In view of the shortcomings of traditional algorithms in addressing the issue of stealth penetration, and fully considering the requirements of rapidity and safety when planning the flight path, an improved A* algorithm was designed in reference [29]. Considering the survivability and penetration capability under the deployment of single-base and dual-base radars, reference [30] proposed an improved A-Star algorithm based on the multi-step search method to achieve the path planning of stealth UAVs in complex scenarios. To balance the average RCS and peak RCS, reference [31] introduced a detection probability model and a penetration efficiency model and used a gradient-free optimization algorithm based on the genetic algorithm to maximize efficiency. However, the A* algorithm has the characteristics of large computational complexity and being unsuitable for dynamic environments. To address this problem, the continuous turning angle is discretized, the algorithm structure is simplified, and the convergence speed of the reinforcement learning algorithm is improved. Finally, the feasibility of the proposed algorithm was verified by comparing it with the existing algorithms.

The rest of this paper is organized as follows. The DDQN algorithm is introduced in Section 2, and the mapping relationship between UAV action and reinforcement learning elements is given. In Section 3, the environment model of this paper is presented, and the stealth UAV penetration algorithm based on DDQN is proposed. In Section 4, the effectiveness of the proposed algorithm is verified by simulation. Finally, Section 5 summarizes this paper.

2. DDQN Algorithm

This paper uses the improved DDQN algorithm under the reinforcement learning framework. After obtaining environmental information, the stealth UAV can avoid the air defense zone composed of multiple radars through real-time planning and maneuvering. A stealth UAV is an agent that cruises through the airspace at a constant speed, and the airspace environment is an anti-aircraft kill zone composed of multiple radars. The state information of the agent

s

is its position in a two-dimensional space, the action

a

is the corner direction of the stealth UAV at the path point, and the reward function

R

is designed as a function related to the distance from the stealth UAV to the target point and the safety state.

In the reinforcement learning framework, the state information the stealth UAV perceives is its position in the airspace. According to the state information, the stealth UAV makes decisions according to the reinforcement learning model, and the action is the steering angle of the UAV at the path point. After completing the action, the stealth UAV constantly cruised to the following path point and updated its state information. In this framework, the stealth UAV must estimate the probability of being detected by each radar according to its state before taking action. The radar detection probability is related to the reward function, and the action taken aims to maximize the reward function. The real-time decision-making framework of the stealth UAV is shown in Figure 1.

The state space perceived by stealth UAVs is its position in a two-dimensional space. It is not conducive to training because of many states. Therefore, this paper uses the powerful fitting ability of neural networks to fit the Q function through neural networks. The input of the Q-network is the position information

s = (x, y)

of the stealth UAV, and the output is the angle

a

of the stealth UAV. In DDQN, the training and target Q-networks have the same structure, with three hidden layers. The hidden layer is the fully connected layer of 128 nodes, and the activation function adopts ReLU. The Q-network architecture is shown in Figure 2.

In the DQN algorithm, the Q-network loss function is

θ^{*} = \arg \min_{θ} \frac{1}{2} {[Q_{θ} (s, a) - (R + γ \max_{w^{'}} Q_{θ} (s^{'}, a^{'}))]}^{2}

(1)

In order to solve the problem of constantly changing targets and overestimation of the Q value when the network is updated, the DDQN algorithm proposes to use two networks to estimate, that is, one network selects the action with the largest value, and the other network calculates the value. In this case, the loss function is

θ^{*} = \arg \min_{θ} \frac{1}{2} {[Q_{θ} (s, a) - (R + γ Q_{θ^{-}} (s^{'}, \arg \max_{a^{'}} Q_{θ} (s^{'}, a^{'})))]}^{2}

(2)

The state is the position of the stealth UAV. In order to better determine the relative relationship between the various states of the stealth UAV, the initial states of the stealth UAV are set as the origin. The action is the discrete value of the steering angle of the stealth UAV at the path point. The angle is discretized into five fixed values

a \in Ω = {0, 45^{°}, 90^{°}, - 45^{°}, - 90^{°}}

to benefit network training and reduce the amount of data.

The reward function needs to be set to solve the sparse reward problem in stealth UAVs’ path selection. If the target position is set as a positive reward, the stealth UAV will obtain a negative reward for a long time, making searching for the target position difficult. Therefore, this paper sets the reward function as the sum of three sub-reward functions. The three reward functions are the arrival reward

R_{a}

, the detection penalty

R_{b}

, and

R_{c}

.

R_{a} = \{\begin{matrix} r_{1} & s = s_{e n d} \\ 0 & s \neq s_{e n d} \end{matrix}

(3)

R_{b} = \{\begin{matrix} 0 & P_{d} < 0.3 \\ r_{2, 0} & 0.3 < P_{d} < 0.5 \\ r_{2, 1} & 0.5 < P_{d} < 0.6 \\ r_{2, 2} & 0.6 < P_{d} < 0.7 \\ r_{2, 3} & 0.7 < P_{d} < 0.8 \\ r_{2, 4} & 0.8 < P_{d} < 0.9 \\ r_{2, 5} & 0.9 < P_{d} < 1 \end{matrix}

(4)

R_{c} = r_{3}

(5)

where Formula (1) represents the positive reward obtained by the stealth UAV for reaching the target position. The positive reward is

r_{1} = 1

, and the positive reward is 0 if the stealth UAV does not reach the target position. Formula (2) represents the relationship between the negative reward obtained by the stealth UAV and the detection probability. When the detection probability is less than 0.5, the reward is 0.

r_{2, 4}

is a maximum value when the detection probability

P_{d} > 0.8

is detected by radar and destroyed by anti-aircraft fire. When the detection probability

P_{d} > 0.9

, the path ends. At this time,

r_{2, 5}

is a maximum value, the stealth UAV has been destroyed by anti-aircraft fire, and the rest of the detection probability corresponds to the minimum value of the corresponding level. Equation (3) represents the negative reward obtained when the stealth UAV flies one step, usually set to

r_{3} = - 1

. The total reward is set to

R = R_{a} + R_{b} + R_{c}

(6)

The relationship between the detection penalty and detection probability is shown in Figure 3. It can be seen that when the radar detection probability is greater than 0.8, the UAV will be greatly penalized. The purpose of the penalty value is to make the stealthy UAV avoid choosing this path node as much as possible. In this paper, the existing penalty value is set based on references [26,32] and multiple tests.

The environment will give feedback according to each action of the stealth UAV. If the action is valuable, such as approaching the target and avoiding the threat area, the environment will reward the stealth UAV. If the action is not valuable, such as moving away from the target and entering the threat area, the environment will penalize the stealth UAV.

When the stealth UAV reaches the target location or is shot down by anti-aircraft fire, a round is over. At the end of each round, the sum of the cumulative reward functions of the round is obtained, and the purpose of the stealth UAV is to obtain the maximum sum of the cumulative reward functions per round. The sum of the cumulative reward functions returned can be expressed as

G_{t} = R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k}

(7)

where

γ

means the discount factor.

In order to solve the problem of local optimality in path planning, the design of the reward function should be further clarified. When the anti-aircraft fire shoots down the stealth UAV, the penalty the stealth UAV will obtain is a maximum value, the sum of the discount of the cumulative reward will become small, and the subsequent stealth UAV will not choose this track. When the stealth UAV executes action selection at the path point, the stealth UAV will obtain a penalty value. If the stealth UAV does not fly to the target position by the shortest path, the sum of the discount of the cumulative reward will become smaller, and the subsequent stealth UAV will not choose this track. The local optimum problem in path planning can be solved by integrating the above two constraints.

3. Environmental Model and Algorithm

3.1. Environmental Model

The environment is set to a square closed airspace of

100 km \times 100 km

, the flight starting point of the stealth UAV is

(0, 0)

, the target endpoint is

(100 km, 100 km)

, and the stealth UAV cannot fly out of the set airspace. In order to enable the stealth UAV to find the shortest path to the target location quickly, the pre-training environment is set as no air defense fire network in the airspace, and the stealth UAV can freely pass to the target point. After the pre-training, the environmental model is transformed into four air defense radars in the airspace, whose positions are, respectively,

(30 km, 30 km)

,

(30 km, 70 km)

,

(70 km, 40 km)

, and

(70 km, 80 km)

. The detection probability of radar can be expressed as [30]

P_{d} \approx 0.5 \times e r f c (\sqrt{- \ln P_{f a}} - \sqrt{S N R + 0.5})

(8)

where

P_{f a}

represents the probability of false alarm,

S N R

represents the signal-to-noise ratio (SNR) received by the radar,

e r f c (\cdot)

is the complementary error function, and the expansion form is

e r f c (z) = 1 - \frac{2}{\sqrt{π}} \int_{0}^{z} e^{- v^{2}} d v

(9)

According to Formula (5), the detection probability is related to the set false alarm probability and the signal-to-noise ratio of radar receiving stealthy UAV echo signal.

S N R

can be expressed as

S N R = \frac{P_{t} G^{2} λ^{2} σ}{{(4 π)}^{3} R^{4} k T_{e} B F_{n} L}

(10)

where

P_{t}

is the peak transmit power,

G

is the antenna gain,

λ

is the signal wavelength,

σ

is the radar cross-section area,

R

is the detection distance,

k

is the Boltzmann constant,

T_{e}

is the effective noise temperature, and

B

is the radar operating bandwidth.

F_{n}

is the noise factor. Let

σ = 0.1

and other parameters be consistent with Section 4; stealth UAVs fly flat in the sky. The spatial detection probability of radar is shown in Figure 4.

Figure 4 shows that when the RCS is a constant value, the detection probability distribution of radar is a concentric circle with itself as the origin, reducing the model’s complexity. However, this is different from the actual situation. During the flight of the UAV, the relative angle with the radar changes, its RCS will also change dynamically, and the detection probability of the radar for the UAV will change accordingly. Only by combining the dynamic characteristics of the RCS can the path planning be more realistic. When multiple radars are deployed independently, the stealth UAV will be shot down if the probability of being detected by any radar is greater than a given threshold. By setting this scenario, the complexity of stealth UAV penetration is increased.

3.2. Algorithm Design

The DDQN algorithm directly estimates the action value function based on the time-difference algorithm. The time-difference algorithm uses the current reward plus the value estimate of the following state as the value of the current state.

Q (s_{t}, a_{t}) = Q (s_{t}, a_{t}) + α [R_{t} + γ \max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]

(11)

where

α

represents the step size, and

R_{t} + γ \max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})

represents the sequential difference error. In order to balance exploration and utilization, the

ε

-greedy strategy is adopted in each action selection. An action is randomly selected from the action space with the probability of

ε

, and the action with the maximum action value function is selected with the probability of

1 - ε

. The

ε

-greedy strategy can be expressed as

π (a | s) = \{\begin{matrix} \frac{ε}{|Ω|} + 1 - ε & a = \underset{a^{'}}{\arg \max} Q (s, a^{'}) \\ \frac{ε}{|Ω|} & e l s e \end{matrix}

(12)

The stealth UAV uses the

ε

-greedy strategy to interact with the environment and puts the five-element array

(s, a, R, s^{'}, d)

obtained from the interaction into the experience playback pool, where

d

indicates the state of the stealth UAV,

d = T r u e

indicates that the stealth UAV can continue to sail, and

d = F a l s e

indicates the end of the voyage. There are two situations at the end of the path: one is that the stealth UAV reaches the target point and obtains a reward, and the other is that the stealth UAV is shot down by anti-aircraft fire and obtains a punishment. The Q-network is trained by randomly sampling a certain number of five-element arrays. In order to solve the problem of updating the network parameters, the target

R_{t} + γ \max_{a} Q (s_{t + 1}, a)

contains the network output. We introduce the target network

Q_{θ^{-}}

, and the target network is updated after a fixed

N_{t r a i n}

round. At the same time, to solve the problem of error accumulation caused by maximum operation in the target network, the training network

Q_{θ}

selects the maximum action, and the target network

Q_{θ^{-}}

calculates the action value. In this case, the loss function of the network is given in Equation (2).

In summary, the path planning algorithm of a stealth UAV based on DDQN is shown in Algorithm 1.

Algorithm 1. Initialize the training network

Q_{θ}

and experience playback pool R, set

Q_{θ^{-}} = Q_{θ}

, and enter hyperparameters

For number of rounds

e = 1 \to E

do

Get initial states

s

While

d = T r u e

do

According to the current network

Q_{θ} (s, a)

, choose action

a_{t}

with

ε

-greedy strategy

Perform action

a_{t}

, obtain reward

R_{t}

, states change to

s_{t + 1}

, update

d

Store

(s_{t}, a_{t}, R_{t}, s_{t + 1}, d)

in the experience playback pool

if the amount of data in R is greater than

N_{\min}

do

Random sample

N

groups of data

{(s_{i}, a_{i}, R_{i}, s_{i + 1}, d)}_{i = 1, \dots, N}

from R

Calculate TD error target

y_{i} = R_{i} + γ Q_{θ^{-}} (s_{i + 1}, \arg \max_{a^{'}} Q_{θ} (s_{i + 1}, a^{'}))

Minimize the loss function

L = \frac{1}{N} \sum_{i} {(y_{i} - Q_{θ} (s_{i}, a_{i}))}^{2}

and update the training network

Q_{θ}

When the number of training network updates reaches

N_{t r a i n}

, set

θ^{-} \leftarrow θ

end

end for

4. Simulation Results

Stealth UAVs realize state-to-action mapping based on the proposed path planning algorithm, and the settings of each hyperparameter of the algorithm are shown in Table 1. In terms of specific parameter settings, this paper learned the parameter settings from the literature [25] and fine-tuned the values of each parameter after multiple simulation tests. The values of each parameter are shown in Table 1. The learning rate

l r

can effectively balance convergence speed and algorithm stability. The discount factor

γ

represents the weight of future rewards, a discount factor close to 0 indicates more attention to near-term rewards, and a discount factor close to 1 indicates more attention to long-term rewards. The greedy probability

ε

can effectively balance exploration and utilization.

In the pre-training environment, there is no air defense fire unit. Through pre-training, the stealth UAV can find the nearest flight path to the target location, provide prior experience for the subsequent stealth UAV rapid penetration air defense fire unit, and reduce the training time of later training. When there is no anti-aircraft fire unit, the flight path of the stealth UAV is shown in Figure 5.

From the observation of Figure 5, we can find no anti-aircraft fire in the airspace. The stealth UAV will fly directly from the starting point to the target position according to the shortest path, verifying the effectiveness of the proposed algorithm in the absence of anti-aircraft fire. Further, the optimization rate of the algorithm can be obtained by observing the relationship between the return of the stealth UAV and the number of turns. The relationship between the return obtained by the stealth UAV and the number of rounds is shown in Figure 6.

Figure 6 shows that when there is no threat of anti-aircraft fire, the return obtained by stealth drones can converge at a speedy rate. The return fluctuates slightly after convergence because the stealth drone needs to balance the relationship between exploration and utilization, and it will randomly select actions with a probability of

ε

. It helps to find the optimal decision again if the environment changes.

Then, the environment model is changed to include four air defense radars in the airspace. Before the simulation experiment, the parameters of the combat environment between the stealth UAV and the radars are first set. The parameters are shown in Table 2.

At this time, the radar airspace detection probability is shown in Figure 7 for the RCS of the stealth UAV.

From the observation of Figure 7, it can be seen that the probability of the stealth UAV being detected by radar is not only related to the distance but is also closely related to its RCS characteristics because the RCS of stealth UAVs from different angles is different. Based on the model obtained in pre-training, the flight path planning of the stealth UAV is trained again.

In order to verify the path planning algorithm of a stealth UAV based on a dynamic RCS and DDQN algorithms, the path based on the A* algorithm [31] and the path based on a fixed RCS and DDQN algorithms are presented simultaneously in the simulation. The simulation results are shown in Figure 8.

As can be seen from Figure 6, all three algorithms can effectively avoid the anti-aircraft kill zone and reach the target position. The path length based on the DDQN algorithm with a fixed RCS is

178.91 km

, the path length based on the A* algorithm is

167.20 km

, and the path length based on the DDQN algorithm with a dynamic RCS is

173.05 km

. Although the path based on the DDQN algorithm with a fixed RCS can reach the destination, its path is long and not optimal because it fails to use the RCS characteristics fully. Although the path based on the A* algorithm can reach the destination with a shorter path, there are also problems, such as the path of the A* algorithm taking the diagonal as the reference, lacking the global perception ability, and passing through the threatened area several times. The path of the A* algorithm has many times of turning. Compared with the above two cases, the path based on the DDQN algorithm with a dynamic RCS can effectively overcome the shortcomings of the above two cases and effectively balance the relationship between the path length and fire threat. The stealth UAV can effectively avoid the anti-aircraft kill zone by interacting with the environment and flying to the target location with the shortest path. Finally, the relationship between the return of the stealth drone and the number of rounds is shown in Figure 9.

Figure 9 shows that when four air defense radars are deployed, the return obtained by stealth drones can still converge at a speedy rate. The fluctuation of return is related to the value of the

ε

. When

ε = 1

, the stealth UAV randomly selects actions at each step to explore the entire airspace; when

ε = 0

, the stealth UAV uses the current information to select the action with the greatest reward; when

ε \in (0, 1)

, the stealth UAV will balance the relationship between exploration and utilization; in this paper, through many experiments,

ε = 0.01

is set, and the experimental results verify the effectiveness of the proposed algorithm. In summary, this section analyzes the influence of each hyperparameter’s setting on the algorithm’s performance and highlights the proposed algorithm’s advantages through a comparison of path length and convergence speed.

5. Conclusions

The UAV flight path planning technology has developed rapidly, but the penetration flight path based on the dynamic RCS of the stealth UAV has not been studied. This paper presents an effective route planning method for the stealth UAV. The method considers the detection probability of multiple radars and the constraints of RCS characteristics of the stealth UAV, discretizes the path space, and guides the stealth UAV to the target position efficiently through the proposed new reward function. The simulation results show that compared with the A* algorithm and the DDQN algorithm with a fixed RCS, the proposed DDQN-based path planning algorithm for stealth UAV penetration in multi-radar air defense zones has good results in completeness, effectiveness, and adaptability. In short, this paper has achieved good experimental results in 2D path planning, but there is no research on 3D path planning and measured verification for a stealth UAV. The subsequent plan is to further study the 3D path planning according to the 3D RCS characteristics of the stealth UAV, and on this basis, the actual measurement verification is carried out to make the experimental results closer to the real battlefield environment.

Author Contributions

Conceptualization, L.B. and Z.G.; methodology, L.B. and Z.G.; software, L.B. and Z.G.; validation, X.G.; formal analysis, X.G.; resources, X.G.; writing—original draft preparation, L.B. and Z.G.; writing—review and editing, L.B., Z.G. and C.L.; supervision, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Technology Innovation Talent Project of the National University of Defense Technology (202501-YJRC-XX-022).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

DURC Statement

This research undertakes an exploration of the application of UAVs within military contexts. It is essential to recognize that while military technological advancements enhance national security capabilities, they concurrently introduce significant ethical risks, especially with respect to dual-use technologies. These technologies, given their inherent flexibility, possess the potential for exploitation in ways that are inconsistent with moral and legal norms. We hereby make a solemn declaration. Throughout the entirety of this research, every action has been guided by profound respect for human safety, dignity, and international laws. The research has been conducted in strict compliance with established ethical frameworks governing military applications. Our overarching objective is to ensure that any technological outcomes derived from this research are exclusively employed for purposes that uphold peace, safeguard national sovereignty, and deter unprovoked aggression. The simulation code can be obtained by contacting the author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

Pan, Z.H.; Zhang, C.X.; Xia, Y.Q.; Xiong, H.; Shao, X. An Improved Artificial Potential Field Method for Path Planning and Formation Control of the Multi-UAV Systems. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 1129–1133. [Google Scholar] [CrossRef]
Anderse, M.; Gonzalez, N.; Solerml, L. Informed scenario based RRT for aircraft trajectory planning under ensemble forecasting of thunder storms. Transp. Res. Part C Emerg. Technol. 2021, 129, 103232. [Google Scholar]
Aggarwal, S.; Kumar, N. Path planning techniques for unmanned aerial vehicles: A review, solutions, and challenges. Comput. Commun. 2019, 149, 270–299. [Google Scholar] [CrossRef]
Belge, E.; Altan, A.; Hacıoğlu, R. Metaheuristic Optimization-Based Path Planning and Tracking of Quadcopter for Payload Hold-Release Mission. Electronics 2022, 11, 1208. [Google Scholar] [CrossRef]
Maurović, I.; Seder, M.; Lenac, K.; Petrović, I. Path planning for active SLAM based on the D* algorithm with negative edge weights. IEEE Trans. Syst. Man Cybern. Syst. 2018, 48, 1321–1331. [Google Scholar] [CrossRef]
Yang, J.; Xu, X.; Yin, D.; Ma, Z.; Shen, L.C. A space mapping based 0–1 linear model for onboard conflict resolution of heterogeneous unmanned aerial vehicles. IEEE Trans. Veh. Technol. 2019, 68, 7455–7465. [Google Scholar] [CrossRef]
Liu, C.; Xie, F.; Ji, T. Fixed-Wing UAV Formation Path Planning Based on Formation Control: Theory and Application. Aerospace 2024, 11, 1. [Google Scholar] [CrossRef]
Zhou, H.; Xiong, H.-L.; Liu, Y.; Tan, N.-D.; Chen, L. Trajectory planning algorithm of UAV based on system positioning accuracy constraints. Electronics 2020, 9, 250. [Google Scholar] [CrossRef]
Xiang, H.B.; Liu, X.B.; Song, X.S.; Zhou, W. UAV Path Planning Based on Enhanced PSO-GA. In Artificial Intelligence: Proceedings of the Third CAAI International Conference, CICAI 2023, Fuzhou, China, 22–23 July 2023, Part II; Springer: Berlin/Heidelberg, Germany, 2024; Volume 14474, pp. 271–282. [Google Scholar]
Maw, A.A.; Tyan, M.; Nguyen, T.A.; Lee, J.-W. iADA*-RL: Anytime graph-based path planning with deep reinforcement learning for an autonomous UAV. Appl. Sci. 2021, 11, 3948. [Google Scholar] [CrossRef]
Zhao, X.; Yang, R.; Zhong, L.; Hou, Z. Multi-UAV Path Planning and Following Based on Multi-Agent Reinforcement Learning. Drones 2024, 8, 18. [Google Scholar] [CrossRef]
Erguson, D.; Stentz, A. Field D*: An Interpolation Based Path Planner and Replanner; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Ye, Q.; Hu, X.; Ma, H. A two-stage method for cooperative target allocation in UAV teams. J. Hefei Univ. Technol. (Nat. Sci.) 2015, 38, 1431–1436. (In Chinese) [Google Scholar]
Lin, C.E.; Syu, Y.M. GA/DP hybrid solution for UAV multi-target path planning. J. Aeronaut. Astronaut. Aviat. 2016, 48, 203–220. [Google Scholar]
Phung, M.D.; Ha, Q.P. Safety-enhanced UAV path planning with spherical vector-based particle swarm optimization. Appl. Soft Comput. 2021, 107, 107376. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, X.; Li, J. Intelligent Path Planning with an Improved Sparrow Search Algorithm for Workshop UAV Inspection. Sensors 2024, 24, 1104. [Google Scholar] [CrossRef]
Konatowski, S.; Pawlowski, P. Application of the ACO algorithm for UAV path planning. Prz. Elektrotechniczny 2019, 1, 117–121. [Google Scholar] [CrossRef]
Shin, J.J.; Bang, H. UAV path planning under dynamic threats using an improved PSO algorithm. Int. J. Aerosp. Eng. 2020, 2020, 8820284. [Google Scholar] [CrossRef]
Ye, C.; Wang, W.T.; Zhang, S.P.; Shao, P. Optimizing 3D UAV Path Planning: A Multi-strategy Enhanced Beluga Whale Optimizer. Lect. Notes Artif. Intell. 2024, 14448, 42–54. [Google Scholar]
Zhou, B.; Guo, Y.; Li, N.; Zhong, X. Path planning of UAV using guided enhancement Q-learning algorithm. Acta Aero-Naut. Astronaut. Sin. 2021, 42, 498–505. (In Chinese) [Google Scholar]
He, J.; Ding, Y.; Yang, Y.; Huang, X. Unmanned aerial vehicle path planning based on PF-DQN in unknown environment. Ordnance Ind. Autom. 2020, 39, 116–123. (In Chinese) [Google Scholar]
Melo, F.S. Convergence of Q-Learning: A Simple Proof; Institute for Systems and Robotics: Lisbon, Portugal, 2001; pp. 1–4. [Google Scholar]
Rummery, G.A.; Mahesan, N. On-Line q-Learning Using Connectionist Systems; Technical Report; University of Cambridge: Cambridge, UK, 1994. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Hasselt, V.H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Li, J.; Zhang, T.; Liu, K. Memory-Enhanced Twin Delayed Deep Deterministic Policy Gradient (ME-TD3)-Based Unmanned Combat Aerial Vehicle Trajectory Planning for Avoiding Radar Detection Threats in Dynamic and Unknown Environments. Remote Sens. 2023, 15, 5494. [Google Scholar] [CrossRef]
Zhao, Z.; Niu, Y.; Ma, Z.; Ji, X. A fast stealth trajectory planning algorithm for stealth UAV to fly in multi-radar network. In Proceedings of the 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR), Angkor Wat, Cambodia, 6–10 June 2016; pp. 549–554. [Google Scholar]
Zhang, Z.; Wu, J.; Dai, J.; He, C. A Novel Real-Time Penetration Path Planning Algorithm for Stealth UAV in 3D Complex Dynamic Environment. IEEE Access 2020, 8, 122757–122771. [Google Scholar] [CrossRef]
Zhang, Z.; Wu, J.; Dai, J.Y.; Ying, J.; He, C. Fast penetration path planning for stealth UAV based on improved A-Star algorithm. Acta Aeronaut. Astronaut. Sin. 2020, 41, 254–264. [Google Scholar]
Zhang, Z.; Wu, J.; Dai, J.; He, C. Rapid Penetration Path Planning Method for Stealth UAV in Complex Environment with BB Threats. Int. J. Aerosp. Eng. 2020, 2020, 8896357. [Google Scholar] [CrossRef]
Yuan, C.; Ma, D.; Jia, Y.; Zhang, L. Stealth Unmanned Aerial Vehicle Penetration Efficiency Optimization Based on Radar Detection Probability Model. Aerospace 2024, 11, 561. [Google Scholar] [CrossRef]
Bassem, R.M.; Atef, Z.E. MATLAB Simulations for Radar Systems Design; Chapman&Hall/CRC: Boca Raton, FL, USA; Taylor & Francis Group: Abingdon, UK, 2004; pp. 56–57. [Google Scholar]

Figure 1. The real-time decision frame diagram of a stealth UAV.

Figure 2. The Q-network architecture.

Figure 3. The relationship between the detection penalty and detection probability.

Figure 4. Radar airspace detection probability when

σ = 0.1

.

Figure 4. Radar airspace detection probability when

σ = 0.1

.

Figure 5. Stealth UAV path without anti-aircraft fire.

Figure 6. Optimal convergence rate of stealth UAV without anti-aircraft fire.

Figure 7. Radar airspace detection probability.

Figure 8. Stealth UAV path when four air defense radars are deployed.

Figure 9. Convergence rate of stealth UAV when four air defense radars are deployed.

Table 1. Algorithm hyperparameter settings.

Parameter Name	Parameter Value	Parameter Name	Parameter Value
Learning rate	$l r = 2 \times 10^{- 3}$	Total rounds	$E = 800$
Discount factor	$γ = 0 . 98$	Greed probability	$ε = 0 . 01$
Target network update interval	$N_{t r a i n} = 10$	Maximum storage in R	$N_{\max} = 10,000$
Minimum number of samples in R	$N_{\min} = 1000$	Sampling batch value	$N = 64$

Table 2. Radar counter environment parameters.

Parameter	Parameter Value	Parameter	Parameter Value
$G$	$20 dB$	$P_{t}$	$30 MW$
$f_{0}$	$9 GHz$	$c$	${3 \times 10}^{8} m / s$
$λ$	$c / f_{0}$	$k$	$1 {. 38 \times 10}^{- 23} J / K$
$T_{e}$	$290 K$	$B$	$100 MHz$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, L.; Guo, Z.; Gao, X.; Li, C. Stealth UAV Path Planning Based on DDQN Against Multi-Radar Detection. Aerospace 2025, 12, 774. https://doi.org/10.3390/aerospace12090774

AMA Style

Bao L, Guo Z, Gao X, Li C. Stealth UAV Path Planning Based on DDQN Against Multi-Radar Detection. Aerospace. 2025; 12(9):774. https://doi.org/10.3390/aerospace12090774

Chicago/Turabian Style

Bao, Lei, Zhengtao Guo, Xianzhong Gao, and Chaolong Li. 2025. "Stealth UAV Path Planning Based on DDQN Against Multi-Radar Detection" Aerospace 12, no. 9: 774. https://doi.org/10.3390/aerospace12090774

APA Style

Bao, L., Guo, Z., Gao, X., & Li, C. (2025). Stealth UAV Path Planning Based on DDQN Against Multi-Radar Detection. Aerospace, 12(9), 774. https://doi.org/10.3390/aerospace12090774

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stealth UAV Path Planning Based on DDQN Against Multi-Radar Detection

Abstract

1. Introduction

2. DDQN Algorithm

3. Environmental Model and Algorithm

3.1. Environmental Model

3.2. Algorithm Design

4. Simulation Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI