Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints

Chen, Yaqing; Jin, Yunfei; He, Xin; Zhang, Yumei

doi:10.3390/drones10030227

Open AccessArticle

Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints

School of Air Traffic Management, Civil Aviation Flight University of China, Guanghan 618307, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(3), 227; https://doi.org/10.3390/drones10030227

Submission received: 12 February 2026 / Revised: 20 March 2026 / Accepted: 20 March 2026 / Published: 23 March 2026

(This article belongs to the Section Innovative Urban Mobility)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Under noise threshold constraints, UAVs tend to increase their altitude above ground level (AGL) to maintain noise compliance, and this altitude adjustment exhibits a clear stepwise upward pattern. As the UAV source noise level increases or the noise threshold becomes more stringent, the required AGL correspondingly rises and may conflict with the prescribed maximum AGL altitude, thereby creating a trade-off between noise compliance and altitude limitations.
Adjusting the noise threshold can significantly influence the feasibility of UAV trajectory planning under noise constraints. For a given urban scenario with specified local noise limits, the maximum allowable UAV source noise level can be further identified. This offers a quantitative basis for evaluating the noise-compliant operational feasibility of UAVs with different noise characteristics in the target area.

What are the implications of the main findings?

A tiered regulatory framework can be established by jointly considering UAV noise characteristics and requirements adapted to the local needs, thereby enabling differentiated AGL flight altitude, depending on UAV noise source strength.
Under specific built-environment scenarios and local noise limits, UAV entry limits and operational restriction policies can be established, thereby enabling the matching between UAV noise classes and local noise limits prior to mission dispatch and ensuring noise compliance.

Abstract

This study proposes TNAP-DDQN, a deep reinforcement learning method for urban low-altitude UAV path planning under residential noise threshold constraints. With time cost and safety risk as the optimization objectives, operational constraints such as collision risk and maximum AGL altitude are incorporated to achieve coordinated optimization of noise compliance, operational safety, and efficiency. To mitigate action space contraction and training instability induced by multiple constraints, a Noise-Degradation-Mask-based Action Bias Network (NDM-ABN) is introduced at the action selection layer. A three-tier degradation scheme prevents empty candidate sets, while bias-based decision making is applied to approximately tied actions to stabilize the policy. Moreover, multi-step prioritized experience replay (PER) improves sample efficiency and long-horizon return modeling, and potential-based reward shaping (PBRS) transforms sparse constraint signals into auxiliary rewards. Simulation results indicate that: (1) NDM-ABN is the key module for stabilizing the noise-exposure process by suppressing high-noise actions; (2) the required AGL is related to the UAV source noise level and local noise limits, implying the need for differentiated AGL altitude classes; and (3) the maximum admissible UAV source noise level increases as the threshold is relaxed. The proposed method provides quantitative guidance for noise-entry and AGL altitude regulation, while future work will incorporate additional metrics (e.g., A-weighted equivalent sound level) to better capture noise fluctuations and short-term peaks.

Keywords:

urban low-altitude UAV; 3D path planning; noise constraints; deep reinforcement learning

1. Introduction

With the rapid development of unmanned aerial vehicle (UAV) technologies, UAVs have been increasingly deployed in civil and commercial applications, particularly in parcel delivery, agricultural monitoring, and environmental protection. However, as UAV operations in urban environments continue to expand, noise has gradually become a major public concern. Therefore, how to effectively mitigate noise-related impacts within UAV path planning has become an urgent problem.

To address various operational challenges in UAV missions, researchers have developed a wide spectrum of path planning algorithms over the past decade. However, as systematically documented in recent comprehensive surveys, these methodological advances primarily focus on conventional optimization objectives, leaving acoustic noise constraints largely unaddressed. Jin et al. [1] proposed a new quadrotor cooperative source-seeking algorithm aimed at solving the source localization problem in environments affected by disturbances and communication constraints. The algorithm combines formation control and gradient-free optimization, guiding a group of quadrotors towards the optimal point of a scalar field without relying on gradient estimation. Experimental and simulation results validated the effectiveness of this method, demonstrating that the quadrotors successfully reach the target source position while maintaining formation, ensuring robustness and stability.

Nevertheless, as systematically documented in recent comprehensive surveys, these methodological advances primarily focus on conventional optimization objectives, leaving acoustic noise constraints largely unaddressed. Meng et al. [2] traced the evolution from classical graph-based algorithms to modern AI techniques, emphasizing the shift toward energy-aware planning and multi-UAV cooperative strategies. Complementing this, Sheltami et al. [3] further mapped these algorithmic paradigms—approximately 30% classical, 29% meta-heuristic, 18% AI-based, and 23% hybrid—onto diverse application domains.

Nevertheless, most existing studies on UAV path planning primarily optimize time cost or safety risk and generally overlook noise-related impacts. Jin et al. [4] explored self-triggered distributed control for fixed-wing UAV formations under velocity and overload constraints. The study proposed a UAV kinematic model incorporating wind disturbances and designed an inner–outer loop control structure. By utilizing a self-triggered sampling mechanism, communication overhead was reduced, and the control method ensured that UAVs remained within desired speed and overload ranges while maintaining formation. Stability of the system was proven using the small-gain theorem, and experimental results demonstrated the method’s effectiveness in addressing various constraint conditions. Regarding conventional path-planning methods, Zammit et al. [5] compared the performance of the A* and RRT* algorithms in three-dimensional environments, highlighting the advantages of A* in generating shorter paths and achieving high computational efficiency. Brown et al. [6] proposed a trajectory generation method that combines quintic polynomials and multi-objective particle swarm optimization for high-altitude long-endurance UAV maritime radar surveillance tasks. This method optimizes fuel consumption, detection probability, and revisit time, achieving globally optimal trajectory planning.

Subsequently, Cheriet et al. [7] conducted an in-depth analysis of A*, RRT*, and PSO for path-planning efficiency in urban 3D scenarios; through experimental evaluations under different flight conditions, their work provides references for selecting appropriate path-planning algorithms. However, although these methods perform well in terms of path efficiency, they do not explicitly consider noise constraints into path planning. To enhance operational safety, Primatesta et al. [8] proposed a risk-aware path-planning method that incorporates risk factors into route selection to reduce potential hazards, and Zhang et al. [9] further developed a ground-risk-map-based planning approach by jointly optimizing ground risk and flight cost, emphasizing the need to consider multiple risk factors in urban environments. Feng et al. [10] introduced a risk-assessment framework for low-altitude airspace path planning by integrating multi-dimensional risk quantification, aiming to improve both flight safety and efficiency. To enhance operational safety, Pang et al. [11] proposed a risk-based UAV traffic network model that classifies and quantifies risks in urban airspace, thereby enabling a comprehensive evaluation framework for path planning. Pol Mestres et al. [12] introduced a motion planning algorithm that integrates Control Lyapunov Functions (CLFs) and Control Barrier Functions (CBFs) with Rapidly Exploring Random Trees (RRT). The algorithm generates safe and dynamically feasible paths, ensuring collision-free navigation that can be executed with CLF-CBF controllers. The proposed method was shown to efficiently generate executable paths, avoiding complex boundary value problem computations, and was validated through simulation and hardware experiments, demonstrating faster execution times compared to traditional methods. In addition, Song et al. [13] incorporated an artificial potential field into an improved A* algorithm to dynamically adjust the trajectory, improve path smoothness, and support safe UAV flight in complex urban environments. Niknejad et al. [14] proposed Da SP-RRT, which constructs data-driven invariant sets to guarantee safety while optimizing trajectory costs, addressing the safety–performance trade-off in unknown dynamic environments.

Additionally, to address frequent path conflicts and excessive computational overhead in large-scale multi-UAV cooperative operations within urban low-altitude airspace, Lu et al. [15] proposed a conflict-free path-planning method that integrates three-dimensional jump point search (3D-JPS) with an incremental update mechanism. However, these studies primarily focus on homogeneous UAV swarms in urban environments. Zhang et al. [16] proposed a distributed time-varying optimization algorithm that guides multiple UAVs in dynamic target tracking while ensuring collision avoidance through a fixed-time consistency protocol. This approach addresses the challenges of global optimization and fast convergence, offering a new perspective for path planning in multi-UAV missions. Wang et al. [17] proposed a PSE-D model-based cooperative path planning framework for UAV-USV systems in anti-submarine missions, decomposing the problem into waypoint determination and trajectory generation phases to address the heterogeneous planning requirements between aerial and surface platforms.

With the rise of deep reinforcement learning (DRL), DRL-based path-planning approaches have also attracted substantial attention. Zhou [18] proposed a DRL-based path-planning algorithm that significantly improves efficiency and safety by optimizing the reward design. Xie et al. [19] employed DRL to address the complexity of dynamic environments, enhancing the real-time performance and flexibility of path planning. Han et al. [20] proposed a PSO-DQN method that strengthens the learning capability of the Q-network, thereby improving planning efficiency. Thanh Thi Nguyen et al. [21] reviewed the application of DRL to path planning in autonomous systems and highlighted both challenges and recent advances, indicating that DRL methods have stronger potential than traditional approaches. Xu Zhenyang et al. [22] proposed the TCP-DQN algorithm, which substantially improves planning efficiency and safety performance for low-altitude vehicles in complex environments via goal-oriented curriculum learning and a prioritized replay strategy. In application domains such as logistics delivery and environmental monitoring, Deng et al. [23] proposed a path-planning method that combines an improved beetle antennae search (BAS) with simulated annealing (SA); by dynamically adjusting the search step size and antenna sensing length, the method improves search accuracy and real-time capability while ensuring flight safety and obstacle avoidance. Hu et al. [24] proposed a hybrid path-planning method that integrates an improved ant colony optimization algorithm (IACO) with rapidly exploring random trees (RRT), focusing on logistics delivery in dense urban-building environments to improve search efficiency and path accuracy. Cao et al. [25] further developed an IACO–RRT hybrid method to enhance planning accuracy and strengthen applicability in complex urban environments.

Although the above studies have made progress in terms of safety, efficiency, and obstacle avoidance, their optimization models generally do not sufficiently account for in-flight noise constraints. As UAV operations in urban environments continue to expand, the noise-related impacts of UAV operations on urban environments and residents have increasingly attracted academic attention. Christian et al. [26] reported, through a comparative study, that UAV noise is more disturbing than road traffic noise under comparable sound pressure levels. Bulusu et al. [27] investigated the noise-related impacts as sociated with large-scale UAV operations in low-altitude urban airspace via simulation, indicating that UAV noise may induce pronounced perceptual impacts on urban residents. Whelchel et al. [28] examined the noise generated by multirotor UAVs under different operating modes and found that flight mode and receiver location significantly affect noise perception. Torija et al. [29] further investigated the influence of UAV hovering on urban soundscapes and discussed the interaction between UAV noise and other environmental noise sources in cities, such as road traffic noise. Torija et al. [30] suggested that, although the sound pressure level of UAV noise is lower than that of conventional aircraft noise, its distinctive timbre and frequency characteristics may still impose potential community impacts. In addition, Schaffer et al. [31] provided a systematic review and noted that the psychoacoustic effects of UAV noise are more complex than those of traditional transportation noise. Hui et al. [32] showed that UAV noise during hovering and forward flight is strongly associated with cognitive disturbance and human emotional responses, particularly in relatively quiet environments. Collectively, these studies demonstrate that UAV noise can induce non-negligible noise-related impacts in urban settings.

Against this backdrop, a growing body of work has begun to incorporate noise-related impacts into path planning. Bian et al. [33] evaluated the environmental impacts of UAV noise in urban airspace via virtual flight simulations, highlighting the critical role of flight trajectories in noise propagation. To effectively reduce noise pollution, Tan et al. [34] proposed a low-noise UAV trajectory planning method that optimizes flight routes based on a noise assessment platform to mitigate environmental impacts during operations. In the same year, Tan et al. [35] further investigated UAV facility siting to maximize logistics efficiency while reducing noise impacts. Tan et al. [36] also proposed, through virtual flight simulations, a low-noise path-planning method aimed at reducing UAV noise pollution via rational route design. In addition, Zhang et al. [37] proposed a multi-objective optimization approach that jointly considers flight cost, noise pollution, and safety risk to optimize low-altitude urban logistics networks. Finally, Chen et al. [38] presented a real-time trajectory planning method that integrates noise control and safety risk.

Despite the growing body of research on low-noise path-planning methods, two important limitations remain. First, local noise limits have not been explicitly incorporated into path-planning constraints, making it difficult for the resulting trajectories to satisfy practical regulatory requirements in noise-sensitive areas. Second, most existing studies do not jointly address the trade-off between safety risk and time cost. To address these limitations, this study investigates UAV path planning in a simplified urban environment constructed from realistic urban building geometries under explicit local noise-limit constraints, with the objective of minimizing safety risk and flight time.

The main contributions of this study are threefold. First, local noise limits are explicitly incorporated into the path-planning framework, thereby improving the practical relevance of the planned trajectories under urban regulatory requirements. Second, this study performs path planning with safety risk and flight time as optimization objectives, ensuring local noise compliance while effectively coordinating noise compliance with other objectives. Third, the proposed TNAP-DDQN algorithm integrates the Noise Degradation Mask-based Action Bias Network (NDM-ABN), a three-level degradation mechanism, and a bias-based decision strategy to enhance action feasibility and policy stability under multiple constraints, while a hybrid experience replay mechanism is introduced to improve sample efficiency and long-term reward learning.

The remainder of this paper is organized as follows. Section 2 introduces the problem formulation and related constraints. Section 3 presents the proposed method. Section 4 reports the experimental setup and results analysis. Section 5 concludes the paper and discusses future research directions.

2. Materials and Methods

Three-dimensional UAV path planning in low-altitude urban airspace is subject to multiple operational constraints and multi-objective trade-offs in practical deployment. Rules such as the maximum AGL altitude, safety separation, and local noise limits jointly shrink the feasible flight domain and the available action set, thereby increasing the difficulty of 3D path search and decision learning. To clarify, all quantities related to altitude in this study are defined as the height above ground level (AGL).

Accordingly, this study formulates the problem as a Markov decision process (MDP) within a reinforcement learning framework. Under this formulation, the state space, action space, and reward function constitute the core elements of the training interaction environment. Specifically, the state space explicitly encodes key information related to the UAV, the environment, and relevant operational constraints; the action space defines a discrete set of three-dimensional maneuvers; and the reward function provides step-wise feedback and maps operational constraints into learnable optimization signals. In UAV path-planning models, noise can be incorporated in different ways. For example, it may be imposed as a strict feasibility constraint, introduced as an explicit cost term in the overall objective function, or incorporated as a penalty term to guide learning. These different treatments correspond to different modeling priorities and may lead to different optimization behaviors. In this study, local noise limits are first treated as explicit constraints to ensure that the generated trajectories satisfy clear noise-compliance requirements in residential environments. At the same time, a noise penalty is further introduced into the reward design to provide additional guidance for learning within the feasible region. This treatment is adopted because the objective of this study is not to directly trade off noise compliance against other objectives, but rather to carry out path planning under mandatory noise constraints while still improving efficiency and safety. Furthermore, building on the multi-objective cost modeling, this study introduces potential-based reward shaping to transform time cost and safety risk into more continuous process rewards, thereby strengthening the guidance during learning. The following sections will further describe the above modeling components and design details.

2.1. Problem Modeling and Optimization Objectives

In this study, the optimization problem is clearly defined as generating a feasible and optimal three-dimensional flight path for UAVs used in logistics delivery in low-altitude urban environments under local noise limits, with the objective of minimizing time cost and safety risk. To achieve this, several constraints are considered, including noise thresholds, collision risk, maximum AGL (Above Ground Level), and safety distance. After completing the problem modeling, the core objective of this study is to utilize the proposed TNAP-DDQN algorithm to solve this specific constrained optimization problem, ensuring the optimal path is found in complex urban environments while balancing flight efficiency and safety. As shown in Figure 1, the optimization framework includes state-action space modeling, definition of the objective function and constraints, and the application of a deep reinforcement learning (DRL) algorithm to solve the problem.

2.2. State–Action Space Modeling

This study employs a three-dimensional grid-based representation to model static obstacles in low-altitude urban environments, as illustrated in Figure 2. Specifically, a cubic airspace of size

a \times b \times c

is first defined, and a Cartesian coordinate system is established with the origin

O

. Considering both building heights and computational resource constraints, the grid resolution

Δ

is determined, and the airspace is uniformly discretized into

n

grid cells. Each grid cell is classified as either a free cell or an obstacle cell. To this end, an indicator variable

φ_{i}

is introduced to denote the

i

th grid cell located at coordinates

(x_{i}, y_{i}, z_{i})

: if the cell contains a static obstacle, then

φ_{i} = 0

, indicating an obstacle cell; otherwise,

φ_{i} = 1

, indicating a free cell.

The UAV action space

A

is defined as the set of all feasible movement commands to adjacent grid cells, comprising 26 discrete actions in total (as shown in Figure 3). In the figure, the current grid cell (purple) and all reachable neighboring cells (gray) are used to provide an intuitive illustration of the action space.

2.3. Multi-Objective Modeling and Potential-Based Reward Shaping

In this study, trajectory planning is formulated with flight time and safety risk as the core optimization objectives. Flight time directly determines mission execution efficiency, whereas safety risk reflects the potential threat posed by the UAV to the ground. These two objectives are often mutually constraining: risk avoidance typically increases flight time, while pursuing the shortest path may lead to elevated risk. By jointly optimizing both objectives, this study aims to identify an effective balance between efficiency and safety. The total flight time is calculated as follows:

C_{time} = \min \sum_{n}^{i = 1} \frac{\sqrt{{(x_{i + 1} - x_{i})}^{2} + {(y_{i + 1} - y_{i})}^{2} + {(z_{i + 1} - z_{i})}^{2}}}{v_{i}^{i + 1}}

(1)

v_{i}^{i + 1} = \{\begin{matrix} v_{h} & z_{i + 1} = z_{i} \\ v_{u} & z_{i + 1} > z_{i} & i \in [1, n] \\ v_{d} & z_{i + 1} < z_{i} \end{matrix}

(2)

The variable

v_{i}^{i + 1}

represents the UAV’s speed when transitioning from the

i

th grid to the

i + 1

th grid. The speed parameter is assumed to remain constant within each individual grid-cell transition and only takes different values according to the motion state. Specifically,

v_{h}

,

v_{u}

, and

v_{d}

represent the UAV’s speed during level flight, climb, and descent, respectively.

The safety-risk cost adopted in this study refers to the cost associated with pedestrian and ground-vehicle casualties caused by a UAV crash, and its unit is expressed as the number of fatalities per flight hour. This cost is denoted by

C_{s a f e} (x_{i}, y_{i}, z_{i})

and consists of two components: the pedestrian fatality risk

c_{s a f e_p} (x_{i}, y_{i}, z_{i})

and the ground-vehicle occupant fatality risk

c_{s a f e_v} (x_{i}, y_{i}, z_{i})

[38,39]:

C_{s a f e} (x_{i}, y_{i}, z_{i}) = c_{s a f e_p} (x_{i}, y_{i}, z_{i}) + c_{s a f e_v} (x_{i}, y_{i}, z_{i})

(3)

Here,

C_{s a f e} (x_{i}, y_{i}, z_{i})

denotes the safety-risk cost associated with the

i

-th grid cell at the 3D location

(x_{i}, y_{i}, z_{i})

. Specifically,

c_{s a f e_p} (x_{i}, y_{i}, z_{i})

and

c_{s a f e_v} (x_{i}, y_{i}, z_{i})

represent the safety-risk costs for ground pedestrians and vehicle occupants, respectively. The safety risk faced by ground pedestrians arises from a continuous process, i.e., UAV loss of control, collision with a pedestrian, and the subsequent injury. The corresponding risk cost is calculated as follows [39]:

c_{s a f e_p} (x_{i}, y_{i}, z_{i}) = P_{UAV} \times N_{hit}^{p} (x_{i}, y_{i}, z_{i}) \times R_{p} (x_{i}, y_{i}, z_{i})

(4)

N_{hit}^{p} (x_{i}, y_{i}, z_{i}) = S_{_hit} \times σ_{p} (x_{i}, y_{i})

(5)

R_{p} (x_{i}, y_{i}, z_{i}) = \frac{1}{1 + \sqrt{\frac{α}{β}} {(\frac{β}{E_{i m p}})}^{\frac{1}{4 S_{c} (x_{i}, y_{i}, z_{i})}}}

(6)

v = \sqrt{\frac{2 m g}{R_{1} S_{_hit} ρ} (1 - e^{- \frac{h R_{1} S_{_hit} ρ}{m}})}

(7)

Equations (4)–(7) quantify the ground-pedestrian safety-risk cost associated with a UAV crash originating from the

i

th grid cell. Following the risk-modeling framework in [39], the pedestrian risk cost is expressed as a function of the UAV system failure rate

P_{UAV}

, the number of affected pedestrians

N_{hit}^{p} (x_{i}, y_{i}, z_{i})

, and the corresponding fatality rate

R_{p} (x_{i}, y_{i}, z_{i})

, where the latter depends on both impact kinetic energy and the ground-shielding effect. The impact kinetic energy E_imp of the falling UAV is known as

E_{i m p} = \frac{1}{2} m v^{2}

. In this formulation, the number of affected pedestrians is determined by the ground impact area and the local population density, while the fatality model further incorporates the shielding coefficient

S_{c}

, lethality energy-threshold parameters

α

and

β

, UAV mass m, ground-impact velocity

v

, and UAV AGL

h

. The model formulation and the associated parameter settings adopted in this study are both based on [39]. Specifically,

S_{c}

is set to 0.5,

α

and

β

are set to

10^{6} J

and

100 J

respectively, the aerodynamic drag coefficient

R_{1}

is set to 0.3, and the air density

ρ

is set to

1.225 kg / m^{3}

. Similarly, the risk-cost model for vehicle occupants is also formulated following the framework in [39], as detailed below:

c_{s a f e_v} (x_{i}, y_{i}, z_{i}) = P_{UAV} \times N_{hit}^{v} (x_{i}, y_{i}, z_{i}) \times R_{v} (x_{i}, y_{i}, z_{i})

(8)

N_{hit}^{v} (x_{i}, y_{i}, z_{i}) = S_{_hit} \times σ_{v} (x_{i}, y_{i})

(9)

Here, Equation (8) computes the safety-risk cost for in-vehicle occupants resulting from a UAV crash originating from the

i

-th grid cell and impacting vehicles. This risk cost is still formulated as the product of the UAV system failure rate

P_{UAV}

, the number of affected vehicles

N_{hit}^{v} (x_{i}, y_{i}, z_{i})

, and the occupant fatality rate

R_{v} (x_{i}, y_{i}, z_{i})

. The number of affected vehicles is calculated by Equation (9), which depends on the impact coverage area

S_{_hit}

and the regional vehicle density

σ_{v} (x_{i}, y_{i})

. The occupant fatality rate

R_{v} (x_{i}, y_{i}, z_{i})

is derived from the ratio of “annual average traffic fatalities” to “annual average traffic accidents”.

To alleviate the sparse-reward issue and provide continuous optimization guidance, this study adopts a potential-based reward shaping approach. Specifically, potential functions are constructed for the two core optimization objectives, i.e., flight time and safety risk, to quantify the objective-related cost at any intermediate state

S_{t}

. After normalization and fusion, these potentials form the total potential function. In this study, the flight time cost and safety risk cost are first normalized and then combined without introducing an additional subjective weighting coefficient. This treatment is adopted because the main focus of this work is the design of a noise-constrained path planning method, rather than the calibration of value preferences between efficiency and safety. In practice, different weighting settings may reflect varying operational priorities. To avoid introducing an additional source of subjective bias, this study employs a normalized and unweighted fusion of the two cost terms. This mechanism transforms sparse and delayed terminal returns into dense gradient signals that are directly coupled with step-wise decisions, thereby significantly improving learning efficiency and policy stability. The following sections first present the mathematical formulations of the two optimization objectives and then detail the corresponding cost-type potential function designs:

Φ (S_{t}) = {\tilde{C}}_{t i m e} (S_{t}) + {\tilde{C}}_{s a f e} (S_{t})

(10)

Here,

{\tilde{C}}_{t i m e} (S_{t})

and

{\tilde{C}}_{s a f e} (S_{t})

denote the normalized values of the corresponding metrics. The time-cost term is defined based on an estimate of the remaining flight time to the target. This estimate is obtained by computing the three-dimensional Euclidean distance between the current state and the target and dividing it by the UAV average flight speed

v_{\arg}

. The formulation is given as follows:

\hat{T} (S_{t}) = \frac{D_{c u r r_g o a l}}{v_{\arg}}

(11)

The normalized time-cost term

{\tilde{C}}_{time} (S_{t})

is given by the following expression:

{\tilde{C}}_{time} (S_{t}) = \frac{\hat{T} (S_{t})}{1.2 \times T_{\max}}

(12)

Here, the normalization factor

T_{\max}

is computed using the maximum possible flight distance in the 3D map (estimated by the diagonal length) and the UAV’s minimum climb speed, and a moderate margin is further introduced to cover detours in practical flight. This design ensures that the time-cost term remains on a stable numerical scale.

The safety-risk cost term is jointly determined by the UAV AGL and a shielding correction. Its value is computed using the safety-risk model in Equation (3), where the indicator variable

θ

is used to denote whether a building exists directly beneath the UAV: if no building is present beneath the UAV, then

θ = 1

; otherwise,

θ = 0.25

[40].

C_{s a f e} (S_{t}) = θ \times C_{s a f e} (x_{i}, y_{i}, z_{i})

(13)

The normalized safety-risk cost is given by the following equation:

{\tilde{C}}_{s a f e} (S_{t}) = \frac{C_{s a f e} (S_{t})}{C_{s a f e} (S_{t}, R_{\max})}

(14)

Here,

C_{s a f e} (S_{t}, R_{\max})

is the reference value computed by Equation (4) at the maximum allowable AGL altitude of 120 m, which is used to normalize the risk cost onto a stable numerical scale. Based on this, the potential-based shaped reward function adopted in this study is formulated as follows:

F (S_{t}, S_{t + 1}) = Φ (S_{t}) - Φ (S_{t + 1})

(15)

2.4. Reward Function Design

To meet practical requirements for urban UAV path planning and to ensure that the UAV can accomplish missions efficiently and safely in complex environments, the reward function is composed of five terms: (i) a baseline reward, which defines task progression and termination conditions; (ii) a collision penalty, which penalizes collision events; (iii) an altitude penalty, which constrains the AGL flight altitude; (iv) a goal-progress reward, which improves sample efficiency and accelerates convergence; and (v) a noise penalty, which accounts for noise-exposure control while optimizing path efficiency and safety.

(i): Baseline reward

The baseline reward mechanism is constructed as follows: a high reward is given when the agent reaches the goal state, indicating task completion; otherwise, a negative reward is assigned to encourage the UAV to reach the goal as quickly as possible. The specific formulation is given by the following equation:

R_{b a s e} = \{\begin{matrix} R_{g o a l} i f s t a t e = g o a l s t a t e \\ B R o t h e r w i s e \end{matrix}

(16)

Here,

R_{g o a l} = 100

represents the goal-reaching reward coefficient, and

B R = - 0.1

denotes the baseline reward.

(ii): Collision penalty

The collision penalty function is used to impose a negative reward for behaviors that bring the UAV close to obstacles. Its formulation is as follows:

R_{o b s t a c l e} = \{\begin{matrix} \begin{matrix} - 5, i f D_{o b s t a c l e} \leq T H_{1} \\ - 3, i f T H_{1} < D_{o b s t a c l e} \leq T H_{2} \end{matrix} \\ \begin{matrix} - 1, i f T H_{2} < D_{o b s t a c l e} \leq T H_{3} \\ 0, i f D_{o b s t a c l e} > T H_{3} \end{matrix} \end{matrix}

(17)

In this equation,

D_{o b s t a c l e}

denotes the Euclidean distance from the current state to the nearest obstacle, and

T H_{1}

,

T H_{2}

, and

T H_{3}

are distance thresholds used for risk grading, with values set to 1, 2, and 3 grid cells, respectively.

(iii): AGL penalty

The AGL penalty function constrains the UAV to fly within the desired AGL range. Its specific formulation is as follows:

R_{h e i g h t} = \{\begin{matrix} - 2 i f z < I M H \\ 0 i f I M H \leq z \leq M H C \\ - 4 i f z > M H C \end{matrix}

(18)

In this equation,

z

represents the current AGL,

I M H = 50 m

is the minimum desired AGL [37], and

I M C = 120 m

is the maximum allowable AGL.

(iv): Goal-progress reward

The goal-progress reward is calculated based on the change in Euclidean distance between the UAV and the target. Specifically, the reward value increases as the distance difference decreases from the previous step to the current step. The specific formulation is given by the following equation:

R_{g o a l} = 4 \times ((D_{p r e v_g o a l} - D_{c u r r_g o a l}) / D_{s t e p_l e n})

(19)

In this equation,

D_{p r e v_g o a l}

and

D_{p r e v_g o a l}

represent the Euclidean distances from the previous state and the current state to the target, respectively, and

D_{s t e p_l e n}

denotes the actual displacement length between two adjacent steps.

D_{s t e p_l e n}

is used to normalize the distance change, thereby preventing bias in the reward magnitude caused by different action step sizes.

(v): Noise penalty

Considering that noise is a significant constraint for low-altitude urban flight, this study constructs the noise penalty function based on the noise propagation attenuation model [38,41]. Its formulation is as follows:

L_{p} (r) = L_{s} (r_{0}) - 10 \log_{10} (r^{2} / r_{0}^{2})

(20)

P_{n o i s e} = - 0.5 \times (L_{p} (r) - L_{T})

(21)

R_{n o i s e} = \max (P_{n o i s e}, - 10)

(22)

In this formulation, Equation (20) is derived from the noise propagation attenuation model according to Ref. [41] (Equation (21) chapter 1). In this model,

L_{p} (r)

represents the sound pressure level at the receiver,

L_{s}

denotes the reference sound pressure level of the UAV noise source, and

r

is the propagation distance between the receiver and the noise source.

r_{0}

is the reference distance, set to 1 m in this study, which is the sound pressure level. To better illustrate the relationship between the UAV, the noise source, and the receiver, a diagram (Figure 4) is provided.

According to the measurement data reported by relevant literature [42], the actual tested UAV was the Prioria Hex, with a takeoff mass of 7.3 kg, and the measured sound pressure level at the receiver was

L_{p} (r) = 65 dB

when the propagation distance from the noise source to the receiver was

r = 15 m

. Based on the attenuation relationship in Equation (20), this measured value is equivalently converted to the reference sound pressure level of the UAV noise source at

r_{0} = 1 m

, which is approximately

L_{s} (r_{0}) = 89 dB

, and used as the noise source for the calculations.

Theoretically, the size of the sound source energy should be represented by sound power. However, since there is a one-to-one correspondence between sound pressure level and sound power level (

L_{w} = L_{p} (r) + 20 \lg (r) + 11

), this study will use the sound pressure level at 1 m,

L_{s} (r_{0} = 1 m)

, to discuss the different sound source energy levels. Additionally, it is assumed that the noise source is uniformly distributed and omnidirectional.

L_{T}

denotes the local noise limit. According to the daytime limit for Class 1 areas specified in Table 1, “Equivalent Sound Level Limits for Environmental Noise in Urban Areas” [43],

L_{T}

was set to 55 dB. Here, if no obstacles are present,

r = 120 m

is used when there is no obstruction; if obstacles exist,

r

is the Euclidean distance from the sound source to the nearest obstacle. Equations (21) and (22) are used for the noise penalty, where a penalty is imposed if the noise exceeds the local noise limit, whereas no penalty is applied when it remains below the local noise limit. To prevent the penalty from becoming too large and causing training instability, which would negatively impact path planning, the penalty term is capped at −10. Finally, the environmental baseline reward is summed with the multi-objective potential-based shaping terms described earlier to form the total reward function used for training. The specific formulation is as follows:

R_{t o t a l} = R_{b a s e} + R_{o b s t a c l e} + R_{h e i g h t} + R_{g o a l} + R_{n o i s e} + F (S_{t}, S_{t + 1})

(23)

3. Algorithm

3.1. Improved DDQN Algorithm Based on Noise Constraints

To address the challenges of high-dimensional state spaces, sparse rewards, and decision oscillations caused by approximately tied action values in three-dimensional urban low-altitude UAV path planning, this study proposes the TNAP-DDQN algorithm (Figure 5), based on the Double DQN framework [44]. TNAP stands for Threshold-Noise-Aware Path Planning. The method adopts a staged training strategy to improve convergence stability: the action-bias mechanism is not activated during the warm-up phase, while it is gradually introduced in the stable phase, alongside adjusted update settings and a low-frequency soft target-network update.

To explicitly incorporate noise constraints into the decision process and reduce oscillatory action switching, a Noise-Degradation-Mask-Based Action Bias Network (NDM-ABN) is introduced at the action-selection layer. This mechanism rapidly evaluates the noise compliance of candidate actions and masks actions that violate local noise limits. Compared with existing constrained reinforcement learning methods, the innovation of NDM-ABN lies in its hierarchical degradation mechanism, which represents a novel approach to action masking. The three-tier degradation strategy gradually relaxes the noise constraint’s strictness, avoiding action space collapse under overly strict noise constraints while ensuring training continuity.

Moreover, NDM-ABN introduces bias decisions for actions with the same value, enhancing action separability and stabilizing policy decisions. This innovation effectively addresses the issue of selecting actions with identical values, a challenge commonly encountered by standard exploration strategies like epsilon-greedy, and improves learning stability and exploration efficiency through refined bias adjustments.

Finally, to improve sample efficiency and the accuracy of long-term reward estimation, the algorithm introduces a Hybrid Experience Replay Mechanism, which samples at a preset ratio of 70:30. In total, 70% of the samples are drawn from the standard buffer, which stores single-step transition experiences, ensuring learning stability. The remaining 30% are drawn from the prioritized multi-step experience buffer, which stores experience trajectories with accumulated multi-step rewards. These samples are prioritized according to the absolute value of the temporal difference error (TD-error), allowing the algorithm to efficiently utilize high-information samples and accelerate the propagation of long-term rewards. This approach effectively balances local update stability and the effective dissemination of long-term rewards.

3.2. Noise-Aware Action Filtering and Bias-Based Decision Mechanism

In three-dimensional grid-based path planning, where actions must simultaneously satisfy both geometric feasibility and noise constraints, this study performs a two-stage filtering process on the 26 discrete actions at each decision step. First, actions that lead to out-of-bounds or collision scenarios are eliminated. Then, using a precomputed building distance transformation table, the sound pressure levels corresponding to the next states of each action are quickly estimated, and actions that exceed the local noise limit are masked. This process confines the search space within the feasible set under dual constraints. The method adopts a staged training strategy to improve convergence stability. During the warm-up phase, the action-bias mechanism is not activated so as to avoid amplifying value-estimation errors before the value network becomes sufficiently stable. In the stable phase, the action-bias mechanism is gradually introduced, together with adjusted update settings and a low-frequency soft target-network update, to enhance training controllability and convergence stability.

To explicitly incorporate noise constraints into the decision process, a noise-aware action-filtering mechanism is first introduced at the action-selection layer. Candidate actions that violate local noise limits are directly masked out. To avoid action-space collapse under overly strict noise thresholds, a hierarchical degradation mechanism is further employed. When the number of feasible actions falls below a predefined lower bound, a more relaxed feasibility-protection threshold is activated; if the feasible set remains insufficient, the filtering process is further degraded to retain only geometrically feasible actions. Through this design, decision continuity can be preserved under complex urban constraints and highly restrictive noise conditions.

After action filtering, approximate ties may still occur among the Q-values of the remaining candidate actions, which may lead to random switching and trajectory oscillation during policy execution. To alleviate this issue, the proposed Noise-Degradation-Mask-Based Action Bias Network (NDM-ABN) is introduced as a tie-breaking module. Specifically, the bias term is applied only within the approximately tied candidate set, whereas the original greedy selection is retained when no such tie exists. Therefore, this module serves as a local refinement of action selection rather than a global reordering of all candidate actions. Since it operates only at the decision layer and does not alter the underlying Bellman update, the fundamental value-learning structure of Double DQN remains unchanged.

The value of masked actions is defined as follows:

{\tilde{Q}}_{m a i n} (s_{t}, a_{t}) = \{\begin{matrix} Q_{m a i n} (s_{t}, a_{t}), & m_{t} (a) = 1 \\ - \infty, & m_{t} (a) = 0 \end{matrix}

(24)

Here,

{\tilde{Q}}_{m a i n} (s_{t}, a_{t})

represents the value estimation of the output action

a_{t}

by the main network at state

s_{t}

.

m_{t} (a)

is used to differentiate between valid and invalid actions. However, under the combined influence of multiple constraints, the candidate actions retained after mask-based filtering often exhibit very similar value estimates, resulting in an approximate tie phenomenon. In such cases, if the standard greedy selection strategy is still applied directly, even small numerical fluctuations may cause frequent switching between actions at adjacent decision steps, thereby leading to local path jitter or even oscillations in the overall trajectory. To address this issue, the NDM-ABN is introduced into the decision-making process for approximate-tie actions. First, the candidate action set corresponding to approximate ties is defined as follows:

C_{t} = \{a ∣ m_{t} (a) = 1, |{\tilde{Q}}_{m a i n} (s_{t}, a_{t}) - {\tilde{Q}}_{\max}| < ε\}

(25)

where

{\tilde{Q}}_{m a i n} (s_{t}, a_{t})

denotes the maximum value among the currently feasible actions, and

ε

represents the threshold used to determine approximate ties. In this study,

ε

is set to 0.0001. Accordingly, when only one action in the candidate set is significantly superior, the algorithm still follows the original greedy selection strategy directly. Only when multiple actions fall into an approximate-tie state does the bias network participate in the decision scoring process, thereby avoiding unnecessary interference with non-tie samples.

To explicitly incorporate noise constraints into action-level decision making, the predicted next state of each candidate action is further evaluated in terms of noise, which is formulated as follows:

L (a_{t}) = L ({s^{'}}_{t}) = L_{p} (r; L_{w}, env), r = d ({s^{'}}_{t}, d ({s^{'}}_{t}))

(26)

Here,

d ({s^{'}}_{t})

represents the building receiver point closest to the UAV’s next state,

d ({s^{'}}_{t}, d ({s^{'}}_{t}))

denotes the propagation distance between the UAV and this receiver point,

env

represents the urban environment, and

L (a_{t})

indicates the sound pressure level at the receiver point. To avoid all actions being masked due to strict noise thresholds during takeoff or landing, the algorithm introduces a buffer zone mechanism: if the next state is within the buffer radius of the starting or ending point, the threshold constraint is temporarily relaxed; otherwise, strict threshold assessment is applied. The specific computation is as follows:

τ ({s^{'}}_{t}) = \{\begin{matrix} L_{n}, & d_{start} ({s^{'}}_{t}) \leq R_{b} o r d_{goal} ({s^{'}}_{t}) \leq R_{b} \\ L_{T}, & otherwise \end{matrix}

(27)

Here,

L_{T}

is the noise threshold,

L_{n} = 65 dB

is the buffer threshold, and

R_{b} = 5

is the buffer radius defined based on unit grid cells. The action filtering mechanism first checks whether the next state is within the buffer zone. It then determines the validity of the action based on the corresponding threshold.

To prevent situations where no valid actions are available due to overly strict noise constraints in obstacle-dense areas, this study introduces a minimum valid action count constraint

N_{\min} = 1

and a feasibility guarantee threshold

F P T = 60 dB

. If

C_{t}^{(h)} \geq N_{\min}

is met, subsequent decisions are made directly within the feasible region

C_{t}^{(h)}

; otherwise, the feasibility guarantee threshold is activated to ensure policy continuity. The final output is the set of valid actions selected based on the feasibility guarantee threshold. Accordingly, the final decision candidate set is defined as follows:

{C^{'}}_{t} = \{\begin{matrix} C_{t}^{(h)}, & | C_{t}^{(h)} | \geq N_{\min} \\ C_{t}^{(s)}, & | C_{t}^{(h)} | < N_{\min} and | C_{t}^{(s)} | \geq N_{\min} \\ C_{t}^{(0)}, & | C_{t}^{(0)} | < N_{\min} \end{matrix}

(28)

where

C_{t}^{(h)}

,

C_{t}^{(s)}

, and

C_{t}^{(0)}

denote the action sets satisfying the strict noise threshold, the feasibility-guarantee threshold, and the geometric feasibility condition only, respectively, and

N_{m i n}

represents the minimum number of valid actions. Through this hierarchical degradation mechanism, the algorithm can effectively prevent the action set from becoming completely empty in complex urban environments.

For the approximate-tie actions within the final candidate set

C_{t}^{'}

, this study further introduces the action-bias term generated by the NDM-ABN to locally adjust the main-network value, and the decision score is constructed as

Q^{d e c} (s_{t}, a_{t}) = {\tilde{Q}}_{m a i n} (s_{t}, a_{t}) + α \cdot {\hat{a}}_{i}

(29)

Here,

α = 0.02

is the scaling factor used to adjust the impact of the bias on the

{\tilde{Q}}_{m a i n} (s_{t}, a_{t})

value.

{\hat{a}}_{t}

is the bias value generated by the action attention network.

In addition to the decision stage, the bias network also needs to learn how to generate effective bias terms during training. To this end, the temporal-difference target is adopted in the training stage to construct the supervisory signal:

Q_{target} (s_{t}, a_{t}) = R_{t} + γ \cdot Q_{target} ({s^{'}}_{t}, {a^{'}}_{t})

(30)

where

R_{t}

denotes the immediate reward,

γ

is the discount factor, and

Q_{target} ({s^{'}}_{t}, {a^{'}}_{t})

represents the value estimate of the next state–action pair given by the target network. Based on this target value, the parameters of the bias network are updated through error backpropagation, enabling the output bias term to gradually acquire discriminative capability in terms of long-term return.

3.3. Prioritized Multi-Step Experience Replay Mechanism

To enhance the sample efficiency and stability of Double DQN in three-dimensional path planning, this study introduces the Prioritized n-Step Replay Mechanism (see Figure 6). This mechanism accelerates the propagation of long-term feedback through (n)-step accumulated rewards and improves the utilization of high TD-error samples by incorporating Prioritized Experience Replay (PER). Additionally, Importance Sampling (IS) weights are used to correct the bias introduced by non-uniform sampling, further improving training performance. It should be noted that, although PER can significantly improve sample efficiency by replaying informative transitions more frequently, it also changes the sampling distribution from uniform replay to a priority-dependent distribution. As a result, the sampled mini-batch no longer strictly follows the original experience distribution, which may introduce estimation bias into value updates and, in principle, affect the stability and convergence behavior of policy learning if no correction is applied.

This study adopts the following (n)-step accumulated reward definition to accelerate the propagation of long-term reward signals:

R_{t}^{(n)} = \sum_{n - 1}^{k = 0} γ^{k} c l i p (r_{t + k}) r_{t + k} \in [- R, R]

(31)

Here,

γ = 0.95

is the discount factor, which is set in this study;

c l i p (r_{t + k})

denotes the reward clipping, with the reward clipping threshold set to 50, used to suppress extreme rewards and stabilize training. The

n

-step TD target is selected by the main network for action, and its value is evaluated by the target network. It is defined as follows:

y_{t}^{(n)} = \{\begin{matrix} R_{t}^{(n)}, if s_{t + n} is terminal \\ R_{t}^{(n)} + γ^{n} \cdot Q_{target} (s_{t + n}, \arg \max_{a} Q_{mian} (s_{t + n}, a)), otherwise \end{matrix}

(32)

Here,

Q_{target}

is the target network,

Q_{main}

is the main network, and

s_{t + n}

is the state after

n

steps.

\arg \max_{a} Q_{main} (s_{t + n}, a)

is used to select the optimal action at the current state, which is then evaluated by the target network to mitigate Q-value overestimation. The TD error used for prioritized replay is defined as

δ_{n} = y_{t}^{(n)} - Q_{main} (s_{t}, a_{t})

(33)

To improve the utilization of high-value samples, Prioritized Experience Replay (PER) assigns sampling probabilities based on the Temporal Difference (TD) error, giving samples with larger errors higher sampling priority. The priority

p_{i}

in PER is defined as follows:

p_{i} = {(| δ_{n} | + ϵ)}^{α}

(34)

In Prioritized Experience Replay (PER), the sampling probability of a sample is determined by its TD error. Here, the constant

ϵ = 0.01

is used to prevent zero priority, and the exponent

α = 0.6

is used to adjust the level of prioritization. A larger

α

leads to a stronger deviation from uniform sampling, which increases replay efficiency for informative samples but may also aggravate sampling bias. Therefore, the prioritization mechanism improves learning efficiency at the cost of introducing a distribution mismatch between replayed samples and the original behavior distribution. Based on the above priority, the sampling probability

P (i)

of a sample is given by

P (i) = \frac{p_{i}}{\sum_{i = 1}^{N} p_{i}}

(35)

Here,

N

represents the total number of samples in the experience pool. To correct the estimation bias introduced by non-uniform sampling in Prioritized Experience Replay (PER), this study employs Importance Sampling (IS) for weighted correction. Specifically, for a sample

i

, its Importance Sampling weight

w_{i}

is defined as follows:

\begin{matrix} β = \min (1.0, β + β_{increment}) \\ w_{i} = {(\frac{1}{N \cdot P (i)})}^{β} \end{matrix}

(36)

Here,

β = 0.4

is the bias correction exponent, and

β_{increment} = 0.01

is the bias correction increment. The importance sampling weight is used to reduce the estimation bias introduced by non-uniform sampling. Specifically, when

β

gradually approaches 1 during training, the update becomes closer to an unbiased estimate, thereby alleviating the potential adverse effect of PER-induced sampling bias on convergence in the later training stage. In this way, the proposed framework seeks to balance the sample-efficiency advantage of PER and the training-stability requirement of Double DQN. To prevent excessively large weights from causing gradient instability, the weights are normalized. The specific formula is as follows:

{\tilde{w}}_{i} = \frac{w_{i}}{\max w_{j}}

(37)

Here,

w_{i}

is the computed IS weight,

{\tilde{w}}_{i}

is the normalized IS weight, and

\max w_{j}

is the maximum weight among all samples. The normalization step is further used to avoid excessively large correction terms, which could otherwise amplify gradient fluctuations and reduce training stability. Therefore, the IS weighting and normalization together serve as the main bias-compensation mechanism in the PER framework adopted in this study. Additionally, to optimize the balance between exploration and exploitation, this study employs an adaptive (n)-step mechanism: the step size is dynamically adjusted based on the real-time distance between the UAV and the target. When the distance is large, the step size is increased to accelerate exploration, and as the UAV approaches the target, the step size is decreased to facilitate fine control. The specific adjustment formula is as follows:

n_{step} = m a x (n_{s t e p_\min}, n_{s t e p_\min} + norm distance \times (n_{s t e p_\max} - n_{s t e p_\min}))

(38)

Here,

n_{s t e p_\min} = 2

and

n_{s t e p_\max} = 3

represent the minimum and maximum step sizes, respectively, and

norm distance

is the normalized distance from the current state to the target. This formula allows the step size to be adaptively adjusted based on the task progress (distance to the target).

4. Experiments and Evaluation

4.1. Simulation Environment Setup

Key training parameters are summarized in Table 1. The experimental environment was established based on a typical urban area in Chengdu to represent the operational context and noise-related impacts of low-altitude urban flight. Geographic information was obtained from Baidu Maps, and a three-dimensional static environment model was constructed using a grid-based method, with a grid resolution of 10 m [45] and an airspace altitude range of 0–150 m. In this study, the ground surface within the selected Chengdu study area is assumed to be flat, i.e., the ground elevation relative to mean sea level is treated as identical throughout the entire area. The distribution of urban buildings and the constructed static environment model are presented in Figure 7. The key training parameters and UAV system parameters are provided in Appendix B (Table A1 and Table A2), where the UAV configuration was determined based on relevant literature [42,46] and a specific DJI model.

In this study, obstacles are not modeled as convex hulls. Compared with convex-hull-based modeling, the grid representation is more convenient for obstacle description and collision checking, whereas convex hull methods require additional geometric preprocessing and increase computational complexity in the three-dimensional environment. Therefore, a grid-based obstacle modeling method is adopted in this study to balance modeling practicality and training efficiency.

Based on the constructed safety risk model, the safety risk in a low-altitude environment is calculated. The road network density and vehicle ownership data for Chengdu’s administrative roads are obtained from relevant reports such as the “2025 China Urban Road Network Density and Operational Status Monitoring Report.” From this, the vehicle density per unit grid in the road network is estimated to be 0.622 vehicles/m². The population of Chengdu is obtained from sources such as the Chengdu Statistics Bureau website, where the total population is 15.98 million and the total area is 14,335 square kilometers, resulting in a population density of 0.027875 people/m² per unit grid area. Additionally, based on data from the National Bureau of Statistics, the annual average number of traffic fatalities and the annual number of traffic accidents are used to calculate the average number of fatalities per accident, which is approximately 0.228 fatalities per accident.

4.2. Algorithm Comparison Experiment

To systematically assess the impact of different improvement modules on UAV path planning performance, this study employs a stepwise module-comparison design. Starting with the basic Double DQN model, while keeping the environment, network structure, and hyperparameters consistent, the Noise Penalty, NDM, ABN, MSL, and PER modules are sequentially introduced. By constructing progressively enhanced comparison models, the influence of each module on policy learning can be clearly identified. Table 1 shows the performance changes in the path after each module is added, and Figure 7 presents the time-series curve of the sound pressure level at the nearest receiver points along the path for each model.

As shown in Table 1, the results reflect a progressive module-addition process, in which each module is introduced on top of the previous configuration. The baseline DDQN shows the worst noise performance, with the highest average SPL of 60.85 dB and the highest maximum SPL at the nearest building façade of 69 dB. Here, the maximum façade SPL is evaluated only during the steady-flight phase, excluding takeoff and landing. After adding the noise penalty, the average SPL decreases only slightly to 60.17 dB, while the maximum façade SPL remains 69 dB, indicating that the reward-level noise penalty alone has limited effect. After introducing NDM, the average SPL, safety risk, and maximum façade SPL all decrease noticeably, showing that this module can effectively suppress high-noise actions.

With the further addition of ABN and MSL, the average SPL remains at a relatively low level of about 52 dB, while flight time and safety are further improved. Among them, MSL achieves the lowest average SPL of 52.11 dB. After incorporating PER, the flight time is further reduced to 195.37 s, which is the best among all configurations, while both the average SPL and maximum façade SPL remain low. Overall, the final enhanced model achieves a good balance among flight efficiency, noise control, and safety.

Figure 8 further reflects the differences in noise compliance among the ablation models. The baseline DDQN remains above the 55 dB threshold over most of the trajectory and exhibits several pronounced peaks, indicating poor noise control. After adding the noise penalty, some high-noise segments are reduced, but repeated threshold exceedances are still observed in the middle section of the trajectory, suggesting that reward-level penalization alone is insufficient to achieve stable noise control.

After introducing NDM, the SPL curve decreases significantly and becomes smoother, with most segments remaining below the threshold, indicating that this module can effectively improve noise-compliance stability. With the further addition of ABN and MSL, the curve becomes more stable overall, showing that these modules further enhance the continuity and consistency of low-noise decisions. After incorporating PER, although slight local fluctuations appear in the final stage, the overall performance is still clearly better than that of the baseline and the noise-penalty-only model.

It should also be noted that, because different modules generate different final trajectories, the SPL data are no longer updated once the path reaches the target point in the later stage. As a result, some areas in the figure show missing SPL data.

Figure 9 compares the path-planning behaviors of different ablation models in an urban environment. As noise-related modules are gradually introduced, clear changes can be observed in both the horizontal trajectory and altitude distribution. Figure 9a shows that, after noise constraints are introduced, the models tend to avoid dense building areas in the horizontal plane. Figure 9b shows that the models generally begin climbing earlier and maintain higher flight altitudes to reduce noise exposure near buildings.

For the baseline DDQN, the path is relatively direct, with only slight deviations near building boundaries, and the flight altitude is mainly maintained at 50–60 m, indicating that it mainly relies on local obstacle avoidance. After introducing the Noise Penalty, the cruising altitude increases to about 60–70 m, suggesting that the model begins to reduce noise exposure by increasing altitude.

With the further addition of ABN, NDM, and MSL, the preference for low-noise regions becomes more evident, and the cruising altitude increases overall. After incorporating PER, the horizontal path becomes smoother and the altitude remains at about 90–100 m, showing better continuity and coordination.

Overall, with the progressive addition of noise-related modules, the path-planning strategy gradually evolves from low-altitude direct obstacle avoidance to a noise-constrained planning mode that combines horizontal detours and altitude compensation, thereby achieving a better balance among noise control, trajectory continuity, and operational efficiency.

Figure 10 shows the evolution of the total reward per episode during training. Overall, all models exhibit a rapid reward increase in the early stage and then gradually become stable, indicating that all models can learn feasible policies. However, their convergence behaviors differ clearly. The baseline DDQN and the Noise Penalty model show relatively large fluctuations in the early and middle stages, indicating weaker training stability.

After introducing NDM, the reward curve becomes noticeably smoother and the model enters the stable stage earlier, showing that this module can effectively reduce training oscillations. With the further addition of ABN, the reward evolution becomes more stable, indicating that it helps improve action-selection consistency and training stability. After incorporating MSL, the reward evolution becomes more continuous, suggesting that multi-step return modeling enhances long-term reward estimation. Finally, after adding PER, the model shows the most stable behavior in the late training stage, indicating that prioritized experience replay further improves training stability and sample utilization efficiency.

Based on the analysis of each module’s contribution in the ablation experiment, this study further conducts an algorithm comparison experiment. Figure 11 presents the path comparison results of TNAP-DDQN, B-APF-DQN [47], S-JPS [48], and Dueling-DQN [49], while Table 2 summarizes the performance of each algorithm.

TNAP-DDQN has a flight time of 195.37 s, which is 13.64%, 13.39%, and 19.63% higher than that of B-APF-DQN, S-JPS, and Dueling-DQN, respectively, indicating that it adopts a more conservative trajectory strategy under explicit noise constraints and therefore incurs additional time cost.

However, this time cost leads to better noise-control performance. The average SPL of TNAP-DDQN is 52.66 dB, which is 6.38 dB, 6.19 dB, and 4.19 dB lower than that of B-APF-DQN, S-JPS, and Dueling-DQN, respectively. Its maximum SPL at the nearest building façade is also the lowest, at 54.38 dB. Here, the maximum façade SPL is evaluated only during the steady-flight phase, excluding takeoff and landing. These results indicate that TNAP-DDQN can more effectively reduce noise exposure during flight.

In terms of safety, the safety-risk value of TNAP-DDQN is 2.3564 × 10⁻⁴, which is 9.57%, 18.89%, and 5.56% lower than that of B-APF-DQN, S-JPS, and Dueling-DQN, respectively. Overall, TNAP-DDQN shows better noise-control performance and lower safety risk than the compared algorithms. Although the differences in computation time are small, clear differences still exist in flight time, noise, and safety metrics.

Figure 12 shows the time-series results of the sound pressure level (SPL) at the nearest receiver points along the path for each algorithm, highlighting the differences in noise exposure across the planned trajectories. Overall, the SPL curve of TNAP-DDQN is the lowest and most stable. For most of the route, its SPL remains clearly below the 55 dB threshold, indicating that the proposed method can more effectively incorporate noise constraints into path planning and maintain a stable low-noise state along the trajectory.

In contrast, the SPL curves of B-APF-DQN, S-JPS, and Dueling-DQN are generally above the threshold for most of the cruise phase, with only a few intervals close to the threshold. These algorithms are less effective in maintaining stable noise compliance and show more pronounced noise peaks in certain intervals.

All algorithms exhibit an instantaneous SPL increase during the takeoff and climb phases, caused by the transient effect of the changing distance between the UAV and the receiver, rather than a reflection of noise control during the cruise phase. The advantage of TNAP-DDQN is that it maintains lower and more stable SPLs during the cruise phase, avoiding prolonged periods above the threshold seen in other algorithms.

It should be noted that due to the different paths generated by each algorithm, some algorithms do not have SPL data in the later stages of the path. This reflects performance differences across algorithms.

4.3. Comparison and Analysis of Paths with Different Noise Constraints

To further investigate the impact of noise constraints on urban low-altitude UAV path planning and to analyze the required AGL under different UAV source noise levels, comparative experiments were conducted by adjusting the local noise limits and the source noise intensity while keeping the environmental parameters unchanged. Based on the preceding analysis, the UAV noise source was modeled as a point source in the subsequent experiments, and a reference sound pressure level of approximately 89 dB at a reference distance of 1 m was adopted as the baseline. On this basis, to compare the influence of different source noise levels on path-planning results, the UAV source noise level was discretized at 3 dB intervals, ranging from 65 dB to 95 dB. These discrete values were used as the parameter

L_{s} (r_{0})

in Equation (20) to compute the received sound level

L_{p} (r)

at an arbitrary distance

r

.

The local noise limits were defined according to the standards for acoustic environment functional zones [43], including the Class 0 (

L_{T} = 50 dB

) and Class 1 (

L_{T} = 55 dB

) limits, while the Class 2 limit (

F P T = 60 dB

) was used as the feasibility protection threshold. In this study, these limits correspond to the daytime limits specified in the acoustic-environment functional-zone standard [43].

(i): Noise threshold $L_{T} = 55 dB$

Appendix B (Table A3) summarizes the path-optimization results under different UAV sound pressure levels for a fixed noise threshold of

L_{T} = 55 dB

. The table reports the corresponding flight time, safety-risk value, and average sound pressure level at the nearest receiver points for each UAV sound pressure level, thereby providing a quantitative basis for analyzing how the UAV sound pressure level affects trajectory performance under the same threshold constraint.

As shown in Table A3, as the UAV source noise level gradually decreases from 95 dB to 65 dB, the average sound pressure level correspondingly declines from 55.35 dB to 38.09 dB, while the flight time is also reduced overall from 217.15 s to 162.61 s. This indicates that, under higher UAV source noise levels, the path-planning strategy tends to adopt more conservative flight maneuvers in order to satisfy the local noise limits, thereby resulting in a higher time cost. In contrast, when the UAV source noise level is lower, the feasible space under the local noise limits is relatively expanded, and the corresponding constraint cost is reduced accordingly.

It should also be noted that the Safety Risk Value does not exhibit a strictly monotonic trend. This is mainly because, under different UAV source noise levels, the optimization process may produce different flight paths, and the proportions of building-shielded areas and unshielded ground areas traversed by these paths are not necessarily the same. Since ground risk is influenced not only by the altitude above ground level (AGL) but also closely related to the shielding conditions of the underlying area, the Safety Risk Value shows some fluctuations rather than a simple monotonic relationship with the UAV source noise level. Nevertheless, from the overall results, these fluctuations are relatively small, indicating that the safety-risk levels remain generally within a similar range under different source noise conditions.

Figure 13 illustrates the UAV trajectory characteristics under different UAV source noise levels from both the top-view and front-view perspectives. In the top view (Figure 13a), under the same local-noise-limit constraint, the overall horizontal direction of the trajectories remains consistent, while noticeable differences can still be observed in the detailed spatial layouts, especially in the middle and later segments of the routes. This indicates that, under noise constraints, different UAV source noise levels affect the spatial organization of the trajectories, leading to different path configurations while preserving a broadly similar start-to-goal direction.

In the front view (Figure 13b), the AGL profiles exhibit a clear layered platform structure, and the cruising AGL generally increases with the UAV source noise level. Specifically, lower UAV source noise levels (65–77 dB) mainly correspond to relatively low AGL platforms of about 50–60 m. A UAV source noise level of 83 dB corresponds to a cruise platform of about 60–70 m, while 86 dB further increases to about 80–90 m. At 89 dB, the cruise AGL rises to about 90–100 m, whereas 92 dB and 95 dB require significantly higher platforms of approximately 120 m and 140 m, respectively. These results indicate that, as the source becomes louder, the feasible low-altitude corridor is progressively compressed, and the planning strategy therefore becomes more reliant on earlier climbing, maintaining a higher AGL, and sustaining longer high-altitude cruise segments in order to enlarge the propagation distance margin and preserve noise compliance.

Figure 14 shows that, as the UAV source noise level increases, the SPL curves at the nearest receiver points shift upward and progressively approach the 55 dB threshold, indicating a gradual reduction in the available noise-feasibility margin. Under low UAV source noise levels (65–77 dB), the curves remain stably below the threshold over most of the trajectory, whereas under moderate UAV source noise levels (80–86 dB), several key segments become critically close to the threshold. For 89–92 dB, the curves remain near the threshold in multiple segments, suggesting a markedly narrowed noise-feasibility margin; by contrast, 95 dB is the only case that is clearly clustered around the threshold and more prone to local exceedances. In all cases, transient SPL rises are observed during the initial climb and final descent, mainly due to rapid changes in the UAV–receiver distance during these phases. Combined with Figure 13b, these results indicate that the path-planning strategy increasingly relies on a higher AGL to maintain noise compliance.

It is worth noting that some areas in the graph are missing SPL data, as different algorithms produce results based on different SPL calculation methods. As a result, no further noise changes occur in the subsequent steps, leading to missing SPL data in certain areas of the graph. This phenomenon reflects the impact of different SPL calculation methods on the algorithm’s results.

(ii): Noise threshold $L_{T} = 50 dB$

Compared to the more relaxed noise threshold

L_{T} = 55 dB

, the detailed path-optimization results under different UAV sound pressure levels for the stricter threshold are provided in Appendix B (Table A4). The table reports the corresponding flight time, safety-risk value, and average sound pressure level at the nearest receiver points for each UAV sound pressure level, thereby providing a quantitative basis for evaluating how a tighter noise threshold influences trajectory performance.

Similar to the results under the 55 dB threshold, Table A4 shows that, under the stricter threshold of

L_{T} = 50 dB

, both the average sound pressure level and the flight time generally decrease as the source-noise intensity is reduced. This indicates that tighter noise constraints impose a higher time cost under strong source-noise conditions, whereas lower UAV source noise levels provide a relatively larger noise-feasible margin. The safety-risk value does not vary monotonically, but instead reflects the combined effect of AGL adjustment and the building-shielding conditions beneath the UAV trajectory.

Figure 15 further compares the trajectories under different UAV source noise levels. In the top view (Figure 15a), all cases generally maintain a similar start-to-goal direction, whereas the detailed path layouts vary with the UAV source noise level, indicating that stronger noise constraints require a greater sacrifice of geometric directness. In the front view (Figure 15b), the AGL profiles show a clear stepwise increase with increasing UAV sound pressure level: 65–74 dB mainly corresponds to 50–60 m, 77 dB to 69–70 m, 80 dB to 70–80 m, 83 dB to 80–90 m, 86 dB to 90–100 m, 89 dB to 110–120 m, 92 dB to 130–140 m, and 95 dB to about 150 m. These results indicate that, under the local noise limits, the planning strategy increasingly relies on earlier climbing and sustained high-AGL cruise segments to enlarge the propagation-distance margin and maintain noise compliance.

Figure 16 shows the time-series characteristics of the sound pressure level (SPL) at the nearest receiver points under stricter local noise limits, revealing the impact of different UAV source noise levels on path planning. Overall, none of the curves exceed the feasibility protection threshold (FPT); however, as the noise threshold tightens, the SPL curves gradually approach the threshold line, indicating that the noise feasibility margin is decreasing.

A more detailed analysis shows that, at lower UAV source noise levels (65–74 dB), the SPL curves remain below the 50 dB threshold for most of the flight segments, with minimal fluctuations, indicating that there is still a sufficient noise feasibility margin. When the UAV source noise level increases to 77–86 dB, the curves approach the threshold more frequently, suggesting that the trajectory has entered a critically feasible state. As the UAV source noise level further increases to 89–95 dB, threshold contact and local exceedances become more frequent, reflecting a significant reduction in the noise feasibility margin. This indicates that, at high UAV source noise levels, the path-planning strategy can no longer maintain stable noise compliance throughout the full trajectory.

Some areas in the graph are missing SPL data due to the use of different SPL calculation methods by different algorithms, causing noise changes to stop in subsequent steps, resulting in missing SPL data in certain areas.

(i): Sensitivity Analysis of Noise Thresholds and Noise Source Levels

Based on the path optimization results under different UAV sound pressure levels and noise threshold conditions, Figure 17, Figure 18 and Figure 19 summarize the variations in required AGL, flight time, and noise exposure.

Figure 17 shows that the required AGL increases stepwise with UAV sound pressure level. The values shown here represent the highest cruise AGL adopted during the stable-flight phase rather than a constant altitude throughout the entire trajectory. Under relatively low UAV sound pressure levels (65–74 dB), the AGL requirements under the 50 dB and 55 dB thresholds are similar, remaining around 50–60 m. As the UAV sound pressure level increases, the stricter 50 dB threshold requires earlier and larger altitude increases than the 55 dB threshold, indicating a stronger reliance on altitude compensation.

Figure 18 shows that, under both threshold settings, the total flight time generally increases with UAV sound pressure level, and this increase is more pronounced under the stricter 50 dB threshold. Under the 50 dB threshold, the flight time rises from about 162.31 s at 65 dB to 223.68 s at 95 dB, whereas under the 55 dB threshold it increases from about 162.61 s to 217.15 s over the same range. The difference between the two cases is relatively small at lower UAV sound pressure levels but becomes increasingly evident as the UAV sound pressure level increases. This indicates that a stricter threshold leads to stronger AGL compensation and, consequently, a higher time cost.

Figure 19 shows that, under both threshold settings, the average SPL at the nearest receiver point increases monotonically with UAV sound pressure level. Under the 50 dB threshold, the average SPL remains below the limit from 65 dB to 86 dB but exceeds the threshold from 89 dB onward. Under the 55 dB threshold, the average SPL remains below the limit up to 92 dB and only slightly exceeds it at 95 dB. These results indicate that the stricter 50 dB threshold causes the trajectories to approach the noise boundary earlier, whereas the 55 dB threshold retains a relatively larger noise-feasibility margin over a wider range of UAV sound pressure levels. It should also be noted that the average SPL reflects only the overall noise level and does not guarantee full-process compliance with the noise constraint during the flight, excluding the takeoff and landing phases. Even when the average value remains below the 50 dB or 55 dB threshold, local or short-term exceedances may still occur in certain critical flight segments. Therefore, further analysis in combination with the maximum SPL at the nearest building façade is still required.

5. Conclusions

This study proposes the TNAP-DDQN method for urban low-altitude UAV path planning under residential noise threshold constraints. The method incorporates multiple operational constraints, including flight time, collision risk, and maximum AGL altitude, to achieve coordinated optimization of noise compliance, safety, and efficiency. To address action-space compression, tied action values, and training instability under multiple constraints, a Noise Degradation Mask-based Action Bias Network (NDM-ABN) is introduced at the action-selection stage. In addition, multi-step prioritized experience replay (PER) and potential-based reward shaping are used to improve sample efficiency and long-term reward learning.

The main conclusions are as follows: First, the ablation and comparison results show that NDM-ABN is the key module in TNAP-DDQN. Compared to using only a noise penalty in the reward function, NDM-ABN directly suppresses high-noise actions during action selection, thereby improving the stability of noise control. Second, the required AGL (Above Ground Level) is significantly affected by the UAV sound pressure level and local noise limits; higher UAV sound pressure levels or stricter noise limits both correspond to higher AGL requirements. This study further provides the corresponding AGL requirements under different UAV sound pressure levels and noise threshold conditions, which can serve as a direct reference for differentiated AGL altitude management of UAVs with different noise characteristics. Third, the results show that stricter noise thresholds significantly reduce the acceptable range of UAV source levels. Under the combined constraints of local noise thresholds and the maximum allowable AGL altitude of 120 m, the maximum acceptable UAV sound pressure level in the scenario considered in this study is 86 dB for the 50 dB noise threshold and 92 dB for the 55 dB noise threshold.

This study provides quantitative guidance for UAV noise-entry management by clarifying the relationship among UAV source noise level, local noise limits, and required AGL. However, the equivalent sound pressure level used in this study cannot fully describe short-term fluctuations and peak variations in noise signals.

Moreover, the simplified point-source assumption, geometric attenuation model, and regular-cuboid obstacle representation improve computational efficiency. However, these simplifications may not fully capture the complexity of acoustic propagation and obstacle shapes in real urban environments, leading to potential deviations between the results and actual conditions. Future work will therefore consider more comprehensive acoustic indicators, higher-precision aerodynamic noise simulation, and more flexible obstacle representation methods such as convex hulls or polyhedral geometric modeling, and field or laboratory measurements for model calibration and validation.

Author Contributions

Funding acquisition, X.H. and Y.Z.; project administration, Y.C., X.H. and Y.Z.; writing—original draft, Y.J.; writing—review and editing, Y.C., X.H. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Safety Special Project of the Civil Aviation Administration of China (CAAC) (kG2025007), the Fundamental Research Funds for the Central Universities (25CAFUC04057) and Sichuan Provincial Engineering Research Center of Smart Operation and Maintenance of Civil Aviation Airports (JCZX 2024ZZ17).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AGL	Above Ground Level
UAVs	Unmanned Aerial Vehicles
DDQN	Double Deep Q-Network
TNAP-DDQN	Threshold-Noise-Aware Path Planning based on Double Deep Q-Network
NDM-ABN	Noise Degradation Mask-based Action Bias Network
PER	Prioritized Experience Replay
SPL	Sound Pressure Level
FPT	Feasibility Protection Threshold
B-APF-DQN	B-spline-optimized Artificial Potential Field Deep Q-Network.
S-JPS	improved Jump Point Search
Dueling-DQN	Dueling Deep Q-Network
IMC	the maximum allowable AGL
IMH	the minimum desired AGL

Appendix A

The ABN in this study is constructed using a multilayer fully connected network, and its forward propagation process can be expressed as follows:

a_{1} = ReLU (W_{1}^{⊤} \cdot x + b_{1})

(A1)

a_{2} = W_{2} \cdot a_{1} + b_{2}

(A2)

a_{3} = W_{3} \cdot a_{2} + b_{3}

(A3)

α_{a} = W_{4} \cdot a_{3} + b_{4}

(A4)

Here,

W_{1} \in R^{128 \times 11}

,

W_{2} \in R^{64 \times 128}

,

W_{3} \in R^{32 \times 64}

, and

W_{4} \in R^{26 \times 32}

denote the weight matrices of each layer, while

b_{1} \in R^{128 \times 1}

,

b_{2} \in R^{64 \times 1}

,

b_{3} \in R^{32 \times 1}

, and

b_{4} \in R^{11 \times 1}

represent the corresponding bias vectors.

a_{1}

,

a_{2}

, and

a_{3}

denote the outputs of the hidden layers, respectively, and

α_{a}

denotes the raw action-bias value generated by the ABN. Through nonlinear mapping, the network captures the latent relationship between the state and action selection, thereby providing additional discriminative information for approximately tied actions.

RMS (α_{a}) = \sqrt{\frac{1}{A} \sum_{A}^{i = 1} {α_{a}}^{2}}

(A5)

Here,

A

denotes the number of current candidate actions, and

α_{a}

represents the raw bias value corresponding to the

a

-th candidate action. The RMS reflects the overall magnitude level of the current bias output and can be used for subsequent scaling. On this basis, the normalized bias term is obtained as follows:

{\hat{a}}_{i} = \tanh (\frac{α_{a}}{RMS (α_{a}) \cdot T})

(A6)

To maintain stable exploration during the early stage of training while enhancing the discriminative capability of the bias in the later stage, an annealing strategy is adopted in this study to dynamically update the temperature parameter, which is expressed as follows:

T = T_{0} \cdot \frac{1}{1 + λ \cdot t}

(A7)

Here,

T_{0} = 1

is the initial temperature,

λ = 0.1

is the annealing rate, and

t

is the current training step.

The action set obtained under the strict noise threshold condition is given as follows:

C_{t}^{(h)} = \{a_{t} \in {C^{'}}_{t} |L ({s^{'}}_{t}) \leq τ ({s^{'}}_{t})\}

(A8)

When the number of actions satisfying the strict noise threshold is insufficient, a feasibility-preserving threshold denoted as

F P T

is introduced, and the corresponding action set is defined as follows:

C_{t}^{(s)} = \{a_{t} \in C_{t} ∣ L ({s^{'}}_{t}) \leq F P T\}, if |C_{t}^{(h)}| < N_{\min}

(A9)

If a sufficient number of actions still cannot be obtained under the feasibility-preserving threshold, the action set is further degraded to the set containing only geometrically feasible actions as follows:

C_{t}^{(0)} = \{a \in C_{t} ∣ s^{'} (a) \in Ω_{free}\}, if |C_{t}^{(s)}| < N_{\min}

(A10)

To reduce the interference of underlying value differences among candidate actions with the bias-based decision process, a de-meaning operation is introduced within the approximately tied candidate set. It can be expressed as follows:

{\bar{Q}}_{m a i n} (s_{t}, a_{t}) = {\tilde{Q}}_{m a i n} (s_{t}, a_{t}) - \frac{1}{| {C^{'}}_{t} |} \sum_{a^{'} \in {C^{'}}_{t}} {\tilde{Q}}_{m a i n} ({s^{'}}_{t}, a_{t})

(A11)

To enable the output of the network to possess effective discriminative capability in terms of long-term return, its parameters must be further updated during the training stage. In this study, the temporal-difference target is adopted to construct the training signal for the bias network:

y_{t} R_{t} + γ \cdot Q_{target} ({s^{'}}_{t}, {a^{'}}_{t})

(A12)

On this basis, the decision score involving the ABN can be further written as follows:

Q^{d e c} (s_{t}, a_{t}) = {\bar{Q}}_{m a i n} (s_{t}, a_{t}) + α \cdot {\hat{a}}_{t}

(A13)

To ensure that the bias term generated by the ABN can provide effective decision-discriminative information under the criterion of long-term return, a loss function is constructed in this study based on the error between the decision score and the target value:

L_{ABN} (θ) = \frac{1}{| {C^{'}}_{t} |} \sum_{a_{t} \in {C^{'}}_{t}} {(Q^{d e s} (s_{t}, a_{t}) - y_{t})}^{2}

(A14)

Appendix B

Table A1. Relevant Parameter Settings.

Parameter	Value
Learning Rate	0.001
Discount Factor	0.95
Exploration Rate	0.9
Exploration Decay Factor	0.995
Minimum Exploration Rate	0.01

Table A2. UAV Parameter Settings.

Parameter	Value
Maximum Flight Altitude (/m)	120
UAV Empty Weight (/kg)	7.3
$S_{hit} / m^{2}$	0.0188
$v_{h} / (m \cdot s^{- 1})$	18
$v_{d} / (m \cdot s^{- 1})$	3
$v_{u} / (m \cdot s^{- 1})$	5
$S_{c}$	0.5
$P_{UAV}$	3.42 × 10⁻⁴

Table A3. Comparison of Results Under Different Noise Sources (

L_{T} = 55 dB

).

Table A3. Comparison of Results Under Different Noise Sources (

L_{T} = 55 dB

).

Pressure Level at 1 m Along the Source (L_s (r₀ = 1)/dB)	Flight Time (/s)	Safety Risk Value	Average Sound Pressure Level (/dB)
95	217.15	2.3453 × 10⁻⁴	55.36
92	199.70	2.2510 × 10⁻⁴	53.34
89	195.37	2.3564 × 10⁻⁴	52.66
86	182.22	2.0348 × 10⁻⁴	51.68
83	174.38	2.1900 × 10⁻⁴	51.10
80	169.05	2.1900 × 10⁻⁴	48.49
77	165.37	2.0126 × 10⁻⁴	47.34
74	164.02	2.6502 × 10⁻⁴	46.45
71	163.61	2.6668 × 10⁻⁴	43.55
68	162.31	2.6003 × 10⁻⁴	40.52
65	162.61	2.9663 × 10⁻⁴	38.09

Table A4. Comparison of Results Under Different Noise Sources (

L_{T} = 50 dB

).

Table A4. Comparison of Results Under Different Noise Sources (

L_{T} = 50 dB

).

Pressure Level at 1 m Along the Source (L_s (r₀ = 1)/dB)	Flight Time (/s)	Safety Risk Value	Average Sound Pressure Level (/dB)
95	223.68	2.1401 × 10⁻⁴	54.30
92	212.73	2.4285 × 10⁻⁴	53.17
89	203.74	2.4673 × 10⁻⁴	50.91
86	190.14	2.4118 × 10⁻⁴	49.06
83	187.70	2.0514 × 10⁻⁴	47.89
80	183.20	2.1734 × 10⁻⁴	45.87
77	177.86	2.1734 × 10⁻⁴	45.13
74	171.85	2.0958 × 10⁻⁴	42.76
71	164.44	2.0736 × 10⁻⁴	42.02
68	162.61	2.6003 × 10⁻⁴	41.36
65	162.31	2.9663 × 10⁻⁴	38.13

References

Jin, Z.; Li, H.; Qin, Z.; Wang, Z. Gradient-free cooperative source-seeking of quadrotor under disturbances and communication constraints. IEEE Trans. Ind. Electron. 2025, 72, 1969–1979. [Google Scholar] [CrossRef]
Meng, W.; Zhang, X.; Zhou, L.; Guo, H.; Hu, X. Advances in UAV Path Planning: A Comprehensive Review of Methods, Challenges, and Future Directions. Drones 2025, 9, 376. [Google Scholar] [CrossRef]
Sheltami, T.R.; Ahmed, G.; Ghaleb, M.; Mahmoud, A.S.H. UAV path planning and trajectory optimization: A comprehensive survey. Arab. J. Sci. Eng. 2025, 51, 105–145. [Google Scholar] [CrossRef]
Jin, Z.; Bai, L.; Wang, Z.; Zhang, P. Self-triggered distributed formation control of fixed-wing unmanned aerial vehicles subject to velocity and overload constraints. IEEE Trans. Autom. Sci. Eng. 2024, 21, 4082–4093. [Google Scholar] [CrossRef]
Zammit, C.; van Kampen, E.-J. Comparison between A* and RRT algorithms for 3D UAV path planning. Unmanned Syst. 2022, 10, 129–146. [Google Scholar] [CrossRef]
Brown, A.; Anderson, D. Trajectory optimization for high-altitude long-endurance UAV maritime radar surveillance. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 2406–2421. [Google Scholar] [CrossRef]
Cheriet, H.; Badra, K.K.; Chouraqui, S. Comparative analysis of UAV path planning algorithms for efficient navigation in urban 3D environments. In Proceedings of the 2024 International Conference of the African Federation of Operational Research Societies (AFROS), Marrakech, Morocco, 8–10 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
Primatesta, S.; Spanò Cuomo, L.; Guglieri, G.; Rizzo, A. An innovative algorithm to estimate risk optimum path for unmanned aerial vehicles in urban environments. Transp. Res. Procedia 2018, 35, 44–53. [Google Scholar] [CrossRef]
Zhang, X.; Liu, Y.; Gao, Z.; Ren, J.; Zhou, S.; Yang, B. A ground-risk-map-based path-planning algorithm for UAVs in an urban environment with beetle swarm optimization. Appl. Sci. 2023, 13, 11305. [Google Scholar] [CrossRef]
Feng, Q.; Zhang, H.; Tang, W.; Wang, F.; Feng, D.; Zhong, G. Digital low-altitude airspace unmanned aerial vehicle path planning and operational capacity assessment in urban risk environments. Drones 2025, 9, 320. [Google Scholar] [CrossRef]
Pang, B.; Tan, Q.; Ra, T.; Low, K.H. A risk-based UAS traffic network model for adaptive urban airspace management. In Proceedings of the AIAA Aviation 2020 Forum, Virtual Event, 15–19 June 2020; AIAA: Reston, VA, USA, 2020; p. 2900. [Google Scholar] [CrossRef]
Mestres, P.; Nieto-Granda, C.; Cortés, J. Safe and dynamically feasible motion planning using control Lyapunov and barrier functions. IEEE Trans. Rob. 2025, 41, 6440–6459. [Google Scholar] [CrossRef]
Song, J.; Zhou, R. Improved UAV path planning in urban environment based on A-Star method. In Proceedings of the 2025 2nd International Conference on Mechanics, Electronics Engineering and Automation (ICMEEA 2025); Atlantis Press: Paris, France, 2025; pp. 1157–1167. [Google Scholar] [CrossRef]
Niknejad, N.; Esmzad, R.; Han, T.; Sankar, G.S.; Modares, H. DaSP-RRT: Data-driven safe performance-aware motion planning. IEEE Robot. Autom. Lett. 2025, 10, 9408–9415. [Google Scholar] [CrossRef]
Lu, Y.; Yan, D.; Wan, Z.; Feng, C. Conflict-free 3D path planning for multi-UAV based on jump point search and incremental update. Drones 2025, 9, 688. [Google Scholar] [CrossRef]
Zhang, B.; Hou, Y.; Yin, H.; Lv, M.; Yang, A.; Wu, L. Cooperative dynamic target tracking: Distributed time-varying optimization for multi-UAV system. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 12245–12257. [Google Scholar] [CrossRef]
Wang, N.; Liang, X.; Li, Z.; Hou, Y.; Yang, A. PSE-D model-based cooperative path planning for UAV and USV systems in antisubmarine search missions. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 6224–6240. [Google Scholar] [CrossRef]
Zhou, T. Research on path planning based on deep reinforcement learning. Adv. Appl. Math. 2025, 14, 572–578. (In Chinese) [Google Scholar] [CrossRef]
Xie, R.; Meng, Z.; Wang, L.; Li, H.; Wang, K.; Wu, Z. Unmanned aerial vehicle path planning algorithm based on deep reinforcement learning in large-scale and dynamic environments. IEEE Access 2021, 9, 24884–24900. [Google Scholar] [CrossRef]
Han, L.; Zhang, H.; An, N. A continuous space path planning method for unmanned aerial vehicle based on particle swarm optimization-enhanced deep Q-network. Drones 2025, 9, 122. [Google Scholar] [CrossRef]
Nguyen, T.; Nahavandi, S.; Razzak, I.; Nguyen, T.; Pham, N.; Hung, N. The emergence of deep reinforcement learning for path planning. arXiv 2025, arXiv:2507.15469. [Google Scholar] [CrossRef]
Xu, Z.; Chen, M.; Han, Z.; Shao, S. Dynamic path planning of low-altitude aircraft based on TCP-DQN algorithm in complex environment. Robot 2025, 47, 383–393. (In Chinese) [Google Scholar] [CrossRef]
Deng, M.; Yang, Q.; Peng, Y. A real-time path planning method for urban low-altitude logistics UAVs. Sensors 2023, 23, 7472. [Google Scholar] [CrossRef] [PubMed]
Hu, X.; Wu, Y.; Pang, B. Path planning for drone delivery in dense building environments. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 5542–5547. [Google Scholar] [CrossRef]
Cao, H.; Li, S.; Li, X.; Liu, Y. A UAV path-planning approach for urban environmental event monitoring. Comput. Mater. Continua 2025, 83, 5575–5593. [Google Scholar] [CrossRef]
Christian, A.; Cabell, R. Initial investigation into the psychoacoustic properties of small unmanned aerial system noise. In Proceedings of the 23rd AIAA/CEAS Aeroacoustics Conference, Denver, CO, USA, 5–9 June 2017; AIAA: Reston, VA, USA, 2017; p. 4051. [Google Scholar] [CrossRef]
Bulusu, V.; Polishchuk, V.; Sedov, L. Noise estimation for future large-scale small UAS operations. In Proceedings of the INTER-NOISE and NOISE-CON Congress and Conference, Hong Kong, China, 27–30 August 2017; Institute of Noise Control Engineering: Reston, VA, USA, 2017; pp. 1–10. Available online: https://ince.publisher.ingentaconnect.com/contentone/ince/incecp/2017/00000254/00000002/art00106 (accessed on 1 February 2026).
Whelchel, J. Flyover noise of multi-rotor sUAS. In Proceedings of the INTER-NOISE and NOISE-CON Congress and Conference, Madrid, Spain, 16–19 June 2019; Institute of Noise Control Engineering: Reston, VA, USA, 2019; pp. 1–10. [Google Scholar]
Torija, A.J.; Li, Z.; Self, R.H. Effects of a hovering unmanned aerial vehicle on urban soundscapes perception. Transp. Res. Part D Transp. Environ. 2020, 78, 102195. [Google Scholar] [CrossRef]
Torija, A.J.; Clark, C. A psychoacoustic approach to building knowledge about human response to noise of unmanned aerial vehicles. Int. J. Environ. Res. Public Health 2021, 18, 682. [Google Scholar] [CrossRef] [PubMed]
Schäffer, B.; Pieren, R.; Heutschi, K.; Wunderli, J.M.; Becker, S. Drone noise emission characteristics and noise effects on humans—A systematic review. Int. J. Environ. Res. Public Health 2021, 18, 5940. [Google Scholar] [CrossRef]
Hui, C.T.J.; Kingan, M.J.; Hioka, Y.; Schmid, G.; Dodd, G.; Dirks, K.N.; Edlin, S.; Mascarenhas, S.; Shim, Y.-M. Quantification of the psychoacoustic effect of noise from small unmanned aerial vehicles. Int. J. Environ. Res. Public Health 2021, 18, 8893. [Google Scholar] [CrossRef]
Bian, H.; Tan, Q.; Zhong, S.; Zhang, X. Assessment of UAM and drone noise impact on the environment based on virtual flights. Aerosp. Sci. Technol. 2021, 118, 106996. [Google Scholar] [CrossRef]
Tan, Q.; Zhang, X.; Bian, H. Enhancing sustainable urban air transportation: Low-noise UAS flight planning using noise assessment simulator. Aerosp. Sci. Technol. 2024, 147, 109071. [Google Scholar] [CrossRef]
Tan, Q.; Bian, H.; Zhang, X. Exploring noise reduction strategies: Optimizing drone station placement for last-mile delivery. Transp. Res. Part D Transp. Environ. 2024, 133, 104306. [Google Scholar] [CrossRef]
Tan, Q.; Zhong, S.; Qu, R.; Li, Y.; Zhou, P.; Lo, H.K.; Zhang, X. Low-noise flight path planning of drones based on a virtual flight noise simulator: A vehicle routing problem. IEEE Intell. Transp. Syst. Mag. 2024, 16, 56–71. [Google Scholar] [CrossRef]
Zhang, C.; Guo, T.; Li, Y. Dual-population coevolutionary optimization for multi-layer urban air logistics network. Acta Aeronaut. Astronaut. Sin. 2025, 46, 531477. (In Chinese) [Google Scholar] [CrossRef]
Chen, D.; Tang, C.; Xie, Y.; Ma, Y.; Xu, T. Real time dual layer path planning of unmanned aerial vehicles for urban low altitude logistics distribution. Acta Aeronaut. Astronaut. Sin. 2025, 46, 331621. (In Chinese) [Google Scholar] [CrossRef]
Pang, B.; Hu, X.; Dai, W.; Low, K.H. UAV path optimization with an integrated cost assessment model considering third-party risks in metropolitan environments. Reliab. Eng. Syst. Saf. 2022, 222, 108399. [Google Scholar] [CrossRef]
Chen, Y.J.; Yu, S.S.; Zhang, X.J. Ground risk quantitative assessment for UAV operations in urban low-altitude scenarios. J. Beijing Univ. Aeronaut. Astronaut. 2025, 51, 806–815. (In Chinese) [Google Scholar] [CrossRef]
Hansen, C.H. Fundamentals of acoustics. Am. J. Phys. 1951, 19, 254–255. [Google Scholar] [CrossRef]
Cabell, R.; Grosveld, F.; McSwain, R. Measured noise from small unmanned aerial vehicles. In Proceedings of the Noise-Con 2016, Providence, RI, USA, 13–15 June 2016; pp. 345–354. [Google Scholar]
DB3205/T 1181-2025[S]; Specification of Noise Control for Low-Altitude Vertical Take-Off and Landing (VTOL). Suzhou Municipal Market Supervision Administration: Suzhou, China, 2025. (In Chinese)
van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; AAAI Press: Palo Alto, CA, USA, 2016; Volume 30, p. 10295. [Google Scholar] [CrossRef]
Qu, X.; Ye, B.; Wang, H.; Xiao, Y. Research on global path planning method for urban logistics UAV based on risk maps. Aeronaut. Comput. Tech. 2024, 54, 87–91. (In Chinese) [Google Scholar]
Han, P.; Yang, X.; Zhao, Y.; Guan, X.; Wang, S. Quantitative ground risk assessment for urban logistical unmanned aerial vehicle (UAV) based on Bayesian network. Sustainability 2022, 14, 5733. [Google Scholar] [CrossRef]
Kong, F.; Wang, Q.; Gao, S.; Yu, H. B-APFDQN: A UAV path planning algorithm based on deep Q-network and artificial potential field. IEEE Access 2023, 11, 44051–44064. [Google Scholar] [CrossRef]
Tan, J.; Yan, B.; Chen, Y.; Yan, H.; Cheng, J. Improved UAV path planning study for JPS. J. Chongqing Univ. Technol. (Nat. Sci.) 2024, 38, 328–337. Available online: http://clgzk.qks.cqut.edu.cn/CN/Y2024/V38/I1/328 (accessed on 1 February 2026). (In Chinese)
Wang, X.; Gursoy, M.C.; Erpek, T.; Sagduyu, Y.E. Learning-based UAV path planning for data collection with integrated collision avoidance. IEEE Internet Things J. 2022, 9, 16663–16676. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the UAV Path Planning Optimization Framework.

Figure 2. State Space Diagram.

Figure 3. Illustration of the action space.

Figure 4. Noise Propagation Relationship between UAV, Noise Source, and Receiver.

Figure 5. Overall Flowchart of TNAP-DDQN.

Figure 6. Prioritized n-Step Replay Mechanism Flowchart.

Figure 7. Urban Environment Modeling. (a) Urban Building View; (b) 3D Illustration of Urban Building Distribution.

Figure 8. Time-Series of Sound Pressure Level at Nearest Receiver Points Along the Trajectory (Ablation Experiment).

Figure 9. UAV Path Comparison in Ablation Experiment. (a) Path Comparison (Top View); (b) Path Comparison (Front View).

Figure 10. Comparison of Cumulative Reward During Training.

Figure 11. Comparison of Path Planning Results for Different Algorithms.

Figure 12. Time-Series of Sound Pressure Level at Nearest Receiver Points Along the Trajectory (different algorithms).

Figure 13. Comparison of UAV Flight Paths Under Different Noise Sources (

L_{T} = 55 dB

): (a) Top View; (b) Front View.

Figure 13. Comparison of UAV Flight Paths Under Different Noise Sources (

L_{T} = 55 dB

): (a) Top View; (b) Front View.

Figure 14. Time-Series of Sound Pressure Level at Nearest Receiver Points Under Different Noise Source Intensities (

L_{T} = 55 dB

).

Figure 14. Time-Series of Sound Pressure Level at Nearest Receiver Points Under Different Noise Source Intensities (

L_{T} = 55 dB

).

Figure 15. Comparison of UAV Flight Paths Under Different Noise Sources (

L_{T} = 50 dB

): (a) Top View; (b) Front View.

Figure 15. Comparison of UAV Flight Paths Under Different Noise Sources (

L_{T} = 50 dB

): (a) Top View; (b) Front View.

Figure 16. Time-Series of Sound Pressure Level at Nearest Receiver Points Under Different Noise Source Intensities (

L_{T} = 50 dB

).

Figure 16. Time-Series of Sound Pressure Level at Nearest Receiver Points Under Different Noise Source Intensities (

L_{T} = 50 dB

).

Figure 17. Comparison of the Highest AGL Cruise Platform Reached by the Planned Trajectories under Different Noise Thresholds.

Figure 18. Comparison of Flight Time Under Different Thresholds.

Figure 19. Comparison of Average Noise Levels Under Different Thresholds.

Table 1. Performance Changes with Progressive Module Addition.

Module Added	Flight Time (/s)	Safety Risk Value	Average Sound Pressure Level (/dB)	Maximum SPL at the Nearest Building Façade (/dB)
DDQN (Baseline)	251.40	2.4118 × 10⁻⁴	60.85	69
+ Noise Penalty	222.08	3.0051 × 10⁻⁴	60.17	69
+ NDM	217.79	2.0237 × 10⁻⁴	56.45	51.39
+ ABN	235.00	2.1900 × 10⁻⁴	52.62	59.45
+ MSL	228.86	2.2067 × 10⁻⁴	52.11	59
+ PER	195.37	2.3564 × 10⁻⁴	52.66	54.38

Table 2. Algorithm Comparison Results.

Algorithm	Flight Time (/s)	Safety Risk Value	Average Sound Pressure Level (/dB)	Maximum SPL at the Nearest Building Façade (/dB)
TNAP-DDQN	195.37	2.3564 × 10⁻⁴	52.66	54.38
B-APF-DQN	171.91	2.6059 × 10⁻⁴	59.04	69
S-JPS	172.31	2.9053 × 10⁻⁴	58.85	62.97
Dueling-DQN	163.31	2.4950 × 10⁻⁴	56.85	69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.; Jin, Y.; He, X.; Zhang, Y. Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints. Drones 2026, 10, 227. https://doi.org/10.3390/drones10030227

AMA Style

Chen Y, Jin Y, He X, Zhang Y. Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints. Drones. 2026; 10(3):227. https://doi.org/10.3390/drones10030227

Chicago/Turabian Style

Chen, Yaqing, Yunfei Jin, Xin He, and Yumei Zhang. 2026. "Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints" Drones 10, no. 3: 227. https://doi.org/10.3390/drones10030227

APA Style

Chen, Y., Jin, Y., He, X., & Zhang, Y. (2026). Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints. Drones, 10(3), 227. https://doi.org/10.3390/drones10030227

Article Menu

Research on Path Planning Methods and Characteristics of Urban Unmanned Aerial Vehicles Under Noise Constraints

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Problem Modeling and Optimization Objectives

2.2. State–Action Space Modeling

2.3. Multi-Objective Modeling and Potential-Based Reward Shaping

2.4. Reward Function Design

3. Algorithm

3.1. Improved DDQN Algorithm Based on Noise Constraints

3.2. Noise-Aware Action Filtering and Bias-Based Decision Mechanism

3.3. Prioritized Multi-Step Experience Replay Mechanism

4. Experiments and Evaluation

4.1. Simulation Environment Setup

4.2. Algorithm Comparison Experiment

4.3. Comparison and Analysis of Paths with Different Noise Constraints

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI