1. Introduction
UAVs have become an integral part of urban management and service optimization  [
1,
2,
3,
4,
5], with their feasibility increasingly being validated [
6,
7]. Compared to an individual UAV, multi-UAV systems can complete tasks more efficiently through collaborative operations [
8,
9]. UAV-based data gathering is more efficient and flexible than traditional ground-based methods, enabling rapid responses to real-time monitoring needs and allowing for data collection from optimal viewpoints [
10,
11]. To fully exploit the potential of UAVs in smart city services, it is necessary to study the path planning problem of UAVs. In urban environments, the main challenges in finding optimal solutions include densely high-rise buildings, non-stationary multi-UAV teams, and dynamic mission objectives. This research aims to develop an efficient path planning algorithm to maximize the overall data gathering rate of camera-equipped multi-UAVs, applicable to various urban services such as traffic data gathering and infrastructure data gathering.
Numerous studies on UAV path planning for urban services cover tasks such as communication relay [
12], disaster management [
13], target monitoring [
14], collaborative search [
15], pursuit [
16], and convoy tracking [
17]. Most of these studies aim to find optimal paths to maximize task efficiency. In [
18], a DQN-based target search strategy was proposed, considering the uncertainty characteristics of task targets in an obstacle-free environment. Under energy and search time constraints, an optimal trajectory was designed to minimize the uncertainty of the search area. In [
19], a trajectory design scheme incorporating RL was proposed to address the issue of multi-UAVs serving as base stations to provide communication coverage for dispersed users. Optimal paths and power allocation strategies for the multi-UAV system were designed to maximize the overall system utility for all served users. In [
20], an intention-based communication method was proposed for randomly appearing traffic monitoring task targets. This method balances the workload among UAVs and maximizes the coverage of ground monitoring equipment blind spots. In [
21], an RL-based clustering management method and online path planning method were proposed, considering rapidly changing environments and a set of fixed waypoints. This approach enables UAVs equipped with visual sensors to fully cover existing waypoints in polynomial time in dynamic environments.
Although these efforts have improved task efficiency through path optimization, they have not adequately addressed task variability and UAV team composition. A general path planning framework should adapt to changes in the location and number of task points, as well as variations in UAV energy and quantity, to better meet urban service needs. To address this issue, this paper proposes an RL system with an environment resetting strategy. This algorithm periodically resets the environment and randomly selects variable parameters such as task points and UAVs at the beginning of each training cycle to adapt to different task scenarios. Specifically, this method is applicable to various types of information collection tasks based on onboard cameras. It can also be extended to tasks such as parcel delivery and emergency medical deliveries by setting appropriate parameters for communication range, the amount of information collected, and other factors, as detailed in the simulation section. Additionally, the characteristics of urban environments must be considered. Effectively capturing more optimal path planning solutions in dense high-rise buildings areas is a significant challenge. In urban settings, UAVs cannot fly over high-rise buildings, which restricts their movements and prevents a thorough exploration of the environment before the end of training. The substantive impact, further reflected at the system level, manifests as lower UAV accessibility to certain areas and a reduced overall information collection rate. This issue is particularly acute in the aforementioned periodic systems, where each training episode has a very limited number of steps. Due to the flexibility challenges posed by unpassable buildings, the more dense and widespread the distribution of high-rise buildings, the greater the restrictive impact. This issue is referred to as the NEP-Problem in this study, and it is rigorously mathematically described in 
Section 2. Additionally, the reward settings in RL often include collision penalties or time penalties, which create sparse reward challenges for specific task points, further reducing the agent’s exploration rate in areas dense with high-rise buildings.
The NEP-Problem can be optimized by selecting appropriate exploration strategies that explore suboptimal actions to discover choices that may yield higher long-term rewards [
22]. In the field of RL exploration, researchers have proposed various strategies, including imitation-based methods [
23,
24], goal-based methods [
25], and intrinsic motivation-based methods [
26,
27,
28]. In [
29], a curiosity-driven exploration mechanism-based RL method was proposed for the time-constrained urban search and rescue path planning problem. This approach measures curiosity by the time a state has not been visited, encouraging an exploration of that area to reduce the time required for path planning. In [
30], an optimized Q-Learning method was proposed to address the issue of agents getting trapped in local optima in unknown environments. This method drives the agent to explore the environment thoroughly through a new action selection strategy and a novel exploration reward. In [
31], an RL method guided by the model’s predictive path, integral for exploration and efficient data gathering, was proposed. It adopts a task-oriented approach to policy improvement to reduce ineffective exploration. In [
32], a self-imitation learning model was proposed for the collaborative monitoring scenario of UAV swarms. By emphasizing valuable historical experiences, this method corrects mistakes and finds the optimal solution more quickly.
The essence of these exploration strategies can be broadly divided into two directions: increasing the probability of selecting exploration actions and enhancing the efficiency of experience utilization. Due to dense obstacles and limited time constraints, simply increasing the exploration rate cannot fundamentally solve the problem of a low exploration rate in the target area. For instance, even with a uniform action selection strategy, which is likely to explore the environment fully, the target area may not be explored deterministically before the training ends. Since the exploration rate of the target area is extremely low or even zero, the agent cannot obtain effective experiences, and increasing the efficiency of experience utilization cannot improve the exploration rate. Inspired by “Go-Explore” [
33] and human learning processes, this paper proposes a Dual-Phase Trajectory-Constrained (DP-TC) method based on expert knowledge. This method enables agents to deterministically reach and explore specific areas under time constraints. Subsequently, based on DP-TC, a Hierarchical–Exponential Robust-Optimization Exploration Strategy (HEROES) is proposed. This strategy employs a hierarchical exponential decay approach to ensure agents thoroughly explore the overall environment.
Additionally, the state distribution is crucial for policy evaluation and improvement. However, commonly used stationary distribution only indicates the long-term characteristics of a system. In stationary distribution, certain states may appear frequently, but in systems with limited steps, these states might not be visited. If visiting these states could lead to more optimal solutions, then a method based on stationary distribution will result in suboptimal outcomes. To assess the adequacy of exploration and optimize exploration strategies, a Temporal Adaptive Distribution (TA-Distribution) method is proposed during the research process. This method effectively indicates both the short-term and long-term operational characteristics of the RL system.
In summary, this paper proposes a DDQN architecture equipped with an environment resetting strategy and HEROES, aiming to maximize the data gathering rates in urban environments by optimizing multi-UAV paths. The proposed method adapts to changes in environmental parameters, enhances UAV accessibility in dense high-rise building areas, and can be extended to any region. The main contributions of this paper are as follows:
- (1)
- A definition of target areas with low accessibility in path planning: This study addresses the issue of low accessibility in certain regions, conceptualizing it as the NEP-Problem. The primary cause is that obstacles restrict the agent’s path choices, hindering the development of effective solutions. To our knowledge, this is the first study to rigorously define the obstacle-induced low accessibility problem within the context of RL-based path planning. 
- (2)
- A novel method is developed for policy evaluation and improvement across a variety of temporal scales. By incorporating the long-term and short-term behavioral patterns of agents, the proposed TA-Distribution method enhances the indicative capabilities of the stationary distribution. 
- (3)
- A novel RL-based path planning algorithm: This study introduces a DDQN architecture, enhanced with an environment reset strategy and the HEROES technique, designed to enable UAVs to navigate to any designated target areas while adhering to constraints associated with parameter variability, latency, and computational demands. 
This paper is structured as follows: 
Section 2 provides a detailed description of the NEP-Problem. 
Section 3 presents the system model, delineates the principles of TA-Distribution and HEROES, and introduces the framework for multi-agent RL. 
Section 4 introduces the simulation experiments and provides an in-depth discussion and analysis of their results. 
Section 5 offers the conclusions drawn from the research. To enhance comprehensibility, 
Section 3 discusses several simulation experiments.
  2. Problem Descriptions
In RL environments, there exists a complex spatial configuration where actions approaching the target point may be obstructed by obstacles, requiring agents to circumvent these obstacles before reaching the target point. In such cases, agents must execute a series of specific actions to reach the target area. The more elongated and narrower the path formed by obstacles, the fewer routes the agent can choose, and the lower the reachability of the agent to that area.
The NEP-Problem refers to the low reachability of specific areas under the aforementioned spatial configuration in RL environments. Compared to open areas, this low reachability prevents the agent from fully exploring and understanding the environment. A typical result, reflected in the UAV path planning efficiency after algorithm training, is a lower information collection rate. In 
Appendix A, two examples based on real-world scenarios are presented, substantiating the existence of the NEP problem in practical environments and illustrating its consequential impact. A typical spatial configuration is shown in 
Figure 1. The meanings of each element are detailed in 
Table 1. This chapter provides a rigorous mathematical description of the NEP-Problem.
The NEP-Problem was identified during the study of UAV path planning based on RL algorithms in urban environments with dynamic conditions. Although RL algorithms possess strong generalization capabilities, we designed an environment reset strategy during the research process, wherein different parameters were selected in each episode. This was carried out to enable the RL algorithm to learn a broader range of parameters and to tackle the challenges posed by dynamic UAV teams and dynamic mission objectives. However, it was observed that within this framework, the agent was unable to sufficiently explore the environment within the limited number of steps, thereby failing to discover potentially optimal solutions. This situation repeatedly occurred in obstacle configurations with similar characteristics. We have described this specific spatial configuration and defined it as follows.
In defined state space 
 where all states are reachable and no obstacles exist, there are 
 validated paths from the start point 
 to the goal point 
. The 
-th path 
 between these points can be represented by a sequence of states as follows:
      where 
 and the element 
 denotes the 
-th state encountered on path 
, situated between 
 and 
. The elements 
 and 
 must satisfy that they can be reached through a single state transition. All possible paths 
 form the following set:
The difficulty in reaching the goal point can be described by the transition probabilities from the start point to the goal point. Within specific path 
, the agent needs to continuously take a series of optimal actions to successfully move from the initial state 
 to the target state 
, passing through multiple states 
. In an obstacle-free environment, the state transition probability from 
 to 
 can be expressed as
      
The single-step state transition probability  is the probability of reaching state  when action  is taken at the -th state  along the -th path. Here,  represents a selectable path, and in scenarios without obstacles ,  denotes the number of elements contained within the set .
The computational complexity of Equation (3) increases linearly with the number of available paths  and the number of state transitions . However, considering that an increase in environmental dimensions can lead to an exponential growth in the number of available paths , directly calculating the state transition probabilities for all possible paths may become impractical. To address this computational challenge, two strategies are employed in this study to limit the selection of invalid paths: proximity to the target state and the certainty of reaching the endpoint. Proximity to the target state:  is closer to the goal point than  for . This approach prunes the path tree by reducing paths that deviate from the target, thereby decreasing the computational load. The certainty of reaching the endpoint: when reaching a state  adjacent to , the next transition will definitely move to the target state .
In the context of “Narrow-Elongated”, the concept of “Elongated” is relatively straightforward to understand, referring to the total state transition probability from the start point to the goal point of each selectable path decreasing as the number of transitions, , increases. According to Equation (3), given a constant set of selectable paths , where  and is not perpetually equal to 1, the total state transition probability  will rapidly decrease as  increases.
“Narrow” implies that the number of selectable paths  from the start point to the goal point decreases as the number and location of obstacles increase, which causes the total state transition probability  to decrease with the reduction in . Obstacles can potentially restrict path choices in any environment. Adding regular obstacles  that do not affect the length of the shortest path ensures that  holds; if the obstacles are located in states passed by paths in the set , it certainly leads to . Furthermore, when adding a gathering of narrow obstacles  that do not affect the shortest path length, and when  is the case, it invariably results in ; if the narrow obstacles are located in the states traversed by paths in the set  and impede more path choices, it definitely leads to . Here, the set  and  contains information about the number and locations of regular obstacles and narrow obstacles,  represents the set of paths from the start point to the goal point under a regular-obstacles environment,  denotes the set of paths under a narrow-obstacles environment forming a narrow corridor, and  indicates that the number of obstacle states added in both scenarios is equal.
Figure 2 intuitively displays the impact of the quantity and location of obstacles on the state transition probabilities, helping to understand the meaning of “Narrow” in the context of “ Narrow-Elongated”. The starting position is indicated by blue text labeled “start”, and the goal position is indicated by red bold text labeled “goal”. The text within each cell represents the calculated probability distribution, rounded to two decimal places. The simulation setup includes a step size of 10 and runs for 10,000 episodes to eliminate randomness. The theoretical calculation results are consistent with the computer simulation outcomes.
 In a 
 grid environment, if there are no obstacles between the start and end points, there are 
 possible obstacle configurations. Based on this, using Depth-First Search (DFS) to determine passability from the start to the end point, 51 configurations are found to be passable. Further, ensuring that all grids are passable, we tested all 51 obstacle configurations in our study, with more configurations presented in the supplementary experiments in the 
Appendix B. Since the number of available paths 
 is the essential factor affecting the state transition probability, the obstacle configurations depicted in 
Figure 2a–e are characterized by the information of available paths, reflecting all passable situations from a local perspective in real-world scenarios. Based on the mathematical description of the NEP-Problem and the TD-distribution design in 
Section 3, we expressed the state transition probability as a state probability distribution, applicable to scenarios of any size.
According to 
Figure 2a–c, it is evident that an increasing number of obstacles restricts the selection of paths. 
Figure 2c,d demonstrate that obstacles in specific locations can further limit path choices, even if there are not many obstacles. 
Figure 2e,f indicate that an increase in the number of obstacles does not further limit the number of path choices. The degree of narrowness of a pathway is fundamentally determined by the number of path choices, which in turn is determined by both the number and location of obstacles. It is not accurate to simply assume that denser obstacles result in lower state transition probabilities. Therefore, with the number of transitions 
 constant (i.e., the path length remains unchanged), fewer selectable paths 
 result in a lower probability of state transitions from the start point to the endpoint 
.
In general environments without grid-like boundaries limiting state transitions, the probability of reaching the goal from the start point is even lower. 
Section 3.2.1 illustrates a scenario representing a narrow corridor situation. It shows that even with a strategy biased towards moving towards the goal point, and even with larger grid boundary restrictions on state transitions, the probability distribution at the endpoint remains very low, despite the step size being much greater than the optimal path transition count 
.
In summary, the greater the distance from the start point to the endpoint, the more “Elongated” the corridor; the fewer the path choices from the start point to the endpoint, the “Narrower” the corridor. When both these conditions are met, the scenario is referred to as the NEP-Problem. A distinctive feature of this problem is that there must exist some states between which the state transition probabilities are extremely low.
The low reachability caused by the NEP-Problem is the fundamental challenge for UAV information collection path planning. Specifically, low reachability stems from flexibility challenges and sparse reward challenges. The flexibility challenge is due to the obstacle characteristics in urban environments, where high-rise buildings or no-fly zones restrict the three-dimensional spatial flexibility of UAVs. The sparse reward challenge arises from the step and collision penalties within episodes in RL algorithms, which lead to insufficient exploration of certain areas during early training, making it difficult to generate optimal solutions. On the other hand, urban service tasks require maximizing data collection rates under limited resources, necessitating the identification of the most optimal solution. RL algorithms need to explore all potential solutions to determine the best option. Since mission points may be distributed across any area of the city, addressing the NEP-Problem and improving the reachability of target areas is particularly important.
To address the issue of low accessibility, a feasible approach is to leverage human expert knowledge to define an action pattern that deterministically guides the UAV to the target area. However, this approach faces several key challenges: first, human expert knowledge may not be optimal, and the resulting cost could be high; second, reliance on expert knowledge may lead to model overfitting; and finally, the RL algorithm must adapt to the complex mathematical structure of the NEP-Problem. In the following sections, we will provide a detailed description of the system model for path planning and the corresponding solutions.
  4. Simulations and Discussion
Due to environmental factors, potential UAV malfunctions, and strict legal regulations, verifying the effectiveness of algorithms through physical experiments can be challenging. Simulation environments offer a controlled alternative.
  4.1. Simulation Map Construction
The majority of multi-agent research studies conduct simulation tests on custom-built simulation maps without taking into account real-world features. The ultimate objective of this research is to deploy algorithms on UAVs to carry out data gathering missions in real physical environments, for which we employ a grid-based method to model real urban environments. Moreover, the simulation map generated using real maps can function as a standardized testing environment, enabling various research teams to conduct algorithm testing and result verification in an identical map setting.
As grid resolution increases, the state space expands rapidly, resulting in an exponential decline in the computational efficiency of RL algorithms when dealing with large-scale state spaces. To keep training time within a reasonable range while maintaining accuracy, grid resolution must be appropriately managed to avoid setting excessively high resolutions. The specific impact of different grid resolutions on training time is provided in 
Appendix C.
This study extracts real-world map data from Google Maps and OpenStreetMap, including information on buildings and road networks, and enhances these with 3D map rendering and satellite imagery to supplement building height data. Using nearest neighbor interpolation, we consolidate information concerning both flyable and unflyable structures alongside road data. Owing to the limitations imposed by the grid-based mapping system, priority is given to the incorporation of road information. The inclusion of road network information is essential. The grid environment is calibrated to preserve the real-world map features as much as possible while simplifying the map environment. The grid-based simulation map focuses more on preserving the passability information that is critical for decision making, rather than retaining the exact shapes of buildings and roads. This type of simulation map is suitable for global path planning at the planning and decision-making stages, rather than local path planning at the control stage. This aligns with our research problem, which is focused on generating appropriate task strategies for UAVs, rather than optimizing local UAV trajectories.
  4.2. Simulation Setup
This study conducts simulations on two distinct maps. The first is the “Midtown Manhattan” map, characterized by a high density of high-rise buildings, which exacerbates the NEP-Problem. The second is the “Downtown Los Angeles” map, based on the northwestern region of downtown Los Angeles, where high-rise buildings are less prevalent but the map size is larger.
The grid map uses a cell size of 
, and UAVs fly at a constant height of 
 over urban streets, unable to fly over urban buildings taller than 20 m. Data gathering is only possible when UAVs enter within 45 m of the center of a task point, and their line of sight may be obstructed by buildings. Each task slot 
 includes 
 video slots. Before starting an episode, the number of UAVs to be deployed and the remaining energy of UAVs, initial positions, as well as the number and positions of task points and the data volume to be gathered, are randomly determined. The specific parameter settings are provided in 
Section 4.3 and 
Section 4.4.
This simulation is conducted on a computer equipped with an Nvidia GeForce 3060 Ti GPU and an i5-12500 CPU, requiring 80 h to complete 2,000,000 steps.
  4.3. “Midtown Manhattan” Scenario
The scenario in 
Figure 10 is established based on the Midtown Manhattan, representing the region surrounded by four streets: Park Avenue, 7th Ave, W 55th St, and W 47th St. This region includes a large number of regular urban buildings, regular grid-like roads, and two relatively wider roads. It is noteworthy that the impassable high-rise buildings on the map are very dense, posing a severe NEP-Problem for the UAVs.
In this set of simulations, we established a grid with  cells, resulting in a total size of . The number of UAVs set at , the initial energy for each UAV set at , and the number of task points set at  were randomly distributed. The data gathering duration at each task point was set at . Nine starting positions were designated. To further account for the similar energy consumption across different flight states and uncertainties such as wind speed and direction, the UAV’s initial energy selection model  is designed to address these issues. For example, providing a narrow energy selection range under fixed energy conditions can better accommodate such uncertainties. Additionally, to ensure safety during the actual deployment and compliance with legal regulations, UAVs should retain a certain amount of residual energy to guarantee a safe return. The initial energy selection range can be determined based on the required residual energy. The east side of the “Midtown Manhattan” map had a denser distribution of task points, while the west side, characterized by densely packed buildings, had fewer task points.
With changes in the number of UAVs, 
, while the other parameters remained constant, the simulation results for the “Midtown Manhattan” scenario are illustrated in 
Figure 10.
In 
Figure 10a, the agent opted to collect information in the eastern part of the map where task points were densely located, without overly focusing on the eastern area, demonstrating that the robust optimization post-exploration strategy does not excessively concentrate on specific areas. 
Figure 10b shows the trajectory of the first agent remaining largely unchanged, while the newly added second agent collected data from task points in the dense western building area, successfully navigating through the narrow extended path formed by numerous high-rise buildings and returning without focusing on the remote blue task points. In 
Figure 10c, the introduction of a third agent allowed for data gathering from a remote device in the map outlying area, with the three agents effectively dividing the map into three sectors to maximize data gathering efficiency. The simulation results demonstrate the effectiveness of the robustly optimized HEROES based on the DP-TC method, while the DDQN path planning algorithm proved adaptable to varying scene parameters.
This study allows for parameter adjustments to suit the decision-making phase of different tasks, such as traffic information collection, surveillance, last-mile parcel delivery, and data collection services during the maintenance of digital twin models in smart cities. Specifically, if the simulation settings are the same as those in this study, the method can accomplish traffic information collection tasks. By increasing the amount of information at mission points and adjusting their weight in the reward function, UAVs can be made to remain at mission points for longer periods, thus completing surveillance tasks. By reducing the communication range of mission points and appropriately adjusting the reward function, UAVs can be guided to approach mission points more closely, simulating the process of approaching and descending to complete parcel delivery tasks. By adjusting the location of mission points (e.g., from roads to buildings or equipment), the algorithm will plan appropriate paths for UAVs to complete data collection tasks for digital twin models.
  4.4. “Downtown Los Angeles” Scenario
The scenario depicted in 
Figure 11 is based on the northwestern region of downtown Los Angeles. Despite numerous flyable buildings along the map periphery, UAVs are required to navigate through densely high-rise building areas to reach these regions. This scenario features a larger map size and more widely distributed task points.
For this scenario, we established a grid with  cells, resulting in a total size of . The number of UAVs set at , the energy for each UAV set at , and the number of task points set at  were randomly distributed. The data gathering duration at each task point was set at . Nine starting positions were designated.
With the number of UAVs, 
, varying while the other parameters remained constant, the simulation results for the “Downtown Los Angeles” scenario are shown in 
Figure 11.
In 
Figure 11a, the agent navigated through areas with dense high-rise buildings, collecting data from numerous task points and disregarding closer but fewer task points on the east side. 
Figure 11b, with an increased number of agents, shows the first agent altering its path to maximize the team data gathering efficiency. Due to flight duration constraints, the two agents did not opt to collect data from the southeastern gray task points. In 
Figure 11c, three agents similarly divided the map into three sections to perform data gathering, maximizing coverage while avoiding overlap in the data gathering areas.
  4.5. System Performance Analysis and Discussion
This section compares the training processes and outcomes of path planning algorithms that integrate three exploration strategies: -Greedy, Go-Explore, and HEROES. It also analyzes the necessity of the proposed exploration strategies and validates their effectiveness.
In multi-UAV data gathering path planning, the performance of algorithms can be assessed through three metrics: the data gathering rate, safe landing rate, and cumulative reward. In this study, maximizing the data gathering rate is paramount, but achieving safe landings in urban environments is also crucial. The data gathering ratio is the ratio of the total data gathered at the end of a task to the total amount of data available at the beginning, indicating the agent’s ability to cover task points and gather information. The safe landing rate records whether all agents have successfully and timely landed at the end of an episode, indicating the agents’ capability to return and land safely. Cumulative rewards provide a comprehensive assessment of the system’s learning performance.
Figure 12 compares the DDQN training processes using three different exploration strategies in both the Manhattan and Los Angeles scenarios. Each training period is defined as 200,000 steps. For each training period, the solid lines depict the average metrics, the shaded regions representing the 95% quantiles of the metrics.
 It is evident that there is no significant difference between the training processes of the Go-Explore and -Greedy exploration in the two scenarios. One potential reason is that Go-Explore only returns to the edge of previously explored areas, without directly reaching target areas within unexplored zones. A second possible reason is that upon reaching the edge of explored areas, the employed strategy is a random exploration tactic, which inherently does not address the problem described in Equation (23).
Furthermore, adequately trained RL path planning strategies need to be evaluated. Using the Monte Carlo analysis method, the system learning situation is assessed by re-selecting random parameters and running 
 episodes to eliminate randomness. The overall average performance metrics for data gathering rates and successful landing rates in the “Midtown Manhattan” and “Downtown Los Angeles” scenarios are presented in 
Table 2 and 
Table 3.
The data gathering rate fell short of reaching 100% in certain scenarios characterized by high values for task points  and data volume , coupled with a low number of UAVs  and reduced energy . This configuration can lead to situations where the UAVs are incapable of gathering the entirety of the available data.
Figure 10 and 
Figure 11 demonstrate the effective collaboration between agents and the system’s stability across different numbers of agents, reflecting excellent parameter generalization capabilities. However, as the number of agents increases and the state space expands, system complexity will grow exponentially. To address large-scale and complex coordination scenarios, further algorithm optimization, the introduction of more efficient training strategies, and the design of a more robust system architecture are still required.
 Through Monte Carlo simulations, the path planning performance of DDQN algorithms incorporating five distinct exploration strategies is evaluated in the “NEP” scenario. The overall average performance metrics for data gathering rates and successful landing rates are presented in 
Table 4. “Phase-A” represents the method that explores based solely on the A* algorithm without relying on any return policy.
Due to the low pass-through capability of the -Greedy exploration strategy for narrow corridors, the algorithm ultimately focuses only on data gathering tasks in open areas, as narrow pathways remain underexplored. The improvement with Go-Explore over -Greedy is minimal because it still introduces random action perturbations and lacks a clear path to the task points. The “select and reach” method based on Phase-A lacks a return strategy, preventing the system from learning effective and complete solutions. The exploration strategy based on the DP-TC method overemphasizes the characteristics of task points near narrow corridors, causing the algorithm to overlook task points in easily accessible areas. The HEROES, which is based on the robust optimization of the DP-TC method, uses a dual exponential decay mechanism to fully explore the environment and balance exploration and exploitation, proving to be a relatively optimal solution.
All algorithms demonstrate a high rate of safe landings, primarily because the simulation imposes strict penalties for failing to land at the designated take-off and landing points. This observation suggests that the NEP-problem does not compromise the algorithms’ ability to generate complete path outputs. Instead, it primarily impedes the discovery of optimal solutions, thereby affecting the overall quality of path planning, as evidenced by the total information collection rate. For instance, both the -Greedy algorithm and the “select and reach” method based on Phase-A are capable of producing complete take-off–collection–return paths, yet their information collection rates are comparatively lower.
Table 5 and 
Table 6 provide the training parameters, where each iteration represents the decision-making process of the agent at each step. 
Table 7 provides the parameters for the time required by the trained path planning model to make decisions, where each iteration represents the process of making a complete path planning decision for a single scenario. As shown in 
Table 5 and 
Table 6, the proposed HEROES requires more training resources compared to the traditional 
-Greedy strategy. However, there is little difference between the HEROES and Go-Explore in terms of computation time and resource consumption.
 As shown in 
Table 7, the HEROES can generate a reasonable path planning decision for a random scenario in an average of 3 s, which is entirely acceptable during the mission planning phase. After training, the proposed HEROES shows no significant difference in the path planning decision computation time for random scenarios when compared to the 
-Greedy and Go-Explore strategies within the DDQN framework.
  4.6. Challenges in Practical Deployment and Directions for Future Research
The deployment of these algorithms in real urban environments presents several critical challenges. First, the uncertainty in flight costs, arising from complex weather conditions such as variable wind direction and speed, results in unpredictable energy consumption. A feasible solution involves the development of robust strategies that integrate both historical and forecasted weather data, coupled with energy redundancy measures. Second, airspace conflicts constitute an inherent risk in real-world operations, necessitating the careful planning of flight zones or the implementation of real-time air traffic management systems. Third, while this study primarily focuses on decision-making processes, the integration of local path planning algorithms is crucial for practical deployment. These algorithms should account for environment-specific behavioral models and task patterns, such as the increased energy consumption during parcel descent or the need for circling to avoid line-of-sight occlusion in surveillance tasks. Finally, legal, regulatory, and safety considerations are imperative, particularly in urban environments with stringent UAV operation regulations. Although this study has minimized the risk of unintended landings, the deployment phase requires the application of appropriate auto-landing algorithms to mitigate potential safety hazards.
This study primarily focuses on global decision making and considers common obstacle scenarios, providing appropriate solutions. For other obstacles, particularly those that are difficult to detect, such as kite strings or power lines, adjustments can be made using corresponding local algorithms. If local obstacle avoidance algorithms are introduced, these algorithms will perform path planning when encountering dynamic obstacles and generate relevant features. The global decision-making algorithm should capture these features during training to maximize overall task efficiency.
In future research, to address dynamic obstacles, uncertain flight costs due to variable weather, and real-time communication constraints, local path planning or trajectory optimization algorithms should be developed, with their features integrated into the global decision-making algorithm to achieve a comprehensive multi-UAV urban service path planning framework. By introducing more accurate flight cost maps and incorporating them into the DDQN framework, the algorithm’s adaptability to different weather conditions and the reliability of path planning can be enhanced. Additionally, considering practical resources (which can support real-time communication for a limited number of UAVs) and legal regulations (which require the real-time monitoring of UAV operational status), designing multi-agent path planning models based on real-time communication will be more suitable for engineering deployment.