Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions
Highlights
- Proposed a novel DQN-A* hybrid algorithm with a multi-stage obstacle-identification strategy for unmanned aerial vehicle (UAV) coverage path planning (CPP).
- Established a theoretical framework that derives a recommended range for UAV positional identifiers based on Z-score feature normalization.
- The dual-driven reward mechanism and A* integration enhance dynamic obstacle avoidance, target detection probability, and overall search efficiency in unknown environments.
- The derived range accelerates model convergence in grid-based CPP tasks driven by deep reinforcement learning.
Abstract
1. Introduction
- Optimization of spatial state representation. This paper derives a principled range for the UAV positional identifier under Z-score-normalized state encoding and validates its effect on convergence through simulation experiments.
- Hybrid CPP architecture with a multi-stage obstacle-identification strategy. Integrating the global decision-making capability of DQN for environmental dynamics with the local optimal path search efficiency of the A* algorithm, we construct the DQN-A* algorithm. This hybrid framework employs a multi-stage obstacle-identification strategy to enable unknown obstacle avoidance and accomplishes CPP.
- Probability–time dual-driven reward mechanism. A novel reward system integrates probability-weighted reward (based on spatial target likelihood distributions) and step-dependent reward (penalizing prolonged search durations). This dual mechanism prioritizes exploration of high-probability zones while effectively minimizing path redundancy, thereby enhancing search efficiency.
2. Related Work
2.1. Offline CPP
2.2. Online CPP
3. Problem Formulation
3.1. Detection Field of the UAV
3.2. Search Area Discretization
3.3. Search Scenario Description
4. Model Construction
4.1. Action Space
4.2. State Space
- Maximizing UAV Salience (Lower Bound of x): The UAV position must be a statistical outlier to ensure the UAV position is a statistically distinguishable feature within the input to the neural network. We require the Z-score difference between the UAV and the background (value 1) to exceed a significance threshold (typically for robust outlier detection):Consequently, the lower bound of x is obtained asNote that for large N the lower bound approaches unity and therefore lies outside the salient-outlier regime in which holds; it should thus be read as a conservative constraint preventing x from degenerating to the background value rather than as a tight analytical bound. The upper bound, by contrast, is obtained well within the validity regime of the approximation.
- Preserving Background Contrast (Upper Bound of x): The network must distinguish between uncovered (0) and covered (1) areas. If x is too large, increases, compressing the normalized difference between 0 and 1 below the gradient sensitivity threshold (empirically ):Consequently, the upper bound of x is obtained as
4.3. Reward Design
4.4. Model Parameter Update
4.5. Integrated Framework of DQN and A* Algorithm
| Algorithm 1 DQN-A* CPP Algorithm |
|
4.6. Hamiltonian Path Pre-Training
4.6.1. Path Construction
| Algorithm 2 Weight-Guided Hamiltonian Path Construction. |
|
4.6.2. Replay Buffer Pre-Filling
5. Computer Simulations
5.1. Parameter Settings
5.2. Analysis of UAV Position Identifier Impacts on Training Effectiveness
- Insufficient Salience Regime (): As shown in Figure 7a, when x takes values of 1 or 2, the model exhibits sluggish convergence and pronounced variance. Theoretically, this occurs because x falls below the lower bound, causing the normalized UAV salience () to be indistinguishable from the background covered regions (labeled as “1”).
- Optimal Balance Regime (): As depicted in Figure 7b, marker values within the theoretical bounds (e.g., ) demonstrate rapid, stable, and nearly identical convergence trajectories. In this regime, the identifier is large enough to highlight the UAV’s position without excessively inflating the global variance .
- Background Contrast Degradation Regime (): Figure 7c reveals a gradual decline in convergence speed as x increases beyond the upper bound (e.g., ). An excessively large x notably inflates the state variance , compressing the normalized difference between uncovered (“0”) and covered (“1”) areas below the network’s gradient sensitivity threshold (), thereby degrading the agent’s spatial awareness.
Learning-Rate Sensitivity Analysis
5.3. Ablation Study
5.3.1. Effect of Hamiltonian Path Pre-Training
5.3.2. Effect of the Dual-Driven Reward Mechanism
5.3.3. Effect of the A* Local Planner
5.3.4. Single-Channel Versus Multi-Channel State Representation
5.3.5. Ablation on Action Space Dimensions (4-Direction vs. 8-Direction)
5.4. Comparison with Baselines
- Overestimation as an Exploration Bonus: In standard RL tasks, the inherent overestimation bias of the vanilla DQN is typically considered a theoretical flaw. However, in an unvisited-region-driven search task, this bias inadvertently functions as an optimistic initialization. It provides a strong intrinsic exploration bonus, driving the UAV to rapidly explore unvisited high-probability regions. Conversely, the accurate and conservative value estimations of DDQN inadvertently suppress this initial exploratory momentum.
- Action Sensitivity in Grid Coverage: Dueling DQN is uniquely designed to decouple state-value and advantage estimations, excelling in environments with numerous “action-irrelevant” states. Nevertheless, in the dense grid-based CPP task, every discrete action strictly dictates the coverage progress, leaving virtually no action-irrelevant states. Consequently, the dual-stream architecture of Dueling DQN introduces optimization overhead during the early training phase rather than accelerating it.
- Synergy with the Decoupled A* Architecture: Because the decoupled A* algorithm serves as a robust, real-time safety net guaranteeing collision-free obstacle avoidance, the global DRL planner is inherently shielded from local deadlock failures. Consequently, leveraging the highly exploratory vanilla DQN as the global planner, structurally supported by the strict A* algorithm, achieves an optimal empirical balance between learning efficiency and operational robustness.
5.5. Generalization Analysis
5.6. Scalability Analysis Across Grid Sizes
5.7. Generalization to Multimodal Probability Distributions
5.8. High-Fidelity Physics-Engine Validation
6. Conclusions
- This study systematically investigates the impact of UAV positional labeling in grid-based maps on model training effectiveness through both theoretical derivation and empirical validation. This finding provides valuable guidance for designing DRL-based path planning algorithms in robot coverage tasks across grid-mapped environments.
- The proposed DQN-A* couples a DQN global planner with an A* local controller through a multi-stage “0-x-1” obstacle-identification rule, enabling reliable avoidance of previously unknown obstacles while preserving complete coverage—a capability absent from the standalone DQN baseline.
- A novel Probability–Time Dual-Driven Reward Mechanism is designed within the DQN-A* algorithm. This mechanism prioritizes exploration of high-probability regions, thus achieving higher target detection probability than benchmark algorithms.
Future Work
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| UAV | Unmanned Aerial Vehicle |
| CPP | Coverage Path Planning |
| DRL | Deep Reinforcement Learning |
| DQN | Deep Q-Network |
| MDP | Markov Decision Process |
| RL | Reinforcement Learning |
| RRT* | Rapidly-exploring Random Tree star |
| TSP | Traveling Salesman Problem |
| PPO | Proximal Policy Optimization |
| DDPG | Deep Deterministic Policy Gradient |
| DDQN | Double DQN |
| TD3 | Twin Delayed DDPG |
| SAC | Soft Actor-Critic |
| PID | Proportional–Integral–Derivative |
| MPC | Model Predictive Control |
| FOV | Field of View |
References
- Lidbetter, T. Search and rescue in the face of uncertain threats. Eur. J. Oper. Res. 2020, 285, 1153–1160. [Google Scholar] [CrossRef]
- Xu, S.; Zhou, Z.; Li, J.; Wang, L.; Zhang, X.; Gao, H. Communication-Constrained UAVs’ Coverage Search Method in Uncertain Scenarios. IEEE Sens. J. 2024, 24, 17092–17101. [Google Scholar] [CrossRef]
- Khosravi, M.; Arora, R.; Enayati, S.; Pishro-Nik, H. A Search and Detection Autonomous Drone System: From Design to Implementation. IEEE Trans. Autom. Sci. Eng. 2025, 22, 3485–3501. [Google Scholar] [CrossRef]
- Martinez-Alpiste, I.; Golcarenarenji, G.; Wang, Q.; Alcaraz-Calero, J.M. Search and rescue operation using UAVs: A case study. Expert Syst. Appl. 2021, 178, 114937. [Google Scholar] [CrossRef]
- Qi, M.; Zheng, H. Temporal-Constrained DDPG-Based Path Planning for UAV in Dynamic Environment. IEEE Sens. J. 2026, 26, 4301–4312. [Google Scholar] [CrossRef]
- Wang, L.; Kan, J.; Guo, J.; Wang, C. 3D Path Planning for the Ground Robot with Improved Ant Colony Optimization. Sensors 2019, 19, 815. [Google Scholar] [CrossRef] [PubMed]
- Sangeetha, V.; Krishankumar, R.; Ravichandran, K.S.; Kar, S. Energy-efficient green ant colony optimization for path planning in dynamic 3D environments. Soft Comput. 2021, 25, 4749–4769. [Google Scholar] [CrossRef]
- Meng, Q.; Qian, C.; Sun, Z.Y.; Zhao, S. Autonomous parking method based on improved A* algorithm and model predictive control. Nonlinear Dyn. 2025, 113, 6839–6862. [Google Scholar] [CrossRef]
- Chen, T.; Wang, Y.; Wen, H.; Kang, J. Autonomous assembly of multiple flexible spacecraft using RRT* algorithm and input shaping technique. Nonlinear Dyn. 2023, 111, 11223–11241. [Google Scholar] [CrossRef]
- Lee, T.K.; Baek, S.H.; Choi, Y.H.; Oh, S.Y. Smooth coverage path planning and control of mobile robots based on high-resolution grid map representation. Robot. Auton. Syst. 2011, 59, 801–812. [Google Scholar] [CrossRef]
- Bao, H.; Wang, Y.; Zhu, H.; Wang, D. Area Complete Coverage Path Planning for Offshore Seabed Organisms Fishing Autonomous Underwater Vehicle Based on Improved Whale Optimization Algorithm. IEEE Sens. J. 2024, 24, 12887–12903. [Google Scholar] [CrossRef]
- Han, G.; Zhou, Z.; Zhang, T.; Wang, H.; Liu, L.; Peng, Y.; Guizani, M. Ant-Colony-Based Complete-Coverage Path-Planning Algorithm for Underwater Gliders in Ocean Areas with Thermoclines. IEEE Trans. Veh. Technol. 2020, 69, 8959–8971. [Google Scholar] [CrossRef]
- Lumelsky, V.J.; Mukhopadhyay, S.; Sun, K. Dynamic path planning in sensor-based terrain acquisition. IEEE Trans. Robot. Autom. 1990, 6, 462–472. [Google Scholar] [CrossRef]
- Wang, L.; Wang, Z.; Liu, M.; Ying, Z.; Xu, N.; Meng, Q. Full Coverage Path Planning Methods of Harvesting Robot with Multi-Objective Constraints. J. Intell. Robot. Syst. 2022, 106, 17. [Google Scholar] [CrossRef]
- Gabriely, Y.; Rimon, E. Spiral-STC: An on-line coverage algorithm of grid environments by a mobile robot. In Proceedings of the Proceedings 2002 IEEE International Conference on Robotics and Automation, Washington, DC, USA, 11–15 May 2002; pp. 954–960. [Google Scholar]
- Hai, V.P.; Asadi, F.; Abut, N.; Kandilli, I. Hybrid Spiral STC-Hedge Algebras Model in Knowledge Reasonings for Robot Coverage Path Planning and Its Applications. Appl. Sci. 2019, 9, 1909. [Google Scholar] [CrossRef]
- Tang, G.; Tang, C.; Zhou, H.; Claramunt, C.; Men, S. R-DFS: A Coverage Path Planning Approach Based on Region Optimal Decomposition. Remote Sens. 2021, 13, 1525. [Google Scholar] [CrossRef]
- Tan, X.; Han, L.; Gong, H.; Wu, Q. Biologically Inspired Complete Coverage Path Planning Algorithm Based on Q-Learning. Sensors 2023, 23, 4647. [Google Scholar] [CrossRef]
- Ai, B.; Jia, M.; Xu, H.; Xu, J.; Wen, Z.; Li, B.; Zhang, D. Coverage path planning for maritime search and rescue using reinforcement learning. Ocean Eng. 2021, 241, 110098. [Google Scholar] [CrossRef]
- Lv, L.; Zhang, S.; Ding, D.; Wang, Y. Path Planning via an Improved DQN-Based Learning Policy. IEEE Access 2019, 7, 67319–67330. [Google Scholar] [CrossRef]
- Dong, J.; Li, X.; Liu, Y. Multi-quadrotor Cooperative Area Coverage Mission Planning Based on DQN. In Proceedings of the 2020 5th International Conference on Advanced Robotics and Mechatronics (ICARM), Shenzhen, China, 18–21 December 2020; pp. 672–677. [Google Scholar]
- Hu, W.; Yu, Y.; Liu, S.; She, C.; Guo, L.; Vucetic, B.; Li, Y. Multi-UAV Coverage Path Planning: A Distributed Online Cooperation Method. IEEE Trans. Veh. Technol. 2023, 72, 11727–11740. [Google Scholar] [CrossRef]
- Nam, L.H.; Huang, L.; Li, X.J.; Xu, J.F. An approach for coverage path planning for UAVs. In Proceedings of the 2016 IEEE 14th International Workshop on Advanced Motion Control (AMC), Auckland, New Zealand, 22–24 April 2016; pp. 411–416. [Google Scholar]
- Sun, W.; Luo, Z.; Huang, K.; Shi, J. Joint Deployment and Coverage Path Planning for Capsule Airports with Multiple Drones. Drones 2023, 7, 457. [Google Scholar] [CrossRef]
- Nasirian, B.; Mehrandezh, M.; Janabi-Sharifi, F. Efficient Coverage Path Planning for Mobile Disinfecting Robots Using Graph-Based Representation of Environment. Front. Robot. AI 2021, 8, 624333. [Google Scholar] [CrossRef] [PubMed]
- Cai, C.; Chen, J.; Yan, Q.; Liu, F. A Multi-Robot Coverage Path Planning Method for Maritime Search and Rescue Using Multiple AUVs. Remote Sens. 2022, 15, 93. [Google Scholar] [CrossRef]
- Feng, L.; Katupitiya, J. UAV-based persistent full area coverage with dynamic priorities. Robot. Auton. Syst. 2022, 157, 104244. [Google Scholar] [CrossRef]
- Jia, Y.; Zhou, S.; Zeng, Q.; Li, C.; Chen, D.; Zhang, K.; Liu, L.; Chen, Z. The UAV Path Coverage Algorithm Based on the Greedy Strategy and Ant Colony Optimization. Electronics 2022, 11, 2667. [Google Scholar] [CrossRef]
- Ahmed, N.; Pawase, C.J.; Chang, K. Distributed 3-D Path Planning for Multi-UAVs with Full Area Surveillance Based on Particle Swarm Optimization. Appl. Sci. 2021, 11, 3417. [Google Scholar] [CrossRef]
- Yang, S.X.; Meng, M. An efficient neural network approach to dynamic robot motion planning. Neural Netw. 2000, 13, 143–148. [Google Scholar] [CrossRef] [PubMed]
- Yang, S.X.; Luo, C. A Neural Network Approach to Complete Coverage Path Planning. IEEE Trans. Syst. Man Cybern. B 2004, 34, 718–724. [Google Scholar] [CrossRef]
- Tang, F. Coverage path planning of unmanned surface vehicle based on improved biological inspired neural network. Ocean. Eng. 2023, 278, 114354. [Google Scholar] [CrossRef]
- Huo, L.; Liu, Y.; Chen, Z.; Yang, Y.; Yan, X.; Xia, H.; Sun, Q. Complete Coverage Path Planning Algorithm Based on Improved Biologically Inspired Neural Networks in Spray Painting. IEEE Robot. Autom. Lett. 2025, 10, 5697–5704. [Google Scholar] [CrossRef]
- Huang, Y.; Wang, Y.; Li, Z.; Zhang, H.; Zhang, C. A Hierarchical Multi Robot Coverage Strategy for Large Maps with Reinforcement Learning and Dense Segmented Siamese Network. IEEE Robot. Autom. Lett. 2025, 10, 444–451. [Google Scholar] [CrossRef]
- Luo, Q.; Luan, T.H.; Shi, W.; Fan, P. Deep Reinforcement Learning Based Computation Offloading and Trajectory Planning for Multi-UAV Cooperative Target Search. IEEE J. Sel. Areas Commun. 2023, 41, 504–520. [Google Scholar] [CrossRef]
- Yi, L.; Hayat, A.A.; Wan, A.Y.S.; Le, A.V.; Tang, Q.R.; Elara, M.R. Complete coverage path planning for omnidirectional self-reconfigurable cleaning robot using aGBNN. IEEE Trans. Autom. Sci. Eng. 2025, 23, 2212–2230. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press Cambridge: Cambridge, MA, USA, 2018; Volume 1, p. 25. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Krzysiak, R.; Butail, S. Information-Based Control of Robots in Search-and-Rescue Missions with Human Prior Knowledge. IEEE Trans. Hum.-Mach. Syst. 2022, 52, 52–63. [Google Scholar] [CrossRef]
- Ni, J.; Gu, Y.; Gu, Y.; Zhao, Y.; Shi, P. UAV Coverage Path Planning with Limited Battery Energy Based on Improved Deep Double Q-network. Int. J. Control Autom. Syst. 2024, 22, 2591–2601. [Google Scholar] [CrossRef]
- Qi, P. Algorithm design and Simulation of optimal maritime search scheme. In First International Conference on Information Sciences, Machinery, Materials and Energy; Atlantis Press: Dordrecht, The Netherlands, 2015; pp. 1772–1775. [Google Scholar]



















| Parameters | Value |
|---|---|
| Total gradient updates | 17,000 |
| Parallel environments | 1024 |
| Discount factor () | 0.95 |
| Learning rate () | |
| Replay buffer size | 100,000 |
| Batch size | 512 |
| 1.0 | |
| 0.05 | |
| linear-decay updates | 15,000 |
| Target network sync period (updates) | 500 |
| Reward scaling factor () | 500 |
| Penalty factor () | |
| Step count factor (K) | 10 |
| Max episode step budget () | 200 |
| Out-of-bounds penalty () | |
| Obstacle-collision penalty () | |
| Completion bonus base () | 50 |
| Timeout penalty coefficient () | 20 |
| Flight energy coefficient () | |
| Turn energy coefficient () |
| Configuration | Coverage Ratio (%) | Mean Turns |
|---|---|---|
| Width sweep (depth ) | ||
| Width 64 | ||
| Width 128 | ||
| Width 256 (default) | ||
| Width 512 | ||
| Depth sweep (width ) | ||
| Depth 1 (default) | ||
| Depth 2 | ||
| Depth 3 |
| Learning Rate/Schedule | Coverage Ratio (%) | Mean Episode Length (Steps) |
|---|---|---|
| (default) | ||
| Cosine annealing | ||
| Step decay |
| Algorithms | Hp | Ddrm | A* |
|---|---|---|---|
| DQN-A* | √ | √ | √ |
| w/oHp | × | √ | √ |
| w/oDdrm | √ | × | √ |
| w/oA* | √ | √ | × |
| Variant | Architecture | Parameters | Relative Size |
|---|---|---|---|
| Single-channel () | MLP (256-unit hidden) | 7684 | |
| Multi-channel–MLP | MLP (256-unit hidden) | ||
| Multi-channel–CNN | Conv(16)→Conv(32)→FC(64) |
| Grid Size | Action Space | Coverage Ratio (%) | Mean Turns |
|---|---|---|---|
| 4-Direction (Ours) | 97.7 | 11.7 | |
| 8-Direction | 98.5 | 18.8 | |
| 4-Direction (Ours) | 95.6 | 53.9 | |
| 8-Direction | 92.8 | 65.4 |
| Grid Size | N | Coverage Ratio (%) |
|---|---|---|
| 25 | ||
| 64 | ||
| 100 |
| Distribution | Coverage Ratio (%) | Mean Steps |
|---|---|---|
| Unimodal | ||
| Bimodal | ||
| Trimodal |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yuan, H.; Yan, S.; Liu, Z.; Wang, S.; Wang, Q.; Chen, G. Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions. Drones 2026, 10, 442. https://doi.org/10.3390/drones10060442
Yuan H, Yan S, Liu Z, Wang S, Wang Q, Chen G. Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions. Drones. 2026; 10(6):442. https://doi.org/10.3390/drones10060442
Chicago/Turabian StyleYuan, Hu, Shengkai Yan, Zhuzhi Liu, Suli Wang, Qiang Wang, and Gaocheng Chen. 2026. "Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions" Drones 10, no. 6: 442. https://doi.org/10.3390/drones10060442
APA StyleYuan, H., Yan, S., Liu, Z., Wang, S., Wang, Q., & Chen, G. (2026). Optimizing Spatial State Representation in Reinforcement Learning for Coverage Path Planning in UAV Search Missions. Drones, 10(6), 442. https://doi.org/10.3390/drones10060442

