Reinforcement Learning-Guided Particle Swarm Optimization for Multi-Objective Unmanned Aerial Vehicle Path Planning
Abstract
1. Introduction
- (1)
- Global path guidance: Q-learning can learn global navigation strategies by exploring the environment systematically, providing valuable guidance information about promising regions and paths to avoid [24]. This learned knowledge can significantly improve the initialization quality and search direction of population-based optimization algorithms.
- (2)
- Experience-based decision making: Unlike traditional random search methods, Q-learning builds a value function that captures the long-term reward expectations for different state–action pairs [25]. This knowledge can be transferred to guide the search process of metaheuristic algorithms, focusing exploration on high-potential regions.
- (3)
- Adaptive learning capability: Q-learning can continuously update its policy based on environmental feedback, making it particularly suitable for dynamic environments where conditions may change during mission execution [26].
- (4)
- Discrete-continuous space bridging: Q-learning naturally handles discrete state spaces, which can be effectively mapped to continuous optimization domains used by metaheuristic algorithms [27]. This bridging capability enables the integration of learned discrete policies with continuous path optimization.
- (5)
- Multi-scale optimization support: Q-learning can operate at different abstraction levels, learning both coarse-grained strategic policies and fine-grained tactical decisions [28]. This multi-scale capability complements the hierarchical nature of UAV path planning problems.
- (1)
- A Novel Three-Stage Hierarchical Planning Framework.
- (2)
- A Discrete-to-Continuous Knowledge Transfer Mechanism.
- (3)
- A Knowledge-Guided Multi-Objective Optimization Model.
2. Materials and Methods
2.1. Global Path Planning Based on Q-Learning
2.1.1. Environmental Modeling and State Space Definition
2.1.2. Action Space Design and Symmetry Considerations
2.1.3. Q-Learning Algorithm
2.1.4. Reward Function Design
2.1.5. Optimal Path Extraction
2.2. Path Conversion and Model Adaptation
2.2.1. Two-Dimensional Path Coordinate Transformation
2.2.2. Three-Dimensional Guidance Path Generation
Algorithm 1: 3D Guidance Path Generation |
/*Input*/ Set 2D grid map environment_map, 2D path from Q-Learning path_2d, scaling factor scale_factor, and fixed flight altitude fixed_flight_altitude. /*Initialization*/ Initialize the height map H as an empty model. Initialize the threat list threats as an empty list. Set 3D map dimensions MAPSIZE_X, MAPSIZE_Y based on environment_map and scale_factor. Set map boundaries xmin, xmax, ymin, ymax, zmin, zmax. /*3D Terrain Model Construction*/ For each cell (i, j) in environment_map do Set base_height = Random value within a predefined range. If environment_map [i, j] == 1 do // Obstacle Area Set peak_center based on (i, j) and scale_factor. Generate a Gaussian surface centered at peak_center with base_height. Update the corresponding area in height map H. Define a spherical threat centered at peak_center and add to threats. Else // Free Area Generate a relatively flat surface with base_height and random noise. Update the corresponding area in height map H. End If End For /*3D Guidance Path Generation*/ Initialize guidance_path_3d as an empty path. For each node (r, c) in path_2d do Set x = r × scale_factor. Set y = c × scale_factor. Set z = fixed_flight_altitude. Append the 3D coordinate (x, y, z) to guidance_path_3d. End For /*Output*/ Output the initial 3D guidance path guidance_path_3d. Output the 3D terrain model H and the threat list threats. Output all map parameters for the next stage. |
2.3. Multi-Objective PSO Fine-Tuning Optimization
2.3.1. Multi-Objective Optimization Problem
2.3.2. Constraints and Objective Functions
- (1)
- Path Length Cost (J1)
- (2)
- Threat Avoidance Cost (J2)
- Safe Zone: If the distance to the threat is greater than the sum of the threat radius, the drone’s size, and a predefined safety margin (danger_dist), the path is considered safe, and the cost is zero.
- Collision Zone: If the path enters the area defined by the threat radius plus the drone’s size, it is considered a collision. This violates a hard constraint, and the cost is set to infinity ().
- Danger Zone: If the path lies between the safe and collision zones, a penalty cost is incurred, which is proportional to the depth of the intrusion.
- (3)
- Flight Altitude Cost (J3)
- (4)
- Path Smoothness (J4)
2.3.3. The Modified MOPSO
- (1)
- Transformation of Q-Learning Knowledge into MOPSO Guidance
- The Guidance Path (PQL): A sequence of 3D waypoints representing the optimal path found by the Q-learning agent. This path serves as a baseline for the MOPSO search.
- Guidance Strength (): A scalar hyperparameter that determines the magnitude of the Q-learning path’s influence on the optimization process.
- (2)
- Guided Population Initialization
- (3)
- The modified velocity update approach guided by Q-learning
- (4)
- “Q-Learning Path Deviation” Objective Function
Algorithm 2: Multi-objective PSO fine-tuning optimization |
/*Input*/ Set maximum number of iterations t_max. Set population size NP, external archive capacity nRep. Set PSO parameters: inertia weight w, damping ratio w_damp. Set mutation probability p_mutation. Input the guidance path from Stage 1: Path_2D_guided. /*Initialization*/ Set the generation number t = 1. For each particle i = 1 to NP do If i <= NP/3 then // Guided Initialization Initialize particle position P_i based on Path_2D_guided with random perturbation. Else// Random Initialization Initialize particle position P_i randomly within the solution space. End If Initialize particle velocity v_i = 0. Evaluate the objective vector Cost(P_i) with Equation (4) Initialize the personal best position pbest_i = P_i. End For Establish External Archive /*Iteration Computation*/ While t < t_max do For each particle i = 1 to NP do Select a leader P_leader from the archive Calculate the guidance strength γ which decays with iteration t. Update velocity v_i based on w, pbest_i, P_leader, and the influence of Path_2D_guided controlled by γ. Update particle position P_i = P_i + v_i. If rand < p_mutation then Perform mutation operation on P_i. End If Evaluate the new objective vector Cost(P_i). If Cost(P_i) dominates Cost(pbest_i) then pbest_i = P_i. End If End For Update the external archive t = t + 1 End While /*Output*/ Select the final solution Path_final from the external archive based on specific user criteria. Output the optimal path Path_final for the UAV. |
3. Experimental Simulation and Result Analysis
3.1. Experimental Setup
3.2. Validation of the Symmetric Action Set in QL-MOPSO
3.3. Parameter Sensitivity Analysis
- (1)
- Learning Rate (α): The model exhibits low sensitivity to α. While lower values (e.g., 0.01) yield marginally shorter paths, our chosen value of 0.15 resides in a region of high stability, offering a robust balance between performance and convergence speed.
- (2)
- Discount Factor (γ): The algorithm is highly robust to variations in γ. Across the tested range of [0.8, 0.999], path length and stability remained consistently excellent, confirming that our choice of 0.95 lies within the optimal performance plateau.
- (3)
- Exploration Rate (ε): A clear optimal range for ε was identified between 0.2 and 0.5. Our selection of ε = 0.2 is empirically validated, as it ensures sufficient exploration to find the optimal path without being hampered by excessive randomness.
3.4. Comparative Performance Analysis
3.4.1. Experimental Results
3.4.2. Analysis of Experimental Results
- (1)
- Superior Path Quality: QL-MOPSO achieves the best performance in the most critical metric, delivering the shortest average path length (1234.57 m). Its significantly lower standard deviation (55.41) also confirms its superior consistency over all competitors.
- (2)
- High Reliability and Efficiency: It demonstrates excellent robustness with a 100% success rate, unlike NSGA-II, which was highly unreliable (60% success). This reliability is achieved with no significant computational penalty, running faster than MOPSO and competitively with MOGWO.
- (3)
- Computational Efficiency: Our QL-MOPSO (32.21 s) and MOGWO (32.11 s) showed nearly identical and highly competitive computational times. Notably, the Q-learning guidance mechanism did not introduce a significant computational burden, as QL-MOPSO was even slightly faster than the MOPSO (35.56 s).
- (4)
- Strategic Cost Trade-off: While MOGWO registered a lower composite cost, this reflects our algorithm’s successful design. QL-MOPSO is intentionally focused on prioritizing geometric path minimization, and it excels in this primary objective, accepting a calculated trade-off in other, secondary cost functions.
- (5)
- Best Results Display: Figure 7 displays the single best path achieved by each algorithm across all 30 experimental runs, illustrating their optimal-case performance. Visually, the path from MOGWO (black stars) appears to be the most geometrically efficient in this best-case scenario. However, this observation must be contextualized by the aggregated statistical data in Table 3, which reveals a more nuanced conclusion.
3.4.3. Summary of Comparison
3.5. Ablation Analysis of the Q-Learning Path Deviation Term in MOPSO Fitness
- (1)
- Full QL-MOPSO (five objectives): Our complete model, which utilizes Q-learning for (a) guided initialization, (b) guided velocity updates, and (c) the J5 path deviation objective.
- (2)
- Ablated QL-MOPSO (four objectives): This model is identical to the full model in every way, including using Q-learning for guided initialization and velocity updates, but we removed the J5 objective from the fitness evaluation.
4. Discussion
4.1. The Critical Role of Action Space Geometry in Guidance Quality
4.2. Limitations and Future Directions
- (1)
- Performance in Dynamic and Real-World Scenarios: The most critical future direction is bridging the sim-to-real gap. Future work will extend the framework to handle dynamic environments, likely by using the trained Q-table as a “warm start” for rapid re-planning. Subsequently, we plan a staged validation process, including hardware-in-the-loop (HIL) simulations and field tests on a physical UAV platform, to assess the algorithm’s robustness against real-world uncertainties.
- (2)
- Scalability and Algorithmic Enhancements: Further research will analyze the scalability of QL-MOPSO in larger, more complex environments to understand its computational limits. To handle continuous state spaces more directly and potentially improve scalability, we also plan to explore replacing tabular Q-learning with Deep Reinforcement Learning (DRL) techniques.
- (3)
- Reward Function and Path Biases:
- (4)
- Adaptation to Real-World Terrain and Dynamic Obstacles:
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
UAV | Unmanned Aerial Vehicle |
MOPSO | Multi-Objective Particle Swarm Optimization (MOPSO) |
QL-MOPSO | Q-Learning based MOPSO |
NSGA | non-dominated sorting genetic algorithm |
MOACO | multi-objective ant colony optimization |
MOGWO | multi-objective grey wolf optimizer |
RL | Reinforcement Learning |
MOO | Multi-Objective Optimization |
References
- Yang, Y.; Fu, Y.; Xin, R.; Feng, W.; Xu, K. Multi-UAV Trajectory Planning Based on a Two-Layer Algorithm Under Four-Dimensional Constraints. Drones 2025, 9, 471. [Google Scholar] [CrossRef]
- Chen, H.; Liang, Y.; Meng, X. A UAV Path Planning Method for Building Surface Information Acquisition Utilizing Opposition-Based Learning Artificial Bee Colony Algorithm. Remote Sens. 2023, 15, 4312. [Google Scholar] [CrossRef]
- Lv, F.; Jian, Y.; Yuan, K.; Lu, Y. Unmanned Aerial Vehicle Path Planning Method Based on Improved Dung Beetle Optimization Algorithm. Symmetry 2025, 17, 367. [Google Scholar] [CrossRef]
- Haidar Ahmad, A.; Zahwe, O.; Nasser, A.; Clement, B. Path Planning for Unmanned Aerial Vehicles in Dynamic Environments: A Novel Approach Using Improved A* and Grey Wolf Optimizer. World Electr. Veh. J. 2024, 15, 531. [Google Scholar] [CrossRef]
- Li, W.; Zhang, K.; Xiong, Q.; Chen, X. Three-Dimensional Unmanned Aerial Vehicle Path Planning in Simulated Rugged Mountainous Terrain Using Improved Enhanced Snake Optimizer (IESO). World Electr. Veh. J. 2025, 16, 295. [Google Scholar] [CrossRef]
- Güven, İ.; Yanmaz, E. Multi-objective path planning for multi-UAV connectivity and area coverage. Ad Hoc Netw. 2024, 160, 103520. [Google Scholar] [CrossRef]
- Zhang, W.; Peng, C.; Yuan, Y.; Cui, J.; Qi, L. A novel multi-objective evolutionary algorithm with a two-fold constraint-handling mechanism for multiple UAV path planning. Expert Syst. Appl. 2024, 238, 121862. [Google Scholar] [CrossRef]
- Xu, X.; Xie, C.; Luo, Z.; Zhang, C.; Zhang, T. A multi-objective evolutionary algorithm based on dimension exploration and discrepancy evolution for UAV path planning problem. Inf. Sci. 2024, 657, 119977. [Google Scholar] [CrossRef]
- Petchrompo, S.; Coit, D.W.; Brintrup, A.; Wannakrairot, A.; Parlikad, A.K. A review of Pareto pruning methods for multi-objective optimization. Comput. Ind. Eng. 2022, 167, 108022. [Google Scholar] [CrossRef]
- Yahia, H.S.; Mohammed, A.S. Path planning optimization in unmanned aerial vehicles using meta-heuristic algorithms: A systematic review. Environ. Monit. Assess. 2023, 195, 30. [Google Scholar] [CrossRef]
- Hooshyar, M.; Huang, Y.M. Meta-heuristic algorithms in UAV path planning optimization: A systematic review (2018–2022). Drones 2023, 7, 687. [Google Scholar] [CrossRef]
- Wu, Y.; Liang, T.; Gou, J.; Tao, C.; Wang, H. Heterogeneous mission planning for multiple UAV formations via metaheuristic algorithms. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 3924–3940. [Google Scholar] [CrossRef]
- Xia, S.; Zhang, X. Constrained path planning for unmanned aerial vehicle in 3D terrain using modified multi-objective particle swarm optimization. Actuators 2021, 10, 255. [Google Scholar] [CrossRef]
- Ma, H.; Zhang, Y.; Sun, S.; Liu, T.; Shan, Y. A comprehensive survey on NSGA-II for multi-objective optimization and applications. Artif. Intell. Rev. 2023, 56, 15217–15270. [Google Scholar] [CrossRef]
- Awadallah, M.A.; Makhadmeh, S.N.; Al-Betar, M.A.; Dalbah, L.M.; Al-Redhaei, A.; Kouka, S.; Enshassi, O.S. Multi-objective ant colony optimization. Arch. Comput. Methods Eng. 2025, 32, 995–1037. [Google Scholar] [CrossRef]
- Ntakolia, C.; Lyridis, D.V. A comparative study on Ant Colony Optimization algorithm approaches for solving multi-objective path planning problems in case of unmanned surface vehicles. Ocean. Eng. 2022, 255, 111418. [Google Scholar] [CrossRef]
- Makhadmeh, S.N.; Alomari, O.A.; Mirjalili, S.; Al-Betar, M.A.; Elnagar, A. Recent advances in multi-objective grey wolf optimizer, its versions and applications. Neural Comput. Appl. 2022, 34, 19723–19749. [Google Scholar] [CrossRef]
- Tang, R.; Qi, L.; Ye, S.; Li, C.; Ni, T.; Guo, J.; Liu, H.; Li, Y.; Zuo, D.; Shi, J.; et al. Three-Dimensional Path Planning for AUVs Based on Interval Multi-Objective Secretary Bird Optimization Algorithm. Symmetry 2025, 17, 993. [Google Scholar] [CrossRef]
- Xiong, Q.; Zhang, X.; He, S.; Shen, J. A fractional-order chaotic sparrow search algorithm for enhancement of long distance iris image. Mathematics 2021, 9, 2790. [Google Scholar] [CrossRef]
- Zhang, X.; Xia, S.; Li, X.; Zhang, T. Multi-objective particle swarm optimization with multi-mode collaboration based on reinforcement learning for path planning of unmanned air vehicles. Knowl.-Based Syst. 2022, 250, 109075. [Google Scholar] [CrossRef]
- Zheng, F. Research on robot motion state estimation based on deep reinforcement learning. J. Hunan Univ. Arts Sci. (Nat. Sci.) 2023, 35, 34–39. [Google Scholar]
- Lyu, L.; Shen, Y.; Zhang, S. The advance of reinforcement learning and deep reinforcement learning. In Proceedings of the IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 25–27 February 2022; IEEE: New York, NY, USA, 2022; pp. 644–648. [Google Scholar]
- Puente-Castro, A.; Rivero, D.; Pedrosa, E.; Pereira, A.; Lau, N.; Fernandez-Blanco, E. Q-learning based system for path planning with unmanned aerial vehicles swarms in obstacle environments. Expert Syst. Appl. 2024, 235, 121240. [Google Scholar] [CrossRef]
- Souto, A.; Alfaia, R.; Cardoso, E.; Araújo, J.; Francês, C. UAV Path Planning Optimization Strategy: Considerations of Urban Morphology, Microclimate, and Energy Efficiency Using Q-Learning Algorithm. Drones 2023, 7, 123. [Google Scholar] [CrossRef]
- Sonny, A.; Yeduri, S.R.; Cenkeramaddi, L.R. Q-learning-based unmanned aerial vehicle path planning with dynamic obstacle avoidance. Appl. Soft Comput. 2023, 147, 110773. [Google Scholar] [CrossRef]
- Wu, J.; Sun, Y.; Li, D.; Shi, J. An adaptive conversion speed Q-learning algorithm for search and rescue UAV path planning in unknown environments. IEEE Trans. Veh. Technol. 2023, 72, 15391–15404. [Google Scholar] [CrossRef]
- Wang, L.; Zhang, L.; Liu, Y.; Wang, Z. Extending Q-learning to continuous and mixed strategy games based on spatial reciprocity. Proc. R. Soc. A 2023, 479, 20220667. [Google Scholar] [CrossRef]
- Mirzanejad, M.; Ebrahimi, M.; Vamplew, P.; Veisi, H. An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning. Knowl. Eng. Rev. 2022, 37, e7. [Google Scholar] [CrossRef]
- Zhao, S.; Wu, Y.; Tan, S.; Wu, J.; Cui, Z.; Wang, Y.-G. QQLMPA: A quasi-opposition learning and Q-learning based marine predators algorithm. Expert Syst. Appl. 2023, 213, 119246. [Google Scholar] [CrossRef]
- Sharma, S.; Kumar, V. A comprehensive review on multi-objective optimization techniques: Past, present and future. Arch. Comput. Methods Eng. 2022, 29, 5605–5633. [Google Scholar] [CrossRef]
- Pereira, J.L.J.; Oliver, G.A.; Francisco, M.B.; Cunha, S.S.; Gomes, G.F. A review of multi-objective optimization: Methods and algorithms in mechanical engineering problems. Arch. Comput. Methods Eng. 2022, 29, 2285–2308. [Google Scholar] [CrossRef]
- Dosantos, P.S.; Bouchet, A.; Mariñas-Collado, I.; Montes, S. OPSBC: A method to sort Pareto-optimal sets of solutions in multi-objective problems. Expert Syst. Appl. 2024, 250, 123803. [Google Scholar] [CrossRef]
Parameters for Q-Learning | Parameters for MOPSO | ||
---|---|---|---|
Learning Rate | 0.15 | Population Size | 100 |
Discount Factor | 0.95 | Max Iterations | 500 |
Initial Exploration Rate | 0.2 | Repository Size | 40 |
Exploration Rate Decay | 0.95 | Dynamic Inertia Weight | 1.0/0.98 |
Number of Training Episodes | 5000 | Grid Divisions per Objective | 5 |
Grid Resolution | 10 × 10 | Leader Selection Parameter | 0.1 |
State Space Size | 10 × 10 | Archive Deletion Parameter | 2 |
Metric | Symmetric (Eight Actions) | Asymmetric (Four Actions) |
---|---|---|
Q-Learning Guide Path Length(steps) | 12 | 19 |
MOPSO Final Path Length (m) | 1102.81 | 1388.31 |
MOPSO Final Cost Vector [J1–J4] | [0.1349, 0.0000, 0.1861, 0.1806] | [0.2897, 0.0000, 0.2998, 0.1311] |
Computation Time (s) | 67.88 | 68.71 |
Algorithm | Time Cost (s) | Final Composite Cost | 3D Path Length (m) | Success |
---|---|---|---|---|
MOPSO | 35.56 ± 0.71 | 0.1642 ± 0.0237 | 1298.84 ± 132.37 | 100% |
NSGA-II | 8.92 ± 0.18 | 0.2003 ± 0.0584 | 1303.41 ± 145.82 | 60% |
MOGWO | 32.11 ± 0.58 | 0.1137 ± 0.0220 | 1265.96 ± 100.20 | 100% |
DE | 69.69 ± 2.16 | 0.1500 ± 0.0300 | 1317.05 ± 137.45 | 100% |
QL-MOPSO | 32.21 ± 0.52 | 0.2041 ± 0.0525 | 1234.57 ± 55.41 | 100% |
Algorithm | Time Cost (s) | Final Composite Cost | 3D Path Length (m) | Success |
---|---|---|---|---|
QL-MOPSO | 32.21 ± 0.52 | 0.2041 ± 0.0525 | 1234.57 ± 55.41 | 100% |
Ablated QL-MOPSO | 30.26 ± 0.53 | 0.1825 ± 0.0293 | 1325.76 ± 132.82 | 100% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, W.; Xiong, Y.; Xiong, Q. Reinforcement Learning-Guided Particle Swarm Optimization for Multi-Objective Unmanned Aerial Vehicle Path Planning. Symmetry 2025, 17, 1292. https://doi.org/10.3390/sym17081292
Li W, Xiong Y, Xiong Q. Reinforcement Learning-Guided Particle Swarm Optimization for Multi-Objective Unmanned Aerial Vehicle Path Planning. Symmetry. 2025; 17(8):1292. https://doi.org/10.3390/sym17081292
Chicago/Turabian StyleLi, Wuke, Ying Xiong, and Qi Xiong. 2025. "Reinforcement Learning-Guided Particle Swarm Optimization for Multi-Objective Unmanned Aerial Vehicle Path Planning" Symmetry 17, no. 8: 1292. https://doi.org/10.3390/sym17081292
APA StyleLi, W., Xiong, Y., & Xiong, Q. (2025). Reinforcement Learning-Guided Particle Swarm Optimization for Multi-Objective Unmanned Aerial Vehicle Path Planning. Symmetry, 17(8), 1292. https://doi.org/10.3390/sym17081292