A Reinforcement Learning-Based Adaptive Grey Wolf Optimizer for Simultaneous Arrival in Manned/Unmanned Aerial Vehicle Dynamic Cooperative Trajectory Planning
Abstract
Highlights
- A novel reinforcement learning-based Grey Wolf Optimizer (RL-GWO) is proposed, which adaptively selects search strategies to balance exploration and exploitation.
- Experimental results show the proposed RL-GWO significantly outperforms standard GWO and DE in both convergence speed and final solution quality for cooperative path planning.
- The proposed method provides a more efficient and robust solution for achieving high-precision time synchronization among heterogeneous UAVs in complex environments.
- The developed dual-layer dynamic planning framework demonstrates high practical value, enabling rapid and effective online replanning to ensure safety against sudden threats.
Abstract
1. Introduction
1.1. Research Background and Significance
- Heterogeneity Coordination: Significant differences exist between fixed-wing UAVs and helicopters regarding flight speed envelopes, maneuverability, and operating altitudes. Planning spatiotemporally coupled trajectories that satisfy their respective physical constraints is the primary objective for achieving efficient coordination [7].
- Dynamic Environment: The mission airspace is volatile, potentially harboring unforeseen dynamic threats [8]. This demands that the planning system possesses rapid online response and replanning capabilities to perceive environmental changes in real-time during mission execution and quickly generate new safe paths for the cluster.
- Time Synchronization: The success of many missions relies on precise temporal coordination between vehicles. For instance, the strike helicopter must arrive precisely at the target within the brief “window” created by the illuminating UAV [9]. This stringent requirement for “simultaneous arrival” capability across the entire cluster elevates path planning from a traditional three-dimensional (3D) spatial problem to a four-dimensional (4D) spatiotemporal problem, imposing extremely high demands on the algorithm’s optimization capability and precision.
1.2. Literature Review
1.3. Main Contributions
- (1)
- A Novel Hybrid Intelligent Optimization Algorithm (RL-GWO): To overcome the standard GWO’s tendency to converge to local optima and its slow convergence speed when solving complex path planning problems, a reinforcement learning mechanism is introduced. The core idea of RL-GWO is to utilize a Q-Learning model to adaptively adjust optimization strategies, effectively balancing the algorithm’s global exploration and local exploitation capabilities, significantly improving solution quality and convergence efficiency.
- (2)
- An Integrated Optimization Scheme for Heterogeneous Clusters: A comprehensive multi-objective fitness function is designed, unifying total flight distance, mission time, time coordination error, and collision risks with static/dynamic obstacles. This transforms the complex 4D cooperative path planning problem into a clear minimization problem, achieving holistic, end-to-end optimization for the entire heterogeneous cluster.
- (3)
- Construction and Validation of a Dual-Layer Dynamic Planning Framework: A practical “offline global planning + online dynamic replanning” framework is proposed. This framework first utilizes the RL-GWO algorithm for global cooperative path planning to obtain an initial optimal solution. During mission execution, an efficient collision prediction mechanism triggers local replanning, invoking RL-GWO again to rapidly generate new safe paths. Simulation experiments prove that this framework effectively balances global planning quality with online dynamic response capability.
2. Problem Formulation and Mathematical Modeling
2.1. Environment Modeling
2.2. Heterogeneous Vehicle Model
- Start and Target Positions: and represent its start point and target point, respectively;
- Velocity Constraints: is the cruise speed of , constrained by its physical performance, i.e., . Different vehicle types (e.g., helicopters vs. fixed-wing UAVs) have distinct speed constraint ranges, which is a core manifestation of “heterogeneity”;
- Safety Radius: is the safety radius assigned to the vehicle for collision detection.
2.3. Path Representation
2.4. Optimization Objective Function
- Terrain Penalty: For any point q on the path, compute its vertical clearance height above the underlying terrain . If , apply a large penalty proportional to the intrusion depth.
- Inter-Vehicle Separation Penalty: This penalty ensures that the flight paths of any two vehicles and in the cluster maintain a minimum safe distance , mitigating potential collision risks. This is a strong constraint guaranteeing physical separation in space, independent of the vehicles’ temporal synchronization. For any two distinct paths and , the penalty is calculated as follows:
- Dynamic Threat Penalty: This is the core for handling dynamic environments. For any point q on path , first, compute the time at which vehicle reaches that point. Subsequently, predict the position of dynamic threat j at that time. If the spatiotemporal distance , apply a large penalty.
3. Reinforcement Learning-Based Adaptive Grey Wolf Optimization Method
3.1. Overall Algorithm Framework
- Global Planning Stage: Before the mission starts, the system takes predefined start/target points and static environment information as input. It invokes the RL-GWO algorithm for sufficient iterations to solve an initial cooperative path plan satisfying all constraints and achieving global optimality.
- Dynamic Replanning Stage: During mission execution, the system continuously performs collision prediction. Upon predicting a potential collision with a dynamic threat, it immediately triggers the replanning module, using the current positions of all vehicles as new starting points. The RL-GWO algorithm is invoked again to rapidly solve a new locally optimal path that avoids the dynamic threat within fewer iterations. The vehicles seamlessly switch to the new path to continue the mission.
3.2. Standard Grey Wolf Optimization Algorithm
3.3. Proposed RL-GWO Algorithm
3.3.1. Improved GWO Search Strategies
3.3.2. Reinforcement Learning-Based Adaptive Strategy Selection
- State: The optimization process is discretized into K stages (e.g., ). The current iteration stage is the state s of the RL agent:
- Action: The agent’s action space A is the set containing the two hybrid update operators described above—A = {, execute the the elite-guided strategy, and , execute the hybrid DE strategy}. The Q-table structure is shown in Table 1.Table 1. Q-table framework.
State/Action ⋮ ⋮ ⋮ - Reward: The reward mechanism is designed to directly reflect the contribution of the selected action to the optimization progress. It primarily consists of the improvement in the global best fitness. If there is no improvement, the reward is 0 or a small negative value. This mechanism directly encourages actions that effectively enhance solution quality:
- Policy and Q-table Update: For action selection, the -greedy policy is used to balance exploitation and exploration. Specifically, at each decision point, the agent chooses the action with the highest current Q-value with a high probability of . Conversely, with a small probability of , it ignores the Q-values and selects a random action from the action space. This simple yet effective mechanism ensures that while the agent predominantly leverages its learned knowledge, it also continually dedicates a fraction of its trials to discovering potentially superior strategies, which is crucial for escaping suboptimal policies. After obtaining the reward in each iteration, the Q-table is updated according to the standard Q-Learning formula:
3.4. Application of RL-GWO to Multi-Vehicle Path Planning
- (1)
- Define Cooperative Task Scenario and Encoding Scheme: Load all mission-related parameters within the 3D mission space. This includes the start position, target position, and speed constraints for each vehicle (helicopters and UAVs) in the cluster, as well as environmental information on static terrain and dynamic threats. Determine the dimension of each solution (i.e., grey wolf individual), defined by the total number of 3D waypoints for all UAVs plus the speed variables.
- (2)
- Initialize Population and Fitness Evaluation: Generate initial waypoints between the start and end points of each vehicle using linear interpolation with added random perturbations. Concatenate the coordinates of all waypoints, and randomly generated speed values for all vehicles to form the initial position vector of each grey wolf individual. Subsequently, use the comprehensive cost function J defined in Section 2.4 as the fitness function to compute the fitness value of each initial individual. Sort the population based on fitness to determine the initial , , and wolves, and initialize the RL Q-table.
- (3)
- Iterative Optimization Solution: Enter the main loop of the algorithm. Use the proposed RL-GWO algorithm to optimize the model. In each iteration, the RL module selects an update strategy based on the current iteration stage. Then, update the path based on a fitness evaluation, and compute the reward value to update the Q-table.
- (4)
- Output Optimal Cooperative Solution: When the algorithm reaches the maximum iteration count or satisfies a stopping condition, terminate the search process, and output the individual with the highest fitness (lowest comprehensive cost) found in the population. Decoding this individual yields the optimal multi-vehicle cooperative path plan. If the stopping conditions are not met, return to Step 3 to continue iteration.
4. Simulation Experiments and Analysis
4.1. Experimental Setup
4.2. Simulation Results in Experiment 1
4.2.1. Performance Comparison with Three-UAV Cluster
4.2.2. Scalability Verification with Six-UAV Cluster
4.3. Experiment 2: Dynamic Environment Replanning Capability Verification
5. Conclusions
- (1)
- A Novel RL-GWO Hybrid Intelligent Optimization Algorithm: Addressing the tendency of standard metaheuristics to converge to local optima when solving complex multi-objective problems, multiple improved Grey Wolf Optimization (GWO) strategies with distinct search characteristics were designed. A reinforcement learning (RL) framework was innovatively introduced to dynamically and adaptively select among these strategies. This method intelligently balances global exploration and local exploitation based on optimization progress, significantly enhancing solution quality and efficiency.
- (2)
- A Complete Dual-Layer Dynamic Planning Framework: This framework combines “offline global planning” with “online dynamic replanning.” The global planning stage utilizes RL-GWO for thorough optimization to generate an initial optimal path satisfying high-precision time coordination. The online replanning stage enables rapid response to sudden dynamic threats, maximally preserving cluster coordination while ensuring safety.
- (3)
- Systematic Validation via Simulation Experiments: Experimental results demonstrate that the proposed RL-GWO algorithm exhibits good convergence and robustness. It achieves time synchronization accuracy at the second level while satisfying multiple complex constraints like terrain avoidance and inter-vehicle safety separation, and ensures path smoothness and economy. In dynamic scenarios, the constructed dual-layer framework enables fast and effective online replanning, proving the method’s practical value in handling environmental uncertainty.
- (1)
- Introducing more refined vehicle dynamic models;
- (2)
- Investigating the impact of communication delays and uncertainties on cooperative planning;
- (3)
- Further validating its feasibility in real-world systems.
Author Contributions
Funding
Data Availability Statement
DURC Statement
Conflicts of Interest
References
- Yu, X.; Jiang, N.; Wang, X.; Li, M. A Hybrid Algorithm Based on Grey Wolf Optimizer and Differential Evolution for UAV Path Planning. Expert Syst. Appl. 2022, 215, 119327. [Google Scholar] [CrossRef]
- Zhou, X.Y.; Jia, W.; He, R.F.; Sun, W. High-Precision localization tracking and motion state estimation of ground-based moving target utilizing unmanned aerial vehicle high-altitude reconnaissance. Remote Sens. 2025, 17, 735. [Google Scholar] [CrossRef]
- Skorobogatov, G.; Barrado, C.; Salamí, E. Multiple UAV systems: A survey. Unmanned Syst. 2020, 8, 149–169. [Google Scholar] [CrossRef]
- Han, Z.; Feng, X.; Lv, Z.; Yang, L. An Improved Environment Modeling Method for UAV Path Planning. Inf. Control 2018, 47, 371–378. (In Chinese) [Google Scholar]
- Zhang, H.; Li, W.; Zheng, J.; Liu, H.; Zhang, P.; Gao, P.; Gan, X. Manned/Unmanned Aerial Vehicle Cooperative Operations: Concepts, Technologies and Challenges. Acta Aeronaut. Astronaut. Sin. 2024, 45, 029653. (In Chinese) [Google Scholar]
- Zhen, Z.Y.; Xing, D.J.; Gao, C. Cooperative search-attack mission planning for multi-UAV based on intelligent self-organized algorithm. Aerosp. Sci. Technol. 2018, 76, 402–411. [Google Scholar] [CrossRef]
- Erdelj, M.; Natalizio, E.; Chowdhury, K.R.; Akyildiz, I.F. Help from the sky: Leveraging UAVs for disaster management. IEEE Pervasive Comput. 2017, 16, 24–32. [Google Scholar] [CrossRef]
- Xu, L.; Cao, X.B.; Du, W.B.; Li, Y.M. Cooperative path planning optimization for multiple UAVs with communication constraints. Knowl.-Based Syst. 2023, 110164, 260. [Google Scholar] [CrossRef]
- Ayawli, B.; Mei, X.; Shen, M.; Appiah, A.Y.; Kyeremeh, F. Mobile Robot Path Planning in Dynamic Environment Using Voronoi Diagram and Computation Geometry Technique. IEEE Access 2019, 7, 86026–86040. [Google Scholar] [CrossRef]
- Zhan, W.; Wang, W.; Chen, N.; Wang, C. Efficient UAV Path Planning with Multiconstraints in a 3D Large Battlefield Environment. Math. Probl. Eng. 2014, 2014, 597092. [Google Scholar] [CrossRef]
- Qadir, Z.; Zafar, M.H.; Moosavi, S.K.R.; Le, K.N.; Mahmud, M.A.P. Autonomous UAV path-planning optimization using metaheuristic approach for predisaster assessment. IEEE Internet Things 2021, 9, 12505–12514. [Google Scholar] [CrossRef]
- Liu, X.; Zhang, D.; Zhang, T.; Cui, Y.; Chen, L.; Liu, S. Novel Best Path Selection Approach Based on Hybrid Improved A* Algorithm and Reinforcement Learning. Appl. Intell. 2021, 51, 9015–9029. [Google Scholar] [CrossRef]
- Beak, J.; Han, S.I.; Han, Y. Energy-efficient UAV Routing for Wireless Sensor Networks. IEEE Trans. Veh. Technol. 2019, 69, 1741–1750. [Google Scholar] [CrossRef]
- Dewangan, R.K.; Shukla, A.; Godfrey, W.W. Three dimensional path planning using Grey wolf optimizer for UAVs. Appl. Intell. 2019, 49, 2201–2217. [Google Scholar] [CrossRef]
- Lu, L.; Dai, Y.; Ying, J.; Zhao, Y. UAV Path Planning Based on APSODE-MS Algorithm. Control Decis. 2022, 37, 1695–1704. (In Chinese) [Google Scholar]
- Lv, L.; Liu, H.; He, R.; Jia, W.; Sun, W. A novel HGW optimizer with enhanced differential perturbation for efficient 3D UAV path planning. Drones 2025, 9, 212. [Google Scholar] [CrossRef]
- Wang, F.; Wang, X.; Sun, S. A reinforcement learning level-based particle swarm optimization algorithm for large-scale optimization. Inf. Sci. 2022, 602, 298–312. [Google Scholar] [CrossRef]
- Meng, Q.; Chen, K.; Qu, Q. PPSwarm: Multi-UAV Path Planning Based on Hybrid PSO in Complex Scenarios. Drones 2024, 8, 192. [Google Scholar] [CrossRef]
- Gupta, H.; Verma, O.P. A novel hybrid Coyote-Particle Swarm Optimization Algorithm for three-dimensional constrained trajectory planning of Unmanned Aerial Vehicle. Appl. Soft. Comput. 2023, 110776, 147. [Google Scholar] [CrossRef]
- Yu, Z.H.; Si, Z.J.; Li, X.B.; Wang, D.; Song, H.B. A novel hybrid particle swarm optimization algorithm for path planning of UAVs. IEEE Internet Things 2022, 9, 22547–22558. [Google Scholar] [CrossRef]
- Yu, X.; Luo, W. Reinforcement Learning-based Multi-strategy Cuckoo Search Algorithm for 3D UAV Path Planning. Expert Syst. Appl. 2022, 223, 119910. [Google Scholar]
- Du, Y.; Peng, Y.; Shao, S.; Liu, B. Multi-UAV Cooperative Path Planning Based on Improved Particle Swarm Optimization. Sci. Technol. Eng. 2020, 20, 13258–13264. (In Chinese) [Google Scholar]
- Pan, Z.; Zhang, C.; Xia, Y.; Xiong, H.; Shao, X. An Improved Artificial Potential Field Method for Path Planning and Formation Control of the Multi-UAV Systems. IEEE Trans. Circuits Syst. II Express Briefs 2021, 69, 1129–1133. [Google Scholar]
- Niu, Y.; Yan, X.; Wang, Y.; Niu, Y. Three-dimensional Collaborative Path Planning for Multiple UCAVs Based on Improved Artificial Ecosystem Optimizer and Reinforcement Learning. Knowl.-Based Syst. 2022, 276, 110782. [Google Scholar]
- He, W.; Qi, X.; Liu, L. A Novel Hybrid Particle Swarm Optimization for Multi-UAV Cooperate Path Planning. Appl. Intell. 2021, 51, 7350–7364. [Google Scholar] [CrossRef]
- Jiang, W.; Lyu, Y.X.; Li, Y.F.; Guo, Y.C.; Zhang, W.G. UAV path planning and collision avoidance in 3D environments based on POMPD and improved grey wolf optimizer. Aerosp. Sci. Technol. 2022, 107314, 121. [Google Scholar]
Task | Vehicle | Start Point (km) | Target Point (km) | Speed Range (km/h) |
---|---|---|---|---|
Heli-1 | (0, 30, 5) | (100, 28, 7) | (30, 180) | |
1 | UAV-1 | (18, 60, 20) | (100, 30, 10) | (108, 180) |
UAV-2 | (18, 10, 22) | (100, 25, 10) | (108, 180) | |
2 | Heli-1 | (0, 30, 5) | (100, 28, 7) | (30, 180) |
Heli-2 | (0, 50, 5) | (100, 35, 10) | (30, 180) | |
UAV-1 | (18, 60, 20) | (100, 40, 10) | (108, 180) | |
UAV-2 | (18, 37, 18) | (100, 32, 10) | (108, 180) | |
UAV-3 | (18, 10, 22) | (100, 25, 10) | (108, 180) | |
UAV-4 | (18, 20, 18) | (100, 30, 12) | (108, 180) |
Vehicle | Start Point (km) | Target Point (km) | Speed Range (km/h) |
---|---|---|---|
Heli-1 | (0, 35, 5) | (100, 35, 8) | (30, 180) |
UAV-1 | (15, 60, 20) | (100, 40, 10) | (108, 180) |
UAV-2 | (15, 10, 22) | (100, 30, 10) | (108, 180) |
Vehicle | Metric | GWO | DE | SDPSO | RL-GWO |
---|---|---|---|---|---|
Heli-1 | Path Length (km) | 126.42 | 108.23 | 106.70 | 103.47 |
Optimized Speed (km/h) | 134.23 | 163.76 | 169.37 | 175.49 | |
Flight Time (h) | 0.94 | 0.66 | 0.63 | 0.59 | |
UAV-1 | Path Length (km) | 120.35 | 97.91 | 97.73 | 90.10 |
Optimized Speed (km/h) | 127.62 | 148.84 | 155.13 | 152.83 | |
Flight Time (h) | 0.94 | 0.66 | 0.63 | 0.59 | |
UAV-2 | Path Length (km) | 120.20 | 90.69 | 89.48 | 85.82 |
Optimized Speed (km/h) | 127.62 | 137.61 | 142.03 | 145.57 | |
Flight Time (h) | 0.94 | 0.66 | 0.63 | 0.59 | |
Overall | Computation Time (s) | 5.26 | 5.58 | 8.86 | 6.13 |
Vehicle | Path Length (km) | Speed (km/h) | Flight Time (h) |
---|---|---|---|
Heli-1 | 104.32 | 140.99 | 0.74 |
Heli-2 | 107.41 | 145.31 | 0.74 |
UAV-1 | 91.43 | 123.67 | 0.74 |
UAV-2 | 83.33 | 112.73 | 0.74 |
UAV-3 | 87.26 | 118.02 | 0.74 |
UAV-4 | 86.89 | 117.43 | 0.74 |
Threat Name | Initial Position (km) | Velocity Vector (km/h) | Radius (km) | Appearance Time (h) |
---|---|---|---|---|
DynObs-1 | (50, 25, 12) | (−30, 20, 0) | 5 | 0.1 |
Vehicle | Path Length (km) | Speed (km/h) | Flight Time (h) |
---|---|---|---|
Heli-1 | 101.85 | 163.07 | 0.62 |
UAV-1 | 91.29 | 146.14 | 0.62 |
UAV-2 | 88.47 | 141.64 | 0.62 |
Vehicle | Remaining Path Length (km) | Speed (km/h) | Estimated Time to Target (h) | Distance Flown (km) |
---|---|---|---|---|
Heli-1 | 67.78 | 167.38 | 0.40 | 39.14 |
UAV-1 | 58.48 | 145.41 | 0.40 | 35.07 |
UAV-2 | 54.54 | 136.13 | 0.40 | 33.99 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jia, W.; Lv, L.; Duan, R.; Sun, T.; Sun, W. A Reinforcement Learning-Based Adaptive Grey Wolf Optimizer for Simultaneous Arrival in Manned/Unmanned Aerial Vehicle Dynamic Cooperative Trajectory Planning. Drones 2025, 9, 723. https://doi.org/10.3390/drones9100723
Jia W, Lv L, Duan R, Sun T, Sun W. A Reinforcement Learning-Based Adaptive Grey Wolf Optimizer for Simultaneous Arrival in Manned/Unmanned Aerial Vehicle Dynamic Cooperative Trajectory Planning. Drones. 2025; 9(10):723. https://doi.org/10.3390/drones9100723
Chicago/Turabian StyleJia, Wei, Lei Lv, Ruizhi Duan, Tianye Sun, and Wei Sun. 2025. "A Reinforcement Learning-Based Adaptive Grey Wolf Optimizer for Simultaneous Arrival in Manned/Unmanned Aerial Vehicle Dynamic Cooperative Trajectory Planning" Drones 9, no. 10: 723. https://doi.org/10.3390/drones9100723
APA StyleJia, W., Lv, L., Duan, R., Sun, T., & Sun, W. (2025). A Reinforcement Learning-Based Adaptive Grey Wolf Optimizer for Simultaneous Arrival in Manned/Unmanned Aerial Vehicle Dynamic Cooperative Trajectory Planning. Drones, 9(10), 723. https://doi.org/10.3390/drones9100723