Cooperative Path Planning for Autonomous UAV Swarms Using MASAC-CA Algorithm
Abstract
1. Introduction
- We propose a hierarchically structured reward mechanism specifically designed for precision simultaneous arrival missions. This architecture systematically integrates sparse terminal rewards with dense intermediate incentives, incorporating a post-arrival holding mechanism for the leader UAV to prevent formation disintegration. The framework combines stagnation penalties, temporal efficiency incentives, and collision avoidance penalties to achieve multi-objective cooperative optimization. Notably, the integration of formation maintenance rewards and obstacle avoidance penalties establishes a dual-layer collision prevention system, significantly enhancing operational safety while preserving formation integrity throughout mission execution.
- We introduce a novel heterogeneous Markov Decision Process formulation that addresses Multi-UAV coordination challenges in simultaneous arrival scenarios [6]. The framework employs distinct state representations: the leader UAV’s state space incorporates a mission completion flag, enabling real-time terminal condition awareness, while wingmen states include the leader UAV’s positional coordinates [7]. This design establishes the leader’s arrival status and position as essential state information, with the completion flag serving as a global coordination signal [8]. The architecture ensures continuous positional tracking by wingmen throughout all mission phases, including operation after the leader reaches the target, creating state-level cooperative coupling that accurately models the leader–follower formation and eliminates formation instability risks inherent in conventional homogeneous approaches.
- Our methodology rigorously implements the Centralized Training with Decentralized Execution (CTDE) [9] paradigm to balance coordination efficiency with operational autonomy. During training, we employ centralized integration of agent experiences for joint policy optimization, preventing behavioral conflicts and suboptimal solutions [10]. During execution, each agent operates independently using locally observable states with decentralized policy networks, ensuring system robustness, real-time responsiveness, and resilience to communication latency [11]. This approach enables effective cooperative behavior while maintaining the operational independence required for dynamic environments [12]. The overall framework is illustrated in Figure 1.
2. Related Work
2.1. Traditional Path Planning Methods
2.2. Reinforcement Learning Methods
3. Problem Formulation
3.1. Problem Statement for UAV Swarms Cooperative Path Planning
- The leader UAV must avoid obstacles (including jamming zones and NFZs) to prevent collisions that could damage the UAV while ensuring it does not enter airspace restricted to UAV overflight or interference zones.
- The leader UAV and the wingman UAVs are considered to have arrived simultaneously if they reach the target location within a time difference of less than ten time steps.
- The wingmen should stay as much as possible at a safe distance from the leader UAV to maintain formation.
- The leader UAV collides with an obstacle resulting in destruction, or enters an NFZ or jamming zone.
- The leader UAV fails to reach the goal location within the stipulated timeframe, indicating inadequate planning.
3.2. UAV Kinematic Model
3.3. MDP Modeling of UAV Swarm Cooperative Path Planning
- (1)
- State Set
- (2)
- Action Set
- (3)
- State Transition Probability
- (4)
- Discount Factor
- (5)
- Reward Set
- (a)
- Boundary penalty .
- (b)
- Obstacle avoidance penalty
- (c)
- Goal reward
- (d)
- Formation distance reward
- (e)
- Stagnation penalty
- (f)
- Time penalty
- (g)
- Speed coordination reward
- (h)
- Temporal coordination reward
- (i)
- Terminal achievement reward
4. Method
4.1. MASAC-CA Algorithm
4.2. Algorithm Update Procedure
| Algorithm 1: MASAC-CA training procedure |
| Multi-Agent Soft Actor-Critic with Cooperative Arrival (MASAC-CA) |
| Input: Number of agents , learning rates , soft update rate , discount factor , replay buffer size , batch size |
| Output: Trained policy networks for all agents |
| 1. Initialize: |
| Actor networks with parameters for each agent |
| Critic networks with parameters |
| Value network with parameters |
| Target value network with parameters |
| Replay buffer with capacity |
| Entropy coefficient |
| 2. for episode = 1 to do |
| 3. Initialize environment, obtain initial heterogeneous state |
| 4. for to do |
| 5. for each agent do |
| 6. Sample action |
| 7. end for |
| 8. Execute joint action |
| 9. Observe comprehensive reward and next heterogeneous state |
| 10. Store transition in |
| 11. if then |
| 12. Sample minibatch of size |
| 13. //Update Critic networks |
| 14. for each agent do |
| 15. # Equation (17) |
| 16. Update by minimizing: |
| # Equation (17) |
| 17. end for |
| 18. //Update Value networks |
| 19. for each agent do |
| 20. Sample |
| 21. # Equation (18) |
| 22. Update by minimizing: |
| # Equation (18) |
| 23. end for |
| 24. //Update Actor networks |
| 25. for each agent do |
| 26. Update by minimizing: |
| # Equation (20) |
| 27. end for |
| 28. //Soft update target networks |
| 29. for each agent do |
| 30. # Equation (19) |
| 31. end for |
| 32. end if |
| 33. end for |
| 34. end for |
5. Experiments
5.1. Simulation Environment and Algorithm Parameter Settings
- (1)
- The real-world three-dimensional motion of UAVs is simplified into a two-dimensional form.
- (2)
- The position information of the goal location and obstacles is assumed known, acquired by ground-based radar and communicated to the UAVs.
5.2. Training Process Comparison
5.3. Testing Process Comparison
5.3.1. Wingman Quantity Analysis
5.3.2. Cooperative Behavior Analysis
5.3.3. Algorithm Robustness Analysis
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shao, S.K.; Peng, Y.; He, C.L.; Du, Y. Efficient path planning for UAV formation via comprehensively improved particle swarm optimization. ISA Trans. 2020, 97, 415–430. [Google Scholar] [CrossRef] [PubMed]
- Wu, Y.; Gou, J.Z.; Hu, X.T.; Huang, Y. A new consensus theory-based method for formation control and obstacle avoidance of UAVs. Aerosp. Sci. Technol. 2020, 107, 106332. [Google Scholar] [CrossRef]
- Qu, C.Z.; Gai, W.D.; Zhong, M.Y.; Zhang, J. A novel reinforcement learning based grey wolf optimizer algorithm for unmanned aerial vehicles (UAVs) path planning. Appl. Soft Comput. 2020, 89, 106099. [Google Scholar] [CrossRef]
- Fang, C.L.; Yang, F.S.; Pan, Q. Multi-UAV cooperative path planning based on MASAC reinforcement learning algorithm. Sci. Sin. Informationis 2024, 54, 1871–1883. [Google Scholar] [CrossRef]
- Northwestern Polytechnical University Shenzhen Research Institute. UAV Path Planning Method and Device Based on Maximum Entropy Safe Reinforcement Learning. CN202410423432.7, 14 June 2024. [Google Scholar]
- Kim, S.; Jung, D. Multiresolution approximation MDP for multi-target reconnaissance online planning. Int. J. Aeronaut. Space Sci. 2025, 26, 2657–2676. [Google Scholar] [CrossRef]
- Bany Salameh, H.; Hussienat, A.; Alhafnawi, M.; Al-Ajlouni, A. Autonomous UAV-based surveillance system for multi-target detection using reinforcement learning. Clust. Comput. 2024, 27, 9381–9394. [Google Scholar] [CrossRef]
- Ryu, S.K.; Jeong, B.M.; Choi, H.L. MDP formulation for multi-UAVs mission planning with refueling constraints. In Robot Intelligence Technology and Applications 7, Proceedings of the 10th International Conference on Robot Intelligence Technology and Applications (RiTA 2022), Kuala Lumpur, Malaysia, 7–9 December 2022; Jo, J., Myung, H., Alshehhi, A.A., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2023; Volume 642, pp. 89–103. [Google Scholar] [CrossRef]
- Zheng, Y.; Xin, B.; He, B.; Ding, Y. Mean policy-based proximal policy optimization for maneuvering decision in multi-UAV air combat. Neural Comput. Appl. 2024, 36, 19667–19690. [Google Scholar] [CrossRef]
- Xu, L.; Zhang, X.; Xiao, D.; Liu, B.; Liu, A. Research on heterogeneous multi-UAV collaborative decision-making method based on improved PPO. Appl. Intell. 2024, 54, 9892–9905. [Google Scholar] [CrossRef]
- Qu, P.; He, C.; Wu, X.; Wang, E.; Xu, S.; Liu, H.; Sun, X. Double mixing networks based monotonic value function decomposition algorithm for swarm intelligence in UAVs. Auton. Agent Multi-Agent Syst. 2025, 39, 16. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, S.; Chen, Z.; Xu, X.; Funiak, S.; Liu, J. Towards cost-efficient federated multi-agent RL with learnable aggregation. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2024), Taipei, Taiwan, 7–10 May 2024; Yang, D.N., Xie, X., Tseng, V.S., Pei, J., Huang, J.W., Lin, J.C.W., Eds.; Lecture Notes in Computer Science; Springer: Singapore, 2024; Volume 14646, pp. 154–168. [Google Scholar] [CrossRef]
- Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
- Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
- Stentz, A. Optimal and efficient path planning for partially-known environments. In Proceedings of the IEEE International Conference on Robotics and Automation, San Diego, CA, USA, 8–13 May 1994; pp. 3310–3317. [Google Scholar]
- Dewangan, R.K.; Shukla, A.; Godfrey, W.W. Three dimensional path planning using grey wolf optimizer for UAVs. Appl. Intell. 2019, 49, 2201–2217. [Google Scholar] [CrossRef]
- Han, Z.; Chen, M.; Shao, S.; Zhou, T.; Wu, Q. Path planning of unmanned autonomous helicopter based on hybrid satisficing decision-enhanced swarm intelligence algorithm. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 1371–1385. [Google Scholar] [CrossRef]
- Wu, W.H.; Guo, X.F.; Zhou, S.Y. Dynamic path planning based on improved constrained differential evolution algorithm. Control Decis. 2020, 35, 2381–2390. [Google Scholar]
- Yu, X.B.; Jiang, N.J.; Wang, X.M.; Li, M. A hybrid algorithm based on grey wolf optimizer and differential evolution for UAV path planning. Expert Syst. Appl. 2023, 215, 119327. [Google Scholar] [CrossRef]
- Xu, L.; Cao, X.B.; Du, W.B.; Li, Y. Cooperative path planning optimization for multiple UAVs with communication constraints. Knowl.-Based Syst. 2023, 260, 110164. [Google Scholar] [CrossRef]
- Zhu, D.; Yang, S.X. Bio-inspired neural network-based optimal path planning for UUVs under the effect of ocean currents. IEEE Trans. Intell. Veh. 2022, 7, 231–239. [Google Scholar] [CrossRef]
- Lin, L.; Zhang, X.S.; Han, C.L.; Ma, H. UAV maneuvering target tracking based on Kalman filter and DDQN algorithm. Tactical Missile Technol. 2022, 98–104. [Google Scholar]
- Li, B.; Yang, Z.P.; Chen, D.Q.; Liang, S.-Y.; Ma, H. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning. Def. Technol. 2021, 17, 457–466. [Google Scholar] [CrossRef]
- Hua, X.; Wang, X.Q.; Rui, T.; Shao, F.; Wang, D. Vision-based end-to-end target tracking control technology for UAV. J. Zhejiang Univ. 2022, 56, 1–9. [Google Scholar]
- Zhang, H.H.; He, P.K.; Zhang, M. UAV target tracking method based on deep reinforcement learning. In Proceedings of the 2022 International Conference on Cyber-Physical Social Intelligence (ICCSI), Beijing, China, 18–21 November 2022; pp. 274–277. [Google Scholar]
- Xiang, X.J. Coordinated formation control for fixed-wing UAVs based on deep reinforcement learning. Acta Aeronaut. Astronaut. Sin. 2021, 42, 1–10. [Google Scholar]
- Masmitja, I.; Martin, M.; Katija, K.; Gomariz, S.; Navarro, J. A reinforcement learning path planning approach for range-only underwater target localization with autonomous vehicles. IEEE J. Ocean. Eng. 2022, 47, 689–702. [Google Scholar]
- Wen, C.; Dong, W.H.; Xie, W.J.; Cai, M.; Hu, D.X. Autonomous tracking and obstacle avoidance for UAV swarm based on decoupled MADDPG. Flight Dyn. 2022, 40, 24–31. [Google Scholar]
- Si, P.; Wu, B.; Yang, R.; Li, M.; Sun, Y. UAV Path Planning Based on Multi-Agent Deep Reinforcement Learning. J. Beijing Univ. Technol. 2023, 49, 449–458. [Google Scholar]
- Galilici, M.; Martin, M.; Masmitja, I. TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (AAMAS), London, UK, 29 May–2 June 2023; pp. 1–9. [Google Scholar]
- Ragi, S.; Chong, E.K.P. UAV path planning in a dynamic environment via partially observable Markov decision process. IEEE Trans. Aerosp. Electron. Syst. 2013, 49, 2397–2412. [Google Scholar] [CrossRef]
- Zhang, T.T.; Yang, X.J. Autonomous coordination saturation attacks method for loitering munitions in urban scenarios based on reinforcement learning. J. Command Control 2023, 9, 457–468. [Google Scholar]
- Wang, L.; Wang, K.; Pan, C.; Xu, W.; Aslam, N.; Hanzo, L. Multi-agent deep reinforcement learning-based trajectory planning for multi-UAV assisted mobile edge computing. IEEE Trans. Cogn. Commun. Netw. 2020, 7, 73–84. [Google Scholar] [CrossRef]
- Bellman, R. A Markovian decision process. J. Math. Mech. 1957, 6, 679–684. [Google Scholar] [CrossRef]
- Bertsekas, D. Reinforcement Learning and Optimal Control; Athena Scientific: Belmont, MA, USA, 2019. [Google Scholar]
- Yuksek, B.; Demirezen, M.U.; Inalhan, G.; Tsourdos, A. Cooperative planning for an unmanned combat aerial vehicle fleet using reinforcement learning. J. Aerosp. Inf. Syst. 2021, 18, 739–750. [Google Scholar] [CrossRef]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Guo, T.; Jiang, N.; Li, B.; Zhu, X.; Wang, Y.; Du, W. UAV Navigation in High Dynamic Environments: A Deep Reinforcement Learning Approach. Chin. J. Aeronaut. 2021, 34, 479–489. [Google Scholar] [CrossRef]











| Parameter | Hyperparameter | Actor Network | Critic Network |
|---|---|---|---|
| Actor learning rate: | Maxstep: 1000 | Input: 10 | Input:10 × N + 2 |
| Critic learning rate: | Batch size: 128 | Hidden: (256, 256) | Hidden: (256, 256) |
| Entropy learning rate: | Maxepisode: 800 | Output: 2 | Output: 1 |
| Soft update rate: | Replay buffer: 20,000 | Activation: relu, tanh | Activation: relu |
| Discount factor: 0.9 | Optimizer: adam | Weights: N(0, 0.1) | Weights: N(0, 0.1) |
| Reward Function Parameters | Parameter Symbol | Parameter Description | Leader Value | Wingman Value |
|---|---|---|---|---|
| Environment Interaction Parameters | Obstacle radius | 20 m | ||
| Target arrival determination threshold | 40 m | |||
| Target proximity threshold | 50 m | |||
| Basic safety boundary threshold | 20 m | |||
| Penalty Coefficients | Collision penalty | −500 | ||
| Basic boundary penalty coefficient | −50 | |||
| Heading deviation penalty coefficient | 10 | 5 | ||
| Formation penalty coefficient | 300 | 400 | ||
| Stagnation penalty weight coefficient | 0.1 | 1 | ||
| Time penalty coefficient | −0.6 | −0.5 | ||
| Remaining flight time difference penalty coefficient | —— | 100 | ||
| Formation Coordination Parameters | Ideal formation distance | 30 m | ||
| Formation distance tolerance threshold | 15 m | |||
| Formation coordination distance threshold | 50 m | |||
| Velocity difference tolerance | 2 m/s | |||
| Formation coordination time threshold | 10 s | |||
| Stagnation determination displacement threshold | 5 m | |||
| Reward Parameters | Target achievement reward | 2000 | 1000 | |
| Full formation coordination reward | 1500 | —— | ||
| Core mission achievement reward | 1000 | —— | ||
| Metric Name | Symbol | Formula |
|---|---|---|
| Mission Success Rate | ||
| Formation Keeping Rate | ||
| Total Flight Duration | ||
| Flight Trajectory Length | ||
| Flight Energy Consumption | ||
| Number of Collisions | ||
| Abruptness |
| Number | Algorithm | |||||
|---|---|---|---|---|---|---|
| 2 | MASAC | 87.0 | 22.83 | 203.93 | 116.63 | 229.45 |
| MASAC-CA | 99.0 | 59.68 | 253.81 | 142.69 | 282.85 | |
| Random Strategy | 0.0 | 7.10 | 999.0 | 510.05 | 999.83 | |
| MADDPG | 0.0 | 5.55 | 215.01 | 122.86 | 184.78 | |
| 3 | MASAC | 83.0 | 10.77 | 195.05 | 111.78 | 219.55 |
| MASAC-CA | 99.0 | 40.91 | 253.52 | 140.62 | 283.34 | |
| Random Strategy | 0.0 | 1.54 | 999.0 | 501.34 | 999.66 | |
| MADDPG | 0.0 | 1.11 | 218.67 | 123.55 | 187.64 | |
| 4 | MASAC | 83.0 | 1.75 | 200.34 | 111.67 | 229.99 |
| MASAC-CA | 99.0 | 33.81 | 253.31 | 140.34 | 279.19 | |
| Random Strategy | 0.0 | 0.35 | 999.0 | 495.85 | 1002.41 | |
| MADDPG | 97.0 | 1.96 | 209.99 | 121.99 | 180.68 | |
| 5 | MASAC | 87.0 | 1.57 | 214.61 | 119.63 | 242.24 |
| MASAC-CA | 99.0 | 26.29 | 257.27 | 137.99 | 287.19 | |
| Random Strategy | 0.0 | 0.15 | 999.0 | 499.76 | 1001.76 | |
| MADDPG | 0.0 | 0.13 | 218.46 | 124.45 | 188.63 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, W.; Zhang, M.; Xu, X.; Qiu, S.; Liao, T.; Yue, L. Cooperative Path Planning for Autonomous UAV Swarms Using MASAC-CA Algorithm. Symmetry 2025, 17, 1970. https://doi.org/10.3390/sym17111970
Hu W, Zhang M, Xu X, Qiu S, Liao T, Yue L. Cooperative Path Planning for Autonomous UAV Swarms Using MASAC-CA Algorithm. Symmetry. 2025; 17(11):1970. https://doi.org/10.3390/sym17111970
Chicago/Turabian StyleHu, Wenli, Mingyuan Zhang, Xinhua Xu, Shaohua Qiu, Tao Liao, and Longfei Yue. 2025. "Cooperative Path Planning for Autonomous UAV Swarms Using MASAC-CA Algorithm" Symmetry 17, no. 11: 1970. https://doi.org/10.3390/sym17111970
APA StyleHu, W., Zhang, M., Xu, X., Qiu, S., Liao, T., & Yue, L. (2025). Cooperative Path Planning for Autonomous UAV Swarms Using MASAC-CA Algorithm. Symmetry, 17(11), 1970. https://doi.org/10.3390/sym17111970

