Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm
Abstract
1. Introduction
1.1. Related Work
1.2. Motivations and Contributions
- A learning-based cooperative path-planning method for urban low-altitude multi-UAV inspection scenarios is developed under the MADDPG-based CTDE framework, where dynamic constraints, obstacle-avoidance safety, and formation cooperation are jointly incorporated into the POMDP formulation and reward design.
- A prioritized experience replay mechanism is introduced, in which sample priorities are constructed by aggregating multi-agent TD errors, thereby improving the utilization of critical samples and enhancing training convergence efficiency.
- An adaptive exploration-noise mechanism is proposed to automatically regulate exploration intensity without changing the deterministic policy-gradient structure, which helps alleviate sparse-reward and local-optimum problems while improving training stability.
- A centralized Critic network enhanced by a multi-head attention mechanism is designed to explicitly model inter-agent interaction dependencies and to improve the accuracy of value estimation for cooperative decision-making.
- Extensive comparative experiments, scalability analysis, and ablation studies are conducted in a three-dimensional urban inspection simulation environment. The effectiveness and robustness of the proposed method are validated using metrics including mission completion rate, formation maintenance rate, flight efficiency, and energy proxy.
2. Problem Description and Modeling
2.1. Urban Low-Altitude Cooperative Inspection Scenario
2.2. UAV Dynamics Model and Constraints
2.3. POMDP Decision Model
2.3.1. State Space
2.3.2. Observation Space
2.3.3. Action Space
2.3.4. Transition Probability
2.3.5. Observation Probability
2.3.6. Multi-Objective Reward Functions
- Boundary reward: This penalizes flying outside feasible airspace
- 2.
- Obstacle avoidance reward: used to guide the UAV to maintain a safe distance from obstacles
- 3.
- Inspection target reward: used to guide the leader UAV to fly to the target point
- 4.
- Formation maintenance reward: used to guide follower UAVs to maintain the preset formation distance with the leader UAV. For any follower UAV , define its distance from the leader UAV as , its formation maintenance sub-reward is
- 5.
- Velocity cooperation reward: used to guide multi-UAV to maintain the consistency of speed and heading. For the leader UAV, define the average speed and average heading angle of the follower UAVs as and respectively, then the velocity cooperation sub-reward of the leader UAV is
2.4. Performance Evaluation Metrics
- (1)
- Mission Completion Rate (MCR):
- (2)
- Obstacle Avoidance Rate (OAR):
- (3)
- Flight Time (FT):
- (4)
- Flight Distance (FD):
- (5)
- Energy Proxy (EP):
- (6)
- Formation Keeping Rate (FKR):
3. Cooperative Path Planning Method Based on PE-MADDPG
3.1. MADDPG Basic Framework
3.2. Prioritized Experience Replay Mechanism
3.3. Adaptive Exploration Noise Mechanism
3.4. Attention-Enhanced Critic Network
3.5. PE-MADDPG Algorithm Flow
| Algorithm 1. Complete training procedure of PE-MADDPG |
| Input Number of agents ; Actor network ; Critic network ; target networks , ; prioritized replay buffer ; initial exploration noise scale ; mini-batch size ; discount factor ; soft update coefficient Output Trained Actor networks target networks 1: Initialize Actor network , Critic network , and corresponding target networks , for each agent 2: Initialize prioritized replay buffer and set episode index 3: Initialize exploration noise scale 4. for each episode do 5. Reset the environment and obtain the initial global state and local observations 6. for each time step do 7. for each agent do 8. Select action 9. end for 10. Execute the joint action , and observe reward vector , next global state , next local observations , and termination flag 11. Store transition Into with initial priority 12. if then 13. Sample a mini-batch from according to prioritized sampling probabilities 14. Compute importance-sampling weights for all sampled transitions 15. for each sampled transition do 16. Generate target joint action with target Actors, where 17. Use the attention-enhanced Critic to estimate and target value for each agent 18. Compute per-agent TD errors 19. Aggregate the multi-agent TD errors into a sample-level priority signal , and update the priority of in 20. end for 21. for each agent do 22. Update the attention-enhanced Critic by minimizing the weighted TD loss 23. Update the Actor using the deterministic policy gradient 24. Softly update the target networks: 25. end for 26. end if 27. Set 28. if then break 29. end for 30. Determine the episode-level task success indicator 31. Adapt the exploration noise scale according to 32. end for 33. Return the trained Actor networks |
4. Simulation Settings
4.1. Experimental Environment Setup
4.1.1. Simulation Environment
4.1.2. UAV Parameter Settings
4.1.3. Training Settings
5. Results
5.1. Comparative Analysis of Training Process
5.2. Sensitivity Analysis
5.3. Performance Comparison Experiments of Different Algorithms
5.4. Scalability Verification
5.5. Cooperative Behavior and Trajectory Visualization Analysis
5.6. Ablation Experiment
- MADDPG-Baseline: Standard MADDPG algorithm.
- MADDPG-PER: Only the prioritized experience replay mechanism is introduced.
- MADDPG-Noise: Only the adaptive exploration noise mechanism is introduced.
- MADDPG-Attention: Only the multi-head attention-enhanced Critic network is introduced.
- PE-MADDPG: The complete algorithm including all improved modules.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Tarekegn, G.B.; Tesfaw, B.A.; Juang, R.-T.; Saha, D.; Tarekegn, R.B.; Lin, H.-P.; Tai, L.-C. Trajectory control and fair communications for multi-UAV networks: A federated multi-agent deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2025, 24, 7598–7611. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, Z. Flexible multi-uav formation control via integrating deep reinforcement learning and affine transformations. Aerosp. Sci. Technol. 2025, 157, 109934. [Google Scholar] [CrossRef]
- Nazemi Jenabi, M.; Asharioun, H.; Pourgholi, M. 3D UAV path planning based on an improved TD3 deep reinforcement learning for data collection in an urban environment. J. Netw. Comput. Appl. 2025, 244, 104336. [Google Scholar] [CrossRef]
- Ivić, S.; Crnković, B.; Grbčić, L.; Matleković, L. Multi-UAV trajectory planning for 3D visual inspection of complex structures. Autom. Constr. 2023, 147, 104709. [Google Scholar] [CrossRef]
- Guo, J.; Harmati, I. Lane-changing system based on deep q-learning with a request–respond mechanism. Expert Syst. Appl. 2024, 235, 121242. [Google Scholar] [CrossRef]
- Hamissi, A.; Dhraief, A. A survey on the unmanned aircraft system traffic management. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
- Zhang, Y.; Ding, M.; Yuan, Y.; Zhang, J.; Yang, Q.; Shi, G.; Jiang, J. Large-scale UAV swarm path planning based on mean-field reinforcement learning. Chin. J. Aeronaut. 2025, 38, 336–349. [Google Scholar] [CrossRef]
- Zhan, F.B.; Noon, C.E. Shortest path algorithms: An evaluation using real road networks. Transp. Sci. 1998, 32, 65–73. [Google Scholar] [CrossRef]
- Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
- Stentz, A. Optimal and efficient path planning for partially-known environments. In Proceedings of the 1994 IEEE International Conference on Robotics and Automation, San Diego, CA, USA, 8–13 May 1994; pp. 3310–3317. [Google Scholar]
- Xie, K.; Qiang, J. Research and optimization of D-start lite algorithm in track planning. IEEE Access 2020, 8, 161920–161928. [Google Scholar] [CrossRef]
- Holland, J.H. Adaptation in Natural and Artificial System, 2nd ed.; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
- Coello, C.C.; Lechuga, M.S. MOPSO: A proposal for multiple objective particle swarm optimization. In Proceedings of the 2002 IEEE Congress on Evolutionary Computation, Honolulu, HI, USA, 12–17 May 2002; pp. 1051–1056. [Google Scholar]
- Dorigo, M.; Maniezzo, V.; Colorni, A. Ant system: Optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. B 1996, 26, 29–41. [Google Scholar] [CrossRef]
- Ugwoke, K.C.; Nnanna, N.A. Simulation-based review of classical, heuristic, and metaheuristic path planning algorithms. Sci. Rep. 2025, 15, 12643. [Google Scholar] [CrossRef]
- Blasi, L.; D’Amato, E. Clothoid-based path planning for a formation of fixed-wing UAVs. Electronics 2023, 12, 2204. [Google Scholar] [CrossRef]
- Han, B.; Shi, L. Multi-agent multi-target pursuit with dynamic target allocation and actor network optimization. Electronics 2023, 12, 4613. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Sumiea, E.H.; Abdulkadir, S.J.; Alhussian, H.S.; Al-Selwi, S.M.; Alqushaibi, A.; Ragab, M.G.; Fati, S.M. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697. [Google Scholar] [CrossRef] [PubMed]
- Recht, B. A tour of reinforcement learning: The view from continuous control. Annu. Rev. Control Robot. Auton. Syst. 2019, 2, 253–279. [Google Scholar] [CrossRef]
- Liu, Q.; Xiong, P.; Zhu, Q.; Xiao, W.; Wang, K.; Hu, G.; Li, G. A DDPG-based path following control strategy for autonomous vehicles by integrated imitation learning and feedforward exploration. Chin. J. Mech. Eng. 2025, 38, 174. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6379–6390. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. In NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022; pp. 24611–24624. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Cheng, L.; Shang, W. Multi-head attention dqn and dynamic priority for path planning of unmanned aerial vehicles oriented to penetration. Electronics 2025, 15, 167. [Google Scholar] [CrossRef]
- Demir, K.; Tumen, V. A deep reinforcement learning algorithm for trajectory planning of swarm UAV fulfilling wildfire reconnaissance. Electronics 2024, 13, 2568. [Google Scholar] [CrossRef]
- Liu, H.; Long, X.; Li, Y.; Yan, J.; Li, M.; Chen, C.; Gu, F.; Pu, H.; Luo, J. Adaptive multi-UAV cooperative path planning based on novel rotation artificial potential fields. Knowl.-Based Syst. 2025, 317, 113429. [Google Scholar] [CrossRef]
- Zhang, G.; Sun, Z.; Li, J.; Huang, J.; Qiu, B. Iterative learning control for path-following of ASV with the ice floes auto-select avoidance mechanism. IEEE Trans. Intell. Transp. Syst. 2025, 26, 13927–13938. [Google Scholar] [CrossRef]
- Zhang, G.; Yin, S. Game-based event-triggered control for unmanned surface vehicle: Algorithm design and harbor experiment. IEEE Trans. Cybern. 2025, 55, 2729–2741. [Google Scholar] [CrossRef] [PubMed]













| Parameter | Leader UAV | Follower UAV |
|---|---|---|
| Velocity (m/s) | [0, 18] | [0, 18] |
| Acceleration (m/s2) | [−1, 1] | [−1.5, 1.5] |
| Angular velocity (rad/s) | [−0.55, 0.55] | [−0.6, 0.6] |
| Optimal formation distance (m) | 300 | 300 |
| Pitch angle range (rad) | ||
| Minimum safety distance (m) | 30 | 30 |
| Core Hyperparameter | Leader UAV |
|---|---|
| Actor network learning rate | 1 × 10−4 |
| Critic network learning rate | 2 × 10−4 |
| Initial value of exploration noise | 0.2 |
| Lower bound of exploration noise | 2 × 10−2 |
| Upper bound of exploration noise | 0.5 |
| Adaptive step size of noise | 2 × 10−2 |
| Soft update rate | 0.5 × 10−2 |
| Discount factor | 0.95 |
| Maximum training episodes | 500 |
| Batch Size | 128 |
| Maximum steps per episode | 1000 |
| Maximum capacity of experience pool | 2 × 104 |
| Actor network | Input → Hidden (256, 256, ReLU) → Output (3, Tanh) |
| Critic network | Input → Embedding → Multi-head Attention → Hidden (256, 256, ReLU) → Output (1) |
| Weight initialization | Normal distribution (mean = 0, std = 0.1) |
| Variant | Avg. Reward (Last 50 Episodes) | Avg. Error Band | Convergence Episode | Improvement vs. Baseline |
|---|---|---|---|---|
| MADDPG-Baseline | 487.08 | 153.52 | 321 | +0.00% |
| MADDPG-Entropy | −470.37 | 137.2 | - | −196.57% |
| MADDPG-PER | 494.51 | 153.52 | 411 | +1.52% |
| MADDPG-Attention | −2.12 | 137.2 | - | −100.44% |
| PE-MADDPG | 982.88 | 100 | 234 | +101.7% |
| Algorithm | MCR (%) | OAR (%) | FKR (%) | FT (Step) | FD (m) | EP |
|---|---|---|---|---|---|---|
| DDPG | 45.0 ± 5.5 | 72.3 ± 6.2 | 2.3 ± 1.2 | 650.8 ± 95.6 | 1580.2 ± 245.4 | 680.7 ± 125.5 |
| MAPPO | 78.0 ± 4.2 | 89.5 ± 4.8 | 28.5 ± 3.8 | 480.5 ± 65.3 | 1120.8 ± 185.2 | 520.3 ± 95.7 |
| MADDPG | 85.0 ± 3.5 | 91.3 ± 4.1 | 32.1 ± 3.2 | 410.7 ± 58.4 | 980.6 ± 150.3 | 450.8 ± 85.4 |
| PE-MADDPG | 92.0 ± 2.1 | 98.6 ± 1.5 | 45.8 ± 2.8 | 350.4 ± 42.6 | 820.3 ± 120.5 | 380.6 ± 75.2 |
| Variant | MCR (%) | OAR (%) | FKR (%) | FT (Step) | FD (m) | EP |
|---|---|---|---|---|---|---|
| MADDPG-Baseline | 85.1 ± 3.4 | 92.3 ± 3.5 | 32.2±3.1 | 411.5 ± 57.8 | 988.7 ± 151.6 | 451.3 ± 84.7 |
| MADDPG-Noise | 87.6 ± 3.0 | 94.5 ± 2.8 | 34.8 ± 2.9 | 395.8 ± 52.3 | 952.4 ± 142.3 | 432.6 ± 80.5 |
| MADDPG-PER | 90.2 ± 2.5 | 96.8 ± 2.1 | 35.1 ± 3.0 | 372.4 ± 48.6 | 895.3 ± 130.5 | 408.5 ± 78.2 |
| MADDPG-Attention | 88.5 ± 2.8 | 95.2 ± 2.6 | 38.4 ± 2.6 | 388.6 ± 50.1 | 926.8 ± 138.4 | 425.7 ± 82.1 |
| PE-MADDPG | 92.3 ± 2.1 | 98.7 ± 1.4 | 45.6 ± 2.7 | 352.6 ± 43.5 | 832.5 ± 122.8 | 381.2 ± 74.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, F.; Wang, Q.; Ma, X. Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm. Electronics 2026, 15, 1632. https://doi.org/10.3390/electronics15081632
Zhang F, Wang Q, Ma X. Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm. Electronics. 2026; 15(8):1632. https://doi.org/10.3390/electronics15081632
Chicago/Turabian StyleZhang, Feiqiao, Qian Wang, and Xin Ma. 2026. "Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm" Electronics 15, no. 8: 1632. https://doi.org/10.3390/electronics15081632
APA StyleZhang, F., Wang, Q., & Ma, X. (2026). Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm. Electronics, 15(8), 1632. https://doi.org/10.3390/electronics15081632
