Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm
Abstract
Highlights
- This study presents an enhanced algorithm that integrates the Rainbow module to improve the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm for multi-agent UAV cooperative and competitive scenarios.
- The proposed algorithm incorporates Prioritized Experience Replay (PER) and multi-step TD updating to optimize long-term reward perception and enhance learning efficiency. Behavioral cloning is also employed to accelerate convergence during initial training.
- Experimental results on a UAV island capture simulation demonstrate that the enhanced algorithm outperforms the original MADDPG, showing a 40% increase in convergence speed and a doubled combat power preservation rate.
- The algorithm proves to be a robust and efficient solution for complex, dynamic, multi-agent game environments.
Abstract
1. Introduction
2. Algorithm Design
2.1. Theoretical Framework Analysis
2.1.1. Multi-Agent Markov Game Formalization
2.1.2. Component Synergy Mechanism Analysis
2.1.3. Multi-Agent Challenge Solutions
2.1.4. Convergence Guarantees
2.2. Rainbow Algorithm Improvement
Algorithm 1 Rainbow-MADDPG algorithm |
Input: Initialize actor network , critic network , replay buffer , the number of UAVs , the number of episodes , and the maximum episode length , the n-step value , priority exponent , and importance sampling correction . for episode = 1: do Set up UAVs’ adversarial game environment. Initialize state , at step ; initialize local observations for each UAV ; initialize actions for each UAV ; for step = 1: do for UAV = 1: do Obtain local observation state ; ; end for Execute actions ; Observe rewards , next state , and completion marks ; TD_error = Calculate_TD_Error; Store experience , with priority in replay buffer ; if all are true then Reset environment. end if samples = Sample_Experience_Batch(, ); for UAV = 1: do target = Compute_n_Step_TD(, , , n, ); loss = Compute_Critic_Loss(, , target, ); Update_Critic_Network(, loss); advantage = Compute_Advantage(, , ); actor_gradient = Compute_Actor_Gradient(advantage, , , ); Update_Actor_Network(, actor_gradient). end for if fixed_step_update then ; end if end for end for Output: The strategy distribution of all UAVs . |
3. UAV Modeling
3.1. State Space Modeling
3.2. Action Space Modeling
3.3. Strategy Design
- (1)
- Attack strategies: For enemy units beyond the attack range of our forces, the system compares relative distances between fighters and assigns tracking responsibilities to the nearest available attack unit. Concurrently, a constraint is implemented to limit the number of attack units tracking a single enemy, ensuring both tracking effectiveness and preservation of remaining attack resources. Regarding enemy units within attack range, available combat units are allocated through a coordinated mechanism based on the following operational principles:
- Attack all enemy units within our attack range if possible;
- To conserve resources, limit the number of our attacking units when attacking the same enemy unit;
- In order to increase the efficiency of the attack, the radar range of the reconnaissance unit is expanded to guide the unit on the en route mission to complete the relevant maneuver tasks.
- (2)
- Interference frequency setting strategy: Given the periodic variation pattern of enemy radar frequencies, the interference frequency strategy primarily employs an online learning approach. This learning process initiates upon entering the simulation and continues throughout the entire engagement. The specific implementation procedure is as follows:
- Obtain changes in the radar frequency of an enemy aircraft, using the changes over three consecutive time points as a sample;
- Combine the first two frequencies as features in time order, predict and store the probability distribution of the third frequency.
- (3)
- Avoidance strategies: Our reconnaissance units acquire enemy posture information across two consecutive time steps, enabling the calculation of potential enemy maneuvering directions. This data, combined with our units’ previous state information, facilitates the projection of probable enemy trajectories and tracking patterns. Based on these estimations, our corresponding reconnaissance and attack units execute evasive maneuvers accordingly.
- (4)
- Detection unit posture reconstruction: In this study, the state information of our two detection units in the island capture confrontation environment is structured as follows:
- Our basic attributes: the survival status of this operator, X coordinate, Y coordinate, heading, radar status, and radar frequency;
- Friendly basic information: distance to another friendly detection unit, distance to all friendly attack units;
- Basic enemy information: Distance to all enemy units detected by the radar.
- (5)
- Attack unit posture reconstruction: The state information of our attack unit in the game-adversarial island capture environment is organized as follows:
- Our basic attributes: operator survival state, X coordinate, Y coordinate, heading, radar state, radar frequency point, jamming radar state, jamming radar frequency point;
- Friendly basic information: distance to all friendly detection units, distance to other surviving friendly attack units;
- Basic information about the enemy: distance from the enemy unit actively observed by the radar, distance from the enemy unit passively observed by the jamming radar, direction of the enemy unit and radar frequency of the enemy unit.
3.4. Assessment of Indicators
4. Experimental Simulation Design
4.1. Configuration and Operating Instructions
4.2. Parameter Design
4.3. Heterogeneous Multi-Agent Environment Setting
4.4. Analysis of Experimental Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
- Pang, Z.J.; Liu, R.Z.; Meng, Z.Y.; Zhang, Y.; Yu, Y.; Lu, T. On reinforcement learning for full-length game of starcraft. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4691–4698. [Google Scholar]
- Li, S.; Chen, M.; Wang, Y.; Wu, Q.; He, J. Human-computer gaming decision-making method in air combat under an incomplete strategy set. Sci. Sin. Inform. 2022, 52, 2239–2253. [Google Scholar] [CrossRef]
- Yan, F.; Zhu, X.; Zhou, Z.; Tang, Y. Real-time task allocation for a heterogeneous multi-uav simultaneous attack. Sci. Sin. Inform. 2019, 49, 555–569. [Google Scholar] [CrossRef]
- Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
- Bi, Z.; Xiao, F.; Kong, D.; Song, X.; Jia, Z.; Lin, T. A data-driven modeling method for game adversity agent. J. Syst. Simul. 2022, 33, 2838–2845. [Google Scholar]
- Barriga, N.; Stanescu, M.; Buro, M. Combining strategic learning with tactical search in real-time strategy games. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Little Cottonwood Canyon, UT, USA, 5–9 October 2017; Volume 13, pp. 9–15. [Google Scholar]
- Isler, V.; Kannan, S.; Khanna, S. Randomized pursuit-evasion in a polygonal environment. IEEE Trans. Robot. 2005, 21, 875–884. [Google Scholar] [CrossRef]
- Kaneshige, J.; Krishnakumar, K. Artificial immune system approach for air combat maneuvering. In Intelligent Computing: Theory and Applications V; SPIE: Bellingham, WA, USA, 2007; Volume 6560, pp. 68–79. [Google Scholar]
- Chen, X.; Wang, Y.F. Study on multi-uav air combat game based on fuzzy strategy. Appl. Mech. Mater. 2014, 494, 1102–1105. [Google Scholar] [CrossRef]
- Duan, H.; Li, P.; Yu, Y. A predator-prey particle swarm optimization approach to multiple ucav air combat modeled by dynamic game theory. IEEE/CAA J. Autom. Sin. 2015, 2, 11–18. [Google Scholar] [CrossRef]
- Zhou, W.; Zhu, J.; Kuang, M. An unmanned air combat system based on swarm intelligence. Sci. China Inf. Sci. 2020, 50, 363–374. [Google Scholar] [CrossRef]
- Zhao, X.; Yang, R.; Zhong, L.; Hou, Z. Multi-UAV path planning and following based on multi-agent reinforcement learning. Drones 2024, 8, 18. [Google Scholar] [CrossRef]
- Fernando, X.; Gupta, A. Analysis of unmanned aerial vehicle-assisted cellular vehicle-to-everything communication using markovian game in a federated learning environment. Drones 2024, 8, 238. [Google Scholar] [CrossRef]
- Yang, J.; Yang, X.; Yu, T. Multi-unmanned aerial vehicle confrontation in intelligent air combat: A multi-agent deep re-inforcement learning approach. Drones 2024, 8, 382. [Google Scholar] [CrossRef]
- Khan, M.R.; Premkumar, G.R.V.; Van Scoy, B. Robust UAV-oriented wireless communications via multi-agent deep re-inforcement learning to optimize user coverage. Drones 2025, 9, 321. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed-cooperative-competitive environments. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6389. [Google Scholar]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815. [Google Scholar]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
- Chen, C.; Mo, L.; Zheng, D.; Cheng, Z.; Lin, D. Asymmetric maneuverability Multi-UAV intelligent coordinated attack and defense confrontation. Acta Aeronaut. Astronaut. Sin. 2020, 41, 324152. [Google Scholar]
- Li, S.; Jia, Y.; Yang, F.; Qin, Q.; Gao, H.; Zhou, Y. Collaborative decision-making method for multi-uav based on multiagent reinforcement learning. IEEE Access 2022, 10, 91385–91396. [Google Scholar] [CrossRef]
- Zhang, T.; Qiu, T.; Liu, Z.; Pu, Z.; Yi, J.; Zhu, J.; Hu, R. Multi-uav cooperative short-range combat via attention-based reinforcement learning using individual reward shaping. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 13737–13744. [Google Scholar]
- Gong, Z.; Xu, Y.; Luo, D. Uav cooperative air combat maneuvering confrontation based on multi-agent reinforcement learning. Unmanned Syst. 2023, 11, 273–286. [Google Scholar] [CrossRef]
- Zhou, J.; Sun, Y.; Xue, Y.; Xiang, Q.; Wu, Y.; Zhou, X. Research on heterogeneous multi-agent reinforcement learning algorithm integrating prior knowledge. Command Control Simul. 2023, 45, 99–107. [Google Scholar]
- Wang, E.; Chen, J.; Hong, C.; Liu, F.; Chen, A.; Jing, H. Introducing a counterfactual baseline for the UAV cluster adversarial game approach. Sci. Sin. Inform. 2024, 54, 1175. [Google Scholar]
- Aler, R.; Valls, J.M.; Camacho, D.; Lopez, A. Programming robosoccer agents by modeling human behavior. Expert Syst. Appl. 2009, 36, 1850–1859. [Google Scholar] [CrossRef]
- Ho, J.; Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 4572–4580. [Google Scholar]
- Wang, Q.; Cheng, D.; Jia, F.; Li, B.; Bo, L. Improving behavioural cloning with positive unlabeled learning. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; PMLR: Brookline, MA, USA, 2023; pp. 3851–3869. [Google Scholar]
- Sun, S.; Li, T.; Chen, X.; Dong, H.; Wang, X. Cooperative defense of autonomous surface vessels with quantity disadvantage using behavior cloning and deep reinforcement learning. Appl. Soft Comput. 2024, 164, 111968. [Google Scholar] [CrossRef]
- Foerster, J.; Assael, I.A.; De Freitas, N.; Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 2145–2153. [Google Scholar]
- Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of theAutonomous Agents and Multiagent Systems: AAMAS 2017 Workshops, Best Papers, São Paulo, Brazil, 8–12 May 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 66–83. [Google Scholar]
- Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Brookline, MA, USA, 2017; pp. 2681–2690. [Google Scholar]
- Wang, E.; Liu, F.; Hong, C.; Guo, J.; Zhao, L.; Xue, J. Masac-based confrontation game method of uav clusters. Sci. Sin. Inform. 2022, 52, 2254–2269. [Google Scholar] [CrossRef]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
- De Asis, K.; Hernandez-Garcia, J.; Holland, G.; Sutton, R. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 2887–2894. [Google Scholar]
- Buşoniu, L.; Babuška, R.; De Schutter, B. Multi-agent reinforcement learning: An overview. In Innovations in Multi-Agent Systems and Applications-1; Springer: Berlin/Heidelberg, Germany, 2010; pp. 183–221. [Google Scholar]
- Littman, M.L. Reinforcement learning improves behaviour from evaluative feedback. Nature 2015, 521, 445–451. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.; Wang, X.; Zhang, X.; Yu, Q.; Hu, X. Effective service composition using multi-agent reinforcement learning. Knowl. Based Syst. 2016, 92, 151–168. [Google Scholar] [CrossRef]
- Palmer, G.; Tuyls, K.; Bloembergen, D.; Savani, R. Lenient multi-agent deep reinforcement learning. arXiv 2017, arXiv:1707.04402. [Google Scholar]
- Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 157–163. [Google Scholar]
- Wolpert, D.; Tumer, K. Optimal payoff functions for members of multiagent teams. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; Volume 12, pp. 265–279. [Google Scholar]
- Liu, W.; Zhang, D.; Wang, X.; Hou, J.; Liu, L. A decision making strategy for generating unit tripping under emergency circumstances based on deep reinforcement learning. Proc. CSEE 2018, 38, 109–119. [Google Scholar]
- McGrew, J.S.; How, J.P.; Williams, B.; Roy, N. Air-combat strategy using approximate dynamic programming. J. Guid. Control Dyn. 2010, 33, 1641–1654. [Google Scholar] [CrossRef]
- China Electronics Technology Group. Multi-Agent Combat Arena (MACA). 2021. Available online: https://github.com/SJTUwbl/MaCA (accessed on 2 February 2018).
- Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Palo Alto, CA, USA, 2018. [Google Scholar]
Unit type | Categorization | Meaning | Reward Value |
---|---|---|---|
Attack unit | Attack Result Returns | Enemy destroyed | 1 |
Failure to fight the enemy | −1 | ||
Detection Returns | Enemy units detected | 1 | |
Destroyed Returns | Attack unit destroyed | −14 | |
Sensor unit | Detection Returns | Enemy units detected | 1 |
Destroyed Returns | Detection unit destroyed | −14 | |
Common (use) | Survival Returns | Survival of your unit | 14 |
Parameter | Value |
---|---|
Max-episode | 100 |
Time-steps | 13,000 |
Lr-actor | 1 × 10−4 |
Lr-critic | 1 × 10−3 |
γ | 0.95 |
τ | 0.01 |
Buffersize | 5 × 105 |
Batchsize | 256 |
Optimizer | Adam |
Activation function | Relu |
α(PER) | 0.6 |
n(multi-step TD) | 3 |
Parameter | Value |
---|---|
Map_x_limit | 800 |
Map_y_limit | 500 |
Random_limit | 50 |
Speed | 6 |
Bloods | 100 |
Turn_range | 0.26 |
Attack_bias | 1 |
Parameter | Value |
---|---|
Fighter_attack_percent | 1 |
Fighter_detect_range | 70 |
Fighter_damage | 100 |
Fighter_damage_range | 150 |
Fighter_turn_range | 3.14 |
Reconnaissance_detect_range | 100 |
Reconnaissance_turn_range | 3.14 |
Scale of Confrontation | Maddpg | Maddpg + Rainbow | Improvement (Times) |
---|---|---|---|
5 vs. 5 | 2.5 | 13.5 | 4.4 |
6 vs. 6 | 2 | 14.1 | 6.05 |
7 vs. 7 | 4 | 14 | 2.5 |
8 vs. 8 | 4.1 | 11.8 | 1.88 |
9 vs. 9 | 5 | 12.7 | 1.54 |
10 vs. 10 | 7.3 | 12.5 | 0.71 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, C.; Zhang, B.; Zhang, M.; Wang, Q.; Zhu, P. Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm. Drones 2025, 9, 673. https://doi.org/10.3390/drones9100673
Yang C, Zhang B, Zhang M, Wang Q, Zhu P. Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm. Drones. 2025; 9(10):673. https://doi.org/10.3390/drones9100673
Chicago/Turabian StyleYang, Chaofan, Bo Zhang, Meng Zhang, Qi Wang, and Peican Zhu. 2025. "Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm" Drones 9, no. 10: 673. https://doi.org/10.3390/drones9100673
APA StyleYang, C., Zhang, B., Zhang, M., Wang, Q., & Zhu, P. (2025). Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm. Drones, 9(10), 673. https://doi.org/10.3390/drones9100673