Scalable Pursuit–Evasion Game for Multi-Fixed-Wing UAV Based on Dynamic Target Assignment and Hierarchical Reinforcement Learning
Highlights
- A hierarchical collaborative game framework combining hierarchical reinforcement learning methods and target allocation methods is proposed for autonomous pursuit–evasion.
- The hierarchical reinforcement learning method based on trajectory prediction and stable auxiliary gradients effectively improves the win rate and generates a smooth flight path.
- The simulation of various large-scale pursuit–evasion games demonstrates the advantages of this framework in terms of training time and large-scale scalability.
- The hierarchical reinforcement learning method based on trajectory prediction and stable auxiliary gradients provides a feasible approach for application to real UAVs.
Abstract
1. Introduction
2. Problem Modeling
2.1. Game Model of Pursuit–Evasion with Multi-Fixed-Wing UAV
2.2. Fixed-Wing UAV Model
2.3. MDP Process
2.4. PPO Algorithm
3. Hierarchical Maneuvering Decision-Making Algorithm
3.1. Reinforcement Learning Aircraft Controller with Stable Auxiliary Gradients
3.1.1. Observation Space and Action Space
3.1.2. Reward Function
- where is the difference between the expected altitude and the current altitude, and is the scale factor for altitude error.
- where is the difference between the expected heading and the current heading, and is the scale factor for heading error.
- where is the difference between the expected velocity and the current velocity, and is the scale factor for velocity error.
3.1.3. Stabilizing the Auxiliary Gradient
| Algorithm 1: SAG-PPO: | |
| 1 | Initialize the policy network and the value network Initialize the auxiliary network Initialize the Lagrange multiplier λ = 0.1 Initialize the hyperparameters and the upper bound of smoothness |
| 2 | for step = 1 to Max-steps do |
| 3 | env step |
| 4 | get (s, a, r, s’, κ, done) and store them in buffer |
| 5 | If reach the buffer length |
| 6 | calculate critic loss |
| 7 | calculate the gradient and update critic parameters |
| 8 | calculate the main loss |
| 9 | calculate the loss of the auxiliary network |
| 10 | total loss is obtained |
| 11 | calculate the gradient and update actor parameters |
| 12 | calculate the gradient and to update the auxiliary network parameters |
| 13 | Update the Lagrange multiplier |
3.2. Maneuver Decision-Making Layer Based on Trajectory Prediction and Hierarchical Reinforcement Learning
3.2.1. Trajectory Prediction
3.2.2. Observation Space and Action Space
3.2.3. Reward Shaping
3.2.4. Hierarchical Reinforcement Learning Algorithm
| Algorithm 2: hierarchical PPO | |
| 1 | Load low level model and set it to eval mode Initialize actor, critic parameters, and hyperparameters |
| 2 | for step = 1 to Max-steps do |
| 3 | get action from actor |
| 4 | transfer action to low level model |
| 5 | output action to env |
| 6 | env step |
| 7 | get (s, a, r, s’, κ, done) and store them in buffer |
| 8 | If reach the buffer length |
| 9 | calculate critic loss |
| 10 | calculate the gradient and update critic parameters |
| 11 | calculate actor loss |
| 12 | calculate the gradient and update actor parameters |
3.3. Dynamic Target Allocation Algorithm Based on Dynamic Value Adjustment
3.3.1. Situation Assessment
3.3.2. Dynamic Target Allocation Method
4. Simulation Results
4.1. Simulation Comparison of Flight Control Layer
4.2. Simulation Comparison of Maneuvering Decision-Making Layer
4.2.1. Trajectory Prediction Simulation
4.2.2. The Training Results of Hierarchical Reinforcement Learning
4.3. Target Allocation Simulation
4.4. Overall Algorithm Comparison
4.5. Analysis of the Impact of Termination Conditions
4.6. Failed Case Analysis
4.7. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Data Availability Statement
DURC Statement
Conflicts of Interest
References
- Lyu, M.; Zhao, Y.; Huang, C.; Huang, H. Unmanned aerial vehicles for search and rescue: A survey. Remote Sens. 2023, 15, 3266. [Google Scholar] [CrossRef]
- Ahmed, S.; Chowdhury, M.Z.; Jang, Y.M. Energy-efficient UAV relaying communications to serve ground nodes. IEEE Commun. Lett. 2020, 24, 849–852. [Google Scholar] [CrossRef]
- Khan, A.; Gupta, S.; Gupta, S.K. Emerging UAV technology for disaster detection, mitigation, response, and preparedness. J. Field Robot. 2022, 39, 905–955. [Google Scholar] [CrossRef]
- Kang, H.; Joung, J.; Kim, J.; Kang, J.; Cho, Y.S. Protect your sky: A survey of counter unmanned aerial vehicle systems. IEEE Access 2020, 8, 168671–168710. [Google Scholar] [CrossRef]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
- Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar] [CrossRef]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Koyamada, S.; Ye, Q.; Liu, G.; Wang, C.; Yang, R.; Zhao, L.; Qin, T.; Liu, T.-Y.; Hon, H.-W. Suphx: Mastering mahjong with deep reinforcement learning. arXiv 2020, arXiv:2003.13590. [Google Scholar] [CrossRef]
- Perolat, J.; De Vylder, B.; Hennes, D.; Tarassov, E.; Strub, F.; de Boer, V.; Muller, P.; Connor, J.T.; Burch, N.; Anthony, T. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science 2022, 378, 990–996. [Google Scholar] [CrossRef]
- Crumpacker, J.B.; Robbins, M.J.; Jenkins, P.R. An approximate dynamic programming approach for solving an air combat maneuvering problem. Expert Syst. Appl. 2022, 203, 117448. [Google Scholar] [CrossRef]
- Zhang, J.; Dinghan, W.; Qiming, Y.; Zhuoyong, S.; Longmeng, J.; Guoqing, S.; Yong, W. Loyal wingman task execution for future aerial combat: A hierarchical prior-based reinforcement learning approach. Chin. J. Aeronaut. 2024, 37, 462–481. [Google Scholar] [CrossRef]
- Yang, Q.; Zhu, Y.; Zhang, J.; Qiao, S.; Liu, J. UAV air combat autonomous maneuver decision based on DDPG algorithm. In Proceedings of the 2019 IEEE 15th International Conference on Control And Automation (ICCA), Edinburgh, UK, 16–19 July 2019; pp. 37–42. [Google Scholar]
- Gao, X.; Zhang, Y.; Wang, B.; Leng, Z.; Hou, Z. The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm. Drones 2024, 8, 501. [Google Scholar] [CrossRef]
- Li, B.; Huang, J.; Bai, S.; Gan, Z.; Liang, S.; Evgeny, N.; Yao, S. Autonomous air combat decision-making of UAV based on parallel self-play reinforcement learning. CAAI Trans. Intell. Technol. 2023, 8, 64–81. [Google Scholar] [CrossRef]
- Freitas, A.; Rabbath, C.A.; Williams, C.; Lechevin, N.; Givigi, S. Multi-Unmanned Aerial Vehicles Cooperative Tactics: A Survey. IEEE Access 2025, 13, 119897–119921. [Google Scholar] [CrossRef]
- Pang, J.; He, J.; Mohamed, N.M.A.A.; Lin, C.; Zhang, Z.; Hao, X. A hierarchical reinforcement learning framework for multi-UAV combat using leader–follower strategy. Knowl.-Based Syst. 2025, 316, 113387. [Google Scholar] [CrossRef]
- Qu, P.; He, C.; Wu, X.; Wang, E.; Xu, S.; Liu, H.; Sun, X. Double mixing networks based monotonic value function decomposition algorithm for swarm intelligence in UAVs. Auton. Agents Multi-Agent Syst. 2025, 39, 16. [Google Scholar] [CrossRef]
- Lei, L.; Wu, C.; Chen, H. Dynamic Decoupling-Driven Cooperative Pursuit for Multi-UAV Systems: A Multi-Agent Reinforcement Learning Policy Optimization Approach. Comput. Mater. Contin. 2025, 85, 1339–1363. [Google Scholar] [CrossRef]
- Liang, X.; Wang, H.; Luo, H. Collaborative Pursuit-Evasion Strategy of UAV/UGV Heterogeneous System in Complex Three-Dimensional Polygonal Environment. Complexity 2020, 2020, 7498740. [Google Scholar] [CrossRef]
- Shahar, F.S.; Sultan, M.T.H.; Nowakowski, M.; Łukaszewicz, A. UGV-UAV Integration Advancements for Coordinated Missions: A Review. J. Intell. Robot. Syst. 2025, 111, 69. [Google Scholar] [CrossRef]
- Pope, A.P.; Ide, J.S.; Mićović, D.; Diaz, H.; Twedt, J.C.; Alcedo, K.; Walker, T.T.; Rosenbluth, D.; Ritholtz, L.; Javorsek, D. Hierarchical Reinforcement Learning for Air Combat at DARPA’s AlphaDogfight Trials. IEEE Trans. Artif. Intell. 2023, 4, 1371–1385. [Google Scholar] [CrossRef]
- Piao, H.; Han, Y.; He, S.; Yu, C.; Fan, S.; Hou, Y.; Bai, C.; Mo, L. Spatiotemporal Relationship Cognitive Learning for Multirobot Air Combat. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2254–2268. [Google Scholar] [CrossRef]
- Qian, C.; Zhang, X.; Li, L.; Zhao, M.; Fang, Y. H3E: Learning air combat with a three-level hierarchical framework embedding expert knowledge. Expert Syst. Appl. 2024, 245, 123084. [Google Scholar] [CrossRef]
- Kong, W.R.; Zhou, D.Y.; Du, Y.J.; Zhou, Y.; Zhao, Y.Y. Hierarchical multi-agent reinforcement learning for multi-aircraft close-range air combat. IET Control Theory Appl. 2023, 17, 1840–1862. [Google Scholar] [CrossRef]
- Xu, X.; Wang, Y.; Guo, X.; Huang, K.; Zhang, X. Multi-UAV air combat cooperative game based on virtual opponent and value attention decomposition policy gradient. Expert Syst. Appl. 2025, 267, 126069. [Google Scholar] [CrossRef]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
- Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
- Zhang, Y.; Ding, M.; Zhang, J.; Yang, Q.; Shi, G.; Lu, M.; Jiang, F. Multi-UAV pursuit-evasion gaming based on PSO-M3DDPG schemes. Complex Intell. Syst. 2024, 10, 6867–6883. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhu, Y.; Wang, J. Gaussian-enhanced reinforcement learning for scalable evasion strategies in multi-agent pursuit-evasion games. Neurocomputing 2025, 652, 131080. [Google Scholar] [CrossRef]
- Wang, B.; Gao, X.; Xie, T. An evolutionary multi-agent reinforcement learning algorithm for multi-UAV air combat. Knowl.-Based Syst. 2024, 299, 112000. [Google Scholar] [CrossRef]
- Abadi, A.S.; Soh, L.-K. Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems. arXiv 2025, arXiv:2510.27659. [Google Scholar] [CrossRef]
- Zhang, S.; Zhou, J.; Tian, D.; Sheng, Z.; Duan, X.; Leung, V.C. Robust cooperative communication optimization for multi-UAV-aided vehicular networks. IEEE Wirel. Commun. Lett. 2020, 10, 780–784. [Google Scholar] [CrossRef]
- Lv, Z.; Xiao, L.; Du, Y.; Niu, G.; Xing, C.; Xu, W. Multi-agent reinforcement learning based UAV swarm communications against jamming. IEEE Trans. Wirel. Commun. 2023, 22, 9063–9075. [Google Scholar] [CrossRef]
- De Marco, A.; D’Onza, P.M.; Manfredi, S. A deep reinforcement learning control approach for high-performance aircraft. Nonlinear Dyn. 2023, 111, 17037–17077. [Google Scholar] [CrossRef]
- Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 157–163. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Xu, X.; Zuo, L.; Li, X.; Qian, L.; Ren, J.; Sun, Z. A Reinforcement Learning Approach to Autonomous Decision Making of Intelligent Vehicles on Highways. IEEE Trans. Syst. Man Cybern.-Syst. 2020, 50, 3884–3897. [Google Scholar] [CrossRef]
- Wang, P.; Chan, C.-Y.; de la Fortelle, A. A Reinforcement Learning Based Approach for Automated Lane Change Maneuvers. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1379–1384. [Google Scholar]
- Zhu, M.; Wang, Y.; Pu, Z.; Hu, J.; Wang, X.; Ke, R. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving. Transp. Res. Part C-Emerg. Technol. 2020, 117, 102662. [Google Scholar] [CrossRef]
- Ren, Z.; Zhang, D.; Tang, S.; Xiong, W.; Yang, S.-h. Cooperative maneuver decision making for multi-UAV air combat based on incomplete information dynamic game. Def. Technol. 2023, 27, 308–317. [Google Scholar] [CrossRef]
- Chae, H.-J.; Choi, H.-L. Tactics Games for Multiple UCAVs Within-Visual-Range Air Combat. In Proceedings of the 2018 AIAA Information Systems-AIAA Infotech @ Aerospace, Kissimmee, FL, USA, 8–12 January 2018. [Google Scholar]
- Li, F.; Yin, M.; Wang, T.; Huang, T.; Yang, C.; Gui, W. Distributed pursuit-evasion game of limited perception USV swarm based on multiagent proximal policy optimization. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 6435–6446. [Google Scholar] [CrossRef]
- Wang, H.; Wang, J. Enhancing multi-UAV air combat decision making via hierarchical reinforcement learning. Sci. Rep. 2024, 14, 4458. [Google Scholar] [CrossRef] [PubMed]






















| Variable Type | Variable Symbol | Meaning |
|---|---|---|
| Basic state variables | h | Height |
| Velocity value in the x direction | ||
| Velocity value in the y direction | ||
| Velocity value in the z direction | ||
| Magnitude of speed | ||
| Change in velocity | ||
| q | Pitch Angle velocity | |
| Relative geometric state variables | Current status of opponent aircraft | |
| Status of opponent UAV after time | ||
| Status of opponent UAV after time |
| Action Variables | Scope |
|---|---|
| (km) | {−0.2, −0.1, 0, 0.1, 0.2} |
| (radian) | {−π/6, −π/12, 0, π/12, −π/6} |
| (mach) | {−0.2, −0.1, 0, 0.1, 0.2} |
| Name | Value |
|---|---|
| n_rollout_threads | 32 |
| use_valuenorm | True |
| hidden_sizes | [128, 128] |
| activation_func | relu |
| gain | 0.01 |
| use_recurrent_policy | True |
| recurrent_n | 1 |
| learn_rate | 0.0005 |
| Initial Conditions | Similar Conditions | Simple Conditions | Strict Conditions | |
|---|---|---|---|---|
| 3V3 | 63.8% | 60.8% | 53.4% | 25.4% |
| 6V6 | 76.5% | 73.5% | 48.3% | 32.5% |
| 9V9 | 84.7% | 79.4% | 55.6% | 38.5% |
| 12V12 | 88.0% | 85.6% | 59.1% | 41.3% |
| 3V3 | 6V6 | 9V9 | 12V12 | |||||
|---|---|---|---|---|---|---|---|---|
| Win Rate | Average OSSM | Win Rate | Average OSSM | Win Rate | Average OSSM | Win Rate | Average OSSM | |
| Ours (baseline) | 63.8% | 0.32 | 76.5% | 0.35 | 84.7% | 0.38 | 88.0% | 0.34 |
| Ours w/o DTA | 56.3% | 0.33 | 50.1% | 0.36 | 42.6% | 0.39 | 31.5% | 0.32 |
| Ours w/o TP | 52.3% | 0.31 | 68.5% | 0.36 | 73.5% | 0.35 | 76.7% | 0.31 |
| Ours w/o SAG | 60.6% | 0.85 | 72.5% | 0.83 | 80.3% | 0.81 | 83.5% | 0.86 |
| Ours w/o HL | 51.6% | 1.23 | 59.5% | 1.18 | 64.6% | 1.33 | 71.5% | 1.21 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Tan, M.; Sun, H.; Ding, D.; Zhou, H.; Liu, Y. Scalable Pursuit–Evasion Game for Multi-Fixed-Wing UAV Based on Dynamic Target Assignment and Hierarchical Reinforcement Learning. Drones 2026, 10, 5. https://doi.org/10.3390/drones10010005
Tan M, Sun H, Ding D, Zhou H, Liu Y. Scalable Pursuit–Evasion Game for Multi-Fixed-Wing UAV Based on Dynamic Target Assignment and Hierarchical Reinforcement Learning. Drones. 2026; 10(1):5. https://doi.org/10.3390/drones10010005
Chicago/Turabian StyleTan, Mulai, Haocheng Sun, Dali Ding, Huan Zhou, and Yongli Liu. 2026. "Scalable Pursuit–Evasion Game for Multi-Fixed-Wing UAV Based on Dynamic Target Assignment and Hierarchical Reinforcement Learning" Drones 10, no. 1: 5. https://doi.org/10.3390/drones10010005
APA StyleTan, M., Sun, H., Ding, D., Zhou, H., & Liu, Y. (2026). Scalable Pursuit–Evasion Game for Multi-Fixed-Wing UAV Based on Dynamic Target Assignment and Hierarchical Reinforcement Learning. Drones, 10(1), 5. https://doi.org/10.3390/drones10010005

