# Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- (1)
- Comprehensively considering the constraint information in the process of UAV pursuit-evasion, this paper conducts a flight control model for a UAV and introduces the concept of threat zone, which makes the simulation more realistic;
- (2)
- Considering various situations in the process of UAV pursuit-evasion comprehensively, four meta-policies are designed for UAV flight decision-making, which not only enrich the maneuver library but also effectively improve the effectiveness of the algorithm;
- (3)
- A hierarchical maneuvering decision-making method for the UAV is designed to ensure that the UAV can flexibly switch meta-polices under different situations and achieve victory.

## 2. Problem Formulation and Preliminaries

#### 2.1. Scenario Description and Assumptions

- (a)
- The UAV is assumed to be a rigid body, and the gravity acceleration is a unified value;
- (b)
- This paper ignores the influence of earth curvature, earth rotation and earth revolution on the UAV flight;
- (c)
- To simplify the complexity of the UAV flight, this paper only considers the kinematics model.

#### 2.2. UAV Model

## 3. Hierarchical Maneuver Decision Method for UAV Pursuit-Evasion Game

#### 3.1. Meta-Policy Model Training Algorithm Based on SAC

- ${R}_{mq1}$ is the continuous angle penalty value for our UAV, which will continuously penalize the agent in a round. The equation ${R}_{mq1}$ is: ${R}_{mq1}=-{q}_{m}/180$;
- ${R}_{mq2}$ is the sparse angle reward value for our UAV, which will reward the agent under certain conditions in a round. The equation ${R}_{mq2}$ is as follows: ${R}_{mq2}=1,if{q}_{m}{q}_{\mathrm{max}}$;
- ${R}_{tq1}$ is the continuous angle penalty value for opponent UAV, which will continuously penalize the agent in a round. The equation ${R}_{tq1}$ is as follows: ${R}_{tq1}=-{q}_{t}/180$;
- ${R}_{tq2}$ is the sparse angle reward value for opponent UAV, which will reward the agent under certain conditions in a round. The equation ${R}_{tq2}$ is as follows: ${R}_{tq2}=1,if{q}_{t}{q}_{\mathrm{max}}$;
- ${R}_{d1}$ will reward the agent when the opponent UAV is in the threat zone of our UAV. The equation ${R}_{d1}$ is as follows: ${R}_{d1}=1,if{D}_{\mathrm{min}}d{D}_{\mathrm{max}}$;
- ${R}_{d2}$ will penalize the agent when the opponent UAV is out of the threat zone of our UAV. The equation ${R}_{d2}$ is as follows: ${R}_{d2}=-3,ifd\ge {D}_{\mathrm{max}}ord\le {D}_{\mathrm{min}}$;
- ${R}_{d3}$ denotes the ratio of the distance between the two sides and maximum threat distance. The equation ${R}_{d3}$ is as follows: ${R}_{d3}=d/{D}_{\mathrm{max}}$.

- Advantage Game (AG): AG refers to two UAVs in the same scenario, and they are in each other’s threat zone. In this situation, our UAV needs to escape from the opponent’s threat zone as soon as possible and keep the distance between two sides within a small range to facilitate our UAV to capture the opponent. Therefore, for the AG, the total reward is ${R}_{AG}$. The equation ${R}_{AG}$ is as follows: ${R}_{AG}={w}_{1}*({R}_{mq1}+{R}_{mq2})-{w}_{2}*({R}_{tq1}+{R}_{tq2})+{w}_{3}*({R}_{d1}+{R}_{d2}+{R}_{d3}/5)$. The paper holds that the reward of both sides is equally important to the task of AG, so ${w}_{1}={w}_{2}=0.4$. Furthermore, the primary task of AG is to ensure that UAV escapes from the opponent’s threat zone quickly. Therefore, the weight of the distance reward is small; that is ${w}_{3}=0.2$.
- Quick Escape (QE): QE means that in the same scenario, our UAV is located in the opponent threat zone, while the opponent UAV is not in our threat zone, and the opponent’s advantage is greater than our advantage. The primary task of our UAV is to maneuver quickly to escape the opponent’s threat zone. For the QE task, the total reward is ${R}_{QE}$. The equation ${R}_{QE}$ is as follows: ${R}_{QE}={w}_{1}*({R}_{tq1}+{R}_{tq2})-{w}_{2}*({R}_{d1}+{R}_{d3}/5)$. In the QE task, angle reward and distance reward are equally important; that is ${w}_{1}={w}_{2}=0.5$.
- Situation Change (SC): SC means that in the same scenario, both sides are not in the other’s threat zone, but the opponent’s advantage is greater than our advantage. Our primary task is to maneuver quickly to change the situation so that our advantage is greater than the opponent’s advantage. For the SC task, the total rewards are ${R}_{SC}$. The equation ${R}_{SC}$ is as follows: ${R}_{SC}={w}_{1}*({R}_{mq1}+{R}_{mq2})-{w}_{2}*({R}_{tq1}+{R}_{tq2})+{w}_{3}*{R}_{d3}$. This paper believes that our angle reward and distance reward are equally important to the SC task; that is ${w}_{1}={w}_{3}=0.4$, ${w}_{2}=0.2$.
- Quick Pursuit (QP): QP means that in the same scenario, both the opponent and our UAV are not in the opponent’s threat zone, and our UAV’s advantage is greater than the opponent UAV’s advantage. The primary task of our UAV is to capture the opponent UAV. Our UAV needs to maneuver quickly to make the opponent UAV fall into our threat zone. The total rewards are ${R}_{QP}$. The equation ${R}_{QP}$ is ${R}_{QP}={w}_{1}*({R}_{mq1}+{R}_{mq2})+{w}_{2}*({R}_{d1}+{R}_{d2}/5)$. In the QP task, angle reward and distance reward are equally important; that is ${w}_{1}={w}_{2}=0.5$.

Algorithm 1: Meta-Policy Training Algorithm |

1. Randomly generated parameters: $\theta ,{\phi}_{1},{\phi}_{2}$ |

2. Initialize the policy network ${\pi}_{\theta}$ and two Soft-Q networks ${Q}_{{\phi}_{1}},{Q}_{{\phi}_{2}}$ |

3. Initialize the target network of the Soft-Q network, and let ${\phi}_{1}\prime \leftarrow {\phi}_{1},{\phi}_{2}\prime \leftarrow {\phi}_{2}$ |

4. FOR t = 0, T: |

5. Get state $s$ |

6. IF t < start_size: |

7. $a=random()$ |

8. ELSE: |

9. $\mu ,\mathrm{log}\sigma ={\pi}_{\theta}(s),a=\mathrm{tanh}(\mu +\tau *\mathrm{exp}(\sigma ))$ |

10. The agent performs actions $a$ and gets a reward $r$ and the next state $s\prime $ 11. Store the array $<s,a,r,s\prime >$ in the experience pool |

12. IF Experience storage in the experience pool > batch_size: |

13. Sampling batch_size group data $<s,a,r,s\prime >$ from the experience pool |

14. Update Q function network parameters: ${\theta}_{i}\leftarrow {\theta}_{i}-{\lambda}_{Q}{\stackrel{\wedge}{\nabla}}_{{}_{{\theta}_{i}}}{J}_{Q}\left({\theta}_{i}\right)$ |

15. Update policy network’s weights: $\varphi \leftarrow \varphi -{\lambda}_{\pi}{\stackrel{\wedge}{\nabla}}_{\varphi}{J}_{\pi}\left(\varphi \right)$ |

16. Adjust the parameters of temperature: $\alpha \leftarrow \alpha -{\lambda}_{\pi}{\stackrel{\wedge}{\nabla}}_{\alpha}J\left(\alpha \right)$ |

17. Update the parameters of the target network: ${\phi}_{1}{}^{\prime}\leftarrow \phi +(1-\tau ){\phi}_{1}{}^{\prime}$ ${\phi}_{2}{}^{\prime}\leftarrow \phi +(1-\tau ){\phi}_{2}{}^{\prime}$ |

18. END IF |

19. Jump to step 6, let $s\leftarrow s\prime $ |

20. END IF |

21. END FOR |

#### 3.2. Hierarchical Decision Method Based on PG-Option

Algorithm 2: Hierarchical Maneuver Decision Algorithm Based on PG-Option |

1. Randomly generate the policy selector network’s parameters $\varphi $; Initialize the experience pool $D$ |

2. Initialize state variables ${s}_{Drevious}$, Initialize the initial state of the agent ${s}_{0}$, ${s}_{0}={s}_{Drevious}$ 3. FOR m = 1, M: |

4. Initialization counting stage: $p=0$ |

5. Periodic selection of meta-policy ${g}_{\pi}$ according to 10 Hz frequency |

6. FOR t = 1, T: |

7. Generate actions ${a}_{t}$ and execute actions; Get reward value ${r}_{t}$ and new state ${s}_{0+t}$ |

8. IF the current meta-policy execution ends:9. Save $\left({s}_{Drevious},{g}_{\pi},{r}_{t}\right)$ into the experience pool 10. ${s}_{Drevious}={s}_{0+t},p=p+l$ |

11. END IF |

12. IF the end of the current round:13. IF capturing the opponent UAV:14. Update the reward value ${r}_{t}=r\left(\tau \right)/t$ 15. ELSE:16. Update the reward value ${r}_{t}=-3$ |

17. END IF18. Update the network parameters of the policy selector $\varphi $, Clear the experience pool $D$ 19. Reinitialize the environment to obtain the initial state of the agent 20. END IF21. END FOR22. END FOR |

## 4. Experiments and Results

#### 4.1. Parameters and Hardware

#### 4.2. Results of Train and Simulation

#### 4.3. Generalization Simulation

#### 4.3.1. Active Capture Situation

#### 4.3.2. Pursued Situation

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Chen, B. Research on AI Application in the Field of Quadcopter UAVs. In Proceedings of the 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Weihai, China, 14–16 October 2020; pp. 569–571. [Google Scholar]
- Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV Maneuvering Target Tracking in Uncertain Environments Based on Deep Reinforcement Learning and Meta-Learning. Remote Sens.
**2020**, 12, 3789. [Google Scholar] [CrossRef] - Li, B.; Song, C.; Bai, S.; Huang, J.; Ma, R.; Wan, K.; Neretin, E. Multi-UAV Trajectory Planning during Cooperative Tracking Based on a Fusion Algorithm Integrating MPC and Standoff. Drones
**2023**, 7, 196. [Google Scholar] [CrossRef] - Liu, X.; Su, Y.; Wu, Y.; Guo, Y. Multi-Conflict-Based Optimal Algorithm for Multi-UAV Cooperative Path Planning. Drones
**2023**, 7, 217. [Google Scholar] [CrossRef] - Li, S.; Wu, Q.; Du, B.; Wang, Y.; Chen, M. Autonomous Maneuver Decision-Making of UCAV with Incomplete Information in Human-Computer Gaming. Drones
**2023**, 7, 157. [Google Scholar] [CrossRef] - Zhang, H.; He, P.; Zhang, M.; Chen, D.; Neretin, E.; Li, B. UAV Target Tracking Method Based on Deep Reinforcement Learning. In Proceedings of the 2022 International Conference on Cyber-Physical Social Intelligence (ICCSI), Nanjing, China, 18–21 November 2022; pp. 274–277. [Google Scholar]
- Alanezi, M.A.; Haruna, Z.; Sha’aban, Y.A.; Bouchekara, H.R.E.H.; Nahas, M.; Shahriar, M.S. Obstacle Avoidance-Based Autonomous Navigation of a Quadrotor System. Drones
**2022**, 6, 288. [Google Scholar] [CrossRef] - Shahid, S.; Zhen, Z.; Javaid, U.; Wen, L. Offense-Defense Distributed Decision Making for Swarm vs. Swarm Confrontation While Attacking the Aircraft Carriers. Drones
**2022**, 6, 271. [Google Scholar] [CrossRef] - Awheda, M.D.; Schwartz, H.M. A fuzzy reinforcement learning algorithm using a predictor for pursuit-evasion games. In Proceedings of the 2016 Annual IEEE Systems Conference (SysCon), Orlando, FL, USA, 18–21 April 2016; pp. 1–8. [Google Scholar]
- Gao, K.; Han, F.; Dong, P.; Xiong, N.; Du, R. Connected Vehicle as a Mobile Sensor for Real Time Queue Length at Signalized Intersections. Sensors
**2019**, 19, 2059. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Alexopoulos, A.; Kirsch, B.; Badreddin, E. Realization of pursuit-evasion games with unmanned aerial vehicles. In Proceedings of the 2017 International Conference on Unmanned Aircraft Systems (ICUAS), Miami, FL, USA, 13–16 June 2017; pp. 797–805. [Google Scholar]
- Gan, Z.; Li, B.; Neretin, E.; Sergey Aleksandrovich, D. UAV Maneuvering Target Tracking based on Deep Reinforcement Learning. J. Phys.
**2021**, 1958, 12015. [Google Scholar] [CrossRef] - Yu, F.; Zhang, X.; Li, Q. Determination of The Barrier in The Qualitatively Pursuit-evasion Differential Game. In Proceedings of the 2018 IEEE CSAA Guidance, Navigation and Control Conference (CGNCC), Xiamen, China, 10–12 August 2018; pp. 1–6. [Google Scholar]
- Pan, Q.; Zhou, D.; Huang, J.; Lv, X.; Yang, Z.; Zhang, K.; Li, X. Maneuver decision for cooperative close-range air combat based on state predicted influence diagram. In Proceedings of the 2017 IEEE International Conference on Information and Automation (ICIA), Macao, China, 18–20 July 2017; pp. 726–731. [Google Scholar]
- Mikhail, K.; Vyacheslav, K. Notes on the pursuit-evasion games between unmanned aerial vehicles operating in uncertain environments. In Proceedings of the 2021 International Conference Engineering and Telecommunication (En&T), Dolgoprudny, Russia, 24–25 November 2021; pp. 1–5. [Google Scholar]
- Han, Z. The Application of Artificial Intelligence in Computer Network Technology. In Proceedings of the 2021 2nd International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Shanghai, China, 15–17 October 2021; pp. 632–635. [Google Scholar]
- Zhu, X.; Wang, Z.; Li, C.; Sun, X. Research on Artificial Intelligence Network Based on Deep Learning. In Proceedings of the 2021 2nd International Conference on Information Science and Education (ICISE-IE), Chongqing, China, 26–28 November 2021; pp. 613–617. [Google Scholar]
- Lyu, L.; Shen, Y.; Zhang, S. The Advance of Reinforcement Learning and Deep Reinforcement Learning. In Proceedings of the 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 25–27 February 2022; pp. 644–648. [Google Scholar]
- Li, W.; Wu, J.; Chen, J.; Lia, K.; Cai, X.; Wang, C.; Guo, Y.; Jia, S.; Chen, W.; Luo, F.; et al. UAV countermeasure maneuver decision based on deep reinforcement learning. In Proceedings of the 2022 37th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Beijing, China, 19–20 November 2022; pp. 92–96. [Google Scholar]
- Zhang, Y.Z.; Xu, J.L.; Yao, K.J.; Liu, J.L. Pursuit missions for UAV swarms based on DDPG algorithm. Acta Aeronaut. Astronaut. Sin.
**2020**, 41, 314–326. [Google Scholar] - Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of Drones: Multi-UAV Pursuit-Evasion Game With Online Motion Planning by Deep Reinforcement Learning. IEEE Trans. Ind. Neural Netw. Learn. Syst.
**2022**. [Google Scholar] [CrossRef] [PubMed] - Fu, X.; Zhu, J.; Wei, Z.; Wang, H.; Li, S. A uav pursuit-evasion strategy based on ddpg and imitation learning. Int. J. Aerosp. Eng.
**2022**, 2022, 1–14. [Google Scholar] [CrossRef] - Sun, Y.; Yan, C.; Lan, Z.; Lin, B.; Zhou, H.; Xiang, X. A Scalable Deep Reinforcement Learning Algorithm for Partially Observable Pursuit-Evasion Game. In Proceedings of the 2022 International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM), Xiamen, China, 5–7 August 2022; pp. 370–376. [Google Scholar]
- Vlahov, B.; Squires, E.; Strickland, L.; Pippin, C. On Developing a UAV Pursuit-Evasion Policy Using Reinforcement Learning. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications(ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 859–864. [Google Scholar]
- Li, Z. A Hierarchical Autonomous Driving Framework Combining Reinforcement Learning and Imitation Learning. In Proceedings of the 2021 International Conference on Computer Engineering and Application (ICCEA), Kunming, China, 25–27 June 2021; pp. 395–400. [Google Scholar]
- Cheng, Y.; Wei, C.; Sun, S.; You, B.; Zhao, Y. An LEO Constellation Early Warning System Decision-Making Method Based on Hierarchical Reinforcement Learning. Sensors
**2023**, 23, 2225. [Google Scholar] [CrossRef] [PubMed] - Qiu, Z.; Wei, W.; Liu, X. Adaptive Gait Generation for Hexapod Robots Based on Reinforcement Learning and Hierarchical Framework. Actuators
**2023**, 12, 75. [Google Scholar] [CrossRef] - Li, Q.; Jiang, W.; Liu, C.; He, J. The Constructing Method of Hierarchical Decision-Making Model in Air Combat. In Proceedings of the 2020 12th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 22–23 August 2020; pp. 122–125. [Google Scholar]
- Bacon, L.; Harb, J.; Precup, D. The option-critic architecture. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 1726–1734. [Google Scholar]
- Wu, Y.; Sun, G.; Xia, X.; Xing, M.; Bao, Z. An Improved SAC Algorithm Based on the Range-Keystone Transform for Doppler Rate Estimation. IEEE Geosci. Remote Sens. Lett.
**2013**, 10, 741–745. [Google Scholar] - Gao, M.; Chang, D. Autonomous Driving Based on Modified SAC Algorithm through Imitation Learning Pretraining. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 12–15 October 2021; pp. 1360–1364. [Google Scholar]
- Xiao, T.; Qi, Y.; Shen, T.; Feng, Y.; Huang, L. Intelligent Task Offloading Method for Vehicular Edge Computing Based on Improved-SAC. In Proceedings of the 2022 IEEE 5th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 16–18 December 2022; pp. 1720–1725. [Google Scholar]
- Zhu, Q.; Su, S.; Tang, T.; Xiao, X. Energy-efficient train control method based on soft actor-critic algorithm. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2423–2428. [Google Scholar]
- Ota, K.; Jha, D.K.; Kanezaki, A. Training larger networks for deep reinforcement learning. arXiv
**2021**, arXiv:2102.07920v1. [Google Scholar]

Meta-Policy | Success | Failure |
---|---|---|

AG | Out of the opponent UAV’s threat zone 1000 m | Failed to break away from the opponent threat zone within 500 steps |

QE | Out of the opponent UAV’s threat zone 2000 m | Failed to break away from the opponent threat zone within 500 steps |

SC | The relative azimuth of the opponent is less than 25° | Our relative azimuth is not bigger than the opponent’s relative azimuth within 500 steps |

QP | Successfully capturing the opponent | The opponent failed to exist our threat zone within 500 steps |

Parameter | Value |
---|---|

Our UAV pitch angle range | [0°, 4°] |

Our UAV heading angle range | [0°, 10°] |

Our UAV change range in speed | [0 m/s, 10 m/s] |

Our UAV threat distance range | [1 km, 3 km] |

Our UAV maximum speed | 200 (m/s) |

Our UAV Maximum threat angle | 20 (°) |

Our UAV Maximum time threshold | 2 (s) |

Opponent UAV pitch angle range | [0°, 4°] |

Opponent UAV heading angle range | [0°, 10°] |

Opponent UAV change range in speed | [0 m/s, 10 m/s] |

Opponent UAV threat distance range | [1 km, 3 km] |

Opponent UAV maximum speed | 200 (m/s) |

Opponent UAV Maximum threat angle | 20 (°) |

Opponent UAV Maximum time threshold | 2 (s) |

Hyperparameter | Value | Hyperparameter | Value |
---|---|---|---|

Actor-network learning rate | 3 × 10^{−4} | Total number of rounds | 10,000 |

Critic network learning rate | 3 × 10^{−4} | Max steps in one round | 500 |

Experience pool size | 100,000 | Soft update parameters | 0.005 |

Batch_size | 64 | Discount rate | 0.99 |

Target entropy value | −3 | Regularization coefficient of entropy | 1 |

X (km) | Y (km) | Z (km) | Pitch (°) | Yaw (°) | Speed (m/s) | Distance (km) | Azimuth (°) | |
---|---|---|---|---|---|---|---|---|

Our UAV | 2 | 3.5 | −3 | 2 | 50 | 70 | 7.457 | 97.3 |

Opponent UAV | −3.5 | 3 | 2 | 1 | −40 | 75 |

X (km) | Y (km) | Z (km) | Pitch (°) | Yaw (°) | Speed (m/s) | Distance (km) | Azimuth (°) | |
---|---|---|---|---|---|---|---|---|

Our UAV | 2 | 2 | 3 | 2 | 50 | 70 | 7.991 | 29.6 |

Opponent UAV | −2 | 9 | 3.5 | 1 | −40 | 80 |

X (km) | Y (km) | Z (km) | Pitch (°) | Yaw (°) | Speed (m/s) | Distance (km) | Azimuth (°) | |
---|---|---|---|---|---|---|---|---|

Our UAV | 2.1 | 2.9 | 3.0 | 2 | 50 | 70 | 8.01 | 165.1 |

Opponent UAV | −1.8 | 8.7 | 3.6 | 1 | 70 | 90 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Li, B.; Zhang, H.; He, P.; Wang, G.; Yue, K.; Neretin, E.
Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game. *Drones* **2023**, *7*, 449.
https://doi.org/10.3390/drones7070449

**AMA Style**

Li B, Zhang H, He P, Wang G, Yue K, Neretin E.
Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game. *Drones*. 2023; 7(7):449.
https://doi.org/10.3390/drones7070449

**Chicago/Turabian Style**

Li, Bo, Haohui Zhang, Pingkuan He, Geng Wang, Kaiqiang Yue, and Evgeny Neretin.
2023. "Hierarchical Maneuver Decision Method Based on PG-Option for UAV Pursuit-Evasion Game" *Drones* 7, no. 7: 449.
https://doi.org/10.3390/drones7070449