An Improved Approach towards Multi-Agent Pursuit–Evasion Game Decision-Making Using Deep Reinforcement Learning

A pursuit–evasion game is a classical maneuver confrontation problem in the multi-agent systems (MASs) domain. An online decision technique based on deep reinforcement learning (DRL) was developed in this paper to address the problem of environment sensing and decision-making in pursuit–evasion games. A control-oriented framework developed from the DRL-based multi-agent deep deterministic policy gradient (MADDPG) algorithm was built to implement multi-agent cooperative decision-making to overcome the limitation of the tedious state variables required for the traditionally complicated modeling process. To address the effects of errors between a model and a real scenario, this paper introduces adversarial disturbances. It also proposes a novel adversarial attack trick and adversarial learning MADDPG (A2-MADDPG) algorithm. By introducing an adversarial attack trick for the agents themselves, uncertainties of the real world are modeled, thereby optimizing robust training. During the training process, adversarial learning was incorporated into our algorithm to preprocess the actions of multiple agents, which enabled them to properly respond to uncertain dynamic changes in MASs. Experimental results verified that the proposed approach provides superior performance and effectiveness for pursuers and evaders, and both can learn the corresponding confrontational strategy during training.


Introduction
With the development of the three generations of artificial intelligence [1], the technology of multi-agent systems (MASs) has been widely used in many areas of society, such as multi-agent motion planning, complex IT systems, computer communication technology, and so on [2][3][4][5]. Pursuit-evasion games have been widely investigated in MASs during recent years. They have been extended to various fields, to include maneuvering target tracking, surveillance early warning, anti-intrusion protection, and intelligent transportation [6,7]. The goal of these studies is to provide good strategies for pursuers and evaders. For pursuers, their goal is to round up the evaders as much as possible through cooperative decision-making. For evaders, they need to choose the best strategy based on the actions of pursuers to design an escape path to prevent being captured [8].
To address this problem, a series of research activities on agent-based pursuit-evasion games has been carried out in the differential gaming field. Isaacs [9] proposed a one-to-one robot hunting problem where partial differential equations describing the pursuer and the evader were created and solved analytically. Furthermore, a generalized maximumminimum solving method of the Hamilton Jacobi equation for pursuit-evasion games was provided by Krasovskii [10]. Because in complex control problems, directly solving differential equations is very complicated and consumes many computing resources, researchers proposed some intelligent optimization algorithms that provide new ideas for solving the differential equation problems associated with pursuit-evasion games. Chen et al. [11] simulated fish foraging behavior and proposed a cooperative pursuit strategy that studied pursuit and evasion when trackers have a constrained turning rate. Wang et al. [12] introduced an alliance generation algorithm that generates a synergistic strategy based on the emotional factors of multirobot systems. This ensured that a team's agents worked towards a common goal. However, there are many constraints and state variables involved in the complicated control process governing these issues, which make the solution intricate especially in a complex and dynamic scenario with multi-agent confrontation. Therefore, more intelligent algorithms are needed to effectively solve the problem of pursuit-evasion games.
By combining deep learning's ability to perceive highly dimensional data [13] and reinforcement learning's decision-making ability [14], deep reinforcement learning (DRL) provides a new optimization scheme for intelligent decision-making or control. Because techniques based on deep reinforcement learning do not require the establishment of a differential game model and agents can learn the optimal confrontation strategy only through interaction with the environment [15], some scholars introduced deep reinforcement learning in pursuit-evasion games and acquired the Nash equilibrium of the problem. Xu et al. [16] established a multi-agent reinforcement learning model for UAV pursuitevasion in which relative motion state equations were employed. As a result, the pursuitevasion issue was converted into a zero-sum game addressed through minimax-Q learning. In predatory games, Park et al. [17] set up a co-evolution framework for predator and prey to allow multiple agents to learn good policies by deep reinforcement learning. Gu et al. [18] presented an attention-based fault-tolerant model, which could also be applied to pursuit-evasion games, and the key idea was to utilize the multihead attention mechanism to select the correct and useful information for estimating the critics. To solve the complicated training problems caused by discrete action sets introduced by deep Q networks [19], Liu et al. [20] transformed a space rendezvous optimization problem between a space vehicle and noncooperative target into a pursuit-evasion differential game. They introduced a branching architecture with a group of parallel neural networks and shared decision modules. To overcome the unstable recognition ability of pursuers, Qadir et al. [21] proposed a novel approach for self-organizing feature maps and deep reinforcement learning based on the agent group role membership function model. Experiments verified the effectiveness of this method for facilitating the capture of evaders by mobile agents. Singh et al. [22] built on the actor-critic model-free multi-agent deep deterministic policy gradient algorithm to operate over the continuous spaces of pursuit-evasion games. In their approach, the evader's strategy is not learned. It is based on Voronoi regions that pursuers try to minimize and evaders try to maximize.
Although they represent progress, previous studies on DRL-based pursuit-evasion games are still in their early stages. In these studies, pursuing platforms are assumed to be equipped with error-free identification and measurement systems that allow them to acquire precise information about the position, velocity, and other characteristics of evaders and cooperators [6,23]. However, sensors and other equipment configured in an unmanned system encounter positioning, sensing, and actuator error in reality [24,25]. These errors cause the environment to become uncertain, thereby affecting the strategies of the pursuers and evaders and making their performance worse. Therefore, this research is about designing a robust algorithm for MASs to effectively mitigate these errors and that would be significant for application research in real-world multi-agent decision-making. This paper introduces a novel multi-agent algorithm to address the decision-making problem of pursuit-evasion games. The algorithm can solve pursuit-evasion games in complex virtual and real environments, where there are static or moving obstacles and pursuers and evaders need to avoid them while making decisions. Specifically, we make the following contributions in this paper: (1) We develop an actor-critic-based motion control framework based on the multiagent deep deterministic policy gradient (MADDPG) [26], which can take the state and behavior of other partners into account and is used to provide collaborative decisionmaking capabilities for each agent in the MAS; (2) We propose an advanced algorithm called A2-MADDPG, which uses two skills to make the training strategy robust. The first is adversarial attack tricks for agents. It proposes to sample the status after stochastic Gaussian noise is applied, and this approach can train a robust agent to cope with measurement errors in the real world. The second is the optimized adversarial learning technique [27]. It is introduced to improve agent stability and to assist in adapting to noise produced by interactions between multiple agents; (3) We verified the effectiveness and robustness of the algorithm in simulation experiments. We compared the performance of the proposed method with two common and advanced algorithms, namely the MADDPG and the independent multiagent deep deterministic policy gradient (IMDDPG), where the IMDDPG is a natural extension of the DDPG [28] in the field of multi-agents. Through a series of experiments, we show that the proposed method presents excellent performance for both pursuers and evaders compared with the MADDPG and IMDDPG in the case of the same hyperparameter settings and simulation environment parameter settings, and it can help them both develop robust motion strategies.
The rest of the paper is structured as follows: Section 2 provides background information about multi-agent pursuit-evasion games and describes related theoretical approaches. Section 3 introduces a framework for collaborative pursuit missions and an improved A2-MADDPG algorithm where an adversarial attack trick and an adversarial learning-based optimization method are combined with the MADDPG. Section 4 verifies the robustness and high performance of the algorithm through simulation experiments. Section 5 provides a conclusion and envisages future work.

Background
In this section, the kinematic and observation model of agents executing a pursuitevasion task is presented. In addition, the essential theoretical background of the DRLbased MADDPG algorithm and adversarial learning is introduced.

Problem Definition
The multi-agent pursuit-evasion game problem can be described as follows: there are pursuers (red agents) and evaders (blue agents), as shown in Figure 1. Both agent types have different tasks based on their maneuverability. Each agent can perceive the relative position of the threat zone (gray circle) using radar and sensors. The velocity and position of each agent are provided by its navigation equipment, and they share information by transmitting through a signaling connection. Pursuers are equipped with an attack or shielding interference device (the red circle represents the attack range of the pursuit), and their mission is considered successful when they suppress an evader by approaching it. Evaders must stay away from pursuers. Neither pursuers nor evaders can exceed their boundaries.

Comparisons of Operators
A general decision-making program for pursuit-evasion gaming is primarily used to determine the communication and cooperation between platforms and achieve target pursuit. This is performed without fully considering the maneuvering characteristics of the platforms. Both agents in this paper are mobile UAVs flying at a fixed altitude with nonholonomic constraints [29], as portrayed in Figure 2. The status update of each UAV can be described as: where p t , v t , and ϕ t denote the position, velocity, and yaw angle parameters. The superscript t represents time t; ∆t is the time interval; a is UAV's acceleration. Considering power systems and mechanical limitations, the maximal velocity and acceleration are assumed to be v max and a max , which are introduced in the following simulation.

Observation Model
The observation model of the agent was presented to provide the agent with the ability to sense the environment [30]. In this multi-agent pursuit-evasion task, (x i a , y i a ) is the position of each agent in the pursuit formation, and both (x j e , y j e ) and (x k o , y k o ) represent the position of the center point of the evader and the threatened area, respectively. The number of pursuers, evaders, and obstacles in the environment is defined as num_P, num_E, and num_o, respectively. Since pursuers and evaders need to consider avoiding obstacles to prevent being hit when making decisions, these obstacles make it more difficult to solve the problem of pursuit-evasion. The formation of all pursuers is denoted as A. An agent i on the pursuers' team can use radar detection and communication transmission to obtain its own local observations from the environment as follows: Here, v s x i , v s y i , x i s , and y i s represent the self-observed velocity and position of the agent on the x and y axis. O i c indicates the observed location of other pursuers in the formation, and l is the sequence number of other pursuers on the team. O i e denotes the observed location of an evader, and j represents its sequence number. O i o represents information observed about an obstacle, and k is the obstacle number. Considering a real mission scenario, a set of range sensors is employed to help the unmanned system detect possible threats from obstacles ahead of it in the range. As shown in Figure 3, the 90 • angle containing the blue arc within the sensor range is the agent's threat detection area. An agent's observations about an obstacle are divided into five parts: where d 1−5 denotes the five sensor indications. We set d 1−5 = L when a threat is not detected. Based on the comprehensive observation information above, an agent can perceive and assess the situation.

Theoretical Context
Deep reinforcement learning is a representative intelligent machine learning algorithm, and adversarial learning can increase the stability and robustness of the model trained by reinforcement learning [31]. They provide new research ideas for multi-agent pursuitevasion decision-making. In this section, adversarial learning, the DRL-based DDPG algorithm, and the MADDPG algorithm are introduced.

Adversarial Learning
Adversarial learning is a technique of defending against adversarial samples [32]. This approach attempts to improve the accuracy of neural network models by training adversarial samples and normal samples together and reducing the interference of the adversarial samples. The robustness and generalization ability of the resulting network are improved. Adversarial training can be expressed as follows: where x, x denote the original sample and adversarial sample, respectively, y is the label value, and θ is the weight of the networks. D(x, x ) represents the distance measurement between the original sample and the adversarial sample, and L adv (x , y; θ) represents the adversarial loss function. In the min-max form, the internal maximization optimization problem is to find the optimal adversarial sample, and the external minimization optimization problem is to minimize the loss function. The learning process of confrontation training is depicted in Figure 4. The fast gradient sign method (FGSM) efficiently generates adversarial samples [33]. The FGSM uses a model's objective loss function to determine the input vector needed to calculate its counter disturbance, which it adds to the corresponding input. This generates counter samples that correspond to the original samples. Suppose that in a classification problem, the output label of the model is class = {0, 1}. After adversarial training, the model will have higher prediction confidence, that is the model will output the correct sample label even if a small adversarial disturbance is added to the sample. This process can be defined as: where η adv represents the sample perturbation added and x * represents adversarial samples after adding perturbation. Each time the model is trained, the FGSM performs an optimization along the gradient direction of the counter loss function L adv (x , y; θ), and counter samples are obtained. The generation process of sample disturbance η adv can be expressed as: where ε denotes the magnitude of disturbance and g is the inverse gradient of the input vector. Moreover, the FGSM-based target loss function can be defined as: where c is an equilibrium coefficient that is used to balance both the original and attack samples. As a result, the adversarial examples used in the adversarial learning method can improve the generalization ability of a model by adding a regular term to the loss function. The goal of adversarial training is to minimize the loss function in the worst case.

DRL-Based DDPG Algorithm
During the deep reinforcement learning process, an agent completes its interaction with the environment by perceiving the environment and taking appropriate actions. It performs adaptive iterative optimization according to a reward signal, as shown in Figure 5. An effective approach to describe the DRL-based training process is the Markov decision process (MDP) [34], which is represented by a five-tuple S, A, R, P, γ . At each time step, an agent interacts with the environment and makes observations, which comprise the agent's state s ∈ S. Agents then perform the action a ∈ A to obtain reward R from s to a new state s . P denotes the environmental model, and it represents the probability distribution of transitioning to a new state. γ is a discount factor used to balance the impact of instantaneous returns and long-term returns on cumulative rewards. The deep deterministic policy gradient is an algorithm that combines policy-based actor neural networks with value-based critic neural networks that can be employed for continuous control [28]. The actor online network µ reacts according to the agent's current observation state s t and generates a reasonable action a t = µ(s t ). The critic online network Q is responsible for evaluating the current action and outputting the action value function Q(s t , a t ; θ Q ). θ µ and θ Q denote the corresponding parameters of an actor online network and a critic online network. In addition, actor target networks µ and critic target networks Q are constructed for future updates.
After each decision, a training sample [s t , a t , r t , s t ] for time t is collected in the experience buffer M, which is applied to iteratively improve the agent's strategies, that is in the update optimization phase, a stochastic mini-batch of N arrays of samples of the previous format is extracted for every training time. The critic online network is updated according to the TD-error, which is defined as: Here, L θ Q is the loss function of critic networks, y is the target value Q-target, and i represents the sequence of extracted samples. Additionally, the actor online network would be trained by minimizing the following policy gradient: At regular intervals, the soft update approach and update factor τ are used to copy the network parameters to the target network, which can be expressed as: The MADDPG algorithm is an effective DRL algorithm derived from the DDPG algorithm and can be used to address problems with multi-agent strategies. In the MADDPG, each agent has its own actor-critic framework [35]. For a multi-agent system, the observation set consisting of n agents is x = (s 1 , s 2 , . . . , s n ), the action set is a = (a 1 , a 2 , . . . , a n ), and the reward set is r = (r 1 , r 2 , . . . , r n ). For each agent, its observations and actions are denoted as s i and a i = µ i (s i |θ µ i ) at a point in time. The agent's actor online network outputs a policy according to the agent's own observations, and its critic online network estimates a centralized action value function Q µ i (x, a 1 , a 2 , . . . , a n ). This function is based on the status and actions of all agents, as depicted in Figure 6.
During each interaction within the environment, an agent will store relevant experiences in the experience buffer. Unlike the DDPG, the N comprehensive learning samples [x, a 1 , a 2 , . . . , a n , r 1 , r 2 , . . . , r n , x ] in the MADDPG are drawn and spliced from the experience buffer of all agents each time one is trained. For agent i, the critic online network is updated according to: Meanwhile, by minimizing the policy gradient, its actor online network can be optimized. This is expressed as: The MADDPG algorithm also borrows the soft update technique described in Equation (13) from DDPG.  Although agents trained with the MADDPG can achieve good results in some simple environments, the multi-agent system is very sensitive to changes in the training environment, and the convergence strategies obtained by agents are likely to fall into a local optimum, that is when the strategies of other agents change, the agent cannot produce the optimal action strategy. In order to improve the robustness of the strategy, this paper combines the MADDPG and adversarial learning to propose the A2-MADDPG algorithm, which is introduced in Section 3.

Proposed Method
This section proposes an approach for realizing control for pursuers and evaders in a game that contains an uncertain environment. There are obstacles in the environment that both pursuers and evaders need to avoid, and when acquiring specific values in the state space, sensors and other devices encounter positioning, sensing, and actuator errors, resulting in inaccurate values, so the environment is uncertain for pursuers and evaders. An MADDPG-based control framework for multi-agent systems is presented, and it includes action, state, space, and specific reward functions. Furthermore, an improved approach called the A2-MADDPG is described. The A2-MADDPG incorporates an adversarial attack trick and adversarial learning into the MADDPG algorithm.

Actor Space
When addressing DRL-based multi-agent decision-making, state and action spaces must be defined based on the MDP framework. To ensure that mission control is more similar to the real world, UAVs use dual-channel control, that is the force on a UAV is controlled directly. The effects of this force are then applied to the UAV's movement attitude and flight velocity. The action output A i of a dual-channel thrust UAV thrust can be expressed as: where the superscript i represents the sequence number of the UAV in an MAS. F i x , F i y represent the force on the x and y axes that the UAV received. Therefore, the acceleration can be given by: where m u represents the mass of the UAV. The UAV's attitude can then be adjusted when combined with Equation (1).

State Space
The state space of a UAV provides useful information based on an agent's observation model. This is used to help the agent sense its surroundings and make decisions. To help both sides during confrontation training, the state spaces of pursuers and evaders should be presented. As per Section 2.1.2, each pursuer's state information is processed and integrated and includes its position relative to partners where s i ps =[v i px , v i py , x i ps , y i ps ] denotes its position and speed based on self-observed information that has not been processed. Similarly, the state space of an evader j can be defined as:

Reward Function
In the traditional MADDPG algorithm, formation cooperation cannot be uniformly controlled since each agent has an independent actor and critic network. When a unit successfully hunts down a target, all agents belonging to the formation receive a positive reward regardless of whether the agent was in effective tracking range or played a positive role in the mission. This is contrary to the incentive policy of real pursuit-evasion scenarios.
To address this problem, a reward function based on the team strategy was presented. An agent could receive a positive reward only if it was within a certain distance ζrange att of the target when the mission terminated. The reward is shaped by three basic elements: (1) distance r i distance : the Euclidean distance is used to judge whether the agent successfully pursued the evader; (2) maneuvering safety r i sa f e : the agent is punished if it has collided with obstacles or collaborators; and (3) mission criteria r i mission , ζrange att are used to judge whether the agent completed the mission. These three subreward functions can be defined as: where i is the sequence number of the agent and function dis(a, b) is used to calculate the Euclidean distance of positions a and b. dis obstacle represents the radius of an obstruction, and dis sa f e represents the minimum safe distance between each pursuer. To summarize, the reward function for a pursuer i can be formulated as: Three relative gain factors β 1−3 are introduced, which represent the respective weights of the three rewards or punishments. Among them, β 1 is negative, and both β 2 and β 3 are positive. To train an autonomous evader, a specific reward function was built according to the distance among the pursuer, evaders, and obstacles, and it has weights that are the inverse of the pursuer's reward function.

Adversarial Attack Trick for the Agent
When perceiving the environment in a real scenario, an agent would encounter unavoidable errors due to the detection process, image recognition, signal processing, and satellite position-based parameter measurements. Improving model robustness is of great significance, especially in key intelligent control fields such as UAVs and robots, where tiny errors or noise could lead to immeasurable and undesirable consequences.
To train a robust agent to adapt to measurement errors and other noise in real environments, an adversarial attack trick for the agent itself is proposed. This approach aims to generate random noise in the agent's status, thereby confusing its perception and assisting it in producing a strategy for abnormal conditions [36]. Algorithm 1 summarizes the adversarial attack process, in which inputs are constituted by the action a i = µ i (s i |θ µ i ) of an actor online network, the action value Q µ i (x, a 1 , a 2 , . . . , a n ) of a critic target network, and that of agent i. Through limited iterations N a , the state is combined with stochastic Gaussian noise under the minimum Q value that could be excavated. Algorithm 1, to control the sequence, introduces the pseudocode for an agent's adversarial attack trick.  (x, a 1 , a 2 , . . . , a n a 1 , a 2 , . . . , a n ) 3: for i = 1, N a do 4: s i(noise) = s i + Gaussian(0, σ 2 s ) 5: x, a 1 , . . . , a i(noise) , . . . , a n 7: if Q i(noise) < Q i then 8: end if 10: end for 11: return s i(noise) The action of agent i can be remodeled based on its state after including stochastic Gaussian noise. Similarly, the robustness of the multi-agent intelligent control model could be optimized according to the adversarial attack of all agents by modeling the indeterminacies of the real world.

Adversarial Learning for Cooperators
In addition to uncertain influences from a real environment, an agent is susceptible to strategy changes made by other agents in the overall system [37]. In other words, an agent cannot produce the optimal action strategy to match other agents when those agents change strategies. Our algorithm preprocesses the actions of cooperators using adversarial training techniques, so the agent's strategy is updated based on the worst decisions of other agents. Specifically, as the neural network is updated, the cumulative return of agent i is optimized under the condition that all cooperators use adversarial strategies. The cumulative return of agent i combined with adversarial learning can be formulated as: where ρ µ represents the state distribution. That means x t+1 would be influenced by the actions of all agents. Furthermore, the action value Q function could be defined in a recursive form as: i (x, a 1 , · · · , a n ) = r i (x, a 1 , · · · , a n ) + γE s min a j =i Q µ i x , a 1 , · · · , a n a i =µ(x i ) The single-step gradient descent method was introduced to overcome high computing costs [38]. Using this method, the actions taken by cooperators are those exhibiting a mixed disturbance, and the direction of the disturbance is the orientation in which the Q function is decreasing. To summarize, the update process of the critic online network can be formulated as: , a 1 , . . . a n ; θ Q i 2 (27) . . , a n + η n (31) where θ Q represents the critic target network, a * j =i represents the action of other agents in their minimum conditions, and η j =i represents the disturbance added for agent j. By linearizing the Q function, the parameter η j is used to denote the gradient direction of Q µ i (x, a 1 , . . . , a n ; θ Q i ) at a j . η j can be replaced with this gradient approximation: The critic network structure of the MADDPG combined with adversarial learning is illustrated in Figure 7. When the MADDPG is implemented, adversarial interference must be added to the actions of other agents without requiring the critic network to be remodeled. To summarize, the A2-MADDPG algorithm proposed in this paper employs an adversarial attack trick and adversarial learning to process an agent's state information and other agents' actions during training. This bridges the gap between simulated training and the real world by adding adversarial disturbances. The overall A2-MADDPG algorithm is described in Algorithm 2.

Algorithm 2 Adversarial attack trick and adversarial learning MADDPG (A2-MADDPG) algorithm.
1: for N agents, randomly initialize their critic network Q i s i , a i ; θ Q i and actor network 3: initialize hyperparameters: experience bufferD, mini-batch size m, max episode M, max step T, actor learning rate l a , critic learning rate l c , discount factor γ, soft update rate τ 4: for episode = 1, M do 5: reset environment, and receive the initial state x 6: initialize exploration noise of action N action 7: for t = 1, T do 8: a 1 , a 2 , . . . , a n ), µ i (s i |θ µ i ), s i ) 9: for each agent i, select action a i = µ s i(noise) ; θ µ i + N action 10: execute a 1 , . . . , a n , rewards r 1 , . . . , r n , and next state x 11: store sample (x, a 1 , . . . , a n , r 1 , . . . , r n , x ) in D 12: for agenti = 1, n do 13: randomly extract m samples x k , a k , r k , x k 14: update critic network by Equation (27) 15: update actor network by: end for 17: update target networks by end for 19: end for

Experiment and Result Analysis
This section describes the simulation's settings and a series of experiments implemented to analyze the effectiveness and performance of the approaches proposed in previous sections.

Simulation Environment Settings
The experiments were conducted using Pycharm and the gym module on an Ubuntu16.04 system with an Intel i7-6700K CPU, a GeForce1660Ti graphics card, and 16 G of RAM. As shown in Figure 8, the testbed was a square (20 km on a side) in a two-dimensional plane. Each obstacle was assumed to have a round threat area with radius r ob ∈ [0.6, 1.3] km (black circle). The attack range of the pursuers (red circles with a UAV inside) was set to 1.2 km, which means the mission was considered successful for pursuers when the distance between them and at least one evader (blue UAVs) was within 1.2 km. Table 1 provides the parameters used for the platforms.  In the DRL-based multipursuer framework, a two-layer perceptron model was constructed for the actor and critic networks. Two fully connected 15 × 64 × 64 × 2 neural networks were created for the actor network and its target. Furthermore, two fully connected 17 × 64 × 64 × 1 neural networks were created for the critic network and its target. Each round ends when the pursuers capture an evader, a platform collides with an obstacle, or the simulation reaches the maximum number of steps. After each round, the environment was reset, and the next round began. Network training ended when the experience buffer was full, and an Adam optimizer was used to determine the neural network parameters. The hyperparameters of the network are shown in Table 2. To examine the performance of trained pursuers, this study used different algorithms to train them to fight evaders that were trained using the constant MADDPG algorithm from Experiment 1. Specifically, we employed the IMDDPG, MADDPG, and A2-MADDPG algorithms in the MAS of pursuers and present comparative data about the average return values for the last 1000 training episodes in Figure 9. As illustrated in Figure 9a, UAVs trained using all three algorithms required roughly 8000 episodes to converge to a steady value with an average reward. In Figure 9b,c, the pursuers and evaders play against each other, and their respective average rewards do not converge to a stable state until about 16,000 episodes. The A2-MADDPG achieved higher convergence approaching the average reward for pursuers and a lower result for evaders. This means the proposed algorithm produced agents that were highly successful at pursuing while preventing targets from escaping, which resulted in lower rewards. To prove the real performance of our algorithm, the average time of first pursuit and the average success of the pursuers in the last 1000 training episodes are recorded in Figure 10.  Figure 10a shows that the MAS pursuers can complete their pursuit after being trained using the three DRL algorithms. As the training time increases, A2-MADDPG UAV formations can pursue evaders in a shorter time. This means that the A2-MADDPG model has more advanced co-adjutant siege capabilities. Figure 10b shows the variation of eligible pursuing times produced during training. The pursuers quickly completed a large number of pursuits in each round, but their success gradually decreased as two maneuvering objects were introduced, confronted one another, and stabilized. This means that MADDGP evaders could also make intelligent decisions autonomously to flee. Ultimately, the average eligible pursuing time of A2-MADDPG UAVs was about four per round, which is better than the other two algorithms.
To present the pros and cons of each algorithm in the steady state, the last 10,000 rounds (from 40,000 to 50,000 episodes in Figures 9 and 10) were analyzed. The results are presented in Table 3 and include the average return values for pursuers and evaders, the average and maximum first pursuit time, and the average and maximum successes.  Table 3 shows the performance of each algorithm after stable convergence. The IMDDPG pursuers that used a distributed critic network had an average return value of 42.15, while the MADDPG pursuers had 72.87. Meanwhile, the A2-MADDPG algorithm improved the pursuers' performance, causing the average return value to rise to 103.20. Driven by the shaped reward function, A2-MADDPG pursuers developed efficient strategies, thereby obtaining a higher round reward. The earliest pursuit time indicator reflects the performance of time-efficient decisions that pursuers made. Table 3 shows that for the A2-MADDPG, the earliest pursuit time became shorter, and its average value was reduced from IMDDPG's, 32.17 to 24.61. The maximum earliest pursuit time was reduced from 35.662 to 26.802. The pursuer success number describes how many evaders were successful in each round. This indicator was better for the A2-MADDPG than for the other two algorithms, which indicates that the MAS pursuers based on it had better performance.

Performance of Evaders
In Experiment 2, we trained evaders using the IMDDPG, MADDPG, and A2-MADDPG to challenge MADDPG pursuers. The average reward results for the last 1000 training episodes are presented in Figure 11. As illustrated in Figure 11c, the evaders achieved the highest round average using the A2-MADDPG, followed by MADDPG, and finally, IMDDPG. This means that A2-MADDPG evaders had a larger advantage during confrontations, which helped them avoid being attacked by pursuers more often.
We recorded the earliest completion time and the eligible tracking time of the pursuers in each round to verify the algorithm's performance, as shown in Figure 12. In Figure 12a, observe that the blue curve has the highest values when the experiment stabilized, that is the first time of pursuing task completion during confrontations between MADDPG pursuers and A2-MADDPG evaders was the longest. This means that A2-MADDPG evaders made efficient decisions when avoiding predators. Figure 12b shows the value of the eligible pursuing time for 50,000 training episodes, from which we observed that the average eligible pursuing time under the confrontation between MADDPG pursuers and A2-MADDPG evaders was the smallest. A2-MADDPG evaders demonstrated better escape strategies. In addition, experimental data from the last 10,000 rounds were analyzed, as presented in Table 4, to control the sequence. The analysis included the average return value of the pursuing UAV formation and evader, the average value of the first time of pursuing task completion, the minimum value of the first time of pursuing task completion, the average eligible pursuing time, and the maximum eligible pursuing time.  Table 4 gives the parameters of each algorithm after stable convergence. Compared with the IMDDPG, which uses a distributed critic network, centralized MADDPG evaders had higher average return values, longer times being pursued, and lower occurrences of being caught. A2-MADDPG evaders, with superior maneuverability, could generate effective actions to flee from capture by pursuers. That means the proposed A2-MADDPG also optimized the evaders' strategies. In this section, the algorithm for a specific 3V1 pursuit and evasion confrontation was simulated and analyzed. Table 5 provides the initial positions of the UAVs and obstacles in Experiment 3. Table 5. Initial positions of UAVs and obstacles in Experiment 3.

Elements Position
Pursuer 1 (red UAV) (−7.3, 7.7) Pursuer 2 (green UAV)  Figure 13 presents the confrontation process of the mission starting from the same initial state. The IMDDPG pursuer formation successfully hit the evader after 39 steps. The MADDPG pursuer formation generated better encirclement strategies, and it took more time to hunt down the target (27 steps). The A2-MADDPG formation adjusted the direction and speed of each UAV to more effectively reach the more maneuverable evader. It took 13 steps for the pursuers to reach the escape target. The algorithms were examined in a random test environment with stochastic initial states and environments in Experiment 4. Based on the chase game trained in Section 4.2.1, the results regarding task completion for 1000 test episodes are shown in Table 6. The improved A2-MADDPG had a higher mission success rate of 88.9% for pursuers, which was higher than the success rate of the MADDPG's 75.3% and the IMDDPG's 70.4%. Moreover, the A2-MADDPG pursuers were able to catch the evader in less time. Compared with the MADDPG and IMDDPG, the average value of first time of pursuing completion was 23.698 for the A2-MADDPG, which shows that the A2-MADDPG pursuers can complete pursuits in less time and the effectiveness of their pursuits was enhanced. To test the effectiveness of escaping, 1000 random environments were generated in Experiment 5 to compare with the IMDDPG, MADDPG, and A2-MADDPG. The results are presented in Table 7. IMDDPG evaders had a success rate of 17.9%, and A2-MADDPG evaders created better evasion strategies, which resulted in an increased success rate of 31.4%. The maximum value of 30.692 for the average value of the first time of being pursued also proved the effectiveness of the A2-MADDPG for evaders. This means that A2-MADDPG evaders specified more effective escape strategies. In summary, compared with the IMDDPG and MADDPG, the evaluation indicators of the A2-MADDPG were significantly better under the same hyperparameter and training environment settings; in the same test environment, the pursuit and escape strategies trained by the A2-MADDPG were obviously more robust and more efficient than those trained by the other two algorithms. Therefore, the A2-MADDPG had a superior performance in the experiments.

Conclusions
In this paper, deep reinforcement learning was applied to multi-agent pursuit-evasion decision-making without building a complicated control system, as is commonly performed in traditional approaches. An elaborate MADDPG-based framework was constructed for providing online decision-making schemes and determining the co-adjutant control of multi-agent systems. By introducing adversarial disturbances, an improved A2-MADDPG was proposed that effectively reduced the influence of errors between models and real scenarios. Introducing an adversarial attack trick optimized the robustness of the multiagent intelligent control model by incorporating adversarial attacks from all agents. An adversarial learning technique was incorporated into our algorithm to overcome the vulnerability of responding to the changes introduced by other agents. This was performed by processing data in the input layer of a critic network. Experimental results showed that the proposed algorithm improved the performance of both types of players in pursuitevasion games and that the trained agents could devise effective strategies autonomously in confrontational missions.
We intend to expand the pursuit-evasion missions by changing the number of pursuers and evaders in the future and increasing the number of obstacles to make the environment more complex, so as to evaluate the performance, efficiency, and robustness of our algorithms in a more realistic and dynamic space. In addition, we would like to apply the trained robust strategies to drones or unmanned vehicles, so that they can make decisions based on the environmental information obtained by the cameras with an authentic range.
This will accelerate the conversion of this work from virtual digital simulations to real multi-agent systems.
Author Contributions: Conceptualization, investigation, methodology, and writing-original draft preparation, K.W.; resources, software, visualization, and validation, D.W. and Y.Z.; writing-review and editing, B.L.; project administration and funding acquisition, X.G.; data curation and formal analysis, Z.H. All authors have read and agreed to the published version of the manuscript.