Online Adaptive Dynamic Programming-Based Solution of Networked Multiple-Pursuer and Single-Evader Game

: This paper presents a new scheme for the online solution of a networked multi-agent pursuit–evasion game based on an online adaptive dynamic programming method. As a multi-agent in the game can form an Internet of Things (IoT) system, by incorporating the relative distance and the control energy as the performance index, the expression of the policies when the agents reach the Nash equilibrium is obtained and proved by the minmax principle. By constructing a Lyapunov function, the capture conditions of the game are obtained and discussed. In order to enable each agent to obtain the policy for reaching the Nash equilibrium in real time, the online adaptive dynamic programming method is used to solve the game problem. Furthermore, the parameters of the neural network are ﬁtted by value function approximation, which avoids the difﬁculties of solving the Hamilton-Jacobi–Isaacs equation, and the numerical solution of the Nash equilibrium is obtained. Simulation results depict the feasibility of the proposed method for use on multi-agent pursuit–evasion games.


Introduction
Recently, the differential game of multi-agents has won the favor of many scholars for its critical application prospects [1][2][3][4]. Among them, games of multi-pursuer and singleevader have been widely considered in many guidance and interception problems. In many scenarios, the agents may involve large-scale or complex dynamic systems, which make the decisions of agents difficult to resolve [5], and the constraint of energy consumption is often considered with the application of renewable energy. Although many methods can solve differential games, most of the existing algorithms cannot solve them online, or require a lot of empirical data. Therefore, this paper studies the problem of an online multi-pursuer single-evader game and resolves the decision of the agents through the method of integral reinforcement learning.
The pursuit-evasion game is a kind of common differential game, which is usually used in competitive games, the optimization of IoT resources, military attacks, and so on [6,7] Among them, the simplest scenario is the single-pursuer single-evader game. This game is a zero-sum game, in which the pursuer and the evader have mutually exclusive interests, and no other agents participate in the interference [8]. However, for a game problem involving multiple agents, the benefit of the game will become complex [9]. The difficulty of solving the multi-agent pursuit-evasion game is closely related to the size of the communication network between individuals and the complexity of the game model [10]. Among them, ref. [11] makes a detailed derivation and solution for the capture game problem of agents. Cappello [12] divided 12 pursuit-evasion game of multi-agents into three cases, and the solution of its Nash equilibrium is given. In addition, when the number of agents in the game is quite large, Peng [13] adopts a distributed network so that the decision-making of other individuals is not affected after the failure or damage of each individual, which means the system has scalability, self-organization, and strong robustness [14].
For an actual multi-agent pursuit-evasion (MPE) game, the core is to solve the decision of each agent according to the setting of the value function. Nowadays, there are many methods with which to solve the strategies of agents in the game. In Refs. [15,16], the scholars solved the pursuit and evasion game problem of multi-agents with an analytical method, obtained the Nash equilibrium solution of each multi-agent, and proved the existence condition of the Nash equilibrium. However, for a complex game system, the process of obtaining an analytical solution consumes a lot of computational costs and time costs. In Ref. [17], the reinforcement learning method was used in the Nash equilibrium game for solving the decisions of agents, and the numerical solution equivalent to the analytical solution was obtained. In addition, recently there have been many off-line algorithms that were able to simplify the solution of multi-agent pursuit and evasion games [18][19][20]. The off-line algorithm for the pursuit-evasion problem in the differential game is becoming more and more mature. However, off-line methods are not competent enough to realize the flexibility of emergency changes in implementing the online confrontation.
Therefore, how to solve the problem of the MPE game online has become a heated topic in the academic community. In 2002, Murray et al. first proposed the iterative adaptive dynamic programming (ADP) algorithm for continuous systems [21] and first adopted the policy iterative algorithm in Ref. [22]. At this time, the ADP algorithm can only be used as an off-line iteration. Vamvoudakis et al. proposed an online adaptive method based on policy iteration [23][24][25] to solve the optimal control problem of continuous nonlinear systems and theoretically proved the stability of this online adaptive algorithm. This online adaptive method has also been studied in discrete systems. Using the ADP method to solve differential game problems has also begun to develop in the direction of online learning. An online adaptive control scheme based on policy iteration for multi-person non-zero-sum differential game problems was proposed [26]. Owing some information about a system may not be known, Wei [27] considered both linear and nonlinear systems to compute an online learning method in optimizing control with unknown information about the system matrix, and an event-triggered ADP method with multiple triggering conditions was developed for multi-player non-zero-sum (NZS) games [28]. In Ref. [29], a novel data-based ADP method was presented to solve the optimal consensus tracking control problem for discrete-time (DT) multi-agent systems (MASs) with multiple time delays. However, in the process of policy iteration, the excitation signal may drop to very low over time, making the approximation algorithm difficult to operate; thus, Karg [30] focused on the fulfillment of the persistent excitation condition for signals which result from transformations by means of polynomials. Moreover, Li [31] discussed the feasibility of this method in a distributed system. For the multi-agent PE game problem with uncertain system parameters, it is still difficult to solve because of the complexity of the scale of the actual systems. Furthermore, solving the MPE game problem in real time without knowing all the information about the systems has become a valuable research topic.
In this paper, the ADP method is used to solve the networked MPE game problem with multiple pursuers and a single evader in real time. In order to realize the implementation and solution of the game, we divide the game process into minimal time intervals through integral reinforcement learning, and then we obtain the policy of each agent by a policy iteration method. Through continuous iterations, the game converges to the Nash equilibrium, and the policy of each agent will converge to its Nash equilibrium policy. In order to eliminate the difficulty of solving the HJI equation in a multi-agent game, as the state information of each agent can be perceived mutually under the IoT system, the value function approximation method is used, and finally the numerical solution of each agent's Nash equilibrium policy is obtained. Using the state information provided by the IoT system, we can obtain the Nash equilibrium policies that consider both distance and energy control in the game. Simulation experiments are used to verify the effectiveness of the method. The main contributions of this paper are listed below: 1.
The relative distance and the control energy are incorporated as the performance index, and the Nash equilibrium of an MPE game is obtained by using the minmax principle, in which the Lyapunov function is constructed to verify whether capture scene occurs; 2.
The multiple iterative interval is divided in the whole game process by conducting the integral reinforcement learning model, and the policy of each iterative interval is obtained by the policy iteration method, which is proved to converge to the Nash equilibrium solution; 3.
The online ADP method is adopted to overcome the difficulty of solving the HJB equation, and a set of approximation functions are established by using the method of value function approximation. The numerical solution of the policies of the agents is obtained.
In simple terms, the original contribution of this paper is to establish an MPE game model by building a weighted value function and creating a basis function to fit the value function using the ADP method. We obtain the Nash equilibrium solution of the game by learning the neural network parameters. The capture conditions are also analyzed by using the Lyapunov function method.
The remainder of this paper is organized as follows: Section 2 constructs the physical model of the multi-agent pursuit-evasion game. Section 3 discusses the Nash equilibrium solution and capture conditions of the game problem. Section 4 discusses the integral reinforcement learning method, i.e., the policy iteration and value function approximation, to obtain the policies without solving the HJI equation directly. Section 5 demonstrates the simulation of a practical problem. Section 6 is the conclusion of the paper.

Formulation of the Game
The multi-agent game system is based on the communication network among agents, integrating a data storage module, navigation system, electric actuator, decision evaluation, and an updating system. Its main structure is shown in Figure 1. Each pursuer is equivalent to equipment in the network. It executes the policy through the electric actuator, and transmits position, speed, and other status information to the communication network from the navigation system. Agents can interact with the status via the network layer. All information is transmitted to the platform layer through the communication system, evaluated by the decision evaluation system, and new policies are generated through the decision updating system; the new policies are transmitted to the electric actuator of the agents via the communication network. tem, we can obtain the Nash equilibrium policies that consider both distance and energy control in the game. Simulation experiments are used to verify the effectiveness of the method.
The main contributions of this paper are listed below: 1. The relative distance and the control energy are incorporated as the performance index, and the Nash equilibrium of an MPE game is obtained by using the minmax principle, in which the Lyapunov function is constructed to verify whether capture scene occurs; 2. The multiple iterative interval is divided in the whole game process by conducting the integral reinforcement learning model, and the policy of each iterative interval is obtained by the policy iteration method, which is proved to converge to the Nash equilibrium solution; 3. The online ADP method is adopted to overcome the difficulty of solving the HJB equation, and a set of approximation functions are established by using the method of value function approximation. The numerical solution of the policies of the agents is obtained.
In simple terms, the original contribution of this paper is to establish an MPE game model by building a weighted value function and creating a basis function to fit the value function using the ADP method. We obtain the Nash equilibrium solution of the game by learning the neural network parameters. The capture conditions are also analyzed by using the Lyapunov function method.
The remainder of this paper is organized as follows: Section 2 constructs the physical model of the multi-agent pursuit-evasion game. Section 3 discusses the Nash equilibrium solution and capture conditions of the game problem. Section 4 discusses the integral reinforcement learning method, i.e., the policy iteration and value function approximation, to obtain the policies without solving the HJI equation directly. Section 5 demonstrates the simulation of a practical problem. Section 6 is the conclusion of the paper.

Formulation of the Game
The multi-agent game system is based on the communication network among agents, integrating a data storage module, navigation system, electric actuator, decision evaluation, and an updating system. Its main structure is shown in Figure 1. Each pursuer is equivalent to equipment in the network. It executes the policy through the electric actuator, and transmits position, speed, and other status information to the communication network from the navigation system. Agents can interact with the status via the network layer. All information is transmitted to the platform layer through the communication system, evaluated by the decision evaluation system, and new policies are generated through the decision updating system; the new policies are transmitted to the electric actuator of the agents via the communication network.  Consider a game system with a couple of pursuers and a single evader, and where each agent in the game system has its goal to achieve, which can be defined as follows: In an MPE game with multiple pursuers and a single evader, the evader tries to escape from being captured by every pursuer, while each pursuer tries to capture the single evader. The performance index of each pursuer is to minimize the distance between the evader and its own control energy, and the performance index of the evader is to maximize the distance between each pursuer and to minimize its own control energy.
The conditions of both the pursuers and the evader change in real time; thus, the system is a multi-agent differential game. Here, the dynamics of each agent are expressed as a set of differential equations. Consider a team of N agents as the pursuers, each of which follows the dynamics: The dynamics of the single evader can be expressed as .
x e = Ax e + B e u e (2) where x pi , u pi , x e , and u e refer to the state variables and controls of the i-th pursuer and the evader, respectively. A, B i , and B e are the system matrices. Let x pi represents the position and velocity of the i-th pursuer in different dimensions, respectively, and u pi contains the accelerations of the i-th pursuer along those dimensions. Similarly, x e and u e are those of the single evader to fulfill the control model. Let δ i is the difference between the state variables of the pursuer i and the evader, which is expressed as For the multi-agent PE game problem we discussed, each pursuer tries to minimize the distance to the evader, while the evader tries to maximize the distance to every pursuer. Substitute Equations (1) and (2) into Equation (3) to get the time derivative of δ i as .
For each element, there exists a performance index. We establish the integral performance index of agent i as where Q i is a positive-definite matrix and R pi and R e are two symmetric positive definite matrices. u pi and u e stand for the policies of the i-th pursuer and evader, respectively. δ i Q i δ i is the weighted term of the relative state variable, which is used to restrict the relative position between the i-th pursuer and the evader. u pi R pi u pi and u e R e u e are the total amounts of energy consumption of the controls of the i-th pursuer and the evader, respectively, which are used to realize the constraints on the control variables.
Since each pursuer will participate in the PE game, which will affect the final policy of the evader, here, we define the overall index function J as the weighted sum of each index function J i in Equation (5) for i = 1, . . . , N, The value of the game is evaluated as follows when the i-th pursuer and the evader employ certain policies: Electronics 2022, 11, 3583 5 of 20 For pursuer i and the component of the evader, when both of which are acquiring the optimal policies, then we have: (8) By summing N value functions in the game, the overall value function can be obtained as follows: (9) where δ is the set of δ 1 , . . . , δ N . The aim of the issue is to find out the appropriate control to satisfy Equation (9) for each agent. However, there are a couple of difficulties faced when calculating the numeral result of the control policies. In the actual system, the analytical solution of each individual strategy is difficult to obtain directly, or occupies too many computing resources and too much time, so the focus of this paper is to iteratively obtain the Nash equilibrium solution by using the idea of reinforcement learning.
The difficulty faced by this paper lies in solving the Nash equilibrium solution of multi-agent systems in pursuit-evasion games. We popularize the scenario of the zero-sum game and consider the scenario of multiple pursuers. We use the integral reinforcement learning method to solve the policy of each agent, as is shown in Section 4.

Solution of the MPE Game Problem
According to the physical model of the PE game problem, the Nash equilibrium solution to the problem is obtained by using the minimax principle. The capture conditions of the PE game are discussed by using the Lyapunov function method.
For multi-agent PE games, each individual's decision will affect the system. The Nash equilibrium solution of the game can be obtained by using the minmax principle, that is, the set of decisions where all parties reach the optimal situation at the same time.
The multi-agent game model is derived from the game of two agents. Among them, the differential game of two agents is a group of differential games, which is established based on bilateral optimal control theory. For the differential game of multiple agents, each index function corresponds to a group of optimization, the policy of which is obtained by the minmax principle. In the connected graph, the evader is linked with multiple pursuers, so it needs to consider the policies of each interconnected chasing pursuer. Suppose there is a game with a set of policies to reach the Nash equilibrium and the evader is linked with a number of N pursuers, then we call (u * p1 , . . . , u * pn , u * e ) the game-theoretic saddle point. The differential expression in Equation (7) is equivalent to the Bellman equation of the game. According to Equations (1) and (2), the Bellman equation of pursuer i and the evader can be obtained as: where H(. . .) is the Hamilton function of the PE game, u pi and u e are the admissible control policies of the i-th pursuer and the evader, and ∇V stands for ∂V ∂δ i . In order to find the optimal policies of the game, according to the stationary condition of the minmax principle, the following equations should be satisfied: Moreover, the second derivative of the Hamilton function to the control of each element should satisfy the following conditions: Thus, the optimal policies of the pursuers and evader are found as For infinite time invariant systems, the solution of the game is determined by (15) and (16), where the function V i is the solution of the Hamilton-Jacobi-Isaacs (HJI) equation of the game as follows: We need to further prove the above conclusions. That is, for the MPE game with multiple pursuers and a single evader, when the policy of each agent reaches Equations (15) and (16), the game reaches the Nash equilibrium.
Before proving this conclusion, we need to use the basic property of the Hamilton function in a multilateral optimal control problem, which is embodied in Lemma 1.

Lemma 1.
Let V * satisfy the HJI Equation (17), making the Hamilton function H(δ, ∇V * , u p1 , . . . , u pN , u * e ) = 0. Then (10) becomes: Proof of Lemma 1. Upon adding and subtracting the terms u * p R p u * p , u * e R e u * e , ∇V Bu * p , and ∇V Bu * e , the Hamiltonian (10) yields: When the value function V attains the Nash equilibrium value V * , we have: From the HJI function (17), we can find that the Hamilton function of the game H(∇V * , u * p1 , . . . , u * pN , u * e ) = 0 when the value function attains the optimal value, which completes the proof.
The Hamilton function can be transformed using Lemma 1. As the control variables u pi , u e , and the conditions at the Nash equilibrium H(∇V * , u * p1 , . . . , u * pN , u * e ) are concluded, it is easier to prove the Nash equilibrium for the MPE problem mentioned in Theorem 1. (1) and (2) with the value function as given in (7). Let V * be a positive definite smooth solution of the HJI Equation (17). Then, (u * p1 , . . . , u * pN , u * e )given by (15) and (16) becomes the game-theoretic saddle point and V * becomes the Nash equilibrium value of the MPE game.

Theorem 1. Consider the kinetic equations of vehicles
Proof of Theorem 1. In order to prove that (u * p1 , . . . , u * pN , u * e ) is the game-theoretic saddle point of the game, we need to show that the best action for the evader to maximize the value function (15) is u * e when all the pursuers execute the policies as given in (21). On the other hand, the best action for the i-th pursuer to minimize the value function (16) is u * pi when the other agents execute the policy as given in (22), which indicates the following: u * e = argmaxV u p1 ,...,u pN ,u e (δ(t)) (22) which are equivalent to: where V u pi ,u e (δ(t)) is the corresponding solution of the Hamiltonian (18). Define V(δ(t 0 )) as the initial value of the game. Assume that a capture occurs in the interval t ∈ [t 0 , ∞], which implies lim x→+∞ V u p ,u e (δ(t)) = 0. Adding this item to Equation (7), we get: Electronics 2022, 11, 3583 8 of 20 From Equation (24), it is obvious that V u * p1 ,...,u * pN ,u * e (δ(t)) = V * (δ(t 0 )). Upon using Lemma 1, (24) becomes: Let ε(V) be the integral term in Equation (25). In order to prove Equation (23), we just need to verify that ε(V u p ,u * e ) ≥ 0 and ε(V u * p ,u e ) ≤ 0. Using (25) we get: which completes the proof.

Remark 1.
From Theorem 1, we know that when the game reaches the Nash equilibrium, the value function does not decrease no matter how the i-th pursuer unilaterally changes its policy. On the other hand, no matter how the single evader unilaterally changes its policy, the value function does not increase. As soon as the policy set of the game reaches the game-theoretic saddle point, any unilateral change in the policies by either agent is in contrary to its benefits, and the other side of the game reaps the reward in this process. Intuitively speaking, when the game reaches the Nash equilibrium, if any pursuer unilaterally changes its policy, the evader becomes more difficult to capture. However, the evader can be captured easily if it unilaterally changes the policy at the saddle point.
So far, we can summarize the Algorithm 1 as follows: Algorithm 1: The optimality of Nash equilibrium policies for each agent Step 1: Obtain the system Nash equilibrium expression The Hamiltonian function (10) is obtained according to the system parameters, and the expression of multi-agent Nash equilibrium policies (15) and (16) is obtained through the minimax conditions (11)~(14).
Step 2: Construct difference form Hamilton function The Nash equilibrium function term (19) is constructed in the Hamiltonian function, and the differential Hamiltonian function (20) is obtained to facilitate the comparison of the properties of Nash equilibrium policies and other policies.
Step 3: Proof of Nash equilibrium of game The rationality of inequality (23) is proved according to the difference method. By comparing the positive and negative of the integral term, i.e., Equations (26) and (27), it is found that the Nash equilibrium can realize the minmax strategy, which is the property of the value function (9).
The process of proving the optimality of Nash equilibrium is shown in Figure 2. So far, we can summarize the Algorithm 1 as follows: Algorithm 1: The optimality of Nash equilibrium policies for each agent Step 1: Obtain the system Nash equilibrium expression The Hamiltonian function (10) is obtained according to the system parameters, and the expression of multi-agent Nash equilibrium policies (15) and (16) is obtained through the minimax conditions (11)~(14).
Step 2: Construct difference form Hamilton function The Nash equilibrium function term (19) is constructed in the Hamiltonian function, and the differential Hamiltonian function (20) is obtained to facilitate the comparison of the properties of Nash equilibrium policies and other policies.
Step 3: Proof of Nash equilibrium of game The rationality of inequality (23) is proved according to the difference method. By comparing the positive and negative of the integral term, i.e., Equations (26) and (27), it is found that the Nash equilibrium can realize the minmax strategy, which is the property of the value function (9).
The process of proving the optimality of Nash equilibrium is shown in Figure 2.
Obtain Nash equilibrium expression

Construct difference form Hamilton function
Prove the Nash equilibrium of game Lemma 1 Theorem 1 Figure 2. Proof of the optimality of Nash equilibrium policies for each agent.
In this problem, each pursuer shares states and control information, so there is a cooperative relationship between each pursuer, which affects the decisions of each agent.
According to Equation (6)  In this problem, each pursuer shares states and control information, so there is a cooperative relationship between each pursuer, which affects the decisions of each agent. According to Equation (6), the value function V is composed of the sum of V 1 , . . . , V N . Due to the coupling between individuals, ∇V k = When there is no communication between each pursuer, the game model is equivalent to N groups of single-pursuer and single-evader game problem, and there is no coupling between individuals, which means ∂V k ∂δ i = 0. If k = i, then the sum of V 1 , . . . , V N is meaningless, and V k needs to be calculated separately. ∇V k in Equation (28) changes to: As this paper considers the case where communication exists between each individual, coupling between agents is contained in the following parts.
In an MPE game problem, whether the pursuers can capture all the evaders or not is very noteworthy information. If so, the MPE problem is likely to become a finite-time game as an interception problem. Next, we demonstrate the conditions under which the capture scenario occurs in the game.
In Theorem 1, we assume that a finite-time capture occurs in the interval t ∈ [t 0 , +∞] to ensure the existence of the Nash equilibrium. Theorem 2 gives the sufficient conditions for the occurrence of the capture.

Theorem 2.
Let the pursuers satisfy the dynamics as given by Equation (1) and the evader satisfy Equation (2). Moreover, let (15) and (16) be the control policies of the pursuers and the single evader, respectively, where function V(δ) is the solution of the HJI Equation (17). Then, a capture of the MPE game occurs in the sense that dynamic (4) is asymptotically stable.
Proof of Theorem 2. In order to prove the property of the stability, we can take the value function as a Lyapunov function. Since V(δ) is the solution of the HJI Equation (15), and we can acquire that V(δ) ≥ 0 and V(δ(t 0 )) = 0, then the derivative of the function . V(δ) can be expressed as follows: From Equation (30), the derivative of the Lyapunov function e B e ≥ 0 for i = 1, . . . , N holds, then the dynamic of the MPE game is asymptotically stable, and all pursuers have the potential to catch up with the evader. On the other hand, if B i R pi B i − B e R −1 e B e < 0, then the Lyapunov stability condition is not satisfied, and the state variables of system (4) may tend to diverge. The divergence of the distance between two agents will make it impossible for the capture to occur. At this time, the pursuers cannot capture the evader.

Remark 2.
In particular, when B i = B e = B for all i = 1, . . . , N, it can be predicted that the distance between the pursuer i and the evader will approach 0 as time passes as R −1 pi − R −1 e is positive. On the contrary, if R −1 pi − R −1 e is not positive, pursuer i may not be able to catch the evader. In value function (7), u pi R p u pi and u e R e u e represent the total amounts of energy consumption of the control for pursuer i and the evader. For pursuer i and the evader, R pi and R e matrices represent their soft constraints on the control utility, which act as the control penalty. The larger the control energy weight of the evader or the smaller the control energy weight of the pursuer, the easier the capture scenario is to trigger.

Solution Using Adaptive Dynamic Programming
In Section 3, we obtain the Nash equilibrium policy of each agent in the MPE game. However, it is challenging to solve the accurate result of the policy for actual systems. For the numerical solution of the policy, Pontani and Conway used a genetic algorithm to calculate the off-line control policies of the agents in a zero-sum differential game [11]. However, for the online game problem of a continuous system, the off-line algorithm may not monitor the states and controls of all the agents. Therefore, based on the concept of reinforcement learning, this section solves the implementation policy through the policy iteration (PI) method. In addition, it is difficult to directly obtain the derivative of the value function with respect to the state quantity, which needs to be fitted by an approximate algorithm called the value function approximation (VFA). When the PI method loops to convergence, the Nash equilibrium policies of all agents are obtained.

Policy Iteration
Since the value function of the MPE game above has an integral form, the value function can be decomposed by dividing the upper and lower bounds of the integral. By forming the terms of integral reinforcement learning, the policy iteration (PI) method is executed to solve the game.
Define an infinite-horizon integral cost associated with the control input as: For Γ(δ(τ), u p (τ), u e (τ)) = δ Qδ + u p R p u p − u e R e u e . Selecting T as a time period, Equation (31) can be expanded as follows: The integrand Γ(δ(τ), u p1 (τ), . . . , u pN (τ), u e (τ)) is known as the integral reinforcement of the pursuer in the time interval [t, t + T]. Note that T is not a state or control variable, but a hyper-parameter in the algorithm. Choosing different time intervals can affect the final solution and efficiency of the algorithm. Divide the whole time period of the game into multiple short time intervals. Assume that [t, t + T] is the ith time interval. In this time interval, the pursuers and evader adopt policies as u The controls for the next time interval can be obtained by using the following in (33): Note that only the state information and control information of both sides is needed in the solving progress. System matrix A does not participate in the above operation. In actual applications, it is likely to solve the game of systems with unknown parameters, such as in the modeling of an unconventional aircraft. Therefore, this algorithm can also be used for obtaining an online solution effectively.
Equations (33) and (34) compose one cycle of policy iteration. The PI method can make the MPE game converge to the Nash equilibrium as the cycle continues. The following theorem proves the convergence of the PI method. pi , u * e ), we get the following: we can see that the function group V u * pi ,u (j) e (δ(t)) increases monotonically. According to Dini's theorem and the uniqueness of the value function (7), the value of the game V u (j) In Equation (35) e ) as the value function converges, which completes the proof. Based on Theorem 3, the continuous MPE game problem can be solved online via the PI method, which brings the solution to converge definitely to the Nash equilibrium as the iteration cycle operates.

Remark 3.
For the continuous time MPE problem, the use of the PI method to solve the Nash equilibrium parameters of the game will not lead to changes in its convergence, and the policy of each agent will eventually approach the analytical solution. In addition, the algorithm is still available for time-varying systems. If A changes suddenly, as long as the current controller stabilizes the new A, it is equivalent to solve the Nash equilibrium in the new state. Using Theorem 3, we can see that the game will converge to the Nash equilibrium corresponding to the final form of the system.

Remark 4.
In the process of the PI method, the accurate information of system matrix A is not required when solving the policies of the agents, which means that for systems with some unknown specific structure, the policies can still be obtained and make the game converge to the Nash equilibrium. In the cycle of iteration, the state variables δ t and δ t+T and control policies u

Value Function Approximation
In the actual application scenario, solving the HJI equation is often a tough process analytically, for it might not have any analytical solution in an MPE game. Thus, we apply an approximative method to solve the equation by utilizing a structure to approximate the solution of the value function.
Assume that a finite set of basis functions φ j (δ) can be determined that approximate the value function V. Note that the functions are linearly independent, and the value function is usually composed of polynomials of the state variables. The value function V can be approximately represented as where L is the number of basis functions in approximation, and ϕ L (δ) is the vector of the corresponding basis function, which is composed of multiple state variables of the agents. w L is a vector including unknown weights to be determined, and w k , (k = 1, . . . , L) are the elements of vector w L . Using the VFA method for the cost function, the HJI equation can be expressed as The focus is to solve the unknown weight parameter w L in Equation (36). The initial value of the parameters is determined before iteration starts, which is likely to obtain residual error before the weight parameters converge to their analytical value. From Equation (37), the residual error can be defined as The residual error indicates the difference between the actual weight parameters and the parameters in the solving process, which can be viewed as a temporal difference residual error for the game system.
In the VFA algorithm, we try to fit the value function through the basis function method and learn the weight parameters by building a neural network. At this time, the weight parameters are also called neural network parameters.
At each iteration step, in order to obtain the weight parameters w (j) L in the VFA that approximates the value function V (j) , the least-square method is used during each iteration step. When the game converges to the Nash equilibrium, the residual will converge to 0. In the process of convergence, the absolute value of the residuals will gradually decrease.
Hence, the weight parameters w (j) L of the VFA are adapted in a way to minimize the following quadratic integral residual: The quadratic integral residual S in Equation (39) represents the cumulative error of the algorithm, and it reaches the minimum value when its first partial derivative over the weight parameter w where Φ = (ϕ L (δ(t + T)) − ϕ L (δ(t))) · (ϕ L (δ(t + T)) − ϕ L (δ(t))) dδ and Θ = (ϕ L (δ(t + T)) − ϕ L (δ(t))) ρ dδ When the residual function S reaches the minimum value, it means that the existing basis function can completely fit the value function by updating the neural network parameters obtained in Equation (41). According to Theorem 3, the value function approximated by the basis function can converge to V * . Thus, the policies of every player ∇V e ) as the value function converges. Since then, the ADP algorithm is used to avoid the difficulty of solving the composite HJI equation, and the Nash equilibrium policies consistent with the analytical method are obtained.

Remark 5.
The value function iteration is embedded in the policy iteration algorithm, i.e., n batches of least-square sample points need to be selected in each iteration process. The number of sample points n in each interval should be more than L (the number of the reinforcement learning parameters to be determined). Otherwise, the reinforcement learning parameter w L cannot be regressed, resulting in the termination of the algorithm iteration. Moreover, the policies of both sides are no longer obtainable.

Numerical Simulation
In this section, we simulate an MPE game with multiple pursuers and a single evader. In the simulation process, the system of agents adopts a class of second-order form, that is, the kind of dynamic model which takes acceleration as control. The trajectory and speed of the agents change in real time.
Consider the MPE game problem as: where s pix , s piy , v pix , and v piy are the state variables of the i-th pursuer, which represent its position and velocity along two directions. Similarly, s ex , s ey , v ex , and v ey are the state variables of the single evader, which represent its position and velocity along those two directions. (a pix , a piy ) and (a ex , a ey ) are the accelerator couples of the i-th pursuer and the single evader, which serve as the policies of the i-th pursuer and the single evader, respectively. Subtract the model of the i-th pursuer (42) and evader (43) to obtain the difference between their state variables δ = [d x , ∆v x , d y , ∆v y ], where d x and d y are the distance projections in the X and Y directions, respectively. The dynamic of the subtracted model is: The distance between the two agents is expressed as a function of the difference of the state variables of Equation (44) as: In addition, in order to confirm whether the pursuers can capture the evader, set the capture radius as l. When the distance between two agents is less than l, then a successful capture takes place, and the MPE game comes to an end.
Since the benefits of agents are determined by the distance between them, and the velocity is irrelevant, matrix Q in the value function (7) is set as Q = diag (1, 1, 0, 0).
Set the capture radius as l = 0.05 m to start the MPE game simulation, and the trajectories of the agents vary as in Figure 3. Table 1 shows the state of agents at the end of the game. cycle onward. We verify the effectiveness of the algorithm by obtaining the analytical solution of the network parameters, which can be seen in Figure 5. The coordinate and time of capture is shown in Table 2.
By contrast, as long as the conditions in Theorem 3 are met, different initial values may affect the speed and pattern of the convergence of neural network parameters, but they will eventually converge to the analytical values.
The selection of the time interval T for each iteration step may also influence the solution of the MPE game. The selection of the iteration period is mainly based on the computing performance of the agents. Now we choose a different iteration time interval as 0.2 T s = to recompute the game problem. Keeping other initial states and parameters of the agents unchanged, we conduct the MPE game and get the trajectories between the pursuers and evader as illustrated in Figure 6, and the distances as shown in Figure 7.    After the game starts, the evader accelerates to the direction where no pursuer exists, and the three pursuers adjust their policies according to the position of the evader. The evader can also adjust its policy to avoid being captured, but it is still captured by pursuer 1 due to greater energy constraints.
As the three pursuers have the same performance and follow value function (8), they can maintain good coordination in pursuit and finally capture the evader. The evader takes all factors of the state of the three pursuers into consideration and makes the decision to accelerate away from the directions of the pursuers.
In the process of the game, the gradient of the distance becomes steeper first and then gentler, which indicates that the pursuers narrow the distance as much as possible in the beginning, but in the later stage, due to the reduction of the distance, their energy control factors gradually occupy the main place. The agents realize the decision of pursuing and evading under the condition that their energy consumptions are as small as possible.
The capture happens at t c = 4.12 s, when the coordinates of pursuer 1 and evader are (10.2406, 12.4123) and (10.2419, 12.4489), respectively. The distance between pursuer 1 and evader is demonstrated in Figure 4.   As the iteration goes, the neural network parameters converge as the games reach the Nash equilibrium, which means the policies of both the pursuers and the evader convergence to the optimal value. The distance between pursuer 2 and evader at the time of capture is 0.0366 m. Note that the control policies of the agents are updated at each interval T = 0.4 s.
From the beginning of the game to when the capture happens, 10 cycles of the policy iteration are completed. During this period, the parameters of the neural network converge gradually, and usually become stable at some fixed values from the fourth iteration cycle onward. We verify the effectiveness of the algorithm by obtaining the analytical solution of the network parameters, which can be seen in Figure 5. The coordinate and time of capture is shown in Table 2.   (a) neural network parameters w 1i , i = 1, · · · , 6; (b) neural network parameters w 2i , i = 1, · · · , 6; (c) neural network parameters w 3i , i = 1, · · · , 6. By contrast, as long as the conditions in Theorem 3 are met, different initial values may affect the speed and pattern of the convergence of neural network parameters, but they will eventually converge to the analytical values.
The selection of the time interval T for each iteration step may also influence the solution of the MPE game. The selection of the iteration period is mainly based on the computing performance of the agents. Now we choose a different iteration time interval as T = 0.2 s to recompute the game problem. Keeping other initial states and parameters of the agents unchanged, we conduct the MPE game and get the trajectories between the pursuers and evader as illustrated in Figure 6, and the distances as shown in Figure 7. It can be seen that when an error occurs for pursuer 2, the other pursuers can still capture the evader. At this time, the capture location is slightly farther than normal, and the capture time is slightly longer at 4.32 s. The distance between the three pursuers and the evader is shown in Figure 9, and the distances when there is no error for the pursuers are also shown as a comparison. The capture conditions are shown in Table 3.  It can be seen that when an error occurs for pursuer 2, the other pursuers can still capture the evader. At this time, the capture location is slightly farther than normal, and the capture time is slightly longer at 4.32 s. The distance between the three pursuers and the evader is shown in Figure 9, and the distances when there is no error for the pursuers are also shown as a comparison. The capture conditions are shown in Table 3.     We can see in Figure 6 that a shorter iteration time interval brings closer a result of reaching the Nash equilibrium. However, the shorter the iteration period, the more iterations are needed in the whole game process, which means that the calculation cost will increase with the number of iterations, though the accuracy will be better at this time.
When a shorter time interval of iterations is selected, the pursuers can catch up with the evader faster. This is mainly because the evader in the simulation has weaker maneuverability, and its policy takes minor changes as the time interval varies. For the pursuers with strong maneuverability, the improvement is obvious, so the capture distance is nearer and the capture time is shorter. A shorter iteration time interval means that the frequency of policy updates increases, and the VFA algorithm is implemented in a shorter time, which can complete optimization faster. In general, while considering the computing performance of the agents, refining the time interval of iterations can enable agents to obtain Nash equilibrium policies faster.
In order to test the stability of the algorithm, we now consider the possible errors in agents. If an error occurs for pursuer 2, which means pursuer 2 cannot receive the information from others to execute the policy obtained by the algorithm, then it can only execute the initially defined policy, while pursuer 1 and pursuer 3 are in a normal state. By executing the above simulation, the trajectory shown in Figure 8 can be obtained.  Therefore, when there is an error on pursuer 2, the pursuers do not strictly implement  Therefore, when there is an error on pursuer 2, the pursuers do not strictly implement the Nash equilibrium policies at this time, so the capture requires a longer distance, and the capture time is increased slightly compared with the normal state. Pursuers 1 and 3 try their best to achieve their optimal policies, while chaser 2 gradually leaves the game due to its error, which also confirms that the Nash equilibrium policies are optimal for each agent in the MPE game.
This simulation experiment evaluated the sensitivity of the initial value to the algorithm, and found that the change in sampling interval would affect the convergence of the neural network parameters. If the sampling interval is too short, it may lead to over-fitting, which affects the accuracy of the algorithm. In addition, the selection of the initial value of the state does not affect the stability of the algorithm, and the game will still gradually converge to the Nash equilibrium in the process of iteration.  Therefore, when there is an error on pursuer 2, the pursuers do not strictly implement the Nash equilibrium policies at this time, so the capture requires a longer distance, and the capture time is increased slightly compared with the normal state. Pursuers 1 and 3 try their best to achieve their optimal policies, while chaser 2 gradually leaves the game due to its error, which also confirms that the Nash equilibrium policies are optimal for each agent in the MPE game.
This simulation experiment evaluated the sensitivity of the initial value to the algorithm, and found that the change in sampling interval would affect the convergence of the neural network parameters. If the sampling interval is too short, it may lead to over-fitting, which affects the accuracy of the algorithm. In addition, the selection of the initial value of the state does not affect the stability of the algorithm, and the game will still gradually converge to the Nash equilibrium in the process of iteration.

Conclusions
The solution of an MPE game with networked multiple-pursuer and single-evader is discussed in this paper, and the expression of the policy of each agent is obtained by using the minmax strategy. It is proved that when the MPE game reaches the Nash equilibrium, the policy of each agent will converge to its Nash equilibrium policy, i.e., the optimal policy of the agent. The adaptive dynamic programming method is used for online policy iteration, and the VFA is adopted, utilizing the data provided through the IoT system to avert the difficulties in solving the complicated HJI equation. It was shown that the game reaches the Nash equilibrium condition after multiple iterations.
In this paper, the online ADP method is used to solve the MPE game problem. For a game with an integral value function, which is similar to Equation (7), the PI method proposed in Section 4 can be used to find its solution. However, for game problems with terminal or compounded value functions, the ADP method is difficult to solve, which limits the use of such methods. In addition, this paper did not take into account the case of complex constraints in the game process, and subsequent research should focus on the constraints of state variables and control variables.
In the future, we will consider the pursuit-evasion game scenarios of more complicated systems. It is worth studying scenarios when there are more evaders and pursuers in one game system. In addition, the algorithm still has room for improvement with regard to computing large-scale neural network parameters.