Robust Tracking Control for Non-Zero-Sum Games of Continuous-Time Uncertain Nonlinear Systems

: In this paper, a new adaptive critic design is proposed to approximate the online Nash equilibrium solution for the robust trajectory tracking control of non-zero-sum (NZS) games for continuous-time uncertain nonlinear systems. First, the augmented system was constructed by combining the tracking error and the reference trajectory. By modifying the cost function, the robust tracking control problem was transformed into an optimal tracking control problem. Based on adaptive dynamic programming (ADP), a single critic neural network (NN) was applied for each player to solve the coupled Hamilton–Jacobi–Bellman (HJB) equations approximately, and the obtained control laws were regarded as the feedback Nash equilibrium. Two additional terms were introduced in the weight update law of each critic NN, which strengthened the weight update process and eliminated the strict requirements for the initial stability control policy. More importantly, in theory, through the Lyapunov theory, the stability of the closed-loop system was guaranteed, and the robust tracking performance was analyzed. Finally, the effectiveness of the proposed scheme was veriﬁed by two examples.


Introduction
Control theory has been gradually developed to meet the needs of engineering.In practical engineering, environmental uncertainties, such as noise, temperature, etc., greatly affect the stability of a system, so it is very important to find a control method to solve this problem.In recent years, some methods to deal with disturbance or uncertainty have been proposed, such as sliding mode control, type-2 fuzzy control [1,2], internal model control, and so on.However, in a sense, robust control can be also applied to solve the control problems of uncertain dynamic systems [3,4].Based on the development of adaptive dynamic programming (ADP) and control algorithms, some methods have been effective in solving robust control problems, including guaranteed cost control [5], the system transformation method [6][7][8], control schemes for robust stabilization using integral reinforcement learning (IRL) methods [9,10].These methods mainly embody the ideas of reinforcement learning and adaptive dynamic programming.When studying the optimal control problem using adaptive dynamic programming [11][12][13], the key is to solve the Hamilton-Jacobi-Bellman (HJB) equation; however, due to the curse of dimensionality, it is almost impossible to solve directly.Combining neural network (NN) approximation methods and ADP ideas, the adaptive critic design has been widely used in robust control [14,15].Considering the adaptive critic design, the approximate solution of the HJB equation can be attained to cope with robust control problems [5,9,15].For a system with uncertainties, the upper bound function of uncertainties is usually given, and then the cost function is modified so that the robust control problem can be transformed into the optimal control problem of a nominal system [15].It has inspired our processing method for uncertain disturbance.From the above results, it can be seen that the basic regulation problem has been solved.
As the complexity of a system increases, a large class of systems often has multiple controllers, such as immune systems [16] and interconnected systems [17].Game theory considers individual predictive behavior and practical behavior in a game, and studies optimization strategies, and multi-controller system issues can be well addressed by it [18].As an important theory in game theory, non-zero-sum (NZS) game theory was first proposed in [19] and it aims to find a set of feedback control strategies to achieve the so-called Nash equilibrium while satisfying the defined performance indicators and guaranteeing the system's stability.In this process, the most important aspect is to solve the coupled HJB equations.Since the coupled HJB equations are difficult to solve directly, many advanced algorithms have been developed.In general, iteration-based algorithms can be used to approximate the solution of HJB equations.A policy-based iteration algorithm was used to solve the system of NZS game problems in [20,21].Considering that it is difficult to know the specific dynamics of complex systems, in [22], based on the iteration algorithm, the Nash equilibrium was obtained approximately by the data-based IRL, which does not need known system dynamics.As policy iteration requires an initial stable control policy, an off-policy IRL method was given to solve the coupled HJB equations in [23].Recently, the ADP method has become an effective tool in solving the coupled HJB equations.To solve the NZS game of unknown nonlinear systems, using a generalized fuzzy hyperbolic model, an approximately optimal control scheme based on the ADP method was presented in [24].Combined with the ADP method and the NN structure, the adaptive critic design was also applied to the NZS game.Based on the structure of an actor?criticNN, an adaptive algorithm was proposed for NZS games in the nonlinear system in [25].In [26], using experience replay techniques, based on the framework of a single critic NN, the NZS game of the unknown dynamical systems was studied.The method proposed above can effectively solve the NZS game.However, there are few studies on NZS games with uncertain disturbances.Therefore, based on adaptive critic design, the NZS game of nonlinear systems with uncertain perturbations was studied in this work.
Initially, our research for the system was limited to allowing the state of the system to converge to the origin; however, many system controller designs also require the controlled object to track a reference trajectory, especially in noisy and uncertain environments.Usually, this is a very common control problem.Trajectory tracking control problems have been solved by some algorithms in [27][28][29][30][31][32][33][34].The iterative algorithm can still be effectively applied to trajectory tracking control.In [27], to overcome some shortcomings of the traditional controller, an adaptive iterative algorithm was proposed for the robot trajectory tracking problem.Considering disturbance, an iterative algorithm based on Q-learning was presented to solve the H∞ tracking problem of discrete-time systems in [28], which didn't require system dynamics.In [29], the tracking problem was transformed into the tracking error adjustment problem through system transformation, which was solved by the iterative ADP algorithm.Then, some non-iterative algorithms for tracking problems were proposed in [30][31][32][33][34].In [30], the optimal tracking control was studied using online approximators, but this method involved the reversibility of the control matrix.To overcome the requirement of invertibility of the control matrix, some new methods were proposed.In [31], based on system transformation, a self-learning optimal control method was used to solve the robust trajectory tracking design of uncertain nonlinear systems.Considering the need for multiple outputs in some systems, the robust tracking control of discrete-time systems with multiple inputs and multiple outputs was studied utilizing the adaptive critic design in [32].By modifying the cost function and introducing a discount factor, the guaranteed cost tracking problem was transformed into an optimal tracking problem, and by developing a new critic NN the optimal tracking control problem could be addressed without policy iteration in [33].As with some systems with unmatched perturbation, the NN-based ADP algorithm was used to obtain the approximately optimal tracking control law of uncertain nonlinear systems with a predefined cost function in [34].In this paper, based on the critic NN structure and the ADP method, an augmented system was used to solve the tracking control problem for NZS games with perturbation.
The main contributions of this paper are as follows: (1) An augmented system was constructed by combining the tracking error and the reference trajectory.The robust tracking control problem was transformed into an optimal tracking control problem of the nominal augmented system by modifying the cost functions.This method no longer strictly required the control matrix to be reversible.Moreover, in most cases, robust tracking control is applied to some special systems, but here we considered a general system similar to a spring-mass-damper system [31].(2) For the NZS game between two players with uncertainties, a newly improved adaptive critic design was proposed to solve the revised coupled HJB equations.Two additional terms were introduced in the critic NN weight design, one was used to ensure that the system could always be in a stable state without the need for the initial stability control policy, and the other was used to analyze the stability of the system.(3) Compared with the actor-critic NN, each player only used one critic NN to approximate their value function and control policy, which could greatly reduce the amount of calculation.By the Lyapunov theory, the stability of the closed-loop system was proved, and the trajectory tracking performance was analyzed.What is more, the adaptive critic design could be carried out online.
The rest of this paper is arranged as follows.In the second section, the description of the two-player NZS game with uncertain terms and the construction method of the augmented matrix structure are given.Then in the third section, a single critic NN structure is used to approximate the value function for each player, and the approximate feedback Nash equilibrium is then solved.Moreover, the system stability analysis and the tracking performance analysis are given.Finally, the effectiveness of the proposed scheme is verified by two examples.

Problem Statement
A class of continuous-time uncertain nonlinear dynamical systems for two-player NZS games is given by where x ∈ R n is the system state, u ∈ R m is the first control input, v ∈ R q is the second control input.The known functions f (•), g(•) and k(•) are Lipschitz continuous on a compact set Here, M(•) ∈ R n×r is a known function, and d(•) ∈ R r is an uncertain function with d(0) = 0.One chooses the initial state as x(0) = x 0 .Let the uncertain term ∆ f (x(t)) be bounded by a known function Here, we introduce a system reference trajectory command generator to implement the trajectory tracking, that is ṡ where s(t) ∈ R n denotes the bounded reference trajectory.Let the initial trajectory be s(0) = s 0 and ϕ(s(t)) is a Lipschitz continuous function with ϕ(0) = 0.The tracking error is defined as e r (t) = x(t) − s(t). (3) Then, the initial error vector is e r (0) = e r0 = x 0 − s 0 .According to (1)-(3), the tracking error dynamics can be obtained as To introduce the augmented system, we define an augmented state vector ζ(t) = [e T r (t), s T (t)] T ∈ R 2n , and we can choose its initial condition as 2) and ( 5), the augmented system dynamics is simplified to where F (•), G(•) and K(•) are new system matrices.What is more, ∆F (ζ) represents the augmented system uncertainty, and they are written in the following specific form: It's easy to conclude that ∆F (ζ) is upper bounded, and the details are as follows: In order to better analyze the NZS game with the uncertain perturbation, we decompose the uncertain term ∆F (ζ) into where M 1 (•) ∈ R n×r and M 2 (•) ∈ R n×r are known functions in the uncertain term.
Assumption 1.The control function matrixes g(x) and k(x) are bounded as g(x) ≤ λ g and k(x) ≤ λ k [31], where λ g and λ k are positive constants, and hence By constructing the augmented dynamics (6), the feedback control laws u(ζ) and v(ζ) are found to make the state of system move along the reference trajectory.At the same time, the closed-loop system is asymptotically stable under the influence of the uncertain term.Next, we can give the appropriate cost functions to transform the robust control the problem into the optimal control problem for its nominal system.For the augmented system (6), we focus on the nominal system part The two-player cost functions are where U 1 (ζ, u, v) and U 2 (ζ, u, v) are the the basic parts of utility functions with  (15) on Ω, moreover, the cost functions ( 16) and ( 17) are finite ∀ζ 0 ∈ Ω.

Given admissible feedback policies u(ζ) ∈ A(Ω) and v(ζ) ∈ A(Ω), one can define value functions that correspond to the cost functions as
where one can define Γ 1 (ζ) and Γ 2 (ζ) as In this paper, a 2-tuple of policies {u, v} is found to minimize ( 18) and ( 19), thus, the optimal value functions V * 1 and V * 2 are defined as In addition, there exists a Nash equilibrium in the NZS game between two players.Next, we give the Nash equilibrium definition.Definition 2. (Nash equilibrium policies) A 2-tuple of policies {u * , v * } with u, v ∈ A(Ω) is said to constitute a Nash equilibrium solution for the two-player game [35], if the following two inequalities are satisfied for all u, v ∈ A(Ω): Under the admissible feedback policies, if the value functions ( 18) and ( 19) are continuously differentiable, their differential equivalents are given by 0 with V i (0) = 0 and ∇V i = ∂V i /∂ζ, i = 1, 2. Define the Hamiltonian functions According to the stationarity conditions [36], two players' optimal feedback control policies are given by Combining ( 26), ( 27), ( 30) and ( 31), one obtains the coupled HJB equations where V * 1 (0) = 0 and V * 2 (0) = 0. To simplify the operation, eight non-negative matrices We all know that it is difficult to directly solve the coupled HJB equations, so, next, we approximate their solutions using the NN-based adaptive critic design.

Robust Trajectory Tracking Design for Non-Zero-Sum Games
This section mainly includes two parts.First, the solution of coupled HJB equations is approximated by the adaptive critic design based on a single NN structure, so that the so-called Nash equilibrium is found.Secondly, the stability of the system is proved and the tracking performance is analyzed via the Lyapunov theory.

Neural Network Implementation
In order to realize the neural network approximation, we first introduce the Weierstrass high-order approximation theorem [37,38].According to Assumption 2, there exist complete independent basis sets { i (ζ)} and {µ i (ζ)} such that the solutions to ( 26) and ( 27) and their gradients are uniformly approximated, that is, there exist coefficients c i and z i such that Then we have where T , and the last terms in these equations converge uniformly to zero as K → ∞.Next, we give the specific content of the value function approximation.
For the augmented dynamics (15), the value functions are re-expressed as where are defined as activation function vectors, K is the number of hidden neurons, and ε 1 and ε 2 are the critic NN approximation errors.When K → ∞, ε 1 and ε 2 converge to zero; however, when K is a fixed constant they are bounded.
Assumption 3. In order to ensure the boundedness, we make the following assumptions, as in [26].
(1) The critic NN activation functions and their gradients are bounded such as φ i ≤ λ φ i and ∇φ i ≤ λ dφ i , i = 1, 2. λ φ i and λ dφ i are positive constants.(2) The critic NN approximation errors and their gradients are bounded by positive constants such that ε 1 ≤ λ ε i and ∇ε i ≤ λ dε i , i = 1, 2. λ ε i and λ dε i are positive constants.
(3) The critic NN weights are upper bounded such that W i ≤ Wi , i = 1, 2. Wi are positive constants.
The derivatives of (39) and (40) along with ζ are where ∇φ i = ∂φ i /∂ζ, ∇ε i = ∂ε i /∂ζ, i = 1, 2. Noticing ( 30), ( 31), ( 41) and (42), the optimal control laws are written as Then the associated Bellman equations can be derived as where are the Bellman equation errors.When the number of the critic NN hidden neurons K → ∞, they converge to zero [36].However, when K is a fixed constant they are bounded by constants such as Based on (32), ( 33), ( 43) and (44), one obtains ε H J 1 and ε H J 2 are the coupled HJB equations approximation errors shown in [36].Without loss of generality, as the number of the critic NN hidden neurons K → ∞, they converge to zero.However, when K is a fixed constant they are bounded by positive constants such that ε H J i ≤ λ ε H J i , i = 1, 2. Since the ideal weights W 1 and W 2 are unknown, they are estimated as Ŵ1 and Ŵ2 , then the weight estimation errors are defined as Wi = W i − Ŵi , i = 1, 2. The estimated value functions are given by V1 Meanwhile, the approximate optimal control policies are presented as Based on (32), ( 33), ( 51) and (52), the approximate Hamilton functions are where e 1 and e 2 are the residual errors.The next tasks are to train neural networks and design Ŵ1 and Ŵ2 to minimize the target function E = 1 2 e T 1 e 1 + 1 2 e T 2 e 2 .Then Ŵ1 and Ŵ2 converge to W 1 and W 2 .
To overcome the difficulty of finding the initial admissible controllers, the following assumption is given.Furthermore, an additional term is developed to strengthen the learning process of the critic NN.Assumption 4. Given the cost functions ( 16) and (17), for the nominal augmented system (15), under the optimal control policies of the two players, we define a continuously differentiable Lyapunov function candidate J s (ζ) satisfying where ∇J s (ζ) = ∂J s (ζ)/∂ζ.Suppose there exists a positive definite matrix Ξ(ζ) such that holds [5].
, and θ is a positive constant [5].Hence, we have (∇J The minimum and maximum eigenvalues of matrix Ξ(ζ) are λ m and λ M , then we obtain Here, J s (ζ) can be selected as J s (ζ) = 0.5ζ T ζ.Now, based on the normalized gradient descent algorithm, the weights of the critic NN for each player are tuned with two additional terms, that is where a > 0 is the learning rate of the critic NN and the third term, b > 0 is the learning rate of the second term.
In addition, J s (ζ) is given in Assumption 3. In (58) and (59), the Π(ζ, û, v) is the additional stabilizing term defined as Remark 2. The second term introduced guarantees that the system remains stable during the weight update process.When the system is stable, the value of this item is 0. When the system is unstable, this item is activated to reinforce system stability by enhancing the training process.On account of and the additional stability term makes the weights update in the opposite direction of Js (ζ).If Js (ζ) ≥ 0, the reinforced training process can reduce it to a negative value.On the other hand, when the probing noise is needed to satisfy the persistent excitation (PE) condition, the additional stabilizing term can keep the system in a closed-loop stable state, which leads the system to no longer need initial stability control.The third terms given in (58) and (59) are for the next stability analysis.

Stability Analysis
In this section, we give several theorems and then add some assumptions to prove the stability of the closed-loop nominal augmented system and analyze the tracking performance.Assumption 5. Assume that the matrices associated with each player's control input have upper bounds Theorem 1.For the nominal augmented system (15), a pair of feedback control laws {u * , v * } are derived by (51) and (52), moreover, the weight vectors of the critic NN are trained by (58) and (59), respectively.Then, we have that the closed-loop system state and the critic NN weights' estimation errors are both uniformly ultimately bounded (UUB).

Proof. See the Appendix A.
According to Thereom 1, it is easy to conclude that the feedback control laws converge.
Corollary 1.The control policies converge to the approximate Nash equilibrium solution of the NZS game.
In addition to the convergence of system states to the origin, the tracking performance of the system is also an important indicator.Therefore, we put forward Theorem 2 to show that system (1) can track the reference trajectory (2) well, and the proof is given.Theorem 2. Given the cost functions ( 16) and ( 17), for the nominal augmented system (15), the approximate optimal control laws obtained by (51) and (52) ensure that the tracking error dynamics are UUB.

Proof. See the Appendix A.
Remark 3. In this section, we give an optimal robust tracking control scheme for the NZS game, which can be extended to the N-player NZS game system in theory.

Simulation 4.1. Two-Player Linear Non-Zero-Sum Game
Consider a continuous-time uncertain linear system: where x = [x 1 , x 2 ] T ∈ R 2 is the state variable, u ∈ R and v ∈ R are the control inputs and the uncertain parameters . The last term of system (67) is the uncertain term that is bounded by Here, the reference trajectory s(t) is generated by the following system: where s = [s 1 , s 2 ] T ∈ R 2 is the reference state.One lets the initial reference state vector be s 0 = [0.5, 0.5] T .Defining the tracking error as e r = x − s so that ėr = ẋ − ṡ, let the augmented state vector be ζ = [e T r , s T ] T .Then, we have the augmented system dynamics as follows: where is the uncertain term of the augmented system.Here, we choose M 1 (ζ) = [1, 0, 0, 0] T and M 2 (ζ) = [0, 1, 0, 0] T .Meanwhile, the decomposed the uncertain term are respectively Therefore, the initial state of the augmented system is ζ 0 = [−1.5,0.5, 0.5, 0.5] T with the initial tracking error vector T .Let the learning rates be a = 2 and b = 0.5.Moreover, one brings in a probing noise to satisfy the persistence excitation (PE) condition.The state trajectories and reference trajectories are displayed in Figures 1 and 2. After the learning process, Figures 3 and 4 show that the weights of critic NN1 and NN2 converged to [0.2521, 0.0627, −0.0501, 0.0213, −0.0487, 0.0373, 0.0134, 0.0188, 0.0171, 0.0273] T and [0.1934, −0.0558, 0.0248, 0.2574, 0.1487, −0.0026, −0.1406, −0.0039, −0.0134, 0.0928] T .Since the value of the initial weights was all set as zero, we could conclude that the system did not require the initial stable control policies.The control trajectories for each player are in Figure 5. Figure 6 demonstrates that the tracking errors convergenced to 0, which indicated that system (67) could track the reference trajectory (68) well.To verify the robustness of the method, one could choose η 1 = −0.5 and η 2 = 0.5, and then perform the simulation and verification.The tracking error and control input are depicted in Figures 7 and 8, which still demonstrated the desired trajectory tracking performance again.

Two-Player Nonlinear Non-zero-Sum Game
Consider a continuous-time uncertain nonlinear system: In this example, the reference signal s(t) is derived by The critic NN activation functions, a and b are the same as in the first example.Similarly, the augmented system dynamics are as follows: Let the initial system state vector be x 0 = [−0.5,−0.5] T and the initia reference trajectory vector be s 0 = [0.5, 0.5] T , then the initial state of the augmented system is The state trajectories and reference trajectories are displayed in Figures 9 and 10.Figures 11 and 12 show that the weights of critic NN1 and NN2 converge to [0.4582, 0.2514, −0.2907, −0.2567, 0.1455, −0.1353, −0.1050, 0.1527, 0.1321, 0.1112] T and [0.2622, 0.0666, −0.0854, −0.0858, 0.0879, −0.0610, −0.0470, 0.0601, 0.0487, 0.0406] T , respectively.It could also be seen that initial stability control policies were not required.The control trajectories for each player are in Figure 13.The tracking errors are displayed in Figure 14, which indicated that system (70) could track the reference trajectory (71) well.These experimental results verified the effectiveness of the proposed method in this paper.

Conclusions
In this paper, an ADP-based robust tracking control design was proposed for the NZS game of nonlinear systems with dynamic uncertainties.Firstly, the tracking error and reference trajectory were used to construct the augmented system.The coupled HJB equations were modified by defining appropriate performance indicators.Then, a new adaptive critic design was proposed to solve the coupled HJB equations.A single-network structure was used to approximate the value function and control policy for each player.By a modified critic NN weights' tuning law, the control policies of the two players converged to the Nash equilibrium of NZS games.What is more, the proof that the system state, tracking error and weight estimation error were UUB was given via the Lyapunov theory.Finally, two simulation results verified the effectiveness of the proposed scheme.We will consider the input constraints and state constraints for this problem in the future.
To summarize, if the inequality p > max(M 1 , M 2 ) = M or ∇J s (ζ) > max(N 1 , N 2 ) = N holds, then L < 0 and we have that system state and the weight estimation errors are UUB.This completes the proof.
Proof of Theorem 2. We choose the following Lyapunov function candidate: (A13) Differentiating L 1 along ζ, we have where ε F 1 = ∇ε 1 ∆F (ζ), since ∇ε 1 and ∆F (ζ) are bounded, let ε F 1 ≤ λ ε F 1 .V2 is similarly as V1 , it is not hard to see that where To summarize, if the inequality ζ > C 1 or q > C 2 holds, then L1 < 0 and we have that the tracking errors of the closed-loop uncertain augmented system are UUB.This completes the proof.
R 21 and R 22 are positive definite matrices.Γ 1 (ζ) and Γ 2 (ζ) are related to the dynamical uncertainty with Γ 1 (ζ) ≥ 0 and Γ 2 (ζ) ≥ 0. What is more, the feedback controllers required to solve the optimal control problem are admissible.Then, the definition of admissible policies is described below.Definition 1. (Admissible policies) Control functions u(ζ) and v(ζ) are said to be admissible with respect to (16) and (17

Figure 3 .
Figure 3. Convergence curves of the critic NN1 weights for player 1.

Figure 4 .
Figure 4. Convergence curves of the critic NN2 weights for player 2.