Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning

This article offers an optimal control tracking method using an event-triggered technique and the internal reinforcement Q-learning (IrQL) algorithm to address the tracking control issue of unknown nonlinear systems with multiple agents (MASs). Relying on the internal reinforcement reward (IRR) formula, a Q-learning function is calculated, and then the iteration IRQL method is developed. In contrast to mechanisms triggered by time, an event-triggered algorithm reduces the rate of transmission and computational load, since the controller may only be upgraded when the predetermined triggering circumstances are met. In addition, in order to implement the suggested system, a neutral reinforce-critic-actor (RCA) network structure is created that may assess the indices of performance and online learning of the event-triggering mechanism. This strategy is intended to be data-driven without having in-depth knowledge of system dynamics. We must develop the event-triggered weight tuning rule, which only modifies the parameters of the actor neutral network (ANN) in response to triggering cases. In addition, a Lyapunov-based convergence study of the reinforce-critic-actor neutral network (NN) is presented. Lastly, an example demonstrates the accessibility and efficiency of the suggested approach.


Introduction
Recently, distributed coordination control of MASs has received a great deal of attention as a result of its extensive applications in power systems [1,2], multi-vehicle [3] and multi-area power systems [4], and other fields. MASs have a variety problems, such as consensus control [5][6][7], synchronization control [8,9], anti-synchronization control [10], and tracking control [11]. Reinforcement learning (RL) [12] and adaptive dynamic programming (ADP) methods [13,14] have been employed by researchers as a means of solving the optimal control problems. Due to its excellent ability for global approximation, neural networks are excellent for dealing with nonlinearities and uncertainties [15]. ADP has great online learning and adaptive ability when it uses neural networks. Furthermore, researchers used RL/ADP algorithms to settle optimal coordination control matters, proposed a lot of directions, tracked control [16][17][18][19], graphical games [19], consensus control [20], containment control [21] and formation control [22]. The controller is designed in the above ways were relying on traditional time-triggered methods. Event-triggered in [23,24], it was suggested that the traditional implementation be changed to an event-triggered one. Because of the increasing number of agents, MASs are required to resolve many computing costs related to the exchange of information. Traditionally, the controller or actuator is constantly updated over a fixed period while the system is in operation. In order to minimize computation and preserve resources, aperiodic sampling is employed in the method of triggering events to improve the controller's computation efficiency. There have been a number of developments in methods that are based on events for addressing discrete time (1) With respect to nonlinear MAS tracking control, the authors of [32] proposed an IrQL framework, which differs from [18,33,34], and the design of a new long-term IRR signal is completed. This product was designed on the basis of the data of neighbors to provide more information to the agent. The IRR function is used to define a Q-function, and an iterative IrQL method is proposed for obtaining control schemes that are optimally distributed.
(2) It is designed to trigger a new condition and cite in an asynchronous and distributed manner [24]. As a result, each agent triggers at its own time. Consequently, there is no need to update the controller on a regular basis. For the purpose of achieving online learning, a reinforce-actor-critic neural network based on triggered events is established to determine the optimal control scheme for triggered events. When compared with other papers [18,33,35,36], this paper adjusts the weights non-periodically, and the ANN is only adjusted when a trigger is encountered.
(3) In this paper, the objective is to develop the most effective tracking control method using a new triggering mechanism developed using the IrQL method. As far as eventtriggered optimal control mechanisms are concerned, the Lyapunov approach is used to determine the rigorous stability assurance of closed-loop multi-agent networks. The designed RCA-NN framework [32] offers an effective means of executing the proposed method online without requiring any knowledge of the dynamics of the system. We made a comparison between the traditional activation method and the IrQL method. According to the simulation results, the designed algorithm is capable of detecting control problems with good tracking performance.
This article is organized as follows. The graph theory and problems of Section 2 provide an overview of some foundations. In Section 3, IrQL-based HJB equations are obtained. As described in Section 4, the most appropriate controller design should be triggered by an event to build the proposed algorithm. Section 5 develops the RCA-NN. The use of Lyapunov technology leads to convergence of weights in the neural networks.
Through analogy examples and comparisons, its effectiveness and correctness of the method are demonstrated in Section 6. The last part includes our final thoughts.

Theoretical Basis of Graphs
It would be possible to model the exchange of information using a directed graph between agents G = (V, E , A), in which V = {υ 1 , υ 2 , . . . , υ n } represents N nonempty notes and E = {(υ i , υ j )|υ i , υ j ∈ V } ∈ V × V represents an edge set, indicating agent i could derive the data from agent j. We define A = [a ij ] , which is a matrix that is adjacency relevant and does not contain negative elements a ij , where a ij > 0 is satisfied if (i, j) ∈ E . Otherwise, a ij = 0. N i = {j|(i, j) ∈ E } is defined as the set of nodes that are neighbors with node i, and a ij > 0 is satisfied for each j ∈ N i . We denote the input matrix D = diag{d i }, A leader's relationship with its followers is the subject of this article. In order to describe follower-leader interactions, we propose an enhanced directed graph model, (i.e., G = (V,Ê ), in whichV = {0, 1, 2, . . . , N} andÊ ∈V ×V). A leader's communication with his or her followers is determined by b i . If b i > 0, then there is an assumption that the leader and followers are in communication. Otherwise, b i = 0. B = diag{b 1 , . . . , b n } ∈ R N×N is defined as the matrix of related connections.

Problem Formulation
If a nonlinear MAS has one leader as well as N followers, then the dynamics for the ith follower would be as follows: In this case, x i ∈ R N represents the system state, u i ∈ R p i represents the control input, and A ∈ R n×n , B i ∈ R n×n represent unknown matrices for the plants and inputs.
The leader is written as follows: x 0 (k + 1) = Ax 0 (k) (2) It is assumed that x 0 ∈ R n represents the leader state.

Assumption 1.
If there is a spanning tree with a leader, thenĜ has a network of communication interactions, andĜ does not contain repeated edges.

Definition 1.
As a result of our design, we are able to develop a control scheme u i (k) that only requires agent information. Therefore, the followers can keep track of the leader. In the event that the funder's conditions are met, we will be able to implement a perfect control scheme [32]: The MAS's local consensus error is expressed as follows: Then, an overview of the error vector is presented as follows: . , x n T (k)) T ∈ R nN , x 0 (k) = I n x 0 ∈ R nN , as well as vector I n having n dimensions.

Design of the IrQL Method
To resolve the issue of tracking control in systems with multiple agents, the authors of [32] developed the IrQL method. What is important is that in order to provide agents with a greater level of local information from other agents or environments, it is necessary to introduce IRR information, thereby improving control and learning efficiency. In addition, agents have been defined according to the Q-function, and the relevant HJB equation is acquired using the IrQL method.
As an example, consider the following IR function for the ith agent: In this case, we can represent the agent's neighbors' input with u −i = {u j |j ∈ N i }. The weight matrices R ii > 0, Q ii > 0, and Q ij > 0 are positive.
According to the IR function, as a function of IRR, the following is expressed: where the IRR function is defined as r ∈ (0, 1] and r is its discount factor. The following performance indices must be minimized for every agent to find a solution to the issue of controlling tracking optimally: In this case, its performance index discount factor is β ∈ (0, 1].

Remark 1.
The function of the designed IRR function incorporates accumulated prospective longterm reward data from the IR function. The performance factor is measured depending on IRR as opposed to IR, which is contrary to the majority of methods. The advantage is that we can enhance the control actions, and the learning process can be accelerated by using a great deal of data.

Remark 2.
Intrinsic motivation (IM) provides a possible method for enhancing the faculty of abstract actions or solving the difficulties associated with exploring the environment in its reinforcement learning direction. IRR acts as a driving agent that learns skills through intrinsic motivation [32].

Definition 2.
In order to resolve the MAS's tracking control issue, we propose a distributed tracking control scheme. As the time step k approaches infinity, e i (k) −→ 0 minimizes the performance metrics (10) simultaneously.
We can obtain a state value function as follows based on the control method of the agent as well as the neighbors u i (t) and u −i (t): Equation (11) can also be expressed as the following formula: Based on the theory, the ideal state value function meets the following conditions: In this case, in Bellman form, the function of IRR is expressed as On the basis of the condition of stationarity, (i.e., ∂u i (k) ), the description of the optimal distributed control method is given below: In this equation,

Remark 3.
As is well known, the state value algorithm V i (e i (k)) is highly concerned with the space of states. In accordance with the state action function, the Q-learning method is designed with RL. The Q-function can be used by each agent to estimate the properties of all possible decisions in the current situation, and we can determine what is the best behavior of the agent at each step by using the Q-function.
The Q-function is written as follows: In accordance with the optimal scheme, the optimal Q-function is given by Based on Equations (16) and (17), we can express the optimal solution as follows: In comparison with the control method of Equation (15), its optimum Q-function provides the optimal solution for the control scheme here. As a result, we intend to calculate the solution to Equation (17).

Designs of the Event-Driven Controller
According to a previous work [18], a time-triggered controller was developed. Nevertheless, a new event-triggering mechanism is designed to minimize computing costs for this case.
) is defined as the sequence of trigger times. At the triggering instant, the sampled disagreement error is expressed asê s i . As a result of the threshold value and error, the triggering time varies. The control scheme can only be updated when k = kt i s and cannot be updated under any other circumstances: To design a triggering condition, we propose a function that measures the gap arising from the existing error and the previously sampled error: We have set the triggering error equal to zero at k = kt i s .
The dynamic expression of localized mistakes based on an event-triggered controlling approach can be written as Thus, the equation for event-triggered events is obtained: It is possible to express the optimal tracking control using an event-triggered approach in the following way: There is a constant L that explains the inequality below: Assumption 3. There is a triggering condition which is as follows: where π i T represents the triggering threshold and L ∈ (0, √ 2/2) [24]. Once the multi-agent system dynamics have stabilized, followers are able to track their leaders.

Neural Network Implementation for the Event-Triggered Approach Using the IrQL Method
This section discusses the tree-NN structure, also known as RCA-NNs. Three virtual networks are included in the tree-NN structure.

Reinforce Neutral Network (RNN) Learning Model
The reinforced NN is employed to approximate the IRR signal as follows: where Z ri (k) represents the input vector, which has e i (k), u i (k), while u −i (k). ω r1i represents the matrix of weights for input-to-hidden layering. Meanwhile, ω r2i represents the matrix of weights for hidden-to-output layering, and ϕ ri (·) represents an activation function [24]. Due to the reinforced NN, the associated error function is as follows: The loss function is written as For convenience's sake, only the matrices ω r2i are updated, and the matrices ω r1i remain unchanged during the training process.
The RNN's update law is expressed as In this equation, α ri represents the rate at which the RNN learns. The gradient descent rule (GDR) is used to obtain an updated law for the reinforced NN's weight, which yields the following results: In this equation , ∆ ri (k) = ϕ ri (ω T r1i (k) · Z ri (k)).

Critic Neutral Network (CNN) Learning Model
In the following section, when designing the critic NN, an attempt is made to achieve a close approximation of the Q-function: In this equation, Z ci (k) represents the relative vector of inputs that hasR i (k), e i (k), and u i (k) as well as u −i (k), while ω T c1i (k) and ω T c2i (k) represent the input layer weight matrices and output layer weight matrices.
It is possible to express the function of the error for the CNN to be Its function of loss is written to be In accordance with the operation of RNNs, only ω c2i is updated, and ω c1i remains unchanged.
With the help of the gradient descent rule (GDR), it can be used to express the weight update law: where α ci represents the critic NN's learning rate. Furthermore, we can obtain its weight update schemes for the critic NN: In this equation, ∆ ci (k) = ϕ ci (ω T ci1 (k)Z ci (k)).

Actor Neutral Network (ANN) Learning Model
Based on the actor NN, an approximate optimal scheme is defined as follows: where the input data of the ANN is represented by Z ai (k) = e i (k), ω a1i represents the weight matrices of the input layer, and ω a2i represents the weight matrices of the output layer. Based on the prediction error of the actor NN, the following result is obtained: It is possible to express the function of loss of the ANN to be As with RNNs and CNNs, ω a1i must remain unchanged throughout the learning process. The actor NN update laws are defined as follows: where α ai represents the ANN learning rate. We can design a weight-tuning scheme for an ANN as follows: where ∆ ai (k) = ϕ ai (ω T a1i (k)Z ai (k)), ci (k) = ∂ϕ ci (ω c1i (k)Z ci (k)) ∂ϕ ci (ω T c1i (k)Z ci (k)) , û i (Z ci (k)) = ∂Z ci (k) ∂û i (k) . Furthermore, we can obtain It is described in detail in Algorithm 1 how the controller is designed using RCA-NNs and event triggering. When the trigger conditions are met, the actor NN is updated.
For analysis of stability based on the Lyapunov method, we present an analysis of stability and convergence in the following section. Assumption 4. The following conditions are assumed to be true: ω r2i (k) ≤ ω rim , ω c2i (k) ≤ ω cim , ω a2i (k) ≤ ω aim . There are bounded activation functions, i.e., ∆ ri (k) ≤ ∆ rim , ∆ ci (k) ≤ ∆ cim , ∆ ai (k) ≤ ∆ aim . What's more, the functions of activation ϕ ai (k) is the function of Lipschitz that satisfies ϕ ai (e i (kt i s )) − ϕ ai (k) ≤ θ ai e i (kt i s ) − e i (k) = θ ai s i (k) ≤ θ ai π i T, where θ ai , π i T are positive constants. Approximation errors of NNs' output can be defined to be: δ ci (k) = ω c2i (k)∆ ci (k), δ ai (k) = ω a2i (k)∆ ai (k), ϑ ri (k) = ω r2i (k)∆ ri (k). Theorem 1. Assume that Assumptions 1 and 2 are true. CNN and ANN weights are renewed by (36) and (42). Upon satisfying the triggering term (26), the local inconsistency error is e i (k), critic evaluated error and actor evaluated error error are consistent and ultimately bounded. Furthermore the control method u i converges to the optimal value u * i .
(1) We can obtain the following function at the time of triggering as follows: In this equation, ∆L 1 (k) is written to be In this equation, we have Furthermore, we have ∆L 2 (k) can be written as Within this equation, we have Furthermore, we have where The following result is obtained by computation: In the case of the difference of the first order of L 3 (k), we can obtain Therefore, we have In the case of ∆L 3 (k), the simplified formula is given below: By adding Equations (47), (51), and (56), we can obtain L(k) as follows: Therefore, we can obtain Moreover, we can obtain If the conditions are met, then we can obtain We can derive ∆L(k) ≤ 0. The proof has been completed.

Statistical Data Illustration
To demonstrate the viability of the proposed method, a simulation is presented in the following section.

Nonlinear MAS Consisting of One Leader and Six Followers
There were six followers and one leader in this tangled set of MASs which were considered. Figure 1 depicts the connection graph of the studied MASs. There was a leader of 0, and there were followers of 1, 2, 3, 4, 5, and 6. It is possible to obtain the corresponding adjacency matrix a 14 = a 21 = a 32 = a 43 = a 52 = a 65 = 1. There is a weighted relationship involving the leaders and followers where b The weight matrices are as follows: The learning rates are α ri = 0.95, α ai = 0.90, and α ci = 0.07(iisequalto1, 2, 3, 4, 5, 6), with a discount factor of = 0.57, β = 0.9.
For the agents, the activation function of the RNNs and ANNs is as follows: According to Figure 2, all followers of the leader were able to accurately follow the leader, and the whole MAS was able to achieve synchronization. Figure 3 illustrates the six agents' cumulative amount of trigger instants. On average, the amount of trigger instants for the six agents was approximately 220. However, using the traditional RL method, the number was approximately 1000. As a result, the computational burden was reduced by 78.0% in comparison with the conventional time-triggered method. According to Figure 4, the trigger mechanism of each agent is illustrated, which indicates that the actor network weight will be updated only when the trigger mechanism is satisfied. As can be seen in Figure 5, there is a correlation involving the error of triggering || s i (k)|| 2 as well as the minimum triggering requirements π i T. Over time, it appears that the triggering error converged. Figures 6 and 7 illustrate the evaluation of the local neighborhood errors using the proposed control method, and it is shown that they could be converged to 0 at k = 60. The local neighborhood errors of [32] are shown in Figures 8 and 9. In comparison with Figures 8 and 9, our proposed control method produced a better convergence effect. Figures 10 and 11 show the estimation of the ANN weight parameters. With the proposed control method, the actor network weights can stabilize faster than with IrQL.     The triggering error trajectory|| s i (k)|| 2 in addition to triggering thresholds π i T(i = 1, 2, 3, 4, 5, 6).

Conclusions
According to this study, an event-triggered optimum controlling problem for modelfree MASs was examined using the IrQL method based on RL. A new IrQL method was introduced by adding additional IRR functions [32], As a result, more information could be obtained by the agent. As a consequence of defining the IRR formula, we defined the Q-function and derived the corresponding HJB equation. In an iterative approach to IrQL, this method was designed to calculate the optimal control strategy. Using the IrQL algorithm, an event-triggered controller utilizing the IrQL method was presented. It was designed to update the controller only at the time of triggering to reduce the burden on computing resources and the transmission network. An RCA-NN was used to implement the suggested approach, which eliminated the need for a model of the system. It is possible to determine the convergent weights of neural networks using the Lyapunov method. To assess the performance and control efficiency of the suggested algorithm, a simulation model was used. Further research will be conducted on the effect of the discount rates on system reliability.