Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning

Wang, Ziwei; Wang, Xin; Tang, Yijie; Liu, Ying; Hu, Jun

doi:10.3390/e25020299

Open AccessArticle

Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning

by

Ziwei Wang

,

Xin Wang

^*

,

Yijie Tang

,

Ying Liu

and

Jun Hu

College of Electronic and Information Engineering, Southwest University, Chongqing 400700, China

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(2), 299; https://doi.org/10.3390/e25020299

Submission received: 13 December 2022 / Revised: 25 January 2023 / Accepted: 27 January 2023 / Published: 5 February 2023

(This article belongs to the Section Multidisciplinary Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This article offers an optimal control tracking method using an event-triggered technique and the internal reinforcement Q-learning (IrQL) algorithm to address the tracking control issue of unknown nonlinear systems with multiple agents (MASs). Relying on the internal reinforcement reward (IRR) formula, a Q-learning function is calculated, and then the iteration IRQL method is developed. In contrast to mechanisms triggered by time, an event-triggered algorithm reduces the rate of transmission and computational load, since the controller may only be upgraded when the predetermined triggering circumstances are met. In addition, in order to implement the suggested system, a neutral reinforce-critic-actor (RCA) network structure is created that may assess the indices of performance and online learning of the event-triggering mechanism. This strategy is intended to be data-driven without having in-depth knowledge of system dynamics. We must develop the event-triggered weight tuning rule, which only modifies the parameters of the actor neutral network (ANN) in response to triggering cases. In addition, a Lyapunov-based convergence study of the reinforce-critic-actor neutral network (NN) is presented. Lastly, an example demonstrates the accessibility and efficiency of the suggested approach.

Keywords:

neural networks (NNs); optimal tracking control; event-triggered mechanism; reinforcement learning (RL); systems with multiple agents

1. Introduction

Recently, distributed coordination control of MASs has received a great deal of attention as a result of its extensive applications in power systems [1,2], multi-vehicle [3] and multi-area power systems [4], and other fields. MASs have a variety problems, such as consensus control [5,6,7], synchronization control [8,9], anti-synchronization control [10], and tracking control [11]. Reinforcement learning (RL) [12] and adaptive dynamic programming (ADP) methods [13,14] have been employed by researchers as a means of solving the optimal control problems. Due to its excellent ability for global approximation, neural networks are excellent for dealing with nonlinearities and uncertainties [15]. ADP has great online learning and adaptive ability when it uses neural networks. Furthermore, researchers used RL/ADP algorithms to settle optimal coordination control matters, proposed a lot of directions, tracked control [16,17,18,19], graphical games [19], consensus control [20], containment control [21] and formation control [22]. The controller is designed in the above ways were relying on traditional time-triggered methods. Event-triggered in [23,24], it was suggested that the traditional implementation be changed to an event-triggered one. Because of the increasing number of agents, MASs are required to resolve many computing costs related to the exchange of information. Traditionally, the controller or actuator is constantly updated over a fixed period while the system is in operation. In order to minimize computation and preserve resources, aperiodic sampling is employed in the method of triggering events to improve the controller’s computation efficiency. There have been a number of developments in methods that are based on events for addressing discrete time systems [24]. The traditional implementation was suggested to be replaced by one that is triggered by events.

With an increase in the number of agents, MASs must solve a large number of computing costs related to information exchange. Traditionally, the controller or actuator is constantly updated frequently using a predetermined period of sampling during system operation. To lessen the computational and save resources, aperiodic sampling is used in the event-triggering scheme to improve the associated controller’s computational efficiency. Researchers have developed some event-based methods to address discrete time systems [25] as well as systems based on continuous time [26,27]. Several algorithms based on triggered events have been designed to solve discrete-time systems [25], as well as systems that operate in continuous time [26,27]. According to these results, the system dynamics are assumed to be accurate ahead of time. However, it is not always possible to understand dynamics properly in practice. According to [24], a controller that was triggered by events was proposed which was designed with inaccurate or unknown dynamics for the system.

The application of Q-learning to process control [28], chemical process control, industrial process automatic control, and other areas was an early application of reinforcement learning (RL). The Q-learning algorithm provides a modeless data-driven method for solving control problems. A key point to keep in mind is all potential actions in the present state. Q-learning is currently used primarily for routing optimization and reception processing in network communication within the context of network management. The Q-learning algorithm supports a modeless data-driven method for solving control problems. It is important to note that all potential actions in the present state [29] are evaluated in the Q-learning method, relying on the Q-function. At present, Q-learning is used primarily for routing optimization and reception processing in network communication in the domain of network management [30]. As a result of AlphaGo’s emergence, dynamic research has been conducted in the field of game theory, and tracking control research has been conducted on issues associated with nonlinear MAS tracking control based on Q-learning, such as in [31]. At present, there is some research for tracking control issues for nonlinear MASs based on Q-learning, such as in [32].

The MAS’s issue of optimal control was solved using the RL/ADP method, as mentioned above. The majority of the above results share two common features. First, the direct use of the immediate or immediate reward (IR) signal to define each agent’s performance index function results in limited learning opportunities. As a second step, a state’s value function is used to determine the Hamilton–Jacobi–Bellman (HJB) equation. The corresponding controller is designed using RL/ADP, which results in efficient learning of the MAS equation. It is beneficial to provide each agent with more information signals in a wide range of realistic applications in order to enhance their learning capabilities. In addition to merely considering performance in terms of status, performance can also be viewed from a broader perspective. The purpose of our research is to avoid the limitations described above.

Taking into consideration the aforementioned findings, this work investigates an ideal solution to the optimum control issue for MASs with unknown nonlinearity to enhance the process of learning as well as the effectiveness of control systems. Utilizing the graph theory, a coordination control problem is first identified. According to the gathered information of the IR, increased reinforcement reward (IRR) signals are provided for a longer-term reward period. Based on the IRR function, a Q-function is then developed to assess the efficacy of each agent’s control system. In addition, a tracking control technique is developed using iterative IrQL to derive the HJB equation for each agent. Then, based on the IrQL technique, triggering mechanisms are employed to establish a tracking control system. Finally, an optimum event-triggered controller based on a network topology of reinforce-actor-critic is created. The event triggering mechanism in a closed-loop approach guarantees that the network weights converge and the system remains stable. In light of the findings of this study, an additional contribution has been made to the literature:

(1) With respect to nonlinear MAS tracking control, the authors of [32] proposed an IrQL framework, which differs from [18,33,34], and the design of a new long-term IRR signal is completed. This product was designed on the basis of the data of neighbors to provide more information to the agent. The IRR function is used to define a Q-function, and an iterative IrQL method is proposed for obtaining control schemes that are optimally distributed.

(2) It is designed to trigger a new condition and cite in an asynchronous and distributed manner [24]. As a result, each agent triggers at its own time. Consequently, there is no need to update the controller on a regular basis. For the purpose of achieving online learning, a reinforce-actor-critic neural network based on triggered events is established to determine the optimal control scheme for triggered events. When compared with other papers [18,33,35,36], this paper adjusts the weights non-periodically, and the ANN is only adjusted when a trigger is encountered.

(3) In this paper, the objective is to develop the most effective tracking control method using a new triggering mechanism developed using the IrQL method. As far as event-triggered optimal control mechanisms are concerned, the Lyapunov approach is used to determine the rigorous stability assurance of closed-loop multi-agent networks. The designed RCA-NN framework [32] offers an effective means of executing the proposed method online without requiring any knowledge of the dynamics of the system. We made a comparison between the traditional activation method and the IrQL method. According to the simulation results, the designed algorithm is capable of detecting control problems with good tracking performance.

This article is organized as follows. The graph theory and problems of Section 2 provide an overview of some foundations. In Section 3, IrQL-based HJB equations are obtained. As described in Section 4, the most appropriate controller design should be triggered by an event to build the proposed algorithm. Section 5 develops the RCA-NN. The use of Lyapunov technology leads to convergence of weights in the neural networks. Through analogy examples and comparisons, its effectiveness and correctness of the method are demonstrated in Section 6. The last part includes our final thoughts.

2. Preliminary Findings

2.1. Theoretical Basis of Graphs

It would be possible to model the exchange of information using a directed graph between agents

G = (V, E, A)

, in which

V = \{υ_{1}, υ_{2}, \dots, υ_{n}\}

represents N nonempty notes and

E = {(υ_{i}, υ_{j}) | υ_{i}, υ_{j} \in V} \in V \times V

represents an edge set, indicating agent i could derive the data from agent j. We define

A = [a_{i j}]

, which is a matrix that is adjacency relevant and does not contain negative elements

a_{i j}

, where

a_{i j} > 0

is satisfied if

(i, j) \in E

. Otherwise,

a_{i j} = 0

.

N_{i} = {j | (i, j) \in E}

is defined as the set of nodes that are neighbors with node i, and

a_{i j} > 0

is satisfied for each

j \in N_{i}

. We denote the input matrix

D = d i a g {d_{i}}

, where

d_{i} = \sum_{j \in N} A_{i j}

. The Laplacian matrix is then defined as

L = D - A \in R^{N \times N} .

A leader’s relationship with its followers is the subject of this article. In order to describe follower-leader interactions, we propose an enhanced directed graph model, (i.e.,

\hat{G} = (\hat{V}, \hat{E})

, in which

\hat{V} = {0, 1, 2, \dots, N}

and

\hat{E} \in \hat{V} \times \hat{V}

). A leader’s communication with his or her followers is determined by

b_{i}

. If

b_{i} > 0

, then there is an assumption that the leader and followers are in communication. Otherwise,

b_{i} = 0

.

B = d i a g {b_{1}, \dots, b_{n}} \in R^{N \times N}

is defined as the matrix of related connections.

2.2. Problem Formulation

If a nonlinear MAS has one leader as well as N followers, then the dynamics for the ith follower would be as follows:

x_{i} (k + 1) = A x_{i} (k) + B_{i} u_{i} (k)

(1)

In this case,

x_{i} \in R^{N}

represents the system state,

u_{i} \in R^{p_{i}}

represents the control input, and

A \in R^{n \times n}, B_{i} \in R^{n \times n}

represent unknown matrices for the plants and inputs.

The leader is written as follows:

x_{0} (k + 1) = A x_{0} (k)

(2)

It is assumed that

x_{0} \in R^{n}

represents the leader state.

Assumption 1.

If there is a spanning tree with a leader, then

\hat{G}

has a network of communication interactions, and

\hat{G}

does not contain repeated edges.

Definition 1.

As a result of our design, we are able to develop a control scheme

u_{i} (k)

that only requires agent information. Therefore, the followers can keep track of the leader. In the event that the funder’s conditions are met, we will be able to implement a perfect control scheme [32]:

lim_{k \to \infty} ‖ x_{i} (k) - x_{0} (k) ‖ = 0, i = 1, 2, \dots, n

(3)

The MAS’s local consensus error is expressed as follows:

e_{i} (k) = \sum_{j \in N_{i}}^{} a_{i j} (x_{i} (k) - x_{j} (k)) + b_{i} (x_{i} (k) - x_{0} (k))

(4)

Then, an overview of the error vector is presented as follows:

e (k) = ((L + B) ⨂ I_{n}) (x (k) - \hat{x_{0}} (k))

(5)

e (k) = {({e_{1}}^{T} (k), {e_{2}}^{T} (k), \dots, {e_{n}}^{T} (k))}^{T} \in R^{n N}

,

x (k) = {({x_{1}}^{T} (k), {x_{2}}^{T} (k), \dots, {x_{n}}^{T} (k))}^{T} \in R^{n N}

,

\hat{x_{0}} (k) = I_{n} ⨂ x_{0} \in R^{n N}

, as well as vector

I_{n}

having n dimensions.

The tracking error is written as

ζ_{i} (k) = x_{i} (k) - x_{0} (k)

, which has the vector form

ζ (k) = x (k) - \hat{x_{0}} (k)

(6)

In this equation,

ζ (k) = {({ζ_{1}}^{T} (k), {ζ_{2}}^{T} (k), \dots, {ζ_{n}}^{T} (k))}^{T} \in R^{n N},

\hat{x_{0}} (k) = {({x_{0}}^{T} (k), {x_{0}}^{T} (k), \dots, {x_{0}}^{T} (k))}^{T}

.

Consequently, the localized neighbor error

e_{i} (k)

is represented in the following manner, in agreement with Equations (1) and (4):

\begin{matrix} e_{i} (k + 1) = & A e_{i} (k) + (d_{i} + b_{i}) B_{i} u_{i} (k) \\ - \sum_{j \in N_{i}}^{} a_{i j} B_{j} u_{j} (k) \\ = & F_{i} (e_{i} (k), u_{i} (k)) \end{matrix}

(7)

Given Equations (5) and (6), it is evident that

e (k)

and

ζ (k)

are related as follows:

lim_{k \to \infty} ‖ e (k) ‖ = 0

as

lim_{k \to \infty} ‖ ζ (k) ‖ = 0

. Consequently, when the localized neighboring error is close to zero, the control problem is resolved.

3. Design of the IrQL Method

To resolve the issue of tracking control in systems with multiple agents, the authors of [32] developed the IrQL method. What is important is that in order to provide agents with a greater level of local information from other agents or environments, it is necessary to introduce IRR information, thereby improving control and learning efficiency. In addition, agents have been defined according to the Q-function, and the relevant HJB equation is acquired using the IrQL method.

As an example, consider the following IR function for the ith agent:

\begin{matrix} j_{i} (e_{i} (k), u_{i} (k), u_{- i} (k)) = & e_{i} {(k)}^{T} R_{i i} e_{i} (k) + u_{i} {(k)}^{T} Q_{i i} u_{i} (k) \\ + \sum_{j \in N_{i}}^{} u_{j} {(k)}^{T} Q_{i j} u_{j} (k) \end{matrix}

(8)

In this case, we can represent the agent’s neighbors’ input with

u_{- i} = {u_{j} | j \in N_{i}}

. The weight matrices

R_{i i} > 0, Q_{i i} > 0

, and

Q_{i j} > 0

are positive.

According to the IR function, as a function of IRR, the following is expressed:

R_{i} (e_{i} (k), u_{i} (k), u_{- i} (k)) = \sum_{s = k}^{\infty} r^{s - k} j_{i} (e_{i} (s), u_{i} (s), u_{- i} (s))

(9)

where the IRR function is defined as

r \in (0, 1]

and r is its discount factor.

The following performance indices must be minimized for every agent to find a solution to the issue of controlling tracking optimally:

J_{i} (e_{i} (0), u_{i} (0), u_{- i} (0)) = \sum_{t = 0}^{\infty} β^{t} R_{i} (e_{i} (t), u_{i} (t), u_{- i} (t))

(10)

In this case, its performance index discount factor is

β \in (0, 1]

.

Remark 1.

The function of the designed IRR function incorporates accumulated prospective long-term reward data from the IR function. The performance factor is measured depending on IRR as opposed to IR, which is contrary to the majority of methods. The advantage is that we can enhance the control actions, and the learning process can be accelerated by using a great deal of data.

Remark 2.

Intrinsic motivation (IM) provides a possible method for enhancing the faculty of abstract actions or solving the difficulties associated with exploring the environment in its reinforcement learning direction. IRR acts as a driving agent that learns skills through intrinsic motivation [32].

Definition 2.

In order to resolve the MAS’s tracking control issue, we propose a distributed tracking control scheme. As the time step k approaches infinity,

e_{i} (k) ⟶ 0

minimizes the performance metrics

(10)

simultaneously.

We can obtain a state value function as follows based on the control method of the agent as well as the neighbors

u_{i} (t)

and

u_{- i} (t)

:

V_{i} (e_{i} (k)) = \sum_{t = k}^{\infty} β^{t - k} R_{i} (e_{i} (t), u_{i} (t), u_{- i} (t))

(11)

Equation (11) can also be expressed as the following formula:

V_{i} (e_{i} (k) = R_{i} (e_{i} (k), u_{i} (k), u_{- i} (k)) + β V_{i} (e_{i} (k + 1))

(12)

Based on the theory, the ideal state value function meets the following conditions:

\begin{matrix} V_{i}^{*} (e_{i} (k)) = \underset{u_{i} (k)}{m i n} \{R_{i} (e_{i} (k), u_{i} (k), u_{- i} (k)) + β V_{i}^{*} (e_{i} (k + 1))\} \end{matrix}

(13)

In this case, in Bellman form, the function of IRR is expressed as

\begin{matrix} R_{i} (e_{i} & (k), u_{i} (k), u_{- i} (k)) \\ = & j_{i} (e_{i} (k), u_{i} (k), u_{- i} (k)) \\ + ϱ R_{i} (e_{i} (k + 1), u_{i} (k + 1), u_{- i} (k + 1)) \end{matrix}

(14)

On the basis of the condition of stationarity, (i.e.,

\frac{\partial V_{i}^{*} (e_{i} (k))}{\partial u_{i} (k)}

), the description of the optimal distributed control method is given below:

\begin{matrix} u_{i}^{*} (k) & = a r g \underset{u_{i} (k)}{m i n} \{R_{i} (e_{i} (k), u_{i} (k), u_{N_{i}} (k)) + β V_{i}^{*} (e_{i} (k + 1))\} \\ = - \frac{1}{2} β (d_{i} + b_{i}) Q_{i i}^{- 1} h_{i}^{T} (x_{i} (k)) ▽ V_{i}^{*} (e_{i} (k + 1)) \end{matrix}

(15)

In this equation,

▽ V_{i}^{*} (e_{i} (k + 1)) = \frac{\partial V_{i}^{*} (e_{i} (k + 1))}{\partial e_{i} (k + 1)}

.

Remark 3.

As is well known, the state value algorithm

V_{i} (e_{i} (k))

is highly concerned with the space of states. In accordance with the state action function, the Q-learning method is designed with RL. The Q-function can be used by each agent to estimate the properties of all possible decisions in the current situation, and we can determine what is the best behavior of the agent at each step by using the Q-function.

The Q-function is written as follows:

\begin{matrix} Q_{i} (e_{i} (k), u_{i} (k), u_{- i} (k)) = R_{i} (e_{i} (k), & u_{i} (k), u_{- i} (k)) \\ + β V_{i} (e_{i} (k + 1)) \end{matrix}

(16)

In accordance with the optimal scheme, the optimal Q-function is given by

\begin{matrix} Q_{i}^{*} & (e_{i} (k), u_{i} (k), u_{- i} (k)) \\ = & R_{i} (e_{i} (k), u_{i} (k), u_{- i} (k)) \\ + β Q_{i}^{*} (e_{i} (k + 1), u_{i}^{*} (k + 1), u_{- i}^{*} (k + 1)) \end{matrix}

(17)

Based on Equations (16) and (17), we can express the optimal solution as follows:

u_{i}^{*} (k) = a r g \underset{u_{i} (k)}{m i n} \{Q_{i}^{*} (e_{i} (k), u_{i} (k), u_{- i} (k))\}

(18)

In comparison with the control method of Equation (15), its optimum Q-function provides the optimal solution for the control scheme here. As a result, we intend to calculate the solution to Equation (17).

4. Designs of the Event-Driven Controller

According to a previous work [18], a time-triggered controller was developed. Nevertheless, a new event-triggering mechanism is designed to minimize computing costs for this case.

Q_{i}^{*}, (e_{i} (k), u_{i} (k), u_{- i} (k))

is defined as the sequence of trigger times. At the triggering instant, the sampled disagreement error is expressed as

{\hat{e}}_{i}^{s}

.

As a result of the threshold value and error, the triggering time varies. The control scheme can only be updated when

k = k t_{s}^{i}

and cannot be updated under any other circumstances:

u_{i} (k) = u_{i} (k t_{s}^{i}), k \in [k t_{s}^{i}, k t_{s + 1}^{i})

(19)

To design a triggering condition, we propose a function that measures the gap arising from the existing error and the previously sampled error:

ϵ_{i}^{s} (k) = {\hat{e}}_{i}^{s} - e_{i} (k), k \in [k t_{s}^{i}, k t_{s + 1}^{i})

(20)

We have set the triggering error equal to zero at

k = k t_{s}^{i}

.

The dynamic expression of localized mistakes based on an event-triggered controlling approach can be written as

e_{i} (k + 1) = F_{i} (e_{i} (k), u_{i} (k t_{s}^{i}))

(21)

Thus, the equation for event-triggered events is obtained:

\begin{matrix} V_{i}^{*} & (e_{i} (k)) = \\ \underset{u_{i} (k t_{s}^{i})}{m i n} \{R_{i} (e_{i} (k), u_{i} (k t_{s}^{i}), u_{- i} (k t_{s}^{i})) + β V_{i}^{*} (F_{i} (e_{i} (k), u_{i} (k t_{s}^{i})))\} \end{matrix}

(22)

\begin{matrix} Q_{i}^{*} (e_{i} (k)) \\ = R_{i} (e_{i} (k), u_{i} (k t_{s}^{i}), u_{- i} (k t_{s}^{i})) \\ + β Q_{i}^{*} (F_{i} (e_{i} (k), u_{i} (k t_{s}^{i}))) \end{matrix}

(23)

It is possible to express the optimal tracking control using an event-triggered approach in the following way:

u_{i}^{*} (k) = a r g \underset{u_{i} (k t_{s}^{i})}{m i n} \{Q_{i}^{*} (e_{i} (k))\}

(24)

Assumption 2.

There is a constant

L

that explains the inequality below:

∥F_{i} (e_{i} (k), u_{i} (k t_{s}^{i}))∥ \leq L ∥e_{i} (k)∥ + L ∥ϵ_{i}^{s} (k)∥

(25)

Assumption 3.

There is a triggering condition which is as follows:

{∥ϵ_{i}^{s} (k)∥}^{2} \leq (1 - 2 L^{2}) / (2 L^{2}) {∥e_{i} (k)∥}^{2} = π_{i} T

(26)

where

π_{i} T

represents the triggering threshold and

L \in (0, \sqrt{2} / 2)

[24]. Once the multi-agent system dynamics have stabilized, followers are able to track their leaders.

5. Neural Network Implementation for the Event-Triggered Approach Using the IrQL Method

This section discusses the tree-NN structure, also known as RCA-NNs. Three virtual networks are included in the tree-NN structure.

5.1. Reinforce Neutral Network (RNN) Learning Model

The reinforced NN is employed to approximate the IRR signal as follows:

\hat{R} (Z_{r i} (k)) = φ_{r i} (ω_{r 2 i}^{T} (k) \cdot φ_{r i} (ω_{r 1 i}^{T} (k) \cdot Z_{r i} (k)))

(27)

where

Z_{r i} (k)

represents the input vector, which has

e_{i} (k),

u_{i} (k)

, while

u_{- i} (k)

.

ω_{r 1 i}

represents the matrix of weights for input-to-hidden layering. Meanwhile,

ω_{r 2 i}

represents the matrix of weights for hidden-to-output layering, and

φ_{r i} (\cdot)

represents an activation function [24].

Due to the reinforced NN, the associated error function is as follows:

\begin{matrix} e_{r i} (k) = j_{i} (e_{i} (k - 1), & u_{i} (k - 1), u_{- i} (k - 1)) \\ + ϱ {\hat{R}}_{i} (Z_{r i} (k)) - {\hat{R}}_{i} (Z_{r i} (k - 1)) \end{matrix}

(28)

The loss function is written as

E_{r i} (k) = \frac{1}{2} e_{r i}^{2} (k)

(29)

For convenience’s sake, only the matrices

ω_{r 2 i}

are updated, and the matrices

ω_{r 1 i}

remain unchanged during the training process.

The RNN’s update law is expressed as

ω_{r 2 i} (k + 1) = ω_{r 2 i} (k) - α_{r i} \cdot (\frac{\partial E_{r i} (k)}{\partial ω_{r 2 i} (k)})

(30)

In this equation,

α_{r i}

represents the rate at which the RNN learns.

The gradient descent rule (GDR) is used to obtain an updated law for the reinforced NN’s weight, which yields the following results:

\begin{matrix} ω_{r i} (k + 1) \\ = ω_{r i} (k) - α_{r i} \cdot (\frac{\partial E_{r i} (k)}{\partial e_{r i} (k)} \cdot \frac{\partial e_{r i} (k)}{\partial \hat{R} (Z_{r i} (k))} \cdot \frac{\partial \hat{R} (Z_{r i} (k))}{\partial ω_{r 2 i} (k)}) \\ = ω_{r 2 i} (k) - α_{r i} ϱ e_{r i} (k) [1 - φ_{r i}^{2} (ω_{r 2 i}^{T} (k) \cdot Δ_{r i} (k))] Δ_{r i} (k) \end{matrix}

(31)

In this equation,

Δ_{r i} (k) = φ_{r i} (ω_{r 1 i}^{T} (k) \cdot Z_{r i} (k))

.

5.2. Critic Neutral Network (CNN) Learning Model

In the following section, when designing the critic NN, an attempt is made to achieve a close approximation of the Q-function:

\hat{Q_{i}} (Z_{c i} (k)) = ω_{c 2 i}^{T} (k) \cdot φ_{c i} (ω_{c 1 i}^{T} (k) \cdot Z_{c i} (k))

(32)

In this equation,

Z_{c i} (k)

represents the relative vector of inputs that has

{\hat{R}}_{i} (k)

,

e_{i} (k)

, and

u_{i} (k)

as well as

u_{- i} (k)

, while

ω_{c 1 i}^{T} (k)

and

ω_{c 2 i}^{T} (k)

represent the input layer weight matrices and output layer weight matrices.

It is possible to express the function of the error for the CNN to be

e_{c i} (k) = {\hat{R}}_{i} (Z_{r i} (k - 1)) + β {\hat{Q}}_{i} (Z_{c i} (k)) - {\hat{Q}}_{i} (Z_{c i} (k - 1))

(33)

Its function of loss is written to be

E_{c i} (k) = \frac{1}{2} e_{c i}^{2} (k)

(34)

In accordance with the operation of RNNs, only

ω_{c 2 i}

is updated, and

ω_{c 1 i}

remains unchanged.

With the help of the gradient descent rule (GDR), it can be used to express the weight update law:

ω_{c 2 i} (k + 1) = ω_{c 2 i} (k) - α_{c i} (\frac{\partial E_{c i} (k)}{\partial ω_{c 2 i} (k)})

(35)

where

α_{c i}

represents the critic NN’s learning rate. Furthermore, we can obtain its weight update schemes for the critic NN:

\begin{matrix} ω_{c 2 i} (k + 1) \\ = ω_{c 2 i} (k) - α_{c i} (\frac{\partial E_{c i} (k)}{\partial e_{c i} (k)} \cdot \frac{\partial e_{c i} (k)}{\partial {\hat{Q}}_{i} (Z_{c i} (k))} \cdot \frac{\partial {\hat{Q}}_{i} (Z_{c i} (k))}{\partial ω_{c 2 i} (k)}) \\ = ω_{c 2 i} (k) - α_{c i} β [{\hat{R}}_{i} (Z_{c i} (k - 1)) + β ω_{c 2 i}^{T} (k) \cdot Δ_{c i} (k) \\ - ω_{c 2 i}^{T} (k - 1) \cdot Δ_{c i} (k - 1)] \cdot Δ_{c i} (k) \end{matrix}

(36)

In this equation,

Δ_{c i} (k) = φ_{c i} (ω_{c i 1}^{T} (k) Z_{c i} (k))

.

5.3. Actor Neutral Network (ANN) Learning Model

Based on the actor NN, an approximate optimal scheme is defined as follows:

{\hat{u}}_{i} (k) = ω_{a 2 i}^{T} \cdot φ_{a i} (ω_{a 1 i}^{T} \cdot Z_{a i} (k))

(37)

where the input data of the ANN is represented by

Z_{a i} (k) = e_{i} (k)

,

ω_{a 1 i}

represents the weight matrices of the input layer, and

ω_{a 2 i}

represents the weight matrices of the output layer.

Based on the prediction error of the actor NN, the following result is obtained:

e_{a i} (k) = {\hat{Q}}_{i} (Z_{c i} (k)) - U_{c}

(38)

It is possible to express the function of loss of the ANN to be

E_{a i} (k) = \frac{1}{2} e_{a i} (k)

(39)

As with RNNs and CNNs,

ω_{a 1 i}

must remain unchanged throughout the learning process. The actor NN update laws are defined as follows:

ω_{a 2 i} (k + 1) = ω_{a 2 i} (k) - α_{a i} (\frac{\partial E_{a i} (k)}{\partial ω_{a 2 i} (k)})

(40)

where

α_{a i}

represents the ANN learning rate. We can design a weight-tuning scheme for an ANN as follows:

\begin{matrix} ω_{a 2 i} (k + 1) \\ = ω_{a 2 i} (k) - α_{a i} \cdot (\frac{\partial E_{a i} (k)}{\partial e_{a i} (k)} \cdot \frac{\partial e_{a i} (k)}{\partial {\hat{Q}}_{i} (Z_{c i} (k))} \\ \times \frac{\partial {\hat{Q}}_{i} (Z_{c i} (k))}{\partial {\hat{u}}_{i} (k)} \cdot \frac{\partial {\hat{u}}_{i} (k)}{\partial ω_{a 2 i} (k)}) \\ = ω_{a 2 i} (k) - α_{a i} Δ_{a i} (k) ω_{a 2 i}^{T} (k) \\ \times ▽_{c i}^{^{'}} (k) ω_{c 1 i}^{T} (k) ▽_{{\hat{u}}_{i}} (Z_{c i} (k)) [ω_{a 2 i}^{T} Δ_{c i} (k)] \end{matrix}

(41)

where

Δ_{a i} (k) = φ_{a i} (ω_{a 1 i}^{T} (k) Z_{a i} (k))

,

▽_{c i}^{^{'}} (k) = \frac{\partial φ_{c i} (ω_{c 1 i} (k) Z_{c i} (k))}{\partial φ_{c i} (ω_{c 1 i}^{T} (k) Z_{c i} (k))}

,

▽_{{\hat{u}}_{i}} (Z_{c i} (k)) = \frac{\partial Z_{c i} (k)}{\partial {\hat{u}}_{i} (k)}

.

Furthermore, we can obtain

ω_{a 2 i} (k + 1) = \{\begin{matrix} ω_{a 2 i} (k) - α_{a i} Δ_{a i} (k) ω_{a 2 i}^{T} (k) \\ \times ▽_{c i}^{^{'}} (k) ω_{c 1 i}^{T} (k) ▽_{{\hat{u}}_{i}} (Z_{c i} (k)) [ω_{a 2 i}^{T} Δ_{c i} (k)], k = k t_{s}^{i} \\ ω_{a 2 i} (k), k \in [k t_{s}^{i}, k t_{s + 1}^{i}) . \end{matrix}

(42)

It is described in detail in Algorithm 1 how the controller is designed using RCA-NNs and event triggering. When the trigger conditions are met, the actor NN is updated.

For analysis of stability based on the Lyapunov method, we present an analysis of stability and convergence in the following section.

Assumption 4.

The following conditions are assumed to be true:

∥ω_{r 2 i} (k)∥ \leq ω_{r i m}, ∥ω_{c 2 i} (k)∥

\leq ω_{c i m}, ∥ω_{a 2 i} (k)∥ \leq ω_{a i m}

. There are bounded activation functions, i.e.,

∥Δ_{r i} (k)∥ \leq Δ_{r i m}, ∥Δ_{c i} (k)∥ \leq Δ_{c i m}, ∥Δ_{a i} (k)∥ \leq Δ_{a i m}

. What’s more, the functions of activation

φ_{a i} (k)

is the function of Lipschitz that satisfies

∥φ_{a i} (e_{i} (k t_{s}^{i})) - φ_{a i} (k)∥ \leq θ_{a i} ∥e_{i} (k t_{s}^{i}) - e_{i} (k)∥ = θ_{a i} ∥ϵ_{i}^{s} (k)∥ \leq θ_{a i} π_{i} T

, where

θ_{a i}

,

π_{i} T

are positive constants. Approximation errors of NNs’ output can be defined to be:

δ_{c i} (k) = ω_{c 2 i} (k) Δ_{c i} (k), δ_{a i} (k) = ω_{a 2 i} (k) Δ_{a i} (k), ϑ_{r i} (k) = ω_{r 2 i} (k) Δ_{r i} (k)

.

Theorem 1.

Assume that Assumptions 1 and 2 are true. CNN and ANN weights are renewed by (36) and (42). Upon satisfying the triggering term(26), the local inconsistency error is

e_{i} (k)

, critic evaluated error and actor evaluated error error are consistent and ultimately bounded. Furthermore the control method

u_{i}

converges to the optimal value

u_{i}^{*}

.

Evidence: Set

{\tilde{ω}}_{r 2 i} (k) = ω_{r 2 i} (k) - ω_{r 2 i}^{*}

as the weighting assessment error between the optimal weights for RNNs

ω_{r 2 i}^{*}

. Its assessment

ω_{r 2 i} (k)

,

{\tilde{ω}}_{c 2 i} (k) = ω_{c 2 i} (k) - ω_{c 2 i}^{*}

is the error resulting from weighting evaluation involving the ideal CNN weights

ω_{c 2 i}^{*}

; its assessed

ω_{c 2 i} (k)

, as well as

{\tilde{ω}}_{a 2 i} (k) = ω_{a 2 i} (k) - ω_{a 2 i}^{*}

is the weighting evaluated error involving the ideal ANN weightings

ω_{a 2 i}^{*}

and its estimation

ω_{a 2 i} (k)

.

Algorithm 1 RCA neural networks based on the IrQL method with event triggering.

Set initial value:

1: Set initial values for

ω_{r 2 i} (0), ω_{a 2 i} (0), ω_{c 2 i} (0)

between

(0, 1)

;

2: Set a low level of degree of precision for the calculation

E

.

3: Initialize the score of

x_{i} (0), x_{0} (0)

within

(0, 1)

The iterative process: Make

k i s e q u a l t o 0

. Error calculation at the localized level

e_{i} (k)

;

4: Keep on;

5: Based on actor NN, estimate

{\hat{u}}_{i} (k)

by

(37)

6: Update the reinforce NN;

7: Via the inputting

[e_{i} (k), u_{i} (k), u_{- i} (k)]

into the reinforce NN, and we can obtain the

estimated the function of IRR

R_{i} (Z_{r i} (k))

via

(27)

8: Obtain

e_{r i} (k)

by

(28)

;

9: Renew the matrices

ω_{r 2 i} (k)

by

(31)

;

10: Renew the critic NN:

11: Via the inputting

[{\hat{R}}_{i} (Z_{r i} (k)), e_{i} (k), u_{i} (k), a n d u_{- i} (k)]

into critic NN,

and we can obtain its estimated Q-function via

(32)

;

12: Obtain

e_{c i} (k)

by

(33)

;

13: Renew the matrices

ω_{c 2 i} (k)

by

(36)

;

14: Renew the actor NN:

15: Input

[e_{i} (k)]

to the actor NN, and we can obtain the estimated Q-function

{\hat{u}}_{i} (k)

via

(37)

16: Calculation

e_{a i} (k)

via

(38)

17: In the event that the triggering conditions are met, renew the matrices

ω_{a 2 i} (k)

of the actor NN using

(41)

18: Otherwise, do not update the weight matrices

ω_{a 2 i} (k)

19: Until

∥ω_{c 2 i} (k + 1) - ω_{c 2 i} (k)∥ \leq E

; otherwise, set

k = k + 1

, then go to

procedure

(5)

20: Keep on

ω_{r 2 i} (k), ω_{c 2 i} (k), ω_{a 2 i} (k)

as the optimal weights.

(1) We can obtain the following function at the time of triggering as follows:

L (k) = L_{1} (k) + L_{2} (k) + L_{3} (k) + L_{4} (k) + L_{5} (k)

(43)

In this equation,

\begin{matrix} L_{1} (k) = \frac{1}{α_{r i}} t r (ω_{r 2 i}^{T} (k) ω_{r 2 i} (k)), \\ L_{2} (k) = \frac{1}{α_{c i}} t r (ω_{c 2 i}^{T} (k) ω_{c 2 i} (k)), \\ L_{3} (k) = \frac{1}{α_{a i}} t r (ω_{a 2 i}^{T} (k) ω_{a 2 i} (k)), \\ L_{4} (k) = ϱ^{k} {\hat{R}}_{i} (k), \\ L_{5} (k) = β^{k} {\hat{Q}}_{i} (k) . \end{matrix}

(44)

Δ L_{1} (k)

is written to be

Δ L_{1} (k) = \frac{1}{α_{r i}} t r (ω_{r 2 i}^{T} (k + 1) ω_{r 2 i} (k + 1) - ω_{r 2 i}^{T} (k) ω_{r 2 i} (k)) .

(45)

In this equation, we have

\begin{matrix} {\tilde{ω}}_{r 2 i} (k + 1) & = ω_{r 2 i} (k + 1) - ω_{r 2 i}^{*} \\ = ω_{r 2 i} (k) - α_{r i} ϱ [j (k - 1) + ϱ \hat{R} (k) - \hat{R} (k - 1)] \\ \times δ_{r i} (k) Δ_{r i} (k) \end{matrix}

(46)

Furthermore, we have

\begin{matrix} Δ L_{1} (k) = & - 2 ϱ^{2} δ_{r i} (k) [ϱ^{- 1} j (k) + \hat{R} (k) - \hat{R} (k - 1)] \\ + α_{r i} ϱ^{4} {[ϱ^{- 1} j (k) + \hat{R} (k) - \hat{R} (k - 1)]}^{2} ϑ_{r i}^{2} (k) \\ = & {\{δ_{r i} (k) - ϱ^{2} [ϱ^{- 1} j (k) + \hat{R} (k) - \hat{R} (k - 1)]\}}^{2} \\ - (1 - α_{r i} (k) Δ_{r i}^{2} (k)) ϱ^{4} {[ϱ^{- 1} j (k) + \hat{R} (k) - \hat{R} (k - 1)]}^{2} \\ \times ϑ_{r i}^{2} (k) - δ_{r i}^{2} (k) \end{matrix}

(47)

Δ L_{2} (k)

can be written as

Δ L_{2} (k) = \frac{1}{α_{c i}} t r (ω_{c 2 i}^{T} (k + 1) ω_{c 2 i} (k + 1) - ω_{c 2 i}^{T} (k) ω_{c 2 i} (k)) .

(48)

Within this equation, we have

\begin{matrix} {\tilde{ω}}_{c 2 i} (k + 1) & = ω_{c 2 i} (k + 1) - ω_{c 2 i}^{*} \\ = ω_{c 2 i} (k) - α_{c i} β Δ_{c i} (k) [{\hat{R}}_{i} (k - 1) \\ + β (ω_{c 2 i} (k) + ω_{c 2 i}^{*}) Δ_{c i} (k) - ω_{c 2 i}^{T} (k - 1) Δ_{c i} (k - 1)] \end{matrix}

(49)

Furthermore, we have

Δ L_{2} (k) = \frac{1}{α_{c i}} \{D_{1} + D_{2} + D_{3} - ω_{c 2 i}^{T} (k) ω_{c 2 i} (k)\}

(50)

where

\begin{matrix} D_{1} & = ω_{c 2 i}^{T} (k) {(I - α_{c i} β^{2} Δ_{c i} (k) Δ_{c i}^{T} (k))}^{2} ω_{c 2 i} (k \\ = {∥ω_{c 2 i} (k)∥}^{2} - 2 α_{c i} β^{2} {∥δ_{c i} (k)∥}^{2} \\ + α_{c i}^{2} β^{4} {∥Δ_{c i} (k)∥}^{2} {∥δ_{c i} (k)∥}^{2} \\ D_{2} & = - 2 α_{c i} β^{2} δ_{c i} (k) [β^{- 1} {\hat{R}}_{i} (k - 1) + {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) \\ - β^{- 1} ω_{c 2 i} (k - 1) Δ_{c i} (k - 1)] {∥Δ_{c i} (k)∥}^{2} \\ D_{3} & = α_{c i}^{2} β^{4} [β^{- 1} {\hat{R}}_{i} (k - 1) + {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) \\ - β^{- 1} ω_{c 2 i}^{T} (k - 1) Δ_{c i} {(k - 1)]}^{T} \\ \times [β^{- 1} {\hat{R}}_{i} (k - 1) + {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) - β^{- 1} ω_{c 2 i}^{T} (k - 1) Δ_{c i} (k - 1)] \end{matrix}

The following result is obtained by computation:

\begin{matrix} Δ L_{2} (k) & = - β^{2} {∥δ_{c i} (k)∥}^{2} - β^{2} (1 - α_{c i} β^{2} {∥Δ_{c i} (k)∥}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} {\hat{R}}_{i} (k - 1) + {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) \\ - β^{- 1} ω_{c 2 i}^{T} (k - 1) Δ_{c i} {(k - 1)) | |}^{2} \\ + | | {\hat{R}}_{i} (k - 1) + β {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) \\ - ω_{c 2 i}^{T} (k - 1) Δ_{c i} (k - 1) {| |}^{2} \end{matrix}

(51)

In the case of the difference of the first order of

L_{3} (k)

, we can obtain

Δ L_{3} (k) = \frac{1}{α_{a i}} (ω_{a 2 i}^{T} (k + 1) ω_{a 2 i} (k + 1) - ω_{a 2 i}^{T} (k) ω_{a 2 i} (k))

(52)

where,

\begin{matrix} {\tilde{ω}}_{a 2 i} (k + 1) & = ω_{a 2 i} (k + 1) - ω_{a 2 i}^{*} \\ = ω_{a 2 i} (k) - α_{a i} Δ_{a i} (k) ω_{c 2 i}^{T} (k) C (k) \\ \times [ω_{c 2 i}^{T} (k) Δ_{c i} (k)] \end{matrix}

(53)

Therefore, we have

\begin{matrix} Δ L_{3} (k) = \frac{1}{α_{a i}} (E 1 - ω_{a 2 i}^{T} (k) ω_{a 2 i} (k)) \end{matrix}

(54)

where

\begin{matrix} E 1 = & | | ω_{a 2 i} (k) {| |}^{2} - 2 α_{a i} ω_{c 2 i}^{T} (k) C (k) δ_{a i} (k) [ω_{c 2 i}^{T} (k) Δ_{c i} (k)] \\ + α_{a i} | | ω_{c 2 i}^{T} (k) Δ_{c i} {(k) | |}^{2} | | Δ_{a i} {(k) | |}^{2} | | ω_{c 2 i}^{T} (k) C (k) {| |}^{2} \end{matrix}

(55)

In the case of

Δ L_{3} (k)

, the simplified formula is given below:

\begin{matrix} Δ L_{3} (k) = & - (1 - α_{a i} | | Δ_{a i} {(k) | |}^{2}) | | ω_{c 2 i}^{T} (k) Δ_{c i} (k) {| |}^{2} \\ \times | | ω_{c 2 i}^{T} {(k) C (k) | |}^{2} - | | δ_{a i} (k) {| |}^{2} \\ + | | ω_{c 2 i}^{T} (k) C (k) Δ_{a i} (k) ω_{c 2 i}^{T} (k) - δ_{a i} (k) {| |}^{2} \end{matrix}

(56)

By adding Equations (47), (51), and (56), we can obtain

L (k)

as follows:

\begin{matrix} Δ L (k) & = Δ L_{1} (k) + Δ L_{2} (k) + Δ L_{3} (k) + Δ L_{4} (k) + Δ L_{5} (k) \\ = - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} {\hat{R}}_{i} (k - 1) + {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) \\ - β^{- 1} ω_{c 2 i}^{T} (k - 1) Δ_{c i} (k - 1) {| |}^{2} \\ - (1 - α_{a i} | | Δ_{a i} (k) | |^{2}) \\ \times | | ω_{c 2 i}^{T} (k) Δ_{c i} {(k) | |}^{2} | | ω_{c 2 i}^{T} {(k) C (k) | |}^{2} + | | {\hat{R}}_{i} (k - 1) \\ + β {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) - ω_{c 2 i}^{T} (k - 1) Δ_{c i} (k - 1) {| |}^{2} \\ + | | ω_{c 2 i}^{T} (k) C (k) Δ_{c i}^{T} (k) ω_{c 2 i}^{T} (k) - δ_{a i} (k) {| |}^{2} \\ - (1 - α_{r i} Δ_{r i}^{2} (k)) ϱ^{4} \\ \times | | ϱ^{- 1} j (k) + {\hat{R}}_{i} (k) - ϱ^{- 1} {\hat{R}}_{i} (k - 1) {| |}^{2} \\ ϑ^{2} (k) + | | δ_{r i} (k) - ϱ^{2} [ϱ^{- 1} j (k) + {\hat{R}}_{i} (k) \\ - ϱ^{- 1} {\hat{R}}_{i} {(k - 1)] | |}^{2} \\ - | | δ_{a i} {(k) | |}^{2} - | | δ_{r i} (k) {| |}^{2} + β^{k + 1} Q_{i} (k + 1) - β^{k} Q_{i} (k) \\ + ϱ^{k + 1} R_{i} (k + 1) - ϱ^{k} R_{i} (k) \end{matrix}

(57)

Therefore, we can obtain

\begin{matrix} Δ L (k) & = - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \times | | δ_{c i} (k) \\ + β^{- 1} V_{1} {(k) | |}^{2} - (1 - α_{a i} | | Δ_{a i} {(k) | |}^{2}) | | X_{1} (k) {| |}^{2} \\ \times | | W_{1} {(k) | |}^{2} + | | V_{1} {(k) | |}^{2} + | | W^{1} (k) X_{1}^{T} (k) - δ_{a i} (k) {| |}^{2} \\ - (1 - α_{a i} (k) | | Δ_{r i} {(k) | |}^{2}) ϱ^{4} | | ϱ^{- 1} Y_{1} (k) {| |}^{2} υ_{r i}^{2} (k) \\ - | | δ_{a i} {(k) | |}^{2} - | | δ_{r i} (k) {| |}^{2} + β^{k + 1} Q_{i} (k + 1) - β^{k} Q_{i} (k) \\ + ϱ^{k + 1} R_{i} (k + 1) - ϱ^{k} R_{i} (k) \end{matrix}

(58)

where

V_{1} (k) = {\hat{R}}_{i} (k - 1) + β {(ω_{c 2 i}^{*})}^{T} Δ_{c i} (k) - ω_{c 2 i}^{T} (k - 1) Δ_{c i} (k - 1), W_{1} (k) = ω_{c 2 i}^{T} (k) C (k),

X_{1} (k) = ω_{c 2 i}^{T} (k) Δ_{c i} (k), Y_{1} (k) = j (k) + ϱ {\hat{R}}_{i} (k) - {\hat{R}}_{i} (k - 1)

, and we can obtain

| | V_{1} (k) | | \leq V_{1 m}, | | W_{1} (k) | | \leq W_{1 m}, | | X_{1} (k) | | \leq X_{1 m}, | | Y_{1} (k) | | \leq Y_{1 m}

. Next, we can obtain

\begin{matrix} Δ L (k) \leq & - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} V_{1} (k) {| |}^{2} \\ - (1 - α_{a i} | | Δ_{a i} {(k) | |}^{2}) | | X_{1} {(k) | |}^{2} | | W_{1} (k) {| |}^{2} \\ + 2 | | W_{1} (k) X_{1}^{T} {(k) | |}^{2} + | | δ_{a i} (k) {| |}^{2} \\ - (1 - α_{r i} | | Δ_{r i} {(k) | |}^{2}) ϱ^{2} | | Y_{1} (k) {| |}^{2} ϑ_{r i}^{2} (k) \\ + 2 | | δ_{r i} {(k) | |}^{2} + 2 | | Y_{1} (k) {| |}^{2} \\ - β^{k} Q_{i} (k) - ϱ^{k} R_{i} (k) \end{matrix}

(59)

Moreover, we can obtain

\begin{matrix} Δ L (k) \leq & - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} V_{1} (k) {| |}^{2} \\ - (1 - α_{a i} | | Δ_{a i} {(k) | |}^{2}) | | X_{1} {(k) | |}^{2} | | W_{1} (k) {| |}^{2} \\ + V_{1 m}^{2} + 2 W_{1 m}^{2} X_{1 m}^{2} + 2 | | {(ω_{a 2 i}^{*})}^{T} Δ_{a i} (k) {| |}^{2} \\ + 2 | | ω_{a 2 i}^{T} Δ_{a i} (k) {| |}^{2} \\ - (1 - α_{r i} | | Δ_{r i} {(k) | |}^{2}) ϱ^{2} | | Y_{1} (k) {| |}^{2} ϑ_{r i}^{2} (k) \\ + 2 | | δ_{r i} {(k) | |}^{2} + 2 | | Y_{1} (k) {| |}^{2} \\ - β^{k} Q_{i} (k) - ϱ^{k} R_{i} (k) \\ \leq - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} V_{1} (k) {| |}^{2} \\ - (1 - α_{a i} | | Δ_{a i} {(k) | |}^{2}) | | X_{1} {(k) | |}^{2} | | W_{1} (k) {| |}^{2} \\ + V_{1 m}^{2} + 2 W_{1 m}^{2} X_{1 m}^{2} + 4 ω_{a i m}^{2} Δ_{a i m}^{2} \\ - (1 - α_{r i} | | Δ_{r i} {(k) | |}^{2}) ϱ^{2} | | Y_{1} (k) {| |}^{2} ϑ_{r i}^{2} (k) \\ + 2 δ_{r i m}^{2} + 2 Y_{1 m}^{2} \\ - β^{k} Q_{i} (k) - ϱ^{k} R_{i} (k) \end{matrix}

(60)

If the conditions are met, then we can obtain

\begin{matrix} α_{r i} \leq \frac{1}{| | Δ_{r i} (k) {| |}^{2}}, α_{c i} \leq \frac{1}{β^{2} | | Δ_{c i} (k) {| |}^{2}}, α_{a i} \leq \frac{1}{| | Δ_{a i} (k) {| |}^{2}} \\ | | δ_{c i} (k) | | > \sqrt{(V_{1 m}^{2} + 2 W_{1 m}^{2} X_{1 m}^{2} + 4 ω_{a 2 i m}^{2} Δ_{a i m}^{2} + 2 δ_{r i m}^{2} + 2 Y_{1 m}^{2}) / β^{2}} \end{matrix}

We can derive

Δ L (k) \leq 0

. The proof has been completed.

(2) In the absence of the triggering conditions, consider the following:

L (k) = L_{1} (k) + L_{2} (k) + L_{4} (k)

(61)

where

\begin{matrix} L_{1} (k) = \frac{1}{α_{r i}} t r (ω_{r 2 i}^{T} (k) ω_{r 2 i} (k)), \\ L_{2} (k) = \frac{1}{α_{c i}} t r (ω_{c 2 i}^{T} (k) ω_{c 2 i} (k)), \\ L_{4} (k) = e_{i}^{T} (k) e_{i} (k) \end{matrix}

\begin{matrix} Δ L (k) & = Δ L_{1} (k) + Δ L_{2} (k) + Δ L_{4} (k) \\ = - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} V_{1} {(k) | |}^{2} + | | V_{1} (k) {| |}^{2} \\ - (1 - α_{r i} | | Δ_{r i} {(k) | |}^{2}) ϱ^{2} | | Y_{1} (k) {| |}^{2} ϑ_{r i}^{2} (k) \\ + 2 δ_{r i m}^{2} + 2 Y_{1 m}^{2} \\ + e_{i}^{T} (k + 1) e_{i} (k + 1) - e_{i}^{T} (k) e_{i} (k) \end{matrix}

(62)

\begin{matrix} Δ L (k) & \leq - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} V_{1} {(k) | |}^{2} + | | V_{1} (k) {| |}^{2} \\ - (1 - α_{r i} | | Δ_{r i} {(k) | |}^{2}) ϱ^{2} | | Y_{1} (k) {| |}^{2} ϑ_{r i}^{2} (k) \\ + 2 δ_{r i m}^{2} + 2 Y_{1 m}^{2} \\ + ((ι | | e_{i} (k) + ι | | ϵ_{i}^{s} {| |)}^{2} - | | e_{i} {(k) | |}^{2}) \\ \leq - β^{2} | | δ_{c i} {(k) | |}^{2} - β^{2} (1 - α_{c i} β^{2} | | Δ_{c i} {(k) | |}^{2}) \\ \times | | δ_{c i} (k) + β^{- 1} V_{1} (k) {| |}^{2} + V_{1 m}^{2} \\ - (1 - α_{r i} | | Δ_{r i} {(k) | |}^{2}) ϱ^{2} | | Y_{1} (k) {| |}^{2} ϑ_{r i}^{2} (k) \\ + 2 δ_{r i m}^{2} + 2 Y_{1 m}^{2} \\ - (1 - 2 ι^{2}) | | e_{i} (k) {| |}^{2} - 2 ι^{2} | | ϵ_{i}^{s} {| |}^{2} \end{matrix}

(63)

In the event that it is satisfied that

α_{r i} \leq \frac{1}{| | Δ_{r i} (k) {| |}^{2}}, α_{c i} \leq \frac{1}{β^{2} | | Δ_{c i} (k) {| |}^{2}}, α_{a i} \leq \frac{1}{| | Δ_{a i} (k) {| |}^{2}}

, and

| | δ_{c i} (k) | | > \sqrt{(V_{1 m}^{2} + 2 δ_{r i m}^{2} + 2 Y_{1 m}^{2}) / β^{2}}

, one has

Δ L (k) \leq 0

. Thus, we can derive

Δ L (k) \leq 0

, and the proof is completed.

6. Statistical Data Illustration

To demonstrate the viability of the proposed method, a simulation is presented in the following section.

Nonlinear MAS Consisting of One Leader and Six Followers

There were six followers and one leader in this tangled set of MASs which were considered. Figure 1 depicts the connection graph of the studied MASs. There was a leader of 0, and there were followers of 1, 2, 3, 4, 5, and 6. It is possible to obtain the corresponding adjacency matrix

a_{14} = a_{21} = a_{32} = a_{43} = a_{52} = a_{65} = 1

. There is a weighted relationship involving the leaders and followers where

b_{1} = 1, b_{2} = b_{3} = b_{4} = b_{5} = b_{6} = 0

. It is possible for agent 1 to accept the information of the leader immediately. The system model parameters for MASs with one leader as well as six followers are as follows:

A = |\begin{matrix} 0.995 & 0.09980 \\ - 0.09982 & 0.995 \end{matrix}|, B_{1} = {[0, 0.2]}^{T}, B_{2} = {[0, 0.5]}^{T}, B_{3} = {[0, 0.4]}^{T}, B_{4} = {[0, 0.3]}^{T}, B_{5} = {[0, 0.6]}^{T}

, and

B_{6} = {[0, 0.7]}^{T}

.

The weight matrices are as follows:

Q_{11} = Q_{22} = Q_{33} = Q_{44} = Q_{55} = Q_{66} = 1, R_{11} = R_{22} = R_{33} = R_{44} = R_{55} = R_{66} = I_{2 \times 2}

, and

Q_{14} = Q_{21} = Q_{32} = Q_{43} = Q_{52} = Q_{65} = I_{2 \times 2}

. The learning rates are

α_{r i} = 0.95, α_{a i} = 0.90

, and

α_{c i} = 0.07 (i i s e q u a l t o 1, 2, 3, 4, 5, 6)

, with a discount factor of

ϱ = 0.57, β = 0.9

.

For the agents, the activation function of the RNNs and ANNs is as follows:

Z_{r 1} (k) = {[e_{1}^{T} (k), u_{1}^{T} (k t_{s}^{1}), u_{4}^{T} (k t_{s}^{4})]}^{T},

Z_{a 1} (k) = e_{1} (k t_{s}^{1}),

Z_{r 2} (k) = {[e_{2}^{T} (k), u_{2}^{T} (k t_{s}^{2}), u_{1}^{T} (k t_{s}^{1})]}^{T},

Z_{a 2} (k) = e_{2} (k t_{s}^{2}),

Z_{r 3} (k) = {[e_{3}^{T} (k), u_{3}^{T} (k t_{s}^{3}), u_{2}^{T} (k t_{s}^{2})]}^{T},

Z_{a 3} (k) = e_{3} (k t_{s}^{3}),

Z_{r 4} (k) = {[e_{4}^{T} (k), u_{4}^{T} (k t_{s}^{4}), u_{3}^{T} (k t_{s}^{3})]}^{T},

Z_{a 4} (k) = e_{4} (k t_{s}^{4}),

Z_{r 5} (k) = {[e_{5}^{T} (k), u_{5}^{T} (k t_{s}^{5}), u_{2}^{T} (k t_{s}^{2})]}^{T},

Z_{a 5} (k) = e_{5} (k t_{s}^{5}),

Z_{r 6} (k) = {[e_{6}^{T} (k), u_{6}^{T} (k t_{s}^{6}), u_{5}^{T} (k t_{s}^{5})]}^{T},

Z_{a 6} (k) = e_{6} (k t_{s}^{6}) .

The initial values of the leader and followers are

x_{0} (0) = {[0.6675, 0.7940]}^{T}, x_{1} (0) = {[0.5734, 0.6000]}^{T}, x_{2} (0) = {[0.5667, 0.7348]}^{T}, x_{3} (0) = {[0.8694, 0.7140]}^{T}, x_{4} (0) = {[1.0212, 1.3842]}^{T}, x_{5} (0) = {[0.8606, 1.5565]}^{T}

, and

x_{6} (0) = {[0.5274, 1.3235]}^{T}

.

According to Figure 2, all followers of the leader were able to accurately follow the leader, and the whole MAS was able to achieve synchronization. Figure 3 illustrates the six agents’ cumulative amount of trigger instants. On average, the amount of trigger instants for the six agents was approximately 220. However, using the traditional RL method, the number was approximately 1000. As a result, the computational burden was reduced by

78.0 %

in comparison with the conventional time-triggered method. According to Figure 4, the trigger mechanism of each agent is illustrated, which indicates that the actor network weight will be updated only when the trigger mechanism is satisfied. As can be seen in Figure 5, there is a correlation involving the error of triggering

| | ϵ_{i}^{s} (k) {| |}^{2}

as well as the minimum triggering requirements

π_{i} T

. Over time, it appears that the triggering error converged. Figure 6 and Figure 7 illustrate the evaluation of the local neighborhood errors using the proposed control method, and it is shown that they could be converged to 0 at k = 60. The local neighborhood errors of [32] are shown in Figure 8 and Figure 9. In comparison with Figure 8 and Figure 9, our proposed control method produced a better convergence effect. Figure 10 and Figure 11 show the estimation of the ANN weight parameters. With the proposed control method, the actor network weights can stabilize faster than with IrQL.

7. Conclusions

According to this study, an event-triggered optimum controlling problem for model-free MASs was examined using the IrQL method based on RL. A new IrQL method was introduced by adding additional IRR functions [32], As a result, more information could be obtained by the agent. As a consequence of defining the IRR formula, we defined the Q-function and derived the corresponding HJB equation. In an iterative approach to IrQL, this method was designed to calculate the optimal control strategy. Using the IrQL algorithm, an event-triggered controller utilizing the IrQL method was presented. It was designed to update the controller only at the time of triggering to reduce the burden on computing resources and the transmission network. An RCA-NN was used to implement the suggested approach, which eliminated the need for a model of the system. It is possible to determine the convergent weights of neural networks using the Lyapunov method. To assess the performance and control efficiency of the suggested algorithm, a simulation model was used. Further research will be conducted on the effect of the discount rates on system reliability.

Author Contributions

Software, Y.T., Y.L. and J.H.; Writing—review & editing, Z.W.; Supervision, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wen, G.; Yu, X.; Liu, Z.W.; Yu, W. Adaptive consensus-based robust strategy for economic dispatch of smart grids subject to communication uncertainties. IEEE Trans. Ind. Inform. 2018, 14, 2484–2496. [Google Scholar] [CrossRef]
Li, P.; Hu, J.; Qiu, L.; Zhao, Y.; Ghosh, B.K. A distributed economic dispatch strategy for power-water networks. IEEE Trans. Control Netw. Syst. 2021, 9, 356–366. [Google Scholar] [CrossRef]
Fax, J.A.; Murray, R.M. Information flow and cooperative control of vehicle formations. IEEE Trans. Autom. Control 2004, 49, 1465–1476. [Google Scholar] [CrossRef]
Wen, S.; Yu, X.; Zeng, Z.; Wang, J. Event-triggering load frequency control for multiarea power systems with communication delays. IEEE Trans. Ind. Electron. 2016, 63, 1308–1317. [Google Scholar] [CrossRef]
Wen, G.; Wang, P.; Huang, T.; Lü, J.; Zhang, F. Distributed consensus of layered multi-agent systems subject papers. IEEE Trans. Circuits Syst. 2020, 67, 3152–3162. [Google Scholar] [CrossRef]
Wu, Z.G.; Xu, Y.; Pan, Y.J.; Su, H.; Tang, Y. Event-triggered control for consensus problem in multi-agent systems with quantized relative state measurements and external disturbance. IEEE Trans. Circuits Syst. 2018, 65, 2232–2242. [Google Scholar] [CrossRef]
Liu, H.; Cheng, L.; Tan, M.; Hou, Z.G. Exponential finite-time consensus of fractional-order multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 1549–1558. [Google Scholar] [CrossRef]
Shi, K.; Wang, J.; Zhong, S.; Zhang, X.; Liu, Y.; Cheng, J. New reliable nonuniform sampling control for uncertain chaotic neural networks under Markov switching topologies. Appl. Math. Comput. 2019, 347, 169–193. [Google Scholar] [CrossRef]
He, W.; Chen, G.; Han, Q.L.; Du, W.; Cao, J.; Qian, F. Multi-agent systems on multilayer networks: Synchronization analysis and network design. IEEE Trans. Syst. 2017, 47, 1655–1667. [Google Scholar]
Hu, J.; Wu, Y. Interventional bipartite consensus on coopetition networks with unknown dynamics. J. Frankl. Inst. 2017, 354, 4438–4456. [Google Scholar] [CrossRef]
Hu, J.P.; Feng, G. Distributed tracking control of leader follower multi-agent systems under noisy measurement. Automatica 2010, 46, 1382–1387. [Google Scholar] [CrossRef] [Green Version]
Wu, X.; Tang, Y.; Cao, J. Input-to-State Stability of Time-Varying Switched Systems with Time Delays. IEEE Trans. Autom. Control 2019, 64, 2537–2544. [Google Scholar] [CrossRef]
Chen, D.; Liu, X.; Yu, W. Finite-time fuzzy adaptive consensus for heterogeneous nonlinear multi-agent systems. IEEE Trans. Netw. Sci. Eng. 2021, 7, 3057–3066. [Google Scholar] [CrossRef]
Wang, J.L.; Wang, Q.; Wu, H.N.; Huang, T. Finite-time consensus and finite-time H_∞ consensus of multi-agent systems under directed topology. IEEE Trans. Netw. Sci. Eng. 2020, 7, 1619–1632. [Google Scholar] [CrossRef]
Ren, Y.; Zhao, Z.; Zhang, C.; Yang, Q.; Hong, K.S. Adaptive neural-network boundary control for a flexible manipulator with input constraints and model uncertainties. IEEE Trans. Cybern. 2021, 51, 4796–4807. [Google Scholar] [CrossRef]
Mu, C.; Zhao, Q.; Gao, Z.; Sun, C. Q-learning solution for optimal consensus control of discrete-time multiagent systems using reinforcement learning. J. Frankl. Inst. 2019, 356, 6946–6967. [Google Scholar] [CrossRef]
Peng, Z.; Zhao, Y.; Hu, J.; Ghosh, B.K. Data-driven optimal tracking control of discrete-time multi-agent systems with two-stage policy iteration algorithm. Inf. Sci. 2019, 481, 189–202. [Google Scholar] [CrossRef]
Zhang, H.; Jiang, H.; Luo, Y.; Xiao, G. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans. Ind. Electron. 2017, 64, 4091–4100. [Google Scholar] [CrossRef]
Abouheaf, M.I.; Lewis, F.L.; Vamvoudakis, K.G.; Haesaert, S.; Babuska, R. Multi-agent discrete-time graphical games and reinforcement learning solutions. Automatica 2014, 50, 3038–3053. [Google Scholar] [CrossRef]
Peng, Z.; Zhao, Y.; Hu, J.; Luo, R.; Ghosh, B.K.; Nguang, S.K. Input–output data-based output antisynchronization control of multiagent systems using reinforcement learning approach. IEEE Trans. Ind. Inform. 2021, 17, 7359–7367. [Google Scholar] [CrossRef]
Peng, Z.; Hu, J.; Ghosh, B.K. Data-driven containment control of discrete-time multi-agent systems via value iteration. Sci. China Inf. Sci. 2020, 63, 189205. [Google Scholar] [CrossRef]
Wen, G.; Chen, C.P.; Feng, J.; Zhou, N. Optimized multi-agent formation control based on an identifier-actor-critic reinforcement learning algorithm. IEEE Trans. Fuzzy Syst. 2018, 26, 2719–2731. [Google Scholar] [CrossRef]
Bai, W.; Li, T.; Long, Y.; Chen, C.P. Event-triggered multigradient recursive reinforcement learning tracking control for multiagent systems. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 366–379. [Google Scholar] [CrossRef] [PubMed]
Peng, Z.; Luo, R.; Hu, J.; Shi, K.; Ghosh, B.K. Distributed optimal tracking control of discrete-time multiagent systems via event-triggered reinforcement learning. IEEE Trans. Circuits Syst. 2022, 69, 3689–3700. [Google Scholar] [CrossRef]
Hu, J.; Chen, G.; Li, H.X. Distributed event-triggered tracking control of leader-follower multi-agent systems with communication delays. Kybernetika 2011, 47, 630–643. [Google Scholar]
Eqtami, A.; Dimarogonas, D.V.; Kyriakopoulos, K.J. Event-triggered control for discrete-time systems. In Proceedings of the American Control Conference, Baltimore, MD, USA, 30 June–2 July 2010; pp. 4719–4724. [Google Scholar]
Chen, X.; Hao, F. Event-triggered average consensus control for discrete-time multi-agent systems. IET Control Theory Appl. 2012, 6, 2493–2498. [Google Scholar] [CrossRef]
Jiang, Y.; Fan, J.; Chai, T.; Li, J.; Lewis, F.L. Data-driven flotation industrial process operational optimal control based on reinforcement learning. IEEE Trans. Ind. Inform. 2018, 14, 1974–1989. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Alsheikh, M.A.; Lin, S.; Niyato, D.; Tan, H.P. Machine learning in wireless sensor networks: Algorithms, strategies, and applications. IEEE Commun. Surv. Tutor. 2014, 16, 1996–2018. [Google Scholar] [CrossRef]
Vamvoudakis, K.G.; Modares, H.; Kiumarsi, B.; Lewis, F.L. Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online. IEEE Control Syst. 2017, 37, 33–52. [Google Scholar]
Peng, Z.; Luo, R.; Hu, J. Optimal tracking control of nonlinear multiagent systems using internal reinforce Q-learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4043–4055. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Liu, D.; Wei, Q.; Zhao, D.; Jin, N. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 2012, 48, 1825–1832. [Google Scholar] [CrossRef]
Peng, Z.; Hu, J.; Shi, K.; Luo, R.; Huang, R.; Ghosh, B.K.; Huang, J. A novel optimal bipartite consensus control scheme for unknown multi-agent systems via model-free reinforcement learning. Appl. Math. Comput. 2020, 369, 124821. [Google Scholar] [CrossRef]
Zhang, H.; Yue, D.; Dou, C.; Zhao, W.; Xie, X. Data-driven distributed optimal consensus control for unknown multiagent systems with input-delay. IEEE Trans. Cybern. 2019, 49, 2095–2105. [Google Scholar] [CrossRef] [PubMed]
Si, J.; Wang, Y.-T. Online learning control by association and reinforcement. IEEE Trans. Neural Netw. 2001, 12, 264–276. [Google Scholar] [CrossRef]

Figure 1. The topology structure for leader-follower MASs.

Figure 2. The tracks for the leader and followers.

Figure 3. The comparison of the trigger time number involving the suggested method as well as the conventional approach.

Figure 4. The triggering instant for each agent.

Figure 5. The triggering error trajectory

| | ϵ_{i}^{s} (k) {| |}^{2}

in addition to triggering thresholds

π_{i} T (i = 1, 2, 3, 4, 5, 6)

.

Figure 5. The triggering error trajectory

| | ϵ_{i}^{s} (k) {| |}^{2}

in addition to triggering thresholds

π_{i} T (i = 1, 2, 3, 4, 5, 6)

.

Figure 6. Local neighborhood errors

e_{i 1} (k)

with the proposed control method.

Figure 6. Local neighborhood errors

e_{i 1} (k)

with the proposed control method.

Figure 7. Local neighborhood errors

e_{i 2} (k)

with proposed control method.

Figure 7. Local neighborhood errors

e_{i 2} (k)

with proposed control method.

Figure 8. Local neighborhood errors

e_{i 1} (k)

of [32].

Figure 8. Local neighborhood errors

e_{i 1} (k)

of [32].

Figure 9. Local neighborhood errors

e_{i 2} (k)

of [32].

Figure 9. Local neighborhood errors

e_{i 2} (k)

of [32].

Figure 10. The estimation of weight parameters of the ANN of [32].

Figure 11. Estimation of the weight parameters of an ANN using the proposed control method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Wang, X.; Tang, Y.; Liu, Y.; Hu, J. Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning. Entropy 2023, 25, 299. https://doi.org/10.3390/e25020299

AMA Style

Wang Z, Wang X, Tang Y, Liu Y, Hu J. Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning. Entropy. 2023; 25(2):299. https://doi.org/10.3390/e25020299

Chicago/Turabian Style

Wang, Ziwei, Xin Wang, Yijie Tang, Ying Liu, and Jun Hu. 2023. "Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning" Entropy 25, no. 2: 299. https://doi.org/10.3390/e25020299

APA Style

Wang, Z., Wang, X., Tang, Y., Liu, Y., & Hu, J. (2023). Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning. Entropy, 25(2), 299. https://doi.org/10.3390/e25020299

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimal Tracking Control of a Nonlinear Multiagent System Using Q-Learning via Event-Triggered Reinforcement Learning

Abstract

1. Introduction

2. Preliminary Findings

2.1. Theoretical Basis of Graphs

2.2. Problem Formulation

3. Design of the IrQL Method

4. Designs of the Event-Driven Controller

5. Neural Network Implementation for the Event-Triggered Approach Using the IrQL Method

5.1. Reinforce Neutral Network (RNN) Learning Model

5.2. Critic Neutral Network (CNN) Learning Model

5.3. Actor Neutral Network (ANN) Learning Model

6. Statistical Data Illustration

Nonlinear MAS Consisting of One Leader and Six Followers

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI