Next Article in Journal
Persistent Topological Laplacians—A Survey
Previous Article in Journal
Peano Theorems for Pedjeu–Ladde-Type Multi-Time Scale Stochastic Differential Equations Driven by Fractional Noises
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stability Analysis of Batch Offline Action-Dependent Heuristic Dynamic Programming Using Deep Neural Networks

Department of Automation and Applied Informatics, Politehnica University of Timisoara, 2, Bd. V. Parvan, 300223 Timisoara, Romania
Mathematics 2025, 13(2), 206; https://doi.org/10.3390/math13020206
Submission received: 2 December 2024 / Revised: 3 January 2025 / Accepted: 6 January 2025 / Published: 9 January 2025

Abstract

:
In this paper, the theoretical stability of batch offline action-dependent heuristic dynamic programming (BOADHDP) is analyzed for deep neural network (NN) approximators for both the action value function and controller which are iteratively improved using collected experiences from the environment. Our findings extend previous research on the stability of online adaptive ADHDP learning with single-hidden-layer NNs by addressing the case of deep neural networks with an arbitrary number of hidden layers, updated offline using batched gradient descend updates. Specifically, our work shows that the learning process of the action value function and controller under BOADHDP is uniformly ultimately bounded (UUB), contingent on certain conditions related to NN learning rates. The developed theory demonstrates an inverse relationship between the number of hidden layers and the learning rate magnitude. We present a practical implementation involving a twin rotor aerodynamical system to emphasize the impact difference between the usage of single-hidden-layer and multiple-hidden-layer NN architectures in BOADHDP learning settings. The validation case study shows that BOADHDP with multiple hidden layer NN architecture implementation obtains 0.0034 on the control benchmark, while the single-hidden-layer NN architectures obtain 0.0049 , outperforming the former by 1.58% by using the same collected dataset and learning conditions. Also, BOADHDP is compared with online adaptive ADHDP, proving the superiority of the former over the latter, both in terms of controller performance and data efficiency.

1. Introduction

Adaptive dynamic programming (ADP) has emerged as a powerful methodology for tuning control systems in modern applications, where complexity, nonlinearity, and uncertainty are commonplace. Originating from Werbos’ pioneering work [1], which was based on the seminal work on dynamic programming conducted by Bellman [2], ADP soon became a notable stream of research, with multiple ADP designs developed. Among the ADP designs, two distinct classes of solutions have emerged: heuristic dynamic programming (HDP) and dual heuristic programming (DHP) [3]. In the HDP framework, reinforcement learning is employed to determine the cost-to-go from the current state. The HDP convergence for the general nonlinear systems is presented in [4]. Conversely, in DHP, neural networks are used to learn the derivative of the cost function relative to the states, known as the costate vector [5]. The DHP convergence for linear systems was established in [6]. For both of those two classes of algorithms, there exists the action-dependent (AD) adaptation [7]. ADP has also addressed the class of discrete-time control problems [8,9,10,11,12,13] and continuous time systems [14,15,16].
Apart from the theoretical contributions, ADP designs have been validated on a wide array of real applications. In [17], ADP is applied to a helicopter tracking and trimming control task. In [18], neural network controllers tuned with the ADHDP method are applied to an engine torque and exhaust air–fuel ratio control for an automotive engine. A practical implementation in the context of an electric water heater is presented in [19], where the collected sensor data were used to learn in a model-free manner the Q-function and the controller.
Convergence and stability proofs of the iterative processes involved in ADP-like techniques have also been developed. In [6], the adaptive critic method is described, where two networks approximate the controller and the Lagrangian multipliers associated with the optimal control, respectively. The convergence of the interleaved successive update of the two networks has been analyzed. In [20], an online generalized ADP is issued for a system with input constraints. Then, using a Lyapunov approach, a uniformly ultimate boundedness (UUB) stability is proved. The convergence of the value-iteration HDP is established for the nonlinear discrete-time systems in [4]. In paper [21], the authors derive the UUB stability for direct HDP algorithms, proving that the actor and critic weights remain bounded. The actor and critic were approximated by a multilayer perceptron (MLP) with three layers: input, hidden, and output. However, the updated weights were only the ones from the hidden and output layers, like in a linear basis function approach. To overcome the practical limitations imposed by linear basis-function-type approximators, such as scalability and overfitting, the authors from paper [22] extended the stability analysis from [21] to MLPs to update both the input-to-hidden-layer weights and the hidden-to-output weights.
Current research in the field of reinforcement learning (RL), which studies the class of stochastic systems and controllers, shows significant performances when using deep NNs for control applications for both discretized systems [23] and continuous control tasks [24]. The advantage of deep neural networks over single-layer networks lies in their increased approximation capacity, which is achieved through multiple hidden layers. These layers enable the composition of features at different abstraction levels, creating a robust hierarchical representation. This hierarchical structure allows deep networks to learn and model complex nonlinear relationships within data more effectively than shallow networks. Thus, using multilayer NNs in ADP applications can enhance learning convergence and the overall controller performance. Also, using batch learning methods, which update the NN weights using collected past experiences simultaneously, is more data efficient compared to the single-transition learning, where the weights are updated one transition at a time. This also breaks the temporal correlations, helping NNs better generalize across a system’s state space. Typically, batch learning is combined with offline learning, where the weights are updated exclusively using a fixed dataset of transitions, without any adaptation during the controller’s runtime. Methods such as those in [19,24,25,26] demonstrate the benefits of using batch learning through a technique known as experience replay. In contrast, ref. [27] highlights an approach where the entire dataset of collected transitions is used for learning, in an offline manner.
This paper makes two key contributions. First, we provide a novel theoretical stability of ADHDP when utilizing deep neural networks as function approximators for both the action value function and the controller and for when batch learning is issued on the entire dataset of collected transitions from the system. This stands as an improvement over the stability analyses performed in [21,22], which were based on single-hidden-layer NN architectures updated online, with each transition collected during the system runtime. To this end, we prove that the batched offline ADHDP (BOADHDP) learning process is uniformly ultimately bounded (UUB) by using the Lyapunov stability approach. We show that the stability of the learning process is dependent on some conditions imposed on the NN learning rates and that these conditions also provide a relationship between the learning rate magnitudes and the number of hidden layers in the networks. Second, we issue a validation study on a twin rotor aerodynamical system (TRAS) to emphasize the superiority of employing multiple hidden layers in the NN approximators in the BOADHDP learning process. We also issue some comparison between BOADHDP and the online adaptive ADHDP algorithms from [21,22].
The rest of the paper is organized as follows. Section 2 describes the theoretical underpinnings of BOADHDP. Section 3 presents the multilayer neural network approximation of the action value function and controller. Section 4 provides the main theoretical results for the stability of BOADHDP. Section 5 illustrates the TRAS validation case study. Finally, the discussion and concluding remarks are presented in Section 6.

2. Problem Formulation

Let the discrete-time nonlinear system described by the state equation be
x k + 1 = F ( x k , u k ) ,
where k N denotes the time index, x k = x 1 , k , , x n , k T Ω X R n the system state, u k = u 1 , k , , u m , k T Ω U R m the control input, F : Ω X × Ω U Ω X the unknown continuously differentiable system function, and Ω X and Ω U the compact subsets of R n and R m , respectively. The control input is generated by u k = C ( x k ) , with C : Ω X Ω U a time-invariant, continuous state feedback controller function with respect to the state x . For convention, vectors with x 1 , k , , x n , k T are column vectors, while the ones without the transposition are row vectors.
For the optimal control problem, the objective is to find the optimal controller that minimizes the infinite value function, defined as follows:
V x k = i = k r ( x i ,   C ( x i ) )   = r x k ,   C x k + V x k + 1 ,
where function r : Ω X × Ω U R , having r x k , u k 0 ,   r 0,0 = 0 , is known as the penalty function, defined as r x k , u k = Θ x k + C x k T R C ( x k ) , where Θ : Ω X R is the penalty term describing the system’s desired behavior as a positive semidefinite function, and R R m × m is a square positive definite command weighting matrix, as in [4]. The optimal value function [1] is defined as
V * x k = min C ( x k ) r x k ,   C x k + V * x k + 1 .
The optimal controller is found by applying the argmin() operator to Equation (3), as
C * x k = arg min C ( x k ) r x k ,   C x k + V * x k + 1 .
With the system function F unknown, one cannot apply the well-known ADP methods for the system (1) directly in order to arrive at (3) and (4). Therefore, the introduction of action value functions is mandatory to handle the model-free case.
The action value function proposed by [28] evaluates both the current state and the command. It is defined as
Q x k , u k = r x k , u k + V x k + 1 .
Compared to the value function (2), the action value function represents the cost of issuing a command u k in a state x k , plus the value function of the next state x k + 1 . Mainly, Equation (5) evaluates all possible actions u k Ω U followed by the controller C ( x k + 1 ) . Equation (5) can also be written, according to [28], as
Q x k , u k = r x k , u k + Q x k + 1 , C ( x k + 1 ) .
From [28], similarly to the value function (3), the optimal action value function is defined as
Q * x k = min C ( x k ) r x k , u k + Q * x k + 1 , C ( x k + 1 ) ,
and the optimal controller is represented by
C * x k = arg min C ( x k ) r x k , u k + Q * x k + 1 , C ( x k + 1 ) .

ADHDP Algorithm

Arriving at the optimal action value function and controller requires an iterative procedure consisting of j steps, where the action value function and controller are continuously updated, according to [28]. Starting with an initial controller C 0 x k and an action value function, e.g., Q 0 x k , u k = 0 , the action value function evaluation is issued by
Q 1 x k , u k = r x k , u k + Q 0 x k + 1 , C 0 ( x k + 1 ) .
Then, the controller is updated using
C 1 x k = arg min C ( x k ) r x k , u k + Q 0 x k + 1 , C ( x k + 1 ) .
At the j t h iteration, the action value function update is
Q j + 1 x k , u k = r x k , u k + Q j x k + 1 , C j ( x k + 1 ) ,
while the controller update law is
C j + 1 x k = arg min C ( x k ) r x k , u k + Q j x k + 1 , C ( x k + 1 ) .
The iteration scheme consisting of the repetitive application of Equations (11) and (12) runs as j .
Remark 1.
A policy iteration algorithm requires an initially known stabilizing controller C 0 x k , whereas, for value iteration schemes, this need is avoided.
In the next section, the implementation of the controller and action value function update is issued using a neural network function approximation for Q j x k , u k and C j x k .

3. Neural Network Implementation for BOADHDP

The recurrent ADP scheme described by Equations (11) and (12) is practically implemented using function approximators for the action value function and controller. To this end, neural networks (NNs) are used, due to their universal function approximation property, which is able to handle multidimensional nonlinear systems (1). The tuning of the NN weights from each individual layer requires both input–output training data and the employment of the backpropagation mechanism, which can be best described as a gradient-based update rule.
The training data for the controller and action value function are collected from the controlled system (1) and take the form of transition tuples ( x k ,   u k , r x k , u k ,   x k + 1 ) stored in a dataset D M = { ( x k ,   u k , r x k , u k ,   x k + 1 ) } , with k = 1 : M . The main objective of the data collection phase is to uniformly sample the state space Ω X × Ω U , sufficiently exploring the systems’ dynamics.
The action value function and controller NN weight tuning algorithm, using a gradient descent, is described in Section 3.1 and Section 3.2. The weight gradient update uses the entirety of the collected transitions from D M , compared to the methods from [21,22] which use only one transition per gradient update. This method is called batch optimization, and its utilization is a common practice for the application of RL and ADP applied to complex nonlinear systems.
For the batch learning implementation, the action value function and controller update is made simultaneously for the entire dataset D M . Therefore, let X p = x 1 , , x M 1 , X f = x 2 , , x M of size n × ( M 1 ) , and Υ = u 1 , , u M 1 of size m × ( M 1 ) be vectors that lump all states and commands collected in the dataset D M . Also, let Ξ = X p Υ be the matrix of the concatenation of the states and command matrices converted into a matrix resembling the action value function input.
Stating Q ~ j x k ,   u k , W Q and C ~ j x k , W C as the action value function and controller functions, respectively, approximated by NNs, and with W ^ Q and W ^ C representing the entirety of the action value function and controller weights, respectively, the gradient descend update is next detailed.

3.1. Action Value Function NN Approximation

The action value function NN has the scope of approximating (11). Having as inputs the state x k and u k , the state action function NN is described as
Q j X p ,   Υ , W Q = Q X p ,   Υ , W Q j = z Q L Q = W Q , j L Q κ Q L Q 1 ,
where
z Q l Q = W Q , j l Q κ Q l Q 1 , for   l Q = 1 , , L Q ,
κ Q l Q = ϕ z Q l Q ,    
and L Q is the total number of layers, W Q , j l R h Q l Q × h Q l Q 1 is the ideal hidden-layer weight matrix from the iteration j and layer l Q , and h Q l Q is the number of neurons from layer l Q . The size of Q X p ,   Υ , W Q j is 1 × M . Here, ϕ · = t a n h ( · ) represents the activation function and can take any form, such as t a n h ( · ) , R e L u ( · ) , s i g m o i d ( · ) , and so on. The vector κ l is the l Q layer activation output. For the first layer, we have κ Q 0 = Ξ .
Generally, weights W Q , j l Q T , for l Q = 0 , ,   L Q are generally unknown due to the existing approximation errors in the weight update backpropagation rule. Hence, working with the real action value function Q j X p ,   Υ , W Q is not realistic, but only with some approximations of it. Noting with W ^ Q , j the entirety of the action value function weights, the output of the approximate action value function NN has the form of
Q ^ j X p ,   Υ , W ^ Q = Q ^ X p ,   Υ , W ^ Q j = z ^ Q L Q = W ^ Q , j L Q κ ^ Q L Q 1 ,
where
z Q l Q = W Q , j l Q κ Q l Q 1 ,   for   l Q = 1 , , L Q ,
κ Q l Q = ϕ z Q l Q ,    
and where W ^ Q , j l R h Q l Q × h Q l Q 1 represents an estimation of the ideal weights for l Q = 0 , ,   L Q . To update the action value function NN weights, an internal gradient update loop is issued for the i Q = 0 , , Ι Q steps, having the weights initialized with W ^ Q , j , 0 = W ^ Q , j . At each iteration i , the following optimization problem needs to be solved,
W ^ Q , j , i Q + 1 = a r g m i n W ^ 1 M E Q , j , i Q ,
where
E Q , j , i Q = e Q , j , i Q e Q , j , i Q T
and
e Q , j , i Q = Q ^ X p ,   Υ , W ^ η Q , j , i Q T ,
having η Q , j , i Q = r X p , Υ γ Q ^ X f , C ~ X f , W ^ C , j , W ^ Q , j , i Q .
Here, e Q , j , i Q represents the prediction error in the form of a TD error. The state action function weights are updated by the rule
W ^ Q , j ,   i Q + 1 l Q = W ^ Q , j ,   i Q l Q α Q E Q , j , i Q W ^ Q , j ,   i Q l Q ,
where α Q > 0 is the action value function NN learning rate and
E Q , j , i Q W ^ Q , j ,   i Q l Q = E Q , j , i Q z ^ Q l Q z ^ Q l Q W ^ Q , j ,   i Q l Q = E Q , j , i Q z ^ Q l Q κ ^ Q l Q 1 T
E Q , j , i Q z ^ Q l Q = E Q , j , i Q z ^ Q l Q + 1 z ^ Q l Q + 1 κ Q l Q κ Q l Q z ^ Q l Q = W ^ Q , j ,   i Q l Q + 1 T E Q , j , i Q z ^ Q l Q + 1 ϕ ˙ z ^ Q l Q        
The sign corresponds to the Hadamard product. Then, the weights W ^ Q , j + 1 of j + 1 are actualized as W ^ Q , j + 1 = W ^ Q , j , Ι Q .

3.2. Controller NN Approximation

The controller NN has the scope of approximating C ( x k ) . Noting with W C , j the entirety of the controller weights, and having as input the state X p , the output is computed as
C j X p , W C = C X p , W C , j = z C L C = W C , j L C κ C L C 1 ,
where
z C l C = W C , j l C κ C l C 1 ,   for   l C = 1 , , L C
κ C l C = ϕ z C l C ,        
and where W C , j l C R h Q l C × h Q l C 1     represents an estimation of the iteration j of the ideal weights for the l C = 0 , ,   L C layers. Noting with W ^ C , j the estimation of the ideal weights, the output of the controller NN is
C ^ j X p , W ^ C = C ^ X p , W ^ C , j = z ^ C L C = W ^ C , j L C κ ^ C L C 1
with
z ^ C l C = W ^ C , j l C κ ^ C l C 1 ,   for   l C = 1 , , L C
κ ^ C l C = ϕ z ^ C l C            
and where W ^ C , j l C represents an estimation of the real weights. To update the controller weights, one needs to issue an internal gradient update loop for the i C = 0 , , Ι C steps, having the weights initialized with W ^ C , j , 0 = W ^ C , j . At each iteration i C , the following optimization problem needs to be minimized for the entirety of the collected dataset, as follows,
W ^ C , j , i C = a r g m i n W ^ 1 M E C , j , i C
where
E C , j , i C = e C , j , i C e C , j , i C T
and
e C , j , i C = Q ^ X p ,   C ^ X p , W ^ , W ^ Q , j + 1
where α C > 0 represents the controller NN learning rate.
The update of each individual weights is
W ^ C , j ,   i C + 1 l C = W ^ C , j ,   i C l C α C E C , j , i C W ^ C , j ,   i C l C
where
E C , j , i C W ^ C , j ,   i C l C = E C , j , i C z ^ C l C z ^ C l C W ^ Q , j ,   i C l C = E C , j , i C z ^ C l C κ ^ C l C 1 T
E C , j , i C z ^ C l C = E C , j , i C z ^ C l C + 1 z ^ C l C + 1 κ l C κ l C z ^ C l C = W ^ C , j ,   i C l C + 1 T E C , j , i C z ^ C l C + 1 ϕ ˙ z ^ C l C .            
To issue the update (28), it is necessary to compute the gradient of the action value function with respect to the controller output. This is computed as
E C , j , i C z ^ C L C = E C , j , i C C ^ X p , W ^ C , j , i C = E C , j , i C z ^ Q 1 z ^ Q 1 κ ^ Q 0 κ ^ Q 0 C ^ X p , W ^ C , j , i C = Ψ T W ^ C , j , i 1 T E C , j , i C z ^ Q 1 = Ω j , i
where Ψ = 0 n × m I m and I m are the identity matrix, of dimensions m × m , and 0 n × m , a n × m matrix of zeros. Then, the weights W ^ C , j + 1 of j + 1 are actualized as W ^ C , j + 1 = W ^ C , j , Ι C .

3.3. Batch Offline ADHDP with Multiple-Hidden-Layer NN Algorithm

Next, the BOADHDP algorithm using multiple-hidden-layer NN function approximators is detailed. The algorithm consists of consecutive steps where the action value and controller NNs are updated.
  • Initialize α Q , α C , Ι Q , Ι C , Δ Q . Initialize the NN architectures for Q ^ j ( X p ,   Υ , W ^ Q ) and C ^ j ( X p , W ^ C ) by setting L Q , L C , and their respective weights. Let j = 0 and i Q = i C = 0 .
  • Collect M transitions from system (1) and construct the database D M .
  • At iteration j , set i Q = 0 and W ^ Q , j , i Q = W ^ Q , j . Then, update the weights from all L Q layers using (18) for i Q = 0 , Ι ¯ Q . Finally, set W ^ Q , j + 1 = W ^ Q , j , Ι Q .
  • Set i C = 0 and W ^ C , j , i C = W ^ C , j . Then, update the weights from all L C layers using (26) for i C = 0 , Ι ¯ C . Finally, set W ^ C , j + 1 = W ^ C , j , Ι C .
  • If the condition W ^ Q , j W ^ Q , j 1 < Δ Q is not met, update j = j + 1 and go to Step 3. Else, stop the iterative algorithm.

4. UUB Convergence

In this section, the convergence of the NN weights to a fixed point is examined. By using a Lyapunov function, the stability of the weight evolution to the fixed point is proved to be UUB under some specific conditions.

4.1. Lyapunov Approach Description

Each iteration j of the BOADHDP algorithm consists of a total cumulated number of Ι = Ι Q + Ι C gradient steps for both action value function and controller. Let a new iteration index be defined as i = 1 : j Ι , namely i [ 1 , , Ι Q , Ι Q + 1 , ,   Ι , Ι + 1 , , Ι + Ι Q , Ι + Ι Q + 1 , ,   2 Ι , ] , which represents a fine-grained iteration over both gradient action value function and controller. During i [ j Ι , j Ι + Ι Q ] , only the action value function neural network weights W ^ Q , j , i are updated using (18), while W ^ C , j , i remains unchanged. On the other side, for i [ j Ι + Ι Q , j Ι + Ι Q + Ι C ] , only the controller weights W ^ C , j , i are updated using (26), while the action value function weights W ^ Q , j , i remain unchanged. To simplify the notation, we substitute W ^ Q , j , i and W ^ C , j , i with W ^ Q , i and W ^ C , i , respectively.
Let W Q * and W C * represent the optimal weights of the action value function NN and the controller NN, and let the weight estimation errors between the approximation of the real weights and the optimal ones be W ¯ Q , i = W ^ Q , i W Q * , W ¯ C , i = W ^ C , i W C * .
Therefore, the difference between the estimated weights and the optimal ones at each layer of both the action value function and the controller NN at each iteration i is, according to (18) and (26),
W ¯ Q , i + 1 l Q = W ^ Q , i + 1 l Q W Q l Q , * = W ^ Q , i l Q α Q E Q , i W ^ Q , i l Q W Q l Q , * = W ¯ Q , i l Q α Q E Q , i W ^ Q , i l Q
W ¯ C , i + 1 l C = W ^ C , i + 1 l C W C l C , * = W ^ C , i l C α C E C , i W ^ C , i l C W C l C , * = W ¯ C , i l C α C E C , i W ^ C , i l C .            
Then, based on (14), (18), (22), and (29), define the following dynamical system with the nonlinear difference equation system, where P represents a nonlinear function,
W ¯ Q , i + 1 W ¯ C , i + 1 = W ¯ Q , i W ¯ C , i P W ^ Q , i , W ^ Q , i 1 , ϕ W ^ Q , i 1 Ξ , ϕ W ^ Q , i 1 1 Ξ ,     , ϕ W ^ Q , i L Q κ ^ Q L Q 1 , ϕ W ^ Q , i 1 L Q κ ^ Q L Q 1 W ^ C , i , W ^ C , i , ϕ W ^ C , i 1 X p , ϕ W ^ C , i 1 1 X p ,     , ϕ W ^ C , i L C κ ^ C L C 1 , ϕ W ^ C , i 1 L C κ ^ C L C 1 .            
Definition 1.
The equilibrium point of a system (32) is said to be uniformly ultimately bounded (UUB) with bound χ > 0 if, for any ψ > 0 and i 0 > 0 , there exists a positive number N = N ( ψ , χ ) independent of i 0 , such that W ¯ Q , i W ¯ C , i χ for all i N + i 0 when W ¯ Q , i 0 W ¯ C , i 0 ψ .

4.2. Preliminary Results

In the following, the UUB property of the system (32) is demonstrated for the update rules (18) and (26), both of which make the weights of the two approximating NNs enter a region with the center in the optimal weights W Q * and W C * . Some fundamental assumptions are next introduced.
Assumption 1.
The optimal NN weights for the action value function and the controller and the activation function ϕ ( · ) are bounded by positive constants, i.e., W Q * W Q , m a x * , W C * W C , m a x * , ϕ · ϕ m a x .
Lemma 1.
Under Assumption 1, it is implied that the first difference of  Γ Q , i L Q = 1 α Q t r W ¯ Q , i L Q T W ¯ Q , i L Q  is given by
Δ Γ Q , i L Q = t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T + α Q t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T .
Proof. 
Let Δ Γ Q , i L Q be described as
Δ Γ Q , i L Q = 1 α Q t r W ¯ Q , i + 1 L Q T W ¯ Q , i + 1 L Q W ¯ Q , i L Q T W ¯ Q , i L Q .
Using (19), (20), and (30), we get
W ¯ Q , i + 1 L Q = W ¯ Q , i L Q α Q E Q , i W ^ Q ,   i L Q = W ¯ Q , i L Q α Q E Q , i e Q , i e Q , i z ^ L z ^ L W ^ Q ,   i L Q = W ¯ Q , i L Q α Q e Q , i ϕ z ^ Q L Q 1 T
Based on this, we have
Δ Γ Q , i L Q = 1 α Q t r W ¯ Q , i L Q α Q e Q , i ϕ z ^ Q L Q 1 T T W ¯ Q , i L Q α Q e Q , i ϕ z ^ Q L Q 1 T W ¯ Q , i L Q T W ¯ Q , i L Q = 1 α Q t r W ¯ Q , i L Q T α Q ϕ z ^ Q L Q 1 e Q , i T W ¯ Q , i L Q α Q e Q , i ϕ z ^ Q L Q 1 T W ¯ Q , i L Q T W ¯ Q , i L Q = 1 α Q t r 2 α Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T + α Q 2 e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T = t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T + α Q t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T .
Lemma 2.
Under Assumption 1, it is implied that the first difference of  Γ Q , i l Q = 1 α Q t r W ¯ Q , i l Q T W ¯ Q , i l Q , for  l Q = 1 : L Q 1 ¯  is given by
t r 2 W ¯ Q , i l Q T W ¯ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 + t r α Q ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ¯ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 .
Proof. 
For any Δ Γ Q , i l Q , with l Q = 1 : L Q 1 ¯ , we have
Δ Γ Q , i l Q = 1 α Q t r W ¯ Q , i + 1 l Q T W ¯ Q , i + 1 l Q W ¯ Q , i l Q T W ¯ Q , i l Q .
Based on (19), (20), and (30), we get
W ¯ Q , i + 1 l Q = W ¯ Q , i l Q α Q W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 ,
with Φ i l Q + 1 = E Q , i z ^ Q l Q + 1 . Based on (38) and (39), one gets
Δ Γ Q , i l Q = 1 α Q t r { W ¯ Q , i l Q α Q W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 T ( W ¯ Q , i l Q α Q W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 ) W ¯ Q , i l Q T W ¯ Q , i l Q } = 1 α Q t r { W ¯ Q , i l Q T α Q ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T ( W ¯ Q , i l Q α Q W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 ) W ¯ Q , i l Q T W ¯ Q , i l Q } = 1 α Q t r { 2 α Q W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 + α Q 2 ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 } = t r 2 W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 + t r α Q ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 .
Lemma 3.
Under Assumption 1, it is implied that the first difference of  Γ C , i L C = 1 α C t r W ¯ C , i L C T W ¯ C , i L C  is given by
t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T + α C t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T .
Proof. 
Let Δ Γ C , i L C be described as
Δ Γ C , i L C = 1 α C t r W ¯ C , i + 1 L C T W ¯ C , i + 1 L C W ¯ C , i L C T W ¯ C , i L C .
Based on (27), (28), and (31), let
W ¯ C , i + 1 l C = W ¯ C , i l C α C E C , i W ^ C , i l C = W ¯ C , i l C α C Ω i ϕ z ^ C L C 1 T .
Therefore,
Δ Γ C , i L C = 1 α C t r W ¯ C , i l C α C Ω i ϕ z ^ C L C 1 T T W ¯ C , i l C α C Ω i ϕ z ^ C L C 1 T W ¯ C , i L C T W ¯ C , i L C = 1 α C t r W ¯ C , i l C T α C ϕ z ^ C L C 1 Ω i T W ¯ C , i l C α C Ω i ϕ z ^ C L C 1 T W ¯ C , i L C T W ¯ C , i L C = 1 α C t r 2 α C W ¯ C , i l C ϕ z ^ C L C 1 Ω i T + α C 2 Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T = t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T + α C t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T .
Lemma 4.
Under Assumption 1, it is implied that the first difference of  Γ C , i l C = 1 α C t r W ¯ C , i l C T W ¯ C , i l C , for l C = 1 : L C 1 ¯ , is given by
t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 + α C t r ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1
Proof. 
For any Δ Γ C , i l C , with l C = 1 : L C 1 ¯ , we have
Δ Γ C , i l C = 1 α C t r W ¯ C , i + 1 l C T W ¯ C , i + 1 l C W ¯ C , i l C T W ¯ C , i l C
Based on (27), (28), and (31), we get
W ¯ C , i + 1 l C = W ¯ C , i l C α C W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 ,
with χ i l C + 1 = E C , i z ^ C l C + 1 . Based on (46) and (47), one gets
Δ Γ C , i l C = 1 α C t r { W ¯ C , i l C α C W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 T ( W ¯ C , i l C α C W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 ) W ¯ C , i l C T W ¯ C , i l C } = 1 α C t r { W ¯ C , i l C T α C ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T ( W ¯ C , i l C α C W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 ) W ¯ C , i l C T W ¯ C , i l C } = 1 α C t r { 2 α C W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 + α C 2 ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 } = t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 + α C t r ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 .

4.3. Main Stability Analysis

This section provides the main stability theory for the error estimation of system (32).
Theorem 1.
Running BOADHDP algorithm from Section 3.3, which iteratively updates  W ^ Q , i  and W ^ C , i  using (18) and (26), the action value function and controller weights converge to their optimal weights W Q *  and W C * , respectively, such that W ¯ Q , i 0  and W ¯ C , i 0  if
α Q < 2 W ¯ Q ,   m a x 2 ϕ Q , m a x 2 + l q = 1 L Q 1 W ¯ Q ,   m a x ϕ Q , m a x 2 W ^ Q ,   m a x l = l Q L Q 1 W ^ Q ,   m a x ϕ ˙ Q , m a x W ¯ Q ,   m a x 2 ϕ Q , m a x 4 + l q = 1 L Q 1 W ^ Q ,   m a x 2 ϕ Q , m a x 4 l = l Q L Q 1 W ^ Q ,   m a x 2 ϕ ˙ Q , m a x 2 = α Q , m a x
α C < 2 W ¯ C ,   m a x Ω m a x ϕ m a x + l C = 1 L C 1 W ¯ C ,   m a x ϕ m a x Ω m a x l = l C L C 1 W ^ C ,   m a x ϕ ˙ C , m a x Ω m a x 2 ϕ m a x 2 + l C = 1 L C 1 ϕ m a x 2 Ω m a x 2 l = l C L C 1 W ^ C ,   m a x 2 ϕ ˙ C , m a x 2 = α C , m a x .
Proof. 
According to (18) and (26), we have, for each layer of action value function and controller NN,
W ¯ Q , i + 1 l Q = W ^ Q , i + 1 l Q W Q l Q , * = W ^ Q , i l Q α Q E Q , i W ^ Q , i l Q W Q l Q , * = W ¯ Q , i l Q α Q E Q , i W ^ Q , i l Q ,
W ¯ C , i + 1 l C = W ^ C , i + 1 l C W C l C , * = W ^ C , i l C α C E C , i W ^ C , i l C W C l C , * = W ¯ C , i l C α C E C , i W ^ C , i l C .
Let the Lyapunov function candidate be defined for each weight matrix according to each action value function and controller NN layer l Q and l C be described as
Γ Q = Γ Q i 1 + + Γ Q , i L Q = 1 α Q t r W ¯ Q , i 1 T W ¯ Q , i 1 + + 1 α Q t r W ¯ Q , i L Q T W ¯ Q , i L Q
Γ C = Γ C , i 1 + + Γ C , i L C = 1 α C t r W ¯ C , i 1 T W ¯ C , i 1 + + 1 α C t r W ¯ C , i L C T W ¯ C , i L C
The joint action value function and controller Lyapunov function is
Γ = Γ Q + Γ C
Let the difference of the Lyapunov candidates be
Δ Γ Q = Δ Γ Q , i 1 + + Δ Γ Q , i L Q
Δ Γ C = Δ Γ C , i 1 + + Δ Γ C , i L C ,
and the joint Lyapunov differences be Δ Γ = Δ Γ Q + Δ Γ C .
Next, the proof is divided in two parts: one proving that Δ Γ Q < 0 , if inequality (49) is respected, and one proving that Δ Γ C < 0 , if inequality (50) is respected.
(a) Let Δ Γ Q , i L Q = t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T + α Q t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T , according to Lemma 1, and Δ Γ Q , i l Q = t r 2 W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 + t r α Q ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 , for all layers l Q = 1 : L Q 1 ¯ , based on Lemma 2.
The sum Δ Γ Q = Δ Γ Q , i 1 + + Δ Γ Q , i L Q , l Q = 1 : L Q 1 ¯ , is lower than 0 if
Δ Γ Q = t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T + α Q t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T + t r 2 W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 + t r α Q ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 < 0 α Q t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T + + α Q t r ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 < t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T + + t r 2 W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 α Q ( t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T + + t r ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 ) < t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T + + t r 2 W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1
For the terms corresponding to layer L Q from (58), we have
t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T e Q , i ϕ z ^ Q L Q 1 T 2
Also, t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T is written as
t r 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 e Q , i T = t r W ¯ Q , i L Q T + ϕ z ^ Q L Q 1 e Q , i T W ¯ Q , i L Q + e Q , i ϕ z ^ Q L Q 1 T t r W ¯ Q , i L Q T W ¯ Q , i L Q t r e Q , i ϕ z ^ Q L Q 1 T ϕ z ^ Q L Q 1 e Q , i T W ¯ Q , i L Q T + ϕ z ^ Q L Q 1 e Q , i T 2 W ¯ Q , i L Q 2 e Q , i ϕ z ^ Q L Q 1 T 2 W ¯ Q , i L Q 2 + 2 W ¯ Q , i L Q e Q , i ϕ z ^ Q L Q 1 T + e Q , i ϕ z ^ Q L Q 1 T 2 W ¯ Q , i L Q 2 e Q , i ϕ z ^ Q L Q 1 T 2 = 2 W ¯ Q , i L Q e Q , i ϕ z ^ Q L Q 1 T .
Based on the TD error definition (17), we can write e Q , i = W ¯ Q , i L Q ϕ z ^ Q L Q 1 η Q , i . Then, (59) is described as
e Q , i ϕ z ^ Q L Q 1 T 2 = W ¯ Q , i L Q ϕ z ^ Q L Q 1 η Q , i ϕ z ^ Q L Q 1 T 2 = W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T η Q , i ϕ z ^ Q L Q 1 T 2 W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T 2 .
Also, (60) is described as
2 W ¯ Q , i L Q e Q , i ϕ z ^ Q L Q 1 T = 2 W ¯ Q , i L Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 η Q , i ϕ z ^ Q L Q 1 T = 2 W ¯ Q , i L Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T η Q , i ϕ z ^ Q L Q 1 T 2 W ¯ Q , i L Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T .
For the terms corresponding to all layers l Q = 1 : L Q 1 ¯ from (58), we have
t r ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 ϕ z ^ Q l Q 1 2 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q 2 .
Also, the term t r 2 W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 is described as
t r 2 W ¯ Q , i l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 = t r W ¯ Q , i l Q T + α Q ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ¯ Q , i l Q + α Q W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 t r W ¯ Q , i l Q T W ¯ Q , i l Q t r ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q ϕ z ^ Q l Q 1 W ¯ Q , i l Q T + α Q ϕ z ^ Q l Q 1 T Φ i l Q + 1 T W ^ Q , i l Q + 1 ϕ ˙ z ^ Q l Q T 2 W ¯ Q , i l Q 2 ϕ z ^ Q l Q 1 2 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q 2 W ¯ Q , i l Q 2 + 2 W ¯ Q , i l Q ϕ z ^ Q l Q 1 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q + ϕ z ^ Q l Q 1 2 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q 2 W ¯ Q , i l Q 2 ϕ z ^ Q l Q 1 2 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q 2 = 2 W ¯ Q , i l Q ϕ z ^ Q l Q 1 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q .
With Φ i l Q + 1 = W ^ Q , i l Q + 2 T Φ i l Q + 2 ϕ ˙ z ^ Q l Q + 1 , based on (20), we get, for all NN layers from l Q + 1 , l Q + 2 , , L Q ,
W ^ Q ,   i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q = W ^ Q ,   i l Q + 1 T W ^ Q , i l Q + 2 T Φ i l Q + 2 ϕ ˙ z ^ Q l Q + 1 ϕ ˙ z ^ Q l Q = W ^ Q , i l Q + 1 T W ^ Q , i l Q + 2 T W ^ Q , i L Q T Φ i L Q ϕ ˙ z ^ Q L Q 1 ϕ ˙ z ^ Q l Q + 1 ϕ ˙ z ^ Q l Q = W ^ Q , i l Q + 1 T W ^ Q , i l Q + 2 T W ^ Q , i L Q T e Q , i ϕ ˙ z ^ Q L Q 1 ϕ ˙ z ^ Q l Q + 1 ϕ ˙ z ^ Q l Q .
Based on the normed Hadamard product property A B A · B , with A and B being matrices of the same size, (65) is described as
W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q W ^ Q ,   i l Q + 1 · W ^ Q , i l Q + 2 T W ^ Q , i L Q T e Q , i ϕ ˙ z ^ Q L Q 1 ϕ ˙ z ^ Q l Q + 1 · ϕ ˙ z ^ Q l Q W ^ Q , i l Q + 1 · W ^ Q , i l Q + 2 · · W ^ Q , i L Q · e Q , i · ϕ ˙ z ^ Q L Q 1 · · ϕ ˙ z ^ Q l Q + 1 · ϕ ˙ z ^ Q l Q = W ^ Q , i l Q + 1 · W ^ Q , i l Q + 2 · · W ^ Q , i L Q · W ^ Q , i L Q ϕ z ^ Q L Q 1 η Q , i · ϕ ˙ z ^ Q L Q 1 · · ϕ ˙ z ^ Q l Q + 1 · ϕ ˙ z ^ Q l Q W ^ Q , i L Q ϕ z ^ Q L Q 1 η Q , i · W ^ Q , i l Q + 1 · W ^ Q , i l Q + 2 · · W ^ Q , i L Q · ϕ ˙ z ^ Q L Q 1 · · ϕ ˙ z ^ Q l Q + 1 · ϕ ˙ z ^ Q l Q W ^ Q , i L Q ϕ z ^ Q L Q 1 · W ^ Q , i l Q + 1 · W ^ Q , i l Q + 2 · · W ^ Q , i L Q · ϕ ˙ z ^ Q L Q 1 · · ϕ ˙ z ^ Q l Q + 1 · ϕ ˙ z ^ Q l Q = W ^ Q , i L Q ϕ z ^ Q L Q 1 l = l Q L Q 1 W ^ Q , i l + 1 ϕ ˙ z ^ Q l .
Therefore, based on (61)–(64), the inequality (58) is written as
α Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T 2 + + ϕ z ^ Q l Q 1 2 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q 2 < 2 W ¯ Q , i L Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T + + 2 W ¯ Q , i l Q ϕ z ^ Q l Q 1 W ^ Q , i l Q + 1 T Φ i l Q + 1 ϕ ˙ z ^ Q l Q α Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T 2 + + ϕ z ^ Q l Q 1 2 W ^ Q , i L Q ϕ z ^ Q L Q 1 2 l = l Q L Q 1 W ^ Q , i l + 1 2 ϕ ˙ z ^ Q l 2 < 2 W ¯ Q , i L Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T + + 2 W ¯ Q , i l Q ϕ z ^ Q l Q 1 W ^ Q , i L Q ϕ z ^ Q L Q 1 l = l Q L Q 1 W ^ Q , i l + 1 ϕ ˙ z ^ Q l α Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T 2 + l q = 1 L Q 1 ϕ z ^ Q l Q 1 2 W ^ Q , i L Q ϕ z ^ Q L Q 1 2 l = l Q L Q 1 W ^ Q , i l + 1 2 ϕ ˙ z ^ Q l 2 < 2 W ¯ Q , i L Q W ¯ Q , i L Q ϕ z ^ Q L Q 1 ϕ z ^ Q L Q 1 T + 2 l q = 1 L Q 1 W ¯ Q , i l Q ϕ z ^ Q l Q 1 W ^ Q , i L Q ϕ z ^ Q L Q 1 l = l Q L Q 1 W ^ Q , i l + 1 ϕ ˙ z ^ Q l .
Let the following norm bounds be defined as follows:
W ¯ Q , i l Q W ¯ Q ,   m a x ,   W ^ Q , i l Q W ^ Q ,   m a x ,   ϕ ˙ z ^ Q l Q ϕ ˙ Q , m a x ,   for   all   l Q = 1 : L Q ¯ .
Then, based on Assumption 1, the inequality (67) can be written as
α Q W ¯ Q ,   m a x 2 ϕ Q , m a x 4 + l q = 1 L Q 1 W ^ Q ,   m a x 2 ϕ Q , m a x 4 l = l Q L Q 1 W ^ Q ,   m a x 2 ϕ ˙ Q , m a x 2 < 2 W ¯ Q ,   m a x 2 ϕ Q , m a x 2 + 2 l q = 1 L Q 1 W ¯ Q ,   m a x ϕ Q , m a x 2 W ^ Q ,   m a x l = l Q L Q 1 W ^ Q ,   m a x ϕ ˙ Q , m a x .
To guarantee that (68) is negative, the learning rate needs to be selected as follows:
α Q < 2 W ¯ Q ,   m a x 2 ϕ Q , m a x 2 + l q = 1 L Q 1 W ¯ Q ,   m a x ϕ Q , m a x 2 W ^ Q ,   m a x l = l Q L Q 1 W ^ Q ,   m a x ϕ ˙ Q , m a x W ¯ Q ,   m a x 2 ϕ Q , m a x 4 + l q = 1 L Q 1 W ^ Q ,   m a x 2 ϕ Q , m a x 4 l = l Q L Q 1 W ^ Q ,   m a x 2 ϕ ˙ Q , m a x 2 = α Q , m a x .
(b) Let Δ Γ C , i L C = t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T + α C t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T , according to Lemma 3, and Δ Γ C , i l C = t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 + α C t r ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 for all layers l C = 1 : L C 1 ¯ , based on Lemma 4.
The sum Δ Γ C = Δ Γ C , i 1 + + Δ Γ C , i L C , l C = 1 : L C 1 ¯ , is lower than 0 if
Δ Γ C = t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T + α C t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T + t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 + α C t r ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 < 0 α C t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T + + α C t r ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 < t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T + + t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 α C ( t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T + + t r ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 ) < t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T + + t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 .
For the terms corresponding to layer L C from (70), we have
t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T Ω i ϕ z ^ C L C 1 T 2 .
Also, t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T is described as
t r 2 W ¯ C , i l C ϕ z ^ C L C 1 Ω i T = t r W ¯ C , i l C T ϕ z ^ C L C 1 Ω i T W ¯ C , i l C Ω i ϕ z ^ C L C 1 T t r W ¯ C , i l C T W ¯ C , i l C t r Ω i ϕ z ^ C L C 1 T ϕ z ^ C L C 1 Ω i T W ¯ C , i l C Ω i ϕ z ^ C L C 1 T 2 W ¯ C , i l C 2 Ω i ϕ z ^ C L C 1 T 2 W ¯ C , i l C 2 + 2 W ¯ C , i l C Ω i ϕ z ^ C L C 1 T + Ω i ϕ z ^ C L C 1 T 2 W ¯ C , i l C 2 Ω i ϕ z ^ C L C 1 T 2 = 2 W ¯ C , i l C Ω i ϕ z ^ C L C 1 T .
For the terms corresponding to all layers l C = 1 : L C 1 ¯ from (70), we have
t r ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 2 = ϕ z ^ C l C 1 2 W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C 2 .
Also, the term t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 is described as
t r 2 W ¯ C , i l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 = t r W ¯ C , i l C T + ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ¯ C , i l C + W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 t r W ¯ C , i l C T W ¯ C , i l C t r z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 W ¯ C , i l C T + ϕ z ^ C l C 1 T χ i l C + 1 T W ^ C , i l C + 1 ϕ ˙ z ^ C l C T 2 W ¯ C , i l C 2 W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 2 W ¯ C , i l C 2 + 2 W ¯ C , i l C W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 + W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 2 W ¯ C , i l C 2 W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C ϕ z ^ C l C 1 2 = 2 W ¯ C , i l C ϕ z ^ C l C 1 W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C .
Having χ i l C + 1 = W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C + 1 , W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C can be written similarly to (65), as
W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C = W ^ C , i l C + 1 T W ^ C , i l C + 2 T W ^ C , i L C T Ω i ϕ ˙ z ^ C L C 1 ϕ ˙ z ^ C l C + 1 ϕ ˙ z ^ C l C .
Based on the normed Hadamard product property, one gets
W ^ C , i l C + 1 T W ^ C , i l C + 2 T W ^ C , i L C T Ω i ϕ ˙ z ^ C L C 1 ϕ ˙ z ^ C l C + 1 ϕ ˙ z ^ C l C < Ω i · W ^ C , i l C + 1 · W ^ C , i l C + 2 · · W ^ C , i L C · ϕ ˙ z ^ C L C 1 · · ϕ ˙ z ^ C l C + 1 · ϕ ˙ z ^ C l C = Ω i l = l C L C 1 W ^ C , i l + 1 ϕ ˙ z ^ C l .
Therefore, based on (71)–(74), the inequality (70) is
α C Ω i ϕ z ^ C L C 1 T 2 + + ϕ z ^ C l C 1 2 W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C 2 < 2 W ¯ C , i l C Ω i ϕ z ^ C L C 1 T + + 2 W ¯ C , i l C ϕ z ^ C l C 1 W ^ C , i l C + 1 T χ i l C + 1 ϕ ˙ z ^ C l C α C Ω i ϕ z ^ C L C 1 T 2 + + ϕ z ^ C l C 1 2 Ω i 2 l = l C L C 1 W ^ C , i l + 1 2 ϕ ˙ z ^ C l 2 < 2 W ¯ C , i l C Ω i ϕ z ^ C L C 1 T + + 2 W ¯ C , i l C ϕ z ^ C l C 1 Ω i l = l C L C 1 W ^ C , i l + 1 ϕ ˙ z ^ C l α C Ω i ϕ z ^ C L C 1 T 2 + l C = 1 L C 1 ϕ z ^ C l C 1 2 Ω i 2 l = l C L C 1 W ^ C , i l C + 1 2 ϕ ˙ z ^ C l C 2 < 2 W ¯ C , i l C Ω i ϕ z ^ C L C 1 T + 2 l C = 1 L C 1 W ¯ C , i l C ϕ z ^ C l C 1 Ω i l = l C L C 1 W ^ C , i l + 1 ϕ ˙ z ^ C l .
Let the following norm bounds be defined as follows:
W ¯ C , i l C W ¯ C ,   m a x ,   W ^ C , i l C W ^ C ,   m a x ,   ϕ ˙ z ^ C l C ϕ ˙ C , m a x ,   Ω i Ω m a x for   all   l C = 1 : L C ¯ .
Based on Assumption 1, the inequality (77) can be written as
α C Ω m a x 2 ϕ m a x 2 + l C = 1 L C 1 ϕ m a x 2 Ω m a x 2 l = l C L C 1 W ^ C ,   m a x 2 ϕ ˙ C , m a x 2 < 2 W ¯ C ,   m a x Ω m a x ϕ m a x + 2 l C = 1 L C 1 W ¯ C ,   m a x ϕ m a x Ω m a x l = l C L C 1 W ^ C ,   m a x ϕ ˙ C , m a x .
To guarantee that (78) is negative, the learning rate needs to be selected as follows:
α C < 2 W ¯ C ,   m a x Ω m a x ϕ m a x + l C = 1 L C 1 W ¯ C ,   m a x ϕ m a x Ω m a x l = l C L C 1 W ^ C ,   m a x ϕ ˙ C , m a x Ω m a x 2 ϕ m a x 2 + l C = 1 L C 1 ϕ m a x 2 Ω m a x 2 l = l C L C 1 W ^ C ,   m a x 2 ϕ ˙ C , m a x 2 = α C , m a x .
In conclusion, by having the inequalities (69) and (79) respected, we get Δ Γ < 0 .

4.4. Results Interpretation

According to (69) and (79), as the number of hidden layers increases, the upper bounds for the learning rates α Q and α C decrease. This is due to the denominators in (69) and (79) being larger than their respective numerators, primarily because the denominators include squared terms. Therefore, the number of hidden layers in both neural networks is inversely proportional to the magnitude of their respective learning rates. For illustrative purposes, the action value function learning rate bound α Q , m a x was plotted along the hidden layers L Q = 1 : 15 in Figure 1, based on (69). The norm bounds of the weights were selected as W ¯ Q ,   m a x = W ^ Q ,   m a x = 2 and ϕ Q , m a x = ϕ ˙ Q , m a x = 1 for the activation function ϕ · = tanh · .
Remark 2.
This inversely proportional relationship between the number of NN hidden layers and the learning rate can be attributed to the gain in complexity of the NN optimization surface as the number of hidden layer increases. A high learning rate in such a scenario can lead to erratic updates in the intricate optimization surface, potentially causing the divergence of the learning process. While a smaller learning rate increases the risk of getting stuck in local minima, it is beneficial for a stable learning.

5. Simulation Study

Next, the impact of employing multiple hidden layers in the NN approximators, batch learning, and offline computation in the ADHDP learning process, namely the BOADHDP algorithm from Section 3.3, was tested on an ORM tracking task on the TRAS system. First, the system is described along with the data collection settings for BOADHDP. This is followed by a comparison between the BOADHDP learning process using single-hidden-layer NNs and the one using two-hidden-layer NNs for approximating the action value function and the controller. Finally, the online adaptive ADHDHP algorithms from [21,22] are compared with BOADHDP, highlighting the advantages of the latter.

5.1. Data Collection Settings on TRAS System

The nonlinear system was characterized as a two-input and two-output system. The horizontal motion, or azimuth, operates as an integrator, whereas the vertical, or pitch, motion experiences different gravitational effects when moving upward versus downward. There was also an interconnection between these two channels. In Figure 2, a system setup is shown. The model used was a simplified deterministic continuous-time state-space representation, consisting of two interconnected state-space subsystems:
ω ˙ h = s a t U h M h ω h 2.7 · 10 5 , K h = 0.216 F h ω h cos α v 0.058 Ω h + 0.0178 s a t U v cos α v , Ω h = K h 0.0238 · cos 2 α v + 3 · 10 3 , α ˙ h = Ω h , ω ˙ v = s a t U v M v ω v 1.63 · 10 4 , Ω ˙ v = 1 0.03 0.2 F v ω v 0.0127 Ω v 0.0935 sin α v 9.28 · 10 6 Ω v ω v + 4.17 · 10 3 s a t U h 0.05 cos α v 0.021 Ω h 2 sin α v cos α v 0.093 sin α v + 0.05 , α ˙ v = Ω v ,
where s a t ( ) is the saturation function in the interval 1 ;   1 . The horizontal azimuth control input was U h = u 1 and the vertical pitch control was U v = u 2 . The system output was represented by the azimuth angle α h π ; π and by the pitch angle α v π / 2 ;   π / 2 . Nonlinear static characteristics were derived from experimental data through polynomial fitting as in [29]:
M v ω = 9.05 × 10 12 ω v 3 + 2.76 × 10 10   ω v 2 + 1.25 × 10 4   ω v 1 + 1.66 × 10 4 ,
F v ω = 1.8 × 10 18   ω v 5 7.8 × 10 16   ω v 4 + 4.1 × 10 11   ω v 3 + 2.7 × 10 8   ω 2 + 3.5 × 10 4   ω 0.014 ,
M h ω h = 5.95 × 10 13   ω h 3 5.05 × 10 10   ω h 2 + 1.02 × 10 4   ω h 1 + 1.61 × 10 3   ω h ,
F h ω h = 2.56 × 10 20   ω h 5 + 4.09 × 10 17   ω h 4 + 3.16 × 10 12   ω h 3 7.34 × 10 9   ω h 2 + 2.12 × 10 5   ω h + 9.13 × 10 3 .
The process was discretized by using a zero-order hold sampler on both inputs and outputs. With a sampling time of T s = 0.1   s , the following discrete-time model was obtained,
x k + 1 = f x k , u k , y k = g x k = α k , h , α k , v T ,
where the system state was x k = ω k , h , Ω k , h , α k , h , ω k , v , Ω k , v , α k , v T R 6 and the control input was u k = u k , h , u k , v , as in [29].
In the ORM tracking paradigm, the controlled system outputs track the output of the ORM model. In this application, the ORM was defined as in [29] and had the form of
x k + 1 , m h = 0.9673 x k , m h + 0.0328 r k , h , x k + 1 , m v = 0.9673 x k , m v + 0.0328 r k , v , y k , m = y k , m h , y k , m v T = x k , m h , x k , m v T ,
where r k , h and r k , v are step input reference signals. Therefore, an extended state that comprises both the TRAS and the ORM states was defined as x k e = ω k , h , Ω k , h , α k , h , ω k , v , Ω k , v , α k , v , x k , m h , x k , m v , r k , h , r k , v T R 10 .
For data collection, the linear diagonal controller
C z , θ = P 11 ( z ) / 1 z 1 0 0 P 22 ( z ) / 1 z 1 ,
P 11 z = 2.9341 5.8689 z 1 + 3.9303 z 2 0.9173 z 3 0.0777 z 4 ,
P 22 z = 0.6228 1.1540 z 1 + 0.5467 z 2
was used in a closed loop with system (85), where the controller parameters were tuned using VRFT as in [29]. Having the closed loop stabilized, the successive step referenced input signals with amplitudes ranging in an interval of r k , h 2 ; 2 , and r k , v 1.4 ; 1.1 were generated at 17 s and 25 s for the azimuth and pitch respectively. To guarantee a satisfactory exploration of the system’s state-space domain, a random noise was added at each two timesteps. The random noise added on C 11 ( Z ) had an amplitude of 1.6 ; 1.6 and the one added on C 22 ( Z ) had an amplitude of 1.7 ; 1.7 . A total of M = 50,000 transitions were collected, creating, therefore, the dataset D 50,000 = { ( x k e ,   u k , r x k e , u k ,   x k + 1 e ) } , with k = 1 : 50,000 . An excerpt of the data exploration is shown in Figure 3. Next, BOADHDP was issued for action value function and controller NN approximations for both the single-hidden-layer ( L Q = 1 , L C = 1 ) and the multilayer case ( L Q = 2 , L C = 2 ).

5.2. Comparison of BOADHDP with Single-Layer and Multilayer NN Approximations

For the single-layer NNs, the form of the action value function was 12-50-1 and that of the controller was 10-10-2. The activation functions of the hidden layer were hyperbolic tangents and the ones of the output layer were linear. The weights were initialized using the Xavier initialization [29]. The internal gradient updates were Ι Q = 500 and Ι C = 100 and the learning rates were selected to be α Q = 0.01 and α C = 0.001 . The penalty function took the form of r x k e , u k = α k , h x k , m h 2 + α k , v x k , m v 2 . The algorithm ran for a total number of 500 iterations. The performance of the NN controller was tested on a simulated scenario. In this scenario, the tracking capabilities were tested on a random reference signal generated from [ 1 ; 1 ] for 2000 timesteps. Therefore, at each BOADHDP j t h iteration, the performance of the controller was measured by the function J x k e = α k , h x k , m h 2 + α k , v x k , m v 2 / 2000 on the simulated scenario, for k = 1 : 2000 . The convergence of the action value function and the values of J x k is shown in Figure 4 in an orange color. This was computed by checking the norm between the weights from successive BOADHDP iterations, namely the norm W ^ Q j W ^ Q j 1 2 2 . The decreasing behavior of the successive weight norms from the first plot in Figure 4 proves the convergence of the action value function. The second plot presents the performance of the value function J x k e under the simulated scenario for the controller obtained from each iteration j , namely C j x k e , W C . The tracking performance of the controller obtained at iteration j = 500 is shown in Figure 5. In this figure, the performance of the TRAS system (85) in a closed loop with the controller C 500 x k e , W C is shown. The evolution in time of the output of the horizontal and the vertical axes is plotted in a blue color along with the reference signal (yellow) and reference model (orange), showing the tracking capacity of the C 500 x k e , W C controller.
For the multilayer NN setup, the form of the action value function was 15-50-10-2 and that of the controller was 10-10-4-2. The activation functions of the two hidden layers were hyperbolic tangents and the ones from the output layer were linear. The weights were initialized using the Xavier initialization [29]. The internal gradient updates were Ι Q = 500 and Ι C = 100 , and the learning rates took the values of α Q = 0.01 and α C = 0.001 . The algorithm ran for a total number of 500 iterations. The convergence of the action value function and the values of J x k is shown in Figure 4 in a blue color. The tracking performance of the controller obtained at iteration j = 500 is shown in Figure 5.
From Figure 4, it can be seen that the convergence of the two-layer NN approximators for the action value function and the controller delivered more stable results. First, the norm of the action value function successive weight differences from the first plot was less noisy and provided a faster convergence in the two-layer case than the single-layer NN. Then, in the second plot, the function J x k e converged faster to a lower value that correlated with a performant controller. Also, the values of J x k e was 0.0049 for the single-layer NNs and 0.0031 for the two-layer implementation. The two-layer implementation outperformed the single-layer one by 1.58%. The difference in tracking performance can be seen in Figure 5 and Figure 6, where the horizontal motion tracking improved in the case of the two-layer NN controller.

5.3. Comparison Between BOADHDP and the Online Adaptive ADHDP

Next, the online adaptive ADHDP algorithms from [21,22] were applied to the TRAS system. The difference between ADHDP methods [] was that the former one only updates the weights from the hidden to the output layer, while the latter updates the entire NN weights.
For these algorithms, we used the same NN architectures as in the single-layer NN from BOADHDP, namely the form of the action value function NN was 12-50-1 and that of the controller was 10-10-2. The activation functions of the hidden layer were hyperbolic tangents and the ones of the output layer were linear. The weights were also initialized using the Xavier initialization [29]. The learning rates were selected to be α Q = 0.01 and α C = 0.001 . The penalty function took the form of r x k e , u k = α k , h x k , m h 2 + α k , v x k , m v 2 .
Compared with BOADHDP, in these implementations, the adaptation of the NNs was made online, using only the transitions along with each time step of the simulated system. The algorithm ran for 200,000 time steps. Every 2000 steps, the controller weights were fixed and their performance was measured by the function J x k e = α k , h x k , m h 2 + α k , v x k , m v 2 / 2000 under a simulated scenario, for k = 1 : 2000 . The convergence of the action value function and of the controller performance of the simulated scenario can be seen in Figure 7 for the ADHDP algorithms from [21,22]. The tracking performance of the ADHDP algorithms from [21,22] on the TRAS system using the aforementioned learning settings is presented in Figure 8. The value of the J x k e was 0.0236 for the ADHDP algorithm from [21] and 0.0258 for the ADHDP algorithm from [22].
The J x k e values of the BOADHDP and ADHDP algorithms from [21,22] are summarized in Table 1. Also, from Figure 5 and Figure 8, it can be observed that the online adaptive ADHDP algorithms could not deliver the same performance as their batch and offline counterpart, BOADHDP. Furthermore, the ADHDP algorithms presented in [21,22] failed to enhance controller performance, even though they utilized four times as many collected transitions from the system. This difference in the performance of the BOADHDP algorithm stems, in part, from the batch nature of the learning process. By processing multiple collected transitions from the state action space at the same time during NN actualization, the gradient for the action value and controller NNs is averaged over all the transitions. In turn, this makes the NN update more stable. By issuing the gradient update in an offline manner, the same collected transitions are used at each iteration, making the convergence speed faster. This stands in accordance with the observations from [28], where the authors proved the advantages of batch learning in comparison to the online adaptive single-transition learning from the classical ADHDP methods. Also, from this case study, it can be seen that the number of transitions required for learning was higher in the online adaptive case than in the batch offline case.

6. Discussion and Conclusions

In this paper, we study the theoretical stability of BOADHDP with deep neural networks as function approximators for the action value function and the controller. To this end, we introduce a stability criterion for the iteratively updated action value function and controller NN. The theory uses the Lyapunov stability approach and shows that the weight estimation errors are UUB if some inequality constraints on the learning rate magnitudes are respected. This research extends the previous results from the literature, such as [21,22], both theoretically and practically.
  • First, our Lyapunov stability is extended to address NN approximators for action value functions and controllers with multiple hidden layers. Although NNs with a single hidden layer are universal approximators, their usage for highly nonlinear applications is hindered by their generalization capabilities. In contrast, multilayer NNs can learn complex features effectively, reducing overfitting and generalization issues. The results outlined in Theorem 1 indicate also an indirect proportionality between the number of NN hidden layers and the magnitude of the learning rate, providing a practical heuristic approach for practical ADP applications of multilayer NNs.
  • Second, our theoretical Lyapunov stability analysis addresses the usage of batch offline learning of the action value function and controller NNs. Although successful ADP applications have been reported using adaptive update methods, their practical use is often constrained by the significant number of iterations required for convergence. The adoption of batch learning has, thus, become standard practice, necessitating a theoretical Lyapunov stability coverage.
  • Finally, from a practical point of view, we validate the advantage of using BOADHDP with multilayer NNs through a case study on a twin rotor aerodynamical system (TRAS). This study compares BOADHDP using neural networks with a single layer and two hidden layers as function approximators. The results show that the normed action value function weight convergence is smoother with two-hidden-layer networks, also leading to a controller with an enhanced performance on the control benchmark ( 0.0049 for the single-layer NNs and 0.0031 for the two-layer implementation, namely a 1.58 % improvement). This demonstrates the superior capability of multilayer networks in managing complex, nonlinear control systems. Also, BOADHDP is compared with ADHDP algorithms from [21,22], with ADHDP algorithms from [21,22] obtaining 0.0236 and 0.0258 , respectively, on the control benchmark, while also requiring four times more collected transitions from the TRAS system. This proves both the efficiency of the BOADHDP with respect to the number of collected transitions and the performance of using batch offline learning methodologies, confirming the results from [28].
Our findings highlight the advantages of BOADHDP with deep neural networks in practical applications, underscoring the improved stability and performance in control tasks. Future research may explore extending this batched multilayer approach to adaptive learning scenarios. From a practical point of view, applications entailing deep neural networks and batch learning applications might benefit from this analysis.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the author on request.

Acknowledgments

I would like to thank Ioan Silea for reading this manuscript and for providing constructive feedback that improved the quality of our research.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Werbos, P.J. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Science. Ph.D. Thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA, USA, 1974. [Google Scholar]
  2. Bellman, R.E. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
  3. Werbos, P.J. Approximate dynamic programming for real time control and neural modeling. In Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches; White, D.A., Sofge, D.A., Eds.; Van Nostrand Reinhold: New York, NY, USA, 1992; pp. 493–525. [Google Scholar]
  4. Al-Tamimi, A.; Lewis, F.L. Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Trans. Syst. Man Cybern. B 2008, 38, 943–949. [Google Scholar] [CrossRef] [PubMed]
  5. Prokhorov, D.V.; Wunsch, D.C. Adaptive critic designs. IEEE Trans. Neural Netw. 1997, 8, 997–1007. [Google Scholar] [CrossRef] [PubMed]
  6. Liu, X.; Balakrishnan, S.N. Convergence analysis of adaptive critic based optimal control. In Proceedings of the American Control Conference, Chicago, IL, USA, 28–30 June 2000. [Google Scholar]
  7. White, D.A.; Sofge, D.A. Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches; Van Nostrand Reinhold: New York, NY, USA, 1992. [Google Scholar]
  8. Padhi, R.; Unnikrishnan, N.; Wang, X.; Balakrishnan, S.N. A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural Netw. 2006, 19, 1648–1660. [Google Scholar] [CrossRef] [PubMed]
  9. Balakrishnan, S.N.; Biega, V. Adaptive-critic-based neural networks for aircraft optimal control. J. Guid. Control Dyn. 1996, 19, 893–898. [Google Scholar] [CrossRef]
  10. Dierks, T.; Jagannathan, S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1118–1129. [Google Scholar] [CrossRef] [PubMed]
  11. Venayagamoorthy, G.K.; Harley, R.G.; Wunsch, D.C. Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator. IEEE Trans. Neural Netw. 2002, 13, 764–773. [Google Scholar] [CrossRef] [PubMed]
  12. Ferrari, S.; Stengel, R.F. Online adaptive critic flight control. J. Guid. Control Dyn. 2004, 27, 777–786. [Google Scholar] [CrossRef]
  13. Ding, J.; Jagannathan, S. An online nonlinear optimal controller synthesis for aircraft with model uncertainties. In Proceedings of the AIAA Guidance, Navigation and Control Conference, Toronto, ON, Canada, 2–5 August 2010. [Google Scholar]
  14. Vrabie, D.; Pastravanu, O.; Lewis, F.; Abu-Khalaf, M. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477–484. [Google Scholar] [CrossRef]
  15. Dierks, T.; Jagannathan, S. Optimal control of affine nonlinear continuous-time systems. In Proceedings of the American Control Conference, Baltimore, MA, USA, 30 June–2 July 2010. [Google Scholar]
  16. Vamvoudakis, K.; Lewis, F. Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010, 46, 878–888. [Google Scholar] [CrossRef]
  17. Enns, R.; Si, J. Helicopter trimming and tracking control using direct neural dynamic programming. IEEE Trans. Neural Netw. 2003, 14, 929–939. [Google Scholar] [CrossRef]
  18. Liu, D.; Javaherian, H.; Kovalenko, O.; Huang, T. Adaptive critic learning techniques for engine torque and air–fuel ratio control. IEEE Trans. Syst. Man Cybern. B 2008, 38, 988–993. [Google Scholar]
  19. Ruelens, F.; Claessens, B.J.; Quaiyum, S.; De Schutter, B.; Babuška, R.; Belmans, R. Reinforcement learning applied to an electric water heater: From theory to practice. IEEE Trans. Smart Grid 2018, 9, 3792–3800. [Google Scholar] [CrossRef]
  20. He, P.; Jagannathan, S. Reinforcement learning-based output feedback control of nonlinear systems with input constraints. IEEE Trans. Syst. Man Cybern. B 2005, 35, 150–154. [Google Scholar] [CrossRef] [PubMed]
  21. Liu, F.; Sun, J.; Si, J.; Guo, W.; Mei, S. A boundness result for the direct heuristic dynamic programming. Neural Netw. 2012, 32, 229–235. [Google Scholar] [CrossRef] [PubMed]
  22. Sokolov, Y.; Kozma, R.; Werbos, L.D.; Werbos, P.J. Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica 2015, 59, 9–18. [Google Scholar] [CrossRef]
  23. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  24. Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, Y.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  25. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the Internation Conference on Machine Learning, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  26. Riedmiller, M. Neural Fitted Q Iteration–First Experiences with a Data Efficient Neural Reinforcement Learning Method. In Proceedings of the European Conference on Machine Learning, Porto, Portugal, 3–7 October 2005. [Google Scholar]
  27. Radac, M.-B.; Lala, T. Learning output reference model tracking for higher-order nonlinear systems with unknown dynamics. Algorithms 2019, 12, 121. [Google Scholar] [CrossRef]
  28. Watkins, C. Learning from Delayed Rewards. Ph.D. Thesis, Department of Computational Science, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
  29. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010. [Google Scholar]
Figure 1. Relation between the number of NN layers and the bound of the learning rate α Q , m a x .
Figure 1. Relation between the number of NN layers and the bound of the learning rate α Q , m a x .
Mathematics 13 00206 g001
Figure 2. TRAS system setup [29].
Figure 2. TRAS system setup [29].
Mathematics 13 00206 g002
Figure 3. Data collection in relation to the TRAS system: r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (blue).
Figure 3. Data collection in relation to the TRAS system: r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (blue).
Mathematics 13 00206 g003
Figure 4. BOADHDP convergence in the TRAS system.
Figure 4. BOADHDP convergence in the TRAS system.
Mathematics 13 00206 g004
Figure 5. One-hidden-layer controller learned through BOADHDP, at iteration j = 500 : r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (blue). The commands u k , h and u k , v are for the horizontal and vertical axes (blue).
Figure 5. One-hidden-layer controller learned through BOADHDP, at iteration j = 500 : r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (blue). The commands u k , h and u k , v are for the horizontal and vertical axes (blue).
Mathematics 13 00206 g005
Figure 6. Two-hidden-layer controller learned through BOADHDP, at iteration j = 500 : r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (blue). The commands u k , h and u k , v are for the horizontal and vertical axes (blue).
Figure 6. Two-hidden-layer controller learned through BOADHDP, at iteration j = 500 : r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (blue). The commands u k , h and u k , v are for the horizontal and vertical axes (blue).
Mathematics 13 00206 g006
Figure 7. ADHDP convergence in relation to the TRAS system. ADHDP algorithm from [21] in purple and ADHDP algorithm from [22] in green.
Figure 7. ADHDP convergence in relation to the TRAS system. ADHDP algorithm from [21] in purple and ADHDP algorithm from [22] in green.
Mathematics 13 00206 g007
Figure 8. Tracking performance of the ADHDP algorithms from [21,22], at iteration j = 150,000 : r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (green—ADHDP algorithm from [21], purple—ADHDP algorithm from [22]). The commands u k , h and u k , v are for the horizontal and vertical axes (green—ADHDP algorithm from [21], purple—ADHDP algorithm from [22]).
Figure 8. Tracking performance of the ADHDP algorithms from [21,22], at iteration j = 150,000 : r k , h and r k , v (yellow); x k , m h and x k , m v (red); α k , h and α k , v (green—ADHDP algorithm from [21], purple—ADHDP algorithm from [22]). The commands u k , h and u k , v are for the horizontal and vertical axes (green—ADHDP algorithm from [21], purple—ADHDP algorithm from [22]).
Mathematics 13 00206 g008
Table 1. Comparison between the BOADHDP (single- and multiple-hidden-layer NN approximations) and the ADHDP algorithms from [21,22].
Table 1. Comparison between the BOADHDP (single- and multiple-hidden-layer NN approximations) and the ADHDP algorithms from [21,22].
Algorithm J x k e
BOADHDP with NN approximation having a single hidden layer 0.0049
BOADHDP with NN approximation having two hidden layers 0.0031
ADHDP from [21]0.0236
ADHDP from [22]0.0258
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lala, T. Stability Analysis of Batch Offline Action-Dependent Heuristic Dynamic Programming Using Deep Neural Networks. Mathematics 2025, 13, 206. https://doi.org/10.3390/math13020206

AMA Style

Lala T. Stability Analysis of Batch Offline Action-Dependent Heuristic Dynamic Programming Using Deep Neural Networks. Mathematics. 2025; 13(2):206. https://doi.org/10.3390/math13020206

Chicago/Turabian Style

Lala, Timotei. 2025. "Stability Analysis of Batch Offline Action-Dependent Heuristic Dynamic Programming Using Deep Neural Networks" Mathematics 13, no. 2: 206. https://doi.org/10.3390/math13020206

APA Style

Lala, T. (2025). Stability Analysis of Batch Offline Action-Dependent Heuristic Dynamic Programming Using Deep Neural Networks. Mathematics, 13(2), 206. https://doi.org/10.3390/math13020206

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop