Next Article in Journal
Development of an Inexpensive Harnessing System Allowing Independent Gardening for Balance Training for Mobility Impaired Individuals
Next Article in Special Issue
Data-Efficient Sensor Upgrade Path Using Knowledge Distillation
Previous Article in Journal
An Electromagnetic Vibration Energy Harvester with a Tunable Mass Moment of Inertia
Previous Article in Special Issue
Deep ConvNet: Non-Random Weight Initialization for Repeatable Determinism, Examined with FSGM
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Adaptive Dynamic Programming Toolbox

School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea
*
Author to whom correspondence should be addressed.
Sensors 2021, 21(16), 5609; https://doi.org/10.3390/s21165609
Submission received: 16 July 2021 / Revised: 13 August 2021 / Accepted: 16 August 2021 / Published: 20 August 2021
(This article belongs to the Collection Robotics, Sensors and Industry 4.0)

Abstract

:
The paper develops the adaptive dynamic programming toolbox (ADPT), which is a MATLAB-based software package and computationally solves optimal control problems for continuous-time control-affine systems. The ADPT produces approximate optimal feedback controls by employing the adaptive dynamic programming technique and solving the Hamilton–Jacobi–Bellman equation approximately. A novel implementation method is derived to optimize the memory consumption by the ADPT throughout its execution. The ADPT supports two working modes: model-based mode and model-free mode. In the former mode, the ADPT computes optimal feedback controls provided the system dynamics. In the latter mode, optimal feedback controls are generated from the measurements of system trajectories, without the requirement of knowledge of the system model. Multiple setting options are provided in the ADPT, such that various customized circumstances can be accommodated. Compared to other popular software toolboxes for optimal control, the ADPT features computational precision and time efficiency, which is illustrated with its applications to a highly non-linear satellite attitude control problem.

1. Introduction

Optimal control is an important branch in control engineering. For continuous-time dynamical systems, finding an optimal feedback control involves solving the so-called Hamilton–Jacobi–Bellman (HJB) equation [1]. For linear systems, however, the HJB equation simplifies to the well-known Riccati equation which results in the linear quadratic regulator [2]. For non-linear systems, solving the HJB equation is generally a formidable task due to its inherently non-linear nature. As a result, there has been a great deal of research devoted to approximately solving the HJB equation. Al’brekht proposed a power series method for smooth systems to solve the HJB equation [3]. Under the assumption that the optimal control and the optimal cost function can be represented in Taylor series, by plugging the series expansions of the dynamics, the cost integrand function, the optimal control and the optimal cost function into the HJB equation and collecting terms degree by degree, the Taylor expansions of the optimal control and the optimal cost function can be recursively obtained. Similar ideas can be found in [4,5]. A recursive algorithm is developed to sequentially improve the control law which converges to the optimal one by starting with an admissible control [6]. This recursive algorithm is commonly referred to as policy iteration (PI) and can be also found in [7,8,9]. The common limitation of these methods is that the complete knowledge of the system is required.
In the past few decades, reinforcement learning (RL) [10] has provided a means to design optimal controllers in an adaptive manner from the viewpoint of learning. Adaptive or approximate dynamic programming (ADP), which is an iterative RL-based adaptive optimal control design method, has been proposed in [11,12,13,14,15]. An approach that employs ADP is proposed in [11] for linear systems without requiring the priori knowledge of the system matrices. An ADP strategy is presented for non-linear systems with partially unknown dynamics in [12], and the necessity of the knowledge of system model is fully relaxed in [13,14,15].
Together with the growth of optimal control theory and methods, several software tools for optimal control have been developed. Notable examples are non-linear systems toolbox [16], control toolbox [17], ACADO [18], its successor ACADOS [19], and GPOPS-II [20]. A common feature of these packages is that system equations are used in them. In addition, optimal controls generated by [17,18,19,20] are open-loop, such that an optimal control is computed for each initial state. Therefore, if the initial state changes, optimal controls need to be computed again. In contrast, the non-linear systems toolbox [16] produces an optimal feedback control by solving the HJB equation.
The primary objective of this paper is to develop a MATLAB-based toolbox that solves optimal feedback control problems computationally for control-affine systems in the continuous-time domain. More specifically, employing the adaptive dynamic programming technique, we derive a computational methodology to compute approximate optimal feedback controls, based on which we develop the adaptive dynamic programming toolbox (ADPT). In the derivation, the Kronecker product used in [11,14] is replaced by Euclidean inner product for the purpose of memory saving during execution of the ADPT. The ADPT supports two working modes: the model-based mode and the model-free mode. The knowledge of system equations is required in the model-based mode. In the model-free mode, the ADPT produces the approximate optimal feedback control from measurements of system trajectories, removing the requirement of the knowledge of system equations. Moreover, multiple options are provided, such that the user can use the toolbox with much flexibility.
The remainder of the paper is organized as follows. Section 2 reviews the standard optimal control problem for a class of continuous-time non-linear systems and the model-free adaptive dynamic programming technique. Section 3 provides implementation details and software features of the ADPT. In Section 4, the ADPT is applied to a satellite attitude control problem in both the model-based mode and the model-free mode. Conclusions and potential future directions are given in Section 5. The codes of the ADPT are available at https://github.com/Everglow0214/The_Adaptive_Dynamic_Programming_Toolbox, accessed on 10 August 2021.

2. Review of Adaptive Dynamic Programming

We review the adaptive dynamic programming (ADP) technique to solve optimal control problems [13,14]. Consider a continuous-time control-affine system given by
x ˙ = f ( x ) + g ( x ) u ,
where x R n is the state, u R m is the control, f : R n R n and g : R n R n × m are locally Lipschitz mappings with f ( 0 ) = 0 . It is assumed that (1) is stabilizable at x = 0 in the sense that the system can be locally asymptotically stabilized by a continuous feedback control. To quantify the performance of a control, an integral cost associated with (1) is given by
J ( x 0 , u ) = 0 ( q ( x ( t ) ) + u ( t ) T R u ( t ) ) d t ,
where x 0 = x ( 0 ) is the initial state, q : R n R 0 is a positive definite function and R R m × m is a symmetric, positive definite matrix. A feedback control u : R n R m is said to be admissible if it stabilizes (1) at the origin, and makes the cost J ( x 0 , u ) finite for all x 0 in a neighborhood of x = 0 .
The objective is to find a control policy u that minimizes J ( x 0 , u ) given x 0 . Define the optimal cost function V * : R n R by
V * ( x ) = min u J ( x , u )
for x R n . Then, V * satisfies the HJB equation
0 = min u { V * ( x ) T ( f ( x ) + g ( x ) u ) + q ( x ) + u T R u } ,
and the minimizer in the HJB equation is the optimal control which is expressed in terms of V * as
u * ( x ) = 1 2 R 1 g ( x ) T V * ( x ) .
Moreover, the state feedback u * locally asymptotically stabilizes (1) at the origin and minimizes (2) over all admissible controls [2]. Solving the HJB equation analytically is extremely difficult in general except for linear cases. Hence, approximate or iterative methods are needed to solve the HJB, and the well-known policy iteration (PI) technique [6] is reviewed in Algorithm 1. Let { V i ( x ) } i 0 and { u i + 1 ( x ) } i 0 be the sequences of functions generated by PI in Algorithm 1. It is shown in [6] that V i + 1 ( x ) V i ( x ) for i 0 , and the limit functions V ( x ) = lim i V i ( x ) and u ( x ) = lim i u i ( x ) are equal to the optimal cost function V * and the optimal control u * .
Algorithm 1. Policy iteration
Input:  An initial admissible control u 0 ( x ) , and a threshold ϵ > 0 .
Output:  The approximate optimal control u i + 1 ( x ) and the approximate optimal cost function V i ( x ) .
1: Set i 0 .
2: while  i 0 do
3:   Policy evaluation: solve for the continuously differentiable cost function V i ( x ) with
      V i ( 0 ) = 0 using
V i ( x ) T ( f ( x ) + g ( x ) u i ( x ) ) + q ( x ) + u i ( x ) T R u i ( x ) = 0 .
4:   Policy improvement: update the control policy by
u i + 1 ( x ) = 1 2 R 1 g ( x ) T V i ( x ) .
5:   if  u i + 1 ( x ) u i ( x ) ϵ for all x then
6:       break
7:   end if
8:   Set i i + 1 .
9: end while
As proposed in [13,14], consider approximating the solutions to (3) and (4) by ADP instead of obtaining them exactly. For this purpose, choose an admissible feedback control u 0 : R n R m for (1) and let { V i ( x ) } i 0 and { u i + 1 ( x ) } i 0 be the sequences of functions generated by PI in Algorithm 1 starting with the control u 0 ( x ) . Following [13,14], choose a bounded time-varying exploration signal η : R R m , and apply the sum u 0 ( x ) + η ( t ) to (1) as follows:
x ˙ = f ( x ) + g ( x ) ( u 0 ( x ) + η ( t ) ) .
Assume that solutions to (5) are well defined for all positive time. Let T ( x , u 0 , η , [ r , s ] ) = { ( x ( t ) , u 0 ( x ( t ) ) , η ( t ) ) r t s } denote the trajectory x ( t ) of the system (5) with the input u 0 + η over the time interval [ r , s ] with 0 r < s . The system (5) can be rewritten as
x ˙ = f ( x ) + g ( x ) u i ( x ) + g ( x ) ν i ( x , t ) ,
where
ν i ( x , t ) = u 0 ( x ) u i ( x ) + η ( t ) .
Combined with (3) and (4), the time derivative of V i ( x ) along the trajectory x ( t ) of (6) is obtained as
V ˙ i ( x ) = q ( x ) u i ( x ) T R u i ( x ) 2 u i + 1 ( x ) T R ν i ( x , t )
for i 0 . By integrating both sides of (7) over any time interval [ r , s ] with 0 r < s , one gets
V i ( x ( s ) ) V i ( x ( r ) ) = r s ( q ( x ) + u i ( x ) T R u i ( x ) + 2 u i + 1 ( x ) T R ν i ( x , τ ) ) d τ .
Let ϕ j : R n R and φ j : R n R m , with j = 1 , 2 , , be two infinite sequences of continuous basis functions on a compact set in R n containing the origin as an interior point that vanish at the origin [13,14]. Then, V i ( x ) and u i + 1 ( x ) for each i 0 can be expressed as infinite series of the basis functions. For each i 0 let V ^ i ( x ) and u ^ i + 1 ( x ) be approximations of V i ( x ) and u i + 1 ( x ) given by
V ^ i ( x ) = j = 1 N 1 c i , j ϕ j ( x ) ,
u ^ i + 1 ( x ) = j = 1 N 2 w i , j φ j ( x ) ,
where N 1 > 0 and N 2 > 0 are integers and c i , j , w i , j R are coefficients to be found for each i 0 . Then, Equation (8) is approximated by V ^ i ( x ) and u ^ i + 1 ( x ) as follows:
j = 1 N 1 c i , j ( ϕ j ( x ( s ) ) ϕ j ( x ( r ) ) ) + r s ( 2 j = 1 N 2 w i , j φ j ( x ) T R ν ^ i ) d τ   = r s ( q ( x ) + u ^ i ( x ) T R u ^ i ( x ) ) d τ ,
where
u ^ 0 = u 0 , ν ^ i = u 0 u ^ i + η .
Suppose that we have K trajectories T ( x , u 0 , η , [ r k , s k ] ) available, k = 1 , , K , where x ( t ) , u 0 ( t ) , and η ( t ) satisfy (6) over the K time intervals [ r k , s k ] , k = 1 , , K . Then, we have K equations of the form (11) for each i 0 , which can be written as
e i , k = 0 , k = 1 , , K ,
where
e i , k : = j = 1 N 1 c i , j ( ϕ j ( x ( s k ) ) ϕ j ( x ( r k ) ) ) + r k s k ( 2 j = 1 N 2 w i , j φ j ( x ) T R ν ^ i ) d τ + r k s k ( q ( x ) + u ^ i ( x ) T R u ^ i ( x ) ) d τ .
Then, the coefficients { c i , j } j = 1 N 1 and { w i , j } j = 1 N 2 are obtained by minimizing
k = 1 K e i , k 2 .
In other words, the K equations in (13) are solved in the least squares sense for the coefficients, { c i , j } j = 1 N 1 and { w i , j } j = 1 N 2 . Thus two sequences { V ^ i ( x ) } i = 0 and { u ^ i + 1 ( x ) } i = 0 can be generated from (11). According to ([14], Cor. 3.2.4), for any arbitrary ϵ > 0 , there exist integers i * > 0 , N 1 * * > 0 and N 2 * * > 0 , such that
j = 1 N 1 c i * , j ϕ j ( x ) V * ( x ) ϵ , j = 1 N 2 w i * , j φ j ( x ) u * ( x ) ϵ
for all x in a neighborhood of the origin, if N 1 > N 1 * * and N 2 > N 2 * * .
Remark 1.
The ADP algorithm relies only on the measurements of states, the initial control policy and the exploration signal, lifting the requirement of knowing the precise system model, while the conventional policy iteration algorithm in Algorithm 1 requires the knowledge of the exact system model. Hence, the ADP algorithm is 100% data-based and model-free.
Remark 2.
Equation (11) depends on the initial control u 0 , the exploration signal η, the time interval [ r , s ] as well as the index i, where the first three u 0 , η, and [ r , s ] are together equivalent to the trajectory T ( x , u 0 , η , [ r , s ] ) if the initial state x ( r ) at t = r is given. Hence, we can generate more diverse trajectories by changing η and [ r , s ] , as well as the initial state, and enrich the ADP algorithm accordingly, as follows. Suppose that we have available K trajectories T ( x k , u 0 , η k , [ r k , s k ] ) , 1 k K , where x k , u 0 and η k satisfy (6), i.e.,
x ˙ k ( t ) = f ( x k ( t ) ) + g ( x k ( t ) ) ( u 0 ( x k ( t ) ) + η k ( t ) )
for r k t s k . Then, we have K equations of the form (11) for each i 0 , which can be written as e i , k = 0 , k = 1 , , K , where
e i , k : = j = 1 N 1 c i , j ( ϕ j ( x k ( s k ) ) ϕ j ( x k ( r k ) ) ) + r k s k ( 2 j = 1 N 2 w i , j φ j ( x k ) T R ν ^ i k ) d τ + r k s k ( q ( x k ) + u ^ i ( x k ) T R u ^ i ( x k ) ) d τ
with u ^ 0 = u 0 and ν ^ i k = u 0 + η k u ^ i . Then, the coefficients { c i , j } j = 1 N 1 and { w i , j } j = 1 N 2 are obtained by minimizing k = 1 K e i , k 2 . For the sake of simplicity of presentation, however, in this paper we will fix η and the initial states and vary only the time intervals to generate trajectory data.

3. Implementation Details and Software Features

We now discuss implementation details and features of the adaptive dynamic programming toolbox (ADPT). We provide two modes to generate approximate optimal feedback controls; one mode requires the knowledge of system model, but the other eliminates this requirement, giving rise to the ADPT’s unique capability of handling model-free cases.

3.1. Implementation of Computational Adaptive Dynamic Programming

To approximate V i ( x ) and u i + 1 ( x ) in (3) and (4), monomials composed of state variables are selected as basis functions. For a pre-fixed number d 1 , define a column vector Φ d ( x ) by ordering monomials in graded reverse lexicographic order [21] as
Φ d ( x ) = ( x 1 , , x n , x 1 2 , x 1 x 2 , , x n 2 , , x n d ) R N × 1 ,
where x = ( x 1 , x 2 , , x n ) R n is the state, d 1 is the highest degree of the monomials, and N is given by
N = i = 1 d i + n 1 n 1 .
For example, if n = 3 and d = 3 , the corresponding ordered monomials are
x 1 , x 2 , x 3 ; x 1 2 , x 1 x 2 , x 1 x 3 , x 2 2 , x 2 x 3 , x 3 2 ; x 1 3 , x 1 2 x 2 , x 1 2 x 3 , x 1 x 2 2 , x 1 x 2 x 3 , x 1 x 3 2 , x 2 3 , x 2 2 x 3 , x 2 x 3 2 , x 3 3 .
According to (9) and (10), the cost function V i ( x ) and the control u i + 1 ( x ) are approximated by V ^ i ( x ) and u ^ i + 1 ( x ) , which are defined as
V ^ i ( x ) = c i Φ d + 1 ( x ) ,
u ^ i + 1 ( x ) = W i Φ d ( x ) ,
where d 1 is the approximation degree, and c i R 1 × N 1 and W i R m × N 2 are composed of coefficients corresponding to the monomials in Φ d + 1 ( x ) and Φ d ( x ) with
N 1 = i = 1 d + 1 i + n 1 n 1 , N 2 = i = 1 d i + n 1 n 1 .
We take the highest degree of monomials to approximate V i greater by one than the approximation degree since u i + 1 is obtained by taking the gradient of V i in (4) and g ( x ) is constant in most cases.
Theorem 1.
Let a set of trajectories be defined as S T = { T ( x , u 0 , η , [ r k , s k ] ) , k = 1 , 2 , , K } with K 1 , and let
α ( x ) = R η Φ d ( x ) T , β ( x ) = R ( u 0 ( x ) + η ) Φ d ( x ) T , γ ( x ) = Φ d ( x ) Φ d ( x ) T .
Then the coefficients c i and W i satisfy
A i c i T vec ( W i ) = b i ,
where
A 0 = Φ d + 1 [ r 1 , s 1 ] ( x ) 2 vec ( r 1 s 1 α ( x ) d t ) T Φ d + 1 [ r K , s K ] ( x ) 2 vec ( r K s K α ( x ) d t ) T R K × ( N 1 + m N 2 ) , b 0 = r 1 s 1 ( q ( x ) + u 0 ( x ) T R u 0 ( x ) ) d t r K s K ( q ( x ) + u 0 ( x ) T R u 0 ( x ) ) d t R K × 1 ,
and for i = 1 , 2 , ,
A i = Φ d + 1 [ r 1 , s 1 ] ( x ) 2 vec ( r 1 s 1 ( β ( x ) R W i 1 γ ( x ) ) d t ) T Φ d + 1 [ r K , s K ] ( x ) 2 vec ( r K s K ( β ( x ) R W i 1 γ ( x ) ) d t ) T R K × ( N 1 + m N 2 ) , b i = r 1 s 1 q ( x ) d t W i 1 T R W i 1 , r 1 s 1 γ ( x ) d t r K s K q ( x ) d t W i 1 T R W i 1 , r K s K γ ( x ) d t R K × 1 ,
where
Φ d + 1 [ r k , s k ] ( x ) = Φ d + 1 ( x ( s k ) ) T Φ d + 1 ( x ( r k ) ) T
for k = 1 , 2 , , K , the operator · , · denotes the Euclidean inner product with E , F = i j E i j F i j for matrices E = [ E i j ] and F = [ F i j ] of equal size, and the operator vec ( · ) is defined as
vec ( Z ) = z 1 z 2 z n R m n × 1
with z j R m × 1 being the jth column of a matrix Z R m × n for j = 1 , , n .
Proof. 
Combining (11), (14) and (15), one has
c 0 ( Φ d + 1 ( x ( s k ) ) Φ d + 1 ( x ( r k ) ) ) + 2 r k s k Φ d ( x ) T W 0 T R η d t   = r k s k ( q ( x ) + u 0 ( x ) T R u 0 ( x ) ) d t ,
and for i = 1 , 2 , ,
c i ( Φ d + 1 ( x ( s k ) ) Φ d + 1 ( x ( r k ) ) ) + 2 r k s k Φ d ( x ) T W i T R ( u 0 ( x ) + η ) d t 2 r k s k Φ d ( x ) T W i T R W i 1 Φ d ( x ) d t = r k s k ( q ( x ) + Φ d ( x ) T W i 1 T R W i 1 Φ d ( x ) ) d t .
By applying the property
A , B C = A C T , B = B T A , C
of the Euclidean inner product, one may rewrite (17) and (18) as
c 0 ( Φ d + 1 ( x ( s k ) ) Φ d + 1 ( x ( r k ) ) ) + 2 W 0 , r k s k R η Φ d ( x ) T d t   = r k s k ( q ( x ) + u 0 ( x ) T R u 0 ( x ) ) d t ,
and for i = 1 , 2 , ,
c i ( Φ d + 1 ( x ( s k ) ) Φ d + 1 ( x ( r k ) ) ) + 2 W i , r k s k R ( u 0 ( x ) + η ) Φ d ( x ) T d t 2 W i , R W i 1 r k s k Φ d ( x ) Φ d ( x ) T d t = W i 1 T R W i 1 , r k s k Φ d ( x ) Φ d ( x ) T d t r k s k q ( x ) d t .
Then, the system of linear equations in (16) readily follows from (19) and (20).    □
We now give the computational adaptive dynamic programming algorithm in Algorithm 2 for practical implementation. To solve the least squares problem in line 5 in the algorithm, we need to have a sufficiently large number K of trajectories, such that the minimization problem can be solved well numerically. Then the approximate optimal feedback control is generated by the algorithm as u ^ i + 1 = W i Φ d ( x ) .
Algorithm 2. Computational adaptive dynamic programming
Input: An approximation degree d 1 , an initial admissible control u 0 ( x ) , an exploration signal η ( t ) , and a threshold ϵ > 0 .
Output: The approximate optimal control u ^ i + 1 ( x ) and the approximate optimal cost function V ^ i ( x ) .
  1:  Apply u = u 0 + η as the input during a sufficiently long period and collect necessary data.
  2:  Set i 0 .
  3:  while  i 0   do
  4:  Generate A i and b i .
  5:  Obtain c i and W i by solving the minimization problem
min c i , W i A i c i T vec ( W i ) b i 2 .
  6:  if  i 1 and c i c i 1 2 + W i W i 1 2 ϵ 2  then
  7:        break
  8:  end if
  9:  Set i i + 1 .
10:  end while
11:  return  u ^ i + 1 ( x ) = W i Φ d ( x ) and V ^ i ( x ) = c i Φ d + 1 ( x )
Remark 3.
As in the statement of Theorem 1, several integral terms are included in A i and b i for i 0 . As in (12), u 0 does not get approximated by the basis functions, so the matrices A 0 and b 0 in Theorem 1 are obtained with x ( r k ) , x ( s k ) , r k s k q ( x ) d t , r k s k u 0 ( x ) T R u 0 ( x ) d t and r k s k α ( x ) d t , 1 k K . For i 1 , the matrices A i and b i in Theorem 1 need, in addition, r k s k β ( x ) d t and r k s k γ ( x ) d t , 1 k K , as well as W i 1 .
Remark 4.
In Theorem 1, the Kronecker product that is used in [11,14] for practical implementation is replaced by Euclidean inner product. Notice that r k s k γ ( x ) d t R N 2 × N 2 is symmetric, k = 1 , , K . Thus, only upper triangular elements of these matrices are required to be stored. On the other hand, by using Kronecker product, one has to save all the elements of these matrices. As a result, less memory space of the processor is occupied by Theorem 1 especially when the number of basis functions to represent the approximate optimal control is large.
Remark 5.
In the situation where the system dynamic equations are known, the ADPT uses the Runge–Kutta method to simultaneously compute the trajectory points x ( r k ) and x ( s k ) and the integral terms that appear in A i and b i . In the case when system equations are not known but trajectory data are available, the ADPT applies the trapezoidal method to evaluate these integrals numerically. In this case, each trajectory T ( x , u 0 , η , [ r k , s k ] ) is represented by a set of its sample points { x ( t k , ) , u 0 ( t k , ) , η ( t k , ) } = 1 L k , where { t k , } = 1 L k is a finite sequence that satisfies r k = t k , 1 < t k , 2 < < t k , L k 1 < t k , L k = s k , and then the trapezoidal method is applied on these sample points to numerically evaluate the integrals over the time interval [ r k , s k ] . If intermediate points in the interval [ r k , s k ] are not available so that partitioning the interval [ r k , s k ] is impossible, then we use the two end points r k and s k to evaluate the integral by the trapezoidal method as
r k s k h ( t ) d t ( s k r k ) ( h ( s k ) + h ( r k ) ) 2
for a function h ( t ) .

3.2. Software Features

The codes of the ADPT are available at https://github.com/Everglow0214/The_Adaptive_Dynamic_Programming_Toolbox, accessed on 10 August 2021.

3.2.1. Symbolic Expressions

It is of great importance for an optimal control package that the user can describe functions, such as system equations, cost functions, etc., in a convenient manner. The idea of the ADPT is to use symbolic expressions. Consider an optimal control problem, where the system model is in the form (1) with
f ( x ) = x 2 k 1 x 1 k 2 x 1 3 k 3 x 2 k 4 , g ( x ) = 0 1 k 4 ,
where x = ( x 1 , x 2 ) R 2 is the state, u R is the control, and k 1 , k 2 , k 3 , k 4 R are system parameters. The cost function is in the form (2) with
q ( x ) = 5 x 1 2 + 3 x 2 2 , R = 2 .
Then in the ADPT the system dynamics and the cost function can be defined in lines 1–17 in Listing A1 provided in the Appendix A.

3.2.2. Working Modes

Two working modes are provided in the ADPT; the model-based mode and the model-free mode. The model-based mode deals with the situation where the system model is given, while the model-free mode addresses the situation where the system model is not known but only trajectory data are available. An example of the model-based mode is given in Listing A1, where after defining the system model (22), the cost function (23) and the approximation degree d in lines 1–20, the function, adpModelBased , returns the coefficients W i and c i for the control u ^ i + 1 and the cost function V ^ i , respectively, in line 21.
An example of the model-free mode is shown in Listing A2 in the Appendix A, where the system model (22) is assumed to be unknown. The initial control u 0 is in the form of u 0 ( x ) = F x with the feedback control gain F defined in line 18. The exploration signal η is composed of four sinusoidal signals, as shown in lines 21–22. A list of two initial states x ( 0 ) = ( 3 , 2 ) and x ( 0 ) = ( 2.2 , 3 ) is given in lines 28–29, and a list of the corresponding total time span for simulation is given in lines 30–31, where the time interval [ 0 , 6 ] is divided into sub-intervals of size 0.002 so that trajectory data are recorded every 0.002 second in lines 36–41. The time stamps are saved in the column vector t _ save in line 39, and the values of states are saved in the matrix x _ save in line 40, with each row in x _ save corresponding to the same row in t _ save . Similarly, the values of the initial control u 0 and the exploration signal η are saved in vectors u 0 _ save and eta _ save in lines 43–44. These measurements are passed to the function, adpModelFree , in lines 48–49 to compute the optimal control and the optimal cost function approximately.
In both the model-based and model-free modes the approximate control is saved in the file, uAdp.m, that is generated automatically and can be applied by calling u = uAdp ( x ) without dependence on other files. Similarly, the user may also check the approximate cost through the file, VAdp.m.

3.2.3. Options

Multiple options are provided such that the user may customize optimal control problems in a convenient way. We here illustrate usage of some of the options, referring the reader for the other options to the user manual available at https://github.com/Everglow0214/The_Adaptive_Dynamic_Programming_Toolbox, accessed on 10 August 2021.
In the model-based mode, the user may set option values through the function, adpSetModelBased , in a name-value manner before calling adpModelBased . That is, the specified values may be assigned to the named options. An example is shown in Listing A3 in the Appendix A, where two sets of initial states, time intervals and exploration signals are specified in lines 1–9. Then, in line 15 the output of adpSetModelBased should be passed to adpModelBased for the options to take effect. Otherwise, the default values would be used for the options as in line 21 in Listing A1.
For the command, adpModelFree , option values can be modified with the function, adpSetModelFree , in the name-value manner. Among the options, ‘stride’ enables the user to record values of states, initial controls and exploration signals in a high frequency for a long time, while using only a portion of them in the iteration process inside adpModelFree . To illustrate it, let each trajectory in the set S T of trajectories in the statement of Theorem 1 be represented by two sample points at time r k and s k , that is, the trapezoidal method evaluates integrals over [ r k , s k ] by taking values at r k and s k as in (21). Suppose that trajectories in S T are consecutive, that is, s k = r k + 1 for k = 1 , 2 , , K 1 . By setting ‘stride’ to a positive integer δ , the data used to generate A i and b i in Algorithm 2 become { T ( x , u 0 , η , [ r 1 + i δ , s ( i + 1 ) δ ] ) , i N , ( i + 1 ) δ K } . For example, consider 3 consecutive trajectories T ( x , u 0 , η , [ r k , r k + 1 ] ) with k = 1 , 2 , 3 . If ‘stride’ is set to 1, one will have three equations from (11) as follows:
j = 1 N 1 c i , j ( ϕ j ( x ( r k + 1 ) ) ϕ j ( x ( r k ) ) ) + r k r k + 1 ( 2 j = 1 N 2 w i , j φ j ( x ) T R ν ^ i ) d τ = r k r k + 1 ( q ( x ) + u ^ i ( x ) T R u ^ i ( x ) ) d τ
for k = 1 , 2 , 3 . These three equations contribute to three rows of A i and three rows of b i as in Theorem 1. If ‘stride’ is set to 3, then one will have only one equation from (11) as follows:
j = 1 N 1 c i , j ( ϕ j ( x ( r 4 ) ) ϕ j ( x ( r 1 ) ) ) + r 1 r 4 ( 2 j = 1 N 2 w i , j φ j ( x ) T R ν ^ i ) d τ = r 1 r 4 ( q ( x ) + u ^ i ( x ) T R u ^ i ( x ) ) d τ ,
where the integrals over [ r 1 , r 4 ] are evaluated by the trapezoidal method with the interval [ r 1 , r 4 ] partitioned into the three sub-intervals [ r 1 , r 2 ] [ r 2 , r 3 ] [ r 3 , r 4 ] , i.e, with the points at r 1 , r 2 , r 3 , and r 4 . Equation (24) will contribute to one row of A i and one row of b i as in Theorem 1. With the assumption that A i has full rank with ‘stride’ set to 3, by setting ‘stride’ to 3, the number of equations in the minimization problem in Algorithm 2 is two thirds less than that with ‘stride’ set to 1, and as a result, the computation load is reduced in the numerical minimization. It is remarked that with ‘stride’ equal to 3, all the four points at r 1 , , r 4 are used by the trapezoidal method to evaluate the integrals over the interval [ r 1 , r 4 ] in (24), producing a more precise value of integral than the one that would be obtained with the two end points at r 1 and r 4 only. An example of calling adpSetModelFree is shown in Listing A4 in the Appendix A. Similarly, adpModelFree takes the output of adpSetModelFree as an argument to validate the options specified.

4. Applications to the Satellite Attitude Stabilizing Problem

In this section, we apply the ADPT to the satellite attitude stabilizing problem because a stabilization problem can be formulated as an optimal control problem. In the first example, the system model is known and the controller is computed by the function adpModelBased . The same problem is solved again in the second example by the function adpModelFree when the system dynamics is unknown. The source codes for these two examples are available at https://github.com/Everglow0214/The_Adaptive_Dynamic_Programming_Toolbox (accessed on 10 August 2021), where more applications of the toolbox can be found.

4.1. Model-Based Case

Let H denote the set of quaternions and S 3 = { q H q = 1 } . The equations of motion of the continuous-time fully-actuated satellite system are given by
q ˙ = 1 2 q Ω ,
Ω ˙ = I 1 ( ( I Ω ) × Ω ) + I 1 u ,
where q S 3 represents the attitude of the satellite, Ω R 3 is the body angular velocity vector, I R 3 × 3 is the moment of inertia matrix and u R 3 is the control input. The quaternion multiplication is carried out for q Ω on the right-hand side of (25) where Ω is treated as a pure quaternion. By the stable embedding technique [22], the system (25) and (26) defined on S 3 × R 3 is extended to the Euclidean space H × R 3 [23,24] as
q ˙ = 1 2 q Ω α ( | q | 2 1 ) q ,
Ω ˙ = I 1 ( ( I Ω ) × Ω ) + I 1 u ,
where q H , Ω R 3 and α > 0 .
Consider the problem of stabilizing the system (27) and (28) at the equilibrium point ( q e , Ω e ) = ( ( 1 , 0 , 0 , 0 ) , ( 0 , 0 , 0 ) ) . The error dynamics is given by
e ˙ q = 1 2 ( e q + q e ) e Ω α ( | e q + q e | 2 1 ) ( e q + q e ) , e ˙ Ω = I 1 ( ( I e Ω ) × e Ω ) + I 1 u ,
where e q = q q e and e Ω = Ω Ω e are state errors. Since the problem of designing a stabilizing controller can be solved by designing an optimal controller, we pose an optimal control problem with the cost integral (2) with q ( x ) = x T Q x , where x = ( e q , e Ω ) R 7 and Q = 2 I 7 × 7 , and R = I 3 × 3 . The inertia matrix I is set to I = diag ( 0.1029 , 0.1263 , 0.0292 ) . The parameter α that appears in the above error dynamics is set to α = 1 .
We set the option ‘xInit’ with three different initial states. For each initial state, the option ‘tSpan’ is set to [ 0 , 15 ] . We use the option ‘explSymb’ to set exploration signals; refer, for the usage of the option ‘explSysb’, to the user manual available at https://github.com/Everglow0214/The_Adaptive_Dynamic_Programming_Toolbox (accessed on 10 August 2021). For the initial control u 0 , the default initial control is used, which is an LQR controller computed for the linearization of the error dynamics around the origin with the weight matrices Q = 2 I 7 × 7 and R = I 3 × 3 . We then call the function, adpModelBased , to generate controllers of degree d = 1 , 2 , 3 . The computation time taken by the function, adpModelBased , to produce the controllers are recorded in Table 1. For the purpose of comparison, we also apply Al’brekht’s method with the non-linear systems toolbox (NST) [16] to produce controllers of degree d = 1 , 2 , 3 for the same optimal control problem, and record their respective computation time in Table 1. For comparison in terms of optimality, we apply the controllers to the system (27) and (28) for the initial error state x 0 = ( ( cos ( θ / 2 ) 1 , sin ( θ / 2 ) , 0 , 0 ) , ( 0 , 0 , 0 ) ) with θ = 1.99999 π and compute their corresponding values of the cost integral in Table 1. Since we do not know the exact optimal value of the cost integral J ( x 0 , u ) for this initial state, we employ the software package called ACADO [18] to numerically produce the optimal control for this optimal control problem with the given initial state. We note that both NST and ACADO are model-based.
We can see in Table 1 that ADPT in the model-based mode is superior to NST in terms of optimality, and ADPT (model-based) for d = 2 , 3 is on par with ACADO in terms of optimality. Notice however that ACADO produces an open-loop optimal control for each given initial state, which is a drawback of ACADO, while ADPT produces a feedback optimal control that is independent of initial states. Moreover, even for the given initial state ACADO takes a tremendous amount of time to compute the open-loop optimal controller. From these observations, we can say that ADPT in the model-based mode is superior to NST and ACADO in terms of optimality, speed, and usefulness all taken into account.

4.2. Model-Free Case

Consider solving the same optimal problem as in Section 4.1, but the system dynamics in (25) and (26), or equivalently the error dynamics are not available. Since we do not have real trajectory data available, for the purpose of demonstration we generate some trajectories with four initial states for the error dynamics, where the same initial control u 0 and exploration signals η are used as the model-based case in Section 4.1. The simulation for data collection is run over the time interval [ 0 , 20 ] with the recording period being 0.002 s, producing 10,000 = 20/0.002 sampled points for each run. For the function adpModelFree , the option of ‘stride’ is set to 4. Then, the function, adpModelFree , is called to generate controllers of degree d = 1 , 2 , 3 , the computation time taken for each of which is recorded in Table 1. For the purpose of comparison in terms of optimality, we apply the controllers generated by adpModelFree to the system (27) and (28) with the initial error state x 0 = ( ( cos ( θ / 2 ) 1 , sin ( θ / 2 ) , 0 , 0 ) , ( 0 , 0 , 0 ) ) with θ = 1.99999 π and compute the corresponding values of the cost integral; see Table 1 for the values.
From Table 1, we can see that ADPT in the model-free mode takes more computation time than ADPT in the model-based mode, and the cost integrals by ADPT in the model-free working mode is slightly higher than those in the model-based working mode, since the integrals in the iteration process are evaluated less accurately. However, ADPT in the model-free mode is superior to NST in terms of optimality and to ACADO in terms of computation time. More importantly, it is noticeable that the result by model-free ADPT is comparable to model-based ADPT, which shows the power of data-based adaptive dynamic programming and the ADP toolbox.
To see how the computed optimal controller works in terms of stabilization, the norm of the state error under the control with d = 3 generated by ADPT in the model-free mode is plotted in Figure 1 together with the norm of state error by the NST controller with degree 3. We can see that the convergence to the origin is faster with the model-free ADP controller than with the controller by NST that is model-based. This comparison result is consistent with the comparison of the two in terms of optimality.

4.3. Discussion

To compare with other toolboxes on ADP or RL, we investigate MATLAB reinforcement learning toolbox with the same control problem. Equations (27) and (28) are discretized using the 4th order Runge–Kutta method to construct the environment in reinforcement learning toolbox. The integrand in (2) is taken as the reward function. The deep deterministic policy gradient (DDPG) algorithm [25] is selected to train the RL agent since the control input in (26) is continuous. However, it is found in simulations that the parameters of the agent generally diverge even after a long training time and the system cannot be stabilized. The reason probably is that by setting only parameters of the exploration signal of standard normal distribution such as mean and deviation rather than choosing an exploration signal of a specific form, the system states may go to infinity in some episodes. Although one may stop the episode before all steps run out in such a situation, the experiences saved in the replay buffer may be detrimental to the training. On the other hand, the options provided by ADPT allow the user to determine what kind of trajectories to be used so that the optimal feedback control may be found quickly.

5. Conclusions and Future Work

The adaptive dynamic programming toolbox, a MATLAB-based package for optimal control for continuous-time control-affine systems, has been presented. By employing the adaptive dynamic programming technique, we propose a computational methodology to approximately produce the optimal control and the optimal cost function, where the Kronecker product used in previous literature is replaced by Euclidean inner product for less memory consumption at runtime. The ADPT can work in the model-based mode or in the model-free mode. The model-based mode deals with the situation where the system model is given while the model-free mode handles the situation where the system dynamics are unknown but only system trajectory data are available. Multiple options are provided, such that the ADPT can be easily customized. The optimality, the running speed, and the utility of the ADPT are illustrated with a satellite attitude stabilizing problem.
Currently control policies and cost functions are approximated by polynomials in the ADPT. As mathematical principles of neural networks are being revealed [26,27], we plan to use deep neural networks in addition to polynomials in the ADPT to approximately represent optimal controls and optimal cost functions to provide users of the ADPT more options.

Author Contributions

Conceptualization, X.X. and D.E.C.; methodology, X.X. and D.E.C.; software, X.X.; validation, X.X.; formal analysis, X.X.; investigation, X.X. and D.E.C.; writing—original draft preparation, X.X.; writing—review and editing, X.X. and D.E.C.; visualization, X.X.; supervision, D.E.C.; project administration, D.E.C.; funding acquisition, D.E.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Center for Applied Research in Artificial Intelligence (CARAI) grant funded by Defense Acquisition Program Administration (DAPA) and Agency for Defense Development (ADD) (UD190031RD), and by the NRF grant funded by the Korea government (MSIT) (2021R1A2C2010585).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Listing A1. An example of the model-based mode.
Listing A1. An example of the model-based mode.
1 n = 2; % state dimension
2 m = 1; % control dimension
3 %% Symbolic variables.
4 syms x [n,1] real
5 syms u [m,1] real
6 syms t real
7  
8 %% Define the system.
9 k1 = 3; k2 = 2; k3 = 2; k4 = 5;
10 f = [x2;
11      (-k1*x1-k2*x1^3-k3*x2)/k4];
12 g = [0;
13      1/k4];
14 
15 %% Define the cost function.
16 q = 5*x1^2 + 3*x2^2;
17 R = 2;
18 
19 %% Execute ADP iterations.
20 d = 3; % approximation degree
21 [w,c] = adpModelBased(f,g,x,n,u,m,q,R,t,d);
Listing A2. An example of the model-free mode.
Listing A2. An example of the model-free mode.
1  n = 2; % state dimension
2  m = 1; % control dimension
3   
4  %% Define the cost function.
5  q = @(x) 5*x(1)^2 + 3*x(2)^2;
6  R = 2;
7 
8  %% Generate data.
9  syms x [n,1] real
10  syms t real
11  k1 = 3; k2 = 2; k3 = 2; k4 = 5;
12  % System dynamics.
13  f = [x2;
14      (-k1*x1-k2*x1^3-k3*x2)/k4];
15  g = [0;
16       1/k4];
17 
18  F = [1, 1] % feedback gain
19 
20  % Exploration signal.
21  eta = 0.8*(sin(7*t)+sin(1.1*t)+sin(sqrt(3)*t)+...
22   sin(sqrt(6)*t));
23  e = matlabFunction(eta,’Vars’,t);
24 
25  % To be used in the function ode45.
26  dx = matlabFunction(f+g*(-F*x+eta),’Vars’,{t,x});
27   
28  xInit = [-3, 2;
29          2.2, 3];
30  tSpan = [0:0.002:6;
31          0:0.002:6];
32  odeOpts = odeSet(’RelTol’,1e-6,’AbsTol’,1e-6);
33 
34  t_save = [];
35  x_save = [];
36  for i = 1:size(xInit,1)
37      [time, states] = ode45(@(t,x)dx(t,x),tSpan(i,:),...
38        xInit(i,:),odeOpts);
39      t_save = [t_save; time];
40      x_save = [x_save; states];
41  end
42
43  u0_save = -x_save * F;
44  eta_save = e(t_save);
45 
46  %% Execute ADP iterations.
47  d = 3; % approximation degree
48  [w,c] = adpModelFree(t_save,x_save,n,u0_save,m,...
49    eta_save,d,q,R);
Listing A3. A demonstration of calling the function adpSetModelBased.
Listing A3. A demonstration of calling the function adpSetModelBased.
1%% The user may specify settings.
2xInit = [-3, 2;
3        2.2, 3];
4tSpan = [0, 10;
5        0, 8];
6 
7syms t real
8eta = [0.8*sin(7*t)+sin(3*t);
9      sin(1.1*t)+sin(pi*t)];
10
11adpOpt = adpSetModelBased(’xInit’,xInit,’tSpan’,tSpan,...
12  ’explSymb’,eta);
13
14%% Execute ADP iterations.
15[w,c] = adpModelBased(f,g,x,n,u,m,q,R,t,d,adpOpt);
Listing A4. A demonstration of calling the function adpSetModelFree.
Listing A4. A demonstration of calling the function adpSetModelFree.
1   %% The user may specify settings.
2   adpOpt = adpSetModelFree(’stride’,2);
3      
4   %% Execute ADP iterations.
5   [w,c] = adpModelFree(t_save,x_save,n,u0_save,m,...
6    eta_save,d,q,R,adpOpt);

References

  1. Kirk, D.E. Optimal Control Theory: An Introduction; Prentice-Hall: Englewood Cliffs, NJ, USA, 1970. [Google Scholar]
  2. Lewis, F.L.; Vrabie, D.L.; Syrmos, V.L. Optimal Control; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2012. [Google Scholar]
  3. Al’brekht, E.G. On the optimal stabilization of nonlinear systems. J. Appl. Math. Mech. 1961, 25, 1254–1266. [Google Scholar] [CrossRef]
  4. Garrard, W.L.; Jordan, J.M. Design of nonlinear automatic flight control systems. Automatica 1977, 13, 497–505. [Google Scholar] [CrossRef]
  5. Nishikawa, Y.; Sannomiya, N.; Itakura, H. A method for suboptimal design of nonlinear feedback systems. Automatica 1971, 7, 703–712. [Google Scholar] [CrossRef]
  6. Saridis, G.N.; Lee, C.-S.G. An approximation theory of optimal control for trainable manipulators. IEEE Trans. Syst. Man Cybern. 1979, SMC-9, 152–159. [Google Scholar] [CrossRef]
  7. Beard, R.W.; Saridis, G.N.; Wen, J.T. Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation. Automatica 1997, 33, 2159–2177. [Google Scholar] [CrossRef]
  8. Beard, R.W.; Saridis, G.N.; Wen, J.T. Approximate solutions to the time-invariant Hamilton-Jacobi-Bellman equation. J. Optim. Theory Appl. 1998, 96, 589–626. [Google Scholar] [CrossRef]
  9. Abu-Khalaf, M.; Lewis, F.L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 2005, 41, 779–791. [Google Scholar] [CrossRef]
  10. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  11. Jiang, Y.; Jiang, Z.-P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 2012, 48, 2699–2704. [Google Scholar] [CrossRef]
  12. Vrabie, D.L.; Lewis, F.L. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Netw. 2009, 22, 237–246. [Google Scholar] [CrossRef] [PubMed]
  13. Jiang, Y.; Jiang, Z.-P. Robust adaptive dynamic programming and feedback stabilization of nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 882–893. [Google Scholar] [CrossRef] [PubMed]
  14. Jiang, Y.; Jiang, Z.-P. Robust Adaptive Dynamic Programming; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2014. [Google Scholar]
  15. Lee, J.Y.; Park, J.B.; Choi, Y.H. Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 916–932. [Google Scholar] [PubMed]
  16. Krener, A.J. Nonlinear Systems Toolbox. MATLAB Toolbox Available upon Request from [email protected].
  17. Giftthaler, M.; Neunert, M.; Stäuble, M.; Buchli, J. The Control Toolbox—An open-source C++ library for robotics, optimal and model predictive control. In Proceedings of the IEEE 2018 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), Brisbane, Australia, 16–19 May 2018; pp. 123–129. [Google Scholar]
  18. Houska, B.; Ferreau, H.J.; Diehl, M. ACADO Toolkit—An open source framework for automatic control and dynamic optimization. Optim. Control Appl. Meth. 2011, 32, 298–312. [Google Scholar] [CrossRef]
  19. Verschueren, R.; Frison, G.; Kouzoupis, D.; Frey, J.; van Duijkeren, N.; Zanelli, A.; Novoselnik, B.; Albin, T.; Quirynen, R.; Diehl, M. ACADOS: A modular open-source framework for fast embedded optimal control. arXiv 2019, arXiv:1910.13753. [Google Scholar]
  20. Patterson, M.A.; Rao, A.V. GPOPS-II: A MATLAB software for solving multiple-phase optimal control problems using hp-adaptive Gaussian quadrature collocation methods and sparse nonlinear programming. ACM Trans. Math. Softw. 2014, 41, 1–37. [Google Scholar] [CrossRef] [Green Version]
  21. Cox, D.A.; Little, J.; O’Shea, D. Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra; Springer: New York, NY, USA, 2015. [Google Scholar]
  22. Chang, D.E. On controller design for systems on manifolds in Euclidean space. Int. J. Robust Nonlinear Control 2018, 28, 4981–4998. [Google Scholar] [CrossRef]
  23. Ko, W. A Stable Embedding Technique for Control of Satellite Attitude Represented in Unit Quaternions. Master’s Thesis, Korea Advanced Institute of Science & Technology, Daejeon, Korea, 2020. [Google Scholar]
  24. Ko, W.; Phogat, K.S.; Petit, N.; Chang, D.E. Tracking controller design for satellite attitude under unknown constant disturbance using stable embedding. J. Electr. Eng. Technol. 2021, 16, 1089–1097. [Google Scholar] [CrossRef]
  25. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:509.02971. [Google Scholar]
  26. Gurney, K. An Introduction to Neural Networks; UCL Press: London, UK, 1997. [Google Scholar]
  27. Caterini, A.L.; Chang, D.E. Deep Neural Networks in a Mathematical Framework; Springer: New York, NY, USA, 2018. [Google Scholar]
Figure 1. The state errors x ( t ) with the controllers of degree 3 generated by ADPT in the model-free working mode and by NST.
Figure 1. The state errors x ( t ) with the controllers of degree 3 generated by ADPT in the model-free working mode and by NST.
Sensors 21 05609 g001
Table 1. Costs at x 0 and computation time by ADPT, NST, and ACADO. J ( x 0 , u ) denotes the integral cost of the corresponding control u for initial state x 0 . ‘Time [s]’ denotes the computation time taken by the method to obtain the controller.
Table 1. Costs at x 0 and computation time by ADPT, NST, and ACADO. J ( x 0 , u ) denotes the integral cost of the corresponding control u for initial state x 0 . ‘Time [s]’ denotes the computation time taken by the method to obtain the controller.
J ( x 0 , u ) Time [s]
ADPT
(model-based)
d = 1 37.82591.5994
d = 2 33.60353.2586
d = 3 33.498613.1021
ADPT
(model-free)
d = 1 43.83080.9707
d = 2 36.83193.3120
d = 3 37.411164.8562
NST d = 1 208.92590.2702
d = 2 94.68680.6211
d = 3 64.07213.6201
ACADO-32.60002359.67
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xing, X.; Chang, D.E. The Adaptive Dynamic Programming Toolbox. Sensors 2021, 21, 5609. https://doi.org/10.3390/s21165609

AMA Style

Xing X, Chang DE. The Adaptive Dynamic Programming Toolbox. Sensors. 2021; 21(16):5609. https://doi.org/10.3390/s21165609

Chicago/Turabian Style

Xing, Xiaowei, and Dong Eui Chang. 2021. "The Adaptive Dynamic Programming Toolbox" Sensors 21, no. 16: 5609. https://doi.org/10.3390/s21165609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop