Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic

Radac, Mircea-Bogdan; Precup, Radu-Emil

doi:10.3390/app9091807

Open AccessArticle

Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic

by

Mircea-Bogdan Radac

^*

and

Radu-Emil Precup

Department of Automation and Applied Informatics, Politehnica University of Timisoara, Timisoara 300006, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2019, 9(9), 1807; https://doi.org/10.3390/app9091807

Submission received: 28 January 2019 / Revised: 19 March 2019 / Accepted: 24 April 2019 / Published: 30 April 2019

(This article belongs to the Special Issue Advances in Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a neural network (NN)-based control scheme in an Adaptive Actor-Critic (AAC) learning framework designed for output reference model tracking, as a representative deep-learning application. The control learning scheme is model-free with respect to the process model. AAC designs usually require an initial controller to start the learning process; however, systematic guidelines for choosing the initial controller are not offered in the literature, especially in a model-free manner. Virtual Reference Feedback Tuning (VRFT) is proposed for obtaining an initially stabilizing NN nonlinear state-feedback controller, designed from input-state-output data collected from the process in open-loop setting. The solution offers systematic design guidelines for initial controller design. The resulting suboptimal state-feedback controller is next improved under the AAC learning framework by online adaptation of a critic NN and a controller NN. The mixed VRFT-AAC approach is validated on a multi-input multi-output nonlinear constrained coupled vertical two-tank system. Discussions on the control system behavior are offered together with comparisons with similar approaches.

Keywords:

adaptive actor-critic; model-free control; data-driven control; reinforcement learning; approximate dynamic programming; output reference model tracking; multi-input multi-output systems; vertical tank systems; Virtual Reference Feedback Tuning

1. Introduction

Data-driven or data-based control techniques rely on data collected from the process in order to learn and tune controllers that prevent control performance degradation due to mismatch between the true process and its model—the main issue with model-based control design approaches [1]. The data-driven controller learning objective can be achieved either by using highly adaptive simplified phenomenological models [2,3], or by using no model at all, except for common structural assumptions about the true process such as linearity or nonlinearity. The latter approach can be considered a true model-free one, with several representative techniques having first emerged from classical control theory, such as: Virtual Reference Feedback Tuning (VRFT), [4], Iterative Feedback Tuning [5], Simultaneous Perturbation Stochastic Approximation [6], Model-free Iterative Learning Control [7,8]. Most of the above approaches relying on instruments specific to optimal control with several recent applications [9,10,11,12,13,14,15,16,17].

Reinforcement learning (RL) [18] is a powerful data-driven technique that solves optimal control problems with parallel developments in the machine learning and control systems communities in which RL is better known as Adaptive (Approximate) Dynamic Programming (ADP) [19] or neuro-dynamic programming [20]. Reinforcement Q-learning [21] with function approximators (FAs) is a particular version of Action Dependent Heuristic Dynamic Programming implemented without a process model [22,23], which is only one of the several types of adaptive actor-critic (AAC) ADP designs [24,25,26,27], besides Heuristic Dynamic Programming [28], Dual Heuristic Programming [29] and all of their action-dependent versions.

For learning high performance control, Action Dependent Heuristic Dynamic Programming (a form of continuous input-state space Q-learning) uses the Q-function as an extension of the cost (value) function and only needs to efficiently explore the input-state space of the unknown process, hence the model-free data-driven label is justified. The class of model-free AAC designs used with FAs is attractive over the majority of the model-based AAC designs, where a partially known nonlinear input-affine state-space representation is at least necessary [22,23]. The main disadvantages of the Action Dependent Heuristic Dynamic Programming schemes are that many transition samples are needed from the process—since the Q-function estimation is more informative, it needs to explore the action space in addition to the state space—and the lack of convergence guarantees in the absence of a process model, when generic FAs are used. Data-driven RL/ADP formulated in terms of control systems theory has also offered recent results regarding different applications and stability and learning convergence, in both model-free and model-based settings [30,31].

In output reference model (ORM) tracking control, the output of the controlled process should track a reference model’s output regarded as a frequently changing time-varying learning goal. This control objective can also be formulated in an optimal control setup. An initial stabilizing state feedback controller that achieves suboptimal ORM tracking control is highly desirable in practice since it could accelerate the learning process. In fact, most of the AAC learning control architectures start the controller learning with respect to some objective using an initial controller, but lack systematic guidelines for obtaining such initial controller.

VRFT is one solution to design data-driven model-free feedback controllers, commonly using input-output data. Its linear time-invariant framework typically needs much fewer samples than model-free AAC designs to obtain an initial controller. Unfortunately, a linear controller cannot ensure good ORM tracking for nonlinear processes acting in wide operating ranges. Since AAC should essentially learn a nonlinear state-feedback controller, it is of interest to obtain such an initial (possibly suboptimal) controller, and this will be shown possible using the VRFT design and tuning framework. This would be significant since model-free AAC approaches are data-hungry in practice and any initial suboptimal solution would shorten the convergence time. Under such motivation, the combination of VRFT and AAC is used to achieve ORM tracking control. The resulting AAC design consists of two neural networks (NNs), one for the controller called the actor NN and one for the cost function approximation called the critic NN. The correction signals during the adaptive learning are backpropagated through the larger NN resulted from cascading the actor and the critic NNs, hence the AAC architecture belongs to the deep reinforcement learning approaches from the literature [32].

The mixed VRFT-AAC approach developed in this paper is applied to a real-world Multi-Input Multi-Output (MIMO) nonlinear coupled constrained laboratory vertical two-tank system for water level control. The approach proposed as follows is novel with respect to the state-of-the-art since:

it introduces an original nonlinear state-feedback neural network-based controller for ORM tracking, tuned with VRFT, serving as initialization for the AAC learning controller that further improves the ORM tracking and accelerates convergence to the optimal controller. This leads to the novel VRFT-AAC combination;
the case study proves implementation of the novel mixed control learning approach for ORM tracking. The MIMO validation scenario also demonstrates good decoupling ability of learned controllers, even under constraints and nonlinearities. Comparisons with a model-free batch fitted Q-learning scheme and with a model-based batch-fitted Q-learning approach are also offered. Statistical characterization case studies in different learning settings are given.
theoretical analysis ensures that the AAC learning scheme preserves the CS stability throughout the updates and converges to the optimal control.

The paper is organized as follows: the next section formulates the ORM tracking control problem in an optimal control framework and offers a way to solve it using VRFT (Section 3) and AAC design (Section 4). Validation case study, useful implementation details, comparisons with similar control learning techniques, thorough investigations and discussions of the observed results, are all presented in Section 5. The concluding remarks are highlighted in Section 6.

2. Output Model Reference Control for Unknown Systems

Let the discrete-time nonlinear unknown open-loop minimum-phase state-space deterministic strictly causal process be

P : {\begin{cases} x_{k + 1} = f (x_{k}, u_{k}), \\ y_{k} = g (x_{k}), \end{cases}

(1)

where k indexes the discrete time,

x_{k} = {[x_{k, 1} \dots x_{k, n}]}^{T} \in X \subset ℜ^{n}

is the n-dimensional state vector (upper T is matrix transpose),

u_{k} = {[u_{k, 1}, \dots, u_{k, m}]}^{T} \in U \subset ℜ^{m}

is the control input signal,

y_{k} = {[y_{k, 1}, \dots, y_{k, p}]}^{T} \in Y \subset ℜ^{p}

is the measurable controlled output,

f : X \times U \to X

is an unknown nonlinear system function,

g : X \to Y

is an unknown nonlinear output function of the states, and the initial conditions are not considered for analysis at this point. It is further assumed that the definition domains X, U, Y are compact convex. The following assumptions common to the data-driven problem formulation [1] are:

A1: System (1) is controllable and fully state observable.

A2: System (1) is internally stable on X × U.

Assumptions A1 and A2 are common in the data-driven control literature and difficult to assess when unknown process models are assumed. They may be supported from the experience on the process operation or from the literature. If no knowledge exists whatsoever, control can be tried in the constraining domains related to the minimum safety operating conditions of the process, which is required minimum information on the process variables. Internal stability is sufficient for output feedback control design and necessary for state-feedback control design using input-state samples.

Concerning the controllability and full state observability assumption A1 imposed to the process, if the observability cannot be verified analytically, data-driven observers can be built using past samples of either the inputs and outputs and/or of the partially measurable state, as shown for linear systems in [33,34] and used for nonlinear systems in [35]. State measurement requires more insight on the process than several pure input-output representations.

Equation (1) is a general form for most controlled processes in practice and it is not restrictive. In this form, it obeys the definition of a deterministic Markov decision process.

The discrete-time known open-loop stable minimum-phase state-space deterministic strictly causal ORM is

O R M : {\begin{cases} x_{k + 1}^{m} = f^{m} (x_{k}^{m}, r_{k}), \\ y_{k}^{m} = g^{m} (x_{k}^{m}), \end{cases}

(2)

where

x_{k}^{m} = {[x_{k, 1}^{m}, \dots, x_{k, n_{m}}^{m}]}^{T} \in X_{m} \subset ℜ^{n_{m}}

is the state vector of the ORM,

r_{k} = {[r_{k, 1}, \dots, r_{k, p}]}^{T} \in R_{m} \subset ℜ^{p}

is the reference input signal,

y_{k}^{m} = {[y_{k, 1}^{m}, \dots, y_{k, p}^{m}]}^{T} \in Y_{m} \subset ℜ^{p}

is the ORM’s output,

f^{m} : X_{m} \times R_{m} \to X_{m}

,

g^{m} : X_{m} \to Y_{m}

are known nonlinear maps. Initial conditions are zero unless stated otherwise. Note that

r_{k}, y_{k}, y_{k}^{m}

have the same size p for square feedback control systems. If the ORM (2) is linear time-invariant in particular, it is always possible to express the ORM as an input-output linear time-invariant transfer matrix

y_{k}^{m} = M (z) r_{k}

, where

M (z)

is an asymptotically stable unit gain (i.e.,

M (1) = I

, where I is the identity matrix) rational transfer matrix and

r_{k}

is the reference input that drives both the feedback control system and the ORM. To extend the process (1) with the ORM (2), we consider the reference input

r_{k}

as a set of measurable exogenous signals that evolve according to

r_{k + 1} = h^{m} (r_{k})

, with unknown

h^{m} : ℜ^{m} \to ℜ^{m}

but measurable

r_{k}

. Piecewise constant

r_{k}

can be modeled for example as

r_{k + 1} = r_{k}

and it will be used throughout this paper. Then the extended state-space model with output equations is

\begin{array}{l} x_{k + 1}^{E} = [\begin{matrix} x_{k + 1} \\ x_{k + 1}^{m} \\ r_{k + 1} \end{matrix}] = [\begin{matrix} f (x_{k}^{}, u_{k}) \\ f^{m} (x_{k}^{m}, r_{k}) \\ h^{m} (r_{k}) \end{matrix}] = F (x_{k}^{E}, u_{k}), x_{k}^{E} \in X^{E} = X \times X_{m} \times R_{m}, \\ y_{k} = g (x_{k}^{E}), \\ y_{k}^{m} = g^{m} (x_{k}^{E}) . \end{array}

(3)

The ORM tracking control problem is formulated in an optimal control framework. Let the infinite horizon cost function (c.f.) to be minimized starting with

x_{i}

be [36]

J (x_{i}^{E}, U_{i, \infty}) = \sum_{k = i}^{\infty} γ^{k - i} Ʋ (x_{k}^{E}, u_{k}), U_{i, \infty} = {u_{i}, \dots, u_{\infty}},

(4)

where i indexes the starting time for

x_{i}^{E}

, the discount factor

0 < γ \leq 1

ensures the convergence of

J (x_{i}^{E}, U_{i, \infty})

[23] and sets the controller’s (or interacting agent’s) horizon, the stage cost Ʋ > 0 depends on

x_{k}^{E}

and

u_{k}

and captures the distance relative to some pre-specified learning goal (target) usually constant in many applications. The unknown control inputs

u_{i}

,

u_{i + 1}

, ..., should minimize

J (x_{i}^{E}, U_{i, \infty})

. A control sequence (or a controller) rendering a finite c.f. are called admissible.

ORM tracking control requires that the undisturbed process output

y_{k}

(also the control system output) tracks the ORM’s output

y_{k}^{m} = M (z) r_{k}

. For stage cost

Ʋ_{M R} = {‖ y_{k}^{m} (x_{k}^{E}) - y_{k} (x_{k}^{E}) ‖}_{2}^{2}

in Equation (4) (measurable

y_{k}

depends via unknown

g ()

on

x_{k}

, but not on

x_{k + 1}

), we introduce the discounted infinite-horizon model reference tracking c.f.

J_{M R}^{\infty} (x_{0}^{E}, θ) = \sum_{k = 0}^{\infty} γ^{k} {‖ y_{k}^{m} (x_{k}^{E}) - y_{k} (x_{k}^{E}, θ) ‖}_{2}^{2} = \sum_{k = 0}^{\infty} γ^{k} {‖ ε_{k}^{} (x_{k}^{E}, θ) ‖}_{2}^{2},

(5)

where

ε_{k} (x_{k}^{E}, θ)

is the model reference tracking error vector,

θ \in ℜ^{n_{θ}}

is a parameterization of a nonlinear feedback admissible controller [23] defined as

u_{k} \overset{d e f}{=} C (x_{k}^{E}, θ)

, which used in Equation (5) reflects the influence of θ on all system trajectories outcomes. This controller coupled with Equation (3) ensures that the output of Equation (1) tracks the ORM’s output.

J_{M R}^{\infty}

in (5) also serves as the value function of using the controller C. For finite

J_{M R}^{\infty}

when γ = 1,

ε_{k}

must be a square-summable sequence which can be obtained with an asymptotically stabilizing controller that ensures

\lim_{k \to \infty} {‖ y_{k}^{m} (x_{k}^{E}) - y_{k} (x_{k}^{E}) ‖}_{2}^{2} = 0

. In the general case when γ < 1,

J_{M R}^{\infty}

will be finite with any stabilizing controller that renders a finite upper bounded

ε_{k}

. Herein, admissible controller for Equations (4) and (5) means the controller that ensures a finite c.f.

J_{M R}^{\infty}

.

A nonlinear reference model M could have been used for tracking purposes as well; however, imposing a linear time-invariant one for the feedback control system ensures indirect feedback linearization of the controlled process. It is extremely beneficial to work with linearized feedback control systems because their behavior generalizes well in wide operating ranges [37]. The ORM tracking problem concerns the control system behavior from the reference input to the controlled output, neglecting potential load disturbances [38]. Extension of the proposed theory to nonlinear ORMs is not difficult. Under classical control rules, the process’s delay and non-minimum-phase character should be included in M. However, the non-minimum-phase zeroes make M non-invertible in addition to requiring their knowledge via identification [38], affecting the subsequent VRFT design, motivating the minimum phase assumption on the process.

3. Nonlinear State-Feedback VRFT for Approximate ORM Tracking Control Using Neural Networks

An initial controller for the system (3) to achieve approximate ORM tracking employs the VRFT concept. Under assumptions A1 and A2, for tuning a nonlinear state-feedback controller, the designer may employ an input-state-output dataset of the form

{{\tilde{u}}_{k}, {\tilde{x}}_{k}, {\tilde{y}}_{k}}, k = \bar{0, N - 1}

, gathered from the process in an open-loop experiment lasting for N sample time steps, where persistently exciting

{\tilde{u}}_{k}

excites all the significant process dynamics. To achieve linear ORM tracking for a nonlinear process, a nonlinear state-feedback controller is more suitable than a linear one, being able to cope with the process nonlinearities.

VRFT concept assumes that, if the controlled output y_k produced in an open-loop experiment conducted on the stable process is both the control system’s output and the ORM’s output, then the closed-loop control system will match the reference model [4,39,40,41,42]. Let

{\tilde{r}}_{k} = M {(z)}^{- 1} {\tilde{y}}_{k}

be the virtual reference input that generates

{\tilde{y}}_{k}

when filtered through M(z) which is assumed to be invertible with respect to the inverse filtering operation. It is called virtual since it is never set as a reference input to the closed-loop control system and it is only used in the offline controller tuning. The virtual states of the ORM are computable from Equation (2) as

{\tilde{x}}_{k + 1}^{m} = f^{m} ({\tilde{x}}_{k}^{m}, {\tilde{r}}_{k})

serving to reconstruct the virtual extended state as

{\tilde{x}}_{k}^{E} = {[{({\tilde{x}}_{k})}^{T} {({\tilde{x}}_{k}^{m})}^{T} {({\tilde{r}}_{k})}^{T}]}^{T}

A controller that produces

{\tilde{u}}_{k}

when fed by

{\tilde{x}}_{k}^{E}

achieves the ORM tracking. VRFT translates the model reference tracking c.f. in Equation (5) to a controller identification c.f. A finite-time controller identification c.f. is [4]

J_{V R}^{N} (θ) = \sum_{k = 0}^{N - 1} {‖ {\tilde{u}}_{k}^{} - C ({\tilde{x}}_{k}^{E}, θ) ‖}^{2} .

(6)

Let the optimal controller parameter vector θ^* be the solution to the optimization problem

θ^{*} = \arg \min_{θ} J_{V R}^{N} (θ)

. Theorem 2 in [41] shows that if the controller parameterization is rich enough, then

θ^{*}

also minimizes

J_{M R}^{\infty}

, proven for input-output models only. Motivated by [41], a formal proof is given as incentive for using state-feedback controllers tuned by nonlinear multi-input multi-output (MIMO) VRFT. Several other assumptions are considered:

A3: The process (1) has an equivalent input-output form

y_{k} = P (y_{k - 1}, \dots, y_{k - n y}, u_{k - 1}, \dots, u_{k - n u}),

where

n y

,

n u

are unknown process orders and the nonlinear map

P

is invertible with respect to

u

, meaning that for given

y_{k}

,

u_{k}

is recoverable as

u_{k - 1} = P^{- 1} (y_{k})

. Zero initial conditions are assumed at this point. Also, the ORM (2) has an equivalent input-output form

y_{k}^{m} = M (y_{k - 1}^{m}, \dots, y_{k - n y m}^{m}, r_{k - 1}, \dots, r_{k - n r})

where

n y m

,

n r

are known ORM’s orders, M is a nonlinear invertible map with stable inverse, allowing the calculation of

r_{k - 1} = M^{- 1} (y_{k}^{m})

. Zero initial conditions are also assumed.

A4: Let the process (1) and the ORM (2) be formally written as

y_{k} = s (x_{k}, u_{k - 1})

and

y_{k}^{m} = s^{m} (x_{k}^{m}, r_{k - 1}^{})

, respectively, to capture simultaneously both the input-output dependence and the input-state-output one in a compact form. These expressions also reveal the relative degree one from input to output, without loss of generality. Assume zero initial conditions for (1) and assume the map s invertible with

x_{k}^{}, u_{k - 1}^{}

computable from

y_{k}^{}

as

x_{k}^{} = {(s_{x}^{})}^{- 1} (y_{k}^{})

,

u_{k - 1}^{} = {(s_{u}^{})}^{- 1} (y_{k}^{})

. Further assume that

s^{m}

is a continuously differentiable invertible map such that

x_{k}^{m}, r_{k - 1}^{}

are computable from

y_{k}^{m}

as

x_{k}^{m} = {(s_{x}^{m})}^{- 1} (y_{k}^{m}), r_{k - 1}^{} = {(s_{r}^{m})}^{- 1} (y_{k}^{m})

and assume there exists positive constants

B_{s x}^{m} > 0, B_{s r}^{m} > 0

such that

‖ \frac{\partial s^{m} (x_{k}^{m}, r_{k - 1})}{\partial x_{k}^{m}} ‖ < B_{s x}^{m}, ‖ \frac{\partial s^{m} (x_{k}^{m}, r_{k - 1})}{\partial r_{k - 1}} ‖ < B_{s r}^{m}

. Let zero initial conditions hold for (2). These inversion assumptions are natural for state-space systems such as (1) and (2) that have equivalent input-output models according to A4. For example, for given output

y_{k}

of (1), the input is uniquely determined as

u_{k - 1} = P^{- 1} (y_{k})

, after which the state can be generated by recursion from

x_{k + 1} = f (x_{k}, u_{k})

of Equation (1). This is the sense of

x_{k}^{} = {(s_{x}^{})}^{- 1} (y_{k}^{})

.

Moreover, let

s, {(s_{x})}^{- 1}

be continuously differentiable and of bounded derivative to satisfy

\begin{array}{l} ‖ \frac{\partial s (x_{k}, u_{k - 1})}{\partial x_{k}} ‖ < B_{s x}, ‖ \frac{\partial s (x_{k}, u_{k - 1})}{\partial u_{k - 1}} ‖ < B_{s u}, ‖ \frac{\partial {(s_{x})}^{- 1} (y_{k})}{\partial y_{k}} ‖ < B_{s y,} \\ 0 < B_{s x} B_{s y} < 1 . \end{array}

(7)

A5: Let a finite open-loop trajectory collected from the process be

D = {{\tilde{u}}_{k}, {\tilde{x}}_{k}, {\tilde{y}}_{k}} \subset U \times X \times Y, k = \bar{0, N - 1}

where

{\tilde{u}}_{k}^{}

is: (1) persistently exciting, for

{\tilde{y}}_{k}^{}

to capture all process dynamics and (2) ensuring uniform exploration of the entire domain

U \times X \times Y

. Good exploration is achievable for large enough N.

A6: There exists a set of nonlinear parameterized state-feedback continuously differentiable controllers

{C (x_{k}^{E}, θ)}

, a

\hat{θ}

for which

{\hat{u}}_{k}^{} = C ({\hat{x}}_{k}^{E}, \hat{θ})

, and an ε > 0 for which

J_{V R}^{N} (\hat{θ}) = \sum_{k = 0}^{N - 1} {‖ {\tilde{u}}_{k}^{} - C ({\tilde{x}}_{k}^{E}, \hat{θ}) ‖}^{2} < ε^{2},

(8)

‖ \frac{\partial C (x_{k}^{E}, θ)}{\partial x_{k}^{E}} ‖ < B_{c x},

(9)

where

{\tilde{x}}_{k}^{E} = {[{({\tilde{x}}_{k})}^{T} {({\tilde{x}}_{k}^{m})}^{T} {({\tilde{r}}_{k})}^{T}]}^{T}

,

{\hat{x}}_{k}^{E} = {[{({\hat{x}}_{k})}^{T} {({\tilde{x}}_{k}^{m})}^{T} {({\tilde{r}}_{k})}^{T}]}^{T}

and

{\tilde{x}}_{k}^{m} = {(s_{x}^{m})}^{- 1} ({\tilde{y}}_{k}^{}), {\tilde{r}}_{k - 1}^{} = {(s_{r}^{m})}^{- 1} ({\tilde{y}}_{k}^{})

. Technically,

{{\hat{u}}_{k}, {\hat{x}}_{k}, {\hat{y}}_{k}}

are generated with

{\hat{u}}_{k}^{} = C ({\hat{x}}_{k}^{E}, \hat{θ})

in closed-loop, by processing the virtual signals

{\tilde{x}}_{k}^{m}, {\tilde{r}}_{k - 1}^{}

obtained from

{\tilde{y}}_{k}^{}

.

Theorem 1: Under assumptions A3–A6, there exists a finite B > 0 such that

J_{M R}^{N} (\hat{θ}) = \sum_{k = 1}^{N} {‖ {\hat{y}}_{k} - {\tilde{y}}_{k} ‖}_{}^{2} = \sum_{k = 1}^{N} {‖ s ({\hat{x}}_{k}^{}, {\hat{u}}_{k - 1}) - s^{m} ({\tilde{x}}_{k}^{m}, {\tilde{r}}_{k - 1}^{}) ‖}_{2}^{2} < B ε^{2} .

(10)

Proof: See Appendix A.

Corollary 1. The controller

C (x_{k}^{E}, \hat{θ})

obtained by minimizing the c.f. (6) is stabilizing and admissible for

J_{M R}^{\infty}

in Equation (5) with

γ < 1

.

Proof. By Equation (8), properly identified

C (x_{k}^{E}, \hat{θ})

renders the finite-time

J_{M R}^{N} (\hat{θ})

(10) arbitrarily small. Secondly, a good exploration of U × X × Y ensured by

D = {{\tilde{u}}_{k}, {\tilde{x}}_{k}, {\tilde{y}}_{k}}

reflects in good exploration of domains

R_{m}, X_{m}

by

{\tilde{r}}_{k}^{}, {\tilde{x}}_{k}^{m}

respectively. In (10),

{\hat{u}}_{k}, {\hat{x}}_{k}, {\hat{y}}_{k}, {\tilde{x}}_{k}^{m}, {\hat{x}}_{k}^{E}, {\tilde{x}}_{k}^{E}

are all generated from the same

{\tilde{r}}_{k}^{}

. If (10) holds for many combinations

{\tilde{r}}_{k}^{}, {\tilde{x}}_{k}^{m}

rendered form exploratory data, then by the arguments of continuous differentiability and bounded derivatives of the maps (7) and by assumption A4, they will hold for any possible combination of

r_{k}^{}

and

x_{k}^{m} = f^{m} (x_{k - 1}^{m}, r_{k - 1})

generated from any

r_{k}^{}

.

To show this, note that both

{\tilde{y}}_{k} = y_{k}^{m} = s^{m} ({\tilde{x}}_{k}^{m}, {\tilde{r}}_{k - 1}^{})

and

{\hat{y}}_{k} = y_{k} = s ({\hat{x}}_{k}, u_{k - 1}) = s ({\hat{x}}_{k}, C ({\hat{x}}_{k - 1}^{E}, \hat{θ}))

in (10) can be generated from the same

{\tilde{r}}_{k}^{}

(

{\hat{x}}_{k - 1}^{E}

contains

{\tilde{r}}_{k - 1}^{}

). Using this fact, it follows from (10) that the ORM tracking errors

‖ s ({\hat{x}}_{k}, C ({\hat{x}}_{k - 1}^{E} ({\tilde{r}}_{k - 1}^{(1)}))) - s_{}^{m} ({\tilde{x}}_{k}^{m}, {\tilde{r}}_{k - 1}^{(1)}) ‖

and

‖ s ({\hat{x}}_{k}, C ({\hat{x}}_{k - 1}^{E} ({\tilde{r}}_{k - 1}^{(2)}))) - s_{}^{m} ({\tilde{x}}_{k}^{m}, {\tilde{r}}_{k - 1}^{(2)}) ‖

are bounded at each time step, for any two training pairs

{\tilde{r}}_{k - 1}^{(1)}

and

{\tilde{r}}_{k - 1}^{(2)}

, since the sum in (10) is bounded. Then, for any

r_{k}^{}

such that

{\tilde{r}}_{k - 1}^{(1)} \leq r_{k} \leq {\tilde{r}}_{k - 1}^{(2)}

(component wise), since

s, s^{m}

are differentiable with bounded derivatives with respect to their arguments, it must hold that

‖ s ({\hat{x}}_{k}, C ({\hat{x}}_{k - 1}^{E} (r_{k - 1}^{}))) - s_{}^{m} (x_{k}^{m}, r_{k - 1}^{}) ‖ = ‖ {\hat{y}}_{k} - y_{k}^{m} ‖

is bounded. Which makes the controller

C (x_{k}^{E}, \hat{θ})

stabilizing for the control system in the sense of bounded output when

r_{k}^{}

is bounded. Then, it is an admissible one for the infinite horizon c.f.

J_{M R}^{\infty}

with γ < 1. This proves the claim.

An NN can be used as a controller for nonlinear state-feedback control learning. Nonlinear VRFT is proposed in [41,42] and successfully applied to NN controllers in [41,43,44,45] but only for output feedback control and not for state-feedback control as in here.

Notice that VRFT control does not need the entire extended state

{\tilde{x}}_{k}^{E}

for feedback (i.e. including the virtual states of the ORM), the process’ initial states would suffice for this purpose. However, state extension is required for preserving the Markov property of the system (3) in order to ensure the correct collection of the transition samples; this is not possible otherwise without special collection design such as using a zero-order-hold for two-by-two consecutive time samples [43]. Correct transition samples collection is required for adaptive actor-critic tuning approach of the same NN controller that is initially tuned via VRFT.

Notice that in the proposed state-feedback VRFT design, knowledge of the output function

y_{k} = g (x_{k})

in Equation (1) is again not needed since

{\tilde{y}}_{k}

is used to calculate the virtual reference

{\tilde{r}}_{k}

, while the controller only uses

{\tilde{x}}_{k}^{E}

for feedback purposes.

4. Adaptive Actor-Critic Learning for ORM Tracking Control

If the system dynamics (3) is known, for a finite-time horizon version of the c.f. (4), numerical dynamic programming solutions can be employed backwards in time only with finite state and action spaces of moderate size, an issue referred to as the “curse of dimensionality”. For infinite horizon c.f.s, Policy Iteration and Value Iteration [23] can be used even for large and/or continuous state and action spaces, where FAs such as NNs are one option.

If the system dynamics in (3) is unknown, the minimization of the c.f. (4) becomes an RL problem. To solve it model-free, an informative c.f. for each state-action pair is defined, called the Q-function (or action-value function). With this respect, the action-value function of acting

u_{k}

in state

x_{k}^{E}

and then following the control (policy)

u_{k} = C (x_{k}^{E})

is defined as

Q^{C} (x_{k}^{E}, u_{k}) = V (x_{k}^{E}, u_{k}^{}) + γ Q^{C} (x_{k + 1}^{E}, C (x_{k + 1}^{E})) .

(11)

The optimal Q-function

Q^{*} (x_{k}^{E}, u_{k}^{*})

satisfies Bellman’s optimality equation

Q^{*} (x_{k}^{E}, u_{k}^{*}) = \min_{u_{k}^{}} (V (x_{k}^{E}, u_{k}^{}) + γ Q^{*} (x_{k + 1}^{E}, u_{k + 1}^{*})),

(12)

with the optimal controller and optimal Q-function

u_{k}^{*} = C^{*} (x_{k}^{E}) = \arg \min_{C} Q^{C} (x_{k}^{E}, u_{k}) = \arg \min_{u} Q^{*} (x_{k}^{E}, u) .

(13)

Then

J^{*} (x_{k}^{E}) = Q^{*} (x_{k}^{E}, u_{k}^{*})

, where

J^{*} (x_{k}^{E}) = \min_{u} J (x_{k}^{E}, u)

is the minimum value c.f. out of the c.f.s defined in Equation (4). Notice that c.f. (4) encompasses (5) thus making the ORM tracking problem consistent with the above equations. The optimal Q-function can be found using Policy Iteration or Value Iteration in a model-free manner, using, e.g., NNs as FAs. The optimal Q-function estimate and the optimal controller estimate can be updated from the transition samples in several ways: in online/offline mode, batch mode, or sample-by-sample update [23,46]. A particular class of online RL approaches is represented by the temporal difference-based AAC design that differs from the batch PI and VI approaches, as it avoids alternate batch back-up of the Q-function FA and of the controller FA.

4.1. Adaptive Actor-Critic Design

The proposed AAC design is a gradient-based scheme designed to converge to the optimal Q-function and optimal controller estimates. Let the temporal-difference error be measured from data as

δ_{k} (x_{k - 1}^{E}, u_{k - 1}) = Ʋ (x_{k - 1}^{E}, u_{k - 1}) + γ {\hat{Q}}_{k - 1} (x_{k}^{E}, C_{k} (x_{k}^{E})) - {\hat{Q}}_{k - 1} (x_{k - 1}^{E}, u_{k - 1})

(14)

where the continuous function

{\hat{Q}}_{k} (\cdot, \cdot)

in its arguments is the Q-function estimate at time k, time at which some controller C_k(x_k) is also available. From this point onward, for notation simplicity, x_k or plain x are used instead of x_k^E The proposed AAC design attempts the Q-function update to online minimize the c.f. E_{c, k} = 0.5 δ _k², while the controller attempts to online minimize the Q-function using gradient descent. Taxonomically, the proposed AAC belongs to the online Policy Iteration schemes where the policy evaluation step (of Bellman error residual minimization type) interleaves with the policy improvement step. The update laws for the AAC design from input-state data are:

u_{k} = C_{k} (x) = C_{k - 1} (x) - α_{a} \cdot {\frac{\partial {\hat{Q}}_{k - 1} (x, u)}{\partial u} |}_{u_{k - 1}}, \forall x,

(15)

\begin{array}{l} {\hat{Q}}_{k} (x_{k - 1}, u_{k - 1}) = {\hat{Q}}_{k - 1} (x_{k - 1}, u_{k - 1}) + α_{c} δ_{k} = \\ = {\hat{Q}}_{k - 1} (x_{k - 1}, u_{k - 1}) + α_{c} (Ʋ (x_{k - 1}, u_{k - 1}) + γ {\hat{Q}}_{k - 1} (x_{k}, u_{k} = C_{k} (x_{k})) - {\hat{Q}}_{k - 1} (x_{k - 1}, u_{k - 1})), \end{array}

(16)

where

α_{a} > 0

,

α_{c} > 0

are learning rates. The controller

C (x_{k})

is imagined as a function (or as an infinitely dense table) mapping any state to a control action.

Comment 1: In particular, for any admissible controller

C_{0}

, repeated calls of (16), under proper exploration (translated to visiting all the pairs

(x_{k - 1}^{}, u_{k - 1}) \in X^{E} \times U

often and to generating the sample

x_{k}^{}

), and under proper selection of

α_{c} > 0

, will update

{\hat{Q}}_{k} (x, u)

,

\forall x \in X^{E}

until

δ_{k}^{} = 0

at which point (11) must hold and the converged

Q^{C_{0}} (x_{k}, u_{k})

evaluates

C_{0}

. This is an online off-policy model-free policy evaluation step. Then

Q_{0} (x, u) = Q^{C_{0}} (x, u)

can be an initialization for AAC. Whereas, an initial admissible controller can be obtained for example using VRFT as shown later in the case study.

Comment 2: The converged Q-function of an admissible control C₀, be it Q^C₀(x_k, u_k), is positive by definition since it accumulates stage costs Ʋ > 0. Moreover, it is always greater than the optimal Q-function, i.e., Q^C₀(x_k, u_k) ≥ Q^*(x_k, u_k) > 0 and obeys the Bellman equation.

Comment 3: From (15), it follows that

Q_{k - 1} (x, C_{k} (x)) \leq Q_{k - 1} (x, C_{k - 1} (x)), \forall x \in X^{E}

(17)

for a small enough

α_{a} > 0

.

Lemma 1. Starting from an admissible controller C₀ with corresponding Q-function initialization

{\hat{Q}}_{0} (x, u) = Q^{C_{0}} (x, u)

, the sequence

{{\hat{Q}}_{k} (x, u)}

is monotonic and non-increasing ensuring that

{\hat{Q}}_{k} (x, u) \leq {\hat{Q}}_{k - 1} (x, u), (x, u) \in X^{E} \times U

(18)

Proof.

{\hat{Q}}_{0} (x, u)

is initialization for Equation (16) and obeys the Bellman Equation (11) for the admissible controller C₀, i.e.,

{\hat{Q}}_{0} (x_{k}^{}, u_{k}) = Ʋ (x_{k}^{}, u_{k}^{}) + γ {\hat{Q}}_{0} (x_{k + 1}^{}, C_{0} (x_{k + 1}^{})) .

(19)

Starting the AAC update law from initial state

x_{0}

,

C_{1} (x)

is updated first by Equation (15), then it follows that

{\hat{Q}}_{1} (x_{0}^{}, u) = {\hat{Q}}_{0} (x_{0}^{}, u) + α_{c} (\underset{\overset{(18)}{=} {\hat{Q}}_{0} (x_{0}^{}, C_{0} (x_{0}))}{\underset{︸}{Ʋ (x_{0}^{}, u) + γ {\hat{Q}}_{0} (x_{1}^{}, C_{1} (x_{1}))}} - {\hat{Q}}_{0} (x_{0}^{}, u)) = {\hat{Q}}_{0} (x_{0}^{}, u) \leq {\hat{Q}}_{0} (x_{0}^{}, u), \forall u .

(20)

Since x₀ can be any state x ∈ X^E, then Equation (18) holds for

k = 1

. Assume by induction that (18) holds for some k. Using Comment 3, it follows that

\begin{array}{l} {\hat{Q}}_{k + 1} (x_{k}^{}, u_{k}) = {\hat{Q}}_{k} (x_{k}^{}, u_{k}) + α_{c} (Ʋ (x_{k}^{}, u_{k}) + γ {\hat{Q}}_{k} (x_{k + 1}^{}, C_{k + 1} (x_{k + 1})) - {\hat{Q}}_{k} (x_{k}^{}, u_{k})) \\ \leq {\hat{Q}}_{k} (x_{k}^{}, u_{k}) + α_{c} (Ʋ (x_{k}^{}, u_{k}) + γ {\hat{Q}}_{k} (x_{k + 1}^{}, C_{k} (x_{k + 1})) - {\hat{Q}}_{k} (x_{k}^{}, u_{k})) \leq \\ \overset{(18)}{\leq} {\hat{Q}}_{k - 1} (x_{k}^{}, u_{k}) + α_{c} (Ʋ (x_{k}^{}, u_{k}) + γ {\hat{Q}}_{k - 1} (x_{k + 1}^{}, C_{k} (x_{k + 1})) - {\hat{Q}}_{k - 1} (x_{k}^{}, u_{k})) = {\hat{Q}}_{k} (x_{k}^{}, u_{k}), \end{array}

(21)

and since

x_{k}, u_{k}

can be any pair (x, u) ∈ X^E × U, the conclusion of Lemma 1 follows.

Theorem 2. Let

{\hat{Q}}_{0} (x, u) > 0

, finite for any finite argument) be an initialization for the Q-function of an initial admissible controller C₀. Starting with any x₀ the control u₀ = C₀(x₀) is applied to the process. Specifically, the AAC update laws ((15), (16)) ensures that at time k = 1, C₁ is updated from C₀,

{\hat{Q}}_{0} (x, u)

is updated with (16) using u₁ = C₁(x₁) in the right-hand side, the control u₁ is sent to the process, and then k ← 2, with the above strategy repeated for subsequent times. Claim: The feedback control system under time-varying control C_k is stabilized for γ < 1 and asymptotically stabilized for γ = 1.

Proof. It is valid for the first three time steps that

{\hat{Q}}_{1} (x_{0}^{}, u_{0}) \overset{(16)}{=} {\hat{Q}}_{0} (x_{0}^{}, u_{0}) + α_{c} [Ʋ (x_{0}^{}, u_{0}^{}) + γ {\hat{Q}}_{0} (x_{1}^{}, u_{1} = C_{1} (x_{1}^{})) - {\hat{Q}}_{0} (x_{0}^{}, u_{0})] \overset{(18)}{\leq} {\hat{Q}}_{0} (x_{0}, u_{0}) .

(22a)

{\hat{Q}}_{2} (x_{1}^{}, u_{1}) \overset{(16)}{=} {\hat{Q}}_{1} (x_{1}^{}, u_{1}) + α_{c} [Ʋ (x_{1}^{}, u_{1}^{}) + γ {\hat{Q}}_{1} (x_{2}^{}, u_{2} = C_{2} (x_{2}^{})) - {\hat{Q}}_{1} (x_{1}^{}, u_{1})] \overset{(18)}{\leq} {\hat{Q}}_{1} (x_{1}, u_{1}) .

(22b)

{\hat{Q}}_{3} (x_{2}^{}, u_{2}) \overset{(16)}{=} {\hat{Q}}_{2} (x_{2}^{}, u_{2}) + α_{c} [Ʋ (x_{2}^{}, u_{2}^{}) + γ {\hat{Q}}_{2} (x_{3}^{}, u_{3} = C_{3} (x_{3}^{})) - {\hat{Q}}_{2} (x_{2}^{}, u_{2})] \overset{(18)}{\leq} {\hat{Q}}_{2} (x_{2}, u_{2}) .

(22c)

Cancelling the same terms in both sides of Equation (22a–c), since

α_{c} > 0

, it follows that the sums in square parentheses are negative. These sums are further refined using Lemma 1 as

Ʋ (x_{0}^{}, u_{0}^{}) + γ {\hat{Q}}_{0} (x_{1}^{}, u_{1}) \leq {\hat{Q}}_{0} (x_{0}, u_{0}),

(23a)

Ʋ (x_{1}^{}, u_{1}^{}) + γ {\hat{Q}}_{1} (x_{2}^{}, u_{2}) \leq {\hat{Q}}_{1} (x_{1}, u_{1}) \overset{(18)}{\leq} {\hat{Q}}_{0} (x_{1}, u_{1}),

(23b)

Ʋ (x_{2}^{}, u_{2}^{}) + γ {\hat{Q}}_{2} (x_{3}^{}, u_{3}) \leq {\hat{Q}}_{2} (x_{2}, u_{2}) \overset{(18)}{\leq} {\hat{Q}}_{1} (x_{2}, u_{2}),

(23c)

Using (23c) in (23b) it follows that

Ʋ (x_{1}^{}, u_{1}^{}) + γ Ʋ (x_{1}^{}, u_{1}^{}) + γ^{2} {\hat{Q}}_{2} (x_{3}^{}, u_{3}) \leq {\hat{Q}}_{0} (x_{1}, u_{1})

, which used in (23a) results in

Ʋ (x_{0}^{}, u_{0}^{}) + γ Ʋ (x_{1}^{}, u_{1}^{}) + γ^{2} Ʋ (x_{2}^{}, u_{2}^{}) + γ^{3} {\hat{Q}}_{2} (x_{3}^{}, u_{3}) \leq {\hat{Q}}_{0} (x_{0}, u_{0})

. Extending the exemplified reasoning backwards from infinity it follows that

\lim_{N \to \infty} (\sum_{i = 0}^{N - 1} γ^{i} Ʋ (x_{i}^{}, u_{i}^{}) + γ^{N} {\hat{Q}}_{N - 1} (x_{N}, u_{N})) \leq {\hat{Q}}_{0} (x_{0}, u_{0}) .

(24)

Since

\lim_{N \to \infty} γ^{N} {\hat{Q}}_{N - 1} (x_{N}, u_{N}) = 0

because

{\hat{Q}}_{N - 1} (x_{N}, u_{N})

is bounded since resulting from a non-increasing sequence, it follows that

\lim_{N \to \infty} \sum_{i = 0}^{N - 1} γ^{i} Ʋ (x_{i}^{}, u_{i}^{}) \leq {\hat{Q}}_{0} (x_{0}, u_{0})

is finite. The left term in the inequality is the cost of using the controller

u_{0} = C_{0} (x_{0}), u_{1} = C_{1} (x_{1}), \dots

. Then it follows that the control system remains stable under the time-varying control of the AAC updates. Moreover, for

γ = 1

, the sequence

{\sqrt{Ʋ (x_{i}^{}, u_{i}^{})}}, i = \bar{0, \infty}

must be square-summable which implies that the control system is asymptotically stabilized by

C_{k}

, thus proving the claim of the theorem.

Comment 4: The above proof resembles the stabilizing action-dependent value iteration of [30], but here relies on gradient-based updates of the Q-function estimate and of the controller, rather than on its minimization. The stability result of the Theorem 2 is valid under continuous updates of the AAC laws (15), (16) under no exploration. It ensures that, starting from an admissible controller

C_{0}

, the AAC updates (15), (16) preserve the control system stability. Since exploratory controls

u_{k}

are critical, a compromising solution is to perform the controller update (15) only for non- exploratory sampling instants, while the Q-function estimates are continuously updated per (16).

4.2. AAC Using Neural Networks As Approximators

In practice, specific FAs such as NNs are employed as approximators for the Q-function and for the controller, respectively. Using NNs as FAs (implying nonlinear features parameterization), the convergence of the learning scheme depends on a large extent to the selection of the learning parameters. However, the advantage of using generic NN architectures is that no manual or automatic feature selection is needed for parameterizing the Q-function and controller estimates.

The proposed AAC design herein uses two NNs to approximate the Q-function (critic, referred to herein as Q-NN), and the controller (actor, referred to herein as C-NN), respectively. Assume an initial admissible NN VRFT state-feedback controller exists. Let the critic NN FA and controller NN FA be parameterized as

{\hat{Q}}_{k} (x_{k}^{E}, u_{k}, θ_{k}^{c})

and

C_{k} (x_{k}^{E}, θ_{k}^{a})

, respectively. With three-layer feed-forward NNs having one hidden layer, fully connected with bias, the critic and the controller are modeled by:

{\hat{Q}}_{k} = W_{c, n_{h c} + 1}^{k} + \sum_{i = 1}^{n_{h c}} W_{c, i}^{k} σ_{i} (\sum_{j = 1}^{n_{i c} + 1} V_{c, j i}^{k} I_{j}^{k}),

(25)

u_{k}^{l} = W_{a, n_{h a} + 1}^{k, l} + \sum_{i = 1}^{n_{h a}} W_{a, i}^{k, l} σ_{i} (\sum_{j = 1}^{n_{i a} + 1} V_{a, j i}^{k} I_{j}^{k}), l = \bar{1, m},

(26)

with

W_{c} = {[W_{_{c, 1}}^{k} \dots W_{c, n_{h c} + 1}^{k}]}^{T}

—the critic output layer weights having n_hc hidden neurons,

V_{c} = {[V_{c, j i}^{k}]}_{i = 1 \dots n_{h c}, j = 1 \dots n_{i c} + 1}

—the critic hidden layer weights matrix,

I_{c}^{k} = {[I_{1}^{k} \dots I_{n_{i c} + 1}^{k}]}^{T} = {[{(x_{k}^{E})}^{T} {(u_{k})}^{T} 1]}^{T}

—the critic input vector of size n_ic + 1 (the bias input is constant 1),

W_{a}^{k, l} = {[W_{a, 1}^{k, l} \dots W_{a, n_{h a} + 1}^{k, l}]}^{T}, l = 1 .. m

– the output layer weights of the l-th controller output

u_{k} = {[u_{k}^{1} \dots u_{k}^{m}]}^{T}

, having n_ha hidden neurons, and

V_{a} = {[V_{a, j i}^{k}]}_{i = 1 \dots n_{h a}, j = 1 \dots n_{i a} + 1}

—the controller hidden layer weights matrix,

I_{a}^{k} = {[I_{1}^{k} \dots I_{n_{i a} + 1}^{k}]}^{T} = {[{(x_{k}^{E})}^{T} 1]}^{T}

– the controller input vector of size n_ia + 1 (bias input 1 included). σ_i = tanh_i, is the hyperbolic tangent activation function at the output of i^th hidden neuron. The controller and critic are cascaded, with the actor output u_k as a part of the critic input alongside

x_{k}^{E}

The actor and critic weights are formally parameterized as

θ_{a}^{k} = {[{(W_{a}^{k, l})}^{T} {(V_{a}^{k})}^{T}]}^{T}

and

θ_{c}^{k} = {[{(W_{c}^{k})}^{T} {(V_{c}^{k})}^{T}]}^{T}

, respectively.

The AAC’s gradient descent tuning rules for the critic (25) and controller (26) as parameterized variants of (15) and (16) are

{\begin{cases} θ_{k}^{c} = θ_{k - 1}^{c} + α_{c} \cdot {\frac{\partial \hat{Q} (x, u, θ)}{\partial θ} |}_{(x_{k - 1}^{E}, u_{k - 1}, θ_{c}^{k - 1})} \cdot δ_{k}, detailed as : \\ W_{c, i}^{k} = W_{c, i}^{k - 1} + δ_{k} α_{c} \times {\begin{cases} σ_{i} (\sum_{j = 1}^{n_{i c} + 1} V_{c, j i}^{k - 1} I_{j}^{k - 1}), & if i \neq n_{h c}^{} + 1, \\ 1, & if i = n_{h c}^{} + 1, \end{cases} \\ V_{c, j i}^{k} = V_{c, j i}^{k - 1} + δ_{k} α_{c} W_{c, i}^{k - 1} I_{j}^{k - 1} σ_{i}^{'} (\sum_{j = 1}^{n_{i c} + 1} V_{c, j i}^{k - 1} I_{j}^{k - 1}), \end{cases}

(27)

{\begin{cases} θ_{k}^{a} = θ_{k - 1}^{a} - α_{a} \cdot {\frac{\partial \hat{Q} (x, u, θ)}{\partial u} |}_{(x_{k - 1}^{E}, u_{k - 1}, θ_{c}^{k - 1})} \cdot {\frac{\partial u (x, θ)}{\partial θ} |}_{(x_{k - 1}^{E}, θ_{a}^{k - 1})}, as : \\ W_{a, i}^{k, l} = W_{a, i}^{k - 1, l} - α_{a} \underset{{\frac{\partial \hat{Q} (x, u, θ)}{\partial u} |}_{(x_{k - 1}^{E}, u_{k - 1}, θ_{c}^{k - 1})}}{\underset{︸}{(\sum_{i = 1}^{n_{h c}} W_{c, i}^{k - 1} V_{c, ξ i}^{k - 1} σ_{i}^{'} (\sum_{j = 1}^{n_{i c} + 1} V_{c, j i}^{k - 1} I_{j}^{k - 1}))}} \times {\begin{cases} σ_{i} (\sum_{j = 1}^{n_{i a} + 1} V_{a, j i}^{k - 1} I_{j}^{k - 1}), & if i \neq n_{h a} + 1, \\ 1, & if i \neq n_{h a} + 1, \end{cases} \\ V_{a, j i}^{k} = V_{a, j i}^{k} - α_{a} {\frac{\partial \hat{Q} (x, u, θ)}{\partial u} |}_{(x_{k - 1}^{E}, u_{k - 1}, θ_{c}^{k - 1})} W_{a, i}^{k - 1, l} I_{j}^{k - 1} {σ^{'}}_{i} (\sum_{j = 1}^{n_{i a} + 1} V_{a, j i}^{k - 1} I_{j}^{k - 1}), \end{cases}

(28)

with α_a, α_c—the learning rate magnitudes of the critic and controller training rules, respectively and

σ_{i}^{'}

is the derivative of σ_i w.r.t its argument and ξ is the index of

u_{k}^{l}

in the

I_{c}^{k}

.

In many practical applications, the designer chooses to perform either full or partial adaptation of the NNs’ weights, the latter implying only output weights adaptation. In this latter case, the Q-NN and C-NN parameterizations are:

\begin{array}{l} {\hat{Q}}_{k} (x_{k}^{E}, u_{k}) = {(W_{c}^{k})}^{T} σ (V_{c}^{k} {[{(x_{k}^{E})}^{T} {(u_{k})}^{T} 1]}^{T}) = {(W_{c}^{k})}^{T} Φ_{c}^{k} (x_{k}^{E}, u_{k}), \\ u_{k} = C_{k} (x_{k}^{E}) = {(W_{a}^{k})}^{T} σ (V_{a}^{k} {[{(x_{k}^{E})}^{T} 1]}^{T}) = {(W_{a}^{k})}^{T} Φ_{c}^{k} (x_{k}^{E}), \end{array}

(29)

where

Φ_{c}^{k} (x_{k}^{E}, u_{k}), Φ_{a}^{k} (x_{k}^{E})

are the matrices of basis functions (or input features) and W_c^k, W_a^k are the tunable output weights parameters, rendering

{\hat{Q}}_{k}, u_{k}

as linear combination of basis functions. This linear parameterization simplifies the convergence analysis but also requires manual features selection and training as a disadvantage.

As it is well-known, the AAC architecture performs online with the C-NN sending controls to the process and the Q-NN serving both to estimate the Q-function and to adaptively tune the C-NN. The closed-loop control system with the process (3) combined with the AAC tuning rules (27), (28) has the unique property that its dynamics is mainly driven by the reference input r_k viewed as a particular state of the extended state vector and, possibly, by exogenous unknown disturbances. Since r_k is user selectable, it can be used to drive the control system in a wide operating range to ensure efficient exploration of the state space. Enhanced exploration of the domain X^E × U can be performed by trying random actions in every state, usually as additive uniform random actions.

4.3. Convergence of the AAC Learning Scheme with NNs

While the results from Section 4.1 are formulated under generic functions for the Q-function and for the controller, the convergence to the optimal controller and optimal Q-function is not ensured. In the following, the convergence to of the AAC learning with NNs is shown. Linear parameterization is the most widely used and supports tractable analysis. Let the output weight parameterization (29) of the Q-NN and C-NN lead to the update laws

\begin{array}{l} W_{c}^{k} = W_{c}^{k - 1} + α_{c} δ_{k} Φ_{c}^{k} (x_{k - 1}^{E}, u_{k - 1}), \\ W_{a}^{k} = W_{a}^{k - 1} - α_{a} Φ_{a}^{k} (x_{k - 1}^{E}) {(W_{c}^{k - 1})}^{T} \frac{\partial Φ_{c}}{\partial u} (x_{k - 1}^{E}, u_{k - 1}), \end{array}

(30)

be compactly written as

\begin{array}{l} W_{c}^{k} = W_{c}^{k - 1} + α_{c} δ_{k} Φ_{c}^{k - 1}, \\ W_{a}^{k} = W_{a}^{k - 1} - α_{a} Φ_{a}^{k - 1} {(W_{c}^{k - 1})}^{T} Φ_{c, u}^{k - 1}, \end{array}

(31)

where

δ_{k} = \underset{U^{k - 1}}{\underset{︸}{Ʋ (x_{k - 1}^{E}, u_{k - 1})}} + {(W_{c}^{k - 1})}^{T} (γ Φ_{c}^{k} - Φ_{c}^{k - 1})

Let the weights of the optimal Q-NN and optimal C-NN be W_c^*, W_a^*, the estimation errors being

{\tilde{W}}_{c}^{k} = W_{c}^{k} - W_{c}^{*}, {\tilde{W}}_{a}^{k} = W_{a}^{k} - W_{a}^{*}

that render the estimation error dynamics

\begin{array}{l} {\tilde{W}}_{c}^{k} = {\tilde{W}}_{c}^{k - 1} + α_{c} δ_{k} Φ_{c}^{k - 1}, \\ {\tilde{W}}_{a}^{k} = {\tilde{W}}_{a}^{k - 1} - α_{a} Φ_{a}^{k - 1} {(W_{c}^{k - 1})}^{T} Φ_{c, u}^{k - 1}, \end{array}

(32)

Some assumptions follow:

A7. Let the estimation error of the critic be denoted as

ζ_{c}^{k - 1} = {({\tilde{W}}_{c}^{k - 1})}^{T} Φ_{c}^{k - 1}

, let the critic’s and actor’s hidden activation layers be bounded as

{‖ Φ_{c}^{k - 1} ‖}^{2} \leq {\bar{φ}}_{c}, {‖ Φ_{a}^{k - 1} ‖}^{2} \leq {\bar{φ}}_{a}

and let the critic’s activation layer derivative w.r.t. u be bounded as

{‖ Φ_{c, u}^{k - 1} ‖}^{2} \leq {\bar{φ}}_{c, u}

where Frobenius norm was used, which is equivalent to the Euclidean norm when it is applied to vectors. The above upper bounds follow since the activation functions are bounded and so are their derivatives.

Theorem 3. Under A7, the AAC learning scheme converges to a vicinity of the optimal controller and optimal Q-function since the estimation errors

{\tilde{W}}_{c}^{k}, {\tilde{W}}_{a}^{k}

are uniformly ultimately bounded provided that

α_{c} > 4 {\bar{φ}}_{c}^{2} - 2 {\bar{φ}}_{c} + 1

, for

{\bar{φ}}_{c} > 1

.

Proof: See Appendix B.

Comment 5. Notice that the temporal difference error

δ_{k}

is calculated in terms of the Q-NN’s output. Then

δ_{k}

is backpropagated to correct the Q-NN weights. Moreover,

δ_{k}

is further backpropagated to correct the C-NN weights, since the C-NN output is an input to the Q-NN. Hence, the resulted AAC architecture belongs to the deep learning approaches (the architecture is presented in Figure 1b of the next Section).

4.4. Summary of the Mixed VRFT-AAC Design Approach

The steps of the VRFT-AAC design approach are summarized next:

S1. Collect input-state-output samples from the open-loop stable process (1) in a dataset

D = {{\tilde{u}}_{k}, {\tilde{x}}_{k}, {\tilde{y}}_{k}} \subset U \times X \times Y, k = \bar{0, N - 1}

where

{\tilde{u}}_{k}^{}

persistently exciting, under conditions of A5.

S2. Obtain the initial state-feedback VRFT controller by minimizing the c.f. in Equation (6) as

θ_{k}^{a} = \arg \min_{θ} J_{V R}^{N} (θ)

. When NNs are used, minimization of the c.f. (6) is equivalent to training the NN. The obtained controller is

C_{k} (x_{k}^{E}, θ_{k}^{a})

, which is a controller for both the process (1) and for the extended process (3). It is also a close initialization to the optimal controller that minimizes

J_{M R}^{\infty}

from (5), since VRFT identifies a controller that approximately minimizes

J_{M R}^{N}

from (10) as a finite horizon version of

J_{M R}^{\infty}

. This is supported by Theorem 1.

S3. Close the control system loop on process (1) (it is equivalent to closing it on extended process (3)) using controller

C_{k} (x_{k}^{E}, θ_{k}^{a})

. The architecture is presented in Figure 2b). Use update (16) (in explicitly parameterized form, use (28)) under an exploratory reference input

r_{k}

in order to learn the Q-function of the controller

C_{k} (x_{k}^{E}, θ_{k}^{a})

. This serves as properly initializing

{\hat{Q}}_{k} (x_{k}^{E}, u_{k}, θ_{k}^{c})

for the subsequent AAC tuning.

S4. Use the updates (15), (16), (27), (28), in explicitly parameterized form), in this exact order and under an exploratory reference input

r_{k}

, to learn the optimal controller

C^{*} (x_{k}^{E}, θ_{k}^{a}^{*} = W_{a}^{*})

and the optimal Q-function

{\hat{Q}}^{*} (x_{k}^{E}, u_{k}, θ_{k}^{c}^{*} = W_{c}^{*})

. Using the above updates for a finite time on a random learning scenario is called a learning episode.

S5. After every learning episode, measure the tracking performance on a standard test scenario. When the prescribed number of maximum tests is reached or the tracking performance on the standard scenario is not improving anymore, the controller learning is stopped. Otherwise, proceed to the next learning episode.

All implementation details of the above VRFT-AAC design are presented in the following Section 5.1 when validation is performed on the complex multivariable tank system case study.

5. Validation Case Study

5.1. AAC Design for a MIMO Vertical Tank System

The controlled process is a vertical MIMO two-tank system (Figure 1c) built around a three-tank laboratory equipment [47] with the continuous-time state-space equations

\begin{array}{l} {\dot{H}}_{1} = \frac{k}{a w} u_{1} - \frac{1}{a w} {\bar{C}}_{1} H_{1}^{α 1} (2.5 {\tilde{u}}_{2} - 0.5), \\ {\dot{H}}_{2} = \frac{1}{a w} {\bar{C}}_{1} H_{1}^{α 1} (2.5 {\tilde{u}}_{2} - 0.5) - \frac{1}{c w + \frac{H_{2}}{H_{2 \max}} b w} {\bar{C}}_{2} H_{2}^{α 2}, \\ {\tilde{u}}_{2} = \min (\max (u_{2}, 0.6), 1), \end{array}

(33)

with

a = 0.25 [m]

,

w = 0.035 [m]

,

c = 0.1 [m]

,

b = 0.345 [m]

and

H_{1 \max} = H_{2 \max} = 0.35 [m]

.

x_{1} = y_{1} = H_{1} \in [0, H_{1 \max}]

and

x_{2} = y_{2} = H_{2} \in [0, H_{2 \max}]

are the water levels in the two tanks considered as system states and controlled outputs. The control inputs

u_{1}, u_{2} \in [0, 1]

(also expressible in [%]) are the duty cycles of the pump direct current (DC) motor and of the electrically controlled valve

C_{1}

, respectively.

k = 1.66 \cdot 10^{- 4} [m^{3} / (s \cdot %)]

is the gain from the pump input to the inflow,

{\bar{C}}_{1} = 5.65 \cdot 10^{- 5} [m^{3 - α 1} / s]

,

{\bar{C}}_{2} = 8 \cdot 10^{- 5} [m^{3 - α 2} / s]

are the resistances of the outflow orifices of the first (upper) and second (lower) tank, called

T_{1}

and

T_{2}

, respectively, and α₁ = 0.29, α₂ = 0.22. The third equality in Equation (33) reflects the dead-zone plus saturation in the second control input u₂.

Features of this process include: no water level setpoint for the tank

T_{2}

can be set if the water level in

T_{1}

is zero; the electrical valve controlling the outflow from

T_{1}

(which is inflow to

T_{2}

) is changed by

{\tilde{u}}_{2} (u_{2})

; no setpoint can be tracked for each tank if there is more outflow than inflow;

T_{1}

’s outflow has a minimum value and can be zero only when

H_{1} = 0

, as per first equality in (33).

T_{2}

’s dynamics is slower than

T_{1}

’s. Proper selection of the parameters

{\bar{C}}_{1}, {\bar{C}}_{2}

through manual valves allows feasible control trajectories in the constrained input-state-output space. Discretization of (33) reveals its Markov form. The water level is measured using piezoelectric sensors (

P S_{1}

and

P S_{2}

in Figure 1c). Protection logic disables the pump voltage when water level exceeds the upper bound. The sampling period used for control experiments is

T_{s} = 0.5 s

. Model (33) is not used for control design.

For VRFT-based control design, the ORM is selected as the ZOH discretization of

M (s) = d i a g (M_{1} (s), M_{2} (s))

.

M_{1} (s) = ω_{0}^{2} / (s^{2} + 2 ς ω_{0} s + ω_{0}^{2})

, with the damping factor

ς = 1.0

and the natural frequency

ω_{0} = 0.5 r a d / s

selects the speed and shape of the desired response

y_{k, 1}^{m}

while similar

M_{2} (s)

with

ς = 1.0

and

ω_{0} = 0.2 r a d / s

describes

y_{k, 2}^{m}

. For collecting the open-loop input-state-output data

{{\tilde{u}}_{k}, {\tilde{x}}_{k}, {\tilde{y}}_{k}}_{k = 0 \dots N - 1}

, 16,000 samples have been generated from 16,000 samples of a uniformly random sequence of persistently exciting steps lasting for 20 s

{\tilde{u}}_{k} = {[{\tilde{u}}_{k, 1} {\tilde{u}}_{k, 2}]}^{T} \in [0, 0.5] \times [0, 1]

, for an experiment of 8000 s. The collected data is displayed in Figure 2 and ensures the exploratory conditions from Assumption A5.

The controllable canonical state-space realizations (A₁, B₁, C₁, D₁) and

(A_{2}, B_{2}, C_{2}, D_{2})

and of

M_{1} (z)

and

M_{2} (z)

are, respectively:

\begin{array}{l} A_{1} = (\begin{matrix} 1.5576 & - 0.6065 \\ 1 & 0 \end{matrix}), B_{1} = (\begin{matrix} 1 \\ 0 \end{matrix}), C_{1}^{} = {(\begin{array}{l} 0.0265 \\ 0.0224 \end{array})}^{T}, D_{1} = 0, \\ A_{2} = (\begin{matrix} 1.8097 & - 0.8187 \\ 1 & 0 \end{matrix}), B_{2} = (\begin{matrix} 1 \\ 0 \end{matrix}), C_{2}^{} = {(\begin{array}{l} 0.0047 \\ 0.0044 \end{array})}^{T}, D_{2} = 0, \end{array}

(34)

The ORM state will then be

x_{k}^{m} = {[x_{k, 1}^{m} x_{k, 2}^{m} x_{k, 3}^{m} x_{k, 4}^{m}]}^{T}

. The virtual reference

{\bar{r}}_{k} = M {(z)}^{- 1} {\tilde{y}}_{k}

is used as input to the state-space models (34) to obtain the ORM’s virtual states

{\tilde{x}}_{k}^{m}

. The extended virtual state vector

{\tilde{x}}_{k}^{E} = {[{({\tilde{x}}_{k})}^{T} {({\tilde{x}}_{k}^{m})}^{T} {({\tilde{r}}_{k})}^{T}]}^{T} \in ℜ^{8}

is used to offline compute the C-NN via VRFT by fitting the inputs

{\tilde{x}}_{k}^{E}

to the outputs

{\tilde{u}}_{k}

. Note that using two second order ORM’s produces four states in the extended state-space, which is disadvantageous. The ORM’s orders should to be as low as possible (usually one) but a second order model offers greater flexibility in output response shaping.

The VRFT C-NN architecture (Figure 1a) is a feedforward 8–10–2 fully connected one with biases having

n_{h a} = 10

hidden neurons with

\tanh ()

activation function and the output activation functions are linear. The weights are initialized with random uniform numbers in

[- 1.5, 1.5]

. Since the NN training is performed offline, standard gradient backpropagation training with Levernberg-Marquardt [48] is used for maximum 50 epochs to learn a stabilizing VRFT C-NN controller

C (x_{k}^{E})

for the MIMO control system, that minimizes

J_{V R}^{N}

. 80% of the data is effectively used for training while the rest of 20% serves as validation data. Early stopping is used after six consecutive increases in the mean sum of squared error evaluated on the validation data. Other offline training algorithms such as Broyden-Fletcher-Goldfarb-Shanno [49,50] and conjugate gradient [51,52] may be similarly efficient while their computational burden is prohibitive for online real-time training.

Results on a standard test scenario with the initial VRFT C-NN controller are shown in Figure 3. It is observed that the ORM tracking errors are bounded, since the VRFT controller is stabilizing (though not asymptotically) and validates the theoretical results of Theorem 1 and Corollary 1. Then it is an admissible controller for

J_{M R}^{\infty}

in Equation (5) with

γ < 1

. The initial controller tuning using VRFT is attractive also because it has learned a feature matrix

Φ_{a} (x_{k}^{E})

, so from this point onwards, the designer can choose to perform either output weights adaptation or full weights adaptation.

The Q-function estimate of the VRFT C-NN (i.e., the critic Q-NN) is next learned in a policy evaluation step in order to serve as a good initial estimate of the Q-function that is needed for the following AAC learning and also to fulfil the requirements of Lemma 1. This step is possible since the VRFT controller is admissible and, for properly selected learning rate, the weights of the Q-NN will converge. The critic Q-NN approximating the Q-function has similar architecture with the C-NN, of size 10–25–1 (eight states and two controls), with

n_{h c} = 25

. The critic Q-NN output weights are randomly drawn from a zero-mean normal distribution with variance

σ^{2} = 90

while the hidden layer weighs are uniformly randomly initialized in

[- 1.5, 1.5]

. Setting

γ = 0.95

, the learning rates

α_{c} = 0.01

in (27) and

α_{a} = 0

in (28) (no controller tuning), all the Q-NN weights are updated using the gradient back-propagation in (27), by driving the MIMO control system with a sequence of uniformly random piecewise constant steps in

r_{k} = {[r_{k, 1} r_{k, 2}]}^{T} \in [0.05, 0.25] \times [0.01, 0.2]

. This procedure also serves as a tuning step for

α_{c}

. With

r_{k, 1}

and

r_{k, 2}

lasting 20 s and 33 s, respectively, we ensure they do not switch simultaneously, to better reveal the coupling effects between the control channels. After 500 s, the critic weights stabilize, the output weights being shown in Figure 4 for 4500 s. This pre-tuned Q-NN will be used as initialization to the following case studies. After this intermediate tuning step of the Q-NN, the designer can choose for full weight adaptation of the Q-NN or only for output weights’ adaptation of the Q-NN while the features matrix

Φ_{c} (x_{k}^{E}, u_{k})

is kept constant.

The C-NN is now further tuned (in the architecture from Figure 1b to improve the ORM control performance. Setting

α_{a} = 10^{- 8}

in (28) and

α_{c} = 0.01

in (27) (critic adaptation should generally be faster than actor adaptation), both the C-NN and Q-NN are adaptively trained online. Although carried out in an adaptive framework, the training unfolds on consecutive episodes where the feedback control system is driven by a sequence of random reference input steps for 700 s. The reference inputs are uniformly random piecewise constant steps in

r_{k} = {[r_{k, 1} r_{k, 2}]}^{T} \in [0.05, 0.25] \times [0.01, 0.2]

, with

r_{k, 1}

and

r_{k, 2}

lasting 20 s and 33 s, respectively. Updates (27), (28) are skipped when either

r_{k, 1}

or

r_{k, 2}

switch, to preserve the Markov property of the extended model. The controller parameters at the end of an episode are the initial ones for the following episode, the Q-NN weights following the same transfer rule. To ensure enhanced exploration of the state-action space, the C-NN controller output is perturbed every third sample time with probing noise according to:

u_{k} = C (x_{k}^{E}) + (\begin{array}{l} r a n d \\ r a n d \end{array}) \cdot Ω, Ω = {\begin{matrix} 1 & if \mod (k, 3) = 0, \\ 0 & otherwise, \end{matrix}

(35)

where

r a n d

is a normally distributed random number with zero-mean and variance

σ^{2} = 3.56

and

\mod (k, s)

is the remainder after dividing k by s. This is in fact a form of ε⁰-greedy exploration strategy useful to try many actions in the vicinity of the current state. A typical learning episode is shown in Figure 5. After each learning episode, the learning is stopped and the C-NN performance is measured on the standard test scenario from Figure 3 and the decrease of a finite-time version of the c.f.

J_{M R}^{\infty}

from Equation (5), namely

J_{M R}^{1400}

, is aimed. This standard test scenario is not seen during training. The learning then resumes with the next episode. After maximum 30 learning episodes (meaning 21,000 s and 42,000 samples), the C-NN and Q-NN adaptations are stopped and the learning trial (comprising of learning episodes) converges under Theorem 3. The final adaptively learned C-NN and the initial VRFT controller are shown performing in Figure 3. The episodic learning allows us to test the improvement in control performance between episodes. Figure 6 illustrates the

J_{M R_A C}^{1400}

decrease over episodes of a convergent learning trial. Throughout each learning episode, under the AAC update laws, the control system preserves its stability, ensured by Theorem 2.

For comparisons, a model-free approximated batch-fitted Q-learning (BFQ) controller [53,54] is also proposed, using the same Q-NN and C-NN architectures with the same sizes. BFQ alternates offline training of the C-NN and the Q-NN, using 12,000 transition samples collected under the randomly perturbed model-free single-input single-output VRFT linear controllers

C_{1} (z) = (2.6092 + 0.1184 z^{- 1} - 2.3609 z^{- 2}) / (1 - z^{- 1})

and

C_{2} (z) = (1.5735 + 0.2405 z^{- 1} - 1.3547 z^{- 2}) / (1 - z^{- 1})

, independently designed for the two tanks, respectively. BFQ implements a Value Iteration algorithm. The training settings assume that the weights of both NNs are initialized to uniform random numbers in

[- 1.5, 1.5]

. Maximum 200 epochs are used for training with Levenberg-Marquardt on 80% effectively training data and 20% validation data. Early stopping is employed to prevent overfitting after six maximum increases of the mean sum of squared errors on the validation data. 200 iterations of BFQ take about 2 hours and the best controller is saved.

The value of

J_{M R}^{1400}

for the initial VRFT controller is

J_{M R_V R F T}^{1400} = 1.58

, for the final adaptively learned controller is

J_{M R_A A C}^{1400} = 0.45

and for the BFQ controller is

J_{M R_M B B F Q}^{1400} = 0.32

.

Additionally, a model-based approximated BFQ solution is also offered for comparisons. A first dimensionality reduction of the extended state space is performed by inspecting that, for both the ORMs,

x_{k + 1, 2}^{m} = x_{k, 1}^{m}

and

x_{k + 1, 4}^{m} = x_{k, 3}^{m}

from the state-space matrices in Equation (34) and, moreover,

y_{k, 1}^{m} = C_{1}^{} {[x_{k, 1}^{m} x_{k, 2}^{m}]}^{T} \approx 2 C_{1} (1) x_{k, 1}^{m}

,

y_{k, 2}^{m} = C_{2}^{} {[x_{k, 3}^{m} x_{k, 4}^{m}]}^{T} \approx 2 C_{2} (1) x_{k, 3}^{m}

. Then

x_{k, 2}^{m}, x_{k, 4}^{m}

are considered approximate duplicates of

x_{k, 1}^{m}, x_{k, 3}^{m}

and removed from the extended state vector, now defined as a reduced extended state vector

x_{k}^{E R} = {[x_{k, 1}^{}, x_{k, 2}^{}, x_{k, 1}^{m}, x_{k, 3}^{m}, r_{k, 1}, r_{k, 2}]}^{T} \in ℜ^{6}

when used for feedback and controller learning. For

[0.05, 0.25] \times [0.03, 0.15] \times [0.5, 4.5] \times [1, 12] \times [0.05, 0.25] \times [0.03, 0.15]

as the domain of

x_{k}^{E R}

and

[0, 1] \times [0.5, 1]

the domain of

u_{k}

, we generate a grid of

N P = 5 \times 5 \times 7 \times 7 \times 6 \times 6 \times 3 \times 3 = 396900

linearly spaced points. 5 points for each of

x_{k, 2}^{m}, x_{k, 4}^{m}

would have led to

N P

of almost 10 million. Let the discrete domains be denoted

X_{d}^{E}

and

U_{d}

, respectively. Domains of

x_{k, 1}^{m}, x_{k, 3}^{m}

and

r_{k, 1}^{}, r_{k, 2}^{}

are found by simulating the ORMs offline such that

y_{k, 1}^{m}, y_{k, 2}^{m}

overlap the constrained domains of

x_{k, 1}^{} = y_{k, 1}, x_{k, 2}^{} = y_{k, 2}

. For a Q-NN of size 8–8–1, a C-NN of size 6–6–2 and

γ = 0.8

found to ensure learning convergence, each iteration of the model-based BFQ trains both the Q-NN and the C-NN. For the Q-NN current iteration estimate (indexed by

i t e r

) denoted

{\hat{Q}}_{i t e r} (x_{k}^{E R}, u_{k}^{})

, the inputs patterns are

{{[{(x_{k}^{E R})}^{T} {(u_{k})}^{T}]}^{T}}

and the target patterns are

{Ʋ_{MR} (x_{k}^{E R}) + γ \min_{u \in U_{d}} {\hat{Q}}_{i t e r - 1} (F (x_{k}^{E R}, u_{k}^{}), u)} .

, evaluated for all the points in

X_{d}^{E} \times U_{d}

. For the current iteration C-NN

C_{i t e r} (x_{k}^{E R})

, the input patterns are

{(x_{k}^{E R})}

and the target patterns are

{u_{k} = \arg \min_{u \in U_{d}} {\hat{Q}}_{i t e r} (x_{k}^{E R}, u)}

. Note that evaluation of

F (x_{k}^{E R}, u_{k}^{})

to get

x_{k + 1}^{E R}

uses the original extended state vector where

x_{k, 2}^{m}, x_{k, 4}^{m}

are copies of generated

x_{k, 1}^{m}, x_{k, 3}^{m}

and a piecewise constant reference input generative model is used where

r_{k + 1, 1}^{} = r_{k, 1}, r_{k + 1, 2}^{} = r_{k, 2}

. To keep the training computationally tractable and timely, only one third of uniformly sampled data points form the training set are used, differently for each of the Q-NN and the C-NN. The weights of both NNs are initialized to uniform random numbers in

[- 1.5, 1.5]

. Maximum 200 epochs are used for training with Levenberg-Marquardt on 80% training data and 20% validation data. Early stopping is employed to prevent overfitting after six maximum increases of the mean sum of squared errors on the validation data. Just 14 iterations (taking about 20 min) of this approximate model-based BFQ produce the control results in Figure 3 (in magenta), with

J_{M R_M F B F Q}^{1400} = 0.25

naturally the smallest, with the best ORM tracking performance, since it uses the process model.

5.2. Statistical Investigations of the AAC Control Performance

Several thorough investigation case studies are considered, for which the initial C-NN tuned by VRFT and the initial Q-NN are the same. The investigation concerns full vs. partial tuning of the Q-NN and C-NN weights, while measuring the probing noise effect on the convergence. All statistics are measured on learning trials of maximum 50 episodes. The minimal and average

J_{M R_A A C}^{1400}

values on 100 trials are measured along with the success percentage of convergent learning trials. The average number of episodes until reaching the minimal

J_{M R_A A C}^{1400}

of a successful learning trial together with the standard deviation of the number of episodes in a successful learning trial, are both rendered in Table 1.

Case 1. Under learning rates

α_{c} = 0.01

in (27) and

α_{a} = 10^{- 8}

in (28), with full adaptation of the Q-NN and of the C-NN, the learning process convergence starting from the initial C-NN tuned by VRFT and initial Q-NN is investigated. For a convergent learning trial, for the constant learning rates being used, the best performance did never drop below

J_{M R_A A C}^{1400} = 0.46

, which is inferior to the BFQ performance, suggesting that the proposed adaptive learning strategy is prone to getting stuck in local minima under the adaptive gradient-based update rules. In fact, BFQ is generally advertised as being more data-efficient, although actor-critic learning architectures also allow alternative updates of the C-NN and Q-NN for improving data usage efficiency. The learning trials converges in about 88% of the cases, comparable with other perturbed AAC designs [55,56] given the wide operating range used for the controlled process.

Case 2. For full weights adaptation of both Q-NN and C-NN, without random perturbation of the control action, with

α_{c} = 0.01

in (27) and

α_{a} = 10^{- 7}

in (28), the convergence rate drops to 53% and the best performance is

J_{M R_A A C}^{1400} = 0.45

.

Case 3. In the case of output weights only tuning of both Q-NN and C-NN, with random perturbation of the control action, with

α_{c} = 0.01

in (27) and

α_{a} = 10^{- 6}

in (28), 100% convergence rate was observed, but the performance never dropped below

J_{M R_A A C}^{1400} = 0.59

.

Case 4. With output weights only tuning of both the Q-NN and C-NN, starting from the initialized C-NN tuned by VRFT and the initial C-NN, in the absence of random perturbation of the control action, with

α_{c} = 0.01

in (27) and

α_{a} = 10^{- 5}

in (28), the AAC learning is 100% convergent in all trials but the performance never drops below

J_{M R_A A C}^{1400} = 0.50

. For

α_{a} = 10^{- 6}

, the average number of episodes per trial increases only to 11.

The above four case studies are statistically characterized in Table 1. Concluding, full weight tuning of Q-NN and C-NN offers better performance (smaller

J_{M R_A A C}^{1400}

) than when output weights only tuning is used. But full weight tuning lowers the convergence rate, as prone to stuck in local minima. In Case 1 vs. Case 2, the probing noise significantly improves the convergence, slightly improves the average

J_{M R_A A C}^{1400}

and decreases the average number of episodes per convergent trial. While full weights adaptation is more sensitive since even small corrections in the input-to-hidden layer weights may lead to learning divergence.

The output weights only tuning is more robust, with 100% convergence success rate to an improved solution, but with inferior achievable performance. The perturbing noise in this case worsens the average number of episodes per trial and the best achievable performance. Guaranteed convergence to an improved solution corresponding to a local minimum in Cases 3 and 4 is also caused by the good initial tuning offered by VRFT. Case 4 with output weights only tuning without probing noise offers the best compromise regarding convergence, performance and few episodes per trial (i.e., fewer transition samples until convergence).

The initial Q-function learning of the NN-VRFT controller is not necessary and learning convergence was obtained without this step also. For the selected critic learning rate, the weights converge fast enough, however, this step serves for tuning the critic learning rate and also for initialization of the features matrix when output weights only adaptation is sought. This tuning step is achievable exactly because an initially stabilizing VRFT controller exists.

5.3. Comments on the AAC Learning Performance

The AAC learning data efficiency is clearly inferior to the BFQ strategy as a comparable model-free approach. The BFQ control is learned from scratch just from transition samples, whereas the proposed AAC controller learns from a NN-VRFT controller delivering an initial suboptimal ORM tracking solution. Both AAC and model-free BFQ are inferior to the model-based BFQ solution which exploits the process model knowledge.

AAC is a form of Action Dependent Heuristic Dynamic Programming which is also less data-efficient than other similar approaches such as Dual Heuristic Programming, where the learned co-state vector carries more information than the Q-function. On the other hand, AAC is less computationally demanding and requires less memory than Dual Heuristic Programming, BFQ and model-based BFQ, owing to AAC’s adaptive implementation. However, AAC becomes competitive when used together with VRFT since the VRFT pre-tuning provides an initial controller close to the optimal one, which then can be fine-tuned using AAC. The initial NN VRFT controller ensures stabilized exploration over a wide operating range for ensuring ORM tracking in a wide range, which is equivalent to indirect feedback linearization. Then the combined VRFT-AAC design for ORM tracking is more attractive for practical data-driven applications [57,58].

6. Conclusions

A model-free combination of VRFT and AAC design approach was successfully validated to learn improved nonlinear state-feedback control for linear ORM tracking in a wide operating range. Learned controllers indirectly account for several nonlinearities such as actuator saturation plus dead-zone and output saturation, while they also show good decoupling abilities. AAC design shares similar conceptual framework with model-free techniques like Q-Learning, or SARSA, VRFT, Iterative Feedback Tuning and model-free Iterative Learning Control, by exploiting only the process model structure but not its parameters. The convergence of the proposed adaptive learning strategy relies on several key aspects: efficient exploration correlated with the size of the training dataset and with the process complexity, selected learning architecture and selection of the approximators with appropriate parameterizations. In a wider context, VRFT shows significant potential for obtaining close-to-optimal initially admissible controllers with respect to the ORM objective.

Future work targets the validation of the proposed tuning approach to other difficult nonlinear processes and its improvement using data-driven techniques.

Author Contributions

M.-B.R. developed the theoretical results, wrote the paper and performed the experiments; R.-E.P. revised the mathematical formulations, analyzed the algorithms and ensured the hardware and software support. All authors have read and approved the final paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Theorem 1.

Let

s ({\hat{x}}_{k}, {\hat{u}}_{k - 1}) - s ({\tilde{x}}_{k}, {\tilde{u}}_{k - 1}^{}) = {\hat{y}}_{k} - {\tilde{y}}_{k} = Δ y_{k}

, where

{\hat{y}}_{k} = s ({\hat{x}}_{k}, {\hat{u}}_{k - 1} = C ({\hat{ζ}}_{k - 1}, \hat{θ}))

. In VRFT, it is also valid that

{\tilde{y}}_{k} = s ({\tilde{x}}_{k}, {\tilde{u}}_{k - 1}^{}) = s^{m} ({\tilde{x}}_{k}^{m}, {\tilde{r}}_{k - 1}^{})

is the output of both the process and of the ORM driven by

{\tilde{r}}_{k - 1} = M^{- 1} ({\tilde{y}}_{k}^{})

. By the mean value theorem, there is a

0 < Γ < 1

making

Δ y_{k} = \frac{\partial s (x_{k}^{Γ}, u_{k - 1}^{Γ})}{\partial x_{k}} ({\hat{x}}_{k} - {\tilde{x}}_{k}) + \frac{\partial s (x_{k}^{Γ}, u_{k - 1}^{Γ})}{\partial u_{k - 1}} ({\hat{u}}_{k - 1} - {\tilde{u}}_{k - 1}),

(A1)

leading to

‖ Δ y_{k} ‖ \leq B_{s x} ‖ Δ x_{k} ‖ + B_{s u} ‖ Δ u_{k - 1} ‖,

(A2)

where

x_{k}^{Γ} = Γ {\hat{x}}_{k} + (1 - Γ) {\tilde{x}}_{k}, u_{k - 1}^{Γ} = Γ {\hat{u}}_{k - 1} + (1 - Γ) {\tilde{u}}_{k - 1}

, and

Δ x_{k} = {\hat{x}}_{k} - {\tilde{x}}_{k}, Δ u_{k - 1} = {\hat{u}}_{k - 1} - {\tilde{u}}_{k - 1}

.

Observe that (8) implies

‖ {\tilde{u}}_{k}^{} - C ({\tilde{x}}_{k}^{E}, \hat{θ}) ‖ < ε

,

\forall k = \bar{0, N - 1}

. But

Δ u_{k} = C ({\hat{x}}_{k}^{E}, \hat{θ}) - {\tilde{u}}_{k} = C ({\hat{x}}_{k}^{E}, \hat{θ}) - C ({\tilde{x}}_{k}^{E}, \hat{θ}) + C ({\tilde{x}}_{k}^{E}, \hat{θ}) - {\tilde{u}}_{k}

and by the MVT there is a

0 < Γ < 1

such that

Δ u_{k} = \frac{\partial C (x_{k}^{E Γ}, θ)}{\partial x_{k}^{E}} ({\hat{x}}_{k}^{E} - {\tilde{x}}_{k}^{E}) + C ({\tilde{x}}_{k}^{E}, \hat{θ}) - {\tilde{u}}_{k},

(A3)

with

x_{k}^{E Γ} = Γ {\hat{x}}_{k}^{E} + (1 - Γ) {\tilde{x}}_{k}^{E},

leading to

‖ Δ u_{k} ‖ \leq B_{c x} ‖ Δ x_{k}^{E} ‖ + ε,

(A4)

for

Δ x_{k}^{E} = {\hat{x}}_{k}^{E} - {\tilde{x}}_{k}^{E}

. But

Δ x_{k}^{E} = {\hat{x}}_{k}^{E} - {\tilde{x}}_{k}^{E} = {[{({\hat{x}}_{k} - {\tilde{x}}_{k})}^{T} 0^{T} 0^{T}]}^{T}

resulting in

‖ Δ x_{k}^{E} ‖ = ‖ Δ x_{k}^{} ‖

which transforms (A4) into

‖ Δ u_{k} ‖ \leq B_{c x} ‖ Δ x_{k}^{} ‖ + ε .

(A5)

Next, by the mean value theorem, there is a

0 < Γ < 1

ensuring that

Δ x_{k}^{} = {\hat{x}}_{k}^{} - {\tilde{x}}_{k}^{} = {(s_{x}^{})}^{- 1} ({\hat{y}}_{k}^{}) - {(s_{x}^{})}^{- 1} ({\tilde{y}}_{k}^{}) = \frac{\partial {(s_{x})}^{- 1} (y_{k}^{Γ})}{\partial y_{k}} ({\hat{y}}_{k}^{} - {\tilde{y}}_{k}^{}), y_{k}^{Γ} = Γ {\hat{y}}_{k}^{} + (1 - Γ) {\tilde{y}}_{k}^{} . .

(A6)

It then follows that

‖ Δ x_{k}^{} ‖ \leq B_{s y} ‖ Δ y_{k}^{} ‖ .

(A7)

Using (A5) and (A7) in (A2) it results

‖ Δ y_{k} ‖ \leq B_{s x} B_{s y} ‖ Δ y_{k} ‖ + B_{s u} (B_{c x} ‖ Δ x_{k - 1}^{} ‖ + ε) \leq B_{s x} B_{s y} ‖ Δ y_{k} ‖ + B_{s u} B_{c x} B_{s y} ‖ Δ y_{k - 1}^{} ‖ + B_{s u} ε,

(A8)

which is equivalent to

‖ Δ y_{k} ‖ \leq \underset{B_{1}}{\underset{︸}{\frac{B_{s u} B_{c x} B_{s y}}{1 - B_{s x} B_{s y}}}} ‖ Δ y_{k - 1}^{} ‖ + \underset{B_{2}}{\underset{︸}{\frac{B_{s u}}{1 - B_{s x} B_{s y}}}} ε .

(A9)

One can write that

J_{M R}^{N} (\hat{θ}) = \sum_{k = 1}^{N} {‖ Δ y_{k} ‖}_{}^{2} \leq {(B_{2} \sum_{k = 1}^{N} \sum_{j = 0}^{k - 1} B_{1}^{j})}^{2} ε^{2} = B ε^{2} .

(A10)

which is the conclusion (10), and the proof of Theorem 1 is completed.

Appendix B

Proof of Theorem 3.

Let

δ_{k} = U^{k - 1} + {(W_{c}^{k - 1})}^{T} (γ Φ_{c}^{k} - Φ_{c}^{k - 1})

be expressed further as

δ_{k} = \underset{E 1}{\underset{︸}{U^{k - 1} + {(W_{c}^{*})}^{T} (γ Φ_{c}^{k} - Φ_{c}^{k - 1})}} + {({\tilde{W}}_{c}^{k - 1})}^{T} (γ Φ_{c}^{k} - Φ_{c}^{k - 1})

. Let

Φ_{c}^{k} = Φ_{c}^{k - 1} + Δ Φ_{c}^{k}

make

δ_{k} = E 1 + (γ - 1) ζ_{c}^{k - 1} + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1}

.

Define the Lyapunov function

L^{k - 1} = \frac{1}{α_{c}} t r {{({\tilde{W}}_{c}^{k - 1})}^{T} {\tilde{W}}_{c}^{k - 1}} + \frac{1}{α_{a}} t r {{({\tilde{W}}_{a}^{k - 1})}^{T} {\tilde{W}}_{a}^{k - 1}} = L_{1} + L_{2} .

(A11)

(tr{.} meaning the matrix trace operator) leading to the first order differences in L₁ and L₂

\begin{array}{l} Δ L_{1} = \frac{1}{α_{c}} [t r {{({\tilde{W}}_{c}^{k})}^{T} {\tilde{W}}_{c}^{k}} - t r {{({\tilde{W}}_{c}^{k - 1})}^{T} {\tilde{W}}_{c}^{k - 1}}], \\ Δ L_{2} = \frac{1}{α_{a}} [t r {{({\tilde{W}}_{a}^{k})}^{T} {\tilde{W}}_{a}^{k}} - t r {{({\tilde{W}}_{a}^{k - 1})}^{T} {\tilde{W}}_{a}^{k - 1}}] . \end{array}

(A12)

Using estimation error dynamics (32),

Δ L_{1}, Δ L_{2}

are refined further. Let

\begin{array}{l} Δ L_{1} = \frac{1}{α_{c}} [t r {2 α_{c} δ_{k} \underset{ζ_{c}^{k - 1}}{\underset{︸}{{(Φ_{c}^{k - 1})}^{T} {\tilde{W}}_{c}^{k}}} + α_{c} δ_{k}^{2} {(Φ_{c}^{k - 1})}^{T} Φ_{c}^{k - 1}] = 2 α_{c} (E 1 + (γ - 1) ζ_{c}^{k - 1} + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1}) ζ_{c}^{k - 1} + \\ + α_{c} {(E 1 + (γ - 1) ζ_{c}^{k - 1} + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1})}^{2} {‖ Φ_{c}^{k - 1} ‖}^{2} \leq 2 α_{c} (γ - 1) {(ζ_{c}^{k - 1})}^{2} + 2 α_{c} (E 1 + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1}) ζ_{c}^{k - 1} \\ + α_{c} {\bar{φ}}_{c} (2 {(γ - 1)}^{2} {(ζ_{c}^{k - 1})}^{2} + 2 {(E 1 + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1})}^{2}) = 2 α_{c} (γ - 1) {(ζ_{c}^{k - 1})}^{2} - {\underset{E 2}{\underset{︸}{(ζ_{c}^{k - 1} - α_{c} (E 1 + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1}))}}}^{2} \\ + 2 α_{c} {\bar{φ}}_{c} {(γ - 1)}^{2} {(ζ_{c}^{k - 1})}^{2} + 2 α_{c} {\bar{φ}}_{c} {(E 1 + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1})}^{2} + {(ζ_{c}^{k - 1})}^{2} + α_{c}^{2} {(E 1 + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1})}^{2} = \\ = = [1 + 2 α_{c} (γ - 1) + 2 α_{c} {\bar{φ}}_{c} {(γ - 1)}^{2}] {(ζ_{c}^{k - 1})}^{2} - E 2 + \underset{E 3}{\underset{︸}{(2 α_{c} {\bar{φ}}_{c} + α_{c}^{2}) {(E 1 + γ {({\tilde{W}}_{c}^{k - 1})}^{T} Δ Φ_{c}^{k - 1})}^{2}}} . \end{array}

(A13)

One can show that the coefficient of (ζ_e^k−1) in the first term, is a function of γ, i.e.,

\tilde{f} (γ) = 1 + 2 α_{c} (γ - 1) + 2 α_{c} {\bar{φ}}_{c} {(γ - 1)}^{2}

, which will have two real roots for

α_{c} > 2 {\bar{φ}}_{c}

. Let these roots be

γ_{1} < γ_{2}

. By ensuring

γ_{1} < 0, γ_{2} > 1

then for any

γ

such that

γ_{1} < 0 < γ \leq 1 < γ_{2}

, it means that

\tilde{f} (γ) < 0

. A sufficient condition to ensure

\tilde{f} (γ) < 0

for all

0 < γ \leq 1

, is to select

α_{c} > \max {2 {\bar{φ}}_{c}, 4 {\bar{φ}}_{c}^{2} - 2 {\bar{φ}}_{c} + 1} = 4 {\bar{φ}}_{c}^{2} - 2 {\bar{φ}}_{c} + 1

for all

{\bar{φ}}_{c} > 1

.

Note further that the term E3 in (A13) is positive. Let

Δ L_{2}

be further expressed as

\begin{array}{l} Δ L_{2} = \frac{1}{α_{a}} [t r {{({\tilde{W}}_{a}^{k})}^{T} {\tilde{W}}_{a}^{k}} - t r {{({\tilde{W}}_{a}^{k - 1})}^{T} {\tilde{W}}_{a}^{k - 1}}] = \\ = \frac{1}{α_{a}} ({‖ {\tilde{W}}_{a}^{k - 1} - α_{a} Φ_{a}^{k - 1} {(W_{c}^{k - 1})}^{T} Φ_{c, u}^{k - 1} ‖}^{2} - {‖ {\tilde{W}}_{a}^{k - 1} ‖}^{2}) \leq \\ \leq \frac{1}{α_{a}} ({(‖ {\tilde{W}}_{a}^{k - 1} ‖ + ‖ α_{a} Φ_{a}^{k - 1} {(W_{c}^{k - 1})}^{T} Φ_{c, u}^{k - 1} ‖)}^{2} - {‖ {\tilde{W}}_{a}^{k - 1} ‖}^{2}) = \\ = 2 α_{a} ‖ {\tilde{W}}_{a}^{k - 1} ‖ ‖ Φ_{a}^{k - 1} {(W_{c}^{k - 1})}^{T} Φ_{c, u}^{k - 1} ‖ + α_{a} {‖ Φ_{a}^{k - 1} {(W_{c}^{k - 1})}^{T} Φ_{c, u}^{k - 1} ‖}^{2} \leq \\ \leq 2 α_{a} {\bar{φ}}_{a} {\bar{φ}}_{c, u} ‖ {\tilde{W}}_{a}^{k - 1} ‖ ‖ W_{c}^{k - 1} ‖ + α_{a} {\bar{φ}}_{a}^{2} {\bar{φ}}_{c, u}^{2} {‖ W_{c}^{k - 1} ‖}^{2} = E 4 > 0 . \end{array}

(A14)

Eventually,

Δ L = Δ L_{1} + Δ L_{2} = [1 + 2 α_{c} (γ - 1) + 2 α_{c} {\bar{φ}}_{c} {(γ - 1)}^{2}] {(ζ_{c}^{k - 1})}^{2} - E 2 + E 3 + E 4,

(A15)

can be shown negative if

[1 + 2 α_{c} (γ - 1) + 2 α_{c} {\bar{φ}}_{c} {(γ - 1)}^{2}] {(ζ_{c}^{k - 1})}^{2} + E 3 + E 4 < 0

(A16)

holds, where, considering the first term as negative for

α_{c} > 4 {\bar{φ}}_{c}^{2} - 2 {\bar{φ}}_{c} + 1

while E3 > 0, E4 > 0, it follows that it suffices to have

{(ζ_{c}^{k - 1})}^{2} > \frac{E 3 + E 4}{| 1 + 2 α_{c} (γ - 1) + 2 α_{c} {\bar{φ}}_{c} {(γ - 1)}^{2} |}

(A17)

in order to make

Δ L

negative definite. Then by Lyapunov extension theorem [59], the AAC learning is stable and the NN estimation errors are uniformly ultimately bounded, concluding the proof.

References

Hou, Z.-S.; Wang, Z. From model-based control to data-driven control: Survey, classification and perspective. Inf. Sci. 2013, 235, 3–35. [Google Scholar] [CrossRef]
Fliess, M.; Join, C. Model-free control. Int. J. Control 2013, 86, 2228–2252. [Google Scholar] [CrossRef]
Hou, Z.-S.; Jin, S. Data-driven model-free adaptive control for a class of MIMO nonlinear discrete-time systems. IEEE Trans. Neural Netw. 2011, 22, 2173–2188. [Google Scholar] [PubMed]
Campi, M.C.; Lecchini, A.; Savaresi, S.M. Virtual reference feedback tuning: A direct method for the design of feedback controllers. Automatica 2002, 38, 1337–1346. [Google Scholar] [CrossRef]
Hjalmarsson, H. Iterative feedback tuning—An overview. Int. J. Adapt. Control Signal Process. 2002, 16, 373–395. [Google Scholar] [CrossRef]
Spall, J.C.; Cristion, J.A. Model-free control of nonlinear stochastic systems with discrete-time measurements. IEEE Trans. Autom. Control 1998, 43, 1198–1210. [Google Scholar] [CrossRef]
Butcher, M.; Karimi, A.; Longchamp, R. Iterative learning control based on stochastic approximation. In Proceedings of the 17th IFAC World Congress, Seoul, Korea, 6–11 July 2008; pp. 1478–1483. [Google Scholar]
Radac, M.-B.; Precup, R.-E.; Petriu, E.M. Optimal behaviour prediction using a primitive-based data-driven model-free iterative learning control approach. Comp. Ind. 2015, 74, 95–109. [Google Scholar] [CrossRef]
Li, Y.; Hou, Z.; Feng, Y.; Chi, R. Data-driven approximate value iteration with optimality error bound analysis. Automatica 2017, 78, 79–87. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E.; Petriu, E.M.; Preitl, S. Iterative data-driven tuning of controllers for nonlinear systems with constraints. IEEE Trans. Ind. Electron. 2014, 61, 6360–6368. [Google Scholar] [CrossRef]
Pang, Z.-H.; Liu, G.-P.; Zhou, D.; Sun, D. Data-based predictive control for networked nonlinear systems with network-induced delay and packet dropout. IEEE Trans. Ind. Electron. 2016, 63, 1249–1257. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E. Data-driven model-free slip control of anti-lock braking systems using reinforcement Q-learning. Neurocomputing 2018, 275, 317–329. [Google Scholar] [CrossRef]
Qiu, J.; Wang, T.; Yin, S.; Gao, H. Data-Based Optimal Control for Networked Double-Layer Industrial Processes. IEEE Trans. Ind. Electron. 2017, 64, 4179–4186. [Google Scholar] [CrossRef]
Chi, R.; Hou, Z.-S.; Jin, S.; Huang, B. An improved data-driven point-to-point ILC using additional on-line control inputs with experimental verification. IEEE Trans. Syst. Man Cybern. Syst. 2017, 49, 687–696. [Google Scholar] [CrossRef]
Liu, D.; Yang, G.-H. Model-free adaptive control design for nonlinear discrete-time processes with reinforcement learning techniques. Int. J. Syst. Sci. 2018, 49, 2298–2308. [Google Scholar] [CrossRef]
Chi, R.; Huang, B.; Hou, Z.; Jin, S. Data-driven high-order terminal iterative learning control with a faster convergence speed. Int. J. Robust Nonlinear Control 2018, 28, 103–119. [Google Scholar] [CrossRef]
Jeng, J.-C.; Ge, G.-P. Data-based approach for feedback–feedforward controller design using closed-loop plant data. ISA Trans. 2018, 80, 244–256. [Google Scholar] [CrossRef] [PubMed]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Werbos, P.J. Approximate dynamic programming for real-time control and neural modeling. In Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches; White, D.A., Sofge, D.A., Eds.; Van Nostrand Reinhold: New York, NY, USA, 1992; pp. 493–525. [Google Scholar]
Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, USA, 1996. [Google Scholar]
Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1991, 8, 279–292. [Google Scholar] [CrossRef]
Wang, F.-Y.; Zhang, H.; Liu, D. Adaptive dynamic programming: An introduction. IEEE Comput. Intell. Mag. 2009, 4, 39–47. [Google Scholar] [CrossRef]
Lewis, F.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst. Mag. 2012, 32, 76–105. [Google Scholar]
Prokhorov, D.V.; Wunsch, D.C. Adaptive critic designs. IEEE Trans. Neural Netw. 1997, 8, 997–1007. [Google Scholar] [CrossRef]
Wei, Q.; Lewis, F.; Liu, D.; Song, R.; Lin, H. Discrete-time local value iteration adaptive dynamic programming: Convergence analysis. IEEE Trans. Syst. Man Cybern. Syst. 2018, 48, 875–891. [Google Scholar] [CrossRef]
Heydari, A. Revisiting approximate dynamic programming and its convergence. IEEE Trans. Cybern. 2014, 44, 2733–2743. [Google Scholar] [CrossRef] [PubMed]
Mu, C.; Ni, Z.; Sun, C.; He, H. Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems. IEEE Trans. Cybern. 2017, 47, 1460–1470. [Google Scholar] [CrossRef]
Venayagamoorthy, G.K.; Harley, R.G.; Wunsch, D.C. Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator. IEEE Trans. Neural Netw. 2002, 13, 764–773. [Google Scholar] [CrossRef]
Ni, Z.; He, H.; Zhong, Z.; Prokhorov, D.V. Model-free dual heuristic dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 1834–1839. [Google Scholar] [CrossRef] [PubMed]
Heydari, A. Optimal triggering of networked control systems. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3011–3021. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Zhang, H.; Gao, Z.; Su, H. Online adaptive policy iteration based fault-tolerant control algorithm for continuous-time nonlinear tracking systems with actuator failures. J. Frankl. Inst. 2018, 355, 6947–6968. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcouglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Lewis, F.L.; Vamvoudakis, K.G. Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man Cybern. B Cybern. 2011, 41, 14–25. [Google Scholar] [CrossRef]
Wang, Z.; Liu, D. Data-based controllability and observability analysis of linear discrete-time systems. IEEE Trans. Neural Netw. 2011, 22, 2388–2392. [Google Scholar] [CrossRef]
Ruelens, F.; Claessens, B.J.; Vandael, S.; de Schutter, B.; Babuška, R.; Belmans, R. Residential demand response of thermostatically controlled loads using batch reinforcement learning. IEEE Trans. Smart Grid 2017, 8, 2149–2159. [Google Scholar] [CrossRef]
Liu, D.; Javaherian, H.; Kovalenko, O.; Huang, T. Adaptive critic learning techniques for engine torque and air-fuel ratio control. IEEE Trans. Syst. Man Cybern. B Cybern. 2008, 38, 988–993. [Google Scholar]
Radac, M.-B.; Precup, R.-E.; Petriu, E.M. Model-free primitive-based iterative learning control approach to trajectory tracking of MIMO systems with experimental validation. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 2925–2938. [Google Scholar] [CrossRef]
Campestrini, L.; Eckhard, D.; Bazanella, A.S.; Gevers, M. Data-driven model reference control design by prediction error identification. J. Frankl. Inst. 2017, 354, 2628–2647. [Google Scholar] [CrossRef]
Campestrini, L.; Eckhard, D.; Gevers, M.; Bazanella, A. Virtual reference feedback tuning for non-minimum phase plants. Automatica 2011, 47, 1778–1784. [Google Scholar] [CrossRef]
Formentin, S.; Savaresi, S.M.; Del Re, L. Non-iterative direct data-driven controller tuning for multivariable systems: Theory and application. IET Control Theory Appl. 2012, 6, 1250–1257. [Google Scholar] [CrossRef]
Yan, P.; Liu, D.; Wang, D.; Ma, H. Data-driven controller design for general MIMO nonlinear systems via virtual reference feedback tuning and neural networks. Neurocomputing 2016, 171, 815–825. [Google Scholar] [CrossRef]
Campi, M.C.; Savaresi, S.M. Direct nonlinear control design: The virtual reference feedback tuning (VRFT) approach. IEEE Trans. Autom. Control 2006, 51, 14–27. [Google Scholar] [CrossRef]
Esparza, A.; Sala, A.; Albertos, P. Neural networks in virtual reference tuning. Eng. Appl. Artif. Intell. 2011, 24, 983–995. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E. Three-level hierarchical model-free learning approach to trajectory tracking control. Eng. Appl. Artif. Intell. 2016, 55, 103–118. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E.; Roman, R.-C. Model-free control performance improvement using virtual reference feedback tuning and reinforcement Q-learning. Int. J. Syst. Sci. 2017, 48, 1071–1083. [Google Scholar] [CrossRef]
Busoniu, L.; Ernst, D.; de Schutter, B.; Babuska, R. Approximate reinforcement learning: An overview. In Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Paris, France, 11–15 April 2011; pp. 1–8. [Google Scholar]
Inteco Ltd. Multitank System, User’s Manual; Inteco Ltd.: Krakow, Poland, 2007. [Google Scholar]
Hagan, M.T.; Menhaj, M.B. Training feed-forward networks with the Marquardt algorithm. IEEE Trans. Neural Netw. 1994, 5, 989–993. [Google Scholar] [CrossRef]
Liu, Q.; Liu, J.; Sang, R.; Li, J.; Zhang, T.; Zhang, Q. Fast neural network training on FPGA using quasi-Newton optimisation method. IEEE Trans. Very Large Scale Integr. Syst. 2018, 26, 1575–1579. [Google Scholar] [CrossRef]
Livieris, I.E. Improving the classification efficiency of an ANN utilizing a new training methodology. Informatics 2019, 6, 1. [Google Scholar] [CrossRef]
Møller, M.F. A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 1993, 6, 525–533. [Google Scholar] [CrossRef]
Livieris, I.E.; Pintelas, P. A new conjugate gradient algorithm for training neural networks based on a modified secant equation. Appl. Math. Comput. 2013, 221, 491–502. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E. Data-driven MIMO model-free reference tracking control with nonlinear state-feedback and fractional order controllers. Appl. Soft Comput. 2018, 73, 992–1003. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E.; Roman, R.-C. Data-driven model reference control of MIMO vertical tank systems with model-free VRFT and Q-Learning. ISA Trans. 2018, 73, 227–238. [Google Scholar] [CrossRef] [PubMed]
He, H.; Ni, Z.; Fu, J. A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing 2012, 78, 3–13. [Google Scholar] [CrossRef]
Zhao, D.; Wang, B.; Liu, D. A supervised actor-critic approach for adaptive cruise control. Soft Comput. 2013, 17, 2089–2099. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E.; Petriu, E.M.; Preitl, S.; Dragos, C.-A. Data-driven reference trajectory tracking algorithm and experimental validation. IEEE Trans. Ind. Inf. 2013, 9, 2327–2336. [Google Scholar] [CrossRef]
Radac, M.-B.; Precup, R.-E. Data-based two-degree-of-freedom iterative control approach to constrained non-linear systems. IET Control Theory Appl. 2015, 9, 1000–1010. [Google Scholar] [CrossRef]
Yang, Q.; Jagannathan, S. Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators. IEEE Trans. Syst. Man Cybern. B Cybern. 2012, 42, 377–390. [Google Scholar] [CrossRef] [PubMed]

Figure 1. (a) control system with the VRFT controller; (b) control system with the VRFT NN controller further tuned by AAC design in a deep learning architecture; (c) the vertical tank system.

Figure 2. VRFT initial experiment data collection. Red line in (b) is the dead-zone threshold for

u_{2}

.

Figure 2. VRFT initial experiment data collection. Red line in (b) is the dead-zone threshold for

u_{2}

.

Figure 3. The initial VRFT controller (black dotted), the final adaptively learned controller (black solid), the BFQ controller (blue), the model-based BFQ controller (magenta) and the ORM outputs (red).

Figure 4. Q-NN output weights in Q-function learning of VRFT control.

Figure 5. Typical training and learning episode. The perturbed control (black), the reference inputs to the CS (green) and the RM outputs (red).

Figure 6. Evolution of

J_{M R_A C}^{1400}

with each episode for a typical trial of 30 episodes.

Figure 6. Evolution of

J_{M R_A C}^{1400}

with each episode for a typical trial of 30 episodes.

Table 1. AAC tuning statistics over maximum 50 episodes per trial.

Scenario		Avg. $J_{M R_A A C}^{1400}$	Min. $J_{M R_A A C}^{1400}$	Success Rate	Avg. Episodes	Std. Episodes
Full Tuning	Perturbed	Avg. $J_{M R_A A C}^{1400}$	Min. $J_{M R_A A C}^{1400}$	Success Rate	Avg. Episodes	Std. Episodes
yes	yes	0.48	0.46	88%	25	4.9
yes	no	0.76	0.45	53%	43	6.3
no	yes	0.64	0.59	100%	16	3.2
no	no	0.526	0.50	100%	10	2.1

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Radac, M.-B.; Precup, R.-E. Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic. Appl. Sci. 2019, 9, 1807. https://doi.org/10.3390/app9091807

AMA Style

Radac M-B, Precup R-E. Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic. Applied Sciences. 2019; 9(9):1807. https://doi.org/10.3390/app9091807

Chicago/Turabian Style

Radac, Mircea-Bogdan, and Radu-Emil Precup. 2019. "Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic" Applied Sciences 9, no. 9: 1807. https://doi.org/10.3390/app9091807

APA Style

Radac, M.-B., & Precup, R.-E. (2019). Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic. Applied Sciences, 9(9), 1807. https://doi.org/10.3390/app9091807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic

Abstract

1. Introduction

2. Output Model Reference Control for Unknown Systems

3. Nonlinear State-Feedback VRFT for Approximate ORM Tracking Control Using Neural Networks

4. Adaptive Actor-Critic Learning for ORM Tracking Control

4.1. Adaptive Actor-Critic Design

4.2. AAC Using Neural Networks As Approximators

4.3. Convergence of the AAC Learning Scheme with NNs

4.4. Summary of the Mixed VRFT-AAC Design Approach

5. Validation Case Study

5.1. AAC Design for a MIMO Vertical Tank System

5.2. Statistical Investigations of the AAC Control Performance

5.3. Comments on the AAC Learning Performance

6. Conclusions

Author Contributions

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI