Next Article in Journal
All-Optical Modulation and Ultrafast Switching in MWIR with Sub-Wavelength Structured Silicon
Next Article in Special Issue
A Deep Learning Method for Bearing Fault Diagnosis through Stacked Residual Dilated Convolutions
Previous Article in Journal
Evaluation of Fruit and Vegetable Containers Made from Mulberry Wood (Morus Alba L.) Waste
Previous Article in Special Issue
Deterministic and Probabilistic Wind Power Forecasting Based on Bi-Level Convolutional Neural Network and Particle Swarm Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic

by
Mircea-Bogdan Radac
* and
Radu-Emil Precup
Department of Automation and Applied Informatics, Politehnica University of Timisoara, Timisoara 300006, Romania
*
Author to whom correspondence should be addressed.
Appl. Sci. 2019, 9(9), 1807; https://doi.org/10.3390/app9091807
Submission received: 28 January 2019 / Revised: 19 March 2019 / Accepted: 24 April 2019 / Published: 30 April 2019
(This article belongs to the Special Issue Advances in Deep Learning)

Abstract

:
This paper proposes a neural network (NN)-based control scheme in an Adaptive Actor-Critic (AAC) learning framework designed for output reference model tracking, as a representative deep-learning application. The control learning scheme is model-free with respect to the process model. AAC designs usually require an initial controller to start the learning process; however, systematic guidelines for choosing the initial controller are not offered in the literature, especially in a model-free manner. Virtual Reference Feedback Tuning (VRFT) is proposed for obtaining an initially stabilizing NN nonlinear state-feedback controller, designed from input-state-output data collected from the process in open-loop setting. The solution offers systematic design guidelines for initial controller design. The resulting suboptimal state-feedback controller is next improved under the AAC learning framework by online adaptation of a critic NN and a controller NN. The mixed VRFT-AAC approach is validated on a multi-input multi-output nonlinear constrained coupled vertical two-tank system. Discussions on the control system behavior are offered together with comparisons with similar approaches.

1. Introduction

Data-driven or data-based control techniques rely on data collected from the process in order to learn and tune controllers that prevent control performance degradation due to mismatch between the true process and its model—the main issue with model-based control design approaches [1]. The data-driven controller learning objective can be achieved either by using highly adaptive simplified phenomenological models [2,3], or by using no model at all, except for common structural assumptions about the true process such as linearity or nonlinearity. The latter approach can be considered a true model-free one, with several representative techniques having first emerged from classical control theory, such as: Virtual Reference Feedback Tuning (VRFT), [4], Iterative Feedback Tuning [5], Simultaneous Perturbation Stochastic Approximation [6], Model-free Iterative Learning Control [7,8]. Most of the above approaches relying on instruments specific to optimal control with several recent applications [9,10,11,12,13,14,15,16,17].
Reinforcement learning (RL) [18] is a powerful data-driven technique that solves optimal control problems with parallel developments in the machine learning and control systems communities in which RL is better known as Adaptive (Approximate) Dynamic Programming (ADP) [19] or neuro-dynamic programming [20]. Reinforcement Q-learning [21] with function approximators (FAs) is a particular version of Action Dependent Heuristic Dynamic Programming implemented without a process model [22,23], which is only one of the several types of adaptive actor-critic (AAC) ADP designs [24,25,26,27], besides Heuristic Dynamic Programming [28], Dual Heuristic Programming [29] and all of their action-dependent versions.
For learning high performance control, Action Dependent Heuristic Dynamic Programming (a form of continuous input-state space Q-learning) uses the Q-function as an extension of the cost (value) function and only needs to efficiently explore the input-state space of the unknown process, hence the model-free data-driven label is justified. The class of model-free AAC designs used with FAs is attractive over the majority of the model-based AAC designs, where a partially known nonlinear input-affine state-space representation is at least necessary [22,23]. The main disadvantages of the Action Dependent Heuristic Dynamic Programming schemes are that many transition samples are needed from the process—since the Q-function estimation is more informative, it needs to explore the action space in addition to the state space—and the lack of convergence guarantees in the absence of a process model, when generic FAs are used. Data-driven RL/ADP formulated in terms of control systems theory has also offered recent results regarding different applications and stability and learning convergence, in both model-free and model-based settings [30,31].
In output reference model (ORM) tracking control, the output of the controlled process should track a reference model’s output regarded as a frequently changing time-varying learning goal. This control objective can also be formulated in an optimal control setup. An initial stabilizing state feedback controller that achieves suboptimal ORM tracking control is highly desirable in practice since it could accelerate the learning process. In fact, most of the AAC learning control architectures start the controller learning with respect to some objective using an initial controller, but lack systematic guidelines for obtaining such initial controller.
VRFT is one solution to design data-driven model-free feedback controllers, commonly using input-output data. Its linear time-invariant framework typically needs much fewer samples than model-free AAC designs to obtain an initial controller. Unfortunately, a linear controller cannot ensure good ORM tracking for nonlinear processes acting in wide operating ranges. Since AAC should essentially learn a nonlinear state-feedback controller, it is of interest to obtain such an initial (possibly suboptimal) controller, and this will be shown possible using the VRFT design and tuning framework. This would be significant since model-free AAC approaches are data-hungry in practice and any initial suboptimal solution would shorten the convergence time. Under such motivation, the combination of VRFT and AAC is used to achieve ORM tracking control. The resulting AAC design consists of two neural networks (NNs), one for the controller called the actor NN and one for the cost function approximation called the critic NN. The correction signals during the adaptive learning are backpropagated through the larger NN resulted from cascading the actor and the critic NNs, hence the AAC architecture belongs to the deep reinforcement learning approaches from the literature [32].
The mixed VRFT-AAC approach developed in this paper is applied to a real-world Multi-Input Multi-Output (MIMO) nonlinear coupled constrained laboratory vertical two-tank system for water level control. The approach proposed as follows is novel with respect to the state-of-the-art since:
  • it introduces an original nonlinear state-feedback neural network-based controller for ORM tracking, tuned with VRFT, serving as initialization for the AAC learning controller that further improves the ORM tracking and accelerates convergence to the optimal controller. This leads to the novel VRFT-AAC combination;
  • the case study proves implementation of the novel mixed control learning approach for ORM tracking. The MIMO validation scenario also demonstrates good decoupling ability of learned controllers, even under constraints and nonlinearities. Comparisons with a model-free batch fitted Q-learning scheme and with a model-based batch-fitted Q-learning approach are also offered. Statistical characterization case studies in different learning settings are given.
  • theoretical analysis ensures that the AAC learning scheme preserves the CS stability throughout the updates and converges to the optimal control.
The paper is organized as follows: the next section formulates the ORM tracking control problem in an optimal control framework and offers a way to solve it using VRFT (Section 3) and AAC design (Section 4). Validation case study, useful implementation details, comparisons with similar control learning techniques, thorough investigations and discussions of the observed results, are all presented in Section 5. The concluding remarks are highlighted in Section 6.

2. Output Model Reference Control for Unknown Systems

Let the discrete-time nonlinear unknown open-loop minimum-phase state-space deterministic strictly causal process be
P : { x k + 1 = f ( x k , u k ) , y k = g ( x k ) ,
where k indexes the discrete time, x k = [ x k , 1     x k , n ] T X n is the n-dimensional state vector (upper T is matrix transpose), u k = [ u k , 1 , , u k , m ] T U m is the control input signal, y k = [ y k , 1 , , y k , p ] T Y p is the measurable controlled output, f : X × U X is an unknown nonlinear system function, g : X Y is an unknown nonlinear output function of the states, and the initial conditions are not considered for analysis at this point. It is further assumed that the definition domains X, U, Y are compact convex. The following assumptions common to the data-driven problem formulation [1] are:
A1: System (1) is controllable and fully state observable.
A2: System (1) is internally stable on X × U.
Assumptions A1 and A2 are common in the data-driven control literature and difficult to assess when unknown process models are assumed. They may be supported from the experience on the process operation or from the literature. If no knowledge exists whatsoever, control can be tried in the constraining domains related to the minimum safety operating conditions of the process, which is required minimum information on the process variables. Internal stability is sufficient for output feedback control design and necessary for state-feedback control design using input-state samples.
Concerning the controllability and full state observability assumption A1 imposed to the process, if the observability cannot be verified analytically, data-driven observers can be built using past samples of either the inputs and outputs and/or of the partially measurable state, as shown for linear systems in [33,34] and used for nonlinear systems in [35]. State measurement requires more insight on the process than several pure input-output representations.
Equation (1) is a general form for most controlled processes in practice and it is not restrictive. In this form, it obeys the definition of a deterministic Markov decision process.
The discrete-time known open-loop stable minimum-phase state-space deterministic strictly causal ORM is
O R M : { x k + 1 m = f m ( x k m , r k ) , y k m = g m ( x k m ) ,
where x k m = [ x k , 1 m , , x k , n m m ] T X m n m is the state vector of the ORM, r k = [ r k , 1 , , r k , p ] T R m p is the reference input signal, y k m = [ y k , 1 m , , y k , p m ] T Y m p is the ORM’s output, f m : X m × R m X m , g m : X m Y m are known nonlinear maps. Initial conditions are zero unless stated otherwise. Note that r k , y k , y k m have the same size p for square feedback control systems. If the ORM (2) is linear time-invariant in particular, it is always possible to express the ORM as an input-output linear time-invariant transfer matrix y k m = M ( z ) r k , where M ( z ) is an asymptotically stable unit gain (i.e., M ( 1 ) = I , where I is the identity matrix) rational transfer matrix and r k is the reference input that drives both the feedback control system and the ORM. To extend the process (1) with the ORM (2), we consider the reference input r k as a set of measurable exogenous signals that evolve according to r k + 1 = h m ( r k ) , with unknown h m : m m but measurable r k . Piecewise constant r k can be modeled for example as r k + 1 = r k and it will be used throughout this paper. Then the extended state-space model with output equations is
x k + 1 E = [ x k + 1 x k + 1 m r k + 1 ] = [ f ( x k , u k ) f m ( x k m , r k ) h m ( r k ) ] = F ( x k E , u k ) , x k E X E = X × X m × R m , y k = g ( x k E ) , y k m = g m ( x k E ) .
The ORM tracking control problem is formulated in an optimal control framework. Let the infinite horizon cost function (c.f.) to be minimized starting with x i be [36]
J ( x i E , U i , ) = k = i γ k i Ʋ ( x k E , u k ) , U i , = { u i , , u } ,
where i indexes the starting time for x i E , the discount factor 0 < γ 1 ensures the convergence of J ( x i E , U i , ) [23] and sets the controller’s (or interacting agent’s) horizon, the stage cost Ʋ > 0 depends on x k E and u k and captures the distance relative to some pre-specified learning goal (target) usually constant in many applications. The unknown control inputs u i , u i + 1 , ..., should minimize J ( x i E , U i , ) . A control sequence (or a controller) rendering a finite c.f. are called admissible.
ORM tracking control requires that the undisturbed process output y k (also the control system output) tracks the ORM’s output y k m = M ( z ) r k . For stage cost Ʋ M R = y k m ( x k E ) y k ( x k E ) 2 2 in Equation (4) (measurable y k depends via unknown g (   ) on x k , but not on x k + 1 ), we introduce the discounted infinite-horizon model reference tracking c.f.
J M R ( x 0 E , θ ) = k = 0 γ k y k m ( x k E ) y k ( x k E , θ ) 2 2 = k = 0 γ k ε k ( x k E , θ ) 2 2 ,
where ε k ( x k E , θ ) is the model reference tracking error vector, θ n θ is a parameterization of a nonlinear feedback admissible controller [23] defined as u k = d e f C ( x k E , θ ) , which used in Equation (5) reflects the influence of θ on all system trajectories outcomes. This controller coupled with Equation (3) ensures that the output of Equation (1) tracks the ORM’s output. J M R in (5) also serves as the value function of using the controller C. For finite J M R when γ = 1, ε k must be a square-summable sequence which can be obtained with an asymptotically stabilizing controller that ensures lim k y k m ( x k E ) y k ( x k E ) 2 2 = 0 . In the general case when γ < 1, J M R will be finite with any stabilizing controller that renders a finite upper bounded ε k . Herein, admissible controller for Equations (4) and (5) means the controller that ensures a finite c.f. J M R .
A nonlinear reference model M could have been used for tracking purposes as well; however, imposing a linear time-invariant one for the feedback control system ensures indirect feedback linearization of the controlled process. It is extremely beneficial to work with linearized feedback control systems because their behavior generalizes well in wide operating ranges [37]. The ORM tracking problem concerns the control system behavior from the reference input to the controlled output, neglecting potential load disturbances [38]. Extension of the proposed theory to nonlinear ORMs is not difficult. Under classical control rules, the process’s delay and non-minimum-phase character should be included in M. However, the non-minimum-phase zeroes make M non-invertible in addition to requiring their knowledge via identification [38], affecting the subsequent VRFT design, motivating the minimum phase assumption on the process.

3. Nonlinear State-Feedback VRFT for Approximate ORM Tracking Control Using Neural Networks

An initial controller for the system (3) to achieve approximate ORM tracking employs the VRFT concept. Under assumptions A1 and A2, for tuning a nonlinear state-feedback controller, the designer may employ an input-state-output dataset of the form { u ˜ k , x ˜ k , y ˜ k } , k = 0 , N 1 ¯ , gathered from the process in an open-loop experiment lasting for N sample time steps, where persistently exciting u ˜ k excites all the significant process dynamics. To achieve linear ORM tracking for a nonlinear process, a nonlinear state-feedback controller is more suitable than a linear one, being able to cope with the process nonlinearities.
VRFT concept assumes that, if the controlled output yk produced in an open-loop experiment conducted on the stable process is both the control system’s output and the ORM’s output, then the closed-loop control system will match the reference model [4,39,40,41,42]. Let r ˜ k = M ( z ) 1 y ˜ k be the virtual reference input that generates y ˜ k when filtered through M(z) which is assumed to be invertible with respect to the inverse filtering operation. It is called virtual since it is never set as a reference input to the closed-loop control system and it is only used in the offline controller tuning. The virtual states of the ORM are computable from Equation (2) as x ˜ k + 1 m = f m ( x ˜ k m , r ˜ k ) serving to reconstruct the virtual extended state as x ˜ k E = [ ( x ˜ k ) T   ( x ˜ k m ) T   ( r ˜ k ) T ] T A controller that produces u ˜ k when fed by x ˜ k E achieves the ORM tracking. VRFT translates the model reference tracking c.f. in Equation (5) to a controller identification c.f. A finite-time controller identification c.f. is [4]
J V R N ( θ ) = k = 0 N 1 u ˜ k C ( x ˜ k E , θ ) 2 .
Let the optimal controller parameter vector θ* be the solution to the optimization problem θ * = arg min θ J V R N ( θ ) . Theorem 2 in [41] shows that if the controller parameterization is rich enough, then θ * also minimizes J M R , proven for input-output models only. Motivated by [41], a formal proof is given as incentive for using state-feedback controllers tuned by nonlinear multi-input multi-output (MIMO) VRFT. Several other assumptions are considered:
A3: The process (1) has an equivalent input-output form y k = P ( y k 1 , , y k n y , u k 1 , , u k n u ) , where n y , n u are unknown process orders and the nonlinear map P is invertible with respect to u , meaning that for given y k , u k is recoverable as u k 1 = P 1 ( y k ) . Zero initial conditions are assumed at this point. Also, the ORM (2) has an equivalent input-output form y k m = M ( y k 1 m , , y k n y m m , r k 1 , , r k n r ) where n y m , n r are known ORM’s orders, M is a nonlinear invertible map with stable inverse, allowing the calculation of r k 1 = M 1 ( y k m ) . Zero initial conditions are also assumed.
A4: Let the process (1) and the ORM (2) be formally written as y k = s ( x k , u k 1 ) and y k m = s m ( x k m , r k 1 ) , respectively, to capture simultaneously both the input-output dependence and the input-state-output one in a compact form. These expressions also reveal the relative degree one from input to output, without loss of generality. Assume zero initial conditions for (1) and assume the map s invertible with x k , u k 1 computable from y k as x k = ( s x ) 1 ( y k ) , u k 1 = ( s u ) 1 ( y k ) . Further assume that s m is a continuously differentiable invertible map such that x k m , r k 1 are computable from y k m as x k m = ( s x m ) 1 ( y k m ) , r k 1 = ( s r m ) 1 ( y k m ) and assume there exists positive constants B s x m > 0 , B s r m > 0 such that s m ( x k m , r k 1 ) x k m < B s x m , s m ( x k m , r k 1 ) r k 1 < B s r m . Let zero initial conditions hold for (2). These inversion assumptions are natural for state-space systems such as (1) and (2) that have equivalent input-output models according to A4. For example, for given output y k of (1), the input is uniquely determined as u k 1 = P 1 ( y k ) , after which the state can be generated by recursion from x k + 1 = f ( x k , u k ) of Equation (1). This is the sense of x k = ( s x ) 1 ( y k ) .
Moreover, let s , ( s x ) 1 be continuously differentiable and of bounded derivative to satisfy
s ( x k , u k 1 ) x k < B s x , s ( x k , u k 1 ) u k 1 < B s u , ( s x ) 1 ( y k ) y k < B s y , 0 < B s x B s y < 1 .
A5: Let a finite open-loop trajectory collected from the process be D = { u ˜ k , x ˜ k , y ˜ k } U × X × Y , k = 0 , N 1 ¯ where u ˜ k is: (1) persistently exciting, for y ˜ k to capture all process dynamics and (2) ensuring uniform exploration of the entire domain U × X × Y . Good exploration is achievable for large enough N.
A6: There exists a set of nonlinear parameterized state-feedback continuously differentiable controllers { C ( x k E , θ ) } , a θ ^ for which u ^ k = C ( x ^ k E , θ ^ ) , and an ε > 0 for which
J V R N ( θ ^ ) = k = 0 N 1 u ˜ k C ( x ˜ k E , θ ^ ) 2 < ε 2 ,
C ( x k E , θ ) x k E < B c x ,
where x ˜ k E = [   ( x ˜ k ) T   ( x ˜ k m ) T   ( r ˜ k ) T ] T , x ^ k E = [   ( x ^ k ) T   ( x ˜ k m ) T   ( r ˜ k ) T ] T and x ˜ k m = ( s x m ) 1 ( y ˜ k ) , r ˜ k 1 = ( s r m ) 1 ( y ˜ k ) . Technically, { u ^ k , x ^ k , y ^ k } are generated with u ^ k = C ( x ^ k E , θ ^ ) in closed-loop, by processing the virtual signals x ˜ k m , r ˜ k 1 obtained from y ˜ k .
Theorem 1: Under assumptions A3A6, there exists a finite B > 0 such that
J M R N ( θ ^ ) = k = 1 N y ^ k y ˜ k 2 = k = 1 N s ( x ^ k , u ^ k 1 ) s m ( x ˜ k m , r ˜ k 1 ) 2 2 < B ε 2 .
Proof: See Appendix A.
Corollary 1. The controller C ( x k E , θ ^ ) obtained by minimizing the c.f. (6) is stabilizing and admissible for J M R in Equation (5) with γ < 1 .
Proof. By Equation (8), properly identified C ( x k E , θ ^ ) renders the finite-time J M R N ( θ ^ ) (10) arbitrarily small. Secondly, a good exploration of U × X × Y ensured by D = { u ˜ k , x ˜ k , y ˜ k } reflects in good exploration of domains R m , X m by r ˜ k , x ˜ k m respectively. In (10), u ^ k , x ^ k , y ^ k , x ˜ k m , x ^ k E , x ˜ k E are all generated from the same r ˜ k . If (10) holds for many combinations r ˜ k , x ˜ k m rendered form exploratory data, then by the arguments of continuous differentiability and bounded derivatives of the maps (7) and by assumption A4, they will hold for any possible combination of r k and x k m = f m ( x k 1 m , r k 1 ) generated from any r k .
To show this, note that both y ˜ k = y k m = s m ( x ˜ k m , r ˜ k 1 ) and y ^ k = y k = s ( x ^ k , u k 1 ) = s ( x ^ k , C ( x ^ k 1 E , θ ^ ) ) in (10) can be generated from the same r ˜ k ( x ^ k 1 E contains r ˜ k 1 ). Using this fact, it follows from (10) that the ORM tracking errors s ( x ^ k , C ( x ^ k 1 E ( r ˜ k 1 ( 1 ) ) ) ) s m ( x ˜ k m , r ˜ k 1 ( 1 ) ) and s ( x ^ k , C ( x ^ k 1 E ( r ˜ k 1 ( 2 ) ) ) ) s m ( x ˜ k m , r ˜ k 1 ( 2 ) ) are bounded at each time step, for any two training pairs r ˜ k 1 ( 1 ) and r ˜ k 1 ( 2 ) , since the sum in (10) is bounded. Then, for any r k such that r ˜ k 1 ( 1 ) r k r ˜ k 1 ( 2 ) (component wise), since s , s m are differentiable with bounded derivatives with respect to their arguments, it must hold that s ( x ^ k , C ( x ^ k 1 E ( r k 1 ) ) ) s m ( x k m , r k 1 ) = y ^ k y k m is bounded. Which makes the controller C ( x k E , θ ^ ) stabilizing for the control system in the sense of bounded output when r k is bounded. Then, it is an admissible one for the infinite horizon c.f. J M R with γ < 1. This proves the claim.
An NN can be used as a controller for nonlinear state-feedback control learning. Nonlinear VRFT is proposed in [41,42] and successfully applied to NN controllers in [41,43,44,45] but only for output feedback control and not for state-feedback control as in here.
Notice that VRFT control does not need the entire extended state x ˜ k E for feedback (i.e. including the virtual states of the ORM), the process’ initial states would suffice for this purpose. However, state extension is required for preserving the Markov property of the system (3) in order to ensure the correct collection of the transition samples; this is not possible otherwise without special collection design such as using a zero-order-hold for two-by-two consecutive time samples [43]. Correct transition samples collection is required for adaptive actor-critic tuning approach of the same NN controller that is initially tuned via VRFT.
Notice that in the proposed state-feedback VRFT design, knowledge of the output function y k = g ( x k ) in Equation (1) is again not needed since y ˜ k is used to calculate the virtual reference r ˜ k , while the controller only uses x ˜ k E for feedback purposes.

4. Adaptive Actor-Critic Learning for ORM Tracking Control

If the system dynamics (3) is known, for a finite-time horizon version of the c.f. (4), numerical dynamic programming solutions can be employed backwards in time only with finite state and action spaces of moderate size, an issue referred to as the “curse of dimensionality”. For infinite horizon c.f.s, Policy Iteration and Value Iteration [23] can be used even for large and/or continuous state and action spaces, where FAs such as NNs are one option.
If the system dynamics in (3) is unknown, the minimization of the c.f. (4) becomes an RL problem. To solve it model-free, an informative c.f. for each state-action pair is defined, called the Q-function (or action-value function). With this respect, the action-value function of acting u k in state x k E and then following the control (policy) u k = C ( x k E ) is defined as
Q C ( x k E , u k ) = V ( x k E , u k ) + γ Q C ( x k + 1 E , C ( x k + 1 E ) ) .
The optimal Q-function Q * ( x k E , u k * ) satisfies Bellman’s optimality equation
Q * ( x k E , u k * ) = min u k ( V ( x k E , u k ) + γ Q * ( x k + 1 E , u k + 1 * ) ) ,
with the optimal controller and optimal Q-function
u k * = C * ( x k E ) = arg min C Q C ( x k E , u k ) = arg min u Q * ( x k E , u ) .
Then J * ( x k E ) = Q * ( x k E , u k * ) , where J * ( x k E ) = min u J ( x k E , u ) is the minimum value c.f. out of the c.f.s defined in Equation (4). Notice that c.f. (4) encompasses (5) thus making the ORM tracking problem consistent with the above equations. The optimal Q-function can be found using Policy Iteration or Value Iteration in a model-free manner, using, e.g., NNs as FAs. The optimal Q-function estimate and the optimal controller estimate can be updated from the transition samples in several ways: in online/offline mode, batch mode, or sample-by-sample update [23,46]. A particular class of online RL approaches is represented by the temporal difference-based AAC design that differs from the batch PI and VI approaches, as it avoids alternate batch back-up of the Q-function FA and of the controller FA.

4.1. Adaptive Actor-Critic Design

The proposed AAC design is a gradient-based scheme designed to converge to the optimal Q-function and optimal controller estimates. Let the temporal-difference error be measured from data as
δ k ( x k 1 E , u k 1 ) = Ʋ ( x k 1 E , u k 1 ) + γ Q ^ k 1 ( x k E , C k ( x k E ) ) Q ^ k 1 ( x k 1 E , u k 1 )
where the continuous function Q ^ k ( , ) in its arguments is the Q-function estimate at time k, time at which some controller Ck(xk) is also available. From this point onward, for notation simplicity, xk or plain x are used instead of xkE The proposed AAC design attempts the Q-function update to online minimize the c.f. Ec, k = 0.5 δ k2, while the controller attempts to online minimize the Q-function using gradient descent. Taxonomically, the proposed AAC belongs to the online Policy Iteration schemes where the policy evaluation step (of Bellman error residual minimization type) interleaves with the policy improvement step. The update laws for the AAC design from input-state data are:
u k = C k ( x ) = C k 1 ( x ) α a Q ^ k 1 ( x , u ) u | u k 1 , x ,
Q ^ k ( x k 1 , u k 1 ) = Q ^ k 1 ( x k 1 , u k 1 ) + α c δ k = = Q ^ k 1 ( x k 1 , u k 1 ) + α c ( Ʋ ( x k 1 , u k 1 ) + γ Q ^ k 1 ( x k , u k = C k ( x k ) ) Q ^ k 1 ( x k 1 , u k 1 ) ) ,
where α a > 0 , α c > 0 are learning rates. The controller C ( x k ) is imagined as a function (or as an infinitely dense table) mapping any state to a control action.
Comment 1: In particular, for any admissible controller C 0 , repeated calls of (16), under proper exploration (translated to visiting all the pairs ( x k 1 , u k 1 ) X E × U often and to generating the sample x k ), and under proper selection of α c > 0 , will update Q ^ k ( x , u ) , x X E until δ k = 0 at which point (11) must hold and the converged Q C 0 ( x k , u k ) evaluates C 0 . This is an online off-policy model-free policy evaluation step. Then Q 0 ( x , u ) = Q C 0 ( x , u ) can be an initialization for AAC. Whereas, an initial admissible controller can be obtained for example using VRFT as shown later in the case study.
Comment 2: The converged Q-function of an admissible control C0, be it QC0(xk, uk), is positive by definition since it accumulates stage costs Ʋ > 0. Moreover, it is always greater than the optimal Q-function, i.e., QC0(xk, uk) ≥ Q*(xk, uk) > 0 and obeys the Bellman equation.
Comment 3: From (15), it follows that
Q k 1 ( x , C k ( x ) ) Q k 1 ( x , C k 1 ( x ) ) , x X E
for a small enough α a > 0 .
Lemma 1. Starting from an admissible controller C0 with corresponding Q-function initialization Q ^ 0 ( x , u ) = Q C 0 ( x , u ) , the sequence { Q ^ k ( x , u ) } is monotonic and non-increasing ensuring that
Q ^ k ( x , u ) Q ^ k 1 ( x , u ) , ( x , u ) X E × U
Proof. Q ^ 0 ( x , u ) is initialization for Equation (16) and obeys the Bellman Equation (11) for the admissible controller C0, i.e.,
Q ^ 0 ( x k , u k ) = Ʋ ( x k , u k ) + γ Q ^ 0 ( x k + 1 , C 0 ( x k + 1 ) ) .
Starting the AAC update law from initial state x 0 , C 1 ( x ) is updated first by Equation (15), then it follows that
Q ^ 1 ( x 0 , u ) = Q ^ 0 ( x 0 , u ) + α c ( Ʋ ( x 0 , u ) + γ Q ^ 0 ( x 1 , C 1 ( x 1 ) ) = ( 18 ) Q ^ 0 ( x 0 , C 0 ( x 0 ) ) Q ^ 0 ( x 0 , u ) ) = Q ^ 0 ( x 0 , u ) Q ^ 0 ( x 0 , u ) , u .
Since x0 can be any state xXE, then Equation (18) holds for k = 1 . Assume by induction that (18) holds for some k. Using Comment 3, it follows that
Q ^ k + 1 ( x k , u k ) = Q ^ k ( x k , u k ) + α c ( Ʋ ( x k , u k ) + γ Q ^ k ( x k + 1 , C k + 1 ( x k + 1 ) ) Q ^ k ( x k , u k ) ) Q ^ k ( x k , u k ) + α c ( Ʋ ( x k , u k ) + γ Q ^ k ( x k + 1 , C k ( x k + 1 ) ) Q ^ k ( x k , u k ) ) ( 18 ) Q ^ k 1 ( x k , u k ) + α c ( Ʋ ( x k , u k ) + γ Q ^ k 1 ( x k + 1 , C k ( x k + 1 ) ) Q ^ k 1 ( x k , u k ) ) = Q ^ k ( x k , u k ) ,
and since x k , u k can be any pair (x, u) ∈ XE × U, the conclusion of Lemma 1 follows.
Theorem 2. Let Q ^ 0 ( x , u ) > 0 , finite for any finite argument) be an initialization for the Q-function of an initial admissible controller C0. Starting with any x0 the control u0 = C0(x0) is applied to the process. Specifically, the AAC update laws ((15), (16)) ensures that at time k = 1, C1 is updated from C0, Q ^ 0 ( x , u ) is updated with (16) using u1 = C1(x1) in the right-hand side, the control u1 is sent to the process, and then k ← 2, with the above strategy repeated for subsequent times. Claim: The feedback control system under time-varying control Ck is stabilized for γ < 1 and asymptotically stabilized for γ = 1.
Proof. It is valid for the first three time steps that
Q ^ 1 ( x 0 , u 0 ) = ( 16 ) Q ^ 0 ( x 0 , u 0 ) + α c [ Ʋ ( x 0 , u 0 ) + γ Q ^ 0 ( x 1 , u 1 = C 1 ( x 1 ) ) Q ^ 0 ( x 0 , u 0 ) ] ( 18 ) Q ^ 0 ( x 0 , u 0 ) .
Q ^ 2 ( x 1 , u 1 ) = ( 16 ) Q ^ 1 ( x 1 , u 1 ) + α c [ Ʋ ( x 1 , u 1 ) + γ Q ^ 1 ( x 2 , u 2 = C 2 ( x 2 ) ) Q ^ 1 ( x 1 , u 1 ) ] ( 18 ) Q ^ 1 ( x 1 , u 1 ) .
Q ^ 3 ( x 2 , u 2 ) = ( 16 ) Q ^ 2 ( x 2 , u 2 ) + α c [ Ʋ ( x 2 , u 2 ) + γ Q ^ 2 ( x 3 , u 3 = C 3 ( x 3 ) ) Q ^ 2 ( x 2 , u 2 ) ] ( 18 ) Q ^ 2 ( x 2 , u 2 ) .
Cancelling the same terms in both sides of Equation (22a–c), since α c > 0 , it follows that the sums in square parentheses are negative. These sums are further refined using Lemma 1 as
Ʋ ( x 0 , u 0 ) + γ Q ^ 0 ( x 1 , u 1 ) Q ^ 0 ( x 0 , u 0 ) ,
Ʋ ( x 1 , u 1 ) + γ Q ^ 1 ( x 2 , u 2 ) Q ^ 1 ( x 1 , u 1 ) ( 18 ) Q ^ 0 ( x 1 , u 1 ) ,
Ʋ ( x 2 , u 2 ) + γ Q ^ 2 ( x 3 , u 3 ) Q ^ 2 ( x 2 , u 2 ) ( 18 ) Q ^ 1 ( x 2 , u 2 ) ,
Using (23c) in (23b) it follows that Ʋ ( x 1 , u 1 ) + γ Ʋ ( x 1 , u 1 ) + γ 2 Q ^ 2 ( x 3 , u 3 ) Q ^ 0 ( x 1 , u 1 ) , which used in (23a) results in Ʋ ( x 0 , u 0 ) + γ Ʋ ( x 1 , u 1 ) + γ 2 Ʋ ( x 2 , u 2 ) + γ 3 Q ^ 2 ( x 3 , u 3 ) Q ^ 0 ( x 0 , u 0 ) . Extending the exemplified reasoning backwards from infinity it follows that
lim N ( i = 0 N 1 γ i Ʋ ( x i , u i ) + γ N Q ^ N 1 ( x N , u N ) ) Q ^ 0 ( x 0 , u 0 ) .
Since lim N γ N Q ^ N 1 ( x N , u N ) = 0 because Q ^ N 1 ( x N , u N ) is bounded since resulting from a non-increasing sequence, it follows that lim N i = 0 N 1 γ i Ʋ ( x i , u i ) Q ^ 0 ( x 0 , u 0 ) is finite. The left term in the inequality is the cost of using the controller u 0 = C 0 ( x 0 ) , u 1 = C 1 ( x 1 ) , . Then it follows that the control system remains stable under the time-varying control of the AAC updates. Moreover, for γ = 1 , the sequence { Ʋ ( x i , u i ) } , i = 0 , ¯ must be square-summable which implies that the control system is asymptotically stabilized by C k , thus proving the claim of the theorem.
Comment 4: The above proof resembles the stabilizing action-dependent value iteration of [30], but here relies on gradient-based updates of the Q-function estimate and of the controller, rather than on its minimization. The stability result of the Theorem 2 is valid under continuous updates of the AAC laws (15), (16) under no exploration. It ensures that, starting from an admissible controller C 0 , the AAC updates (15), (16) preserve the control system stability. Since exploratory controls u k are critical, a compromising solution is to perform the controller update (15) only for non- exploratory sampling instants, while the Q-function estimates are continuously updated per (16).

4.2. AAC Using Neural Networks As Approximators

In practice, specific FAs such as NNs are employed as approximators for the Q-function and for the controller, respectively. Using NNs as FAs (implying nonlinear features parameterization), the convergence of the learning scheme depends on a large extent to the selection of the learning parameters. However, the advantage of using generic NN architectures is that no manual or automatic feature selection is needed for parameterizing the Q-function and controller estimates.
The proposed AAC design herein uses two NNs to approximate the Q-function (critic, referred to herein as Q-NN), and the controller (actor, referred to herein as C-NN), respectively. Assume an initial admissible NN VRFT state-feedback controller exists. Let the critic NN FA and controller NN FA be parameterized as Q ^ k ( x k E , u k , θ k c ) and C k ( x k E , θ k a ) , respectively. With three-layer feed-forward NNs having one hidden layer, fully connected with bias, the critic and the controller are modeled by:
Q ^ k = W c , n h c + 1 k + i = 1 n h c W c , i k σ i ( j = 1 n i c + 1 V c , j i k I j k ) ,
u k l = W a , n h a + 1 k , l + i = 1 n h a W a , i k , l σ i ( j = 1 n i a + 1 V a , j i k I j k ) ,   l = 1 , m ¯ ,
with W c = [ W c , 1 k   W c , n h c + 1 k ] T —the critic output layer weights having nhc hidden neurons, V c = [ V c , j i k ] i = 1 n h c , j = 1 n i c + 1 —the critic hidden layer weights matrix, I c k = [ I 1 k I n i c + 1 k ] T = [ ( x k E ) T   ( u k ) T   1 ] T —the critic input vector of size nic + 1 (the bias input is constant 1), W a k , l = [ W a , 1 k , l   W a , n h a + 1 k , l ] T , l = 1 .. m – the output layer weights of the l-th controller output u k = [ u k 1 u k m ] T , having nha hidden neurons, and V a = [ V a , j i k ] i = 1 n h a , j = 1 n i a + 1 —the controller hidden layer weights matrix, I a k = [ I 1 k I n i a + 1 k ] T = [ ( x k E ) T   1 ] T – the controller input vector of size nia + 1 (bias input 1 included). σi = tanhi, is the hyperbolic tangent activation function at the output of ith hidden neuron. The controller and critic are cascaded, with the actor output uk as a part of the critic input alongside x k E The actor and critic weights are formally parameterized as θ a k = [ ( W a k , l ) T   ( V a k ) T ] T and θ c k = [ ( W c k ) T   ( V c k ) T ] T , respectively.
The AAC’s gradient descent tuning rules for the critic (25) and controller (26) as parameterized variants of (15) and (16) are
{ θ k c = θ k 1 c + α c Q ^ ( x , u , θ ) θ | ( x k 1 E , u k 1 , θ c k 1 ) δ k ,   detailed as : W c , i k = W c , i k 1 + δ k α c × { σ i ( j = 1 n i c + 1 V c , j i k 1 I j k 1 ) , if   i n h c + 1 , 1 , if   i = n h c + 1 , V c , j i k = V c , j i k 1 + δ k α c W c , i k 1 I j k 1 σ i ( j = 1 n i c + 1 V c , j i k 1 I j k 1 ) ,
{ θ k a = θ k 1 a α a Q ^ ( x , u , θ ) u | ( x k 1 E , u k 1 , θ c k 1 ) u ( x , θ ) θ | ( x k 1 E , θ a k 1 ) , as : W a , i k , l = W a , i k 1 , l α a ( i = 1 n h c W c , i k 1 V c , ξ i k 1 σ i ( j = 1 n i c + 1 V c , j i k 1 I j k 1 ) ) Q ^ ( x , u , θ ) u | ( x k 1 E , u k 1 , θ c k 1 ) × { σ i ( j = 1 n i a + 1 V a , j i k 1 I j k 1 ) , if   i n h a + 1 , 1 , if   i n h a + 1 , V a , j i k = V a , j i k α a Q ^ ( x , u , θ ) u | ( x k 1 E , u k 1 , θ c k 1 ) W a , i k 1 , l I j k 1 σ i ( j = 1 n i a + 1 V a , j i k 1 I j k 1 ) ,
with αa, αc—the learning rate magnitudes of the critic and controller training rules, respectively and σ i is the derivative of σi w.r.t its argument and ξ is the index of u k l in the I c k .
In many practical applications, the designer chooses to perform either full or partial adaptation of the NNs’ weights, the latter implying only output weights adaptation. In this latter case, the Q-NN and C-NN parameterizations are:
Q ^ k ( x k E , u k ) = ( W c k ) T σ ( V c k [ ( x k E ) T   ( u k ) T   1 ] T ) = ( W c k ) T Φ c k ( x k E , u k ) ,   u k = C k ( x k E ) = ( W a k ) T σ ( V a k [ ( x k E ) T   1 ] T ) = ( W a k ) T Φ c k ( x k E ) ,
where Φ c k ( x k E , u k ) , Φ a k ( x k E ) are the matrices of basis functions (or input features) and Wck, Wak are the tunable output weights parameters, rendering Q ^ k , u k as linear combination of basis functions. This linear parameterization simplifies the convergence analysis but also requires manual features selection and training as a disadvantage.
As it is well-known, the AAC architecture performs online with the C-NN sending controls to the process and the Q-NN serving both to estimate the Q-function and to adaptively tune the C-NN. The closed-loop control system with the process (3) combined with the AAC tuning rules (27), (28) has the unique property that its dynamics is mainly driven by the reference input rk viewed as a particular state of the extended state vector and, possibly, by exogenous unknown disturbances. Since rk is user selectable, it can be used to drive the control system in a wide operating range to ensure efficient exploration of the state space. Enhanced exploration of the domain XE × U can be performed by trying random actions in every state, usually as additive uniform random actions.

4.3. Convergence of the AAC Learning Scheme with NNs

While the results from Section 4.1 are formulated under generic functions for the Q-function and for the controller, the convergence to the optimal controller and optimal Q-function is not ensured. In the following, the convergence to of the AAC learning with NNs is shown. Linear parameterization is the most widely used and supports tractable analysis. Let the output weight parameterization (29) of the Q-NN and C-NN lead to the update laws
W c k = W c k 1 + α c δ k Φ c k ( x k 1 E , u k 1 ) ,   W a k = W a k 1 α a Φ a k ( x k 1 E ) ( W c k 1 ) T Φ c u ( x k 1 E , u k 1 ) ,
be compactly written as
W c k = W c k 1 + α c δ k Φ c k 1 ,   W a k = W a k 1 α a Φ a k 1 ( W c k 1 ) T Φ c , u k 1 ,
where δ k = Ʋ ( x k 1 E , u k 1 ) U k 1 + ( W c k 1 ) T ( γ Φ c k Φ c k 1 ) Let the weights of the optimal Q-NN and optimal C-NN be Wc*, Wa*, the estimation errors being W ˜ c k = W c k W c * , W ˜ a k = W a k W a * that render the estimation error dynamics
W ˜ c k = W ˜ c k 1 + α c δ k Φ c k 1 ,   W ˜ a k = W ˜ a k 1 α a Φ a k 1 ( W c k 1 ) T Φ c , u k 1 ,
Some assumptions follow:
A7. Let the estimation error of the critic be denoted as ζ c k 1 = ( W ˜ c k 1 ) T Φ c k 1 , let the critic’s and actor’s hidden activation layers be bounded as Φ c k 1 2 φ ¯ c , Φ a k 1 2 φ ¯ a and let the critic’s activation layer derivative w.r.t. u be bounded as Φ c , u k 1 2 φ ¯ c , u where Frobenius norm was used, which is equivalent to the Euclidean norm when it is applied to vectors. The above upper bounds follow since the activation functions are bounded and so are their derivatives.
Theorem 3. Under A7, the AAC learning scheme converges to a vicinity of the optimal controller and optimal Q-function since the estimation errors W ˜ c k , W ˜ a k are uniformly ultimately bounded provided that α c > 4 φ ¯ c 2 2 φ ¯ c + 1 , for φ ¯ c > 1 .
Proof: See Appendix B.
Comment 5. Notice that the temporal difference error δ k is calculated in terms of the Q-NN’s output. Then δ k is backpropagated to correct the Q-NN weights. Moreover, δ k is further backpropagated to correct the C-NN weights, since the C-NN output is an input to the Q-NN. Hence, the resulted AAC architecture belongs to the deep learning approaches (the architecture is presented in Figure 1b of the next Section).

4.4. Summary of the Mixed VRFT-AAC Design Approach

The steps of the VRFT-AAC design approach are summarized next:
S1. Collect input-state-output samples from the open-loop stable process (1) in a dataset D = { u ˜ k , x ˜ k , y ˜ k } U × X × Y , k = 0 , N 1 ¯ where u ˜ k persistently exciting, under conditions of A5.
S2. Obtain the initial state-feedback VRFT controller by minimizing the c.f. in Equation (6) as θ k a = arg min θ J V R N ( θ ) . When NNs are used, minimization of the c.f. (6) is equivalent to training the NN. The obtained controller is C k ( x k E , θ k a ) , which is a controller for both the process (1) and for the extended process (3). It is also a close initialization to the optimal controller that minimizes J M R from (5), since VRFT identifies a controller that approximately minimizes J M R N from (10) as a finite horizon version of J M R . This is supported by Theorem 1.
S3. Close the control system loop on process (1) (it is equivalent to closing it on extended process (3)) using controller C k ( x k E , θ k a ) . The architecture is presented in Figure 2b). Use update (16) (in explicitly parameterized form, use (28)) under an exploratory reference input r k in order to learn the Q-function of the controller C k ( x k E , θ k a ) . This serves as properly initializing Q ^ k ( x k E , u k , θ k c ) for the subsequent AAC tuning.
S4. Use the updates (15), (16), (27), (28), in explicitly parameterized form), in this exact order and under an exploratory reference input r k , to learn the optimal controller C * ( x k E , θ k a * = W a * ) and the optimal Q-function Q ^ * ( x k E , u k , θ k c * = W c * ) . Using the above updates for a finite time on a random learning scenario is called a learning episode.
S5. After every learning episode, measure the tracking performance on a standard test scenario. When the prescribed number of maximum tests is reached or the tracking performance on the standard scenario is not improving anymore, the controller learning is stopped. Otherwise, proceed to the next learning episode.
All implementation details of the above VRFT-AAC design are presented in the following Section 5.1 when validation is performed on the complex multivariable tank system case study.

5. Validation Case Study

5.1. AAC Design for a MIMO Vertical Tank System

The controlled process is a vertical MIMO two-tank system (Figure 1c) built around a three-tank laboratory equipment [47] with the continuous-time state-space equations
H ˙ 1 = k a w u 1 1 a w C ¯ 1 H 1 α 1 ( 2.5 u ˜ 2 0.5 ) , H ˙ 2 = 1 a w C ¯ 1 H 1 α 1 ( 2.5 u ˜ 2 0.5 ) 1 c w + H 2 H 2 max b w C ¯ 2 H 2 α 2 ,             u ˜ 2 = min ( max ( u 2 , 0.6 ) , 1 ) ,
with a = 0.25   [ m ] , w = 0.035   [ m ] , c = 0.1   [ m ] , b = 0.345   [ m ] and H 1 max = H 2 max = 0.35    [ m ] . x 1 = y 1 = H 1 [ 0 , H 1 max ] and x 2 = y 2 = H 2 [ 0 , H 2 max ] are the water levels in the two tanks considered as system states and controlled outputs. The control inputs u 1 , u 2 [ 0 , 1 ] (also expressible in [%]) are the duty cycles of the pump direct current (DC) motor and of the electrically controlled valve C 1 , respectively. k = 1.66 10 4 [ m 3 / ( s % ) ] is the gain from the pump input to the inflow, C ¯ 1 = 5.65 10 5 [ m 3 α 1 / s ] , C ¯ 2 = 8 10 5 [ m 3 α 2 / s ] are the resistances of the outflow orifices of the first (upper) and second (lower) tank, called T 1 and T 2 , respectively, and α1 = 0.29, α2 = 0.22. The third equality in Equation (33) reflects the dead-zone plus saturation in the second control input u2.
Features of this process include: no water level setpoint for the tank T 2 can be set if the water level in T 1 is zero; the electrical valve controlling the outflow from T 1 (which is inflow to T 2 ) is changed by u ˜ 2 ( u 2 ) ; no setpoint can be tracked for each tank if there is more outflow than inflow; T 1 ’s outflow has a minimum value and can be zero only when H 1 = 0 , as per first equality in (33). T 2 ’s dynamics is slower than T 1 ’s. Proper selection of the parameters C ¯ 1 , C ¯ 2 through manual valves allows feasible control trajectories in the constrained input-state-output space. Discretization of (33) reveals its Markov form. The water level is measured using piezoelectric sensors ( P S 1 and P S 2 in Figure 1c). Protection logic disables the pump voltage when water level exceeds the upper bound. The sampling period used for control experiments is T s = 0.5    s . Model (33) is not used for control design.
For VRFT-based control design, the ORM is selected as the ZOH discretization of M ( s ) = d i a g ( M 1 ( s ) , M 2 ( s ) ) . M 1 ( s ) = ω 0 2 / ( s 2 + 2 ς ω 0 s + ω 0 2 ) , with the damping factor ς = 1.0 and the natural frequency ω 0 = 0.5 r a d / s selects the speed and shape of the desired response y k , 1 m while similar M 2 ( s ) with ς = 1.0 and ω 0 = 0.2 r a d / s describes y k , 2 m . For collecting the open-loop input-state-output data { u ˜ k , x ˜ k , y ˜ k } k = 0 N 1 , 16,000 samples have been generated from 16,000 samples of a uniformly random sequence of persistently exciting steps lasting for 20 s u ˜ k = [ u ˜ k , 1   u ˜ k , 2 ] T [ 0 , 0.5 ] × [ 0 , 1 ] , for an experiment of 8000 s. The collected data is displayed in Figure 2 and ensures the exploratory conditions from Assumption A5.
The controllable canonical state-space realizations (A1, B1, C1, D1) and ( A 2 , B 2 , C 2 , D 2 ) and of M 1 ( z ) and M 2 ( z ) are, respectively:
A 1 = ( 1.5576 0.6065 1 0 ) , B 1 = ( 1 0 ) , C 1 = ( 0.0265 0.0224 ) T , D 1 = 0 , A 2 = ( 1.8097 0.8187 1 0 ) , B 2 = ( 1 0 ) , C 2 = ( 0.0047 0.0044 ) T , D 2 = 0 ,
The ORM state will then be x k m = [ x k , 1 m   x k , 2 m   x k , 3 m   x k , 4 m ] T . The virtual reference r ¯ k = M ( z ) 1 y ˜ k is used as input to the state-space models (34) to obtain the ORM’s virtual states x ˜ k m . The extended virtual state vector x ˜ k E = [   ( x ˜ k ) T   ( x ˜ k m ) T   ( r ˜ k ) T ] T 8 is used to offline compute the C-NN via VRFT by fitting the inputs x ˜ k E to the outputs u ˜ k . Note that using two second order ORM’s produces four states in the extended state-space, which is disadvantageous. The ORM’s orders should to be as low as possible (usually one) but a second order model offers greater flexibility in output response shaping.
The VRFT C-NN architecture (Figure 1a) is a feedforward 8–10–2 fully connected one with biases having n h a = 10 hidden neurons with tanh (   ) activation function and the output activation functions are linear. The weights are initialized with random uniform numbers in [ 1.5 , 1.5 ] . Since the NN training is performed offline, standard gradient backpropagation training with Levernberg-Marquardt [48] is used for maximum 50 epochs to learn a stabilizing VRFT C-NN controller C ( x k E ) for the MIMO control system, that minimizes J V R N . 80% of the data is effectively used for training while the rest of 20% serves as validation data. Early stopping is used after six consecutive increases in the mean sum of squared error evaluated on the validation data. Other offline training algorithms such as Broyden-Fletcher-Goldfarb-Shanno [49,50] and conjugate gradient [51,52] may be similarly efficient while their computational burden is prohibitive for online real-time training.
Results on a standard test scenario with the initial VRFT C-NN controller are shown in Figure 3. It is observed that the ORM tracking errors are bounded, since the VRFT controller is stabilizing (though not asymptotically) and validates the theoretical results of Theorem 1 and Corollary 1. Then it is an admissible controller for J M R in Equation (5) with γ < 1 . The initial controller tuning using VRFT is attractive also because it has learned a feature matrix Φ a ( x k E ) , so from this point onwards, the designer can choose to perform either output weights adaptation or full weights adaptation.
The Q-function estimate of the VRFT C-NN (i.e., the critic Q-NN) is next learned in a policy evaluation step in order to serve as a good initial estimate of the Q-function that is needed for the following AAC learning and also to fulfil the requirements of Lemma 1. This step is possible since the VRFT controller is admissible and, for properly selected learning rate, the weights of the Q-NN will converge. The critic Q-NN approximating the Q-function has similar architecture with the C-NN, of size 10–25–1 (eight states and two controls), with n h c = 25 . The critic Q-NN output weights are randomly drawn from a zero-mean normal distribution with variance σ 2 = 90 while the hidden layer weighs are uniformly randomly initialized in [ 1.5 , 1.5 ] . Setting γ = 0.95 , the learning rates α c = 0.01 in (27) and α a = 0 in (28) (no controller tuning), all the Q-NN weights are updated using the gradient back-propagation in (27), by driving the MIMO control system with a sequence of uniformly random piecewise constant steps in r k = [ r k , 1   r k , 2 ] T [ 0.05 , 0.25 ] × [ 0.01 , 0.2 ] . This procedure also serves as a tuning step for α c . With r k , 1 and r k , 2 lasting 20 s and 33 s, respectively, we ensure they do not switch simultaneously, to better reveal the coupling effects between the control channels. After 500 s, the critic weights stabilize, the output weights being shown in Figure 4 for 4500 s. This pre-tuned Q-NN will be used as initialization to the following case studies. After this intermediate tuning step of the Q-NN, the designer can choose for full weight adaptation of the Q-NN or only for output weights’ adaptation of the Q-NN while the features matrix Φ c ( x k E , u k ) is kept constant.
The C-NN is now further tuned (in the architecture from Figure 1b to improve the ORM control performance. Setting α a = 10 8 in (28) and α c = 0.01 in (27) (critic adaptation should generally be faster than actor adaptation), both the C-NN and Q-NN are adaptively trained online. Although carried out in an adaptive framework, the training unfolds on consecutive episodes where the feedback control system is driven by a sequence of random reference input steps for 700 s. The reference inputs are uniformly random piecewise constant steps in r k = [ r k , 1   r k , 2 ] T [ 0.05 , 0.25 ] × [ 0.01 , 0.2 ] , with r k , 1 and r k , 2 lasting 20 s and 33 s, respectively. Updates (27), (28) are skipped when either r k , 1 or r k , 2 switch, to preserve the Markov property of the extended model. The controller parameters at the end of an episode are the initial ones for the following episode, the Q-NN weights following the same transfer rule. To ensure enhanced exploration of the state-action space, the C-NN controller output is perturbed every third sample time with probing noise according to:
u k = C ( x k E ) + ( r a n d r a n d ) Ω , Ω = { 1 if   mod ( k , 3 )   =   0 , 0 otherwise ,
where r a n d is a normally distributed random number with zero-mean and variance σ 2 = 3.56 and mod ( k , s ) is the remainder after dividing k by s. This is in fact a form of ε0-greedy exploration strategy useful to try many actions in the vicinity of the current state. A typical learning episode is shown in Figure 5. After each learning episode, the learning is stopped and the C-NN performance is measured on the standard test scenario from Figure 3 and the decrease of a finite-time version of the c.f. J M R from Equation (5), namely J M R 1400 , is aimed. This standard test scenario is not seen during training. The learning then resumes with the next episode. After maximum 30 learning episodes (meaning 21,000 s and 42,000 samples), the C-NN and Q-NN adaptations are stopped and the learning trial (comprising of learning episodes) converges under Theorem 3. The final adaptively learned C-NN and the initial VRFT controller are shown performing in Figure 3. The episodic learning allows us to test the improvement in control performance between episodes. Figure 6 illustrates the J M R _ A C 1400 decrease over episodes of a convergent learning trial. Throughout each learning episode, under the AAC update laws, the control system preserves its stability, ensured by Theorem 2.
For comparisons, a model-free approximated batch-fitted Q-learning (BFQ) controller [53,54] is also proposed, using the same Q-NN and C-NN architectures with the same sizes. BFQ alternates offline training of the C-NN and the Q-NN, using 12,000 transition samples collected under the randomly perturbed model-free single-input single-output VRFT linear controllers C 1 ( z ) = ( 2.6092 + 0.1184 z 1 2.3609 z 2 ) / ( 1 z 1 ) and C 2 ( z ) = ( 1.5735 + 0.2405 z 1 1.3547 z 2 ) / ( 1 z 1 ) , independently designed for the two tanks, respectively. BFQ implements a Value Iteration algorithm. The training settings assume that the weights of both NNs are initialized to uniform random numbers in [ 1.5 , 1.5 ] . Maximum 200 epochs are used for training with Levenberg-Marquardt on 80% effectively training data and 20% validation data. Early stopping is employed to prevent overfitting after six maximum increases of the mean sum of squared errors on the validation data. 200 iterations of BFQ take about 2 hours and the best controller is saved.
The value of J M R 1400 for the initial VRFT controller is J M R _ V R F T 1400 = 1.58 , for the final adaptively learned controller is J M R _ A A C 1400 = 0.45 and for the BFQ controller is J M R _ M B B F Q 1400 = 0.32 .
Additionally, a model-based approximated BFQ solution is also offered for comparisons. A first dimensionality reduction of the extended state space is performed by inspecting that, for both the ORMs, x k + 1 , 2 m = x k , 1 m and x k + 1 , 4 m = x k , 3 m from the state-space matrices in Equation (34) and, moreover, y k , 1 m = C 1 [ x k , 1 m   x k , 2 m ] T 2 C 1 ( 1 ) x k , 1 m , y k , 2 m = C 2 [ x k , 3 m   x k , 4 m ] T 2 C 2 ( 1 ) x k , 3 m . Then x k , 2 m , x k , 4 m are considered approximate duplicates of x k , 1 m , x k , 3 m and removed from the extended state vector, now defined as a reduced extended state vector x k E R = [ x k , 1 , x k , 2 , x k , 1 m , x k , 3 m , r k , 1 ,   r k , 2 ] T 6 when used for feedback and controller learning. For [ 0.05 , 0.25 ] × [ 0.03 , 0.15 ] × [ 0.5 , 4.5 ] × [ 1 , 12 ] × [ 0.05 , 0.25 ] × [ 0.03 , 0.15 ] as the domain of x k E R and [ 0 , 1 ] × [ 0.5 , 1 ] the domain of u k , we generate a grid of N P = 5 × 5 × 7 × 7 × 6 × 6 × 3 × 3 = 396900 linearly spaced points. 5 points for each of x k , 2 m , x k , 4 m would have led to N P of almost 10 million. Let the discrete domains be denoted X d E and U d , respectively. Domains of x k , 1 m , x k , 3 m and r k , 1 , r k , 2 are found by simulating the ORMs offline such that y k , 1 m , y k , 2 m overlap the constrained domains of x k , 1 = y k , 1 , x k , 2 = y k , 2 . For a Q-NN of size 8–8–1, a C-NN of size 6–6–2 and γ = 0.8 found to ensure learning convergence, each iteration of the model-based BFQ trains both the Q-NN and the C-NN. For the Q-NN current iteration estimate (indexed by i t e r ) denoted Q ^ i t e r ( x k E R , u k ) , the inputs patterns are { [ ( x k E R ) T ( u k ) T ] T } and the target patterns are { Ʋ MR ( x k E R ) + γ min u U d Q ^ i t e r 1 ( F ( x k E R , u k ) , u ) } . , evaluated for all the points in X d E × U d . For the current iteration C-NN C i t e r ( x k E R ) , the input patterns are { ( x k E R ) } and the target patterns are { u k = arg min u U d Q ^ i t e r ( x k E R , u ) } . Note that evaluation of F ( x k E R , u k ) to get x k + 1 E R uses the original extended state vector where x k , 2 m , x k , 4 m are copies of generated x k , 1 m , x k , 3 m and a piecewise constant reference input generative model is used where r k + 1 , 1 = r k , 1 , r k + 1 , 2 = r k , 2 . To keep the training computationally tractable and timely, only one third of uniformly sampled data points form the training set are used, differently for each of the Q-NN and the C-NN. The weights of both NNs are initialized to uniform random numbers in [ 1.5 , 1.5 ] . Maximum 200 epochs are used for training with Levenberg-Marquardt on 80% training data and 20% validation data. Early stopping is employed to prevent overfitting after six maximum increases of the mean sum of squared errors on the validation data. Just 14 iterations (taking about 20 min) of this approximate model-based BFQ produce the control results in Figure 3 (in magenta), with J M R _ M F B F Q 1400 = 0.25 naturally the smallest, with the best ORM tracking performance, since it uses the process model.

5.2. Statistical Investigations of the AAC Control Performance

Several thorough investigation case studies are considered, for which the initial C-NN tuned by VRFT and the initial Q-NN are the same. The investigation concerns full vs. partial tuning of the Q-NN and C-NN weights, while measuring the probing noise effect on the convergence. All statistics are measured on learning trials of maximum 50 episodes. The minimal and average J M R _ A A C 1400 values on 100 trials are measured along with the success percentage of convergent learning trials. The average number of episodes until reaching the minimal J M R _ A A C 1400 of a successful learning trial together with the standard deviation of the number of episodes in a successful learning trial, are both rendered in Table 1.
Case 1. Under learning rates α c = 0.01 in (27) and α a = 10 8 in (28), with full adaptation of the Q-NN and of the C-NN, the learning process convergence starting from the initial C-NN tuned by VRFT and initial Q-NN is investigated. For a convergent learning trial, for the constant learning rates being used, the best performance did never drop below J M R _ A A C 1400 = 0.46 , which is inferior to the BFQ performance, suggesting that the proposed adaptive learning strategy is prone to getting stuck in local minima under the adaptive gradient-based update rules. In fact, BFQ is generally advertised as being more data-efficient, although actor-critic learning architectures also allow alternative updates of the C-NN and Q-NN for improving data usage efficiency. The learning trials converges in about 88% of the cases, comparable with other perturbed AAC designs [55,56] given the wide operating range used for the controlled process.
Case 2. For full weights adaptation of both Q-NN and C-NN, without random perturbation of the control action, with α c = 0.01 in (27) and α a = 10 7 in (28), the convergence rate drops to 53% and the best performance is J M R _ A A C 1400 = 0.45 .
Case 3. In the case of output weights only tuning of both Q-NN and C-NN, with random perturbation of the control action, with α c = 0.01 in (27) and α a = 10 6 in (28), 100% convergence rate was observed, but the performance never dropped below J M R _ A A C 1400 = 0.59 .
Case 4. With output weights only tuning of both the Q-NN and C-NN, starting from the initialized C-NN tuned by VRFT and the initial C-NN, in the absence of random perturbation of the control action, with α c = 0.01 in (27) and α a = 10 5 in (28), the AAC learning is 100% convergent in all trials but the performance never drops below J M R _ A A C 1400 = 0.50 . For α a = 10 6 , the average number of episodes per trial increases only to 11.
The above four case studies are statistically characterized in Table 1. Concluding, full weight tuning of Q-NN and C-NN offers better performance (smaller J M R _ A A C 1400 ) than when output weights only tuning is used. But full weight tuning lowers the convergence rate, as prone to stuck in local minima. In Case 1 vs. Case 2, the probing noise significantly improves the convergence, slightly improves the average J M R _ A A C 1400 and decreases the average number of episodes per convergent trial. While full weights adaptation is more sensitive since even small corrections in the input-to-hidden layer weights may lead to learning divergence.
The output weights only tuning is more robust, with 100% convergence success rate to an improved solution, but with inferior achievable performance. The perturbing noise in this case worsens the average number of episodes per trial and the best achievable performance. Guaranteed convergence to an improved solution corresponding to a local minimum in Cases 3 and 4 is also caused by the good initial tuning offered by VRFT. Case 4 with output weights only tuning without probing noise offers the best compromise regarding convergence, performance and few episodes per trial (i.e., fewer transition samples until convergence).
The initial Q-function learning of the NN-VRFT controller is not necessary and learning convergence was obtained without this step also. For the selected critic learning rate, the weights converge fast enough, however, this step serves for tuning the critic learning rate and also for initialization of the features matrix when output weights only adaptation is sought. This tuning step is achievable exactly because an initially stabilizing VRFT controller exists.

5.3. Comments on the AAC Learning Performance

The AAC learning data efficiency is clearly inferior to the BFQ strategy as a comparable model-free approach. The BFQ control is learned from scratch just from transition samples, whereas the proposed AAC controller learns from a NN-VRFT controller delivering an initial suboptimal ORM tracking solution. Both AAC and model-free BFQ are inferior to the model-based BFQ solution which exploits the process model knowledge.
AAC is a form of Action Dependent Heuristic Dynamic Programming which is also less data-efficient than other similar approaches such as Dual Heuristic Programming, where the learned co-state vector carries more information than the Q-function. On the other hand, AAC is less computationally demanding and requires less memory than Dual Heuristic Programming, BFQ and model-based BFQ, owing to AAC’s adaptive implementation. However, AAC becomes competitive when used together with VRFT since the VRFT pre-tuning provides an initial controller close to the optimal one, which then can be fine-tuned using AAC. The initial NN VRFT controller ensures stabilized exploration over a wide operating range for ensuring ORM tracking in a wide range, which is equivalent to indirect feedback linearization. Then the combined VRFT-AAC design for ORM tracking is more attractive for practical data-driven applications [57,58].

6. Conclusions

A model-free combination of VRFT and AAC design approach was successfully validated to learn improved nonlinear state-feedback control for linear ORM tracking in a wide operating range. Learned controllers indirectly account for several nonlinearities such as actuator saturation plus dead-zone and output saturation, while they also show good decoupling abilities. AAC design shares similar conceptual framework with model-free techniques like Q-Learning, or SARSA, VRFT, Iterative Feedback Tuning and model-free Iterative Learning Control, by exploiting only the process model structure but not its parameters. The convergence of the proposed adaptive learning strategy relies on several key aspects: efficient exploration correlated with the size of the training dataset and with the process complexity, selected learning architecture and selection of the approximators with appropriate parameterizations. In a wider context, VRFT shows significant potential for obtaining close-to-optimal initially admissible controllers with respect to the ORM objective.
Future work targets the validation of the proposed tuning approach to other difficult nonlinear processes and its improvement using data-driven techniques.

Author Contributions

M.-B.R. developed the theoretical results, wrote the paper and performed the experiments; R.-E.P. revised the mathematical formulations, analyzed the algorithms and ensured the hardware and software support. All authors have read and approved the final paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Theorem 1.
Let s ( x ^ k , u ^ k 1 ) s ( x ˜ k , u ˜ k 1 ) = y ^ k y ˜ k = Δ y k , where y ^ k = s ( x ^ k , u ^ k 1 = C ( ζ ^ k 1 , θ ^ ) ) . In VRFT, it is also valid that y ˜ k = s ( x ˜ k , u ˜ k 1 ) = s m ( x ˜ k m , r ˜ k 1 ) is the output of both the process and of the ORM driven by r ˜ k 1 = M 1 ( y ˜ k ) . By the mean value theorem, there is a 0 < Γ < 1 making
Δ y k = s ( x k Γ , u k 1 Γ ) x k ( x ^ k x ˜ k ) + s ( x k Γ , u k 1 Γ ) u k 1 ( u ^ k 1 u ˜ k 1 ) ,
leading to
Δ y k B s x Δ x k + B s u Δ u k 1 ,
where x k Γ = Γ x ^ k + ( 1 Γ ) x ˜ k , u k 1 Γ = Γ u ^ k 1 + ( 1 Γ ) u ˜ k 1 , and Δ x k = x ^ k x ˜ k , Δ u k 1 = u ^ k 1 u ˜ k 1 .
Observe that (8) implies u ˜ k C ( x ˜ k E , θ ^ ) < ε , k = 0 , N 1 ¯ . But Δ u k = C ( x ^ k E , θ ^ ) u ˜ k = C ( x ^ k E , θ ^ ) C ( x ˜ k E , θ ^ ) + C ( x ˜ k E , θ ^ ) u ˜ k and by the MVT there is a 0 < Γ < 1 such that
Δ u k = C ( x k E Γ , θ ) x k E ( x ^ k E x ˜ k E ) + C ( x ˜ k E , θ ^ ) u ˜ k ,
with x k E Γ = Γ x ^ k E + ( 1 Γ ) x ˜ k E , leading to
Δ u k B c x Δ x k E + ε ,
for Δ x k E = x ^ k E x ˜ k E . But Δ x k E = x ^ k E x ˜ k E = [   ( x ^ k x ˜ k ) T   0 T   0 T ] T resulting in Δ x k E = Δ x k which transforms (A4) into
Δ u k B c x Δ x k + ε .
Next, by the mean value theorem, there is a 0 < Γ < 1 ensuring that
Δ x k = x ^ k x ˜ k = ( s x ) 1 ( y ^ k ) ( s x ) 1 ( y ˜ k ) = ( s x ) 1 ( y k Γ ) y k ( y ^ k y ˜ k ) , y k Γ = Γ y ^ k + ( 1 Γ ) y ˜ k . .
It then follows that
Δ x k B s y Δ y k .
Using (A5) and (A7) in (A2) it results
Δ y k B s x B s y Δ y k + B s u ( B c x Δ x k 1 + ε ) B s x B s y Δ y k + B s u B c x B s y Δ y k 1 + B s u ε ,
which is equivalent to
Δ y k B s u B c x B s y 1 B s x B s y B 1 Δ y k 1 + B s u 1 B s x B s y B 2 ε .
One can write that
J M R N ( θ ^ ) = k = 1 N Δ y k 2 ( B 2 k = 1 N j = 0 k 1 B 1 j ) 2 ε 2 = B ε 2 .
which is the conclusion (10), and the proof of Theorem 1 is completed.

Appendix B

Proof of Theorem 3.
Let δ k = U k 1 + ( W c k 1 ) T ( γ Φ c k Φ c k 1 ) be expressed further as δ k = U k 1 + ( W c * ) T ( γ Φ c k Φ c k 1 ) E 1 + ( W ˜ c k 1 ) T ( γ Φ c k Φ c k 1 ) . Let Φ c k = Φ c k 1 + Δ Φ c k make δ k = E 1 + ( γ 1 ) ζ c k 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 .
Define the Lyapunov function
L k 1 = 1 α c t r { ( W ˜ c k 1 ) T W ˜ c k 1 } + 1 α a t r { ( W ˜ a k 1 ) T W ˜ a k 1 } = L 1 + L 2 .
(tr{.} meaning the matrix trace operator) leading to the first order differences in L1 and L2
Δ L 1 = 1 α c [ t r { ( W ˜ c k ) T W ˜ c k } t r { ( W ˜ c k 1 ) T W ˜ c k 1 } ] , Δ L 2 = 1 α a [ t r { ( W ˜ a k ) T W ˜ a k } t r { ( W ˜ a k 1 ) T W ˜ a k 1 } ] .
Using estimation error dynamics (32), Δ L 1 , Δ L 2 are refined further. Let
Δ L 1 = 1 α c [ t r { 2 α c δ k ( Φ c k 1 ) T W ˜ c k ζ c k 1 + α c δ k 2 ( Φ c k 1 ) T Φ c k 1 ] = 2 α c ( E 1 + ( γ 1 ) ζ c k 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) ζ c k 1 + + α c ( E 1 + ( γ 1 ) ζ c k 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) 2 Φ c k 1 2 2 α c ( γ 1 ) ( ζ c k 1 ) 2 + 2 α c ( E 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) ζ c k 1 + α c φ ¯ c ( 2 ( γ 1 ) 2 ( ζ c k 1 ) 2 + 2 ( E 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) 2 ) = 2 α c ( γ 1 ) ( ζ c k 1 ) 2 ( ζ c k 1 α c ( E 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) ) E 2 2 + 2 α c φ ¯ c ( γ 1 ) 2 ( ζ c k 1 ) 2 + 2 α c φ ¯ c ( E 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) 2 + ( ζ c k 1 ) 2 + α c 2 ( E 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) 2 = = = [ 1 + 2 α c ( γ 1 ) + 2 α c φ ¯ c ( γ 1 ) 2 ] ( ζ c k 1 ) 2 E 2 + ( 2 α c φ ¯ c + α c 2 ) ( E 1 + γ ( W ˜ c k 1 ) T Δ Φ c k 1 ) 2 E 3 .
One can show that the coefficient of (ζek−1) in the first term, is a function of γ, i.e., f ˜ ( γ ) = 1 + 2 α c ( γ 1 ) + 2 α c φ ¯ c ( γ 1 ) 2 , which will have two real roots for α c > 2 φ ¯ c . Let these roots be γ 1 < γ 2 . By ensuring γ 1 < 0 , γ 2 > 1 then for any γ such that γ 1 < 0 < γ 1 < γ 2 , it means that f ˜ ( γ ) < 0 . A sufficient condition to ensure f ˜ ( γ ) < 0 for all 0 < γ 1 , is to select α c > max { 2 φ ¯ c , 4 φ ¯ c 2 2 φ ¯ c + 1 } = 4 φ ¯ c 2 2 φ ¯ c + 1 for all φ ¯ c > 1 .
Note further that the term E3 in (A13) is positive. Let Δ L 2 be further expressed as
Δ L 2 = 1 α a [ t r { ( W ˜ a k ) T W ˜ a k } t r { ( W ˜ a k 1 ) T W ˜ a k 1 } ] = = 1 α a ( W ˜ a k 1 α a Φ a k 1 ( W c k 1 ) T Φ c , u k 1 2 W ˜ a k 1 2 ) 1 α a ( ( W ˜ a k 1 + α a Φ a k 1 ( W c k 1 ) T Φ c , u k 1 ) 2 W ˜ a k 1 2 ) = = 2 α a W ˜ a k 1 Φ a k 1 ( W c k 1 ) T Φ c , u k 1 + α a Φ a k 1 ( W c k 1 ) T Φ c , u k 1 2 2 α a φ ¯ a φ ¯ c , u W ˜ a k 1 W c k 1 + α a φ ¯ a 2 φ ¯ c , u 2 W c k 1 2 = E 4 > 0 .
Eventually,
Δ L = Δ L 1 + Δ L 2 = [ 1 + 2 α c ( γ 1 ) + 2 α c φ ¯ c ( γ 1 ) 2 ] ( ζ c k 1 ) 2 E 2 + E 3 + E 4 ,
can be shown negative if
[ 1 + 2 α c ( γ 1 ) + 2 α c φ ¯ c ( γ 1 ) 2 ] ( ζ c k 1 ) 2 + E 3 + E 4 < 0
holds, where, considering the first term as negative for α c > 4 φ ¯ c 2 2 φ ¯ c + 1 while E3 > 0, E4 > 0, it follows that it suffices to have
( ζ c k 1 ) 2 > E 3 + E 4 | 1 + 2 α c ( γ 1 ) + 2 α c φ ¯ c ( γ 1 ) 2 |
in order to make Δ L negative definite. Then by Lyapunov extension theorem [59], the AAC learning is stable and the NN estimation errors are uniformly ultimately bounded, concluding the proof.

References

  1. Hou, Z.-S.; Wang, Z. From model-based control to data-driven control: Survey, classification and perspective. Inf. Sci. 2013, 235, 3–35. [Google Scholar] [CrossRef]
  2. Fliess, M.; Join, C. Model-free control. Int. J. Control 2013, 86, 2228–2252. [Google Scholar] [CrossRef]
  3. Hou, Z.-S.; Jin, S. Data-driven model-free adaptive control for a class of MIMO nonlinear discrete-time systems. IEEE Trans. Neural Netw. 2011, 22, 2173–2188. [Google Scholar] [PubMed]
  4. Campi, M.C.; Lecchini, A.; Savaresi, S.M. Virtual reference feedback tuning: A direct method for the design of feedback controllers. Automatica 2002, 38, 1337–1346. [Google Scholar] [CrossRef]
  5. Hjalmarsson, H. Iterative feedback tuning—An overview. Int. J. Adapt. Control Signal Process. 2002, 16, 373–395. [Google Scholar] [CrossRef]
  6. Spall, J.C.; Cristion, J.A. Model-free control of nonlinear stochastic systems with discrete-time measurements. IEEE Trans. Autom. Control 1998, 43, 1198–1210. [Google Scholar] [CrossRef]
  7. Butcher, M.; Karimi, A.; Longchamp, R. Iterative learning control based on stochastic approximation. In Proceedings of the 17th IFAC World Congress, Seoul, Korea, 6–11 July 2008; pp. 1478–1483. [Google Scholar]
  8. Radac, M.-B.; Precup, R.-E.; Petriu, E.M. Optimal behaviour prediction using a primitive-based data-driven model-free iterative learning control approach. Comp. Ind. 2015, 74, 95–109. [Google Scholar] [CrossRef]
  9. Li, Y.; Hou, Z.; Feng, Y.; Chi, R. Data-driven approximate value iteration with optimality error bound analysis. Automatica 2017, 78, 79–87. [Google Scholar] [CrossRef]
  10. Radac, M.-B.; Precup, R.-E.; Petriu, E.M.; Preitl, S. Iterative data-driven tuning of controllers for nonlinear systems with constraints. IEEE Trans. Ind. Electron. 2014, 61, 6360–6368. [Google Scholar] [CrossRef]
  11. Pang, Z.-H.; Liu, G.-P.; Zhou, D.; Sun, D. Data-based predictive control for networked nonlinear systems with network-induced delay and packet dropout. IEEE Trans. Ind. Electron. 2016, 63, 1249–1257. [Google Scholar] [CrossRef]
  12. Radac, M.-B.; Precup, R.-E. Data-driven model-free slip control of anti-lock braking systems using reinforcement Q-learning. Neurocomputing 2018, 275, 317–329. [Google Scholar] [CrossRef]
  13. Qiu, J.; Wang, T.; Yin, S.; Gao, H. Data-Based Optimal Control for Networked Double-Layer Industrial Processes. IEEE Trans. Ind. Electron. 2017, 64, 4179–4186. [Google Scholar] [CrossRef]
  14. Chi, R.; Hou, Z.-S.; Jin, S.; Huang, B. An improved data-driven point-to-point ILC using additional on-line control inputs with experimental verification. IEEE Trans. Syst. Man Cybern. Syst. 2017, 49, 687–696. [Google Scholar] [CrossRef]
  15. Liu, D.; Yang, G.-H. Model-free adaptive control design for nonlinear discrete-time processes with reinforcement learning techniques. Int. J. Syst. Sci. 2018, 49, 2298–2308. [Google Scholar] [CrossRef]
  16. Chi, R.; Huang, B.; Hou, Z.; Jin, S. Data-driven high-order terminal iterative learning control with a faster convergence speed. Int. J. Robust Nonlinear Control 2018, 28, 103–119. [Google Scholar] [CrossRef]
  17. Jeng, J.-C.; Ge, G.-P. Data-based approach for feedback–feedforward controller design using closed-loop plant data. ISA Trans. 2018, 80, 244–256. [Google Scholar] [CrossRef] [PubMed]
  18. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  19. Werbos, P.J. Approximate dynamic programming for real-time control and neural modeling. In Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches; White, D.A., Sofge, D.A., Eds.; Van Nostrand Reinhold: New York, NY, USA, 1992; pp. 493–525. [Google Scholar]
  20. Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, USA, 1996. [Google Scholar]
  21. Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1991, 8, 279–292. [Google Scholar] [CrossRef]
  22. Wang, F.-Y.; Zhang, H.; Liu, D. Adaptive dynamic programming: An introduction. IEEE Comput. Intell. Mag. 2009, 4, 39–47. [Google Scholar] [CrossRef]
  23. Lewis, F.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst. Mag. 2012, 32, 76–105. [Google Scholar]
  24. Prokhorov, D.V.; Wunsch, D.C. Adaptive critic designs. IEEE Trans. Neural Netw. 1997, 8, 997–1007. [Google Scholar] [CrossRef]
  25. Wei, Q.; Lewis, F.; Liu, D.; Song, R.; Lin, H. Discrete-time local value iteration adaptive dynamic programming: Convergence analysis. IEEE Trans. Syst. Man Cybern. Syst. 2018, 48, 875–891. [Google Scholar] [CrossRef]
  26. Heydari, A. Revisiting approximate dynamic programming and its convergence. IEEE Trans. Cybern. 2014, 44, 2733–2743. [Google Scholar] [CrossRef] [PubMed]
  27. Mu, C.; Ni, Z.; Sun, C.; He, H. Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems. IEEE Trans. Cybern. 2017, 47, 1460–1470. [Google Scholar] [CrossRef]
  28. Venayagamoorthy, G.K.; Harley, R.G.; Wunsch, D.C. Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator. IEEE Trans. Neural Netw. 2002, 13, 764–773. [Google Scholar] [CrossRef]
  29. Ni, Z.; He, H.; Zhong, Z.; Prokhorov, D.V. Model-free dual heuristic dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 1834–1839. [Google Scholar] [CrossRef] [PubMed]
  30. Heydari, A. Optimal triggering of networked control systems. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3011–3021. [Google Scholar] [CrossRef] [PubMed]
  31. Zhang, K.; Zhang, H.; Gao, Z.; Su, H. Online adaptive policy iteration based fault-tolerant control algorithm for continuous-time nonlinear tracking systems with actuator failures. J. Frankl. Inst. 2018, 355, 6947–6968. [Google Scholar] [CrossRef]
  32. Mnih, V.; Kavukcouglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  33. Lewis, F.L.; Vamvoudakis, K.G. Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man Cybern. B Cybern. 2011, 41, 14–25. [Google Scholar] [CrossRef]
  34. Wang, Z.; Liu, D. Data-based controllability and observability analysis of linear discrete-time systems. IEEE Trans. Neural Netw. 2011, 22, 2388–2392. [Google Scholar] [CrossRef]
  35. Ruelens, F.; Claessens, B.J.; Vandael, S.; de Schutter, B.; Babuška, R.; Belmans, R. Residential demand response of thermostatically controlled loads using batch reinforcement learning. IEEE Trans. Smart Grid 2017, 8, 2149–2159. [Google Scholar] [CrossRef]
  36. Liu, D.; Javaherian, H.; Kovalenko, O.; Huang, T. Adaptive critic learning techniques for engine torque and air-fuel ratio control. IEEE Trans. Syst. Man Cybern. B Cybern. 2008, 38, 988–993. [Google Scholar]
  37. Radac, M.-B.; Precup, R.-E.; Petriu, E.M. Model-free primitive-based iterative learning control approach to trajectory tracking of MIMO systems with experimental validation. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 2925–2938. [Google Scholar] [CrossRef]
  38. Campestrini, L.; Eckhard, D.; Bazanella, A.S.; Gevers, M. Data-driven model reference control design by prediction error identification. J. Frankl. Inst. 2017, 354, 2628–2647. [Google Scholar] [CrossRef]
  39. Campestrini, L.; Eckhard, D.; Gevers, M.; Bazanella, A. Virtual reference feedback tuning for non-minimum phase plants. Automatica 2011, 47, 1778–1784. [Google Scholar] [CrossRef]
  40. Formentin, S.; Savaresi, S.M.; Del Re, L. Non-iterative direct data-driven controller tuning for multivariable systems: Theory and application. IET Control Theory Appl. 2012, 6, 1250–1257. [Google Scholar] [CrossRef]
  41. Yan, P.; Liu, D.; Wang, D.; Ma, H. Data-driven controller design for general MIMO nonlinear systems via virtual reference feedback tuning and neural networks. Neurocomputing 2016, 171, 815–825. [Google Scholar] [CrossRef]
  42. Campi, M.C.; Savaresi, S.M. Direct nonlinear control design: The virtual reference feedback tuning (VRFT) approach. IEEE Trans. Autom. Control 2006, 51, 14–27. [Google Scholar] [CrossRef]
  43. Esparza, A.; Sala, A.; Albertos, P. Neural networks in virtual reference tuning. Eng. Appl. Artif. Intell. 2011, 24, 983–995. [Google Scholar] [CrossRef]
  44. Radac, M.-B.; Precup, R.-E. Three-level hierarchical model-free learning approach to trajectory tracking control. Eng. Appl. Artif. Intell. 2016, 55, 103–118. [Google Scholar] [CrossRef]
  45. Radac, M.-B.; Precup, R.-E.; Roman, R.-C. Model-free control performance improvement using virtual reference feedback tuning and reinforcement Q-learning. Int. J. Syst. Sci. 2017, 48, 1071–1083. [Google Scholar] [CrossRef]
  46. Busoniu, L.; Ernst, D.; de Schutter, B.; Babuska, R. Approximate reinforcement learning: An overview. In Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Paris, France, 11–15 April 2011; pp. 1–8. [Google Scholar]
  47. Inteco Ltd. Multitank System, User’s Manual; Inteco Ltd.: Krakow, Poland, 2007. [Google Scholar]
  48. Hagan, M.T.; Menhaj, M.B. Training feed-forward networks with the Marquardt algorithm. IEEE Trans. Neural Netw. 1994, 5, 989–993. [Google Scholar] [CrossRef]
  49. Liu, Q.; Liu, J.; Sang, R.; Li, J.; Zhang, T.; Zhang, Q. Fast neural network training on FPGA using quasi-Newton optimisation method. IEEE Trans. Very Large Scale Integr. Syst. 2018, 26, 1575–1579. [Google Scholar] [CrossRef]
  50. Livieris, I.E. Improving the classification efficiency of an ANN utilizing a new training methodology. Informatics 2019, 6, 1. [Google Scholar] [CrossRef]
  51. Møller, M.F. A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 1993, 6, 525–533. [Google Scholar] [CrossRef]
  52. Livieris, I.E.; Pintelas, P. A new conjugate gradient algorithm for training neural networks based on a modified secant equation. Appl. Math. Comput. 2013, 221, 491–502. [Google Scholar] [CrossRef]
  53. Radac, M.-B.; Precup, R.-E. Data-driven MIMO model-free reference tracking control with nonlinear state-feedback and fractional order controllers. Appl. Soft Comput. 2018, 73, 992–1003. [Google Scholar] [CrossRef]
  54. Radac, M.-B.; Precup, R.-E.; Roman, R.-C. Data-driven model reference control of MIMO vertical tank systems with model-free VRFT and Q-Learning. ISA Trans. 2018, 73, 227–238. [Google Scholar] [CrossRef] [PubMed]
  55. He, H.; Ni, Z.; Fu, J. A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing 2012, 78, 3–13. [Google Scholar] [CrossRef]
  56. Zhao, D.; Wang, B.; Liu, D. A supervised actor-critic approach for adaptive cruise control. Soft Comput. 2013, 17, 2089–2099. [Google Scholar] [CrossRef]
  57. Radac, M.-B.; Precup, R.-E.; Petriu, E.M.; Preitl, S.; Dragos, C.-A. Data-driven reference trajectory tracking algorithm and experimental validation. IEEE Trans. Ind. Inf. 2013, 9, 2327–2336. [Google Scholar] [CrossRef]
  58. Radac, M.-B.; Precup, R.-E. Data-based two-degree-of-freedom iterative control approach to constrained non-linear systems. IET Control Theory Appl. 2015, 9, 1000–1010. [Google Scholar] [CrossRef]
  59. Yang, Q.; Jagannathan, S. Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators. IEEE Trans. Syst. Man Cybern. B Cybern. 2012, 42, 377–390. [Google Scholar] [CrossRef] [PubMed]
Figure 1. (a) control system with the VRFT controller; (b) control system with the VRFT NN controller further tuned by AAC design in a deep learning architecture; (c) the vertical tank system.
Figure 1. (a) control system with the VRFT controller; (b) control system with the VRFT NN controller further tuned by AAC design in a deep learning architecture; (c) the vertical tank system.
Applsci 09 01807 g001
Figure 2. VRFT initial experiment data collection. Red line in (b) is the dead-zone threshold for u 2 .
Figure 2. VRFT initial experiment data collection. Red line in (b) is the dead-zone threshold for u 2 .
Applsci 09 01807 g002
Figure 3. The initial VRFT controller (black dotted), the final adaptively learned controller (black solid), the BFQ controller (blue), the model-based BFQ controller (magenta) and the ORM outputs (red).
Figure 3. The initial VRFT controller (black dotted), the final adaptively learned controller (black solid), the BFQ controller (blue), the model-based BFQ controller (magenta) and the ORM outputs (red).
Applsci 09 01807 g003
Figure 4. Q-NN output weights in Q-function learning of VRFT control.
Figure 4. Q-NN output weights in Q-function learning of VRFT control.
Applsci 09 01807 g004
Figure 5. Typical training and learning episode. The perturbed control (black), the reference inputs to the CS (green) and the RM outputs (red).
Figure 5. Typical training and learning episode. The perturbed control (black), the reference inputs to the CS (green) and the RM outputs (red).
Applsci 09 01807 g005
Figure 6. Evolution of J M R _ A C 1400 with each episode for a typical trial of 30 episodes.
Figure 6. Evolution of J M R _ A C 1400 with each episode for a typical trial of 30 episodes.
Applsci 09 01807 g006
Table 1. AAC tuning statistics over maximum 50 episodes per trial.
Table 1. AAC tuning statistics over maximum 50 episodes per trial.
ScenarioAvg. J M R _ A A C 1400 Min. J M R _ A A C 1400 Success RateAvg. EpisodesStd. Episodes
Full TuningPerturbed
yesyes0.480.4688%254.9
yesno0.760.4553%436.3
noyes0.640.59100%163.2
nono0.5260.50100%102.1

Share and Cite

MDPI and ACS Style

Radac, M.-B.; Precup, R.-E. Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic. Appl. Sci. 2019, 9, 1807. https://doi.org/10.3390/app9091807

AMA Style

Radac M-B, Precup R-E. Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic. Applied Sciences. 2019; 9(9):1807. https://doi.org/10.3390/app9091807

Chicago/Turabian Style

Radac, Mircea-Bogdan, and Radu-Emil Precup. 2019. "Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic" Applied Sciences 9, no. 9: 1807. https://doi.org/10.3390/app9091807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop