Data-Driven Model-Free Tracking Reinforcement Learning Control with VRFT-based Adaptive Actor-Critic

This paper proposes a neural network (NN)-based control scheme in an Adaptive Actor-Critic (AAC) learning framework designed for output reference model tracking, as a representative deep-learning application. The control learning scheme is model-free with respect to the process model. AAC designs usually require an initial controller to start the learning process; however, systematic guidelines for choosing the initial controller are not offered in the literature, especially in a model-free manner. Virtual Reference Feedback Tuning (VRFT) is proposed for obtaining an initially stabilizing NN nonlinear state-feedback controller, designed from input-state-output data collected from the process in open-loop setting. The solution offers systematic design guidelines for initial controller design. The resulting suboptimal state-feedback controller is next improved under the AAC learning framework by online adaptation of a critic NN and a controller NN. The mixed VRFT-AAC approach is validated on a multi-input multi-output nonlinear constrained coupled vertical two-tank system. Discussions on the control system behavior are offered together with comparisons with similar approaches.


Introduction
Data-driven or data-based control techniques rely on data collected from the process in order to learn and tune controllers that prevent control performance degradation due to mismatch between the true process and its model-the main issue with model-based control design approaches [1]. The data-driven controller learning objective can be achieved either by using highly adaptive simplified phenomenological models [2,3], or by using no model at all, except for common structural assumptions about the true process such as linearity or nonlinearity. The latter approach can be considered a true model-free one, with several representative techniques having first emerged from classical control theory, such as: Virtual Reference Feedback Tuning (VRFT), [4], Iterative Feedback Tuning [5], Simultaneous Perturbation Stochastic Approximation [6], Model-free Iterative Learning Control [7,8]. Most of the above approaches relying on instruments specific to optimal control with several recent applications [9][10][11][12][13][14][15][16][17].
Reinforcement learning (RL) [18] is a powerful data-driven technique that solves optimal control problems with parallel developments in the machine learning and control systems communities in which RL is better known as Adaptive (Approximate) Dynamic Programming (ADP) [19] or neuro-dynamic programming [20]. Reinforcement Q-learning [21] with function approximators (FAs) is a particular version of Action Dependent Heuristic Dynamic Programming implemented without a process model [22,23], which is only one of the several types of adaptive actor-critic (AAC) ADP designs [24][25][26][27], besides Heuristic Dynamic Programming [28], Dual Heuristic Programming [29] and all of their action-dependent versions.
For learning high performance control, Action Dependent Heuristic Dynamic Programming (a form of continuous input-state space Q-learning) uses the Q-function as an extension of the cost (value) function and only needs to efficiently explore the input-state space of the unknown process, hence the model-free data-driven label is justified. The class of model-free AAC designs used with FAs is attractive over the majority of the model-based AAC designs, where a partially known nonlinear input-affine state-space representation is at least necessary [22,23]. The main disadvantages of the Action Dependent Heuristic Dynamic Programming schemes are that many transition samples are needed from the process-since the Q-function estimation is more informative, it needs to explore the action space in addition to the state space-and the lack of convergence guarantees in the absence of a process model, when generic FAs are used. Data-driven RL/ADP formulated in terms of control systems theory has also offered recent results regarding different applications and stability and learning convergence, in both model-free and model-based settings [30,31].
In output reference model (ORM) tracking control, the output of the controlled process should track a reference model's output regarded as a frequently changing time-varying learning goal. This control objective can also be formulated in an optimal control setup. An initial stabilizing state feedback controller that achieves suboptimal ORM tracking control is highly desirable in practice since it could accelerate the learning process. In fact, most of the AAC learning control architectures start the controller learning with respect to some objective using an initial controller, but lack systematic guidelines for obtaining such initial controller.
VRFT is one solution to design data-driven model-free feedback controllers, commonly using input-output data. Its linear time-invariant framework typically needs much fewer samples than model-free AAC designs to obtain an initial controller. Unfortunately, a linear controller cannot ensure good ORM tracking for nonlinear processes acting in wide operating ranges. Since AAC should essentially learn a nonlinear state-feedback controller, it is of interest to obtain such an initial (possibly suboptimal) controller, and this will be shown possible using the VRFT design and tuning framework. This would be significant since model-free AAC approaches are data-hungry in practice and any initial suboptimal solution would shorten the convergence time. Under such motivation, the combination of VRFT and AAC is used to achieve ORM tracking control. The resulting AAC design consists of two neural networks (NNs), one for the controller called the actor NN and one for the cost function approximation called the critic NN. The correction signals during the adaptive learning are backpropagated through the larger NN resulted from cascading the actor and the critic NNs, hence the AAC architecture belongs to the deep reinforcement learning approaches from the literature [32].
The mixed VRFT-AAC approach developed in this paper is applied to a real-world Multi-Input Multi-Output (MIMO) nonlinear coupled constrained laboratory vertical two-tank system for water level control. The approach proposed as follows is novel with respect to the state-of-the-art since: • it introduces an original nonlinear state-feedback neural network-based controller for ORM tracking, tuned with VRFT, serving as initialization for the AAC learning controller that further improves the ORM tracking and accelerates convergence to the optimal controller. This leads to the novel VRFT-AAC combination; • the case study proves implementation of the novel mixed control learning approach for ORM tracking. The MIMO validation scenario also demonstrates good decoupling ability of learned controllers, even under constraints and nonlinearities. Comparisons with a model-free batch fitted Q-learning scheme and with a model-based batch-fitted Q-learning approach are also offered. Statistical characterization case studies in different learning settings are given. • theoretical analysis ensures that the AAC learning scheme preserves the CS stability throughout the updates and converges to the optimal control.
The paper is organized as follows: the next section formulates the ORM tracking control problem in an optimal control framework and offers a way to solve it using VRFT (Section 3) and AAC design (Section 4). Validation case study, useful implementation details, comparisons with similar control learning techniques, thorough investigations and discussions of the observed results, are all presented in Section 5. The concluding remarks are highlighted in Section 6.

Output Model Reference Control for Unknown Systems
Let the discrete-time nonlinear unknown open-loop minimum-phase state-space deterministic strictly causal process be P : x k+1 = f(x k , u k ), where k indexes the discrete time, x k = [x k,1 . . . x k,n ] T ∈ X ⊂ n is the n-dimensional state vector (upper T is matrix transpose), u k = [u k,1 , . . . , u k,m ] T ∈ U ⊂ m is the control input signal, y k = [y k,1 , . . . , y k,p ] T ∈ Y ⊂ p is the measurable controlled output, f : X × U → X is an unknown nonlinear system function, g : X → Y is an unknown nonlinear output function of the states, and the initial conditions are not considered for analysis at this point. It is further assumed that the definition domains X, U, Y are compact convex. The following assumptions common to the data-driven problem formulation [1] are: A1: System (1) is controllable and fully state observable. A2: System (1) is internally stable on X × U.
Assumptions A1 and A2 are common in the data-driven control literature and difficult to assess when unknown process models are assumed. They may be supported from the experience on the process operation or from the literature. If no knowledge exists whatsoever, control can be tried in the constraining domains related to the minimum safety operating conditions of the process, which is required minimum information on the process variables. Internal stability is sufficient for output feedback control design and necessary for state-feedback control design using input-state samples.
Concerning the controllability and full state observability assumption A1 imposed to the process, if the observability cannot be verified analytically, data-driven observers can be built using past samples of either the inputs and outputs and/or of the partially measurable state, as shown for linear systems in [33,34] and used for nonlinear systems in [35]. State measurement requires more insight on the process than several pure input-output representations.
Equation (1) is a general form for most controlled processes in practice and it is not restrictive. In this form, it obeys the definition of a deterministic Markov decision process.
The discrete-time known open-loop stable minimum-phase state-space deterministic strictly causal ORM is ORM : are known nonlinear maps. Initial conditions are zero unless stated otherwise. Note that r k , y k , y m k have the same size p for square feedback control systems. If the ORM (2) is linear time-invariant in particular, it is always possible to express the ORM as an input-output linear time-invariant transfer matrix y m k = M(z)r k , where M(z) is an asymptotically stable unit gain (i.e., M(1) = I, where I is the identity matrix) rational transfer matrix and r k is the reference input that drives both the feedback control system and the ORM. To extend the process (1) with the ORM (2), we consider the reference input r k as a set of measurable exogenous signals that evolve according to r k+1 = h m (r k ), with unknown h m : m → m but measurable r k . Piecewise constant r k can be modeled for example as r k+1 = r k and it will be used throughout this paper. Then the extended state-space model with output equations is The ORM tracking control problem is formulated in an optimal control framework. Let the infinite horizon cost function (c.f.) to be minimized starting with x i be [36] J where i indexes the starting time for x E i , the discount factor 0 < γ ≤ 1 ensures the convergence of J(x E i , U i,∞ ) [23] and sets the controller's (or interacting agent's) horizon, the stage cost V > 0 depends on x E k and u k and captures the distance relative to some pre-specified learning goal (target) usually constant in many applications. The unknown control inputs u i , u i+1 , ..., should minimize J(x E i , U i,∞ ). A control sequence (or a controller) rendering a finite c.f. are called admissible.
ORM tracking control requires that the undisturbed process output y k (also the control system output) tracks the ORM's output y m in Equation (4) (measurable y k depends via unknown g( ) on x k , but not on x k+1 ), we introduce the discounted infinite-horizon model reference tracking c.f.
where ε k (x E k , θ) is the model reference tracking error vector, θ ∈ n θ is a parameterization of a nonlinear feedback admissible controller [23] defined as u k de f = C(x E k , θ), which used in Equation (5) reflects the influence of θ on all system trajectories outcomes. This controller coupled with Equation (3) ensures that the output of Equation (1) tracks the ORM's output. J ∞ MR in (5) also serves as the value function of using the controller C. For finite J ∞ MR when γ = 1, ε k must be a square-summable sequence which can be obtained with an asymptotically stabilizing controller that ensures lim In the general case when γ < 1, J ∞ MR will be finite with any stabilizing controller that renders a finite upper bounded ε k . Herein, admissible controller for Equations (4) and (5) means the controller that ensures a finite c.f. J ∞ MR . A nonlinear reference model M could have been used for tracking purposes as well; however, imposing a linear time-invariant one for the feedback control system ensures indirect feedback linearization of the controlled process. It is extremely beneficial to work with linearized feedback control systems because their behavior generalizes well in wide operating ranges [37]. The ORM tracking problem concerns the control system behavior from the reference input to the controlled output, neglecting potential load disturbances [38]. Extension of the proposed theory to nonlinear ORMs is not difficult. Under classical control rules, the process's delay and non-minimum-phase character should be included in M. However, the non-minimum-phase zeroes make M non-invertible in addition to requiring their knowledge via identification [38], affecting the subsequent VRFT design, motivating the minimum phase assumption on the process.

Nonlinear State-Feedback VRFT for Approximate ORM Tracking Control Using Neural Networks
An initial controller for the system (3) to achieve approximate ORM tracking employs the VRFT concept. Under assumptions A1 and A2, for tuning a nonlinear state-feedback controller, the designer may employ an input-state-output dataset of the form ũ k ,x k ,ỹ k , k = 0, N − 1, gathered from the process in an open-loop experiment lasting for N sample time steps, where persistently excitingũ k excites all the significant process dynamics. To achieve linear ORM tracking for a nonlinear process, a nonlinear state-feedback controller is more suitable than a linear one, being able to cope with the process nonlinearities.
VRFT concept assumes that, if the controlled output y k produced in an open-loop experiment conducted on the stable process is both the control system's output and the ORM's output, then the closed-loop control system will match the reference model [4,[39][40][41][42]. Letr k = M(z) −1ỹ k be the virtual reference input that generatesỹ k when filtered through M(z) which is assumed to be invertible with respect to the inverse filtering operation. It is called virtual since it is never set as a reference input to the closed-loop control system and it is only used in the offline controller tuning. The virtual states of the ORM are computable from Equation (2) asx m k+1 = f m (x VR (θ). Theorem 2 in [41] shows that if the controller parameterization is rich enough, then θ * also minimizes J ∞ MR , proven for input-output models only. Motivated by [41], a formal proof is given as incentive for using state-feedback controllers tuned by nonlinear multi-input multi-output (MIMO) VRFT. Several other assumptions are considered: A3: The process (1) has an equivalent input-output form y k = P(y k−1 , . . . , y k−ny , u k−1 , . . . , u k−nu ), where ny, nu are unknown process orders and the nonlinear map P is invertible with respect to u, meaning that for given y k , u k is recoverable as u k−1 = P −1 (y k ). Zero initial conditions are assumed at this point. Also, the ORM (2) has an equivalent input-output form y m k = M(y m k−1 , . . . , y m k−nym , r k−1 , . . . , r k−nr ) where nym, nr are known ORM's orders, M is a nonlinear invertible map with stable inverse, allowing the calculation of r k−1 = M −1 (y m k ). Zero initial conditions are also assumed. A4: Let the process (1) and the ORM (2) be formally written as y k = s(x k , u k−1 ) and y m k = s m (x m k , r k−1 ), respectively, to capture simultaneously both the input-output dependence and the input-state-output one in a compact form. These expressions also reveal the relative degree one from input to output, without loss of generality. Assume zero initial conditions for (1) and assume the map s invertible with x k , u k−1 computable from y k as x k = (s x ) −1 (y k ), u k−1 = (s u ) −1 (y k ). Further assume that s m is a continuously differentiable invertible map such that x m k , r k−1 are computable from y m k as x m k = (s m x ) −1 (y m k ), r k−1 = (s m r ) −1 (y m k ) and assume there exists positive constants B m sx > 0, B m sr > 0 such that sr . Let zero initial conditions hold for (2). These inversion assumptions are natural for state-space systems such as (1) and (2) that have equivalent input-output models according to A4. For example, for given output y k of (1), the input is uniquely determined as u k−1 = P −1 (y k ), after which the state can be generated by recursion from x k+1 = f(x k , u k ) of Equation (1). This is the sense of x k = (s x ) −1 (y k ).
Moreover, let s, (s x ) −1 be continuously differentiable and of bounded derivative to satisfy A5: Let a finite open-loop trajectory collected from the process be D = ũ k ,x k ,ỹ k ⊂ U × X × Y, k = 0, N − 1 whereũ k is: (1) persistently exciting, forỹ k to capture all process dynamics and (2) ensuring uniform exploration of the entire domain U × X × Y. Good exploration is achievable for large enough N.
A6: There exists a set of nonlinear parameterized state-feedback continuously differentiable controllers C(x E k , θ) , aθ for whichû k = C(x E k ,θ), and an ε > 0 for which Proof : See Appendix A.
Corollary 1. The controller C(x E k ,θ) obtained by minimizing the c.f. (6) is stabilizing and admissible for J ∞ MR in Equation (5) with γ < 1.
Proof. By Equation (8), properly identified C(x E k ,θ) renders the finite-time J N MR (θ) (10) arbitrarily small. Secondly, a good exploration of U × X × Y ensured by D = ũ k ,x k ,ỹ k reflects in good exploration of domains R m , X m byr k ,x m k respectively. In (10) k are all generated from the samer k . If (10) holds for many combinationsr k ,x m k rendered form exploratory data, then by the arguments of continuous differentiability and bounded derivatives of the maps (7) and by assumption A4, they will hold for any possible combination of r k and x m k = f m (x m k−1 , r k−1 ) generated from any r k . To show this, note that bothỹ in (10) can be generated from the samer k (x E k−1 containsr k−1 ). Using this fact, it follows from (10) that the ORM tracking errors s(      derivatives with respect to their arguments, it must hold that s(x k , C(x E k−1 (r k−1 ))) − s m (x m k , r k−1 ) = ŷ k − y m k is bounded. Which makes the controller C(x E k ,θ) stabilizing for the control system in the sense of bounded output when r k is bounded. Then, it is an admissible one for the infinite horizon c.f. J ∞ MR with γ < 1. This proves the claim. An NN can be used as a controller for nonlinear state-feedback control learning. Nonlinear VRFT is proposed in [41,42] and successfully applied to NN controllers in [41,[43][44][45] but only for output feedback control and not for state-feedback control as in here.
Notice that VRFT control does not need the entire extended statex E k for feedback (i.e. including the virtual states of the ORM), the process' initial states would suffice for this purpose. However, state extension is required for preserving the Markov property of the system (3) in order to ensure the correct collection of the transition samples; this is not possible otherwise without special collection design such as using a zero-order-hold for two-by-two consecutive time samples [43]. Correct transition samples collection is required for adaptive actor-critic tuning approach of the same NN controller that is initially tuned via VRFT.
Notice that in the proposed state-feedback VRFT design, knowledge of the output function (1) is again not needed sinceỹ k is used to calculate the virtual referencer k , while the controller only usesx E k for feedback purposes.

Adaptive Actor-Critic Learning for ORM Tracking Control
If the system dynamics (3) is known, for a finite-time horizon version of the c.f. (4), numerical dynamic programming solutions can be employed backwards in time only with finite state and action spaces of moderate size, an issue referred to as the "curse of dimensionality". For infinite horizon c.f.s, Policy Iteration and Value Iteration [23] can be used even for large and/or continuous state and action spaces, where FAs such as NNs are one option.
If the system dynamics in (3) is unknown, the minimization of the c.f. (4) becomes an RL problem. To solve it model-free, an informative c.f. for each state-action pair is defined, called the Q-function (or action-value function). With this respect, the action-value function of acting u k in state x E k and then following the control (policy) The optimal Q-function Q * (x E k , u * k ) satisfies Bellman's optimality equation with the optimal controller and optimal Q-function is the minimum value c.f. out of the c.f.s defined in Equation (4). Notice that c.f. (4) encompasses (5) thus making the ORM tracking problem consistent with the above equations. The optimal Q-function can be found using Policy Iteration or Value Iteration in a model-free manner, using, e.g., NNs as FAs. The optimal Q-function estimate and the optimal controller estimate can be updated from the transition samples in several ways: in online/offline mode, batch mode, or sample-by-sample update [23,46]. A particular class of online RL approaches is represented by the temporal difference-based AAC design that differs from the batch PI and VI approaches, as it avoids alternate batch back-up of the Q-function FA and of the controller FA.

Adaptive Actor-Critic Design
The proposed AAC design is a gradient-based scheme designed to converge to the optimal Q-function and optimal controller estimates. Let the temporal-difference error be measured from data as where the continuous functionQ k (·, ·) in its arguments is the Q-function estimate at time k, time at which some controller C k (x k ) is also available. From this point onward, for notation simplicity, x k or plain x are used instead of x k E The proposed AAC design attempts the Q-function update to online minimize the c.f. E c, k = 0.5 δ k 2 , while the controller attempts to online minimize the Q-function using gradient descent. Taxonomically, the proposed AAC belongs to the online Policy Iteration schemes where the policy evaluation step (of Bellman error residual minimization type) interleaves with the policy improvement step. The update laws for the AAC design from input-state data are: where α a > 0, α c > 0 are learning rates. The controller C(x k ) is imagined as a function (or as an infinitely dense table) mapping any state to a control action. Comment 1: In particular, for any admissible controller C 0 , repeated calls of (16), under proper exploration (translated to visiting all the pairs (x k−1 , u k−1 ) ∈ X E × U often and to generating the sample x k ), and under proper selection of α c > 0, will updateQ k (x, u), ∀x ∈ X E until δ k = 0 at which point (11) must hold and the converged Q C 0 (x k , u k ) evaluates C 0 . This is an online off-policy model-free policy evaluation step. Then Q 0 (x, u) = Q C 0 (x, u) can be an initialization for AAC. Whereas, an initial admissible controller can be obtained for example using VRFT as shown later in the case study.
Comment 2: The converged Q-function of an admissible control C 0 , be it Q C 0 (x k , u k ), is positive by definition since it accumulates stage costs V > 0. Moreover, it is always greater than the optimal Q-function, i.e., Q C 0 (x k , u k ) ≥ Q * (x k , u k ) > 0 and obeys the Bellman equation.
Comment 3: From (15), it follows that for a small enough α a > 0. Lemma 1. Starting from an admissible controller C 0 with corresponding Q-function initialization Proof.Q 0 (x, u) is initialization for Equation (16) and obeys the Bellman Equation (11) for the admissible controller C 0 , i.e.,Q Starting the AAC update law from initial state x 0 , C 1 (x) is updated first by Equation (15), then it follows that Since x 0 can be any state x ∈ X E , then Equation (18) holds for k = 1. Assume by induction that (18) holds for some k. Using Comment 3, it follows that and since x k , u k can be any pair (x, u) ∈ X E × U, the conclusion of Lemma 1 follows.
Theorem 2. LetQ 0 (x, u) > 0, finite for any finite argument) be an initialization for the Q-function of an initial admissible controller C 0 . Starting with any x 0 the control u 0 = C 0 (x 0 ) is applied to the process. Specifically, the AAC update laws ( (15), (16)) ensures that at time k = 1, C 1 is updated from C 0 ,Q 0 (x, u) is updated with (16) using u 1 = C 1 (x 1 ) in the right-hand side, the control u 1 is sent to the process, and then k ← 2, with the above strategy repeated for subsequent times. Claim: The feedback control system under time-varying control C k is stabilized for γ < 1 and asymptotically stabilized for γ = 1.
Proof. It is valid for the first three time steps that Cancelling the same terms in both sides of Equation (22a-c), since α c > 0, it follows that the sums in square parentheses are negative. These sums are further refined using Lemma 1 as . Extending the exemplified reasoning backwards from infinity it follows that Since lim in the inequality is the cost of using the controller u 0 = C 0 (x 0 ), u 1 = C 1 (x 1 ), . . .. Then it follows that the control system remains stable under the time-varying control of the AAC updates. Moreover, for ∞ must be square-summable which implies that the control system is asymptotically stabilized by C k , thus proving the claim of the theorem. Comment 4: The above proof resembles the stabilizing action-dependent value iteration of [30], but here relies on gradient-based updates of the Q-function estimate and of the controller, rather than on its minimization. The stability result of the Theorem 2 is valid under continuous updates of the AAC laws (15), (16) under no exploration. It ensures that, starting from an admissible controller C 0 , the AAC updates (15), (16) preserve the control system stability. Since exploratory controls u k are critical, a compromising solution is to perform the controller update (15) only for non-exploratory sampling instants, while the Q-function estimates are continuously updated per (16).

AAC Using Neural Networks As Approximators
In practice, specific FAs such as NNs are employed as approximators for the Q-function and for the controller, respectively. Using NNs as FAs (implying nonlinear features parameterization), the convergence of the learning scheme depends on a large extent to the selection of the learning parameters. However, the advantage of using generic NN architectures is that no manual or automatic feature selection is needed for parameterizing the Q-function and controller estimates.
The proposed AAC design herein uses two NNs to approximate the Q-function (critic, referred to herein as Q-NN), and the controller (actor, referred to herein as C-NN), respectively. Assume an initial admissible NN VRFT state-feedback controller exists. Let the critic NN FA and controller NN FA be parameterized asQ , respectively. With three-layer feed-forward NNs having one hidden layer, fully connected with bias, the critic and the controller are modeled by: with W c = [W k c,1 . . . W k c,n hc +1 ] T -the critic output layer weights having n hc hidden neurons, -the critic hidden layer weights matrix, T -the critic input vector of size n ic + 1 (the bias input is constant 1), W k,l a = [W k,l a,1 . . . W k,l a,n ha +1 ] T , l = 1..m -the output layer weights of the l-th controller output having n ha hidden neurons, and -the controller hidden layer weights

1]
T -the controller input vector of size n ia + 1 (bias input 1 included).
σ i = tanh i , is the hyperbolic tangent activation function at the output of i th hidden neuron. The controller and critic are cascaded, with the actor output u k as a part of the critic input alongside x E k The actor and critic weights are formally parameterized as The AAC's gradient descent tuning rules for the critic (25) and controller (26) as parameterized variants of (15) and (16) are · δ k , detailed as : , as : with α a , α c -the learning rate magnitudes of the critic and controller training rules, respectively and σ i is the derivative of σ i w.r.t its argument and ξ is the index of u l k in the I k c . In many practical applications, the designer chooses to perform either full or partial adaptation of the NNs' weights, the latter implying only output weights adaptation. In this latter case, the Q-NN and C-NN parameterizations are: where Φ k c (x E k , u k ), Φ k a (x E k ) are the matrices of basis functions (or input features) and W c k , W a k are the tunable output weights parameters, renderingQ k , u k as linear combination of basis functions. This linear parameterization simplifies the convergence analysis but also requires manual features selection and training as a disadvantage.
As it is well-known, the AAC architecture performs online with the C-NN sending controls to the process and the Q-NN serving both to estimate the Q-function and to adaptively tune the C-NN. The closed-loop control system with the process (3) combined with the AAC tuning rules (27), (28) has the unique property that its dynamics is mainly driven by the reference input r k viewed as a particular state of the extended state vector and, possibly, by exogenous unknown disturbances. Since r k is user selectable, it can be used to drive the control system in a wide operating range to ensure efficient exploration of the state space. Enhanced exploration of the domain X E × U can be performed by trying random actions in every state, usually as additive uniform random actions.

Convergence of the AAC Learning Scheme with NNs
While the results from Section 4.1 are formulated under generic functions for the Q-function and for the controller, the convergence to the optimal controller and optimal Q-function is not ensured. In the following, the convergence to of the AAC learning with NNs is shown. Linear parameterization is the most widely used and supports tractable analysis. Let the output weight parameterization (29) of the Q-NN and C-NN lead to the update laws be compactly written as where Let the weights of the optimal Q-NN and optimal C-NN be W c * , W a * , the estimation errors beingW Notice that the temporal difference error δ k is calculated in terms of the Q-NN's output. Then δ k is backpropagated to correct the Q-NN weights. Moreover, δ k is further backpropagated to correct the C-NN weights, since the C-NN output is an input to the Q-NN. Hence, the resulted AAC architecture belongs to the deep learning approaches (the architecture is presented in Figure 1b of the next Section).   For VRFT-based control design, the ORM is selected as the ZOH discretization of )) ( ),

Summary of the Mixed VRFT-AAC Design Approach
The steps of the VRFT-AAC design approach are summarized next: S1. Collect input-state-output samples from the open-loop stable process (1) in a dataset D = ũ k ,x k ,ỹ k ⊂ U × X × Y, k = 0, N − 1 whereũ k persistently exciting, under conditions of A5. S2. Obtain the initial state-feedback VRFT controller by minimizing the c.f. in Equation (6) as θ a k = argmin θ J N VR (θ). When NNs are used, minimization of the c.f. (6) is equivalent to training the NN. The obtained controller is C k (x E k , θ a k ), which is a controller for both the process (1) and for the extended process (3). It is also a close initialization to the optimal controller that minimizes J ∞ MR from (5), since VRFT identifies a controller that approximately minimizes J N MR from (10) as a finite horizon version of J ∞ MR . This is supported by Theorem 1.
S3. Close the control system loop on process (1) (it is equivalent to closing it on extended process (3)) using controller C k (x E k , θ a k ). The architecture is presented in Figure 2b). Use update (16) (in explicitly parameterized form, use (28)) under an exploratory reference input r k in order to learn the Q-function of the controller C k (x E k , θ a k ). This serves as properly initializingQ k (x E k , u k , θ c k ) for the subsequent AAC tuning.  Results on a standard test scenario with the initial VRFT C-NN controller are shown in Figure 3. It is observed that the ORM tracking errors are bounded, since the VRFT controller is stabilizing (though not asymptotically) and validates the theoretical results of Theorem 1 and Corollary 1. Then it is an admissible controller for  MR J in Equation (5) with 1   . The initial controller tuning using VRFT is attractive also because it has learned a feature matrix , so from this point onwards, the designer can choose to perform either output weights adaptation or full weights adaptation. S4. Use the updates (15), (16), (27), (28), in explicitly parameterized form), in this exact order and under an exploratory reference input r k , to learn the optimal controller C * (x E k , θ a k * = W * a ) and the optimal Q-functionQ * (x E k , u k , θ c k * = W * c ). Using the above updates for a finite time on a random learning scenario is called a learning episode.
S5. After every learning episode, measure the tracking performance on a standard test scenario. When the prescribed number of maximum tests is reached or the tracking performance on the standard scenario is not improving anymore, the controller learning is stopped. Otherwise, proceed to the next learning episode.
All implementation details of the above VRFT-AAC design are presented in the following Section 5.1 when validation is performed on the complex multivariable tank system case study.

AAC Design for a MIMO Vertical Tank System
The controlled process is a vertical MIMO two-tank system (Figure 1c) built around a three-tank laboratory equipment [47] with the continuous-time state-space equations . Features of this process include: no water level setpoint for the tank T 2 can be set if the water level in T 1 is zero; the electrical valve controlling the outflow from T 1 (which is inflow to T 2 ) is changed by u 2 (u 2 ); no setpoint can be tracked for each tank if there is more outflow than inflow; T 1 's outflow has a minimum value and can be zero only when H 1 = 0, as per first equality in (33). T 2 's dynamics is slower than T 1 's. Proper selection of the parameters C 1 , C 2 through manual valves allows feasible control trajectories in the constrained input-state-output space. Discretization of (33) reveals its Markov form. The water level is measured using piezoelectric sensors (PS 1 and PS 2 in Figure 1c). Protection logic disables the pump voltage when water level exceeds the upper bound. The sampling period used for control experiments is T s = 0.5 s. Model (33) is not used for control design.
For VRFT-based control design, the ORM is selected as the ZOH discretization of M(s) = diag(M 1 (s), M 2 (s)).  Figure 2 and ensures the exploratory conditions from Assumption A5. The controllable canonical state-space realizations (A 1 , B 1 , C 1 , D 1 ) and (A 2 , B 2 , C 2 , D 2 ) and of M 1 (z) and M 2 (z) are, respectively: The ORM state will then be x m k = [x m k,1 x m k,2 x m k,3 x m k,4 ] T . The virtual referencer k = M(z) −1ỹ k is used as input to the state-space models (34) to obtain the ORM's virtual statesx m k . The extended virtual T ∈ 8 is used to offline compute the C-NN via VRFT by fitting the inputsx E k to the outputsũ k . Note that using two second order ORM's produces four states in the extended state-space, which is disadvantageous. The ORM's orders should to be as low as possible (usually one) but a second order model offers greater flexibility in output response shaping.
The VRFT C-NN architecture (Figure 1a) is a feedforward 8-10-2 fully connected one with biases having n ha = 10 hidden neurons with tanh( ) activation function and the output activation functions are linear. The weights are initialized with random uniform numbers in [−1.5, 1.5]. Since the NN training is performed offline, standard gradient backpropagation training with Levernberg-Marquardt [48] is used for maximum 50 epochs to learn a stabilizing VRFT C-NN controller C(x E k ) for the MIMO control system, that minimizes J N VR . 80% of the data is effectively used for training while the rest of 20% serves as validation data. Early stopping is used after six consecutive increases in the mean sum of squared error evaluated on the validation data. Other offline training algorithms such as Broyden-Fletcher-Goldfarb-Shanno [49,50] and conjugate gradient [51,52] may be similarly efficient while their computational burden is prohibitive for online real-time training.
Results on a standard test scenario with the initial VRFT C-NN controller are shown in Figure 3. It is observed that the ORM tracking errors are bounded, since the VRFT controller is stabilizing (though not asymptotically) and validates the theoretical results of Theorem 1 and Corollary 1. Then it is an admissible controller for J ∞ MR in Equation (5) with γ < 1. The initial controller tuning using VRFT is attractive also because it has learned a feature matrix Φ a (x E k ), so from this point onwards, the designer can choose to perform either output weights adaptation or full weights adaptation.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 15 of 23 they do not switch simultaneously, to better reveal the coupling effects between the control channels. After 500 s, the critic weights stabilize, the output weights being shown in Figure 4 for 4500 s. This pre-tuned Q-NN will be used as initialization to the following case studies. After this intermediate tuning  r switch, to preserve the Markov property of the extended model. The controller parameters at the end of an episode are the initial ones for the following episode, the Q-NN weights following the same transfer rule. To ensure enhanced exploration of the state-action space, the C-NN controller output is perturbed every third sample time with probing noise according to: where rand is a normally distributed random number with zero-mean and variance 56 .
) , mod( s k is the remainder after dividing k by s. This is in fact a form of  0 -greedy exploration strategy useful to try many actions in the vicinity of the current state. A typical learning episode is The Q-function estimate of the VRFT C-NN (i.e., the critic Q-NN) is next learned in a policy evaluation step in order to serve as a good initial estimate of the Q-function that is needed for the following AAC learning and also to fulfil the requirements of Lemma 1. This step is possible since the VRFT controller is admissible and, for properly selected learning rate, the weights of the Q-NN will converge. The critic Q-NN approximating the Q-function has similar architecture with the C-NN, of size 10-25-1 (eight states and two controls), with n hc = 25. The critic Q-NN output weights are randomly drawn from a zero-mean normal distribution with variance σ 2 = 90 while the hidden layer weighs are uniformly randomly initialized in [−1.5, 1.5]. Setting γ = 0.95, the learning rates α c = 0.01 in (27) and α a = 0 in (28) (no controller tuning), all the Q-NN weights are updated using the gradient back-propagation in (27), by driving the MIMO control system with a sequence of uniformly random piecewise constant steps in r k = [r k,1 r k,2 ] T ∈ [0.05, 0.25] × [0.01, 0.2]. This procedure also serves as a tuning step for α c . With r k,1 and r k,2 lasting 20 s and 33 s, respectively, we ensure they do not switch simultaneously, to better reveal the coupling effects between the control channels. After 500 s, the critic weights stabilize, the output weights being shown in Figure 4 for 4500 s. This pre-tuned Q-NN will be used as initialization to the following case studies. After this intermediate tuning step of the Q-NN, the designer can choose for full weight adaptation of the Q-NN or only for output weights' adaptation of the Q-NN while the features matrix Φ c (x E k , u k ) is kept constant.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 15 of 23 they do not switch simultaneously, to better reveal the coupling effects between the control channels. After 500 s, the critic weights stabilize, the output weights being shown in Figure 4 for 4500 s. This pre-tuned Q-NN will be used as initialization to the following case studies.  The C-NN is now further tuned (in the architecture from Figure 1b to improve the ORM control performance. Setting α a = 10 −8 in (28) and α c = 0.01 in (27) (critic adaptation should generally be faster than actor adaptation), both the C-NN and Q-NN are adaptively trained online. Although carried out in an adaptive framework, the training unfolds on consecutive episodes where the feedback control system is driven by a sequence of random reference input steps for 700 s. The reference inputs are uniformly random piecewise constant steps in r k = [r k,1 r k,2 ] T ∈ [0.05, 0.25] × [0.01, 0.2], with r k,1 and r k,2 lasting 20 s and 33 s, respectively. Updates (27), (28) are skipped when either r k,1 or r k,2 switch, to preserve the Markov property of the extended model. The controller parameters at the end of an episode are the initial ones for the following episode, the Q-NN weights following the same transfer rule. To ensure enhanced exploration of the state-action space, the C-NN controller output is perturbed every third sample time with probing noise according to: where rand is a normally distributed random number with zero-mean and variance σ 2 = 3.56 and mod(k, s) is the remainder after dividing k by s. This is in fact a form of ε 0 -greedy exploration strategy useful to try many actions in the vicinity of the current state. A typical learning episode is shown in Figure 5. After each learning episode, the learning is stopped and the C-NN performance is measured on the standard test scenario from Figure 3 and the decrease of a finite-time version of the c.f. J ∞ MR from Equation (5), namely J 1400 MR , is aimed. This standard test scenario is not seen during training. The learning then resumes with the next episode. After maximum 30 learning episodes (meaning 21,000 s and 42,000 samples), the C-NN and Q-NN adaptations are stopped and the learning trial (comprising of learning episodes) converges under Theorem 3. The final adaptively learned C-NN and the initial VRFT controller are shown performing in Figure 3. The episodic learning allows us to test the improvement in control performance between episodes. Figure 6 illustrates the J 1400 MR_AC decrease over episodes of a convergent learning trial. Throughout each learning episode, under the AAC update laws, the control system preserves its stability, ensured by Theorem 2.  For comparisons, a model-free approximated batch-fitted Q-learning (BFQ) controller [53,54] is also proposed, using the same Q-NN and C-NN architectures with the same sizes. BFQ alternates offline training of the C-NN and the Q-NN, using 12000 transition samples collected under the  For comparisons, a model-free approximated batch-fitted Q-learning (BFQ) controller [53,54] is also proposed, using the same Q-NN and C-NN architectures with the same sizes. BFQ alternates offline training of the C-NN and the Q-NN, using 12000 transition samples collected under the For comparisons, a model-free approximated batch-fitted Q-learning (BFQ) controller [53,54] is also proposed, using the same Q-NN and C-NN architectures with the same sizes. BFQ alternates offline training of the C-NN and the Q-NN, using 12,000 transition samples collected under the randomly perturbed model-free single-input single-output VRFT linear controllers C 1 (z) = (2.6092 + 0.1184z −1 − 2.3609z −2 )/(1 − z −1 ) and C 2 (z) = (1.5735 + 0.2405z −1 − 1.3547z −2 )/(1 − z −1 ), independently designed for the two tanks, respectively. BFQ implements a Value Iteration algorithm. The training settings assume that the weights of both NNs are initialized to uniform random numbers in [−1.5, 1.5]. Maximum 200 epochs are used for training with Levenberg-Marquardt on 80% effectively training data and 20% validation data. Early stopping is employed to prevent overfitting after six maximum increases of the mean sum of squared errors on the validation data. 200 iterations of BFQ take about 2 hours and the best controller is saved.
The value of J 1400 MR for the initial VRFT controller is J 1400 MR_VRFT = 1.58, for the final adaptively learned controller is J 1400 MR_AAC = 0.45 and for the BFQ controller is J 1400 MR_MBBFQ = 0.32. Additionally, a model-based approximated BFQ solution is also offered for comparisons. A first dimensionality reduction of the extended state space is performed by inspecting that, for both the ORMs, x m k+1,2 = x m k,1 and x m k+1,4 = x m k,3 from the state-space matrices in Equation (34) and, moreover, y m k, For the current iteration C-NN C iter x ER k , the input patterns are x ER k and the target patterns are u k = argmin u∈U dQ iter x ER k , u . Note that evaluation of F(x ER k , u k ) to get x ER k+1 uses the original extended state vector where x m k,2 , x m k,4 are copies of generated x m k,1 , x m k,3 and a piecewise constant reference input generative model is used where r k+1,1 = r k,1 , r k+1,2 = r k,2 . To keep the training computationally tractable and timely, only one third of uniformly sampled data points form the training set are used, differently for each of the Q-NN and the C-NN. The weights of both NNs are initialized to uniform random numbers in [−1.5, 1.5]. Maximum 200 epochs are used for training with Levenberg-Marquardt on 80% training data and 20% validation data. Early stopping is employed to prevent overfitting after six maximum increases of the mean sum of squared errors on the validation data. Just 14 iterations (taking about 20 min) of this approximate model-based BFQ produce the control results in Figure 3 (in magenta), with J 1400 MR_MFBFQ = 0.25 naturally the smallest, with the best ORM tracking performance, since it uses the process model.

Statistical Investigations of the AAC Control Performance
Several thorough investigation case studies are considered, for which the initial C-NN tuned by VRFT and the initial Q-NN are the same. The investigation concerns full vs. partial tuning of the Q-NN and C-NN weights, while measuring the probing noise effect on the convergence. All statistics are measured on learning trials of maximum 50 episodes. The minimal and average J 1400 MR_AAC values on 100 trials are measured along with the success percentage of convergent learning trials. The average number of episodes until reaching the minimal J 1400 MR_AAC of a successful learning trial together with the standard deviation of the number of episodes in a successful learning trial, are both rendered in Table 1.  (27) and α a = 10 −8 in (28), with full adaptation of the Q-NN and of the C-NN, the learning process convergence starting from the initial C-NN tuned by VRFT and initial Q-NN is investigated. For a convergent learning trial, for the constant learning rates being used, the best performance did never drop below J 1400 MR_AAC = 0.46, which is inferior to the BFQ performance, suggesting that the proposed adaptive learning strategy is prone to getting stuck in local minima under the adaptive gradient-based update rules. In fact, BFQ is generally advertised as being more data-efficient, although actor-critic learning architectures also allow alternative updates of the C-NN and Q-NN for improving data usage efficiency. The learning trials converges in about 88% of the cases, comparable with other perturbed AAC designs [55,56] given the wide operating range used for the controlled process.
Case 2. For full weights adaptation of both Q-NN and C-NN, without random perturbation of the control action, with α c = 0.01 in (27) and α a = 10 −7 in (28), the convergence rate drops to 53% and the best performance is J 1400 MR_AAC = 0.45. Case 3. In the case of output weights only tuning of both Q-NN and C-NN, with random perturbation of the control action, with α c = 0.01 in (27) and α a = 10 −6 in (28), 100% convergence rate was observed, but the performance never dropped below J 1400 MR_AAC = 0.59. Case 4. With output weights only tuning of both the Q-NN and C-NN, starting from the initialized C-NN tuned by VRFT and the initial C-NN, in the absence of random perturbation of the control action, with α c = 0.01 in (27) and α a = 10 −5 in (28), the AAC learning is 100% convergent in all trials but the performance never drops below J 1400 MR_AAC = 0.50. For α a = 10 −6 , the average number of episodes per trial increases only to 11.
The above four case studies are statistically characterized in Table 1. Concluding, full weight tuning of Q-NN and C-NN offers better performance (smaller J 1400 MR_AAC ) than when output weights only tuning is used. But full weight tuning lowers the convergence rate, as prone to stuck in local minima. In Case 1 vs. Case 2, the probing noise significantly improves the convergence, slightly improves the average J 1400 MR_AAC and decreases the average number of episodes per convergent trial. While full weights adaptation is more sensitive since even small corrections in the input-to-hidden layer weights may lead to learning divergence.
The output weights only tuning is more robust, with 100% convergence success rate to an improved solution, but with inferior achievable performance. The perturbing noise in this case worsens the average number of episodes per trial and the best achievable performance. Guaranteed convergence to an improved solution corresponding to a local minimum in Cases 3 and 4 is also caused by the good initial tuning offered by VRFT. Case 4 with output weights only tuning without probing noise offers the best compromise regarding convergence, performance and few episodes per trial (i.e., fewer transition samples until convergence).
The initial Q-function learning of the NN-VRFT controller is not necessary and learning convergence was obtained without this step also. For the selected critic learning rate, the weights converge fast enough, however, this step serves for tuning the critic learning rate and also for initialization of the features matrix when output weights only adaptation is sought. This tuning step is achievable exactly because an initially stabilizing VRFT controller exists.

Comments on the AAC Learning Performance
The AAC learning data efficiency is clearly inferior to the BFQ strategy as a comparable model-free approach. The BFQ control is learned from scratch just from transition samples, whereas the proposed AAC controller learns from a NN-VRFT controller delivering an initial suboptimal ORM tracking solution. Both AAC and model-free BFQ are inferior to the model-based BFQ solution which exploits the process model knowledge.
AAC is a form of Action Dependent Heuristic Dynamic Programming which is also less data-efficient than other similar approaches such as Dual Heuristic Programming, where the learned co-state vector carries more information than the Q-function. On the other hand, AAC is less computationally demanding and requires less memory than Dual Heuristic Programming, BFQ and model-based BFQ, owing to AAC's adaptive implementation. However, AAC becomes competitive when used together with VRFT since the VRFT pre-tuning provides an initial controller close to the optimal one, which then can be fine-tuned using AAC. The initial NN VRFT controller ensures stabilized exploration over a wide operating range for ensuring ORM tracking in a wide range, which is equivalent to indirect feedback linearization. Then the combined VRFT-AAC design for ORM tracking is more attractive for practical data-driven applications [57,58].

Conclusions
A model-free combination of VRFT and AAC design approach was successfully validated to learn improved nonlinear state-feedback control for linear ORM tracking in a wide operating range. Learned controllers indirectly account for several nonlinearities such as actuator saturation plus dead-zone and output saturation, while they also show good decoupling abilities. AAC design shares similar conceptual framework with model-free techniques like Q-Learning, or SARSA, VRFT, Iterative Feedback Tuning and model-free Iterative Learning Control, by exploiting only the process model structure but not its parameters. The convergence of the proposed adaptive learning strategy relies on several key aspects: efficient exploration correlated with the size of the training dataset and with the process complexity, selected learning architecture and selection of the approximators with appropriate parameterizations. In a wider context, VRFT shows significant potential for obtaining close-to-optimal initially admissible controllers with respect to the ORM objective.
Future work targets the validation of the proposed tuning approach to other difficult nonlinear processes and its improvement using data-driven techniques.
Using (A5) and (A7) in (A2) it results ∆y k ≤ B sx B sy ∆y k + B su (B cx ∆x k−1 + ε) ≤ B sx B sy ∆y k + B su B cx B sy ∆y k−1 + B su ε, (A9) One can write that which is the conclusion (10), and the proof of Theorem 1 is completed.
Note further that the term E3 in (A13) is positive. Let ∆L 2 be further expressed as Eventually, can be shown negative if holds, where, considering the first term as negative for α c > 4ϕ 2 c − 2ϕ c + 1 while E3 > 0, E4 > 0, it follows that it suffices to have in order to make ∆L negative definite. Then by Lyapunov extension theorem [59], the AAC learning is stable and the NN estimation errors are uniformly ultimately bounded, concluding the proof.