Virtual State Feedback Reference Tuning and Value Iteration Reinforcement Learning for Unknown Observable Systems Control

: In this paper, a novel Virtual State-feedback Reference Feedback Tuning (VSFRT) and Ap-proximate Iterative Value Iteration Reinforcement Learning (AI-VIRL) are applied for learning linear reference model output (LRMO) tracking control of observable systems with unknown dynamics. For the observable system, a new state representation in terms of input/output (IO) data is derived. Consequently, the Virtual State Feedback Tuning (VRFT)-based solution is redeﬁned to accommodate virtual state feedback control, leading to an original stability-certiﬁed Virtual State-Feedback Reference Tuning (VSFRT) concept. Both VSFRT and AI-VIRL use neural networks controllers. We ﬁnd that AI-VIRL is signiﬁcantly more computationally demanding and more sensitive to the exploration settings, while leading to inferior LRMO tracking performance when compared to VSFRT. It is not helped either by transfer learning the VSFRT control as initialization for AI-VIRL. State dimensionality reduction using machine learning techniques such as principal component analysis and autoencoders does not improve on the best learned tracking performance however it trades off the learning complexity. Surprisingly, unlike AI-VIRL, the VSFRT control is one-shot (non-iterative) and learns stabilizing controllers even in poorly, open-loop explored environments, proving to be superior in learning LRMO tracking control. Validation on two nonlinear coupled multivariable complex systems serves as a comprehensive case study.


Introduction
Learning control from input/output (IO) system data is a current significant research area.The idea stems from the data-driven control research, where it is strongly believed that the gap between an identified system model and the true system is an important factor leading to control performance degradation.
Value Iteration (VI) is one popular approximate dynamic programming [1][2][3][4][5][6][7] and reinforcement learning algorithm [8][9][10][11][12][13], together with Policy Iteration.VI Reinforcement Learning (VIRL) algorithm comes in many implementation flavors, online or offline, off-policy or on-policy, batch-wise or adaptive-wise, with known or unknown system dynamics.In this work, the class of offline off-policy VIRL for unknown dynamical systems is adopted, based on neural networks (NNs) function approximators, hence it will be coined as Approximate Iterative VIRL (AI-VIRL).For this model-free offline off-policy learning variant, a database of transition samples (or experiences) is required to learn the optimal control.Most practical implementations are of the actor-critic type where function approximators (most often neural networks (NNs)) are used to approximate the cost function and the controller, respectively.
Energies 2021, 14, 1006 2 of 26 One of the crucial aspects of AI-VIRL convergence is appropriate exploration, translated to visiting as many state-actions combinations as possible while uniformly covering the state-action domains.The exploration aspect is especially problematic with general nonlinear systems whereas for linear ones, linearly parameterized function approximators for the cost function and for the controller, relieve to some extent this issue, owing to their better generalization capacity.Good exploration is not easily achieved in uncontrolled environments, and usually, pre-existing stabilizing controllers can increase the exploration quality significantly.An additional reason for using pre-stabilizing controller is that, in mechatronics systems, uncontrolled environment commonly implies instability, under which dangerous conditions could lead to physical damage.This is different from the virtual environments specific, e.g., to video games [14] (or even simulated mechatronics systems), where instability leads to an episode (or simulation) termination but physical damage is not a threat.
Another issue with reinforcement learning algorithms such as AI-VIRL, is the state representation.Most often, in the case of unknown systems, the measured data from the system cannot be assumed to fully capture the system state, leading to the partial observability learning issues associated with the reinforcement learning algorithms [15][16][17][18].Therefore, new state representations are needed to ensure that learning takes place in a fully observable environment.A new state representation is proposed in this work based on the assumption that the controlled system is observable.Therefore, a virtual state built from present and past IO samples is introduced as an alias for the true state.The new virtual state-space representation is a fully observable system which allows controlling the original underlying system.
A similar approach to AI-VIRL for learning optimal control in offline off-policy mode is Virtual Reference Feedback Tuning (VRFT) [19][20][21][22][23][24][25].It also relies on a database of (usually IO) samples collected from the system in a dedicated experimental interaction step.Traditionally, VRFT was proposed for output feedback error IO controllers and was not sufficiently exploited for state-feedback control yet.Certainly not for the virtual statefeedback control required by observable systems for which direct state measurement is impossible.One contribution of this work is to propose for the first time such a model-free framework called Virtual State-Feedback Reference Tuning (VSFRT), which learns control based on the feedback provided by the virtual state representation.
VSFRT also requires exploration of the controlled system dynamics by using persistently excited input signals, in order to visit many IO (or input-state-output) combinations.Principally, it turns the system identification problem into a direct controller identification problem.This paradigm shift can be thought of as a typical supervised machine learning problem, especially since VRFT has been used before with NN controllers [24][25][26][27].To date, VRFT has been applied mainly for output-feedback error-based IO controllers with few reported results with state-feedback control [26,27] but not with feedback control based on a virtual state constructed from IO data.
Both AI-VIRL and VRFT lend themselves to the reference model output tracking problem framework.It is therefore of interest to compare their learning capacity in terms of resources needed, achievable tracking performance, sensitivity to the exploration issue and type of approximators being used.In particular, the linear reference model output (LRMO) tracking control setting is advantageous, ensuring indirect state-feedback linearization of control systems.Such linearity property of control systems is critical for higher-level learning paradigms such as Iterative Learning Control [28][29][30][31][32][33][34] and primitive-based learning [34][35][36][37][38][39][40], as representative hierarchical learning control paradigms [41][42][43][44].
The contributions of this work are: • A new state representation for systems with unknown dynamics.A virtual state is constructed from historical input/output data samples, under observability assumptions.

•
An original Virtual State Feedback Reference Tuning (VSFRT) neural controller tuning based on the new state representation.Stability certification is analyzed.
• Performance comparison of VSFRT and AI-VIRL data-driven neural controllers for LRMO tracking.

•
Analysis of the transfer learning suitability for the VSFRT controller to provide initial admissible controllers for the iterative AI-VIRL process.

•
Analyze the impact of the state representation dimensionality reduction upon the learning performance using unsupervised machine learning tools such as principal component analysis (PCA) and autoencoders (AE).
Section 2 introduces the LRMO tracking problem formulation while Section 3 proposes the VSFRT and the AI-VIRL solution concepts.The two comprehensive validation case studies of Sections 4 and 5, respectively, validate this work's objectives.Conclusions are presented in the last Section 6.

The LRMO Tracking Problem
Let the dynamical discrete-time system be described by the state-space plus output model equation where s k = [s k,1 . . .s k,n ] T is the un-measurable state, the control input is u k = [u k,1 , . . ., u k,m u ] T while y k = [y k,1 , . . ., y k,p ] T is the sensed output.The dynamics f, g of ( 1) are unknown but considered continuously differentiable (CD) maps.Additional assumption about (1) require that it is IO observable and controllable.The IO observability implies that, as in the case of the more well-known linear systems, the state can be fully recovered from system present and past IO samples.Given the observable system (1), IO samples u k , y k are employed to form a virtual state-space model having u k as input and y k as output (similarly to (1)), with a different state vector, according to [45].This resulted virtual state-space model is: defined in terms of the virtual state vector T (upper T means vector/matrix transposition).Model (2) has partially known dynamics (the unknown part stems from unknown dynamics of (1)) and it is fully state observable [45].Such transformations are well-known for linear systems.Here, τ is correlated with the observability index and should be chosen empirically since it cannot be established analytically due to partially unknown dynamics.It should be selected as large as possible, accounting for the fact that for a value larger than the true observability index, there is no more information gain in explaining the true state s k through v k [45].The observability index is well-known in linear systems theory (please consult Appendix A).Its minimal value K for which τ ≥ K, ensures that the observability matrix has its full column rank equal to the state dimension n.Meaning that the state is fully observable from a number of at least K past input samples and at least K past output samples.In the light of the above remark, we define the unknown observability index of the nonlinear system (1) as the minimal value K for which any τ ≥ K ensures that s k is observable from Transformations such as (2) can easily accommodate time delays in the input/state of (1) by properly introducing supplementary states.Such operations preserve the full observability of (2) [45].IO controlling (2) is the same with IO controlling (1), since they have the same input and output.While any potential state-feedback control mapping the state to the control input, would differ for (1) and (2) since s k ∈ n and v k ∈ p(τ+1)+m u τ .The control objective is aimed at shaping the IO behavior of (1) by indirectly controlling the IO behavior of (2) through state-feedback.This is achieved by the reference model output tracking framework, presented next.
A strictly causal linear reference model (LRM) is: where the LRM state is . ., r k,p ] T will be reference input to the control system and the LRM output is y m k = [y m k,1 , . . ., y m k,p ] T .The LRM dynamics are known and characterized by the matrices G, H, L. Its linear pulse transfer matrix IO dependence can be established as y m k = T LRM (q)r k , where "q" is the pulse transfer time-based operator, analogous to the "z" operator.
The LRMO tracking goal is to search for the control input which causes the system's (1) output y k to track the output y m k of the LRM (3), given any reference input r k .This objective is captured as an optimal control problem searching for the optimal input satisfying where the dependence of y k on u k is suggested.In the expression above, • 2 measures the L 2 distance over vectors.It is assumed that a solution u * k for (4) exists.The above control problem is a form of imitation learning where the LRM is the "expert" (or teacher or supervisor) and the system (1) is a learner which must mimic the LRM's IO behavior.For the given problem, dynamics of (3) must be designed beforehand, however they may be unknown to the learner system (1).
In the subsequent section, the problem ( 4) is solved by learning two state-feedback closed-loop control solutions for the system (2).The implication is straightforward.If a state-feedback controller of the form u k = C(v k ) is learned to control (2), then this control action can be set as the actual control input for the system (1), based on the feedback v k built from present and past IO samples u k , y k of the system (1).Hence, learning control for (2) by solving (4) is the same with solving (4) for the underlying system (1).Notice the recurrence in Several observations concerning the LRM selection are mentioned.According to the classical control rules in model reference control [46], the LRM dynamics must be correlated with the bandwidth of the system (1).The time-delay of (1) and its possible nonminimum-phase (NMP) character must be accounted for inside the LRM as they should not be compensated.These knowledge requirements are satisfiable based on working experience with the system or from technical datasheets.However, they do not interfere with the "unknown dynamics" assumption.Since the virtual state-feedback control design is attempted based on the VSFRT principle, it is known from classical VRFT control that the NMP property of (1) requires special care, therefore for simplification, it will be assumed that (1) is minimum-phase.IO data collection necessary for both the VSFRT and the AI-VIRL designs require that ( 1) is either open-loop stable or stabilized in closed-loop.It will be assumed further that ( 1) is open-loop stable, although closed-loop stabilization for IO samples collections can (and will also) be employed for reducing collection phase duration and to enhance exploration.
The following section proposes and details the VSFRT and AI-VIRL for learning optimal control in (4), in virtual state feedback-based closed-form solutions.

Recapitulating VRFT for Error-Feedback IO Control
In the well-known VRFT for error-feedback control defined for linear mono-variable unknown systems [19], a database of IO samples is considered available after being collected from the base system (1), let this database be called Either open-or closed-loop could be considered for IO samples collection.Based on the VRFT principle, it is assumed that y k is also the output of the given LRM conveyed by T LRM (q).One can offline calculate in a noncausal fashion the virtual reference rk = T −1 LRM (q)y k which set as input to T LRM (q) would render y k at its output.The tilde means offline calculation.A virtual feedback error is next defined as ẽk = rk − y k .Selecting some prior linear con- troller transfer function structure C( ẽk , ϑ) (ϑ-the parameter vector) as a function of the virtual feedback error, the controller identification problem is defined as the minimization over ϑ of the cost Meaning that the controller C( ẽk , ϑ) outputs u k when driven by ẽk .This conceptually implies that the control system having the loop closed by C( ẽk , ϑ), produces the signals u k and y k when driven by rk .This would eventually match the closed-loop with the LRM.The VRFT circumvents direct controlled system identification, hence it is model-free.
The work [19] analyzed approximate theoretical equivalence between V N VR (ϑ) and the LRM cost 2 from (4).A linear prefilter called "the L-filter" (sometimes denoted with M) with pulse transfer function L(q), was used to enhance this equivalence by replacing u k , ẽk in V N VR (ϑ) with their filtered variants u L k = L(q)u k , ẽL k = L(q) ẽk .The VRFT extension to the multi-variable case was studied in [20] while its extension to nonlinear system and nonlinear controller case has been afterwards exploited in works like [23][24][25][26][27], where the L-filter was dropped when richly parameterized controllers were used (e.g., NNs).

VSFRT-The Virtual State Feedback-Based VRFT Solution for the LRM Output Tracking
Following the rationale behind the classical model-free VRFT, a database of IO samples is considered available after being collected from the base system (1), let it be denoted as It is irrelevant for the following discussion if the samples were collected in open-or closed-loop.An input-state-output database is constructed as (u k , ṽk , y k ) where ṽk is constructed from the historical data u k , y k .Based on the VRFT principle, it is assumed that y k is also the output of the given LRM (characterized by T LRM (q)).A non-causal filtering then allows for virtual reference calculation as in rk = T −1 LRM (q)y k .Similar to VRFT applied for error-feedback IO control, now a virtual state-feedback reference feedback tuning (VSFRT) controller is searched for.This controller denoted C should make the LRM's output to be tracked by the system (2)'s output and it is identified to minimize the cost with T being an extended state regressor vector constructed from the virtual state ṽk of (2) and with a controller parameter vector ϑ.
A VSFRT controller rendering V N VR (ϑ) = 0 should lead to V ∞ LRMO → 0 .It is the VRFT principle which establishes the equivalence between V N VR (ϑ) and V ∞ LRMO (ϑ), being supported in practice by using richly parameterized controllers (such as NNs [25]), coupled with a wise selection of the LRM dynamics.A controller NN called C-NN will be employed in this framework, with ϑ capturing the NN trainable weights.along with the index j = 1 counting the number of trainings.

4.
The C-NN is trained with input patterns in = sex− k and output patterns t = u k .This is equivalent to minimizing (5) w.r.t.ϑ.

5.
If j < MaxTrain, set j = j + 1 and repeat from the 3 rd Step, otherwise finish the algorithm.
The above algorithm produces several trained neural controllers C-NN with parameter The best one is selected based on some predefined criterion (minimal value of V N VR (ϑ) or minimal value of some tracking performance on a given test scenario).Stability of the closed-loop with the VSFRT C-NN is asserted, with some assumptions following next [47]: A1.The system (2) supports the equivalent IO description y k = T(y k−1 , . . ., y k−ny , u k−1 , . . ., u k−nu ), with ny, nu unknown system orders and the nonlinear map T is invertible with respect to u: For a given y k , the input u k is computable as u k−1 = T −1 (y k ) (the "k−1" subscript in u vs. the "k" subscript in y formally suggest strictly causal system).Additionally, LRM (3) supports the IO recurrence y m k = T LRM (y m k−1 , . . ., y m k−nym , r k−1 , . . ., r k−nr ) with known orders nym, nr and T LRM is a linear invertible map with stable inverse, allowing to find r k−1 = T −1 LRM (y m k ).A2.The system (2) and the LRM (3) are formally representable as , to simultaneously convey the IO and the input-state-output dependence in a single compact form, under the mappings M, M m .With no generality loss, the above representation indicates the relative degree one between input and output.Let us assume the map and that there are constants κ m Mx > 0, κ m Mr > 0 fulfilling Mr , where • is an appropriate matrix norm induced by the L 2 vector norm.The model inversion assumptions are natural for state-space systems (2) and (3) being characterized by IO models.Practically, the idea behind (2).The above inequalities with upper bounds on the maps' partial derivatives are reasonable and commonly used in control.Moreover, let M, (M v ) −1 be continuously differentiable (CD) and of bounded derivative to satisfy The maps M, (M v ) −1 are CD since they emerge from the CD map F in (2), whose CD is a consequence of the CD property of f, g from (1) [45].Their bounded derivatives are reasonable assumptions for that matter.
A3. Let DB = u k , ṽk , y k ⊂ U × V × Y, k = 0, N − 1 be a trajectory collected from the system (2) within respective domains U, V, Y and with u k being: (1) Persistently exciting (PE), to make sure that y k senses all system dynamics; (2) uniformly exploring the entire domain U × V × Y.The larger N, the better exploration is obtained.
A4.There exists a set of nonlinear parameterized state-feedback continuously differentiable controllers C(s ex− k , ϑ) , a θ for which ûk = C( ŝex− k , θ), and an ε > 0 for which where The quantities ûk , vk , ŷk would be collected under ûk = C( ŝex− k , θ) in closed-loop, as dictated by the evolution of the virtual signal rk−1 for a given θ.The bounded derivative condition for the controller is natural when smooth NN function approximators are used.Theorem 1. [47]: Under assumptions A1-A4, there exists a finite κ > 0 such that Proof.We introduce the notation and make the notation equiv- [47] being the state of a natural state-space model.

sex− k
does not contain the LRM state s m k , this discussion is deferred for now to Section 3.4.Then, following the rationale of Theorem's 1 proof in the Appendix of [47], the proof of the current Theorem 1 follows.

Corollary 1. The controller C(s ex−
k , θ) where θ is obtained by minimizing (5), is stabilizing for the system (2) in the uniformly ultimately bounded (UUB) sense.
Proof.When θ is the value found to minimize V N VR (ϑ) from ( 5), it makes the first inequality in A4 hold for arbitrarily small ε > 0. From Theorem 1 it follows that ŷk − y k 2 2 is bounded.Notice that y k is bounded from the experimental collection phase and that ŷk is generated in closed-loop with C( ŝex− k , θ).However, the closed-loop that generates ŷk is driven by rk obtained from y k as rk−1 = T −1 LRM (y k ).The PE condition on u k makes y k well explore its domain which subsequently makes rk well explore its domain, let it be called R r .Based on the CD and bounded derivatives properties of the maps, it means that, for any other values rk ⊂ R r , the term ŷk − y k 2 2 is bounded.The UUB stability of the closed-loop follows.8) is also the LRM's output, according to the VSFRT initial assumption, therefore

Remark 1. The output y
< ε 2 is equivalent to making ε sufficiently small, i.e., ε → 0 .Then V N LRMO ( θ) < κε 2 can be made arbitrarily small.Since V N LRMO ( θ) is the finite-length (and also the parameterized closed-form control) version of V ∞ LRMO from (4), it is expected that for sufficiently large N, an equivalence holds between minimizing V N VR ( θ) in ( 5) and minimizing V ∞ LRMO in (4).
Remark 2. The controller C(s ex− k , θ) which is stabilizing (2) in the IO UUB sense, is also stabilizing (1), since (1) has the same input and output as (2).

The AI-VIRL Solution for the LRM Output Tracking
Solving (4) with machine learning AI-VIRL requires an MDP formulation of the system dynamics.To proceed, the virtual system (2) and the LRM dynamics (3) are combined to form the extended virtual state space system where r k+1 = Γ(r k ) is any valid generative dynamical model of the reference input, herein being modeled as a piecewise constant signal, i.e., r k+1 = r k and S is the extended system dynamics nonlinear mapping.The dimension of s ex k is n x = p(r + 2) + m u r + n m .For extended system (9) with partially unknown dynamics, an offline off-policy batch AI-VIRL algorithm will be used, which is also known by the name of model-free batchfitted Q-learning.It relies on a database of transition samples collected from the extended system (9) and uses two function approximators: One to model the well-known action-state Q-function and another one modeling a parameterized dependence of the control on the state, i.e., u k = C(s ex k , ϑ) with parameter vector ϑ.Commonly, NNs are used thanks to their well-developed training software and customizable architecture.The Q-function approximator is parameterized as A major relaxation about model-free AI-VIRL w.r.t. to other reinforcement learning algorithms is that it does not require an initial admissible (i.e., stabilizing) controller, but manages to converge to the optimal control which must be stabilizing for the closed-loop control system.
To solve (4) and reach for the optimal control now expressed as a direct state dependence as in u * k = C(s ex k , ϑ * ) and also parameterized by ϑ, AI-VIRL relies on the database of transition samples (sometimes called experiences) DB = (s ) .These samples can be collected in different styles, as later on pointed out in the case study.Making the AI-VIRL a batch offline off-policy approach.Randomly initializing the Qfunction NN (Q-NN) weight vector π 0 and the controller NN weight vector as ϑ 0 , AI-VIRL alternates the Q-function weight vector update step 2 (10) with the controller weight vector update step until, e.g., no more changes in π j or ϑ j , implying convergence to ϑ * .With NN approximators for both the Q-function and the controller, solutions to (10) and ( 11) are embodied as the classical NN backpropagation-based training, recognizing the cost functions in (10) and (11) as the mean sum of squared errors.The NN training procedure requires gradient calculation w.r.t.NN weights.
For solving (11), another trick is possible for low-dimensional control input: 1) Firstly, find the approximate minimizers u , by enumerating all combinations of discretized control values over a fine grid discretization of u's domain; 2) secondly, establish these minimizers as targets for the C-NN C(s , ϑ) then gradient-based train for the parameters ϑ of this NN.
The AI-VIRL algorithm is summarized next.

3.
Train the C-NN with inputs in = s and target outputs t = u . This is equivalent to solving (11).
With all transition samples participating in the AI-VIRL, this model-free Q-learninglike algorithm benefits from a form of experience replay, widely used in reinforcement learning.Under certain assumptions, convergence of the AI-VIRL C-NN to the optimal controller which implies stability of the closed-loop has been analyzed before in the literature [2,3,[6][7][8]11,12] and is not discussed here.

The Neural Transfer Learning Capacity
The AI-VIRL solution is a computationally expensive approach while, the VSFRT solution is one-shot and it is obtained much faster in terms of computing time.It is of interest to check whether the AI-VIRL convergence is helped by initializing the controller with an admissible (i.e., stabilizing) one, e.g., learned with VSFRT.This is coined as transfer learning.
Notice that for VSFRT, We stress that "~" is only meaningful in the offline computation phase for VSFRT and, outside this scope and in the following, the state vectors are referred to as s m k and s ex− k .The case study will use controllers modeled as a three-layer NN with linear input layer, nonlinear activation for a given number of neurons in the hidden layer and with linear output activation.Let the controller be u k = C(s ex k , ϑ).A fully connected feedforward NN equation models each output as where out , b l in , b i out are input layer weights, output layer weights, input layer bias weights and output layer bias weights, respectively.is a given differentiable nonlinear activation function (e.g., tanh, logsig, ReLu, etc.) and n H is the number of hidden layer neurons (or nodes).The parameter vector ϑ gathers all trainable weights of the NN.
The controller transfer learning from VSFRT to AI-VIRL starts by observing that Then, all the input weights and biases from (12)  the inputs s m k are set to zero, while the other weights and biases are copied from the VSFRT NN controller.Then the learned VSFRT controller will be transferred as an initialization for the AI-VIRL controller.

First Validation Case Study
A two-axis motion control system stemming from an aerial model is subjected to control learning via VSFRT and AI-VIRL.It is nonlinear and coupled and allows for vertical and horizontal positioning, being described as where S 1 −1 ( ) means saturating function on [−1, 1], u 1 , u 2 are the horizontal motion control input and the vertical motion control input respectively.s a,3 (rad is the vertical angle, other states being described in [48].The nonlinear static maps M p (s p,1 ), F p (s p,1 ), M a (s a,1 ), F a (s a,1 ) are fitted polynomials obtained for s p,1 , s a,1 ∈ (−4000; 4000) [48].
A zero-order hold on the inputs combined with an outputs sampler applied on ( 13), conducts to an equivalent discrete-time model of relative degree one suitable for IO data collection and control.The system's unknown dynamics will not be employed herein for learning control.
The objective of the problem ( 4) is here translated to finding the controller which makes the system's outputs track the outputs of the LRM given as T LRM (q) = diag(T (q), T (q)) with T (q), T (q) the discrete-time variants of T (s) = T (s) = 1/(τs + 1), τ = 3 obtained for the sampling interval of ∆ = 0.1 s.

IO Data Collected in Closed-Loop
An initial linear MIMO IO controller is first found based on an IO variant of VRFT.The linear diagonal controller learned with IO VRFT is For exploration enhancement, the reference inputs have their amplitudes uniformly randomly distributed in r k,1 ∈ [−2; 2], r k,2 ∈ [−1.4; 1.4] and they present as piece-wise constant sequences of 3 s and 7 s, respectively.Every 5 th sampling instant, a uniform random number in [−1; 1] is added to the system inputs u k,1 , u k,2 .Then r k drives the LRM With controller (14) in closed-loop over the output feedback error u k = C(q)(r k − y k ), IO data {u k , y k } is collected for 2000 s, as rendered in Figure 1 (only for the first 200 s).With controller (14) in closed-loop over the output feedback error is collected for 2000 s, as rendered in Figure 1 (only for the first 200 s).

Learning the VSFRT Controller from Closed-Loop IO Data
Using the collected output data k y and the given reference model times, starting with reinitialized weights.Each trained C-NN was tested on the standard scenario shown in Figure 2 (only for the first 600 s), by measuring a normalized c.f.
samples in 1200 s.The best C-NN (in terms of the minimal value of V per trial) is retained at each trial.One trial lasts about 30 min on a standard desktop computer with all calculations performed on CPU.
For a fair evaluation, the learning process is repeated for five trials.In Table 2, the value of V for five trials is filled and then averaged.The best C-NN, found in trial 3, has 0.0051332 = V .

Learning the VSFRT Controller from Closed-Loop IO Data
Using the collected output data y k and the given reference model T LRM (q), the virtual reference input is computed as rk = T −1 LRM (q)y k .Afterwards, the extended state is built form the IO database {u k , y k } and from rk , as lumped in The extended state  For a fair evaluation, the learning process is repeated for five trials.In Table 2, the value of V for five trials is filled and then averaged.The best C-NN, found in trial 3, has V = 0.0051332.
Before using the raw database, the transition samples at time instants where are to be deleted (the piecewise constant step model is not a valid state-space at switching  In this case, the extended state used to learn C-NN with AI-VIRL, built using the same dataset {u k , y k }, the states from the reference model s m k and the reference input r k , is T .Before using the raw database, the transition samples at time instants where r k = r k+1 are to be deleted (the piecewise constant step model is not a valid state-space at switching instants) and the resulted database used for learning is DB = (s The controller NN and the Q-function NN settings are depicted in Table 3. To , at the current algorithm iteration.Each AI-VIRL iteration produces a C-NN that is tested on the same standard test scenario used with VSFRT, by measuring a finite-time normalized c.f. V = 1/N • V N LRMO for N = 12, 000 samples over 1200 s.The AI-VIRL is iterated for MaxIter = 1000 times and all stabilizing controllers that are better than the previous ones running on the standard test scenario are recorded.For fair evaluation, AI-VIRL is also run for five trials.Two cases are analyzed, to test the impact of the training epochs on the controller NN.In the first case, both C-NN and Q-NN are to be trained for maximum 500 epochs.The best C-NN is found at iteration 822 of trial 3 and measures V = 0.0039271 on the standard test scenario shown in Figure 3 (only the first 600 s shown).Each AI-VIRL iteration produces a C-NN that is tested on the same standard test scenario used with VSFRT, by measuring a finite-time normalized c.f.
samples over 1200 s.The AI-VIRL is iterated for 1000 = MaxIter times and all stabilizing controllers that are better than the previous ones running on the standard test scenario are recorded.For fair evaluation, AI-VIRL is also run for five trials.
Two cases are analyzed, to test the impact of the training epochs on the controller NN.In the first case, both C-NN and Q-NN are to be trained for maximum 500 epochs.The best C-NN is found at iteration 822 of trial 3 and measures 0.0039271 = V on the standard test scenario shown in Figure 3 (only the first 600 s shown).In the second case, the Q-NN is trained for maximum 500 epochs and the C-NN is trained for maximum 100 epochs.The best C-NN is found at iteration 573 of the first trial and measures V = 0.0070567 on the standard test scenario shown in Figure 4 (only the first 600 s shown).
In the second case, the Q-NN is trained for maximum 500 epochs and the C-NN is trained for maximum 100 epochs.The best C-NN is found at iteration 573 of the first trial and measures 0.0070567 = V on the standard test scenario shown in Figure 4 (only the first 600 s shown).The iterative learning process specific to AI-VIRL lasts for approximately 6 h in the first above mentioned case and for about 4 h in the second case, on a standard desktop computer with all operations on CPU.All measurements corresponding to the five trials in the two cases are filled in Table 2.
After learning the AI-VIRL control, the conclusion w.r.t.VSFRT learned control can be drawn: The VSFRT control is computationally cheaper to learn (one-shot instead of iterative) and the average tracking control performance is better with VSFRT (about an order of magnitude better).

Learning VSFRT and AI-VIRL Control Using IO Data Collected in Open-Loop
Next, we repeat the learning process for the two controllers (the VSFRT one and the AI-VIRL one) but the data used for learning is collected in open-loop as depicted in Figure 5.For the sake of exploration, the system inputs have their amplitudes uniformly randomly distributed in For exploration, the reference inputs have their amplitudes uniformly randomly distributed in and they are also piece-wise constant sequences of 10 and 15 s, respectively.We underline that for VSFRT, k r and m k y are not needed as they are The iterative learning process specific to AI-VIRL lasts for approximately 6 h in the first above mentioned case and for about 4 h in the second case, on a standard desktop computer with all operations on CPU.All measurements corresponding to the five trials in the two cases are filled in Table 2.
After learning the AI-VIRL control, the conclusion w.r.t.VSFRT learned control can be drawn: The VSFRT control is computationally cheaper to learn (one-shot instead of iterative) and the average tracking control performance is better with VSFRT (about an order of magnitude better).

Learning VSFRT and AI-VIRL Control Using IO Data Collected in Open-Loop
Next, we repeat the learning process for the two controllers (the VSFRT one and the AI-VIRL one) but the data used for learning is collected in open-loop as depicted in Figure 5.For the sake of exploration, the system inputs have their amplitudes uniformly randomly distributed in u k,1 ∈ [−0.45; 0.6], u k,2 ∈ [−0.5; 0.4] and they present as piece-wise constant sequences of 3 s and 7 s, respectively.Each of u k,1 , u k,2 are firstly filtered through the lowpass dynamics 1/(s+1).Then every 5 th sampling instant, a uniform random number in [−1; 1] is again added to the filtered u k,1 , u k,2 .A crucial difference w.r.t. the closed-loop collection scenario is that the reference inputs drive only the LRM and since there is no controller in closed-loop, the reference inputs r k and the LRM's outputs y m k = T LRM (q)r k evolve as correlated but independently of the system outputs who were excited by u k .For exploration, the reference inputs have their amplitudes uniformly randomly distributed in r k,1 ∈ [−6; 6], r k,2 ∈ [−1; 2] and they are also piece-wise constant sequences of 10 and 15 s, respectively.We underline that for VSFRT, r k and y m k are not needed as they are automatically offline calculated by the VSFRT principle.However, AI-VIRL uses them for forming the extended state-space model.
significantly less variation in the outputs).This may affect learning convergence and final LRM tracking performance.
The extended states for VSFRT and AI-VIRL and their NNs settings are kept the same as in the closed-loop IO collection scenario.
For a fair evaluation, the learning process is repeated five times for all the cases previously described.In Table 4 is noted the value of V for each trial and the average value.on the test scenario from Figure 6, a value close to the minimal value obtained in the closed-loop IO data collection scenario.The AI-VIRL controllers' performance is obviously inferior; the C-NN with the minimal V in the case when it is trained for maximum 100 epochs, is found at iteration 15 and measures 0.24 = V on the second trial (tracking is shown in Figure 7 the first 600 s only); the C-NN with minimal V (best tracking) in the case when trained for maximum 500 epochs, is found at iteration 132 of the fifth trial and measures 0.20267 = V on the standard test scenario shown in Figure 8 for the first 600 s only.It is clear when comparing Figure 1 with Figure 5 that the open-loop data collection in uncontrolled environment is not satisfactorily exploring the input-state space (there is significantly less variation in the outputs).This may affect learning convergence and final LRM tracking performance.
The extended states for VSFRT and AI-VIRL and their NNs settings are kept the same as in the closed-loop IO collection scenario.
For a fair evaluation, the learning process is repeated five times for all the cases previously described.In Table 4 is noted the value of V for each trial and the average value.The best VSFRT controller, obtained in trial 4, measures V = 0.0043692 on the test scenario from Figure 6, a value close to the minimal value obtained in the closed-loop IO data collection scenario.The AI-VIRL controllers' performance is obviously inferior; the C-NN with the minimal V in the case when it is trained for maximum 100 epochs, is found at iteration 15 and measures V = 0.24 on the second trial (tracking is shown in Figure 7 the first 600 s only); the C-NN with minimal V (best tracking) in the case when trained for maximum 500 epochs, is found at iteration 132 of the fifth trial and measures V = 0.20267 on the standard test scenario shown in Figure 8 for the first 600 s only.The conclusion is that the poor exploration fails the AI-VIRL convergence to a good controller, whereas the VSFRT controller is learned to about the same LRM tracking performance.VSFRT learns better control (about two order of magnitude smaller V than with AI-VIRL) while being less computationally demanding and despite the poor exploration.

Testing the Transfer Learning Advantage
It is checked whether initializing the AI-VIRL algorithm with an admissible VSFRT controller helps improving the learning convergence, using the transfer learning capability.An initial admissible controller is not however mandatory for the AI-VIRL.Since both the VSFRT and the AI-VIRL controllers aim at solving the same LRMO tracking control problem, it is expected that the VSFRT controller initializes AI-VIRL closer to the optimal controller.
The VSFRT controller from the closed-loop IO data collection case is transferred since this case has proved good LRMO tracking for both the VSFRT and for the AI-VIRL controllers.Two cases were again analyzed: When the C-NN of the AI-VIRL is trained for 100 epochs at each iteration and when the C-NN of the AI-VIRL is trained for 500 epochs at each iteration.While the Q-NN has its weights randomly initialized with each trial.
The results unveil that the transfer learning does not speed up the AI-VIRL convergence, nor does it result in better AI-VIRL controllers after the trial ends.The reason is that the learning process taking place in the high-dimensional space spanned by the C-NN weights remains un-controllable directly, but only indirectly by the hyper-parameter settings such as database exploration quality and size, training overfitting prevention mechanism and adequate NN architecture setup.

VSFRT and AI-VIRL Performance under State Dimensionality Reduction with PCA and Autoencoders
State representation is a notorious problem in approximate dynamic programming and reinforcement learning.In the observability-based framework, the state comprises of past IO samples who are time-correlated.Meaning that the desired uniform coverage of the state space is significantly affected.Intuitively, the more recent IO samples from the virtual state better reflect the actual system state, whereas the older IO samples are less characterizing the actual system state.
Analyzing the state representation for VSFRT and for AI-VIRL, two sources of correlation appear.First, the time-correlation since the virtual state is constructed from past IO samples.Secondly, correlation appears between the independent reference inputs (used in closed-loop IO data collection) and the LRM's states and outputs and the system's inputs and outputs as well.It is therefore justified to strive for state dimensionality reduction.
One of the popular unsupervised machine learning tools for dimensionality reduction is principal component analysis (PCA).Thanks to its offline nature, it can be used only with offline learning schemes, such as VSFRT and AI-VIRL.Let the state vector s k recordings be arranged in a matrix S where the number of columns correspond to the state components and the line number is the record time index k.The state s k can be either one of sex− k for VSFRT or s ex k for AI-VIRL.Let the empirical estimate of the covariance of S be S obtained after centering the columns of S to zero, and the square matrix V contains S's eigenvectors on each column, arranged from the left to the right, in the descending order of the eigenvalues amplitudes.The number n P of principal components counts the first n P leftmost columns from V, which are the principal components.Let this leftmost slicing of V be V L .The reduced represented state is then calculated as s red = (V L ) T s.
The state dimensionality reduction effect on tracking performance is tested both for the VSFRT and for the AI-VIRL.The used database is the one from the closed-loop collection scenario since this offers a learning chance to both VSFRT and to the AI-VIRL controllers.
For VSFRT, the dimension of the state sex− k is 12, whereas for AI-VIRL, the dimension of s ex k is 14.For VSFRT, the first 4 principal components explain for 98.32% of the data variation, while for AI-VIRL, the first 6 components explain for 98.51% of the data variation, confirming existent high correlations between the state vector components.This leads to reduced C-NN architecture sizes of 4-6-2 for VSFRT and 6-6-2 for AI-VIRL.For AI-VIRL, the case where 100 epochs for training the C-NN was employed.The explained variation in the data is shown for the first four principal components only for VSFRT, in Figure 9.
leads to reduced C-NN architecture sizes of 4-6-2 for VSFRT and 6-6-2 for AI-VIRL.For AI-VIRL, the case where 100 epochs for training the C-NN was employed.The explained variation in the data is shown for the first four principal components only for VSFRT, in Figure 9.
The best learned C-NN with VSFRT and with AI-VIRL measure 0.1653 = V and 3488 . 1 = V respectively, on the test scenario.The VSFRT controller performs well only on controlling 1 , k y while poorly on controlling 2 , k y .Whereas the AI-VIRL control is unex- ploitable on any of the two axes.The conclusion is that the exploration issue is exacerbated by the state dimensionality reduction, although learning still takes place to some extent.Even when the data variation is explainable by a reduced number of principal components, due to many apparent correlations.Under reduced state information loss when performing dimensionality reduction, no improvement on the best tracking performance is expected.However, learning still occurs, which encourages to use the dimensionality reduction as a trade-off for reducing learning computational effort by reducing the state space size and subsequently the NN architecture size.
A standard fully-connected single-hidden layer feedforward autoencoder (AE) is next used to test the dimensionality reduction and its effect on the learning performance, in a different unsupervised machine learning paradigm.Details of the autoencoder are: Six hidden neurons that are the encoder's outputs and is also the number of reduced features; sigmoidal activation function from the input-to-hidden layer as well as from the hidden-to-output layer;  The best learned C-NN with VSFRT and with AI-VIRL measure V = 0.1653 and V = 1.3488 respectively, on the test scenario.The VSFRT controller performs well only on controlling y k,1 while poorly on controlling y k,2 .Whereas the AI-VIRL control is unexploitable on any of the two axes.The conclusion is that the exploration issue is exacerbated by the state dimensionality reduction, although learning still takes place to some extent.Even when the data variation is explainable by a reduced number of principal components, due to many apparent correlations.
Under reduced state information loss when performing dimensionality reduction, no improvement on the best tracking performance is expected.However, learning still occurs, which encourages to use the dimensionality reduction as a trade-off for reducing learning computational effort by reducing the state space size and subsequently the NN architecture size.
A standard fully-connected single-hidden layer feedforward autoencoder (AE) is next used to test the dimensionality reduction and its effect on the learning performance, in a different unsupervised machine learning paradigm.Details of the autoencoder are: Six hidden neurons that are the encoder's outputs and is also the number of reduced features; sigmoidal activation function from the input-to-hidden layer as well as from the hidden-tooutput layer; number of training epochs set to maximum 500.The training cost function is of the form T MSSE + c 1 T L 2 + c 2 T sparse where T MSSE penalizes the mean summed squared deviation of the AE outputs from the same inputs, T L 2 is the weights L 2 regularization term responsible with limiting the AE weights amplitude and the T sparse term encourages sparsity at the AE hidden layer's output.T sparse measures the Kullback-Leibler divergence of the averaged hidden layer outputs activation values with respect to a desirable value set as 0.15 in this case study.While the other parameters in the cost are set as c 1 = 0.004, c 2 = 4.The AE's targeted output is the copied input.
The AE-based dimensionality reduction is only applied to the VSFRT control learning.
Let the encoder map the input to the dimensionally-reduced feature as in The VSFRT C-NN training now uses the pairs sred k , u k using the same value MaxTrain = 200 and the same five trials approach.The best learned C-NN with VSFRT measured V = 0.1838 which is on par with the best performance from the VSFRT C-NN learned with PCA reduction.The result unveils that no significant improvement is obtained using the AE-based reduction, which is expected due to sensible information loss.Then there is no preference for using either of PCA or AE for this purpose.It also confirms that no better performance is attainable.However, it is advised to use dimensionality reduction tools since by a reduced virtual state, simpler (as in reduced number of parameters) NN architectures are usable, leading to less computational effort in training, testing and realworld implementation phase.

Second Validation Case Study
A two-joints rigid and planar robot arm serves in the following case study.Its dynamics are described by [49] m 1 (s) where u = [u 1 , u 2 ] T are motor input torques for the base joint and for the tip joint, respectively.Both inputs are limited inside the model, within their domains [−0. forces affect the planar arm, the matrices in the Equation ( 16) are A sample period ∆ = 0.05 s characterizes a zero-order hold applied on the inputs and on the outputs of (16), rendering it into an equivalent discrete-time model with inputs It is intended to search for the control which makes the closed-loop match the continuous-time decoupled LRM model T LRM (s) = diag(1/(0.5s+ 1), 1/(0.2s+ 1)), which is subsequently transformed to the discrete-time observable canonical state-space form T LRM (q) is the discrete-time counterpart of T LRM (s), calculated for the sample period ∆.
For driving the resulted closed-loop CS, r k,1 is modeled as a sequence of piecewise constant values for 2 s with zero-mean random amplitudes of variance 0.6 and r k,2 is as a sequence of piecewise constant values for 1 s with zero-mean random amplitudes of variance 0.65.For enhanced exploration, random additive disturbance is used on inputs u k,1 , u k,2 .A uniform random number in [−3, 3] is added to the first input every 2nd sample and a similar distribution random number is added to the second input every 3rd sample.A total of 15,000 samples of r k,1 , r k,2 , u k,1 , u k,2 , y k,1 , y k,2 , y m k,1 , y m k,2 are collected.The collection is shown in Figure 10.

Learning the VSFRT Controller from Closed-Loop IO Data
Using the collected output data k y and the given reference model The extended state samples in 20 s.The best C-NN (in terms of the minimal value of V per trial) is retained at each trial.One trial lasts about 30 min on a standard desktop computer with all calculations performed on CPU.For a fair evaluation, the learning process is repeated for five trials.In Table 6, the value of V for five trials is filled and then averaged.About 95% of the learned controllers are stabilizing, in spite of the underlying nonlinear optimization specific to VSFRT, indirectly solved by C-NN training.(red).

Learning the VSFRT Controller from Closed-Loop IO Data
Using the collected output data y k and the given reference model T LRM (q), the virtual reference input is computed as rk = T −1 LRM (q)y k .Afterwards, the extended state is built form the IO database samples {u k , y k } and from rk , all components being lumped in The extended state  5.Each learning trial trains the C-NN for MaxSteps = 200 times, starting with reinitialized weights.Each trained C-NN was tested in closed-loop on a scenario where the test reference inputs are identical to those used in Figure 11, by measuring a normalized c.f. V = 1/N • V N LRMO for N = 400 samples in 20 s.The best C-NN (in terms of the minimal value of V per trial) is retained at each trial.One trial lasts about 30 min on a standard desktop computer with all calculations performed on CPU.For a fair evaluation, the learning process is repeated for five trials.In Table 6, the value of V for five trials is filled and then averaged.About 95% of the learned controllers are stabilizing, in spite of the underlying nonlinear optimization specific to VSFRT, indirectly solved by C-NN training.There is a dramatic difference in favor of VSFRT's superior tracking performance, in spite of the poorer input-state space exploration.As expected, AI-VIRL's convergence to a good controller is affected by the poor exploration and its performance is not better than in the closed-loop collection case.Concluding, the superiority of VSFRT over AI-VIRL is again confirmed, both in terms of reduced computation complexity/time and in terms of tracking performance.

Conclusions
Learning controllers from IO data to achieve high LRM output tracking performance has been validated by both VSFRT and AI-VIRL.Learning takes place in the space of the new virtual state representation build from historical IO samples, to address observable systems.The resulted control is in fact learned to be applied to the original underlying system.Surprisingly in the two illustrated case studies, using the same IO data and the same LRM, VSFRT showed no worse tracking performance than AI-VIRL, despite its significantly lesser computation demands.While in the cases of poor exploration, VSFRT was clearly superior.
Aside performance comparisons, several additional studies were conducted.The transfer learning opportunity from VSFRT to provide initial admissible controller for the AI-VIRL showed no advantage.While transfer learning could theoretically accelerate the learning convergence to the optimal control, the result indicates that learning in the highdimensional space spanned by the C-NN weights remains uncontrollable directly, but  The extended state used to learn C-NN with AI-VIRL, built using the same database {u k , y k } collected for VSFRT, together with the states from the reference model s m k and the reference input r k .The extended state vector is comprised of T .Before using the raw database, the transition samples at time instants where r k = r k+1 are to be excluded (the piecewise constant step model is not a valid state-space transition model at switching instants) and the resulted database used for learning is , at the current algorithm iteration.Each AI-VIRL iteration produces a C-NN that is tested on the same standard test scenario used with VSFRT, by measuring the c.f. V for N = 400 samples over 20 s.The AI-VIRL is iterated MaxIter = 200 times and all stabilizing controllers that are better than the previous ones running on the standard test scenario are recorded.For fair evaluation, AI-VIRL is also run for five trials and all the best measurements over one trial (and the average value per trials) are filled in Table 6.After learning the AI-VIRL controllers, the conclusion based on Table 6 is clear: VSFRT is again better than AI-VIRL, in spite of being computationally less demanding and also being one-shot.The learning took place on the same database, in a controlled environment where good exploration was attainable.times and all stabilizing controllers that are better than the previous ones running on the standard test scenario are recorded.For fair evaluation, AI-VIRL is also run for five trials and all the best measurements over one trial (and the average value per trials) are filled in Table 6.
After learning the AI-VIRL controllers, the conclusion based on Table 6 is clear: VSFRT is again better than AI-VIRL, in spite of being computationally less demanding and also being one-shot.The learning took place on the same database, in a controlled environment where good exploration was attainable.

Learning VSFRT and AI-VIRL Control Based on IO Data Collected in Open-Loop
The robotic arm act as an open-loop integrator on each joint, hence being marginally stable.The open-loop collection is driven by zero-mean impulse inputs   The exact same learning settings from the closed-loop case were used for VSFRT and for AI-VIRL.Notice that Five learning trials are executed both for VSFRT and for AI-VIRL, with the LRMO tracking performance measure V recorded in Table 8.The best learned VSFRT and AI-VIRL controllers are shown performing on the standard tracking test scenario in Figure 11.There is a dramatic difference in favor of VSFRT's superior tracking performance, in spite of the poorer input-state space exploration.As expected, AI-VIRL's convergence to a good controller is affected by the poor exploration and its performance is not better than in the closed-loop collection case.Concluding, the superiority of VSFRT over AI-VIRL is again confirmed, both in terms of reduced computation complexity/time and in terms of tracking performance.

Conclusions
Learning controllers from IO data to achieve high LRM output tracking performance has been validated by both VSFRT and AI-VIRL.Learning takes place in the space of the new virtual state representation build from historical IO samples, to address observable systems.The resulted control is in fact learned to be applied to the original underlying system.Surprisingly in the two illustrated case studies, using the same IO data and the same LRM, VSFRT showed no worse tracking performance than AI-VIRL, despite its significantly lesser computation demands.While in the cases of poor exploration, VSFRT was clearly superior.
Aside performance comparisons, several additional studies were conducted.The transfer learning opportunity from VSFRT to provide initial admissible controller for the AI-VIRL showed no advantage.While transfer learning could theoretically accelerate the learning convergence to the optimal control, the result indicates that learning in the high-dimensional space spanned by the C-NN weights remains uncontrollable directly, but only indirectly by the hyper-parameter settings such as database exploration quality and size, training overfitting prevention mechanism and adequate NN architecture setup.Since the resulted state representation will generally lead to large virtual state vectors, the impact of dimensionality reduction through standard unsupervised machine learning techniques such as principal component analysis and autoencoders was studied.The obtained results indicate that no significant improvement is obtained, then there is no preference for using either one of PCA or AE for this purpose.It also confirms that no better tracking performance is attainable.However, it is advised to use dimensionality reduction since by a reduced virtual state, simpler (in the sense of fewer parameters) NN architectures are used, leading to less computational effort in training, testing and real-world implementation phase.
sm k (computable from rk ) does not appear in sex− k .Meaning that sex− k represents only a part of s ex k from (9).The reasons are twofold: (i) Firstly, since s m k is correlated with y m k via the output equation y m k = Ls m k in (3), in the light of the VRFT principle it means that s m k is redundant since y k = y m k already appears in v k .(ii) Secondly, different from the AI-VIRL where the reference input generative model r k+1 = r k used for learning and testing phase is the same as the model used in the transition samples collection phase, the VSFRT specific is different.The virtual reference rk is imposed by y k and it is different from the one used in testing phase.Implying that sm k computable from rk also has different versions in the learning and testing phases.This has been observed beforehand in the experimental case study and motivates the exclusion of sm k from sex− k .

2 ,
(blue), used to learn the VSFRT NN controller using the VSFRT algorithm.The NN settings are described in Table1.Each learning trial trains the C-NN for 200 = MaxTrain
of the form DS = ( sex− k , u k ) and it is used to learn the VSFRT NN controller using the VSFRT algorithm.The NN settings are described in Table1.Each learning trial trains the C-NN for MaxTrain = 200 times, starting with reinitialized weights.Each trained C-NN was tested on the standard scenario shown in Figure2(only for the first 600 s), by measuring a normalized c.f. V = 1/N • V N LRMO for N = 12, 000 samples in 1200 s.The best C-NN (in terms of the minimal value of V per trial) is retained at each trial.One trial lasts about 30 min on a standard desktop computer with all calculations performed on CPU.
find Q-NN's minimizers for training the C-NN in Step 3 of the AI-VIRL algorithm, all possible input combinations u k,1 × u k,2 resulted from 19 discrete values in [-1;1] for each input are enumerated, for each s ex[i] k .The minimizing combination is set as target for the C-NN, for the given input s ex[i] k find Q-NN's minimizers for training the C-NN in Step 3 of the AI-VIRL algorithm, all possible input combinations minimizing combination is set as target for the C-NN, for the given input ] [i ex k s , at the current algorithm iteration.

Figure 4 . 1 ,
Figure 4. AI-VIRL NN controller: (a) ,1 k u ; (b) ,1 k y (black), m k y 1 , (red); (c) k,2 u ; (d) ,2 k y (black), m k y 2 , difference w.r.t. the closed-loop collection scenario is that the reference inputs drive only the LRM and since there is no controller in closed-loop, the reference inputs k r and the LRM's outputs but independently of the system outputs who were excited by k u .

Figure 9 .
Figure 9. Proportion of state variation explained as function of the number of principal components, for VSFRT state input.

2 LT
number of training epochs set to maximum 500.The training cost function is of the form squared deviation of the AE outputs from the same inputs, is the weights L2 regular- ization term responsible with limiting the AE weights amplitude and the sparse T term encourages sparsity at the AE hidden layer's output.sparse T measures the Kullback-Leibler

Figure 9 .
Figure 9. Proportion of state variation explained as function of the number of principal components, for VSFRT state input.
iance 0.65.For enhanced exploration, random additive disturbance is used on inputs.the first input every 2nd sample and a similar distribution random number is added to the second input every 3rd sample.A total of 15,000 samples of
components being lumped in used to learn the VSFRT NN controller according to the VSFRT algorithm.The C-NN settings are described in Table 5.Each learning trial trains the C-NN for 200 = MaxSteps times, starting with reinitialized weights.Each trained C-NN was tested in closed-loop on a scenario where the test reference inputs, by measuring a normalized c.f.

Figure 10 .
Figure 10.Closed-loop data collection: (a) u k,1 ; (b) y k,1 (black), y m k,1 (red); (c) u k,2 ; (d) y k,2 (black), y m k,2 of the form DB = ( sex− k , u k ) and it is used to learn the VSFRT NN controller according to the VSFRT algorithm.The C-NN settings are described in Table

Figure 11 .
Figure 11.VSFRT (black lines) and AI-VIRL (blue lines) learned from open-loop data, tested in closed-loop.y m k,1 (b) and y m k,2 (d) are in red.(a) and (c) show u k,1 and u k,2 , respectively.
1, P , where P < N. The controller NN and the Q function NN settings are depicted in Table 7.To find Q-NN's minimizers for training the C-NN in Step 3 of the AI-VIRL algorithm, all possible input combinations u k,1 × u k,2 resulted from 25 discrete values in [−2;2] for each input are enumerated, for each s ex[i] k .The minimizing combination is set as target for the C-NN, for the given input s ex[i] k

5. 4 .
Learning VSFRT and AI-VIRL Control Based on IO Data Collected in Open-Loop The robotic arm act as an open-loop integrator on each joint, hence being marginally stable.The open-loop collection is driven by zero-mean impulse inputs u k,1 , u k,2 .For exploration's sake, the reference inputs have their amplitudes uniformly randomly distributed in r k,1 ∈ [−1.5; 1.5], r k,2 ∈ [−2; 2] and they present as sequences of piece-wise constant signals lasting 2.5 s and 2 s, respectively.The difference now w.r.t. the closed-loop collection scenario is that the reference inputs r k drive only the LRM and since there is no controller in closed-loop, the reference inputs and the LRM's outputs y m k = T LRM (q)r k evolve independently of the system's outputs y k who were driven by u k .The open-loop collection is captured in Figure 12 from where it is clear that y k,1 , y k,2 do not intersect too often with r k,1 , r k,2 and with y m k,1 , y m k,2 , respectively.Energies 2021, 14, x FOR PEER REVIEW 22 of 26 each input are enumerated, for each ] [i ex k s .The minimizing combination is set as target for the C-NN, for the given input ] [i ex k s , at the current algorithm iteration.Each AI-VIRL iteration produces a C-NN that is tested on the same standard test scenario used with VSFRT, by measuring the c.f. V for 400 = N samples over 20 s.The AI-VIRL is iterated 200 = MaxIter sake, the reference inputs have their amplitudes uniformly randomly distributed in they present as sequences of piece-wise constant signals lasting 2.5 s and 2 s, respectively.The difference now w.r.t. the closed-loop collection scenario is that the reference inputs k r drive only the LRM and since there is no controller in closed-loop, the reference inputs and the LRM's outputs system's outputs k y who were driven by k u .The open-loop collection is captured in Figure 11 from where it is clear that

Figure 12 .
Figure 12.Open-loop data collection: (a) u k,1 ; (b) y k,1 (black), y m k,1 (blue), r k,1 (red); (c) u k,2 ; (d) y k,2 (black), y m k,2 (blue), r k,2 (red).The exact same learning settings from the closed-loop case were used for VSFRT and for AI-VIRL.Notice that r k,1 , r k,2 and with y m k,1 , y m k,2 are not used for learning control with VSFRT since they do not enter the extended state sex− k .

Table 2 .
VSFRT and Approximate Iterative Value Iteration Reinforcement Learning (AI-VIRL) tracking performance when learning uses closed-loop IO data.
4.3.Learning the AI-VIRL Controller from Closed-Loop IO DataIn this case, the extended state used to learn C-NN with AI-VIRL, built using the same dataset

Table 4 .
VSFRT and AI-VIRL tracking performance when learning uses open-loop IO data.

Table 4 .
VSFRT and AI-VIRL tracking performance when learning uses open-loop IO data.
2; 0.2] Nm and [−0.1; 0.1] Nm, respectively, through the component-wise saturation function Sat(•).s = [s 1 , s 2 ] T measured in radians (rad) represents the angle of the base and tip joints, respectively.While the joints' angular velocities which are unmeasurable are captured in

Table 6 .
VSFRT and AI-VIRL tracking performance when learning uses closed-loop IO data.

Table 8 .
VSFRT and AI-VIRL tracking performance when learning uses open-loop IO data.