Next Article in Journal
Calorimetric Behaviour of Electric Cables
Next Article in Special Issue
The Model of Support for the Decision-Making Process, While Organizing Dredging Works in the Ports
Previous Article in Journal
Heat Transfer in Highly Turbulent Separated Flows: A Review
Previous Article in Special Issue
Learning Feedforward Control Using Multiagent Control Approach for Motion Control Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Virtual State Feedback Reference Tuning and Value Iteration Reinforcement Learning for Unknown Observable Systems Control

by
Mircea-Bogdan Radac
* and
Anamaria-Ioana Borlea
Department of Automation and Applied Informatics, Politehnica University of Timisoara, 300223 Timisoara, Romania
*
Author to whom correspondence should be addressed.
Energies 2021, 14(4), 1006; https://doi.org/10.3390/en14041006
Submission received: 27 December 2020 / Revised: 10 February 2021 / Accepted: 12 February 2021 / Published: 15 February 2021
(This article belongs to the Special Issue Intelligent Control for Future Systems)

Abstract

:
In this paper, a novel Virtual State-feedback Reference Feedback Tuning (VSFRT) and Approximate Iterative Value Iteration Reinforcement Learning (AI-VIRL) are applied for learning linear reference model output (LRMO) tracking control of observable systems with unknown dynamics. For the observable system, a new state representation in terms of input/output (IO) data is derived. Consequently, the Virtual State Feedback Tuning (VRFT)-based solution is redefined to accommodate virtual state feedback control, leading to an original stability-certified Virtual State-Feedback Reference Tuning (VSFRT) concept. Both VSFRT and AI-VIRL use neural networks controllers. We find that AI-VIRL is significantly more computationally demanding and more sensitive to the exploration settings, while leading to inferior LRMO tracking performance when compared to VSFRT. It is not helped either by transfer learning the VSFRT control as initialization for AI-VIRL. State dimensionality reduction using machine learning techniques such as principal component analysis and autoencoders does not improve on the best learned tracking performance however it trades off the learning complexity. Surprisingly, unlike AI-VIRL, the VSFRT control is one-shot (non-iterative) and learns stabilizing controllers even in poorly, open-loop explored environments, proving to be superior in learning LRMO tracking control. Validation on two nonlinear coupled multivariable complex systems serves as a comprehensive case study.

1. Introduction

Learning control from input/output (IO) system data is a current significant research area. The idea stems from the data-driven control research, where it is strongly believed that the gap between an identified system model and the true system is an important factor leading to control performance degradation.
Value Iteration (VI) is one popular approximate dynamic programming [1,2,3,4,5,6,7] and reinforcement learning algorithm [8,9,10,11,12,13], together with Policy Iteration. VI Reinforcement Learning (VIRL) algorithm comes in many implementation flavors, online or offline, off-policy or on-policy, batch-wise or adaptive-wise, with known or unknown system dynamics. In this work, the class of offline off-policy VIRL for unknown dynamical systems is adopted, based on neural networks (NNs) function approximators, hence it will be coined as Approximate Iterative VIRL (AI-VIRL). For this model-free offline off-policy learning variant, a database of transition samples (or experiences) is required to learn the optimal control. Most practical implementations are of the actor-critic type where function approximators (most often neural networks (NNs)) are used to approximate the cost function and the controller, respectively.
One of the crucial aspects of AI-VIRL convergence is appropriate exploration, translated to visiting as many state-actions combinations as possible while uniformly covering the state-action domains. The exploration aspect is especially problematic with general nonlinear systems whereas for linear ones, linearly parameterized function approximators for the cost function and for the controller, relieve to some extent this issue, owing to their better generalization capacity. Good exploration is not easily achieved in uncontrolled environments, and usually, pre-existing stabilizing controllers can increase the exploration quality significantly. An additional reason for using pre-stabilizing controller is that, in mechatronics systems, uncontrolled environment commonly implies instability, under which dangerous conditions could lead to physical damage. This is different from the virtual environments specific, e.g., to video games [14] (or even simulated mechatronics systems), where instability leads to an episode (or simulation) termination but physical damage is not a threat.
Another issue with reinforcement learning algorithms such as AI-VIRL, is the state representation. Most often, in the case of unknown systems, the measured data from the system cannot be assumed to fully capture the system state, leading to the partial observability learning issues associated with the reinforcement learning algorithms [15,16,17,18]. Therefore, new state representations are needed to ensure that learning takes place in a fully observable environment. A new state representation is proposed in this work based on the assumption that the controlled system is observable. Therefore, a virtual state built from present and past IO samples is introduced as an alias for the true state. The new virtual state-space representation is a fully observable system which allows controlling the original underlying system.
A similar approach to AI-VIRL for learning optimal control in offline off-policy mode is Virtual Reference Feedback Tuning (VRFT) [19,20,21,22,23,24,25]. It also relies on a database of (usually IO) samples collected from the system in a dedicated experimental interaction step. Traditionally, VRFT was proposed for output feedback error IO controllers and was not sufficiently exploited for state-feedback control yet. Certainly not for the virtual state-feedback control required by observable systems for which direct state measurement is impossible. One contribution of this work is to propose for the first time such a model-free framework called Virtual State-Feedback Reference Tuning (VSFRT), which learns control based on the feedback provided by the virtual state representation.
VSFRT also requires exploration of the controlled system dynamics by using persistently excited input signals, in order to visit many IO (or input-state-output) combinations. Principally, it turns the system identification problem into a direct controller identification problem. This paradigm shift can be thought of as a typical supervised machine learning problem, especially since VRFT has been used before with NN controllers [24,25,26,27]. To date, VRFT has been applied mainly for output-feedback error-based IO controllers with few reported results with state-feedback control [26,27] but not with feedback control based on a virtual state constructed from IO data.
Both AI-VIRL and VRFT lend themselves to the reference model output tracking problem framework. It is therefore of interest to compare their learning capacity in terms of resources needed, achievable tracking performance, sensitivity to the exploration issue and type of approximators being used. In particular, the linear reference model output (LRMO) tracking control setting is advantageous, ensuring indirect state-feedback linearization of control systems. Such linearity property of control systems is critical for higher-level learning paradigms such as Iterative Learning Control [28,29,30,31,32,33,34] and primitive-based learning [34,35,36,37,38,39,40], as representative hierarchical learning control paradigms [41,42,43,44].
The contributions of this work are:
  • A new state representation for systems with unknown dynamics. A virtual state is constructed from historical input/output data samples, under observability assumptions.
  • An original Virtual State Feedback Reference Tuning (VSFRT) neural controller tuning based on the new state representation. Stability certification is analyzed.
  • Performance comparison of VSFRT and AI-VIRL data-driven neural controllers for LRMO tracking.
  • Analysis of the transfer learning suitability for the VSFRT controller to provide initial admissible controllers for the iterative AI-VIRL process.
  • Analyze the impact of the state representation dimensionality reduction upon the learning performance using unsupervised machine learning tools such as principal component analysis (PCA) and autoencoders (AE).
Section 2 introduces the LRMO tracking problem formulation while Section 3 proposes the VSFRT and the AI-VIRL solution concepts. The two comprehensive validation case studies of Section 4 and Section 5, respectively, validate this work’s objectives. Conclusions are presented in the last Section 6.

2. The LRMO Tracking Problem

Let the dynamical discrete-time system be described by the state-space plus output model equation
{ s k + 1 = f ( s k , u k ) , y k = g ( s k ) ,
where s k = [ s k , 1 s k , n ] T is the un-measurable state, the control input is u k = [ u k , 1 , , u k , m u ] T while y k = [ y k , 1 , , y k , p ] T is the sensed output. The dynamics f, g of (1) are unknown but considered continuously differentiable (CD) maps. Additional assumption about (1) require that it is IO observable and controllable. The IO observability implies that, as in the case of the more well-known linear systems, the state can be fully recovered from system present and past IO samples. Given the observable system (1), IO samples u k , y k are employed to form a virtual state-space model having u k as input and y k as output (similarly to (1)), with a different state vector, according to [45]. This resulted virtual state-space model is:
v k + 1 = F ( v k , u k ) , y k = v k , 1 ,
defined in terms of the virtual state vector v k = [ Y k , k τ ¯ T , U k 1 , k τ ¯ T ] T = Δ [ v k , 1 , v k , 2 , v k , 2 τ + 1 ] p ( τ + 1 ) + m u τ , with Y k , k τ ¯ = [ y k T y k τ T ] T = Δ [ v k , 1 T v k , τ + 1 T ] T , U k 1 , k τ ¯ = [ u k 1 T u k τ T ] T = Δ [ v k , τ + 2 T v k , 2 τ + 1 T ] T (upper T means vector/matrix transposition). Model (2) has partially known dynamics (the unknown part stems from unknown dynamics of (1)) and it is fully state observable [45]. Such transformations are well-known for linear systems. Here, τ is correlated with the observability index and should be chosen empirically since it cannot be established analytically due to partially unknown dynamics. It should be selected as large as possible, accounting for the fact that for a value larger than the true observability index, there is no more information gain in explaining the true state s k through v k [45]. The observability index is well-known in linear systems theory (please consult Appendix A). Its minimal value K for which τ K , ensures that the observability matrix has its full column rank equal to the state dimension n . Meaning that the state is fully observable from a number of at least K past input samples and at least K past output samples. In the light of the above remark, we define the unknown observability index of the nonlinear system (1) as the minimal value K for which any τ K ensures that s k is observable from Y k , k τ ¯ , U k 1 , k τ ¯ . Transformations such as (2) can easily accommodate time delays in the input/state of (1) by properly introducing supplementary states. Such operations preserve the full observability of (2) [45].
IO controlling (2) is the same with IO controlling (1), since they have the same input and output. While any potential state-feedback control mapping the state to the control input, would differ for (1) and (2) since s k n and v k p ( τ + 1 ) + m u τ . The control objective is aimed at shaping the IO behavior of (1) by indirectly controlling the IO behavior of (2) through state-feedback. This is achieved by the reference model output tracking framework, presented next.
A strictly causal linear reference model (LRM) is:
{ s k + 1 m = G s k m + H r k , y k m = L s k m ,
where the LRM state is s k m = [ s k , 1 m , , s k , n m m ] T , the LRM input r k = [ r k , 1 , , r k , p ] T will be reference input to the control system and the LRM output is y k m = [ y k , 1 m , , y k , p m ] T . The LRM dynamics are known and characterized by the matrices G , H , L . Its linear pulse transfer matrix IO dependence can be established as y k m = T L R M ( q ) r k , where “q” is the pulse transfer time-based operator, analogous to the “z” operator.
The LRMO tracking goal is to search for the control input which causes the system’s (1) output y k to track the output y k m of the LRM (3), given any reference input r k . This objective is captured as an optimal control problem searching for the optimal input satisfying
u k * = arg min u k V L R M O ,      V L R M O = k = 0 y k ( u k ) y k m 2 2 ,            s . t .    ( 1 ) ,    ( 3 ) ,
where the dependence of y k on u k is suggested. In the expression above, 2 measures the L2 distance over vectors. It is assumed that a solution u k * for (4) exists. The above control problem is a form of imitation learning where the LRM is the “expert” (or teacher or supervisor) and the system (1) is a learner which must mimic the LRM’s IO behavior. For the given problem, dynamics of (3) must be designed beforehand, however they may be unknown to the learner system (1).
In the subsequent section, the problem (4) is solved by learning two state-feedback closed-loop control solutions for the system (2). The implication is straightforward. If a state-feedback controller of the form u k = C ( v k ) is learned to control (2), then this control action can be set as the actual control input for the system (1), based on the feedback v k built from present and past IO samples u k , y k of the system (1). Hence, learning control for (2) by solving (4) is the same with solving (4) for the underlying system (1). Notice the recurrence in u k = C ( v k ) since v k includes u k 1 , , u k τ .
Several observations concerning the LRM selection are mentioned. According to the classical control rules in model reference control [46], the LRM dynamics must be correlated with the bandwidth of the system (1). The time-delay of (1) and its possible non-minimum-phase (NMP) character must be accounted for inside the LRM as they should not be compensated. These knowledge requirements are satisfiable based on working experience with the system or from technical datasheets. However, they do not interfere with the “unknown dynamics” assumption. Since the virtual state-feedback control design is attempted based on the VSFRT principle, it is known from classical VRFT control that the NMP property of (1) requires special care, therefore for simplification, it will be assumed that (1) is minimum-phase. IO data collection necessary for both the VSFRT and the AI-VIRL designs require that (1) is either open-loop stable or stabilized in closed-loop. It will be assumed further that (1) is open-loop stable, although closed-loop stabilization for IO samples collections can (and will also) be employed for reducing collection phase duration and to enhance exploration.
The following section proposes and details the VSFRT and AI-VIRL for learning optimal control in (4), in virtual state feedback-based closed-form solutions.

3. LRM Output Tracking Problem Solution

3.1. Recapitulating VRFT for Error-Feedback IO Control

In the well-known VRFT for error-feedback control defined for linear mono-variable unknown systems [19], a database of IO samples is considered available after being collected from the base system (1), let this database be called D B = { ( u k , y k ) } , k = 0 , N 1 ¯ . Either open- or closed-loop could be considered for IO samples collection. Based on the VRFT principle, it is assumed that y k is also the output of the given LRM conveyed by T L R M ( q ) . One can offline calculate in a noncausal fashion the virtual reference r ˜ k = T L R M 1 ( q ) y k which set as input to T L R M ( q ) would render y k at its output. The tilde means offline calculation. A virtual feedback error is next defined as e ˜ k = r ˜ k y k . Selecting some prior linear controller transfer function structure C ( e ˜ k , ϑ ) ( ϑ -the parameter vector) as a function of the virtual feedback error, the controller identification problem is defined as the minimization over ϑ of the cost V V R N ( ϑ ) = 1 N k = 0 N 1 u k C ( e ˜ k , ϑ ) 2 2 . Meaning that the controller C ( e ˜ k , ϑ ) outputs u k when driven by e ˜ k . This conceptually implies that the control system having the loop closed by C ( e ˜ k , ϑ ) , produces the signals u k and y k when driven by r ˜ k . This would eventually match the closed-loop with the LRM. The VRFT circumvents direct controlled system identification, hence it is model-free.
The work [19] analyzed approximate theoretical equivalence between V V R N ( ϑ ) and the LRM cost V L R M O = k = 0 y k ( u k ) y k m 2 2 from (4). A linear prefilter called “the L-filter” (sometimes denoted with M) with pulse transfer function L(q), was used to enhance this equivalence by replacing u k , e ˜ k in V V R N ( ϑ ) with their filtered variants u k L = L ( q ) u k , e ˜ k L = L ( q ) e ˜ k . The VRFT extension to the multi-variable case was studied in [20] while its extension to nonlinear system and nonlinear controller case has been afterwards exploited in works like [23,24,25,26,27], where the L-filter was dropped when richly parameterized controllers were used (e.g., NNs).

3.2. VSFRT—The Virtual State Feedback-Based VRFT Solution for the LRM Output Tracking

Following the rationale behind the classical model-free VRFT, a database of IO samples is considered available after being collected from the base system (1), let it be denoted as D B = { ( u k , y k ) } , k = 0 , N 1 ¯ . It is irrelevant for the following discussion if the samples were collected in open- or closed-loop. An input-state-output database is constructed as { ( u k , v ˜ k , y k ) } where v ˜ k is constructed from the historical data u k , y k . Based on the VRFT principle, it is assumed that y k is also the output of the given LRM (characterized by T L R M ( q ) ). A non-causal filtering then allows for virtual reference calculation as in r ˜ k = T L R M 1 ( q ) y k . Similar to VRFT applied for error-feedback IO control, now a virtual state-feedback reference feedback tuning (VSFRT) controller is searched for. This controller denoted C should make the LRM’s output to be tracked by the system (2)’s output and it is identified to minimize the cost
V V R N ( ϑ ) = 1 N k = 0 N 1 u k C ( s ˜ k e x , ϑ ) 2 2
with s ˜ k e x = [ ( v ˜ k ) T ( r ˜ k ) T ] T being an extended state regressor vector constructed from the virtual state v ˜ k of (2) and with a controller parameter vector ϑ .
A VSFRT controller rendering V V R N ( ϑ ) = 0 should lead to V L R M O 0 . It is the VRFT principle which establishes the equivalence between V V R N ( ϑ ) and V L R M O ( ϑ ) , being supported in practice by using richly parameterized controllers (such as NNs [25]), coupled with a wise selection of the LRM dynamics. A controller NN called C-NN will be employed in this framework, with ϑ capturing the NN trainable weights. It is straightforward to regard V V R N ( ϑ ) as the mean sum of squared errors (MSSE) criterion for training the C-NN, with input pattern { s ˜ k e x } and with output pattern { u k } .
The model-free VSFRT algorithm is presented below.
  • The IO database D B = { ( u k , y k ) } , k = 0 , N 1 ¯ is first collected, then v ˜ k is built. Both r ˜ k and s ˜ k e x are computed.
  • A C-NN is parameterized by properly selecting a NN architecture type together with its training details. M a x T r a i n called the maximum number of training times is set along with the index j = 1 counting the number of trainings.
  • The NN weights vector ϑ is initialized (e.g., randomly).
  • The C-NN is trained with input patterns i n = { s ˜ k e x } and output patterns t = { u k } . This is equivalent to minimizing (5) w.r.t. ϑ .
  • If j < M a x T r a i n , set j = j + 1 and repeat from the 3rd Step, otherwise finish the algorithm.
The above algorithm produces several trained neural controllers C-NN with parameter ϑ , i.e C ( s ˜ k e x , ϑ ) . The best one is selected based on some predefined criterion (minimal value of V V R N ( ϑ ) or minimal value of some tracking performance on a given test scenario). Stability of the closed-loop with the VSFRT C-NN is asserted, with some assumptions following next [47]:
A1. The system (2) supports the equivalent IO description y k = T ( y k 1 , , y k n y , u k 1 , , u k n u ) , with n y , n u unknown system orders and the nonlinear map T is invertible with respect to u : For a given y k , the input u k is computable as u k 1 = T 1 ( y k ) (the “k−1” subscript in u vs. the “k” subscript in y formally suggest strictly causal system). Additionally, LRM (3) supports the IO recurrence y k m = T L R M ( y k 1 m , , y k n y m m , r k 1 , , r k n r ) with known orders n y m , n r and T L R M is a linear invertible map with stable inverse, allowing to find r k 1 = T L R M 1 ( y k m ) .
A2. The system (2) and the LRM (3) are formally representable as y k = M ( v k , u k 1 ) and y k m = M m ( s k m , r k 1 ) , to simultaneously convey the IO and the input-state-output dependence in a single compact form, under the mappings M , M m . With no generality loss, the above representation indicates the relative degree one between input and output. Let us assume the map M is invertible, with v k , u k 1 computable from y k as v k = ( M v ) 1 ( y k ) , u k 1 = ( M u ) 1 ( y k ) . Also assume that M m is a CD invertible map such that s k m , r k 1 are computable from y k m as s k m = ( M s m ) 1 ( y k m ) , r k 1 = ( M r m ) 1 ( y k m ) and that there are constants κ M x m > 0 , κ M r m > 0 fulfilling M m ( s k m , r k 1 ) s k m < κ M x m and M m ( s k m , r k 1 ) r k 1 < κ M r m , where is an appropriate matrix norm induced by the L2 vector norm. The model inversion assumptions are natural for state-space systems (2) and (3) being characterized by IO models. Practically, the idea behind v k = ( M v ) 1 ( y k ) is as follows: For certain y k in (2), one can calculate u k 1 = T 1 ( y k ) , then generate v k + 1 = F ( v k , u k ) based on (2). The above inequalities with upper bounds on the maps’ partial derivatives are reasonable and commonly used in control. Moreover, let M , ( M v ) 1 be continuously differentiable (CD) and of bounded derivative to satisfy
M ( v k , u k 1 ) v k < κ M v , M ( v k , u k 1 ) u k 1 < κ M u , ( M v ) 1 ( y k ) y k < κ M y , 0 < κ M v κ M y < 1 .
The maps M , ( M v ) 1 are CD since they emerge from the CD map F in (2), whose CD is a consequence of the CD property of f, g from (1) [45]. Their bounded derivatives are reasonable assumptions for that matter.
A3. Let D B = { u k , v ˜ k , y k } U × V × Y , k = 0 , N 1 ¯ be a trajectory collected from the system (2) within respective domains U , V , Y and with u k being: (1) Persistently exciting (PE), to make sure that y k senses all system dynamics; (2) uniformly exploring the entire domain U × V × Y . The larger N , the better exploration is obtained.
A4. There exists a set of nonlinear parameterized state-feedback continuously differentiable controllers { C ( s k e x , ϑ ) } , a ϑ ^ for which u ^ k = C ( s ^ k e x , ϑ ^ ) , and an ε > 0 for which
V V R N ( ϑ ^ ) = k = 0 N 1 u k C ( s ^ k e x , ϑ ^ ) 2 2 < ε 2 , C ( s ^ k e x , ϑ ) s ^ k e x < κ c s ,
where s ^ k e x = [ ( v ^ k ) T ( r ˜ k ) T ] T and r ˜ k 1 = ( M r m ) 1 ( y k ) . The quantities { u ^ k , v ^ k , y ^ k } would be collected under u ^ k = C ( s ^ k e x , ϑ ^ ) in closed-loop, as dictated by the evolution of the virtual signal r ˜ k 1 for a given ϑ ^ . The bounded derivative condition for the controller is natural when smooth NN function approximators are used.
Theorem 1.
[47]: Under assumptions A1–A4, there exists a finite κ > 0 such that
V L R M O N ( ϑ ^ ) = Δ k = 1 N y ^ k ( ϑ ^ ) y k 2 2 = k = 1 N Δ y k 2 2 < κ ε 2 .
Proof. 
We introduce the notation s ˜ k e x = [ ( v ˜ k ) T ( r ˜ k ) T ] T and make the notation equivalences s ˜ k e x ζ ˜ k , s ^ k e x ζ ^ k , v ˜ k x ˜ k , v ^ k x ^ k . Note that v k is the state of a virtual state-space model (2), different from x k in [47] being the state of a natural state-space model. Additionally, s ˜ k e x does not contain the LRM state s k m , this discussion is deferred for now to Section 3.4. Then, following the rationale of Theorem’s 1 proof in the Appendix of [47], the proof of the current Theorem 1 follows.         □
Corollary 1.
The controller C ( s k e x , ϑ ^ ) where ϑ ^ is obtained by minimizing (5), is stabilizing for the system (2) in the uniformly ultimately bounded (UUB) sense.
Proof. 
When ϑ ^ is the value found to minimize V V R N ( ϑ ) from (5), it makes the first inequality in A4 hold for arbitrarily small ε > 0 . From Theorem 1 it follows that y ^ k y k 2 2 is bounded. Notice that y k is bounded from the experimental collection phase and that y ^ k is generated in closed-loop with C ( s ^ k e x , ϑ ^ ) . However, the closed-loop that generates y ^ k is driven by r ˜ k obtained from y k as r ˜ k 1 = T L R M 1 ( y k ) . The PE condition on u k makes y k well explore its domain which subsequently makes r ˜ k well explore its domain, let it be called R r . Based on the CD and bounded derivatives properties of the maps, it means that, for any other values r ˜ k R r , the term y ^ k y k 2 2 is bounded. The UUB stability of the closed-loop follows.         □
Remark 1.
The output y k in V L R M O N ( ϑ ^ ) from (8) is also the LRM’s output, according to the VSFRT initial assumption, therefore V L R M O N ( ϑ ^ ) = k = 1 N y ^ k ( ϑ ^ ) y k m 2 2 < κ ε 2 . Additionally, a good controller identification rendering a small V V R N ( ϑ ^ ) = k = 0 N 1 u k C ( s ^ k e x , ϑ ^ ) 2 2 < ε 2 is equivalent to making ε sufficiently small, i.e., ε 0 . Then V L R M O N ( ϑ ^ ) < κ ε 2 can be made arbitrarily small. Since V L R M O N ( ϑ ^ ) is the finite-length (and also the parameterized closed-form control) version of V L R M O from (4), it is expected that for sufficiently large N , an equivalence holds between minimizing V V R N ( ϑ ^ ) in (5) and minimizing V L R M O in (4).
Remark 2.
The controller C ( s k e x , ϑ ^ ) which is stabilizing (2) in the IO UUB sense, is also stabilizing (1), since (1) has the same input and output as (2).

3.3. The AI-VIRL Solution for the LRM Output Tracking

Solving (4) with machine learning AI-VIRL requires an MDP formulation of the system dynamics. To proceed, the virtual system (2) and the LRM dynamics (3) are combined to form the extended virtual state space system
s k + 1 e x = ( v k + 1 r k + 1 s k + 1 m ) = ( F ( v k , u k ) Γ ( r k ) G s k m + H r k ) = S ( s k e x , u k ) , y k = v k , 1 , y k m = L s k m ,
where r k + 1 = Γ ( r k ) is any valid generative dynamical model of the reference input, herein being modeled as a piecewise constant signal, i.e., r k + 1 = r k and S is the extended system dynamics nonlinear mapping. The dimension of s k e x is n x = p ( r + 2 ) + m u r + n m .
For extended system (9) with partially unknown dynamics, an offline off-policy batch AI-VIRL algorithm will be used, which is also known by the name of model-free batch-fitted Q-learning. It relies on a database of transition samples collected from the extended system (9) and uses two function approximators: One to model the well-known action-state Q-function and another one modeling a parameterized dependence of the control on the state, i.e., u k = C ( s k e x , ϑ ) with parameter vector ϑ . Commonly, NNs are used thanks to their well-developed training software and customizable architecture. The Q-function approximator is parameterized as Q ( s k e x , u k , π ) . A major relaxation about model-free AI-VIRL w.r.t. to other reinforcement learning algorithms is that it does not require an initial admissible (i.e., stabilizing) controller, but manages to converge to the optimal control which must be stabilizing for the closed-loop control system.
To solve (4) and reach for the optimal control now expressed as a direct state dependence as in u k * = C ( s k e x , ϑ * ) and also parameterized by ϑ , AI-VIRL relies on the database of transition samples (sometimes called experiences) D B = { ( s k e x [ 1 ] , u k [ 1 ] , s k + 1 e x [ 1 ] ) , , ( s k e x [ N ] , u k [ N ] , s k + 1 e x [ N ] ) } . These samples can be collected in different styles, as later on pointed out in the case study. Making the AI-VIRL a batch offline off-policy approach. Randomly initializing the Q-function NN (Q-NN) weight vector π 0 and the controller NN weight vector as ϑ 0 , AI-VIRL alternates the Q-function weight vector update step
π j = arg min π 1 N i = 1 N ( Q ( s k e x [ i ] , u k [ i ] , π ) r ( s k e x [ i ] , u k [ i ] ) Q ( s k + 1 e x [ i ] , C ( s k + 1 e x [ i ] , ϑ j 1 ) , π j 1 ) ) 2
with the controller weight vector update step
ϑ j = arg min ϑ 1 N i = 1 N Q ( s k e x [ i ] , C ( s k e x [ i ] , ϑ ) , π j ) ,
until, e.g., no more changes in π j or ϑ j , implying convergence to ϑ * . With NN approximators for both the Q-function and the controller, solutions to (10) and (11) are embodied as the classical NN backpropagation-based training, recognizing the cost functions in (10) and (11) as the mean sum of squared errors. The NN training procedure requires gradient calculation w.r.t. NN weights.
For solving (11), another trick is possible for low-dimensional control input: 1) Firstly, find the approximate minimizers u k min [ i ] of Q ( s k e x [ i ] , u , π j ) over u for all s k e x [ i ] , by enumerating all combinations of discretized control values over a fine grid discretization of u’s domain; 2) secondly, establish these minimizers as targets for the C-NN C ( s k e x [ i ] , ϑ ) then gradient-based train for the parameters ϑ of this NN.
The AI-VIRL algorithm is summarized next.
  • Select C-NN and Q-NN architectures and training settings. Initialize π 0 , ϑ 0 , termination threshold ε , maximum number of iterations M a x I t e r and iteration index j = 1. Prepare the transition samples database D S = { ( s k e x [ i ] , u k [ i ] , s k + 1 e x [ i ] ) } , i = 1 , N ¯ .
  • Train the Q-NN with inputs i n = { [ s k e x [ i ] T    u k [ i ] T ] T } and target outputs t = { r ( s k e x [ i ] , u k [ i ] ) + Q ( s k + 1 e x [ i ] , C ( s k + 1 e x [ i ] , ϑ j 1 ) , π j 1 ) } . This is equivalent to solving (10).
  • Train the C-NN with inputs i n = { s k e x [ i ] } and target outputs t = { u k min [ i ] } . This is equivalent to solving (11).
  • If j < M a x I t e r and ϑ j ϑ j 1 2 2 > ε (with some threshold ε > 0 ), make j = j + 1 and go to Step 2, else terminate the algorithm.
With all transition samples participating in the AI-VIRL, this model-free Q-learning-like algorithm benefits from a form of experience replay, widely used in reinforcement learning. Under certain assumptions, convergence of the AI-VIRL C-NN to the optimal controller which implies stability of the closed-loop has been analyzed before in the literature [2,3,6,7,8,11,12] and is not discussed here.

3.4. The Neural Transfer Learning Capacity

The AI-VIRL solution is a computationally expensive approach while, the VSFRT solution is one-shot and it is obtained much faster in terms of computing time. It is of interest to check whether the AI-VIRL convergence is helped by initializing the controller with an admissible (i.e., stabilizing) one, e.g., learned with VSFRT. This is coined as transfer learning.
Notice that for VSFRT, s ˜ k m (computable from r ˜ k ) does not appear in s ˜ k e x . Meaning that s ˜ k e x represents only a part of s k e x from (9). The reasons are twofold: (i) Firstly, since s k m is correlated with y k m via the output equation y k m = L s k m in (3), in the light of the VRFT principle it means that s k m is redundant since y k = y k m already appears in v k . (ii) Secondly, different from the AI-VIRL where the reference input generative model r k + 1 = r k used for learning and testing phase is the same as the model used in the transition samples collection phase, the VSFRT specific is different. The virtual reference r ˜ k is imposed by y k and it is different from the one used in testing phase. Implying that s ˜ k m computable from r ˜ k also has different versions in the learning and testing phases. This has been observed beforehand in the experimental case study and motivates the exclusion of s ˜ k m from s ˜ k e x .
We stress that “~” is only meaningful in the offline computation phase for VSFRT and, outside this scope and in the following, the state vectors are referred to as s k m and s k e x .
The case study will use controllers modeled as a three-layer NN with linear input layer, nonlinear activation for a given number of neurons in the hidden layer and with linear output activation. Let the controller be u k = C ( s k e x , ϑ ) . A fully connected feedforward NN equation models each output as
u k , i = b o u t i + j = 1 n H W o u t j , i ( l = 1 n x W i n l , j s k l + b i n l ) , i = 1 , m u ¯
where s k l is the l-th input component from s k e x at time k, W i n l , j , W o u t j , i , b i n l , b o u t i are input layer weights, output layer weights, input layer bias weights and output layer bias weights, respectively. is a given differentiable nonlinear activation function (e.g., tanh, logsig, ReLu, etc.) and n H is the number of hidden layer neurons (or nodes). The parameter vector ϑ gathers all trainable weights of the NN.
The controller transfer learning from VSFRT to AI-VIRL starts by observing that s k e x = [ ( s k e x ) T    ( s k m ) T ] T . Then, all the input weights and biases from (12) corresponding to the inputs s k m are set to zero, while the other weights and biases are copied from the VSFRT NN controller. Then the learned VSFRT controller will be transferred as an initialization for the AI-VIRL controller.

4. First Validation Case Study

A two-axis motion control system stemming from an aerial model is subjected to control learning via VSFRT and AI-VIRL. It is nonlinear and coupled and allows for vertical and horizontal positioning, being described as
{ s ˙ a , 1 = ( S 1 1 ( u 1 ) M h ( s a , 1 ) ) / 2.7 × 10 5 , s a , 2 = τ / ( 23.8 × 10 3 × cos 2 s p , 3 + 3 × 10 3 ) ,       τ ˙ = ( 0.216 F a ( s a , 1 ) cos s p , 3 0.058 s a , 2 + 0.0178 S 1 1 ( u 2 ) cos s p , 3 ) , s ˙ a , 3 = s a , 2 , { s ˙ p , 1 = ( S 1 1 ( u 2 ) M p ( s p , 1 ) ) / 1.63 × 10 4 , s ˙ p , 2 = 33.33 ( 0.2 F p ( s p , 1 ) 0.0127 s p , 2 0.0935 sin s p , 3 9.28 × 10 6 s p , 2 | s p , 1 | + 4.17 × 10 3 S 1 1 ( u 1 ) 0.05 cos s p , 3 0.021 s a , 2 2 sin s p , 3 cos s p , 3 + 0.05 ) , s ˙ p , 3 = Ω p ,
where S 1 1 ( ) means saturating function on [ 1 , 1 ] , u 1 , u 2 are the horizontal motion control input and the vertical motion control input respectively. s a , 3   ( r a d ) = y 1 [ π , π ] is the horizontal angle, s p , 3   ( r a d ) = y 2 [ π / 2 , π / 2 ] is the vertical angle, other states being described in [48]. The nonlinear static maps M p   ( s p , 1 ) , F p   ( s p , 1 ) , M a   ( s a , 1 ) , F a   ( s a , 1 ) are fitted polynomials obtained for s p , 1 , s a , 1 ( 4000 ; 4000 ) [48].
A zero-order hold on the inputs combined with an outputs sampler applied on (13), conducts to an equivalent discrete-time model of relative degree one suitable for IO data collection and control. The system’s unknown dynamics will not be employed herein for learning control.
The objective of the problem (4) is here translated to finding the controller which makes the system’s outputs track the outputs of the LRM given as T L R M ( q ) = d i a g ( T ( q ) , T ( q ) ) with T ( q ) , T ( q ) the discrete-time variants of T ( s ) = T ( s ) = 1 / ( τ s + 1 ) , τ = 3 obtained for the sampling interval of Δ = 0.1    s .

4.1. IO Data Collected in Closed-Loop

An initial linear MIMO IO controller is first found based on an IO variant of VRFT. The linear diagonal controller learned with IO VRFT is
C ( q ) = 1 D ( q ) [ C 11 ( q ) 0 0 C 22 ( q ) ] , C 11 ( q ) = 2.6674 5.3354 q 1 + 3.5730 q 2 0.0339 q 3 + 0.0706 q 4 , C 22 ( q ) = 0.5662 1.0491 q 1 + 0.4970 q 2 , D ( q ) = 1 q 1 .
For exploration enhancement, the reference inputs have their amplitudes uniformly randomly distributed in r k , 1 [ 2 ; 2 ] , r k , 2 [ 1.4 ; 1.4 ] and they present as piece-wise constant sequences of 3 s and 7 s, respectively. Every 5 t h sampling instant, a uniform random number in [ 1 ; 1 ] is added to the system inputs u k , 1 , u k , 2 . Then r k drives the LRM
{ s k + 1 , 1 m = 0.9672 s k , 1 m + 0.03278 r k , 1 , s k + 1 , 2 m = 0.9672 s k , 2 m + 0.03278 r k , 2 , y k m = [ y k , 1 m    y k , 2 m ] T = [ s k , 1 m    s k , 2 m ] T .
With controller (14) in closed-loop over the output feedback error u k = C ( q ) ( r k y k ) , IO data { u k , y k } is collected for 2000 s, as rendered in Figure 1 (only for the first 200 s).

4.2. Learning the VSFRT Controller from Closed-Loop IO Data

Using the collected output data y k and the given reference model T L R M ( q ) , the virtual reference input is computed as r ˜ k = T L R M 1 ( q ) y k . Afterwards, the extended state is built form the IO database { u k , y k } and from r ˜ k , as lumped in s ˜ k e x = [ y k , 1    y k , 2 y k 3 , 1    y k 3 , 2    u k 1 , 1    u k 1 , 2    r ˜ k , 1    r ˜ k , 2 ] T .
The extended state s ˜ k e x fills a database of the form D S = { ( s ˜ k e x , u k ) } and it is used to learn the VSFRT NN controller using the VSFRT algorithm. The NN settings are described in Table 1. Each learning trial trains the C-NN for M a x T r a i n = 200 times, starting with reinitialized weights. Each trained C-NN was tested on the standard scenario shown in Figure 2 (only for the first 600 s), by measuring a normalized c.f. V = 1 / N V L R M O N for N = 12,000 samples in 1200 s. The best C-NN (in terms of the minimal value of V per trial) is retained at each trial. One trial lasts about 30 min on a standard desktop computer with all calculations performed on CPU.
For a fair evaluation, the learning process is repeated for five trials. In Table 2, the value of V for five trials is filled and then averaged. The best C-NN, found in trial 3, has V = 0 . 0051332 .

4.3. Learning the AI-VIRL Controller from Closed-Loop IO Data

In this case, the extended state used to learn C-NN with AI-VIRL, built using the same dataset { u k , y k } , the states from the reference model s k m and the reference input r k , is s k e x = [ y k , 1 y k , 2 y k 3 , 1 y k 3 , 2 u k 1 , 1 , u k 1 , 2 , r k , 1 , r k , 2 , s k , 1 m , s k , 2 m ] T .
Before using the raw database, the transition samples at time instants where r k r k + 1 are to be deleted (the piecewise constant step model is not a valid state-space at switching instants) and the resulted database used for learning is D B = { ( s k e x [ i ] , u k [ i ] , s k + 1 e x [ i ] ) } , i = 1 , P ¯ , w h e r e P < N .
The controller NN and the Q-function NN settings are depicted in Table 3.
To find Q-NN’s minimizers for training the C-NN in Step 3 of the AI-VIRL algorithm, all possible input combinations u k , 1 × u k , 2 resulted from 19 discrete values in [–1;1] for each input are enumerated, for each s k e x [ i ] . The minimizing combination is set as target for the C-NN, for the given input s k e x [ i ] , at the current algorithm iteration.
Each AI-VIRL iteration produces a C-NN that is tested on the same standard test scenario used with VSFRT, by measuring a finite-time normalized c.f. V = 1 / N V L R M O N for N = 12,000 samples over 1200 s. The AI-VIRL is iterated for M a x I t e r = 1000 times and all stabilizing controllers that are better than the previous ones running on the standard test scenario are recorded. For fair evaluation, AI-VIRL is also run for five trials.
Two cases are analyzed, to test the impact of the training epochs on the controller NN. In the first case, both C-NN and Q-NN are to be trained for maximum 500 epochs. The best C-NN is found at iteration 822 of trial 3 and measures V = 0 . 0039271 on the standard test scenario shown in Figure 3 (only the first 600 s shown).
In the second case, the Q-NN is trained for maximum 500 epochs and the C-NN is trained for maximum 100 epochs. The best C-NN is found at iteration 573 of the first trial and measures V = 0 . 0070567 on the standard test scenario shown in Figure 4 (only the first 600 s shown).
The iterative learning process specific to AI-VIRL lasts for approximately 6 h in the first above mentioned case and for about 4 h in the second case, on a standard desktop computer with all operations on CPU. All measurements corresponding to the five trials in the two cases are filled in Table 2.
After learning the AI-VIRL control, the conclusion w.r.t. VSFRT learned control can be drawn: The VSFRT control is computationally cheaper to learn (one-shot instead of iterative) and the average tracking control performance is better with VSFRT (about an order of magnitude better).

4.4. Learning VSFRT and AI-VIRL Control Using IO Data Collected in Open-Loop

Next, we repeat the learning process for the two controllers (the VSFRT one and the AI-VIRL one) but the data used for learning is collected in open-loop as depicted in Figure 5. For the sake of exploration, the system inputs have their amplitudes uniformly randomly distributed in u k , 1 [ 0.45 ; 0.6 ] , u k , 2 [ 0.5 ; 0.4 ] and they present as piece-wise constant sequences of 3 s and 7 s, respectively. Each of u k , 1 , u k , 2 are firstly filtered through the lowpass dynamics 1/(s+1). Then every 5 t h sampling instant, a uniform random number in [ 1 ; 1 ] is again added to the filtered u k , 1 , u k , 2 . A crucial difference w.r.t. the closed-loop collection scenario is that the reference inputs drive only the LRM and since there is no controller in closed-loop, the reference inputs r k and the LRM’s outputs y k m = T L R M ( q ) r k evolve as correlated but independently of the system outputs who were excited by u k . For exploration, the reference inputs have their amplitudes uniformly randomly distributed in r k , 1 [ 6 ; 6 ] , r k , 2 [ 1 ; 2 ] and they are also piece-wise constant sequences of 10 and 15 s, respectively. We underline that for VSFRT, r k and y k m are not needed as they are automatically offline calculated by the VSFRT principle. However, AI-VIRL uses them for forming the extended state-space model.
It is clear when comparing Figure 1 with Figure 5 that the open-loop data collection in uncontrolled environment is not satisfactorily exploring the input-state space (there is significantly less variation in the outputs). This may affect learning convergence and final LRM tracking performance.
The extended states for VSFRT and AI-VIRL and their NNs settings are kept the same as in the closed-loop IO collection scenario.
For a fair evaluation, the learning process is repeated five times for all the cases previously described. In Table 4 is noted the value of V for each trial and the average value.
The best VSFRT controller, obtained in trial 4, measures V = 0 . 0043692 on the test scenario from Figure 6, a value close to the minimal value obtained in the closed-loop IO data collection scenario. The AI-VIRL controllers’ performance is obviously inferior; the C-NN with the minimal V in the case when it is trained for maximum 100 epochs, is found at iteration 15 and measures V = 0 . 24 on the second trial (tracking is shown in Figure 7 the first 600 s only); the C-NN with minimal V (best tracking) in the case when trained for maximum 500 epochs, is found at iteration 132 of the fifth trial and measures V = 0 . 20267 on the standard test scenario shown in Figure 8 for the first 600 s only.
The conclusion is that the poor exploration fails the AI-VIRL convergence to a good controller, whereas the VSFRT controller is learned to about the same LRM tracking performance. VSFRT learns better control (about two order of magnitude smaller V than with AI-VIRL) while being less computationally demanding and despite the poor exploration.

4.5. Testing the Transfer Learning Advantage

It is checked whether initializing the AI-VIRL algorithm with an admissible VSFRT controller helps improving the learning convergence, using the transfer learning capability. An initial admissible controller is not however mandatory for the AI-VIRL. Since both the VSFRT and the AI-VIRL controllers aim at solving the same LRMO tracking control problem, it is expected that the VSFRT controller initializes AI-VIRL closer to the optimal controller.
The VSFRT controller from the closed-loop IO data collection case is transferred since this case has proved good LRMO tracking for both the VSFRT and for the AI-VIRL controllers. Two cases were again analyzed: When the C-NN of the AI-VIRL is trained for 100 epochs at each iteration and when the C-NN of the AI-VIRL is trained for 500 epochs at each iteration. While the Q-NN has its weights randomly initialized with each trial.
The results unveil that the transfer learning does not speed up the AI-VIRL convergence, nor does it result in better AI-VIRL controllers after the trial ends. The reason is that the learning process taking place in the high-dimensional space spanned by the C-NN weights remains un-controllable directly, but only indirectly by the hyper-parameter settings such as database exploration quality and size, training overfitting prevention mechanism and adequate NN architecture setup.

4.6. VSFRT and AI-VIRL Performance under State Dimensionality Reduction with PCA and Autoencoders

State representation is a notorious problem in approximate dynamic programming and reinforcement learning. In the observability-based framework, the state comprises of past IO samples who are time-correlated. Meaning that the desired uniform coverage of the state space is significantly affected. Intuitively, the more recent IO samples from the virtual state better reflect the actual system state, whereas the older IO samples are less characterizing the actual system state.
Analyzing the state representation for VSFRT and for AI-VIRL, two sources of correlation appear. First, the time-correlation since the virtual state is constructed from past IO samples. Secondly, correlation appears between the independent reference inputs (used in closed-loop IO data collection) and the LRM’s states and outputs and the system’s inputs and outputs as well. It is therefore justified to strive for state dimensionality reduction.
One of the popular unsupervised machine learning tools for dimensionality reduction is principal component analysis (PCA). Thanks to its offline nature, it can be used only with offline learning schemes, such as VSFRT and AI-VIRL. Let the state vector s k recordings be arranged in a matrix S where the number of columns correspond to the state components and the line number is the record time index k. The state s k can be either one of s ˜ k e x for VSFRT or s k e x for AI-VIRL. Let the empirical estimate of the covariance of S be S ˜ obtained after centering the columns of S to zero, and the square matrix V contains S ˜ ’s eigenvectors on each column, arranged from the left to the right, in the descending order of the eigenvalues amplitudes. The number n P of principal components counts the first n P leftmost columns from V, which are the principal components. Let this leftmost slicing of V be V L . The reduced represented state is then calculated as s r e d = ( V L ) T s .
The state dimensionality reduction effect on tracking performance is tested both for the VSFRT and for the AI-VIRL. The used database is the one from the closed-loop collection scenario since this offers a learning chance to both VSFRT and to the AI-VIRL controllers. For VSFRT, the dimension of the state s ˜ k e x is 12, whereas for AI-VIRL, the dimension of s k e x is 14. For VSFRT, the first 4 principal components explain for 98.32% of the data variation, while for AI-VIRL, the first 6 components explain for 98.51% of the data variation, confirming existent high correlations between the state vector components. This leads to reduced C-NN architecture sizes of 4–6–2 for VSFRT and 6–6–2 for AI-VIRL. For AI-VIRL, the case where 100 epochs for training the C-NN was employed. The explained variation in the data is shown for the first four principal components only for VSFRT, in Figure 9.
The best learned C-NN with VSFRT and with AI-VIRL measure V = 0 . 1653 and V = 1.3488 respectively, on the test scenario. The VSFRT controller performs well only on controlling y k , 1 while poorly on controlling y k , 2 . Whereas the AI-VIRL control is unexploitable on any of the two axes. The conclusion is that the exploration issue is exacerbated by the state dimensionality reduction, although learning still takes place to some extent. Even when the data variation is explainable by a reduced number of principal components, due to many apparent correlations.
Under reduced state information loss when performing dimensionality reduction, no improvement on the best tracking performance is expected. However, learning still occurs, which encourages to use the dimensionality reduction as a trade-off for reducing learning computational effort by reducing the state space size and subsequently the NN architecture size.
A standard fully-connected single-hidden layer feedforward autoencoder (AE) is next used to test the dimensionality reduction and its effect on the learning performance, in a different unsupervised machine learning paradigm. Details of the autoencoder are: Six hidden neurons that are the encoder’s outputs and is also the number of reduced features; sigmoidal activation function from the input-to-hidden layer as well as from the hidden-to-output layer; number of training epochs set to maximum 500. The training cost function is of the form T M S S E + c 1 T L 2 + c 2 T s p a r s e where T M S S E penalizes the mean summed squared deviation of the AE outputs from the same inputs, T L 2 is the weights L2 regularization term responsible with limiting the AE weights amplitude and the T s p a r s e term encourages sparsity at the AE hidden layer’s output. T s p a r s e measures the Kullback–Leibler divergence of the averaged hidden layer outputs activation values with respect to a desirable value set as 0.15 in this case study. While the other parameters in the cost are set as c 1 = 0.004 , c 2 = 4 . The AE’s targeted output is the copied input.
The AE-based dimensionality reduction is only applied to the VSFRT control learning. Let the encoder map the input to the dimensionally-reduced feature as in s ˜ k r e d = E n c ( s ˜ k e x ) . The VSFRT C-NN training now uses the pairs { s ˜ k r e d , u k } using the same value M a x T r a i n = 200 and the same five trials approach. The best learned C-NN with VSFRT measured V = 0 . 1838 which is on par with the best performance from the VSFRT C-NN learned with PCA reduction. The result unveils that no significant improvement is obtained using the AE-based reduction, which is expected due to sensible information loss. Then there is no preference for using either of PCA or AE for this purpose. It also confirms that no better performance is attainable. However, it is advised to use dimensionality reduction tools since by a reduced virtual state, simpler (as in reduced number of parameters) NN architectures are usable, leading to less computational effort in training, testing and real-world implementation phase.

5. Second Validation Case Study

A two-joints rigid and planar robot arm serves in the following case study. Its dynamics are described by [49]
m 1 ( s ) s ¨ + m 2 ( s , s ˙ ) s ˙ = S a t ( u )
where u = [ u 1 , u 2 ] T are motor input torques for the base joint and for the tip joint, respectively. Both inputs are limited inside the model, within their domains [−0.2; 0.2] Nm and [−0.1; 0.1] Nm, respectively, through the component-wise saturation function S a t ( ) . s = [ s 1 , s 2 ] T measured in radians (rad) represents the angle of the base and tip joints, respectively. While the joints’ angular velocities which are unmeasurable are captured in s ˙ = [ s ˙ 1 = s 3 , s ˙ 2 = s 4 ] T and physically limited inside [ 2 π , 2 π ] rad/s. When no gravitational forces affect the planar arm, the matrices in the Equation (16) are
m 1 ( s ) = [ m 1 c 1 2 + m 2 l 1 2 + I 1 + m 2 c 2 2 + I 2 + 2 m 2 l 1 c 2 cos s 2 m 2 c 2 2 + I 2 + m 2 l 1 c 2 cos s 2 m 2 c 2 2 + I 2 + m 2 l 1 c 2 cos s 2 m 2 c 2 2 + I 2 ] , m 2 ( s , s ˙ ) = [ b 1 m 2 l 1 c 2 s 4 sin s 2 m 2 l 1 c 2 ( s 3 + s 4 ) sin s 2 m 2 l 1 c 2 x 3 sin s 2 b 2 ] ,
with the parameters’ numerical values set as l1 = 0.1, l2 = 0.1, m1 = 1.25, m2 = 1, I1 = 0.004, I2 = 0.003, c1 = 0.05, c2 = 0.05, b1 = 0.1, b2 = 0.02. These parameters’ interpretation is irrelevant to the following developments.
A sample period Δ = 0.05    s characterizes a zero-order hold applied on the inputs and on the outputs of (16), rendering it into an equivalent discrete-time model with inputs u k = [ u k , 1 , u k , 2 ] T and outputs y k = [ y k , 1 = s k , 1 , y k , 2 = s k , 2 ] T .
It is intended to search for the control which makes the closed-loop match the continuous-time decoupled LRM model T L R M ( s ) = d i a g ( 1 / ( 0.5 s + 1 ) , 1 / ( 0.2 s + 1 ) ) , which is subsequently transformed to the discrete-time observable canonical state-space form
{ s k + 1 , 1 m = 0.9048 s k , 1 m + 0.0952 r k , 1 , s k + 1 , 2 m = 0.7788 s k , 2 m + 0.2212 r k , 2 , y k m = [ y k , 1 m    y k , 2 m ] T = [ s k , 1 m    s k , 2 m ] T .
T L R M ( q ) is the discrete-time counterpart of T L R M ( s ) , calculated for the sample period Δ .

5.1. IO Data Collection in Closed-Loop

The transition samples must be collected first. Initial non-decoupling proportional-type controllers are used, given as uk,1 = 0.15 × (rk,1yk,1) and uk,2 = 0.05 × (rk,2yk,2), respectively.
For driving the resulted closed-loop CS, rk,1 is modeled as a sequence of piecewise constant values for 2 s with zero-mean random amplitudes of variance 0.6 and rk,2 is as a sequence of piecewise constant values for 1 s with zero-mean random amplitudes of variance 0.65. For enhanced exploration, random additive disturbance is used on inputs u k , 1 , u k , 2 . A uniform random number in [ 3 , 3 ] is added to the first input every 2nd sample and a similar distribution random number is added to the second input every 3rd sample. A total of 15,000 samples of r k , 1 , r k , 2 , u k , 1 , u k , 2 , y k , 1 , y k , 2 , y k , 1 m , y k , 2 m are collected. The collection is shown in Figure 10.

5.2. Learning the VSFRT Controller from Closed-Loop IO Data

Using the collected output data y k and the given reference model T L R M ( q ) , the virtual reference input is computed as r ˜ k = T L R M 1 ( q ) y k . Afterwards, the extended state is built form the IO database samples { u k , y k } and from r ˜ k , all components being lumped in s ˜ k e x = [ y k , 1    y k , 2    y k 1 , 1 y k 1 , 2    y k 2 , 1    y k 2 , 2    u k 1 , 1 u k 1 , 2 r ˜ k , 1 r ˜ k , 2 ] T .
The extended state s ˜ k e x fills a database of the form D B = { ( s ˜ k e x , u k ) } and it is used to learn the VSFRT NN controller according to the VSFRT algorithm. The C-NN settings are described in Table 5. Each learning trial trains the C-NN for M a x S t e p s = 200 times, starting with reinitialized weights. Each trained C-NN was tested in closed-loop on a scenario where the test reference inputs are identical to those used in Figure 11, by measuring a normalized c.f. V = 1 / N V L R M O N for N = 400 samples in 20 s. The best C-NN (in terms of the minimal value of V per trial) is retained at each trial. One trial lasts about 30 min on a standard desktop computer with all calculations performed on CPU. For a fair evaluation, the learning process is repeated for five trials. In Table 6, the value of V for five trials is filled and then averaged. About 95% of the learned controllers are stabilizing, in spite of the underlying nonlinear optimization specific to VSFRT, indirectly solved by C-NN training.

5.3. Learning the AI-VIRL Controller from Closed-Loop IO Data

The extended state used to learn C-NN with AI-VIRL, built using the same database { u k , y k } collected for VSFRT, together with the states from the reference model s k m and the reference input r k . The extended state vector is comprised of s k e x = [ y k , 1    y k , 2    y k 1 , 1    y k 1 , 2    y k 2 , 1    y k 2 , 2    u k 1 , 1 , u k 1 , 2 , r k , 1 , r k , 2 , x k , 1 m , x k , 2 m ] T .
Before using the raw database, the transition samples at time instants where r k r k + 1 are to be excluded (the piecewise constant step model is not a valid state-space transition model at switching instants) and the resulted database used for learning is D B = { ( s k e x [ i ] , u k [ i ] , s k + 1 e x [ i ] ) } , i = 1 , P ¯ , w h e r e P < N .
The controller NN and the Q function NN settings are depicted in Table 7.
To find Q-NN’s minimizers for training the C-NN in Step 3 of the AI-VIRL algorithm, all possible input combinations u k , 1 × u k , 2 resulted from 25 discrete values in [−2;2] for each input are enumerated, for each s k e x [ i ] . The minimizing combination is set as target for the C-NN, for the given input s k e x [ i ] , at the current algorithm iteration.
Each AI-VIRL iteration produces a C-NN that is tested on the same standard test scenario used with VSFRT, by measuring the c.f. V for N = 400 samples over 20 s. The AI-VIRL is iterated M a x I t e r = 200 times and all stabilizing controllers that are better than the previous ones running on the standard test scenario are recorded. For fair evaluation, AI-VIRL is also run for five trials and all the best measurements over one trial (and the average value per trials) are filled in Table 6.
After learning the AI-VIRL controllers, the conclusion based on Table 6 is clear: VSFRT is again better than AI-VIRL, in spite of being computationally less demanding and also being one-shot. The learning took place on the same database, in a controlled environment where good exploration was attainable.

5.4. Learning VSFRT and AI-VIRL Control Based on IO Data Collected in Open-Loop

The robotic arm act as an open-loop integrator on each joint, hence being marginally stable. The open-loop collection is driven by zero-mean impulse inputs u k , 1 , u k , 2 . For exploration’s sake, the reference inputs have their amplitudes uniformly randomly distributed in r k , 1 [ 1.5 ; 1.5 ] , r k , 2 [ 2 ; 2 ] and they present as sequences of piece-wise constant signals lasting 2.5 s and 2 s, respectively. The difference now w.r.t. the closed-loop collection scenario is that the reference inputs r k drive only the LRM and since there is no controller in closed-loop, the reference inputs and the LRM’s outputs y k m = T L R M ( q ) r k evolve independently of the system’s outputs y k who were driven by u k . The open-loop collection is captured in Figure 12 from where it is clear that y k , 1 , y k , 2 do not intersect too often with r k , 1 , r k , 2 and with y k , 1 m , y k , 2 m , respectively.
The exact same learning settings from the closed-loop case were used for VSFRT and for AI-VIRL. Notice that r k , 1 , r k , 2 and with y k , 1 m , y k , 2 m are not used for learning control with VSFRT since they do not enter the extended state s ˜ k e x .
Five learning trials are executed both for VSFRT and for AI-VIRL, with the LRMO tracking performance measure V recorded in Table 8. The best learned VSFRT and AI-VIRL controllers are shown performing on the standard tracking test scenario in Figure 11.
There is a dramatic difference in favor of VSFRT’s superior tracking performance, in spite of the poorer input-state space exploration. As expected, AI-VIRL’s convergence to a good controller is affected by the poor exploration and its performance is not better than in the closed-loop collection case. Concluding, the superiority of VSFRT over AI-VIRL is again confirmed, both in terms of reduced computation complexity/time and in terms of tracking performance.

6. Conclusions

Learning controllers from IO data to achieve high LRM output tracking performance has been validated by both VSFRT and AI-VIRL. Learning takes place in the space of the new virtual state representation build from historical IO samples, to address observable systems. The resulted control is in fact learned to be applied to the original underlying system. Surprisingly in the two illustrated case studies, using the same IO data and the same LRM, VSFRT showed no worse tracking performance than AI-VIRL, despite its significantly lesser computation demands. While in the cases of poor exploration, VSFRT was clearly superior.
Aside performance comparisons, several additional studies were conducted. The transfer learning opportunity from VSFRT to provide initial admissible controller for the AI-VIRL showed no advantage. While transfer learning could theoretically accelerate the learning convergence to the optimal control, the result indicates that learning in the high-dimensional space spanned by the C-NN weights remains uncontrollable directly, but only indirectly by the hyper-parameter settings such as database exploration quality and size, training overfitting prevention mechanism and adequate NN architecture setup. Since the resulted state representation will generally lead to large virtual state vectors, the impact of dimensionality reduction through standard unsupervised machine learning techniques such as principal component analysis and autoencoders was studied. The obtained results indicate that no significant improvement is obtained, then there is no preference for using either one of PCA or AE for this purpose. It also confirms that no better tracking performance is attainable. However, it is advised to use dimensionality reduction since by a reduced virtual state, simpler (in the sense of fewer parameters) NN architectures are used, leading to less computational effort in training, testing and real-world implementation phase.

Author Contributions

Conceptualization, methodology, formal analysis, M.-B.R.; software, validation, investigation, A.-I.B.; writing—original draft preparation, M.-B.R. and A.-I.B.; writing—review and editing, M.-B.R. and A.-I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant from the Romanian Ministry of Education and Research, CNCS-UEFISCDI, project number PN-III-P1-1.1-TE-2019-1089, within PNCDI III.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A

For an observable linear discrete-time processes of the form
{ s k + 1 = G s k + H u k , y k = L s k ,
with s k n , u k m u , y k p , G n × n , H n × m u , L p × n , a data-based state observer can be built using IO samples and known process matrices. Let τ past samples of y k and u k be collected in v k = [ ( Y k 1 , k τ ¯ ) T , ( U k 1 , k τ ¯ ) T ] T ( p + m u ) τ and let the controllability and observability matrices be, C o = [ H    G H G τ 1 H    ] , Ξ = [ ( L G τ 1 ) T , , ( L G ) T ,    L T ] T respectively. It also exists an observability index K such that: τ < K r a n k ( Ξ ) < n , while τ K r a n k ( Ξ ) = n . Assuming τ K , then Ξ is full-column rank. With impulse response coefficients matrix
Π = ( 0 L H L G H L G r 2 H 0 0 L H L G r 3 H 0 0 L H 0 0 0 0 0 ) ,
the data-based observer expresses the state s k in terms of v k as
s k = [ N y    N u ] v k , where     N y = G τ ( Ξ T Ξ ) 1 Ξ T , N u = C o N y Π .

References

  1. Fu, H.; Chen, X.; Wang, W.; Wu, M. MRAC for unknown discrete-time nonlinear systems based on supervised neural dynamic programming. Neurocomputing 2020, 384, 130–141. [Google Scholar] [CrossRef]
  2. Wang, W.; Chen, X.; Fu, H.; Wu, M. Data-driven adaptive dynamic programming for partially observable nonzero-sum games via Q-learning method. Int. J. Syst. Sci. 2019, 50, 1338–1352. [Google Scholar] [CrossRef]
  3. Perrusquia, A.; Yu, W. Neural H2 control using continuous-time reinforcement learning. IEEE Trans. Cybern. 2020, 1–10. [Google Scholar] [CrossRef]
  4. Sardarmehni, T.; Heydari, A. Sub-optimal switching in anti-lock brake systems using approximate dynamic programming. IET Control Theory Appl. 2019, 13, 1413–1424. [Google Scholar] [CrossRef]
  5. Martinez-Piazuelo, J.; Ochoa, D.E.; Quijano, N.; Giraldo, L.F. A multi-critic reinforcement learning method: An application to multi-tank water systems. IEEE Access 2020, 8, 173227–173238. [Google Scholar] [CrossRef]
  6. Liu, Y.; Zhang, H.; Yu, R.; Xing, Z. H∞ tracking control of discrete-time system with delays via data-based adaptive dynamic programming. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 4078–4085. [Google Scholar] [CrossRef]
  7. Na, J.; Lv, Y.; Zhang, K.; Zhao, J. Adaptive identifier-critic-based optimal tracking control for nonlinear systems with experimental validation. IEEE Trans. Syst. Man. Cybern. Syst. 2020, 1–14. [Google Scholar] [CrossRef]
  8. Li, J.; Ding, J.; Chai, T.; Lewis, F.L.; Jagannathan, S. Adaptive interleaved reinforcement learning: Robust stability of affine nonlinear systems with unknown uncertainty. IEEE Trans. Neural. Netw. Learn. Syst. 2020, 1–11. [Google Scholar] [CrossRef]
  9. Buşoniu, L.; de Bruin, T.; Tolić, D.; Kober, J.; Palunko, I. Reinforcement learning for control: Performance, stability, and deep approximators. Annu. Rev. Control 2018, 46, 8–28. [Google Scholar] [CrossRef]
  10. Treesatayapun, C. Knowledge-based reinforcement learning controller with fuzzy-rule network: Experimental validation. Neural Comput. Appl. 2020, 32, 9761–9775. [Google Scholar] [CrossRef]
  11. Huang, M.; Liu, C.; He, X.; Ma, L.; Lu, Z.; Su, H. Reinforcement learning-based control for nonlinear discrete-time systems with unknown control directions and control constraints. Neurocomputing 2020, 402, 50–65. [Google Scholar] [CrossRef]
  12. Chen, C.; Sun, W.; Zhao, G.; Peng, Y. Reinforcement Q-Learning incorporated with internal model method for output feedback tracking control of unknown linear systems. IEEE Access 2020, 8, 134456–134467. [Google Scholar] [CrossRef]
  13. De Bruin, T.; Kober, J.; Tuyls, K.; Babuska, R. Integrating state representation learning into deep reinforcement learning. IEEE Robot. Autom. Lett. 2018, 3, 1394–1401. [Google Scholar] [CrossRef]
  14. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  15. Lewis, F.L.; Vamvoudakis, K.G. Reinforcement learning for partially observable dynamic processes: Adaptive Dynamic Programming using measured output data. IEEE Trans. Syst. Man. Cybern. B Cybern. 2011, 41, 14–25. [Google Scholar] [CrossRef]
  16. Wang, Z.; Liu, D. Data-based controllability and observability analysis of linear discrete-time systems. IEEE Trans. Neural Netw. 2011, 22, 2388–2392. [Google Scholar] [CrossRef]
  17. Ni, Z.; He, H.; Zhong, X. Experimental studies on data-driven heuristic dynamic programming for POMDP. Front. Intell. Control. Inf. Process. 2014, 83–105. [Google Scholar]
  18. Ruelens, F.; Claessens, B.J.; Vandael, S.; De Schutter, B.; Babuska, R.; Belmans, R. Residential demand response of thermostatically controlled loads using batch reinforcement learning. IEEE Trans. Smart Grid 2017, 8, 2149–2159. [Google Scholar] [CrossRef] [Green Version]
  19. Campi, M.C.; Lecchini, A.; Savaresi, S.M. Virtual reference feedback tuning: A direct method for the design of feedback controllers. Automatica 2002, 38, 1337–1346. [Google Scholar] [CrossRef] [Green Version]
  20. Formentin, S.; Savaresi, S.M.; Del Re, L. Non-iterative direct data-driven controller tuning for multivariable systems: Theory and application. IET Control Theory Appl. 2012, 6, 1250. [Google Scholar] [CrossRef]
  21. Campestrini, L.; Eckhard, D.; Gevers, M.; Bazanella, A.S. Virtual Reference Feedback Tuning for non-minimum phase plants. Automatica 2011, 47, 1778–1784. [Google Scholar] [CrossRef]
  22. Eckhard, D.; Campestrini, L.; Christ Boeira, E. Virtual disturbance feedback tuning. IFAC J. Syst. Control 2018, 3, 23–29. [Google Scholar] [CrossRef]
  23. Campi, M.C.; Savaresi, S.M. Direct nonlinear control design: The Virtual Reference Feedback Tuning (VRFT) approach. IEEE Trans. Automat. Contr. 2006, 51, 14–27. [Google Scholar] [CrossRef]
  24. Esparza, A.; Sala, A.; Albertos, P. Neural networks in virtual reference tuning. Eng. Appl. Artif. Intell. 2011, 24, 983–995. [Google Scholar] [CrossRef] [Green Version]
  25. Yan, P.; Liu, D.; Wang, D.; Ma, H. Data-driven controller design for general MIMO nonlinear systems via virtual reference feedback tuning and neural networks. Neurocomputing 2016, 171, 815–825. [Google Scholar] [CrossRef]
  26. Radac, M.-B.; Precup, R.-E. Data-driven model-free slip control of anti-lock braking systems using reinforcement Q-learning. Neurocomputing 2018, 275, 317–329. [Google Scholar] [CrossRef]
  27. Radac, M.-B.; Precup, R.-E. Data-driven model-free tracking reinforcement learning control with VRFT-based adaptive actor-critic. Appl. Sci. 2019, 9, 1807. [Google Scholar] [CrossRef] [Green Version]
  28. Radac, M.-B.; Precup, R.-E.; Petriu, E.M. Model-free primitive-based iterative learning control approach to trajectory tracking of MIMO systems with experimental validation. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 2925–2938. [Google Scholar] [CrossRef] [PubMed]
  29. Chi, R.; Hou, Z.; Jin, S.; Huang, B. An improved data-driven point-to-point ILC using additional on-line control inputs with experimental verification. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 687–696. [Google Scholar] [CrossRef]
  30. Chi, R.; Zhang, H.; Huang, B.; Hou, Z. Quantitative data-driven adaptive iterative learning control: From trajectory tracking to point-to-point tracking. IEEE Trans. Cybern. 2020, 1–15. [Google Scholar] [CrossRef]
  31. Zhang, J.; Meng, D. Convergence analysis of saturated iterative learning control systems with locally Lipschitz nonlinearities. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4025–4035. [Google Scholar] [CrossRef]
  32. Li, X.; Chen, S.-L.; Teo, C.S.; Tan, K.K. Data-based tuning of reduced-order inverse model in both disturbance observer and feedforward with application to tray indexing. IEEE Trans. Ind. Electron. 2017, 64, 5492–5501. [Google Scholar] [CrossRef]
  33. Madadi, E.; Soffker, D. Model-free control of unknown nonlinear systems using an iterative learning concept: Theoretical development and experimental validation. Nonlinear Dyn. 2018, 94, 1151–1163. [Google Scholar] [CrossRef]
  34. Shi, J.; Xu, J.; Sun, J.; Yang, Y. Iterative Learning Control for time-varying systems subject to variable pass lengths: Application to robot manipulators. IEEE Trans. Ind. Electron. 2019, 67, 8629–8637. [Google Scholar] [CrossRef]
  35. Wu, B.; Gupta, J.K.; Kochenderfer, M. Model primitives for hierarchical lifelong reinforcement learning. Auton. Agent Multi Agent Syst. 2020, 34, 28. [Google Scholar] [CrossRef]
  36. Li, J.; Li, Z.; Li, X.; Feng, Y.; Hu, Y.; Xu, B. Skill learning strategy based on dynamic motion primitives for human-robot cooperative manipulation. IEEE Trans. Cogn. Dev. Syst. 2020, 1. [Google Scholar] [CrossRef]
  37. Kim, Y.L.; Ahn, K.H.; Song, J.B. Reinforcement learning based on movement primitives for contact tasks. Robot. Comput. Integr. Manuf. 2020, 62, 101863. [Google Scholar] [CrossRef]
  38. Camci, E.; Kayacan, E. Learning motion primitives for planning swift maneuvers of quadrotor. Auton. Robots 2019, 43, 1733–1745. [Google Scholar] [CrossRef]
  39. Yang, C.; Chen, C.; He, W.; Cui, R.; Li, Z. Robot learning system based on adaptive neural control and dynamic movement primitives. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 777–787. [Google Scholar] [CrossRef] [PubMed]
  40. Huang, R.; Cheng, H.; Qiu, J.; Zhang, J. Learning physical human-robot interaction with coupled cooperative primitives for a lower exoskeleton. IEEE Trans. Autom. Sci. Eng. 2019, 16, 1566–1574. [Google Scholar] [CrossRef]
  41. Liu, H.; Li, J.; Ge, S. Research on hierarchical control and optimisation learning method of multi-energy microgrid considering multi-agent game. IET Smart Grid 2020, 3, 479–489. [Google Scholar] [CrossRef]
  42. Van, N.D.; Sualeh, M.; Kim, D.; Kim, G.-W. A hierarchical control system for autonomous driving towards urban challenges. Appl. Sci. 2020, 10, 3543. [Google Scholar] [CrossRef]
  43. Jiang, W.; Yang, C.; Liu, Z.; Liang, M.; Li, P.; Zhou, G. A hierarchical control structure for distributed energy storage system in DC micro-grid. IEEE Access 2019, 7, 128787–128795. [Google Scholar] [CrossRef]
  44. Merel, J.; Botvinick, M.; Wayne, G. Hierarchical motor control in mammals and machines. Nat. Commun. 2019, 10, 5489. [Google Scholar] [CrossRef] [Green Version]
  45. Radac, M.-B.; Lala, T. Robust control of unknown observable nonlinear systems solved as a zero-sum game. IEEE Access 2020, 8, 214153–214165. [Google Scholar] [CrossRef]
  46. Alagoz, B.-B.; Tepljakov, A.; Petlenkov, E.; Yeroglu, C. Multi-loop model reference proportional integral derivative controls: Design and performance evaluations. Algorithms 2020, 13, 38. [Google Scholar] [CrossRef] [Green Version]
  47. Radac, M.-B.; Precup, R.-E. Data-driven MIMO model-free reference tracking control with nonlinear state-feedback and fractional order controllers. Appl. Soft Comput. 2018, 73, 992–1003. [Google Scholar] [CrossRef]
  48. Two Rotor Aerodynamical System, User’s Manual; Inteco Ltd.: Krakow, Poland, 2007.
  49. Busoniu, L.; De Schutter, B.; Babuska, R. Decentralized reinforcement learning control of a robotic manipulator. In Proceedings of the 2006 9th International Conference on Control, Automation, Robotics and Vision, Singapore, 5–8 December 2006. [Google Scholar]
Figure 1. Closed-loop IO data from the process. (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (blue), r k , 1 (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (blue), r k , 2 (red).
Figure 1. Closed-loop IO data from the process. (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (blue), r k , 1 (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (blue), r k , 2 (red).
Energies 14 01006 g001
Figure 2. VSFRT neural network (NN) controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Figure 2. VSFRT neural network (NN) controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Energies 14 01006 g002
Figure 3. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Figure 3. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Energies 14 01006 g003
Figure 4. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Figure 4. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Energies 14 01006 g004
Figure 5. Open-loop input/output (IO) data from the system: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (blue), r k , 1 (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (blue), r k , 2 (red).
Figure 5. Open-loop input/output (IO) data from the system: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (blue), r k , 1 (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (blue), r k , 2 (red).
Energies 14 01006 g005
Figure 6. VSFRT NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Figure 6. VSFRT NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Energies 14 01006 g006
Figure 7. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Figure 7. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Energies 14 01006 g007
Figure 8. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Figure 8. AI-VIRL NN controller: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Energies 14 01006 g008
Figure 9. Proportion of state variation explained as function of the number of principal components, for VSFRT state input.
Figure 9. Proportion of state variation explained as function of the number of principal components, for VSFRT state input.
Energies 14 01006 g009
Figure 10. Closed-loop data collection: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Figure 10. Closed-loop data collection: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (red).
Energies 14 01006 g010
Figure 11. VSFRT (black lines) and AI-VIRL (blue lines) learned from open-loop data, tested in closed-loop. y k , 1 m (b) and y k , 2 m (d) are in red. (a) and (c) show u k , 1 and u k , 2 , respectively.
Figure 11. VSFRT (black lines) and AI-VIRL (blue lines) learned from open-loop data, tested in closed-loop. y k , 1 m (b) and y k , 2 m (d) are in red. (a) and (c) show u k , 1 and u k , 2 , respectively.
Energies 14 01006 g011
Figure 12. Open-loop data collection: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (blue), r k , 1 (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (blue), r k , 2 (red).
Figure 12. Open-loop data collection: (a) u k , 1 ; (b) y k , 1 (black), y k , 1 m (blue), r k , 1 (red); (c) u k , 2 ; (d) y k , 2 (black), y k , 2 m (blue), r k , 2 (red).
Energies 14 01006 g012
Table 1. Virtual State-feedback Reference Feedback Tuning (VSFRT) C-NN settings.
Table 1. Virtual State-feedback Reference Feedback Tuning (VSFRT) C-NN settings.
SettingC-NN
Architecture12-6-2 (12 inputs for s k e x 12 , 6 hidden layer neurons and 2 outputs: u k , 1 , u k , 2 )
Activation function in hidden layertansig
Activation function in output layerlinear
Initial weightsuniform random numbers in [0;1]
Training algorithmscaled conjugate gradient
Maximum number of epochs to train100
Validation/training ratio10–90%
Maximum validation failures50
Minimum performance gradient 10 30
Training cost functionmean sum of squared errors (MSSE)
Table 2. VSFRT and Approximate Iterative Value Iteration Reinforcement Learning (AI-VIRL) tracking performance when learning uses closed-loop IO data.
Table 2. VSFRT and Approximate Iterative Value Iteration Reinforcement Learning (AI-VIRL) tracking performance when learning uses closed-loop IO data.
TrialVSFRTAI-VIRL (500 Epochs)AI-VIRL (100 Epochs)
10.00581040.0101020.0070567
20.00547750.00667650.017531
30.00513320.00392710.034047
40.00929130.0240980.01389
50.00581030.00695560.013097
Average0.006304540.010351840.03424868
Table 3. AI-VIRL C-NN and Q-NN settings.
Table 3. AI-VIRL C-NN and Q-NN settings.
SettingC-NNQ-NN
Architecture14-10-216-30-1
Activation function in hidden layertansigtansig
Activation function in output layerlinearlinear
Initial weightsuniform random numbers in [0;1]uniform random numbers in [0;1]
Training algorithmscaled conjugate gradientscaled conjugate gradient
Maximum number of epochs to train500 or 100500
Validation/training ratio10–90% 10–90%
Maximum validation failures5050
Minimum performance gradient 10 30 10 30
Training cost functionMSSEMSSE
Table 4. VSFRT and AI-VIRL tracking performance when learning uses open-loop IO data.
Table 4. VSFRT and AI-VIRL tracking performance when learning uses open-loop IO data.
TrialVSFRTAI-VIRL (500 Epochs)AI-VIRL (100 Epochs)
10.00577421.61591.0426
20.0045070.240.96942
30.00552761.82681.0055
40.00436920.635661.3044
50.00492020.844850.20267
Average0.005019641.0326420.904918
Table 5. VSFRT C-NN settings.
Table 5. VSFRT C-NN settings.
SettingC-NN
Architecture10-6-2
Activation function in hidden layertansig
Activation function in output layerlinear
Initial weightsuniform random numbers in [0;1]
Training algorithmscaled conjugate gradient
Maximum number of epochs to train100
Validation/training ratio10–90%
Maximum validation failures50
Minimum performance gradient 10 30
Training cost functionMSSE
Table 6. VSFRT and AI-VIRL tracking performance when learning uses closed-loop IO data.
Table 6. VSFRT and AI-VIRL tracking performance when learning uses closed-loop IO data.
TrialVSFRTAI-VIRL (500 Epochs)
10.00431440.0039152
20.00504680.0051058
30.00408010.0068728
40.00448910.0058930
50.00458070.0055730
Average0.00450220.0054719
Table 7. AI-VIRL C-NN and Q-NN settings.
Table 7. AI-VIRL C-NN and Q-NN settings.
SettingC-NNQ-NN
Architecture12-6-214-40-1
Activation function in hidden layertansigtansig
Activation function in output layerlinearlinear
Initial weightsuniform random numbers in [0;1]uniform random numbers in [0;1]
Training algorithmscaled conjugate gradientscaled conjugate gradient
Maximum number of epochs to train100500
Validation/training ratio10–90% 10–90%
Maximum validation failures5050
Minimum performance gradient 10 30 10 30
Training cost functionMSSEMSSE
Table 8. VSFRT and AI-VIRL tracking performance when learning uses open-loop IO data.
Table 8. VSFRT and AI-VIRL tracking performance when learning uses open-loop IO data.
TrialVSFRTAI-VIRL (100 Epochs)
10.00216230.127480
20.00214400.273590
30.00216950.865850
40.00219510.202360
50.00214970.175330
Average0.00216410.328922
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Radac, M.-B.; Borlea, A.-I. Virtual State Feedback Reference Tuning and Value Iteration Reinforcement Learning for Unknown Observable Systems Control. Energies 2021, 14, 1006. https://doi.org/10.3390/en14041006

AMA Style

Radac M-B, Borlea A-I. Virtual State Feedback Reference Tuning and Value Iteration Reinforcement Learning for Unknown Observable Systems Control. Energies. 2021; 14(4):1006. https://doi.org/10.3390/en14041006

Chicago/Turabian Style

Radac, Mircea-Bogdan, and Anamaria-Ioana Borlea. 2021. "Virtual State Feedback Reference Tuning and Value Iteration Reinforcement Learning for Unknown Observable Systems Control" Energies 14, no. 4: 1006. https://doi.org/10.3390/en14041006

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop