Open Access
This article is

- freely available
- re-usable

*Algorithms*
**2019**,
*12*(6),
121;
https://doi.org/10.3390/a12060121

Article

Learning Output Reference Model Tracking for Higher-Order Nonlinear Systems with Unknown Dynamics

Department of Automation and Applied Informatics, Politehnica University of Timisoara, 2 Bd. V. Parvan, 300223 Timisoara, Romania

^{*}

Author to whom correspondence should be addressed.

Received: 1 May 2019 / Accepted: 9 June 2019 / Published: 12 June 2019

## Abstract

**:**

Linearly and nonlinearly parameterized approximate dynamic programming approaches used for output reference model (ORM) tracking control are proposed. The ORM tracking problem is of significant interest in practice since, with a linear ORM, the closed-loop control system is indirectly feedback linearized and value iteration (VI) offers the means to achieve ORM tracking without using process dynamics. Ranging from linear to nonlinear parameterizations, a successful approximate VI implementation for continuous state-action spaces depends on several key parameters such as: problem dimension, exploration of the state-action space, the state-transitions dataset size, and suitable selection of the function approximators. We show that using the same transitions dataset and under a general linear parameterization of the Q-function, high performance ORM tracking can be achieved with an approximate VI scheme, on the same performance level as that of a neural-network (NN)-based implementation that is more complex and takes significantly more time to learn. However, the latter proves to be more robust to hyperparameters selection, dataset size, and to exploration strategies, recommending it as the de facto practical implementation. The case study is aimed at ORM tracking of a real-world nonlinear two inputs–two outputs aerodynamic process with ten internal states, as a representative high order system.

Keywords:

approximate dynamic programming; reinforcement learning; data-driven control; model-free control; reference trajectory tracking; output reference model; multivariable control; aerodynamic rotor system; neural networks; learning systems## 1. Introduction

The output reference model (ORM) tracking problem is of significant interest in practice, especially for nonlinear systems control, since by selection of a linear ORM, feedback linearization is enforced on the controlled process. Then, the closed-loop control system can act linearly in a wide range and not only in the vicinity of an operating point. Subsequently, linearized control systems are then subjected to higher level learning schemes such as the Iterative Learning Control ones, with practical implications such as primitive-based learning [1] that can extrapolate optimal behavior to previously unseen tracking scenarios.

On another side, selection of a suitable ORM is not straightforward because of several reasons. The ORM has to be matched with the process bandwidth and with several process nonlinearities such as, e.g., input and output saturations. From classical control theory, dead-time and non-minimum-phase characters of the process cannot be compensated for and must be reflected in the ORM. Apart from this information that can be measured or inferred from working experience with the process, avoiding knowledge of the process’ state transition function (process dynamics)—the most time consuming to identify and the most uncertain part of the process—in designing high performance control is very attractive in practice.

Reinforcement Learning (RL) has developed both from the artificial intelligence [2], and from classical control [3,4,5,6,7], where it is better known as Adaptive (Approximate, Neuro) Dynamic Programming (ADP). Certain ADP variants can ensure ORM tracking control without knowing the state-space (transition function) dynamics of the controlled process, which is of high importance in the practice of model-free (herein accepted as unknown dynamics) and data-driven control schemes that are able to compensate for poor modeling and process model uncertainty. Thus, ADP relies only on data collected from the process called state transitions. While plenty of mature ADP schemes already exist in the literature, tuning such schemes for a particular problem requires significant experience. Firstly, it must be specified whether ADP deals with continuous (infinite) or discrete (finite) state-action spaces. Then, the intended implementation will decide upon online/offline and/or adaptive/batch processing, the suitable selection of the approximator used for the extended cost function (called the Q-function) and/or for the controller. Afterwards, linear or nonlinear parameterizations are sought. Exploration of the state-action spaces is critical, as well as the hyperparameters of the overall learning scheme such as the number of transition samples, trading off exploration with exploitation, etc. Although successful stories on RL and ADP applied to large state-action spaces are reported mainly with artificial intelligence [8], in control theory, most approaches use low-order processes as representative case studies and mainly in linear quadratic regulator (LQR)-like settings (regulating states to zero). While, in an ADP, the reference input tracking control problem has been tackled before for linear time-invariant (LTI) processes by the name of Linear Quadratic Tracking (LQT) [9,10], the ORM tracking for nonlinear processes was rarely addressed [11].

The iterative model-free approximate Value Iteration (IMF-AVI) proposed in this work belongs to the family of batch-fitted Q-learning schemes [12,13] known as action-dependent heuristic dynamic programming (ADHDP) that are popular and representative ADP approaches, owing to their simplicity and model-free character. These schemes have been implemented in many variants: online vs. offline, adaptive or batch, for discrete/continuous states and actions, with/without function approximators, such as Neural Networks (NNs) [14,15,16,17,18,19,20,21,22].

Concerning the exploration issue in ADP for control, a suitable exploration that covers as well as possible the state-action space is not trivially ensured. Randomly generated control input signals will almost surely fail to guide the exploration in the entire state-action space, at least not in a reasonable amount of time. Then, a priori designed feedback controllers can be used under a variable reference input serving to guide the exploration [23]. The existence of an initial feedback stabilizing controller, not necessarily of a high performance one, can accelerate the transition samples dataset collection under exploration. This allows for offline IMF-AVI based on large datasets, leading to improved convergence speed for high-dimensional processes. However, such input–output (IO) or input-state feedback controllers were traditionally not to be designed without using a process model, until the advent of data-driven model-free controller design techniques that have appeared from the field of control theory: Virtual Reference Feedback Tuning (VRFT) [24], Iterative Feedback Tuning [25], Model Free Iterative Learning Control [26,27,28], Model Free (Adaptive) Control [29,30], with representative applications [31,32,33]. This work shows a successful example of a model-free output feedback controller used to collect input-to-state transition samples from the process for learning state-feedback ADP-based ORM tracking control. Therefore it fits with the recent data-driven control [34,35,36,37,38,39,40,41,42] and reinforcement learning [43,44] applications.

The case study deals with the challenging ORM tracking control for a nonlinear real-world two-inputs–two-outputs aerodynamic system (TITOAS) having six natural states that are extended with four additional ones according to the proposed theory. The process uses aerodynamic thrust to create vertical (pitch) and horizontal (azimuth) motion. It is shown that IMF-AVI can be used to attain ORM tracking of first order lag type, despite the high order of the multivariable process, and despite the pitch motion being naturally oscillatory and the azimuth motion practically behaving close to an integrator. The state transitions dataset is collected under the guidance of an input–output (IO) feedback controller designed using model-free VRFT.

As a main contribution, the paper is focused on a detailed comparison of the advantages and disadvantages of using linear and nonlinear parameterizations for the IMF-AVI scheme, while covering complete implementation details. To the best of authors’ knowledge, the ORM tracking context with linear parameterizations was not studied before for high-order real-world processes. Moreover, theoretical analysis shows convergence of the IMF-AVI while accounting for approximation errors and explains for the robust learning convergence of the NN-based IMF-AVI. The results indicate that the nonlinearly parameterized NN-based IMF-AVI implementation should be de facto in practice since, although more time-consuming, it automatically manages the basis function selection, it is more robust to dataset size and exploration settings, and generally more well-suited for nonlinear processes with unknown dynamics.

## 2. Output Model Reference Control for Unknown Dynamics Nonlinear Processes

#### 2.1. The Process

A discrete-time nonlinear unknown open-loop stable state-space deterministic strictly causal process is defined as
where k indexes the discrete time, ${\mathbf{x}}_{k}={[{x}_{k,1},\dots ,{x}_{k,n}]}^{\top}\in {\mathsf{\Omega}}_{X}\subset {\mathbb{R}}^{n}$ is the n-dimensional state vector, ${\mathbf{u}}_{k}={[{u}_{k,1},\dots ,{u}_{k,{m}_{u}}]}^{\top}\in {\mathsf{\Omega}}_{U}\subset {\mathbb{R}}^{{m}_{u}}$ is the control input signal, ${\mathbf{y}}_{k}={[{y}_{k,1},\dots ,{y}_{k,p}]}^{\top}\in {\mathsf{\Omega}}_{Y}\subset {\mathbb{R}}^{p}$ is the measurable controlled output, $\mathbf{f}:{\mathsf{\Omega}}_{X}\times {\mathsf{\Omega}}_{U}\mapsto {\mathsf{\Omega}}_{X}$ is an unknown nonlinear system function continuously differentiable within its domain, $\mathbf{g}:{\mathsf{\Omega}}_{X}\mapsto {\mathsf{\Omega}}_{Y}$ is an unknown nonlinear continuously differentiable output function. Initial conditions are not accounted for at this point. Assume known domains ${\mathsf{\Omega}}_{X},{\mathsf{\Omega}}_{U},{\mathsf{\Omega}}_{Y}$ are compact convex. Equation (1) is a general un-restrictive form for most controlled processes. The following assumptions common to the data-driven formulation are:

$$P:\{{\mathbf{x}}_{k+1}=\mathbf{f}({\mathbf{x}}_{k},{\mathbf{u}}_{k}),{\mathbf{y}}_{k}=\mathbf{g}\left({\mathbf{x}}_{k}\right)\},$$

**Assumption**

**1**

**(A1).**

(1) is fully state controllable with measurable states.

**Assumption**

**2**

**(A2).**

(1) is input-to-state stable on known domain ${\mathsf{\Omega}}_{U}\times {\mathsf{\Omega}}_{X}$.

**Assumption**

**3**

**(A3).**

(1) is minimum-phase (MP).

A1 and A2 are widely used in data-driven control, cannot be checked analytically for the unknown model (1) but can be inferred from historical and working knowledge with the process. Should such information not be available, the user can attempt process control under restraining safety operating conditions, that are usually dealt with at supervisory level control. Input to state stability (A2) is necessary if open-loop input-state samples collection is intended to be used for state space control design. Assumption A2 can be relaxed if a stabilizing state-space controller is already available and used just for the purpose of input-state data collection. A3 is the least restrictive assumption and it is used in the context of the VRFT design of a feedback controller based on input–output process data. Although solutions exist to deal with nonminimum-phase systems processes, the MP assumption simplifies the VRFT design and the output reference model selection (to be introduced in the following section).

Comment 1.

Model (1) accounts for a wide range of processes including fixed time-delay ones. For positive integer nonzero delay d on the control input ${\mathbf{u}}_{k-d}$, additional states can extend the initial process model (1) as ${\mathbf{x}}_{k,1}^{u}={\mathbf{u}}_{k-1},{\mathbf{x}}_{k,2}^{u}={\mathbf{u}}_{k-2},\dots ,{\mathbf{x}}_{k,d}^{u}={\mathbf{u}}_{k-d}$ and arrive at a state-space model without delays, in which the additional states are measurable as past input samples. A delay in the original states in (1), i.e., ${\mathbf{x}}_{k-d}$, are similarly treated.

#### 2.2. Output Reference Model Control Problem Definition

Let the discrete-time known open-loop stable minimum-phase (MP) state-space deterministic strictly causal ORM be
where ${\mathbf{x}}_{k}^{m}={[{x}_{k,1}^{m},\dots ,{x}_{k,n}^{m}]}^{\top}\in {\mathsf{\Omega}}_{{X}_{m}}\subset {\mathbb{R}}_{m}^{n}$ is the ORM state, ${\mathbf{r}}_{k}={[{r}_{k,1},\dots ,{r}_{k,p}]}^{\top}\in {\mathsf{\Omega}}_{{R}_{m}}\subset {\mathbb{R}}^{p}$ is the reference input signal, ${\mathbf{y}}_{k}^{m}={[{y}_{k,1}^{m},\dots ,{y}_{k,p}^{m}]}^{\top}\in {\mathsf{\Omega}}_{{Y}_{m}}\subset {\mathbb{R}}^{p}$ is the ORM output, ${\mathbf{f}}^{m}:{\mathsf{\Omega}}_{{X}_{m}}\times {\mathsf{\Omega}}_{{R}_{m}}\mapsto {\mathsf{\Omega}}_{{X}_{m}}$, ${\mathbf{g}}^{m}:{\mathsf{\Omega}}_{{X}_{m}}\mapsto {\mathsf{\Omega}}_{{Y}_{m}}$ are known nonlinear mappings. Initial conditions are zero unless otherwise stated. Notice that ${\mathbf{r}}_{m},{\mathbf{y}}_{k},{\mathbf{y}}_{k}^{m}$ are size p for square feedback control systems (CSs). If the ORM (2) is LTI, it is always possible to express the ORM as an IO LTI transfer function (t.f.) $\mathbf{M}\left(z\right)$ ensuring ${\mathbf{y}}_{k}^{m}=\mathbf{M}\left(z\right){\mathbf{r}}_{k}$, where $\mathbf{M}\left(z\right)$ is commonly an asymptotically stable unit-gain rational t.f. and ${\mathbf{r}}_{k}$ is the reference input that drives both the feedback CS and the ORM. We introduce an extended process comprising of the process (1) coupled with the ORM (2). For this, we consider the reference input ${\mathbf{r}}_{k}$ as a set of measurable exogenous signals (possibly interpreted as a disturbance) that evolve according to ${\mathbf{r}}_{k+1}={\mathbf{h}}^{m}\left({\mathbf{r}}_{k}\right)$, with known nonlinear ${\mathbf{h}}^{m}:{\mathbb{R}}^{m}\mapsto {\mathbb{R}}^{m}$, where ${\mathbf{r}}_{k}$ is measurable. Herein, ${\mathbf{h}}^{m}(.)$ is a generative model for the reference input.

$$M:\{{\mathbf{x}}_{k+1}^{m}={\mathbf{f}}^{m}({\mathbf{x}}_{k}^{m},{\mathbf{r}}_{k}),{\mathbf{y}}_{k}^{m}={\mathbf{g}}^{m}({\mathbf{x}}_{k}^{m})\},$$

The class of LTI generative models ${\mathbf{h}}^{m}(.)$ has been studied before in [9] but it is a rather restrictive one. For example, reference inputs signals modeled as a sequence of steps of constant amplitude cannot be modeled by LTI generative models. A step reference input signal with constant amplitude over time can be modeled as ${\mathbf{r}}_{k+1}={\mathbf{r}}_{k}$ with some initial condition ${\mathbf{r}}_{0}$. On the other hand, a sinusoidal scalar reference input signal ${r}_{k}$ can be modeled only through a second order state-space model. To see this, let the Laplace transform of $cos\left(\omega t\right)\sigma \left(t\right)$ ($\sigma \left(t\right)$ is the unit step function) be $H\left(s\right)=\ell \left\{cos\right(\omega t\left)\sigma \right(t\left)\right\}$ with the complex Laplace variable s. If $sH\left(s\right)$ is considered a t.f. driven by the unit step function with Laplace transform $\ell \left\{\sigma \right(t\left)\right\}=1/s$, then the LTI discrete-time state-space associated with $sH\left(s\right)$ acting as a generative model for ${r}_{k}$ is of the form
with known $\mathbf{A}\in {\mathbb{R}}^{2\times 2},\mathbf{B}\in {\mathbb{R}}^{2\times 1},\mathbf{C}\in {\mathbb{R}}^{1\times 2},D\in \mathbb{R},{\mathbf{o}}_{0}={[0,0]}^{\top}$, and ${\sigma}_{k}=\{1,1,1,\dots \}$ is the discrete-time unit step function. The combination of $H\left(s\right)$ driven by the Dirac impulse with $\ell \left\{\delta \right(t\left)\right\}$ could also have been considered as a generative model. Based on the state-space model above, modeling p sinusoidal reference inputs ${\mathbf{r}}_{k}\in {\mathsf{\Omega}}_{{R}_{m}}\subset {\mathbb{R}}^{p}$ requires $2p$ states. Generally speaking, the generative model of the reference input must obey the Markov property.

$$\begin{array}{c}\hfill {\mathbf{o}}_{k+1}={\mathbf{Ao}}_{k}+\mathbf{B}{\sigma}_{k},\\ \hfill {r}_{k}={\mathbf{Co}}_{k}+\mathbf{D}{\sigma}_{k},\end{array}$$

Consider next that the extended state-space model that consists of (1), (2), and the state-space generative model of the reference input signal is, in the most general form:
where ${\mathbf{x}}_{k}^{E}$ is called the extended state vector. Note that the extended state-space fulfils the Markov property. The ORM tracking control problem is formulated in an optimal control framework. Let the infinite horizon cost function (c.f.) to be minimized starting with ${\mathbf{x}}_{0}$ be [6]

$${\mathbf{x}}_{k+1}^{E}=\left[\begin{array}{c}{\mathbf{x}}_{k+1}\\ {\mathbf{x}}_{k+1}^{m}\\ {\mathbf{r}}_{k+1}\end{array}\right]=\left[\begin{array}{c}\mathbf{f}({\mathbf{x}}_{k},{\mathbf{u}}_{k})\\ {\mathbf{f}}^{m}({\mathbf{x}}_{k}^{m},{\mathbf{r}}_{k})\\ {\mathbf{h}}^{m}\left({\mathbf{r}}_{k}\right)\end{array}\right]=\mathbf{E}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}),{\mathbf{x}}_{k}^{E}\in {\mathsf{\Omega}}_{{X}^{E}},$$

$${J}_{MR}^{\infty}({\mathbf{x}}_{0}^{E},\mathbf{\theta})=\sum _{k=0}^{\infty}{\gamma}^{k}{\parallel {\mathbf{y}}_{k}^{m}\left({\mathbf{x}}_{k}^{E}\right)-{\mathbf{y}}_{k}({\mathbf{x}}_{k}^{E},\mathbf{\theta})\parallel}_{2}^{2}=\sum _{k=0}^{\infty}{\gamma}^{k}{\parallel {\mathbf{\u03f5}}_{k}({\mathbf{x}}_{k}^{E},\mathbf{\theta})\parallel}_{2}^{2}.$$

In (5), the discount factor $0<\gamma \le 1$ sets the controller’s horizon, $\gamma <1$ is usually used in practice to guarantee learning convergence to optimal control. ${\parallel \mathbf{x}\parallel}_{2}=\sqrt{{\mathbf{x}}^{\top}\mathbf{x}}$ is the Euclidean norm of the column vector $\mathbf{x}$. ${\upsilon}_{MR}={\parallel {\mathbf{y}}_{k}^{m}\left({\mathbf{x}}_{k}^{E}\right)-{\mathbf{y}}_{k}\left({\mathbf{x}}_{k}^{E}\right)\parallel}_{2}^{2}$ is the stage cost where measurable ${\mathbf{y}}_{k}$ depends via unknown $\mathbf{g}$ in (1) on ${\mathbf{x}}_{k}$ and ${\upsilon}_{MR}$ penalizes the deviation of ${\mathbf{y}}_{k}$ from the ORM’s output ${\mathbf{y}}_{k}^{m}$. In ORM tracking, the stage cost does not penalize the control effort with some positive definite function $W\left({\mathbf{u}}_{k}\right)>0$ since the ORM tracking instills an inertia on the CS that indirectly acts as a regularizer on the control effort. Secondly, if the reference inputs ${\mathbf{r}}_{k}$ do not set to zero, the ORM’s outputs also do not. For most processes, the corresponding constant steady-state control will be non-zero, hence making ${J}_{MR}^{\infty}\left(\mathbf{\theta}\right)$ infinite when $\gamma =1$.

Herein, $\mathbf{\theta}\in {\mathbb{R}}^{{n}_{\theta}}$ parameterizes a nonlinear state-feedback admissible controller [6] defined as ${\mathbf{u}}_{k}\stackrel{\Delta}{=}\mathbf{C}({\mathbf{x}}_{k}^{E},\mathbf{\theta})$, which used in (4) shows that all CS’s trajectories depend on $\mathbf{\theta}$. Any stabilizing controller sequence (or controller) rendering a finite c.f. is called admissible. A finite ${J}_{MR}^{\infty}$ holds if ${\mathbf{\u03f5}}_{k}$ is a square-summable sequence, ensured by an asymptotically stabilizing controller if $\gamma =1$ or by a stabilizing controller if $\gamma <1$. ${J}_{MR}^{\infty}\left(\mathbf{\theta}\right)$ in (5) is the value function of using the controller $\mathbf{C}\left(\mathbf{\theta}\right)$. Let the optimal controller ${\mathbf{u}}_{k}^{*}=\mathbf{C}({\mathbf{x}}_{k}^{E},{\mathbf{\theta}}^{*})$ that minimizes (5) be

$${\mathbf{\theta}}^{*}=arg\underset{\mathbf{\theta}}{\mathrm{min}}{J}_{MR}^{\infty}({\mathbf{x}}_{0}^{E},\mathbf{\theta}).$$

Tracking a nonlinear ORM can also be used, however, tracking a linear one renders highly desirable indirect feedback linearization of the CS, where a linear CS’s behavior generalizes well in wide operating ranges [1]. Then the ORM tracking control problem of this work should make ${\upsilon}_{MR}\approx 0$ when ${\mathbf{r}}_{k}$ drives both the CS and the ORM.

Under classical control rules, following Comment 1, the process time delay and non-minimum-phase (NMP) character should be accounted for in $\mathbf{M}\left(z\right)$. However, the NMP zeroes make $\mathbf{M}\left(z\right)$ non-invertible in addition to requiring their identification, thus placing a burden on the subsequent VRFT IO control design [45]. This motivates the MP assumption on the process.

Depending on the learning context, the user may select a piece-wise constant generative model for the reference input signal such as ${\mathbf{r}}_{k+1}={\mathbf{r}}_{k}$, or a ramp-like model, a sine-like model, etc. In all cases, the states of the generative model are known, measurable and need to be introduced in the extended state vector, to fulfill the Markov property of the extended state-space model. In many practical applications, for the ORM tracking problem, the CS’s outputs are required to track the ORM’s outputs when both the ORM and the CS are driven by the piece-wise constant reference input signal expressed by a generative model of the form ${\mathbf{r}}_{k+1}={\mathbf{r}}_{k}$. This generative model will be used subsequently in this paper for learning ORM tracking controllers. Obviously, the learnt solution will depend on the proposed reference input generative model, while changing this model requires re-learning.

## 3. Solution to the ORM Tracking Problem

For unknown extended process dynamics (4), minimization of (5) can be tackled using an iterative model-free approximate Value Iteration (IMF-AVI). A c.f. that extends ${J}_{MR}^{\infty}\left({\mathbf{x}}_{k}^{E}\right)$ called the Q-function (or action-value function) is first defined for each state-action pair. Let the Q-function of acting as ${\mathbf{u}}_{k}$ in state ${\mathbf{x}}_{k}^{E}$ and then following the control (policy) ${\mathbf{u}}_{k}=\mathbf{C}\left({\mathbf{x}}_{k}^{E}\right)$ be

$${Q}^{\mathbf{C}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}^{\mathbf{C}}({\mathbf{x}}_{k+1}^{E},\mathbf{C}\left({\mathbf{x}}_{k+1}^{E}\right)).$$

The optimal Q-function ${Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ corresponding to the optimal controller obeys Bellman’s optimality equation
where the optimal controller and Q-functions are

$${Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\underset{\mathbf{C}(.)}{\mathrm{min}}\{\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}^{\mathbf{C}}({\mathbf{x}}_{k+1}^{E},\mathbf{C}\left({\mathbf{x}}_{k+1}^{E}\right))\},$$

$${\mathbf{u}}_{k}^{*}={\mathbf{C}}^{*}\left({\mathbf{x}}_{k}^{E}\right)=arg\underset{\mathbf{C}}{\mathrm{min}}{Q}^{\mathbf{C}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}),{Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\underset{\mathbf{C}(.)}{\mathrm{min}}{Q}^{C}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}).$$

Then, for ${J}_{MR}^{\infty ,*}={\mathrm{min}}_{\mathbf{u}}{J}_{MR}^{\infty}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ it follows that ${J}_{MR}^{\infty ,*}={Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}^{*}={\mathbf{C}}^{*}\left({\mathbf{x}}_{k}^{E}\right))$. Implying that finding ${Q}^{*}$ is equivalent to finding the optimal c.f. ${J}_{MR}^{\infty ,*}$.

The optimal Q-function and optimal controller can be found using either Policy Iteration (PoIt) or Value Iteration (VI) strategies. For continuous state-action spaces, IMF-AVI is one possible solution, using different linear and/or nonlinear parameterizations for the Q-function and/or for the controller. NNs are most widely used as nonlinearly parameterized function approximators. As it is well-known, VI alternates two steps: the Q-function estimate update step and the controller improvement step. Several Q-function parameterizations allow for explicit analytic calculation of the improved controller as the following optimization problem
by directly minimizing ${Q}^{\mathbf{C}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},\mathbf{\pi})$ w.r.t. ${\mathbf{u}}_{k}$, where the parameterization $\mathbf{\pi}$ has been moved from the controller into the Q-function. (10) is the controller improvement step specific to both the PoIt and VI algorithms. In these special cases, it is possible to eliminate the controller approximator and use only one for the Q-function Q. Then, given a dataset D of transition samples, $D=\left\{({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},{\mathbf{x}}_{k+1}^{E})\right\},k=\overline{1,N}$ the IMF-AVI amounts to solving the following optimization problem (OP) at every iteration j
which is a Bellman residual minimization problem where the (usually separate) controller improvement step is now embedded inside the OP (11). More explicitly, for a linear parameterization $Q({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},\mathbf{\pi})={\mathsf{\Phi}}^{\top}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\mathbf{\pi}$ using a set of ${n}_{\mathsf{\Phi}}$ basis functions of the form ${\mathsf{\Phi}}^{\top}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=[{\mathsf{\Phi}}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}),\dots ,{\mathsf{\Phi}}_{{n}_{\mathsf{\Phi}}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})]$, the least squares solution to (11) is equivalent to solving the following over-determined linear system of equations w.r.t. ${\mathbf{\pi}}_{j+1}$ in the least-squares sense:

$$\tilde{\mathbf{C}}({\mathbf{x}}_{k}^{E},\mathbf{\pi})=arg\underset{\mathbf{C}}{\mathrm{min}}{Q}^{\mathbf{C}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},\mathbf{\pi}),$$

$${\mathbf{\pi}}_{j+1}=arg\underset{\mathbf{\pi}}{\mathrm{min}}\sum _{k=1}^{N}{(Q({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},\mathbf{\pi})-\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})-\gamma Q({\mathbf{x}}_{k+1}^{E},\tilde{\mathbf{C}}({\mathbf{x}}_{k+1}^{E},{\mathbf{\pi}}_{j}),{\mathbf{\pi}}_{j}))}^{2},$$

$$\left[\begin{array}{c}{\mathsf{\Phi}}^{\top}({\mathbf{x}}_{1}^{E},{\mathbf{u}}_{1})\\ \dots \\ {\mathsf{\Phi}}^{\top}({\mathbf{x}}_{N}^{E},{\mathbf{u}}_{N})\end{array}\right]\phantom{\rule{4pt}{0ex}}{\mathbf{\pi}}_{j+1}=\left[\begin{array}{c}\upsilon ({\mathbf{x}}_{1}^{E},{\mathbf{u}}_{1})+\gamma {\mathsf{\Phi}}^{\top}({\mathbf{x}}_{2}^{E},\tilde{\mathbf{C}}({\mathbf{x}}_{2}^{E},{\mathbf{\pi}}_{\mathbf{j}})){\mathbf{\pi}}_{j}\\ \dots \\ \upsilon ({\mathbf{x}}_{N}^{E},{\mathbf{u}}_{N})+\gamma {\mathsf{\Phi}}^{\top}({\mathbf{x}}_{N+1}^{E},\tilde{\mathbf{C}}({\mathbf{x}}_{N+1}^{E},{\mathbf{\pi}}_{\mathbf{j}})){\mathbf{\pi}}_{j}\end{array}\right].$$

Concluding, starting with an initial parameterization ${\mathbf{\pi}}_{0}$, the IMF-AVI approach with linearly parameterized Q-function that allows explicit controller improvement calculation as in (10), embeds both VI steps into solving (12). Linearly parameterized IMF-AVI (LP-IMF-AVI) will be validated in the case study and compared to nonlinearly parameterized IMF-AVI (NP-IMF-AVI). Convergence of the generally formulated IMF-AVI is next analysed under approximation errors.

#### IMF-AVI Convergence Analysis with Approximation Errors for ORM Tracking

The proposed iterative model-free VI-based Q-learning Algorithm 1 consists of the next steps.

Algorithm 1 VI-based Q-learning. |

S1: Initialize controller ${\mathbf{C}}_{0}$ and the Q-function value to ${Q}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=0$, initialize iteration index $j=1$ |

S2: Use one step backup equation for the Q-function as in (13) |

S3: Improve the controller using the Equation (14) |

S4: Set $j=j+1$ and repeat steps S2, S3, until convergence |

To be detailed as follows:

S1. Select an initial (not necessarily admissible) controller ${\mathbf{C}}_{0}$ and an initialization value ${Q}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=0$ of the Q-function. Initialize iteration $j=1$.

S2. Use one step backup equation for the Q-function

$$\begin{array}{c}\hfill {Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j-1}\left({\mathbf{x}}_{k+1}^{E}\right))\\ \hfill =\underset{\mathbf{u}}{\mathrm{min}}\{\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}\end{array}$$

S3. Improve the controller using the equation

$${\mathbf{C}}_{j}\left({\mathbf{x}}_{k}^{E}\right)=arg\underset{\mathbf{u}}{\mathrm{min}}{Q}_{j}({\mathbf{x}}_{k}^{E},\mathbf{u}).$$

S4. Set $j=j+1$ and repeat steps S2, S3, until convergence.

**Lemma**

**1.**

For an arbitrary sequence of controllers $\left\{{\mathbf{\kappa}}_{j}\right\}$ define the VI-update for extended c.f. ${\xi}_{j}$ as [46]

$${\xi}_{j+1}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})=\upsilon ({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})+\gamma {\xi}_{j}({\mathit{x}}_{k+1}^{E},{\mathbf{\kappa}}_{j}\left({\mathit{x}}_{k+1}^{E}\right)).$$

If ${Q}_{0}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})={\xi}_{0}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})=0$, then ${Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\le {\xi}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$.

**Proof.**

It is valid that

$$\begin{array}{c}\hfill {Q}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma \stackrel{0}{\overbrace{{Q}_{0}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{0}\left({\mathbf{x}}_{k+1}^{E}\right))}}=\\ \hfill =\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma \stackrel{0}{\overbrace{{\xi}_{0}({\mathbf{x}}_{k+1}^{E},{\mathbf{\kappa}}_{0}\left({\mathbf{x}}_{k+1}^{E}\right))}}={\xi}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}).\end{array}$$

Meaning that ${Q}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {\xi}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Assume by induction that ${Q}_{j-1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {\xi}_{j-1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Then
which completes the proof. Here, it was used that ${\mathbf{C}}_{j-1}\left({\mathbf{x}}_{k}^{E}\right)$ is the optimal controller for ${Q}_{j-1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ per (14), then, for any other controller $\mathbf{C}\left({\mathbf{x}}_{k}^{E}\right)$ (in particular it can also be ${\mathbf{\kappa}}_{j-1}\left({\mathbf{x}}_{k}^{E}\right)$) it follows that
□

$$\begin{array}{c}\hfill {Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j-1}\left({\mathbf{x}}_{k+1}^{E}\right))\le \\ \hfill \le \upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{\kappa}}_{j-1}\left({\mathbf{x}}_{k+1}^{E}\right))\le \\ \hfill \le \upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\xi}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{\kappa}}_{j-1}\left({\mathbf{x}}_{k+1}^{E}\right))={\xi}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}),\end{array}$$

$${Q}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j-1}\left({\mathbf{x}}_{k+1}^{E}\right))\le {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{C}\left({\mathbf{x}}_{k+1}^{E}\right)).$$

**Lemma**

**2.**

For the sequence {${Q}_{j}$} from (13), under controllability assumption A1, it is valid that:

- (1)
- $0\le {Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\le B({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$ with $B({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$ an upper bound.
- (2)
- If there exists a solution ${Q}^{*}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$ to (8), then $0\le {Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\le {Q}^{*}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\le B({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$.

**Proof.**

For any fixed admissible controller $\mathbf{\eta}\left({\mathbf{x}}_{k}^{E}\right)$, ${Q}^{\mathbf{\eta}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}^{\eta}({\mathbf{x}}_{k+1}^{E},\mathbf{\eta}\left({\mathbf{x}}_{k+1}^{E}\right))$ is the Bellman equation. Update (13) renders

$$\begin{array}{cc}\hfill {Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})& =\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j-1}\left({\mathbf{x}}_{k+1}^{E}\right))\stackrel{\left(18\right)}{\le}\hfill \\ & \stackrel{\left(18\right)}{\le}\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{\eta}\left({\mathbf{x}}_{k+1}^{E}\right))\hfill \\ \hfill {Q}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{\eta}\left({\mathbf{x}}_{k+1}^{E}\right)& =\upsilon ({\mathbf{x}}_{k+1}^{E},\mathbf{\eta}\left({\mathbf{x}}_{k+1}^{E}\right)+\gamma {Q}_{j-2}({\mathbf{x}}_{k+2}^{E},{\mathbf{C}}_{j-2}\left({\mathbf{x}}_{k+2}^{E}\right))\stackrel{\left(18\right)}{\le}\hfill \\ & \stackrel{\left(18\right)}{\le}\upsilon ({\mathbf{x}}_{k+1}^{E},\mathbf{\eta}\left({\mathbf{x}}_{k+1}^{E}\right)+\gamma {Q}_{j-2}({\mathbf{x}}_{k+2}^{E},\mathbf{\eta}\left({\mathbf{x}}_{k+2}^{E}\right))\hfill \\ & \dots \hfill \\ \hfill {Q}_{1}({\mathbf{x}}_{k+j-1}^{E},\mathbf{\eta}({\mathbf{x}}_{k+j-1}^{E})& =\upsilon ({\mathbf{x}}_{k+j-1}^{E},\mathbf{\eta}({\mathbf{x}}_{k+j-1}^{E}))+\gamma \stackrel{0}{\overbrace{{Q}_{0}({\mathbf{x}}_{k+j}^{E},{\mathbf{C}}_{0}({\mathbf{x}}_{k+j}^{E}))}}\hfill \end{array}$$

Replacing from the last inequality towards the first it follows that
then, setting ${Q}^{\mathbf{\eta}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=B({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ proves the first part of Lemma 2.

$$\begin{array}{cc}\hfill {Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})& \le \upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma \upsilon ({\mathbf{x}}_{k+1}^{E},\mathbf{\eta}\left({\mathbf{x}}_{k+1}^{E}\right))+\dots +{\gamma}^{j-1}({\mathbf{x}}_{k+j-1}^{E},\mathbf{\eta}({\mathbf{x}}_{k+j-1}^{E})\hfill \\ & \le \upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\sum _{j=1}^{\infty}{\gamma}^{j}\upsilon ({\mathbf{x}}_{k+j}^{E},\mathbf{\eta}({\mathbf{x}}_{k+j}^{E}))={Q}^{\mathbf{\eta}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}),\hfill \end{array}$$

Among all admissible controllers, the optimal one renders the Q-function with the lowest value therefore ${Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}^{\mathbf{\eta}}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=B({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. If $\mathbf{\eta}\left({\mathbf{x}}_{k}^{E}\right)={\mathbf{C}}^{*}\left({\mathbf{x}}_{k}^{E}\right)$ is the optimal controller, it follows that ${Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Then the second part of Lemma 2 follows as $0\le {Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le B({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. □

**Theorem**

**1.**

For the extended process (4) under A1, A2, with c.f. (5), with the sequences {${\mathit{C}}_{j}$} and {${Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$} generated by the Q-learning Algorithm 1, it is true that:

- (1)
- {${Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$} is a non-decreasing sequence for which ${Q}_{j+1}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\ge {Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$ holds, $\forall j,\forall ({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$ and
- (2)
- (2) ${\mathrm{lim}}_{j\to \infty}{\mathit{C}}_{j}\left({\mathit{x}}_{k}^{E}\right)={\mathit{C}}^{*}\left({\mathit{x}}_{k}^{E}\right)$ and ${\mathrm{lim}}_{j\to \infty}{Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})={Q}^{*}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$.

**Proof.**

Let ${Q}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})={\xi}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=0$ and assume the update

$${\xi}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\xi}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j}\left({\mathbf{x}}_{k+1}^{E}\right)).$$

By induction it is shown that ${Q}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\ge {\xi}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ since

$${Q}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{0}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{0}\left({\mathbf{x}}_{K+1}^{E}\right))=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma \xb70\ge 0={\xi}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}).$$

Assume next that ${Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\ge {\xi}_{j-1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ and show that

$$\begin{array}{c}\hfill {Q}_{j+1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})-{\xi}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j}\left({\mathbf{x}}_{k+1}^{E}\right))-\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})-\\ \hfill \gamma {\xi}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j}\left({\mathbf{x}}_{k+1}^{E}\right))=\gamma [{Q}_{j}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j}\left({\mathbf{x}}_{k+1}^{E}\right))-{\xi}_{j-1}({\mathbf{x}}_{k+1}^{E},{\mathbf{C}}_{j}\left({\mathbf{x}}_{k+1}^{E}\right))]\ge 0.\end{array}$$

The expression above leads to ${Q}_{j+1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\ge {\xi}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Since by Lemma 1 one has that ${Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\ge {\xi}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ it follows that ${Q}_{j+1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\ge {Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$, proving first part of Theorem 1.

Any non-decreasing upper bounded sequence must have a limit, thus ${\mathrm{lim}}_{j\to \infty}{\mathbf{C}}_{j}={\mathbf{C}}_{\infty}$ and ${\mathrm{lim}}_{j\to \infty}{Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})={Q}_{\infty}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ with ${\mathbf{C}}_{\infty}$ an admissible controller. For any admissible controller $\mathbf{\eta}\left({\mathbf{x}}_{k}^{E}\right)={\mathbf{C}}_{\infty}\left({\mathbf{x}}_{k}^{E}\right)$ that is non-optimal if follows from (20) that ${Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}_{\infty}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Still, part 2 of Lemma 2 states that ${Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ implying ${Q}_{\infty}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Then from ${Q}_{\infty}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {Q}_{\infty}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ it must hold true that ${Q}_{\infty}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})={Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ and ${\mathbf{C}}_{\infty}\left({\mathbf{x}}_{k}^{E}\right)={\mathbf{C}}^{*}\left({\mathbf{x}}_{k}^{E}\right)$ which proves the second part of Theorem 1. □

Comment 2.

(13) is practically solved in the sense of the OP (11) (either as a linear or nonlinear regression) using a batch (dataset) of transition samples collected from the process using any controller, that is in off-policy mode. While the controller improvement step (14) can be solved either as a regression or explicitly analytically when the expression of ${Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ allows it. Moreover, (13) and (14) can be solved batch-wise in either online or offline mode. When the batch of transition samples is updated with one sample at a time, the VI-scheme becomes adaptive.

Comment 3.

Theorem 1 proves the VI-based learning convergence of the sequence of Q-functions ${\mathrm{lim}}_{j\to \infty}{Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})={Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ assuming that the true Q-function parameterization is used. In practice, this is rarely possible, such as, e.g., in the case of LTI systems. For general nonlinear processes of type (1), different function approximators are employed for the Q-function, most commonly using NNs. Then the convergence of the VI Q-learning scheme is to a suboptimal controller and to a suboptimal Q-function, owing to the approximation errors. A generic convergence proof of the learning scheme under approximation errors is next shown, accounting for general Q-function parameterizations [47].

Let the IMF-AVI Algorithm 2 consist of the steps.

Algorithm 2 IMF-AVI. |

S1: Initialize controller ${\tilde{\mathbf{C}}}_{0}$ and Q-function value ${\tilde{Q}}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=0,\forall ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Initialize iteration $j=1$ |

S2: Update the approximate Q-function using Equation (24) |

S3: Improve the approximate controller using Equation (25) |

S4: Set $j=j+1$ and repeat steps S2, S3, until convergence |

To be detailed as follows:

S1. Select an initial (not necessarily admissible) controller ${\tilde{\mathbf{C}}}_{0}$ and an initialization value ${\tilde{Q}}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=0,\forall ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ of the Q-function. Initialize iteration $j=1$.

S2. Use the update equation for the approximate Q-function

$$\begin{array}{c}\hfill {\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\tilde{Q}}_{j-1}({\mathbf{x}}_{k+1}^{E},{\tilde{\mathbf{C}}}_{j-1}({\mathbf{x}}_{k+1}^{E}))+{\delta}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\\ \hfill =\underset{\mathbf{u}}{\mathrm{min}}\{\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\tilde{Q}}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}+{\delta}_{j}\end{array}$$

S3. Improve the approximate controller using

$${\tilde{\mathbf{C}}}_{j}\left({\mathbf{x}}_{k}^{E}\right)=arg\underset{\mathbf{u}}{\mathrm{min}}{\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},\mathbf{u})$$

S4. Set $j=j+1$ and repeat steps S2, S3, until convergence.

Comment 4.

In Algorithm 2, the sequences $\{{\tilde{\mathbf{C}}}_{j}({\mathbf{x}}_{k}^{E})\}$ and $\{{\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},\mathbf{u})\}$ are approximations of the true sequences $\{{\mathbf{C}}_{j}({\mathbf{x}}_{k}^{E})\}$ and $\{{Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\}$. Since the true Q-function and controller parameterizations are not generally known, (24) must be solved in the sense of the OP (11) with respect to the unknown ${\tilde{Q}}_{j}$, in order to minimize the residuals ${\delta}_{j}$ at each iteration. If the true parameterizations of the Q-function and of the controller were known, then ${\delta}_{j}=0$ and the IMF-AVI updates (24), (25) coincide with (13), (14), respectively. Next, let the following assumption hold.

A3. There exist two positive scalar constants $\underline{\psi},\overline{\psi}$ such that $0<\underline{\psi}\le 1\le \overline{\psi}<\infty $, ensuring

$$\begin{array}{c}\hfill \underset{\mathbf{u}}{\mathrm{min}}\{\underline{\psi}\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\tilde{Q}}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}\le {\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le \\ \hfill \underset{\mathbf{u}}{\mathrm{min}}\{\overline{\psi}\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\tilde{Q}}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}.\end{array}$$

Comment 5.

Inequalities from (26) account for nonzero positive or negative residuals ${\delta}_{j}$, i.e., for the approximation errors in the Q-function, since ${\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ can over- or under-estimate ${\mathrm{min}}_{\mathbf{u}}\{\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\tilde{Q}}_{j-1}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}$ in (24). $\underline{\psi},\overline{\psi}$ can span large intervals ($\underline{\psi}$ close to 0 and $\overline{\psi}$ very large). The hope is that, if $\underline{\psi},\overline{\psi}$ are close to 1—meaning low approximation errors—then the entire IMF-AVI process preserves ${\delta}_{j}\approx 0$. In practice, this amounts to using high performance approximators. For example, with NNs, adding more layers and more neurons, enhances the approximation capability and theoretically reduces the residuals in (24).

**Theorem**

**2.**

Let the sequences $\{{\tilde{\mathit{C}}}_{j}({\mathit{x}}_{k}^{E})\}$ and $\{{\tilde{Q}}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\}$ evolve as in (24), (25), the sequences $\{{\mathit{C}}_{j}({\mathit{x}}_{k}^{E})\}$ and $\{{Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\}$ evolve as in (13), (14). Initialize ${\tilde{Q}}_{0}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})={Q}_{0}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})=0,\forall ({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$ and let A3 hold. Then

$$\underline{\psi}{Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\le {\tilde{Q}}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})\le \overline{\psi}{Q}_{j}({\mathit{x}}_{k}^{E},{\mathit{u}}_{k})$$

**Proof.**

First, the development proceeds by induction for the left inequality. For $j=0$ it is clear that $\underline{\psi}{Q}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {\tilde{Q}}_{0}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. For $j=1$, (13) produces ${Q}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ and left-hand side of (26) reads ${\mathrm{min}}_{\mathbf{u}}\{\underline{\psi}\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+0\}\le {\tilde{Q}}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Then $\underline{\psi}{Q}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {\tilde{Q}}_{1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$. Next assume that
holds at iteration j. Based on (28) used in (26), it is valid that

$$\underline{\psi}{Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$$

$$\begin{array}{c}\hfill \underset{\mathbf{u}}{\mathrm{min}}\{\underline{\psi}\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma \underline{\psi}{Q}_{j}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}\le \\ \hfill \underset{\mathbf{u}}{\mathrm{min}}\{\underline{\psi}\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {\tilde{Q}}_{j}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}\le {\tilde{Q}}_{j+1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k}).\end{array}$$

Notice from (29) that

$$\begin{array}{c}\hfill \underset{u}{\mathrm{min}}\{\underline{\psi}\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma \underline{\psi}{Q}_{j}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}=\underline{\psi}\underset{u}{\mathrm{min}}\{\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma {Q}_{j}({\mathbf{x}}_{k+1}^{E},\mathbf{u})\}\\ \hfill \stackrel{\left(13\right)}{=}\underline{\psi}{Q}_{j+1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\end{array}$$

From (29), (30) it follows that $\underline{\psi}{Q}_{j+1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\le {\tilde{Q}}_{j+1}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ proving the left side of (27) by induction. The right side of (27) is shown similarly, proving Theorem 2. □

Comment 6.

Theorem 2 shows that the trajectory of $\{{\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\}$ closely follows that of $\{{Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\}$ in a bandwidth set by $\underline{\psi},\overline{\psi}$. It does not ensure that $\{{\tilde{Q}}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\}$ converges to a steady-state value, but in the worst case, it oscillates around ${Q}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=li{m}_{j\to \infty}{Q}_{j}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ in a band that can be made arbitrarily small by using powerful approximators. By minimizing over ${\mathbf{u}}_{k}$ both sides of (27), similar conclusions result for the controller sequence $\{{\tilde{\mathbf{C}}}_{j}({\mathbf{x}}_{k}^{E})\}$ that closely follows $\{{\mathbf{C}}_{j}\left({\mathbf{x}}_{k}^{E}\right)\}$.

In the following Section, the IMF-AVI is validated on two illustrative examples. The provided theoretical analysis supports and explains the robust learning performance of the nonlinearly parameterized IMF-AVI with respect to the linearly parameterized one.

## 4. Validation Case Studies

#### 4.1. ORM Tracking for a Linear Process

A first introductory simple example of IMF-AVI for the ORM tracking of a first-order process motivates the more complex validation for the TITOAS process and offers insight into how the IMF-AVI solution scales up with the higher-order processes.

Let a scalar discrete-time process discretized at ${T}_{s}=0.1s$ be ${x}_{k+1}=0.8187{x}_{k}+0.1813{u}_{k}$. The continuous-time ORM $M\left(s\right)=1/(s+1)$ ZOH discretized at the same ${T}_{s}$ leads to the extended process equivalent to (4), (output equations also given):
where a piece-wise constant reference input generative model is introduced to ensure that the extended process (31) has full measurable state.

$$\left\{\begin{array}{c}{x}_{k+1}=0.8187{x}_{k}+0.1813{u}_{k},\hfill \\ {x}_{k+1}^{m}=0.9048{x}_{k}^{m}+0.09516{r}_{k},\hfill \\ {r}_{k+1}={r}_{k},\hfill \\ {y}_{k}={x}_{k},{y}_{k}^{m}={x}_{k}^{m},\hfill \end{array}\right.\iff {\mathbf{x}}_{k+1}^{E}=\mathbf{E}({\mathbf{x}}_{k}^{E},{u}_{k})$$

For data collection, the ORM’s output ${y}_{k}^{m}$ is collected along with: ${u}_{k}$, ${x}_{k}$ and the reference input ${r}_{k}$. The measurable extended state vector is then ${\mathbf{x}}_{k}^{E}={[{x}_{k},{x}_{k}^{m},{r}_{k}]}^{\top}$. A discretized version of an integral controller with t.f. $0.25/s$ at sampling period ${T}_{s}=0.1s$ closes the loop of the control system and asymptotically stabilizes it, while calculating the control input ${u}_{k}$ based on the feedback error ${e}_{k}={r}_{k}-{y}_{k}$. This CS setup is used for collecting transition samples of the form $D=\{({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},{\mathbf{x}}_{k+1}^{E})\}$. Data is collected for 500 s, with normally distributed random reference inputs having variance ${\sigma}_{r}^{2}=0.0951$, modeled as piece-wise constant steps that change their values every 20 s. Normally distributed white noise having variance ${\sigma}_{u}^{2}=4.96$ is added on the command ${u}_{k}$ at every time step to ensure a proper exploration by visiting as many combinations of states and actions as possible. Exploration has a critical role in the success of the IMF-AVI. A higher amplitude additive noise on ${u}_{k}$ increases the chances of converging the approximate VI approach. The state transitions data collection is shown in Figure 1 for the first 1000 samples (100 s).

Notice that a reference input modeled as a sequence of constant amplitude steps is used for exploration purposes, for which it may not be possible to write ${r}_{k+1}={h}^{m}\left({r}_{k}\right)$ as a generative model. To solve this, all transition samples that correspond to the switching times of the reference input are eliminated, therefore, ${r}_{k+1}={r}_{k}$ can be considered as the piece-wise constant generative model of the reference input.

The control objective is to minimize ${J}_{MR}^{\infty}\left(\mathbf{\theta}\right)$ from (5) using the stage cost $\upsilon \left({\mathbf{x}}_{k}^{E}\right)={({y}_{k}^{m}-{y}_{k})}^{2}$ (where the outputs obviously depend on the extended states as per (31)), with the discount factor $\gamma =0.9$. Thus the overall objective is to find the optimal state-feedback controller ${\mathbf{u}}_{k}^{*}={\mathbf{C}}^{*}\left({\mathbf{x}}_{k}^{E}\right)$ that makes the feedback CS match the ORM.

The Q-function is linearly parameterized as $Q({x}_{k}^{E},{u}_{k})={\mathsf{\Phi}}^{\top}({\mathbf{x}}_{k}^{E},{u}_{k})\mathbf{\pi}$, with the quadratic basis functions vector constructed by the unique terms of the Kronecker product of all input arguments of $Q({\mathbf{x}}_{k}^{E},{u}_{k})$ as
with $\mathbf{\pi}\in {\mathbb{R}}^{10}$. The controller improvement step equivalent to explicitly minimizing the Q-function w.r.t. the control input ${u}_{k}$ is ${\tilde{u}}_{k}^{*}={\tilde{C}}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{\pi}}_{j})=-\frac{1}{2{\pi}_{4,j}}[{\pi}_{7,j},{\pi}_{9,j},{\pi}_{10,j}]{\mathbf{x}}_{k}^{E}$. This improved linear-in-the-state controller is embedded in the linear system of equations (12) that is solved for every iteration of IMF-AVI. Each iteration produces a new ${\mathbf{\pi}}_{j+1}$ that is tested on a test scenario where the uniformly random reference inputs have amplitude ${r}_{k}\in [-1;1]$ and switch every 10 s. The ORM tracking performance is then measured by the Euclidean vector norm $\parallel {y}_{k}^{m}-{y}_{k}{\parallel}_{2}$ while $\parallel {\mathbf{\pi}}_{j+1}-{\mathbf{\pi}}_{j}{\parallel}_{2}$ serves as a stopping condition when it drops below a prescribed threshold. The practically observed convergence process is shown in Figure 2 over the first 400 iterations, with $\parallel {\mathbf{\pi}}_{j+1}-{\mathbf{\pi}}_{j}{\parallel}_{2}$ still decreasing after 1000 iterations. While $\parallel {y}_{k}^{m}-{y}_{k}{\parallel}_{2}$ is very small right from the first iterations, making the process output practically overlap with the ORM’s output.

$${\mathsf{\Phi}}^{\top}({\mathbf{x}}_{k}^{E},{u}_{k})=[{x}_{k}^{2},{\left({x}_{k}^{m}\right)}^{2},{r}_{k}^{2},{u}_{k}^{2},{x}_{k}{x}_{k}^{m},{x}_{k}{r}_{k},{x}_{k}{u}_{k},{x}_{k}^{m}{r}_{k},{x}_{k}^{m}{u}_{k},{r}_{k}{u}_{k}]$$

Comment 7.

For LTI processes with an LQR-like c.f., an LTI ORM and an LTI generative reference input model, linear parameterizations of the extended Q-function of the form $Q({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})={\mathsf{\Phi}}^{\top}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})\mathbf{\pi}$ is the well-known [9] form $Q({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=[{({\mathbf{x}}_{k}^{E})}^{\top},{({\mathbf{u}}_{k})}^{\top}]\mathbf{P}{[{({\mathbf{x}}_{k}^{E})}^{\top},{({\mathbf{u}}_{k})}^{\top}]}^{\top}$ of the quadratic Q-function, with parameter $\mathbf{\pi}=vec\left(\mathbf{P}\right)$ being the vectorized form of the symmetric positive-definite matrix

**P**and the basis function vector ${\mathsf{\Phi}}^{\top}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ is obtained by the nonrepeatable terms of the Kronecker product of all the Q-function input arguments.#### 4.2. IMF-AVI on the Nonlinear TITOAS Aerodynamic System

The ORM tracking problem on the more challenging TITOAS angular position control [48] (Figure 3) is aimed next. The azimuth (horizontal) motion behaves as an integrator while the pitch (vertical) positioning is affected differently by the gravity for the up and down motions. Coupling between the two channels is present. A simplified deterministic continuous-time state-space model of this process is given as two coupled state-space sub-systems:
where $sat\left(\right)$ is the saturation function on $[-1;1]$, ${U}_{h}={u}_{1}$ is the azimuth motion control input, ${U}_{v}={u}_{2}$ is the vertical motion control input, ${\alpha}_{h}\left(rad\right)={y}_{1}\in [-\pi ,\pi ]$ is the azimuth angle output, ${\alpha}_{v}\left(rad\right)={y}_{2}\in [-\pi /2,\pi /2]$ is the pitch angle output, other states being described in [11,48]. The nonlinear static characteristics obtained by polynomial fitting from experimental data are for ${\omega}_{v},{\omega}_{h}\in (-4000;4000)$:

$$\begin{array}{c}\left\{\begin{array}{c}{\dot{\omega}}_{h}=(sat\left({U}_{h}\right)-{M}_{h}\left({\omega}_{h}\right))/2.5\xb7{10}^{-5},\hfill \\ {\dot{K}}_{h}=0.216{F}_{h}\left({\omega}_{h}\right)cos\left({\alpha}_{v}\right)-0.058{\mathsf{\Omega}}_{h}+0.0178sat\left({U}_{v}\right)cos\left({\alpha}_{v}\right),\hfill \\ {\mathsf{\Omega}}_{h}={K}_{k}/(0.0238co{s}^{2}\left({\alpha}_{v}\right)+3\xb7{10}^{-3}),\hfill \\ {\dot{\alpha}}_{h}={\mathsf{\Omega}}_{h},\hfill \end{array}\right.\hfill \\ \left\{\begin{array}{c}{\dot{\omega}}_{v}=(sat\left({U}_{v}\right)-{M}_{v}\left({\omega}_{v}\right))/1.63\xb7{10}^{-4},\hfill \\ {\dot{\mathsf{\Omega}}}_{v}=\frac{1}{0.03}\left(\begin{array}{c}0.2{F}_{v}\left({\omega}_{v}\right)-0.0127{\mathsf{\Omega}}_{v}-0.0935sin{\alpha}_{v}+\\ -9.28\xb7{10}^{-6}{\mathsf{\Omega}}_{v}\left|{\omega}_{v}\right|+4.17\xb7{10}^{03}sat\left({U}_{h}\right)-0.05cos{\alpha}_{v}+\\ -0.021{\mathsf{\Omega}}_{h}^{2}sin{\alpha}_{v}cos{\alpha}_{v}-0.093sin{\alpha}_{v}+0.05\end{array}\right)\hfill \\ \dot{{\alpha}_{v}}={\mathsf{\Omega}}_{v},\hfill \end{array}\right.\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {M}_{v}\left({\omega}_{v}\right)& =& 9.05\times {10}^{-12}{\omega}_{v}^{3}+2.76\times {10}^{-10}{\omega}_{v}^{2}+1.25\times {10}^{-4}{\omega}_{v}+1.66\times {10}^{-4},\hfill \\ \hfill {F}_{v}\left({\omega}_{v}\right)& =& -1.8\times {10}^{-18}{\omega}_{v}^{5}-7.8\times {10}^{-16}{\omega}_{v}^{4}+4.1\times {10}^{-11}{\omega}_{v}^{3}+2.7\times {10}^{-8}{\omega}_{v}^{2}\hfill \\ & & +3.5\times {10}^{-5}{\omega}_{v}-0.014,\hfill \\ \hfill {M}_{h}\left({\omega}_{h}\right)& =& 5.95\times {10}^{-13}{\omega}_{h}^{3}-5.05\times {10}^{-10}{\omega}_{h}^{2}+1.02\times {10}^{-4}{\omega}_{h}+1.61\times {10}^{-3},\hfill \\ \hfill {F}_{h}\left({\omega}_{h}\right)& =& -2.56\times {0}^{-20}{\omega}_{h}^{5}+4.09\times {10}^{-17}{\omega}_{h}^{4}+3.16\times {10}^{-12}{\omega}_{h}^{3}-7.34\times {10}^{-9}{\omega}_{h}^{2}\hfill \\ & & +2.12\times {10}^{-5}{\omega}_{h}+9.13\times {10}^{-3}.\hfill \end{array}$$

A zero-order hold on the inputs and a sampler on the outputs of (33) lead to an equivalent MP discrete-time model of sampling time ${T}_{s}=0.1s$ and of relative degree 1 (one), suitable for input-state data collection
where ${{x}}_{k}={[{\omega}_{h,k},{\mathsf{\Omega}}_{h,k},{\alpha}_{h,k},{\omega}_{v,k},{\mathsf{\Omega}}_{v,k},{\alpha}_{v,k}]}^{\top}\in {\mathbb{R}}^{6}$ and ${\mathbf{u}}_{k}={[{u}_{k,1},{u}_{k,2}]}^{\top}\in {\mathbb{R}}^{2}$. The process’ dynamics will not be used for learning the control in the following.

$$P:\left\{\begin{array}{c}{\mathbf{x}}_{k+1}=\mathbf{f}({\mathbf{x}}_{k},{\mathbf{u}}_{k}),\hfill \\ {\mathbf{y}}_{k}=\mathbf{g}\left({\mathbf{x}}_{k}\right)={[{\alpha}_{h,k},{\alpha}_{v,k}]}^{\top},\hfill \end{array}\right.$$

#### 4.3. Initial Controller with Model-Free VRFT

An initial model-free multivariable IO controller is first found using model-free VRFT, as described in [11,23,31]. The ORM is $\mathbf{M}\left(z\right)=diag({M}_{1}\left(z\right),{M}_{2}\left(z\right))$ where ${M}_{1}\left(z\right),{M}_{2}\left(z\right)$ are the discrete-time counterparts of ${M}_{1}\left(s\right)={M}_{2}\left(s\right)=1/(3s+1)$ obtained for a sampling period of ${T}_{s}=0.1$ s. The VRFT prefilter is chosen as $\mathbf{L}\left(z\right)=\mathbf{M}\left(z\right)$. A pseudo-random binary signal (PRBS) of amplitude $[-0.1;0.1]$ is used on both inputs ${u}_{k,1},{u}_{k,2}$ to open-loop excite the pitch and azimuth dynamics simultaneously, as shown in Figure 4. The IO data $\{{\tilde{\mathbf{u}}}_{k},{\tilde{\mathbf{y}}}_{k}\}$ is collected with low-amplitude zero-mean inputs ${u}_{k,1},{u}_{k,2}$, in order to maintain the process linearity around the mechanical equilibrium, such that to fit the linear VRFT design framework.

An un-decoupling linear output feedback error diagonal controller with the parameters computed by the VRFT approach is
where the parameter vector $\mathbf{\theta}$ groups all the coefficients of ${P}_{11}\left(z\right),{P}_{22}\left(z\right)$. Controller (36) is obtained for $\mathbf{\theta}$ as the least squares minimizer of ${J}_{VR}\left(\mathbf{\theta}\right)={\sum}_{k=1}^{N}{\parallel {\tilde{\mathbf{u}}}_{k}^{L}-\mathbf{C}(z,\mathbf{\theta}){\tilde{\mathbf{e}}}_{k}^{L}\parallel}_{2}^{2}$ where ${\tilde{\mathbf{u}}}_{k}^{L}=\mathbf{L}\left(z\right){\tilde{\mathbf{u}}}_{k}=\mathbf{L}\left(z\right){[{\tilde{u}}_{k,1},{\tilde{u}}_{k,2}]}^{\top},{\tilde{\mathbf{e}}}_{k}^{L}=\mathbf{L}\left(z\right){\tilde{\mathbf{e}}}_{k}=\mathbf{L}\left(z\right){[{\tilde{e}}_{k,1},{\tilde{e}}_{k,2}]}^{\top}$, ${[{\tilde{e}}_{k,1},{\tilde{e}}_{k,2}]}^{\top}=({\mathbf{M}}^{-1}\left(z\right)-{\mathbf{I}}_{2}){[{\tilde{y}}_{k,1},{\tilde{y}}_{k,2}]}^{\top}$. Here, ${J}_{VR}\left(\mathbf{\theta}\right)$ is an approximation of the c.f. ${J}_{MR}^{\infty}$ from (5) obtained for $\gamma =1$. The controller (36) will then close the feedback control loop as in ${\mathbf{u}}_{k}=\mathbf{C}(z,\mathbf{\theta})({\mathbf{r}}_{k}-{\mathbf{y}}_{k})$.

$$\begin{array}{c}\mathbf{C}(z,\mathbf{\theta})=\left[\begin{array}{cc}{P}_{11}\left(z\right)/(1-{z}^{-1})& 0\\ 0& {P}_{22}\left(z\right)/(1-{z}^{-1})\end{array}\right],\hfill \\ {P}_{11}\left(z\right)=2.9341-5.8689{z}^{-1}+3.9303{z}^{-2}-0.9173{z}^{-3}-0.0777{z}^{-4},\hfill \\ {P}_{22}\left(z\right)=0.6228-1.1540{z}^{-1}+0.5467{z}^{-2},\hfill \end{array}$$

Notice that, by formulation, the VRFT controller tuning aims to minimize the undiscounted ($\gamma =1$) ${J}_{MR}^{\infty}$ from (5), but via the output feedback controller (36) that processes the feedback control error ${\mathbf{e}}_{k}={\mathbf{r}}_{k}-{\mathbf{y}}_{k}$. The same goal to minimize (5) is pursued by the subsequent IMF-AVI design of a state-feedback controller tuning for the extended process. Nonlinear (in particular, linear) state-feedback controllers can also be found by VRFT as shown in [23,31], to serve as initializations for the IMF-AVI, or possibly, even for PoIt-like algorithms. However, should this not be necessary, IO feedback controllers are much more data-efficient, requiring significantly less IO data to obtain stabilizing controllers.

#### 4.4. Input–State–Output Data Collection

ORM tracking is intended by making the closed loop CS match the same ORM $\mathbf{M}\left(z\right)=diag({M}_{1}\left(z\right),{M}_{2}\left(z\right))$. With the linear controller (36) used in closed-loop to stabilize the process, input–state–output data is collected for 7000 s. The reference inputs with amplitudes ${r}_{k,1}\in [-2;2],{r}_{k,2}\in [-1.4;1.1]$ model successive steps that switch their amplitudes uniformly random at 17 s and 25 s, respectively. On the outputs ${u}_{k,1},{u}_{k,2}$ of both controllers ${C}_{11}\left(z\right),{C}_{22}\left(z\right)$, an additive noise is added at every 2nd sample as an uniform random number in $[-1.6;1.6]$ for ${C}_{11}\left(z\right)$ and in $[-1.7;1.7]$ for ${C}_{22}\left(z\right)$. These additive disturbances provide an appropriate exploration, visiting many combinations of input–states–outputs. The computed controller outputs are saturated to $[-1;1]$, then sent to the process. The reference inputs ${r}_{k,1},{r}_{k,2}$ drive the ORM:

$$\left\{\begin{array}{c}{x}_{k+1,1}^{m}=0.9672{x}_{k,1}^{m}+0.03278{r}_{k,1},\hfill \\ {x}_{k+1,2}^{m}=0.9672{x}_{k,2}^{m}+0.03278{r}_{k,2},\hfill \\ {\mathbf{y}}_{k}^{m}={[{y}_{k,1}^{m},{y}_{k,2}^{m}]}^{\top}={[{x}_{k,1}^{m},{x}_{k,2}^{m}]}^{\top}.\hfill \end{array}\right.$$

Then the states of the ORM (also outputs of the ORM) are also collected along with the states and control inputs of the process, to build the process extended state (4). Let the extended state be:

$${\mathbf{x}}_{k}^{E}={[\underset{{({\mathbf{x}}_{k}^{m})}^{\top}}{\underbrace{{x}_{k,1}^{m},{x}_{k,2}^{m}}},\underset{{\mathbf{r}}_{k}^{\top}}{\underbrace{{r}_{k,1},{r}_{k,2}}},{({\mathbf{x}}_{k})}^{\top}]}^{\top}.$$

Essentially, the collected ${\mathbf{x}}_{k}^{E}$ and ${\mathbf{u}}_{k}$ builds the transitions dataset $D=\{({\mathbf{x}}_{1}^{E},{\mathbf{u}}_{1},{\mathbf{x}}_{2}^{E}),\dots ,({\mathbf{x}}_{70000}^{E},{\mathbf{u}}_{70000},{\mathbf{x}}_{70001}^{E})\}$ for $N=\mathrm{70,000}$, used for the IMF-AVI implementation. After collection, an important processing step is the data normalization. Some process states are scaled in order to ensure that all states are inside [−1;1]. The scaled process state is ${\tilde{\mathbf{x}}}_{k}={[{\omega}_{h,k}/7200,25\xb7{\mathsf{\Omega}}_{h,k},{\alpha}_{h,k},{\omega}_{v,k}/3500,40\xb7{\mathsf{\Omega}}_{v,k},{\alpha}_{v,k}]}^{\top}\in {\mathbb{R}}^{6}$ and ${\mathbf{u}}_{k}={[{u}_{k,1},{u}_{k,2}]}^{\top}\in {\mathbb{R}}^{2}$. Other variables such as the reference inputs, the ORM states and the saturated process inputs already have values inside $[-1;1]$. The normalized state is eventually used for state feedback. Collected transition samples are shown in Figure 5 only for the process inputs and outputs, ORM’s outputs and reference inputs, for the first 400 s (4000 samples) out of 7000 s.

Note that the reference input signals ${r}_{k,1},{r}_{k,2}$ used as sequences of constant amplitude steps for ensuring good exploration, do not have a generative model that obeys the Markov assumption. To avoid this problem, the piece-wise constant reference input generative model ${\mathbf{r}}_{k+1}={\mathbf{r}}_{k}$ is employed by eliminating from the dataset D all the transition samples that correspond to switching reference inputs instants (i.e., when at least one of ${r}_{k,1},{r}_{k,2}$ switches).

#### 4.5. Learning State-Feedback Controllers with Linearly Parameterized IMF-AVI

Details of the LP-IMF-AVI applied to the ORM tracking control problem are next provided. The stage cost is defined $\upsilon \left({\mathbf{x}}_{k}^{E}\right)={({y}_{k,1}-{y}_{k,1}^{m})}^{2}+({y}_{k,2}-{y}_{k,2}^{m})2$ and the discount factor in ${J}_{MR}^{\infty}$ is $\gamma =0.95$. The Q-function is linearly parameterized using the basis functions

$$\begin{array}{cc}\hfill {\mathsf{\Phi}}^{\top}({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})=& [{({x}_{k,1}^{m})}^{2},{({x}_{k,2}^{m})}^{2},{r}_{k,1}^{2},\dots ,{x}_{k,6}^{2},{u}_{k,1}^{2},{u}_{k,2}^{2},{x}_{k,1}^{m}{x}_{k,2}^{m},{x}_{k,1}^{m}{r}_{k,1},\dots ,\hfill \\ & {x}_{k,1}^{m}{u}_{k,2},{x}_{k,2}^{m}{r}_{k,1},\dots ,{u}_{k,1}{u}_{k,2}]\in {\mathbb{R}}^{78}.\hfill \end{array}$$

This basis functions selection is inspired by the shape of the quadratic Q-function resulting from LTI processes with LQR-like penalties (see Comment 7). It is expected to be a sensible choice since the TITOAS process is a nonlinear one, therefore the quadratic Q-function may under-parameterize the true Q-function. Nevertheless, its computational advantage incentives the testing of such a solution. Notice that the controller improvement step at each iteration of the LP-IMF-AVI is based on explicit minimization of the Q-function. Solving the linear system of equations resulting after setting the derivative of $Q({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})$ w.r.t. ${\mathbf{u}}_{k}$ equal to zero, it is obtained that

$$\begin{array}{c}\hfill {\tilde{\mathbf{u}}}_{k}^{*}=\left[\begin{array}{c}{u}_{k,1}^{*}\\ {u}_{k,2}^{*}\end{array}\right]={\tilde{\mathbf{C}}}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{\pi}}_{j})={\left[\begin{array}{cc}2{\pi}_{j,11}& {\pi}_{j,78}\\ {\pi}_{j,78}& 2{\pi}_{j,12}\end{array}\right]}^{-1}\left[\begin{array}{c}{F}_{1}\left({\mathbf{x}}_{k}^{E}\right)\\ {F}_{2}\left({\mathbf{x}}_{k}^{E}\right)\end{array}\right],\\ \hfill {F}_{1}\left({\mathbf{x}}_{k}^{E}\right)={\pi}_{j,22}{x}_{k,1}^{m}+{\pi}_{j,32}{x}_{k,2}^{m}+{\pi}_{j,41}{r}_{k,1}+{\pi}_{j,49}{r}_{k,2}+{\pi}_{j,56}{x}_{k,1}+\\ \hfill {\pi}_{j,62}{x}_{k,2}+{\pi}_{j,67}{x}_{k,3}+{\pi}_{j,71}{x}_{k,4}+{\pi}_{j,74}{x}_{k,5}+{\pi}_{j,76}{x}_{k,6}={\mathbf{\pi}}_{j,1}^{\top}{\mathbf{x}}_{k}^{E},\\ \hfill {F}_{2}\left({\mathbf{x}}_{k}^{E}\right)={\pi}_{j,23}{x}_{k,1}^{m}+{\pi}_{j,33}{x}_{k,2}^{m}+{\pi}_{j,42}{r}_{k,1}+{\pi}_{j,50}{r}_{k,2}+{\pi}_{j,57}{x}_{k,1}+\\ \hfill {\pi}_{j,63}{x}_{k,2}+{\pi}_{j,68}{x}_{k,3}+{\pi}_{j,72}{x}_{k,4}+{\pi}_{j,75}{x}_{k,5}+{\pi}_{j,77}{x}_{k,6}={\mathbf{\pi}}_{j,2}^{\top}{\mathbf{x}}_{k}^{E}.\end{array}$$

The improved controller is embedded in the system (12) of 70,000 linear equations with 78 unknowns corresponding to the parameters of ${\mathbf{\pi}}_{j+1}\in {\mathbb{R}}^{78}$. This linear system (12) is solved in least squares sense, with each of the 50 iterations of the LP-IMF-AVI. The practical convergence results are shown in Figure 6 for $\parallel {\mathbf{\pi}}_{j+1}-{\mathbf{\pi}}_{j}{\parallel}_{2}$ and for the ORM tracking performance in terms of a normalized c.f. ${J}_{test}=1/N(\parallel {y}_{k,1}-{y}_{k,1}^{m}{\parallel}_{2}+\parallel {y}_{k,2}-{y}_{k,2}^{m}{\parallel}_{2})$ measured for samples over 200 s in the test scenario displayed in Figure 7. The test scenario consists of a sequence of piece-wise constant reference inputs that switch at different moments of time for the azimuth and pitch (${y}_{k,1}$ and ${y}_{k,2}$, respectively), to illustrate the existing coupling behavior between the two control channels and the extent to which the learned controller manages to achieve the decoupled behavior requested but the diagonal ORM.

The best LP-IMF-AVI controller found over the 50 iterations results in ${J}_{test}=0.0017$ (tracking results in black lines in Figure 7), which is more than 6 times lower than the tracking performance of the VRFT controller used for transition samples collection, for which ${J}_{test}=0.0103$ (tracking results in green lines in Figure 7). The convergence of the LP-IMF-AVI parameters is depicted in Figure 8.

#### 4.6. Learning State-Feedback Controllers with Nonlinearly Parameterized IMF-AVI Using NNs

The previous LP-IMF-AVI for ORM tracking control learning scheme is next challenged by a NP-IMF-AVI implemented with NNs. In this case, two NNs are used to approximate the Q-function and the controller (the latter is sometimes avoidable, see the comments later on in this sub-section). The procedure follows the NP-IMF-AVI implementation described in [23,49]. The same dataset of transition samples is used as was previously used for the LP-IMF-AVI. Notice that the NN-based implementation is widely used in the reinforcement learning-based approach of ADP and is generally more scalable to problems of high dimension.

The controller NN (C-NN) estimate is a 10–3–2 (10 inputs because ${\mathbf{x}}_{k}^{E}\in {\mathbb{R}}^{10}$, 3 neurons in the hidden layer, and 2 outputs corresponding to ${u}_{k,1},{u}_{k,2}$) with $tanh$ activation function in the hidden layer and linear output activation. The Q-function NN (Q-NN) estimate is 12–25–1 with the same parameters as C-NN. Initial weights of both NNs are uniform random numbers with zero-mean and variance 0.3. Both NNs are to be trained using scaled conjugate gradient for a maximum of 500 epochs. The available dataset is randomly divided into training (80%) and validation data (20%). Early stopping during training is enforced after 10 increases of the training c.f. mean sum of squared errors (MSE) evaluated on the validation data. MSE is herein, for all networks, the default performance function used in training.

The NP-IMF-AVI proposed herein consists of two steps for each iteration j. The first one calculates the targets for the NN $Q({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},{\mathbf{\pi}}_{j})$ (having inputs ${[{\left({\mathbf{x}}_{k}^{E}\right)}^{\top},{\left({\mathbf{u}}_{k}\right)}^{\top}]}^{\top}$ and current iteration weights ${\mathbf{\pi}}_{j}$) as $\{\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma Q({\mathbf{x}}_{k+1}^{E},\mathbf{C}({\mathbf{x}}_{k+1}^{E},{\mathbf{\theta}}_{j-1}),{\mathbf{\pi}}_{j-1})\}$, for all transitions in the dataset. Resulting in the trained Q-function estimator NN $Q({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k},{\mathbf{\pi}}_{j})$ with parameter weights ${\mathbf{\pi}}_{j}$. The second step (the controller improvement) first calculates the targets for the controller $\mathbf{C}({\mathbf{x}}_{k}^{E},{\mathbf{\theta}}_{j})$ (with inputs ${({\mathbf{x}}_{k}^{E})}^{\top}$) as $\{{\mathbf{u}}_{k}=argmi{n}_{\mathbf{u}\in \mathsf{\Lambda}}Q({\mathbf{x}}_{k}^{E},\mathbf{u},{\mathbf{\pi}}_{j})\}$. Note that additional parameterization for the controller NN weights ${\mathbf{\theta}}_{j}$ is needed. Training produces the improved controller characterized by the new weights ${\mathbf{\theta}}_{j}$. Here, the discrete set of control actions $\mathsf{\Lambda}\subset {\mathsf{\Omega}}_{U}$ used to minimize the Q-NN estimate for computing the controller targets is the Cartesian product of two identical sets of control actions, each containing 21 equally spaced values in [–1;1], i.e., $\{-1,-0.9,\dots ,0.9,1\}$.

A discount $\gamma =0.95$ will be used and each iteration of the NP-IMF-AVI produces a C-NN that is tested on the standard test scenario shown in Figure 6 by measuring the same normalized c.f. ${J}_{test}$ for $N=2000$, on the same test scenario that was used in the case of the LP-IMF-AVI. The NP-IMF-AVI is iterated 50 times and all the stabilizing controllers that are better than the VRFT multivariable controller running on the standard test scenario described in Figure 7 (in terms of smaller ${J}_{test}$) are stored. The best C-NN across 50 iterations renders ${J}_{test}=0.0025$. The tracking performance for the best NN controller found with the NP-IMF-AVI is shown in blue lines in Figure 7. The convergence process is depicted in Figure 9.

A gridsearch is next performed for the NP-IMF-AVI training process, by changing the dataset size from 30,000 to 50,000 to 70,000, combined with 17, 19, and 21 discrete values used for minimizing the Q-function over the two control inputs. For the case of 50,000 data with 17 uniform discrete possible values for each control input, ${J}_{test}=0.0017$ which is the same with the best performance of the LP-IMF-AVI. Notice that neither the nonlinear state-feedback controller of the NP-IMF-AVI nor the linear state-feedback controller of the LP-IMF-AVI have integral component, while the linear output feedback controller tuned by VRFT and used for exploration has integrators.

Two additional approaches exist for dealing with a NP-IMF-AVI using two NNs, for each of the Q-NN approximator and for the C-NN. For example, [31] used to cascade the C-NN and the Q-NN. After training the Q-NN and producing the new weights ${\mathbf{\pi}}_{j}$, the weights of the Q-NN are fixed and only the weights ${\mathbf{\theta}}_{j}$ of the C-NN are trained, with all the targets equal to $\left\{0\right\}$ for all the inputs ${\mathbf{x}}_{k}^{E}$ of the cascaded NN $Q({\mathbf{x}}_{k}^{E},\mathbf{C}({\mathbf{x}}_{k}^{E},{\mathbf{\theta}}_{j}),{\mathbf{\pi}}_{j})$. In this way, the C-NN is forced to minimize the Q-NN. The disadvantage is the vanishing gradient problem of the resulted cascaded network that deepens through more hidden layers, therefore only small corrections are brought to the C-NN part that is further away from the Q-NN’s output. Yet another solution [13] uses, for the controller improvement step, a single/several gradient descent step/steps ${\mathbf{\theta}}_{j}={\mathbf{\theta}}_{j-1}-\alpha \frac{1}{N}{\sum}_{k=1}^{N}\frac{dQ}{d\mathbf{u}}\frac{d\mathbf{u}}{d\mathbf{\theta}}{|}_{({\mathbf{\theta}}_{j-1},{\mathbf{x}}_{k}^{E})}$ with each iteration of NP-IMF-AVI, with step size $\alpha >0$ and with gradient $\frac{1}{N}{\sum}_{k=1}^{N}\frac{dQ}{d\mathbf{u}}\frac{d\mathbf{u}}{d\mathbf{\theta}}{|}_{({\mathbf{\theta}}_{j-1},{\mathbf{x}}_{k}^{E})}$ accumulated over all inputs ${\mathbf{x}}_{k}^{E}$ of the cascaded NN $Q({\mathbf{x}}_{k}^{E},\mathbf{C}({\mathbf{x}}_{k}^{E},{\mathbf{\theta}}_{j-1}),{\mathbf{\pi}}_{j})$, over fixed Q-NN weights. Essentially, the two approaches described above are equivalent and the number of gradient descent steps at each iteration is user-selectable. Also, no minimization by enumerating a finite set of control actions needs to be performed in either of the two above approaches. The above two equivalent approaches are effectively a particular case of the Neural-Fitted Q-iteration with Continuous Actions (NFQCA) approach [13], more recently to be updated with some changes to Deep Deterministic Policy Gradient (DDPG) [50]. DDPG uses two NNs as well, for the Q-NN and for the C-NN. It was originally developed to work in online off-policy mode, hence the need to update the Q-NN and the C-NN in a faster way on a relatively small number of transition samples (called minibatch) randomly extracted from a replay buffer equivalent to the dataset D, in order to break the time correlation of consecutive samples. The effectiveness of DDPG in real-time online control has yet to be proven.

Two variants of offline off-policy DDPG called DDPG1 and DDPG2 are run for comparisons purposes. Both use minibatches of 128 transitions from the dataset D at each training iteration. While both use soft target updates of the Q-NN weights ${\mathbf{\pi}}_{j}^{\prime}=\tau {\mathbf{\pi}}_{j}+(1-\tau ){\mathbf{\pi}}_{j-1}^{\prime}$ and of the C-NN weights ${\mathbf{\theta}}_{j}^{\prime}=\tau {\mathbf{\theta}}_{j}+(1-\tau ){\mathbf{\theta}}_{j-1}^{\prime}$, with $\tau =0.005$. ${\mathbf{\pi}}_{j}^{\prime}$ and ${\mathbf{\theta}}_{j}^{\prime}$ are used to calculate the targets for the Q-NN training. At each iteration, DDPG1 makes one update step of the Q-NN weights in the negative direction of the gradient of the MSE w.r.t. ${\mathbf{\pi}}_{j}$ with step size $\alpha =0.001$ and one update step of the C-NN weights in the negative direction of the gradient of the Q-NN’s output w.r.t. ${\mathbf{\theta}}_{j}$ with step size $\alpha =0.001$. While DDPG2 differs in that the Q-NN training on each minibatch of each iteration is left to the same settings used for NP-IMF-AVI training (scaled conjugate gradient for maximum 500 epochs), only one gradient descent step is used to update the C-NN weights with the same $\alpha =0.001$. The step-sizes were selected to ensure learning convergence. It was observed that DDPG1 has the slowest convergence (convergence appears after more than 20,000 iterations) since it performs only one gradient update step per iteration, DDPG2 has faster convergence speed (convergence appears after 5000 iterations) since it allows more gradient steps for Q-NN training, while NP-IMF-AVI has the highest convergence speed (convergence appears after 10 iterations), allowing more training in terms of gradient descent steps (with scaled conjugate gradient direction) for both Q-NN and for C-NN, at each iteration. This proves that, given the high-dimensional process, it is better to use the entire dataset D for offline training, as it was done with NP-IMF-AVI. On the other hand, the best performance with DDPG1 and DDPG2 is 0.003, not as good as the best one with the more computationally demanding NP-IF-AVI (0.0017), suggesting that minimizing the Q-NN by enumerating discrete actions to calculate the C-NN targets may actually escape local minima. The total learning time to convergence with DDPG1 and DDPG2 is about the same as with NP-IMF-AVI, which is to be expected since less calculations for DDPG1 takes more iterations until convergence appears. Notice that NP-IMF-AVI does not use soft target updates for its two NNs.

The additional NN controller is not mandatory and the NP-IMF-AVI can be made similar to the LP-IMF-AVI case. In this case, the minimization of the Q-function NN estimate is to be performed by enumerating the discrete set of control actions $\mathsf{\Lambda}\subset {\mathsf{\Omega}}_{U}$ and the targets calculation for the Q-function NN will use $\{\upsilon ({\mathbf{x}}_{k}^{E},{\mathbf{u}}_{k})+\gamma Q({x}_{k+1}^{E},argmi{n}_{\mathbf{u}\in \mathsf{\Lambda}}Q({\mathbf{x}}_{k+1}^{E},\mathbf{u},{\mathbf{\pi}}_{j}),{\mathbf{\pi}}_{j-1})\}$. This approach merges the controller improvement step and the Q-function improvement step. However, for real-time control implementation after NP-IMF-AVI convergence, it is more expensive to find ${\mathbf{u}}_{k}^{*}=argmi{n}_{\mathbf{u}\in \mathsf{\Lambda}}{Q}^{*}({\mathbf{x}}_{k}^{E},\mathbf{u},{\mathbf{\pi}}_{j})$, since it requires evaluating the Q-function NN for a number of times proportional to the number of combinations of discrete control actions. Then only slower processes can be accommodated with this implementation. Whereas in the case when a dedicated controller NN is used, after NP-IMF-AVI convergence, the optimal control ${\mathbf{u}}_{k}^{*}={\mathbf{C}}^{*}({\mathbf{x}}_{k}^{E},{\mathbf{\theta}}_{j})$ is calculated at once, through a single NN evaluation. This dedicated NN controller can also be obtained (trained) as a final step after the NP-IMF-AVI has converged to the optimal Q-function ${Q}^{*}({\mathbf{x}}_{k}^{E},\mathbf{u},{\mathbf{\pi}}_{j})$ and the targets for the controller output are calculated as $\{{\mathbf{u}}_{k}^{*}=argmi{n}_{\mathbf{u}\in \mathsf{\Lambda}}{Q}^{*}({\mathbf{x}}_{k}^{E},\mathbf{u},{\mathbf{\pi}}_{j})\}$. Another original solution that uses a single NN Q-function approximator was proposed in [51], such that a quadratic approximation of the NN-fitted Q-function is used to directly derive a linear state-feedback controller with each iteration.

#### 4.7. Comments on the Obtained Results

Some comments follow the validation of the LP-IMF-AVI and NP-IMF-AVI. The results of Figure 6 indicate that convergence of the LP-IMF-AVI is attained in terms of $\parallel {\mathbf{\pi}}_{j}-{\mathbf{\pi}}_{j-1}{\parallel}_{2}\to 0$, however perfect ORM tracking is not possible, as shown by the nonzero constant value of ${J}_{test}$. On one hand, this is to be expected since the resulting linear state-feedback controller coupled with the process’ nonlinear dynamics is not capable of ensuring a closed-loop linear behavior as requested by the ORM. On the other hand, the NN controller resulting from the NP-IMF-AVI implementation is a nonlinear state feedback controller, however the best obtained results are not better than (but on the same level with) those obtained with the linear state-feedback controller of the NP-IMF-AVI, although the nonlinear controller is expected to perform better in terms of lower ${J}_{test}$, due to its flexibility being able to compensate for the process nonlinearity. If this flexibility does not turn into an advantage, the reason lies with the additional NN controller parameterization (that introduces additional approximation errors) and with the training process that relies on approximate minimizations in the calculation step of the controller’s targets.

The iterative evolution of ${J}_{test}$ in case of both LP-IMF-AVI and NP-IMF-AVI show stabilization to constant nonzero values, suggesting that neither approach can provide perfectly ORM tracking controllers. For the LP-IMF-AVI, the responsibility lies with the under-representation error introduced by quadratic Q-function (and with the subsequent resulting linear state-feedback controller), while for NP-IMF-AVI, responsibility lies with the errors introduced by the additional controller approximator NN and the targets calculation in the controller improvement step.

Computational resources analysis indicate that the LP-IMF-AVI has learned only 78 parameters for to the Q-function parameterization, and no intermediate controller approximator is used. The run time for 50 iterations is about 345 seconds (including evaluation steps on the test scenario after each iteration). The NN-based NP-IMF-AVI needs to learn two NNs having 351 parameters (weights) for the Q-function NN and 41 parameters (weights) for the controller NN, respectively. Contrastingly, the runtime for the NP-IMF-AVI is about 3300 seconds, almost ten times more than in the case of the LP-IMF-AVI. Despite the larger parameter learning space, the converged behavior of the NN-based NP-IMF-AVI is very similar to that of the LP-IMF-AVI (see tracking results in Figure 7).

LP-IMF-AVI has shown an increased sensitivity to the transition samples dataset size: for fewer or more transition samples in the dataset, the LP-IMF-AVI diverges, under exactly the same exploration settings. But this divergence appears only after an initial convergence phase similar to that of Figure 6, and not from the very beginning. Whereas having fewer transition samples is intuitively disadvantageous for learning the true Q-function approximation, having a larger number of transition samples leading to divergence is unexpected. The reason is that non-uniform state-action space exploration affects the linear regression. Then, given a fixed dataset size, an increased amplitude of the additive disturbance used to stimulate exploration combined with a more often application of this disturbance (such as every 2nd sample) increases the convergence probability. These observations indicate again that the proposed linear parameterization using quadratic basis functions is insufficient for a correct representation of the true Q-function, thus failing the small approximation errors assumptions of Theorem 2. The connection between the convergence guarantees and the approximation errors have been analyzed in the literature [52,53,54,55].

In the light of the previous paragraph’s observations, the NN-based NP-IMF-AVI proves to be significantly more robust throughout the convergence process, both to various transition samples dataset sizes and to different exploration settings (disturbance amplitude and frequency of its application, how often the reference inputs switch during the transition samples collection phase, etc.). This may well pay off for the additional controller approximator NN and for the extra computation time since the chances of learning high performance controllers will depend less on the selection of the many parameters involved. Moreover, manual selection of the basis functions is unnecessary with the NN-based NP-IMF-AVI, while the over-parameterization is automatically managed by the NN training mechanism.

Data normalization is a frequently overlooked issue in ADP control but it is critical to successful design since it numerically affects both the regression solution in LP-IMF-AVI and the NN training in NP-IMF-AVI. A diagonal scaling matrix $\mathbf{S}=diag({s}_{1},\dots ,{s}_{n+{n}_{m}+p})$ leads to the scaled extended state ${\overline{\mathbf{x}}}_{k}^{E}={\mathbf{Sx}}_{k}^{E}$ resulting in the extended state-space model ${\overline{\mathbf{x}}}_{k+1}^{E}=\mathbf{S}\xb7\mathbf{E}({\mathbf{S}}^{-1}{\overline{\mathbf{x}}}_{k}^{E},{\mathbf{u}}_{k})$ that still preserves the MDP property.

## 5. Conclusions

This paper proves a functional design for an IMF-AVI ADP learning scheme dedicated to the challenging problem of ORM tracking control for a high-order real-world complex nonlinear process with unknown dynamics. The investigation revolves around a comparative analysis of a linear vs. a nonlinear parameterization of the IMF-AVI approach. Learning high performance state-feedback control under the model-free mechanism offered by IMF-AVI builds upon the input–states–outputs transition samples collection step that uses an initial exploratory linear output feedback controller that is also designed in a model-free setup using VRFT. From the practitioners’ viewpoint, the NN-based implementation of IMF-AVI is more appealing since it easily scales up with problem dimension and automatically manages the basis functions selection for the function approximators.

Future work attempts to validate the proposed design approach to more complex high-order nonlinear processes of practical importance.

## Author Contributions

Conceptualization, M.-B.R.; methodology, M.-B.R.; software, T.L; validation, T.L.; formal analysis, M.-B.R.; investigation, T.L.; data curation, M.-B.R. and T.L.; writing–original draft preparation, M.-B.R. and T.L.; writing–review and editing, M.-B.R. and T.L.; supervision, M.-B.R.

## Funding

This research received no external funding

## Conflicts of Interest

The authors declare no conflict of interest.

## References

- Radac, M.B.; Precup, R.E.; Petriu, E.M. Model-free primitive-based iterative learning control approach to trajectory tracking of MIMO systems with experimental validation. IEEE Trans. Neural Netw. Learn. Syst.
**2015**, 26, 2925–2938. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, USA, 1996. [Google Scholar]
- Wang, F.Y.; Zhang, H.; Liu, D. Adaptive dynamic programming: an introduction. IEEE Comput. Intell. Mag.
**2009**, 4, 39–47. [Google Scholar] [CrossRef] - Lewis, F.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst. Mag.
**2012**, 32, 76–105. [Google Scholar] - Lewis, F.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circ. Syst. Mag.
**2009**, 9, 76–105. [Google Scholar] [CrossRef] - Murray, J.; Cox, C.J.; Lendaris, G.G.; Saeks, R. Adaptive dynamic programming. IEEE Trans. Syst. Man Cybern.
**2002**, 32, 140–153. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] - Kiumarsi, B.; Lewis, F.L.; Naghibi-Sistani, M.B.; Karimpour, A. Optimal tracking control of unknown discrete-time linear systems using input–output measured data. IEEE Trans. Cybern.
**2015**, 45, 2270–2779. [Google Scholar] [CrossRef] - Kiumarsi, B.; Lewis, F.L.; Modares, H.; Karimpour, A.; Naghibi-Sistani, M.B. Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica
**2014**, 50, 1167–1175. [Google Scholar] [CrossRef] - Radac, M.B.; Precup, R.E.; Roman, R.C. Model-free control performance improvement using virtual reference feedback tuning and reinforcement Q-learning. Int. J. Syst. Sci.
**2017**, 48, 1071–1083. [Google Scholar] [CrossRef] - Ernst, D.; Geurts, P.; Wehenkel, L. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res.
**2005**, 6, 2089–2099. [Google Scholar] - Hafner, R.; Riedmiller, M. Reinforcement learning in feedback control. Challenges and benchmarks from technical process control. Mach. Learn.
**2011**, 84, 137–169. [Google Scholar] [CrossRef] - Zhao, D.; Wang, B.; Liu, D. A supervised actor critic approach for adaptive cruise control. Soft Comput.
**2013**, 17, 2089–2099. [Google Scholar] [CrossRef] - Cui, R.; Yang, R.; Li, Y.; Sharma, S. Adaptive neural network control of AUVs with Control input nonlinearities using reinforcement learning. IEEE Trans. Syst. Man Cybern.
**2017**, 47, 1019–1029. [Google Scholar] [CrossRef] - Xu, X.; Hou, Z.; Lian, C.; He, H. Online learning control using adaptive critic designs with sparse kernel machines. IEEE Trans. Neural Netw. Learn. Syst.
**2013**, 24, 762–775. [Google Scholar] - He, H.; Ni, Z.; Fu, J. A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing
**2012**, 78, 3–13. [Google Scholar] [CrossRef] - Modares, H.; Lewis, F.L.; Jiang, Z.P. H
_{∞}Tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst.**2015**, 26, 2550–2562. [Google Scholar] [CrossRef] - Li, J.; Modares, H.; Chai, T.; Lewis, F.L.; Xie, L. Off-policy reinforcement learning for synchronization in multiagent graphical games. IEEE Trans. Neural Netw. Learn. Syst.
**2017**, 28, 2434–2445. [Google Scholar] [CrossRef] - Bertsekas, D. Value and policy iterations in optimal control and adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst.
**2017**, 28, 500–509. [Google Scholar] [CrossRef] - Yang, Y.; Wunsch, D.; Yin, Y. Hamiltonian-driven adaptive dynamic programming for continuous nonlinear dynamical systems. IEEE Trans. Neural Netw. Learn. Syst.
**2017**, 28, 1929–1940. [Google Scholar] [CrossRef] - Kamalapurkar, R.; Andrews, L.; Walters, P.; Dixon, W.E. Model-based reinforcement learning for infinite horizon approximate optimal tracking. IEEE Trans. Neural Netw. Learn. Syst.
**2017**, 28, 753–758. [Google Scholar] [CrossRef] - Radac, M.B.; Precup, R.E. Data-driven MIMO Model-free reference tracking control with nonlinear state-feedback and fractional order controllers. Appl. Soft Comput.
**2018**, 73, 992–1003. [Google Scholar] [CrossRef] - Campi, M.C.; Lecchini, A.; Savaresi, S.M. Virtual Reference Feedback Tuning: A direct method for the design of feedback controllers. Automatica
**2002**, 38, 1337–1346. [Google Scholar] [CrossRef] - Hjalmarsson, H. Iterative Feedback Tuning—An overview. Int. J. Adapt. Control Signal Process.
**2002**, 16, 373–395. [Google Scholar] [CrossRef] - Janssens, P.; Pipeleers, G.; Swevers, J.L. Model-free iterative learning control for LTI systems and experimental validation on a linear motor test setup. In Proceedings of the 2011 American Control Conference (ACC), San Francisco, CA, USA, 29 June–1 July 2011; pp. 4287–4292. [Google Scholar]
- Radac, M.B.; Precup, R.E. Optimal behavior prediction using a primitive-based data-driven model-free iterative learning control approach. Comput. Ind.
**2015**, 74, 95–109. [Google Scholar] [CrossRef] - Chi, R.; Hou, Z.S.; Jin, S.; Huang, B. An improved data-driven point-to-point ILC using additional on-line control inputs with experimental verification. IEEE Trans. Syst. Man Cybern.
**2017**, 49, 687–696. [Google Scholar] [CrossRef] - Abouaissa, H.; Fliess, M.; Join, C. On ramp metering: towards a better understanding of ALINEA via model-free control. Int. J. Control
**2017**, 90, 1018–1026. [Google Scholar] [CrossRef] - Hou, Z.S.; Liu, S.; Tian, T. Lazy-learning-based data-driven model-free adaptive predictive control for a class of discrete-time nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst.
**2017**, 28, 1914–1928. [Google Scholar] [CrossRef] - Radac, M.B.; Precup, R.E.; Roman, R.C. Data-driven model reference control of MIMO vertical tank systems with model-free VRFT and Q-learning. ISA Trans.
**2018**, 73, 227–238. [Google Scholar] [CrossRef] - Bolder, J.; Kleinendorst, S.; Oomen, T. Data-driven multivariable ILC: enhanced performance by eliminating L and Q filters. Int. J. Robot. Nonlinear Control
**2018**, 28, 3728–3751. [Google Scholar] [CrossRef] - Wang, Z.; Lu, R.; Gao, F.; Liu, D. An indirect data-driven method for trajectory tracking control of a class of nonlinear discrete-time systems. IEEE Trans. Ind. Electron.
**2017**, 64, 4121–4129. [Google Scholar] [CrossRef] - Pandian, B.J.; Noel, M.M. Control of a bioreactor using a new partially supervised reinforcement learning algorithm. J. Proc. Control
**2018**, 69, 16–29. [Google Scholar] [CrossRef] - Diaz, H.; Armesto, L.; Sala, A. Fitted q-function control methodology based on takagi-sugeno systems. IEEE Trans. Control Syst. Tech.
**2018**. [Google Scholar] [CrossRef] - Wang, W.; Chen, X.; Fu, H.; Wu, M. Data-driven adaptive dynamic programming for partially observable nonzero-sum games via Q-learning method. Int. J. Syst. Sci.
**2019**. [Google Scholar] [CrossRef] - Mu, C.; Zhang, Y. Learning-based robust tracking control of quadrotor with time-varying and coupling uncertainties. IEEE Trans. Neural Netw. Learn. Syst.
**2019**. [Google Scholar] [CrossRef] [PubMed] - Liu, D.; Yang, G.H. Model-free adaptive control design for nonlinear discrete-time processes with reinforcement learning techniques. Int. J. Syst. Sci.
**2018**, 49, 2298–2308. [Google Scholar] [CrossRef] - Song, F.; Liu, Y.; Xu, J.X.; Yang, X.; Zhu, Q. Data-driven iterative feedforward tuning for a wafer stage: A high-order approach based on instrumental variables. IEEE Trans. Ind. Electr.
**2019**, 66, 3106–3116. [Google Scholar] [CrossRef] - Kofinas, P.; Dounis, A. Fuzzy Q-learning agent for online tuning of PID controller for DC motor speed control. Algorithms
**2018**, 11, 148. [Google Scholar] [CrossRef] - Radac, M.-B.; Precup, R.-E. Three-level hierarchical model-free learning approach to trajectory tracking control. Eng. Appl. Artif. Intell.
**2016**, 55, 103–118. [Google Scholar] [CrossRef] - Radac, M.-B.; Precup, R.-E. Data-based two-degree-of-freedom iterative control approach to constrained non-linear systems. IET Control Theory Appl.
**2015**, 9, 1000–1010. [Google Scholar] [CrossRef] - Salgado, M.; Clempner, J.B. Measuring the emotional state among interacting agents: A game theory approach using reinforcement learning. Expert Syst. Appl.
**2018**, 97, 266–275. [Google Scholar] [CrossRef] - Silva, M.A.L.; de Souza, S.R.; Souza, M.J.F.; Bazzan, A.L.C. A reinforcement learning-based multi-agent framework applied for solving routing and scheduling problems. Expert Syst. Appl.
**2019**, 131, 148–171. [Google Scholar] [CrossRef] - Campestrini, L.; Eckhard, D.; Gevers, M.; Bazanella, M. Virtual reference tuning for non-minimum phase plants. Automatica
**2011**, 47, 1778–1784. [Google Scholar] [CrossRef] - Al-Tamimi, A.; Lewis, F.L.; Abu-Khalaf, M. Discrete-time nonlinear HJB Solution using approximate dynamic programming: Convergence proof. IEEE Trans. Syst. Man Cybern. Cybern.
**2008**, 38, 943–949. [Google Scholar] [CrossRef] [PubMed] - Rantzer, A. Relaxed dynamic programming in switching systems. IEEE Proc. Control Theory Appl.
**2006**, 153, 567–574. [Google Scholar] [CrossRef] - Inteco, LTD. Two Rotor Aerodynamical System; User’s Manual; Inteco, LTD: Krakow, Poland, 2007; Available online: http://ee.sharif.edu/~lcsl/lab/Tras_um_PCI.pdf (accessed on 12 June 2019).
- Radac, M.-B.; Precup, R.-E. Data-driven model-free slip control of anti-lock braking systems using reinforcement Q-learning. Neurocomputing
**2018**, 275, 317–329. [Google Scholar] [CrossRef] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. Mach. Learn.
**2016**, arXiv:1509.02971. [Google Scholar] - Ten Hagen, S.; Krose, B. Neural Q-learning. Neural Comput. Appl.
**2003**, 12, 81–88. [Google Scholar] [CrossRef] - Radac, M.B.; Precup, R.E. Data-Driven model-free tracking reinforcement learning control with VRFT-based adaptive actor-critic. Appl. Sci.
**2019**, 9, 1807. [Google Scholar] [CrossRef] - Dierks, T.; Jagannathan, S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Trans. Neural Netw. Learn. Syst.
**2012**, 23, 1118–1129. [Google Scholar] [CrossRef] - Heydari, A. Theoretical and numerical analysis of approximate dynamic programming with approximation errors. J. Gui Control Dyn.
**2016**, 39, 301–311. [Google Scholar] [CrossRef] - Heydari, A. Revisiting approximate dynamic programming and its convergence. IEEE Trans. Cybern.
**2014**, 44, 2733–2743. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Closed-loop state transitions data collection for Example 1: (top) ${y}_{k}$ (black), ${r}_{k}$ (blue), ${y}_{k}^{m}$ (red); (bottom) ${u}_{k}$.

**Figure 2.**Convergence results of the linearly paramaterized iterative model-free approximate Value Iteration (LP-IMF-AVI) for the linear process example.

**Figure 4.**Open-loop input–output (IO) data from the two-inputs–two-outputs aerodynamic system (TITOAS) for Virtual Reference Feedback Tuning (VRFT) controller tuning.

**Figure 5.**IO data collection with the linear controller [36]: (

**a**) ${u}_{k,1}$; (

**b**) ${y}_{k,1}$ (black), ${y}_{k,1}^{m}$ (red), ${r}_{k,1}$ (black dotted); (

**c**) ${u}_{k,2}$; (

**d**) ${y}_{k,2}$ (black), ${y}_{k,2}^{m}$ (red), ${r}_{k,1}$ (black dotted).

**Figure 7.**The IMF-AVI convergence on TITOAS: ${y}_{k,1}^{m},{y}_{k,2}^{m}$, (red); ${u}_{k,1},{u}_{k,2},{y}_{k,1},{y}_{k,2}$ for LP-IMF-AVI (black), for NP-IMF-AVI with NNs (blue), for the initial VRFT controller used for transitions collection (green).

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).