Reinforcement-Learning-Based Tracking Control with Fixed-Time Prescribed Performance for Reusable Launch Vehicle under Input Constraints

Shihao Xu; Yingzi Guan; Changzhu Wei; Yulong Li; Lei Xu

doi:10.3390/app12157436

,

and

¹

School of Astronautics, Harbin Institute of Technology, Harbin 150001, China

²

Beijing Institute of Space Launch Technology, China Academy of Launch Vehicle Technology, Beijing 100076, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2022, 12(15), 7436;https://doi.org/10.3390/app12157436

This article belongs to the Special Issue AI Applications in the Industrial Technologies

Version Notes

Order Reprints

Abstract

This paper proposes a novel reinforcement learning (RL)-based tracking control scheme with fixed-time prescribed performance for a reusable launch vehicle subject to parametric uncertainties, external disturbances, and input constraints. First, a fixed-time prescribed performance function is employed to restrain attitude tracking errors, and an equivalent unconstrained system is derived via an error transformation technique. Then, a hyperbolic tangent function is incorporated into the optimal performance index of the unconstrained system to tackle the input constraints. Subsequently, an actor-critic RL framework with super-twisting-like sliding mode control is constructed to establish a practical solution for the optimal control problem. Benefiting from the proposed scheme, the robustness of the RL-based controller against unknown dynamics is enhanced, and the control performance can be qualitatively prearranged by users. Theoretical analysis shows that the attitude tracking errors converge to a preset region within a preassigned fixed time, and the weight estimation errors of the actor-critic networks are uniformly ultimately bounded. Finally, comparative numerical simulation results are provided to illustrate the effectiveness and improved performance of the proposed control scheme.

Keywords:

reinforcement learning-based control; prescribed performance control; fixed-time control; input constraints; reusable launch vehicle

1. Introduction

Recent years have witnessed an increasing demand for reliable and economical access to space. Reusable launch vehicles (RLV), as a cost-effective means of undertaking space missions, are attracting more and more attention from researchers [1]. A dynamic model of RLV provides strong non-linear and coupling characteristics due to the complex flight environment of the re-entry phase. External disturbances, uncertain structural and aerodynamic parameters, and input constraints inevitably exist during real flight, having a significant impact on the attitude control system. In this context, attitude control for RLV is a challenging topic and has elicited widespread interest. Various control methodologies, such as adaptive control [2], dynamic inversion control [3], robust control [4], sliding mode control [5,6], and neural network (NN) control [7,8], have been applied over the past decades. Nevertheless, there is still scope to develop an optimal control approach for RLV suffering from complicated non-linear dynamics, parametric uncertainties and limited inputs.

From a mathematical point of view, the Hamilton–Jacobi–Bellman (HJB) function and its solution are required to be established to solve the optimal control problems. However, it is difficult to derive an analytical solution from the HJB function for non-linear continuous-time systems. Given this, a reinforcement learning (RL) scheme with an actor-critic (AC) structure was initially created by Werbos [9], whereby a critic network was exploited to approximate the value function, and an actor network was deployed to obtain the optimal control policy. Informed by Werbos’ contribution, Vamvoudakis et al. developed an online AC algorithm to solve the continuous-time infinite horizon optimal control problem [10]. He et al. proposed a novel online learning and optimization structure by incorporating a reference network into the AC structure [11]. Ma et al. devised a learning-based adaptive sliding mode control scheme for a tethered space robot with limited inputs [12]. Although the above control strategies have provided excellent results in terms of optimal control, the existing problem concerns the need for accurate dynamic modeling [13]. Given the parametric uncertainties and unknown disturbances, it is, in practice, difficult to exactly determine the system dynamics for RLV, limiting the methods’ applicability. Therefore, a further problem exists in that the aforementioned methods’ robustness must be enhanced for practical systems with unknown dynamics. Fan et al. combined (ISMC) with a reinforcement learning control scheme for non-linear systems with partially unknown dynamics [14]. However, the input constraints were not considered, and the ISMC used may lead to unexpected oscillation [15]. Zhang et al. developed a learning-based

H_{\infty}

tracking control scheme for which the uncertainties and input constraints were considered [16]. Nevertheless, the control design is conservative, and the iterative algorithm is rather complicated.

It is of note that previous RL-based control methods only establish asymptotic or finite-time convergence [17]. Therefore, the upper bound of the convergence time is uncontrollable, and the transient and steady-state performance, namely, the maximum overshoot and the steady accuracy, cannot be quantitatively prearranged by users. As a promising solution to this problem, the prescribed performance control (PPC) method created by Bechlioulis et al. has attracted widespread attention [18]. The salient feature of PPC is that users can quantitatively pre-arrange both the transient performance and the steady tracking error. In [19], a novel PPC scheme combined with a command filter was proposed for a quadrotor unmanned aerial vehicle subject to error constraints. In [20], an NN-based adaptive non-affine tracking controller was devised for an air-breathing hypersonic vehicle with guaranteed prescribed performance. In [21], a data-driven PPC scheme was developed for an unmanned surface vehicle with unknown dynamics. However, the conventional PPC approach only ensures that the system states converge to the preset region as time tends to infinity [22], leading to an unsatisfactory solution for time-limited problems, such as the re-entry mission. Moreover, the input constraints and the optimality are not comprehensively considered in the conventional PPC paradigm. Therefore, the expected performance index cannot be consistently guaranteed and optimized for RLV suffering from poor aerodynamic maneuverability and limited control torques.

Motivated by the foregoing considerations, a novel RL-based tracking controller, with fixed-time prescribed performance for RLV subject to parameter uncertainties, external disturbances and input constraints, is investigated. The main contributions and characteristics of the proposed method can be summarized as follows.

An online RL-based, nearly optimal, controller with limited inputs is developed by synthesizing the AC structure and the hyperbolic tangent performance index. In addition, the robustness of the learning-based controller is strengthened by incorporating a super-twisting-like sliding mode control.
Compared with a previous learning-based controller described in [10,11,12], in which the system dynamics are required to be known exactly, the proposed control scheme only requires the input-output data pairs of RLV, such that the system dynamics can be completely unknown.
In contrast to existing RL-based control schemes with asymptotic or finite-time convergence [10,11,12,17,21], the proposed control scheme can ensure that the tracking errors converge to a preset region within a preassigned fixed time. Moreover, the prescribed transient and steady-state performance can be guaranteed.
Comparative numerical simulation investigations show that the proposed method can provide improved performance in terms of the transient response, and steady accuracy with less control effort.

2. Problem Statement and Preliminaries

2.1. Problem Statement

Following [2], the control-oriented model of the rigid-body RLV is given as follows:

\begin{matrix} \{\begin{matrix} \dot{Θ} = R ω \\ I \dot{ω} = - Ω I ω + M + Δ D \end{matrix} \end{matrix}

(1)

where

Θ = {[α, β, σ]}^{⊤}

represents the attitude angle vector,

ω = {[p, q, r]}^{⊤}

denotes the angular rate vector,

M = {[M_{x}, M_{y}, M_{z}]}^{⊤}

is the control input vector, and

M_{x}, M_{y}, M_{z}

are limited to the interval

[- \bar{M}, \bar{M}]

,

Δ D = Δ D_{a} + Δ D_{e}

is the unknown disturbance vector,

Δ D_{a}

is the aerodynamic torque vector, and

Δ D_{e}

is the external disturbance vector.

Δ D_{a}

can be formulated as:

\begin{matrix} Δ D_{a} = [\begin{matrix} (\frac{m_{x}^{p} L_{r} p}{V} + m_{x}^{σ} σ) q_{V} S_{r} L_{r} \\ (\frac{m_{y}^{r} L_{r} r}{V} + m_{y}^{β} β) q_{V} S_{r} L_{r} \\ (\frac{m_{z}^{q} L_{r} q}{V} + m_{z}^{α} α) q_{V} S_{r} L_{r} \end{matrix}] \end{matrix}

(2)

where

m_{x}^{p}

,

m_{y}^{r}

and

m_{z}^{q}

represent the damping moment coefficients,

m_{x}^{σ}

,

m_{y}^{β}

and

m_{z}^{α}

are the static stability moment coefficients, V is the velocity,

q_{V}

is the dynamic pressure, and

S_{r}

and

L_{r}

are the cross-sectional area and the reference length of RLV, respectively.

The skew-symmetric matrix

Ω

, the inertia matrix

I

, and the coordinate transformation matrix

R

are defined by

\begin{matrix} \begin{matrix} Ω = [\begin{matrix} 0 & - r & q \\ r & 0 & - p \\ - q & p & 0 \end{matrix}], I = [\begin{matrix} I_{x x} & 0 & - I_{x z} \\ 0 & I_{y y} & 0 \\ - I_{z x} & 0 & I_{z z} \end{matrix}], \\ R = [\begin{matrix} - cos α tan β & 1 & - sin α tan β \\ sin α & 0 & - cos α \\ - cos α cos β & - sin β & - sin α cos β \end{matrix}] . \end{matrix} \end{matrix}

(3)

Defining the guidance command vector

Θ_{d} = {[α_{c}, β_{c}, σ_{c}]}^{⊤}

, the attitude tracking error vector

e_{1} = Θ - Θ_{d} = {[e_{1 α}, e_{1 β}, e_{1 σ}]}^{⊤}

, and the angular rate tracking error vector

e_{2} = R ω - {\dot{Θ}}_{d}

, the tracking error dynamics are given as:

\begin{matrix} \{\begin{matrix} {\dot{e}}_{1} = e_{2} \\ {\dot{e}}_{2} = B_{1} M + Δ D_{1} \end{matrix} \end{matrix}

(4)

where

B_{1} = R I^{- 1}

is the control matrix,

Δ D_{1} = - R I^{- 1} Ω I ω - {\ddot{Θ}}_{c} + \dot{R} ω + R I^{- 1} Δ D

denotes the lumped disturbance vector.

Assumption A1

([2]).

Δ D_{1}

is bounded by

| | Δ D_{1} | | \leq D_{m}

.

Assumption A2

([2]). During the re-entry phase,

β \neq \pm 90 \deg

, thus

R

is always invertible.

Control Objective: According to the tracking error dynamics (4), the control objective of this paper can be summarized as developing an RL-based optimal control scheme with guaranteed fixed-time prescribed performance such that the attitude tracking errors

e_{1 i} (i = α, β, σ)

can converge to a preset region within a preassigned fixed time.

2.2. Preliminaries

Lemma 1

([23]). Considering a non-linear function

α (x) = \ln [1 - \tanh^{2} (x)]

, the following equation

\begin{matrix} α (x) = \ln (4) - 2 x sign (x) + κ_{α} \end{matrix}

(5)

always holds, where

κ_{α}

is bounded by a real positive constant.

Lemma 2

([24]). Considering the following fixed-time prescribed performance function (FTPPF)

\begin{matrix} ρ (t) = \{\begin{matrix} (ρ_{0} - ρ_{\infty}) [\frac{sin (2 π t / T)}{2 π} - \frac{t}{T}] + ρ_{0}, 0 \leq t \leq T \\ ρ_{\infty}, t > T \end{matrix} \end{matrix}

(6)

where

ρ_{0} > ρ_{\infty} > 0

represent the initial and terminal values of FTPPF, respectively.

T > 0

is the preassigned convergence time. It can be concluded that

ρ (t)

is a positive, non-increasing and

𝒞^{2}

continuous function with

ρ (0) = ρ_{0}

,

ρ (T) = ρ_{\infty}

and

\dot{ρ} (T) = \ddot{ρ} (T) = 0

.

Lemma 3

([25]). For any

μ > 0

, the following inequality holds

\begin{matrix} 0 \leq | x | - x tanh (\frac{x}{μ}) \leq k_{p} μ \end{matrix}

(7)

where

k_{p}

satisfies

k_{p} = e^{- (k_{p} + 1)}

(i.e.,

k_{p} = 0.2785

).

3. Controller Design

3.1. Prescribed Performance Constraint

In this subsection, the following constraint is formulated to restrain the attitude tracking errors

e_{1 i} (i = α, β, σ)

within the FTPPF

\begin{matrix} - ρ_{i} (t) < e_{1 i} (t) < ρ_{i} (t) \end{matrix}

(8)

where

ρ_{i} (t)

is defined in Lemma 2. Subsequently, the equivalent unconstrained error variables

η_{1 i} (i = α, β, σ)

can be derived via the error transformation method [18]

\begin{matrix} η_{1 i} = \frac{1}{2} \ln \frac{1 + z_{i}}{1 - z_{i}} \end{matrix}

(9)

with

z_{i} = e_{1 i} / ρ_{i}

. Taking the first and second-order time derivatives of

η_{1 i}

yields

\begin{matrix} \{\begin{matrix} {\dot{η}}_{1 i} = ξ_{i} η_{2 i} \\ {\dot{η}}_{2 i} = {\ddot{e}}_{2 i} - Λ_{i} \end{matrix} \end{matrix}

(10)

with

η_{2 i} = e_{2 i} - \frac{{\dot{ρ}}_{i}}{ρ_{i}} e_{1 i}

,

ξ_{i} = \frac{1}{2 ρ_{i}} (\frac{1}{z_{i} + 1} - \frac{1}{z_{i} - 1}) \geq \frac{1}{ρ_{i \infty}}

and

Λ_{i} = \frac{ρ_{i} {\ddot{ρ}}_{i} e_{1 i} + ρ_{i} {\dot{ρ}}_{i} e_{2 i} - {\dot{ρ}}_{i}^{2} e_{1 i}}{ρ_{i}^{2}}

. Substituting (4) into (10), one obtains

\begin{matrix} \{\begin{matrix} {\dot{η}}_{1} = ξ η_{2} \\ {\dot{η}}_{2} = B_{1} M + D_{1} - Λ \end{matrix} \end{matrix}

(11)

where

ξ = D (ξ_{i})

,

η_{2} = C (η_{2 i})

and

Λ = C (Λ_{i})

, and the symbols

C (\cdot)

and

D (\cdot)

represent the diagonal matrix and the column vector, respectively.

3.2. Reinforcement Learning-Based Control Design

Firstly, the following sliding variable

s

is defined as:

\begin{matrix} s = η_{2} + c η_{1} \end{matrix}

(12)

where

c

is a positive diagonal matrix. Taking the first time derivative of

s

along (11) yields

\begin{matrix} \dot{s} = Ξ + B_{1} M + D_{1} \end{matrix}

(13)

where

Ξ = c {\dot{η}}_{1} - Λ

. In order to achieve a satisfactory tracking performance, a traditional super-twisting controller based on (12) and (13) is developed as follows:

\begin{matrix} \begin{matrix} M = - B_{1}^{- 1} Γ, Γ = Ξ + C [λ_{1 i} {sig}^{1 / 2} (s_{i})] - M_{d}, \end{matrix} \end{matrix}

(14)

where

{sig}^{1 / 2} (s_{i}) = {| s_{i} |}^{1 / 2} sign (s_{i})

,

{\dot{M}}_{d} = - D (λ_{2 i}) sign (s_{i})

, and

λ_{1 i}

and

λ_{2 i}

are positive constants. Nevertheless, the feasibility of (14) may not be guaranteed with limited control inputs, and the time-fuel performance index cannot be approximately optimized. To this end, an online RL-based, nearly optimal, controller is proposed by integrating the AC structure into the super-twisting controller for a comprehensive solution to the issue mentioned above.

Before elaborating the detailed design procedure, it is assumed that there exists a group of admissible control strategies [10]

\begin{matrix} M = M_{o} - B_{1}^{- 1} \{C [λ_{1 i} {sig}^{1 / 2} (s_{i})] - M_{d}\} \in Ω_{u} . \end{matrix}

(15)

Moreover,

M

can achieve the control objective with the following time-fuel performance index being satisfied

\begin{matrix} V (s) = \int_{t}^{\infty} \{V_{s} (τ) + J [M_{o} (τ)]\} d τ \end{matrix}

(16)

where

V_{s} = s^{⊤} Q s

,

Q

is a positive diagonal matrix, and

J (M_{o})

is chosen as [26]:

\begin{matrix} J (M_{o}) = 2 \int_{0}^{M_{o}} [λ \tanh^{- 1} {(\frac{v}{λ})}^{⊤} ϖ] d v \end{matrix}

(17)

where

ϖ

is selected as a positive diagonal matrix,

v

represents the variable of integration, and the upper and lower bounds of

v

are

M_{o}

and

0

, respectively. The Lyapunov equation of (16) can be calculated as:

\begin{matrix} V_{s} (t) + J [M_{o} (t)] + \nabla V_{s}^{⊤} \dot{s} = 0 \end{matrix}

(18)

where

\nabla V_{s} = \partial V (s) / \partial s

. The optimal value function is defined as:

\begin{matrix} V^{*} (s) = \min_{M_{o}} \int_{t}^{\infty} \{V_{s} (τ) + J [M_{o} (τ)]\} d τ, \end{matrix}

(19)

and the corresponding HJB equation can be formulated as:

\begin{matrix} \min_{M_{o}} \{V_{s} + J (M_{o}) + \nabla V_{s}^{* ⊤} \dot{s}\} = 0, \end{matrix}

(20)

where

\nabla V_{s}^{*} = \partial V^{*} (s) / \partial s

.

Equation (20) is equivalent to

\begin{matrix} \frac{\partial}{\partial M_{o}^{*}} [V_{s} + J (M_{o}^{*}) + \nabla V_{s}^{* ⊤} \dot{s}] = 0 . \end{matrix}

(21)

Solving the partial derivative of (21) yields the following optimal control strategy

\begin{matrix} \begin{matrix} M_{o}^{*} = - λ \tanh ({\bar{M}}_{o}^{*}), {\bar{M}}_{o}^{*} = \frac{1}{2 λ} ϖ^{- 1} B_{1}^{⊤} \nabla V_{s}^{*} . \end{matrix} \end{matrix}

(22)

Substituting

M_{o}^{*}

into (17), one obtains

\begin{matrix} \begin{matrix} J (M_{o}^{*}) = λ \nabla V_{s}^{* ⊤} B_{1} \tanh ({\bar{M}}_{o}^{*}) + λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\bar{M}}_{o}^{*})] \\ = - \nabla V_{s}^{* ⊤} B_{1} M_{o}^{*} + λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\bar{M}}_{o}^{*})], \end{matrix} \end{matrix}

(23)

where

1_{c}

is a column vector with all elements being one, and

\bar{ϖ}

is a row vector generated by the elements on the main diagonal of

ϖ

. Combining (20) and (23), it can be derived that

\begin{matrix} V_{s} + \nabla V_{s}^{* ⊤} \dot{s} + λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\bar{M}}_{o}^{*})] = 0 . \end{matrix}

(24)

The nearly optimal control

M_{o}^{*}

can be obtained by solving (24) if

\nabla V_{s}^{*}

is available. Inspired by the NN-based control scheme, the following approximation of

V^{*} (s)

can be established

\begin{matrix} V^{*} (s) = W^{* ⊤} σ [s (t)] + ϵ [s (t)] \end{matrix}

(25)

where

W^{*}

is the optimal weight vector,

σ [s (t)]

is the base vector, and

ϵ [s (t)]

is the approximation error of NN. Subsequently, the gradient of

V^{*} (s)

is

\begin{matrix} \nabla V_{s}^{*} = W^{* ⊤} \nabla σ_{s} [s (t)] + \nabla ϵ_{s} [s (t)] \end{matrix}

(26)

where

\nabla σ_{s} = \partial σ / \partial s

,

\nabla ϵ_{s} = \partial ϵ / \partial s

. In light of the universal approximation property of NN for smooth functions on prescribed compact sets, the approximation errors

ϵ [s (t)]

and

\nabla ϵ_{s} [s (t)]

are bounded with a finite dimension of

σ [s (t)]

[27]. Moreover, it is assumed that

| | W^{*} | |

,

σ [s (t)]

and

\nabla σ_{s} [s (t)]

are bounded [13,14,16,17,21].

Recalling (22), the NN-based nearly optimal control law can be formulated as:

\begin{matrix} \{\begin{matrix} M = {\hat{M}}_{o}^{*} - B_{1}^{- 1} \{C [λ_{1 i} {sig}^{1 / 2} (s_{i})] - M_{d}\} \\ {\hat{M}}_{o}^{*} = - λ \tanh [\frac{1}{2 λ} ϖ^{- 1} B_{1}^{⊤} (\nabla σ_{s}^{⊤} W^{*} + \nabla ϵ_{s})] \end{matrix} . \end{matrix}

(27)

Defining the Bellman error as [14]:

\begin{matrix} \begin{matrix} B_{ϵ} = λ^{2} \bar{ϖ} & \{\ln [1_{c} - \tanh^{2} ({\bar{M}}_{o}^{*})] - \ln [1_{c} - \tanh^{2} ({\hat{\bar{M}}}_{o}^{*})]\} \end{matrix} \end{matrix}

(28)

with

{\hat{\bar{M}}}_{o}^{*} = \frac{1}{2 λ} ϖ^{- 1} B_{1}^{⊤} \nabla σ_{s}^{⊤} W^{*}

, (20) can be rewritten as:

\begin{matrix} \begin{matrix} V_{s} & + W^{* ⊤} \nabla σ_{s} Γ + λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\hat{\bar{M}}}_{o}^{*})] + ϵ_{H} = 0 \end{matrix} \end{matrix}

(29)

where

ϵ_{H} = \nabla ϵ_{s}^{⊤} Γ + B_{ϵ} + W^{*} \nabla σ_{s} D_{2}

is the bounded HJB error [10].

In this paper, the optimal weight

W^{*}

is generated by the online RL scheme with the AC structure. In this context, the nearly optimal control policy is formulated as:

\begin{matrix} \{\begin{matrix} M = {\hat{M}}_{a}^{*} - B_{1}^{- 1} \{C [λ_{1 i} {sig}^{1 / 2} (s_{i})] - M_{d}\} \\ {\hat{M}}_{a}^{*} = - λ \tanh ({\hat{\bar{M}}}_{a}^{*}), {\hat{\bar{M}}}_{a}^{*} = \frac{1}{2 λ} ϖ^{- 1} B_{1}^{⊤} \nabla σ_{s}^{⊤} {\hat{W}}_{a} \end{matrix} \end{matrix}

(30)

where

{\hat{W}}_{a}

is the weight of the actor network. Moreover, the performance index (16) can be estimated as:

\begin{matrix} {\hat{V}}_{s} = {\hat{W}}_{c}^{⊤} σ_{s} [s (t)] \end{matrix}

(31)

where

{\hat{W}}_{c}

is the weight of the critic network. The adaptation laws for

{\hat{W}}_{c}

and

{\hat{W}}_{a}

are designed as:

\begin{matrix} \begin{matrix} {\dot{\hat{W}}}_{c} = - A_{1} \{\bar{θ} [θ^{⊤} {\hat{W}}_{c} + V_{s} + J ({\hat{M}}_{a}^{*})] + p_{c} ({\hat{W}}_{c} - {\hat{W}}_{a})\}, \end{matrix} \end{matrix}

(32)

\begin{matrix} {\dot{\hat{W}}}_{a} = - A_{2} [p_{a} ({\hat{W}}_{a} - {\hat{W}}_{c}) - \frac{λ p_{a}}{p_{c}} \nabla σ_{s} B_{1} Ψ {\bar{θ}}^{⊤} {\hat{W}}_{c}], \end{matrix}

(33)

where

θ = \nabla σ_{s} (Γ + B_{1} {\hat{M}}_{a}^{*})

,

\bar{θ} = θ / {(θ^{⊤} θ + 1)}^{2}

;

A_{1}

and

A_{2}

are positive diagonal matrices;

Ψ = tanh ({\hat{\bar{M}}}_{a}^{*}) - tanh ({\hat{\bar{M}}}_{a}^{*} / κ)

;

p_{c}

,

p_{a}

and

κ

are positive real constants. The projection operator

Proj (\cdot)

is imposed on (33) to guarantee that

{\hat{W}}_{a}

is bounded [28]. It is assumed that

θ_{1} = θ / (θ^{⊤} θ + 1)

is persistently excitating (PE) [29].

The proposed control scheme is illustrated by a block diagram in Figure 1.

Figure 1. The block diagram of the proposed control scheme.

3.3. Stability Analysis

Theorem 1.

Considering the control-oriented RLV model (1) with input constraints, if the initial conditions satisfies

- ρ_{i 0} < e_{1 i} (0) < ρ_{i 0}

, the nearly optimal control policy is chosen as (30) with the weight update laws (32) and (33), then the following results can be obtained:

the sliding variable $s$ , the weight estimation errors ${\tilde{W}}_{a}$ and ${\tilde{W}}_{c}$ are uniformly ultimately bounded (UUB);
the attitude tracking errors $e_{1 i}$ uniformly obey the fixed-time performance envelops in (8).

Proof of Theorem 1

Consider the Lyapunov function candidate as follows:

\begin{matrix} V = V (s) + \frac{1}{2} {\tilde{W}}_{c}^{⊤} A_{1}^{- 1} {\tilde{W}}_{c} + \frac{p_{c}}{2 p_{a}} {\tilde{W}}_{a}^{⊤} A_{2}^{- 1} {\tilde{W}}_{a} \end{matrix}

(34)

where

{\tilde{W}}_{c} = W^{*} - {\hat{W}}_{c}

and

{\tilde{W}}_{a} = W^{*} - {\hat{W}}_{a}

. Taking the first time derivative of V yields

\begin{matrix} \dot{V} = \dot{V} (s) + {\dot{V}}_{c} + {\dot{V}}_{a} \end{matrix}

(35)

with

{\dot{V}}_{c} = {\tilde{W}}_{c}^{⊤} A_{1}^{- 1} {\dot{\tilde{W}}}_{c}

,

{\dot{V}}_{a} = \frac{p_{c}}{p_{a}} {\tilde{W}}_{a}^{⊤} A_{2}^{- 1} {\dot{\tilde{W}}}_{a}

, and

\begin{matrix} \begin{matrix} \dot{V} (s) = \nabla V_{s}^{⊤} (Γ + D_{1} + B_{1} {\hat{M}}_{a}^{*}) . \end{matrix} \end{matrix}

(36)

By invoking the HJB equation (29),

\nabla V_{s}^{⊤} Ξ

can be rewritten as:

\begin{matrix} \begin{matrix} \nabla V_{s}^{⊤} Ξ & = - V_{s} - λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\hat{\bar{M}}}_{o}^{*})] + \nabla V_{s}^{⊤} \{C [λ_{1 i} {sig}^{1 / 2} (s_{i})] - M_{d}\} - ε_{e} \end{matrix} \end{matrix}

(37)

where

ε_{e} = B_{ϵ} + \nabla V_{s}^{⊤} D_{1}

. Substituting (37) into (36) yields

\begin{matrix} \begin{matrix} \dot{V} (s) & = - V_{s} - λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\hat{\bar{M}}}_{o}^{*})] - λ \nabla V_{s}^{* ⊤} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) - B_{ϵ} . \end{matrix} \end{matrix}

(38)

Furthermore, according to (17), it can be deduced that

\begin{matrix} \begin{matrix} λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\hat{\bar{M}}}_{o}^{*})] & = J ({\underset{̲}{M}}_{o}^{*}) + \nabla V_{s}^{* ⊤} B_{1} {\underset{̲}{M}}_{o}^{*} \end{matrix} \end{matrix}

(39)

with

{\underset{̲}{M}}_{o}^{*} = - λ tanh ({\hat{\bar{M}}}_{o}^{*})

. Equation (38) can be rewritten as:

\begin{matrix} \begin{matrix} \dot{V} (s) = - V_{s} - J ({\underset{̲}{M}}_{o}^{*}) - \nabla V_{s}^{* ⊤} B_{1} {\underset{̲}{M}}_{o}^{*} - B_{ϵ} - λ (W^{* ⊤} \nabla σ_{s} + \nabla ϵ_{s}^{⊤}) B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) \\ = - V_{s} - J ({\underset{̲}{M}}_{o}^{*}) + ε_{v} - λ ({\hat{W}}_{a}^{⊤} + {\tilde{W}}_{a}^{⊤}) \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) \end{matrix} \end{matrix}

(40)

where

ε_{v} = - B_{ε} + \nabla V_{s}^{* ⊤} B_{1} {\underset{̲}{M}}_{o}^{*} - \nabla ϵ_{s}^{⊤} B_{1} {\hat{M}}_{a}^{*}

. Noting that

tanh (\cdot)

is an odd function, it can be concluded that

λ {\hat{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) > 0

. Moreover, it is indicated that

J ({\underset{̲}{M}}_{o}^{*}) > 0

from the definition in (17). Based on the above discussions, (40) can be simplified as:

\begin{matrix} {\dot{V}}_{s} (s) \leq - V_{s} - λ {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) + ε_{v} . \end{matrix}

(41)

Subsequently, incorporating (32) with

{\dot{V}}_{c}

yields

\begin{matrix} \begin{matrix} {\dot{V}}_{c} & = - {\tilde{W}}_{c}^{⊤} Γ_{1} {\dot{\hat{W}}}_{c} = {\tilde{W}}_{c}^{⊤} \{\bar{θ} [θ^{⊤} {\hat{W}}_{c} + V_{s} + J ({\hat{M}}_{a}^{*})] + p_{c} ({\hat{W}}_{c} - {\hat{W}}_{a})\} . \end{matrix} \end{matrix}

(42)

Recalling (29) and the definition of

θ

, (42) can be rearranged as:

\begin{matrix} \begin{matrix} {\dot{V}}_{c} = {\tilde{W}}_{c}^{⊤} \{\bar{θ} [θ^{⊤} {\hat{W}}_{c} - θ^{⊤} W^{*} + θ^{⊤} W^{*} - ϵ_{H} - λ^{2} \bar{ϖ} \ln [1_{c} - \tanh^{2} ({\hat{\bar{M}}}_{o}^{*})] \\ - W^{* ⊤} \nabla σ_{s} Γ + J ({\hat{M}}_{a}^{*})]\} + p_{c} {\tilde{W}}_{c}^{⊤} ({\hat{W}}_{c} - {\hat{W}}_{a}) \\ = {\tilde{W}}_{c}^{⊤} \bar{θ} \{- θ^{⊤} {\tilde{W}}_{c} + {(\nabla σ_{s} B_{1} {\hat{M}}_{a}^{*})}^{⊤} W^{*} - λ^{2} \bar{ϖ} \ln [1_{c} - {tanh}^{2} ({\hat{\bar{M}}}_{o}^{*})] - ϵ_{H} \\ + λ {\hat{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) + λ^{2} \bar{ϖ} \ln [1_{c} - {tanh}^{2} ({\hat{\bar{M}}}_{a}^{*})]\} + p_{c} {\tilde{W}}_{c}^{⊤} ({\hat{W}}_{c} - {\hat{W}}_{a}) \\ = {\tilde{W}}_{c}^{⊤} \bar{θ} \{- θ^{⊤} {\tilde{W}}_{c} - λ {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) - λ^{2} \bar{ϖ} \ln [1_{c} - {tanh}^{2} ({\hat{\bar{M}}}_{o}^{*})] - ϵ_{H} \\ + λ^{2} \bar{ϖ} \ln [1_{c} - {tanh}^{2} ({\hat{\bar{M}}}_{a}^{*})]\} + p_{c} {\tilde{W}}_{c}^{⊤} ({\hat{W}}_{c} - {\hat{W}}_{a}) . \end{matrix} \end{matrix}

(43)

With the aid of Lemma 1 and Lemma 3, (43) can be further simplified as:

\begin{matrix} \begin{matrix} {\dot{V}}_{c} = {\tilde{W}}_{c}^{⊤} \bar{θ} [- θ^{⊤} {\tilde{W}}_{c} - λ {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) - ϵ_{H} - λ B_{1}^{⊤} \nabla σ_{s}^{⊤} {\hat{W}}_{a} sign ({\hat{\bar{M}}}_{a}^{*}) \\ + λ B_{1}^{⊤} \nabla σ_{s}^{⊤} W^{*} sign ({\hat{\bar{M}}}_{o}^{*}) + λ^{2} \bar{ϖ} (ε_{a} - ε_{a^{*}})] + p_{c} {\tilde{W}}_{c}^{⊤} ({\hat{W}}_{c} - {\hat{W}}_{a}) \\ = {\tilde{W}}_{c}^{⊤} \bar{θ} [- ϵ_{H} + λ^{2} \bar{ϖ} (ε_{a} - ε_{a^{*}}) + λ B_{1}^{⊤} \nabla σ_{s}^{⊤} W^{*} sign ({\hat{\bar{M}}}_{o}^{*})] \\ - {\tilde{W}}_{c}^{⊤} \bar{θ} \{λ {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) + λ {\hat{W}}_{a}^{⊤} \nabla σ_{s} B_{1} [tanh (\frac{{\hat{\bar{M}}}_{a}^{*}}{κ}) + κ_{1}]\} \\ - {\tilde{W}}_{c}^{⊤} \bar{θ} θ^{⊤} {\tilde{W}}_{c} + p_{c} {\tilde{W}}_{c}^{⊤} ({\hat{W}}_{c} - {\hat{W}}_{a}) \end{matrix} \end{matrix}

(44)

where

| | ε_{a} | |

,

| | ε_{a^{*}} | |

and

| κ_{1} |

are bounded. For ease of notation, two bounded vectors

ℵ_{1}

and

ℵ_{2}

are defined as:

\begin{matrix} \begin{matrix} ℵ_{1} = \{λ \nabla σ_{s} B_{1} tanh (\frac{{\hat{\bar{M}}}_{a}^{*}}{κ}) - λ \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*})\} {\bar{θ}}^{⊤} W^{*}, \\ ℵ_{2} = λ W^{* ⊤} \nabla σ_{s} B_{1} sign ({\hat{\bar{W}}}_{o}^{*}) - λ W^{* ⊤} \nabla σ_{s} B_{1} [tanh (\frac{{\hat{\bar{M}}}_{a}^{*}}{κ}) + κ_{1}] \\ - ϵ_{H} + λ^{2} \bar{ϖ} (ε_{a} - ε_{a^{*}}), \end{matrix} \end{matrix}

(45)

then (44) can be further sorted into the following structure

\begin{matrix} \begin{matrix} {\dot{V}}_{c} = - {\tilde{W}}_{c}^{⊤} \bar{θ} θ^{⊤} {\tilde{W}}_{c} + p_{c} {\tilde{W}}_{c}^{⊤} ({\hat{W}}_{c} - {\hat{W}}_{a}) + {\tilde{W}}_{c} \bar{θ} ℵ_{2} + {\tilde{W}}_{a}^{⊤} ℵ_{1} + λ κ_{1} {\tilde{W}}_{c}^{⊤} \bar{θ} {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} \\ + λ {\hat{W}}_{c}^{⊤} \bar{θ} {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} Ψ . \end{matrix} \end{matrix}

(46)

Substituting (33), (41), and (46) into (35) yields

\begin{matrix} \begin{matrix} \dot{V} \leq - V_{s} - λ {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) + ε_{-} {\tilde{W}}_{c}^{⊤} \bar{θ} θ^{⊤} {\tilde{W}}_{c} + p_{c} {\tilde{W}}_{c}^{⊤} ({\hat{W}}_{c} - {\hat{W}}_{a}) \\ + {\tilde{W}}_{c} \bar{θ} ℵ_{2} + {\tilde{W}}_{a}^{⊤} ℵ_{1} + λ κ_{1} {\tilde{W}}_{c}^{⊤} \bar{θ} {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} - λ {\hat{W}}_{c}^{⊤} \bar{θ} {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} \\ + λ {\hat{W}}_{c}^{⊤} \bar{θ} {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} Ψ + {\tilde{W}}_{a}^{⊤} [p_{c} ({\hat{W}}_{a} - {\hat{W}}_{c}) - λ \nabla σ_{s} B_{1} Ψ {\bar{θ}}^{⊤} {\hat{W}}_{c}] \\ \leq - V_{s} - λ {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) + ε_{v} - {\tilde{W}}_{c}^{⊤} \bar{θ} θ^{⊤} {\tilde{W}}_{c} + p_{c} {({\tilde{W}}_{a} - {\tilde{W}}_{c})}^{⊤} ({\tilde{W}}_{a} - {\tilde{W}}_{c}) \\ + {\tilde{W}}_{c} \bar{θ} ℵ_{2} + {\tilde{W}}_{a}^{⊤} ℵ_{1} + λ κ_{1} {\tilde{W}}_{c}^{⊤} \bar{θ} {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} \\ \leq - V_{s} - {\tilde{W}}_{c}^{⊤} \bar{θ} θ^{⊤} {\tilde{W}}_{c} + ε_{v 1} + {\tilde{W}}_{c}^{⊤} \bar{θ} ℵ_{3} \end{matrix} \end{matrix}

(47)

where

ε_{v 1} = ε_{v} - λ {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1} tanh ({\hat{\bar{M}}}_{a}^{*}) + {\tilde{W}}_{a}^{⊤} ℵ_{1}

and

ℵ_{3} = ℵ_{2} + λ κ_{1} {\tilde{W}}_{a}^{⊤} \nabla σ_{s} B_{1}

.

With the definition of the generalized vector

G = {[s, {\tilde{W}}_{c}^{⊤} θ_{1}]}^{⊤}

, the inequality (47) can be rewritten as:

\begin{matrix} \dot{V} \leq - G^{⊤} Π G + G^{⊤} Π_{θ} + ε_{v 1} \end{matrix}

(48)

where

Π = diag (Q, I)

,

Π_{θ} = {[0, ℵ_{3} / (θ^{⊤} θ + 1)]}^{⊤}

. Given that

Π_{θ}

,

{\hat{W}}_{a}

are bounded vectors, it can be concluded that the

{\tilde{W}}_{a}

is a bounded vector and

| | Π_{θ} | | \leq {\bar{Π}}_{θ}

and

| ε_{v 1} | \leq {\bar{ε}}_{v 1}

. Therefore,

\dot{V}

is negative if

\begin{matrix} | | G | | > \frac{{\bar{Π}}_{θ} \sqrt{{\bar{Π}}_{θ}^{2} + 4 λ_{\min} (Π) {\bar{ε}}_{v 1}}}{2 λ_{\min} (Π)}, \end{matrix}

(49)

which implies that the sliding variable

s

and

{\tilde{W}}_{c}^{⊤} θ_{1}

are UUB. Furthermore, in view of the assumption that

θ_{1}

is PE, the weight estimation error

{\tilde{W}}_{c}

is also UUB [14]. Once the sliding variable converges to the vicinity of origin, the gradient vector

\nabla σ [s (t)]

together with

{\hat{M}}_{a}^{*}

will also be small enough. Defining

d = Ξ + D_{1} + B_{1} {\hat{M}}_{a}^{*}

, and assuming that there exists a small positive constant

\bar{d}

satisfying

| | D | | \leq \bar{d}

, the dynamic of the sliding variable can be represented as:

[\begin{matrix} \dot{s} \\ {\dot{M}}_{d} \end{matrix}] = [\begin{matrix} d - C [λ_{1 i} {sig}^{1 / 2} (s_{i})] - M_{d} \\ - D (λ_{2 i}) sign (s_{i}) \end{matrix}] .

(50)

According to the deduction lines in [30], the sliding variable can reach a small neighborhood of origin; thus, the equivalent unconstrained errors

η_{1 i}

are bounded.

Recalling the error transformation method in (9), it can be derived that

\begin{matrix} z_{i} = T (η_{1 i}) = \frac{exp (η_{1 i}) - exp (- η_{1 i})}{exp (η_{1 i}) + exp (- η_{1 i})} . \end{matrix}

(51)

Note that

T (η_{1 i})

is a smooth, strictly increasing, function w.r.t.

η_{1 i}

and satisfies

\lim_{η_{1 i} \to - \infty} T (η_{1 i}) = - 1

,

\lim_{η_{1 i} \to \infty} T (η_{1 i}) = 1

; thus, it can be derived that

- 1 < z_{i} < 1

and

- ρ_{i} < e_{1 i} < ρ_{i}

hold true if

η_{1 i}

is bounded. Following the above analysis, it can be deduced that

e_{1 i}

never violates the fixed-time prescribed performance constraints in (8) for all

t > 0

, and

| e_{1 i} | < ρ_{i \infty}

for

t > T

.

This completes the proof of Theorem 1. □

4. Numerical Simulations

In this section, numerical simulations for the re-entry phase of RLV are carried out to illustrate the effectiveness and improved performance of the proposed control scheme. The parameters of the RLV are based on [31]. The initial states of the re-entry phase are set as

α_{0} = 47 \deg

,

β_{0} = - 2 \deg

,

σ_{0} = 2 \deg

,

p_{0} = 0 \deg / s

,

q_{0} = 0 \deg / s

and

r_{0} = 0 \deg / s

. The guidance commands are designed as

α_{c} = (45 - 0.5 t) \deg

,

β_{c} = 0 \deg

and

σ_{c} = 0 \deg

. The uncertainties of the inertia parameters, the aerodynamic parameters and the air density are each set to +20% bias. The external disturbance

Δ D_{e}

is given as [32]:

\begin{matrix} Δ D_{e} = [\begin{matrix} (0.5 + cos (π t / 2) + sin (π t / 3)) \times p \\ (0.5 + cos (π t / 2) + sin (π t / 4)) \times r \\ (0.5 + cos (π t / 3) + sin (π t / 2)) \times q \end{matrix}] \times 10^{4} (N \cdot m) . \end{matrix}

(52)

The saturation bound of the control torques is set as

\bar{M} = 1 \times 10^{5} (N \cdot m)

. The attitude tracking errors are required to be less than

0.5 \deg

for

t \geq 2 s

. Therefore, the parameters of the FTPPF are selected as

ρ_{0 i} = π / 36, ρ_{\infty i} = π / 360

and

T = 2

. Comprehensively considering the satisfactory transient and steady-state performance, the parameters of the proposed control scheme are rigorously chosen as

c = 2 I_{3 \times 3}, ϖ = I_{3 \times 3}, Q = I_{3 \times 3}, λ_{1 i} = 2, λ_{2 i} = 0.1

. Inspired by the work of [12,14], the suitable basis vectors can be selected as polynomial combinations of the concerned state variables in the performance index (16). The base vectors of the actor and critic networks are adjusted in repeated trials to balance the approximation error and the computational burden. They are identically selected as:

\begin{matrix} σ [s (t)] = {[\frac{s_{α}^{4}}{4}, \frac{s_{α}^{3}}{3}, \frac{s_{α}^{2}}{2}, \frac{s_{β}^{4}}{4}, \frac{s_{β}^{3}}{3}, \frac{s_{β}^{2}}{2}, \frac{s_{σ}^{4}}{4}, \frac{s_{σ}^{3}}{3}, \frac{s_{σ}^{2}}{2}]}^{⊤} . \end{matrix}

(53)

The initial values of

{\hat{W}}_{a}

and

{\hat{W}}_{c}

are randomly generated in

(0, 1)

. The user-defined parameters of the weight adaptation laws are designed as

A_{1} = I_{9 \times 9}, A_{2} = 0.1 I_{9 \times 9}, p_{c} = 0.001, p_{a} = 1, κ = 0.001

.

Furthermore, in order to demonstrate the superiority of the proposed control scheme, the RL-based finite-time control (RLFTC) with input constraints in [17], and the robust adaptive backstepping control (RABC) method in [33], are implemented. To provide a fair comparison, the time-fuel performance index of the RLFTC is identical to the proposed method, and the generation method for the initial values of the actor network and the critic network remains the same. The other user-defined parameters of the RLFTC are intensively selected as

T_{1} = 0.1, λ = 2 / 3, Γ_{c} = 0.2, Γ_{a} = 0.2, k_{a} = 0.1

. The parameters of the RABC remain the same as that in [33]. The simulation results are given in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10.

Figure 2. Time histories of the attitude angles.

Figure 3. Time histories of the attitude tracking error

e_{1 α}

.

Figure 4. Time histories of the attitude tracking error

e_{1 β}

.

Figure 5. Time histories of the attitude tracking error

e_{1 σ}

.

Figure 6. Time histories of the angular rates.

Figure 7. Time histories of the sliding manifolds.

Figure 8. Time histories of the control torques.

Figure 9. Weights of the actor network.

Figure 10. Weights of the critic network.

The tracking performances of the attitude angles under three controllers are shown in Figure 2, Figure 3, Figure 4 and Figure 5. It can clearly be seen that the proposed control scheme can provide faster convergence and smaller overshoot in the presence of parametric uncertainties and external disturbances. The angular rates of RLV and the sliding manifolds are demonstrated in Figure 6 and Figure 7, which provide further evidence for the improved performance of the proposed control scheme. Moreover, by comparing the tracking performance with the control inputs illustrated in Figure 8, it can be observed that the proposed control scheme exhibits better transient and steady-state performance with limited control inputs. The evolution trajectories of

{\hat{W}}_{a}

and

{\hat{W}}_{c}

are depicted in Figure 9 and Figure 10, respectively. It can be readily found that

{\hat{W}}_{a}

and

{\hat{W}}_{c}

are convergent to the same values, which indicates that the ideal weight vector

W^{*}

can be effectively estimated via the proposed adaptation law.

To make a clear comparison, the maximum overshoot and the adjustment time are introduced to evaluate the transient performance of these controllers. Moreover, the integral absolute control effort (IACE) index

\int_{0}^{30} (| M_{i} |) d τ, (i = x, y, z)

and the integral of the time and absolute error (ITAE)

\int_{0}^{30} τ (| e_{1 i}) | d τ, (i = α, β, σ)

index of three control schemes are calculated to evaluate the tracking accuracy and control effort. The performance indices of the three channels are summarized in Table 1, Table 2 and Table 3.

Table 1. Performance indexes of the

α

-channel.

Table 2. Performance indexes of the

β

-channel.

Table 3. Performance indexes of the

σ

-channel.

From the foregoing simulation results, it can be concluded that the proposed control scheme outperforms RLFTC and RABC in terms of transient performance, tracking accuracy and control effort. Furthermore, by synthesizing the AC structure and the fixed-time PPC paradigm, the proposed control scheme offers an online RL-based model-free solution for controlling RLV and other complex industrial systems.

5. Conclusions

In this paper, an innovative RL-based tracking control scheme with fixed-time prescribed performance has been proposed for RLV under parametric uncertainties, external disturbances, and input constraints. By resorting to the FTPPF, fixed-time performance envelopes have been imposed on the attitude tracking errors. Combined with the AC-based online RL structure and the super-twisting-like sliding mode control, the optimal control policy and the performance index have been learned recursively, and the robustness of the learning process has been further enhanced. Moreover, theoretical analysis has demonstrated that the attitude tracking error can converge to a preset region within a preassigned fixed time, and that the sliding variable, the weight estimation errors of the actor and critic networks are UUB. Comparative simulation results have verified the effectiveness and improved performance of the proposed control scheme. The angular rate constraints will be addressed in our future work, and the optimal control problem for the underactuated RLV will be specifically addressed. Experimental investigations, such as hardware-in-the-loop simulations, will be undertaken.

Author Contributions

Conceptualization, S.X. and Y.G.; methodology, S.X.; software, S.X.; validation, Y.G., C.W. and Y.L.; formal analysis, S.X.; investigation, S.X.; resources, C.W.; data curation, Y.G.; writing—original draft preparation, S.X.; writing—review and editing, S.X., Y.G., C.W., Y.L. and L.X.; visualization, S.X., Y.G. and L.X.; supervision, C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the SAST Foundation grant number SAST2021-028.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RLV	Reusable Launch Vehicles
NN	Neural Network
RL	Reinforcement Learning
AC	Actor-Critic
UUB	Uniformly Ultimately Bounded
HJB	Hamilton–Jacobi–Bellman
PPC	Prescribed Performance Control
FTPPF	Fixed-time Prescribed Performance Function
RLFTC	Reinforcement-Learning-Based Finite-Time Control
RABC	Robust Adaptive Backstepping Control
IACE	Integral Absolute Control Effort
ITAE	Integral of Time and Absolute Error

References

Stott, J.E.; Shtessel, Y.B. Launch vehicle attitude control using sliding mode control and observation techniques. J. Frankl. Inst. B 2012, 349, 397–412. [Google Scholar] [CrossRef]
Tian, B.L.; Li, Z.Y.; Zhao, X.P.; Zong, Q. Adaptive Multivariable Reentry Attitude Control of RLV With Prescribed Performance. IEEE Trans. Syst. Man Cybern. Syst. 2022, 1–5. [Google Scholar] [CrossRef]
Acquatella, P.; Briese, L.E.; Schnepper, K. Guidance command generation and nonlinear dynamic inversion control for reusable launch vehicles. Acta Astronaut. 2020, 174, 334–346. [Google Scholar] [CrossRef]
Xu, B.; Wang, X.; Shi, Z.K. Robust adaptive neural control of nonminimum phase hypersonic vehicle model. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 1107–1115. [Google Scholar] [CrossRef]
Zhang, L.; Wei, C.Z.; Wu, R.; Cui, N.G. Fixed-time extended state observer based non-singular fast terminal sliding mode control for a VTVL reusable launch vehicle. Aerosp. Sci. Technol. 2018, 82, 70–79. [Google Scholar] [CrossRef]
Ju, X.Z.; Wei, C.Z.; Xu, H.C.; Wang, F. Fractional-order sliding mode control with a predefined-time observer for VTVL reusable launch vehicles under actuator faults and saturation constraints. ISA Trans. 2022, in press. [CrossRef]
Cheng, L.; Wang, Z.B.; Gong, S.P. Adaptive control of hypersonic vehicles with unknown dynamics based on dual network architecture. Acta Astronaut. 2022, 193, 197–208. [Google Scholar] [CrossRef]
Xu, B.; Shou, Y.X.; Shi, Z.K.; Yan, T. Predefined-Time Hierarchical Coordinated Neural Control for Hypersonic Reentry Vehicle. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef]
Werbos, P. Approximate dynamic programming for realtime control and neural modelling. In Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches; Van Nostrand Reinhold: New York, NY, USA, 1992; pp. 493–525. [Google Scholar]
Vamvoudakis, K.G.; Lewis, F.L. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010, 46, 878–888. [Google Scholar] [CrossRef]
He, H.B.; Ni, Z.; Fu, J. A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing 2012, 78, 3–13. [Google Scholar] [CrossRef]
Ma, Z.Q.; Huang, P.F.; Lin, Y.X. Learning-based Sliding Mode Control for Underactuated Deployment of Tethered Space Robot with Limited Input. IEEE Trans. Aerosp. Electron. Syst. 2021, 58, 2026–2038. [Google Scholar] [CrossRef]
Wang, N.; Gao, Y.; Zhao, H.; Ahn, C.K. Reinforcement learning-based optimal tracking control of an unknown unmanned surface vehicle. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3034–3045. [Google Scholar] [CrossRef] [PubMed]
Fan, Q.Y.; Yang, G.H. Adaptive actor–critic design-based integral sliding-mode control for partially unknown nonlinear systems with input disturbances. IEEE Trans. Neural Netw. Learn. Syst. 2015, 27, 165–177. [Google Scholar] [CrossRef] [PubMed]
Kuang, Z.; Gao, H.; Tomizuka, M. Precise linear-motor synchronization control via cross-coupled second-order discrete-time fractional-order sliding mode. IEEE/ASME Trans. Mechatronics 2020, 26, 358–368. [Google Scholar] [CrossRef]
Zhang, H.; Cui, X.; Luo, Y.; Jiang, H. Finite-horizon H_∞ tracking control for unknown nonlinear systems with saturating actuators. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 1200–1212. [Google Scholar] [CrossRef]
Wang, N.; Gao, Y.; Yang, C.; Zhang, X.F. Reinforcement learning-based finite-time tracking control of an unknown unmanned surface vehicle with input constraints. Neurocomputing 2022, 484, 26–37. [Google Scholar] [CrossRef]
Bechlioulis, C.P.; Rovithakis, G.A. Robust adaptive control of feedback linearizable MIMO nonlinear systems with prescribed performance. IEEE Trans. Autom. Control 2008, 53, 2090–2099. [Google Scholar] [CrossRef]
Cui, G.Z.; Yang, W.; Yu, J.P.; Li, Z.; Tao, C.B. Fixed-time prescribed performance adaptive trajectory tracking control for a QUAV. IEEE Trans. Circuits Syst. II 2021, 69, 494–498. [Google Scholar] [CrossRef]
Bu, X.W. Guaranteeing prescribed performance for air-breathing hypersonic vehicles via an adaptive non-affine tracking controller. Acta Astronaut. 2018, 151, 368–379. [Google Scholar] [CrossRef]
Wang, N.; Gao, Y.; Zhang, X.F. Data-driven performance-prescribed reinforcement learning control of an unmanned surface vehicle. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5456–5467. [Google Scholar] [CrossRef]
Luo, S.B.; Wu, X.; Wei, C.S.; Zhang, Y.L.; Yang, Z. Adaptive finite-time prescribed performance attitude tracking control for reusable launch vehicle during reentry phase: An event-triggered case. Adv. Space Res. 2022, 69, 3814–3827. [Google Scholar] [CrossRef]
Modares, H.; Sistani, M.B.N.; Lewis, F.L. A policy iteration approach to online optimal control of continuous-time constrained-input systems. ISA Trans. 2013, 52, 611–621. [Google Scholar] [CrossRef]
Tan, J.; Guo, S.J. Backstepping control with fixed-time prescribed performance for fixed wing UAV under model uncertainties and external disturbances. Int. J. Control 2022, 95, 934–951. [Google Scholar] [CrossRef]
Yuan, Y.; Wang, Z.; Guo, L.; Liu, H.P. Barrier Lyapunov functions-based adaptive fault tolerant control for flexible hypersonic flight vehicles with full state constraints. IEEE Trans. Syst. Man Cybern. Syst. 2018, 50, 3391–3400. [Google Scholar] [CrossRef]
Lyshevski, S.E. Optimal control of nonlinear continuous-time systems: Design of bounded controllers via generalized nonquadratic functionals. In Proceedings of the 1998 American Control Conference, ACC (IEEE Cat. No. 98CH36207). Philadelphia, PA, USA, 26–26 June 1998; Volume 1, pp. 205–209. [Google Scholar]
Hornik, K.; Stinchcombe, M.; White, H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 1990, 3, 551–560. [Google Scholar] [CrossRef]
Kamalapurkar, R.; Walters, P.; Dixon, W.E. Model-based reinforcement learning for approximate optimal regulation. Automatica 2016, 64, 94–104. [Google Scholar] [CrossRef] [Green Version]
Bhasin, S.; Kamalapurkar, R.; Johnson, M.; Vamvoudakis, K.G.; Lewis, F.L.; Dixon, W.E. A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 2013, 49, 82–92. [Google Scholar] [CrossRef]
Moreno, J.A.; Osorio, M. Strict Lyapunov functions for the super-twisting algorithm. IEEE Trans. Autom. Control 2012, 57, 1035–1040. [Google Scholar] [CrossRef]
Wei, C.Z.; Wang, M.Z.; Lu, B.G.; Pu, J.L. Accelerated Landweber iteration based control allocation for fault tolerant control of reusable launch vehicle. Chin. J. Aeronaut. 2022, 35, 175–184. [Google Scholar] [CrossRef]
Zhang, C.F.; Zhang, G.S.; Dong, Q. Fixed-time disturbance observer-based nearly optimal control for reusable launch vehicle with input constraints. ISA Trans. 2022, 122, 182–197. [Google Scholar] [CrossRef]
Wang, Z.; Wu, Z.; Du, Y. Robust adaptive backstepping control for reentry reusable launch vehicles. Acta Astronaut. 2016, 126, 258–264. [Google Scholar] [CrossRef]

Figure 1. The block diagram of the proposed control scheme.

Figure 2. Time histories of the attitude angles.

Figure 3. Time histories of the attitude tracking error

e_{1 α}

.

Figure 4. Time histories of the attitude tracking error

e_{1 β}

.

Figure 5. Time histories of the attitude tracking error

e_{1 σ}

.

Figure 6. Time histories of the angular rates.

Figure 7. Time histories of the sliding manifolds.

Figure 8. Time histories of the control torques.

Figure 9. Weights of the actor network.

Figure 10. Weights of the critic network.

Table 1. Performance indexes of the

α

-channel.

Table 1. Performance indexes of the

α

-channel.

Performance Index	Proposed Method	RABC	RLFTC
Maximum Overshoot	/	20.5%	34.0%
Adjustment Time ¹ (s)	3.9	5.6	4.3
ITAE index ( $deg \cdot s^{2}$ )	1.6407	43.2600	1.6962
IACE index ( $N \cdot m \cdot s$ )	$8.507 \times 10^{5}$	$8.893 \times 10^{5}$	$1.022 \times 10^{6}$

¹ The time when the attitude tracking error reaches to ±0.05 deg.

Table 2. Performance indexes of the

β

-channel.

Table 2. Performance indexes of the

β

-channel.

Performance Index	Proposed Method	RABC	RLFTC
Maximum Overshoot	4.5%	19.0%	38.2%
Adjustment Time (s)	3.1	4.1	4.6
ITAE index ( $deg \cdot s^{2}$ )	0.7347	1.5192	1.1507
IACE index ( $N \cdot m \cdot s$ )	$1.8365 \times 10^{5}$	$2.0571 \times 10^{5}$	$4.155 \times 10^{5}$

Table 3. Performance indexes of the

σ

-channel.

Table 3. Performance indexes of the

σ

-channel.

Performance Index	Proposed Method	RABC	RLFTC
Maximum Overshoot	5.5%	17.7%	53.5%
Adjustment Time (s)	4.1	4.8	4.7
ITAE index ( $deg \cdot s^{2}$ )	0.8783	1.7853	1.7242
IACE index ( $N \cdot m \cdot s$ )	$4.6577 \times 10^{3}$	$5.0277 \times 10^{3}$	$1.8364 \times 10^{4}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Reinforcement-Learning-Based Tracking Control with Fixed-Time Prescribed Performance for Reusable Launch Vehicle under Input Constraints

Abstract

1. Introduction

2. Problem Statement and Preliminaries

2.1. Problem Statement

2.2. Preliminaries

3. Controller Design

3.1. Prescribed Performance Constraint

3.2. Reinforcement Learning-Based Control Design

3.3. Stability Analysis

4. Numerical Simulations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics