ESO-Enhanced Actor–Critic Reinforcement Learning-Optimised Trajectory Tracking Control for 3-DOF Marine Vessels

Liang, Xiaoling; Li, Jiajian

doi:10.3390/math14050867

Open AccessArticle

ESO-Enhanced Actor–Critic Reinforcement Learning-Optimised Trajectory Tracking Control for 3-DOF Marine Vessels

by

Xiaoling Liang

¹ and

Jiajian Li

^2,*

¹

School of Marine Engineering, Dalian Maritime University, Dalian 116026, China

²

Faculty of Computing, Hong Kong Polytechnic University, Hong Kong, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(5), 867; https://doi.org/10.3390/math14050867

Submission received: 2 February 2026 / Revised: 28 February 2026 / Accepted: 2 March 2026 / Published: 4 March 2026

Download

Browse Figures

Versions Notes

Abstract

This paper develops an extended-state-observer (ESO)-enhanced actor–critic reinforcement learning (RL) scheme for the trajectory tracking control of 3-DOF marine vessels subject to uncertain hydrodynamics and environmental disturbances. A coordinate-consistent error construction is provided to obtain an exact strict-feedback second-order uncertain template. On this basis, an Hamilton–Jacobi–Bellman (HJB)-inspired optimised control structure is implemented: the critic approximates the optimal value-gradient and the actor generates the optimised control law. A key simplification is employed: rather than minimising the squared Bellman residual via complex gradients, we introduce an HJB-inspired actor–critic consistency regularisation through a weight-matching coupling. This yields computationally light online update laws and enables transparent Lyapunov-based stability analysis while not claiming exact HJB satisfaction or policy optimality. The ESO estimates lumped uncertainty and provides feedforward compensation, so the RL module learns only the observer residual. A composite Lyapunov analysis establishes the semi-global uniform ultimate boundedness of tracking errors and boundedness of all observer signals. Practical implementation with thruster allocation, explicit wind–wave–current disturbance shaping filters, and a theory-aligned ablation protocol are provided for reproducibility.

Keywords:

marine vessel; trajectory tracking; actor–critic RL; ESO

MSC:

58E25

1. Introduction

The accurate trajectory tracking of marine surface vessels is a foundational capability for autonomous navigation, offshore operations, and safety-critical station-keeping tasks [1,2]. In practical deployments, tracking performance is challenged by a combination of (i) unmodelled and time-varying hydrodynamic effects, (ii) stochastic environmental loads induced by wind, waves, and currents, and (iii) strict actuation constraints arising from limited thruster authority and allocation feasibility. These factors can jointly induce large tracking errors, severe input saturation, and even qualitative trajectory distortion when the control law and the physical actuation model are not formulated in a coordinate-consistent manner. Thus, purely model-based designs may suffer from performance degradation when the nominal model is inaccurate, whereas aggressive robust designs may achieve stability at the expense of excessive control effort.

From a modelling perspective, the planar 3-degree-of-freedom (3-DOF) vessel dynamics constitute a coupled multi-input multi-output (MIMO) nonlinear system in which the kinematics are naturally expressed in the inertial frame while the generalised forces and moment are applied in the body-fixed frame. This inherent frame mismatch complicates both controller synthesis and theoretical analysis [3]. A common approach is to design tracking controllers via backstepping or strict-feedback transformations, which can provide systematic stability proofs when the system is cast into a suitable cascade form. However, when environmental disturbances and unmodeled dynamics are significant, purely model-based designs often become conservative to preserve robustness, or require increasingly complex adaptation laws [4]. Moreover, the presence of actuator bounds and allocation constraints can invalidate unconstrained stability arguments, since saturation introduces additional nonlinearities and may render a reference trajectory infeasible [5]. A prominent model-light route to robustness is active disturbance rejection control (ADRC), whose core component is the extended state observer (ESO) that estimates the total disturbance—including unmodeled dynamics and external perturbations—in real time [6,7,8]. Systematic tuning principles based on scaling and bandwidth parameterisation further popularised ADRC in engineering practice by explicitly linking observer and controller bandwidth to disturbance rejection and noise sensitivity [9,10]. In marine robotics, ESO-type disturbance rejection has been adopted for USV trajectory tracking to mitigate ocean environmental loads and parameter uncertainty, and recent studies have demonstrated its practicality in simulation and field experiments [11]. Nevertheless, ESO-based tracking designs are typically not optimal in an explicit sense; moreover, under hard input bounds, tuning purely for disturbance rejection may increase peak thrust demand and saturation rate, which can slow down recovery and distort the intended path.

Learning-based control methods, particularly reinforcement learning (RL), offer a complementary direction by improving performance through online policy optimisation [12,13]. Actor–critic structures are especially attractive for continuous-control problems, as they can approximate optimal policies and value functions without requiring an exact analytic solution of the Hamilton–Jacobi–Bellman (HJB) equation [14]. Nevertheless, vanilla actor–critic schemes may exhibit policy drift, sensitivity to stochastic disturbances, and degraded transient behavior when deployed on systems with severe uncertainty and hard actuation limits. In marine applications, these issues are amplified by persistent environmental loads and the coupling between translational and rotational motions [15]. Therefore, a key open problem is how to integrate learning with robust disturbance rejection in a way that (a) preserves coordinate consistency between theory and implementation, (b) remains stable under wind–wave–current disturbances, and (c) is feasible under realistic actuator constraints. RL—particularly, adaptive dynamic programming (ADP) and actor–critic structures—offers a complementary route by approximating the value function and policy to circumvent the analytic solution of the HJB equation. For continuous-time nonlinear systems, actor–critic learning connects directly to optimal control through policy iteration, temporal-difference learning, and Hamiltonian residual minimisation, enabling online performance improvement without requiring an explicit solution of the HJB partial differential equation [16]. In the marine domain, deep RL has been explored for path-following and trajectory tracking under uncertainty [17], and very recent work has started to incorporate input saturation characteristics into RL-based USV optimal tracking formulations [18]. However, pure RL controllers can be sensitive to unmodeled dynamics and external disturbances, and stability guarantees often become delicate when exploration, approximation errors, and actuator constraints coexist. Recent studies in other networked and cyber-physical domains also indicate a growing trend toward hybrid frameworks that combine structured optimisation components with learning-based adaptation for dynamic multi-objective decision-making [19]. This motivates a principled integration of disturbance observers with RL so that learning takes place around a robust baseline and focuses on optimal refinement rather than disturbance cancellation.

To highlight the motivation, we briefly compare this work with representative related approaches. Model-based robust tracking controllers provide stability guarantees but may become conservative under strong uncertainty and persistent disturbances, especially when actuator constraints are active. ESO- or ADRC-based designs offer practical disturbance rejection by estimating lumped uncertainties, but they do not explicitly address learning-oriented performance refinement. In contrast, many DRL frameworks in other domains such as multi-agent DRL and multi-DNN DRL for MEC or SD-WAN optimisation are primarily designed for slot-based decision problems, where DRL serves as the main decision engine for resource allocation and does not aim to provide Lyapunov-style closed-loop stability guarantees for continuous-time nonlinear dynamics. The present manuscript targets safety-critical continuous-time marine motion control and adopts a hybrid decomposition philosophy, while the ESO-based controller provides the stabilising robust baseline and the actor–critic component is incorporated as a Lyapunov-guided consistency regularisation and secondary online refinement mechanism. This comparison clarifies the research gap and motivates the proposed ESO-enhanced actor–critic framework. Motivated by these challenges, this paper develops a unified control framework that couples an ESO with an actor–critic RL module for the constrained trajectory tracking of 3-DOF vessels. The ESO is used to estimate and compensate lumped uncertainties, including wind–wave–current loads and unmodeled dynamics, thereby providing a robust baseline that stabilises the learning process. On top of this robustification, an actor–critic component refines the control policy toward improved performance such as reduced tracking error and control energy. To strengthen the mathematical consistency of learning, we introduce an HJB-consistency mechanism based on a computable residual metric and a weight-matching coupling between the actor and critic updates. This coupling aligns the learning dynamics with the optimality structure and mitigates uncontrolled weight growth and oscillatory behavior. Importantly, the overall design is formulated in a coordinate-consistent strict-feedback tracking structure, explicitly mapping inertial-frame tracking errors to physically applied body-fixed forces and moment, and incorporating actuator saturation and optional thruster allocation constraints to ensure feasibility.

The proposed framework is validated on a standard 3-DOF vessel benchmark under stochastic wind–wave–current disturbances. Beyond reporting tracking trajectories, we adopt reproducible evaluation metrics that capture both control performance and learning consistency, such as integrated tracking error, control energy, saturation rate, HJB residual, and actor–critic weight mismatch norms. In addition, systematic ablation studies are provided to quantify the complementary roles of robust disturbance estimation and learning-based optimisation.

The main methodological contributions of this paper are summarised as follows:

We integrate ESO-based lumped-disturbance estimation with actor–critic policy improvement within a single closed-loop architecture suitable for 3-DOF vessel tracking. Compared with ESO-only and disturbance-sensitive RL-only designs, the proposed integration typically yields smaller tracking errors with lower control effort under environmental loads [20,21].
Compared with standard TD-error-driven actor–critic schemes commonly used in the RL literature [22,23], the proposed weight-matching coupling is introduced as a Lyapunov-guided consistency regularisation within a control-oriented framework, which helps to suppress parameter drift and improves robustness under disturbances and actuator constraints.
While most RL controllers operate as black boxes, this work guarantees SGUUB stability with a tracking error bound explicitly tied to the ESO residual. This provides a transparent mechanism for performance tuning linking observer bandwidth directly to control precision, which offers reliability that standard heuristic RL methods lack.

The remainder of the paper is organised as follows. Section 2 presents the vessel model, disturbance representation, and coordinate-consistent strict-feedback tracking error system. Section 3 develops the ESO design, the actor–critic learning law with HJB-consistency, and the overall constrained control synthesis. It provides the closed-loop stability analysis and establishes boundedness and ultimate tracking performance guarantees. Section 4 reports simulation results and ablation studies under wind–wave–current disturbances and actuator limits. Finally, Section 5 concludes the paper and outlines future research directions.

2. Vessel Model and Tracking Objective

Consider 3-DOF vessel dynamics [24]

\begin{matrix} \dot{η} & = R (ψ) ν, \end{matrix}

(1)

\begin{matrix} M (ν) \dot{ν} + C (ν) ν + D (ν) ν & = τ + d (t), \end{matrix}

(2)

where

M (ν) ≻ 0

is the inertia matrix,

C (ν)

is the coriolis term,

D (ν)

is the damping term, and

d (t)

collects wind, wave, current, and unmodelled effects.

The control objective is designed as follows: given a smooth desired trajectory

η_{d} (t)

with bounded derivatives, design a control input

τ

such that the tracking errors converge to a small neighbourhood of zero, while maintaining reasonable control effort and robustness against uncertainty and disturbances.

Define the earth-fixed tracking error

η_{1} ≜ η - η_{d} .

(3)

Define the earth-fixed strict-feedback velocity error

η_{2} ≜ R (ψ) ν - {\dot{η}}_{d} .

(4)

Then, the strict-feedback first equation holds exactly:

{\dot{η}}_{1} = η_{2} .

(5)

Differentiate (4), then one has

\begin{matrix} {\dot{η}}_{2} & = \dot{R} (ψ) ν + R (ψ) \dot{ν} - {\ddot{η}}_{d} \\ = \dot{R} (ψ) ν + R (ψ) M^{- 1} (ν) (τ + d - C (ν) ν - D (ν) ν) - {\ddot{η}}_{d} . \end{matrix}

(6)

Choose a constant nominal inertia matrix

M_{0} ≻ 0

and define a nominal feedforward term

τ_{f f} (t)

:

τ_{f f} (t) ≜ C_{0} (ν_{d}) ν_{d} + D_{0} (ν_{d}) ν_{d} + M_{0} {\dot{ν}}_{d},

(7)

where

(C_{0}, D_{0})

are nominal models. Implement

τ = M_{0} u + τ_{f f} (t),

(8)

with design input

u \in R^{3}

. Substituting (8) into (6) yields

{\dot{η}}_{2} = u + f (χ, t),

(9)

where

χ

collects measurable signals and the lumped uncertainty is

\begin{matrix} f (χ, t) ≜ & \underset{input channel mismatch}{\underset{︸}{(R (ψ) M^{- 1} (ν) M_{0} - I) u}} + \underset{rotation - induced coupling}{\underset{︸}{\dot{R} (ψ) ν}} \\ + \underset{model / ff / env mismatch}{\underset{︸}{R (ψ) M^{- 1} (ν) (τ_{f f} - C (ν) ν - D (ν) ν + d (t)) - {\ddot{η}}_{d}}} . \end{matrix}

(10)

Remark 1.

The lumped uncertainty

f (χ, t)

in Equation (10) contains an input-coupled mismatch term that is linear in u. To clarify well-posedness, we decompose

f (χ, t) = Δ_{u} (χ) u + \bar{f} (χ, t),

(11)

where

Δ_{u} (χ) = R (ψ) M^{- 1} (ν) M_{0} - I,

(12)

and

\bar{f} (χ, t)

collects the remaining rotation-induced coupling, model mismatch, and environmental disturbance terms. In implementation, the control input u is computed explicitly from measured and estimated signals and then mapped to the physical body frame with actuator saturation. Hence, the closed loop is a well-defined nonlinear ODE with saturation rather than an algebraic loop. We assume that

I + Δ_{u} (χ)

is uniformly nonsingular in the operating region, which is a mild small-gain-type condition ensuring the existence and uniqueness of solutions and preserving ESO convergence under the input-coupled uncertainty.

Thus, the second-order uncertain template is

{\dot{η}}_{1} = η_{2}, {\dot{η}}_{2} = u + f (χ, t) .

(13)

Remark 2.

The strict-feedback template in (13) is introduced for synthesis on actor–critic learning and observer-based compensation. In implementation, the virtual input u is mapped back to the physical body-frame force and moment through the kinematics and thruster allocation, so actuator constraints can be enforced without modifying the learning law.

Assumption 1.

The lumped disturbance

d (t)

is bounded and has a bounded derivative, i.e.,

∥ d (t) ∥ \leq \bar{d}

and

∥ \dot{d} (t) ∥ \leq {\bar{d}}_{1}

.

Assumption 2.

The reference trajectory

η_{d} (t)

is twice continuously differentiable and bounded, together with its derivatives, i.e.,

η_{d} (t)

,

{\dot{η}}_{d} (t)

, and

{\ddot{η}}_{d} (t)

are bounded.

Assumption 3.

The regressor signal

S (η_{e})

satisfies a finite excitation condition over a sliding window. There exist

T_{w} > 0

and

γ > 0

such that

\int_{t - T_{w}}^{t} S (η_{e}) S^{⊤} (η_{e}) d τ ⪰ γ I \forall t \geq T_{w} .

This assumption is used to strengthen mismatch convergence statements such as

{\hat{W}}_{a} - {\hat{W}}_{c} \to 0

.

Define the infinite-horizon quadratic performance index

J (η_{e} (0)) = \int_{0}^{\infty} (η_{e}^{T} (σ) η_{e} (σ) + u^{T} (σ) u (σ)) d σ,

(14)

where

η_{e} = {[η_{1}^{T}, η_{2}^{T}]}^{T} \in R^{6}

.

The quadratic form in (14) reflects a standard tracking-to-regulation conversion: the stacked error state

η_{e}

is penalised to enforce accurate tracking, while u is penalised to avoid excessive virtual control action. This optimal-control viewpoint provides a principled way to couple learning with a stabilising baseline in the presence of persistent environmental disturbances.

Let

J^{*} (η_{e})

be the optimal value function. The Hamiltonian is

H (η_{e}, u, J^{*}) ≜ η_{e}^{T} η_{e} + u^{T} u + {(\frac{d J^{*}}{d η_{1}})}^{T} {\dot{η}}_{1} + {(\frac{d J^{*}}{d η_{2}})}^{T} {\dot{η}}_{2} .

(15)

At the optimum, the HJB equation implies

H (η_{e}, u, J^{*}) = 0

and the optimal input satisfies the stationarity condition with respect to u. Since

J^{*}

is unknown for the vessel tracking problem with uncertainty, we approximate it via a critic and compute an approximately optimal policy via actor.

Using (13), the stationary condition

\partial H / \partial u = 0

yields

u^{*} = - \frac{1}{2} \frac{d J^{*}}{d η_{2}} .

(16)

Treat

f (χ, t)

in (13) as an extended state:

z_{1} = η_{2}, z_{2} = f .

Then,

{\dot{z}}_{1} = u + z_{2}

,

{\dot{z}}_{2} = \dot{f}

. A second-order ESO is chosen as

\begin{matrix} {\dot{\hat{z}}}_{1} & = u + {\hat{z}}_{2} + β_{1} (z_{1} - {\hat{z}}_{1}), \end{matrix}

(17)

\begin{matrix} {\dot{\hat{z}}}_{2} & = β_{2} (z_{1} - {\hat{z}}_{1}), \end{matrix}

(18)

with parameters

β_{1}, β_{2} > 0

.

Use the compensated input

u = {\hat{u}}^{*} (η_{e}) - \hat{f}, \hat{f} ≜ {\hat{z}}_{2} .

(19)

Then, one obtains

{\dot{η}}_{1} = η_{2}, {\dot{η}}_{2} = {\hat{u}}^{*} (η_{e}) + Δ f, Δ f ≜ f - \hat{f} .

(20)

Let

S (η_{e}) \in R^{m}

be a bounded basis-function vector. Define critic and actor weights

{\hat{W}}_{c} (t) \in R^{m \times 3}, {\hat{W}}_{a} (t) \in R^{m \times 3},

so that

{\hat{W}}_{c}^{T} S (η_{e}) \in R^{3}

and

{\hat{W}}_{a}^{T} S (η_{e}) \in R^{3}

.

Choose the critic approximation of the value-gradient

\frac{d {\hat{J}}^{*}}{d η_{2}} = 2 Γ_{1} η_{1} + 2 Γ_{2} η_{2} + {\hat{W}}_{c}^{T} S (η_{e}),

(21)

where

Γ_{1}, Γ_{2} \in R^{3 \times 3}

are diagonal positive definite matrices.

Implement optimised nominal input

{\hat{u}}^{*} (η) = - Γ_{1} η_{1} - Γ_{2} η_{2} - \frac{1}{2} {\hat{W}}_{a}^{T} S (η_{e}) .

(22)

Use online updates

\begin{matrix} {\dot{\hat{W}}}_{c} & = - k_{c} S (η_{e}) S^{T} (η_{e}) {\hat{W}}_{c}, \end{matrix}

(23)

\begin{matrix} {\dot{\hat{W}}}_{a} & = - S (η_{e}) S^{T} (η_{e}) (k_{a} ({\hat{W}}_{a} - {\hat{W}}_{c}) + k_{c} {\hat{W}}_{c}), \end{matrix}

(24)

with design parameter

k_{a} > 0, k_{c} > 0

.

Remark 3.

The update laws in Equations (23) and (24) are designed as a Lyapunov-guided actor–critic adaptation mechanism for the proposed ESO-based robust-control framework, rather than a standard TD Bellman-residual-gradient deep reinforcement learning update. This control-oriented design is adopted to maintain closed-loop boundedness and stable parameter adaptation under disturbances and actuator constraints. In this context, the actor–critic coupling term is used as a consistency regularisation term, instead of being interpreted as a standalone Bellman-optimal learning rule.

Using the HJB equation and its approximation, we define an optimality-related residual as

ε (t) ≜ H (η_{e}, u, {\hat{J}}^{*}) .

(25)

In the present control-oriented design, this residual is introduced as a diagnostic indicator to interpret the subsequent stationarity and weight-matching construction under the adopted approximation. Since Equations (21) and (22) are designed as Lyapunov-guided consistency regularisation rather than TD-error-driven learning, we do not claim that

ε (t)

is explicitly minimised or driven to zero by the update law.

Impose

\frac{\partial H (η_{e}, u, {\hat{J}}^{*})}{\partial {\hat{W}}_{a}} = \frac{1}{2} S (η_{e}) S^{T} (η_{e}) ({\hat{W}}_{a} - {\hat{W}}_{c}) = 0 .

(26)

Remark 4.

The stationarity condition

\partial H / \partial {\hat{W}}_{a} = 0

is imposed as a tractable surrogate consistency condition for the parameterised policy with respect to the approximate Hamiltonian constructed using

{\hat{J}}^{*}

. It provides a convenient direction to promote actor critic parameter consistency and facilitate Lyapunov analysis. It should be emphasised that this stationarity condition for the approximate Hamiltonian does not imply the pointwise satisfaction of the true HJB equation

H (η_{e}, u^{*}, J^{*}) = 0

nor guarantee global optimality. In the present design, the coupled actor–critic updates in Equations (21) and (22) are adopted for Lyapunov-compatible online adaptation within the ESO-based robust control framework.

3. Stability Analysis

Assumption 4.

The basis functions

S (η_{e})

used in the actor–critic parameterisation are generated by Gaussian RBFs with fixed centers and widths; hence,

S (η_{e})

is uniformly bounded in the operating region and there exists a constant

\bar{S}

such that

| S (η_{e}) | \leq \bar{S}

for all η in this region. Moreover, the actor and critic weights are assumed to evolve in compact sets and remain bounded, that is, there exist constants

{\bar{W}}_{a}

and

{\bar{W}}_{c}

such that

| {\hat{W}}_{a} (t) | \leq {\bar{W}}_{a}

and

| {\hat{W}}_{c} (t) | \leq {\bar{W}}_{c}

for all t. This boundedness is consistent with the coupled update structure and can also be enforced by standard projection-based constrained adaptation when needed.

This section establishes the boundedness of the overall ESO–actor–critic closed loop. The analysis proceeds in three steps: (i) show that the ESO yields a uniformly ultimately bounded estimation (UUB) error for the lumped uncertainty, (ii) show that the critic and actor weight dynamics remain bounded under the chosen update laws, and (iii) combine these properties into a composite Lyapunov argument that produces an explicit ultimate tracking-error bound. Denote errors

e_{1} = z_{1} - {\hat{z}}_{1}

,

e_{2} = z_{2} - {\hat{z}}_{2} = Δ f

. Then,

\begin{matrix} {\dot{e}}_{1} & = e_{2} - β_{1} e_{1}, \end{matrix}

(27)

\begin{matrix} {\dot{e}}_{2} & = \dot{f} - β_{2} e_{1} . \end{matrix}

(28)

Lemma 1.

Under Assumptions 1, 2 and 4 and under bounded closed-loop signals, there exists a constant

{\bar{f}}_{1}

such that

| \dot{f} (χ, t) | \leq {\bar{f}}_{1}

for all t.

This follows because

f (χ, t)

depends on bounded kinematic and dynamic states, bounded disturbances, and bounded control input, and its time derivative depends on bounded derivatives of these signals. In particular, the control input is bounded due to actuator saturation and the basis functions have bounded values in the operating region; hence, all terms contributing to

\dot{f}

remain bounded, which yields the stated bound.

Let Lyapunov function candidate

V_{e s o} = \frac{1}{2} ∥ e_{1} ∥^{2} + \frac{1}{2 β_{2}} {∥ e_{2} ∥}^{2} .

(29)

Differentiating gives

\begin{matrix} {\dot{V}}_{e s o} & = - β_{1} ∥ e_{1} ∥^{2} + \frac{1}{β_{2}} e_{2}^{T} \dot{f} \leq - β_{1} ∥ e_{1} ∥^{2} + \frac{1}{β_{2}} ∥ e_{2} ∥ {\bar{f}}_{1} \\ \leq - β_{1} ∥ e_{1} ∥^{2} + \frac{ϵ_{f}}{2 β_{2}} {∥ e_{2} ∥}^{2} + \frac{1}{2 ϵ_{f} β_{2}} {\bar{f}}_{1}^{2}, \end{matrix}

(30)

for any

ϵ_{f} > 0

. Hence,

(e_{1}, e_{2})

are UUB. In particular, there exist finite constants

{\bar{Δ}}_{f}

and

T_{0}

such that

∥ Δ f (t) ∥ = ∥ e_{2} (t) ∥ \leq {\bar{Δ}}_{f}

for all

t \geq T_{0}

.

Next, consider the parameter estimation dynamics induced by the coupled actor–critic updates. By standard arguments for gradient-type updates with a shared regressor

S (η_{e})

, the actor weight error

\tilde{W}

satisfies

\dot{\tilde{W}} = - k_{a} S S^{T} \tilde{W}

in the idealised approximation setting, which immediately implies a non-increasing quadratic storage function. Define

\tilde{W} ≜ {\hat{W}}_{a} - {\hat{W}}_{c}

and

P (t) = Tr ({\tilde{W}}^{T} \tilde{W}) = {∥ \tilde{W} ∥}_{F}^{2} .

(31)

Specifically, we have

\dot{P} = - 2 k_{a} {∥ S^{T} \tilde{W} ∥}_{F}^{2} \leq 0 .

(32)

Also,

\frac{d}{d t} Tr ({\hat{W}}_{c}^{T} {\hat{W}}_{c}) = - 2 k_{c} Tr ({\hat{W}}_{c}^{T} S S^{T} {\hat{W}}_{c}) \leq 0,

so

{\hat{W}}_{c}

is bounded and non-increasing in norm. Therefore,

{\hat{W}}_{a} = \tilde{W} + {\hat{W}}_{c}

is bounded:

∥ {\hat{W}}_{a} (t) ∥ \leq {\bar{W}}_{a} < \infty, \forall t \geq 0 .

(33)

Lemma 2.

Define

\tilde{W} ≜ {\hat{W}}_{a} - {\hat{W}}_{c}

and

P (t) = Tr ({\tilde{W}}^{⊤} \tilde{W}) = {∥ \tilde{W} ∥}_{F}^{2} .

Under the coupled updates (21) and (22), the mismatch dynamics satisfies

\dot{\tilde{W}} = - k_{a} S (η_{e}) S^{⊤} (η_{e}) \tilde{W}

, and hence

\dot{P} (t)

satisfies (32). Therefore,

P (t)

is non-increasing and

\tilde{W} (t)

is bounded for all

t \geq 0

.

Remark 5.

Lemma 1 shows that Equation (24), together with (21) and (22), enforces a monotonic decrease in the actor–critic mismatch energy P(t). This promotes parameter consistency under the adopted approximation and serves as a Lyapunov-guided consistency regularisation. However,

{\hat{W}}_{a} \approx {\hat{W}}_{c}

alone does not imply a pointwise satisfaction of the HJB equation

H (η, u^{*}, J^{*}) = 0

nor exact policy optimality. Hence, the stationarity and weight-matching construction are interpreted as an HJB-inspired surrogate rather than a proof of exact HJB optimality.

If the finite excitation condition in Assumption 3 holds, then Proposition 1 further implies

W_{a} - W_{c}

converges to zero; otherwise, only boundedness and mismatch energy dissipation are guaranteed.

Proposition 1.

Assume that there exist

T_{w} > 0

and

γ > 0

such that

\int_{t - T_{w}}^{t} S (η (τ)) S^{⊤} (η (τ)) d τ ⪰ γ I \forall t \geq T_{w}

Then,

{lim}_{t \to \infty} {∥ {\hat{W}}_{a} (t) - {\hat{W}}_{c} (t) ∥}_{F} = 0

and hence

{\hat{W}}_{a} (t) \to {\hat{W}}_{c} (t)

.

Choose a scalar

α \in (0, 1)

and define

V_{e}^{★} ≜ \frac{1}{2} {∥ η_{1} ∥}^{2} + α η_{1}^{T} η_{2} + \frac{1}{2} {∥ η_{2} ∥}^{2} .

(34)

Using

| η_{1}^{T} η_{2} | \leq \frac{1}{2} ({∥ η_{1} ∥}^{2} + {∥ η_{2} ∥}^{2})

, we obtain the global quadratic bounds

\frac{1 - α}{2} ({∥ η_{1} ∥}^{2} + {∥ η_{2} ∥}^{2}) \leq V_{e}^{★} \leq \frac{1 + α}{2} ({∥ η_{1} ∥}^{2} + {∥ η_{2} ∥}^{2}) .

(35)

Now, define the full Lyapunov candidate

V ≜ V_{e}^{★} + \frac{1}{2 k_{a}} P + V_{e s o} .

(36)

Remark 6.

The coupling parameter

α \in (0, 1)

is introduced to shape the quadratic bounds of

V_{e}^{★}

and to enlarge the feasibility margin when the ESO and learning transients coexist. In practice, a smaller α yields a more conservative but more robust bound, while a larger α can improve transient response at the expense of reduced robustness margin.

It is positive definite and radially unbounded in

(η_{1}, η_{2}, \tilde{W}, e_{1}, e_{2})

because

V_{e}^{★}

is quadratically bounded by (35).

Differentiate (34), which yields

\begin{matrix} {\dot{V}}_{e}^{★} & = η_{1}^{T} {\dot{η}}_{1} + α ({\dot{η}}_{1}^{T} η_{2} + η_{1}^{T} {\dot{η}}_{2}) + η_{2}^{T} {\dot{η}}_{2} \\ = η_{1}^{T} η_{2} + α (η_{2}^{T} η_{2} + η_{1}^{T} {\dot{η}}_{2}) + η_{2}^{T} {\dot{η}}_{2} \\ = η_{1}^{T} η_{2} + α {∥ η_{2} ∥}^{2} + {(α η_{1} + η_{2})}^{T} (- Γ_{1} η_{1} - Γ_{2} η_{2} - \frac{1}{2} {\hat{W}}_{a}^{T} S + Δ f) \\ = - α η_{1}^{T} Γ_{1} η_{1} - η_{2}^{T} Γ_{2} η_{2} - η_{2}^{T} Γ_{1} η_{1} - α η_{1}^{T} Γ_{2} η_{2} + α {∥ η_{2} ∥}^{2} \\ - \frac{1}{2} {(α η_{1} + η_{2})}^{T} {\hat{W}}_{a}^{T} S + {(α η_{1} + η_{2})}^{T} Δ f . \end{matrix}

(37)

Use eigenvalue bounds:

η_{1}^{T} Γ_{1} η_{1} \geq λ_{min} (Γ_{1}) ∥ η_{1} ∥^{2}, η_{2}^{T} Γ_{2} η_{2} \geq λ_{min} (Γ_{2}) {∥ η_{2} ∥}^{2} .

For cross terms, apply Young inequalities with tunable scalars

ε_{12}, ε_{21} > 0

:

\begin{matrix} | η_{2}^{T} Γ_{1} η_{1} | & \leq ∥ Γ_{1} ∥ ∥ η_{2} ∥ ∥ η_{1} ∥ \leq \frac{ε_{21}}{2} ∥ η_{2} ∥^{2} + \frac{∥ Γ_{1} ∥^{2}}{2 ε_{21}} {∥ η_{1} ∥}^{2}, \end{matrix}

(38)

\begin{matrix} | α η_{1}^{T} Γ_{2} η_{2} | & \leq α ∥ Γ_{2} ∥ ∥ η_{1} ∥ ∥ η_{2} ∥ \leq \frac{ε_{12}}{2} ∥ η_{2} ∥^{2} + \frac{α^{2} {∥ Γ_{2} ∥}^{2}}{2 ε_{12}} {∥ η_{1} ∥}^{2} . \end{matrix}

(39)

Substituting (38) and (39) into (37) yields

\begin{matrix} {\dot{V}}_{e}^{★} & \leq - (α λ_{min} (Γ_{1}) - \frac{∥ Γ_{1} ∥^{2}}{2 ε_{21}} - \frac{α^{2} {∥ Γ_{2} ∥}^{2}}{2 ε_{12}}) {∥ η_{1} ∥}^{2} \\ - (λ_{min} (Γ_{2}) - α - \frac{ε_{21}}{2} - \frac{ε_{12}}{2}) {∥ η_{2} ∥}^{2} - \frac{1}{2} {(α η_{1} + η_{2})}^{T} {\hat{W}}_{a}^{T} S + {(α η_{1} + η_{2})}^{T} Δ f . \end{matrix}

(40)

Define

c_{1} ≜ α λ_{min} (Γ_{1}) - \frac{∥ Γ_{1} ∥^{2}}{2 ε_{21}} - \frac{α^{2} {∥ Γ_{2} ∥}^{2}}{2 ε_{12}}, c_{2} ≜ λ_{min} (Γ_{2}) - α - \frac{ε_{21}}{2} - \frac{ε_{12}}{2} .

(41)

Choose

Γ_{1}, Γ_{2}

and

ε_{12}, ε_{21}

such that

c_{1} > 0

and

c_{2} > 0

.

Using Assumption 4 and (33) for any epsilon greater than zero, it follows that, for any

ϵ_{a} > 0

,

\begin{matrix} |\frac{1}{2} {(α η_{1} + η_{2})}^{T} {\hat{W}}_{a}^{T} S| & \leq \frac{1}{2} ∥ α η_{1} + η_{2} ∥ ∥ {\hat{W}}_{a} ∥ ∥ S ∥ \leq \frac{1}{2} (α ∥ η_{1} ∥ + ∥ η_{2} ∥) {\bar{W}}_{a} \bar{S} \\ \leq \frac{ϵ_{a}}{2} {∥ η_{2} ∥}^{2} + \frac{ϵ_{a}}{2} α^{2} {∥ η_{1} ∥}^{2} + \frac{{\bar{W}}_{a}^{2} {\bar{S}}^{2}}{2 ϵ_{a}} . \end{matrix}

(42)

For any

ϵ_{d} > 0

, one obtains

\begin{matrix} {(α η_{1} + η_{2})}^{T} Δ f & \leq ∥ α η_{1} + η_{2} ∥ ∥ Δ f ∥ \leq (α ∥ η_{1} ∥ + ∥ η_{2} ∥) ∥ Δ f ∥ \\ \leq \frac{ϵ_{d}}{2} ∥ η_{2} ∥^{2} + \frac{1}{2 ϵ_{d}} {∥ Δ f ∥}^{2} + \frac{ϵ_{d}}{2} α^{2} ∥ η_{1} ∥^{2} + \frac{1}{2 ϵ_{d}} {∥ Δ f ∥}^{2} \\ = \frac{ϵ_{d}}{2} ∥ η_{2} ∥^{2} + \frac{ϵ_{d}}{2} α^{2} ∥ η_{1} ∥^{2} + \frac{1}{ϵ_{d}} {∥ Δ f ∥}^{2} . \end{matrix}

(43)

Substitute (42) and (43) into (40):

\begin{matrix} {\dot{V}}_{e}^{★} & \leq - (c_{1} - \frac{ϵ_{a}}{2} α^{2} - \frac{ϵ_{d}}{2} α^{2}) ∥ η_{1} ∥^{2} - (c_{2} - \frac{ϵ_{a}}{2} - \frac{ϵ_{d}}{2}) {∥ η_{2} ∥}^{2} \\ + \underset{≜ ρ_{a}}{\underset{︸}{\frac{{\bar{W}}_{a}^{2} {\bar{S}}^{2}}{2 ϵ_{a}}}} + \underset{≜ ρ_{d} (t)}{\underset{︸}{\frac{1}{ϵ_{d}} {∥ Δ f ∥}^{2}}} . \end{matrix}

(44)

Define the tightened constants

{\tilde{c}}_{1} ≜ c_{1} - \frac{α^{2}}{2} (ϵ_{a} + ϵ_{d}), {\tilde{c}}_{2} ≜ c_{2} - \frac{1}{2} (ϵ_{a} + ϵ_{d}) .

(45)

Choose

ϵ_{a}, ϵ_{d}

sufficiently small so that

{\tilde{c}}_{1} > 0

and

{\tilde{c}}_{2} > 0

.

From (32),

\frac{d}{d t} (\frac{1}{2 k_{a}} P) \leq 0

. Combine (44) with (30), which yields

\begin{matrix} \dot{V} & \leq - {\tilde{c}}_{1} ∥ η_{1} ∥^{2} - {\tilde{c}}_{2} ∥ η_{2} ∥^{2} + ρ_{a} + \frac{1}{ϵ_{d}} {∥ Δ f ∥}^{2} - β_{1} ∥ e_{1} ∥^{2} + \frac{ϵ_{f}}{2 β_{2}} {∥ Δ f ∥}^{2} + \frac{1}{2 ϵ_{f} β_{2}} {\bar{f}}_{1}^{2} \\ \leq - {\tilde{c}}_{1} ∥ η_{1} ∥^{2} - {\tilde{c}}_{2} ∥ η_{2} ∥^{2} - β_{1} ∥ e_{1} ∥^{2} + ρ_{a} + (\frac{1}{ϵ_{d}} + \frac{ϵ_{f}}{2 β_{2}}) {∥ Δ f ∥}^{2} + \frac{1}{2 ϵ_{f} β_{2}} {\bar{f}}_{1}^{2} . \end{matrix}

(46)

For

t \geq T_{0}

,

∥ Δ f (t) ∥ \leq {\bar{Δ}}_{f}

from the ESO UUB property. Therefore, for all

t \geq T_{0}

,

\dot{V} \leq - λ (∥ η_{1} ∥^{2} + ∥ η_{2} ∥^{2} + {∥ e_{1} ∥}^{2}) + ρ,

(47)

where one can choose

λ ≜ min {{\tilde{c}}_{1}, {\tilde{c}}_{2}, β_{1}},

(48)

and an explicit ultimate-bound constant

ρ ≜ ρ_{a} + (\frac{1}{ϵ_{d}} + \frac{ϵ_{f}}{2 β_{2}}) {\bar{Δ}}_{f}^{2} + \frac{1}{2 ϵ_{f} β_{2}} {\bar{f}}_{1}^{2} = \frac{{\bar{W}}_{a}^{2} {\bar{S}}^{2}}{2 ϵ_{a}} + (\frac{1}{ϵ_{d}} + \frac{ϵ_{f}}{2 β_{2}}) {\bar{Δ}}_{f}^{2} + \frac{1}{2 ϵ_{f} β_{2}} {\bar{f}}_{1}^{2} .

(49)

Theorem 1.

Assume that Assumptions 1, 2, and 4 hold; the controller and observer learning gains are selected such that

{\tilde{c}}_{1} > 0

and

{\tilde{c}}_{2} > 0

in (45). Let λ and ρ be defined in (48) and (49), and let

μ ≜ \frac{2 λ}{1 + α}

. Then, all closed-loop signals

(η_{1}, η_{2}, {\hat{W}}_{a}, {\hat{W}}_{c}, e_{1}, e_{2})

remain bounded. Moreover, for all

t \geq T_{0}

, the composite Lyapunov function satisfies the comparison solution (51), and the tracking errors are semi-globally uniformly ultimately bounded with the explicit ultimate bound (52).

Proof.

For

t \geq T_{0}

, the ESO property yields

∥ Δ f (t) ∥ \leq {\bar{Δ}}_{f}

, so the dissipation inequality (47) holds with the constants

λ

and

ρ

in (48) and (49). The remaining steps follow by standard comparison arguments, as detailed below.

Using (35), we have

V \geq V_{e}^{★} \geq \frac{1 - α}{2} (∥ η_{1} ∥^{2} + ∥ η_{2} ∥^{2}) .

Hence, (47) implies the standard comparison inequality

\dot{V} \leq - \underset{≜ μ}{\underset{︸}{\frac{2 λ}{1 + α}}} V + ρ, \forall t \geq T_{0},

(50)

because

V \leq \frac{1 + α}{2} (∥ η_{1} ∥^{2} + ∥ η_{2} ∥^{2}) + \frac{1}{2 k_{a}} P + V_{e s o}

and the negative term in (47) dominates the state part;

μ

can be conservatively selected as

μ = \frac{2 λ}{1 + α}

.

Therefore,

V (t) \leq (V (T_{0}) - \frac{ρ}{μ}) e^{- μ (t - T_{0})} + \frac{ρ}{μ}, \forall t \geq T_{0},

(51)

and

\underset{t \to \infty}{lim sup} (∥ η_{1} {(t) ∥}^{2} + {∥ η_{2} (t) ∥}^{2}) \leq \frac{2}{1 - α} \underset{t \to \infty}{lim sup} V (t) \leq \frac{2}{1 - α} \cdot \frac{ρ}{μ} = \frac{(1 + α)}{(1 - α)} \cdot \frac{ρ}{λ} .

(52)

Hence, the closed-loop tracking errors are SGUUB, and all signals

(η_{1}, η_{2}, {\hat{W}}_{a}, {\hat{W}}_{c}, e_{1}, e_{2})

remain bounded. □

Remark 7.

To avoid confusion with Lyapunov-drift-based single-slot optimisation methods, we emphasise that the Lyapunov function used in this paper is introduced only for the stability analysis of the closed-loop ESO actor–critic system. The long-term control objective is still defined by the infinite-horizon cost in Equation (14), and the actor–critic module performs HJB-inspired online policy refinement for this objective. The Lyapunov function is then used to establish boundedness and the SGUUB of the tracking, observer, and learning dynamics under disturbances and actuator constraints.

Remark 8.

The proposed ESO learning framework is fundamentally different from generic deep reinforcement learning methods for MEC optimisation [25]. The methods use deep reinforcement learning as the main decision engine for slot-based resource allocation, whereas our method targets continuous-time nonlinear motion control, where ESO provides the stabilising robust baseline and learning serves as a secondary performance-refinement mechanism.

Implementation algorithm for ESO-enhanced actor–critic RL-optimised 3-DOF trajectory tracking is demonstrated below (Algorithm 1).

Algorithm 1. ESO-enhanced actor–critic RL-optimised trajectory tracking (3-DOF).

1:: Initialise: choose $Γ_{1}, Γ_{2}, k_{a}, k_{c}$ ; choose ESO bandwidth $ω_{o}$ and set $β_{1} = 2 ω_{o}$ , $β_{2} = ω_{o}^{2}$ ; initialise ${\hat{W}}_{a} (0), {\hat{W}}_{c} (0), {\hat{z}}_{1} (0), {\hat{z}}_{2} (0)$ .
2:: for each control step t do
3:: Measure $(η, ν)$ and compute references $(η_{d}, {\dot{η}}_{d}, {\ddot{η}}_{d})$ and $(ν_{d}, {\dot{ν}}_{d})$ .
4:: Compute errors: $η_{1} = η - η_{d}$ , $η_{2} = R (ψ) ν - {\dot{η}}_{d}$ .
5:: Set $z_{1} = η_{2}$ ; integrate ESO (17) and (18) to obtain $\hat{f} = {\hat{z}}_{2}$ .
6:: Compute basis vector $S (η_{e})$ .
7:: Actor output: ${\hat{u}}^{*} = - Γ_{1} η_{1} - Γ_{2} η_{2} - \frac{1}{2} {\hat{W}}_{a}^{T} S (η_{e})$ .
8:: Apply compensated equivalent input: $u = {\hat{u}}^{*} - \hat{f}$ .
9:: Update critic: integrate ${\dot{\hat{W}}}_{c} = - k_{c} S (η_{e}) S^{T} (η_{e}) {\hat{W}}_{c}$ .
10:: Update actor: integrate ${\dot{\hat{W}}}_{a} = - S (η_{e}) S^{T} (η_{e}) (k_{a} ({\hat{W}}_{a} - {\hat{W}}_{c}) + k_{c} {\hat{W}}_{c})$ .
11:: Compute $τ = M_{0} u + τ_{f f} (t)$ and allocate thrusters.
12:: end for

4. Simulation Studies

This section provides a reproducible simulation protocol and a theory-aligned ablation study. The uncertainty and the environmental loads are generated continuously by wind–wave–current shaping filters, which is consistent with Assumption 1. To avoid a trivial initial condition, the vessel starts from an offset position that does not lie on the desired circle, and both the start and terminal points are explicitly marked in the trajectory plots. Vessel model parameters are set as Cybership II parameters. The simulations were executed on software stack Matlab R2025a version.

A circular reference trajectory is designed as

\begin{matrix} x_{d} (t) & = R cos (ω t), \end{matrix}

(53)

\begin{matrix} y_{d} (t) & = R sin (ω t), \end{matrix}

(54)

\begin{matrix} ψ_{d} (t) & = atan 2 ({\dot{y}}_{d} (t), {\dot{x}}_{d} (t)), \end{matrix}

(55)

with

R = 1.5 m

and

ω = 2 π / 70 rad / s

.

Generate environmental disturbance model

d (t) = {[d_{x} (t), d_{y} (t), d_{ψ} (t)]}^{T}

by wind–wave–current filters:

d (t) = d_{c} (t) + d_{w} (t) + d_{w v} (t) .

(56)

Steady current-induced bias is exploited as

d_{c} (t) = [\begin{matrix} K_{c} V_{c}^{2} cos χ_{c} \\ K_{c} V_{c}^{2} sin χ_{c} \\ K_{c ψ} V_{c}^{2} \end{matrix}] .

(57)

The wind gust low-frequency filter is chosen as

{\dot{ξ}}_{w} (t) = - ω_{w} ξ_{w} (t) + σ_{w} ω_{w} w (t), w (t) \sim N (0, I_{3}),

(58)

d_{w} (t) = diag (κ_{w x}, κ_{w y}, κ_{w ψ}) ξ_{w} (t) .

(59)

Wave-frequency load is employed as a second-order filter. For each

l \in {x, y, ψ}

,

{\ddot{ξ}}_{l} (t) + 2 ζ_{l} ω_{l} {\dot{ξ}}_{l} (t) + ω_{l}^{2} ξ_{l} (t) = σ_{l} ω_{l}^{2} w_{l} (t), w_{l} (t) \sim N (0, 1),

(60)

d_{w v} (t) = [\begin{matrix} ξ_{x} (t) \\ ξ_{y} (t) \\ ξ_{ψ} (t) \end{matrix}] .

(61)

Disturbance filter parameters are shown in Table 1.

All compared controllers share the same plant, the same reference trajectory, and the same actuator constraints. Thus, differences in performance are attributable to the control law rather than feasibility handling. We compare an ablation study and four methods: (i) nominal feedback (no ESO/RL),

u = - Γ_{1} η_{1} - Γ_{2} η_{2}

, (ii) ESO-only,

u = - Γ_{1} η_{1} - Γ_{2} η_{2} - \hat{f}

, (iii) RL-only,

u = {\hat{u}}^{*} (η)

with (23) and (24), and

\hat{f} \equiv 0

, and (iv) proposed control design. The following performance metrics are reported over

t \in [0, T]

: (i) RMS position error

{RMS}_{p} = \sqrt{\frac{1}{T} \int_{0}^{T} ({(x - x_{d})}^{2} + {(y - y_{d})}^{2}) d t}

, (ii) RMS yaw error

{RMS}_{ψ} = \sqrt{\frac{1}{T} \int_{0}^{T} {(ψ - ψ_{d})}^{2} d t}

, (iii) control energy

E_{τ} = \int_{0}^{T} τ^{T} τ d t

. Controller settings are shown in Table 2.

The ablation results in Table 3 indicate that the dominant robustness improvement in this representative case is provided by the ESO-based disturbance estimation, while the actor–critic component mainly serves as a Lyapunov-guided consistency regularisation and secondary refinement mechanism within the proposed control-oriented framework. Therefore, the numerical margin between ESO only and ESO plus actor–critic may be modest under the reported disturbance realisation, and we avoid interpreting it as a statistically significant gain without repeated-run dispersion measures.

Figure 1 shows the 2-D trajectory-tracking performance of the circular reference. All controllers converge to the vicinity of the desired orbit after a short transient caused by the initial offset. However, the proposed method exhibits the tightest overlap with the reference curve and the smallest visible deviation along the orbit, indicating superior steady-state disturbance rejection under persistent environmental loads. In particular, compared with the ESO-only and RL-only baselines, the proposed controller reduces residual drift and maintains a smaller tracking tube around the reference circle. It shows the overall trajectory-tracking behavior under the same disturbance realisation. All methods converge toward the reference trajectory after the initial offset, while the proposed ESO plus actor–critic scheme exhibits a tighter overlap with the desired path in the steady phase. This visual trend is consistent with the intention of using ESO to handle the dominant disturbance rejection and using the actor–critic coupling to provide a stable refinement around the robust baseline. To highlight the steady tracking regime, the zoomed view in Figure 2 shows that the proposed method achieves the smallest residual error level and mitigates long-term drift more effectively under the same disturbance realisation and actuator constraints. The proposed controller achieves the lowest steady tracking error and the fastest decay after the initial transient. It reports the time evolution of the position tracking error magnitude. The transient part reflects how fast the controller recovers from the initial offset, whereas the steady part reflects the residual tracking tube under persistent disturbances. The proposed method yields a smoother and lower steady error envelope in this representative case, indicating improved disturbance rejection and reduced oscillations in the tracking response. The heading tracking performance is shown in Figure 3. The proposed method maintains a smaller and smoother

e_{ψ} (t)

response, which contributes to the reduced lateral deviation on the circular path and prevents error accumulation caused by heading misalignment. Note that a small negative steady value of the yaw tracking error in Figure 3 indicates a slight signed offset rather than instability, and such a residual bias is consistent with the SGUUB property under persistent disturbances and actuator constraints. Figure 4 plots the three-channel commanded input

τ = {[τ_{x}, τ_{y}, τ_{ψ}]}^{⊤}

for the proposed controller. The input signals exhibit a larger transient effort to recover from the initial offset, followed by a bounded and smooth steady regime. Importantly, the improved tracking accuracy of the proposed method is not achieved by persistent saturation or excessively aggressive actuation; instead, it results from the complementary compensation structure that combines disturbance estimation and learning-based residual. The control inputs generated by the proposed controller. The signals show a larger but short-lived transient effort to compensate the initial offset, followed by bounded and relatively smooth steady behavior. This indicates that the improved tracking observed in Figure 1, Figure 2 and Figure 3 is not obtained by persistent aggressive actuation, but by the combined effect of ESO disturbance compensation and the consistency regularisation in the actor–critic adaptation. To further quantify the visual result, the position error norm and the RMS index are reported in Figure 5 and Table 3. The proposed method achieves the lowest RMS position error while keeping the commanded inputs. The quantitative results in Table 3 report the tracking metrics for the representative disturbance realisation used in this study and the values are presented to illustrate qualitative performance trends rather than statistical significance. In the reported no-event wind–wave–current case, the proposed ESO+RL controller attains the smallest overall tracking error (

{RMS}_{p} = 0.0382

m and

{IAE}_{p} = 0.8687

m·s). The proposed method achieves the best tracking performance among the compared methods in this representative case, while the margin over the ESO-only baseline is modest. Figure 5 summarises the tracking performance indicators for the compared methods in this representative disturbance case. The proposed method attains the smallest overall error level among the compared controllers, while the difference between ESO-only and ESO plus actor–critic remains modest. This supports the interpretation that ESO contributes the primary robustness gain, and the actor–critic coupling provides secondary refinement under the reported scenario. Figure 6 reports the evolution of actor and critic weight norm. Compared with RL-only, the proposed architecture yields a markedly smaller and more stationary critic weight norm, indicating that ESO and input-effectiveness adaptation reduce the residual uncertainty seen by the RL layer, thus improving learning stability and accelerating convergence. This shows the evolution of the actor and critic weight norms. The trajectories indicate that the coupled update law provides a bounded and stable parameter adaptation process, which is consistent with the Lyapunov-guided consistency regularisation interpretation of Equations (21) and (22). In this manuscript, we do not interpret these curves as evidence of TD error-driven optimal learning, but as evidence of stable online adaptation within the ESO-based robust control framework. Overall, the results in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 demonstrate that the proposed controller provides superior tracking accuracy under persistent disturbances in a no-event scenario, while maintaining feasible and bounded control inputs under identical actuator constraints.

The results indicate that the ESO module provides the dominant disturbance rejection capability in this scenario, and the actor–critic coupling mainly serves as a Lyapunov-guided consistency regularisation and online refinement component rather than the primary source of robustness. Therefore, the RL-only baseline can be close to the nominal controller and the additional benefit of ESO plus RL over ESO-only may be modest under the reported disturbance realisation.

5. Conclusions

In this study, a coordinate-consistent, ESO-enhanced actor–critic RL control scheme was developed to address the 3-DOF trajectory-tracking problem for marine vessels subjected to complex environmental disturbances. By leveraging an ESO with explicit reproducible disturbance shaping, the framework successfully decouples lumped uncertainties from the control law, thereby significantly reducing the learning burden on the neural network approximators. The proposed actor–critic architecture maintains computational efficiency via a Lyapunov-guided weight-matching consistency regularisation, supporting stable online refinement and real-time implementability; exact HJB optimality and policy optimality are not claimed under this simplified adaptation structure. Stability analysis demonstrates that the closed-loop system achieves SGUUB tracking, with the steady-state error bounds explicitly characterised by the ESO estimation residual. This integration of robust state estimation and adaptive learning provides a rigorous foundation for future enhancements, which will focus on incorporating actuator allocation constraints into the optimisation manifold and integrating formal safety filters to ensure operational resilience in constrained maritime environments.

Author Contributions

Conceptualisation and methodology, software and validation, X.L. and J.L. Both authors contributed equally to the article. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation Grant No. 2025MSLH070.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

$η = {[x, y, ψ]}^{T} \in R^{3}$	position and yaw in earth-fixed frame
$ν = {[u, v, r]}^{T} \in R^{3}$	body-fixed velocities
$τ \in R^{3}$	generalised control forces/moment
$R (ψ) \in R^{3 \times 3}$	planar rotation matrix
$η_{d} (t)$ , ${\dot{η}}_{d} (t)$ , ${\ddot{η}}_{d} (t)$	desired trajectory and derivatives
$η_{1} \in R^{3}$	earth-fixed tracking error, $η_{1} = η - η_{d}$
$η_{2} \in R^{3}$	earth-fixed strict-feedback velocity error, $η_{2} = R (ψ) ν - {\dot{η}}_{d}$
$u \in R^{3}$	equivalent second-order control input
$f (χ, t) \in R^{3}$	lumped uncertainty in the strict-feedback template
ESO	$z_{1} = η_{2}$ , $z_{2} = f$ ; estimates ${\hat{z}}_{1}$ , ${\hat{z}}_{2} = \hat{f}$ ; residual $Δ f = f - \hat{f}$
$S (η_{e}) \in R^{m}$	basis-function vector.
${\hat{W}}_{c} (t), {\hat{W}}_{a} (t) \in R^{m \times 3}$	critic/actor weights
$k_{c}, k_{a} > 0$	critic/actor learning gains
$Γ_{1}, Γ_{2} \in R^{3 \times 3}$	positive definite feedback gains
${∥ \cdot ∥}_{F}$	Frobenius

References

Zhu, G.; Wu, C.; Ma, Y.; Hu, S. Resource-constrained adaptive neural output feedback security control for networked USVs under dual-channel malicious attacks. IEEE Trans. Veh. Technol. 2025, 74, 10109–10121. [Google Scholar] [CrossRef]
Liang, X.; Wang, D.; Ge, S.S. Continuous predictive control based on dynamic surface design with application to trajectory tracking. Appl. Ocean. Res. 2021, 111, 102615. [Google Scholar] [CrossRef]
Gao, Q.; Li, J. Adaptive Consensus Control of Multiple Underactuated Marine Surface Vessels with Input Saturation and Severe Uncertainties. Mathematics 2025, 13, 3786. [Google Scholar] [CrossRef]
Zaccone, R. A dynamic programming approach to the collision avoidance of autonomous ships. Mathematics 2024, 12, 1546. [Google Scholar] [CrossRef]
Han, J. From PID to active disturbance rejection control. IEEE Trans. Ind. Electron. 2009, 56, 900–906. [Google Scholar] [CrossRef]
Li, S.; Yang, J.; Iwasaki, M.; Chen, W.H. Hierarchical disturbance/uncertainty estimation and attenuation for integrated modeling and motion control: Overview and perspectives. IEEE/ASME Trans. Mechatron. 2025, 30, 4435–4449. [Google Scholar] [CrossRef]
Chen, W.H.; Yang, J.; Guo, L.; Li, S. Disturbance-observer-based control and related methods—An overview. IEEE Trans. Ind. Electron. 2015, 63, 1083–1095. [Google Scholar] [CrossRef]
Feng, H.; Guo, B.Z. Active disturbance rejection control: Old and new results. Annu. Rev. Control 2017, 44, 238–248. [Google Scholar] [CrossRef]
Gao, Z. Scaling and Bandwidth-Parameterization Based Controller Tuning. Ph.D. Thesis, Cleveland State University, Cleveland, OH, USA, 2003. [Google Scholar]
Gao, Z. Active disturbance rejection control: From an enduring idea to an emerging technology. In Proceedings of the 10th International Workshop on Robot Motion and Control (RoMoCo), Poznan, Poland, 6–8 July 2015; pp. 269–282. [Google Scholar]
Ning, J.; Wang, H.; Hu, X.; Chen, C.P. Event-triggered Adaptive Coordinated Formation Control of Multiple Under-actuated Vehicles with Input and State Quantization. IEEE Trans. Veh. Technol. 2025, 75, 1825–1840. [Google Scholar] [CrossRef]
Liang, X.; Bao, D.; Ge, S.S. Modeling of neuro-fuzzy system with optimization algorithm as a support in system boundary capability online assessment. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 2974–2978. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Li, B.; Yang, X.; Zhou, R.; Wen, G. Reinforcement learning-based optimised control for a class of second-order nonlinear dynamic systems. Int. J. Syst. Sci. 2022, 53, 3154–3164. [Google Scholar] [CrossRef]
Liang, X.; Wu, J.; Xie, H.; Lu, Y. Preview-Based Optimal Control for Trajectory Tracking of Fully-Actuated Marine Vessels. Mathematics 2024, 12, 3942. [Google Scholar] [CrossRef]
Wen, G.; Niu, B. Optimized tracking control based on reinforcement learning for a class of high-order unknown nonlinear dynamic systems. Inf. Sci. 2022, 606, 368–379. [Google Scholar] [CrossRef]
Qu, J.; Zhang, L.; Lu, Y.; Zhang, W.; Liu, X. A Deep Reinforcement Learning Path-Following Control of Uncertain Under-Actuated Autonomous Marine Vehicle. J. Mar. Sci. Eng. 2023, 11, 1762. [Google Scholar] [CrossRef]
Wei, Z.; Du, J. Reinforcement learning-based trajectory tracking optimal control for underactuated unmanned surface vehicles under asymmetric input saturation. Eng. Appl. Artif. Intell. 2025, 162, 112307. [Google Scholar] [CrossRef]
Abdulghani, A.M.; Abdullah, A.; Rahiman, A.; Abdul Hamid, N.A.W.; Akram, B.O. Dynamic Multi-Objective Controller Placement in SD-WAN: A GMM-MARL Hybrid Framework. Network 2025, 5, 52. [Google Scholar] [CrossRef]
Lamraoui, H.C.; Qidan, Z.; Bouzid, Y. Improved active disturbance rejecter control for trajectory tracking of unmanned surface vessel. Mar. Syst. Ocean. Technol. 2022, 17, 18–26. [Google Scholar] [CrossRef]
Su, Y.; Teng, F.; Li, T.; Chen, C.P. Fixed-time optimal trajectory tracking control for an electric unmanned surface vehicle via reinforcement learning. IEEE/ASME Trans. Mechatron. 2025, 1–12. [Google Scholar] [CrossRef]
Parisi, S.; Tangkaratt, V.; Peters, J.; Khan, M.E. TD-regularized actor-critic methods. Mach. Learn. 2019, 108, 1467–1501. [Google Scholar] [CrossRef]
Chen, H.; Chen, Z.; Liu, A.; Fang, W. Double actor-critic with TD error-driven regularization in reinforcement learning. Neural Netw. 2025, 196, 108323. [Google Scholar] [CrossRef] [PubMed]
Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; John Willy & Sons Ltd.: Hoboken, NJ, USA, 2011. [Google Scholar]
Zhang, S.; Tong, X.; Chi, K.; Shi, Z. Jointly Optimizing Task Offloading and Resource Allocation in MEC with Secure Data Transmission: A Multi-DNNs Approach. IEEE Trans. Mob. Comput. 2025, 1–16. [Google Scholar] [CrossRef]

Figure 1. Planar trajectory tracking on a circular reference. The desired circle is shown together with the actual trajectories of all compared methods.

Figure 2. Position tracking error magnitude versus time for all methods.

Figure 3. Yaw tracking error

ψ - ψ_{d}

versus time.

Figure 3. Yaw tracking error

ψ - ψ_{d}

versus time.

Figure 4. Control inputs

τ = {[τ_{x}, τ_{y}, τ_{ψ}]}^{T}

.

Figure 4. Control inputs

τ = {[τ_{x}, τ_{y}, τ_{ψ}]}^{T}

.

Figure 5. Comparison of RMS position error.

Figure 6. Actor and critic weight norm.

Table 1. Disturbance filter parameters.

Parameter	Meaning	Value
$V_{c}$	current magnitude	$0.2$ m/s
$χ_{c}$	current direction	0 rad
$K_{c}, K_{c ψ}$	current-to-load gains	chosen to match bias
$ω_{w}$	wind gust bandwidth	$0.2$ rad/s
$σ_{w}$	wind RMS scaling	$1.0$
$κ_{w x}, κ_{w y}, κ_{w ψ}$	wind load scaling	$(50, 80, 30)$
$ω_{x}, ω_{y}, ω_{ψ}$	wave dominant freq	$(1.0, 1.0, 0.8)$ rad/s
$ζ_{x}, ζ_{y}, ζ_{ψ}$	wave damping ratios	$(0.2, 0.2, 0.25)$
$σ_{x}, σ_{y}, σ_{ψ}$	wave intensity	$(30, 40, 20)$

Table 2. Controller and observer parameters.

Parameter	Meaning	Value
$Γ_{1}$	feedback gain on $η_{1}$	$2 I_{3}$
$Γ_{2}$	feedback gain on $η_{2}$	$3 I_{3}$
$k_{a}$	actor learning gain	2.0
$k_{c}$	critic learning gain	1.2
m	number of basis functions	25
$S (η_{e})$	basis type	Gaussian RBF
$ω_{o}$	ESO bandwidth	8–15 rad/s
$β_{1}, β_{2}$	ESO gains	$β_{1} = 2 ω_{o}, β_{2} = ω_{o}^{2}$

Table 3. Ablation results under wind–wave–current disturbances.

Method	RMS Pos (m)	IAE Pos (m·s)	RMS Yaw (rad)
Nominal (No ESO/RL)	0.0393	0.9935	0.0290
ESO-only	0.0383	0.8807	0.0273
RL-only	0.0393	0.9929	0.0290
ESO+RL (Proposed)	0.0382	0.8687	0.0270

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, X.; Li, J. ESO-Enhanced Actor–Critic Reinforcement Learning-Optimised Trajectory Tracking Control for 3-DOF Marine Vessels. Mathematics 2026, 14, 867. https://doi.org/10.3390/math14050867

AMA Style

Liang X, Li J. ESO-Enhanced Actor–Critic Reinforcement Learning-Optimised Trajectory Tracking Control for 3-DOF Marine Vessels. Mathematics. 2026; 14(5):867. https://doi.org/10.3390/math14050867

Chicago/Turabian Style

Liang, Xiaoling, and Jiajian Li. 2026. "ESO-Enhanced Actor–Critic Reinforcement Learning-Optimised Trajectory Tracking Control for 3-DOF Marine Vessels" Mathematics 14, no. 5: 867. https://doi.org/10.3390/math14050867

APA Style

Liang, X., & Li, J. (2026). ESO-Enhanced Actor–Critic Reinforcement Learning-Optimised Trajectory Tracking Control for 3-DOF Marine Vessels. Mathematics, 14(5), 867. https://doi.org/10.3390/math14050867

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ESO-Enhanced Actor–Critic Reinforcement Learning-Optimised Trajectory Tracking Control for 3-DOF Marine Vessels

Abstract

1. Introduction

2. Vessel Model and Tracking Objective

3. Stability Analysis

4. Simulation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI