1. Introduction
The accurate trajectory tracking of marine surface vessels is a foundational capability for autonomous navigation, offshore operations, and safety-critical station-keeping tasks [
1,
2]. In practical deployments, tracking performance is challenged by a combination of (i) unmodelled and time-varying hydrodynamic effects, (ii) stochastic environmental loads induced by wind, waves, and currents, and (iii) strict actuation constraints arising from limited thruster authority and allocation feasibility. These factors can jointly induce large tracking errors, severe input saturation, and even qualitative trajectory distortion when the control law and the physical actuation model are not formulated in a coordinate-consistent manner. Thus, purely model-based designs may suffer from performance degradation when the nominal model is inaccurate, whereas aggressive robust designs may achieve stability at the expense of excessive control effort.
From a modelling perspective, the planar 3-degree-of-freedom (3-DOF) vessel dynamics constitute a coupled multi-input multi-output (MIMO) nonlinear system in which the kinematics are naturally expressed in the inertial frame while the generalised forces and moment are applied in the body-fixed frame. This inherent frame mismatch complicates both controller synthesis and theoretical analysis [
3]. A common approach is to design tracking controllers via backstepping or strict-feedback transformations, which can provide systematic stability proofs when the system is cast into a suitable cascade form. However, when environmental disturbances and unmodeled dynamics are significant, purely model-based designs often become conservative to preserve robustness, or require increasingly complex adaptation laws [
4]. Moreover, the presence of actuator bounds and allocation constraints can invalidate unconstrained stability arguments, since saturation introduces additional nonlinearities and may render a reference trajectory infeasible [
5]. A prominent model-light route to robustness is active disturbance rejection control (ADRC), whose core component is the extended state observer (ESO) that estimates the total disturbance—including unmodeled dynamics and external perturbations—in real time [
6,
7,
8]. Systematic tuning principles based on scaling and bandwidth parameterisation further popularised ADRC in engineering practice by explicitly linking observer and controller bandwidth to disturbance rejection and noise sensitivity [
9,
10]. In marine robotics, ESO-type disturbance rejection has been adopted for USV trajectory tracking to mitigate ocean environmental loads and parameter uncertainty, and recent studies have demonstrated its practicality in simulation and field experiments [
11]. Nevertheless, ESO-based tracking designs are typically not optimal in an explicit sense; moreover, under hard input bounds, tuning purely for disturbance rejection may increase peak thrust demand and saturation rate, which can slow down recovery and distort the intended path.
Learning-based control methods, particularly reinforcement learning (RL), offer a complementary direction by improving performance through online policy optimisation [
12,
13]. Actor–critic structures are especially attractive for continuous-control problems, as they can approximate optimal policies and value functions without requiring an exact analytic solution of the Hamilton–Jacobi–Bellman (HJB) equation [
14]. Nevertheless, vanilla actor–critic schemes may exhibit policy drift, sensitivity to stochastic disturbances, and degraded transient behavior when deployed on systems with severe uncertainty and hard actuation limits. In marine applications, these issues are amplified by persistent environmental loads and the coupling between translational and rotational motions [
15]. Therefore, a key open problem is how to integrate learning with robust disturbance rejection in a way that (a) preserves coordinate consistency between theory and implementation, (b) remains stable under wind–wave–current disturbances, and (c) is feasible under realistic actuator constraints. RL—particularly, adaptive dynamic programming (ADP) and actor–critic structures—offers a complementary route by approximating the value function and policy to circumvent the analytic solution of the HJB equation. For continuous-time nonlinear systems, actor–critic learning connects directly to optimal control through policy iteration, temporal-difference learning, and Hamiltonian residual minimisation, enabling online performance improvement without requiring an explicit solution of the HJB partial differential equation [
16]. In the marine domain, deep RL has been explored for path-following and trajectory tracking under uncertainty [
17], and very recent work has started to incorporate input saturation characteristics into RL-based USV optimal tracking formulations [
18]. However, pure RL controllers can be sensitive to unmodeled dynamics and external disturbances, and stability guarantees often become delicate when exploration, approximation errors, and actuator constraints coexist. Recent studies in other networked and cyber-physical domains also indicate a growing trend toward hybrid frameworks that combine structured optimisation components with learning-based adaptation for dynamic multi-objective decision-making [
19]. This motivates a principled integration of disturbance observers with RL so that learning takes place around a robust baseline and focuses on optimal refinement rather than disturbance cancellation.
To highlight the motivation, we briefly compare this work with representative related approaches. Model-based robust tracking controllers provide stability guarantees but may become conservative under strong uncertainty and persistent disturbances, especially when actuator constraints are active. ESO- or ADRC-based designs offer practical disturbance rejection by estimating lumped uncertainties, but they do not explicitly address learning-oriented performance refinement. In contrast, many DRL frameworks in other domains such as multi-agent DRL and multi-DNN DRL for MEC or SD-WAN optimisation are primarily designed for slot-based decision problems, where DRL serves as the main decision engine for resource allocation and does not aim to provide Lyapunov-style closed-loop stability guarantees for continuous-time nonlinear dynamics. The present manuscript targets safety-critical continuous-time marine motion control and adopts a hybrid decomposition philosophy, while the ESO-based controller provides the stabilising robust baseline and the actor–critic component is incorporated as a Lyapunov-guided consistency regularisation and secondary online refinement mechanism. This comparison clarifies the research gap and motivates the proposed ESO-enhanced actor–critic framework. Motivated by these challenges, this paper develops a unified control framework that couples an ESO with an actor–critic RL module for the constrained trajectory tracking of 3-DOF vessels. The ESO is used to estimate and compensate lumped uncertainties, including wind–wave–current loads and unmodeled dynamics, thereby providing a robust baseline that stabilises the learning process. On top of this robustification, an actor–critic component refines the control policy toward improved performance such as reduced tracking error and control energy. To strengthen the mathematical consistency of learning, we introduce an HJB-consistency mechanism based on a computable residual metric and a weight-matching coupling between the actor and critic updates. This coupling aligns the learning dynamics with the optimality structure and mitigates uncontrolled weight growth and oscillatory behavior. Importantly, the overall design is formulated in a coordinate-consistent strict-feedback tracking structure, explicitly mapping inertial-frame tracking errors to physically applied body-fixed forces and moment, and incorporating actuator saturation and optional thruster allocation constraints to ensure feasibility.
The proposed framework is validated on a standard 3-DOF vessel benchmark under stochastic wind–wave–current disturbances. Beyond reporting tracking trajectories, we adopt reproducible evaluation metrics that capture both control performance and learning consistency, such as integrated tracking error, control energy, saturation rate, HJB residual, and actor–critic weight mismatch norms. In addition, systematic ablation studies are provided to quantify the complementary roles of robust disturbance estimation and learning-based optimisation.
The main methodological contributions of this paper are summarised as follows:
We integrate ESO-based lumped-disturbance estimation with actor–critic policy improvement within a single closed-loop architecture suitable for 3-DOF vessel tracking. Compared with ESO-only and disturbance-sensitive RL-only designs, the proposed integration typically yields smaller tracking errors with lower control effort under environmental loads [
20,
21].
Compared with standard TD-error-driven actor–critic schemes commonly used in the RL literature [
22,
23], the proposed weight-matching coupling is introduced as a Lyapunov-guided consistency regularisation within a control-oriented framework, which helps to suppress parameter drift and improves robustness under disturbances and actuator constraints.
While most RL controllers operate as black boxes, this work guarantees SGUUB stability with a tracking error bound explicitly tied to the ESO residual. This provides a transparent mechanism for performance tuning linking observer bandwidth directly to control precision, which offers reliability that standard heuristic RL methods lack.
The remainder of the paper is organised as follows.
Section 2 presents the vessel model, disturbance representation, and coordinate-consistent strict-feedback tracking error system.
Section 3 develops the ESO design, the actor–critic learning law with HJB-consistency, and the overall constrained control synthesis. It provides the closed-loop stability analysis and establishes boundedness and ultimate tracking performance guarantees.
Section 4 reports simulation results and ablation studies under wind–wave–current disturbances and actuator limits. Finally,
Section 5 concludes the paper and outlines future research directions.
2. Vessel Model and Tracking Objective
Consider 3-DOF vessel dynamics [
24]
where
is the inertia matrix,
is the coriolis term,
is the damping term, and
collects wind, wave, current, and unmodelled effects.
The control objective is designed as follows: given a smooth desired trajectory with bounded derivatives, design a control input such that the tracking errors converge to a small neighbourhood of zero, while maintaining reasonable control effort and robustness against uncertainty and disturbances.
Define the earth-fixed tracking error
Define the earth-fixed strict-feedback velocity error
Then, the strict-feedback first equation holds exactly:
Differentiate (
4), then one has
Choose a constant nominal inertia matrix
and define a nominal feedforward term
:
where
are nominal models. Implement
with design input
. Substituting (
8) into (
6) yields
where
collects measurable signals and the lumped uncertainty is
Remark 1. The lumped uncertainty in Equation (10) contains an input-coupled mismatch term that is linear in u. To clarify well-posedness, we decomposewhereand collects the remaining rotation-induced coupling, model mismatch, and environmental disturbance terms. In implementation, the control input u is computed explicitly from measured and estimated signals and then mapped to the physical body frame with actuator saturation. Hence, the closed loop is a well-defined nonlinear ODE with saturation rather than an algebraic loop. We assume that is uniformly nonsingular in the operating region, which is a mild small-gain-type condition ensuring the existence and uniqueness of solutions and preserving ESO convergence under the input-coupled uncertainty. Thus, the second-order uncertain template is
Remark 2. The strict-feedback template in (13) is introduced for synthesis on actor–critic learning and observer-based compensation. In implementation, the virtual input u is mapped back to the physical body-frame force and moment through the kinematics and thruster allocation, so actuator constraints can be enforced without modifying the learning law. Assumption 1. The lumped disturbance is bounded and has a bounded derivative, i.e., and .
Assumption 2. The reference trajectory is twice continuously differentiable and bounded, together with its derivatives, i.e., , , and are bounded.
Assumption 3. The regressor signal satisfies a finite excitation condition over a sliding window. There exist and such that This assumption is used to strengthen mismatch convergence statements such as .
Define the infinite-horizon quadratic performance index
where
.
The quadratic form in (
14) reflects a standard tracking-to-regulation conversion: the stacked error state
is penalised to enforce accurate tracking, while
u is penalised to avoid excessive virtual control action. This optimal-control viewpoint provides a principled way to couple learning with a stabilising baseline in the presence of persistent environmental disturbances.
Let
be the optimal value function. The Hamiltonian is
At the optimum, the HJB equation implies
and the optimal input satisfies the stationarity condition with respect to
u. Since
is unknown for the vessel tracking problem with uncertainty, we approximate it via a critic and compute an approximately optimal policy via actor.
Using (
13), the stationary condition
yields
Treat
in (
13) as an extended state:
Then,
,
. A second-order ESO is chosen as
with parameters
.
Use the compensated input
Then, one obtains
Let
be a bounded basis-function vector. Define critic and actor weights
so that
and
.
Choose the critic approximation of the value-gradient
where
are diagonal positive definite matrices.
Implement optimised nominal input
Use online updates
with design parameter
.
Remark 3. The update laws in Equations (23) and (24) are designed as a Lyapunov-guided actor–critic adaptation mechanism for the proposed ESO-based robust-control framework, rather than a standard TD Bellman-residual-gradient deep reinforcement learning update. This control-oriented design is adopted to maintain closed-loop boundedness and stable parameter adaptation under disturbances and actuator constraints. In this context, the actor–critic coupling term is used as a consistency regularisation term, instead of being interpreted as a standalone Bellman-optimal learning rule. Using the HJB equation and its approximation, we define an optimality-related residual as
In the present control-oriented design, this residual is introduced as a diagnostic indicator to interpret the subsequent stationarity and weight-matching construction under the adopted approximation. Since Equations (
21) and (
22) are designed as Lyapunov-guided consistency regularisation rather than TD-error-driven learning, we do not claim that
is explicitly minimised or driven to zero by the update law.
Remark 4. The stationarity condition is imposed as a tractable surrogate consistency condition for the parameterised policy with respect to the approximate Hamiltonian constructed using . It provides a convenient direction to promote actor critic parameter consistency and facilitate Lyapunov analysis. It should be emphasised that this stationarity condition for the approximate Hamiltonian does not imply the pointwise satisfaction of the true HJB equation nor guarantee global optimality. In the present design, the coupled actor–critic updates in Equations (21) and (22) are adopted for Lyapunov-compatible online adaptation within the ESO-based robust control framework. 3. Stability Analysis
Assumption 4. The basis functions used in the actor–critic parameterisation are generated by Gaussian RBFs with fixed centers and widths; hence, is uniformly bounded in the operating region and there exists a constant such that for all η in this region. Moreover, the actor and critic weights are assumed to evolve in compact sets and remain bounded, that is, there exist constants and such that and for all t. This boundedness is consistent with the coupled update structure and can also be enforced by standard projection-based constrained adaptation when needed.
This section establishes the boundedness of the overall ESO–actor–critic closed loop. The analysis proceeds in three steps: (i) show that the ESO yields a uniformly ultimately bounded estimation (UUB) error for the lumped uncertainty, (ii) show that the critic and actor weight dynamics remain bounded under the chosen update laws, and (iii) combine these properties into a composite Lyapunov argument that produces an explicit ultimate tracking-error bound. Denote errors
,
. Then,
Lemma 1. Under Assumptions 1, 2 and 4 and under bounded closed-loop signals, there exists a constant such that for all t.
This follows because depends on bounded kinematic and dynamic states, bounded disturbances, and bounded control input, and its time derivative depends on bounded derivatives of these signals. In particular, the control input is bounded due to actuator saturation and the basis functions have bounded values in the operating region; hence, all terms contributing to remain bounded, which yields the stated bound.
Let Lyapunov function candidate
Differentiating gives
for any
. Hence,
are UUB. In particular, there exist finite constants
and
such that
for all
.
Next, consider the parameter estimation dynamics induced by the coupled actor–critic updates. By standard arguments for gradient-type updates with a shared regressor
, the actor weight error
satisfies
in the idealised approximation setting, which immediately implies a non-increasing quadratic storage function. Define
and
Specifically, we have
Also,
so
is bounded and non-increasing in norm. Therefore,
is bounded:
Lemma 2. Define and Under the coupled updates (21) and (22), the mismatch dynamics satisfies , and hence satisfies (32). Therefore, is non-increasing and is bounded for all . Remark 5. Lemma 1 shows that Equation (24), together with (21) and (22), enforces a monotonic decrease in the actor–critic mismatch energy P(t). This promotes parameter consistency under the adopted approximation and serves as a Lyapunov-guided consistency regularisation. However, alone does not imply a pointwise satisfaction of the HJB equation nor exact policy optimality. Hence, the stationarity and weight-matching construction are interpreted as an HJB-inspired surrogate rather than a proof of exact HJB optimality. If the finite excitation condition in Assumption 3 holds, then Proposition 1 further implies converges to zero; otherwise, only boundedness and mismatch energy dissipation are guaranteed.
Proposition 1. Assume that there exist and such thatThen, and hence . Choose a scalar
and define
Using
, we obtain the global quadratic bounds
Now, define the full Lyapunov candidate
Remark 6. The coupling parameter is introduced to shape the quadratic bounds of and to enlarge the feasibility margin when the ESO and learning transients coexist. In practice, a smaller α yields a more conservative but more robust bound, while a larger α can improve transient response at the expense of reduced robustness margin.
It is positive definite and radially unbounded in
because
is quadratically bounded by (
35).
Differentiate (
34), which yields
Use eigenvalue bounds:
For cross terms, apply Young inequalities with tunable scalars
:
Substituting (
38) and (
39) into (
37) yields
Define
Choose
and
such that
and
.
Using Assumption 4 and (
33) for any epsilon greater than zero, it follows that, for any
,
For any
, one obtains
Substitute (
42) and (
43) into (
40):
Define the tightened constants
Choose
sufficiently small so that
and
.
From (
32),
. Combine (
44) with (
30), which yields
For
,
from the ESO UUB property. Therefore, for all
,
where one can choose
and an explicit ultimate-bound constant
Theorem 1. Assume that Assumptions 1, 2, and 4 hold; the controller and observer learning gains are selected such that and in (45). Let λ and ρ be defined in (48) and (49), and let . Then, all closed-loop signals remain bounded. Moreover, for all , the composite Lyapunov function satisfies the comparison solution (51), and the tracking errors are semi-globally uniformly ultimately bounded with the explicit ultimate bound (52). Proof. For
, the ESO property yields
, so the dissipation inequality (
47) holds with the constants
and
in (
48) and (
49). The remaining steps follow by standard comparison arguments, as detailed below.
Using (
35), we have
Hence, (
47) implies the standard comparison inequality
because
and the negative term in (
47) dominates the state part;
can be conservatively selected as
.
Therefore,
and
Hence, the closed-loop tracking errors are SGUUB, and all signals
remain bounded. □
Remark 7. To avoid confusion with Lyapunov-drift-based single-slot optimisation methods, we emphasise that the Lyapunov function used in this paper is introduced only for the stability analysis of the closed-loop ESO actor–critic system. The long-term control objective is still defined by the infinite-horizon cost in Equation (14), and the actor–critic module performs HJB-inspired online policy refinement for this objective. The Lyapunov function is then used to establish boundedness and the SGUUB of the tracking, observer, and learning dynamics under disturbances and actuator constraints. Remark 8. The proposed ESO learning framework is fundamentally different from generic deep reinforcement learning methods for MEC optimisation [25]. The methods use deep reinforcement learning as the main decision engine for slot-based resource allocation, whereas our method targets continuous-time nonlinear motion control, where ESO provides the stabilising robust baseline and learning serves as a secondary performance-refinement mechanism. Implementation algorithm for ESO-enhanced actor–critic RL-optimised 3-DOF trajectory tracking is demonstrated below (Algorithm 1).
| Algorithm 1. ESO-enhanced actor–critic RL-optimised trajectory tracking (3-DOF). |
- 1:
Initialise: choose ; choose ESO bandwidth and set , ; initialise . - 2:
for each control step t do - 3:
Measure and compute references and . - 4:
Compute errors: , . - 5:
Set ; integrate ESO ( 17) and ( 18) to obtain . - 6:
Compute basis vector . - 7:
Actor output: . - 8:
Apply compensated equivalent input: . - 9:
Update critic: integrate . - 10:
Update actor: integrate . - 11:
Compute and allocate thrusters. - 12:
end for
|
4. Simulation Studies
This section provides a reproducible simulation protocol and a theory-aligned ablation study. The uncertainty and the environmental loads are generated continuously by wind–wave–current shaping filters, which is consistent with Assumption 1. To avoid a trivial initial condition, the vessel starts from an offset position that does not lie on the desired circle, and both the start and terminal points are explicitly marked in the trajectory plots. Vessel model parameters are set as Cybership II parameters. The simulations were executed on software stack Matlab R2025a version.
A circular reference trajectory is designed as
with
and
.
Generate environmental disturbance model
by wind–wave–current filters:
Steady current-induced bias is exploited as
The wind gust low-frequency filter is chosen as
Wave-frequency load is employed as a second-order filter. For each
,
Disturbance filter parameters are shown in
Table 1.
All compared controllers share the same plant, the same reference trajectory, and the same actuator constraints. Thus, differences in performance are attributable to the control law rather than feasibility handling. We compare an ablation study and four methods: (i) nominal feedback (no ESO/RL),
, (ii) ESO-only,
, (iii) RL-only,
with (
23) and (
24), and
, and (iv) proposed control design. The following performance metrics are reported over
: (i) RMS position error
, (ii) RMS yaw error
, (iii) control energy
. Controller settings are shown in
Table 2.
The ablation results in
Table 3 indicate that the dominant robustness improvement in this representative case is provided by the ESO-based disturbance estimation, while the actor–critic component mainly serves as a Lyapunov-guided consistency regularisation and secondary refinement mechanism within the proposed control-oriented framework. Therefore, the numerical margin between ESO only and ESO plus actor–critic may be modest under the reported disturbance realisation, and we avoid interpreting it as a statistically significant gain without repeated-run dispersion measures.
Figure 1 shows the 2-D trajectory-tracking performance of the circular reference. All controllers converge to the vicinity of the desired orbit after a short transient caused by the initial offset. However, the proposed method exhibits the tightest overlap with the reference curve and the smallest visible deviation along the orbit, indicating superior steady-state disturbance rejection under persistent environmental loads. In particular, compared with the ESO-only and RL-only baselines, the proposed controller reduces residual drift and maintains a smaller tracking tube around the reference circle. It shows the overall trajectory-tracking behavior under the same disturbance realisation. All methods converge toward the reference trajectory after the initial offset, while the proposed ESO plus actor–critic scheme exhibits a tighter overlap with the desired path in the steady phase. This visual trend is consistent with the intention of using ESO to handle the dominant disturbance rejection and using the actor–critic coupling to provide a stable refinement around the robust baseline. To highlight the steady tracking regime, the zoomed view in
Figure 2 shows that the proposed method achieves the smallest residual error level and mitigates long-term drift more effectively under the same disturbance realisation and actuator constraints. The proposed controller achieves the lowest steady tracking error and the fastest decay after the initial transient. It reports the time evolution of the position tracking error magnitude. The transient part reflects how fast the controller recovers from the initial offset, whereas the steady part reflects the residual tracking tube under persistent disturbances. The proposed method yields a smoother and lower steady error envelope in this representative case, indicating improved disturbance rejection and reduced oscillations in the tracking response. The heading tracking performance is shown in
Figure 3. The proposed method maintains a smaller and smoother
response, which contributes to the reduced lateral deviation on the circular path and prevents error accumulation caused by heading misalignment. Note that a small negative steady value of the yaw tracking error in
Figure 3 indicates a slight signed offset rather than instability, and such a residual bias is consistent with the SGUUB property under persistent disturbances and actuator constraints.
Figure 4 plots the three-channel commanded input
for the proposed controller. The input signals exhibit a larger transient effort to recover from the initial offset, followed by a bounded and smooth steady regime. Importantly, the improved tracking accuracy of the proposed method is not achieved by persistent saturation or excessively aggressive actuation; instead, it results from the complementary compensation structure that combines disturbance estimation and learning-based residual. The control inputs generated by the proposed controller. The signals show a larger but short-lived transient effort to compensate the initial offset, followed by bounded and relatively smooth steady behavior. This indicates that the improved tracking observed in
Figure 1,
Figure 2 and
Figure 3 is not obtained by persistent aggressive actuation, but by the combined effect of ESO disturbance compensation and the consistency regularisation in the actor–critic adaptation. To further quantify the visual result, the position error norm and the RMS index are reported in
Figure 5 and
Table 3. The proposed method achieves the lowest RMS position error while keeping the commanded inputs. The quantitative results in
Table 3 report the tracking metrics for the representative disturbance realisation used in this study and the values are presented to illustrate qualitative performance trends rather than statistical significance. In the reported no-event wind–wave–current case, the proposed ESO+RL controller attains the smallest overall tracking error (
m and
m·s). The proposed method achieves the best tracking performance among the compared methods in this representative case, while the margin over the ESO-only baseline is modest.
Figure 5 summarises the tracking performance indicators for the compared methods in this representative disturbance case. The proposed method attains the smallest overall error level among the compared controllers, while the difference between ESO-only and ESO plus actor–critic remains modest. This supports the interpretation that ESO contributes the primary robustness gain, and the actor–critic coupling provides secondary refinement under the reported scenario.
Figure 6 reports the evolution of actor and critic weight norm. Compared with RL-only, the proposed architecture yields a markedly smaller and more stationary critic weight norm, indicating that ESO and input-effectiveness adaptation reduce the residual uncertainty seen by the RL layer, thus improving learning stability and accelerating convergence. This shows the evolution of the actor and critic weight norms. The trajectories indicate that the coupled update law provides a bounded and stable parameter adaptation process, which is consistent with the Lyapunov-guided consistency regularisation interpretation of Equations (
21) and (
22). In this manuscript, we do not interpret these curves as evidence of TD error-driven optimal learning, but as evidence of stable online adaptation within the ESO-based robust control framework. Overall, the results in
Figure 1,
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6 demonstrate that the proposed controller provides superior tracking accuracy under persistent disturbances in a no-event scenario, while maintaining feasible and bounded control inputs under identical actuator constraints.
The results indicate that the ESO module provides the dominant disturbance rejection capability in this scenario, and the actor–critic coupling mainly serves as a Lyapunov-guided consistency regularisation and online refinement component rather than the primary source of robustness. Therefore, the RL-only baseline can be close to the nominal controller and the additional benefit of ESO plus RL over ESO-only may be modest under the reported disturbance realisation.