Hierarchical Control for USV Trajectory Tracking with Proactive–Reactive Reward Shaping

Luo, Zixiao; Du, Dongmei; Liu, Dandan; Yang, Qiangqiang; Chai, Yi; Hu, Shiyu; Wu, Jiayou

doi:10.3390/jmse13122392

Open AccessArticle

Hierarchical Control for USV Trajectory Tracking with Proactive–Reactive Reward Shaping

by

Zixiao Luo

¹,

Dongmei Du

²,

Dandan Liu

^1,*

,

Qiangqiang Yang

¹,

Yi Chai

¹,

Shiyu Hu

¹ and

Jiayou Wu

¹

Faculty of Artificial Intelligence, Shanghai University of Electric Power, Shanghai 201306, China

²

Marine Design and Research Institute of China, Shanghai 200011, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(12), 2392; https://doi.org/10.3390/jmse13122392

Submission received: 13 November 2025 / Revised: 9 December 2025 / Accepted: 14 December 2025 / Published: 17 December 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

To address trajectory tracking of underactuated unmanned surface vessels (USVs) under disturbances and model uncertainty, we propose a hierarchical control framework that combines model predictive control (MPC) with proximal policy optimization (PPO). The outer loop runs in the inertial reference frame, where an MPC planner based on a kinematic model enforces velocity and safety constraints and generates feasible body–fixed velocity references. The inner loop runs in the body–fixed reference frame, where a PPO policy learns the nonlinear inverse mapping from velocity to multi–thruster thrust, compensating hydrodynamic modeling errors and external disturbances. On top of this framework, we design a Proactive–Reactive Adaptive Reward (PRAR) that uses the MPC prediction sequence and real–time pose errors to adaptively reweight the reward across surge, sway and yaw, improving robustness and cross–model generalization. Simulation studies on circular and curvilinear trajectories compare the proposed PRAR–driven dual–loop controller (PRAR–DLC) with MPC–PID, PPO–Only, MPC–PPO and PPO variants. On the curvilinear trajectory, PRAR–DLC reduces surge MAE and maximum tracking error from 0.269 m and 0.963 m (MPC–PID) to 0.138 m and 0.337 m, respectively; on the circular trajectory it achieves about an 8.5% reduction in surge MAE while maintaining comparable sway and yaw accuracy to the baseline controllers. Real–time profiling further shows that the average MPC and PPO evaluation times remain below the control sampling period, indicating that the proposed architecture is compatible with real–time onboard implementation and physical deployment.

Keywords:

unmanned surface vehicles; MPC; PPO; hierarchical control; adaptive reward

1. Introduction

As an intelligent waterborne system, the Unmanned Surface Vehicle (USV) has demonstrated significant application potential in areas such as marine monitoring and investigation, patrol, fishery resource monitoring, and safety inspection of water conservancy facilities. Equipped with a complete control system, sensor system and communication system, USVs can autonomously perform tasks without human control. However, conditions like environmental variability and model uncertainty may affect the performance of the USV control system [1]. Consequently, achieving accurate trajectory tracking control (ATTC) for USVs has emerged as a pivotal concern within the realm of ocean engineering and control [2]. An underdriven ship is one in which the number of input dimensions to the control system is less than the number of degrees of freedom [3]. Its typical feature is the absence of direct driving force in the lateral or longitudinal degrees of freedom. Owing to the strong nonlinearity and actuation constraints of such marine systems, trajectory tracking and stabilization for underactuated ships are particularly challenging.

In studies on trajectory tracking controllers underactuated ships, the main approaches can be classified into two categories: model–based methods and data–driven methods. Common model-based control methods include sliding-mode control (SMC), model predictive control (MPC), linear–quadratic regulation (LQR), adaptive control, and backstepping control. When the controlled object is modeled precisely and its parameters are known accurately, these control methods can fully utilize the information and achieve optimal control performance of the system. Accordingly, model-based control has shown substantial practical value in a variety of control tasks. Several studies on USV trajectory tracking have been built on these methods, including the following: Wei et al. [4] proposed an anti-saturation backstepping control strategy. Zhu et al. [5] designed a distributed nonlinear model predictive (NMPC) control method for multi–target tracking surrounding control problem of the multi-USV system. Xu et al. [6] designed an adaptive line-of-sight (ALOS) guidance to transform the fuzzy adaptive fault-tolerant path following control (PFC) problem into the heading tracking control problem. Chen et al. [7] proposed a prescribed performance compensation control scheme to reduce the impact of uncertainty on the USV-UAV control system. However, these algorithms hinge on accurate kinematic and dynamic models. Environmental disturbances such as wind and waves often cause mismatches between the plant’s actual dynamics and its model, thereby degrading controller performance. Moreover, when the plant’s dynamic model is intrinsically complex, the design of such controllers and the real-time computation of the control system become more challenging.

Another widely used approach is data-driven control. Rather than directly relying on an explicit mathematical model of the vessel, this approach designs the controller by exploiting the vessel’s input–output mapping and the data collected during operation. Proportional–integral–derivative (PID) control is a representative data–driven method and has been widely applied to the control of USVs [8,9,10]. However, when confronted with systems exhibiting nonlinear, time–varying, or high–order dynamics, PID controllers often fall short of the desired performance and show marked limitations in disturbance rejection.

Within the field of machine learning, reinforcement learning (RL) is an algorithmic paradigm that focuses on how an agent learns through interactions with its environment to maximize cumulative reward [11]. RL–based controllers adapt parameters in response to environmental feedback and reward signals. Through continuous experimentation and learning, the controller can effectively adapt to dynamic, complex, and uncertain environments [12]. In recent years, with advances in reinforcement learning, the field of USV control has made significant progress [13,14,15,16], and numerous exemplary studies have made outstanding contributions to RL-based USV trajectory tracking control. To overcome complex environmental disturbances, Zhou et al. [17] proposed a deep reinforcement learning (DRL)-based policy for USV trajectory tracking control. Wu et al. [18] proposed a proximal policy optimization (PPO) control scheme for highly coupled systems with nonlinear relationships. Wen et al. [19] presented a novel approach for achieving high-precision trajectory tracking control through utilization of receding horizon reinforcement learning (RHRL). Wang et al. [20] introduced an integral compensator to the conventional deep deterministic policy gradient (DDPG) algorithm and thus improved the tracking accuracy and robustness of vessel’s controller. Dai et al. [21] presented adaptive neural tracking control of underactuated surface vessels with modeling uncertainties and time–varying external disturbances. By leveraging system data, these machine learning–based control methods dispense with reliance on an accurate system model effectively improve the system’s robustness.

To achieve superior control performance, model–based and data–driven methods can be combined to exploit their respective strengths, thereby accommodating diverse models and practical application scenarios. In the field of USV trajectory tracking control research, researchers combine reinforcement learning methods with certain model–based control approaches. By setting rewards to guide the adjustment of controller parameters, they enhance the system’s robustness. Huang et al. [22] proposed a probabilistic neural networks model predictive control (PNMPC) to avoid the computational complexity associated with sample capacity. Cui et al. [23] proposed a local update spectrum probabilistic model predictive control (LUSPMPC) approach to enhance the computational efficiency of USV under unforeseen and unobservable external disturbances. Qin et al. [24] applied DDPG to achieve the optimal control performance for the controlled plant.

Although existing studies have extensively explored USV control from the perspectives of various model-based control methods, deep reinforcement learning, and their hybrid architectures, several critical gaps remain to be addressed. Most existing hybrid approaches focus on coupling RL with conventional controllers at a single level, and their typical forms can be summarized into several categories as shown in Table 1.

As can be seen, these architectures restrict RL to the role of either a parameter scheduler [25] or high-level decision-maker [26], making it difficult to explicitly handle multi-thruster inverse dynamics and complex execution-layer nonlinearities [27]; or they concentrate complexity in high-dimensional models and model–based receding–horizon optimization, which imposes stringent requirements on real-time performance and engineering deployment and makes the controller highly sensitive to model structure and system dimension.

For USV trajectory-tracking tasks with kinematic constraints and complex hydrodynamic characteristics, existing work still falls short of fully decoupling the problem in a hierarchical manner. Most methods either directly employ robust or adaptive control and single–layer model–based controllers at the dynamics level, or use RL for high–level path planning, parameter scheduling, or model learning. Such approaches typically depend on relatively accurate dynamic or hydrodynamic modeling. When the plant exhibits unmodeled dynamics, contains parameters that are difficult to obtain, or suffers from uncertainty in the specific parameter values and identification procedures, it becomes challenging to guarantee control performance and closed–loop stability, and the resulting controllers exhibit limited transferability.

Moreover, existing RL-based control studies mostly rely on static, manually specified reward functions based on prior knowledge. These rewards remain fixed throughout training and cannot promptly reflect the dynamic performance requirements of the plant in complex, time-varying environments, which leads to pronounced limitations in terms of convergence smoothness and robustness to disturbances.

Motivated by the above gaps, we design a Proactive–Reactive Adaptive Reward (PRAR)-driven dual–loop USV trajectory–tracking controller, referred to as the PRAR–driven Dual–Loop Controller (PRAR–DLC). The outer loop of this controller adopts Model Predictive Control (MPC), while the inner loop employs Proximal Policy Optimization (PPO). The overall objective is to ensure constraint feasibility at the trajectory–planning level and to achieve high–precision, adaptive, and cross–model generalizable control at the dynamic execution level within a unified control framework. Specifically, the aims of this study are as follows:

(1): Construct a dual–loop structure with kinematic–dynamic hierarchy.
In the inertial reference frame, the outer loop designs a planning controller based on MPC, performing receding–horizon optimization on a simplified kinematic model. By explicitly incorporating speed, acceleration, and safety constraints, it guarantees the dynamic feasibility of the reference velocity commands. In the body–fixed reference frame, an execution controller is designed for the inner loop. This controller is structurally decoupled from the outer loop so that constrained planning and dynamic tracking are handled by different loops, forming a clear hierarchical control architecture.
(2): Employ reinforcement learning exclusively at the execution layer to achieve weak dependence on complex dynamics and cross–model generalization.
In the inner loop, we introduce a PPO–based policy network that directly learns a nonlinear inverse–dynamics and compensation mapping from the outer–loop velocity references and USV state observations to thruster forces. In this way, complex hydrodynamic effects and multi–thruster couplings are absorbed into the learned RL policy instead of being explicitly modeled via a detailed dynamic description. The outer loop only requires a simplified kinematic model to generate feasible references, and the control architecture is structurally decoupled from the specific USV dynamics. Consequently, when changing the vessel type or propulsion configuration, one only needs to retrain or fine–tune the inner–loop policy to complete controller transfer.
(3): Develop a Proactive–Reactive Adaptive Reward (PRAR) to enhance the training efficiency and robustness of the inner–loop PPO.
Leveraging the hierarchical information of the dual–loop structure, we design an adaptive reward mechanism that simultaneously accounts for anticipated future trajectory trends and the current pose–tracking error, allowing the reward to adjust dynamically with the system state and trajectory evolution rather than remain fixed. By using PRAR to shape the PPO update process in a targeted manner, the learned policy attains heightened sensitivity to both long–term trajectory quality and instantaneous tracking error, thereby outperforming conventional static reward designs in terms of sample efficiency, convergence smoothness, and robustness to disturbances and model uncertainties.

The PRAR–DLC, through the coordinated division of labor between the outer–loop MPC and inner–loop PPO, is designed to provide a hierarchical intelligent control scheme for USV trajectory tracking in complex hydrodynamic environments and under diverse propulsion configurations, simultaneously ensuring constraint safety, weak dependence on explicit models, and cross–platform transferability.

2. USV Dynamic Model

For USV trajectory tracking, three–degree–of–freedom (3–DOF) kinematic and dynamic models are typically formulated and analyzed. For notation, we adopt the convention introduced by Fossen [28]. The controlled vessel is a waterjet propulsion vessel with fixed thruster vector angles [29]. An illustration of the vessel in the inertial and body-fixed reference frames is provided in Figure 1.

The USV model comprises two parts [29]. The first relates the 3–DOF control inputs—surge force, sway force, and yaw moment—to the dynamics governing the vessel’s velocities, and can be written as:

τ = M \dot{ν} + C (ν) ν + D (ν) ν,

(1)

where

τ

is a vector of forces and moments generated by the port and starboard thrusters:

τ = [\begin{matrix} τ_{x} \\ τ_{y} \\ τ_{z} \end{matrix}] = [\begin{matrix} τ_{port} + τ_{stbd} \\ 0 \\ (τ_{port} - τ_{stbd}) \frac{B}{2} \end{matrix}] .

(2)

M

is the inertia matrix, comprising the vessel mass, moments of inertia, and added–mass terms, and it characterizes the inertial properties during acceleration.

M

is denoted by:

M = [\begin{matrix} m - X_{\dot{u}} & 0 & - m y_{G} \\ 0 & m - Y_{\dot{v}} & m x_{G} - Y_{\dot{r}} \\ - m y_{G} & m x_{G} - N_{\dot{v}} & I_{z} - N_{\dot{r}} \end{matrix}] .

(3)

The Coriolis matrix

C (ν)

captures the coupling between inertia and velocity; it is typically skew-symmetric and accounts for Coriolis and centripetal effects during motion.

C (ν)

is expressed as:

\begin{matrix} C (ν) & = [\begin{matrix} 0 & 0 & - m (x_{G} r + v) \\ 0 & 0 & - m (y_{G} r - u) \\ m (x_{G} r + v) & m (y_{G} r - u) & 0 \end{matrix}] + \\ [\begin{matrix} 0 & 0 & 2 (Y_{\dot{v}} v + \frac{Y_{\dot{r}} + N_{\dot{v}}}{2} r) \\ 0 & 0 & - X_{\dot{u}} u \\ 2 (- Y_{\dot{v}} v - \frac{Y_{\dot{r}} + N_{\dot{v}}}{2} r) & X_{\dot{u}} u & 0 \end{matrix}] . \end{matrix}

(4)

The damping matrix

D (ν)

describes the influence of fluid damping and frictional losses on the vessel’s motion and usually contains both linear and nonlinear components.

D (ν)

is denoted by:

D (ν) = - [\begin{matrix} X_{u} & 0 & 0 \\ 0 & Y_{v} & Y_{r} \\ 0 & N_{v} & N_{r} \end{matrix}] - [\begin{matrix} X_{u | u |} | u | & 0 & 0 \\ 0 & Y_{v | v |} | v | + Y_{v | r |} | r | & Y_{r | v |} | v | + Y_{r | r |} | r | \\ 0 & N_{v | v |} | v | + N_{v | r |} | r | & N_{r | v |} | v | + N_{r | r |} | r | \end{matrix}] .

(5)

The velocity vector

ν = {[u, v, r]}^{⊤}

denotes, in the body-fixed reference frame, the surge and sway linear velocities and the yaw rate with physical units

u, v \in m / s

and

r \in rad / s

, respectively.

The second part of the model describes the kinematic mapping between the body-fixed and inertial reference frames, transforming body-fixed velocities into the time derivatives of the inertial-frame pose, and is given by:

\dot{η} = J (η) ν,

(6)

where

η = {[x, y, ψ]}^{⊤}

denotes the position and heading in the inertial reference frame, and

J (η)

is the coordinate transformation matrix that maps velocity vectors between the body-fixed and inertial reference frames:

J (η) = [\begin{matrix} cos ψ & - sin ψ & 0 \\ sin ψ & cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}] .

(7)

The kinematic relations are formulated in an East–North–Up (ENU) right–handed inertial reference frame. Therefore,

ψ > 0

corresponds to a counterclockwise rotation in the horizontal plane.

In the vessel’s dynamic model, the values of certain elements in

M

,

C (ν)

and

D (ν)

are determined by the vessel’s structural configuration. Moreover, the entries of

C (ν)

and

D (ν)

are velocity-dependent, and considerable parameter uncertainty is exhibited [30].

In this study, the specific values of the aforementioned parameters and the identification methods can be found in [1]. This USV dynamic model provides the foundational basis for the entire control system; subsequent controller design and simulation experiments are conducted on this model.

3. Control Scheme Design

To achieve improved tracking performance, the complex USV trajectory-tracking task is decoupled into high–level kinematic planning combined with low–level dynamic control within a hierarchical framework. In this framework, the outer–loop controller performs long–horizon path planning in the inertial reference frame based on a concise and accurate kinematic model and generates the desired velocity commands. The inner–loop controller then addresses the more intricate and uncertain dynamic mapping, compensating model uncertainties and external disturbances through learning. The control architecture is shown in Figure 2.

In the outer loop, MPC algorithm is applied. Its model-based, long-horizon prediction capability is leveraged to compute optimal reference velocities on the basis of the vessel’s kinematic model, thereby achieving path planning. In the inner loop, a PPO–based agent is deployed. Its model-free, data-driven nature is exploited to learn to execute the MPC–issued commands accurately under complex nonlinear dynamics and unknown disturbances. Further details are provided in this section.

3.1. Outer–Loop Controller Based on MPC

The core task of the outer–loop controller is to generate, via online optimization, the optimal desired velocities in the body–fixed reference frame

(u_{ref}, v_{ref}, r_{ref})

based on the deviation between the vessel’s current state and the reference trajectory. These commands are then provided as references to the inner–loop PPO agent for execution.

The MPC–based outer–loop controller is built upon a mathematical model capable of accurately predicting the system’s future evolution, namely the kinematic model of the controlled USV introduced in the preceding section [29]:

\dot{η} = (\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{ψ} \end{matrix}) = (\begin{matrix} cos ψ & - sin ψ & 0 \\ sin ψ & cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} u \\ v \\ r \end{matrix}) = J (η) ν .

(8)

At each sampling instant k, a first–order Taylor expansion of Equation (8) is performed about the reference point

(η_{ref, k}, ν_{ref, k})

, yielding:

Δ \dot{η} \approx A_{c} Δ η + B_{c} Δ ν,

(9)

where the Jacobian matrices

A_{c}

and

B_{c}

are given by:

A_{c} = {\frac{\partial (J (η) ν)}{\partial η}|}_{η_{ref}, ν_{ref}} = (\begin{matrix} 0 & 0 & - u_{ref} sin ψ_{ref} - v_{ref} cos ψ_{ref} \\ 0 & 0 & u_{ref} cos ψ_{ref} - v_{ref} sin ψ_{ref} \\ 0 & 0 & 0 \end{matrix}),

(10)

B_{c} = {\frac{\partial (J (η) ν)}{\partial ν}|}_{η_{ref}, ν_{ref}} = (\begin{matrix} cos ψ_{ref} & - sin ψ_{ref} & 0 \\ sin ψ_{ref} & cos ψ_{ref} & 0 \\ 0 & 0 & 1 \end{matrix}) .

(11)

After forward Euler discretization, the discrete–time state–space representation is obtained as Equation (12):

η_{k + 1} = (I + T A_{c}) η_{k} + (T B_{c}) ν_{k} = A_{1, k} η_{k} + B_{1, k} ν_{k},

(12)

where T denotes the sampling period and is set to

T = 0.03 s

. Accordingly, at sampling instant k,

A_{1}

and

B_{1}

can be derived as:

A_{1, k} = (\begin{matrix} 1 & 0 & T (- u_{ref} sin ψ_{ref} - v_{ref} cos ψ_{ref}) \\ 0 & 1 & T (u_{ref} cos ψ_{ref} - v_{ref} sin ψ_{ref}) \\ 0 & 0 & 1 \end{matrix}),

(13)

B_{1, k} = (\begin{matrix} T cos ψ_{ref} & - T sin ψ_{ref} & 0 \\ T sin ψ_{ref} & T cos ψ_{ref} & 0 \\ 0 & 0 & T \end{matrix}) .

(14)

To obtain the complete state and control sequences over a prediction horizon

N_{p} = 40

and a control horizon

N_{c} = 10

, and to compute the optimum of the cost function J, an augmented state

ξ_{k} \in R^{6}

is constructed:

ξ_{k} = [\begin{matrix} η_{k} - η_{ref, k} \\ ν_{k - 1} \end{matrix}],

(15)

where

η_{k}

denotes the actual pose (position and heading) of the vessel at sampling instant k;

η_{ref, k}

denotes the reference pose at the same instant;

v_{k - 1} = {[u_{k - 1}, v_{k - 1}, r_{k - 1}]}^{⊤}

denotes the actual control output at the previous instant, namely the velocity commands in the three degrees of freedom. The control increment is defined as

Δ v_{k} = v_{k} - v_{k - 1}

. Accordingly, the update equations for the control state and the pose error can be obtained as:

ν_{k} = 0 η_{k} + I ν_{k - 1} + I Δ ν_{k},

(16)

\begin{matrix} η_{k + 1} - η_{ref, k + 1} & = (A_{1, k} η_{k} + B_{1, k} ν_{k}) - η_{ref, k + 1} \\ = A_{1, k} (η_{k} - η_{ref, k}) + B_{1, k} (ν_{k - 1} + Δ ν_{k}) + (A_{1, k} η_{ref, k} - η_{ref, k + 1}) . \end{matrix}

(17)

To simplify the model, the reference trajectory is assumed to vary slowly, so that the last term in the pose-error state update satisfies

(A_{1, k} η_{ref, k} - η_{ref, k + 1}) \approx 0,

(18)

and the update can be approximated by

η_{k + 1} - η_{ref, k + 1} = A_{1, k} (η_{k} - η_{ref, k}) + B_{1, k} (ν_{k - 1} + Δ ν_{k}) .

(19)

By combining the update of the control state with the pose–error update, the augmented state update can be derived as follows:

ξ_{k + 1} = [\begin{matrix} η_{k + 1} - η_{ref, k + 1} \\ ν_{k} \end{matrix}] = [\begin{matrix} A_{1, k} & B_{1, k} \\ 0_{3 \times 3} & I_{3 \times 3} \end{matrix}] [\begin{matrix} η_{k} - η_{ref, k} \\ ν_{k - 1} \end{matrix}] + [\begin{matrix} B_{1, k} \\ I_{3 \times 3} \end{matrix}] Δ ν_{k},

(20)

and the augmented predictive model with the control increment

Δ ν_{k}

as input can be obtained as:

ξ_{k + 1} = A_{k} ξ_{k} + B_{k} Δ ν_{k},

(21)

where

A_{k} = (\begin{matrix} A_{1, k} & B_{1, k} \\ 0_{3 \times 3} & I_{3 \times 3} \end{matrix}), B_{k} = (\begin{matrix} B_{1, k} \\ I_{3 \times 3} \end{matrix}) .

(22)

This predictive model clearly characterizes the dynamic evolution of the pose error at the current instant and the previous control input under the action of the control increment. To extract the internal pose from the augmented state for subsequent prediction and weighting in the cost function, we use an output matrix

C \in R^{3 \times 6}

:

C = [I_{3 \times 3} 0_{3 \times 3}] = (\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \end{matrix}),

(23)

and map the augmented error state

ξ_{k}

to the output

e_{k}

:

e_{k} = C ξ_{k} = η_{k} - η_{ref, k} .

(24)

By iterating the predictive model in Equation (21), a relationship is established between the future states over the prediction horizon

N_{p}

, the current augmented state

ξ_{k}

, and the sequence of control increments over the control horizon

N_{c}

[31]. At time step k, the one–step–ahead prediction for the next sampling instant is defined as:

ξ_{k + 1 | k} = A_{k} ξ_{k} + B_{k} Δ ν_{k} .

(25)

The two–step–ahead prediction at time step

k + 2

is defined as:

ξ_{k + 2 | k} = A_{k} ξ_{k + 1 | k} + B_{k} Δ ν_{k + 1} .

(26)

Substituting the expression of

ξ_{k + 1 ∣ k}

into the above yields the relationship between the current state

ξ_{k}

and the two–step–ahead prediction

ξ_{k + 2 ∣ k}

:

ξ_{k + 2 | k} = A_{k} (A_{k} ξ_{k} + B_{k} Δ ν_{k}) + B_{k} Δ ν_{k + 1} = A_{k}^{2} ξ_{k} + A_{k} B_{k} Δ ν_{k} + B_{k} Δ ν_{k + 1} .

(27)

By induction, the i–step–ahead prediction at time step k is derived as:

ξ_{k + i | k} = A_{k}^{i} ξ_{k} + \sum_{j = 0}^{i - 1} A_{k}^{i - 1 - j} B_{k} Δ ν_{k + j} .

(28)

Stack all predicted state vectors from

i = 1

to

N_{p}

into a long vector

Ξ_{k}

:

Ξ_{k} = [\begin{matrix} ξ_{k + 1 | k} \\ ξ_{k + 2 | k} \\ ⋮ \\ ξ_{k + N_{p} | k} \end{matrix}] .

(29)

Stack all control increments from

j = 0

to

N_{c} - 1

into another column vector:

Δ U_{k} = [\begin{matrix} Δ ν_{k} \\ Δ ν_{k + 1} \\ ⋮ \\ Δ ν_{k + N_{c} - 1} \end{matrix}] .

(30)

Then the state prediction equation can be derived as:

Ξ_{k} = Φ_{k} ξ_{k} + Θ_{k} Δ U_{k},

(31)

where the prediction matrices

Φ_{k}

and

Θ_{k}

are given by:

Φ_{k} = [\begin{matrix} A_{k} \\ A_{k}^{2} \\ ⋮ \\ A_{k}^{N_{p}} \end{matrix}], Θ_{k} = (\begin{matrix} B_{k} & 0 & \dots & 0 \\ A_{k} B_{k} & B_{k} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ A_{k}^{N_{p} - 1} B_{k} & A_{k}^{N_{p} - 2} B_{k} & \dots & A_{k}^{N_{p} - N_{c}} B_{k} \end{matrix}) .

(32)

Since the cost function acts directly on the output

e_{k}

, the prediction equation is combined with the pose-error output equation to obtain the predicted output sequence over

N_{p}

steps, denoted by:

E_{k} = diag (\underset{N_{p}}{\underset{︸}{C, \dots, C}}) Ξ_{k} = \tilde{C} (Φ_{k} ξ_{k} + Θ_{k} Δ U_{k}),

(33)

where

\tilde{C}

is the block–diagonal output matrix. For notational simplicity, the prediction relation is rewritten as:

E_{k} = Φ ξ_{k} + Θ Δ U_{k},

(34)

and the prediction matrices

Φ

and

Θ

are redefined as:

Φ = [\begin{matrix} C A_{k} \\ C A_{k}^{2} \\ ⋮ \\ C A_{k}^{N_{p}} \end{matrix}], Θ = (\begin{matrix} C B_{k} & 0 & \dots & 0 \\ C A_{k} B_{k} & C B_{k} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ C A_{k}^{N_{p} - 1} B_{k} & C A_{k}^{N_{p} - 2} B_{k} & \dots & C A_{k}^{N_{p} - N_{c}} B_{k} \end{matrix}) .

(35)

At each control period, the MPC controller solves a finite–horizon optimization problem formulated as a standard quadratic program. The cost function J is defined as:

J (Δ U_{k}, s) = \sum_{i = 1}^{N_{p}} {∥e_{k + i | k}∥}_{Q}^{2} + \sum_{j = 0}^{N_{c} - 1} {∥Δ ν_{k + j | k}∥}_{R}^{2} + ρ s^{2},

(36)

where

Q

and

R

denote the weighting matrices for the tracking error and the control increment, respectively, and are chosen as constant diagonal matrices with

Q = diag (100, 100, 150)

and

R = diag (0.5, 0.5, 3)

. s is a slack variable, and

ρ > 0

is its penalty coefficient. By substituting the prediction Equation (34) into the cost function Equation (36) and introducing the slack variable s, the optimization is reformulated as a quadratic program (QP). Let the optimization variable be defined as:

z = [\begin{matrix} Δ U_{k} \\ s \end{matrix}] .

(37)

Then the original optimization problem can be expressed in the standard QP objective form as:

min_{z} J (z) = \frac{1}{2} z^{⊤} H z + f^{⊤} z .

(38)

The Hessian matrix

H

and the gradient vector

f

are derived from the cost function Equation (36):

H = (\begin{matrix} 2 (Θ^{⊤} \tilde{Q} Θ + \tilde{R}) & 0 \\ 0 & 2 ρ \end{matrix}),

(39)

f = (\begin{matrix} 2 Θ^{⊤} \tilde{Q} Φ ξ_{k} \\ 0 \end{matrix}),

(40)

where

\tilde{Q} = diag (Q, \dots, Q)

and

\tilde{R} = diag (R, \dots, R)

are the block–diagonal weighting matrices expanded over the entire horizon.

The optimization must also satisfy physical constraints. In this work, amplitude constraints are imposed on both the control inputs and the control increments:

\{\begin{matrix} ν_{\min} & \leq ν_{k} \leq ν_{\max} \\ Δ ν_{\min} & \leq Δ ν_{k} \leq Δ ν_{\max} \end{matrix}

(41)

These constraints are converted into linear inequality constraints on the decision variable

z

. At each sampling step k, the quadratic program is solved online. The first element of the optimal control–increment sequence,

Δ ν_{k}^{*}

, is then used to update the current control input:

ν_{k} = ν_{k - 1} + Δ ν_{k}^{★} .

(42)

The final controller output is the sum of the feedback control input and the feedforward velocity, which is then supplied as the reference signal to the inner–loop PPO controller.

3.2. Inner–Loop Controller Based on PRAR–PPO

The inner loop adopts PRAR–enhanced PPO.This closed–loop controller focuses solely on learning the inverse dynamic mapping from velocity to thrust within the volume coordinate system, thereby achieving precise tracking of the velocity commands output by the outer-loop MPC controller.

3.2.1. PPO

PPO is an on-policy Actor–Critic algorithm. Its key innovation is the introduction of a clipped surrogate objective function, which mitigates destructive policy degradation caused by excessively large update steps in traditional policy-gradient methods [32].

To quantify and effectively constrain the discrepancy between the new and old policies during updates, PPO defines a probability ratio:

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})},

(43)

where

π_{θ}

denotes the policy with the current trainable parameters, while

π_{θ_{old}}

denotes the behavior policy used to interact with the environment and collect data. If

r_{t} (θ) > 1

,the new policy increases the probability of taking action

a_{t}

in state

s_{t}

; conversely, if

r_{t} (θ) < 1

, the new policy decreases that probability.

In the PPO algorithm, the advantage function

A_{t}

evaluates how good the current action is in the current state:

A_{t} = \sum_{l = 0}^{\infty} {(γ λ)}^{l} δ_{t + l},

(44)

where

γ

is the discount factor;

λ

is the coefficient controlling the bias in advantage estimation; and

δ_{t}

is the temporal-difference error. The specific form of

δ_{t}

is given by:

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t}),

(45)

where

V (s_{t})

and

V (s_{t + 1})

denote the state values at the current state

s_{t}

and the next state

s_{t + 1}

.

Using the above equations, we compute the advantage estimate

A_{t}

at each sampling time. Unlike standard policy–gradient objectives, PPO adopts a clipped surrogate objective: by introducing a clipping mechanism to constrain

r_{t} (θ)

, it suppresses overly aggressive policy updates. The final optimization objective is given in Equation (46):

L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})],

(46)

where

ϵ = 0.1

is the clipping width. This clipping operation constrains

r_{t} (θ)

to the interval

[1 - ϵ, 1 + ϵ]

, thereby ensuring stable updates.

By formulating and optimizing the clipped surrogate objective, PPO strictly constrains the magnitude of policy updates, thereby achieving more effective learning while preserving stability.

3.2.2. State and Action Space

(1): State Space
The state space of the inner-loop controller aims to comprehensively represent the state information of the USV dynamics model. Higher-level trajectory-planning states (e.g., pose error) are fully decoupled and handled by the outer-loop MPC; only the velocity command is fed into the inner-loop agent’s state space. Since the input of the inner loop is the reference velocity command from the outer loop, and the output is the actual velocity output by the dynamics model, we define the state space s as a direct combination of these two vectors, as given by Equation (47):

$s = {[ν_{ref} {(t)}^{⊤}, ν_{actual} {(t)}^{⊤}]}^{⊤},$

(47)

where $ν_{ref} (t) = {[u_{ref} (t), v_{ref} (t), r_{ref} (t)]}^{⊤}$ denotes the desired velocity command passed from the outer loop to the inner loop, and $ν_{actual} (t) = {[u_{actual} (t), v_{actual} (t), r_{actual} (t)]}^{⊤}$ denotes the actual velocity produced by the inner loop; all components are measured in $m / s$ .
(2): Action Space
The agent’s action space $a$ corresponds to the control inputs of the USV dynamics model, namely the thrust commands applied to the two thrusters, and is defined as:

$a = {[T_{port}, T_{stbd}]}^{⊤},$

(48)

where $T_{port}$ and $T_{stbd}$ denote the control commands for the port–side and starboard–side thrusters, respectively. Subject to the thrusters’ physical capability limits, the command range is normalized to $[0, 10]$ . In the simulation environment, this command is linearly mapped via a gain block to the actual physical thrust range $[0 N, 2000 N]$ .

3.2.3. PRAR

In reinforcement learning, the design of the reward function is crucial to both learning efficiency and the final performance of the policy. A reasonable reward function can transform high–level control objectives into specific, optimizable numerical signals. The static reward function commonly used in the standard PPO algorithm is typically designed as a secondary penalty for tracking error, using a fixed set of weight coefficients:

r_{t} = - (w_{u} e_{u} {(t)}^{2} + w_{v} e_{v} {(t)}^{2} + w_{r} e_{r} {(t)}^{2}),

(49)

where

e_{u} (t), e_{v} (t), e_{r} (t)

denote the velocity errors in surge, sway, and yaw rate, respectively, and

w_{u}, w_{v}, w_{r}

are manually tuned weighting hyperparameters. Such a fixed reward function may not fully accommodate the diverse learning needs across different training stages [33]. Because the reward is computed purely from historical errors, the agent receives negative feedback only after the error has already occurred and begun to accumulate; as a result, the penalty mechanism cannot proactively guide the controller to anticipate imminent disturbances and trajectory deviations, nor to adjust and guard against them in time—leading to a pronounced response lag.

To overcome these limitations, we leverage the characteristics of the outer–loop MPC and design a Proactive–Reactive Adaptive Reward (PRAR) that uses both the pose–tracking error and the rolling prediction sequence generated by the MPC. This mechanism reflects real–time errors while accounting for the future task difficulty, as illustrated in Figure 3:

This method replaces the fixed penalty weights in the baseline reward with a dynamically computed, real-time weight vector

W (t)

. The reward

r_{t}

is defined as:

r_{t} = - e {(t)}^{⊤} diag (W (t)) e (t),

(50)

where

e (t) = {[e_{u} (t), e_{v} (t), e_{r} (t)]}^{⊤}

is the instantaneous velocity–tracking error vector, and

e {(t)}^{2}

denotes the element-wise square.

For the dynamic weight vector

W (t)

, its computation is given by:

W (t) = W_{base} ⊙ f_{reactive} (t) ⊙ f_{proactive} (t),

(51)

where ⊙ denotes element–wise multiplication,

W_{base}

is the baseline weight vector, and

f_{reactive} (t)

and

f_{proactive} (t)

are the reactive and predictive factors, respectively.

(1): Reactive factor
To address the limited adaptivity of static rewards, a reactive factor is designed to enhance the controller’s responsiveness to the current tracking performance. This factor receives the desired pose $η = {[x, y, ψ]}^{⊤}$ and the actual pose $η_{r} = {[x_{r}, y_{r}, ψ_{r}]}^{⊤}$ output by the outer-loop controller. The module then computes the instantaneous pose error:

$e_{p} = [\begin{matrix} e_{x} \\ e_{y} \end{matrix}] = [\begin{matrix} x - x_{r} \\ y - y_{r} \end{matrix}],$

(52)

$e_{ψ} = atan2 (sin (ψ - ψ_{r}), cos (ψ - ψ_{r})),$

(53)

where $e_{p}$ denotes the position–error vector and $e_{ψ}$ denotes the heading–angle error. To directly apply the error extracted from the outer loop to the USV dynamic model, the error is transformed from the inertial reference frame to the body–fixed reference frame. This coordinate transformation is accomplished via the rotation matrix $R (ψ)$ parameterized by the current heading:

$R (ψ) = [\begin{matrix} cos ψ & sin ψ \\ - sin ψ & cos ψ \end{matrix}] .$

(54)

By concatenating the decoupled longitudinal and lateral position errors with the heading error, a unified error vector $e (t) \in R^{3}$ can be obtained:

$e (t) = [\begin{matrix} e_{u} (t) \\ e_{v} (t) \\ e_{ψ} (t) \end{matrix}] = [\begin{matrix} e_{x} cos ψ + e_{y} sin ψ \\ - e_{x} sin ψ + e_{y} cos ψ \\ e_{ψ} \end{matrix}] .$

(55)

Finally, the error vector $e (t)$ is mapped, elementwise, by a nonlinear function to the weighting adjustment factors associated with each degree of freedom:

$f_{reactive, i} (t) = 1 + A_{r} \cdot (1 - exp (- α | e_{i} (t) |)), i \in {u, v, r},$

(56)

where $e_{i}$ is the error for each channel, $A_{r}$ is the reactive amplitude, which specifies the upper bound on the gain provided by the reactive factor, and $α$ is reactive sensitivity, by which the growth rate of the factor with increasing error magnitude is controlled.
The reactive factor performs a nonlinear, bounded, and smooth gain mapping of the pose error. As a result, when the vessel deviates from the planned course, the penalty on velocity–tracking error is promptly increased, enabling automatic adaptation to the current control task.
(2): Proactive factor
To mitigate the response lag inherent in conventional reward functions that rely solely on historical information, the future velocity sequence generated by the outer–loop MPC during receding–horizon optimization is incorporated into the reward design [34]. And a predictive adjustment factor is devised consequently. This factor converts the MPC controller’s forecast of future task complexity into a feedforward modulation signal, thereby allowing the inner–loop agent to prepare in advance before pronounced changes in the control commands occur.
Through a dedicated module, the factor receives the velocity planning sequence over the next $N_{p}$ time steps generated by the outer-loop MPC:

$V_{ref} (t) = {[ν_{u, 0}, ν_{v, 0}, ν_{r, 0}, \dots, ν_{u, N_{p} - 1}, ν_{v, N_{p} - 1}, ν_{r, N_{p} - 1}]}^{⊤} \in R^{3 N_{p}} .$

(57)

It is then reshaped as:

$V_{plan} (t) = [\begin{matrix} ν_{u, 0} & ν_{v, 0} & ν_{r, 0} \\ ν_{u, 1} & ν_{v, 1} & ν_{r, 1} \\ ⋮ & ⋮ & ⋮ \\ ν_{u, N_{p} - 1} & ν_{v, N_{p} - 1} & ν_{r, N_{p} - 1} \end{matrix}] \in R^{N_{p} \times 3},$

(58)

where the k-th row gives the reference velocities for the k-th prediction step across the three channels. To emphasize the importance of near-term predictions for the current decision, a time–attention weight $w_{k}$ is defined:

$w_{k} = σ^{k}, k = 0, \dots, N_{p} - 2,$

(59)

where $σ$ is the temporal decay factor, which assigns weights to the complexity at future steps. Normalization yields:

$α_{k} = \frac{w_{k}}{\sum_{j = 0}^{N_{p} - 2} w_{j}}, \sum_{k = 0}^{N_{p} - 2} α_{k} = 1 .$

(60)

Subsequently, a first-order difference is taken between adjacent samples for each channel:

$Δ ν_{i, k + 1} = | ν_{i, k + 1} - ν_{i, k} |, k = 0, \dots, N_{p} - 2 .$

(61)

Based on this definition, a time-weighted trajectory complexity (TWTC) metric is introduced to quantify the complexity of the planned velocity sequence:

$C_{t, i} = \sum_{k = 0}^{N_{p} - 2} α_{k} |ν_{i, k + 1} - ν_{i, k}|, i \in {u, v, r} .$

(62)

Segments in $V_{ref} (t)$ that are closer to the present and exhibit sharper variations yield higher trajectory complexity. This design enables the agent to place greater emphasis on imminent and rapidly changing commands and their impact on policy stability.
After computing the trajectory complexity index for each channel, the hyperbolic tangent is employed as the activation function to map these indices to the final predictive adjustment factors:

$f_{proactive, i} (t) = 1 + A_{p} \cdot tanh (β \cdot C_{t, i}),$

(63)

where $A_{p}$ is the proactive amplitude, which specifies the maximum gain attainable by the predictive adjustment factor, and $β$ is proactive sensitivity, by which the growth rate of the factor with increasing trajectory complexity is controlled.
By quantifying the dynamic complexity of the MPC’s future plan and mapping it nonlinearly into a bounded, smooth gain–modulation signal, the predictive adjustment factor is obtained. This factor enables the controller to increase the penalty on velocity–tracking error in advance when pronounced changes in future commands are anticipated, thereby achieving proactive anticipation and adaptive adjustment to forthcoming task challenges.

Collecting the above definitions, the mapping from the real–time pose error and the MPC–based trajectory complexity to the time-varying reward weights can be written in closed form. For each degree of freedom

i \in {u, v, r}

, the adaptive weight applied to the velocity–tracking error at time step k is given by:

\begin{matrix} W_{i} (k) & = W_{base, i} [1 + A_{r} (1 - e^{- α | e_{i} (k) |})] [1 + A_{p} tanh (β C_{k, i})], \\ C_{k, i} & = \sum_{ℓ = 0}^{N_{p} - 2} \frac{σ^{ℓ}}{\sum_{j = 0}^{N_{p} - 2} σ^{j}} |ν_{i, ℓ + 1} (k) - ν_{i, ℓ} (k)| . \end{matrix}

(64)

which shows that

W_{i} (k)

increases monotonically with both the instantaneous pose error

| e_{i} (k) |

and the predicted trajectory complexity

C_{k, i}

. The PRAR–weighted reward used by the inner–loop PPO controller is finally given by:

r_{k} = - \sum_{i \in {u, v, r}} W_{i} (k) e_{v, i}^{2} (k),

(65)

where

e_{v, i} (k)

denotes the velocity–tracking error in surge, sway or yaw. In this way, PRAR provides an explicit, time–varying mapping from real–time pose errors and MPC–based complexity indices to the scalar penalty coefficients in the reinforcement–learning reward.

3.3. Implementation of PRAR–PPO

The complete implementation is summarized as Algorithm 1:

Algorithm 1 MPC–PRAR–PPO dual–loop control

Require: USV kinematic and dynamic models; prediction horizon

N_{p}

; control horizon

N_{c}

; sample time T;

1:: input and rate constraints; PRAR parameters $(A_{r}, α, A_{p}, β, γ)$ ;
2:: policy $π_{θ}$ and value function $V_{ϕ}$
3:: Initialize previous reference velocity $ν_{- 1}^{ref} \leftarrow 0$ and base weights $W_{base} \leftarrow {[w_{u}, w_{v}, w_{r}]}^{⊤}$
4:: Set time index $k \leftarrow 0$
5:: while mission not finished do
6:: Measure current pose $η_{k} = {[x_{k}, y_{k}, ψ_{k}]}^{⊤}$ and body–fixed velocity $ν_{k}^{act} = {[u_{k}, v_{k}, r_{k}]}^{⊤}$
7:: Obtain reference pose $η_{k}^{ref} = {[x_{k}^{ref}, y_{k}^{ref}, ψ_{k}^{ref}]}^{⊤}$ from path generator
8:: Form augmented state $ξ_{k}$ from pose error $(η_{k} - η_{k}^{ref})$ and the internally stored previous virtual input
9:: Linearize kinematic model, build MPC prediction matrices and quadratic program
10:: Solve MPC to obtain current reference velocity $ν_{k}^{ref} = {[u_{k}^{ref}, v_{k}^{ref}, r_{k}^{ref}]}^{⊤}$
11:: and prediction sequence $V_{ref} (k) \in R^{3 N_{p}}$
12:: Compute inertial position errors $e_{x} = x_{k} - x_{k}^{ref}$ , $e_{y} = y_{k} - y_{k}^{ref}$
13:: Compute heading error $e_{ψ} = atan 2 (sin (ψ_{k} - ψ_{k}^{ref}), cos (ψ_{k} - ψ_{k}^{ref}))$
14:: Transform $(e_{x}, e_{y})$ into body-fixed errors using $ψ_{k}$ :
15:: $e_{u} = e_{x} cos ψ_{k} + e_{y} sin ψ_{k}$ , $e_{v} = - e_{x} sin ψ_{k} + e_{y} cos ψ_{k}$
16:: Set $e (k) = {[e_{u}, e_{v}, e_{ψ}]}^{⊤}$
17:: for $i \in {u, v, r}$ do
18:: Compute reactive factor $f_{i}^{react} (k) = 1 + A_{r} (1 - exp (- α | e_{i} (k) |))$
19:: end for
20:: Reshape $V_{ref} (k)$ into $V_{plan} (k) \in R^{N_{p} \times 3}$ and set $α_{ℓ} = σ^{ℓ} / \sum_{j = 0}^{N_{p} - 2} σ^{j}$
21:: for $i \in {u, v, r}$ do
22:: Compute trajectory complexity $C_{k, i} = \sum_{ℓ = 0}^{N_{p} - 2} α_{ℓ} | ν_{i, ℓ + 1} - ν_{i, ℓ} |$
23:: Compute proactive factor $f_{i}^{pro} (k) = 1 + A_{p} tanh (β C_{k, i})$
24:: Update adaptive weight $W_{i} (k) = W_{base, i} f_{i}^{react} (k) f_{i}^{pro} (k)$
25:: end for
26:: Construct policy input $s_{k} = {[{(ν_{k}^{ref})}^{⊤}, {(ν_{k}^{act})}^{⊤}]}^{⊤} \in R^{6}$
27:: Generate action $a_{k} = π_{θ} (s_{k})$ (or sample from $π_{θ} (\cdot ∣ s_{k})$ ) and map it to thrust command $τ_{k}$
28:: Apply $τ_{k}$ to the USV dynamics to obtain next pose $η_{k + 1}$ and velocity $ν_{k + 1}^{act}$
29:: If learning is enabled then
30:: Compute reward $r_{k}$ from $ν_{k}^{ref}$ , $ν_{k}^{act}$ and $W_{i} (k)$
31:: Store transition $(s_{k}, a_{k}, r_{k}, s_{k + 1})$ and update $θ, ϕ$ by PPO
32:: End if
33:: Set $k \leftarrow k + 1$
34:: end while

4. Simulation Results and Analysis

To validate the effectiveness of the proposed PRAR–DLC scheme in USV trajectory tracking, five groups of comparative experiments are conducted: (1) a single–loop controller that relies solely on PPO; (2) a dual–loop architecture in which the outer-loop controller is retained while the inner loop is replaced with a carefully tuned PID controller based on the nominal hydrodynamic model; (3) a dual-loop MPC–PPO controller with fixed penalty weights; (4) an MPC–PPO controller incorporating only the Proactive factor; and (5) an MPC–PPO controller incorporating only the Reactive factor. This configuration enables a systematic assessment of how different reward designs influence the overall USV control performance.

The VTec S-III USV is selected as the controlled vessel for the simulation experiments [29], and its basic parameters are listed in Table 2 and Table A1.

4.1. Neural Network Parameters of Inner–Loop

The inner loop comprises an Actor network and a Critic network, each with two hidden layers. The hyperparameter settings for the PRAR–DLC are detailed in Table 3.

4.2. Circular Trajectory Tracking Experiments

The circular trajectory is a circle of radius

R = 10 m

centered at

(x_{c}, y_{c}) = (0, 10)

:

\{\begin{matrix} x_{ref} (t) & = x_{c} + R sin (ω t) \\ y_{ref} (t) & = y_{c} - R cos (ω t) \end{matrix}, ω = \frac{v}{R} .

(66)

During the training phase, the linear speed is set to

v_{train} = 3 m / s

, and the angular speed is

ω_{train} = 0.3 rad / s

. Each episode contains 2000 steps (duration

20 s

). At episode reset, the initial velocities are slightly randomized, with

u_{0}, r_{0} \in [0, 0.1] m / s

and

v_{0} = 0 m / s

. During the testing phase, the linear speed is increased to

v_{test} = 4 m / s

, yielding the corresponding angular speed

ω_{test} = 0.4 rad / s

and and each episode contains 1500 steps (duration

15 s

). This configuration constitutes a mild out–of–distribution (OOD) test. The centripetal acceleration

a_{c} = \frac{v^{2}}{R}

rises from

0.9

to

1.6 m / s^{2}

, which imposes higher demands on thrust and control moment and thereby assesses the policy’s generalization and robustness. The remaining parameters and constraints are identical to those in the training phase. To account for stochastic variability, each method is tested over 30 independent runs. The mean absolute tracking error computed across these repetitions is then recorded in Table 4. The performance comparison is based on four error metrics: the mean absolute error (MAE), the root-mean-square (RMS) error, the maximum error, and the 95th-percentile error.

For the circular trajectory, the spatial tracking results in Figure 4 show that all controllers except PPO–Only are able to follow the reference circle, but their accuracy and smoothness differ noticeably. The trajectory of the proposed PRAR–DLC almost coincides with that of the nominal–hydrodynamic–model–based MPC–PID controller, while MPC–PPO and PPO–Re exhibit visible deviations along the arc. This visual impression is consistent with the quantitative error statistics in Table 4.

In the surge direction, PRAR–DLC achieves a longitudinal tracking performance comparable to MPC–PID and clearly better than most learning-based baselines. Its MAE is reduced from

0.899

(MPC–PID) to

0.823

, and the RMS error remains almost identical to MPC–PID. The 95th-percentile error is also slightly smaller (a reduction of about

3.5 %

), although the maximum error is larger, indicating a few more pronounced overshoots. Compared with the RL baselines, PRAR–DLC provides a more balanced behavior: PPO–Pro attains the smallest surge MAE but at the cost of larger lateral and yaw errors, whereas PPO–Only, MPC–PPO and PPO–Re show both higher RMS and much larger tail errors in Figure 5b. Overall, PRAR–DLC delivers near–MPC–PID surge tracking while avoiding the severe drift observed in several PPO variants.

In the sway direction, PRAR–DLC and MPC–PID exhibit almost the same mean error, and PRAR–DLC still clearly outperforms all PPO-based baselines. As shown in Figure 6b, the proposed controller maintains a relatively smooth lateral error with moderate peaks, whereas PPO–Only, MPC–PPO and PPO–Re produce much larger oscillations and offsets during the turning phase. These results suggest that the current PRAR design tends to prioritize longitudinal tracking and overall trajectory shape, accepting a modest increase in lateral peak error compared with the nominal-model PID while still significantly improving over purely learning-based controllers.

For yaw control, PRAR–DLC achieves almost the same mean heading accuracy as MPC–PID: the MAE values are

0.08197

and

0.08209

, respectively. At the same time, all PPO-based baselines show noticeably larger MAE (

0.12

–

0.26

) and much larger spikes, as evidenced by Figure 7b. Similar to the sway case, the RMS and tail yaw errors of PRAR–DLC are higher than those of MPC–PID (RMS and 95th–percentile increased by roughly

22 %

and

56 %

), but remain substantially below the levels of PPO–Only, MPC–PPO and PPO–Pro. The yaw error curve of PRAR–DLC is characterized by a slowly varying bias without catastrophic bursts, while learning–only controllers display pronounced oscillations.

Taken together, the circular trajectory experiments show that, under a much weaker model assumption (outer–loop kinematics and a learned PPO execution layer), PRAR–DLC can match the mean surge–sway–yaw accuracy of a nominal–model MPC–PID controller and clearly outperforms all RL–based and RL–model hybrid baselines in terms of overall stability and robustness, at the cost of slightly larger lateral and yaw peak errors.

4.3. Curvilinear Trajectory Tracking Experiments

The desired trajectory in this experiment is a sinusoidal path with multiple tunable parameters, as given by Equation (67):

\{\begin{matrix} x_{ref} (t) & = v t \\ y_{ref} (t) & = y_{0} + A sin ((π / L) v t + φ) \end{matrix}

(67)

The parameter ranges for the trajectory are listed in Table 5. During both training and testing, the specific parameter values of the desired trajectory within each episode are determined by a random seed. Each seed defines a unique and independent trajectory that remained fixed for the entire episode. A seed pool containing ten distinct seeds is prepared. In training phase, one seed is drawn uniformly at random from this pool at the beginning of each episode to instantiate the desired trajectory. In testing, a seed not contained in the pool is selected and fixed as the desired trajectory for evaluating all models.

During the training phase, the linear speed is set to

v_{train} = 3 m / s

, and the angular speed is

ω_{train} = 0.3 rad / s

. Each episode contains 2000 steps (duration 20 s). At episode reset, the initial velocities are slightly randomized, with

u_{0}, r_{0} \in [0, 0.1] m / s

and

v_{0} = 0 m / s

. During the testing phase, each episode contains 1500 steps.To account for stochastic variability, each method is tested over 30 independent runs. The mean absolute tracking error computed across these repetitions is then recorded in Table 6.

For the curvilinear trajectory, the spatial tracking result in Figure 8 shows that all controllers can roughly follow the reference path, but their tracking quality differs significantly. The nominal hydrodynamic MPC–PID and the proposed PRAR–DLC stay closest to the desired curve, whereas PPO–Only exhibits a smaller amplitude and MPC–PPO and PPO–Re deviate more noticeably in the middle and latter parts of the trajectory, which is consistent with the error statistics in Table 6.

In the surge direction, PRAR–DLC provides the most robust longitudinal tracking among all methods. Compared with MPC–PID, the surge MAE is reduced from

0.269

to

0.138

(about

49 %

reduction), and the RMS error decreases from

0.357

to

0.171

(more than

50 %

). The maximum and 95th–percentile surge errors drop from

0.963 / 0.843

to

0.337 / 0.306

, corresponding to reductions of approximately

65 %

and

64 %

, respectively. Although PPO–Pro and PPO–Re achieve slightly smaller MAE, their peak and tail errors remain noticeably larger than those of PRAR–DLC, and MPC–PPO suffers from much larger, near–divergent errors. The surge error curves in Figure 9b confirm that PRAR–DLC keeps the longitudinal error tightly bounded over the entire horizon, while several PPO variants accumulate substantial drift. These results indicate that the proposed proactive–reactive adaptive reward effectively suppresses large outliers and rare extreme errors in the longitudinal motion.

In the sway direction, PRAR–DLC does not minimize the average error but offers a reasonable trade–off between accuracy and robustness. Its sway MAE (

0.577

) is much smaller than those of PPO–Only and MPC–PPO (both above

1.0

). The RMS, maximum and 95th-percentile sway errors of PRAR–DLC are moderately larger than those of MPC–PID, but remain in the same order of magnitude. As shown in Figure 10b, PRAR–DLC produces a relatively smooth and bounded lateral error, whereas PPO–Only and MPC–PPO exhibit persistent bias and strong oscillations. This behavior suggests that the current PRAR design intentionally prioritises longitudinal tracking and global trajectory shape, accepting a certain increase in lateral peak error compared with the nominal-model PID, while still clearly outperforming purely learning–based controllers.

For yaw control, PRAR–DLC again exhibits competitive performance. Its yaw MAE (

0.071

) is slightly larger than that of MPC–PID (

0.060

) but clearly lower than those of PPO–Only (

0.104

) and MPC–PPO (

0.193

), and comparable to PPO–Pro (

0.073

). As shown in Figure 11b, PRAR–DLC yields a low-amplitude, low-frequency heading error without pronounced spikes, whereas the learning–only baselines produce stronger oscillations and intermittent peaks. Overall, the curvilinear trajectory experiments show that, under a weak–model assumption with an outer–loop kinematic planner and an inner PPO execution layer, PRAR–DLC can substantially reduce longitudinal errors and maintain acceptable sway and yaw performance, achieving a favorable balance between tracking accuracy and robustness compared with both the nominal–model MPC–PID controller and the RL–based baselines.

4.4. Real–Time Execution Characteristics and Physical Deployment Considerations

The runtime performance of the proposed hierarchical architecture was systematically profiled under two scenarios. In both cases, the outer

f_{MPC}

loop operates with a sampling period of

T_{MPC} = 0.03 s

(approximately

33 Hz

), whereas the inner

f_{PPO}

loop runs at

T_{PPO} = 0.01 s

(approximately

100 Hz

), which reflects the intended hierarchical design where the PPO-based thrust allocation reacts faster than the MPC–based kinematic planning. For each scenario, we recorded the average and maximum solution times of the MPC quadratic program, as well as the average and maximum forward–pass times of the PPO policy network. To emulate adverse operating conditions and assess robustness against fast perturbations, a bounded high-frequency disturbance at

50 Hz

is injected into the surge force and yaw moment channels. This frequency is chosen to be close to the Nyquist limit of the inner–loop sampling period

T_{PPO} = 0.01 s

, so that, after discretization, it appears as a rapidly alternating excitation at each PPO control step. The disturbance amplitude is selected to remain well below the maximum thrust and torque capabilities of the water–jet actuators, ensuring physical plausibility while still providing a stringent high–frequency stress test for both the inner–loop PPO controller and the outer–loop MPC layer [35].

4.4.1. Runtime Performance in Circular Trajectory Tracking Experiments

The real–time execution performance of the MPC–PPO and PRAR–DLC controllers in the circular trajectory tracking experiments is illustrated in Figure 12 and Figure 13 and summarized in and Table 7.

As shown in Figure 12 and Figure 13 and Table 7, the proposed hierarchical architecture is compatible with real-time implementation. Under the MPC–PPO baseline, the mean and maximum MPC QP solve times are

2.108 ms

and

3.010 ms

, while the mean and maximum PPO forward times are

1.030 ms

and

4.439 ms

. With the PRAR–DLC controller, the average MPC load is further reduced to

1.862 ms

, and the PPO forward time remains below

1 ms

on average, with a maximum of

2.007 ms

. Comparing these figures with the sampling periods of the two loops,

T_{MPC} = 0.03 s

(

f_{MPC} \approx 33.3 Hz

) and

T_{PPO} = 0.01 s

(

f_{PPO} = 100 Hz

), indicates that the average computational cost occupies less than each loop’s sampling interval.

4.4.2. Runtime Performance in Curvilinear Trajectory Tracking Experiments

The real-time execution performance of the MPC–PPO and PRAR–DLC controllers in the curvilinear trajectory tracking experiments is illustrated in Figure 14 and Figure 15 and summarized in and Table 8.

For the curvilinear trajectory tracking experiments, the runtime behavior reported in Table 8 and Figure 14 and Figure 15 complements the circular trajectory results and illustrates how the proposed architecture behaves under more aggressive, non–uniform maneuvers. Under the MPC–PPO controller, the mean and maximum MPC QP solve times are

2.124 ms

and

3.351 ms

, while the PPO forward pass requires

1.136 ms

on average with a worst case of

2.537 ms

. When the PRAR–DLC scheme is activated, the computational footprint of both loops is further reduced: the average MPC time drops to

1.813 ms

and the PPO forward time to

0.934 ms

, with a maximum of

1.572 ms

. These results indicate that the PRAR mechanism enables PRAR–DLC to achieve faster MPC solving and PPO inference compared with the MPC–PPO controller. Moreover, the total computation time required for each update of the hierarchical control system remains well below the system sampling period, thereby satisfying real–time execution requirements. Hence, the additional optimization burden introduced by the outer MPC does not violate real–time constraints and the two loops can be scheduled independently in practice.

From a conceptual implementation viewpoint, these timing margins imply that the hierarchical controller can be executed on a realistic onboard computing platform—sharing CPU resources with sensing, navigation and communication modules—without saturating computational capacity or degrading the closed–loop behavior of the physical USV under practical operating conditions.

Although the dual–loop controller combines MPC, PRAR and PPO and is therefore more computationally demanding than classical single–loop PID or backstepping schemes, its algorithmic complexity remains modest from the perspective of a realistic shipborne computer. In each inner–loop period, the controller solves a small-scale quadratic program with three control inputs and a fixed control horizon, performs a few vector operations over the prediction horizon for the PRAR weighting, and executes one forward pass through two fully connected neural network layers of size 512. This results in roughly

10^{5}

–

10^{6}

basic arithmetic operations per step, which is several orders of magnitude below the processing capability of typical industrial embedded CPUs or onboard PCs used on USVs. Together with the measured runtimes reported in the previous tables, this shows that the computational cost of the proposed algorithm is well matched to practical hardware resources and remains comfortably below the real–time limits.

From a physical implementation viewpoint, these generous timing and capacity margins imply that the hierarchical controller can run reliably on a realistic onboard computing platform while sharing CPU resources with sensing, navigation and communication modules. The additional computation introduced by the dual–loop structure does not saturate the processor or introduce harmful delays, and thus does not degrade the closed–loop behavior of the physical USV under practical operating conditions. On the contrary, the extra complexity is a deliberate design choice that trades inexpensive modern computation for improved constraint handling, stability and disturbance rejection. Therefore, the computational complexity of the proposed architecture is not expected to have a negative impact on a real shipborne control system; instead, it is a necessary condition for achieving high–performance trajectory tracking.

5. Conclusions

This study proposes a dual–loop USV trajectory–tracking framework that combines MPC with PPO. The USV dynamics are explicitly decoupled into a kinematic layer and a dynamic layer, for which the outer and inner controllers are designed, respectively. In the outer loop, an MPC planner operating in the inertial reference frame uses a kinematic model and explicit velocity and safety constraints to generate dynamically feasible body–fixed velocity references together with a prediction sequence over the horizon. In the inner loop, a PPO–based execution controller operating in the body–fixed reference frame learns the nonlinear inverse mapping from commanded velocities to multi-thruster forces and performs online compensation for unmodeled hydrodynamics and external disturbances. This design enforces a feasible–first–then–refine behavior, in which the policy is guided to remain within the outer–loop feasible set while progressively improving tracking accuracy. Building on this dual-loop structure, we further design a Proactive–Reactive Adaptive Reward (PRAR) mechanism that fully exploits the outer loop’s prediction sequence in combination with real–time pose–tracking errors: PRAR derives a time–weighted trajectory–complexity measure from the predicted velocities and adaptively reweights the surge, sway and yaw components of the reward, allowing the controller to anticipate upcoming aggressive maneuvers while remaining sensitive to instantaneous deviations.

The simulation results on both circular and curvilinear trajectories show that the USV controlled by the proposed framework maintains high–precision tracking under disturbances and model mismatch. Compared with a nominal hydrodynamic MPC–PID baseline, the PRAR–driven dual–loop controller reduces surge MAE and maximum longitudinal error on the curvilinear trajectory from 0.269 m and 0.963 m to 0.138 m and 0.337 m, respectively, and achieves about an 8.5% reduction in surge MAE on the circular trajectory while keeping sway and yaw MAE at a level comparable to the baseline. Across all scenarios, it also significantly suppresses error tails and avoids the large drifts or oscillations observed in PPO–Only, MPC–PPO and other PPO variants, confirming strong robustness. In addition, the proposed PRAR mechanism integrates the outer–loop MPC prediction sequence with real–time pose errors to adaptively reweight the reward, which, compared with fixed–weight PPO, improves generalization capability while keeping the average MPC and PPO computation times in the low–millisecond range, compatible with real–time on–board implementation.

Future work will focus on two directions. First, hardware–in–the–loop and full-scale waterborne experiments will be conducted to validate the controller under unmodeled actuator dynamics, onboard sensor noise, and network–induced latency. Second, to further reduce computation load and facilitate deployment on resource–limited USVs, we will investigate model lightweighting, including reduced–order MPC, neural surrogate models for prediction, and compact policy networks for thrust allocation. Extending the framework to over–actuated USVs and cooperative multi–vessel missions also constitutes a promising direction for the hierarchical architecture introduced in this study.

Author Contributions

Conceptualization, Z.L. and D.L.; methodology, Z.L.; software, Y.C.; validation, Z.L., S.H. and J.W.; formal analysis, Z.L.; investigation, D.L.; resources, D.D.; data curation, Q.Y.; writing—original draft preparation, Z.L.; writing—review and editing, D.L.; visualization, Z.L.; supervision, D.D.; project administration, D.L.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Key Laboratory Open Fund (Project number: JCKY2024206D003).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The values of the constants used in the simulation are shown in the Table A1.

Table A1. Values of the constants.

Parameter	Definition/Value
$x_{G}$	center of mass longitudinal coordinate in body-fixed frame: 0
$y_{G}$	center of mass transversal coordinate in body-fixed frame: 0
$I_{z}$	moment of inertia around z: $4.1$
$X_{u}$	force in x created by surge speed
$Y_{v}$	$0.5 (- 40) ρ \| v \| [1.1 + 0.0045 \frac{L}{T} - 0.1 \frac{B_{hull}}{T} + 0.016 {(\frac{B_{hull}}{T})}^{2} (\frac{π T L}{2})]$
$Y_{r}$	$- 3 π ρ \sqrt{u^{2} + v^{2}} T^{2} L$
$N_{v}$	$- 0.06 π ρ \sqrt{u^{2} + v^{2}} T^{2} L$
$N_{r}$	$- 0.02 π ρ \sqrt{u^{2} + v^{2}} T^{2} L^{2}$
$X_{\dot{u}}$	force in x created by surge acceleration: $- 2.25$
$Y_{\dot{v}}$	force in y created by sway acceleration: $- 23.13$
$Y_{\dot{r}}$	force in y created by yaw angular acceleration: $- 1.31$
$N_{\dot{v}}$	moment about z created by sway acceleration: $- 16.41$
$N_{\dot{r}}$	moment about z created by yaw angular acceleration: $- 2.79$
$X_{u \| u \|}$	force in x created by the product of surge velocity and its absolute value
$Y_{v \| v \|}$	force in y created by the product of sway velocity and its absolute value: $- 99.99$
$Y_{v \| r \|}$	force in y created by the product of sway velocity and the absolute value of yaw rate: $- 5.49$
$Y_{r \| v \|}$	force in y created by the product of yaw rate and the absolute value of sway velocity: $- 5.49$
$Y_{r \| r \|}$	force in y created by the product of yaw rate and its absolute value: $- 8.8$
$N_{v \| v \|}$	moment about z created by the product of sway velocity and its absolute value: $- 5.49$
$N_{v \| r \|}$	moment about z created by the product of sway velocity and the absolute value of yaw rate: $- 8.8$
$N_{r \| v \|}$	moment about z created by the product of yaw rate and the absolute value of sway velocity: $- 8.8$
$N_{r \| r \|}$	moment about z created by the product of yaw rate and its absolute value: $- 3.44$
$ν_{max}$	${[\begin{matrix} 0.8 & 0.5 & 1.4 \end{matrix}]}^{⊤}$
$ν_{min}$	${[\begin{matrix} 0 & 0 & 0 \end{matrix}]}^{⊤}$
$Δ ν_{max}$	${[\begin{matrix} 0.05 & 0.05 & 0.05 \end{matrix}]}^{⊤}$
$Δ ν_{min}$	${[\begin{matrix} 0 & 0 & 0 \end{matrix}]}^{⊤}$

References

Gonzalez-Garcia, A.; Castañeda, H. Guidance and control based on adaptive sliding mode strategy for a USV subject to uncertainties. IEEE J. Ocean. Eng. 2021, 46, 1144–1154. [Google Scholar] [CrossRef]
Er, M.J.; Ma, C.; Liu, T.; Gong, H. Intelligent motion control of unmanned surface vehicles: A critical review. Ocean. Eng. 2023, 280, 114562. [Google Scholar]
Xu, W.; Zhang, X.; Miao, Z. Cooperative trajectory tracking control of USV-UAV with non-singular sliding mode surface and RBF neural network. Ocean. Eng. 2025, 337, 121872. [Google Scholar] [CrossRef]
Wei, J.; Zhang, J.; Dong, H.; Liu, Z. Prescribed-time trajectory tracking control for underactuated usv with input amplitude and rate constraints. Ocean. Eng. 2025, 326, 120891. [Google Scholar] [CrossRef]
Zhu, Y.; Li, Z.; Li, Y.; Yuan, C.; Guo, G. Multi-target tracking and surrounding control for multi-USV system under unknown disturbances based on distributed nmpc. Ocean. Eng. 2025, 338, 121972. [Google Scholar] [CrossRef]
Xu, X.; Zhang, X.; Zhang, Z.; Ma, F.; Guan, C.; Steyskal, F. Los guidance for fuzzy adaptive fault-tolerant path following control of USV with side-slip compensation and actuator fault. IEEE Trans. Transp. Electrif. 2024, 11, 6875–6886. [Google Scholar]
Chen, G.; Wang, W.; Dong, J. Performance-optimize adaptive robust tracking control for USV-UAV heterogeneous systems with uncertainty. IEEE Trans. Veh. Technol. 2025, 74, 7251–7262. [Google Scholar] [CrossRef]
Zhang, H.; Zhao, Z.; Wei, Y.; Liu, Y.; Wu, W. A self-tuning variable universe fuzzy pid control framework with hybrid bas-pso-sa optimization for unmanned surface vehicles. J. Mar. Sci. Eng. 2025, 13, 558. [Google Scholar] [CrossRef]
Cao, X.; Zheng, Y.; Xiao, H.; Xiao, M. Multi-target point path planning and tracking of unmanned surface vehicle using sflos algorithm and incremental pid control with crane-assisted. IEEE Access 2025, 13, 44620–44635. [Google Scholar] [CrossRef]
Fossen, T.I.; Pettersen, K.Y.; Galeazzi, R. Line-of-sight path following for dubins paths with adaptive sideslip compensation of drift forces. IEEE Trans. Control Syst. Technol. 2014, 23, 820–827. [Google Scholar]
Chen, G.; Huang, Z.; Wang, W.; Yang, S. A novel dynamically adjusted entropy algorithm for collision avoidance in autonomous ships based on deep reinforcement learning. J. Mar. Sci. Eng. 2024, 12, 1562. [Google Scholar] [CrossRef]
Wang, Y.-L.; Wang, C.-C.; Han, Q.-L.; Wang, X. Networked and deep reinforcement learning-based control for autonomous marine vehicles: A survey. IEEE Trans. Syst. Man Cybern. Syst. 2024, 55, 4–17. [Google Scholar] [CrossRef]
Xiong, Y.; Wang, S.; Tian, H.; Liu, G.; Shan, Z.; Yin, Y.; Tao, J.; Ye, H.; Tang, Y. Spatiotemporal meta-reinforcement learning for multi-USV adversarial games using a hybrid gat-transformer. J. Mar. Sci. Eng. 2025, 13, 1593. [Google Scholar] [CrossRef]
Rao, J.; Wang, C.; Liu, M.; Chen, J.; Lei, J.; Giernacki, W. A deep reinforcement learning approach and its application in multi-USV adversarial game simulation. Appl. Intell. 2025, 55, 1–24. [Google Scholar] [CrossRef]
Liu, Y.; Chen, C.; Qu, D.; Zhong, Y.; Pu, H.; Luo, J.; Peng, Y.; Xie, J.; Zhou, R. Multi-usv system antidisturbance cooperative searching based on the reinforcement learning method. IEEE J. Ocean. Eng. 2023, 48, 1019–1047. [Google Scholar] [CrossRef]
Cui, Z.; Guan, W.; Zhang, X. USV formation navigation decision-making through hybrid deep reinforcement learning using self-attention mechanism. Expert Syst. Appl. 2024, 256, 124906. [Google Scholar]
Zhou, Y.; Gong, C.; Chen, K. Adaptive control scheme for USV trajectory-tracking under complex environmental disturbances via deep reinforcement learning. IEEE Internet Things J. 2025, 11, 15181–15196. [Google Scholar] [CrossRef]
Wu, C.; Yu, W.; Liao, W.; Ou, Y. Deep reinforcement learning with intrinsic curiosity module based trajectory tracking control for usv. Ocean. Eng. 2024, 308, 118342. [Google Scholar] [CrossRef]
Wen, Y.; Chen, Y.; Guo, X. Usv trajectory tracking control based on receding horizon reinforcement learning. Sensors 2024, 24, 2771. [Google Scholar] [CrossRef]
Wang, Y.; Cao, J.; Sun, J.; Zou, X.; Sun, C. Path following control for unmanned surface vehicles: A reinforcement learning-based method with experimental validation. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 18237–18250. [Google Scholar]
Dai, S.-L.; He, S.; Wang, M.; Yuan, C. Adaptive neural control of underactuated surface vessels with prescribed performance guarantees. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3686–3698. [Google Scholar]
Huang, W.; Cui, Y.; Li, H.; Wu, X. Effective probabilistic neural networks model for model-based reinforcement learning USV. IEEE Trans. Autom. Sci. Eng. 2025, 22, 11625–11641. [Google Scholar] [CrossRef]
Cui, Y.; Shi, W.; Yang, H.; Shao, C.; Peng, L.; Li, H. Probabilistic model-based reinforcement learning unmanned surface vehicles using local update sparse spectrum approximation. IEEE Trans. Ind. Inform. 2023, 20, 1283–1293. [Google Scholar]
Qin, H.; Tan, P.; Chen, Z.; Sun, M.; Sun, Q. Deep reinforcement learning based active disturbance rejection control for ship course control. Neurocomputing 2022, 484, 99–108. [Google Scholar]
Zhu, K.; Zhang, G.; Zhu, C.; Niu, Y.; Liu, J. A bi-level optimization strategy for flexible and economic operation of the chp units based on reinforcement learning and multi-objective mpc. Appl. Energy 2025, 391, 125850. [Google Scholar]
Liao, Y.; Yu, G.; Chen, P.; Zhou, B.; Li, H. Integration of decision-making and motion planning for autonomous driving based on double-layer reinforcement learning framework. IEEE Trans. Veh. Technol. 2023, 73, 3142–3158. [Google Scholar]
Hu, L.; Ding, L.; Yang, H.; Liu, T.; Zhang, A.; Chen, S.; Gao, H.; Xu, P.; Deng, Z. Lno-driven deep rl-mpc: Hierarchical adaptive control architecture for dynamic legged locomotion. IEEE Trans. Ind. Inform. 2025, 21, 8574–8584. [Google Scholar]
Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Gonzalez-Garcia, A.; Barragan-Alcantar, D.; Collado-Gonzalez, I.; Garrido, L. Adaptive dynamic programming and deep reinforcement learning for the control of an unmanned surface vehicle: Experimental results. Control. Eng. Pract. 2021, 111, 104807. [Google Scholar] [CrossRef]
Li, W.; Zhang, X. Data-driven model predictive control for underactuated USV path tracking with unknown dynamics. Ocean. Eng. 2025, 333, 121457. [Google Scholar] [CrossRef]
Jin, L.; Liu, L.; Wang, X.; Shang, M.; Wang, F.-Y. Physical-informed neural network for mpc-based trajectory tracking of vehicles with noise considered. IEEE Trans. Intell. Veh. 2024, 9, 4493–4503. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Yan, Q.; Wu, X.; Wang, J.; Fortino, G.; Pupo, F.; Yin, M. Egcarl: A ppo-based reinforcement learning method with expert guidance and dynamic rewards for autonomous driving. Inf. Fusion 2025, 126, 103606. [Google Scholar] [CrossRef]
Ramezani, M.; Alandihallaj, M.A.; Yalçın, B.C.; Mendez, M.A.O.; Hein, A.M. Fuel-aware autonomous docking using rl-augmented mpc rewards for on-orbit refueling. Acta Astronaut. 2025, 238, 690–705. [Google Scholar] [CrossRef]
Wang, X.; Yi, H.; Xu, J.; Xu, C.; Song, L. Pid controller based on improved ddpg for trajectory tracking control of USV. J. Mar. Sci. Eng. 2024, 12, 1771. [Google Scholar] [CrossRef]

Figure 1. Inertial and body–fixed reference frames.

Figure 2. Overall control architecture.

Figure 3. Mechanism of PRAR.

Figure 4. Trajectory tracking results.

Figure 5. Surge outputs and errors.

Figure 6. Sway outputs and errors.

Figure 7. Yaw outputs and errors.

Figure 8. Trajectory tracking results.

Figure 9. Surge outputs and errors.

Figure 10. Sway outputs and errors.

Figure 11. Yaw outputs and errors.

Figure 12. MPC–PPO solve time.

Figure 13. PRAR–DLC solve time.

Figure 14. MPC–PPO solve time.

Figure 15. PRAR–DLC solve time.

Table 1. Typical hybrid architectures combining RL and model–based control.

RL Layer	Model-Based/Classical Layer	Limitations
Tuning and supervisory	Fixed–structure at execution layer	RL only adjusts gains; cannot directly handle strong nonlinearities or actuator coupling.
Meta-control (cost weights, horizon, margins)	Predictive/model–based controller for planning and control	Requires accurate models; optimization burden grows with dimension.
Model learning (NN/GP dynamics)	Planner/optimizer based on learned model	Online planning remains expensive; hard to satisfy real–time execution.
High-level decision/planning	Low-level trajectory/thrust controller	Low-level stays model-dependent; weak adaptability to unmodeled multi–thruster dynamics.

Table 2. Basic parameters of controlled vessel.

Parameter	Value
Length overall (L)	1.01 [ $m$ ]
Length on the waterline	0.93 [ $m$ ]
Draft (T)	0.09 [ $m$ ]
Beam overall (B)	0.75 [ $m$ ]
Beam on the waterline	0.61 [ $m$ ]
Centerline–to–centerline side hull separation	0.41 [ $m$ ]
Individual hull beam ( $B_{hull}$ )	0.27 [ $m$ ]
Length to beam ratio	2.47
Mass	30 [ $kg$ ]
Longitudinal center of gravity ( $L C G$ )	0.45 [ $m$ ]
Moment of inertia	$4.1 [kg m^{2}]$

Table 3. Parameters of PRAR–PPO.

	Parameter	Value
Actor	Hidden layers number	2
	Hidden layer units number	512
	Learning rate	0.0001
Critic	Hidden layers number	2
	Hidden layer units number	512
	Learning rate	0.0001
PRAR	Reactive amplitude ( $A_{r}$ )	36
	Reactive sensitivity ( $α$ )	6
	Proactive amplitude ( $A_{p}$ )	40
	Proactive sensitivity ( $β$ )	4
	Temporal decay factor ( $σ$ )	0.95
Training	Sample time	0.01 [ $s$ ]
	Mini-batch Size	256
	Epochs	5
	Clip ratio	0.1
	Entropy weight	0.001
	Max episodes	1000

Table 4. Error analysis of circular trajectory tracking experiments.

	Method	MAE	RMS	Max $\| e \|$	P95 $\| e \|$
Surge error (m)	MPC–PID	0.899399	1.05529	1.86294	1.83958
	PPO–Only	0.8526259	1.551172	4.011608	3.698421
	MPC–PPO	1.059727	1.355035	2.475958	2.444204
	PPO–Pro	0.7053956	0.8555924	1.446061	1.405843
	PPO–Re	1.263034	1.636648	2.988347	2.949519
	PRAR–DLC	0.8231149	1.059953	2.45345	1.774853
Sway error (m)	MPC–PID	0.996826	1.15033	1.98833	1.95831
	PPO–Only	1.2557115	1.767043	4.136898	3.715612
	MPC–PPO	1.39755	1.681554	3.008116	2.9757
	PPO–Pro	1.234283	1.524139	3.199124	3.185034
	PPO–Re	1.625424	1.98941	3.685232	3.641728
	PRAR–DLC	1.002083	1.306609	2.570468	2.537281
Yaw error (rad)	MPC–PID	0.0820923	0.0929039	0.143103	0.13913
	PPO–Only	0.11874111	0.2040948	0.8473338	0.4651466
	MPC–PPO	0.118652	0.1478336	0.2490282	0.2404278
	PPO–Pro	0.2590863	0.3067927	0.6715411	0.6193831
	PPO–Re	0.1501432	0.193635	0.3369412	0.3262109
	PRAR–DLC	0.08196562	0.1133658	0.2320107	0.2174889

Table 5. Ranges of parameters.

Parameter	Range
v	$[9, 11] m / s$
A	$[8, 12] m$
L	$[32, 48] m$
$k = π / L$	$[\frac{π}{48}, \frac{π}{32}]$
$φ$	$[0, 2 π)$
$y_{0}$	$[- 2, 2] m$

Table 6. Error analysis of curvilinear trajectory tracking experiments.

	Method	MAE	RMS	Max $\| e \|$	P95 $\| e \|$
Surge error (m)	MPC–PID	0.269015	0.356648	0.963048	0.843426
	PPO–Only	0.1882976	0.2524888	0.6630578	0.5575619
	MPC–PPO	2.392861	2.570056	3.61758	3.585265
	PPO–Pro	0.10255073	0.1887792	0.7251616	0.5331297
	PPO–Re	0.09407365	0.1518097	0.4962796	0.4071088
	PRAR–DLC	0.1377696	0.1708874	0.3371085	0.3061041
Sway error (m)	MPC–PID	0.450823	0.888231	3.19389	2.86015
	PPO–Only	1.305216	1.434991	3.201511	2.497105
	MPC–PPO	1.077909	1.352311	3.196557	3.127173
	PPO–Pro	0.3764644	0.8215966	3.199087	2.643689
	PPO–Re	0.5269082	1.0263736	3.198586	3.109152
	PRAR–DLC	0.5765827	1.0281397	3.196482	3.091985
Yaw error (rad)	MPC–PID	0.0595616	0.130843	0.543868	0.420101
	PPO–Only	0.10403506	0.13738	0.5232575	0.2328551
	MPC–PPO	0.1934058	0.255236	0.7189844	0.6317281
	PPO–Pro	0.07345435	0.1560732	0.7342426	0.4029403
	PPO–Re	0.07969775	0.1547191	0.6783097	0.411587
	PRAR–DLC	0.07061012	0.1456756	0.6031172	0.4506205

Table 7. Runtime statistics of circular trajectory tracking experiments.

Method	Mean MPC Solve Time (ms)	Max MPC Solve Time (ms)	Mean PPO Forward Time (ms)	Max PPO Forward Time (ms)
MPC–PPO	2.108	3.010	1.030	4.439
PRAR–DLC	1.862	5.158	0.962	2.007

Table 8. Runtime statistics of curvilinear trajectory tracking experiments.

Method	Mean MPC Solve Time (ms)	Max MPC Solve Time (ms)	Mean PPO Forward Time (ms)	Max PPO Forward Time (ms)
MPC–PPO	2.124	3.351	1.136	2.537
PRAR–DLC	1.813	6.244	0.934	1.572

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, Z.; Du, D.; Liu, D.; Yang, Q.; Chai, Y.; Hu, S.; Wu, J. Hierarchical Control for USV Trajectory Tracking with Proactive–Reactive Reward Shaping. J. Mar. Sci. Eng. 2025, 13, 2392. https://doi.org/10.3390/jmse13122392

AMA Style

Luo Z, Du D, Liu D, Yang Q, Chai Y, Hu S, Wu J. Hierarchical Control for USV Trajectory Tracking with Proactive–Reactive Reward Shaping. Journal of Marine Science and Engineering. 2025; 13(12):2392. https://doi.org/10.3390/jmse13122392

Chicago/Turabian Style

Luo, Zixiao, Dongmei Du, Dandan Liu, Qiangqiang Yang, Yi Chai, Shiyu Hu, and Jiayou Wu. 2025. "Hierarchical Control for USV Trajectory Tracking with Proactive–Reactive Reward Shaping" Journal of Marine Science and Engineering 13, no. 12: 2392. https://doi.org/10.3390/jmse13122392

APA Style

Luo, Z., Du, D., Liu, D., Yang, Q., Chai, Y., Hu, S., & Wu, J. (2025). Hierarchical Control for USV Trajectory Tracking with Proactive–Reactive Reward Shaping. Journal of Marine Science and Engineering, 13(12), 2392. https://doi.org/10.3390/jmse13122392

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Control for USV Trajectory Tracking with Proactive–Reactive Reward Shaping

Abstract

1. Introduction

2. USV Dynamic Model

3. Control Scheme Design

3.1. Outer–Loop Controller Based on MPC

3.2. Inner–Loop Controller Based on PRAR–PPO

3.2.1. PPO

3.2.2. State and Action Space

3.2.3. PRAR

3.3. Implementation of PRAR–PPO

4. Simulation Results and Analysis

4.1. Neural Network Parameters of Inner–Loop

4.2. Circular Trajectory Tracking Experiments

4.3. Curvilinear Trajectory Tracking Experiments

4.4. Real–Time Execution Characteristics and Physical Deployment Considerations

4.4.1. Runtime Performance in Circular Trajectory Tracking Experiments

4.4.2. Runtime Performance in Curvilinear Trajectory Tracking Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI