Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC

Tan, Jiehui; Sun, Yushan; Zhang, Liwen; Chai, Puxin; Liu, Zhan

doi:10.3390/jmse14121100

Open AccessArticle

Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC

by

Jiehui Tan

^1,2,

Yushan Sun

^1,2,3,*,

Liwen Zhang

^1,2

,

Puxin Chai

^1,2 and

Zhan Liu

^1,2

¹

National Key Laboratory of Autonomous Marine Vehicle Technology, Harbin Engineering University, Harbin 150001, China

²

College of Shipbuilding Engineering, Harbin Engineering University, Harbin 150001, China

³

Shandong Future Robot Co., Ltd., Weihai 264209, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(12), 1100; https://doi.org/10.3390/jmse14121100 (registering DOI)

Submission received: 25 April 2026 / Revised: 4 June 2026 / Accepted: 10 June 2026 / Published: 14 June 2026

(This article belongs to the Special Issue Advanced Studies in Marine Vessel Motion Control)

Download

Browse Figures

Versions Notes

Abstract

To improve the path-following performance of an underactuated autonomous underwater vehicle (AUV) under varying path geometries and desired velocities, this study proposes a direct X-rudder control method based on Task-Informed Inductive-Bias Conservative Soft Actor–Critic (TIB-CSAC). The proposed method directly learns the X-rudder control policy from the path-following information of the current and subsequent path segments in a data-driven way, thereby avoiding the complex design and manual tuning of guidance laws and attitude controllers for rudder command generation. To support such two-segment policy learning, a task-informed inductive-bias encoder is proposed to construct structured and conditioned state representations, thereby improving sample efficiency and overall training quality. In addition, given the long-tail characteristics of task difficulty in agent training, a multi-head conservative value evaluation mechanism is incorporated to mitigate return drawdowns induced by challenging tasks in the tail stage of training and to enhance tail-stage convergence stability. The path-following performance is validated in three representative scenarios with different path pitch, path heading variations, and desired surge velocity conditions. The results show that, compared with the baseline soft actor–critic (SAC) method, TIB-CSAC improves multiple vertical and horizontal error metrics, including maximum absolute error, mean absolute error, tail error, and error threshold exceedance ratio. These results indicate that TIB-CSAC not only improves overall adherence to the reference path, but also more effectively suppresses extreme errors and tail errors, thereby demonstrating stronger path-following robustness and reliability.

Keywords:

autonomous underwater vehicles; X-rudder control; path following; deep reinforcement learning; soft actor–critic; intelligent control

1. Introduction

Owing to their simple structure and high maneuvering efficiency, X-rudder autonomous underwater vehicles (AUVs) have become important platforms for a wide range of underwater applications, including environmental monitoring, offshore infrastructure inspection, and marine resource exploration. As underwater missions become increasingly diverse, AUVs are expected to exhibit strong generalization capability and reliable performance under a broad range of operating conditions.

Path-following is a fundamental capability of AUVs. It enables an AUV to move along a reference path while minimizing its position error. In practice, the reference path is commonly composed of multiple straight-line segments, and the path-following task can be characterized by parameters such as the path heading angle, path pitch angle, and desired surge velocity. However, achieving path-following performance that is both robust and strongly generalizable remains challenging. First, different task-parameter settings lead to substantial variations in task difficulty and impose different maneuvering demands. This challenge is particularly pronounced during path-segment transitions, where the AUV must rapidly adjust its motion and reduce tracking errors within a limited spatial range; these transitions are therefore more difficult than straight-segment following. Moreover, the underactuated nature of X-rudder AUVs, nonlinear hydrodynamic effects, and coupling among multiple motion variables further increase the complexity of unified controller design. As the task-generalization range expands, the workload and complexity associated with multi-parameter tuning also increase.

Due to the absence of direct lateral and vertical actuation in underactuated AUVs, conventional path-following methods typically use a guidance law to transform position errors into desired attitudes. These commands are then tracked by a controller, such as prescribed performance control [1], sliding-mode control [2,3], or model predictive control [4]. However, such model-based methods rely on accurate hydrodynamic models, and parameter identification is difficult and costly. Model-free controllers, such as proportional–integral–derivative (PID) controllers [5,6] and S-plane controllers [7,8,9], are easier to implement but usually rely on expert tuning. Moreover, under significant variations in path geometry and desired surge velocity, these hierarchical schemes often require re-tuning of the guidance law or controller [10], which limits their adaptability and generalization capability.

Deep reinforcement learning (DRL), with its self-learning capability and data-driven decision-making mechanism, provides a promising approach to path-following [5,11]. Sun et al. [12] proposed an improved deep deterministic policy gradient method (OSAM-DDPG) that learns a control policy conditioned on attitude and velocity errors. However, when a line-of-sight (LOS) guidance law with fixed parameters is used, its path-following performance under different operating conditions may be limited. Fang et al. [13] applied DDPG to achieve three-degree-of-freedom control at constant speed with variable pitch and yaw, while realizing path-following via waypoint tracking. Wang et al. [14] developed a simplified DDPG method that does not require prediction of future rewards and validated its performance on a helical path and an inclined straight path under a single constant desired speed. Wang et al. [15] proposed a SAC-based adaptive PID controller and verified its attitude-control performance in LOS-guided path-following, although its effectiveness in direct position-error regulation remains to be further investigated. Zhang et al. [16] proposed an improved DDPG-based path-following method integrated with a LOS guidance law and trained the algorithm on two fixed paths under a single constant desired speed. Zhang et al. [17] proposed a deep interactive reinforcement learning method, and Dong et al. [18] developed an improved SAC-based two-dimensional path-following method. In both studies, the agent was trained and evaluated using the same path geometries; therefore, its path-following performance across diverse path geometries remains an open issue.

Despite these advances, current DRL-based path-following studies still have limitations in state construction when strong generalization is required. On the one hand, many DRL methods rely on manually designed guidance laws with fixed parameters, making it difficult to adapt to path-following tasks with different parameter configurations. On the other hand, multiple types of original path-following information, including position errors, path heading/pitch variations, and progress relative to adjacent segments, are usually mapped by the guidance law into a small number of desired attitude commands [19]. If the agent mainly receives these attitude commands as inputs, task-level contextual information may not be explicitly represented in the agent state. Such local and indirect state representations may create a representation bottleneck, limiting the agent’s conditional learning capability across operating conditions and weakening its generalization performance [20,21].

Existing path-following studies also have limitations in agent training. Some studies restrict agent training to basic control tasks involving attitude and velocity errors, and some methods are trained under only a single surge-speed setting or a single path-geometry condition. Such limited coverage of the training-task distribution can easily lead to policy overfitting, thereby degrading path-following performance in other tasks. In domain-randomized training, challenging operating conditions, in which task parameters such as the path heading angle, path pitch angle, and desired surge velocity simultaneously approach their boundary values, occur with relatively low probability. Meanwhile, compared with straight-segment following, path-segment transitions are more difficult and are associated with sparser experience samples. As a result, task difficulty spans a wide range under domain-randomized training and exhibits a pronounced long-tail distribution. However, existing methods such as DDPG and SAC are essentially risk-neutral expected-return optimization methods. Although they may achieve satisfactory average performance under most regular operating conditions, their path-following performance under more challenging tasks may remain limited.

To address the limited generalization capability and robustness of underactuated X-rudder AUVs under diverse path-following conditions, this study proposes a direct X-rudder control method based on Task-Informed Inductive-Bias Conservative Soft Actor–Critic (TIB-CSAC) for AUV path-following. The main contributions are summarized as follows:

(1): A direct X-rudder control framework is proposed. In this framework, the X-rudder control policy is learned directly from two-segment path-following information, avoiding the complex hierarchical design of guidance laws and attitude controllers as well as the associated coupled parameter tuning.
(2): A task-informed inductive-bias state encoder is developed to construct structured and task-conditioned representations of two-segment path-following states for direct policy learning. This design reduces the representation burden caused by direct concatenation of heterogeneous state variables and improves sample efficiency.
(3): A multi-head conservative value-evaluation mechanism is introduced to improve value assessment for high-uncertainty state–action pairs, thereby reducing return drawdowns induced by challenging path-following tasks and improving tail-stage convergence stability.
(4): The proposed method is validated in three representative scenarios with distinct path geometries and desired surge velocities. The results show improvements across multiple error metrics, including the maximum absolute error, mean absolute error, tail error, and threshold-exceedance proportion, demonstrating enhanced generalization capability and path-following performance under diverse operating conditions.

The remainder of this paper is organized as follows. Section 2 presents the problem statement. Section 3 describes the proposed method. Section 4 analyzes and discusses the experimental results, and Section 5 concludes the paper and outlines future work.

2. Problem Statement

2.1. Underactuated AUV Model

In this study, the AUV is equipped with a main thruster and an X-rudder configuration. The numbering and rotation directions of the X-rudders, as well as the Earth-fixed (O-XYZ) and body-fixed (O_B-X_BY_BZ_B) coordinate systems, are shown in Figure 1. The kinematic and dynamic model of the AUV is given by [22]

\dot{η} = J (η) ν

(1)

M \dot{ν} + C (ν) ν + D (ν) ν + g (η) = τ^{a} + τ^{d}

(2)

where J, M, C, D, and g represent the kinematic transformation matrix, inertia matrix, Coriolis–centripetal matrix, hydrodynamic damping matrix, and restoring-force vector, respectively. η = [x, y, z, θ, ψ]^T and ν = [u, v, w, q, r]^T denote the position-attitude vector in the Earth-fixed frame and the velocity vector in the body-fixed frame, respectively.

In practical AUV design, roll motion is usually expected to remain close to zero to ensure sensor and payload stability. This condition can be maintained passively by the hydrostatic restoring moment and hydrodynamic roll damping. Therefore, under the assumptions that gravity and buoyancy are in equilibrium, the center of buoyancy lies in the vertical plane, and sufficient hydrostatic restoring moment is available for near-zero roll motion, the kinematic and dynamic model can be written as [23,24]

\{\begin{cases} \dot{x} = u \cos (ψ) \cos (θ) - v \sin (ψ) + w \cos (ψ) \sin (θ) \\ \dot{y} = u \sin (ψ) \cos (θ) + v \cos (ψ) + w \sin (ψ) \sin (θ) \\ \dot{z} = - u \sin (θ) + w \cos (θ) \\ \dot{θ} = q \\ \dot{ψ} = r / \cos (θ) \end{cases}

(3)

\{\begin{cases} \dot{u} = (m_{22} v r - m_{33} w q - X_{u} u - X_{u |u|} u | u | + τ_{u}^{a} + τ_{u}^{d}) / m_{11} \\ \dot{v} = (- m_{11} u r - Y_{v} v - Y_{v |v|} v | v | + τ_{v}^{d}) / m_{22} \\ \dot{w} = (m_{11} u q - Z_{w} w - Z_{w |w|} w | w | + τ_{w}^{d}) / m_{33} \\ \dot{q} = ((m_{33} - m_{11}) u w - M_{q} q - M_{q |q|} q | q | - {\bar{B G}}_{z} W \sin θ + τ_{q}^{a} + τ_{q}^{d}) / m_{55} \\ \dot{r} = ((m_{11} - m_{22}) u v - N_{r} r - N_{r |r|} r | r | + τ_{r}^{a} + τ_{r}^{d}) / m_{66} \end{cases}

(4)

where W is the AUV weight, and

{\bar{B G}}_{z}

is the vertical offset between the centers of mass and buoyancy.

m_{(\cdot)}

denotes the diagonal inertia terms of the mass matrix M, including added mass. (

X_{u}

,

Y_{v}

,

Z_{w}

,

M_{q}

,

N_{r}

) and (

X_{u |u|}

,

Y_{v |v|}

,

Z_{w |w|}

,

M_{q |q|}

,

N_{r |r|}

) represents linear and nonlinear damping coefficient, respectively.

τ^{d} = {[τ_{u}^{d}, τ_{v}^{d}, τ_{w}^{d}, τ_{q}^{d}, τ_{r}^{d}]}^{T}

is the environmental disturbance.

τ^{a} = {[τ_{u}^{a}, 0, 0, τ_{q}^{a}, τ_{r}^{a}]}^{T}

is the actuator input vector, where

τ_{u}^{a}

is the thruster input, and

τ_{q}^{a}

and

τ_{r}^{a}

are the X-rudder inputs for pitch and yaw motion, respectively.

Unlike a conventional cross-rudder configuration, in which the horizontal and vertical rudders mainly generate pitch and yaw moments separately, the X-rudder contains four inclined control surfaces. As shown in Equation (5), each rudder contributes to both

τ_{q}^{a}

and

τ_{r}^{a}

; therefore, the pitch and yaw moments are generated by the combined deflections of all four rudders. This configuration improves rudder-utilization efficiency but also introduces a more complex rudder-moment coupling relationship.

\begin{array}{l} τ_{q}^{a} = u^{2} (M_{δ_{1}} δ_{1} + M_{δ_{2}} δ_{2} + M_{δ_{3}} δ_{3} + M_{δ_{4}} δ_{4}) \\ τ_{r}^{a} = u^{2} (N_{δ_{1}} δ_{1} + N_{δ_{2}} δ_{2} + N_{δ_{3}} δ_{3} + N_{δ_{4}} δ_{4}) \end{array}

(5)

where

M_{δ}

and

N_{δ}

are the hydrodynamic parameters of the rudder.

2.2. Path Following Objective

The desired path is represented as a polyline of n^p straight-line segments, where the i-th segment with endpoints

P_{i - 1}^{p}

and

P_{i}^{p}

(i = 1, 2,…, n^p) is defined by Equation (6).

\begin{matrix} P_{i}^{p} = [\begin{matrix} x_{i}^{p} \\ y_{i}^{p} \\ z_{i}^{p} \end{matrix}] = P_{i - 1}^{p} + L_{i}^{p} [\begin{matrix} \cos ψ_{i}^{p} \cos θ_{i}^{p} \\ \sin ψ_{i}^{p} \cos θ_{i}^{p} \\ - \sin θ_{i}^{p} \end{matrix}] & , \begin{matrix} |ψ_{i}^{p} - ψ_{i - 1}^{p}| \leq Δ ψ_{\max}^{p} \\ |θ_{i}^{p}| \leq θ_{\max}^{p} \\ L_{i}^{p} \geq L_{\min} \end{matrix} \end{matrix}

(6)

where

Δ ψ_{\max}^{p}

denotes the maximum variation between adjacent path headings

ψ_{i}^{p}

, and

ψ_{0}^{p}

is set to the initial yaw angle of the AUV.

θ_{\max}^{p}

represents the maximum absolute value of the path pitch angle

θ_{i}^{p}

, and L_min is the minimum path-segment length

L_{i}^{p}

.

The vertical error is defined as the perpendicular distance from the AUV to the projection point on the path segment:

e_{i}^{v} = (z_{i}^{p} - z_{i - 1}^{p}) ε_{i} + z_{i - 1}^{p} - z

(7)

ε_{i} = (\vec{P_{i - 1}^{p} P_{AUV}} \cdot \vec{P_{i - 1}^{p} P_{i}^{p}}) / {|P_{i - 1}^{p} P_{i}^{p}|}^{2}

(8)

where

ε_{i}

is an along-track parameter based on the AUV’s position P_AUV = [x, y, z]^T relative to the path segment

P_{i - 1}^{p} P_{i}^{p}

.

The horizontal error is defined as the horizontal offset with the minimum absolute value among the two adjacent path segments.

e_{i}^{h} = \{\begin{cases} d_{k}^{h}, k = \underset{j \in {i, i + 1}}{\arg \min} |d_{j}^{h}| if i < n^{p} \\ d_{n^{p}}^{h} if i = n^{p} \end{cases}

(9)

d_{i}^{h} = \vec{P_{i - 1}^{h} P_{AUV}^{h}} \times \vec{P_{i - 1}^{h} P_{i}^{h}} / |P_{i - 1}^{h} P_{i}^{h}|

(10)

where

d_{i}^{h}

denotes the horizontal distance from the AUV to the i-th path segment; and

P_{i}^{h} = {[x_{i}^{p}, y_{i}^{p}]}^{T}

and

P_{AUV}^{h}

denote the horizontal position.

In summary, this study aims to minimize both the horizontal and vertical position errors, as shown in Equation (11).

\min J = \sum_{t} (|e_{i^{e}}^{h} (t)| + |e_{i^{e}}^{v} (t)|)

(11)

i^{e} \leftarrow i^{e} + 1, if ε_{i^{e}} > 1

(12)

where i^e denotes the path-segment index used to compute the position error, and

ε_{i^{e}}

is the along-track parameter in Equation (8). The index i^e is initialized to one and is incremented when

ε_{i^{e}} > 1

, indicating that the AUV has passed the endpoint of the current segment.

3. X-Rudder Path Following Control Based on TIB-CSAC

3.1. Markov Decision Process Design

3.1.1. States

Let i^o denote the index used for state observation. As shown in Equations (13) and (14), i^o is initialized to one and is incremented when the horizontal error relative to the next target segment remains within a prescribed threshold for a duration

t_{\min}^{o}

, thereby preserving state continuity during path transitions. To avoid excessive lag behind i^e, i^o is also forcibly incremented when

i^{e} - i^{o} > 2

. After each update of i^o, t^o is reset to zero.

i^{o} \leftarrow i^{o} + 1, if t^{o} \geq t_{\min}^{o} or i^{e} - i^{o} > 2

(13)

t^{o} = \{\begin{cases} t^{o} + Δ t if |d_{i^{o} + 1}^{h}| \leq d_{\min}^{h} \\ 0 otherwise \end{cases}

(14)

The overall observation state vector consists of four components:

s = {[s_{dyn}^{T}, s_{task}^{T}, s_{i^{o}}^{T}, s_{i^{o} + 1}^{T}]}^{T}

(15)

where

s_{dyn}

denotes the general dynamic state describing the current motion and actuator conditions of the AUV.

s_{dyn} = {[ν^{T}, \dot{ψ}, δ^{T}]}^{T}

(16)

where ν = [u, v, w, q, r]^T denotes the velocity vector, and δ = [δ₁, δ₂, δ₃, δ₄]^T denotes the X-rudder-angle vector.

s_{task}

is the task-level state that provides information on the desired surge velocity, path-geometry variations, and path-following progress:

s_{task} = {[u, Δ ψ_{i^{o}}^{p}, Δ θ_{i^{o}}^{p}, d_{i^{o}}^{p}, {d_{i^{o}}^{p}}^{'}]}^{T}

(17)

where

Δ θ_{i^{o}}^{p} = θ_{i^{o} + 1}^{p} - θ_{i^{o}}^{p}

and

Δ ψ_{i^{o}}^{p} = ψ_{i^{o} + 1}^{p} - ψ_{i^{o}}^{p}

denote the changes in path pitch and path heading, respectively.

d_{i^{o}}^{p}

and

{d_{i^{o}}^{p}}^{'}

are the distances from the AUV projection points on segments

P_{i^{o} - 1}^{p} P_{i^{o}}^{p}

and

P_{i^{o}}^{p} P_{i^{o} + 1}^{p}

to waypoint

P_{i^{o}}^{p}

, encoding the progress of the AUV along the two adjacent segments.

d_{i^{o}}^{p} = (\vec{P_{AUV} P_{i^{o}}^{p}} \cdot \vec{P_{i^{o} - 1}^{p} P_{i^{o}}^{p}}) / |P_{i^{o} - 1}^{p} P_{i^{o}}^{p}|

(18)

{d_{i^{o}}^{p}}^{'} = (\vec{P_{i^{o}}^{p} P_{AUV}} \cdot \vec{P_{i^{o}}^{p} P_{i^{o} + 1}^{p}}) / |P_{i^{o}}^{p} P_{i^{o} + 1}^{p}|

(19)

s_{i}

(i = i^o, i^o + 1) denotes the segment-specific state that describes the relative path-following error for the current or subsequent path segment. It provides preview information for the upcoming transition and reduces abrupt changes in the policy input.

s_{i} = [Δ θ_{i}^{R}, e_{i}^{v}, {\dot{e}}_{i}^{v}, Δ ψ_{i}^{R}, d_{i}^{h}, {\dot{d}}_{i}^{h}]

(20)

where

Δ θ_{i}^{R} = θ_{i}^{p} - θ

and

Δ ψ_{i}^{R} = ψ_{i}^{R} - ψ

represent the relative pitch and yaw angles with respect to segment

P_{i - 1}^{p} P_{i}^{p}

, respectively.

e_{i}^{v}

and

{\dot{e}}_{i}^{v}

are the vertical error defined in Equation (7) and its rate, respectively; and

d_{i}^{h}

and

{\dot{d}}_{i}^{h}

are the horizontal distance defined in Equation (10) and its rate, respectively.

To avoid index overflow when the AUV reaches the final segment, a virtual path point

P_{n^{p} + 1}^{p}

is appended so that the extended segment

P_{n^{p}}^{p} P_{n^{p} + 1}^{p}

maintains the same geometry as

P_{n^{p} - 1}^{p} P_{n^{p}}^{p}

, with

L_{n^{p} + 1}^{p} = L_{n^{p}}^{p}

,

ψ_{n^{p} + 1}^{p} = ψ_{n^{p}}^{p}

, and

θ_{n^{p} + 1}^{p} = θ_{n^{p}}^{p}

.

3.1.2. Actions

The action vector

a_{t} = [Δ δ_{t}^{θ}, Δ δ_{t}^{ψ}] \in {[- 1, 1]}^{2}

is defined as the normalized pitch and yaw control increments at each time step. These increments are mapped to X-rudder angles according to Equation (21). The X-rudder design enables all four rudders to jointly generate pitch and yaw moments, reducing the risk of single-rudder saturation and increasing the available equivalent moment under the rudder-angle limit.

δ_{t} = [\begin{matrix} δ_{1, t} \\ δ_{2, t} \\ δ_{3, t} \\ δ_{4, t} \end{matrix}] = Δ δ_{\max} [\begin{matrix} + 1 & - 1 \\ - 1 & - 1 \\ - 1 & + 1 \\ + 1 & + 1 \end{matrix}] [\begin{matrix} Δ δ_{t}^{θ} \\ Δ δ_{t}^{ψ} \end{matrix}] + δ_{t - 1}, |δ| \leq δ_{\max}

(21)

where Δδ_max and δ_max denote the maximum rudder-angle change per time step and the maximum rudder angle, respectively.

3.1.3. Reward

The reward

r_{RL}

consists of two position-error penalties and a velocity penalty that prevents the AUV from moving opposite to the tangent vector

t^{p}

of the desired path. In addition, the reward terms are normalized to reduce the influence of scale differences among components, preventing the learning process from being dominated by a single error term.

r_{R L} = - k_{1}^{r} sat (|e_{i^{e}}^{h} / e_{\max}^{h}|) - k_{2}^{r} sat (|e_{i^{e}}^{v} / e_{\max}^{v}|) + k_{3}^{r} sat (r_{i^{e}}^{v} / r_{\max}^{v})

(22)

r_{i^{e}}^{v} = \min (0, [\dot{x}, \dot{y}, \dot{z}] \cdot t^{p})

(23)

t^{p} = [\cos (ψ_{i^{e}}^{p}) \cos (θ_{i^{e}}^{p}), \sin (ψ_{i^{e}}^{p}) \cos (θ_{i^{e}}^{p}), - \sin (θ_{i^{e}}^{p})]

(24)

where

k_{i}^{r}

(i = 1, 2, 3) denotes the weighting coefficients of the reward terms, and

e_{\max}^{h}

,

e_{\max}^{v}

, and

r_{\max}^{v}

denote the corresponding normalization factors.

sat (x) = \min (1, \max (- 1, x))

.

3.1.4. Termination

The episode terminates when

i^{e} = n^{p} + 1

, indicating that the AUV has passed the endpoint of the final path segment, or when the time exceeds the maximum limit

t_{\max}^{t}

.

d_{t} = \{\begin{cases} True, i^{e} = n^{p} + 1 \\ True, t > t_{\max}^{t} \end{cases}

(25)

3.2. Task-Informed Inductive-Bias Conservative SAC Algorithm

3.2.1. Task-Informed Inductive-Bias Encoder

In this study, the input state vector consists of dynamic states, segment-specific path-following states, and task-level contextual states, which play distinct functional roles in decision making. Most DRL methods typically flatten and concatenate the state variables into a single unstructured vector and process them using a shared multilayer perceptron (MLP). This approach relies on implicit representation learning to recover the latent structure during training. When the state variables are high-dimensional and heterogeneous, interference among irrelevant or weakly relevant features may increase the burden of representation learning. As a result, sample efficiency and overall training quality may be limited. To mitigate this limitation, a Task-Informed Inductive-Bias (TIB) encoder is proposed to provide structured and task-conditioned state representations.

The overall framework is illustrated in Figure 2. Specifically, the dynamic state vector is first embedded into a compact latent feature

h_{dyn}

. Two segment-specific branches with identical architectures but independent parameters then encode the path-following states associated with the current and next path segments, respectively. Conditioned on

h_{dyn}

, each branch extracts segment-specific features, which are fused into a single representation

h_{fuse}

.

h_{dyn} = f_{dyn} (s_{dyn}; θ_{dyn})

(26)

h_{cur} = f_{cur} ([h_{dyn}, s_{i^{o}}]; θ_{cur})

(27)

h_{next} = f_{next} ([h_{dyn}, s_{i^{o} + 1}]; θ_{next})

(28)

h_{fuse} = f_{fuse} ([h_{cur}, h_{next}]; θ_{fuse})

(29)

where

f_{dyn} (\cdot)

,

f_{cur} (\cdot)

,

f_{next} (\cdot)

, and

f_{fuse} (\cdot)

are implemented as MLPs, and

θ_{dyn}

,

θ_{cur}

,

θ_{next}

, and

θ_{fuse}

denote their parameters.

To adapt encoder features to segment-specific path-following states, a feature-wise linear modulation (FiLM) network

g (\cdot)

generates scaling and shifting vectors (ω, β) from the task-level state

s_{task}

. The fused feature

h_{fuse}

is then modulated element-wise.

[ω, β] = g (s_{task}; θ_{m})

(30)

\tilde{h} = ω ⊙ h_{fuse} + β

(31)

where

ω, β \in ℝ^{|h_{fuse}|}

.

g (\cdot)

is implemented as an MLP and

θ_{m}

denotes its parameters.

⊙

denotes element-wise multiplication.

In general, the overall TIB mapping can be written compactly as

\tilde{h} = F_{enc} (s; Θ_{enc})

(32)

where

Θ_{enc} = {θ_{dyn}, θ_{cur}, θ_{next}, θ_{fuse}, θ_{m}}

denotes all encoder parameters.

Accordingly, the actor encoder produces a latent feature, which is then fed into the policy network to parameterize a squashed Gaussian policy. The action is subsequently sampled using the reparameterization trick.

{\tilde{h}}^{π} = F_{enc} (s; Θ_{enc}^{π})

(33)

[μ (s), \log σ (s)] = f_{π} ({\tilde{h}}^{π}; θ_{π})

(34)

a = \tanh (μ (s) + σ (s) ⊙ ε)

(35)

where

Θ_{enc}^{π}

and

θ_{π}

denote the parameters of the actor encoder and policy network, respectively;

ε \sim N (0, I)

, and μ(s) and σ(s) represent the mean vector and standard deviation vector of the diagonal Gaussian distribution, respectively.

Similarly, the critics adopt the same TIB encoder architecture as the actor but use different parameters to avoid gradient interference between policy and critic updates.

{\tilde{h}}^{Q} = F_{enc} (s; Θ_{enc}^{Q})

(36)

Q_{i} (s, a) = f_{Q_{i}} ([{\tilde{h}}^{Q}, a]; θ_{Q_{i}})

(37)

where i = 1, 2,…, n^Q, and n^Q denotes the number of critic heads.

Θ_{enc}^{Q}

and

θ_{Q_{i}}

denote the parameters of the critic encoder and the critic heads, respectively.

3.2.2. Conservative Soft Actor–Critic Algorithm

During path-following training, task difficulty spans a broad range and exhibits a long-tail distribution. Conventional SAC is a risk-neutral optimization method that primarily maximizes the overall expected return and lacks an explicit mechanism for handling rare but challenging operating conditions. When sample coverage is insufficient in these difficult regimes, critic value learning for the corresponding state–action pairs relies more heavily on function approximation with sparse data support, resulting in higher estimation uncertainty and lower reliability. Combined with the actor’s tendency to favor actions with high Q-values, this can produce locally optimistic value estimates and thus limit path-following performance in challenging regions.

To address this issue, this paper introduces a multi-head conservative value-evaluation mechanism. Specifically, n^Q (n^Q ≥ 2) critic heads with independent parameters are built on the shared critic TIB encoder. For any state–action pair (s,a), the mean Q-value μ_Q and the critic disagreement σ_Q are given by

μ_{Q} (s, a) = \frac{1}{n^{Q}} \sum_{i = 1}^{n^{Q}} Q_{i} (s, a)

(38)

σ_{Q} (s, a) = \sqrt{\frac{1}{n^{Q}} \sum_{i = 1}^{n^{Q}} {(Q_{i} (s, a) - μ_{Q} (s, a))}^{2}}

(39)

Policy improvement is guided by μ_Q, which represents the overall Q-value level, and σ_Q, which characterizes estimation uncertainty. This design discourages the policy from favoring actions with high predicted Q-values but large critic disagreement.

Q_{C} (s, a) = μ_{Q} (s, a) - \sqrt{\ln n^{Q}} σ_{Q} (s, a)

(40)

L_{π} = E_{s \sim D, a \sim π (\cdot | s)} [\begin{matrix} α \log π (a | s) - Q_{C} (s, a) \end{matrix}]

(41)

where

\sqrt{\ln n^{Q}}

denotes the uncertainty-penalty coefficient. As n^Q increases, σ_Q can characterize value-estimation uncertainty more stably, and the penalty strength is therefore moderately increased. This growth is kept sublinear to avoid introducing an overly conservative bias that would impair policy-optimization efficiency.

The loss function for the entropy temperature coefficient α is defined as

L_{α} = E_{s \sim D, a \sim π_{ϕ} (\cdot | s)} [- α (\log π (a | s) + H_{tar})]

(42)

where Ɗ denotes the replay buffer, and

H_{tar}

is the target entropy.

To mitigate value overestimation and error propagation in critic bootstrapping, the temporal-difference (TD) target for the i-th critic head is calculated using the minimum of two randomly selected and distinct target critic heads. Let

a_{t + 1} \sim π (\cdot | s_{t + 1})

; the TD target is

y_{t, i} = r_{R L, t} + γ (1 - d_{t}) (\min_{j = n_{1}, n_{2}} {\bar{Q}}_{j} (s_{t + 1}, a_{t + 1}) - α \log π (a_{t + 1} ∣ s_{t + 1}))

(43)

where

{\bar{Q}}_{j}

denotes the Q-value estimated by the j-th target critic; n₁, n₂∈{1, 2,…, n^Q} and n₁ ≠ n₂; γ is the discount factor; and d_t is the episode termination indicator.

To preserve critic diversity for uncertainty estimation, each critic head is updated using its own TD regression loss on an independently sampled mini-batch:

L_{i}^{Q} = E_{(s_{t}, a_{t}, r_{R L, t}, s_{t + 1}, d_{t}) \sim B_{i}, B_{i} ~ D} [{(Q_{i} (s_{t}, a_{t}) - y_{t, i})}^{2}]

(44)

where

B_{i}

denotes the mini-batch independently sampled for the i-th value head from the replay buffer Ɗ.

To ensure consistent state representations, the parameters of the TIB critic encoder are updated using the gradient of the mean TD loss:

L_{enc}^{Q} = \frac{1}{n^{Q}} \sum_{i = 1}^{n^{Q}} L_{i}^{Q}

(45)

The target critic encoder and its heads are updated through soft updates:

{\bar{Θ}}_{enc}^{Q} \leftarrow τ Θ_{enc}^{Q} + (1 - τ) {\bar{Θ}}_{enc}^{Q}

(46)

{\bar{θ}}_{Q_{i}} \leftarrow τ θ_{Q_{i}} + (1 - τ) {\bar{θ}}_{Q_{i}}

(47)

The overall algorithm is summarized in Algorithm 1.

Algorithm 1 Training procedure

Input: Critic size n^Q; time interval Δt; soft-update rate τ;
discount factor γ; batch size n^B; termination flag d_t;
Initialize: actor

π (a | s)

; critics

{Q_{i}}_{i = 1}^{n^{Q}}

; temperature α;
target critics

{{\bar{Q}}_{i}}_{i = 1}^{n^{Q}}

; Replay buffer Ɗ;
1: // Interaction
2: for each environment step do
3: Get current observation s_t;
4: Sample action

a_{t} ~ π_{ϕ} (\cdot | s_{t})

;
5: Execute a_t, obtain

r_{RL, t}

, s_t₊₁, d_t;
6: If d_t = true, reset the environment;
7: Store

{(s_{t}, a_{t}, r_{RL, t}, s_{t + 1}, d_{t})}

into Ɗ;
8: // Parameter updates
9: if |Ɗ| > n^bat then
10: Sample independent mini-batch

{B_{i}}_{i = 1}^{n^{Q}}

;
11: Sample next action

a_{t + 1} ~ π_{ϕ} (\cdot | s_{t + 1})

for (

s_{t + 1}

in

{B_{i}}_{i}^{n^{Q}}

);
12: Compute TD target using Equation (43);
13: Update critic using Equations (44) and (45);
14: Update actor using Equation (41);
15: Update α using Equation (42);
16: Soft update target critics using Equations (46) and (47);
17: end if
18: end for

4. Experiments

4.1. Basic Settings

To facilitate path-following learning, each training episode consists of a two-segment reference path randomly initialized under the constraints in Equation (6). At the beginning of each episode, the initial AUV velocity is set to zero, and the desired surge speed u_d is randomly sampled from [0.5, 2.0] m/s. This setting covers the acceleration process from zero speed to the desired surge speed, thereby including the initial low-authority phase of the X-rudder in the training distribution. To preserve state continuity and ensure stable policy updates, the observation index i^o is kept fixed at its initial value for most of the episode. Therefore, an episode terminates when either the maximum time limit

t_{\max}^{t}

is exceeded or the stabilization-time criterion in Equation (13) is satisfied. The observation index i^o is updated only in the latter case, as shown in Figure 3.

The hydrodynamic parameters are adopted from Ref. [25] and are provided in Table 1; the training task settings are summarized in Table 2. The bounded time-varying external disturbances acting on the AUV hydrodynamics are given in Equation (48). The sinusoidal and cosine terms approximate low-frequency force/moment disturbances with wave-like periodic characteristics.

\{\begin{cases} τ_{u}^{d} (t) = u_{d} / u_{d, \max} (5 \sin (k_{1}^{d} t + φ_{1}^{d}) + 5 \cos (k_{1}^{d} t + φ_{2}^{d}) + 5 \cos (k_{1}^{d} t + φ_{3}^{d})) \\ τ_{v}^{d} (t) = u_{d} / u_{d, \max} (0.5 \sin (k_{1}^{d} t + φ_{1}^{d}) + \cos (k_{1}^{d} t + φ_{2}^{d}) + 0.1 \cos (k_{1}^{d} t + φ_{3}^{d})) \\ τ_{w}^{d} (t) = u_{d} / u_{d, \max} (0.5 \sin (k_{1}^{d} t + φ_{1}^{d}) + \cos (k_{1}^{d} t + φ_{2}^{d}) + 0.1 \cos (k_{1}^{d} t + φ_{3}^{d})) \\ τ_{q}^{d} (t) = u_{d} / u_{d, \max} (\sin (k_{2}^{d} t + φ_{1}^{d}) + 2 \cos (k_{2}^{d} t + φ_{2}^{d}) + 2 \cos (k_{1}^{d} t + φ_{3}^{d})) \\ τ_{r}^{d} (t) = u_{d} / u_{d, \max} (\sin (k_{2}^{d} t + φ_{1}^{d}) + \cos (k_{2}^{d} t + φ_{2}^{d}) + 2 \cos (k_{1}^{d} t + φ_{3}^{d})) \end{cases}

(48)

where

k_{1}^{d}

and

k_{2}^{d}

denote the disturbance angular frequencies, while

φ_{1}^{d}

,

φ_{2}^{d}

, and

φ_{3}^{d}

denote the disturbance phase shifts, which are randomly sampled at the beginning of each episode within the predefined ranges in Table 2. In the subsequent comparative tests, these parameters are fixed to

k_{1}^{d}

= 0.1,

k_{2}^{d} = 0.2

,

φ_{1}^{d}

= 0,

φ_{2}^{d}

= π/4, and

φ_{3}^{d}

= π/6.

In this study, the AUV speed is regulated by an incremental S-plane controller, with the thruster parameters and control law provided in Appendix A. X-rudder path-following control is realized using three off-policy DRL methods: SAC [26], DDPG [27], and the proposed method. SAC is selected as the main baseline to directly evaluate the contributions of the proposed modules within the SAC framework. DDPG is included as another representative off-policy continuous-control algorithm, providing a deterministic-policy comparison. A conventional hierarchical model-free benchmark is also included, combining LOS guidance [18], an S-plane controller, and pseudoinverse-based control allocation. The algorithm settings are summarized in Table 3. Shared or comparable hyperparameters and network scales are kept consistent wherever applicable so that the comparison is conducted under unified algorithmic configurations and comparable model capacities.

4.2. Training Results

To mitigate stochastic effects and ensure reliable comparisons, each method was trained independently for 1200 episodes using five different random seeds. The training results are shown in Figure 4, where the solid lines represent the episode-wise mean return

{\bar{R}}_{j}

across five runs, and the shaded regions indicate the corresponding episode-wise standard deviation

σ_{j}

. As illustrated in Figure 4, TIB-CSAC converges faster during training; the magnified view also shows superior late-stage stability, with no sharp declines in episode rewards as observed for DDPG and SAC.

{\bar{R}}_{j} = \frac{1}{K} \sum_{k = 1}^{K} R_{j, k}

(49)

σ_{j} = \sqrt{\frac{1}{K - 1} \sum_{k = 1}^{K} {(R_{j, k} - {\bar{R}}_{j})}^{2}}

(50)

where

R_{j, k}

denotes the episode return at episode j in the k-th run, k = 1, 2,…, K, K = 5.

To quantify tail-stage performance, as defined in Equations (51)–(53), let

{\bar{R}}_{tail}

denote the mean episode return over the tail stage of training averaged across runs, and let

σ_{tail}

denote the average standard deviation of episode returns over the tail stage within each run. As detailed in Table 4, compared with baseline SAC and DDPG, the proposed method increases

{\bar{R}}_{tail}

from −2.7927 × 10² and −5.5372 × 10² to −1.7244 × 10² and reduces

σ_{tail}

from 5.1210 × 10² and 4.6465 × 10² to 1.2368 × 10², indicating improved tail-stage training performance and enhanced tail-stage stability.

{\bar{R}}_{tail, k} = \frac{1}{N_{tail}} \sum_{j = n^{eps} - N_{tail} + 1}^{n^{eps}} R_{j, k}

(51)

{\bar{R}}_{tail} = \frac{1}{K} \sum_{k = 1}^{K} {\bar{R}}_{tail, k}

(52)

σ_{tail} = \frac{1}{K} \sum_{k = 1}^{K} \sqrt{\frac{1}{N_{tail} - 1} \sum_{j = n^{eps} - N_{tail} + 1}^{n^{eps}} {(R_{j, k} - {\bar{R}}_{tail, k})}^{2}}

(53)

where N_tail denotes the number of episodes in the tail-stage training and is set to 200.

To quantify training efficiency, episode returns are first smoothed using a W-episode moving average. For the k-th run, TRE_k is defined as the earliest episode t such that the smoothed return

{\tilde{R}}_{t, k} \geq R_{th}

satisfies Equation (55), and the overall TRE is averaged over K runs according to Equation (56). Here, W = 50 and

R_{th} = - 1 \times 10^{3}

. Moreover, the average area under the training curve is used to quantify overall training quality, as defined in Equation (57).

{\tilde{R}}_{t, k} = \frac{1}{W} \sum_{j = 1}^{W} R_{t - j, k}, t \geq W

(54)

{TRE}_{k} = \min \{t \in {W, \dots, n^{eps}} ∣ {\tilde{R}}_{t, k} \geq R_{th}\}

(55)

\bar{TRE} = \frac{1}{K} \sum_{k = 1}^{K} {TRE}_{k}

(56)

\bar{AUC} = \frac{1}{K} \sum_{k = 1}^{K} \int_{1}^{n^{eps}} R_{j, k} d j

(57)

As detailed in Table 4, TIB-CSAC reaches the near-final performance threshold with fewer training episodes than SAC and DDPG:

\bar{TRE}

equal to 2.9560 × 10² compared with 7.3320 × 10² and 9.3920 × 10². It also attains a less negative cumulative return in

\bar{AUC}

, namely −2.6398 × 10⁶ versus −6.5231 × 10⁶ and −9.2165 × 10⁶, indicating improved sample efficiency and overall training quality.

4.3. Path Following Results

4.3.1. Case 1 Boustrophedon Path

A boustrophedon path is used to assess horizontal path-following performance during aggressive orthogonal turns under varying desired speeds u_d. In the horizontal plane, the path consists of 400 m survey lines connected by 100 m short transects. Moreover, u_d is updated after the AUV completes two consecutive turns and then travels 100 m along the path, with u_d taking values of 2.0, 1.7, 1.4, 1.1, 0.8, and 0.5 m/s. The detailed results are shown in Figure 5, Figure 6, Figure 7 and Figure 8.

As shown by the overall trajectories in Figure 5, all methods accomplish the prescribed boustrophedon path-following task. The zoomed-in views in Figure 5b further indicate that the main differences among these methods are concentrated in a few challenging regions, particularly around orthogonal turns and the subsequent transition phases. As shown in Figure 6 and Figure 7, under different velocity conditions, TIB-CSAC achieves tighter adherence to the reference path, with overall position errors remaining within a relatively small range. During orthogonal-turn transitions, despite variations in rudder hydrodynamic effectiveness across velocities, TIB-CSAC consistently maintains lower horizontal overshoot and faster error decay. By contrast, although the baseline methods keep the horizontal error close to zero under most conditions, they tend to exhibit larger overshoot peaks or oscillations and slower error decay around orthogonal turns. Overall, the results of Case 1 show that TIB-CSAC provides more robust horizontal path-following performance across a range of surge velocities, especially in the small subset of challenging orthogonal-turn conditions.

4.3.2. Case 2 Trapezoidal Path

A 3D trapezoidal path is used to validate the transient path-following response under segment-wise simultaneous variations in the desired path heading, path pitch, and surge velocity. The desired surge velocity u_d is set sequentially to [2, 1.875, 1.75, 1.625, 1.5, 1.375, 1.25, 1.125, 1, 0.875, 0.75, 0.625, 0.5] m/s; the path pitch is set to [−0.1222, 0, 0.1047, 0, −0.0873, 0, 0.0698, 0, −0.0524, 0, 0.0349, 0, −0.0175] rad; and the path heading is set to [0, 1.5708, 0, 1.309, 0, 1.0472, 0, 0.7854, 0, 0.5236, 0, 0.2618, 0] rad. The length of each segment is 100 m. The detailed results are shown in Figure 9, Figure 10, Figure 11 and Figure 12.

As shown in Figure 9, all methods complete the 3D trapezoidal path-following task, and the main discrepancies arise around path-segment transitions. As indicated by Figure 10 and Figure 11, TIB-CSAC yields smaller overall position errors and more stable transitions between path segments than the baseline methods under simultaneous variations in surge velocity and path geometry. In particular, under higher-velocity conditions, TIB-CSAC more effectively suppresses horizontal error peaks caused by large path-heading changes and mitigates persistent vertical-error oscillations associated with large path-pitch variations. As shown in Figure 12, the X-rudder angles of TIB-CSAC remain smooth and continuous except for brief pulse-like adjustments during path-segment transitions. Overall, Case 2 demonstrates that TIB-CSAC provides a more stable transient response under simultaneous variations in surge velocity and path geometry.

4.3.3. Case 3 Random Path

This section evaluates path-following performance on a 3D random path under time-varying surge-velocity conditions. Table 5 summarizes the waypoint configuration of the reference path, which consists of 20 waypoints in total, with the coordinate origin defined as the path-starting point. The velocity is set to

u_{d} (t) = - 0.75 \cos (0.01 t) + 1.25

. The results are shown in Figure 13, Figure 14, Figure 15 and Figure 16.

As shown in Figure 13, Figure 14 and Figure 15, the main differences in the 3D random path-following results are concentrated in challenging regions associated with higher surge velocities and large path-geometry variations. The baseline methods exhibit different error patterns: DDPG and LOS+SP show larger horizontal overshoot and more pronounced vertical-error oscillations in several regions, whereas SAC exhibits horizontal and vertical error fluctuations in certain intervals, accompanied by sustained oscillations in surge velocity. By contrast, under the combined effects of time-varying speed and random path-geometry variations, TIB-CSAC more effectively mitigates horizontal overshoot and vertical oscillations, demonstrating more robust path-following performance under diverse operating conditions.

4.3.4. Metric Analyses

Four metrics are used to quantitatively evaluate path-following performance: the maximum absolute error (MXAE), mean absolute error (MAE), conditional value at risk at 95% (CVaR₉₅), and threshold-exceedance time ratio V_ε. CVaR₉₅ and V_ε are defined in Equations (58)–(60). Specifically, CVaR₉₅ denotes the conditional mean of the position-error magnitude beyond the 95th percentile of |e(t)|, while V_ε denotes the fraction of time for which |e(t)| exceeds the threshold ε = 0.1 m. Detailed quantitative results are summarized in Table 6 and Table 7.

{CVaR}_{95} = E [|e (t)| ||e (t)| \geq q_{95}]

(58)

q_{95} = \inf \{q : \Pr (|e (t)| \leq q) \geq 0.95\}

(59)

V_{ε} = \frac{1}{T} \int_{0}^{T} χ (| e (t) | > ε) d t

(60)

where e(t) denotes the position error, q₉₅ is the 95th percentile of |e(t)|, Pr(⋅) denotes probability, inf{⋅} denotes the infimum, χ(⋅) is the indicator function, and T denotes the task duration.

Averaged over the three cases, TIB-CSAC yields lower overall error metrics than baseline SAC. In terms of MXAE, the average vertical and horizontal MXAE values of TIB-CSAC are reduced by 1.4645 × 10⁻¹ m and 6.4623 m, respectively, indicating a stronger ability to suppress extreme errors. For MAE, the corresponding reductions are 1.2965 × 10⁻² m and 4.2044 × 10⁻¹ m, indicating closer overall adherence to the reference path. For CVaR₉₅, the corresponding reductions are 3.7903 × 10⁻² m and 3.8225 m, indicating more effective suppression of tail errors under difficult operating conditions. For V_ε, the corresponding reductions are 1.5675 × 10⁻² and 3.5120 × 10⁻¹, indicating a lower proportion of time with threshold-exceeding errors and better maintenance of small-error path-following.

Overall, TIB-CSAC exhibits reliable path-following performance under diverse surge-velocity and path-geometry conditions, with clear advantages in overall path adherence, error-peak suppression, tail-error reduction, and persistence within the prescribed error threshold.

Several supplementary metrics are also introduced to quantitatively evaluate X-rudder control behavior and task-level effort. These metrics include the rudder saturation ratio

R_{δ}^{s a t}

, second-order rudder-angle difference index

J_{Δ^{2} δ}

, rudder-rate smoothness index

J_{\dot{δ}}

, speed-weighted rudder-effort proxy

J_{δ, u}

, and AUV energy-consumption proxy

J_{A U V}

, as defined in Equations (61)–(65). Here, N is the number of time steps. The quantitative results are summarized in Table 8.

R_{δ}^{s a t} = \frac{1}{4 N} \sum_{i = 1}^{4} \sum_{k = 1}^{N} χ (| δ_{i} (k) | \geq 0.95 δ_{\max})

(61)

J_{Δ^{2} δ} = \frac{1}{4 (N - 2)} \sum_{i = 1}^{4} \sum_{k = 3}^{N} {(δ_{i} (k) - 2 δ_{i} (k - 1) + δ_{i} (k - 2))}^{2}

(62)

J_{\dot{δ}} = \frac{1}{4 (N - 1)} \sum_{i = 1}^{4} \sum_{k = 2}^{N} {((δ_{i} (k) - δ_{i} (k - 1)) / Δ t)}^{2}

(63)

J_{δ, u} = \frac{1}{4 N} \sum_{i = 1}^{4} \sum_{k = 1}^{N} {(u (k) δ_{i} (k))}^{2}

(64)

J_{AUV} = Δ t \sum_{k = 1}^{N} u^{3} (k)

(65)

As shown in Table 8, TIB-CSAC maintains a relatively low rudder saturation ratio in all three cases, although LOS+SP or DDPG achieves lower

R_{δ}^{s a t}

values in some cases. For the second-order rudder-angle difference

J_{Δ^{2} δ}

, SAC generally yields smaller values, while TIB-CSAC remains lower than LOS+SP and DDPG. This indicates that the proposed method introduces necessary corrective rudder actions during path transitions without causing severe rudder-angle fluctuations. In terms of the rudder-rate smoothness index

J_{\dot{δ}}

, TIB-CSAC achieves the lowest values in all three cases, suggesting smoother rudder-rate behavior. For the speed-weighted rudder-effort proxy

J_{δ, u}

, TIB-CSAC is not always the lowest, but it is consistently lower than SAC. Moreover, TIB-CSAC achieves the lowest AUV energy-consumption proxy

J_{AUV}

and the shortest task-completion time T in all three cases. These results indicate that the improved path-following performance of TIB-CSAC is not obtained through persistent rudder saturation or excessive task-level effort but through generally reasonable actuator behavior.

4.4. Ablation Study

To further analyze the contribution of each module, this section conducts an ablation study on the full TIB-CSAC method and three comparison variants: the baseline SAC; CSAC, in which the TIB encoder is replaced by fully connected layers of comparable size; and TIB-SAC, which combines the TIB encoder with the baseline SAC. Each method is trained for 1200 episodes using five random seeds.

4.4.1. Quantitative Analysis of the TIB Encoder

Table 9 and Figure 17 quantify the contribution of the TIB encoder through two controlled comparisons: CSAC versus TIB-CSAC and SAC versus TIB-SAC. Compared with TIB-CSAC, the

\bar{TRE}

of CSAC increases from 2.9560 × 10² to 6.4180 × 10², corresponding to 2.17 times that of TIB-CSAC. Meanwhile, the

\bar{AUC}

decreases from −2.6398 × 10⁶ to −5.4341 × 10⁶, corresponding to 2.06 times the absolute magnitude of TIB-CSAC. A consistent trend is also observed in the comparison between TIB-SAC and SAC. Specifically, the

\bar{TRE}

of TIB-SAC decreases from 7.3320 × 10² to 4.3060 × 10², corresponding to 0.59 times that of SAC, while the

\bar{AUC}

increases from −6.5231 × 10⁶ to −3.9544 × 10⁶, corresponding to 0.61 times the absolute magnitude of SAC. In addition, the tail-stage metrics show consistent but secondary improvements after introducing the TIB encoder, with

σ_{tail}

decreasing in both SAC versus TIB-SAC and CSAC versus TIB-CSAC. Overall, these results indicate that the structured and conditioned representations induced by the TIB encoder primarily accelerate policy formation and improve overall training quality.

4.4.2. Quantitative Analysis of the Conservative Value Evaluation Mechanism

The contribution of the conservative value-evaluation mechanism is quantified through two controlled comparisons: TIB-SAC versus TIB-CSAC and SAC versus CSAC. Relative to TIB-CSAC, the

σ_{tail}

of TIB-SAC increases from 1.2368 × 10² to 3.5529 × 10², corresponding to 2.87 times that of TIB-CSAC. A similar effect is observed in the comparison between CSAC and SAC, where the

σ_{tail}

of CSAC decreases from 5.1210 × 10² to 1.3339 × 10², corresponding to 0.26 times that of SAC. Collectively, these results indicate that the multi-head conservative value-evaluation mechanism effectively mitigates tail-stage return fluctuations and reduces return drawdowns induced by challenging tasks, thereby improving tail-stage training stability.

4.4.3. Overall Ablation Discussion

Compared with baseline SAC, TIB-CSAC reduces

\bar{TRE}

and

σ_{tail}

to 0.40 and 0.24 times those of SAC, respectively, while yielding less negative

\bar{AUC}

and

{\bar{R}}_{tail}

values whose absolute magnitudes are reduced to 0.40 and 0.62 times those of SAC, respectively. These results indicate that combining the TIB encoder with the conservative value-evaluation mechanism enables TIB-CSAC to achieve faster convergence, higher overall training quality, and better tail-stage performance.

4.5. Sensitivity Study

4.5.1. Sensitivity Analysis on Critic Size

To further investigate the influence of a key hyperparameter on the performance of the proposed algorithm, this section conducts a sensitivity study on the critic size n^Q used in the multi-head conservative value evaluation mechanism. For each parameter setting, training is performed for 1200 episodes using five random seeds.

As shown in Figure 18 and Table 10, when n^Q = 2, the return curves remain in a low-return region with severe fluctuations, indicating that an overly small n^Q is insufficient to provide reliable conservative value evaluation for effective policy learning. With n^Q = 4, the return curves begin to enter the high-return region, but noticeable return drawdowns still persist after convergence, suggesting that value-evaluation robustness remains inadequate. Increasing n^Q from 4 to 6 effectively improves tail-stage stability, with

σ_{tail}

decreasing from 5.3876 × 10² to 1.2368 × 10², corresponding to 0.23 times that of the n^Q = 4 setting. By contrast, further increasing n^Q from 6 to 8 yields only a modest additional reduction in

σ_{tail}

, from 1.2368 × 10² to 1.0508 × 10², corresponding to 0.85 times that of the n^Q = 6 setting, while convergence becomes slower, with

\bar{TRE}

increasing from 2.9560 × 10² to 3.3480 × 10², corresponding to 1.13 times that of the n^Q = 6 setting. This trend suggests that an overly large n^Q may further suppress tail-stage volatility, but the associated increase in conservatism may also make policy improvement more cautious, thereby reducing sample efficiency. Therefore, under the algorithmic setting adopted in this study, n^Q = 6 provides a favorable trade-off between convergence speed and tail-stage performance.

4.5.2. Sensitivity Analysis on Reward Weights

To further evaluate the influence of reward weights on path-following performance, a sensitivity analysis was conducted for k_r_,1, k_r_,2, and k_r_,3. In addition to the default setting k_r_,1 = k_r_,2 = k_r_,3 = 5, three reward-weight combinations were tested in Cases 1–3: [k_r_,1, k_r_,2, k_r_,3] = [7.50, 3.75, 3.75], [3.75, 7.50, 3.75], and [3.75, 3.75, 7.50]. These settings increase the relative weights of the horizontal-error penalty, vertical-error penalty, and reverse-motion penalty, respectively. The averaged metric results over Cases 1–3 are shown in Figure 19, and the detailed metrics for each case are provided in Table 11 and Table 12.

As shown in Figure 19, the relative allocation of reward weights affects the emphasis of path-following performance. Relative to the default setting, increasing the relative weight of k_r_,1 improves most horizontal error metrics, especially by reducing the horizontal threshold-exceedance ratio V_ε and horizontal tail error CVaR₉₅. Increasing the relative weight of kr,2 reduces the vertical V_ε, CVaR₉₅, and MXAE compared with the default setting, but it also increases the horizontal error metrics, indicating that emphasizing vertical-error suppression may weaken horizontal path-following performance. After C is increased, the horizontal and vertical error metrics remain generally close to those of the default setting. This occurs because the reverse-motion penalty mainly serves as an auxiliary constraint. When the AUV rarely tends to move opposite to the path direction, increasing k_r_,3 has a relatively limited effect on the final path-following error metrics.

Overall, although the setting k_r_,1 = k_r_,2 = k_r_,3 = 5 does not achieve the best value for every individual metric, it provides a more balanced performance between horizontal and vertical path-following errors and avoids significant degradation in either error category. Therefore, this weight setting is adopted as the default reward-function configuration.

4.6. Stochastic Disturbance

To further evaluate the robustness of the proposed controller under stochastic environmental disturbances, an additional test is conducted using a first-order Gauss–Markov disturbance model [28]. The test is performed on a 3D random path with 20 waypoints, where the coordinate origin is defined as the path-starting point. The desired surge velocity is set to

u_{d} (t) = - 0.75 \cos (0.01 t) + 1.25

. The waypoint configuration is summarized in Table 13, and the results are shown in Figure 20, Figure 21 and Figure 22 and Table 14 and Table 15.

\begin{array}{l} τ_{k}^{d} = clip (τ_{k}^{d}, - A, A) \\ τ_{k + 1}^{d} = ρ τ_{k}^{d} + σ_{d} \sqrt{1 - ρ^{2}} ε_{k} \\ ρ = \exp (- Δ t / T_{c}) \end{array}

(66)

where

τ^{d} = {[τ_{u}^{d}, τ_{v}^{d}, τ_{w}^{d}, τ_{q}^{d}, τ_{r}^{d}]}^{T}

; A = [15.0, 1.6, 1.6, 5.0, 4.0]^T denotes the theoretical upper bound of each disturbance channel in n Equation (48);

ε_{k} ~ N (0, I_{5})

;

σ_{d} = A / 3

; and T_c = 30 s.

As shown in Figure 20, all methods complete the 3D path-following task under stochastic disturbances. The error curves in Figure 22 show that the main error fluctuations are concentrated around path-segment transitions and high-curvature regions. Under stochastic disturbances, TIB-CSAC suppresses both vertical and horizontal error fluctuations more effectively than the baseline methods. As summarized in Table 14, TIB-CSAC achieves the lowest MXAE, MAE, CVaR95, and V_ε, indicating better suppression of peak errors, average errors, tail errors, and threshold-exceeding errors under stochastic environmental disturbances. The supplementary control-related results in Table 15 further show that this improvement is not accompanied by excessive rudder saturation or unreasonable task-level effort. Therefore, the Gauss–Markov disturbance test provides further evidence of the stochastic-disturbance robustness of the proposed method.

5. Conclusions

This study proposes a TIB-CSAC method to improve the X-rudder path-following performance of an underactuated AUV under varying path geometries and desired surge velocities. The proposed method learns the X-rudder control policy directly from two-segment path-following information, thereby avoiding the hierarchical guidance-control architecture and the associated coupled parameter tuning. In addition, a task-informed inductive-bias encoder is introduced to support segment-wise policy learning, improving sample efficiency and overall training quality. A multi-head conservative value-evaluation mechanism is also incorporated to mitigate return drawdowns induced by challenging path-following tasks and to improve tail-stage convergence stability. Overall, compared with baseline SAC, TIB-CSAC reduces

\bar{TRE}

and

σ_{tail}

to 0.40 and 0.24 times those of SAC, respectively, while yielding less negative

\bar{AUC}

and

{\bar{R}}_{tail}

values whose absolute magnitudes are reduced to 0.40 and 0.62 times those of SAC, respectively.

The path-following performance of the proposed method is validated in three representative scenarios with distinct path geometries and desired surge velocities. The results show that TIB-CSAC improves multiple position-error metrics compared with baseline SAC. Averaged over the three cases, the vertical and horizontal MXAE values are reduced by 1.4645 × 10⁻¹ m and 6.4623 m, respectively. The corresponding reductions in MAE are 1.2965 × 10⁻² m and 4.2044 × 10⁻¹ m, while those in CVaR₉₅ are 3.7903 × 10⁻² m and 3.8225 m. In addition, the vertical and horizontal V_ε values are reduced by 1.5675 × 10⁻² and 3.5120 × 10⁻¹, respectively. These results indicate that TIB-CSAC not only improves overall adherence to the reference path but also more effectively suppresses extreme position errors, reduces tail errors, and maintains path-following within the prescribed error threshold, thereby demonstrating improved robustness and reliability.

The current results are mainly obtained through simulation-based evaluation. To reduce the mismatch between the simulator and the real environment, sim-to-real transfer methods provide a promising direction for adapting DRL policies trained in simulation to real AUV systems with limited real-world data [29,30]. For example, Wu et al. improved the real-world deployment of a simulation-trained navigation policy using human-guided interventions and demonstrations [31], while Hsu et al. fine-tuned a simulation-trained policy in a safety-constrained laboratory environment before deploying it on a quadrupedal robot for real-world visual navigation [32]. Therefore, future work will focus on software-system integration, real-world data collection, sim-to-real policy transfer learning, and open-water experiments to further evaluate the proposed method. In addition, the adopted 5-DOF model may be insufficient when rudder faults or asymmetric rudder effectiveness occur. Future work will also investigate fault-tolerant X-rudder control in a full 6-DOF framework by further considering roll dynamics and roll control.

Author Contributions

Conceptualization, J.T.; Methodology, J.T.; Software, L.Z.; Formal analysis, P.C.; Investigation, Z.L.; Resources, Y.S.; Writing—original draft, J.T.; Writing—review & editing, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the Natural Science Foundation of Heilongjiang Province of China under Grant No. ZD2020E005. Additional support was provided by the Special Funds of the Taishan Industrial Leading Talents Project.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy restrictions.

Conflicts of Interest

Author Yushan Sun was employed by the company Shandong Future Robot Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

The actuator input of the thruster is modeled as

τ_{u}^{a} (n) = ρ n^{2} D^{4} (K_{T p 1} \cdot u / n D + K_{T p 2})

(A1)

where the seawater density is set to ρ = 1025 kg/m³, the propeller diameter is D = 0.25 m, and the linear coefficients are

K_{T p 1}

= −0.3077 and

K_{T p 2}

= 0.220.

The rotational speed of thruster n_t follows an incremental S-plane controller:

n_{t} = {\bar{n}}_{t} \cdot n_{\max}

(A2)

{\bar{n}}_{t} = sat ({\bar{n}}_{t - 1} + η_{n} \cdot S (e_{u}, {\dot{e}}_{u}))

(A3)

e_{u} = sat (\frac{u_{d} - u}{Δ u_{\max}}), {\dot{e}}_{u} = sat (\frac{{\dot{u}}_{d} - \dot{u}}{Δ {\dot{u}}_{\max}})

(A4)

S (e, \dot{e}) = \frac{2}{1 + \exp (- k_{1} e - k_{2} \dot{e})} - 1

(A5)

where

{\bar{n}}_{t}

is the normalized control output. The maximum rotational speed is set to n_max = 1800 rpm, the update rate is

η_{n}

= 0.08, normalization factors for error are

Δ u_{\max}

= 2.0 and

Δ {\dot{u}}_{\max}

= 4.0, and the S-plane controller factors are

k_{1}^{u} = 2.0

and

k_{2}^{u} = 1.8

.

References

Shen, Z.; Wang, Q.; Dong, S.; Yu, H. Prescribed performance dynamic surface control for trajectory-tracking of unmanned surface vessel with input saturation. Appl. Ocean Res. 2021, 113, 102736. [Google Scholar] [CrossRef]
He, L.; Zhang, Y.; Li, S.; Li, B.; Yuan, Z. Three-Dimensional Path Following Control for Underactuated AUV Based on Ocean Current Observer. Drones 2024, 8, 672. [Google Scholar] [CrossRef]
Zheng, J.; Song, L.; Liu, L.; Yu, W.; Wang, Y.; Chen, C. Fixed-time sliding mode tracking control for autonomous underwater vehicles. Appl. Ocean Res. 2021, 117, 102928. [Google Scholar] [CrossRef]
Rout, R.; Subudhi, B. Design of Line-of-Sight Guidance Law and a Constrained Optimal Controller for an Autonomous Underwater Vehicle. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 416–420. [Google Scholar] [CrossRef]
Wang, L.; Li, S.; Liu, J.; Wu, Q. Data-driven path-following control of underactuated ships based on antenna mutation beetle swarm predictive reinforcement learning. Appl. Ocean Res. 2022, 124, 103207. [Google Scholar] [CrossRef]
S., P.; Rajendran, S. A unified seakeeping and manoeuvring model with a PID controller for path following of a KVLCC2 tanker in regular waves. Appl. Ocean Res. 2021, 116, 102860. [Google Scholar] [CrossRef]
He, Y.; Xie, Y.; Pan, G.; Cao, Y.; Huang, Q.; Ma, S.; Zhang, D.; Cao, Y. Depth and Heading Control of a Manta Robot Based on S-Plane Control. J. Mar. Sci. Eng. 2022, 10, 1698. [Google Scholar] [CrossRef]
Xu, F.; Zhang, L.; Zhong, J. Three-Dimensional Path Tracking of Over-Actuated AUVs Based on MPC and Variable Universe S-Plane Algorithms. J. Mar. Sci. Eng. 2024, 12, 418. [Google Scholar] [CrossRef]
Jiang, C.; Lv, J.; Wan, L.; Wang, J.; He, B.; Wu, G. An Improved S-Plane Controller for High-Speed Multi-Purpose AUVs with Situational Static Loads. J. Mar. Sci. Eng. 2023, 11, 646. [Google Scholar] [CrossRef]
He, L.; Xie, M.; Zhang, Y. A Review of Path Following, Trajectory Tracking, and Formation Control for Autonomous Underwater Vehicles. Drones 2025, 9, 286. [Google Scholar] [CrossRef]
Wang, D.; Shen, Y.; Wan, J.; Sha, Q.; Li, G.; Chen, G.; He, B. Sliding mode heading control for AUV based on continuous hybrid model-free and model-based reinforcement learning. Appl. Ocean Res. 2022, 118, 102960. [Google Scholar] [CrossRef]
Sun, Y.; Ran, X.; Zhang, G.; Wang, X.; Xu, H. AUV path following controlled by modified Deep Deterministic Policy Gradient. Ocean Eng. 2020, 210, 107360. [Google Scholar] [CrossRef]
Fang, Y.; Huang, Z.; Pu, J.; Zhang, J. AUV position tracking and trajectory control based on fast-deployed deep reinforcement learning method. Ocean Eng. 2022, 245, 110452. [Google Scholar] [CrossRef]
Wang, Z.; Li, Y.; Ma, C.; Yan, X.; Jiang, D. Path-following optimal control of autonomous underwater vehicle based on deep reinforcement learning. Ocean Eng. 2023, 268, 113407. [Google Scholar] [CrossRef]
Wang, Y.; Hou, Y.; Lai, Z.; Cao, L.; Hong, W.; Wu, D. An adaptive PID controller for path following of autonomous underwater vehicle based on Soft Actor–Critic. Ocean Eng. 2024, 307, 118171. [Google Scholar] [CrossRef]
Zhang, C.; Cheng, P.; Du, B.; Dong, B.; Zhang, W. AUV path tracking with real-time obstacle avoidance via reinforcement learning under adaptive constraints. Ocean Eng. 2022, 256, 111453. [Google Scholar] [CrossRef]
Zhang, Q.; Lin, J.; Sha, Q.; He, B.; Li, G. Deep Interactive Reinforcement Learning for Path Following of Autonomous Underwater Vehicle. IEEE Access 2020, 8, 24258–24268. [Google Scholar] [CrossRef]
Dong, N.; Liu, S.; Ip, A.W.H.; Yung, K.L.; Gao, Z.; Juan, R.; Wang, Y. End-to-end autonomous underwater vehicle path following control method based on improved soft actor–critic for deep space exploration. J. Ind. Inf. Integr. 2025, 45, 100792. [Google Scholar] [CrossRef]
Gu, N.; Wang, D.; Peng, Z.; Wang, J.; Han, Q.-L. Advances in line-of-sight guidance for path following of autonomous marine vehicles: An overview. IEEE Trans. Syst. Man. Cybern. Syst. 2022, 53, 12–28. [Google Scholar] [CrossRef]
Fan, J.; Li, W. DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, Baltimore, MD, USA, 17–23 July 2022; pp. 6074–6102. [Google Scholar]
Schaul, T.; Horgan, D.; Gregor, K.; Silver, D. Universal Value Function Approximators. In Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, Lille, France, 6–11 July 2015; pp. 1312–1320. [Google Scholar]
Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Chai, P.; Sun, Y.; Wu, W.; Zhou, H.; Gao, F. AUV preset performance approximate dynamic programming path following control based on fixed time guidance law. Ocean Eng. 2025, 321, 120425. [Google Scholar] [CrossRef]
Zhang, L.; Sun, Y.; Chai, P.; Tan, J.; Zheng, H. Prescribed-performance time-delay compensation control for UUV trajectory tracking in main-branch water conveyance tunnel transitions under unknown input delays. Ocean Eng. 2025, 342, 122941. [Google Scholar] [CrossRef]
Pettersen, K.Y.; Egeland, O. Time-varying exponential stabilization of the position and attitude of an underactuated autonomous underwater vehicle. IEEE Trans. Autom. Control. 1999, 44, 112–115. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
An, X.; Xing, H.; Li, H.; Shi, H.; Gu, Y. Construction and verification of the autonomous underwater helicopter’s digital twin system for missions simulation. Ocean Eng. 2025, 341, 122641. [Google Scholar] [CrossRef]
Ju, H.; Juan, R.; Gomez, R.; Nakamura, K.; Li, G. Transferring policy of deep reinforcement learning from simulation to reality for robotics. Nat. Mach. Intell. 2022, 4, 1077–1087. [Google Scholar] [CrossRef]
Zhu, W.; Guo, X.; Owaki, D.; Kutsuzawa, K.; Hayashibe, M. A Survey of Sim-to-Real Transfer Techniques Applied to Reinforcement Learning for Bioinspired Robots. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3444–3459. [Google Scholar] [CrossRef] [PubMed]
Wu, J.; Zhou, Y.; Yang, H.; Huang, Z.; Lv, C. Human-Guided Reinforcement Learning With Sim-to-Real Transfer for Autonomous Navigation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14745–14759. [Google Scholar] [CrossRef]
Hsu, K.-C.; Ren, A.Z.; Nguyen, D.P.; Majumdar, A.; Fisac, J.F. Sim-to-Lab-to-Real: Safe reinforcement learning with shielding and generalization guarantees. Artif. Intell. 2023, 314, 103811. [Google Scholar] [CrossRef]

Figure 1. Coordinate system.

Figure 2. The overall architecture of the TIB-CSAC algorithm.

Figure 3. Schematic diagram of the training process.

Figure 4. Episode return curves in the comparative study.

Figure 5. Trajectories in Case 1 (the dashed line with arrows indicates the path direction).

Figure 6. Path-following errors in Case 1.

Figure 7. Surge velocities in Case 1.

Figure 8. TIB-CSAC rudder angles in Case 1.

Figure 9. Trajectories in Case 2 (the dashed line with arrows indicates the path direction).

Figure 10. Path-following errors in Case 2.

Figure 11. Surge velocity in Case 2.

Figure 12. TIB-CSAC rudder angles in Case 2.

Figure 13. Trajectories in Case 3 (the dashed line with arrows indicates the path direction).

Figure 14. Path-following errors in Case 3.

Figure 15. Surge velocity in Case 3.

Figure 16. TIB-CSAC rudder angles in Case 3.

Figure 17. Episode return curves in the ablation study.

Figure 18. Episode return curves in the sensitivity study.

Figure 19. Average path-following metrics of Case 1–3.

Figure 20. 3D trajectories.

Figure 21. Surge velocity.

Figure 22. Path-following error.

Table 1. Hydrodynamic parameters.

Parameter	Value	Parameter	Value
Inertia terms	$(m_{11}, m_{22}, m_{33}) = (215, 265, 265) kg$ $(m_{55}, m_{66}) = 80 kg \cdot m^{2}$	Rudder rate	Δδ_max = 0.3491 rad/s
Damping coefficient	$(X_{u}, Y_{v}, Z_{w}) = (70, 100, 100) kg \cdot m^{2}$ $(M_{q}, N_{r}) = 50 kg \cdot m^{2} / s$	Rudder limit	δ_max = 0.4189 rad
Damping coefficient	$(X_{u \| u \|}, Y_{v \| v \|}, Z_{w \| w \|}) = 100 \cdot (1, 2, 2) kg / m$ $(M_{q \| q \|}, N_{r \| r \|}) = 50 kg \cdot m^{2}$	AUV weight	W = 1813 N
Rudder coefficients	$M_{δ} = 0.1595 \cdot [+ 1, - 1, - 1, + 1]$ $N_{δ} = 0.1595 \cdot [+ 1, - 1, - 1, + 1]$	Vertical offset	${\bar{B G}}_{z} = 0 . 02 m$

Table 2. Training task settings.

Parameter	Value	Parameter	Value
Initial AUV states	η₀ = 0, ν₀ = 0	Path segments	n^p = 2
Maximum adjacent heading change	$Δ ψ_{\max}^{p} = 1.5707 rad$	Time step	Δt = 0.2 s
Maximum absolute path pitch angle	$Δ θ_{\max}^{p} = 0.1222 rad$	Time threshold	$t_{\max}^{o} = 10 s$
Minimum segment length	L_min = 100 m	Error threshold	$d_{\min}^{h} = 0.1 m$
Normalization factors	$(e_{\max}^{h}, e_{\max}^{v}, r_{\max}) = (4, 2, 4)$	Reward weights	$k_{i}^{r} = 5 (i = 1, 2, 3)$
Maximum task time limit	$t_{\max}^{t} = 1.3 sum (L^{p} / \hat{u})$	Disturbance phase shifts	$\|φ_{1}^{d}\| < π / 8$
Disturbance frequencies	$0.08 < k_{1}^{d} < 0.12$	Disturbance phase shifts	$π / 8 < \|φ_{2}^{d}\| < 3 π / 8$
Disturbance frequencies	$0.18 < k_{2}^{d} < 0.22$	Disturbance phase shifts	$π / 12 < \|φ_{3}^{d}\| < π / 4$

Table 3. Algorithm settings.

Parameter	Value	Parameter	Value
Learning rate	lr = 2 × 10⁻⁴	Dynamic encoder	MLP(2 × 32), ReLU
Discount factor	γ = 0.99	Segment branches	MLP(2 × 64), ReLU
Critic number	n^Q = 6	Fusion layer	MLP(1 × 64), ReLU
Soft update coefficient	τ = 1 × 10⁻³	FiLM modulation	g:MLP(1 × 64), ReLU
Training episodes	n^eps = 1200	Policy network	MLP(2 × 128), ReLU
Replay Buffer size	n^buf = 1 × 10⁶	Critic head network	MLP(2 × 128), ReLU
Batch size	n^bat = 256	Target entropy	$H_{tar} = - 2$

Table 4. Training metric results in the comparative study.

Method	${\bar{R}}_{tail}$	$σ_{tail}$	$\bar{TRE}$	$\bar{AUC}$
TIB-CSAC	−1.7244 × 10²	1.2368 × 10²	2.9560 × 10²	−2.6398 × 10⁶
SAC	−2.7927 × 10²	5.1210 × 10²	7.3320 × 10²	−6.5231 × 10⁶
DDPG	−5.5372 × 10²	4.6465 × 10²	9.3920 × 10²	−9.2165 × 10⁶

Table 5. Waypoint Coordinates of Random Paths.

Index	1	2	3	4	5	6	7	8	9	10
$x_{i}^{p}$ (m)	111.79	238.38	434.02	546.22	486.21	562.53	674.69	596.73	441.31	433.10
$y_{i}^{p}$ (m)	92.60	158.75	179.64	115.61	−47.99	−174.84	−288.56	−370.22	−325.50	−185.18
$z_{i}^{p}$ (m)	2.56	14.46	14.68	15.56	3.68	11.19	15.01	20.85	11.28	18.72
Index	11	12	13	14	15	16	17	18	19	20
$x_{i}^{p}$ (m)	464.11	375.32	450.57	628.69	794.98	883.33	1001.32	1110.58	1264.07	1387.24
$y_{i}^{p}$ (m)	−74.54	99.49	201.61	131.19	68.23	−97.93	−52.65	54.26	135.41	54.28
$z_{i}^{p}$ (m)	20.23	24.20	22.87	32.63	31.45	43.51	32.77	35.95	51.05	51.89

Table 6. Vertical error metrics.

Case	Metric	TIB-CSAC	LOS+SP	SAC	DDPG
C1	MXAE (m)	1.7299 × 10⁻¹	2.6719 × 10⁻¹	2.6679 × 10⁻¹	4.2489 × 10⁻¹
	MAE (m)	4.1841 × 10⁻³	1.0881 × 10⁻²	1.3561 × 10⁻²	4.7823 × 10⁻²
	CVaR₉₅ (m)	4.7757 × 10⁻²	8.6580 × 10⁻²	6.3900 × 10⁻²	2.4309 × 10⁻¹
	V_ε	7.4956 × 10⁻³	1.4532 × 10⁻²	3.0235 × 10⁻³	1.4487 × 10⁻¹
C2	MXAE (m)	3.6035 × 10⁻¹	4.8607 × 10⁻¹	5.8540 × 10⁻¹	5.5639 × 10⁻¹
	MAE (m)	1.0904 × 10⁻²	3.8505 × 10⁻²	2.1844 × 10⁻²	8.1225 × 10⁻²
	CVaR₉₅ (m)	1.0198 × 10⁻¹	2.5271 × 10⁻¹	1.5847 × 10⁻¹	4.0743 × 10⁻¹
	V_ε	1.7398 × 10⁻²	1.1855 × 10⁻¹	3.2938 × 10⁻²	2.5016 × 10⁻¹
C3	MXAE (m)	3.4549 × 10⁻¹	3.7410 × 10⁻¹	4.6598 × 10⁻¹	6.7275 × 10⁻¹
	MAE (m)	1.2928 × 10⁻²	3.3489 × 10⁻²	3.1507 × 10⁻²	9.6232 × 10⁻²
	CVaR₉₅ (m)	1.2817 × 10⁻¹	2.1243 × 10⁻¹	1.6925 × 10⁻¹	4.4115 × 10⁻¹
	V_ε	2.5180 × 10⁻²	8.2075 × 10⁻²	6.1138 × 10⁻²	3.3564 × 10⁻¹

Table 7. Horizontal error metrics.

Case	Metric	TIB-CSAC	LOS+SP	SAC	DDPG
C1	MXAE (m)	2.3813 × 10⁰	1.0942 × 10¹	1.0892 × 10¹	1.0348 × 10¹
	MAE (m)	6.1981 × 10⁻²	5.9013 × 10⁻¹	4.7040 × 10⁻¹	4.8171 × 10⁻¹
	CVaR₉₅ (m)	8.4313 × 10⁻¹	7.1942 × 10⁰	5.3551 × 10⁰	5.1160 × 10⁰
	V_ε	6.8683 × 10⁻²	3.7126 × 10⁻¹	3.8413 × 10⁻¹	7.3974 × 10⁻¹
C2	MXAE (m)	1.0324 × 10⁰	1.0575 × 10¹	8.3592 × 10⁰	1.4654 × 10¹
	MAE (m)	3.6557 × 10⁻²	7.0307 × 10⁻¹	4.6165 × 10⁻¹	7.6099 × 10⁻¹
	CVaR₉₅ (m)	3.8266 × 10⁻¹	6.1709 × 10⁰	4.2428 × 10⁰	7.2433 × 10⁰
	V_ε	6.6016 × 10⁻²	5.7557 × 10⁻¹	4.9173 × 10⁻¹	7.9612 × 10⁻¹
C3	MXAE (m)	2.9708 × 10⁰	1.0196 × 10¹	6.5206 × 10⁰	1.0790 × 10¹
	MAE (m)	8.7358 × 10⁻²	5.9247 × 10⁻¹	5.1517 × 10⁻¹	7.0218 × 10⁻¹
	CVaR₉₅ (m)	1.0650 × 10⁰	4.9855 × 10⁰	4.1604 × 10⁰	5.4241 × 10⁰
	V_ε	1.4977 × 10⁻¹	5.8885 × 10⁻¹	4.6221 × 10⁻¹	8.4074 × 10⁻¹

Table 8. Control-related metrics.

Case	Metric	TIB-CSAC	LOS+SP	SAC	DDPG
C1	$R_{δ}^{s a t}$	3.6415 × 10⁻²	3.5432 × 10⁻²	7.1948 × 10⁻²	1.9965 × 10⁻²
	$J_{Δ^{2} δ}$	5.9789 × 10⁻²	7.9926 × 10⁻¹	2.0554 × 10⁻²	6.3354 × 10⁰
	$J_{\dot{δ}}$	7.5725 × 10⁰	1.0895 × 10¹	7.7465 × 10⁰	4.2052 × 10¹
	$J_{δ, u}$	2.7156 × 10¹	1.8364 × 10¹	7.7713 × 10¹	1.4899 × 10¹
	$J_{AUV}$	5.5553 × 10³	5.7613 × 10³	5.9402 × 10³	5.6670 × 10³
	$T$	3.7622 × 10³	3.8948 × 10³	3.9028 × 10³	3.8116 × 10³
C2	$R_{δ}^{s a t}$	6.0163 × 10⁻³	1.0941 × 10⁻²	5.5885 × 10⁻²	2.7831 × 10⁻³
	$J_{Δ^{2} δ}$	3.1153 × 10⁻¹	2.3985 × 10⁰	1.0248 × 10⁻¹	6.8936 × 10⁰
	$J_{\dot{δ}}$	1.8446 × 10¹	3.1841 × 10¹	2.1538 × 10¹	4.7438 × 10¹
	$J_{δ, u}$	3.2003 × 10¹	2.6966 × 10¹	1.7262 × 10²	1.9909 × 10¹
	$J_{AUV}$	2.2942 × 10³	2.4531 × 10³	2.5507 × 10³	2.4031 × 10³
	$T$	1.2300 × 10³	1.2704 × 10³	1.2812 × 10³	1.2576 × 10³
C3	$R_{δ}^{s a t}$	4.1298 × 10⁻²	2.0751 × 10⁻²	9.3597 × 10⁻²	2.6845 × 10⁻²
	$J_{Δ^{2} δ}$	6.1293 × 10⁻¹	1.5771 × 10⁰	1.1750 × 10⁻¹	1.3719 × 10¹
	$J_{\dot{δ}}$	1.6996 × 10¹	1.9188 × 10¹	2.1262 × 10¹	8.9930 × 10¹
	$J_{δ, u}$	3.4639 × 10¹	2.3822 × 10¹	2.0899 × 10²	2.6849 × 10¹
	$J_{AUV}$	7.5449 × 10³	7.5713 × 10³	7.7279 × 10³	7.5638 × 10³
	$T$	2.4384 × 10³	2.5830 × 10³	2.6988 × 10³	2.5480 × 10³

Table 9. Training metric results in the ablation study.

Method	$\bar{TRE}$	$\bar{AUC}$	$σ_{tail}$	${\bar{R}}_{tail}$
SAC	7.3320 × 10²	−6.5231 × 10⁶	5.1210 × 10²	−2.7927 × 10²
CSAC	6.4180 × 10²	−5.4341 × 10⁶	1.3339 × 10²	−2.0564 × 10²
TIB-SAC	4.3060 × 10²	−3.9544 × 10⁶	3.5529 × 10²	−2.3032 × 10²
TIB-CSAC	2.9560 × 10²	−2.6398 × 10⁶	1.2368 × 10²	−1.7244 × 10²

Table 10. Training metric results in the sensitivity study.

n^Q	$σ_{tail}$	$\bar{TRE}$	$\bar{A U C}$	${\bar{R}}_{tail}$
2	2.4269 × 10³	1.2000 × 10³	−7.9829 × 10⁶	−3.7323 × 10³
4	5.3876 × 10²	3.8780 × 10²	−3.2756 × 10⁶	−2.7346 × 10²
6	1.2368 × 10²	2.9560 × 10²	−2.6398 × 10⁶	−1.7244 × 10²
8	1.0508 × 10²	3.3480 × 10²	−2.6063 × 10⁶	−1.6741 × 10²

Table 11. Horizontal error metrics for sensitivity analysis of reward weights.

Case	[k_r_,1, k_r_,2, k_r_,3]	[7.50, 3.75, 3.75]	[3.75, 7.50, 3.75]	[3.75, 3.75, 7.50]	[5.0, 5.0, 5.0]
C1	MXAE (m)	2.8895 × 10⁰	3.2273 × 10⁰	2.7073 × 10⁰	2.3813 × 10⁰
	MAE (m)	9.3567 × 10⁻²	1.2752 × 10⁻¹	8.1307 × 10⁻²	6.1981 × 10⁻²
	CVaR₉₅ (m)	1.2507 × 10⁰	1.1357 × 10⁰	1.0947 × 10⁰	8.4313 × 10⁻¹
	V_ε	8.7584 × 10⁻²	3.7600 × 10⁻¹	7.3903 × 10⁻²	6.8683 × 10⁻²
C2	MXAE (m)	1.1418 × 10⁰	1.6357 × 10⁰	1.1187 × 10⁰	1.0324 × 10⁰
	MAE (m)	4.0079 × 10⁻²	8.7472 × 10⁻²	5.9850 × 10⁻²	3.6557 × 10⁻²
	CVaR₉₅ (m)	4.0264 × 10⁻¹	7.0743 × 10⁻¹	5.4121 × 10⁻¹	3.8266 × 10⁻¹
	V_ε	6.0212 × 10⁻²	2.0276 × 10⁻¹	1.1075 × 10⁻¹	6.6016 × 10⁻²
C3	MXAE (m)	2.6098 × 10⁰	2.5967 × 10⁰	2.6334 × 10⁰	2.9708 × 10⁰
	MAE (m)	5.2106 × 10⁻²	6.1310 × 10⁻²	6.4482 × 10⁻²	8.7358 × 10⁻²
	CVaR₉₅ (m)	5.2805 × 10⁻¹	5.6101 × 10⁻¹	6.0352 × 10⁻¹	1.0650 × 10⁰
	V_ε	7.0330 × 10⁻²	9.8560 × 10⁻²	9.6979 × 10⁻²	1.4977 × 10⁻¹

Table 12. Vertical error metrics for sensitivity analysis of reward weights.

Case	[k_r_,1, k_r_,2, k_r_,3]	[7.50, 3.75, 3.75]	[3.75, 7.50, 3.75]	[3.75, 3.75, 7.50]	[5.0, 5.0, 5.0]
C1	MXAE (m)	2.5174 × 10⁻¹	1.5607 × 10⁻¹	1.8283 × 10⁻¹	1.7299 × 10⁻¹
	MAE (m)	1.0644 × 10⁻²	1.0499 × 10⁻²	6.4236 × 10⁻³	4.1841 × 10⁻³
	CVaR₉₅ (m)	1.1995 × 10⁻¹	5.1650 × 10⁻²	7.6575 × 10⁻²	4.7757 × 10⁻²
	V_ε	2.4152 × 10⁻²	4.2438 × 10⁻⁴	1.6045 × 10⁻²	7.4956 × 10⁻³
C2	MXAE (m)	3.8565 × 10⁻¹	3.0402 × 10⁻¹	3.7452 × 10⁻¹	3.6035 × 10⁻¹
	MAE (m)	1.0350 × 10⁻²	1.6619 × 10⁻²	1.0264 × 10⁻²	1.0904 × 10⁻²
	CVaR₉₅ (m)	1.1253 × 10⁻¹	9.9581 × 10⁻²	9.9590 × 10⁻²	1.0198 × 10⁻¹
	V_ε	2.0667 × 10⁻²	1.5981 × 10⁻²	1.5450 × 10⁻²	1.7398 × 10⁻²
C3	MXAE (m)	3.2619 × 10⁻¹	2.6435 × 10⁻¹	3.2154 × 10⁻¹	3.4549 × 10⁻¹
	MAE (m)	1.0312 × 10⁻²	1.9053 × 10⁻²	1.1628 × 10⁻²	1.2928 × 10⁻²
	CVaR₉₅ (m)	9.7986 × 10⁻²	8.2500 × 10⁻²	9.0378 × 10⁻²	1.2817 × 10⁻¹
	V_ε	1.6776 × 10⁻²	1.1930 × 10⁻²	1.5267 × 10⁻²	2.5180 × 10⁻²

Table 13. Coordinates of the reference-path waypoints.

Index	1	2	3	4	5	6	7	8	9	10
$x_{i}^{p}$ (m)	112.13	307.52	497.67	657.61	813.02	835.02	731.12	619.40	621.43	525.10
$y_{i}^{p}$ (m)	0.00	−32.65	−33.77	52.53	155.41	311.17	365.13	209.29	96.76	30.63
$z_{i}^{p}$ (m)	4.72	9.98	3.04	8.81	13.64	23.01	20.88	24.59	22.71	19.30
Index	11	12	13	14	15	16	17	18	19	20
$x_{i}^{p}$ (m)	408.72	315.79	229.22	138.34	161.95	218.32	283.15	423.69	509.14	595.53
$y_{i}^{p}$ (m)	−33.07	−152.37	−264.84	−196.47	−52.90	98.86	197.75	133.86	224.75	297.61
$z_{i}^{p}$ (m)	16.14	6.18	4.09	4.13	12.66	12.21	10.74	22.41	14.44	11.01

Table 14. Path following error metrics.

	Metric	TIB-CSAC	LOS+SP	SAC	DDPG
Vertical	MXAE (m)	3.0580 × 10⁻¹	3.1465 × 10⁻¹	3.9217 × 10⁻¹	5.5085 × 10⁻¹
	MAE (m)	8.9867 × 10⁻³	2.3031 × 10⁻²	2.5653 × 10⁻²	4.6731 × 10⁻²
	CVaR₉₅ (m)	6.6192 × 10⁻²	1.3515 × 10⁻¹	1.3048 × 10⁻¹	2.6511 × 10⁻¹
	V_ε	7.9594 × 10⁻³	4.0323 × 10⁻²	3.1242 × 10⁻²	1.2729 × 10⁻¹
Horizontal	MXAE (m)	3.0449 × 10⁰	1.0180 × 10¹	6.7056 × 10⁰	1.5886 × 10¹
	MAE (m)	6.2952 × 10⁻²	5.0292 × 10⁻¹	3.0592 × 10⁻¹	7.7647 × 10⁻¹
	CVaR₉₅ (m)	4.6887 × 10⁻¹	4.9808 × 10⁰	2.6753 × 10⁰	6.6688 × 10⁰
	V_ε	1.2621 × 10⁻¹	4.2244 × 10⁻¹	5.8750 × 10⁻¹	9.3249 × 10⁻¹

Table 15. Rudder metrics.

Metric	TIB-CSAC	LOS+SP	SAC	DDPG
$R_{δ}^{s a t}$	7.3909 × 10⁻³	4.1181 × 10⁻³	5.3944 × 10⁻²	4.9386 × 10⁻³
$J_{Δ^{2} δ}$	5.0755 × 10⁻¹	2.2047 × 10⁰	2.1324 × 10⁻¹	1.8090 × 10¹
$J_{\dot{δ}}$	1.3140 × 10¹	2.5033 × 10¹	3.1093 × 10¹	1.1789 × 10²
$J_{δ, u}$	2.5234 × 10¹	1.4796 × 10¹	1.7047 × 10²	1.6914 × 10¹
$J_{AUV}$	7.2190 × 10³	7.4023 × 10³	7.4040 × 10³	7.3943 × 10³
$T$	2.2866 × 10³	2.3312 × 10³	2.3302 × 10³	2.3286 × 10³

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, J.; Sun, Y.; Zhang, L.; Chai, P.; Liu, Z. Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC. J. Mar. Sci. Eng. 2026, 14, 1100. https://doi.org/10.3390/jmse14121100

AMA Style

Tan J, Sun Y, Zhang L, Chai P, Liu Z. Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC. Journal of Marine Science and Engineering. 2026; 14(12):1100. https://doi.org/10.3390/jmse14121100

Chicago/Turabian Style

Tan, Jiehui, Yushan Sun, Liwen Zhang, Puxin Chai, and Zhan Liu. 2026. "Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC" Journal of Marine Science and Engineering 14, no. 12: 1100. https://doi.org/10.3390/jmse14121100

APA Style

Tan, J., Sun, Y., Zhang, L., Chai, P., & Liu, Z. (2026). Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC. Journal of Marine Science and Engineering, 14(12), 1100. https://doi.org/10.3390/jmse14121100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC

Abstract

1. Introduction

2. Problem Statement

2.1. Underactuated AUV Model

2.2. Path Following Objective

3. X-Rudder Path Following Control Based on TIB-CSAC

3.1. Markov Decision Process Design

3.1.1. States

3.1.2. Actions

3.1.3. Reward

3.1.4. Termination

3.2. Task-Informed Inductive-Bias Conservative SAC Algorithm

3.2.1. Task-Informed Inductive-Bias Encoder

3.2.2. Conservative Soft Actor–Critic Algorithm

4. Experiments

4.1. Basic Settings

4.2. Training Results

4.3. Path Following Results

4.3.1. Case 1 Boustrophedon Path

4.3.2. Case 2 Trapezoidal Path

4.3.3. Case 3 Random Path

4.3.4. Metric Analyses

4.4. Ablation Study

4.4.1. Quantitative Analysis of the TIB Encoder

4.4.2. Quantitative Analysis of the Conservative Value Evaluation Mechanism

4.4.3. Overall Ablation Discussion

4.5. Sensitivity Study

4.5.1. Sensitivity Analysis on Critic Size

4.5.2. Sensitivity Analysis on Reward Weights

4.6. Stochastic Disturbance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI