Previous Article in Journal
Analysis of the Resistance Performance for a High-Speed Planing Craft Under Various Longitudinal Center of Gravity Positions
Previous Article in Special Issue
Safe Guidance Strategy for Affine Formation Manoeuvre of ASVs Using the Interference Vector Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC

1
National Key Laboratory of Autonomous Marine Vehicle Technology, Harbin Engineering University, Harbin 150001, China
2
College of Shipbuilding Engineering, Harbin Engineering University, Harbin 150001, China
3
Shandong Future Robot Co., Ltd., Weihai 264209, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2026, 14(12), 1100; https://doi.org/10.3390/jmse14121100 (registering DOI)
Submission received: 25 April 2026 / Revised: 4 June 2026 / Accepted: 10 June 2026 / Published: 14 June 2026
(This article belongs to the Special Issue Advanced Studies in Marine Vessel Motion Control)

Abstract

To improve the path-following performance of an underactuated autonomous underwater vehicle (AUV) under varying path geometries and desired velocities, this study proposes a direct X-rudder control method based on Task-Informed Inductive-Bias Conservative Soft Actor–Critic (TIB-CSAC). The proposed method directly learns the X-rudder control policy from the path-following information of the current and subsequent path segments in a data-driven way, thereby avoiding the complex design and manual tuning of guidance laws and attitude controllers for rudder command generation. To support such two-segment policy learning, a task-informed inductive-bias encoder is proposed to construct structured and conditioned state representations, thereby improving sample efficiency and overall training quality. In addition, given the long-tail characteristics of task difficulty in agent training, a multi-head conservative value evaluation mechanism is incorporated to mitigate return drawdowns induced by challenging tasks in the tail stage of training and to enhance tail-stage convergence stability. The path-following performance is validated in three representative scenarios with different path pitch, path heading variations, and desired surge velocity conditions. The results show that, compared with the baseline soft actor–critic (SAC) method, TIB-CSAC improves multiple vertical and horizontal error metrics, including maximum absolute error, mean absolute error, tail error, and error threshold exceedance ratio. These results indicate that TIB-CSAC not only improves overall adherence to the reference path, but also more effectively suppresses extreme errors and tail errors, thereby demonstrating stronger path-following robustness and reliability.

1. Introduction

Owing to their simple structure and high maneuvering efficiency, X-rudder autonomous underwater vehicles (AUVs) have become important platforms for a wide range of underwater applications, including environmental monitoring, offshore infrastructure inspection, and marine resource exploration. As underwater missions become increasingly diverse, AUVs are expected to exhibit strong generalization capability and reliable performance under a broad range of operating conditions.
Path-following is a fundamental capability of AUVs. It enables an AUV to move along a reference path while minimizing its position error. In practice, the reference path is commonly composed of multiple straight-line segments, and the path-following task can be characterized by parameters such as the path heading angle, path pitch angle, and desired surge velocity. However, achieving path-following performance that is both robust and strongly generalizable remains challenging. First, different task-parameter settings lead to substantial variations in task difficulty and impose different maneuvering demands. This challenge is particularly pronounced during path-segment transitions, where the AUV must rapidly adjust its motion and reduce tracking errors within a limited spatial range; these transitions are therefore more difficult than straight-segment following. Moreover, the underactuated nature of X-rudder AUVs, nonlinear hydrodynamic effects, and coupling among multiple motion variables further increase the complexity of unified controller design. As the task-generalization range expands, the workload and complexity associated with multi-parameter tuning also increase.
Due to the absence of direct lateral and vertical actuation in underactuated AUVs, conventional path-following methods typically use a guidance law to transform position errors into desired attitudes. These commands are then tracked by a controller, such as prescribed performance control [1], sliding-mode control [2,3], or model predictive control [4]. However, such model-based methods rely on accurate hydrodynamic models, and parameter identification is difficult and costly. Model-free controllers, such as proportional–integral–derivative (PID) controllers [5,6] and S-plane controllers [7,8,9], are easier to implement but usually rely on expert tuning. Moreover, under significant variations in path geometry and desired surge velocity, these hierarchical schemes often require re-tuning of the guidance law or controller [10], which limits their adaptability and generalization capability.
Deep reinforcement learning (DRL), with its self-learning capability and data-driven decision-making mechanism, provides a promising approach to path-following [5,11]. Sun et al. [12] proposed an improved deep deterministic policy gradient method (OSAM-DDPG) that learns a control policy conditioned on attitude and velocity errors. However, when a line-of-sight (LOS) guidance law with fixed parameters is used, its path-following performance under different operating conditions may be limited. Fang et al. [13] applied DDPG to achieve three-degree-of-freedom control at constant speed with variable pitch and yaw, while realizing path-following via waypoint tracking. Wang et al. [14] developed a simplified DDPG method that does not require prediction of future rewards and validated its performance on a helical path and an inclined straight path under a single constant desired speed. Wang et al. [15] proposed a SAC-based adaptive PID controller and verified its attitude-control performance in LOS-guided path-following, although its effectiveness in direct position-error regulation remains to be further investigated. Zhang et al. [16] proposed an improved DDPG-based path-following method integrated with a LOS guidance law and trained the algorithm on two fixed paths under a single constant desired speed. Zhang et al. [17] proposed a deep interactive reinforcement learning method, and Dong et al. [18] developed an improved SAC-based two-dimensional path-following method. In both studies, the agent was trained and evaluated using the same path geometries; therefore, its path-following performance across diverse path geometries remains an open issue.
Despite these advances, current DRL-based path-following studies still have limitations in state construction when strong generalization is required. On the one hand, many DRL methods rely on manually designed guidance laws with fixed parameters, making it difficult to adapt to path-following tasks with different parameter configurations. On the other hand, multiple types of original path-following information, including position errors, path heading/pitch variations, and progress relative to adjacent segments, are usually mapped by the guidance law into a small number of desired attitude commands [19]. If the agent mainly receives these attitude commands as inputs, task-level contextual information may not be explicitly represented in the agent state. Such local and indirect state representations may create a representation bottleneck, limiting the agent’s conditional learning capability across operating conditions and weakening its generalization performance [20,21].
Existing path-following studies also have limitations in agent training. Some studies restrict agent training to basic control tasks involving attitude and velocity errors, and some methods are trained under only a single surge-speed setting or a single path-geometry condition. Such limited coverage of the training-task distribution can easily lead to policy overfitting, thereby degrading path-following performance in other tasks. In domain-randomized training, challenging operating conditions, in which task parameters such as the path heading angle, path pitch angle, and desired surge velocity simultaneously approach their boundary values, occur with relatively low probability. Meanwhile, compared with straight-segment following, path-segment transitions are more difficult and are associated with sparser experience samples. As a result, task difficulty spans a wide range under domain-randomized training and exhibits a pronounced long-tail distribution. However, existing methods such as DDPG and SAC are essentially risk-neutral expected-return optimization methods. Although they may achieve satisfactory average performance under most regular operating conditions, their path-following performance under more challenging tasks may remain limited.
To address the limited generalization capability and robustness of underactuated X-rudder AUVs under diverse path-following conditions, this study proposes a direct X-rudder control method based on Task-Informed Inductive-Bias Conservative Soft Actor–Critic (TIB-CSAC) for AUV path-following. The main contributions are summarized as follows:
(1)
A direct X-rudder control framework is proposed. In this framework, the X-rudder control policy is learned directly from two-segment path-following information, avoiding the complex hierarchical design of guidance laws and attitude controllers as well as the associated coupled parameter tuning.
(2)
A task-informed inductive-bias state encoder is developed to construct structured and task-conditioned representations of two-segment path-following states for direct policy learning. This design reduces the representation burden caused by direct concatenation of heterogeneous state variables and improves sample efficiency.
(3)
A multi-head conservative value-evaluation mechanism is introduced to improve value assessment for high-uncertainty state–action pairs, thereby reducing return drawdowns induced by challenging path-following tasks and improving tail-stage convergence stability.
(4)
The proposed method is validated in three representative scenarios with distinct path geometries and desired surge velocities. The results show improvements across multiple error metrics, including the maximum absolute error, mean absolute error, tail error, and threshold-exceedance proportion, demonstrating enhanced generalization capability and path-following performance under diverse operating conditions.
The remainder of this paper is organized as follows. Section 2 presents the problem statement. Section 3 describes the proposed method. Section 4 analyzes and discusses the experimental results, and Section 5 concludes the paper and outlines future work.

2. Problem Statement

2.1. Underactuated AUV Model

In this study, the AUV is equipped with a main thruster and an X-rudder configuration. The numbering and rotation directions of the X-rudders, as well as the Earth-fixed (O-XYZ) and body-fixed (OB-XBYBZB) coordinate systems, are shown in Figure 1. The kinematic and dynamic model of the AUV is given by [22]
η ˙ = J ( η ) ν
M ν ˙ + C ( ν ) ν + D ( ν ) ν + g ( η ) = τ a + τ d
where J, M, C, D, and g represent the kinematic transformation matrix, inertia matrix, Coriolis–centripetal matrix, hydrodynamic damping matrix, and restoring-force vector, respectively. η = [x, y, z, θ, ψ]T and ν = [u, v, w, q, r]T denote the position-attitude vector in the Earth-fixed frame and the velocity vector in the body-fixed frame, respectively.
In practical AUV design, roll motion is usually expected to remain close to zero to ensure sensor and payload stability. This condition can be maintained passively by the hydrostatic restoring moment and hydrodynamic roll damping. Therefore, under the assumptions that gravity and buoyancy are in equilibrium, the center of buoyancy lies in the vertical plane, and sufficient hydrostatic restoring moment is available for near-zero roll motion, the kinematic and dynamic model can be written as [23,24]
x ˙ = u cos ( ψ ) cos ( θ ) v sin ( ψ ) + w cos ( ψ ) sin ( θ ) y ˙ = u sin ( ψ ) cos ( θ ) + v cos ( ψ ) + w sin ( ψ ) sin ( θ ) z ˙ = u sin ( θ ) + w cos ( θ ) θ ˙ = q ψ ˙ = r / cos ( θ )
u ˙ = ( m 22 v r m 33 w q X u u X u u u | u | + τ u a + τ u d ) / m 11 v ˙ = ( m 11 u r Y v v Y v v v | v | + τ v d ) / m 22 w ˙ = ( m 11 u q Z w w Z w w w | w | + τ w d ) / m 33 q ˙ = ( ( m 33 m 11 ) u w M q q M q q q | q | B G ¯ z W sin θ + τ q a + τ q d ) / m 55 r ˙ = ( ( m 11 m 22 ) u v N r r N r r r | r | + τ r a + τ r d ) / m 66
where W is the AUV weight, and B G ¯ z is the vertical offset between the centers of mass and buoyancy. m ( ) denotes the diagonal inertia terms of the mass matrix M, including added mass. ( X u , Y v , Z w , M q , N r ) and ( X u u , Y v v , Z w w , M q q , N r r ) represents linear and nonlinear damping coefficient, respectively. τ d = [ τ u d , τ v d , τ w d , τ q d , τ r d ] T is the environmental disturbance. τ a = [ τ u a , 0 , 0 , τ q a , τ r a ] T is the actuator input vector, where τ u a is the thruster input, and τ q a and τ r a are the X-rudder inputs for pitch and yaw motion, respectively.
Unlike a conventional cross-rudder configuration, in which the horizontal and vertical rudders mainly generate pitch and yaw moments separately, the X-rudder contains four inclined control surfaces. As shown in Equation (5), each rudder contributes to both τ q a and τ r a ; therefore, the pitch and yaw moments are generated by the combined deflections of all four rudders. This configuration improves rudder-utilization efficiency but also introduces a more complex rudder-moment coupling relationship.
τ q a = u 2 ( M δ 1 δ 1 + M δ 2 δ 2 + M δ 3 δ 3 + M δ 4 δ 4 ) τ r a = u 2 ( N δ 1 δ 1 + N δ 2 δ 2 + N δ 3 δ 3 + N δ 4 δ 4 )
where M δ and N δ are the hydrodynamic parameters of the rudder.

2.2. Path Following Objective

The desired path is represented as a polyline of np straight-line segments, where the i-th segment with endpoints P i 1 p and P i p (i = 1, 2,…, np) is defined by Equation (6).
P i p = x i p y i p z i p = P i 1 p + L i p cos ψ i p cos θ i p sin ψ i p cos θ i p sin θ i p , ψ i p ψ i 1 p Δ ψ max p θ i p θ max p L i p L min
where Δ ψ max p denotes the maximum variation between adjacent path headings ψ i p , and ψ 0 p is set to the initial yaw angle of the AUV. θ max p represents the maximum absolute value of the path pitch angle θ i p , and Lmin is the minimum path-segment length L i p .
The vertical error is defined as the perpendicular distance from the AUV to the projection point on the path segment:
e i v = ( z i p z i 1 p ) ε i + z i 1 p z
ε i = ( P i 1 p P AUV P i 1 p P i p ) / P i 1 p P i p 2
where ε i is an along-track parameter based on the AUV’s position PAUV = [x, y, z]T relative to the path segment P i 1 p P i p .
The horizontal error is defined as the horizontal offset with the minimum absolute value among the two adjacent path segments.
e i h = d k h , k = arg min j { i , i + 1 } d j h     if i < n p d n p h   if i = n p
d i h = P i 1 h P AUV h × P i 1 h P i h / P i 1 h P i h
where d i h denotes the horizontal distance from the AUV to the i-th path segment; and P i h = [ x i p , y i p ] T and P AUV h denote the horizontal position.
In summary, this study aims to minimize both the horizontal and vertical position errors, as shown in Equation (11).
min J = t ( e i e h ( t ) + e i e v ( t ) )
i e i e + 1 ,   if   ε i e > 1
where ie denotes the path-segment index used to compute the position error, and ε i e is the along-track parameter in Equation (8). The index ie is initialized to one and is incremented when ε i e > 1 , indicating that the AUV has passed the endpoint of the current segment.

3. X-Rudder Path Following Control Based on TIB-CSAC

3.1. Markov Decision Process Design

3.1.1. States

Let io denote the index used for state observation. As shown in Equations (13) and (14), io is initialized to one and is incremented when the horizontal error relative to the next target segment remains within a prescribed threshold for a duration t min o , thereby preserving state continuity during path transitions. To avoid excessive lag behind ie, io is also forcibly incremented when i e i o > 2 . After each update of io, to is reset to zero.
i o i o + 1 ,   if   t o t min o   or   i e i o > 2
t o = t o + Δ t   if   d i o + 1 h d min h 0 otherwise
The overall observation state vector consists of four components:
s =   [ s dyn T , s task T , s i o T , s i o + 1 T ] T
where s dyn denotes the general dynamic state describing the current motion and actuator conditions of the AUV.
s dyn = [ ν T , ψ ˙ , δ T ] T
where ν = [u, v, w, q, r]T denotes the velocity vector, and δ = [δ1, δ2, δ3, δ4]T denotes the X-rudder-angle vector. s task is the task-level state that provides information on the desired surge velocity, path-geometry variations, and path-following progress:
s task = [ u , Δ ψ i o p , Δ θ i o p , d i o p , d i o p ] T
where Δ θ i o p = θ i o + 1 p θ i o p and Δ ψ i o p = ψ i o + 1 p ψ i o p denote the changes in path pitch and path heading, respectively. d i o p and d i o p are the distances from the AUV projection points on segments P i o 1 p P i o p and P i o p P i o + 1 p to waypoint P i o p , encoding the progress of the AUV along the two adjacent segments.
d i o p = ( P AUV P i o p P i o 1 p P i o p ) / P i o 1 p P i o p
d i o p = ( P i o p P AUV P i o p P i o + 1 p ) / P i o p P i o + 1 p
s i (i = io, io + 1) denotes the segment-specific state that describes the relative path-following error for the current or subsequent path segment. It provides preview information for the upcoming transition and reduces abrupt changes in the policy input.
s i = [ Δ θ i R , e i v , e ˙ i v , Δ ψ i R , d i h , d ˙ i h ]
where Δ θ i R = θ i p θ and Δ ψ i R = ψ i R ψ represent the relative pitch and yaw angles with respect to segment P i 1 p P i p , respectively. e i v and e ˙ i v are the vertical error defined in Equation (7) and its rate, respectively; and d i h and d ˙ i h are the horizontal distance defined in Equation (10) and its rate, respectively.
To avoid index overflow when the AUV reaches the final segment, a virtual path point P n p + 1 p is appended so that the extended segment P n p p P n p + 1 p maintains the same geometry as P n p 1 p P n p p , with L n p + 1 p = L n p p , ψ n p + 1 p = ψ n p p , and θ n p + 1 p = θ n p p .

3.1.2. Actions

The action vector a t = [ Δ δ t θ , Δ δ t ψ ] [ 1 , 1 ] 2 is defined as the normalized pitch and yaw control increments at each time step. These increments are mapped to X-rudder angles according to Equation (21). The X-rudder design enables all four rudders to jointly generate pitch and yaw moments, reducing the risk of single-rudder saturation and increasing the available equivalent moment under the rudder-angle limit.
δ t = δ 1 , t δ 2 , t δ 3 , t δ 4 , t = Δ δ max + 1 1 1 1 1 + 1 + 1 + 1 Δ δ t θ Δ δ t ψ + δ t 1 , δ δ max
where Δδmax and δmax denote the maximum rudder-angle change per time step and the maximum rudder angle, respectively.

3.1.3. Reward

The reward r RL consists of two position-error penalties and a velocity penalty that prevents the AUV from moving opposite to the tangent vector t p of the desired path. In addition, the reward terms are normalized to reduce the influence of scale differences among components, preventing the learning process from being dominated by a single error term.
r R L = k 1 r sat ( e i e h / e max h ) k 2 r sat ( e i e v / e max v ) + k 3 r sat ( r i e v / r max v )
r i e v = min ( 0 , [ x ˙ , y ˙ , z ˙ ] t p )
t p = [ cos ( ψ i e p ) cos ( θ i e p ) , sin ( ψ i e p ) cos ( θ i e p ) , sin ( θ i e p ) ]
where k i r (i = 1, 2, 3) denotes the weighting coefficients of the reward terms, and e max h , e max v , and r max v denote the corresponding normalization factors. sat ( x ) = min ( 1 , max ( 1 , x ) ) .

3.1.4. Termination

The episode terminates when i e = n p + 1 , indicating that the AUV has passed the endpoint of the final path segment, or when the time exceeds the maximum limit t max t .
d t = True , i e = n p + 1 True , t > t max t

3.2. Task-Informed Inductive-Bias Conservative SAC Algorithm

3.2.1. Task-Informed Inductive-Bias Encoder

In this study, the input state vector consists of dynamic states, segment-specific path-following states, and task-level contextual states, which play distinct functional roles in decision making. Most DRL methods typically flatten and concatenate the state variables into a single unstructured vector and process them using a shared multilayer perceptron (MLP). This approach relies on implicit representation learning to recover the latent structure during training. When the state variables are high-dimensional and heterogeneous, interference among irrelevant or weakly relevant features may increase the burden of representation learning. As a result, sample efficiency and overall training quality may be limited. To mitigate this limitation, a Task-Informed Inductive-Bias (TIB) encoder is proposed to provide structured and task-conditioned state representations.
The overall framework is illustrated in Figure 2. Specifically, the dynamic state vector is first embedded into a compact latent feature h dyn . Two segment-specific branches with identical architectures but independent parameters then encode the path-following states associated with the current and next path segments, respectively. Conditioned on h dyn , each branch extracts segment-specific features, which are fused into a single representation h fuse .
h dyn = f dyn ( s dyn ; θ dyn )
h cur = f cur [ h dyn , s i o ] ; θ cur
h next = f next [ h dyn , s i o + 1 ] ; θ next
h fuse = f fuse [ h cur , h next ] ; θ fuse
where f dyn ( ) , f cur ( ) , f next ( ) , and f fuse ( ) are implemented as MLPs, and θ dyn , θ cur , θ next , and θ fuse denote their parameters.
To adapt encoder features to segment-specific path-following states, a feature-wise linear modulation (FiLM) network g ( ) generates scaling and shifting vectors (ω, β) from the task-level state s task . The fused feature h fuse is then modulated element-wise.
[ ω , β ] = g ( s task ; θ m )
h ~ = ω h fuse + β
where ω , β h fuse . g ( ) is implemented as an MLP and θ m denotes its parameters. denotes element-wise multiplication.
In general, the overall TIB mapping can be written compactly as
h ~ = F enc ( s ; Θ enc )
where Θ enc = { θ dyn , θ cur , θ next , θ fuse , θ m } denotes all encoder parameters.
Accordingly, the actor encoder produces a latent feature, which is then fed into the policy network to parameterize a squashed Gaussian policy. The action is subsequently sampled using the reparameterization trick.
h ~ π = F enc ( s ; Θ enc π )
[ μ ( s ) , log σ ( s ) ] = f π ( h ~ π ; θ π )
a = tanh μ ( s ) + σ ( s ) ε
where Θ enc π and θ π denote the parameters of the actor encoder and policy network, respectively; ε N ( 0 , I ) , and μ(s) and σ(s) represent the mean vector and standard deviation vector of the diagonal Gaussian distribution, respectively.
Similarly, the critics adopt the same TIB encoder architecture as the actor but use different parameters to avoid gradient interference between policy and critic updates.
h ~ Q = F enc ( s ; Θ enc Q )
Q i ( s , a ) = f Q i [ h ~ Q , a ] ; θ Q i
where i = 1, 2,…, nQ, and nQ denotes the number of critic heads. Θ enc Q and θ Q i denote the parameters of the critic encoder and the critic heads, respectively.

3.2.2. Conservative Soft Actor–Critic Algorithm

During path-following training, task difficulty spans a broad range and exhibits a long-tail distribution. Conventional SAC is a risk-neutral optimization method that primarily maximizes the overall expected return and lacks an explicit mechanism for handling rare but challenging operating conditions. When sample coverage is insufficient in these difficult regimes, critic value learning for the corresponding state–action pairs relies more heavily on function approximation with sparse data support, resulting in higher estimation uncertainty and lower reliability. Combined with the actor’s tendency to favor actions with high Q-values, this can produce locally optimistic value estimates and thus limit path-following performance in challenging regions.
To address this issue, this paper introduces a multi-head conservative value-evaluation mechanism. Specifically, nQ (nQ ≥ 2) critic heads with independent parameters are built on the shared critic TIB encoder. For any state–action pair (s,a), the mean Q-value μQ and the critic disagreement σQ are given by
μ Q ( s , a ) = 1 n Q i = 1 n Q Q i ( s , a )
σ Q ( s , a ) = 1 n Q i = 1 n Q Q i ( s , a ) μ Q ( s , a ) 2
Policy improvement is guided by μQ, which represents the overall Q-value level, and σQ, which characterizes estimation uncertainty. This design discourages the policy from favoring actions with high predicted Q-values but large critic disagreement.
Q C ( s , a ) = μ Q ( s , a ) ln n Q σ Q ( s , a )
L π = E s D , a π ( | s ) α log π ( a | s ) Q C ( s , a )
where ln n Q denotes the uncertainty-penalty coefficient. As nQ increases, σQ can characterize value-estimation uncertainty more stably, and the penalty strength is therefore moderately increased. This growth is kept sublinear to avoid introducing an overly conservative bias that would impair policy-optimization efficiency.
The loss function for the entropy temperature coefficient α is defined as
L α = E s D , a π ϕ ( | s ) α log π ( a | s ) + H tar
where Ɗ denotes the replay buffer, and H tar is the target entropy.
To mitigate value overestimation and error propagation in critic bootstrapping, the temporal-difference (TD) target for the i-th critic head is calculated using the minimum of two randomly selected and distinct target critic heads. Let a t + 1 π ( | s t + 1 ) ; the TD target is
y t , i = r R L , t + γ ( 1 d t ) min j = n 1 , n 2 Q ¯ j ( s t + 1 , a t + 1 ) α log π ( a t + 1 s t + 1 )
where Q ¯ j denotes the Q-value estimated by the j-th target critic; n1, n2∈{1, 2,…, nQ} and n1n2; γ is the discount factor; and dt is the episode termination indicator.
To preserve critic diversity for uncertainty estimation, each critic head is updated using its own TD regression loss on an independently sampled mini-batch:
L i Q = E ( s t , a t , r R L , t , s t + 1 , d t ) B i , B i ~ D ( Q i ( s t , a t ) y t , i ) 2
where B i denotes the mini-batch independently sampled for the i-th value head from the replay buffer Ɗ.
To ensure consistent state representations, the parameters of the TIB critic encoder are updated using the gradient of the mean TD loss:
L enc Q = 1 n Q i = 1 n Q L i Q
The target critic encoder and its heads are updated through soft updates:
Θ ¯ enc Q τ Θ enc Q + ( 1 τ ) Θ ¯ enc Q
θ ¯ Q i τ θ Q i + ( 1 τ ) θ ¯ Q i
The overall algorithm is summarized in Algorithm 1.
Algorithm 1 Training procedure
Input: Critic size nQ; time interval Δt; soft-update rate τ;
   discount factor γ; batch size nB; termination flag dt;
Initialize: actor π ( a | s ) ; critics { Q i } i = 1 n Q ; temperature α;
     target critics { Q ¯ i } i = 1 n Q ; Replay buffer Ɗ;
1: // Interaction
2: for each environment step do
3:  Get current observation st;
4:  Sample action a t ~ π ϕ ( | s t ) ;
5:  Execute at, obtain r RL , t , st+1, dt;
6:  If dt = true, reset the environment;
7:  Store { ( s t , a t , r RL , t , s t + 1 , d t ) } into Ɗ;
8: // Parameter updates
9:  if |Ɗ| > nbat then
10:    Sample independent mini-batch { B i } i = 1 n Q ;
11:    Sample next action a t + 1 ~ π ϕ ( | s t + 1 ) for ( s t + 1 in { B i } i n Q );
12:    Compute TD target using Equation (43);
13:    Update critic using Equations (44) and (45);
14:    Update actor using Equation (41);
15:    Update α using Equation (42);
16:    Soft update target critics using Equations (46) and (47);
17:  end if
18: end for

4. Experiments

4.1. Basic Settings

To facilitate path-following learning, each training episode consists of a two-segment reference path randomly initialized under the constraints in Equation (6). At the beginning of each episode, the initial AUV velocity is set to zero, and the desired surge speed ud is randomly sampled from [0.5, 2.0] m/s. This setting covers the acceleration process from zero speed to the desired surge speed, thereby including the initial low-authority phase of the X-rudder in the training distribution. To preserve state continuity and ensure stable policy updates, the observation index io is kept fixed at its initial value for most of the episode. Therefore, an episode terminates when either the maximum time limit t max t is exceeded or the stabilization-time criterion in Equation (13) is satisfied. The observation index io is updated only in the latter case, as shown in Figure 3.
The hydrodynamic parameters are adopted from Ref. [25] and are provided in Table 1; the training task settings are summarized in Table 2. The bounded time-varying external disturbances acting on the AUV hydrodynamics are given in Equation (48). The sinusoidal and cosine terms approximate low-frequency force/moment disturbances with wave-like periodic characteristics.
τ u d ( t ) = u d / u d , max ( 5 sin ( k 1 d t + φ 1 d ) + 5 cos ( k 1 d t + φ 2 d ) + 5 cos ( k 1 d t + φ 3 d ) ) τ v d ( t ) = u d / u d , max ( 0.5 sin ( k 1 d t + φ 1 d ) + cos ( k 1 d t + φ 2 d ) + 0.1 cos ( k 1 d t + φ 3 d ) ) τ w d ( t ) = u d / u d , max ( 0.5 sin ( k 1 d t + φ 1 d ) + cos ( k 1 d t + φ 2 d ) + 0.1 cos ( k 1 d t + φ 3 d ) ) τ q d ( t ) = u d / u d , max ( sin ( k 2 d t + φ 1 d ) + 2 cos ( k 2 d t + φ 2 d ) + 2 cos ( k 1 d t + φ 3 d ) ) τ r d ( t ) = u d / u d , max ( sin ( k 2 d t + φ 1 d ) + cos ( k 2 d t + φ 2 d ) + 2 cos ( k 1 d t + φ 3 d ) )
where k 1 d and k 2 d denote the disturbance angular frequencies, while φ 1 d , φ 2 d , and φ 3 d denote the disturbance phase shifts, which are randomly sampled at the beginning of each episode within the predefined ranges in Table 2. In the subsequent comparative tests, these parameters are fixed to k 1 d = 0.1, k 2 d = 0.2 , φ 1 d = 0, φ 2 d = π/4, and φ 3 d = π/6.
In this study, the AUV speed is regulated by an incremental S-plane controller, with the thruster parameters and control law provided in Appendix A. X-rudder path-following control is realized using three off-policy DRL methods: SAC [26], DDPG [27], and the proposed method. SAC is selected as the main baseline to directly evaluate the contributions of the proposed modules within the SAC framework. DDPG is included as another representative off-policy continuous-control algorithm, providing a deterministic-policy comparison. A conventional hierarchical model-free benchmark is also included, combining LOS guidance [18], an S-plane controller, and pseudoinverse-based control allocation. The algorithm settings are summarized in Table 3. Shared or comparable hyperparameters and network scales are kept consistent wherever applicable so that the comparison is conducted under unified algorithmic configurations and comparable model capacities.

4.2. Training Results

To mitigate stochastic effects and ensure reliable comparisons, each method was trained independently for 1200 episodes using five different random seeds. The training results are shown in Figure 4, where the solid lines represent the episode-wise mean return R ¯ j across five runs, and the shaded regions indicate the corresponding episode-wise standard deviation σ j . As illustrated in Figure 4, TIB-CSAC converges faster during training; the magnified view also shows superior late-stage stability, with no sharp declines in episode rewards as observed for DDPG and SAC.
R ¯ j = 1 K k = 1 K R j , k
σ j = 1 K 1 k = 1 K R j , k R ¯ j 2
where R j , k denotes the episode return at episode j in the k-th run, k = 1, 2,…, K, K = 5.
To quantify tail-stage performance, as defined in Equations (51)–(53), let R ¯ tail denote the mean episode return over the tail stage of training averaged across runs, and let σ tail denote the average standard deviation of episode returns over the tail stage within each run. As detailed in Table 4, compared with baseline SAC and DDPG, the proposed method increases R ¯ tail from −2.7927 × 102 and −5.5372 × 102 to −1.7244 × 102 and reduces σ tail from 5.1210 × 102 and 4.6465 × 102 to 1.2368 × 102, indicating improved tail-stage training performance and enhanced tail-stage stability.
R ¯ tail , k = 1 N tail j = n eps N tail + 1 n eps R j , k
R ¯ tail = 1 K k = 1 K R ¯ tail , k
σ tail = 1 K k = 1 K 1 N tail 1 j = n eps N tail + 1 n eps R j , k R ¯ tail , k 2
where Ntail denotes the number of episodes in the tail-stage training and is set to 200.
To quantify training efficiency, episode returns are first smoothed using a W-episode moving average. For the k-th run, TREk is defined as the earliest episode t such that the smoothed return R ~ t , k R th satisfies Equation (55), and the overall TRE is averaged over K runs according to Equation (56). Here, W = 50 and R th = 1 × 10 3 . Moreover, the average area under the training curve is used to quantify overall training quality, as defined in Equation (57).
R ~ t , k = 1 W j = 1 W R t j , k , t W
TRE k = min t { W , , n eps } R ~ t , k R th
TRE ¯ = 1 K k = 1 K TRE k
AUC ¯ = 1 K k = 1 K 1 n eps R j , k d j
As detailed in Table 4, TIB-CSAC reaches the near-final performance threshold with fewer training episodes than SAC and DDPG: TRE ¯ equal to 2.9560 × 102 compared with 7.3320 × 102 and 9.3920 × 102. It also attains a less negative cumulative return in AUC ¯ , namely −2.6398 × 106 versus −6.5231 × 106 and −9.2165 × 106, indicating improved sample efficiency and overall training quality.

4.3. Path Following Results

4.3.1. Case 1 Boustrophedon Path

A boustrophedon path is used to assess horizontal path-following performance during aggressive orthogonal turns under varying desired speeds ud. In the horizontal plane, the path consists of 400 m survey lines connected by 100 m short transects. Moreover, ud is updated after the AUV completes two consecutive turns and then travels 100 m along the path, with ud taking values of 2.0, 1.7, 1.4, 1.1, 0.8, and 0.5 m/s. The detailed results are shown in Figure 5, Figure 6, Figure 7 and Figure 8.
As shown by the overall trajectories in Figure 5, all methods accomplish the prescribed boustrophedon path-following task. The zoomed-in views in Figure 5b further indicate that the main differences among these methods are concentrated in a few challenging regions, particularly around orthogonal turns and the subsequent transition phases. As shown in Figure 6 and Figure 7, under different velocity conditions, TIB-CSAC achieves tighter adherence to the reference path, with overall position errors remaining within a relatively small range. During orthogonal-turn transitions, despite variations in rudder hydrodynamic effectiveness across velocities, TIB-CSAC consistently maintains lower horizontal overshoot and faster error decay. By contrast, although the baseline methods keep the horizontal error close to zero under most conditions, they tend to exhibit larger overshoot peaks or oscillations and slower error decay around orthogonal turns. Overall, the results of Case 1 show that TIB-CSAC provides more robust horizontal path-following performance across a range of surge velocities, especially in the small subset of challenging orthogonal-turn conditions.

4.3.2. Case 2 Trapezoidal Path

A 3D trapezoidal path is used to validate the transient path-following response under segment-wise simultaneous variations in the desired path heading, path pitch, and surge velocity. The desired surge velocity ud is set sequentially to [2, 1.875, 1.75, 1.625, 1.5, 1.375, 1.25, 1.125, 1, 0.875, 0.75, 0.625, 0.5] m/s; the path pitch is set to [−0.1222, 0, 0.1047, 0, −0.0873, 0, 0.0698, 0, −0.0524, 0, 0.0349, 0, −0.0175] rad; and the path heading is set to [0, 1.5708, 0, 1.309, 0, 1.0472, 0, 0.7854, 0, 0.5236, 0, 0.2618, 0] rad. The length of each segment is 100 m. The detailed results are shown in Figure 9, Figure 10, Figure 11 and Figure 12.
As shown in Figure 9, all methods complete the 3D trapezoidal path-following task, and the main discrepancies arise around path-segment transitions. As indicated by Figure 10 and Figure 11, TIB-CSAC yields smaller overall position errors and more stable transitions between path segments than the baseline methods under simultaneous variations in surge velocity and path geometry. In particular, under higher-velocity conditions, TIB-CSAC more effectively suppresses horizontal error peaks caused by large path-heading changes and mitigates persistent vertical-error oscillations associated with large path-pitch variations. As shown in Figure 12, the X-rudder angles of TIB-CSAC remain smooth and continuous except for brief pulse-like adjustments during path-segment transitions. Overall, Case 2 demonstrates that TIB-CSAC provides a more stable transient response under simultaneous variations in surge velocity and path geometry.

4.3.3. Case 3 Random Path

This section evaluates path-following performance on a 3D random path under time-varying surge-velocity conditions. Table 5 summarizes the waypoint configuration of the reference path, which consists of 20 waypoints in total, with the coordinate origin defined as the path-starting point. The velocity is set to u d ( t ) = 0.75 cos ( 0.01 t ) + 1.25 . The results are shown in Figure 13, Figure 14, Figure 15 and Figure 16.
As shown in Figure 13, Figure 14 and Figure 15, the main differences in the 3D random path-following results are concentrated in challenging regions associated with higher surge velocities and large path-geometry variations. The baseline methods exhibit different error patterns: DDPG and LOS+SP show larger horizontal overshoot and more pronounced vertical-error oscillations in several regions, whereas SAC exhibits horizontal and vertical error fluctuations in certain intervals, accompanied by sustained oscillations in surge velocity. By contrast, under the combined effects of time-varying speed and random path-geometry variations, TIB-CSAC more effectively mitigates horizontal overshoot and vertical oscillations, demonstrating more robust path-following performance under diverse operating conditions.

4.3.4. Metric Analyses

Four metrics are used to quantitatively evaluate path-following performance: the maximum absolute error (MXAE), mean absolute error (MAE), conditional value at risk at 95% (CVaR95), and threshold-exceedance time ratio Vε. CVaR95 and Vε are defined in Equations (58)–(60). Specifically, CVaR95 denotes the conditional mean of the position-error magnitude beyond the 95th percentile of |e(t)|, while Vε denotes the fraction of time for which |e(t)| exceeds the threshold ε = 0.1 m. Detailed quantitative results are summarized in Table 6 and Table 7.
CVaR 95 = E e ( t ) e ( t ) q 95
q 95 = inf q : Pr ( e ( t ) q ) 0.95
V ε = 1 T 0 T χ | e ( t ) | > ε d t
where e(t) denotes the position error, q95 is the 95th percentile of |e(t)|, Pr(⋅) denotes probability, inf{⋅} denotes the infimum, χ(⋅) is the indicator function, and T denotes the task duration.
Averaged over the three cases, TIB-CSAC yields lower overall error metrics than baseline SAC. In terms of MXAE, the average vertical and horizontal MXAE values of TIB-CSAC are reduced by 1.4645 × 10−1 m and 6.4623 m, respectively, indicating a stronger ability to suppress extreme errors. For MAE, the corresponding reductions are 1.2965 × 10−2 m and 4.2044 × 10−1 m, indicating closer overall adherence to the reference path. For CVaR95, the corresponding reductions are 3.7903 × 10−2 m and 3.8225 m, indicating more effective suppression of tail errors under difficult operating conditions. For Vε, the corresponding reductions are 1.5675 × 10−2 and 3.5120 × 10−1, indicating a lower proportion of time with threshold-exceeding errors and better maintenance of small-error path-following.
Overall, TIB-CSAC exhibits reliable path-following performance under diverse surge-velocity and path-geometry conditions, with clear advantages in overall path adherence, error-peak suppression, tail-error reduction, and persistence within the prescribed error threshold.
Several supplementary metrics are also introduced to quantitatively evaluate X-rudder control behavior and task-level effort. These metrics include the rudder saturation ratio R δ s a t , second-order rudder-angle difference index J Δ 2 δ , rudder-rate smoothness index J δ ˙ , speed-weighted rudder-effort proxy J δ , u , and AUV energy-consumption proxy J A U V , as defined in Equations (61)–(65). Here, N is the number of time steps. The quantitative results are summarized in Table 8.
R δ s a t = 1 4 N i = 1 4 k = 1 N χ | δ i ( k ) | 0.95 δ max
J Δ 2 δ = 1 4 ( N 2 ) i = 1 4 k = 3 N δ i ( k ) 2 δ i ( k 1 ) + δ i ( k 2 ) 2
J δ ˙ = 1 4 ( N 1 ) i = 1 4 k = 2 N ( δ i ( k ) δ i ( k 1 ) ) / Δ t 2
J δ , u = 1 4 N i = 1 4 k = 1 N u ( k ) δ i ( k ) 2
J AUV = Δ t k = 1 N u 3 ( k )
As shown in Table 8, TIB-CSAC maintains a relatively low rudder saturation ratio in all three cases, although LOS+SP or DDPG achieves lower R δ s a t values in some cases. For the second-order rudder-angle difference J Δ 2 δ , SAC generally yields smaller values, while TIB-CSAC remains lower than LOS+SP and DDPG. This indicates that the proposed method introduces necessary corrective rudder actions during path transitions without causing severe rudder-angle fluctuations. In terms of the rudder-rate smoothness index J δ ˙ , TIB-CSAC achieves the lowest values in all three cases, suggesting smoother rudder-rate behavior. For the speed-weighted rudder-effort proxy J δ , u , TIB-CSAC is not always the lowest, but it is consistently lower than SAC. Moreover, TIB-CSAC achieves the lowest AUV energy-consumption proxy J AUV and the shortest task-completion time T in all three cases. These results indicate that the improved path-following performance of TIB-CSAC is not obtained through persistent rudder saturation or excessive task-level effort but through generally reasonable actuator behavior.

4.4. Ablation Study

To further analyze the contribution of each module, this section conducts an ablation study on the full TIB-CSAC method and three comparison variants: the baseline SAC; CSAC, in which the TIB encoder is replaced by fully connected layers of comparable size; and TIB-SAC, which combines the TIB encoder with the baseline SAC. Each method is trained for 1200 episodes using five random seeds.

4.4.1. Quantitative Analysis of the TIB Encoder

Table 9 and Figure 17 quantify the contribution of the TIB encoder through two controlled comparisons: CSAC versus TIB-CSAC and SAC versus TIB-SAC. Compared with TIB-CSAC, the TRE ¯ of CSAC increases from 2.9560 × 102 to 6.4180 × 102, corresponding to 2.17 times that of TIB-CSAC. Meanwhile, the AUC ¯ decreases from −2.6398 × 106 to −5.4341 × 106, corresponding to 2.06 times the absolute magnitude of TIB-CSAC. A consistent trend is also observed in the comparison between TIB-SAC and SAC. Specifically, the TRE ¯ of TIB-SAC decreases from 7.3320 × 102 to 4.3060 × 102, corresponding to 0.59 times that of SAC, while the AUC ¯ increases from −6.5231 × 106 to −3.9544 × 106, corresponding to 0.61 times the absolute magnitude of SAC. In addition, the tail-stage metrics show consistent but secondary improvements after introducing the TIB encoder, with σ tail decreasing in both SAC versus TIB-SAC and CSAC versus TIB-CSAC. Overall, these results indicate that the structured and conditioned representations induced by the TIB encoder primarily accelerate policy formation and improve overall training quality.

4.4.2. Quantitative Analysis of the Conservative Value Evaluation Mechanism

The contribution of the conservative value-evaluation mechanism is quantified through two controlled comparisons: TIB-SAC versus TIB-CSAC and SAC versus CSAC. Relative to TIB-CSAC, the σ tail of TIB-SAC increases from 1.2368 × 102 to 3.5529 × 102, corresponding to 2.87 times that of TIB-CSAC. A similar effect is observed in the comparison between CSAC and SAC, where the σ tail of CSAC decreases from 5.1210 × 102 to 1.3339 × 102, corresponding to 0.26 times that of SAC. Collectively, these results indicate that the multi-head conservative value-evaluation mechanism effectively mitigates tail-stage return fluctuations and reduces return drawdowns induced by challenging tasks, thereby improving tail-stage training stability.

4.4.3. Overall Ablation Discussion

Compared with baseline SAC, TIB-CSAC reduces TRE ¯ and σ tail to 0.40 and 0.24 times those of SAC, respectively, while yielding less negative AUC ¯ and R ¯ tail values whose absolute magnitudes are reduced to 0.40 and 0.62 times those of SAC, respectively. These results indicate that combining the TIB encoder with the conservative value-evaluation mechanism enables TIB-CSAC to achieve faster convergence, higher overall training quality, and better tail-stage performance.

4.5. Sensitivity Study

4.5.1. Sensitivity Analysis on Critic Size

To further investigate the influence of a key hyperparameter on the performance of the proposed algorithm, this section conducts a sensitivity study on the critic size nQ used in the multi-head conservative value evaluation mechanism. For each parameter setting, training is performed for 1200 episodes using five random seeds.
As shown in Figure 18 and Table 10, when nQ = 2, the return curves remain in a low-return region with severe fluctuations, indicating that an overly small nQ is insufficient to provide reliable conservative value evaluation for effective policy learning. With nQ = 4, the return curves begin to enter the high-return region, but noticeable return drawdowns still persist after convergence, suggesting that value-evaluation robustness remains inadequate. Increasing nQ from 4 to 6 effectively improves tail-stage stability, with σ tail decreasing from 5.3876 × 102 to 1.2368 × 102, corresponding to 0.23 times that of the nQ = 4 setting. By contrast, further increasing nQ from 6 to 8 yields only a modest additional reduction in σ tail , from 1.2368 × 102 to 1.0508 × 102, corresponding to 0.85 times that of the nQ = 6 setting, while convergence becomes slower, with TRE ¯ increasing from 2.9560 × 102 to 3.3480 × 102, corresponding to 1.13 times that of the nQ = 6 setting. This trend suggests that an overly large nQ may further suppress tail-stage volatility, but the associated increase in conservatism may also make policy improvement more cautious, thereby reducing sample efficiency. Therefore, under the algorithmic setting adopted in this study, nQ = 6 provides a favorable trade-off between convergence speed and tail-stage performance.

4.5.2. Sensitivity Analysis on Reward Weights

To further evaluate the influence of reward weights on path-following performance, a sensitivity analysis was conducted for kr,1, kr,2, and kr,3. In addition to the default setting kr,1 = kr,2 = kr,3 = 5, three reward-weight combinations were tested in Cases 1–3: [kr,1, kr,2, kr,3] = [7.50, 3.75, 3.75], [3.75, 7.50, 3.75], and [3.75, 3.75, 7.50]. These settings increase the relative weights of the horizontal-error penalty, vertical-error penalty, and reverse-motion penalty, respectively. The averaged metric results over Cases 1–3 are shown in Figure 19, and the detailed metrics for each case are provided in Table 11 and Table 12.
As shown in Figure 19, the relative allocation of reward weights affects the emphasis of path-following performance. Relative to the default setting, increasing the relative weight of kr,1 improves most horizontal error metrics, especially by reducing the horizontal threshold-exceedance ratio Vε and horizontal tail error CVaR95. Increasing the relative weight of kr,2 reduces the vertical Vε, CVaR95, and MXAE compared with the default setting, but it also increases the horizontal error metrics, indicating that emphasizing vertical-error suppression may weaken horizontal path-following performance. After C is increased, the horizontal and vertical error metrics remain generally close to those of the default setting. This occurs because the reverse-motion penalty mainly serves as an auxiliary constraint. When the AUV rarely tends to move opposite to the path direction, increasing kr,3 has a relatively limited effect on the final path-following error metrics.
Overall, although the setting kr,1 = kr,2 = kr,3 = 5 does not achieve the best value for every individual metric, it provides a more balanced performance between horizontal and vertical path-following errors and avoids significant degradation in either error category. Therefore, this weight setting is adopted as the default reward-function configuration.

4.6. Stochastic Disturbance

To further evaluate the robustness of the proposed controller under stochastic environmental disturbances, an additional test is conducted using a first-order Gauss–Markov disturbance model [28]. The test is performed on a 3D random path with 20 waypoints, where the coordinate origin is defined as the path-starting point. The desired surge velocity is set to u d ( t ) = 0.75 cos ( 0.01 t ) + 1.25 . The waypoint configuration is summarized in Table 13, and the results are shown in Figure 20, Figure 21 and Figure 22 and Table 14 and Table 15.
τ k d = clip ( τ k d , A , A ) τ k + 1 d = ρ τ k d + σ d 1 ρ 2 ε k ρ = exp ( Δ t / T c )
where τ d = [ τ u d , τ v d , τ w d , τ q d , τ r d ] T ; A = [15.0, 1.6, 1.6, 5.0, 4.0]T denotes the theoretical upper bound of each disturbance channel in n Equation (48); ε k ~ N ( 0 , I 5 ) ; σ d = A / 3 ; and Tc = 30 s.
As shown in Figure 20, all methods complete the 3D path-following task under stochastic disturbances. The error curves in Figure 22 show that the main error fluctuations are concentrated around path-segment transitions and high-curvature regions. Under stochastic disturbances, TIB-CSAC suppresses both vertical and horizontal error fluctuations more effectively than the baseline methods. As summarized in Table 14, TIB-CSAC achieves the lowest MXAE, MAE, CVaR95, and Vε, indicating better suppression of peak errors, average errors, tail errors, and threshold-exceeding errors under stochastic environmental disturbances. The supplementary control-related results in Table 15 further show that this improvement is not accompanied by excessive rudder saturation or unreasonable task-level effort. Therefore, the Gauss–Markov disturbance test provides further evidence of the stochastic-disturbance robustness of the proposed method.

5. Conclusions

This study proposes a TIB-CSAC method to improve the X-rudder path-following performance of an underactuated AUV under varying path geometries and desired surge velocities. The proposed method learns the X-rudder control policy directly from two-segment path-following information, thereby avoiding the hierarchical guidance-control architecture and the associated coupled parameter tuning. In addition, a task-informed inductive-bias encoder is introduced to support segment-wise policy learning, improving sample efficiency and overall training quality. A multi-head conservative value-evaluation mechanism is also incorporated to mitigate return drawdowns induced by challenging path-following tasks and to improve tail-stage convergence stability. Overall, compared with baseline SAC, TIB-CSAC reduces TRE ¯ and σ tail to 0.40 and 0.24 times those of SAC, respectively, while yielding less negative AUC ¯ and R ¯ tail values whose absolute magnitudes are reduced to 0.40 and 0.62 times those of SAC, respectively.
The path-following performance of the proposed method is validated in three representative scenarios with distinct path geometries and desired surge velocities. The results show that TIB-CSAC improves multiple position-error metrics compared with baseline SAC. Averaged over the three cases, the vertical and horizontal MXAE values are reduced by 1.4645 × 10−1 m and 6.4623 m, respectively. The corresponding reductions in MAE are 1.2965 × 10−2 m and 4.2044 × 10−1 m, while those in CVaR95 are 3.7903 × 10−2 m and 3.8225 m. In addition, the vertical and horizontal Vε values are reduced by 1.5675 × 10−2 and 3.5120 × 10−1, respectively. These results indicate that TIB-CSAC not only improves overall adherence to the reference path but also more effectively suppresses extreme position errors, reduces tail errors, and maintains path-following within the prescribed error threshold, thereby demonstrating improved robustness and reliability.
The current results are mainly obtained through simulation-based evaluation. To reduce the mismatch between the simulator and the real environment, sim-to-real transfer methods provide a promising direction for adapting DRL policies trained in simulation to real AUV systems with limited real-world data [29,30]. For example, Wu et al. improved the real-world deployment of a simulation-trained navigation policy using human-guided interventions and demonstrations [31], while Hsu et al. fine-tuned a simulation-trained policy in a safety-constrained laboratory environment before deploying it on a quadrupedal robot for real-world visual navigation [32]. Therefore, future work will focus on software-system integration, real-world data collection, sim-to-real policy transfer learning, and open-water experiments to further evaluate the proposed method. In addition, the adopted 5-DOF model may be insufficient when rudder faults or asymmetric rudder effectiveness occur. Future work will also investigate fault-tolerant X-rudder control in a full 6-DOF framework by further considering roll dynamics and roll control.

Author Contributions

Conceptualization, J.T.; Methodology, J.T.; Software, L.Z.; Formal analysis, P.C.; Investigation, Z.L.; Resources, Y.S.; Writing—original draft, J.T.; Writing—review & editing, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the Natural Science Foundation of Heilongjiang Province of China under Grant No. ZD2020E005. Additional support was provided by the Special Funds of the Taishan Industrial Leading Talents Project.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy restrictions.

Conflicts of Interest

Author Yushan Sun was employed by the company Shandong Future Robot Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

The actuator input of the thruster is modeled as
τ u a ( n ) = ρ n 2 D 4 ( K T p 1 u / n D + K T p 2 )
where the seawater density is set to ρ = 1025 kg/m3, the propeller diameter is D = 0.25 m, and the linear coefficients are K T p 1 = −0.3077 and K T p 2 = 0.220.
The rotational speed of thruster nt follows an incremental S-plane controller:
n t = n ¯ t n max
n ¯ t = sat n ¯ t 1 + η n S ( e u , e ˙ u )
e u = sat u d u Δ u max , e ˙ u = sat u ˙ d u ˙ Δ u ˙ max
S ( e , e ˙ ) = 2 1 + exp ( k 1 e k 2 e ˙ ) 1
where n ¯ t is the normalized control output. The maximum rotational speed is set to nmax = 1800 rpm, the update rate is η n = 0.08, normalization factors for error are Δ u max = 2.0 and Δ u ˙ max = 4.0, and the S-plane controller factors are k 1 u = 2.0 and k 2 u = 1.8 .

References

  1. Shen, Z.; Wang, Q.; Dong, S.; Yu, H. Prescribed performance dynamic surface control for trajectory-tracking of unmanned surface vessel with input saturation. Appl. Ocean Res. 2021, 113, 102736. [Google Scholar] [CrossRef]
  2. He, L.; Zhang, Y.; Li, S.; Li, B.; Yuan, Z. Three-Dimensional Path Following Control for Underactuated AUV Based on Ocean Current Observer. Drones 2024, 8, 672. [Google Scholar] [CrossRef]
  3. Zheng, J.; Song, L.; Liu, L.; Yu, W.; Wang, Y.; Chen, C. Fixed-time sliding mode tracking control for autonomous underwater vehicles. Appl. Ocean Res. 2021, 117, 102928. [Google Scholar] [CrossRef]
  4. Rout, R.; Subudhi, B. Design of Line-of-Sight Guidance Law and a Constrained Optimal Controller for an Autonomous Underwater Vehicle. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 416–420. [Google Scholar] [CrossRef]
  5. Wang, L.; Li, S.; Liu, J.; Wu, Q. Data-driven path-following control of underactuated ships based on antenna mutation beetle swarm predictive reinforcement learning. Appl. Ocean Res. 2022, 124, 103207. [Google Scholar] [CrossRef]
  6. S., P.; Rajendran, S. A unified seakeeping and manoeuvring model with a PID controller for path following of a KVLCC2 tanker in regular waves. Appl. Ocean Res. 2021, 116, 102860. [Google Scholar] [CrossRef]
  7. He, Y.; Xie, Y.; Pan, G.; Cao, Y.; Huang, Q.; Ma, S.; Zhang, D.; Cao, Y. Depth and Heading Control of a Manta Robot Based on S-Plane Control. J. Mar. Sci. Eng. 2022, 10, 1698. [Google Scholar] [CrossRef]
  8. Xu, F.; Zhang, L.; Zhong, J. Three-Dimensional Path Tracking of Over-Actuated AUVs Based on MPC and Variable Universe S-Plane Algorithms. J. Mar. Sci. Eng. 2024, 12, 418. [Google Scholar] [CrossRef]
  9. Jiang, C.; Lv, J.; Wan, L.; Wang, J.; He, B.; Wu, G. An Improved S-Plane Controller for High-Speed Multi-Purpose AUVs with Situational Static Loads. J. Mar. Sci. Eng. 2023, 11, 646. [Google Scholar] [CrossRef]
  10. He, L.; Xie, M.; Zhang, Y. A Review of Path Following, Trajectory Tracking, and Formation Control for Autonomous Underwater Vehicles. Drones 2025, 9, 286. [Google Scholar] [CrossRef]
  11. Wang, D.; Shen, Y.; Wan, J.; Sha, Q.; Li, G.; Chen, G.; He, B. Sliding mode heading control for AUV based on continuous hybrid model-free and model-based reinforcement learning. Appl. Ocean Res. 2022, 118, 102960. [Google Scholar] [CrossRef]
  12. Sun, Y.; Ran, X.; Zhang, G.; Wang, X.; Xu, H. AUV path following controlled by modified Deep Deterministic Policy Gradient. Ocean Eng. 2020, 210, 107360. [Google Scholar] [CrossRef]
  13. Fang, Y.; Huang, Z.; Pu, J.; Zhang, J. AUV position tracking and trajectory control based on fast-deployed deep reinforcement learning method. Ocean Eng. 2022, 245, 110452. [Google Scholar] [CrossRef]
  14. Wang, Z.; Li, Y.; Ma, C.; Yan, X.; Jiang, D. Path-following optimal control of autonomous underwater vehicle based on deep reinforcement learning. Ocean Eng. 2023, 268, 113407. [Google Scholar] [CrossRef]
  15. Wang, Y.; Hou, Y.; Lai, Z.; Cao, L.; Hong, W.; Wu, D. An adaptive PID controller for path following of autonomous underwater vehicle based on Soft Actor–Critic. Ocean Eng. 2024, 307, 118171. [Google Scholar] [CrossRef]
  16. Zhang, C.; Cheng, P.; Du, B.; Dong, B.; Zhang, W. AUV path tracking with real-time obstacle avoidance via reinforcement learning under adaptive constraints. Ocean Eng. 2022, 256, 111453. [Google Scholar] [CrossRef]
  17. Zhang, Q.; Lin, J.; Sha, Q.; He, B.; Li, G. Deep Interactive Reinforcement Learning for Path Following of Autonomous Underwater Vehicle. IEEE Access 2020, 8, 24258–24268. [Google Scholar] [CrossRef]
  18. Dong, N.; Liu, S.; Ip, A.W.H.; Yung, K.L.; Gao, Z.; Juan, R.; Wang, Y. End-to-end autonomous underwater vehicle path following control method based on improved soft actor–critic for deep space exploration. J. Ind. Inf. Integr. 2025, 45, 100792. [Google Scholar] [CrossRef]
  19. Gu, N.; Wang, D.; Peng, Z.; Wang, J.; Han, Q.-L. Advances in line-of-sight guidance for path following of autonomous marine vehicles: An overview. IEEE Trans. Syst. Man. Cybern. Syst. 2022, 53, 12–28. [Google Scholar] [CrossRef]
  20. Fan, J.; Li, W. DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, Baltimore, MD, USA, 17–23 July 2022; pp. 6074–6102. [Google Scholar]
  21. Schaul, T.; Horgan, D.; Gregor, K.; Silver, D. Universal Value Function Approximators. In Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, Lille, France, 6–11 July 2015; pp. 1312–1320. [Google Scholar]
  22. Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
  23. Chai, P.; Sun, Y.; Wu, W.; Zhou, H.; Gao, F. AUV preset performance approximate dynamic programming path following control based on fixed time guidance law. Ocean Eng. 2025, 321, 120425. [Google Scholar] [CrossRef]
  24. Zhang, L.; Sun, Y.; Chai, P.; Tan, J.; Zheng, H. Prescribed-performance time-delay compensation control for UUV trajectory tracking in main-branch water conveyance tunnel transitions under unknown input delays. Ocean Eng. 2025, 342, 122941. [Google Scholar] [CrossRef]
  25. Pettersen, K.Y.; Egeland, O. Time-varying exponential stabilization of the position and attitude of an underactuated autonomous underwater vehicle. IEEE Trans. Autom. Control. 1999, 44, 112–115. [Google Scholar] [CrossRef]
  26. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar] [CrossRef]
  27. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
  28. An, X.; Xing, H.; Li, H.; Shi, H.; Gu, Y. Construction and verification of the autonomous underwater helicopter’s digital twin system for missions simulation. Ocean Eng. 2025, 341, 122641. [Google Scholar] [CrossRef]
  29. Ju, H.; Juan, R.; Gomez, R.; Nakamura, K.; Li, G. Transferring policy of deep reinforcement learning from simulation to reality for robotics. Nat. Mach. Intell. 2022, 4, 1077–1087. [Google Scholar] [CrossRef]
  30. Zhu, W.; Guo, X.; Owaki, D.; Kutsuzawa, K.; Hayashibe, M. A Survey of Sim-to-Real Transfer Techniques Applied to Reinforcement Learning for Bioinspired Robots. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3444–3459. [Google Scholar] [CrossRef] [PubMed]
  31. Wu, J.; Zhou, Y.; Yang, H.; Huang, Z.; Lv, C. Human-Guided Reinforcement Learning With Sim-to-Real Transfer for Autonomous Navigation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14745–14759. [Google Scholar] [CrossRef]
  32. Hsu, K.-C.; Ren, A.Z.; Nguyen, D.P.; Majumdar, A.; Fisac, J.F. Sim-to-Lab-to-Real: Safe reinforcement learning with shielding and generalization guarantees. Artif. Intell. 2023, 314, 103811. [Google Scholar] [CrossRef]
Figure 1. Coordinate system.
Figure 1. Coordinate system.
Jmse 14 01100 g001
Figure 2. The overall architecture of the TIB-CSAC algorithm.
Figure 2. The overall architecture of the TIB-CSAC algorithm.
Jmse 14 01100 g002
Figure 3. Schematic diagram of the training process.
Figure 3. Schematic diagram of the training process.
Jmse 14 01100 g003
Figure 4. Episode return curves in the comparative study.
Figure 4. Episode return curves in the comparative study.
Jmse 14 01100 g004
Figure 5. Trajectories in Case 1 (the dashed line with arrows indicates the path direction).
Figure 5. Trajectories in Case 1 (the dashed line with arrows indicates the path direction).
Jmse 14 01100 g005
Figure 6. Path-following errors in Case 1.
Figure 6. Path-following errors in Case 1.
Jmse 14 01100 g006
Figure 7. Surge velocities in Case 1.
Figure 7. Surge velocities in Case 1.
Jmse 14 01100 g007
Figure 8. TIB-CSAC rudder angles in Case 1.
Figure 8. TIB-CSAC rudder angles in Case 1.
Jmse 14 01100 g008
Figure 9. Trajectories in Case 2 (the dashed line with arrows indicates the path direction).
Figure 9. Trajectories in Case 2 (the dashed line with arrows indicates the path direction).
Jmse 14 01100 g009
Figure 10. Path-following errors in Case 2.
Figure 10. Path-following errors in Case 2.
Jmse 14 01100 g010
Figure 11. Surge velocity in Case 2.
Figure 11. Surge velocity in Case 2.
Jmse 14 01100 g011
Figure 12. TIB-CSAC rudder angles in Case 2.
Figure 12. TIB-CSAC rudder angles in Case 2.
Jmse 14 01100 g012
Figure 13. Trajectories in Case 3 (the dashed line with arrows indicates the path direction).
Figure 13. Trajectories in Case 3 (the dashed line with arrows indicates the path direction).
Jmse 14 01100 g013
Figure 14. Path-following errors in Case 3.
Figure 14. Path-following errors in Case 3.
Jmse 14 01100 g014
Figure 15. Surge velocity in Case 3.
Figure 15. Surge velocity in Case 3.
Jmse 14 01100 g015
Figure 16. TIB-CSAC rudder angles in Case 3.
Figure 16. TIB-CSAC rudder angles in Case 3.
Jmse 14 01100 g016
Figure 17. Episode return curves in the ablation study.
Figure 17. Episode return curves in the ablation study.
Jmse 14 01100 g017
Figure 18. Episode return curves in the sensitivity study.
Figure 18. Episode return curves in the sensitivity study.
Jmse 14 01100 g018
Figure 19. Average path-following metrics of Case 1–3.
Figure 19. Average path-following metrics of Case 1–3.
Jmse 14 01100 g019
Figure 20. 3D trajectories.
Figure 20. 3D trajectories.
Jmse 14 01100 g020
Figure 21. Surge velocity.
Figure 21. Surge velocity.
Jmse 14 01100 g021
Figure 22. Path-following error.
Figure 22. Path-following error.
Jmse 14 01100 g022
Table 1. Hydrodynamic parameters.
Table 1. Hydrodynamic parameters.
ParameterValueParameterValue
Inertia terms ( m 11 , m 22 , m 33 ) = ( 215 , 265 , 265 )   kg
( m 55 , m 66 ) = 80   kg m 2
Rudder rateΔδmax = 0.3491 rad/s
Damping coefficient ( X u , Y v , Z w ) = ( 70 , 100 , 100 )   kg m 2
( M q , N r ) = 50   kg m 2 / s
Rudder limitδmax = 0.4189 rad
Damping coefficient ( X u | u | , Y v | v | , Z w | w | ) = 100 ( 1 , 2 , 2 )   kg / m
( M q | q | , N r | r | ) = 50   kg m 2
AUV weightW = 1813 N
Rudder coefficients M δ = 0.1595 [ + 1 , 1 , 1 , + 1 ]
N δ = 0.1595 [ + 1 , 1 , 1 , + 1 ]
Vertical offset B G ¯ z = 0 . 02 m
Table 2. Training task settings.
Table 2. Training task settings.
ParameterValueParameterValue
Initial AUV statesη0 = 0, ν0 = 0Path segmentsnp = 2
Maximum adjacent heading change Δ ψ max p = 1.5707   rad Time stepΔt = 0.2 s
Maximum absolute path pitch angle Δ θ max p = 0.1222   rad Time threshold t max o = 10   s
Minimum segment length Lmin = 100 mError threshold d min h = 0.1   m
Normalization factors ( e max h , e max v , r max ) = ( 4 , 2 , 4 ) Reward weights k i r = 5   i = 1 , 2 , 3
Maximum task time limit t max t = 1.3 sum ( L p / u ^ ) Disturbance phase shifts φ 1 d < π / 8
Disturbance frequencies 0.08 < k 1 d < 0.12 Disturbance phase shifts π / 8 < φ 2 d < 3 π / 8
Disturbance frequencies 0.18 < k 2 d < 0.22 Disturbance phase shifts π / 12 < φ 3 d < π / 4
Table 3. Algorithm settings.
Table 3. Algorithm settings.
ParameterValueParameterValue
Learning ratelr = 2 × 10−4Dynamic encoderMLP(2 × 32), ReLU
Discount factorγ = 0.99Segment branchesMLP(2 × 64), ReLU
Critic numbernQ = 6Fusion layerMLP(1 × 64), ReLU
Soft update coefficientτ = 1 × 10−3FiLM modulationg:MLP(1 × 64), ReLU
Training episodesneps = 1200Policy networkMLP(2 × 128), ReLU
Replay Buffer sizenbuf = 1 × 106Critic head networkMLP(2 × 128), ReLU
Batch sizenbat = 256Target entropy H tar = 2
Table 4. Training metric results in the comparative study.
Table 4. Training metric results in the comparative study.
Method R ¯ tail σ tail TRE ¯ AUC ¯
TIB-CSAC−1.7244 × 1021.2368 × 1022.9560 × 102−2.6398 × 106
SAC−2.7927 × 1025.1210 × 1027.3320 × 102−6.5231 × 106
DDPG−5.5372 × 1024.6465 × 1029.3920 × 102−9.2165 × 106
Table 5. Waypoint Coordinates of Random Paths.
Table 5. Waypoint Coordinates of Random Paths.
Index12345678910
x i p  (m)111.79238.38434.02546.22486.21562.53674.69596.73441.31433.10
y i p  (m)92.60158.75179.64115.61−47.99−174.84−288.56−370.22−325.50−185.18
z i p  (m)2.5614.4614.6815.563.6811.1915.0120.8511.2818.72
Index11121314151617181920
x i p  (m)464.11375.32450.57628.69794.98883.331001.321110.581264.071387.24
y i p  (m)−74.5499.49201.61131.1968.23−97.93−52.6554.26135.4154.28
z i p  (m)20.2324.2022.8732.6331.4543.5132.7735.9551.0551.89
Table 6. Vertical error metrics.
Table 6. Vertical error metrics.
CaseMetricTIB-CSACLOS+SPSACDDPG
C1MXAE (m)1.7299 × 10−12.6719 × 10−12.6679 × 10−14.2489 × 10−1
MAE (m)4.1841 × 10−31.0881 × 10−21.3561 × 10−24.7823 × 10−2
CVaR95 (m)4.7757 × 10−28.6580 × 10−26.3900 × 10−22.4309 × 10−1
Vε7.4956 × 10−31.4532 × 10−23.0235 × 10−31.4487 × 10−1
C2MXAE (m)3.6035 × 10−14.8607 × 10−15.8540 × 10−15.5639 × 10−1
MAE (m)1.0904 × 10−23.8505 × 10−22.1844 × 10−28.1225 × 10−2
CVaR95 (m)1.0198 × 10−12.5271 × 10−11.5847 × 10−14.0743 × 10−1
Vε1.7398 × 10−21.1855 × 10−13.2938 × 10−22.5016 × 10−1
C3MXAE (m)3.4549 × 10−13.7410 × 10−14.6598 × 10−16.7275 × 10−1
MAE (m)1.2928 × 10−23.3489 × 10−23.1507 × 10−29.6232 × 10−2
CVaR95 (m)1.2817 × 10−12.1243 × 10−11.6925 × 10−14.4115 × 10−1
Vε2.5180 × 10−28.2075 × 10−26.1138 × 10−23.3564 × 10−1
Table 7. Horizontal error metrics.
Table 7. Horizontal error metrics.
CaseMetricTIB-CSACLOS+SPSACDDPG
C1MXAE (m)2.3813 × 1001.0942 × 1011.0892 × 1011.0348 × 101
MAE (m)6.1981 × 10−25.9013 × 10−14.7040 × 10−14.8171 × 10−1
CVaR95 (m)8.4313 × 10−17.1942 × 1005.3551 × 1005.1160 × 100
Vε6.8683 × 10−23.7126 × 10−13.8413 × 10−17.3974 × 10−1
C2MXAE (m)1.0324 × 1001.0575 × 1018.3592 × 1001.4654 × 101
MAE (m)3.6557 × 10−27.0307 × 10−14.6165 × 10−17.6099 × 10−1
CVaR95 (m)3.8266 × 10−16.1709 × 1004.2428 × 1007.2433 × 100
Vε6.6016 × 10−25.7557 × 10−14.9173 × 10−17.9612 × 10−1
C3MXAE (m)2.9708 × 1001.0196 × 1016.5206 × 1001.0790 × 101
MAE (m)8.7358 × 10−25.9247 × 10−15.1517 × 10−17.0218 × 10−1
CVaR95 (m)1.0650 × 1004.9855 × 1004.1604 × 1005.4241 × 100
Vε1.4977 × 10−15.8885 × 10−14.6221 × 10−18.4074 × 10−1
Table 8. Control-related metrics.
Table 8. Control-related metrics.
CaseMetricTIB-CSACLOS+SPSACDDPG
C1 R δ s a t 3.6415 × 10−23.5432 × 10−27.1948 × 10−21.9965 × 10−2
J Δ 2 δ 5.9789 × 10−27.9926 × 10−12.0554 × 10−26.3354 × 100
J δ ˙ 7.5725 × 1001.0895 × 1017.7465 × 1004.2052 × 101
J δ , u 2.7156 × 1011.8364 × 1017.7713 × 1011.4899 × 101
J AUV 5.5553 × 1035.7613 × 1035.9402 × 1035.6670 × 103
T 3.7622 × 1033.8948 × 1033.9028 × 1033.8116 × 103
C2 R δ s a t 6.0163 × 10−31.0941 × 10−25.5885 × 10−22.7831 × 10−3
J Δ 2 δ 3.1153 × 10−12.3985 × 1001.0248 × 10−16.8936 × 100
J δ ˙ 1.8446 × 1013.1841 × 1012.1538 × 1014.7438 × 101
J δ , u 3.2003 × 1012.6966 × 1011.7262 × 1021.9909 × 101
J AUV 2.2942 × 1032.4531 × 1032.5507 × 1032.4031 × 103
T 1.2300 × 1031.2704 × 1031.2812 × 1031.2576 × 103
C3 R δ s a t 4.1298 × 10−22.0751 × 10−29.3597 × 10−22.6845 × 10−2
J Δ 2 δ 6.1293 × 10−11.5771 × 1001.1750 × 10−11.3719 × 101
J δ ˙ 1.6996 × 1011.9188 × 1012.1262 × 1018.9930 × 101
J δ , u 3.4639 × 1012.3822 × 1012.0899 × 1022.6849 × 101
J AUV 7.5449 × 1037.5713 × 1037.7279 × 1037.5638 × 103
T 2.4384 × 1032.5830 × 1032.6988 × 1032.5480 × 103
Table 9. Training metric results in the ablation study.
Table 9. Training metric results in the ablation study.
Method TRE ¯ AUC ¯ σ tail R ¯ tail
SAC7.3320 × 102−6.5231 × 1065.1210 × 102−2.7927 × 102
CSAC6.4180 × 102−5.4341 × 1061.3339 × 102−2.0564 × 102
TIB-SAC4.3060 × 102−3.9544 × 1063.5529 × 102−2.3032 × 102
TIB-CSAC2.9560 × 102−2.6398 × 1061.2368 × 102−1.7244 × 102
Table 10. Training metric results in the sensitivity study.
Table 10. Training metric results in the sensitivity study.
nQ σ tail TRE ¯ A U C ¯ R ¯ tail
22.4269 × 1031.2000 × 103−7.9829 × 106−3.7323 × 103
45.3876 × 1023.8780 × 102−3.2756 × 106−2.7346 × 102
61.2368 × 1022.9560 × 102−2.6398 × 106−1.7244 × 102
81.0508 × 1023.3480 × 102−2.6063 × 106−1.6741 × 102
Table 11. Horizontal error metrics for sensitivity analysis of reward weights.
Table 11. Horizontal error metrics for sensitivity analysis of reward weights.
Case[kr,1, kr,2, kr,3][7.50, 3.75, 3.75][3.75, 7.50, 3.75][3.75, 3.75, 7.50][5.0, 5.0, 5.0]
C1MXAE (m)2.8895 × 1003.2273 × 1002.7073 × 1002.3813 × 100
MAE (m)9.3567 × 10−21.2752 × 10−18.1307 × 10−26.1981 × 10−2
CVaR95 (m)1.2507 × 1001.1357 × 1001.0947 × 1008.4313 × 10−1
Vε8.7584 × 10−23.7600 × 10−17.3903 × 10−26.8683 × 10−2
C2MXAE (m)1.1418 × 1001.6357 × 1001.1187 × 1001.0324 × 100
MAE (m)4.0079 × 10−28.7472 × 10−25.9850 × 10−23.6557 × 10−2
CVaR95 (m)4.0264 × 10−17.0743 × 10−15.4121 × 10−13.8266 × 10−1
Vε6.0212 × 10−22.0276 × 10−11.1075 × 10−16.6016 × 10−2
C3MXAE (m)2.6098 × 1002.5967 × 1002.6334 × 1002.9708 × 100
MAE (m)5.2106 × 10−26.1310 × 10−26.4482 × 10−28.7358 × 10−2
CVaR95 (m)5.2805 × 10−15.6101 × 10−16.0352 × 10−11.0650 × 100
Vε7.0330 × 10−29.8560 × 10−29.6979 × 10−21.4977 × 10−1
Table 12. Vertical error metrics for sensitivity analysis of reward weights.
Table 12. Vertical error metrics for sensitivity analysis of reward weights.
Case[kr,1, kr,2, kr,3][7.50, 3.75, 3.75][3.75, 7.50, 3.75][3.75, 3.75, 7.50][5.0, 5.0, 5.0]
C1MXAE (m)2.5174 × 10−11.5607 × 10−11.8283 × 10−11.7299 × 10−1
MAE (m)1.0644 × 10−21.0499 × 10−26.4236 × 10−34.1841 × 10−3
CVaR95 (m)1.1995 × 10−15.1650 × 10−27.6575 × 10−24.7757 × 10−2
Vε2.4152 × 10−24.2438 × 10−41.6045 × 10−27.4956 × 10−3
C2MXAE (m)3.8565 × 10−13.0402 × 10−13.7452 × 10−13.6035 × 10−1
MAE (m)1.0350 × 10−21.6619 × 10−21.0264 × 10−21.0904 × 10−2
CVaR95 (m)1.1253 × 10−19.9581 × 10−29.9590 × 10−21.0198 × 10−1
Vε2.0667 × 10−21.5981 × 10−21.5450 × 10−21.7398 × 10−2
C3MXAE (m)3.2619 × 10−12.6435 × 10−13.2154 × 10−13.4549 × 10−1
MAE (m)1.0312 × 10−21.9053 × 10−21.1628 × 10−21.2928 × 10−2
CVaR95 (m)9.7986 × 10−28.2500 × 10−29.0378 × 10−21.2817 × 10−1
Vε1.6776 × 10−21.1930 × 10−21.5267 × 10−22.5180 × 10−2
Table 13. Coordinates of the reference-path waypoints.
Table 13. Coordinates of the reference-path waypoints.
Index12345678910
x i p  (m)112.13307.52497.67657.61813.02835.02731.12619.40621.43525.10
y i p  (m)0.00−32.65−33.7752.53155.41311.17365.13209.2996.7630.63
z i p  (m)4.729.983.048.8113.6423.0120.8824.5922.7119.30
Index11121314151617181920
x i p  (m)408.72315.79229.22138.34161.95218.32283.15423.69509.14595.53
y i p  (m)−33.07−152.37−264.84−196.47−52.9098.86197.75133.86224.75297.61
z i p  (m)16.146.184.094.1312.6612.2110.7422.4114.4411.01
Table 14. Path following error metrics.
Table 14. Path following error metrics.
MetricTIB-CSACLOS+SPSACDDPG
VerticalMXAE (m)3.0580 × 10−13.1465 × 10−13.9217 × 10−15.5085 × 10−1
MAE (m)8.9867 × 10−32.3031 × 10−22.5653 × 10−24.6731 × 10−2
CVaR95 (m)6.6192 × 10−21.3515 × 10−11.3048 × 10−12.6511 × 10−1
Vε7.9594 × 10−34.0323 × 10−23.1242 × 10−21.2729 × 10−1
HorizontalMXAE (m)3.0449 × 1001.0180 × 1016.7056 × 1001.5886 × 101
MAE (m)6.2952 × 10−25.0292 × 10−13.0592 × 10−17.7647 × 10−1
CVaR95 (m)4.6887 × 10−14.9808 × 1002.6753 × 1006.6688 × 100
Vε1.2621 × 10−14.2244 × 10−15.8750 × 10−19.3249 × 10−1
Table 15. Rudder metrics.
Table 15. Rudder metrics.
MetricTIB-CSACLOS+SPSACDDPG
R δ s a t 7.3909 × 10−34.1181 × 10−35.3944 × 10−24.9386 × 10−3
J Δ 2 δ 5.0755 × 10−12.2047 × 1002.1324 × 10−11.8090 × 101
J δ ˙ 1.3140 × 1012.5033 × 1013.1093 × 1011.1789 × 102
J δ , u 2.5234 × 1011.4796 × 1011.7047 × 1021.6914 × 101
J AUV 7.2190 × 1037.4023 × 1037.4040 × 1037.3943 × 103
T 2.2866 × 1032.3312 × 1032.3302 × 1032.3286 × 103
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, J.; Sun, Y.; Zhang, L.; Chai, P.; Liu, Z. Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC. J. Mar. Sci. Eng. 2026, 14, 1100. https://doi.org/10.3390/jmse14121100

AMA Style

Tan J, Sun Y, Zhang L, Chai P, Liu Z. Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC. Journal of Marine Science and Engineering. 2026; 14(12):1100. https://doi.org/10.3390/jmse14121100

Chicago/Turabian Style

Tan, Jiehui, Yushan Sun, Liwen Zhang, Puxin Chai, and Zhan Liu. 2026. "Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC" Journal of Marine Science and Engineering 14, no. 12: 1100. https://doi.org/10.3390/jmse14121100

APA Style

Tan, J., Sun, Y., Zhang, L., Chai, P., & Liu, Z. (2026). Direct X-Rudder Path-Following Control for Underactuated AUVs via TIB-CSAC. Journal of Marine Science and Engineering, 14(12), 1100. https://doi.org/10.3390/jmse14121100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop