You are currently viewing a new version of our website. To view the old version click .
Journal of Marine Science and Engineering
  • Article
  • Open Access

11 December 2025

Three-Dimensional Autonomous Navigation of Unmanned Underwater Vehicle Based on Deep Reinforcement Learning and Adaptive Line-of-Sight Guidance

,
,
,
,
and
1
Quanzhou Institute of Equipment Manufacturing, Haixi Institutes, Chinese Academy of Sciences, Quanzhou 362000, China
2
Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences, Fuzhou 350000, China
3
College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150000, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Design and Application of Underwater Vehicles

Abstract

Unmanned underwater vehicles (UUVs) face significant challenges in achieving safe and efficient autonomous navigation in complex marine environments due to uncertain perception, dynamic obstacles, and nonlinear coupled motion control. This study proposes a hierarchical autonomous navigation framework that integrates improved particle swarm optimization (PSO) for 3D global route planning, and a deep deterministic policy gradient (DDPG) algorithm enhanced by noisy networks and proportional prioritized experience replay (PPER) for local collision avoidance. To address dynamic sideslip and current-induced deviations during execution, a novel 3D adaptive line-of-sight (ALOS) guidance method is developed, which decouples nonlinear motion in horizontal and vertical planes and ensures robust tracking. The global planner incorporates a multi-objective cost function that considers yaw and pitch adjustments, while the improved PSO employs nonlinearly synchronized adaptive weights to enhance convergence and avoid local minima. For local avoidance, the proposed DDPG framework incorporates a memory-enhanced state–action representation, GRU-based temporal processing, and stratified sample replay to enhance learning stability and exploration. Simulation results indicate that the proposed method reduces route length by 5.96% and planning time by 82.9% compared to baseline algorithms in dynamic scenarios, it achieves an up to 11% higher success rate and 10% better efficiency than SAC and standard DDPG. The 3D ALOS controller outperforms existing guidance strategies under time-varying currents, ensuring smoother tracking and reduced actuator effort.

1. Introduction

When a UUV performs global route planning in a high-dimensional environment, it faces issues such as long calculation times and a single optimization target, which limits its ability to efficiently find routes in large sea areas. Improving the computational efficiency of global route planning and balancing the navigation distance, planning efficiency, and the feasibility of planning under a multi-objective optimization framework are core issues addressed in this study.
Based on global route planning, the UUV needs to perform local collision avoidance in observable environments under conditions of sparse rewards. The current related collision avoidance algorithms still have deficiencies in terms of reliability, real-time performance, generalization ability, and learning sample utilization of collision avoidance planning. They are unable to make effective decisions for collision avoidance planning with multiple dynamic obstacles, making it difficult for UUVs to make effective decisions when dealing with dynamic obstacles and uncertain environments. How to build an efficient local collision avoidance strategy and improve real-time obstacle avoidance capabilities is a key challenge for UUV intelligent navigation technology.
Deep Reinforcement Learning (DRL) has powerful self-learning capabilities. It does not require accurate modeling of the obstacle environment in which the UUV operates, nor does it heavily rely on sensor accuracy. Huang et al. [1] proposed the Double Deep Q Network (DQN) algorithm to optimize collision avoidance and navigation planning. The Double DQN algorithm decouples action selection and evaluation, obtaining the action index corresponding to the maximum Q value through the current network and inputting it into the target network to calculate the target Q value. The results show that overestimation is reduced, and the Double DQN algorithm can effectively process complex environmental information and perform optimal navigation planning. The application of value function DRL in autonomous collision avoidance has been confirmed in existing studies. Compared with value-based and model-free methods, policy gradient-based methods have great advantages in processing continuous actions. Xi [2] proposed an end-to-end perception-planning-execution method based on a two-layer deep deterministic policy gradient (DDPG) algorithm to address the problems of sparse rewards, single strategies, and poor environmental adaptability in UUV local motion planning tasks, in order to overcome the challenges of end-to-end methods that directly output control force in terms of training and learning. Wu et al. [3] proposed an end-to-end UUV motion framework based on the PPO algorithm. The framework directly takes raw sonar perception information as input and incorporates multiple objective factors, such as collision avoidance, collision, and speed, into the reward function. Experimental results show that the algorithm can enable a UUV to achieve effective collision avoidance in an unknown underwater environment full of obstacles. However, he did not consider the nonlinear characteristics of UUV kinematics and dynamics. To address this problem, Hou et al. [4] designed a DDPG collision avoidance algorithm in which two neural networks control the thruster and rudder angle position of the UUV, respectively. The reward function considers the distance from the obstacle to the sonar, as well as the difference between the current position and the previous position to the target point. Simulation results show that the UUV can plan a collision-free path in an unknown continuous environment. Sun et al. [5] demonstrated a collision avoidance planning method that integrates DDPG and Artificial Potential Field (APF), taking sensor information as input and outputting longitudinal velocity and bow angle. This method combines APF to design a reward function and uses a SumTree structure to store experience samples according to their importance, giving priority to extracting high-quality samples, and accelerating the convergence of the SumTree-DDPG algorithm. Li et al. [6] introduced a goal-oriented post-experience replay method to address the inherent problem of sparse rewards in UUV local motion planning tasks. Simulation results show that this method effectively improves the stability of training and improves sample efficiency. In addition, an additional APF method dynamic collision avoidance emergency protection mechanism, is designed for UUV. The combination of DRL and APF has the potential to improve the performance of intelligent agents in complex environments, but DRL and APF are both reactive collision avoidance algorithms. If they are combined, the UUV decision-making and planning will conflict, and a coordination mechanism needs to be designed. Tang [7] uses sonar data as the input of the network and uses the policy network output of the TD3 framework as the motion command of the UUV to complete the collision avoidance planning of the UUV in a 3D unknown environment. In the above papers, the interference of ocean currents is not considered, and the obstacles in the environment are regular. To address these problems, Huang [8] proposed a UUV reactive collision avoidance controller based on the Soft Actor–Critic (SAC) method, and designed the state space, action space, and multi-objective function of reactive collision avoidance. The simulation results show that this method can realize collision avoidance planning in sparse 3D static and dynamic environments.
From the above research results, we know that there are a large number of scholars studying collision avoidance planners based on the DRL algorithm, but only a few papers are studying samples in the learning process. During the model training process, the blind exploration of the DRL algorithm generates a large number of low-quality samples, which will reduce the utilization rate of effective sample data, thereby greatly reducing the learning efficiency. In addition, there is little research on the stability and generalization of the training model. Therefore, this also provides research value and significance for the work of this paper.
UUV autonomous navigation planning is closely related to route tracking. Navigation planning is crucial for ensuring that UUVs provide safe routes and avoid obstacles during autonomous missions, while route tracking is the process of controlling and executing the UUV’s planned routes. The quality of navigation planning directly affects the executability of route tracking; that is, high-quality planned routes should have good command smoothness and eliminate chattering, which is conducive to UUV operation execution.
For 2D route tracking problems, line-of-sight (LOS) is a commonly used method that can be used to guide a UUV to track along a preset path. The LOS method calculates the LOS angle between the current position of the UUV and the target waypoint and generates corresponding heading instructions to gradually converge to the target path, thereby completing route tracking [9]. Fossen and Pettersen et al. [10] found the source of tracking error and proved the algorithm’s strong convergence and robustness to disturbances. Although the proportional LOS guidance method is effective and popular, significant tracking errors may occur during route tracking when the UUV is subjected to drift forces caused by ocean currents. To overcome this difficulty, researchers have completed a lot of work to mitigate the effects of sideslip. In reference [11], accelerometers are used to measure longitudinal and lateral accelerations, and then the corresponding velocities are calculated by integrating these measurements. Reference [12] proposed an integral LOS guidance method to achieve straight-line route tracking, which can avoid the risk of integrator saturation [13]. Reference [14] derived an improved integral LOS guidance method for tracking on curved paths. Reference [15] proposed direct and indirect integral LOS guidance methods for ships exposed to time-varying ocean currents. The above references mitigate the effects of sideslip angle by adding integral terms to the LOS guidance method. Integral LOS can only handle constant sideslip angles. When following a curved path or time-varying ocean disturbances, the sideslip angle will also change due to the change in ocean current disturbances over time. Secondly, the stability is reduced due to the existence of phase lag superposition integral terms. Liu et al. [16] used LOS based on an extended state observer [17] for route tracking of underactuated vehicles when the sideslip angle changes over time. In the marine environment, the actual motion of a UUV is affected by complex nonlinear coupling relationships, involving multiple degrees of freedom and complex motion states. The LOS method has been widely used in two-dimensional route tracking, but there are still challenges in terms of time-varying ocean currents, sideslip angle changes, and 3D path planning. Research route tracking methods suitable for 3D environments to solve the path deviation problem caused by the nonlinear coupled motion of UUVs in 3D space, so as to improve the autonomous navigation capability of UUVs in complex marine environments.
When UUVs perform tasks in the ocean environment, they are inevitably affected by time-varying ocean currents and dynamic sideslip angles, which significantly increase the complexity of route tracking. Traditional guidance methods have poor tracking stability under time-varying flow field disturbances and environmental changes, making it difficult to achieve high-precision three-dimensional route tracking. How to improve navigation stability and autonomous control capabilities in a high-dimensional environment is an important direction that underwater autonomous navigation technology needs to break through.
The main contributions of this article can be summarized as follows:
  • The multi-level state perturbation guidance mechanism and the noisy network with multi-scale parameter noise fusion solve the training oscillation problem caused by fixed-scale noise, and the sampling number penalty term and stratified priority sampling strategy are used to improve learning efficiency and maintain the generalization ability of the DDPG model.
  • The horizontal and vertical front sight vectors are designed to decouple the control design of UUV nonlinear motion, and their stable convergence is proved by theory, which solves the problem that the integral LOS guidance method is difficult to directly track the three-dimensional planning waypoints under the condition of UUV attitude change caused by ocean current disturbance.

2. Construction of UUV Autonomous Navigation Planning Model

In order to analyze the problem of UUV underwater maneuvering motion control and obstacle detection model construction, the north–east–down (NED), UUV coordinate system, and forward-looking sonar coordinate system were established, as shown in Figure 1. x , y , and z represent the displacement of the UUV relative to the three axes of the NED coordinate system. A vehicle coordinate system o b x b y b z b fixed to the center of mass of the UUV was established to describe the translation and rotation of the UUV. Among them, the axis o x b is located in the longitudinal section of the UUV, pointing from the center of mass of the UUV to its bow direction; the o y b axis is perpendicular to the longitudinal section of the UUV, with the starboard direction of the UUV as the positive direction; the o z b axis is located in the longitudinal symmetry plane of the UUV, perpendicular to the o x b axis, and points to the bottom of the UUV. Table 1 shows the UUV motion parameters.
Figure 1. Coordinate systems of UUV and sonar.
Table 1. UUV motion parameters and symbols.

2.1. Kinematic and Dynamic Model of 3-DOF UUV

Considering the 3-DOF underactuated UUV kinematic of u , q and r , spatial motions such as swaying, yawing, and pitching can be realized, providing a model simulation basis for the subsequent 3D route optimization of UUV autonomous navigation planning and the planned route tracking motion in 3D space [18]:
x ˙ = u cos Ψ cos θ y ˙ = u sin Ψ cos θ z ˙ = u sin θ φ ˙ = q sin θ tan θ + r tan θ θ ˙ = q Ψ ˙ = r / cos θ ,                   θ ± 90
The kinematics of the UUV should also satisfy the following constraints:
40 ° θ 40 ° 2 kn u 8 kn 10 ° / s q 10 ° / s 15 ° / s r 15 ° / s 1 kn / s u ˙ 1 kn / s
UUV 3-DOF dynamic model:
M V ˙ + C ( V ) V + D ( V ) V = τ
where V = [ u , q , r ] T represents the UUV surge, pitch, and yaw.
Mass and added mass matrix M :
M = m X u ˙ 0 0 0 I y M q ˙ 0 0 0 I z N r ˙
where X u ˙ , M q ˙ and N r ˙ represent the equivalent additional inertia caused by the ocean current on the UUV’s motion along the three degrees of freedom of surge, pitch, and bow rotation, respectively, without considering the direct coupling of the 3-DOF in inertia.
Coriolis matrix C ( V ) :
C ( V ) = 0 0 ( m X u ˙ ) ( r + q ) 0 0 0 ( m X u ˙ ) ( r + q ) 0 0
where ( m X u ˙ ) ( r + q ) is the inertial force term caused by motion coupling. A simplified asymmetric structure is adopted, and only the bow rotation and pitch cross terms caused by the change in surge velocity are retained.
Damping matrix D ( V ) :
D ( V ) = X u + X u u 0 0 0 M q + M q q 0 0 0 N r + N r r
In the formula, X u , M q and N r represent the linear damping coefficients of the 3-DOF u , q and r , respectively; X u , M q and N r represent the nonlinear damping coefficients of the corresponding degrees of freedom, respectively; u , q and r take absolute values to reflect the physical property that nonlinear damping increases with increasing speed.
These control inputs are expressed in vector form as follows:
τ = τ u τ q τ r T
Among them, τ u is the longitudinal thrust provided by the propeller, τ q is the pitch moment provided by the horizontal rudder, and τ r is the yaw turning moment provided by the rudder.

2.2. Three-Dimensional Global Environment Model

This study adopts a grid environment in the horizontal direction and discretizes the space in the vertical direction to form an environment representation suitable for navigation planning, as shown in Figure 2. ( x , y , z ) is the coordinate of any point in 3D space, and the UUV planning space can be expressed as:
{ ( x , y , z ) | x min x x max , y min y y max , z min z z max }
Among them, x min , x max , y min , y max , z min and z max are the value ranges of each dimension of UUV, respectively.
Figure 2. Schematic diagram of 3D space and slices.
When the UUV moves to point ( x , y , z ) in 3D space, the altitude of the terrain corresponding to the UUV vertically is represented by z x y ( 1 < x < m , 1 < y < n , m , n are all positive integers). The two-dimensional matrix V m a p represents the environmental model of the UUV, and the index values m and n of V m a p represent the maximum boundaries of the three-dimensional ocean environment in the north and east directions. Therefore, each element in the matrix represents the altitude of the vertical terrain corresponding to the point:
V m a p = z 11 z 12 z 1 n z 21 z 22 z 2 n z m 1 z m 2 z m n

2.3. Time-Varying Ocean Current Disturbance Model

Assuming that in the NED coordinate system, the ocean current speed is U c and the direction is Ψ c . The ocean current speed U c changes dynamically according to the first-order lag system as follows:
U c ( t + h ) = e h ω c U c ( t ) + ( 1 e h ω c ) U c d
U c ( t ) = U c ( t ) + k × N ( 0 , 1 )
where U c ( t ) is the current current speed; U c d is the target current speed; h is the simulation step size; ω c is the law of current speed change. N ( 0 , 1 ) is the normal distribution noise with a mean of 0 and a standard deviation of 1.
Ψ c ( t + h ) = e h ω Ψ Ψ c ( t ) + ( 1 e h ω Ψ ) Ψ c d
Ψ c ( t ) = Ψ c ( t ) + π 180 × 1 20 × N ( 0 , 1 )
where Ψ c ( t ) is the current direction; Ψ c d is the target current direction; and ω Ψ is the law of current direction change. Both U c and Ψ c are exponentially smooth changes, reflecting the existence of certain inertia and disturbance in natural current changes. At the same time, weak Gaussian white noise disturbances are superimposed to simulate small-scale uncertainties in the real ocean environment.

2.4. Noise Modeling and Characteristics Analysis of Forward-Looking Sonar Images

This paper uses the Oculus M750d multi-beam forward-looking sonar as a simulation detection prototype system, as shown in Figure 3. Its horizontal detection angle is 130° fan-shaped area, vertical angle is 20°, maximum range is 120 m, effective range is 0.1 m–120 m, operating frequency is 750 kHz, and the maximum update rate is 27 Hz. Its distance resolution is 1° and contains 512 beams.
Figure 3. Oculus M750d multibeam forward-looking sonar and sonar image.
In order to improve the effectiveness of subsequent image enhancement and target detection algorithms, it is necessary to systematically model and analyze the sources, characteristics, and statistical properties of noise in sonar images, as shown in Appendix A. Sonar image noise can be divided into three categories according to the noise source: self-noise, environmental noise and reverberation noise [19].
A multi-beam forward-looking sonar simulation model was created to simulate the underwater obstacle perception scene. The sonar horizontal opening angle was divided into a cluster every 10 degrees, and the minimum value among the distance values of the upper, middle and lower layers was selected as the corresponding detection distance d i of the sonar. The sonar measurement values of 512 beams were reduced to 13 dimensions [ d 0 , d 1 , d 2 , , d 12 ] , as shown in Figure 4.
Figure 4. Multibeam forward-looking sonar field of view detection modeling.
In order to accurately measure the position of obstacles in the global NED coordinate system, the multi-beam forward-looking sonar obtains the obstacle distance and angle information, and converts it into the NED coordinate system through coordinate transformation:
x b g y b g z b g = R Θ x b s + l y b s z b s + x u g y u g z u g
where ( x b g , y b g , z b g ) is the coordinate of the obstacle in the NED coordinate system, ( x b s , y b s , z b s ) is the coordinate of the obstacle measurement information converted to the UUV carrier coordinate system, ( x u g , y u g , z u g ) is the coordinate of the UUV in the NED coordinate system, l represents the distance from the sonar to the center of mass of the UUV, and the origin of the sonar coordinate is on the x o x b axis of the UUV.

3. Autonomous Collision Avoidance Planning Method for UUV Based on Target Information Driven DDPG

3.1. Deep Deterministic Policy Gradients

We consider an AUV that interacts with the environment E in discrete time steps. At time t [ 1 , T ] the AUV takes an action a t according to the observation s t and receives a scalar reward r t (here, we assumed the environment is fully observed.) We model it as a Markov decision process with a state space S , action space A , state distribution p ( s 1 ) , transition dynamics p ( s t + 1 | s t , a t ) , and reward function r ( s t , a t ) . The sum of discounted future reward is defined as R t = k = t γ k t r ( s k , a k ) with a discounting factor γ ( 0 , 1 ) . The AUV’s goal is to obtain a policy that maximizes the cumulative discounted reward R t . We denote the discounted state distribution for a policy π as ρ π and the action-value function Q π ( s t , a t ) = Ε r t , s t + 1 E ,   a t ρ π [ R t | s t , a t ] . The goal of UUV is to learn a series policy that maximizes the expected return J = Ε s i , a i E , a i π [ R 1 ] .
We use an off-policy approach to learn a deterministic target policy H from the trajectories generated by the random behavior policy π ( s , a ) . We average the state distribution of the behavior policy and transform the performance objective function into the value function of the target policy μ w μ ( s ) :
J β ( μ w μ ) = S ρ β ( s ) V μ ( s ) d s = S ρ β ( s ) Q μ ( s , μ w μ ( s ) ) d s
w μ J β ( μ w μ ) S ρ β ( s ) w μ μ w μ ( a | s ) Q μ ( s , a ) d s = Ε s ρ β [ w μ μ w μ ( s ) a Q μ ( s , a ) | a = μ w μ ( s ) ]
δ t = r t + γ Q w Q ( s t + 1 , μ w μ ( s t + 1 ) ) Q w Q ( s t , a t )
w t + 1 Q = w t Q + α w Q δ t w Q Q w Q ( s t , a t )
w t + 1 μ = w t μ + α w μ w μ μ w μ ( s t ) a Q w Q ( s t , a t ) | a = μ w μ ( s )
The validity of (16) does not depend on w μ Q μ w μ ( s , a ) , but it is approximated by a neural network. Replace the true value function Q μ ( s , a ) with a differentiable value function Q w ( s , a ) in Formula (16) to derive an off-policy deterministic policy gradient update policy method. The Critic network estimates the true value Q w ( s , a ) Q μ ( s , a ) from the trajectory generated by the policy β ( a | s ) using an appropriate policy evaluation algorithm. In the following off-policy DDPG algorithm, the Critic network uses Q-learning updates to estimate the action value function.

3.2. Target Information Driven DDPG Algorithm Framework

(1)
State-space design with target information as input
In this paper, the Actor network input s t = [ d 0 , d 1 , d 2 , , d 12 , ρ u g d , ρ u g Ψ ] is expanded from a 14-dimensional vector to a 15-dimensional vector, which includes the processed sonar data [ d 0 , d 1 , d 2 , , d 12 ] , the distance ρ u g d = d ( p u u v p g o a l ) t d ( p u u v p g o a l ) t 1 between the current position of the UUV and the target point, and the deviation angle ρ u g Ψ = Ψ ( p u u v p g o a l ) t Ψ ( p u u v p g o a l ) t 1 between the UUV heading and the target point at two moments. The Critic network is 17-dimensional. In addition to the above inputs, it also includes the UUV’s strategy at the previous moment (including longitudinal accelerations and yaw angular velocity), which are normalized and used as the input of the network. Not only is the orientation information of the target point added, but also the strategy of the previous moment is introduced to reduce the probability of blind learning of the UUV in the case of sparse rewards, and make the UUV’s movement smoother.
(2)
Continuous action space design
DRL methods usually participate in exploration behavior by injecting noise into the action space. Adding noise to the action space is a common method to improve the performance of Dueling DQN. In the Q learning process, by applying the strategy, a certain proportion of random actions will be added when selecting actions. Different from the previous chapter, in this chapter, the continuous state quantities of the longitudinal acceleration and bow angular velocity of the UUV are used as the output strategy, and a noise network is used instead of adding noise to the strategy to improve the stability and generalization of the model.
(3)
DDPG network framework driven by target information
The improved DDPG algorithm network framework is shown in Figure 5. The algorithm obtains training data from the sample pool through the experience replay mechanism and uses the Critic network to calculate the Q value gradient information to guide the Actor network update, thereby continuously optimizing the policy network parameters. Compared with the DQN algorithm based on the value function, the DDPG algorithm shows higher stability in tasks in the continuous action space, higher policy optimization efficiency, and faster convergence speed, and the number of training steps required to reach the optimal solution is also significantly reduced. The network update in this paper uses the average reward of the round to eliminate the impact of the difference in the number of rounds on the evaluation results, making the evaluation results more stable and improving the stability of the model.
Figure 5. Target-oriented DDPG collision avoidance algorithm framework.
(4)
Comprehensive reward function
This paper defines a comprehensive reward function based on the environment in which the UUV is located. The comprehensive reward function designed in this section is expressed as four dynamic reward items:
R t 1 = α d tanh ( d t / d s g )
where α d represents the distance penalty factor, which is a negative number; d t is the distance between the UUV and the target point at time t , and d s g is the distance between the starting point and the target point.
R t 2 = 2 , if   d < ( 0.7 × d max ) and 30 ° < θ u o < 30 ° 1 , if   ( 0.7 × d max ) < d < d max
where d is the distance between the UUV and the obstacle, d max is the maximum detection distance of the sonar, and θ u o is the angle between the UUV heading and the obstacle. If the UUV hits an obstacle or a boundary, then R t 3 = 80 ; if the UUV reaches the target point, then R t 4 = 50 .
The comprehensive reward function of this paper is the sum of R t 1 , R t 2 , R t 3 , R t 4 :
R t = R t 1 + R t 2 + R t 3 + R t 4

3.3. Design of DDPG Method with Noisy Network

DDPG uses a deterministic strategy. For complex environments in continuous space, it is often difficult to conduct sufficiently effective global exploration by only adding Gaussian noise or OU noise for exploration. Once it falls into a local optimum, it may lack higher-dimensional randomness to jump out of the local extreme point.
In DDPG, the policy network outputs continuous action values, and the square error d ( π , π ˜ ) can be used directly to measure the difference between the output actions of two policies:
d ( π , π ˜ ) = 1 N i = 1 N Ε s [ ( π ( s ) i π ˜ ( s ) i ) 2 ]
In the DDPG algorithm implementation, N ( 0 , σ 2 ) is used as the perturbation of the strategy, and the perturbation strategy distribution also obeys π ˜ Γ ( 0 , σ 2 ) . Set the adaptive scaling factor to δ = σ and construct a single linear unit of the noise network:
y = w x + b
where x p 1 , y q 1 , w p q , b q 1 . The idea of the noise network is to regard the parameter w as a distribution rather than a series of values. The resampling method is used for w Γ ( μ w , σ w ) . First, a noise ζ w Γ ( 0 , I ) is sampled, and the following is obtained:
w = σ w ζ w + μ w
At this time, the parameter w needs to learn the mean and variance. Compared with DDPG, the parameters that need to be learned become twice as many as the original. Similarly, the bias becomes b = σ b ζ b + μ b , and the resulting linear unit becomes:
y = ( σ w ζ w + μ w ) x + σ b ζ b + μ b
Here we take the standard DQN, Q ( s , a ; w ) as an example, and replace the Q network weight w with the noisy weight parameters σ w , ζ w , μ w , replace b with σ b , ζ b , μ b .Get the DQN with noisy network:
y ˜ i = r + γ max a Q ˜ ( s t + 1 , a ; σ w , ζ w , μ w , σ b , ζ b , μ b )
TD loss function L ( σ w , μ w , σ b , μ b ) :
L ( σ w , μ w , σ b , μ b ) = 1 2 [ Q ˜ ( s t , a ; σ w , ζ w , μ w , σ b , ζ b , μ b ) y ˜ i ] 2
The training method of the noisy network is exactly the same as that of the standard neural network. Both use backpropagation to calculate the gradient and then use the gradient to update the neural parameters. The chain rule can be used to calculate the gradient of the loss function with respect to the parameters:
L σ w = L y y σ w   ,   L μ w = L y y μ w   ,   L σ b = L y y σ b   ,   L μ b = L y y μ b
σ w σ w + α L σ w   ,   μ w μ w + α L μ w   ,   σ b σ b + α L σ b   ,   μ b μ b + α L μ b
In order to further improve the robustness, strategy generalization ability and training stability of UUV’s autonomous collision avoidance in complex dynamic ocean environments, this paper proposes an improved noise network mechanism based on the introduction of noise network in the original DDPG structure, combined with a GRU module, which systematically improves the responsiveness of the collision avoidance strategy to dynamic changes in the environment.
(1)
Design of multi-level state disturbance guidance mechanism
Different from the traditional noise network method that introduces fixed distributed disturbances at the parameter layer, this study constructs a state disturbance guidance network, which uses the temporal dynamic change in the current observation state to adjust the noise intensity and realize adaptive exploration driven by environmental uncertainty perception. Construct the disturbance weight function σ i :
σ i = σ 0 ( 1 + κ Var t ( s t n , t ) )
In the formula, σ 0 is the initial disturbance intensity; κ is the adjustment coefficient; and Var t ( s t n , t ) is the variance of the state change in the n -step time window. The parameter noise amplitude in the Actor network is dynamically adjusted according to the disturbance weight to achieve the state perception ability of more aggressive strategy exploration when the environment changes more drastically.
(2)
Multi-scale parameter noise fusion structure
In order to avoid the training oscillation problem caused by fixed-scale noise, a multi-scale noise fusion module is designed to introduce independent noise of different scales into each actor and critic hidden layer to form an inter-layer noise regularized gradient flow. The hidden state is extracted by GRU and sent to the noise fusion network to improve the memory of historical decision trajectories and cope with dynamic obstacle strategy patterns. Gaussian noise N ( 0 , σ i 2 ) of different scales is embedded in each layer and flexibly fused:
h i n o i s y = α i ( W i x + b i + ε i ) + ( 1 α i ) h i
where ε i N ( 0 , σ i 2 ) represents the Gaussian noise introduced by the i th layer; α i represents the learnable noise fusion coefficient of the i th layer; W i and b i represent the weight and bias of the i th layer; h i represents the original forward feature of the i th layer; and h i n o i s y represents the output feature after fusion of the noise.

3.4. Design of Experience Pool Sample Sorting Method Based on Proportional Priority

In order to improve the utilization efficiency of training samples and the speed of strategy convergence, a prioritized experience replay (PPER) mechanism based on temporal difference error (TD error) is introduced. This mechanism quantifies the contribution of each experience to learning, uses TD error as a metric for experience priority, and prioritizes sampling of high TD error samples for training, thereby improving the use value and learning efficiency of samples.
However, although traditional PPER improves sample utilization, it also faces the following problems: ① Scanning the entire experience pool at a high frequency to update the priority increases the computational time; ② It is sensitive to random reward signals or function approximation errors, which can easily introduce training instability; ③ Excessive concentration on a few high TD error samples may cause the strategy to fall into local optimality or even overfitting, reducing the generalization ability of the model [20].
(1)
Introducing the Sampling Times Penalty
In order to solve the above problems, this paper introduces a sampling number penalty term, which is between pure greedy priority and uniform random sampling. The i th sampling probability P ( i ) of the transition state is defined as:
P ( i ) = ( p i + ε ) α / ( n i + 1 ) β k ( p k + ε ) k / ( n k + 1 ) β
Among them, p ( i ) represents the priority of the i th experience, expressed as TD error; the exponent α determines the priority, α = 0 corresponds to uniform sampling; | p ( i ) | + ε is used to ensure that each experience has a certain priority, and ε is a very small number; n i represents the number of times the i th experience is sampled; β controls the penalty intensity of the sampling frequency, which is often takes in the range of (0.4~0.6). This mechanism measures the importance of each experience based on its TD error and uses it as the basis for the sampling probability. The larger the TD error, the more valuable the experience is to the current strategy optimization, and therefore, the higher the probability of being sampled.
(2)
Stratified Prioritized Sampling
In the process of UUV collision avoidance planning model training, due to the large number of samples, the model is prone to overfitting risks. In addition, there are common problems in the UUV environment, such as sparse reward signals and drastic changes in the UUV bow angle. The traditional experience replay mechanism is difficult to effectively improve learning efficiency and maintain the generalization ability of the model.
To alleviate the above problems, the samples in the experience pool are divided into several priority intervals (divided into five levels according to percentiles) according to their TD error size, and random sampling is performed in each interval according to the set sampling weight. This strategy ensures that high-error samples are sampled first while avoiding excessive concentration of samples in intervals with large TD errors, thereby reducing the risk of the model falling into local optimality or overfitting, and enhancing sample diversity and overall training stability.
This paper proposes a DDPG collision avoidance planning algorithm based on sampling number penalty and stratified sampling strategy to improve priority experience replay, namely, proportional priority experience replay.
This form is more robust to outliers, and the Algorithm 1 pseudo-code is as follows:
Algorithm 1. DDPG Algorithm with TD Error Ratio Priority Experience Replay
Input: Minimum batch k , step length η , experience cycle K and size N , importance index α , β , total time T .
Initialize experience pool D , Δ = 0 , p 1 = 1
Select a 0 π w ( s 0 ) according to the state quantity s 0
for t = 1   t o   T   d o
  Obtain the observation s t , r t + 1 , s t + 1 from the environment and save the state transition
tuple ( s t , a t , r t + 1 , s t + 1 ) in the experience pool with the largest state priority p t = max i < t p i .
  if t 0 mod K then
    for j = 1 to k do
    Sampling state transitions from the experience pool j P ( j ) = p j α i p i α
    Calculating sampling importance weights w ¯ j = ( N P ( j ) ) β max i w i .
    Calculating TD Error δ j = r j + γ j Q t arg e t ( s j , arg max a Q ( s j , a ) ) Q ( s j 1 , a j 1 )
    Update state transition importance probability p j δ j
    Cumulative weight change Δ Δ + w ¯ j δ j w Q ( s j 1 , a j 1 )
    end for
    Update network weights w w + η Δ , Reset Δ = 0
    Copy the weights to the Q target network after a period of time w t arg e t w
    end if
    Select action a t π w ( s t )
end for

3.5. Simulation Verification and Analysis

This paper compares the path planning effects of DQN, Dueling DQN, DDPG, SAC, and the proposed algorithm in static obstacle environments. The evaluation indicators include average path length and planning stability. The DDPG algorithm training parameters are shown in Table 2.
Table 2. Hyperparameter settings for DDPG model training.
Figure 6 shows the environment obstacles constructed using a randomly generated method. Each collision avoidance planning algorithm was trained for 300 rounds. As shown in the figure, the UUV’s start point is located at [50 m, 45 m] and the target point is located at [520 m, 585 m]. The shortest distance between the two points is 715.9 m. The DDPG (no target information driven) algorithm uses a fully connected neural network. A comparison of Figure 6a,d shows that without proper target information driving the strategy, effective convergence is difficult. Once the target location or corresponding reward mechanism is clearly defined in DDPG, combined with a recurrent neural network to memorize historical states, navigation efficiency will be significantly improved.
Figure 6. Comparison of the impact of having and not having goal-oriented information on the performance of planning algorithms.
Table 3 shows that DDPG methods using either LSTM or GRU outperform the DDPG collision avoidance planning algorithm without target information in terms of training time, number of early rounds to reach the target, and number of path steps. DDPG-GRU performs best, achieving the best results in all four metrics: shortest training time, earliest target arrival, shortest path, and best success rate. This demonstrates that GRU is more effective in processing time series data for this task. DDPG without target information significantly lags behind in both convergence speed and path efficiency, demonstrating that incorporating target information and leveraging historical state memory are crucial for improving policy learning in complex environments.
Table 3. Training performance of four different DRL methods.
Figure 7 shows that after each algorithm successfully reaches the target for the first time, the reward value rises rapidly, indicating a significant improvement in policy performance. The cumulative returns of all algorithms gradually increase, reflecting the continuous exploration and optimization of the policy. The DDPG-target-free algorithm only incorporates obstacle distance information and lacks target information. In contrast, the algorithm with a memory-enhanced contextual value network and target information achieves higher cumulative returns, demonstrating that historical state and target information are crucial for the UUV collision avoidance model. Although Dueling DQN-LSTM reduces Q-value overestimation, its final performance is less stable than the DDPG-LSTM and DDPG-GRU algorithms, and its success rate is lower.
Figure 7. Reward function change curves of different algorithms.
The effect of the noise network on the algorithm is verified in the environment of Figure 8. The UUV start point (yellow circle) is located at [380 m, 20 m], and the target point (red circle) is located at [135 m, 575 m]. The straight-line distance between the two points is 606.7 m.
Figure 8. Trajectory comparison of adding noise network.
Table 4 shows that all collision avoidance algorithms can avoid dense static obstacles. The DDPG algorithm has the longest path, while the DDPG + Noise Network algorithm has the shortest path. The DDPG + Noise Network algorithm has the smallest cumulative heading adjustment angle, resulting in the smoothest path. Compared to the Dueling DQN, DQN + Noise Network, and DDPG algorithms, the cumulative adjustment angle is reduced by 63.0%, 63.2%, and 40.3%, respectively.
Table 4. Comparison of planning algorithm path length and cumulative turning angle.
Adding noise to the network weights to achieve parameter noise allows the UUV to produce richer behavioral performance. Adding regularized noise to each layer can increase the stability of training, but it may cause changes in the distribution of activation values of the previous and next layers. Although methods such as evolutionary strategies use parameter perturbations, they lose temporal structure information during training and require more learning samples. In this chapter, the same noise is added to each layer, but the noise between layers does not affect each other.
Figure 9 is a test of the performance of different DRL algorithms in a static environment. The positions of the starting point and the target point remain unchanged. The size and distribution range of each obstacle are randomly generated. Each algorithm is run 100 times (only one round is shown in the figure).
Figure 9. Run 100 times to test the success rate of different algorithms.
As can be seen from the data in Table 5, the traditional DQN based on a discrete action space and its improved Dueling DQN algorithm performed poorly in terms of average path length and standard deviation, reaching 1494.0 m, 45.5 m and 1451.0 m, 42.0 m, respectively, indicating that they have problems of large path volatility and insufficient strategy accuracy in complex dynamic environments. In comparison, the DDPG series of algorithms using continuous action space modeling has better overall performance. Among them, although the original DDPG algorithm is slightly inferior to some of the comparison algorithms (1452.7 m) in average path length, its standard deviation is only 24.1 m, showing strong path stability.
Table 5. Comparison of average path length and standard deviation of different algorithms.
In Figure 10, the DQN algorithm has the lowest success rate, and the SAC algorithm has a higher success rate than the DDPG algorithm that has only one improvement method; the success rate of the DDPG algorithm with a noisy network is 14%, 11%, and 2% higher than that of DQN, Dueling DQN, and the DDPG algorithm without a noisy network, respectively, indicating the effectiveness of the noisy network; the success rate of the DDPG algorithm with a noisy network and proportional priority experience replay is 25%, 21%, and 5% higher than that of DQN, Dueling DQN, and SAC algorithms, respectively, indicating that the introduction of a noisy network and proportional priority experience replay greatly improves the performance of the DDPG algorithm.
Figure 10. Success rates of different algorithms.
The complex simulation environment in Figure 11 contains seven irregular static obstacles of unknown islands and six dynamic obstacles. The UUV starting point (yellow circle) is located at [70 m, 990 m], the target point (red circle) is located at [800 m, 135 m], and the shortest distance between the two points is 1124.4 m. The speed of each dynamic obstacle is randomly generated between 1 m/s and 2 m/s, and its direction of movement is also randomly generated.
Figure 11. Trajectory diagram of the improved DDPG collision avoidance planning algorithm at different times in complex environments.
Figure 11 shows that the UUV adopted a safe collision avoidance strategy based on the DDPG + NoiseNetwork + PPER algorithm at three different times, avoiding the risk of collision between the UUV and different obstacles. At t = 460 s, the UUV is about to meet the dynamic obstacle 5 moving to the lower left. In order to avoid the collision, the UUV deflects to the left and moves in the same direction as the obstacle, and then the two move to the upper left together. When t = 600 s, after the obstacle avoidance action is completed, the UUV quickly adjusts its heading and moves towards the target point again. This shows that the proposed improved DDPG algorithm can perceive the movement of obstacles in real time in a complex dynamic environment and make smooth and coherent collision avoidance decisions.
In summary, the DDPG + NoisyNetwork + PPER model proposed in this paper has become the most advantageous path planning strategy at present with the comprehensive performance of shortest path length and smallest fluctuation. This method significantly improves stability while ensuring the efficiency of UUV path.

4. Comprehensive Simulation Verification of UUV Planning, Route Tracking and Autonomous Navigation

In order to achieve seamless connection from route planning to execution, UUV needs to have route tracking capability to ensure that it can accurately navigate along the planned route. In complex flow fields, UUVs cause dynamic sideslip angles due to factors such as hydrodynamic nonlinearity, asymmetric flow disturbances, control lag and incomplete observation. Its unpredictability and control limitations make it difficult to completely eliminate it, which seriously affects the path tracking accuracy and navigation stability. To address this problem, this paper constructs horizontal and vertical front sight vectors, and on this basis proposes a 3D adaptive line-of-sight (ALOS) guidance algorithm to ensure the feasibility of the 3D navigation planning results of UUVs. The simulation verifies the effectiveness of the UUV kinematic model in tracking the planned route under current disturbances, ensuring the feasibility of the UUV collision avoidance planning results.

4.1. Design of 3D ALOS Guidance Method Based on Horizontal and Vertical Front Sight Vectors

As shown in Figure 12, a third coordinate system {P} is introduced when implementing three-dimensional planning route tracking. The axis of the coordinate system {P} is parallel to the introduced path. In the figure, θ h = tan 1 ( y e p / Δ h ) and θ v = tan 1 ( z e p / Δ v ) are two waypoints ( x i , y i , z i ) and ( x i + 1 , y i + 1 , z i + 1 ) output by the UUV collision avoidance planning in the NED coordinate system. The origin of the tangential coordinate system {P} of the current path is located at ( x i , y i , z i ) , and its horizontal axis points to the next waypoint ( x i + 1 , y i + 1 , z i + 1 ) . π h is the angle of rotation of the UUV’s NED tracking error around the Z axis, and π v is the angle of rotation of the UUV around the Y axis.
Figure 12. UUV 3D ALOS guidance diagram.
According to the conversion relationship between the coordinate system in Section 2 and the {P} coordinate system, the longitudinal–lateral–vertical tracking error of the UUV in the {P} coordinate system is obtained:
x e p y e p z e p = R Y , π v T R Z , π h T x y z x i y i z i
Rotation matrix:
R Y , π v T = cos ( π v ) 0 sin ( π v ) 0 1 0 sin ( π v ) 0 cos ( π v )
R Z , π h T = cos ( π h ) sin ( π h ) 0 sin ( π h ) cos ( π h ) 0 0 0 1
where
π h = a tan 2 ( y i + 1 y i , x i + 1 x i )
π v = a tan 2 ( ( z i + 1 z n ) , ( x i + 1 x i ) 2 + ( y i + 1 y i ) 2 )
For a surge velocity of 0 < u min < u < u max , the differential equation of motion for the rate of change in position of the UUV in the NED coordinates is expressed as:
x ˙ = U h cos ( Ψ + β c ) y ˙ = U h sin ( Ψ + β c ) z ˙ = U v cos ( θ α c )
where α c = tan 1 ( v sin ( ϕ ) + w cos ( ϕ ) u ) , β c = tan 1 ( v cos ( ϕ ) w sin ( ϕ ) U v cos ( θ α c ) ) , U h and U v are the velocity components in the horizontal and vertical directions, respectively, and are given by:
U v = u 1 + tan 2 ( α c )
U h = U v cos ( θ α c ) 1 + tan 2 ( β c )
Desired heading and pitch angles:
Ψ d = π h β ^ c tan 1 ( y e p / Δ h )
θ d = π c + α ^ c + tan 1 ( z e p / Δ v )
Under the influence of ocean current disturbance, the UUV’s longitudinal velocity, sway velocity, and bow angle will change due to external environmental interference, and the phase angles α c and β c will also change accordingly, which will cause the horizontal front sight component U h and the vertical front sight component U v determined by Formulas (40) and (41) to also change. In order to simplify the complexity of tracking, the concept of relative velocity is used to simulate ocean current disturbance. In the process of straight-line path tracking, the following assumptions are made:
Assumption 1.
During the path following process, α c   and   β c   are fixed values, that is   α ˙ c = 0   and   β ˙ c = 0 .
The assumption is satisfied when the UUV follows a straight path at a constant speed and is considered to meet the assumption even if there is a slight speed change under time-varying environmental disturbances.
The tracking error differential equations along three directions are:
x ˙ e p y ˙ e p z ˙ e p = R Y , π v T R Z , π h T U h cos ( Ψ + β c ) U h sin ( Ψ + β c ) U v cos ( θ α c )
Expanding the last two lines of the above equation corresponds to the lateral and vertical tracking errors:
y ˙ e p = U h sin ( Ψ + β c π h )
z ˙ e p = U h sin ( π v ) cos ( Ψ + β c π h ) U v cos ( π v ) sin ( θ α c )
Rewrite the vertical path differential Formula (46) into a function form of P to realize the design of the depth controller. Substitute Formula (41) into Formula (46) to obtain:
z ˙ e p = U v cos ( θ α c ) 1 + tan 2 ( β c ) sin ( π v ) cos ( Ψ + β c π h ) U v cos ( π v ) sin ( θ α c )
Using the trigonometric transformation equation of the sum and difference angles, we can obtain:
z ˙ e p = U v sin ( π v ) cos ( θ α c ) 1 + tan 2 ( β c ) cos ( Ψ + β c π h ) U v sin ( π h ) cos ( θ α c ) U v sin ( θ α c π v ) =   U v sin ( θ α c π v ) + U v sin ( π v ) cos ( θ α c ) [ 1 + tan 2 ( β c ) cos ( Ψ + β c π h ) + 1 ]
Then,
z ˙ e p = U v sin ( θ α c π v ) + U h sin ( π v ) 1 + tan 2 ( β c ) [ 1 + tan 2 ( β c ) cos ( Ψ + β c π h ) + 1 ]
Guidance rate ALOS for 3D path following:
Ψ d = π h β ^ c tan 1 ( y e p Δ h )
β ^ ˙ c = γ h Δ h Δ h 2 + ( y e p ) 2 y e p
θ d = π v + α ^ c + tan 1 ( z e p Δ v )
α ^ ˙ c = γ v Δ v Δ v 2 + ( z e p ) 2 z e p
For Formulas (45) and (48), the motion of the horizontal and vertical planes is decoupled. Δ h > 0 and Δ v > 0 are the self-specified forward distances, and γ h > 0 , γ v > 0 are the adaptive gain. α ^ c , β ^ c is the estimated value of α c , β c , respectively. The stability analysis assumes that the UUV can obtain perfect path tracking Ψ = Ψ d and θ = θ d by turning the heading and the deep autopilot. The parameter mapping method is used to enhance the robustness of the ALOS guidance rate:
α ^ ˙ c = γ v Δ v Δ v 2 + ( z e p ) 2 Pr ( α ^ c , z e p )
α ^ ˙ c = γ v Δ v Δ v 2 + ( z e p ) 2 Pr ( α ^ c , z e p )
The parameter estimates are compact sets α ^ c M θ ^ , β ^ c M θ ^ , M θ ^ = M θ + ε , ε is a very small positive number, making M θ ^ > M θ true,
Pr ( θ ^ , τ ) = ( 1 c ( θ ^ ) ) τ   if θ ^ > M θ , θ ^ T τ > 0 τ   else
where c ( θ ^ ) = min { 1 , ( θ ^ 2 M θ 2 ) / ( M θ ^ 2 M θ 2 ) } is a special form of parameter mapping, which is modified to ensure semi-global stability.
y ˙ e p = U h sin ( β ˜ c tan 1 ( y e p Δ h ) )
z ˙ e p = U v sin ( α ˜ c tan 1 ( z e p Δ v ) ) + g ( t , y e p , β ˜ c )
where α ˜ c = α c α ^ c , β ˜ c = β c β ^ c .
g ( t , y e p , β ˜ c ) = U h sin ( π v ) 1 + tan 2 ( β c ) [ 1 + tan 2 ( β c ) cos ( β ˜ c tan 1 ( y e p Δ h ) ) 1 ]
Using the trigonometric transformation equations of sum and difference angles:
y ˙ e p = U h sin ( tan 1 ( y e p Δ h ) ) cos ( β ˜ c ) + U h cos ( tan 1 ( y e p Δ h ) ) sin ( β ˜ c )
z ˙ e p = U v sin ( tan 1 ( z e p Δ v ) ) cos ( α ˜ c ) + U v cos ( tan 1 ( z e p Δ v ) ) sin ( α ˜ c ) + g ( t , y e p , β ˜ c )
Using sin ( tan 1 ( x / d ) ) = x / d 2 + x 2 and cos ( tan 1 ( x / d ) ) = d / d 2 + x 2 , the above equation is transformed into a nonlinear cascade system:
1 : z ˙ e p = U v Δ v ( Δ v ) 2 + ( z e p ) 2 ( cos ( α ˜ c ) z e p Δ v sin ( α ˜ c ) ) + g ( t , y e p , β ˜ c ) α ˜ ˙ c = γ v Δ v ( Δ v ) 2 + ( z e p ) 2 Pr ( α ^ c , z e p )
2 : y ˙ e p = U h Δ h ( Δ h ) 2 + ( y e p ) 2 ( cos ( β ˜ c ) y e p Δ h sin ( β ˜ c ) ) β ˜ ˙ c = γ h Δ h ( Δ h ) 2 + ( y e p ) 2 Pr ( β ^ c , y e p )
From Assumption 1, we can see that α ˜ ˙ c = α ^ ˙ c , β ˜ ˙ c = β ^ ˙ c , U h , U v , g ( t , y e p , β ˜ c ) are time-varying, and the Σ 1 and Σ 2 subsystems are non-autonomous systems. In some cases, it is advantageous to make the look-ahead distances Δ h , Δ v also time-varying. Assumption 1 is valid, and the stability of the cascade c and v is guaranteed by Lemmas 1, 2, and Theorem 1.
Lemma 1.
If the perturbation g ( t , y e p , β ˜ c ) 0   is of initial value ( z e p , α ˜ c ) = ( 0 , 0 ) , then the cascade system Σ 1   is uniformly semiglobal exponentially stable.
Lemma 2.
If the initial value ( y e p , β ˜ c ) = ( 0 , 0 ) , then the cascade system Σ 2   is uniform semiglobal exponential stable (USGES).
Theorem 1.
Consider a three-dimensional path-following system for a UUV guided by the adaptive line-of-sight (ALOS) method. The system dynamics are decoupled into a horizontal guidance subsystem and a vertical guidance subsystem as follows:
y ˜ ˙ e = γ h y ˜ e + g 1 ( t )
z ˜ ˙ e = γ v z ˜ e + h ( y ˜ e ) + g 2 ( t )
where γ h , γ v > 0 are control gains; g 1 ( t ) , g 2 ( t ) are bounded external disturbances satisfying g i ( t ) g ¯ i , i = 1 , 2 ; h ( y ˜ e ) is a continuous function that satisfies the Lipschitz condition: h ( y ˜ e ) L y ˜ e for some L > 0 .
Then, under suitable selection of γ h and γ v , the origin of the cascaded system Σ 1 Σ 2 is Uniformly Semiglobally Exponentially Stable (USGES) with respect to bounded disturbances.
Proof. 
(1)
Lyapunov Stability of the Horizontal Subsystem
Define the Lyapunov function candidate:
V 1 ( y ˜ e ) = 1 2 y ˜ e 2
Taking the derivative along the trajectory of the system yields:
V ˙ 1 = y ˜ e y ˜ ˙ e = γ h y ˜ e 2 + y ˜ e g 1 ( t ) γ h y ˜ e 2 + y ˜ e g ¯ 1
By Young’s inequality:
V ˙ 1 γ h y ˜ e 2 + y ˜ e 2 2 γ h + γ h g ¯ 1 2 2 = ( γ h 1 2 γ h ) y ˜ e 2 + γ h g ¯ 1 2 2
Therefore, the subsystem is input-to-state stable (ISS) and exponentially stable when γ h > 1 2 .
(2)
Lyapunov Stability of the Vertical Subsystem
V 2 ( z ˜ e ) = 1 2 z e 2
V ˙ 2 = z ˜ e z ˜ ˙ e = γ v z ˜ e 2 + z ˜ e h ( y ˜ e ) + z ˜ e g 2 ( t )
Using the Lipschitz condition and Young’s inequality:
z ˜ e h ( y ˜ e ) L y ˜ e z ˜ e L 2 ε y ˜ e 2 + L ε 2 z ˜ e 2
Similarly, y ˜ 2 d 2 ( t ) 1 2 δ g ¯ 2 2 + δ 2 y ˜ 2 2 , ε = δ = γ v 2 L + 1 , thus:
V ˙ 2 ( γ v L ε + δ 2 ) z ˜ e 2 + L 2 ε y ˜ e 2 + 1 2 δ g ¯ 2 2
Let ε , δ be chosen such that:
γ v > L ε + δ 2 α 2 > 0
Then:
V ˙ 2 α 2 z ˜ e 2 + β y ˜ e 2 + C
with β = L 2 ε , C = 1 2 δ g ¯ 2 2 .
(3)
Composite Lyapunov Function
Define the total Lyapunov function:
V = V 1 + V 2 = 1 2 y ˜ e 2 + 1 2 z ˜ e 2
V ˙ ( α 1 β ) y ˜ e 2 α 2 z ˜ e 2 + g ¯ 2 + C
If the gains γ h , γ v are selected so that α 1 > β , the system satisfies “limited input disturbance and exponential convergence of state”, that is, it satisfies the USGES condition. □

4.2. UUV Autonomous Navigation Planning and Route Tracking Capability Simulation Verification

The simulation environment shown in Figure 13 simulates the scenario of a UUV searching for targets in narrow waters under variable depth conditions. The UUV starts from the yellow square starting point and heads for the yellow pentagonal target point. During navigation, the UUV will encounter multiple prohibited areas, including light-colored spherical static obstacles and dark-colored dynamic obstacles (such as minefields, enemy moving UUVs, sonar detection areas, etc.), as shown in the figure. These obstacles pose a threat to the safe navigation of the UUV. The simulation parameter settings for variable depth environments are shown in Table 6. The experiment compares the autonomous collision avoidance performance of PSO combined with five algorithms, DWA, DQN, Dueling DQN, SAC, and DDPG, in a dynamic environment to evaluate the effectiveness and stability of these algorithms in dealing with environmental uncertainty.
Figure 13. UUV collision avoidance planning trajectory diagram of different planning algorithms under variable depth conditions.
Table 6. Setting of simulation parameters for variable depth environment.
In the planning results of different algorithms shown in Figure 14, the vertical axis marks 1, 2, 3, 4, and 5 correspond to the initial distances of the five dynamic obstacles relative to the UUV in Figure 10. The performance of each algorithm in the figure can be evaluated by the change in the distance between the UUV and the dynamic obstacle over time. The longer the distance, the better the obstacle avoidance effect. The figure shows that the distance between the algorithm proposed in this paper and the dynamic obstacle is greater than 20 m, indicating that the UUV successfully avoided collisions with dynamic obstacles in all simulations of the autonomous collision avoidance planning algorithm.
Figure 14. Distance variation curves between UUV and dynamic obstacles under different planning algorithms under variable depth conditions.
Table 7 summarizes the path length, planning time, and minimum distance to dynamic obstacles of different algorithms. The results show that the DDPG algorithm generates the shortest path, while the DQN algorithm has the shortest average single-step planning time, that is, the output speed of the control command is the fastest. In contrast, the planning time of PSO combined with DWA is the longest, mainly because in a three-dimensional dynamic environment, the improved PSO requires multiple iterations to optimize the optimal global route, and the DWA algorithm with curvature constraints is also time-consuming each time it predicts the trajectory. In addition, the DWA algorithm with curvature constraints also consumes a lot of time to calculate the predicted trajectory each time. PSO combined with DWA, DQN, and DDPG are similar in the minimum distance to a certain dynamic obstacle. They can all effectively avoid obstacles and ensure the safe navigation of UUVs in complex environments. Relatively speaking, although the path lengths of Dueling DQN and SAC are greater than those of DDPG, their minimum collision avoidance distances with dynamic obstacles are larger, which also reflects a certain degree of safety redundancy.
Table 7. Path length, planning time, and minimum collision avoidance distance of different planning algorithms under variable depth conditions.
In order to verify whether the UUV waypoints generated by the planning algorithm are executable, the UUV waypoints in Figure 15 are input into the 3-DOF model of the UUV to observe the change curves of each state quantity. When t < 200   s , the ocean current disturbance is U c = 0 . 2   m / s , Ψ c = 120 ° ; when 100   s t < 300   s , U c = 0 . 2   m / s ; Ψ c gradually increases from 120 ° to 140 ° and then Ψ c fluctuates in a small range.; when t > 300   s , U c gradually increases from 0 . 2   m / s to 0 . 65   m / s , and then fluctuates around 0 . 65   m / s .
Figure 15. Ocean current speed and direction change curve.
Combining Figure 13, Figure 14 and Figure 16, we can see that the trajectory of each algorithm runs relatively smoothly in the initial stage; when encountering dynamic obstacles 3 and 4, the DWA algorithm, DQN algorithm, and Dueling algorithm made a strategy to make the track bend more, especially the waypoints planned by the DWA algorithm and DQN algorithm, the UUV route tracking control system cannot complete effective tracking. However, the waypoints of the Dueling DQN algorithm can be realized by the UUV route tracking control system. The trajectories of the SAC algorithm and the DDPG algorithm are relatively stable, the overall tracking effect is good, and they show strong dynamic adaptability.
Figure 16. The waypoints of UUV path tracking with different planning algorithms under ocean current disturbance.
Figure 17 shows that in the presence of ocean current interference, the proposed three-dimensional ALOS guidance algorithm has relatively stable speed control in the longitudinal direction, indicating good control capability of the UUV longitudinal speed; all attitude quantities remain stable before 300 s, and the roll angular velocity and pitch angular velocity are close to 0 between 100 s and 300 s; between 300 s and 500 s, the collision avoidance behavior of each collision avoidance planning algorithm against dynamic obstacles causes fluctuations in attitude quantities. Combined with Figure 16, it can be seen that the paths of SAC and the proposed DDPG algorithm are smoother, and the collision avoidance strategy is better. It can achieve stable and continuous heading adjustment, which is conducive to improving the overall tracking quality.
Figure 17. UUV motion state change curves of different planning algorithms under variable depth conditions.
Figure 18 shows the changes in the angle of attack α and sideslip angle β during the navigation process of the UUV under different planning algorithms. The angle of attack of each planning algorithm is positive at the initial moment, indicating that the UUV is in the climbing stage. This is consistent with the path trajectory of Figure 16, indicating that the pitch attitude control effect is good. The sideslip angle of the UUV at the initial moment is due to the large angle between the bow direction of the UUV and the direction of the ocean current at the beginning. Under the joint action of the three-dimensional ALOS algorithm and the controller, the sideslip angle decreases rapidly. The sideslip angle of each algorithm is controlled within ±10°. The results show that the three-dimensional ALOS track tracking method combined with the collision avoidance strategy can effectively improve the attitude stability of the UUV in a time-varying ocean current environment.
Figure 18. UUV attack angle and sideslip angle curves of different planning algorithms under variable depth conditions.
Figure 19 shows the pitch angle and bow angle tracking errors of each algorithm. In the variable depth collision avoidance planning and three-dimensional route tracking tasks, the DDPG algorithm performs best in pitch angle, bow angle, and position error indicators, showing good route tracking accuracy. The SAC algorithm also has high attitude and trajectory control capabilities and is suitable for autonomous navigation in dynamic and complex environments. DQN and Dueling DQN have certain deficiencies in attitude control accuracy and trajectory smoothness. As shown in Figure 16, the traditional improved PSO + DWA and DQN algorithms cannot reach the target point safely, and it is difficult to complete obstacle avoidance planning in highly dynamic scenes.
Figure 19. UUV pitching angle and heading angle tracking error curves under different algorithms.
Table 8 shows the comparison of the RMS of pitch angle tracking and bow angle tracking of different algorithms in a variable depth environment. The SAC algorithm has the smallest RMSE for pitch angle tracking; the DDPG algorithm has the smallest RMSE for bow angle tracking, which is 12.8%, 17.1%, 21.7%, and 9.4% lower than the DWA, DQN, Dueling DQN, and SAC algorithms, respectively; the DDPG algorithm has the smallest RMSE for position tracking, which is 23.8%, 15.04%, 14.2%, and 11.1% lower than the DWA, DQN, Dueling DQN, and SAC algorithms, respectively. Overall, DDPG shows comprehensive advantages in robustness, accuracy, and track smoothness, verifying its superiority in complex 3D dynamic environments.
Table 8. Comparison of tracking root mean square error of different algorithms in variable depth environment.
In the UUV actuator change curves of Figure 20, different planning algorithms, DDPG and SAC, have the smallest fluctuations in the changes in horizontal rudder angle and tail rudder angle, and the control is smoother; the thruster speed of each method can be maintained stably, ensuring the UUV speed stability and track tracking continuity.
Figure 20. Change curve of UUV actuator under different planning algorithms under variable depth conditions.
From the above simulation results, it can be seen that the improved DDPG algorithm in this paper shows better tracking performance, especially in the bow angle tracking error and position tracking error, which are better than other DRL methods. It is suitable for UUV autonomous navigation tasks with high requirements for route tracking accuracy and stability. The change curve of the actuator shows that the control is smoother, and the algorithm also meets strong adaptability. In addition, compared with the improved DWA algorithm, the collision avoidance planning time of the DDPG algorithm, SAC algorithm, and improved DWA algorithm is shortened by more than 95%, which proves that the DRL algorithm has strong real-time performance when doing collision avoidance planning. The distance between the algorithm proposed in this paper and multiple dynamic obstacles is greater than 20 m, indicating that the UUV successfully avoids collision with dynamic obstacles under the planning of all autonomous collision avoidance algorithms.

5. Conclusions

This paper proposes a hierarchical autonomous navigation framework for unmanned underwater vehicles (UUVs), combining global route planning based on an Improved particle swarm optimization (PSO) algorithm, local collision avoidance using a deep deterministic policy gradient (DDPG) method enhanced by noisy networks and proportional priority experience replay (PPER), and robust three-dimensional path tracking using an adaptive line-of-sight (ALOS) guidance strategy.
In summary, the proposed method demonstrates strong capability for efficient and safe UUV navigation in scenarios involving sparse rewards, limited perception, and nonlinear motion coupling. The framework effectively bridges global planning, local avoidance, and route tracking into a coherent autonomous navigation strategy.
Future research work on this topic includes:
When facing multiple dynamic obstacles, the UUV’s collision avoidance planning ability is not as good as the DWA algorithm with curvature constraints. In the future, it is planned to add the ability to perceive dynamic obstacles to the model, use the extended Kalman filter (EKF) for online filtering and trajectory prediction of dynamic obstacles, integrate the EKF output into the state input of DRL, and introduce the reward function of dynamic obstacles. Use the LSTM/GRU network to help the UUV remember the movement trend of obstacles in the past few steps to improve the collision avoidance prediction ability.

Author Contributions

Conceptualization, J.Y. and H.W.; methodology, J.Y. and H.W.; software, J.Y. and B.Z.; investigation, C.L. and Y.H.; data curation, S.S. and Y.H.; writing—original draft preparation, J.Y. and B.Z.; writing—review and editing, H.W.; project administration, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research work is supported by the Joint Training Fund Project of Hanjiang National Laboratory (No. HJLJ20230406), Basic research funds for central universities (3072024YY0401), the National Key Laboratory of Underwater Robot Technology Fund (No. JCKYS2022SXJQR-09), and a special program to guide high-level scientific research (No. 3072022QBZ0403).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers and the handling editors for their constructive comments that greatly improved this article from its original form.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

After background noise extraction from the image data collected by the forward-looking sonar, the normalized histograms of the background noise intensity at near-range and far-range are fitted with four noise models: Weibull, Gamma, Rayleigh, and Gaussian [20]. The minimum mean squared error (MSE) and Kullback–Leibler divergence are used as evaluation indicators to quantitatively measure the fitting accuracy and distribution consistency of each model.
Figure A1 shows the normalized histogram of the background noise intensity in the close-range forward-looking sonar image and the corresponding noise model fitting results. It can be observed from Figure A1b that the background noise intensity distribution of the close-range sonar image is relatively uniform, mainly concentrated in the medium intensity range. Due to the short detection distance, the target echo signal is strong, and the reverberation and system noise are also enhanced, resulting in a high overall level of background noise. Figure A1c shows the fitting results of the background noise histogram under different statistical models. The results show that the Rayleigh distribution has the worst fitting accuracy, with obvious deviations in both the low-intensity area and the high-intensity tail; although the fitting effect of the Weibull distribution is improved compared with the Rayleigh, it is still not as stable as other models overall. In contrast, both the gamma distribution and the Gaussian distribution can fit the actual noise data well, and both show high consistency in the main peak area and tail trend.
Figure A1. Forward-looking sonar close-up sample image.
Figure A2 shows the normalized histogram of the background noise intensity in the close-range forward-looking sonar image and its corresponding noise model fitting results. As shown in Figure A2b, the background noise in the long-range forward-looking sonar image is mainly concentrated in the low-intensity area. This feature is mainly attributed to the continuous attenuation of the energy of the sound wave during propagation, which significantly weakens the echo signal, resulting in a low overall noise level. Figure A2c shows the fitting results of the background noise of the long-range sonar image under different statistical models. The results show that the fitting deviation of the Rayleigh distribution in the main area is large, and the overall fitting effect is the worst; in contrast, the fitting curves of the gamma distribution, Gaussian distribution, and Weibull distribution all approximate the normalized histogram of the background noise well, showing strong consistency in the peak position and tail trend. However, since the three are visually similar, it is difficult to accurately judge the optimal model through intuitive observation, so it is still necessary to rely on subsequent quantitative fitting indicators to further evaluate their modeling accuracy and adaptability.
Figure A2. Forward-looking sonar long-range sample image.
Ten sets of long-range and short-range forward-looking sonar image data sets collected at different experimental time periods were selected, each set containing 10 frames of images, and a total of 100 long-range and short-range images were obtained. To ensure the consistency of the comparison, the background area size in the selected images was kept consistent, and the background noise was extracted under the same preprocessing conditions for modeling and evaluation.
Table A1 shows that the gamma distribution has the best performance in characterizing the statistical characteristics of the background noise of the forward-looking sonar image in a shallow water environment, and exhibits high stability and fitting accuracy under different detection distances and acquisition conditions.
Table A1. Goodness of fit evaluation for different noise distributions.
Table A1. Goodness of fit evaluation for different noise distributions.
MethodsNoise ModelFar-Range ( 1 × 10 2 )Near-Range ( 1 × 10 2 )
χ 2 Weibull distribution1.663.15
Gaussian distribution1.122.56
Rayleigh distribution3.5831.46
Gamma distribution0.981.94
KolmogorovWeibull distribution2.534.12
Gaussian distribution1.752.69
Rayleigh distribution22.4525.10
Gamma distribution1.311.58
In order to further verify the applicability and robustness of the constructed noise model in a shallow water environment, this paper conducts gamma noise error analysis on long-range and short-range forward-looking sonar images, respectively. Ten image samples at different detection distances were selected from the continuously acquired image sequence for fitting verification and error evaluation of the noise model parameters, so as to comprehensively examine the generalization ability and stability of the model under different imaging conditions.
From the error analysis of the modeling parameters of the background noise of the near-range and long-range forward-looking sonar images in Table A2, it can be seen that the estimation errors of the parameters of the constructed noise estimation model are all controlled within an acceptable range. Overall, the gamma noise model can accurately characterize the statistical distribution characteristics of the background noise of the forward-looking sonar image, has good versatility and robustness, and provides a reliable noise prior modeling basis for subsequent tasks.
Table A2. Gamma noise error statistics for near and far distances.
Table A2. Gamma noise error statistics for near and far distances.
No.Far-Range θ Error
( 1 × 10 2 )
Far-Range k Error
( 1 × 10 2 )
Near-Range θ Error
( 1 × 10 2 )
Near-Range k Error
( 1 × 10 2 )
10.180.7958.9365.2
23.8249.858.237.01
39.63176.1117.4528.32
42.5151.8717.2815.07
517.0211.728.9311.85
65.4381.4868.1698.74
74.59116.7612.9415.03
816.6236.9412.6817.49
99.9540.5925.1831.63
107.8430.424.744.51

References

  1. Huang, Z.; Lin, H.; Zhang, G. The USV Path Planning Based on an Improved DQN Algorithm. In Proceedings of the 2021 International Conference on Networking, Communications and Information Technology (NetCIT), Manchester, UK, 26–27 December 2021. [Google Scholar]
  2. Lyu, X.; Sun, Y.; Wang, L.; Tan, J.; Zhang, L. End-to-end AUV local motion planning method based on deep reinforcement learning. J. Mar. Sci. Eng. 2023, 11, 1796. [Google Scholar] [CrossRef]
  3. Wu, H.; Song, S.; Hsu, Y.; You, K.; Wu, C. End-to-end sensorimotor control problems of AUVs with deep reinforcement learning. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 5869–5874. [Google Scholar]
  4. Hou, X.; Du, J.; Wang, J.; Ren, Y. UUV path planning with kinematic constraints in unknown environment using reinforcement learning. In Proceedings of the 2020 4th International Conference on Digital Signal Processing, Chengdu, China, 19–21 June 2020. [Google Scholar]
  5. Sun, Y.; Luo, X.; Ran, X.; Zhang, G. A 2D Optimal Path Planning Algorithm for Autonomous Underwater Vehicle Driving in Unknown Underwater Canyons. J. Mar. Sci. Eng. 2021, 9, 252. [Google Scholar] [CrossRef]
  6. Li, X.; Yu, S. Obstacle avoidance path planning for AUVs in a three-dimensional unknown environment based on the C-APF-TD3 algorithm. Ocean Eng. 2025, 315, 119886. [Google Scholar] [CrossRef]
  7. Tang, Z.; Cao, X.; Zhou, Z.; Zhang, Z.; Xu, C.; Dou, J. Path planning of autonomous underwater vehicle in unknown environment based on improved deep reinforcement learning. Ocean Eng. 2024, 301, 117547. [Google Scholar] [CrossRef]
  8. Xu, J.; Huang, F.; Wu, D. A learning method for AUV collision avoidance through deep reinforcement learning. Ocean Eng. 2022, 260, 112038. [Google Scholar] [CrossRef]
  9. Fossen, T.I.; Pettersen, K.Y.; Galeazzi, R. Line-of-sight path following for dubins paths with adaptive sideslip compensation of drift forces. IEEE Trans. Control Syst. Technol. 2014, 23, 820–827. [Google Scholar] [CrossRef]
  10. Fossen, T.I.; Pettersen, K.Y. On uniform semiglobal exponential stability (USGES) of proportional line-of-sight guidance laws. Automatica 2014, 50, 2912–2917. [Google Scholar] [CrossRef]
  11. Hac, A.; Simpson, M.D. Estimation of vehicle side slip angle and yaw rate. SAE Trans. 2000, 109, 1032–1038. [Google Scholar]
  12. Borhaug, E.; Pavlov, A.; Pettersen, K.Y. Integral LOS control for path following of underactuated marine surface vessels in the presence of constant ocean currents. In Proceedings of the 2008 47th IEEE Conference on Decision and Control, Cancun, Mexico, 9–11 December 2008. [Google Scholar]
  13. Miao, J.; Sun, X.; Chen, Q.; Zhang, H.; Liu, W.; Wang, Y. Robust path-following control for AUV under multiple uncertainties and input saturation. Drones 2023, 7, 665. [Google Scholar] [CrossRef]
  14. Lekkas, A.M.; Fossen, T.I. Integral LOS path following for curved paths based on a monotone cubic Hermite spline parametrization. IEEE Trans. Control Syst. Technol. 2014, 22, 2287–2301. [Google Scholar] [CrossRef]
  15. Fossen, T.I.; Lekkas, A.M. Direct and indirect adaptive integral line-of-sight path-following controllers for marine craft exposed to ocean currents. Int. J. Adapt. Control Signal Process. 2017, 31, 445–463. [Google Scholar] [CrossRef]
  16. Liu, L.; Wang, D.; Peng, Z. ESO-based line-of-sight guidance law for path following of underactuated marine surface vehicles with exact sideslip compensation. IEEE J. Ocean. Eng. 2016, 42, 477–487. [Google Scholar] [CrossRef]
  17. He, L.; Zhang, Y.; Li, S.; Li, B.; Yuan, Z. Three-Dimensional Path Following Control for Underactuated AUV Based on Ocean Current Observer. Drones 2024, 8, 672. [Google Scholar] [CrossRef]
  18. Lin, C.; Wang, H.; Yuan, J.; Yu, D.; Li, C. An Improved Recurrent Neural Network for Unmanned Underwater Vehicle Online Obstacle Avoidance. Ocean Eng. 2019, 189, 106327. [Google Scholar] [CrossRef]
  19. Li, Q. Digital Sonar Design in Underwater Acoustics: Principles and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  20. Schaul, T.; Quan, J.; Antonoglou, I. Prioritized experience replay. In Proceedings of the International Conference on Learning Repxesentations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–21. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.