Abstract
Unmanned underwater vehicles (UUVs) face significant challenges in achieving safe and efficient autonomous navigation in complex marine environments due to uncertain perception, dynamic obstacles, and nonlinear coupled motion control. This study proposes a hierarchical autonomous navigation framework that integrates improved particle swarm optimization (PSO) for 3D global route planning, and a deep deterministic policy gradient (DDPG) algorithm enhanced by noisy networks and proportional prioritized experience replay (PPER) for local collision avoidance. To address dynamic sideslip and current-induced deviations during execution, a novel 3D adaptive line-of-sight (ALOS) guidance method is developed, which decouples nonlinear motion in horizontal and vertical planes and ensures robust tracking. The global planner incorporates a multi-objective cost function that considers yaw and pitch adjustments, while the improved PSO employs nonlinearly synchronized adaptive weights to enhance convergence and avoid local minima. For local avoidance, the proposed DDPG framework incorporates a memory-enhanced state–action representation, GRU-based temporal processing, and stratified sample replay to enhance learning stability and exploration. Simulation results indicate that the proposed method reduces route length by 5.96% and planning time by 82.9% compared to baseline algorithms in dynamic scenarios, it achieves an up to 11% higher success rate and 10% better efficiency than SAC and standard DDPG. The 3D ALOS controller outperforms existing guidance strategies under time-varying currents, ensuring smoother tracking and reduced actuator effort.
1. Introduction
When a UUV performs global route planning in a high-dimensional environment, it faces issues such as long calculation times and a single optimization target, which limits its ability to efficiently find routes in large sea areas. Improving the computational efficiency of global route planning and balancing the navigation distance, planning efficiency, and the feasibility of planning under a multi-objective optimization framework are core issues addressed in this study.
Based on global route planning, the UUV needs to perform local collision avoidance in observable environments under conditions of sparse rewards. The current related collision avoidance algorithms still have deficiencies in terms of reliability, real-time performance, generalization ability, and learning sample utilization of collision avoidance planning. They are unable to make effective decisions for collision avoidance planning with multiple dynamic obstacles, making it difficult for UUVs to make effective decisions when dealing with dynamic obstacles and uncertain environments. How to build an efficient local collision avoidance strategy and improve real-time obstacle avoidance capabilities is a key challenge for UUV intelligent navigation technology.
Deep Reinforcement Learning (DRL) has powerful self-learning capabilities. It does not require accurate modeling of the obstacle environment in which the UUV operates, nor does it heavily rely on sensor accuracy. Huang et al. [1] proposed the Double Deep Q Network (DQN) algorithm to optimize collision avoidance and navigation planning. The Double DQN algorithm decouples action selection and evaluation, obtaining the action index corresponding to the maximum Q value through the current network and inputting it into the target network to calculate the target Q value. The results show that overestimation is reduced, and the Double DQN algorithm can effectively process complex environmental information and perform optimal navigation planning. The application of value function DRL in autonomous collision avoidance has been confirmed in existing studies. Compared with value-based and model-free methods, policy gradient-based methods have great advantages in processing continuous actions. Xi [2] proposed an end-to-end perception-planning-execution method based on a two-layer deep deterministic policy gradient (DDPG) algorithm to address the problems of sparse rewards, single strategies, and poor environmental adaptability in UUV local motion planning tasks, in order to overcome the challenges of end-to-end methods that directly output control force in terms of training and learning. Wu et al. [3] proposed an end-to-end UUV motion framework based on the PPO algorithm. The framework directly takes raw sonar perception information as input and incorporates multiple objective factors, such as collision avoidance, collision, and speed, into the reward function. Experimental results show that the algorithm can enable a UUV to achieve effective collision avoidance in an unknown underwater environment full of obstacles. However, he did not consider the nonlinear characteristics of UUV kinematics and dynamics. To address this problem, Hou et al. [4] designed a DDPG collision avoidance algorithm in which two neural networks control the thruster and rudder angle position of the UUV, respectively. The reward function considers the distance from the obstacle to the sonar, as well as the difference between the current position and the previous position to the target point. Simulation results show that the UUV can plan a collision-free path in an unknown continuous environment. Sun et al. [5] demonstrated a collision avoidance planning method that integrates DDPG and Artificial Potential Field (APF), taking sensor information as input and outputting longitudinal velocity and bow angle. This method combines APF to design a reward function and uses a SumTree structure to store experience samples according to their importance, giving priority to extracting high-quality samples, and accelerating the convergence of the SumTree-DDPG algorithm. Li et al. [6] introduced a goal-oriented post-experience replay method to address the inherent problem of sparse rewards in UUV local motion planning tasks. Simulation results show that this method effectively improves the stability of training and improves sample efficiency. In addition, an additional APF method dynamic collision avoidance emergency protection mechanism, is designed for UUV. The combination of DRL and APF has the potential to improve the performance of intelligent agents in complex environments, but DRL and APF are both reactive collision avoidance algorithms. If they are combined, the UUV decision-making and planning will conflict, and a coordination mechanism needs to be designed. Tang [7] uses sonar data as the input of the network and uses the policy network output of the TD3 framework as the motion command of the UUV to complete the collision avoidance planning of the UUV in a 3D unknown environment. In the above papers, the interference of ocean currents is not considered, and the obstacles in the environment are regular. To address these problems, Huang [8] proposed a UUV reactive collision avoidance controller based on the Soft Actor–Critic (SAC) method, and designed the state space, action space, and multi-objective function of reactive collision avoidance. The simulation results show that this method can realize collision avoidance planning in sparse 3D static and dynamic environments.
From the above research results, we know that there are a large number of scholars studying collision avoidance planners based on the DRL algorithm, but only a few papers are studying samples in the learning process. During the model training process, the blind exploration of the DRL algorithm generates a large number of low-quality samples, which will reduce the utilization rate of effective sample data, thereby greatly reducing the learning efficiency. In addition, there is little research on the stability and generalization of the training model. Therefore, this also provides research value and significance for the work of this paper.
UUV autonomous navigation planning is closely related to route tracking. Navigation planning is crucial for ensuring that UUVs provide safe routes and avoid obstacles during autonomous missions, while route tracking is the process of controlling and executing the UUV’s planned routes. The quality of navigation planning directly affects the executability of route tracking; that is, high-quality planned routes should have good command smoothness and eliminate chattering, which is conducive to UUV operation execution.
For 2D route tracking problems, line-of-sight (LOS) is a commonly used method that can be used to guide a UUV to track along a preset path. The LOS method calculates the LOS angle between the current position of the UUV and the target waypoint and generates corresponding heading instructions to gradually converge to the target path, thereby completing route tracking [9]. Fossen and Pettersen et al. [10] found the source of tracking error and proved the algorithm’s strong convergence and robustness to disturbances. Although the proportional LOS guidance method is effective and popular, significant tracking errors may occur during route tracking when the UUV is subjected to drift forces caused by ocean currents. To overcome this difficulty, researchers have completed a lot of work to mitigate the effects of sideslip. In reference [11], accelerometers are used to measure longitudinal and lateral accelerations, and then the corresponding velocities are calculated by integrating these measurements. Reference [12] proposed an integral LOS guidance method to achieve straight-line route tracking, which can avoid the risk of integrator saturation [13]. Reference [14] derived an improved integral LOS guidance method for tracking on curved paths. Reference [15] proposed direct and indirect integral LOS guidance methods for ships exposed to time-varying ocean currents. The above references mitigate the effects of sideslip angle by adding integral terms to the LOS guidance method. Integral LOS can only handle constant sideslip angles. When following a curved path or time-varying ocean disturbances, the sideslip angle will also change due to the change in ocean current disturbances over time. Secondly, the stability is reduced due to the existence of phase lag superposition integral terms. Liu et al. [16] used LOS based on an extended state observer [17] for route tracking of underactuated vehicles when the sideslip angle changes over time. In the marine environment, the actual motion of a UUV is affected by complex nonlinear coupling relationships, involving multiple degrees of freedom and complex motion states. The LOS method has been widely used in two-dimensional route tracking, but there are still challenges in terms of time-varying ocean currents, sideslip angle changes, and 3D path planning. Research route tracking methods suitable for 3D environments to solve the path deviation problem caused by the nonlinear coupled motion of UUVs in 3D space, so as to improve the autonomous navigation capability of UUVs in complex marine environments.
When UUVs perform tasks in the ocean environment, they are inevitably affected by time-varying ocean currents and dynamic sideslip angles, which significantly increase the complexity of route tracking. Traditional guidance methods have poor tracking stability under time-varying flow field disturbances and environmental changes, making it difficult to achieve high-precision three-dimensional route tracking. How to improve navigation stability and autonomous control capabilities in a high-dimensional environment is an important direction that underwater autonomous navigation technology needs to break through.
The main contributions of this article can be summarized as follows:
- The multi-level state perturbation guidance mechanism and the noisy network with multi-scale parameter noise fusion solve the training oscillation problem caused by fixed-scale noise, and the sampling number penalty term and stratified priority sampling strategy are used to improve learning efficiency and maintain the generalization ability of the DDPG model.
- The horizontal and vertical front sight vectors are designed to decouple the control design of UUV nonlinear motion, and their stable convergence is proved by theory, which solves the problem that the integral LOS guidance method is difficult to directly track the three-dimensional planning waypoints under the condition of UUV attitude change caused by ocean current disturbance.
3. Autonomous Collision Avoidance Planning Method for UUV Based on Target Information Driven DDPG
3.1. Deep Deterministic Policy Gradients
We consider an AUV that interacts with the environment in discrete time steps. At time the AUV takes an action according to the observation and receives a scalar reward (here, we assumed the environment is fully observed.) We model it as a Markov decision process with a state space , action space , state distribution , transition dynamics , and reward function . The sum of discounted future reward is defined as with a discounting factor . The AUV’s goal is to obtain a policy that maximizes the cumulative discounted reward . We denote the discounted state distribution for a policy as and the action-value function . The goal of UUV is to learn a series policy that maximizes the expected return .
We use an off-policy approach to learn a deterministic target policy H from the trajectories generated by the random behavior policy . We average the state distribution of the behavior policy and transform the performance objective function into the value function of the target policy :
The validity of (16) does not depend on , but it is approximated by a neural network. Replace the true value function with a differentiable value function in Formula (16) to derive an off-policy deterministic policy gradient update policy method. The Critic network estimates the true value from the trajectory generated by the policy using an appropriate policy evaluation algorithm. In the following off-policy DDPG algorithm, the Critic network uses Q-learning updates to estimate the action value function.
3.2. Target Information Driven DDPG Algorithm Framework
- (1)
- State-space design with target information as input
In this paper, the Actor network input is expanded from a 14-dimensional vector to a 15-dimensional vector, which includes the processed sonar data , the distance between the current position of the UUV and the target point, and the deviation angle between the UUV heading and the target point at two moments. The Critic network is 17-dimensional. In addition to the above inputs, it also includes the UUV’s strategy at the previous moment (including longitudinal accelerations and yaw angular velocity), which are normalized and used as the input of the network. Not only is the orientation information of the target point added, but also the strategy of the previous moment is introduced to reduce the probability of blind learning of the UUV in the case of sparse rewards, and make the UUV’s movement smoother.
- (2)
- Continuous action space design
DRL methods usually participate in exploration behavior by injecting noise into the action space. Adding noise to the action space is a common method to improve the performance of Dueling DQN. In the Q learning process, by applying the strategy, a certain proportion of random actions will be added when selecting actions. Different from the previous chapter, in this chapter, the continuous state quantities of the longitudinal acceleration and bow angular velocity of the UUV are used as the output strategy, and a noise network is used instead of adding noise to the strategy to improve the stability and generalization of the model.
- (3)
- DDPG network framework driven by target information
The improved DDPG algorithm network framework is shown in Figure 5. The algorithm obtains training data from the sample pool through the experience replay mechanism and uses the Critic network to calculate the Q value gradient information to guide the Actor network update, thereby continuously optimizing the policy network parameters. Compared with the DQN algorithm based on the value function, the DDPG algorithm shows higher stability in tasks in the continuous action space, higher policy optimization efficiency, and faster convergence speed, and the number of training steps required to reach the optimal solution is also significantly reduced. The network update in this paper uses the average reward of the round to eliminate the impact of the difference in the number of rounds on the evaluation results, making the evaluation results more stable and improving the stability of the model.
Figure 5.
Target-oriented DDPG collision avoidance algorithm framework.
- (4)
- Comprehensive reward function
This paper defines a comprehensive reward function based on the environment in which the UUV is located. The comprehensive reward function designed in this section is expressed as four dynamic reward items:
where represents the distance penalty factor, which is a negative number; is the distance between the UUV and the target point at time , and is the distance between the starting point and the target point.
where is the distance between the UUV and the obstacle, is the maximum detection distance of the sonar, and is the angle between the UUV heading and the obstacle. If the UUV hits an obstacle or a boundary, then ; if the UUV reaches the target point, then .
The comprehensive reward function of this paper is the sum of , , , :
3.3. Design of DDPG Method with Noisy Network
DDPG uses a deterministic strategy. For complex environments in continuous space, it is often difficult to conduct sufficiently effective global exploration by only adding Gaussian noise or OU noise for exploration. Once it falls into a local optimum, it may lack higher-dimensional randomness to jump out of the local extreme point.
In DDPG, the policy network outputs continuous action values, and the square error can be used directly to measure the difference between the output actions of two policies:
In the DDPG algorithm implementation, is used as the perturbation of the strategy, and the perturbation strategy distribution also obeys . Set the adaptive scaling factor to and construct a single linear unit of the noise network:
where , , , . The idea of the noise network is to regard the parameter as a distribution rather than a series of values. The resampling method is used for . First, a noise is sampled, and the following is obtained:
At this time, the parameter needs to learn the mean and variance. Compared with DDPG, the parameters that need to be learned become twice as many as the original. Similarly, the bias becomes , and the resulting linear unit becomes:
Here we take the standard DQN, as an example, and replace the Q network weight with the noisy weight parameters , , , replace with , , .Get the DQN with noisy network:
TD loss function :
The training method of the noisy network is exactly the same as that of the standard neural network. Both use backpropagation to calculate the gradient and then use the gradient to update the neural parameters. The chain rule can be used to calculate the gradient of the loss function with respect to the parameters:
In order to further improve the robustness, strategy generalization ability and training stability of UUV’s autonomous collision avoidance in complex dynamic ocean environments, this paper proposes an improved noise network mechanism based on the introduction of noise network in the original DDPG structure, combined with a GRU module, which systematically improves the responsiveness of the collision avoidance strategy to dynamic changes in the environment.
- (1)
- Design of multi-level state disturbance guidance mechanism
Different from the traditional noise network method that introduces fixed distributed disturbances at the parameter layer, this study constructs a state disturbance guidance network, which uses the temporal dynamic change in the current observation state to adjust the noise intensity and realize adaptive exploration driven by environmental uncertainty perception. Construct the disturbance weight function :
In the formula, is the initial disturbance intensity; is the adjustment coefficient; and is the variance of the state change in the -step time window. The parameter noise amplitude in the Actor network is dynamically adjusted according to the disturbance weight to achieve the state perception ability of more aggressive strategy exploration when the environment changes more drastically.
- (2)
- Multi-scale parameter noise fusion structure
In order to avoid the training oscillation problem caused by fixed-scale noise, a multi-scale noise fusion module is designed to introduce independent noise of different scales into each actor and critic hidden layer to form an inter-layer noise regularized gradient flow. The hidden state is extracted by GRU and sent to the noise fusion network to improve the memory of historical decision trajectories and cope with dynamic obstacle strategy patterns. Gaussian noise of different scales is embedded in each layer and flexibly fused:
where represents the Gaussian noise introduced by the layer; represents the learnable noise fusion coefficient of the layer; and represent the weight and bias of the layer; represents the original forward feature of the layer; and represents the output feature after fusion of the noise.
3.4. Design of Experience Pool Sample Sorting Method Based on Proportional Priority
In order to improve the utilization efficiency of training samples and the speed of strategy convergence, a prioritized experience replay (PPER) mechanism based on temporal difference error (TD error) is introduced. This mechanism quantifies the contribution of each experience to learning, uses TD error as a metric for experience priority, and prioritizes sampling of high TD error samples for training, thereby improving the use value and learning efficiency of samples.
However, although traditional PPER improves sample utilization, it also faces the following problems: ① Scanning the entire experience pool at a high frequency to update the priority increases the computational time; ② It is sensitive to random reward signals or function approximation errors, which can easily introduce training instability; ③ Excessive concentration on a few high TD error samples may cause the strategy to fall into local optimality or even overfitting, reducing the generalization ability of the model [20].
- (1)
- Introducing the Sampling Times Penalty
In order to solve the above problems, this paper introduces a sampling number penalty term, which is between pure greedy priority and uniform random sampling. The th sampling probability of the transition state is defined as:
Among them, represents the priority of the experience, expressed as TD error; the exponent determines the priority, corresponds to uniform sampling; is used to ensure that each experience has a certain priority, and is a very small number; represents the number of times the experience is sampled; controls the penalty intensity of the sampling frequency, which is often takes in the range of (0.4~0.6). This mechanism measures the importance of each experience based on its TD error and uses it as the basis for the sampling probability. The larger the TD error, the more valuable the experience is to the current strategy optimization, and therefore, the higher the probability of being sampled.
- (2)
- Stratified Prioritized Sampling
In the process of UUV collision avoidance planning model training, due to the large number of samples, the model is prone to overfitting risks. In addition, there are common problems in the UUV environment, such as sparse reward signals and drastic changes in the UUV bow angle. The traditional experience replay mechanism is difficult to effectively improve learning efficiency and maintain the generalization ability of the model.
To alleviate the above problems, the samples in the experience pool are divided into several priority intervals (divided into five levels according to percentiles) according to their TD error size, and random sampling is performed in each interval according to the set sampling weight. This strategy ensures that high-error samples are sampled first while avoiding excessive concentration of samples in intervals with large TD errors, thereby reducing the risk of the model falling into local optimality or overfitting, and enhancing sample diversity and overall training stability.
This paper proposes a DDPG collision avoidance planning algorithm based on sampling number penalty and stratified sampling strategy to improve priority experience replay, namely, proportional priority experience replay.
This form is more robust to outliers, and the Algorithm 1 pseudo-code is as follows:
| Algorithm 1. DDPG Algorithm with TD Error Ratio Priority Experience Replay |
| Input: Minimum batch , step length , experience cycle and size , importance index , , total time . Initialize experience pool , , Select according to the state quantity for Obtain the observation from the environment and save the state transition tuple in the experience pool with the largest state priority . if mod then for to do Sampling state transitions from the experience pool Calculating sampling importance weights . Calculating TD Error Update state transition importance probability Cumulative weight change end for Update network weights , Reset Copy the weights to the Q target network after a period of time end if Select action end for |
3.5. Simulation Verification and Analysis
This paper compares the path planning effects of DQN, Dueling DQN, DDPG, SAC, and the proposed algorithm in static obstacle environments. The evaluation indicators include average path length and planning stability. The DDPG algorithm training parameters are shown in Table 2.
Table 2.
Hyperparameter settings for DDPG model training.
Figure 6 shows the environment obstacles constructed using a randomly generated method. Each collision avoidance planning algorithm was trained for 300 rounds. As shown in the figure, the UUV’s start point is located at [50 m, 45 m] and the target point is located at [520 m, 585 m]. The shortest distance between the two points is 715.9 m. The DDPG (no target information driven) algorithm uses a fully connected neural network. A comparison of Figure 6a,d shows that without proper target information driving the strategy, effective convergence is difficult. Once the target location or corresponding reward mechanism is clearly defined in DDPG, combined with a recurrent neural network to memorize historical states, navigation efficiency will be significantly improved.

Figure 6.
Comparison of the impact of having and not having goal-oriented information on the performance of planning algorithms.
Table 3 shows that DDPG methods using either LSTM or GRU outperform the DDPG collision avoidance planning algorithm without target information in terms of training time, number of early rounds to reach the target, and number of path steps. DDPG-GRU performs best, achieving the best results in all four metrics: shortest training time, earliest target arrival, shortest path, and best success rate. This demonstrates that GRU is more effective in processing time series data for this task. DDPG without target information significantly lags behind in both convergence speed and path efficiency, demonstrating that incorporating target information and leveraging historical state memory are crucial for improving policy learning in complex environments.
Table 3.
Training performance of four different DRL methods.
Figure 7 shows that after each algorithm successfully reaches the target for the first time, the reward value rises rapidly, indicating a significant improvement in policy performance. The cumulative returns of all algorithms gradually increase, reflecting the continuous exploration and optimization of the policy. The DDPG-target-free algorithm only incorporates obstacle distance information and lacks target information. In contrast, the algorithm with a memory-enhanced contextual value network and target information achieves higher cumulative returns, demonstrating that historical state and target information are crucial for the UUV collision avoidance model. Although Dueling DQN-LSTM reduces Q-value overestimation, its final performance is less stable than the DDPG-LSTM and DDPG-GRU algorithms, and its success rate is lower.
Figure 7.
Reward function change curves of different algorithms.
The effect of the noise network on the algorithm is verified in the environment of Figure 8. The UUV start point (yellow circle) is located at [380 m, 20 m], and the target point (red circle) is located at [135 m, 575 m]. The straight-line distance between the two points is 606.7 m.
Figure 8.
Trajectory comparison of adding noise network.
Table 4 shows that all collision avoidance algorithms can avoid dense static obstacles. The DDPG algorithm has the longest path, while the DDPG + Noise Network algorithm has the shortest path. The DDPG + Noise Network algorithm has the smallest cumulative heading adjustment angle, resulting in the smoothest path. Compared to the Dueling DQN, DQN + Noise Network, and DDPG algorithms, the cumulative adjustment angle is reduced by 63.0%, 63.2%, and 40.3%, respectively.
Table 4.
Comparison of planning algorithm path length and cumulative turning angle.
Adding noise to the network weights to achieve parameter noise allows the UUV to produce richer behavioral performance. Adding regularized noise to each layer can increase the stability of training, but it may cause changes in the distribution of activation values of the previous and next layers. Although methods such as evolutionary strategies use parameter perturbations, they lose temporal structure information during training and require more learning samples. In this chapter, the same noise is added to each layer, but the noise between layers does not affect each other.
Figure 9 is a test of the performance of different DRL algorithms in a static environment. The positions of the starting point and the target point remain unchanged. The size and distribution range of each obstacle are randomly generated. Each algorithm is run 100 times (only one round is shown in the figure).
Figure 9.
Run 100 times to test the success rate of different algorithms.
As can be seen from the data in Table 5, the traditional DQN based on a discrete action space and its improved Dueling DQN algorithm performed poorly in terms of average path length and standard deviation, reaching 1494.0 m, 45.5 m and 1451.0 m, 42.0 m, respectively, indicating that they have problems of large path volatility and insufficient strategy accuracy in complex dynamic environments. In comparison, the DDPG series of algorithms using continuous action space modeling has better overall performance. Among them, although the original DDPG algorithm is slightly inferior to some of the comparison algorithms (1452.7 m) in average path length, its standard deviation is only 24.1 m, showing strong path stability.
Table 5.
Comparison of average path length and standard deviation of different algorithms.
In Figure 10, the DQN algorithm has the lowest success rate, and the SAC algorithm has a higher success rate than the DDPG algorithm that has only one improvement method; the success rate of the DDPG algorithm with a noisy network is 14%, 11%, and 2% higher than that of DQN, Dueling DQN, and the DDPG algorithm without a noisy network, respectively, indicating the effectiveness of the noisy network; the success rate of the DDPG algorithm with a noisy network and proportional priority experience replay is 25%, 21%, and 5% higher than that of DQN, Dueling DQN, and SAC algorithms, respectively, indicating that the introduction of a noisy network and proportional priority experience replay greatly improves the performance of the DDPG algorithm.
Figure 10.
Success rates of different algorithms.
The complex simulation environment in Figure 11 contains seven irregular static obstacles of unknown islands and six dynamic obstacles. The UUV starting point (yellow circle) is located at [70 m, 990 m], the target point (red circle) is located at [800 m, 135 m], and the shortest distance between the two points is 1124.4 m. The speed of each dynamic obstacle is randomly generated between 1 m/s and 2 m/s, and its direction of movement is also randomly generated.
Figure 11.
Trajectory diagram of the improved DDPG collision avoidance planning algorithm at different times in complex environments.
Figure 11 shows that the UUV adopted a safe collision avoidance strategy based on the DDPG + NoiseNetwork + PPER algorithm at three different times, avoiding the risk of collision between the UUV and different obstacles. At t = 460 s, the UUV is about to meet the dynamic obstacle 5 moving to the lower left. In order to avoid the collision, the UUV deflects to the left and moves in the same direction as the obstacle, and then the two move to the upper left together. When t = 600 s, after the obstacle avoidance action is completed, the UUV quickly adjusts its heading and moves towards the target point again. This shows that the proposed improved DDPG algorithm can perceive the movement of obstacles in real time in a complex dynamic environment and make smooth and coherent collision avoidance decisions.
In summary, the DDPG + NoisyNetwork + PPER model proposed in this paper has become the most advantageous path planning strategy at present with the comprehensive performance of shortest path length and smallest fluctuation. This method significantly improves stability while ensuring the efficiency of UUV path.
5. Conclusions
This paper proposes a hierarchical autonomous navigation framework for unmanned underwater vehicles (UUVs), combining global route planning based on an Improved particle swarm optimization (PSO) algorithm, local collision avoidance using a deep deterministic policy gradient (DDPG) method enhanced by noisy networks and proportional priority experience replay (PPER), and robust three-dimensional path tracking using an adaptive line-of-sight (ALOS) guidance strategy.
In summary, the proposed method demonstrates strong capability for efficient and safe UUV navigation in scenarios involving sparse rewards, limited perception, and nonlinear motion coupling. The framework effectively bridges global planning, local avoidance, and route tracking into a coherent autonomous navigation strategy.
Future research work on this topic includes:
When facing multiple dynamic obstacles, the UUV’s collision avoidance planning ability is not as good as the DWA algorithm with curvature constraints. In the future, it is planned to add the ability to perceive dynamic obstacles to the model, use the extended Kalman filter (EKF) for online filtering and trajectory prediction of dynamic obstacles, integrate the EKF output into the state input of DRL, and introduce the reward function of dynamic obstacles. Use the LSTM/GRU network to help the UUV remember the movement trend of obstacles in the past few steps to improve the collision avoidance prediction ability.
Author Contributions
Conceptualization, J.Y. and H.W.; methodology, J.Y. and H.W.; software, J.Y. and B.Z.; investigation, C.L. and Y.H.; data curation, S.S. and Y.H.; writing—original draft preparation, J.Y. and B.Z.; writing—review and editing, H.W.; project administration, H.W. All authors have read and agreed to the published version of the manuscript.
Funding
This research work is supported by the Joint Training Fund Project of Hanjiang National Laboratory (No. HJLJ20230406), Basic research funds for central universities (3072024YY0401), the National Key Laboratory of Underwater Robot Technology Fund (No. JCKYS2022SXJQR-09), and a special program to guide high-level scientific research (No. 3072022QBZ0403).
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Acknowledgments
The authors would like to thank the anonymous reviewers and the handling editors for their constructive comments that greatly improved this article from its original form.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A
After background noise extraction from the image data collected by the forward-looking sonar, the normalized histograms of the background noise intensity at near-range and far-range are fitted with four noise models: Weibull, Gamma, Rayleigh, and Gaussian [20]. The minimum mean squared error (MSE) and Kullback–Leibler divergence are used as evaluation indicators to quantitatively measure the fitting accuracy and distribution consistency of each model.
Figure A1 shows the normalized histogram of the background noise intensity in the close-range forward-looking sonar image and the corresponding noise model fitting results. It can be observed from Figure A1b that the background noise intensity distribution of the close-range sonar image is relatively uniform, mainly concentrated in the medium intensity range. Due to the short detection distance, the target echo signal is strong, and the reverberation and system noise are also enhanced, resulting in a high overall level of background noise. Figure A1c shows the fitting results of the background noise histogram under different statistical models. The results show that the Rayleigh distribution has the worst fitting accuracy, with obvious deviations in both the low-intensity area and the high-intensity tail; although the fitting effect of the Weibull distribution is improved compared with the Rayleigh, it is still not as stable as other models overall. In contrast, both the gamma distribution and the Gaussian distribution can fit the actual noise data well, and both show high consistency in the main peak area and tail trend.
Figure A1.
Forward-looking sonar close-up sample image.
Figure A2 shows the normalized histogram of the background noise intensity in the close-range forward-looking sonar image and its corresponding noise model fitting results. As shown in Figure A2b, the background noise in the long-range forward-looking sonar image is mainly concentrated in the low-intensity area. This feature is mainly attributed to the continuous attenuation of the energy of the sound wave during propagation, which significantly weakens the echo signal, resulting in a low overall noise level. Figure A2c shows the fitting results of the background noise of the long-range sonar image under different statistical models. The results show that the fitting deviation of the Rayleigh distribution in the main area is large, and the overall fitting effect is the worst; in contrast, the fitting curves of the gamma distribution, Gaussian distribution, and Weibull distribution all approximate the normalized histogram of the background noise well, showing strong consistency in the peak position and tail trend. However, since the three are visually similar, it is difficult to accurately judge the optimal model through intuitive observation, so it is still necessary to rely on subsequent quantitative fitting indicators to further evaluate their modeling accuracy and adaptability.
Figure A2.
Forward-looking sonar long-range sample image.
Ten sets of long-range and short-range forward-looking sonar image data sets collected at different experimental time periods were selected, each set containing 10 frames of images, and a total of 100 long-range and short-range images were obtained. To ensure the consistency of the comparison, the background area size in the selected images was kept consistent, and the background noise was extracted under the same preprocessing conditions for modeling and evaluation.
Table A1 shows that the gamma distribution has the best performance in characterizing the statistical characteristics of the background noise of the forward-looking sonar image in a shallow water environment, and exhibits high stability and fitting accuracy under different detection distances and acquisition conditions.
Table A1.
Goodness of fit evaluation for different noise distributions.
Table A1.
Goodness of fit evaluation for different noise distributions.
| Methods | Noise Model | Far-Range () | Near-Range () |
|---|---|---|---|
| Weibull distribution | 1.66 | 3.15 | |
| Gaussian distribution | 1.12 | 2.56 | |
| Rayleigh distribution | 3.58 | 31.46 | |
| Gamma distribution | 0.98 | 1.94 | |
| Kolmogorov | Weibull distribution | 2.53 | 4.12 |
| Gaussian distribution | 1.75 | 2.69 | |
| Rayleigh distribution | 22.45 | 25.10 | |
| Gamma distribution | 1.31 | 1.58 |
In order to further verify the applicability and robustness of the constructed noise model in a shallow water environment, this paper conducts gamma noise error analysis on long-range and short-range forward-looking sonar images, respectively. Ten image samples at different detection distances were selected from the continuously acquired image sequence for fitting verification and error evaluation of the noise model parameters, so as to comprehensively examine the generalization ability and stability of the model under different imaging conditions.
From the error analysis of the modeling parameters of the background noise of the near-range and long-range forward-looking sonar images in Table A2, it can be seen that the estimation errors of the parameters of the constructed noise estimation model are all controlled within an acceptable range. Overall, the gamma noise model can accurately characterize the statistical distribution characteristics of the background noise of the forward-looking sonar image, has good versatility and robustness, and provides a reliable noise prior modeling basis for subsequent tasks.
Table A2.
Gamma noise error statistics for near and far distances.
Table A2.
Gamma noise error statistics for near and far distances.
| No. | Far-Range Error () | Far-Range Error () | Near-Range Error () | Near-Range Error () |
|---|---|---|---|---|
| 1 | 0.18 | 0.79 | 58.93 | 65.2 |
| 2 | 3.82 | 49.85 | 8.23 | 7.01 |
| 3 | 9.63 | 176.11 | 17.45 | 28.32 |
| 4 | 2.51 | 51.87 | 17.28 | 15.07 |
| 5 | 17.02 | 11.72 | 8.93 | 11.85 |
| 6 | 5.43 | 81.48 | 68.16 | 98.74 |
| 7 | 4.59 | 116.76 | 12.94 | 15.03 |
| 8 | 16.62 | 36.94 | 12.68 | 17.49 |
| 9 | 9.95 | 40.59 | 25.18 | 31.63 |
| 10 | 7.84 | 30.42 | 4.74 | 4.51 |
References
- Huang, Z.; Lin, H.; Zhang, G. The USV Path Planning Based on an Improved DQN Algorithm. In Proceedings of the 2021 International Conference on Networking, Communications and Information Technology (NetCIT), Manchester, UK, 26–27 December 2021. [Google Scholar]
- Lyu, X.; Sun, Y.; Wang, L.; Tan, J.; Zhang, L. End-to-end AUV local motion planning method based on deep reinforcement learning. J. Mar. Sci. Eng. 2023, 11, 1796. [Google Scholar] [CrossRef]
- Wu, H.; Song, S.; Hsu, Y.; You, K.; Wu, C. End-to-end sensorimotor control problems of AUVs with deep reinforcement learning. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 5869–5874. [Google Scholar]
- Hou, X.; Du, J.; Wang, J.; Ren, Y. UUV path planning with kinematic constraints in unknown environment using reinforcement learning. In Proceedings of the 2020 4th International Conference on Digital Signal Processing, Chengdu, China, 19–21 June 2020. [Google Scholar]
- Sun, Y.; Luo, X.; Ran, X.; Zhang, G. A 2D Optimal Path Planning Algorithm for Autonomous Underwater Vehicle Driving in Unknown Underwater Canyons. J. Mar. Sci. Eng. 2021, 9, 252. [Google Scholar] [CrossRef]
- Li, X.; Yu, S. Obstacle avoidance path planning for AUVs in a three-dimensional unknown environment based on the C-APF-TD3 algorithm. Ocean Eng. 2025, 315, 119886. [Google Scholar] [CrossRef]
- Tang, Z.; Cao, X.; Zhou, Z.; Zhang, Z.; Xu, C.; Dou, J. Path planning of autonomous underwater vehicle in unknown environment based on improved deep reinforcement learning. Ocean Eng. 2024, 301, 117547. [Google Scholar] [CrossRef]
- Xu, J.; Huang, F.; Wu, D. A learning method for AUV collision avoidance through deep reinforcement learning. Ocean Eng. 2022, 260, 112038. [Google Scholar] [CrossRef]
- Fossen, T.I.; Pettersen, K.Y.; Galeazzi, R. Line-of-sight path following for dubins paths with adaptive sideslip compensation of drift forces. IEEE Trans. Control Syst. Technol. 2014, 23, 820–827. [Google Scholar] [CrossRef]
- Fossen, T.I.; Pettersen, K.Y. On uniform semiglobal exponential stability (USGES) of proportional line-of-sight guidance laws. Automatica 2014, 50, 2912–2917. [Google Scholar] [CrossRef]
- Hac, A.; Simpson, M.D. Estimation of vehicle side slip angle and yaw rate. SAE Trans. 2000, 109, 1032–1038. [Google Scholar]
- Borhaug, E.; Pavlov, A.; Pettersen, K.Y. Integral LOS control for path following of underactuated marine surface vessels in the presence of constant ocean currents. In Proceedings of the 2008 47th IEEE Conference on Decision and Control, Cancun, Mexico, 9–11 December 2008. [Google Scholar]
- Miao, J.; Sun, X.; Chen, Q.; Zhang, H.; Liu, W.; Wang, Y. Robust path-following control for AUV under multiple uncertainties and input saturation. Drones 2023, 7, 665. [Google Scholar] [CrossRef]
- Lekkas, A.M.; Fossen, T.I. Integral LOS path following for curved paths based on a monotone cubic Hermite spline parametrization. IEEE Trans. Control Syst. Technol. 2014, 22, 2287–2301. [Google Scholar] [CrossRef]
- Fossen, T.I.; Lekkas, A.M. Direct and indirect adaptive integral line-of-sight path-following controllers for marine craft exposed to ocean currents. Int. J. Adapt. Control Signal Process. 2017, 31, 445–463. [Google Scholar] [CrossRef]
- Liu, L.; Wang, D.; Peng, Z. ESO-based line-of-sight guidance law for path following of underactuated marine surface vehicles with exact sideslip compensation. IEEE J. Ocean. Eng. 2016, 42, 477–487. [Google Scholar] [CrossRef]
- He, L.; Zhang, Y.; Li, S.; Li, B.; Yuan, Z. Three-Dimensional Path Following Control for Underactuated AUV Based on Ocean Current Observer. Drones 2024, 8, 672. [Google Scholar] [CrossRef]
- Lin, C.; Wang, H.; Yuan, J.; Yu, D.; Li, C. An Improved Recurrent Neural Network for Unmanned Underwater Vehicle Online Obstacle Avoidance. Ocean Eng. 2019, 189, 106327. [Google Scholar] [CrossRef]
- Li, Q. Digital Sonar Design in Underwater Acoustics: Principles and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Schaul, T.; Quan, J.; Antonoglou, I. Prioritized experience replay. In Proceedings of the International Conference on Learning Repxesentations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–21. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).