1. Introduction
Over the past few decades, driven by continuous progress in autonomous navigation and environmental perception technologies, Autonomous Underwater Vehicles (AUVs) have become indispensable tools for scientific research, underwater exploration, infrastructure inspection, and military operations [
1,
2,
3]. Improvements in autonomy, operational range, and multifunctionality have enabled AUVs to perform complex missions such as ocean mapping, ecosystem monitoring, and maritime surveillance. These advances have enhanced the efficiency and capabilities of AUVs in both civilian and defense applications [
4]. However, the widespread deployment of AUVs also raises significant safety concerns. In complex marine environments, currents can vary rapidly and visibility is often limited by suspended particles. Moreover, underwater obstacles such as reefs, discarded equipment, and large marine organisms may be distributed widely and move unpredictably. Reliable autonomous collision avoidance is, therefore, essential for operational safety [
5]. Failure in collision avoidance may lead to collisions with other vessels, underwater obstacles, or infrastructure, resulting in vehicle damage, mission disruption, and secondary hazards such as shipping interference, marine pollution, or communication failures. In military contexts, such incidents could further lead to the leakage of classified information and compromise mission deployment.
Early research on AUV obstacle avoidance primarily relied on traditional physics-based models to ensure navigation safety and avoidance reliability. Among these methods, PID control offered fundamental attitude and trajectory regulation for AUVs due to its simple structure and fast response, effectively supporting obstacle avoidance maneuvers in low-complexity environments [
6]. Backstepping control addressed the nonlinear dynamics in AUV navigation through step-by-step controller design, significantly enhancing avoidance stability in nonlinear systems [
7]. The artificial potential field method introduced a virtual “attraction–repulsion” potential model, transforming obstacle avoidance and target pursuit into intuitive force-balance decisions. This approach simplified the complexity of path planning and provided an effective solution for real-time obstacle avoidance in early AUV systems [
8]. While these studies have significantly advanced the field, their methods rely on precise model parameters. The performance of controllers with fixed parameter designs can be limited. This is due to the underactuated nature of AUVs, their strongly coupled nonlinear dynamics, and time-varying hydrodynamic coefficients. These approaches also lack flexibility in handling multi-objective optimization problems.
Deep reinforcement learning (DRL) has emerged as a powerful alternative, as it learns policies directly through interactions with uncertain and dynamic environments [
9,
10]. For example, Zheng et al. proposed a PPO-based path-planning method, UP4O, for complex ocean-current conditions; by integrating obstacle features with state information such as relative position, currents, and velocity, it achieves time-efficient planning that balances global guidance and local obstacle avoidance [
11]. Bingul et al. introduced a memory-based DRL approach for obstacle avoidance in unknown environments, leveraging recurrent networks with temporal attention to exploit historical observations and mitigate partial observability, thereby increasing collision-free flight distance and reducing energy loss caused by oscillatory motions [
12]. Wu et al. developed an improved TD3-based path-following control method, incorporating importance-aware experience replay, smooth regularization, and an adaptive reward design to accelerate convergence and suppress action oscillations while maintaining high tracking accuracy under disturbances [
13]. Overall, existing studies show that modern DRL, especially policy-gradient-based approaches, provide an effective tool for learning robust and scalable collision-avoidance policies in high-dimensional state and action spaces.
Clearly, introducing reinforcement learning methods into AUV obstacle avoidance tasks aligns well with future intelligent development trends. However, a review of both traditional obstacle avoidance methods and reinforcement learning approaches reveals that researchers have paid little attention to dynamic obstacle avoidance for underwater AUVs. This study draws inspiration from the theory of dynamic obstacle avoidance for surface vessels [
14]. The goal is to find an intelligent method suitable for addressing the challenges of dynamic obstacles faced by underwater AUVs. Nevertheless, applying DRL to AUV obstacle avoidance tasks still faces a core challenge: training agents stably and efficiently in dynamically uncertain marine environments [
15]. When dealing with sparse rewards and sequential decision-making problems, designing efficient exploration methods to achieve faster and more stable exploration is often a high-priority consideration [
16]. Although DRL can effectively solve the challenges of RL in high-dimensional state and action spaces, its application in obstacle avoidance tasks has significant limitations [
17]. In underwater obstacle avoidance scenarios, problems such as sparse rewards, long-term delayed rewards, and hierarchical task structures are common. Agents struggle to learn optimal obstacle avoidance strategies under these conditions. Hierarchical reinforcement learning (HRL) offers a promising solution to these DRL limitations [
18]. Its abstraction mechanism allows complex obstacle avoidance tasks to be decomposed into hierarchical levels. However, new issues arise when applying HRL to AUV obstacle avoidance. Designing task abstraction hierarchies that precisely match the real-time demands of maritime scenarios—such as dynamic ocean currents and sudden obstacles—remains difficult. Furthermore, the generalization capability of hierarchical strategies is highly sensitive to environmental variations. This makes it challenging to ensure obstacle avoidance stability and safety across different sea areas and operating conditions.
To address the aforementioned limitations, this study proposes a hierarchical reinforcement learning AUV obstacle avoidance method tailored for dynamic obstacle scenarios. The main technical contributions are summarized as follows:
We systematically embed key stochastic factors—such as obstacle behavior patterns and external disturbances—into the AUV obstacle avoidance training environment. Concurrently, we innovatively introduce a Collision Threat Index (CTI) tailored for underwater scenarios, which effectively quantifies the collision risk between the vehicle and dynamic obstacles.
Unlike traditional end-to-end DRL, HDAO fundamentally decouples the navigation task across different time scales based on the principles of HRL. Our framework learns to dynamically select among pure navigation, pure obstacle avoidance, or a combination of both. This approach enables the synergistic optimization of global navigation intents and local avoidance actions, significantly alleviating the high-dimensional training pressure in complex environments.
Traditional hierarchical exploration methods often suffer from inherent defects, including the underutilization of environmental information and slow, inefficient global exploration. To address these limitations, we integrate a curiosity-driven reward mechanism to improve policy exploration. This enhancement boosts the exploration performance of HRL, achieving rapid and highly efficient coverage of the global state space in complex environments.
4. Hierarchical Obstacle Avoidance Network Architecture
4.1. Hierarchical Policy Architecture
In this section, we introduce the framework used to learn the hierarchical policy, as illustrated in
Figure 3. Our solution employs a two-layer policy structure (
and
). Both policy layers receive the same observation vector
, which encodes the AUV’s own state, the goal, ocean currents, and obstacle-related information.
To handle the hybrid decision-making process between high-level discrete mode selection and low-level continuous control, the system adopts a modular policy activation and routing mechanism in its engineering implementation. This modular approach is discussed in reference [
22]. The discrete mode selector
output by the high-level policy serves as a high-level routing switch. It directly determines the direction of the low-level execution logic.
The high-level policy operates on a different timescale compared to the conventional Markov Decision Process (MDP). It takes actions at time steps , which correspond to every T steps of the low-level policy, or fewer if the low-level policy has converged to a subgoal. The high-level policy receives an observation and generates an action based on the final goal . If the low-level policy converges to the subgoal early, the high-level policy does not need to wait for the full T steps. When the subgoal is completed within the T-step window, the system queries and generates a new high-level action in advance. The selection of the macro step size T must balance computational efficiency and responsiveness. It should also match the typical speed of dynamic obstacles. In this study, setting T to 10 proves sufficient.
A key component of the high-level action
is the subgoal
, which encodes the desired relative change in the current state
. This subgoal defines a target state
for the low-level policy to pursue. It remains active for
T consecutive time steps of the low-level policy, persisting until the subgoal is successfully achieved within this window. Upon completion, the HDAO framework queries the high-level policy to generate the next high-level action. At each subsequent time step, the original subgoal
is updated according to the recurrence relation:
Another component of the high-level action is a motion mode selector . This selector is a discrete variable that determines which path-planning strategy the low-level policy will use to achieve the current subgoal in the environment: goal-directed navigation, obstacle avoidance, or both. Using the motion mode selector, the high-level policy dynamically determines the planning mode the AUV should adopt over the T-step time window, based on the environmental perception state.
The low-level policy operates at each discrete time step t. It takes a state observation and generates an action based on the previous high-level action , which includes the subgoal and the action derived from the motion mode selector . At each time step t, the environment provides a state . The low-level policy interacts directly with the environment, while the high-level policy guides the low-level policy via high-level actions and goals according to the current state. It determines the next goal by updating and , which directs the low-level policy to reach a target state, allowing the low-level policy to efficiently learn from prior experience of the high-level policy. The high-level policy updates once every T steps. The low-level policy observes the state , goal , and , then produces a low-level action that is applied to the environment. The environment then samples a reward from the reward function and transitions to a new state via the transition function .
To accommodate the hierarchical decision-making architecture, this study employs two independent Actor–Critic network pairs. Specifically, two sets of the PPO algorithm are utilized to optimize the high-level planning policy and the low-level execution policy separately. The high-level value function, denoted as , takes as input the AUV’s global state at a macroscopic time step, . Its output is a long-term value estimation for that state. This value function is responsible for predicting the cumulative discounted reward from the current moment until the end of the task. The low-level value function, denoted as , constitutes the Critic component within the low-level PPO network. The input to this function consists not only of the AUV’s microscopic state but also incorporates the instructions issued by the high-level layer, namely the sub-goal and the mode . It is responsible for estimating the expected return of executing an atomic action under the current specific task requirements.Both the high-level and low-level networks are jointly optimized using the Proximal Policy Optimization (PPO) algorithm to ensure stable gradient updates.
4.2. Curiosity-Driven Training Mechanism
In the hierarchical structure described in
Section 4.1, the high-level policy generates subgoals and selects motion modes to guide long-term planning, while the low-level policy executes concrete control actions within a fixed time window to produce short-term outputs. Although such temporal abstraction alleviates the optimization difficulty of long-horizon tasks, policy learning in dynamic and uncertain underwater environments still relies heavily on effective exploration, especially when the low-level controller faces sparse and strongly delayed extrinsic rewards, which can lead to insufficient exploration and slow convergence. To address these issues, this section introduces a curiosity-driven training mechanism that constructs a pseudo-reward to complement the task reward, encouraging more informative behaviors and achieving a dynamic balance between exploration and exploitation, thereby improving training efficiency and policy robustness.
To ensure the diversity of the lower policy sets, the HDAO algorithm uses information entropy to construct a pseudo-reward. The introduced pseudo-reward function can be expressed as:
where
H denotes the Shannon entropy with base e, and
A represents the action distribution. On the right-hand side of Equation (
20), the first term is intended to increase the randomness of the lower-level internal policy when selecting actions; the second term is to enhance the randomness of the upper-level policy when choosing the lower-level internal policy. Incorporating the above pseudo-reward function into the standard reinforcement learning framework yields the augmented reward function:
where
is a hyperparameter. It controls the relative importance of the pseudo-reward function in the augmented reward function. When
approaches 0, the augmented reward function converges to the standard reinforcement learning objective. This entropy-maximizing reward function objective encourages the diversity of action distributions for lower-level internal policies.
5. Result
To systematically evaluate the performance advantages and applicability of the proposed HDAO algorithm in complex three-dimensional underwater environments, this study conducts simulation experiments on AUV autonomous obstacle avoidance and path planning. The experimental scenario is set within a 3D dynamic underwater space. To authentically replicate the real marine conditions encountered during AUV operations and to enhance the credibility of the experimental validation and the robustness of the algorithm, this study utilizes measured ocean current data from the Integrated Ocean Current Dataset published by the National Marine Science Data Center of China to construct the 3D flow field for the simulation. Flow field disturbances serve as environmental inputs to the AUV’s dynamic model, with the ocean current velocity superimposed onto the AUV’s velocity to simulate the vehicle’s trajectory in a real ocean environment. This study selects PPO as the baseline algorithm. As a stable and generalizable policy-based, model-free reinforcement learning algorithm, PPO is well-suited for continuous action spaces and has been extensively validated in sequential decision-making tasks such as robotic motion control and autonomous navigation. The experiments in this section aim to evaluate the obstacle avoidance performance of the HDAO algorithm. Furthermore, ablation studies are conducted to verify whether the incorporation of a Collision Threat Index benefits dynamic obstacle avoidance and to assess the suitability of the curiosity-driven hierarchical framework for AUV path-finding tasks in 3D underwater environments.
The experimental computer was equipped with a 13th Gen Intel Core i7-13650HX 2.60 GHz processor (Intel Corporation, Santa Clara, CA, USA), 64 GB of RAM (Samsung, Suwon, Republic of Korea), and an NVIDIA RTX 4060 graphics card (NVIDIA Corporation, Santa Clara, CA, USA). To train and evaluate the proposed HDAO algorithm, we developed a 3D underwater simulation environment based on the standard OpenAI Gym 0.21.0 framework using Python 3.8.
Table 1,
Table 2 and
Table 3 summarize the basic experimental configuration. The selection of key parameters was based on empirical evaluation and theoretical considerations.
5.1. HDAO Performance Comparison
To verify the superiority of the hierarchical strategy proposed in this paper for AUV dynamic obstacle avoidance in underwater environments, HDAO is compared with mainstream obstacle avoidance algorithms—RMPC [
23] and C-APF-TD3 [
24]—in terms of evaluation indices such as collision avoidance rate, time consumption, minimum distance, and maximum rudder angle variation. Three scenarios are evaluated: a static setting with fixed obstacle positions, a low-speed setting in which obstacles move slowly under natural conditions and remain much slower than the AUV, and a high-speed setting in which obstacles move randomly at speeds comparable to the AUV. Each scenario is independently run 100 times. Four evaluation indices are defined: (1) Avoid: the ratio of AUVs reaching the destination within the time limit without any collision; (2) Time: the average navigation time of all successful cases; (3) MinDis: the minimum distance between the AUV and obstacles during navigation; (4) Rud: the maximum variation in rudder angle during AUV navigation.
To ensure the rigor and fairness of the performance evaluation, the baseline methods RMPC and C-APF-TD3, along with the proposed HDAO algorithm, were deployed under equivalent testing conditions. All methods were based on the identical AUV dynamics model and the same underwater obstacle motion environment, with control inputs subject to the exact same physical constraints. Furthermore, consistent total training interaction steps and stopping conditions were applied.
Under the above experimental settings, the evaluation indices are obtained as shown in
Table 4, this table compares the AUV obstacle avoidance performance of three algorithms—HDAO, RMPC, and C-APF-TD3—in static, low-speed dynamic, and high-speed dynamic obstacle scenarios. The experimental results show that HDAO maintains a 100% obstacle avoidance success rate across all scenarios, with the shortest average time consumption and the smallest rudder angle variation, demonstrating the most stable and efficient performance. RMPC also achieves a 100% success rate in static and low-speed scenarios, but its success rate drops to 91% in the high-speed scenario, along with a significant increase in rudder angle fluctuation. C-APF-TD3 performs weakly in dynamic environments, particularly in the high-speed scenario where its obstacle avoidance success rate is only 86%, with the longest time consumption and multiple failures to maintain a safe distance. In terms of obstacle avoidance rate, both the HDAO method and RMPC are suitable for most low-speed obstacle movements under natural conditions. However, under high-speed obstacle movement conditions, only the HDAO method achieves a 100% avoidance rate. Furthermore, from the perspectives of average time consumption, minimum distance to obstacles, and rudder angle variation, HDAO outperforms the other two methods in decision efficiency, safety, and control smoothness.
The results presented in
Table 4 are visualized using radar charts based on several dimensions of the evaluation indices. The performance of the three algorithms under different scenarios is normalized to the range of 0–1 for visualization. A value closer to 1 indicates better performance on the corresponding metric. The results are shown in
Figure 4. The radar charts clearly show that the performance of the HDAO algorithm does not exhibit significant degradation in static, low-speed, or high-speed scenarios. The coverage area of its radar plot is the largest among the three algorithms in all scenarios, which demonstrates its superior overall performance. This visualization is highly consistent with the quantitative data reported in the table. The results further provide intuitive evidence of the significant advantages of the HDAO algorithm in terms of environmental adaptability and robustness.
A comparative analysis of the training convergence is presented in
Figure 5, which shows a comparison of the reward functions of the three algorithms during training. The horizontal axis represents training episodes, and the vertical axis represents cumulative reward, with shaded areas indicating the 95% confidence intervals across multiple runs. As observed in
Figure 5, each curve starts with a low cumulative reward in the early training stage, but the reward gradually increases with more episodes, indicating improved performance for all three algorithms. Eventually, the cumulative reward converges and stabilizes. The proposed HDAO method achieves the highest cumulative reward earliest in the later training stage, and its final converged reward is superior to that of the other two algorithms. Moreover, HDAO exhibits faster convergence, demonstrating its advantages for DRL training.
Figure 6a–c show the training trajectories of the AUV using the three methods during experiments, displayed as two-dimensional projections. In the static scenario (
Figure 6a) and the low-speed dynamic scenario (
Figure 6b), all three methods successfully avoid obstacles and reach the destination. In the high-speed scenario (
Figure 6c), due to the increased obstacle speed, only the HDAO method succeeds in obstacle avoidance, while the other two methods fail. It can be seen that the HDAO method successfully avoids obstacles and accurately reaches the target points in all three scenarios, showing strong environmental adaptability and relatively smooth trajectory changes. In contrast, when the environment changes, traditional DRL algorithms such as RMPC and C-APF-TD3 struggle to maintain good obstacle avoidance performance, resulting in collisions with obstacles.
5.2. Ablation Study
The hierarchical framework and the Collision Threat Index (CTI) are key components enabling HDAO to address AUV obstacle avoidance tasks in dynamic environments. To validate this, the HDAO-CO algorithm was designed and compared with the baseline algorithm PPO. In HDAO-CO, the CTI is replaced with a basic obstacle avoidance reward, while all other rewards and network parameters remain unchanged. This design minimizes the impact of data processing techniques on the experimental results. A comparison of the performance metrics between HDAO and HDAO-CO is presented in
Table 5.
Table 5 indicates that, under the same network architecture, data processing techniques, and basic hyperparameters, HDAO significantly outperforms HDAO-CO and the baseline method in terms of obstacle avoidance success rate, navigation efficiency, and stability in dynamic environments. HDAO-CO achieves complete obstacle avoidance only in the static scenario. It tends to fail in dynamic environments, which poses challenges to the navigation safety of the AUV. The baseline method, PPO, can safely avoid obstacles in the low-speed scenario. However, it becomes unsuitable in the high-speed scenario. Another evaluation index with clear discrimination is the rudder angle variation. The rudder angle variation of HDAO-CO and PPO is several times higher than that of HDAO. Among the three methods, HDAO-CO exhibits the largest rudder angle variation. The results indicate that the introduction of the CTI metric is a key factor for obstacle avoidance in dynamic environments. The hierarchical structure further ensures stable navigation of the AUV. These factors jointly guarantee safe and stable obstacle avoidance performance in dynamic environments.
5.3. Discussion on the Design of the CTI Metric and the Results
To validate our theoretical analysis regarding the limitations of traditional maritime metrics in 3D underwater environments, we integrated DCPA and TCPA into our proposed methodological framework. We then conducted comparative experiments between our proposed CTI and the classic DCPA and TCPA metrics. The comparative results are presented in
Table 6. They show that single-metric evaluation schemes using either DCPA or TCPA perform poorly in dynamic scenarios. Specifically, they significantly reduce the obstacle avoidance success rate and noticeably increase the average navigation time, accompanied by severe control oscillations. In contrast, our proposed CTI achieves excellent obstacle avoidance results and provides superior navigation efficiency. This demonstrates its effectiveness in managing AUV dynamic obstacle avoidance without depending on the constant velocity assumption.
This study systematically evaluates HDAO in three scenarios, including static, low-speed dynamic, and high-speed dynamic environments. It performs comparative analyses using obstacle-avoidance success rate, navigation efficiency (travel time), minimum safety distance, and rudder angle variation as evaluation metrics. The results show that HDAO consistently achieves the best overall performance across all three scenarios. Its advantages are particularly pronounced in the dynamic and high-speed settings. In contrast, HDAO-CO is more prone to obstacle-avoidance failures in dynamic environments, and standard PPO becomes significantly less applicable in high-speed scenarios. Ablation and comparative experiments further identify the specific sources of these improvements. First, the CTI provides a continuous and interpretable risk representation, which gives the policy explicit risk feedback during dynamic interactions and enforces stable avoidance constraints. Second, the hierarchical architecture jointly optimizes global navigation and local avoidance at different time scales, which reduces the learning difficulty of long-horizon tasks and suppresses control oscillations, thereby significantly decreasing rudder angle variation. Furthermore, the intrinsic-reward (curiosity-driven) mechanism alleviates insufficient exploration under sparse rewards, accelerates the formation of effective avoidance behaviors, and improves adaptability in dynamic environments. Beyond these algorithmic advantages, empirical results demonstrate that the average inference time for a single decision step (forward pass) of the HDAO policy network is ms. This minimal computational latency fully satisfies the real-time control frequency typically required by modern AUV low-level controllers, conclusively proving the algorithm’s excellent feasibility for real-world engineering deployment.