HDAO: A Hierarchical Curiosity-Driven Reinforcement Learning Approach for AUV Dynamic Obstacle Avoidance

Du, Huazheng; Liu, Qian; Liu, Xu; Xia, Na

doi:10.3390/jmse14080720

Open AccessArticle

HDAO: A Hierarchical Curiosity-Driven Reinforcement Learning Approach for AUV Dynamic Obstacle Avoidance

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(8), 720; https://doi.org/10.3390/jmse14080720

Submission received: 20 February 2026 / Revised: 21 March 2026 / Accepted: 31 March 2026 / Published: 14 April 2026

(This article belongs to the Special Issue Dynamics, Control, and Design of Bionic Underwater Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Autonomous obstacle avoidance is a critical capability for Autonomous Underwater Vehicles (AUVs) to operate safely in dynamic and uncertain marine environments. Traditional AUV control methods rely on precise physical modeling and preset rules, yet they struggle to adapt to multiple sources of uncertainty, such as random initial states, dynamic obstacles, and varying currents. In recent years, deep reinforcement learning has provided a new avenue for data-driven adaptive policy learning. However, it remains insufficient for handling long-horizon tasks with sparse rewards. While hierarchical reinforcement learning can mitigate reward sparsity through temporal abstraction, it often faces challenges including exploration–exploitation imbalance, slow global convergence, and insufficient safety guarantees. Furthermore, most existing studies neglect dynamic environmental disturbances and task continuity, which further limits the practical application of these algorithms. To address these challenges, this paper proposes a hierarchical curiosity-driven AUV obstacle avoidance algorithm (HDAO), designed for autonomous obstacle avoidance in dynamic and uncertain underwater environments. The core design of HDAO incorporates several key innovations. Firstly, it introduces a Collision Threat Index for dynamic obstacles, which enables explicit risk perception and quantifies collision threats, thereby enhancing the policy’s generalization and robustness. Secondly, a task-decoupled hierarchical architecture is employed to synergistically optimize global path planning and local obstacle avoidance behaviors. This approach effectively manages long-horizon navigation tasks while alleviating high-dimensional training pressure. Finally, a novel reward mechanism is designed by integrating hierarchical active exploration with curiosity-driven passive exploration. This mechanism effectively incentivizes the agent to explore unvisited areas under sparse reward conditions and dynamically balances exploration and exploitation. Experimental results demonstrate that HDAO significantly outperforms existing methods in terms of obstacle avoidance success rate, training convergence speed and robustness against external disturbances.

Keywords:

autonomous underwater vehicle; autonomous obstacle avoidance; hierarchical reinforcement learning; dynamic obstacles

1. Introduction

Over the past few decades, driven by continuous progress in autonomous navigation and environmental perception technologies, Autonomous Underwater Vehicles (AUVs) have become indispensable tools for scientific research, underwater exploration, infrastructure inspection, and military operations [1,2,3]. Improvements in autonomy, operational range, and multifunctionality have enabled AUVs to perform complex missions such as ocean mapping, ecosystem monitoring, and maritime surveillance. These advances have enhanced the efficiency and capabilities of AUVs in both civilian and defense applications [4]. However, the widespread deployment of AUVs also raises significant safety concerns. In complex marine environments, currents can vary rapidly and visibility is often limited by suspended particles. Moreover, underwater obstacles such as reefs, discarded equipment, and large marine organisms may be distributed widely and move unpredictably. Reliable autonomous collision avoidance is, therefore, essential for operational safety [5]. Failure in collision avoidance may lead to collisions with other vessels, underwater obstacles, or infrastructure, resulting in vehicle damage, mission disruption, and secondary hazards such as shipping interference, marine pollution, or communication failures. In military contexts, such incidents could further lead to the leakage of classified information and compromise mission deployment.

Early research on AUV obstacle avoidance primarily relied on traditional physics-based models to ensure navigation safety and avoidance reliability. Among these methods, PID control offered fundamental attitude and trajectory regulation for AUVs due to its simple structure and fast response, effectively supporting obstacle avoidance maneuvers in low-complexity environments [6]. Backstepping control addressed the nonlinear dynamics in AUV navigation through step-by-step controller design, significantly enhancing avoidance stability in nonlinear systems [7]. The artificial potential field method introduced a virtual “attraction–repulsion” potential model, transforming obstacle avoidance and target pursuit into intuitive force-balance decisions. This approach simplified the complexity of path planning and provided an effective solution for real-time obstacle avoidance in early AUV systems [8]. While these studies have significantly advanced the field, their methods rely on precise model parameters. The performance of controllers with fixed parameter designs can be limited. This is due to the underactuated nature of AUVs, their strongly coupled nonlinear dynamics, and time-varying hydrodynamic coefficients. These approaches also lack flexibility in handling multi-objective optimization problems.

Deep reinforcement learning (DRL) has emerged as a powerful alternative, as it learns policies directly through interactions with uncertain and dynamic environments [9,10]. For example, Zheng et al. proposed a PPO-based path-planning method, UP4O, for complex ocean-current conditions; by integrating obstacle features with state information such as relative position, currents, and velocity, it achieves time-efficient planning that balances global guidance and local obstacle avoidance [11]. Bingul et al. introduced a memory-based DRL approach for obstacle avoidance in unknown environments, leveraging recurrent networks with temporal attention to exploit historical observations and mitigate partial observability, thereby increasing collision-free flight distance and reducing energy loss caused by oscillatory motions [12]. Wu et al. developed an improved TD3-based path-following control method, incorporating importance-aware experience replay, smooth regularization, and an adaptive reward design to accelerate convergence and suppress action oscillations while maintaining high tracking accuracy under disturbances [13]. Overall, existing studies show that modern DRL, especially policy-gradient-based approaches, provide an effective tool for learning robust and scalable collision-avoidance policies in high-dimensional state and action spaces.

Clearly, introducing reinforcement learning methods into AUV obstacle avoidance tasks aligns well with future intelligent development trends. However, a review of both traditional obstacle avoidance methods and reinforcement learning approaches reveals that researchers have paid little attention to dynamic obstacle avoidance for underwater AUVs. This study draws inspiration from the theory of dynamic obstacle avoidance for surface vessels [14]. The goal is to find an intelligent method suitable for addressing the challenges of dynamic obstacles faced by underwater AUVs. Nevertheless, applying DRL to AUV obstacle avoidance tasks still faces a core challenge: training agents stably and efficiently in dynamically uncertain marine environments [15]. When dealing with sparse rewards and sequential decision-making problems, designing efficient exploration methods to achieve faster and more stable exploration is often a high-priority consideration [16]. Although DRL can effectively solve the challenges of RL in high-dimensional state and action spaces, its application in obstacle avoidance tasks has significant limitations [17]. In underwater obstacle avoidance scenarios, problems such as sparse rewards, long-term delayed rewards, and hierarchical task structures are common. Agents struggle to learn optimal obstacle avoidance strategies under these conditions. Hierarchical reinforcement learning (HRL) offers a promising solution to these DRL limitations [18]. Its abstraction mechanism allows complex obstacle avoidance tasks to be decomposed into hierarchical levels. However, new issues arise when applying HRL to AUV obstacle avoidance. Designing task abstraction hierarchies that precisely match the real-time demands of maritime scenarios—such as dynamic ocean currents and sudden obstacles—remains difficult. Furthermore, the generalization capability of hierarchical strategies is highly sensitive to environmental variations. This makes it challenging to ensure obstacle avoidance stability and safety across different sea areas and operating conditions.

To address the aforementioned limitations, this study proposes a hierarchical reinforcement learning AUV obstacle avoidance method tailored for dynamic obstacle scenarios. The main technical contributions are summarized as follows:

We systematically embed key stochastic factors—such as obstacle behavior patterns and external disturbances—into the AUV obstacle avoidance training environment. Concurrently, we innovatively introduce a Collision Threat Index (CTI) tailored for underwater scenarios, which effectively quantifies the collision risk between the vehicle and dynamic obstacles.
Unlike traditional end-to-end DRL, HDAO fundamentally decouples the navigation task across different time scales based on the principles of HRL. Our framework learns to dynamically select among pure navigation, pure obstacle avoidance, or a combination of both. This approach enables the synergistic optimization of global navigation intents and local avoidance actions, significantly alleviating the high-dimensional training pressure in complex environments.
Traditional hierarchical exploration methods often suffer from inherent defects, including the underutilization of environmental information and slow, inefficient global exploration. To address these limitations, we integrate a curiosity-driven reward mechanism to improve policy exploration. This enhancement boosts the exploration performance of HRL, achieving rapid and highly efficient coverage of the global state space in complex environments.

2. Problem Modeling

2.1. Scenario Model

The main objective of this study is to enable the AUV to safely avoid unknown dynamic obstacles under current disturbances and successfully navigate to a predefined target region. This objective is formulated as a reinforcement learning problem and is achieved by maximizing the expected cumulative discounted reward:

R^{*} = max_{π} E_{π} [\sum_{t = 0}^{T} γ^{t} r (s_{t}, a_{t})] .

(1)

Figure 1 presents the scenario model adopted when the AUV detects a dynamic obstacle. The model characterizes the relative spatial relationship between the AUV and the detected obstacle, as well as that between the AUV and the known target region. For computational tractability and modeling consistency, all unknown obstacles and known target regions are uniformly approximated as spherical bodies. In practical engineering applications, taking the maximum geometric radius of the obstacle as the radius of the bounding sphere is sufficient to guarantee the lower bound of the algorithm’s safety.

2.2. Kinematics and Dynamics Modeling of AUVs

To ensure the physical feasibility of the proposed method, we first establish a complete six-degree-of-freedom (6-DOF) kinematic and dynamic model of the AUV. The kinematic relationship describing the motion of the autonomous underwater vehicle is given by [19]:

\dot{η} = J (η) ν,

(2)

The position and orientation vector

\dot{η} = {[x, y, z, ϕ, θ, ψ]}^{T}

represents the AUV’s posture in the inertial frame. The velocity vector

ν = {[u, v, w, p, q, r]}^{T}

corresponds to the vehicle’s velocities expressed in the body-fixed frame. The Jacobian matrix

J (η)

and the angular velocity transformation matrix

J (ϕ, θ)

are defined as follows:

J (η) = [\begin{matrix} R_{b}^{n} (ϕ, θ, ψ) & 0 \\ 0 & J (ϕ, θ) \end{matrix}],

(3)

J (ϕ, θ) = [\begin{matrix} 1 & sin ϕ tan θ & cos ϕ tan θ \\ 0 & cos ϕ & - sin ϕ \\ 0 & sin ϕ sec θ & cos ϕ sec θ \end{matrix}] .

(4)

The dynamics of the AUV are described by the rigid-body equations considering added mass effects, as follows:

M \dot{ν} + C (ν) ν + D (ν) ν + g (η) = τ,

(5)

where M is the mass and inertia matrix,

C (ν)

is the Coriolis and centripetal matrix,

D (ν)

is the velocity-dependent damping matrix, and

g (η)

represents the combined restoring forces and moments from gravity and buoyancy. The vector

τ = {[X, Y, Z, K, M, N]}^{T}

denotes the control forces and moments. Relevant parameters, such as the mass m, moments of inertia

I_{x x}, I_{y y}, I_{z z}

, added mass coefficients, and damping terms, are provided for reference.

2.3. Collision Threat Index

Current collision risk research for Maritime Autonomous Surface Ships (MASS) typically uses metrics like DCPA (Distance to Closest Point of Approach) and TCPA (Time to Closest Point of Approach). Researchers use these metrics to construct a standardized Collision Risk Index framework [14]. For example, the model in [20] integrates these two indicators. This integration achieves a risk assessment that complies with maritime regulations. Specifically, DCPA defines the predicted minimum spatial distance between the ego vessel and the target under their current headings and speeds. Meanwhile, TCPA refers to the estimated time required to reach this minimum distance. However, these metrics are not directly applicable to underwater AUV scenarios. Unlike surface vessels, AUVs and obstacles often have 3D trajectories that do not lie on the same plane. Consequently, the minimum distance and time calculated under preset headings and speeds fail to accurately reflect the true collision probability in a 3D environment. Furthermore, AUVs and obstacles exhibit high-degree-of-freedom and highly dynamic motions. These motions frequently violate the constant velocity assumption that DCPA and TCPA strictly rely upon. In dynamic marine environments, the relative motion between an AUV and obstacles is complex and highly variable. Therefore, our system requires an accurate assessment of the threat level posed by dynamic obstacles. To achieve this, we draw inspiration from classic collision risk inference research for MASS. We construct a novel collision risk metric to quantify the risk level between the AUV and surrounding obstacles. Our proposed metric calculates this risk based on relative distance and relative attitude angle information. It then normalizes the risk value to the

[0, 1]

interval. In this scale, a larger value indicates a higher collision threat.

Assuming the detected dynamic obstacle is located at

(x_{o}, y_{o}, z_{o})

, Equation (4) is reformulated as:

T (θ, ψ) = [\begin{matrix} T_{1} (θ, ψ) & O_{3 \times 2} \\ O_{2 \times 3} & T_{2} (θ) \end{matrix}] .

(6)

By analyzing the three-dimensional geometric relationship illustrated in Figure 1, we can obtain:

[\begin{matrix} x_{r} \\ y_{r} \\ z_{r} \end{matrix}] = T_{1}^{T} (θ, ψ) [\begin{matrix} Δ x \\ Δ y \\ Δ z \end{matrix}],

(7)

where

Δ x = x_{o} - x

,

Δ y = y_{o} - y

, and

Δ z = z_{o} - z

. Subsequently, the relative distance

ρ

, relative pitch angle

β

, and relative yaw angle

α

can be derived as follows:

\begin{matrix} ρ & = \sqrt{x_{r}^{2} + y_{r}^{2} + z_{r}^{2}}, \end{matrix}

(8)

\begin{matrix} β & = arctan 2 (z_{r}, \sqrt{x_{r}^{2} + y_{r}^{2}}), \end{matrix}

(9)

\begin{matrix} α & = arctan 2 (y_{r}, x_{r}) . \end{matrix}

(10)

where the function

arctan 2 (b, a)

returns the four-quadrant inverse tangent value of

b / a

within the interval

(- π, π]

.

To evaluate the risk between the AUV and dynamic obstacles, the following Collision Threat Index (CTI) assessment formula is constructed:

CTI = \{\begin{matrix} 0 & ρ > ρ_{\max}, \\ 1 & ρ \leq ρ_{safe}, \\ e^{- (k_{1} \cdot ρ + k_{2} \cdot | β | + k_{3} \cdot | α |)} & otherwise . \end{matrix}

(11)

The computed CTI value is constrained within the interval

[0, 1]

, which effectively quantifies the collision risk between the vessel and surrounding obstacles and provides a decision-making basis for training collision avoidance strategies in reinforcement learning agents. In Equation (11),

k_{1}

,

k_{2}

, and

k_{3}

are sensitivity adjustment coefficients used to quantify the impact of relative distance (

ρ

), relative pitch angle (

β

), and relative yaw angle (

α

) on the decay rate of the collision risk index, respectively. The selection of these parameters follows the hierarchy of

k_{1} > k_{3} > k_{2}

. First, spatial distance is the absolute physical prerequisite for collision avoidance; therefore,

k_{1}

is assigned the largest weight. Second, considering the kinematic characteristics of traditional AUVs, their maneuverability in the horizontal plane is typically superior to that in the vertical plane, which makes obstacle avoidance in the depth direction more constrained. Consequently, setting

k_{2} < k_{3}

allows for a slower risk decay in the vertical direction to effectively compensate for the limitations of the AUV’s vertical maneuverability.

In our implementation, the CTI plays a dual role within the reinforcement learning framework: (1) as part of the state input, it provides explicit risk awareness to the high-level policy, enabling more informed and proactive global decision-making; (2) as part of the reward function, it penalizes risky behavior by assigning stronger negative rewards when the agent approaches high-risk situations. This dual role of the CTI establishes a synergistic mechanism that enhances the agent’s environmental perception and decision-making capabilities.

2.4. Environment Setup

2.4.1. State Space

The policy learning of the AUV is conducted through interactions with the environment. Considering that both the AUV’s own state and the distribution of obstacles in the underwater environment affect the final decision-making, the joint state

s_{t} = {s_{AUV}, s_{ENV}}

is used as the input to the DRL at time t:

s_{A U V} = {x, y, z, ϕ, θ, ψ, u, v, w, p, q, r, Δ x_{g}, Δ y_{g}, Δ z_{g}},

(12)

s_{ENV} = {s_{obstacle}, s_{current}},

(13)

where

s_{A U V}

denotes the AUV self-state, including its position

(x, y, z)

and Euler angles

(ϕ, θ, ψ)

in the inertial frame, the linear velocities

(u, v, w)

and angular velocities

(p, q, r)

in the body-fixed frame, as well as the relative goal position

(Δ x_{g}, Δ y_{g}, Δ z_{g})

.

s_{ENV}

represents the environmental state, encompassing the distribution of all obstacles

s_{obstacle}

and the current state

s_{current}

.

2.4.2. Action Space

The action space is defined as a set of feasible AUV control strategies. Based on the AUV’s dynamic equations, the input to the control strategy is given as:

a = τ,

(14)

where a denotes the AUV’s action, and

τ

represent the control torques of the AUV.

2.4.3. Reward Function Configuration

To ensure that the reward function is logically clear and hierarchically well-defined within the framework of hierarchical reinforcement learning, this section systematically configures the reward signals across different levels of the system. The overall reward consists of the high-level task reward and the low-level task reward. The detailed design and calculation are described below.

The high-level: The selection of the high-level reinforcement signal

R_{root}

is determined according to the specific requirements of the path planning task, when the root task is correctly selected,

R_{root}

is positive; when incorrectly selected, it is negative. The magnitude of

R_{root}

can be adjusted based on experimental requirements. The root task is defined as either selecting obstacle avoidance or approaching the target.

The low-level: The low-level policy is mainly used to execute actions for obstacle avoidance and is divided into three components: the goal reward signal, the obstacle avoidance reward signal, and the heading error reward signal.

The goal reward signal refers to the reward received when the robot executes a strategy to move toward the target. When no obstacles are present around the robot, it should select the goal-reaching subtask. Therefore, the goal reward signal for the goal-reaching subtask, denoted as

R_{goal}

:

R_{goal} = \{\begin{matrix} r_{goal} & ∥ P_{AUV} - P_{goal} ∥ \leq d_{g}, \\ r_{near} & d_{g} < ∥ P_{AUV} - P_{goal} ∥ \leq d_{n}, \\ - λ_{1} ∥ P_{AUV} - P_{goal} ∥ & otherwise, \end{matrix}

(15)

where

d_{g}

is defined as the radius of the target region,

P_{AUV}

denotes the position of the AUV at each time step, and

P_{goal}

is the target coordinate that varies randomly per episode.

r_{goal}

,

r_{near}

,

λ_{1}

, and

d_{n}

are all positive constants, with

d_{n} > d_{g}

.

The obstacle avoidance reinforcement signal refers to the signal received when the AUV executes corresponding strategies upon encountering obstacles. Therefore, it can be defined as

R_{avoid}

:

R_{avoid} = - λ_{c} CTI - λ_{v} r_{v e l} .

(16)

r_{vel} = \{\begin{matrix} - k_{v} \frac{v_{c}}{v_{r e f}}, & v_{c} > 0 and ρ < ρ_{t h}, \\ 0, & otherwise, \end{matrix}

(17)

where

r_{vel}

denotes the penalty term related to the relative velocity between the AUV and the obstacle,

v_{c}

is the relative velocity,

v_{r e f}

is the reference velocity used for normalization, and

ρ_{t h}

is the distance threshold for activating the penalty.

The heading error reward serves as a supplementary reward during the AUV’s target approach and obstacle avoidance processes. The heading error reward is calculated as follows:

R_{err} = - λ_{2} |ψ_{LOS 0} - ψ_{AUV}|,

(18)

where

ψ_{LOS 0} = arctan (\frac{y_{goal} - y_{start}}{x_{goal} - x_{start}})

denotes the initial line-of-sight angle from the starting area to the target, and

λ_{3}

is a positive constant coefficient.

3. Model Construction

Training Model Construction

The overall training architecture of the proposed method is illustrated in Figure 2. The framework consists of two main components: the AUV–environment interaction module and the hierarchical policy network training module.

As shown in Figure 2, the AUV continuously interacts with the dynamic underwater environment. At each time step, the agent observes the current state

s_{t}

, executes an action

a_{t}

, and receives a reward

r_{t + 1}

from the environment. The environment then transitions to a new state

s_{t + 1}

, forming the interaction tuple

(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})

. These interaction data are used to update the policy networks. Through repeated environment interaction and reward feedback accumulation, the agent gradually learns an effective dynamic obstacle avoidance strategy.

The network training module adopts a hierarchical reinforcement learning structure. The high-level policy network is responsible for long-term decision-making and motion mode selection, including switching between goal-directed navigation and obstacle avoidance modes. The low-level policy network, enhanced by the curiosity-driven mechanism, executes short-term control actions according to the subgoals generated by the high-level policy. This hierarchical design achieves temporal abstraction and task decoupling, effectively reducing the optimization difficulty of long-horizon tasks and improving training stability in complex dynamic environments.

In this study, PPO [21] is adopted as the baseline optimization algorithm to jointly optimize both policy layers, thereby ensuring stable gradient updates. PPO demonstrates strong training stability in continuous control tasks. By introducing a clipped surrogate objective, PPO constrains excessive changes in the policy distribution between successive iterations, which prevents performance oscillation or degradation caused by overly large gradient steps and enhances the robustness of policy gradient updates.

4. Hierarchical Obstacle Avoidance Network Architecture

4.1. Hierarchical Policy Architecture

In this section, we introduce the framework used to learn the hierarchical policy, as illustrated in Figure 3. Our solution employs a two-layer policy structure (

π^{HL}

and

π^{LL}

). Both policy layers receive the same observation vector

o_{t} \in O

, which encodes the AUV’s own state, the goal, ocean currents, and obstacle-related information.

To handle the hybrid decision-making process between high-level discrete mode selection and low-level continuous control, the system adopts a modular policy activation and routing mechanism in its engineering implementation. This modular approach is discussed in reference [22]. The discrete mode selector

e_{t}

output by the high-level policy serves as a high-level routing switch. It directly determines the direction of the low-level execution logic.

The high-level policy

π^{HL}

operates on a different timescale compared to the conventional Markov Decision Process (MDP). It takes actions at time steps

t_{k}

, which correspond to every T steps of the low-level policy, or fewer if the low-level policy has converged to a subgoal. The high-level policy receives an observation

o_{t} \in O

and generates an action

a_{t}^{HL}

based on the final goal

g \in G

. If the low-level policy converges to the subgoal early, the high-level policy does not need to wait for the full T steps. When the subgoal is completed within the T-step window, the system queries and generates a new high-level action in advance. The selection of the macro step size T must balance computational efficiency and responsiveness. It should also match the typical speed of dynamic obstacles. In this study, setting T to 10 proves sufficient.

A key component of the high-level action

a_{t}^{HL}

is the subgoal

g_{t}^{LL}

, which encodes the desired relative change in the current state

s_{t}

. This subgoal defines a target state

s_{t} + g_{t}^{LL}

for the low-level policy to pursue. It remains active for T consecutive time steps of the low-level policy, persisting until the subgoal is successfully achieved within this window. Upon completion, the HDAO framework queries the high-level policy to generate the next high-level action. At each subsequent time step, the original subgoal

g_{t}^{LL}

is updated according to the recurrence relation:

g_{t + 1}^{LL} = s_{t} + g_{t}^{LL} - s_{t + 1} .

(19)

Another component of the high-level action

a_{t}^{HL}

is a motion mode selector

e_{t}

. This selector is a discrete variable that determines which path-planning strategy the low-level policy will use to achieve the current subgoal in the environment: goal-directed navigation, obstacle avoidance, or both. Using the motion mode selector, the high-level policy dynamically determines the planning mode the AUV should adopt over the T-step time window, based on the environmental perception state.

The low-level policy

π^{L L}

operates at each discrete time step t. It takes a state observation

o_{t} \in O

and generates an action

a_{t}^{L L} \in A^{L L}

based on the previous high-level action

a_{t}^{H L} \in A^{H L}

, which includes the subgoal

g_{t}^{L L}

and the action derived from the motion mode selector

e_{t}

. At each time step t, the environment provides a state

s_{t}

. The low-level policy interacts directly with the environment, while the high-level policy guides the low-level policy via high-level actions and goals

g_{t}

according to the current state. It determines the next goal by updating

g_{t}

and

e_{t}

, which directs the low-level policy to reach a target state, allowing the low-level policy to efficiently learn from prior experience of the high-level policy. The high-level policy

π^{HL}

updates once every T steps. The low-level policy

π^{LL}

observes the state

s_{t}

, goal

g_{t}

, and

e_{t}

, then produces a low-level action

a_{t} \sim μ^{l o} (s_{t}, g_{t}, e_{t})

that is applied to the environment. The environment then samples a reward

R_{t}

from the reward function

R (s_{t}, a_{t})

and transitions to a new state

s_{t + 1}

via the transition function

f (s_{t}, a_{t})

.

To accommodate the hierarchical decision-making architecture, this study employs two independent Actor–Critic network pairs. Specifically, two sets of the PPO algorithm are utilized to optimize the high-level planning policy and the low-level execution policy separately. The high-level value function, denoted as

V_{H L} (s_{t_{k}})

, takes as input the AUV’s global state at a macroscopic time step,

s_{t_{k}}

. Its output is a long-term value estimation for that state. This value function is responsible for predicting the cumulative discounted reward from the current moment until the end of the task. The low-level value function, denoted as

V_{L L} (s_{t}, g_{t}, e_{t})

, constitutes the Critic component within the low-level PPO network. The input to this function consists not only of the AUV’s microscopic state

s_{t}

but also incorporates the instructions issued by the high-level layer, namely the sub-goal

g_{t}

and the mode

e_{t}

. It is responsible for estimating the expected return of executing an atomic action

a_{t}

under the current specific task requirements.Both the high-level and low-level networks are jointly optimized using the Proximal Policy Optimization (PPO) algorithm to ensure stable gradient updates.

4.2. Curiosity-Driven Training Mechanism

In the hierarchical structure described in Section 4.1, the high-level policy generates subgoals and selects motion modes to guide long-term planning, while the low-level policy executes concrete control actions within a fixed time window to produce short-term outputs. Although such temporal abstraction alleviates the optimization difficulty of long-horizon tasks, policy learning in dynamic and uncertain underwater environments still relies heavily on effective exploration, especially when the low-level controller faces sparse and strongly delayed extrinsic rewards, which can lead to insufficient exploration and slow convergence. To address these issues, this section introduces a curiosity-driven training mechanism that constructs a pseudo-reward to complement the task reward, encouraging more informative behaviors and achieving a dynamic balance between exploration and exploitation, thereby improving training efficiency and policy robustness.

To ensure the diversity of the lower policy sets, the HDAO algorithm uses information entropy to construct a pseudo-reward. The introduced pseudo-reward function can be expressed as:

R_{bonus} = H (A^{π_{o}} | S),

(20)

where H denotes the Shannon entropy with base e, and A represents the action distribution. On the right-hand side of Equation (20), the first term is intended to increase the randomness of the lower-level internal policy when selecting actions; the second term is to enhance the randomness of the upper-level policy when choosing the lower-level internal policy. Incorporating the above pseudo-reward function into the standard reinforcement learning framework yields the augmented reward function:

R_{aug} (S_{t}, A_{t}) = (1 - λ) R (S_{t}, A_{t}) + λ R_{bonus} (S_{t}),

(21)

where

λ

is a hyperparameter. It controls the relative importance of the pseudo-reward function in the augmented reward function. When

λ

approaches 0, the augmented reward function converges to the standard reinforcement learning objective. This entropy-maximizing reward function objective encourages the diversity of action distributions for lower-level internal policies.

5. Result

To systematically evaluate the performance advantages and applicability of the proposed HDAO algorithm in complex three-dimensional underwater environments, this study conducts simulation experiments on AUV autonomous obstacle avoidance and path planning. The experimental scenario is set within a 3D dynamic underwater space. To authentically replicate the real marine conditions encountered during AUV operations and to enhance the credibility of the experimental validation and the robustness of the algorithm, this study utilizes measured ocean current data from the Integrated Ocean Current Dataset published by the National Marine Science Data Center of China to construct the 3D flow field for the simulation. Flow field disturbances serve as environmental inputs to the AUV’s dynamic model, with the ocean current velocity superimposed onto the AUV’s velocity to simulate the vehicle’s trajectory in a real ocean environment. This study selects PPO as the baseline algorithm. As a stable and generalizable policy-based, model-free reinforcement learning algorithm, PPO is well-suited for continuous action spaces and has been extensively validated in sequential decision-making tasks such as robotic motion control and autonomous navigation. The experiments in this section aim to evaluate the obstacle avoidance performance of the HDAO algorithm. Furthermore, ablation studies are conducted to verify whether the incorporation of a Collision Threat Index benefits dynamic obstacle avoidance and to assess the suitability of the curiosity-driven hierarchical framework for AUV path-finding tasks in 3D underwater environments.

The experimental computer was equipped with a 13th Gen Intel Core i7-13650HX 2.60 GHz processor (Intel Corporation, Santa Clara, CA, USA), 64 GB of RAM (Samsung, Suwon, Republic of Korea), and an NVIDIA RTX 4060 graphics card (NVIDIA Corporation, Santa Clara, CA, USA). To train and evaluate the proposed HDAO algorithm, we developed a 3D underwater simulation environment based on the standard OpenAI Gym 0.21.0 framework using Python 3.8. Table 1, Table 2 and Table 3 summarize the basic experimental configuration. The selection of key parameters was based on empirical evaluation and theoretical considerations.

5.1. HDAO Performance Comparison

To verify the superiority of the hierarchical strategy proposed in this paper for AUV dynamic obstacle avoidance in underwater environments, HDAO is compared with mainstream obstacle avoidance algorithms—RMPC [23] and C-APF-TD3 [24]—in terms of evaluation indices such as collision avoidance rate, time consumption, minimum distance, and maximum rudder angle variation. Three scenarios are evaluated: a static setting with fixed obstacle positions, a low-speed setting in which obstacles move slowly under natural conditions and remain much slower than the AUV, and a high-speed setting in which obstacles move randomly at speeds comparable to the AUV. Each scenario is independently run 100 times. Four evaluation indices are defined: (1) Avoid: the ratio of AUVs reaching the destination within the time limit without any collision; (2) Time: the average navigation time of all successful cases; (3) MinDis: the minimum distance between the AUV and obstacles during navigation; (4) Rud: the maximum variation in rudder angle during AUV navigation.

To ensure the rigor and fairness of the performance evaluation, the baseline methods RMPC and C-APF-TD3, along with the proposed HDAO algorithm, were deployed under equivalent testing conditions. All methods were based on the identical AUV dynamics model and the same underwater obstacle motion environment, with control inputs subject to the exact same physical constraints. Furthermore, consistent total training interaction steps and stopping conditions were applied.

Under the above experimental settings, the evaluation indices are obtained as shown in Table 4, this table compares the AUV obstacle avoidance performance of three algorithms—HDAO, RMPC, and C-APF-TD3—in static, low-speed dynamic, and high-speed dynamic obstacle scenarios. The experimental results show that HDAO maintains a 100% obstacle avoidance success rate across all scenarios, with the shortest average time consumption and the smallest rudder angle variation, demonstrating the most stable and efficient performance. RMPC also achieves a 100% success rate in static and low-speed scenarios, but its success rate drops to 91% in the high-speed scenario, along with a significant increase in rudder angle fluctuation. C-APF-TD3 performs weakly in dynamic environments, particularly in the high-speed scenario where its obstacle avoidance success rate is only 86%, with the longest time consumption and multiple failures to maintain a safe distance. In terms of obstacle avoidance rate, both the HDAO method and RMPC are suitable for most low-speed obstacle movements under natural conditions. However, under high-speed obstacle movement conditions, only the HDAO method achieves a 100% avoidance rate. Furthermore, from the perspectives of average time consumption, minimum distance to obstacles, and rudder angle variation, HDAO outperforms the other two methods in decision efficiency, safety, and control smoothness.

The results presented in Table 4 are visualized using radar charts based on several dimensions of the evaluation indices. The performance of the three algorithms under different scenarios is normalized to the range of 0–1 for visualization. A value closer to 1 indicates better performance on the corresponding metric. The results are shown in Figure 4. The radar charts clearly show that the performance of the HDAO algorithm does not exhibit significant degradation in static, low-speed, or high-speed scenarios. The coverage area of its radar plot is the largest among the three algorithms in all scenarios, which demonstrates its superior overall performance. This visualization is highly consistent with the quantitative data reported in the table. The results further provide intuitive evidence of the significant advantages of the HDAO algorithm in terms of environmental adaptability and robustness.

A comparative analysis of the training convergence is presented in Figure 5, which shows a comparison of the reward functions of the three algorithms during training. The horizontal axis represents training episodes, and the vertical axis represents cumulative reward, with shaded areas indicating the 95% confidence intervals across multiple runs. As observed in Figure 5, each curve starts with a low cumulative reward in the early training stage, but the reward gradually increases with more episodes, indicating improved performance for all three algorithms. Eventually, the cumulative reward converges and stabilizes. The proposed HDAO method achieves the highest cumulative reward earliest in the later training stage, and its final converged reward is superior to that of the other two algorithms. Moreover, HDAO exhibits faster convergence, demonstrating its advantages for DRL training.

Figure 6a–c show the training trajectories of the AUV using the three methods during experiments, displayed as two-dimensional projections. In the static scenario (Figure 6a) and the low-speed dynamic scenario (Figure 6b), all three methods successfully avoid obstacles and reach the destination. In the high-speed scenario (Figure 6c), due to the increased obstacle speed, only the HDAO method succeeds in obstacle avoidance, while the other two methods fail. It can be seen that the HDAO method successfully avoids obstacles and accurately reaches the target points in all three scenarios, showing strong environmental adaptability and relatively smooth trajectory changes. In contrast, when the environment changes, traditional DRL algorithms such as RMPC and C-APF-TD3 struggle to maintain good obstacle avoidance performance, resulting in collisions with obstacles.

5.2. Ablation Study

The hierarchical framework and the Collision Threat Index (CTI) are key components enabling HDAO to address AUV obstacle avoidance tasks in dynamic environments. To validate this, the HDAO-CO algorithm was designed and compared with the baseline algorithm PPO. In HDAO-CO, the CTI is replaced with a basic obstacle avoidance reward, while all other rewards and network parameters remain unchanged. This design minimizes the impact of data processing techniques on the experimental results. A comparison of the performance metrics between HDAO and HDAO-CO is presented in Table 5.

Table 5 indicates that, under the same network architecture, data processing techniques, and basic hyperparameters, HDAO significantly outperforms HDAO-CO and the baseline method in terms of obstacle avoidance success rate, navigation efficiency, and stability in dynamic environments. HDAO-CO achieves complete obstacle avoidance only in the static scenario. It tends to fail in dynamic environments, which poses challenges to the navigation safety of the AUV. The baseline method, PPO, can safely avoid obstacles in the low-speed scenario. However, it becomes unsuitable in the high-speed scenario. Another evaluation index with clear discrimination is the rudder angle variation. The rudder angle variation of HDAO-CO and PPO is several times higher than that of HDAO. Among the three methods, HDAO-CO exhibits the largest rudder angle variation. The results indicate that the introduction of the CTI metric is a key factor for obstacle avoidance in dynamic environments. The hierarchical structure further ensures stable navigation of the AUV. These factors jointly guarantee safe and stable obstacle avoidance performance in dynamic environments.

5.3. Discussion on the Design of the CTI Metric and the Results

To validate our theoretical analysis regarding the limitations of traditional maritime metrics in 3D underwater environments, we integrated DCPA and TCPA into our proposed methodological framework. We then conducted comparative experiments between our proposed CTI and the classic DCPA and TCPA metrics. The comparative results are presented in Table 6. They show that single-metric evaluation schemes using either DCPA or TCPA perform poorly in dynamic scenarios. Specifically, they significantly reduce the obstacle avoidance success rate and noticeably increase the average navigation time, accompanied by severe control oscillations. In contrast, our proposed CTI achieves excellent obstacle avoidance results and provides superior navigation efficiency. This demonstrates its effectiveness in managing AUV dynamic obstacle avoidance without depending on the constant velocity assumption.

This study systematically evaluates HDAO in three scenarios, including static, low-speed dynamic, and high-speed dynamic environments. It performs comparative analyses using obstacle-avoidance success rate, navigation efficiency (travel time), minimum safety distance, and rudder angle variation as evaluation metrics. The results show that HDAO consistently achieves the best overall performance across all three scenarios. Its advantages are particularly pronounced in the dynamic and high-speed settings. In contrast, HDAO-CO is more prone to obstacle-avoidance failures in dynamic environments, and standard PPO becomes significantly less applicable in high-speed scenarios. Ablation and comparative experiments further identify the specific sources of these improvements. First, the CTI provides a continuous and interpretable risk representation, which gives the policy explicit risk feedback during dynamic interactions and enforces stable avoidance constraints. Second, the hierarchical architecture jointly optimizes global navigation and local avoidance at different time scales, which reduces the learning difficulty of long-horizon tasks and suppresses control oscillations, thereby significantly decreasing rudder angle variation. Furthermore, the intrinsic-reward (curiosity-driven) mechanism alleviates insufficient exploration under sparse rewards, accelerates the formation of effective avoidance behaviors, and improves adaptability in dynamic environments. Beyond these algorithmic advantages, empirical results demonstrate that the average inference time for a single decision step (forward pass) of the HDAO policy network is

4.73 \pm 0.59

ms. This minimal computational latency fully satisfies the real-time control frequency typically required by modern AUV low-level controllers, conclusively proving the algorithm’s excellent feasibility for real-world engineering deployment.

6. Conclusions

This paper investigates AUV autonomous navigation and obstacle avoidance in dynamic marine environments. We propose HDAO, a hierarchical curiosity-driven deep reinforcement learning method for dynamic obstacle avoidance. The method introduces a Collision Threat Index to better represent time-varying collision risk. It also employs a hierarchical policy to jointly optimize global planning and local avoidance. Experimental results show that HDAO maintains a high success rate with small control fluctuations even in high-speed dynamic scenarios, demonstrating strong robustness and generalization. Ablation studies further confirm that CTI plays a key role in stabilizing avoidance behavior. They also indicate that the hierarchical structure and intrinsic reward help suppress control oscillations and improve adaptability. Although the proposed algorithm performs excellently in simulation, deploying it in real AUV systems still faces severe challenges, including the inherent Sim-to-Real gap, real-world sensor noise, and unpredictable underwater contingencies alongside complex hydrodynamic perturbations. To thoroughly bridge this gap, the future development of HDAO will focus on actual AUV physical platforms. We will conduct physical lake and sea trials to further validate, deploy, and optimize the proposed system under realistic marine conditions, thereby achieving a complete closed loop from theoretical algorithm design to practical engineering deployment.

Author Contributions

Conceptualization and methodology, H.D. and Q.L.; investigation and supervision H.D., N.X. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (Grant No. 61701161 and No. 61971178).

Data Availability Statement

The AUV dataset generated and analyzed during the current study is not publicly available due to its custom-built nature. However, the data presented in this study are available on request from the corresponding author.

Acknowledgments

Acknowledgement is extended to the National Marine Science Data Center (NMSDC, https://mds.nmdis.org.cn/, accessed on 4 January 2026), a National Science and Technology Resource Sharing Service Platform of China, for the provision of oceanographic datasets supporting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shen, C.; Shi, Y.; Buckham, B. Trajectory tracking control of an autonomous underwater vehicle using Lyapunov-based model predictive control. IEEE Trans. Ind. Electron. 2017, 65, 5796–5805. [Google Scholar] [CrossRef]
Vangi, M.; Topini, E.; Liverani, G.; Topini, A.; Ridolfi, A.; Allotta, B. Design, Development, and Testing of an Innovative Autonomous Underwater Reconfigurable Vehicle for Versatile Applications. IEEE J. Ocean. Eng. 2025, 50, 509–526. [Google Scholar] [CrossRef]
Carreras, M.; Hernández, J.D.; Vidal, E.; Palomeras, N.; Ribas, D.; Ridao, P. Sparus II AUV—A Hovering Vehicle for Seabed Inspection. IEEE J. Ocean. Eng. 2018, 43, 344–355. [Google Scholar] [CrossRef]
Yan, T.; Xu, Z.; Yang, S.X. Consensus formation tracking for multiple AUV systems using distributed bioinspired sliding mode control. IEEE Trans. Intell. Veh. 2022, 8, 1081–1092. [Google Scholar] [CrossRef]
Zheng, H.; Sun, Y.; Zhang, G.; Zhang, L.; Zhang, W. Research on real time obstacle avoidance method for AUV based on combination of ViT-DPFN and MPC. IEEE Trans. Instrum. Meas. 2024, 73, 5009115. [Google Scholar] [CrossRef]
Bingul, Z.; Gul, K. Intelligent-PID with PD feedforward trajectory tracking control of an autonomous underwater vehicle. Machines 2023, 11, 300. [Google Scholar] [CrossRef]
Wu, J.; Wang, H.; Liu, Y.; Zhang, M.; Wu, T. Learning-based fixed-wing UAV reactive maneuver control for obstacle avoidance. Aerosp. Sci. Technol. 2022, 126, 107623. [Google Scholar] [CrossRef]
Pang, W.; Zhu, D.; Sun, C. Multi-AUV formation reconfiguration obstacle avoidance algorithm based on affine transformation and improved artificial potential field under ocean currents disturbance. IEEE Trans. Autom. Sci. Eng. 2023, 21, 1469–1487. [Google Scholar] [CrossRef]
Liu, Y.; Liu, C.; Meng, Y.Z.; Ren, X.Q.; Wang, X.F. Velocity domain-based distributed pursuit-encirclement control for multi-USVs with incomplete information. IEEE Trans. Intell. Veh. 2024, 9, 3246–3257. [Google Scholar] [CrossRef]
Yao, G.; Li, Y.; Zhang, H.; Jiang, Y.; Wang, T.; Sun, F.; Yang, X. Review of hybrid aquatic-aerial vehicle (HAAV): Classifications, current status, applications, challenges and technology perspectives. Prog. Aerosp. Sci. 2023, 139, 100902. [Google Scholar] [CrossRef]
Yang, J.; Huo, J.; Xi, M.; He, J.; Li, Z.; Song, H.H. A time-saving path planning scheme for autonomous underwater vehicles with complex underwater conditions. IEEE Internet Things J. 2022, 10, 1001–1013. [Google Scholar] [CrossRef]
Singla, A.; Padakandla, S.; Bhatnagar, S. Memory-based deep reinforcement learning for obstacle avoidance in UAV with limited environment knowledge. IEEE Trans. Intell. Transp. Syst. 2019, 22, 107–118. [Google Scholar] [CrossRef]
Fan, Y.; Dong, H.; Zhao, X.; Denissenko, P. Path-following control of unmanned underwater vehicle based on an improved TD3 deep reinforcement learning. IEEE Trans. Control Syst. Technol. 2024, 32, 1904–1919. [Google Scholar] [CrossRef]
Namgung, H.; Kim, J.-S.; Jang, D.-U. Path planning and collision avoidance technologies for maritime autonomous surface ships: A review of COLREGs compliance, algorithmic trends and the navigation-GPT framework. J. Navig. 2026, 1–21, ahead of print. [Google Scholar] [CrossRef]
Cao, F.; Xu, H.; Ru, J.; Li, Z.; Zhang, H.; Liu, H. Collision avoidance of multi-UUV systems based on deep reinforcement learning in complex marine environments. J. Mar. Sci. Eng. 2025, 13, 1615. [Google Scholar] [CrossRef]
Xu, J.; Huang, F.; Wu, D.; Cui, Y.; Yan, Z.; Du, X. A learning method for AUV collision avoidance through deep reinforcement learning. Ocean Eng. 2022, 260, 112038. [Google Scholar] [CrossRef]
Chen, H.; Guo, J.; Zhang, B.; Song, S.; Ye, Q.; Pan, M. Efficient collision-free data collection for underwater acoustic sensor networks: A hierarchical DRL approach. IEEE Trans. Mob. Comput. 2025, 1–16, ahead of print. [Google Scholar] [CrossRef]
Tessler, C.; Givony, S.; Zahavy, T.; Mankowitz, D.; Mannor, S. A deep hierarchical approach to lifelong learning in minecraft. Proc. AAAI Conf. Artif. Intell. 2017, 31, 1553–1561. [Google Scholar] [CrossRef]
Li, H.; Guo, S.; Li, C.; Huang, L. Study on 3D path planning of AUV based on the reinforcement learning method. In Proceedings of the 2025 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 3–6 August 2025; pp. 821–826. [Google Scholar]
Namgung, H.; Kim, J.-S. Collision Risk Inference System for Maritime Autonomous Surface Ships Using COLREGs Rules Compliant Collision Avoidance. IEEE Access 2021, 9, 7823–7835. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Ngo, T.; Hassan, A.; Shafiq, S.; Medvidovic, N. DNN Modularization via Activation-Driven Training. arXiv 2026, arXiv:2411.01074v4. [Google Scholar]
Li, J.W.; Chavez-Galaviz, J.; Azizzadenesheli, K.; Mahmoudian, N. Dynamic obstacle avoidance for usvs using cross-domain deep reinforcement learning and neural network model predictive controller. Sensors 2023, 23, 3572. [Google Scholar] [CrossRef]
Li, X.; Yu, S. Obstacle avoidance path planning for AUVs in a three-dimensional unknown environment based on the C-APF-TD3 algorithm. Ocean Eng. 2025, 315, 119886. [Google Scholar] [CrossRef]

Figure 1. Scenario model when the AUV detects a dynamic obstacle.

Figure 2. Training Model.

Figure 3. Framework.

Figure 4. Radar-chart comparison of normalized obstacle-avoidance performance metrics for three methods under static, low-speed, and high-speed scenarios.

Figure 5. Training reward curves of three methods under different scenarios. (a) Static obstacle scenario; (b) Low-speed dynamic obstacle scenario; (c) High-speed dynamic obstacle scenario.

Figure 6. Trajectory comparison of three methods in (a) static, (b) low-speed, and (c) high-speed scenarios.

Table 1. TrainingParameters.

Training Parameter	Value
Batch Size	64
Discount Factor ( $γ$ )	0.99
Training Steps	1024
Activation Function	ReLU

Table 2. Detailed neural network architectures for the hierarchical policy.

Network	Hidden Layers	Hidden Activation	Learning Rate
Actor (H)	3 layers (256, 256, 128)	ReLU	$1.0 \times 10^{- 4}$
Critic (H)	3 layers (256, 256, 128)	ReLU	$2.5 \times 10^{- 4}$
Actor (L)	3 layers (256, 128, 128)	ReLU	$1.0 \times 10^{- 4}$
Critic (L)	3 layers (256, 128, 128)	ReLU	$2.5 \times 10^{- 4}$

Table 3. Parameters list of AUV model.

Parameter	Value	Parameter	Value	Parameter	Value
m	30.48 (kg)	W	299 N	B	306 N
$m_{11}$	30.48 (kg)	$m_{22}$	30.48 (kg)	$m_{33}$	20.73 (kg·m²)
$X_{u}$	−13.5 (kg/s)	$X_{u \| u \|}$	−1.62 (kg/m)	$Z_{w}$	−66.6 (kg/s)
$Z_{q}$	−66.6 (kg/s)	$Z_{w \| w \|}$	−131 (kg/m)	$Z_{q \| q \|}$	−0.632 (kg·m/rad²)
$Z_{w}$	−35.5 (kg)	$M_{w}$	−1.93 (kg·m)	$M_{w \| w \|}$	3.18 (kg)
$M_{q}$	−6.87 (kg·m²/s)	$M_{q \| q \|}$	−188 kg·m²/rad²	$z_{G}$	0.0196 (m)
$M_{u u δ}$	−6.15 (kg/rad)	$Z_{u u δ}$	−6.15 (kg/rad)	$M_{u w}$	24 (kg)

Table 4. Comparison of three methods across static, low-speed, and high-speed scenarios.

Scenario	Method	Evaluation Index
Scenario	Method	Avoid (%)	Time (s)	MinDis (m)	Rud (°)
Static	HDAO	100	$58.2 \pm 1.4$	$0.31 \pm 0.02$	$59.7 \pm 3.8$
	RMPC	100	$60.9 \pm 2.1$	$0.20 \pm 0.03$	$338.7 \pm 22.5$
	C-APF-TD3	100	$63.2 \pm 2.8$	$0.22 \pm 0.03$	$292.4 \pm 24.8$
Low-speed	HDAO	100	$63.7 \pm 1.7$	$0.30 \pm 0.02$	$65.9 \pm 4.6$
	RMPC	100	$71.9 \pm 3.3$	$0.28 \pm 0.04$	$589.6 \pm 46.9$
	C-APF-TD3	94	$80.8 \pm 4.5$	-	$401.7 \pm 35.9$
High-speed	HDAO	100	$71.4 \pm 2.3$	$0.29 \pm 0.03$	$292.8 \pm 19.2$
	RMPC	91	$92.8 \pm 5.6$	-	$1101.3 \pm 83.7$
	C-APF-TD3	86	$109.1 \pm 7.1$	-	$642.5 \pm 54.2$

Table 5. Ablation study results: performance comparison of the proposed method and its variants across three scenarios.

Scenario	Method	Evaluation Index
Scenario	Method	Avoid (%)	Time (s)	MinDis (m)	Rud (°)
Static	HDAO	100	$58.7 \pm 1.2$	$0.32 \pm 0.02$	$60.1 \pm 3.5$
	HDAO-CO	100	$63.1 \pm 1.5$	$0.24 \pm 0.03$	$358.5 \pm 8.7$
	PPO	100	$66.2 \pm 1.8$	$0.30 \pm 0.02$	$158.1 \pm 5.2$
Low-speed	HDAO	100	$64.2 \pm 1.4$	$0.31 \pm 0.02$	$66.3 \pm 4.1$
	HDAO-CO	93	$74.5 \pm 2.1$	-	$498.3 \pm 12.5$
	PPO	100	$83.7 \pm 2.3$	$0.29 \pm 0.03$	$172.0 \pm 6.8$
High-speed	HDAO	100	$71.9 \pm 1.7$	$0.28 \pm 0.03$	$293.5 \pm 9.8$
	HDAO-CO	88	$91.0 \pm 2.5$	-	$822.1 \pm 15.3$
	PPO	94	$113.4 \pm 3.1$	-	$226.0 \pm 7.5$

Table 6. Performance comparison of the proposed CTI method and classic DCPA/TCPA baseline methods across different scenarios.

Scenario	Method	Evaluation Index
Scenario	Method	Avoid (%)	Time (s)	MinDis (m)	Rud (°)
Static	CTI	100	$58.2 \pm 1.4$	$0.31 \pm 0.02$	$59.7 \pm 3.8$
	DCPA	100	$73.1 \pm 3.1$	$0.50 \pm 0.04$	$228.7 \pm 18.5$
	TCPA	100	$69.7 \pm 2.8$	$0.48 \pm 0.04$	$223.6 \pm 17.9$
Low-speed	CTI	100	$63.7 \pm 1.7$	$0.30 \pm 0.02$	$65.9 \pm 4.6$
	DCPA	91	$80.5 \pm 4.2$	-	$301.2 \pm 24.3$
	TCPA	82.5	$83.7 \pm 4.5$	-	$231.5 \pm 21.6$
High-speed	CTI	100	$71.4 \pm 2.3$	$0.29 \pm 0.03$	$292.8 \pm 19.2$
	DCPA	82	$96.9 \pm 5.6$	-	$320.4 \pm 26.8$
	TCPA	80	$94.8 \pm 5.2$	-	$311.6 \pm 25.9$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, H.; Liu, Q.; Liu, X.; Xia, N. HDAO: A Hierarchical Curiosity-Driven Reinforcement Learning Approach for AUV Dynamic Obstacle Avoidance. J. Mar. Sci. Eng. 2026, 14, 720. https://doi.org/10.3390/jmse14080720

AMA Style

Du H, Liu Q, Liu X, Xia N. HDAO: A Hierarchical Curiosity-Driven Reinforcement Learning Approach for AUV Dynamic Obstacle Avoidance. Journal of Marine Science and Engineering. 2026; 14(8):720. https://doi.org/10.3390/jmse14080720

Chicago/Turabian Style

Du, Huazheng, Qian Liu, Xu Liu, and Na Xia. 2026. "HDAO: A Hierarchical Curiosity-Driven Reinforcement Learning Approach for AUV Dynamic Obstacle Avoidance" Journal of Marine Science and Engineering 14, no. 8: 720. https://doi.org/10.3390/jmse14080720

APA Style

Du, H., Liu, Q., Liu, X., & Xia, N. (2026). HDAO: A Hierarchical Curiosity-Driven Reinforcement Learning Approach for AUV Dynamic Obstacle Avoidance. Journal of Marine Science and Engineering, 14(8), 720. https://doi.org/10.3390/jmse14080720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HDAO: A Hierarchical Curiosity-Driven Reinforcement Learning Approach for AUV Dynamic Obstacle Avoidance

Abstract

1. Introduction

2. Problem Modeling

2.1. Scenario Model

2.2. Kinematics and Dynamics Modeling of AUVs

2.3. Collision Threat Index

2.4. Environment Setup

2.4.1. State Space

2.4.2. Action Space

2.4.3. Reward Function Configuration

3. Model Construction

Training Model Construction

4. Hierarchical Obstacle Avoidance Network Architecture

4.1. Hierarchical Policy Architecture

4.2. Curiosity-Driven Training Mechanism

5. Result

5.1. HDAO Performance Comparison

5.2. Ablation Study

5.3. Discussion on the Design of the CTI Metric and the Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI