Research on Multi-USV Collision Avoidance Based on Priority-Driven and Expert-Guided Deep Reinforcement Learning

Xu, Lixin; Wang, Zixuan; Hong, Zhichao; Han, Chaoshuai; Qin, Jiarong; Yang, Ke

doi:10.3390/jmse14020197

Open AccessArticle

Research on Multi-USV Collision Avoidance Based on Priority-Driven and Expert-Guided Deep Reinforcement Learning

by

Lixin Xu

^1,2,

Zixuan Wang

¹,

Zhichao Hong

^1,2,*,

Chaoshuai Han

^1,2,

Jiarong Qin

³ and

Ke Yang

¹

Ocean College, Jiangsu University of Science and Technology, Zhenjiang 212003, China

²

Jiangsu Marine Technology Innovation Center, Nantong 226000, China

³

School of Mathematical Sciences, Yangzhou University, Yangzhou 225009, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(2), 197; https://doi.org/10.3390/jmse14020197 (registering DOI)

Submission received: 30 December 2025 / Revised: 13 January 2026 / Accepted: 15 January 2026 / Published: 17 January 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Deep reinforcement learning (DRL) has demonstrated considerable potential for autonomous collision avoidance in unmanned surface vessels (USVs). However, its application in complex multi-agent maritime environments is often limited by challenges such as convergence issues and high computational costs. To address these issues, this paper proposes an expert-guided DRL algorithm that integrates a Dual-Priority Experience Replay (DPER) mechanism with a Hybrid Reciprocal Velocity Obstacles (HRVO) expert module. Specifically, the DPER mechanism prioritizes high-value experiences by considering both temporal-difference (TD) error and collision avoidance quality. The TD error prioritization selects experiences with large TD errors, which typically correspond to critical state transitions with significant prediction discrepancies, thus accelerating value function updates and enhancing learning efficiency. At the same time, the collision avoidance quality prioritization reinforces successful evasive actions, preventing them from being overshadowed by a large volume of ordinary experiences. To further improve algorithm performance, this study integrates a COLREGs-compliant HRVO expert module, which guides early-stage policy exploration while ensuring compliance with regulatory constraints. The expert mechanism is incorporated into the Soft Actor-Critic (SAC) algorithm and validated in multi-vessel collision avoidance scenarios using maritime simulations. The experimental results demonstrate that, compared to traditional DRL baselines, the proposed algorithm reduces training time by 60.37% and, in comparison to rule-based algorithms, achieves shorter navigation times and lower rudder frequencies.

Keywords:

deep reinforcement learning; collision avoidance; unmanned surface vessels; hybrid reciprocal velocity obstacles; COLREGs

1. Introduction

1.1. Related Works

The efficacy of collision avoidance in multi-USV cooperative scenarios is critically challenged by dynamic obstacle behaviors, complex sea conditions, and stringent maritime navigation regulations [1,2]. Conventional rule-based methods frequently yield oscillatory or deadlock-prone trajectories, while purely data-driven approaches are characterized by slow convergence and inadequate regulatory compliance [3]. Furthermore, many reactive strategies are inadequate in achieving stable and efficient maneuvering, often resulting in path oscillations, velocity fluctuations, and an inherent trade-off between safety and efficiency. Specifically, conservative policies, while reducing the risk of collision, tend to cause detours and delays; conversely, aggressive behaviors may increase the likelihood of collisions [4]. These limitations underscore the urgent need to develop adaptive, regulation-aware obstacle avoidance strategies tailored to complex, multi-agent maritime environments.

The development of vessel tracking and control has evolved significantly in recent decades, transitioning from classical control strategies to learning-based approaches. For example, Proportional-Integral-Derivative (PID) controllers were first widely used in current control class methods [5]. Later Vadim Utkin et al. proposed Sliding Mode Control (SMC), which improved robustness to model uncertainty and external disturbances through variable structure control, but suffered from chattering effects [6]. The Line of Sight (LOS) guidance method provided a geometric solution based on heading deviation, yet its performance degraded under strong lateral disturbances [7,8]. With the rise of prediction-based methods, Model Predictive Control (MPC) enabled more flexible trajectory tracking by optimizing actions over a finite horizon while incorporating constraints [9]. However, its high computational cost limits real-time applications. Recently, the emergence of DRL provides new approaches for tracking and controlling unmanned vessels. DRL continuously optimizes the corresponding decision-making strategies through continuous learning in the constant interaction of the intelligence with the environment [10,11]. Among DRL algorithms, the Deep Deterministic Policy Gradient (DDPG) algorithm introduced deterministic actor-critic learning for continuous control tasks such as vessel navigation [12]. Later, Proximal Policy Optimization (PPO) addressed the training instability of DDPG by bounding policy updates, thus improving training stability and sample efficiency, making it suitable for complex marine scenarios [13]. Despite these limitations, DRL offers new opportunities for enhancing autonomous collision avoidance in USVs, particularly in dynamic and uncertain maritime environments.

Typical obstacle avoidance algorithms for USVs include the Artificial Potential Field (APF), Dynamic Window Approach (DWA), and Velocity Obstacle (VO) methods. Khatib proposed APF algorithm to generate real-time paths by simulating attractive and repulsive virtual forces [14]. The DWA evaluates candidate trajectories based on safety and efficiency metrics to make real-time avoidance decisions [15], but lacks adaptability to dynamic obstacles. The VO method detects potential collisions by predicting velocity obstacles and selecting safe speeds accordingly [16,17]. However, it suffers from inefficient paths and poor compliance with maritime navigation rules, limiting its real-world applicability [18]. To address the oscillation issue in VO, the Reciprocal Velocity Obstacle (RVO) algorithm introduces mutual responsibility between agents, allowing cooperative avoidance among multiple vessels. This narrows the feasible velocity space and produces smoother, more realistic maneuvers [19]. These methods, while effective in structured scenarios, often fail to comply with the COLREGs, which are essential for safe and lawful maritime navigation [20,21].

While DRL has demonstrated significant potential in addressing complex decision-making problems and has been increasingly integrated with rule-based collision avoidance strategies, several fundamental limitations remain in existing hybrid approaches [22]. Methods combining artificial potential fields with DRL often suffer from local minima and oscillatory behaviors, limiting their robustness in dense multi-vessel encounters. Classical HRVO- or VO-based planners, although effective in generating collision-free velocities under geometric constraints, are typically employed as deterministic controllers and lack long-term optimization and adaptability to complex maritime scenarios. Prioritized experience replay-based DRL methods, such as PER-SAC or PER-DDPG, primarily rely on TD-error to improve learning efficiency, but they do not explicitly emphasize safety-critical avoidance behaviors, causing rare yet crucial evasive experiences to be underrepresented during training. In addition, many COLREGs-aware reinforcement learning approaches mainly incorporate maritime rules through reward shaping or penalty terms, which cannot fully prevent unsafe or non-compliant exploration during early training stages. As a result, despite notable progress, existing methods still face challenges in balancing safety, learning efficiency, and regulatory compliance in dynamic multi-agent maritime environments.

1.2. Novelty and Contributions

To address the limitations of existing hybrid collision avoidance methods in terms of convergence efficiency, safe exploration, and robustness in multi-vessel interactions, this paper proposes an expert-guided DPER-SAC framework. The core idea of this framework is to reconstruct the reinforcement learning process for complex maritime collision avoidance tasks by integrating structured expert priors and task-aware experience replay.

Unlike APF-DRL or reward-shaping methods that indirectly influence policy learning—often leading to decision oscillations or local deadlocks—this paper integrates COLREGs-compliant velocity preferences directly into an HRVO-based expert module. This approach imposes regulatory consistency constraints on the candidate velocity space during the action generation phase. Rather than relying on static reward penalties, we introduce encounter-dependent risk weights to dynamically calibrate COLREGs compliance based on real-time environmental risk and regulatory priorities. This mechanism allows the agent to flexibly adapt its maneuvers to diverse interaction scenarios, effectively mitigating the conflicts and sub-optimal behaviors inherent in rigid reward-shaping frameworks.

Traditional rule-based collision avoidance algorithms, such as VO and HRVO, are often implemented as fixed controllers. However, these methods can lead to overly conservative maneuvers and potential deadlock situations. To overcome these limitations, this paper incorporates HRVO as a decaying expert prior to providing safe and structured guidance during the initial stages of reinforcement learning. This mechanism effectively restricts the search space and avoids the inefficiencies of random exploration. As training progresses, the SAC framework’s long-term reward optimization takes precedence, gradually phasing out the expert’s influence. This transition enables the agent to autonomously develop and execute optimal collision avoidance strategies in complex multi-vessel interactions.

Furthermore, the proposed DPER mechanism effectively addresses the inherent limitations of traditional PER in complex maritime collision avoidance tasks by integrating both TD-error and collision avoidance quality into the prioritization framework. While conventional PER methods rely primarily on TD-error to identify high-learning-potential experiences, successful collision avoidance in multi-vessel scenarios is often a sparse event. Consequently, these critical avoidance behaviors are frequently overshadowed by a vast volume of redundant, non-critical data, leading to a training bias toward common behaviors at the expense of rare, high-value samples. To mitigate this, DPER extends the PER framework by introducing collision avoidance quality as a secondary priority metric. By explicitly prioritizing high-quality avoidance samples directly aligned with task objectives, DPER ensures that these pivotal experiences are sufficiently reinforced, thereby accelerating policy convergence and enhancing robustness.

2. Dynamics of Unmanned Surface Vessel

2.1. USV Model

In this study, a dual-coordinate system was employed to model the motion of an USV. The selection of an appropriate coordinate system ensures accurate representation of the vessel’s orientation and position. As illustrated in Figure 1, the Earth’s inertial reference frame (O_E-X_EY_EZ_E) was used, with X_E and Y_E aligned to the geographical directions, while Z_E is oriented along the Earth’s gravitational axis. The USV’s local coordinate system (O_B-X_BY_BZ_B) was defined as follows: X_B corresponds to the forward direction, Y_B represents the right side, and Z_B aligns with the vertical direction.

2.2. Vessel Dynamics Characteristics

2.2.1. Dynamic Modeling of Motion Systems

This study is based on the principles of vessel motion mechanics and fluid dynamics, with the aim of developing three degrees of freedom model for USV. The focus is on the horizontal, vertical, and longitudinal movements of the vessel within the water. The dynamic equation governing the system is as follows:

\{\begin{array}{l} \dot{u} = f_{u} (u, v, r, δ, τ_{e n v}) \\ \dot{v} = f_{v} (u, v, r) \\ \dot{r} = f_{r} (r, δ, τ_{e n v}) \end{array}

(1)

where

u, v

and

r

denote the surge velocity, sway velocity, and yaw rate.

\dot{u}

,

\dot{v}

and

\dot{r}

represent their corresponding time derivatives.

δ

denotes the rudder and

τ_{e n v}

denotes the effective environmental force. The functions

f_{u}

,

f_{v}

, and

f_{r}

describe the respective dynamic variations.

Traditional Nomoto models often overlook critical factors like the angle of attack and the interactions with environmental forces. To overcome these limitations, modified models incorporating environmental moment coefficients,

τ_{e n v, N}

, have been proposed to more accurately capture the impact of wind, waves, and currents on vessel dynamics. The improved Nomoto model can be expressed as follows:

T \dot{r} + r = K δ + τ_{e n v}

(2)

where

T

is the time constant of the yaw rate response,

K

is the rudder gain coefficient.

In the horizontal motion of USVs, hydrodynamic forces in the Z direction significantly influence the vessel’s speed dynamics. During turning maneuvers, fluid–structure interactions generate nonlinear resistance forces that affect horizontal acceleration. This resistance is described by the following equation:

\dot{v} = \frac{v - c \cdot u \cdot r}{d}

(3)

where c is the velocity-dependent coupling coefficient that characterizes the interaction between surge velocity and yaw rate, and d denotes the horizontal damping coefficient associated with hydrodynamic resistance during sway motion.

2.2.2. Kinematic Characteristics

The field of vessel dynamics focuses on the geometric properties of vessels, including their position, velocity, and direction. A vessel’s state of motion evolves over time. The variables

\dot{x}

and

\dot{y}

represent the real-time updates of the vessel’s position, while

v

and

\dot{x}

describe the relation vessel between the vessel’s velocity and the rate of change in its position within the vessel’s local coordinate system, as compared to the positional coordinates of a fixed reference frame on Earth. The following equation represents the dynamic equation of the vessel:

\{\begin{matrix} \dot{x} = u s i n (ψ) + v c o s (ψ) \\ \dot{y} = u c o s (ψ) - v s i n (ψ) \\ \dot{ψ} = r \end{matrix}

(4)

where

\dot{x}

denotes the rate of movement in the eastward direction,

\dot{y}

denotes the rate of movement in the northward direction, and

\dot{ψ}

denotes the rate of change in the heading.

3. Proposed Deep Reinforcement Learning Method for Collision Avoidance

3.1. COLREGs-Aware Velocity Selection with HRVO

3.1.1. The COLREGs Rules

The COLREGs (Rules 13–15) outline standardized protocols for vessel encounters, categorizing them into three primary types: overtaking, head-on, and crossing situations. As depicted in Figure 2, this study utilizes these rules to design scenario-specific avoidance behaviors for autonomous surface vessels. These regulations are essential for preventing accidents in congested maritime environments, where multiple vessels often interact within limited spaces. By adhering to COLREGs, the USV can make decisions that are not only collision-free but also legally compliant, thereby enhancing both safety and efficiency in maritime operations.

3.1.2. Hybrid Reciprocal Velocity Obstacles Method

The VO model offers a geometric framework for predicting potential collisions by identifying the set of velocities that would result in contact based on relative motion. As illustrated in Figure 3, the VO space encompasses all velocity vectors that would lead to a collision with a moving obstacle. However, this approach does not consider the intentions of other agents, which can limit its effectiveness in dynamic, multi-agent environments. Although the VO model provides a simple and intuitive formulation, it places full avoidance responsibility on the ego agent, often leading to oscillatory or overly conservative behaviors.

To address the limitations of conventional VO methods, this study proposes an enhanced HRVO formulation that integrates both reciprocal coordination and maritime regulatory constraints. Departing from traditional methods with rigid assumptions, the HRVO framework introduces a dynamic responsibility allocation mechanism. This approach flexibly adjusts the avoidance responsibilities among interacting agents based on their observed intentions, combining the efficiency of reciprocal cooperation with the robustness required to manage non-cooperative behaviors. Furthermore, velocity selection is guided by COLREGs-based preferences, enabling the expert module to generate candidate actions that are not only collision-free but also compliant with navigational rules.

The HRVO algorithm combines the reciprocal principles of the VO method with speed-space avoidance techniques, offering a more efficient solution for multi-agent collision avoidance. It adopts a hybrid responsibility allocation mechanism to effectively manage obstacle avoidance among multiple intelligent agents in complex environments. The mathematical formulation of the HRVO algorithm is presented as follows:

{H R V O}_{U S V | A} (τ) = \{v_{U S V} ∣ \exists t \in [0, τ] : D + (v_{U S V} - v_{A} + w u) t < r_{U S V} + r_{A}\}

(5)

where

v_{U S V} - v_{A}

represents the relative velocity between agents,

τ

denotes the time horizon for collision prediction,

D

is the current distance between the two agents, and

r_{U S V}, r_{A}

are the radii of the safety regions of agent USV and A,

w \in [0, 1]

is the dynamic responsibility weight. When

w = 0.5

, the model degenerates into the RVO algorithm, and when

w = 1,

it degenerates into the standard VO algorithm.

Furthermore, the HRVO framework enables velocity selection to be optimized according to specific task objectives and environmental constraints. A representative cost-based velocity selection model is defined as follows:

v_{new} = a r g \underset{v \in H R V O}{m i n} ({∥ v - v_{pref} ∥}^{2} + λ \cdot C_{COLREGs} (v) + γ \cdot C_{dynamic} (v))

(6)

where

v_{new}

is the selected goal-directed velocity,

v_{pref}

denotes the preferred velocity towards the target.

C_{COLREGs} (v)

measures compliance with maritime collision regulations.

C_{dynamic} (v)

penalizes high acceleration or sharp maneuvers. Coefficients

λ

and

γ

control the trade-off between regulation adherence, safety, and motion smoothness.

As illustrated in Figure 4, an adaptive HRVO model is employed for multi-agent collision avoidance. In Figure 4a, the USV is simplified as a reference point for efficiency, while target vessels are represented by elliptical envelopes. The USV is simplified as a reference point for efficiency, while target vessels are represented by elliptical envelopes. Compared to conventional circular models, the elliptical abstraction based on semi-major axis a and semi-minor axis b provides a tighter fit to the vessel’s physical profile. By integrating real-time heading and dimensions, this approach defines a dynamically rotating safety buffer, ensuring that velocity-space transformations rigorously account for the obstacle’s spatial occupancy and orientation-dependent constraints.

Figure 4b illustrates the construction of velocity obstacles under the HRVO framework. In this representation, each obstacle (A, B, C) is modeled as a dynamic agent with known positions (

p_{A}

,

p_{B}

,

p_{C}

) and velocities (

v_{A}

,

v_{B}

,

v_{C}

), while the USV is assigned a current velocity V_U. For each obstacle, a velocity obstacle cone is constructed in velocity space, defining the set of velocities that would result in a collision within a finite time horizon.

Unlike traditional VO methods, the apex of each HRVO cone is not located at the origin but is shifted to a reciprocal velocity point that reflects the shared responsibility between the USV and the obstacle. This reciprocal velocity apex is computed as follows:

v^{'} = w \cdot v_{o b s t a c l e} + (1 - w) \cdot v_{U S V}

(7)

where

v^{'}

denotes the shifted velocity apex used for HRVO cone construction.

v_{U S V}

and

v_{o b s t a c l e}

are the velocities of the USV and the obstacle. w ∈ [0, 1] represents the dynamic responsibility weight. This weighted combination reflects the assumed reciprocal responsibility of both agents in executing avoidance maneuvers.

The collision avoidance velocity space is defined by aggregating the velocity obstacle cones of surrounding dynamic agents. Compared to RVO, HRVO adopts a hybrid approach by incorporating adaptive weighting and trajectory optimization. The integration of these mechanisms enables the framework to respond more effectively to non-cooperative behaviors and dynamic environmental disturbances. In practice, HRVO not only considers relative velocities but also optimizes trajectories by accounting for the evolving intentions of target vessels. When a target vessel demonstrates no avoidance intent, the responsibility weight

w

is adaptively increased, shifting the primary collision avoidance burden to the agent. Moreover, to accommodate vessel inertia, this weight adjustment occurs gradually rather than instantaneously. By ensuring that

w

evolves incrementally, the system facilitates smoother, more kinematically feasible maneuvers, avoiding abrupt course changes. This approach effectively balances immediate safety needs with the physical limitations of maritime motion, ensuring robust operational stability in complex scenarios.

While HRVO serves as a safety-oriented expert prior, it is prone to path sub-optimality and systemic deadlocks, particularly in dense environments. Moreover, HRVO’s focus on immediate safety frequently results in inefficient trajectories. These limitations were empirically observed during training, with vessels either encountering standstills in crowded scenarios or exhibiting excessive maneuvering in sparse ones. To address these issues, the SAC framework utilizes its stochastic exploration and experience replay mechanisms to intervene upon detecting deadlock, increasing the SAC control weight to shorten the duration of the deadlock state. Additionally, during subsequent autonomous control, the agent autonomously re-learns from these high TD-error experiences.

In summary, the HRVO framework provides a robust and interpretable foundation for motion planning in dynamic marine environments. By incorporating COLREGs-compliant velocity selection into the expert model, it supplies structured prior knowledge that not only guides the learning process but also alleviates the generation of rule-violating or overly conservative velocities in DRL.

3.2. Dual-Priority Experience Replay Mechanism

Experience replay is a critical component in reinforcement learning, contributing to both training stability and learning efficiency. However, conventional replay buffers typically employ uniform sampling strategies that disregard the varying significance of individual transitions. This oversight can impede the learning process, particularly in complex, multi-agent environments such as multi-USV scenarios.

To address this limitation, we propose a DPER mechanism that integrates two complementary prioritization strategies: one based on TD error and the other on obstacle avoidance quality. The TD error component emphasizes transitions with high learning potential by prioritizing those that exhibit significant discrepancies between predicted and actual rewards, thereby facilitating more effective value function updates. In parallel, the quality-based component assigns higher importance to transitions involving successful obstacle avoidance, which are evaluated through metrics such as minimum distance to obstacles, safe maneuvering speed, and trajectory alignment. During the sampling process, these two priority scores are combined using a weighted formulation to derive the final probability of selection. This dual-priority scheme ensures that both transitions critical for policy optimization and those demonstrating high-quality behavioral outcomes are systematically emphasized, thereby enhancing sample efficiency, promoting regulatory compliance, and improving the overall robustness of the learned policy.

To represent experience transitions, this study employs a quintuple structure defined as

(s_{t}, a_{t}, r_{t}, s_{t + 1})

, where

s_{t}

is the state,

a_{t}

is the action,

r_{t}

is the reward and

s_{t + 1}

is the next state.

The TD error

δ_{t}

is which measures the discrepancy between predicted and actual rewards, is calculated as

δ_{t} = r_{t} + γ \cdot {m a x}_{a^{'}} Q_{t a r g e t} (s_{t + 1}, a^{'}) - Q_{c u r r e n t} (s_{t}, a_{t})

(8)

where

γ \in [0, 1]

denotes the discount factor,

Q_{t a r g e t}

represents the target

Q

network, and

Q_{c u r r e n t}

denotes the current

Q

network.

The corresponding TD-based priority for sample

i

is given by

P (i) = {(T D (i) + ϵ)}^{α}

(9)

where

P (i)

represents the priority of the i-th experience,

ϵ

is a constant to prevent division by zero errors, and

α

determines the level of prioritization.

In addition to TD error, obstacle avoidance performance is used as a complementary prioritization criterion. A Boolean flag

f_{s u c c e s s} \in \{0,1\}

indicates whether the navigation successfully avoided a collision. If successful, a normalized quality score

q_{i} \in [0,1]

is computed to quantify the effectiveness of the maneuver. The overall formulation is defined as follows:

q_{i} = η_{1} \cdot d_{m i n} + η_{2} \cdot v_{s a f e} + η_{3} \cdot θ_{a l i g n}

(10)

where

η_{1}

,

η_{2}

,

η_{3}

are the weighting coefficients of each indicator,

d_{m i n}

is the minimum distance between the obstacle and the obstacle during obstacle avoidance,

v_{s a f e}

is the safe speed during obstacle avoidance, and

θ_{a l i g n}

is the degree of alignment between the heading after obstacle avoidance and the target path.

To enable autonomous obstacle avoidance through expert-guided reinforcement learning, it is essential to gradually reduce reliance on expert knowledge. To this end, a dynamic decay coefficient

λ

is introduced to represent the diminishing influence of expert experience over time. The influence factor

b_{t}

is initialized as

b_{0}

and decays with training steps

t

, as defined by

b_{t} = b_{0} \cdot λ^{t / T}

(11)

The dual priority fusion

p_{i}^{d u a l}

formula is

p_{i}^{d u a l} = {(T D (i) + ϵ)}^{α} + f_{i}^{s u c c e s s} \cdot q_{i} \cdot b_{t}

(12)

The sampling process reflects the mechanism by which experiential transitions are acquired, with the sampling probability directly proportional to the priority level. In this study, the priority exponent α, which controls the degree of prioritization by adjusting the slope of the sampling distribution. The corresponding formula is defined as

P (t) = \frac{p_{t}^{α}}{\sum_{i} p_{i}^{α}}

(13)

where

P (t)

is the normalized sampling probability for time

t

,

p_{t}

is the priority of sample

t

.

Since priority-based sampling introduces bias in the distribution of training data, it is necessary to correct this bias by adjusting the importance-sampling weights through a hierarchical update. The corresponding adjustment formula is given as

w_{t} = {(\frac{1}{N \cdot P (t)})}^{β} / m a x (w_{i})

(14)

where

w_{t}

is the importance-sampling weight for time

t

, N is the buffer size,

β \in [0, 1]

is the deviation correction factor. As training progresses,

β

approaches 1 to reduce the bias from prioritized replay.

After computing the importance-sampling weights, a prioritized mini-batch

B

is sampled from the replay buffer

D

based on the normalized sampling probabilities:

B = S a m p l e (D, P (i)), i \sim D i s c r e t e (p_{1}, p_{2}, \dots, p_{N})

(15)

where

p_{i}

denotes the normalized sampling probability of the

i - t h

transition.

A schematic overview of the proposed DPER mechanism is presented in Figure 5, which integrates TD error and obstacle avoidance quality into a unified experience pool. The agent draws samples from the dual-priority experience pool and updates the parameters of the actor and critic networks. In this architecture, the agent stores experiences

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in the dual-priority experience pool. The priority of experiences is calculated based on the TD error and the quality of successful obstacle avoidance. DPER injects experience filtering intelligence into SAC.

By maintaining a balance between expert-guided and suboptimal experiences, the mechanism mitigates the dominance of noisy samples while preserving critical learning signals. This targeted sampling strategy improves both the learning efficiency and policy robustness.

3.3. DPER-SAC Algorithm Architecture

3.3.1. Network Design and Agent Architecture

Recent advances in machine learning have significantly expanded its applicability across diverse domains. These developments are largely driven by the ability of algorithms to adapt to dynamic environments and continuously improve their learning capacity.

In this section, we propose the DPER-SAC algorithm, which integrates a novel dual-priority sampling mechanism into the SAC framework to improve learning efficiency. SAC is based on the maximum entropy reinforcement learning paradigm, aiming to maximize both the expected return and the entropy of the policy, thereby encouraging more comprehensive exploration.

As illustrated in Figure 6, the network architecture adopted in this study comprises three primary components: the actor network, the critic network, and the value network. The actor network, denoted as

π_{θ} (\cdot)

, generates actions based on the current state using a parameterized stochastic policy. The critic network evaluates the selected actions by estimating the Q value and the corresponding state-value. Through interactions with the environment, the agent receives state feedback and transitions to the next state accordingly. The actor network architecture includes both the policy network and its corresponding target network. Based on the current state

s_{t}

, the actor outputs an action distribution as follows:

a_{t} \sim π_{θ} (s_{t})

(16)

where

π_{θ}

represents the parameterized policy of the policy network.

Secondly, inspired by the Double DQN architecture, SAC incorporates a dual critic network to mitigate the overestimation of Q values. Specifically, during each update, the minimum value between two Q networks is selected to provide a more conservative target estimate. The critic module in SAC consists of an online critic network and a corresponding target critic network. The critic network outputs the Q value as follows:

Q (s_{t}, a_{t})

(17)

The target critic network is used to stabilize the training process, and the output is

Q_{t a r g e t} (s_{t}, a_{t})

(18)

The objective function is defined as the minimization of losses, and its formula is as follows:

L (π_{θ}, D) = - E_{(s_{t}, a_{t}) \sim D} [l o g π_{θ} (a_{t} ∣ s_{t})]

(19)

Concurrently, the loss function of the network of critics is expressed as follows:

L (φ, D) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim R} [{(Q_{φ} (s_{t}, a_{t}) - r_{t} - γ V (s_{t + 1}))}^{2}]

(20)

where R represents the data collected by the strategy in the past,

Q_{φ} (s_{t}, a_{t})

denotes the critic network, and

γ

signifies the discount factor, balancing current and future rewards.

V (s_{t + 1})

is the state-value function that estimates the expected return from the next state

s_{t + 1}

.

By efficiently storing and leveraging past experiences, the agent can rapidly adapt and optimize its policy in dynamic environments. Moreover, the incorporation of TD error and obstacle avoidance quality as prioritization criteria provides more informative feedback during learning, thereby improving the agent’s practical performance.

3.3.2. Reward Function Design

The proximity reward and terminal reward

In reinforcement learning for intelligent agents, goal achievement is significantly influenced by the design of proximity and terminal rewards. The proximity reward incentivizes agents to take efficient actions when near the target, thereby improving navigation safety and efficiency. This reward is a function of the distance to the goal, denoted as

d_{g o a l}

, and is designed to vary with distance, providing differentiated feedback. Such a progressive reward mechanism effectively enhances the exploration capability of the SAC algorithm, as shown in the following formula:

r_{p r o x i m i t y} = \{\begin{matrix} 3 e x p (- 0.05 d_{g o a l}), & d_{g o a l} < 15 \\ e x p (- 0.05 d_{g o a l}), & 15 \leq d_{g o a l} < 40 \\ 0, & o t h e r w i s e \end{matrix}

(21)

In this formulation, the proximity reward is highest when the vessel is within 15 m of the target, encouraging optimal navigation near the goal. As the distance increases, the reward decreases exponentially, reflecting reduced proximity.

2.: The heading and distance error reward

The track error is defined as the perpendicular distance between the vessel’s current position and the reference trajectory. Let the vessel’s current position be

(x, y)

, and the endpoints of the reference track segment be

(x_{1}, y_{1})

and

(x_{2}, y_{2})

. The track error is calculated using the following formula:

T r a c k E r r o r = \frac{|(y_{2} - y_{1}) x + (x_{1} - x_{2}) y + (x_{2} y_{1} - x_{1} y_{2})|}{\sqrt{{(y_{2} - y_{1})}^{2} + {(x_{2} - x_{1})}^{2}}}

(22)

The heading error is defined as the angular deviation between the vessel’s current heading and the desired target heading. Let

θ

denote the current heading angle and

θ_{g o a l}

the target heading angle, both measured in radians. The heading error is computed using the following formula:

H e a d i n g E r r o r = c l i p (θ - θ_{g o a l}, - π, π)

(23)

Together, the heading and distance error reward

r_{h e a d i n g}

functions enable fine-grained guidance during training, allowing the agent to effectively align its trajectory with the intended path and correct deviations dynamically during navigation.

3.: The obstacle avoidance reward

Collision avoidance rewards in maritime autonomous navigation have traditionally relied on the minimum distance to obstacles. However, this static metric does not capture the dynamic nature of vessel motion. To enhance the agent’s situational awareness, we redefine the reward by incorporating two key maritime safety indices: the Distance to Closest Point of Approach (DCPA) and the Time to Closest Point of Approach (TCPA) thereby enabling a more proactive and multidimensional assessment of collision risks.

The minimum distance reward is calculated as follows:

r_{d i s t a n c e} = \{\begin{matrix} - 3 e x p (- 0.4 d_{o b s, m i n}), & d_{o b s, m i n} < 4 \\ - e x p (- 0.3 d_{o b s, m i n}), & 4 \leq d_{o b s, m i n} < 15 \\ 0, & o t h e r w i s e \end{matrix}

(24)

where

d_{o b s, m i n}

denotes the distance between the vessel and the nearest obstacle.

To integrate dynamic risk evaluation, the DCPA and TCPA are introduced to account for relative vessel motion. DCPA measures the time to the closest point of approach, while TCPA reflects the time before vessels reach that point. The dynamic collision risk reward is then calculated as

r_{d y n a m i c} = \{\begin{matrix} - e x p (- α \cdot D C P A), & i f D C P A < t h r e s h o l d \\ - e x p (- β \cdot T C P A), & i f T C P A < t h r e s h o l d \\ 0, & o t h e r w i s e \end{matrix}

(25)

where α and β are constants that adjust the sensitivity of the reward to the DCPA and TCPA, respectively, and the thresholds are set based on the acceptable risk levels for the environment.

The final avoidance reward

r_{a v o i d a n c e}

is defined as the sum of both

r_{d i s t a n c e}

and

r_{d y n a m i c}

. This combined approach enables the agent to make more informed decisions based on both proximity to obstacles and the dynamic risk of collision, improving navigation safety in dynamic and complex maritime environments.

4.: The COLREGs reward

The COLREGs are integrated into the DPER-SAC framework as fundamental constraints to ensure alignment with international maritime safety standards. By embedding these rules into the action selection process, the framework mitigates learning instability and prunes suboptimal trajectories. This integration ensures that the agent’s actions are consistently compliant with maritime navigation rules, thereby enhancing both safety and operational efficiency. The COLREGs reward values are defined as

r_{C O L R E G s} = \{\begin{matrix} + 5.0, & i f c o m p l y i n g w i t h C O L R E G s \\ - 5.0, & i f v i o l a t i n g C O L R E G s \\ 0, & i f n o r e l e v a n t e n c o u n t e r \end{matrix}

(26)

5.: Encounter-dependent risk weighting

An encounter-dependent risk weighting scheme is incorporated to account for varying collision risks and the legal priorities defined by the COLREGs. By dynamically adjusting reward weights based on the encounter context, the framework scales the avoidance penalty according to the assessed risk. High-risk proximities lead to higher weights, reinforcing proactive avoidance actions, while lower-risk scenarios apply reduced weights to prevent unnecessary over-maneuvering.

The encounter weight

w_{e n c o u n t e r}

is determined by the relative distance and velocity between vessels and is calculated differently for various encounter types. The specific calculations are as follows:

w_{e n c o u n t e r} = \{\begin{array}{l} α_{h e a d - o n} \cdot \frac{1}{d_{r e l}} \cdot \frac{1}{v_{r e l}}, & i f H e a d - o n \\ α_{o v e r t a k i n g} \cdot \frac{1}{d_{r e l}} \cdot e x p (- β_{o v e r t a k i n g} \cdot θ_{r e l}), & i f O v e r t a k i n g \\ α_{c r o s s i n g} \cdot \frac{1}{d_{r e l}} \cdot (1 - e x p (- γ_{c r o s s i n g} \cdot θ_{r e l})), & i f C r o s s i n g \end{array}

(27)

where

d_{r e l}

represents the relative distance,

v_{rel}

is the relative velocity, and

θ_{r e l}

is the relative heading angle. The coefficients

α

,

β

and

γ

are risk weighting factors specific to each encounter type.

Finally, the total reward

r_{t o t a l}

is calculated by incorporating the encounter-dependent weight

w_{e n c o u n t e r}

into the overall reward function, which combines proximity, heading, avoidance, and COLREGs compliance

r_{t o t a l} = w_{e n c o u n t e r} \cdot (r_{p r o x i m i t y} + r_{h e a d i n g} + r_{a v o i d a n c e} + r_{C O L R E G s})

(28)

The encounter-dependent risk weighting mechanism ensures that the agent prioritizes high-risk encounters while maintaining optimal path planning and compliance with maritime regulations.

3.4. Expert-Guided Deep Reinforcement Learning Strategy

In complex and dynamic maritime environments, the presence of obstacles poses significant challenges to the control and decision-making capabilities of intelligent systems. To address these challenges, this study proposes an expert-guided DRL mechanism for obstacle avoidance, which integrates expert knowledge into the DRL framework.

The core of the proposed Expert-Guided DRL mechanism lies in the dynamic integration of HRVO-based expert guidance with SAC-based autonomous learning. During the early training phase, the agent operates under high-exploration conditions, while safety is maintained through a heavily weighted local obstacle avoidance strategy derived from the HRVO expert. The HRVO system offers reliable safety constraints and effective collision avoidance behaviors, while the SAC agent explores optimal policies within these safe boundaries.

As training progresses and the agent’s policy mature, the influence of HRVO guidance is progressively attenuated via a decay schedule, enabling a smooth transition from expert-guided exploration to fully autonomous control. This adaptive integration is realized through a probabilistic factor that dynamically balances the contribution of HRVO expert actions and SAC-learned policies at each decision step. In high-risk scenarios or when novel obstacle configurations are encountered, the mechanism temporarily increases reliance on HRVO actions to preserve operational safety. The overall expert-guided reinforcement learning framework is depicted in Figure 7.

The computational cost of the proposed method increases with the number of vessels, N, primarily due to the growing interactions between vessels. The complexity is mainly driven by HRVO calculations and SAC updates. As the number of vessels increases, pairwise interactions between vessels grow quadratically, resulting in an increased computational load for HRVO. Simultaneously, as the state space expands, the state-action evaluations in the SAC algorithm become more computationally intensive. To address this issue, the proposed method dynamically adjusts the computational load through the decay of expert guidance and incorporates prioritized sampling in dual experience replay to minimize unnecessary state-action evaluations, prioritizing samples with higher learning value. Additionally, the proposed method models local interactions, limiting each calculation step to consider only vessels within a specific range while disregarding those farther away with a lesser threat, thus reducing the state space and number of interactions, and consequently lowering computational complexity.

This approach not only accelerates convergence and enhances sample efficiency but also ensures the agent inherits robust safety behaviors from the expert while developing autonomous decision-making capabilities. Compared to conventional DRL or purely expert-driven systems, the proposed method achieves a more effective trade-off between safety and exploration, significantly reducing oscillations and mitigating deadlock issues commonly observed in traditional methods.

4. Simulation Analysis

4.1. Environment and Parameters

This study employs the PyBullet (version 3.1.5) simulation platform and the Gym learning framework, integrated with PyTorch (version 1.8.1), to construct a stable baseline environment for DRL. Additionally, a three-dimensional visual simulation environment was developed to support the control and evaluation of USV operations. The simulation scenarios include various obstacle configurations designed to replicate real-world maritime encounters, enabling comprehensive assessments of different obstacle avoidance algorithms. This setup facilitates the evaluation of the effectiveness and robustness of learning-based obstacle avoidance strategies. The training parameters of the proposed DPER-SAC algorithm are summarized in Table 1.

4.2. Convergence Verification of Loss and Reward Functions

In DRL algorithms, the reward and loss functions are critical indicators of learning stability and convergence. The reward function provides immediate feedback to the agent, serving as the primary driver for behavior optimization. A steadily increasing and low-variance reward signal typically signifies effective policy learning and convergence toward optimal behavior.

As shown in Figure 8, the training reward trends for three algorithms—TD3 (blue), PPO (orange), and the proposed DPER-SAC (green)—are illustrated, with solid lines representing the average reward per episode. Compared to TD3 and PPO, the DPER-SAC algorithm exhibits markedly superior performance in terms of convergence speed and stability. Specifically, its reward curve rapidly rises and stabilizes at a higher level within the first 200 episodes, maintaining minimal variance throughout training. In contrast, although TD3 eventually reaches a relatively high reward, it experiences frequent sharp drops and requires more episodes to achieve stability. PPO displays a slower and more gradual improvement but converges at a lower reward level, reflecting stable yet less efficient learning behavior.

These comparative results demonstrate that the hybrid architecture of DPER-SAC effectively harnesses expert knowledge to guide early-stage exploration. This enables the agent to concentrate on high-value decision regions while maintaining strong safety constraints. The resulting balance between exploration and stability leads to faster convergence and improved final performance compared to baseline DRL algorithms. Traditional reinforcement learning algorithms for obstacle avoidance typically require 3000 episodes to converge to stable policies. In contrast, expert-guided algorithms using DPER converge within 1189 episodes, achieving a 60.37% reduction in both training time and computational load. This highlights the significant improvement in training efficiency offered by the DPER-SAC algorithm over traditional reinforcement learning methods.

In addition to reward trends, the convergence of loss functions is essential for evaluating the learning dynamics of DRL algorithms. As shown in Figure 9, the training loss curves of the actor and critic networks in the DPER-SAC framework exhibit distinct convergence patterns. The critic loss (blue) initially fluctuates due to environmental adaptation and value estimation instability but gradually decreases and stabilizes by approximately the 180th episode. In contrast, the actor loss (orange) drops sharply to around −17 before gradually rising and converging the 923th episode. These patterns indicate a successful shift from early exploration to policy refinement.

The loss trends confirm the effectiveness of the DPER-SAC architecture in achieving stable policy optimization. The integration of HRVO priors accelerates early learning, narrows the search space, and guides the agent toward long-term strategy optimization. The convergence of both networks further demonstrates the method’s robustness in dynamic environments and its compliance with maritime navigation protocols.

In terms of collision avoidance quality, the DPER-SAC framework demonstrates a decisive advantage. As illustrated in Figure 10, DPER-SAC achieves a mean score of 0.842, representing a 76.8% improvement over the 0.476 achieved by PER-SAC. Notably, DPER-SAC attains peak performance in the early training stages and converges rapidly to a stable equilibrium. In contrast, the PER-SAC baseline exhibits substantial volatility throughout the learning process with a standard deviation of 0.15. The significantly narrower fluctuation range of DPER-SAC underscores the efficacy of the dual-priority mechanism in enhancing training stability and policy robustness.

4.3. Two-Vessel Encounter Scenarios in Simulation

To simulate typical maritime encounters, the vessel configurations used in the experiment are shown in Table 2. In the simulation tests, to assess the USV’s extreme collision avoidance capabilities in complex environments, the target vessels are modeled in a non-cooperative, predictable mode. Specifically, although some target vessels are required to give way according to COLREGs, this study assumes they take no avoidance actions and maintain their heading and speed. This configuration effectively simulates and validates the hybrid collision avoidance guidance capability of HRVO.

As shown in Figure 11 the simulation results reveal that the DPER-SAC algorithm significantly improves the USV’s performance in avoiding collisions compared to the VO algorithm. The speed comparison between the two algorithms, presented in Figure 11a, demonstrates that the DPER-SAC strategy yields smoother and more consistent velocity transitions.

This smoothness is crucial for ensuring the stability of the USV’s motion during the encounter. Additionally, the angular velocity comparison in Figure 11b further corroborates this finding, showing that the DPER-SAC strategy enables smoother and more controlled turns, whereas the VO algorithm results in more abrupt changes in angular velocity. In Figure 11c, the trajectory plot illustrates the USV’s path as it successfully avoids both obstacles (USV2 and USV3), while adhering to the reference path and reaching the goal. The plot shows that the DPER-SAC algorithm enables the USV to navigate safely and efficiently through the crossing scenario, providing superior collision avoidance and trajectory smoothness when compared to the VO algorithm.

As shown in Figure 12, the comparison of heading angle variations between the VO and DPER-SAC algorithms reveals that the latter offers a smoother and more stable response. The DPER-SAC algorithm adjusts the heading angle consistently, aligning with dynamic constraints, enhancing compliance with COLREGs, and ensuring the USV stays on course. In contrast, the VO algorithm exhibits heading angle oscillations, which can lead to inefficient maneuvers and safety risks. The DPER-SAC algorithm’s ability to balance safety and efficiency while adhering to COLREGs underscores its superior performance in autonomous collision avoidance in dynamic maritime environments.

To further assess the performance of different algorithms in practical simulations, Table 3 demonstrates the superiority of the DPER-SAC algorithm across several key metrics. The improved algorithm shows significantly better performance than the VO algorithm, reducing the navigation distance by 7.97 m and the navigation time by 23.4%, indicating faster execution and more efficient path planning. Additionally, the improved algorithm achieves an average speed of 1.92 m/s, which is 25.49% higher than the VO algorithm’s 1.53 m/s, enabling quicker and more effective obstacle avoidance. The average path curvature of the improved algorithm is reduced by 0.31 deg per step compared to the VO algorithm, indicating smoother and more stable navigation, demonstrating its ability to make faster and more accurate collision avoidance decisions.

To ensure statistical rigor, we conducted 200 independent test episodes for each algorithm. The resulting performance distributions are visualized through box plots using normalized metrics, as shown in Figure 13. These plots demonstrate that the DPER-SAC algorithm consistently outperforms the VO algorithm, exhibiting narrower interquartile ranges (IQRs) across most evaluated metrics, indicating enhanced stability and operational reliability. Specifically, DPER-SAC shows significantly lower median values for navigation distance, transit time, path curvature, and heading deviations, while maintaining higher median velocities compared to the VO baseline.

In conclusion, the DPER-SAC algorithm outperforms the VO algorithm in key metrics, such as navigation distance, time, and path efficiency. These improvements highlight its effectiveness in dynamic maritime environments, making it a more efficient and reliable solution for autonomous collision avoidance in two-vessel encounters.

4.4. Multi-Vessel Encounter Scenarios in Simulation

To evaluate the multi-vessel collision avoidance performance, this section presents simulations of scenarios based on the parameters in Table 4, which specify the initial positions, velocities, vessel radii, and safety distances for the main vessel and surrounding obstacle vessels. The collision avoidance scenario for multiple USVs also involves target vessels set in a non-cooperative, predictable mode, with no avoidance actions taken.

To assess the multi-vessel collision avoidance performance, Figure 14 presents the simulation results. In Figure 14a, the main USV performs a slight left turn to avoid the overtaking vessel, USV3, and then smoothly turns right to avoid the right-side approaching vessel, following COLREGs. The algorithm efficiently prioritizes obstacles, managing the avoidance of multiple vessels.

Figure 14 presents the complete multi-vessel avoidance trajectory. The VO algorithm, represented by the gray solid line, successfully avoids the obstacle but struggles with multiple vessels, resulting in oscillations and large turning angles. In contrast, the improved algorithm successfully avoids all vessels, demonstrating superior performance in handling complex multi-vessel interactions.

Figure 14b illustrates the head-on encounter scenario, where the main USV adjusts its heading to avoid the approaching obstacle vessel, while minimizing deceleration to maintain smooth navigation. Figure 14c shows the left-side crossing encounter, where the algorithm computes an efficient heading, ensuring the main USV safely reaches its destination.

Finally, Figure 14d provides the complete multi-vessel avoidance trajectory, showing the overall navigation path of the main USV as it interacts with multiple vessels. The comparison of these scenarios with traditional VO methods clearly indicates that the proposed approach provides more stable and efficient maneuvers, ensuring safer navigation in complex, multi-vessel environments.

To evaluate the collision avoidance performance of the DPER-SAC algorithm in multi-vessel encounter scenarios. In Figure 15, the speed profiles of both algorithms are compared. The DPER-SAC algorithm achieves smoother and more consistent speed transitions compared to the VO algorithm, which exhibits abrupt speed changes. This smoother adjustment reduces unnecessary acceleration and deceleration, thus enhancing navigation stability.

In Figure 16, the angular velocity variations for both algorithms are compared. The DPER-SAC algorithm generates more controlled and gradual angular velocity adjustments, whereas the VO algorithm shows noticeable oscillations, which can lead to instability and inefficiency. This further highlights the advantage of the DPER-SAC algorithm in maintaining heading safety.

In Figure 17, the heading angle variations over time for both algorithms are compared. The DPER-SAC algorithm adjusts the heading more smoothly, ensuring the USV follows its intended path without sharp deviations. In contrast, the VO algorithm shows steep fluctuations in heading angle, potentially leading to unpredictable navigation trajectories and posing stability risks.

As shown in Table 5, a comparison is made between the DPER-SAC algorithm and the VO algorithm across key performance metrics. The DPER-SAC algorithm demonstrates superior performance in several areas, significantly improving path planning and execution efficiency. Compared to the VO algorithm, the DPER-SAC algorithm reduces the navigation distance by 25.63 m and shortens the execution time by 46.74%.

Moreover, the improved algorithm significantly enhances path smoothness, reducing both the average path curvature and the maximum single-step heading change by 65.3% and 74.5%, respectively, compared to the VO algorithm. In addition, the DPER-SAC algorithm demonstrates more efficient decision-making, cutting the avoidance state time steps by 106 steps, while achieving a 32.2% increase in average speed.

Figure 18 presents a statistical summary of 200 simulation trials. In terms of navigation efficiency, DPER-SAC demonstrates significant reductions in both navigation distance and transit time, with lower median values and narrower IQRs. Regarding trajectory smoothness, the algorithm shows substantially reduced variability in path curvature and heading deviations, leading to more fluid and consistent maneuvers.

All experiments in this study were conducted in a controlled simulation environment, and the proposed method has not yet been validated with real-world AIS data or sea trials. Although simulations may not fully capture the intricate complexities of the marine environment, the modeling of vessel dynamics and environmental disturbances in this study ensures that the learned collision avoidance logic is consistent with fundamental physical laws. This consistency provides a reliable basis for the potential transition of the algorithm to real-world applications. Future research will focus on validating the robustness and cross-domain generalization of the algorithm using large-scale AIS datasets and practical sea trial verification.

5. Discussion

While the DPER-SAC framework significantly enhances trajectory smoothness and navigation efficiency in simulations, its performance in high-density maritime environments and large fleets warrants further investigation. In scenarios involving more than 10 vessels, the computational complexity of multi-agent interactions increases, potentially leading to slower decision-making and reduced efficiency in path planning and collision avoidance. Future research should focus on addressing these challenges by optimizing the algorithm for larger fleets, exploring distributed learning approaches, and improving the transition from expert-guided to autonomous control. These advancements are critical for deploying the framework in real-world, large-scale autonomous vessel operations.

6. Conclusions

In this paper, we propose a novel expert-guided framework for dynamic obstacle avoidance in USVs, integrating HRVO-based expert strategies with DPER. This approach effectively combines the stability of traditional methods with the adaptability of reinforcement learning. The proposed method accelerates convergence, improves sample efficiency, and reduces training time by 60.37%, addressing key challenges in dynamic obstacle avoidance tasks.

In summary, the DPER-SAC framework successfully bridges the gap between rule-based stability and reinforcement learning adaptability. The simulation results confirm that by integrating HRVO expert priors with a dual-priority sampling mechanism, the system effectively resolves multi-agent conflicts while optimizing trajectory quality. Beyond performance metrics, this study underscores the practical value of hybrid architectures in mitigating ‘cold-start’ risks for individual autonomous vessels. Furthermore, it provides a scalable foundation for fleet management, ensuring that large-scale autonomous traffic remains both COLREGs-compliant and operationally flexible within high-density maritime corridors.

Author Contributions

Conceptualization, L.X. and Z.W.; methodology, L.X. and Z.W.; software, Z.W. and Z.H.; validation, Z.H., K.Y. and C.H.; formal analysis, Z.H., J.Q. and C.H.; investigation, L.X. and Z.W.; resources, L.X. and Z.H.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.H., K.Y. and C.H.; visualization, L.X., Z.W. and J.Q.; supervision, L.X. and Z.H.; project administration, L.X.; funding acquisition, L.X. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Funding No. 2022YFC2806600, No. 2022YFC2806604) and the Jiangsu Marine Technology Innovation Center Industry Incubation Project (Funding No. MTIC-2023-RD-0002).

Data Availability Statement

The datasets presented in this article are not readily available because part of the data still needs to be studied later. Requests to access the datasets should be directed to zxw_1102@163.com.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, Z.; Sun, H.; Li, P.; Zou, J. Cooperative Strategy for Pursuit-Evasion Problem with Collision Avoidance. Ocean Eng. 2022, 266, 112742. [Google Scholar] [CrossRef]
Pan, C.; Peng, Z.; Liu, L.; Wang, D. Data-Driven Distributed Formation Control of under-Actuated Unmanned Surface Vehicles with Collision Avoidance via Model-Based Deep Reinforcement Learning. Ocean Eng. 2023, 267, 113166. [Google Scholar] [CrossRef]
Wang, S.; Wang, Z.; Jiang, R.; Zhu, F.; Yan, R.; Shang, Y. A Multi-Agent Reinforcement Learning-Based Longitudinal and Lateral Control of CAVs to Improve Traffic Efficiency in a Mandatory Lane Change Scenario. Transp. Res. Part C Emerg. Technol. 2024, 158, 104445. [Google Scholar] [CrossRef]
Liang, H.; Li, H.; Gao, J.; Cui, R.; Xu, D. Economic MPC-Based Planning for Marine Vehicles: Tuning Safety and Energy Efficiency. IEEE Trans. Ind. Electron. 2023, 70, 10546–10556. [Google Scholar] [CrossRef]
Liu, Z.; Chi, R.; Huang, B.; Hou, Z. Finite-Time PID Control for Nonlinear Nonaffine Systems. Sci. China Inf. Sci. 2024, 67, 212206. [Google Scholar] [CrossRef]
Huang, H.; Yu, C.; Sun, Z.; Zhang, Y.; Zhao, Z. Sliding Mode Control Strategy Based on Disturbance Observer for Permanent Magnet In-Wheel Motor. Sci. Rep. 2024, 14, 16151. [Google Scholar] [CrossRef] [PubMed]
Dai, Z.; Zhang, Q.; Cheng, J.; Weng, Y. LOS Guidance Law for Unmanned Surface Vehicle Path Following with Unknown Time-Varying Sideslip Compensation. In Proceedings of the International Conference on Neural Computing for Advanced Applications, Hefei, China, 7–9 July 2023; Zhang, H., Ke, Y., Wu, Z., Hao, T., Zhang, Z., Meng, W., Mu, Y., Eds.; Springer Nature: Singapore, 2023; pp. 193–205. [Google Scholar]
Liu, Z.; Song, S.; Yuan, S.; Ma, Y.; Yao, Z. ALOS-Based USV Path-Following Control with Obstacle Avoidance Strategy. J. Mar. Sci. Eng. 2022, 10, 1203. [Google Scholar] [CrossRef]
Celestini, D.; Gammelli, D.; Guffanti, T.; D’Amico, S.; Capello, E.; Pavone, M. Transformer-Based Model Predictive Control: Trajectory Optimization via Sequence Modeling. IEEE Robot. Autom. Lett. 2024, 9, 9820–9827. [Google Scholar] [CrossRef]
Qiao, Y.; Yin, J.; Wang, W.; Duarte, F.; Yang, J.; Ratti, C. Survey of Deep Learning for Autonomous Surface Vehicles in Marine Environments. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3678–3701. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-Agent Deep Reinforcement Learning: A Survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Islam, F.; Ball, J.E.; Goodin, C.T. Enhancing Longitudinal Velocity Control with Attention Mechanism-Based Deep Deterministic Policy Gradient (DDPG) for Safety and Comfort. IEEE Access 2024, 12, 30765–30780. [Google Scholar] [CrossRef]
Cui, Z.; Guan, W.; Luo, W.; Zhang, X. Intelligent Navigation Method for Multiple Marine Autonomous Surface Vessels Based on Improved PPO Algorithm. Ocean Eng. 2023, 287, 115783. [Google Scholar] [CrossRef]
Wu, Z.; Dai, J.; Jiang, B.; Karimi, H.R. Robot Path Planning Based on Artificial Potential Field with Deterministic Annealing. ISA Trans. 2023, 138, 74–87. [Google Scholar] [CrossRef] [PubMed]
Dao, T.-K.; Ngo, T.-G.; Pan, J.-S.; Nguyen, T.-T.-T.; Nguyen, T.-T. Enhancing Path Planning Capabilities of Automated Guided Vehicles in Dynamic Environments: Multi-Objective PSO and Dynamic-Window Approach. Biomimetics 2024, 9, 35. [Google Scholar] [CrossRef]
Huang, J.; Zeng, J.; Chi, X.; Sreenath, K.; Liu, Z.; Su, H. Dynamic Collision Avoidance Using Velocity Obstacle-Based Control Barrier Functions. IEEE Trans. Control Syst. Technol. 2025, 33, 1601–1615. [Google Scholar] [CrossRef]
Wang, J.; Wang, R.; Lu, D.; Zhou, H.; Tao, T. USV Dynamic Accurate Obstacle Avoidance Based on Improved Velocity Obstacle Method. Electronics 2022, 11, 2720. [Google Scholar] [CrossRef]
Liao, Y.; Wu, Y.; Zhao, S.; Zhang, D. Unmanned Aerial Vehicle Obstacle Avoidance Based Custom Elliptic Domain. Drones 2024, 8, 397. [Google Scholar] [CrossRef]
Han, R.; Chen, S.; Wang, S.; Zhang, Z.; Gao, R.; Hao, Q.; Pan, J. Reinforcement Learned Distributed Multi-Robot Navigation with Reciprocal Velocity Obstacle Shaped Rewards. IEEE Robot. Autom. Lett. 2022, 7, 5896–5903. [Google Scholar] [CrossRef]
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. J. Mar. Sci. Eng. 2022, 10, 944. [Google Scholar] [CrossRef]
Agyei, K.; Sarhadi, P.; Naeem, W. Large Language Model-Based Decision-Making for COLREGs and the Control of Autonomous Surface Vehicles. In Proceedings of the 2025 European Control Conference (ECC), Thessaloniki, Greece, 24–27 June 2025. [Google Scholar]
Yang, J.; Zhang, M.; Chen, X.; Li, Q. Formal Verification of Probabilistic Deep Reinforcement Learning Policies with Abstract Training. In Proceedings of the Verification, Model Checking, and Abstract Interpretation, Denver, CO, USA, 20–21 January 2025; Shankaranarayanan, K., Sankaranarayanan, S., Trivedi, A., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 125–147. [Google Scholar]

Figure 1. Coordinate systems of the USV.

Figure 2. The COLREGs rules for vessel navigation at sea.

Figure 3. Principle of velocity obstacles.

Figure 4. HRVO avoidance principle. (a) Traditional VO position space. (b) Construction of velocity obstacle cones using the HRVO method.

Figure 5. Double priority experience replay mechanism.

Figure 6. Soft actor-critic network architecture.

Figure 7. Expert-Guided deep reinforcement learning mechanism.

Figure 8. Reward performance comparison of three algorithms.

Figure 9. Variations in the reward functions in DPER actor-critic.

Figure 10. Comparison of Collision Avoidance Quality between PER-SAC and DPER-SAC.

Figure 11. Simulation results for the crossing encounter scenario. (a) Speed comparison between the VO algorithm and the DPER-SAC algorithm. (b) Angular velocity comparison between the VO algorithm and the DPER-SAC algorithm. (c) Trajectory plot of the USV facing two crossing vessels.

Figure 12. Comparison of heading angle variations between the VO and DPER-SAC algorithms.

Figure 13. Performance comparison of DPER-SAC and VO algorithms for dual-vessel system.

Figure 14. Simulation results for multi-vessel collision avoidance. (a) Right-side crossing encounter scenario. (b) Head-on encounter scenario. (c) Left-side encounter scenario. (d) Complete multi-vessel avoidance trajectory.

Figure 15. Speed comparison between the DPER-SAC and VO algorithms.

Figure 16. Angular velocity comparison between the DPER-SAC and VO algorithms.

Figure 17. Heading angle comparison between the DPER-SAC and VO algorithms.

Figure 18. Performance comparison of DPER-SAC and VO algorithms for multi-vessel system.

Table 1. Parameters of the training process.

Parameter	Value
Discount factor γ	0.99
Actor network learning rate	0.0001
Critic network learning rate	0.0001
Replay memory size	100,000
Batch size	256
Action Noise Standard Deviation	0.3
Soft update $τ$	0.01
Training episodes	1500
Hidden nodes	128

Table 2. Parameters for two-vessel scenario testing.

Vessel	Initial Position (m)	Initial Velocity (m/s)	Vessel Radius (m)	Safety Distance (m)
Main USV	(−80,−60)	1.80	5.0	-
USV2	(−57.6, 18.3)	1.5	7.5	10.0
USV3	(59.7, −9.8)	2.0	5.0	8.0

Table 3. Comparison of Dual-Ship Obstacle Avoidance Using DPER-SAC and VO Algorithms.

Algorithm	Average Navigation Distance (m)	Average Navigation Time (s)	Average Speed (m/s)	Average Path Curvature (deg/Step)	Maximum Single-Step Heading Change (rad)	Minimum Avoidance State Time Steps
DPER-SAC	189.74	97.69	1.92	0.53	0.0574	38 (20.0%)
VO	197.56	127.43	1.53	0.87	0.1432	145 (73.4%)

Table 4. Parameters for multi-vessel scenario testing.

Vessel	Initial Position (m)	Initial Velocity (m/s)	Vessel Radius (m)	Safety Distance (m)
Main USV	(−80.0, −60.0)	1.80	5.0	-
USV2	(−9.2, −42.6)	1.5	7.5	10.0
USV3	(−101.0, −96.7)	1.6	5.0	8.0
USV4	(49.8, 47.3)	1.5	6.0	9.0
USV5	(−28.1, −64.2)	0.85	5.5	8.0

Table 5. Comparison of multi-vessel obstacle avoidance using DPER-SAC and VO algorithms.

Algorithm	Average Navigation Distance (m)	Average Navigation Time (s)	Average Speed (m/s)	Average Path Curvature (deg/Step)	Maximum Single-Step Heading Change (rad)	Minimum Avoidance State Time Steps
DPER-SAC	195.73	98.47	1.89	1.23	0.0556	76 (41.9%)
VO	221.36	145.21	1.43	3.54	0.2183	182 (76.7%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, L.; Wang, Z.; Hong, Z.; Han, C.; Qin, J.; Yang, K. Research on Multi-USV Collision Avoidance Based on Priority-Driven and Expert-Guided Deep Reinforcement Learning. J. Mar. Sci. Eng. 2026, 14, 197. https://doi.org/10.3390/jmse14020197

AMA Style

Xu L, Wang Z, Hong Z, Han C, Qin J, Yang K. Research on Multi-USV Collision Avoidance Based on Priority-Driven and Expert-Guided Deep Reinforcement Learning. Journal of Marine Science and Engineering. 2026; 14(2):197. https://doi.org/10.3390/jmse14020197

Chicago/Turabian Style

Xu, Lixin, Zixuan Wang, Zhichao Hong, Chaoshuai Han, Jiarong Qin, and Ke Yang. 2026. "Research on Multi-USV Collision Avoidance Based on Priority-Driven and Expert-Guided Deep Reinforcement Learning" Journal of Marine Science and Engineering 14, no. 2: 197. https://doi.org/10.3390/jmse14020197

APA Style

Xu, L., Wang, Z., Hong, Z., Han, C., Qin, J., & Yang, K. (2026). Research on Multi-USV Collision Avoidance Based on Priority-Driven and Expert-Guided Deep Reinforcement Learning. Journal of Marine Science and Engineering, 14(2), 197. https://doi.org/10.3390/jmse14020197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Research on Multi-USV Collision Avoidance Based on Priority-Driven and Expert-Guided Deep Reinforcement Learning

Abstract

1. Introduction

1.1. Related Works

1.2. Novelty and Contributions

2. Dynamics of Unmanned Surface Vessel

2.1. USV Model

2.2. Vessel Dynamics Characteristics

2.2.1. Dynamic Modeling of Motion Systems

2.2.2. Kinematic Characteristics

3. Proposed Deep Reinforcement Learning Method for Collision Avoidance

3.1. COLREGs-Aware Velocity Selection with HRVO

3.1.1. The COLREGs Rules

3.1.2. Hybrid Reciprocal Velocity Obstacles Method

3.2. Dual-Priority Experience Replay Mechanism

3.3. DPER-SAC Algorithm Architecture

3.3.1. Network Design and Agent Architecture

3.3.2. Reward Function Design

3.4. Expert-Guided Deep Reinforcement Learning Strategy

4. Simulation Analysis

4.1. Environment and Parameters

4.2. Convergence Verification of Loss and Reward Functions

4.3. Two-Vessel Encounter Scenarios in Simulation

4.4. Multi-Vessel Encounter Scenarios in Simulation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI