Robust Collision Avoidance for ASVs Using Deep Reinforcement Learning with Sim2Real Methods in Static Obstacle Environments

Han, Changgyu; Park, Sekil; Woo, Joohyun

doi:10.3390/jmse13091727

Open AccessArticle

Robust Collision Avoidance for ASVs Using Deep Reinforcement Learning with Sim2Real Methods in Static Obstacle Environments

by

Changgyu Han

¹

,

Sekil Park

²

and

Joohyun Woo

^1,*

¹

Department of Naval Architecture and Ocean Systems Engineering, Korea Maritime & Ocean University, Busan 49112, Republic of Korea

²

Maritime Digital Transformation Research Center, Korea Research Institute of Ships and Ocean Engineering, Daejeon 34103, Republic of Korea

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(9), 1727; https://doi.org/10.3390/jmse13091727

Submission received: 21 July 2025 / Revised: 28 August 2025 / Accepted: 4 September 2025 / Published: 8 September 2025

(This article belongs to the Special Issue Advanced Research in Guidance, Navigation, and Control for Autonomous Surface Vehicle)

Download

Browse Figures

Versions Notes

Abstract

When a policy trained with deep reinforcement learning (DRL) in simulation is deployed in the real world, its performance often deteriorates due to the Sim2Real gap. This study addresses this problem for Autonomous Surface Vessels (ASVs) by developing a robust collision-avoidance framework. We integrate a MATLAB-based ship dynamics model with ROS and Gazebo, and employ the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. To enhance robustness and generalization, we combine domain randomization and curriculum learning. As a result, the trained agent consistently achieved a high success rate of over 90% in unseen environments, significantly outperforming a baseline TD3 agent and a conventional PID controller. This demonstrates that the proposed Sim2Real methods are highly effective for creating robust control policies for ASVs. For future work, we plan to validate the learned policy through real-world experiments.

Keywords:

deep reinforcement learning; collision avoidance; autonomous surface vessel; Sim2Real; TD3

1. Introduction

Autonomous Surface Vessels (ASVs) are gaining attention as a key technology to improve safety and efficiency in maritime operations. As human error remains a primary factor in many vessel collisions, the development of intelligent collision-avoidance systems has become increasingly essential to reduce accidents [1]. This need is especially pronounced in confined waterways, such as harbors and narrow channels, where vessels have restricted maneuverability and various obstacles (floating debris, vessels, etc.) increase collision risk [2]. Therefore, enabling a vessel to perceive its surroundings and navigate safely in such complex environments is critical.

Traditionally, collision avoidance for ASVs has been addressed through various methodologies, which can be categorized into geometric and kinematic path planning, potential field methods, and model predictive control (MPC). In the realm of path planning, Xu et al. developed a hybrid approach that combines an enhanced A* path planner with a dynamic window approach (DWA) [3]. This proposed method enables robust collision avoidance for Unmanned Surface Vehicle (USV) from both static and dynamic obstacles within an integrated framework that modulates the vessel’s velocity in compliance with the International Regulations for Preventing Collisions at Sea (COLREGs). Simulation results indicated that this method produced more efficient and safer trajectories compared to baseline DWA. Similarly, Cao et al. developed an improved RRT method that pruned redundant branches, smoothed paths using B-splines, and maintained a safety corridor around obstacles [4]. Their planner demonstrated real-time performance and high-fidelity guidance between planned paths and actual ship trajectories in the test reaches of the Yangtze River. In potential field methods, Jadhav et al. developed a modified sink-vortex artificial potential field that generates navigation paths using harmonic functions [5]. This proposed algorithm outperformed the velocity obstacle (VO) method on every evaluated performance metric including success rate, computational efficiency, path-tracking accuracy, and time to goal. Finally, Johansen et al. developed a scenario-based MPC framework that parameterizes collision-avoidance behaviors by adjusting guidance course angles and propulsion commands [6]. Their method evaluates COLREGs compliance and assesses collision hazards over a finite prediction horizon by simulating control behaviors, thus explicitly avoids complex numerical optimization, and has demonstrated effectiveness with multiple dynamic obstacles and uncertainty.

But marine vessels have nonlinear dynamics, MPC relies on linear approximations or discrete scenario evaluations that can lead to limitations in control accuracy and robustness, particularly when precise and continuous actions over complex trajectories are required. Li et al. developed an integration method of Enhanced Artificial Potential Field (EAPF) with a nonlinear model predictive control (NMPC) algorithm to address this problem [7]. In this framework, the EAPF generates a global path for collision avoidance, while the NMPC optimizes rudder commands for path tracking. A notable feature of this approach is the direct integration of a high-fidelity, three-Degrees of Freedom (3-DOF) Maneuvering Modeling Group (MMG) ship model into the controller [8]. With the model discretized, the NMPC achieves highly accurate vessel state predictions and implements real-time adjustments. Evaluations in multi-ship scenarios demonstrate that the method produces much smoother trajectories and mitigates heading oscillations. However, traditional methods often depend on accurate environmental models and struggle to adapt to complex and unpredictable dynamic environments.

Because of these problems, deep reinforcement learning (DRL) that have recently been verified in the field of unmanned vehicles such as robots and drones are being actively applied to the field of ASVs collision avoidance. Shen et al. developed a DQN-based approach for automatic collision avoidance among multiple ships in congested waterways [9]. This method uses twelve virtual detection lines and incorporates expert knowledge (ship domain, “bumper” distances, predicted danger zones) into the state and reward design. Simulation results showed that this method can effectively avoid collisions in multi-ship encounter scenarios. However, traditional DQN algorithms tend to overestimate state–action values and may not be well-suited for continuous action spaces. To address this problem, other algorithms like Deep Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO) have been applied. Yuan et al. proposed a DDPG-based algorithm using a Gated Recurrent Unit (GRU) network, which contains fewer parameters than LSTM networks to help reduce training time [10]. The reward function is composed of the target point, running time, and turning angle simultaneously. Hua et al. developed an improved DDPG algorithm including several key enhancements that expedite convergence and improve training efficiency [11]. This method includes a continuous, COLREGs integrated reward function, dynamic region restrictions for nearby obstacles, and a dual-intelligence agent with prioritized experience replay within a constrained action space. Meyer et al. applied PPO to the dual objective problem of path following and COLREGs compliance [12]. A key finding from their work demonstrated that a DRL agent, initially trained in a purely synthetic environment featuring simple obstacle geometries, could successfully generalize its learned policy to unseen scenarios. The proposed learning architecture embeds high-fidelity terrain models and actual vessel traffic data derived from the Trondheim Fjord. Guan et al. proposed an improved PPO algorithm by integrating Generalized Advantage Estimation (GAE) into the loss function, which allows the baseline in the PPO algorithm to be self-adjusted [13]. This modification led to a convergence rate approximately 25% faster than that of the traditional PPO algorithm. Wu et al. developed a risk-aware framework for mapless USV navigation in congested environments [14]. Instead of maximizing the conventional expected return, their approach optimizes a risk-sensitive objective using Conditional Value at Risk (CVaR). This method explicitly trains the agent to avoid low-probability, high-consequence events, such as collisions, by focusing on the worst-case outcomes of its actions. The resulting policy not only navigates efficiently using raw sensor data but also exhibits more cautious and safer behaviors in uncertain situations, providing a robust solution for safety-critical maritime operations.

However, when applying a policy learned in a simulation environment to a real environment, there is a Sim2Real (Sim to Real) problem in which performance is significantly reduced due to uncertainty in the dynamic model, sensor noise, and environmental disturbances [15]. Therefore, addressing Sim2Real is very important for ensuring the robust real-world performance of ASVs. To address this gap, Song et al. proposed an approach based on system identification to enhance the simulation’s fidelity [16]. They collected operational data from a real USV to estimate the vessel’s dynamic characteristics. Based on these, they built a high-fidelity simulation environment nearly identical to the real world. They demonstrated that a DRL agent trained in this environment exhibited high Sim2Real performance without additional training, as the gap with the real hardware was minimal. This shows that system identification is a highly effective Sim2Real solution when a real-world system exists. Li et al. proposed a cross-domain transfer learning which involved pre-training a DRL agent for obstacle avoidance in an Unmanned Ground Vehicle (UGV) [17]. The principle was that the fundamental concepts of obstacle avoidance could be learned more rapidly in the UGV simulation and then effectively adapted to the USV. This cross-domain approach showed significant results, reducing training time and improving collision-avoidance performance compared to agents trained solely in the USV. Wang et al. proposed a domain randomization by varying parameters at each training episode, including LiDAR sensor noise, control signal latency, and the shapes and sizes of obstacles [18]. This was combined with curriculum learning, where the agent’s training environment progressively increased in difficulty, starting from a simple open space and advancing to complex scenarios with dense obstacle configurations. This strategy ensured the policy’s robustness against real-world uncertainties, enabling it to be successfully deployed on a physical USV directly from simulation. Batista et al. provided crucial empirical evidence for Sim2Real by conducting field tests on a physical USV [19]. Their study involved deploying a TD3-based policy, trained in a simulation with domain randomization, onto a physical vessel to perform waypoint-tracking missions in a real harbor. The field tests demonstrated that the policy could be deployed directly from simulation, successfully handling significant real-world factors like wind and unmodeled currents. This work not only validates the practical robustness of the Sim2Real approach but also provides a valuable methodology for the in-field evaluation of learned control policies for maritime operations.

In this study, we propose a DRL-based collision-avoidance policy for ASVs that applies Sim2Real methods to enhance its real-world robustness. Rather than employing the conventional two-stage approach of path planning and path tracking, our study develops a direct, end-to-end framework that enables real-time collision avoidance based solely on sensor data. To achieve this, we employ the TD3 algorithm within a co-simulation environment that integrates a MATLAB-based 6-DOF marine dynamics model with the Gazebo simulator. Our approach is distinguished by the synergistic application of two key Sim2Real strategies: domain randomization and curriculum learning. The primary contribution of this paper is to demonstrate that this integrated Sim2Real approach yields a significantly more robust policy compared to a baseline agent, as validated by a comparative analysis of success rates and safety metrics in test simulation environments.

The remaining paper is organized as follows. Section 2 describes the simulation environment and the overall learning architecture. Section 3 describes the Markov Decision Process and the applied Sim2Real methods. Section 4 describes the hyperparameter settings and simulation scenario and results. Finally, Section 5 presents the conclusions.

2. Simulation Environment and Learning Architecture

2.1. Simulation Environment

We integrated a high-fidelity MATLAB (R2023b)-based ship dynamics model (Fossen Marine Systems Simulator (MSS) [20]) with the Gazebo (Gazebo Garden) simulation framework, which is based on the Robot Operating System (ROS Noetic). Specifically, we employed the dynamics model for the Otter, the smallest catamaran-type vessel in the MSS toolbox. This model was chosen because its characteristics closely match those of the physical ASV we plan to use in future real-world experiments. Gazebo was used to simulate the vessel’s onboard sensors, which provided realistic LiDAR scan data. The simulation operates in a closed loop. First, control commands are sent to the MATLAB-based MSS. Then, MSS computes the resulting changes in the vessel’s 6-DOF (Degrees of Freedom) state (i.e., position and orientation) and transmits this updated state to Gazebo. Finally, Gazebo generates the corresponding sensor data based on the vessel’s new state.

2.2. Learning Architecture

The overall learning architecture for our DRL based collision avoidance system is a closed loop interaction between the agent (ASV) and the simulation environment. As illustrated in Figure 1, the architecture consists of three main components.

On the left is a prioritized experience replay (PER) buffer [21] that stores transitions

{(s}_{t}, a_{t}, r_{t}, s_{t + 1})

, each assigned a priority value. The PER buffer adjusts the sampling probability based on the importance of an experience, allowing the TD3 agent to learn more efficiently and converge faster by replaying high-priority transitions more frequently.

At the center is the TD3 agent, which comprises an Actor network and two Critic networks. The Actor network generates continuous steering and throttle commands based on the current state, while the twin Critic networks estimate the Q-values of state–action pairs to mitigate overestimation bias [22]. At each time step, the agent executes an action, observes the resulting reward and next state, and stores this transition in the PER buffer. The networks are subsequently updated by sampling batches of high-priority experiences from the buffer, which focuses the training process on more significant transitions.

On the right, the simulation environment provides the agent with states and rewards. As detailed in Section 2.1, this environment is a co-simulation framework integrating the MATLAB-based MSS for high-fidelity dynamics and Gazebo for sensor data.

3. Markov Decision Process and Sim2Real Methods

3.1. Markov Decision Process

To enable the application of deep reinforcement learning, we model the ASVs collision-avoidance task as a Markov Decision Process (MDP).

3.1.1. State Space

The state space

s_{t}

, which the agent uses to decide its action, is composed of six elements as follows:

s_{t} = [s_{t}^{s c a n}, s_{t}^{d i s t}, s_{t}^{a n g l e}, s_{t}^{p r e v_a c t}, s_{t}^{r a t e}, s_{t}^{r p s}]

(1)

The primary sensory input for the agent is a 360-degree LiDAR scan, providing distance measurements to surrounding obstacles. However, utilizing the raw 360-point data as a state presents a significant challenge due to its high dimensionality. This issue, commonly known as the curse of dimensionality, was found to impede the DRL agent’s learning convergence. Therefore, to facilitate efficient and stable training, we process the raw scan to formulate a compact yet salient state representation, denoted as

s_{t}^{s c a n}

.

s_{t}^{s c a n} = [d_{t}^{1}, d_{t}^{2}, \dots, d_{t}^{39}]

(2)

d_{t}^{i}

represents a normalized distance value obtained from LiDAR measurements. The 360° field of view is divided into 39 angular sectors, with 6° intervals in the more significant forward region and 20° intervals in the rear. For each sector, the minimum measured distance to an obstacle is extracted and normalized by dividing it by the maximum sensing range of the LiDAR. As a result, each value lies in the range [0,1], representing the relative distance to the nearest obstacle in that direction as shown in Figure 2a.

The second component,

s_{t}^{d i s t}

represents the relative distance from the ASVs to the goal and is defined as

s_{t}^{d i s t} = c l i p ((\frac{{d i s t}_{I}^{G} - {d i s t}_{t}^{G}}{{d i s t}_{I}^{G}}), - 1, 1)

(3)

{d i s t}_{I}^{G}

denotes the initial distance to the goal at the beginning of the episode, while

{d i s t}_{t}^{G}

represents the current distance to the goal as shown in Figure 2b.

The third component,

s_{t}^{a n g l e}

represents the angle to the goal point and is defined as

s_{t}^{a n g l e} = \frac{(θ_{t}^{G} - θ_{t}^{A S V})}{π}

(4)

θ_{t}^{G}

and

θ_{t}^{A S V}

represent the angle to the goal and the ASV’s heading, respectively, as shown in Figure 2b.

The fourth component,

s_{t}^{p r e v_a c t}

represents the agent’s previous actions. The rationale for including this information in the state is to promote temporally smooth and stable control policies.

s_{t}^{p r e v_a c t} = [a_{t - 1}^{s p e e d}, a_{t - 1}^{s t e e r}, a_{t - 2}^{s t e e r}, a_{t - 3}^{s t e e r}, a_{t - 4}^{s t e e r}, a_{t - 5}^{s t e e r}]

(5)

Because steering action is particularly sensitive to chattering, the steer action considers the past five steps, while the speed action considers only the previous one.

The fifth component,

s_{t}^{p r e v_a c t}

represents ASVs speed and angular velocity, and is defined as

s_{t}^{r a t e} = [v_{t}^{u}, v_{t}^{w}]

(6)

v_{t}^{u}

is the current surge velocity normalized by the ASVs maximum surge speed, and

v_{t}^{w}

is the current yaw rate normalized by the ASVs maximum yaw rate.

Finally,

s_{t}^{r p s}

represents the current propeller’s RPS (Revolutions Per Second) and is defined as

s_{t}^{r p s} = [{r p s}_{t}^{l e f t}, {r p s}_{t}^{r i g h t}]

(7)

{r p s}_{t}^{l e f t}

and

{r p s}_{t}^{r i g h t}

are the RPS for the left and right propellers, respectively, each normalized by the maximum ASVs RPS. Due to actuator dynamics, a response delay exists between a control command and the resulting propeller RPS. Furthermore, uncertainties in actuation and variations in hydrodynamic loads mean that the same command can produce different RPS values depending on the vessel’s state. To address this, we include the propeller RPS as part of the state representation. This allows the agent to directly observe the real-time output of its actuators, enabling it to learn a more robust control policy.

3.1.2. Action Space

The action space

a_{t}

is the set of all possible actions an agent can take in a given state and consists of the two continuous actions as follows:

a_{t} = [a_{t}^{s p e e d}, a_{t}^{s t e e r}]

(8)

a_{t}^{s p e e d}

and

a_{t}^{s t e e r}

represent the speed command and the steering command of ASVs, respectively. These actions are converted into propeller RPS in the MATLAB-based dynamics model following the equations defined in (9). Then, they are saturated between −100 RPS and 100 RPS. To consider the response delay of a real propeller, the rotational speed is first passed through a first-order actuator-dynamics filter before being commanded, thereby ensuring a gradual change in propeller speed.

\{\begin{matrix} l e f t p r o p e l l e r r p s = 100 \times (a_{t}^{s p e e d} - 0.5 \times a_{t}^{s t e e r}) \\ r i g h t p r o p e l l e r r p s = 100 \times (a_{t}^{s p e e d} + 0.5 \times a_{t}^{s t e e r}) \end{matrix}

(9)

3.1.3. Reward Function

The reward function

r_{t}

which evaluates the action in each state is composed of the seven components as follows:

r_{t} = r_{y a w} + r_{d i s t} + r_{o b s} + r_{s m o o t h} + r_{r a t e} + r_{s u c c e e d} + r_{s t e p}

(10)

The design of this reward function is guided by several key objectives. Specifically, it is structured to encourage the agent to achieve the following: (1) progressively approach the goal while maintaining a stable heading; (2) safely avoid collisions with obstacles; (3) minimize control chattering for smooth actuation; and (4) reach the goal as quickly as possible.

First, the heading alignment reward

r_{y a w}

guides the agent to maintain its heading toward the goal and is defined as

r_{y a w} = λ_{1} \cos (θ_{t}^{G} - θ_{t}^{A S V})

(11)

Second, the progress reward

r_{d i s t}

encourages forward movement toward the goal and is defined as

r_{d i s t} = λ_{2} \times ({d i s t}_{t - 1}^{G} - {d i s t}_{t}^{G})

(12)

Third, the safe distance reward

r_{o b s}

imposes a penalty when the agent is near obstacles to prevent collision and is defined as

r_{o b s} = \{\begin{array}{l} - λ_{3} \times {(λ_{4} - d_{t}^{m i n})}^{2}, & i f d_{t}^{m i n} < λ_{4} \\ 0 & o t h e r w i s e \end{array}

(13)

d_{t}^{m i n}

represents the shortest distance to an obstacle as measured by the LiDAR.

Fourth, the action smoothness reward

r_{s m o o t h}

encourages stable actions and is defined as

r_{s m o o t h} = λ_{5} \times {(a_{t}^{s p e e d} - a_{t - 1}^{s p e e d})}^{2} + λ_{6} \times {(a_{t}^{s t e e r} - a_{t - 1}^{s t e e r})}^{2}

(14)

Fifth, the yaw rate penalty

r_{r a t e}

discourages excessive rotational motion and is defined as

r_{r a t e} = λ_{7} \times |{\dot{ψ}}_{t}|

(15)

{\dot{ψ}}_{t}

is the ASVs yaw rate (rad/s).

Sixth, the terminal reward

r_{s u c c e e d}

reflects the outcome of the episode and is defined as

r_{s u c c e e d} = \{\begin{array}{l} r_{s u c c e e d}, & i f s u c c e e d \\ r_{c o l l i s i o n}, & i f c o l l i s i o n \\ r_{t i m e_o u t}, & i f t i m e o u t \\ 0 & o t h e r w i s e \end{array}

(16)

Timeout is triggered at step 800.

Finally, the small negative reward

r_{s t e p}

provides a constant penalty at each time step to encourage shorter episodes and is defined as

r_{s t e p} = λ_{8}

(17)

The weight parameters were tuned to balance the competing objectives of safety and task completion. Safety is the most important objective, enforced through both step and terminal penalties. Among the per-step penalties, the safe distance reward (

r_{o b s}

) is assigned the largest weight to prevent close approaches to obstacles. The terminal penalties are likewise structured to prioritize safety, with the penalty for a collision (

r_{c o l l i s i o n}

) being significantly greater than for a timeout (

r_{t i m e_o u t}

). To encourage the agent to complete the task despite these numerous penalties, the success reward (

r_{s u c c e e d}

) was set as the largest absolute value in the function. The remaining shaping rewards were tuned to refine the agent’s behavior for stability and efficiency without conflicting with the primary objectives. All reward parameters are listed in Section 4.1.

3.2. Sim2Real Methods

Policies trained in simulation often experience performance degradation when deployed in the real-world, a phenomenon known as the Sim2Real gap. This gap arises from differences between simulated and real-world domains. First, the dynamic models used in simulations are imperfect approximations that often neglect complex hydrodynamic effects, actuator nonlinearities, and hardware-specific characteristics. Second, simulations typically assume idealized sensing, whereas real sensors are subject to noise, bias, and latency. Lastly, real-world environmental conditions, such as wind, waves, and currents, introduce complex and stochastic disturbances that are difficult to model with complete fidelity. As a result, a policy that demonstrates great performance in simulation may fail to generalize robustly, leading to bad performance when exposed to unmodeled real-world uncertainties. To address this, we integrate two key Sim2Real techniques into our training process: domain randomization and curriculum learning.

3.2.1. Domain Randomization

Domain randomization is one of the most common approaches to address the Sim2Real problem in reinforcement learning [23]. This approach involves the random sampling of various environmental parameters such as physical properties (e.g., vehicle mass and friction coefficients), initial conditions (e.g., starting positions), and sensor noise, within predefined ranges. This method prevents the policy from overfitting to a specific environment and ensures that the policy can be robustly applied in the real world.

To implement this, we randomized the ASV’s starting location, heading, and the goal position for each training episode. Furthermore, to account for real-world sensor noise and actuator imprecision, we injected stochastic noise into the sensor measurements, the agent’s state estimation (position and orientation), and the actuator control inputs. The specific probability distributions and parameter ranges used for this randomization are detailed in Table 1.

3.2.2. Curriculum Learning

Curriculum learning is where learning starts in a simple environment and the difficulty of the environment gradually increases as the agent succeeds [24]. In this study, the difficulty of the training environment adjusts automatically based on the agent’s performance. The agent advances to the next level when two criteria are met: (1) the average success rate over the last 1000 episodes exceeds 90%, and (2) the success rate curve stabilizes (slope ≤ 0.01). This learning approach enables the agent to respond effectively to various changes in the simulation environment and to learn a robust policy for the real world. We implemented a curriculum composed of four distinct levels in Stage 1, each designed to incrementally increase task difficulty, as shown in Figure 3. Specifically, the number of random obstacles in Level 4 is increased following a successful trial and decreased upon failure.

4. Hyper Parameters and Simulation Scenario and Results

4.1. Hyper Parameters Settting

The hyperparameters used for training the TD3 agent are summarized in Table 2, while the parameters used for reward shaping are provided in Table 3.

4.2. Simulation Scenario

4.2.1. Evaluation Methodology

We evaluated the performance of our proposed Sim2Real agent against two baselines: a PID controller with the velocity obstacle (VO) method [25] and a TD3 agent trained without Sim2Real methods. For the PID + VO method, a PID controller was implemented to regulate the heading error (in degrees), with gains set to Kp = 0.005 (proportional), Ki = 0.0 and Kd = 0.0085. The ASV’s forward speed was maintained at a constant 2.0 m/s. The VO algorithm was configured using the effective radius of the vessel, set to 2.5 m, and an additional safety margin of 1.0 m was applied to all detected obstacles to ensure conservative collision-avoidance maneuvers. The evaluation was conducted in two distinct stages, with the corresponding maps shown in Figure 4. Each agent’s performance was assessed over 1000 episodes for each stage.

4.2.2. Task and Environment Configuration

In each scenario, the agent was initialized at a random position and heading within a 10 m radius of the starting point (light blue circle). The path consisted of waypoints (green triangles), leading to a final goal (red circle). A waypoint or goal was considered reached upon entering an 8 m success radius. The environment also contained static obstacles of various shapes and sizes (detailed in Figure 5 and Table 4), which were randomly positioned for each scenario.

4.2.3. Sim2Real Test Conditions

To test the policy’s robustness to real-world conditions, the evaluation was conducted in a test environment with differences from the training environment, as summarized in Table 5. These differences included external disturbances that were absent during training, such as currents (up to 0.2 m/s) and wind (up to 5 m/s), with their directions and magnitudes independently randomized for each episode. Specifically, currents were incorporated via a relative-flow formulation, while wind-induced forces and moments followed the Blendermann model [26].

4.3. Training Stability and Behavioral Analysis

A critical finding from this experiment was the failure of the agent trained exclusively with domain randomization (DR) to achieve policy convergence within the 70,000 training episodes. This outcome highlights a fundamental trade-off in Sim2Real. While broad environmental randomization is a powerful mechanism for instilling policy robustness, it can also severely destabilize the learning process and impede convergence in complex tasks. Consequently, we could not obtain a viable policy for the DR-only agent, rendering a direct performance comparison against this baseline infeasible.

The established instability of the DR-only approach necessitates a qualitative analysis to understand how curriculum learning (CL) enables and synergizes with DR. To this end, Figure 6. compares the navigation trajectories of an agent trained with ‘CL Only’ against one trained with our proposed ‘CL + DR’. The comparison reveals a distinct divergence in their learned strategies: while the ‘CL Only’ agent (blue solid lines) executes aggressive maneuvers in close proximity to obstacles, our proposed agent (red dashed lines) consistently adopts a more conservative policy, maintaining a significantly wider safety margin. This cautious behavior is a direct manifestation of the robustness instilled by DR. By training on a wide spectrum of simulated uncertainties, such as sensor and position errors, the policy learns to inherently compensate for potential real-world discrepancies.

Ultimately, these findings confirm a crucial synergistic relationship. The convergence failure of the ‘DR Only’ agent empirically demonstrates that curriculum learning acts as an essential stabilizing factor. By providing a structured learning path from simple to complex scenarios, CL creates the stable foundation necessary for the agent to learn successfully. Only upon this foundation can domain randomization be effectively leveraged to build a demonstrably safer and more robust policy.

4.4. Stage 2 Results

Our proposed agent demonstrates superior performance in the Stage 2, as depicted in Figure 7a. This figure highlights a particularly challenging and congested segment where all comparative methods falter. The PID + VO, TD3, and CL ONLY agents each fail to navigate this area, resulting in mission termination early in the trial. In contrast, our proposed agent is the only method to successfully navigate this difficult segment and complete the entire waypoint path. This qualitative result provides clear, intuitive evidence of the agent’s enhanced robustness and long-range navigation in the presence of unseen environmental disturbances.

This visually demonstrated superiority is rigorously substantiated by the quantitative results summarized in Figure 7b. The high success rate of 94.8% achieved by the proposed agent—significantly outperforming the PID + VO (79.4%), CL ONLY (85.8%), and TD3 (88.7%) baselines—quantifies its ability to succeed where others fail. This robustness is underpinned by superior safety and control efficiency. The agent maintained the largest closest obstacle distance at 9.32 m, indicating more reliable navigation in cluttered environments. Furthermore, its smooth and efficient steering control, demonstrated by the lowest RMS ΔSteer among DRL agents (14.4%), was crucial for handling the complex maneuvers required in the trial, all while maintaining a competitive completion time of 19.47 s. Taken together, these qualitative and quantitative findings validate the effectiveness of our proposed Sim2Real method, confirming its ability to produce robust, high-performance collision avoidance for long-horizon navigation tasks.

4.5. Stage 3 Results

In Stage 3, the most complex environment, the performance gap between our proposed agent and the baseline methods became even more pronounced. The qualitative results from a representative trial are depicted in Figure 8a. This figure clearly shows that all baseline methods—PID + VO, TD3, and CL ONLY—fail to handle the initial series of tight maneuvers near the starting point. In contrast, our proposed agent is once again the only method capable of successfully navigating the entire complex, long-horizon path, providing a definitive visual testament to its superior robustness.

This visually demonstrated success is rigorously quantified by the performance metrics summarized in Figure 8b. The proposed agent achieved the highest success rate of 93.2%, significantly outperforming the PID + VO (74.0%), CL ONLY (82.6%), and TD3 (86.4%) agents in this challenging scenario. The underlying reasons for this success are evident in the other metrics. Our agent maintained a high safety margin with the closest obstacle distance of 7.87 m. While this value is marginally lower than that of the CL ONLY agent in this specific stage, this is attributed to the proposed agent’s superior capability. It frequently succeeds in navigating through confined waters where other agents typically fail; these successful yet tight passages naturally result in a lower average distance metric. This seemingly paradoxical result, therefore, further highlights its robustness and willingness to undertake more challenging maneuvers when necessary. Furthermore, it demonstrated the most efficient and stable control among the DRL methods, achieving a competitive completion time of 24.66 s and the lowest RMS ΔSteer of 16.6%.

These consistent results across stages of increasing complexity validate the exceptional robustness and generalization capabilities of our proposed Sim2Real method. The agent’s ability to reliably succeed in diverse and unseen environments where both classical and standard DRL methods repeatedly fail confirms its effectiveness for real-world ASV collision-avoidance tasks.

5. Conclusions

This study proposes a deep reinforcement learning (DRL) agent for robust Autonomous Surface Vehicle (ASV) collision avoidance in complex, confined waterways. To this end, we integrated key Sim2-Real techniques—domain randomization and curriculum learning—to enhance the policy’s robustness and generalization capabilities.

Performance evaluations confirmed the superiority of our proposed agent over both conventional and standard DRL baselines, particularly in high-complexity scenarios. In the most challenging stage, our agent’s average success rate was nearly 7 percentage points higher than that of a TD3 agent trained without Sim2Real. The performance gap was even more pronounced against the classical PID + VO method, which our agent outperformed by over 19 percentage points in success rate, alongside achieving faster completion times and significantly smoother steering control.

The key contribution of this research is the empirical demonstration that a Sim2Real approach, combining domain randomization and curriculum learning, instills the necessary robustness and generalization for an agent to overcome the limitations of traditional methods. Our agent’s ability to reliably navigate complex scenarios where the classical PID + VO method consistently failed is a direct validation of this approach. These findings confirm that DRL agents integrated with such techniques present a highly effective and reliable solution for real-world maritime collision avoidance.

Our future work will focus on developing a high-fidelity simulation environment grounded in the dynamics of our physical vessel to mitigate the Sim2Real gap further. The initial step will be to perform system identification on our vessel to determine its precise hydrodynamic derivatives, which will then be used to develop a simulation that accurately reflects its real-world dynamics. Furthermore, we will identify and model key hardware characteristics from the physical vessel, such as sensor noise, actuator nonlinearities, and control loop frequencies. These realistic system parameters will be integrated into the learning architecture to enhance the agent’s robustness. The goal of future work is to deploy the policy trained in this data-driven, high-fidelity simulation onto our physical autonomous vessel for empirical validation of its effectiveness and robustness in actual maritime conditions.

Author Contributions

Conceptualization, J.W. and S.P.; methodology, C.H.; software, C.H.; validation, C.H. and J.W.; formal analysis, C.H.; investigation, C.H.; writing—original draft preparation, C.H.; writing—review and editing, J.W. and S.P.; visualization, C.H.; supervision, J.W.; project administration, S.P.; funding acquisition, J.W. and S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Korea Research Institute of Ships and Ocean Engineering (KRISO) grant from the Endowment Project, “Development of Smart Maritime Technologies based on Digital Twin” (Grant No. PES5590), funded by the Ministry of Oceans and Fisheries, Korea. This work was also supported by the Korea Institute of Marine Science & Technology Promotion (KIMST) grant (No. RS-2024-00432366), funded by the Ministry of Oceans and Fisheries, Korea.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors gratefully acknowledge the support from the Korea Research Institute of Ships and Ocean Engineering (KRISO), the Ministry of Oceans and Fisheries, and the Korea Institute of Marine Science & Technology Promotion (KIMST).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DRL	Deep Reinforcement Learning
ASV	Autonomous Surface Vessel
Sim2Real	Simulation-to-Real
TD3	Twin Delayed Deep Deterministic Policy Gradient
VO	Velocity Obstacle
PID	Proportional–Integral–Derivative
ROS	Robot Operating System
MDP	Markov Decision Process
PER	Prioritized Experience Replay

References

Alam, M.S.; Sudha, S.K.R.; Somayajula, A. AI on the water: Applying drl to autonomous vessel navigation. arXiv 2023, arXiv:2310.14938. [Google Scholar] [CrossRef]
Yang, S.; Wang, K.; Wang, W.; Wu, H.; Suo, Y.; Chen, G.; Xian, J. Dual-attention proximal policy optimization for efficient autonomous navigation in narrow channels using deep reinforcement learning. Ocean Eng. 2025, 326, 120707. [Google Scholar] [CrossRef]
Xu, D.; Yang, J.; Zhou, X.; Xu, H. Hybrid path planning method for USV using bidirectional A* and improved DWA considering the manoeuvrability and COLREGs. Ocean Eng. 2024, 298, 117210. [Google Scholar] [CrossRef]
Cao, S.; Fan, P.; Yan, T.; Xie, C.; Deng, J.; Xu, F.; Shu, Y. Inland waterway ship path planning based on improved RRT algorithm. J. Mar. Sci. Eng. 2022, 10, 1460. [Google Scholar] [CrossRef]
Jadhav, A.K.; Pandi, A.R.; Somayajula, A. Collision avoidance for autonomous surface vessels using novel artificial potential fields. Ocean Eng. 2023, 288, 116011. [Google Scholar] [CrossRef]
Johansen, T.A.; Cristofaro, A.; Perez, T. Ship collision avoidance using scenario-based model predictive control. IFAC-Pap. 2016, 49, 14–21. [Google Scholar]
Li, W.; Wang, Y.; Zhu, S.; Xiao, J.; Chen, S.; Guo, J.; Ren, D.; Wang, J. Path tracking and local obstacle avoidance for automated vehicle based on improved artificial potential field. Int. J. Control Autom. Syst. 2023, 21, 1644–1658. [Google Scholar] [CrossRef]
Yasukawa, H.; Yoshimura, Y. Introduction of MMG standard method for ship maneuvering predictions. J. Mar. Sci. Technol. 2015, 20, 37–52. [Google Scholar] [CrossRef]
Shen, H.; Hashimoto, H.; Matsuda, A.; Taniguchi, Y.; Terada, D.; Guo, C. Automatic collision avoidance of multiple ships based on deep Q-learning. Appl. Ocean Res. 2019, 86, 268–288. [Google Scholar] [CrossRef]
Yuan, J.; Han, M.; Wang, H.; Zhong, B.; Gao, W.; Yu, D. AUV Collision Avoidance Planning Method Based on Deep Deterministic Policy Gradient. J. Mar. Sci. Eng. 2023, 11, 2258. [Google Scholar] [CrossRef]
Hua, M.; Zhou, W.; Cheng, H.; Chen, Z. Improved DDPG algorithm-based path planning for unmanned surface vehicles. Intell. Robot. 2024, 4, 363–384. [Google Scholar] [CrossRef]
Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-Compliant Collision Avoidance for Unmanned Surface Vehicle Using Deep Reinforcement Learning. IEEE Access 2020, 8, 165344–165364. [Google Scholar] [CrossRef]
Guan, W.; Cui, Z.; Zhang, X. Intelligent smart marine autonomous surface ship decision system based on improved PPO algorithm. Sensors 2022, 22, 5732. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Wei, C.; Guan, D.; Ji, Z. Risk-aware deep reinforcement learning for mapless navigation of unmanned surface vehicles in uncertain and congested environments. Ocean Eng. 2025, 322, 120446. [Google Scholar] [CrossRef]
Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 737–744. [Google Scholar]
Song, R.; Gao, S.; Li, Y. Sim-to-Real in Unmanned Surface Vehicle Control: A System Identification-Based Approach for Enhanced Training Environments. In Proceedings of the 2024 9th International Conference on Electronic Technology and Information Science (ICETIS), Xiamen, China, 10–12 May 2024. [Google Scholar]
Li, J.; Chavez-Galaviz, J.; Azizzadenesheli, K.; Mahmoudian, N. Dynamic Obstacle Avoidance for USVs Using Cross-Domain Deep Reinforcement Learning and Neural Network Model Predictive Controller. Sensors 2023, 23, 3572. [Google Scholar] [CrossRef]
Wang, N.; Wang, Y.; Zhao, Y.; Wang, Y.; Li, Z. Sim-to-real: Mapless navigation for USVs using deep reinforcement learning. J. Mar. Sci. Eng. 2022, 10, 895. [Google Scholar] [CrossRef]
Batista, L.F.; Aravecchia, S.; Hutchinson, S.; Pradalier, C. Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field Tests. arXiv 2025, arXiv:2505.10033. [Google Scholar] [CrossRef]
Fossen. Marine Systems Simulator (MSS). 2004. Available online: https://github.com/cybergalactic/MSS (accessed on 5 September 2025).
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017. [Google Scholar]
Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. J. Mach. Learn. Res. 2020, 21, 1–50. [Google Scholar]
Fiorini, P.; Shiller, Z. Motion planning in dynamic environments using velocity obstacles. Int. J. Robot. Res. 1998, 17, 760–772. [Google Scholar] [CrossRef]
Blendermann, W. Parameter identification of wind loads on ships. J. Wind Eng. Ind. Aerodyn. 1994, 51, 339–351. [Google Scholar] [CrossRef]

Figure 1. Learning architecture.

Figure 2. (a) Lidar scan: The green lines delineate the sectors of the Lidar scan, the black circles represent obstacles, and the blue lines indicate the waterway boundaries. (b) ASVs coordinate.

Figure 3. The red circle indicates the goal location, the white lines represent the waterway boundaries, and the multi-colored objects of various shapes are obstacles. The stages shown are: (a) Open waters, (b) Stage 1 with no obstacles, (c) Stage 1 with obstacles at fixed positions, and (d) Randomized obstacle positions and goal location.

Figure 4. The black lines represent the waterway boundaries, and the multi-colored objects of various shapes are obstacles. The map of 2 test stages. (a) Size: 500 × 550 m, channel width: 50 m; (b) size: 700 × 800 m, channel width: 40 m.

Figure 5. Obstacle shape.

Figure 6. Comparison of trajectory between the ‘CL Only’ and ‘Ours’ (proposed method).

Figure 7. Stage 2 result. (a) Three random scenario trajectories; (b) performance comparison.

Figure 8. Stage 3 result. (a) Three random scenario trajectories; (b) Performance comparison.

Table 1. Randomization parameters.

Category	Parameter	Range	Rational
Sensor	Lidar standard deviation (σ)	σ ~ Uniform [0.01, 0.05] (m)	Lidar Range noise
Position	Position noise (x)	ϵ ~ Uniform [0, 0.1] (m)	Agent Position noise
Orientation	Heading angle (ψ)	ϵ ~ Gaussian (0, 0.03²) (rad)	Agent Heading noise
Control	Action noise (a)	ϵ ~ Gaussian (0, 0.05²) (−)	Actuator noise

Table 2. Training hyperparameters.

Parameters	Value
Discount factor	0.99
Learning rate	0.0001
Soft update factor τ	0.005
Target policy noise σ	0.02
Policy noise clip	0.05
Actor update delay	2

Table 3. Reward parameters.

Parameters	Value
$λ_{1}$	0.5
$λ_{2}$	5.0
$λ_{3}$	0.2
$λ_{4}$	8.0
$λ_{5}$	−5.0
$λ_{6}$	−4.0
$λ_{7}$	−1.0
$λ_{8}$ $r_{s u c c e e d}$ $r_{c o l l i s i o n}$ $r_{t i m e_o u t}$	−2.0 1500 −1000 −500

Table 4. Size of obstacles.

Parameters	Value(m)
Cylinder radius	2, 3, 4, 4.5, 5
Cube width × length	3 × 6, 4 × 7, 5 × 6, 5 × 5, 7 × 7
Sphere radius	2, 2.5, 3, 3.5, 4
Tetrapod height	4.5, 5.0, 5.5, 6.0, 6.5

Table 5. Comparison of training and test sim environment.

Category	Train	Test
MAP	Stage 1	Stages 2–3
Obstacle Count	60	80 (Stage 2), 105 (Stage 3)
Waypoint Distance [m]	50	50 (Stage 2), 80 (Stage 3)
Current	None	≤0.2 m/s (random direction)
Wind	None	≤5 m/s (random direction)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, C.; Park, S.; Woo, J. Robust Collision Avoidance for ASVs Using Deep Reinforcement Learning with Sim2Real Methods in Static Obstacle Environments. J. Mar. Sci. Eng. 2025, 13, 1727. https://doi.org/10.3390/jmse13091727

AMA Style

Han C, Park S, Woo J. Robust Collision Avoidance for ASVs Using Deep Reinforcement Learning with Sim2Real Methods in Static Obstacle Environments. Journal of Marine Science and Engineering. 2025; 13(9):1727. https://doi.org/10.3390/jmse13091727

Chicago/Turabian Style

Han, Changgyu, Sekil Park, and Joohyun Woo. 2025. "Robust Collision Avoidance for ASVs Using Deep Reinforcement Learning with Sim2Real Methods in Static Obstacle Environments" Journal of Marine Science and Engineering 13, no. 9: 1727. https://doi.org/10.3390/jmse13091727

APA Style

Han, C., Park, S., & Woo, J. (2025). Robust Collision Avoidance for ASVs Using Deep Reinforcement Learning with Sim2Real Methods in Static Obstacle Environments. Journal of Marine Science and Engineering, 13(9), 1727. https://doi.org/10.3390/jmse13091727

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Collision Avoidance for ASVs Using Deep Reinforcement Learning with Sim2Real Methods in Static Obstacle Environments

Abstract

1. Introduction

2. Simulation Environment and Learning Architecture

2.1. Simulation Environment

2.2. Learning Architecture

3. Markov Decision Process and Sim2Real Methods

3.1. Markov Decision Process

3.1.1. State Space

3.1.2. Action Space

3.1.3. Reward Function

3.2. Sim2Real Methods

3.2.1. Domain Randomization

3.2.2. Curriculum Learning

4. Hyper Parameters and Simulation Scenario and Results

4.1. Hyper Parameters Settting

4.2. Simulation Scenario

4.2.1. Evaluation Methodology

4.2.2. Task and Environment Configuration

4.2.3. Sim2Real Test Conditions

4.3. Training Stability and Behavioral Analysis

4.4. Stage 2 Results

4.5. Stage 3 Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI