Autonomous Navigation Control and Collision Avoidance Decision-Making of an Under-Actuated ASV Based on Deep Reinforcement Learning

Wang, Yiting; Li, Zhiyao; Wang, Lei; Wang, Xuefeng

doi:10.3390/jmse13112108

Open AccessArticle

Autonomous Navigation Control and Collision Avoidance Decision-Making of an Under-Actuated ASV Based on Deep Reinforcement Learning

¹

School of Ocean and Civil Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

²

State Key Laboratory of Ocean Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(11), 2108; https://doi.org/10.3390/jmse13112108

Submission received: 15 October 2025 / Revised: 31 October 2025 / Accepted: 4 November 2025 / Published: 6 November 2025

(This article belongs to the Special Issue Advanced Control Strategies for Autonomous Maritime Systems)

Download

Browse Figures

Versions Notes

Abstract

For efficient and safe navigation for an autonomous surface vehicle (ASV), this paper proposes an autonomous navigation behavior framework that integrates deep reinforcement learning (DRL) to achieve autonomous decision-making and low-level control actions in path following and collision avoidance. By controlling both the propeller speed and the rudder angle, the policy of each behavior pattern is trained with the soft actor–critic (SAC) algorithm. Moreover, a dynamic obstacle trajectory predictor based on the Kalman filter and the long short-term memory module is developed for obstacle avoidance. Simulations and physical experiments using an under-actuated very large crude carrier (VLCC) model indicate that our DRL-based method produces appreciable performance gains in ASV autonomous navigation under environmental disturbances, which enables forecasting of the expected state of a vessel over a future time and improves the operational efficiency of the navigation process.

Keywords:

autonomous navigation; path following; collision avoidance; deep reinforcement learning; soft actor–critic

1. Introduction

The advancement of maritime transportation highlights the significance of ship decision-making and control methods, which are vital for maintaining the safety and efficiency of ship navigation [1]. While human factors are critical for traditional ship navigation, they are the leading causes of maritime accidents [2]. Autonomous surface vehicles (ASVs) have the potential to minimize human error, thereby guaranteeing more secure operations [3]. When an ASV sails in open water, and its route is pre-determined, it primarily depends on the path-following control approach to ensure reliable navigation, whereas in complicated environments like ports or island regions, it is necessary for the ASV to automatically detect and avoid obstacles with local information. The conventional ASV lacks sufficient control inputs for managing motions in six-degrees of freedom (DOF), which categorizes them as under-actuated systems. External factors like wind, wave, and current can greatly impact the motion control. Therefore, it is essential for the ASV to possess real-time and robust path-following control and collision avoidance decision-making functions to minimize the path tracking deviations, resist external disturbances, and improve operational efficiency during navigation.

In recent years, artificial intelligence technology, especially reinforcement learning (RL), has garnered widespread attention for its potential in handling complex decision-making and control challenges. The goal of RL is to learn an optimal policy by facilitating interactions between the agent and the environment. It demonstrates application prospects in fields like autonomous driving and drone flight [4,5]. Although RL has made considerable progress in various domains, its utilization in the navigation of ASVs still remains at an early stage. Marine vehicles are required to analyze various environmental data in real time and adapt to ever-changing obstacles and complex sea conditions. Traditional rule-based methods [6,7,8] have the weakness of poor scalability and tend to be inflexible when dealing with dynamic marine settings. Additionally, model-based control and decision-making approaches [9,10,11] often depend on a precise ship mathematical model. In comparison, RL or deep reinforcement learning (DRL) algorithms transform inputs from multiple channels into executable outputs within a black-box framework. An RL-based algorithm eliminates the necessity for manually established rules and exact specifications. Moreover, it is capable of handling features that are not easy to directly quantify, such as images, natural language, etc. [12]. Ref. [13] achieved ship path following through a line-of-sight guidance algorithm and subsequently trained a collision avoidance policy utilizing the proximal policy optimization algorithm. This approach was verified under unknown environmental disturbances. However, the state space incorporated global coordinate information, and the policy need to be retrained in new scenarios. Ref. [14] developed a dual-layer navigation system comprising a long-term planner and a short-term decision maker using a deep Q-network (DQN) algorithm. The planner first generates a global path. Then, the decision maker utilizes environmental images as input to a convolutional neural network. Nonetheless, this method did not consider the ship’s dynamics. Ref. [15] implemented a dueling deep Q-network for the autonomous navigation and obstacle avoidance of unmanned surface vehicles (USVs). The convergence speed of this algorithm outperformed DQN and Deep SARSA in both static and dynamic environments, although the control variables in this method were the surge force and yaw moment, instead of the direct control of actuators and rudders. Additionally, the wind velocity and direction were not included in the simulation platform. Ref. [16] presented an autonomous navigation system for USVs that sensed ocean conditions in real time and output rudder angle control commands. They employed a double Q-network to facilitate end-to-end control, but the feasible rudder angles were discrete. Ref. [17] introduced an improved deep deterministic policy gradient method. LiDAR was integrated to provide inputs for collision avoidance perception, enabling autonomous navigation and collision avoidance by controlling the rudder angle without relying on global information. However, in this method, environmental disturbances were not considered. Ref. [18] proposed a maritime autonomous surface ship autonomous navigation system using dueling deep Q-networks with prioritized replay. Along with the wind, current, and wave data, the system also utilized the data from the ship’s automatic identification system (AIS) to construct the navigation environment. However, the state space was dependent on the geographic locations of both the ship and the target. It also employed a discrete action space and ignored the throttle operation of the vessel. Ref. [19] controlled the ship’s acceleration and rudder angle in straight-line water channels and compared the performance of various DRL algorithms. However, this study assumed a very simplistic navigation environment, limiting its applicability to real-world scenarios.

Most of the existing studies of DRL focus on simulation experiments or theoretical evaluations. Challenges in practical applications, including environmental uncertainty, dynamic obstacles, and complicated encounter scenarios, etc., have not been fully addressed. Therefore, this paper presents a hybrid system for autonomous navigation with the state-of-the-art DRL algorithm soft actor–critic (SAC). In the absence of collision risks, the ASV controls its rudder angle for path following through a course tracking controller. However, in risky situations, the collision avoidance decision maker generates a target course designed to avert collisions and then adjusts both the rudder angle and propeller speed to revert to the designated course. After the training of policies, we perform testing simulations on path following, collision avoidance, and the autonomous navigation framework in complex environments. A model-scale physical experiment is also carried out to validate the effectiveness of the proposed path-following control policy. The results illustrate that our approach successfully accomplishes ASV navigational decision-making and control. It also indicates that the prediction of the vessel’s expected state is facilitated, which allows for the optimization of the overall navigation process. The main contributions of this paper include the following:

(1): A framework for DRL-based autonomous navigation is established, and a virtual simulation environment is developed. The policies are trained to exhibit diverse behaviors of under-actuated vessels, which include patterns of path following and collision avoidance. The non-global perception mechanism enables the ASV to navigate along stochastic paths and execute commands from different navigation patterns. A physical experiment with an under-actuated very large crude carrier (VLCC) ship model is conducted to verify the performance of the path-following policy.
(2): A dynamic collision avoidance approach based on trajectory prediction is proposed. The ASV first receives the historical time series data of dynamic obstacles. Then, using a kinematic model, the agent predicts possible trajectories of dynamic obstacles in the near future combining a Kalman filter (KF) and a long short-term memory (LSTM) network. Concurrently, the propeller rotation speed and rudder angle of the ship are controlled to proactively reduce the risk of collision.

The structure of this paper is delineated as follows: Section 2 introduces the ship mathematical models. Section 3 illustrates the problems formulations associated with path following and collision avoidance. Section 4 elaborates on the SAC algorithm, the policy designs, the trajectory prediction algorithm for dynamic obstacles, and the hybrid system framework. Section 5 discusses the algorithm parameters and the results of training, numerical simulation, and model testing. Finally, Section 6 offers a summary and perspectives.

2. Mathematical Modeling of the Ship

For motion control, it is necessary to first define the state of the ship in the reference coordinate systems. Here, we use the Earth coordinate system

O_{E} - X_{E} Y_{E} Z_{E}

and the body-fixed coordinate system

O - X Y Z

to describe the ship’s motion parameters. As shown in Figure 1a, the Earth coordinate system is established on a tangent plane to the Earth surface, with the x-axis towards the north, the y-axis towards the eas, t and the z-axis perpendicular to the Earth’s surface. In the body-fixed coordinate system, the ASV is treated as a rigid body, with the x-axis towards the bow, the y-axis towards starboard, and the z-axis towards the bottom of the ship. This paper considers a three-DOF mathematical model of ASV operating on the water surface, and the corresponding horizontal reference coordinate systems are illustrated in Figure 1b.

The motion parameters relationship between the two coordinate systems is referred to as the ship kinematic equation, as delineated in Equation (1).

\dot{η} = J (ψ) υ,

(1)

J (ψ) = [\begin{matrix} cos ψ & - sin ψ & 0 \\ sin ψ & cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}],

(2)

where

η = {[x, y, ψ]}^{T}

is the vector of the ship displacements and yaw angle in the Earth coordinate system.

υ = {[u, v, r]}^{T}

is the vector of the linear and angular velocities and

U = \sqrt{u^{2} + v^{2}}

.

J (ψ)

is the rotation matrix, as shown in Equation (2). Furthermore, besides the kinematic equation, the dynamic equation is also essential for the control of ASV. The time-domain motion equation is given as follows [20]:

M \dot{υ} + C_{R B} (υ) υ + C_{A} (υ_{r}) υ_{r} + D (υ_{r}) υ_{r} = τ + τ_{w a v e} + τ_{w i n d},

(3)

where

υ_{r} = υ - υ_{c} \in R^{3}

represents the velocity vector of the vessel in relation to the ocean current, which is steady and irrotational. M denotes the total mass matrix.

C_{R B} (υ)

is the rigid body Coriolis centripetal force matrix, and

C_{A} (υ_{r})

is the Coriolis centripetal force matrix of the additional mass.

D (υ_{r})

refers to the matrix of the water damping coefficients.

τ \in R^{3}

is the control input vector, and

τ_{w a v e}

and

τ_{w i n d} \in R^{3}

correspond to the environmental disturbance forces exerted by wave and wind. Since the left part of Equation (3) incorporates the velocity of the ocean current, there is no need for an additional term of current force. Regarding the control input

τ

, its relationship with the propeller revolving speed n and rudder angle

δ

is illustrated in Equation (4), where C is the configuration matrix of the propellers.

f_{c} (\cdot)

is the conversion function [21].

τ = C f_{c} (υ_{r}, n, δ)

(4)

3. Autonomous Navigation and Collision Avoidance Problem Modeling

3.1. Markov Decision Process

As a branch of machine learning, DRL integrates the benefits of both deep learning and RL [22]. The Markov decision process (MDP) is one of the fundamental parts in RL. Here, we define the MDP tuple as

〈 S

,

A

,

P

,

R

,

γ 〉

, where

S

is the set of all environment states, with

s_{t} \in S

indicating the state the agent is in at time t.

A

represents the action space.

a_{t} \in A

refers to the action executed by the agent at time t.

P (s_{t}, a_{t}, s_{t + 1}) \in R

represents the probability of the agent transitioning from state

s_{t}

to the next state

s_{t + 1}

when taking action

a_{t}

in state

s_{t}

.

R

is the reward function, and

r_{t + 1} = R (s_{t}, a_{t}) \in R

represents the reward the agent received when taking action

a_{t}

in state

s_{t}

.

γ \in [0, 1]

is the discount factor [23]. In RL, the policy

π

:

S \to R

serves as a function that maps the state space to the action space. Specifically,

π (a_{t} | s_{t})

represents the probability of selecting action

a_{t}

given state

s_{t}

under policy

π

. Denoting by

G_{t}

the discounted cumulative reward at time t, it is calculated by Equation (5).

G_{t} = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots + γ^{T - t - 1} r_{T},

(5)

where T is the terminal time. The objective of RL is to find an optimal policy to maximize the expected discounted return, which ensures that this return is at least equal to that achieved by any alternative policies [24].

3.2. Path Following and Course Tracking Control

The basis of ASV autonomous navigation is to follow a specified route while avoiding obstacles in complicated environments. Path-following methods, including direct and indirect techniques, are employed to control the motions along a pre-determined path in spatial dimensions without time constraints [25]. The indirect following method is particularly preferred due to its clarity in variable definition and ease of implementation [25].

Furthermore, as a key part of ASV guidance and control techniques, the course tracking is a prerequisite for the implementation of indirect path-following control [26]. Figure 2 illustrates the schematic view of indirect path following based on course tracking control, where path

Θ

is a line comprising

N_{Θ}

discrete points

{[x_{p i}

,

y_{p i}

{]}}_{i = 1 \dots N_{Θ}}

.

Δ_{1}

denotes the forward viewing distance. The ASV first determines the nearest path point

P_{c} = (x_{c}, y_{c})

on

Θ

relative to its current position with Equation (6), where

P_{ASV} = (x_{t}, y_{t})

is the ship position.

| | \cdot {| |}_{2}

is the L2-norm. Then from this nearest path point, the ASV conducts a forward search for a lookout point

P_{T} = (x_{T}, y_{T})

along

Θ

that is one

Δ_{1}

distance away from

P_{c}

, i.e.,

|| (x_{c}, y_{c}) - (x_{T}, y_{T}) {||}_{2} \to Δ_{1}

. The primary objective of the ASV is to compute the desired course

φ_{d}

between the position and the lookout point

P_{T}

at each moment t, as shown in Equation (7). It is required to minimize the difference between

ψ

and

φ_{d}

to the greatest extent, i.e.,

|| ψ - φ_{d} {||}_{2} \to 0

.

P_{c} = \underset{(x, y) \in Θ}{argmin} | | (x, y) - (x_{t}, y_{t}) {| |}_{2}

(6)

φ_{d} = atan2 (y_{T} - y_{t}, x_{T} - x_{t})

(7)

In this study, it is assumed that the rotational speed of a ship propeller remains constant throughout the path-following task. Therefore, at each moment t, the ASV determines an optimal rudder angle

δ_{t}

only based on its current position, desired path, and other relevant information to control its heading. This means that the path-following problem satisfies the Markov property and can be addressed using DRL algorithms. In order to improve the policy generalization performance, it is important that the agent’s state space and reward function are designed without the influence of global environmental information. Initially, the controller aims to minimize the deviation between the current heading and the desired course; so, the state space should encompass the yaw angle

ψ

and the course

φ_{d}

. Furthermore, since the steering amplitude of the rudder angle in

Δ t

should not be excessive, the rudder angle

δ

and the angular velocity r are also incorporated into the state representation. Thus, the state vector for the ASV in the path-following task at time t is defined as

s_{t} = [sin ψ_{t}, cos ψ_{t}, sin φ_{t}, cos φ_{t}, δ_{t}, r_{t}]

, while the action vector is

a_{t} = δ_{t}

.

As for the reward function

r_{p_{t}}

, it is divided into two components, namely

R_{course}

and

R_{rudder}

. The respective formulations for each reward function are shown in Equations (8) and (9). Equation (10) represents the formula for

r_{p_{t}}

, where

k_{1}

and

k_{2}

are both positive weight coefficients. For a given path, the agent continuously obtains a positive reward when the absolute error

| ψ_{t} - φ_{d_{t}} |

is small. Conversely, a negative penalty is imposed when the angular error is acute. For

R_{rudder}

, the penalty diminishes as the amplitude of the rudder angle change within the time

Δ t

decreases.

R_{course} = \frac{π / 2 - | ψ_{t} - φ_{d_{t}} |}{π / 2}

(8)

R_{rudder} = - \frac{|δ_{t - 1} - δ_{t}|}{Δ t}

(9)

r_{p_{t}} = k_{1} \cdot R_{course} + k_{2} \cdot R_{rudder}

(10)

3.3. Collision Avoidance Decision-Making

Most of the time the ASV navigates along designated specific routes in open waters and relies only on the path-following controller. However, when it is located in more complex waters, the ASV must implement appropriate measures to avoid obstacles that could lead to a collision. Current collision avoidance algorithms are all local planning methods that must be in real time and consider the dynamic constraints of the vessel [27]. Similar to path-following control, the collision avoidance problem can also be framed as MDP and integrated with DRL for policy optimization.

In this paper, the basis of collision avoidance decision-making is LiDAR, a sensor that utilizes laser beams to measure distances and angles, allowing for real-time environment data collection, including the position, shape, and size of obstacles. Figure 3a illustrates a LiDAR on an ASV emitting laser beams in two-dimensional space, where dark circles indicate obstacles, green lines represent laser beams that do not detect any objects, and red lines show beams that make contact with obstacles. Here, we define the maximum detection radius of the LiDAR as

r_{\max}

, the distance between the LiDAR and the point of contact as

r_{obs}

, and the total number of emitted rays as

N_{l}

. To ensure our method is suitable for different LiDARs, it is necessary to normalize the detection distances. That is, if a beam i does not hit any obstacles, the detection result should be

r_{i} = 1

. Otherwise the normalized feedback is

r_{i} = r_{{obs}_{i}} / r_{\max} \in [0, 1)

.

Specifically, the decision maker first determines the collision avoidance course

ϕ

, which is then used as one of the inputs for the course tracking controller mentioned in Section 3.2. Simultaneously, the propeller revolving speed n is also controlled. Therefore, the agent’s action space consists of the collision avoidance course

ϕ

and the propeller speed n. Since the ASV needs to avoid collision while approaching its destination, it should take into account not only the feedback information

F = [r_{1}, r_{2}, \dots, r_{N_{l}}] \in R^{N_{l}}

provided by the LiDAR, but also its own yaw angle

ψ

, desired goal course

φ_{d}

, collision avoidance course

ϕ

, drift angle

β

, linear velocity u, v, and angular velocity r. Consequently, the state vector of our decision-making agent at time t stands for

s_{t} = [F_{t}, sin ψ_{t}, cos ψ_{t}, sin φ_{d_{t}}, cos φ_{d_{t}}, sin ϕ_{t}, cos ϕ_{t}, sin β_{t}, cos β_{t}, U_{t}, r_{t}]

, and the action vector is

a_{t} = [ϕ_{t}, n_{t}]

.

The reward function

r_{o_{t}}

for the collision avoidance decision maker is composed of four parts, which are

R_{course}^{'}

,

R_{target}

,

R_{yaw}

, and

R_{collision}

. Specifically,

R_{course}^{'}

and

R_{target}

are designed to assist the ASV approaching

P_{T}

, which serves as the point guiding the vessel to its destination. Meanwhile, the design of

R_{yaw}

aims to prevent excessive changes in the yaw angle. The formulas for calculating

R_{course}^{'}

,

R_{target}

, and

R_{yaw}

are provided below, where

dot (\cdot)

is the dot product of two vectors.

R_{course}^{'} = dot ((\cos ψ_{t}, sin ψ_{t}), (\cos φ_{d_{t}}, sin φ_{d_{t}}))

(11)

R_{target} = | | (x_{t - 1}, y_{t - 1}) - P_{T} {| |}_{2} - | | (x_{t}, y_{t}) - P_{T} {| |}_{2}

(12)

R_{yaw} = \{\begin{matrix} - | ψ_{t} - ψ_{t - 1} | & if | ψ_{t} - ψ_{t - 1} | < π \\ 2 π - | ψ_{t} - ψ_{t - 1} | & otherwise \end{matrix}

(13)

For

R_{collision}

, when the ship collides with an obstacle, it receives a penalty of −1. If none of the lasers detect the obstacle, the reward is 0. If the ship does not collide, but its lasers make contact with the obstacle, the reward is calculated through the weight of each laser beam and the corresponding detection distance. As illustrated in Figure 3b, the laser beam directly in front of the ship is assigned an index of 0, with the indices of other laser beams increasing in a clockwise direction. Additionally, a darker beam color indicates a higher laser weight value. The equations for computing the laser weight

W \in R^{N_{l}}

are shown in Equations (14) and (15), where

k_{3}

is a small negative constant, and

χ_{i} \in [0, N_{l})

represents the index of the laser beam i. The formula for

R_{collision}

is shown in Equation (16), where

1

denotes the unit vector.

w_{i} = \{\begin{matrix} exp [k_{3} \cdot (N_{l} - χ_{i})] & if (N_{l} is even ⋂ χ_{i} > N_{l} / 2) ⋃ (N_{l} is odd ⋂ χ_{i} > (N_{l} - 1) / 2) \\ exp (k_{3} \cdot χ_{i}) & otherwise . \end{matrix}

(14)

W = [w_{1}, w_{2}, \dots, w_{N_{l}}], w_{i = 1, \dots, N_{l}} \in [0, 1]

(15)

R_{collide} = \{\begin{matrix} - 1 & if collide \\ 0 & if \min (F_{t}) = 1 \\ W_{t} {(F_{t} - 1)}^{T} & otherwise . \end{matrix}

(16)

According to the previously mentioned formulas, if there are obstacles directly in front of the ship, the agent will receive a larger negative reward than that of the situation when obstacles are located in other directions, as the obstacles ahead present a higher risk of collision. The total reward function

r_{o_{t}}

can be expressed as Equation (17), where

k_{4}, k_{5}, k_{6},

and

k_{7}

are all positive weight coefficients.

r_{o_{t}} = k_{4} \cdot R_{course}^{'} + k_{5} \cdot R_{target} + k_{6} \cdot R_{yaw} + k_{7} \cdot R_{collide}

(17)

4. Autonomous Navigation and Collision Avoidance Methods Based on Deep Reinforcement Learning

4.1. Soft Actor–Critic Algorithm

For an RL agent in state s at time t, we denote the expected discounted return with respect to the policy

π

as

V_{π} (s)

. For the agent taking action a in state s at time t, the expected return is expressed as

Q_{π} (s, a)

[23]. The corresponding formulas are shown in Equations (18) and (19).

\begin{matrix} V_{π} (s) = E_{π} [G_{t} | s_{t} = s] \end{matrix}

(18)

\begin{matrix} Q_{π} (s, a) = E_{π} [G_{t} | s_{t} = s, a_{t} = a] \end{matrix}

(19)

This paper utilizes the SAC to train the DRL policy. The SAC is a state-of-the-art DRL algorithm that employs the Actor–Critic framework and learns a stochastic policy while incorporating a regularization term to enhance exploration capabilities [28]. It uses a total of five networks, including one actor network

π

with parameter

θ

, two critic networks:

Q_{1}

with parameter

ω_{1}

and

Q_{2}

with parameter

ω_{2}

, and two target critic networks:

{\hat{Q}}_{1}

with parameter

ω_{1}^{-}

and

{\hat{Q}}_{2}

with parameter

ω_{2}^{-}

. Since the SAC is an off-policy based DRL algorithm, it maintains a replay buffer

D

for policy training. The actor and critics are updated through gradient back propagation, as shown in Equations (20) and (21).

θ \leftarrow θ - λ \nabla_{θ} E_{s \sim D, a \sim π (\cdot | s)} [α log π (a | s) - {min}_{k = 1, 2} Q_{k} (s, a)]

(20)

ω_{j} \leftarrow ω_{j} - λ \nabla_{ω_{j}} \frac{1}{2} E_{s, a \sim D} {[Q_{j} (s, a) - r (s, a) - γ E_{s^{'} \sim P (\cdot |, a), a^{'} \sim π (\cdot | s^{'})} {min}_{k = 1, 2} {\hat{Q}}_{k} (s^{'}, a^{'})]}^{2}, j = 1, 2,

(21)

where

λ

is the learning rate.

α

is the regularization coefficient. The target critics are soft updated at specific intervals, as shown in Equation (22), where

ζ

represents the soft update coefficient.

ω_{j}^{-} \leftarrow ζ ω_{j} + (1 - ζ) ω_{j}^{-}, j = 1, 2

(22)

4.2. Autonomous Navigation Control and Collision Avoidance Decision-Making Method

4.2.1. Policies Training

For the path-following agent, the LSTM module is introduced to improve its prediction ability considering the delay in ship maneuvering. As illustrated in Figure 4, the LSTM features a unique memory cell and gate mechanism that effectively captures and processes long short-term dependencies in sequential data [29]. Therefore, the past state series

s_{t - N_{s}}

,

s_{t - N_{s} + 1}

, …,

s_{t - 1}

along with the current state

s_{t}

are input to the LSTM layer to extract time series features, which enables the ASV to forecast its own state. The output from the LSTM is then used as the input for next layers in the model. For the collision avoidance policy, the ASV is trained in a setting with randomly placed static obstacles. The LiDAR first emits

N_{l}

beams of laser with the ship’s current position as the center. It then iterates each beam to generate a feedback list F. The agent combines its motion parameters and list F to compute the current state

s_{t}

. The policy is also developed using the SAC algorithm, incorporating LSTM module to improve the agent’s predictive capability.

Figure 5 shows the flowchart of the policy training process. The agent interacts with the environment by taking action(s) and receives rewards at each time step; then, it updates the current state and determines whether the episode is terminated or not. In the case of the path-following policy, if the ship arrives at the target point, the maximum time step is reached, or the ship exceeds the map boundaries, the current episode is terminated, and the next new episode will commence. Whereas for the collision avoidance policy, besides the condition for the distance between the ASV and the target point, the episode also ends if the maximum time step is reached without any collisions.

4.2.2. Trajectory Prediction of Dynamic Obstacle

Since the collision avoidance policy is trained in an environment with static obstacles, applying the policy directly to dynamic scenarios for ASV may not yield ideal results. As illustrated in Figure 6, it is assumed that the historical data of dynamic obstacles’ coordinates within the map area are accessible to the ASV, allowing it to forecast the future trajectories of obstacles. In addition, all the predicted trajectory points are treated as imagined obstacles for the ASV to avoid potential collisions.

The Kalman filter (KF) is a widely used algorithm for estimating states and is composed of three steps: measurement, state update, and prediction [30]. In this paper, we apply an improved KF algorithm for short-term predictions of motion trajectories with the kinematic model of dynamic obstacles [31]. Here, the state vector X is defined as

X = {[x, u, a_{x}, y, v, a_{y}]}^{T}

, where x, y are the positions. u, v are the linear velocities, and

a_{x}

,

a_{y}

are the linear accelerations, all referenced in the Earth coordinate system. The state extrapolation equation is presented in Equation (23).

X_{t + 1} = A X_{t} + B u_{t} + E {\tilde{j}}_{t},

(23)

where

X_{t}

is the state at time t.

u_{t} \in R^{2}

is the control input.

{\tilde{j}}_{t} \in R^{2}

is the noise. A is the state transition matrix. B is the control matrix, and E is the noise matrix. According to the kinematic model, the expressions for A, B, and E are shown in Equation (24), Equation (25), and Equation (26), respectively. Since both

u_{t}

and

{\tilde{j}}_{t}

are taken as the acceleration rates in this paper, the original state equation Equation (23) can be reformed as Equation (27).

A = [\begin{matrix} A_{x} & 0 \\ 0 & A_{y} \end{matrix}]

(24)

A_{x} = A_{y} = [\begin{matrix} 1 & Δ t & \frac{Δ t^{2}}{2} \\ 0 & 1 & Δ t \\ 0 & 0 & 1 \end{matrix}]

(25)

B = E = {[\begin{matrix} \frac{Δ t^{3}}{6} & \frac{Δ t^{2}}{2} & Δ t & 0 & 0 & 0 \\ 0 & 0 & 0 & \frac{Δ t^{3}}{6} & \frac{Δ t^{2}}{2} & Δ t \end{matrix}]}^{T}

(26)

X_{t + 1} = A X_{t} + B (u_{t} + {\tilde{j}}_{t})

(27)

We denote the new control vector at time t as

{\tilde{u}}_{t} = u_{t} + {\tilde{j}}_{t}

, with its covariance represented as

q_{{\tilde{u}}_{t}}

, its estimated value as

{\hat{\tilde{u}}}_{t}

, and the covariance of its estimated value as

q_{{\hat{\tilde{u}}}_{t}}

. In the KF prediction stage, we define the predicted state and covariance at time t as

{\hat{X}}_{t}

and

P_{t}

; then, the state

{\hat{X}}_{t + 1}

and covariance

P_{t + 1}

at time

t + 1

are calculated through Equations (28) and (29). Since what the ASV can have access to is the coordinates data of obstacles, for

{\hat{\tilde{u}}}_{t}

and

q_{{\hat{\tilde{u}}}_{t}}

, they are predicted by an additional LSTM unit in this paper. We denote the output

h_{t}

of LSTM at time t as

h_{t} = [{\hat{\tilde{u}}}_{t}, q_{{\hat{\tilde{u}}}_{t}}]

. Then, we have Equation (30).

{\hat{X}}_{t + 1} = A {\hat{X}}_{t} + B {\hat{\tilde{u}}}_{t}

(28)

P_{t + 1} = A P_{t} A^{T} + B q_{{\hat{\tilde{u}}}_{t}} q_{{\hat{\tilde{u}}}_{t}}^{T} B^{T}

(29)

h_{t}, c_{t} = LSTM ({\hat{X}}_{t - 1}, h_{t - 1}, c_{t - 1})

(30)

σ \leftarrow \underset{σ}{argmin} MSE (Predict (N_{h}, σ), N_{f})

(31)

Let the LSTM parameter be

σ

. To train the network, we assume there is a database

D_{KF}

containing the trajectories of moving objects. Then, we sample a batch of trajectories with the size of

N_{KF}

. Each trajectory is split into a historical segment with a horizon of

N_{h}

and a future segment with a horizon of

N_{f}

. Then, the predicted values from the LSTM model based on the points within

N_{h}

, i.e.,

Predict (N_{h}, σ)

, are compared to the future observations within

N_{f}

by calculating the mean squared error (MSE) loss. The network parameters are then updated through back-propagation, as shown in Equation (31).

The KF executes measurement, state update, and prediction repeatedly to estimate states [30]. In this paper, the KF is initialized using historical observations from time 0 to

N_{h} - 1

as usual. However, after time

N_{h} - 1

, only the prediction step is performed without utilizing any future observations within the horizon of

N_{f}

. As illustrated in Algorithm 1, with the KF initialized using the historical data, the state and covariance estimates at time

N_{h}

are used as the initial values for trajectory prediction, which is based on Equations (28) and (29).

Algorithm 1: KF-LSTM Algorithm

4.2.3. Hybrid Autonomous Navigation and Collision Avoidance System in Complex Environments

The flowchart of the system is illustrated in Figure 7. In practice, the ASV encounters various static and dynamic obstacles. It is also affected by external factors like wind, wave, and current. For navigation in a complex environment, the vessel first conducts global path planning with map data to generate a feasible collision-free path from the starting point to the destination. The path can be designed either manually or through global path planning algorithms. Once a collision-free path is determined, in the path-following module, the ASV continuously updates local lookout points

P_{T}

and calculates the desired course

φ_{d}

at each moment. Subsequently, the angular error is calculated and fed into the course tracking controller, which produces control commands to steer the vessel safely along the designated path.

The criterion for switching policies is based on the collision risk. When the vessel does not identify any dynamic obstacles during its navigation, the system defaults to autonomous navigation mode to follow the pre-defined path. Otherwise, it shifts to collision avoidance mode. Here, we introduce the concept of the closest point of approach (CPA) as a metric for evaluating the collision risk [32]. As shown in Figure 8,

P_{A}

represents the coordinates of the ASV.

P_{B}

denotes the coordinates of the dynamic obstacle. The vector

υ_{A}

,

υ_{B} \in R^{2}

corresponds to the linear velocities in the Earth coordinate system. At time

t_{CPA}

, both the ASV and the obstacle reach their respective CPA simultaneously, maintaining a distance of

d_{CPA}

to their encounter points. The formulas for

t_{CPA}

and

d_{CPA}

are shown in Equation (32) and Equation (33), respectively.

t_{CPA} = \frac{(P_{B} - P_{A}) \cdot (υ_{A} - υ_{B})}{| | υ_{A} - υ_{B} {| |}_{2}}

(32)

d_{CPA} = | | (P_{A} + υ_{A} t_{CPA}) - (P_{B} + υ_{B} t_{CPA}) {| |}_{2}

(33)

When the ASV satisfies both

0 \leq t_{CPA} \leq t_{m a x}

and

d_{CPA} \leq d_{m i n}

, it indicates a potential collision risk. At this time, the ship predicts the trajectory of dynamic obstacles, avoiding virtual and actual obstacles at the same time. It is worth noting that in collision avoidance mode, the forward viewing distance

Δ_{2}

to search for a local lookout point along the global path needs to exceed

Δ_{1}

in path following to avoid urgent maneuvers.

5. Results and Discussion

5.1. The Parameters Setting

The parameters of the ASV adopted in the present study are defined based on a 1:300 model-scale VLCC with a single propeller and a single rudder. The mass m of the model is 10 kg. The overall length

L_{O A}

of the model is 1.11 m, and the width B is 0.165 m. The major parameters of this model-scale vessel are detailed in Table 1. Since the revolution speed of the propeller is proportional to the pulse-width modulation (

PWM

) of the motor, the PWM value is used to replace the revolution speed n with a range of

PWM \in [- 100, 100]

. The permissible range for the rudder angle is

δ \in [- 30 \deg, 30 \deg]

. Additionally, the hardware used for the algorithm training is NVIDIA RTX 4080 GPU. PyTorch 2.1 is also used, and the environment is built on OpenAI gym. Based on the current hardware and computational capability, the average inference time per step of the proposed policy is less than 50 ms, which can meet the real-time requirement for the ASV.

Other algorithm parameters and the network structure of actor and critics in policy are shown in Table 2 and Figure A1, respectively. In the path-following task, the values of

k_{1}

and

k_{2}

are set to the same positive value. For the collision avoidance task, the coefficient

k_{7}

should have the highest value to impose a large penalty when the ASV is close to or collides with obstacles. Both

k_{4}

and

k_{5}

pertain to the ASV approach to the target point and should be greater than

k_{6}

, which is relatively less important during navigation. The simulation environment for the training of path following and the collision avoidance policy is a 1280 m × 720 m collision-free open water. During policy training and verifying, the starting point and the target point of the ASV are both randomly initialized at the start of each episode. All the factors in original velocity vector

ν_{0} = {[u_{0}, v_{0}, r_{0}]}^{T}

are set to 0.

5.2. Training Results and Discussion

5.2.1. Path-Following Policy

The policy is trained for a total of 2,000,000 iterative steps, with each 1000 steps counted as one epoch. Model parameters of the actor are saved after each epoch, and the model is then verified over 10 episodes to assess the success rate and average reward per step of the agent. The initial yaw angle of the ASV is randomized before the start of episode, and the PWM is 80. To illustrate the training process of the policy, Figure 9 shows the ship trajectory with examples of three different training stages in which the starting point and the target point are fixed. In Figure 9a, the ASV possesses an entire stochastic policy, and there is no trajectory that arrives at the target point. With the increase in the training epoch, the ASV gradually learns how to complete the path-following task shown in Figure 9b, and all of the trajectories reach the destination shown in Figure 9c. Figure 10a displays the ratio of the successful arrival episodes to the target point out of the total episodes in the first 200 epochs. The “success” here indicates that the ASV can arrive at the target point. Figure 10b illustrates the the average reward and standard deviation for each step throughout the entire validation process. Notably, in the first 50 epochs, the success rate experiences considerable fluctuations, showing the inherent performance of the DRL-based path-following policy, while the average reward per step keeps increasing. In the subsequent epochs, the success rate levels off at 1. Additionally, the variance of the average reward decreases continuously, with the mean value eventually stabilizing around 0.7.

5.2.2. Collision Avoidance Policy

In this section, the number of static obstacles in the training environment is 25 with a radius range of [7 m, 12 m]. The maximum detection radius

r_{m a x}

of LiDAR is 70 m, and the number of laser beams

N_{l}

is 64. The initial yaw angle is fixed as the angle between the starting point and target point. The collision avoidance policy is trained for a total of 5,000,000 iterative steps, with every five episodes considered as one epoch. The actor parameters are also saved after each epoch with the model validated over 10 episodes.

Similar to Figure 9, ship trajectories from three different training stage are shown in Figure 11. The obstacles, the starting point, and the target point remain unchanged. It is obvious that although most of the trajectories in Figure 11 successfully arrive at the target point, the trajectories in Figure 11c are more compact than those from earlier epochs. Figure 12a shows the success rate of the ASV, and Figure 12b presents the average reward and standard deviation for each step. The success rate quickly rises to 1.0 within a few epochs and stabilizes between 0.9 and 1.0 during training. The average reward per step also stabilizes rapidly between 1.0 and 1.2, albeit with distinct fluctuations in the standard deviation.

5.2.3. KF-LSTM Predictor

Here, we utilize a publicly accessible ship AIS dataset provided by the Danish Maritime Authority as

D_{KF}

for the training of the KF-LSTM predictor model [33]. Figure 13 visualizes the AIS dataset, which includes a total of 11,888 navigation trajectories, with each trajectory containing the ship’s Maritime Mobile Service Identity (MMSI) code, time steps, longitudes, latitudes, headings, and velocities. The sampling batch size

N_{KF}

is 128, with a total of 1000 epochs trained. The model structure contains one LSTM hidden layer with a hidden size of 60. Figure 14 separately shows the change in the MSE error during the training and validation process, indicating a generally decreasing trend from epochs 0 to 200. As the number of epochs increases further, the average training loss ultimately converges to approximately 0.5 and shows less fluctuation than the average validation loss, which stabilizes between 0 and 1.

5.3. Experimental Results and Discussion

5.3.1. Circular Path-Following Simulations

In this subsection, we further test the performance of our path-following policy using two scenarios involving the same circular path. The vessel first navigates along a straight segment before completing a full loop of a circular path. The coordinates of the starting point are (440 m, 300 m), and those of the target point are (439 m, 169 m), with the PWM of ASV fixed at 80. The environmental disturbances are considered in this subsection, including wind, current and wave.

Equation (34) presents the wind load exerted on a structure [34], where

ρ_{a}

denotes the air density.

U_{R}

indicates the relative wind speed.

α_{R}

signifies the relative wind angle.

A_{f}

refers to the front ship projection area, and

A_{s}

is the lateral projection area of the ASV above the waterline.

C_{x_{w i n d}}

,

C_{y_{w i n d}}

, and

C_{n_{w i n d}}

are the wind load coefficients. These wind load coefficients are obtained by the use of standard CFD wind load simulations, which can be seen in Figure 15.

τ_{w i n d} = \frac{1}{2} ρ_{a} U_{R}^{2} [\begin{matrix} A_{f} C_{x_{w i n d}} (α_{R}) \\ A_{s} C_{y_{w i n d}} (α_{R}) \\ A_{s} L_{O A} C_{n_{w i n d}} (α_{R}) \end{matrix}]

(34)

For the wave load, only the second-order drift forces and moment are considered, and the empirical equations are shown in Equation (35) [35].

τ_{w a v e} = ρ_{c} g L_{O A} [\begin{matrix} \cos (ϑ) \int_{0}^{\infty} S (ω) C_{x_{w a v e}} (λ_{ω}) d ω \\ \sin (ϑ) \int_{0}^{\infty} S (ω) C_{y_{w a v e}} (λ_{ω}) d ω \\ \sin (ϑ) L_{O A} \int_{0}^{\infty} S (ω) C_{n_{w a v e}} (λ_{ω}) d ω \end{matrix}],

(35)

where

ρ_{c}

denotes the fluid density. g is the gravity acceleration.

ϑ

represents the encounter angle of the wave.

ω

is the wave circular frequency, and

λ_{ω}

is the wavelength corresponding to the frequency

ω

.

S (ω)

is the wave spectral density. In this paper, the spectrum is adapted from [36].

C_{x_{w a v e}}

,

C_{y_{w a v e}}

, and

C_{n_{w a v e}}

are the coefficients of the wave loads. The specific values of the environmental parameters in the simulation are detailed in Table 3. These environmental parameters are selected based on the upper limit of the small-scale VLCC thrust forces in order to show its performance in a harsh sea environment.

Figure 16a provides schematic view of the process in a static water environment (Scenario

I

), whereas Figure 16b depicts the scenario in an environment subject to disturbances (Scenario

II

), which includes the forces and moments of wind, current, and wave. Figure 17 shows the change in the yaw angle, the distance error between the actual path and global path, and the rudder angle over time in the two environments. Table 4 lists the path-following errors for the whole process. In the static environment, the path error initially reaches a maximum of 1.098 m. At this point, the rudder is significantly adjusted to realign the yaw angle to −90 deg, which corresponds to the direction of the local target point on the straight line segment. Subsequently, the path deviation on this segment is minimal, and then it fluctuates between 0.2 to 0.4 m on the circular path. The rudder angle remains stable within a range of −10 to 10 deg, and the heading transitions smoothly in accordance with the curvature of the circular path. For the task in another environment with disturbances, the average deviation is higher than that of a static environment at 3.979 m, and the maximum error is 7.886 m at 396.5 s. Additionally, the changes in heading and rudder steering behavior exhibit significant differences, attributable to the disturbances present in this environment.

5.3.2. Model Tests of Autonomous Navigation Method

Model tests are conducted in this section to evaluate the performance of our autonomous navigation method based on the path-following policy. As illustrated in Figure 18, the prototype system comprises a VLCC ship model, a perception module that incorporates an inertial measurement unit (IMU) and an AprilTag visual recognition system, a decision-making and control module that includes a path points generator, a state calculator, and a path-following policy, and an execution module including a driver, a servo, and a motor.

Figure 19 provides the overview of the two model test maps in a pool. In each map, we perform two model tests with different starting points and target points. At the beginning of each test, a series of guiding points are pre-determined, and the vessel is required to navigate along these points, with the propeller’s PWM set to 80. Taking test 1-2 and test 2-1 as examples, Figure 20 and Figure 21 separately present crucial navigation snapshots. In Figure 20a, the ASV just starts its navigation, in Figure 20b,c, it successfully avoids obstacles around the bend, and in Figure 20d, the ASV reaches its target point. In addition, the ASV is in the initial navigation stage in Figure 21a, and in Figure 21b, it bypasses the obstacle. Then, the ASV gradually approaches its destination and achieves the destination, as shown in Figure 21d.

Comparisons between the experimental trajectory and the simulated trajectory in each test are shown in Figure 22. The red line denotes the experimental trajectory, while the black dot line is the simulated trajectory. It can be seen that there are deviations between the trajectories, mainly due to the limited scale of the pool, which restricts the turning space for the vessel. Thus, the guiding points only function to provide the desired course. Figure 23 demonstrates the changes in the deviation value, and Table 5 illustrates the average and maximum distance error for each test. The results indicate that the average and maximum values are less than 0.3 m and 0.6 m, respectively. Notably, the maximum trajectory deviation occurs in test 1-1, which is 0.587 m at the point index of 340.

5.3.3. Collision Avoidance in Stochastic Environments

Simulation experiments, focused on the collision avoidance policy in environments characterized by random static and multiple dynamic obstacles, are conducted in this section. First, Figure 24 shows six simulation trajectory results of static obstacle avoidance experiments in calm water. The size of the simulation map is 640 m × 360 m, and the initial yaw angle of the ASV, radius range of the obstacle, detection radius of the LiDAR, and the number of laser beams are the same as those in Section 5.2.2. It is evident that, at various coordinate positions, the vessel effectively detects static obstacles and successfully achieves autonomous collision avoidance.

In Figure 25, the collision avoidance tasks are conducted in calm water, and the ASV encounters a single dynamic obstacle in situations of head on, crossing 1, crossing 2, and overtaking separately. The position of the ASV and the obstacle is recorded per 150 s. Figure 26a shows the change in the distance between the ASV and each dynamic obstacle. The specific minimum distance values are presented in Table 6, and all of the minimum distances are above 40 m.

Figure 27 illustrates the schematic representations of the ASV successfully avoiding four dynamic obstacles under the environmental disturbances. Figure 26b presents the variation in distance between the vessel and obstacles over time, with the minimum distances detailed in Table 7. The initial position of the vessel is (140 m, 80 m), and the destination is set at (370 m, 300 m). The data indicate, that prior to 250 s, the vessel proceeds directly towards its target. At 400 s, with the virtual obstacles detected, the vessel sightly changes its heading. By 500 s, predictions were made for the trajectories of obstacles No.2 and No.3, and the ship maneuvers to avoid the imagined obstacles, finally bypassing No.2 and No.3 at 550 s. Then, the ship encounters obstacle No.4, and it preemptively maneuvers to avoid the predicted trajectory of the obstacle, navigating to the right side of No.4, while approaching the target point. At 839.5 s, the vessel reaches its closest proximity to the No.4 obstacle.

5.3.4. Autonomous Navigation and Collision Avoidance in Complex Environment

This section presents simulations of hybrid ship autonomous navigation and collision avoidance in complex environments. The navigation commences at the coordinates (465 m, 634 m) and terminates at (905 m, 514 m). Upon detection of obstacles by the LiDAR system (excluding virtual obstacles), the distance

Δ_{2}

is employed to adjust the local target point. In other cases, the looking forward distance is set to

Δ_{1}

. The environmental conditions are consistent with those detailed in Table 3.

Figure 28 shows the simulation results at each specific time step. Figure 29 provides data regarding the distances between the ASV and other obstacles, the distance between the ASV actual path and global path, and the rudder angle and propeller PWM over time. The minimum distances can be found in Table 8. At the beginning, a global path avoiding static obstacles is provided. Prior to 450 s, the vessel executes the path-following mode with a constant propeller PWM. Due to environmental disturbances, there is a stable path error around 5 m. After 450 s, the vessel turns right to avoid No.1, thus completing the crossing in front of No.1 eventually. At 665 s, the minimum distance between the two is 76.321 m. Later, at 750 s, the vessel encounters obstacle No.2. The vessel also turns right to avoid it, with the minimum distance at 989.5 s being 37.078 m, and successfully avoids No.2 after 1050 s. At 1350 s, the vessel is positioned directly in front of obstacle No.4. At this time, the ASV continues its forward navigation, passing directly in front of No.4, with the minimum distance to No.4 being 30.866 m. After 1950 s, the vessel is situated behind obstacle No.3, and at 2100 s, it passes between the static obstacle and No.3, subsequently switching to the path-following mode and gradually approaching No.5. At 2550 s, the vessel executes a right turn to avoid an impending obstacle. At 2650 s, the distance between the two is 25.081 m. Finally, at 2806.5 s, the vessel successfully arrives at the target point.

6. Conclusions

In this paper, we take an under-actuated VLCC ship model as the research subject. Utilizing the SAC algorithm, a navigation behavior control system that is suitable for a complex marine environment is proposed, enabling the vessel to autonomously follow pre-defined paths and avoid obstacles. To address the challenge of avoiding dynamic obstacles, a trajectory prediction algorithm that combines KF and LSTM is introduced. When the vessel identifies a potential collision risk using the CPA criteria, it then predicts the future trajectory of dynamic obstacles based on their historical coordinate data and considers the points on the predicted trajectory as virtual obstacles to avoid, thus taking collision avoidance measures in advance. Throughout the training process of the path-following policy, the success rate consistently remains at

100 %

after 50 epochs. For the collision avoidance policy, its success rate stabilizes between

90 %

and

100 %

. By performing numerical simulation tests and building prototype system for the ship model, the path-following policy is validated in different navigation environments. Additionally, through numerical simulations, successful autonomous navigation and collision avoidance tests are also completed with static unknown obstacles, a single dynamic obstacle in different encounter scenarios, multiple dynamic obstacles, and in complex environments.

However, considering the practical use of the DRL-based autonomous navigation, the following research could be carried out in future work:

The real-world conditions may degrade sensor performance. Future work will first integrate multi-sensor fusion and noise-adaptive perception modules to enhance robustness, building on the model-agnostic advantages of the current framework.
While this work demonstrates the efficacy of the proposed hybrid framework for underactuated ASV path following and collision avoidance, comprehensive comparisons with traditional methods or other DRL algorithms are left to future studies, as they require tailored adaptations to partial observability and actuator constraints.
This study assumes dynamic obstacles follow predefined trajectories, excluding reciprocal interactions. Future work will integrate game-theoretic, COLREGS, and MARL frameworks to model adaptive multi-agent behaviors, critical for crowded waterways.

Author Contributions

Methodology, Y.W. and L.W.; Validation, Y.W.; Investigation, Z.L. and L.W.; Resources, Z.L.; Data curation, Y.W.; Writing—original draft, Y.W.; Writing—review and editing, Z.L. and L.W.; Supervision, X.W.; Project administration, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the State Key Laboratory of Maritime Technology and Safety (Grant No. W24CG000040) and the National Natural Science Foundation of China (Grant No. 42406205).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Network Structures of Actor and Critic in SAC

The network structures in this section are applied in both the path-following policy and the collision avoidance policy, as shown in Figure A1 and Figure A2, where

D_{s}

is the dimension of state.

D_{a}

is the dimension of action.

Figure A1. The actor network structure.

Figure A2. The critic network structure.

References

Yan, X.; Wang, S.; Ma, F. Review and prospect for intelligent cargo ships. Chin. J. Ship Res. 2021, 16, 1–6. [Google Scholar] [CrossRef]
Zubowicz, T.; Armiński, K.; Witkowska, A.; Śmierzchalski, R. Marine autonomous surface ship-control system configuration. IFAC-PapersOnLine 2019, 52, 409–415. [Google Scholar] [CrossRef]
de Vos, J.; Hekkenberg, R.G.; Banda, O.A.V. The Impact of Autonomous Ships on Safety at Sea—A Statistical Analysis. Reliab. Eng. Syst. Saf. 2021, 210, 107558. [Google Scholar] [CrossRef]
Azar, A.T.; Koubaa, A.; Ali Mohamed, N.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.; Hameed, I.A.; et al. Drone Deep Reinforcement Learning: A Review. Electronics 2021, 10, 999. [Google Scholar] [CrossRef]
Irshayyid, A.; Chen, J.; Xiong, G. A review on reinforcement learning-based highway autonomous vehicle control. Green Energy Intell. Transp. 2024, 3, 100156. [Google Scholar] [CrossRef]
Lokukaluge, P.; Joao, C.; Carlos, G.S. Fuzzy logic based decision making system for collision avoidance of ocean navigation under critical collision conditions. J. Mar. Sci. Technol. 2011, 16, 84–99. [Google Scholar] [CrossRef]
Campbell, S.; Naeem, W. A Rule-based Heuristic Method for COLREGS-compliant Collision Avoidance for an Unmanned Surface Vehicle. IFAC Proc. Vol. 2012, 45, 386–391. [Google Scholar] [CrossRef]
Wu, B.; Cheng, T.; Yip, T.L.; Wang, Y. Fuzzy logic based dynamic decision-making system for intelligent navigation strategy within inland traffic separation schemes. Ocean. Eng. 2020, 197, 106909. [Google Scholar] [CrossRef]
Fan, Y.; Sun, X.; Wang, G. An autonomous dynamic collision avoidance control method for unmanned surface vehicle in unknown ocean environment. Int. J. Adv. Robot. Syst. 2019, 16, 1729881419831581. [Google Scholar] [CrossRef]
Jawhar, G.; Lamia, I.; Maarouf, S. Adaptive Finite Time Path-Following Control of Underactuated Surface Vehicle with Collision Avoidance. J. Dyn. Syst. Meas. Control 2019, 141, 121008. [Google Scholar] [CrossRef]
Ge, Y.; Zhong, L.; Qiang, Z.J. Research on USV Heading Control Method Based on Kalman Filter Sliding Mode Control. In Proceedings of the 2020 Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 1547–1551. [Google Scholar] [CrossRef]
Surjeet, B.; Nishu, G.; Ahmed, A.; Isha, B.; Rami, M.; Sarmad, M.; Firas, A. A survey on deep reinforcement learning architectures, applications and emerging trends. IET Commun. 2022, 19, e12447. [Google Scholar] [CrossRef]
Zhao, L.; Myung-Il, R.; Lee, S. Control method for path following and collision avoidance of autonomous ship based on deep reinforcement learning. J. Mar. Sci. Technol. 2019, 27, 1. [Google Scholar]
Mohammad, E.; Nader, Z.; Mahtab, S.; Amilcar, S.; Bruno, B.M.; Stan, M. Using Deep Reinforcement Learning Methods for Autonomous Vessels in 2D Environments. In Advances in Artificial Intelligence; Springer: Cham, Switzerland, 2020; pp. 220–231. [Google Scholar]
Wu, X.; Chen, H.; Chen, C.; Zhong, M.; Xie, S.; Guo, Y.; Fujita, H. The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method. Knowl.-Based Syst. 2020, 196, 105201. [Google Scholar] [CrossRef]
Yan, N.; Huang, S.; Kong, C. Reinforcement Learning-Based Autonomous Navigation and Obstacle Avoidance for USVs under Partially Observable Conditions. Math. Probl. Eng. 2021, 2021, 5519033. [Google Scholar] [CrossRef]
Zhou, C.; Wang, Y.; Wang, L.; He, H. Obstacle avoidance strategy for an autonomous surface vessel based on modified deep deterministic policy gradient. Ocean. Eng. 2022, 243, 110166. [Google Scholar] [CrossRef]
Gao, M.; Kang, Z.; Zhang, A.; Liu, J.; Zhao, F. MASS autonomous navigation system based on AIS big data with dueling deep Q networks prioritized replay reinforcement learning. Ocean. Eng. 2022, 249, 110834. [Google Scholar] [CrossRef]
Yang, X.; Han, Q. Improved reinforcement learning for collision-free local path planning of dynamic obstacle. Ocean. Eng. 2023, 283, 115040. [Google Scholar] [CrossRef]
Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
Skjetne, R.; Smogeli, Ø.; Fossen, T.I. Modeling, identification, and adaptive maneuvering of CyberShip II: A complete design with experiments. IFAC Proc. Vol. 2004, 37, 203–208. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar] [CrossRef]
Sutton, R.; Barto, A. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
Chen, W.; Qiu, X.; Cai, T.; Dai, H.; Zheng, Z.; Zhang, Y. Deep Reinforcement Learning for Internet of Things: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2021, 23, 1659–1692. [Google Scholar] [CrossRef]
Zheng, Y. Research on Tracking Control of High-Speed Underacmated Unmanned Surface Vessels. Master’s Thesis, Harbin Engineering University, Harbin, China, 2021. [Google Scholar]
Liu, J. Research on Track Planning and Tracking Control Algorithm of Underactuated Ship. Master’s Thesis, Wuhan University of Technology, Wuhan, China, 2022. [Google Scholar]
Zhu, H.; Ding, Y. Optimized Dynamic Collision Avoidance Algorithm for USV Path Planning. Sensors 2023, 23, 4567. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
Sepp, H.; Jürgen, S. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Welch, G.; Bishop, G. An Introduction to the Kalman Filter; University of North Carolina at Chapel Hill: Chapel Hill, NC, USA, 1994. [Google Scholar]
Mercat, J.; Zoghby, N.E.; Sandou, G.; Beauvois, D.; Gil, G.P. Kinematic Single Vehicle Trajectory Prediction Baselines and Applications with the NGSIM Dataset. arXiv 2020, arXiv:1908.11472. [Google Scholar] [CrossRef]
Stenersen, T.C. Guidance System for Autonomous Surface Vehicles. Master’s Thesis, NTNU, Trondheim, Norway, 2015. [Google Scholar]
Duong, N.; Ronan, F. A Transformer Network with Sparse Augmented Data Representation and Cross Entropy Loss for AIS-Based Vessel Trajectory Prediction. IEEE Access 2024, 12, 21596–21609. [Google Scholar] [CrossRef]
Journee, J.; Massie, W. Offshore Hydromechanics, 1st ed.; Delft University of Technology, Faculteit Civiele Techniek en Geowetenschappen: Delft, The Netherlands, 2000. [Google Scholar]
Daidola, J.; Graham, D.; Chandrash, L. A simulation program for vessel’s maneuvering at slow speeds. In Proceedings of the 11th Ship Technology and Research Symposium (STAR), Portland, OR, USA, 21–23 May 1986. [Google Scholar]
Shen, X.; Wang, C.; Lian, S.; Li, S. Wind waVe spectrum estimation of small generating area by the maximum entropy method. J. Chang. Univ. Sci. Technol. (Nat. Sci.) 2007, 4, 39–43. [Google Scholar]

Figure 1. The reference coordinate systems. (a) The Earth and body-fixed coordinate systems in the six-DOF scenario. The original point O is located at the mass center of the ship. (b) The three-DOF horizontal Earth and body-fixed coordinate systems.

Figure 2. The general view of the path-following task in three-DOF horizontal plane, which is based on course tracking control.

Figure 3. The schematic views of LiDAR on ASV. (a) A LiDAR on an ASV is emitting laser beams in two-dimensional space. (b) The weight distribution of LiDAR laser beams. The beams located in front of the ASV possess a higher weight value showing a higher collision risk, whereas those directed backwards hold a lower weight value of collision risk.

Figure 4. In the structure of the LSTM unit, c represents the memory cell’s stored value, h denotes the output value of the LSTM, and x is the input value from the user. The LSTM unit has four inputs and one output. The inputs consist of the vector

(x_{t - 1}, h_{t - 1}, c_{t - 1})

along with the activation signals for the forget gate

f_{t}

, the input gate

i_{t}

, and the output gate

o_{t}

.

Figure 4. In the structure of the LSTM unit, c represents the memory cell’s stored value, h denotes the output value of the LSTM, and x is the input value from the user. The LSTM unit has four inputs and one output. The inputs consist of the vector

(x_{t - 1}, h_{t - 1}, c_{t - 1})

along with the activation signals for the forget gate

f_{t}

, the input gate

i_{t}

, and the output gate

o_{t}

.

Figure 5. The flowchart of the policy training process. If the current iterative step is larger than the policy training start step, the model training commences.

Figure 6. The points in the predicted trajectory are considered as virtual obstacles that the ASV needs to avoid.

Figure 7. The flowchart of the hybrid system, where

P_{t a r g e t}

represents the coordinates of the final target point.

Δ

is the distance threshold.

Figure 7. The flowchart of the hybrid system, where

P_{t a r g e t}

represents the coordinates of the final target point.

Δ

is the distance threshold.

Figure 8. The schematic view of CPA, where

U_{A}

and

U_{B}

are the resultant velocities.

Figure 8. The schematic view of CPA, where

U_{A}

and

U_{B}

are the resultant velocities.

Figure 9. Visualization of the path-following policy training process in different training stages. The epoch of each sub-figure (a–c) is 1, 20, and 175, respectively.

Figure 10. The training results of the path-following policy. (a) Success rate per epoch. (b) Average return and standard deviation per step in each epoch.

Figure 11. Visualization of the collision avoidance policy training process in different training stages. The epoch of each sub-figure (a–c) is 1, 25, and 300, respectively.

Figure 12. The training results of the collision avoidance policy. (a) Success rate per epoch. (b) Average return and standard deviation per step in each epoch.

Figure 13. Visualization of the AIS trajectory data.

Figure 14. MSE loss of KF-LSTM trajectory predictor. (a) Average training loss in each epoch. (b) Average validation loss in each epoch.

Figure 15. Wind load coefficients of the ASV in different headings. (a)

C_{x_{w i n d}}

. (b)

C_{y_{w i n d}}

. (c)

C_{z_{w i n d}}

.

Figure 15. Wind load coefficients of the ASV in different headings. (a)

C_{x_{w i n d}}

. (b)

C_{y_{w i n d}}

. (c)

C_{z_{w i n d}}

.

Figure 16. Path-following task results in different scenarios. (a) Simulation results in static water. (b) Simulation results in water with external disturbances.

Figure 17. Time domain curves of yaw angle, path-following error, and rudder angle in scenarios

I

and

II

. (a) Yaw angle (Scenario

I

). (b) Yaw angle (Scenario

II

). (c) Path-following error (Scenario

I

). (d) Path following error (Scenario

II

). (e) Rudder angle (Scenario

I

). (f) Rudder angle (Scenario

II

).

Figure 17. Time domain curves of yaw angle, path-following error, and rudder angle in scenarios

I

and

II

. (a) Yaw angle (Scenario

I

). (b) Yaw angle (Scenario

II

). (c) Path-following error (Scenario

I

). (d) Path following error (Scenario

II

). (e) Rudder angle (Scenario

I

). (f) Rudder angle (Scenario

II

).

Figure 18. The automatic behavior system of the under-actuated ASV state calculator.

Figure 19. The overview of two maps for model tests. (a) The map for test 1-1 and 1-2. (b) The map for test 2-1 and 2-2.

Figure 20. Crucial navigation snapshots of test 1-2 in map 1. (a) Initial status. (b) During the navigation. (c) During the navigation. (d) Final stage.

Figure 21. Crucial navigation snapshots of test 2-1 in map 2. (a) Initial status. (b) During the navigation. (c) During the navigation. (d) Final stage.

Figure 22. Comparisons between experimental trajectory and simulated trajectory. (a) Test 1-1. (b) Test 1-2. (c) Test 2-1. (d) Test 2-2.

Figure 23. Distance errors between experimental trajectory and simulated trajectory. The x-axis denotes the index of experimental trajectory points. (a) Test 1-1. (b) Test 1-2. (c) Test 2-1. (d) Test 2-2.

Figure 24. Trajectories of the ASV in autonomous collision avoidance tasks. (a) Task 1. (b) Task 2. (c) Task 3. (d) Task 4. (e) Task 5. (f) Task 6.

Figure 25. The schematic representations of the ASV successfully avoiding four dynamic obstacles in calm water. (a) Task 1. (b) Task 2. (c) Task 3. (d) Task 4.

Figure 26. The variation in distance between the vessel and obstacles. (a) Single dynamic obstacle in different encounter scenarios. (b) Multiple dynamic obstacles scenario.

Figure 27. The schematic representations of the ASV successfully avoiding four dynamic obstacles under environmental disturbances.

Figure 28. The general views of the ASV navigating in a complex environment. (A) The navigation between 50 s and 1400 s. (B) The navigation between 1550 s and terminal time step.

Figure 29. The variation in distance between the ASV and the obstacles and the motion parameters change in a complicated environment. (a) The variation in distance between the ASV and dynamic obstacles. (b) The distance between ship actual path and global path. (c) Rudder angle. (d) Propeller PWM.

Table 1. Major parameters of the model-scale VLCC.

Parameter	Unit	Value
Overall length	m	1.11
Overall width	m	0.165
Draft	m	0.068
Displacement	kg	10
$X_{g}$	m	0.562
$Y_{g}$	m	0
$Z_{g}$	m	0.058
$R_{x x}$	m	0.052
$R_{y y}$	m	0.26
$R_{z z}$	m	0.26

Table 2. Hyperparameters of algorithms.

Parameter	Value	Parameter	Value
Time step $Δ t$ (s)	0.5	Learning rate $λ$	3 × 10⁻⁴
Soft-update coefficient $ζ$	5 × 10⁻³	Batch size of policy training $N_{b}$	256
Batch size of KF-LSTM training $N_{KF}$	128	Discount factor $γ$	0.99
Actor and critic network update interval	1	Target network update interval	1
Policy training start step	1 × 10⁴	Replay buffer size	1 × 10⁶
Historical states horizon $N_{s}$	6	KF-LSTM historical horizon $N_{h}$	8
KF-LSTM prediction horizon $N_{f}$	30	Maximum CPA time $t_{m a x}$ (s)	50
Minimum CPA distance $d_{m i n}$ (m)	200	Threshold distance $Δ$ (m)	10.0
Forward viewing distance $Δ_{1}$ (m)	8.0	Forward viewing distance $Δ_{2}$ (m)	30.0
Coefficient $k_{1}$	1.0	Coefficient $k_{2}$	1.0
Coefficient $k_{3}$	−0.1	Coefficient $k_{4}$	1.0
Coefficient $k_{5}$	1.0	Coefficient $k_{6}$	0.6
Coefficient $k_{7}$	5.0	-	-

Table 3. Parameters of environmental disturbances.

Parameter	Value	Parameter	Value
Wind speed $U_{W}$ (m/s)	1.5	Wind angle $α_{W}$ (rad)	−3 $π$ /4
Wave angle $ϑ$ (rad)	−3 $π$ /4	Current speed $υ_{c}$ (m/s)	0.2
Current angle $ψ_{c}$ (rad)	−3 $π$ /4	Air density $ρ_{a}$ (kg/ $m^{3}$ )	1.29
Fluid density $ρ_{c}$ (kg/ $m^{3}$ )	1030	Gravitational acceleration g (m/ $s^{2}$ )	9.8
Frontal projection area $A_{f}$ ( $m^{2}$ )	0.0066	Lateral projection area $A_{s}$ ( $m^{2}$ )	0.0444

Table 4. The path-following errors in scenarios

I

and

II

.

Table 4. The path-following errors in scenarios

I

and

II

.

Parameter	Value	Parameter	Value
Average error in Scenario $I$ (m)	0.210	Average error in Scenario $II$ (m)	3.979
Maximum error in Scenario $I$ (m)	1.098	Maximum error in Scenario $II$ (m)	7.886
Time for maximum error in Scenario $I$ (s)	5.5	Time for maximum error in Scenario $II$ (s)	396.5

Table 5. The trajectory deviation values for each model test.

Test	Average Error (m)	Maximum Error (m)	Actual Trajectory Index of Maximum Error
1-1	0.270	0.587	340
1-2	0.096	0.260	336
2-1	0.178	0.396	291
2-2	0.165	0.331	338

Table 6. The minimum distances between the ASV and obstacles in different encounter scenarios.

Encounter Scenario	Minimum Distance (m)	Time (s)
Head on	40.658	488.0
Crossing 1	42.769	285.0
Crossing 2	55.493	342.0
Overtaking	47.239	645.5

Table 7. The minimum distances between the ASV and obstacles in multiple dynamic obstacles scenario.

Obstacle Index	Minimum Distance (m)	Time (s)
No.1	56.869	722.5
No.2	59.884	727.0
No.3	58.246	480.0
No.4	22.741	839.5

Table 8. The minimum distances between the ASV and dynamic obstacles in complicated water.

Obstacle Index	Minimum Distance (m)	Time (s)
No.1	76.321	665.0
No.2	37.078	989.5
No.3	56.530	2041.0
No.4	30.866	1421.0
No.5	25.081	2650.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Li, Z.; Wang, L.; Wang, X. Autonomous Navigation Control and Collision Avoidance Decision-Making of an Under-Actuated ASV Based on Deep Reinforcement Learning. J. Mar. Sci. Eng. 2025, 13, 2108. https://doi.org/10.3390/jmse13112108

AMA Style

Wang Y, Li Z, Wang L, Wang X. Autonomous Navigation Control and Collision Avoidance Decision-Making of an Under-Actuated ASV Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering. 2025; 13(11):2108. https://doi.org/10.3390/jmse13112108

Chicago/Turabian Style

Wang, Yiting, Zhiyao Li, Lei Wang, and Xuefeng Wang. 2025. "Autonomous Navigation Control and Collision Avoidance Decision-Making of an Under-Actuated ASV Based on Deep Reinforcement Learning" Journal of Marine Science and Engineering 13, no. 11: 2108. https://doi.org/10.3390/jmse13112108

APA Style

Wang, Y., Li, Z., Wang, L., & Wang, X. (2025). Autonomous Navigation Control and Collision Avoidance Decision-Making of an Under-Actuated ASV Based on Deep Reinforcement Learning. Journal of Marine Science and Engineering, 13(11), 2108. https://doi.org/10.3390/jmse13112108

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Navigation Control and Collision Avoidance Decision-Making of an Under-Actuated ASV Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Mathematical Modeling of the Ship

3. Autonomous Navigation and Collision Avoidance Problem Modeling

3.1. Markov Decision Process

3.2. Path Following and Course Tracking Control

3.3. Collision Avoidance Decision-Making

4. Autonomous Navigation and Collision Avoidance Methods Based on Deep Reinforcement Learning

4.1. Soft Actor–Critic Algorithm

4.2. Autonomous Navigation Control and Collision Avoidance Decision-Making Method

4.2.1. Policies Training

4.2.2. Trajectory Prediction of Dynamic Obstacle

4.2.3. Hybrid Autonomous Navigation and Collision Avoidance System in Complex Environments

5. Results and Discussion

5.1. The Parameters Setting

5.2. Training Results and Discussion

5.2.1. Path-Following Policy

5.2.2. Collision Avoidance Policy

5.2.3. KF-LSTM Predictor

5.3. Experimental Results and Discussion

5.3.1. Circular Path-Following Simulations

5.3.2. Model Tests of Autonomous Navigation Method

5.3.3. Collision Avoidance in Stochastic Environments

5.3.4. Autonomous Navigation and Collision Avoidance in Complex Environment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Network Structures of Actor and Critic in SAC

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI