Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND

Peng, Xiao; Han, Fenglei; Xia, Guihua; Zhao, Wangyuan; Zhao, Yiming

doi:10.3390/jmse11071320

Open AccessArticle

Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND

by

Xiao Peng

,

Fenglei Han

^*,

Guihua Xia

,

Wangyuan Zhao

and

Yiming Zhao

College of Shipbuilding Engineering, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(7), 1320; https://doi.org/10.3390/jmse11071320

Submission received: 2 June 2023 / Revised: 21 June 2023 / Accepted: 27 June 2023 / Published: 28 June 2023

(This article belongs to the Special Issue AI for Navigation and Path Planning of Marine Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

In crowded waters with unknown obstacle motion information, traditional methods often fail to ensure safe and autonomous collision avoidance. To address the challenges of information acquisition and decision delay, this study proposes an optimized autonomous navigation strategy that combines deep reinforcement learning with internal and external rewards. By incorporating random network distillation (RND) with proximal policy optimization (PPO), the interest of autonomous ships in exploring unknown environments is enhanced. Additionally, the proposed approach enables the autonomous generation of intrinsic reward signals for actions. For multi-ship collision avoidance scenarios, an environmental reward is designed based on the International Regulations for Preventing Collision at Sea (COLREGs). This reward system categorizes dynamic obstacles into four collision avoidance situations. The experimental results demonstrate that the proposed algorithm outperforms the popular PPO algorithm by achieving more efficient and safe collision avoidance decision-making in crowded ocean environments with unknown motion information. This research provides a theoretical foundation and serves as a methodological reference for the route deployment of autonomous ships.

Keywords:

autonomous ship; collision avoidance; deep reinforcement learning; COLREGs

1. Introduction

Amidst the surge in economic globalization, the competition in international trade among nations has intensified. In this milieu, maritime transport assumes a cardinal role, accounting for over 80% of the volume and 70% of the value of goods traded globally, underscoring its significance in trade and development. Consequently, maritime transport can be viewed as an economic barometer [1]. The burgeoning shipping industry has precipitated the designation of specific sea routes as central conduits for maritime transport, bringing maritime security into sharp focus. Ship collisions are a salient concern in this domain, which have elicited widespread attention [2]. The European Maritime Safety Agency [3] posits that ship collisions are the predominant cause of maritime casualties, with a staggering 89.5% of such incidents attributable to human error. Furthermore, approximately 41% of these incidents are reported to occur within port vicinities. These data highlight the perils associated with ship collisions and navigational errors. Addressing these risks is paramount to enhancing maritime safety. Leveraging advancements in intelligent and autonomous technologies, numerous countries have embarked on research into autonomous unmanned surface vessels (USVs) and collision avoidance systems to preclude navigational mishaps. Concurrently, the Maritime Safety Committee (MSC) of the International Maritime Organization (IMO) delineated the regulatory scoping activities for Maritime Autonomous Surface Ships (MASSs) in 2021, indicating that MASSs should be equipped with automation processes and decision support for operational automation. These investigations collectively attest to the potential of intelligent navigation technologies in substantially reducing navigation-related collisions, fatalities, and human errors. Accordingly, research into autonomous USVs with integrated collision avoidance systems is deemed to be the most efficacious strategy to forestall maritime accidents attributable to collisions and human errors.

Furthermore, the Maritime Safety Committee’s (MSC) scoping activities for Maritime Autonomous Surface Ships (MASSs) encompass regulations pertaining to collision avoidance for unmanned vessels, primarily the International Regulations for Preventing Collisions at Sea (COLREGs). These regulations are deemed to be quintessential and warrant precedence in the development in collision avoidance decision algorithms for autonomous vessels. Maritime vessels are obliged to adhere to the tenets of the COLREGs, which include vessel encounter direction identification, shipping behavior analysis, and collision probability assessment. Notably, real-world maritime operations have manifested instances of non-adherence to the COLREGs, often attributable to human errors and external environmental influences. Data from the Fujian Maritime Bureau reveal that an excess of one hundred ship collisions have transpired in the vicinity of the Taiwan Strait in recent years. This underscores the exigency for expeditious development in autonomous collision avoidance systems for unmanned surface vessels (USVs), and the criticality of translating the qualitative principles of the COLREGs into quantifiable implementations for autonomous vessel collision avoidance, thereby bolstering the sagacity of navigational rules and maritime safety.

This study endeavors to surmount the hurdles associated with autonomous vessel collision avoidance. A plethora of obstacles endemic to the maritime milieu underpin the foundation of prevailing collision avoidance algorithms. Nevertheless, within navigationally intensive maritime realms, algorithms are prone to converging to local optima, consequently incapacitating their ability to fulfill autonomous navigation objectives. To curtail collisions with other vessels or reefs in densely navigated waters or proximate to coastal ports, this research introduces a formidable and efficient methodology, dubbed the proactive obstacle navigation and detection (POND) system. This system harnesses a deep reinforcement learning (DRL) algorithm, synergized with random network distillation (RND), to facilitate intelligent collision avoidance navigation. The structure of this study is as follows: Section 2 presents a review of the literature pertaining to ship collision avoidance models to date. Section 3 outlines the ship collision avoidance system and elucidates the associated COLREGs. In Section 4, the components and operational principles of the POND algorithm are explicated. Section 5 is dedicated to the training and validation of the algorithm. Finally, Section 6 concludes the study by synthesizing the research conclusions and suggesting avenues for future research.

2. Related Work

With the growth in sea transportation, more academics are becoming interested in the study of ship collision avoidance algorithms. At this stage, collision avoidance algorithms can be roughly divided into traditional collision avoidance algorithms and intelligent collision avoidance algorithms.

Traditional collision avoidance algorithms predominantly encompass the A* algorithm, the artificial potential field (APF) algorithm, and the rapidly exploring random tree (RRT). Addressing the issue of global collision avoidance path planning, Rui Song et al. [4] developed an enhanced A* algorithm, incorporating a smoothing component to refine the original A* output, thereby providing a more continuous trajectory. To achieve unmanned surface vehicle (USV) route planning, Zhen Zhang et al. [5] proposed an advanced RRT, employing a flexible hybrid dynamic scale factor to augment the objective function. Daoyong Wang et al. [6] amalgamated the widely used genetic algorithm with membrane computing, leveraging the conventional APF method to efficiently address robotic navigation. However, traditional collision avoidance algorithms are primarily suited for static path planning and fail to account for dynamic obstacles, rendering them impractical for maritime collision avoidance. Consequently, researchers have pivoted toward a development in intelligent algorithms to circumvent these limitations. F. Ding et al. [7] introduced a path planning methodology predicated using particle swarm optimization (PSO), constructing mathematical models of underwater robots and marine environments, and calibrating path generation through PSO. Liu et al. [8] devised a technique based on ant colony optimization (ACO) and clustering, employing a smoothing method to alter the route for global planning and an enhanced ACO to selectively determine the search region. Kozynchenko, A.I. and Kozynchenko, S.A [9] developed a dynamic predictive planning algorithm, integrating neural networks with fuzzy logic to optimize real-time navigation based on maritime conditions. Notwithstanding, early intelligent algorithms relied heavily on pre-existing environmental data and faced substantial practical constraints, prompting researchers to shift their focus to reinforcement learning.

In this case, RL has been widely studied for automobiles [10], unmanned aerial ships [11], and the multi-robot technique [12]. The issue of collision avoidance route planning was addressed by Mihai Duguleana and Gheorghe Mogan [13], utilizing a technique founded on Q-learning and a neural network. For marine vessels, Chen Chen et al. [14] created a route planning approach founded on Q-learning that learned the action state model to determine the appropriate action policy. It is, however, difficult to analyze all action state data when relying solely on RL algorithms due to a lack of environment information. Using RGB-D physical sensors in combination with DQN structures, a new navigation method was constructed by Lei Tai and Ming Liu [15]. The method addresses the key technique of AGENT collision avoidance for dynamic obstacles in real-world environments by fully training the system in a simulation platform. The Sarsa algorithm was shown to be feasible for improving obstacle avoidance for unmanned ships by Rubo Zhang et al. [16], who also developed an obstacle avoidance system model for marine vessels based on the Sarsa algorithm. For the problem of collision avoidance by ships after multi-vessel encounters, Woo et al. successfully solved the difficult problem of autonomous collision avoidance by surface unmanned boats by fusing semi-Markovian decisions into an existing DQN network [17]. Yiquan DU et al. [18] combined the DDPG algorithm with the DP algorithm, and used the LSTM neural network to store the historical state information, which solved the matter of optimal route planning for coastal vessels and reduced the existence of arcs or overshoots in the planned path. In order for underwater robots to develop search tactics and learn from experience, Xiang Cao et al. [19] suggested one target-seeking algorithm relying on A3C. This algorithm evaluates the network structure through the designed asynchronous advantage and uses a dual-stream Q-learning algorithm to perform underwater robot movement in order to ensure a more effective search strategy.

In addition, in order to make the collision avoidance algorithm proposed by some scholars applicable to maritime ships, the International Maritime Collision Avoidance Rules Convention is embedded in the developed obstacle avoidance algorithm, so that the ship will not violate the convention and cause an accident when avoiding collision. Adaptive judgment algorithms founded on the crisis of collision with the target ship are part of the quantitative analysis method for the COLREGs that Yixiong He et al. [20] proposed. It is capable of handling situations involving one vessel avoiding numerous target ships simultaneously. Yuxin Zhao et al. [21] used an efficient two-way collision prevention algorithm to identify collision avoidance actions in compliance with the COLREGs, while environmental disturbances were not considered in this study. Within the COLREGs, Jinfen Zhang et al. [22] suggested a decentralized anti-collision decision support framework based on decision trees for multi-vessel circumstances. Zhao [23] examined the validity of methods for collision avoidance on autonomous ships. For the collision avoidance problem of multiple dynamic obstacles in the sea around a ship during navigation, Luman Zhao and Myung-Il Roh [24] designed a system for collision prevention that adheres to the COLREGs, which solves the collision avoidance problem of a ship facing multiple obstacles by making the obstacle information the initial data for input. For the current research stage, there are also many researchers who have combined navigation rules with deep reinforcement learning; for example, Wang et al. [25] used the action states specified in part of the navigation rules as the network input for training, while introducing the navigation rules reward mechanism in the loss function part, and the training results suggested that their proposed model can elucidate reasonable collision avoidance behaviors for different encounter scenarios. Zhai et al. [26] conducted a study on autonomous ship collision avoidance based on a dual-depth Q-network, in which the COLREGs and human maneuvering experience were introduced in the reward function, thus enabling the proposed model to handle special situations such as multi-ship close encounters, and the action outputs by the model were closer to the human operation level under compliance with the navigation rules.

3. Collision Avoidance Path Planning in Crowded Ocean Environment

In the existing mainstream autonomous ship collision avoidance system design, the ship only relies on its own carrying sensors to collect information about the marine environment around the ship, which has the problem of not enabling the ship to make reasonable obstacle avoidance in time when it senses the obstacle due to the limited distance between the ship and the obstacle. Therefore, this study proposes an obstacle circumvention system founded on the aggregation of UAVs and autonomous ships, which identifies dynamic and static obstacles through the visual sensors carried by UAVs and quantitatively evaluates their relative positions, so as to effectively transmit reasonable information to the autonomous ship. Meanwhile, to be able to guarantee that the autonomous ship collision avoidance system could be applied to international shipping, this study combines the COLREGs rules with obstacle avoidance assessment to meet the demand that autonomous ships can accurately avoid collisions.

The goal of our research is to tackle collision avoidance path planning matter for ships in crowded maritime settings. By extending previous research, all ships can make collision avoidance decisions autonomously through the collision avoidance system proposed in this study and can avoid target obstacles while complying with the COLREGs. In this section, a brief description of the ship collision avoidance system composition and its operation and the related COLREGs required for this study are introduced.

3.1. Framework for a Ship Path Planning System

Figure 1 shows the general process framework of the autonomous ship collision duck system, which mainly relies on UAVs and autonomous ships to work together. In the initial stage of environmental information sensing and acquisition, the system mainly relies on the vision sensors carried by the UAV for acquisition, which generally includes image enhancement, target recognition, and target tracking. The system can transmit the marine environment information (such as the type, number, location, and movement of obstacles, etc.) collected by the visual sensors to the autonomous ship processing center for environmental modeling, i.e., the relative position of the ship and obstacles, and the speed of obstacles (such as ships, buoys, reefs, etc.). At this time, using the collision avoidance algorithm suggested within this article, the path planning system could calculate which actions the autonomous ship should take to avoid collision in the current state. When there is no need to avoid the obstacle, it just needs to follow the original route and continue to move forward. In the last step, sensors are used to determine whether the autonomous vessel has arrived at the destination. After reaching the destination, the UAV will land safely on the deck of the autonomous ship to complete the navigation task.

Therefore, we used the deep reinforcement learning method to carry out the path planning stage, and train the discrete actions of autonomous ships to avoid collision during navigation through the combination of the proximal policy optimization algorithm and random network distillation. In the collision prevention decision-making step, consideration should be given to the formulation of international rules for the avoidance of collisions by ships, and appropriate actions to avoid collisions should be determined according to the situation encountered. In the remainder of this section, background concepts for the regulation are presented.

3.2. International Regulations for Preventing Collisions at Sea

While applying the proposed deep reinforcement learning algorithm to the obstacle avoidance problem of autonomous ship path planning, we should consider the practical issue of the safety of the ship navigating on the route. Here, we develop reasonable motion choices based on the International Regulations for Collision Avoidance at Sea (COLREGs), a group of mandatory maritime traffic laws developed by the International Maritime Organization to ensure ship safety and reduce accidents. The COLREGs outline several maritime ship encounter orientation scenarios and their avoidance routes against them. Therefore, autonomous ships must establish behaviors founded on the COLREGs in collision avoidance to ensure safety in navigating the seas [27]. Based on the relevant descriptions of the COLREGs, we divided proportional positions of autonomous vessels and other ships into four different avoidance action selection areas, as shown in Figure 2.

Figure 2 depicts the four scenarios wherein an autonomous vessel is encircled by evasive vessels. The study classifies the heading ranges of 0°–5° and 355°–360° as cases where the approaching vessel is required to undertake evasive maneuvers. In such instances, the autonomous vessel must execute a 15° starboard turn to avert a collision and proceed along its trajectory. When other vessels are within the 5°–112.5° range, the autonomous vessel perceives them as stationary, non-navigable obstacles, necessitating a starboard turn for collision avoidance. In situations where a vessel is within the 247.5°–355° range relative to the autonomous vessel, the latter assumes the highest priority and retains its original course and velocity without executing any evasive actions. Additionally, the volatile maritime weather conditions pose difficulties in applying navigation rules under low visibility. The International Regulations for Preventing Collisions at Sea (COLREGs) stipulate that if a vessel lies within the 0°–180° bearing of an autonomous vessel, a starboard turn is mandated for safety and navigational efficacy. Consequently, the collision avoidance model proposed in this study is designed to execute the appropriate navigational maneuvers in response to the prevailing circumstances. The study presumes that the visual sensors affixed to the unmanned aerial vehicles (UAVs) are capable of efficaciously monitoring the maritime environment in the vicinity. Thus, the ensuing obstacle and reward function designs are premised on optimal visibility at sea, ensuring accurate control directives from the vessel.

3.3. Problem Definition for Autonomous Vessel Path Planning

During the whole navigation task of the ship, with the help of the automatic identification system (AIS), the ship can plan the whole navigation route from the starting point to the destination, in which the known obstacles, such as islands on the route, are avoided, and this kind of problem can become the global path planning, but for the unknown obstacles on a certain section of the route, such as other sudden sailing ships, reefs, etc., its route planning matter can be called the local path planning. The route path planning matter of autonomous ships could be regarded as the seeking of the best navigation route under the safe driving condition of the ship.

To establish the favorable circumstances for the ship’s current state, a method that can solve the optimal decision-making problem of local path planning well with unknown sea environment information is urgently needed. The DRL approach combines the DL and RL methods. It relies on the DL method to have a strong perception ability. At the same time, the RL method enables the system to have a robust decision-making ability, which can provide a solution to the issue of local optimal route planning for vessels.

4. The Model of the Deep Reinforcement Learning Network

4.1. Proximal Policy Optimization Algorithm

Reinforcement learning (RL) is typically characterized through the lens of a Markov decision process (MDP) [28]. Within this framework, the agent in reinforcement learning, which in the context of this paper refers to the autonomous vessel, transitions from an initial state, denoted as ‘

s

’, to a subsequent state, denoted as ‘

s^{'}

’, while executing actions during navigation. The outcome of this transition is quantified as a reward, represented by ‘

r

’. The likelihood of this transition is termed as the state transition probability, denoted as ‘

P_{s \to s^{'}}^{a}

’. Through continuous interaction with the environment, autonomous vessels endeavor to learn and ultimately ascertain optimal strategies for attaining their goals.

The proximal policy optimization (PPO) algorithm [29], a policy-based reinforcement learning algorithm, was introduced by the OpenAI team and comprises an action network and a critic network. In the policy gradient method, the agent selects its subsequent action based on the policy, denoted by

π

, as opposed to the Q-Learning method which employs a value-based approach for training. Consequently, the policy gradient method is versatile, catering to both discrete action spaces as well as continuous action spaces, which is particularly pertinent for navigation tasks involving continuous non-discrete actions such as turning maneuvers. In this study, the introduction of random network distillation (RND) was employed exclusively with discrete actions to showcase the augmented robustness of the proposed algorithm in cluttered maritime environments.

The PPO algorithm framework primarily encompasses two components: the actor network and the critic network. The actor network is tasked with generating the necessary actions for the ship and facilitating the interaction with the marine environment. In contrast, the critic network evaluates the current behavior by generating a value function, which is subsequently utilized to guide the ship in selecting the most appropriate action. Distinct from traditional policy gradient algorithms, the PPO algorithm operates as a learning iterative agent, where the agent engaged in the interaction with the environment differs, indicative of an off-policy approach. Within the realm of off-policy methods, importance sampling is a paramount technique.

To enhance the learning efficacy of the unmanned surface vehicle (USV) collision avoidance model, the POND model introduced in this study employs off-policy learning, akin to the PPO algorithm. This enables the model to utilize one policy for interactive exploration within the marine environment and a distinct policy for training. Off-policy learning necessitates the incorporation of importance sampling. We assumed that the reward mechanism of the ship collision avoidance model was a function

T (m)

if we wanted to sample some

m

from the distribution of

P

, but we did not find the expected value of this function because it could not be integrated; so, we chose another distribution by which to sample another set of data and brought it into the band hypothesis function to calculate the average. Then, we brought

m_{t}

into

T (m)

and calculated the mean of all that was generated, that is, we obtained the expected value of

T (m_{t})

.

E_{_{m ~ f}} [T (m)] \approx \frac{1}{N} \sum_{t = 1}^{N} T (m_{t})

(1)

The above expectation value can be written as

\int T (m) p (m) d m

, where

p (m)

is the weight function; then, the above expectation expression can be written as the following form:

\int T (m) p (m) d m = \int T (m) \frac{p (m)}{p^{'} (m)} p^{'} (m) d m = E_{m ~ f} [T (m) \frac{p (m)}{p^{'} (m)}]

(2)

With this transformation method, we resolved the data collected by the agent that were used to interact with the environment to train the agent used for training. But, it is worth noting that the two agents cannot be too different, otherwise their expected values will greatly deviate.

Using this method, we achieved the expected value of the indirect solution function. When training for different strategies of participating ships for obstacle avoidance, we assumed here that

θ_{2}

was a strategy whose role it was to provide data for the participating strategy

θ_{1}

, i.e., the strategy interacts with the ocean, and is not actually used for ship training. Therefore, when using the model strategy, the route generated by its interaction with the ocean environment is noted as S. The expectation associated with the strategy used for training (Equation (3)) was approximated by calculating the expectation under this strategy (Equation (4)), where we only needed to complement the weight functions of the two strategies after this strategy.

\nabla R_{θ_{1}} = E_{S ~ p_{θ^{2}} (S)} [\frac{p_{θ_{1}} (S)}{p_{θ_{2}} (S)} R (S) \nabla \log p_{θ_{1}} (S)]

(3)

E_{(s_{t}, a_{t}) ~ π_{θ_{1}}} [A^{θ_{1}} (s_{t}, a_{t}) \nabla \log p_{θ_{1}} (a_{t}^{n} | s_{t}^{n})]

(4)

where both

A^{θ_{i}} (s_{t}, a_{t})

and

R (S)

represent the reward value derived from the actions performed by the autonomous vessel in the current state.

Using the expectation of the policy used for training, we also found the expectation of the policy used for interaction with the marine environment, which implemented the off-policy algorithm.

E_{(s_{t}, a_{t}) ~ π_{θ_{2}}} [\frac{p_{θ_{1}} (s_{t}, a_{t})}{p_{θ_{2}} (s_{t}, a_{t})} A^{θ_{1}} (s_{t}, a_{t}) \nabla \log p_{θ_{1}} (a_{t}^{n} | s_{t}^{n})]

(5)

We changed the distribution of the strategy to the following equation.

p_{θ_{1}} (s_{t}, a_{t}) = p_{θ_{1}} (a_{t} | s_{t}) p_{θ_{1}} (s_{t})

(6)

p_{θ_{2}} (s_{t}, a_{t}) = p_{θ_{2}} (a_{t} | s_{t}) p_{θ_{2}} (s_{t})

(7)

Then, the expectation of the interaction strategy with the environment can be rewritten as

E_{(s_{t}, a_{t}) ~ π_{θ_{2}}} [\frac{p_{θ_{1}} (a_{t} | s_{t})}{p_{θ_{2}} (a_{t} | s_{t})} \frac{p_{θ_{1}} (s_{t})}{p_{θ_{2}} (s_{t})} A^{θ_{2}} (s_{t}, a_{t}) \nabla \log p_{θ_{1}} (a_{t}^{n} | s_{t}^{n})]

(8)

In a real ocean environment, the distribution probabilities of the strategies used for observation and training are the same for the same state; so, we can simplify Equation (8), i.e.,

E_{(s_{t}, a_{t}) ~ π_{θ_{2}}} [\frac{p_{θ_{1}} (a_{t} | s_{t})}{p_{θ_{2}} (a_{t} | s_{t})} A^{θ_{2}} (s_{t}, a_{t}) \nabla \log p_{θ_{1}} (a_{t}^{n} | s_{t}^{n})]

(9)

After the above-mentioned method was proposed, for the PPO algorithm, a set of parameters

θ

were first initialized. In the current environment, the current state of the autonomous ship is

s_{t}

, and the action to be performed is

a_{t}

; then, the current

π^{'} (a_{t} | s_{t})

and

A^{θ^{'}} (s_{t}, a_{t})

can be calculated, and the objective function under the parameter

θ^{k}

can be calculated through the policy value and reward value of the first two, as follows:

\begin{array}{r} J^{θ_{k}}_{P P O} (θ^{1}) = E_{(s_{t}, a_{t}) ~ π_{θ^{2}}} & [\min (\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})} A^{θ^{k}} (s_{t}, a_{t}), \\ c l i p (\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})}, 1 - ε, 1 + ε) A^{θ^{k}} (s_{t}, a_{t})] \end{array}

(10)

Among them, we used the clip function as a constraint method to ensure that the two agents were not too far apart, that is, when the reward function is

A^{θ} (s_{t}, a_{t}) > 0

, it means that the action is better at this time; so, the training strategy

p_{θ} (a_{t} | s_{t})

needs to be improved. However, in order to ensure that the data collected by the agent interacting with the environment can be given to the training agent for training, the ratio

\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{'}} (a_{t} | s_{t})}

is increased while setting an upper limit, that is,

1 + ε

. When the ratio is greater than this upper limit, the ratio can only be taken to

1 + ε

, so that the objective function will not have more benefits even if the execution probability of this action is greater, thus increasing the stability of the algorithm. Similarly, when

A^{θ} (s_{t}, a_{t}) < 0

, it means that the current action will make the unmanned ship develop toward a bad trend for the task; so, the training strategy

p_{θ} (a_{t} | s_{t})

should be reduced at this time, and a lower limit

1 - ε

is defined for the strategy ratio. In this study, the value of

ε

= 0.2.

4.2. Random Network Distillation (RND)

Because the reward of the external environment is not enough to support the ship to complete the navigation task in the crowded environment, it is necessary to add a reward mechanism on the basis of the existing algorithms to improve the agent’s interest in exploring the unknown environment. However, methods such as curiosity- and count-based methods are difficult to be used in the vast marine environment. In this study, the random network distillation method (RND) [30] was used and combined with the PPO algorithm.

Compared with the neural network, when its input is similar to the samples used for training, the results often only have a small error. Based on this idea, RND uses the prediction error of the agent’s past experience to estimate the new experience. The RND method first determines a neural network using random initialization parameters, which are used to predict the output of the current state, and then determines the exploration reward; at the same time, another neural network is used to train the data. The former network is called the target network, which will not be used for training in the whole process; the latter network is the predictor network, which is trained to predict the output of the target network. The target network can be regarded as an input map

f : O \to ℝ^{k}

, and the prediction network can be regarded as a map

\hat{f} : O \to ℝ^{k}

(k denotes the same network dimension of both). Here, we use the upper triangle symbol to represent the network to be trained. The gradient descent approach was used by the prediction network during the training phase to minimize the objective function.

G_{θ_{\hat{f}}} = | | \hat{f} (x; θ) - f (x) | |

(11)

where

x

denotes the current state coordinates of the autonomous ship within the designated navigation area, and the prediction network parameter

θ_{\hat{f}}

is updated by minimizing the difference between the two networks. Through this continuous training process, the network training with the random initialization of parameters can be completed. When the input state greatly differs from the prediction, the objective function will become very large, which is not conducive to the training of the network, that is, the prediction will change in a more “random” direction.

Figure 3 shows the proposed RND framework. The main idea of RND is to encourage agents to have more “interest” to explore the areas that they did not reach by giving larger reward values to autonomous ships. When the autonomous ship reaches a certain state, RND calculates the mapping values of the target network and the prediction network according to the current state, which are

f_{i}, {\hat{f}}_{i}

, respectively, and then calculates the MSE error of the two, namely

| | f_{i} - {\hat{f}}_{i} | |^{2}

. When more states are inputted into the system, the prediction network can receive more known states, which becomes more effective in predicting the output of the target network. When reaching the previously reached states, the autonomous ship will obtain a small reward value (because the target output is predictable), and it also will tend not to repeatedly explore these known states.

4.3. POND Fusion Algorithm

In order to improve the security of autonomous vessels sailing in a crowded marine environment, we combined the above two algorithms (PPO and RND), in which the internal reward calculated by RND was introduced as an exploration enthusiasm reward. The introduction of RND reward can better enable autonomous ships to try to reach the areas that they did not reach during training, and it can be easily combined with rewards from the external marine environment (such as maritime navigation rules) to encourage agents (autonomous ships) to examine their surroundings more thoroughly. The architecture of our method is shown in Figure 4.

In the actual marine environment, the main concerns of autonomous ships when sailing are their distance from the target point, energy such as fuel consumed, and marine navigation rules. Therefore, we cannot regard RND reward as being at the same level as external environment reward, and if we directly combine the two, it is often difficult to achieve satisfactory results in the training. Therefore, in this study, we designed two weight values to weigh the importance of reward. The total reward function is expressed as follows:

A_{t}^{P O N D} = α A_{t}^{e} + β A_{t}^{i}

(12)

where

A_{t}^{P O N D}

is used to update the actor network in the part of the PPO algorithm;

α

and

β

, respectively, represent the discount factors of the external environment reward and the intrinsic reward, which are taken as

α = 1.5

and

β = 0.8

in this study. The reward module mechanism is shown in Figure 5.

The external environment reward is divided into four main parts.

Voyage penalty

For the external environment reward, considering the fuel reserve during the actual navigation procedure, the autonomous vessel’s sailing distance should be minimized while avoiding obstacles as far as possible, that is, the basic external environment reward settings for each step are as follows:

R_{V o y a g e P e n a l t y} = \frac{- 1}{g r i d s i z e}

(13)

where ‘gridsize’ represents the distance of each step of the autonomous ship’s movement (equivalent to the minimum size of the grid modeled by the simulation platform).

2.: Arriving reward

To make certain that the autonomous vessel can successfully complete its tasks, in the construction of the reward module of the algorithm, the action reward value that can make the autonomous ship reach the destination is set to 1, and the action reward value that can guide the autonomous ship to leave the sea or collide with obstacles is set to −1.

R_{Arriving} = \{\begin{matrix} 1, w h e n a g e n t a r r i v e d e s t i n a t i o n \\ - 1, o t h e r w i s e \end{matrix}

(14)

3.: COLREGs reward

In the actual navigation, the inevitable problem is how to avoid the moving ship. According to the four situations of encountering other ships proposed by the COLREGs, as described in Section 3.2, this paper gives rewards for actions that meet the rules under corresponding circumstances. In this study, if the autonomous ship avoids collision in each encounter, the reward value of the COLREGs is 1; otherwise, the value is −1. For example, when other ships are approaching, the autonomous ship must make a right turn to avoid other ships. At this time, the output value of the COLREGs reward module is 1, and the output value of collision with obstacles, left turn, berthing, and other actions is −1.

R_{C O L R E G s} = \{\begin{matrix} 1, w h e n a g e n t s a t i s f i e s C O L R E G s \\ - 1, o t h e r w i s e \end{matrix}

(15)

4.: Heading-error and cross-error reward

Simultaneously, in a process to facilitate the autonomous ship using less fuel as far as possible, on the basis of the best path, the reward value is set for its heading and overtaking error when avoiding other obstacles. In this groundbreaking research, the autonomous vessel was set at a fixed speed, and the heading angle and overtaking error are shown in Figure 6. The difference between the expected heading angle and the current heading angle was used as the heading angle error reward. See Equation (16) for the specific expression of the reward function. Similarly to the heading angle error, the overtaking error reward function is shown in Equation (17). When the heading angle and overtaking distance are within an acceptable range, the reward values of both are positive. In this study, the maximum heading angle was set at 15°, and the maximum overtaking distance was 0.4 of the length between the perpendicular angles of the autonomous ship model used.

R_{h e a d i n g} = \{\begin{matrix} | | ψ_{e} {| |}_{2}, i f | ψ_{e} | < | ψ_{\max} | \\ - 1, o t h e r w i s e \end{matrix}

(16)

R_{c r o s s} = \{\begin{matrix} | | y_{e} {| |}_{2}, i f | y_{e} | < | y_{\max} | \\ - 1, o t h e r w i s e \end{matrix}

(17)

4.4. Implementation of POND Algorithm

The traditional RL algorithm relies on the reward value of each action feedback to train the agent; however, the actions of autonomous ships in crowded ocean environments may have a fixed reward value for a period of time. So, it cannot make a reasonable choice of the next action, and thus the system cannot complete the obstacle avoidance task. Based on this practical problem, our study proposes a policy-based RL algorithm to drive the autonomous ship to explore the marine environment more efficiently by setting the curiosity reward as an internal reward function in addition to the external environmental reward through dual rewards. Once the agent is familiar with the information of obstacles in the crowded marine environment, including coral reefs and other vessels, the agent can learn the basic collision avoidance strategies in the crowded marine environment and extend them to other unknown environments.

First, based on the aerial vision of UAVs, autonomous ships can obtain panoramic information of the surrounding sea area, build a panoramic chart, and input the panoramic chart as environmental information into the environment module of the algorithm architecture. The algorithm architecture of our method is shown in Figure 4, in which the neural network has three network structures: one is a simple encoder (simple) composed of two convolutional layers; one is composed of three convolutional layers (nature_cnn) [31] and a Resnet network model, which consists of three pairs of stacked layers of IMPALA Resnet, and each layer has two residual blocks [32].

The actor network, whose schematic structure is shown in Figure 7, has three hidden levels, made up entirely of ReLU-activated neurons, and its final layer, which employs sigmoid-activated neurons to choose the autonomous ship’s current form of action. The target network, RND prediction network, and critic network are all three-layer neural networks. The hidden layer additionally consists of neurons activated via ReLU, and the network structure diagram is shown in Figure 8.

The flowchart of the proposed POND algorithm in this paper is shown in Figure 9. In each iteration, the environmental information

s

is inserted into the actor network, and the double-calculated values are used as the mean and variance of the normal distribution to construct the normal distribution of the action. Then, sample an action from the distribution, input it into the environment to obtain the environmental reward

R_{E}

corresponding to this action and the next state

s_{t + 1}

, and calculate the intrinsic reward

R_{I} = | | \hat{f} (s_{t + 1}) - f (s_{t + 1}) | |^{2}

at the same time, that is, store several pairs of [state, action, reward]. Then, input the next state value into the critic network, obtain the state of

V^{π} (s_{t})

, and calculate the discount reward; according to this method, input all of the state combinations into the critic network, obtain all of the states of

V^{π} (s)

, and calculate

{\hat{A}}_{t}

. Finally, the critic network is updated through the loss value calculation and backpropagation, all of the stored states are combined into the old-actor network and the actor network, and two normal distributions are obtained, respectively. Through two normal distributions, different probability values for the same action can be calculated, that is, the ratio. From this, Equation (8) can be calculated and backpropagated to update the actor network. After a certain number of steps, the old-actor network is updated with the actor network weight, and the predicted network weight of the RND is updated under the current cycle. Through the continuous loop iteration of the above steps, the autonomous ship can complete the task of autonomous obstacle avoidance. The pseudo code of the POND algorithm is as follows (Algorithm 1).

Algorithm 1 POND for navigation of USVs

M \leftarrow

number of Pre-training steps

N_{o p t} \leftarrow

number of optimization steps

for

i = 1

to

M

do

sample

a_{t}

sample

s_{t + 1} ~ p (s_{t + 1} | s_{t}, a_{t})

update observation normalization parameters

end for
for

j = 1

to

N_{o p t}

do

sample

a_{t} ~ π (a_{t} | s_{t})

sample

s_{t + 1} ~ p (s_{t + 1} | s_{t}, a_{t})

calculate intrinsic reward
for

a c t o r = 1,2, \dots, N

do
calculate environment reward
calculate combined advantage estimates
end for
update reward normalization parameters
update observation normalization parameters
optimize objective function
end for

5. Experiment and Results

5.1. Simulation Platform Settings

To carry out the USV’s autonomous routing and obstacle avoidance experiments, the experiments in this chapter use the Unity 3D simulation platform for simulation experiments. Ml-agent uses the Unity 3D C# development framework as frontend and middleware, and links with the Google TensorFlow backend in Python [33]. Ml-agent for Microsoft Windows has a dynamic link library to link C# and Python. It enables developers to create settings for intelligent agent training. In order to better evaluate the algorithm designed in this paper, the autonomous ship uses the PPO algorithm and the POND algorithm for training and route planning simulation in an unknown marine circumstance, and also uses the convergence indicators in different environments to evaluate the algorithm evaluation. For the simulation software, we constructed four modules:

The autonomous ship module is responsible for the selection and decision-making of physical parameters, motion parameters, and behaviors such as the main scale of the target ship;
The UAV module is responsible for identifying obstacles and judging the geographic location and distance between autonomous ships and obstacles;
The marine map module mainly sets the size of the area where the autonomous ship is located, as well as the grid size and quantity;
The obstacle component is in charge of determining the fundamental physical characteristics and the position and quantity of obstacles, as well as the movement information of movable obstacles.

The simulation diagram of the ocean model detected via UAV vision is shown in Figure 10: (a) The picture shows the marine environment map detected by the UAV in real time. The experimental autonomous ships are directly under the UAV, and the rest are obstacle ships and obstacle reefs. (b) The picture shows the ocean after target recognition and grid drawing. The map is shown, in which the blue identification box is the autonomous ship (agent), and the red identification box is the obstacle, namely Obship and Reef.

The efficacy and performance of the suggested POND algorithm were empirically evaluated using the aforementioned parameter values. We built the simulation platform and ran the trials using two open-source, reputable software packages to prevent any implementation-related problems. The platform was created using the Unity Machine Learning Agent Toolkit, and Keras-RL [34] was used to develop the POND algorithm.

The fact that each experiment was carried out using Windows 10 on a machine with an Intel Core i9 CPU, a Nvidia RTX3070 GPU, and 32G RAM is noteworthy.

5.2. Algorithm Model and Parameter Settings

In this research, a deep reinforcement learning autonomous driving strategy algorithm founded on the COLREGs was developed to study the tasks of autonomous ships to complete autonomous navigation and autonomous obstacle avoidance. A comparison experiment with the PPO algorithm was carried out to validate the algorithm’s performance. The POND algorithm suggested in this research is based on a hybrid of the PPO and RND algorithms, and adds the rules of collision avoidance at sea. The hyperparameters of the POND algorithm are shown in Table 1; the PPO approach was founded on the actor–critic model, which emits actor network activities. The hyperparameters of the probability distribution model are shown in Table 2; the hyperparameters of the stochastic network distillation module are shown in Table 3.

5.3. Experimental Results of Reinforcement Methods in Crowded Marine Environments

The training part of the experimental results is shown in Figure 11. We conducted comparative experiments based on the PPO and POND optimization algorithms and three feature extraction networks, and selected six representative experimental data from all of the test results. The x-axis in the illustration indicates the number of training steps (step), and the y-axis depicts the cumulative reward curve of the training process. At the beginning of training, the agents of each method navigated poorly and tended to go out of the navigation area, walk out of order, or collide with obstacles. After a period of exploration, the PPO algorithm and POND algorithm based on the nature_cnn feature extraction network gradually learned a suitable navigation and driving strategy to operate the autonomous vessel to avoid collisions and arrive at the destination. Since the PPO algorithm with the simple network structure in Figure 11 could not complete the task of autonomous navigation and collision avoidance, and the training results clearly showed that the agent was developing a trend that was not conducive to the safety of the autonomous ship navigation, We used five other algorithms with different network structures (involving the PPO algorithm and the POND algorithm, respectively). The cumulative reward trend in the situation was plotted separately for analysis; it can be clearly seen from Figure 11 that the algorithm with the nature_cnn network can better complete the autonomous navigation obstacle avoidance task, and the POND algorithm was compared with the traditional PPO algorithm. The reward fluctuation is small; so, the method had better training efficiency and a better convergence effect. Our proposed algorithm first completed the task in about 60,000 steps and started to converge after 200,000 steps. However, the algorithms based on the other two feature extraction networks (Resnet and simple) generally failed to complete the task of autonomous navigation and obstacle avoidance, or the planned route was not the optimal route. For example, with the combination of the POND algorithm and the Resnet network structure, although the training results reached convergence, the cumulative rewards were quite different from the above two, and it can be clearly concluded that there were many points that were deducted during the sailing process, such as longer routes.

In addition, we analyzed the influence of each network structure on whether the algorithm converged. According to Figure 12, based on the two algorithms, the nature_cnn network converges better than the other networks, the cumulative reward value is higher than the other two, the fluctuation is small, and the algorithm convergence is relatively stable; the performance of the simple network is worse than the other two. It can be intelligibly seen from the figure that the combination of it and the two algorithms gave the agent a direction that was not conducive to its navigation safety. The reason for this may be that the network structure is too simple. As a result, favorable marine environment information cannot be captured, and wrong environmental information is given to the agent, which leads to training failure; however, all of the algorithms equipped with the Resnet network completed the task of autonomous navigation and obstacle avoidance, but the cumulative reward was lower than that of nature_cnn, so it did not work. The reason for planning the best route to guide the autonomous ship may be that, in a crowded marine environment, there are not so many features, and so the complex feature extraction network structure will have a negative impact. Therefore, when performing autonomous navigation tasks in a crowded ocean environment, the nature_cnn feature extraction network structure is more conducive to the agent’s decision-making than the others.

Figure 13 shows the trend in policy loss values per episode. As expected, the POND algorithm with the nature_cnn network is quick and productive in locating sensible policies, and the POND algorithm’s policy loss is consistently lower than that of the conventional PPO algorithm. The peak value of the policy loss in the POND algorithm is 0.14, while the peak value of the policy loss in the PPO algorithm is as high as 0.45, respectively. Although the final policy loss in both algorithms converges, the overall volatility of POND is significantly smaller than that of the PPO algorithm. The loss trend in value is shown in Figure 12. The peak value of the value loss in the PPO algorithm is 0.75, while the peak value of the value loss in the POND algorithm is only 0.27. Although the value loss in the PPO algorithm equipped with the Resnet network converges the fastest, its cumulative reward trend shows that the algorithm leads the agent to the wrong direction, such as going out of the sea or colliding with an obstacle, etc. Therefore, considering the actual sailing situation, to ensure the safety of autonomous ships, the POND algorithm equipped with the nature_cnn network performs the best. Although its value loss converges slower than the above PPO algorithm, its loss value is still within an acceptable range.

Figure 14 shows the extrinsic reward from the environment during training. It can be found that since the rising speed of extrinsic reward is compared with the rising speed of cumulative reward, it might be said that the intrinsic reward diminishes with iteration. In the beginning of training, as the agent explores to new states, it often experiences larger intrinsic rewards as it discovers new states. As a result, the agent might become more used to the surroundings more quickly as its experience grows. Our model, containing intrinsic rewards, may initially learn fundamental driving rules more quickly and accrue rewards more instantaneously. In the meantime, due to the issue of the crowded ocean environment, the conventional pace reward cannot reasonably tell the agent the correct driving route, but through the RND model, based on the intrinsic reward mechanism, the agent can better learn efficient and correct driving strategies.

To make the training outcomes appear more realistic, the performance of the POND algorithm for autonomous navigation in Unity3d in the marine environment of 100*100 grids is shown in Figure 15. Among them, the green box is the autonomous ship and its drone. The drone processes the environmental information through its own perspective and sends it to the agent’s brain for processing. The image captured by the drone is shown in Figure 16, in which the autonomous ship is the object (agent) in the blue identification box, and the red identification box is the obstacle ship (Obship) with unknown motion status and the reef that needs to be avoided in the marine environment. The initial position of the autonomous ship and other obstacles is shown in Figure 15a; the autonomous ship faces the oncoming obstacle ship and turns right to avoid it, as shown in Figure 15b, whereby it avoids the oncoming obstacle. Then, an obstacle ship approaching laterally is found on the right side, as shown in Figure 15c, whereby the autonomous ship is heading for the target, and it makes a left turn in the face of this obstacle, as shown in Figure 15d, which shows the autonomous ship. When there is a static obstacle reef in the heading of the algorithm, based on the environmental separation between the existing situation and the desired outcome, the autonomous ship continues to make a left turn and passes under the reef to avoid obstacles, as shown in Figure 15e, whereby the autonomous ship drives to the left of the reef. When turning to the right side, the rudder is turned forward to the right, and it finally arrives at the desired location, successfully completing the task of autonomous navigation and obstacle avoidance. The final picture is shown in Figure 15f.

Generalization performance: We investigated the generalization performance of our model. Previous studies have been based on training and testing in a certain environmental state, and then in terms of actual ocean navigation, the motion information of obstacles is often unknown. Therefore, in order to enable the model to perform better in different unknown environments, we applied the model to a simulated ocean environment model with randomly set obstacle motion information, and compared its training convergence. Figure 17 and Figure 18 show the performance of the POND model in an unknown marine environment. Figure 17 depicts the cumulative rewards accrued by the generalized model. It is evident that during the initial phase of training, the reward values are low, suggesting that the agent is prone to collisions with obstacles. However, post 200,000 steps, the cumulative rewards consistently exceed zero, signifying that the agent successfully navigates to the target. Notably, this also implies an extended navigation route, which could be attributed to randomly generated obstacle movements. Figure 18 delineates the dynamics of value loss and policy loss through two distinct curves, ‘a’ and ‘b’, respectively. Curve ‘a’ reveals an initial increase in value loss during successful training, which subsequently diminishes as the reward values stabilize. Conversely, curve ‘b’ demonstrates policy loss, which exhibits oscillations throughout the training phase and maintains a value below 1. These data suggest that the generalized collision avoidance model is adept at handling scenarios with randomly generated obstacle states.

The aforementioned experimental findings demonstrate that POND is capable of avoiding both static and moving objects in both familiar and unfamiliar surroundings. Compared with the traditional PPO algorithm, POND features faster convergence and greater search efficiency. As a result, the autonomous ship can determine the optimal sailing path to reach the objective without running into any obstacles under the supervision of the POND algorithm.

6. Conclusions and Discussion

6.1. Conclusions

In this paper, salient marine environmental constituents such as static (e.g., reefs) and dynamic (e.g., moving vessels) obstacles were taken into account. Furthermore, deep reinforcement learning was amalgamated with stochastic network distillation to engineer an autonomous collision avoidance decision-making model for surface vessels, predicated using the navigational COLREGs. To emulate the actual marine navigation milieu, a comprehensive collision avoidance decision system was developed, wherein an unmanned aerial vehicle (UAV) was deployed to extract features from the marine environment, utilizing visual sensors and circumventing obstacles using the devised algorithm. Additionally, taking into consideration the inherent variability in obstacle motion, the algorithm’s generalization capability was augmented by assigning random values within a specified range to the initial position, number of obstacles, motion data, and dynamic obstacle target points. Through extensive training, it was ascertained that the autonomous vessel employing the POND algorithm proficiently evades obstacles in congested marine environments and reaches predetermined destinations in adherence with the COLREGs. To ascertain the superiority of the model, it was juxtaposed with the widely utilized PPO model. The results indicate that the decision-making model proffered in this investigation exhibits more rapid convergence and a significantly higher reward value compared to the PPO model.

This study is distinguished from others in the following aspects:

While extant research on autonomous collision avoidance algorithms predominantly relies on measurement data from sensors onboard the vessel, yielding limited oceanic information, this investigation utilized an integrated UAV-USV decision system model, harnessing UAV observational data to construct an environmental model, thereby maximizing the data collection surrounding the vessel.
Existing algorithms generally incorporate the COLREGs rules into the reward function with minimal consideration for other reward factors, compromising the reduction in energy consumption. This study introduced a reward function that integrates environmental data and an intrinsic reward mechanism, incentivizing exploration during training and optimal navigational actions, and thus enhancing collision avoidance efficacy and aligning with sustainable navigation practices.
Contrary to the majority of collision avoidance models that focus solely on static obstacle avoidance, this study considered both dynamic and static obstacles, randomizing their generation to improve model generalization.

In conclusion, the collision avoidance model introduced in this study holds substantial promise for real-world applications and can be integrated into intelligent collision avoidance systems for unmanned vessels.

6.2. Discussion

As a critical component in the autonomous navigation of maritime vessels, this study aimed to enable an autonomous ship to discern and compute the optimal navigation route via its affiliated unmanned aerial vehicle (UAV), solely employing pertinent visual sensors. However, there are several aspects that warrant augmentation and further investigation for the pragmatic application of and advanced research into the collision avoidance model:

In the present study, the collision avoidance model has been evaluated exclusively in simulated environments. The subsequent phase will entail conducting sea trials and employing the collision avoidance model aboard research support vessels for empirical validation.
Moving forward, it is imperative to contemplate the integration of unanticipated factors within the collision avoidance model to bolster navigational safety, particularly in scenarios involving erratic positional alterations by human-navigated vessels in real maritime environments.
The model, in its current state, is confined to the domain of path planning. However, the maritime navigation process necessitates a heightened focus on the vessel’s state, particularly in regard to crewed ships. Future endeavors should incorporate the vessel’s hydrodynamic performance metrics, such as wave resistance and maneuverability, into the environmental data. Moreover, strategies for the efficacious incorporation of the vessel’s orientation and the aquatic forces acting upon it, as well as the autonomous adjustment of the vessel’s orientation, velocity, and trajectory based on current conditions, should be explored.

Author Contributions

Conceptualization, X.P. and F.H.; methodology, X.P.; software, W.Z. and Y.Z.; validation, X.P., F.H. and G.X.; formal analysis, G.X.; investigation, G.X.; resources, F.H.; data curation, F.H.; writing—original draft preparation, X.P.; writing—review and editing, X.P.; visualization, W.Z.; supervision, F.H.; project administration, F.H.; funding acquisition, F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, grant number 2022YFB3306200 and Natural Science Foundation of Heilongjiang Province of China, grant number LH2021E047.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Millefiori, L.M.; Braca, P.; Zissis, D.; Spiliopoulos, G.; Marano, S.; Willett, P.K.; Carniel, S. COVID-19 impact on global maritime mobility. Sci. Rep. 2021, 11, 18. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Liu, Z.; Cai, Y. The ship maneuverability based collision avoidance dynamic support system in close-quarters situation. Ocean Eng. 2017, 146, 486–497. [Google Scholar] [CrossRef]
EMSA. Annual Overview of Marine Casualties and Incidents; EMSA: Tulsa, OK, USA, 2021; pp. 4–5. [Google Scholar]
Song, R.; Liu, Y.; Bucknall, R. Smoothed A* algorithm for practical unmanned surface vehicle path planning. Appl. Ocean Res. 2019, 83, 9–20. [Google Scholar] [CrossRef]
Zhang, Z.; Wu, D.; Gu, J.; Li, F. A Path-Planning strategy for unmanned surface vehicles based on an adaptive hybrid dynamic stepsize and target attractive force-RRT algorithm. J. Mar. Sci. Eng. 2019, 7, 132. [Google Scholar] [CrossRef] [Green Version]
Wang, D.; Wang, P.; Zhang, X.; Guo, X.; Shu, Y.; Tian, X. An obstacle avoidance strategy for the wave glider based on the improved artificial potential field and collision prediction model. Ocean Eng. 2020, 206, 107356. [Google Scholar] [CrossRef]
Ding, F.; Zhang, Z.; Fu, M.; Wang, Y.; Wang, C. Energy-efficient path planning and control approach of USV based on particle swarm optimization. In Proceedings of the OCEANS 2018 MTS/IEEE Charleston, Charleston, SC, USA, 22–25 October 2018. [Google Scholar]
Liu, X.; Li, Y.; Zhang, J.; Zheng, J.; Yang, C. Self-adaptive dynamic obstacle avoidance and path planning for USV under complex maritime environment. IEEE Access 2019, 7, 114945–114954. [Google Scholar] [CrossRef]
Kozynchenko, A.I.; Kozynchenko, S.A. Applying the dynamic predictive guidance to ship collision avoidance: Crossing case study simulation. Ocean Eng. 2018, 164, 640–649. [Google Scholar] [CrossRef]
Chae, H.; Kang, C.M.; Kim, B.; Kim, J.; Chung, C.C.; Choi, J.W. Autonomous braking system via deep reinforcement learning. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems, Yokohama, Japan, 6–19 October 2017. [Google Scholar]
Kahn, G.; Villaflor, A.; Pong, V.; Abbeel, P.; Levine, S. Uncertainty-aware reinforcement learning for collision avoidance. arXiv 2017, arXiv:1702.01182. [Google Scholar]
Everett, M.; Yu, F.C.; Jonathan, P.H. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018. [Google Scholar]
Duguleana, M.; Mogan, G. Neural networks based reinforcement learning for mobile robots obstacle avoidance. Expert Syst. Appl. 2016, 62, 104–115. [Google Scholar] [CrossRef]
Chen, C.; Chen, X.-Q.; Ma, F.; Zeng, X.-J.; Wang, J. A knowledge-free path planning approach for smart ships based on reinforcement learning. Ocean Eng. 2019, 189, 106299. [Google Scholar] [CrossRef]
Tai, L.; Ming, L. Towards cognitive exploration through deep reinforcement learning for mobile robots. arXiv 2016, arXiv:1610.01733. [Google Scholar]
Zhang, R.; Tang, P.; Su, Y.; Li, X.; Yang, G.; Shi, C. An adaptive obstacle avoidance algorithm for unmanned surface vehicle in complicated marine environments. IEEE/CAA J. Autom. Sin. 2014, 1, 385–396. [Google Scholar] [CrossRef]
Woo, J.; Kim, N. Collision avoidance for an unmanned surface vehicle using deep reinforcement learning. Ocean Eng. 2020, 199, 107001. [Google Scholar] [CrossRef]
Du, Y.; Zhang, X.; Cao, Z.; Wang, S.; Liang, J.; Zhang, F.; Tang, J. An optimized path planning method for coastal ships based on improved DDPG and DP. J. Adv. Transp. 2021, 2021, 7765130. [Google Scholar] [CrossRef]
Cao, X.; Sun, C.; Yan, M. Target search control of AUV in underwater environment with deep reinforcement learning. IEEE Access 2019, 7, 96549–96559. [Google Scholar] [CrossRef]
He, Y.; Jin, Y.; Huang, L.; Xiong, Y.; Chen, P.; Mou, J. Quantitative analysis of COLREG rules and seamanship for autonomous collision avoidance at open sea. Ocean Eng. 2017, 140, 281–291. [Google Scholar] [CrossRef]
Zhao, Y.; Li, W.; Shi, P. A real-time collision avoidance learning system for Unmanned Surface Vessels. Neurocomputing 2016, 182, 255–266. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, D.; Yan, X.; Haugen, S.; Soares, C.G. A distributed anti-collision decision support formulation in multi-ship encounter situations under COLREGs. Ocean Eng. 2015, 105, 336–348. [Google Scholar] [CrossRef]
Zhao, L. Simulation Method to Support Autonomous Navigation and Installation Operation of an Offshore Support Vessel. Diss. Doctoral Dissertation, Seoul National University, Seoul, Republic of Korea, 2019. [Google Scholar]
Zhao, L.; Roh, M.-I. COLREGs-compliant multiship collision avoidance based on deep reinforcement learning. Ocean Eng. 2019, 191, 106436. [Google Scholar] [CrossRef]
Wang, W.; Huang, L.; Liu, K.; Wu, X.; Wang, J. A COLREGs-Compliant Collision Avoidance Decision Approach Based on Deep Reinforcement Learning. J. Mar. Sci. Eng. 2022, 10, 944. [Google Scholar] [CrossRef]
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent Ship Collision Avoidance Algorithm Based on DDQN with Prioritized Experience Replay under COLREGs. J. Mar. Sci. Eng. 2022, 10, 585. [Google Scholar] [CrossRef]
Vagale, A.; Oucheikh, R.; Bye, R.T.; Osen, O.L.; Fossen, T.I. Path planning and collision avoidance for autonomous surface vehicles I: A review. J. Mar. Sci. Technol. 2021, 26, 1292–1306. [Google Scholar] [CrossRef]
Papadimitriou, C.H.; Tsitsiklis, J.N. The complexity of Markov decision processes. Math. Oper. Res. 1987, 12, 441–450. [Google Scholar] [CrossRef] [Green Version]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by random network distillation. arXiv 2018, arXiv:1810.12894. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, V.; Ward, T.; Kavukcuoglu, K. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Juliani, A.; Berges, V.P.; Teng, E.; Cohen, A.; Harper, J.; Elion, C.; Lange, D. Unity: A general platform for intelligent agents. arXiv 2018, arXiv:1809.02627. [Google Scholar]
Keras-rl. Available online: https://github.com/keras-rl/keras-rl. (accessed on 6 June 2021).
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]

Figure 1. Autonomous vessel path planning system.

Figure 2. Four anti-collision action areas according to relative positions.

Figure 3. Random network distillation framework diagram.

Figure 4. POND algorithm framework diagram.

Figure 5. The reward module mechanism.

Figure 6. Schematic diagram of reward function setting.

Figure 7. Schematic diagram of actor network architecture.

Figure 8. Schematic diagram of critic network architecture.

Figure 9. Flowchart of POND algorithm.

Figure 10. Simulation example map. ((a) is the overhead view image of the UAV, (b) is the overhead view image after processing by recognition algorithm). Attention: in this simulation platform, the ship’s roll, pitch, yaw, surge, sway, heave, etc., are not considered, and other influencing variables such as real wind speed and direction, ocean currents, and power are ignored.

Figure 11. Variation in cumulative reward per step in a crowded ocean environment. ((a) is the cumulative reward value obtained by five different algorithms during training, and (b) is the cumulative reward value obtained by six different algorithms during training).

Figure 12. Cumulative reward changes in different network structures at each step. ((a) is the cumulative reward value obtained by the PPO algorithm with three different networks training, (b) is the cumulative reward value obtained by the POND algorithm with three different networks training).

Figure 13. Changes in policy loss and value loss in different network structures at each step. ((a) is the value of policy loss obtained during training for six different methods, and (b) is the value of value loss obtained during training for six different methods).

Figure 14. The change in the external marine environment reward at each step of the POND algorithm.

Figure 15. The trajectory of autonomous ships in Unity3d. ((a–f) are the images of the simulation environment at the moment when the UAV-USV system makes each action in a certain collision avoidance simulation scenario).

Figure 16. The trajectory of autonomous ships captured by drones in Unity3d. ((a–f) is the environmental position of the autonomous ship after making collision avoidance at each moment taken by the UAV, which has gone through the target identification process).

Figure 17. Cumulative reward change at each step of POND algorithm in unknown environment.

Figure 18. Changes in policy loss and value loss at each step of POND algorithm in unknown environment. ((a) is the value loss value obtained by the POND generalization model during training, and (b) is the policy loss value obtained by the POND generalization model during training).

Table 1. Hyperparameters and training parameters of POND.

Parameter	Content	Value
Trainer_type	RL types	PPO
Summary_freq	Parameter inputs to the next training	20,000
Time_horzion	The number of training steps inserted into the replay buffer	5
Max_steps	Total number of training	1,000,000
Learning_rate	Gradient descent rate	3.0 × 10⁻⁴
Leanrning_rate_schedule	Gradient descent method	Linear
Batch_size	Number of data selected for each gradient descent	32
Buffer_size	The amount of data required for each model update	256
Hidden_units	Number of hidden layer cells in the POND network	256
Num_layers	Number of hidden layers in the POND network	256
Normalize	Environment vector input normalization	3
Vis_encode_type	Vision sensor data encoder selection	Simple/ Resnet/Cnn

Table 2. Hyperparameters of PPO.

Parameter	Content	Value
beta	Strategy randomness regularization	5.0 × 10⁻³
epsilon	Speed of policy change	0.2
lambd	Regularization parameter used when calculating GAE [35]	0.95
beta_schedule	The way the beta parameter changes	Linear
epsilon_schedule	The way the epsilon parameter changes	Linear
num_epoch	Number of complete passes through the training data set	3

Table 3. Hyperparameters of RND.

Parameter	Content	Value
strength	Intrinsic reward weighting	1.0
gamma	Bonus discount factor	0.9
learning_rate	Model iteration update rate	3 × 10⁻⁴

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, X.; Han, F.; Xia, G.; Zhao, W.; Zhao, Y. Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND. J. Mar. Sci. Eng. 2023, 11, 1320. https://doi.org/10.3390/jmse11071320

AMA Style

Peng X, Han F, Xia G, Zhao W, Zhao Y. Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND. Journal of Marine Science and Engineering. 2023; 11(7):1320. https://doi.org/10.3390/jmse11071320

Chicago/Turabian Style

Peng, Xiao, Fenglei Han, Guihua Xia, Wangyuan Zhao, and Yiming Zhao. 2023. "Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND" Journal of Marine Science and Engineering 11, no. 7: 1320. https://doi.org/10.3390/jmse11071320

APA Style

Peng, X., Han, F., Xia, G., Zhao, W., & Zhao, Y. (2023). Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND. Journal of Marine Science and Engineering, 11(7), 1320. https://doi.org/10.3390/jmse11071320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND

Abstract

1. Introduction

2. Related Work

3. Collision Avoidance Path Planning in Crowded Ocean Environment

3.1. Framework for a Ship Path Planning System

3.2. International Regulations for Preventing Collisions at Sea

3.3. Problem Definition for Autonomous Vessel Path Planning

4. The Model of the Deep Reinforcement Learning Network

4.1. Proximal Policy Optimization Algorithm

4.2. Random Network Distillation (RND)

4.3. POND Fusion Algorithm

4.4. Implementation of POND Algorithm

5. Experiment and Results

5.1. Simulation Platform Settings

5.2. Algorithm Model and Parameter Settings

5.3. Experimental Results of Reinforcement Methods in Crowded Marine Environments

6. Conclusions and Discussion

6.1. Conclusions

6.2. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI