Abstract
Decision-making for collision avoidance in complex maritime environments is a critical technology in the field of autonomous ship navigation. However, existing collision avoidance decision algorithms still suffer from unstable strategy exploration and poor compliance with regulations. To address these issues, this paper proposes a novel autonomous ship collision avoidance algorithm, the dynamically adjusted entropy proximal policy optimization (DAE-PPO). Firstly, a reward system suitable for complex maritime encounter scenarios is established, integrating the International Regulations for Preventing Collisions at Sea (COLREGs) with collision risk assessment. Secondly, the exploration mechanism is optimized using a quadratically decreasing entropy method to effectively avoid local optima and enhance strategic performance. Finally, a simulation testing environment based on Unreal Engine 5 (UE5) was developed to conduct experiments and validate the proposed algorithm. Experimental results demonstrate that the DAE-PPO algorithm exhibits significant improvements in efficiency, success rate, and stability in collision avoidance tests. Specifically, it shows a 45% improvement in success rate per hundred collision avoidance attempts compared to the classic PPO algorithm and a reduction of 0.35 in the maximum collision risk (CR) value during individual collision avoidance tasks.
1. Introduction
As the global economy rapidly expands, the number of maritime vessels has significantly increased, marking them as barometers of the global economic climate [1]. This growth has led to increasingly crowded maritime routes, escalating concerns about navigational safety. Against this backdrop, the development of effective collision avoidance mechanisms for ships has become a critical issue in ensuring navigational safety in complex maritime environments. The International Regulations for Preventing Collisions at Sea (COLREGs) play a pivotal role in this [2]. However, statistics indicate that over 80% of maritime collisions are caused by human error, including the crew’s failure to fully and effectively comply with the COLREGs at critical moments [3]. The primary research goal for autonomous maritime collision avoidance algorithms is to enable ships to make effective collision avoidance decisions in complex environments, thus minimizing or preventing accidents.
Various algorithms for ship collision avoidance automation have been proposed, such as the A* algorithm [4], genetic algorithm (GA) [5], and artificial potential field (APF) [6]. These classic path-planning techniques have supported the preliminary exploration of intelligent maritime navigation. However, these algorithms, while useful, encounter several limitations, including high computational complexity, strong dependence on environmental models and heuristic functions, and a lack of learning capabilities. These issues are particularly problematic in scenarios involving unknown environments, moving obstacles, and other unquantifiable factors that are crucial in unmanned maritime path planning.
In contrast, recent advancements in automation and artificial intelligence have introduced new avenues for addressing these challenges in maritime collision avoidance. Deep reinforcement learning (DRL) algorithms, known for their adaptability and versatility, are expected to address these shortcomings and are becoming a focal point in collision avoidance research. Guo Siyu et al. [7] employed a combination of reinforcement learning algorithms, such as deep deterministic policy gradient (DDPG) and APF, to enhance learning efficiency and convergence speed. However, their research did not account for ship models and actual environmental conditions, necessitating further validation in complex settings. Yin Cheng et al. [8] designed a comprehensive reward function for deep reinforcement learning tailored to underactuated unmanned marine vessels, taking into account environmental disturbances and vessel motion characteristics. They successfully implemented path planning in complex maritime areas with numerous static obstacles; however, the applicability of this method in scenarios with dynamic obstacles remains to be evaluated. Zhijian Huang et al. [9] adjusted the standard Deep Q-Network (DQN) architecture to better handle the complexity and partial observability of marine environments, although further validation is required to assess the algorithm’s performance in dynamic and multi-obstacle settings. Xu Xinli et al. [10] developed a COLREGs-compliant intelligent collision avoidance algorithm (CICA) that tracks and updates network weights. The design of the action space involves three discrete values, resulting in a lack of continuity in actions.
Compared to other DRL algorithms, the proximal policy optimization (PPO) algorithm offers greater stability and efficiency in managing continuous action spaces, making it particularly well-suited for handling the continuous control challenges in ship maneuvering [11]. Qianhao Xiao et al. [12] introduced a novel distributed sampling strategy that enhances the balance and diversity of sample collection through regional segmentation. They incorporated a Beta strategy to address action boundary issues in continuous action spaces, thereby increasing the success rate of path planning and reward accumulation. Eivind Meyer et al. [13] proposed a new observation vector and reward function design, including a feasibility pooling algorithm for real-time sensor data dimension reduction. They also introduced a reward trade-off parameter λ, enabling the agent to dynamically adjust its navigation strategy based on the current policy. Wei Guan et al. [14] proposed an improved algorithm that integrates generalized advantage estimation into the loss function of the PPO algorithm, verifying that unmanned ships can autonomously navigate and avoid collisions without human intervention. Chuanbo Wu et al. [15] developed a collision avoidance path planning method combining the dynamic window approach with the PPO algorithm, which has been validated through simulation experiments to be effective in near-shore navigation.
Although the PPO algorithm has made significant progress in maritime collision avoidance, there remains room for improvement in exploration efficiency and policy performance. Particularly, it exhibits limitations in balancing exploration and exploitation. In complex maritime environments, the algorithm may become trapped in local optima, limiting further performance enhancements. To overcome this challenge, introducing appropriate exploration mechanisms is crucial, especially when dealing with complex or high-dimensional decision spaces. Such mechanisms can help the algorithm escape local optima and explore a broader solution space, thereby improving overall decision quality and strategy robustness.
Entropy regularization is an effective method to avoid local optima by encouraging agents to engage in broader exploration during training, fostering the discovery of new possibilities, and maintaining the randomness of strategies [16,17]. However, applying this technique to complex environments at sea and balancing entropy parameters to achieve optimal performance between exploration and exploitation remains a nuanced challenge. To address these issues, this paper introduces an improved PPO algorithm with dynamically adjusted entropy (DAE-PPO). This research introduces a quadratically decreasing entropy approach designed to maintain a high entropy coefficient early in training, allowing for extensive exploration, thus preventing convergence to local optima and adapting to more complex training environments. As training progresses, the entropy coefficient gradually decreases, mitigating abrupt changes in exploration intensity, ensuring a smooth transition during the training process, reducing instability, and accelerating model convergence. The main contributions of this paper are as follows:
- (1)
- An enhanced proximal policy optimization (PPO) algorithm based on dynamic entropy adjustment is proposed. This approach optimizes the entropy regularization framework to improve the exploration efficiency and policy performance of the PPO algorithm without introducing additional hyperparameters. A PPO network framework specifically designed for maritime collision avoidance has been developed, with various improvements analyzed and compared based on a comprehensive training environment.
- (2)
- A collision risk (CR) metric is introduced into the reward function, based on the distance to the closest point of approach (DCPA) and time to the closest point of approach (TCPA). Regulations from COLREGs are integrated, and factors influencing the collision avoidance process are considered, constructing a refined reward signal tailored for training unmanned ships.
- (3)
- The proposed algorithm is implemented on the Unreal Engine 5 (UE5) physics engine platform, creating a simulation environment that mirrors maritime navigation characteristics. Experimental validation is conducted to demonstrate the effectiveness and practicality of the proposed method.
This study contributes a robust and compliant approach to collision avoidance in autonomous ship navigation, enhancing safety and operational efficiency in complex maritime environments.
The organization of this paper is as follows: Section 2 presents the ship dynamics model, COLREGs, collision risks, and mathematical models of ships. Section 3 covers deep reinforcement learning methods, network frameworks, PPO optimization techniques, the design of action and state spaces, and reward function design. Section 4 describes the establishment of the training environment and the presentation of training results, and it includes tests conducted on the improved ship collision avoidance algorithm within the UE5 environment. The paper concludes with a summary and outlook.
2. Problem Description
2.1. Ship Motion Dynamics
This study utilizes the mathematical models of ship dynamics proposed by Yasukawa and Yoshimura (2015) [18], Sandeepkumar et al. (2022) [19], and Sivaraj et al. (2022) [20] to construct the simulation environment, as shown in Equation (1):
where represents the mass of the ship, , denote the added mass in the x and y directions, , , , , , represent the velocities and accelerations in the x, y, and yaw directions, respectively. and represent the moment of inertia and the additional moment of inertia, respectively, along the z-axis, is the center of gravity along the x-axis. The hydrodynamic forces are divided into components caused by the hull (), propeller (), rudder (), and waves ().
During navigation, the ship acquires relevant parameters from other vessels and calculates collision avoidance parameters, as shown in Figure 1. The positions of one’s own ship (OS), target ship (TS), and goal are (, ), (, ), and (, ), respectively. The OS’s speed is , the blue arrow indicates the direction of speed, its heading is , and its rudder angle is . The target ship’s speed is , its heading is , its bearing relative to one’s own ship is , the relative bearing from OS to the goal is , and the distance to the nearest point on the reference path is .
Figure 1.
Collision avoidance schematic diagram.
2.2. COLREGs
It is crucial for the development of autonomous ship collision avoidance systems to accurately understand behavioral norms under different collision scenarios and to consider maritime collision avoidance rules. Based on the COLREGs and an in-depth analysis of steering and navigation rules, we have systematically organized the collision avoidance strategies that should be followed when ships encounter other moving vessels, as shown in Figure 2, ensuring logical consistency and clear expression.
Figure 2.
Collision avoidance according to the COLREGs for each situation.
- (1)
- Overtaking: When a ship is located within the rear quadrant of another moving ship (112.5° to 247.5°), it is in an overtaking encounter situation. If the overtaking ship is faster than the ship being overtaken, it must take appropriate measures to avoid a collision, safely maneuvering by altering course to port or starboard to pass at a safe distance between the two ships.
- (2)
- Head-on: When two ships are on opposing or nearly opposing courses, and the bearing of the approaching ship is within the 0° to 5° or 355° to 360° range relative to one’s own ship, it is defined as a head-on situation. In this case, both ships should alter course to starboard, passing port to port to prevent a collision.
- (3)
- Crossing-stand-on: If a moving ship is located on the port side of one’s own ship, within a bearing range of 247.5° to 355°, it is defined as a crossing-stand-on. In this scenario, one’s own ship, being the stand-on ship, should maintain its course and speed.
- (4)
- Crossing-give-way: When a moving ship is situated on the starboard side of one’s own ship, with a bearing between 0° and 112.5°, it is defined as a crossing-give-way. In this case, one’s own ship should take action to alter course to starboard to ensure a safe distance between the ships and to avoid the crossing ship.
2.3. Collision Risk
The introduction of collision risk (CR) not only simplifies the decision-making process for collision avoidance but also enhances the predictability and controllability of collision avoidance actions. The calculation of CR is based on the quantitative analysis of key parameters such as TCPA and DCPA. As shown in Figure 3, represents the distance between the two ships, is the relative heading between one’s own ship and the target ship, and indicates the bearing of the target ship relative to one’s own ship. , , and denote the speed of one’s own ship, the target ship, and their relative speed, respectively. The blue arrow indicates the direction of speed. These parameters form a comprehensive risk assessment system for ship collision avoidance, providing an intuitive and reliable method to assess collision risk.
Figure 3.
Concepts of DCPA, and TCPA.
The formula for calculating the distance to the closest point of approach (DCPA) between two ships is as follows:
The formula for calculating the time to the closest point of approach (TCPA) between two ships is as follows:
To accurately assess the CR and effectively implement avoidance strategies, it is essential to consider the ship domain (SD) [21] as a critical factor. The size and shape of the ship domain directly determine the outcome of the CR assessment, defining scenarios where ships are deemed to encounter potential collision risks. Numerous methods exist for determining ship domains [22,23], ranging from complex theoretical algorithms to detailed adjustments based on expert knowledge. In this paper, we employ a circular ship domain as a simple and intuitive method, which is both practical and efficient, as shown in Figure 2, illustrating the ship domains of OS and TS with a radius, r. The circular ship domain is easy to calculate and apply, significantly simplifying operational procedures and enhancing the efficiency and accuracy of ship collision avoidance decisions. Despite its simplicity, this method offers significant advantages in navigational safety and decision-making efficiency, providing a clear approach to addressing challenges in complex maritime conditions. Therefore, the CR assessment formula designed in this paper is as follows [24,25]. The objective function to be maximized is:
To determine the coefficients for DCPA and TCPA, the size and speed of the OS and the TS, as well as the operational distance at which actual collision avoidance maneuvers commence, were considered. These coefficients are set to ensure that when the TS is at the edge of the OS’s recognition distance (dr), a constant collision risk threshold, defined as allowable CR (CRal), is maintained. The dr is defined as the distance at which one’s own ship starts to identify and monitor the target ship, while the CRal is used to determine when one’s own ship should take action to avoid a collision with the target ship. To validate the effectiveness of the designed method, a KVLCC 2 tanker [26] was chosen as the example vessel. Table 1 summarizes the main parameters of this example ship.
Table 1.
Principal dimensions of KVLCC2 and key parameters for collision avoidance.
3. Collision Avoidance for Unmanned Ships Based on Deep Reinforcement Learning
In the field of machine learning, reinforcement learning (RL) is a core algorithmic paradigm focused on how agents learn through interaction with the environment to maximize cumulative rewards. Policy gradient methods are representative algorithms in this domain, which evaluate the performance of potential strategies to optimize decision-making processes. A significant advantage of these methods is their ability to directly parameterize strategies, particularly suitable for exploring high-dimensional action spaces [27,28]. However, significant changes in policy parameters can lead to performance instability in practice.
To address the challenges of policy gradient methods, researchers have developed trust region policy optimization (TRPO). The TRPO algorithm, inspired by natural policy gradients, centers on the idea of limiting the extent of policy updates to ensure gradual improvement in stability and performance. TRPO employs Kullback-Leibler (KL) divergence to control the magnitude of policy updates, thereby preventing excessively large update steps [29,30].
where is the probability of action under state according to the new policy. is the advantage function, used to assess the superiority of action compared to the average policy.
TRPO maintains stability during policy iteration by constructing a trust region, which prevents performance degradation due to overly large update steps. However, the application of TRPO often involves complex second-order optimization computations, limiting its wider application and scalability.
To address the limitations of TRPO, Schulman et al. [31] introduced proximal policy optimization (PPO). The PPO algorithm regulates the policy update process as follows:
where represents the ratio of the new policy to the old policy (importance sampling ratio). Clipping operations constrain changes in this ratio, preventing excessively large updates and, thus, enhancing the stability of the algorithm. This approach not only simplifies the implementation process but also shows significant advantages in terms of sample complexity, scalability, and performance.
PPO effectively mitigates severe fluctuations in performance that could result from policy updates by limiting the likelihood ratio between new and old policies. The implementation of PPO consists of two fundamental steps: first, executing the policy and collecting data, and second, using a gradient ascent algorithm to optimize the agent’s objective function.
In the PPO algorithm, the loss function consists of three parts: the policy objective function , the value function term , and the entropy term [32]. These components are combined into the total loss function of PPO, where coefficients and are used to adjust the weights of the value function and entropy term. The policy objective function aims to optimize decision-making effectiveness, the value function term improves the accuracy of state estimation, and the entropy term enhances policy exploration to prevent premature convergence to local optima.
Specifically, the loss function in PPO can be expressed as follows:
where is the policy objective function, is the value function loss, and represents the policy entropy, which is the entropy of the probability distribution of choosing an action, , under a given state according to policy . Adjusting and controls the influence of the value function and entropy on the total loss, balancing optimization, and exploration in the policy.
While simplifying implementation, PPO demonstrates significant advantages in terms of sample complexity, scalability, and performance. By limiting the extent of policy updates, PPO effectively avoids potential instabilities during training. Compared to TRPO, PPO simplifies the complex second-order optimization problem, thus enhancing training efficiency. However, PPO also has some limitations, such as relatively low sample utilization and sensitivity to hyperparameters, which, although less critical than in TRPO, still require careful tuning of multiple hyperparameters to ensure optimal performance.
Chaudhari and others [33] introduced entropy as a quantitative measure of randomness into the PPO objective function, increasing the uncertainty of the policy and thereby encouraging exploratory behavior. An increase in entropy implies a more exploratory policy, while a decrease leads to a more deterministic approach. In PPO, the objective function can be expressed as follows:
where the entropy regularization term is added to the objective function, where is a tuning hyperparameter that balances the relationship between the PPO objective function and the entropy regularization. Its range is from 0 to 1, with higher values encouraging more exploratory behaviors and increasing policy randomness, thereby avoiding early policy convergence. The introduction of entropy effectively promotes exploratory behavior during policy training, especially in situations with large action spaces or complex state transitions. The introduction of policy entropy as a regularization term in the loss function helps prevent the policy from prematurely focusing on a few actions and avoids early convergence to local optima, thus motivating the algorithm to explore a wider range of state–action pairs during training. This mechanism is particularly useful in scenarios with large action spaces or complex state transitions, aiding in the discovery of better long-term strategies.
3.1. The Proposed DAE-PPO Algorithm
This algorithm optimizes the entropy regularization framework within the PPO algorithm without introducing additional hyperparameters, thus promoting more efficient exploration and improved policy performance. To further enhance the exploration capability of the policy, we propose a dynamic entropy adjustment method using a quadratically decreasing entropy approach to unify reward maximization with exploration uncertainty. This refinement allows for more precise entropy adjustments. The improved objective function incorporating the quadratically decreasing entropy term is represented as follows:
Here, represents the entropy coefficient as a function of time , defined as follows:
where represents the maximum exploration coefficient set during training, the minimum exploration coefficient, the current timestep, and the total number of timesteps. The entropy coefficient decreases quadratically from to as progresses toward . This coefficient decreases quadratically over time, providing more exploration space in the early stages of training and preventing premature convergence. As training progresses, the entropy coefficient gradually diminishes, facilitating a smooth transition from exploration to exploitation and ensuring the policy quickly converges to the global optimum during the later stages of training.
The core advantage of the quadratically decreasing entropy method lies in its high entropy coefficient settings at the beginning of training, which provides ample exploration space for the model and helps prevent the strategy from converging prematurely to local optima. Compared to simple linearly decreasing methods, this approach maintains higher entropy values longer during the early training phase, thereby enhancing the breadth of policy exploration and increasing the likelihood of discovering global optima, especially in complex and variable training environments. As training progresses, the gradual decay of the entropy coefficient aids in a smooth transition to the exploitation phase, reducing training instabilities caused by sudden changes in exploration intensity. In the later stages of training, the reduced entropy coefficient encourages the strategy to utilize the knowledge acquired, thereby facilitating rapid convergence of the model and enhancing policy performance. Furthermore, the dynamic entropy adjustment method employed in this study not only improves the exploration efficiency of the PPO algorithm but also enhances the adaptability and robustness of the strategy in varied environments. With a carefully designed entropy adjustment strategy, we can more effectively guide agents in making decisions within complex reinforcement learning tasks, providing a novel optimization approach for solving practical problems.
3.2. Network Architecture and Initial Settings
In this study, we employ an actor-critic architecture to implement the PPO algorithm, where the actor network is responsible for generating a probability distribution of actions and sampling to determine the actual actions, while the critic network estimates the state values. The dimensions of the input layer of this architecture are determined by the observation vector of the environment, while the dimensions of the output layer are dictated by the action vector. Specifically, the actor network outputs the means and standard deviations of the actions, making the output layer’s dimension twice the number of actions; conversely, the critic network outputs a single state value, with an output layer dimension of 1. In terms of hidden layer design, both the actor and critic networks employ a multi-layer perceptron (MLP) structure, with each hidden layer containing 128 units, split across two layers.
The initial steps of the training process include creating a Replay Buffer to store interaction experiences, followed by initializing the actor and critic networks, and deciding whether to use initial network weights based on the configuration file. Subsequently, both networks are configured with the Adam optimizer, and an exponential learning rate scheduler is used to adjust the learning rate.
During the training loop, experience data are received from the trainer and stored in the replay buffer. The network parameters are then updated using the PPO strategy, with the updated strategy fed back to the trainer, ensuring synchronous updates of the critic network.
The design of this network architecture aims to leverage the dual advantages of the actor-critic method: parallel optimization of policy and value functions to accelerate the learning process. The actor network’s output of action probability distributions facilitates natural policy exploration, while the introduction of entropy loss and regularization loss further incentivizes exploratory behavior, helping to prevent the strategy from prematurely converging to local optima. The critic network provides a value estimate that offers a stable reference objective for policy optimization. Furthermore, updates to network parameters are performed through a dedicated optimizer, incorporating core features of the PPO algorithm, such as clipping of probability ratios and truncation of value functions. These mechanisms work together to safeguard the update process, preventing instability due to excessively large steps.
Figure 4 illustrates the update process of the PPO algorithm. The PPO algorithm effectively captures complex feature relationships through a multi-layer perceptron structure and uses a replay buffer to store experiences, ensuring sample diversity and stability. Through clipping mechanisms and synchronous update strategies, the PPO algorithm achieves stability in policy and consistency in value estimates during training. This design not only enhances the model’s flexibility and adaptability but also ensures the robustness and reliability of the learning process.
Figure 4.
Algorithm architecture.
3.3. State and Action Space Design
In the reinforcement learning framework, the environment provides the agent with observations of its current state, based on which the agent selects actions. Subsequently, the environment responds to the agent’s actions, providing feedback that includes new state information and corresponding rewards. In the context of autonomous collision avoidance tasks, unmanned ships act as agents, while obstacles, the marine environment, and other vessels constitute their environment. To ensure effective deployment and application in real-world settings, the design of the state space must closely reflect data available from actual sensors. This study has designed a multidimensional state space, as follows:
which includes the following four parts: 1. The agent’s own navigational status: This part involves the unmanned ship’s heading, rate of heading change, rudder angle, rudder angular velocity, speed, and position coordinates, all of which collectively describe the ship’s immediate navigational state. 2. Goal-related state: This part includes the relative bearing and coordinates of the goal in relation to the unmanned ship, providing target-oriented navigation information. 3. Target ship state information: for each TS in the environment, the state space includes its true bearing, speed, heading, position coordinates, and CR relative to the OS. This information is crucial for assessing and avoiding collision risks. 4. Reference path state: this concerns the distance between the unmanned ship and the nearest point on the reference path, providing a reference for path planning and navigation.
To ensure safe navigation at sea, mariners continuously monitor collision risks and make decisions based on extensive navigation experience to make timely adjustments to the vessel’s course. Through comprehensive training in a simulation environment, the unmanned ship can learn and master these collision avoidance decision-making skills. In this study, the rudder angle serves as the action chosen by the agent through the strategy, capable of changing the unmanned ship’s direction and path.
The rudder angle is designed in a continuous space, from −35° to 35°, and from −5° to 5°. This design closely mimics the physical characteristics of actual ship maneuvering, enhancing the flexibility and adaptability of the unmanned ship’s navigation.
3.4. Reward Function Design
The reward function plays a crucial role in reinforcement learning, serving as a metric to evaluate the agent’s behavior and guiding the agent to act in a way that maximizes its cumulative rewards. As the training process iterates, the agent learns how to act within the explored environment to maximize its expected future benefits, eventually converging toward a stable behavior strategy. To ensure that the trained strategies effectively accomplish the autonomous collision avoidance tasks, this study divides the reward function into four components: destination reward, navigation reward, collision avoidance reward, and rule compliance reward. The reward values are calculated and accumulated at each frame. Here is the specific design of each part of the reward function:
- (1)
- Destination reward: This reward is designed to guide the agent toward the destination.
Figure 5.
Navigation reward.
- (2)
- Navigation reward: This reward encourages the agent to navigate efficiently along the reference path.
Figure 6.
Simulation experiment scenes in Unreal Engine 5.
- (3)
- Collision avoidance reward: This reward is based on the current collision risk value and a critical threshold, aimed at preventing collisions with obstacle vessels.
- (4)
- Rule compliance reward: According to the COLREGs, this reward guides autonomous ships in making collision avoidance decisions that conform to rules.
COLREGs constrain the behavior of autonomous ships. First, refer to the standards in Figure 2 to assess encounter scenarios, and then determine if the OS’s decisions comply with COLREG rules 13–17 to appropriately avoid conflicts and provide corresponding rewards. If the OS decides to alter course in a head-on or starboard-crossing situation when > 0, it is considered compliant with COLREGs, receiving a reward value of 1; in all other cases, the reward is 0.
Through meticulous design of state spaces, action spaces, and reward functions, this study has established a comprehensive reinforcement learning training system. At each timestep, the agent’s state information is fed into the neural network, which outputs value estimates for each possible action based on current parameters and states. The agent selects the best action based on these estimates, leading to updates in the environmental state. This design ensures that the agent can learn effective collision avoidance strategies in complex maritime environments.
4. Experimental Results and Analysis
4.1. Training Environment
The simulation experiments in this study were conducted using the Unreal Engine 5.3.2 virtual simulation platform, renowned for its advanced graphical processing capabilities and physics engine. The hardware configuration of the experimental environment comprised a high-performance NVIDIA RTX 3080 GPU, customized and manufactured by ASUS, a renowned electronics company based in Taipei, Taiwan, and an Intel Core i7 4.50 GHz CPU processor, sourced from Intel Corporation, headquartered in Santa Clara, California, USA. This combination provided robust computational support for the experiments. On the software side, we utilized Python 3.7 programming language, paired with the TensorFlow deep learning framework, to construct and train our experimental models. As shown in Figure 7, we built the calm sea simulation experiment scenes in Unreal Engine 5 (UE5), ensuring the visualization and interactivity of the experiments.
Figure 7.
Simulation experiment scenes in Unreal Engine 5.
As depicted in Figure 8a, for convenience, both the OS and TS are assumed to be the same ship (KVLCC2). The main specifications and parameters of the validated sample ship are summarized in Table 1. Each training session randomly generates two obstacle ships in specified states. The obstacle ships near the destination have 10 initial positions, uniformly distributed along a circular arc formed by their starting positions, with an interval β of 1° between them. This setup aims to simulate the continuity of situations that might be encountered during ship navigation. Obstacle ships near the target ship have 360 initial positions, evenly spaced by interval β along a circle formed by their starting positions. This is used to simulate various encounter scenarios. The light-colored circular area around the ship represents the ship’s domain, with a radius R closely related to the calculation of the CR value. The initial position of the target ship and its line to the destination is defined as the navigation line, which serves as an input reference for the reward function. Figure 8b shows the termination signal for the state, i.e., the end conditions for experimental training. The termination signal triggers in three scenarios: if the target ship collides with an obstacle ship, deviates beyond the maximum allowed distance from the planned route, or reaches the destination, at which point the system marks the training session as “Done”.
Figure 8.
Training environment. (a) Training scene generation; (b) training termination conditions.
4.2. Algorithm Hyperparameters
Based on comprehensive considerations of the algorithm architecture and simulation environment, the selected training hyperparameters are presented in Table 2. The learning rate critic is the learning rate used to update the value function (managed by the critic network). An appropriate learning rate is crucial for ensuring the stability of the training process. The learning rate policy refers to the learning rate of the policy network, which adjusts the agent’s behavior in response to environmental feedback. The magnitude of this learning rate directly influences the dynamics of training. The learning rates for the actor and critic networks are 0.001 and 0.01, respectively. The learning rate decay is set to 0.99, which causes the learning rate to exponentially decay as training progresses, aiding in model convergence and reducing the risk of significant weight updates in the later stages of training. In the entropy-inclusive PPO algorithm, the entropy coefficient is set to 0.05, which quantifies the randomness of the strategy and encourages exploratory behavior. Its optimal value was determined through a series of debugging experiments. In the DAE-PPO algorithm, the entropy coefficient start and entropy coefficient end are set to 0.09 and 0.01, respectively, to facilitate dynamic entropy adjustment during training. Epsilon Clip (ε-Clip) is set at 0.2 to limit the magnitude of policy updates, preventing training instability due to overly large updates. Lambda (λ) is set at 0.95 to adjust the variance and bias in the generalized advantage estimation (GAE), ensuring the stability of value estimates.
Table 2.
Hyperparameters for training.
4.3. Training Process
Figure 9 shows the trends in average reward values and the corresponding confidence intervals for three algorithms over 20,000 training episodes. The red curve represents the PPO algorithm with dynamically adjusted entropy (DAE-PPO), the green curve represents the entropy-inclusive PPO algorithm, and the blue curve represents the classic PPO algorithm. The shaded areas in the graph indicate the confidence intervals for each algorithm. In the early stages of training, DAE-PPO has a wider confidence interval, indicating that the algorithm adopts a broader exploration strategy initially. This strategy helps the algorithm quickly identify behavior patterns that yield higher rewards. As training progresses, after about 5000 episodes, the average reward for DAE-PPO begins to significantly increase, and its confidence interval starts to narrow, demonstrating the algorithm’s performance stability. By 8000 training episodes, the average reward for DAE-PPO stabilizes and ultimately reaches a high level. The DAE-PPO algorithm, by employing a quadratic decay interpolation of entropy, demonstrates significant advantages in autonomous ship navigation and collision avoidance tasks. This entropy decay strategy optimizes the balance between exploration and exploitation, not only enhancing the algorithm’s average reward but also its stability and reducing convergence time. In contrast, both the traditional PPO algorithm and the entropy-inclusive PPO algorithm have not surpassed DAE-PPO in terms of reward values and stability, thus confirming the superiority of DAE-PPO in such tasks.
Figure 9.
Average reward.
Figure 10 provides a visual display of the agent’s performance across various training stages, offering a detailed tracking and evaluation of the entire learning process.
Figure 10.
Training. (a) First episode; (b) 1432nd episode; (c) 3646th episode; (d) 5600th episode; (e) 7763rd episode; (f) 9981st episode.
Figure 10a shows the agent’s training in the first episode. Due to the random initialization of neural network parameters and the agent’s unfamiliarity with the environment, its movement is based on random exploration, leading to circling behavior at the starting position.
Figure 10b displays the agent’s training in episode 1432. The agent learns from feedback that circling at the initial position is not a reward-driven behavior, thus it begins to explore the scenario. It then collides with an obstacle ship and the training ends. Although this training session failed, the interaction experience is crucial for future exploration.
Figure 10c illustrates the training effect in episode 3646, where the agent successfully avoids ST2 and begins to try more behaviors to explore the scenario.
Figure 10d shows the training outcome in episode 5600, where the agent’s behavior aligns more closely with the intended guided rewards. The failed experience of colliding with obstacle ship ST1 is essential for achieving better training results.
Figure 10e presents the training result in episode 7763, where the agent meets basic collision avoidance requirements and reaches the destination for the first time, validating the design of the guided rewards.
Figure 10f depicts the training result in episode 9981, where the agent’s behavior has become largely stable and efficient, gradually meeting the expectations.
4.4. Results Analysis
From the reward curve graph in Figure 9, it is evident that after completing 20,000 training cycles, all three algorithms have reached a state of convergence. In particular, the DAE-PPO demonstrates significant advantages in terms of convergence speed, stability after convergence, and average rewards post-convergence, surpassing the other two algorithms. To further visually demonstrate the performance advantages of the improved algorithm after model convergence, this paper specifically selects a typical encounter scenario to conduct comparative tests among the three algorithms.
As shown in Figure 11a,b, the initial positions of OS, TS1, and TS2 are (6117.98, 1863.74), (4832.98, 4510.49), and (687.41, 7271.13), respectively, with the obstacle ships maintaining constant speed and course. Figure 11a,b, respectively, show the trajectory and the relative distance curves of the OS and TS for the classic PPO training model. The classic PPO model exhibits significant fluctuations in ship heading, showing a lack of stability, and the closest distances to TS1 and TS2 are both less than 600 m, indicating a high risk of collision. Moreover, the decision-making during encounters does not comply with the COLREGs.

Figure 11.
Encounter situation. (a) Path planned by classic PPO; (b) distance curve between OS and TSs for classic PPO; (c) path planned by entropy-PPO; (d) distance curve between OS and TSs for entropy-PPO; (e) path planned by DAE-PPO; (f) distance curve between OS and TSs for DAE-PPO.
Figure 11c,d present the trajectory and the relative distance curves of the OS and TS for the entropy-PPO training model. Compared to the classic PPO model, the entropy-PPO model’s course is more stable, maintaining a higher safety distance with TS, yet the closest point remains at a fairly close 604.64 m. Actions during encounters with TS1 and TS2, such as turning left too little or lacking clear movement, do not comply with COLREGs.
Figure 11e,f showcase the trajectory and relative distance curves of OS and TS for the DAE-PPO training model. The trajectory indicates that DAE-PPO exhibits greater stability in collision avoidance decisions compared to the previous two models. The encounter decisions comply with COLREGs and are clearly and succinctly executed. As seen in Table 3, the DAE-PPO algorithm performs excellently in collision avoidance, with the widest separation and the lowest collision risk among the algorithms, and it achieves this in the shortest time of 135 s, indicating its strong capability in complex encounter scenarios.
Table 3.
Results of collision avoidance of Example.
To verify the generalization ability of the algorithm, 100 sets of random experiments were conducted under the same random seed. In the same environment, 100 collision avoidance tests were conducted, with the number of successful incidents recorded. As shown in Table 4, the DAE-PPO algorithm significantly outperformed other algorithms in terms of success rate in multiple experiments, further validating its stability and reliability.
Table 4.
Results of 100 times experiments.
Figure 12 displays a challenging encounter scenario involving three obstacle ships, designed to assess the performance of collision avoidance models in handling complex situations.
Figure 12.
Collision avoidance environment. (a) Path planned; (b) distance curve between OS and TSs.
In this scenario, the initial positions of OS are (6121.76, 1857.45), with TS1, TS2, and TS3 starting at (4654.17, 4721.57), (−928.20, 8911.22), and (1458.61, 3880.10), respectively. During the simulation, the course and speed of the obstacle ships remain constant. Figure 12a shows the completed path of OS. When OS and TS1 formed a crossing-give-way situation with a potential collision risk, the agent made a significant right turn. The minimum distance between the two ships was recorded as 766.25 m, successfully achieving safe avoidance, and the decision strictly followed the COLREGs. After the avoidance maneuver, OS corrected its rudder angle and, after a stable course, performed a crossing-stand-on encounter with TS3. At this point, the agent decided to turn left. During this process, the minimum distance between the two ships was 1098.60 m. After avoiding TS3, OS continued to navigate smoothly toward the destination. Subsequently, OS and TS2 encountered a head-on situation. OS quickly made a significant right turn, and during this process, the minimum distance between the two ships was again 1098.60 m, with the avoidance behavior also complying with COLREGs. After completing the avoidance, OS began to resume its course and smoothly reached its destination.
Through this simulation, we validated the effectiveness and compliance of the collision avoidance model in handling complex maritime traffic scenarios, ensuring safe and compliant navigation in varied maritime environments.
5. Conclusions
This study addresses the issue of ship collision avoidance in maritime transport environments by proposing an improved PPO algorithm (DAE-PPO) based on dynamically adjusted entropy. The algorithm optimizes the exploration mechanism through a quadratically decreasing entropy method, effectively avoiding local optima and enhancing strategic performance, all while considering the COLREGs. Simulation results indicate that the DAE-PPO algorithm significantly outperforms in efficiency, success rate, and stability in collision avoidance tests.
Moreover, the reward function designed in this study is subdivided into destination, navigation, collision avoidance, and rule compliance rewards, effectively guiding the agent in making effective collision avoidance decisions in complex maritime environments.
Despite the significant achievements of the DAE-PPO algorithm in simulated environments, future research needs to delve deeper into several areas: While simulation environments can emulate various scenarios, the complexity and unpredictability of actual maritime environments are higher. Future work should include testing and validating the algorithm in real maritime conditions. Maritime traffic often involves interactions among multiple ships; research could explore collaborative collision avoidance strategies within multi-agent systems to enhance the efficiency and safety of overall maritime traffic. The exploration of elliptical ship domain models, which can dynamically adjust based on the speed of unmanned ships, will more accurately simulate real-world collision avoidance scenarios. Through these subsequent studies, we will aim to further enhance the performance of autonomous collision avoidance algorithms for unmanned ships, advance unmanned ship technology, and ultimately achieve safe and efficient autonomous maritime navigation.
Author Contributions
Conceptualization, G.C. and W.W.; methodology, Z.H.; software, Z.H. and W.W.; validation, G.C., Z.H. and W.W.; writing—review and editing, Z.H., G.C. and W.W.; data curation, Z.H.; visualization, G.C. and Z.H.; supervision, S.Y., G.C. and W.W.; project administration, S.Y.; funding acquisition, G.C., W.W. and S.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China (no. 52371369), the Key Projects of the National Key R&D Program (no. 2021YFB390150), the Natural Science Project of Fujian Province (no. 2022J01323, 2023J01325, 2023I0019), the Science and Technology Plan Project of Fujian Province (No. 3502ZCQXT2021007), the Natural Science Foundation of Xiamen, China (grant no. 502Z202373038), and the Funds of Fujian Province for Promoting High-Quality Development of the Marine and Fisheries Industry (No. FJHYF-ZH2023-10).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data are contained within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Millefiori, L.M.; Braca, P.; Zissis, D.; Spiliopoulos, G.; Marano, S.; Willett, P.K.; Carniel, S. COVID-19 Impact on Global Maritime Mobility. Sci. Rep. 2021, 11, 18039. [Google Scholar] [CrossRef] [PubMed]
- International Maritime Organization. Convention on the International Regulations for Preventing Collisions at Sea, 1972 (COLREGs); International Maritime Organization: London, UK, 1972. [Google Scholar]
- Tang, P.; Zhang, R.; Liu, D.; Huang, L.; Liu, G.; Deng, T. Local Reactive Obstacle Avoidance Approach for High-Speed Unmanned Surface Vehicle. Ocean. Eng. 2015, 106, 128–140. [Google Scholar] [CrossRef]
- Hart, P.; Nilsson, N.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Sci. Cyber. 1968, 4, 100–107. [Google Scholar] [CrossRef]
- Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; The MIT Press: Cambridge, MA, USA, 1992; ISBN 978-0-262-27555-2. [Google Scholar]
- Khatib, O. Real-Time Obstacle Avoidance for Manipulators and Mobile Robots. Int. J. Robot. Res. 1986, 5, 90–98. [Google Scholar] [CrossRef]
- Guo, S.; Zhang, X.; Zheng, Y.; Du, Y. An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning. Sensors 2020, 20, 426. [Google Scholar] [CrossRef]
- Cheng, Y.; Zhang, W. Concise Deep Reinforcement Learning Obstacle Avoidance for Underactuated Unmanned Marine Vessels. Neurocomputing 2018, 272, 63–73. [Google Scholar] [CrossRef]
- Huang, Z.; Lin, H.; Zhang, G. The USV Path Planning Based on an Improved DQN Algorithm. In Proceedings of the 2021 International Conference on Networking, Communications and Information Technology (NetCIT), Manchester, UK, 26–27 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 162–166. [Google Scholar]
- Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent Collision Avoidance Algorithms for USVs via Deep Reinforcement Learning under COLREGs. Ocean. Eng. 2020, 217, 107704. [Google Scholar] [CrossRef]
- Peng, X.; Han, F.; Xia, G.; Zhao, W.; Zhao, Y. Autonomous Obstacle Avoidance in Crowded Ocean Environment Based on COLREGs and POND. J. Mar. Sci. Eng. 2023, 11, 1320. [Google Scholar] [CrossRef]
- Xiao, Q.; Jiang, L.; Wang, M.; Zhang, X. An Improved Distributed Sampling PPO Algorithm Based on Beta Policy for Continuous Global Path Planning Scheme. Sensors 2023, 23, 6101. [Google Scholar] [CrossRef] [PubMed]
- Meyer, E.; Robinson, H.; Rasheed, A.; San, O. Taming an Autonomous Surface Vehicle for Path Following and Collision Avoidance Using Deep Reinforcement Learning. IEEE Access 2020, 8, 41466–41481. [Google Scholar] [CrossRef]
- Guan, W.; Cui, Z.; Zhang, X. Intelligent Smart Marine Autonomous Surface Ship Decision System Based on Improved PPO Algorithm. Sensors 2022, 22, 5732. [Google Scholar] [CrossRef] [PubMed]
- Wu, C.; Yu, W.; Li, G.; Liao, W. Deep Reinforcement Learning with Dynamic Window Approach Based Collision Avoidance Path Planning for Maritime Autonomous Surface Ships. Ocean. Eng. 2023, 284, 115208. [Google Scholar] [CrossRef]
- Mnih, P.V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D. Asynchronous Methods for Deep Reinforcement Learning. arXiv 2016, arXiv:1602.01783. [Google Scholar]
- Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
- Yasukawa, H.; Yoshimura, Y. Introduction of MMG Standard Method for Ship Maneuvering Predictions. J. Mar. Sci. Technol. 2015, 20, 37–52. [Google Scholar] [CrossRef]
- Sandeepkumar, R.; Rajendran, S.; Mohan, R.; Pascoal, A. A Unified Ship Manoeuvring Model with a Nonlinear Model Predictive Controller for Path Following in Regular Waves. Ocean. Eng. 2022, 243, 110165. [Google Scholar] [CrossRef]
- Sivaraj, S.; Rajendran, S.; Prasad, L.P. Data Driven Control Based on Deep Q-Network Algorithm for Heading Control and Path Following of a Ship in Calm Water and Waves. Ocean. Eng. 2022, 259, 111802. [Google Scholar] [CrossRef]
- Fujii, Y.; Tanaka, K. Traffic Capacity. J. Navig. 1971, 24, 543–552. [Google Scholar] [CrossRef]
- Coldwell, T.G. Marine Traffic Behaviour in Restricted Waters. J. Navig. 1983, 36, 430–444. [Google Scholar] [CrossRef]
- Goodwin, E.M. A Statistical Study of Ship Domains. J. Navig. 1975, 28, 328–344. [Google Scholar] [CrossRef]
- Mou, J.M.; Tak, C.V.D.; Ligteringen, H. Study on Collision Avoidance in Busy Waterways by Using AIS Data. Ocean. Eng. 2010, 37, 483–490. [Google Scholar] [CrossRef]
- Ha, J.; Roh, M.-I.; Lee, H.-W. Quantitative Calculation Method of the Collision Risk for Collision Avoidance in Ship Navigation Using the CPA and Ship Domain. J. Comput. Des. Eng. 2021, 8, 894–909. [Google Scholar] [CrossRef]
- Sakamoto, N.; Ohashi, K.; Araki, M.; Kume, K.; Kobayashi, H. Identification of KVLCC2 Manoeuvring Parameters for a Modular-Type Mathematical Model by RaNS Method with an Overset Approach. Ocean. Eng. 2019, 188, 106257. [Google Scholar] [CrossRef]
- Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking Deep Reinforcement Learning for Continuous Control. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Ilyas, A.; Engstrom, L.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; Madry, A. A Closer Look at Deep Policy Gradients. arXiv 2020, arXiv:1811.02553. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
- Kakade, S.M. A Natural Policy Gradient. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; Volume 14. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Williams, R.J.; Peng, J. Function Optimization Using Connectionist Reinforcement Learning Algorithms. Connect. Sci. 1991, 3, 241–268. [Google Scholar] [CrossRef]
- Chaudhari, P.; Choromanska, A.; Soatto, S.; LeCun, Y.; Baldassi, C.; Borgs, C.; Chayes, J.; Sagun, L.; Zecchina, R. Entropy-SGD: Biasing Gradient Descent into Wide Valleys. J. Stat. Mech. 2019, 2019, 124018. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).