Intelligent Smart Marine Autonomous Surface Ship Decision System Based on Improved PPO Algorithm

With the development of artificial intelligence technology, the behavior decision-making of an intelligent smart marine autonomous surface ship (SMASS) has become particularly important. This research proposed local path planning and a behavior decision-making approach based on improved Proximal Policy Optimization (PPO), which could drive an unmanned SMASS to the target without requiring any human experiences. In addition, a generalized advantage estimation was added to the loss function of the PPO algorithm, which allowed baselines in PPO algorithms to be self-adjusted. At first, the SMASS was modeled with the Nomoto model in a simulation waterway. Then, distances, obstacles, and prohibited areas were regularized as rewards or punishments, which were used to judge the performance and manipulation decisions of the vessel Subsequently, improved PPO was introduced to learn the action–reward model, and the neural network model after training was used to manipulate the SMASS’s movement. To achieve higher reward values, the SMASS could find an appropriate path or navigation strategy by itself. After a sufficient number of rounds of training, a convincing path and manipulation strategies would likely be produced. Compared with the proposed approach of the existing methods, this approach is more effective in self-learning and continuous optimization and thus closer to human manipulation.


Introduction and Background
Since the 1970s, the combination of robot technologies and vehicles has led to the emergence of drones, unmanned vehicles, and unmanned ships [1]. Among them, a ship sailing on the sea is seriously affected by wind and surges. The decision-making and path planning of intelligent ships have been considered significant academic problems. Ships are generally under-actuated due to their large tonnage, slow speed, and relatively weak power. The autonomous navigation of ships has to meet huge inertia and complex navigation rules; therefore, the requirements for smart ships are much higher than those for unmanned vehicles. A ship operator faces many challenges, including those associated with the dynamic environment, insufficient power, and uncertainties in perception. According to the report of the International Maritime Organization, more than 80 percent of maritime accidents are caused by misunderstandings of the situation and by human error in decision-making resulting from failure to comply with the International Regulations for Preventing Collisions at Sea (COLREGs). Therefore, artificial intelligence for ship navigation is considered very difficult, and its core functions are path planning and intelligent decision-making.
Ship intelligent decision-making can be divided into two types: path planning and obstacle avoidance. One is the traditional model-based obstacle avoidance algorithm. For many years, the A* algorithm was the dominant approach in relevant research. A Swiss boat named Avalon was capable of generating a persuasive path to a given destination and avoiding both static and dynamic obstacles based on the A* algorithm [2]. Several heuristic function values of the current path grid are compared by the A* algorithm to

•
An intelligent SMASS decision-making system based on the Proximal Policy Optimization (PPO) algorithm was proposed in this paper, which could make the critic network and action network converge faster. • Through the Gazebo simulation environment, sensors such as laser radar and navigation radar were used to obtain external environmental information. Intelligent SMASS could make complex path planning decisions in different environments. After the training, if unknown obstacles are placed on the map, the intelligent ship could still successfully avoid obstacles. • The Nomoto model was brought into the training of this experiment. Training the model could meet the needs of practical engineering.
The rest of this paper is organized as follows. Section 2 introduces the composition of the intelligent SMASS system, ship mathematical model, and COLREGs. Section 3 introduces a deep reinforcement learning algorithm and improved Proximal Policy Optimization (PPO) algorithm. Section 4 mainly introduces the reward function setting and network setting. Section 5 mainly introduces the design of the Gazebo simulation and the analysis of experimental results. Section 6 is the summary of this paper and the future research planning.

Intelligent Ship Decision System and Ship Mathematical Model
In building a complete set of the intelligent smart marine autonomous surface ship (SMASS) decision-making systems, it was necessary to clarify the components of the system, the functions of each part, and the relationship between different parts [17]. There are three parts included in the intelligent smart marine autonomous surface ship (SMASS) decisionmaking system, namely, the sensing part, the decision-making part, and the control part, as shown in Figure 1.
Optimization (PPO) algorithm. Section 4 mainly introduces the reward function s and network setting. Section 5 mainly introduces the design of the Gazebo simulatio the analysis of experimental results. Section 6 is the summary of this paper and the research planning.

Intelligent Ship Decision System and Ship Mathematical Model
In building a complete set of the intelligent smart marine autonomous surfac (SMASS) decision-making systems, it was necessary to clarify the components of th tem, the functions of each part, and the relationship between different parts [17]. are three parts included in the intelligent smart marine autonomous surface ship (SM decision-making system, namely, the sensing part, the decision-making part, and th trol part, as shown in Figure 1.

Intelligent Ship Decision System
The sensing part is divided into the SMASS's own state information and navi environment information. The sensing part is mainly composed of navigation radar radar, GPS, shaft power sensor, bathymeter, speed sensor, and AIS. The SMASS' state information includes the SMASS's course, speed, position, oil consumption re ing, propeller speed, and hull structure strength [18]. Navigation environment mation includes other ship heading speed TCPA, DCPA, hydrological information, ity, channel depth, meteorological information (temperature, humidity, wind dire wind speed), electronic chart information, navigation mark distribution, etc. In this p laser radar and positioning systems were used in the environmental perception of i gent SMASS. The decision part includes path planning before sailing and obstacle a ance during self-service navigation. In this paper, improved PPO algorithms were for SMASS path planning and obstacle avoidance. The algorithm has the followin vantages: • With autonomous learning ability, the convergence rate was faster than the com

Intelligent Ship Decision System
The sensing part is divided into the SMASS's own state information and navigation environment information. The sensing part is mainly composed of navigation radar, laser radar, GPS, shaft power sensor, bathymeter, speed sensor, and AIS. The SMASS's own state information includes the SMASS's course, speed, position, oil consumption remaining, propeller speed, and hull structure strength [18]. Navigation environment information includes other ship heading speed TCPA, DCPA, hydrological information, velocity, channel depth, meteorological information (temperature, humidity, wind direction, wind speed), electronic chart information, navigation mark distribution, etc. In this paper, laser radar and positioning systems were used in the environmental perception of intelligent SMASS. The decision part includes path planning before sailing and obstacle avoidance during self-service navigation. In this paper, improved PPO algorithms were used for SMASS path planning and obstacle avoidance. The algorithm has the following advantages:

•
With autonomous learning ability, the convergence rate was faster than the common calculation method.

•
The trained intelligent SMASS navigation system could obtain strong generalization, which would solve different scene problems. For example, it can solve the problem The path planning problem and SMASS decision problem could be solved simultaneously. SMASS could find the optimal path to the target point through known obstacle information. Under the unknown environment, the SMASS could detect the position of obstacles by laser radar and accurately avoid the obstacles.

Ship Mathematical Model
The mathematical model of ship motion is significant for ship motion simulation. The ship motion model can be divided into the linear model and the nonlinear model. The linear model is mainly used to optimize or train the control simulator, neural network decision-making, and controller design [19]. To describe the motion of a ship, a ship motion coordinate system was established, as shown in Figure 2. ensors 2022, 22, x FOR PEER REVIEW waters. In local path planning, it could successfully avoid unknown ob not appear on the electronic chart.

•
The path planning problem and SMASS decision problem could be s neously. SMASS could find the optimal path to the target point throu stacle information. Under the unknown environment, the SMASS co position of obstacles by laser radar and accurately avoid the obstacles

Ship Mathematical Model
The mathematical model of ship motion is significant for ship motion s ship motion model can be divided into the linear model and the nonline linear model is mainly used to optimize or train the control simulator, n decision-making, and controller design [19]. To describe the motion of a sh tion coordinate system was established, as shown in Figure 2. In this figure, G represents the position of the center of gravity of th indicates the hydrostatic water plane, O is the origin, OG X represents th the center of ship gravity on the X and Y axes, respectively, ψ is the ship, and δ indicates the ship rudder angle. Considering only the ship l locity v and yaw angular velocity r , the ship motion mathematic mode pressed as: 11 Ignoring the lateral drift velocity v in Equation (1), the response equati steering rudder to yaw motion can be written as: In this figure, G represents the position of the center of gravity of the ship, XOY indicates the hydrostatic water plane, O is the origin, X OG represents the projection of the center of ship gravity on the X and Y axes, respectively, ψ is the heading of the ship, and δ indicates the ship rudder angle. Considering only the ship lateral drift velocity v and yaw angular velocity r, the ship motion mathematic model could be expressed as: where a 11 , a 12 , a 21 , a 22 , b 11 , and b 21 are the ship maneuverability parameters [20]. Ignoring the lateral drift velocity v in Equation (1), the response equation of the ship steering rudder to yaw motion can be written as: where T 1 , T 2 , T 3 , and K 0 are maneuverability indexes. Their values could be estimated by T 1 T 2 = 1/(a 11 a 22 − a 12 a 21 ), T 1 + T 2 = (a 11 + a 22 )/(a 12 a 21 − a 11 a 22 ), T 3 = b 21 /(a 21 b 11 − a 11 b 21 ), and K 0 = (a 21 b 11 − a 11 b 21 )/(a 11 a 22 − a 12 a 21 ). Then, the Laplace transform of Equation (2) could be carried out to obtain the transfer function of the ship steering control system, as shown in Equation (3): For ships with large inertia, the dynamic characteristics are the most important in the low-frequency range [21]. Thus, let the following formula show that s = jw → 0 , and, ignoring the second and third-order small quantities, the Nomoto model can be obtained by: The differential equation form of the Nomoto model is written as shown in Equation (5): The value T represents the coefficient ratio of the inertia moment to the damping moment [22]. A large T value indicates a large inertia moment and a small damping moment during ship motion. The value K actually refers to the angular velocity value of yaw motion by each rudder angle. The large K means a large yaw moment and a small damping moment produced by the rudder.
Taking the ship as a rigid body, when the ship steers at any rudder angle δ, the yaw rate is r. The above formula can be seen as the yaw motion equation of the ship when it steers. When the ship turns, altering her course, at any rudder angle, assuming that the initial conditions are t = 0, δ = δ 0 , and r = 0, the yaw angle at any time can be calculated by KT Equation (6): Ship yaw angle r is the derivative of ψ with respect to time. As shown in Equation (7).
There are two advantages of using the Nomoto model in this experiment:

•
In the low-frequency range, the spectrum of the Nomoto model is very close to that of the high order model.

•
The designed controller has low order and is easy to implement.

COLREGs
To solve SMASS path planning and obstacle avoidance problems based on DRL, maritime collision avoidance rules should be considered. COLREGS is a maritime traffic rule that is stipulated in the high seas and all navigational waters connected to the high seas to ensure the safety of ship navigation, preventing ship collision. Therefore, intelligent ship decision-making systems should also act in accordance with COLREG to ensure the safety of maritime navigation [23]. According to the COLREGS, the relative position of the two ships is divided into four obstacle avoidance strategy regions, such as in Figure 3.
The four collision avoidance rules involved in COLREGS Chapter 2 Regulation 13 to 17 are as follows. The corresponding collision avoidance actions are displayed in Figure 3.  (1) Head-on The encounter situation refers to the opposite or nearly opposite course (where the course usually refers to the bow direction of the ship rather than the track direction) of the two mobile ships under the condition of mutual seeing, and there is a risk of collision. The opposite direction means the relative azimuth of the target ship (TS) and own ship (OS) is in [355°, 360°] or [0°, 5°]. Two ships should alter course port through passing the starboard side of another ship. The head-on situation is displayed in Figure 3a.

(2) Overtaking
The overtaking situation means that the speed of the rear ship is greater than that of the front ship. When the own ship chases the target ship in a certain direction 22.5 degrees behind the target ship, the target ship is a stand-on ship, and the own ship should give way to the target ship. The overtaking situation is displayed in Figure 3b.

(3) Crossing give-way
When two ships meet and there is a risk of collision, the relative position of the target ship and the own ship is in [5°, 112.5°]. In this case, the own ship should give way to the target ship. According to COLREGs, the own ship should alter her course to starboard to avoid a collision. The crossing give-way situation is displayed in Figure 3c.

(4) Crossing stand-on
When two ships meet and the relative position of the target ship and the own ship is in [247.5° ,355°], there is a risk of collision. In this case, the ship is stand-on, and the target ship should give way to the own ship. If the target ship does not take avoidance action timely, the own ship should take action to avoid the collision. The crossing stand-on situation is displayed in Figure 3d.

(1) Head-on
The encounter situation refers to the opposite or nearly opposite course (where the course usually refers to the bow direction of the ship rather than the track direction) of the two mobile ships under the condition of mutual seeing, and there is a risk of collision. The opposite direction means the relative azimuth of the target ship (TS) and own ship (OS) is in [355 • , 360 • ] or [0 • , 5 • ]. Two ships should alter course port through passing the starboard side of another ship. The head-on situation is displayed in Figure 3a.

(2) Overtaking
The overtaking situation means that the speed of the rear ship is greater than that of the front ship. When the own ship chases the target ship in a certain direction 22.5 degrees behind the target ship, the target ship is a stand-on ship, and the own ship should give way to the target ship. The overtaking situation is displayed in Figure 3b.

(3) Crossing give-way
When two ships meet and there is a risk of collision, the relative position of the target ship and the own ship is in [5 • , 112.5 • ]. In this case, the own ship should give way to the target ship. According to COLREGs, the own ship should alter her course to starboard to avoid a collision. The crossing give-way situation is displayed in Figure 3c.

(4) Crossing stand-on
When two ships meet and the relative position of the target ship and the own ship is in [247.5 • , 355 • ], there is a risk of collision. In this case, the ship is stand-on, and the target ship should give way to the own ship. If the target ship does not take avoidance action timely, the own ship should take action to avoid the collision. The crossing stand-on situation is displayed in Figure 3d.

Deep Reinforcement Learning
At present, artificial intelligence technologies have developed rapidly; especially after AlphaGo defeated Lee Sedol, the nine-stage chess player, reinforcement learning has risen rapidly to provide new possibilities for intelligent ship path planning. The Qlearning algorithm could obtain the best behavior decision-making through the optimal action-value function [24]. However, the marine environment is too complex, and a ship sailing on the sea faces many uncertainties. The Q-table could seem inadequate in solving complex problems.
The development of deep reinforcement learning is greatly accelerated by neural networks [25]. With the change of the agent's external environment, through the backpropagation of neural networks, the weights of neural networks could be updated to simulate complex functions. Deep reinforcement learning algorithms are divided into two categories: value learning and strategy learning.
Reinforcement learning based on value function is represented by Deep Q-Learning (DQN), and the problem of correlation and non-static distribution could be solved by the experience replay method. The current Q value is generated by the evaluation network, and the target Q value is generated by the target network [26]. The experience replay stores the transfer samples (s t , a t , r t , s t+1 ) from each time step agent that interacts with the environment into the replay memory unit. Then, small-batch data in the memory library are selected for training, but the DQN algorithm is not accurate in estimating the action value Q, so there are some errors. Suppose DQN's estimate of the real action is unbiased, then the error is noise with an average of 0. q = max a Q(s, a; ω) is maximized based on DQN action a and used to compute TD target . Adding noise to the action-value function will make q ≥ max a (S t+1 , a, ω). Obtaining the Q value at the next moment is an overestimation.
Although noise does not change the mean value, it will make the maximum value of Q greater than the maximum value of x. Expectations for the maximum of Q will also be greater than the maximum value of x. Updating DQN estimates at time t with TD target also means updating itself with itself. Uniform overestimation does not make DQN a problem with action selection because each action overestimation is the same agent and will still choose to score high action. However, non-uniform overestimation will make DQN have problems in the action selection. Double Deep Q-Learning (Double DQN) was proposed by Google DeepMind to solve the overestimation problem of DQN [27]. Although the estimation made by Double Deep Q-Learning is relatively small, its overestimation of the maximum value cannot be solved fundamentally. This is why reinforcement learning based on value learning was abandoned in this paper.
The Actor-Critic (AC) algorithm is representative of strategy learning. There are two neural networks that exist in the AC algorithm. One is used to interact with the environment to select actions, and the other is used to evaluate the quality of actions, and the network parameters are updated by gradient descent. The AC algorithm is good but difficult to converge. Compared with random strategies, deterministic strategies adopt different action probabilities at the same state when solving continuous action problems, but the maximum probability is only one. Double actor neural networks and double critic neural networks were used in the Deep Deterministic Policy Gradient (DDPG) algorithm to improve the convergence of neural networks [28]. The algorithm can only take action with the maximum probability; however, by removing the probability distribution, the algorithm will be much simpler. In 2017, a Proximal Policy Optimization (PPO) algorithm was proposed by OpenAI [29]. The Policy Gradient algorithm is very sensitive to the step size, but it is difficult to select the appropriate step size. If the difference between the new and old strategies is too large, it is not conducive to learning. The problem of uncertain learning rates in the Policy Gradient algorithm could be solved by the PPO algorithm; if the learning rate is too large, then the learned strategy is not easy to converge. On the contrary, if the learning rate is too small, it will take a long time. The proportion of current and previous strategies could be used in the PPO algorithm, which would limit the update range of the current strategy, so that the Policy Gradient algorithm would not be so sensitive to a slightly larger learning rate.

Improved PPO Algorithm
The current and previous strategy networks were used by the traditional PPO algorithm to improve the uncertainty of the learning rate, but they still had a large variance. A generalized advantage estimate was proposed by John Schulman et al. to improve the TRPO algorithm [30], which can also be used to improve the PPO algorithm.
First, the application of baseline in strategy learning should be understood. The baseline could be regarded as a function b independent of action a.
where a is the action taken for the agent, s is the current state, and θ is the network parameter. The essence of the policy function is the probability density function. Taking Equation (9) to the equality of policy gradient update will obtain the advantage function.
Although the gradient is not affected by the value of b, it affects the Monte Carlo approximation. When b approaches Q π , the variance of the Monte Carlo approximation will decrease, and the convergence rate will improve. The value of b is V π (S t ), where V π (S t ) is independent of action a, and then the advantage function is obtained. The action value function can be seen as the conditional expectation of the return value U t to s t , a t , and the state value function can be seen as the conditional expectation of the action value function to s t ; thus, the equation can be obtained: At this time, the Monte Carlo approximation of Q π and V π can be obtained: Because the value b in the dominant function is V π (s t ), it is a definite value, so it is not necessary to use the Monte Carlo approximation. The unbiased estimation of the strategy gradient can be expressed as: We can define η V t = r t + γV π (s t+1 ) − V π (s t ) and subtract the K − step advantage from the baseline function, then we can obtain the following equation: Therefore, the generalized advantage estimation can be obtained. The formula is as follows:Ĝ The loss function of the PPO algorithm is: In this equation, µ t θ is the ratio of probability. The ratio of the probability is that the strategy before updating takes a specific operation in a specific state to the probability that the current strategy takes the same operation in the same state. The ratio is between 1 − ε and 1 + ε according to the range of the super parameter ε. Therefore, there is a great change between the previous strategy and the current strategy. The PPO loss iteration is shown in Figure 4.

Network Construction and Input and Output Information
State information is input by the actor network and critic network in the PPO rithm. Two-dimensional plane coordinates of the ship ( x , y ); rudder angle and ru

Network Construction and Input and Output Information
State information is input by the actor network and critic network in the PPO algorithm. Two-dimensional plane coordinates of the ship (x p , y p ); rudder angle and rudder angular velocity of the operating system (δ, δ 1 ); and 24 laser radar vector lines (χ 1 , χ 2 , χ 3 . . . χ 24 ) were used as the state information of the environment.
To avoid collisions with other ships, the navigator should adjust the direction of their own ship to ensure the navigation safety of ships in designated waters. The collision avoidance method of an autonomous ship can be created through a sufficient learning process by simulating the appropriate decision-making skills that the navigator could acquire over a long period of experience [31]. In this experiment, the output data are the rudder angle of the SMASS. The course and path of the SMASS would be affected by the change of rudder angle. The altering course to port is defined as negative, and altering course to starboard is defined as positive. The action space of this experiment is The obstacle avoidance process of the SMASS is shown in Figure 5. In this paper, the deep neural network was used to fit the policy function π . them, the actor network adopted a two-layer full connection layer with 128 neuro Relu activation function was used and the network input was state S . The obtai pectation and standard deviation were put into the Gaussian distribution, the pro density function was obtained using the strategy distribution, and the probabilit sponding to different action a was the output. The critic network adopted two fu nected layers with 128 neurons and the Relu activation function. The network inp state S , the output of the actor selected action score. The PPO algorithm and e ment interaction process are shown in Figure 6. In this paper, the deep neural network was used to fit the policy function π. Among them, the actor network adopted a two-layer full connection layer with 128 neurons. The Relu activation function was used and the network input was state S. The obtained expectation and standard deviation were put into the Gaussian distribution, the probability density function was obtained using the strategy distribution, and the probability corresponding to different action a was the output. The critic network adopted two fully connected layers with 128 neurons and the Relu activation function. The network input was state S, the output of the actor selected action score. The PPO algorithm and environment interaction process are shown in Figure 6. density function was obtained using the strategy distribution, and the probability corresponding to different action a was the output. The critic network adopted two fully connected layers with 128 neurons and the Relu activation function. The network input was state S , the output of the actor selected action score. The PPO algorithm and environment interaction process are shown in Figure 6. The probability obtained by the previous strategy was optimized with other relevant parameters, and the difference in the _ new Actor network was obtained. The obtained difference was put into the _ new Actor , so that the strategy of the global network is new, and the strategy of the regional network is old. The critic network output is the value V , using discount reward, value subtraction, and generalized advantage estimate The probability obtained by the previous strategy was optimized with other relevant parameters, and the difference in the new_Actor network was obtained. The obtained difference was put into the new_Actor, so that the strategy of the global network is new, and the strategy of the regional network is old. The critic network output is the value V, using discount reward, value subtraction, and generalized advantage estimate optimization to obtain the advantage function. Then, the gradient descent algorithm was used to calculate the error and update the network parameters. The proportion of the current and previous strategies was multiplied by the advantage function. One part was directly multiplied, and the other part was multiplied after 1 − ε and 1 + ε, according to the range of the super parameter ε. The minimum value of the two was taken, and then the error was calculated.
To break the correlation of data and ensure the convergence of policy functions, an empirical playback memory can be set to store the historical motion state. Under each time step t, the intelligent ship entered a new state after interacting with the environment, and the updated state was put into the memory. In the process of the neural network training, a small batch of state samples were extracted from the memory to ensure the stability of the training.

Reward Function
According to the task of SMASS path planning and obstacle avoidance, the reward function was set to the following five parts: goal approach reward, yaw angle reward, target point reward, obstacle avoidance reward, and COLREGs reward as shown in the Figure 7.

Reward Function
According to the task of SMASS path planning and obstacle avoidance, function was set to the following five parts: goal approach reward, yaw an target point reward, obstacle avoidance reward, and COLREGs reward as sh  (1) Goal approach reward The primary task to solve the intelligent SMASS path planning was to make the SMASS reach the target position. The goal approach reward value was set as follows: where x p and y p are the coordinates of the current position of the ship, x g and y g are the coordinates of the target point, and λ g is the weight of the target proximity reward.

(2) Yaw angle reward
When the SMASS is planning the path, the heading angle should be taken as an important indicator. As shown in Figure 8, the connection between the current position of the ship and the position of the target point should be regarded as the shortest distance, and the SMASS motion direction should be along this direction as far as possible. The Yaw angle reward function is set as follows: where yaw is the yaw angle between the SMASS and the target point; tr is the reward coefficient of the yaw angle, which indicates that the reward values obtained from different angles are different; λ a is the weight of the yaw angle reward; and ε is the adjustment parameter of the reward value and distance. p p g are the coordinates of the target point, and g λ is the weight of the target proximi ward.

(2) Yaw angle reward
When the SMASS is planning the path, the heading angle should be taken as a portant indicator. As shown in Figure 8, the connection between the current positi the ship and the position of the target point should be regarded as the shortest dist and the SMASS motion direction should be along this direction as far as possible. The angle reward function is set as follows: where y a w is the yaw angle between the SMASS and the target point; tr is the re coefficient of the yaw angle, which indicates that the reward values obtained from d ent angles are different; a λ is the weight of the yaw angle reward; and ε is the ad ment parameter of the reward value and distance.

(3) Target point reward
In order to get the SMASS to the target point, it is necessary to set a reward a target point position. At the same time, the SMASS should also receive a negative re when it collides with obstacles during navigation. The reward value is set as follows

(3) Target point reward
In order to get the SMASS to the target point, it is necessary to set a reward at the target point position. At the same time, the SMASS should also receive a negative reward when it collides with obstacles during navigation. The reward value is set as follows:

(4) Obstacle avoidance reward
The laser radar detection range of the SMASS is a circle, launching 24 detection lines from the center of the circle; R_radar is the radar radius, and the reward is 0 when the static obstacle is outside the radar radius. As shown in Figure 9, S 1 is set as the safe distance between the SMASS and the obstacle. When the distance between the SMASS and the obstacle is less than S 1 , a negative reward will be obtained. The reward value is set as follows:  Figure 9. The process of obstacle detection by SMASS laser radar.

(5) COLREGs reward
In order to make the trained SMASS behavior satisfy COLREGs, a COLREGs reward function was introduced. The distance between SMASS and the target point was designed in the COLREGs reward.
While SMASS needs to keep heading, the rudder angle should be 0. In addition, when SMASS needs to avoid obstacles or target ships, she should alter her course to starboard. These are defined as satisfying COLREGs. Otherwise, SMASS should alter her course to port or hold heading after encountering obstacles or target ships, which is considered to be a violation of COLREGs. When the SMASS operations comply with COLREGs, the SMASS would obtain positive rewards. However, when SMASS violates COLREGs, it will be punished. Hence, the reward function can be set as follows: where c λ is the weight of the COLREGs reward function. Therefore, the calculation process of the total reward function is shown in Figure 7 and is expressed as follows: Figure 9. The process of obstacle detection by SMASS laser radar.

(5) COLREGs reward
In order to make the trained SMASS behavior satisfy COLREGs, a COLREGs reward function was introduced. The distance between SMASS and the target point was designed in the COLREGs reward.
While SMASS needs to keep heading, the rudder angle should be 0. In addition, when SMASS needs to avoid obstacles or target ships, she should alter her course to starboard. These are defined as satisfying COLREGs. Otherwise, SMASS should alter her course to port or hold heading after encountering obstacles or target ships, which is considered to be a violation of COLREGs. When the SMASS operations comply with COLREGs, the SMASS would obtain positive rewards. However, when SMASS violates COLREGs, it will be punished. Hence, the reward function can be set as follows: where λ c is the weight of the COLREGs reward function. Therefore, the calculation process of the total reward function is shown in Figure 7 and is expressed as follows:

Design of Simulation
The training environment is necessary for the intelligent SMASS deep reinforcement learning. A designed unmanned ship training environment can quickly test algorithms [32]. Hence, multiple simulation scenarios were set up to train mobile SMASS for path planning. Based on the improved PPO algorithm proposed above and the construction of the neural network framework, the neural network was trained. The computer configuration was as follows: Intel Core i9-11900K, NVIDIA GTX3090, 24 G video memory, 32 G main memory, and 512 G SSD storage. Gazebo and VScode were used for joint simulation and established a three-dimensional navigation environment in Gazebo to simulate different waters and build a SMASS model, as shown in Figure 10.
planning. Based on the improved PPO algorithm proposed above and the constructi the neural network framework, the neural network was trained. The computer confi tion was as follows: Intel Core i9-11900K, NVIDIA GTX3090, 24 G video memory, main memory, and 512 G SSD storage. Gazebo and VScode were used for joint simul and established a three-dimensional navigation environment in Gazebo to simulat ferent waters and build a SMASS model, as shown in Figure 10. Some restrictions were attached to the SMASS model. SMASS cannot slow dow speed and can only alert her course during the voyage. The SMASS inertia was appr ately increased to simulate the real motion state of the SMASS. In the SMASS ste phase, with the increase of the rudder angle, the rudder transverse force and rudder turn the SMASS moment. In the transition stage, the transverse velocity and angula locity were generated under the action of transverse force and rudder force tra torque, and the increasingly obvious oblique shipping motion made the ship ente accelerated rotation state. When the SMASS moved in a fixed-length cycle, the ste force transfer torque, drift angle hydrodynamic transfer torque, and resistance tra torque were balanced. The acceleration of the rotational angular is zero, and the rotat angular velocity was the largest and most stable at this value. This experiment assu that the SMASS navigated in still water.

Network Training Process
Experimental parameter settings are shown in Table 1. The Gazebo environment form module is responsible for generating a navigation environment and simul SMASS simulation. The environment module could generate and calculate SMASS tion and SMASS movement information. When the SMASS reached the target, the tra task was over, and entered the next training. When the SMASS encountered obstac Some restrictions were attached to the SMASS model. SMASS cannot slow down her speed and can only alert her course during the voyage. The SMASS inertia was appropriately increased to simulate the real motion state of the SMASS. In the SMASS steering phase, with the increase of the rudder angle, the rudder transverse force and rudder force turn the SMASS moment. In the transition stage, the transverse velocity and angular velocity were generated under the action of transverse force and rudder force transfer torque, and the increasingly obvious oblique shipping motion made the ship enter the accelerated rotation state. When the SMASS moved in a fixed-length cycle, the steering force transfer torque, drift angle hydrodynamic transfer torque, and resistance transfer torque were balanced. The acceleration of the rotational angular is zero, and the rotational angular velocity was the largest and most stable at this value. This experiment assumed that the SMASS navigated in still water.

Network Training Process
Experimental parameter settings are shown in Table 1. The Gazebo environment platform module is responsible for generating a navigation environment and simulating SMASS simulation. The environment module could generate and calculate SMASS position and SMASS movement information. When the SMASS reached the target, the training task was over, and entered the next training. When the SMASS encountered obstacles, it stopped training immediately and was placed in the initial position for the next training. The SMASS obstacle avoidance decision training process was divided into two environments, environment one (Env1) and environment two (Env2). In the experiment, the initial position of the SMASS in the simulation environment was (0,0). There were six static obstacles in the simulation environment, and the coordinates of these six static obstacles were (0.46, 1.78), (−0.57, −1.75), (1.68, 3.78), (0.62, −4.44), (0.13, 6.08), and (−1.15, −6.18). There were two target points, and the coordinates were (1.00, −7.00) and (2.00, 7.00), as shown in Figure 11. In the early stage of environmental interaction, ships extremely lacked driving experience and collision avoidance experience. The trained SMASS could not navigate towards the target and avoid a collision. In the experiment, the initial position of the SMASS in the simulation environ was (0,0). There were six static obstacles in the simulation environment, and the co nates of these six static obstacles were (0.46, 1.78), (−0.57, −1.75), (1.68, 3.78), (0.62, − (0.13, 6.08), and (−1.15, −6.18). There were two target points, and the coordinates (1.00, −7.00) and (2.00, 7.00), as shown in Figure 11. In the early stage of environm interaction, ships extremely lacked driving experience and collision avoidance experi The trained SMASS could not navigate towards the target and avoid a collision. Figure 11. Gazebo simulation environment (Env1). From right to left are six obstacles (ob1, ob2 ob4, 0b5, and ob6), the blue part is the laser radar range, and the blue line is the laser radar det line. The left purple box is the target point.
After 1000 training times, SMASS could avoid obstacle 1 and obstacle 2. Whe SMASS sailed on the port side of obstacle 2, the course remained unchanged. Whe countering obstacle 2, the SMASS took two consecutive port alters of 25° and move wards the upper right under obstacle 2. When it was 0.6 miles from obstacle 1, her c to port was altered to 45°, along with obstacle 1 upward obliquely. The SMASS con ously steered port and starboard and changed course during movement, but the SM could not reach the target point and collided with the environmental framework d the wandering process. SMASS collision avoidance obstacles 1, 2, and 3 are shown in ure 12. After 1000 training times, SMASS could avoid obstacle 1 and obstacle 2. When the SMASS sailed on the port side of obstacle 2, the course remained unchanged. When encountering obstacle 2, the SMASS took two consecutive port alters of 25 • and moved towards the upper right under obstacle 2. When it was 0.6 miles from obstacle 1, her course to port was altered to 45 • , along with obstacle 1 upward obliquely. The SMASS continuously steered port and starboard and changed course during movement, but the SMASS could not reach the target point and collided with the environmental framework during the wandering process. SMASS collision avoidance obstacles 1, 2, and 3 are shown in Figure 12. After training about 1200 times, the SMASS successfully reached the first target point Subsequently, the SMASS continuously altered her course to port 45° and sailed to th target point 2. When the SMASS passed under obstacle 3 and navigated towards obstacl 4, her course was altered to starboard 25°, then port and starboard rudder were altered continuously to ensure heading stability.
After training 1500 times, the SMASS could maintain her course and sail to the targe point. The SMASS first altered her course to port 25° close to the upper starboard of ob stacle 6, and then turned starboard by 25° twice in succession, passing over Obstacle 6 successfully reaching the target point 2. The collision avoidance process is shown in Fig  ure 13. In training environment one, the SMASS successfully avoided six obstacles. In th process of SMASS obstacle avoidance, the change curve of the SMASS steering angle with time is shown in Figure 14.  After training about 1200 times, the SMASS successfully reached the first target point. Subsequently, the SMASS continuously altered her course to port 45 • and sailed to the target point 2. When the SMASS passed under obstacle 3 and navigated towards obstacle 4, her course was altered to starboard 25 • , then port and starboard rudder were altered continuously to ensure heading stability.
After training 1500 times, the SMASS could maintain her course and sail to the target point. The SMASS first altered her course to port 25 • close to the upper starboard of obstacle 6, and then turned starboard by 25 • twice in succession, passing over Obstacle 6, successfully reaching the target point 2. The collision avoidance process is shown in Figure 13. In training environment one, the SMASS successfully avoided six obstacles. In the process of SMASS obstacle avoidance, the change curve of the SMASS steering angle with time is shown in Figure 14. After training about 1200 times, the SMASS successfully reached the first target poin Subsequently, the SMASS continuously altered her course to port 45° and sailed to th target point 2. When the SMASS passed under obstacle 3 and navigated towards obstacl 4, her course was altered to starboard 25°, then port and starboard rudder were altere continuously to ensure heading stability.
After training 1500 times, the SMASS could maintain her course and sail to the targe point. The SMASS first altered her course to port 25° close to the upper starboard of ob stacle 6, and then turned starboard by 25° twice in succession, passing over Obstacle 6 successfully reaching the target point 2. The collision avoidance process is shown in Fig  ure 13. In training environment one, the SMASS successfully avoided six obstacles. In th process of SMASS obstacle avoidance, the change curve of the SMASS steering angle wit time is shown in Figure 14.    After training about 1200 times, the SMASS frequently operated the rudder and reached the first target point. In the process of sailing to the second target point, the SMASS chose to sail around obstacle 1 from above, as shown in Figure 16b. After reaching the target point, the SMASS chose to alter course to port 45° to sail a distance on the lef upper side, and then frequently operated the rudder. When the SMASS reached the top left of obstacle 1, the SMASS chose to alter the course to port 45° to drive down. After training 1400 times, the SMASS almost did not collide with five obstacles o enter the minimum distance 1 S between the SMASS and the static obstacle. The reward value obtained by the SMASS crossing between obstacle 1 and obstacle 2 was greater than that obtained by the SMASS bypassing above obstacle 1. As shown in Figure 16a, when the distance between the SMASS and obstacle 1 was greater than 0.5 miles, the SMASS altered her course to starboard 25°. When the SMASS was 0.4 miles away from obstacle 2 the SMASS chose to alter her course to starboard 25° and moved forward 0.5 miles. Sub sequently, the SMASS altered her course to port and avoided obstacle 2. At the same time when the SMASS arrived at target 2 and got ready to return to target 1, the reward value obtained by the SMASS passing through the left side of obstacle 5 was larger than tha passing through the right side. After passing obstacle 5, the SMASS chose to alter he After training about 1200 times, the SMASS frequently operated the rudder and reached the first target point. In the process of sailing to the second target point, the SMASS chose to sail around obstacle 1 from above, as shown in Figure 16b. After reaching the target point, the SMASS chose to alter course to port 45 • to sail a distance on the left upper side, and then frequently operated the rudder. When the SMASS reached the top left of obstacle 1, the SMASS chose to alter the course to port 45 • to drive down. After training about 1200 times, the SMASS frequently operated the rudder and reached the first target point. In the process of sailing to the second target point, the SMASS chose to sail around obstacle 1 from above, as shown in Figure 16b. After reaching the target point, the SMASS chose to alter course to port 45° to sail a distance on the left upper side, and then frequently operated the rudder. When the SMASS reached the top left of obstacle 1, the SMASS chose to alter the course to port 45° to drive down. After training 1400 times, the SMASS almost did not collide with five obstacles or enter the minimum distance 1 S between the SMASS and the static obstacle. The reward value obtained by the SMASS crossing between obstacle 1 and obstacle 2 was greater than that obtained by the SMASS bypassing above obstacle 1. As shown in Figure 16a, when the distance between the SMASS and obstacle 1 was greater than 0.5 miles, the SMASS altered her course to starboard 25°. When the SMASS was 0.4 miles away from obstacle 2, the SMASS chose to alter her course to starboard 25° and moved forward 0.5 miles. Subsequently, the SMASS altered her course to port and avoided obstacle 2. At the same time, when the SMASS arrived at target 2 and got ready to return to target 1, the reward value obtained by the SMASS passing through the left side of obstacle 5 was larger than that passing through the right side. After passing obstacle 5, the SMASS chose to alter her After training 1400 times, the SMASS almost did not collide with five obstacles or enter the minimum distance S 1 between the SMASS and the static obstacle. The reward value obtained by the SMASS crossing between obstacle 1 and obstacle 2 was greater than that obtained by the SMASS bypassing above obstacle 1. As shown in Figure 16a, when the distance between the SMASS and obstacle 1 was greater than 0.5 miles, the SMASS altered her course to starboard 25 • . When the SMASS was 0.4 miles away from obstacle 2, the SMASS chose to alter her course to starboard 25 • and moved forward 0.5 miles. Subsequently, the SMASS altered her course to port and avoided obstacle 2. At the same time, when the SMASS arrived at target 2 and got ready to return to target 1, the reward value obtained by the SMASS passing through the left side of obstacle 5 was larger than that passing through the right side. After passing obstacle 5, the SMASS chose to alter her course to port by 25 • . The SMASS altered her course to starboard 45 • after passing through obstacle 5. The process of SMASS obstacle avoidance is shown in Figure 17. In the process of SMASS obstacle avoidance, the change of the SMASS steering angle is shown in Figure 18. course to port by 25°. The SMASS altered her course to starboard 45° after passing through obstacle 5. The process of SMASS obstacle avoidance is shown in Figure 17. In the process of SMASS obstacle avoidance, the change of the SMASS steering angle is shown in Figure  18.   course to port by 25°. The SMASS altered her course to starboard 45° after passing through obstacle 5. The process of SMASS obstacle avoidance is shown in Figure 17. In the process of SMASS obstacle avoidance, the change of the SMASS steering angle is shown in Figure  18.

Comparison Experiment
To verify the effectiveness of the improved PPO algorithm, this paper compared the improved PPO algorithm with the other classic strategy-based reinforcement learning algorithms (such as the AC algorithm, DDPG algorithm, and traditional PPO algorithm). As shown in Figure 19, after training 20,000 times, the actor-network in the AC algorithm converged after training 11,000 times, and the critic network converged after training 10,000 times. The results showed that the convergence rate of the AC algorithm was not satisfied, and the loss value was high. While the DDPG algorithm converged after about training 10,000 times, the algorithm still had the problem of high loss value. When solving SMASS decision-making problems, the traditional PPO algorithm converged after 8000 training times, which was better than the AC algorithm and DDPG algorithm. However, the improved PPO algorithm converged after 6000 training times; the convergence rate was significantly better than the traditional PPO algorithm, and the loss was greatly improved. Hence, it can be found that the convergence rate of the improved PPO algorithm could increase by about 25% compared to the traditional PPO algorithm. Compared with the traditional DDPG and AC algorithms, the convergence rate of the improved PPO algorithm could increase by about 50%. The Generalized Advantage Estimation Algorithm directly affects the convergence speed and convergence quality of the PPO algorithm. In this experiment, four groups of comparison experiments were conducted to prove the influence of differences in the generalized advantage estimation on the PPO algorithm. Taking the training environment as an example, four λ values were selected for comparative experiments, which were 0.8, Figure 19. The comparative experiment of the convergence curve. The Generalized Advantage Estimation Algorithm directly affects the convergence speed and convergence quality of the PPO algorithm. In this experiment, four groups of comparison experiments were conducted to prove the influence of differences in the generalized advantage estimation on the PPO algorithm. Taking the training environment as an example, four λ values were selected for comparative experiments, which were 0.8, 0.9, 0.95, and 0.99, respectively.
The convergence of actor and critic networks when λ was 0.8 is shown in Figures 20 and 21. The convergence of the actor network was not obvious, and the critic network was not converged obviously after 24,000 training sessions.  The convergence of actor and critic networks when λ was 0.9 is shown in Figures  22 and 23. Compared with the actor network convergence curve when λ was 0.8, the actor network convergence was better, but the critic network still did not converge after 22,000 training sessions.   The convergence of actor and critic networks when λ was 0.9 is shown in Figures  22 and 23. Compared with the actor network convergence curve when λ was 0.8, the actor network convergence was better, but the critic network still did not converge after 22,000 training sessions.  The convergence of actor and critic networks when λ was 0.9 is shown in Figures 22 and 23. Compared with the actor network convergence curve when λ was 0.8, the actor network convergence was better, but the critic network still did not converge after 22,000 training sessions.
The convergence of actor and critic networks when λ was 0.95 is shown in Figures 24 and 25. The convergence rate of the actor network was faster than when λ was 0.9 in the early convergence effect, and the overall convergence trend was shown. In addition, the convergence effect of the critic network was significantly better than when λ was 0.9.
The convergence of actor and critic networks when λ was 0.99 is shown in Figures 26 and 27. The convergence rate of the actor network was much faster than that of the curve when λ was 0.95. In addition, when λ was 0.99, the convergence quality and stability of the actor network and critic network were better than the curve when λ was 0.95. The convergence of actor and critic networks when λ was 0.9 is shown in Figures  22 and 23. Compared with the actor network convergence curve when λ was 0.8, the actor network convergence was better, but the critic network still did not converge after 22,000 training sessions.  The convergence of actor and critic networks when λ was 0.95 is shown in Figures  24 and 25. The convergence rate of the actor network was faster than when λ was 0.9 in the early convergence effect, and the overall convergence trend was shown. In addition, the convergence effect of the critic network was significantly better than when λ was 0.9.  The convergence of actor and critic networks when λ was 0.95 is shown in Figures  24 and 25. The convergence rate of the actor network was faster than when λ was 0.9 in the early convergence effect, and the overall convergence trend was shown. In addition, the convergence effect of the critic network was significantly better than when λ was 0.9.

Verification Simulation
Generalizability refers to the ability of trained models to apply to new data and make accurate predictions. When the training is insufficient, the fitting ability of the decisionmaking system is not obvious. The disturbance of training data is insufficient to make the decision-making system change significantly. With the increase of training times, the fitting ability of the decision-making system is gradually enhanced. The disturbance can be detected by the decision-making system. A model is often trained too well on training data, that is, overfitting, so that it cannot be generalized. In order to prove the generalization of the proposed SMASS intelligent obstacle avoidance model in this paper, several different simulation environments were constructed to verify the generalizability of the trained SMASS obstacle avoidance network.
The eight representative simulation environments were extracted and displayed as shown in Figure 28. The initial and end positions of each environment were shown in Table 2. There were five obstacles in environment 3. Environments 4, 5, and 6 were used to simulate the navigation of SMASS in relatively narrow waters. The number of obstacles in environment 7 was not too much, but the environment was more complex. There were

Verification Simulation
Generalizability refers to the ability of trained models to apply to new data and make accurate predictions. When the training is insufficient, the fitting ability of the decisionmaking system is not obvious. The disturbance of training data is insufficient to make the decision-making system change significantly. With the increase of training times, the fitting ability of the decision-making system is gradually enhanced. The disturbance can be detected by the decision-making system. A model is often trained too well on training data, that is, overfitting, so that it cannot be generalized. In order to prove the generalization of the proposed SMASS intelligent obstacle avoidance model in this paper, several different simulation environments were constructed to verify the generalizability of the trained SMASS obstacle avoidance network.
The eight representative simulation environments were extracted and displayed as shown in Figure 28. The initial and end positions of each environment were shown in Table 2. There were five obstacles in environment 3. Environments 4, 5, and 6 were used to simulate the navigation of SMASS in relatively narrow waters. The number of obstacles in environment 7 was not too much, but the environment was more complex. There were

Verification Simulation
Generalizability refers to the ability of trained models to apply to new data and make accurate predictions. When the training is insufficient, the fitting ability of the decisionmaking system is not obvious. The disturbance of training data is insufficient to make the decision-making system change significantly. With the increase of training times, the fitting ability of the decision-making system is gradually enhanced. The disturbance can be detected by the decision-making system. A model is often trained too well on training data, that is, overfitting, so that it cannot be generalized. In order to prove the generalization of the proposed SMASS intelligent obstacle avoidance model in this paper, several different simulation environments were constructed to verify the generalizability of the trained SMASS obstacle avoidance network.
The eight representative simulation environments were extracted and displayed as shown in Figure 28. The initial and end positions of each environment were shown in Table 2. There were five obstacles in environment 3. Environments 4, 5, and 6 were used to simulate the navigation of SMASS in relatively narrow waters. The number of obstacles in environment 7 was not too much, but the environment was more complex. There were only two obstacles in environment 8, but the navigable waters were very narrow to simulate the SMASS obstacle avoidance in narrow waters. Environment 9 was relatively open, but there were multiple obstacles located along a line. The environment was used to test whether the SMASS could find the optimal path when there were multiple obstacles in the environment. In environment 10, the navigation area with more obstacles was very narrow, which could be used to simulate the SMASS complex obstacle avoidance navigation in complex narrow waters. In each environment, the collision avoidance processes from the starting position to the end position were described by six graphs (as shown in Figures 29 and 30). Moreover, the SMASS steering rudder angle of collision avoidance processes in each environment are shown in Figures 31-33.            In addition, the avoidance simulations of sailing target ships were carried out to ify the trained SMASS obstacle avoidance capability. Taking the No. 9 environment example, these sailing target ships met the trained SMASS under the different col encounter situations, and the trained SMASS could avoid them accurately and safe cording to COLREGs.

Gazebo Environment Initial Position End Position
As shown in Figure 34, the left side of the figure is the sailing path of the SMAS three target ships, and the right side is the SMASS avoidance process in the simul environment. The first target ship (TS01) and the SMASS formed a crossing give-wa uation, and the SMASS altered her course to starboard to avoid the first target ship. W the SMASS met the second target ship (TS02), the two ships are formed a crossing s on situation. Then, the SMASS kept her course and altered starboard to avoid the se target ship. When the SMASS passed through the middle position, the third target (TS03) and the SMASS formed the head-on situation. Then, the SMASS altered cou starboard to avoid the third target ship. In addition, the avoidance simulations of sailing target ships were carried out to verify the trained SMASS obstacle avoidance capability. Taking the No. 9 environment as an example, these sailing target ships met the trained SMASS under the different collision encounter situations, and the trained SMASS could avoid them accurately and safely according to COLREGs.
As shown in Figure 34, the left side of the figure is the sailing path of the SMASS and three target ships, and the right side is the SMASS avoidance process in the simulation environment. The first target ship (TS01) and the SMASS formed a crossing give-way situation, and the SMASS altered her course to starboard to avoid the first target ship. When the SMASS met the second target ship (TS02), the two ships are formed a crossing stand-on situation. Then, the SMASS kept her course and altered starboard to avoid the second target ship. When the SMASS passed through the middle position, the third target ship (TS03) and the SMASS formed the head-on situation. Then, the SMASS altered course to starboard to avoid the third target ship.

Conclusions
An improved PPO algorithm for path planning and obstacle avoidance in different complex waters was presented in this paper. SMASS can perform complex local path planning and obstacle avoidance operations when external information is not fully accepted. In this experiment, five factors were considered in the design of the reward function, namely, the relationship between target position, angle, and distance, COLREGs, the reward for safety obstacle avoidance, and whether to reach the target point. This algorithm also performed well in complex waters composed of different numbers of obstacles. The contributions of this experiment are as follows: • The improved PPO algorithm is superior to other traditional model-free reinforcement learning algorithms based on strategy learning in solving ship decision-making and local path planning problems. The improved PPO algorithm has the advantages of fast convergence and low loss value.

•
The improved PPO algorithm has a strong self-learning ability and strong generalization, which could be used to solve the SMASS local path planning and collision avoidance decision-making simultaneously in different complex navigation environments.
Some works should be explored in the future. In the experiment, there are some limitations in setting obstacles into cylinders and squares. Actual obstacles such as islands and navigable areas are not suitable to be set into base shapes. The design of complex obstacles is one of the directions in the future study. In addition, the rudder angle output in this study was the command rudder angle, which has a certain deviation from the execution rudder angle. This is also an important factor to be considered in future studies.
Author Contributions: : Conceptualization, W.G. and Z.C.; methodology, W.G. and Z.C.; software, Z.C.; writing-original draft preparation, Z.C.; writing-review and editing, W.G. and Z.C.; resources, X.Z. All authors have read and agreed to the published version of the manuscript.

Conclusions
An improved PPO algorithm for path planning and obstacle avoidance in different complex waters was presented in this paper. SMASS can perform complex local path planning and obstacle avoidance operations when external information is not fully accepted. In this experiment, five factors were considered in the design of the reward function, namely, the relationship between target position, angle, and distance, COLREGs, the reward for safety obstacle avoidance, and whether to reach the target point. This algorithm also performed well in complex waters composed of different numbers of obstacles. The contributions of this experiment are as follows:

•
The improved PPO algorithm is superior to other traditional model-free reinforcement learning algorithms based on strategy learning in solving ship decision-making and local path planning problems. The improved PPO algorithm has the advantages of fast convergence and low loss value.

•
The improved PPO algorithm has a strong self-learning ability and strong generalization, which could be used to solve the SMASS local path planning and collision avoidance decision-making simultaneously in different complex navigation environments.
Some works should be explored in the future. In the experiment, there are some limitations in setting obstacles into cylinders and squares. Actual obstacles such as islands and navigable areas are not suitable to be set into base shapes. The design of complex obstacles is one of the directions in the future study. In addition, the rudder angle output in this study was the command rudder angle, which has a certain deviation from the execution rudder angle. This is also an important factor to be considered in future studies.
Author Contributions: Conceptualization, W.G. and Z.C.; methodology, W.G. and Z.C.; software, Z.C.; writing-original draft preparation, Z.C.; writing-review and editing, W.G. and Z.C.; resources, X.Z. All authors have read and agreed to the published version of the manuscript.