Autonomous Navigation Decision-Making Method for a Smart Marine Surface Vessel Based on an Improved Soft Actor–Critic Algorithm

: In this study, an intelligent hybrid algorithm based on deep-reinforcement learning (DRL) is proposed to achieve autonomous navigation and intelligent collision avoidance for a smart autonomous marine surface vessel (SMASV). First, the kinematic model of the SMASV is used, and clauses 13 to 17 of the Convention on the International Regulations for Preventing Collisions at Sea (COLREGs) are introduced. Then, the electronic chart is rasterized and used for path planning. Next, states, actions, and reward functions are designed, and collision avoidance strategies are formulated. In addition, a temperature factor and a constrained loss function are used to improve the soft actor–critic (SAC) algorithm. This improvement reduces the challenges of hyperparameter adjustment and improves sampling e ﬃ ciency. By comparing the improved SAC algorithm with other deep-reinforcement learning (DRL) algorithms based on strategy learning, it is proved that the improved SAC algorithm converges faster than the other algorithms. During the experiment, some unknown obstacles are added to the simulation environment to verify the collision-avoidance ability of the trained SMASV. Moreover, eight sea areas are randomly selected to verify the generalization ability of the intelligent-navigation system. The results show that the proposed method can plan a path for the SMASV accurately and e ﬀ ectively, and the SMASV decision-making behavior in the collision-avoidance process conforms to the COLREGs in both unknown and dynamic environments.


Introduction and Background
With the acceleration of globalization, maritime traffic is becoming increasingly important.In particular, ships are usually underactuated, in addition to their characteristics of large tonnage, relatively weak power, and slow speed.According to an International Maritime Organization (IMO) report, the performance of non-standard ship operations by pilots is a significant factor leading to frequent maritime accidents at sea [1].Therefore, methods for the autonomous collision avoidance of ships must be developed.In this regard, artificial intelligence has developed rapidly in recent years, and the combination of robot technology and control technology has provided a new solution to the problem of path planning for ships and intelligent collision avoidance [2].
In the process of ship navigation, intelligent navigation systems face several challenges, including an unknown environment and perceived uncertainty.Therefore, the core of developing intelligent ship navigation focuses on path planning and intelligent collision avoidance.One approach assumes that path-planning or decision-making algorithms are based on mathematical-function models or physical-function models.The Avalon ship was designed by Erckens et al. [3] to cross the Atlantic Ocean automatically.The route was planned using the A-star algorithm and the authors considered the effect of dynamic obstacles and airflow in their study.However, the A-star algorithm requires too much memory to compute an optimal solution with the available resources.Song et al. [4] proposed a to guide the agent to the target position and obtain a higher reward value so that the USV could rapidly find the optimal path.A ship-collision-avoidance algorithm based on the DQN algorithm was proposed by Shen et al. [14], who proposed a simplified ship domain and introduced it into the reward function.However, the DQN algorithm had the problem of overestimation, and unmanned surface ships were prone to make actionselection errors in complex environments.A reinforcement learning training method based on electronic charts was proposed by Guo et al. [15].In their new algorithm, automaticidentification-system (AIS) information was regarded as the state space of the neuralnetwork input.The trained neural-network model could be used for autonomous path planning.A new multi-ship collision-avoidance algorithm was adopted by Zhao et al. [16].In particular, the authors considered the powerful guidance ability of the line-of-sight (LOS) algorithm to assist in training the proximal policy optimization (PPO) algorithm.The experimental results showed that the algorithm had a desirable collision-avoidance effect in multi-ship encounter situations.In addition, the collision-avoidance path of each ship conformed to the COLREGs.A deep deterministic policy gradient (DDPG) algorithm, based on a priority sampling mechanism, was improved by Xu et al. [17], and the authors defined five complex reward functions.The algorithm designed the USV navigation situation model and established a geometric model of the two-ship encounter situation, according to navigational practice.A collision cone was introduced for static obstacles, and the COLREGs were monitored for dynamic ships to quantify a collision into five types.Although the algorithm achieved effective collision avoidance in complex environments with dynamic and static obstacles, the reward function was too complex and the experiment did not have a generalization ability.Do-Hyun Chun et al. [18] adopted a risk-assessment algorithm and ship domain to solve the ship-collision-avoidance problem.The algorithm needed to fully consider the DCPA and TCPA between their own ship and the target ship.However, TCPA was related to the time at which the two ships encountered.Thus, calculating the rendezvous time between the two ships during the experiment was challenging.Long et al. [19] proposed a collision-avoidance strategy based on the PPO algorithm.In their experiment, the distributed sensors of a multi-robot system were used.The strategy of the algorithm could directly map the original sensor-measurement results to the agent's steering command.based on the motion speed, and the trained network model had a good generalization.
A summary of recent literature on ship-collision avoidance is provided in Table 1.Based on the above studies, we developed an intelligent-navigation decision-making system for a smart marine autonomous surface vessel (SMASV), based on an improved soft actor-critic (SAC) and A-star algorithm.The main contributions of this study are as follows:

•
The convergence speed of the loss function and the average reward function of the improved SAC algorithm were better than those of other popular strategy-based DRL algorithms.

•
The trained deep-neural-network model for the SMASV had a strong generalization ability, and could solve the problem of multiple SMASV encounters in complex situations.
The remainder of the paper is organized as follows: Section 2 introduces the preliminary considerations and the formulation of the problem.Section 3 introduces the proposed approach.Section 4 presents a simulation and comparative experiments.Finally, Section 5 concludes the study.Overall, the proposed intelligent SMASV navigation system comprises four components: the path planner, the sensor module, the decision-making module, and the control module [20].The planning system refers to the path-planning process before sailing.In this study, the A-star algorithm was used to plan the global path of the SMASV, and the heuristic search was performed on the rasterized electronic chart to plan the optimal path rapidly.In particular, in the path-planning process, the planning system can only plan on an electronic chart with known information and cannot avoid unknown obstacles.
The sensor module contains two perception components: the SMASV's position information and the navigation-environment information [21].The sensor part is mainly composed of the global navigation satellite system (GNSS), sonar, a sounder, an inertial measurement unit (IMU), a gyrocompass, marine radar, lidar, and so on.
The decision-making module is composed of a deep-neural-network (DNN) model trained on the deep reinforcement-learning SAC algorithm.The algorithm can realize the collision avoidance of unknown obstacles and dynamic obstacles by processing the state information and the action information of the sensing module.
Finally, the control module executes the rudder angle command given by the decision system, so that the SMASV can sail safely according to the specified rudder order.The architecture of the intelligent SMASV navigation system is illustrated in Figure 1.

Construction of the Path Planner
The global path-planning of the SMASV includes two parts: rasterization of the electronic chart and global path planning [22].In this study, the grid method was used to model the electronic chart after gridding.The grayscale processing of the electronic chart is shown in Figure 2.
First, the electronic chart was processed into a grayscale map, improving the chart rasterization accuracy.Then, the grayscale image was rasterized to form the final electronic chart.This chart had a simple structure, which helped in developing a simulation environment and path planning.The processing of the marine environmental information using the grid method is illustrated in Figure 2.
The A-star algorithm is a heuristic algorithm with some prior knowledge of path planning.The algorithm not only focuses on the paths that have been passed, but also predicts the points or states that have not yet been passed through.Hence, the A-star algorithm was used in this paper to solve the shortest path in static networks.

Ship-Motion Mathematical Model
The actual motion of the SMASV is complex, with six degrees of freedom (6-DOF).Moreover, overly high-dimensional and elaborate models can contain huge numbers of parameters that are challenging to estimate and difficult to solve [23].Therefore, we simplify the 6-DOF model to a three-degrees-of-freedom (3-DOF) model.The SMASV 3-DOF model has sufficient accuracy for ship path-planning and automatic collision avoidance [24].
As shown in Figure 3, the ship motion Equation [25] is as follows: .
where η = [x, y, ψ] T refers to the position coordinates and the heading angle of the ship, v = [u, v, r] T represents the velocity vector of the ship's 3DOF, R(ψ) represents the rotation matrix of the ship, g(v) = [g u , g v , g r ] T represents the unmodeled dynamic model, and τ w is the sum of the external forces of environmental interference.R(ψ) is expressed as: Matrix C(v) is composed of a centripetal matrix and a Coriolis force matrix that account for the effects of fluid dynamics.
The inertial matrix of the system includes an added mass matrix M A , which accounts for the effects of fluid mechanics, and a rigid body mass matrix, M RB : where r are the hydrodynamic derivatives, m is the mass of the ship, and I z is the moment of inertia of the ship around the Z coordinate axis.D(v) is a nonlinear damping matrix, which can be expressed as the sum of linear damping and nonlinear damping: where Y ( * ) , X ( * ) , N ( * ) are the hydrodynamic derivatives, which represent the hydrodynamic forces and moments acting on the ship.

COLREGs
It is necessary to consider the COLREGs when solving the intelligent-decision problem of ships using deep-reinforcement learning [26].The encounter between two ships is described in four cases, as expressed in clauses 13 to 17 of the COLREGs [12], as shown in Figure 4.

Improved SAC Algorithm
The SAC algorithm was proposed by Haarnoja et al. [27].Note that the Boltzmann distribution of the Q function in a traditional maximum entropy RL soft Q-learning algorithm cannot solve the action problem in continuous space.In this regard, the SAC algorithm solved this limitation and achieved good convergence.
The entropy represents a measure of the randomness of random variables.In reinforcement learning, we use H(π(•|s )) to represent the randomness of strategy π under state s.The idea of maximum entropy reinforcement learning is to make strategies more random in addition to maximizing cumulative rewards.Therefore, an entropy regularization is added to the objective function of RL as follows: where α is a regularization coefficient which is used to control the importance of entropy.
The entropy regularization increases the exploration degree of the reinforcement learning algorithm.The larger the value of α is, the more strongly the algorithm explores, which helps to improve the speed of strategy learning.In addition, a sufficiently large value of α can reduce the possibility of agents falling into the local optima.The difference between traditional reinforcement learning and maximum entropy reinforcement learning for the ship decision-making process is shown in Figure 5.In Figure 5, the ship has two routes to reach the target point.Obviously, Path 1 is shorter and more effective.At this time, the two peaks of the curve in Figure 5 The state value function can be written as follows: According to the soft Bellman equation, policy evaluation can converge to a policy π in finite state and action space.Therefore, the policy improvement formula can be written as follows: This distribution is a form of the Boltzmann distribution (energy-based model).The distribution assigns a non-zero probability to all actions.Thus, the ship will be more inclined to learn all the behaviors that can handle the task, and adapt to the local optimal problem caused by environmental changes.In the SAC algorithm, action value function Q and a policy function π are used for modeling.The algorithm selects a smaller value when selecting the Q value, reducing overestimation.The loss function of the Q network is as follows: where R represents the data collected by the strategy in the past.Because SAC is an offpolicy algorithm, in order to make the strategy more stable, we considered two target Q networks.The network parameters are updated as follows: The loss function for strategy π is derived from the KL divergence.The loss function is expressed as follows: In the maximum entropy reinforcement learning algorithm, the selection of the entropy regularization coefficient is crucial.The choice of entropy weighting value differs in different states.For example, when the optimal action is uncertain, the entropy weighting value should be slightly higher.When some actions are determined, the entropy weighting value can be reduced.The goal of reinforcement learning is expressed as a constrained maximum expected revenue, and the mean of constraint entropy is greater than H 0 .Then we can obtain the loss function of α as follows: When the entropy of the strategy is lower than H 0 , the loss function of the training target will increase α.The importance of the entropy counterpart will also increase when the training strategy function π.When the entropy of the strategy is higher than H 0 , the training target loss function will reduce α.This implies that the maximum value is a critical parameter in strategy training.

Remark 1.
For the algorithm selection, we found that DRL based on value learning [28,29] has an overestimation that cannot be eliminated.This is due to the bootstrapping of such algorithms in the iterative process and the presence of uneven noise.The Actor-Critic (AC) algorithm [30] contains two neural networks, one interacting with the environment to select the action (actor-network) and the other evaluating the quality of the action (critic-network).Subsequently, random sampling is performed in action probability to extract the current action.However, the traditional AC algorithm is difficult to introduce to converge rapidly.Thus, the PPO algorithm [31] was proposed by open AI in 2017.Compared with complex methods such as Taylor expansion approximation, conjugate gradient, and linear search of trust region policy optimization (TRPO) [32], PPO is simpler to use than Penalty or Clip.Nevertheless, both TRPO and PPO are on-policy algorithms, implying that their sample efficiency is low.An improved DDPG algorithm [33] was proposed by Lillicrap et al.Compared with the random strategy, the deterministic strategy adopted by the DDPG algorithm considers the maximum probability action when solving the continuous action problem.Thus, the sampling cost was significantly reduced, improving the sampling efficiency.However, the training of the DDPG algorithm is unstable, with poor convergence.Moreover, it is sensitive to hyperparameters.Therefore, adapting the algorithm to different complex environments is challenging.Multi-agent reinforcement learning, which is based on game theory and involves different agents having cooperative or competitive relationships, has gained substantial interest recently.The multi-agent reinforcement learning [34] approach used to train the network model, however, left it with poor generalization capabilities and insufficient expertise in dealing with challenges involving unknown obstacles.As a result, the multi-agent reinforcement learning method and the DRL based on value learning are not suitable for this study.

Construction of Neural Network
In the improved SAC algorithm, the action selection of the SMASV depends on the state of the system.The decision-making module provides the corresponding action according to the environmental information detected by the SMASV in the past period.The SMASV can obtain the location of obstacles and targets through detection technology and positioning systems.Thus, the SMASV state space and action space must be designed to achieve intelligent decision-making.
Generally, state information is used as the input of the network for the improved SAC algorithm.The states of the network can be expressed as Equation (20): The target-Q functions, evaluation-Q functions, and strategy function π should be fitted by the neural network.The actor network was used to fit the strategy function π.Two fully connected layers were adopted for the network, and each layer contained 256 neurons.In addition, the ReLU activation function was used in this network.Then, the target-Q function and the evaluation-Q function were fitted using four networks, each with four fully connected layers containing 256 neurons and the ReLU activation function.However, the difference between the target-Q network and the evaluation-Q network was that the evaluation-Q network and the target-Q network were used for current and delayed action evaluations, respectively.The interaction process between the decision system and the environment is shown in Figure 6.The RMSprop with a momentum term was used to optimize the network training.The working principle was to dynamically adjust the learning rate of each parameter by using the first-order moment estimation and the second-order moment estimation of the gradient.The learning rate of each iteration had a specific range and made the parameters stationary.
To prevent a reduction in neural network training efficiency, memory with an experience playback function was set up to store historical action and state information.This method can improve the use of data because each sample is used multiple times.In addition, this method can reduce the correlation between continuous samples and the parameter update variance.Using historical data in the training process can enhance the efficiency of the algorithm and accelerate the convergence of the loss value.

Reward Function Setting
The purpose of the reward function setting is to allow the SMASV to follow the planned path while avoiding unknown obstacles and complying with the COLREGs.The reward function is calculated as the sum of the episode reward [35].In particular, the reward function is divided into five parts, as shown in Figure 7. (

1) Goal-approach reward
The goal-approach reward function guides the SMASV to the target.Both the targetpoint position information and the SMASV position information are considered as part of the state information input by the neural network, which can be expressed as follows: (2) Angle reward In the SMASV decision-making process, the yaw angle is crucial in the decision-making system.Therefore, the yaw-angle reward function was added to the reward function.If the SMASV sails in the planned direction, positive reward is given; otherwise, a negative reward is given.The specific settings are as follows: where yaw is the heading angle error, tr is the coefficient of the heading angle error, λ a is the weight of the angle reward, and ε is a parameter in which the reward value changes with the distance. (

3) Reach target point and collision reward
A high reward value is obtained when the SMASV reaches the target location.This causes the SMASV to regard the target point as the learning target.A large negative reward is obtained when the SMASV collides with an obstacle.This allows the SMASV to learn to avoid obstacles.The reward function is set as follows: (4) Obstacle-avoidance reward The SMASV radar, which emits 24 detection lines, can be used to detect whether there are obstacles within the radar coverage.When the SMASV radar senses the presence of an obstacle, it calculates the obstacle-avoidance reward.We set a safe distance at the position of S 1 within the radar radius, as shown in Figure 8.Hence, the obstacle-avoidance reward function is set as follows: (24) where ob represents the distance between the obstacle and the SMASV.(

5) COLREGs reward
The purpose of introducing the COLREGs reward is to make the SMASV comply with the COLREGs in collision-avoidance actions during navigation.When the SMASV decides on an effective measure, complying with the COLREGs and avoiding the target ship, the SMASV obtains a positive reward.However, if the action taken by the SMASV does not conform to the COLREGs, the SMASV will not be rewarded.In particular, the reward function is set as follows: Remark 2. In designing the reward function of the COLREGs, some works of literature [15,16,34,36] are referred to.The SMASV alters her course to starboard to avoid the TS according to the head-on, overtaking, and crossing give-way situations (rules 13-16) in the COLREGs.Additionally, changing the SMASV's course to starboard can more effectively reduce the probability of a collision when she encounters the crossing stand-on situation (rules 17).In order to make the SMASV incline to the starboard alteration during the collision-avoidance decision-making process, we introduced the COLREGs into the reward function.
The reward function equation is expressed as follows:

Experiment 4.1. Design of Simulation
Multiple simulation scenarios were established to verify the practicability and feasibility of the intelligent-SMASV navigation system based on the SAC and A-star algorithms.Based on the above improved SAC algorithm, PyTorch was used to construct a neural network framework and train the neural network.In this experiment, Gazebo and VScode were used for conjoint simulation.Before sailing, the rasterization of the electronic chart and the path-planning of the A-star algorithm were first performed [37].Then, neuralnetwork training and model-parameter debugging were performed in VScode.Finally, a simulation environment was established in Gazebo to test the feasibility and generalization ability of the algorithm in different environments.The length of the SMASV was 116 m; the draft was 5.4 m; the beam was 18 m; the rudder area was 11.8 m 2 ; the maximum rudder angle was 35 • ; and the speed was 12 knots.The hydrodynamic parameters are shown in Table 2.
Table 2. Hydrodynamic parameters of the SMASV.

Hydrodynamic Parameters of Ship-Motion Model
This experiment assumed that the SMASV sailed in still water and some restrictions were attached to the SMASV model.First, the SMASV could not slow down her speed during navigation, but could only change her course, because when a ship navigates in the ocean and encounters obstacles or other target ships, course alteration is a better action for collision avoidance.Second, the inertia of a ship was introduced to simulate the true motion state of the SMASV.Third, during the simulation process, the SMASV did not know the location of unknown obstacles or the subsequent actions of dynamic obstacles.Therefore, the SMASV could only perform response actions based on existing information.

Network Training
The network parameters and the algorithm parameters are listed Table 3.The SMASV kinematics and dynamics information could be generated and calculated through the environment module in the Gazebo simulation environment.These items of information were introduced into the VScode as subscription nodes for network training and parameter updating.When the SMASV reached the target position, the training was completed and the rewards were recorded in cumulative form.At this time, a new target point was generated in the simulation environment and the SMASV started a new task.When the experiment reached its maximum number of steps or when the SMASV encountered an obstacle, the experiment was immediately stopped and the reward value was output.After the training stopped, the SMASV was placed in a new position and started a new training round.Environment 0 (ENV-TRAIN) was used to the training of the intelligent-SMASV decision-making system.The training environment is illustrated in Figure 8. Nine obstacles were present in the simulation environment.The initial position of the SMASV in the experiment was (0, 0).In the early stage of interaction with the environment, the SMASV did not have the ability to drive to the target point and avoid collisions.During exploration, the SMASV constantly collided with the environmental frame.
After 500 training iterations, the SMASV could successfully avoid Obstacle 1, Obstacle 2, and Obstacle 3. When the SMASV passed through the left side of Obstacle 2, the SMASV operated at the rudder angle to maintain her course.When the SMASV was 1.3 nautical miles away from Obstacle 2, the SMASV chose to alter her course to starboard to 45 • and then alter her course to port to 5 • , passing between Obstacle 2 and Obstacle 3. Due to a lack of training experience, the SMASV collided with Obstacle 5.
After 700 training iterations, the SMASV successfully avoided five obstacles on the left side of the simulation environment.At this time, the SMASV found that the reward value obtained by passing above the second obstacle was larger than that obtained by passing below the second obstacle.Therefore, the SMASV changed its strategy.When the SMASV was 0.7 nautical miles away from Obstacle 2, the SMASV chose to alter her course to port 23 • , then altered her course to starboard 5 • and passed Obstacle 1 and Obstacle 2. When the SMASV was 1.5 nautical miles from Obstacle 4, the SMASV chose to alter her course to starboard 15 degrees and keep her course.Then, when the SMASV was 0.6 nautical miles away from Obstacle 4, the SMASV chose to alter her course to starboard 5 • to pass Obstacle 4 and Obstacle 5.The collision-avoidance process is shown in Figure 9A.
After 1000 training iterations, the SMASV could successfully avoid the four obstacles on the right side of the simulation environment.When the SMASV was 0.4 nautical miles away from Obstacle 7, the SMASV altered her course to starboard 20 • .At this time, the SMASV was 0.2 nautical miles from Obstacle 7.Then, the SMASV chose to alter her course to port 17 • , then alter course to starboard 5 • to avoid Obstacles 8 and 9. Finally, the SMASV held her course and approached the target point.The collision-avoidance process is shown in Figure 9B.After 1500 training times, the SMASV could successfully avoid nine obstacles in the simulation environment and reach the target point.
Remark 3. In this study, a static obstacle environment was used to train the neural-network model.In the verification generalization experiment, we found that the SMASV can avoid obstacles well in the dynamic obstacle environment, and the collision-avoidance action is compliant with the COL-REGs.The reason for this phenomenon is that the neural-network model of the COLREGs has been learned during the training process and the COLREGs will also be executed under certain conditions in a dynamic environment.The training of lidar-based course adjustment can be generalized from static environments to dynamic environments.This is because, regardless of whether an environment is dynamic or static, the principle of lidar detecting obstacles is the same.It is shown that the trained neural-network model has a strong generalization ability.Additionally, the multi-ship decision-making problems can be resolved via the proposed reinforcement-learning method.
In addition, while verifying whether the SMASV could avoid obstacles, we also had to guarantee the stability of the algorithm based on the neural-network loss-function curves.During the neural-network training process, the loss curve can directly reflect the network architecture quality.The loss values generated during the training process were stored.Then, the curve was drawn using the tensorboardX drawing tool to observe the changes in the training-loss-function value curves and adjust to the learning-rate parameters.Figure 10 shows the loss-value curves for the Target-Q1 network and the actor network in the improved SAC algorithm.
The target-Q1 network had a large decrease in loss value at the beginning of the training, indicating that the learning rate was appropriate and the gradient descent process was performed.After 5000 iterations of learning, the loss values converged to a stable state, and the change in loss function was not as evident as it was in the beginning states.Simultaneously, the actor network began to converge at approximately 3500 iterations during training and the convergence rate was relatively stable.The network-loss-function value of the improved SAC algorithm was compared to that of the traditional DRL algorithm, as shown in Figure 11.The results showed that the improved SAC algorithm is more stable and converges rapidly for neural-network training.

Comparison Experiment
The maximum-entropy DRL algorithm was compared with traditional DRL algorithms to verify its performance.In particular, the improved SAC algorithm was compared with other strategy-based learning algorithms, including the improved PPO [35], PPO [31], DDPG [33], and AC [30] algorithms.A comparison of the average-reward-convergence curves for the different algorithms is shown in Figure 12.In this experiment, the maximum number of episodes was 2000.After 2000 episodes, the reward-function value of the AC algorithm did not show a convergence trend.After 1800 episodes, the reward-function value of the DDPG algorithm began to converge.The average-reward value of the DDPG algorithm was approximately 21,400.After 1200 episodes, the reward-function value of the PPO algorithm began to converge.As a result, the curve-convergence rate was better than that of the DDPG algorithm and AC algorithm.The average-reward value of the PPO algorithm was approximately 27,600.In addition, the convergence rate of the reward-function value of the improved PPO algorithm was better than that of the PPO algorithm.After 900 episodes, the algorithm began to converge, and the average reward-value was approximately 30,000.
After 800 episodes, the improved SAC algorithm began to converge, and the fluctuation of the reward-function value was very minimal.Under the above conditions, the convergence rate of the improved SAC algorithm outperformed that of the improved PPO algorithm and the PPO algorithm by approximately 12% and 45%, respectively.

Verification Simulation
In this study, a real marine environment was considered as the environmental space for the proposed model verification.First, the grid method was used to rasterize the marine environment and, then, the rasterized environment was converted into binary values "0" and "1", where "1" represented the obstacle, which was the forbidden navigation area and displayed as a black area in the environment model.In contrast, "0" represented the navigable area, displayed as a white area in the environment model.
The simulation environment was used to construct a navigation simulation and modeltraining environment, which contained unknown and dynamic obstacles.We applied the network model trained in the training environment in the real marine environment.However, these obstacles were not displayed in a real marine environment.To reach the target location safely and efficiently, the trained SMASV must avoid obstacles, including unknown obstacles not displayed in the real marine environment.
In addition, to verify the generalization ability of the proposed intelligent-SMASV navigation system, we built several simulation environments in different water areas.In this regard, we extracted eight representative water areas from the constructed simulation environments, as shown in Figure 13.Environment 1 is the Strait of Malacca, a maritime area prone to accidents and an important strait connecting Asia and Africa.Environment 2 is the Bismarck Strait, which contains many islands and reefs.Environment 3 is the Balearic Sea, located on the north side of the Mediterranean Sea, which has a narrow coast.Environment 4 is the Tyrrhenian Sea, which is connected to the Mediterranean Sea and is relatively open.Environment 5 is the Strait of Tartary, located on the northwest side of the Pacific Ocean, which is one of the world's major straits.Environment 6 is the coastal port of Djibouti, one of the modern ports in eastern Africa.Environment 7 is the strait of Dover, which is the channel connecting the North Sea and the Atlantic, where shipping is very busy.Environment 8 is the Mozambique Channel, the longest strait in the world, located between Mozambique and Madagascar in southeastern Africa.
The path planning track and the sailing track of the intelligent-ship navigation system are displayed on the grid chart.The results are shown in Figure 14.The changes in the rudder angle during the ship navigation are shown in Figure 15.In this paper, an A-star algorithm was used to plan the SMASV navigation route.The blue point is the waypoint planned by the A-star algorithm, which was also defined as the virtual target point of the SMASV in different stages of the decision-making algorithm.The SMASV needs to sail to the virtual target point.If the SMASV does not reach the waypoint due to the collision-avoidance task, it will not be punished.The red line is the planned route before sailing, the blue line is the actual sailing track during the voyage, the green pentagram is the destination, and the black circle is the unknown obstacle.To verify that the trained intelligent SMASV navigation system complies with the COL-REGs, we carried out an additional experiment.The three wide waters of Environments 3, 4, and 7 were used as the COLREGs verification environments.As shown in Figure 16, the red line represents the planning path, the black line represents the decision-making track, and there are three target ships in each environment.TS01 was used to simulate the head-on case, TS02 was used to simulate the overtaking case, and TS03 was used to simulate the crossing give-way and crossing stand-on cases.The encounter angle, the minimum encounter distance, and the collision-avoidance action of the SMASV and other TSs in the encounter process are illustrated in Table 4.In Environment 3, the SMASV and TS03 are in a crossing give-way situation.When the SMASV was 0.6 nautical miles away from TS03, the SMASV altered her course to starboard 16 • .Then, the SMASV and TS02 constituted an overtaking situation.The SMASV chose to alter her course to starboard 12 • , then altered her course to port 6 • to avoid TS02.Finally, the SMASV and TS01 constituted a head-on situation.When the SMASV was 0.5 nautical miles away from TS01, the SMASV altered her course to starboard 14 • to avoid TS01.
In Environment 4, the SMASV and TS02 are in an overtaking situation.The SMASV chose to alter her course to starboard 17 • , then alter her course to port 4 • to avoid TS02.Then, the SMASV and TS03 constituted a crossing stand-on situation.Because TS03 did not follow the COLREGs, the SMASV chose to alter her course to starboard 13 • .Finally, the SMASV and TS01 constituted a head-on situation.When the SMASV was 0.3 nautical miles away from TS01, the SMASV altered her course to starboard 10 • to avoid TS01.
In Environment 7, the SMASV and TS02 are in an overtaking situation.The SMASV chose to alter her course to starboard 15 • continuously to avoid TS02.Then, the SMASV and TS03 constituted a crossing give-way situation.The SMASV chose to alter her course to starboard 15 • .Finally, the SMASV and TS01 constituted a head-on situation.When SMASV was 0.4 nautical miles away from TS01, the SMASV altered her course to starboard 15 • , then altered her course to port 7 • to avoid TS01. Figure 17 shows the changes in the rudder angle during the ship navigation.At the end of the experiment, we set up an open water experimental environment to verify the applicability of the proposed collision-avoidance decision-making trained by the improved SAC algorithm.In the experiment, every SMASV was endowed with a network model trained in this study.The typical three-ship and four-ship encounter process is shown in Figure 18.The COLREGs do not specify the specific operation and collision-avoidance actions of each ship in the process of multi-ship encounters.However, the network model trained in this study can solve the problem of multi-ship encounters well, and each ship can reach the target position safely.

Conclusions
In this study, an SMASV autonomous navigation decision-making method based on an improved SAC algorithm was proposed.The A-star algorithm, which can plan a safe route effectively and accurately, was used for path planning before voyages.Furthermore, a deep-reinforcement learning algorithm was used to avoid unknown and dynamic obstacles during the voyage.
The contributions of this study are as follows: • An improved SAC algorithm based on maximum-entropy reinforcement learning has a better convergence speed of average reward value than other traditional DRL algorithms based on strategy learning.In addition, the simulation results showed that the neural network model trained by the proposed method had a strong generalization ability, which can solve unknown-environment and dynamic-environment decisionmaking problems during the voyage.

•
The proposed reward function allowed the SMASV to follow the COLREGs in the decision-making process.In this study, it was proved by experiments that this process can solve the problem of multiple SMASV encounters in complex situations and assist ship navigation well, and that it has high engineering practicability.
Further efforts should be carried out in the following aspects.The obstacles in the training environment were considered as cylinders.The geometry of the obstacles is crucial in simulating a real navigation environment.Therefore, this task should be considered in future research.Moreover, the simulation environment of this study was static waters, and the influence of wind and waves on the SMASV rudder angle was not considered.In future work, a robust control methodology should be designed to perform the decision-making

Figure 1 .
Figure 1.Intelligent navigation system for the SMASV: part (a) is the path planner; part (b) is the sensor module; part (c) is the decision-making module; and part (d) is the control module.

Figure 2 .
Figure 2. Electronic chart rasterization process.Subfigure (a) is the real sea environment, and subfigure (b) is the rasterized electronic chart environment.

Figure 4 .
Figure 4. Collision avoidance in four situations, based on the COLREGs.Among them, (a) is Headon situation, (b) is Overtaking situation, (c) is Crossing give-way situation, and (d) is Crossing stand-on situation.(a) Head-on situation: When two vessels (OS and TS) are meeting on reciprocal or nearly reciprocal courses within an azimuth angle of (0 • , 5 • ) or (355 • , 360 • ), this situation should be judged as a head-on situation.Each vessel (OS and TS) should alter her course to starboard so that each will pass on the port side of the other.The head-on situation is shown in Figure 4a.(b) Overtaking situation: A vessel (OS) is deemed to be overtaking when coming up to another vessel (TS) from a direction within the range of (112.5 • , 247.5 • ) abaft her beam.The OS vessel should alter her course to starboard or port to give way to the TS.The overtaking situation is shown in Figure 4b.(c) Crossing give-way situation: When one vessel (TS) is located within (5 • , 112.5 • ) of the other vessel's (OS)'s azimuth, this situation is considered as the crossing give-way.The OS vessel should alter her course to starboard to avoid the TS vessel.The crossing give-way situation is shown in Figure 4c.(d) Crossing stand-on situation: When one vessel (TS) is located within (247.5 • , 355 • ) of the other vessel's (OS) 's azimuth, this situation is considered as the crossing stand-on.The OS vessel needs to maintain course and speed, and make a timely observation of the TS vessel's dynamics.When the TS and OS pose a collision threat, the OS should alter her course to starboard or port to avoid collision.The crossing stand-on situation is shown in Figure 4d.

Figure 5 .
Figure 5. Differences between the traditional reinforcement learning and the maximum-entropy reinforcement learning.Among them, the red pentagram represents the target point, the black fork represents no navigable path.
correspond to two paths.The high peak represents Path 1, and the low peak represents Path 2. Traditional reinforcement-learning methods typically use a policy distribution centered on the largest Q value, such as the Gaussian distribution represented by the red curve in Figure 5.At this time, the ship's exploration will be completely biased toward Path 1, thus completely ignoring Path 2. Although the agent can learn better behavior under this strategy distribution, once the environment changes, Path 1 is blocked by unknown obstacles.Since the ship only learns Path 1, in order to reach the target again, Path 2 must be learned from scratch.Therefore, the soft Bellman equation is used: ) where x p and y p are the two-dimensional plane coordinates of the SMASV, x g , y g are the two-dimensional plane coordinates of the target point, and (ζ 1 , ζ 2 , ζ 3 ...ζ 24 ) are the 24 laser radar vector lines.To reduce the complexity of the calculation and the reward function, a continuous action space is be set up.Since assuming that the SMASV can only alter her course to port or starboard, or to maintain her course through the rudder steering control, the action space of the SMASV was designed considering the following action: [−20 • ~20 • ].

Figure 6 .
Figure 6.Improved SAC algorithm and environment interaction process.The Q network contains four fully connected layers, with 256 neurons in each layer, using the ReLU activation function.The strategy network comprises two layers of fully connected layers; each layer has 256 neurons, using the ReLU activation function.

Figure 10 .
Figure 10.The loss curves for the target−Q1 network and the actor network in the improved SAC algorithm. ff

Figure 11 .
Figure 11.The loss curves for the Actor network and the Critic network in the different DRL algorithm.

Figure 12 .
Figure 12.The comparison of the average−reward−convergence curves of the improved SAC algorithm with AC, DDPG, PPO, and improved PPO algorithm.

Figure 13 .
Figure 13.Electronic chart rasterization of eight water areas.

Figure 14 .
Figure 14.The planning and sailing tracks on the rasterized electronic chart.

Figure 17 .
Figure 17.The SMASV rudder angle in multiple ship encounter situations.

Figure 18 .
Figure 18.The typical three−ship and four−ship encounter sailing track.

Remark 4 .
The training process of the neural-network model requires strong computing power and has a long-time cost.However, it does not need to spend a lot of hash rate to apply the trained neural-network model to the SMASV.In the process of generalization experiments, we directly called the neural-network model trained in the training environment (ENV-Train) and did not need to retrain the network model according to the different environments.This was also the reason why deep-reinforcement learning is more suitable for solving ship-collision-avoidance problems.

Table 1 .
Summary of literature on ship-collision avoidance.

Table 3 .
Experimental data and network parameters.