An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning

Deep reinforcement learning (DRL) has excellent performance in continuous control problems and it is widely used in path planning and other fields. An autonomous path planning model based on DRL is proposed to realize the intelligent path planning of unmanned ships in the unknown environment. The model utilizes the deep deterministic policy gradient (DDPG) algorithm, through the continuous interaction with the environment and the use of historical experience data; the agent learns the optimal action strategy in a simulation environment. The navigation rules and the ship’s encounter situation are transformed into a navigation restricted area, so as to achieve the purpose of planned path safety in order to ensure the validity and accuracy of the model. Ship data provided by ship automatic identification system (AIS) are used to train this path planning model. Subsequently, the improved DRL is obtained by combining DDPG with the artificial potential field. Finally, the path planning model is integrated into the electronic chart platform for experiments. Through the establishment of comparative experiments, the results show that the improved model can achieve autonomous path planning, and it has good convergence speed and stability.


Introduction
With the density of maritime traffic increasing, various types of marine accidents frequently occur. 80% of marine accidents are caused by human factors, according to the world maritime disaster records that were filed by international maritime organization (IMO) from 1978 to 2008 [1]. Therefore, improving the autonomous driving level of ships has become an urgent problem to be solved. On the other hand, the environment that is faced by ships is more and more complex. In some cases, it is not suitable for manned ships to go to the workplace to carry out tasks, while unmanned ships are more suitable for dealing with the complex and changeable harsh environment at sea. This requires unmanned ships to have the ability of autonomous path planning and obstacle avoidance, so as to efficiently complete tasks and enhance the comprehensive operation ability [2].
With their strong autonomy and adaptability, unmanned ships have gradually become the new research direction pursued by the current industry [3]. Unmanned ships can not only perform tasks independently in dangerous sea areas, but also cooperate with manned ships to improve work efficiency. They are widely used in marine exploration, military tasks, material transportation, and other fields. Autonomous path planning is an essential key technology in order to improve the self-determination ability of unmanned ships. The autonomous path planning of unmanned ships needs to optimize the path in the safe navigation area according to certain navigation rules and crew experience. It can safely avoid obstacles and independently plan an optimal trajectory from the known starting point to the target point. Unmanned ships often face complex and variable navigational Campbell, S et al. [22] showed a USV real-time path planning method that was based on improved A* algorithm, which is integrated with decision-making framework and combined with COLREGS. The results show that the method realizes real-time path planning of the USV in complex navigation environment. However, the A* algorithm relies on the design of the grid map, and the size and number of grids will directly affect the calculation speed and accuracy of the algorithm. Xue, Y et al. [23] introduced a path planning method of unmanned ships that are based on APF, and combined with COLREGS. The experimental showed that the method could effectively realize unmanned ships path search and collision avoidance in complex environments. However, this method is difficult to deal with ships autonomous path planning and collision avoidance problems in an unknown restricted navigation environment.
In addition, many intelligent algorithms, such as genetic algorithms, ant colony optimization algorithms, and neural network algorithms, have also been used in the autonomous path planning problem of unmanned ships. For example, Vettor, R et al. [24] used the optimization genetic algorithm to calculate the environmental information as the initial population to obtain the navigation path that satisfies the requirements. Lazarowska, A et al. [25] proposed the ant colony optimization algorithm to transform ships path planning and collision avoidance problems into dynamic optimization problems, with collision risk and range loss as the objective function. The optimal planning path and collision avoidance strategy were obtained based on the motion prediction of dynamic obstacles. Xin, J et al. [26] adopted the improved genetic algorithm to increase the number of offspring by using the multi-domain inversion. The result shows that the algorithm is superior with a desirable balance between the path length and time-cost, and it has a shorter optimal path, a faster convergence speed, and better robustness. Xie, S et al. [27] proposed a predictive collision avoidance method for under-actuated surface ships based on the improved beetle antenna search (BAS) method. A predictive optimization strategy for real-time collision avoidance is established and COLREGS is used as a constraint condition while considering the minimization of safety and economic cost. The simulation experiments verify the effectiveness of the improved BAS method. However, such intelligent algorithms are usually computationally intensive, and they are mainly used in offline global path planning or auxiliary decision making, and they are difficult to be used for real-time ship action decision problems.
In the field of intelligent ships, the application of DRL to the control of unmanned ships has gradually become a new research field. For example, Chen, C et al. [18] introduced a path planning and maneuvering method for unmanned cargo ships that were based on Q-learning. This method can learn the action reward model and it obtains the best action strategy. After enough rounds of training, the ship can find the right path or navigation strategy by itself. Fu, K et al. [28] proposed a ship rotation detection model based on a feature fusion pyramid network and DRL (FFPN-RL). The ship's independent guidance and docking function is realized by applying the Dueling Q network to the inclined ship detection task. Yang, J et al. [29] showed autonomous navigation control that is based on relative value iterative gradient (RVIG) algorithm for unmanned ships, and designed the navigation environment of the ship while using Unity3D game engine software. The simulation results showed that the unmanned ships could successfully avoid obstacles and reached the destination in a complex environment. Shen, H.Q et al. [30] offered a method that was based on the Dueling DQN algorithm for automatic collision avoidance of multiple ships, and combined ship manoeuvrability, crew's experience, and COLREGS to verify the path planning and collision avoidance capability of unmanned ships. Zhang, R.B et al. [31], based on the Sarsa on-policy algorithm, proposed a behavior-based USV local path planning and obstacle avoidance method, and tested in real marine environment. WANG, Y et al. [32] introduced a USV course tracking control plan by combining DDPG algorithm and achieved good experimental results. Zhang, X et al. [33] proposed an adaptive navigation method that was based on maritime autonomous surface ships for hierarchical DRL. The method is combined with the ship maneuverability and navigation rules COLREGS for training and learning. The results show that the method can effectively improve the navigation safety and avoid collision. Zhao, L et al. [34] used the proximal policy optimization (PPO) algorithm, combined with the ship motion model and navigation rules, the unmanned ship autonomous collision avoidance model in the multi-ship environment is proposed. The experimental results show that the model could obtain the time efficiency and collision-free path of multiple ships, and it has good adaptability to unknown complex environments. In addition, as the human factor is composed of navigators' subjectivity and indeterminacy of the navigation situation in the actual navigation process, it is usually has a game character in the actual ship control process [35]. The DRL overcomes the shortcomings of usual intelligent algorithm, which requires a certain number of samples. At the same time, it has less error and response time.
Many critical autonomous path planning methods have been proposed in the field of unmanned ships. However, these methods are mainly focused on the research of small and medium-sized USV, while the research on unmanned ships is relatively rare, and few experts currently apply the DDPG to unmanned ships path planning. This paper chooses DDPG for unmanned ships path planning, because it has powerful deep neural network function fitting ability and better generalized learning ability. The fitting ability can achieve higher precision when approaching the motion of the ship, and the learning ability can enable the ship to obtain the action output close to the human characteristic from the decision behavior, which is helpful in further developing the human-like traffic models. In addition, the outputs of algorithms, such as Q-learning and DQN, are discrete behaviors, which will lead to the discontinuity and incompleteness of the actions. The DDPG has the advantages of fast convergence speed and continuous action space. When compared with the traditional path planning, the method proposed in this paper has more continuous action output and less decision error when the unmanned ship is sailing.

Deep Reinforcement Learning
As an important field of machine learning, DL can extract high-precision feature samples from the original input data. This method has greatly promoted the development of visual object recognition, object detection, and voice recognition. DL uses deep neural networks (DNN) to automatically learn the features of high-dimensional data. The core idea of DL is to update the network parameters through the back propagation method and then discover the distributed characteristics of the data from the training process.
RL, as another field of machine learning, is widely used in robot control, game simulation, and scheduling optimization. The purpose of RL is to maximize the cumulative reward that was obtained by agents in the training process and learn the optimal strategy to achieve the goal. At the same time, the agent directly interacts with the environment information to replace the large amount of training data by evaluating the action value. The principle is shown in Figure 1. The agent observes the state s t from the environment and makes action a t that is based on a certain policy. The environment then makes feedback reward r t for the executed action and it moves to the new state s t+1 . The agent uses the new state and reward r t+1 to select the action again and keep repeating to maximize the long-term accumulation index of reward.
The DRL is an end-to-end learning method that combines DL and RL, while using the perception ability of DL and the decision-making ability of RL [36]. It has made great progress in continuous motion control and it can effectively solve the shortcomings of traditional unmanned ships in path planning. DDPG is a type of algorithm in DRL that can be used to solve continuous action spaces problems. Among them, deep refers to the deep network structure and policy gradient is the strategy gradient algorithm, which can randomly select actions in the continuous action space according to the learned strategies (action distribution). The purpose of deterministic is to help the policy gradient avoid random selection and to output a specific action value. The DDPG is being gradually applied to the transportation sector, especially in the areas of driverless cars and unmanned ships. These vehicles (buses, small cars, boats, and cargo ships, etc.) often require continuous motion control as well as The Actor network and the Critic network are two separate networks that share state information. The Actor network uses state information to generate actions, while the environment feeds back the resulting actions and outputs reward. The Critic network uses state and reward to estimate the value of the current action and constantly adjust its own value function. Meanwhile, the

AC Algorithm
DDPG is based on AC algorithm, whose structure is shown in Figure 2. ships in path planning. DDPG is a type of algorithm in DRL that can be used to solve continuous action spaces problems. Among them, deep refers to the deep network structure and policy gradient is the strategy gradient algorithm, which can randomly select actions in the continuous action space according to the learned strategies (action distribution). The purpose of deterministic is to help the policy gradient avoid random selection and to output a specific action value. The DDPG is being gradually applied to the transportation sector, especially in the areas of driverless cars and unmanned ships. These vehicles (buses, small cars, boats, and cargo ships, etc.) often require continuous motion control as well as traffic rules and human operating experience to ensure driving safety. The DDPG can solve the above problems well with its powerful self-learning ability and function fitting ability. Therefore, DDPG has broad prospects and expansion space in the maritime domain.

AC Algorithm
DDPG is based on AC algorithm, whose structure is shown in Figure 2.
The network structure of the AC framework includes a policy network and an evaluation network. The policy network is called an Actor network, and the evaluation network is called a Critic network. The Actor network is used to select actions corresponding to the DDPG, and the Critic network evaluates the merits and demerits of the selected actions by calculating the value function. The Actor network and the Critic network are two separate networks that share state information. The Actor network uses state information to generate actions, while the environment feeds back the resulting actions and outputs reward. The Critic network uses state and reward to estimate the value of the current action and constantly adjust its own value function. Meanwhile, the Actor network updates its action strategy in the direction of improving the value of the action. In this cycle, the Critic network evaluates the action strategy by means of a value function, giving the Actor network a better gradient estimate, and finally obtaining the optimal action strategy. It is very important to evaluate the action strategy in the Critic network, which is more conducive to the convergence and stability of the current Actor network. The above features ensure that the AC algorithm can obtain the optimal action strategy with the gradient estimation at a lower variance. Figure 3 shows the structure of DDPG algorithm. The DDPG algorithm takes the information of the initial state as the input, and the output result is the action strategy () t s  calculated by the algorithm. Afterwards, the random noise is added to the action strategy to obtain the final output action, and this process is a typical end-to-end learning mode. When starting the task, the agent outputs an action according to the current state t s . The The network structure of the AC framework includes a policy network and an evaluation network. The policy network is called an Actor network, and the evaluation network is called a Critic network. The Actor network is used to select actions corresponding to the DDPG, and the Critic network evaluates the merits and demerits of the selected actions by calculating the value function.

DDPG Algorithm Structure
The Actor network and the Critic network are two separate networks that share state information. The Actor network uses state information to generate actions, while the environment feeds back the resulting actions and outputs reward. The Critic network uses state and reward to estimate the value of the current action and constantly adjust its own value function. Meanwhile, the Actor network updates its action strategy in the direction of improving the value of the action. In this cycle, the Critic network evaluates the action strategy by means of a value function, giving the Actor network a better gradient estimate, and finally obtaining the optimal action strategy. It is very important to evaluate the action strategy in the Critic network, which is more conducive to the convergence and stability of the current Actor network. The above features ensure that the AC algorithm can obtain the optimal action strategy with the gradient estimation at a lower variance.  agent to achieve the goal gives a positive reward and, on the contrary, gives a negative reward. Afterwards, the current state information, the action, the reward, and the state information of the next time 1 ( , , , ) t t t t s a r s  are stored in the experience buffer pool. At the same time, the neural network trains experience and continuously adjusts action strategy by randomly extracting sample data from the experience buffer pool, and uses the gradient descent approach to update and iterate network parameters, so as to further enhance the stability and accuracy of the algorithm.  DDPG combines with DQN on the premise of AC algorithm in order to further enhance the stability and effectiveness of network training, which makes it more conducive in the filed of solving the problem of continuous state and action space. Additionally, DDPG uses DQN's the experience replay memory and the target network to solve the problem of non-convergence when using neural network to approximate the function value. Meanwhile, DDPG subdivides the network structure into online network and target network. The online network is used to output the actions in real time, evaluate actions, and update network parameters through online training, which includes online Actor network and online Critic network, respectively. The target network includes target Actor network and target Critic network, which are used to update the value network system and the Actor network system, but not carry out online training and updating of network parameters. The target network and the online network have the same neural network structure and initialization parameters. In the training process, the parameters of the target Actor network and target Critic network are updated in a way of slow change (Soft Replace), instead of directly copying the parameters of online Actor network and online Critic network, so as to further enhance the stability of the training process. The following formulas describe how to update the parameters.

DDPG Algorithm
(1 ) The DDPG algorithm takes the information of the initial state as the input, and the output result is the action strategy µ(s t ) calculated by the algorithm. Afterwards, the random noise is added to the action strategy to obtain the final output action, and this process is a typical end-to-end learning mode. When starting the task, the agent outputs an action according to the current state s t . The reward function is designed and the action is evaluated in order to verify the validity of the output action, thereby obtaining a feedback reward r t of the environment. The action that is beneficial to the agent to achieve the goal gives a positive reward and, on the contrary, gives a negative reward. Afterwards, the current state information, the action, the reward, and the state information of the next time (s t , a t , r t , s t+1 ) are stored in the experience buffer pool. At the same time, the neural network trains experience and continuously adjusts action strategy by randomly extracting sample data from the experience buffer pool, and uses the gradient descent approach to update and iterate network parameters, so as to further enhance the stability and accuracy of the algorithm.
DDPG combines with DQN on the premise of AC algorithm in order to further enhance the stability and effectiveness of network training, which makes it more conducive in the filed of solving the problem of continuous state and action space. Additionally, DDPG uses DQN's the experience replay memory and the target network to solve the problem of non-convergence when using neural network to approximate the function value. Meanwhile, DDPG subdivides the network structure into online network and target network. The online network is used to output the actions in real time, evaluate actions, and update network parameters through online training, which includes online Actor network and online Critic network, respectively. The target network includes target Actor network and target Critic network, which are used to update the value network system and the Actor network system, but not carry out online training and updating of network parameters. The target network and the online network have the same neural network structure and initialization parameters. In the training process, the parameters of the target Actor network and target Critic network are updated in a way of slow change (Soft Replace), instead of directly copying the parameters of online Actor network and online Critic network, so as to further enhance the stability of the training process. The following formulas describe how to update the parameters.
where, θ µ and θ Q are the parameters of the online Actor network and the online Critic network, θ µ and θ Q are the parameters of the target Actor network and the target Critic network, τ 1 is the approximation coefficient. Figure 4 is the flow chart of the DDPG algorithm. where,   and Q  are the parameters of the online Actor network and the online Critic network,    and Q   are the parameters of the target Actor network and the target Critic network, 1  is the approximation coefficient. Figure 4 is the flow chart of the DDPG algorithm.

Structure Design of Autonomous Path Planning Model for Unmanned Ships Based on DDPG Algorithm
The DDPG algorithm is a combination of deep learning and reinforcement learning. Based on this algorithm, this paper designs an autonomous path planning model for unmanned ships. The algorithm structure mainly includes AC algorithm, experience replay mechanism, and neural network. The output action of the model is gradually accurate by using the AC algorithm to output and judge the ship's action strategy. By using the experience replay mechanism, the historical information of the DDPG algorithm executed in the experimental environment is remembered, and the data are randomly extracted for training and learning. In the course of the experiment, the number of environmental states and behavioral states of the ship is large, so it is necessary to use neural networks for fitting and generalization. The current state of the unmanned ship obtained in the environment is used as the input of the neural network, and the output is the Q value of actions

Structure Design of Autonomous Path Planning Model for Unmanned Ships Based on DDPG Algorithm
The DDPG algorithm is a combination of deep learning and reinforcement learning. Based on this algorithm, this paper designs an autonomous path planning model for unmanned ships. The algorithm structure mainly includes AC algorithm, experience replay mechanism, and neural network. The output action of the model is gradually accurate by using the AC algorithm to output and judge the ship's action strategy. By using the experience replay mechanism, the historical information of the DDPG algorithm executed in the experimental environment is remembered, and the data are randomly extracted for training and learning. In the course of the experiment, the number of environmental states and behavioral states of the ship is large, so it is necessary to use neural networks for fitting and generalization. The current state of the unmanned ship obtained in the environment is used as the input of the neural network, and the output is the Q value of actions that the ship can perform in the current state. The model can learn the optimal action strategy in the current state by continuously training and adjusting neural network parameters.
3.3.1. Model Structure Figure 5 shows the model structure of unmanned ships path planning based on the DDPG algorithm. The model mainly includes AC algorithm, Environment (Ship Action Controller and Ship Navigation Information Fusion Module), and Experience Replay Memory. Among them, the model obtains the environmental information and the ship's state data through the Ship Navigation Information Fusion Module, and it serves as the input state of the AC algorithm. The optimal ship action strategy is output, which satisfies the ship's maximum cumulative return during the learning process, by randomly extracting data from the experience buffer pool for repeated training and learning. Finally, unmanned ships can avoid obstacles and reach the destinations with the help of the Ship Action Controller. Figure 5 shows the model structure of unmanned ships path planning based on the DDPG algorithm. The model mainly includes AC algorithm, Environment (Ship Action Controller and Ship Navigation Information Fusion Module), and Experience Replay Memory. Among them, the model obtains the environmental information and the ship's state data through the Ship Navigation Information Fusion Module, and it serves as the input state of the AC algorithm. The optimal ship action strategy is output, which satisfies the ship's maximum cumulative return during the learning process, by randomly extracting data from the experience buffer pool for repeated training and learning. Finally, unmanned ships can avoid obstacles and reach the destinations with the help of the Ship Action Controller.

Model Structure
 AC Algorithm AC Algorithm includes Actor network and Critic network, which respectively include online network and target network. The two networks have the same network structure and they adopt the experience replay technology to randomly extract ship sample data from the experience buffer pool for network training.

 Environment
The training environment of autonomous path playing model of unmanned ships mainly includes Ship Action Controller and Ship Navigation Information Fusion Module. When the ship performs the action the environment feedbacks the reward and the state of the ship.  Ship Action Controller The Ship Action Controller converts the model output action to deflection heading and speed increment. These actions indicate that the unmanned ship should perform in the current state.

 Ship navigation information fusion module
The module can receive and process global positioning system (GPS), automatic identification system (AIS), depth sounder, and anemometer in real time (the information can be integrated and displayed on the electronic chart platform), and provide information of the unmanned ships state, including the ship's position, heading, speed, the distance between the ship and the obstacle, and the angle of the ship and the target point.  Experience replay memory The function of this module is to store the state of the ship, the action, the reward, and the state of the next moment. When the maximum capacity of the experience buffer pool is reached, the old data will be replaced by the new data, and the buffer pool is updated in this way.
The DDPG algorithm pseudo code is as follows.
Algorithm 1. Pseudo code of the DDPG algorithm. • AC Algorithm AC Algorithm includes Actor network and Critic network, which respectively include online network and target network. The two networks have the same network structure and they adopt the experience replay technology to randomly extract ship sample data from the experience buffer pool for network training.

• Environment
The training environment of autonomous path playing model of unmanned ships mainly includes Ship Action Controller and Ship Navigation Information Fusion Module. When the ship performs the action the environment feedbacks the reward and the state of the ship.

• Ship Action Controller
The Ship Action Controller converts the model output action to deflection heading and speed increment. These actions indicate that the unmanned ship should perform in the current state. The module can receive and process global positioning system (GPS), automatic identification system (AIS), depth sounder, and anemometer in real time (the information can be integrated and displayed on the electronic chart platform), and provide information of the unmanned ships state, including the ship's position, heading, speed, the distance between the ship and the obstacle, and the angle of the ship and the target point.

• Ship navigation information fusion module
• Experience replay memory The function of this module is to store the state of the ship, the action, the reward, and the state of the next moment. When the maximum capacity of the experience buffer pool is reached, the old data will be replaced by the new data, and the buffer pool is updated in this way.
The DDPG algorithm pseudo code is as follows.
Algorithm 1: Pseudo code of the DDPG algorithm.
1: Randomly initialize Critic network Q(s, a|θ Q ) and Actor network µ(s|θ µ ) with weight parameters θ Q and θ µ . 2: Initialize target network Actor Q (s, a|θ Q ) and Critic µ (s|θ µ ) with weight parameters θ Q ← θ Q and θ µ ← θ µ . 3: Initialize experience replay memory D. 4: for t = 1 to M do 5: Initialize the random process N in the action exploration strategy. 6: Input initial unmanned ships and environment observation state s 1 : ship latitude and Longitude, ship heading, ship speed, angle with target point, distances from obstacles. 7: for t = 1 to T do 8: Choose ship heading and ship speed a t = µ(s t |θ µ ) + N t based on current strategy µ(s t ) and exploring noise N. 9: Implement output action a t to get the reward r t and the new state s t+1 . 10: Save transition (s t , a t , r t , s t+1 ) into D.

11:
Sample random batch of N transitions (s i , a i , r i , s i+1 ) from D.

12:
Set Update online Critic network by minimizing loss:

14:
Update online Actor network using sampled policy gradient:

15:
Update target Actor network and target Critic network:

Structural Design of AC Algorithms
The input parameters of the Actor network and the Critic network are all from the ship state data that were provided by the Ship Navigation Information Fusion Module. The Actor network and the Critic network continuously calculate and update their network parameters by continuously inputting ship state information, and finally get better network output results, which are respectively the action of unmanned ships and the value of action. Figure 6 shows the specific network structure of the AC algorithm. The Actor network structure contains two fully connected hidden layers. The number of neurons is 300 and 600, respectively. The output nodes of each hidden layer are nonlinearly processed by the activation function to limit the number of output actions to one. First, the Actor network enters the initial ship state information t s , calculates it through two hidden layers of neurons, and then uses the ReLU activation function to limit the output of each layer. At the same time, in the last layer of the network, the Tanh activation function is used to limit the network single output action value t a between [−1, 1], and the Ship Action Controller is then used to convert the network output action into the actual operation action. The Critic network and The Actor network have the same network structure. The number of neurons of the two hidden layers is 200, and the output node of each hidden layer also adopts the activation function for non-linear processing. The evaluation value of the output action of Actor network is the final network result, which is known as the Q value. Different from the input parameters of the Actor network, the Critic network takes the initial state of the ships t s , and the output action of the Actor network t a as the input parameters. After two hidden layers' calculation, the output is processed by ReLU activation function, but when the final action value Q is output, no The Actor network structure contains two fully connected hidden layers. The number of neurons is 300 and 600, respectively. The output nodes of each hidden layer are nonlinearly processed by the activation function to limit the number of output actions to one. First, the Actor network enters the initial ship state information s t , calculates it through two hidden layers of neurons, and then uses the ReLU activation function to limit the output of each layer. At the same time, in the last layer of the network, the Tanh activation function is used to limit the network single output action value a t between [−1, 1], and the Ship Action Controller is then used to convert the network output action into the actual operation action.
The Critic network and The Actor network have the same network structure. The number of neurons of the two hidden layers is 200, and the output node of each hidden layer also adopts the activation function for non-linear processing. The evaluation value of the output action of Actor network is the final network result, which is known as the Q value. Different from the input parameters of the Actor network, the Critic network takes the initial state of the ships s t , and the output action of the Actor network a t as the input parameters. After two hidden layers' calculation, the output is processed by ReLU activation function, but when the final action value Q is output, no activation function is used to ensure that the output result of network is the definite action value. Additionally, the action value is used to evaluate the output action of Actor network.

Action Strategy Design
In the process of DRL, the relationship between exploitation and exploration needs to be correctly handled. The appropriate action exploration strategy can enable the agent to try more new actions, so as to avoid falling into local optimization. In this paper, the action strategy of unmanned ships mainly includes action control strategies and action exploration strategies. When designing a neural network structure, the random noise is added into the output action of the Actor network as an action exploration strategy, and the output action is converted into specific execution as the control strategy according to the actual physical meaning.

• Action control strategy
In the aspect of action control strategy, one of the output actions of Actor network is used to indicate the deflection heading a steering and the Tanh activation function is used to control the range of the output value is [−1, 1]. D max indicates the maximum deflection angle, the value range is [0 and steering is used to represent the actual deviation sailing value of the ships. The calculation formula is as follows: Another output action of the Actor network is ship speed increment a shi f ting , which is processed by activation function Tanh. The output range is [−1, 1]. V max is the maximum value of the ship's speed change, the range of value is [0, 15] kn, and shi f ting is the actual change of ship speed. The calculation formula is as follows: • Action exploration strategy In the aspect of action exploration strategy, the effective implementation of action exploration in the continuous control space can enable the unmanned ships to find a better action since DDPG is an off-strategy algorithm. The method of adding random noise into the output action of the neural network is used to realize the exploratory process of the action. Random noise is defined, as follows: where, µ is the exploration strategy and N t is the added random noise. The Ornstein-Uhlenbeck (OU) process is a sequential correlation process. It is often used as the random noise of DDPG algorithm and it has good effect in continuous action space. In this paper, this method is used as random noise of unmanned ships action strategy and it is specifically defined as: In the above formula, θ is the speed of the variable approaching the average value, µ is the action mean value, σ is the fluctuation degree of the random process, and W t is the Wiener process.

Data Preparation
Based on the ship AIS system, this paper collects the real AIS data of 50 ships from August 1st to 15th, 2018 in the Dalian-Yantai route. Through the analysis of data, complete ship navigation data can be obtained, for example: Longitude, Latitude, speed over ground (SOG), course over ground (COG), heading (HDG), and other information. Table 1 shows the main information of AIS. This paper mainly selects the longitude, latitude, heading, and speed information of the ship as the input parameters of the model. The trajectory of each ship can be clearly displayed since the data is collected in real time. Their total mileage is 133,550 nautical miles and the data volume is around 3G. This paper randomly selects the data of 20 ships from the first 50 ships as the input data of the model training since these data are collected in real time. These data were normalized prior to model training to speed up the training of the model and improve the accuracy of the model.

Design of Reward Function
On the one hand, the unmanned ships needs to maintain the exploration ability to obtain better action, on the other hand, it also needs to use the learned action to obtain more reward from the environment, in order to fully explore the environment space to obtain the optimal action strategy.
The reward function is also known as the immediate reward or enhancement signal R. When an unmanned ship performs an action, the environment will make feedback information based on the action to evaluate the performance of the action. The environment and decision makers design the reward function. It is usually a scalar, with positive value as reward and negative value as punishment. It is very important to obey the navigation rules for the practicability of the algorithm in the autonomous path planning process of unmanned ships.
According to the COLREGS, the ship encounter situation can generally be divided into three types: head-on situation, crossing situation, and overtaking situation. Figure 7 shows the situation of the ship encounter situation. Different ship encounters will be divided according to COLREGS when the sea environment has good visibility, as shown in Figure 7. The range of the A region is (0°, 005°) and (355°, 360°); it is called head-on situation. The ship should take a steering action that is greater than 15° to the right. The B region is a crossing situation with a range of (247.5°, 355°). The C region is the overtaking situation, and the range is (112.5°, 247.5°). The ship is overtaken and it does not usually take action. The D region is a crossing situation with a range of (005°, 112.5°). The ship should take right-turning action. In addition, in situation where the visibility of the sea environment is restricted, there will be no responsibility separation between the stand-on vessel and the give-way vessel. The COLREGS made the following rules for the situation of the ship at this time: For coming ships in the range of (0 °, 90 °) and (270 °, 360 °), the self-ship will turn to the right. Self-ship takes a turn towards other ships for coming ships in the range of (90 °, 180 °) and (180 °, 270 °) . We mainly study the situation of good visibility at sea in this article.
The rules need to be mathematically quantified and the rules are converted into navigational restrictions in order to integrate the COLREGS and crew's experience into the model. Set the navigation limit area when the distance between the ship and the other ship is less than six nautical miles, otherwise it will not be set. The following describes the rule conversion in the three scenarios: 1. Head-on situation. Under this circumstance, both ships should take the action of turning to the right to avoid the boat. The method of rule conversion is as follows: a navigational limit line with a length of three times the ship length and a direction of 10° from the bow is drawn clockwise in the bow direction of the two ships. The environment will punish the ship if the ship crosses the navigation limit line. The specific performance is that the ship receives a negative reward. Different ship encounters will be divided according to COLREGS when the sea environment has good visibility, as shown in Figure 7. The range of the A region is (0 • , 005 • ) and (355 • , 360 • ); it is called head-on situation. The ship should take a steering action that is greater than 15 • to the right. The B region is a crossing situation with a range of (247.5 • , 355 • ). The C region is the overtaking situation, and the range is (112.5 • , 247.5 • ). The ship is overtaken and it does not usually take action. The D region is a crossing situation with a range of (005 • , 112.5 • ). The ship should take right-turning action. In addition, in situation where the visibility of the sea environment is restricted, there will be no responsibility separation between the stand-on vessel and the give-way vessel. The COLREGS made the following rules for the situation of the ship at this time: For coming ships in the range of (0 • , 90 • ) and (270 • , 360 • ), the self-ship will turn to the right. Self-ship takes a turn towards other ships for coming ships in the range of (90 • , 180 • ) and (180 • , 270 • ). We mainly study the situation of good visibility at sea in this article.
The rules need to be mathematically quantified and the rules are converted into navigational restrictions in order to integrate the COLREGS and crew's experience into the model. Set the navigation limit area when the distance between the ship and the other ship is less than six nautical miles, otherwise it will not be set. The following describes the rule conversion in the three scenarios: 1.
Head-on situation. Under this circumstance, both ships should take the action of turning to the right to avoid the boat. The method of rule conversion is as follows: a navigational limit line with a length of three times the ship length and a direction of 10 • from the bow is drawn clockwise in the bow direction of the two ships. The environment will punish the ship if the ship crosses the navigation limit line. The specific performance is that the ship receives a negative reward. 2.
Crossing situation. If there is a ship on the right side of the other ship, the ship should pass through the tail of the other ship. The method of rule conversion is as follows: draw a navigation limit line that is equal to four times the length of the ship in the direction of the bow of the other ship, to avoid the self-ship passing through the bow of the other ship. The ship will be punished and receive a negative reward if the ship crosses the navigation limit of the other ship. 3.
Overtaking situation. The ship should pass from both sides of the other ship when the ship is behind the other ship and needs to overtake the other ship. The method of rule conversion is as follows: a navigation limit line of 1.5 times the length of the ship is drawn at the tail of the other ship in the vertical direction of the course. If the ship crosses the navigation limit line, it will be punished and get a negative return value.
For the conversion of crew's experience, this paper adds a virtual area for obstacles based on crew's experience in order to enable the ship to take actions to avoid obstacles in advance. When the distance between the ship and the obstacle is more than 1.5 times the length of the ship, the obstacle area is set, otherwise it is not set. The same as embedding COLREGS, we also convert the crew's experience into a restricted area. We set the obstacle as an approximate regular pattern in order to simplify the calculation and unify the criteria (a circular). The method of setting the virtual obstacle area is as follows: the center of the obstacle is the circle and the longer side of the obstacle is used as the radius. Once the ship enters the obstacle area, the environment will punish it. In this case, the reward obtained by the ship is the inverse of the distance between the ship and the obstacle. The punishment is stronger when the ship is closer to the obstacle and, on the contrary, it will be smaller. The reward dynamically changes until the ship is outside the obstacle area.
By setting the navigation restriction area, the ship's requirements for driving according to COLREGS and crew's experiences are realized, thereby constraining and guiding the ship to select correct and reasonable behavior. The navigation rules are converted into navigational restrictions through the quantitative treatment of COLREGS and crew's experiences. When the ship crosses a restricted navigation area or collides with an obstacle, it will be punished with a negative value. When the ship reaches the target point, the return value is positive. At the same time, the goal-based attraction strategy is adopted, and the reward is positively correlated with the distance between the ship and the target at the adjacent time. The reward function is designed, as follows: In the above formula, d t−goal is the distance between the ship and the target point at the current time t, the d t −goal is the distance between the ship and the target point at the previous time t , the d t−obs is the dangerous distance between the ship and the obstacle at the current time t, the D g−min is the minimum threshold for the ship to reach the target point, and the D o−min is the minimum dangerous distance threshold between the ship and the obstacle.
During the training process, when the ship reaches the target range, that is d t−goal < D g−min , the reward is set to 2. When reaching the obstacle range, that is d t−obs < D o−min , the reward is set to −1. In other cases, the attraction strategy is to attract the ship to the target point as soon as possible, in order to maximize the return value of the action. The current ship's distance from the target point is subtracted from the distance between the ship and the target point at the previous time to determine whether the ship is approaching the target point. As the simulation environment is based on a 800 × 600 pixels two-dimensional map, 1 pixel represents 10m in the actual environment. Here, (d t −goal − d t−goal ) is normalized, that is, the result of dividing the distance between the current position of the ship and the target point by the distance of the diagonal of the two-dimensional map is used as the reward value, and the range of the reward value is between [0, 1]. At the same time, the reduction of 0.1 in each step is to reduce the number of steps to reach the target point and avoid the redundancy in the planned path. The reward is calculated as (d t −goal − d t−goal ) − 0.1. There are many training rounds in the experiment. The current round ends when the ship reaches the target point. If it collides with obstacles, the ship is stepped back and re-selected action. At the same time, the maximum number of steps in each round is set to 400 steps. The current round is ended and the next round is re-entered when the maximum number of steps is reached.

Model Execution Process
The path planning model of unmanned ships based on the DDPG algorithm is used to abstract the real complex environment and then transform it into a simple virtual environment through the model. At the same time, the action strategy of this model is applied to the environment of electronic chart platform, so as to obtain the optimal planning track of unmanned ships and realize the learning process of end-to-end algorithm in the real environment. Figure 8 shows the execution flow of the autonomous path planning model for unmanned ships that are based on DDPG algorithm.  The execution process of the model for unmanned ships is described, as follows: 1. When the unmanned ships path planning program starts, the system reads the ship data through the ship navigation information fusion module.

2.
The system invokes the model and takes the ship data as the input state and obtains the ship's action strategy under the current state after the model processing and calculation.
3. According to the actual motion of the ship, the model further converts the action strategy into the actual action that the unmanned ships should take.  The execution process of the model for unmanned ships is described, as follows: 1.
When the unmanned ships path planning program starts, the system reads the ship data through the ship navigation information fusion module. 2.
The system invokes the model and takes the ship data as the input state and obtains the ship's action strategy under the current state after the model processing and calculation.

3.
According to the actual motion of the ship, the model further converts the action strategy into the actual action that the unmanned ships should take. 4.
The ship action controller analyzes the acquired execution action and performs the action in the current unmanned ships state. 5.
The model obtains the states information of the unmanned ships at the next moment after the execution action, and determines whether the ship state after the execution action is the end state. 6.
The model will continue to use the state information of the ship at the next moment if it is not the end state, and then calculate and judge what action the ship should take at the next moment and cycle through it. If it is the end state, it indicates that the unmanned ships have completed the path planning task, and the model ends the calculation and the invocation.
In this section, first, the unmanned ship uses the obtained state information as the input to the algorithm. Second, the action strategy is obtained through training based on the navigation restricted and the reward function. Subsequently, the parameters of the model are updated with historical experience data until the cumulative return is at the maximum. Finally, the ship executes the action and changes its state, and it determines whether to end the execution process by judging the current state.

Definition of Model Input and Output Parameters
The input parameters of the model are ship data obtained from ship navigation information fusion module, which mainly includes ship's own state, target point information, and surrounding information. The output of the model is the execution action of unmanned ship, including the deflection heading and the speed increment. Table 2 shows the input parameters names and definitions used in the model. Here, the Angle between the current heading of the ship and the target point is designed as the input parameter, so that the unmanned ships can be quickly driven toward the target point, and the model training period is prevented from being too long. At the same time, the obstacle risk in the range of 1000 m is calculated and then stored in the set Tracks as the decision-making basis for ship obstacle avoidance. Among them, the set of distances from obstacles is differentiated according to the training situation. In the case where the self-ship heading extension line intersects with the other ship heading extension line, that is, on the premise that there is a danger of collision. If it is a single ship encounter, Tracks has only one distance information of other ships. Tracks contains the distance information of the most dangerous other ship if it is a situation where multiple ships meet. Here, the method of selecting the most dangerous other ship is: When the heading extension line of multiple other ships intersects the heading extension line of self-ship, we choose to store the smallest distance between the self-ship and other ship in Tracks. Finally, the above input states are respectively normalized to limit their range to [0, 1], thereby improving the calculation efficiency of the model. The deflection heading Steering and speed increment Shifting are used as the output action of the model, which realizes the direct conversion from the input state to the output action. Table 3 shows the output parameters definition. According to the actual navigation situation, the maximum deflection heading of the unmanned ship is set to 35 • . There is a situation that the ship might not be able to complete the currently required deflection angle but receive a new steering action due to the slower steering of the ship. Therefore, the same action accumulation method is used to calculate the heading deflection angle d ψ . The heading deflection angle of the unmanned ship is obtained by Formula (7).
where, a i represents the unmanned ship deflection angle generated by the model at time i, and k represents the same action performed k(k > 0) times in succession.
In addition, the current ship's heading ψ d can be obtained by summing the ship s heading and deflection angle at the previous moment since the ship's heading at the previous moment ψ p is known. The current heading of the unmanned ship is obtained by Formula (8).

Model Training Process
Python language and electronic chart platform are used for model training and simulation experiments in order to verify the validity and feasibility of the unmanned ship autonomous path planning model based on DDPG algorithm. The deep learning framework TensorFlow trains the model. The model is designed using the python language, and the convergence and accuracy of the model training can be directly observed. The simulation environment is an electronic chart platform designed and developed based on Visual Studio 2013. Additionally, the state size of the simulation environment is a two-dimensional map of 800 × 600 pixels, and the range of motion of the ship is set to the size of the two-dimensional map. The ship is considered to have collided if the ship crosses the boundary of the two-dimensional map. This paper first uses the DDPG algorithm to train a global route from the start point to the end point, and then trains on the situation of encountering dynamic obstacles (other ships) during the voyage. We will mainly study the part of the ship that avoids obstacles and reaches the end point since the planning of the global static route is not the focus of this paper.
After many experiments, the better structure and parameters of the neural network are designed. The network structure of the Actor and the Critic both use a fully connected neural network with two layers of hidden layers. The Actor network hyper parameters are set, as follows: the numbers of neurons in the hidden layer are 300 and 600, respectively, the learning rate of the network is 10 −4 , the action of the output layer is the heading deflection and the speed increment respectively, and different activations are selected according to the difference of the output action range. The heading deflection uses the tanh activation function and its output range is [−1, 1]; the speed increment uses the sigmoid activation function, and its output range is [0, 1]. The Critic network hyper parameters are set, as follows: the number of neurons in the hidden layer is 200, the learning rate is 10 −3 , and the discount factor γ of the reward function is set to 0.9. In addition, the experience buffer pool size is set to 5000 and the batch learning sample size is 64. The random noise uses the method in Equation (5); the value of µ is set to 0, θ is set to 0.6, and σ is set to 0.3. The rate of the soft update method is τ = 0.01. The Actor network and the Critic network both use the Adam network optimizer. To prevent infinite training, set the maximum number of steps per round to 400, for a total of 300 rounds. The parameters of the Actor network and the Critic network are updated every 1000 steps in order to further improve the accuracy of the model in the training process. Based on the above training conditions and parameters, this paper studies and trains the path planning model of the unmanned ship. The training process and results are described, as follows. Figure 9 shows the number of training steps per round during the unmanned ships autonomous path planning process. It can be seen that, in about the first 40 rounds, the training steps of the model reach the maximum, which indicates that the model triggers the termination condition of training and it does not realize the path planning or falls into the local obstacle area. Between 45th and 110th round, the training steps of each round began to decrease, and the average training steps are maintained at 50 steps, which indicated that unmanned ships learn more and more action strategies and plan a complete path independently. After about the 115th round, the training steps of the model are kept below 40 steps in each round, which indicates that the unmanned ship has fully learned the optimal action strategy and it has no collision risk. In the vicinity of the 275th round, the number of training steps increased due to the exploratory strategy of the algorithm, which makes unmanned ships attempt random action. The purpose of the model is to improve the reward on the action strategy through continuous interaction with the environment. The greater the cumulative reward per round, the better the learning effect. Figure 10 shows the cumulative reward of each round of the model. In the first 40 rounds, the reward of the model is lower per round, and the fluctuation state is processed, which indicated that the unmanned ships has not found the correct path and is constantly trying new action strategy. Around the 45th round, the reward of the model began to increase, which indicated that a path to the target point is found. After the 115th round, the cumulative reward of the model per round is basically maintained at the maximum, indicating that the unmanned ships have found the optimal action strategy. The trend of cumulative reward per round is consistent with the change of steps per round when compared with Figure 9. The average reward reflects the effect of the learning process and it also more directly observes the degree of change in the reward. Figure 11 shows the average reward of the model every 50 rounds. As can be seen from the figure, the general trend of average compensation is upward. About The purpose of the model is to improve the reward on the action strategy through continuous interaction with the environment. The greater the cumulative reward per round, the better the learning effect. Figure 10 shows the cumulative reward of each round of the model. In the first 40 rounds, the reward of the model is lower per round, and the fluctuation state is processed, which indicated that the unmanned ships has not found the correct path and is constantly trying new action strategy. Around the 45th round, the reward of the model began to increase, which indicated that a path to the target point is found. After the 115th round, the cumulative reward of the model per round is basically maintained at the maximum, indicating that the unmanned ships have found the optimal action strategy. The trend of cumulative reward per round is consistent with the change of steps per round when compared with Figure 9. The purpose of the model is to improve the reward on the action strategy through continuous interaction with the environment. The greater the cumulative reward per round, the better the learning effect. Figure 10 shows the cumulative reward of each round of the model. In the first 40 rounds, the reward of the model is lower per round, and the fluctuation state is processed, which indicated that the unmanned ships has not found the correct path and is constantly trying new action strategy. Around the 45th round, the reward of the model began to increase, which indicated that a path to the target point is found. After the 115th round, the cumulative reward of the model per round is basically maintained at the maximum, indicating that the unmanned ships have found the optimal action strategy. The trend of cumulative reward per round is consistent with the change of steps per round when compared with Figure 9. The average reward reflects the effect of the learning process and it also more directly observes the degree of change in the reward. Figure 11 shows the average reward of the model every 50 rounds. As can be seen from the figure, the general trend of average compensation is upward. About the 45th round, the growth rate of average rewards began to slow down and then gradually leveled  The average reward reflects the effect of the learning process and it also more directly observes the degree of change in the reward. Figure 11 shows the average reward of the model every 50 rounds. As can be seen from the figure, the general trend of average compensation is upward. About the 45th round, the growth rate of average rewards began to slow down and then gradually leveled off. After the 110th round, the average reward stabilized and then remained at a large value, which indicated that the model has found the optimal action strategy at this time.

Model Integration and Simulation Experiment
The model is tested and observed in the simulation environment in order to verify the validity and correctness of the autonomous path planning model for unmanned ships in Section 4.2. Firstly, the electronic chart platform and ship navigation rules are briefly introduced. Secondly, the model is validated in three different situations to observe whether the unmanned ship is running correctly according to the navigation rules. Finally, the other two unmanned ship path planning methods are selected as comparative experiments, and the training process and simulation results of the three methods are compared and analyzed.

Verification Environment
It is difficult for traditional numerical simulation methods to accurately describe the environmental information because the unmanned ships have a wide range of work and its environment is complex. The electronic chart [37] platform is an important navigation tool for ships and other marine vehicles. It can provide real and complete environmental information needed in navigation, including land, ocean, water depth, obstacles, and islands, etc.
In this paper, the electronic chart platform that was developed in C++ language is used as the verification environment of the model, as shown in Figure 12. The platform has the following main features: (1) Displaying standard chart information, which can be used in any scale chart interface.

Model Integration and Simulation Experiment
The model is tested and observed in the simulation environment in order to verify the validity and correctness of the autonomous path planning model for unmanned ships in Section 4.2. Firstly, the electronic chart platform and ship navigation rules are briefly introduced. Secondly, the model is validated in three different situations to observe whether the unmanned ship is running correctly according to the navigation rules. Finally, the other two unmanned ship path planning methods are selected as comparative experiments, and the training process and simulation results of the three methods are compared and analyzed.

Verification Environment
It is difficult for traditional numerical simulation methods to accurately describe the environmental information because the unmanned ships have a wide range of work and its environment is complex. The electronic chart [37] platform is an important navigation tool for ships and other marine vehicles. It can provide real and complete environmental information needed in navigation, including land, ocean, water depth, obstacles, and islands, etc.
In this paper, the electronic chart platform that was developed in C++ language is used as the verification environment of the model, as shown in Figure 12. The platform has the following main features: (1) Displaying standard chart information, which can be used in any scale chart interface. In this paper, the electronic chart platform that was developed in C++ language is used as the verification environment of the model, as shown in Figure 12. The platform has the following main features: (1) Displaying standard chart information, which can be used in any scale chart interface.

Validation Results
The autonomous path planning model of unmanned ships is obtained through training, and the model is invoked by the electronic chart platform to further verify the effectiveness of the algorithm. The head-on situation, crossing situation, overtaking situation, and multi-ship encounter situation in the ship navigation process are tested, respectively, in order to verify whether the action strategy made by the model in this paper complies with the COLREGS. Here, the parameter information in the experiment process is uniformly introduced. Ship-1 represents the self-ship and Ship-i (i = 1, 2, 3, . . . ) represents the other ship. Heading angle represents the initial angle of the ship and Ship Speed indicates the initial speed. The starting point and target point points are expressed in latitude and longitude.
(1) Head-on case Table 4 sets the motion parameters of the ships in detail. Figure 13 shows the experimental results. It can be seen from the waypoints information that the Ship-1 turned to the right according to the rules, successfully avoided the other ship, and then headed for the target position. In this case, the planned path length is 6.592 nautical miles. the planned path length is 6.592 nautical miles.  (2) Crossing case Tested by the crossing of two ships, Table 5 and Table 6 show the specific ship motion parameters. Figure 14 and Figure 15 show the experimental results, respectively. In Figure 14, the other ship approached from the right side, where the Ship-1 acts as a give-way vessel with respect to the Ship-2, and then deflects the heading to the right according to rule and passes the stern of the Ship-2. In this case, the planned path length is 6.663 nautical miles. In Figure 15, the other ship approached from the left side, In this case, the planned path length is 6.462 nautical miles.  (2) Crossing case Tested by the crossing of two ships, Tables 5 and 6 show the specific ship motion parameters. Figures 14 and 15 show the experimental results, respectively. In Figure 14, the other ship approached from the right side, where the Ship-1 acts as a give-way vessel with respect to the Ship-2, and then deflects the heading to the right according to rule and passes the stern of the Ship-2. In this case, the planned path length is 6.663 nautical miles. In Figure 15, the other ship approached from the left side, In this case, the planned path length is 6.462 nautical miles.     (3) Overtaking case Table 7 shows the motion parameters of the two ships. From the test results and the waypoints information, Figure 16 shows that the Ship-1 is on the port side of the Ship-2, and the Ship-1 passes as the give-way vessel from the stern of the Ship-2. Ship-2 maintains heading as a stand-on vessel. In this case, the planned path length is 6.416 nautical miles.  (3) Overtaking case Table 7 shows the motion parameters of the two ships. From the test results and the waypoints information, Figure 16 shows that the Ship-1 is on the port side of the Ship-2, and the Ship-1 passes as the give-way vessel from the stern of the Ship-2. Ship-2 maintains heading as a stand-on vessel. In this case, the planned path length is 6.416 nautical miles.  (4) Multi-ship encounter case 1 Table 8 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a head-on encounter, and then a crossing encounter with Ship-3. Ship-1 first takes the action of turning right, passing through the right side of Ship-2, then takes the action of turning right again, passing through the tail of Ship-3, and finally reaching the target point, according to the test results and waypoint information. Figure 17 shows the scene construction of multi ship encounter case 1, and Figure 18 shows the verification results of multi ship encounter case 1. In this case, the planned path length is 14.436 nautical miles.  (4) Multi-ship encounter case 1 Table 8 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a head-on encounter, and then a crossing encounter with Ship-3. Ship-1 first takes the action of turning right, passing through the right side of Ship-2, then takes the action of turning right again, passing through the tail of Ship-3, and finally reaching the target point, according to the test results and waypoint information. Figure 17 shows the scene construction of multi ship encounter case 1, and Figure 18 shows the verification results of multi ship encounter case 1. In this case, the planned path length is 14.436 nautical miles. encounter, and then a crossing encounter with Ship-3. Ship-1 first takes the action of turning right, passing through the right side of Ship-2, then takes the action of turning right again, passing through the tail of Ship-3, and finally reaching the target point, according to the test results and waypoint information. Figure 17 shows the scene construction of multi ship encounter case 1, and Figure 18 shows the verification results of multi ship encounter case 1. In this case, the planned path length is 14.436 nautical miles.   (5) Multi-ship encounter case 2 Table 9 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation and then form an overtaking situation with Ship-3. Ship-1 first takes the action of turning right, passing through the stern of Ship-2, then takes the action of turning left, passing through the right side of Ship-3, and finally reaches the target point, according to the test results and waypoint information. Figure 19 shows the scene construction of multi ship encounter case 2 and Figure 20 shows the verification results of multi ship encounter case 2. In this case, the planned path length is 13.651 nautical miles.  (5) Multi-ship encounter case 2 Table 9 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation and then form an overtaking situation with Ship-3. Ship-1 first takes the action of turning right, passing through the stern of Ship-2, then takes the action of turning left, passing through the right side of Ship-3, and finally reaches the target point, according to the test results and waypoint information. Figure 19 shows the scene construction of multi ship encounter case 2 and Figure 20 shows the verification results of multi ship encounter case 2. In this case, the planned path length is 13.651 nautical miles. encounter situation and then form an overtaking situation with Ship-3. Ship-1 first takes the action of turning right, passing through the stern of Ship-2, then takes the action of turning left, passing through the right side of Ship-3, and finally reaches the target point, according to the test results and waypoint information. Figure 19 shows the scene construction of multi ship encounter case 2 and Figure 20 shows the verification results of multi ship encounter case 2. In this case, the planned path length is 13.651 nautical miles.  Figure 19. Multi-ship encounter case 2 scenario construction. Figure 19. Multi-ship encounter case 2 scenario construction. (6) Multi-ship encounter case 3 Table 10 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation and then form an overtaking situation with Ship-3. Ship-1 first takes the action of turning left, passing through the stern of Ship-2, then takes the action of turning left, passing through the right side of Ship-3, and finally reaches the target point, according to the test results and waypoint information. Figure 21 shows the scene construction of multi ship encounter case 2 and Figure 22 shows the verification results of multi ship encounter case 3. In this case, the planned path length is 15.264 nautical miles.  (6) Multi-ship encounter case 3 Table 10 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation and then form an overtaking situation with Ship-3. Ship-1 first takes the action of turning left, passing through the stern of Ship-2, then takes the action of turning left, passing through the right side of Ship-3, and finally reaches the target point, according to the test results and waypoint information. Figure 21 shows the scene construction of multi ship encounter case 2 and Figure 22 shows the verification results of multi ship encounter case 3. In this case, the planned path length is 15.264 nautical miles.    (7) Multi-ship encounter case 4 Table 11 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation, form a head-on encounter situation with Ship-3, and then finally form an overtaking encounter situation with Ship-4. According to the test results and waypoint information, Ship-1 first takes the action of turning right, passing from the stern of Ship-2 and the right side of Ship-3, then takes the action of turning left, passing from the left side of Ship-4, and then finally reaches the target point. Figure 23 shows the scene structure of case 3 of multi ship encounter and Figure 24 shows the verification results of case 4 of a multi ship encounter. In this case, the planned path length is 18.713 nautical miles.  (7) Multi-ship encounter case 4 Table 11 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation, form a head-on encounter situation with Ship-3, and then finally form an overtaking encounter situation with Ship-4. According to the test results and waypoint information, Ship-1 first takes the action of turning right, passing from the stern of Ship-2 and the right side of Ship-3, then takes the action of turning left, passing from the left side of Ship-4, and then finally reaches the target point. Figure 23 shows the scene structure of case 3 of multi ship encounter and Figure 24 shows the verification results of case 4 of a multi ship encounter. In this case, the planned path length is 18.713 nautical miles.    Through the observation of the above verification test, it is shown that the unmanned ship based on the model that is proposed in this paper has better path planning effect in the case of single ship and multi-ship, and it can successfully avoid the obstacle according to the navigation rules. Finally, reach the target point. At the same time, by observing the time and path length used in path planning, it is further indicated that the path distance of the model planning is shorter and more in line with the actual navigation experience.

Improved Model of Autonomous Path Planning
The planned path is redundant and not sufficiently flat, although the unmanned ship autonomous path planning based on the DDPG algorithm is implemented in Section 4.3.2. By observing the training process in Section 4.2., it is found that the DDPG training time is longer, the convergence speed is slower, and the algorithm is easy to fall into the problem of local iteration. The above phenomenon is because the DDPG has no prior knowledge of the environment and the initial stage of learning can only randomly select actions. Therefore, the convergence speed of the algorithm is slow in complex environments. This section improves the DDPG and adds the APF method to obtain the unmanned ships path planning model based on APF-DDPG in order to improve the learning efficiency in the initial stage and speed up the convergence of the algorithm.
The APF is a method of virtual gravitational field and repulsive field, which has been widely used in real-time path planning of robots. The basic principle is to virtualize the simulation environment, and each state point in the environment has a corresponding potential energy value. The target point generates a gravitational potential field in the virtual environment, the obstacle generates a repulsive potential field in the virtual environment, and the total field strength is obtained by superimposing the gravitational field and the repulsive field. The object approaches the target point by using the gravitational field, and the repulsion field is used to avoid the obstacle. Generally, it is calculated by the following formula: In the above formula, Through the observation of the above verification test, it is shown that the unmanned ship based on the model that is proposed in this paper has better path planning effect in the case of single ship and multi-ship, and it can successfully avoid the obstacle according to the navigation rules. Finally, reach the target point. At the same time, by observing the time and path length used in path planning, it is further indicated that the path distance of the model planning is shorter and more in line with the actual navigation experience.

Improved Model of Autonomous Path Planning
The planned path is redundant and not sufficiently flat, although the unmanned ship autonomous path planning based on the DDPG algorithm is implemented in Section 4.3.2. By observing the training process in Section 4.2., it is found that the DDPG training time is longer, the convergence speed is slower, and the algorithm is easy to fall into the problem of local iteration. The above phenomenon is because the DDPG has no prior knowledge of the environment and the initial stage of learning can only randomly select actions. Therefore, the convergence speed of the algorithm is slow in complex environments. This section improves the DDPG and adds the APF method to obtain the unmanned ships path planning model based on APF-DDPG in order to improve the learning efficiency in the initial stage and speed up the convergence of the algorithm.
The APF is a method of virtual gravitational field and repulsive field, which has been widely used in real-time path planning of robots. The basic principle is to virtualize the simulation environment, and each state point in the environment has a corresponding potential energy value. The target point generates a gravitational potential field in the virtual environment, the obstacle generates a repulsive potential field in the virtual environment, and the total field strength is obtained by superimposing the gravitational field and the repulsive field. The object approaches the target point by using the gravitational field, and the repulsion field is used to avoid the obstacle. Generally, it is calculated by the following formula: U(s) = U a (s) + U r (s) (9) In the above formula, U a (s) is the potential energy value of the gravitational field at the point s, U r (s) is the potential energy value of the repulsive field at the point s, and U(s) is the potential energy of the point s. U a (s) and U r (s) are obtained by Formulas (10) and (11), respectively.
In the Formula (10): k a is the scale factor of gravitational field and ρ g (s) is the minimum distance between the s and the target.
In the Formula (11): k r is the scale factor of repulsion field and ρ ob (s) is the minimum distance between the s and the obstacle, where ρ ob (s) is the obstacle influence coefficient.
For the path planning problem of unmanned ships, the immediate reward r can only be obtained when the destination is reached or an obstacle is encountered. The sparsity of the reward function leads to low initial efficiency and multiple iterations. There are a large number of invalid iterative search spaces, especially for large-scale unknown training environments. Therefore, the APF is constructed according to the position information of the target point and the obstacle point. At this time, the potential field value of each state in the potential field represents the maximum cumulative return V(s i ) of the state s i , and the relation expression is expressed as: In the Formula (12): U(s i ) is the potential field value of state s i in the virtual potential field environment. V(s i ) represents the maximum cumulative return when the optimal action is taken in state s i .
The steps of the APF-DDPG based autonomous path planning method are as follows: 1.
The potential field is constructed according to the target point and the obstacle in the virtual environment, and the gravity potential field with the target point as the center of the potential field is established.

2.
Define the potential energy value U(s i ) in the potential energy field as the maximum cumulative report V(s i ) under state s i , according to Equation (12). 3.
The ship explores the environment from the starting point and selects the action in the current state s i . The environment status is updated to state s and an immediate return value is received r.

4.
Update the Q value according to the state value function: Q(s i , a) = r + γV(s i ). Subsequently, update the online Critic network.

5.
Observe whether the ship reaches the target point or reaches the set maximum number of learning. If the two meets one of them, the round of learning ends and the next iteration is started. Otherwise, return to step (3).
When the gravitational field is added to the target point position, the unmanned ship reaches the target point faster and the selection of the action strategy becomes more stable. The experimental procedure that is based on the APF-DDPG method is described below.
The path planning experimental parameters that are based on the APF-DDPG are the same as the DDPG parameters in Section 4.3.2. The APF parameters are set, as follows: k a = 1.6, k r = 1.2, ρ 0 = 3.0. We choose the environment of case (4), (5), and (7) in Section 4.3.2 as the comparison environment in order to better compare APF-DDPG with DDPG. Figure 25 shows the training process of APF-DDPG. From Figure 25a, it can be seen that the number of training steps in each round begins to decline and converge in the 43th round, and it fluctuates in the subsequent training process, which is caused by the action exploration strategy of the algorithm. Figure 25b shows the reward value of each round. The reward value of each round starts to increase and then reaches the maximum value from the training to the 43th round, indicating that a better action strategy is found at this time. Figure 25c shows the average reward value of each round, which rapidly increases and then stabilizes at the maximum value at the beginning of the 43th round.  The result is shown in Figure 18 and Figure 26 as compared with the (4) experimental case in Section 4.3.2. Figure 18 has more redundant paths and it takes longer. The path in Figure 26 is smoother and the length of the planned path is shorter and takes less time. Similarly, when compared with the experimental case (5) (7) in Section 4.3.2, the results are shown in Figures 20, 27, 24 and 28, respectively. It can be seen that APF-DDPG is better than DDPG in the distance and time of path planning. By comparing the DDPG and the APF-DDPG, it is shown that the APF-DDPG has better decision-making level and faster convergence speed.    Figure 18 has more redundant paths and it takes longer. The path in Figure 26 is smoother and the length of the planned path is shorter and takes less time. Similarly, when compared with the experimental case (5) (7) in Section 4.3.2, the results are shown in Figures 20, 24, 27 and 28, respectively. It can be seen that APF-DDPG is better than DDPG in the distance and time of path planning. By comparing the DDPG and the APF-DDPG, it is shown that the APF-DDPG has better decision-making level and faster convergence speed.  The result is shown in Figure 18 and Figure 26 as compared with the (4) experimental case in Section 4.3.2. Figure 18 has more redundant paths and it takes longer. The path in Figure 26 is smoother and the length of the planned path is shorter and takes less time. Similarly, when compared with the experimental case (5) (7) in Section 4.3.2, the results are shown in Figures 20, 27, 24 and 28, respectively. It can be seen that APF-DDPG is better than DDPG in the distance and time of path planning. By comparing the DDPG and the APF-DDPG, it is shown that the APF-DDPG has better decision-making level and faster convergence speed.

Experimental Comparison and Analysis
Currently, there are five path planning methods comparisons and experimental analyses, as described in this section. These methods are called the DQN method, AC method, DDPG method, Qlearning method [18], and the path planning method based on APF-DDPG. The five methods are trained to obtain training steps, the reward, and the average reward in the case that the ship's motion parameters and the surrounding environment are the same. Analyze and compare the experimental process and performance data of unmanned ships autonomous path planning during the training phase.
The five comparison methods set the same neural network structure and network parameters, the discount factor of the reward function is set to 0.9, and the experience buffer pool size is set to 5000. Among them, the Q-learning algorithm and DQN algorithm discretize the deflection heading action into 70 deflection angle values in the range of [−35°, 35°], and other algorithms directly output the specific action value. Figure 29 shows the contrast experiment of autonomous path planning for unmanned ships.    Figure 30 is a comparison of the average reward of the five methods.
(a), (d), (g), (j), and (m) in Figure 29 are the experimental results of the number of steps per round obtained while using the DQN, the AC, the DDPG, the Q-learning, and the APF-DDPG, respectively. By longitudinally comparing the number of execution steps per round, it can be found that the number of steps per turn based on the DQN (a) starts to decrease at about the 150th round, while the number of steps per turn based on the AC (d) and the DDPG (g) begins to decrease and then gradually converges around the 100th round. The number of steps per turn based on the DQN does not converge to the minimum number of steps, but the AC and the DDPG have both converged to the minimum number of steps, and the DDPG produces fewer fluctuations in the latter process. In addition, the number of steps per round based on the Q-learning (j) begins to decrease around the

Experimental Comparison and Analysis
Currently, there are five path planning methods comparisons and experimental analyses, as described in this section. These methods are called the DQN method, AC method, DDPG method, Q-learning method [18], and the path planning method based on APF-DDPG. The five methods are trained to obtain training steps, the reward, and the average reward in the case that the ship's motion parameters and the surrounding environment are the same. Analyze and compare the experimental process and performance data of unmanned ships autonomous path planning during the training phase.
The five comparison methods set the same neural network structure and network parameters, the discount factor of the reward function is set to 0.9, and the experience buffer pool size is set to 5000. Among them, the Q-learning algorithm and DQN algorithm discretize the deflection heading action into 70 deflection angle values in the range of [−35 • , 35 • ], and other algorithms directly output the specific action value. Figure 29 shows the contrast experiment of autonomous path planning for unmanned ships.   the 50th round. This is because the random exploration strategy of the algorithm causes the reward of each round to fluctuate, but around the 115th round, the average reward reaches the maximum. In addition, the average reward based on Q learning (l) showed a downward trend at the beginning, which indicated that it was in the exploration stage. Subsequently, there was an increase in the 86th round, but did not reach the maximum average reward finally. By observing (o), the average reward that is based on the APF-DDPG increases at the 43th round and reaches the maximum average reward value faster, which converges faster than other algorithms. Longitudinal comparison of five   Figure 30 is a comparison of the average reward for the five path planning methods. It can be seen from the figure that the Q-learning and DQN methods that are based on discrete motion space converge slowly, starting to gradually increase in the 80th to 100th rounds, and fail to reach the maximum draw value. Based on the AC and DDPG methods, convergence begins on the 10th to 50th rounds. The maximum average reward value is greater than the DQN and Q-learning methods, and the DDPG algorithm has reached the maximum average reward value. The average reward value of Experimental data from five methods were extracted, and the comparison is made from the total iteration time, the optimal decision time, convergence steps, and the number of obstacle collisions to further illustrate the accuracy and effectiveness of the improved model. As shown in Table 12, DQN and Q-learning based on discrete motion space have greater overhead in terms of training time and convergence speed. In contrast, AC and DDPG have better training time and convergence effects. When compared to the above method, the decision method using APF-DDPG has less iteration time and optimal decision time. In addition, the convergence speed of the method is improved, and the trial and error rate is reduced. The simulation results show that the APF-DDPG has higher convergence speed and stability.

Conclusions
For the traditional path planning algorithm, the historical experience data cannot be recycled and used for online training and learning, which results in low accuracy of the algorithm, and the actual path of the plan is not smooth and redundant. This paper proposes an unmanned ships autonomous path planning method based on the DDPG algorithm. First, the ship data are acquired based on the electronic chart platform. Secondly, the model is designed and trained in combination with the COLREGS and crew experience, and validated in three classical encounter situations on the electronic chart platform. The experiments show that unmanned ships take the best and reasonable action in unfamiliar environment, successfully complete the task of autonomous path planning, and realize the end-to-end learning method of unmanned ships. Finally, by combining APF and DDPG, an unmanned ship autonomous path planning method that is based on improved DRL is proposed. The improved DRL is compared with other classical DRL methods. The results show that the improved DRL has faster convergence speed and accuracy, can achieve continuous operation output, and has less navigation error, which further validates the effectiveness of the proposed method.   Figure 29 are the experimental results of the number of steps per round obtained while using the DQN, the AC, the DDPG, the Q-learning, and the APF-DDPG, respectively. By longitudinally comparing the number of execution steps per round, it can be found that the number of steps per turn based on the DQN (a) starts to decrease at about the 150th round, while the number of steps per turn based on the AC (d) and the DDPG (g) begins to decrease and then gradually converges around the 100th round. The number of steps per turn based on the DQN does not converge to the minimum number of steps, but the AC and the DDPG have both converged to the minimum number of steps, and the DDPG produces fewer fluctuations in the latter process. In addition, the number of steps per round based on the Q-learning (j) begins to decrease around the 150th round, and the number of steps per round based on the APF-DDPG (m) begins to decrease around the 43th round. By vertically comparing the five methods, it can be found that the path planning method that is based on the APF-DDPG reaches the target point with fewer rounds, and the convergence speed is faster, which is better than the other algorithms in terms of stability.
(b), (e), (h), (k), and (n) in Figure 29 are the experimental result of the reward per round obtained by the DQN, the AC, the DDPG, the Q-learning, and the APF-DDPG, respectively. The round reward based on the DQN (b) and the AC (e) starts to increase at about the 100th round, and then gradually reaches the maximum reward, and the round reward based on the DDPG (h) starts to increase at about the 50th round and then gradually stabilizes at the maximum reward, which indicated that the unmanned ships while using DDPG found the optimal action strategy faster. The round reward based on the DQN continuously fluctuates and does not reach the maximum reward, which indicates that the unmanned ships has not learned the optimal behavior strategy, and the AC has found the maximum reward value, indicating that the optimal behavior strategy has been learned. In addition, the round reward based on Q learning (k) has been low in the initial round, indicating that the correct action has not yet been learned and is in the exploratory stage. The 100th round of the award began to gradually increase, but ultimately did not reach the maximum reward. The round reward that was based on the APF-DDPG (n) began to increase at the 43th round and reached the maximum reward value faster. Longitudinal comparison of the five experimental processes shows that the APF-DDPG achieves the maximum round bonus value with the least amount of time and it is consistently maintained. At the same time, the APF-DDPG is more stable than the other algorithms in the later stage, which indicates that the unmanned ships has learned the optimal behavior strategy.
(c), (f), (i), (l), and (o) in Figure 29 is the experimental result of the average reward per round obtained by the DQN, the AC, DDPG, the Q-learning, and the APF-DDPG, respectively. By observing the average reward, it is more intuitive to judge the efficiency and accuracy of the convergence of the three algorithms. By observing (c), it can be known that the average reward that is based on the DQN starts to increase at about the 110th round and tends to be stable around the 180th round. By observing (f), the average reward that is based on the AC is the 40th round began to increase and then stabilized around 130th round. By observing (i), the average reward that is based on the DDPG of this paper started to increase at about the 10th round, and the downward trend appeared around the 50th round. This is because the random exploration strategy of the algorithm causes the reward of each round to fluctuate, but around the 115th round, the average reward reaches the maximum. In addition, the average reward based on Q learning (l) showed a downward trend at the beginning, which indicated that it was in the exploration stage. Subsequently, there was an increase in the 86th round, but did not reach the maximum average reward finally. By observing (o), the average reward that is based on the APF-DDPG increases at the 43th round and reaches the maximum average reward value faster, which converges faster than other algorithms. Longitudinal comparison of five experimental processes shows that the APF-DDPG achieves the maximum average reward with the least amount of time and it is consistently maintained. Figure 30 is a comparison of the average reward for the five path planning methods. It can be seen from the figure that the Q-learning and DQN methods that are based on discrete motion space converge slowly, starting to gradually increase in the 80th to 100th rounds, and fail to reach the maximum draw value. Based on the AC and DDPG methods, convergence begins on the 10th to 50th rounds. The maximum average reward value is greater than the DQN and Q-learning methods, and the DDPG algorithm has reached the maximum average reward value. The average reward value of the APF-DDPG method is higher than the other methods at the beginning, which indicated that the initial convergence rate is improved after the artificial potential field method is added. At the same time, the APF-DDPG method reached the maximum average reward value in the 70th round. In summary, the APF-DDPG method is superior to other methods in terms of convergence speed and stability.
Experimental data from five methods were extracted, and the comparison is made from the total iteration time, the optimal decision time, convergence steps, and the number of obstacle collisions to further illustrate the accuracy and effectiveness of the improved model. As shown in Table 12, DQN and Q-learning based on discrete motion space have greater overhead in terms of training time and convergence speed. In contrast, AC and DDPG have better training time and convergence effects. When compared to the above method, the decision method using APF-DDPG has less iteration time and optimal decision time. In addition, the convergence speed of the method is improved, and the trial and error rate is reduced. The simulation results show that the APF-DDPG has higher convergence speed and stability.

Conclusions
For the traditional path planning algorithm, the historical experience data cannot be recycled and used for online training and learning, which results in low accuracy of the algorithm, and the actual path of the plan is not smooth and redundant. This paper proposes an unmanned ships autonomous path planning method based on the DDPG algorithm. First, the ship data are acquired based on the electronic chart platform. Secondly, the model is designed and trained in combination with the COLREGS and crew experience, and validated in three classical encounter situations on the electronic chart platform. The experiments show that unmanned ships take the best and reasonable action in unfamiliar environment, successfully complete the task of autonomous path planning, and realize the end-to-end learning method of unmanned ships. Finally, by combining APF and DDPG, an unmanned ship autonomous path planning method that is based on improved DRL is proposed. The improved DRL is compared with other classical DRL methods. The results show that the improved DRL has faster convergence speed and accuracy, can achieve continuous operation output, and has less navigation error, which further validates the effectiveness of the proposed method. However, the ship's motion model and the actual verification environment are not considered in this paper. How to consider the motion model of a ship in a complex sea area and then verify it in a real environment is the focus of the next research in this paper.