Path Planning for Automatic Berthing Using Ship-Maneuvering Simulation-Based Deep Reinforcement Learning

: Despite receiving much a�ention from researchers in the ﬁeld of naval architecture and marine engineering since the early stages of modern shipbuilding, the berthing phase is still one of the biggest challenges in ship maneuvering due to the potential risks involved. Many algorithms have been proposed to solve this problem. This paper proposes a new approach with a path-planning algorithm for automatic berthing tasks using deep reinforcement learning (RL) based on a ma-neuvering simulation. Unlike the conventional path-planning algorithm using the control theory or an advanced algorithm using deep learning, a state-of-the-art path-planning algorithm based on reinforcement learning automatically learns, explores, and optimizes the path for berthing performance through trial and error. The results of performing the twin delayed deep deterministic policy gradient (TD3) combined with the maneuvering simulation show that the approach can be used to propose a feasible and safe path for high-performing automatic berthing tasks.


Introduction
Since the early stages of modern shipbuilding, much a ention has been paid to automated methods of ship navigation, particularly with the continuous advancements in artificial intelligence (AI).As a result, the number of autonomous ships has rapidly grown.Autonomous ship navigation offers substantial advantages in terms of safety, efficiency, reliability, and environmental sustainability.By harnessing advanced technologies such as sensor systems, data analysis, and artificial intelligence and by reducing the risk of human error, autonomous navigation systems ensure the safety of ship operations.These systems operate consistently and reliably, unhindered by human limitations, resulting in more predictable performance and fewer accidents.Autonomous ships can process vast amounts of data, enabling informed decision making, collision avoidance, and adaptation to changing conditions [1].However, automatic ship berthing remains an extremely complex task, particularly under low-speed conditions where the hydrodynamic forces acting on the ship are highly nonlinear [2].Controlling the ship becomes challenging and necessitates the expertise of an experienced commander.Numerous researchers have conducted extensive studies on the principles and algorithms for automatic ship berthing.Researchers have developed various ship control algorithms based on control theories and maneuverability assumptions [3][4][5][6][7][8][9].These approaches proved effective under defined berthing conditions before the existence of AI.The development of AI algorithms has propelled the creation of algorithms and methods that enhance ship control performance, improve safety, and greatly reduce accidents in the marine industry.Many supervised learning algorithms based on neural networks have exhibited promising results with a high success rate, as in [10][11][12][13].Applying AI algorithms eliminates the need to clearly understand the mathematical model of ships.However, acquiring a substantial number of labeled training data can be time-consuming and costly.
Unlike the aforementioned methods where the training dataset is not necessary, reinforcement learning techniques, which constitute an area of machine learning, allow the ship to learn and optimize its berthing maneuvers through interactions with a simulated environment.The application of RL in the automatic berthing task has shown good results when the ship can automatically learn the strategy and optimize the control policy to move to the berthing point [14,15].
In this paper, initial development of a novel path-planning algorithm for autonomous ship berthing that uses the latest technique in reinforcement learning, called twin delayed deep deterministic policy gradient (TD3), is proposed.The TD3 algorithm was introduced by Sco Fujimoto (2018) and is specifically designed for continuous action spaces.TD3 is an extension of deep deterministic policy gradients (DDPGs) and aims to address certain challenges and improve the stability of learning in complex environments.TD3 exploration algorithms allow the agent to explore the environment and gain new experiences that optimize rewards through trial and error.It employs two distinct value function estimators, which mitigate the overestimation bias and stabilize the learning process.Leveraging these two critics, TD3 provides more accurate value estimates and facilitates be er policy updates.High performance and stability compared to other algorithms in the field of RL were shown in [16].In combination with the MMG model, the solutions for ship-maneuvering motion simulation that were proposed by a research group maneuvering modeling group (MMG) in 1977 [17] suggest a feasible path, resulting in faster convergence and improved accuracy.
This article is organized into five parts.The first section introduces previous studies conducted in this field.Section 2 presents the equation of motion for the ship based on the MMG model along with hydrodynamic and interaction coefficients.Section 3 outlines the path-planning algorithm based on deep reinforcement learning TD3.Section 4 showcases and discusses simulation results for two berthing cases.Finally, Section 5 concludes this research.

Coordinated System
This paper focuses on the motions of USVs in horizontal planes only.Thus, two coordinate systems were defined for a maneuvering ship based on the right-hand rule, as shown in Figure 1.The earth-fixed coordinate system is , where the origin is located on the water surface and the body-fixed coordinate system is , where the origin is in the midship.The and axes point toward the ship's bow and starboard, respectively.The heading angle represents the angle between the and axes.

Mathematical Model of USV
The motion equation with three degrees of freedom (3-DOF) is established based on Newton's second law.In this paper, the berthing task assumed performing in the calm water, the disturbances from environments such as waves and wind in the port area are ignored due to the simplicity of the equation of motion.Thus, the heave, roll, and pitch motion are relatively small and not significantly affected in the equation of motion; thus, it can be neglected from the equation of motion.Furthermore, the low-speed condition made the 3-DOF motions sufficient to simulate the motion of the vehicle.Additionally, the main purpose of this paper is to focus on the path planning algorithm to generate the feasible path for the berthing task based on the maneuvering simulation.The application of the 3-DOF motion equations causes simplicity but retains characteristics of the system.The MMG model for the 3-DOF equation of motion suggested in [17] that the total ship force and moment can be divided into sub-components: the hull, thruster, and steering system.Thus, the 3-DOF motion equation is expressed as where is the mass of the ship; is the longitudinal position of the center of gravity of the ship; , , and denote the surge, sway, and yaw velocities, respectively; the symbol " " in the head of the variable denotes the derivative of the variable with respect to time; is the moment representing the mass moment of inertia with respect to the zaxis; , , and represent the surge force, lateral force, and yaw moments, respectively, at the midship; and the subscripts , , and denote the hull, rudder, and propeller, respectively.
Due to the operation conditions in the berthing phase, the hydrodynamic forces and moments around the midship acting on the ship hull were investigated at a low speed with a wide range of drift angles [2].Thus, the equation of motion considers some of the high-order hydrodynamic coefficients and the hydrodynamic forces and moments caused by the hull, which are expressed as follows: The hydrodynamic coefficients were described under the Taylor series expansion in terms of surge velocity, sway velocity, and yaw rate.
The ship model was equipped with twin-propeller and twin-rudder systems [18].The thruster model is expressed as follows: where is the propeller diameter; is the propeller revolutions per minute; denotes the thrust deduction factor; is the lateral position of the propeller from the centerline; and the superscripts and denote the side of the propeller (port and starboard).
The thrust coefficient is described as the function of the advanced ratio coefficient that was obtained through the propeller open water test The parameter required for the estimation of thrust is given as where the wake fraction at the propeller position was estimated using the wake fraction at the propeller in the straight motion ; the geometrical inflow angle to the propeller position is denoted by ; describe the wake-changing coefficients for plus and minus due to lateral motion; and denotes the longitudinal position from the midship.
Forces and moments due to the steering system for the twin rudder were calculated based on the normal force and are expressed as where the normal force acting on the rudder is described as follows (Equation ( 7)).
, , 1 sin 2 The parameters required for estimating the rudder forces and moment during the maneuver are given as where is the normal rudder force; , , and are the steering resistance deduction factor, the rudder increase factor, and the position of an additional lateral force component, respectively; is the resultant rudder inflow velocity; is the rudder lift gradient coefficient; is the rudder aspect ratio; is the effective inflow angle to the rudder; and are longitudinal and lateral inflow velocity components to the rudder; is a ratio of a wake fraction at the propeller and rudder position; is the flow straightening coefficient; and is the effective inflow angle to the rudder in the maneuvering motion.

Hydrodynamic and Interaction Coefficients
Previous studies were carried out by research groups at Changwon National University on hydrodynamic properties under operating conditions [19].Experiments were conducted on the Korean autonomous surface ship (KASS) model, a ship model used in the project carried out by many universities and research institutes to develop autonomous ships.The main characteristics and the shape of the ship's model are shown in Table 1 and Figure 2 The cross-comparison of the results between the previous studies and [20] shows similarities in the results.Figure 3 shows the comparison results of the turning maneuverability at a rudder angle of 35 degrees at three and six knots.Similarly, in order to obtain the feasible path for the automatic berthing task, the high-accuracy equation of motion and hydrodynamic coefficients of the USV should be investigated carefully using the Korean autonomous surface ship (KASS) model.In this paper, the hydrodynamic coefficients were estimated using the captive model test at Changwon National University and compared with the CFD method as presented in [21].The coefficients relative to only the surge velocity were estimated through the resistance test.The hydrodynamic coefficients related to surge and sway velocity for forces and moments were estimated through a static drift test.The hydrodynamic coefficients related to the yaw rate were estimated using the circular motion test and the hydrodynamic coefficients related to the combined effect of sway velocity and yaw rate were estimated using the combined circular motion with drift test.The added mass and interaction coefficients were selected from [21].The summary of hydrodynamics and interaction coefficients are shown in Tables 2 and 3.

Maneuverability
To define the problem, it is necessary to assess the maneuverability of the ship.Understanding the maneuverability makes it reasonable to determine where to start the automatic berthing process.The maneuvering simulation under low speed (1 knot) was conducted to investigate the ship-maneuvering characteristics in the port environment.Figure 4 shows the trajectory of the turning circle test at 35 degrees of a rudder angle at 1 knot.The simulation results show that this can easily turn into a range of approximately 3 . Thus, the berthing area should be greater than three times that of from the berthing point.The maneuvering characteristics are shown in Table 4.

Path-Planning Approach
In the last few decades, significant advancements have been made in the field of artificial intelligence, particularly reinforcement learning, a subfield of machine learning, which falls under the machine learning category.Reinforcement learning involves training data by assigning rewards and punishments based on behavior and state.Unlike supervised and semi-supervised learning, reinforcement learning does not rely on pairs of input data or true results and it does not explicitly evaluate near-optimal actions as true or false.As a result, reinforcement learning offers a solution to tackle complex problems, including the control of robots, self-driving cars, and even applications in the aerospace industry.A noteworthy advancement in reinforcement learning is the introduction of a twin delayed deep deterministic policy gradient in 2018.TD3 is an effective model-free policy reinforcement learning method.The TD3 agent is an actor-critic reinforcement learning agent that optimizes the expected long-term reward.Specifically, TD3 builds upon the success of the deep deterministic policy gradient algorithm developed in 2016.DDPG remains highly regarded and successful in the continuous action space, finding extensive applications in fields such as robotics and self-driving systems.
However, like many algorithms, DDPG has its limitations, including instability and the need for fine-tuning hyperparameters for each task.Estimation errors gradually accumulate during training, leading to suboptimal local states, overestimation, or severe forgetfulness on the part of the agent.To address these issues, TD3 was developed with a focus on reducing the overestimation bias prevalent in previous reinforcement learning algorithms.This is achieved through the incorporation of three key features:


The utilization of twin critic networks, which work in pairs. Delayed updates of the actor. Action noise regularization.

x/L (-)
By implementing these features, TD3 aims to enhance the stability and performance of reinforcement learning algorithms, ultimately improving their applicability in various domains.

Conception
The path-planning algorithm in this paper was performed using the KASS model mentioned in Section 2.3.The port selected was the Busan port, whose geometry is shown in Figure 5.The objective of this result is to use the TD3 (the pseudocode as shown in algorithm 1) for training the model that can generate the path for the berthing process.First, the TD3 algorithm trains the model with the combination of maneuvering simulations suggested by the MMG model.This approach allows for the integration of realistic ship motion dynamics into the training process.Then, this model is used to generate the desired path for the berthing task by inpu ing the state of the ship and predicting the control signal (propeller speed) and (rudder angle), based on the input ship state ( , , , , , and ).The concept of path planning for automatic berthing tasks is shown in Figure 6., , Update Q-functions using gradient descent: If j mod policy delay == 0 then: Update policy by the one-step deterministic policy gradient ascent using

Se ing for Reinforcement Learning
In this section, the parameters and variables for the TD3 algorithm to deal with the automatic berthing task were set as follows:


Observation space and state: The observation space and state were defined as the set of physical velocity, position, and orientation.The state vector ( , , , , , and ) includes the position , as the element.The orientation is the heading angle.The linear velocity is , and the angular velocity is ; In this paper, based on the boundary state, the weights 100, 1000, 2000, and 1000 were assigned to the weight of distance, heading, linear, and angular velocity, respectively.The reward value was described as the sum of the reward in each time step.It received a positive value if the state variable changed to the required value and vice versa.Furthermore, for the faster convergence of rewards in the first stage of the berthing task, the distance is more prior to the target state than the speed and heading.Thus, the reward function of distance is multiplied by the distance coefficients.It is (1.1-Distance coefficients) in the case of the reward function of the heading, resultant velocity, and yaw rate.The reward coefficients are described in Figure 7.


Environment: The environment receives the input as the control input and the state then returns the ship's new state and the reward for this action.The environment function was built based on a maneuvering simulation that uses the MMG model as a mathematical simulation. Agent: The hyperparameters for the TD3 model were selected as follows: the number of hidden layers was set as two layers with 512 units for each.The learning rate for the actor and critic networks α and β was set to 0.0001.The discount factor was 0.99.The soft update coefficient was 0.005.The batch size was 128.This training process was set to 20,000/50,000 steps for the warmup of the model with the exploration noise set in Table 5.

Boundary Conditions
The simulations were performed at Busan port, with the satellite capture as in Figure 7.The geometry of this port was simplified as Figure 8. Considering the geometry of this port, two cases of berthing were selected to investigate the path-plan-generating system.

Simulation Results and Discussion
The USV is considered to have successful berthing if it approaches the target berthing point with an error within the allowable range established in Section 3.3.The training process stops when the number of training episodes reaches 50,000.
Figures 9 and 10 sequentially present a set of information that includes the trajectory, surge, sway, yaw rate, heading angle, and control input as the propeller revolution and rudder angle.The USV starts from the random state shown in Section 3.3.Under the automatic control of the TD3 model, the ship has successfully berthed at the target location.The ship states gradually change from the initial state to the required range.In particular, the combination with the adjusted reward function made the ship berthing process happen faster and more optimally.The time series of surge velocity was shown in the first phase of the berthing process.The ship's speed was increased to reduce the distance to the berthing point.However, the sway velocity and yaw rate did not change much in this phase.In the last phase of the berthing process, the importance of surge velocity, yaw rate, and heading angle is much more important than the distance in the first phase.So, the change in this value seems more sudden to adapt to the required value of the berthing process.
Figure 11a,b shows the learning performance of TD3 case 1 with 50,000 episodes.Although the warmup episode number is 20,000, the average reward shows that the model seems to be successfully berthing and stable before the warmup phase is finished.This demonstrates that TD3 provides good reinforcement learning for the automatic berthing process.To account for the difference in penalty values affecting the training process, too high a penalty will cause the model to misjudge the states, for instance, in cases where the vessel has moved reasonably close to the berthing position.However, it then collides with the wall and receives a heavy point deduction, causing the model to judge that process as wrong and try actions other than that process.This makes the training process longer.These results show that the method proposed in this paper has a high performance and success rate with a low penalty value.
The simulation results show that the combination of TD3 and the maneuvering simulation proposes a powerful and accurate system for the automatic berthing process.Comparing the shape of the average reward to the results shown in [15] demonstrates that the stability of the TD3 algorithm is be er than the older algorithm in the field of reinforcement learning.In particular, the method proposed in this paper is easier to apply because of the ability to learn, explore, and optimize the policy automatically.It can be used for another ship model if we know the ship's hydrodynamic characteristics.
However, the limitations in this paper are evident.Firstly, the simplification of the port condition: due to the simplicity of the model, the effect of the disturbance was ignored in the simulation.This causes non-accuracy if there is wind or waves.The second limitation of this approach is the simplification of the obstacles and port geometry.It has a significant effect on the determination of the initial state and the berthing point.The presence of moving obstacles can make the training time increase significantly.So, the method proposed in this paper should only be used in the determined port and not for moving obstacles.Finally, this paper proposed the initial development of the path planning system.The results are performed based on maneuver simulation.The accuracy and performance of this approach in real-world operations need to be carefully considered and evaluated.
Although the approach in this paper uses the newest technique in reinforcement learning at present, the performance of this method should be investigated and compared carefully.

Conclusions and Remarks
In this study, the Korea autonomous surface ship (KASS) model was selected as the target ship to perform the training for path planning for the autonomous berthing task.A mathematical model and the hydrodynamics coefficients suggested in previous research conducted at Changwon National University provided an accurate model for solving the motion of a slow ship.By performing the path-planning algorithm based on the combination of TD3 and a maneuvering simulation, the automatic berthing task could be conducted with the automatic berthing problem and stable performance using reinforcement learning.
Even though the high performance of the path-planning system was shown, the complex environmental disturbance in the port area needs to be included in the model.It takes more time to train the model but is a necessary factor in the real situation.Additionally, several algorithms based on control theory must be considered for faster convergence.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available in this article (Tables and Figures).

Conflicts of Interest:
The authors declare no conflict of interest.

Figure 1 .
Figure 1.Coordinate system of the twin-propeller and twin-rudder ship model.

Figure 3 .
Figure 3. Simulation results of turning trajectories at a rudder angle of 35 degrees at three and six knots [20].

Figure 4 .
Figure 4. Simulation results of turning trajectories at a rudder angle of 35 degrees.

Figure 5 .
Figure 5. Satellite image of the Busan port.

Figure 6 .
Figure 6.Concept of the path-planning algorithm.


Action: The control action includes the control input of the thruster (revolution of propeller) and steering (rudder angle) system.The action signal is continuous in the range [−1,1], where [−1,1] = [−300,100] rpm represents the thrust system and [−1,1] where [−1,1] = [−35,35] is the degrees for the steering system. Reward function: This plays a crucial role in the design of a reinforcement learning application.It serves as a guide for the network training process and helps optimize the model's performance throughout each episode.If the reward function does not accurately capture the objectives of the target task, the model may struggle to achieve desirable performance.

-
Resistance of the ship in straight motion(-)

Table 2 .
Hydrodynamic force and moment coefficients.