Next Article in Journal
Blockchain-Embedded Service-Level Agreement to Measure Trust in a Frugal Smart Factory Assembly Process
Previous Article in Journal
SRG-YOLO: Star Operation and Restormer-Based YOLOv11 via Global Context for Vehicle Object Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Human-Navigable Ship-Handling Support Using Improved Deep Deterministic Policy Gradient for Survey Line Tracking

by
Hitoshi Yoshioka
1,
Hirotada Hashimoto
1,2,* and
Akihiko Matsuda
3
1
Graduate School of Engineering, Osaka Metropolitan University, Sakai 599-8531, Osaka, Japan
2
Kobe Ocean-Bottom Exploration Center, Kobe University, Kobe 658-0022, Hyogo, Japan
3
Fisheries Technology Institute, Japan Fisheries Research and Education Agency, Kamisu 314-0408, Ibaraki, Japan
*
Author to whom correspondence should be addressed.
Automation 2026, 7(1), 16; https://doi.org/10.3390/automation7010016
Submission received: 30 November 2025 / Revised: 29 December 2025 / Accepted: 4 January 2026 / Published: 8 January 2026
(This article belongs to the Section Industrial Automation and Process Control)

Abstract

This study presents a human-navigable ship-handling support system that employs artificial intelligence (AI) for survey line tracking. AI was developed using the Deep Deterministic Policy Gradient (DDPG), a type of deep reinforcement learning (DRL), and was evaluated through experiments conducted with a research vessel. The experiments revealed several issues inherent to DRL that required improvement. The first issue was the asymmetry observed in the policy learned through the DDPG. To address this, a learning approach that utilizes symmetric training data and symmetry-constrained actor and critic neural networks was proposed. The second issue was excessive steering during tracking maneuvers. To mitigate this, an objective function for actor learning that incorporates a cost term to suppress the magnitude of actions was proposed. The third issue was the frequent oscillation of actions. To resolve this, improved conditioning for action policy smoothing was introduced in the objective function to smooth actions appropriate to the situation. A subsequent experiment at sea was conducted to evaluate the improved AI-based ship-handling support system. As a result, precise path tracking performance with minimal operator discomfort and smooth control actions was achieved through manual ship handling guided by AI-generated instructions under actual sea conditions.

1. Introduction

Japan relies on other countries for most of its resources. With land resources becoming depleted, the focus is shifting toward marine resources. Numerous valuable resources, such as rare earth elements and methane hydrates, exist on the seabed in Japan’s exclusive economic zone. The development of such seabed resources requires surveying a wide sea area. Moreover, seabed investigations are essential for predicting natural hazards such as subduction zone earthquakes and submarine volcanic eruptions. Offshore wind power generation, which does not produce CO2 emissions, has also attracted attention as a measure against global warming. Seabed surveys are indispensable for designing the mooring of floating offshore wind turbines. Consequently, ocean surveys are becoming increasingly important; however, they demand substantial time and cost due to the vast areas that must be surveyed. For this purpose, careful ship handling to precisely track survey lines over a long period of time is required. Therefore, it is necessary to develop human-navigable ship-handling instructions for seafarers to use as practical support.
Proportional–integral–differential (PID) control is commonly employed for line tracking maneuvers. A control output is calculated with constant gains for the value of the error, the sum of the time series of errors, and the changing rate of errors. However, it is challenging to apply PID control to ships with multiple actuators. It is also difficult to achieve precise control under dynamic variations in disturbances and tracking targets. This is because the gains are fixed and tuned to maximize the average performance in advance. In addition, survey line tracking control needs to anticipate a ship’s behavior and changes in the target line in advance. Early action is necessary because the inertial effect is large, and a ship does not turn as soon as steering begins. Such prediction-based control is difficult to achieve using PID control, so model predictive control (MPC) is usually used instead [1,2,3]. MPC is based on motion prediction; therefore, it can reduce overshoots and oscillations of the control variable. MPC outputs the optimized control output within a predicted horizon. However, the computational cost of this process is high; therefore, online control via MPC is difficult in cases where complex control targets and/or constraints are involved [4]. Furthermore, MPC obtains the control output using model predictions, so errors in the maneuvering model could degrade control performance. Due to difficulties in eliminating model errors, control schemes capable of compensating for discrepancies between numerical models and real-world dynamics have been studied [5,6,7,8]. The adaptability of control schemes to problems involving model errors and disturbances has been demonstrated in previous studies. Although these methods are well established in control theory, designing constraints that satisfy multiple objectives remains challenging. Human-navigable control requires the simultaneous consideration of tracking accuracy, control smoothness, robustness to noise, and real-time capability.
Artificial intelligence (AI) has advanced rapidly in recent years. In a previous study, a multilayer neural network was incorporated into an existing control law to estimate model uncertainty, and the performance was enhanced [9]. Another approach that uses AI technology is to develop a control law based directly on deep reinforcement learning (DRL). DQN, a pioneering method in DRL, received a higher score in Atari games than a human player [10], demonstrating the potential of AI to exceed human capabilities in specific tasks. Thereafter, many improved methods and algorithms for DRL have been studied and applied in various research fields, including line tracking problems [11,12,13,14]. These studies demonstrated that AI (a well-learned model using DRL) is effective for the development of autonomous line tracking control. DRL optimizes control policies by considering future vessel motion and target line changes through a vast trial-and-error simulation. The advantage of AI is that it can achieve more precise and flexible control compared to PID control; furthermore, online control can be achieved due to its superior speed compared to MPC and its robustness against model errors. Whereas MPC computes optimal control using a maneuvering model, DRL learns the optimal action directly from data; thus, it can adapt to system dynamics even when model errors exist, because the input data are updated as state feedback.
However, AI-based control systems suffer from certain limitations. One limitation is that the process of decision-making is not clear because its criterion is not designed by humans; rather, AI learns autonomously. Automatic control systems or navigation support systems are preferred as white boxes to increase user trust. Although explainable AI (XAI) has been studied to address this issue [15,16], existing methods cannot fully clarify the basis of decision-making. Another problem with AI is that outputs can be highly sensitive to small fluctuations in input data. Because the observation data of a ship are affected by the ship’s surge, sway, yaw, roll motion, and observation noise, AI’s perceived state oscillates frequently. Consequently, the control outputs change over time, even though the actual ship state does not change significantly. Such behavior can reduce ride comfort, damage cargo, strain steering gears, and reduce user confidence. Event-triggered control is a method used to reduce the frequency of output updates [17]. However, when the threshold is set to a large value in order to decrease the update frequency, the deviation at the moment of control updating becomes large. Consequently, it is difficult to simultaneously reduce both the frequency of output updates and the magnitude of the output change in a single update step. In addition, the timing of event occurrences is difficult for human operators to predict. Therefore, event-triggered control is not suitable for the development of human-navigable control systems.
Another issue is the sensitivity to observation noise, which is problematic for the robustness of AI systems. Furthermore, the control algorithms proposed in previous studies were validated only through numerical simulations; therefore, the performance of maneuvering control systems trained via numerical simulation in a real-world environment has not been clarified. Addressing these DRL-related issues is essential for practical deployment.
This study proposes a DRL-based control system for achieving precise and human-navigable survey line tracking using the improved Deep Deterministic Policy Gradient (DDPG). In this study, the term “human-navigable” is defined as the ability of a human operator to perform operations in the same manner as in normal manual operation while following AI’s instructions. To develop human-navigable AI, it is essential that the control instructions vary smoothly and at a frequency comparable with that of a human operator, that the resulting control actions are similar to those of a human operator, and that the control accuracy is equal to or greater than that achieved by a human operator. The features of the proposed scheme include symmetric maneuvering control, suppression of unnecessary excessive control, and infrequent and smooth control. Symmetric maneuvering control is achieved through a symmetry-constrained neural network architecture. Human-navigable, infrequent, and smooth control can be achieved by devising an objective learning function. These algorithms can contribute to the development of human-like autonomous controls and human-navigable ship-handling support. First, the effectiveness of each algorithm was verified through numerical experiments. Second, an onboard survey line tracking system incorporating all the algorithms was developed for the research vessel. Finally, the developed system was validated through experiments. The results confirmed that symmetric, smooth, and precise line tracking control can be achieved under real sea conditions.

2. Algorithm

2.1. Reinforcement Learning

Reinforcement learning (RL) consists of an agent, environment, and interactions between them. In the learning process, the agent observes state s t at time t . The agent selects an action a t according to a policy π θ . Then, the environment transitions to the next state s t + 1 and the agent receives a reward r t + 1   from the environment. This sequence is repeated, and the agent updates its policy. This process is illustrated in Figure 1.
In RL, the agent learns a policy that maximizes the future cumulative reward through trial and error. The cumulative reward is described by Equation (1). When the rewards at future steps are summed, they are discounted by a factor γ to account for the uncertainty of future rewards. Actions that will be highly rewarded in the future are learned using cumulative rewards. By experiencing various states, the agent learns a policy that can handle various situations.
R t = γ n r t + 1 + n

2.2. Deep Deterministic Policy Gradient

The Deep Deterministic Policy Gradient (DDPG) [18] consists of two neural networks (NNs). It addresses continuous action space using actor and critic networks. The actor network predicts the optimal action a from state s . The critic network predicts cumulative rewards from state s and action a . The actor network and the critic network are denoted as μ s θ μ and Q s , a θ Q , respectively. The parameters of the critic network are updated by minimizing the loss function shown in Equation (2). The loss is a measure of the discrepancy between the predicted and received rewards. The actor network parameters are updated to maximize the critic network’s output, as described in Equation (3). During learning, separate target networks with the same architecture as the actor and critic networks are used for action selection and cumulative reward estimation. The parameters of these networks are updated according to Equation (4). Here, θ μ and θ Q denote the parameters of the target actor and target critic networks, respectively.
L = 1 N i r i + γ Q s i + 1 , μ s i + 1 θ μ θ Q Q s i + 1 , a i + 1 θ Q 2
θ μ J 1 N i a Q s , a θ Q s = s i , a = μ s i θ μ μ s θ μ s i
θ μ 1 τ θ μ + τ θ μ θ Q 1 τ θ Q + τ θ Q

3. Learning

3.1. Learning Environment

The maneuvering motion model of the research vessel TAKAMARU, operated by the Japan Fisheries Research and Education Agency, was developed to describe ship motion in a numerical environment. The equations of motion are given in Equation (5), and the coordinate system of the maneuvering model is shown in Figure 2. To represent the hydrodynamic force components, the so-called MMG model [19] was employed. In the MMG model, hydrodynamic forces are decomposed into components with respect to the ship hull, propeller, rudder, and their interaction. Accordingly, the total force acting on the vessel is expressed as the sum of these components, as shown in Equation (6). Here, the subscripts H, R, and P denote the hull, rudder, and propeller, respectively. The coefficients appearing in the expressions for each force component were determined through captive model experiments. In this study, the maneuvering coefficients were identified with a circular motion test, and the maneuvering motion model was tuned to reproduce the measured motions observed in actual ship experiments. Further details of the MMG model can be found in the literature [19]. A list of the defined symbols in Equations (5) and (6) is shown in Table 1.
m + m x u ˙ m + m y v m r x G m r 2 = X m + m y v m ˙ + m + m x u r + x G m r ˙ = Y I Z G + x G 2 m + J z r ˙ + x G m v m ˙ + u r = N m
X = X H + X R + X P Y = Y H + Y R N m = N H + N R
Target lines were generated randomly by combining straight lines and arcs. The ranges of the arc radius and central angle were 50–200 m and 45–315°. The range of the straight line was 300–500 m. An example of a generated target line is shown in Figure 3. The learning time step was set to 5 s, and each training episode consisted of 240 steps.

3.2. Observation Setting

A state at time t consists of information about the ship’s motion and the target line. Ship information includes longitudinal velocity, lateral velocity, rate of turn, drift angle, and rudder angle, denoted as u t , v t , r t , β t , and δ t , respectively. The target line is discretized at intervals corresponding to the distance traveled by the ship during one time step, i.e., V t d t . The relative coordinates of the target points are included in the representation of the state. The first target point T 0 is defined as the intersection of the target line and is perpendicular from the ship to the target line. The number of target points was set to 10 in the experiment, determined according to the time constant of the maneuvering model. The current and four preceding states were used as the input data. The definition of the state is given by Equation (7), and the definition of the target points is shown in Figure 4.
s t u t , v t , r t , β t , δ t , T x 0 , T y 0 , T x 1 , T y 1 , , T x 9 , T y 9
The action of the actor is represented by the continuous rudder angle, bounded within the range of −35–35°.

3.3. Reward Setting

Line tracking control is achieved by maintaining the ship’s course parallel to the target line. Thus, a reward is given when deviations in distance and angle occur. The reward function is defined using a Gaussian function. It is calculated using Equation (8), where d t [ m ] and θ t [ ° ] denote the deviation distance and deviation angle, respectively. In addition, a reward of +0.1 is assigned when these deviations decrease from one step to the next, ensuring the explicit goodness of the actions. The parameters constituting the reward function, such as 25, 30, and others, were determined so that AI exhibited good performance in both simulations and full-scale ship experiments.
R t = 0.1 × 2 × 0.5 d t / 25 2 + | θ t | / 30 2 1

3.4. Numerical Validation

The structure of the actor network and critic network is shown in Table 2 and Table 3. Dropout layers with a rate of 0.2 are used after non-output layers. Activations are represented by the ReLU function in hidden layers, the linear function in the output of the actor network, and the softsign function in the output of the critic network. The learning hyperparameters are presented in Table 4.
The developed AI system was first evaluated via numerical simulation. An example result of the simulation of line tracking is shown in Figure 5. In the figure, the triangles indicate the ship positions at time intervals of 120 s. The average deviation distance from the target line is 2.0 [m], while the length of the ship is 29.5 [m]. The developed AI system demonstrates high precision in line tracking. In the next section, the performance of the developed AI system is validated with an actual ship.

3.5. Experimental Validation

3.5.1. Experiment Environment

A full-scale experiment was conducted in Tateyama Bay, Japan. The research vessel TAKAMARU, shown in Figure 6, is capable of receiving rudder commands from an external PC; however, most existing ships cannot be directly controlled by external systems. Therefore, the developed AI system was implemented not as an automatic controller but as a ship-handling support system. Specifically, the AI-generated rudder commands were displayed on a monitor, and a crew member manually steered the vessel in accordance with these instructions.

3.5.2. Clarification of AI’s Intent

A major issue in AI-based control systems is the opacity of the decision-making process. To enhance transparency, the MMG model was used to generate predicted ship motions based on the AI’s rudder commands, which were visualized on a display screen. The predicted horizon exceeded 100 steps (one step is 5 s) and was updated at a rate of 1 Hz. The user interface (UI) developed for the experiment is shown in Figure 7. The red marker indicates the current position of the ship, the blue line indicates the predicted trajectory, and the black line indicates the target line. The current rudder angle and the rudder angle indicated by the AI system are displayed together. The steering is manually controlled such that the current rudder follows the command rudder. It is clear that the AI system intends to steer the ship to the right to track the target line. The UI helps the operator to understand the AI system’s intention in real time.

3.5.3. Ship Experiment

Survey lines were used as the tracking target for the ship experiment. An example of a survey line tracking trajectory in actual ocean-bottom exploration is shown in Figure 8. The vessel followed each survey line sequentially, transitioning from the end of one line to the beginning of the next via a trackable interpolation curve. In this study, circular arcs were employed to construct these interpolation curves. Since ships have large inertia and cannot change their heading instantaneously, a margin was introduced between the straight line and the arc to enable a gradual approach parallel to the line. The radius of the interpolation arc, the margin length, and the minimum turning radius are denoted by R , m , and R m i n , respectively. If the distance between adjacent survey lines is greater than 2 R m i n , the lines are expanded outward by a margin m and connected using a semicircular arc, as shown in Figure 9a. In this case, the radius R is set to half of the distance between the lines. When the distance between the lines is smaller, interpolation is performed, as illustrated in Figure 9b, where the radius R is set to R m i n . In this case, a single semicircular arc with radius R m i n cannot be placed between the lines. Therefore, the first circular arc is introduced to create sufficient spacing, followed by the second arc that connects to the next line. These circular arcs are arranged geometrically. Since the radius of the interpolation circular arcs is set according to the maneuverability, trackable interpolation is achieved. The perpendicular lines should be tracked according to the purpose of the survey. In this case, the interpolation shown in Figure 9c was used. Trackable interpolation is achieved by determining R according to the maneuverability. Based on the maneuvering characteristics of the vessel, the parameters were set to R m i n = 100 m and m = 30 m.

3.5.4. Results of Experiment

The trajectory and time series of the rudder angle are shown in Figure 10 and Figure 11. The gray and blue lines in Figure 10 represent the target line and the actual trajectory, respectively. The triangles indicate the ship positions at time intervals of 120 s.

4. Developing a Human-Navigable AI System

4.1. Symmetry of AI Control

Control systems designed by humans, such as PID controllers, are symmetrical. In contrast, the symmetry of the AI system is not ensured, which could lead to user suspicion and distrust. The asymmetry of AI control arises from the randomness of the initial parameter values of NNs and the bias of the learning data. Incorporating symmetry into DRL not only yields symmetric control behavior but may also improve learning efficiency. Therefore, this study considers DRL using symmetric training data and symmetry-constrained neural networks. A symmetrical state s t ¯ is defined in Equation (9). The symmetry is lateral in the relative coordinates.
s t u t , v t , r t , β t , δ t , T x 0 , T y 0 , T x 1 , T y 1 , , T x 9 , T y 9 s t ¯ u t , v t , r t , β t , δ t , T x 0 , T y 0 , T x 1 , T y 1 , , T x 9 , T y 9
First, symmetry is introduced into the training data. When sampling minibatch data from the replay buffer, symmetric data are generated from the sampled states, and both the original and symmetric data are included in the minibatch. This process is illustrated in Figure 12.
Next, symmetry is incorporated into the NN architecture. The outputs of the actor network from the symmetric data should have signs opposite to those of the original data. The outputs of the critic network from the symmetric data should have the same values as those of the original data. The NN structure incorporating these symmetry constraints is shown in Figure 13. For the actor network, the symmetric input data s ¯ are created from the original input data s . Common hidden layers are applied to both datasets. The outputs of the hidden layers are f ( s ) and f ( s ¯ ) . Finally, the difference between f ( s ) and f ( s ¯ ) is calculated as the output of the actor network. The activation of the output is a softsign function according to the DDPG. When the input state changes from s to s ¯ , the output of the actor network changes from s o f t s i g n f s f s ¯ to s o f t s i g n f s ¯ f s . Thus, an output value with the opposite sign is obtained. In the critic network, the subtraction layer in the actor network is changed to the addition layer to make the output data from the symmetric input the same. The final activation is a linear function.
Symmetric actions can be achieved solely through the use of a symmetric actor network; therefore, a symmetric critic network is not strictly required. However, incorporating symmetry into the critic network may further improve learning efficiency. To examine the effectiveness of symmetry, the following three scenarios were evaluated:
  • ① Symmetric minibatch data.
  • ② A symmetric actor network.
  • ③ A symmetric actor network and a symmetric critic network.
The learning environment and reward settings were the same as those described in the previous section. The network architectures without symmetry correspond to those in Table 2 and Table 3, whereas the symmetry-constrained network architectures are presented in Table 5 and Table 6. The hyperparameters were the same as those listed in Table 4.
Under conditions of symmetry, learning was performed, and the results were compared with those of the conventional DDPG. The episode rewards and the critic network’s loss values during training are shown in Figure 14 and Figure 15. The blue, orange, green, and red lines represent the results of the conventional DDPG, the symmetric actor network, the symmetric actor network and symmetric critic, and the symmetric minibatch data, respectively. Figure 14 shows that the episode rewards for the symmetry-based approaches are generally higher than those for the conventional DDPG. In addition, incorporating symmetry reduced the critic network’s loss. These results indicate that symmetry contributes to improved learning performance.
The line tracking simulation results shown in Figure 16 confirm that high-precision tracking can be achieved. The results obtained using a symmetric actor network exhibit a symmetric time series of the rudder angle. In contrast, learning with symmetric data alone did not fully achieve symmetry; in particular, symmetry was not maintained between steps 150 and 200. Based on these results, the line tracking performance is improved most effectively when both the actor and critic networks incorporate symmetry.

4.2. Limitation of Action Changes

According to the results of DDPG validation, the AI-generated rudder command was changed from −35° to 35°. Such abrupt control actions are not typically executed by human operators and impose significant load on the steering gear. The frequency of changing rudder commands should be limited considering the limits of the actuator, such as the changing rate of the rudder angle and the reduction in the inertial force, which results in anxiety for passengers. However, a learning method that accounts for the limit on the number of changing actions has not yet been proposed.
An initial attempt was made to incorporate a penalty for changes in the rudder angle during learning, but the outcomes were not satisfactory. Therefore, this study proposes a method to explicitly consider limits on action changes within the objective function.
In the DDPG, the parameters of the actor network are updated to maximize the objective function. The objective function considering the limits on action changes, J π θ Δ a , is defined by Equation (10), where the first term represents the cumulative reward, which is the standard objective in the conventional DDPG. The second term, λ Δ a L Δ a , represents the restriction of action changes Δ a , and λ Δ a is the strength of the restriction. The parameters are updated to minimize the number of changing actions. The strength of the second term depends on its value and the gradient. Therefore, the function should have a large gradient if the number of changing actions exceeds the limit. In this study, the Swish function [21] is employed for this purpose, as shown in Equation (10). The resulting shape of the objective function is shown in Figure 17.
J π θ Δ a = J π θ λ Δ a L Δ a L Δ a = s w i s h β Δ a Δ a m a x + Δ a s w i s h β Δ a m a x Δ a m a x s w i s h β Δ a m a x Δ a = π θ s t π θ s t + 1
The learning process was performed under three conditions, no limit, a limit of 20°/step, and a limit of 40°/step, to validate the proposed method. The results are shown in Figure 18. According to these results, the deviation distances were comparable across all conditions.
The distribution of the number of changing actions is shown in Figure 19a–c, and the cumulative ratio of the degree of rudder change is shown in Figure 19d. According to Figure 19d, imposing a limit reduces the frequency of large rudder angle changes. In more than 95% of the steps, the magnitude of action changes remains below the specified limit. Although the proposed method does not perfectly enforce the action change constraint, it effectively suppresses excessive changes. Based on these results, the proposed method can be used to incorporate action change limits into the learning process. In this study, the restriction was implemented using the Swish function, although the ReLU function could also be used.

4.3. Situational Action Policy Smoothing

NNs are highly sensitive to small variations in input data. This issue is well known not only in autonomous control but also in image recognition. In real sea conditions, measured data are affected by ship motions (e.g., rolling), sensor errors, and observation noise. If an AI system responds excessively to such variations, rudder commands may fluctuate significantly even when the vessel is moving along a straight line.
A smoothing method for DRL, Conditioning for Action Policy Smoothing (CAPS) [22], has been proposed to address this issue. In CAPS, the objective function of the actor network, J π θ C A P S , is described by Equation (11). The second and third terms are the temporal and spatial smoothing terms, respectively, with λ T and λ S controlling the strength of these smoothing terms. D T and D S denote the distance measures. The actor parameters are updated to maximize the cumulative reward using the first term. The second term encourages temporal smoothness by minimizing the difference between consecutive actions, π θ s t and π θ s t + 1 , as shown in the second line of Equation (11). The third term suppresses sensitivity to input noise by minimizing the difference between the action for the observed state, π θ s t , and the action for the noise-perturbed state, π θ s t n o i s e . The noise distribution was determined from the experimental data.
J π θ C A P S = J π θ λ T L T λ S L S L T = D T π θ s t , π θ s t + 1 L S = D S π θ s t , π θ s t n o i s e
The temporal term in CAPS acts as a constraint on changes in the action. Although this helps suppress meandering on straight paths by stabilizing the actor output, it also delays rudder adjustments when transitioning from straight segments to curves. In other words, CAPS does not consider changes in the observed state. To address this, a state-dependent restriction is proposed.
The improved objective function is described by Equation (12). The second term is the restriction on temporal action changes. Since the number of changing actions is divided by the number of changing states, the limit is smaller when the observed states change significantly. This improved method is called Situation Action Policy Smoothing (SAPS). L T a L T s , and L S are the differences between consecutive actions, consecutive states, and a noisy state, respectively. λ T a , λ T s , and λ S are the strength of the per-step action constraint, the relaxation of the constraint in response to state variations, and the strength of the constraint against noise., respectively.
J π θ S A P S = J π θ λ T a L T a λ T s L T s + 1 λ S L S L T a = D T a π θ s t , π θ s t + 1 L T s = D T a s t ,   s t + 1 L S = D S π θ s t , π θ s t ¯
Learning was performed using CAPS and SAPS to validate the proposed method. They were compared at the same restrictive strength. The objective function parameters are listed in Table 7. The trajectories of the simulation results are shown in Figure 20. It is shown that SAPS can smoothly follow both straight and curved lines with high accuracy, while CAPS causes delayed rudder movement in curved sections. These results demonstrate that SAPS achieves situational smoothing with improved accuracy.
Next, the conventional DDPG and the DDPG with SAPS were compared. The results are presented in Figure 21. The average change in the rudder angle per step decreased from 16.5°/step in the conventional DDPG to 0.73°/step with SAPS. This indicates that the proposed smoothing method is highly effective.

5. Validation Using Ship Experiment

5.1. Results

In the previous section, methods for improving the DDPG were described. The proposed objective function and its parameters are given by Equation (13) and in Table 8. The architectures of the actor and critic networks are the same as those presented in Table 5 and Table 6.
J π θ = J π θ λ S L S λ T a L T a λ T S L T S + 1 λ Δ a L Δ a L S = D S π θ s t , π θ s t n o i s e L T a = D T a π θ s t , π θ s t + 1 L T S = D T S s t , s t + 1 L Δ a = s w i s h β Δ a Δ a m a x + Δ a s w i s h β Δ a m a x Δ a m a x s w i s h β Δ a m a x Δ a = π θ s t + 1 π θ s t
A line tracking AI system was developed incorporating the proposed improvements and was validated under real sea conditions. The AI system was validated using survey lines with three types of interpolation. The results are shown in Figure 22, Figure 23 and Figure 24. In the trajectory figures, the triangles indicate the ship positions at time intervals of 120 s. During these experiments, environmental disturbances were mild (average tidal current: 0.4 knots; average wind speed: 2.0 m/s). The average deviation distances were 3.6, 4.2, and 4.6 m, while the maximum deviations were 8.3, 11.8, and 17.5 m. Since the maximum deviations were less than half the vessel length, these results demonstrate that high-precision line tracking was achieved in actual sea conditions. Moreover, changes in the rudder movement were smooth, which is favorable for operator comfort and practical use.
Additional experiments were conducted to evaluate the system’s performance under stronger disturbances (average tidal current: 0.8 knots; average wind speed: 6.5 m/s). The results are shown in Figure 25, Figure 26 and Figure 27. The average deviation distances were 3.5, 1.8, and 1.8 m, and the maximum deviations were 13.0, 6.6, and 6.1 m. The rudder movements remained smooth. These results indicate that the proposed method achieves precise and stable line tracking even under strong disturbances.

5.2. Discussion

This paper proposes a steering control support system based on an improved DDPG algorithm and compares its performance with that of the conventional DDPG method and human helmsman. The former serves as a baseline reinforcement learning (RL) method. The latter represents human steering, which is the most commonly employed approach in actual ship operations and can be regarded as the best baseline for validating the effectiveness of the proposed method.
The conventional DDPG shown in Figure 10 and Figure 11 and the improved DDPG were compared. The trajectories and time series of the rudder angle are shown in Figure 28 and Figure 29. The triangles indicate the ship positions at time intervals of 120 s. The average change in rudder angle was 4.508°/step in the DDPG and 0.355°/step in the improved DDPG. Since the average number of rudder changes was also reduced by approximately 8%, it can be concluded that the proposed method is effective for achieving smooth rudder control without compromising tracking accuracy.
Experiments testing line tracking by humans without AI support were also conducted using the same survey lines. The trajectory and the time series of the rudder angle are shown in Figure 30 and Figure 31. The triangles indicate the ship positions at time intervals of 120 s. For a straight line, line tracking can be achieved by making the latitude of the ship the same as the latitude of the survey line. However, for a curved or diagonal line, the latitude and longitude of the ship must be the same as those of the survey line. It was difficult for humans to control both the latitude and longitude. In contrast, the developed AI system tracked the curved lines, and thus, we can conclude that it is effective for automating survey line tracking. The average changes in the rudder angle per step under AI and human operation were 0.51 [deg/s] and 0.50 [deg/s], respectively. The AI generated human-like smooth control actions while achieving higher tracking accuracy. These results confirm that the developed AI system can provide human-navigable ship-handling support.
The results of all experiments conducted under the same conditions as those in Figure 22 and Figure 23 are shown in Figure 32. The experiments were conducted four times for both the wide and narrow survey lines, with each trial plotted in a different color in the figures, and all trials were performed by the same helmsman. As shown in Figure 32, similar trajectories were obtained across all attempts. Although the helmsman’s control inputs varied between trials, the results indicate that the proposed AI system effectively supports the helmsman in achieving a consistent level of tracking precision.

5.3. Comparison of Line Tracking Between AI-Supported Human Control and AI Control

In the ship experiments described in the previous section, the developed AI system was implemented as a support system that displayed rudder commands to the operator. This approach achieved high-precision line tracking performance. In this section, the results of fully automated AI control are presented and compared with those of AI-supported human control.
In the AI control experiment, rudder commands were transmitted directly from an external laptop to the vessel’s steering system, which executed the corresponding rudder angles automatically. The trajectories and time series of the rudder angle for both AI-supported helmsman control and fully automated AI control are shown in Figure 33. The average deviation distances from the target path under AI-supported helmsman control and AI control were 9.79 [m] and 10.08 [m], respectively. According to the results, high-precision line tracking was achieved in both scenarios, and it is clear that the developed AI system can give steering instructions that are feasible for a human operator to follow. Moreover, even if the operator’s steering response is slightly delayed, the difference in performance remains minimal because the AI-based steering assistance incorporates state feedback.
In conclusion, the developed AI system is effective both as a human-navigable support system and as an autonomous controller. Given that most existing ships cannot be directly controlled by external computers, the ability of the system to operate in human-navigable support mode provides a significant practical advantage and facilitates its deployment in current maritime operations. This capability is particularly valuable during the transitional period before fully autonomous ships become feasible.

6. Conclusions

This study addressed the limitations of deep reinforcement learning (DRL) in achieving human-navigable AI-based ship handling support in survey line tracking. First, the symmetry of maneuvering control was guaranteed by introducing symmetry-constrained actor and critic networks. Second, excessive control actions were suppressed by incorporating a penalty term into the objective function using the Swish function. Third, situational smoothing of the rudder actions was achieved by introducing improved conditioning for action policy smoothing. Ship experiments were conducted at sea to validate the proposed methods. The experiments demonstrated that the proposed method achieved smooth and precise control with an appropriate update frequency under real sea conditions. The AI-generated rudder commands were also found to be interpretable and manageable for human helmsmen to follow. Furthermore, comparison with the AI control experiments showed that line tracking was similarly smooth and accurate, confirming the effectiveness of the system as a human-navigable ship handling support tool.
The proposed ship-handling support tool using improved DRL for line tracking was validated at sea under natural disturbances. Further investigation of its performance under more severe environmental conditions is required to assess its suitability for practical deployment. In addition, it is necessary to quantitatively evaluate the human-friendliness of the proposed method using appropriate indices, such as the helmsman’s tension level and workload. Furthermore, incorrect or faulty data may be present in real-world environments. Therefore, incorporating fault-tolerant control (FTC) mechanisms, such as those reported in [23], to enhance the robustness of the proposed AI system under practical operating conditions is an important direction for future work.

Author Contributions

Conceptualization, H.Y., H.H. and A.M.; methodology, H.Y.; software, H.Y.; validation, H.Y.; formal analysis, H.Y.; investigation, H.Y.; resources, H.Y., H.H. and A.M.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, H.H. and A.M.; visualization, H.Y.; supervision, H.H.; project administration, H.H.; funding acquisition, H.Y., H.H. and A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by JPSP KAKENHI (Grant Numbers 20H00284 and 23K26321) and JST SPRING (Grant Number JPMJSP2139).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are grateful to the crew of the research vessel TAKAMARU of the Japan Fisheries Research and Education Agency for conducting the demonstration experiments. During the preparation of this manuscript, the authors used ChatGPT 5.1 for the purposes of English editing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Abdelaal, M.; Fränzle, M.; Hahn, A. Nonlinear model predictive control for trajectory tracking and collision avoidance of underactuated vessels with disturbances. Ocean. Eng. 2018, 160, 168–180. [Google Scholar] [CrossRef]
  2. Yang, Y.; Du, J.; Liu, H.; Guo, C.; Abraham, A. A trajectory tracking robust controller of surface vessels with disturbance uncertainties. IEEE Trans. Control Syst. Technol. 2014, 22, 1511–1518. [Google Scholar] [CrossRef]
  3. Liu, C.; Negenborn, R.R.; Chu, X.; Zheng, H. Predictive path following based on adaptive line-of-sight for underactuated autonomous surface vessels. J. Mar. Sci. Technol. 2017, 23, 483–494. [Google Scholar] [CrossRef]
  4. Yuan, S.; Liu, Z.; Zheng, L.; Sun, Y.; Wang, Z. Event-based Adaptive Horizon Nonlinear Model Predictive Control for trajectory tracking of marine surface vessel. Ocean. Eng. 2022, 258, 111082. [Google Scholar] [CrossRef]
  5. Shin, J.; Kwak, D.J.; Lee, Y. Adaptive path-following control for an unmanned surface vessel using an identified dynamic model. IEEE/ASME Trans. Mechatron. 2017, 22, 1143–1153. [Google Scholar] [CrossRef]
  6. Zhu, G.; Du, J. Robust adaptive neural trajectory tracking control of unmanned surface vessels under input saturation. In Proceedings of the Chinese Control Conference, Guangzhou, China, 27–30 July 2019. [Google Scholar] [CrossRef]
  7. Zhao, Y.; Qi, X.; Ma, Y.; Li, Z.; Malekian, R.; Sotelo, M.A. Path following optimization for an underactuated USV using smoothly-convergent deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2021, 22, 6208–6220. [Google Scholar] [CrossRef]
  8. Qu, X.; Liang, X.; Hou, Y.; Li, Y.; Zhang, R. Path-following control of unmanned surface vehicles with unknown dynamics and unmeasured velocities. J. Mar. Sci. Technol. 2020, 26, 395–407. [Google Scholar] [CrossRef]
  9. Qiu, B.; Wang, G.; Fan, Y.; Mu, D.; Sun, X. Robust path-following control based on trajectory linearization control for unmanned surface vehicle with uncertainty of model and actuator saturation. IEEJ Trans. Electr. Electron. Eng. 2019, 14, 1681–1690. [Google Scholar] [CrossRef]
  10. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
  11. Zheng, Y.; Tao, J.; Hartikainen, J.; Duan, F.; Sun, H.; Sun, M.; Sun, Q.; Zeng, X.; Chen, Z.; Xie, G. DDPG based LADRC trajectory tracking control for underactuated unmanned ship under environmental disturbances. Ocean. Eng. 2023, 271, 113667. [Google Scholar] [CrossRef]
  12. Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based Controller for path following of an unmanned surface vehicle. Ocean. Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
  13. Zhong, W.; Li, H.; Meng, Y.; Yang, X.; Feng, Y.; Ye, H.; Liu, W. USV path following controller based on DDPG with composite state-space and dynamic reward function. Ocean. Eng. 2022, 266, 112449. [Google Scholar] [CrossRef]
  14. Zhao, Y.; Qiu, B.; Wang, G.; Fan, Y. Robust path following control of underactuated unmanned surface vehicle with disturbances and input saturation. IEEE Access 2021, 9, 46106–46116. [Google Scholar] [CrossRef]
  15. Ribeiro, M.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016. [Google Scholar] [CrossRef]
  16. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
  17. Liu, G.; Liang, H.; Wang, R.; Sui, Z.; Sun, Q. Adaptive event-triggered output feedback control for nonlinear multiagent systems using output information only. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 7639–7650. [Google Scholar] [CrossRef]
  18. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2016, arXiv:1509.02971. [Google Scholar]
  19. Yasukawa, H.; Yoshimura, Y. Introduction of MMG standard method for ship maneuvering predictions. J. Mar. Sci. Technol. 2014, 20, 37–52. [Google Scholar] [CrossRef]
  20. KOBEC, Summary of the First Survey Voyage in Fiscal Year 2008. Available online: http://www.k-obec.kobe-u.ac.jp/cms/wp-content/uploads/2020/04/captain-report_01.pdf (accessed on 14 October 2025). (In Japanese)
  21. Ramachandran, P.; Zoph, B.; Le, Q.V. Swish: A Self-Gated Activation Function. arXiv 2017, arXiv:1710.05941. [Google Scholar]
  22. Mysore, S.; Mabsout, B.; Mancuso, R.; Saenko, K. Regularizing action policies for smooth control with reinforcement learning. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021. [Google Scholar]
  23. Liu, G.; Sun, Q.; Su, H.; Wang, M. Adaptive Cooperative Fault-tolerant control for output-constrained nonlinear multi-agent systems under stochastic FDI attacks. IEEE Trans. Circuits Syst. I Regul. Pap. 2025, 72, 6025–6036. [Google Scholar] [CrossRef]
Figure 1. Interaction between environment and agent in reinforcement learning.
Figure 1. Interaction between environment and agent in reinforcement learning.
Automation 07 00016 g001
Figure 2. Coordination of MMG model.
Figure 2. Coordination of MMG model.
Automation 07 00016 g002
Figure 3. Example of trajectory generated in learning environment.
Figure 3. Example of trajectory generated in learning environment.
Automation 07 00016 g003
Figure 4. Definition of target points T 0 T 5 . The first target point T 0 is located at T x 0 T y 0 .
Figure 4. Definition of target points T 0 T 5 . The first target point T 0 is located at T x 0 T y 0 .
Automation 07 00016 g004
Figure 5. Simulated trajectory (red line) and target line (dashed line).
Figure 5. Simulated trajectory (red line) and target line (dashed line).
Automation 07 00016 g005
Figure 6. Research vessel “TAKAMARU”.
Figure 6. Research vessel “TAKAMARU”.
Automation 07 00016 g006
Figure 7. Display of user interface.
Figure 7. Display of user interface.
Automation 07 00016 g007
Figure 8. Example of trajectories of line tracking for ocean-bottom survey. Different colors indicate tracking results obtained on different days [20].
Figure 8. Example of trajectories of line tracking for ocean-bottom survey. Different colors indicate tracking results obtained on different days [20].
Automation 07 00016 g008
Figure 9. Interpolation of survey lines.
Figure 9. Interpolation of survey lines.
Automation 07 00016 g009
Figure 10. Trajectory of ship experiment. The dashed line indicates the target line.
Figure 10. Trajectory of ship experiment. The dashed line indicates the target line.
Automation 07 00016 g010
Figure 11. Time series of rudder angle.
Figure 11. Time series of rudder angle.
Automation 07 00016 g011
Figure 12. Creation of symmetric minibatch data.
Figure 12. Creation of symmetric minibatch data.
Automation 07 00016 g012
Figure 13. Structure of neural network considering symmetry.
Figure 13. Structure of neural network considering symmetry.
Automation 07 00016 g013
Figure 14. Transition of episode rewards (circle markers) and moving averages (lines).
Figure 14. Transition of episode rewards (circle markers) and moving averages (lines).
Automation 07 00016 g014
Figure 15. Transition of mean absolute error (circle markers) and moving averages (lines).
Figure 15. Transition of mean absolute error (circle markers) and moving averages (lines).
Automation 07 00016 g015
Figure 16. Trajectory and rudder angle in survey line tracking using symmetric NN.
Figure 16. Trajectory and rudder angle in survey line tracking using symmetric NN.
Automation 07 00016 g016
Figure 17. Loss function with respect to degree of action. Red line indicates zero and the limit.
Figure 17. Loss function with respect to degree of action. Red line indicates zero and the limit.
Automation 07 00016 g017
Figure 18. Comparison of trajectory and rudder command. The dashed line in the trajectory figure indicates the target line.
Figure 18. Comparison of trajectory and rudder command. The dashed line in the trajectory figure indicates the target line.
Automation 07 00016 g018
Figure 19. Probability density of degree of rudder change.
Figure 19. Probability density of degree of rudder change.
Automation 07 00016 g019
Figure 20. Comparison of trajectory and rudder angle between CAPS (red) and SAPS (blue). The dashed line in the trajectory figure indicates the target line.
Figure 20. Comparison of trajectory and rudder angle between CAPS (red) and SAPS (blue). The dashed line in the trajectory figure indicates the target line.
Automation 07 00016 g020
Figure 21. Comparison of trajectory and rudder angle between DDPG (red) and SAPS (blue). The dashed line in the trajectory figure indicates the target line.
Figure 21. Comparison of trajectory and rudder angle between DDPG (red) and SAPS (blue). The dashed line in the trajectory figure indicates the target line.
Automation 07 00016 g021
Figure 22. Results of ship experiment in calm conditions (wide survey line).
Figure 22. Results of ship experiment in calm conditions (wide survey line).
Automation 07 00016 g022
Figure 23. Results of ship experiment in calm conditions (narrow survey line).
Figure 23. Results of ship experiment in calm conditions (narrow survey line).
Automation 07 00016 g023
Figure 24. Results of ship experiment in calm conditions (perpendicular survey line).
Figure 24. Results of ship experiment in calm conditions (perpendicular survey line).
Automation 07 00016 g024
Figure 25. Results of ship experiment under strong disturbance (wide survey line).
Figure 25. Results of ship experiment under strong disturbance (wide survey line).
Automation 07 00016 g025
Figure 26. Results of ship experiment under strong disturbance (narrow survey line).
Figure 26. Results of ship experiment under strong disturbance (narrow survey line).
Automation 07 00016 g026
Figure 27. Results of ship experiment under strong disturbance (perpendicular survey line).
Figure 27. Results of ship experiment under strong disturbance (perpendicular survey line).
Automation 07 00016 g027
Figure 28. Trajectories (blue lines) of DDPG (top panel) and improved DDPG (bottom panel), and the target line (dashed line).
Figure 28. Trajectories (blue lines) of DDPG (top panel) and improved DDPG (bottom panel), and the target line (dashed line).
Automation 07 00016 g028
Figure 29. Time series of rudder command of DDPG (top panel) and improved DDPG (bottom panel).
Figure 29. Time series of rudder command of DDPG (top panel) and improved DDPG (bottom panel).
Automation 07 00016 g029
Figure 30. Comparing trajectories under AI operation (top panel) and human operation (bottom panel). The dashed line indicates the target line.
Figure 30. Comparing trajectories under AI operation (top panel) and human operation (bottom panel). The dashed line indicates the target line.
Automation 07 00016 g030
Figure 31. Comparing time series of rudder angle under AI operation (top panel) and human operation (bottom panel).
Figure 31. Comparing time series of rudder angle under AI operation (top panel) and human operation (bottom panel).
Automation 07 00016 g031
Figure 32. Trajectories of all experiments for wide and narrow survey lines. The dashed line indicates the target line.
Figure 32. Trajectories of all experiments for wide and narrow survey lines. The dashed line indicates the target line.
Automation 07 00016 g032
Figure 33. Comparison of trajectories between full AI control and helmsman control supported by AI. The dashed line in the trajectory figure indicates the target line.
Figure 33. Comparison of trajectories between full AI control and helmsman control supported by AI. The dashed line in the trajectory figure indicates the target line.
Automation 07 00016 g033
Table 1. List of symbols.
Table 1. List of symbols.
m Ship’s mass
m x ,   m y Added masses of x axis direction and y axis direction, respectively
I Z G Moment of inertia of ship around center of gravity
J Z Added moment of inertia
u Surge velocity at center of gravity
v m Lateral velocity at midship
r Yaw rate
x G Longitudinal coordinate of center of gravity of ship
X , Y , N m Surge force, lateral force, and yaw moment around midship, ignoring added mass components
X H , Y H , N H Surge force, lateral force, and yaw moment around midship acting on ship hull, ignoring added mass components
X P Surge force due to propeller
X R , Y R , N R Surge force, lateral force, and yaw moment around midship by steering
Table 2. Structure of hidden layers of actor network.
Table 2. Structure of hidden layers of actor network.
LayerNumber of Nodes
Fully Connected128
Fully Connected512
Fully Connected512
Fully Connected128
Fully Connected1
Table 3. Structure of hidden layers of critic network.
Table 3. Structure of hidden layers of critic network.
LayerNumber of Nodes
Fully Connected128
Fully Connected512
Fully Connected512
Fully Connected128
Fully Connected1
Table 4. Hyperparameters.
Table 4. Hyperparameters.
OptimizerAdam
Learning rate0.01
Memory size120,000
Batch size2048
Table 5. Structure of hidden layers of actor network considering symmetry.
Table 5. Structure of hidden layers of actor network considering symmetry.
LayerNumber of Nodes
Fully Connected128
Fully Connected512
Fully Connected512
Fully Connected1
Table 6. Structure of hidden layers of critic network considering symmetry.
Table 6. Structure of hidden layers of critic network considering symmetry.
LayerNumber of Nodes
Fully Connected128
Fully Connected512
Fully Connected512
Fully Connected1
Table 7. Parameters of CAPS and SAPS.
Table 7. Parameters of CAPS and SAPS.
CAPSSAPS
λ S = 0.02 λ S = 0.02
λ T = 0.02 λ T a = 0.02
- λ T s = 1.0
Table 8. Parameters of proposed objective function.
Table 8. Parameters of proposed objective function.
λ S 0.02
λ T a 0.02
λ T s 1.0
λ Δ a 0.02
Δ a m a x 1.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yoshioka, H.; Hashimoto, H.; Matsuda, A. Human-Navigable Ship-Handling Support Using Improved Deep Deterministic Policy Gradient for Survey Line Tracking. Automation 2026, 7, 16. https://doi.org/10.3390/automation7010016

AMA Style

Yoshioka H, Hashimoto H, Matsuda A. Human-Navigable Ship-Handling Support Using Improved Deep Deterministic Policy Gradient for Survey Line Tracking. Automation. 2026; 7(1):16. https://doi.org/10.3390/automation7010016

Chicago/Turabian Style

Yoshioka, Hitoshi, Hirotada Hashimoto, and Akihiko Matsuda. 2026. "Human-Navigable Ship-Handling Support Using Improved Deep Deterministic Policy Gradient for Survey Line Tracking" Automation 7, no. 1: 16. https://doi.org/10.3390/automation7010016

APA Style

Yoshioka, H., Hashimoto, H., & Matsuda, A. (2026). Human-Navigable Ship-Handling Support Using Improved Deep Deterministic Policy Gradient for Survey Line Tracking. Automation, 7(1), 16. https://doi.org/10.3390/automation7010016

Article Metrics

Back to TopTop