1. Introduction
Japan relies on other countries for most of its resources. With land resources becoming depleted, the focus is shifting toward marine resources. Numerous valuable resources, such as rare earth elements and methane hydrates, exist on the seabed in Japan’s exclusive economic zone. The development of such seabed resources requires surveying a wide sea area. Moreover, seabed investigations are essential for predicting natural hazards such as subduction zone earthquakes and submarine volcanic eruptions. Offshore wind power generation, which does not produce CO2 emissions, has also attracted attention as a measure against global warming. Seabed surveys are indispensable for designing the mooring of floating offshore wind turbines. Consequently, ocean surveys are becoming increasingly important; however, they demand substantial time and cost due to the vast areas that must be surveyed. For this purpose, careful ship handling to precisely track survey lines over a long period of time is required. Therefore, it is necessary to develop human-navigable ship-handling instructions for seafarers to use as practical support.
Proportional–integral–differential (PID) control is commonly employed for line tracking maneuvers. A control output is calculated with constant gains for the value of the error, the sum of the time series of errors, and the changing rate of errors. However, it is challenging to apply PID control to ships with multiple actuators. It is also difficult to achieve precise control under dynamic variations in disturbances and tracking targets. This is because the gains are fixed and tuned to maximize the average performance in advance. In addition, survey line tracking control needs to anticipate a ship’s behavior and changes in the target line in advance. Early action is necessary because the inertial effect is large, and a ship does not turn as soon as steering begins. Such prediction-based control is difficult to achieve using PID control, so model predictive control (MPC) is usually used instead [
1,
2,
3]. MPC is based on motion prediction; therefore, it can reduce overshoots and oscillations of the control variable. MPC outputs the optimized control output within a predicted horizon. However, the computational cost of this process is high; therefore, online control via MPC is difficult in cases where complex control targets and/or constraints are involved [
4]. Furthermore, MPC obtains the control output using model predictions, so errors in the maneuvering model could degrade control performance. Due to difficulties in eliminating model errors, control schemes capable of compensating for discrepancies between numerical models and real-world dynamics have been studied [
5,
6,
7,
8]. The adaptability of control schemes to problems involving model errors and disturbances has been demonstrated in previous studies. Although these methods are well established in control theory, designing constraints that satisfy multiple objectives remains challenging. Human-navigable control requires the simultaneous consideration of tracking accuracy, control smoothness, robustness to noise, and real-time capability.
Artificial intelligence (AI) has advanced rapidly in recent years. In a previous study, a multilayer neural network was incorporated into an existing control law to estimate model uncertainty, and the performance was enhanced [
9]. Another approach that uses AI technology is to develop a control law based directly on deep reinforcement learning (DRL). DQN, a pioneering method in DRL, received a higher score in Atari games than a human player [
10], demonstrating the potential of AI to exceed human capabilities in specific tasks. Thereafter, many improved methods and algorithms for DRL have been studied and applied in various research fields, including line tracking problems [
11,
12,
13,
14]. These studies demonstrated that AI (a well-learned model using DRL) is effective for the development of autonomous line tracking control. DRL optimizes control policies by considering future vessel motion and target line changes through a vast trial-and-error simulation. The advantage of AI is that it can achieve more precise and flexible control compared to PID control; furthermore, online control can be achieved due to its superior speed compared to MPC and its robustness against model errors. Whereas MPC computes optimal control using a maneuvering model, DRL learns the optimal action directly from data; thus, it can adapt to system dynamics even when model errors exist, because the input data are updated as state feedback.
However, AI-based control systems suffer from certain limitations. One limitation is that the process of decision-making is not clear because its criterion is not designed by humans; rather, AI learns autonomously. Automatic control systems or navigation support systems are preferred as white boxes to increase user trust. Although explainable AI (XAI) has been studied to address this issue [
15,
16], existing methods cannot fully clarify the basis of decision-making. Another problem with AI is that outputs can be highly sensitive to small fluctuations in input data. Because the observation data of a ship are affected by the ship’s surge, sway, yaw, roll motion, and observation noise, AI’s perceived state oscillates frequently. Consequently, the control outputs change over time, even though the actual ship state does not change significantly. Such behavior can reduce ride comfort, damage cargo, strain steering gears, and reduce user confidence. Event-triggered control is a method used to reduce the frequency of output updates [
17]. However, when the threshold is set to a large value in order to decrease the update frequency, the deviation at the moment of control updating becomes large. Consequently, it is difficult to simultaneously reduce both the frequency of output updates and the magnitude of the output change in a single update step. In addition, the timing of event occurrences is difficult for human operators to predict. Therefore, event-triggered control is not suitable for the development of human-navigable control systems.
Another issue is the sensitivity to observation noise, which is problematic for the robustness of AI systems. Furthermore, the control algorithms proposed in previous studies were validated only through numerical simulations; therefore, the performance of maneuvering control systems trained via numerical simulation in a real-world environment has not been clarified. Addressing these DRL-related issues is essential for practical deployment.
This study proposes a DRL-based control system for achieving precise and human-navigable survey line tracking using the improved Deep Deterministic Policy Gradient (DDPG). In this study, the term “human-navigable” is defined as the ability of a human operator to perform operations in the same manner as in normal manual operation while following AI’s instructions. To develop human-navigable AI, it is essential that the control instructions vary smoothly and at a frequency comparable with that of a human operator, that the resulting control actions are similar to those of a human operator, and that the control accuracy is equal to or greater than that achieved by a human operator. The features of the proposed scheme include symmetric maneuvering control, suppression of unnecessary excessive control, and infrequent and smooth control. Symmetric maneuvering control is achieved through a symmetry-constrained neural network architecture. Human-navigable, infrequent, and smooth control can be achieved by devising an objective learning function. These algorithms can contribute to the development of human-like autonomous controls and human-navigable ship-handling support. First, the effectiveness of each algorithm was verified through numerical experiments. Second, an onboard survey line tracking system incorporating all the algorithms was developed for the research vessel. Finally, the developed system was validated through experiments. The results confirmed that symmetric, smooth, and precise line tracking control can be achieved under real sea conditions.
4. Developing a Human-Navigable AI System
4.1. Symmetry of AI Control
Control systems designed by humans, such as PID controllers, are symmetrical. In contrast, the symmetry of the AI system is not ensured, which could lead to user suspicion and distrust. The asymmetry of AI control arises from the randomness of the initial parameter values of NNs and the bias of the learning data. Incorporating symmetry into DRL not only yields symmetric control behavior but may also improve learning efficiency. Therefore, this study considers DRL using symmetric training data and symmetry-constrained neural networks. A symmetrical state
is defined in Equation (9). The symmetry is lateral in the relative coordinates.
First, symmetry is introduced into the training data. When sampling minibatch data from the replay buffer, symmetric data are generated from the sampled states, and both the original and symmetric data are included in the minibatch. This process is illustrated in
Figure 12.
Next, symmetry is incorporated into the NN architecture. The outputs of the actor network from the symmetric data should have signs opposite to those of the original data. The outputs of the critic network from the symmetric data should have the same values as those of the original data. The NN structure incorporating these symmetry constraints is shown in
Figure 13. For the actor network, the symmetric input data
are created from the original input data
. Common hidden layers are applied to both datasets. The outputs of the hidden layers are
and
. Finally, the difference between
and
is calculated as the output of the actor network. The activation of the output is a softsign function according to the DDPG. When the input state changes from
to
, the output of the actor network changes from
to
. Thus, an output value with the opposite sign is obtained. In the critic network, the subtraction layer in the actor network is changed to the addition layer to make the output data from the symmetric input the same. The final activation is a linear function.
Symmetric actions can be achieved solely through the use of a symmetric actor network; therefore, a symmetric critic network is not strictly required. However, incorporating symmetry into the critic network may further improve learning efficiency. To examine the effectiveness of symmetry, the following three scenarios were evaluated:
① Symmetric minibatch data.
② A symmetric actor network.
③ A symmetric actor network and a symmetric critic network.
The learning environment and reward settings were the same as those described in the previous section. The network architectures without symmetry correspond to those in
Table 2 and
Table 3, whereas the symmetry-constrained network architectures are presented in
Table 5 and
Table 6. The hyperparameters were the same as those listed in
Table 4.
Under conditions of symmetry, learning was performed, and the results were compared with those of the conventional DDPG. The episode rewards and the critic network’s loss values during training are shown in
Figure 14 and
Figure 15. The blue, orange, green, and red lines represent the results of the conventional DDPG, the symmetric actor network, the symmetric actor network and symmetric critic, and the symmetric minibatch data, respectively.
Figure 14 shows that the episode rewards for the symmetry-based approaches are generally higher than those for the conventional DDPG. In addition, incorporating symmetry reduced the critic network’s loss. These results indicate that symmetry contributes to improved learning performance.
The line tracking simulation results shown in
Figure 16 confirm that high-precision tracking can be achieved. The results obtained using a symmetric actor network exhibit a symmetric time series of the rudder angle. In contrast, learning with symmetric data alone did not fully achieve symmetry; in particular, symmetry was not maintained between steps 150 and 200. Based on these results, the line tracking performance is improved most effectively when both the actor and critic networks incorporate symmetry.
4.2. Limitation of Action Changes
According to the results of DDPG validation, the AI-generated rudder command was changed from −35° to 35°. Such abrupt control actions are not typically executed by human operators and impose significant load on the steering gear. The frequency of changing rudder commands should be limited considering the limits of the actuator, such as the changing rate of the rudder angle and the reduction in the inertial force, which results in anxiety for passengers. However, a learning method that accounts for the limit on the number of changing actions has not yet been proposed.
An initial attempt was made to incorporate a penalty for changes in the rudder angle during learning, but the outcomes were not satisfactory. Therefore, this study proposes a method to explicitly consider limits on action changes within the objective function.
In the DDPG, the parameters of the actor network are updated to maximize the objective function. The objective function considering the limits on action changes,
, is defined by Equation (10), where the first term represents the cumulative reward, which is the standard objective in the conventional DDPG. The second term,
, represents the restriction of action changes
, and
is the strength of the restriction. The parameters are updated to minimize the number of changing actions. The strength of the second term depends on its value and the gradient. Therefore, the function should have a large gradient if the number of changing actions exceeds the limit. In this study, the Swish function [
21] is employed for this purpose, as shown in Equation (10). The resulting shape of the objective function is shown in
Figure 17.
The learning process was performed under three conditions, no limit, a limit of 20°/step, and a limit of 40°/step, to validate the proposed method. The results are shown in
Figure 18. According to these results, the deviation distances were comparable across all conditions.
The distribution of the number of changing actions is shown in
Figure 19a–c, and the cumulative ratio of the degree of rudder change is shown in
Figure 19d. According to
Figure 19d, imposing a limit reduces the frequency of large rudder angle changes. In more than 95% of the steps, the magnitude of action changes remains below the specified limit. Although the proposed method does not perfectly enforce the action change constraint, it effectively suppresses excessive changes. Based on these results, the proposed method can be used to incorporate action change limits into the learning process. In this study, the restriction was implemented using the Swish function, although the ReLU function could also be used.
4.3. Situational Action Policy Smoothing
NNs are highly sensitive to small variations in input data. This issue is well known not only in autonomous control but also in image recognition. In real sea conditions, measured data are affected by ship motions (e.g., rolling), sensor errors, and observation noise. If an AI system responds excessively to such variations, rudder commands may fluctuate significantly even when the vessel is moving along a straight line.
A smoothing method for DRL, Conditioning for Action Policy Smoothing (CAPS) [
22], has been proposed to address this issue. In CAPS, the objective function of the actor network,
, is described by Equation (11). The second and third terms are the temporal and spatial smoothing terms, respectively, with
and
controlling the strength of these smoothing terms.
and
denote the distance measures. The actor parameters are updated to maximize the cumulative reward using the first term. The second term encourages temporal smoothness by minimizing the difference between consecutive actions,
and
, as shown in the second line of Equation (11). The third term suppresses sensitivity to input noise by minimizing the difference between the action for the observed state,
, and the action for the noise-perturbed state,
. The noise distribution was determined from the experimental data.
The temporal term in CAPS acts as a constraint on changes in the action. Although this helps suppress meandering on straight paths by stabilizing the actor output, it also delays rudder adjustments when transitioning from straight segments to curves. In other words, CAPS does not consider changes in the observed state. To address this, a state-dependent restriction is proposed.
The improved objective function is described by Equation (12). The second term is the restriction on temporal action changes. Since the number of changing actions is divided by the number of changing states, the limit is smaller when the observed states change significantly. This improved method is called Situation Action Policy Smoothing (SAPS).
,
, and
are the differences between consecutive actions, consecutive states, and a noisy state, respectively.
,
, and
are the strength of the per-step action constraint, the relaxation of the constraint in response to state variations, and the strength of the constraint against noise., respectively.
Learning was performed using CAPS and SAPS to validate the proposed method. They were compared at the same restrictive strength. The objective function parameters are listed in
Table 7. The trajectories of the simulation results are shown in
Figure 20. It is shown that SAPS can smoothly follow both straight and curved lines with high accuracy, while CAPS causes delayed rudder movement in curved sections. These results demonstrate that SAPS achieves situational smoothing with improved accuracy.
Next, the conventional DDPG and the DDPG with SAPS were compared. The results are presented in
Figure 21. The average change in the rudder angle per step decreased from 16.5°/step in the conventional DDPG to 0.73°/step with SAPS. This indicates that the proposed smoothing method is highly effective.