Next Article in Journal
Sustainable Solutions for Small/Medium Ports a Guide to Efficient and Effective Planning
Next Article in Special Issue
The Hydrodynamic Interaction between an AUV and Submarine during the Recovery Process
Previous Article in Journal
Experimental Evaluation and Validation of Pressure Distributions in Ice–Structure Collisions Using a Pendulum Apparatus
Previous Article in Special Issue
Dynamic Data-Driven Application System for Flow Field Prediction with Autonomous Marine Vehicles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Deep Reinforcement Learning-Based Path-Following Control Scheme for an Uncertain Under-Actuated Autonomous Marine Vehicle

School of Mechanical and Electrical Engineering, Dalian Minzu University, Dalian 116600, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2023, 11(9), 1762; https://doi.org/10.3390/jmse11091762
Submission received: 1 August 2023 / Revised: 2 September 2023 / Accepted: 6 September 2023 / Published: 9 September 2023
(This article belongs to the Special Issue Autonomous Marine Vehicle Operations)

Abstract

:
In this article, a deep reinforcement learning-based path-following control scheme is established for an under-actuated autonomous marine vehicle (AMV) in the presence of model uncertainties and unknown marine environment disturbances is presented. By virtue of light-of-sight guidance, a surge-heading joint guidance method is developed within the kinematic level, thereby enabling the AMV to follow the desired path accurately. Within the dynamic level, model uncertainties and time-varying environment disturbances are taken into account, and the reinforcement learning control method using the twin-delay deep deterministic policy gradient (TD3) is developed for the under-actuated vehicle, where path-following actions are generated via the state space and hybrid rewards. Additionally, actor-critic networks are developed using the long-short time memory (LSTM) network, and the vehicle can successfully make a decision by the aid of historical states, thus enhancing the convergence rate of dynamic controllers. Simulation results and comprehensive comparisons on a prototype AMV demonstrate the remarkable effectiveness and superiority of the proposed LSTM-TD3-based path-following control scheme.

1. Introduction

The Autonomous marine vehicle (AMV) is a marine intelligent platform that performs tasks autonomously or semi-autonomously [1], which is widely applied in military and civilian fields due to its small volume, strong concealment, good flexibility and other advantages [2]. In different missions, path following of the AMV plays a crucial role for realized autonomous operation. Considering that, in practice, the AMV inevitably suffers from marine environment disturbances, the path-following control method with high precision and efficiency is crucial to the success of an operation, where a parameterized path is expected to be tracked as accurately as possible [3].
In generally, the path-following control of an AMV consists of two critical parts: kinematics guidance and dynamics control. In the part of guidance research, by calculating the desired heading angle, path-following errors can converge to zero, and in the part of control research, control inputs including surge force and yaw torque are solved using the desired guidance signals, thus contributing to the path following performance [4]. For the former, the light-of-sight (LOS) guidance was widely applied because of its high precision and simplicity [5,6,7,8]. For the latter, fruitful methods were proposed and applied to controller design, such as PID control [9,10], fuzzy control [11,12,13], adaptive control [14,15], active disturbances rejection control [16,17], sliding mode control [18,19], and backstepping control [20,21,22]. In [23], considering the path-following control under unknown environment disturbances, the modified integral LOS guidance law and the adaptive sliding mode control law are developed, realizing the desired path following. In [24], to solve the path-following control of an under-actuated autonomous underwater vehicle subject to environment disturbances, an adaptive robust control method is proposed using fuzzy logic, backstepping and sliding mode technology, where the fuzzy logic system is utilized to approximate the unknown uncertainties. In [25], a novel, adaptive, robust path-following scheme is proposed by combining with the trajectory linearization control and the finite-time disturbance observer. In [26], a fuzzy unknown observer-based, robust, adaptive path-following control scheme is proposed, where the fuzzy observer is designed to estimate lumped unknowns and the observer-based, robust, adaptive tracking control laws are synthesized, thus ensuring that the guided signals are globally asymptotically tracked. However, the above control method depends on a system model with high accuracy, and the derivation process is complex.
With increasingly rapid development of machine learning, deep reinforcement learning (DRL) algorithms are widely applied to the relative studies of unmanned system control [27]. The DRL is a combination of deep learning and reinforcement learning, which has strong decision-making ability and anti-disturbance ability of reinforcement learning and strong representation ability of deep neural network, thus effectively reducing the complexity and difficulty of the controller design. At present, the popular DRL algorithms include the soft actor-critic (SAC), the proximal policy optimization (PPO), and the deep deterministic policy gradient (DDPG) [28,29,30]. In [31], the advantage of actor-critic (A2C) is proposed to solve path-following control for a fish-like robot, where the desired path is a randomly generated curve. In [32], a DRL controller is designed using the DDPG for path following, and simulation shows that the proposed method is better than the PID in terms of transient characteristics. In [33], a distributed DRL method is proposed to solve the path-following control of an under-actuated AMV, where the DDPG-based controller is designed and the radial basis neural network is utilized to approximate the unknown disturbances. In [34], an improved DDPG control method was proposed for path following based on an optimized sampling pool and average motion evaluation network, and the simulation results show that the proposed method effectively improves the utilization rate of samples and avoids falling into a local optimum in the training process. In [35], a linear active disturbances rejection controller based on the SAC was proposed to solve path-following control under wind and wave environments. In [36], the path-following control laws were designed using the twin-delayed deep deterministic policy gradient algorithm (TD3), where desired velocities were generated by the LOS guidance.
Considering the path following of an under-actuated AMV under unknown model parameters and environment disturbances, this article establishes the motion model for an AMV, and proposes a path-following control method by combining with the long short-term memory network (LSTM) and the TD3 algorithm. The main contributions are as follows: (1) within the kinematic level, the surge-heading joint guidance law is developed based on the LOS, where the desired velocity signals are generated to guide the vehicle along the desired path; (2) within the dynamic level, the TD3-based surge-heading controller is developed for the vehicle, where states, actions and reward functions are defined; and (3) to enhance the convergence rate of controller networks, the LSTM layer, using the historical states, is added into the TD3.
The remainder of this article is organized as follows: preliminaries and problem statement are described in Section 2; Section 3 presents the kinematic guidance law and the DRL-based dynamic controller of an AMV; simulation results and analysis are presented in Section 4; and Section 5 contains the conclusion.

2. Preliminaries and Problem Statement

2.1. Reinforcement Learning

Reinforcement learning is based on the Markov decision process. Four basic elements are defined as { S , A , P , R } , where S is the set of all states, A is the set of all actions, P is the state transition probability, and R is the reward function [37]. The decay sum of all rewards from a certain state to the final state can be calculated by
R t = r t + 1 + γ r t + 2 + γ 2 r t + 3 + = k = 0 γ k r t + k + 1
where γ is the discount factor, satisfying γ [ 0 , 1 ] , and r t + i   ( i = 1 , , k + 1 ) is the reward at the current time.
Additionally, the value functions under the policy η include action value function Q μ ( s t , a t ) and state value function V μ ( s t ) , where s t and a t are the state and action at the current time. The value functions are described as
Q μ ( s t , a t ) = E μ [ R t | s t , a t ] = E μ [ k = 0 γ k r t + k + 1 | s t , a t ] V μ ( s t ) = E μ [ R t | s t ] = E μ [ k = 0 γ k r t + k + 1 | s t ]
where E expresses the expectation, and the optimal policy μ can be achieved by maximizing the optimal state-action value functions [38].
μ = arg max V μ ( s t ) = arg max Q μ ( s t , a t )

2.2. LSTM Network

The LSTM network has better memory ability, where important data is retained and irrelevant noise is deleted, thereby relieving the gradient disappearance of the existing recurrent neural network and the memory burden of networks [39]. The neuronal structure is shown in Figure 1, where x t is the input; h t is the output; c t is the state value of the memory cell at the current time; h t 1 and c t 1 are the input signals at the previous time; f t is the forgetting gate; i t is the input gate; o t is the output gate; and σ is the sigmoid function.
As shown in Figure 1, when the information inputs to the neuron, it firstly goes through the forgetting gate and input gate; then, it goes through the output gate, and the state value of the memory cell c t are calculated based on the f t and i t . Finally, the outputs are calculated based on o t and c t . The renewal process can be described by
f t = σ ( W 1 x t + W 2 h t 1 + b 1 ) i t = σ ( W 3 x t + W 4 h t 1 + b 2 ) o t = σ ( W 7 x t + W 8 h t 1 + b 4 ) c t = c t 1 × f t + tanh ( W 5 x t + W 6 h t 1 + b 3 ) × i t h t = o t × tanh ( c t )
where W i is the weight coefficient with i = 1 , 2 , , 8 ; b h is the bias with h = 1 , , 4 , and tanh is the activation function [40].

2.3. Under-Actuated AMV Model

As described in [41], the under-actuated AMV model of three degrees of freedom in the horizontal plane is written as
η = R ( η ) ν M ν ˙ + C ( ν ) ν + D ( ν ) ν = τ d + τ
where η = [ x , y , ψ ] T are the positions and heading angle of AMV in the earth-fixed frame, and ν = [ u , v , r ] T are the surge, sway and yaw velocities in the body-fixed frame. τ = [ τ u , 0 , τ r ] T are the control inputs of path-following, and τ d = [ τ u d , τ v d , τ r d ] T are the time-varying marine environment disturbances. R ( η ) is the rotation matrix from the body-fixed frame to the earth-fixed frame, which is defined as
R ( η ) = cos ψ sin ψ 0 sin ψ cos ψ 0 0 0 1
M is the inertial matrix and satisfies M = M T > 0 , which is written as
M = m 11   0 0 0 m 22 m 23 0 m 32 m 33
C ( ν ) is the coriolis-centripetal matrix and satisfies C ( ν ) = C ( ν ) T , which is written as
C ( ν ) = 0 0 c 13 ( ν ) 0 0 c 23 ( ν ) c 13 ( ν ) c 23 ( ν ) 0
and D ( ν ) is the damping matrix, which is written as
D ( ν ) = d 11 ( ν ) 0 0 0 d 22 ( ν ) d 23 ( ν ) 0 d 32 ( ν ) d 33 ( ν )
with m 11 = m X u ˙ , m 22 = m Y v ˙ , m 33 = I z N r ˙ , m 23 = m x g Y r ˙ , m 32 = m x g N v ˙ , c 13 ( ν ) = m 11 v m 23 r , c 23 ( ν ) = m 11 u , d 11 ( ν ) = X u X | u | u u X u u u u 2 , d 22 ( ν ) = Y v Y | v | v v , d 33 ( ν ) = N r N | v | r v N | r | r r , Y r ˙ = N r ˙ , where m is AMV mass, and I z is the moment of inertia in yaw. X * , Y * and N * are the hydrodynamic coefficients.
As shown in Figure 2, the desired path x d ( s ) ,   y d ( s ) is a continuous parameterized curve with a time-independent variable s . For any moving point on the curve, a path-tangential angle in the earth-fixed frame is defined as
α = a t a n 2 y d ( s ) , x d ( s )
where y d ( s ) = y d / s , x d ( s ) = x d / s . The errors between x ,   y and x d ,   y d can be formulated as
x e y e = cos α sin α   sin α   cos α T x x d ( s )   y y d ( s )  
where x e is the along-track error, and y e is the cross-track error.
In this article, our objective is to design the DRL-based path-following control scheme for an uncertain under-actuated AMV, such that the vehicle can follow the desired path with the desired velocities regardless of model uncertainties and unknown marine environment disturbances. To be specific, the objective can be formalized as
lim t x e   δ x lim t y e   δ y
where δ x and δ y are any small positive constants.

3. DRL-Based Path-Following Control Scheme

In this section, a DRL-based path-following control scheme is established for an under-actuated AMV in the presence of model uncertainties and unknown marine environment disturbances. The diagram of the proposed control scheme is shown in Figure 3, where kinematic guidance and dynamic control are designed, respectively.
Within the kinematic level, according to the position errors and related motion states obtained by the AMV model, the desired surge velocity and heading angle are generated by the designed surge-heading joint guidance law based on the light-of-sight guidance. Within the dynamic level, DRL-based surge-heading controllers are presented for following the desired guidance signals. The reward function is designed to generate rewards by comparing the desired signals with the actual vehicle states in the environment. The controllers generate the control inputs based on the rewards, and the vehicle precisely tracks the desired signals based on the control inputs and the novel environment states to realize path-following control. By combining with the kinematic guidance and dynamic control, the objective (12) can be successfully completed.

3.1. Kinematic Guidance Design

Firstly, kinematic guidance is designed in this subsection, where desired surge velocities and heading angles are produced. Differentiating (11) along (5) yields
x ˙ e = u cos ψ α v sin ψ α + α ˙ y e u s y ˙ e = u sin ψ α + v cos ψ α α ˙ x e
where u s is velocity of the virtual point along the desired path, which is defined by
u s = s ˙ x d 2 ( s ) + y d 2 ( s )
Define the sideslip angle of under-actuated vehicle as
β = arctan v u
In this context, path-following error dynamics (13) is rewritten as
x ˙ e = u cos ψ α u sin ψ α t a n β + α ˙ y e u s y ˙ e = u sin ψ α + u cos ψ α t a n β α ˙ x e
Then, select the Lyapunov function related to path following errors as
V = 1 2 x e 2 + y e 2
The time derivative of (17) along the solution (16) is
V ˙ = x e x ˙ e + y e y ˙ e = x e u cos ( ψ α ) u sin ( ψ α ) t a n β + α ˙ y e u s + y e u sin ( ψ α ) α ˙ x e + u cos ( ψ α ) t a n β
Thus, the surge-heading joint guidance law is designed as follows
u d = k 1 y e 2 + Δ 2 ψ d = α β d arctan y e Δ
where k 1 > 0 ; Δ > 0 is the look-ahead distance; β d = arctan v / u d and virtual velocity u s is determined by
u s = U d cos β d + ψ α + k 2 x e
with k 2 > 0 and U d = u d 2 + v 2 .
Using the fact
sin tan 1 y e Δ = y e y e 2 + Δ 2 1 / 2 cos tan 1 y e Δ = Δ y e 2 + Δ 2 1 / 2
and substituting the guidance law (19) into (18) yield
V ˙ = x e α ˙ y e k 2 x e + y e k 1 cos β d y e α ˙ x e = k 2 x e 2 k 1 cos β d y e 2
Since 0 < cos β d 1 renders V ˙ k 2 x e 2 k 1 y e 2 , it indicates that path-following error x e and y e can globally asymptotically converge to the origin using the proposed guidance method.

3.2. Dynamic Control Design

Because of inaccurate measurements and environment disturbances, AMV model parameters cannot be obtained completely, thereby resulting in uncertainties of dynamics. To enhance the engineering practicality and reduce the complexity of controllers, a TD3-based reinforcement learning control method is presented within the dynamic level, where control inputs of the AMV are generated successfully.
The main purpose of the DRL algorithm is to make the vehicle take actions in the case of different path information. The proposed TD3 algorithm is based on the actor-critic structure, where policy functions are produced using actor networks and critic networks used to judge the performance of the actor [42]. Additionally, LSTM network layers are introduced into the TD3 and thereby enhance the utilization rate of historical states.
Firstly, the network structure of TD3 algorithm is shown in Figure 4. By virtue of initial environment states, actions of the AMV are generated using actor networks, and rewards are accordingly calculated using reward functions; thus, the states can be updated with the generated actions. The empirical value is defined as e ( t ) = { s , a , r , s t + 1 } and saved into buffer M e m o r y D . Through repeated training, the empirical replay sequence D = { e 1 , e 2 , , e n } can be formed. Considering that the adjacent actions of path-following have strong relevance, a batch of empirical sequences are selected randomly for training. The actor network of target generates the action a t + 1 according to the state s t + 1 in the empirical replay sequence, and the critic network of target calculates the Q min value, where Q min is the smaller of the two Q t a r g e t values generated by target networks. Two critic networks are updated based on Q , Q min and loss functions. Actor networks generate actions using states. Critic networks generate the Q value using states and actions, and thus calculate the policy gradient and update actor networks using the Q value.
The specific renewal process is as follows. Considering that TD3 algorithm is a deterministic policy and has the characteristic of target policy smoothing regularization, random noise ε is added into target actions. Therefore, there is a t + 1 = μ ( s t + 1 | w ) + ε , where μ is the policy of target actor networks, and w is the parameter of target actor networks.
The target value is calculated as
y = r + γ Q min ( s t + 1 , a t + 1 | θ i )
where i = 1 ,   2 , θ i represents parameters of target critic networks.
The loss function is defined as
L ( θ i ) = 1 N y Q ( s , a | θ i ) 2
where N represents the number of mini-batch, and θ i represents parameters of critic networks. The gradient is updated by
L ( θ i ) θ i = E y Q ( s , a | θ i ) Q ( s , a | θ i ) θ i
Subsequently, the policy gradient of actor networks is updated by
J ( w ) w = E Q ( s , a | θ ) a μ ( s | w ) w w J 1 N ( a Q ( s , a | θ 1 ) | a = μ ( s ) w μ ( s | w ) )
After a couple of cycles, target parameters are soft updated by
θ i = ξ θ i + 1 ξ θ i w = ξ w + 1 ξ w
where ξ 0 , 1 represents the learning rate.
Then, the LSTM network is introduced into actor-critic networks, thus contributing to the LSTM-TD3-based reinforcement learning controllers. The LSTM-TD3 network still retains the actor-critic structure, where LSTM inputs is a length of sequences. According to the real-time navigational information, the continuous states are saved into the sequences. The LSTM network layer is connected to generate the final hidden state h t , where h t is a one-dimensional array. Via the multi-layer perceptron (MLP) neural networks, the path-following control inputs of an AMV are generated, which include surge forces and yaw torques. The network structure of the actor is shown in Figure 5. Note that the critic has the similar network structure to the actor, and generates available actions.
Finally, the state space, action space and reward function are designed as follows. To be specific, the state space represents perceived environment information of the AMV, which is the basis of decision-making and reward-evaluating. The action space represents control inputs of the AMV, including surge forces and yaw torques. The reward function is used to evaluate current state of the AMV.
In this context, the state space is defined as
s t = [ x e ,   y e ,   ψ ,   ψ e ,   u ,   v ,   r ,   u e ,   τ u ( t 1 ) ,   τ r ( t 1 ) ]
where τ u ( t 1 ) and τ r ( t 1 ) are control inputs at the previous moment. u e = u u d and ψ e = ψ ψ d with u d , ψ d are generated by the guidance law (19).
Taking path-following errors, surge velocity and heading angle errors into considerations, the reward function r 1 is designed as
r 1 = λ 2 exp k 3 | u e | 1 + 2 exp k 4 | ψ e | 1 + 2 exp k 5 | x e 2 + y e 2 | 1
where k ( = 3 , 4 , 5 ) > 0 and λ > 0 . Note that the exponential function is used to calculate rewards, which limits the size of rewards and avoids high rewards.
Furthermore, the reward function r 2 is designed as
r 2 = exp k 6 | σ τ u | + exp k 6 | σ τ r | 1
where k 6 > 0 ; σ τ u and σ τ r are the standard deviation of two inputs, which are used to reduce the chattering of control inputs.
By combining with (29) and (30), the hybrid rewards of path-following control are established as
r = r 1 + k 7 r 2
where k 7 > 0 , and satisfies k 7 ( 0 , 1 ) .
The framework of the dynamic control algorithm is summarized in Algorithm 1.
Algorithm 1. Dynamic control algorithm of an AMV.
Inputs: Learning rate ξ , l θ and l w , regular factor ε , gradient threshold parameter g , discount factor γ , sequence length L , the maximum number of steps per training K , updating cycle of target network parameter d , training cycle F , mini-batch N .
Initialize: Critic network Q ( s , a | θ i ) and actor network μ ( s | w ) with random parameters θ 1 , θ 2 and w , target network θ 1 θ 1 , θ 2 θ 2 and w w , experience replay buffer M e m o r y   D , navigation environment of an AMV.
Procedure:
1: for   n 1 = 1 , , F do
2:   for   n 2 = 1 , , K do
3:   Select actions with exploration noise a ~ μ ( s | w ) + ε and obtain reward r and next moment state s t + 1
4:   Save transition tuple ( s , a , r , s t + 1 ) into M e m o r y   D
5:   Sample N transitions ( s , a , r , s t + 1 )
a t + 1 μ ( s t + 1 | w ) + ε y r + γ Q min ( s t + 1 , a t + 1 | θ i )
6:   Update Critic networks parameters θ i as
θ i arg min θ i 1 N y Q ( s , a | θ i ) 2
7:   if t mod d then
8:    Update actor network parameters w as
w J 1 N ( a Q ( s , a | θ 1 ) | a = μ ( s ) w μ ( s | w ) )
9:    Update target network as
θ i = ξ θ i + 1 ξ θ i w = ξ w + 1 ξ w
end if
end for
Outputs: Actor network parameter w , critic network parameters θ 1 and θ 2 .

4. Simulation Studies

In this section, simulation studies are shown to verify the effectiveness and superiority of the proposed DRL-based path-following control method. Consider an under-actuated AMV described by (5) with model uncertainties and unknown marine environment disturbances. Model parameters of the prototype AMV can be found in [43].
Within the kinematic guidance level, relative parameters are chosen as follows: k 1 = 0.2 , k 2 = 2 , Δ = 3 . Within the dynamic control level, relative parameters are chosen as follows: k 3 = 1.5 , k 4 = 6 , k 5 = 1 , k 6 = 1 , λ = 0.8 , k 7 = 0.3 . Training hyper parameters and network parameters of the LSTM-TD3 are shown in Table 1 and Table 2, respectively.
Episode rewards with different DRL algorithms of path-following control of an AMV are shown in Figure 6. It can be seen that the initial reward is extremely low since the vehicle explores the environment randomly during the initial training stage. After collecting enough data, the rewards converge to a stable value under the DRL control method. Compared to the asynchronous advantage of actor-critic (A3C) developed in [44], the TD3 and LSTM-TD3 can effectively increase the accumulated reward. Additionally, the proposed LSTM-TD3 shows a faster convergence rate and a more stable convergence process than the other two algorithms.
After training the algorithm for 1000 episodes, the optimal actor network parameters of the LSTM-TD3 and TD3 are saved and utilized. Simulation time is set as 200s. The initial positions and attitude are η = [ 10 , 0 , 0 ] T and the initial velocities are ν = [ 0 , 0 , 0 ] T . Time-varying marine environment disturbances are defined as
τ d = 5 sin 0.1 t + π / 5 2.2 cos 0.1 t + 6 1.2 cos 0.1 t + 3
The desired path is defined as
x d = s y d = 10 sin ( 0.3 s ) + 2 s
Simulation results are shown in Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11. Figure 7 shows the path-following control performance of an under-actuated AMV where the desired path, actual path under the proposed LSTM-TD3 and the traditional TD3 are plotted. Obviously, the proposed control method has significant superiority in terms of transient responses and steady-state performance. Figure 8 shows the path-following errors of an AMV subject to model uncertainties and marine environment disturbances. It can be seen that the tangential error and the normal error can converge to the origin faster under the proposed control method. Figure 9 shows the surge velocity and heading angle, where desired signals are generated by guidance law (19). The actual velocities gradually converge to the desired value by the aid of the DRL controller. Note that the slight chattering of velocities is due to large path inflection point, and under-actuated AMV have to reduce their speed to follow the desired path. Figure 10 shows the velocity error and heading angle error of an AMV. It can be seen that the LSTM network considers historical states and thus enhances control performance. Figure 11 shows path-following control inputs of an AMV, including the surge force and the yaw torque. Because of the hybrid rewards with standard deviation, where 20 continuous inputs are set as a calculation group and dynamic sliding is introduced, the control chattering is effectively relieved.

5. Conclusions

This article studies the path-following control of an under-actuated AMV subject to unknown model parameters and marine environmental disturbances. Within the kinematic level, a surge-heading joint guidance law is presented, and makes the vehicle follow the desired path. Within the dynamic level, a LSTM-TD3-based reinforcement learning controller is presented, where vehicle actions are generated by the state space and hybrid reward. Additionally, actor-critic networks are developed using the LSTM network, and vehicles can make a decision by the aid of historical states, thus enhancing the convergence rate of controller networks. Simulation results and comprehensive comparisons demonstrate the remarkable effectiveness and superiority of the proposed path-following control method. By the aid of the proposed controller, the AMV can achieve path following regardless of marine environment disturbances.

Author Contributions

Conceptualization, X.Q. and Y.J.; methodology, X.Q.; software, Y.J.; validation, X.Q., Y.J. and R.Z.; formal analysis, F.L.; investigation, Y.J.; resources, X.Q.; data curation, X.Q.; writing—original draft preparation, X.Q.; writing—review and editing, X.Q.; visualization, F.L.; supervision, F.L.; project administration, R.Z.; funding acquisition, R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 61673084).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jorge, V.A.; Granada, R.; Maidana, R.G.; Jurak, D.A.; Heck, G.; Negreiros, A.P.; Dos Santos, D.H.; Gonçalves, L.M.; Amory, A.M. A Survey on Unmanned Surface Vehicles for Disaster Robotics: Main Challenges and Directions. Sensors 2019, 19, 702. [Google Scholar] [CrossRef]
  2. Liu, T.; Dong, Z.; Du, H.; Song, L.; Mao, Y. Path Following Control of the Underactuated USV Based on the Improved Line-of-Sight Guidance Algorithm. Pol. Marit. Res. 2017, 24, 3–11. [Google Scholar] [CrossRef]
  3. Mu, D.; Wang, G.; Fan, Y.; Bai, Y.; Zhao, Y. Fuzzy-Based Optimal Adaptive Line-of-Sight Path Following for underactuated unmanned surface vehicle with uncertainties and time-varying disturbances. Math. Probl. Eng. 2018, 2018, 7512606. [Google Scholar] [CrossRef]
  4. Koh, S.; Zhou, B.; Fang, H.; Yang, P.; Yang, Z.; Yang, Q.; Guan, L.; Ji, Z. Real-time deep reinforcement learning based vehicle navigation. Appl. Soft Comput. 2020, 96, 106694. [Google Scholar] [CrossRef]
  5. Mu, D.; Wang, G.; Fan, Y.; Bai, Y.; Zhao, Y. Path following for podded propulsion unmanned surface vehicle: Theory, simulation and experiment. IEEJ Trans. Electr. Electron. Eng. 2018, 13, 911–923. [Google Scholar] [CrossRef]
  6. Lekkas, A.M.; Fossen, T.I. Integral LOS Path Following for Curved Paths Based on a Monotone Cubic Hermite Spline Parametrization. IEEE Trans. Control Syst. Technol. 2014, 22, 2287–2301. [Google Scholar] [CrossRef]
  7. Fossen, T.I.; Lekkas, A.M. Direct and indirect adaptive integral line-of-sight path-following controllers for marine craft exposed to ocean currents. Int. J. Adapt. Control Signal Process. 2017, 31, 445–463. [Google Scholar] [CrossRef]
  8. Fossen, T.I.; Pettersen, K.Y.; Galeazzi, R. Line-of-Sight Path Following for Dubins Paths with Adaptive Sideslip Compensation of Drift Forces. IEEE Trans. Control Syst. Technol. 2014, 23, 820–827. [Google Scholar] [CrossRef]
  9. Liu, Z.; Song, S.; Yuan, S.; Ma, Y.; Yao, Z. ALOS-Based USV Path-Following Control with Obstacle Avoidance Strategy. J. Mar. Sci. Eng. 2022, 10, 1203. [Google Scholar] [CrossRef]
  10. Rout, R.; Subudhi, B. Inverse optimal self-tuning PID control design for an autonomous underwater vehicle. Int. J. Syst. Sci. 2017, 48, 367–375. [Google Scholar] [CrossRef]
  11. Yu, C.; Xiang, X.; Lapierre, L.; Zhang, Q. Nonlinear guidance and fuzzy control for three-dimensional path following of an underactuated autonomous underwater vehicle. Ocean Eng. 2017, 146, 457–467. [Google Scholar] [CrossRef]
  12. Xiang, X.; Yu, C.; Zhang, Q. Robust fuzzy 3D path following for autonomous underwater vehicle subject to uncertainties. Comput. Oper. Res. 2017, 84, 165–177. [Google Scholar] [CrossRef]
  13. Zhang, J.; Xiang, X.; Lapierre, L.; Zhang, Q.; Li, W. Approach-angle-based three-dimensional indirect adaptive fuzzy path following of under-actuated AUV with input saturation. Appl. Ocean Res. 2021, 107, 102486. [Google Scholar] [CrossRef]
  14. Sahu, B.K.; Subudhi, B. Adaptive tracking control of an autonomous underwater vehicle. Int. J. Autom. Comput. 2014, 11, 299–307. [Google Scholar] [CrossRef]
  15. Shin, J.; Kwak, D.J.; Lee, Y. Adaptive Path-Following Control for an Unmanned Surface Vessel Using an Identified Dynamic Model. IEEE/ASME Trans. Mechatron. 2017, 22, 1143–1153. [Google Scholar] [CrossRef]
  16. Lamraoui, H.C.; Zhu, Q. Path following control of fully-actuated autonomous underwater vehicle in presence of fast-varying disturbances. Appl. Ocean Res. 2019, 86, 40–46. [Google Scholar] [CrossRef]
  17. Zhang, H.; Zhang, X.; Cao, T.; Bu, R. Active disturbance rejection control for ship path following with Euler method. Ocean Eng. 2022, 247, 110516. [Google Scholar] [CrossRef]
  18. Zhang, G.; Huang, H.; Wan, L.; Li, Y.; Cao, J.; Su, Y. A novel adaptive second order sliding mode path following control for a portable AUV. Ocean Eng. 2018, 151, 82–92. [Google Scholar] [CrossRef]
  19. Zhang, H.; Zhang, X.; Bu, R. Radial Basis Function Neural Network Sliding Mode Control for Ship Path Following Based on Position Prediction. J. Mar. Sci. Eng. 2021, 9, 1055. [Google Scholar] [CrossRef]
  20. Wang, J.; Wang, C.; Wei, Y.; Zhang, C. Three-Dimensional Path Following of an Underactuated AUV Based on Neuro-Adaptive Command Filtered Backstepping Control. IEEE Access 2018, 6, 74355–74365. [Google Scholar] [CrossRef]
  21. Yan, Z.; Yang, Z.; Zhang, J.; Zhou, J.; Jiang, A.; Du, X. Trajectory tracking control of UUV based on backstepping sliding mode with fuzzy switching gain in diving plane. IEEE Access 2019, 7, 166788–166795. [Google Scholar] [CrossRef]
  22. Zhou, J.; Zhao, X.; Chen, T.; Yan, Z.; Yang, Z. Trajectory tracking control of an underactuated AUV based on backstepping sliding mode with state prediction. IEEE Access 2019, 7, 181983–181993. [Google Scholar] [CrossRef]
  23. Chen, X.; Liu, Z.; Zhang, J.; Zhou, D.; Dong, J. Adaptive sliding-mode path following control system of the underactuated USV under the influence of ocean currents. J. Syst. Eng. Electron. 2018, 29, 1271–1283. [Google Scholar] [CrossRef]
  24. Liang, X.; Wan, L.; Blake, J.I.; Shenoi, R.A.; Townsend, N. Path Following of an Underactuated AUV Based on Fuzzy Backstepping Sliding Mode Control. Int. J. Adv. Robot. Syst. 2016, 13, 122. [Google Scholar] [CrossRef]
  25. Qiu, B.; Wang, G.; Fan, Y.; Mu, D.; Sun, X. Path Following of Underactuated Unmanned Surface Vehicle Based on Trajectory Linearization Control with Input Saturation and External Disturbances. Int. J. Control Autom. Syst. 2020, 18, 2108–2119. [Google Scholar] [CrossRef]
  26. Wang, N.; Sun, Z.; Yin, J.; Zou, Z.; Su, S. Fuzzy unknown observer-based robust adaptive path following control of underactuated surface vehicles subject to multiple unknowns. Ocean Eng. 2019, 176, 57–64. [Google Scholar] [CrossRef]
  27. Havenstrøm, S.T.; Rasheed, A.; San, O. Deep reinforcement learning controller for 3D path following and collision avoidance by autonomous underwater vehicles. Front. Robot. AI 2021, 7, 211. [Google Scholar] [CrossRef]
  28. Meyer, E.; Heiberg, A.; Rasheed, A.; San, O. COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. IEEE Access 2020, 8, 165344–165364. [Google Scholar] [CrossRef]
  29. Sola, Y.; Le Chenadec, G.; Clement, B. Simultaneous control and guidance of an auv based on soft actor–critic. Sensors 2022, 22, 6072. [Google Scholar] [CrossRef]
  30. Fang, Y.; Huang, Z.; Pu, J.; Zhang, J. AUV position tracking and trajectory control based on fast-deployed deep reinforcement learning method. Ocean Eng. 2022, 245, 110452. [Google Scholar] [CrossRef]
  31. Zhang, T.; Tian, R.; Wang, C.; Xie, G. Path-Following Control of Fish-like Robots: A Deep Reinforcement Learning Approach. IFAC-PapersOnLine 2020, 53, 8163–8168. [Google Scholar] [CrossRef]
  32. Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
  33. Han, Z.; Wang, Y.; Sun, Q. Straight-Path Following and Formation Control of USVs Using Distributed Deep Reinforcement Learning and Adaptive Neural Network. IEEE/CAA J. Autom. Sin. 2023, 10, 572–574. [Google Scholar] [CrossRef]
  34. Sun, Y.; Ran, X.; Zhang, G.; Wang, X.; Xu, H. AUV path following controlled by modified Deep Deterministic Policy Gradient. Ocean Eng. 2020, 210, 107360. [Google Scholar] [CrossRef]
  35. Zheng, Y.; Tao, J.; Sun, Q.; Sun, H.; Chen, Z.; Sun, M.; Xie, G. Soft Actor–Critic based active disturbance rejection path following control for unmanned surface vessel under wind and wave disturbances. Ocean Eng. 2022, 247, 110631. [Google Scholar] [CrossRef]
  36. Liang, Z.; Qu, X.; Zhang, Z.; Chen, C. Three-Dimensional Path-Following Control of an Autonomous Underwater Vehicle Based on Deep Reinforcement Learning. Pol. Marit. Res. 2022, 29, 36–44. [Google Scholar] [CrossRef]
  37. Liu, Y.; Peng, Y.; Wang, M.; Xie, J.; Zhou, R. Multi-usv system cooperative underwater target search based on reinforcement learning and probability map. Math. Probl. Eng. 2020, 2020, 7842768. [Google Scholar] [CrossRef]
  38. Havenstrøm, S.T.; Sterud, C.; Rasheed, A.; San, O. Proportional integral derivative controller assisted reinforcement learning for path following by autonomous underwater vehicles. arXiv 2020, arXiv:2002.01022. [Google Scholar] [CrossRef]
  39. Zhang, W.; Wu, P.; Peng, Y.; Liu, D. Roll motion prediction of unmanned surface vehicle based on coupled CNN and LSTM. Future Internet 2019, 11, 243. [Google Scholar] [CrossRef]
  40. Li, J.; Tian, Z.; Zhang, G.; Li, W. Multi-AUV Formation Predictive Control Based on CNN-LSTM under Communication Constraints. J. Mar. Sci. Eng. 2023, 11, 873. [Google Scholar] [CrossRef]
  41. Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
  42. Chu, Z.; Sun, B.; Zhu, D.; Zhang, M.; Luo, C. Motion control of unmanned underwater vehicles via deep imitation reinforcement learning algorithm. IET Intell. Transp. Syst. 2020, 14, 764–774. [Google Scholar] [CrossRef]
  43. Wang, N.; Gao, Y.; Yang, C.; Zhang, X. Reinforcement learning-based finite-time tracking control of an unknown unmanned surface vehicle with input constraints. Neurocomputing 2022, 484, 26–37. [Google Scholar] [CrossRef]
  44. Xie, S.; Chu, X.; Zheng, M.; Liu, C. A composite learning method for multi-ship collision avoidance based on reinforcement learning and inverse control. Neurocomputing 2020, 411, 375–392. [Google Scholar] [CrossRef]
Figure 1. Neuronal structure of the LSTM.
Figure 1. Neuronal structure of the LSTM.
Jmse 11 01762 g001
Figure 2. Diagram of horizontal path-following control.
Figure 2. Diagram of horizontal path-following control.
Jmse 11 01762 g002
Figure 3. Diagram of the proposed path-following control scheme.
Figure 3. Diagram of the proposed path-following control scheme.
Jmse 11 01762 g003
Figure 4. Network structure of the TD3.
Figure 4. Network structure of the TD3.
Jmse 11 01762 g004
Figure 5. Network structure of Actor.
Figure 5. Network structure of Actor.
Jmse 11 01762 g005
Figure 6. Episode rewards with different DRL algorithms.
Figure 6. Episode rewards with different DRL algorithms.
Jmse 11 01762 g006
Figure 7. Path-following control performance of an AMV.
Figure 7. Path-following control performance of an AMV.
Jmse 11 01762 g007
Figure 8. Path-following errors of an AMV under LSTM-TD3 and TD3.
Figure 8. Path-following errors of an AMV under LSTM-TD3 and TD3.
Jmse 11 01762 g008
Figure 9. Surge velocity and heading angle of an AMV.
Figure 9. Surge velocity and heading angle of an AMV.
Jmse 11 01762 g009
Figure 10. Velocity and heading errors of an AMV under LSTM-TD3 and TD3.
Figure 10. Velocity and heading errors of an AMV under LSTM-TD3 and TD3.
Jmse 11 01762 g010
Figure 11. Surge force and yaw torque of an AMV.
Figure 11. Surge force and yaw torque of an AMV.
Jmse 11 01762 g011
Table 1. Training hyper parameters.
Table 1. Training hyper parameters.
ParametersValue
Discount factor γ 0.99
State sequence length L 20
Training cycle F 1000
Maximum number of steps K 1000
Capacity of buffer D 100,000
Learning rate l 0.001
OptimizerAdam
Gradient threshold parameter g 1
Regular factor ε 0.00005
Mini-batch N 128
Table 2. LSTM-TD3 network parameters.
Table 2. LSTM-TD3 network parameters.
ParametersValue
Input layer of actor network11
Input layer of critic network13
Fully connected layer200
LSTM layer of actor network100
LSTM layer of critic network100
Output layer of actor2
Output layer of critic1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qu, X.; Jiang, Y.; Zhang, R.; Long, F. A Deep Reinforcement Learning-Based Path-Following Control Scheme for an Uncertain Under-Actuated Autonomous Marine Vehicle. J. Mar. Sci. Eng. 2023, 11, 1762. https://doi.org/10.3390/jmse11091762

AMA Style

Qu X, Jiang Y, Zhang R, Long F. A Deep Reinforcement Learning-Based Path-Following Control Scheme for an Uncertain Under-Actuated Autonomous Marine Vehicle. Journal of Marine Science and Engineering. 2023; 11(9):1762. https://doi.org/10.3390/jmse11091762

Chicago/Turabian Style

Qu, Xingru, Yuze Jiang, Rubo Zhang, and Feifei Long. 2023. "A Deep Reinforcement Learning-Based Path-Following Control Scheme for an Uncertain Under-Actuated Autonomous Marine Vehicle" Journal of Marine Science and Engineering 11, no. 9: 1762. https://doi.org/10.3390/jmse11091762

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop