USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning

We present a novel approach for achieving high-precision trajectory tracking control in an unmanned surface vehicle (USV) through utilization of receding horizon reinforcement learning (RHRL). The control architecture for the USV involves a composite of feedforward and feedback components. The feedforward control component is derived directly from the curvature of the reference path and the dynamic model. Feedback control is acquired through application of the RHRL algorithm, effectively addressing the problem of achieving optimal tracking control. The methodology introduced in this paper synergizes with the rolling time domain optimization mechanism, converting the perpetual time domain optimal control predicament into a succession of finite time domain control problems amenable to resolution. In contrast to Lyapunov model predictive control (LMPC) and sliding mode control (SMC), our proposed method employs the RHRL controller, which yields an explicit state feedback control law. This characteristic endows the controller with the dual capabilities of direct offline and online learning deployment. Within each prediction time domain, we employ a time-independent executive–evaluator network structure to glean insights into the optimal value function and control strategy. Furthermore, we substantiate the convergence of the RHRL algorithm in each prediction time domain through rigorous theoretical proof, with concurrent analysis to verify the stability of the closed-loop system. To conclude, USV trajectory control tests are carried out within a simulated environment.


Introduction
A USV inherently constitutes a complex nonlinear system, being subject to disturbances and influences from the environment during navigation.Consequently, enhancing the path-tracking accuracy of unmanned ship motion control is a pressing concern.
At present, common methods for achieving such control include the PID [1,2], which is the most widely used, feedback control [3,4], fuzzy control [5,6], module predictive control (MPC) [7,8], and reinforcement learning (RL)-based control [9,10] methods.Of the aforementioned approaches, the PID control method stands out for its advantages.Notably, it eliminates the necessity for modeling the unmanned ship, rendering it a robust and easily implementable controller.However, a challenge lies in ensuring the optimality of specific performance indices.While the fuzzy controller exhibits the capability to deduce and generate expert behavior, its application is challenged by the intricacies of crafting fuzzy rules that primarily arise from the complexity inherent in the navigation environment.
The feedback controller, in its typical operation, computes heading and lateral deviations by analyzing the geometric relationship between the USV and the desired path.Based on this, it directly determines the steering wheel angle for precise steering control.The methods used for tracking, which involve deriving the correlation between the selected path anchor point and the USV position, are the single-point tracking method, pre-sight distance method, and the Stanley method.Both the single-point tracking method [11] and pre-viewing distance method [12,13] offer the advantages of simplicity in algorithms and ease of implementation.However, a notable consideration lies in the fact that the selection of pre-viewing distance is contingent upon the experiential judgment of designers.The Stanley method, initially introduced by Stanford University for an unmanned vehicle fleet, is well suited for lower vehicle speeds.It necessitates a continuous curvature in the reference trajectory for optimal implementation.
A plethora of research findings have emerged concerning the application of MPC in vehicle motion control, as documented in the literature [14][15][16][17].Of the achievements in these cited works, Falcone et al. [15] introduced an MPC motion controller grounded in the continuous linearization model, and their simulation results underscore the efficacy of the continuous linearization MPC design approach in minimizing computational costs.Carvalho et al. [17] studied an algorithm for local path planning using locally linearized MPC, carrying out linearization and convex approximation of nonlinear obstacle avoidance boundaries.Liniger et al. [18] proposed a lateral motion method of model predictive controlling control (MPCC).Using this method, the lateral deviation is calculated by estimating the position of the projection point, which reduces the computational complexity to a certain extent.Ostafew et al. [19] adopted Gaussian process regression to build a nonparametric model of a mobile robot.In the realm of unmanned surface vehicles, the trajectory tracking controller, employing the MPC method, typically necessitates real-time numerical calculations for solving an open-loop control sequence.The performance of this approach can be influenced by the precision of the model in addition to the unavoidable challenge of managing the complexity inherent in online calculations.Collectively, the current control strategies have various limitations characterized by suboptimal tracking accuracy and constrained computational efficiency.
In recent years, approximate dynamic programming (ADP) as well as reinforcement learning (RL) have experienced widespread adoption in the design of robot decision and control algorithms, thanks to their remarkable efficiency in solving optimization problems and adaptive learning capabilities [20,21].Yang [22] developed a learning method which is based on PID control for the tracking control of vehicles.Aiming at optimizing the tracking deviation of robots, the DHP algorithm was employed for real-time adjustment of PID parameters, enhancing path-tracking accuracy.Gong et al. [23] designed a finite-time dynamic positioning controller for surface vessels.Shen et al. [24] introduced an innovative LMPC framework aiming to enhance trajectory tracking performance.Jiang et al. [25] also proposed sliding mode control to improve the tracking performance of USVs.
Recent advancements include noteworthy works employing deep learning and deep reinforcement learning to design controllers based on image or state information, facilitating trajectory control for USVs [26][27][28].A key advantage of this approach lies in leveraging deep networks to enhance the feature representation capabilities of both reinforcement learning and supervised learning.Notably, the training process is entirely data driven, eliminating the need for dynamic model information.However, it has the following disadvantages: (1) Due to the inherent complexity of deep networks, application of this method is limited to offline training control strategies for online deployment.Moreover, its control performance is susceptible to the influence of factors such as the quantity and distribution of training samples.(2) In the context of deep network learning, the analysis of theoretical characteristics, such as convergence and robustness, remains a crucial and challenging issue for the academic community to address.
Motivated by the challenges outlined above, we propose a RHRL-based control method, aiming at achieving high-precision lateral control for USVs.The initial step involves constructing a dynamic deviation model for a USV.The steering control of such vehicles comprises two parts, which are feedforward and feedback.Feedforward control is derived directly from the curvature and deviation model for the reference path.In parallel, the establishment of feedback control is achieved by addressing the problem of optimal tracking through application of the RHRL algorithm proposed in this paper.Diverging from conventional optimal control methods rooted in reinforcement learning, RHRL employs a rolling horizon optimization mechanism.This transformation converts infinite time domain optimal control problems into a sequence of finite time domain heuristic dynamic programming problems for resolution.In contrast to the MPC method for unwinding the loop control sequence, the strategy learned by this method is an explicit state feedback control law, which is amenable to offline direct deployment and online learning.Furthermore, in Section 3, the convergence and stability of the closed-loop associated with the proposed RHRL algorithm are theoretically analyzed within each prediction time domain.Finally, simulation and comparative experiments for USV trajectory control using the RHRL algorithm are conducted.Through simulation tests, the control performance is found to be comparable to that of LMPC, with notable advantages in terms of computational efficiency, lower sample complexity, and higher learning efficiency.To verify the algorithm's robustness and anti-interference capabilities, simulation incorporating disturbances are also conducted.
The remaining sections of this manuscript are arranged as follows.In Section 2, a dynamic model of a USV is built.Then, a USV trajectory control algorithm based on RHRL is proposed and shown to be stable.In Section 3, the simulation and comparison experiments are carried out, and disturbances are added.Section 4 contains the conclusions.

Modeling
In contemporary vehicle modeling, the utilization of three degrees of freedom (DOF) and six DOF predominates.However, considering the environment of the USV investigated in this study, which navigates on the sea surface, we opt for three degrees of freedom in the modeling process to avoid unnecessary complexity.
In the process of establishing dynamical equations, a crucial decision lies in selecting the coordinate system for their formulation.Direct application of Newton's laws of motion necessitates the expansion of equations in an inertial coordinate system.Nevertheless, various considerations compel us to derive the dynamic equations in a satellite coordinate system.One such reason is to establish dynamic equations that are direction independent.Additionally, employing the satellite coordinate system facilitates the direct assignment of forces and control moments.However, this would result in the current frame of reference not being an inertial frame of reference.Hence, to account for the non-inertial reference frame, Coriolis and centripetal forces are artificially introduced.This allows us to derive the remaining dynamics as if they were in an inertial reference frame.
The USV under investigation features a catamaran-like structure, incorporating two fixed propellers positioned at the extremities of each hull.In Figure 1, variables U 1 and U 2 denote the speeds of the two thrusters, while θ represents the heading angle.
Considering its actual working environment, trajectory tracking control of the USV on the horizontal plane will be the focus of our study.
There is a reference frame called the BF (body frame) that is securely fixed to the USV, with the point of origin deliberately chosen to coincide with the center of gravity.Global information is recorded by the IF (inertial frame).Thus, the USV's motion can be accurately described via the kinematic equation and dynamic equation of the coordinate transformation between these two frames.
The kinematic equation is where ξ = [x, y, z] T represents the USV's position and heading in the IF; v = [u, v, r] T represents the USV's velocity in the BF; and the rotation matrix R(θ) depends on θ, which is the heading angle.R(θ) can be expressed by the follow equation: According to the Newton's law of motion, the dynamic equation can be established as follows: where κ = [F u , F v , F r ] T represents the thrust force of each propeller.The matrix M takes the mass (which is added) into consideration; C(v) represents the Coriolis and centripetal matrix.Concrete forms of the above three matrices are shown as follows: where D(v) is the USV's damping matrix; g(ξ) denotes the specific restoring force.
The thrusters τ = [τ 1 , τ 2 , τ 3 ] T generate thrust force κ, and the τ comes from κ = B(α)τ.αdenoting the thrusters' azimuth vector in the BF.We can obtain the distribution of the thruster: where B denotes an input matrix that is constant.B is a 3 × 3 matrix that distributes power to the thrusters in three directions, and B satisfies the condition that B T B is not singular.l 1 , l 2 ∈ (0, 1) are the thrusters' efficiency factors.
Therefore, we can derive the dynamic model of the USV for trajectory tracking by combining Equations ( 1), (3), and ( 5): where x = [x, y, θ, u, v, r] T is the defined state, and input control is expressed as τ = [τ 1 , τ 2 , τ 3 ] T .At the end of this section, we successfully derive the dynamic equation governing USV operation on the water surface.

The USV Trajectory Control Algorithm Based on RHRL
In this section, the USV trajectory control algorithm utilizing RHRL is elaborated.We initially formulate the performance index for the finite time domain trajectory control problem of the USV.Subsequently, we outline the core concepts of the associated reinforcement learning algorithm along with the design and implementation process of the controller.Also included is a detailed analysis of convergence based on this approach.
When conducting tracking control, it is necessary to describe the relative position between the USV and the desired path, as shown in Figure 2. The point P represents the closet point from the desired path, which is called the road projection point.P(X p , Y p , φ d , κ) is denoted as the path information at the projection point, where X p , Y p are the global coordinates of P. φ d is the angle between the tangent line of P and the X-axis, also known as the direction of the path; κ is the curvature of the path at point P. The distance between P and the USV centroid is called the lateral deviation e y , and e y > 0 is specified for when the USV is located on the left side of the path, and e y < 0 when the USV is on the right side.Therefore, the lateral deviation can be expressed as The path deviation e φ of the USV is defined as the difference between the path and the direction, which is The first derivative of e y and e φ are shown below: where It is assumed that v x remains constant and there is no sidescale phenomenon in the moving process, and that the expected yaw velocity of the USV's desired path is constant; then, the lateral acceleration of the USV when it stably tracks the path is a y = v x 2 κ.
Assuming that the course deviation e φ is small, then according to the small angle theorem, sin(e φ ) ≈ e φ , cos(e φ ) ≈ 1.Then, the second derivative of the lateral deviation with respect to time can be expressed as The first derivative can be approximated as Combining Equations ( 1), ( 3) and ( 4), also ( 8)-( 10); the following equation can be derived as where ω d = φd , e = [e y , ėy , e φ , ė φ ] T , and the control quantity u = δ f .Given a sampling period ∆t, the discrete time model of Equation (11a) can be discretized as where , and k is a discrete time point.For the above model Equation ( 12), it is assumed that path information (X i , Y i ) M i=1 , and the purpose of this paper is to design a lateral control algorithm based on RHRL (as shown in Figure 3) such that during the control process, the above-mentioned lateral error state quantity gradually converges to 0, that is, e → 0.

Design of Performance Index for the Finite Time Domain Trajectory Control Problem
In this section, a detailed control algorithm based on RHRL is presented.We commence by designing the performance index for the USV finite time domain lateral control problem.Subsequently, we outline the core concept of the RHRL algorithm and delve into the design implementation and convergence analysis based on the actuator-evaluator.For the system deviation model of Equation ( 12), the control quantity can be decomposed into the form of a feedforward component u f plus a feedback component u b such that u = u f + u b , which is shown in Figure 3.The feedforward control quantity represents the expected control input during steady-state vehicle operation and is applicable when the vehicle is stably following the reference path.At the same time e(k) = e(k + 1) = 0 holds, u b = 0 as well.The feedforward control quantity u f can be determined as follows: The value in the above formula can be obtained by Since u f can be easily solved at any current time value k, we assume that u f remains constant throughout the prediction time domain [k, k + N], then the feedback control quantity u b to be solved needs to meet the following constraints: where ū represents the maximum of u, u is the minimum of u.The RHRL algorithm, introduced in this paper, seeks to minimize the following performance indicator function by optimizing u b ∈ U b in each prediction time domain: where the cost function L(e(l), u b (l)) = e T (l)Qe(l) + Pu b (l) 2 , Q ∈ R 4×4 is a matrix which is positive definite, P is a preset positive real number, and the cost function of the predictive time domain terminal is where the penalty matrix R ∈ R 4×4 is a positive definite matrix, which can be solved using the following Lyapunov equation: where is the feedback gain matrix satisfying the conditions indicating that F is Schur-stable.(The characteristic polynomial 'F' for discrete linear systems is such that the roots are located within the unit circle.This property results in the system being classified as Schur-stable).

Path Control Algorithm Based on RHRL
The implementation of the finite time domain reinforcement learning algorithm using the executive-evaluator involves the following main steps: First of all, according to Equation (15), in any l ∈ [k, k + N − 1], we can express the value function as a differential form: V(e(l)) = L(e(l), u b (l)) + V(e(l + 1)) (18) where V(e(k + N)) = V f (e(k + N)).At the l-th prediction moment, V * (e(l)) would be defined as the optimal value function, and we obtain the HJB equation of the above finite time domain optimization control problem as L(e(l), u b (l)) + V * (e(l + 1)) (19) and the optimal control strategy: In fact, due to the control constraints, it is difficult to obtain analytical solutions for V * and u * using Equations ( 19) and (20).In principle, we can approximate the optimal solution of the value function and the control strategy through the method of value iteration.For any l ∈ [k, k + N − 1], at given initial values where V 0 (e(l)) = 0, then iterate steps i = 0, 1, 2 • • • .This needs to be repeated until V i+1 (e(l)) − V i (e(l)) → 0 to resolve the following two steps.
(1) Strategy update In conclusion, the task of trajectory tracking is accomplished through continuous updating of the strategy and feedback values.

Rolling Time Domain Executor-Evaluator Learning Implementation
We employ the executive-evaluator structure to implement the finite time domain value function iteration algorithm described above.In existing finite time domain reinforcement learning control algorithms [17], the value function in the prediction time domain is regarded as a time-dependent function.

Assumption 1.
If there has a control strategy u b (e) = Φ(v(e)) so that system (Equation ( 12)) is asymptotically stable under control strategy u = u b + u f , where Φ(v(e)) is a continuous function satisfying u b (e) ∈ U b , ∀v(e) ∈ R.
The aforementioned assumptions essentially represent another aspect of the stabilizability of the system Equation ( 12).Simultaneously, it is worth noting that the dynamic model Equation (12) presented in this paper is controllable, so there must be a continuous equation u b (e) ∈ U b that renders Equation ( 16) asymptotically stable under the control strategy u = u b + u f .Therefore, the above assumptions are reasonable.
We define χ f as a control invariant set under the control law u b = Ke ∈ U b , then we can state the following theorem.
Theorem 1. (Time-independent value function) If the value of the prediction time domain N satisfies t ∈ [k, k + N] in any prediction time domain, for any initial state e(k) ∈ R 4 , the terminal state e(k + N) ∈ χ f is driven by the control strategy u(e(l)), l ∈ [k, k + N − 1] of system Equation ( 9) such that there is such a control strategy u b (e) ∈ U b that V(e(l)), and l ∈ [k, k + N − 1] is a function that is independent of time.
Proof of Theorem 1. Firstly, consider the case of e(k) ∈ χ f .Based on the definition of χ f , there is a control law u b = Ke = Φ(v(e)) ∈ U b that ensures the quantity of states at any time in the future satisfy x(l) ∈ χ f .From that, we can solve and obtain the following function: For the case of e(k) / ∈ χ f , according to Assumption 1, there is such a control strategy u b = Φ(v(e)) and a finite prediction step N that e(k + N) ∈ χ f .In particular, let v = Ke, then where Hence, a value function and a strategy independent of time exist.Drawing inspiration from this, we adopt a time-independent executive-evaluator structure to execute the finite time-domain value function iteration process described above.Initially, a network of evaluators is designed to approximate the value function: where W c ∈ R N c represents the weight of the evaluator network, N c denotes network node number; φ(e) is the network's basis function.According to the definition of the evaluator network, the resulting errors E and the end error E f can be expressed as Therein, e f = e(k + N), which can be randomly valued around 0. By minimizing E c (l) = E(l) 2 + E f 2 , the equation for updating the weights of the evaluator network is derived as follows: where µ c > 0 is the learning rate of the evaluator network.Next, to deal with control constraints, we construct the network of actuators as follows: where is the weight of actuator network; σ(e) is the basis function vector of the network.N a indicates the node number, which is on network.Given that the actuator network aims to approximate the optimal strategy of control, we define the control quantity deviation as follows: By minimizing E a 2 , we can obtain the update rule of the network weight as where µ a > 0 represents the learning rate of the actuator network.

Algorithm 1
The main steps of implementing the above finite time domain reinforcement learning algorithm, which makes use of the executive-evaluator.
(I) Initialize the weights W c , W a , and obtain the initial state Z(0).

(II)
When the time t = k∆t, the projection point P is found according to the state Z(t), and the deviation state e(t) is calculated.

Convergence Analysis of the Weight of Finite Time Domain Actuator and Evaluator
Next, we present the convergence analysis of the above RHRL algorithm in each prediction domain [k, k + N − 1].First, the (local) optimal value function and control strategy can be represented as a network: where both W a and W c are weight matrices, and κ a and κ c are the errors of reconstruction.

Assumption 3. (Continuous excitation)
There are positive real numbers q 1 , q 2 , (q 1 < q 2 ) such that In order to more compactly describe the following theorem, define Theorem 2. Under Assumptions 2 and 3, if the appropriate learning laws µ c and µ a and {β i } 3 (i = 0) are chosen so that γ 1 > 0 and α − γ 2 > 0, then the network weights Ŵc and Ŵa of Equations ( 27) and (30) will asymptotically converge to the following region when using the above strategy: where Wc = W c − Ŵc , Wa = W a − Ŵa , ξ a = WT a ψ, and E t is the error.Furthermore, if κ c,m , κc,m , κ a,m → 0, then Wc and ξ a converge asymptotically to 0.
Proof of Theorem 2. The Lyapunov function is defined as follows: Wc ), and L a = tr WT a η a −1 Wa .They can be calculated based on Equation (26). where where κ c 1 f = κ c (k + N), then according to Equations ( 27), ( 35) and (36): where κc = −∆φ(l Similarly, ∆L a (l + 1) can be expressed as and where ḡ = g T g.According to Young's inequality theorem, where E a = (1/β 2 + 1/β 3 ) κ2 a .Then, by defining E t = E c,m + E a,m , we obtain On this basis, if κc,m , κ c,m , κ a,m → 0, E t → 0 is obtained, then Wc and ξ a asymptotically converge to 0. Hence, at this juncture, we have successfully concluded the proof of Theorem 2.
The conclusion of the above theorem indicates that we can make u converge to u * b with an arbitrarily small error by increasing the number of base function nodes in the actuator and the evaluator.Therefore, under the premise that Assumption 1 is true, if a sufficiently large N is chosen, the equation of system (12) satisfies the terminal state e(k ) is a feasible control strategy.We define the loss function produced by the feasible strategy for Los f (k + 1|k), and referring to Rawling's [29], Los f (k + 1|k) − Los * (k|k) ≤ −L(e(k|k), u b (k|k)) is available.Due to Ke(k + N|k) being suboptimal, we may safely derive which can be obtain by using Lyapunov stability analysis of the stability of the system, which is a closed-loop system.

Simulation Analysis
To ensure a precise comparison of the control performance between RHRL, Lyapunovbased MPC (LMPC), and sliding mode control (SMC), the control variable method was adopted using experimental parameters from [24,25].In the simulations, all of the hydrodynamic parameters in the equations are based on the Falcon model [30].
The simulation results are presented in this section in showcasing the advantages of the RHRL method.In addition, the operating environment is Matlab 2021b, and the core is R7-5800H.
In this section, the desired trajectory tracking simulation of a USV based on RHRL will be executed as described to emphasize the feasibility and efficiency of RHRL algorithm proposed earlier.The parameters for USV simulation are presented in Table 1.

Tracking Performance
Both Figure 4a,c depict the tracking results for Path I.The USV trajectories are represented by the blue curve for the LMPC control method, the green curve for the SMC controller, and the red curve for the USV RHRL controller, and the black curve illustrates the sinusoidal trajectory, which is the desired trajectory.The results demonstrate that all controllers are successful in guiding the USV along the desired trajectory, affirming the stability of the closed loop.However, the RHRL method notably exhibits a considerably accelerated convergence compared to the LMPC and SMC methods.This acceleration in convergence is attributed to the selection of control gain matrices K p and K d , which are small.The simulation results show that the improvement of tracking accuracy is due to synchronous online incremental learning and deployment.Figure 4b illustrates the thrust output of each propeller.It is evident that at the commencement of tracking, the RHRL controller maximally utilizes the onboard thrust capability to achieve convergence as swiftly as possible.In essence, the state remains within the prescribed boundary, aligning with expectations.It is also notable that RHRL demonstrates superior adjustment capability and undergoes more rapid adjustments.
The outcomes for Path II are presented in Figure 5. Similarities arise from the observations: The USV exhibits quicker convergence to the desired trajectory through RHRL.

Robustness Experiment with Disturbance
The incorporation of the receding horizon implementation introduces feedback into the closed-loop system.One of the inherent advantages of the RHRL controller is its robustness toward disturbances and emergencies, making it particularly well-suited for control systems in marine and submarine environments.The RHRL's robustness is thoroughly examined and demonstrated through simulations.The definite simulated disturbance of magnitude [100(N), 100(N), 0(Nm)] T was added.To provide a clearer visualization of the deviation between the three algorithms, the reference trajectory, indicated by a black line, is also included in this experiment.
In analyzing the outcomes shown from Figure 5 to Figure 6, it is evident that RHRL tracking control consistently guides the USV to adequately converge toward the desired trajectory.In contrast, substantial tracking errors are exhibited when conducting tracking control using LMPC, the even greater errors are associated with SMC.Figures 6b and 7b illustrate that the RHRL controller consistently provides feedback for responding within a small time domain, ensuring minimal deviation.The MSEs (mean square errors) for both paths are consolidated in Tables 2 and 3. Generally, the MSEs are approximately 10 times smaller for RHRL compared to LMPC and SMC, especially in the case of Path II.Indeed, it is widely acknowledged that smaller MSEs correspond to reduced tracking error, thereby resulting in higher tracking accuracy; thus, it is evident that the RHRL algorithm significantly enhances tracking accuracy.In order to more objectively demonstrate the excellent performance of the algorithm, we propose conducting quantitative analysis based on a new factor, namely thrust output.It is known that a smaller average value of thrust corresponds to lower energy consumption and enhanced cost-effectiveness.The specific data are shown in Tables 4 and 5.As can be seen from the tables, the energy consumption of RHRL compared with LMPC is reduced by 43.85% and 41.65% for Paths I and II, respectively.The data show that RHRL is much more economical than LMPC.However, due to the algorithm characteristics, RHRL does not have a significant advantage over SMC based on this analysis.The observed disparity stems from RHRL's ability to learn and adapt online, utilizing online optimization to dynamically adjust control gains and effectively compensate for interference.Conversely, both LMPC and SMC lack this flexibility.Consequently, robustness is significantly enhanced by RHRL control.

Conclusions
In this paper, a trajectory control algorithm for USV based on RHRL is introduced in which reinforcement learning is seamlessly integrated with a rolling time domain optimization mechanism.Thus, infinite time self-learning optimization problems are effectively converted into a series of finite time optimization problems, which can then be solved using an executive-evaluator algorithm.The incorporation of the rolling time domain mechanism in this design approach significantly enhances the learning efficiency of the RL algorithm.Moreover, compared to LMPC and SMC, the optimization method utilizing both effector and evaluator contributes to enhanced computational efficiency.In diverging from the majority of existing finite time domain executive-evaluator learning algorithms, the proposed RHRL employs a time-independent single-network structure.This innovative approach serves to diminish the intricacy associated with network design and online computational complexity.Moreover, we analyzed the stability of the closed-loop system theoretically.Concerning scenarios involving significant errors in the learned approximation strategy, we plan to conduct in-depth analysis and substantiation in our forthcoming research.The results of simulations demonstrate that our algorithm is effective based on comparison with typical traditional algorithms in simulation scenarios.The simulation results show that RHRL control is superior to LMPC and SMC in terms of control performance and computational efficiency while also being more economical than LMPC.RHRL control also has lower sample complexity and higher learning efficiency.

Figure 1 .
Figure 1.Diagram of the BF (left) and IF (right).

Figure 3 .
Figure 3. Trajectory tracking control block diagram of the USV.

Figure 4 .
Figure 4.The USV trajectory tracking performance in Path I. (a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.

Figure 5 .
Figure 5.The USV trajectory tracking performance in Path II.(a) The USV trajectory for Path II.(b) The thrust outputs for Path II.(c) The state trajectories for Path II.

Figure 6 .
Figure 6.The USV trajectory tracking performance in Path I with disturbance.(a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.

Figure 7 .
Figure 7.The USV trajectory tracking performance in Path II with disturbance.(a) The USV trajectory for Path II.(b) The thrust outputs for Path II.(c) The state trajectories for Path II.

Table 1 .
Parameters for USV simulation.

Table 2 .
MSE for disturbances in Path I.

Table 3 .
MSE for disturbances in Path II.

Table 4 .
The average thrust output with disturbances in Path I.

Table 5 .
The average thrust output with disturbances in Path II.