A Trajectory Tracking and Local Path Planning Control Strategy for Unmanned Underwater Vehicles

: The control strategy of an underdriven unmanned underwater vehicle (UUV) equipped with front sonar and actuator faults in a continuous task environment is investigated. Considering trajectory tracking and local path planning in complex-obstacle environments, we propose a task transition strategy under the event-triggered mechanism and design the corresponding state space and action space for the trajectory tracking task under the deep reinforcement learning framework. Meanwhile, a feed-forward compensation mechanism is designed to counteract the effects of external disturbances and actuator faults in combination with a reduced-order extended state observer. For the path planning task under the rapidly exploring random tree (RRT) framework, a reward component and angular factors are introduced to optimize the growth and exploration points of the extended tree under the consideration of the shortest distance, optimal energy consumption, and steering angle constraints. The effectiveness of the proposed method was veriﬁed through continuous task simulations of trajectory tracking and local path planning.


Introduction
With the emergence of unmanned underwater vehicles (UUVs) demonstrating high maneuverability, stealth, and intelligence in the tasks of detecting underwater pipelines, surveying underwater topography, exploring oil and gas resources, and undertaking military surveillance and patrolling activities, their control strategies in harsh ocean environments with poor communication conditions and low visibility have become a hot research topic for many scholars [1].Since the underwater trajectory tracking and local path planning of a single UUV are the basis and prerequisite for accomplishing various complex tasks, granting UUVs these two capabilities simultaneously in complex underwater environments has become a top priority.
There have been numerous attempts at trajectory tracking with classical control methods.A fault-tolerant controller using feedback-linearized control to solve the consensus control problem for multi-UV systems was presented in [2].Additionally, in [3], the target tracking control issue of underactuated UUVs was effectively tackled through the implementation of line-of-sight (LOS) and backstepping methods.In [4], a novel Lyapunov-based MPC framework was proposed to improve UUV tracking performance, and trajectory tracking control was explored using the model predictive control (MPC) approach.In order to attenuate the effect of the accuracy of the UUV mathematical model on the actual UUV control, many scholars have made improvements to the robustness and control algorithms of UUVs.In [5], an improved self-resistant (ADRC) control strategy was proposed, which estimated unknown disturbances by designing a generalized expanded state observer (ESO).Also, in terms of resistance to underwater disturbances, the motion control of an algorithm is that the planned paths tend to lose some of their optimal characteristics, which is why we chose to incorporate angular constraints.
In our task environment, the advantage of RRT is that its search approach is based on a tree structure with random sampling, and routes from points to be extended with new points can generate kinematics and dynamics simulations of the robot, which makes its combination with the tracking task in the same temporal and spatial domains more efficient [21].Secondly, the search process of RRT is more concise compared to other metaheuristic algorithms and does not require precise spatial information about the obstacles.
The main contributions of our work are summarized below: • A trajectory tracking controller is designed by using the event triggering mechanism and combining the fault state and disturbance state of the UUV actuator, which realizes the classification and collection of the relevant states in the model training and simulation and the seamless connection of the successive tasks.

•
Combining the deep reinforcement learning framework with meta-heuristic algorithms in the same task space, the conditions of distance, energy consumption, and steering angle constraints required by UUVs in the task environment are considered from various perspectives, and reward component and angular factors are introduced into the RRT algorithm to locally plan the path.

•
In response to the effects of external perturbations and actuator faults, a joint reducedorder extended state observer and feed-forward compensation mechanism is designed to enable the UUV to counteract the unfavorable effects by automatically adjusting its own output.

Preliminaries
In this section, we will model the mission environment, starting from a mathematical model of an underactuated UUV with thruster faults, and combine our knowledge of deep reinforcement learning and RRT to model the environment for the tracking task and local planning task.

Underactuated UUV Model Description
A torpedo-shaped underactuated UUV, which employs a cross-rudder and singlepropeller actuator layout, was chosen as the research object of this paper [23].We define the position, attitude, and velocity of the underdriven UUV in the Earth-fixed coordinate system (O E − X E Y E Z E ) and body-axis coordinate system (O B − X B Y B Z B ), respectively.As shown in Figure 1, (x, y, z) denote the position of the UUV in the Earth-fixed coordinate system.θ and ϕ represent the pitch angle and yaw angle of the UUV in the Earth-fixed coordinate system, respectively, and their positive direction follows the right-hand rule.u, v, and w are the velocities in surge, sway, and heave, respectively.Finally, q and r refer to pitch angular velocity and yaw angular velocity in the body-axis coordinate system, respectively, and their positive direction is the same as that of θ and ϕ.
Based on the description of Fossen's equations [24] associated with UUV thruster failure [25], we can obtain the five-degrees-of-freedom UUV dynamics equations with unknown perturbations: where m 11 , m 22 , m 33 , m 55 , and m 66 represent the mass and inertia parameters of the UUV, and τ u , τ q , and τ r represent the control force and moment generated by the tail propeller and rudder.f u , f v , f w , f q , and f r represent hydrodynamic damping and friction items.τ du (t), τ dv (t), τ dw (t), τ dq (t), and τ dr (t) represent time-varying bounded environmental disturbances caused by factors such as wind and ocean currents.GM L , ∇, g, and ρ e represent the longitudinal metacentric height, displacement, gravity acceleration, and water density, respectively.
In order to more realistically simulate the impact of the UUV's thruster faults on its drive accuracy and drive efficiency in underwater environments, we simulate the loss of control signals for UUVs when they are subjected to rudder vibration and propeller entanglement by a combination of multiplicative and additive faults.In Equation (1), we set τ i (i = u, q, r) as the input of the actuator; λ uh , λ qh , and λ rh are set at a fixed value; and additive faults ωuh , ωqh , and ωrh are time-varying variables.

Tracking the State Model of the Mission Environment
In the tracking task, we select the environment state as the relative distance ρ, relative pitch angle β, and relative yaw angle α between the UUV and the desired trajectory in each time step, and obtain it by the LOS method [3].As shown in Figure 2, (x m , y m , z m ) are the coordinates of the desired trajectory point in the Earth's fixed coordinate system, and ∆x = x m − x, ∆y = y m − y, and ∆z = z m − z denote the three-dimensional coordinate difference of the trajectory point with respect to the UUV.(x r , y r , z r ) are the relative errors and can be expressed as where R 1 can be expressed as Finally, we can obtain the relative distance ρ, relative pitch angle β, and relative yaw angle Remark 1.Some UUV physical constraints and assumptions are as follows: • The motion of the UUV in the roll direction is neglected.

•
The UUV has neutral buoyancy, and the origin of the body-fixed coordinates is located at the center of mass.

•
The yaw angle is limited to (−π, π], and the pitch angle is limited to (−6/π, 6/π]. To achieve the ability to train UUVs to track a target through deep reinforcement learning, the policy network input states, output actions, and reward functions need to be formulated in conjunction with the states obtained from Equation (4) relative to the desired trajectory.

1.
Trajectory Tracking Task Status We selected the velocity state, actuator failure state, and perturbation state, which are directly related to the update of the UUV position, velocity, and acceleration states, as the base states for the policy network inputs.The velocity state can be expressed as Since multiplicative faults are approximated as fixed values, so that they have no effect on the learning update process of the policy network, the fault state should be set as additive faults, denoted as follows: The five-degrees-of-freedom UUV perturbation state can be defined as In addition, combining the relative distance, relative pitch angle, and relative yaw angle obtained from Equation (4), we set the error state of the trajectory tracking task as follows: where ρ d , β d , and α d denote the desired relative distance, relative pitch angle, and relative error, respectively.
Finally, we can obtain the input state of the trajectory tracking task in the deep reinforcement learning framework: 2.

Action space
The UUV studied in this paper is powered by a propeller in the tail and rudders on both sides, with control forces and moments applied to the UUV to update its state over time.Therefore, we directly chose the control force and moment as the output of the deep reinforcement learning strategy network, which can be expressed as 3.

Reward function
The trajectory tracking task expects the UUV to track the target quickly and maintain a fixed tracking state in a limited time, so the reward function should be selected by first considering the error between the UUV and the desired trajectory, based on the approximation that the larger the error, the smaller the reward.Combining this with Equation ( 8), one obtains In addition, we expect the outputs of the UUV propeller and rudder to be maintained in a more stable and energy-efficient manner, which means that under the condition of achieving the same control accuracy, the smaller the output value, the better the corresponding action.Thus, we define the reward function related to the controller output as where ζ s and ζ p denote the weight parameters of the two-part reward function.

Reduced-Order Extended State Observer
To deal with the unknown perturbations and actuator faults of underdriven UUVs, we design a reduced-order extended observer (RESO) to estimate the perturbation terms and additive fault terms of UUVs in real environments.First, we define the combination term of external perturbations and actuator additive faults as Assuming that the external perturbation and the actuator fault are both continuously minimizable, then χ is also continuously minimizable.We can rewrite Equation (1) in matrix form: where M, C, and D represent the inertia matrix, the Coriolis force, and the centrifugal terms matrix, respectively.The multiplicative faults λ of the three actuators are set to 0.8.v = [u, v, w, q, r] T represents the velocity vector, and η = [x, y, z, θ, ϕ] T represents the position and attitude vectors.a denotes the same action space as in Equation (10).Then, this is reduced to the standard form of a reduced-dimensional observation: where u represents λa in Equation ( 14).For a detailed derivation of the dimensionality-reducing dilation observer, please refer to [26].Ultimately, the obtained reduced-dimensional extended observer is Here, f 1 and f 2 denote the parametric functions, which can be expressed as where k 1 , k 2 and ∂ are set to constant values.For the proof of Lyapunov stability of this observer, please refer to Appendix A.

Localized Planning Mission Environment Model
To realize omnidirectional detection centered on the UUV, a 360-degree scanning sonar was selected to obtain scanning data [27].As shown in Figure 3, an obstacle is considered detected if its relative distance to the UUV is less than or equal to ρ max .At the same time, we introduce the event trigger flag T d .When the UUV detects an obstacle, T d becomes 1, indicating that the trajectory tracking task is temporarily stopped and the local planning task is started; when the UUV travels along the planned collisionfree path to the point of the path where the obstacle is not detected, T d turns back to 0, indicating that the local planning task is completed, and then the UUV continues to follow the desired trajectory.
Since the RRT-planned path is a time series consisting of the coordinates of the UUV corresponding to the time steps in the Earth's fixed coordinate system, the planned path can be represented by the following mathematical expression: where s denotes the path length corresponding to the unit time step, and s n denotes the distance between the planning start and end points.At the same time, we introduce the concepts of the relative pitch angle and relative yaw angle from Equation ( 4) into the two neighboring nodes of the planning path and discretize them as

The Principles of the Basic RRT Algorithm
The concept of the RRT algorithm can be briefly described as follows: First, the random tree T is initialized with the start node of the UUV; then, a random point is evenly sampled in the barrier-free space, and a nearby node in the random tree is traversed by the distance evaluation function.Secondly, a new node is generated by advancing with a unit step from the nearby node to a random node.If there is no collision between the nearby node and the new node, the new node is discarded without any modification to the random tree T [28].The iterative process above is repeated until the target node becomes a leaf node, or the search ends when the set number of iterations is exceeded.Returning from the target node to the starting point, the planned path can be obtained.Figure 4 shows the node expansion process of basic RRT, as described above.

Improved RRT Algorithm with Reinforcement Learning Applications
In this section, a steering angle optimization RRT algorithm based on a reward mechanism is proposed.The logic of deep reinforcement learning applied in this paper is also introduced.Finally, the structural framework of complete trajectory tracking combined with local planning is obtained.

Improved RRT Algorithm
In order to simulate the real underwater motion of the UUV, we limit the generation of new nodes in the RRT algorithm by using the maximum rotation angles θ max and ϕ max of the UUV rotating around the y-axis and z-axis of the body-axis coordinate system as the constraints.
The starting point q rand in the state space is used as the root node, and a point q rand is randomly selected from the search space.If this point falls in the non-obstacle interval, it will traverse T to find the nearest node n to this random point [29].Then, by applying Equation (19), we find the relative pitch angle β(s) and relative yaw angle ᾱ(s) between q rand and q nearest .Finally, we calculate q new under the condition of maximizing angle βmax and ᾱmax : In addition, for the case where β(n) and ᾱ(n) are both within the angular constraints, the generation of their new nodes follows the rules of the basic RRT.
In fact, even though we limit the maximum steering angle of the planned path, we still cannot avoid the search instability and hysteresis caused by the strong randomness of the RRT algorithm.In this case, we introduce a reward function for the behavior of RRT to generate new node q new in order to reduce its unnecessary exploration.The setting of the reward weight should consider the following three aspects: 1.
Rewarding the generation of new nodes that are close to the goal point q goal .According to the basic RRT algorithm, the extension vector of the nearest point q near (n) can be expressed by the following mathematical formula [21]: where R(n) denotes the expansion vector from the nth nearest point q near (n) to the nth random point q rand (n), and τ represents the step size in the RRT algorithm.Then, the reward component from q goal to q near (n) can be defined as where g 1 denotes the reward gain coefficient that varies the target bias of the generated random vector.

2.
Rewarding the generation of new nodes that are close to the desired trajectory.
The continuous task environment of trajectory tracking and localized obstacle avoidance requires the UUV to complete the switch between the two task states with the smallest possible energy consumption.Therefore, it requires us to keep the error with the desired trajectory small during the planning of the paths.As shown in Figure 3, we plan two actual paths, and only that which is closer to the desired trajectory is selected.Emulating the form of the reward component with respect to q goal , we obtain the reward component with the desired trajectory point of the next time step as the goal: 3.
Rewarding the generation of new nodes that stay away from obstacles.
The new node q new is affected by the reward component G 3 (n) introduced between the barrier and q near (n) as it is generated, which can be expressed as Summarizing the above analysis, we can visualize the generation process of new node q new by Figure 5.

Applications of Deep Reinforcement Learning
To counteract the effects of external perturbations and actuator failures on UUV operation, we apply UUV dynamics model training that excludes external perturbations and actuator additive faults when training the network model using deep reinforcement learning.In the actual test environment, we introduced feed-forward compensation to compensate for the estimated perturbation and additive fault terms before the UUV actuators executed the vehicle's action.We rewrite Equation (1) as Equation ( 25) is used for network model training, in conjunction with the state space in Section 2.2.The state we select for training is: where g s and e s are consistent with the meanings expressed in Equation ( 9).Since the UUV does not have actuators in the v and w velocity directions, it cannot offset the effects of perturbations and faults in the form of feed-forward compensation, so we incorporate the perturbations τ dv and τ dw in these two directions into the state space during the training of the network model and train the UUV to output appropriate actions to resist the effects of the perturbations in these two directions.Then, the feed-forward compensation acts directly on the output action of the UUV actuator, which can be represented as The compensated force and torque output of the actuator a ensures that the UUV can safely navigate without disturbances and actuator faults.
Based on the state space of the task environment established in Section 2.2, the architecture of the trajectory tracking task can be represented by Figure 6.
Given the powerful performance of the TD3 algorithm, we embed it in the proposed deep reinforcement learning-based task framework.In the TD3 algorithm, the state-action value function is mainly used to evaluate the current strategy, which we define as where V(s t ) is the state value function, and γ is the discount factor.In the state-action value network updating process, a reverse gradient strategy is used, and its objective function can be expressed as follows: where ε is a random noise obeying a normal distribution, and the TD3 algorithm introduces two target state-action value networks Q θk (s t , a t ), k = 1, 2 and chooses the smaller value of the computation result to be the state-action value of the next state action [30].Then, its update objective can be expressed as where is the replay buffer used to store s i−1 , a i−1 , and s i , which is shown in Figure 6.
Unlike the state-action value function update approach described above, we update the parameters ω of the strategy network π ω (a t |s t ) by gradient ascent to make the policy network converge to the vicinity of the maximum value of the Q-values.Its update goal can be expressed as follows: In addition, to improve the generalization ability of the policy network, we include the reference value in the training, which can be denoted as where i is a random variable obeying a uniform distribution U ε d min , ε d max .
AUV At this point, the task strategies of trajectory tracking and local planning have been established.Next, incorporating the description of the event triggering mechanism in Section 2.3, we can obtain the execution rules of these two task strategies in the continuous task space, which are represented by the following pseudo-code (Algorithm 1).
Update ω ← ω − λ ω ∇ω J π (ω) Execute the action a output from the trained policy network π ω , tracking the desired trajectory 28: q near ← Nearest(tree, q rand ) 32: (q new , U new ) ← Steer(q near , q rand ) refer to Section 3.1 33: Judge the relationship between ∂(n) and ∂max , β(n) and βmax 34: Update q new by Equation (20) 35: if Obstacle f ree(q new ) then R ← q new 36: end if 37: end if 40: until 41: Continue to complete the trajectory tracking task Remark 2. Please refer to [30] for a detailed description of the TD3 algorithm, which is not developed in this paper.The relevant parameters of the algorithm during the simulation will be given in Section 4.1.

Simulation
In this section, the UUV's trajectory tracking and local planning capabilities and continuous task completion capabilities will be tested in the framework of the above tasks.

Trajectory Tracking Capability
The parameter settings for the deep reinforcement learning algorithm were consistent before the simulation began.As shown in Table 1, for the 3D trajectory tracking task, we set the capacity of replay memory B to 100,000 and the batch size b s to 128.The learning rate was then set to 0.0003.Next, we selected the desired trajectory for model training as follows: x d = 8 sin((0.10+ 0.1 x m )t) where x m , y m , and z m are random variables with the same meaning as i in Equation (32).At the same time, the perturbations in both directions v and w required for the state space during training were set as follows: ϑ j=u,v,w,q,r 3.0, 1.5, 2.5, 1.5, 2.0 γ j=u,v,w,q,r 0.0 ω j=u,v,w,q,r 0.9, 0.5, 0.4, 0.5, 0.6 ξ j=u,v,w,q,r 0.0 κ j=u,v,w,q,r 1.0 For training, we only needed the perturbations in the direction of the two velocities u and v as input states, whose parameters were set as with the values updated each training episode.
Meanwhile, we set ϑ v = 1.5, ϑ w = 1.6, ω v = 0.8, ω w = 0.5.In the actual test, we set the error state based on Equation ( 8): , where ρ d is 1.0, β d and α d are both 0, and the desired trajectory is set to x d = 10 sin(0.1t) In addition, the parameters related to the amount of disturbance in the actual test could all be set to constant values; please refer to the data in Table 1.
The training process is illustrated by Figure 7, where the average reward was calculated by evaluating the learned action strategy every 100 time steps, such that in each episode we evaluated the action strategy 10 times and finally calculated the average reward for each episode.In Figure 7, the top and bottom of the shaded area indicate the maximum and minimum of the average reward over the 10 assessments, respectively.The end of the 100th episode was taken as the dividing line, before which the process of data collection for the stochastic control strategy was carried out, and after which the action strategy started to be optimized iteratively by gradient descent until the optimal control strategy was obtained.The entire training process took 8.3 h on a computer with a CPU (Inter(R) Core(TM) i5-12400h @2.5GHz made in Portland, OR, United States) and a GPU (NVIDIA GeForce GTX 3070ti 8GB made in Santa Clara, CA, United States).Comparing the deep reinforcement learning algorithms, it is easy to see that the TD3 algorithm rewarded faster convergence and better training stability than the DDPG algorithm (the DDPG algorithm showed individually larger fluctuations near the end of training).
The results of the test are shown in Figures 8-11.For comparison with the classical control methods mentioned in Section 1, we reproduced the trajectory tracking controller based on the LOS method in [3].Additionally, in the context of deep reinforcement learning, in order to verify the convergence and stability performance of the TD3 algorithm in this task, the DDPG algorithm was used as the benchmark algorithm in the same task environment for comparison.In Figures 8 and 9, it is clear that the TD3 algorithm approached the desired trajectory faster and demonstrated better convergence performance and less fluctuation relative to the classical and DDPG algorithms.The problem with the classical algorithms was that they showed poor convergence performance and took a long time to track the target, while the DDPG algorithm, although equal to the TD3 algorithm in terms of tracking speed, had poor stability.In addition, we show the compensated action output values in Figure 10.The DDPG algorithm presented a large-amplitude action output, which would cause dysfunction in the UUV propulsion structure.The estimations of the designed reduced-order extended state observer are shown in Figure 11: the first graph shows the velocity estimates for the UUV with five degrees of freedom, subsequent graphs show the estimates of the combined perturbation and additive fault terms under the five degrees of freedom condition, where the red dashed lines all indicate actual values.It is easy to see that the observer quickly converged to the target value after a short period of oscillation.

Localized Path Planning Capability
To test the local path planning ability of the UUV, we designed a multi-obstacle environment and compared the pre-and post-improvement RRT algorithms.We assumed that the simulation environment was a space of 100 m × 100 m; the detection radius of the UUV was 10 m; and the starting and ending coordinates were (0, 0) and (99, 99), respectively.The results are shown in Figure 12, where RRT* denotes the improved RRT algorithm obtained in Section 3.1.It is easy to see that the convergence of the spanning tree toward the goal point was greatly improved after the introduction of the gravitational component, and the path smoothness was guaranteed after the addition of the angle restriction.Figure 13 demonstrates the comparison of the above algorithms in terms of search speed and exploration degree.The exploration efficiency and success rate for finding the shortest collisionfree path of the RRT* algorithm proposed in this paper were greatly improved in the multi-obstacle environment.

Trajectory Tracking Combined with Localized Path Planning Capabilities
Based on the event triggering mechanism in Section 2.3, we completed a joint test of the UUV trajectory tracking and local planning capabilities in a 3D environment.We set the desired trajectories as follows: For the desired trajectory described in Equation ( 36), the tracking results of the three algorithms (TD3, DDPG, and the classical algorithm) are represented in Figures 14-16 and were basically the same as the results in Section 4.1.However, the classical algorithm deviated at about the 210th time step and moved away from the desired trajectory.The TD3 algorithm outperformed the DDPG algorithm in terms of tracking speed, tracking stability, and the magnitude of the output force and moment.In order to fully test the performance of the UUV in the joint task of trajectory tracking and local planning, we established continuous obstacle avoidance and complex obstacle environments, respectively, on the basis of the above tracking trajectories, and the experimental results are analyzed below.As shown in Figures 17 and 18, in the continuous obstacle environment, the four RRTbased algorithms realized the switching of task states and completed the planning of local collision-free paths through the event triggering mechanism.Due to some randomness in the RRT algorithm, when the reward component was not added, the local path was generated away from the desired direction at a distance larger than half of the sphere.However, after the addition of the reward component, the local path was generated closer to the desired trajectory and the target point.Meanwhile, the smoothness of the planned paths was improved after angle optimization was implemented.Figures 19 and 20 show the joint experiment in a complex-obstacle environment, where the local planning task was turned on after the UUV detected an obstacle (T d == 1).The four algorithms under the RRT framework planned different collision-free paths before T d became 0. It is easy to see in the side view of Figure 19 that the paths planned by the basic RRT and RRT with angular constraints algorithms were relatively long and did not find narrow passages between obstacles.Conversely, after adding the reward component, as shown in the top view of Figure 20, the generation of the localized path moved towards the desired trajectory and successfully searched for the narrow passage.The effectiveness of the proposed algorithm was confirmed.

Conclusions
In this paper, a framework for the joint task of UUV underwater trajectory tracking and local path planning was proposed.In this framework, the event triggering mechanism was utilized to realize the switching of task states, and a trajectory tracking controller under a deep reinforcement learning framework was combined with an RRT algorithm to realize local planning in multi-obstacle environments.We introduced two kinds of reward components and angular constraints into the RRT algorithm, which directed the spanning tree towards the target point and deferred the application of the desired trajectory, respectively, preventing the UUV from encountering steering difficulties in the real environment.In order to resist the effects of external perturbations, we proposed a way to observe the perturbations using a reduced-order extended observer and incorporate reinforcement learning features for targeted compensation.Finally, simulation experiments on the UUV's trajectory tracking ability, local planning ability, and continuous task completion ability were conducted under our proposed framework, which was compared with the classical control algorithm and DDPG algorithm described in Section 1.The results showed that our method achieved better convergence to the desired trajectories and better search performance during planning.
Although this paper demonstrates the feasibility and superiority of our methodology in the theoretical and simulation domains, and the treatment of actuator faults and external perturbations gives the methodology the potential to be applied in a practical environment, the engineering application domain remains the focus of our next work.In addition, the incorporation of the RRT method is not efficient for underwater exploration.In the future, we will use deep reinforcement learning as a framework to explore more efficient and safer methods to tackle real problems in engineering.
The Lyapunov function G is positive definite outside V 0 .In the region V 1 , because δ > Γ 0 and 0 ≤ δ 1 ≤ εδ, we have If κ is sufficiently small, we can easily obtain that In the same way, it can be proved that dG dt < 0 in the region V 3 .Under the conditions ε = β 1 − κ, Γ 0 < Γ and |δ| ≤ Γ, in order to make the formula negative, the following inequality needs to be satisfied: Substituting ε = β 1 − κ into Equation (A7), we obtain If κ is sufficiently small, Equation (A8) is rewritten as After analysis, the following two inequalities can be obtained: Inequality (A9) holds when the above inequalities are satisfied.According to the above analysis, as long as Γ satisfies Equation (A10) and κ is sufficiently small, Equation (A9) holds in the range |δ| ≤ Γ.Based on the above detailed analysis, when Equation (A10) is satisfied, dG dt holds in the region V 2 .In the same way, it can be proved that dG dt < 0 in the region V 4 .This completes the proof.

where arctan 2 Figure 2 .
Figure 2. Schematic of the trajectory tracking task environment.

Figure 3 .
Figure 3. Schematic of the local path planning task environment.

Figure 5 .
Figure 5. Constructing new nodes by increasing the reward component.

Figure 6 .
Figure 6.Component representation of trajectory tracking task architecture.(A) Interaction between environment and network model.(B) The trajectory tracking task simulation environment, whose composition is consistent with that presented in Section 2.2.(C) The control policy and its target network are the mapping from the state space to the action space.(D) Calculation of Q-values using state-action value networks and their target networks.

Algorithm 1
Pseudo-code of TD3 for UUV path tracking with local path planning 1: Initialize the interactive environment 2: Randomly initialize state-action value networks Q θ 1 and Q θ 2 and policy network π ω with parameter vectors θ 1 , θ 2 , and ω 3: Initialize target state-action value networks Q θ1 and Q θ2 with parameter vectors θ1 ← θ 1 and θ2 ← θ 2 4: Initialize replay buffer D 1 to capacity B and empty it5:Initialize global shared step counter C ← 0 and maximum time steps of each episode T 6: Set the maximum number of iterations C max and some other necessary parameters (refer to Section 4.1)7: for C ∈ {0, 1, 2, • • • , C max } doReset the environment, E d ← 0, s t ← s 0 , and so on Perform action with exploration noise a ← π ω (s) + χ, χ ∼ N (0, σ) Store samples in the replay buffer D ← D ∪ s t , a, s (t+1) , r 15:Sample b s samples s t , a, s (t+1) , r from D 16:

Table 1 .
The values of the shared parameters.