A 2D Optimal Path Planning Algorithm for Autonomous Underwater Vehicle Driving in Unknown Underwater Canyons

: This research aims to solve the safe navigation problem of autonomous underwater vehicles (AUVs) in deep ocean, which is a complex and changeable environment with various mountains. When an AUV reaches the deep sea navigation, it encounters many underwater canyons, and the hard valley walls threaten its safety seriously. To solve the problem on the safe driving of AUV in underwater canyons and address the potential of AUV autonomous obstacle avoidance in uncertain environments, an improved AUV path planning algorithm based on the deep deterministic policy gradient (DDPG) algorithm is proposed in this work. This method refers to an end-to-end path planning algorithm that optimizes the strategy directly. It takes sensor information as input and driving speed and yaw angle as outputs. The path planning algorithm can reach the predetermined target point while avoiding large-scale static obstacles, such as valley walls in the simulated underwater canyon environment, as well as sudden small-scale dynamic obstacles, such as marine life and other vehicles. In addition, this research aims at the multi-objective structure of the obstacle avoidance of path planning, modularized reward function design, and combined artiﬁcial potential ﬁeld method to set continuous rewards. This research also proposes a new algorithm called deep SumTree-deterministic policy gradient algorithm (SumTree-DDPG), which improves the random storage and extraction strategy of DDPG algorithm experience samples. According to the importance of the experience samples, the samples are classiﬁed and stored in combination with the SumTree structure, high-quality samples are extracted continuously, and SumTree-DDPG algorithm ﬁnally improves the speed of the convergence model. Finally, this research uses Python language to write an underwater canyon simulation environment and builds a deep reinforcement learning simulation platform on a high-performance computer to conduct simulation learning training for AUV. Data simulation veriﬁed that the proposed path planning method can guide the under-actuated underwater robot to navigate to the target without colliding with any obstacles. In comparison with the DDPG algorithm, the stability, training’s total reward, and robustness of the improved Sumtree-DDPG algorithm planner in this study are better.


Introduction
In recent years, autonomous underwater vehicles (AUVs) have elicited wide attention because of revolutionizing the oceanic research with applications on numerous scientific fields, such as marine geoscience, submarine oil exploration, submarine salvage, submarine pipeline repair, and archeology [1][2][3]. Among all of the functions of AUV, autonomous obstacle-avoidance capability is the most important one because the obstacles are usually unknown for AUVs in underwater environment; thus, AUV can easily run into obstacles, thereby causing them to malfunction or even damage the robot [4].
Several autonomous navigation methods for obstacle-avoidance have been reported in the literature. Lozano et al. [5,6] proposed visibility graph algorithm. In this algorithm, 2 of 24 AUV is regarded as a bit, and obstacles are considered plane polygons. Subsequently, the starting point, goal point, and polygon obstacle of each vertex are connected. Moreover, all attachment fellowships without path obstructions are considered collision-free path. Finally, a safety view is formed, and some algorithms are used to search the optimal path. The principle of this method is simple and easy to realize. Takahashi et al. [7] proposed the Voronoi diagram method. In 1983, the existence of a certain distance between the planned path and obstacles can be satisfied, and factors, such as safety, can be considered. The planning time increases and decreases with the density of obstacles. Although the shortest path can be determined by this method, it lacks flexibility. In Takahashi et al. [8,9], precise raster algorithm divided free space into no-overlap grid units. The grid is dominated by obstacles for grid assignment and makes a series of parallel lines to each obstacle vertices. Edges and obstacles in the planning environment are stopped. Eventually, the space is decomposed into a series of trapezoidal area to realize obstacle avoidance. In literature [10,11], quad-tree and octree decomposition methods were used to establish the plane sea area obstacle model and the submarine terrain model, respectively. In addition, the current velocity and direction could be used as grid attribute information to establish the current model. A* and D* algorithms are widely used path search algorithms. A* algorithm selects the optimal path node by calculating the evaluation function of all candidate nodes to the target point, which is suitable for static path planning [12]. In the bionic fish path planning problem, Qiang et al. [13] adopted the deployable point method to reduce the search nodes and improve the search efficiency; however, the environmental factors were not considered. D* algorithm [14] is the dynamic A* algorithm that is suitable for solving dynamic path planning problems by detecting the changes in the previous or nearby nodes of the shortest path. Artificial potential field method is a virtual method proposed by Khatib et al. [15]. This method is widely used in the path planning field, and its concept is to construct various virtual potential fields for the path planning of AUV [16]. Warren [17] used the artificial potential field method to carry out path planning for underwater robots and realized the global path planning of AUV in two-dimensional (2D) and three-dimensional (3D) spaces by reducing the local minima through heuristic knowledge. Chao [18] adopted optimization theory, combined artificial potential field with obstacle constraint, and transformed path planning problem into solving constraint and semi-constraint problems. Cheng et al. [19] used velocity vector synthesis algorithm to enable the combined velocity of ocean current velocity and AUV velocity point to the target, thereby minimizing resource consumption. In Ferrari et al. [20], aiming at the problem of collaborative planning of multi-AUV to avoid multi-detection platform network, the detection platform was considered a virtual obstacle, and the planning result could determine the minimum exposure probability and the non-collision path by modifying the fitness function.
With the progress of computer technology, artificial intelligence has received extensive attention in various fields. The artificial intelligence-based path planning technology aims to transform the behavior and thoughts of some natural animals into algorithms that will be used in the path planning of mobile devices. Currently, artificial intelligence algorithms, such as particle swarm optimization algorithm, ant colony algorithm, evolutionary computing [21], genetic algorithm, and self-organizing neural network [22], have emerged and are widely applied. Xu et al. [23] used the genetic algorithm and particle swarm optimization (GA-PSO) hybrid planning algorithm to realize the AUV global path planning under current conditions. Wang et al. [24] designed cutting and handicap operators to solve the problem of ant colony path planning and realized the AUV global path planning in a 2D grid environment model. In paper [25], particle swarm optimization (PSO) was used to solve the path planning problem of dynamic environment, and the speed and heading information of the robot were introduced into the objective function. The results verify that PSO has good real-time performance in solving the path planning problems. Xin et al. [26] improved the ant colony path planning, designed the cutting and handicap operators, and realized the AUV global path planning in the 2D grid environment model. These methods need to know the global environment and does not have the ability to learn and explore the unknown environment path planning.
AUV is one of the most important means to explore the deep sea world. The deep ocean is a complex and changeable environment which is distributed with various mountains. When the underwater autonomous vehicle reaches the deep sea, it will face many large and small underwater canyons, and hard valley walls and other serious threats to the safety of the underwater autonomous vehicle [27]. So, path planning and obstacle avoidance are important components of autonomous navigation for AUV. The goal is to find a collision-free path from the start to the end in a complex underwater environment. Algorithm design is the core of path planning. The learning algorithms of artificial intelligence are regarded by the majority of researchers as the future of artificial intelligence. Neural network is an important content of machine learning. In the recent path planning of underwater robots, a large number of scholars used sensor data as network input and behavior and actions as network outputs; moreover, network models were obtained through training [28,29]. In paper [30], a 2D environment traversal path planning method based on biologically inspired neural networks was proposed. A recurrent neural network with convolution was developed [31] to improve the autonomous ability and intelligence of obstacle avoidance planning. Zhu et al. [32] focused on the study of sudden obstacles and used environmental changes to cause variations in neuron excitation and activity output values, thereby outputting collision-free path points. Reinforcement learning (RL) is an artificial intelligence algorithm that does not require prior knowledge and directly performs trial-and-error iterations with the environment to obtain feedback information to optimize strategies and is therefore widely used in mobile robot path planning in complex environments [33,34]. In paper [35], an adaptive neural network obstacle control method of AUVs with control input nonlinearities using RL was considered. In addition to improving accuracy, you can also learn control strategies from data to avoid cumbersome manual tuning parameters [36,37]. In 1989, Watkins [38] proposed a typical model-free RL algorithm called Q-learning algorithm. which is one of the most widely used algorithm in RL solutions [39]. Considering that the Q-learning algorithm [40] can guarantee convergence without knowing the model and can obtain good path planning in the case of a small state space, some scholars [41,42] have also applied it to the path planning of robots. However, for the research on underwater robot path planning and obstacle avoidance, such as large-dimensional and large state space, solving the optimal policy using Q-learning algorithm is difficult. Mnih et al. [43] proposed a deep reinforcement learning (DRL) algorithm based on Deep Q Network (DQN). The performance of this algorithm in many Atari games has reached the same level as that of humans; however, it cannot be applied directly to high-level dimensional continuous motion space control problem. Cheng et al. [44] proposed a DRL-based obstacle avoidance planning algorithm for underwater robots. Two convolutional layers in the algorithm structure extract the input state quantity features. The focus is on the distance to the target point, the distance to the obstacle, and the endpoint nearby speed and drift four-term return function (e.g., R distance, R collisions, R end, and R drift). However, no obvious advantage over traditional path planning algorithm was noted. Moreover, most researchers only considered the obstacle avoidance of static obstacles and some other works [45,46] presented applications where depth exploration in semi-static conditions could be improved; they seldom carried out real-time obstacle avoidance research on dynamic obstacles.
Lillicrap et al. [47] proposed the deep deterministic policy gradient (DDPG) algorithm based on DQN and DPG. This algorithm shows strong robustness and stability and performs well when processing high-dimensional continuous motion space control tasks. More than 20 complex control tasks have been implemented, but they have not been applied to the control of AUV path planning and obstacle avoidance. Therefore, the research on the path planning and obstacle avoidance in unknown underwater canyons for AUV based on the DDPG algorithm is carried out.
The remainder of this paper is organized as follows: In Section 2, four mathematical models required for AUV navigation in unknown underwater canyons are established. In Section 3, the path planning and obstacle avoidance algorithm are designed. In Section 4, the path planning and obstacle avoidance in unknown canyon simulation tests are discussed. In Section 5, the study is concluded.

Preliminaries
AUV-Kinematic Model with 3 Degrees of Freedom A differential AUV model is constructed. In Figure 1, a 2-dimensional (2D) AUV model is shown in the geodetic-fixed frame (ξ − E − η). The length of the AUV model is 1.46 m, its mass is 45 kg, and the center of gravity coordinates in the body-fixed frame is (0, 0). The AUV has seven range-finding sonars (In Figure 1, S i (i = 1, 2, 3, . . . 7)). These sonars can obtain obstacle information around the AUV. In addition, to facilitate research, the red dotted line in Figure 1 is used to indicate the sonar detection beam. The arrangement is shown in Figure 1. The sampling frequency of the range-finding sonars is 2 Hz, and their detection distance is 150 m. The stern of the robot has three propellers, P i (i = 1, 2, 3) with 0.2 m from the y-axis in the body-fixed frame. The propeller can generate 15 kg of force. The AUV performs an inertial navigation method for measuring velocity, position, and attitude. on the path planning and obstacle avoidance in unknown underwater canyons for AUV based on the DDPG algorithm is carried out. The remainder of this paper is organized as follows: In Section 2, four mathematical models required for AUV navigation in unknown underwater canyons are established. In Section 3, the path planning and obstacle avoidance algorithm are designed. In Section 4, the path planning and obstacle avoidance in unknown canyon simulation tests are discussed. In Section 5, the study is concluded.

AUV-Kinematic Model with 3 Degrees of Freedom
A differential AUV model is constructed. In Figure 1, a 2-dimensional (2D) AUV model is shown in the geodetic-fixed frame ( The length of the AUV model is 1.46 m, its mass is 45 kg, and the center of gravity coordinates in the body-fixed frame is   0,0 . The AUV has seven range-finding sonars (In Figure 1, ). These sonars can obtain obstacle information around the AUV. In addition, to facilitate research, the red dotted line in Figure 1 is used to indicate the sonar detection beam. The arrangement is shown in Figure 1. The sampling frequency of the range-finding sonars is 2 Hz, and their detection distance is 150 m. The stern of the robot has three propellers, with 0.2 m from the y-axis in the body-fixed frame. The propeller can generate 15 kg of force. The AUV performs an inertial navigation method for measuring velocity, position, and attitude. In this paper, the path planning and obstacle avoidance research of AUV are based on the kinematic model of AUV, because the route planned by combining the real movement process of AUV has the advantages of continuity and smoothness. An AUV usually moves in a 3-dimensional (3D) space with 6 degree of freedoms (DOFs), thereby resulting in the coupled dynamics in its planner and diving motions. To facilitate control design, the model is usually decoupled, whereas the designed control will be validated using the coupled nonlinear dynamics [44]. We consider the horizontal motion of the AUV with 3 degree of freedoms (DOFs) (Figure 1), which is described by the motion components as surge, sway, and yaw. On the basis of this consideration, In this paper, the path planning and obstacle avoidance research of AUV are based on the kinematic model of AUV, because the route planned by combining the real movement process of AUV has the advantages of continuity and smoothness. An AUV usually moves in a 3-dimensional (3D) space with 6 degree of freedoms (DOFs), thereby resulting in the coupled dynamics in its planner and diving motions. To facilitate control design, the model is usually decoupled, whereas the designed control will be validated using the coupled nonlinear dynamics [44]. We consider the horizontal motion of the AUV with 3 degree of freedoms (DOFs) (Figure 1), which is described by the motion components as surge, sway, and yaw. On the basis of this consideration, v = [u, v, r] T ∈ R 3 denotes the velocity vector, whereas η = [x, y, ψ] T ∈ R 3 denotes the position vector. Let us denote the position coordinate of an AUV as (x, y) and the yaw as (ψ) in the earth-fixed inertial frame. The linear velocities in the body-fixed frame of the AUV v = [u, v, r] T ∈ R 3 correspond to surge, sway, and yaw. The horizontal maneuvering models [48] of the AUV can be expressed as: where R(ψ) is the rotation matrix for the horizontal motion of the AUV with three DOFs, which can be expressed as:

Obstacle Avoidance Strategy
This research aims to solve the problem on the safe navigation of AUVs in deep ocean, which is a complex and changeable environment with various mountains. When AUVs reach deep sea navigation, it faces many underwater canyons, and the hard valley walls threaten its safety seriously. In addition, other submersibles that navigate in the deep sea and moving marine life also threaten the safety of AUVs. The valley wall of the underwater canyon is a large-scale continuous obstacle relative to AUV, whereas other submersibles and moving marine organisms in deep sea navigation are dynamic obstacles of similar size to AUV. Thus, we should come up with different obstacle avoidance strategies. The following content involves the obstacle avoidance strategy proposed in this study for two different types of obstacle avoidance strategies.

Large-Scale Continuous Obstacle Avoidance Strategy of AUV
When an AUV goes to the deep sea, it faces many underwater canyons, and the hard valley wall threatens its safety seriously. In this study, a large area of the underwater canyon wall that AUV can only detect a very small part of this obstacle by sonars (less than 20 percent of its overall size) is regarded as a large-scale continuous obstacle, and the distance of the sensors from the center is negligible. The large-scale continuous obstacle strategy of AUV is presented as follows.
First, AUV is assumed to have the capability to measure the distance D i (i = 1, 2, 3, . . . 7) and the azimuth angle ψ w of the underwater canyon wall in the geodetic-fixed frame (ξ − E − η) between the obstacle in the angle ς i (i = 1, 2, 3, . . . 7) of X axis direction by using the seven sonars installed on the left and right sides ahead of X axis.
Second, the AUV obstacle avoidance model in the face of a large-scale continuous obstacle is detailed as follows.
When the AUV drives into an unknown underwater canyon environment, in order to ensure the safety of the AUV itself, the AUV maintains the current dive depth and avoids the underwater canyon rock wall at the horizontal level. Therefore, in the current situation, the AUV kinematic model conforms to the Equation (1) (AUV 3-DOF kinematic model), and Equation (1) is rewritten to obtain the 3-DOF kinematic model of the AUV at time t to obtain Equation (4). Equation (4) into: where η(t) is the horizontal position vector in the coordinate method of the AUV at time t, including the horizontal position coordinates x(t), y(t), and the yaw angle ψ(t); · η(t) is the derivative of η(t) with respect to time t; · ψ(t) is the derivative of ψ(t) with respect to time t; V(t) is the horizontal velocity vector of the AUV under the carrier at time t, including the horizontal velocity along the X axis longitudinal velocity u(t) and the Y axis lateral velocity v(t) and yaw angular velocity r(t); R(ψ(t)) is the rotation matrix for the horizontal motion of the AUV with three DOFs at time t, which can be expressed as: cos(ψ(t)) 0 0 0 1   (5) Figure 2 shows the geometric relationship of AUV facing continuous obstacles in a wide range, such as walls and canyons.  Figure 2 shows the geometric relationship of AUV facing continuous obstacles in a wide range, such as walls and canyons.  Continuous surface of obstacles, the bold part of the Figure 2 represents the triangle symbol of the AUV position, T i (i = 1, 2, 3, . . . 7) represents the underwater sonar of autonomous navigation and the continuous obstacle detection on the surface of the intersection, D i (i = 1, 2, 3, . . . 7) is AUV to T i (i = 1, 2, 3, . . . 7) point distance, ψ is the AUV yaw angle, point 7 sonar and AUV body the angle between the axis X, which can be deduced from the geometric relationship in Figure 2.
where V T i (i = 1, 2, 3, · · ·7) is the moving speed of the intersection point T detected by the sensor along the continuous obstacle surface, V is the horizontal speed of the AUV, ψ w is the azimuth angle of the underwater canyon wall in the geodetic-fixed frame (ξ − E − η). Because the angle ς i (i = 1, 2, 3, . . . 7) between the sonar pointing and the body axis X of the AUV is an inherent property of AUV, which is determined by the installation position and installation angle of the sonar, thus: The combination of Equations (6) and (7) can solve the differential equation of the distance D(1, 2, 3, . . . 7) of the obstacle detected by sonar: It can be seen from Equation (9) that the magnitude of D(1, 2, 3, . . . 7) is related to AUV's yaw angle ψ and the horizontal speed of the AUV V.
If the minimum safety distance of AUV is ρ s and the sonar detection distances of the AUV is D(1, 2, 3, . . . 7), when the AUV faces a large range of continuous obstacles, such as walls of canyons, the condition for AUV not to collide is: where ρ s is a positive constant. Combining Equations (9) and (10), it can be known that AUV can ensure obstacle avoidance condition min(D i ) ≥ ρ s (i = 1, 2, 3, . . . 7) by adjusting its own yaw angle ψ and the horizontal speed V, so that AUV can drive safely and autonomously in an unknown canyon environment.

Multi-Dynamic Obstacle Avoidance Strategy of AUV
AUVs will not only encounter static obstacles during the navigation in the deep sea but also often encounter dynamic obstacles, such as other underwater vehicles, marine life, and marine floating objects. In this paper, we only consider small-scale dynamic obstacles whose shape and size can be completely detected by the sonars of AUV. This research combines the idea of artificial potential field method to design a multi-dynamic obstacle avoidance strategy. Thus, AUVs need to reach the designated target area when navigating in the underwater environment to the target behavior. This research establishes the target behavior of AUV as a potential field function [15]: where k 1 is the negative constant, x goal , y goal is the coordinate of the center position of the target area in the Cartesian coordinate method, and (x t , y t ) is the coordinate of the center position of the AUV in the Cartesian coordinate method at time t. Different from the scope of obstacle repulsion potential field established by artificial potential field method proposed by other researchers, the anticipating scope of AUV repulsion potential field is established in this study. When a dynamic obstacle enters into the repulsion field scope of AUV, the smaller the distance between them, the greater the repulsion force will be. On the contrary, the repulsion force of AUV is low. When the AUV changes course and speed to enable the obstacle to leave the scope of AUV repulsion potential field, the repulsion force received by AUV is 0. In this study, the behavior of AUV avoiding dynamic obstacles is established as the repulsion potential field function of AUV, as shown in Equation (12): where k 2 is the negative constant; (x t , y t ) is the position coordinate of the AUV in the Cartesian coordinate method at time t, (x t , y t ) is the position coordinate of the dynamic obstacle in the Cartesian coordinate method at time t; and L 1 and L 2 are the distance between the long axis and the short axis after expanding the AUV into an ellipsoid, respectively. The shape of AUV is determined, and most of them are ellipsoidal, because the shape of the dynamic obstacles is uncertain. In this study, AUV is expanded to provide certain safety space. The specific treatment is shown in Equation (13): where α is the constant whose value is greater than 1; and L and B are the maximum length and width of the underwater vehicle, respectively. In this study, α = 2.25 is set. As shown in Figure 3, the blue ellipse represents security boundary of 2D in which the length of the major axis is 2.25L and the length of the minor axis is 2.25B. In this study, the position information of dynamic obstacle and the relative distance information between dynamic obstacle distance and AUV are detected by the sonar of the AUV. When the sonar of the AUV detects dynamic obstacles, dynamic obstacle does not enter the scope of AUV's repulsive force field, and AUV is safe. Some processing, such as the kinematics model for predicting and estimating dynamic obstacles, is not the focus of this research and will not be explained. If dynamic obstacles enter the scope of the repulsion field of the AUV, then obstacles will be avoided by AUV and changes the heading angle and navigation speed continuously. If the dynamic obstacle is still within the scope of the repulsion potential field of the AUV and the distance between the dynamic obstacle and the AUV is less than the safe distance, AUV collides with obstacles.

MDP Model of AUV Path Planning
This section describes how to combine the contents of the first three sections of this chapter to generate a Markov Decision Process (MDP) model of AUV that can be trained in deep RL.
MDP is a classic form of sequential decision-making and is usually used to model RL problems [49]. MDP is composed of four tuples  In this study, the position information of dynamic obstacle and the relative distance information between dynamic obstacle distance and AUV are detected by the sonar of the AUV. When the sonar of the AUV detects dynamic obstacles, dynamic obstacle does not enter the scope of AUV's repulsive force field, and AUV is safe. Some processing, such as the kinematics model for predicting and estimating dynamic obstacles, is not the focus of this research and will not be explained. If dynamic obstacles enter the scope of the repulsion field of the AUV, then obstacles will be avoided by AUV and changes the heading angle and navigation speed continuously. If the dynamic obstacle is still within the scope of the repulsion potential field of the AUV and the distance between the dynamic obstacle and the AUV is less than the safe distance, AUV collides with obstacles.

MDP Model of AUV Path Planning
This section describes how to combine the contents of the first three sections of this chapter to generate a Markov Decision Process (MDP) model of AUV that can be trained in deep RL.
MDP is a classic form of sequential decision-making and is usually used to model RL problems [49]. MDP is composed of four tuples MDP = (S, A, P, R), where A represents the action set, S represents the state set, P : S × A × S → (0, 1) represents the state transition probability, and R is the reward function. Moreover, AUV interacts with the environment (Markov Decision Process is shown in Figure 4). After receiving the status information at time t, the AUV outputs the action µ t ∈ A. The reward value generated at time t is R t = f (s t , a t , s t+1 ), and the state becomes s t+1 . The action µ t outputted by the agent is determined by the policy π, which is the probability that the state s t is mapped to each action: S → P(A) .

Reward
State Action Policy The MDP model of AUV path planning in this paper first follows two assumptions: (1) First, the path planning task is horizontal with three degrees of freedom; (2) Second, time is discretized; to meet the real-time requirements, the planning method outputs at regular intervals with a sampling rate of = 0.5 s S T .
As introduced in Section 2.1, the AUV used in this research is equipped with 7 obstacle avoidance sonar sensors to detect the distance to obstacles or targets in real time. Sonars are the direct device for AUV to interact with the environment. Therefore, in this study, the 7 obstacle avoidance sonars detected by the AUV at time t are the value of the distance to the obstacle or target, which is set as the state space S of the MDP model of AUV path planning, and the detection capability of the sonar sensor is limited as state constraints. Hence, the detection distance range is   0,150 m , the state constraint is set to   0,150 m , and the final state set expression at time t is shown in Equation (14): 1  2  3  4  5  6  7  1  2  3  4  5  6  7 , , , , , , ,  The MDP model of AUV path planning in this paper first follows two assumptions:

Range of Values (m)
(1) First, the path planning task is horizontal with three degrees of freedom; (2) Second, time is discretized; to meet the real-time requirements, the planning method outputs at regular intervals with a sampling rate of T S = 0.5 s. As introduced in Section 2.1, the AUV used in this research is equipped with 7 obstacle avoidance sonar sensors to detect the distance to obstacles or targets in real time. Sonars are the direct device for AUV to interact with the environment. Therefore, in this study, the 7 obstacle avoidance sonars detected by the AUV at time t are the value D i (t) (i = 1, 2, 3, . . . 7) of the distance to the obstacle or target, which is set as the state space S of the MDP model of AUV path planning, and the detection capability of the sonar sensor is limited as state constraints. Hence, the detection distance range is [0, 150 m], the state constraint is set to [0, 150 m], and the final state set expression at time t is shown in Equation (14): Table 1. State sets.

Range of Values (m)
According to the 3-degree-of-freedom kinematics model of the AUV in Section 2.1, the action space A of the MDP model in this study is defined as the yaw angular velocity ω(t) and the horizontal velocity vector V(t), and the action space constraints correspond to the limitations of its own maneuverability. The AUV used in this research has three thrusters, one tail thruster and two side thrusters, which can realize turning and forward and backward, respectively. Therefore, the heading angle range is [−180 • , +180 • ], and considering the limitation of its own maneuverability, the yaw rate range is −1.0 rad −1 , 1.5 rad −1 , and the sailing speed range is [−1.0 m/s, 1.5 m/s]. (x t , y t ) is the position coordinate of the AUV at time t in the geodetic-fixed frame (ξ − E − η). Thus, the action set a t at time t is shown in Equation (15): The action set A of the MDP model for AUV path planning is shown in Table 2, [x min , x max ] and [y min , y max ] are the horizontal and vertical limits of the AUV driving range in the geodetic coordinate system, respectively. Table 2. Action sets.

Action
Type Range of Values In this study, large-scale continuous static obstacle avoidance strategies proposed in Section 2.2.1 of this chapter and multiple dynamic obstacle avoidance strategies proposed in Section 2.2.2 of this chapter are integrated into the specific setting of reward values of the MDP model of deep RL, which will be introduced in Section 3.3. Finally, P : S × A × S → (0, 1) of the MDP model represents the state transition probability, and it is updated through DRL algorithm.

SumTree-DDPG Algorithm
This study uses RL methods based on DDPG. Unlike traditional value-based RL, this method can search for strategies directly. Therefore, it can be applied to a continuous highdimensional action space. DDPG is an actor-critic algorithm. This section introduces the critic, actor, reward function, and replay memory in four aspects and proposes an improved DDPG algorithm (SumTree-DDPG) for AUV path planning and obstacle avoidance.

Critic
The critic is used to fit the state action value function, including the target Q network and the online Q network, and the two networks are updated alternately. The initial parameters of the two networks, θ Q and θ Q , are equal. After the random sampling of small batch data N(s i , a i , r i , s i+1 ) from the experience buffer pool, the online value Q network is updated by minimizing the loss value L. The calculation of L is shown in Equation (16).
In Equation (16), Target Q Target refers to the target Q value, as shown in Equation (17).
Different from the real-time update of online Q network, the target Q network is updated every other period of time, and its update method is shown in Equation (18).
where τ is a preset constant.

Actor
In the DDPG algorithm, a policy network with the parameter θ µ is used to represent the deterministic policy a = µ(s|θ µ ). The actor is used to fit the policy function. Its main task is to output the deterministic action value t for the input state s t . The update of online policy network parameters is shown in Equation (19).
In Equation (20), the state s follows the ρ β distribution, and θ µ is the online policy network parameter. The target policy network is updated in the same way as the target Q network, and it is updated every once in a while as Formula (20): In Equation (20), the parameter τ is a preset constant.

Reward Function Design
The reward function plays an important role in RL tasks, and it points to the direction of the network parameter update of actors and critics [50]. The reward function of this study is mainly designed according to the large-scale continuous static obstacle avoidance strategies in Section 2.2.1 and multiple dynamic obstacle avoidance strategies in Section 2.2.2. Aiming at the AUV obstacle avoidance problem, this paper designs a reward function algorithm that considers the three aspects of goal, safety, and stability.
AUV's tendency toward target behavior is reflected in the reward value of target module r 1 (s t , a t , s t+1 ). This study combines the gravitational potential field function in Section 2.2.2 to set the reward value of the target module of the first component of the reward value. The target module reward value function r 1 (s t , a t , s t+1 ) is designed as follows: where x goal , y goal is the coordinate of the center position of the target area in the Cartesian coordinate method; and (x t , y t ) is the coordinate of the center position of the AUV in the Cartesian coordinate method at time t. When the AUV reaches the target area, the reward value of the target module will be updated: where R 1 is a normal number. The AUV's obstacle avoidance behavior is set as the safety module reward value r 2 (s t , a t , s t+1 ). The obstacles considered in this study include large-scale continuous static obstacle and multiple dynamic obstacles. According to Section 2.2.1, it is proposed that the distance between the seven sonars controlling AUV and the large-scale continuous static obstacles detected is always greater than or equal to the safe radius of AUV, so that the large-scale continuous static obstacles can be avoided. According to Section 2.2.2, the method of setting the scope of repulsion potential field of AUV is proposed to avoid collision with dynamic obstacles. Combined with the two obstacle avoidance strategies, the second component of the safety module reward value r 2 (s t , a t , s t+1 ) is shown in the Equation (23): where r 2 (s t , a t , s t+1 ) is the reward value of the safety module; and r 1 2 (s t , a t , s t+1 ) is the first component of the safety module r 2 (s t , a t , s t+1 ), which is used to avoid large-scale continuous static obstacles. r 2 2 (s t , a t , s t+1 ) is the second component of safety module r 2 (s t , a t , s t+1 ), which is used to avoid small dynamic obstacles.
The specific process set by r 1 2 (s t , a t , s t+1 ) is when the minimum detection distance min(D i (t)) (i = 1, 2, 3, . . . 7) of the 7 sonar probes of AUV is twice longer than the safe distance r s at time step t, indicating that AUV is safe and the reward value r 1 2 (s t , a t , s t+1 ) is 0. When 1.0r s ≤ min(D i (t)) ≤ 2.0r s (i = 1, 2, 3, . . . 7) is true, then the AUV is about to collide the large-scale continuous static obstacle and obtain the continuous negative reward −(min(D i (t)) − r s ) 2 (i = 1, 2, 3, . . . 7); when min(D i (t)) ≤ 1.0r s (i = 1, 2, 3, . . . 7) is less than the safe distance, it means that AUV collides with the large-scale continuous static obstacle and obtains the negative reward −R 2 . Therefore, the expression of r 1 2 (s t , a t , s t+1 ) is: where min(D i (t)) (i = 1, 2, 3, . . . 7) is the minimum detection distance of the 7 sonar probes of AUV between AUV and the large-scale continuous static obstacle at time step t; r s is the set safety margin; and R 2 is a normal number. The specific process set by r 2 2 (s t , a t , s t+1 ) is when the 7 sonars of AUV detect that the dynamic obstacle does not enter the repulsion area of AUV, indicating that AUV is safe and the reward value is 0. When the dynamic obstacle enters the repulsion area of AUV, then the AUV will get a continuous negative reward, and the closer the distance between the obstacle and AUV, the more negative reward it will get. If the dynamic obstacle finally reaches the safe radius of AUV, then the two collide, and the negative reward value −R 2 will be obtained. Therefore, the expression of r 2 2 (s t , a t , s t+1 ) is: where (x t , y t ) is the position coordinate of the AUV in the Cartesian coordinate method at time t; (x t , y t ) is the position coordinate of the dynamic obstacle in the Cartesian coordinate method at time t; L 1 and L 2 are the distance between the long axis and the short axis after expanding the AUV into an ellipsoid, respectively; r s the set safety margin; and R 2 is a normal number.
To improve the robustness of the AUV obstacle avoidance method and enhance the ability of the AUV to maintain the heading and speed when it is in a safe local area and approaching the target point, this paper designs the stability reward value function as follows.
where r 3 (s t , a t , s t+1 ) represents the first component of current interference stability module value reward of the total reward r(s t , a t , s t+1 ) for time step t time t; ω t and ω t+1 of the Formula (26) represent respectively the current moment and the next moment of AUV's yaw angular velocity; v t and v t+1 of the Formula (26) represent respectively the current moment and the next moment of AUV speed. In this paper, the reward value function used for AUV path planning and obstacle avoidance is shown in Equation (27).
where τ 1 , τ 2 , and τ 3 are the weights of various factors. The larger the value, the more the trained model focuses on this factor. The specific value needs to be set according to the specific environment and requirements. The algorithm pseudo code of reward function for AUV obstacle avoidance is shown in Algorithm 1. Algorithm 1. Reward Algorithm for AUV Obstacle Avoidance 1: Initialize reward value r(s t , a t , s t+1 ) = 0 2: Take action a t and observe s t+1 4: Get the stability reward value function r 3 (s t , a t , s t+1 ) : if transition from safe region to safe region 4: then the reward value of the target module r 1 (s t , a t , s t+1 ) : where x goal , y goal is the coordinate of the center position of the target area; (x t , y t ) is the coordinate of the center position of the AUV at time t 5: else transition from safe region to unsafe region 6: if AUV encounters large-scale continuous obstacle 7: then the safety module reward value r 2 (s t , a t , s t+1 ) : r 2 (s t , a t , s t+1 ) ← r 1 2 (s t , a t , s t+1 ) 8: else if AUV encounters multi-dynamic obstacle 9: then r 2 (s t , a t , s t+1 ) ← r 2 2 (s t , a t , s t+1 ) 10: else if transition from unsafe region to obstacle region 11: then r 2 (s t , a t , s t+1 ) ← r 2 (s t , a t , s t+1 ) − R 2 and restart the exploration 12: else transition from unsafe region to safe region 13: then r 2 (s t , a t , s t+1 ) = 0 14: if transition from safe region to goal region 15: then r 1 (s t , a t , s t+1 ) ← r 1 (s t , a t , s t+1 ) + R 2 16: r(s t , a t , s t+1 ) ← r(s t , a t , s t+1 ) + τ 1 r 1 (s t , a t , s t+1 ) + τ 2 r 2 (s t , a t , s t+1 ) + τ 3 r 3 (s t , a t , s t+1 ) . where 0 ≤ τ 1 ≤ 1, 0 ≤ τ 2 ≤ 1, and 0 ≤ τ 3 ≤ 1 are the weights of various factors 17: end

Replay Memory
The DDPG algorithm uses the experience replay method to store the experience samples generated by the agent's interaction with the environment in the experience buffer pool and randomly sample samples from it to train the network. This method of randomly sampling samples neither considers the different importance of different data, nor does fully consider the diversity of the samples to be drawn, resulting in slower model convergence. To solve this problem, the sample storage and extraction strategy in this paper are to take the method of priority extraction according to the importance of the data, which effectively improves the convergence speed of the model.
In this article, the small batch sample sampling is not random sampling, but according to the sample priority in the memory bank. So this can more effectively find the samples we need to learn. In the DDPG algorithm, the parameters of the strategy network depend on the selection of the value network, and the parameters in the value network are determined by the loss function of the value network. So the sample priority P can be defined by the expectation of the difference between the target Q value and the actual Q value. The greater the difference between the target Q value of the value network and the actual Q value, the greater the prediction accuracy of the network parameters, that is, the more the sample needs to be learned, that is, the higher the priority P. With priority P, this article uses the SumTree method to effectively sample based on P. The SumTree method does not sort the obtained samples, which reduces the computing power compared to the sorting algorithm.
SumTree is a tree structure ( Figure 5), the priority P of the sample is stored in the leaf node, and each leaf node corresponds to an index value. Using the index value, the corresponding sample can be accessed. Every two leaf nodes correspond to a parent node of an upper level. The priority of the parent node is equal to the sum of the priorities of the left and right child nodes. Thus, the top of the SumTree is sum(P).
of an upper level. The priority of the parent node is equal to the sum of the priorities of the left and right child nodes. Thus, the top of the SumTree is   sum P . When sampling, this study first divides the priority of the root node (the sum of the priority of all leaf nodes) by the number of samples N and divides the priority from 0 to the sum of priority into N intervals. Then, a number is randomly selected in each interval. Because nodes with higher priority will also occupy a longer interval, the probability of being drawn will also be higher, thus achieving the purpose of priority sampling. Each time a leaf node is drawn, its priority and corresponding sample pool data are returned.
are collected from SumTree, and the sampling probability and weight of each sample are shown in the following Equations (28) and (29), respectively.
By improving DDPG experience replay and combining with algorithm 1, the algorithm for AUV path planning and obstacle avoidance is obtained, which we call SumTree-DDPG (Algorithm 2). When sampling, this study first divides the priority of the root node (the sum of the priority of all leaf nodes) by the number of samples N and divides the priority from 0 to the sum of priority into N intervals. Then, a number is randomly selected in each interval. Because nodes with higher priority will also occupy a longer interval, the probability of being drawn will also be higher, thus achieving the purpose of priority sampling. Each time a leaf node is drawn, its priority and corresponding sample pool data are returned.
N samples s k i , a k i , r k i , s k i+1 , k = 1, 2, . . . N are collected from SumTree, and the sampling probability and weight of each sample are shown in the following Equations (28) and (29), respectively.
By improving DDPG experience replay and combining with algorithm 1, the algorithm for AUV path planning and obstacle avoidance is obtained, which we call SumTree-DDPG (Algorithm 2). Algorithm 2. SumTree-DDPG Algorithm 1: Randomly initialize critic network Q s, a θ Q and actor µ(s|θ u ) with weights θ Q and θ µ 2: Initialize target network Q and µ with weights θ Q ← θ Q , θ µ ← θ µ 3: Initialize the SumTree and define the capacity size H = φ 4: for episode = 1, M do 5: Initialize a random process N for action exploration 6: Receive initial observation state s 1 7: for step = 1, T do 8: Select action a t = µ(s t |θ µ ) + N t according to the current policy and exploration noise 9: Take action a t and observe s t+1 10: Decide reward r t (s t , a t , s t+1 ) using Algorithm 1 11: Store transition (s t , a t , r t , s t+1 ) in SumTree H = φ 12: do Sample a minibatch of N transitions s k t , a k t , r k t , s k t+1 , k = 1, 2, . . . N from SumTree H = φ with probability-sampling: P(k) = p k /∑ m P(m) with importance-sampling weight: ω k = P(k)/min j P(j) 13: Set y i = r i + γQ s i+1 , µ s i+1 θ µ θ Q 14: Update critic by minimizing the loss: L = 1 N ∑ i y i − Q s i , a i θ Q 2 15: Update the actor policy using the sampled policy gradient: 16: Update the target networks: In addition, the structure diagram of the SumTree-DDPG algorithm applied to AUV online path planning is shown in Figure 6.

Simulations
To verify the feasibility of the method proposed in this paper, first, this study used the python programming language to build two underwater canyon simulation test environments based on the pyglet module. Then, the DDPG algorithm and SumTree-DDPG algorithm are applied to the path planning and obstacle avoidance of AUV for comparative analysis, respectively. According to the principle of control variable method, the same simulation environment is used in both cases.

Simulation of AUV Crossing Unkonwn Underwater Canyon
First of all, the simulation training process of the AUV in this research is based on four hypotheses: (1) Assumption 1: When the AUV is trained in the underwater canyon simulation environment, the details of the AUV model have negligible influence on the generation of obstacle avoidance paths; (2) Assumption 2: The effects of environmental disturbances, such as deep ocean currents, are ignored;

Simulations
To verify the feasibility of the method proposed in this paper, first, this study used the python programming language to build two underwater canyon simulation test environments based on the pyglet module. Then, the DDPG algorithm and SumTree-DDPG algorithm are applied to the path planning and obstacle avoidance of AUV for comparative analysis, respectively. According to the principle of control variable method, the same simulation environment is used in both cases.

Simulation of AUV Crossing Unkonwn Underwater Canyon
First of all, the simulation training process of the AUV in this research is based on four hypotheses: (1) Assumption 1: When the AUV is trained in the underwater canyon simulation environment, the details of the AUV model have negligible influence on the generation of obstacle avoidance paths; (2) Assumption 2: The effects of environmental disturbances, such as deep ocean currents, are ignored; (3) Assumption 3: AUV avoid obstacles on the horizontal obstacle avoidance; (4) Assumption 4: The next state of AUV is only related to the current state, and the condition distribution of the next state does not change with time based on the current state. The obstacle avoidance process of AUV is established as the MDP model in Section 2.3.
This study constructed two unknown 2D underwater simulation environments to simulate AUV traveling through unknown underwater canyon at a fixed depth by using the Python language compiler in a high-performance computer, as shown in Figure 7, namely, Environment 1: an unknown underwater environment with irregular narrow terrain, to simulate AUV driving in an unknown underwater canyon; Environment 2, as shown in Figure 8, is the addition of some small-scale dynamic obstacles represented by blue squares in the Figure 8 on the basis of Environment 1 to simulate other vehicles or Marine life traveling in the underwater canyon. The simulation environment size is 1000 × 300 m. In Figures 7 and 8, the black irregular blocks represent the walls of unknown underwater canyon, the green square represents the target, and the red rectangle represents the AUV. The solid black line around the AUV simulates the sound detection beam of obstacle avoidance. The initial position of AUV is at the map coordinate point (980 m, 125 m), and the target center position is (30 m, 100 m). Environment 1 is mainly used to verify the planning ability of AUV obstacle avoidance algorithm to avoid unknown large-scale continuous static obstacles (e.g., the walls of underwater canyon) and reach distant target points. Dynamic obstacles added in Environment 2 are uniform linear motion and uniform acceleration linear motion. Moreover, the red dotted line in Figure 8 represents the motion trajectories of dynamic obstacles. This environment is further used to verify the planning ability of AUV obstacle avoidance algorithm in the case of abrupt dynamic obstacles.   After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.

Large-Scale Continuous Obstacle Avoidance Simulations Results
First of all, in Environment 1: Single-target, the walls of unknown underwater canyon   After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.   After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.   After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.   After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.        After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation. of some small-scale dynamic obstacles represented by e basis of Environment 1 to simulate other vehicles or erwater canyon. The simulation environment size is e black irregular blocks represent the walls of unknown uare represents the target, and the red rectangle repre-e around the AUV simulates the sound detection beam al position of AUV is at the map coordinate point ter position is   30 m,100 m . Environment 1 is mainly ty of AUV obstacle avoidance algorithm to avoid un-tic obstacles (e.g., the walls of underwater canyon) and ic obstacles added in Environment 2 are uniform linear linear motion. Moreover, the red dotted line in Figure 8 of dynamic obstacles. This environment is further used UV obstacle avoidance algorithm in the case of abrupt s of unknown underwater canyon; : Target area; : AUV.
After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.

Large-Scale Continuous Obstacle Avoidance Simulations Results
First of all, in Environment 1: Single-target, the walls of unknown underwater canyon are regarded as large-scale continuous static obstacles to AUV, two planning methods were tested. During the test, the principle of controlling variables is always maintained, that is, the basic parameters of the two planning methods are consistent (Table 3).  Figure 9 shows the historical trajectory of the AUV online pathfinding process of the two planners (The green line is the AUV motion trajectory). Both methods can plan AUV to avoid large-scale continuous static obstacles and reach the target area. Qualitatively, SumTree-DDPG has a better planning effect than DDPG algorithm for the movement path of the target.   Figure 9 shows the historical trajectory of the AUV online pathfinding process of the two planners (The green line is the AUV motion trajectory). Both methods can plan AUV to avoid large-scale continuous static obstacles and reach the target area. Qualitatively, SumTree-DDPG has a better planning effect than DDPG algorithm for the movement path of the target.  Table 4 shows that for 1000 rounds ( 2000 steps per round), the number of times the AUV successfully reached the target area during the DDPG planner training process is 1, and the DDPG planner does not converge to the optimal path in 1000 episodes. As shown in Figure 10, the optimal path planned by the DDPG planner in environment 1 is 1263.50 m . The SumTree-DDPG planner also runs 1000 rounds in the same environment ( 2000 steps per round). During online path planning process, the number of times the AUV successfully reaches the target area is 218 , and the program runs to 796 episodes of convergence and this method converges to the optimal path. The optimal path planned is 1128.50 m . Compared with the DDPG planner, when the SumTree-DDPG planner AUV is trained in environment 1, the success rate is increased from 0.10% to 21.80% , the number of collisions with obstacles is reduced, the training efficiency is improved, and the planned collision-free optimal path is shorter.  shown in Figure 8, is the addition of some small-scale dynamic obstacles represented by blue squares in the Figure 8 on the basis of Environment 1 to simulate other vehicles or Marine life traveling in the underwater canyon. The simulation environment size is 1000 300 m  . In Figures 7 and 8, the black irregular blocks represent the walls of unknown underwater canyon, the green square represents the target, and the red rectangle represents the AUV. The solid black line around the AUV simulates the sound detection beam of obstacle avoidance. The initial position of AUV is at the map coordinate point   980 m,125 m , and the target center position is   30 m,100 m . Environment 1 is mainly used to verify the planning ability of AUV obstacle avoidance algorithm to avoid unknown large-scale continuous static obstacles (e.g., the walls of underwater canyon) and reach distant target points. Dynamic obstacles added in Environment 2 are uniform linear motion and uniform acceleration linear motion. Moreover, the red dotted line in Figure 8 represents the motion trajectories of dynamic obstacles. This environment is further used to verify the planning ability of AUV obstacle avoidance algorithm in the case of abrupt dynamic obstacles.  After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.

Large-Scale Continuous Obstacle Avoidance Simulations Results
First of all, in Environment 1: Single-target, the walls of unknown underwater canyon are regarded as large-scale continuous static obstacles to AUV, two planning methods were tested. During the test, the principle of controlling variables is always maintained, that is, the basic parameters of the two planning methods are consistent (Table 3).
: The walls of unknown underwater canyon; . Sci. Eng. 2021, 9, x FOR PEER REVIEW 19 of 27 shown in Figure 8, is the addition of some small-scale dynamic obstacles represented by blue squares in the Figure 8 on the basis of Environment 1 to simulate other vehicles or Marine life traveling in the underwater canyon. The simulation environment size is 1000 300 m  . In Figures 7 and 8, the black irregular blocks represent the walls of unknown underwater canyon, the green square represents the target, and the red rectangle represents the AUV. The solid black line around the AUV simulates the sound detection beam of obstacle avoidance. The initial position of AUV is at the map coordinate point   980 m,125 m , and the target center position is   30 m,100 m . Environment 1 is mainly used to verify the planning ability of AUV obstacle avoidance algorithm to avoid unknown large-scale continuous static obstacles (e.g., the walls of underwater canyon) and reach distant target points. Dynamic obstacles added in Environment 2 are uniform linear motion and uniform acceleration linear motion. Moreover, the red dotted line in Figure 8 represents the motion trajectories of dynamic obstacles. This environment is further used to verify the planning ability of AUV obstacle avoidance algorithm in the case of abrupt dynamic obstacles.  After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.

Large-Scale Continuous Obstacle Avoidance Simulations Results
First of all, in Environment 1: Single-target, the walls of unknown underwater canyon are regarded as large-scale continuous static obstacles to AUV, two planning methods were tested. During the test, the principle of controlling variables is always maintained, that is, the basic parameters of the two planning methods are consistent (Table 3).   After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.

Large-Scale Continuous Obstacle Avoidance Simulations Results
First of all, in Environment 1: Single-target, the walls of unknown underwater canyon are regarded as large-scale continuous static obstacles to AUV, two planning methods were tested. During the test, the principle of controlling variables is always maintained, that is, the basic parameters of the two planning methods are consistent (Table 3).
: AUV. Table 4 shows that for 1000 rounds (2000 steps per round), the number of times the AUV successfully reached the target area during the DDPG planner training process is 1, and the DDPG planner does not converge to the optimal path in 1000 episodes. As shown in Figure 10, the optimal path planned by the DDPG planner in environment 1 is 1263.50 m. The SumTree-DDPG planner also runs 1000 rounds in the same environment (2000 steps per round). During online path planning process, the number of times the AUV successfully reaches the target area is 218, and the program runs to 796 episodes of convergence and this method converges to the optimal path. The optimal path planned is 1128.50 m. Compared with the DDPG planner, when the SumTree-DDPG planner AUV is trained in environment 1, the success rate is increased from 0.10% to 21.80%, the number of collisions with obstacles is reduced, the training efficiency is improved, and the planned collision-free optimal path is shorter.  Figure 10. The optimized path of AUV in Environment 1. Figure 11 shows the average cumulative reward curves obtained in 1000 rounds of the two algorithms. Figure 11 depicts that the stability, training's total reward, and robustness of the SumTree-DDPG planner are better than those of the DDPG planner.

Dynamic Obstacle Avoidance Control Simulations Results
To consider the influence of dynamic obstacles, such as other ships or underwater floats, on AUV's travel in underwater canyons, we designed Environment 2 based on Environment 1. Environment 2 refers to the addition of some small-scale dynamic obstacles on the basis of Environment 1, whose shape and size can be detected by sonars to simulate other underwater vehicles or marine life that travels in the underwater canyon. In Environment 2, the single-target underwater unknown environment with large-scale continuous static obstacle, some small-scale dynamic obstacles, and two planning methods were tested. In addition, during each episodes of online training, dynamic obstacles will move along the four red dotted lines in Figure 10 and specific parameter information of dynamic obstacles in the geodetic-fixed frame is shown in Table 5.  Figure 11 shows the average cumulative reward curves obtained in 1000 rounds of the two algorithms. Figure 11 depicts that the stability, training's total reward, and robustness of the SumTree-DDPG planner are better than those of the DDPG planner.  Figure 11 shows the average cumulative reward curves obtained in 1000 rounds of the two algorithms. Figure 11 depicts that the stability, training's total reward, and robustness of the SumTree-DDPG planner are better than those of the DDPG planner.

Dynamic Obstacle Avoidance Control Simulations Results
To consider the influence of dynamic obstacles, such as other ships or underwater floats, on AUV's travel in underwater canyons, we designed Environment 2 based on Environment 1. Environment 2 refers to the addition of some small-scale dynamic obstacles on the basis of Environment 1, whose shape and size can be detected by sonars to simulate other underwater vehicles or marine life that travels in the underwater canyon. In Environment 2, the single-target underwater unknown environment with large-scale continuous static obstacle, some small-scale dynamic obstacles, and two planning methods were tested. In addition, during each episodes of online training, dynamic obstacles will move along the four red dotted lines in Figure 10 and specific parameter information of dynamic obstacles in the geodetic-fixed frame is shown in Table 5.

Dynamic Obstacle Avoidance Control Simulations Results
To consider the influence of dynamic obstacles, such as other ships or underwater floats, on AUV's travel in underwater canyons, we designed Environment 2 based on Environment 1. Environment 2 refers to the addition of some small-scale dynamic obstacles on the basis of Environment 1, whose shape and size can be detected by sonars to simulate other underwater vehicles or marine life that travels in the underwater canyon. In Environment 2, the single-target underwater unknown environment with large-scale continuous static obstacle, some small-scale dynamic obstacles, and two planning methods were tested. In addition, during each episodes of online training, dynamic obstacles will move along the four red dotted lines in Figure 10 and specific parameter information of dynamic obstacles in the geodetic-fixed frame is shown in Table 5. During the test, the principle of controlling variables is always maintained, that is, the basic parameters of the two planning methods are consistent (Table 6).  Figure 12 shows the historical trajectory of the AUV online pathfinding process of the two planners (the green line is the AUV motion trajectory). It can be seen from the figure that the DDPG planner method failed to find a collision-free path to the target area. Only the SumTree-DDPG planner method successfully finds multiple safe and collision-free paths in an unknown underwater canyon with multiple dynamic obstacles, and the AUV can safely reach the target area by driving along these paths. Qualitatively, SumTree-DDPG planner has a better planning effect than DDPG algorithm for the movement path of the target in environment 2. During the test, the principle of controlling variables is always maintained, that is, the basic parameters of the two planning methods are consistent (Table 6).  Figure 12 shows the historical trajectory of the AUV online pathfinding process of the two planners (the green line is the AUV motion trajectory). It can be seen from the figure that the DDPG planner method failed to find a collision-free path to the target area. Only the SumTree-DDPG planner method successfully finds multiple safe and collisionfree paths in an unknown underwater canyon with multiple dynamic obstacles, and the AUV can safely reach the target area by driving along these paths. Qualitatively, SumTree-DDPG planner has a better planning effect than DDPG algorithm for the movement path of the target in environment 2.  Table 7 shows that for 1000 rounds ( 2000 steps per round), the number of times the AUV successfully reached the target area during the DDPG planner training process is 0, and the program does not converge in 1000 episodes. As shown in Figure 13, the DDPG planner did not find a safe and collision-free path to the target area. The SumTree-DDPG planner also runs 1000 rounds in the same environment ( 2000 steps per round). During the online path planning process of SumTree-DDPG planner, the number of times used to verify the planning ability of AUV obstacle avoidance algorithm to avoid unknown large-scale continuous static obstacles (e.g., the walls of underwater canyon) and reach distant target points. Dynamic obstacles added in Environment 2 are uniform linear motion and uniform acceleration linear motion. Moreover, the red dotted line in Figure 8 represents the motion trajectories of dynamic obstacles. This environment is further used to verify the planning ability of AUV obstacle avoidance algorithm in the case of abrupt dynamic obstacles.  After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation. used to verify the planning ability of AUV obstacle avoidance algorithm to avoid unknown large-scale continuous static obstacles (e.g., the walls of underwater canyon) and reach distant target points. Dynamic obstacles added in Environment 2 are uniform linear motion and uniform acceleration linear motion. Moreover, the red dotted line in Figure 8 represents the motion trajectories of dynamic obstacles. This environment is further used to verify the planning ability of AUV obstacle avoidance algorithm in the case of abrupt dynamic obstacles.  After setting up AUV simulation environments 1 and 2, this study combines the DDPG algorithm and the SumTree-DDOG algorithm with the AUV model in Chapter 2 to establish two AUV path planning algorithm: DDPG algorithm AUV path planning method and SumTree-DDPG algorithm AUV path planning method. Finally, the pros and cons of the two planning methods are analyzed by simulation.  : AUV. Table 7 shows that for 1000 rounds (2000 steps per round), the number of times the AUV successfully reached the target area during the DDPG planner training process is 0, and the program does not converge in 1000 episodes. As shown in Figure 13, the DDPG planner did not find a safe and collision-free path to the target area. The SumTree-DDPG planner also runs 1000 rounds in the same environment (2000 steps per round). During the online path planning process of SumTree-DDPG planner, the number of times the AUV successfully reaches the target area is 39, and the optimal path planned is 1253 m among these safe driving paths. In addition, both algorithms converged within 1000 rounds, but neither planner converged to the optimal path, and the DDPG algorithm fell into the local optimal value earlier than the SumTree-DDPG algorithm. When testing in the complex environment 2, the SumTree planner has a higher success rate than the DDPG planner, the success rate is increased from 0 to 3.90%. the AUV successfully reaches the target area is 39 , and the optimal path planned is 1253 m among these safe driving paths. In addition, both algorithms converged within 1000 rounds, but neither planner converged to the optimal path, and the DDPG algorithm fell into the local optimal value earlier than the SumTree-DDPG algorithm. When testing in the complex environment 2, the SumTree planner has a higher success rate than the DDPG planner, the success rate is increased from 0 to 3.90% .   Figure 14 shows the average cumulative reward curves obtained in 1000 rounds of DDPG and SumTree-DDPG. Figure 14 shows that when facing the complex environment 2 with dynamic obstacles, the reward value obtained by the DDPG algorithm in each round is mostly the negative reward value generated by the unsuccessful completion of the task. Less AUVs are guided to avoid obstacles to reach the target position during 1000 episodes of online path planning, which shows that the learning effect of the path planner based on the DDPG algorithm is poor. However, because of the SumTree structure selected by the memory pool, the SumTree-DDPG algorithm continuously accumulates good learning samples to eliminate bad memories, and the learning effect improves until it converges to an optimal path that successfully avoids all obstacles to reach the goal. Figure 14 depicts that stability, training effect, and robustness of the SumTree-DDPG planner are better than the DDPG planner.  Figure 14 shows the average cumulative reward curves obtained in 1000 rounds of DDPG and SumTree-DDPG. Figure 14 shows that when facing the complex environment 2 with dynamic obstacles, the reward value obtained by the DDPG algorithm in each round is mostly the negative reward value generated by the unsuccessful completion of the task. Less AUVs are guided to avoid obstacles to reach the target position during 1000 episodes of online path planning, which shows that the learning effect of the path planner based on the DDPG algorithm is poor. However, because of the SumTree structure selected by the memory pool, the SumTree-DDPG algorithm continuously accumulates good learning samples to eliminate bad memories, and the learning effect improves until it converges to an optimal path that successfully avoids all obstacles to reach the goal. Figure 14 depicts that stability, training effect, and robustness of the SumTree-DDPG planner are better than the DDPG planner.

Analysis of Simulations
The simulation results show that the AUV path planning obstacle avoidance method based on DDPG algorithm and the SumTree-DDPG algorithm are effective and can solve the problem on the safe driving of AUV in underwater canyons. Moreover, the proposed

Analysis of Simulations
The simulation results show that the AUV path planning obstacle avoidance method based on DDPG algorithm and the SumTree-DDPG algorithm are effective and can solve the problem on the safe driving of AUV in underwater canyons. Moreover, the proposed SumTree-DDPG algorithm in this work, regardless whether it is the unknown underwater canyon environment 1 or the simulated underwater canyon environment 2 established in this paper, the learning effect is better than the DDPG algorithm, and the stability of the algorithm is better. In Section 4.2, the AUV path planning and obstacle avoidance method based on SumTree-DDPG algorithm is proven effective for path planning and obstacle avoidance in unknown underwater canyon environment. This method can face the AUV autonomous obstacle avoidance in an uncertain environment. However, the AUV motion planning obstacle avoidance method in this paper is a 2D plane space, whereas the actual environment is a 3D space, the energy consumption optimization and the influence of ocean waves in underwater canyons are not considered. We set this revision as a future work.

Conclusions
To solve the problem on the safe driving of AUV in underwater canyons and tap the potential of AUV autonomous obstacle avoidance in uncertain environments, this paper proposes an improved AUV based on DDPG path planning method. The method is an end-to-end path planning optimization strategy. Sensor information are considered input, and driving speed and yaw angle are outputs. The path planning method can reach the predetermined target point while avoiding large-scale static obstacles that AUV can only detect a very small part of this obstacle by sonars (less than 20 percent of its overall size), such as valley walls in the simulated underwater canyon environment, as well as sudden small-scale dynamic obstacles whose shape and size can be completely detected by the sonars of AUV, such as marine life and other underwater vehicles. In addition, this research aims at the multi-objective structure of the obstacle avoidance process of path planning, modularized reward function design, and combined artificial potential field method to set continuous rewards. This method solves the sparse reward problem in complex environments. This research also proposes the SumTree-DDPG algorithm, which improves the random storage and extraction strategy of the experience samples of the DDPG algorithm. Aiming at the model convergence rate, this algorithm is combined with the SumTree structure to classify and store the samples and extract high-quality samples continuously according to the different importance of the experience samples. Finally, the effectiveness of the method is verified by simulation.
The main contributions of this paper can be summarized as follows: (1) To solve the problem on the safe driving of under-driven AUVs in underwater canyons, this research proposes a large-scale continuous obstacle avoidance model, a uniform straight line, and a uniform acceleration straight line state obstacle avoidance model to simulate large-scale static obstacles, such as valley walls in the underwater canyon environment and sudden small-scale dynamic obstacles, such as marine life and other vehicles. (2) On the basis of the AUV dynamic model, this paper transforms the traditional AUV path planning process into a Markov decision process (MDP) model, which can be used for AUV DRL. (3) According to the multi-objective structure of the obstacle avoidance process of motion planning, this research carried out a modular design of the reward function and combined the artificial potential field method to set continuous reward. (4) This research also proposes the SumTree-DDPG algorithm, which improves the random storage and extraction strategy of the experience samples of the DDPG algorithm. According to the importance of the experience samples, the samples are classified and stored in combination with the SumTree structure, and high-quality samples are continuously extracted, thereby ultimately improving the convergence speed of the model.
Funding: This research was funded by the Natural Science Foundation of Heilongjiang Province, grant number ZD2020E005, Financial support for Shaanxi Provincial Water Conservancy Science and technology program, grant number 2020slkj-5, and the China National Natural Science Foundation, grant number 51779057 and 51709061.