Hierarchical Motion Planning of AUVs in Three Typical Marine Environments

Broad waters, harbor waters, and waterway waters make up more than 90% of autonomous underwater vehicles (AUV) navigation area, and each of them has its typical environmental characteristics. In this paper, a three-layer AUV motion planning architecture was designed to improve the planning logic of an AUV when completing complex underwater tasks. The AUV motion planning ability was trained by the improved deep deterministic policy gradient (DDPG) combined with the experience pool of classification. Compared with the traditional DDPG algorithm, the proposed algorithm is more efficient. Using the strategy obtained from the training and the motion planning architecture proposed in the paper, the tasks of AUVs searching in broad waters, crossing in waterway waters and patrolling in harbor waters were realized in the simulation experiment. The reliability of the planning system was verified in field tests.


Introduction
The autonomous underwater vehicle (AUV) is an important tool in the field of ocean development [1]. It can be used for environmental monitoring and topographic detection in broad waters, patrolling in harbor waters, and crossing in waterway waters. Broad waters, harbor waters, and waterway waters make up more than 90% of AUVs navigation area, and each of them has its typical environmental characteristics. Broad waters belong to the ideal area for AUVs navigation, with large space for avoidance and few restrictions on behavior and action. There are irregular dynamic and static obstacles in harbor waters, and AUVs can move within a narrow range. The complex waterway environment and obstacles such as fast-moving ships require real-time high performance AUV motion planning. When an AUV performs underwater tasks, it is faced with complex environments and situations such as sudden obstacles and unknown water environment. Therefore, it is urgent to improve AUV's motion planning capability under complex dynamic environment [2].
Eichhorn, Mike et al. [3] proposed several important requirements for optimal motion planning for the "SLOCUM Glider" AUV, as well as a solution based on fast graph algorithms. According to the marine environment of the Continental shelf of Newfoundland and Labrador, this paper planned the optimal navigation path of the waterway area.
Sun, Zhu et al. [4] proposed an optimized fuzzy control algorithm for AUVs motion planning. The model could be used for motion planning in complex underwater environment. The virtual acceleration and speed of an AUV were obtained by using the environmental information collected by sonar and the fuzzy system with an acceleration and braking module, so that the AUV could avoid dynamic obstacles automatically. However, considering that the choice of a fuzzy boundary is subjective, the generated path could not be guaranteed to be optimal. The following year, they [5] proposed a new discrete centralized planning strategy based on glasius bio-inspired neural network for AUVs full coverage motion planning. The algorithm had low computational cost and high efficiency.
It solved the problem that a single AUV is difficult to carry out full coverage task with long range. Ramos, Garcia-Garrido et al. [6] planned a path to optimize scientific impact and navigation efficiency according to the complex space-time structure of ocean flow field and dynamic system. They pre-planned the Silbo glider's mission to cross the North Atlantic from April 2016 to March 2017. The planning capability of the system in broad waters and waterway waters was demonstrated. MahmoudZadeh, Somaiyeh et al. [7] designed a hierarchical dynamic task planning framework for an AUV to complete task assignment within a limited time in an uncertain underwater environment. An advanced reactive task planner was developed to guide an AUV towards the target and finish the task on time. They designed a low-level motion planner to handle unexpected changes in dynamic terrain by regenerating the optimal trajectory. Based on the biogeography optimization (BBO) algorithm, the task was rearranged by updating the terrain, and the motion planning of broad water area was verified.
The above research content only focuses on a single water area and lacks detailed obstacle avoidance strategies. Different underwater environments have different characteristics, and AUVs motion planning needs a logical, detailed and accurate hierarchical structure.
In order to solve the problem of AUVs planning in three typical underwater environments, a three-layer AUV motion planning architecture is proposed, and the planning logic of AUV in complex tasks is given in the paper. At the same time, in order to reduce the complexity of programming, the improved reinforcement learning method is used to train the planning strategy of AUVs. DDPG algorithm is improved in the aspect of critic network. The improved algorithm combined with the classified experience pool is used to train the AUV motion planning ability. Compared with the traditional DDPG algorithm, the proposed algorithm is more efficient. In addition, other traditional methods, such as A* and ant colony algorithm, can only carry out global optimization and are easy to fall into local minimum. By contrast, the method proposed in this paper can obtain the local obstacle avoidance strategy for AUVs, and the intelligence level of AUVs can be improved.
The structure of the paper is as follows: The Section 2 establishes the three-layer motion planning architecture of the AUV. In Section 3, an improved reinforcement learning algorithm is proposed, and an algorithm model is established by combining with the artificial experience pool. In Section 4, the comparative experiment of the algorithm and the motion planning experiment of an AUV under three typical environments are carried out. The fifth chapter carries on the field experiments of an AUV motion planning. Section 6 concludes the study.

The Hierarchy of Motion Planning
AUV motion planning is divided into three levels of "task-behavior-action". The work which an AUV needs to complete is defined as tasks, such as: motion planning, target following, terrain detection, underwater search, etc. Target commands generated by an AUV underwater navigation are defined as actions, such as: left turn, right turn, forward, etc. The set of a series of actions generated by an AUV in order to complete the task is defined as the behavior, such as obstacle avoidance, target search, path following, etc. The three levels can be understood as: task decomposition layer, behavior planning layer, and action execution layer. The task decomposition layer breaks down the task commands received by the AUV into behaviors. The behavior planning layer plans the behavior based on the environmental information. The action execution layer uses reinforcement learning method to train AUV's movements to complete the motion planning. The AUV actuator is controlled to generate the action to achieve the target commends.
The motion planning task of AUVs is divided into three layers, as shown in Figure 1. The top layer is the task layer, which is the motion planning task. The second layer is the behavior layer, including obstacle avoidance behavior and navigation to the target. The third layer is the action layer, including: change velocities, change directions, up and down, emergency braking and back action. The task layer is the highest level of AUV learning system, the last learning layer and the slowest learning layer. Therefore, in practical design, task design at the root task level is relatively simple, making it easy for the AUV to learn experience and make decisions. Decision design is the most important part of the root task level, so a better decision strategy should be designed for the root task layer.
The behavior layer realizes the analysis of environmental information, the update of knowledge and the choice of behavior. The AUV must analyze the environmental information collected by the sensor and send the analysis result to the decision-making strategy module, and then make the choice of behavior according to the corresponding policy decision. At the same time, the learning system can update the analysis results to its knowledge database, and then send the knowledge database to the decision-making strategy module of this layer.
The action layer selects actions according to its own decision-making strategies. It is the bottom layer of the hierarchy and also the execution layer of the motion planning task. There are options for multiple actions that are executed immediately after selected.
The decision-making process of the system is a top-down process. The task layer does not directly select policies according to the detected environment state, but makes decisions from the top to the bottom layer, until the basic action instructions to be executed are finally decided.
Decision making is a bottom-up learning process. First, learn the action layer. Then, the behavioral layer learns to select strategies for each behavior. Finally, the task layer learns how to implement the AUV's final task decision. This process can be extended based on decision problems.
The planning logic of an AUV navigation is as follows: 1. Load the existing map environment and initialize the parameters. 2. The planning mission is confirmed and the AUV begins to sail. 3. According to the multi-constraint model, the optimal navigation behavior to the target point is planned. 4. Real-time sensor data is received to determine the AUV location and supplement the environmental map. 5. Whether the AUV has reached the target point: Yes, end the task. No, proceed to the next step. 6. Whether unknown obstacle is detected: No, go to 7. Yes, go to 8. 7. Whether an AUV deviates from the path of global planning: No, maintain this behavior. Yes, preplanning to navigate to the target. Go to 9. The task layer is the highest level of AUV learning system, the last learning layer and the slowest learning layer. Therefore, in practical design, task design at the root task level is relatively simple, making it easy for the AUV to learn experience and make decisions. Decision design is the most important part of the root task level, so a better decision strategy should be designed for the root task layer.
The behavior layer realizes the analysis of environmental information, the update of knowledge and the choice of behavior. The AUV must analyze the environmental information collected by the sensor and send the analysis result to the decision-making strategy module, and then make the choice of behavior according to the corresponding policy decision. At the same time, the learning system can update the analysis results to its knowledge database, and then send the knowledge database to the decision-making strategy module of this layer.
The action layer selects actions according to its own decision-making strategies. It is the bottom layer of the hierarchy and also the execution layer of the motion planning task. There are options for multiple actions that are executed immediately after selected.
The decision-making process of the system is a top-down process. The task layer does not directly select policies according to the detected environment state, but makes decisions from the top to the bottom layer, until the basic action instructions to be executed are finally decided.
Decision making is a bottom-up learning process. First, learn the action layer. Then, the behavioral layer learns to select strategies for each behavior. Finally, the task layer learns how to implement the AUV's final task decision. This process can be extended based on decision problems.
The planning logic of an AUV navigation is as follows: 1.
Load the existing map environment and initialize the parameters.

2.
The planning mission is confirmed and the AUV begins to sail.

3.
According to the multi-constraint model, the optimal navigation behavior to the target point is planned.

4.
Real-time sensor data is received to determine the AUV location and supplement the environmental map.

5.
Whether the AUV has reached the target point: Yes, end the task. No, proceed to the next step. 6.
Whether unknown obstacle is detected: No, go to 7. Yes, go to 8.

7.
Whether an AUV deviates from the path of global planning: No, maintain this behavior. Yes, preplanning to navigate to the target. Go to 9. 8.
According to the multi-constraint model, the optimal obstacle avoidance behavior is programmed. Go to 9. 9.
Plan actions; 10. The AUV performs one step of the planned action, go to 4.

Motion Planning Algorithm Modeling
The AUV's ability to reach the target and avoid obstacles is trained by improved deep deterministic policy gradient (DDPG) [8], combining artificial experience [9] with reinforcement learning algorithms. Artificial experience is used to improve the training efficiency of reinforcement learning. Reinforcement learning is used to optimize the obstacle avoidance strategy of artificial experience.

The Improved DDPG
DDPG combines Actor Critic [10] and Deep Q Network (DQN) [11], which is to apply the memory banks in the DQN structure and the idea of two neural networks update on Actor Critic. The Critic network in DDPG only evaluates the overall actions' performance. However, there are 6-dof outputs which are coupled with each other in the motion system of AUVs, including: longitudinal velocity u, lateral velocity v, angular velocity ω, heeling angle ϕ, trim angle θ, and heading angle ψ. The critic network does not evaluate the actions in each dimension, so the difference between the state value of the optimal and non-optimal actions is small, resulting in low learning efficiency of the algorithm. This study proposes to establish the average motion critic network to solve this problem.
DDPG is divided into two parts: Main Net and Target Net, as shown in Figure 2. Each part includes an ActorNet and six CriticNets. The ActorNet outputs actions of 6-dof, and the CriticNet evaluates each action. Considering a standard Reinforcement Learning (RL) problem, a finite Markov Decision Process (MDP) which comprises a current state s t , an action space a t , a reward function r, and the next state s t+1 is established.

Obstacle Avoidance Strategy Based on Artificial Experience
The AUV adopts the method of artificial experience to avoid obstacles during underwater navigation. 15 (1) The calculation formula of the heading angle is as follows: The process of MainNet The process of TargetNet Update The output to loss function Extract data from the experience pool for learning. Input the 6-dof actions at form (s t , a t , r, s t+1 ) into six critic networks of Main Net, and each critic network calculates the value function of each action, respectively, by Bellman equation: where, i is between 1 and 6. If the target policy is deterministic, it can be described as a function µ:s ← a , and the inner expectation is as followings: Get the average Q value Q(s t , a t ): According to Q-learning [12], considering function approximators parameterized by θ Q , the loss function is: Train the neural network to minimize the loss function so that the actual Q value tends to the target Q value.
On basis of policy gradient, a parameterized actor function µ(s|θ µ ) which specifies the current policy by deterministically mapping states to a specific action is maintained. The parameters of the motion estimation network are updated by following function:

Obstacle Avoidance Strategy Based on Artificial Experience
The AUV adopts the method of artificial experience to avoid obstacles during underwater navigation. 15 (1) The calculation formula of the heading angle is as follows: where, β is the target heading. β is the current heading. (x 1 , y 1 ) is the current position of an AUV. (x 2 , y 2 ) is the target position of an AUV.
(2) When an AUV detects a static obstacle within 5 m in front, avoid it. Make an edge in the direction perpendicular to the AUV navigation, while reserving a 2 m safe range to rectangle the obstacle. The rectangle angle nearest to an AUV is taken as the target point and the heading angle is calculated by Formula 6, as shown in Figure 3.
(3) When the AUV detects a dynamic obstacle within 5 m in front, avoid it. The node at the end of the dynamic obstacle is taken as the target point, and the heading angle is calculated based on Formula 6, as shown in Figure 4.
The artificial experience algorithm uses simple logic to avoid obstacles. However, it is not optimal; the logic of the artificial experience algorithm needs to be trained.  (3) When the AUV detects a dynamic obstacle within 5 m in front, at the end of the dynamic obstacle is taken as the target point, and th calculated based on formula 6, as shown in Figure 4. The artificial experience algorithm uses simple logic to avoid obs is not optimal; the logic of the artificial experience algorithm needs to b

Sample Space Classification
The sample space of DDPG algorithm is divided into four categ sample space D0, high error sample space D1, diversity sample spac sample space D3； In DDPG algorithm, the network samples randomly from the buff buffer is finite in size in which   1 , , , t t t s a r s  is stored and sampled ra to the exploration strategy. However, due to the limited of sampling general learning effect of samples in the early stage, the learning rate the later stage, and the behavior cannot be significantly improved, m  (3) When the AUV detects a dynamic obstacle within 5 m in front, at the end of the dynamic obstacle is taken as the target point, and th calculated based on formula 6, as shown in Figure 4. The artificial experience algorithm uses simple logic to avoid obst is not optimal; the logic of the artificial experience algorithm needs to b

Sample Space Classification
The sample space of DDPG algorithm is divided into four categ sample space D0, high error sample space D1, diversity sample space sample space D3； In DDPG algorithm, the network samples randomly from the buffe buffer is finite in size in which   1 , , , t t t s a r s  is stored and sampled ra to the exploration strategy. However, due to the limited of sampling general learning effect of samples in the early stage, the learning rate the later stage, and the behavior cannot be significantly improved, m range of the controller unstable. Therefore, in the second stage of lea space should be increased to get better samples, so as to obtain more c Before that, the bisimulation is showed as following.

Sample Space Classification
The sample space of DDPG algorithm is divided into four categories: the original sample space D 0 , high error sample space D 1 , diversity sample space D 2 , and artificial sample space D 3 ; In DDPG algorithm, the network samples randomly from the buffer for training. The buffer is finite in size in which (s t , a t , r, s t+1 ) is stored and sampled randomly according to the exploration strategy. However, due to the limited of sampling space size and the general learning effect of samples in the early stage, the learning rate will slow down in the later stage, and the behavior cannot be significantly improved, making the control range of the controller unstable. Therefore, in the second stage of learning, the sample space should be increased to get better samples, so as to obtain more control experience. Before that, the bisimulation is showed as following. Definition 1. The bisimulation [13]: If E ⊆ S × S is the relationship of bisimulation, then for s 1 , s 2 ∈ S: (1) ∀a ∈ A, r(s 1 , a) = r(s 2 , a); (2) ∀a ∈ A, ∀C ∈ S E , ∑ t∈C P(t|s 1 , a ) = ∑ t∈C P(t|s 2 , a ); (8) where, the equivalent set of states S with respect to E is denoted by S E . P(t|s 1 , a ) represents the probability that the system takes an action a and moves to the next state at time t in the process of MDP. If the two states satisfy the bisimulation relationship, it can be denoted as s 1 ∼ s 2 . Lemma 1. Bisimulation metric [14]: D is defined as the set of metrics on the state set S, and let d ∈ D. For ∀s 1 , s 2 ∈ S, define: is the Kantorovich metric [15], which is defined by the following linear program: On the basis of the above questions, this study constructed two sample spaces besides the original sample space D 0 , namely, high error sample space D 1 and diversity sample space D 2 . In the space D 1 , the temporal-difference (TD) error is used as the heuristic information to sort samples in the sample space so as to improve the probability of selecting samples with large error. The bisimulation measurement method is used to measure the distance between samples in D 2 . During the algorithm learning process, the training samples will be selected in proportion from the D 1 and the D 2 , respectively, so as to give consideration to the diversity and value of samples.
The AUV's position within 5 m of the obstacle is set as the dangerous state. The system selects an action based on artificial experience. When the AUV encounters an obstacle and correctly outputs the heading angle according to the artificial experience, the reward value will be 1, which will be stored in the artificial experience pool D 3 .
During the process of training, the sample (s t , a t , r, s t+1 ) is put into D 0 . TD error of the action function is calculated, as shown in formula (10). When the TD error is larger than a threshold value, it indicates that the sample has a great influence on the change of the action value function, and it can be considered that the modified sample has high value. Thus, the sample is put in D 1 . When the samples in D 1 are sufficient, the training samples are not only selected from D 0 , but from D 0 and D 1 , respectively, in a certain proportion. Meanwhile, Algorithm 1 is used to calculate the distance between samples in D 0 . Lowsimilarity samples are put into D 2 to ensure the diversity of selected samples. After that, Samples from D 1 and D 2 are selected, respectively, in a certain proportion for learning. The distance between samples is calculated by Algorithm 1. The larger the distance, the lower the similarity of samples. When the state of AUV is in danger, the samples is put in D 3 .

Algorithm Model Establishment
The algorithm mainly includes three models [11]: input model, output model, and reward value model.
The input model of the algorithm is the state of an AUV, including position coordinates, velocity, and heading of an AUV. The output of the algorithm is the action of an AUV. The reward value model is designed based on multiple constraints. The constraints of motion planning are: following target constraint, obstacle constraint, and current constraint.
A multi-constraint model is established, which includes the position coordinates and attitude of the AUV, ocean currents, and collision distance of obstacles. By adding the weight coefficient of economy, safety and concealment, the navigation value is obtained.
C is the constraint. The larger the C, the more costly the AUV navigation. So, C should be as small as possible. α is the angle between the current and the AUV attitude. l ta is the distance between the current position of the AUV and the target position. L max is the distance from the initial position of the AUV to the target position. l ob is the distance between the current position of the AUV and the obstacle. ς is the security coefficient, and the value is (0,1). The larger ς is, the more security is considered in the planning. The AUV tends to yaw at an angle of 90 • with the current. Therefore, the attitude of AUV is planned to avoid a 90 • angle to the current. ζ is the economic coefficient, and the value is (0,1). The larger ζ is, the more economical it is considered in the planning. In short, the AUV should try to navigate downstream and avoid obstacles to reach the target as soon as possible.
The multi-constraint model of AUV is designed as the reward value during the training of DDPG algorithm, R = −C. The higher the C value, the higher the AUV navigation cost, the worse the reward value. The smaller the C value, the smaller the AUV navigation cost and the better the reward value. The heading of the AUV is treated as an action, which is a continuous output value. The planned combination of actions is the behavior and, in this article, the path.
In addition, if the AUV hits the obstacle, the round ends, and the reward R = −1. If the AUV reaches the target, the round ends, and the reward R = 1. During the path following, the system will be rewarded in real time: (12) l d represents the vertical distance between the AUV position and the target path.

Contrast Experiment
AUV motion planning is simulated by using the algorithm designed in this paper. AUV's position coordinates, AUV's current heading angle, speed, distance between the AUV and the target, distance between the AUV and the obstacle are taken as input, and AUV's actions are output to train AUV's ability to follow towards the target and avoid obstacles. The training process is as follows in Algorithm 2:

Algorithm 2 The training process.
Initialize parameters. for episode in MAX_EPISODES do Put low-similarity samples into D 1 ; Targets and obstacles appear randomly within the environment; for step in MAX_EP_STEPS: Learning and output action β= a =µ(s t |θ µ ); According to AUV's heading and velocity, the AUV kinematic model is used to calculate its position at the next moment; Calculate the constraint C, R = −C; if AUV reaches the target: R = 1; done = true; step = 0; break; else if collides with obstacles: R = −1 done = true; step = 0; break; else: on_goal = 0; Store (s, a, r, s ) in D 0 ; if AUV is in critical condition: Select a certain number of samples from D 3 Randomly; else if D 1 = null: Select a certain number of samples from D 1 and D 2 Randomly; else if D 2 = null: Select a certain number of samples from D 0 and D 2 Randomly; else Select a certain number of samples from D 0 Randomly; Update the critic network: if AUV is in critical condition: Put (s, a, r, s ) in D 3 ; Update the actor network: ∇ a Q s, a θ Q s=s t ,a=µ(s t ) ∇ θ µ µ(s|θ µ )| s=s t ; Update the network parameters: After the training, complete obstacle avoidance and following strategies were obtained. The results are shown in Figure 5.
In Figure 5, the abscissa represents the episodes of training, whereas the ordinate represents the total reward of each episode. A total of 10,000 episodes are created during simulation training. Each training episode is updated with 500 steps. Although the learning curves of both artificial experience-DDPG and DDPG algorithm converges to a satisfactory value, the learning process of artificial experience-DDPG is more stable. It can be seen form Figure 5 that artificial experience-DDPG is more effective than conventional DDPG. Figure 5 shows that when the reward value converges to 230-250, the training is successful. In addition, in order for the AUV to learn all the motions, there is a strategy of randomly selecting the actions during the training. So, the curves in Figure 5 do not converge.
end for step; end for episode; After the training, complete obstacle avoidance and following strategies were obtained. The results are shown in Figure 5. In Figure 5, the abscissa represents the episodes of training, whereas the ordinate represents the total reward of each episode. A total of 10,000 episodes are created during simulation training. Each training episode is updated with 500 steps. Although the learning curves of both artificial experience-DDPG and DDPG algorithm converges to a satisfactory value, the learning process of artificial experience-DDPG is more stable. It can be seen form Figure 5 that artificial experience-DDPG is more effective than conventional DDPG. Figure 5 shows that when the reward value converges to 230-250, the training is successful. In addition, in order for the AUV to learn all the motions, there is a strategy of randomly selecting the actions during the training. So, the curves in Figure 5 do not converge.

Simulation Experiment Platform
The architecture of the simulation experiment platform is shown in Figure 6. The system consists of several computers, which are visual simulation computers for the operation environment modeling system, motion planning computers for the operation planning system, and sensor data processor for the operation perception processing system. They integrate into a system via network switches. Motion planning control, environment awareness and information transmission of typical environment model are accomplished through network communication.
A total of three typical environmental models, the AUV dynamics and kinematics models, obstacle models, and target models, were established in the environmental modeling system. The perception processing system detects the target in the environment, filters the perception data, and sends the processed data to the planning system. The planning system plans behaviors and actions according to perceptual data, environmental map and the AUV real-time location. The planning instruction is sent to the environmental system. The AUV is controlled to move in the environment to realize target following and obstacle avoidance. When the sensing system detects a sudden obstacle or a moving

Simulation Experiment Platform
The architecture of the simulation experiment platform is shown in Figure 6. The system consists of several computers, which are visual simulation computers for the operation environment modeling system, motion planning computers for the operation planning system, and sensor data processor for the operation perception processing system. They integrate into a system via network switches. Motion planning control, environment awareness and information transmission of typical environment model are accomplished through network communication. obstacle, the planning system carries out re-planning to realize the real-time obstacle avoidance behavior.

Motion Planning Simulation Experiment
After the simulation platform was built, models of broad waters, waterway waters and harbor waters were established. Training experience was introduced, and simulation experiments were carried out respectively.
In broad waters, the AUV was assigned an area search task. The AUV was greatly influenced by topography. The AUV path following and obstacle avoidance were simulated. The result is shown in Figure 7.  A total of three typical environmental models, the AUV dynamics and kinematics models, obstacle models, and target models, were established in the environmental modeling system. The perception processing system detects the target in the environment, filters the perception data, and sends the processed data to the planning system. The planning system plans behaviors and actions according to perceptual data, environmental map and the AUV real-time location. The planning instruction is sent to the environmental system. The AUV is controlled to move in the environment to realize target following and obstacle avoidance. When the sensing system detects a sudden obstacle or a moving obstacle, the planning system carries out re-planning to realize the real-time obstacle avoidance behavior.

Motion Planning Simulation Experiment
After the simulation platform was built, models of broad waters, waterway waters and harbor waters were established. Training experience was introduced, and simulation experiments were carried out respectively.
In broad waters, the AUV was assigned an area search task. The AUV was greatly influenced by topography. The AUV path following and obstacle avoidance were simulated. The result is shown in Figure 7.  In Figure 7, the yellow polygons represent the known environmental terrain. The green polygons represent sudden obstacles. The blue lines are the path of the global plan. The red lines indicate the AUV's path. In this simulation experiment, the AUV path following and obstacle avoidance are realized. Figure 7 shows that in broad waters, the AUV actions can be correctly planned based on the algorithm designed in this paper, enabling the AUV to follow the path and avoid obstacles.
In the waterway, the task of transversal crossing was assigned to the AUV. The AUV was greatly influenced by passing ships. The AUV behavior of avoiding moving obstacles was realized. The result is shown in Figure 8. In Figure 7, the yellow polygons represent the known environmental terrain. The green polygons represent sudden obstacles. The blue lines are the path of the global plan. The red lines indicate the AUV's path. In this simulation experiment, the AUV path following and obstacle avoidance are realized. Figure 7 shows that in broad waters, the AUV actions can be correctly planned based on the algorithm designed in this paper, enabling the AUV to follow the path and avoid obstacles.
In the waterway, the task of transversal crossing was assigned to the AUV. The AUV was greatly influenced by passing ships. The AUV behavior of avoiding moving obstacles was realized. The result is shown in Figure 8.  In Figure 8, the yellow polygons represent the known environmental terrain. The blue polygons represent moving obstacles. The blue lines are the path of the obstacles. The red lines indicate the AUV's path. The motion planning of AUV waterway crossing and obstacle avoidance is realized in the simulation experiment. Figure 8 shows that in the waterway, the algorithm designed in this paper can correctly plan AUV's actions, enabling the AUV to navigate and avoid moving obstacles.
In the harbor waters, the task of patrolling was assigned to the AUV. The AUV was influenced by both terrain and stationary and moving obstacles. However, different from the obstacles with regular movement in waterway, the obstacles in harbor waters area were complex and irregular. AUV patrol and obstacle avoidance were realized. The result is shown in Figure 9. In Figure 8, the yellow polygons represent the known environmental terrain. The blue polygons represent moving obstacles. The blue lines are the path of the obstacles. The red lines indicate the AUV's path. The motion planning of AUV waterway crossing and obstacle avoidance is realized in the simulation experiment. Figure 8 shows that in the waterway, the algorithm designed in this paper can correctly plan AUV's actions, enabling the AUV to navigate and avoid moving obstacles.
In the harbor waters, the task of patrolling was assigned to the AUV. The AUV was influenced by both terrain and stationary and moving obstacles. However, different from the obstacles with regular movement in waterway, the obstacles in harbor waters area were complex and irregular. AUV patrol and obstacle avoidance were realized. The result is shown in Figure 9.
In Figure 9, the yellow polygons represent the known environmental terrain. The blue polygons represent moving obstacles. The blue lines are the path of the obstacles. The red lines indicate the AUV's path. In this simulation experiment, the AUV patrol and obstacle avoidance are realized. Figure 9 shows that in harbor waters, the algorithm designed in this paper can correctly plan AUV's actions, enabling the AUV to realize patrol and obstacle avoidance.  In Figure 9, the yellow polygons represent the known environmental terrain. The blue polygons represent moving obstacles. The blue lines are the path of the obstacles. The red lines indicate the AUV's path. In this simulation experiment, the AUV patrol and obstacle avoidance are realized. Figure 9 shows that in harbor waters, the algorithm designed in this paper can correctly plan AUV's actions, enabling the AUV to realize patrol and obstacle avoidance.
To sum up, compared with the traditional algorithm, the algorithm designed in this paper has a great improvement in training efficiency. With the experience gained from algorithm training, the AUV motion planning under three typical environments can be well realized, so as to complete the scheduled tasks. In addition, other traditional methods, such as A* and ant colony algorithm, can only carry out global optimization and are easy to fall into a local minimum. By contrast, the method proposed in this paper can obtain the local obstacle avoidance strategy for AUVs, and the intelligence level of AUVs can be improved., To sum up, compared with the traditional algorithm, the algorithm designed in this paper has a great improvement in training efficiency. With the experience gained from algorithm training, the AUV motion planning under three typical environments can be well realized, so as to complete the scheduled tasks. In addition, other traditional methods, such as A* and ant colony algorithm, can only carry out global optimization and are easy to fall into a local minimum. By contrast, the method proposed in this paper can obtain the local obstacle avoidance strategy for AUVs, and the intelligence level of AUVs can be improved.

Field Tests
The applicability of the algorithm was verified by experiments in the field. The AUV system in the experiment was composed of sensing, planning and control system [16]. The AUV carried out motion planning under the guidance and supervision of global knowledge, and completed the assigned task. As the decision maker of the AUV, the planning system received the information of sensor and navigation system in real time, analyzed and gave corresponding planning instructions, and then sent them to the control system to control the movement of AUV. Within each step, the data of the AUV's current state information (position information, attitude information, planning instructions, etc.) was recorded. Parameters of the AUV were shown in Table 1. The system hardware architecture of the AUV is shown in Figure 10. At the end of the test, the data results were derived to verify the technical indicators. ning system received the information of sensor and navigation system in real time, analyzed and gave corresponding planning instructions, and then sent them to the control system to control the movement of AUV. Within each step, the data of the AUV's current state information (position information, attitude information, planning instructions, etc.) was recorded. Parameters of the AUV were shown in Table 1. The system hardware architecture of the AUV is shown in Figure 10. At the end of the test, the data results were derived to verify the technical indicators.  Area search experiment in broad waters was shown in Figure 11 which was completed in an inland river area in Qinghai Province, China. The experimental area is 1000 m × 1000 m. The maximum water depth is 30 m. The AUV navigated along the predetermined route to complete the search task in the test area. The test results are shown in the Area search experiment in broad waters was shown in Figure 11 which was completed in an inland river area in Qinghai Province, China. The experimental area is 1000 m × 1000 m. The maximum water depth is 30 m. The AUV navigated along the predetermined route to complete the search task in the test area. The test results are shown in the Figure 12. The test results show that, based on the AUV motion planning algorithm designed in this paper, the system outputs planning instructions to the control system in real time, and the control system controls the AUV navigation, which can realize the task of the AUV path following in broad waters.
The test of the AUV following along the wall in the waterway was shown in the Figure 13, which was completed in Zhejiang Province, China. The experimental area is 1500 m × 50 m. The maximum water depth is 8m. The ranging sonar mounted by the AUV was used to detect the distance between the AUV and one side of the waterway in real time, and the distance was kept to 3 m. The ranging sonar mounted by the AUV is shown in Figure 14. The test results are shown in the Figure 14. The sonar parameters are shown in Table 2.  The test results show that, based on the AUV motion planning algorithm designed in this paper, the system outputs planning instructions to the control system in real time, and the control system controls the AUV navigation, which can realize the task of the AUV path following in broad waters.
The test of the AUV following along the wall in the waterway was shown in the Figure 13, which was completed in Zhejiang Province, China. The experimental area is 1500 m × 50 m. The maximum water depth is 8m. The ranging sonar mounted by the AUV was used to detect the distance between the AUV and one side of the waterway in real time, and the distance was kept to 3 m. The ranging sonar mounted by the AUV is shown in Figure 14. The test results are shown in the Figure 14. The sonar parameters are shown in Table 2.  The test results show that, based on the AUV motion planning algorithm designed in this paper, the system outputs planning instructions to the control system in real time, and the control system controls the AUV navigation, which can realize the task of the AUV path following in broad waters.
The test of the AUV following along the wall in the waterway was shown in the Figure 13, which was completed in Zhejiang Province, China. The experimental area is 1500 m × 50 m. The maximum water depth is 8m. The ranging sonar mounted by the AUV was used to detect the distance between the AUV and one side of the waterway in real time, and the distance was kept to 3 m. The ranging sonar mounted by the AUV is shown in Figure 14. The test results are shown in the Figure 14. The sonar parameters are shown in Table 2.
As shown in the Figure 15, based on the motion planning system designed in this paper, the AUV can maintain a distance of about 3 m from one side of the waterway. The reliability of the proposed planning system is verified.
The test of the AUV following along the wall in the waterway was shown in the Figure 13, which was completed in Zhejiang Province, China. The experimental area is 1500 m × 50 m. The maximum water depth is 8m. The ranging sonar mounted by the AUV was used to detect the distance between the AUV and one side of the waterway in real time, and the distance was kept to 3 m. The ranging sonar mounted by the AUV is shown in Figure 14. The test results are shown in the Figure 14. The sonar parameters are shown in Table 2. Figure 13. The test of AUV following along the wall. Figure 13. The test of AUV following along the wall.  The sonar parameters Name：200 kHz-50 kHz Underwater ranging transducer.
As shown in the Figure 15, based on the motion planning system designed in this paper, the AUV can maintain a distance of about 3 m from one side of the waterway. The reliability of the proposed planning system is verified.

Conclusions
In order to improve the AUV's motion planning ability in underwater complex tasks, Figure 14. The ranging sonar. Table 2. The sonar parameters.

The Sonar Parameters
Name  The sonar parameters Name：200 kHz-50 kHz Underwater ranging transducer.
As shown in the Figure 15, based on the motion planning system designed in this paper, the AUV can maintain a distance of about 3 m from one side of the waterway. The reliability of the proposed planning system is verified.

Conclusions
In order to improve the AUV's motion planning ability in underwater complex tasks, this paper proposes a three-layer AUV motion planning architecture. The logical order of AUV underwater motion planning is pointed out. In the motion critic network, the DDPG algorithm is improved. The AUV obstacle avoidance and target following motion plan- Figure 15. The AUV test results for wall tracking.

Conclusions
In order to improve the AUV's motion planning ability in underwater complex tasks, this paper proposes a three-layer AUV motion planning architecture. The logical order of AUV underwater motion planning is pointed out. In the motion critic network, the DDPG algorithm is improved. The AUV obstacle avoidance and target following motion planning ability is trained by combining with classified experience pool. Compared with the traditional DDPG algorithm, the training efficiency of this method is higher. The motion planning simulation test platform of the AUV under three typical environments was built. By using the strategy obtained from training and based on the motion planning architecture designed in this paper, the tasks of the AUV scanning in broad waters, navigating in waterway waters and patrolling in harbor waters were realized in the simulation experiment. The area search test and the waterway following test were carried out in the field. Experimental results were obtained to verify the reliability of the planning system.

Conflicts of Interest:
The authors declare no conflict of interest.