AUV 3D Path Planning Based on the Improved Hierarchical Deep Q Network

: This study proposed the 3D path planning of an autonomous underwater vehicle (AUV) by using the hierarchical deep Q network (HDQN) combined with the prioritized experience replay. The path planning task was divided into three layers, which realized the dimensionality reduction of state space and solved the problem of dimension disaster. An artificial potential field was used to design the positive rewards of the algorithm to shorten the training time. According to the different requirements of the task, this study modified the rewards in the training process to obtain different paths. The path planning simulation and field tests were carried out. The results of the tests corroborated that the training time of the proposed method was shorter than that of the traditional method. The path obtained by simulation training was proved to be safe and effective.


Introduction
As a key technology in the marine industry, autonomous underwater vehicles (AUVs) have been given considerable attention and application [1]. They play an important role in the development of marine resources and environmental protection. In civil applications, AUVs are mostly used in underwater topography exploration, hydrological information and water quality detection, underwater wreck detection, and other aspects [2][3][4].
Path planning is one of the keys to the AUV system, which has been extensively studied. Ben Li [5] proposed an improved genetic algorithm for the global three-dimensional path planning of an under-actuated AUV. The shortest path was obtained by hierarchical path planning. However, the genetic algorithm is prone to premature convergence, and its local optimization ability is poor. J. D. Hernández [6] presented a framework for planning collision-free paths online for AUVs in unknown environments. It was composed of three main modules that incrementally explored the environment while solving start-to-goal queries. They planned paths for the SPARUSII AUV performing autonomous missions in a 2-dimensional workspace. However, the obstacles in the simulation process were too simple. Petres [7] designed a continuous state, which used an anisotropic fast matching algorithm to complete the AUV path planning task. However, this method only used linear evaluation function, which had certain limitations. Sun B [8] proposed an optimal fuzzy control algorithm for 3D path planning. Based on the environment information, the virtual acceleration and velocity of AUV could be obtained through the fuzzy system, so that AUV could avoid dynamic obstacles automatically. However, because of the subjectivity of fuzzy boundary selection, the generated path could not be guaranteed to be optimal.
There are some problems with the above path planning methods. Currently, artificial intelligence technology is developing rapidly [9], which can greatly improve the intelligence level and autonomy of AUVs [10]. Reinforcement learning has been studied for AUV path planning. Hiroshi et al. [11] proposed a multi-layer training structure based on Q-Learning. They carried out a planning simulation experiment on the R-ONE vehicle. However, Q-learning is difficult to apply in a continuous environment. Yang and Zhang [12] integrated reinforcement learning with the fuzzy logic method for AUV local planning under the sea flow field. Q-learning was used to adjust the peak point of the fuzzy membership function. The recommendations of behaviors were integrated through adjustable weighting factors to generate the final motion command for AUVs. However, the environment model was simple and unrealistic. Liu [13] used the Q-learning to make local path planning for AUVs. The simulation was carried out in the electronic chart, but he did not set the rewards effectively. Cheng et al. [14] proposed a motion planning method based on DRL. They used CNN to extract the characteristics of sensor information in order to make decisions on the motion. However, this type of training requires a long period.
In view of the existing problems in the present studies, this study proposed an improved hierarchical deep Q network (HDQN) method with the prioritized experience replay to realize the three-dimensional path planning of AUV. The AUV path planning task was divided into three layers, and the planning strategy of AUV was trained layer by layer to reduce the learning time and improve the learning efficiency. Compared with the standard practice in AUVs, HDQN had four benefits: 1. HDQN did not require researchers to pre-program tasks. 2. HDQN method ensured the security of AUV by training in simulation. 3. The method solved the problem of dimension disaster based on the idea of stratification. 4. The method solved the problem of the local optimal solution. This study used a triangular prism mesh to discrete the environmental state model. It could increase the choice of horizontal motions to optimize the path and simplify the vertical motions to reduce the environmental state. Based on the water flow and terrain obstacles, the rewards of the AUV training process were set in detail. Combining the prioritized experience replay [15] with the HDQN (HDP) improved the learning rate of AUVs and shortened the learning time. The idea of the artificial potential field was added to the HDP (HDPA) to improve the problem of the sparse rewards and to shorten the training time. According to the different requirements of tasks, the rewards were modified to obtain the paths of different selection strategies.
The remainder of this paper is organized as follows: In Chapter 2, the path planning algorithm is designed. In Chapter 3, the path planning simulation tests are discussed. In Chapter 4, a field experiment is completed to prove that the path obtained by training is reliable. Chapter 5 concludes the study. As shown in Figure 1, in the path planning and control process of AUVs, the global path planning of AUVs is realized in the upper computer. After launching an AUV, the path nodes obtained by global path planning are transmitted to the lower computer by radio [16]. The AUV navigates to the path nodes according to the path following strategy. Furthermore, the AUV detects the surrounding environment and completes the task by using the detection equipment. The target heading, target velocity, and target depth are calculated in the planning system, which sends them to the control system. The control system controls the AUV in navigating according to the target command [17]. In this study, the improved HDQN algorithm was proposed to realize global path planning.

HDQN and Prioritized Experience Replay
HDQN is the improved reinforcement learning algorithm based on hierarchical thinking. Reinforcement learning refers to the learning process of humans, which sets the reward artificially and enables agents to search for the optimal strategy through constant trials and errors [18]. Figure 2 depicts that reinforcement learning is a process of trial and error [19].  The path planning task is seen as a three-layer model, as shown in Figure 3. The algorithm framework is a bottom-up hierarchical structure consisting of the environment interaction layer, subtask selection layer, and root task collaboration layer. The environment interaction layer acquires the Environment Predictor environment information and interacts with the accumulated experience of the AUV. Experience accumulation of environmental information in the learning experience database is then obtained. The data is compared with the current state of the marine environment, and the result of the comparison is passed to the root task collaboration layer. The root task cooperation layer transmits the action sequence to the sub-task selection layer based on the current state information and generates the subtask decision based on the environment state. The subtask selection layer receives the output from the root task layer and selects subtasks or actions based on the policy. The AUV selects actions to act on the environment and updates the learning experience based on feedback from the environment.
In conclusion, the process of hierarchical reinforcement learning method is the top-down decisionmaking process and the bottom-up learning process.

HDQN
The Q-learning method was proposed by Watkins in 1989 [20]. It is a temporal difference method with model-free based on off-policy. Similar to that of reinforcement learning, the idea of Q-learning is to construct a control strategy to maximize agents' behavioral performance [21]. Agents process the information that is perceived from the complex environment. In this process, four parameters are required, namely, action set, rewards, environment state set, and action-utility function. In the training process, the Bellman equation [22] is used to update the q-table: In equation (1), Q is the action-utility function, which means the immediate reward for the action taken in the current state. r is the reward. s is the environment state. a is the action. γ is the discount parameters.
The purpose of dynamic programming is to optimize the evaluation function, that is, to maximize the expectation of infinite discount rewards. The optimal evaluation function is defined as: represents the reward obtained by the system in selecting an action according to the current environment state. E is the expectation of the cumulative rewards. According to formula (1), the sufficient and necessary condition for the optimal evaluation function is: is a state transition probability function, which indicates the probability of the system choosing an action based on the current state to move to the next state. The optimal strategy is searched by strategy iteration. The idea is to start with a random strategy and then evaluate and improve it until the optimal strategy is found. For any state, the evaluation function of the strategy named  is calculated as follows: In the path planning task, obstacle avoidance and approaching the target are defined as subtasks.
The transition function is defined as   represents the evaluation of the system, performing an action to change the state from j s to 1 j s  according to strategy: Starting at the s, after N time steps, the system terminates at the s'. The state transition probability is   , | , P s N s a   . The evaluation function of sub-task is defined as: where i is the sub-task.   , , Q i s as  is defined as the expectation of accumulated rewards: where as are the actions performed in a sub-task. Formulas (6) and (7) are the evaluation functions of the algorithm structure.
The neural network is used to train the Q function. Take the states as the inputs and the Q value of all actions as the outputs of the neural network. According to the Q-learning idea, the action with the maximum Q value is directly selected as the next action. Figure 4 illustrates that, when training the AUV with the neural network, the Q value of the action, which is calculated by Equation (1), is required first. The Q value is estimated by updating the neural network. Thereafter, the action corresponding to the maximum estimated value of Q is selected to exchange for rewards in the environment. The estimated Q value is subtracted from the maximum Q value to obtain the error loss. The loss function is as follows: where θ is the parameter of the network. The target Q value is expressed by TargetQ: The loss function is determined on the basis of the second item of the Q-learning. The gradient of L (θ) in relation to θ is then obtained, and the network parameter is updated according to the SGD [23] or other methods.

Prioritized Experience Replay
The experience replay is an important link to the HDQN. The HDQN uses a matrix named memory bank to store learning experiences. The structure of each memory is (s, a, r, sʹ). The q-target is determined on the basis of the reward, and the maximum Q value is obtained by the target-net network. The HDQN method stores the transfer samples into the memory bank. Some memories are extracted randomly when training. However, a problem arises with this experience replay method. The learning rate of agents will be slow down due to lacking successful experiences when positive rewards are few in the early stage. A strategy is needed for the learning system to prioritize the good ...

Environment
Copy the parameters at N time intervals experience in the memory bank to study the successful experience, that is, prioritized experience replay. Prioritized experience replay is not random sampling but sampling according to the priority of samples in the memory bank to make learning more efficient. The priority of the sample is determined by td-error, in which q-eval is used to determine the order of learning. The larger the tderror, the lower the prediction accuracy. Therefore, the priority of the sample that must be learned is high. The priority is calculated on the basis of the following equation: where py is the td-error. The method named Sum Tree is used to sample memories from the bank after determining the priority of memories. Sum Tree is a tree structure. Each leaf stores the priority expressed by p of each sample. Each branch node only has two branches. The value of the node is the sum of the two branches' value. Thus, the top of Sum Tree is the sum of all the values of p. Figure 5 affirms that the lowest leaf layer stores the p of each sample. When sampling, the memories must be divided by the batch size. Thereafter, a random number expressed by g is chosen in each interval. g is compared with h1. If g > h1, then g must be compared with t1. If g < h1, then g must be compared with t2. If g > h1 and g > t1, then g must be compared with g1. If g > h1 and g < t1, then g = g − t1, and g must be compared with t2. By analogy, down to the bottom of the tree, if g > g1, then the left sample is selected. Otherwise, the right sample is selected.

Set the Rewards and Actions
The artificial potential field method is introduced to design positive rewards to make the AUV find the target position more quickly. When the distance between the AUVʹs current position and the target position is smaller than that between the AUVʹs previous position and the target position, the AUV is rewarded positively by the following potential function: where lmax is the distance between the AUVʹs start position and the target position, and l is the distance between the AUVʹs current position and the target position. To be affected by the flow of water, for AUV navigation, is inevitable. AUVs are most affected by side flow and least affected by downstream or headstream. Therefore, when an AUV is set to navigate in a current, the reward is calculated as follows: where  represents the navigation direction of the AUV.  represents the water flow direction.  and  are in the same coordinate system. Their value ranges are   The rewards are constantly adjusted during the simulation training, which are set as follows:   In the process of AUV 3D path planning training, the AUV will receive a maximum reward of 10 when it reaches the target position. When the AUV collides with an obstacle, it will receive a minimum reward of −1.5. A negative reward of −0.01 is set for each movement of the AUV to avoid reciprocating movement. The AUV usually navigates at the specified depth to complete tasks. Therefore, the negative reward obtained by sailing the AUV away from the specified depth is set to be larger; that is, the reward of avoiding obstacles in the vertical plane of the AUV is −0.05.
The AUV's actions are divided into 14 parts. The AUV moves vertically first and then starting the horizontal navigation. Therefore, the vertical motion of the AUV can be divided into two movements: up and down. The horizontal motion of the AUV is divided into 12 actions with 30° of separation between the bow angles to ensure that the AUV has more direction choices in optimizing paths ( Figure 6). a. b.

Algorithm Process
After confirming the parameters of the HDQN, the 3D path planning for AUVs based on the improved HDQN is prepared. The process is as follows: Algorithm 1: the path plan training process.

Experimental Definition
The autonomous underwater vehicle named AUV-R, as shown in Figure 7, developed by the Science and Technology on Underwater Vehicle Laboratory of Harbin Engineering University, was used to perform comb scanning in the river of Qinghai Province, China. The altitude of the test water area was measured by the altimeter. The ADCP was used to measure the flow of the experimental water area.   Figure 8b shows the height map containing only black and white. Figure 8b exhibits that the lighter the color, the higher the terrain. The terrain was higher near the river bank and became lower away from the shore. a.
b. The gray value in Figure 8b was identified by the simulation system to determine the height of each coordinate to establish a three-dimensional environment model. By summarizing the existing modeling ideas [24,25], the three-dimensional environment model was discretized, and the triangular prism mesh model was established. Figure 9a shows the horizontal modeling. The environment was modeled in the form of an isosceles triangle. The black points in the figure are grid nodes. Twelve actions in the horizontal plane are shown in Figure 9a by the red points and red lines. Figure 9b shows the three-dimensional state model. Some lines were used to connect nodes to make the model clear.

a.
b.

Path Planning
The AUV was simulated and tested in the simulation system. The task was to allow the AUV to navigate safely to the target position at a depth of 10 m. The 3D path planning simulation training was carried out for the AUV using the improved HDQN. The rewards and actions were set in accordance with the content of Section 2.2. The initial position of the AUV was set as the starting point. The task node of the AUV was set as the target point. The training step was set at 100,000 episodes. During the simulation, the AUV colliding with obstacles or arriving at the target point represented the completion of one episode. Figure 11 and Figure 12 show the simulation results.   Figure 11 shows the path from the starting point to the endpoint of the AUV obtained by the simulation system. The improved HDQN algorithm could be used to obtain a suitable path at a fixed depth with avoiding topographic obstacles and little influence of water flow. Figure 12 shows the accumulated rewards of the AUV in each episode of the simulation system. It took approximately 19,000 episodes before the AUV reached the target position for the first time. After 75,000 times of training, better results could be obtained. The system randomly selected the action with a 10% probability. Therefore, the AUV could not reach the target position every time, even in the later stage of training.
The HDQN algorithm, HDQN algorithm combined with prioritized experience replay (HDP), and HDP combined with the artificial potential field method (HDPA) were used to train the path planning of the AUV. The number of times that the AUV arrived at the target position in the training process was recorded. Figure 13 shows the results. In Figure 13, the x-coordinate represents the number of times that the AUV arrived at the target point, and the y-coordinate represents the episodes of the training process. Results showed that the HDP algorithm had a lower slope than the HDQN algorithm in the early training period, meaning that the time interval of the AUV reaching the target position with the HDP algorithm was shorter. The system made effective use of its successful training experience. In addition, the blue curve in Figure 13 shows that the artificial potential field module enabled the AUV to reach the target position faster and gain a successful experience earlier, causing the system to achieve a suitable path faster. Through experiments, it was found that the HDPA algorithm could make AUV reach the target point earlier and get more useful experience. It could be proved that the learning rate of the HDPA method was higher than that of the other two methods.
To remove the navigation limit of the AUV at a fixed depth of 10 m, the rewards obtained when the AUV navigated at an unspecified depth in Equation (13) must be modified. The negative reward of the AUV leaving the target depth must be reduced, as shown in Equation (14), to realize vertical obstacle avoidance. The parameters in Equation (14) were obtained through multiple adjustments during simulation training. A total of 100,000 steps of training episodes were set, and the simulation results are shown in Figure 14 and Figure 15. a. b.
c. In Figure 14, the red line is the path. In Figure 14c, the red and yellow lines represent paths at different depths. Results showed that the AUV chose the vertical obstacle avoidance strategy when meeting the terrain obstacles. Figure 15 shows that the AUV reached the target position for the first time at nearly the 18,000th episode. After approximately 50,000 times of training, the simulation system of AUV path planning could obtain good results. Figure 11 and Figure 14 illustrate that the path of the AUV with a horizontal obstacle avoidance strategy was longer than that of the AUV with vertical obstacle avoidance. However, given the limitations of some tasks, the AUV must keep in constant depth as much as possible. By setting the different rewards of the improved HDQN, the AUV could be trained to choose the appropriate path. All in all, the improved HDQN omitted the process of establishing the optimal model of the task so that the optimal solution could be found conveniently and quickly.

Field Experiment
The field experiment was carried out to prove that the training path was reliable, which was completed in an inland river area of Qinghai Province, China. As the length of a river was much longer than its width, the AUV was required to set the depth and avoid obstacles to navigate safely along the river bank to collect the information of the test water area efficiently and quickly.
Six collision avoidance sonars were mounted on AUV's head, tail, and sides to ensure the safety of the AUV, as shown in Figure 16. They measured the distance from the AUV to the obstacle in four directions. When the distance was less than 3 m, AUV would avoid obstacles. The obstacle avoidance strategy of this experiment adopted the manual experience method: where β and β' are the target and current heading angles of the AUV. β'' is the planned heading angle of the system. π/18 is the empirical value obtained from multiple field tests. The simulation test results, as shown in Figure 11, were used to complete the field experiment of the AUV (Figure 17). After launching the AUV into the water and sending the nodes of the path to the AUV, the AUV navigated to the path node. The test results are shown in Figure 18.   Figure 18 shows that the AUV tracked the planning path after navigating to the first path node. In this experiment, the AUV resurfaced twice to correct its position. Based on the dead reckoning, DVL, magnetic compass, depth meter, and altimeter were used to determine the position of the AUV. Table 1 shows the specifications of the DVL. When the AUV rose to the surface, GPS was used to correct its position, as shown in the rectangular box in Figure 18. The AUVʹs position was corrected Sonar1 Sonar2 Sonar3 Sonar4 Sonar5 Sonar6 from the orange point to the blue point. The distance of this test voyage was short, resulting in a small error accumulation, so there was little deviation between positions of the dead reckoning and the actual.  Figure 19 records the data collected by sonar 1, 2, and 3. In order to display the data clearly, the data over 20 m was cut out. Since the field experiment was carried out in inland waters, the underwater conditions were relatively stable. As could be seen from Figure 19, the distance measured by sonar was always more than 3 m, and the AUV had been sailing in a safe area without triggering the obstacle avoidance strategy.

Conclusion
In the study, a real underwater environment model was established. The HDQN, combined with the prioritized experience replay, improved the learning rate of the agent and shortened the learning time. The path planning task was divided into three layers with the idea of layering so as to solve the problem of dimension disaster. The idea of an artificial potential field was added to improve the problem of sparse rewards. The different paths could be obtained by modifying and setting the rewards. The field experiment proved that the path obtained by simulation training was safe and effective. However, the framework proposed in the study could not realize real-time obstacle avoidance of AUVs in an unknown environment. This would be the next problem that needs to be researched and solved.

Acknowledgments:
The author would like to thank the reviewers for their comments on improving the quality of the paper. Special thanks also go to the employees of Harbin Engineering University for their assistance during the field experiments.

Conflicts of Interest:
The authors declare no conflict of interest.