Dynamic Target Tracking of Autonomous Underwater Vehicle Based on Deep Reinforcement Learning

: Due to the unknown motion model and the complexity of the environment, the problem of target tracking for autonomous underwater vehicles (AUVs) became one of the major difﬁculties in model-based controllers. Therefore, the target tracking task of AUV is modeled as a Markov decision process (MDP) with unknown state transition probabilities. Based on actor–critic framework and experience replay technique, a model-free reinforcement learning algorithm is proposed to realize the dynamic target tracking of AUVs. In order to improve the performance of the algorithm, an adaptive experience replay scheme is further proposed. Speciﬁcally, the proposed algorithm utilizes the experience replay buffer to store and disrupt the samples, so that the time series samples can be used for training the neural network. Then, the sample priority is arranged according to the temporal difference error, while the adaptive parameters are introduced in the sample priority calculation, thus improving the experience replay rules. The results conﬁrm the quick and stable learning of the proposed algorithm, when tracking the dynamic targets in various motion states. Additionally, the results also demonstrate good control performance regarding both stability and computational complexity, thus indicating the effectiveness of the proposed algorithm in target tracking tasks.


Introduction
An autonomous underwater vehicle (AUV) is an important tool for exploring the ocean. AUVs possess the ability to complete underwater tasks independently based on the configured sensors and pre-implanted control programs [1,2]. Few underwater tasks require a good target tracking ability, including underwater target reconnaissance, port defense, and underwater docking, etc. [3][4][5][6]. However, during the task execution, the motion of AUV may be easily disturbed by the ocean currents, waves, and other factors [7,8]. Additionally, due to its complex structure, the dynamics of AUVs are nonlinear [9,10]. As a result, the AUV control problem becomes harder to solve.
Most of the methods presented in literature are based on accurate dynamic models for addressing the control problem of AUVs. Prestero et al. [11] established a set of sixdegree-of-freedom kinematic and dynamic models based on the REMUS AUV. Assuming that the inclination angle of an AUV is small, the model is linearized, and a proportionalintegral-differential controller is designed to control its depth. Refsnes et al. [12] designed a three-degree-of-freedom flow field dynamics model and a five-degree-of-freedom AUV model, combined with the flow field dynamics model, the influence of the seabed flow field on AUVs was accurately described, and an output feedback controller was proposed to control the direction and inclination of the AUV tracking target [13]. All the aforementioned control methods perform well in specific tasks, however, their control performance declines, and an accurate dynamic model is difficult to obtain. Unlike the traditional controllers, the controller obtained using reinforcement learning (RL) does not depend on the accuracy of the model, i.e., it can learn through its interactions with the environment [14,15].
In recent years, the reinforcement learning was developed and successfully applied to the UAV control, autonomous driving and path planning tasks [16,17]. In AUV research, Fakdi et al. [18] used a direct policy search method to learn the strategies for tracking the electric cable, for the state space and action space are discrete. Sun et al. [19] discretized the action space of AUV for target tracking. However, it is noteworthy that poor precision discretization is unable to meet the requirements of control precision, and high precision discretization leads to curse of dimensionality [20].
With the development of deep learning, batch learning, and experience replay techniques [21][22][23], deep reinforcement learning shows an excellent performance in continuous control tasks [24]. Mnih et al. [25] proposed a deep Q-learning (DQN) algorithm, which achieves human level control accuracy in various challenging tasks. However, this method is unable to address the problem of high dimensional state space and high dimensional action space simultaneously [26,27]. Based on the actor-critic framework [28], Lillicrap et al. [20] proposed deep deterministic policy gradient (DDPG) algorithm. This algorithm uses the action network to approximate the deterministic strategy and solves the problem that the DQN algorithm is unable to solve directly in a continuous action space.
When applying the RL algorithms in various tasks, it is critical to establish the corresponding Markov decision process (MDP), including the design of action, state, and reward functions [29]. In complex tasks, the reward function design may involve multiple conflicting control objectives, thus making the task very difficult [30,31]. Therefore, it is necessary to design a reward function that can be used for multiple control objectives. Chowdhury et al. [32] converted the multi-objective problem into a single-objective MDP by scalarizing the one-step rewards with a weight, and completed three simulated mission scenarios to elucidate the approach.
The experience replay technique allows the samples to be used for neural network training [33]. However, in underwater tasks involving AUVs, the battery and storage capacity are limited. As a result, it is impossible to collect samples indefinitely. Moreover, the number of collected samples is generally not enough for appropriately training the model. The original experience replay technique uses the equal probability extraction method to replay the samples, resulting in invalid or low-quality training samples [34]. The temporal difference error reflects the impact of samples on the training process [35]. In order to improve sample utilization, this work considers the temporal difference error based on samples for improving the existing experience replay technique based on the actor-critic algorithm. We propose the adaptive experience replay deterministic policy gradient (AERDPG) for addressing the aforementioned issues.
The main contributions of this work are summarized below.
(1). We design reward function for multiple control objectives and formulate MDP for the AUV target tracking task. (2). The use of the experience replay technique for training leads to low sample utilization.
In order to improve the training efficiency, combined with the temporal difference error, an adaptive experience replay rule is proposed. (3). Based on the actor-critic algorithm, a reinforcement learning training framework is built, and the target tracking of AUV is realized combined with the experience replay technique.
The rest of this paper is organized as follows: In Section 2, the AUV motion and target tracking are described. In Section 3, the target tracking task is modeled as MDP, and the state, action, and reward functions are designed. In Section 4, an RL algorithm, namely AERDPG, is applied to solve the MDP. In Section 5, the target tracking simulation experiments are performed, and the proposed algorithm is compared with DDPG and PER DDPG algorithms to verify its effectiveness.

Coordinate Systems of AUVs
The AUV motion description involves two sets of coordinate systems [36,37], namely the inertial coordinate system and motion coordinate system, as shown in Figure 1, where (φ, θ, ψ) represents the roll angle, pitch angle, and yaw angle of the AUV, respectively, (u, v, w) denotes the turbulence speed, sway speed, and heave speed, respectively, (p, q, r) denotes the roll angular velocity, pitch angular velocity, and yaw angular velocity, respectively, and (X e , Y e , Z e ) represents the position of the AUV in the inertial coordinate system. tion experiments are performed, and the proposed algorithm is compared with DDPG and PER DDPG algorithms to verify its effectiveness.

Coordinate Systems of AUVs
The AUV motion description involves two sets of coordinate systems [36,37], namely the inertial coordinate system and motion coordinate system, as shown in Figure   1, where ( ) , , φ θ ψ represents the roll angle, pitch angle, and yaw angle of the AUV, respectively, ( ) represents the position of the AUV in the inertial coordinate system.

Dynamic Model of AUVs
The equations of motion of six-DOF AUVs [38] are mathematically expressed as,  x y z represents the center coordinate of AUV, u , v , w , p , q and r represent the velocities and angular velocities of AUV, u  , v  , w  , p  , q  and r  represent the accelerations and angular accelerations, X , Y and Z represent the surge force, sway force, and heave force, respectively, K , M and N represent the torques.

Dynamic Model of AUVs
The equations of motion of six-DOF AUVs [38] are mathematically expressed as, where m represents the mass, I xx , I yy , I zz , I xy , I xz , and I yz represent the moments of inertia of AUVs' center of gravity relative to its origin, x g , y g , z g represents the center coordinate of AUV, u, v, w, p, q and r represent the velocities and angular velocities of AUV, r represent the accelerations and angular accelerations, X, Y and Z represent the surge force, sway force, and heave force, respectively, K, M and N represent the torques.
The AUV motion is regarded as the sum of the motions in the vertical plane and the surge direction. In this work, the tracking problem is only considered in the vertical plane to simplify the AUV motion. Additionally, the following two common assumptions are made, (1). The AUV center of buoyancy coincides with its center of gravity.
Finally, the AUV motion equation in the vertical plane can be expressed as follows, where Y |v|v , Y . v , Y uv , Z |w|w and Z . w represent the hydrodynamic coefficients and ∆d is the time interval. τ = [Y, Z] T is the input vector of the AUV, due to the drive limitation, the values of Y and Z have saturated boundaries [9], and they can be expressed as follows, where sign(·) represents sign function, Y M and Z M represent the saturated boundary of Y and Z respectively. The output vector of AUV is η = [y, z] T , which is obtained by the visual sensors. y and z are the relative coordinates of the AUV and the tracked target along the Y e and Z e axes in the inertial coordinate system, respectively.

Target Tracking Tasks
As presented in Figure 2, at time t, the target being tracked is located at the bottom right of the AUV. At this time, the AUV is moved to the bottom right, so that the relative position point is returned to the target area. Similarly, the AUV is moved above the left at t + 1. This is the target tracking process in a unit time period. The continuous time period constitutes the continuous target tracking process of the AUV.
Finally, the AUV motion equation in the vertical plane can be expressed as follows, where [ ] , T Y Z τ = is the input vector of the AUV, due to the drive limitation, the values of Y and Z have saturated boundaries [9], and they can be expressed as follows, Y and e Z axes in the inertial coordinate system, respectively.

Target Tracking Tasks
As presented in Figure 2, at time t , the target being tracked is located at the bottom right of the AUV. At this time, the AUV is moved to the bottom right, so that the relative position point is returned to the target area. Similarly, the AUV is moved above the left at 1 t + . This is the target tracking process in a unit time period. The continuous time period constitutes the continuous target tracking process of the AUV.  The tracked target motion is time-varying. p is defined as a constant, which represents the probability of the tracked target changing its motion direction and speed. The magnitude of motion change follows normal distribution. Considering the upper and lower limits of motion, the target dynamic equation is mathematically expressed as follows, where denotes the change in the states of tracked target motion, p 0 is a randomly generated number between 0 and 1, n a ∼ N(0, a) and n b ∼ N(0, b) are the random variables subject to normal distributions, 0 ≤ v obj ≤ v max is the speed of the target being tracked, and 0 ≤ d obj ≤ 2π is the motion direction.

MDP of Target Tracking Problem
The goal of MDP is to find a strategy that maximizes the cumulative discount reward, and strategy π that maps the states to actions. Therefore, the following optimization objective can be expressed as follows, where P represents the strategy space, γ represents the discount factor from 0 to 1, and T represents the maximum step size. Based on the input vector, the action a t at t is defined as follows, in addition, the state s t is defined by the output vector, where sin θ t and cos θ t jointly describe the relative motion speed directions, and v t is the relative speed between t − 1 and t, which is expressed as follows, Due to the complexity and uncertainty of the underwater environment, the state transition model is defined as a black box function as follows, the model shows that, given the state variables and action variables at the current moment, the state variables at the next moment are returned under the interference of noise ∆χ t . Target tracking is a multi-objective control problem that includes tracking, stability, and energy saving. First, in order to perform tracking, the AUV motion is controlled to retain the relative motion within the target area. Second, smooth control is achieved by considering the speed of the relative position point entering the target area. Finally, the AUV energy loss during a mission is considered.
During the initial stage of target tracking, there is a distance between the relative position point and the target area. In order to reduce this distance, energy must be consumed to increase the speed of the AUV. In this process, there is a contradiction between the distance from the position point to the target area and the relative speed. In order to mitigate this contradiction between multiple targets during the initial training stage, a variable κ related to time is introduced. It is used to measure the effect of objective functions on the training process during different periods. The design reward function is mathematically expressed as follows, where ρ 1 , ρ 2 and ρ 3 represent the weight coefficients of each item, −ρ 1 η t T η t represents the minimal distance between the target location point and the target area, −ρ 2 v t represents the minimal relative velocity, and −ρ 3 a T a is the minimal consumed energy. The total reward can be expressed as, where T represents the number of time steps.

Algorithm Structure
Based on the actor-critic framework [28], an algorithm denoted as adaptive experience replay deterministic policy gradient (AERDPG) is proposed. The proposed algorithm is developed by using a dual network structure and experience replay technique, as shown in Figure 3. mitigate this contradiction between multiple targets during the initial training stage, a variable κ related to time is introduced. It is used to measure the effect of objective functions on the training process during different periods. The design reward function is mathematically expressed as follows, where 1 ρ , 2 ρ and 3 ρ represent the weight coefficients of each item, sents the minimal distance between the target location point and the target area, 2 t v ρ − represents the minimal relative velocity, and 3 T a a ρ − is the minimal consumed energy. The total reward can be expressed as, where T represents the number of time steps.

Algorithm Structure
Based on the actor-critic framework [28], an algorithm denoted as adaptive experience replay deterministic policy gradient (AERDPG) is proposed. The proposed algorithm is developed by using a dual network structure and experience replay technique, as shown in Figure 3. As presented in Figure 3, the samples generated based on the interaction with the environment are stored in the adaptive experience replay buffer. The data are then selected from the replay buffer as the input of both the main and the target network. In the main network, the evaluation network estimates the function value Q(s t , a t |θ π ) by applying the temporal difference (TD) method for evaluating both the advantages and disadvantages of the policy network selection action. The evaluation network uses the off-policy method for updating the parameters. The update formula is mathematically expressed as follows, where α Q is the learning rate of the evaluation network, δ t is the temporal difference error (TD error), and a(s t+1 |θ π ) is the output of the policy network, which is responsible for guiding the agent to interact with the environment and update the parameters based on the deterministic policy gradient method [39]. The parameter update formula is expressed as follows, where α π denotes the learning rate of the policy network. In the target network, the parameters of each network are transferred by the main network. The output function of the target network is used to construct the loss function.
In order to solve the existing gradient disappearance problem, the neural network uses the ReLU activation function. Then, an adaptive experience replay technique is proposed to improve the sample utilization.

Neural Network Structure
Considering the nonlinear dynamic characteristics of the AUV, two neural network structures are designed, i.e., the evaluation network and the policy network, with parameters θ Q and θ µ , respectively.
In the neural network structure, either sigmoid or tanh functions are generally used as the activation functions. These activation functions are very sensitive to the near-zero inputs and insensitive to larger inputs. This saturation property leads to gradient disappearance. In other words, under large input, the element gradient approaches zero, which is not beneficial for network training. The ReLU activation function avoids this problem since it only suppresses the changes in one direction, as shown in Figure 4. Furthermore, the simple ReLU activation function accelerates the network calculation, which is in accordance with the requirements of online tracking.
where π α denotes the learning rate of the policy network. In the targ parameters of each network are transferred by the main network. The ou the target network is used to construct the loss function.
In order to solve the existing gradient disappearance problem, the uses the ReLU activation function. Then, an adaptive experience repl proposed to improve the sample utilization.

Neural Network Structure
Considering the nonlinear dynamic characteristics of the AUV, two structures are designed, i.e., the evaluation network and the policy ne rameters Q θ and μ θ , respectively. In the neural network structure, either sigmoid or tanh functions are as the activation functions. These activation functions are very sensitive inputs and insensitive to larger inputs. This saturation property leads to pearance. In other words, under large input, the element gradient ap which is not beneficial for network training. The ReLU activation func problem since it only suppresses the changes in one direction, as sho Furthermore, the simple ReLU activation function accelerates the netw which is in accordance with the requirements of online tracking. Therefore, two networks with ReLU activation functions are designed to approximate the strategy function µ(s|θ µ ) and value function Q s, χ θ Q , as shown in Figure 5. Figure 5a is the policy network structure and Figure 5b is the evaluation network structure.
The policy network has three layers for generating the policies. Its input comprises an amount of state s t and its output consists of an amount of action a t . The policy network has two hidden layers and an output layer. In order to control the actions within a certain interval, the tanh function is used as the activation function of the output layer. The evaluation network has three layers for evaluating the policy. Its input comprises action a t and state s t , which pass through two hidden layers, are then combined, and enter in the third implicit layer. Finally, we obtain the value function by using the linear unit of the output layer.  Figure   5. Figure 5a is the policy network structure and Figure 5b is the evaluation network structure.
(a) (b) The policy network has three layers for generating the policies. Its input comprises an amount of state t s and its output consists of an amount of action t a . The policy network has two hidden layers and an output layer. In order to control the actions within a certain interval, the tanh function is used as the activation function of the output layer. The evaluation network has three layers for evaluating the policy. Its input comprises action t a and state t s , which pass through two hidden layers, are then combined, and enter in the third implicit layer. Finally, we obtain the value function by using the linear unit of the output layer.

Adaptive Experience Replay Technique
The experience replay technique eliminates the requirements for independent samples and their identical distribution during the training process of the network. This is done by disrupting the sample order and eliminating the correlation between the samples. Furthermore, the original experience replay technique replays all the samples in the replay buffer with equal probability, and the extracted samples contain a large number of invalid or low-quality individuals. The TD error guides the parameter update of evaluation and strategy networks simultaneously. Additionally, in the experience replay buffer, TD error generated by different samples after entering the evaluation network varies.

Adaptive Experience Replay Technique
The experience replay technique eliminates the requirements for independent samples and their identical distribution during the training process of the network. This is done by disrupting the sample order and eliminating the correlation between the samples. Furthermore, the original experience replay technique replays all the samples in the replay buffer with equal probability, and the extracted samples contain a large number of invalid or lowquality individuals. The TD error guides the parameter update of evaluation and strategy networks simultaneously. Additionally, in the experience replay buffer, TD error generated by different samples after entering the evaluation network varies. Therefore, different samples affect the back propagation differently-the greater the TD error, the greater the effect on the network back propagation. The original experience replay technique ignores the difference between the TD errors of the samples, resulting in low sample utilization.
In this work, the sample priority is determined by the absolute sample TD error. The samples with higher priority are considered in subsequent training. The replay probability is defined as follows, where L denotes the number of samples. However, during the initial training stage, differences between samples cannot be evaluated properly due to the randomly generated initial parameters of the evaluation network. Thus, the TD error is not accurate, thus making it difficult for samples with a low TD error to be replayed during the initial training stage. As the training process progresses, these samples are replaced by new samples, which degrade the diversity of training samples. Therefore, it is necessary to reduce the impact of TD error on replay probability during the initial training stage. Additionally, the probability of replaying low TD error should be increased to ensure that each sample has a similar replay probability. The probability of replaying samples is mathematically expressed as follows, where α is a constantly changing value used to adapt to the training process and control the priority degree of the TD error. During the initial training stage α = 0, the probability of replaying each sample is the same. As the training progresses, α increases until it reaches the maximum value, i.e., α = 1. The µ is the normalization coefficient used to ensure that the total sample replay probability is always 1, whereas b is an offset value required to improve the probability of replaying low TD error samples during the whole training process.

Algorithm Process
Combined with the adaptive experience replay technique and the aforementioned neural network structure, the AERDPG algorithm was proposed. The major steps involved in the proposed algorithm are presented in Algorithm 1.

Algorithm 1. AERDPG for the target tracking of the AUV
Input: number of episodes M, number of steps for each episode T, batch size N, learning rates for evaluation and policy networks α Q and α µ , respectively, discount factor γ, experience replay buffer capacity L max , target motion change probability p, and current number of samples in experience replay buffer L. Output: Target tracking strategy a t = µ(s t |θ µ ). 1 Initialize the evaluation network Q s t , a t θ Q and the policy network µ(s t |θ µ ). Initialize the current number of samples in the experience replay buffer L. 2 for i = 1 : M do: 3 Reset the initial state s 0 ; 4 for j = 1 : T do: 5 if R < R max : 6 Generate a control strategy a t = µ(s t |θ µ ) + ∆σ with additional exploration noise ∆σ obtained from the current policy network; 7 Reward r t+1 and next state s t+1 are obtained from the current state s t and action a t ; 8 Store quadruples [s t , a t , r t+1 , s t+1 ] into experience replay buffer; 9 end for; 10 else: 11 The priority of each sample is calculated by Equation (17); 12 Select group [s t , a t , r t+1 , s t+1 ] from the experience replay buffer based on P(j); 13 Update the current evaluation network by Equation (14); 14 Update the current policy network by Equation (15); 15 Transfer the current network parameters to the target network after a pre-defined time step; 16 end for; 17 end for; 18 end for.

Experiment Settings
Using the RL algorithm to perform target tracking requires an AUV to explore the environment. Please note that obtaining a large amount of data in the real environment using an AUV is expensive. Therefore, the simulation experiments based on REMUS AUV are performed to obtain the training data. The REMUS AUV was developed and designed by a hydroid company in cooperation with the Naval Research Office. It is one of the widely used AUVs. The specific values of hydrodynamic coefficients [11] are shown in Table 1.
For various target motion changing probabilities p, the performances of DDPG (based on the original experience replay mechanism) and PER DDPG (based on the priority experience replay mechanism) are compared with the proposed AERDPG algorithm. The number of epochs is set to 2000 and there are 200 steps in each epoch. The learning rates of both the policy and evaluation networks are 0.0005, the hidden layers of policy network is 2, the hidden layers of evaluation networks is 3, the hidden unit is 256, the cumulative reward discount factor is γ = 0.99, the experience replay buffer capacity is 100,000, the batch size is 64, and the soft update parameter is τ = 0.01.

Training Process Comparison
The training processes for three algorithms are compared and the total reward value curve is obtained, as shown in Figure 6; obviously, no matter what probability, the target changes speed and direction randomly, and the algorithm in this paper can achieve a higher total reward value in 600 episodes, which verifies the feasibility of the algorithm. changes speed and direction randomly, and the algorithm in this paper can achieve a higher total reward value in 600 episodes, which verifies the feasibility of the algorithm. Moreover, in order to further compare the performance of the proposed algorithm during the training process, various evaluation indicators are introduced. However, unlike supervised learning, the RL has no separate dataset for training and testing, thus making it difficult to evaluate the online training algorithms. Thus, the sample utilization rate is evaluated by increasing the reward value curve speed. As the time required by the training curve to reach a higher reward value decreases, the curve rises quickly, resulting in a higher sample utilization rate. In addition, the average reward and reward variance are used to evaluate the training stability. The larger the average reward, the smaller the reward variance, indicating that the training process is more stable. The experiments show that the AUV achieves a better tracking result when the reward value is −40. Therefore, the number of rounds required to reach the reward value −40 is used as the index to evaluate the sample utilization. Moreover, the average reward Moreover, in order to further compare the performance of the proposed algorithm during the training process, various evaluation indicators are introduced. However, unlike supervised learning, the RL has no separate dataset for training and testing, thus making it difficult to evaluate the online training algorithms. Thus, the sample utilization rate is evaluated by increasing the reward value curve speed. As the time required by the training curve to reach a higher reward value decreases, the curve rises quickly, resulting in a higher sample utilization rate. In addition, the average reward and reward variance are used to evaluate the training stability. The larger the average reward, the smaller the reward variance, indicating that the training process is more stable.
The experiments show that the AUV achieves a better tracking result when the reward value is −40. Therefore, the number of rounds required to reach the reward value −40 is used as the index to evaluate the sample utilization. Moreover, the average reward and its reward variance in both the middle and later training stages are used as the indices to evaluate the training stability. The resulting values are presented in Table 2. By analyzing the aforementioned values, it is evident that the DDPG reward value curve slowly rises, whereas the PER DDPG algorithm performs poorly for the average reward and reward variance. The results imply a certain relationship between the sample utilization and the training sample selection rules. The sample selection affects the training stability during the middle and later stages of the training process. Additionally, the results show that the proposed AERDPG algorithm has advantages in terms of sample utilization and training stability.

Controller Performance Comparison
In order to further verify the performance of the control strategy trained by using the proposed AERDPG algorithm, the tracking experiments using the static and dynamic targets are performed. The control performance is evaluated by changing the relative position between the AUV and the target.
First, a static target with the motion changing probability p = 0 is tracked. The same initial position is set in each experiment to perform the comparison between the experimental results. The initial position is located at the upper left AUV segment and the angle between the connecting line (connecting the target and the center) and the horizontal line is 45 • . As shown in Figure 7, during the initial control stage, the AUV quickly adjusts both the sway and heave force to keep the target in the middle of the relative coordinate system. The relative position in the sway and heave directions approaches zero. However, small amplitude oscillation still occurs due to noise interference. the angle between the connecting line (connecting the target and the center) and the horizontal line is 45 °. As shown in Figure 7, during the initial control stage, the AUV quickly adjusts both the sway and heave force to keep the target in the middle of the relative coordinate system. The relative position in the sway and heave directions approaches zero. However, small amplitude oscillation still occurs due to noise interference. In order to verify the computational complexity and accuracy of the proposed AERDPG controller, steady-state error, overshoot, and regulation time are observed. The attained values are compared with those values that are obtained by using the DDPG and In order to verify the computational complexity and accuracy of the proposed AERDPG controller, steady-state error, overshoot, and regulation time are observed. The attained values are compared with those values that are obtained by using the DDPG and PER DDPG controllers. It should be noted that the regulation time refers to the time required for the controlled object to reach a steady state. The overshoot denotes the difference between the maximum output value and the steady-state value. On the contrary, the steady-state error denotes the difference between the expected and actual output. The comparison results are shown in Table 3. The numerical analysis shows that the AERDPG and DDPG controllers perform better as compared to the PER DDPG controller in terms of steady-state error, overshoot, and regulation time.
In the next step, a tracking experiment using the dynamic target is conducted with the motion change probability p = 0.25. The experimental results are shown in Figure 8. During the initial control stage, the controller makes the target reach the central target area quickly. The relative positions in the sway and heave directions approach zero. Due to the randomness of the tracked target motion, the tracking produces oscillations. However, the target is kept in the central area, thus meeting the tracking control requirements. PER DDPG controllers. It should be noted that the regulation time refers to the time required for the controlled object to reach a steady state. The overshoot denotes the difference between the maximum output value and the steady-state value. On the contrary, the steady-state error denotes the difference between the expected and actual output. The comparison results are shown in Table 3. The numerical analysis shows that the AERDPG and DDPG controllers perform better as compared to the PER DDPG controller in terms of steady-state error, overshoot, and regulation time.
In the next step, a tracking experiment using the dynamic target is conducted with the motion change probability 0.25 p = . The experimental results are shown in Figure 8. During the initial control stage, the controller makes the target reach the central target area quickly. The relative positions in the sway and heave directions approach zero. Due to the randomness of the tracked target motion, the tracking produces oscillations. However, the target is kept in the central area, thus meeting the tracking control requirements. Due to significant curve oscillations, the performance cannot be evaluated based on the steady-state error, overshoot, and adjustment time. Therefore, the mean relative position and its square deviation are used as the indicators for evaluating the performance. The comparison results are shown in Table 4. Table 4. The performance indicators for the three controllers (Bold represents the optimal value). Due to significant curve oscillations, the performance cannot be evaluated based on the steady-state error, overshoot, and adjustment time. Therefore, the mean relative position and its square deviation are used as the indicators for evaluating the performance. The comparison results are shown in Table 4. The numerical analysis shows that when the motion changing probability of the tracked target is p = 0.25, the AERDPG and DDPG controllers provide better mean and mean square deviation (of the relative position) results as compared to the PER DDPG controller.
When tracking the static and dynamic targets, the AERDPG controller controls the AUV motion, ensuring that it is always directly behind the target. Its performance is proven to be an improvement with respect to the PER DDPG controller in terms of speed and stability.

Conclusions
In this paper, a control method based on deep reinforcement learning is proposed for the AUV target tracking. Considering the time variability in the motion of the target being tracked, the tracked target motion model with different probabilities of motion change is established. The reward function is used to establish the MDP of the corresponding problem. Based on the reinforcement learning actor-critic framework, combined with the timing differential error and adaptive parameters to improve the original experience replay technique, the AERDPG algorithm is proposed. The proposed algorithm combines the determination of strategy gradient and timing differential update neural network, and trains the network with adaptive prioritization samples to improve the utilization of samples. The tracking experimental results of targets with different motion change probabilities confirm the quick and stable learning of the AERDPG algorithm. Additionally, the results also show that the AERDPG algorithm is better as compared to the previous DDPG algorithm in terms of training speed, and the trained controller can track the probability of different motion changes.