An AUV Target-Tracking Method Combining Imitation Learning and Deep Reinforcement Learning

: This study aims to solve the problem of sparse reward and local convergence when using a reinforcement learning algorithm as the controller of an AUV. Based on the generative adversarial imitation (GAIL) algorithm combined with a multi-agent, a multi-agent GAIL (MAG) algorithm is proposed. The GAIL enables the AUV to directly learn from expert demonstrations, overcoming the difﬁculty of slow initial training of the network. Parallel training of multi-agents reduces the high correlation between samples to avoid local convergence. In addition, a reward function is designed to help training. Finally, the results show that in the unity simulation platform test, the proposed algorithm has a strong optimal decision-making ability in the tracking process.


Introduction
The autonomous underwater vehicle (AUV) is an unmanned underwater device programmed with a controller to perform its tasks automatically [1]. As an important tool for humans to understand the ocean, AUVs have been widely used for a variety of underwater tasks [2], such as subsea mapping [3], image processing [4-6], pipeline maintenance [7], and field source search [8]. AUVs have also been involved in scientific investigations of the ocean and lakes [9].
Control technology is an important part of AUV design. Linear controllers such as the PID and linear quadratic regulator are widely used in the fields of science and engineering [10,11] but have problems such as slow response and oscillation, which lead to being unstable for variations and AUV parameters in the environment [12,13]. Christudas et al. [14] proposed a non-linear long short-term memory recurrent neural network control algorithm. Zamfirache et al. [15] proposed a control method that combines the deep Q-learning (DQL) algorithm and the meta-heuristic gravitational search algorithm (GSA). Precup et al. introduced a series of data-driven, model-free controllers [16,17].
Reinforcement learning (RL) has achieved rapid development, such as aircraft control, factory scheduling tasks, and industrial process control in the community of autonomous control [18,19]. Compared with traditional control strategies, RL does not require prior knowledge of environmental model parameters and has strong adaptability and autonomous control capability [20]. The emergence of neural networks has enabled deep learning to achieve tremendous development in the communities of hyperspectral image classification, agricultural science, and biomedicine [21][22][23]. It also solved the problem that traditional RL algorithms such as Q-learning have of not being able to control agents in a high-dimensional state space and action space by introducing neural networks [24]. In addition, neural networks provide an efficient method for designing non-linear controllers [25]. Schulma et al. [26] proposed the proximal policy optimization (PPO) algorithm using neural networks, which can perform control tasks in a high-dimensional continuous 2 of 19 action space. Haarnoja et al. [27] proposed a soft actor-critic (SAC) algorithm to make the agent explore as much as possible while completing the task by introducing entropy.
A problem in RL is that the rewards are sparse in some complex environments [28], which makes it difficult to obtain the positive reward at the beginning of training. Research shows that imitation learning (IL) learns to control strategies from demonstration trajectory samples and can solve the sparse reward problem [29]. In IL, the agent can obtain the information needed in the motion control process from the demonstration trajectory samples. IL algorithms have been applied in many fields, including human-computer interaction and machine vision [30,31]. IL is mainly divided into two categories: one is called the behavioral cloning (BC) algorithm, which learns expectation strategies directly from expert demonstration [32]. However, the generalization ability is poor and only applicable to limited data samples. The other is the inverse reinforcement learning (IRL) algorithm. It is suitable for areas where the reward function is difficult to quantify accurately [33]. However, it takes a long time to rebuild the reward function. The generative adversarial imitation learning (GAIL) algorithm can learn strategies directly from expert trajectories without restoring reward function by introducing generative adversarial networks (GAN) [34]. It greatly reduces the complexity of calculation and improves the shortcomings of long training times of the IRL algorithm.
Another problem in RL is that the data sample information generated is highly correlated, leading to local convergence. One solution to this problem is to perform random sampling in the experience replay buffer, but this solution is only suitable for off-policy RL [35]. Another solution is multi-agent RL. With the increase in the number of agents, the computational complexity of multi-agent RL also increases. Gupta et al. [36] proposed a parameter sharing trust-region policy optimization algorithm that combines parameter sharing and trust-region policy optimization (TRPO) to solve the problem of computational complexity. Multi-agent RL makes samples come from different AUVs to avoid the correlation of the samples [37].
The main contributions of this research are as follows: we propose a multi-agent GAIL (MAG) algorithm to control AUVs tracking the target. GAIL is used to help training, enabling AUVs to directly learn from expert demonstrations, solving the sparse reward problem. The high correlation of training samples through multi-agent RL will be reduced. Finally, the experiments show that the tracking effect of the proposed algorithm is better than PPO, SAC, and GAIL in three different environment models.

Coordinate Systems of AUVs
The coordinate system is shown in Figure 1. The study of the dynamics of AUV can be divided into two parts, which are kinematics and kinetics. Kinematics is used to describe the movement process of AUV, and dynamics represents the force received during the movement of AUV. This paper assumes that the target is moving in a two-dimensional plane on the seabed, so it is considered that the AUV only tracks the target in the plane. When describing the motion of the AUV, there are usually two sets of coordinate systems: the inertial coordinate system O − XYZ and the motion coordinate system o a − uvw. (ξ, ϑ, ζ) represents the roll angle, pitch angle, and yaw angle of the AUV. (u, v, w) are the turbulence speed, sway speed, and heave speed; (p, q, r) are the roll angular velocity, pitch angular velocity and yaw angular velocity; and the point (x a , y a , z a ) represents the position of the AUV in the inertial coordinate system.
, M is the mass matrix tripetal matrices, respectively, ( ) D φ is the damping matrix, ( ) J φ is the transformation matrix, τ is the joint torques. Th  The kinematic equation of AUV is as follows: where η = [x a , y a , z a ] T and φ = [u, v, w] T , M is the mass matrix, C(φ) is the Corioliscentripetal matrices, respectively, D(φ) is the damping matrix, G(φ) is the input matrix, J(φ) is the transformation matrix, τ is the joint torques. The coefficients in M, C(φ), G(φ), D(φ) are given [38]: where m 1~m5 , c 1 , c 2 , d 1~d5 , g 1 and g 2 are unknown functions of hydrodynamic coefficient.

Problem Description
As shown in Figure 2, the trajectory of the target covers as much of the entire pool as possible, and the random noise is considered. Therefore, in the experiment, the trajectory curve of the target is set to circle the pool for periodic motion. In other words, the target is moving along the seafloor plane {wp 1 → wp 2 → wp 3 → wp 4 → wp 1 . . .}; the point x g , y g represents the position of the target in the inertial coordinate system. The AUV is equipped with a light sensor. The light sensor aims to transmit the position of the target and the position of the AUV to the action-decision system, and then the strategy network outputs the action command according to the state of the AUV, and finally through the propeller power unit and the bias aerodynamic moment controls the AUV. The AUV may collide with the target during the tracking process, thereby changing the motion state of the target. The target is considered to be tracked by the AUV if the distance is less than a certain value, that is {x a , x g , y a , y g ∈ R 2 : (x a − x g ) 2 + (y a − y g ) 2 1 2 ≤ ε}. The AUV tracking control can be described as the design algorithm to make the distance between the AUV and the target less than a certain value. The AUV is equipped with a light sensor. The light sensor aims to transmit the p tion of the target and the position of the AUV to the action-decision system, and then strategy network outputs the action command according to the state of the AUV, and nally through the propeller power unit and the bias aerodynamic moment controls AUV. The AUV may collide with the target during the tracking process, thereby chang the motion state of the target. The target is considered to be tracked by the AUV if distance is less than a certain value, that The AUV tracking control can be scribed as the design algorithm to make the distance between the AUV and the target than a certain value.

State Space and Action Space of AUV
The AUV obtains the observation vector from the state space as the input valu the neural network and provides the appropriate action for the AUV through the acti decision system to accurately track the target. Therefore, the selection principle of the servation vector is the state information needed to complete the task. In order to red the dimensionality of the state space as much as possible, the data information mus independent. As mentioned in the previous section, the state space is the position in mation of the target and the AUV. In the actual process, we normalize the value of observation vector. This normalization is based on the mean and variance of the obser tion vector, thus speeding up the convergence of the neural network; the speed and di tion of the AUV during the movement will determine the success or failure of the track task, so it is set as the action vector and also as the output of the neural network. It can expressed as follows: Figure 2. AUV tracking target diagram.

State Space and Action Space of AUV
The AUV obtains the observation vector from the state space as the input value of the neural network and provides the appropriate action for the AUV through the actiondecision system to accurately track the target. Therefore, the selection principle of the observation vector is the state information needed to complete the task. In order to reduce the dimensionality of the state space as much as possible, the data information must be independent. As mentioned in the previous section, the state space is the position information of the target and the AUV. In the actual process, we normalize the value of the observation vector. This normalization is based on the mean and variance of the observation vector, thus speeding up the convergence of the neural network; the speed and direction of the AUV during the movement will determine the success or failure of the tracking task, so it is set as the action vector and also as the output of the neural network. It can be expressed as follows: where u a represents the velocity of AUV, and → P represents the direction vector of the AUV towards the target.

MAG Path-Planning Algorithm
As shown in Figure 3, MAG consists of three modules. The first is the interaction module. The main function of this module is to interact with the environment to generate samples and output them to the training queue. The second is the sample module, which contains generated samples input from the training queue and expert samples input from the demonstration. The third is the network update module, which updates the discriminant network D ψ and the generative network G θ . After the update, it outputs the action sequence to the AUV waiting to execute the action. After the training is completed, the generator network G θ is the controller of the AUV. samples and output them to the training queue. The second is the sample module, which contains generated samples input from the training queue and expert samples input from the demonstration. The third is the network update module, which updates the discriminant network D ψ and the generative network G θ . After the update, it outputs the action sequence to the AUV waiting to execute the action. After the training is completed, the generator network G θ is the controller of the AUV. Environment

Markov Decision Process
A Markov process consists of quintuples ( , , , , ) contains all the situations of the agent interacting with the environment; A is the action space, which is the action performed by the agent from state s to state s′; T is the state is the probability that the agent performs action a in state s and reaches state s′; R is the reward function, which provides rewards when the agent interacts with the environment; and γ is the discount factor, which determines the current value of the rewards that the agent will obtain in the future.

Markov Decision Process
A Markov process consists of quintuples (S, A, T, R, γ). S is the state space, which contains all the situations of the agent interacting with the environment; A is the action space, which is the action performed by the agent from state s to state s ; T is the state transition model; T(s |s, a) is the probability that the agent performs action a in state s and reaches state s ; R is the reward function, which provides rewards when the agent interacts with the environment; and γ is the discount factor, which determines the current value of the rewards that the agent will obtain in the future.

Discriminant Network
GAIL defines IL as the problem of matching demonstrations of expert policies π E by introducing a GAN framework. Demonstration is a series of state-action pairs {s 1 , a 1 , s 2 , a 2 , . . .} produced by the agent interacting with the environment. According to the implementation process of generative adversarial networks, first of all, it is necessary to build a generative network and a discriminant network. The discriminant network D ψ parameterized by ψ is used to learn to distinguish whether trajectories are formed by experts or non-experts, whereas the generative network G θ parameterized by θ continuously imitates the expert policy until the discriminant network D ψ cannot distinguish the source of the trajectory. The objective of GAIL is to optimize the function [39]: The discriminant network D ψ is to learn to output a high score when encountering a state-action pair from π E and a low score when encountering a state-action pair from samples. When π is similar to gradually π E , the scores of the outputs are gradually approached.

Generative Network
The generative network G θ is a strategy network whose input is a state, and the output is an action, so it can generate state-action pairs. It uses PPO to perform generative network updates; since the step size is limited by the size of the control range, it can be updated smoothly when the variance of the reward value is large [26], and the limit of this range is controlled by the Kullback-Leibler (KL) divergence. The generative network can be stably updated by restricting the KL divergence of the new and old generative networks within a certain constraint range, as shown in Equation (9): where D max KL (π old , π) is the KL divergence of the old and new strategies, and δ is the constraint parameter of the KL divergence.
The original GAIL uses the output of the discriminant network as an alternative reward function. In order to speed up the training speed and compare with the RL algorithm, an incomplete reward function was added which only contains a few situations. The objective that the policy network needs to optimize is shown in Equation (10): where E p is accumulated reward value, A θ (s, a) is the advantage function for the π θ .

Multi-Agent Training
AUVs in multiple identical environments share the same set of network parameters, and their decision-making does not affect one another. Two queues are included in the training: the prediction queue and training queue.
The prediction queue inputs all current AUV prediction requests into the generative network. After the prediction is completed, it transmits the requested action to each corresponding AUV that is waiting to execute the action.
The training queue submits the samples of the interaction between the AUV and the environment to the generative network and the discriminant network for parameters update. By collecting samples in multiple identical environments, the correlation between samples is reduced so as to avoid local convergence in the AUV training process.

Algorithm Process
According to the aforementioned methods, the algorithm process can be described as shown in Algorithm 1. Algorithm 1. The path-plan training process.
Create multiple identical underwater environments and create an AUV with light sensor and target in each environment in the unity software. Initialize training queue and prediction queue. Randomly initialize the discriminant network D(s, a|ψ) and the generative network G(s|θ) while each AUV shares KL divergence and parameters of D(s, a|ψ) and G(s|θ) . for episode = 1 to M: for step = 1 to N: AUVs input state s to training queue waiting for the output of action; Generative network G(s|θ) output action sequences for prediction queue and obtains the generative trajectory {τ 1 , τ 2 , . . . , τ N }; Batch generative trajectory {τ 1 , τ 2 , . . . , τ N }; Step = step + 1; If (hit the target or obstacle): Calculate cumulative rewards; break; end for; The discriminant network D(s, a|ψ) scores generative trajectory {τ 1 , τ 2 , . . . , τ N }; Update the generative network G(s|θ) parameters θ i to θ i+1 by maximizing Equation (10); Update the discriminant network D(s, a|ψ) parameters ψ i to ψ i+1 by maximizing Equation (8); Episode = episode + 1; end for; Output: the trained generative network to the AUV as a controller.

Environmental Model and Training Parameters
In this section, simulation experiments were used to verify the performance of AUV subsea dynamic target tracking. Unity software was used to visually simulate the underwater environment. The motion model of the target and ocean current disturbance were written based on script languages with C# and Python. The generative network and discriminant network were constructed by the machine language library (Pytorch), and the model was trained by GPU. The experiment used AMD Ryzen 7 2700 Eight-Core 3.2 GHz processor (AMD Inc., Santa Clara, CA, USA) and 8 GB RAM graphics card. The overall experimental environment was an underwater environment of 25.0 × 25.0 × 1.0 m. This paper demonstrates the tracking effect of the controller in three scenes: Scene 1 is without obstacle and disturbance; Scene 2 is without obstacle and with disturbance; and Scene 3 is with four obstacles and without disturbance. The experiment set the center of the underwater environment as the origin (0, 0). The positions of the obstacles were (−10, −10), (−10, 10), (10, −10) and (10,10). During training, the agent observed the environment using the light sensor; the angle range was {−60 , 60}, and the farthest distance measurement was 1.5 m.
In an iterative round process, set initial reward R 0 = 0, while the AUV obtains a −0.001 reward per step to speed up training. The round ends when the following three situations occur: 1.
The AUV collides with the target and gets a 0.1 reward.

2.
The AUV collides with an obstacle and gets a −0.1 reward.

3.
The AUV has completed 1000 steps. It can be expressed as follows: where j = 1, 2, . . . , 1000, the episode ends on collision or tracking, or when the number of iterations j reaches 1000, the accumulated reward at this time is recorded. Then, the parameters are reinitialized and the next round starts.
The parameters of the algorithm were set according to previous work [40][41][42], and the experimental conditions such as random noise and obstacles were also considered. The parameters of the PPO were set as: the sizes of batch, buffer, and memory were 128, 2048, and 256, respectively; the learning rate was 0.0001; the discount factor was 0.9; the epsilon was 0.2; the hidden layers were 2; and the hidden units were 256. The SAC parameters were as follows: the sizes of batch, buffer, and memory were 128, 2048, and 256, respectively; the discount factor was 0.9; the learning rate was 0.0003; the hidden layers were 2; and the hidden units were 256. The GAIL parameters were as follows: the sizes of batch, buffer, and memory were 64, 2048, and 256, respectively; the discount factor was 0.9; the epsilon was 0.2; the hidden layers were 2; the hidden units were 128; and the learning rate was 0.0001. The MAG parameters were as follows: the sizes of batch, buffer, and memory were 64, 2048, and 256, respectively; the discount factor was 0.9; the epsilon was 0.2; the hidden layers were 2; the hidden units were 128; the learning rate was 0.0003; and the sequence length was 64.
Perform 2000 Monte Carlo simulations for the four methods respectively, and the distance of the AUV and target is defined as follows: where i = 1, 2, . . . , N, and N = 2000 is the number of Monte Carlo simulations. (x a i , y a i ) and (x g i , y g i ) are the corresponding coordinates of the AUV and the target, respectively.

Evaluation Standard
In general, the performance of the AUV controller is based on the cumulative reward, training time, the motion path of the AUV and the target, and the number of collision violations (CV) [43][44][45][46][47]. The reward is obtained by the AUV from the environment during training. The higher the cumulative reward, the more tasks the AUV will complete. The training time represents the computational complexity of a controller. The motion path represents the tracking accuracy of the controller during the entire movement process, and the CV represents the AUV's ability to avoid obstacles.

Results
In order to verify the effectiveness of the algorithm proposed in this paper, the novel control algorithms of PPO, SAC, and GAIL were selected for comparison. PPO is a classic online RL algorithm that can be used to solve the sequential decision-making problem of unmanned aerial vehicles and non-linear attitude control problems [40,48]. SAC is a representative algorithm of offline RL that can realize low-level control of quadrotors and map-free navigation and obstacle avoidance of hybrid unmanned underwater vehicles [49,50]. GAIL is a representative IL algorithm that predicts airport-airside motion of aircraft-taxi trajectories and enables mobile robots to learn to navigate in dynamic pedestrian environments in a socially desirable manner [51,52].
From Table 1, it can be seen that the average cumulative reward of MAG is higher than PPO, SAC, and GAIL. The target-tracking task was completed more times, indicating that the MAG has the better performance in Scene 1. It can also be seen that the variance of MAG is lower than that of PPO and SAC, indicating that MAG can make the training process more stable. Compared to GAIL, the mean and variance are slightly lower than MAG. By analyzing Table 2, it can be observed that MAG needs to spend the most time training. It improves obstacle avoidance while increasing the computational complexity. As shown in Figure 4, although the four methods can finally converge to around 1.0 after training 100,000 steps, the curve fluctuation range indicates that MAG has better stability during the training process and the network update is more stable. In Scene 2, the reward curve of the other three methods oscillates more obviously, indicating that MAG has better anti-disturbance after being affected by the ocean current. It can be seen from Figure 4c that the reward of the four methods reduced after Scenes 1 and 2 following the addition of obstacles, but the MAG controller still converges the fastest and updates the smoothest. Figure 5 shows the change process of the distance between the AUV and the target. When there is no noise, the three algorithms can gradually decrease this distance. The MAG can bring the AUV closer to the target than PPO and GAIL. After adding noise interference, the distance was always kept at a high value when using PPO. However, the AUV controlled by MAG and GAIL could still track the target well. This shows that the anti-interference ability of MAG and GAIL is better than that of PPO.
In Table 3 In Figure 6a,e, the AUV can effectively track the target using PPO during the 0-300 steps, but the Y-coordinate change of the AUV and the target have a large deviation after 300 steps, indicating that the AUV was unable to track the target during this period. Figure 6b,f shows that the distance between the AUV and the target is always kept at a small value throughout the tracking process except during the 1000-1300 steps. It shows that the training effect of SAC is better than that of PPO. As can be seen from Figure 6c,g, the trajectories of the AUV and the target are more similar using GAIL compared to SAC. In addition, in the whole tracking process, there is a large difference in the Y-coordinate during 1500-1600 steps only, indicating that the GAIL controller has a better tracking effect than SAC. It can be seen from Figure 6d,h that the AUV is always close to the target throughout the tracking process using the MAG controller. Therefore, the tracking accuracy of the MAG controller is the best among the four methods.
In Figure 7a,e, there is a big distance error between the AUV and the target using PPO during 1300-1400 steps only. Compared with the trajectory of the AUV and the target in Figure 6a,e, the tracking accuracy of the PPO controller is less affected by noise. Figure 7b,f shows that the AUV loses the target for more time in Scene 2 than in Scene 1, indicating that the tracking ability of the SAC controller deteriorated due to the noise. From the Figure 7c,d,g,h, it can be seen that the distance is smaller using PPO than GAIL during the whole tracking process. This shows that in Scene 2, both the GAIL and MAG controllers have better tracking accuracy, but compared with GAIL, the MAG controller can bring the AUV closer to the target. This indicates that the MAG controller has the best anti-interference.  Figure 5 shows the change process of the distance between the AUV and the target. When there is no noise, the three algorithms can gradually decrease this distance. The MAG can bring the AUV closer to the target than PPO and GAIL. After adding noise interference, the distance was always kept at a high value when using PPO. However, the AUV controlled by MAG and GAIL could still track the target well. This shows that the anti-interference ability of MAG and GAIL is better than that of PPO.    Figure 5 shows the change process of the distance between the AUV and the target. When there is no noise, the three algorithms can gradually decrease this distance. The MAG can bring the AUV closer to the target than PPO and GAIL. After adding noise interference, the distance was always kept at a high value when using PPO. However, the AUV controlled by MAG and GAIL could still track the target well. This shows that the anti-interference ability of MAG and GAIL is better than that of PPO. In Table 3  In Figure 6a,e, the AUV can effectively track the target using PPO during the 0-300 steps, but the Y-coordinate change of the AUV and the target have a large deviation after 300 steps, indicating that the AUV was unable to track the target during this period. Figure  6b,f shows that the distance between the AUV and the target is always kept at a smal value throughout the tracking process except during the 1000-1300 steps. It shows that the training effect of SAC is better than that of PPO. As can be seen from Figure 6c,g, the trajectories of the AUV and the target are more similar using GAIL compared to SAC. In addition, in the whole tracking process, there is a large difference in the Y-coordinate during 1500-1600 steps only, indicating that the GAIL controller has a better tracking effect than SAC. It can be seen from Figure 6d,h that the AUV is always close to the target throughout the tracking process using the MAG controller. Therefore, the tracking accuracy of the MAG controller is the best among the four methods. Step -15 In Figure 7a,e, there is a big distance error between the AUV and the target using PPO during 1300-1400 steps only. Compared with the trajectory of the AUV and the target in Figure 6a,e, the tracking accuracy of the PPO controller is less affected by noise. Figure  7b,f shows that the AUV loses the target for more time in Scene 2 than in Scene 1, indicating that the tracking ability of the SAC controller deteriorated due to the noise. From the Figure 7c,d,g,h, it can be seen that the distance is smaller using PPO than GAIL during the whole tracking process. This shows that in Scene 2, both the GAIL and MAG controllers have better tracking accuracy, but compared with GAIL, the MAG controller can bring the AUV closer to the target. This indicates that the MAG controller has the best antiinterference. (g) (h) Figure 6. The coordinate change curves in Scene 1 during the process of AUV tracking the target using PPO, SAC, GAIL, and MAG, respectively: (a-d) the X-coordinate change; (e-h) the Y-coordinate change.
In Figure 7a,e, there is a big distance error between the AUV and the target using PPO during 1300-1400 steps only. Compared with the trajectory of the AUV and the target in Figure 6a,e, the tracking accuracy of the PPO controller is less affected by noise. Figure  7b,f shows that the AUV loses the target for more time in Scene 2 than in Scene 1, indicating that the tracking ability of the SAC controller deteriorated due to the noise. From the Figure 7c,d,g,h, it can be seen that the distance is smaller using PPO than GAIL during the whole tracking process. This shows that in Scene 2, both the GAIL and MAG controllers have better tracking accuracy, but compared with GAIL, the MAG controller can bring the AUV closer to the target. This indicates that the MAG controller has the best antiinterference. Step -15 Step -15  Figure 8 shows the trajectory of the AUV and the target using four methods in Scene 3. It can be seen that the control effect of the four methods becomes significantly worse than in Scene 1. It shows that the AUV acts more cautiously when colliding with an obstacle is negatively rewarded. In Figure 8a,b,e,f, the AUV hesitated for a long time and went further and further away from the target when using PPO or SAC. When using the GAIL controller, the trajectory of the AUV and the target is quite different in the whole tracking process; it is considered that the AUV cannot complete the task of tracking the target. Compared with other methods, the MAG controller makes the AUV track the target for the longest time. The tracking effect is also better than that of the other three methods.  Figure 8 shows the trajectory of the AUV and the target using four methods in Scene 3. It can be seen that the control effect of the four methods becomes significantly worse than in Scene 1. It shows that the AUV acts more cautiously when colliding with an obstacle is negatively rewarded. In Figure 8a,b,e,f, the AUV hesitated for a long time and went further and further away from the target when using PPO or SAC. When using the GAIL controller, the trajectory of the AUV and the target is quite different in the whole tracking process; it is considered that the AUV cannot complete the task of tracking the target. Compared with other methods, the MAG controller makes the AUV track the target for the longest time. The tracking effect is also better than that of the other three methods.
In order to verify that the proposed method still has good performance even when the number of training steps is reduced, we reduced the number of training steps to 20,000 times. We compared with the PPO and GAIL in Scenes 1 and 2 to simplify the experiment. Figure 9 shows that after the training times were reduced, the tracking accuracy of PPO and GAIL dropped significantly in Scenes 1 and 2, whereas the MAG controller could still enable the AUV to track the target, indicating that when the training times are low, the MAG controller already has good performance.
In Table 4, ( − x a , − y a ) is the average value of the coordinates of the AUV, and ( − x g , − y g ) is the average value of coordinates of the target. After the training time is reduced, from the perspective of distance, the tracking accuracy of MAG is better than that of PPO and GAIL, and the variance value is also smaller, indicating that the tracking process is smoother. From the perspective of coordinates, the coordinates of the AUV using the MAG are closer to the target, indicating that the MAG can train the optimal network earlier.  In order to verify that the proposed method still has good performance even when the number of training steps is reduced, we reduced the number of training steps to 20,000 Step -15  In Table 4 is the average value of coordinates of the target. After the training time is reduced, from the perspective of distance, the tracking accuracy of MAG is better than that of PPO and GAIL, and the variance value is also smaller, indicating that the tracking process is smoother. From the perspective of coordinates, the coordinates of the AUV using the MAG are closer to the target, indicating that the MAG can train the optimal network earlier.  Figure 10 shows the movement trajectory of the AUV and the target during the process of the AUV tracking the target using the three methods in Scene 1. It can be seen from Figure 10a,d that if the distance becomes large, the tracking accuracy is poor. In Figure  10b,e, the distance decreased to a certain extent using GAIL, but the target still could not be tracked effectively. Figure 10c,f shows that the distance is the smallest, meaning that the task of tracking the target could be successfully completed, and the tracking effect is better, indicating that multi-agent training can improve the effect of network training.  Figure 10 shows the movement trajectory of the AUV and the target during the process of the AUV tracking the target using the three methods in Scene 1. It can be seen from Figure 10a,d that if the distance becomes large, the tracking accuracy is poor. In Figure 10b,e, the distance decreased to a certain extent using GAIL, but the target still could not be tracked effectively. Figure 10c,f shows that the distance is the smallest, meaning that the task of tracking the target could be successfully completed, and the tracking effect is better, indicating that multi-agent training can improve the effect of network training. Step -12  Figure 11 shows the motion trajectory of the AUV and the target during the process of the AUV tracking the target using the three methods in Scene 2. Figure 11a,b,d,e shows that during the process of the AUV using PPO and GAIL to track the target, the AUV movement trajectory is very different from the target movement trajectory. On the other hand, Figure 11c,f indicates that during the process of the AUV tracking the target using MAG, the gap between the target motion trajectory and the AUV motion trajectory was small, and the tracking effect better. It shows that in the case of fewer training times, it is more efficient to add multi-agent training to train the AUV and accelerate the exploration. Step -15 -10 -5 Figure 10. The coordinate change curves during the process of AUV tracking the target using PPO, GAIL, and MAG, respectively, training 20,000 times in Scene 1: (a-c) the X-coordinate change; and (d-f) the Y-coordinate change. Figure 11 shows the motion trajectory of the AUV and the target during the process of the AUV tracking the target using the three methods in Scene 2. Figure 11a,b,d,e shows that during the process of the AUV using PPO and GAIL to track the target, the AUV movement trajectory is very different from the target movement trajectory. On the other hand, Figure 11c,f indicates that during the process of the AUV tracking the target using MAG, the gap between the target motion trajectory and the AUV motion trajectory was small, and the tracking effect better. It shows that in the case of fewer training times, it is more efficient to add multi-agent training to train the AUV and accelerate the exploration. (c) (d) (e) (f) Figure 11. The coordinate change curves during the process of AUV tracking the target using PPO, GAIL, and MAG, respectively, training 20,000 times in scene 2: (a-c) the X-coordinate change; and (d-f) the Y-coordinate change.

Conclusions
This paper proposes a multi-agent training algorithm based on PPO combined with a GAN framework to control an AUV tracking a target. AUVs learn directly from expert demonstrations to solve the sparse reward problem. In addition, it avoids convergence by multi-agent training. Simulation experiments prove that when performing dynamic target-tracking tasks, this controller has better tracking accuracy than PPO, SAC, and GAIL. However, the starting point position and complexity of the environment have a certain impact on the performance of the underwater robot tracking the target. In the future, further improvements will be made to the proposed method to reduce the training time. The real simulation experiments are also very important to the actual motion control. Since the starting point of this paper is to verify the feasibility and effectiveness of the control algorithm, random noise interference and four obstacles were added to the plane Figure 11. The coordinate change curves during the process of AUV tracking the target using PPO, GAIL, and MAG, respectively, training 20,000 times in scene 2: (a-c) the X-coordinate change; and (d-f) the Y-coordinate change.

Conclusions
This paper proposes a multi-agent training algorithm based on PPO combined with a GAN framework to control an AUV tracking a target. AUVs learn directly from expert demonstrations to solve the sparse reward problem. In addition, it avoids convergence by multi-agent training. Simulation experiments prove that when performing dynamic target-tracking tasks, this controller has better tracking accuracy than PPO, SAC, and GAIL. However, the starting point position and complexity of the environment have a certain impact on the performance of the underwater robot tracking the target. In the future, further improvements will be made to the proposed method to reduce the training time. The real simulation experiments are also very important to the actual motion control. Since the starting point of this paper is to verify the feasibility and effectiveness of the control algorithm, random noise interference and four obstacles were added to the plane environment. In the application to the actual underwater system, the change of hydrodynamic coefficients at different depths and other factors need to be considered.