A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current

: In this paper, the multiple autonomous underwater vehicles (AUVs) task allocation (TA) problem in ocean current environment based on a novel reinforcement learning approach is studied. First, the ocean current environment including direction and intensity is established and a reward function is designed, in which the AUVs are required to consider the ocean current, the task emergency and the energy constraints to ﬁnd the optimal TA strategy. Then, an automatic policy amendment algorithm (APAA) is proposed to solve the drawback of slow convergence in reinforcement learning (RL). In APAA, the task sequences with higher team cumulative reward (TCR) are recorded to construct task sequence matrix (TSM). After that, the TCR, the subtask reward (SR) and the entropy are used to evaluate TSM to generate amendment probability, which adjusts the action distribution to increase the chances of choosing those more valuable actions. Finally, the simulation results are provided to verify the effectiveness of the proposed approach. The convergence performance of APAA is also better than DDQN, PER and PPO-Clip.


Introduction
Recently, with the development of Autonomous Underwater Vehicle (AUV) technology [1,2], AUV has been widely applied in hunting [3,4], rescue [5], detection [6,7] and other tasks [8][9][10]. Compared with single AUV system, multiple autonomous underwater vehicles (AUVs) can be competent for more complex tasks [11]. Therefore, the problem of cooperation between AUVs has attracted wide attention. Among many cooperation problems, task allocation (TA) [12] is critical for AUVs to perform tasks successfully. The description of the TA for a multi-AUV system in the ocean current is shown in Figure 1. If some soluble targets that can be denoted as {T 1 , T 2 , . . . , T 5 } drifted in the ocean current as a result of a transport ship accident, the surrounding AUVs denoted as {U 1 , U 2 , U 3 } need to collaborate to complete the task, that is rescuing the five targets immediately. The AUVs establish a temporary communication network, which can share the location of the targets as well as the location and speed of all AUVs, and the drifted targets to be salvaged when the total power of the nearby AUVs exceeds its weight. Besides, the AUVs are also required to consider the impact of the ocean current, energy consumption, the task emergency and avoid collisions with other AUVs. As a result, due to the tough environment, an optimal salvage strategy is needed to ensure that AUVs accomplish the task safely and quickly.
At present, the methods for solving the TA problem with multi-constraints mainly include market-based method [13], swarm intelligence method [14] and reinforcement learning [15]. Although the market mechanism method can find the optimal solution, it has high requirements for real-time and communication ability of AUVs system. The swarm intelligence method can find acceptable solutions, but it has poor generalization ability and poor performance when dealing with unknown factors.
RL is an emerging field, which has been widely used in automatic control [16], intelligent decision-making [17], optimization [18], scheduling [19], etc. [20]. Using RL, AUVs Efficient use of samples is commonly used to improve the performance of traditional RL. The experience replay is used to learn existing samples by random sampling, which reduces the correlation between samples while improving sample efficiency [26]. However, each sample has a different influence on learning, and the effect of uniform sampling is very limited. Schaul et al. [27] improved the traditional experience replay method by using TD-error to evaluate the importance of samples in the experience pool (EP), which improved the convergence of the importance sampling reinforcement learning. Horgan et al. [28] presented shared prior experience replay so that RL can learn more data in distributed architecture training. Zhao et al. [29] presented that learning high-return samples can effectively improve the convergence rate. In the algorithm, the authors took the sample sequence that has high-expectation rewards and TD-error as the basis of importance sampling to achieve good results. Zhang et al. [30] proposed an adaptive priority correction algorithm to estimate the real sampling probability by evaluating the predicted TD error and the real TD error of the experience pool. Almost all the methods proposed above use either reward information or error generated by sample training to evaluate the importance.
Others have suggested that sample information of different aspects can also improve performance. M. Ramicic and A. Bonarini [31] improved the learning efficiency by using entropy to quantify the state space and carrying out importance sampling. Yang et al. [32] constructed directed association graph (DAG) by using sample trajectory, and introduced episodic memory and DAG into traditional deep reinforcement learning (DRL) loss, which made DRL learn from different aspects and improved sample utilization rate.
The balance between exploration and exploitation remain challenging. Undirected space exploration makes the algorithm converge slowly, and excessive use of existing experience usually can only find a non-optimal policy. Pathak et al. [33] proposed the curiosity mechanism to make space exploration more efficient. Zhu et al. [34] used dropout regularization to predict the distribution of Q values and select actions in the form of maximizing the distribution of Q values. This method can effectively evaluate the learning of the environment and the trade-off exploration and exploitation in a non-stationary environment. Kumra et al. [35] introduced a loss-adjusted exploration strategy to determine candidate actions based on Boltzmann distribution of loss estimation, ensuring the balance of exploration and exploitation. Other studies are based on the assumption of prior knowledge, which can be used to explore space more efficiently and improve performance. Shi et al. [36] decomposed complex tasks into several sub-tasks and solved them separately, and used transfer learning to accelerate the learning of new tasks by combining the prior knowledge of sub-tasks. Pakizeh et al. [37] constructed the Q tables by sharing knowledge among agents to improve the convergence.
Compared with the previous research results, the contributions of this paper are summarized as follows: (1) In the traditional methods, sample reuse is to extract experience by learning samples from replay buffer, and it cannot directly improve the quality of samples by guiding the behavior of policy. Furthermore, experience extracted from samples can improve the convergence, but the effect is related to the experience extraction. The algorithm we proposed can extract available information from samples and use the information directly in decisions. Our algorithm not overly dependent on training effect and can directly improve the sample quality. (2) The traditional methods based on sample reuse do not take the influence of exploitation on policy exploration into account. Automatic Policy Amendment algorithm (APAA) considers the balance between exploration and exploitation, and it uses entropy to evaluate the information extracted from samples, aiming to maintain certain exploration ability in action decision-making and avoid trapping into a nonoptimal policy. (3) The traditional method based on importance evaluation generally evaluates the importance of samples with the expected reward, and does not consider the evaluation between samples with the same expected reward under environmental changes.
To overcome the shortcoming, the subtask reward evaluation method is combined to distinguish the influence of the same reward value on policy under different situations.
The remainder of this paper is organized as follows. In Section 2, the environment of ocean current and the motions of AUVs are described. Section 3 introduces the related RL algorithm. Sections 4 and 5 give the mathematical description of the reward function and the algorithm design, respectively. The simulation results and efficiency analysis are introduced in Sections 6 and 7, respectively. Finally, a conclusion is presented in Section 8.

Ocean Current Environment
Ocean current is the flow phenomenon of seawater in the ocean. The proper use of ocean current can help AUVs to save energy and more quickly accomplish the task, otherwise it may affect the completion of the task and even damage AUVs. The ocean current model is composed of several randomly distributed eddy equations, and it is defined in [38] as where p = (p x , p y ) is the central coordinate of the eddy, (c x , c y ) is the size of ocean current at (x, y), a = (a x , a y ) is the intensity coefficient of the eddy, and a y determines the rotation direction of the eddy. When a y is positive, the eddy rotates clockwise. Otherwise, the eddy rotates counter clockwise. sgn(.) is a sign function. The ocean flow field is formed by the superposition of m eddies as where rand(.) is a random function.

AUVs Model
The multi-AUV system consists of N u AUVs, and be denoted by U = {U 1 , U 2 , . . . , U N u }. For each U i ∈ U, its model is defined as a quad tuple, in which up i , v i , pow i , eg i represent U i 's position, maximum speed, capability for salvage and energy, respectively. We define the energy loss of U i in time t is proportional to the third power of its current propulsion velocity as where k is the drag coefficient.
Let U i has eight discrete directions in its action space, which can be denoted by D = {d 1 , d 2 , . . . , d 8 } as shown in Figure 2. At time t, U i 's position changes with its movement and the ocean current, and is calculated as where c = (c x , c y ). As shown in Figure 3.

Task Model
A task consists of N m targets, and the target i is defined as T i , i ∈ {1, 2, . . . , N m }, and the target set is T = {T 1 , T 2 , . . . , T N m }. For each T i ∈ T, it can be represented by mp i , weight i , emerg i , cpl i , which denote the location, weight, emergency and complete sign of the target, respectively. We assume that the weight and emergency of these soluble targets decrease over time. The weight of T i at each time step t is expressed as where max(., .) is the function that taking the largest of two values and α is the weight attenuation coefficient. Then, the emergency of T i at time t + 1 will be changed corresponding to weight i (t + 1), defined by The task model requires AUVs not only to consider their own energy in the ocean current, but also to salvage the targets before dissolved in water as soon as possible. The distance between U i and T j in time t is calculated as Let R c be the salvage radius of the targets. For each T j ∈ T, which is not fully dissolved in water, it will be salvaged when the sum of capabilities of the AUVs in R c is greater than its weight, as shown in Figure 4. After that, cpl j is given by Not only the AUVs, but also the targets are affected by the ocean current, which will make them drift along with it, as shown in Figure 5. The movement model of the target is shown as

Background
Reinforcement Learning (RL) RL can be described by Markov decision process (MDP), defined by the five tuples (S, A, P, R, γ), in which S denotes the set of states, A is the set of actions, P denotes the state transition probability, R is a bounded reward function, and γ is the discount factor [21]. The agent interacts with the environment at each discrete time step t ∈ {1, 2, . . .}, selects action a based on the current state s t , receives a reward r t and transfers to the next state s t+1 according to probability p t , aiming to receive the maximum reward in one episode. The sum of the reward obtained by the policy is expressed as In this paper, APAA we proposed based on the DDQN framework, which is a typical model of DRL, then it and some related algorithms are compared with APAA. DDQN, as one of the most commonly used variants of DQN, solves the overestimation of Q value. The target network is used to evaluate the optimal action of the current network, and the loss function is defined as where θ is a current network for policy selection, and θ − is a target network for policy evaluation. The two network usually synchronization after several iterations with the current network. move at maximum speed or at 70% of maximum speed, and even remain stationary, that is, it does not perform any of its own movement, but only relies on the ocean current to change its position.

Reward Model
The reward an AUV received in a time step is composed of four parts, i.e., energy consumption, moving evaluation, collision detection and task completion.
• Energy consumption is determined by the AUV's speed. The faster the AUV speed in each time step, the larger the energy consumption is, and the lower the reward value will be, the reward for the energy consumption is defined as where MAXE is the initial energy of U i , MAXDE is the energy attenuation ratio at the maximum speed. The energy consumption decreases to 0 when the AUV moves only by ocean current, it will get the maximum energy consumption reward 0 in this situation. • Moving evaluation is determined by the AUV whether it is closer to the nearest target than it was at the previous time step, the reward for the moving evaluation is defined as • Collision detection is to judge whether an AUV collides with others. It will get a negative reward when colliding with others. The reward for the collision is defined as where τ is a small positive number, and it represents the minimum safe distance between AUVs to avoid collision, as shown in Figure 6. • Task completion reward is determined by whether the AUVs salvage the targets. If AUVs salvage a target at time t, all the AUVs will get the reward, which is also related to the emergency of the target. The reward for the task completion is defined as Then, the reward obtained by U i in a time step can be calculated as

Automatic Policy Amendment Algorithm (APAA)
APAA we proposed is to get the task sequence of each AUV when they finish the entire TA. AUVs will add the task sequence into their Task Sequence Matrix (TSM), if they get a high TCR in a TA. TCR is taken as the AUVs' reference to each task sequence when making policy decision in the future. Entropy is used to measure the uncertainty for TSM to ensure the diversity of AUVs' learning samples. SR enable AUVs to balance TCR with their own reward. The amendment probability associated with TSM is generated to affect the AUVs' action probability distribution, to improve the sample quality, and to accelerate DDQN training.

Task Sequence Matrix (TSM)
The task sequence represents the order in which U i salvages the targets in each task allocation. In general, the higher the team cumulative reward (TCR) of a task sequence in an environment, the more valuable it is. Let the amount of the records in TSM be N. Each AUV preserves a N × N m matrix to store its own task sequence. Meanwhile, V i R is used to represent the TCR corresponding to the row i in TSM. When a new task sequence emerges, each AUV updates it to TSM by removing the sequence with the lowest TCR if it meets where R i is the cumulative reward obtained by U i during TA. Table 1 shows the top 10 optimal task sequences generated by the three AUVs after fifty iterations. For each U i ∈ U in the table, the corresponding column j indicates the jth target U i performs, and the jth target that U i executes is denoted as UT i j . For example, T 1 and T 3 appear the most frequently in UT 1 1 , indicating that U 1 can contributes a higher TCR if it performs T 1 or T 3 in UT 1 1 . Table 1. 10 optimal task sequences in the TSMs.

Automatic Policy Amendment Matrix (APAM)
In a TA, the target selection preference of the AUVs will determine the task completion efficiency, as described in Table 1. We wish to use TSM to influence the preference the AUVs have to perform the task. As a result, a N m × N m matrix called APAM is constructed by evaluate TSM through three indicators including TCR, entropy and SR. It is worth noting that APAM i,j represents the probability that AUVs selects T i when performing UT * j , where UT * j represents the jth target performed by an AUV.
(1) Team Cumulative Reward (TCR) The TCR is designed to assign the importance to each sequence in TSM with different weights. The higher TCR of the sequence, the greater the influence on the AUVs. The weight of each sequence in TSM is defined as where MI N_R and MAX_R are the lowest and highest TCR the AUVs can achieve, respectively. Then, for each T i ∈ T, the weight corresponding to the AUVs selecting target T i in performing UT * j , j ∈ {1, 2, . . . , N m } is calculated according to w tcr as where M tcr is a N m × N m matrix. Finally, Equation (22) is used to transform M tcr into probability matrix P tcr .
(2) Entropy Based on the update mode of TSM mentioned in Section 5.1, we know that TSM will record a new sequence with a TCR greater than the worst sequence in TSM. From Table 1, U 3 may select T 1 , T 3 , and T 4 , when performing UT 3 3 . However, with the TSM updated by new sequences, the diversity of the targets in TSM may decrease dramatically, and U 3 may only select T 3 after several iterations. This is not expected in early training, because it will converge to a non-optimal policy. The entropy is used to measure the effect of sequences in TSM on the diversity of AUVs behaviors, and based on entropy, multiple similar records with high reward will get lower weights. The entropy weight of a new sequence is calculated by multiplying the change in TSM's average TCR by the change in TSM's entropy after it is added to TSM.
For the column k in TSM, the number of occurrences of T i is updated as where C is a N m × N m matrix and is transformed into probability matrix C in the same way as Equation (22). After that, ie is calculated by When a new sequence is added, the entropy of TSM is affected. For each sequence in TSM, the weight under the entropy metric is where er old and er new are the average TCR of TSM before and after the new sequence is added, respectively. ie old and ie new are the entropy before and after a new sequence is added, respectively.
Note that we transform M e into probability matrix P e in the same way as Equation (22).
(3) Subtask Reward (SR) SR is defined as the cumulative reward obtained by an AUV during the salvage of ith target, aiming to make the AUV balance the TCR and its own reward based on the actual environment. Let V i,j SR be a N × N m matrix, representing the cumulative reward for finishing TSM i,j . Based on TSM, the weight of each sequence is calculated as Then, we have Finally, the probability matrix P sr is calculated for the M sr by Equation (22).
(4) Probability Weighted P tcr , P e and P sr are the probability matrices that each target selected by AUVs according to three different indicators under different orders, respectively. In addition, if the variance of a subtask reward in TSM is large, the selected target is considered to has a great influence on the TCR. In this case, P sr will have a high proportion coefficient, calculated as w1 = min(η, arctan(var(V * ,j SR ) π ), (29) where V * ,j SR represents the column j of V SR , min(., .) is the smaller of two values, var(.) is the variance of the data set, and 0 < η < 1 is used to restrict the influence of P sr . The tradeoff between reward and entropy in TSM records has the same coefficient. The APAM is given by

(5) Probability Prediction
APAM is updated through the new sequences, and the probability changes in the historical experience of the matrix can be used as the momentum to predict the future change of APAM, and can furtherly accelerate DDQN training. The momentum matrix is constructed as where apam is the change in the momentum of APAM. 0 < w2 < 1, and 0 < w3 < 1. apam is updated by the proportional coefficient when the new sequence satisfies the update conditions of TSM, otherwise, the attenuation of apam is carried out according to a certain coefficient. The prediction of the new APAM is given as

Action Conduct by APAM
As shown in Table 2, the APAM is generated by the TSM. Obviously, different AUVs have different preferences for the targets. This preference is applied to reduce the state action space and solve the undirected problem of exploration in traditional RL. Q value is converted into probability by softmax, and then according to the information provided by the APAM, the probability is amended. A proper amend method will lead to a good effect of training. The action distribution is the trade-off in the current environment with multiple constraints, while the probability generated by APAM only represents which target has a higher priority for execution. The probability of an action will be motivated or restrained according to the distribution of action instead of the distribution of APAM. If AUVs' estimation of an action is similar to the expected behavior in APAM, the action will be motivated according to the similarity between the action and the expected direction of APAM, otherwise, it will not be motivated. In fact, motivating one action means inhibiting others, so it is no need to perform additional inhibiting operations for other actions. Table 2. The APAM of the three AUVs.
When U m performs UT m j , the column j in the APAM will influence its decision. Let p q represent the probability that U m moves in each direction. For each p k q in p q , the positional relationship between each unfinished target and U m is calculated, then calculating the cosine similarity between these positional relationships and d k , and finally amending p k q according to the APAM of U m where cosSim( * , * ) represents the cosine similarity of two vectors.

Algorithm Summarize
The Algorithm 1 gives the algorithm flow for APAA. while s i t is not terminal state do 6: for i = 1 to N u do 7: Generate action probability distribution p q according to state s i t ; 8: if n > M then 9: if n%2 == 0 then 10: end if 12: for k = 1 to D do 13: p k q is corrected according to Equation (33); 14: end for 15: end if 16: Choose action a i t by -greedy according to p q , and get reward r i t ; 17: Put s i t , a i t , r i t , s i t+1 into EP; 18: end for 19: if n%L == 0 then 23: A batch samples randomly selected from EP; 24: Training network θ by Equation (13); 25: end if 26: Executed θ − = θ after several iterations; 27: end for

Simulation Results
In this section, some simulation results for APAA are presented and compare them with those obtained by DDQN, Priority Experience Replay (PER) and Proximal Policy Optimization Clip (PPO-Clip). The simulation is implemented using MATLAB 2018b, and the personal computer is configured with Intel(R) Core(TM) i7-10700 CPU @2.90GHz, 8GigaBytes (GB) RAM.

Experiment Parameters
The experiment we designed involves a group of three AUVs and five targets distributed in the 10 m × 10 m ocean current region, and the ocean current is modeled by Equation (1). The weight of some targets is greater than the power of AUVs, so they need to be salvaged by the cooperation of multiple AUVs. In the experimental comparison, each algorithm has the same network parameters and structure , and the same initial conditions. Table 3 shows the network and training parameters. -greedy exploration strategy is adopted in action selection. At each step, AUVs randomly select an action with a probability , and with a probability 1-select the action with the highest expected reward. In addition, the decay factor β is used to cause to decrease with iterations to increase the probability of choosing the optimal action. is updated after the each training of policy as Parameters of the RL are shown in Table 4.  Table 5 shows the parameters used in APAA. The values of these parameters affect the performance of APAA. First of all, the effect of APAA actually depends on the experience in TSM. N with small value will lead to insufficient experience diversity in TSM and trap into non-optimal policy easily. Then, although η let the AUVs have the ability to balance between the team and itself, its value should not be large in collaborative task, as this may cause the task to fail. Finally, w 2 and w 3 control APAM changes, but both should have a large value because TSM experience collection is essentially Monte Carlo sampling, and noise from random sampling can cause APAM instability.

Parameter Symbol Value
Size of the TSM N 15 P sr impact factor η 0.1 APAM attenuation factor w 2 0.9 APAM update factor w 3 0.7 Tables 6 and 7 show the attributes of the AUVs and the targets, respectively. In the experiment, the initial positions of the AUVs and the targets are randomly initialized within the ocean current region, the weights of the targets are randomly initialized between 2kg and 5kg, the powers of the AUVs are randomly initialized between 1kg and 4kg, and AUVs' velocity are randomly initialized between 1m/s and 3m/s. For convenience, we show a set of parameters. Parameters involved in the TA model are shown in Table 8.

Experiment Result
We compare APAA with DDQN, PER and PPO-Clip in the same scenario and run them several times to get average performance. The performance is shown in Figure 7, and APAA achieves the best convergence performance compared with the other algorithms under the same episode. In fact, the task in this paper is a typical of multi-objective optimization problem, which generally results in a large state-action space. It is clear that DDQN requires a lot of exploration to converge in the complex task and has weak stability in convergence. PER based on DDQN takes advantage of the TD-error of the samples to carry out priority sampling, which improves the sample efficiency. The advantage of PER became visible after the first 2000 iterations and achieves better performance than DDQN. PPO-Clip is an off-policy algorithm based on Actor-Critic, which improves the intelligence of AUVs in an adversative way and achieves the weakest performance. In addition, the convergence time of the algorithms is given in Table 9, and APAA has the highest efficiency. By contrast, PPO-Clip is hard to have a high efficiency due to training for two networks. PER uses the sum tree to improve the sampling efficiency, but still has a high time cost in updating the priority of the samples and sampling.  From Section 4.3, TCR consists of four goals. The performance of task completion reward is shown in Figure 8. In the experiment, the maximum reward for salvaging the five targets is 125. APAA gets the highest task completion reward compared with the other algorithms.  Figure 9 shows the performance of the algorithms in energy loss and collision detection, respectively. In Table 6, the total energy reserve carried by AUVs is 1800 J, and APAA only consumes 7% of the energy to salvage all the targets, that is, it can plan a better path in the ocean current. In addition, the performance of APAA in collision detection also highlights the lower probability of collisions occurring during AUVs executing the task.  Table 10 shows the performance of the algorithms after convergence. A performance is provided, time consuming, indicating the time cost the AUVs take to complete the task.  Figure 10 shows the trajectories of the AUVs and targets with APAA after training. The hexagonal stars are the starting positions of the AUVs, and the asterisks are the end positions of the AUVs. The squares are the starting positions of the targets, the circles are the positions when the targets are salvaged, and the arrows are the direction of the ocean current at the coordinates. As can be seen from the path planned by the AUVs at each step, they always take full advantage of the ocean current moving in the same direction.
this part can be expressed as O(md). Finally, the computational cost of APAA is the sum of the computational complexity of the two parts, i.e., O(3n 2 + md).
Although APAA has a high computational complexity in form, amending the action distribution with the drifting targets that have been salvaged are not considered. As a result, the computational complexity of the second part would decrease with the progress of the overall task, so that it can almost be ignored. In addition, the samples of RL are only related to the actions of the AUVs at each time step, and it is difficult for AUVs to extract the relationship between behaviors and task results from the massive samples. However, APAM extracts information related to the task, which can effectively accelerate the learning speed, so as to believe that such a computational cost is worth it. In contrast, the computational complexity of PER is related to the size of the replay buffer, and the computational complexity of PER increases dramatically for complex tasks. The parameter sensitivity of PPO-Clip and the need to train two networks result in higher computational complexity. Therefore, APAA also outperforms PER and PPO-Clip at the same training time.

Conclusions
In this paper, a new RL approach is proposed to solve the task allocation problem of multi-AUV in ocean currents. First, the ocean current and a reward function are constructed. The ocean current, the energy, the task emergency and the collision with other AUVs need to be taken into account when AUVs perform the task. Many classical RL algorithms improve the efficiency of traditional samples, but a problem is that traditional samples are not directly related to the task, which makes it difficult for AUVs to understand how their behavior affects the final result. To overcome this drawback, the Automatic Policy Amendment Algorithm (APAA) is introduced. TSM is generated by the task sequences for each AUV, which represents the task preference for AUVs to obtain the highest TCR. Such information related to the task can effectively guide the policy learning. After that, APAM is calculated by TSM, and uses TCR, entropy and SR to adjust the decision of AUVs. Finally, the simulation results show that APAA accelerates the convergence and improves the overall performance compared with the DDQN, PER and PPO-Clip. In future work, we will deal with more complex optimal planning tasks in 3D scenarios.