A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current

Ding, Cheng; Zheng, Zhi

doi:10.3390/drones6060141

Open AccessFeature PaperArticle

A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current

by

Cheng Ding

and

Zhi Zheng

^*

College of Computer and Cyber Security, Fujian Normal University, Fuzhou 350117, China

^*

Author to whom correspondence should be addressed.

Drones 2022, 6(6), 141; https://doi.org/10.3390/drones6060141

Submission received: 20 April 2022 / Revised: 26 May 2022 / Accepted: 30 May 2022 / Published: 7 June 2022

(This article belongs to the Special Issue Operations and Maintenance (O&M) of Offshore Renewal Energy (ORE) Devices Using Marine Vehicles and Drones)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, the multiple autonomous underwater vehicles (AUVs) task allocation (TA) problem in ocean current environment based on a novel reinforcement learning approach is studied. First, the ocean current environment including direction and intensity is established and a reward function is designed, in which the AUVs are required to consider the ocean current, the task emergency and the energy constraints to find the optimal TA strategy. Then, an automatic policy amendment algorithm (APAA) is proposed to solve the drawback of slow convergence in reinforcement learning (RL). In APAA, the task sequences with higher team cumulative reward (TCR) are recorded to construct task sequence matrix (TSM). After that, the TCR, the subtask reward (SR) and the entropy are used to evaluate TSM to generate amendment probability, which adjusts the action distribution to increase the chances of choosing those more valuable actions. Finally, the simulation results are provided to verify the effectiveness of the proposed approach. The convergence performance of APAA is also better than DDQN, PER and PPO-Clip.

Keywords:

multi-AUV; task allocation; reinforcement learning; policy automatic amendment; entropy

1. Introduction

Recently, with the development of Autonomous Underwater Vehicle (AUV) technology [1,2], AUV has been widely applied in hunting [3,4], rescue [5], detection [6,7] and other tasks [8,9,10]. Compared with single AUV system, multiple autonomous underwater vehicles (AUVs) can be competent for more complex tasks [11]. Therefore, the problem of cooperation between AUVs has attracted wide attention. Among many cooperation problems, task allocation (TA) [12] is critical for AUVs to perform tasks successfully. The description of the TA for a multi-AUV system in the ocean current is shown in Figure 1. If some soluble targets that can be denoted as

{T_{1}, T_{2}, \dots, T_{5}}

drifted in the ocean current as a result of a transport ship accident, the surrounding AUVs denoted as

{U_{1}, U_{2}, U_{3}}

need to collaborate to complete the task, that is rescuing the five targets immediately. The AUVs establish a temporary communication network, which can share the location of the targets as well as the location and speed of all AUVs, and the drifted targets to be salvaged when the total power of the nearby AUVs exceeds its weight. Besides, the AUVs are also required to consider the impact of the ocean current, energy consumption, the task emergency and avoid collisions with other AUVs. As a result, due to the tough environment, an optimal salvage strategy is needed to ensure that AUVs accomplish the task safely and quickly.

At present, the methods for solving the TA problem with multi-constraints mainly include market-based method [13], swarm intelligence method [14] and reinforcement learning [15]. Although the market mechanism method can find the optimal solution, it has high requirements for real-time and communication ability of AUVs system. The swarm intelligence method can find acceptable solutions, but it has poor generalization ability and poor performance when dealing with unknown factors.

RL is an emerging field, which has been widely used in automatic control [16], intelligent decision-making [17], optimization [18], scheduling [19], etc. [20]. Using RL, AUVs can interact with the environment to obtain feedback signals from the environment by trial-and-error. Good behaviors will be enhanced, while bad behaviors will be weakened by signals, RL gradually learn the mapping between state and action to obtain the optimal policy with the maximum expected reward [21]. Q learning [22] is one of the classical RL algorithms, which maps state actions into optimal value functions by using Q table. However, Q learning is affected by the curse of dimensionality, making it difficult to deal with problems that have high-dimensional continuous state space. In order to overcome the above-mentioned difficulties, deep Q network (DQN) is proposed, which uses neural networks to approximate the optimal value function [23,24]. Double Deep Q-Network (DDQN) [25] is one of the most commonly used methods in DQN, which adopts the dual network structure. The current network is used for policy execution and the target network is used for policy evaluation, which solves the problem of overestimation in DQN.

Efficient use of samples is commonly used to improve the performance of traditional RL. The experience replay is used to learn existing samples by random sampling, which reduces the correlation between samples while improving sample efficiency [26]. However, each sample has a different influence on learning, and the effect of uniform sampling is very limited. Schaul et al. [27] improved the traditional experience replay method by using TD-error to evaluate the importance of samples in the experience pool (EP), which improved the convergence of the importance sampling reinforcement learning. Horgan et al. [28] presented shared prior experience replay so that RL can learn more data in distributed architecture training. Zhao et al. [29] presented that learning high-return samples can effectively improve the convergence rate. In the algorithm, the authors took the sample sequence that has high-expectation rewards and TD-error as the basis of importance sampling to achieve good results. Zhang et al. [30] proposed an adaptive priority correction algorithm to estimate the real sampling probability by evaluating the predicted TD error and the real TD error of the experience pool. Almost all the methods proposed above use either reward information or error generated by sample training to evaluate the importance. Others have suggested that sample information of different aspects can also improve performance. M. Ramicic and A. Bonarini [31] improved the learning efficiency by using entropy to quantify the state space and carrying out importance sampling. Yang et al. [32] constructed directed association graph (DAG) by using sample trajectory, and introduced episodic memory and DAG into traditional deep reinforcement learning (DRL) loss, which made DRL learn from different aspects and improved sample utilization rate.

The balance between exploration and exploitation remain challenging. Undirected space exploration makes the algorithm converge slowly, and excessive use of existing experience usually can only find a non-optimal policy. Pathak et al. [33] proposed the curiosity mechanism to make space exploration more efficient. Zhu et al. [34] used dropout regularization to predict the distribution of Q values and select actions in the form of maximizing the distribution of Q values. This method can effectively evaluate the learning of the environment and the trade-off exploration and exploitation in a non-stationary environment. Kumra et al. [35] introduced a loss-adjusted exploration strategy to determine candidate actions based on Boltzmann distribution of loss estimation, ensuring the balance of exploration and exploitation. Other studies are based on the assumption of prior knowledge, which can be used to explore space more efficiently and improve performance. Shi et al. [36] decomposed complex tasks into several sub-tasks and solved them separately, and used transfer learning to accelerate the learning of new tasks by combining the prior knowledge of sub-tasks. Pakizeh et al. [37] constructed the Q tables by sharing knowledge among agents to improve the convergence.

Compared with the previous research results, the contributions of this paper are summarized as follows:

(1): In the traditional methods, sample reuse is to extract experience by learning samples from replay buffer, and it cannot directly improve the quality of samples by guiding the behavior of policy. Furthermore, experience extracted from samples can improve the convergence, but the effect is related to the experience extraction. The algorithm we proposed can extract available information from samples and use the information directly in decisions. Our algorithm not overly dependent on training effect and can directly improve the sample quality.
(2): The traditional methods based on sample reuse do not take the influence of exploitation on policy exploration into account. Automatic Policy Amendment algorithm (APAA) considers the balance between exploration and exploitation, and it uses entropy to evaluate the information extracted from samples, aiming to maintain certain exploration ability in action decision-making and avoid trapping into a non-optimal policy.
(3): The traditional method based on importance evaluation generally evaluates the importance of samples with the expected reward, and does not consider the evaluation between samples with the same expected reward under environmental changes. To overcome the shortcoming, the subtask reward evaluation method is combined to distinguish the influence of the same reward value on policy under different situations.

The remainder of this paper is organized as follows. In Section 2, the environment of ocean current and the motions of AUVs are described. Section 3 introduces the related RL algorithm. Section 4 and Section 5 give the mathematical description of the reward function and the algorithm design, respectively. The simulation results and efficiency analysis are introduced in Section 6 and Section 7, respectively. Finally, a conclusion is presented in Section 8.

2. Problem Statement

2.1. Ocean Current Environment

Ocean current is the flow phenomenon of seawater in the ocean. The proper use of ocean current can help AUVs to save energy and more quickly accomplish the task, otherwise it may affect the completion of the task and even damage AUVs. The ocean current model is composed of several randomly distributed eddy equations, and it is defined in [38] as

\begin{matrix} e d d y {p, a} : f (x, y) & = {(x - p_{x})}^{2} + {(y - P_{y})}^{2}, \end{matrix}

(1)

\begin{matrix} c_{x} & = [- | a_{x} | \frac{\partial f}{\partial x} - | a_{y} | \frac{\partial f}{\partial y}] * \frac{1}{2 f}, \end{matrix}

(2)

\begin{matrix} c_{y} & = [- s g n (a_{y}) | a_{x} | \frac{\partial f}{\partial x} - | a_{y} | \frac{\partial f}{\partial y}] * \frac{1}{2 f}, \end{matrix}

(3)

where

p = (p_{x}, p_{y})

is the central coordinate of the eddy,

(c_{x}, c_{y})

is the size of ocean current at

(x, y)

,

a = (a_{x}, a_{y})

is the intensity coefficient of the eddy, and

a_{y}

determines the rotation direction of the eddy. When

a_{y}

is positive, the eddy rotates clockwise. Otherwise, the eddy rotates counter clockwise.

s g n (.)

is a sign function.

The ocean flow field is formed by the superposition of m eddies as

F = \sum_{i = 1}^{m} e d d y {r a n d (p_{i}), r a n d (a_{i})},

(4)

where

r a n d (.)

is a random function.

2.2. AUVs Model

The multi-AUV system consists of

N_{u}

AUVs, and be denoted by

U = {U_{1}, U_{2}, \dots, U_{N_{u}}}

. For each

U_{i} \in U

, its model is defined as a quad tuple, in which

u p_{i}

,

v_{i}

,

p o w_{i}

,

e g_{i}

represent

U_{i}

’s position, maximum speed, capability for salvage and energy, respectively. We define the energy loss of

U_{i}

in time t is proportional to the third power of its current propulsion velocity as

L E_{i} (t) = k * | | v_{i} (t) {| |}_{2}^{3}, \forall i \in {1, 2, \dots, N_{u}},

(5)

where k is the drag coefficient.

Let

U_{i}

has eight discrete directions in its action space, which can be denoted by

D = {d^{1}, d^{2}, \dots, d^{8}}

as shown in Figure 2. At time t,

U_{i}

’s position changes with its movement and the ocean current, and is calculated as

u p_{i} (t + 1) = u p_{i} (t) + v_{i} (t) + \vec{c},

(6)

where

\vec{c} = (c_{x}, c_{y})

. As shown in Figure 3.

2.3. Task Model

A task consists of

N_{m}

targets, and the target i is defined as

T_{i}, i \in {1, 2, \dots, N_{m}}

, and the target set is

T = {T_{1}, T_{2}, \dots, T_{N_{m}}}

. For each

T_{i} \in T

, it can be represented by

m p_{i}

,

w e i g h t_{i}

,

e m e r g_{i}

,

c p l_{i}

, which denote the location, weight, emergency and complete sign of the target, respectively. We assume that the weight and emergency of these soluble targets decrease over time. The weight of

T_{i}

at each time step t is expressed as

w e i g h t_{i} (t + 1) = max (0, w e i g h t_{i} (t) - α * w e i g h t_{i} (0)), \forall i \in {1, 2, \dots, N_{m}},

(7)

where

max (., .)

is the function that taking the largest of two values and

α

is the weight attenuation coefficient.

Then, the emergency of

T_{i}

at time

t + 1

will be changed corresponding to

w e i g h t_{i} (t + 1)

, defined by

e m e r g_{i} (t + 1) = \frac{w e i g h t_{i} (t + 1)}{w e i g h t_{i} (0)}, \forall i \in {1, 2, \dots, N_{m}} .

(8)

The task model requires AUVs not only to consider their own energy in the ocean current, but also to salvage the targets before dissolved in water as soon as possible. The distance between

U_{i}

and

T_{j}

in time t is calculated as

d i s t_{i, j} (t) = | | u p_{i} (t) - m p_{j} (t) {| |}_{2}, \forall i \in {1, 2, \dots, N_{u}}, \forall j \in {1, 2, \dots, N_{m}} .

(9)

Let

R_{c}

be the salvage radius of the targets. For each

T_{j} \in T

, which is not fully dissolved in water, it will be salvaged when the sum of capabilities of the AUVs in

R_{c}

is greater than its weight, as shown in Figure 4. After that,

c p l_{j}

is given by

c p l_{j} = \{\begin{matrix} 1 & if \sum_{d i s t_{i, j} (t) < R_{c}} p o w_{i} \geq w e i g h t_{j} (t) \\ 0 & otherwise \end{matrix}, \forall i \in {1, 2, \dots, N_{u}}, \forall j \in {1, 2, \dots, N_{m}} .

(10)

Not only the AUVs, but also the targets are affected by the ocean current, which will make them drift along with it, as shown in Figure 5. The movement model of the target is shown as

m p_{i} (t + 1) = m p_{i} (t) + \vec{c}, \forall i \in {1, 2, \dots, N_{m}} .

(11)

3. Background

Reinforcement Learning (RL)

RL can be described by Markov decision process (MDP), defined by the five tuples

(S, A, P, R, γ)

, in which S denotes the set of states, A is the set of actions, P denotes the state transition probability, R is a bounded reward function, and

γ

is the discount factor [21]. The agent interacts with the environment at each discrete time step

t \in {1, 2, \dots}

, selects action a based on the current state

s_{t}

, receives a reward

r_{t}

and transfers to the next state

s_{t + 1}

according to probability

p_{t}

, aiming to receive the maximum reward in one episode. The sum of the reward obtained by the policy is expressed as

G_{t} = \sum_{t^{^{'}} = t}^{T} γ^{t^{^{'}} - t} * r_{t^{'}} .

(12)

In this paper, APAA we proposed based on the DDQN framework, which is a typical model of DRL, then it and some related algorithms are compared with APAA. DDQN, as one of the most commonly used variants of DQN, solves the overestimation of Q value. The target network is used to evaluate the optimal action of the current network, and the loss function is defined as

L (θ) = {(r_{t} + γ * Q_{θ^{-}} (s_{t + 1}, arg max_{a} Q_{θ} (s_{t + 1}, a)) - Q_{θ} (s_{t}, a))}^{2},

(13)

where

θ

is a current network for policy selection, and

θ^{-}

is a target network for policy evaluation. The two network usually synchronization after several iterations with the current network.

4. Model Design

4.1. State Model

Assume that each AUV has the information including the direction of the ocean current, all state information of itself, other AUVs, and all the targets. The state perceived by

U_{i}

at time t is expressed as

s_{t}^{i} = {u s_{t}^{i}, g s_{t}^{i}, t s_{t}^{i}, e d_{t}^{i}}

, in which

u s_{t}^{i} = 〈 u p_{i}, v_{i}, p o w_{i}, e g_{i} 〉

represents the state of

U_{i}

at time t,

g s_{t}^{i} = {〈 u p_{1}, v_{1} 〉, 〈 u p_{2}, v_{2} 〉, \dots}

denotes the state of other AUVs at time t,

t s_{t}^{i} = {〈 m p_{1}, w e i g h t_{1}, e m e r g_{1}, c p l_{1} 〉, 〈 m p_{2}, w e i g h t_{2}, e m e r g_{2}, c p l_{2} 〉, \dots}

represents the state of the targets at time t and

e d_{t}^{i} = 〈 c_{x}, c_{y} 〉

is the current direction of the

U_{i}

at

u p_{i} (t)

.

4.2. Action Model

For each AUV, its action space is divided into eight directions including north, northeast, east, southeast, south, southwest, west and northwest. In each direction, the AUV can move at maximum speed or at 70% of maximum speed, and even remain stationary, that is, it does not perform any of its own movement, but only relies on the ocean current to change its position.

4.3. Reward Model

The reward an AUV received in a time step is composed of four parts, i.e., energy consumption, moving evaluation, collision detection and task completion.

Energy consumption is determined by the AUV’s speed. The faster the AUV speed in each time step, the larger the energy consumption is, and the lower the reward value will be, the reward for the energy consumption is defined as

$r_{e}^{i} = - 2 * \frac{L E_{i} (t)}{M A X E * M A X D E}, \forall i \in {1, 2, \dots, N_{u}},$

(14)

where $M A X E$ is the initial energy of $U_{i}$ , $M A X D E$ is the energy attenuation ratio at the maximum speed. The energy consumption decreases to 0 when the AUV moves only by ocean current, it will get the maximum energy consumption reward 0 in this situation.
Moving evaluation is determined by the AUV whether it is closer to the nearest target than it was at the previous time step, the reward for the moving evaluation is defined as

$r_{m}^{i} = \{\begin{matrix} 0 & if min_{j = 1, 2, \dots, N_{m}} (d i s t_{i, j} (t)) > min_{j = 1, 2, \dots, N_{m},} (d i s t_{i, j} (t + 1)), \\ - 0.5 & otherwise . \end{matrix}$

(15)
Collision detection is to judge whether an AUV collides with others. It will get a negative reward when colliding with others. The reward for the collision is defined as

$r_{o}^{i} = \{\begin{matrix} 0 & if | | u p_{i} (t) - u p_{j} (t) {| |}_{2} \geq τ \\ - 2 & otherwise \end{matrix}, \forall i \in {1, 2, \dots, N_{u}} ⋀ i \neq j,$

(16)

where $τ$ is a small positive number, and it represents the minimum safe distance between AUVs to avoid collision, as shown in Figure 6.
Task completion reward is determined by whether the AUVs salvage the targets. If AUVs salvage a target at time t, all the AUVs will get the reward, which is also related to the emergency of the target. The reward for the task completion is defined as

$r_{c} = \{\begin{matrix} e m e r g_{j} & if c p l_{j} = 1 \\ 0 & otherwise \end{matrix}, \forall j \in {1, 2, \dots, N_{m}} .$

(17)

Then, the reward obtained by

U_{i}

in a time step can be calculated as

r^{i} = r_{e}^{i} + r_{m}^{i} + r_{o}^{i} + r_{c} .

(18)

5. Automatic Policy Amendment Algorithm (APAA)

APAA we proposed is to get the task sequence of each AUV when they finish the entire TA. AUVs will add the task sequence into their Task Sequence Matrix (TSM), if they get a high TCR in a TA. TCR is taken as the AUVs’ reference to each task sequence when making policy decision in the future. Entropy is used to measure the uncertainty for TSM to ensure the diversity of AUVs’ learning samples. SR enable AUVs to balance TCR with their own reward. The amendment probability associated with TSM is generated to affect the AUVs’ action probability distribution, to improve the sample quality, and to accelerate DDQN training.

5.1. Task Sequence Matrix (TSM)

The task sequence represents the order in which

U_{i}

salvages the targets in each task allocation. In general, the higher the team cumulative reward (TCR) of a task sequence in an environment, the more valuable it is. Let the amount of the records in TSM be N. Each AUV preserves a

N \times N_{m}

matrix to store its own task sequence. Meanwhile,

V_{R}^{i}

is used to represent the TCR corresponding to the row i in TSM. When a new task sequence emerges, each AUV updates it to TSM by removing the sequence with the lowest TCR if it meets

\sum_{i = 1}^{N_{u}} R^{i} > min_{j = 1, 2, \dots, N} V_{R}^{j},

(19)

where

R^{i}

is the cumulative reward obtained by

U_{i}

during TA.

Table 1 shows the top 10 optimal task sequences generated by the three AUVs after fifty iterations. For each

U_{i} \in U

in the table, the corresponding column j indicates the jth target

U_{i}

performs, and the jth target that

U_{i}

executes is denoted as

U T_{j}^{i}

. For example,

T_{1}

and

T_{3}

appear the most frequently in

U T_{1}^{1}

, indicating that

U_{1}

can contributes a higher TCR if it performs

T_{1}

or

T_{3}

in

U T_{1}^{1}

.

5.2. Automatic Policy Amendment Matrix (APAM)

In a TA, the target selection preference of the AUVs will determine the task completion efficiency, as described in Table 1. We wish to use TSM to influence the preference the AUVs have to perform the task. As a result, a

N_{m} \times N_{m}

matrix called APAM is constructed by evaluate TSM through three indicators including TCR, entropy and SR. It is worth noting that

A P A M_{i, j}

represents the probability that AUVs selects

T_{i}

when performing

U T_{j}^{*}

, where

U T_{j}^{*}

represents the jth target performed by an AUV.

(1): Team Cumulative Reward (TCR)

The TCR is designed to assign the importance to each sequence in TSM with different weights. The higher TCR of the sequence, the greater the influence on the AUVs. The weight of each sequence in TSM is defined as

w_{t c r}^{i} = \frac{V_{R}^{i} - M I N_R}{M A X_R - M I N_R}, \forall i \in {1, 2, \dots, N},

(20)

where

M I N_R

and

M A X_R

are the lowest and highest TCR the AUVs can achieve, respectively.

Then, for each

T_{i} \in T

, the weight corresponding to the AUVs selecting target

T_{i}

in performing

U T_{j}^{*}, j \in {1, 2, \dots, N_{m}}

is calculated according to

w_{t c r}

as

M_{t c r}^{i, j} = \{\begin{matrix} M_{t c r}^{i, j} + w_{t c r}^{k} & if T S M_{k, j} = T_{i} \\ M_{t c r}^{i, j} & otherwise \end{matrix}, \forall i, j \in {1, 2, \dots, N_{m}}, \forall k \in {1, 2, \dots, N},

(21)

where

M_{t c r}

is a

N_{m} \times N_{m}

matrix.

Finally, Equation (22) is used to transform

M_{t c r}

into probability matrix

P_{t c r}

.

P_{t c r}^{i, j} = \frac{M_{t c r}^{i, j}}{\sum_{i = 1}^{N_{m}} M_{t c r}^{i, j}}, \forall j \in {1, 2, \dots, N_{m}} .

(22)

(2): Entropy

Based on the update mode of TSM mentioned in Section 5.1, we know that TSM will record a new sequence with a TCR greater than the worst sequence in TSM. From Table 1,

U_{3}

may select

T_{1}

,

T_{3}

, and

T_{4}

, when performing

U T_{3}^{3}

. However, with the TSM updated by new sequences, the diversity of the targets in TSM may decrease dramatically, and

U_{3}

may only select

T_{3}

after several iterations. This is not expected in early training, because it will converge to a non-optimal policy. The entropy is used to measure the effect of sequences in TSM on the diversity of AUVs behaviors, and based on entropy, multiple similar records with high reward will get lower weights. The entropy weight of a new sequence is calculated by multiplying the change in TSM’s average TCR by the change in TSM’s entropy after it is added to TSM.

For the column k in TSM, the number of occurrences of

T_{i}

is updated as

C_{i, j} = \{\begin{matrix} C_{i, j} + 1 & if T S M_{k, j} = T_{i} \\ C_{i, j} & otherwise \end{matrix}, \forall i, j \in {1, 2, \dots, N_{m}}, \forall k \in {1, 2, \dots, N},

(23)

where C is a

N_{m} \times N_{m}

matrix and is transformed into probability matrix

C^{^{'}}

in the same way as Equation (22). After that,

i e

is calculated by

i e = \sum_{j = 1}^{N_{m}} \sum_{i = 1}^{N_{m}} - C_{i, j}^{^{'}} * l o g (C_{i, j}^{^{'}}) .

(24)

When a new sequence is added, the entropy of TSM is affected. For each sequence in TSM, the weight under the entropy metric is

w_{e}^{i} = e^{[(e r_{n e w} - e r_{o l d}) * (i e_{n e w} - i e_{o l d})]}, \forall i \in {i, 2, \dots, N},

(25)

where

e r_{o l d}

and

e r_{n e w}

are the average TCR of TSM before and after the new sequence is added, respectively.

i e_{o l d}

and

i e_{n e w}

are the entropy before and after a new sequence is added, respectively.

Similar to Equation (21), the weights of each target performed by AUVs in different order according to

w_{e}

can be calculated as

M_{e}^{i, j} = \{\begin{matrix} M_{e}^{i, j} + w_{e}^{k} & if T S M_{k, j} = T_{i} \\ M_{e}^{i, j} & otherwise \end{matrix}, \forall i, j \in {1, 2, \dots, N_{m}}, \forall k \in {1, 2, \dots, N} .

(26)

Note that we transform

M_{e}

into probability matrix

P_{e}

in the same way as Equation (22).

(3): Subtask Reward (SR)

SR is defined as the cumulative reward obtained by an AUV during the salvage of ith target, aiming to make the AUV balance the TCR and its own reward based on the actual environment. Let

V_{S R}^{i, j}

be a

N \times N_{m}

matrix, representing the cumulative reward for finishing

T S M_{i, j}

. Based on TSM, the weight of each sequence is calculated as

w_{s r}^{i, j} = \frac{V_{S R}^{i, j} - M I N_R}{M A X_R - M I N_R}, \forall j \in {1, 2, \dots, N_{m}} .

(27)

Then, we have

M_{s r}^{i, j} = \{\begin{matrix} M_{s r}^{i, j} + w_{s r}^{k, j} & if T S M_{k, j} = T_{i} \\ M_{s r}^{i, j} & otherwise \end{matrix}, \forall i, j \in {1, 2, \dots, N_{m}}, \forall k \in {1, 2, \dots, N} .

(28)

Finally, the probability matrix

P_{s r}

is calculated for the

M_{s r}

by Equation (22).

(4): Probability Weighted

P_{t c r}

,

P_{e}

and

P_{s r}

are the probability matrices that each target selected by AUVs according to three different indicators under different orders, respectively. In addition, if the variance of a subtask reward in TSM is large, the selected target is considered to has a great influence on the TCR. In this case,

P_{s r}

will have a high proportion coefficient, calculated as

w 1 = min (η, \frac{arctan (v a r (V_{S R}^{*, j})}{π}),

(29)

where

V_{S R}^{*, j}

represents the column j of

V_{S R}

,

min (., .)

is the smaller of two values,

v a r (.)

is the variance of the data set, and

0 < η < 1

is used to restrict the influence of

P_{s r}

. The trade-off between reward and entropy in TSM records has the same coefficient. The APAM is given by

A P A M = \frac{[w 1 * P_{s r} + (1 - w 1) * P_{t c r}] + P_{e}}{2} .

(30)

(5): Probability Prediction

APAM is updated through the new sequences, and the probability changes in the historical experience of the matrix can be used as the momentum to predict the future change of APAM, and can furtherly accelerate DDQN training. The momentum matrix is constructed as

Δ a p a m = \{\begin{matrix} w 3 * Δ a p a m + (1 - w 3) * (A P A M_{n e w} - A P A M_{o l d}) & if TSM updated, \\ w 2 * Δ a p a m & otherwise, \end{matrix}

(31)

where

Δ a p a m

is the change in the momentum of APAM.

0 < w 2 < 1

, and

0 < w 3 < 1

.

Δ a p a m

is updated by the proportional coefficient when the new sequence satisfies the update conditions of TSM, otherwise, the attenuation of

Δ a p a m

is carried out according to a certain coefficient. The prediction of the new APAM is given as

A P A M_{p d t} = A P A M + Δ a p a m .

(32)

5.3. Action Conduct by APAM

As shown in Table 2, the APAM is generated by the TSM. Obviously, different AUVs have different preferences for the targets. This preference is applied to reduce the state action space and solve the undirected problem of exploration in traditional RL. Q value is converted into probability by softmax, and then according to the information provided by the APAM, the probability is amended. A proper amend method will lead to a good effect of training. The action distribution is the trade-off in the current environment with multiple constraints, while the probability generated by APAM only represents which target has a higher priority for execution. The probability of an action will be motivated or restrained according to the distribution of action instead of the distribution of APAM. If AUVs’ estimation of an action is similar to the expected behavior in APAM, the action will be motivated according to the similarity between the action and the expected direction of APAM, otherwise, it will not be motivated. In fact, motivating one action means inhibiting others, so it is no need to perform additional inhibiting operations for other actions.

When

U_{m}

performs

U T_{j}^{m}

, the column j in the APAM will influence its decision. Let

p_{q}

represent the probability that

U_{m}

moves in each direction. For each

p_{q}^{k}

in

p_{q}

, the positional relationship between each unfinished target and

U_{m}

is calculated, then calculating the cosine similarity between these positional relationships and

d^{k}

, and finally amending

p_{q}^{k}

according to the APAM of

U_{m}

p_{q}^{k} \leftarrow p_{q}^{k} * (1 + \sum_{i = 1}^{N_{m}} A P A M_{i, j}^{m} * max (0, c o s S i m (d^{k}, m p {(t)}_{i} - u p {(t)}_{m}))), \forall j \in {1, 2, \dots, N_{m}},

(33)

where

c o s S i m (*, *)

represents the cosine similarity of two vectors.

5.4. Algorithm Summarize

The Algorithm 1 gives the algorithm flow for APAA.

Algorithm 1 APAA.

Input:
N_u, N_m, D, EPISODE, L, M.
Output:

θ^{*}

.
1: Initialize:

θ

,

θ^{-}

,

E P

,

T S M

,

A P A M

,

Δ a p a m

.
2: for

n = 1

to

E P I S O D E

do
3:

t = 1

;
4: Generate APAM by Equations (20)–(30);
5: while

s_{t}^{i}

is not terminal state do
6: for

i = 1

to

N_{u}

do
7: Generate action probability distribution

p_{q}

according to state

s_{t}^{i}

;
8: if

n > M

then
9: if

n % 2 = = 0

then
10:

A P A M^{i} = A P A M^{i} + Δ a p a m^{i}

;
11: end if
12: for

k = 1

to D do
13:

p_{q}^{k}

is corrected according to Equation (33);
14:           end for
15:         end if
16:         Choose action

a_{t}^{i}

by

ϵ

-greedy according to

p_{q}

, and get reward

r_{t}^{i}

;
17: Put

〈 s_{t}^{i}, a_{t}^{i}, r_{t}^{i}, s_{t + 1}^{i} 〉

into

E P

;
18: end for
19:

t = t + 1

;
20:    end while
21:    For a new sequence, update TSM according to Equation (19);
22:    if

n % L = = 0

then
23: A batch samples randomly selected from

E P

;
24: Training network

θ

by Equation (13);
25: end if
26: Executed

θ^{-} = θ

after several iterations;
27: end for

6. Simulation Results

In this section, some simulation results for APAA are presented and compare them with those obtained by DDQN, Priority Experience Replay (PER) and Proximal Policy Optimization Clip (PPO-Clip). The simulation is implemented using MATLAB 2018b, and the personal computer is configured with Intel(R) Core(TM) i7-10700 CPU @2.90GHz, 8GigaBytes (GB) RAM.

6.1. Experiment Parameters

The experiment we designed involves a group of three AUVs and five targets distributed in the 10 m × 10 m ocean current region, and the ocean current is modeled by Equation (1). The weight of some targets is greater than the power of AUVs, so they need to be salvaged by the cooperation of multiple AUVs. In the experimental comparison, each algorithm has the same network parameters and structure, and the same initial conditions. Table 3 shows the network and training parameters.

ϵ

-greedy exploration strategy is adopted in action selection. At each step, AUVs randomly select an action with a probability

ϵ

, and with a probability 1-

ϵ

select the action with the highest expected reward. In addition, the decay factor

β

is used to cause

ϵ

to decrease with iterations to increase the probability of choosing the optimal action.

ϵ

is updated after the each training of policy as

ϵ (t + 1) = β * ϵ (t) .

(34)

Parameters of the RL are shown in Table 4.

Table 5 shows the parameters used in APAA. The values of these parameters affect the performance of APAA. First of all, the effect of APAA actually depends on the experience in TSM. N with small value will lead to insufficient experience diversity in TSM and trap into non-optimal policy easily. Then, although

η

let the AUVs have the ability to balance between the team and itself, its value should not be large in collaborative task, as this may cause the task to fail. Finally,

w_{2}

and

w_{3}

control

Δ A P A M

changes, but both should have a large value because TSM experience collection is essentially Monte Carlo sampling, and noise from random sampling can cause

Δ A P A M

instability.

Table 6 and Table 7 show the attributes of the AUVs and the targets, respectively. In the experiment, the initial positions of the AUVs and the targets are randomly initialized within the ocean current region, the weights of the targets are randomly initialized between 2kg and 5kg, the powers of the AUVs are randomly initialized between 1 kg and 4 kg, and AUVs’ velocity are randomly initialized between 1 m/s and 3 m/s. For convenience, we show a set of parameters.

Parameters involved in the TA model are shown in Table 8.

6.2. Experiment Result

We compare APAA with DDQN, PER and PPO-Clip in the same scenario and run them several times to get average performance. The performance is shown in Figure 7, and APAA achieves the best convergence performance compared with the other algorithms under the same episode. In fact, the task in this paper is a typical of multi-objective optimization problem, which generally results in a large state-action space. It is clear that DDQN requires a lot of exploration to converge in the complex task and has weak stability in convergence. PER based on DDQN takes advantage of the TD-error of the samples to carry out priority sampling, which improves the sample efficiency. The advantage of PER became visible after the first 2000 iterations and achieves better performance than DDQN. PPO-Clip is an off-policy algorithm based on Actor-Critic, which improves the intelligence of AUVs in an adversative way and achieves the weakest performance. In addition, the convergence time of the algorithms is given in Table 9, and APAA has the highest efficiency. By contrast, PPO-Clip is hard to have a high efficiency due to training for two networks. PER uses the sum tree to improve the sampling efficiency, but still has a high time cost in updating the priority of the samples and sampling.

From Section 4.3, TCR consists of four goals. The performance of task completion reward is shown in Figure 8. In the experiment, the maximum reward for salvaging the five targets is 125. APAA gets the highest task completion reward compared with the other algorithms.

Figure 9 shows the performance of the algorithms in energy loss and collision detection, respectively. In Table 6, the total energy reserve carried by AUVs is 1800 J, and APAA only consumes 7% of the energy to salvage all the targets, that is, it can plan a better path in the ocean current. In addition, the performance of APAA in collision detection also highlights the lower probability of collisions occurring during AUVs executing the task.

Table 10 shows the performance of the algorithms after convergence. A performance is provided, time consuming, indicating the time cost the AUVs take to complete the task.

Figure 10 shows the trajectories of the AUVs and targets with APAA after training. The hexagonal stars are the starting positions of the AUVs, and the asterisks are the end positions of the AUVs. The squares are the starting positions of the targets, the circles are the positions when the targets are salvaged, and the arrows are the direction of the ocean current at the coordinates. As can be seen from the path planned by the AUVs at each step, they always take full advantage of the ocean current moving in the same direction.

7. Analysis

In this section, the validity and computational complexity of APAA theoretically are presented.

7.1. Validity Analysis

In order to speed up RL, the policy subspace with greater potential benefits in the state-action space needs to be consider. Let

π^{(k)}

be the policy in the kth iteration, and

p^{(k)} (s, a)

be the probability of choosing an action from the state-action space. According to APAA, we have

p^{(k)} (s_{t}, a_{t}) \leftarrow p^{(k)} (s_{t}, a_{t}) + a p a a (s_{t}, a_{t}), \forall s_{t} \in S, \forall a_{t} \in A,

(35)

where

a p a a (s_{t}, a_{t})

is the probability amendment for the state-action pair.

If

(s_{t}, a_{t})

is a better state-action pair,

a p a a (s_{t}, a_{t}) > 0

, otherwise,

a p a a (s_{t}, a_{t}) \leq 0

.

\sum_{a \in A} a p a a (s_{t}, a) = 0 .

As a result,

\exists S_{l} \subseteq S

and

A_{l} \subseteq A

, for

\forall s \in S_{l}

and

a \in A_{l}

, we have

p^{k} (s_{t}, a_{t}) + a p a a (s_{t}, a_{t}) = 0

. The subspace will not be explored when

p^{k} (S_{l}, A_{l}) = 0

, so APAA is more efficient than DDQN with the same episodes.

7.2. Computational Complexity Analysis

The computational cost of APAA mainly consists of two parts: calculating APAM by TSM and amending AUVs’ decision by APAM. First, let the number of actions of the AUVs be d, TSM be an

n \times m

matrix, and

n > m

. In Section 5.2, each element in TSM is evaluated by TCR, SR, and Entropy to generate APAM. Thus, the computational complexity of this part can easily be estimated as

O (3 m n) < O (3 n^{2})

. After that, APAM amends the probability distribution for the actions of the AUVs, and the computational complexity of this part can be expressed as

O (m d)

. Finally, the computational cost of APAA is the sum of the computational complexity of the two parts, i.e.,

O (3 n^{2} + m d)

.

Although APAA has a high computational complexity in form, amending the action distribution with the drifting targets that have been salvaged are not considered. As a result, the computational complexity of the second part would decrease with the progress of the overall task, so that it can almost be ignored. In addition, the samples of RL are only related to the actions of the AUVs at each time step, and it is difficult for AUVs to extract the relationship between behaviors and task results from the massive samples. However, APAM extracts information related to the task, which can effectively accelerate the learning speed, so as to believe that such a computational cost is worth it. In contrast, the computational complexity of PER is related to the size of the replay buffer, and the computational complexity of PER increases dramatically for complex tasks. The parameter sensitivity of PPO-Clip and the need to train two networks result in higher computational complexity. Therefore, APAA also outperforms PER and PPO-Clip at the same training time.

8. Conclusions

In this paper, a new RL approach is proposed to solve the task allocation problem of multi-AUV in ocean currents. First, the ocean current and a reward function are constructed. The ocean current, the energy, the task emergency and the collision with other AUVs need to be taken into account when AUVs perform the task. Many classical RL algorithms improve the efficiency of traditional samples, but a problem is that traditional samples are not directly related to the task, which makes it difficult for AUVs to understand how their behavior affects the final result. To overcome this drawback, the Automatic Policy Amendment Algorithm (APAA) is introduced. TSM is generated by the task sequences for each AUV, which represents the task preference for AUVs to obtain the highest TCR. Such information related to the task can effectively guide the policy learning. After that, APAM is calculated by TSM, and uses TCR, entropy and SR to adjust the decision of AUVs. Finally, the simulation results show that APAA accelerates the convergence and improves the overall performance compared with the DDQN, PER and PPO-Clip. In future work, we will deal with more complex optimal planning tasks in 3D scenarios.

Author Contributions

Conceptualization, Z.Z. and C.D.; methodology, C.D. and Z.Z.; software, C.D.; validation, C.D.; writing original draft preparation, C.D.; writing review and editing, C.D. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The authors are grateful for the National Natural Science Foundation of China (No. 61873033), the Science Foundation of Fujian Normal University (No. Z0210553), and the Natural Science Foundation of Fujian Province (No. 2020H0012, No. 2017J01740).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Allotta, B.; Bartolini, F.; Caiti, A.; Costanzi, R.; Di Corato, F.; Fenucci, D.; Gelli, J.; Guerrini, P.; Monni, N.; Munafò, A.; et al. Typhoon at CommsNet13: Experimental Experience on AUV Navigation and Localization. Annu. Rev. Control 2015, 40, 157–171. [Google Scholar] [CrossRef]
Allotta, B.; Costanzi, R.; Pugi, L.; Ridolfi, A. Identification of the Main Hydrodynamic Parameters of Typhoon AUV from A Reduced Experimental Dataset. Ocean. Eng. 2018, 147, 77–88. [Google Scholar] [CrossRef]
Liu, Q.; Sun, B.; Zhu, D. A Multi-AUVs Cooperative Hunting Algorithm for Environment with Ocean Current. In Proceedings of the 2018 37th Chinese Control Conference, Wuhan, China, 25–27 July 2018. [Google Scholar]
Li, L.; Li, Y.; Zeng, J.; Xu, G.; Zhang, Y.; Feng, X. A Research of Multiple Autonomous Underwater Vehicles Cooperative Target Hunting Based on Formation Control. In Proceedings of the 2021 6th International Conference on Automation, Control and Robotics Engineering, Dalian, China, 15–17 July 2021. [Google Scholar]
Wu, J.; Song, C.; Ma, J.; Wu, J.; Han, G. Reinforcement Learning and Particle Swarm Optimization Supporting Real-Time Rescue Assignments for Multiple Autonomous Underwater Vehicles. IEEE Trans. Intell. Transp. Syst. 2021. accepted. [Google Scholar] [CrossRef]
Zhu, Z.; Wu, Z.; Deng, Z.; Qin, H.; Wang, X. An Ocean Bottom Flying Node AUV for Seismic Observations. In Proceedings of the 2018 IEEE/OES Autonomous Underwater Vehicle Workshop (AUV), Porto, Portugal, 6–9 November 2018. [Google Scholar]
Liu, S.; Xu, H.L.; Lin, Y.; Gao, L. Visual Navigation for Recovering an AUV by Another AUV in Shallow Water. Sensors 2019, 19, 1889. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shen, C.; Buckham, B.; Shi, Y. Modified C/GMRES Algorithm for Fast Nonlinear Model Predictive Tracking Control of AUVs. IEEE Trans. Control Syst. Technol. 2017, 25, 1896–1904. [Google Scholar] [CrossRef]
Carreras, M.; Hernandez, J.D.; Vidal, E.; Palomeras, N.; Ribas, D.; Ridao, P. Sparus II AUV-A Hovering Vehicle for Seabed Inspection. IEEE J. Ocean. Eng. 2018, 43, 344–355. [Google Scholar] [CrossRef]
Kojima, M.; Asada, A.; Mizuno, K.; Nagahashi, K.; Katase, F.; Saito, Y.; Ura, T. AUV IRSAS for Submarine Hydrothermal Deposits Exploration. In Proceedings of the 2016 IEEE/OES Autonomous Underwater Vehicles (AUV), Tokyo, Japan, 6–9 November 2016. [Google Scholar]
Savkin, A.V.; Verma, S.C.; Anstee, S. Optimal Navigation of an Unmanned Surface Vehicle and an Autonomous Underwater Vehicle Collaborating for Reliable Acoustic Communication with Collision Avoidance. Drones 2022, 6, 27. [Google Scholar] [CrossRef]
Yu, X.; Gao, X.; Wang, L.; Wang, X.; Ding, Y.; Lu, C.; Zhang, S. Cooperative Multi-UAV Task Assignment in Cross-Regional Joint Operations Considering Ammunition Inventory. Drones 2022, 6, 77. [Google Scholar] [CrossRef]
Ferri, G.; Munafo, A.; Tesei, A.; LePage, K. A Market-based Task Allocation Framework for Autonomous Underwater Surveillance Networks. In Proceedings of the Oceans Aberdeen Conference, Aberdeen, UK, 19–22 June 2018. [Google Scholar]
Ma, Y.N.; Gong, Y.J.; Xiao, C.F.; Gao, Y.; Zhang, J. Path Planning for Autonomous Underwater Vehicles: An Ant Colony Algorithm Incorporating Alarm Pheromone. IEEE Trans. Veh. Technol. 2019, 68, 141–154. [Google Scholar] [CrossRef]
Han, G.; Gong, A.; Wang, H.; Martinez-Garcia, M.; Peng, Y. Multi-AUV Collaborative Data Collection Algorithm Based on Q-learning in Underwater Acoustic Sensor Networks. IEEE Trans. Veh. Technol. 2021, 70, 9294–9305. [Google Scholar] [CrossRef]
Xi, L.; Zhou, L.; Xu, Y.; Chen, X. A Multi-Step Unified Reinforcement Learning Method for Automatic Generation Control in Multi-area Interconnected Power Grid. IEEE Trans. Sustain. Energy 2020, 12, 1406–1415. [Google Scholar] [CrossRef]
Zhang, J.; Yang, Q.; Shi, G.; Lu, Y.; Wu, Y. UAV Cooperative Air Combat Maneuver Decision Based on Multi-agent Reinforcement Learning. J. Syst. Eng. Electron. 2021, 32, 1421–1438. [Google Scholar]
Zhang, Z.; Wang, D.; Gao, J. Learning Automata-based Multiagent Reinforcement Learning for Optimization of Cooperative Tasks. IEEE Trans. Neural. Netw. Learn. Syst. 2021, 32, 4639–4652. [Google Scholar] [CrossRef] [PubMed]
Guo, W.; Tian, W.; Ye, Y.; Xu, L.; Wu, K. Cloud Resource Scheduling with Deep Reinforcement Learning and Imitation Learning. IEEE Internet Things J. 2021, 8, 3576–3586. [Google Scholar] [CrossRef]
Hoseini, S.A.; Hassan, J.; Bokani, A.; Kanhere, S.S. In Situ MIMO-WPT Recharging of UAVs Using Intelligent Flying Energy Sources. Drones 2021, 5, 89. [Google Scholar] [CrossRef]
Sutton, R.; Barto, A. Reinforcement Learning:An Introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Geist, M.; Pietquin, O. Algorithmic Survey of Parametric Value Function Approximation. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 845–867. [Google Scholar] [CrossRef] [Green Version]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. In Proceedings of the 30th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Lin, L.J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. In Proceedings of the International Conference on Learning Representations 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Horgan, D.; Quan, J.; Budden, D.; Barth Maron, G.; Hessel, M.; Van Hasselt, H.; Silver, D. Distributed Prioritized Experience Replay. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zhao, Y.; Liu, P.; Zhao, W.; Tang, X. Twice Sampling Method in Deep Q-network. Acta Autom. Sin. 2019, 14, 1870–1882. [Google Scholar]
Zhang, H.J.; Qu, C.; Zhang, J.D.; Li, J. Self-Adaptive Priority Correction for Prioritized Experience Replay. Appl. Sci. 2020, 10, 6925. [Google Scholar] [CrossRef]
Ramicic, M.; Bonarini, A. Entropy-based Prioritized Sampling in Deep Q-learning. In Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017. [Google Scholar]
Yang, D.; Qin, X.; Xu, X.; Li, C.; Wei, G. Sample-efficient Deep Reinforcement Learning with Directed Associative Graph. China Commun. 2021, 18, 100–113. [Google Scholar] [CrossRef]
Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhu, J.; Wei, Y.T. Adaptive Deep Reinforcement Learning for Non-stationary Environments. Sci. China Inf. Sci. 2021. accepted. [Google Scholar]
Kumra, S.; Joshi, S.; Sahin, F. Learning Robotic Manipulation Tasks via Task Progress Based Gaussian Reward and Loss Adjusted Exploration. IEEE Robot. Autom. Lett. 2022, 7, 534–541. [Google Scholar] [CrossRef]
Shi, H.; Xu, M. A Multiple-Attribute Decision-Making Approach to Reinforcement Learning. IEEE Trans. Cogn. Dev. Syst. 2020, 12, 695–708. [Google Scholar] [CrossRef]
Pakizeh, E.; Palhang, M.; Pedram, M.M. Multi-criteria Expertness Based Cooperative Q-learning. Appl. Intell. 2013, 39, 28–40. [Google Scholar] [CrossRef]
Yao, X.; Wang, F.; Wang, J. Energy-optimal Path Planning for AUV with Time-variable Ocean Currents. Control Decis. 2020, 35, 2424–2432. [Google Scholar]

Figure 1. The description of the TA for multi-AUV system.

Figure 2. Eight discrete directions of the AUVs in action space.

Figure 3. Influence of ocean current on AUV’s direction.

Figure 4. AUVs cooperate to complete the task.

Figure 5. Movement model of the target.

Figure 6. AUVs collision detection.

Figure 7. Team cumulative reward (TCR) of the team.

Figure 8. Task reward of the team.

Figure 9. The performance of (a) energy consumption and (b) collision detection.

Figure 10. The trajectories of the AUVs and the targets in APAA.

Table 1. 10 optimal task sequences in the TSMs.

	$U_{1}$					$U_{2}$					$U_{3}$					$V_{R}$
1	$T_{1}$	-	-	-	-	-	-	-	-	-	$T_{2}$	$T_{3}$	$T_{4}$	$T_{5}$	-	−108
2	$T_{5}$	-	-	-	-	$T_{3}$	-	-	-	-	$T_{1}$	$T_{2}$	$T_{4}$	-	-	−92
3	-	-	-	-	-	$T_{5}$	-	-	-	-	$T_{2}$	-	$T_{3}$	$T_{1}$	-	−115
4	$T_{1}$	$T_{4}$	-	-	-	-	-	-	-	-	$T_{5}$	$T_{2}$	$T_{3}$	-	-	−113
5	$T_{3}$	$T_{1}$	$T_{2}$	$T_{4}$	$T_{5}$	-	-	-	-	-	-	-	-	-	-	−131
6	$T_{3}$	-	-	-	-	$T_{2}$	$T_{4}$	$T_{1}$	-	-	$T_{5}$	-	-	-	-	−107
7	$T_{3}$	-	-	-	-	$T_{5}$	$T_{2}$	-	-	-	$T_{4}$	$T_{1}$	-	-	-	−111
8	$T_{3}$	$T_{4}$	-	-	-	-	-	-	-	-	$T_{2}$	$T_{5}$	$T_{1}$	-	-	−117
9	$T_{3}$	$T_{5}$	$T_{1}$	$T_{2}$	-	-	-	-	-	-	$T_{4}$	-	-	-	-	−123
10	$T_{1}$	-	-	-	-	-	-	-	-	-	$T_{2}$	$T_{3}$	$T_{4}$	$T_{5}$	-	−125

Table 2. The APAM of the three AUVs.

	$U_{1}$					$U_{2}$					$U_{3}$
$T_{1}$	0.403	0.425	-	-	-	-	-	-	-	-	0.122	0.123	0.282	0.333	-
$T_{2}$	-	-	1	-	-	-	0.717	-	-	-	0.508	0.369	-	-	-
$T_{3}$	0.468	-	-	-	-	0.419	-	-	-	-	-	0.255	0.289	-	-
$T_{4}$	-	0.575	-	1	-	-	-	-	-	-	0.243	0.127	0.429	-	-
$T_{5}$	0.129	-	-	-	1	0.581	0.283	-	-	-	0.127	0.126	-	0.667	-

Table 3. Network structure and training parameters.

Hidden Layers	Transfer Function	Optimization Function	Epochs	Learning Rate	Batch	Regularization
2	tanh	adam	500	0.001	300	L2

Table 4. Parameters of the RL training.

$EPISODE$	$ϵ (0)$	$β$
5000	0.8	0.995

Table 5. Parameters of APAA algorithm.

Parameter	Symbol	Value
Size of the TSM	N	15
$P_{s r}$ impact factor	$η$	0.1
$Δ A P A M$ attenuation factor	$w_{2}$	0.9
$Δ A P A M$ update factor	$w_{3}$	0.7

Table 6. Attributes of the AUVs.

AUVs	Position	Power (kg)	Speed (m/s)	Energy (J)
$U_{1}$	(5,1)	5	1	600
$U_{2}$	(10,3)	2	3	600
$U_{3}$	(3,6)	2	3	600

Table 7. Attributes of the targets.

Targets	Position	Weight (kg)	Emergency
$T_{1}$	(5,7)	4	9.7894
$T_{2}$	(4,9)	2	7.3135
$T_{3}$	(6,2)	2	8.66
$T_{4}$	(8,10)	3	7.3227
$T_{5}$	(2,1)	2	8.405

Table 8. Parameters of the TA.

Parameter	Symbol	Value
Number of AUVs	$N_{u}$	3
Number of Targets	$N_{m}$	5
Salvage radius	$R_{c}$	0.5 m
Collision radius	$τ$	0.1 m
Drag coefficient	k	3.425
Targets weight attenuation coefficient	$α$	0.01

Table 9. Convergence time of the algorithms.

	Time (h)
DDQN	2.75
PER	3.77
PPO-Clip	0.74
APAA	0.5

Table 10. Convergence performance of the algorithms (After 5000 episodes).

	Task Reward	Collision Frequency	Energy Consumption (J)	Time Consuming (s)
DDQN	105	3	560	84
PER	110	2	650	75
PPO-Clip	82	4	1350	111
APAA	117	0	337	12

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, C.; Zheng, Z. A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current. Drones 2022, 6, 141. https://doi.org/10.3390/drones6060141

AMA Style

Ding C, Zheng Z. A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current. Drones. 2022; 6(6):141. https://doi.org/10.3390/drones6060141

Chicago/Turabian Style

Ding, Cheng, and Zhi Zheng. 2022. "A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current" Drones 6, no. 6: 141. https://doi.org/10.3390/drones6060141

APA Style

Ding, C., & Zheng, Z. (2022). A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current. Drones, 6(6), 141. https://doi.org/10.3390/drones6060141

Article Menu

A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current

Abstract

1. Introduction

2. Problem Statement

2.1. Ocean Current Environment

2.2. AUVs Model

2.3. Task Model

3. Background

Reinforcement Learning (RL)

4. Model Design

4.1. State Model

4.2. Action Model

4.3. Reward Model

5. Automatic Policy Amendment Algorithm (APAA)

5.1. Task Sequence Matrix (TSM)

5.2. Automatic Policy Amendment Matrix (APAM)

5.3. Action Conduct by APAM

5.4. Algorithm Summarize

6. Simulation Results

6.1. Experiment Parameters

6.2. Experiment Result

7. Analysis

7.1. Validity Analysis

7.2. Computational Complexity Analysis

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI