Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets

Qu, Xingru; Li, Chu; Jiang, Shang; Liu, Guanqun; Zhang, Rubo

doi:10.3390/jmse13081558

Open AccessArticle

Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets

by

Xingru Qu

¹

,

Chu Li

¹,

Shang Jiang

^2,*,

Guanqun Liu

¹

and

Rubo Zhang

^1,*

¹

College of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian 116600, China

²

Department of Automatic Control, Dalian Naval Academy, Dalian 116013, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(8), 1558; https://doi.org/10.3390/jmse13081558

Submission received: 12 July 2025 / Revised: 5 August 2025 / Accepted: 11 August 2025 / Published: 14 August 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Autonomous surface vehicles (ASVs) have been widely applied in ocean engineering due to their small size, low cost, and high mobility. However, more relevant encirclement control methods with many-to-one are simple and do not consider the system dynamics. This article proposes a cooperative encirclement control method for ASVs against multiple targets based on multi-agent reinforcement learning. Firstly, a dynamic target allocation algorithm is designed based on location information of both vehicles and targets, enabling vehicles to select encirclement targets in real-time according to relative distances. Subsequently, the whole encirclement process is divided into multiple stages, and a multi-stage reward function is developed based on curriculum learning to guide ASVs in completing encirclement tasks progressively, from simpler to more complex scenarios. Then, the actor and critic networks incorporating long short-term memory are constructed, respectively, and a multi-agent soft actor-critic reinforcement learning algorithm is employed to train ASVs, enhancing cooperative target encirclement maneuvers. Finally, the effectiveness and superiority of the proposed method is validated through a six-on-two encirclement simulation.

Keywords:

autonomous surface vehicles; multiple targets; cooperative encirclement control; multi-agent reinforcement learning

1. Introduction

In recent years, autonomous surface vehicles (ASVs) have garnered unprecedented attention and experienced rapid development in both military and civilian domains [1,2]. With the continuous advancement of vehicle intelligence, the operation and application of ASVs has expanded from safe environments focused on reconnaissance and surveillance to complex scenarios involving combat and confrontation [3,4]. Target encirclement represents a crucial research direction in cooperative control of multi-ASVs, where non-cooperative targets are enclosed within a specific circular domain, thereby diminishing their threat capabilities [5,6]. However, the targets in real-world scenarios are not stationary, such as pollutant spills, illegal fishing, or autonomous interdiction, which can move and disperse [7]. Sequential encirclement of such targets is frequently ineffective and prone to uncontrolled spreading or escape, leading to excessive duration and mission failure. Therefore, the study of cooperative encirclement for ASVs against multiple targets has significant practical value.

Existing target encirclement control methods can be broadly categorized into three approaches: bio-inspired encirclement [8], differential game-based encirclement [9], and reinforcement learning-based encirclement [10]. To be specific, bio-inspired approaches primarily mimic the dynamic interactions observed between predators and prey in nature. For example, Ref. [11] simulates the cooperative behaviors of biological groups to design a bio-inspired distributed encirclement controller, achieving efficient encirclement for non-cooperative targets. Ref. [12] develops a switching encirclement control strategy using animal group behavior. Ref. [13] designs a bio-inspired optimization algorithm based on orca predation behavior, simulating orcas’ search, encirclement, and attack behaviors. By combining with agent position-based area partitioning and bio-inspired neural networks, Ref. [14] proposes a cooperative encirclement method using an improved crayfish optimization algorithm. However, these methods exhibit limited adaptability to complex environments, and abrupt changes in target behavior often render the system incapable of adjusting the encirclement strategy effectively.

Differential game-based methods construct pursuit-evasion models and derive optimal encirclement strategies by solving Hamilton–Jacobi equations [15,16]. For example, Ref. [17] constructs a cooperative encirclement framework based on differential games, analytically solving via Hamilton–Jacobi–Isaacs equations. Ref. [18] proposes a hybrid differential game-based encirclement guidance method, enabling pursuit-evasion missions under static obstacle constraints. Ref. [19] develops a pursuit-evasion matching framework through global game decomposition. Ref. [20] designs a cooperative control method based on stochastic potential games and multi-agent reinforcement learning that estimates Nash equilibrium strategies using temporal relative motion information. However, as the size of the agent group increases, these methods require extensive assumptions and rules, making them less feasible for real-time control in large-scale scenarios.

Due to the rapid advancement of artificial intelligence technology, deep reinforcement learning has been extensively applied in decision-making, control, and optimization [21,22]. Reinforcement learning-based encirclement methods can optimize encirclement strategies through agent–environment interactions, effectively handling high-dimensional state spaces and demonstrating strong adaptability in complex environments [23,24]. For example, Ref. [25] balances individual and collective benefits through knowledge-embedded encirclement rewards. Ref. [26] establishes an obstacle-assisted cooperative encirclement method based on multi-agent reinforcement learning. Ref. [27] conducts decentralized encirclement training via curriculum learning. Ref. [28] utilizes a centralized training-decentralized execution framework to solve the pursuit evasion problem. Ref. [29] proposes a reinforcement learning-based adaptive encirclement control method under an obstacle environment. Ref. [30] uses virtual barriers and curriculum learning techniques during the training process, improving the generalization capabilities and convergence speed of the capture policy for limited perception ASVs. However, the reward functions employed in these approaches often lead to insufficient learning motivation for individual agents. Some agents may exploit collective rewards and cease optimizing the encirclement strategies, thereby prolonging the training process. Note that most existing studies focus on single-target encirclement, with limited attention paid to multi-target cooperative encirclement scenarios.

Motivated by these research gaps, this article investigates the cooperative encirclement problem for ASVs against multi-targets based on multi-agent reinforcement learning (MARL). By integrating dynamic target allocation and multi-stage reward guidance, the method utilizes a MARL framework to train ASVs for cooperative target encirclement. Simulation results with six-on-two game demonstrate that the encirclement performance of our proposed control method significantly outperforms existing control methods. Main contributions are as follows: (1) A dynamic target allocation algorithm based on proximity principles and encirclement distance is developed to optimize the target selection process for ASVs. (2) A curriculum learning-inspired multi-stage reward function is designed, including search, besiege, and capture, increasing the success rate of target encirclement. (3) A cooperative control solution is proposed by employing a multi-agent soft actor-critic reinforcement learning framework with long short-term memory networks, resulting in more efficient and stable ASV encirclement maneuvers.

The rest of this article is composed of the following sections. Section 2 formulates the encirclement problem of ASVs against multiple targets. The dynamic target assignment, cooperative encirclement reward, and reinforcement learning control are presented in Section 3. Section 4 verifies the performance of the designed control method by simulations. Finally, Section 5 concludes this article.

Notations: Throughout this article,

‖\cdot‖

denotes the Euclidean norm of a vector.

t

denotes the time step.

E

denotes the mathematical expectation.

⊙

denotes the Hadamard product and

\nabla

denotes the gradient operator.

2. Problem Statement

As shown in Figure 1, consider multiple high-speed ASVs and multiple low-speed targets in the horizontal plane.

p_{i} = {[x_{i}, y_{i}]}^{T}

denotes the position of the

i th

ASV with

i = 1, 2, \cdot \cdot \cdot, N

.

p_{T m} = {[x_{T m}, y_{T m}]}^{T}

denotes the position of the

m th

target with

m = 1, 2, \cdot \cdot \cdot, M

.

d_{capture}

denotes the encirclement radius.

χ_{i j}

denotes encirclement angles formed by adjacent ASVs.

According to Ref. [31], the kinematic and kinetic equation of the

i th

ASV can be expressed as

\{\begin{cases} {\dot{η}}_{i} = R (ψ_{i}) ν_{i} \\ M {\dot{ν}}_{i} = f (ν_{i}) + τ_{i} + τ_{w i} \end{cases}

(1)

where

η_{i} = {[x_{i}, y_{i}, ψ_{i}]}^{T}

denotes the position and heading angle information of the ASV in the earth-fixed frame.

ν_{i} = {[u_{i}, v_{i}, r_{i}]}^{T}

denotes the surge, sway and yaw velocities of the ASV in the body-fixed frame.

τ_{i} = {[τ_{u i}, 0, τ_{r i}]}^{T}

denotes the input of the ASV including surge forces and yaw moments.

τ_{w i} = {[τ_{w u i}, τ_{w v i}, τ_{w r i}]}^{T}

denotes the time-varying environmental disturbances.

R (ψ_{i})

denotes the rotation matrix from the body-fixed frame to the earth-fixed frame, which can be written as

R (ψ_{i}) = [\begin{matrix} \cos ψ_{i} & - \sin ψ_{i} & 0 \\ \sin ψ_{i} & \cos ψ_{i} & 0 \\ 0 & 0 & 1 \end{matrix}]

(2)

while the inertial matrix

M = diag (m_{u i}, m_{v i}, m_{r i})

, the nominal dynamics

f (ν_{i}) = {[f_{u i}, f_{v i}, f_{r i}]}^{T}

,

f_{u i} = m_{v i} v_{i} r_{i} - d_{u i} u_{i}

,

f_{v i} = - m_{u i} u_{i} r_{i} - d_{v i} v_{i}

,

f_{r i} = (m_{u i} - m_{v i}) u_{i} v_{i} - d_{r i} r_{i}

,

d_{u i}

,

d_{v i}

and

d_{r i}

denote the fluid damping.

m_{u i}

,

m_{v i}

and

m_{r i}

denote the inertial component on the three degrees of freedom.

In this article, a MARL-based cooperative encirclement control method is constructed for ASVs to guarantee that vehicles are evenly distributed around the targets with desired distances and angles under the premise of safety. To be specific, the control objective can be formalized as

\{\begin{cases} \lim_{t \to \infty} d_{\min} \leq ‖p_{i} - p_{T m}‖ \leq d_{capture}, i \in D_{m} \\ \lim_{t \to \infty} |χ_{i j} - χ_{d}| \leq χ_{0}, i, j \in D_{m}, i \neq j \end{cases}

(3)

where

D_{m}

denotes the encirclement alliance for the

m th

target.

d_{\min}

denotes the minimum distance between the ASV and the target.

χ_{d} = 2 π / K

denotes the desired encirclement angle with

K

being the number of one alliance.

χ_{0}

is a small positive constant.

Remark 1.

In this paper, the encirclement mission of ASVs is not subject to geographical boundary constraints. To guarantee the existence of a feasible solution, we focus exclusively on high-speed ASVs and low-speed targets. Moreover, considering the minimum number of nodes required for a closed loop in the two-dimensional plane, there exists

N \geq 3 M

.

Remark 2.

By virtue of artificial potential field method, the maneuvering strategy of targets based on position information is generated by

v_{T m} = \sum_{i = 1}^{N} (\frac{k_{a} (p_{T m} - p_{i})}{‖p_{i} - p_{T m}‖})

(4)

where

k_{a} > 0

.

3. Reinforcement Learning-Based Cooperative Encirclement Control

This section presents the proposed cooperative encirclement control design. The entire control architecture is illustrated in Figure 2, which includes dynamic target assignment, cooperative encirclement reward and reinforcement learning control modules. To be specific, the assignment module is used to assign targets to the ASVs based on the shortest encirclement distance. The cooperative encirclement reward module is responsible for guiding ASVs to approach and capture the assigned targets. The multi-agent soft actor-critic reinforcement learning module is used to train ASVs with unknown dynamics, optimizing encirclement performance through agent–environment interactions.

3.1. Dynamic Target Assignment

In this part, a dynamic task allocation algorithm is designed using the total encirclement distance. To obtain the minimal encirclement distance at any time, the position-based objective function is designed as

J_{allocat} = - \sum_{i = 1}^{N} \sum_{m = 1}^{M} ‖p_{i} - p_{T m}‖ H_{i m}

(5)

with

H_{i m}

satisfying

\{\begin{cases} \sum_{i = 1}^{N} H_{i m} \geq K, m = 1, \dots, M \\ \sum_{m = 1}^{M} H_{i m} = 1, i = 1, \dots, N \end{cases}

(6)

Note that in the cooperative encirclement process, each ASV can choose one target to encircle, while each target requires at least

K

vehicles to form an effective encirclement. If the ASV is matched with a target, then

H_{i m} = 1

. Otherwise,

H_{i m} = 0

.

The optimal assignment between ASVs and targets at each time step is obtained by solving the maximization of Equation (5). The dynamic task allocation is presented in Algorithm 1, where

H^{k} (k = 1, 2, \dots, K)

denotes assignment matrices that satisfy Equation (6).

Algorithm 1: Target allocation for cooperative encirclement.

Inputs: Target position

p_{T m}

, ASV position

p_{i}

, the number of ASVs/targets

N

and

M

.
Initialization: Optimal allocation relation matrix

H_{d} \leftarrow N one

, extreme value

J^{*} \leftarrow - \infty

.
1: Calculating the Euclidean distance between ASVs and targets.
2: for

k = 1, 2, \dots, K

do
3:

if J^{*} < J_{allocat} (H^{k})

4:

J^{*} = J_{allocat} (H^{k})

5:

H_{d} = H^{k}

6: end if
7: end for
Outputs: Optimal allocation relation matrix

H_{d}

.

3.2. Cooperative Encirclement Reward

In this part, to overcome the learning inefficiency caused by sparse rewards, a curriculum learning-based encirclement reward function is developed for one alliance, which divides the whole process into three stages: search, besiegement, and capture, progressively enhancing the cooperative encirclement capability of ASVs.

To be specific, the condition of the search stage is defined as

\sum_{i = 1}^{N - 1} S_{{TH}_{i} H_{i + 1}} + S_{{TH}_{N} H_{1}} > S_{H_{1}, \dots, H_{N}}

(7)

where

S_{(\cdot)}

denotes the polygonal area formed by connecting all ASVs. The search stage-induced reward is designed as

r_{search}^{i} = - k_{0} (\sum_{i = 1}^{N - 1} S_{{TH}_{i} H_{i + 1}} + S_{{TH}_{N} H_{1}} - S_{H_{1}, \dots, H_{N}}) + \sum_{i = 1}^{N} (d_{T i} (t - 1) - d_{T i} (t))

(8)

where

k_{0} > 0

.

The condition of the besiegement stage is defined as

d_{i j} \leq k_{1} (d_{T i} + d_{T j})

(9)

where

0 < k_{1} < 1

,

d_{i j}

denotes the Euclidean distance between two vehicles. The besiegement stage-induced reward is designed as

r_{besieged}^{i} = - k_{2} (d_{i j} - k_{1} (d_{T i} + d_{T j})) + k_{3} \exp (- |d_{T i} - d_{capture}|)

(10)

where

k_{2} > 0

,

k_{3} > 0

, and

j = i + 1

.

The condition of the capture stage is defined as

\max d_{T i} \leq d_{capture}

(11)

where

\max

denotes the maximum value. The capture stage-induced reward is designed as

\{\begin{cases} r_{capture}^{i} = k_{4} \exp (g (φ_{i})) \\ g (φ_{i}) = - ({(\sum_{i = 1}^{N} \sin (φ_{i}))}^{2} + {(\sum_{i = 1}^{N} \cos (φ_{i}))}^{2}) \end{cases}

(12)

where

k_{4} > 0

and

φ_{i} = \arctan ((y_{i} - y_{T}) / (x_{i} - x_{T}))

.

By combining rewards Equations (8), (10), and (12), the stage-induced reward of cooperative target encirclement is written as

r_{encirclement}^{i} = \{\begin{cases} r_{search}^{i}, \sum_{i = 1}^{N - 1} S_{{TH}_{i} H_{i + 1}} + S_{{TH}_{N} H_{1}} > S_{H_{1}, \dots, H_{N}} \\ r_{besieged}^{i}, d_{i j} \leq k_{1} (d_{T i} + d_{T j}) \\ r_{capture}^{i}, \max d_{T i} \leq d_{capture} \end{cases}

(13)

Moreover, to prevent collision among ASVs, the collision avoidance reward is designed as

r_{collision}^{i} = - k_{5} \exp (- \min d_{i j}) - k_{6} \exp (- d_{T i})

(14)

where

k_{5} > 0

and

k_{6} > 0

.

Moreover, to reduce the control input chattering, the constraint reward is designed as

r_{inputs}^{i} = - (|τ_{u i} (t) - τ_{u i} (t - 1)| + |τ_{r i} (t) - τ_{r i} (t - 1)|)

(15)

In this context, the cooperative target encirclement reward function for the

i th

ASV at time step

t

can be computed by

r_{t}^{i} = r_{encirclement}^{i} + r_{inputs}^{i} + r_{collision}^{i}

(16)

3.3. Reinforcement Learning Control

By virtue of centralized training with a decentralized execution framework, a maximum entropy-based multi-agent soft actor-critic algorithm is developed for ASVs against multiple targets. In centralized training, the critic network of each ASV evaluates the global encirclement status by incorporating observations and actions from all vehicles, which can alleviate coordination inefficiencies caused by local observations. In distributed execution, the actor network of each ASV independently executes policy without relying on behavioral information from other vehicles, just utilizing its local observations to execute the encirclement strategy.

Firstly, the observation and action of the

i th

ASV is defined as

\{\begin{cases} o_{t}^{i} = [x_{i}, y_{i}, u_{i}, ψ_{i}, x_{T 1}, \dots, x_{T m}, y_{T 1}, \dots, y_{T m}] \\ a_{t}^{i} = [τ_{u i}, τ_{r i}] \end{cases}

(17)

which satisfies

τ_{u i} \in [0, 100]

and

τ_{r i} \in [- 25, 25]

according to the maneuverability of ASVs.

The actor network

π_{ω}^{i}

of the

i th

ASV fits the policy function

f_{ω^{i}} (o_{t}^{i}, ξ)

, whose outputs of are the mean and standard deviation of a Gaussian distribution. The action is sampled by reparameterization

a_{t}^{i} = f_{ω^{i}} (o_{t}^{i}, ξ) = \tanh (ϑ_{ω^{i}} (o_{t}^{i}) + δ_{ω^{i}} (o_{t}^{i}) ⊙ ξ)

(18)

where

ξ

denotes the Gaussian noise,

ϑ_{ω^{i}} (o_{t}^{i})

denotes the mean value,

δ_{ω^{i}} (o_{t}^{i})

denotes the standard deviation and

ω^{i}

is the parameter of the actor network for the

i th

ASV.

Within the reinforcement learning, the maximum entropy mechanism is adopted, enabling ASVs to enhance the random exploration ability of the policy while maximizing cumulative rewards. Consequently, the objective function of actor networks is designed based on the policy entropy and action value function

J (π_{ω}^{i}) = E_{D} [\min_{k = 1, 2} Q_{θ_{k}}^{i} (o, a)] + α_{i} H (π_{ω}^{i} (\cdot | o_{t}^{i}))

(19)

where

H (π_{ω}^{i} (\cdot | o_{t}^{i})) = - E_{D} \log (π_{ω}^{i} (a_{t}^{i} | o_{t}^{i}))

,

o = {o_{t}^{1}, o_{t}^{2}, \dots, o_{t}^{N}}

,

a = {a_{t}^{1}, a_{t}^{2}, \dots, a_{t}^{N}}

.

\min_{k = 1, 2} Q_{θ_{k}}^{i} (o, a)

represents the smaller value generated by the two critic networks in the main network, where

θ_{k}

represents the parameter of two critic networks.

D

denotes the experience replay buffer, and each set of experience data is stored in the form of tuple

\{o, a, r, o^{'}\}

.

r = {r_{t}^{1}, r_{t}^{2}, \dots, r_{t}^{N}}

represents the reward set of all ASVs.

o^{'}

denotes the observation information set at the next moment.

α_{i} > 0

denotes the regularization coefficient, which is updated by minimizing the loss function

L (α_{i}) = E_{D} [- α_{i} \log (π_{ω}^{i} (a_{t}^{i} | o_{t}^{i})) - α_{i} H_{d}]

(20)

where

H_{d}

is the predefined entropy threshold.

The critic network of the

i th

ASV evaluates the current encirclement status based on the global observation and action information, which is updated by minimizing the following loss function

L (Q_{θ_{k}}^{i}) = E_{D} [\frac{1}{2} {(Q_{θ_{k}}^{i} (o, a) - y_{d}^{i})}^{2}]

(21)

with

y_{d}^{i} = r_{t}^{i} + γ (\min_{k = 1, 2} Q_{{\bar{θ}}_{k}}^{i} (o^{'}, {\tilde{a}}^{'}) - α_{i} \log (π_{ω}^{i} ({\tilde{a}}_{t + 1}^{i} | o_{t + 1}^{i})))

(22)

where

\min_{k = 1, 2} Q_{{\bar{θ}}_{k}}^{i} (o^{'}, {\tilde{a}}^{'})

represents the smaller value generated by two target critic networks.

{\bar{θ}}_{k}

denotes the parameters of the target critic network.

{\tilde{a}}^{'} = {{\tilde{a}}_{t + 1}^{1}, {\tilde{a}}_{t + 1}^{2}, \dots, {\tilde{a}}_{t + 1}^{N}}

with

{\tilde{a}}_{t + 1}^{i}

being obtained by real-time sampling, that is

{\tilde{a}}_{t + 1}^{i} \sim π_{ω}^{i} (\cdot | o_{t + 1}^{i})

, rather than from the experience replay buffer.

The target network of the

i th

ASV is updated using soft update

{\bar{θ}}_{k}^{i} \leftarrow σ θ_{k}^{i} + (1 - σ) {\bar{θ}}_{k}^{i}

(23)

where

σ

denotes update rate.

Moreover, to enhance the sequential modeling capability of the deep network structure for historical encirclement information, the actor and critic networks are constructed using long short-term memory (LSTM), as shown in Figure 3, thereby capturing the temporal dependencies in the encirclement process.

The computation process of the actor network is as follows

\{\begin{cases} ({\tilde{h}}_{t}, C_{t}) = LSTM (o_{t}^{R}, {\tilde{h}}_{t - 1}, W_{a 1}) \\ e_{t}^{a 1} = F_{a 1} (C_{t}, W_{a 2}), e_{t}^{a 2} = F_{a 2} (e_{t}^{a 1}, W_{a 3}) \\ e_{t}^{a 3} = F_{a 3} (e_{t}^{a 2}, W_{a 4}), e_{t}^{a 4} = F_{a 4} (e_{t}^{a 2}, W_{a 5}) \\ ϑ_{ω} = O_{a 1} (e_{t}^{a 3}, W_{a 6}), δ_{ω} = O_{a 2} (e_{t}^{a 4}, W_{a 7}) \end{cases}

(24)

where

o_{t}^{R}

denotes the historical observation sequence.

{\tilde{h}}_{t}

and

C_{t}

denote the hidden state and cell state generated by the LSTM layer.

F_{a *}

denotes fully connected (FC) layer.

e_{t}^{a *}

denotes the output of the fully connected layer.

O_{a *}

denotes output layer.

W_{a *}

denotes learnable network weights.

The computation process of the critic network is as follows

\{\begin{cases} (h_{t}^{1}, C_{t}^{1}) = LSTM_1 (o_{t}^{R}, h_{t - 1}^{1}, W_{c 1}) \\ (h_{t}^{2}, C_{t}^{2}) = LSTM_2 (a_{t}^{R}, h_{t - 1}^{2}, W_{c 2}) \\ e_{t}^{c 1} = F_{c 1} (C_{t}^{1}, W_{c 3}), e_{t}^{c 2} = F_{c 2} (C_{t}^{2}, W_{c 4}) \\ κ_{t} = T (e_{t}^{c 1}, e_{t}^{c 2}, W_{c 5}), e_{t}^{c 3} = F_{c 3} (κ_{t}, W_{c 6}) \\ Q = O_{c} (e_{t}^{c 3}, W_{c 7}) \end{cases}

(25)

where

a_{t}^{R}

denotes the historical action sequence.

h_{t}^{*}

and

C_{t}^{*}

denote the hidden state and cell state generated by the two LSTM layers.

F_{c *}

denotes fully connected layer.

e_{t}^{c *}

denotes the output of the fully connected layer.

T

denotes connection layer and

κ_{t}

is its output.

O_{c}

denotes output layer.

W_{c *}

denotes learnable network weights.

The specific training process is presented in Algorithm 2, where

λ_{*}

denotes learning rates and satisfies

0 < λ_{*} < 1

.

Algorithm 2: Cooperative encirclement control via MARL.

Inputs: Actor network parameters

ω^{i}

, critic network parameters

θ_{1}^{i}

and

θ_{2}^{i}

.

Initialization: Target network parameters

{\bar{θ}}_{1}^{i} \leftarrow θ_{1}^{i}

and

{\bar{θ}}_{2}^{i} \leftarrow θ_{2}^{i}

, experience replay buffer

D \leftarrow \emptyset

, cooperative encirclement environment.

1: for

n_{1} = 1, \dots, F

do

2: for

n_{2} = 1, \dots, Q

do

3: Sample action

a_{t}^{i} \sim π_{ω}^{i} (\cdot | o_{t}^{i})

from policy based on observation

o_{t}^{i}

4: Compute reward

r_{t}^{i}

and obtain

o_{t + 1}^{i}

5: Store

\{o, a, r, o^{'}\}

into the experience replay buffer

D

6: end for

7: Replay mini-batch samples from

D

8: Update network parameters using the gradient descent/ascent and soft update

θ_{k} \leftarrow θ_{k} - λ_{c r} \nabla_{θ_{k}} L (Q_{θ_{K}}^{i})

for

k \in {1, 2}

ω \leftarrow ω + λ_{a c} \nabla_{ω} J (π_{ω}^{i})

α_{i} \leftarrow α_{i} - λ_{α} \nabla_{α} L (α_{i})

{\bar{θ}}_{k}^{i} \leftarrow σ θ_{k}^{i} + (1 - σ) {\bar{θ}}_{k}^{i}

for

k \in {1, 2}

9: end for

Outputs: Network parameters

ω^{i}

,

θ_{1}^{i}

and

θ_{2}^{i}

.

4. Simulation Results

In order to demonstrate the effectiveness of the proposed cooperative encirclement control method, simulation studies together with comprehensive comparisons are conducted. Principal parameters of the ASV are as follows:

m_{u i} = 19

,

m_{v i} = 35.2

and

m_{r i} = 4.2

. The nominal dynamics is given by

f_{u i} = 35.2 v_{i} r_{i} - 4 u_{i}

,

f_{v i} = - 19 u_{i} r_{i} - 1 v_{i}

and

f_{r i} = - 16.2 u_{i} v_{i} - 10 r_{i}

[32]. Time-varying environmental disturbances are added to the ASV model at the beginning of each episode and remain throughout the episode, which are given by

\{\begin{cases} {\dot{τ}}_{w u i} + α_{u i} τ_{w u i} = m_{u i} G (s) w_{u i} \\ {\dot{τ}}_{w v i} + α_{v i} τ_{w v i} = m_{v i} G (s) w_{v i} \\ {\dot{τ}}_{w r i} + α_{r i} τ_{w r i} = m_{r i} G (s) w_{r i} \end{cases}

(26)

where

α_{* i}

is a positive constant.

w_{* i}

represents a Gaussian white noise process,

* = u, v, r

.

G (s) = 0.255 s / (s^{2} + 0.485 s + {0.8}^{2})

represents the transfer function.

The simulation scenario is as follows. Consider a networked system composed of six ASVs and two targets in the horizontal plane without geographical boundary constraints. If the distance between the ASV and the target is less than 50 m, the target adopts the maneuver strategy given by (4). When the criteria (3) are met, it is determined that the target encirclement mission is completed.

At the beginning of each episode, the positions and heading angles of ASVs are randomly initialized with a specified range. To be specific,

x_{i} \in [- 30, 30]

,

y_{i} \in [- 30, 30]

,

ψ_{i} \in [0, π]

. The initial positions of the two targets are

(20, 100)

and

(50, 100)

, and initial heading angles are both

π / 4

. Let the encirclement radius be

d_{capture} = 15

and the minimum distance between ASVs and targets be

d_{\min} = 10

. After numerous simulations and parameter adjustments, the reward coefficients are set as

k_{0} = 2

,

k_{1} = 10

,

k_{5} = 100

, and

k_{6} = 100

. The training hyperparameters are listed in Table 1.

In order to demonstrate the superiority of the proposed stage-induced reward function, the traditional reward method in [33] is deployed to derive comparison results. Specifically, the traditional encirclement reward function is governed by

r_{encirclement}^{i} = 100 \frac{χ_{l e f t} + χ_{r i g h t}}{4 π} \exp (- 0.03 σ (χ)) + r_{guide}^{i}

(27)

with

r_{guide}^{i} = 0.5

if

d_{T i} \leq d_{capture}

and

r_{guide}^{i} = 0.5 \exp (- 0.005 d_{capture})

if

d_{T i} > d_{capture}

, where

χ_{l e f t}

and

χ_{r i g h t}

are the encirclement angles of the

i th

vehicle and the neighbor vehicle, respectively.

σ (χ)

is the standard deviation of the encirclement angle. Note that the superiority and comparisons of the LSTM can be found in our previous work [34].

The encirclement performance is evaluated from two aspects, including episode rewards and task success rates. Episode rewards of cooperative target encirclement using different reward functions are shown in Figure 4, where the shaded areas represent the standard deviation. As illustrated, all methods converge within the training steps. However, policy training using the stage-induced reward function clearly exhibits faster convergence and higher reward values, which can be explained by the fact that ASVs using the curriculum rewards gradually learn to search, besiegement and capture, avoiding unnecessary exploration for the dynamic environment. Success rate of cooperative target encirclement using different reward functions is shown in Figure 5. During the first 2000 episodes, the success rate is basically close to zero, indicating that the ASVs are in the exploration process and have not developed a successful policy for cooperative target encirclement. After 2500 episodes, with the assistance of curriculum learning, the success rates of encirclement achieved by the stage-induced reward function continue to increase (approaching 90%), while the success rates of the comparison method rise more slowly.

After the training is completed, we save the network parameters with the highest reward value, and conduct a six-on-two cooperative target encirclement test. The vehicles’ initial positions are set as

p_{1} = {[- 20, - 20]}^{T}

,

p_{2} = {[0, - 20]}^{T}

,

p_{3} = {[20, - 20]}^{T}

,

p_{4} = {[- 20, 0]}^{T}

,

p_{5} = {[0, 0]}^{T}

, and

p_{6} = {[20, 0]}^{T}

. The other values are the same as those in the training process. Note that if the target meets the encirclement condition (3), the encirclement mission will be immediately terminated.

The trajectories of both ASVs and targets throughout the encirclement process are depicted in Figure 6. Notably, six vehicles set off from the initial positions and gradually capture the two moving targets, respectively, forming an encirclement circle. Worth of mention is that six vehicles are gradually divided into two encirclement alliances. Each alliance has three ASVs, and an effective encirclement for the two moving targets is achieved in the end. The target allocation result is shown in Figure 7, where the second ASV and the fifth ASV switch the encirclement target at about 40 s. Finally, the first, fourth, and fifth ASVs encircle the first target, while the second, third, and sixth ASVs encircle the second target. In addition, different encirclement alliances have differential termination times.

The distances between ASVs and the corresponding targets are shown in Figure 8 and Figure 9. The encirclement angles of adjacent ASVs are shown in Figure 10 and Figure 11. One can conclude that both distances and angles satisfy the desired requirements. Moreover, under the incentive of curriculum rewards, ASVs first meet the distance constraint and then the angle constraint when conducting the cooperative encirclement, achieving the given control objective from the easier to the more difficult. Figure 12 and Figure 13 present the velocities and heading angles of ASVs, respectively, where each vehicle adjusts the velocity and angle in real time according to the moving target. Figure 14 and Figure 15 show control inputs of ASVs, including the surge forces and yaw moments, which are bounded and realistic from a practical viewpoint. Both exhibit minor fluctuations during cooperative target encirclement. These fluctuations arise from the time-varying environmental disturbances.

5. Conclusions

This article focuses on the cooperative encirclement problem for multiple ASVs and proposes a multi-target cooperative encirclement control method using MARL. Considering practical mission requirements and operational constraints, the conditions for successful cooperative encirclement by multiple ASVs are established. A dynamic target allocation algorithm, based on the positional information of both vehicles and targets, is designed to optimize target selection in real time. To meet the demands of high-performance cooperative training, a curriculum learning-based multi-stage encirclement reward function is developed, guiding ASVs to approach targets progressively from simpler to more complex tasks. Within a centralized training and decentralized execution framework, a multi-agent soft actor-critic algorithm incorporating long short-term memory is designed to compute the control inputs of vehicles, enabling effective multi-target encirclement. Finally, simulation results validate the effectiveness and superiority of the proposed control method.

Further investigations may aim at the cooperative encirclement of ASVs against unknown targets. It is also desirable to investigate the practical implementation of the proposed method for target encirclement with hardware constraints, communication delays, and measurement errors.

Author Contributions

Conceptualization, X.Q. and S.J.; methodology, X.Q.; software, C.L.; validation, X.Q., S.J. and R.Z.; formal analysis, G.L.; investigation, C.L.; resources, R.Z.; data curation, X.Q.; writing—original draft preparation, X.Q.; writing—review and editing, X.Q.; visualization, G.L.; supervision, S.J. and R.Z.; project administration, G.L. and R.Z.; funding acquisition, R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grant 61673084, in part by the Liaoning Province Applied Basic Research Program Project under Grant 2025JH2/101300065, and in part by the Fundamental Research Funds for the Central Universities under Grant 044420250027.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Peng, Z.H.; Wang, J.; Wang, D.; Qing, L.H. An Overview of Recent Advances in Coordinated Control of Multiple Autonomous Surface Vehicles. IEEE Trans. Ind. Inform. 2021, 17, 732–745. [Google Scholar] [CrossRef]
Osorio, M.G.; Ierardi, C.; Flores, I.J.; Martín, M.P.; Gata, P.M. Coordinated control of multiple autonomous surface vehicles: Challenges and advances-A systematic review. Ocean Eng. 2024, 312, 119160. [Google Scholar] [CrossRef]
Ma, L.; Wang, Y.L.; Han, Q.L. Cooperative Target Tracking of Multiple Autonomous Surface Vehicles Under Switching Interaction Topologies. IEEE/CAA J. Autom. Sin. 2023, 10, 673–684. [Google Scholar] [CrossRef]
Wu, W.T.; Zhang, Y.B.; Jia, Z.H.; Lu, J.G.; Zhang, W.Z. Adaptive Fault-Tolerant Fuzzy Containment Control for Networked Autonomous Surface Vehicles: A Noncooperative Game Approach. IEEE Trans. Fuzzy Syst. 2024, 32, 4192–4204. [Google Scholar] [CrossRef]
Qu, X.R.; Zeng, L.H.; Qu, S.H.; Long, F.F.; Zhang, R.B. An Overview of Recent Advances in Pursuit-Evasion Games with Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2025, 13, 458. [Google Scholar] [CrossRef]
Pan, C.; Wang, A.Q.; Peng, Z.H.; Han, B.; Lyu, G.H.; Zhang, W.D. Pursuit-evasion game of under-actuated ASVs based on deep reinforcement learning and model predictive path integral control. Neurocomputing 2025, 638, 130045. [Google Scholar] [CrossRef]
Biedunkova, O.; Kuznietsov, P. Integration of water management in the assessment of the impact of heavy metals discharge from the power plant with mitigation strategies. Ecol. Indic. 2025, 175, 113618. [Google Scholar] [CrossRef]
Liu, F.; Yuan, S.H.; Meng, W.; Su, R.; Xie, L.H. Multiple Noncooperative Targets Encirclement by Relative Distance-Based Positioning and Neural Antisynchronization Control. IEEE Trans. Ind. Electron. 2024, 71, 1675–1685. [Google Scholar] [CrossRef]
Wei, W.; Wang, J.J.; Du, J.; Fang, Z.R.; Ren, Y.; Chen, C.L.P. Differential Game-Based Deep Reinforcement Learning in Underwater Target Hunting Task. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 462–474. [Google Scholar] [CrossRef]
Gou, F.D.; Du, H.K.; Zhao, C.Y.; Cai, Y.Z. A Policy-Guided Reinforcement Learning Method for Encirclement Control in Multiobstacle Environment. IEEE Trans. Neural Netw. Learn. Syst. 2025; early access. [Google Scholar] [CrossRef]
Deng, Y.M.; Zhu, B.T.; Duan, H.B. Bioinspired Bearing-Based Target Enclosing Control for Unmanned Aerial Vehicle Swarm. IEEE/ASME Trans. Mechatron. 2024; Early Access. [Google Scholar] [CrossRef]
Xu, K.; Li, Y.; Sun, J.; Du, S.Y.; Di, X.P.; Yang, Y.G.; Li, B. Targets capture by distributed active swarms via bio-inspired reinforcement learning. Sci. China Phys. Mech. Astron. 2025, 68, 218711. [Google Scholar] [CrossRef]
Jiang, Y.X.; Wu, Q.; Zhu, S.K.; Zhang, L.K. Orca predation algorithm: A novel bio-inspired algorithm for global optimization problems. Expert Syst. Appl. 2022, 188, 116026. [Google Scholar] [CrossRef]
Zhang, M.Y.; Chen, H.; Cai, W.Y. Collaborative Hunting Method of Multi-AUV in 3D IoUT: Searching, Tracking, and Encirclement Keeping. IEEE Internet Things J. 2025, 12, 10958–10973. [Google Scholar] [CrossRef]
Bagagiolo, F.; Capuani, R.; Marzufero, L. A single player and a mass of agents: A pursuit evasion-like game. Esaim-Control. Optim. Calc. Var. 2024, 30, 17. [Google Scholar] [CrossRef]
Kokolakis, N.-M.T.; Vamvoudakis, K.G. Safety-Aware Pursuit-Evasion Games in Unknown Environments Using Gaussian Processes and Finite-Time Convergent Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3130–3143. [Google Scholar] [CrossRef]
Garcia, E.; Casbeer, D.W.; Moll, A.V.; Pachter, M. Multiple pursuer multiple evader differential games. IEEE Trans. Autom. Control. 2021, 66, 2345–2350. [Google Scholar] [CrossRef]
Wang, X.T.; Yang, M.; Wang, S.Y.; Hou, M.Z.; Chao, T. Linear-quadratic and norm-bounded combined differential game guidance scheme with obstacle avoidance for attacking defended aircraft in three-player engagement. Def. Technol. 2024, 42, 136–155. [Google Scholar] [CrossRef]
Yan, R.; Duan, X.M.; Shi, Z.Y.; Zhong, Y.S.; Bullo, F. Matching-based capture strategies for 3D heterogeneous multiplayer reach-avoid differential games. Automatica 2022, 140, 110207. [Google Scholar] [CrossRef]
Yang, K.J.; Zhu, M.; Guo, X.; Zhang, Y.F.; Zhou, Y.T. Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System. Drones 2025, 9, 103. [Google Scholar] [CrossRef]
Salem, N.M.; Shaheen, M.A.M.; Hasanien, H.M. Novel reinforcement learning technique based parameter estimation for proton exchange membrane fuel cell model. Sci. Rep. 2024, 14, 27475. [Google Scholar] [CrossRef] [PubMed]
Shaheen, M.A.M.; Ullah, Z.; Hasanien, H.M.; Tostado-Veliz, M.; Ji, H.; Qais, M.H.; Alghuwainem, S.; Jurado, F. Enhanced transient search optimization algorithm-based optimal reactive power dispatch including electric vehicles. Energy 2023, 277, 127711. [Google Scholar] [CrossRef]
Hu, X.X.; Liu, S.Z.; Xu, J.W.; Xiao, B.; Guo, C.G. Integral reinforcement learning based dynamic stackelberg pursuit-evasion game for unmanned surface vehicles. Alex. Eng. J. 2024, 108, 428–435. [Google Scholar] [CrossRef]
Chen, Y.; Shi, Y.; Dai, X.H.; Meng, Q.; Yu, T. Pursuit-evasion game with online planning using deep reinforcement learning. Appl. Intell. 2025, 55, 512. [Google Scholar] [CrossRef]
Zhang, C.M.; Zeng, R.J.; Lin, B.; Zhang, Y.B.; Xie, W.; Zhang, W.D. Multi-USV cooperative target encirclement through learning-based distributed transferable policy and experimental validation. Ocean Eng. 2025, 318, 120124. [Google Scholar] [CrossRef]
Gan, W.H.; Qu, X.Q.; Song, D.L.; Yao, P. Multi-USV cooperative chasing strategy based on obstacles assistance and deep reinforcement learning. IEEE Trans. Autom. Sci. Eng. 2024, 21, 5895–5910. [Google Scholar] [CrossRef]
Feng, Y.K.; Wu, Z.X.; Wang, J.; Gu, J.W.; Yu, F.Y.; Yu, J.Z. Decentralized Multirobotic Fish Pursuit Control With Attraction-Enhanced Reinforcement Learning. IEEE Trans. Ind. Electron. 2025, 72, 8290–8300. [Google Scholar] [CrossRef]
Chen, J.C.; Wang, Y.; Zhang, Y.; Lu, Y.T.; Shu, Q.H.; Hu, Y.J. Extrinsic-and-Intrinsic Reward-Based Multi-Agent Reinforcement Learning for Multi-UAV Cooperative Target Encirclement. IEEE Trans. Intell. Transp. Syst. 2025; Early Access. [Google Scholar] [CrossRef]
Wang, Q.; Liu, C.; Meng, Y.Z.; Ren, X.Q.; Wang, X.F. Reinforcement learning-based moving-target enclosing control for an unmanned surface vehicle in multi-obstacle environments. Ocean Eng. 2024, 304, 117920. [Google Scholar] [CrossRef]
Li, F.B.; Yin, M.M.; Wang, T.D.; Huang, T.W.; Yang, C.H.; Gui, W.H. Distributed pursuit-evasion game of limited perception USV swarm based on multiagent proximal policy optimization. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 6435–6446. [Google Scholar] [CrossRef]
Fossen, T. Handbook of Marine Craft Hydrodynamics and Motion Control; Wiley: Chichester, UK, 2011. [Google Scholar]
Wang, N.; Sun, Z.; Jiao, Y.H.; Han, G.J. Surge-heading guidance-based finite-time path following of underactuated marine vehicles. IEEE Trans. Veh. Technol. 2019, 68, 8523–8532. [Google Scholar] [CrossRef]
Xia, J.W.; Luo, Y.S.; Liu, Z.K.; Zhang, Y.L.; Shi, H.R.; Liu, Z. Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning. Def. Technol. 2023, 29, 80–94. [Google Scholar] [CrossRef]
Qu, X.R.; Jiang, Y.Z.; Zhang, R.B.; Long, F.F. A Deep Reinforcement Learning-Based Path-Following Control Scheme for an Uncertain Under-Actuated Autonomous Marine Vehicle. J. Mar. Sci. Eng. 2023, 11, 1762. [Google Scholar] [CrossRef]

Figure 1. Cooperative encirclement diagram of ASVs against multiple targets.

Figure 2. Architecture of the proposed cooperative encirclement control method.

Figure 3. Structure of the actor network and critic network.

Figure 4. Episode rewards of cooperative target encirclement using different reward functions.

Figure 5. Success rates of cooperative target encirclement using different reward functions.

Figure 6. Cooperative encirclement performance of ASVs using the proposed method.

Figure 7. Target allocation process for cooperative target encirclement.

Figure 8. Distances between ASVs and the target with the first alliance.

Figure 9. Distances between ASVs and the target with the second alliance.

Figure 10. Encirclement angles of adjacent ASVs with the first alliance.

Figure 11. Encirclement angles of adjacent ASVs with the second alliance.

Figure 12. Velocities of ASVs during cooperative target encirclement.

Figure 13. Heading angles of ASVs during cooperative target encirclement.

Figure 14. Surge forces of ASVs during cooperative target encirclement.

Figure 15. Yaw moments of ASVs during cooperative target encirclement.

Table 1. Main training hyperparameters.

Hyperparameters	Value
Discount factor $γ$	0.99
Maximum training episodes $F$	10,000
Maximum steps per training episodes $Q$	1000
Critic network learning rate $λ_{cr}$	0.0003
Actor network learning rate $λ_{ac}$	0.0002
Entropy learning rate $λ_{en}$	0.0005
Target network update rate $σ$	0.0002
Entropy threshold $H_{d}$	−2
Mini-batch	256
Experience replay buffer size	1,000,000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qu, X.; Li, C.; Jiang, S.; Liu, G.; Zhang, R. Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets. J. Mar. Sci. Eng. 2025, 13, 1558. https://doi.org/10.3390/jmse13081558

AMA Style

Qu X, Li C, Jiang S, Liu G, Zhang R. Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets. Journal of Marine Science and Engineering. 2025; 13(8):1558. https://doi.org/10.3390/jmse13081558

Chicago/Turabian Style

Qu, Xingru, Chu Li, Shang Jiang, Guanqun Liu, and Rubo Zhang. 2025. "Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets" Journal of Marine Science and Engineering 13, no. 8: 1558. https://doi.org/10.3390/jmse13081558

APA Style

Qu, X., Li, C., Jiang, S., Liu, G., & Zhang, R. (2025). Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets. Journal of Marine Science and Engineering, 13(8), 1558. https://doi.org/10.3390/jmse13081558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets

Abstract

1. Introduction

2. Problem Statement

3. Reinforcement Learning-Based Cooperative Encirclement Control

3.1. Dynamic Target Assignment

3.2. Cooperative Encirclement Reward

3.3. Reinforcement Learning Control

4. Simulation Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI