An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning

Du, Zhen; Yang, Shenhua; Wang, Weijun

doi:10.3390/jmse14030252

Open AccessArticle

An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning

by

Zhen Du

,

Shenhua Yang

^*

and

Weijun Wang

School of Navigation, Jimei University, No. 185, Yinjiang Road, Jimei District, Xiamen 361021, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(3), 252; https://doi.org/10.3390/jmse14030252

Submission received: 27 December 2025 / Revised: 19 January 2026 / Accepted: 23 January 2026 / Published: 25 January 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

In constrained waters, multi-USV cooperative encirclement of highly maneuverable targets is strongly affected by partial observability as well as obstacle and boundary constraints, posing substantial challenges to stable cooperative control. Existing deep reinforcement learning methods often suffer from low exploration efficiency, pronounced policy oscillations, and difficulties in maintaining the desired encirclement geometry in complex environments. To address these challenges, this paper proposes an adaptive difference-based multi-agent policy gradient method (MAADPG) under the centralized training and decentralized execution (CTDE) paradigm. MAADPG deeply integrates potential-field-inspired geometric guidance with a multi-agent deterministic policy gradient framework. Specifically, a guidance module generates geometrically interpretable candidate actions for each pursuer. Moreover, a difference-driven adaptive action adoption mechanism is introduced at the behavior policy execution level, where guided actions and policy actions are locally compared and the guided action is adopted only when it yields a significantly positive return difference. This design enables MAADPG to select higher-quality interaction actions, improve exploration efficiency, and enhance policy stability. Experimental results demonstrate that MAADPG consistently achieves fast convergence, stable coordination, and reliable encirclement formation across representative pursuit–encirclement scenarios, including obstacle-free, sparsely obstructed, and densely obstructed environments, thereby validating its applicability and stability for multi-USV encirclement tasks in constrained waters.

Keywords:

unmanned surface vehicles (USVs); cooperative encirclement; multi-agent reinforcement learning (MARL); centralized training with decentralized execution (CTDE); difference-driven

1. Introduction

The constrained encirclement problem (CEP) concerns coordinating multiple unmanned surface vehicles (USVs) to effectively surround and constrain a highly maneuverable target within a restricted maritime domain, under finite-time and safety constraints [1,2,3,4,5,6,7]. As an integrative problem involving cooperative control, adversarial game dynamics, and multiple operational constraints, CEP addresses pressing needs in maritime search and rescue, waterway lockdown, target interception, illegal intrusion prevention, and port security. By enabling coordinated multi-USV operations, it enhances area-coverage efficiency and target-control accuracy, while allowing unmanned execution in high-risk environments to reduce both cost and personnel risk. In this context, the application of machine learning methods in the fields of maritime and ocean engineering has been systematically reviewed and has attracted broad attention, particularly for autonomous marine control and intelligent decision-making tasks, where it demonstrates considerable potential [8]. Based on these characteristics and its significance, the constrained encirclement problem (CEP) has been widely adopted as a key benchmark task in multi-agent systems (MAS) research for evaluating cooperative capability and robustness [9].

In the literature, early studies on the constrained encirclement problem (CEP) primarily relied on geometric and potential-field-based approaches, in which rapid target approaching and basic formation maintenance were achieved using analytical potential functions, velocity guidance laws, and closed-form trajectories [10]. Owing to their structural simplicity and low computational overhead, these methods are well-suited to regular maritime environments with sparse obstacles. Nevertheless, potential-field designs are typically handcrafted, which may induce field conflicts and local minima in obstacle-rich scenarios with complex boundaries or highly maneuvering targets, ultimately resulting in formation disruption. To enhance stability and practicability, subsequent works incorporated conventional USV techniques, such as interception guidance, observation compensation, and robust control, and further established integrated sensing–control pipelines to improve trajectory convergence and attitude regulation [11]. However, their performance may still degrade under strong nonlinear disturbances and environmental uncertainties.

Data-driven approaches further attempted to unify historical experience, feasible-region constraints, and obstacle-avoidance rules into policy design, thereby improving deployment safety and reproducibility [12]. Yet, the absence of sufficient online adaptation often leads to limited generalization in unseen environments or in the presence of dynamic obstacles. Model predictive control (MPC) explicitly embeds constraints, including velocity, acceleration, and boundary limits, into a receding-horizon optimization framework, enabling smooth trajectories and constraint consistency [13,14,15]. However, MPC generally requires accurate models and substantial computational resources, which hinders its scalability to real-time multi-agent encirclement. Imitation learning can facilitate rapid policy initialization using expert demonstrations [16], but limited coverage of demonstrations may cause policy degradation in multi-target, multi-obstacle, and dynamically adversarial settings. Differential games and analytical envelope control provide theoretical convergence guarantees and closed-form characterizations [17]; nevertheless, they often rely on restrictive assumptions regarding system dynamics, observability, and environmental structures, making them difficult to generalize to high-dimensional, unstructured, and dynamically constrained maritime environments.

Overall, while traditional approaches have advanced stability enhancement, constraint handling, and policy initialization, they still exhibit limited adaptability in high-dimensional and highly interactive maritime encirclement tasks. Specifically, the coupling complexity between multiple targets and multiple agents is difficult to fully capture [18]. Moreover, changes in constraints and environmental configurations frequently necessitate redesign and retuning of control laws, resulting in limited cross-scenario transferability [19]. When the task is further extended to cooperative navigation and collision avoidance with communication coupling, decision complexity and environmental non-stationarity are further amplified [20]. Therefore, intelligent approaches that do not rely on precise models and that support adaptive cooperative decision-making are highly desirable as complementary solutions.

Compared with conventional control approaches that rely on explicit model structures and handcrafted priors, reinforcement learning (RL) offers several advantages, including independence from accurate dynamics modeling, the capability to handle high-dimensional continuous action spaces, and the ability to adaptively optimize policies through interaction. Consequently, RL has shown strong potential for addressing complex, unstructured, and strongly coupled CEP tasks. In particular, RL can directly learn decision-making patterns under uncertainty, partial observability, and dynamic obstacles, thereby compensating for the limitations of traditional methods in adaptability, generalization, and cooperative decision-making. Early studies adopted deep reinforcement learning (DRL) to learn end-to-end encirclement policies. By jointly modeling reward shaping and obstacle-avoidance constraints, these methods achieved stable formations and closed-loop encirclement in multi-obstacle maritime environments, demonstrating the feasibility of DRL [21]. However, most of them were designed for single-task and closed scenarios, and the learned policies are often sensitive to the training distribution, resulting in limited cross-scenario generalization.

To enhance cooperation and cross-domain transferability, distributed and transferable policy frameworks introduced mechanisms of policy sharing and transfer, enabling multiple USVs to learn transferable joint strategies across multiple targets and operating conditions, which were further validated on engineering platforms [22]. For cooperative encirclement tasks involving multiple targets and dynamic obstacles, subsequent research incorporated boundary constraints, obstacle avoidance, and inter-agent collision avoidance into joint training, facilitating the emergence of cooperative behaviors in both simulation and real-world USV experiments [23]. To address perception-limited scenarios, graph attention networks and graph-structured encoders have been employed to extract critical neighborhood information from local observations, thereby improving capture success rates and policy stability [24]. Meanwhile, RL has increasingly been integrated with safety modeling, fault-tolerant compensation, and online adaptation mechanisms, giving rise to preliminary safe reinforcement learning frameworks that provide new avenues for feasibility guarantees under disturbances and multi-constraint conditions [25]. Nevertheless, existing RL and multi-agent reinforcement learning (MARL) approaches still face several challenges, including pronounced instability and performance fluctuations in early-stage training, low sample efficiency, difficulties in cooperative learning under non-stationary interactions, and ineffective integration between prior knowledge and learned policies. Therefore, CEP demands a new RL framework that can stabilize early exploration, improve learning efficiency, and enable controllable injection of prior knowledge, thereby strengthening cooperative decision-making in multi-agent systems.

Although multi-agent reinforcement learning (MARL) has demonstrated considerable potential in theoretical studies and simulation-based platforms, its application to encirclement tasks in constrained environments still faces several critical challenges. On the one hand, the strong coupling among agent policies induces pronounced non-stationarity during parallel learning, which often leads to policy oscillations, slow convergence, and degraded training stability. On the other hand, directly incorporating geometric priors, imitation demonstrations, or external guidance strategies into the learning framework may improve exploration quality to a certain extent; however, in the absence of difference-driven evaluation and selection mechanisms based on returns, such external strategies can instead interfere with policy updates, resulting in ambiguous exploration directions and reduced learning efficiency. Collectively, these factors significantly limit the effective deployment of MARL methods for cooperative encirclement tasks in complex and constrained maritime environments.

To address the above challenges, this paper proposes the Multi-Agent Guided Adaptive Difference Policy Gradient (MAADPG) method. Unlike existing hybrid MARL approaches that directly incorporate geometric heuristics, action masking, or residual control into policy outputs, MAADPG explicitly considers the risk of negative transfer and policy interference induced by external priors, and introduces a difference-driven adaptive action adoption mechanism. Instead of imposing a constant bias on the learned policy, the proposed mechanism performs online screening based on one-step return differences: the guidance action is allowed to intervene only when it yields an improvement in terms of the immediate return. In this way, MAADPG achieves controllable and reversible prior injection without altering the CTDE policy-gradient training pipeline, thereby mitigating the risk of negative transfer caused by prior mismatch and enhancing the stability of early-stage exploration. The main contributions of this paper are summarized as follows:

(1): A difference-driven adaptive action adoption mechanism is proposed. By performing a prospective evaluation between the autonomous policy action and the guidance action, the mechanism adaptively determines whether to accept the guidance action according to their immediate return difference, thereby preventing prior guidance from imposing a persistent bias on the policy output. Consequently, without altering the original training pipeline, it effectively stabilizes early-stage exploration, reduces the risk of negative transfer introduced by external priors, and maintains the consistency and controllability of policy updates.
(2): Collaborative framework between structural priors and learned policies. A collaborative framework is established that combines structural priors with learned policies. Guidance forces are generated via a potential-field-based heuristic, and, together with the single-step difference evaluation mechanism, the algorithm gradually internalizes environmental constraints and coordination patterns. This leads to an organic integration of prior knowledge and deep reinforcement learning.
(3): Progressive consistency evaluation strategy. A progressive consistency evaluation strategy is proposed which adopts a sequential action adoption and immediate update mechanism. This ensures that single-step policy evaluations across multiple agents are conducted under a consistent state premise, avoiding contradictory assessments and enhancing the coherence of joint decision-making.

The remainder of this paper is organized as follows: Section 2 formulates the problem, including the ship motion model and CEP task definition; Section 3 presents the underlying RL algorithms and the proposed MAADPG framework; Section 4 validates algorithm performance through simulation experiments; Section 5 concludes the paper and outlines future research directions.

2. Problem Formulation and Preliminaries

2.1. Multi-Agent Reinforcement Learning

Deep reinforcement learning (DRL) enables agents to learn and update policies through real-time interactions with the environment. At each decision step, the agent generates an action based on its current state and observations. The environment then transitions to a new state and returns an immediate reward to evaluate the action taken. The agent uses the reward signal and interaction trajectories to update its parameterized policy online, thus gradually improving its long-term return.

In multi-agent encirclement tasks, individual decisions are inherently coupled and often occur under partial observability. Agents simultaneously select actions at each time step, and the joint action determines the environment transition, after which each agent receives an individualized reward. Additionally, the continuous adaptation of other agents’ policies introduces non-stationarity and complicates credit assignment [26]. To model such multi-agent interactions, we formulate the environment as a Markov game [27]:

G = 〈N, S, {A_{i}}_{i = 1}^{N}, {O_{i}}_{i = 1}^{N}, P, {r_{i}}_{i = 1}^{N}, γ〉 .

(1)

where

N = 1, \dots, n

denotes the set of agents, and S is the global state space;

A_{i}

and

O_{i}

denote the action space and the local observation space of agent i, respectively, with

o_{i} = O_{i} (s)

representing the output of the observation function;

P (\cdot ∣ s, a)

denotes the joint state transition function;

R_{i} : S \times \prod_{i} A_{i} \to R

is the immediate reward function; and

γ \in (0, 1)

is the discount factor. At time step t, given the current state

s_{t} \in S

, all agents simultaneously select a joint action

a_{t} = (a_{1, t} \dots a_{n, t}) \in \prod_{i} A_{i}

. The environment then evolves according to

s_{t + 1} \sim P (\cdot ∣ s_{t}, a_{t}); r_{i, t} = R_{i} (s_{t}, a_{t}); o_{i, t} = O_{i} (s_{t})

transitions to the next state and returns to the individual reward and observation of each agent. The objective is to learn a policy

π_{i}

for each agent that maximizes its expected cumulative discounted return during the interaction process, defined as:

J_{i} (θ_{i}) = E_{(s, o) \sim D} [Q_{i}^{μ} (s, μ_{1} (o_{1}), \dots, μ_{N} (o_{N}))] .

(2)

To alleviate the non-stationarity caused by mutual learning between multiple agents, this article adopts the paradigm of centralized training and decentralized execution (CTDE) of MADDPG [28]: during training, each agent is equipped with a centralized critic and a decentralized actor; the critic

Q_{i} (s, a; ϕ_{i})

has access to the global state s and the joint action a, while the actor

μ_{i} (o_{i}; θ_{i})

depends only on local observation

o_{i}

. During execution, agents act independently using

μ_{i}

without requiring global information. The critic is trained by minimizing the temporal difference (TD) error [29]:

L_{i} (ϕ_{i}) = E_{(s, a, r, s^{'}) \sim D} [{(Q_{i} (s, a; ϕ_{i}) - y_{i})}^{2}] .

(3)

where the target value is given by a target network updated with delay

y_{i} = r_{i} + γ Q_{i}^{'} (s^{'}, a_{1}^{'}, \dots, a_{N}^{'}; ϕ_{i}^{'}),

(4)

where

a_{j}^{'} = μ_{j}^{'} (o_{j}^{'}; θ_{j}^{'})

and

o_{j}^{'} = O_{j} (s^{'})

. Correspondingly, the actor is updated according to the deterministic policy gradient (DPG) [30]:

\nabla_{θ_{i}} J_{i} (θ_{i}) = E_{(s, o) \sim D} [\nabla_{θ_{i}} μ_{i} (o_{i}) \nabla_{a_{i}} Q_{i}^{μ} (s, a_{1}, \dots, a_{N}) |_{a_{i} = μ_{i} (o_{i})}] .

(5)

To improve training stability and sample efficiency, all interaction data are uniformly stored in the replay buffer D for off-policy training:

\begin{matrix} ϕ_{i}^{'} & \leftarrow τ ϕ_{i} + (1 - τ) ϕ_{i}^{'}, \end{matrix}

(6)

\begin{matrix} θ_{i}^{'} & \leftarrow τ θ_{i} + (1 - τ) θ_{i}^{'} . \end{matrix}

(7)

where

τ ≪ 1

serves to suppress training oscillations and enhance convergence. Under the above mechanism and without altering the information constraints at execution time, the critic can exploit global information during training to provide consistent evaluation of joint behaviors, thereby enabling stable and efficient policy learning in non-stationary multi-agent environments.

Building on this policy-optimization framework, this paper focuses on the constrained encirclement problem (CEP)—a constrained multi-agent encirclement task whose objective is to effectively blockade and intercept a highly maneuverable target in a maritime environment. To standardize the encirclement performance metric and the subsequent optimization objective, the team-level expected discounted return of the pursuer (USV) fleet is defined as

J (π ∣ s_{0}) = E_{τ \sim ρ^{π} (s_{0})} [\sum_{t = 0}^{T_{max}} γ^{t} r_{t} (s_{t}, a_{t})],

(8)

where

μ = (μ_{1}, \dots, μ_{N}) \in Π

denotes the joint deterministic policy of the encircling vessels;

γ \in (0, 1)

is the discount factor;

T_{max}

is the upper bound on the episode length; and

r_{t} (s_{t}, a_{t})

is the team reward function used to comprehensively evaluate the encirclement performance.

This paper formulates the encirclement process as a cooperative multi-agent Markov game [31] and trains the encircling team under the CTDE paradigm to maximize the expected discounted return at the team level:

μ^{*} \in arg max_{μ \in Π} J (μ ∣ s_{0}) .

(9)

For the sake of policy learning efficiency and numerical stability, a planar point-mass kinematic model with acceleration control is adopted. Each USV outputs control variables within a normalized action space, which are uniformly mapped by the environment into physical accelerations to propagate the system state. The heading angle is not integrated as an independent state variable but is instead instantaneously determined by the velocity direction.

Considering the local sensing, limited communication, and obstacle-avoidance requirements inherent in multi-USV systems, the CEP is further formulated as a Markov game and solved under the centralized training with decentralized execution (CTDE) paradigm to obtain the optimal policy of the encircling team, denoted as

π^{*}

.

2.2. Ship Motion Model

In the study of maneuver decision-making for multi-USV encirclement, we construct a continuous scenario in which each USV moves on the two-dimensional horizontal plane

Ω = [0, L] \times [0, L]

. The two-dimensional USV kinematic model is illustrated in Figure 1.

For tractable policy learning and numerical stability, we adopt a planar point-mass model with acceleration control: the policy outputs a two-dimensional control in the normalized action space

{[- 1, 1]}^{m}

, which is uniformly mapped by the environment to physical accelerations to propagate the state. The continuous-time model is

\{\begin{matrix} {\dot{p}}_{i} (t) = v_{i} (t), \\ a_{i} (t) = {\dot{v}}_{i} (t) = a_{max}^{(i)} u_{i} (t) . \end{matrix}

(10)

where

p_{i} = [x_{i}, y_{i}]

and

v_{i} = [v_{x, i}, v_{y, i}]

denote position and velocity, respectively;

u_{i} = [u_{x, i}, u_{y, i}] \in {[- 1, 1]}^{m}

is the normalized action, and

a_{max}^{(i)} > 0

is the acceleration upper bound coefficient. The heading angle is not integrated as an independent state but is given instantaneously by the velocity direction. To avoid numerical jitter near zero speed, we adopt the following thresholded definition:

ψ_{i} (t) = \{\begin{matrix} atan2 (v_{y, i} (t), v_{x, i} (t)), & ∥ v_{i} (t) ∥ > ε, \\ ψ_{i} (t^{-}), & otherwise . \end{matrix}

(11)

where

ε > 0

is a small threshold and

t^{-}

denotes the instant immediately preceding time t (i.e., the heading of the previous step is retained). The velocity and position are updated via first-order explicit Euler integration, with the velocity norm and the position domain projected at each step:

\{\begin{matrix} p_{i, t + 1} & = Π_{Ω} (p_{i, t} + v_{i, t + 1} Δ t), \\ v_{i, t + 1} & = Π_{∥ \cdot ∥ \leq v_{max}^{(i)}} (v_{i, t} + a_{max}^{(i)} u_{i, t} Δ t) . \end{matrix}

(12)

where

Π_{∥ \cdot ∥ \leq v_{max}^{(i)}} (\cdot)

denotes the velocity–norm saturation projection (onto the ball of radius

v_{max}^{(i)}

), and

Π_{Ω} (\cdot)

clips the position componentwise to the workspace

Ω

. From Equations (10) and (11), all USVs receive continuous control under a common normalized scale. The corresponding physical constraints are

\{\begin{matrix} ∥ v_{i, t} ∥ & \leq v_{max}, \\ ∥ a_{i, t} ∥ & \leq a_{max} . \end{matrix}

(13)

For supporting partial observability and obstacle avoidance, we adopt a sector-based ray distance sensing model (see Figure 2). Around the body heading,

N_{r} = 16

rays are placed at equal angular intervals, with a maximum range of

L_{s}

. The measurement vector of the i-th USV at time t is

L_{i} (t) = [ℓ_{i, 1} (t), \dots, ℓ_{i, 16} (t)] \in {[0, L_{s}]}^{16},

(14)

where

ℓ_{i, k} (t) \in [0, L_{s}]

denotes the distance along the k-th ray to either the environmental boundary or an obstacle; if no intersection exists, we set

ℓ_{i, k} (t) = L_{s}

. For network input, the measurements are normalized as

{\hat{L}}_{i} = L_{i} / L_{s} \in {[0, 1]}^{16}

and, together with geometric quantities such as

(p_{i} (t), v_{i} (t))

, constitute the observation

o_{i}

.

2.3. Definition and Statement of the Multi-Agent Capture Problem

In cooperative encirclement tasks involving multiple unmanned surface vehicles (USVs) operating in constrained maritime environments, the pursuers are required to establish a stable and effective enclosure of a highly maneuverable target within a finite time horizon while satisfying strict safety constraints. The inherent complexity of this task arises from three primary aspects. First, maritime boundaries and internal obstacles partition the feasible operating space into multiple non-convex regions, which significantly restrict the maneuverability of encircling USVs during the approach and closure phases. Under such conditions, safe navigation often relies on traversing narrow feasible corridors, where constraint violations or collisions may easily occur. Second, the decision-making process is characterized by strong inter-agent coupling: the control action of an individual USV not only determines its own motion but also directly affects the geometric configuration of the overall formation. This coupling introduces pronounced non-stationarity into the multi-agent system, rendering the encirclement formation vulnerable to fragmentation, over-contraction, or target penetration. Third, the target USV typically exhibits superior speed and maneuverability and may employ stochastic evasion strategies, such that temporarily formed encirclement structures remain continuously exposed to disturbances and degradation. Without a unified geometric representation and explicit constraint-aware formulation, the encirclement process tends to exhibit failure modes including incomplete closure despite successful approach, or closure accompanied by unsafe inter-vehicle proximity, thereby undermining the feasibility and robustness of cooperative encirclement control.

To provide a unified mathematical framework for characterizing the non-convex environmental constraints, inter-agent coupling, and dynamic disturbances inherent in cooperative encirclement tasks, and to support subsequent policy learning with a tractable and verifiable formulation, it is necessary to formally define the state space, distance measures, feasible regions, and geometric closure conditions of the encirclement scenario. Accordingly, this paper investigates a cooperative encirclement problem involving multiple unmanned surface vehicles (USVs) in a two-dimensional constrained maritime domain

W \subset R^{2}

. The encirclement team consists of three USVs, and the target is denoted by T. The domain contains K static circular obstacles. For each USV, the state and control at time t are

x_{i, t} = [p_{i, t}, v_{i, t}]

and the normalized action

u_{i, t} \in {[- 1, 1]}^{2}

, respectively; action mapping, state propagation, and saturation follow Equations (10)–(13), while the heading angle is given instantaneously by the velocity direction and is not integrated as an independent state. To characterize the spatial feasibility between a USV and surrounding obstacles, the minimum distance from a point to the obstacle boundary is defined as follows:

D_{O} (p) = min_{k = 1, \dots, K} (∥ p - c_{k} ∥ - r_{k}) .

(15)

It provides a quantitative metric for evaluating the minimum safety margin of an individual relative to all obstacles and boundary constraints. Given the USV–USV and USV–obstacle safety thresholds

d_{min} > 0, d_{obs} > 0

, the feasible state set is

X_{feas} = \{s | \begin{matrix} p_{i, t} \in W, ∥ p_{i, t} - p_{j, t} ∥ \geq d_{min} (i \neq j), D_{O} (p_{i, t}) \geq d_{obs}, \forall i \end{matrix}\} .

(16)

That is, all individuals remain within the workspace and satisfy the safety constraints with respect to both inter-agent distances and obstacles. To further quantify the degree of geometric closure of the encirclement formation around the target, the maximum angular gap is introduced as follows:

Γ_{t} = max_{m = 1, \dots, N} (θ_{(m + 1), t} - θ_{(m), t}),

(17)

where

θ_{i, t} = atan2 (r_{i, t}^{y}, r_{i, t}^{x})

is the polar angle of the i-th encircling USV relative to the target position

p_{i}^{T}

;

θ_{(m), t}

denotes the sequence obtained by sorting

θ_{i, t}

in ascending order; and

Γ_{t}

takes the maximum angular difference between adjacent elements. A smaller value of

Γ_{t}

indicates a higher degree of circumferential coverage.

By jointly considering feasibility, safety, and geometric closure, encirclement success is defined as the condition under which the target is enclosed by a closed geometric structure formed by the pursuing USVs in the spatial domain, while all safety constraints are satisfied. Formally, the success set is defined as follows:

\begin{matrix} S_{enc} = {s_{t} | & p_{t}^{T} \in conv {p_{1, t}, \dots, p_{N, t}}, ρ_{min} \leq ∥ p_{i, t} - p_{t}^{T} ∥ \leq ρ_{max} \forall i, Γ_{t} \leq Γ_{max}, p_{i, t} \in W, \\ ∥ p_{i, t} - p_{j, t} ∥ \geq d_{min} \forall i \neq j, D_{O} (p_{i, t}) \geq d_{obs} \forall i} . \end{matrix}

(18)

where the four categories correspond to (1) convex-hull inclusion—the target lies inside the encircling USVs’ convex hull; (2) radial bandwidth—the distances from all encircling USVs to the target fall within

[ρ_{min}, ρ_{max}]

; (3) circumferential closure—the maximum angular gap does not exceed the threshold

Γ_{max}

; and (4) feasibility—the boundary constraint

p_{i, t} \in W

, the minimum inter-USV spacing

| p_{i, t} - p_{j, t} | \geq d_{min}

, and the obstacle-safety requirement

D_{O} (p_{i, t}) \geq d_{obs}

are simultaneously satisfied. In this paper, an instantaneous consensus-based criterion is adopted: the encirclement is regarded as successful if and only if there exists a time instant at which all conditions in Equation (18) are simultaneously satisfied.

3. Multi-Agent Guided Adaptive Difference Policy Gradient

This chapter presents the MAADPG algorithm in detail, including its motivating improvements, core ideas, complete design, and procedural workflow.

3.1. MAADPG

In the multi-agent reinforcement learning formulation presented in Section 2, the constrained encirclement problem (CEP) is characterized as a cooperative policy optimization task subject to multiple geometric constraints, safety distance requirements, and coupled agent dynamics. Although the centralized training and decentralized execution (CTDE) framework enables consistent value estimation for multi-agent critics, the corresponding actors still face substantial challenges in complex geometric environments. Specifically, (1) early-stage exploration lacks directional guidance, making it difficult to form stable approach and enclosure structures; (2) relying solely on policy updates often fails to preserve geometric consistency of the formation, rendering the encirclement structure vulnerable to disruption by local optima or environmental disturbances; and (3) policy gradient methods operating in high-dimensional continuous action spaces are prone to severe oscillations, resulting in poor training stability. These limitations significantly hinder the effectiveness of conventional policy gradient methods in addressing CEP tasks in constrained maritime environments. To address the aforementioned structural challenges, the proposed MAADPG is developed on the theoretical foundations established in Equations (2), (5), and (8), and introduces a structured extension at the behavior policy level to explicitly incorporate the intrinsic geometric structures and safety constraints of the constrained encirclement problem (CEP). The core idea of MAADPG is to preserve the centralized training and decentralized execution (CTDE) framework while explicitly embedding encirclement-related geometric information into the policy output stage. This is achieved by constructing guidance actions endowed with task-specific structural priors, thereby compensating for the lack of directionality and geometric consistency exhibited by purely learned policies in complex state spaces. Meanwhile, a difference-driven adaptive action adoption mechanism is introduced to perform local performance evaluation between the guidance actions and the policy-generated actions. Guided actions are adopted only when they yield an immediate performance improvement, which effectively prevents excessive intervention in policy learning. Notably, the proposed adoption mechanism is formulated as a condition-triggered guidance scheme. In contrast to long-term biasing strategies, such as action masking, residual superposition, or fixed-weight mixing, guidance intervention is activated only when the one-step return difference meets a predefined improvement criterion; otherwise, the policy action is preserved. This design effectively mitigates negative transfer and policy interference induced by prior mismatch. Through this structured enhancement, MAADPG enables more coordinated decision-making across target approach, obstacle avoidance, and formation geometry maintenance, resulting in policy generation with clear directionality and geometric consistency, and providing a more stable and reliable data distribution for subsequent gradient-based updates.

In the multi-agent reinforcement learning formulation given in Section 2, the capture task is modeled as a Markov game. For agent i, the optimization objective is the discounted return maximization problem defined in Equation (2); under the centralized-critic framework, the corresponding multi-agent deterministic policy gradient update is given by Equation (5), and the overall strategy optimization further incorporates the CEP capture performance metric in Equation (8).

Building on this unified framework, this paper develops the Multi-Agent Guided Adaptive Difference Policy Gradient (MAADPG) method for constrained multi-agent capture. MAADPG takes Equations (2), (5), and (8) as its theoretical foundation, and injects the explicit geometric structure and safety constraints of the cooperative encirclement and pursuit (CEP) problem into a structural extension at the behavior-policy level. While keeping the global optimization objective and the centralized-critic update form unchanged, it constructs a geometry-informed behavioural operator with adaptive difference constraints, so that policy improvement is jointly governed by the learned policy, the geometric guidance, and the online estimated return difference rather than being purely data-driven. The resulting strategy architecture is shown in Figure 3. Specifically, the core idea is embodied in the execution mechanism. For each encircling agent, two candidate actions are generated in parallel: the policy action produced by the actor network

a_{i}^{r} = μ_{i} (o_{i}; θ_{i})

and the guidance action generated by the GAD module

a_{i}^{P} = G_{i} (s, o_{i}; ψ_{i})

. These actions are then subjected to a difference-driven single-step policy evaluation and comparison under a unified normalization scale. The guidance action is adaptively adopted only when it yields a significant improvement over the policy action. In this manner, early-stage exploration is stabilized and sample efficiency is enhanced without altering the training objective or the gradient computation process. The detailed implementation is provided in the pseudocode of Algorithm 1.

Actions are represented in normalized form and compared within the same normalized domain. Let the action dimension be m, and the normalized domain be

{[- 1, 1]}^{m}

. Here, the term-normalized domain denotes that the policy network outputs a dimensionless action vector, in which each component corresponds to a candidate control command—generated by either the Actor or the PFM—scaled within the admissible range of the associated control channel. As a result, actions produced from different sources (i.e., the autonomous policy and the geometric guidance) can be evaluated and compared under a unified scale. Define the element-wise clipping operator

Π_{{[- 1, 1]}^{m}} (\cdot)

and the environment action mapping

G : {[- 1, 1]}^{m} \to A \subset R^{n}

, which projects normalized quantities to the physical scale. The outputs

d_{i}^{r}

and

d_{i}^{g}

of the actor and the PFM are both first clipped to

{[- 1, 1]}^{m}

and then passed through G in a unified manner, so that the two action branches lie in the same action space, satisfy the same scale constraints, and can be directly compared.

The guidance module adopts an artificial potential-field heuristic [32]: the two-dimensional guidance force is the superposition of target attraction, obstacle repulsion, and teammate repulsion:

\begin{matrix} F_{i} & = \frac{k_{t}}{{∥ p_{t} - p_{i} ∥}^{2}} \frac{p_{t} - p_{i}}{∥ p_{t} - p_{i} ∥} + \sum_{o \in O} \frac{k_{o}}{d_{o}^{2}} \frac{p_{i} - p_{o}}{∥ p_{i} - p_{o} ∥} + \sum_{j \in N ∖ {i}} \frac{k_{r}}{{∥ p_{i} - p_{j} ∥}^{2}} \frac{p_{i} - p_{j}}{∥ p_{i} - p_{j} ∥}, \end{matrix}

(19)

where N denotes the set of pursuers, O the set of obstacles,

p_{t}

the target position, and

d_{o} = ∥ p_{i} - p_{o} ∥ - r_{o}

the shortest distance from the i-th pursuer to the obstacle surface. The relative influence of different potential field components is regulated by three weighting parameters, which are defined as

k_{t}, k_{o}, k_{r}

. These parameters, respectively, control the contribution strengths of the target attraction, obstacle repulsion, and inter-agent repulsion terms. Specifically,

k_{t}

determines the magnitude of the attractive force toward the target and directly affects the speed and intensity with which an agent converges to the target center. The parameter

k_{o}

adjusts the obstacle repulsion term, enabling agents to maintain a safe clearance from obstacles while preventing excessive sensitivity caused by overly strong repulsive forces. Meanwhile,

k_{r}

governs the inter-agent repulsion term, which preserves the geometric distribution among agents during the encirclement process, preventing collision risks due to overly compact formations or degradation of the enclosure structure caused by excessive dispersion. To be consistent with the action constraints,

F_{i}

is normalized and projected onto the action domain to obtain the guided action:

a_{i, t}^{S} = Π_{{[- 1, 1]}^{m}} (\frac{F_{i}}{max (ε, ∥ F_{i} ∥)}), 0 < ε ≪ 1 .

(20)

Algorithm 1 MAADPG training procedure

Require:: number of agents N; guided-agent set $H \subseteq {1, \dots, N}$ ; batch size B; horizon T; number of episodes E; discount factor $γ$ ; soft-update rate $τ$ ; update interval K; threshold $β$ ; stability coefficient $λ$ ; PFM parameters $(k_{t}, k_{o}, k_{h})$ ; action map $a = A_{max} u, u \in {[- 1, 1]}^{m}$
Ensure:: trained policies ${μ_{i}}_{i = 1}^{N}$
1:: Initialize actors $μ_{i}$ , critics $Q_{i}$ , target networks $Q_{i}^{tgt}, μ_{i}^{tgt}$ , replay buffer $D \leftarrow \emptyset$ .
2:: Note: ${\hat{R}}_{i} (\cdot)$ denotes the environment immediate reward $r_{i} (s_{t}, a_{t})$ defined in Equation (34).
3:: for $episode = 1$ to E do
4:: reset environment and obtain $(s_{1}, o_{1})$
5:: for $t = 1$ to T do
6:: compute PFM actions for $i \in H$ : $a_{i, t}^{p} \leftarrow clip (μ_{i} (o_{i, t}) + noise, - 1, 1)$
7:: compute Actor actions for $i \in H$ : $a_{i, t}^{r} \leftarrow clip (μ_{i} (s_{i, t}) + noise, - 1, 1)$
8:: one-step evaluation for $i \in H$ (Peek Step):
9:: ${\tilde{r}}_{i, t}^{p} \leftarrow {\hat{R}}_{i} (s_{t}, (a_{i, t}^{p}, a_{- i, t}^{p}))$ ,
10:: ${\tilde{r}}_{i, t}^{r} \leftarrow {\hat{R}}_{i} (s_{t}, (a_{i, t}^{r}, a_{- i, t}^{p}))$
11:: compute normalized return difference
12:: $Δ_{i} (s_{t}) \leftarrow \frac{{\tilde{r}}_{i, t}^{r} - {\tilde{r}}_{i, t}^{p}}{| {\tilde{r}}_{i, t}^{p} | + λ} \times 100 %$
13:: apply $β$ -Gate truncation for $i \in H$ :
14:: $a_{i, t}^{*} \leftarrow T_{i} (a_{i, t}^{p}, a_{i, t}^{r}; s_{t})$
15:: set $a_{j, t}^{*} \leftarrow a_{j, t}^{p}$ for $j \notin H$
16:: step environment with ${A_{max} a_{i, t}^{*}}_{i = 1}^{N}$ to obtain $(r_{t}, o_{t + 1}, s_{t + 1}, d_{t})$
17:: store $(s_{t}, o_{t}, a_{t}^{*}, r_{t}, s_{t + 1}, o_{t + 1}, d_{t})$ in $D$
18:: if $t mod K = 0$ and $| D | \geq B$ then
19:: sample a minibatch from $D$ , compute TD targets
20:: $y_{i} = r_{i} + γ (1 - d) Q_{i}^{tgt} (s^{'}, a^{'})$ ,
21:: and update $Q_{i}, μ_{i}, Q_{i}^{tgt}, μ_{i}^{tgt}$ as in standard MADDPG
22:: end if
23:: end for
24:: end for

To transform the resultant potential-field force generated by the PFM into a guidance action, the target attraction force, obstacle repulsion force, and inter-agent repulsion force are first aggregated into a resultant vector

F_{i}

, which is then normalized by its magnitude so that only directional information is retained. This treatment avoids unnecessary scaling effects caused by force magnitudes and potential field parameters on the control intensity. During the normalization process,

max (ε, ∥ F_{i} ∥)

is adopted as the denominator, where

ε ≪ 1

is a regularization constant introduced to prevent numerical instability when

∥ F_{i} ∥

approaches zero. This formulation ensures that the normalization operation maintains favorable numerical stability even under extremely small resultant force conditions. Subsequently, a component-wise clipping operator is applied to map each element of the normalized vector into the interval

{[- 1, 1]}^{m}

, ensuring that the resulting guidance action lies in the same normalized action domain as the policy action. This procedure transforms the geometric information encoded in the potential field into a guidance action that can be directly compared with the Actor output on a unified scale. The primary purpose of this normalization is to align the magnitudes of multi-source guidance components, preventing any single component from dominating the action output due to differences in physical units or amplitude ranges, thereby improving numerical stability and enhancing the controllability and cross-scenario applicability of the guidance signal.

To characterize when geometric guidance should intervene in behavioural decision-making, MAADPG constructs a difference truncation operator based on one-step return differences. Given the current global state

s_{t}

and the joint action of all other agents

a_{- i, t}

, the two candidate actions are evaluated separately by

{\tilde{r}}_{i}^{π} = {\tilde{R}}_{i} (s_{t}, a_{i, t}, a_{- i, t})

and

{\tilde{r}}_{i}^{P} = {\tilde{R}}_{i} (s_{t}, a_{i, t}^{S}, a_{- i, t})

These two quantities can be interpreted as local evaluations of the same

Q_{i}

-function at different actions under state

s_{t}

. On this basis, we define the normalized return-difference measure as follows:

Δ_{i} = \frac{{\tilde{r}}_{i}^{P} - {\tilde{r}}_{i}^{π}}{| {\tilde{r}}_{i}^{P} | + λ} \times 100 %, λ > 0 .

(21)

At each decision step,

Δ_{i}

is computed online based on the current policy output and the corresponding one-step return from the environment. It quantifies the immediate return improvement achieved by the guidance action relative to the policy action, and the resulting local improvement signal is used as the decision criterion for subsequent action adoption. It should be emphasized that

Δ_{i}

serves as a condition-triggered gating signal: when the guidance action yields only marginal or uncertain gains, the gate tends to suppress intervention, thereby avoiding unnecessary perturbations to the learned policy. Hence, this metric is not intended for long-horizon advantage estimation or multi-step return approximation. Moreover,

λ > 0

is a stabilization coefficient introduced to prevent numerical amplification or instability in the variance estimate when

| {\bar{r}}_{i}^{P} |

becomes too small or approaches zero in the denominator, ensuring that the evaluation of local energy variance remains smooth and well-controlled across different scales.

Based on this difference measure, MAADPG defines a difference truncation operator in the action space as follows:

T_{i} (a_{i}^{π}, a_{i}^{P}; s_{t}) = a_{i}^{π} + I_{{Δ_{i} \geq β}} (a_{i}^{P} - a_{i}^{π}), β \in [0, 1] .

(22)

where

I_{{\cdot}}

denotes the indicator function that characterizes whether a given condition holds, taking the value 1 when the condition is satisfied and 0 otherwise. Accordingly, the guidance action is regarded as an effective improvement over the current policy and replaces the learned action only when

Δ_{i}

is no less than the predefined threshold; otherwise, the original policy output is preserved.

Under the difference truncation operator, the behavioral policy of the i-th agent in the CEP task can be expressed as

μ_{i}^{AD} (o_{i, t}, s_{t}) = T_{i} (μ_{i} (o_{i, t}; θ_{i}), G_{i} (s_{t}, o_{i, t}; ψ_{i}); s_{t}) .

(23)

By replacing the behavioural action

a_{i}

in Equation (5) with

a_{i}^{*} = μ_{i}^{AD} (o_{i}, s)

, the deterministic multi-agent policy gradient formulation under MAADPG is obtained. Consequently, the TD targets of the critic remain given by Equations (3) and (4), and the overall optimisation objective still follows the definitions in Equations (2) and (8). The key distinction is that MAADPG introduces, at the behavior-policy level, an adaptive action adoption mechanism driven by return differences: the one-step return difference

Δ_{i}

defined in Equation (21) is used as the decision signal, and the threshold operator in Equation (22) is employed to gate the selection between the guidance action and the policy action. When

Δ_{i} \geq β

, the system tends to adopt the guidance action to obtain a significant immediate return gain; otherwise, the original policy action is preserved. It should be noted that the “return-difference constraint” specifies a local selection rule for behavior–action generation, aiming to suppress unnecessary guidance injection when the incremental return gain is insufficient, thereby enhancing the stability and autonomy of the learning process. This mechanism does not substitute long-horizon advantage estimation for the one-step return difference, nor does it modify the critic learning objective under the CTDE framework. Therefore, MAADPG can be regarded as a geometry-constrained extension of the standard multi-agent policy gradient paradigm at the policy-improvement operator level: during early training, geometric guidance reshapes the exploration distribution to improve sample utilization; during later training, as policy performance improves and

Δ_{i}

decreases, the behavior policy naturally reverts to being dominated by the learned policy, enabling a smooth transition from guidance-dominant behaviors to policy-dominant decision-making. MAADPG can therefore be regarded as a geometric-constraint extension of the policy-improvement operator within the standard multi-agent policy gradient framework: in the early training stage, it reshapes the exploration distribution by leveraging geometric guidance and the difference truncation operator, thereby improving sample efficiency; in the later stage, as the policy performance improves and

Δ_{i}

decreases, the behavioural policy naturally reverts to being dominated by the learned policy, achieving a smooth transition from geometry-dominated to policy-dominated control.

3.2. State Space

All definitions in this paper are consistent with the following implementations: actions are generated in the normalized domain

{[- 1, 1]}^{m}

and mapped to physical accelerations on the environment side; observations are formed by concatenating “self state—teammates—sensors—relative target (or relative pursuers)”; and the centralized critic receives the global state vector obtained by concatenating all agents’ observations. A rigorous mathematical specification is provided below, together with pointers to the key implementation locations.

Let the environment boundary scale be L, and define the distance normalization factor as

d_{norm} = 2 L

. For the i-th hunter, its observation is formed by concatenating four parts: self state, teammate positions, laser range readings, and relative target information. The hunter’s observation is thus expressed as

\begin{matrix} o_{i} = & [\frac{x_{i}}{L}, \frac{y_{i}}{L}, \frac{v_{x, i}}{v_{max}}, \frac{v_{y, i}}{v_{max}}, {\{\frac{x_{j}}{L}, \frac{y_{j}}{L}\}}_{j \in N ∖ {i}}, ℓ_{i}^{all}, \frac{∥ X_{e} - X_{i} ∥}{d_{norm}}, atan2 (y_{e} - y_{i}, x_{e} - x_{i})] . \end{matrix}

(24)

where H denotes the set of hunters,

X_{e}

the target position, and

ℓ_{i}^{all} \in R^{16}

the laser array information of the i-th hunter. The observation vector of the i-th hunter is assembled item by item in a fixed order.

For the target vessel (target), the observation is formed by concatenating its self state, laser sensing readings, and, for each pursuer, the relative distance and bearing angle.

o_{e} = [\frac{x_{e}}{L}, \frac{y_{e}}{L}, \frac{v_{x, e}}{v_{max}^{(e)}}, \frac{v_{y, e}}{v_{max}^{(e)}}, ℓ_{e}^{all}, {\{\frac{∥ X_{j} - X_{e} ∥}{d_{norm}}, atan2 (y_{j} - y_{e}, x_{j} - x_{e})\}}_{j \in H}] .

(25)

The overall observation dimension of the pursuer and target vessels is kept identical, while the perceptual information is divided into two categories and processed separately: the first consists of range measurements used solely for constructing individual observations, and the second consists of range measurements used exclusively for safety constraints and reward computation. This separation prevents information leakage during training and improves both the interpretability of the model and the consistency of performance evaluation.

Within the CTDE framework, the central critic during training takes the global state

s_{t} = (o_{1, t}, \dots, o_{N, t})

as input [33], constructed by concatenating the local observations of all agents in a fixed order, so that its information content is equal to the sum of the individual observation contents; it also receives the joint action consisting of the actions of all agents for value evaluation. In this way, centralized value learning and stable policy improvement are realized on the training side, while at execution time each agent makes decisions solely from its own observation, preserving decentralized execution.

3.3. Action Space

Each agent selects a two-dimensional continuous action in the normalized domain

{[- 1, 1]}^{m}

,

a_{i} = [a_{x, i}, a_{y, i}]

. On the environment side, it is assigned to physical acceleration using the acceleration limit

a_{max}^{(i)}

(with

a_{max}

/

a_{max}^{(e)}

for pursuers/target, respectively) and the speed limit

v_{max}^{(i)}

, and then integrated to velocity and position, producing a discrete-time dynamic that is fully consistent with the following implementation:

v_{i, t + 1} = Π_{∥ \cdot ∥ \leq v_{max}^{(i)}} (v_{i, t} + a_{max}^{(i)} a_{i, t} Δ t) .

(26)

x_{i, t + 1} = x_{i, t} + v_{i, t + 1} Δ t,

(27)

where

Π_{∥ \cdot ∥ \leq v_{max}}

denotes the projection that enforces a norm constraint on the velocity vector, and the simulation time step is

Δ t

. In (25), the ’acceleration term’ is obtained by component scaling the normalized action according to the per-axis limits; the projection operator ensures that the velocity norm does not exceed

v_{max}

. After the update of the position in (26), a constraint on workspace consistency (e.g., boundary projection or clipping) is applied to the position to ensure that the state always remains within the feasible domain [34].

3.4. Reward Function

In complex multi-agent adversarial environments, to alleviate the learning difficulty caused by sparse rewards that are triggered only by terminal events, this paper employs a dense reward-shaping strategy for the instantaneous reward, thereby providing gradient-consistent signals aligned with the task objective throughout training. The encirclement reward is composed of the following sub-terms: motion-progress reward

R_{prog}

, environmental safety margin reward

R_{safe}

, geometric phase-advancement reward

R_{stage}

, cooperative separation penalty

R_{sep}

, target contact buffer/collision penalty

R_{hit}

, and terminal success reward

R_{term}

. The definitions are as follows.

The motion-progress reward

R_{prog}

is introduced to characterize the projection magnitude of the pursuer’s velocity along the target-directed heading, which is defined as follows:

R_{prog} (i, t) = \frac{∥ v_{i, t} ∥}{v_{max}} \cdot \frac{v_{i, t} \cdot (x_{e, t} - x_{i, t})}{∥ v_{i, t} ∥ ∥ x_{e, t} - x_{i, t} ∥} .

(28)

where

x_{i}, v_{i}

denote the position and velocity of pursuer i,

x_{e}

is the target position, and

v_{max}

is the pursuer speed limit.

To encourage each agent to navigate safely within the feasible region, an environment-safety reward

R_{safe}

is defined. This term is constructed based on laser-range distance information via a piecewise formulation as follows:

R_{safe} (t) = \{\begin{matrix} - c_{coll}, & collided, \\ \frac{min (ℓ^{obs}) - L_{sens}}{L_{sens}}, & otherwise . \end{matrix}

(29)

where

ℓ^{obs}

denotes the reward/safety lidar measurement,

L_{sens}

the sensing range, and

c_{coll} > 0

the collision penalty constant.

The encirclement task exhibits a clear stage-wise structure and can be broadly characterized as a “tracking–encircling–capturing” process. To quantify the progression across different phases, a piecewise function is adopted to define the stage-progress reward

R_{stage}

:

R_{stage} (t) = \{\begin{matrix} - \frac{Σ_{d}}{d_{max}}, & \begin{matrix} Σ_{s} > S_{4}, Σ_{d} \geq d_{lim}, m i n_{k} d_{k} \geq d_{cap} \end{matrix} \\ - \frac{1}{3} log (Σ_{s} - S_{4} + 1), & \begin{matrix} Σ_{s} > S_{4}, Σ_{d} < d_{lim} or min_{k} d_{k} < d_{cap} \end{matrix} \\ exp (\frac{Σ_{d}^{(-)} - Σ_{d}}{3 v_{max}}), & \begin{matrix} Σ_{s} \leq S_{4}, max_{k} d_{k} > d_{cap} . \end{matrix} \end{matrix}

(30)

where

\sum_{s} = S_{1} + S_{2} + S_{3}

and

S_{4}

denote the geometric area associated with the encirclement formation,

\sum_{d} = \sum_{i} d_{i}

(with

\sum_{d}^{(-)}

being its value at the previous step) represents the total distance from the pursuers to the target, and

d_{cap}

,

d_{\lim}

, and

d_{max}

correspond to the capture radius, the stage threshold, and a scaling constant, respectively.

Cooperative encirclement must also avoid inter-agent interference caused by excessive proximity; therefore, we introduce a cooperative separation penalty:

R_{sep} (i, j, t) = - k_{keep} \frac{{(r_{keep} - d_{i j} (t))}_{+}}{r_{keep} + ε} .

(31)

where

(x) + = max x, 0

,

k_{keep} > 0

is the penalty coefficient, and

ε > 0

is a numerical stabilization term.

To penalize hazardous proximity, a target-collision penalty term

R_{hit}

is constructed, which consists of a safety-buffer penalty and a collision-triggered penalty as follows:

R_{hit} (i, t) = - k_{hit} \frac{{(r_{safe}^{(ht)} - d_{ht} (t))}_{+}}{r_{safe}^{(ht)} + ε} - I \{d_{ht} (t) < r_{hard}^{(ht)}\} c_{hard} .

(32)

where

d_{ht} (t)

denotes the minimum distance between the pursuers and the target,

r^{(ht)} safe

and

r^{(ht)} hard

are the soft and hard radii, respectively, and

k_{hit}, c_{hard} > 0

.

When the geometric conditions are satisfied, that is, the target lies inside the triangle of the pursuers and the minimum distance reaches the prescribed threshold, the encirclement is deemed successful, a one-time terminal reward is issued, and the episode is terminated.

R_{term} (t) = I_{{Σ_{s} \leq S_{4} \land max (d_{k}) \leq d_{c a p}}} c_{succ} .

(33)

where

c_{succ} > 0

is the success constant. In summary, the total reward per step for the agent i is

r_{i, t} = R_{prog} (i, t) + R_{safe} (t) + R_{stage} (t) + \sum_{j \in H ∖ {i}} R_{sep} (i, j, t) + R_{hit} (i, t) + R_{term} (t) .

(34)

The training and single-step evaluation phases share an identical pipeline for dynamics propagation, sensor updates, and reward computation. On the evaluation side, an equivalent one-step rollout is executed from the current state to obtain an instantaneous assessment signal that is consistent with the training objective in both scale and constraints. This design enables candidate actions to be compared directly under unified units and normalization conditions, thereby improving the comparability and statistical stability of action selection, while enhancing the consistency and controllability of the policy update process.

It should be noted that the above reward terms are designed to characterize the encirclement stages and safety constraints defined in this work, and their specific forms are therefore task-dependent. The core contribution of our method lies in the behavior-policy-level, difference-driven action adoption and the guidance–policy fusion mechanism. Importantly, this mechanism itself is not tied to any particular reward formulation; thus, under alternative encirclement criteria or reward structures, the proposed framework can be readily adapted by accordingly modifying the task reward terms and termination conditions. During training, obstacle layouts are randomly generated according to predefined rules, enabling the policy to interact and learn under diverse geometric configurations, which further improves adaptability to environmental variations.

4. Simulation Experiments and Results Analysis

In this section, we evaluate the performance of MAADPG in simulation. We first describe the simulation setup, and then compare MAADPG against three baselines: DDPG, MADDPG, and MADDPGApprox. Experiments are conducted in a constrained simulation environment. Four RL-based evader strategies are trained under randomized initial conditions: DDPG, MAADPG, MADDPG, and MADDPGApprox. During training, model checkpoints are saved at fixed intervals; for testing, the best checkpoint of each evader (by validation performance) is selected and evaluated. Convergence curves are obtained by counting the number of successful escapes across the periodically saved checkpoints for all four RL agents.

4.1. Parameter Settings

The simulation timestep is set to 1 s, and the operational area is defined as a 2 km × 2 km two-dimensional square domain. At the beginning of each training episode, the initial positions of all USVs and the target are randomly initialized. To reflect the higher maneuverability of the target, distinct constraints on maximum velocity and acceleration are applied to the target and pursuers, respectively. The decision-making model adopts two fully connected neural networks: the actor network has a structure of 28 × 128 × 128 × 2, while the critic network has a structure of 28 × 128 × 128 × 1. The learning rates are initialized to 0.0005 for the actor and 0.001 for the critic. The remaining training hyperparameters mainly follow the canonical MADDPG configuration and commonly adopted empirical ranges [35], with minor adjustments performed to improve convergence stability for the considered task. Detailed hyperparameter settings are provided in Table 1. Each episode terminates when any of the following conditions is met: successful task completion, collision between a USV and the boundary, or the simulation reaches the predefined maximum number of steps. Upon termination, the environment resets and the next episode begins.

In addition, for the difference-constraint-driven adaptive action adoption mechanism, the stabilization coefficient in Equation (21) is set to

λ = 10^{- 6}

to prevent excessively large coefficient values when the denominator becomes small due to low return magnitudes, thereby improving the numerical stability of computing the return-difference quantity

Δ_{i}

. The threshold parameter is set to

β = 0.1

(i.e., a 10% improvement margin) to regulate the adoption strength of the guidance action: the guidance action is accepted only when it brings a sufficiently significant one-step return improvement over the policy action (i.e.,

Δ_{i} \geq β

); otherwise, the original policy action is preserved. It should be noted that all experiments in this work adopt the same

(λ, β)

configuration without scenario-specific tuning, so as to ensure consistency in comparative evaluation and to demonstrate the transferability and robustness of the proposed difference-constraint mechanism across different environments. This setting follows the principle of “intervention only upon significant improvement”: when

β

is small, the gate becomes more inclined to frequently adopt the guidance action, whereas a larger

β

yields a more conservative intervention behavior. In the extreme case,

β \to 0

implies that the guidance action is almost always adopted, while sufficiently increasing

β

tends to suppress adoption and reduces the mechanism to the original policy output.

4.2. Simulation Scenario and Settings

The simulation experiments were conducted on a hardware platform equipped with an Intel Core i9-12900K processor and an NVIDIA GeForce RTX 3060Ti GPU, providing sufficient computational resources for large-scale training. The task scenario is defined within a bounded two-dimensional square domain, involving three unmanned surface vehicles (USVs) acting as pursuers and one mobile target, forming a many-to-one pursuit-evasion setup. A visual representation of the simulated environment is shown in Figure 4, where the three red USVs denote the pursuers and the blue vehicle represents the evading target. Each USV is surrounded by a dashed circular area indicating its capture radius, while the blue circle denotes the safety margin. Static obstacles are illustrated as gray circular regions. The obstacle parameters are detailed in Table 2. The implementation was developed in Python (version 3.12) with PyTorch (version 2.3), and the training algorithm was implemented using Stable-Baselines3 (version 2.3.2). OriginPro (version 2023) was used for data analysis and plotting.

The number of obstacles in each training episode is randomly sampled from 0 to 3, and their radii vary uniformly between 0.10 km and 0.15 km. Obstacles remain static within each episode. This design increases environmental uncertainty and enhances the robustness requirements of the learned policies. The specific initialization parameters of the USVs are listed in Table 3.

To ensure the validity of the adversarial training setting, the target USV employs a single-agent deep reinforcement learning policy based on MADDPG for evasive behavior, while the pursuing USVs are trained using a centralized training and decentralized execution (CTDE) framework for cooperative pursuit control.

4.3. Comparative Analysis of Algorithms

In this experiment, the target vessel exhibits a dynamic evasive behavior with policy uncertainty. It makes autonomous decisions based on deep reinforcement learning, featuring both high maneuverability and non-deterministic actions, which substantially increases the requirements of the encirclement task on the cooperative decision-making, trajectory planning, and policy robustness of the multi-UAV system. Based on the aforementioned simulation environment and parameter settings, four algorithms are employed to train the encirclement maneuver-decision model, and their learning processes and final performance are comparatively analyzed under a unified training configuration. Figure 5 illustrates the team cumulative return curves over 1500 training episodes (with Episode on the horizontal axis and the total return of the three UAVs on the vertical axis), where the blue, orange, green, and red curves correspond to MADDPG, DDPG, MADDPGApprox, and the proposed method (MAADPG), respectively. Overall, all four methods exhibit a typical learning pattern of “low initial returns–rapid improvement–continuous optimization with eventual saturation”: the returns remain negative in the early stage and generally transition from negative to positive within approximately the first 200 episodes, indicating that cooperative encirclement behaviors begin to emerge and the learning process enters an effective stage. Compared with the baselines, the proposed method consistently achieves higher return levels and a more stable optimization trajectory throughout training. After becoming positive, its return continues to increase steadily and undergoes a more pronounced growth phase in the middle stage, thereby reaching a high-return plateau earlier. Although certain fluctuations are observed in the later stage, the overall return remains within a high range without evident degradation, demonstrating superior convergence stability and a higher performance upper bound. These results suggest that the difference-driven adaptive action adoption mechanism can effectively suppress severe oscillations during early exploration and reduce the risk of negative transfer induced by external prior mismatch through a return-consistent screening of guidance actions, leading to a more consistent and controllable policy update process. By contrast, MADDPG shows an overall increasing trend but exhibits more pronounced fluctuations in the mid-to-late stages, characterized by alternating phases of improvement and rollback, which indicates insufficient convergence stability under complex interactions. MADDPGApprox improves rapidly in the middle stage and reaches a moderate plateau; however, its subsequent growth is limited, and its long-term performance remains inferior to that of the proposed method. DDPG improves the slowest and displays larger variance, with a notably lower final plateau, reflecting its disadvantages in sample efficiency and stability for multi-agent cooperative tasks.

In this experiment, the target vessel exhibits a dynamic evasive behavior with policy uncertainty and makes autonomous decisions based on deep reinforcement learning, featuring both high maneuverability and non-deterministic actions. This setting substantially increases the requirements of the encirclement task on cooperative decision-making, trajectory planning, and policy robustness for the multi-UAV system. Under the above simulation environment and parameter settings, four algorithms are employed to train the encirclement maneuver-decision model. Figure 5 reports the team cumulative return curves over 1500 training episodes under a unified training configuration (episode on the horizontal axis and the total return of the three UAVs on the vertical axis), where the blue, red, green, and orange curves correspond to MADDPG, DDPG, MADDPGApprox, and the proposed MAADPG, respectively. The overall trends indicate that all methods start with relatively low returns and then increase steadily as learning progresses, suggesting that cooperative encirclement strategies gradually emerge. More importantly, MAADPG exhibits a faster return ascent and smaller oscillation amplitude in the early training stage. Benefiting from the difference-driven adaptive action adoption mechanism, the algorithm performs a gated screening of whether the guidance action yields an immediate return improvement during early exploration. As a result, it effectively stabilizes early exploration and reduces return fluctuations induced by ineffective trial-and-error, without relying on forcibly injecting priors. Meanwhile, it avoids persistently imprinting biased priors into policy updates, thereby substantially mitigating the risk of negative transfer caused by external priors. As training proceeds, MAADPG enters a relatively stable plateau after approximately 750 episodes and consistently outperforms the three baselines in the later stage, reflecting a more consistent and controllable policy update process. By contrast, DDPG and MADDPGApprox converge more slowly and exhibit larger fluctuations, whereas MADDPG achieves intermediate performance but still shows certain instability in the later stage.

During training, a random adoption mechanism is adopted, such that all four methods interact and learn under the same environment generation rules and parameter ranges (with obstacle layouts varying randomly across episodes), thereby enhancing policy adaptability to environmental variations. Except for the algorithmic core, all other training hyperparameters (e.g., network architectures, optimizers, and experience replay settings) are kept identical, and model checkpoints are saved at fixed intervals. During evaluation, offline validation is conducted on a set of fixed and consistent test environments, ensuring that different methods are compared in terms of returns and encirclement performance under the same scenarios. The overall results indicate that MAADPG achieves superior sample efficiency and faster convergence with smoother training curves, and it maintains effective cooperative encirclement performance even when the target employs a highly uncertain escape policy.

In the same simulation settings, further offline evaluations were performed in a complex environment with three static obstacles, using the capture success rate (CSR) as the primary performance metric. The evaluation setup mirrored the training conditions in terms of map structure, obstacle configuration, and initialization, and was repeated in multiple randomized trials. A trial was deemed successful if the target was captured within the number of steps allowed without collisions or boundary violations. As shown in Figure 6, MAADPG achieves the highest capture success rate among all compared methods, demonstrating stronger policy robustness under high uncertainty conditions and further validating the effectiveness and adaptability of the proposed approach.

To clarify the computational overhead introduced by the Peek mechanism during training, we conduct a stage-wise profiling of the action adoption module. Without Peek, the policy network outputs actions via a single forward inference, yielding an average decision time of approximately 0.282 ms per step. With Peek enabled, the method performs a local comparison among candidate actions at the current decision step: candidate generation and evaluation (four candidates) take about 0.676 ms, while gated screening and final action selection take about 0.947 ms, resulting in a total Peek procedure time of approximately 1.623 ms (including baseline inference). These results indicate that the additional computation is mainly concentrated in the gating stage, consistent with the structural overhead representation “Ng + gate” in Table 4. Here, N denotes the number of candidate actions, g represents the per-candidate evaluation cost (policy inference and Critic/Q assessment), and gate corresponds to the adoption decision based on local return differences. Notably, Peek serves as an auxiliary exploration and stabilization mechanism during training: it compares candidate actions locally and adopts the guided action only when the improvement is significantly positive, thereby suppressing ineffective exploration and improving convergence stability. During evaluation/testing, the trained policy is executed through standard forward inference to ensure a lightweight and consistent online execution pipeline; therefore, the extra candidate evaluation and gating steps required by Peek are not performed.

Under this setting, Table 4 compares the end-to-end computational overhead of different methods in terms of per-step wall-clock latency and runtime throughput. The results show that MAADPG remains in the same runtime order as multi-agent baselines: its wall-clock step time is 0.387 ms/step, which is about 4.4% and 8.1% lower than MADDPG (0.405 ms/step) and MADDPG-approx (0.421 ms/step), respectively. Meanwhile, MAADPG achieves a throughput of 2582.3 steps/s, outperforming MADDPG (2471.1 steps/s) and MADDPG-approx (2377.2 steps/s) by approximately 4.5% and 8.6%. Although single-agent DDPG exhibits a smaller per-step runtime due to its simplified computation pipeline, it lacks cooperative decision-making capability and is therefore not suitable for the multi-agent pursuit task considered in this work. Overall, MAADPG trades a controllable amount of additional computation during training for higher-quality action adoption and a more stable learning process, while maintaining comparable evaluation-time efficiency to existing multi-agent methods, demonstrating favorable practical deployability.

In the three-obstacle setting (Figure 7a–d), the behavior of the policy can be explained by several training signaling mechanisms. Initial stage. The algorithm provides sustained positive feedback to reduce the effective path distance to the target and reduce the relative heading error; consequently, the chasing parties quickly approach the random initial poses and advance with a low curvature along the central or lateral corridors (see Figure 7a). At the same time, minimum safety margins around obstacles/boundaries and among teammates are explicitly measured and converted into positive/negative signals. Maintaining sufficient clearance is rewarded, while nearing thresholds is penalized, thus inducing buffer keeping, avoiding unnecessary sharp turns, and mitigating mutual interference when traversing the obstacle belt (Figure 7b). Middle stage. The geometric progression of pressure and envelope is captured by formation structure metrics, e.g., angular spread, monotonic shrinkage of the enclosure radius, and the centrality of the target. Improvements in these metrics yield phase rewards, enabling the three pursuers to form a stable semi-arc in the near field and progressively seal escape channels (Figure 7c). To prevent the formation from stretching or crowding, penalties are assigned when inter-agent spacing departs from the desired range: excessively large gaps weaken lateral/rear pressure, whereas overly small gaps induce interference.

As a result, the three trajectories maintain a balanced division of roles: approach, encirclement, and capture, even while threading through obstacles. Final stage. For near-target contact, a soft buffer is set outside the capture radius: entering this buffer without meeting the capture condition causes small deductions, whereas actual collisions or boundary violations trigger substantial penalties. This explains why the pursuers complete closure through minor speed reconfiguration and curvature adjustments rather than direct ramming (Figure 7d). When the legal capture condition is satisfied, there is closed enclosure with overlapping capture disks and no boundary violations or collisions, and the algorithm grants a terminal bonus, encouraging the policy to maintain stable geometry and adequate safety margins until the event is triggered. In general, this signal design produces a coherent approach–encirclement–capture evolution: the early rapid approach is driven by path-progress signals; robust mid-phase envelope arises from geometric progression and coordination constraints; and collision-free completion is induced by buffer/collision deterrence and terminal bonuses, yielding the trajectories and stable enclosure observed in Figure 7.

To evaluate the proposed multi-USV pursuit strategy under complex conditions, the MAADPG policy is implemented in four independent test environments (see Figure 8), where Figure 8a–d correspond to obstacle counts of 0, 1, 2, and 3, respectively. Each subfigure presents the complete approach–encirclement–capture process for its setting. In Figure 8a (no obstacles), the three hunter USVs advance along the shortest feasible corridors to rapidly approach the evader, establish complementary bearings, and smoothly shrink the radius of the enclosure, culminating in capture with low path curvature. In Figure 8b (one obstacle), the hunter closest to the obstacle adopts a ’detour–rejoin’ maneuver, gradually widening the lateral spacing to preserve safety margins, while the other two maintain frontal/rear pressure to preserve the encirclement geometry; near-field closure then yields capture.

In Figure 8c (two obstacles), two bottleneck corridors induce a brief role allocation: one hunter preemptively blocks the likely escape channel, and the other two progress with distinct curvature radii, driving a steadily shrinking enclosure and a stable semi-ring that enables capture. In Figure 8d (three obstacles), feasible corridors narrow considerably, requiring multiple path replans and speed readjustments; the encirclement phase lengthens and trajectories become more tortuous, yet coordinated multi-angle tightening still achieves effective capture while maintaining safety. Side-by-side comparison indicates that, as the number of obstacles increases, both the time to form a stable encirclement and the path curvature/length grow; nevertheless, across all scenarios the policy consistently maintains timely responses to the changes of the evader’s direction, enforces safety margins, and preserves the geometry of the enclosure, thus reliably achieving the capture workflow approach-encirclement.

To further elucidate the underlying kinematics, Figure 9, Figure 10, Figure 11 and Figure 12 report the velocity profiles for the 0, 1, 2, and 3 obstacle settings (each figure comprises three panels: speed magnitude, x-axis velocity, and y-axis velocity). A consistent pattern emerges: the evader accelerates rapidly in the early stage and operates near its speed limit for an extended period; the three hunters also increase to near their maximum speeds during approach and then form complementary roles in the x/y components during encirclement, using lateral and longitudinal coordination to compress the enclosure. Immediately before capture, hunters exhibit synchronized deceleration accompanied by brief component direction switches to complete closure. As the obstacle count rises from 0 to 3, fluctuations and adjustment frequency in the x/y components increase, small-scale speed reconfigurations occur more often in the mid-to-late phase, and the encirclement interval lengthens; even so, all four environments show a terminal joint drop in speed magnitude and convergence of velocity components among the hunters, indicating successful completion of the capture maneuver.

5. Conclusions and Future Work

This paper focuses on the cooperative encirclement task of USVs in confined waters and addresses key challenges encountered by multi-agent deep reinforcement learning in complex geometric environments, including unstable exploration, policy oscillations, and difficulties in maintaining encirclement formations. To this end, we propose an adaptive difference-based multi-agent policy gradient method (MAADPG). Under the centralized training and decentralized execution (CTDE) paradigm, MAADPG introduces a difference-driven adaptive action adoption mechanism at the behavior-policy level, which performs a prospective one-step evaluation of the policy action and the guidance action, and then adaptively determines whether to accept the guidance action according to the one-step return difference. In this way, MAADPG stabilizes early-stage exploration and mitigates the risk of negative transfer caused by prior mismatch, while preserving the original training pipeline, optimization objective, and network architecture. Combined with the stage-aware evaluation procedure for the “tracking–encircling–capturing” process, simulation results under dynamic targets and multi-obstacle scenarios demonstrate the effectiveness of the proposed method.

Despite the favorable performance of MAADPG in cooperative multi-USV encirclement tasks, several limitations remain. On the one hand, the proposed approach is primarily validated in idealized simulation environments, and the impacts of practical factors such as communication delays and perception errors on the difference-driven mechanism have not yet been systematically examined. On the other hand, although geometric guidance is introduced to alleviate blind exploration, the learning process during the initial training stage still inevitably relies on a certain degree of random exploration to establish effective policy updates, which may lead to reduced training efficiency in highly complex environments.

Future research will be conducted in direct response to the identified limitations. Specifically, communication delays and perception uncertainties will be explicitly incorporated into the MAADPG framework to systematically analyze the stability and robustness of the difference-driven action adoption mechanism under information-constrained conditions. In addition, further improvements to the adaptivity of the execution-level guidance mechanism will be explored to reduce reliance on random exploration during the early training stage, thereby enhancing training efficiency in complex encirclement environments. Finally, a systematic sensitivity analysis is conducted on key hyperparameters such as

λ

and

β

, and a value-adaptive adjustment strategy based on training stages and performance feedback is further investigated, aiming to achieve a better trade-off between stability and guidance-intervention strength, thereby improving the generalization capability and transferability of the method in complex environments.

Author Contributions

Conceptualization, Z.D. and W.W.; methodology, Z.D.; software, Z.D.; validation, Z.D.; formal analysis, Z.D.; investigation, Z.D.; resources, S.Y. and W.W.; data curation, Z.D.; writing—original draft preparation, Z.D.; writing—review and editing, Z.D. and W.W.; visualization, Z.D.; supervision, S.Y. and W.W.; project administration, S.Y. and W.W.; funding acquisition, S.Y. and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52371369), the Natural Science Foundation of Fujian Province (Grant No. 2024J01702), and the Natural Science Foundation of Xiamen (Grant No. 502Z202373038).

Data Availability Statement

Data are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

USV	Unmanned Surface Vehicle
DRL	Deep Reinforcement Learning
MARL	Multi-Agent Reinforcement Learning
CTDE	Centralized Training and Decentralized Execution
MADDPG	Multi-Agent Deep Deterministic Policy Gradient
DDPG	Deep Deterministic Policy Gradient
MAADPG	Multi-Agent Guided Adaptive Difference Policy Gradient
CEP	Constrained Encirclement Problem
CSR	Capture Success Rate

References

Liu, Y.; Liu, C.; Meng, Y.; Ren, X.; Wang, X. Velocity Domain-Based Distributed Pursuit-Encirclement Control for Multi-USVs with Incomplete Information. IEEE Trans. Intell. Veh. 2023, 9, 3246–3257. [Google Scholar] [CrossRef]
Qu, X.; Li, C.; Jiang, Y.; Long, F.; Zhang, R. Cooperative Pursuit of Unmanned Surface Vehicles Using Multi-Agent Reinforcement Learning. J. Shanghai Jiaotong Univ. Sci. 2025, 1–8. [Google Scholar] [CrossRef]
Sun, Z.; Sun, H.; Li, P.; Zou, J. Self-Organizing Cooperative Pursuit Strategy for Multi-USV with Dynamic Obstacle Ships. J. Mar. Sci. Eng. 2022, 10, 562. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, Z.; Xu, J.; Wang, X.; Lu, Y.; Yu, J. A Cooperative Hunting Method for Multi-USV Based on the A* Algorithm in an Environment with Obstacles. Sensors 2023, 23, 7058. [Google Scholar] [CrossRef]
Yang, X.; Wang, X.; Ye, H.; Xiang, Z.; Zhang, B. Design and Verification of IHA-MATD3: A Novel Multi-USV Cooperative Pursuit-Evasion Scheme. Ocean Eng. 2025, 340, 122446. [Google Scholar] [CrossRef]
Pan, C.; Wang, A.; Peng, Z.; Han, B.; Lyu, G.; Zhang, W. Pursuit-Evasion Game of Under-Actuated ASVs Based on Deep Reinforcement Learning and Model Predictive Path Integral Control. Neurocomputing 2025, 638, 130045. [Google Scholar] [CrossRef]
Jiang, Y.; Peng, Z.; Wang, J. Constrained Control of Autonomous Surface Vehicles for Multitarget Encirclement via Fuzzy Modeling and Neurodynamic Optimization. IEEE Trans. Fuzzy Syst. 2022, 31, 875–889. [Google Scholar] [CrossRef]
Panda, J.P. Machine Learning for Naval Architecture, Ocean and Marine Engineering. J. Mar. Sci. Technol. 2023, 28, 1–26. [Google Scholar] [CrossRef]
Qu, X.; Gan, W.; Song, D.; Zhou, L. Pursuit-Evasion Game Strategy of USV Based on Deep Reinforcement Learning in Complex Multi-Obstacle Environment. Ocean. Eng. 2023, 273, 114016. [Google Scholar] [CrossRef]
Wang, Y.; Wang, X.; Zhou, W.; Yan, H.; Xie, S. Threat Potential Field Based Pursuit–Evasion Games for Underactuated Unmanned Surface Vehicles. Ocean Eng. 2023, 285, 115381. [Google Scholar] [CrossRef]
Sun, Z.; Sun, H.; Li, P.; Li, X.; Du, L. Pursuit–Evasion Problem of Unmanned Surface Vehicles in a Complex Marine Environment. Appl. Sci. 2022, 12, 9120. [Google Scholar] [CrossRef]
Luo, Q.; Wang, H.; Li, N.; Su, B.; Zheng, W. Model-Free Predictive Trajectory Tracking Control and Obstacle Avoidance for Unmanned Surface Vehicle with Uncertainty and Unknown Disturbances via Model-Free Extended State Observer. Int. J. Control Autom. Syst. 2024, 22, 1985–1997. [Google Scholar] [CrossRef]
Luo, Q.; Wang, H.; Li, N.; Zheng, W. Multi-Unmanned Surface Vehicle Model-Free Sliding Mode Predictive Adaptive Formation Control and Obstacle Avoidance in Complex Marine Environment via Model-Free Extended State Observer. Ocean. Eng. 2024, 293, 116773. [Google Scholar] [CrossRef]
Yang, S.; Wang, K.; Wang, W.; Wu, H.; Suo, Y.; Chen, G.; Xian, J. Dual-Attention Proximal Policy Optimization for Efficient Autonomous Navigation in Narrow Channels Using Deep Reinforcement Learning. Ocean. Eng. 2025, 326, 120707. [Google Scholar] [CrossRef]
Wang, W.; Wu, H.; Yang, S.; Li-Ching, K. LNPP: Logical Neural Path Planning of Mobile Beacon for Ocean Sensor Networks in Uncertain Environments Using Hierarchical Reinforcement Learning. Ocean. Eng. 2025, 12, 2606–2621. [Google Scholar] [CrossRef]
Chaysri, P.; Spatharis, C.; Blekas, K.; Vlachos, K. Unmanned Surface Vehicle Navigation through Generative Adversarial Imitation Learning. Ocean. Eng. 2023, 282, 114989. [Google Scholar] [CrossRef]
Sinha, A.; Cao, Y. Three-Dimensional Guidance Law for Target Enclosing within Arbitrary Smooth Shapes. J. Guid. Control Dyn. 2023, 46, 2224–2234. [Google Scholar] [CrossRef]
Qu, X.; Li, C.; Jiang, S.; Liu, G.; Zhang, R. Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets. J. Mar. Sci. Eng. 2025, 13, 1558. [Google Scholar] [CrossRef]
Zhang, J.; Yang, Y.; Liu, K.; Li, T. Solving Dynamic Encirclement for Multi-ASV Systems Subjected to Input Saturation via Time-Varying Formation Control. Ocean. Eng. 2024, 310, 118707. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, Y. Multiple Ships Cooperative Navigation and Collision Avoidance Using Multi-Agent Reinforcement Learning with Communication. arXiv 2024, arXiv:2410.21290. [Google Scholar] [CrossRef]
Gan, W.; Qu, X.; Song, D.; Yao, P. Multi-USV Cooperative Chasing Strategy Based on Obstacles Assistance and Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2023, 21, 5895–5910. [Google Scholar] [CrossRef]
Zhang, C.; Zeng, R.; Lin, B.; Zhang, Y.; Xie, W.; Zhang, W. Multi-USV Cooperative Target Encirclement through Learning-Based Distributed Transferable Policy and Experimental Validation. Ocean. Eng. 2025, 318, 120124. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Wei, X.; Wang, H. Sim-Real Joint Experimental Verification for an Unmanned Surface Vehicle Formation Strategy Based on Multi-Agent Deterministic Policy Gradient and Line of Sight Guidance. Ocean. Eng. 2023, 270, 113661. [Google Scholar] [CrossRef]
Peng, Z.; Wu, G.; Luo, B.; Wang, L. Multi-UAV Cooperative Pursuit Strategy with Limited Visual Field in Urban Airspace: A Multi-Agent Reinforcement Learning Approach. IEEE/CAA J. Autom. Sin. 2025, 12, 1350–1367. [Google Scholar] [CrossRef]
Qu, X.; Zeng, L.; Qu, S.; Long, F.; Zhang, R. An Overview of Recent Advances in Pursuit–Evasion Games with Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2025, 13, 458. [Google Scholar] [CrossRef]
Li, F.; Yin, M.; Wang, T.; Huang, T.; Yang, C.; Gui, W. Distributed Pursuit–Evasion Game of Limited Perception USV Swarm Based on Multiagent Proximal Policy Optimization. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 6435–6446. [Google Scholar] [CrossRef]
Wang, C.; Wang, Y.; Shi, P.; Wang, F. Scalable-MADDPG-Based Cooperative Target Invasion for a Multi-USV System. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 17867–17877. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Dong, R.; Du, J.; Liu, Y.; Heidari, A.A.; Chen, H. An Enhanced Deep Deterministic Policy Gradient Algorithm for Intelligent Control of Robotic Arms. Front. Neuroinform. 2023, 17, 1096053. [Google Scholar] [CrossRef]
Oroojlooy, A.; Hajinezhad, D. A Review of Cooperative Multi-Agent Deep Reinforcement Learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar] [CrossRef]
He, Z.; Chu, X.; Liu, C.; Wu, W. A Novel Model Predictive Artificial Potential Field Based Ship Motion Planning Method Considering COLREGs for Complex Encounter Scenarios. ISA Trans. 2023, 134, 58–73. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Tassa, Y.; Doron, Y.; Muldal, A.; Erez, T.; Li, Y.; de Las Casas, D.; Budden, D.; Abdolmaleki, A.; Merel, J.; Lefrancq, A.; et al. DeepMind Control Suite. arXiv 2018, arXiv:1801.00690. [Google Scholar] [CrossRef]
Xu, S.; Dang, Z. Emergent behaviors in multiagent pursuit evasion games within a bounded 2D grid world. Sci. Rep. 2025, 15, 29376. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Two-dimensional kinematic model of a USV, where the blue circle denotes the capture range

d_{capture}

.

Figure 1. Two-dimensional kinematic model of a USV, where the blue circle denotes the capture range

d_{capture}

.

Figure 2. USV sensing and detection model, where red dashed rays indicate detected obstacles and black rays indicate no detection.

Figure 3. AADPG flowchart. The arrows indicate the data flow and update directions among the environment, replay buffer, and actor–critic networks. The orange and green blocks denote the actor and critic networks, respectively, while the purple block represents the GAD module (Peek Step and

β

-Gate) and the gray block denotes the PFM-based prior guidance. The symbol “*” denotes the finally adopted action

a_{i}^{*}

selected between the policy action

a_{i}^{r}

and the prior-guided action

a_{i}^{p}

.

Figure 3. AADPG flowchart. The arrows indicate the data flow and update directions among the environment, replay buffer, and actor–critic networks. The orange and green blocks denote the actor and critic networks, respectively, while the purple block represents the GAD module (Peek Step and

β

-Gate) and the gray block denotes the PFM-based prior guidance. The symbol “*” denotes the finally adopted action

a_{i}^{*}

selected between the policy action

a_{i}^{r}

and the prior-guided action

a_{i}^{p}

.

Figure 4. Simulation environment.

Figure 5. Reward curve comparison.

Figure 6. Capture success rate comparison.

Figure 7. Visualization of the pursuit process based on the MAADPG algorithm. The colored solid curves denote the trajectories of the pursuer USVs, the blue solid curve represents the trajectory of the evading target, gray circles indicate static obstacles, and the outer dashed circles illustrate the capture/safety regions during the encirclement process.

Figure 8. Pursuit snapshots using the MAADPG algorithm at different simulation steps: (a) step 22, (b) step 107, (c) step 191, and (d) step 210. The colored solid curves denote the trajectories of the pursuer USVs, the blue solid curve represents the trajectory of the evading target, gray circles indicate static obstacles, and the outer dashed circles illustrate the capture/safety regions around each USV.

Figure 9. Speed profiles of USVs in the capture task with 0 obstacles.

Figure 10. Speed profiles of USVs in the capture task with 1 obstacle.

Figure 11. Speed profiles of USVs in the capture task with 2 obstacles.

Figure 12. Speed profiles of USVs in the capture task with 3 obstacles.

Table 1. Neural network hyperparameter settings.

Parameter	Value
Discount factor	0.95
Learning rate of actor network	0.0005
Learning rate of critic network	0.001
Soft-update rate $τ$	0.01
Replay buffer size	1,000,000
Batch size	1024
Hidden layers	4
Hidden units per layer	128
Number of neurons	512

Table 2. Obstacle settings.

Parameter	Value
Number of obstacles (per episode)	0–3
Obstacle radius	0.10–0.15 km

Table 3. USV parameter settings.

Parameter	Value
Initial velocity of hunters	0 km/s
Maximum speed of hunters	0.010 km/s
Maximum acceleration of hunters	0.004 km/s²
Detection range of sensor	0.2 km
Round-up (capture) distance of hunters	0.15 km
Initial velocity of target	0 km/s
Maximum speed of target	0.011 km/s
Maximum acceleration of target	0.005 km/s²

Table 4. Computational overhead comparison (per decision step/per training update).

Method	Env	Policy	Critic/Q	State	Wall-Clock	Training
Method	Step(s)/Decision	Forward(s)	Eval(s)	Copy/Rollback	Step Time (ms)	Throughput (Steps/s)
MAADPG	+1 (peek)	+Ng + gate	+(1–2)·Ng	+1	0.387	2582.3
MADDPG	0	+Ng	+Ng	0	0.405	2471.1
MADDPG-approx	0	+Ng	+Ng	0	0.421	2377.2
DDPG	0	baseline	baseline	0	0.102	9837.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, Z.; Yang, S.; Wang, W. An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning. J. Mar. Sci. Eng. 2026, 14, 252. https://doi.org/10.3390/jmse14030252

AMA Style

Du Z, Yang S, Wang W. An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning. Journal of Marine Science and Engineering. 2026; 14(3):252. https://doi.org/10.3390/jmse14030252

Chicago/Turabian Style

Du, Zhen, Shenhua Yang, and Weijun Wang. 2026. "An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning" Journal of Marine Science and Engineering 14, no. 3: 252. https://doi.org/10.3390/jmse14030252

APA Style

Du, Z., Yang, S., & Wang, W. (2026). An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning. Journal of Marine Science and Engineering, 14(3), 252. https://doi.org/10.3390/jmse14030252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Problem Formulation and Preliminaries

2.1. Multi-Agent Reinforcement Learning

2.2. Ship Motion Model

2.3. Definition and Statement of the Multi-Agent Capture Problem

3. Multi-Agent Guided Adaptive Difference Policy Gradient

3.1. MAADPG

3.2. State Space

3.3. Action Space

3.4. Reward Function

4. Simulation Experiments and Results Analysis

4.1. Parameter Settings

4.2. Simulation Scenario and Settings

4.3. Comparative Analysis of Algorithms

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI