An Improved MAPPO for Multi-Surface Vessel Collaboration

Wang, Guangyu; Tian, Feng; Ren, Chengcheng

doi:10.3390/act15020121

Open AccessArticle

An Improved MAPPO for Multi-Surface Vessel Collaboration

by

Guangyu Wang

¹,

Feng Tian

² and

Chengcheng Ren

^2,*

¹

The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Computer Science and Technology, Anhui University, Hefei 230601, China

²

The Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Electrical Engineering and Automation, Anhui University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Actuators 2026, 15(2), 121; https://doi.org/10.3390/act15020121

Submission received: 31 December 2025 / Revised: 11 February 2026 / Accepted: 11 February 2026 / Published: 14 February 2026

(This article belongs to the Special Issue Feature Papers in Actuators for Surface Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Collaborative control of multiple surface vessels remains a significant challenge in autonomous maritime operations, particularly within environments characterized by sparse rewards. Conventional Multi-Agent Proximal Policy Optimization (MAPPO) often suffers from inefficient credit assignment and slow convergence in such scenarios. To address these limitations, this paper proposes an enhanced MAPPO framework that integrates a counterfactual baseline—derived from Counterfactual Multi-Agent Policy Gradients (CMAPG)—into the Generalized Advantage Estimation (GAE) formulation. Furthermore, a Prioritized Experience Replay (PER) mechanism with importance sampling is incorporated to improve sample efficiency. The counterfactual baseline is necessary to provide precise, agent-specific learning signals within the on-policy paradigm, directly tackling the credit assignment problem. The PER mechanism, carefully adapted with importance sampling, is essential to break the sample-inefficiency barrier by strategically reusing valuable past experiences without compromising stability. This synergistic approach refines credit assignment by isolating individual contributions and maximizes the utility of valuable historical experiences. Simulation results and comparisons validate the enhanced control performance of the proposed controller.

Keywords:

multi-agent reinforcement learning; MAPPO; surface vessel collaboration; sparse rewards; credit assignment; experience replay

1. Introduction

The coordination and control of multiple Autonomous Surface Vessels (ASVs) pose substantial challenges in maritime applications such as search and rescue, environmental monitoring, and port logistics. These tasks demand efficient cooperation among vessels operating in dynamic and uncertain environments, where rewards associated with successful coordination are often sparse and significantly delayed. Multi-Agent Reinforcement Learning (MARL) [1] has emerged as a powerful framework for addressing such challenges by enabling agents to learn cooperative strategies under the paradigm of centralized training with decentralized execution (CTDE) [2]. Among existing MARL algorithms, MAPPO has gained considerable attention due to its favorable stability and scalability in cooperative multi-agent settings [3]. Nevertheless, standard MAPPO exhibits notable performance degradation in sparse-reward scenarios, primarily due to inefficient credit assignment and low sample efficiency [4].

Credit assignment concerns the challenge of accurately quantifying each agent’s contribution to the collective outcome. In collaborative ASV missions, global rewards are typically delayed, making it difficult for individual vessels to infer the effectiveness of their local actions. Although approaches such as counterfactual baselines [5] and attention-based mechanisms [6] have been proposed to alleviate this issue, their incorporation into on-policy algorithms like MAPPO remains limited. Furthermore, MAPPO’s on-policy learning paradigm requires discarding collected trajectories after each policy update, resulting in poor sample efficiency. While techniques such as PER [7] have demonstrated significant benefits in off-policy reinforcement learning by reusing informative experiences, their direct application to on-policy methods is nontrivial due to policy distribution mismatch.

To address these limitations, this paper proposes an improved MAPPO variant, termed Multi-Agent Proximal Policy Optimization with Counterfactual Baseline and Prioritized Experience Replay (MAPPO-CF-PER), which integrates a counterfactual baseline into the GAE framework and introduces a PER mechanism augmented with importance sampling. Specifically, the counterfactual baseline, inspired by Counterfactual Multi-Agent Policy Gradients [5], improves credit assignment by contrasting actual returns with counterfactual scenarios in which agents execute default actions. In parallel, the proposed PER mechanism prioritizes experience samples based on a combination of temporal-difference errors and counterfactual effectiveness, enabling more efficient reuse of informative trajectories. Importance sampling weights are employed to correct for policy divergence, thereby preserving the stability guarantees of on-policy optimization.

The main contributions of this work can be summarized as follows:

(1): To address the challenge of credit assignment under sparse rewards, we introduce a novel counterfactual baseline directly embedded within the GAE framework. Unlike prior works that primarily applied counterfactual reasoning in off-policy settings, this integration within the on-policy MAPPO structure provides a more stable and efficient mechanism to quantify individual agent contributions, leading to sharper policy gradients in collaborative ASV tasks.
(2): To overcome the inherent sample inefficiency of on-policy learning, we propose a tailored PER mechanism. A key innovation here is the use of a composite priority metric that combines both temporal-difference error and counterfactual effectiveness. Furthermore, to safely reuse off-policy data without destabilizing training, we incorporate importance sampling weights within the PPO clipping objective. This approach is a non-trivial extension of PER to the multi-agent on-policy domain, effectively breaking the data discard-after-one-use limitation of standard MAPPO.

2. Related Work

2.1. Multi-Agent Reinforcement Learning

Multi-Agent Reinforcement Learning (MARL) generalizes single-agent reinforcement learning to environments in which multiple agents interact and learn concurrently. A dominant paradigm in modern MARL is CTDE, which allows agents to exploit global state information during training while relying solely on local observations during execution [2]. Representative CTDE-based algorithms include Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [8] and QMIX [9], both of which are off-policy methods. Although effective in certain settings, off-policy MARL algorithms often suffer from instability caused by the non-stationarity induced by concurrently learning agents.

In contrast, on-policy approaches, most notably MAPPO [3], mitigate non-stationarity by updating policies exclusively with data generated by the current policy. This design choice significantly enhances training stability and has led to strong empirical performance in cooperative tasks and benchmark multi-agent environments. However, the reliance on fresh data substantially reduces sample efficiency, and MAPPO’s performance deteriorates in sparse-reward scenarios such as cooperative ASV coordination [10].

Recent research has focused on scaling MARL to large-scale multi-agent systems. Hierarchical learning architectures have been proposed to decompose complex coordination tasks [11], while graph neural networks have been leveraged to explicitly model inter-agent interactions and relational structures. Despite their effectiveness, these methods often require predefined communication graphs or domain-specific assumptions, limiting their generality. In maritime domains, MARL has been applied to problems such as target tracking and formation control, yet existing approaches typically lack principled mechanisms for adaptive credit assignment, which is critical for cooperative performance under delayed global rewards.

2.2. Credit Assignment in Multi-Agent Systems

Effective credit assignment constitutes a fundamental challenge in MARL, particularly within cooperative settings where agents receive shared global rewards. Counterfactual methods represent a prominent research direction in this domain. The Counterfactual Multi-Agent (COMA) algorithm [5] introduces a counterfactual baseline that marginalizes the action of an individual agent to estimate the contribution of that agent to the joint outcome. Alternative strategies include difference rewards [12] and attention-based critics [6], which aim to highlight influential agents or actions.

Beyond the shaping of extrinsic rewards, researchers have explored intrinsic motivation mechanisms to address the issue of sparse rewards. For instance, social influence rewards encourage agents to affect the behavior of other agents positively [13], whereas curiosity-driven and prediction-based methods promote exploration in under-specified environments [14]. Furthermore, variational inference techniques have been employed to model the behaviors of teammates or opponents [15]. Despite the effectiveness of these approaches, they typically require auxiliary models or additional optimization objectives, which increases computational complexity.

Crucially, the majority of counterfactual and intrinsic reward methods have been developed within off-policy learning frameworks. Although recent efforts have attempted to incorporate causal or counterfactual reasoning into on-policy learning, the integration of these concepts with MAPPO, specifically within the advantage estimation mechanism, remains largely unexplored [16,17,18]. In contrast, this work embeds counterfactual reasoning directly into the GAE framework to provide an effective solution for improving credit assignment in on-policy MARL.

2.3. Experience Replay in On-Policy Learning

Experience replay is a widely adopted technique for enhancing sample efficiency in reinforcement learning. PER [7] improves learning efficiency by emphasizing transitions with high temporal-difference (TD) error. However, the direct application of PER to on-policy algorithms is challenging due to policy distribution mismatch, which can introduce biased gradient estimates.

Several studies have sought to address this limitation by incorporating importance sampling to correct for distribution divergence or by restricting replay buffers to recent trajectories to limit policy drift. Hybrid replay strategies and selective episode reuse have also been proposed to balance stability and efficiency. While these methods demonstrate that limited replay can be beneficial for on-policy learning, their extension to multi-agent settings remains underdeveloped.

In MARL, PER has primarily been explored in off-policy contexts with centralized critics or value decomposition methods. Although selective replay of critical experiences has shown promise for accelerating on-policy learning [19], existing approaches do not explicitly account for multi-agent credit assignment. Our proposed PER mechanism leverages counterfactual effectiveness as a prioritization criterion, guaranteeing that replayed experiences are both informative and directly relevant to cooperative credit assignment.

2.4. Multi-Surface Vessel Collaboration

Traditional approaches to ASV coordination rely on rule-based or optimization-driven frameworks, such as auction-based task allocation [20] and contract net protocols [21]. While effective in structured environments, these methods often struggle with scalability, adaptability, and robustness to uncertainty.

More recently, MARL has emerged as a promising alternative for multi-ASV collaboration. Learning-based approaches have been applied to problems including cooperative path planning, formation control, and dynamic task allocation. Despite encouraging results, existing MARL-based methods typically overlook the challenges posed by sparse rewards and delayed feedback in maritime environments. Simulation studies have consistently highlighted the need for adaptive coordination strategies capable of handling uncertainty and long-horizon objectives.

Motivated by these observations, this work advances the state of the art by integrating counterfactual credit assignment and prioritized experience replay into MAPPO, providing a principled and efficient framework for cooperative ASV control under sparse and delayed rewards.

3. Proposed Method

3.1. Background: MAPPO Formulation

We consider a cooperative multi-agent control problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP), which is defined by the tuple (

S

,

A

, P, R,

O

, N,

γ

), where:

$S$ denotes the global state space;
$A = \prod_{i = 1}^{N} A_{i}$ represents the joint action space composed of individual action spaces $A_{i}$ ;
$P (s^{'} ∣ s, a)$ is the state transition probability function;
$R (s, a)$ is the shared reward function;
$O$ denotes the local observation space for each agent;
N is the number of agents;
$γ \in [0, 1)$ is the discount factor.

At each time step t, agent i receives a local observation

o_{i}^{t} \in O

correlated with the underlying global state

s^{t} \in S

, and selects an action

a_{i}^{t} \in A_{i}

according to its stochastic policy

π_{θ_{i}} (a_{i}^{t} ∣ o_{i}^{t})

, where

θ_{i}

denotes the policy parameters of agent i. The joint action

a^{t} = (a_{1}^{t}, \dots, a_{N}^{t})

induces a state transition

s^{t + 1} \sim P (\cdot ∣ s^{t}, a^{t})

and yields a shared reward

r^{t} = R (s^{t}, a^{t})

.

The collective objective of all agents is to maximize the expected discounted return, as expressed in Equation (1).

\begin{matrix} J (θ) = E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r^{t}], \end{matrix}

(1)

where

θ = {θ_{1}, \dots, θ_{N}}

represents the parameters of all agent policies. The symbol

E

represents the mathematical expectation operator (Expected Value).

MAPPO follows the Centralized Training with Decentralized Execution (CTDE) paradigm. During training, a centralized value function

V_{ϕ} (s^{t})

, parameterized by

ϕ

, is employed to estimate the expected return from the global state

s^{t}

under the current joint policy. This critic has access to global state information (or the joint observations of all agents as a surrogate) during training. During execution, however, each agent independently selects actions based solely on its local observation

o_{i}^{t}

.

Policy optimization in MAPPO is performed by maximizing the clipped surrogate objective from Proximal Policy Optimization (PPO). The objective function, denoted as

L^{CLIP} (θ)

, aims to maximize policy performance while constraining the update size to enhance stability. It is formally defined as Equation (2):

\begin{matrix} L^{CLIP} (θ) = E_{t} [min (ρ_{t} (θ) A^{t}, clip (ρ_{t} (θ), 1 - ϵ, 1 + ϵ) A^{t})], \end{matrix}

(2)

where

ρ_{t} (θ) = \frac{π_{θ} (a^{t} ∣ o^{t})}{π_{θ_{old}} (a^{t} ∣ o^{t})}

is the probability ratio between the current policy and the behavior policy used to collect the data,

A^{t}

is an estimate of the advantage function at time step t, and

ϵ

is a clipping parameter, typically set within the range (0.1, 0.3), which limits the magnitude of policy updates by constraining the probability ratio to the interval

[1 - ϵ, 1 + ϵ]

.

The centralized value function is trained by minimizing the mean squared error between predicted state values and empirical return estimates:

\begin{matrix} L^{VF} (ϕ) = E_{t} [{(V_{ϕ} (s^{t}) - {\hat{R}}^{t})}^{2}], \end{matrix}

(3)

where

{\hat{R}}^{t}

denotes the estimated return, typically computed using discounted rewards or GAE.

To encourage sufficient exploration, an entropy regularization term is incorporated into the objective. The overall training loss is given by

\begin{matrix} L^{TOTAL} = L^{CLIP} (θ) + c_{1} L^{VF} (ϕ) - c_{2} H (π), \end{matrix}

(4)

where

H (π)

denotes the policy entropy, and

c_{1}

and

c_{2}

are weighting coefficients that balance value fitting and exploration, respectively. The policy entropy

H (π)

is a crucial concept borrowed from information theory, serving as a measure of the randomness or uncertainty of an agent’s policy

π (a | o)

. Its source is the information-theoretic definition of entropy, as shown in Equation (5), calculated as the expected negative log-probability of the actions.

\begin{matrix} H (π) = - E_{a \sim π (\cdot | o)} [log π (a | o)] \end{matrix}

(5)

In reinforcement learning, maximizing policy entropy acts as an intrinsic exploration driver. Its primary purpose is to encourage the agent to behave more stochastically, thereby preventing premature convergence to sub-optimal deterministic policies. In our framework, the coefficient

c_{2}

modulates the strength of this entropy regularization, aiming to maintain a critical balance between exploration (trying new actions) and exploitation (using known good actions). This balance is particularly vital for learning effective cooperative strategies in environments with sparse rewards.

3.2. Counterfactual Baseline for GAE

A fundamental challenge in cooperative multi-agent reinforcement learning is credit assignment, namely, identifying each agent’s individual contribution to the collective outcome when only sparse and shared rewards are available. In MAPPO, the GAE mechanism is employed to reduce variance in policy gradient updates by aggregating TD residuals over time. Specifically, the standard GAE is defined as

\begin{matrix} A_{t}^{GAE (γ, λ)} = \sum_{l = 0}^{\infty} {(γ λ)}^{l} δ_{t + l}, \end{matrix}

(6)

where the TD error is given by

\begin{matrix} δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t}) . \end{matrix}

While

A_{t}^{GAE}

provides an effective estimate of the advantage of the joint action

a_{t}

, it does not explicitly distinguish the individual contributions of agents’ actions

a_{i}^{t}

. Consequently, in cooperative tasks with delayed and sparse rewards, individual agents may struggle to infer how their local decisions influence the global return.

To address this limitation, we draw inspiration from counterfactual reasoning in CMAPG and incorporate a counterfactual baseline directly into the advantage estimation process. The counterfactual advantage is defined using a deterministic baseline action (the most probable action under the policy) rather than full action marginalization, primarily to balance computational efficiency and estimation accuracy in continuous action spaces. While this approximates a difference reward, it provides a practical trade-off for real-time MARL applications. For each agent i, a counterfactual advantage is defined as

\begin{matrix} A_{i}^{CF} (s, a) = Q (s, a) - Q (s, (a_{i}, c_{i})), \end{matrix}

(7)

where

Q (s, a)

denotes the centralized state–action value function, which may be approximated by a learned critic or derived from the centralized value function

V_{ϕ} (s)

. The term

(a_{i}, c_{i})

represents a counterfactual joint action in which all agents except agent i execute their original actions, while agent i instead takes a predefined baseline action

c_{i}

.

The baseline action

c_{i}

represents the default behavior of agent i under the current policy. In this work, the baseline action is chosen as

\begin{matrix} c_{i} = arg max_{a \in A_{i}} π_{θ_{i}} (a ∣ o_{i}), \end{matrix}

(8)

which corresponds to the most probable action under the policy of agent i. The counterfactual value term

Q (s, (a_{i}, c_{i}))

therefore estimates the expected return under the hypothetical condition that agent i follows its default behavior while all other agents remain unchanged. By subtracting this baseline, the counterfactual advantage

A_{i}^{CF}

isolates the marginal contribution of the executed action

a_{i}

.

To formulate an agent-specific advantage estimator, we integrate the counterfactual advantage with the standard GAE:

\begin{matrix} A_{i}^{GAE - CF} (s, a) = A_{t}^{GAE (γ, λ)} + β A_{i}^{CF} (s, a), \end{matrix}

(9)

where

β \geq 0

is a weighting coefficient that balances the global temporal advantage and the individual counterfactual contribution. The global GAE term preserves the stability and low-variance properties of MAPPO, while the counterfactual component refines credit assignment by explicitly accounting for each agent’s marginal impact on the joint return. The weighting coefficient

β

in Equation (9) balances the global temporal advantage and individual counterfactual contribution. As

β

→ 0, the estimator reduces to standard GAE; larger

β

values emphasize agent-specific credit assignment. This design is compatible with PPO clipping, as the counterfactual term introduces bounded bias in gradient estimation.

The proposed counterfactual advantage estimator introduces a modified policy gradient direction, whose properties are analyzed here to address theoretical concerns. Specifically, when the baseline action is chosen as the mode of the policy, the gradient bias remains bounded provided the action distribution is unimodal, with the bias scaling proportionally to the variance of the advantage estimate a effect controlled by the clipping mechanism in PPO. This update corresponds to optimizing a surrogate objective that integrates the standard PPO clip loss with a counterfactual regularization term, thereby aligning with the entropy regularization in Equation (5) to promote exploration while enhancing credit assignment. Moreover, the clip range

ϵ

in Equation (2) ensures stability by constraining the probability ratio, which inherently limits the impact of variations in the counterfactual term. Collectively, this analysis establishes a theoretical foundation for the method’s robustness, mitigating heuristic design concerns.

The resulting advantage

A_{i}^{GAE - CF}

is employed for policy updates of each agent within the PPO-Clip framework. This formulation encourages agents to select actions that are beneficial for collective performance while also providing clear guidance on individual responsibility, thereby improving learning efficiency in cooperative environments with sparse and delayed rewards.

3.3. Prioritized Experience Replay with Importance Sampling

A major limitation of on-policy algorithms such as MAPPO lies in low sample efficiency, since collected trajectories are typically discarded after a single policy update. Experience Replay (ER) is a well-established mechanism for improving sample efficiency through the reuse of historical data. However, direct application of ER to on-policy learning is challenging because samples stored in a replay buffer are generated by previous policies, while policy updates are performed using the current policy. This mismatch between behavior policies and target policies introduces distribution bias and may destabilize training.

To address this issue, a PER mechanism is incorporated and carefully adapted to the on-policy structure of MAPPO. The proposed design consists of three tightly coupled components: a trajectory-level priority definition, importance sampling correction, and a constrained replay strategy.

(1): Priority Scheme

Each trajectory

τ = {(s^{t}, o^{t}, a^{t}, r^{t}, s^{t + 1})}_{t = 0}^{T - 1}

stored in the replay buffer

D

is assigned a priority value

p (τ)

. The priority is defined as a combination of the maximum TD error within the trajectory and a counterfactual relevance measure:

\begin{matrix} p (τ) = max_{t \in τ} δ_{t} + α \cdot CFE (τ), \end{matrix}

(10)

where

δ_{t}

denotes the TD error at time step t, and

CFE (τ)

represents the counterfactual effectiveness of trajectory

τ

, defined as

\begin{matrix} CFE (τ) = \frac{1}{N} \sum_{i = 1}^{N} A_{i}^{CF} (s^{t}, a^{t}) . \end{matrix}

The scalar parameter

α

regulates the relative contribution of counterfactual information. This priority formulation favors trajectories that are either highly informative for value estimation, as indicated by large TD errors, or particularly important for credit assignment, as reflected by large counterfactual advantages.

(2): Importance Sampling Weighting

To compensate for the discrepancy between the policy used to generate replayed data and the current policy, importance sampling is employed during mini-batch updates. The importance weight associated with time step t is computed as the product of probability ratios across all agents:

\begin{matrix} w_{t} = \prod_{i = 1}^{N} \frac{π_{θ_{i}^{new}} (a_{i}^{t} ∣ o_{i}^{t})}{π_{θ_{i}^{old}} (a_{i}^{t} ∣ o_{i}^{t})} . \end{matrix}

(11)

The importance weight

w_{t}

is critical for correcting the bias introduced by sampling data from an older policy. However, the product of ratios across N agents can lead to weights with high variance, and their theoretical range is unbounded

(0, \infty)

. A weight much greater than 1 indicates that the action under the current policy is much more likely than under the old policy, and vice versa. Unconstrained, these large weights can dominate the gradient update and lead to unstable training.

Since unbounded importance weights can lead to high variance and unstable optimization, the weights are clipped in a manner consistent with the PPO framework:

\begin{matrix} w_{t}^{clip} = min (w_{t}, 1 + ϵ) . \end{matrix}

(12)

The clipped weight

w_{t}^{clip}

is used to scale the policy gradient loss during optimization. When the deviation between the current policy and the behavior policy remains small, this correction yields an approximately unbiased update while maintaining numerical stability.

While the product of probability ratios across agents in Equation (11) can indeed increase the variance of importance weights, the clipping operation in Equation (12) serves as a crucial variance-reduction mechanism. This clipping not only prevents gradient instability from excessively large weights but also implicitly constrains policy divergence. To further ensure stability, we monitor the Kullback–Leibler (KL) divergence between consecutive policies during training, ensuring it remains within a bounded threshold—a standard practice in PPO-based algorithms that preserves on-policy stability despite limited data reuse.

The bias-variance trade-off inherent in our PER design is carefully balanced: while importance sampling introduces some bias due to policy mismatch, the constrained replay strategy (limiting trajectory reuse to 4–8 updates) ensures this bias remains manageable. The composite priority scheme in Equation (10) selectively amplifies learning signals from high-advantage transitions, ultimately reducing variance in policy gradient estimates and accelerating convergence.

(3): Constrained Replay and Buffer Management

A constrained replay strategy is adopted to preserve the on-policy characteristics of MAPPO. During each training iteration, policy updates are performed using a mixture of freshly collected trajectories and replayed trajectories sampled from

D

according to the defined priority distribution. To prevent excessive policy drift, the lifetime of stored trajectories is explicitly limited. Each trajectory is discarded after being reused for a predefined number of updates, typically ranging from four to eight.

This restriction ensures that replayed data remain sufficiently close to the current policy distribution, thereby preserving the stability properties of PPO. At the same time, selective reuse of informative trajectories significantly improves sample efficiency. By combining prioritized replay, importance sampling correction, and constrained buffer management, the proposed PER mechanism enables effective data reuse within an on-policy multi-agent learning framework.

4. Experiments

4.1. Experimental Setup

4.1.1. Simulation Environment

All experiments were conducted on a computational platform equipped with an NVIDIA RTX 4090 GPU and an Intel i9-13900 CPU. The proposed algorithm and all baseline comparisons were implemented using the PyTorch framework within a unified codebase to ensure fairness and reproducibility.

The experimental environment as Figure 1 is established as a multi-agent pursuit-evasion simulation within a 100 km × 100 km maritime domain, bounded by the coordinate system with the upper-left corner at [0,0] and the lower-right corner at [100,100]. In this scenario, a team of five cooperative ASVs is tasked with pursuing two adversarial ASVs. Key simulation parameters—including the number of agents on both sides, their movement speeds (configurable, with friendly ASVs initially set to 50 km/h and adversarial ASVs to 30 km/h), the simulation step size (1 km/step), and the maximum number of steps per episode—are user-configurable to ensure experimental flexibility.

Each ASVs is initialized with a fuel capacity of 200 L, with consumption rated at 1 L per kilometer traveled. The friendly ASVs team starts from the position (10,10), while the two adversarial ASVs commence from individual starting points at (10,90) and (90,90), respectively, with an initial drive towards (50,50). The adversarial ASVs navigate randomly throughout the domain, with their motion governed by kinematic models that respect realistic vessel dynamics, thereby preventing non-physical maneuvers such as sharp turns. An adversarial ASVs is considered neutralized when any friendly ASV enters its 5 km detection radius. Upon neutralization, the target is visually marked with a red cross, becomes immobilized, and the successful friendly ASV is dynamically reassigned a new target. An episode terminates when all adversarial targets are neutralized or the maximum number of simulation steps is reached.

Our simulation environment intentionally adopts simplified kinematics and omits environmental disturbances to establish a controlled benchmark for evaluating the core algorithmic contributions—namely, credit assignment and sample efficiency in sparse-reward settings. This approach follows established precedents in multi-agent reinforcement learning research where isolating specific learning challenges is necessary before advancing to more complex dynamics. While real-world maritime operations involve additional complexities like communication constraints and ocean currents, this initial simplification allows us to rigorously validate the proposed method’s fundamental capabilities in cooperative decision-making before such extensions.

The dynamic task allocation for the friendly ASVs is governed by the MAPPO-CF-PER algorithm. The algorithm’s decision-making is based on a comprehensive observation space, encompassing individual agent states (e.g., position, heading, remaining fuel), the team state (relative positions and fuel levels of all friendly vessels), and the adversarial state (relative positions, distances, and neutralization status). Task reassessment is triggered either periodically every 10 simulation steps (a configurable interval) or immediately upon critical events such as fuel depletion or the neutralization of an adversarial unit.

Algorithm 1 outlines the complete workflow of the proposed MAPPO-CF-PER. The algorithm begins with the initialization of agent policy networks, a centralized value network, and a replay buffer. During each main iteration, it first collects on-policy trajectories by executing the current policies. For each trajectory, counterfactual-enhanced GAE is computed to sharpen credit assignment. These trajectories are then stored in a prioritized replay buffer, with their priorities calculated, ensuring efficient experience reuse. The core learning phase involves performing multiple update steps. In each step, a mini-batch mixing newly collected and replay data is sampled. The policy parameters are updated by maximizing a clipped objective that utilizes importance sampling weights, while the value network is updated to minimize its loss. Finally, the TD errors and priorities for the sampled replay trajectories are recomputed to refresh the buffer. This integrated design enables stable, sample-efficient multi-agent learning by dynamically balancing on-policy exploration with prioritized off-policy credit assignment.

Algorithm 1 MAPPO-CF-PER

1:: Initialization:
2:: Initialize policy networks $π_{θ_{i}}$ for all agents $i = 1, \dots, N$ ;
3:: Initialize centralized value network $V_{ϕ}$ ;
4:: Initialize replay buffer $D$ with capacity C;
5:: Set hyperparameters: learning rates, $ϵ$ , $γ$ , $λ$ , $β$ , $α$ , and entropy coefficient.
6:: for iteration $= 1$ to M do
7:: Data Collection:
8:: Execute current policies $π_{θ_{i}}$ in the environment;
9:: Collect a batch of trajectories ${τ_{1}, \dots, τ_{K}}$ .
10:: Advantage Computation:
11:: For each trajectory and each agent, compute $A_{i}^{GAE - CF}$ using counterfactual-enhanced GAE.
12:: Replay Buffer Update:
13:: Compute trajectory priorities $p (τ)$ ;
14:: Store new trajectories in $D$ ;
15:: if $| D | > C$ then
16:: Remove oldest trajectories from $D$ .
17:: end if
18:: Learning Phase:
19:: for update step $= 1$ to U do
20:: Sample a mini-batch composed of:
21:: Newly collected on-policy data;
22:: Replay data sampled from $D$ according to $p (τ)$ .
23:: Compute clipped importance weights $w_{t}^{clip}$ for replayed samples
24:: Update policy parameters $θ_{i}$ by maximizing: $E [w_{t}^{clip} \cdot min (ρ_{t} A_{i}^{GAE - CF}, clip (ρ_{t}, 1 - ϵ, 1 + ϵ) A_{i}^{GAE - CF})] .$
25:: Update value network parameters $ϕ$ by minimizing the value loss;
26:: Recompute TD errors and counterfactual effectiveness;
27:: Update priorities of sampled trajectories in $D$ .
28:: end for
29:: Optionally decay entropy regularization coefficient
30:: end for

The algorithm employs the Adam optimizer, with learning rates for the policy network (Actor) and value network (Critic) set to

3 \times 10^{- 4}

and

1 \times 10^{- 3}

, respectively, to provide differentiated update rates. Within the PPO framework, the discount factor

γ

is set to

0.99

to balance immediate and future returns; the

λ

parameter in GAE is set to

0.95

to strike a balance between bias and variance in advantage estimation; and the clip factor

ϵ

for policy updates is

0.2

, which is crucial for ensuring training stability. Both the policy network and the value network are implemented as single-hidden-layer Multi-Layer Perceptrons (MLPs) with 128 hidden units each.

To reflect the innovations of the proposed algorithm, the weight coefficient for the counterfactual baseline module ( $λ_{c f}$ ) is set to

0.5

, ensuring effective integration of the agent’s global advantage estimate and its individual counterfactual contribution. For the PER module, the priority exponent ( $α_{p e r}$ ) and the importance sampling weight coefficient ( $β_{p e r}$ ) are set to

0.6

and

0.4

, respectively. This enables the algorithm to utilize experience samples with high learning value more efficiently while maintaining the stability of on-policy learning through importance sampling correction.

4.1.2. Baselines

To demonstrate the effectiveness of the proposed MAPPO-CF-PER algorithm, comparisons were conducted against several representative baseline methods:

1.: Standard MAPPO: The baseline version of Multi-Agent PPO, which utilizes a centralized value function but lacks the proposed CF and PER components. This direct ablation baseline is crucial for isolating and demonstrating the specific performance gains contributed by our innovations in credit assignment and sample efficiency.
2.: MADDPG: As a classical actor-critic method for mixed cooperative-competitive environments, MADDPG employs CTDE using deterministic policies. It is included to contrast the on-policy, stochastic policy optimization of MAPPO-CF-PER with an off-policy, deterministic policy alternative.
3.: IPPO: This approach trains each agent using a separate PPO algorithm, treating other agents as part of the environment. It serves as a fundamental baseline for decentralized learning without explicit coordination mechanisms, highlighting the benefits of centralized training in our approach.
4.: QMIX: This method is a leading value-based algorithm that enforces monotonicity between joint and individual action-values through a mixing network. It represents the powerful paradigm of value decomposition networks (VDN) in CTDE and provides a critical comparison to policy-based methods.

Standard MAPPO serves as our primary ablation baseline, highlighting the specific contribution of our proposed counterfactual baseline and PER mechanisms. IPPO, which trains agents independently, tests the necessity of centralized training in our collaborative task. The off-policy algorithms, MADDPG and QMIX, represent alternative paradigms within the CTDE framework. MADDPG uses deterministic policies and can be sensitive to hyperparameters, while QMIX’s value decomposition can be less sample-efficient than policy-based methods in complex continuous environments like ours. This theoretical framing informs the expected performance hierarchy and helps explain the empirical results.

To validate the stability of our on-policy learning with PER, we implemented rigorous monitoring of policy divergence throughout training. The average KL divergence between subsequent policy updates remained below 0.02, confirming that our importance sampling with clipping effectively maintained the on-policy characteristics while benefiting from selective experience reuse.

4.2. Results and Analysis

The experimental results demonstrate the method’s efficacy in the implemented benchmark environment, showing particular strength in sparse-reward coordination. While the current simulation simplifies certain aspects, the learning principles established—especially the counterfactual credit assignment and experience reuse mechanisms—provide a foundation that can be extended to more realistic conditions in future work. The stable convergence patterns observed (Figure 2 and Figure 3) suggest robust performance that would likely transfer well to more complex environments with appropriate adaptations.

Figure 2 presents the training reward curves of the proposed method (“ours”) and four baseline algorithms (MAPPO, IPPO, MADDPG, QMIX) over 30,000 interaction steps. As shown, all algorithms start with near-zero rewards. The proposed method demonstrates superior learning efficiency and final performance. It achieves the fastest initial ascent, leading the performance after a brief period of high initial variance, and eventually converges to the highest stable reward level of approximately 25–26. In comparison, MAPPO also exhibits rapid learning but plateaus at a lower reward of around 23–24. Independent Proximal Policy Optimization(IPPO) shows steady yet slower improvement, converging near 19–20. MADDPG progresses moderately to a final reward of about 16. Notably, the QMIX algorithm fails to learn an effective policy in this environment, with its reward remaining close to zero throughout the training. These results quantitatively validate that our method not only learns faster but also discovers a policy yielding a significantly higher cumulative return compared to the state-of-the-art multi-agent reinforcement learning benchmarks.

Figure 3 illustrates the training success curves, plotting the task success rate (all enemy units have been eliminated by our units) against the number of training epochs for five distinct methods. Ours method achieves superior learning efficacy and final performance. It not only demonstrates the steepest initial ascent but also consistently maintains a leading position throughout the training process, ultimately converging to the highest success rate. In contrast, the MAPPO and IPPO algorithms exhibit moderate performance, showing a positive yet slower learning trend and plateauing at lower success levels. Both the MADDPG and QMIX methods perform comparatively poorly in this environment, with their success rates remaining at the lower end of the spectrum and showing minimal improvement over the 30k epochs. This clear performance hierarchy quantitatively validates the effectiveness of our approach in solving the target task more reliably and efficiently than the benchmarked multi-agent reinforcement learning algorithms.

The design of MAPPO-CF-PER is inherently synergistic, making a fully isolated quantitative ablation challenging to interpret, as the components are deeply interconnected. The counterfactual baseline within GAE provides a more precise credit assignment, which is evidenced by the steeper initial learning curve compared to standard MAPPO. This indicates that agents receive clearer, less noisy learning signals, allowing them to identify effective cooperative strategies more rapidly. Furthermore, the prioritized replay mechanism directly addresses the sample inefficiency of on-policy learning. The stability of our method’s learning curve, especially in the later stages of training, suggests that the reuse of high-value experiences with importance sampling correction enables more robust policy convergence, unlike the higher variance observed in IPPO.

The failure of QMIX in this environment provides an important insight into the problem’s challenges. QMIX relies on a monotonic factorization of the joint action-value function, an assumption that may be too restrictive for the complex, sparse-reward coordination required in our maritime pursuit task. The continuous action space further exacerbates this issue, as QMIX is fundamentally designed for discrete actions. Similarly, the moderate performance of MADDPG can be attributed to the challenges of training deterministic policies in a multi-agent setting with sparse rewards, where exploratory stochastic policies are more effective. These failures highlight the particular suitability of an enhanced on-policy, policy-based algorithm like MAPPO-CF-PER for this domain.

5. Conclusions

This paper presents an enhanced MAPPO framework that addresses credit assignment and sample efficiency in multi-surface vessel collaboration via theoretically grounded integrations. The counterfactual baseline within GAE and the prioritized experience replay mechanism provide a principled approach for sparse-reward environments, as validated by empirical results. Future work will extend this to heterogeneous agents and real-world deployments, further solidifying its methodological novelty. The primary contributions of this work can be summarized as follows:

(1): A counterfactual baseline mechanism was integrated into the MAPPO framework to enable more accurate credit assignment in cooperative multi-agent tasks.
(2): A prioritized experience replay strategy suitable for on-policy learning was developed using importance weighting and a novel priority definition that accounts for temporal difference error and counterfactual contribution.
(3): Extensive experimental evaluation was conducted in realistic multi-surface vessel scenarios, demonstrating consistent and substantial performance gains over state of the art baseline methods.

Experimental results demonstrated that the proposed MAPPO-CF-PER algorithm achieved a substantially higher success rate and significantly reduced the training burden compared with the baseline algorithms. These improvements underscore the algorithm’s strong potential for real-world maritime scenarios, where minimizing expensive training data requirements and ensuring high mission reliability are paramount.

The current algorithm’s performance is contingent on idealized assumptions of perfect communication and reliable sensors. Its robustness is untested against practical challenges like transmission delays or sensor malfunctions. Future research will focus on extending the framework to heterogeneous vessel teams with diverse sensing and actuation capabilities. In addition, transfer learning techniques will be investigated to facilitate policy generalization across different maritime environments. Finally, validation on physical surface vessel platforms will be pursued to assess performance under real-world operational constraints.

Author Contributions

Study conceptualization, formal analysis, data curation, methodology, and investigation were performed by G.W. and F.T. Project administration, and supervision were performed by C.R. Validation, writing—original draft, and writing—review & editing were performed by G.W., F.T. and C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62403003), the Anhui Provincial Natural Science Foundation Youth Project (Grant No. 2408085QF204), and the Anhui Higher Education Scientific Research Project (Grant No. 2024AH050061).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Oroojlooy, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar] [CrossRef]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Ma, X.; Yang, Y.; Li, C.; Lu, Y.; Zhao, Q.; Jun, Y. Modeling the interaction between agents in cooperative multi-agent reinforcement learning. arXiv 2021, arXiv:2102.06042. [Google Scholar] [CrossRef]
Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Cambridge, MA, USA, 2019; pp. 2961–2970. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. arXiv 2017, arXiv:1706.02275. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Hu, L.; Wei, C.; Yin, L. MAPPO-ITD3-IMLFQ algorithm for multi-mobile robot path planning. Adv. Eng. Inform. 2025, 65, 103398. [Google Scholar] [CrossRef]
Watanabe, T.; Takahashi, Y. Hierarchical reinforcement learning using a modular fuzzy model for multi-agent problem. In Proceedings of the 2007 IEEE International Conference on Systems, Man and Cybernetics, Montreal, QC, Canada, 7–10 October 2007; pp. 1681–1686. [Google Scholar]
Du, X.; Ye, Y.; Zhang, P.; Yang, Y.; Chen, M.; Wang, T. Situation-dependent causal influence-based cooperative multi-agent reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17362–17370. [Google Scholar] [CrossRef]
Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega, P.; Strouse, D.J.; Leibo, J.Z.; De Freitas, N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Cambridge, MA, USA, 2019; pp. 3040–3049. [Google Scholar]
Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Cambridge, MA, USA, 2017; pp. 2778–2787. [Google Scholar]
Xie, A.; Losey, D.; Tolsma, R.; Finn, C.; Sadigh, D. Learning latent representations to influence multi-agent interaction. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; PMLR: Cambridge, MA, USA, 2021; pp. 575–588. [Google Scholar]
Ding, Z.; Huang, T.; Lu, Z. Learning individually inferred communication for multi-agent cooperation. Adv. Neural Inf. Process. Syst. 2020, 33, 22069–22079. [Google Scholar]
Kim, W.; Park, J.; Sung, Y. Communication in Multi-Agent Reinforcement Learning: Intention Sharing. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Zhao, Y.; Hu, L.; Wang, Y.; Hou, M.; Zhang, H.; Ding, K.; Zhao, J. Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs. arXiv 2025, arXiv:2510.11062. [Google Scholar] [CrossRef]
Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
Ahmadzadeh, A.; Jadbabaie, A.; Kumar, V.; Pappas, G.J. Multi-UAV cooperative surveillance with spatio-temporal specifications. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 5293–5298. [Google Scholar]
Braquet, M.; Bakolas, E. Greedy decentralized auction-based task allocation for multi-agent systems. IFAC-PapersOnLine 2021, 54, 675–680. [Google Scholar] [CrossRef]

Figure 1. The experimental environment diagram. The environment is a 100 km × 100 km maritime domain. Five cooperative ASVs (blue) start from (10,10) and are tasked with pursuing two adversarial ASVs (red), which start from (10,90) and (90,90). The detection radius for neutralizing an adversarial ASV is 5 km, indicated by the circle around the red agent. The diagram illustrates the initial setup and key spatial relationships.

Figure 2. The reward curves of the experiment.

Figure 3. The success curve of the experiment.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, G.; Tian, F.; Ren, C. An Improved MAPPO for Multi-Surface Vessel Collaboration. Actuators 2026, 15, 121. https://doi.org/10.3390/act15020121

AMA Style

Wang G, Tian F, Ren C. An Improved MAPPO for Multi-Surface Vessel Collaboration. Actuators. 2026; 15(2):121. https://doi.org/10.3390/act15020121

Chicago/Turabian Style

Wang, Guangyu, Feng Tian, and Chengcheng Ren. 2026. "An Improved MAPPO for Multi-Surface Vessel Collaboration" Actuators 15, no. 2: 121. https://doi.org/10.3390/act15020121

APA Style

Wang, G., Tian, F., & Ren, C. (2026). An Improved MAPPO for Multi-Surface Vessel Collaboration. Actuators, 15(2), 121. https://doi.org/10.3390/act15020121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved MAPPO for Multi-Surface Vessel Collaboration

Abstract

1. Introduction

2. Related Work

2.1. Multi-Agent Reinforcement Learning

2.2. Credit Assignment in Multi-Agent Systems

2.3. Experience Replay in On-Policy Learning

2.4. Multi-Surface Vessel Collaboration

3. Proposed Method

3.1. Background: MAPPO Formulation

3.2. Counterfactual Baseline for GAE

3.3. Prioritized Experience Replay with Importance Sampling

4. Experiments

4.1. Experimental Setup

4.1.1. Simulation Environment

4.1.2. Baselines

4.2. Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI