An Improved MAPPO for Multi-Surface Vessel Collaboration
Abstract
1. Introduction
- (1)
- To address the challenge of credit assignment under sparse rewards, we introduce a novel counterfactual baseline directly embedded within the GAE framework. Unlike prior works that primarily applied counterfactual reasoning in off-policy settings, this integration within the on-policy MAPPO structure provides a more stable and efficient mechanism to quantify individual agent contributions, leading to sharper policy gradients in collaborative ASV tasks.
- (2)
- To overcome the inherent sample inefficiency of on-policy learning, we propose a tailored PER mechanism. A key innovation here is the use of a composite priority metric that combines both temporal-difference error and counterfactual effectiveness. Furthermore, to safely reuse off-policy data without destabilizing training, we incorporate importance sampling weights within the PPO clipping objective. This approach is a non-trivial extension of PER to the multi-agent on-policy domain, effectively breaking the data discard-after-one-use limitation of standard MAPPO.
2. Related Work
2.1. Multi-Agent Reinforcement Learning
2.2. Credit Assignment in Multi-Agent Systems
2.3. Experience Replay in On-Policy Learning
2.4. Multi-Surface Vessel Collaboration
3. Proposed Method
3.1. Background: MAPPO Formulation
- denotes the global state space;
- represents the joint action space composed of individual action spaces ;
- is the state transition probability function;
- is the shared reward function;
- denotes the local observation space for each agent;
- N is the number of agents;
- is the discount factor.
3.2. Counterfactual Baseline for GAE
3.3. Prioritized Experience Replay with Importance Sampling
- (1)
- Priority Scheme
- (2)
- Importance Sampling Weighting
- (3)
- Constrained Replay and Buffer Management
4. Experiments
4.1. Experimental Setup
4.1.1. Simulation Environment
| Algorithm 1 MAPPO-CF-PER |
|
4.1.2. Baselines
- 1.
- Standard MAPPO: The baseline version of Multi-Agent PPO, which utilizes a centralized value function but lacks the proposed CF and PER components. This direct ablation baseline is crucial for isolating and demonstrating the specific performance gains contributed by our innovations in credit assignment and sample efficiency.
- 2.
- MADDPG: As a classical actor-critic method for mixed cooperative-competitive environments, MADDPG employs CTDE using deterministic policies. It is included to contrast the on-policy, stochastic policy optimization of MAPPO-CF-PER with an off-policy, deterministic policy alternative.
- 3.
- IPPO: This approach trains each agent using a separate PPO algorithm, treating other agents as part of the environment. It serves as a fundamental baseline for decentralized learning without explicit coordination mechanisms, highlighting the benefits of centralized training in our approach.
- 4.
- QMIX: This method is a leading value-based algorithm that enforces monotonicity between joint and individual action-values through a mixing network. It represents the powerful paradigm of value decomposition networks (VDN) in CTDE and provides a critical comparison to policy-based methods.
4.2. Results and Analysis
5. Conclusions
- (1)
- A counterfactual baseline mechanism was integrated into the MAPPO framework to enable more accurate credit assignment in cooperative multi-agent tasks.
- (2)
- A prioritized experience replay strategy suitable for on-policy learning was developed using importance weighting and a novel priority definition that accounts for temporal difference error and counterfactual contribution.
- (3)
- Extensive experimental evaluation was conducted in realistic multi-surface vessel scenarios, demonstrating consistent and substantial performance gains over state of the art baseline methods.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Oroojlooy, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar] [CrossRef]
- Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Ma, X.; Yang, Y.; Li, C.; Lu, Y.; Zhao, Q.; Jun, Y. Modeling the interaction between agents in cooperative multi-agent reinforcement learning. arXiv 2021, arXiv:2102.06042. [Google Scholar] [CrossRef]
- Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Cambridge, MA, USA, 2019; pp. 2961–2970. [Google Scholar]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. arXiv 2017, arXiv:1706.02275. [Google Scholar]
- Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
- Hu, L.; Wei, C.; Yin, L. MAPPO-ITD3-IMLFQ algorithm for multi-mobile robot path planning. Adv. Eng. Inform. 2025, 65, 103398. [Google Scholar] [CrossRef]
- Watanabe, T.; Takahashi, Y. Hierarchical reinforcement learning using a modular fuzzy model for multi-agent problem. In Proceedings of the 2007 IEEE International Conference on Systems, Man and Cybernetics, Montreal, QC, Canada, 7–10 October 2007; pp. 1681–1686. [Google Scholar]
- Du, X.; Ye, Y.; Zhang, P.; Yang, Y.; Chen, M.; Wang, T. Situation-dependent causal influence-based cooperative multi-agent reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17362–17370. [Google Scholar] [CrossRef]
- Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega, P.; Strouse, D.J.; Leibo, J.Z.; De Freitas, N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Cambridge, MA, USA, 2019; pp. 3040–3049. [Google Scholar]
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Cambridge, MA, USA, 2017; pp. 2778–2787. [Google Scholar]
- Xie, A.; Losey, D.; Tolsma, R.; Finn, C.; Sadigh, D. Learning latent representations to influence multi-agent interaction. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; PMLR: Cambridge, MA, USA, 2021; pp. 575–588. [Google Scholar]
- Ding, Z.; Huang, T.; Lu, Z. Learning individually inferred communication for multi-agent cooperation. Adv. Neural Inf. Process. Syst. 2020, 33, 22069–22079. [Google Scholar]
- Kim, W.; Park, J.; Sung, Y. Communication in Multi-Agent Reinforcement Learning: Intention Sharing. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Zhao, Y.; Hu, L.; Wang, Y.; Hou, M.; Zhang, H.; Ding, K.; Zhao, J. Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs. arXiv 2025, arXiv:2510.11062. [Google Scholar] [CrossRef]
- Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
- Ahmadzadeh, A.; Jadbabaie, A.; Kumar, V.; Pappas, G.J. Multi-UAV cooperative surveillance with spatio-temporal specifications. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 5293–5298. [Google Scholar]
- Braquet, M.; Bakolas, E. Greedy decentralized auction-based task allocation for multi-agent systems. IFAC-PapersOnLine 2021, 54, 675–680. [Google Scholar] [CrossRef]



Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, G.; Tian, F.; Ren, C. An Improved MAPPO for Multi-Surface Vessel Collaboration. Actuators 2026, 15, 121. https://doi.org/10.3390/act15020121
Wang G, Tian F, Ren C. An Improved MAPPO for Multi-Surface Vessel Collaboration. Actuators. 2026; 15(2):121. https://doi.org/10.3390/act15020121
Chicago/Turabian StyleWang, Guangyu, Feng Tian, and Chengcheng Ren. 2026. "An Improved MAPPO for Multi-Surface Vessel Collaboration" Actuators 15, no. 2: 121. https://doi.org/10.3390/act15020121
APA StyleWang, G., Tian, F., & Ren, C. (2026). An Improved MAPPO for Multi-Surface Vessel Collaboration. Actuators, 15(2), 121. https://doi.org/10.3390/act15020121

