An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning
Abstract
1. Introduction
- (1)
- A difference-driven adaptive action adoption mechanism is proposed. By performing a prospective evaluation between the autonomous policy action and the guidance action, the mechanism adaptively determines whether to accept the guidance action according to their immediate return difference, thereby preventing prior guidance from imposing a persistent bias on the policy output. Consequently, without altering the original training pipeline, it effectively stabilizes early-stage exploration, reduces the risk of negative transfer introduced by external priors, and maintains the consistency and controllability of policy updates.
- (2)
- Collaborative framework between structural priors and learned policies. A collaborative framework is established that combines structural priors with learned policies. Guidance forces are generated via a potential-field-based heuristic, and, together with the single-step difference evaluation mechanism, the algorithm gradually internalizes environmental constraints and coordination patterns. This leads to an organic integration of prior knowledge and deep reinforcement learning.
- (3)
- Progressive consistency evaluation strategy. A progressive consistency evaluation strategy is proposed which adopts a sequential action adoption and immediate update mechanism. This ensures that single-step policy evaluations across multiple agents are conducted under a consistent state premise, avoiding contradictory assessments and enhancing the coherence of joint decision-making.
2. Problem Formulation and Preliminaries
2.1. Multi-Agent Reinforcement Learning
2.2. Ship Motion Model
2.3. Definition and Statement of the Multi-Agent Capture Problem
3. Multi-Agent Guided Adaptive Difference Policy Gradient
3.1. MAADPG
| Algorithm 1 MAADPG training procedure |
|
3.2. State Space
3.3. Action Space
3.4. Reward Function
4. Simulation Experiments and Results Analysis
4.1. Parameter Settings
4.2. Simulation Scenario and Settings
4.3. Comparative Analysis of Algorithms
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| USV | Unmanned Surface Vehicle |
| DRL | Deep Reinforcement Learning |
| MARL | Multi-Agent Reinforcement Learning |
| CTDE | Centralized Training and Decentralized Execution |
| MADDPG | Multi-Agent Deep Deterministic Policy Gradient |
| DDPG | Deep Deterministic Policy Gradient |
| MAADPG | Multi-Agent Guided Adaptive Difference Policy Gradient |
| CEP | Constrained Encirclement Problem |
| CSR | Capture Success Rate |
References
- Liu, Y.; Liu, C.; Meng, Y.; Ren, X.; Wang, X. Velocity Domain-Based Distributed Pursuit-Encirclement Control for Multi-USVs with Incomplete Information. IEEE Trans. Intell. Veh. 2023, 9, 3246–3257. [Google Scholar] [CrossRef]
- Qu, X.; Li, C.; Jiang, Y.; Long, F.; Zhang, R. Cooperative Pursuit of Unmanned Surface Vehicles Using Multi-Agent Reinforcement Learning. J. Shanghai Jiaotong Univ. Sci. 2025, 1–8. [Google Scholar] [CrossRef]
- Sun, Z.; Sun, H.; Li, P.; Zou, J. Self-Organizing Cooperative Pursuit Strategy for Multi-USV with Dynamic Obstacle Ships. J. Mar. Sci. Eng. 2022, 10, 562. [Google Scholar] [CrossRef]
- Chen, Z.; Zhao, Z.; Xu, J.; Wang, X.; Lu, Y.; Yu, J. A Cooperative Hunting Method for Multi-USV Based on the A* Algorithm in an Environment with Obstacles. Sensors 2023, 23, 7058. [Google Scholar] [CrossRef]
- Yang, X.; Wang, X.; Ye, H.; Xiang, Z.; Zhang, B. Design and Verification of IHA-MATD3: A Novel Multi-USV Cooperative Pursuit-Evasion Scheme. Ocean Eng. 2025, 340, 122446. [Google Scholar] [CrossRef]
- Pan, C.; Wang, A.; Peng, Z.; Han, B.; Lyu, G.; Zhang, W. Pursuit-Evasion Game of Under-Actuated ASVs Based on Deep Reinforcement Learning and Model Predictive Path Integral Control. Neurocomputing 2025, 638, 130045. [Google Scholar] [CrossRef]
- Jiang, Y.; Peng, Z.; Wang, J. Constrained Control of Autonomous Surface Vehicles for Multitarget Encirclement via Fuzzy Modeling and Neurodynamic Optimization. IEEE Trans. Fuzzy Syst. 2022, 31, 875–889. [Google Scholar] [CrossRef]
- Panda, J.P. Machine Learning for Naval Architecture, Ocean and Marine Engineering. J. Mar. Sci. Technol. 2023, 28, 1–26. [Google Scholar] [CrossRef]
- Qu, X.; Gan, W.; Song, D.; Zhou, L. Pursuit-Evasion Game Strategy of USV Based on Deep Reinforcement Learning in Complex Multi-Obstacle Environment. Ocean. Eng. 2023, 273, 114016. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, X.; Zhou, W.; Yan, H.; Xie, S. Threat Potential Field Based Pursuit–Evasion Games for Underactuated Unmanned Surface Vehicles. Ocean Eng. 2023, 285, 115381. [Google Scholar] [CrossRef]
- Sun, Z.; Sun, H.; Li, P.; Li, X.; Du, L. Pursuit–Evasion Problem of Unmanned Surface Vehicles in a Complex Marine Environment. Appl. Sci. 2022, 12, 9120. [Google Scholar] [CrossRef]
- Luo, Q.; Wang, H.; Li, N.; Su, B.; Zheng, W. Model-Free Predictive Trajectory Tracking Control and Obstacle Avoidance for Unmanned Surface Vehicle with Uncertainty and Unknown Disturbances via Model-Free Extended State Observer. Int. J. Control Autom. Syst. 2024, 22, 1985–1997. [Google Scholar] [CrossRef]
- Luo, Q.; Wang, H.; Li, N.; Zheng, W. Multi-Unmanned Surface Vehicle Model-Free Sliding Mode Predictive Adaptive Formation Control and Obstacle Avoidance in Complex Marine Environment via Model-Free Extended State Observer. Ocean. Eng. 2024, 293, 116773. [Google Scholar] [CrossRef]
- Yang, S.; Wang, K.; Wang, W.; Wu, H.; Suo, Y.; Chen, G.; Xian, J. Dual-Attention Proximal Policy Optimization for Efficient Autonomous Navigation in Narrow Channels Using Deep Reinforcement Learning. Ocean. Eng. 2025, 326, 120707. [Google Scholar] [CrossRef]
- Wang, W.; Wu, H.; Yang, S.; Li-Ching, K. LNPP: Logical Neural Path Planning of Mobile Beacon for Ocean Sensor Networks in Uncertain Environments Using Hierarchical Reinforcement Learning. Ocean. Eng. 2025, 12, 2606–2621. [Google Scholar] [CrossRef]
- Chaysri, P.; Spatharis, C.; Blekas, K.; Vlachos, K. Unmanned Surface Vehicle Navigation through Generative Adversarial Imitation Learning. Ocean. Eng. 2023, 282, 114989. [Google Scholar] [CrossRef]
- Sinha, A.; Cao, Y. Three-Dimensional Guidance Law for Target Enclosing within Arbitrary Smooth Shapes. J. Guid. Control Dyn. 2023, 46, 2224–2234. [Google Scholar] [CrossRef]
- Qu, X.; Li, C.; Jiang, S.; Liu, G.; Zhang, R. Multi-Agent Reinforcement Learning-Based Cooperative Encirclement Control of Autonomous Surface Vehicles Against Multiple Targets. J. Mar. Sci. Eng. 2025, 13, 1558. [Google Scholar] [CrossRef]
- Zhang, J.; Yang, Y.; Liu, K.; Li, T. Solving Dynamic Encirclement for Multi-ASV Systems Subjected to Input Saturation via Time-Varying Formation Control. Ocean. Eng. 2024, 310, 118707. [Google Scholar] [CrossRef]
- Wang, Y.; Zhao, Y. Multiple Ships Cooperative Navigation and Collision Avoidance Using Multi-Agent Reinforcement Learning with Communication. arXiv 2024, arXiv:2410.21290. [Google Scholar] [CrossRef]
- Gan, W.; Qu, X.; Song, D.; Yao, P. Multi-USV Cooperative Chasing Strategy Based on Obstacles Assistance and Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2023, 21, 5895–5910. [Google Scholar] [CrossRef]
- Zhang, C.; Zeng, R.; Lin, B.; Zhang, Y.; Xie, W.; Zhang, W. Multi-USV Cooperative Target Encirclement through Learning-Based Distributed Transferable Policy and Experimental Validation. Ocean. Eng. 2025, 318, 120124. [Google Scholar] [CrossRef]
- Li, Y.; Li, X.; Wei, X.; Wang, H. Sim-Real Joint Experimental Verification for an Unmanned Surface Vehicle Formation Strategy Based on Multi-Agent Deterministic Policy Gradient and Line of Sight Guidance. Ocean. Eng. 2023, 270, 113661. [Google Scholar] [CrossRef]
- Peng, Z.; Wu, G.; Luo, B.; Wang, L. Multi-UAV Cooperative Pursuit Strategy with Limited Visual Field in Urban Airspace: A Multi-Agent Reinforcement Learning Approach. IEEE/CAA J. Autom. Sin. 2025, 12, 1350–1367. [Google Scholar] [CrossRef]
- Qu, X.; Zeng, L.; Qu, S.; Long, F.; Zhang, R. An Overview of Recent Advances in Pursuit–Evasion Games with Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2025, 13, 458. [Google Scholar] [CrossRef]
- Li, F.; Yin, M.; Wang, T.; Huang, T.; Yang, C.; Gui, W. Distributed Pursuit–Evasion Game of Limited Perception USV Swarm Based on Multiagent Proximal Policy Optimization. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 6435–6446. [Google Scholar] [CrossRef]
- Wang, C.; Wang, Y.; Shi, P.; Wang, F. Scalable-MADDPG-Based Cooperative Target Invasion for a Multi-USV System. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 17867–17877. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Dong, R.; Du, J.; Liu, Y.; Heidari, A.A.; Chen, H. An Enhanced Deep Deterministic Policy Gradient Algorithm for Intelligent Control of Robotic Arms. Front. Neuroinform. 2023, 17, 1096053. [Google Scholar] [CrossRef]
- Oroojlooy, A.; Hajinezhad, D. A Review of Cooperative Multi-Agent Deep Reinforcement Learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar] [CrossRef]
- He, Z.; Chu, X.; Liu, C.; Wu, W. A Novel Model Predictive Artificial Potential Field Based Ship Motion Planning Method Considering COLREGs for Complex Encounter Scenarios. ISA Trans. 2023, 134, 58–73. [Google Scholar] [CrossRef] [PubMed]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
- Tassa, Y.; Doron, Y.; Muldal, A.; Erez, T.; Li, Y.; de Las Casas, D.; Budden, D.; Abdolmaleki, A.; Merel, J.; Lefrancq, A.; et al. DeepMind Control Suite. arXiv 2018, arXiv:1801.00690. [Google Scholar] [CrossRef]
- Xu, S.; Dang, Z. Emergent behaviors in multiagent pursuit evasion games within a bounded 2D grid world. Sci. Rep. 2025, 15, 29376. [Google Scholar] [CrossRef] [PubMed]












| Parameter | Value |
|---|---|
| Discount factor | 0.95 |
| Learning rate of actor network | 0.0005 |
| Learning rate of critic network | 0.001 |
| Soft-update rate | 0.01 |
| Replay buffer size | 1,000,000 |
| Batch size | 1024 |
| Hidden layers | 4 |
| Hidden units per layer | 128 |
| Number of neurons | 512 |
| Parameter | Value |
|---|---|
| Number of obstacles (per episode) | 0–3 |
| Obstacle radius | 0.10–0.15 km |
| Parameter | Value |
|---|---|
| Initial velocity of hunters | 0 km/s |
| Maximum speed of hunters | 0.010 km/s |
| Maximum acceleration of hunters | 0.004 km/s2 |
| Detection range of sensor | 0.2 km |
| Round-up (capture) distance of hunters | 0.15 km |
| Initial velocity of target | 0 km/s |
| Maximum speed of target | 0.011 km/s |
| Maximum acceleration of target | 0.005 km/s2 |
| Method | Env | Policy | Critic/Q | State | Wall-Clock | Training |
|---|---|---|---|---|---|---|
| Step(s)/Decision | Forward(s) | Eval(s) | Copy/Rollback | Step Time (ms) | Throughput (Steps/s) | |
| MAADPG | +1 (peek) | +Ng + gate | +(1–2)·Ng | +1 | 0.387 | 2582.3 |
| MADDPG | 0 | +Ng | +Ng | 0 | 0.405 | 2471.1 |
| MADDPG-approx | 0 | +Ng | +Ng | 0 | 0.421 | 2377.2 |
| DDPG | 0 | baseline | baseline | 0 | 0.102 | 9837.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Du, Z.; Yang, S.; Wang, W. An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning. J. Mar. Sci. Eng. 2026, 14, 252. https://doi.org/10.3390/jmse14030252
Du Z, Yang S, Wang W. An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning. Journal of Marine Science and Engineering. 2026; 14(3):252. https://doi.org/10.3390/jmse14030252
Chicago/Turabian StyleDu, Zhen, Shenhua Yang, and Weijun Wang. 2026. "An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning" Journal of Marine Science and Engineering 14, no. 3: 252. https://doi.org/10.3390/jmse14030252
APA StyleDu, Z., Yang, S., & Wang, W. (2026). An Adaptive Difference Policy Gradient Method for Cooperative Multi-USV Pursuit in Multi-Agent Reinforcement Learning. Journal of Marine Science and Engineering, 14(3), 252. https://doi.org/10.3390/jmse14030252

