Adaptive Policy Switching for Multi-Agent ASVs in Multi-Objective Aquatic Cleaning Environments
Abstract
1. Introduction
2. Materials and Methods
2.1. Problem Formulation
2.2. Scenario and Vehicle Properties
2.3. Trash Dynamics and Recollection
2.4. Deep Reinforcement Learning Framework
2.5. Policy Network Training
2.5.1. Phase Construction
2.5.2. Action Space
2.5.3. States
- Trash model: , shared among all agents, representing detected trash positions and capturing partial observability.
- Agent trail position: records the agent’s detection history over the past 10 steps, with current detection cells at 1 and past steps decaying by 0.1 per step, down to 0.1.
- Position of other agents: the union of detection masks of all other agents, with cells inside the masks set to 1, providing awareness of fleet distribution.
2.5.4. Rewards
- Visit reward : proportional to newly discovered cells, shared among overlapping agents:where counts overlapping coverage.
- Inactivity penalty : penalizes consecutive steps without discovering new areas:where accumulates the number of consecutive steps without discovering new cells, and scales this penalty to control its impact.
- Redundancy penalty : penalizes overlap with areas explored by self or other agents in the last steps:
- Trash collection reward : rewards collecting trash at current position.
- Distance reward : promotes movement towards nearby trash using inverse Dijkstra distances:
- Model update reward : encourages adaptation to dynamic trash movement.
- Time penalty : discourages unnecessary delays.
2.5.5. Multi-Task Policy Network
2.6. Reward-Greedy Policy Selection and Pareto Front Construction
- Weighted Sum (WS): The most common approach to multi-objective optimization is the weighted sum method, which is essentially a convex combination of the two rewards, it suffers from the disadvantage of not being able to find a diverse set solutions if the Pareto front is non-convex.
- Weighted Power (WP): In this method, each reward is raised to a power before applying the weight. This allows for emphasizing certain objectives non-linearly and can be tuned via the parameter :
- Weighted Product of Powers (WPOP): This multiplicative approach combines the rewards by taking each reward to the power of its respective weight and then multiplying them. It is useful for emphasizing that all objectives should be high, as a low value in any objective strongly reduces the total scalarized reward:
- Exponential Weighted Criterion (EWC): In response to the inability of the weighted sum method to capture points on non-convex portions of the Pareto optimal surface, Ref. [32] propose the exponential weighted criterion, the performance of the method depends on the value of p and usually a large value of p is needed.
3. Results
3.1. Simulation Settings
3.2. Metrics and Hyperparameter Sensitivity Analysis
- Percentage of the Map Visited (PMV): It is the exploration objective to be maximized, and it measures the percentage of the map that has been visited during the mission. It is defined as the proportion of the total navigable area that has been visited at least once by any ASV. A higher PMV value indicates a more comprehensive exploration, ensuring that the agents gather sufficient information about the environment’s structure, obstacles, and trash distribution, which is crucial for the subsequent cleaning phase. Mathematically, PMV is given by:
- Percentage of Trash Cleaned (PTC): It is the cleaning objective to be maximized, and it measures the proportion of trash that has been effectively eliminated from the environment, relative to the total amount of trash present at the beginning of the episode (K). Optimizing PTC requires efficient path planning and coordination among agents to maximize trash collection before exceeding their distance limits. PTC is mathematically defined as:
- Consecutive Spacing: The spacing metric measures the variance of distances between consecutive solutions along the Pareto front, sorted by the first objective . It captures gaps along the front better than nearest-neighbor spacing in 2D.For a sorted Pareto front , the consecutive distances areand the spacing metric is defined as the standard deviation of these distances:whereA lower S indicates more uniform spacing along the front.
- Zitzler’s Metric: The metric measures the extent of the Pareto front along all objectives. For a Pareto front in m objectives, letbe the range of the front along the i-th objective. Then the metric is defined asA higher indicates a larger extent of the front across all objectives.
- Hypervolume (HV): The hypervolume quantifies the size of the objective space dominated by the Pareto Front (PF) with respect to a reference point . For discrete non-dominated points sorted by (e.g., PTC), the hypervolume can be approximated as the sum of rectangular areas between consecutive PF points:where and denote the objective values of the i-th non-dominated solution in the sorted PF, and corresponds to the reference point. A higher indicates a larger and more dominant Pareto region, reflecting both convergence and diversity.
3.3. Scalarization Results
3.3.1. Comparison
3.3.2. Limitations
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| POMG | Partially Observable Markov Game |
| ASV | Autonomous Surface Vehicle |
| ML | Marine Litter |
| DRL | Deep Reinforcement Learning |
| DQL | Deep Q-Learning |
| DQN | Deep Q-Network |
| MADRL | Multi-Agent Deep Reinforcement Learning |
| MDQN | Multi-Agent Deep Q-Network |
| MT-MADRL | Multi-Task Multi-Agent Deep Reinforcement Learning |
| CNN | Convolutional Neural Network |
| C-DQL | Censoring Deep Q-Learning |
| MDP | Markov Decision Process |
| RGPS | Reward Greedy Policy Selection |
| WS | Weighted Sum |
| WP | Weighted Power |
| WPOP | Weighted Product of Powers |
| PTC | Percentage of Trash Cleaned |
| PMV | Percentage of the Map Visited |
| HV | Hypervolume |
| FP2S | Fixed-Phase Pareto Set |
| PDS | Phase Duration Set |
| MOEAs | Multi-Objective Evolutionary Algorithms |
| HRL | Hierarchical Reinforcement Learning |
References
- Le, V.G.; Nguyen, H.L.; Nguyen, M.K.; Lin, C.; Hung, N.T.Q.; Khedulkar, A.P.; Hue, N.K.; Trang, P.T.T.; Mungray, A.K.; Nguyen, D.D. Marine macro-litter sources and ecological impact: A review. Environ. Chem. Lett. 2024, 22, 1257–1273. [Google Scholar] [CrossRef]
- Egger, M.; Booth, A.M.; Bosker, T.; Everaert, G.; Garrard, S.L.; Havas, V.; Huntley, H.S.; Koelmans, A.A.; Kvale, K.; Lebreton, L.; et al. Evaluating the environmental impact of cleaning the North Pacific Garbage Patch. Sci. Rep. 2025, 15, 16736. [Google Scholar] [CrossRef] [PubMed]
- Dunbabin, M.; Grinham, A.; Udy, J. An autonomous surface vehicle for water quality monitoring. In Proceedings of the Australasian Conference on Robotics and Automation (ACRA), Sydney, Australia, 2–4 December 2009; pp. 2–4. [Google Scholar]
- Kamarudin, N.; Mohd Nordin, I.N.A.; Misman, D.; Khamis, N.; Razif, M.; Hanim, F. Development of Water Surface Mobile Garbage Collector Robot. Alinteri J. Agric. Sci. 2021, 36, 534–540. [Google Scholar] [CrossRef]
- Balestrieri, E.; Daponte, P.; De Vito, L.; Lamonaca, F. Sensors and measurements for unmanned systems: An overview. Sensors 2021, 21, 1518. [Google Scholar] [CrossRef]
- Katsouras, G.; Dimitriou, E.; Karavoltsos, S.; Samios, S.; Sakellari, A.; Mentzafou, A.; Tsalas, N.; Scoullos, M. Use of unmanned surface vehicles (USVs) in water chemistry studies. Sensors 2024, 24, 2809. [Google Scholar] [CrossRef]
- Diop, D.S.; Luis, S.Y.; Esteve, M.P.; Marín, S.L.T.; Reina, D.G. Decoupling Patrolling Tasks for Water Quality Monitoring: A Multi-Agent Deep Reinforcement Learning Approach. IEEE Access 2024, 12, 75559–75576. [Google Scholar] [CrossRef]
- Seck, D.; Yanes, S.; Perales, M.; Gutiérrez, D.; Toral, S. Multiobjective Environmental Cleanup with Autonomous Surface Vehicle Fleets Using Multitask Multiagent Deep Reinforcement Learning. Adv. Intell. Syst. 2025, e202500434. [Google Scholar] [CrossRef]
- Barrionuevo, A.M.; Luis, S.Y.; Reina, D.G.; Marín, S.L.T. Optimizing Plastic Waste Collection in Water Bodies Using Heterogeneous Autonomous Surface Vehicles With Deep Reinforcement Learning. IEEE Robot. Autom. Lett. 2025, 10, 4930–4937. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv 2017, arXiv:1710.02298. [Google Scholar] [CrossRef]
- Casado-Pérez, A.; Yanes, S.; Toral, S.L.; Perales-Esteve, M.; Gutiérrez-Reina, D. Variational Autoencoder for the Prediction of Oil Contamination Temporal Evolution in Water Environments. Sensors 2025, 25, 1654. [Google Scholar] [CrossRef]
- Luis, S.Y.; Peralta, F.; Córdoba, A.T.; del Nozal, Á.R.; Marín, S.T.; Reina, D.G. An evolutionary multi-objective path planning of a fleet of ASVs for patrolling water resources. Eng. Appl. Artif. Intell. 2022, 112, 104852. [Google Scholar] [CrossRef]
- Yanes Luis, S.; Shutin, D.; Marchal Gómez, J.; Gutiérrez Reina, D.; Toral Marín, S. Deep Reinforcement Multiagent Learning Framework for Information Gathering with Local Gaussian Processes for Water Monitoring. Adv. Intell. Syst. 2024, 6, 2300850. [Google Scholar] [CrossRef]
- Liu, Q.; Szepesvári, C.; Jin, C. Sample-efficient reinforcement learning of partially observable markov games. Adv. Neural Inf. Process. Syst. 2022, 35, 18296–18308. [Google Scholar]
- Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; Basar, T. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Proceedings of Machine Learning Research. Volume 80, pp. 5872–5881. [Google Scholar]
- Xia, J.; Luo, Y.; Liu, Z.; Zhang, Y.; Shi, H.; Liu, Z. Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning. Def. Technol. 2023, 29, 80–94. [Google Scholar] [CrossRef]
- Liu, L.T.; Dogan, U.; Hofmann, K. Decoding multitask dqn in the world of minecraft. In Proceedings of the 13th European Workshop on Reinforcement Learning (EWRL), Barcelona, Spain, 3–4 December 2016. [Google Scholar]
- Mossalam, H.; Assael, Y.M.; Roijers, D.M.; Whiteson, S. Multi-Objective Deep Reinforcement Learning. arXiv 2016, arXiv:1610.02707. [Google Scholar] [CrossRef]
- Li, K.; Zhang, T.; Wang, R. Deep reinforcement learning for multiobjective optimization. IEEE Trans. Cybern. 2020, 51, 3103–3114. [Google Scholar] [CrossRef] [PubMed]
- Basaklar, T.; Gumussoy, S.; Ogras, U.Y. Pd-morl: Preference-driven multi-objective reinforcement learning algorithm. arXiv 2022, arXiv:2208.07914. [Google Scholar]
- Corsi, D.; Camponogara, D.; Farinelli, A. Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning. Reinf. Learn. J. 2025, 3, 1106–1123. [Google Scholar]
- Golovin, D.; Krause, A. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. J. Artif. Intell. Res. 2011, 42, 427–486. [Google Scholar]
- Krause, A.; Golovin, D. Submodular function maximization. Tractability 2014, 3, 3. [Google Scholar]
- Hansen, E.A.; Bernstein, D.S.; Zilberstein, S. Dynamic programming for partially observable stochastic games. In Proceedings of the AAAI, San Jose, CA, USA, 25–29 July 2004; Volume 4, pp. 709–715. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; Proceedings of Machine Learning Research. Volume 48, pp. 1995–2003. [Google Scholar]
- Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of the Autonomous Agents and Multiagent Systems: AAMAS 2017 Workshops, Best Papers, São Paulo, Brazil, 8–12 May 2017; Revised Selected Papers 16. Springer: New York, NY, USA, 2017; pp. 66–83. [Google Scholar]
- Wong, A.; Bäck, T.; Kononova, A.V.; Plaat, A. Deep multiagent reinforcement learning: Challenges and directions. Artif. Intell. Rev. 2023, 56, 5023–5056. [Google Scholar] [CrossRef]
- Marler, R.T.; Arora, J.S. Survey of multi-objective optimization methods for engineering. Struct. Multidiscip. Optim. 2004, 26, 369–395. [Google Scholar] [CrossRef]
- Athan, T.W.; Papalambros, P.Y. A note on weighted criteria methods for compromise solutions in multi-objective optimization. Eng. Optim. 1996, 27, 155–176. [Google Scholar] [CrossRef]










| Category | Parameter | Value |
|---|---|---|
| Scalarization | Weighted Product exponent () | 3 |
| Exponential Weighted Criterion (p) | 1 | |
| Environment | Number of ASVs (N) | 4 |
| Detection radius () | 2 nodes | |
| Distance budget | 200 units | |
| Movement length | 1 node | |
| Training | Replay buffer size | |
| Batch size | 128 | |
| Learning rate () | ||
| Target network hard update | 2000 steps | |
| -greedy interval | ||
| threshold | ||
| Discount factor () |
| Scalarization | Hypervolume | M3 | Spacing | Nº of Points |
|---|---|---|---|---|
| WP | 5400.8738 | 5.0406 | 0.9193 | 21 |
| WS | 5779.8875 | 5.4511 | 1.4368 | 19 |
| WPOP | 5750.2380 | 5.0639 | 0.6876 | 21 |
| EWC | 5706.1609 | 5.2529 | 1.0877 | 23 |
| Combined | 5810.0283 | 5.4977 | 0.7035 | 40 |
| Literature | 5091.7528 | 5.7991 | 2.1422 | 9 |
| Comparison | Cleaning (p-Value) | Exploration (p-Value) |
|---|---|---|
| WP vs. WS | 0.2753 | 0.0955 |
| WP vs. WPOP | 0.0239 | 0.5392 |
| WP vs. EWC | 0.6578 | 0.0022 |
| WS vs. WPOP | 0.2935 | 0.1956 |
| WS vs. EWC | 0.3321 | 0.1956 |
| WPOP vs. EWC | 0.1111 | 0.1193 |
| Policy (Type @ PDS) | Cleaning (%) | Exploration (%) |
|---|---|---|
| Best Exploration @ 1.0 | 35.2 | 90.8 |
| Best Exploration @ 0.9 | 41.3 | 89.6 |
| Best Cleaning @ 0.9 | 41.8 | 87.5 |
| Best Exploration @ 0.8 | 48.0 | 86.6 |
| Best Exploration @ 0.7 | 51.3 | 84.1 |
| Best Cleaning @ 0.7 | 53.4 | 83.4 |
| Final Trained @ 0.7 | 53.9 | 82.7 |
| Best Exploration @ 0.6 | 57.0 | 80.4 |
| Best Cleaning @ 0.6 | 57.3 | 79.3 |
| Scalarization | Hypervolume | M3 | Spacing | Nº of Points |
|---|---|---|---|---|
| EWC | 7483.7189 | 5.1127 | 1.5835 | 12 |
| WPOP | 6873.0391 | 5.5886 | 4.4323 | 8 |
| WS | 7507.0533 | 4.2034 | 2.5124 | 6 |
| WP | 6721.3104 | 3.4942 | 0.7545 | 9 |
| Combined | 7533.9085 | 5.0035 | 1.5274 | 11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Seck, D.; Yanes-Luis, S.; Perales-Esteve, M.; Marín, S.T.; Gutiérrez-Reina, D. Adaptive Policy Switching for Multi-Agent ASVs in Multi-Objective Aquatic Cleaning Environments. Sensors 2026, 26, 427. https://doi.org/10.3390/s26020427
Seck D, Yanes-Luis S, Perales-Esteve M, Marín ST, Gutiérrez-Reina D. Adaptive Policy Switching for Multi-Agent ASVs in Multi-Objective Aquatic Cleaning Environments. Sensors. 2026; 26(2):427. https://doi.org/10.3390/s26020427
Chicago/Turabian StyleSeck, Dame, Samuel Yanes-Luis, Manuel Perales-Esteve, Sergio Toral Marín, and Daniel Gutiérrez-Reina. 2026. "Adaptive Policy Switching for Multi-Agent ASVs in Multi-Objective Aquatic Cleaning Environments" Sensors 26, no. 2: 427. https://doi.org/10.3390/s26020427
APA StyleSeck, D., Yanes-Luis, S., Perales-Esteve, M., Marín, S. T., & Gutiérrez-Reina, D. (2026). Adaptive Policy Switching for Multi-Agent ASVs in Multi-Objective Aquatic Cleaning Environments. Sensors, 26(2), 427. https://doi.org/10.3390/s26020427

