Risk-Aware Distributional Reinforcement Learning for Safe Path Planning of Surface Sensing Agents
Abstract
1. Introduction
1.1. Motivation and Problem Statement
1.2. Limitations of Existing Methods
1.3. Method and Contributions
- Main contributions.
- We propose an EMA-smoothed prioritized replay scheme that moves priority toward the exponential moving average of TD errors. This reduces sampling noise, preserves diversity, and focuses updates on transitions with high learning value. EMA-PER is integrated directly into Adaptive IQN.
- We develop a mixed expert that combines A* global routing with RVO local avoidance to produce safe, feasible demonstrations. A* plans waypoints under a grid map, and RVO resolves short-horizon interactions with other surface sensing agents, yielding trajectories that respect collision constraints.
- We extend EMA-PER to jointly schedule sampling from expert demonstrations and self-generated experience. By gradually discounting early expert bias while preserving informative expert segments, MEA-PER accelerates early learning and improves final robustness.
- We deliver a complete training and execution pipeline that fuses the above components with Adaptive IQN’s risk-sensitive evaluation, producing policies that maintain safety and efficiency across increasing agent counts and environmental complexity.
2. Related Work
2.1. Decentralized Multi-Surface Sensing Agent Path Planning
2.2. Distributional RL in Robotics
2.3. Experience Replay
2.4. Hybrid Path–Planning
2.5. Learning from Demonstrations and Expert Guidance
3. Methodology
3.1. System Overview
3.2. Kinematic Model
3.3. Ocean Current and Obstacles
3.4. Reinforcement-Learning Formulation
3.4.1. Observation Design
3.4.2. Action Space and Kinematics
3.4.3. Task Objective and Termination
3.4.4. Reward Shaping
3.5. Framework for Distributional RL-Based Decentralized Navigation and Collision Avoidance
3.5.1. Adaptive Implicit Quantile Network
3.5.2. Implicit Quantile Network (IQN)
- Distributional view.
- Quantile parameterisation.
- Quantile features and fusion.
- Learning objective.
3.5.3. Adaptive Risk Sensitivity
3.6. Prioritized Replay with EMA Smoothing
- Motivation and data structure.
- Initial priority.
- Sampling rule.
- Stabilising updates via EMA.
Bias Correction via Importance Sampling
3.7. Hybrid Expert Planner (A* and RVO)
3.7.1. Global Path Planning
3.7.2. RVO-Based Local Collision Avoidance
3.7.3. Expert Trajectory Generation
| Algorithm 1 Hybrid Expert (A* global routing + RVO local avoidance)—Trajectory Generator |
| Require: start , goal g, static_map, init_vel, all_agents_states |
| Ensure: expert trajectory |
| , , init_vel |
| compute global path from to g on static_map using A* search |
| while do |
| if then |
| end if |
| sense nearby agents within perception range of from all_agents_states |
| compute collision-free velocity using RVO with current velocity , preferred velocity , and nearby agents ; |
| ; |
| Append; |
| end while |
| return |
| Algorithm 2 AQP Training with Prioritized Replay (PER) |
| Require: simulator, replay capacity C, batch size B, learn_freq, target_period, total_steps; (priority exponent), (IS anneal); (exploration) |
| Ensure: trained policy |
| initialize replay with capacity C; initialize Adaptive IQN parameters ; set target |
| reset simulator; obtain initial states |
| (optional seed) insert expert transitions into with high priority |
| for total_steps do |
| // Agent Interaction with Environment |
| for each agent i do |
| choose by -greedy w.r.t. ; step simulator |
| push into with initial priority (: initial TD error) |
| end for |
| if then |
| // Sample Prioritized Batch |
| sample indices from with |
| ; ; |
| compute distributional targets with ; evaluate predicted quantiles with |
| update |
| refresh priorities in |
| if then |
| end if |
| end if |
| if episode terminated or timeout then |
| reset simulator; reinitialize |
| end if |
| end for |
| return |
4. Experiments
4.1. Simulation Setup
4.2. Results Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Xing, B.; Yu, M.; Liu, Z.; Tan, Y.; Sun, Y.; Li, B. A Review of Path Planning for Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2023, 11, 1556. [Google Scholar] [CrossRef]
- Balasubramanyam, A.; Hiremath, S.; Santhosh, G.; Menon, D.; Kumar, D.; Honnavalli, P.B.; Indian Pedestrian Behaviour Modelling Using Imitation and Reinforcement Learning. Preprint/Working Paper. 2025. Available online: https://www.preprints.org/frontend/manuscript/1af899c59efbe45e3f1a24834bc985b1/download_pub (accessed on 4 December 2025).
- Liu, C.; Van Kampen, E.J.; De Croon, G.C. Adaptive Risk-Tendency: Nano Drone Navigation in Cluttered Environments with Distributional Reinforcement Learning. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7198–7204. [Google Scholar]
- Fu, M.; Huang, L.; Li, F.; Qu, H.; Xu, C. A Fully Value Distributional Deep Reinforcement Learning Framework for Multi-Agent Cooperation. Neural Netw. 2025, 184, 107035. [Google Scholar] [CrossRef] [PubMed]
- Lin, X.; Huang, Y.; Chen, F.; Englot, B. Decentralized Multi-Robot Navigation for Autonomous Surface Vehicles with Distributional Reinforcement Learning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8327–8333. [Google Scholar]
- Yang, C.; Zhao, Y.; Cai, X.; Wei, W.; Feng, X.; Zhou, K. Path Planning Algorithm for Unmanned Surface Vessel Based on Multiobjective Reinforcement Learning. Comput. Intell. Neurosci. 2023, 2023, 2146314. [Google Scholar] [CrossRef]
- Dou, J.; Ouyang, K.; Wu, Z.; Hu, Z.; Lin, J.; Wang, H. From Invariance to Symmetry Breaking in FIM-Aware Cooperative Heterogeneous Agent Networks. Symmetry 2025, 17, 1899. [Google Scholar] [CrossRef]
- Brignoni, L.; Johansen, T.A.; Breivik, M. Environmental Disturbances and Uncertainty in Maritime Autonomous Surface Ships. Ocean. Eng. 2022, 259, 111956. [Google Scholar]
- Xing, B.; Wang, X.; Yang, L.; Liu, Z.; Wu, Q. An Algorithm of Complete Coverage Path Planning for Unmanned Surface Vehicle Based on Reinforcement Learning. J. Mar. Sci. Eng. 2023, 11, 645. [Google Scholar] [CrossRef]
- Sharan, R.; Ayanian, N. Consensus-Based Multi-Robot Navigation in Dynamic Environments. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11418–11424. [Google Scholar]
- Du, Z.; Li, W.; Shi, G. Multi-USV Collaborative Obstacle Avoidance Based on Improved Velocity Obstacle Method. ASCE-ASME J. Risk Uncertain. Eng. Syst. Part A Civ. Eng. 2024, 10, 04023049. [Google Scholar] [CrossRef]
- Liu, Y.; Sun, Z.; Wan, J.; Li, H.; Yang, D.; Li, Y.; Fu, W.; Yu, Z.; Sun, J. Hybrid Path Planning Method for SSA Based on Improved A-Star and DWA. J. Mar. Sci. Eng. 2025, 13, 934. [Google Scholar] [CrossRef]
- Zhang, J.; Ren, J.; Cui, Y.; Fu, D.; Cong, J. Multi-SSA Task Planning Method Based on Improved Deep Reinforcement Learning. IEEE Internet Things J. 2024, 11, 18549–18567. [Google Scholar] [CrossRef]
- Dabney, W.; Ostrovski, G.; Silver, D.; Munos, R. Implicit Quantile Networks for Distributional Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1096–1105. [Google Scholar]
- Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar] [CrossRef]
- Van Den Berg, J.; Lin, M.; Manocha, D. Reciprocal Velocity Obstacles for Real-Time Multi-Agent Navigation. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation (ICRA), Pasadena, CA, USA, 19–23 May 2008; pp. 1928–1935. [Google Scholar]
- Chen, S.; Feng, L.; Bao, X.; Jiang, Z.; Xing, B.; Xu, J. An Optimal-Path-Planning Method for Unmanned Surface Vehicles Based on a Novel Group Intelligence Algorithm. J. Mar. Sci. Eng. 2024, 12, 477. [Google Scholar] [CrossRef]
- Liang, J.; Miao, H.; Li, K.; Tan, J.; Wang, X.; Luo, R.; Jiang, Y. A Review of Multi-Agent Reinforcement Learning Algorithms. Electronics 2025, 14, 820. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 24611–24624. [Google Scholar]
- Wang, H.; Lou, S.; Jing, J.; Wang, Y.; Liu, W.; Liu, T. The EBS-A* Algorithm: An Improved A* Algorithm for Path Planning. PLoS ONE 2022, 17, e0263841. [Google Scholar] [CrossRef]
- Xu, Z.; Xia, Y.; Bai, H.; Song, S.; Wang, X. Vision-Based Deep Reinforcement Learning of Unmanned Aerial Vehicle (UAV) Autonomous Navigation Using Privileged Information. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3–17. [Google Scholar]
- Wang, Y.; Du, Y.; Xu, J.; Han, X.; Chen, Z. Real-Time Local Path Planning Strategy Based on Deep Distributional Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24562–24574. [Google Scholar]
- Pham, T.; Lee, G.H.; Chung, W. Autonomous Navigation for Mobile Robots Using Deep Reinforcement Learning in Partially Observable Environments. IEEE Access 2021, 9, 155297–155311. [Google Scholar]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
- Kabir, R.; Watanobe, Y.; Islam, M.R.; Naruse, K. Enhanced Robot Motion Block of A-Star Algorithm for Robotic Path Planning. Sensors 2024, 24, 1422. [Google Scholar] [CrossRef]
- Chauvin, T.; Gianazza, D.; Durand, N. ORCA-A: A Hybrid Reciprocal Collision Avoidance and Route Planning Algorithm for UAS in Dense Urban Areas. In Proceedings of the SESAR Innovation Days 2024, Rome, Italy, 12–15 November 2023. [Google Scholar]
- Shi, Z.; Wang, K.; Zhang, J.J. Improved Reinforcement Learning Path Planning Algorithm Integrating Prior Knowledge. PLoS ONE 2023, 18, e0284942. [Google Scholar] [CrossRef] [PubMed]
- Lipowska, D.; Lipowski, A.; Ferreira, A.L. Homonyms and Context in Signalling Game with Reinforcement Learning. PLoS ONE 2025, 20, e0322743. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Liu, Y.; Everett, M.; How, J.P. Decentralized Non-Communicating Multiagent Collision Avoidance with Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 285–292. [Google Scholar]
- Liu, Y.; Wang, C.; Zhao, C.; Wu, H.; Wei, Y. A Soft Actor-Critic Deep Reinforcement-Learning-Based Robot Navigation Method Using LiDAR. Remote Sens. 2024, 16, 2072. [Google Scholar] [CrossRef]
- Hart, P.E.; Nilsson, N.J.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]






| Set 1 (No Static Obstacles) | Set 2 (With Static Obstacles) | ||||||
|---|---|---|---|---|---|---|---|
| # SSAs | #Vortices | #Obstacles | # SSAs | #Vortices | #Obstacles | ||
| Scene 1 | 3 | 4 | 0 | Scene 1 | 3 | 4 | 4 |
| Scene 2 | 4 | 5 | 0 | Scene 2 | 4 | 5 | 5 |
| Scene 3 | 5 | 6 | 0 | Scene 3 | 5 | 6 | 6 |
| Scene 4 | 6 | 7 | 0 | Scene 4 | 6 | 7 | 7 |
| Scene 5 | 7 | 8 | 0 | Scene 5 | 7 | 8 | 8 |
| Algorithm | Average Latency (ms) | Real-Time Capable |
|---|---|---|
| APF | 0.2 | Yes |
| DQN | 1.1 | Yes |
| IQN | 1.7 | Yes |
| Adaptive-IQN | 1.9 | Yes |
| Proposed | 2.6 | Yes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dou, J.; Li, Z.; Wang, Y.; Ouyang, K.; Xia, W.; Lin, J.; Wang, H. Risk-Aware Distributional Reinforcement Learning for Safe Path Planning of Surface Sensing Agents. Electronics 2025, 14, 4828. https://doi.org/10.3390/electronics14244828
Dou J, Li Z, Wang Y, Ouyang K, Xia W, Lin J, Wang H. Risk-Aware Distributional Reinforcement Learning for Safe Path Planning of Surface Sensing Agents. Electronics. 2025; 14(24):4828. https://doi.org/10.3390/electronics14244828
Chicago/Turabian StyleDou, Jihua, Zhongqi Li, Yuanhao Wang, Kunpeng Ouyang, Weihao Xia, Jianxin Lin, and Huachuan Wang. 2025. "Risk-Aware Distributional Reinforcement Learning for Safe Path Planning of Surface Sensing Agents" Electronics 14, no. 24: 4828. https://doi.org/10.3390/electronics14244828
APA StyleDou, J., Li, Z., Wang, Y., Ouyang, K., Xia, W., Lin, J., & Wang, H. (2025). Risk-Aware Distributional Reinforcement Learning for Safe Path Planning of Surface Sensing Agents. Electronics, 14(24), 4828. https://doi.org/10.3390/electronics14244828

