You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

8 December 2025

Risk-Aware Distributional Reinforcement Learning for Safe Path Planning of Surface Sensing Agents

,
,
,
,
,
and
1
Dalian Naval Academy, Dalian 116018, China
2
College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue The Collaborative Perception, Localization, and Decision Control of Intelligent Vehicles

Abstract

In spatially constrained water domains, surface sensing agents(SSAs) must achieve safe path planning, uncertain currents, and sensor noise. We present a decentralized motion planning and collision-avoidance framework based on distributional reinforcement learning (DRL) that models the full return distribution to enable risk-aware decision making. Each surface sensing agent autonomously proceeds to its designated coordinates without rigid spatial constraints, coordinating implicitly through learned policies and a lightweight safety shield that enforces separation and kinematic limits. The method integrates (i) distributional value estimation for controllable risk sensitivity near hazards, (ii) domain randomization of sea states and disturbances for robustness, and (iii) a shielded action layer compatible with standard reactive rules (e.g., velocity obstacle-style constraints) to guarantee feasible maneuvers. In simulations across cluttered maps and stochastic current fields, the proposed approach improves success rates and reduces near-miss events compared to non-distributional RL and classical planners, while maintaining competitive path length and computation time. The results indicate that DRL-based surface sensing agent navigation is a practical path toward safe, efficient environmental monitoring and surveying.

1. Introduction

1.1. Motivation and Problem Statement

In recent years, surface sensing agents have emerged as a novel solution for civilian marine observation. Capable of unattended operation, they facilitate long-term continuous monitoring of critical parameters such as temperature, salinity, and biological characteristics. These agents offer significant advantages, including low operational costs and broad spatiotemporal coverage [1]. In complex, dynamic marine environments, enabling a decentralized surface sensing agent network to navigate safely and efficiently is both scientifically valuable and operationally critical [2,3,4]. The operational workspace is intrinsically uncertain and time-varying: multi-scale currents, waves, wind, and sporadic events generate nonstationary conditions [5,6,7]; and marine traffic introduces numerous dynamic obstacles (e.g., vessels or other surface sensing agents), while topography, submerged formations, and floating markers represent static obstacles. These factors form a multi-source, partially observed decision problem that is difficult to model precisely [8].
For a surface sensing agent network, mission success requires joint task allocation and motion coordination so that each agent reaches its goal without collisions, while maintaining system-level efficiency [9]. Each surface sensing agent must act using only sensing within a limited horizon, under unknown environmental disturbances and interactions with nearby agents. Under such decentralized, partial-information conditions, purely global plans or rigid rule sets are brittle; moreover, policies must balance safety, efficiency, and robustness in real time [10].
The core difficulty lies in coupling local perception with cooperative navigation under flow-driven dynamics. First, sensing is partial and can be noisy, leading to incomplete or delayed information. Second, the environment is nonstationary: background currents evolve and other intelligent agents adapt, making fixed priors insufficient. Third, long-horizon routing subject to collision-avoidance and safety margins is computationally demanding. Consequently, decentralized policies must learn to avoid both dynamic and static obstacles while ensuring goal reachability [5].
This work targets decentralized navigation for surface sensing agent networks. Each surface sensing agent plans and acts autonomously from local observations, avoiding dynamic obstacles (surrounding objects) and static obstacles (e.g., geological features), and reaching assigned goals without centralized coordination, while maintaining safety margins in cluttered flow fields. Additional uncertainty due to regional ocean circulation (e.g., rapid water movements) and navigational uncertainty further increase problem complexity [11].

1.2. Limitations of Existing Methods

Classical route-planning and navigation (e.g., APF, A*, Dijkstra, optimal control/MPC) perform well when sensing is complete and environments are static or piecewise static [12]. In realistic marine settings, however, dynamics are strongly time varying and stochastic, making accurate modeling and reliable long-horizon planning difficult, especially under decentralized, multi-agent coordination [1,8]. In cooperative scenarios, agents must negotiate crossings [13,14], maintain separation, and communicate under partial observability; rule-based APF-style behaviors are brittle and often fail to generalize when traffic density or flow patterns change [9,10]. These approaches typically assume precise maps and disturbance models; model errors, unmodeled interactions, and unforeseen events (e.g., uncertain obstacle motion or incidents) degrade performance and can lead to unsafe or overly conservative plans. More critically, they lack mechanisms to continuously adapt from streaming experience, so performance degrades as tasks, flow regimes, or traffic patterns shift [8].
DRL offers a data-driven alternative that can cope with uncertainty, rich interactions, and partial observability by learning policies directly from interaction. MARL with CTDE further supports cooperative behaviors such as collision avoidance, transit passage selection, and task-aware routing. Nevertheless, standard DRL brings its own challenges: sample inefficiency and unstable updates, safety during exploration, and limited generalization across scenes without careful replay, risk handling, and inductive biases. These limitations motivate hybrid solutions that combine reliable priors with adaptive learning to achieve both safety and efficiency in dynamic marine environments.
To address high–risk decision contexts, distributional RL models the full return distribution rather than only its expectation [15]. This enables policies that explicitly reason about variability and tail events, supporting risk-aware decision making and improved safety. Tail measures such as CVaR can be used to modulate risk sensitivity online, complementing safe RL by reducing catastrophic outcomes while preserving performance. In marine navigation, distributional RL is therefore a natural fit: it captures flow and interaction uncertainty, allows principled risk calibration, and provides a basis for reliable autonomous operation under partial observability [16].
Lin et al. [5,17] introduced an Adaptive Implicit Quantile Network (Adaptive IQN) for decentralized multi-agent surface navigation. By learning return quantiles, the method optimized a risk-aware policy that improved robustness without centralized coordination. The work provided an important step in bringing distributional RL to aquatic multi-agent systems and empirically demonstrated the potential for safer, more stable behavior. At the same time, challenges remained in sample efficiency, stability under dense traffic, and scalability to highly cluttered scenes—gaps that motivate the enhancements proposed in this paper.

1.3. Method and Contributions

Decentralized navigation for surface sensing agent networks is challenging due to partial observability, nonstationary flow, and tightly coupled multi-agent interactions. We build on distributional RL—specifically, Adaptive IQN—and introduce a replay mechanism that emphasizes informative experience. The core idea is to couple learning with a reliable expert that provides safe guidance in cluttered scenes: an A* global router delivers waypoints, while RVO performs local, dynamic collision avoidance. The hybrid expert generates high-quality demonstrations that are blended with on-policy data through an EMA-smoothed prioritized replay (EMA-PER), enabling Adaptive IQN to learn risk-aware policies more efficiently and stably.
  • Main contributions.
    • We propose an EMA-smoothed prioritized replay scheme that moves priority toward the exponential moving average of TD errors. This reduces sampling noise, preserves diversity, and focuses updates on transitions with high learning value. EMA-PER is integrated directly into Adaptive IQN.
    • We develop a mixed expert that combines A* global routing with RVO local avoidance to produce safe, feasible demonstrations. A* plans waypoints under a grid map, and RVO resolves short-horizon interactions with other surface sensing agents, yielding trajectories that respect collision constraints.
    • We extend EMA-PER to jointly schedule sampling from expert demonstrations and self-generated experience. By gradually discounting early expert bias while preserving informative expert segments, MEA-PER accelerates early learning and improves final robustness.
    • We deliver a complete training and execution pipeline that fuses the above components with Adaptive IQN’s risk-sensitive evaluation, producing policies that maintain safety and efficiency across increasing agent counts and environmental complexity.
The approach exploits complementary strengths: classical planning (A*, RVO) contributes feasibility and safety priors, while distributional RL adapts to uncertain, time-varying flows. By guiding learning toward reliable and informative transitions (EMA-PER/MEA-PER), the framework reduces unsafe exploration, improves sample efficiency, and yields higher-quality policies than either RL-only or planner-only baselines.

3. Methodology

3.1. System Overview

We study a group of N surface sensing agents operating in a 2D operational workspace to perform route planning and navigation. Agent i starts from p start i and aims to reach p goal i . Control is decentralized: each agent selects actions using only local observations. No fixed structure is needed; agents travel independently toward their goals. The environment includes background currents and obstacles, with geological features modeled as static obstacles and other agents acting as dynamic obstacles.

3.2. Kinematic Model

For tractability, each agent is modeled as a point with pose ( x , y , θ ) , where ( x , y ) are global Cartesian coordinates and θ is the heading. The agent velocity is decomposed into an ambient current and self-generated motion:
d X ( t ) d t = V cur X ( t ) + V ctrl ( t ) .
Here V cur is the Eulerian current field evaluated at the agent location and V ctrl is the commanded translational velocity produced by the navigation policy and steering constraints.

3.3. Ocean Current and Obstacles

The flow field is represented by a Rankine vortex, which comprises a solid-body core and an irrotational exterior. In the core ( r a ), fluid rotates with constant angular velocity Ω = Γ / ( 2 π a 2 ) , yielding a tangential speed:
v θ ( r ) = Ω r = Γ 2 π a 2 r .
Outside the core ( r > a ), the flow is irrotational with
v θ ( r ) = Γ 2 π r .
Equivalently,
v θ ( r ) = Γ 2 π r a 2 , r a , 1 r , r > a .
Static obstacles (geological features) are represented as impenetrable regions; other agents are treated as dynamic obstacles in the planning layer.

3.4. Reinforcement-Learning Formulation

We cast decentralized navigation as a decentralized partially observable Markov decision process. Each surface sensing agent learns a mapping from its local observation to a continuous action that maximizes the discounted return.

3.4.1. Observation Design

At step t, the observation concatenates three parts:
s self t = [ p x goal , p y goal , v x , v y ] ,
s static t = [ o 1 r , , o m r ] , o i r = [ p i s x , p i s y , r i s ] ,
s dynamic t = [ o 1 d , , o n d ] , o i d = [ p i d x , p i d y , v i d x , v i d y ] .
Candidates are sorted by distance to the ego agent; if fewer than m or n are detected, zero padding keeps a fixed input size.

3.4.2. Action Space and Kinematics

The action is a t = ( a t , ω t ) , where a t is the commanded tangential acceleration and ω t is the commanded turn-rate. The forward speed is constrained to [ 0 , v max ] , and ( a t , ω t ) respect platform limits.

3.4.3. Task Objective and Termination

An episode terminates upon (i) goal arrival with p goal p t < d arrive , (ii) collision with any obstacle or another agent when the distance is below d collide , or (iii) exhaustion of the step budget.

3.4.4. Reward Shaping

Our surface sensing agent framework is illustrated in Figure 1. The objective is to learn a policy π ( s ) that minimizes travel time while avoiding collisions with static obstacles and other agents. A collision is declared when the Euclidean distance between a surface sensing agent and any static obstacle or another surface sensing agent is less than d collision . Arrival is declared when the Euclidean distance to the goal is less than d arrive .
Figure 1. Decision framework for a surface sensing agent. Local inputs comprise the ego state, the nearest static obstacles, and nearby moving obstacles. An interaction module encodes agent–environment relations and feeds a distributional value head that predicts return distributions. A risk-sensitivity setting selects the evaluation quantile, and the action with the maximal selected value is executed ( arg max ). The procedure is decentralized and runs independently on each agent.
Action. At each time step the surface sensing agent executes a t = ( a t , ω t ) , where a t is the rate of change of the magnitude of V S ( t ) and ω t is the rate of change of its heading. The policy is written as π : s ( a , ω ) , and the forward speed is constrained to [ 0 , v max ] .
Reward. The reward at time t is computed as
r t = r step + r forward , t + collision ( s t ) · r collision + reach _ goal ( s t ) · r goal ,
where r step is a fixed step penalty with value 1.0 (a negative reward that encourages the agent to reach the goal quickly so as to reduce cumulative penalties). r forward , t is a forward-progress reward computed as p goal , t p goal , t 1 , which rewards moving toward the goal between two consecutive time steps. collision ( s t ) is an indicator that equals 1 if a collision occurs at state s t and 0 otherwise, with r collision = 50.0 . reach _ goal ( s t ) is an indicator that equals 1 if the goal is reached at state s t and 0 otherwise, with r goal = 100.0 .
Each agent optimizes its own objective, yet coherent group behaviour emerges from local policies and their interactions. System success under currents and disturbances depends on decentralized coordination rather than a central dispatcher. We use an A*–RVO expert as a purely decentralized heuristic to shape interactions during training. If every agent maintains safety and efficiency locally, group navigation succeeds as an emergent property of the ensemble.
To align with sparse sensing and the fixed-size requirements of neural inputs, we vectorize perceived context by sorting candidates by distance and retaining at most m static obstacles and n moving agents; zero padding yields a consistent input. This keeps dimensionality low and prioritizes safety-critical context while reducing parameter count and training variance. The design, however, imposes a perceptual horizon: information beyond the top- m / n may be omitted. In practice this trade-off between capability and computability is mitigated by (i) distance-based ordering, which biases inputs toward imminent hazards; and (ii) conservative fallback logic (A*–RVO).
The adopted reward function weights r s t e p = 1.0 , r c o l l i s i o n = 50.0 , and r g o a l = 100.0 in this paper are determined by combining domain knowledge of multi-surface sensing agent navigation and iterative validation experiments. Among them, the collision penalty term r c o l l i s i o n = 50.0 is calibrated to avoid a collision rate > 15% in dense scenarios when r c o l l i s i o n = 20.0 , and prevent overly conservative behavior with a 30 % increase in travel time when r c o l l i s i o n = 80.0 ; the goal reward term r g o a l = 100.0 is twice the maximum cumulative step penalty of a typical successful episode, strongly incentivizing task completion while ensuring safety; the step penalty term r s t e p = 1.0 balances navigation efficiency and exploration behavior with a moderate negative weight, avoiding rushing toward the goal and ignoring obstacles or aimless wandering.
The impact of weights is quantitatively analyzed through 30 independent trainings with different random seeds: increasing the absolute value of r c o l l i s i o n accelerates the convergence of collision avoidance behaviors but tends to cause conservatism; the agent’s motivation toward the goal decreases when r g o a l < 80.0 , with the stagnation rate rising to 12 % , while the collision rate increases by 8–10% when r g o a l > 120.0 ; and r s t e p < 1.0 shortens training episodes but impairs the ability to explore paths in complex environments, whereas non-negative weights double the average travel time. The finally selected weights achieve an optimal balance, at a 98.2 % success rate and 0.3 % collision rate, with near-minimal travel time and energy consumption.

3.5. Framework for Distributional RL-Based Decentralized Navigation and Collision Avoidance

3.5.1. Adaptive Implicit Quantile Network

The core policy module is an Adaptive Implicit Quantile Network (Adaptive IQN), a distributional reinforcement-learning architecture that models the full return distribution for each action. Risk sensitivity is controlled by selecting evaluation quantiles (including CVaR-oriented thresholds), enabling explicit trade-offs between efficiency and caution under environmental uncertainty. The network embeds observations with cosine features and fully connected layers, applies rectified-linear activations, and—given an observation and a chosen quantile—outputs action–return distributions, as summarized in Figure 2.
Figure 2. Architecture of the Adaptive Quantile Planner (AQP). The policy uses an Adaptive IQN head to predict action–return distributions; risk sensitivity is applied via quantile/CVaR selection. The symbol * denotes the multiplication of the sampled quantiles by the CVaR threshold.

3.5.2. Implicit Quantile Network (IQN)

  • Distributional view.
Instead of modeling only the expected return, distributional reinforcement learning treats the return as a random variable [29,31,32]. Let Z π ( s , a ) denote the return under policy π . Its one-step relation is
Z π ( s , a ) = D R ( s , a ) + γ Z π s , a , a π ( · | s ) ,
where = D indicates equality in distribution, and R ( s , a ) is the immediate-reward distribution.
  • Quantile parameterisation.
IQN represents the return distribution through its continuous quantile function,
Z θ ( s , a ; τ ) F Z π ( · | s , a ) 1 ( τ ) , τ ( 0 , 1 ) ,
with three components: (i) an observation encoder ψ ( s ) that maps the raw input (ego state, static obstacles, moving obstacles) to a fixed-dimensional embedding via fully connected layers; (ii) a quantile embedding ϕ ( τ ) ; and (iii) a head that combines ψ ( s ) and ϕ ( τ ) to predict action–quantile values.
  • Quantile features and fusion.
Risk preference is injected by a monotone transform f ( τ ; φ ) applied to samples τ U ( 0 , 1 ) , where φ ( 0 , 1 ] can be interpreted as a CVaR-style threshold. The transformed quantile is embedded with a cosine basis followed by a linear–ReLU projection:
ϕ ( τ ) = ReLU cos π · 0 · f ( τ ; φ ) , cos π · 1 · f ( τ ; φ ) , , cos π · 63 · f ( τ ; φ ) .
We fuse state and quantile features by a Hadamard product and a small multilayer head:
q θ ( s , a , τ ) = FC ψ ( s ) ϕ ( τ ) ,
which yields the predicted quantile value for action a at level τ .
  • Learning objective.
Training minimises the quantile Huber loss using N = N = 8 quantile samples τ i , τ j U ( 0 , 1 ) and parameter κ = 1.0:
L ( θ ) = 1 N N i = 1 N j = 1 N ρ τ i κ δ τ i , τ j , δ τ i , τ j = r t + γ Z θ τ j s t + 1 , π ( s t + 1 ) Z θ τ i s t , a t ,
where θ denotes a target network. This objective aligns predicted quantiles with the distributional Bellman target while retaining robustness to outliers.

3.5.3. Adaptive Risk Sensitivity

We use an adaptive risk mechanism guided by CVaR. Operationally, we bias evaluation toward the lower tail by shrinking the sampled quantile:
τ = ϕ τ .
The scaling ϕ depends on proximity to obstacles (static or moving). Let X be the agent position and { X o } the obstacle set. With the clearance
min d ( X , { X o } ) = min o X X o ,
and a distance threshold d 0 > 0 , we set
ϕ = min d ( X , { X o } ) d 0 , if min d ( X , { X o } ) d 0 , 1.0 , if min d ( X , { X o } ) > d 0 .
Far from obstacles (> d 0 ) we keep ϕ = 1 to favour efficiency; close to obstacles, ϕ decreases smoothly toward zero, shifting evaluation to more conservative quantiles and promoting avoidance without altering the underlying policy class.

3.6. Prioritized Replay with EMA Smoothing

  • Motivation and data structure.
Rather than drawing minibatches uniformly, we bias sampling toward transitions that appear most informative for policy improvement [2,28,31]. Transitions ( s t , a t , r t , s t + 1 ) are stored in a replay memory D together with a nonnegative priority value. To support fast updates and priority-proportional draws, we maintain a binary accumulator tree (often called a “sum tree”): leaf nodes hold individual priorities; internal nodes cache partial sums; and the root stores the total mass, enabling O ( log n ) inserts and queries.
  • Initial priority.
When a new transition arrives, its priority is seeded from the current temporal-difference (TD) error and a small floor ε > 0 to keep every item eligible:
p t = | δ t | + ε .
  • Sampling rule.
Let α [ 0 , 1 ] tune the strength of prioritisation. The probability of drawing index t is
P ( t ) = p t α k p k α ,
which reduces to uniform sampling when α = 0.
  • Stabilising updates via EMA.
After a sampled item is replayed, its priority is refreshed using an exponential moving average (EMA) of TD errors. This limits abrupt changes caused by noisy single-step estimates:
p new = γ p old + ( 1 γ ) | δ new | + ε ,
with smoothing factor γ ( 0 , 1 ) . Large γ yields slower, steadier adaptation that leverages historical signals; small γ reacts quickly to newly informative samples. This EMA variant is the only modification to otherwise standard prioritized replay and is used throughout our experiments.

Bias Correction via Importance Sampling

Nonuniform replay introduces sampling bias. We correct it with importance-sampling (IS) weights computed from the sampling probability P ( t ) and buffer size N buf :
w t = 1 N buf P ( t ) β ,
where β [ 0 , 1 ] controls the strength of the correction. We anneal β from an initial value β 0 toward 1 during training so that early updates are less noisy and later updates become unbiased.
For numerical stability, the weights in a minibatch can be normalized by their maximum, w ˜ t = w t / max t B w t . The PER loss uses these weights to scale the base loss L ( θ ) (e.g., the quantile Huber objective):
L PER ( θ ) = 1 | B | t B w ˜ t L t ( θ ) .
This restores the correct gradient direction under prioritized sampling while keeping updates well-conditioned.

3.7. Hybrid Expert Planner (A* and RVO)

To supply reliable demonstrations for RL, we use a hybrid planner that couples A* for global routing with RVO for local dynamic avoidance. The modules run per agent in a decentralized manner.

3.7.1. Global Path Planning

A* searches on a known map containing only static obstacles and generates a waypoint sequence P global = { WP 1 , WP 2 , , WP goal } . The path avoids all known static obstacles and provides long-horizon guidance from the current position to the goal.

3.7.2. RVO-Based Local Collision Avoidance

RVO operates in a local neighborhood to resolve interactions with moving agents while accounting for nearby static boundaries. The preferred velocity v pref points toward the next waypoint WP next ; RVO selects a new command v new that is as close as possible to v pref while remaining outside all reciprocal velocity–obstacle regions, thereby ensuring collision-free motion.

3.7.3. Expert Trajectory Generation

At each decision step, the A* module supplies the current waypoint on the guidance path. Using this waypoint and local context (especially the states of nearby agents), the RVO module computes the velocity to be executed in the next time step(Algorithm 1). From the current state s t , RVO yields the expert action a t expert , the reward r t , and the next state s t + 1 . The resulting transition ( s t , a t expert , r t , s t + 1 ) is stored in the replay buffer and used alongside trajectories collected by the learning policy.
Algorithm 1 Hybrid Expert (A* global routing + RVO local avoidance)—Trajectory Generator
Require: start s 0 , goal g, static_map, init_vel, all_agents_states
Ensure: expert trajectory T = [ ( s , a , r , s ) ]
    T [ ] , x s 0 , v init_vel
    P global compute global path from s 0 to g on static_map using A* search
   while  x g GOAL _ THRESHOLD   t < MAX _ STEPS   do
      if  x WP k < WAYPOINT _ ARRIVAL _ THRESHOLD  then
           k min ( k + 1 , | P global | 1 )
      end if
       v pref MAX _ SPEED · NORMALIZE ( WP k x )
       N sense nearby agents within perception range of x from all_agents_states
       u compute collision-free velocity using RVO with current velocity v , preferred velocity v pref , and nearby agents N ; u CLIPTOLIMITS ( u )
       x x + u Δ t ; v u
       r CALCULATEREWARD ( state ( x , v ) , u , x )
       s FORMULATERLSTATE ( x , v , SENSENEARBYAGENTS ( x , all _ agents _ states ) , g )
      Append ( ( s , u , r , s ) , T ) ; ( x , v , s , t ) ( x , v , s , t + 1 )
   end while
   return  T
The expert trajectory generator and training workflow operate in tandem (Figure 3, Algorithm 2). The expert combines a global waypoint plan from A* with local collision avoidance from RVO: at each step, the current waypoint defines a preferred motion, nearby agents and obstacles constrain feasible velocities, and RVO outputs a safe control that advances the agent while respecting speed limits. The resulting transition ( s t , a t expert , r t , s t + 1 ) is written to the replay buffer exactly like data collected by the learning policy. During training, the buffer is sampled with prioritized replay so that informative events—expert demonstrations, near-collision interactions, and high-TD-error moments—are seen more often. The policy network (Adaptive IQN) learns from full return distributions rather than scalar values, and its updates are driven by TD errors that also refresh sampling priorities. This division of roles yields a stable and sample-efficient loop: the expert supplies a feasible, safety-oriented backbone in cluttered flows; prioritized replay concentrates gradient effort where it matters; and the distributional policy preserves efficiency while reducing variance and collision risk as task complexity grows.
Algorithm 2 AQP Training with Prioritized Replay (PER)
Require: simulator, replay capacity C, batch size B, learn_freq, target_period, total_steps; α (priority exponent), β 0 , β 1 (IS anneal); ϵ max , ϵ min (exploration)
Ensure: trained policy π θ
    initialize replay M with capacity C; initialize Adaptive IQN parameters θ ; set target θ θ
    reset simulator; obtain initial states { s i }
    (optional seed) insert expert transitions into M with high priority
    for  t = 1 total_steps do
         ϵ linear _ anneal ( t , total _ steps , ϵ max , ϵ min )
        // Agent Interaction with Environment
        for each agent i do
            choose a i by ϵ -greedy w.r.t. π θ ( s i ) ; step simulator ( r i , s i )
            push ( s i , a i , r i , s i ) into M with initial priority p | δ | + ε ( δ : initial TD error)
             s i s i
        end for
        if  t mod learn _ freq = 0 | M | B  then
            // Sample Prioritized Batch
            sample indices { j } 1 B from M with P ( j ) = p j α / k p k α
             β linear _ anneal ( t , total _ steps , β 0 , β 1 ) ;    w j 1 / ( | M | P ( j ) ) β ;    w ˜ j w j / max k { j } w k
            compute distributional targets with θ ; evaluate predicted quantiles with θ
             L 1 B j w ˜ j · QuantileHuber Z θ ( s j , a j ) , Z θ ( s j )
            update θ θ η θ L
            refresh priorities p j EMA p j , | δ j | + ε in M
            if  t mod target _ period = 0  then
                 θ θ
            end if
        end if
        if episode terminated or timeout then
            reset simulator; reinitialize { s i }
        end if
  end for
  return  π θ
Figure 3. Workflow of the proposed training framework. The A*–RVO expert produces demonstration transitions ( s t , a t , r t , s t + 1 ) , which are stored in a replay buffer together with on-policy experience. Minibatches are drawn by prioritized sampling and fed to the Adaptive IQN. The policy interacts with the simulator to choose a t , while TD errors both drive gradient updates and refresh sampling priorities.

4. Experiments

4.1. Simulation Setup

We implement a custom Python simulator for a planar aquatic domain, built on Python 3.8 and relying on Gymnasium 0.29.1 for environment construction and Pygame 2.5.2 for rendering. The map size is 50 × 50 m . Static obstacles, when present, are randomly placed in circular disks with radii of 1 m . Each surface sensing agent has a sensing range of 15 m with a 360 field of view. Background flow is modeled by Rankine vortices with core radii of 0.5 m .
The Rankine vortex used in this study is not intended to represent any specific hydrodynamic phenomenon. Instead, it serves as an abstract but structured flow field designed to introduce coherent circulation, smooth velocity gradients, and non-uniform flow patterns. These characteristics are broadly representative of oceanic dynamics, while the model remains computationally lightweight and highly reproducible. Thus, the objective of the flow model is not high-fidelity physical simulation, but to provide an environment with realistic variability for evaluating decentralized multi-agent navigation.
We evaluate our proposed approach and compare it against Adaptive IQN, IQN, DQN, and APF. The first suite trains surface sensing agent to avoid dynamic obstacles only; static obstacles are absent. We use 500 randomly generated scenarios (100 per scene type). The second suite uses another 500 evaluation scenarios constructed in the same way but with additional static obstacles. The detailed parameter settings for these scenarios are summarized in Table 1. An episode is marked as a failure if not all surface sensing agents reach their goals within 3 min. The energy cost of each surface sensing agent is computed as the time–sum of action magnitudes over its trajectory.
Table 1. Simulation scenario settings.
All algorithms were trained on an NVIDIA RTX 3090 GPU with an Intel Core i9-13900K CPU. The hardware platform was equipped with 128 GB DDR5 RAM and ran 64-bit Ubuntu 22.04 LTS. The proposed innovation framework required approximately 18 h of training for convergence, while adaptive-IQN and standard IQN required 14 h and 11 h, respectively. The DQN baseline converged in 6 h, and the APF method does not require learning-based training. The average training speed was around 0.35 s per episode, and the final performance results were obtained after full convergence.
The additional training cost of our method mainly stems from (i) the multi-agent coordination module and (ii) the distributional value estimation. Despite this overhead, the overall training demand remains within a practical range for modern GPU hardware. To further assess the practical feasibility of the proposed approach, we analyzed the computational complexity and empirical runtime during the inference phase. The per-step complexity is dominated by the forward pass of the multi-agent distributional critic, scaling as O ( N · P ) , where N is the number of agents and P represents the network parameters. In terms of execution speed, our method achieves an average latency of 2.6 ms per decision step. Although this introduces a modest increase compared to Adaptive-IQN (1.9 ms), IQN (1.7 ms), DQN (1.1 ms), and APF (0.2 ms), as summarized in Table 2, the latency remains well within real-time constraints. Furthermore, we evaluated the scalability of the algorithm. The inference latency grows approximately linearly, increasing from 2.6 ms for three agents to 4.1 ms for seven agents. This mild slope indicates that the proposed policy architecture introduces minimal overhead as the number of agents increases, ensuring efficient deployment even in complex multi-agent scenarios.
Table 2. Comparison of average inference latency per decision step across different algorithms.
Unless otherwise noted, we train Adaptive IQN with N = N = 8 quantile samples and the quantile Huber loss with κ = 1.0 . Risk is handled via a CVaR-style quantile scaling with base threshold ϕ = 0.1 . Prioritised replay uses exponent α = 0.6 , initial IS exponent β 0 = 0.4 linearly annealed to 0.1 , a small priority bias ε PER = 10 3 , and EMA smoothing factor γ = 0.5 for priority updates. The A* router employs Euclidean edge costs with a 1 m grid resolution. RVO parameters are an agent radius of 0.5 m , at most 5 neighbors, an agent time horizon of 5 s , and an obstacle time horizon of 2 s . Optimization uses Adam with a learning rate of 1 × 10 4 , replay capacity of 1 × 10 5 , discount factor of 0.99 , and 64 mini-batches; the target network is updated every 1000 environment steps. Exploration follows ϵ -greedy with ϵ linearly decayed from 0.6 to 0.05 over training.

4.2. Results Discussion

Figure 4 illustrates six representative flow–obstacle scenes and the resulting surface sensing agent trajectories. (a) Four surface sensing agents with well-separated goals in a weakly sheared region produce near-straight motions punctuated by minor curvature induced by the background flow; all agents avoid purple static obstacles with large clearance. (b) With stronger vortical features, paths elongate and bend around eddies; agents execute coordinated detours while preserving mutual separation as they transit a cluttered central zone. (c) Goals placed near a vortex core tighten curvature and evoke conservative speed modulation: trajectories wrap partially around the core before converging, demonstrating risk-aware behavior in high-shear pockets. (d) In the five-SSA case with roughly symmetric start–goal placements, intersecting corridors are resolved by local interaction rules (RVO), yielding temporal staggering and side-step maneuvers that prevent head-on encounters. (e) A narrow passage between obstacles causes the agents’ trajectories to funnel into a single file. As they approach the bottleneck, their paths show brief hesitation and local congestion effects, after which the trajectories straighten once the agents exit the constricted region. (f) Under dense clutter and stronger circulation, the agents produce multi-turn curved paths around obstacle clusters and flow-induced vortical regions. Despite these complex deviations, all trajectories consistently converge to the yellow star goals without collision.
Figure 4. Representative scenarios for surface sensing agents in a flow field. The simulation environment is a 50 m × 50 m marine area. Panels (ac): four-SSA case; Panels (df): five-SSA case. Green square with red arrow = current pose and velocity; purple discs = static obstacles; yellow circles = surface sensing agent initial positions; yellow stars = surface sensing agent goals; blue background = flow magnitude/isocontours. Colored curves show the executed trajectories for each surface sensing agent. Right insets depict the local sensing footprint for a representative agent (labeled “ssa 3” or “ssa 4” accordingly).
Figure 5 and Figure 6 summarize and explain the comparative outcomes. In Set 1 (no static obstacles), AQP achieves the top success rate for all team sizes (Figure 5a) and simultaneously shifts the time and energy distributions leftward with tighter interquartile ranges (Figure 5b,c). The shorter upper whiskers and fewer extreme outliers indicate fewer near-stall episodes and more consistent progress. Reward distributions (Figure 5d) are correspondingly less negative and less variable, implying that AQP accrues stepwise penalties for shorter durations and avoids large collision penalties.
Figure 5. Performance on Set 1 (no static obstacles) as a function of the number of surface sensing agents. (a) Success rate. (b) Completion time (s). (c) Energy cost. (d) Cumulative reward. Methods: AQP (ours), Adaptive IQN, IQN, DQN, and APF. Boxes/whiskers in (bd) show median, interquartile range, and extrema with outliers.
Figure 6. Performance on Set 2 (with static obstacles) under increasing numbers of surface sensing agents. (a) Success rate. (b) Completion time (s). (c) Energy cost. (d) Cumulative reward. Same method legend and box-plot conventions as Figure 5.
When Set 2 introduces static obstacles, the gap widens. Baselines degrade as agent count increases: APF suffers local minima and oscillations, DQN exhibits high variance and frequent timeouts, and vanilla IQN under-explores safe corridors in dense clutter. In contrast, AQP sustains high success (Figure 6a) and preserves compact time/energy boxes (Figure 6b,c), with reward distributions that retain narrow spread and higher medians (Figure 6d). The tail behavior is especially telling: AQP shows markedly fewer long-time or high-energy outliers, reflecting robust avoidance of vortex cores and narrow passages even for 6–7 agents.
By dissecting the performance differentials across the baseline (IQN), the risk-aware variant (Adaptive IQN), and the proposed Innovation Framework, we isolate the independent and synergistic contributions of each module. The substantial leap in success rate from IQN to Adaptive IQN in obstacle-rich environments (Figure 6a) empirically validates the independent efficacy of the Adaptive Risk strategy in guaranteeing safety boundaries. Furthermore, the complete framework transcends this safety baseline, achieving superior energy efficiency and reward convergence (Figure 6c,d) compared to the risk-only variant. This secondary elevation attests to the synergistic value of the Hybrid Expert and EMA-PER: while the risk module secures survival, the expert guidance and prioritized replay optimize the policy manifold for minimal cost, demonstrating that these components function not merely additively, but multiplicatively to enhance the overall learning dynamics.
Mechanistically, these trends match the design. The A*–RVO expert furnishes a feasible, safety-oriented backbone that prevents catastrophic exploration in clutter; EMA-PER then prioritizes demonstrative and near-decision-boundary transitions, accelerating credit assignment and reducing update noise; finally, Adaptive IQN adjusts evaluation quantiles with proximity, yielding cautious actions near obstacles but efficiency in benign flow. Together, this yields a favorable Pareto profile—high success at lower time/energy with reduced variance—which becomes increasingly pronounced as scene complexity and team size grow.
It is worth noting that the outliers observed in the box plots (Figure 5 and Figure 6), particularly in task completion time and energy consumption, originate from the inherent stochasticity of multi-agent exploration, where agents occasionally exhibit extreme behaviors under uncertainty. Crucially, these outliers do not compromise the validity of our core conclusions. The interquartile ranges and medians demonstrate consistent central trends across all scenarios, and statistical verification (Wilcoxon rank-sum test, p < 0.01 ) confirms that the performance advantage of AQP over baselines remains statistically significant and robust, independent of these sporadic deviations.

5. Conclusions

We address decentralized navigation and collision avoidance for surface sensing agent networks operating in dynamic hydrodynamic fields and propose a distributional–RL framework that integrates prior knowledge with adaptive learning. The core technical contributions are (i) an EMA–PER mechanism that stabilizes and accelerates training by emphasizing informative transitions, including demonstrations; (ii) an Adaptive IQN policy that reasons over full return distributions and modulates risk; and (iii) a hybrid A*–RVO expert that supplies safe, feasible guidance in cluttered flow fields. Extensive experiments show consistent gains in success rate, travel time, energy cost, and reward stability across scenarios of increasing complexity, demonstrating strong efficiency and robustness. The method couples expert priors with distributional reinforcement learning to obtain complementary strengths. The A*–RVO component offers a reliable backbone plan, ensuring feasibility and safety in densely cluttered regions. EMA–PER focuses updates on salient and diverse experiences, improving sample efficiency and stabilizing early learning. Adaptive IQN exploits the full return distribution and adjusts risk sensitivity according to local context, yielding cautious behavior near obstacles while maintaining efficiency in benign flow. The framework scales with the number of agents and scene complexity, and its components are modular: the expert can be replaced by other planners, and the risk scheduler in Adaptive IQN can be tuned to mission requirements.
Looking ahead, we see several promising directions for further advancing this work. These include (a) incorporating richer and more realistic oceanographic flow models together with domain randomization to enhance sim-to-real generalization; (b) developing closed-loop adaptation strategies that adjust the risk schedule based on online performance feedback; (c) integrating communication-aware coordination mechanisms to support dense multi-agent surface sensing agent teams; and (d) performing sim-to-real validation on field robots to assess robustness under real-world sensing and actuation uncertainties. These extensions would further strengthen the practicality of the approach while preserving its safety and efficiency advantages. In addition, the planned physical experiments will be an essential step toward demonstrating the real-world applicability of the proposed framework.

Author Contributions

Conceptualization, J.D., Z.L., Y.W., and H.W.; Methodology, Z.L., Y.W., and W.X.; Software, Y.W., Z.L., and W.X.; Validation, Y.W., H.W., and J.L.; Formal analysis, K.O. and J.L.; Investigation, Y.W.; Resources, H.W. and J.D.; Data curation, K.O. and Y.W.; Writing—original draft preparation, Y.W. and K.O.; Writing—review & editing, J.D., H.W., W.X., and Z.L.; Visualization, Z.L. and J.D.; Supervision, H.W., J.D., and J.L.; Project administration, J.D. and H.W.; Funding acquisition, J.D. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the National Natural Science Foundation of China (No. 52302171), Shandong Provincial Natural Science Foundation (ZR2023QF005), the Fundamental Research Funds for the Central Universities (3072024XX2606, 3072025YC0401, 3072025YC0402).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xing, B.; Yu, M.; Liu, Z.; Tan, Y.; Sun, Y.; Li, B. A Review of Path Planning for Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2023, 11, 1556. [Google Scholar] [CrossRef]
  2. Balasubramanyam, A.; Hiremath, S.; Santhosh, G.; Menon, D.; Kumar, D.; Honnavalli, P.B.; Indian Pedestrian Behaviour Modelling Using Imitation and Reinforcement Learning. Preprint/Working Paper. 2025. Available online: https://www.preprints.org/frontend/manuscript/1af899c59efbe45e3f1a24834bc985b1/download_pub (accessed on 4 December 2025).
  3. Liu, C.; Van Kampen, E.J.; De Croon, G.C. Adaptive Risk-Tendency: Nano Drone Navigation in Cluttered Environments with Distributional Reinforcement Learning. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7198–7204. [Google Scholar]
  4. Fu, M.; Huang, L.; Li, F.; Qu, H.; Xu, C. A Fully Value Distributional Deep Reinforcement Learning Framework for Multi-Agent Cooperation. Neural Netw. 2025, 184, 107035. [Google Scholar] [CrossRef] [PubMed]
  5. Lin, X.; Huang, Y.; Chen, F.; Englot, B. Decentralized Multi-Robot Navigation for Autonomous Surface Vehicles with Distributional Reinforcement Learning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8327–8333. [Google Scholar]
  6. Yang, C.; Zhao, Y.; Cai, X.; Wei, W.; Feng, X.; Zhou, K. Path Planning Algorithm for Unmanned Surface Vessel Based on Multiobjective Reinforcement Learning. Comput. Intell. Neurosci. 2023, 2023, 2146314. [Google Scholar] [CrossRef]
  7. Dou, J.; Ouyang, K.; Wu, Z.; Hu, Z.; Lin, J.; Wang, H. From Invariance to Symmetry Breaking in FIM-Aware Cooperative Heterogeneous Agent Networks. Symmetry 2025, 17, 1899. [Google Scholar] [CrossRef]
  8. Brignoni, L.; Johansen, T.A.; Breivik, M. Environmental Disturbances and Uncertainty in Maritime Autonomous Surface Ships. Ocean. Eng. 2022, 259, 111956. [Google Scholar]
  9. Xing, B.; Wang, X.; Yang, L.; Liu, Z.; Wu, Q. An Algorithm of Complete Coverage Path Planning for Unmanned Surface Vehicle Based on Reinforcement Learning. J. Mar. Sci. Eng. 2023, 11, 645. [Google Scholar] [CrossRef]
  10. Sharan, R.; Ayanian, N. Consensus-Based Multi-Robot Navigation in Dynamic Environments. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11418–11424. [Google Scholar]
  11. Du, Z.; Li, W.; Shi, G. Multi-USV Collaborative Obstacle Avoidance Based on Improved Velocity Obstacle Method. ASCE-ASME J. Risk Uncertain. Eng. Syst. Part A Civ. Eng. 2024, 10, 04023049. [Google Scholar] [CrossRef]
  12. Liu, Y.; Sun, Z.; Wan, J.; Li, H.; Yang, D.; Li, Y.; Fu, W.; Yu, Z.; Sun, J. Hybrid Path Planning Method for SSA Based on Improved A-Star and DWA. J. Mar. Sci. Eng. 2025, 13, 934. [Google Scholar] [CrossRef]
  13. Zhang, J.; Ren, J.; Cui, Y.; Fu, D.; Cong, J. Multi-SSA Task Planning Method Based on Improved Deep Reinforcement Learning. IEEE Internet Things J. 2024, 11, 18549–18567. [Google Scholar] [CrossRef]
  14. Dabney, W.; Ostrovski, G.; Silver, D.; Munos, R. Implicit Quantile Networks for Distributional Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1096–1105. [Google Scholar]
  15. Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar] [CrossRef]
  16. Van Den Berg, J.; Lin, M.; Manocha, D. Reciprocal Velocity Obstacles for Real-Time Multi-Agent Navigation. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation (ICRA), Pasadena, CA, USA, 19–23 May 2008; pp. 1928–1935. [Google Scholar]
  17. Chen, S.; Feng, L.; Bao, X.; Jiang, Z.; Xing, B.; Xu, J. An Optimal-Path-Planning Method for Unmanned Surface Vehicles Based on a Novel Group Intelligence Algorithm. J. Mar. Sci. Eng. 2024, 12, 477. [Google Scholar] [CrossRef]
  18. Liang, J.; Miao, H.; Li, K.; Tan, J.; Wang, X.; Luo, R.; Jiang, Y. A Review of Multi-Agent Reinforcement Learning Algorithms. Electronics 2025, 14, 820. [Google Scholar] [CrossRef]
  19. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  20. Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 24611–24624. [Google Scholar]
  21. Wang, H.; Lou, S.; Jing, J.; Wang, Y.; Liu, W.; Liu, T. The EBS-A* Algorithm: An Improved A* Algorithm for Path Planning. PLoS ONE 2022, 17, e0263841. [Google Scholar] [CrossRef]
  22. Xu, Z.; Xia, Y.; Bai, H.; Song, S.; Wang, X. Vision-Based Deep Reinforcement Learning of Unmanned Aerial Vehicle (UAV) Autonomous Navigation Using Privileged Information. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3–17. [Google Scholar]
  23. Wang, Y.; Du, Y.; Xu, J.; Han, X.; Chen, Z. Real-Time Local Path Planning Strategy Based on Deep Distributional Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24562–24574. [Google Scholar]
  24. Pham, T.; Lee, G.H.; Chung, W. Autonomous Navigation for Mobile Robots Using Deep Reinforcement Learning in Partially Observable Environments. IEEE Access 2021, 9, 155297–155311. [Google Scholar]
  25. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
  26. Kabir, R.; Watanobe, Y.; Islam, M.R.; Naruse, K. Enhanced Robot Motion Block of A-Star Algorithm for Robotic Path Planning. Sensors 2024, 24, 1422. [Google Scholar] [CrossRef]
  27. Chauvin, T.; Gianazza, D.; Durand, N. ORCA-A: A Hybrid Reciprocal Collision Avoidance and Route Planning Algorithm for UAS in Dense Urban Areas. In Proceedings of the SESAR Innovation Days 2024, Rome, Italy, 12–15 November 2023. [Google Scholar]
  28. Shi, Z.; Wang, K.; Zhang, J.J. Improved Reinforcement Learning Path Planning Algorithm Integrating Prior Knowledge. PLoS ONE 2023, 18, e0284942. [Google Scholar] [CrossRef] [PubMed]
  29. Lipowska, D.; Lipowski, A.; Ferreira, A.L. Homonyms and Context in Signalling Game with Reinforcement Learning. PLoS ONE 2025, 20, e0322743. [Google Scholar] [CrossRef] [PubMed]
  30. Chen, X.; Liu, Y.; Everett, M.; How, J.P. Decentralized Non-Communicating Multiagent Collision Avoidance with Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 285–292. [Google Scholar]
  31. Liu, Y.; Wang, C.; Zhao, C.; Wu, H.; Wei, Y. A Soft Actor-Critic Deep Reinforcement-Learning-Based Robot Navigation Method Using LiDAR. Remote Sens. 2024, 16, 2072. [Google Scholar] [CrossRef]
  32. Hart, P.E.; Nilsson, N.J.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.