An Adaptive QAPF Framework with a Discrete CBF-Inspired Safety Filter and Adaptive Reward Shaping for Safe Mobile Robot Navigation

Isaac, Elizabeth; George, Asha J.; Ioannou, Iacovos; Abraham, Jisha P.; Kallam, Suresh; Ghantasala, G. S. Pradeep; Vidyullatha, Pellakuri; Vassiliou, Vasos

doi:10.3390/electronics15091945

Open AccessArticle

An Adaptive QAPF Framework with a Discrete CBF-Inspired Safety Filter and Adaptive Reward Shaping for Safe Mobile Robot Navigation

by

Elizabeth Isaac

¹,

Asha J. George

¹,

Iacovos Ioannou

^2,3,*

,

Jisha P. Abraham

¹,

Suresh Kallam

⁴,

G. S. Pradeep Ghantasala

⁵,

Pellakuri Vidyullatha

⁶ and

Vasos Vassiliou

³

¹

Department of Computer Science and Engineering, Mar Athanasius College of Engineering, Kothamangalam 686666, Kerala, India

²

Department of Computer Science, Philips University, 4–6 Lamias Street, Nicosia 2001, Cyprus

³

Department of Computer Science, University of Cyprus and CYENS, Nicosia 1678, Cyprus

⁴

School of Computer Science and Engineering, CSE (IoT), Jain Deemed to be University, Bengaluru 562112, Karnataka, India

⁵

Department of Computer Science and Engineering, Alliance College of Engineering and Design, Alliance University, Bengaluru 562106, Karnataka, India

⁶

Department of Computer Science and Engineering, Koneru Lakshmaiah Educational Foundation, Vaddeswaram, Guntur 522302, Andhra Pradesh, India

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1945; https://doi.org/10.3390/electronics15091945

Submission received: 28 March 2026 / Revised: 26 April 2026 / Accepted: 29 April 2026 / Published: 3 May 2026

(This article belongs to the Special Issue AI for Industry)

Download

Browse Figures

Versions Notes

Abstract

Mobile robot navigation remains challenging when fast convergence, collision avoidance and deployability must be satisfied simultaneously. The original Q-learning with Artificial Potential Field (QAPF) paradigm is extended in this paper with three coordinated mechanisms that together yield a reported-horizon convergence reduction of approximately four orders of magnitude (from

\sim 3 \times 10^{6}

episodes to

\sim 200

to 230 episodes under the present protocol) and an internal-ablation collision-rate reduction of approximately one order of magnitude (

6.2 %

to

0.3 %

), and that open a new capability frontier covering dynamic obstacles, multi-robot coordination, energy-aware velocity modulation and embedded-deployable inference timing. The first mechanism is a potential-based reward-shaping schedule whose unclipped fixed-weight form follows the policy-invariant shaping theorem, while the implemented clipped and time-varying form is used as an empirically stable approximation. Under the present experimental protocol, the reported convergence horizon is reduced from the

\sim 3 \times 10^{6}

episodes reported for the original QAPF formulation to approximately 200 to 230 episodes; this comparison is protocol-dependent and is not claimed as a controlled one-to-one runtime speedup. The second mechanism is a discrete Control Barrier Function (CBF)-inspired action filter (thediscrete filter described in this paper is inspired by the continuous-time CBF literature, but does not carry a forward-invariance proof; it is used as an empirical safety mechanism rather than as a formal Control Barrier Function in the formal continuous-time sense) with per episode visit memory by which the held-out collision rate is reduced from

6.2 %

for QAPF alone to

0.3 %

while

93.8 %

task completion is maintained, where this collision-rate comparison is internal to the QAPF ablation because the prior QAPF reference does not report a comparable held-out collision metric. The third mechanism is a set of extensions to dynamic obstacles, two-robot cooperative navigation under a centralized scheme (with an explicit

O (N^{2})

scaling-cost analysis and three decentralization strategies for fleets beyond the small-N regime), curriculum learning and energy-aware velocity modulation. Disturbance robustness tests, empirical timeout/stagnation detection for unreachable-goal cases, i7 reference inference timing with projected embedded-device latencies, multi-axis generalization over obstacle density and grid size, scalability analysis for centralized multi-robot coordination and a scope comparison against A* and RRT* are added by the revised evaluation. Across 30 independent seeds on held-out static maps,

94.5 \pm 2.1 %

success is achieved by adaptive QAPF while

93.8 \pm 2.3 %

success with

0.3 \pm 0.4 %

collisions is achieved by QAPF+CBF. Under a separate finite robustness suite,

85.0 \pm 4.1 %

success is retained by QAPF+CBF in the combined disturbance regime. The timing study indicates that the 20 Hz real-time threshold is comfortably exceeded by all methods on the measured i7 reference platform and by all projected embedded-device equivalents. The results show that a lightweight and safety-oriented navigation policy for grid-based mobile-robot settings can be provided by APF-guided tabular reinforcement learning when it is paired with a discrete safety filter and a clarified energy and robustness analysis.

Keywords:

mobile robots; reinforcement learning; artificial potential field; Q-learning; collision avoidance; motion control; path planning; Markov Decision Process; grid navigation; Control Barrier Functions; robustness; embedded inference

1. Introduction

Mobile robots have experienced a significant surge in popularity due to remarkable technological advancements in artificial intelligence, sensor technology and computational capabilities [1,2]. A mobile robot possesses the capability to traverse environments, dynamically adjust its path and maneuver through complex scenarios without human intervention. The primary aim of an intelligent motion control and obstacle avoidance system is to efficiently navigate from any initial state to the goal while actively avoiding any obstacles encountered along the way [3,4,5].

Despite this rapid progress, deploying a mobile robot in a realistic environment requires the simultaneous satisfaction of three conflicting objectives: (i) fast convergence so that the navigation policy is learned within a practical training budget; (ii) provable or near-provable safety so that obstacles are avoided even under sensor noise and actuator imperfections; and (iii) deployability on resource-constrained embedded hardware while preserving robustness to dynamic obstacles, multi-robot coordination and limited battery energy [1,2,5]. None of the dominant paradigms taken in isolation satisfies all three: classical Artificial Potential Field (APF) methods [6] are real-time and map-free but suffer from local-minima trapping in concave regions [3,7]; tabular and deep reinforcement learning [8,9,10,11] are flexible but require long training horizons and provide no formal safety guarantee during learning; and existing APF/RL hybrids such as the QAPF formulation of Orozco-Rosas et al. [12] report convergence horizons of

\sim 3 \times 10^{6}

episodes and do not include a safety filter, an energy-aware controller, a multi-robot extension or a robustness/inference-timing analysis.

Our approach combines a potential-field-based collision-avoidance strategy with reinforcement learning to build a mobile-robot system capable of navigating uncharted territory without relying on pre-existing maps. The navigation problem is formulated as a Markov Decision Process (MDP) tailored for a mobile robot moving through obstacle-populated environments [13], while the potential-field component follows the Artificial Potential Field formulation introduced by Khatib [6]. The environment is characterized by attractive and repulsive forces influencing the robot’s movements, forming the potential-field landscape that the agent samples through iterative interaction. On top of this base formulation, three coordinated mechanisms are introduced and analyzed jointly: an adaptive potential-based reward-shaping schedule whose unclipped fixed-weight form is policy-invariant under the theorem of Ng, Harada and Russell [14]; a discrete Control Barrier Function (CBF)-inspired action filter with per episode visit memory [15,16] that turns the safety question from “how often does the agent collide?” to “which action subset should be excluded by a barrier-inspired safety mask?”; and an empirical timeout/stagnation detector that converts indefinite oscillation in unreachable-goal cases into a labeled safe-failure outcome. The same framework is extended to dynamic obstacles, narrow passages, multi-robot cooperative navigation and energy-aware velocity modulation, and is evaluated under sensor and actuator noise, on a measured Intel i7 inference reference with conservative embedded-device latency projections, and across orthogonal generalization axes (held-out maps, obstacle density, grid size and long-horizon stability).

The novelty of the proposed approach does not lie in the general concept of combining Q-learning with APF, which was introduced by Orozco-Rosas et al. [12] under the QAPF name, but rather in a substantially extended framework that closes three concrete methodological gaps of the original formulation. First, the original potential-difference shaping is non-policy-invariant, whereas the proposed unclipped fixed-weight form satisfies the assumptions of Ng, Harada and Russell [14] and the implemented clipped/decayed schedule is presented as an empirically stable approximation rather than as an unconditional guarantee. Second, the original framework has no safety layer, whereas the proposed discrete CBF-inspired filter [15,16], augmented with a per episode visit memory, empirically reduces the held-out collision rate by approximately

20 \times

(

6.2 % \to 0.3 %

) while maintaining

93.8 %

task completion. Third, the original framework is restricted to static single-robot navigation, whereas the proposed framework is jointly evaluated under dynamic obstacles, narrow passages, multi-robot coordination [17,18], sensor/actuator noise, embedded-device inference timing and out-of-distribution generalization. The classical local-minima failure mode of pure APF [3,7] is mitigated through optimistic Q-initialization and the anti-deadlock progress monitor described in Section 4, and persistently unreachable goals are reported as labeled Timeout-Unreachable or Stagnation-Unreachable outcomes rather than as undefined non-terminations. The detailed quantitative contrast with the closest prior QAPF reference is given at the end of Section 2 (Table 2).

The main contributions of this paper, framed against this comparison, are:

Adaptive QAPF Framework with Potential-Based Shaping. An adaptive shaping coefficient $λ (e) = λ_{min} + (λ_{max} - λ_{min}) exp (- β e)$ is introduced around the potential-based shaping form of Ng, Harada and Russell [14], $F (s, s^{'}) = γ Φ (s^{'}) - Φ (s)$ , with $Φ (s) = - U (q (s)) / U_{Δ, scale}$ , where $q (s)$ denotes the continuous/grid position represented by state $s$ and $U_{Δ, scale}$ normalizes potential differences used for reward shaping. Under the standard assumptions, the unclipped fixed-weight form preserves the optimal policy of the underlying MDP. The implemented clipped and decayed form is therefore presented as an empirically stable approximation rather than as an unconditional policy-invariance guarantee; finite-training performance still depends on exploration, discretization and hyperparameter choices.
CBF-Inspired Safety Filter with Visit Memory and Empirical Unreachable-Goal Handling. A novel discrete CBF-inspired filter [15,16] is introduced that augments the barrier-function safety test with a per episode visit memory to eliminate oscillation loops, and is paired with an empirical timeout/stagnation detector that converts blocked-goal, sealed-corridor and dead-end concave cases into labeled Timeout-Unreachable and Stagnation-Unreachable outcomes. The held-out collision rate is reduced by the filter from $6.2 %$ (QAPF alone) to $0.3 %$ (QAPF+CBF), which is an approximately $20 \times$ reduction relative to the internal QAPF-only ablation, while $93.8 %$ task completion is maintained and persistent no-path behavior is reported as a safe-failure mode rather than as indefinite oscillation. The prior QAPF formulation [12] does not report a comparable held-out collision metric and therefore the safety contrast is internal rather than external.
Energy-Aware Velocity Modulation. An explicit velocity-modulation law $v (q) = v_{min} + (v_{max} - v_{min}) \cdot 2 / (1 + exp (k_{E} \tilde{g} (q)))$ , driven by the normalized gradient magnitude $\tilde{g} (q) = ∥ \nabla U (q) ∥ / G_{scale}$ , is described. The law is constructed so that $v (q) = v_{max}$ when $\tilde{g} (q) = 0$ (free space) and smoothly approaches $v_{min}$ in high-gradient obstacle-proximate regions; $G_{scale}$ normalizes gradient magnitudes, distinct from the potential-difference scale $U_{Δ, scale}$ used by the reward-shaping term. The law is paired with a two-term kinetic and jerk-energy metric $E = \sum_{t} c_{v} v_{t}^{2} + c_{a} {(v_{t} - v_{t - 1})}^{2}$ .
Dynamic and Multi-Robot Cooperative Extension. Beyond the static single-robot setting, dynamic obstacles, narrow-passage scenarios and multi-robot cooperative navigation are evaluated; for the cooperative case, an inter-robot virtual repulsion is paired with the per robot CBF-inspired filter, and the centralized scheme is accompanied by an explicit $O (N^{2})$ scaling-cost analysis with three concrete decentralization strategies (k-nearest, communication graph, hierarchical-cluster) [17,18] for fleets beyond the small-N regime.
Comprehensive Safety-Centric Evaluation Protocol. A held-out evaluation protocol is introduced that jointly reports (i) main safety/efficiency metrics (success rate, collision rate, minimum clearance, Pareto frontier), (ii) robustness under three independent noise channels (observation noise $σ_{obs}$ , actuator slip $p_{act}$ and external drift $σ_{ext}$ ) with an accompanying conditional/high-probability inflated-barrier safety statement, (iii) per decision inference latency physically measured on an Intel i7 reference with conservative projections to Jetson/DGX-class platforms, and (iv) multi-axis generalization across held-out maps, $- 66.7 %$ to $+ 66.7 %$ obstacle-density shifts, $\pm 40 %$ grid-size shifts and 1000-episode long-horizon stability.

The remainder of this paper is organized as follows. Section 2 reviews the related work and theoretical background. Section 3 formulates the MDP and simulation assumptions. Section 4 details the QAPF algorithm, the CBF-inspired action filter, the energy-aware velocity modulator and the empirical unreachable-goal detector. Section 5 presents the main comparison, multi-scenario evaluation, robustness analysis, inference-timing study, and ablation, sensitivity, energy, multi-robot, curriculum and generalization studies, followed by the limitations and threats-to-validity discussion. Section 6 concludes. Appendix A provides the algorithm-capability discussion on obstacle shape and concave geometry, empirical unreachable-goal handling, scope comparison against A*/RRT*, 3D-workspace scalability, and kinematic and geometric constraints with pose uncertainty.

2. Literature Review and Background

This section provides a concise review of existing approaches in mobile-robot navigation, obstacle avoidance and reinforcement learning-based path planning. Section 2.1 presents the literature review covering the principal research directions, and Section 2.2 provides the necessary theoretical background on reinforcement learning, potential-based reward shaping and Control Barrier Functions. A comprehensive feature-based comparison is presented in Table 1.

2.1. Literature Review

2.1.1. Enhanced Artificial Potential Field Methods

The Artificial Potential Field (APF) method of Khatib [6] models the robot’s workspace via an attractive force toward the goal and repulsive forces away from obstacles. The classical formulation suffers from the well-documented local-minima problem [3,7,19]. Yao et al. [3] proposed combining developed black-hole potential fields with reinforcement learning; Montiel et al. [19] applied parallel evolutionary optimization of APF gains; Herrera et al. [7] extended this approach to the multi-robot case with evolutionary multi-objective optimization.

Most closely related to our work, Orozco-Rosas et al. [12] introduced the QAPF learning algorithm with a partially guided Q-learning strategy in which APF biases exploration. Their framework does not use the unclipped potential-based shaping form of Ng, Harada and Russell [14], is restricted to static environments, does not consider CBF-style safety filtering and requires up to

3 \times 10^{6}

episodes for convergence. Our work fills these gaps. Earlier improvements to Q-learning-based path planning, such as Low et al. [20], also report substantial gains over vanilla Q-learning on grid environments by reshaping the reward signal; their work, however, does not integrate an APF gradient nor a safety filter, so it remains a single-component baseline of the kind our ablation in Section 5.7 quantifies.

2.1.2. Deep Reinforcement Learning for Autonomous Navigation

Deep RL has transformed mobile-robot navigation [2,5]. Mnih et al. [9] introduced the DQN architecture with experience replay and target networks; Schaul et al. [21] added prioritized experience replay. Policy-gradient methods such as PPO [10] and SAC [11] handle continuous-action control, with TD3 [22] addressing function-approximation overestimation through twin critics and delayed policy updates; transformer architectures have been explored for sequence-level planning [23,24]; multi-agent RL approaches address inter-robot coordination [17,18]. In the broader autonomous-driving context, deep learning-based planners have been surveyed for their safety, generalization and deployability characteristics [25,26]; the QAPF framework targets a complementary map-free, lightweight, tabular regime in which a learned policy must operate alongside a discrete safety filter rather than as part of a heavy end-to-end driving stack.

2.1.3. Collision Avoidance with Dynamic Obstacles

Maw et al. [27] proposed iADA*-RL for multi-threat avoidance. These methods typically require an auxiliary path planner; our approach, by contrast, produces low-collision, reactive trajectories end-to-end via APF-guided Q-learning and the CBF action filter.

2.1.4. Control Barrier Functions

CBFs provide a principled framework for enforcing forward-invariant safe sets [15,28]. Cheng et al. [16] combined model-free RL with a CBF-based safety filter and Gaussian-process dynamics learning. Our contribution is a discrete CBF-inspired filter; appropriate for the grid-world setting and carrying empirical rather than formal guarantees; it is augmented with a per episode visit memory that prevents oscillation and reduces observed collisions to

0.3 %

in practice (a

\sim 20 \times

reduction relative to QAPF without the filter). Adjacent lines of work formalize safety differently: Shalev-Shwartz et al. [29] propose a responsibility-sensitive contract (RSS) that yields rule-based safety constraints rather than learned ones, while Zhu et al. [30] obtain safe RL behavior through reward shaping designed for autonomous-driving longitudinal control. The discrete CBF filter adopted here is closer in spirit to the latter, a soft, learned-policy-friendly safety layer, but it is bound to the grid-world action space rather than to a continuous accelerator or brake control axis.

2.1.5. Quantitative Headline Comparison Against the Closest Prior QAPF Work

To make this novelty quantitatively concrete, Table 2 places our results side-by-side with the most relevant prior QAPF reference [12] along the dimensions on which comparison is meaningful. Three points are made sharply by the table. The reported convergence horizon is substantially shorter under the respective reported protocols (

3 \times 10^{6}

to

\sim 200

episodes), although this comparison is protocol-dependent and is not a controlled one-to-one ablation against the original implementation. Collisions are reduced by approximately

20 \times

by the proposed Control Barrier Function (CBF) layer relative to the same QAPF implementation without the safety layer (

6.2 %

to

0.3 %

). The capability scope is extended from static single-robot navigation to dynamic, multi-robot, energy-aware, robustness-tested and embedded-deployable navigation. Because the prior QAPF paper did not report the same collision metric, the collision-reduction claim is made against the internal QAPF-only ablation rather than against that prior paper.

2.2. Theoretical Background

2.2.1. Reinforcement Learning and Policy-Invariant Reward Shaping

Q-learning [8] learns the optimal action-value function

Q^{*} (s, a)

by iteratively applying the temporal-difference update. Convergence requires infinite visitation of all state–action pairs and the Robbins and Monro conditions on the learning-rate schedule [31].

A key theoretical tool for accelerating learning without biasing the optimal policy is the potential-based shaping result of Ng, Harada and Russell [14]. Given a base MDP with reward

r (s, a, s^{'})

, an auxiliary shaping function

F (s, s^{'}) = γ Φ (s^{'}) - Φ (s)

is defined for any real-valued potential

Φ : S \to R

; the shaped reward

\tilde{r} = r + F

then yields exactly the same optimal policy as r. In the proposed framework,

Φ (s) = - U (q (s)) / U_{Δ, scale}

is taken, where U is the APF total potential evaluated at the robot position encoded by the state, and

U_{Δ, scale}

is the potential-difference normalization constant used only for reward shaping. This gives the field-aligned signal the same potential-difference structure proven policy-invariant by [14] when it is used in an unclipped fixed-weight form. In the implemented algorithm, clipping and the decaying coefficient

λ (e)

are added for numerical stability and faster finite-training convergence; therefore, the exact policy-invariance guarantee applies to the underlying unclipped fixed-weight form, while the reported clipped scheduled form is evaluated empirically. This distinction is a key difference from the original QAPF formulation [12], which used non-invariant potential-difference shaping.

2.2.2. Control Barrier Functions

A (zeroing) Control Barrier Function [15] is a continuously differentiable

h : X \to R

such that the safe set

C = {x : h (x) \geq 0}

is forward-invariant under controls satisfying

\dot{h} (x) \geq - κ h (x)

for some class-

K

function

κ

. In our discrete grid-world setting the continuous-time condition is relaxed to a discrete inequality

h (q^{'}) \geq δ_{safe}

, yielding an empirical, not formally forward-invariant, safety guarantee.

3. Problem and System Description

This section formulates the mobile-robot navigation problem as a Markov Decision Process (MDP) and specifies the simulation environment used in the rest of this paper. Section 5.2 states the modeling assumptions on observation, dynamics, obstacle geometry and discrete actions. Section 3.2 presents the MDP formulation, including the discrete state encoding, the action space, the attractive and repulsive potential components and the base reward function. The complete closed-loop architecture is summarized in Figure 1, while the attractive, repulsive and combined potential fields used by the navigation policy are illustrated later in Figure 2.

Note that for notation convention, continuous positions, displacements, obstacle centers, goal positions, forces and noise vectors are written in boldface, e.g.,

q

,

q_{f}

,

o

,

Δ_{a}

,

F

and

η

. Matrices are also written in boldface, e.g.,

I_{2}

. Scalar discrete actions, rewards, potentials and state-encoding bins are not bolded.

3.1. System Description

The end-to-end framework is organized as a four-stage closed-loop pipeline summarized in Figure 1. The first stage, Environment and Sensing, models the workspace as a

50 \times 50

discrete grid populated with point obstacles of collision radius

ϵ_{coll} = 1.5

cells (red filled circles), a start cell

s_{0}

(green square) and a goal cell

q_{f}

(gold star), with each obstacle additionally carrying an APF influence radius

ρ_{0} = 3.0

cells (light-blue ring) inside which the repulsive potential of Equation (4) is non-zero. At every decision step the sensing layer constructs the discrete state

s

of Equation (1) from the robot pose, the goal pose and the nearest-obstacle geometry. The second stage, the QAPF policy of Section 4.3, combines the Q-learning update of Equation (6) with the artificial-potential-field guidance of Equations (2)–(4) through the zero-centered hybrid score of Equation (11) and emits a nominal action

a^{★}

. The third stage, the discrete CBF-inspired safety filter of Section 4.5, evaluates the barrier value

h (q^{'})

of Equation (12) at every successor cell, intersects the resulting safe-mask with the per episode visit-memory mask, and either passes

a^{★}

through or substitutes the progress-maximizing safe-and-unforbidden alternative; its output is the executed action

a^{exec}

. The fourth stage, robot actuation, applies

a^{exec}

to the robot dynamics and returns the next state

s^{'}

to the sensing layer, closing the loop. The same pipeline accommodates the dynamic-obstacle, narrow-passage and two-robot cooperative variants of Section 4 without architectural change: only the obstacle-set update rule, the inter-robot virtual repulsion of Equation (17) and the per robot CBF-filter instance differ between configurations. The empirical timeout/stagnation detector of Section 4.7 runs as a parallel supervisor over the same loop and labels persistent no-path behavior as a safe-failure outcome rather than as indefinite oscillation.

3.2. MDP Formulation

1. State ( $s$ ): The state is encoded as a 7-tuple

s = (x_{bin}, y_{bin}, β_{g}, β_{o}, d_{o}, {\dot{d}}_{o}, {\hat{d}}_{o}),

(1)

where

x_{bin}, y_{bin} \in {0, \dots, 4}

are binned position coordinates (five bins per axis, giving 25 position bins),

β_{g}, β_{o} \in {0, \dots, 7}

are discretized bearings (eight

45 °

sectors) to the goal and nearest obstacle respectively,

d_{o} \in {0, 1, 2, 3}

is the binned distance to the nearest obstacle,

{\dot{d}}_{o} \in {0, 1, 2}

is the discretized approach rate (

0 =

closing,

1 =

stationary,

2 =

receding; obtained by thresholding

ρ_{min} (q_{t}) - ρ_{min} (q_{t - 1})

at

\pm 0.1

cells) and

{\hat{d}}_{o} \in {0, 1, 2, 3}

is the one-step predicted distance bin computed from a constant-velocity extrapolation. The state-space cardinality is

| S | = 5 \cdot 5 \cdot 8 \cdot 8 \cdot 4 \cdot 3 \cdot 4 = 76,800

.

The navigation environment is governed by the APF total potential at position

q \in R^{2}

:

U (q) = U_{att} (q) + U_{rep} (q),

(2)

where

U_{att} (q)

is the attractive potential generated by the goal, and

U_{rep} (q)

is the repulsive potential generated by the obstacles. Equation (2) decomposes the total potential into an attractive component (Equation (3)) and a repulsive component (Equation (4)), with the attractive component centered at the goal

q_{f}

,

U_{att} (q) = \frac{1}{2} k_{att} {∥ q - q_{f} ∥}^{2},

(3)

where

k_{att}

is the attractive-gain coefficient and

q_{f} \in R^{2}

is the goal position, with the repulsive component generating a barrier within the influence distance

ρ_{0}

:

U_{rep} (q) = \{\begin{matrix} \frac{1}{2} k_{rep} {(\frac{1}{ρ (q)} - \frac{1}{ρ_{0}})}^{2}, & if ρ (q) < ρ_{0} \\ 0, & otherwise . \end{matrix}

(4)

Here

ρ (q) = {min}_{o \in O} ∥ q - o ∥

is the Euclidean distance to the nearest obstacle in the obstacle set

O = {o_{1}, \dots, o_{N_{o}}}

.

2. Action (a):

a \in A = {↑, ↓, \leftarrow, \to}

. Because

A

is discrete and finite, a is a scalar index (not bold).

3. Reward (r): The base reward used by the environment is

r (s, a, s^{'}) = R_{outcome} - λ_{s} - λ_{c} (1 - \frac{ρ (q^{'})}{ρ_{0}}) 1 [ρ (q^{'}) < ρ_{0}] + λ_{g} \cdot clip (d_{goal} (q) - d_{goal} (q^{'}), - 1, 1),

(5)

where

R_{outcome} = + 100

at goal,

- 50

at collision and 0 otherwise;

λ_{s}

is the per step penalty;

λ_{c}

is the proximity-penalty weight;

λ_{g}

is the dense-progress coefficient, and

d_{goal} (q) = ∥ q - q_{f} ∥

. The QAPF-specific potential-based shaping term of Section 4.3 is added on top of this base reward at the agent level, not the environment level.

3.3. Potential Field Visualization

Figure 2 illustrates the attractive, repulsive and combined potentials that guide the navigation.

4. Methodology

This section presents the methodology of the proposed QAPF framework and its CBF-inspired variant. Section 4.1 and Section 4.2 review the Q-learning update used as the learning backbone and the potential-field force approach used as the guidance signal. Section 4.3 introduces the QAPF learning algorithm with the policy-invariant potential-based shaping form and the adaptive shaping schedule. Section 4.4 describes the zero-centered hybrid Q+APF action-scoring rule. Section 4.5 introduces the discrete CBF-inspired action filter with per episode visit memory. Section 4.6 describes the energy-aware velocity-modulation law. Section 4.7 describes the empirical timeout/stagnation-based unreachable-goal detector. Section 4.8 describes the multi-robot cooperative extension, and Section 4.9 describes the three-dimensional workspace extension. The implementation parameters used by all of these components are summarized in Table 3.

4.1. Q-Learning

Q-learning [8] is an off-policy, model-free algorithm that iteratively refines estimates of the expected cumulative discounted reward via

Q (s, a) \leftarrow Q (s, a) + η [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)],

(6)

where

η \in (0, 1]

is the learning rate,

γ \in [0, 1)

is the discount factor and

s^{'}

is the resulting next state. Under standard conditions Q-learning converges to

Q^{*}

[8,31]. The Q-table is initialized optimistically at

Q_{0} = 5.0

rather than at zero: this initialization encourages exploration of unvisited state-to-action pairs, because any negative experience of a visited action makes it less attractive than the still-optimistic default.

4.2. Potential Field Force Approach

The force field is the negative gradient of the total potential,

F (q) = - \nabla U (q)

. On the discrete grid the continuous gradient is approximated by finite differences; the STEP algorithm (Algorithm 1) evaluates the potential at each of the four neighboring cells and encodes the resulting state according to Equation (1).

Algorithm 1 STEP(a): potential-field evaluation

1:: Input: action a, current state $s$ , position $q$
2:: Output: next state $s^{'}$ , reward r, done flag
3:: Compute next position: $q^{'} \leftarrow clip (q + Δ_{a}, 0, (G - 1) 1)$
4:: Calculate $U_{att} (q^{'}) = \frac{1}{2} k_{att} {∥ q^{'} - q_{f} ∥}^{2}$ and $U_{rep} (q^{'})$ per Equation (4)
5:: $U_{total} \leftarrow U_{att} + U_{rep}$
6:: Assign base reward r per Equation (5)
7:: Encode next state $s^{'}$ per Equation (1); check termination
8:: return $s^{'}, r$ , done

4.3. QAPF Learning Algorithm

The QAPF algorithm (Algorithm 2) integrates the Q-learning update of Equation (6) with APF guidance through three tightly coupled mechanisms.

The first component, potential-based reward shaping, is described as follows. Instead of adding an arbitrary APF-flavored bonus to the reward, the Ng, Harada and Russell form [14] is adopted:

\tilde{r} (s, a, s^{'}) = r (s, a, s^{'}) + λ (e) \cdot clip (\frac{U (q) - γ U (q^{'})}{U_{Δ, scale}}, - 1, + 1),

(7)

which reduces to

F (s, s^{'}) = γ Φ (s^{'}) - Φ (s)

with potential

Φ (s) = - U (q (s)) / U_{Δ, scale}

when the clipping is inactive and

λ (e)

is fixed. For any fixed coefficient and without clipping, the term satisfies the assumptions of [14] and preserves the optimal policy of the base MDP. The implemented version intentionally clips the normalized potential difference and uses the schedule

λ (e)

of Equation (8); these two engineering choices stabilize finite-sample learning but mean that exact policy invariance should not be claimed for the full clipped, time-varying implementation. The policy-invariance statement therefore applies to the underlying unclipped fixed-weight shaping form, while the practical scheduled-and-clipped variant is validated empirically.

The clipping to

[- 1, + 1]

stabilizes updates in the presence of heavy-tailed

Δ U

(observed raw range

[- 69.5, + 70.0]

; see Figure 6);

U_{Δ, scale}

is auto-calibrated as the 95th percentile of

| Δ U |

observed during a 2000-step random-walk warm-up. This scale is distinct from the gradient normalization constant

G_{scale}

used later by the energy-aware velocity module.

The second component is the adaptive shaping coefficient. The weight

λ (e)

is decayed geometrically from a large initial value to a small residual:

λ (e) = λ_{min} + (λ_{max} - λ_{min}) exp (- β e),

(8)

with defaults

λ_{min} = 0.5

,

λ_{max} = 5.0

, and

β = 0.005

(tuneable via the sensitivity study, Section 5.7). Early episodes receive a strong APF signal; as e grows, the agent transitions to acting on the autonomous Q-values.

The third component is APF-weighted softmax exploration. During exploration (probability

ε

), an action is sampled from a softmax over the negative APF potential at each candidate successor cell rather than from a uniform distribution:

P (a_{i} ∣ s) = \frac{exp (- (U (q_{a_{i}}^{'}) - U_{min}) / T)}{\sum_{j} exp (- (U (q_{a_{j}}^{'}) - U_{min}) / T)},

(9)

where

q_{a_{i}}^{'}

is the successor cell reached by candidate action

a_{i}

,

U_{min} = {min}_{j} U (q_{a_{j}}^{'})

is the minimum successor potential subtracted for numerical stability and

T > 0

is the softmax temperature controlling the exploration–exploitation balance, mixed with a

10 %

uniform floor to maintain full coverage. The temperature T decays geometrically (

T \leftarrow max (T_{min}, T \cdot T_{decay})

per episode) from

T_{0} = 2.0

to

T_{min} = 0.3

. An anti-deadlock monitor additionally forces exploration (probability raised to at least

0.5

) whenever the robot has not reduced

d_{goal}

by more than

1.0

cell over the preceding 15 steps.

The same progress monitor is also used during evaluation to label empirical unreachable-goal outcomes. A single no-progress window triggers forced exploration, whereas an unreachable label is assigned only if the no-progress condition persists over

N_{stag}

consecutive monitoring windows or if the maximum horizon is reached. Thus, ordinary temporary stalls are treated as exploration events, while persistent no-path behavior is reported as a safe-failure mode rather than as indefinite oscillation.

Algorithm 2 QAPF learning algorithm

1:: Initialize $Q (s, a) \leftarrow Q_{0} = 5.0$ for all $s, a$ ; calibrate $U_{Δ, scale}$ from $| Δ U |$ (2000-step warm-up)
2:: Set $η, γ, ε_{0}, ε_{min}, ε_{decay}, λ_{min}, λ_{max}, β, T_{0}, T_{min}$
3:: for episode $e = 1, \dots, E_{max}$ do
4:: Initialize $s \leftarrow s_{0}$ from current training-map seed; reset obstacles and start/goal
5:: Set $t \leftarrow 0$ , $d o n e \leftarrow false$ and $t e r m i n a l_s t a t u s \leftarrow \emptyset$
6:: Reset stuck-history buffer $H_{d_{g}} \leftarrow \emptyset$ and persistent-stagnation counter $C_{stag} \leftarrow 0$
7:: while not done and $t < T_{max}$ do
8:: Update $H_{d_{g}}$ ; determine if stuck (no progress over last $W_{stag} = 15$ steps)
9:: if stuck then $C_{stag} \leftarrow C_{stag} + 1$ else $C_{stag} \leftarrow 0$ end if
10:: if (not eval_mode or stuck) and rand < $max (ε, 0.5 \cdot 1 [stuck])$ then
11:: Sample a from Equation (9) mixed with $10 %$ uniform
12:: else
13:: Select $a = arg {max}_{a^{'}} [Q (s, a^{'}) - w_{g} \tilde{U} (a^{'})]$ (Equation (11))
14:: end if
15:: Execute a, obtain $s^{'}, r$
16:: Compute shaped reward $\tilde{r}$ via Equation (7)
17:: Update Q using Equation (6) with $\tilde{r}$
18:: $s \leftarrow s^{'}$
19:: $t \leftarrow t + 1$
20:: if $C_{stag} \geq N_{stag}$ in evaluation mode then set $d o n e \leftarrow true$ and $t e r m i n a l_s t a t u s$ ← Stagnation-Unreachable end if
21:: end while
22:: if $t \geq T_{max}$ and not goal_reached and not collision then $t e r m i n a l_s t a t u s$ ← Timeout-Unreachable end if
23:: $ε \leftarrow max (ε_{min}, ε \cdot ε_{decay})$ ; $T \leftarrow max (T_{min}, T \cdot T_{decay})$
24:: end for
25:: return $π^{*} (s) = arg {max}_{a} [Q (s, a) - w_{g}^{eval} \tilde{U} (a)]$

4.4. Hybrid Q+APF Action Scoring

At decision time, the agent combines learned Q-values with APF-based guidance through the following score. For each candidate action

a_{i}

with successor potential

U_{i} : = U (q_{a_{i}}^{'})

, let

{\tilde{U}}_{i} = \frac{U_{i} - \bar{U}}{max ({max}_{j} U_{j} - {min}_{j} U_{j}, ϵ_{U})}, \bar{U} = \frac{1}{| A |} \sum_{j} U_{j}, ϵ_{U} = 10^{- 9},

(10)

i.e.,

\tilde{U}

defined by Equation (10) is zero-centered and range-normalized. The small constant

ϵ_{U}

prevents division by zero in flat local fields where all candidate successor cells have identical or numerically indistinguishable potential. This is important: with raw

U_{i}

, the per action Q-values become confounded with the absolute potential scale, and in obstacle-free regions, all actions receive essentially identical APF terms. Using the zero-centered

\tilde{U}

, only the relative ordering of potentials matters; this is precisely the gradient information that the framework is designed to inject.

The hybrid score is then

score (s, a_{i}) = Q (s, a_{i}) - w_{g} \cdot {\tilde{U}}_{i},

(11)

with

w_{g} = w_{g}^{train} = 1.2

during training and

w_{g} = w_{g}^{eval} = 2.0

at evaluation. The higher evaluation weight produces more conservative, gradient-following behavior on held-out maps, where the Q-table has not been specifically tuned.

4.5. CBF-Inspired Action Filter with Visit Memory

The QAPF policy is augmented with a discrete CBF-inspired safety filter that additionally contains a per episode visit memory to prevent oscillation loops; a practical issue in which an oscillating robot repeatedly selects the same stuck action. Let the discrete barrier function be

h (q) = ρ (q) - ϵ_{coll} = min_{o \in O} ∥ q - o ∥ - ϵ_{coll},

(12)

so the safe set

C = {q : h (q) \geq δ_{safe}}

is defined via the barrier function

h (q)

of Equation (12), with

δ_{safe} = 0.3

cells (an inflated-margin parameter chosen empirically to cover one-step transition uncertainty; its role in robust safety is analyzed in Section 5.5). A visit counter

V_{e} (q, a)

tracks, within the current episode e, how many times the agent has executed a from cell

q

. An action is forbidden at

q

if

V_{e} (q, a) \geq M

(default

M = 3

).

The filter, given the QAPF-nominal action

a^{★}

, proceeds in five phases (Algorithm 3): (1) evaluate h and U at each successor cell; (2) build safe-mask

M_{s} = {a : h (q_{a}^{'}) \geq δ_{safe}}

and forbidden-mask

M_{f} = {a : V_{e} (q, a) \geq M}

; (3) if

a^{★} \in M_{s} ∖ M_{f}

, execute

a^{★}

; (4) else, if

M_{s} ∖ M_{f} \neq \emptyset

, execute

arg {min}_{a \in M_{s} ∖ M_{f}} U (q_{a}^{'})

, the progress-maximizing safe unforbidden action; (5) otherwise, fall back to

arg {min}_{a \in M_{s}} U

, or, if

M_{s} = \emptyset

, to

arg {max}_{a} h (q_{a}^{'})

.

Unlike the continuous-time CBF-QP formulation of [15], our discrete variant does not carry a formal forward-invariance proof. Nevertheless, combined with the inflated margin

δ_{safe}

and the visit memory, it reduces collisions to

0.3 %

on the held-out evaluation suite across all 30 seeds (a

\sim 20 \times

reduction relative to QAPF without the filter, see Section 5.3) and gracefully degrades under noise (Section 5.5).

Algorithm 3 CBF-inspired action filter with visit memory

1:: Input: nominal action $a^{★}$ , position $q$ , visit counter $V_{e}$
2:: For each $a \in A$ : $q_{a}^{'} \leftarrow q + Δ_{a}$ ; $h_{a} \leftarrow h (q_{a}^{'})$ ; $U_{a} \leftarrow U (q_{a}^{'})$
3:: $M_{s} \leftarrow {a : h_{a} \geq δ_{safe}}$ ; $M_{f} \leftarrow {a : V_{e} (q, a) \geq M}$
4:: if $a^{★} \in M_{s}$ and $a^{★} \notin M_{f}$ then
5:: $a^{chosen} \leftarrow a^{★}$
6:: else if $M_{s} ∖ M_{f} \neq \emptyset$ then
7:: $a^{chosen} \leftarrow arg {min}_{a \in M_{s} ∖ M_{f}} U_{a}$
8:: else if $M_{s} \neq \emptyset$ then
9:: $a^{chosen} \leftarrow arg {min}_{a \in M_{s}} U_{a}$
10:: else
11:: $a^{chosen} \leftarrow arg {max}_{a \in A} h_{a}$
12:: end if
13:: $V_{e} (q, a^{chosen}) \leftarrow V_{e} (q, a^{chosen}) + 1$
14:: return $a^{chosen}$

The full evaluation procedure that combines the trained QAPF policy with the discrete CBF-inspired filter is summarized in Algorithm 4.

Algorithm 4 QAPF+CBF evaluation wrapper

1:: Input: trained Q-table from Algorithm 2; environment; map seed
2:: Output: terminal status ∈ {Goal, Collision, Timeout-Unreachable, Stagnation-Unreachable}
3:: Initialize $s \leftarrow s_{0}$ ; set $t \leftarrow 0$ , $d o n e \leftarrow false$ and $t e r m i n a l_s t a t u s \leftarrow \emptyset$ ; reset visit counter $V_{e} \leftarrow 0$ ; reset stagnation counter $C_{stag} \leftarrow 0$
4:: while not done and $t < T_{max}$ do
5:: Compute QAPF nominal action $a^{★} \leftarrow arg {max}_{a^{'}} [Q (s, a^{'}) - w_{g}^{eval} \tilde{U} (a^{'})]$ (Equation (11))
6:: Apply CBF-inspired filter: $a^{exec}$ ← CBFFilter $(a^{★}, q, V_{e})$ (Algorithm 3)
7:: Execute $a^{exec}$ , obtain $s^{'}$ , r, $ρ_{min} (q^{'})$
8:: Update stagnation counter $C_{stag}$ per Equation (16)
9:: if goal reached then terminal_status $\leftarrow$ Goal; done ← true end if
10:: if $ρ_{min} (q^{'}) < ϵ_{coll}$ then terminal_status $\leftarrow$ Collision; done ← true end if
11:: if $C_{stag} \geq N_{stag}$ then terminal_status ← Stagnation-Unreachable; done ← true end if
12:: $s \leftarrow s^{'}$ ; $t \leftarrow t + 1$
13:: end while
14:: if not done and $t \geq T_{max}$ then
15:: terminal_status ← Timeout-Unreachable
16:: end if
17:: return terminal_status

4.6. Energy-Aware Velocity Modulation

An explicit sigmoidal velocity-modulation law is described that allows the robot to reduce kinetic energy in high-gradient obstacle-proximate regions while retaining approximately full speed in flat/free-space regions. The module is formulated as three equations: a kinetic plus jerk energy per step (Equation (13)), a normalized gradient magnitude (Equation (14)) and a full-speed-at-zero-gradient sigmoidal speed law (Equation (15)):

\begin{matrix} E 1 (Kinetic + jerk energy per step) : E (t) & = c_{v} v {(t)}^{2} + c_{a} {(v (t) - v (t - 1))}^{2}, \end{matrix}

(13)

\begin{matrix} E 2 (Normalized gradient magnitude) : \tilde{g} (q) & = \frac{∥ \nabla U (q) ∥}{G_{scale}}, \end{matrix}

(14)

\begin{matrix} E 3 (Sigmoidal speed law) : v (q) & = v_{min} + (v_{max} - v_{min}) \frac{2}{1 + exp (k_{E} \tilde{g} (q))}, \end{matrix}

(15)

where

v_{min} = 0.05

is a hard speed floor preventing degenerate standstill,

v_{max} = 1.0

,

c_{v} = 1.0

,

c_{a} = 0.5

and

k_{E}

is the regime-selecting coefficient (constant:

k_{E} = 10^{- 3}

; APF-mod:

k_{E} = 0.5

; APF-agg:

k_{E} = 1.5

). The law satisfies

v (q) = v_{max}

when

\tilde{g} (q) = 0

and smoothly approaches

v_{min}

as the normalized gradient increases. The gradient normalization constant

G_{scale}

is auto-calibrated as the 95th percentile of

∥ \nabla U ∥

observed during a 500-step random-walk pass over the same obstacle field.

The normalization E2 is essential: because the raw gradient

∥ \nabla U ∥

scales with

k_{rep} \approx 100

, an unnormalized law would saturate the exponential term at nearly every step and drive

v (q)

to the speed floor

v_{min}

, producing excessive navigation time. E2 rescales the gradient to a bounded typical range

[0, 1.5]

, restoring meaningful velocity modulation. The total episode energy is

E_{ep} = \sum_{t = 1}^{T} E (t)

, reported both absolutely and as a ratio

E_{ep} / E_{const}

relative to the constant-speed baseline.

4.7. Empirical Unreachable-Goal Detection

Pure APF and APF/RL hybrids share a well-known failure mode in which the gradient field admits a stable non-goal equilibrium [3,7]: the agent oscillates indefinitely without ever terminating. To convert this into a labeled safe-failure outcome rather than an indefinite non-termination, an empirical unreachable-goal detection mechanism is integrated directly at the policy level, in addition to the standard goal-reaching and collision termination criteria. Let

d_{t} = ∥ q_{t} - q_{f} ∥

denote the goal distance at step t. A stagnation window is detected when

max_{τ \in {t - W_{stag}, \dots, t}} d_{τ} - min_{τ \in {t - W_{stag}, \dots, t}} d_{τ} < Δ d_{stag},

(16)

and the robot remains inside a repeated local region during the same window (default

W_{stag} = 15

steps,

Δ d_{stag} = 1.0

cell). A single stagnation window does not yet trigger termination; instead, it raises the exploration probability to

ε \geq 0.5

and resamples actions from the APF-weighted softmax of Equation (9), which is the same anti-deadlock monitor invoked inside Algorithm 2. Only when the stagnation condition persists over

N_{stag} = 3

consecutive monitoring windows is the episode terminated and labeled as Stagnation-Unreachable. If the maximum horizon

T_{max} = 1000

is reached without goal arrival or collision, the episode is labeled as Timeout-Unreachable. The two thresholds are chosen so that recoverable temporary stalls are absorbed by exploration boosts, while persistent no-path behavior is reported as a safe-failure mode.

Unlike a graph-theoretic reachability oracle (e.g., A* [32]), this detector is empirical: it does not certify that no feasible path exists in the underlying grid graph. It is, however, sufficient at the policy level to ensure that blocked-goal, sealed-corridor and dead-end concave maps produce a clear terminal status with a bounded detection time, which is the property required by a deployable navigation controller. Empirical detection rates on the no-path test suite are reported in Appendix A.2 and are between

96.5 %

and

98.7 %

across the three no-path map families, with a false-unreachable rate of

1.0 \pm 0.6 %

on reachable held-out maps.

4.8. Multi-Robot Cooperative Extension

For the multi-robot setting, each robot k augments its APF repulsive field with a virtual inter-robot repulsive component:

U_{rep}^{MR} (q_{k}) = \sum_{j \neq k} \frac{1}{2} k_{rep}^{MR} {(\frac{1}{∥ q_{k} - q_{j} ∥} - \frac{1}{ρ_{0}^{MR}})}^{2} 1 [∥ q_{k} - q_{j} ∥ < ρ_{0}^{MR}],

(17)

with

k_{rep}^{MR} = 20

,

ρ_{0}^{MR} = 3

cells and

q_{j}

being the current position of the other robot(s). The augmented potential

U (q_{k}) + U_{rep}^{MR} (q_{k})

replaces U in the hybrid score of Equation (11); when combined with the per robot CBF filter of Section 4.5, the joint policy achieves a

92.1 %

joint success rate (Table 14).

4.9. Three-Dimensional Workspace Extension

The 2D MDP and APF formulation of Section 3 and Section 4 is naturally extended to a three-dimensional workspace in which the robot position becomes

q \in R^{3}

. Three families of components are affected by the dimensional change. The continuous-state components, namely the attractive potential

U_{att}

, the repulsive potential

U_{rep}

, the gradient-based force

F = - \nabla U

, the CBF barrier

h (q)

and the energy module of Equations (13) to (15), are all extended without modification because only a metric and an obstacle geometry are required and both are defined naturally in

R^{3}

. The discrete state encoding of Equation (1) is generalized by adopting the 3D 10-tuple

(x_{bin}, y_{bin}, z_{bin}, β_{g}^{az}, β_{g}^{el}, β_{o}^{az}, β_{o}^{el}, d_{o}, {\dot{d}}_{o}, {\hat{d}}_{o})

in which

z_{bin}

is the binned vertical coordinate,

β_{g}^{az}

and

β_{g}^{el}

are the azimuth and elevation bearings to the goal,

β_{o}^{az}

and

β_{o}^{el}

are the corresponding bearings to the nearest obstacle and the remaining distance and predicted distance bins are retained from the 2D formulation. The action space is extended from

| A | = 4

cardinal moves in the plane to

| A | = 6

cardinal moves in 3D, with an optional

| A | = 26

king-move set for smoother trajectories at higher computational cost. The shaping form of Equation (7) and the CBF filter of Algorithm 3 are unchanged because both depend only on the potential and on the barrier function, both of which are heading- and dimension-invariant. A scalability and complexity analysis of the same 3D extension is provided in Appendix A.4.

4.10. Implementation Specifications

Table 3 summarizes all implementation parameters and reflects the values actually used in the reference implementation.

5. Experiment and Analysis

This section presents the experimental evaluation of the proposed QAPF and QAPF+CBF policies. Section 5.1 describes the experimental setup, simulation parameters and evaluated methods. Section 5.3 reports the main static-obstacle comparison and the multi-scenario evaluation across static, dynamic and narrow-passage settings. Section 5.4 reports the empirical unreachable-goal and stagnation detection study. Section 5.5 reports the robustness analysis under observation noise, actuator slip and external drift, accompanied by a conditional/high-probability inflated-barrier safety statement. Section 5.6 reports the per decision inference timing on an Intel i7 reference and the projected embedded-device latencies. Section 5.7 reports the ablation, sensitivity, energy, multi-robot, curriculum and multi-axis generalization studies. Section 5.8 reports the three-dimensional workspace evaluation, and Section 5.9 discusses the limitations and threats to validity. The main comparative metrics are summarized in Table 5.

All reported metrics are obtained from held-out evaluation suites (seed offsets disjoint from training) rather than from raw training trajectories. Unless otherwise stated, the main comparative, scenario, ablation and generalization experiments use 30 independent seeds with up to 1500 training episodes. Protocol-specific exceptions are stated where they occur: the multi-robot experiment uses 1000 episodes per robot, the robustness study uses five training seeds × 30 noisy evaluation episodes per regime, and the timing study uses repeated per decision timing calls rather than success-rate training seeds. The held-out evaluation additionally records terminal status as goal, collision, Timeout-Unreachable or Stagnation-Unreachable. The latter two labels are used only when the agent neither reaches the goal nor collides, and the maximum horizon or persistent-stagnation criterion is satisfied. This status variable prevents impossible-goal episodes from being conflated with ordinary failed navigation trials. It should also be noted, before the per experiment protocols are detailed, that all simulation experiments reported in this section are run on a

50 \times 50 \times 1

voxel grid in which the vertical coordinate is held fixed at

z = 0

. The 2D plane is therefore the operational evaluation domain throughout Section 5, and the 3D-on-2D-plane sanity check that exercises the dimensional generalization of QAPF+CBF on the same protocol is reported separately in Section 5.8. A real-aerial validation of the 3D formulation, in which a quadrotor drone is flown through a structured indoor environment, is left for future work and is described in Section 6.

5.1. Experimental Setup and Evaluated Methods

The simulation parameters and experimental configuration used throughout this paper are summarized in Table 4.

The following algorithms are compared: APF-Only [6] (greedy gradient descent on

U_{total}

), Standard Q-Learning [8], Efficient Q-Learning (EQL) [13] (optimistic initialization, slow

ε

decay), Conservative Q-Learning (CQL) [33], Deep Q-Network (DQN) [9], QAPF (ours) and QAPF+CBF (ours).

5.2. Assumptions

The simulation and learning framework adopted in this study is based on the following assumptions:

1.: Nominal state observation: The robot has access to its position and to the positions of all obstacles located within the influence distance $ρ_{0}$ . Robustness to bounded observation noise and actuator uncertainty is analyzed in Section 5.5.
2.: Deterministic nominal transitions: Under nominal conditions, executing a discrete action moves the robot deterministically by one grid cell in the selected direction. Robustness to bounded actuator slip and external disturbance is analyzed in Section 5.5.
3.: Known obstacle geometry: Obstacles are represented as point objects with a fixed collision radius of $ϵ_{coll} = 1.5$ cells.
4.: Dynamic obstacle model: Dynamic obstacles are assumed to move with constant velocity along fixed linear trajectories and to follow reflective boundary conditions at the workspace limits.
5.: Discrete-action space: The robot selects its motion command from the four cardinal actions $A = {↑, ↓, \leftarrow, \to}$ .

5.3. Main Results and Multi-Scenario Evaluation

Table 5 presents the primary comparison in the default static-obstacle setting. The metrics reported in the table are defined as follows. The success rate (SR) is computed as the fraction of held-out evaluation episodes terminating in the goal-reached outcome. The collision rate (CR) is the fraction of episodes terminating with

ρ_{min} < ϵ_{coll}

. The minimum clearance is the mean over held-out episodes of

{min}_{t} ρ_{min} (q_{t})

in grid cells, where collision episodes are still counted through the CR, and the clearance value should therefore be interpreted jointly with the CR rather than as a standalone safety metric. The convergence episode is an estimated plateau-crossing episode obtained by linearly interpolating the smoothed held-out SR curve between logged evaluation checkpoints; it is therefore not constrained to be a multiple of the 50-episode logging interval. The plateau threshold is defined as the first episode at which the smoothed trailing held-out SR is within five absolute percentage points of the final/asymptotic mean.

In the held-out evaluation, the highest success rate of

94.5 \pm 2.1 %

is achieved by the proposed QAPF, and all baseline methods are significantly outperformed. A success rate of

93.8 \pm 2.3 %

with a near-zero collision rate of

0.3 \pm 0.4 %

is achieved by QAPF+CBF, which demonstrates that collisions are reduced to a near-zero level by the CBF safety filter without substantially compromising task completion. Among the baselines, EQL performs best at

86.2 %

using optimistic Q-value initialization and slower exploration decay, followed by DQN at

84.7 %

, CQL at

82.5 %

and Std. QL at

78.3 %

. APF-Only does not learn and therefore has no convergence episode in the learning sense. Its empirical success-rate trace is observed to stabilize within approximately 50 evaluation episodes at

72.8 %

as a result of local-minima trapping; this stabilization index is reported only as descriptive context for the figure and is not comparable to the learning-convergence values listed for the Q-learning baselines. The comparison between QAPF (

6.2 %

coll.) and QAPF+CBF (

0.3 %

coll.) is internal to the proposed framework because the prior QAPF reference [12] does not report a comparable held-out collision rate. Accordingly, the approximately

20 \times

reduction is described as an internal-ablation effect rather than as an external comparison against published baselines. Two distinct effects of the CBF-inspired filter are made visible by this internal contrast: (i) the safe-mask

M_{s}

pushes the collision rate down by an order of magnitude and (ii) the visit memory

M_{f}

keeps the success rate competitive by preventing oscillation deadlocks that would otherwise time out.

Figure 3 shows the learning curves. QAPF reaches its plateau after approximately 205 episodes and QAPF+CBF by approximately 230 episodes. This is substantially shorter than the ∼3 ×

10^{6}

episodes reported by the original QAPF of Orozco-Rosas et al. [12]; however, the comparison is made across different reported protocols and is therefore interpreted as a protocol-dependent convergence-horizon contrast rather than as a controlled one-to-one speedup.

5.3.1. Multi-Scenario Evaluation

Table 6 reports the results across static, dynamic and narrow-passage scenarios.

Across all three scenarios, the baselines are consistently outperformed by the QAPF variants. In the static scenario, the highest SR (

94.5 %

) is obtained by QAPF and the lowest CR (

0.3 %

) by QAPF+CBF. In the dynamic scenario, both the highest SR (

90.4 %

) and the lowest CR (

0.5 %

) are obtained by QAPF+CBF. In the narrow-passage scenario, QAPF (best SR

85.7 %

) and QAPF+CBF (best CR

0.3 %

) are essentially tied on success while QAPF+CBF wins decisively on safety.

5.3.2. Representative Trajectories, Safety Analysis and Reward Stability

Figure 4 shows representative trajectories from the trained QAPF+CBF agent; the curvilinear paths reflect the zero-centered hybrid score of Equation (11), which produces smooth deflection around obstacles rather than sharp corners. In the figure, static obstacles are indicated by red filled circles drawn at the collision radius

ϵ_{coll} = 1.5

cells; the start state

s_{0}

is shown by the green square and the goal position

q_{f}

by the gold star; the executed trajectory

{q_{t}}_{t = 0}^{T}

is plotted as a blue solid curve; the influence radius

ρ_{0} = 3.0

cells, inside which the repulsive potential is non-zero, is rendered as a light-blue shaded region around each obstacle. Abrupt corners in the trajectory would indicate Q-table dominance, which is suppressed at evaluation time by the higher APF-guidance weight

w_{g}^{eval}

. Figure 5 visualizes the barrier function

h (q)

along trajectories with and without the CBF filter: without CBF, h drops below the safety margin on multiple occasions; with CBF,

h \geq δ_{safe}

is maintained throughout the trajectory. Figure 6 shows that the raw shaping signal

Δ U

spans approximately

[- 70, + 70]

, while the clipped normalized signal

\tilde{Δ U}

is bounded to

[- 1, + 1]

, confirming the stabilizing role of the clipping in Equation (7).

Figure 7 presents the Pareto frontier: QAPF+CBF sits closest to the utopia point at

(0.3 % collisions, 93.8 % success)

, dominating every baseline along the safety axis while remaining within

0.7

pp of QAPF on success. The QAPF to QAPF+CBF transition trades

0.7

pp of success for an approximately 20-fold reduction in collisions, and the bottom-right baselines (Std QL, EQL, CQL) are clearly Pareto-dominated.

5.4. Empirical Unreachable-Goal and Stagnation Detection

To address cases in which the robot cannot reach the goal, an empirical unreachable-goal investigation is added to the evaluation protocol. This experiment is designed for practical failure recognition rather than for formal graph-theoretic reachability certification. Blocked-goal, sealed-corridor and dead-end concave maps are used as no-path cases. In these settings, a high success rate is not expected because the goal is infeasible under the four-neighbor grid action model. The desired behavior is instead safe termination with an unreachable label and a low collision-before-termination rate. The detailed detector definition and the empirical no-path table are provided in Appendix A.2.

5.5. Robustness Analysis Under Noise and Disturbances

A navigation system that is merely accurate under nominal conditions is of limited practical value if it collapses under the sensor noise and actuator imperfections that characterize real deployments. A dedicated robustness study is therefore conducted with both a theoretical component and a held-out experimental protocol.

5.5.1. Noise Model

Three independent noise channels are injected into the evaluation environment only (policies are trained on clean dynamics). Let

q_{t}

denote the robot’s true position at step t,

a_{t}

be the nominal action selected by the policy and

a_{t}^{exec}

be the action executed by the actuator:

\begin{matrix} {\tilde{q}}_{t} & = q_{t} + η_{t}^{obs}, η_{t}^{obs} \sim N (0, σ_{obs}^{2} I_{2}) & (observation noise), \end{matrix}

(18)

\begin{matrix} a_{t}^{exec} & = \{\begin{matrix} a_{t} & w . p . 1 - p_{act} \\ U (A) & w . p . p_{act} \end{matrix} & (actuator slip), \end{matrix}

(19)

\begin{matrix} q_{t + 1} & = q_{t} + Δ_{a_{t}^{exec}} + η_{t}^{ext}, η_{t}^{ext} \sim N (0, σ_{ext}^{2} I_{2}) & (external drift) . \end{matrix}

(20)

The perturbed observation

{\tilde{q}}_{t}

given by Equation (18) is used to recompute the state-space bins in Equation (1), so the noise actually affects policy output rather than being a no-op. The inflated-margin argument of Equations (21) and (22) is source-agnostic: it depends only on a bounded perturbation

∥ η ∥ \leq δ_{p}

of the position used to evaluate the barrier h, regardless of whether the perturbation originates from robot self-localization noise, obstacle-position noise or goal-position noise (see Appendix A.5 for the unified treatment). Consequently, the experimental sweep of Table 8, which injects noise on the robot pose used by the policy and operates as a worst-case proxy for diagonal-Gaussian obstacle- and goal-pose uncertainty of comparable or smaller magnitude, because the binned-state features driven directly by

q

(the absolute bins

x_{bin}, y_{bin}

of Equation (1)) are perturbed by robot-pose noise but not by obstacle- or goal-pose noise alone. The non-Gaussian or correlated obstacle/goal localization error (e.g., systematic SLAM drift) remains an open extension, as discussed in Section 5.9. The actuator slip of Equation (19) is applied to the nominal action, and the external drift of Equation (20) is applied after the nominal transition and before termination re-check. Six primary regimes are reported in Table 7.

5.5.2. Theoretical Analysis

For the QAPF+CBF agent under observation noise, a deterministic guarantee can only be stated after replacing the Gaussian perturbation with a bounded or high-probability error radius. Let

δ_{p}

be a bound satisfying

Pr (∥ η^{obs} ∥ \leq δ_{p}) \geq p

. For Gaussian observation noise in two dimensions,

δ_{p}

may be chosen from the corresponding

χ_{2}^{2}

quantile. Conditional on this high-probability event, the safe set can be inflated by

δ_{p}

:

{\tilde{C}}_{p} = {q : h (q) \geq δ_{safe} + δ_{p}},

(21)

where

{\tilde{C}}_{p}

is the inflated high-probability safe set,

h (q)

is the discrete barrier function of Equation (12),

δ_{safe}

is the nominal safety margin and

δ_{p}

is the radius of the high-probability error ball at confidence level p. If the filter enforces

h ({\tilde{q}}_{t}) \geq δ_{safe} + δ_{p}

using the noisy observation, then by the reverse triangle inequality

h (q_{t}) = ρ (q_{t}) - ϵ_{coll} \geq ρ ({\tilde{q}}_{t}) - ∥ q_{t} - {\tilde{q}}_{t} ∥ - ϵ_{coll} \geq h ({\tilde{q}}_{t}) - δ_{p} \geq δ_{safe} \geq 0 .

(22)

Thus, the inflated-margin argument provides a conditional, high-probability safety statement for observation-side perturbations, not an unconditional guarantee for unbounded Gaussian noise: with probability at least p over the realization of

η^{obs}

, the filter enforces

h (q_{t}) \geq δ_{safe} \geq 0

; the complement event of probability at most

1 - p

corresponds to draws of

η^{obs}

that exceed the chosen radius

δ_{p}

and for which no safety claim is made. Actuator slip and external drift remain harder because they perturb the realized transition after the action has been selected; they cannot be eliminated by observation-side margin inflation alone.

5.5.3. Experimental Results

Table 8 deliberately reports the success rate only. The collision rate and minimum clearance under each noise regime are retained in the raw robustness logs available from the corresponding author upon reasonable request but are not tabulated here because the 150-episode per cell sample produces wide confidence bands on rare-event metrics: a single observed collision in 150 episodes corresponds to an empirical CR of

0.67 %

with a Wilson

95 %

upper bound near

3.6 %

, so the per regime CR contrasts at the sub-

1 %

level are not statistically resolvable at this episode budget. The full SR/CR/minimum-clearance noisy–safety benchmark over a larger episode budget is identified in Section 5.9 as the natural follow-up rather than as a current claim.

The success-rate degradation per noise regime is summarized graphically in Figure 8.

Three observations follow from the results. First, QAPF+CBF shows no observed success-rate degradation under the two observation-noise regimes in this finite suite (

100 %

SR in obs_low/obs_high); this is consistent with the inflated-safe-set argument, but it should not be interpreted as a universal guarantee for all Gaussian perturbations. Because the robustness suite contains 150 evaluation episodes per cell, a value of

100.0 \pm 0.0 %

means that no failures were observed in this finite sample, not that the true failure probability is zero. For example, a

150 / 150

binomial success count corresponds to an approximate

95 %

Wilson lower confidence bound of about

97.5 %

. Second, actuator slip is the hardest regime for all methods: even at

p_{act} = 0.15

, QAPF+CBF drops to

70 %

because the filter cannot guarantee execution of the safe action it selects; the slip is downstream of the decision and cannot be filtered. Third, under the realistic combined regime, QAPF+CBF retains

85 %

success, over 16 points higher than QAPF alone (

68.3 %

) and seven points higher than APF-Only (

78.3 %

), which is the most practically relevant robustness number in this study.

5.6. I7 Inference Timing and Embedded-Device Latency Projection

The per decision inference latency of each method is physically measured on a laptop-class Intel i7 reference and then projected to representative embedded and mobile device classes. The embedded-device values are projection estimates, not direct Jetson/DGX measurements.

5.6.1. Measurement Protocol

For each method–seed pair, the agent is trained under the standard protocol; a fresh evaluation environment is then instantiated, and the following sequence is executed: (i)

n_{warmup} = 50

untimed select_action calls to populate CPU caches and JIT paths; (ii)

n_{timed} = 2000

timed calls, each wrapped in time.perf_counter() with microsecond resolution. The median of the 2000 samples is reported in Table 9 to provide a robust summary against GC-induced outliers. The

p 95

and

p 99

latency values are retained in the raw timing output available from the corresponding author upon reasonable request but are not tabulated in this manuscript to keep the main table compact. The throughput column follows the implementation and is computed from the mean-call latency as

{Hz}_{ref} = 1 / \bar{t}

; therefore, it should not be recomputed as the reciprocal of the median-latency column.

5.6.2. Device Extrapolation

Direct on-device measurements across all target platforms were not performed in this study. Device-class projections are therefore reported, obtained by multiplying the measured i7 reference latency by conservative CPU-side scaling factors (Table 10). These projections are intended to answer whether the method is plausibly real time on embedded hardware; they do not replace physical Jetson/DGX measurements. DGX is also server-class rather than a mobile-robot platform, so it is included only as an upper-bound compute reference. Real on-device measurements can vary by

\pm 20

to

50 %

, depending on cache effects, Python/C++ implementation, power mode and memory bandwidth.

5.6.3. Results

The lightweight decision policies are the fastest: Std. QL uses a single Q-table lookup, while the compact DQN baseline uses a short one-hidden-layer forward pass. APF-Only, QAPF and QAPF+CBF all require evaluating the potential at four candidate successor cells, which involves per obstacle distance computations, a

O (| A | \cdot | O |)

operation. Despite this overhead, all methods exceed the 20 Hz real-time threshold on the measured i7 reference and under the projected embedded-device latency model. For QAPF+CBF, the implementation reports approximately

4.9

kHz mean-call throughput on the i7 reference, while the projected median latencies are

417.0

μ

s

on Jetson Nano and

156.4

μ

s

on Jetson AGX Orin. These projected median-latency equivalents should not be directly compared with the “Hz @ i7” mean-throughput column.

5.7. Ablation, Sensitivity and Extended Studies

5.7.1. Ablation Study

The contribution of each QAPF component is isolated in Table 11 and visualized in Figure 9: RL-Only (a strict component-isolation configuration: plain tabular Q-learning on the QAPF state encoding and reward, with all QAPF-specific components removed; this is intentionally distinct from the Std. QL baseline of Table 5, which is tuned as a fair comparison opponent), APF-Only (no learning), QAPF-Fixed-

λ

(no adaptive schedule), the full QAPF-Full and QAPF+CBF.

A clear story is told by the ablation. The adaptive shaping schedule of Equation (8) is observed to improve over the fixed-

λ

variant by approximately

5.6

percentage points (

94.5 %

vs.

88.9 %

), and the value of the decay schedule for late-stage policy refinement is therefore demonstrated. A small drop in the success rate (

94.5 %

to

93.8 %

) is traded by the CBF filter for an approximately

20 \times

reduction in collisions (

6.2 %

to

0.3 %

), and the minimum clearance is simultaneously improved from

2.78

to

3.15

cells, which confirms that the robot is being kept further from obstacles by the filter.

5.7.2. Sensitivity Analysis

Table 12 reports a diagnostic QAPF sensitivity sweep over the shaping magnitude

λ_{max}

and decay rate

β

of Equation (8). This sweep uses a separate exploratory configuration in the code (

λ_{min} = 1.0

,

T_{0} = 5.0

,

T_{min} = 0.5

,

T_{decay} = 0.99

,

w_{g}^{train} = 0.30

,

w_{g}^{eval} = 0.90

) and characterizes behavior around a wider parameter neighborhood; the final QAPF/QAPF+CBF parameters used by make_agent to produce Table 5, Table 6 and Table 11 are those reported in Table 3:

λ_{min} = 0.5

,

λ_{max} = 5.0

,

β = 0.005

,

T_{0} = 2.0

,

T_{min} = 0.3

,

T_{decay} = 0.995

and APF-guidance weights

w_{g}^{train} = 1.2

,

w_{g}^{eval} = 2.0

. The flat success-rate landscape across the center of the table (all SRs within

\pm 3

pp of the maximum) is consistent with a robust setting of the shaping coefficient, which is the operationally important property of a sensitivity analysis. Figure 10 visualizes the same data as a success-rate landscape over

(λ_{max}, β)

together with the corresponding convergence-episode landscape.

5.7.3. Energy Modulation Study

The kinetic and jerk energy and navigation time of Equations (13) to (15) are reported in Table 13 across the three regimes. The same

94.5 %

success rate as that of the unmodulated QAPF policy is maintained across all regimes, which confirms the compatibility of the energy module with the QAPF policy. Figure 11 visualizes the resulting velocity profiles and the energy and navigation-time trade-off across the three modes.

APF-mod delivers an

18 %

energy reduction for an

8.5 %

increase in navigation time; a favorable trade-off for battery-limited deployments. APF-agg pushes energy savings to

30 %

but at a

46.5 %

higher navigation time. The sigmoidal speed law of Equation (15) is observed to produce velocity dips only near obstacles, which concentrates the energy savings where safety is most important. The

94.5 %

success rate is preserved across all three modes, which is consistent with the static-scenario QAPF row of Table 6: velocity modulation is shown to alter the energy and navigation-time trade-off without affecting task completion.

5.7.4. Multi-Robot Study

Table 14 reports three configurations for the same two-robot evaluation setting: independent (no coordination), cooperative (inter-robot APF repulsion per Equation (17)) and Cooperative+CBF. Figure 12 visualizes the joint success rate and inter-robot collision count across the three two-robot configurations. Adding inter-robot APF repulsion alone is shown to be insufficient: residual

5.1

inter-robot collisions are observed in the Cooperative configuration. When the inter-robot repulsion is paired with the per robot CBF filter, the joint success is driven to

92.1 %

with only

0.5

residual inter-robot collisions.

Scalability Discussion

The Cooperative+CBF configuration evaluated above uses a centralized, fully- communicative coordination scheme: at each decision step, every robot k computes the augmented potential

U (q_{k}) + \sum_{j \neq k} U_{rep}^{MR} (q_{k}, q_{j})

over the positions of all other robots. This design is appropriate for the two-robot regime evaluated here (warehouse aisles, lab-scale fleets), but its scaling properties deserve explicit comment for larger deployments. The dominant costs are summarized in Table 15.

The

O (N^{2})

pair-evaluation and communication costs remain tractable for small fleets but become increasingly restrictive as N grows. More importantly, fleet-level success decreases multiplicatively: even if each robot individually maintains the QAPF+CBF success rate of

93.8 %

, the independent-failure upper bound is

0 . 938^{N}

, which falls below

50 %

at

N \approx 11

. Beyond small fleets, three concrete decentralization strategies become necessary: (i) k-nearest-neighbor repulsion (restrict the inter-robot sum to the

k = 3

to 5 closest other robots, dropping cost to

O (N k)

[18]); (ii) communication-graph-aware coordination (explicit communication graph respecting wireless range and bandwidth, decentralized consensus protocol [17,18]); and (iii) hierarchical-cluster coordination (3 to 5-robot clusters with gateway-robot boundary handling, giving approximately

O (N)

overall coordination cost).

Limitations of the present evaluation that follow from this analysis. The

92.1 %

joint SR result in Table 14 should not be read as a claim about N-robot scalability: it is a two-robot evaluation, and as N grows, the joint SR is mathematically bounded above by

\prod_{k} {SR}_{k}

(under independent-failure assumptions), so even individually-strong robots can yield joint SR

≪ 92 %

for large fleets; specifically,

0 . 938^{N} < 0.5

already at

N \approx 11

. Fleet-scale validation is treated as the most important extension of the multi-robot story for future work.

5.7.5. Curriculum Learning Study

Table 16 compares fixed-density training against a progressive-curriculum schedule that starts at low obstacle density and increases towards the target density across episodes; Figure 13 visualizes the corresponding learning curves.

Two effects are visible. On the asymptotic success rate, fixed-density and curriculum training are statistically indistinguishable: the

1.7

percentage-point difference (

94.5 %

vs.

96.2 %

) is comparable to one standard deviation of either estimate. On convergence speed, however, the curriculum schedule is clearly faster (

120 \pm 25

vs.

205 \pm 35

episodes; the gap is well outside one standard deviation), which is the more interesting result for sample-efficiency-constrained deployments. The curriculum benefit is most pronounced when the gap between the easy starting density and the target density is wide, because the agent receives a more informative shaping signal early in training; when the starting density is already close to the target, curriculum and fixed schedules collapse to essentially the same trajectory.

Curriculum Schedule Details

The progressive-curriculum schedule used in this study is parameterized as follows. The initial obstacle density is

N_{o}^{(0)} = 5

, the target density is

N_{o}^{(target)} = 15

and the ramp duration is

E_{ramp} = 400

episodes. The density at episode e is set to

N_{o} (e) = N_{o}^{(0)} + (N_{o}^{(target)} - N_{o}^{(0)}) \cdot min (e / E_{ramp}, 1)

. After episode

E_{ramp}

, the density is held fixed at

N_{o}^{(target)}

for the remainder of training. The same map seed is used at every density step so that the obstacle layout grows monotonically rather than being resampled at each episode.

5.7.6. Multi-Axis Generalization Study

Generalization is measured by the gap between performance on seen obstacle configurations, which are drawn from the same seed pool used during training and performance on out-of-distribution configurations that the agent never encountered during training. Four orthogonal generalization axes are evaluated, each on 1000 held-out evaluation episodes per cell (30 seeds and approximately 33 episodes per seed). The four axes are listed below. Axis A covers held-out maps under the training distribution (

N_{o} = 15

,

G = 50

, disjoint seed pool). Axis B covers out-of-distribution obstacle density (

N_{o} \in {5, 10, 15, 20, 25}

, a

- 66.7 %

to

+ 66.7 %

shift around the training value

N_{o} = 15

). Axis C covers out-of-distribution grid size (

G \in {30, 50, 70}

, with the relative density held fixed at

15 / 2500

). Axis D covers long-horizon stability over 1000 consecutive episodes with the per decile success rate.

Table 17 (Axes A to C) shows the four-method comparison aggregated across each axis, while Table 18 (Axis D) reports the per decile stability of QAPF+CBF.

Three findings from the expanded study are worth highlighting. On Axis B, QAPF+CBF is observed to degrade by only

14.3

percentage points (

92.8 %

to

78.5 %

) when the obstacle count is scaled up by

66.7 %

relative to training, whereas 20 points are lost by APF-Only and 18 points by DQN. This robustness to obstacle-density shift is a direct consequence of the relative-feature state encoding (Equation (1)), because

d_{o}

and

β_{o}

depend only on the nearest obstacle, and the policy is therefore largely insensitive to how many obstacles are placed in the rest of the field. On Axis C, the policy is observed to transfer cleanly to grid sizes the agent never trained on, with an

88 %

to

90 %

SR retained at both

G = 30

and

G = 70

, because the bin counts (five position bins per axis) are normalized by grid size during state encoding. On Axis D, a flat

93.4 %

to

94.0 %

SR is maintained by QAPF+CBF across all ten deciles of a 1000-episode evaluation, with the slight per decile fluctuation well within the per decile sampling noise. No evidence of long-horizon drift or wear-out is observed in this finite evaluation, and deployment claims for tasks that run for many thousands of decisions per session are supported.

5.8. Three-Dimensional Workspace Evaluation

A simulation-based examination of the 3D extension described in Section 4.9 is reported here so that the dimensional generalization of QAPF+CBF can be evaluated on the same protocol as the 2D study. We state two scope limitations of this evaluation up-front. First, the simulator used here is a

50 \times 50 \times 1

voxel grid with the vertical coordinate held at

z = 0

; the 2D plane is therefore recovered as a degenerate special case of the 3D formulation, and the experiment is a code-path regression check rather than a non-trivial vertical-trajectory benchmark. Second, no stacked-obstacle, vertical-corridor or fly-over-fly-under scenario is exercised in this paper. A non-trivial 3D evaluation in which the z-axis is genuinely exercised is identified as a future work follow-up (item (viii) in Section 6), and a real-aerial drone validation is the planned vehicle for that follow-up. A real-aerial validation of the same 3D formulation, in which a quadrotor drone is flown through a structured indoor environment, is therefore deferred to future work item (viii) in Section 6. The 3D-on-2D-plane evaluation reproduces the static-obstacle headline metrics within sampling noise: a SR of

93.6 \pm 2.6 %

with a CR of

0.4 \pm 0.5 %

is observed for QAPF+CBF, and a SR of

94.3 \pm 2.3 %

with a CR of

6.4 \pm 1.9 %

is observed for QAPF on the 3D simulator. These numbers are consistent with the 2D static-obstacle results of Table 5 and confirm that the 3D code path does not degrade performance on a problem that the 2D agent solves cleanly. A non-trivial 3D evaluation in which the z-axis is genuinely exercised, for example, through stacked obstacles or vertical corridors, is left as the natural follow-up that the drone-based validation will provide.

5.9. Discussion, Limitations and Threats to Validity

The experimental results consistently support the central claim that adaptive APF-guided Q-learning, combined with a CBF-inspired discrete-action filter with per episode visit memory, produces a safe and efficient navigation policy. The CBF filter delivers an exceptional safety-vs-efficiency trade-off: ablating it lifts the success rate marginally from

93.8 %

to

94.5 %

but reintroduces a

\sim 20 \times

higher collision rate (

0.3 % \to 6.2 %

), which is the wrong direction for any safety-critical deployment.

The robustness analysis shows that QAPF+CBF degrades gracefully: it shows no observed degradation under observation noise up to

σ_{obs} = 0.8

cells in the finite robustness suite (courtesy of the inflated-margin argument of Section 5.5) and loses at most 30 percentage points under severe actuator slip. Under the realistic combined regime the agent retains

85 %

success, which we view as evidence of deployability.

Because the robustness table reports the SR rather than the full SR/CR/clearance triplet used in Table 5, it should be interpreted as a finite robustness screen rather than as a complete noisy–safety benchmark. A full noisy-environment safety benchmark reporting the SR, CR and minimum clearance under each disturbance regime is left as a follow-up experiment.

The inference-timing analysis indicates real-time viability on the measured Intel i7 reference and under projected embedded-device scaling: all methods remain above 20 Hz, with QAPF+CBF projected to have a median decision latency of

417.0

μ

s

on Jetson Nano. The scaling-factor projection is an approximation; physical Jetson and DGX-class timing measurements are future work and should be performed before any platform-specific deployment claim.

The multi-axis generalization study (Section 5.7.6) shows that clean transfer is achieved by QAPF+CBF across

- 66.7 %

to

+ 66.7 %

obstacle-density shifts, across

40 %

grid-size shifts in either direction and across 1000 consecutive evaluation episodes without drift.

Several limitations should be acknowledged. The continuous-dynamics limitation is that the

50 \times 50

grid-world with simplified point-robot kinematics does not capture the continuous dynamics, actuator limits, wheel slip, turning-radius limits or higher-dimensional sensor streams of physical robots, and a sim-to-real transfer study on a Jetson-mounted physical platform remains future work. The discrete CBF limitation is that the empirical

0.3 %

collision rate is achieved by the filter partly through the visit-memory mechanism rather than through classical forward-invariance theory, and the formal safety guarantees of [15,28] do not transfer directly to the discrete setting. The noise-model limitation is that i.i.d. Gaussian noise is injected by the robustness study, whereas real sensor noise is often correlated, biased and coupled to the perception pipeline, so richer noise models are a natural extension. The projected timing limitation is that the timing results for Jetson and DGX-class devices are projected rather than measured on-device, and DGX is a server-class reference rather than a mobile-robot platform. The unreachable-goal limitation is not the absence of a practical response: timeout and persistent-stagnation outcomes are explicitly detected and labeled by the current implementation. The remaining limitation is only formal certification, because the present empirical detector does not prove graph-theoretic infeasibility of the goal. The classical-planner comparison limitation is that A* and RRT* are discussed as scope baselines but are not experimentally benchmarked under a matched replanning or perception protocol. The 3D and kinematic-validation limitation is that the 3D extension is currently exercised only on the

z = 0

plane sanity check of Section 5.8; a non-trivial vertical-corridor benchmark and a physical non-holonomic platform validation are not executed in this study, and a real-aerial drone-based evaluation is identified as future work. The multi-robot scalability limitation is that the present centralized cooperative scheme is bounded by two distinct scaling limits, as discussed in Section 5.7.4: a joint-success-rate bound (under independent failures the joint SR is upper-bounded by

\prod_{k} {SR}_{k} = 0 . 938^{N}

, which falls below

50 %

at

N \approx 11

) and a compute and communication bound from

O (N^{2})

pair evaluations and state sharing. The decentralized, k-nearest and hierarchical variants enumerated in Section 5.7.4 are motivated by both bounds as the natural path beyond the small-fleet regime that the centralized scheme handles cleanly.

6. Conclusions and Future Work

This paper presented an extended QAPF framework that addresses key limitations of the original formulation [12] through three core contributions: (i) potential-based reward shaping whose unclipped fixed-weight form is policy-invariant, together with a clipped adaptive

λ (e)

schedule used for stable finite-training performance; (ii) a discrete CBF-inspired action filter augmented with a per episode visit memory; and (iii) extensions to dynamic-obstacle, multi-robot and energy-aware navigation scenarios described by the explicit three-equation velocity-modulation law of Equations (13) to (15), which preserves full speed in low-gradient regions and reduces speed near obstacles.

Across 30 independent seeds and held-out maps, a SR of

93.8 \pm 2.3 %

with only

0.3 \pm 0.4 %

collisions is achieved by QAPF+CBF in the static-obstacle setting, while

90.4 %

and

85.5 %

success are retained in the dynamic and narrow-passage scenarios respectively. No observed degradation is shown by the robustness analysis under observation noise up to

σ_{obs} = 0.8

cells in the finite robustness suite, while

85 %

retention is reported under realistic combined noise. The inference-timing analysis shows real-time viability (>20 Hz) on the measured i7 reference and across all considered projected embedded platforms from Jetson Nano to AGX Orin, with QAPF+CBF reporting approximately

4.9

kHz mean-call throughput on the i7 reference and a projected median latency of

417.0

μ

s

on Jetson Nano. Physical Jetson and DGX-class measurements are treated as future work. The generalization study demonstrates clean transfer across

- 66.7 %

to

+ 66.7 %

obstacle-density shifts and 1000-episode horizons, while the small-fleet regime is identified by the multi-robot scalability discussion as the natural operating range of the present centralized scheme, and three concrete decentralization strategies are enumerated for larger fleets.

Future work will target nine directions. (i) Physical sim-to-real validation on a Jetson-mounted ground robot, with end-to-end on-device latency measurement on Jetson AGX Orin and calibration of the projection multipliers in Table 10 against direct measurements; the optional DGX-class server timing is retained only as an upper-bound compute reference. The sim-to-real transfer will be studied through CARLA [34] and domain-randomization [35] to bridge the reality gap surveyed in [36]. (ii) Correlated and bias-shifted sensor-noise models that stress-test the inflated-margin CBF under more realistic perceptual errors. (iii) Adversarial robustness, in which worst-case input perturbations are evaluated in the spirit of neural-network safety verification [37]. (iv) An adaptive

δ_{safe}

that tightens in open regions and inflates near clutter so that conservatism is traded against the path length. (v) Scaling the multi-robot framework to fleets of

N \geq 20

via the decentralization strategies of Section 5.7.4. (vi) Formal unreachable-goal certification beyond the empirical timeout/stagnation detector used in this paper, through a two-tier architecture combining a periodic A* [32] reachability oracle, optional RRT*-based escape [38] and a

∥ \nabla U ∥

plus Q-variance local-minimum diagnostic. (vii) A direct experimental A*/RRT* benchmark run under a matched replanning and perception protocol rather than under the ideal static-map assumptions of classical planners, together with a hybrid architecture in which a global path is generated by A* and local execution under dynamic disturbances is delegated to QAPF+CBF. (viii) Experimental validation of the 3D extension of Section 4.9 on real aerial platforms, in which a quadrotor drone is flown through a structured indoor environment so that the simulated

(x, y, z)

behavior reported in Section 5.8 is corroborated under physical actuator and perception conditions. (ix) Experimental validation on concave obstacles along the lines of Appendix A.1.

Author Contributions

Conceptualization, E.I., I.I., V.V. and A.J.G.; methodology, E.I., I.I. and V.V.; software, E.I., I.I. and V.V.; validation, E.I., A.J.G., I.I. and V.V.; formal analysis, E.I., I.I. and V.V.; investigation, E.I., I.I. and V.V.; resources, I.I., V.V. and J.P.A.; data curation, E.I., I.I. and V.V.; writing—original draft preparation, E.I., I.I. and V.V.; writing—review and editing, I.I., V.V., S.K., G.S.P.G. and P.V.; visualization, E.I., I.I. and V.V.; supervision, I.I., V.V., A.J.G. and J.P.A.; project administration, I.I., V.V. and A.J.G.; funding acquisition, I.I. and V.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 739578 and the Government of the Republic of Cyprus through the Deputy Ministry of Research, Innovation and Digital Policy.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The simulation code, raw result files and regenerated figures are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithm Capabilities, Limitations and Comparisons

The algorithmic scope of the proposed framework is clarified in this appendix by addressing (i) obstacle-shape handling and concave geometries, (ii) empirical timeout/stagnation handling for impossible-goal cases, (iii) comparison against classical sampling and graph-based planners (RRT* and A*), (iv) scalability to 3D workspaces and (v) kinematic and geometric constraints together with pose uncertainty. In the present implementation, unreachable-goal cases are handled empirically rather than by a formal reachability certificate. The maximum-step budget identifies Timeout-Unreachable episodes, while the persistent-stagnation monitor identifies Stagnation-Unreachable behavior when the distance to the goal fails to improve over repeated windows and the robot remains trapped in a local region. Thus, impossible-goal scenarios are not left undefined: they are explicitly terminated and labeled as safe empirical failures. What remains outside the scope of this paper is a formal graph-theoretic proof that no feasible path exists.

Appendix A.1. Obstacle Shape and Concave Geometry

Each obstacle is represented in the evaluated framework as a point object with a circular collision radius

ϵ_{coll} = 1.5

cells, so that the repulsive potential of Equation (4) is radially symmetric around each obstacle center. This representation is sufficient for convex obstacle approximations and constitutes the standard simplification adopted in the QAPF literature [6,12]. Three concrete extensions are provided to cover non-trivial shapes.

The first extension is the convex polygonal obstacle case, in which the radial

ρ (q)

is replaced by the Euclidean distance from

q

to the obstacle convex-hull boundary. Both

ρ

and

\nabla ρ

remain well defined and Lipschitz away from the boundary, so the QAPF action selection of Equation (11), the CBF filter of Algorithm 3 and the inflated-margin theorem of Equations (21) and (22) generalizes directly with the new

ρ

. The second extension is the point-decomposition representation of concave obstacles, in which a concave obstacle is represented as a union of K point obstacles (or, more accurately, as a Minkowski sum of K disks) chosen so that the obstacle boundary is densely sampled. The total repulsive potential becomes

U_{rep} (q) = \sum_{k = 1}^{K} U_{rep, k} (q)

with no algorithmic change and the standard accuracy and cost trade-off applies: K is proportional to the boundary perimeter divided by

(ρ_{0} / 2)

for sub-influence-radius coverage. In our

50 \times 50

grid this corresponds to approximately

K \sim 20

extra point obstacles per non-trivial concave shape, which remains well within the per decision

O (N_{obs})

inference budget characterized in Section 5.6. The third extension addresses the local-minima failure mode associated with deeply concave obstacles, such as U-shapes and narrow dead-end pockets, which are the classical failure mode of pure APF navigation because the attractive and repulsive gradients can balance into a stable non-goal equilibrium. The Q-learning component of QAPF mitigates this in two complementary ways: an optimistic Q-initialization

Q_{0} = 5.0

pushes the agent to attempt every action at least once before settling, even when the local potential gradient indicates a stay-still preference; and the anti-deadlock monitor of Algorithm 2, which forces

ε \geq 0.5

exploration when

d_{goal}

has not improved over the last 15 steps, provides an explicit escape mechanism. Concave-trap scenarios have not been exhaustively benchmarked in this paper, and their systematic evaluation together with the unreachable-goal detection problem is identified as the most natural extension of the present framework.

Appendix A.2. Empirical Handling of Impossible-Goal Cases

The present framework addresses cases in which the goal is impossible to reach through empirical timeout and stagnation detection. A goal is treated as empirically unreachable when the robot neither reaches the goal nor collides before the maximum episode horizon, or when persistent stagnation indicates that the controller is trapped in a local no-progress region. Let

d_{t} = ∥ q_{t} - q_{f} ∥

be the goal distance. A stagnation window is detected when

max_{τ \in {t - W_{stag}, \dots, t}} d_{τ} - min_{τ \in {t - W_{stag}, \dots, t}} d_{τ} < Δ d_{stag},

(A1)

and the robot remains inside a repeated local region during the same window. In this study,

W_{stag} = 15

and

Δ d_{stag} = 1.0

cell match the anti-deadlock monitor of Algorithm 2. To avoid prematurely labeling recoverable temporary stalls as impossible-goal cases, the terminal Stagnation-Unreachable label is assigned only after

N_{stag} = 3

consecutive stagnation windows. The Timeout-Unreachable label is assigned if

T_{max} = 1000

steps are exhausted without goal arrival or collision.

Table A1 summarizes the empirical behavior of this detector on reachable control maps and three no-path settings. The blocked-goal and sealed-corridor settings make the goal topologically unreachable under the four-neighbor grid action model. The dead-end concave setting creates a local trap in which APF-driven motion can repeatedly return to the same region if no failure detector is used. The desired behavior in all three no-path cases is not path success but safe termination with an unreachable label and without collision.

Table A1. Empirical impossible-goal investigation using timeout and persistent-stagnation detection on the finite no-path test suite.

Scenario	Unreachable Detection (%)	False-Unreachable (%)	Collision Before Label (%)	Mean Detection Time (Steps)
Reachable held-out maps	−	$1.0 \pm 0.6$	$0.3 \pm 0.4$	−
Blocked-goal maps	$98.7 \pm 1.2$	−	$0.0 \pm 0.0$	$72 \pm 18$
Sealed-corridor maps	$97.8 \pm 1.5$	−	$0.3 \pm 0.4$	$94 \pm 22$
Dead-end concave maps	$96.5 \pm 2.0$	−	$0.7 \pm 0.5$	$118 \pm 34$

Unreachable Detection is the percentage of no-path episodes terminated by either Timeout-Unreachable or Stagnation-Unreachable. False-unreachable is measured only on reachable held-out maps and denotes episodes incorrectly labeled unreachable, although a feasible route exists. Collision Before Label is the percentage of episodes that collide before the unreachable label is assigned. Mean Detection Time is conditional on correctly labeled no-path episodes.

This mechanism is intentionally described as empirical detection rather than formal certification. It is suitable for finite-horizon robotic deployment because it prevents endless oscillation and provides a clear terminal status for blocked-goal, sealed-corridor and dead-end cases. However, it does not prove that no path exists in the underlying grid graph. A formal reachability oracle based on A* or graph connectivity analysis can be added as an optional supervisory layer, but the QAPF+CBF controller itself already provides a practical timeout/stagnation response to impossible-goal scenarios.

Appendix A.3. Comparison Against RRT* and A*

Classical motion-planning methods such as A* [32] and RRT* [38] solve a different problem from QAPF and are best understood as complementary rather than universally dominated baselines. No claim is therefore made that QAPF+CBF is always preferable to A* or RRT*. Instead, Table A2 identifies the operating regimes in which each class of method is preferable.

A* and RRT* are not included as quantitative baselines in the main experimental comparison because they solve a different problem under different assumptions. Under exact static-map and exact-collision-checking assumptions, A* returns a shortest-path optimum, and RRT* is asymptotically optimal; in that regime, A* and RRT* dominate QAPF+CBF, which, as a learned reactive policy, carries no optimality guarantee. The complementary regime evaluated in this paper—partial observability, dynamic obstacles, sensor noise and bounded per decision compute—is the regime in which a single static-map plan is repeatedly invalidated, and the question is whether the controller can react in real time without reconstructing the global plan. A controlled head-to-head benchmark in which A* is forced to replan at every dynamic-obstacle move (approximately 90 replans per dynamic episode in our setup) under the same noisy-feature input distribution is identified as future work item (vii) in Section 6 and is the experiment that would convert the present scope comparison into a head-to-head answer. The two regimes in which QAPF+CBF is most attractive are the dynamic-obstacle regime, for which fast online reaction is required (a

90.4 %

SR with a

0.5 %

CR is reported for the dynamic-scenario setting), and the partial-observability regime with sensor noise, for which an additional estimator or maintained map is required by A* and RRT*, whereas noisy local features are used directly by QAPF+CBF (an

85 %

SR is reported for the combined-noise robustness regime). For static and fully-mapped environments in which the computational budget allows offline planning, formal optimality and safety guarantees are offered by A* and RRT* that are not offered by QAPF+CBF. A natural hybrid architecture is therefore identified, in which a global path is generated by A*, and local execution under dynamic disturbances is delegated to QAPF+CBF; this architecture is left for future work. The present manuscript provides a scope and deployment-regime comparison against A*/RRT* rather than a claim of empirical superiority over these planners under their ideal static-map assumptions.

Table A2. Scope comparison of the proposed QAPF+CBF framework against classical motion planners A* [32] and RRT* [38]. This is not a matched experimental benchmark; A*/RRT* are deterministic planners that require a known map or configuration space, whereas QAPF+CBF is a learned reactive policy that operates from local sensor features.

Property	A*	RRT*	QAPF+CBF (Ours)
requires explicit map/model?	yes (occupancy grid)	yes (configuration space)	no global map required; local features used
optimality guarantee	shortest discrete path	asymptotic optimality	no (learned heuristic)
handles dynamic obstacles?	via repeated replanning	via repeated replanning	yes, as a reactive policy (Section 5.3)
handles sensor noise?	requires estimator/map filtering	requires estimator/map filtering	evaluated directly under noisy local features (Section 5.5)
per decision time	$O (\| V \| log \| V \|)$ for fresh plan	$O (log n)$ for query, but tree growth is offline	$O (\| A \| \cdot N_{obs})$ , real time
empirical SR (this paper)	not benchmarked	not benchmarked	$93.8 %$
collision-free condition	yes, if the occupancy map and collision checking are exact	yes, if samples and edges are collision-checked in a valid configuration space	empirical $0.3 %$ CR
operating regime	static, fully-observed	static, fully-observed	dynamic, partially-observed, learnable

Appendix A.4. Scalability to 3D Workspaces

Extension to a 3D workspace (

q \in R^{3}

) is conceptually direct and computationally tractable, with the principal cost being the inflation of the discrete state space. The continuous-state components, namely the APF potentials

U_{att}

and

U_{rep}

, the gradient-based force

F = - \nabla U

, the CBF barrier

h (q)

and the energy module of Equations (13) to (15), are all generalized to

R^{3}

without modification because only a metric and an obstacle geometry are required and both are extended naturally. The discrete state space is generalized by replacing the 2D 7-tuple of Equation (1) with a 3D 10-tuple

(x_{bin}, y_{bin}, z_{bin}, β_{g}^{az}, β_{g}^{el}, β_{o}^{az}, β_{o}^{el}, d_{o}, {\dot{d}}_{o}, {\hat{d}}_{o})

in which both azimuth and elevation angular bins are introduced and the one-step predicted distance bin

{\hat{d}}_{o}

is retained from the 2D formulation. With five position bins per axis and eight angular bins per dimension, the state cardinality is grown from

| S_{2 D} | = 76, 800

to

| S_{3 D} | = 5^{3} \cdot 8^{4} \cdot 4 \cdot 3 \cdot 4 = 24, 576, 000 \approx 2.46 \times 10^{7}

, which is approximately a

320 \times

increase. The Q-table memory is therefore

24.576 \times 10^{6} \times 6 \times 4 \approx 590

MB for six 3D cardinal actions using single-precision floats, or

24.576 \times 10^{6} \times 26 \times 4 \approx 2.56

GB for 26-neighbor actions, when Python object overhead is excluded; this footprint is feasible in a compact array implementation on workstation-class hardware but is no longer negligible on small embedded platforms. The action space is naturally extended from

| A | = 4

(cardinal in 2D) to

| A | = 6

(cardinal in 3D); an option of

| A | = 26

(king-moves) is retained at higher computational cost and potentially smoother trajectories. The state-space inflation typically translates into approximately

O (| S_{3 D} | / | S_{2 D} |) \sim 320 \times

more training episodes for the tabular variant; for practical deployment, a Deep Q-Network function approximator with the same APF-shaped reward would be substituted because the policy-invariance argument of Equation (7) in its unclipped fixed-weight form is independent of the function-approximation class. The methodological details of the 3D extension are described in Section 4.9, and a simulation evaluation of the same extension, in which the z-axis is held at zero so that the 2D plane is recovered as a degenerate case, is reported in Section 5.8; a real drone-based validation of the 3D formulation is identified as future work item (viii) in Section 6.

Appendix A.5. Kinematic and Geometric Constraints, and Pose Uncertainty

The current framework treats the robot as a point object with discrete and instantaneously-executable cardinal actions. Real mobile platforms are subject to three families of constraints not modeled in our experiments. The first family is non-holonomic kinematics: differential-drive and Ackermann-steering robots cannot move sideways and have bounded turning radii. The standard incorporation in the QAPF framework is to replace the four cardinal actions with

| A | = k

feasible

(v, ω)

velocity-pair primitives that respect the platform kinematics and to include the heading

θ

in the state vector, giving an extended state

s^{kin} = (q, θ)

with an additional 8-bin angular component. The shaping form of Equation (7) is unchanged because F depends only on the potential

Φ (s) = - U (q) / U_{Δ, scale}

, which is heading-invariant. Empirically, prior continuous-state extensions of QAPF [12] report a

\sim 3

to 5 percentage-point SR penalty for non-holonomic kinematics relative to the omnidirectional case. The second family is the geometric robot-footprint constraint: a non-zero robot radius

r_{robot}

is incorporated by inflating the collision radius to

ϵ_{coll} \to ϵ_{coll} + r_{robot}

, which is a Minkowski-sum operation; the CBF margin

δ_{safe}

should also be inflated proportionally by the same reasoning as the noise-inflation argument of Equation (21). The third family is pose uncertainty in obstacle and goal localization: real sensing produces uncertain obstacle and goal positions, modeled as

{\tilde{o}}_{k} = o_{k} + η_{k}^{obs - loc}

with

η_{k}^{obs - loc} \sim N (0, Σ_{o})

and similarly for

{\tilde{q}}_{f}

. The robustness analysis of Section 5.5 already covers the diagonal-Gaussian case for robot self-localization because the inflated-margin theorem of Equations (21) and (22) generalizes directly: the only requirement is bounded perturbation

∥ η ∥ \leq δ

, regardless of which physical quantity (robot pose, obstacle position or goal position) the perturbation applies to. For non-Gaussian or correlated pose uncertainty arising, for example, from a SLAM front-end with biased odometry, additional treatment is required: either a more conservative

δ_{safe}

sized to the worst-case credible region or a chance-constrained variant of the CBF that admits probabilistic safety. This case is noted as an open extension.

Appendix B. Symbols and Abbreviations

Appendix B.1. Symbol Table

The mathematical symbols used throughout this manuscript are summarized in Table A3. The symbols are grouped by category (state, MDP, APF, Q-learning, CBF, multi-robot and noise) and are listed together with their domain and a one-line description. Readers can use Table A3 as a quick reference when reading the methodology and the experimental analysis.

Table A3. Symbol table.

Symbol	Domain/Units	Description
State and action
$q$ , $q_{t}$	$R^{2}$ (cells)	Robot position; t indexes the decision step.
$q_{f}$	$R^{2}$	Goal position.
$q_{a}^{'}$	$R^{2}$	Successor position when action a is executed.
$s$ , $s^{'}$	$S$	Discrete state and successor state in the MDP.
$S$	Finite	Discrete state space, $\| S \| = 76,800$ .
$x_{bin}, y_{bin}$	${0, \dots, 4}$	Binned position coordinates (5 bins per axis).
$β_{g}, β_{o}$	${0, \dots, 7}$	Discretized bearing to goal and to nearest obstacle.
$d_{o}$	${0, 1, 2, 3}$	Binned distance to nearest obstacle.
${\dot{d}}_{o}$	${0, 1, 2}$	Discretized approach rate (closing, stationary, receding).
${\hat{d}}_{o}$	${0, 1, 2, 3}$	One-step predicted distance bin.
a, $a_{i}$	$A$	Action; $A = {↑, ↓, \leftarrow, \to}$ .
$O$	Set	Obstacle set, $O = {o_{1}, \dots, o_{N_{o}}}$ .
$N_{o}$	$Z_{> 0}$	Number of obstacles in the scene.
Artificial Potential Field
$U (q)$	Scalar	Total potential at $q$ .
$U_{att} (q)$	Scalar	Attractive potential generated by the goal.
$U_{rep} (q)$	Scalar	Repulsive potential generated by obstacles.
$k_{att}$	Scalar	Attractive-gain coefficient ( $k_{att} = 1.0$ ).
$k_{rep}$	Scalar	Repulsive-gain coefficient ( $k_{rep} = 100.0$ ).
$ρ (q)$	Cells	Distance to nearest obstacle: $ρ = {min}_{o \in O} ∥ q - o ∥$ .
$ρ_{0}$	Cells	Influence distance of the repulsive potential ( $ρ_{0} = 3.0$ ).
$\nabla U (q)$	Vector	Gradient vector of the total potential at position $q$ .
$U_{Δ, scale}$	Scalar	Auto-calibrated potential-difference normalization constant for reward shaping, computed from $\| Δ U \|$ .
$G_{scale}$	Scalar	Auto-calibrated gradient-magnitude normalization constant for energy-aware velocity modulation, computed from $∥ \nabla U ∥$ .
${\tilde{U}}_{i}$	Scalar	Zero-centered and range-normalized potential at successor $a_{i}$ .
T, $T_{0}$ , $T_{min}$	Scalar	Softmax temperature; initial value; floor.
$\bar{U}$	Scalar	Mean successor potential, $\bar{U} = {\| A \|}^{- 1} \sum_{j} U_{j}$ , used in Equation (10).
$U_{min}$	Scalar	Minimum successor potential subtracted in the softmax of Equation (9) for numerical stability.
$ϵ_{U}$	Scalar	Numerical-stability constant in Equation (10); $ϵ_{U} = 10^{- 9}$ .
G	Integer	Grid side length in cells (default $G = 50$ ).
$Δ_{a}$	Vector	Discrete-action displacement vector applied by Algorithm 1.
Reward and Q-learning
$r (s, a, s^{'})$	Scalar	Step reward function.
$R_{outcome}$	Scalar	Terminal outcome bonus or penalty (+100, −50 or 0).
$λ_{s}, λ_{c}, λ_{g}$	Scalar	Step, proximity and progress reward weights.
$d_{goal} (q)$	Cells	Euclidean distance to the goal.
$Q (s, a)$	Scalar	Action-value function.
$Q_{0}$	Scalar	Optimistic Q-table initializer ( $Q_{0} = 5.0$ ).
$η$	$(0, 1]$	Q-learning rate.
$γ$	$[0, 1)$	Discount factor ( $γ = 0.95$ ).
$ε$	$[0, 1]$	Exploration probability for $ε$ -greedy.
$F (s, s^{'})$	Scalar	Potential-based shaping increment.
$Φ (s)$	Scalar	State potential for reward shaping, $Φ = - U / U_{Δ, scale}$ .
$λ (e)$	Scalar	Adaptive shaping coefficient at episode e.
$λ_{min}, λ_{max}, β$	Scalar	Floor, ceiling and decay rate of $λ (e)$ .
$w_{g}^{train}, w_{g}^{eval}$	Scalar	APF guidance weights at training and evaluation.
$H_{d_{g}}$	Buffer	Stuck-history buffer of recent goal distances used by the anti-deadlock monitor.
$E_{max}$	Integer	Maximum number of training episodes.
$T_{max}$	Integer	Maximum episode horizon in steps.
$E_{ramp}$	Integer	Curriculum ramp length in episodes.
$N_{o}^{(0)}$ , $N_{o}^{(target)}$	Integer	Initial and target obstacle densities for the curriculum schedule.
CBF safety filter
$h (q)$	Scalar	Discrete barrier function, $h = ρ - ϵ_{coll}$ .
$ϵ_{coll}$	Cells	Collision radius ( $ϵ_{coll} = 1.5$ ).
$δ_{safe}$	Cells	Inflated safety margin ( $δ_{safe} = 0.3$ ).
$C$	Set	Nominal safe set, ${q : h (q) \geq δ_{safe}}$ .
${\tilde{C}}_{p}$	Set	High-probability inflated safe set.
$V_{e} (q, a)$	Integer	Per episode visit counter.
M	Integer	Visit cap per cell–action pair ( $M = 3$ ).
$M_{s}, M_{f}$	Subsets of $A$	Safe-mask and forbidden-mask sets.
$a^{★}$	Action	QAPF nominal action prior to filtering.
$N_{stag}$	Integer	Number of consecutive stagnation windows that triggers Stagnation-Unreachable.
$W_{stag}$	Integer	Stagnation-window length in steps ( $W_{stag} = 15$ ).
$Δ d_{stag}$	Cells	Goal-distance variation threshold defining a stagnation window ( $Δ d_{stag} = 1.0$ ).
$C_{stag}$	Integer	Persistent-stagnation counter used by Algorithm 2.
Multi-robot and energy
$q_{k}$ , $q_{j}$	$R^{2}$	Position of robots k and j.
$U_{rep}^{MR}$	Scalar	Inter-robot repulsive potential.
$k_{rep}^{MR}, ρ_{0}^{MR}$	Scalar	Multi-robot repulsion gain and influence distance.
$E (t)$	Scalar	Per step kinetic plus jerk energy.
$E_{ep}$	Scalar	Total episode energy.
$c_{v}, c_{a}$	Scalar	Kinetic and jerk-energy weights.
$v (q)$	Cells/step	Modulated speed at position $q$ .
$v_{min}, v_{max}$	Cells/step	Speed floor and ceiling.
$k_{E}$	Scalar	Velocity-modulation aggressiveness coefficient.
$\tilde{g} (q)$	Scalar	Normalized gradient magnitude, $∥ \nabla U (q) ∥ / G_{scale}$ .
$σ (\cdot)$	Function	Logistic sigmoid, $σ (x) = 1 / (1 + e^{- x})$ .
Noise model and robustness
$η_{t}^{obs}$	Vector	Observation noise at step t.
$σ_{obs}$	Cells	Observation-noise standard deviation.
$p_{act}$	$[0, 1]$	Actuator-slip probability.
$η_{t}^{ext}$	Vector	External positional drift at step t.
$σ_{ext}$	Cells	External-drift standard deviation.
$δ_{p}$	Cells	Radius of the high-probability error ball at level p.
$I_{2}$	Matrix	$2 \times 2$ identity matrix used in noise covariance.
Training schedule
e	$Z_{\geq 0}$	Episode index.
t	$Z_{\geq 0}$	Decision-step index within an episode.

Appendix B.2. Abbreviations

The abbreviations used throughout this manuscript are listed in Table A4. The list contains domain abbreviations (e.g., APF, CBF, MDP), method names (e.g., QAPF, DQN, RRT*) and evaluation-metric abbreviations (e.g., SR, CR). Readers can use Table A4 as a quick reference for any abbreviation encountered in the methodology, experimental analysis and discussion sections.

Table A4. Abbreviations used in this manuscript.

Abbreviation	Meaning
Domain and concept
APF	Artificial potential field.
CBF	Control barrier function.
MDP	Markov decision process.
QP	Quadratic program.
RL	Reinforcement learning.
DRL	Deep reinforcement learning.
RSS	Responsibility-sensitive safety (contract).
SLAM	Simultaneous localization and mapping.
GP	Gaussian process.
GC	Garbage collection (Python runtime).
Methods (proposed and baselines)
QAPF	Q-learning with artificial potential field (proposed).
QAPF+CBF	QAPF augmented with the discrete CBF-inspired action filter (proposed).
Std. QL	Standard tabular Q-learning.
EQL	Efficient Q-learning baseline using optimistic initialization and slow $ε$ decay.
CQL	Conservative Q-learning.
DQN	Deep Q-network.
PPO	Proximal policy optimization.
SAC	Soft actor–critic.
TD3	Twin delayed deep deterministic policy gradient.
RRT*	Optimal rapidly-exploring random tree.
A*	A-star graph-search planner.
iADA*-RL	Incremental anytime dynamic A* with reinforcement learning.
Metrics, evaluation and protocols
SR	Success Rate.
CR	Collision Rate.
pp	Percentage point(s).
Hz	Hertz (decisions per second).
Hardware and platforms
i7	Intel Core i7 reference laptop CPU.
DGX	NVIDIA DGX-class server platform.
AGX Orin	NVIDIA Jetson AGX Orin embedded module.
Orin NX	NVIDIA Jetson Orin NX embedded module.
Nano	NVIDIA Jetson Nano embedded module.
GPU	Graphics processing unit.
CPU	Central processing unit.
Statistics
i.i.d.	Independent and identically distributed.
p	p-value (statistical significance).
$p 95, p 99$	95th and 99th percentile of a distribution.

References

Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
Chen, L.; Wu, P.; Chitta, K.; Jaeger, B.; Geiger, A.; Li, H. End-to-End Autonomous Driving: Challenges and Frontiers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10164–10183. [Google Scholar] [CrossRef]
Yao, Q.; Zheng, Z.; Qi, L.; Yuan, H.; Guo, X.; Zhao, M.; Liu, Z.; Yang, T. Path Planning Method with Improved Artificial Potential Field: A Reinforcement Learning Perspective. IEEE Access 2020, 8, 135513–135523. [Google Scholar] [CrossRef]
Aradi, S. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 740–759. [Google Scholar] [CrossRef]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Khatib, O. Real-time obstacle avoidance for manipulators and mobile robots. Int. J. Robot. Res. 1986, 5, 90–98. [Google Scholar] [CrossRef]
Herrera Ortiz, J.A.; Rodríguez-Vázquez, K.; Padilla Castañeda, M.A.; Arámbula Cosío, F. Autonomous robot navigation based on the evolutionary multi-objective optimization of potential fields. Eng. Optim. 2013, 45, 19–43. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning; Proceedings of Machine Learning Research; PMLR: Stockholm, Sweden, 2018; Volume 80, pp. 1861–1870. [Google Scholar]
Orozco-Rosas, U.; Picos, K.; Pantrigo, J.J.; Montemayor, A.S.; Cuesta-Infante, A. Mobile robot path planning using a QAPF learning algorithm for known and unknown environments. IEEE Access 2022, 10, 84648–84663. [Google Scholar] [CrossRef]
Maoudj, A.; Hentout, A. Optimal path planning approach based on Q-learning algorithm for mobile robots. Appl. Soft Comput. 2020, 97, 106796. [Google Scholar] [CrossRef]
Ng, A.Y.; Harada, D.; Russell, S. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning; Morgan Kaufmann: San Francisco, CA, USA, 1999; pp. 278–287. [Google Scholar]
Ames, A.D.; Xu, X.; Grizzle, J.W.; Tabuada, P. Control barrier function based quadratic programs for safety critical systems. IEEE Trans. Autom. Control 2017, 62, 3861–3876. [Google Scholar] [CrossRef]
Cheng, R.; Orosz, G.; Murray, R.M.; Burdick, J.W. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2019; Volume 33, pp. 3387–3395. [Google Scholar] [CrossRef]
Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. arXiv 2016, arXiv:1610.03295. [Google Scholar] [CrossRef]
Dinneweth, J.; Boubezoul, A.; Mandiau, R.; Espié, S. Multi-agent reinforcement learning for autonomous vehicles: A survey. Auton. Intell. Syst. 2022, 2, 27. [Google Scholar] [CrossRef]
Montiel, O.; Sepúlveda, R.; Orozco-Rosas, U. Optimal Path Planning Generation for Mobile Robots using Parallel Evolutionary Artificial Potential Field. J. Intell. Robot. Syst. 2015, 79, 237–257. [Google Scholar] [CrossRef]
Low, E.S.; Ong, P.; Cheah, K.C. Solving the optimal path planning of a mobile robot using improved Q-learning. Robot. Auton. Syst. 2019, 115, 143–161. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. In Proceedings of the International Conference on Learning Representations. arXiv 2016, arXiv:1511.05952. [Google Scholar] [CrossRef]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning; Proceedings of Machine Learning Research; PMLR: Stockholm, Sweden, 2018; Volume 80, pp. 1587–1596. [Google Scholar]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 15084–15097. [Google Scholar]
Janner, M.; Li, Q.; Levine, S. Offline Reinforcement Learning as One Big Sequence Modeling Problem. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 1273–1286. [Google Scholar]
Muhammad, K.; Ullah, A.; Lloret, J.; Del Ser, J.; de Albuquerque, V.H.C. Deep Learning for Safe Autonomous Driving: Current Challenges and Future Directions. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4316–4336. [Google Scholar] [CrossRef]
You, C.; Lu, J.; Filev, D.; Tsiotras, P. Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning. Robot. Auton. Syst. 2019, 114, 1–18. [Google Scholar] [CrossRef]
Maw, A.A.; Tyan, M.; Nguyen, T.A.; Lee, J.W. iADA*-RL: Anytime Graph-Based Path Planning with Deep Reinforcement Learning for an Autonomous UAV. Appl. Sci. 2021, 11, 3948. [Google Scholar] [CrossRef]
Ames, A.D.; Coogan, S.; Egerstedt, M.; Notomista, G.; Sreenath, K.; Tabuada, P. Control barrier functions: Theory and applications. In Proceedings of the 2019 18th European Control Conference (ECC); IEEE: New York, NY, USA, 2019; pp. 3420–3431. [Google Scholar] [CrossRef]
Shalev-Shwartz, S.; Shammah, S.; Shashua, A. On a Formal Model of Safe and Scalable Self-Driving Cars. arXiv 2017, arXiv:1708.06374. [Google Scholar] [CrossRef]
Zhu, M.; Wang, Y.; Pu, Z.; Hu, J.; Wang, X.; Ke, R. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving. Transp. Res. Part Emerg. Technol. 2020, 117, 102662. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Kumar, A.; Zhou, A.; Tucker, G.; Levine, S. Conservative Q-Learning for Offline Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 1179–1191. [Google Scholar]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning; Proceedings of Machine Learning Research; PMLR: Mountain View, CA, USA, 2017; Volume 78, pp. 1–16. [Google Scholar]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2017; pp. 23–30. [Google Scholar] [CrossRef]
Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI); IEEE: New York, NY, USA, 2020; pp. 737–744. [Google Scholar] [CrossRef]
Huang, X.; Kwiatkowska, M.; Wang, S.; Wu, M. Safety Verification of Deep Neural Networks. In Proceedings of the Computer Aided Verification; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10426, pp. 3–29. [Google Scholar] [CrossRef]
Karaman, S.; Frazzoli, E. Sampling-Based Algorithms for Optimal Motion Planning. Int. J. Robot. Res. 2011, 30, 846–894. [Google Scholar] [CrossRef]

Figure 1. End-to-end system overview of the proposed Adaptive QAPF+CBF framework. The four-stage closed loop comprises: (i) Environment and Sensing, which constructs the discrete state

s

of Equation (1) from the grid layout, robot pose, goal pose and obstacle set; (ii) the QAPF policy, which fuses the Q-learning update of Equation (6) with the APF guidance of Equations (2)–(4) via the zero-centered hybrid score of Equation (11) and emits a nominal action

a^{★}

; (iii) the discrete CBF-inspired safety filter of Algorithm 3, which intersects the barrier-derived safe-mask with the per episode visit-memory mask and emits the executed action

a^{exec}

; and (iv) robot actuation, which closes the loop by returning the next state

s^{'}

to the sensing layer. Static obstacles are drawn at the collision radius

ϵ_{coll} = 1.5

cells; light-blue rings denote the APF influence radius

ρ_{0} = 3.0

cells; the green square is the start

s_{0}

, and the gold star is the goal

q_{f}

.

Figure 1. End-to-end system overview of the proposed Adaptive QAPF+CBF framework. The four-stage closed loop comprises: (i) Environment and Sensing, which constructs the discrete state

s

of Equation (1) from the grid layout, robot pose, goal pose and obstacle set; (ii) the QAPF policy, which fuses the Q-learning update of Equation (6) with the APF guidance of Equations (2)–(4) via the zero-centered hybrid score of Equation (11) and emits a nominal action

a^{★}

; (iii) the discrete CBF-inspired safety filter of Algorithm 3, which intersects the barrier-derived safe-mask with the per episode visit-memory mask and emits the executed action

a^{exec}

; and (iv) robot actuation, which closes the loop by returning the next state

s^{'}

to the sensing layer. Static obstacles are drawn at the collision radius

ϵ_{coll} = 1.5

cells; light-blue rings denote the APF influence radius

ρ_{0} = 3.0

cells; the green square is the start

s_{0}

, and the gold star is the goal

q_{f}

.

Figure 2. Visualization of the potential-field components: (a) attractive potential field, decreasing toward the goal as the goal is approached; (b) repulsive potential field illustrating the high-potential regions around obstacles, with red dots marking obstacle centers and white squares marking the start (

s_{0}

) and goal (

q_{f}

) cells; (c) combined total potential field with the navigation path from start to goal, using the same marker convention.

Figure 2. Visualization of the potential-field components: (a) attractive potential field, decreasing toward the goal as the goal is approached; (b) repulsive potential field illustrating the high-potential regions around obstacles, with red dots marking obstacle centers and white squares marking the start (

s_{0}

) and goal (

q_{f}

) cells; (c) combined total potential field with the navigation path from start to goal, using the same marker convention.

Figure 3. Held-out success rate is plotted against training episode for all evaluated methods. The performance plateau is reached by QAPF and QAPF+CBF within the 1500-episode training budget at interpolated plateau-crossing estimates of approximately episode 205 and 230, respectively. These values are obtained from the smoothed learning curves and are not required to be multiples of the 50-episode evaluation-logging interval. This reported convergence horizon is substantially shorter than the

\sim 3 \times 10^{6}

episodes reported by the original QAPF formulation [12], although the comparison is protocol-dependent and not a controlled one-to-one speedup claim.

Figure 3. Held-out success rate is plotted against training episode for all evaluated methods. The performance plateau is reached by QAPF and QAPF+CBF within the 1500-episode training budget at interpolated plateau-crossing estimates of approximately episode 205 and 230, respectively. These values are obtained from the smoothed learning curves and are not required to be multiples of the 50-episode evaluation-logging interval. This reported convergence horizon is substantially shorter than the

\sim 3 \times 10^{6}

episodes reported by the original QAPF formulation [12], although the comparison is protocol-dependent and not a controlled one-to-one speedup claim.

Figure 4. Representative navigation trajectories from the QAPF+CBF agent. Red filled circles denote obstacles at the collision radius

ϵ_{coll} = 1.5

cells; light-blue regions denote the APF influence radius

ρ_{0} = 3.0

cells; the green square is the start

s_{0}

; the gold star is the goal

q_{f}

; the blue curve is the executed trajectory

{q_{t}}_{t = 0}^{T}

. The path illustrates smooth APF-guided obstacle deflection while maintaining CBF-enforced clearance.

Figure 4. Representative navigation trajectories from the QAPF+CBF agent. Red filled circles denote obstacles at the collision radius

ϵ_{coll} = 1.5

cells; light-blue regions denote the APF influence radius

ρ_{0} = 3.0

cells; the green square is the start

s_{0}

; the gold star is the goal

q_{f}

; the blue curve is the executed trajectory

{q_{t}}_{t = 0}^{T}

. The path illustrates smooth APF-guided obstacle deflection while maintaining CBF-enforced clearance.

Figure 5. Safety analysis of the CBF filter. The left panel plots the barrier value

h (q)

along a representative trajectory; the red curve corresponds to QAPF without the filter and is observed to dip into the pink collision zone (

h < 0

) at multiple steps, while the green curve corresponds to QAPF+CBF and is maintained at or above the safety margin throughout. The right panel reports the held-out collision rates from Table 5 across all evaluated methods, including APF-Only at

7.9 %

, QAPF at

6.2 %

and QAPF+CBF at

0.3 %

. The collision-reduction headline is made specifically against the internal QAPF-without-filter ablation:

6.2 %

to

0.3 %

, which is an approximately

20 \times

reduction.

Figure 5. Safety analysis of the CBF filter. The left panel plots the barrier value

h (q)

along a representative trajectory; the red curve corresponds to QAPF without the filter and is observed to dip into the pink collision zone (

h < 0

) at multiple steps, while the green curve corresponds to QAPF+CBF and is maintained at or above the safety margin throughout. The right panel reports the held-out collision rates from Table 5 across all evaluated methods, including APF-Only at

7.9 %

, QAPF at

6.2 %

and QAPF+CBF at

0.3 %

. The collision-reduction headline is made specifically against the internal QAPF-without-filter ablation:

6.2 %

to

0.3 %

, which is an approximately

20 \times

reduction.

Figure 6. Stability analysis of the reward-shaping signal across training transitions. The raw potential-difference signal is heavy-tailed (range ∼[−70, +70]), whereas the clipped normalized shaping term remains bounded in

[- 1, + 1]

, preventing unstable Q-value updates from outlier

Δ U

values.

Figure 6. Stability analysis of the reward-shaping signal across training transitions. The raw potential-difference signal is heavy-tailed (range ∼[−70, +70]), whereas the clipped normalized shaping term remains bounded in

[- 1, + 1]

, preventing unstable Q-value updates from outlier

Δ U

values.

Figure 7. Safety to performance Pareto analysis is shown across all evaluated methods. QAPF+CBF (top-left) is observed to sit closest to the safety-performance utopia point because near-QAPF success is preserved, while collisions are reduced by approximately

20 \times

. In the QAPF to QAPF+CBF transition,

0.7

pp of success is traded for the collision reduction, and the bottom-right baselines are clearly Pareto-dominated.

Figure 7. Safety to performance Pareto analysis is shown across all evaluated methods. QAPF+CBF (top-left) is observed to sit closest to the safety-performance utopia point because near-QAPF success is preserved, while collisions are reduced by approximately

20 \times

. In the QAPF to QAPF+CBF transition,

0.7

pp of success is traded for the collision reduction, and the bottom-right baselines are clearly Pareto-dominated.

Figure 8. Robustness analysis is shown with the success rate per noise regime in panel (a) and degradation in percentage points relative to the clean condition in panel (b). The smallest observation-noise degradation in this finite suite is observed for QAPF+CBF, while actuator slip degrades all methods because the realized transition is perturbed downstream of action selection and cannot be removed by observation-side margin inflation.

Figure 9. The ablation study is shown for RL-Only, APF-Only, QAPF-Fixed-

λ

, QAPF-Full and QAPF+CBF. RL-Only denotes the strict component-isolation configuration described in the Table 11 caption and is distinct from the Std. QL baseline of Table 5.

Figure 9. The ablation study is shown for RL-Only, APF-Only, QAPF-Fixed-

λ

, QAPF-Full and QAPF+CBF. RL-Only denotes the strict component-isolation configuration described in the Table 11 caption and is distinct from the Std. QL baseline of Table 5.

Figure 10. Diagnostic sensitivity analysis showing the effect of the shaping magnitude and decay rate on the success rate and convergence under the exploratory sweep configuration (

λ_{min} = 1.0

,

T_{0} = 5.0

,

w_{g}^{train} = 0.30

,

w_{g}^{eval} = 0.90

). The highlighted

(λ_{max} = 5, β = 0.005)

region marks the production shaping coordinates only; absolute SR/convergence values in this landscape do not match the QAPF row of Table 5 because the exploration parameters differ. Production defaults are reported in Table 3.

Figure 10. Diagnostic sensitivity analysis showing the effect of the shaping magnitude and decay rate on the success rate and convergence under the exploratory sweep configuration (

λ_{min} = 1.0

,

T_{0} = 5.0

,

w_{g}^{train} = 0.30

,

w_{g}^{eval} = 0.90

). The highlighted

(λ_{max} = 5, β = 0.005)

region marks the production shaping coordinates only; absolute SR/convergence values in this landscape do not match the QAPF row of Table 5 because the exploration parameters differ. Production defaults are reported in Table 3.

Figure 11. Energy modulation results are shown across the three velocity profiles. The best balanced trade-off is delivered by APF-mod, with an

18 %

energy reduction obtained for an

8.5 %

longer navigation time, whereas APF-agg maximizes energy saving (

30 %

) at a larger time cost (

46.5 %

longer). The

94.5 %

success rate of the unmodulated policy is preserved across all three regimes.

Figure 11. Energy modulation results are shown across the three velocity profiles. The best balanced trade-off is delivered by APF-mod, with an

18 %

energy reduction obtained for an

8.5 %

longer navigation time, whereas APF-agg maximizes energy saving (

30 %

) at a larger time cost (

46.5 %

longer). The

94.5 %

success rate of the unmodulated policy is preserved across all three regimes.

Figure 12. The two-robot multi-robot study is shown for the independent, cooperative and Cooperative+CBF configurations. Joint success is improved by inter-robot APF repulsion, but residual

5.1

inter-robot collisions per evaluation are observed. When repulsion is paired with the per robot CBF filter, joint success is driven to

92.1 %

with only

0.5

residual inter-robot collisions. This is a two-robot result, and scalability beyond

N \approx 10

is expected to require the decentralization strategies discussed in the main text.

Figure 12. The two-robot multi-robot study is shown for the independent, cooperative and Cooperative+CBF configurations. Joint success is improved by inter-robot APF repulsion, but residual

5.1

inter-robot collisions per evaluation are observed. When repulsion is paired with the per robot CBF filter, joint success is driven to

92.1 %

with only

0.5

residual inter-robot collisions. This is a two-robot result, and scalability beyond

N \approx 10

is expected to require the decentralization strategies discussed in the main text.

Figure 13. The curriculum learning comparison is shown as learning curves of the fixed-density and progressive-curriculum schedules. Early learning is mainly accelerated by curriculum training, and the convergence episode is reduced (120 episodes against 205 episodes), while the final success rate remains close to that of the fixed-density policy.

Table 1. Feature-based comparison across key capability dimensions. ✔ = fully supported, ∼ = partially supported, and ✗ = not supported.

Method	Map-Free Nav.	Dynamic Obs.	Multi-Robot	Safety Filtering	Adaptive Learning	Local-Minima Mitigation	Real-Time Capable
Classical APF [6]	✔	✗	✗	✗	✗	✗	✔
Black-hole APF+RL [3]	✔	✔	✗	✗	∼	✔	✔
Evolutionary APF [19]	✗	✗	✗	✗	✗	∼	✗
Evo. Multi-obj. APF [7]	✔	✔	✔	✗	✗	✗	∼
QAPF (Orozco-Rosas) [12]	✔	✗	✗	✗	✔	∼	✔
Q-Learning [8]	✔	✗	✗	✗	✔	✗	✔
DQN [9]	✔	∼	✗	✗	✔	✔	∼
PPO [10]	✔	✔	✗	✗	✔	✔	∼
SAC [11]	✔	✔	✗	✗	✔	✔	∼
CBF-QP [15]	✗	∼	✗	✔	✗	✗	✔
CBF+RL [16]	✔	✔	✗	✔	✔	✔	∼
QAPF (Ours)	✔	✔	✔	✗	✔	∼	✔
QAPF+CBF (Ours)	✔	✔	✔	✔	✔	∼	✔

Table 2. Quantitative comparison with the most relevant prior QAPF formulation [12]. “−” indicates the dimension was not addressed in the prior work. The convergence-horizon comparison is protocol-dependent and should be interpreted as a reported-horizon contrast rather than as a controlled one-to-one runtime speedup.

Capability Dimension	Prior QAPF [12]	This Work	Improvement
Convergence horizon	$\sim 3 \times 10^{6}$ episodes	∼205 ep (QAPF)/∼230 ep (QAPF+CBF)	Protocol-dependent reported-horizon reduction
Collision rate (held-out)	− (not reported)	$6.2 %$ (QAPF)/ $0.3 %$ (QAPF+CBF)	∼20× internal reduction vs. QAPF-only ablation
Success rate (static)	Path-quality metric only	$94.5 %$ / $93.8 %$	Qualitative jump
Dynamic obstacles	−	$90.4 %$ SR (QAPF+CBF)	New capability
Narrow passages	−	$85.5 %$ SR (QAPF+CBF)	New capability
Multi-robot cooperation	−	$92.1 %$ joint SR	New capability
Robustness eval. (noise)	−	6 primary regimes (theory + exp.)	New capability
Unreachable-goal handling	−	Timeout/stagnation detector with empirical no-path investigation	New safe-failure capability
Embedded inference timing	−	∼4.9 kHz mean-call throughput on i7; projected embedded latencies	New capability

Table 3. Implementation specifications for the QAPF algorithm (values as implemented).

Parameter	Value/Description
Grid size	$50 \times 50$ cells
Max steps per episode	1000
Goal threshold ( $ϵ_{goal}$ )/Collision radius ( $ϵ_{coll}$ )	0.5/1.5 cells
Outcome rewards ( $R_{goal}$ , $R_{coll}$ , $λ_{s}$ , $λ_{c}$ , $λ_{g}$ )	$+ 100$ , $- 50$ , 1, 1, $0.5$
Learning rate $η$	0.15 (QAPF/QAPF+CBF); 0.1 (baselines)
Discount factor $γ$	0.95
Exploration ( $ε_{0}$ , $ε_{min}$ , $ε_{decay}$ )	0.3, 0.01, 0.995 (QAPF); 0.9, 0.01, 0.995 (baselines)
APF gains ( $k_{att}$ , $k_{rep}$ , $ρ_{0}$ )	1.0, 100.0, 3.0 cells
Softmax temperature ( $T_{0}$ , $T_{min}$ , $T_{decay}$ )	2.0, 0.3, 0.995/episode
Optimistic Q-init ( $Q_{0}$ )	5.0
Adaptive shaping ( $λ_{min}$ , $λ_{max}$ , $β$ )	0.5, 5.0, 0.005
Hybrid guidance weights ( $w_{g}^{train}$ , $w_{g}^{eval}$ )	1.2, 2.0
$U_{Δ, scale}$ calibration	95th percentile of $\| Δ U \|$ over 2000 random steps; used only for reward-shaping normalization
Gradient scale $G_{scale}$	95th percentile of $∥ \nabla U ∥$ over 500 random steps; used only for energy-aware velocity modulation
CBF safety margin ( $δ_{safe}$ )/visit cap (M)	0.3 cells/3 revisits per cell to action
Empirical unreachable detection	$T_{max} = 1000$ steps; $W_{stag} = 15$ steps; $Δ d_{stag} = 1.0$ cell; $N_{stag} = 3$ consecutive stagnation windows
Max episodes/evaluation logging/convergence extraction	1500/every 50 episodes/convergence episode estimated by interpolation of the smoothed held-out SR curve
Final held-out evaluation episodes	100
Independent seeds	30
Energy-aware module
$k_{E}$ (Constant/APF-mod/APF-agg)	$10^{- 3}$ /0.5/1.5
$v_{max}$ , $v_{min}$ , $c_{v}$ , $c_{a}$	1.0, 0.05, 1.0, 0.5
Multi-robot module
$k_{rep}^{MR}$ , $ρ_{0}^{MR}$	20.0, 3.0 cells
Curriculum schedule (Section 5.7.5)
Initial density $N_{o}^{(0)}$ /target density $N_{o}^{(target)}$	5/15
Ramp episodes $E_{ramp}$ /schedule shape	400/linear, monotonic seed

Table 4. Simulation parameters and experimental configuration.

Category	Parameter	Value/Description
Environment	Grid size	$50 \times 50$ cells
	Obstacle count	15 (default); variable in scenarios
	Obstacle types	Static, dynamic, narrow-passage
Hardware	Processor	Intel Core i7 (8 cores, $3.6$ $G$ $Hz$ )
	Memory	16 GB DDR4 RAM
	Parallelism	Multiprocessing, 16 CPU threads
Software	Language	Python 3.8
	Numerical	NumPy 1.21.0
	Visualization	Matplotlib 3.4.0
Evaluation	Independent runs	30 for main comparative, scenario, ablation and generalization experiments
	Protocol-specific exceptions	Robustness: 5 training seeds × 30 noisy episodes per regime; timing: repeated per decision calls
	Training episodes	Up to 1500 (main); 1000 (multi-robot)
	Evaluation	Held-out suites (disjoint seed pools)

Table 5. Main comparison in the static-obstacle setting with 15 obstacles, reporting means and standard deviations across independent seeds on held-out evaluation maps with the best values per metric shown in bold.

Method	Success Rate (%)	Collision Rate (%)	Min. Clearance (Cells)	Conv. Episode
APF-Only	$72.8 \pm 4.1$	$7.9 \pm 2.1$	$2.54 \pm 0.12$	N/A ^‡
Std. QL	$78.3 \pm 3.4$	$22.4 \pm 3.1$	$3.52 \pm 0.35$	$385.0 \pm 45.0$
EQL	$86.2 \pm 2.8$	$14.3 \pm 2.5$	$3.10 \pm 0.28$	$320.0 \pm 38.0$
CQL	$82.5 \pm 3.1$	$18.5 \pm 2.8$	$3.35 \pm 0.32$	$350.0 \pm 42.0$
DQN	$84.7 \pm 3.5$	$16.8 \pm 3.2$	$3.21 \pm 0.42$	$300.0 \pm 55.0$
QAPF	$94.5 \pm 2.1$	$6.2 \pm 1.8$	$2.78 \pm 0.15$	$205.0 \pm 35.0$
QAPF+CBF	$93.8 \pm 2.3$	$0.3 \pm 0.4$	$3.15 \pm 0.18$	$230.0 \pm 40.0$

Metric definitions. SR = percentage of held-out episodes that reach the goal; CR = percentage of episodes terminating with

ρ_{min} < ϵ_{coll}

; Min. Clearance = mean closest approach distance

{min}_{t} ρ_{min} (q_{t})

in grid cells, interpreted together with CR; Conv. Episode = interpolated plateau-crossing episode from the smoothed held-out SR curve, using the five-percentage-point threshold relative to the final/asymptotic mean. ^‡ APF-Only is a non-learning method and therefore has no convergence-of-learning episode. The ∼50 episodes previously reported for APF-Only correspond to the index at which its empirical success-rate trace stabilizes and is reported only in the running text rather than as a learning-convergence metric in this table. See also Table 17 (Axis A) for the same comparison evaluated under the wider multi-axis generalization protocol of 1000 episodes per cell rather than the 3000-episode held-out suite used here.

Table 6. Multi-scenario evaluation results. Metric definitions match Table 5.

Scenario	Method	Success Rate (%)	Collision Rate (%)	Min. Clearance (Cells)
Static	Std. QL	78.3	22.4	3.52
	EQL	86.2	14.3	3.10
	CQL	82.5	18.5	3.35
	DQN	84.7	16.8	3.21
	QAPF	$94.5$	6.2	2.78
	QAPF+CBF	93.8	$0.3$	3.15
Dynamic	Std. QL	61.5	31.8	2.85
	EQL	71.8	24.7	2.62
	CQL	67.2	28.3	2.74
	DQN	74.1	22.4	2.55
	QAPF	89.3	9.8	2.18
	QAPF+CBF	$90.4$	$0.5$	2.82
Narrow	Std. QL	55.8	37.6	3.95
	EQL	64.3	29.2	3.65
	CQL	60.1	34.8	3.78
	DQN	69.4	12.5	3.42
	QAPF	$85.7$	9.8	2.38
	QAPF+CBF	85.5	$0.3$	3.08

Metric definitions. SR = success rate; CR = collision rate; Min. Clearance = mean closest approach distance in grid cells. These definitions match Table 5; clearance should be read together with the CR because methods can have high mean clearance on non-collision trajectories while still failing in a separate subset of episodes.

Table 7. Primary noise regimes reported in the robustness study. All noise is i.i.d. per step.

Regime	$σ_{obs}$ (Cells)	$p_{act}$	$σ_{ext}$ (Cells)	Physical Interpretation
clean	0.0	0.00	0.00	ideal conditions
obs_low	0.3	0.00	0.00	light sensor noise
obs_high	0.8	0.00	0.00	heavy sensor noise
act_low	0.0	0.05	0.00	mild actuator slip
act_high	0.0	0.15	0.00	severe actuator slip
combined	0.3	0.05	0.10	realistic mixed disturbance

Table 8. Robustness study reporting the success rate (%) under each primary noise regime. The protocol of this table is a finite robustness screen and uses noisy evaluation episodes per regime, distinct from the headline static-condition evaluation of Table 5, which uses the same 30 seeds but 100 clean held-out episodes. The clean-regime row of this table is therefore a robustness-screen reference and not a replacement for the headline static-obstacle number reported in Table 5; the two protocols are deliberately kept separate. The best success rate per regime is in bold.

Regime	APF-Only	QAPF	QAPF+CBF
clean	$86.7 \pm 6.2$	$85.0 \pm 10.8$	$100.0 \pm 0.0$
obs_low	$86.7 \pm 6.2$	$75.0 \pm 4.1$	$100.0 \pm 0.0$
obs_high	$86.7 \pm 6.2$	$81.7 \pm 2.4$	$100.0 \pm 0.0$
act_low	$75.0 \pm 10.8$	$75.0 \pm 7.1$	$83.3 \pm 2.4$
act_high	$65.0 \pm 10.8$	$63.3 \pm 8.5$	$70.0 \pm 0.0$
combined	$78.3 \pm 2.4$	$68.3 \pm 11.8$	$85.0 \pm 4.1$

Table 9. Inference latency: median and mean per decision time (

μ

s) physically measured on the Intel i7 reference and projected median latency on embedded/mobile device classes. The Jetson Nano, Jetson Orin NX, Jetson AGX Orin and DGX columns are projection estimates based on Table 10, not direct on-device measurements. The “Hz @ i7 (mean)” column follows the implementation and reports the throughput computed from the mean call latency, not from the median column. Projected embedded-device columns are therefore reported as median-latency values, not as directly comparable mean-throughput measurements. All methods exceed the 20 Hz real-time threshold on the measured i7 reference and under all projected device-class latencies. We emphasize that only the i7 column is a physical measurement; the Jetson Nano, Orin NX, AGX Orin and DGX columns are device-class projections obtained by applying the conservative CPU-side scaling factors of Table 10, and they are not a substitute for direct on-device measurement. Direct physical Jetson-AGX-Orin and Jetson-Nano latency measurements, together with calibration of the Table 10 multipliers against measured ratios, are identified as future work item (i) in Section 6.

Table 9. Inference latency: median and mean per decision time (

μ

s) physically measured on the Intel i7 reference and projected median latency on embedded/mobile device classes. The Jetson Nano, Jetson Orin NX, Jetson AGX Orin and DGX columns are projection estimates based on Table 10, not direct on-device measurements. The “Hz @ i7 (mean)” column follows the implementation and reports the throughput computed from the mean call latency, not from the median column. Projected embedded-device columns are therefore reported as median-latency values, not as directly comparable mean-throughput measurements. All methods exceed the 20 Hz real-time threshold on the measured i7 reference and under all projected device-class latencies. We emphasize that only the i7 column is a physical measurement; the Jetson Nano, Orin NX, AGX Orin and DGX columns are device-class projections obtained by applying the conservative CPU-side scaling factors of Table 10, and they are not a substitute for direct on-device measurement. Direct physical Jetson-AGX-Orin and Jetson-Nano latency measurements, together with calibration of the Table 10 multipliers against measured ratios, are identified as future work item (i) in Section 6.

Method	i7 Median (µs)	i7 Mean (µs)	Nano	Orin NX	AGX Orin	DGX	Hz @ i7 (Mean)	20 Hz?
APF-Only	158.6	228.2	380.6	174.5	142.7	111.0	4382	✔ ^†
Std. QL	9.0	14.7	21.6	9.9	8.1	6.3	68,000	✔
DQN	13.4	17.5	32.2	14.7	12.1	9.4	57,000	✔
QAPF	149.4	173.6	358.5	164.3	134.4	104.6	5761	✔
QAPF+CBF	173.7	204.1	417.0	191.1	156.4	121.6	4899	✔

^† APF-Only on Jetson Nano has a projected median latency of

380.6

μ

s

, comfortably below the 50 ms budget of a 20 Hz control loop. Percentile latencies beyond the median are retained in the raw timing logs available from the corresponding author upon reasonable request rather than in the compact manuscript table.

Table 10. Conservative CPU-side scaling factors used for device-class latency projection. Factor

= median {latency}_{device} / median {latency}_{i 7 ref}

. These are projections, not physical on-device measurements.

Table 10. Conservative CPU-side scaling factors used for device-class latency projection. Factor

= median {latency}_{device} / median {latency}_{i 7 ref}

. These are projections, not physical on-device measurements.

Platform	Scaling Factor	Rationale
Intel i7 (reference)	$\times 1.0$	Baseline (laptop-class, 8 cores, $3.6$ $G$ $Hz$ )
Jetson Nano	$\times 2.4$	ARM Cortex-A57 @ $1.43$ $G$ $Hz$ , no AVX
Jetson Orin NX	$\times 1.1$	ARM Cortex-A78AE @ $2.0$ $G$ $Hz$
Jetson AGX Orin	$\times 0.9$	ARM Cortex-A78AE @ $2.2$ $G$ $Hz$ , wider memory
DGX-level GPU (proxy)	$\times 0.7$	CPU-side overhead only; numpy tables do not saturate GPU

Table 11. Ablation study: isolated contribution of each QAPF component. RL-Only here denotes a strict component-isolation configuration; it removes the QAPF-specific adaptive shaping, APF-weighted softmax exploration and anti-deadlock monitor and is distinct from the Std. QL baseline reported in Table 5. The reported

250 \pm 0

convergence value for RL-Only denotes the convergence-evaluation cap; RL-Only did not reach the convergence criterion within the ablation budget across any seed.

Table 11. Ablation study: isolated contribution of each QAPF component. RL-Only here denotes a strict component-isolation configuration; it removes the QAPF-specific adaptive shaping, APF-weighted softmax exploration and anti-deadlock monitor and is distinct from the Std. QL baseline reported in Table 5. The reported

250 \pm 0

convergence value for RL-Only denotes the convergence-evaluation cap; RL-Only did not reach the convergence criterion within the ablation budget across any seed.

Variant	Success (%)	Collision (%)	Clearance	Conv. Ep
RL-Only	$45.2 \pm 5.8$	$48.7 \pm 6.3$	$1.75 \pm 0.35$	$250 \pm 0$
APF-Only	$72.8 \pm 4.1$	$7.9 \pm 2.1$	$2.54 \pm 0.12$	N/A
QAPF-Fixed- $λ$	$88.9 \pm 3.2$	$12.1 \pm 2.8$	$2.56 \pm 0.14$	$180 \pm 45$
QAPF-Full	$94.5 \pm 2.1$	$6.2 \pm 1.8$	$2.78 \pm 0.15$	$205 \pm 35$
QAPF+CBF	$93.8 \pm 2.3$	$0.3 \pm 0.4$	$3.15 \pm 0.18$	$230 \pm 40$

Table 12. Diagnostic sensitivity analysis under a separate exploratory sweep configuration (

λ_{min} = 1.0

,

T_{0} = 5.0

,

w_{g}^{train} = 0.30

,

w_{g}^{eval} = 0.90

). These settings are not the production defaults used in Table 5; the production defaults are listed in Table 3.

Table 12. Diagnostic sensitivity analysis under a separate exploratory sweep configuration (

λ_{min} = 1.0

,

T_{0} = 5.0

,

w_{g}^{train} = 0.30

,

w_{g}^{eval} = 0.90

). These settings are not the production defaults used in Table 5; the production defaults are listed in Table 3.

$λ_{max}$	$β$	Success Rate (%)	Conv. Episode
5	0.005	90.8	225
5	0.01	91.7	218
5	0.02	89.1	210
10	0.005	92.1	245
10	0.01	94.5	198
10	0.02	89.8	176
20	0.005	88.2	258
20	0.01	90.3	231
20	0.02	87.4	219

Reading note. The bolded

(λ_{max} = 5, β = 0.005)

cell uses the production shaping schedule but not the exploratory exploration settings of this sweep; it is therefore not numerically equal to the QAPF row of Table 5 (

94.5 % / 205

ep), which uses the production exploration settings of Table 3. The numerically matching cell at

(λ_{max} = 10, β = 0.01)

is a coincidence of two independent runs and should not be interpreted as the production operating point.

Table 13. Energy-aware velocity modulation (Equation (13) to (15)). E is the total kinetic plus jerk energy accumulated over the episode;

E / E_{const}

is the energy ratio relative to the constant-speed baseline; Nav. Time is the resulting effective navigation time in the simulator. The mean speed

\bar{v}

is used internally to compute the timing profile but is not tabulated separately.

Table 13. Energy-aware velocity modulation (Equation (13) to (15)). E is the total kinetic plus jerk energy accumulated over the episode;

E / E_{const}

is the energy ratio relative to the constant-speed baseline; Nav. Time is the resulting effective navigation time in the simulator. The mean speed

\bar{v}

is used internally to compute the timing profile but is not tabulated separately.

Regime	$k_{E}$	SR (%)	E	$E / E_{const}$	Nav. Time
Constant	$10^{- 3}$	94.5	60.05	1.00	14.2
APF-mod	0.5	94.5	49.25	0.82	15.4
APF-agg	1.5	94.5	42.04	0.70	20.8

Table 14. Two-robot multi-robot study results.

Configuration	Joint Success Rate (%)	Inter-Robot Collisions
Independent	$79.5$	$14.3$
Cooperative	$89.2$	$5.1$
Coop+CBF	$92.1$	$0.5$

Table 15. Theoretical scaling cost of the centralized cooperative scheme as the number of robots N grows. The table separates the compute/communication scaling from the fleet-level joint-success bound, because both become limiting factors in large multi-robot deployments. The “

N = 10

example” column reports the cost magnitude evaluated at

N = 10

(e.g.,

\sim 100

pair evaluations per decision step at ∼2

μ

s

per pair, consistent with the reference-platform timing of Section 5.6).

Table 15. Theoretical scaling cost of the centralized cooperative scheme as the number of robots N grows. The table separates the compute/communication scaling from the fleet-level joint-success bound, because both become limiting factors in large multi-robot deployments. The “

N = 10

example” column reports the cost magnitude evaluated at

N = 10

(e.g.,

\sim 100

pair evaluations per decision step at ∼2

μ

s

per pair, consistent with the reference-platform timing of Section 5.6).

Cost Dimension	Scaling	$N = 10$ Example	Main Bottleneck
Per decision pair evaluations	$O (N^{2})$	90 directed/45 unique pairs	Runtime at large N
Communication bandwidth	$O (N^{2})$	$\sim 90$ state messages per cycle	Wireless contention
Fleet-level joint success	$\prod_{k = 1}^{N} {SR}_{k}$	$0 . 938^{10} \approx 52.7 %$	Joint SR drops rapidly
Single-point-of-failure risk	Centralized coordinator	One coordinator	Resilience requirement

Table 16. Curriculum learning study comparing fixed-density and progressive-curriculum training schedules.

Training Mode	Success Rate (%)	Convergence Episode
Fixed	$94.5 \pm 2.1$	$205 \pm 35$
Curriculum	$96.2 \pm 1.8$	$120 \pm 25$

Table 17. Multi-axis generalization study. All numbers are success rate (%) on 1000 held-out evaluation episodes per cell, 30 seeds. Best per row in bold.

Generalization Axis	Cell	APF-Only	DQN	QAPF	QAPF+CBF
Axis A. Held-out maps under the training distribution
A1 Seen (training pool)	$N_{o} = 15$ $G = 50$	$72.5 \pm 4.3$	$84.2 \pm 3.8$	$94.1 \pm 2.3$	$93.4 \pm 2.5$
A2 Unseen (disjoint seeds)	$N_{o} = 15$ $G = 50$	$71.8 \pm 4.8$	$81.5 \pm 4.5$	$93.5 \pm 2.8$	$92.8 \pm 3.1$
Axis B. Out-of-distribution obstacle density (trained on $N_{o} = 15$ )
B1 $N_{o} = 5$ (sparse)	$G = 50$	$84.6 \pm 3.1$	$89.3 \pm 2.5$	$96.8 \pm 1.4$	$96.5 \pm 1.7$
B2 $N_{o} = 10$	$G = 50$	$77.9 \pm 3.6$	$86.7 \pm 3.0$	$95.4 \pm 1.9$	$95.0 \pm 2.1$
B3 $N_{o} = 15$ (in-dist.)	$G = 50$	$71.8 \pm 4.8$	$81.5 \pm 4.5$	$93.5 \pm 2.8$	$92.8 \pm 3.1$
B4 $N_{o} = 20$	$G = 50$	$63.4 \pm 5.5$	$74.1 \pm 5.2$	$87.6 \pm 3.4$	$87.1 \pm 3.6$
B5 $N_{o} = 25$ (dense)	$G = 50$	$52.1 \pm 6.4$	$63.5 \pm 5.9$	$79.2 \pm 4.1$	$78.5 \pm 4.4$
Axis C. Out-of-distribution grid size (trained on $G = 50$ , density $15 / 2500$ held fixed)
C1 $G = 30$ (smaller)	$N_{o} = 5$	$70.4 \pm 5.1$	$79.2 \pm 4.4$	$90.7 \pm 3.0$	$90.2 \pm 3.2$
C2 $G = 50$ (in-dist.)	$N_{o} = 15$	$71.8 \pm 4.8$	$81.5 \pm 4.5$	$93.5 \pm 2.8$	$92.8 \pm 3.1$
C3 $G = 70$ (larger)	$N_{o} = 29$	$66.2 \pm 5.6$	$76.4 \pm 4.9$	$89.0 \pm 3.5$	$88.4 \pm 3.7$

Cross-reference to Table 5. Axis A row A2 (

93.5 \pm 2.8 %

QAPF,

92.8 \pm 3.1 %

QAPF+CBF) is the closest counterpart to the static held-out comparison of Table 5 (

94.5 \pm 2.1 %

and

93.8 \pm 2.3 %

respectively). The two evaluations were run as independent suites: Table 5 uses 30 seeds × 100 held-out episodes (=3000 episodes), whereas the generalization suite uses 30 seeds × approximately 33 episodes per seed (=1000 episodes per cell). The

\sim 1

pp offset is within one standard deviation of either estimate and is consistent with finite-sample variation across the two protocols.

Table 18. Axis D. Long-horizon stability of QAPF+CBF reported as the success rate (%) per consecutive 100-episode decile across 1000 held-out episodes and 30 seeds. The flat profile (all deciles at or above

93 %

) confirms that no policy drift, replay-buffer corruption, or oscillation accumulation is observed over long deployment horizons.

Table 18. Axis D. Long-horizon stability of QAPF+CBF reported as the success rate (%) per consecutive 100-episode decile across 1000 held-out episodes and 30 seeds. The flat profile (all deciles at or above

93 %

) confirms that no policy drift, replay-buffer corruption, or oscillation accumulation is observed over long deployment horizons.

Decile	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10
SR (%)	$93.6$	$93.4$	$93.8$	$93.5$	$93.9$	$93.7$	$93.5$	$93.8$	$94.0$	$93.6$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Isaac, E.; George, A.J.; Ioannou, I.; Abraham, J.P.; Kallam, S.; Ghantasala, G.S.P.; Vidyullatha, P.; Vassiliou, V. An Adaptive QAPF Framework with a Discrete CBF-Inspired Safety Filter and Adaptive Reward Shaping for Safe Mobile Robot Navigation. Electronics 2026, 15, 1945. https://doi.org/10.3390/electronics15091945

AMA Style

Isaac E, George AJ, Ioannou I, Abraham JP, Kallam S, Ghantasala GSP, Vidyullatha P, Vassiliou V. An Adaptive QAPF Framework with a Discrete CBF-Inspired Safety Filter and Adaptive Reward Shaping for Safe Mobile Robot Navigation. Electronics. 2026; 15(9):1945. https://doi.org/10.3390/electronics15091945

Chicago/Turabian Style

Isaac, Elizabeth, Asha J. George, Iacovos Ioannou, Jisha P. Abraham, Suresh Kallam, G. S. Pradeep Ghantasala, Pellakuri Vidyullatha, and Vasos Vassiliou. 2026. "An Adaptive QAPF Framework with a Discrete CBF-Inspired Safety Filter and Adaptive Reward Shaping for Safe Mobile Robot Navigation" Electronics 15, no. 9: 1945. https://doi.org/10.3390/electronics15091945

APA Style

Isaac, E., George, A. J., Ioannou, I., Abraham, J. P., Kallam, S., Ghantasala, G. S. P., Vidyullatha, P., & Vassiliou, V. (2026). An Adaptive QAPF Framework with a Discrete CBF-Inspired Safety Filter and Adaptive Reward Shaping for Safe Mobile Robot Navigation. Electronics, 15(9), 1945. https://doi.org/10.3390/electronics15091945

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive QAPF Framework with a Discrete CBF-Inspired Safety Filter and Adaptive Reward Shaping for Safe Mobile Robot Navigation

Abstract

1. Introduction

2. Literature Review and Background

2.1. Literature Review

2.1.1. Enhanced Artificial Potential Field Methods

2.1.2. Deep Reinforcement Learning for Autonomous Navigation

2.1.3. Collision Avoidance with Dynamic Obstacles

2.1.4. Control Barrier Functions

2.1.5. Quantitative Headline Comparison Against the Closest Prior QAPF Work

2.2. Theoretical Background

2.2.1. Reinforcement Learning and Policy-Invariant Reward Shaping

2.2.2. Control Barrier Functions

3. Problem and System Description

3.1. System Description

3.2. MDP Formulation

3.3. Potential Field Visualization

4. Methodology

4.1. Q-Learning

4.2. Potential Field Force Approach

4.3. QAPF Learning Algorithm

4.4. Hybrid Q+APF Action Scoring

4.5. CBF-Inspired Action Filter with Visit Memory

4.6. Energy-Aware Velocity Modulation

4.7. Empirical Unreachable-Goal Detection

4.8. Multi-Robot Cooperative Extension

4.9. Three-Dimensional Workspace Extension

4.10. Implementation Specifications

5. Experiment and Analysis

5.1. Experimental Setup and Evaluated Methods

5.2. Assumptions

5.3. Main Results and Multi-Scenario Evaluation

5.3.1. Multi-Scenario Evaluation

5.3.2. Representative Trajectories, Safety Analysis and Reward Stability

5.4. Empirical Unreachable-Goal and Stagnation Detection

5.5. Robustness Analysis Under Noise and Disturbances

5.5.1. Noise Model

5.5.2. Theoretical Analysis

5.5.3. Experimental Results

5.6. I7 Inference Timing and Embedded-Device Latency Projection

5.6.1. Measurement Protocol

5.6.2. Device Extrapolation

5.6.3. Results

5.7. Ablation, Sensitivity and Extended Studies

5.7.1. Ablation Study

5.7.2. Sensitivity Analysis

5.7.3. Energy Modulation Study

5.7.4. Multi-Robot Study

Scalability Discussion

5.7.5. Curriculum Learning Study

Curriculum Schedule Details

5.7.6. Multi-Axis Generalization Study

5.8. Three-Dimensional Workspace Evaluation

5.9. Discussion, Limitations and Threats to Validity

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Algorithm Capabilities, Limitations and Comparisons

Appendix A.1. Obstacle Shape and Concave Geometry

Appendix A.2. Empirical Handling of Impossible-Goal Cases

Appendix A.3. Comparison Against RRT* and A*

Appendix A.4. Scalability to 3D Workspaces

Appendix A.5. Kinematic and Geometric Constraints, and Pose Uncertainty

Appendix B. Symbols and Abbreviations

Appendix B.1. Symbol Table

Appendix B.2. Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI