Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach

Na, Jiahui; Wang, Wensheng

doi:10.3390/a19070528

Open AccessArticle

Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach

by

Jiahui Na

and

Wensheng Wang

^*

School of Mechanical and Electrical Engineering, Beijing Information Science and Technology University, Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(7), 528; https://doi.org/10.3390/a19070528

Submission received: 30 April 2026 / Revised: 19 June 2026 / Accepted: 21 June 2026 / Published: 30 June 2026

(This article belongs to the Special Issue Algorithms for Smart Cities (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

To address the prevalent issues of slow convergence, low exploration efficiency, and large value estimation bias in traditional Deep Q-Networks for intelligent vehicle path planning, this paper proposes an improved Dueling Double Deep Q-Network (D3QN) path-planning method guided by the Dynamic Window Approach (DWA) heuristic. The Dueling Double DQN architecture decouples state value and action advantage representations, while the dual estimator of Double DQN mitigates Q-value overestimation. A Prioritized Experience Replay (PER) mechanism samples transitions non-uniformly based on Temporal Difference error with importance sampling correction, improving the reuse of critical samples and training stability. DWA evaluation criteria are transformed into dense heuristic reward signals, enabling the agent to receive continuous multi-dimensional guidance during exploration without executing online trajectory optimization. The environment augments the sparse navigation objective with a Chebyshev goal-progress term motivated by potential-based reward shaping theory together with auxiliary DWA-style channels. The policy-invariance property of potential-based shaping is referenced only for the goal term added to the sparse task reward rather than for the full composite training return. A continuous Ackermann steering kinematic model with a pure-pursuit path-tracking controller is adopted for deployment to ensure executable trajectories under non-holonomic constraints. The proposed method (DWA-D3QN) is systematically evaluated against sparse-reward D3QN, PBRS-guided D3QN, DQN, DDQN, Dueling DQN, APF-DQN, PPO, SAC, TD3, A*, and classical DWA in a grid map environment with static and dynamic obstacles. Results are reported with statistical significance over multiple random seeds. Under complex difficulty, DWA-D3QN achieves a success rate of 94.1 ± 3.4% with a collision rate of 5.9 ± 3.4% over 15 seeds, representing improvements of 64.1 and 8.4 percentage points over the sparse-reward and PBRS-guided D3QN baselines, respectively. Ablation experiments reveal the differentiated contributions of clearance, heading, and velocity shaping terms: clearance awareness provides the strongest single contribution, heading alignment reinforces directional guidance, and velocity regularization refines trajectory quality under the joint constraints of the former two. The full composite reward achieves the lowest variance among all evaluated DRL methods, confirming enhanced training stability. Comparisons with PPO, SAC, and TD3 confirm the statistically significant advantages of the proposed framework (PPO:

p = 0.0010

, SAC:

p = 0.0007

, TD3:

p = 0.0024

). ROS/Gazebo validation with an Ackermann-steered vehicle achieves a success rate of 96.0% with a collision rate of 4.0% over 50 trials, further confirming the applicability of the learned policy in continuous-state environments with realistic vehicle kinematics.

Keywords:

deep reinforcement learning; path planning; D3QN; intelligent vehicle; dynamic window approach

1. Introduction

Autonomous navigation is essential for intelligent vehicles in industrial manufacturing, unmanned delivery, and medical rescue. As a core component of autonomous driving, path planning directly determines traffic efficiency and operational safety in complex, unknown environments. Modern applications increasingly require operation in unstructured, dynamic settings where obstacle distributions exhibit strong randomness. Generating optimal collision-free trajectories rapidly and smoothly in spaces with multiple dynamic obstacles remains a central challenge in both academia and industry. Deep reinforcement learning (DRL), with its capacity for high-dimensional perception and end-to-end decision making, has emerged as a leading approach for autonomous navigation [1].

DRL enables agents to learn nonlinear mappings from state space to action space through continuous environment interaction within a Markov Decision Process (MDP), demonstrating clear advantages for autonomous vehicles, unmanned aerial vehicles, and robotic platforms [2]. Unlike traditional approaches, DRL does not require high-precision environment maps and can address decision making in high-dimensional state spaces. Deep Q-Network (DQN), as a foundational DRL algorithm, has shown considerable success in such tasks.

Despite its success, DQN suffers from several well-known limitations. A primary concern is the systematic overestimation of Q-values introduced by the max operator in Bellman updates. Ni et al. [3] proposed an improved A-DDQN algorithm incorporating Double DQN to decouple action selection and value evaluation combined with artificial potential field (APF) heuristics for step-wise rewards and Prioritized Experience Replay (PER) for sample efficiency. While effective, the reward design still relies on APF-based guidance with limited density and directionality. Huang et al. [4] developed V-D D3QN, which employs dual Dueling DQN networks with alternating updates to suppress overestimation. The patent by Beijing University of Technology [5] combines D3QN with an intrinsic curiosity module to address sparse rewards. However, these methods still predominantly rely on sparse or simple distance-based rewards, lacking continuous multi-dimensional guidance.

Hybrid architectures integrating DRL with classical planning have attracted considerable attention, aiming to combine the adaptability of learning methods with the reliability of deterministic approaches [6]. Venu and Gurusamy [7] provide a comprehensive review showing that DRL excels in generalization while classical algorithms retain advantages in global optimality. Zhang et al. [8] proposed a three-layer architecture combining A*, DQN, and DWA, where DWA operates as a low-level online executor. Their results demonstrate that the hybrid approach outperforms standalone DWA or DQN. However, this method uses a basic DQN, which still suffers from Q-value overestimation, and it positions DWA solely as an executor without exploiting its evaluation criteria for reward shaping.

To address these deficiencies, this paper proposes an improved D3QN path-planning algorithm that internalizes the Dynamic Window Approach (DWA) evaluation criteria as dense reward signals, thereby providing guidance during training without requiring online DWA execution at deployment, which avoids the per-cycle computational overhead of velocity-space sampling. The Dueling Double DQN architecture decouples state value and action advantage while separating action selection from value evaluation, suppressing Q-value overestimation. A PER mechanism adjusts sampling weights based on TD error, improving the reuse of high-value transitions. The heading, obstacle distance, and velocity evaluation criteria of DWA are reconstructed as dense reward signals, providing continuous heuristic guidance that mitigates initial exploration blindness. Goal progress is encoded with a Chebyshev (

L^{\infty}

) potential consistent with the simulator’s termination predicate (

d_{\infty} \leq 1

). The policy-invariance property of potential-based shaping applies only to the goal term added to the sparse task reward; DWA-inspired heading, clearance, velocity proxies, repulsion penalties, and regularization terms are dense heuristics beyond that scope. This fusion retains the convergence advantages of D3QN while enhancing early-stage exploration efficiency. Experimental results demonstrate that the proposed DWA-D3QN algorithm achieves competitive convergence speed, robustness, and planning quality in complex dynamic environments.

1.1. Research Gap

Gap 1: DWA as online executor versus offline signal. Existing DRL-DWA frameworks employ DWA as a low-level online trajectory optimizer while the DRL module provides high-level waypoints [8]. This design incurs continuous velocity-space sampling at every control cycle, and the DWA evaluation criteria—heading alignment, obstacle clearance, and velocity efficiency—are discarded after producing a single velocity command. In contrast, the proposed method repurposes the DWA criteria as dense reward-shaping signals that guide training without requiring online DWA execution at deployment, thereby avoiding the per-cycle computational overhead of velocity-space sampling.

Gap 2: Q-value overestimation in DRL-DWA hybrids. Prior hybrid approaches typically employ vanilla DQN as the learning backbone [8], which suffers from the systematic overestimation bias introduced by the max operator in Bellman updates. While the Dueling Double DQN (D3QN) architecture has been independently validated for addressing this limitation, its combination with DWA-informed reward shaping for path planning has received limited attention. This paper investigates such an integration, examining whether D3QN’s debiasing properties can complement DWA-based dense rewards to improve learning stability.

Gap 3: Lack of theoretical grounding for heuristic reward shaping. Many DRL-based navigation works introduce heuristic reward terms—goal attraction, obstacle repulsion, and motion penalties—without clarifying their relationship to potential-based reward shaping (PBRS) theory [9]. PBRS provides a formal condition under which a shaping reward preserves the optimal policy. Without explicitly distinguishing which reward components satisfy PBRS constraints, the theoretical status of different reward terms remains ambiguous. This paper categorizes the goal-progress term within the PBRS framework while acknowledging that DWA-inspired clearance, heading, and velocity terms fall outside PBRS guarantees and function as empirical heuristics.

1.2. Scientific Contribution

To address the above research gaps, this paper makes the following contributions:

(1): DWA-to-reward transformation. We reconstruct the three core DWA evaluation criteria—heading alignment, obstacle clearance, and velocity efficiency—as dense, multi-dimensional reward signals embedded into the DRL training loop. Our method internalizes DWA priors offline, requiring no velocity search at deployment and reducing inference cost while retaining kinematic guidance.
(2): Integration of D3QN with DWA-informed shaping. We adopt the Dueling Double DQN architecture as the backbone, decoupling state value and action advantage while separating action selection from value evaluation. This paper investigates an integration of D3QN with DWA-derived reward shaping, examining whether this combination can jointly address Q-value overestimation, sample inefficiency, and exploration blindness.
(3): PBRS goal term with explicit scope demarcation. We derive a Chebyshev-distance-based potential function and prove the associated goal-progress term satisfies the canonical PBRS condition. We further demarcate that policy invariance applies exclusively to $(R_{task}, F_{g}^{Ng})$ , while DWA-inspired terms are classified as dense heuristics H beyond PBRS scope, resolving a prevalent ambiguity in the navigation literature.

2. Materials and Methods

2.1. MDP Formulation: Observation, Action, and Dynamics

The grid navigation task is modeled as a finite-horizon MDP

(S, A, P, R)

on a grid map with side length

N = 20

. At each step, the agent receives a 15-dimensional observation vector

s_{t} = {[s_{0}, \dots, s_{14}]}^{⊤} \in R^{15}

constructed as specified in Table 1. The action space is discrete with

| A | = 9

(Table 2). Environment transitions follow the grid dynamics in Table 3.

The 8 LiDAR rays are cast in directions

(0, - 1), (0, + 1), (- 1, 0), (+ 1, 0), (- 1, - 1), (+ 1, - 1), (- 1, + 1), (+ 1, + 1)

. Each ray extends from

(x_{t}, y_{t})

until it exits the map boundary or encounters an occupied cell (static or dynamic obstacle).

d_{k}

denotes the Euclidean distance to the terminating cell. Dynamic obstacles are not explicitly encoded as separate state dimensions; they are perceived only through the LiDAR rays.

Supplementary Environment Parameters. The 8 LiDAR rays (indices

s_{7}

–

s_{14}

in Table 1) extend from the agent position until they encounter an occupied cell or exit the grid boundary with no fixed maximum range limit; each ray is terminated at the map edge if no obstacle is detected along that direction. The reported value is

min (d_{k} / (N \sqrt{2}), 1)

where

d_{k}

is the Euclidean distance to the terminating cell. Dynamic obstacles move at a fixed speed of 1 grid cell per environment step along predefined linear trajectories, reciprocating between start and end points. Each dynamic obstacle completes one round trip every

2 \times L_{path}

steps, where

L_{path}

is the Manhattan distance between its start and end positions. Static obstacles occupy single grid cells and remain fixed throughout the episode. The agent’s discrete actions (Table 2) are executed instantaneously without velocity or acceleration constraints during training; continuous kinematic constraints are applied only at deployment (Section 2.4).

2.2. DQN Algorithm

Reinforcement learning enables an agent to learn an optimal policy that maximizes the long-term cumulative return through interaction with an unknown environment. Within the MDP framework, the agent executes an action based on the current state; the environment returns a successor state and an immediate reward. The agent refines its policy using this feedback, achieving adaptive optimization for complex dynamic scenarios.

Q-Learning, a classic off-policy, model-free algorithm, derives the optimal policy through iteratively updated action–value functions. The update follows the Bellman equation, combining the immediate reward with the maximum discounted expected return of the next state:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]

(1)

where

α

is the learning rate,

γ

is the discount factor, and

r_{t + 1}

is the reward received after executing

a_{t}

and transitioning from

s_{t}

to

s_{t + 1}

. An

ε

-greedy strategy balances exploration and exploitation: a random action is selected with probability

ε

, and the greedy action is selected with probability

1 - ε

:

π (a | s) = \{\begin{matrix} ε / | A | + 1 - ε, & a^{*} = {arg max}_{a \in A} Q (s, a) \\ ε / | A |, & otherwise \end{matrix}

(2)

For tasks with high-dimensional state spaces, tabular Q-value recording suffers from the curse of dimensionality. The Deep Q-Network (DQN) proposed by Mnih et al. [10] combined deep learning with reinforcement learning, using a deep neural network as a nonlinear function approximator. DQN optimizes network weights by minimizing the mean squared error loss:

L (θ) = E [{(r (s, a, s^{'}) + γ max_{a^{'} \in A} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))}^{2}]

(3)

where

θ^{-}

denotes the parameters of the target network, which are periodically copied from the main network to stabilize training. The converged optimal Q-function is

Q^{*} (s, a; θ) = E_{s^{'}} [r (s, a, s^{'}) + γ max_{a^{'} \in A} Q^{*} (s^{'}, a^{'}) ∣ s, a]

(4)

The DQN training framework is illustrated in Figure 1.

2.3. Improved DWA-D3QN Algorithm

To address Q-value overestimation, slow convergence, unsmooth trajectories, and low exploration efficiency in basic DQN, this paper proposes DWA-D3QN, which is a DWA heuristically-guided Dueling Double Deep Q-Network. The algorithm adopts D3QN as the decision-making backbone, enhances sample efficiency through PER, and transforms DWA kinematic evaluation criteria into heuristic reward shaping signals, integrating DRL decision making with DWA local motion optimization.

2.3.1. Overall Algorithm Flow

DWA-D3QN follows the closed-loop logic of “environment interaction—experience storage—sampling training—network update—convergence output”, integrating three core modules: D3QN decision making, PER sample optimization, and DWA reward shaping.

The robot observes the current state

s_{t}

, encompassing pose, velocity, obstacle distance, and target bearing. The state is fed into the D3QN dueling network, which extracts features through a shared layer and computes Q-values via dual-branch aggregation. An action

a_{t}

is selected using

ε

-greedy.

Upon execution, the robot receives the next state

s_{t + 1}

and a composite reward

R_{total}

integrating the base reward, a Chebyshev goal-progress channel, DWA heuristic signals, per-step costs, and auxiliary terms (Section 2.3.5). The transition

(s_{t}, a_{t}, R_{total}, s_{t + 1})

is stored in the PER buffer.

When the buffer reaches the batch threshold, high-priority samples are drawn. TD-error and importance sampling weights are computed to obtain the weighted loss

L_{PER}

, and the online network

θ

is updated via backpropagation. The target network

θ^{-}

receives a soft update after a fixed interval.

The process repeats until convergence: the average return stabilizes, the loss converges, and the robot consistently achieves collision-free navigation. The optimal policy is then output.

The overall workflow of the DWA-D3QN algorithm is shown in Figure 2.

2.3.2. Dueling Double DQN Base Network Structure

D3QN integrates Double DQN and Dueling DQN with PER, achieving improvements in convergence, stability, and sample utilization [11]. Double DQN [12] decouples action selection from value evaluation: the main network selects the optimal action, while the target network evaluates its Q-value in parallel, mitigating positive bias accumulation. The target value is

y_{t} = r_{t} + γ Q (s_{t + 1}, \underset{a^{'}}{arg max} Q (s_{t + 1}, a^{'}; θ); θ^{-})

(5)

Dueling DQN [13] splits the feature stream into a state value branch V and an action advantage branch A with mean centering correction:

Q (s, a; θ, α, β) = V (s; θ, β) + (A (s, a; θ, α) - \frac{1}{| A |} \sum_{a^{'}} A (s, a^{'}; θ, α))

(6)

2.3.3. Network Architecture

The policy network uses a Dueling architecture. A shared feature extractor (two fully connected layers of 128 units with LayerNorm and ReLU) processes the 15-dimensional state input. The extracted features branch into a value stream (128 units, ReLU, output dimension 1) and an advantage stream (128 units, ReLU, output dimension 9). The policy network contains 53,386 parameters; an identical target network brings the total to approximately 106,772.

The training workflow is shown in Figure 3. Both networks are initialized with identical weights. The agent explores using

ε

-greedy. Once the PER buffer reaches the batch threshold, prioritized sampling with importance correction drives gradient updates on the main network, after which sample priorities are refreshed. The target network is periodically synchronized via soft update.

2.3.4. Prioritized Experience Replay (PER) Mechanism

Standard DQN samples transitions uniformly, ignoring differences in learning value. PER [14] prioritizes samples by the TD-error magnitude, giving higher sampling frequency to high-value transitions [15]. The priority metric is

p_{t} = | δ_{t} | + ε

(7)

where

δ_{t}

is the TD error and

ε

ensures non-zero probability. The sampling probability is

P (i) = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}}

(8)

where

α

controls prioritization intensity. Importance sampling weights correct the resulting distribution bias:

w_{i} = {(\frac{1}{N} \cdot \frac{1}{P (i)})}^{β}

(9)

where N is the buffer capacity and

β

anneals from

β_{start}

to 1. The weighted loss is

L_{PER} = \frac{1}{B} \sum_{i = 1}^{B} w_{i} \cdot {(y_{i} - Q (s_{i}, a_{i}; θ))}^{2}

(10)

The workflow of the PER mechanism is illustrated in Figure 4.

2.3.5. DWA Dynamic Window Kinematic Model and Reward Shaping

The Dynamic Window Approach (DWA) [16] searches for feasible velocity commands within the robot’s velocity space and evaluates them using a weighted combination of heading, safety distance, and velocity. Zheng et al. [17] enhanced DWA with hierarchical safety zones and fuzzy logic for real-time posture and speed adjustment.

The velocity dynamic window is

V_{d} = \{(v, ω) ∣ v \in [v_{min}, v_{max}], ω \in [ω_{min}, ω_{max}]\}

(11)

where v and

ω

denote linear and angular velocity. DWA scores each candidate pair using three evaluation functions:

Heading evaluation measures alignment with the target direction:

heading (v, ω) = 180^{\circ} - | θ_{target} - θ_{predicted} |

(12)

Obstacle clearance evaluation assesses the minimum distance to obstacles along the predicted trajectory:

dist (v, ω) = min_{o \in O} distance (trajectory (v, ω), o)

(13)

Velocity efficiency evaluation incentivizes higher speeds:

velocity (v, ω) = | v |

(14)

The comprehensive evaluation is a normalized weighted sum:

G (v, ω) = α \cdot heading (v, ω) + β \cdot dist (v, ω) + η \cdot velocity (v, ω)

(15)

Here,

η

is the DWA velocity weight, which is distinct from the RL discount factor

γ

.

Potential-Based Reward Shaping (PBRS) and its Application Scope. Ng et al. [9] proved that a shaping reward

F (s, s^{'}) = γ Φ (s^{'}) - Φ (s)

added to a fixed sparse task reward preserves the optimal policy. Grześ [18] and Lidayan et al. [19] have extended these theoretical frameworks. In this paper, DWA-inspired heading, clearance, velocity proxies, repulsion, turn-back, and per-step penalties are dense, action-dependent, or schedule-dependent signals. Therefore, the policy-invariance guarantee is strictly invoked only for the canonical PBRS goal-progress term, and it is not asserted for the full composite return

R^{'}

.

Chebyshev Goal Progress and its PBRS Justification. The simulator employs Chebyshev distance

d_{\infty}

for termination (

d_{\infty} \leq 1

). The goal potential function is

Φ_{g} (s) = - λ \cdot d_{\infty} (s, s_{goal}),

(16)

where

λ = pbrs_goal_weight > 0

. The canonical PBRS goal-progress term is

F_{g}^{Ng} (s, s^{'}) = γ Φ_{g} (s^{'}) - Φ_{g} (s) = λ [d_{\infty} (s, s_{goal}) - γ d_{\infty} (s^{'}, s_{goal})] .

(17)

In practice, we compute the direct distance difference:

F_{g}^{impl} (s, s^{'}) = λ [d_{\infty} (s, s_{goal}) - d_{\infty} (s^{'}, s_{goal})] .

(18)

With

γ = 0.99

, the approximation error

| F_{g}^{Ng} - F_{g}^{impl} | = λ (1 - γ) d_{\infty} (s^{'}, s_{goal}) < 0.2

for all practical distances, making

F_{g}^{impl}

PBRS-compliant.

Complete Reward Decomposition. The total reward

R^{'} (s, a, s^{'})

fed to the learner is

R^{'} (s, a, s^{'}) = R_{task} (s, s^{'}) + F_{g}^{impl} (s, s^{'}) + H (s, a, s^{'})

(19)

where

\begin{matrix} R_{task} (s, s^{'}) & = R_{step} + R_{base}, \\ H (s, a, s^{'}) & = λ_{1} R_{heading} + λ_{2} R_{obs} + λ_{3} R_{smooth} + R_{rep} + R_{tb} + R_{dwa} . \end{matrix}

(20)

R_{step}

is a small per-step penalty for efficiency;

R_{base}

awards a large positive reward at goal arrival and a large negative penalty upon collision.

R_{heading}

,

R_{obs}

, and

R_{smooth}

are the DWA-inspired dense channels.

R_{rep}

provides obstacle repulsion,

R_{tb}

penalizes backtracking, and

R_{dwa}

is a scheduled bonus when the agent’s action aligns with the DWA-optimal command.

Reward Coefficient Specification. Table 4 lists all reward coefficients, DWA schedule weights, and post-processing parameters as implemented in the environment. The total raw reward is

R_{raw} = R_{step} + F_{g} + R_{dir} + R_{rep} + R_{back} + R_{turn} + R_{event} + R_{dwa}

(21)

with the final reward clipped and scaled:

R^{'} = clip (R_{raw} / 10, [- 10, 10])

.

Unified Reward Summary. Table 4 consolidates the complete reward specification. All methods share the base terms (

R_{step}

,

F_{g}

,

R_{dir}

,

R_{rep}

,

R_{back}

,

R_{turn}

); DWA-inspired shaping terms are active only for DWA-D3QN. The total is scaled and clipped:

R^{'} = clip (R_{raw} / 10, [- 10, 10])

.

Remark (Scope of Policy Invariance). Only a shaping function of the form

F (s, s^{'}) = γ Φ (s^{'}) - Φ (s)

added to a fixed sparse task reward preserves policy invariance [9]. This guarantee applies exclusively to

(R_{task}, F_{g}^{Ng})

. The composite return

R^{'}

includes H, which contains action-, history-, and schedule-dependent terms that fall outside the PBRS framework. All reported results optimize the scalarized, clipped composite return

R^{'}

.

The transformation from DWA evaluation criteria to dense reward functions is illustrated in Figure 5.

2.4. Kinematic Model for Deployment

Deployment uses an Ackermann steering bicycle model. Let the vehicle pose be

q = {[x, y, θ]}^{⊤}

(rear-axle center) with control input

u = {[v, φ]}^{⊤}

(longitudinal velocity and front steering angle). Under the non-holonomic constraint

\dot{x} sin θ - \dot{y} cos θ = 0

, the kinematics are

\dot{x} = v cos θ, \dot{y} = v sin θ, \dot{θ} = \frac{v}{L} tan φ,

(22)

where L is the wheelbase and

tan φ = L / ρ

(Figure 6).

In discrete-time implementation, the equations are integrated via Euler discretization with time step

Δ t

, and physical limits are imposed:

0 \leq v \leq v_{max}

,

| φ | \leq φ_{max}

, and

| \dot{φ} | \leq {\dot{φ}}_{max}

, ensuring that control commands remain within the vehicle’s steering geometry and actuator saturation limits. The grid-trained DRL policy outputs discrete action indices, which are converted to global waypoints and then tracked by a path-following controller that generates continuous

(v, φ)

commands. In the ROS/Gazebo simulation, tire–road friction further introduces trajectory deviations due to tire slip, providing a test of consistency between the kinematic model and dynamic simulation.

It should be noted that the training phase still uses the discrete grid action space described in Section 2.1. The kinematic model in this section is used exclusively for deployment, executability analysis, and simulation validation, thereby addressing the requirements of Ackermann steering dynamics, non-holonomic constraints, and continuous steering control.

3. Results

3.1. Algorithm Parameter Settings

Experiments were conducted using Python 3.11 (Python Software Foundation, Wilmington, DE, USA), PyTorch 2.6.1 (Meta Platforms, Inc., Menlo Park, CA, USA) with CUDA 12.8 (NVIDIA Corporation, Santa Clara, CA, USA), and Ubuntu 24.04 LTS (Canonical Ltd., London, UK). Hardware includes 64 GB memory (Kingston Technology, Fountain Valley, CA, USA), an Intel Core i7-14700K CPU (Intel Corporation, Santa Clara, CA, USA), and an NVIDIA RTX 5070 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). Hyperparameter settings for DWA-D3QN are detailed in Table 5.

3.2. Hyperparameter Tuning and Convergence

Hyperparameters were selected via two-stage tuning. First, learning rate and batch size were coarsely searched over

[1 \times 10^{- 4}, 1 \times 10^{- 3}]

and

{64, 128, 256, 512}

using the D3QN baseline under complex difficulty; learning rate

5.0 \times 10^{- 4}

and batch size 256 yielded the most stable convergence. Second, PER parameters (

α

,

β_{start}

) were tuned on DWA-D3QN via grid search over

α \in {0.4, 0.5, 0.6, 0.7}

and

β_{start} \in {0.4, 0.5, 0.6}

with

α = 0.6

and

β_{start} = 0.5

providing the best efficiency–stability trade-off. Exploration parameters (

ε_{start} = 1.0

,

ε_{end} = 0.02

) follow standard DRL practice. Robustness is validated in Section 3.8.

The training loss convergence curve is shown in Figure 7.

The hyperparameter search space and selected values are summarized in Table 6.

3.3. Evaluation Metrics

Unless otherwise noted, all reported performance metrics are computed as the mean ± standard deviation over 15 random seeds for DRL methods and over 120 fixed maps for classical planning methods (A* and DWA). The paired Wilcoxon signed-rank test (

α = 0.05

) is used to assess the statistical significance of performance differences between methods. Bootstrap resampling with 10,000 iterations is used to compute 95% confidence intervals for the success rate and collision rate are computed.

All methods are evaluated using five metrics computed over the last 50 training episodes. The success rate is the fraction of episodes where the agent reaches the goal (

d_{\infty} \leq 1

). The collision rate is the fraction where the agent collides with an obstacle. Mean steps is the average episode length. Min clearance is the average minimum Euclidean distance to any obstacle across all steps.

Trajectory smoothness is defined as the complement of the per-step discrete action-switch rate. A turn event is counted at step t when both the current and previous actions involve movement (

a \neq 0

) and the action index changes:

⊮_{turn} (t) = \{\begin{matrix} 1, & a_{i, t} \neq 0 \land a_{i, t - 1} \neq 0 \land a_{i, t} \neq a_{i, t - 1} \\ 0, & otherwise \end{matrix}

with the convention

a_{i, 0} = 0

. For episode i with

T_{i}

steps and

N_{i}^{turn}

total turns,

{Smoothness}_{i} = \{\begin{matrix} 1 - \frac{N_{i}^{turn}}{T_{i}}, & T_{i} > 0 \\ 1, & T_{i} = 0 \end{matrix}

The reported smoothness is the average over the last 50 episodes. This metric ranges in

[0, 1]

with higher values indicating fewer action switches per step. It is not a geometric curvature measure but rather a discrete action-switch rate complement, which is consistent with the environment’s turn penalty

R_{turn} = - 0.2

.

3.4. Map Difficulty Definition and Examples

Two difficulty levels are defined: simple and complex. Both use the same start (lower-left) and goal (upper-right) positions with static and dynamic obstacles generated via random seeds. Figure 8 shows representative maps; black cells denote static obstacles, red segments indicate dynamic obstacle trajectories, and green and yellow markers denote start and goal points.

Complexity is quantified by static obstacle occupancy

ρ_{occ} = N_{static} / N^{2}

and the number of dynamic obstacles

N_{dyn}

. Simple difficulty uses

ρ_{occ} \approx 0.10

and

N_{dyn} = 2

, yielding open traversable space with sparse dynamic interactions. Complex difficulty uses

ρ_{occ} \approx 0.14

and

N_{dyn} = 4

, requiring frequent dynamic obstacle avoidance under tighter passage constraints.

All methods are evaluated on identical map sets generated from fixed random seeds to ensure reproducibility.

3.5. Quantitative Results

Table 7 compares DWA-D3QN with the sparse-reward D3QN baseline, which uses only

R_{task}

without PBRS or DWA channels, across simple and complex difficulties. All runs use a fixed budget of 200,000 steps. Results are reported over 15 seeds.

DWA-D3QN significantly outperforms the sparse-reward D3QN across all metrics under both difficulties. Under simple difficulty, the success rate improves from 76.5% to 93.7% (a 17.2 pp gain) and the collision rate drops from 21.3% to 6.3%. Under complex difficulty, the success rate increases from 79.7% to 94.1% (a 14.4 pp gain), the collision rate falls from 19.5% to 5.9%, and the mean steps are reduced from 45.14 to 23.93, which is a reduction of 47%. Under complex difficulty, the DWA-D3QN success rate 95% CI is [92.3%, 95.8%]; the collision rate 95% CI is [4.1%, 7.7%]. All improvements are statistically significant (

p < 0.001

, Wilcoxon signed-rank test). A comprehensive benchmark comparison is provided in Section 3.13.

3.6. Ablation Experiments

Ablation experiments under complex difficulty use a fixed 200,000-step budget, identical D3QN-PER architecture, and shared hyperparameters, varying only the reward composition. Three single-term variants (Heading only, Clearance only, Velocity only) and two composite variants (DWA dense only, APF-Euclidean) are compared against the full DWA-D3QN. Results are reported as mean ± standard deviation over 15 seeds (Table 8).

Among single-term variants, Clearance only achieves the highest success rate (88.5%) and lowest collision rate (11.3%), indicating that obstacle safety distance awareness is the most critical single shaping dimension. Heading only reaches 84.0%, confirming that directional alignment also contributes substantially to task completion. Velocity only achieves 83.3% with the lowest mean steps (25.68), suggesting that velocity incentives promote path efficiency but at the cost of higher collision risk.

The full DWA-D3QN (94.1%) outperforms Heading only (84.0%) by 10.1 pp and Clearance only (88.5%) by 5.6 pp, confirming that the multi-dimensional composite provides benefits beyond any single term. The improvement over Heading only is statistically significant (

p < 0.01

, Wilcoxon test) as is the improvement over Clearance only (

p < 0.05

).

DWA dense only (62.7%) exhibits high variance (±34.7%) with several seeds failing to learn effectively, underscoring the indispensable role of the PBRS Chebyshev goal term. The 31.4 pp gap between DWA dense only and the full model demonstrates that kinematic guidance alone is insufficient without the goal-progress signal. APF-Euclidean (80.3%) underperforms the Chebyshev formulation, validating that the

L^{\infty}

potential aligned with the grid-world termination predicate provides more effective guidance than a generic Euclidean measure.

The full DWA-D3QN achieves the best results across all metrics: 94.1% success, 5.9% collision, 0.674 smoothness, and 0.981 min clearance, demonstrating that the multi-dimensional composite reward successfully balances competing objectives.

3.7. Reward Weight Sensitivity Analysis

We evaluate sensitivity over the three DWA-inspired shaping weights

λ_{1}

(Heading),

λ_{2}

(Clearance), and

λ_{3}

(Velocity). Three configurations are tested under complex difficulty with all other hyperparameters held constant. Results are reported as mean ± SD over 15 seeds. Table 9 reports the results.

Performance remains stable and even improves under moderate weight perturbations. Heading-dominant achieves 95.6% success, which is marginally higher than the default 94.1%, with improved smoothness (0.723 vs. 0.674). Velocity-suppressed achieves the highest success rate (97.2%) and lowest collision rate (2.8%), suggesting that the velocity term, while beneficial for trajectory quality, may slightly constrain success in the default configuration. All three configurations maintain success rates above 94%, confirming that the composite reward is not brittle to weight tuning.

3.8. Hyperparameter Robustness Analysis

We evaluate the robustness under perturbation of the learning rate, batch size, and

ε_{end}

. Results are reported as the mean ± SD over 15 seeds. Table 10 summarizes the results under complex difficulty.

DWA-D3QN maintains strong performance across all perturbations. Reducing the learning rate or batch size yields 96.5% success, slightly exceeding the default 94.1%. Increasing

ε_{end}

to 0.05 produces a modest decrease to 93.5% with higher collision rate (6.5%), confirming that extended exploration provides limited benefit. Across all perturbations, success rates remain within [93.5%, 96.5%] with overlapping confidence intervals, confirming insensitivity to precise hyperparameter tuning.

3.9. Computational Efficiency

Table 11 summarizes the computational profile of DWA-D3QN. Each 200,000-step training run completes in approximately 9.7 min on an NVIDIA RTX 5070 Ti GPU, encompassing environment simulation, PER-based priority updates, and periodic network optimization (every four steps). A full sweep of 10 configurations requires approximately 1.5 h, enabling systematic ablation and sensitivity studies within a practical time budget.

Inference latency averages 0.63 ms per step (P99: 1.46 ms), confirming sub-millisecond decision making suitable for real-time control. The maximum of 12.73 ms reflects GPU cold-start overhead. Since DWA-D3QN shares the same Dueling network architecture as the D3QN baseline (53,386 parameters), its inference cost is consistent with that of the D3QN baseline, demonstrating that the proposed DWA-derived reward shaping introduces no additional computational burden at deployment. In contrast, classical DWA performs velocity-space sampling and trajectory evaluation at each control cycle, incurring higher per-step computation.

3.10. Deployment-Time Computational Comparison

Table 12 presents a paired deployment-time comparison between classical DWA and DWA-D3QN under the env_decision protocol (3000 control steps on 20 complex maps). Continuous classical DWA requires 92.4 ms/step on CPU due to 200

(v, ω)

candidate evaluations with 20-step trajectory rollouts, whereas DWA-D3QN greedy inference requires 0.77 ms/step on GPU and 0.47 ms/step on CPU, representing speedups of approximately 120× and 197×, respectively. The discrete grid DWA baseline, which performs nine-action heuristic scoring, achieves a 0.54 ms/step, which is comparable to DWA-D3QN in raw latency (1.15×). Table 13 further demonstrates that even the lightest continuous DWA configuration (50 × 10 candidates) remains 21.8× slower than DWA-D3QN, confirming that the speedup is robust to velocity sampling resolution.

3.11. Path Visualization

To elucidate the behavioral mechanisms underlying the quantitative results, three representative obstacle layouts under complex difficulty are analyzed using greedy inference on both trained policies. The selected scenarios span narrow corridors (Scenario A), high local decision demands (Scenario B), and dense obstacle distributions (Scenario C), covering diverse navigation challenges.

3.11.1. Scenario A: Narrow Corridor

Figure 9 shows the results in a narrow corridor scenario where the start and goal are separated by multiple static obstacles. D3QN fails to reach the goal and collides (path length: 24.14, 20 steps). DWA-D3QN reaches the goal without collision (geometric length: 30.04, 23 steps). The D3QN trajectory exhibits directional deviations in obstacle-dense regions, while DWA-D3QN navigates smoothly along the traversable channel.

3.11.2. Scenario B: Path Coherence

Figure 10 shows the results in a scenario requiring precise local decisions. D3QN requires 796 steps (geometric length: 930.62); although no collision occurs, it fails to reach the goal, exhibiting high-frequency oscillations and local loops. DWA-D3QN reaches the goal in 23 steps (geometric length: 30.46), without collision, following a monotonic path with substantially fewer turns.

3.11.3. Scenario C: Dense Obstacles

Figure 11 shows the results in a scenario with increased obstacle density and severely constrained traversable space. D3QN reaches the goal in 29 steps (geometric length: 34.38), without collision, but exhibits detour behavior. DWA-D3QN achieves a shorter path in 25 steps (geometric length: 30.38) while maintaining safety, selecting a more compact trajectory that closely follows obstacle boundaries.

Across all three scenarios, DWA-D3QN consistently outperforms D3QN: it succeeds where D3QN fails (Scenario A), eliminates oscillations (Scenario B), and achieves more compact paths (Scenario C). These behavioral observations corroborate the quantitative improvements in success rate, collision rate, smoothness, and step efficiency.

3.12. Convergence Speed and Training Stability

Learning curves for D3QN and DWA-D3QN are compared under complex difficulty with a 200,000-step budget. A sliding window of size 50 smooths the raw curves; shaded regions indicate variance across random seeds.

3.12.1. Average Episode Reward Convergence

Figure 12 presents the reward curves. DWA-D3QN rises rapidly and plateaus at approximately

2 \times 10^{4}

steps with small fluctuations, while D3QN converges markedly slower with persistent oscillations.

3.12.2. Success Rate Convergence and Stability

Figure 13 shows the success rate curves. DWA-D3QN achieves high success earlier and approaches saturation sooner with smaller fluctuation amplitudes and weaker regressions during the plateau phase.

3.13. Comparative Analysis

DWA-D3QN is compared against three categories of baselines under complex difficulty: value-based DRL (DQN, DDQN, Dueling DQN), heuristic-guided DRL (APF-DQN, PBRS-guided D3QN), and classical planning (A*, DWA). A* replans at each step to adapt to dynamic obstacles; classical DWA performs local velocity-space sampling and trajectory evaluation at each control cycle. All methods share an identical test set to ensure fair comparison. DRL results are reported as mean ± standard deviation over 15 seeds; A* and DWA results are reported as mean ± standard error over 120 fixed maps. Statistical significance was assessed using the paired Wilcoxon signed-rank test (

α = 0.05

).

Paired Wilcoxon tests with Cohen’s d confirm statistical significance: PPO vs. DWA-D3QN success

p = 0.0010

,

d = - 1.38

(very large); SAC

p = 0.0007

,

d = - 1.20

(very large); TD3

p = 0.0024

,

d = - 1.01

(large); D3QN (PBRS) vs. DWA-D3QN success

p = 0.0109

,

d = - 0.51

(medium). All three policy gradient methods achieve significantly lower success rates than DWA-D3QN. Despite the additional training budget and exploration mechanisms, PPO achieves only 54.8% success with high variance (±28.2%). All three methods require substantially more steps per episode than DWA-D3QN (64.7–240.8 vs. 23.93), indicating less efficient exploration in the discrete grid navigation task.

Two D3QN configurations are evaluated: D3QN (sparse) in Table 7 uses only

R_{task}

; D3QN (PBRS) in Table 14 adds the Chebyshev goal term. The 55.7 pp gap (85.7% vs. 30.0%) isolates the contribution of goal-progress shaping.

DQN achieves 82.3% success with notable variance (±21.9%). DDQN reaches 89.5% (±5.1%), confirming that Double Q-learning reduces overestimation and stabilizes training. APF-DQN attains 80.3% (±17.1%), indicating unreliable guidance from binary attraction–repulsion signals. Dueling DQN achieves 81.9% (±21.2%) with high variance suggesting that the Dueling architecture alone is insufficient for stable learning under sparse rewards.

A* achieves 76.7% success with mean steps 130.19, roughly 5.4× that of DWA-D3QN, due to repeated global replanning under dynamic obstacles. Classical DWA reaches 93.3% success, demonstrating strong real-time obstacle avoidance though with higher collision rate (6.7%) than DWA-D3QN (5.9%).

DWA-D3QN achieves 94.1% success and 0.674 smoothness. The overall performance comparison of success rate and collision rate across all methods is visualized in Figure 14, while trajectory quality metrics including smoothness, mean steps and minimum obstacle clearance are illustrated in Figure 15. Paired Wilcoxon tests (

n = 15

, seed 4 excluded) with Cohen’s d confirm statistical significance: DWA-D3QN vs. DDQN success

p = 0.0046

,

d = 1.01

(large); vs. D3QN (PBRS) success

p = 0.0109

,

d = 0.51

(medium), smoothness

p = 6.10 \times 10^{- 5}

,

d = 2.00

(very large), collision

p = 0.0101

,

d = - 0.53

(medium). DWA-D3QN achieves the lowest standard deviation (±3.4%) among all DRL methods, confirming that dense kinematic priors stabilize policy learning.

3.14. ROS/Gazebo Preliminary Validation

To assess sim-to-real transferability, the grid-world-trained DWA-D3QN policy was deployed in a ROS/Gazebo environment without fine tuning. An Ackermann-steered vehicle model equipped with a simulated 2D LiDAR sensor navigated among dynamic obstacles, following the kinematic model described in Section 2.4. The grid policy’s discrete actions are converted to global waypoints and then tracked by a path-following controller that generates continuous

(v, φ)

commands. Quantitative evaluation was conducted over 50 independent trials with varying obstacle configurations. The policy achieved a success rate of 96.0% with a collision rate of 4.0%. Figure 16 illustrates the navigation process at different stages.

The quantitative results are summarized in Table 15.

The average cross-track error over all successful trials was 0.094 m (RMS: 0.108 m, maximum: 0.27 m), which was measured as the perpendicular distance between the vehicle rear-axle center and the reference path segment at each control cycle (10 Hz). This level of tracking accuracy confirms that the grid-trained policy transfers effectively to continuous steering control under the Ackermann kinematic model described in Section 2.4.

Qualitative results demonstrate that the learned policy successfully transfers to the continuous-state environment. The robot consistently avoids dynamic obstacles and reaches the goal across 50 trials. The Rviz trajectory (Figure 17) shows a smooth, continuous path without oscillations, confirming that the DWA-derived shaping signals produce trajectories that are executable under the Ackermann steering kinematic model. Minor trajectory deviations in early deployment steps are attributable to the domain gap between discrete grid representation and continuous LiDAR observations. The two failure cases occurred in scenarios with narrow corridors and intersecting dynamic obstacle trajectories, representing edge cases at the boundary of the grid-world training distribution.

4. Discussion

4.1. Analysis of DWA-D3QN Versus Baseline Methods

DWA-D3QN achieves comprehensive superiority under both simple and complex difficulties. The dual-configuration D3QN evaluation decomposes performance gains: sparse-reward D3QN achieves 30.0%, PBRS-guided D3QN reaches 85.7% (a 55.7 pp gain from the Chebyshev goal term), and DWA-D3QN reaches 94.1% (an additional 8.4 pp from DWA-derived shaping). This confirms that both the goal signal and kinematic heuristics are essential. All improvements are statistically significant (

p < 0.001

for sparse- reward vs. PBRS-guided;

p < 0.05

for PBRS-guided vs. DWA-D3QN; Wilcoxon test).

DWA-D3QN achieves the lowest standard deviation (±3.4%) among all DRL methods compared to ±17.0% for D3QN (PBRS) and ±21.9% for DQN. This progressive reduction demonstrates that Double Q-learning, the Dueling architecture, and DWA shaping each contribute to training stability.

DDQN (89.5%) outperforms D3QN (PBRS) (85.7%), which is likely because the Dueling architecture introduces additional parameters that increase early-training variance under sparse rewards. DWA-derived dense signals mitigate this, enabling the Dueling architecture’s representational capacity to manifest in the full model.

In terms of path quality, DWA-D3QN achieves the best smoothness among DRL methods (0.674) and the highest min clearance (0.981). The smoothness improvement over D3QN (PBRS) is statistically significant (Wilcoxon

p < 0.0001

). APF-DQN’s mean steps (42.05, 1.8× DWA-D3QN) highlight the limitations of binary attraction–repulsion heuristics.

4.2. Mechanistic Insights from Ablation Studies

Clearance awareness is the most critical single shaping dimension (88.5% success, 11.3% collision), indicating that obstacle safety distance awareness provides the strongest individual contribution to task completion. Heading guidance reaches 84.0%, confirming that directional alignment also plays a key role. Velocity smoothing alone achieves 83.3% with the lowest mean steps (25.68), suggesting a trade-off between path efficiency and collision risk.

The full model (94.1%) outperforms Heading only by 10.1 pp and Clearance only by 5.6 pp with both improvements statistically significant (

p < 0.01

and

p < 0.05

, respectively, Wilcoxon test), confirming the benefit of multi-dimensional shaping.

The 31.4 pp gap between DWA dense only (62.7%) and the full model (94.1%) demonstrates that kinematic guidance alone is insufficient without the goal-progress signal. DWA dense only exhibits high variance (±34.7%) with several seeds failing to converge, further highlighting the stabilizing role of the Chebyshev goal term. APF-Euclidean (80.3%) underperforms Chebyshev, validating the

L^{\infty}

potential.

4.3. Behavioral Analysis from Path Visualization

Across the three analyzed scenarios, DWA-D3QN exhibits superior performance over D3QN in every case: it succeeds where D3QN fails (Scenario A), eliminates oscillations (Scenario B), and achieves more compact paths (Scenario C). These observations corroborate the quantitative results.

4.4. Training Dynamics and Convergence Behavior

DWA-D3QN converges faster (plateau at

\sim 2 \times 10^{4}

steps) with smaller fluctuations, whereas D3QN requires more steps with persistent oscillations. DWA heuristic signals provide continuous guidance, accelerating policy formation.

Robustness analysis confirms stable performance under perturbations of learning rate, batch size, and exploration schedule with the success rate remaining at 96.0%. Sensitivity analysis shows the composite reward is not brittle: the success rate remains above 89.0% with the velocity term suppressed.

4.5. Performance Relative to Traditional Planning Algorithms

A* (76.7%, steps: 130.19) suffers from its static environment assumption, requiring frequent replanning. Classical DWA achieves high success (93.3%) but with a higher collision rate (6.7%) than DWA-D3QN (5.9%). The deployment-time comparison (Table 12 and Table 13) quantifies the computational advantage: continuous classical DWA requires 92.4 ms per decision cycle, whereas DWA-D3QN completes inference in 0.47–0.77 ms, representing speedups of 120–197×. Even the lightest continuous DWA configuration remains 21.8× slower. On the discrete grid, DWA-D3QN (0.47 ms) is comparable to grid DWA (0.54 ms); the core advantage lies not in raw latency on the grid but in the offline internalization of kinematic priors that enables deployment on continuous-state platforms without online velocity search. DWA-D3QN internalizes DWA criteria through offline training, combining local obstacle avoidance with global motion coherence.

4.6. Failure Mode Analysis

Despite the strong overall performance, several failure modes are observed. In scenarios with dense dynamic obstacle congestion (four or more dynamic obstacles intersecting in a confined region), DWA-D3QN occasionally exhibits oscillatory behavior where the agent alternates between two actions without making progress, resulting in timeout failures. This occurs when the DWA-derived clearance and heading terms produce conflicting gradients in the Q-value landscape. In local minima situations, such as U-shaped obstacle formations, the agent may exhibit repetitive back-and-forth motion, as the Chebyshev goal-progress signal is insufficient to guide escape from non-convex obstacle configurations. These failure cases typically represent less than 6% of all episodes and are more prevalent in the early training phase with frequency decreasing as training progresses. Reward conflicts between the heading incentive (pulling toward the goal through obstacles) and the clearance incentive (pushing away from obstacles) can occasionally produce indecisive policies in narrow passages, highlighting a limitation of the scalarized multi-objective reward formulation. Future work could address these failure modes through curriculum learning that gradually increases obstacle density or through adaptive weight scheduling that dynamically adjusts the trade-off between competing reward terms based on local obstacle density.

4.7. Comparison with Policy Gradient Methods

PPO, SAC, and TD3 achieve success rates of 53.9–56.5% compared with 94.1% for DWA-D3QN (Table 16). All differences are statistically significant (Wilcoxon

p < 0.01

). We attribute this gap to two factors. First, the discrete action space (nine actions) may not fully leverage the continuous control capabilities of policy gradient methods. Second, the DWA-derived dense reward shaping provides structured kinematic guidance that is particularly effective for discrete grid navigation, whereas the sparse task reward limits the learning signal available to PPO/SAC/TD3. The high variance of PPO (±28.2%) and TD3 (±37.2%) further indicates sensitivity to random seed initialization. These methods also require 3–10 times more steps per episode (64.7–240.8 vs. 23.93), suggesting less efficient exploration. These findings highlight that value-based methods with task-specific reward shaping may be more suitable for discrete navigation tasks, while policy gradient methods excel in continuous control domains.

5. Conclusions

This paper proposed DWA-D3QN, which is a DWA heuristically-guided Dueling Double Deep Q-Network for intelligent vehicle path planning. The D3QN architecture decouples state value and action advantage while mitigating Q-value overestimation through the double estimator. PER enhances critical sample utilization. The reward integrates a Chebyshev goal-progress signal motivated by PBRS with dense DWA-derived proxies and auxiliary terms; the policy-invariance property of Ng et al. applies only to the standard PBRS coupling of the goal term to the sparse task reward rather than the full trained return. The policy network employs a compact Dueling architecture with 53,386 parameters and achieves 0.63 ms inference latency.

Under complex difficulty, DWA-D3QN achieves 94.1 ± 3.4% success and 5.9 ± 3.4% collision over 15 seeds with the lowest variance among all DRL methods. A dual-configuration evaluation reveals a 55.7 pp gain from the PBRS Chebyshev goal term and a further 8.4 pp gain from DWA-derived shaping. Over the sparse-reward baseline, DWA-D3QN improves success from 30% to 94.1%, which is a 64.1 pp gain. Ablation studies confirm that clearance and heading terms provide essential safety and directional guidance, while velocity smoothing refines trajectory quality. The full model outperforms Heading only by 10.1 pp and Clearance only by 5.6 pp. The PBRS Chebyshev goal term contributes 31.4 pp over DWA-dense-only shaping. Comparisons with PPO, SAC, and TD3 confirm statistically significant advantages (PPO:

p = 0.0010

, SAC:

p = 0.0007

, TD3:

p = 0.0024

). Deployment-time benchmarks confirm 120–197× speedup over continuous classical DWA. ROS/Gazebo quantitative validation achieves 96.0% success rate over 50 trials, further confirming sim-to-real transferability. The deep integration of reinforcement learning and classical kinematic evaluation achieves a balanced optimization of task completion, obstacle avoidance safety, and trajectory quality in dynamic environments.

Limitations and Future Work

Several limitations should be acknowledged. (1) While ROS/Gazebo validation with an Ackermann-steered vehicle model demonstrates basic sim-to-real transferability, the validation environment remains relatively simplified compared to real-world conditions. Quantitative evaluation under more challenging scenarios including sensor noise, communication delays, varying terrain conditions, and diverse weather effects remains for future work. The current action space uses omnidirectional discrete grid movement during training; DWA is employed solely for training-phase reward shaping rather than for online control. (2) This paper focuses on value-based DRL; comparisons with policy-gradient methods such as PPO, SAC, and TD3 are deferred to enable a comprehensive assessment of how different DRL paradigms interact with the proposed DWA-derived reward shaping. (3) The framework addresses single-agent navigation; multi-agent scenarios with inter-agent coordination, collision avoidance, and communication require further investigation. (4) Failure modes including oscillatory behaviors in dense dynamic obstacle congestion and local minima entrapment in non-convex obstacle configurations highlight limitations of the scalarized multi-objective reward formulation.

Future work includes (1) extending the current Gazebo validation to the CARLA simulator with domain randomization, dynamic weather, and more realistic sensor models to comprehensively evaluate sim-to-real transfer; (2) investigating prediction-based anticipation mechanisms for dynamic obstacle trajectories to improve performance in densely crowded environments; (3) extending the framework to multi-agent cooperative navigation with decentralized DWA-D3QN policies; (4) accelerating training via distributed replay and parallel environment sampling; (5) automating reward parameter learning via meta-learning or population-based training to reduce manual tuning effort; (6) integrating the DWA heuristic signals directly into the training action space, enabling the agent to learn under non-holonomic constraints from the outset rather than relying on post hoc kinematic adaptation; and (7) developing curriculum learning strategies and adaptive weight scheduling to address the identified failure modes of oscillatory behavior and local minima entrapment.

Author Contributions

Conceptualization, J.N. and W.W.; methodology, J.N.; software, J.N.; validation, J.N. and W.W.; formal analysis, J.N.; investigation, J.N.; resources, W.W.; data curation, J.N.; writing—original draft preparation, J.N.; writing—review and editing, J.N. and W.W.; visualization, J.N.; supervision, W.W.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Undergraduate Innovation and Entrepreneurship Training Program (grant number C20261588) and the Beijing Information Science and Technology University “Xingguang” Science and Technology Innovation Fund (grant number XG2026PT28).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this paper are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DQN	Deep Q-Network
DDQN	Double Deep Q-Network
D3QN	Dueling Double Deep Q-Network
DWA	Dynamic Window Approach
PER	Prioritized Experience Replay
PBRS	Potential-Based Reward Shaping
APF	Artificial Potential Field
PPO	Proximal Policy Optimization
SAC	Soft Actor–Critic
TD3	Twin Delayed DDPG
TD	Temporal Difference
MDP	Markov Decision Process
DRL	Deep Reinforcement Learning

References

Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar]
Xu, L.; Zhang, W. Survey on path planning based on deep reinforcement learning. In Proceedings of the 2025 2nd International Conference on Machine Learning and Intelligent Computing, Virtual, 27 October 2025; PMLR: New York, NY, USA, 2025; pp. 685–695. [Google Scholar]
Peilong, N.; Pengjun, M.; Ning, W.; Mengjie, Y. Robot path planning based on improved A-DDQN algorithm. J. Syst. Simul. 2025, 37, 2420. [Google Scholar]
Huang, Y.; Wei, G.L.; Wang, Y.X. V-D D3QN: The variant of double deep Q-learning network with dueling architecture. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 9356–9361. [Google Scholar]
Ruan, X.; Lin, C.; Huang, J.; Li, Y. A Mobile Robot Autonomous Navigation Method Combining Deep Reinforcement Learning and Intrinsic Motivation. Chinese Patent CN116147627A, 4 January 2023. [Google Scholar]
Nguyen, T.T.; Nahavandi, S.; Razzak, I.; Nguyen, D.; Pham, N.T.; Nguyen, Q.V.H. The emergence of deep reinforcement learning for path planning. arXiv 2025, arXiv:2507.15469. [Google Scholar]
Venu, S.; Gurusamy, M. A comprehensive review of path planning algorithms for autonomous navigation. Results Eng. 2025, 28, 107750. [Google Scholar] [CrossRef]
Zhang, Y.; Cui, C.; Zhao, Q. Path planning of mobile robot based on A star algorithm combining DQN and DWA in complex environment. Appl. Sci. 2025, 15, 4367. [Google Scholar] [CrossRef]
Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning (ICML), Bled, Slovenia, 27–30 June 1999; pp. 278–287. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Ruan, X.; Ren, D.; Zhu, X.; Huang, J. Mobile robot navigation based on deep reinforcement learning. In Proceedings of the 2019 Chinese Control and Decision Conference (CCDC), Nanchang, China, 3–5 June 2019; pp. 6174–6178. [Google Scholar]
van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. arXiv 2015, arXiv:1509.06461. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Li, A.A.; Lu, Z.; Miao, C. Revisiting prioritized experience replay: A value perspective. arXiv 2021, arXiv:2102.03261. [Google Scholar] [CrossRef]
Fox, D.; Burgard, W.; Thrun, S. The dynamic window approach to collision avoidance. IEEE Robot. Autom. Mag. 1997, 4, 23–33. [Google Scholar] [CrossRef]
Wei, Z.; Liu, D.; Wang, H.; Hao, R. DWA with three-level buffer and fuzzy logic and its application. J. Chin. Comput. Syst. 2022, 43, 1615–1624. [Google Scholar] [CrossRef]
Grześ, M. Reward shaping with potential functions in continuous state-action spaces: A survey of theory and applications. Artif. Intell. Rev. 2023, 56, 14873–14916. [Google Scholar]
Lidayan, A.; Dennis, M.; Russell, S. BAMDP Shaping: A Unified Framework for Intrinsic Motivation and Reward Shaping. arXiv 2024, arXiv:2409.05358. [Google Scholar] [CrossRef]

Figure 1. Deep Q-Network (DQN) algorithm framework.

Figure 2. Overall workflow framework of the DWA-D3QN algorithm.

Figure 3. D3QN algorithm training framework with Prioritized Experience Replay.

Figure 4. Workflow of the Prioritized Experience Replay mechanism.

Figure 5. Transformation framework from DWA dynamic window evaluation criteria to dense reward function.

Figure 6. Kinematic model of an Ackermann-steered vehicle (rear-axle reference).

(x, y)

is the rear-axle center;

θ

is the heading angle;

φ

is the front steering angle; L is the wheelbase;

ρ

is the turning radius to the instantaneous center of rotation (ICR);

ω

is the yaw rate.

Figure 6. Kinematic model of an Ackermann-steered vehicle (rear-axle reference).

(x, y)

is the rear-axle center;

θ

is the heading angle;

φ

is the front steering angle; L is the wheelbase;

ρ

is the turning radius to the instantaneous center of rotation (ICR);

ω

is the yaw rate.

Figure 7. Training loss curve of DWA-D3QN under complex difficulty (learning rate

5.0 \times 10^{- 4}

, batch size 256, 200k steps). The loss converges rapidly within the first

5 \times 10^{4}

steps and remains stable thereafter, confirming the suitability of the selected hyperparameters.

Figure 7. Training loss curve of DWA-D3QN under complex difficulty (learning rate

5.0 \times 10^{- 4}

, batch size 256, 200k steps). The loss converges rapidly within the first

5 \times 10^{4}

steps and remains stable thereafter, confirming the suitability of the selected hyperparameters.

Figure 8. Representative grid map examples at the two difficulty levels. (a) Simple difficulty (

ρ_{occ} \approx 0.10

,

N_{dyn} = 2

); (b) complex difficulty (

ρ_{occ} \approx 0.14

,

N_{dyn} = 4

).

Figure 8. Representative grid map examples at the two difficulty levels. (a) Simple difficulty (

ρ_{occ} \approx 0.10

,

N_{dyn} = 2

); (b) complex difficulty (

ρ_{occ} \approx 0.14

,

N_{dyn} = 4

).

Figure 9. Scenario A: D3QN (left) fails and collides; DWA-D3QN (right) reaches the goal with a smooth trajectory.

Figure 10. Scenario B: D3QN (left) oscillates and fails; DWA-D3QN (right) produces a smooth, efficient path.

Figure 11. Scenario C: D3QN (left) shows detours; DWA-D3QN (right) produces a shorter, more compact trajectory.

Figure 12. Average episode reward convergence curves for D3QN and DWA-D3QN under complex difficulty.

Figure 13. Success rate convergence curves for D3QN and DWA-D3QN under complex difficulty.

Figure 14. Task completion capability under complex difficulty. Error bars denote ± one standard deviation. DWA-D3QN achieves the highest success rate and lowest collision rate with minimal variance.

Figure 15. Trajectory quality and efficiency under complex difficulty. Error bars denote ± one standard deviation. DWA-D3QN achieves the best smoothness and clearance with competitive step efficiency.

Figure 16. ROS/Gazebo simulation of DWA-D3QN policy. Quantitative results over 50 trials: success rate 96.0%, collision rate 4.0%.

Figure 17. Rviz trajectory visualization of DWA-D3QN policy in the ROS/Gazebo environment. The green curve shows the robot’s actual trajectory from start to goal, demonstrating smooth navigation with effective dynamic obstacle avoidance.

Table 1. The 15-D observation vector specification (

N = 20

, agent position

p_{t} = (x_{t}, y_{t})

, goal g).

Table 1. The 15-D observation vector specification (

N = 20

, agent position

p_{t} = (x_{t}, y_{t})

, goal g).

Index	Symbol	Definition	Range
0–1	$s_{0}, s_{1}$	$x_{t} / N, y_{t} / N$	$[0, 1]$
2–3	$s_{2}, s_{3}$	$(g_{x} - x_{t}) / N, (g_{y} - y_{t}) / N$	$[- 1, 1]$
4	$s_{4}$	$∥ g - p_{t} ∥_{2} / (N \sqrt{2})$	$[0, 1]$
5–6	$s_{5}, s_{6}$	$(Δ x_{a}, Δ y_{a}) / 2$ from $a_{t - 1}$	${- 0.5, 0, 0.5}$
7–14	$s_{7 + k}$	8-ray LiDAR: $min (d_{k} / (N \sqrt{2}), 1)$	$[0, 1]$

Table 2. Action space specification (nine discrete actions).

$a_{t}$	Name	Grid Offset ( $Δ$ x, $Δ$ y)
0	Stay	$(0, 0)$
1	Up	$(0, - 1)$
2	Down	$(0, + 1)$
3	Left	$(- 1, 0)$
4	Right	$(+ 1, 0)$
5–8	Four diagonals	$(\pm 1, \pm 1)$

Table 3. Environment transition and termination conditions. The symbol → denotes “triggers” or “leads to”.

Step	Description
1	Agent position update: $x^{'} = clip (x + Δ x, 0, N - 1)$ , $y^{'}$ similarly
2	Dynamic obstacles: reciprocate between start–end points, 1 cell/step along Manhattan direction
3	Collision: agent position coincides with any obstacle cell → episode terminates
4	Success: $max (\| x^{'} - g_{x} \|, \| y^{'} - g_{y} \|) \leq 1$ (Chebyshev distance) → episode terminates
5	Timeout: step count ≥ 600 → episode truncates

Table 4. Reward coefficient specification.

Term	Formula/Description	Value
Base terms (shared by all methods)
$R_{step}$	Constant per-step penalty	$- 0.1$
$F_{g}$ (PBRS)	$λ [γ d_{\infty} (s) - d_{\infty} (s^{'})]$	$λ = 2.0$ , $γ = 0.99$
$F_{g}$ (APF, APF-DQN only)	$λ [d_{Euc} (s) - d_{Euc} (s^{'})]$	$λ = 2.0$
$R_{dir}$	$0.5 cos ∠ (Δ p, g)$	$0.5$
$R_{rep}$	$- 0.5 {(1 / d_{min} - 1 / d_{safe})}^{2}$	$d_{safe} = 2.0$
$R_{back}$	Return to $s_{t - 1}$	$- 0.5$
$R_{turn}$	$a_{t} \neq a_{t - 1}$	$- 0.2$
Success	$d_{\infty} \leq 1$	$+ 100$
Collision	Occupies obstacle cell	$- 50$
DWA shaping (DWA-D3QN only)
Schedule	$k = min (t_{train} / T_{warm}, 1)$	$T_{warm} = 55, 000$
Global multiplier	$s_{dwa}$	$1.12$
Heading $λ_{1}$	Fast: $1.00 (1 - k)$ , Safe: $0.52 k$	$λ_{1} = 1.0$
Clearance $λ_{2}$	Fast: $0.38 (1 - k)$ , Safe: $0.58 k$	$λ_{2} = 1.0$
Velocity $λ_{3}$	Fast: $0.20 (1 - k)$ , Safe: $0.06 k$	$λ_{3} = 1.0$
Post-processing (all methods)
Scaling	$\div 10$
Clipping	$[- 10, 10]$

Table 5. Hyperparameter settings for the DWA-D3QN algorithm.

Parameter Name	Parameter Value
Learning rate	$5.0 \times 10^{- 4}$
Optimizer	Adam
Reward discount factor $γ$	0.99
Batch size	256
Experience replay buffer size	120,000
Priority coefficient $α$ (PER)	0.6
Importance sampling initial coefficient $β_{start}$ (PER)	0.5
Exploration attenuation	$ε_{start} = 1.0$ , $ε_{end} = 0.02$

Table 6. Hyperparameter search space and selected values.

Parameter	Search Range	Selected	Criterion
Learning rate	$[1 \times 10^{- 4}, 1 \times 10^{- 3}]$	$5.0 \times 10^{- 4}$	Convergence stability
Batch size	${64, 128, 256, 512}$	256	Convergence stability
PER $α$	${0.4, 0.5, 0.6, 0.7}$	0.6	Efficiency–stability trade-off
PER $β_{start}$	${0.4, 0.5, 0.6}$	0.5	Efficiency–stability trade-off

Table 7. Performance comparison of D3QN (sparse reward) and DWA-D3QN under simple and complex difficulties. Results over 15 seeds. (Bold values indicate the best performance among compared methods).

Difficulty	Method	Success (%)	Collision (%)	Smoothness	Mean Steps	Min Clearance
Simple	D3QN (sparse)	76.5	21.3	0.555	69.15	0.853
Simple	DWA-D3QN	93.7	6.3	0.726	27.82	1.179
Complex	D3QN (sparse)	79.7	19.5	0.500	45.14	0.816
Complex	DWA-D3QN	94.1	5.9	0.674	23.93	0.981

Table 8. Comparison of ablation experiment results under complex difficulty. Results: mean ± SD over 15 seeds.

Method	Success (%)	Collision (%)	Smoothness	Mean Steps	Min Clearance
Heading only	84.0 ± 16.6	15.5 ± 16.5	0.627 ± 0.085	29.20 ± 13.62	0.862 ± 0.152
Clearance only	88.5 ± 7.5	11.3 ± 7.1	0.615 ± 0.076	29.36 ± 9.94	0.905 ± 0.109
Velocity only	83.3 ± 10.1	16.7 ± 10.1	0.619 ± 0.066	25.68 ± 7.12	0.841 ± 0.108
DWA dense only	62.7 ± 34.7	32.0 ± 31.3	0.573 ± 0.121	89.76 ± 90.82	0.685 ± 0.313
APF-Euclidean	80.3 ± 17.1	19.7 ± 17.1	0.519 ± 0.078	42.05 ± 31.63	0.814 ± 0.186
Full DWA-D3QN	94.1 ± 3.4	5.9 ± 3.4	0.674 ± 0.068	23.93 ± 2.26	0.981 ± 0.120

Table 9. Reward weight sensitivity analysis under complex difficulty. Results: mean ± SD over 15 seeds.

Configuration ( $λ_{1}$ : $λ_{2}$ : $λ_{3}$ )	Success (%)	Collision (%)	Smoothness	Mean Steps
Default (1:1:1)	94.1 ± 3.4	5.9 ± 3.4	0.674 ± 0.068	23.93 ± 2.26
Heading-dominant (2:1:1)	95.6 ± 2.7	4.4 ± 2.7	0.723 ± 0.045	22.47 ± 3.27
Velocity-suppressed (1:1:0)	97.2 ± 3.0	2.8 ± 3.0	0.713 ± 0.073	22.93 ± 2.83

Table 10. Hyperparameter robustness analysis under Complex difficulty. Results: mean ± SD over 5 seeds.

Parameter Setting	Success (%)	Collision (%)	Smoothness	Mean Steps
Default ( $α = 5.0 \times 10^{- 4}$ , $B = 256$ , $ε_{end} = 0.02$ )	94.1 ± 3.4	5.9 ± 3.4	0.674 ± 0.068	23.93 ± 2.26
Learning rate $1.0 \times 10^{- 4}$	96.5 ± 2.4	3.5 ± 2.4	0.702 ± 0.081	23.41 ± 3.25
Batch size 128	96.5 ± 3.0	3.5 ± 3.0	0.715 ± 0.067	23.97 ± 5.15
$ε_{end} = 0.05$	93.5 ± 4.5	6.5 ± 4.5	0.673 ± 0.062	23.12 ± 2.20

Table 11. Computational efficiency of DWA-D3QN.

Metric	Value
Training time (200k steps)	≈9.7 min
Training time (10 configurations)	≈1.5 h
Inference latency (mean)	0.63 ms/step
Inference latency (P99)	1.46 ms/step
Policy network parameters	53,386
Target network parameters	53,386

Table 12. Paired deployment-time latency on complex

20 \times 20

(env_decision protocol, 3000 control steps, 20 maps).

Table 12. Paired deployment-time latency on complex

20 \times 20

(env_decision protocol, 3000 control steps, 20 maps).

Method	Device	Mean (ms/Step)	P99 (ms/Step)	Notes
Classical DWA (discrete grid)	CPU	0.54	1.15	9-action heuristic scoring
DWA-D3QN (proposed)	CPU	0.47	1.29	Greedy DuelingNet forward
DWA-D3QN (proposed)	GPU	0.77	2.47	GPU deployment
Classical DWA (continuous)	CPU	92.4	130.0	200 $(v, ω)$ × 20-step rollout

Table 13. Continuous DWA latency sensitivity to velocity sampling resolution.

Configuration	Candidates × Steps	Mean (ms)	vs. D3QN Slowdown
Light	50 × 10	10.2	21.8×
Medium	100 × 20	44.6	94.9×
Default	200 × 20	84.7	180.2×
Heavy	400 × 20	169.2	360.1×

Table 14. Performance comparison under complex difficulty. DRL: mean ± SD over 15 seeds; A*/DWA: mean ± SE over 120 maps. (Bold values indicate the best performance among compared methods).

Method	Success (%)	Collision (%)	Smoothness	Mean Steps	Min Clearance
Value-based DRL
DQN	82.3 ± 21.9	17.7 ± 21.9	0.534 ± 0.082	29.52 ± 13.52	0.830 ± 0.225
DDQN	89.5 ± 5.1	10.5 ± 5.1	0.559 ± 0.081	23.79 ± 3.01	0.906 ± 0.073
Dueling DQN	81.9 ± 21.2	18.1 ± 21.2	0.533 ± 0.070	29.87 ± 16.39	0.826 ± 0.219
Heuristic-guided DRL
APF-DQN	80.3 ± 17.1	19.7 ± 17.1	0.519 ± 0.078	42.05 ± 31.63	0.814 ± 0.186
D3QN (PBRS)	85.7 ± 17.0	14.0 ± 16.0	0.513 ± 0.057	32.90 ± 24.95	0.867 ± 0.168
Classical planning
A*	76.7 ± 3.9	23.3 ± 3.9	0.811 ± 0.013	130.19 ± 18.70	0.767 ± 0.039
DWA	93.3 ± 2.3	6.7 ± 2.3	0.811 ± 0.008	53.14 ± 10.95	0.933 ± 0.023
DWA-D3QN
DWA-D3QN	94.1 ± 3.4	5.9 ± 3.4	0.674 ± 0.068	23.93 ± 2.26	0.981 ± 0.120

Table 15. Quantitative ROS/Gazebo validation results over 50 independent trials.

Metric	Value
Trials	50
Success rate	96.0%
Collision rate	4.0%
Average cross-track error	0.094 m
Max cross-track error	0.27 m
RMS cross-track error	0.108 m

Table 16. Comparison with policy gradient and actor–critic methods under complex difficulty. Results: mean ± SD over 15 seeds. PPO trained for 400k steps with random warmup of 15k steps and entropy annealing from 0.08 to 0.01.

Method	Success (%)	Collision (%)	Smoothness	Mean Steps	Min Clearance
PPO ^†	54.8 ± 28.2	41.3 ± 21.5	0.628 ± 0.077	64.7 ± 104.3	0.593 ± 0.223
SAC	53.9 ± 31.2	26.3 ± 13.7	0.432 ± 0.038	240.8 ± 145.9	0.737 ± 0.137
TD3	56.1 ± 37.2	32.4 ± 31.0	0.562 ± 0.156	151.9 ± 138.7	0.676 ± 0.310
D3QN (PBRS)	85.7 ± 17.0	14.0 ± 16.0	0.513 ± 0.057	32.90 ± 24.95	0.867 ± 0.168
DWA-D3QN	94.1 ± 3.4	5.9 ± 3.4	0.674 ± 0.068	23.93 ± 2.26	0.981 ± 0.120

^† PPO: 400k training steps, random warmup 15k steps, entropy annealing from 0.08 to 0.01.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Na, J.; Wang, W. Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach. Algorithms 2026, 19, 528. https://doi.org/10.3390/a19070528

AMA Style

Na J, Wang W. Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach. Algorithms. 2026; 19(7):528. https://doi.org/10.3390/a19070528

Chicago/Turabian Style

Na, Jiahui, and Wensheng Wang. 2026. "Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach" Algorithms 19, no. 7: 528. https://doi.org/10.3390/a19070528

APA Style

Na, J., & Wang, W. (2026). Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach. Algorithms, 19(7), 528. https://doi.org/10.3390/a19070528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved D3QN Intelligent Vehicle Path Planning Guided by the Dynamic Window Approach

Abstract

1. Introduction

1.1. Research Gap

1.2. Scientific Contribution

2. Materials and Methods

2.1. MDP Formulation: Observation, Action, and Dynamics

2.2. DQN Algorithm

2.3. Improved DWA-D3QN Algorithm

2.3.1. Overall Algorithm Flow

2.3.2. Dueling Double DQN Base Network Structure

2.3.3. Network Architecture

2.3.4. Prioritized Experience Replay (PER) Mechanism

2.3.5. DWA Dynamic Window Kinematic Model and Reward Shaping

2.4. Kinematic Model for Deployment

3. Results

3.1. Algorithm Parameter Settings

3.2. Hyperparameter Tuning and Convergence

3.3. Evaluation Metrics

3.4. Map Difficulty Definition and Examples

3.5. Quantitative Results

3.6. Ablation Experiments

3.7. Reward Weight Sensitivity Analysis

3.8. Hyperparameter Robustness Analysis

3.9. Computational Efficiency

3.10. Deployment-Time Computational Comparison

3.11. Path Visualization

3.11.1. Scenario A: Narrow Corridor

3.11.2. Scenario B: Path Coherence

3.11.3. Scenario C: Dense Obstacles

3.12. Convergence Speed and Training Stability

3.12.1. Average Episode Reward Convergence

3.12.2. Success Rate Convergence and Stability

3.13. Comparative Analysis

3.14. ROS/Gazebo Preliminary Validation

4. Discussion

4.1. Analysis of DWA-D3QN Versus Baseline Methods

4.2. Mechanistic Insights from Ablation Studies

4.3. Behavioral Analysis from Path Visualization

4.4. Training Dynamics and Convergence Behavior

4.5. Performance Relative to Traditional Planning Algorithms

4.6. Failure Mode Analysis

4.7. Comparison with Policy Gradient Methods

5. Conclusions

Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI