Figure 1.
Deep Q-Network (DQN) algorithm framework.
Figure 1.
Deep Q-Network (DQN) algorithm framework.
Figure 2.
Overall workflow framework of the DWA-D3QN algorithm.
Figure 2.
Overall workflow framework of the DWA-D3QN algorithm.
Figure 3.
D3QN algorithm training framework with Prioritized Experience Replay.
Figure 3.
D3QN algorithm training framework with Prioritized Experience Replay.
Figure 4.
Workflow of the Prioritized Experience Replay mechanism.
Figure 4.
Workflow of the Prioritized Experience Replay mechanism.
Figure 5.
Transformation framework from DWA dynamic window evaluation criteria to dense reward function.
Figure 5.
Transformation framework from DWA dynamic window evaluation criteria to dense reward function.
Figure 6.
Kinematic model of an Ackermann-steered vehicle (rear-axle reference). is the rear-axle center; is the heading angle; is the front steering angle; L is the wheelbase; is the turning radius to the instantaneous center of rotation (ICR); is the yaw rate.
Figure 6.
Kinematic model of an Ackermann-steered vehicle (rear-axle reference). is the rear-axle center; is the heading angle; is the front steering angle; L is the wheelbase; is the turning radius to the instantaneous center of rotation (ICR); is the yaw rate.
Figure 7.
Training loss curve of DWA-D3QN under complex difficulty (learning rate , batch size 256, 200k steps). The loss converges rapidly within the first steps and remains stable thereafter, confirming the suitability of the selected hyperparameters.
Figure 7.
Training loss curve of DWA-D3QN under complex difficulty (learning rate , batch size 256, 200k steps). The loss converges rapidly within the first steps and remains stable thereafter, confirming the suitability of the selected hyperparameters.
Figure 8.
Representative grid map examples at the two difficulty levels. (a) Simple difficulty (, ); (b) complex difficulty (, ).
Figure 8.
Representative grid map examples at the two difficulty levels. (a) Simple difficulty (, ); (b) complex difficulty (, ).
Figure 9.
Scenario A: D3QN (left) fails and collides; DWA-D3QN (right) reaches the goal with a smooth trajectory.
Figure 9.
Scenario A: D3QN (left) fails and collides; DWA-D3QN (right) reaches the goal with a smooth trajectory.
Figure 10.
Scenario B: D3QN (left) oscillates and fails; DWA-D3QN (right) produces a smooth, efficient path.
Figure 10.
Scenario B: D3QN (left) oscillates and fails; DWA-D3QN (right) produces a smooth, efficient path.
Figure 11.
Scenario C: D3QN (left) shows detours; DWA-D3QN (right) produces a shorter, more compact trajectory.
Figure 11.
Scenario C: D3QN (left) shows detours; DWA-D3QN (right) produces a shorter, more compact trajectory.
Figure 12.
Average episode reward convergence curves for D3QN and DWA-D3QN under complex difficulty.
Figure 12.
Average episode reward convergence curves for D3QN and DWA-D3QN under complex difficulty.
Figure 13.
Success rate convergence curves for D3QN and DWA-D3QN under complex difficulty.
Figure 13.
Success rate convergence curves for D3QN and DWA-D3QN under complex difficulty.
Figure 14.
Task completion capability under complex difficulty. Error bars denote ± one standard deviation. DWA-D3QN achieves the highest success rate and lowest collision rate with minimal variance.
Figure 14.
Task completion capability under complex difficulty. Error bars denote ± one standard deviation. DWA-D3QN achieves the highest success rate and lowest collision rate with minimal variance.
Figure 15.
Trajectory quality and efficiency under complex difficulty. Error bars denote ± one standard deviation. DWA-D3QN achieves the best smoothness and clearance with competitive step efficiency.
Figure 15.
Trajectory quality and efficiency under complex difficulty. Error bars denote ± one standard deviation. DWA-D3QN achieves the best smoothness and clearance with competitive step efficiency.
Figure 16.
ROS/Gazebo simulation of DWA-D3QN policy. Quantitative results over 50 trials: success rate 96.0%, collision rate 4.0%.
Figure 16.
ROS/Gazebo simulation of DWA-D3QN policy. Quantitative results over 50 trials: success rate 96.0%, collision rate 4.0%.
Figure 17.
Rviz trajectory visualization of DWA-D3QN policy in the ROS/Gazebo environment. The green curve shows the robot’s actual trajectory from start to goal, demonstrating smooth navigation with effective dynamic obstacle avoidance.
Figure 17.
Rviz trajectory visualization of DWA-D3QN policy in the ROS/Gazebo environment. The green curve shows the robot’s actual trajectory from start to goal, demonstrating smooth navigation with effective dynamic obstacle avoidance.
Table 1.
The 15-D observation vector specification (, agent position , goal g).
Table 1.
The 15-D observation vector specification (, agent position , goal g).
| Index | Symbol | Definition | Range |
|---|
| 0–1 | | | |
| 2–3 | | | |
| 4 | | | |
| 5–6 | | from | |
| 7–14 | | 8-ray LiDAR: | |
Table 2.
Action space specification (nine discrete actions).
Table 2.
Action space specification (nine discrete actions).
| Name | Grid Offset (x, y) |
|---|
| 0 | Stay | |
| 1 | Up | |
| 2 | Down | |
| 3 | Left | |
| 4 | Right | |
| 5–8 | Four diagonals | |
Table 3.
Environment transition and termination conditions. The symbol → denotes “triggers” or “leads to”.
Table 3.
Environment transition and termination conditions. The symbol → denotes “triggers” or “leads to”.
| Step | Description |
|---|
| 1 | Agent position update: , similarly |
| 2 | Dynamic obstacles: reciprocate between start–end points, 1 cell/step along Manhattan direction |
| 3 | Collision: agent position coincides with any obstacle cell → episode terminates |
| 4 | Success: (Chebyshev distance) → episode terminates |
| 5 | Timeout: step count ≥ 600 → episode truncates |
Table 4.
Reward coefficient specification.
Table 4.
Reward coefficient specification.
| Term | Formula/Description | Value |
|---|
| Base terms (shared by all methods) |
| Constant per-step penalty | |
| (PBRS) | | , |
| (APF, APF-DQN only) | | |
| | |
| | |
| Return to | |
| | |
| Success | | |
| Collision | Occupies obstacle cell | |
| DWA shaping (DWA-D3QN only) |
| Schedule | | |
| Global multiplier | | |
| Heading | Fast: , Safe: | |
| Clearance | Fast: , Safe: | |
| Velocity | Fast: , Safe: | |
| Post-processing (all methods) |
| Scaling | | |
| Clipping | | |
Table 5.
Hyperparameter settings for the DWA-D3QN algorithm.
Table 5.
Hyperparameter settings for the DWA-D3QN algorithm.
| Parameter Name | Parameter Value |
|---|
| Learning rate | |
| Optimizer | Adam |
| Reward discount factor | 0.99 |
| Batch size | 256 |
| Experience replay buffer size | 120,000 |
| Priority coefficient (PER) | 0.6 |
| Importance sampling initial coefficient (PER) | 0.5 |
| Exploration attenuation | , |
Table 6.
Hyperparameter search space and selected values.
Table 6.
Hyperparameter search space and selected values.
| Parameter | Search Range | Selected | Criterion |
|---|
| Learning rate | | | Convergence stability |
| Batch size | | 256 | Convergence stability |
| PER | | 0.6 | Efficiency–stability trade-off |
| PER | | 0.5 | Efficiency–stability trade-off |
Table 7.
Performance comparison of D3QN (sparse reward) and DWA-D3QN under simple and complex difficulties. Results over 15 seeds. (Bold values indicate the best performance among compared methods).
Table 7.
Performance comparison of D3QN (sparse reward) and DWA-D3QN under simple and complex difficulties. Results over 15 seeds. (Bold values indicate the best performance among compared methods).
| Difficulty | Method | Success (%) | Collision (%) | Smoothness | Mean Steps | Min Clearance |
|---|
| Simple | D3QN (sparse) | 76.5 | 21.3 | 0.555 | 69.15 | 0.853 |
| DWA-D3QN | 93.7 | 6.3 | 0.726 | 27.82 | 1.179 |
| Complex | D3QN (sparse) | 79.7 | 19.5 | 0.500 | 45.14 | 0.816 |
| DWA-D3QN | 94.1 | 5.9 | 0.674 | 23.93 | 0.981 |
Table 8.
Comparison of ablation experiment results under complex difficulty. Results: mean ± SD over 15 seeds.
Table 8.
Comparison of ablation experiment results under complex difficulty. Results: mean ± SD over 15 seeds.
| Method | Success (%) | Collision (%) | Smoothness | Mean Steps | Min Clearance |
|---|
| Heading only | 84.0 ± 16.6 | 15.5 ± 16.5 | 0.627 ± 0.085 | 29.20 ± 13.62 | 0.862 ± 0.152 |
| Clearance only | 88.5 ± 7.5 | 11.3 ± 7.1 | 0.615 ± 0.076 | 29.36 ± 9.94 | 0.905 ± 0.109 |
| Velocity only | 83.3 ± 10.1 | 16.7 ± 10.1 | 0.619 ± 0.066 | 25.68 ± 7.12 | 0.841 ± 0.108 |
| DWA dense only | 62.7 ± 34.7 | 32.0 ± 31.3 | 0.573 ± 0.121 | 89.76 ± 90.82 | 0.685 ± 0.313 |
| APF-Euclidean | 80.3 ± 17.1 | 19.7 ± 17.1 | 0.519 ± 0.078 | 42.05 ± 31.63 | 0.814 ± 0.186 |
| Full DWA-D3QN | 94.1 ± 3.4 | 5.9 ± 3.4 | 0.674 ± 0.068 | 23.93 ± 2.26 | 0.981 ± 0.120 |
Table 9.
Reward weight sensitivity analysis under complex difficulty. Results: mean ± SD over 15 seeds.
Table 9.
Reward weight sensitivity analysis under complex difficulty. Results: mean ± SD over 15 seeds.
| Configuration (::) | Success (%) | Collision (%) | Smoothness | Mean Steps |
|---|
| Default (1:1:1) | 94.1 ± 3.4 | 5.9 ± 3.4 | 0.674 ± 0.068 | 23.93 ± 2.26 |
| Heading-dominant (2:1:1) | 95.6 ± 2.7 | 4.4 ± 2.7 | 0.723 ± 0.045 | 22.47 ± 3.27 |
| Velocity-suppressed (1:1:0) | 97.2 ± 3.0 | 2.8 ± 3.0 | 0.713 ± 0.073 | 22.93 ± 2.83 |
Table 10.
Hyperparameter robustness analysis under Complex difficulty. Results: mean ± SD over 5 seeds.
Table 10.
Hyperparameter robustness analysis under Complex difficulty. Results: mean ± SD over 5 seeds.
| Parameter Setting | Success (%) | Collision (%) | Smoothness | Mean Steps |
|---|
| Default (, , ) | 94.1 ± 3.4 | 5.9 ± 3.4 | 0.674 ± 0.068 | 23.93 ± 2.26 |
| Learning rate | 96.5 ± 2.4 | 3.5 ± 2.4 | 0.702 ± 0.081 | 23.41 ± 3.25 |
| Batch size 128 | 96.5 ± 3.0 | 3.5 ± 3.0 | 0.715 ± 0.067 | 23.97 ± 5.15 |
| 93.5 ± 4.5 | 6.5 ± 4.5 | 0.673 ± 0.062 | 23.12 ± 2.20 |
Table 11.
Computational efficiency of DWA-D3QN.
Table 11.
Computational efficiency of DWA-D3QN.
| Metric | Value |
|---|
| Training time (200k steps) | ≈9.7 min |
| Training time (10 configurations) | ≈1.5 h |
| Inference latency (mean) | 0.63 ms/step |
| Inference latency (P99) | 1.46 ms/step |
| Policy network parameters | 53,386 |
| Target network parameters | 53,386 |
Table 12.
Paired deployment-time latency on complex (env_decision protocol, 3000 control steps, 20 maps).
Table 12.
Paired deployment-time latency on complex (env_decision protocol, 3000 control steps, 20 maps).
| Method | Device | Mean (ms/Step) | P99 (ms/Step) | Notes |
|---|
| Classical DWA (discrete grid) | CPU | 0.54 | 1.15 | 9-action heuristic scoring |
| DWA-D3QN (proposed) | CPU | 0.47 | 1.29 | Greedy DuelingNet forward |
| DWA-D3QN (proposed) | GPU | 0.77 | 2.47 | GPU deployment |
| Classical DWA (continuous) | CPU | 92.4 | 130.0 | 200 × 20-step rollout |
Table 13.
Continuous DWA latency sensitivity to velocity sampling resolution.
Table 13.
Continuous DWA latency sensitivity to velocity sampling resolution.
| Configuration | Candidates × Steps | Mean (ms) | vs. D3QN Slowdown |
|---|
| Light | 50 × 10 | 10.2 | 21.8× |
| Medium | 100 × 20 | 44.6 | 94.9× |
| Default | 200 × 20 | 84.7 | 180.2× |
| Heavy | 400 × 20 | 169.2 | 360.1× |
Table 14.
Performance comparison under complex difficulty. DRL: mean ± SD over 15 seeds; A*/DWA: mean ± SE over 120 maps. (Bold values indicate the best performance among compared methods).
Table 14.
Performance comparison under complex difficulty. DRL: mean ± SD over 15 seeds; A*/DWA: mean ± SE over 120 maps. (Bold values indicate the best performance among compared methods).
| Method | Success (%) | Collision (%) | Smoothness | Mean Steps | Min Clearance |
|---|
| Value-based DRL |
| DQN | 82.3 ± 21.9 | 17.7 ± 21.9 | 0.534 ± 0.082 | 29.52 ± 13.52 | 0.830 ± 0.225 |
| DDQN | 89.5 ± 5.1 | 10.5 ± 5.1 | 0.559 ± 0.081 | 23.79 ± 3.01 | 0.906 ± 0.073 |
| Dueling DQN | 81.9 ± 21.2 | 18.1 ± 21.2 | 0.533 ± 0.070 | 29.87 ± 16.39 | 0.826 ± 0.219 |
| Heuristic-guided DRL |
| APF-DQN | 80.3 ± 17.1 | 19.7 ± 17.1 | 0.519 ± 0.078 | 42.05 ± 31.63 | 0.814 ± 0.186 |
| D3QN (PBRS) | 85.7 ± 17.0 | 14.0 ± 16.0 | 0.513 ± 0.057 | 32.90 ± 24.95 | 0.867 ± 0.168 |
| Classical planning |
| A* | 76.7 ± 3.9 | 23.3 ± 3.9 | 0.811 ± 0.013 | 130.19 ± 18.70 | 0.767 ± 0.039 |
| DWA | 93.3 ± 2.3 | 6.7 ± 2.3 | 0.811 ± 0.008 | 53.14 ± 10.95 | 0.933 ± 0.023 |
| DWA-D3QN |
| DWA-D3QN | 94.1 ± 3.4 | 5.9 ± 3.4 | 0.674 ± 0.068 | 23.93 ± 2.26 | 0.981 ± 0.120 |
Table 15.
Quantitative ROS/Gazebo validation results over 50 independent trials.
Table 15.
Quantitative ROS/Gazebo validation results over 50 independent trials.
| Metric | Value |
|---|
| Trials | 50 |
| Success rate | 96.0% |
| Collision rate | 4.0% |
| Average cross-track error | 0.094 m |
| Max cross-track error | 0.27 m |
| RMS cross-track error | 0.108 m |
Table 16.
Comparison with policy gradient and actor–critic methods under complex difficulty. Results: mean ± SD over 15 seeds. PPO trained for 400k steps with random warmup of 15k steps and entropy annealing from 0.08 to 0.01.
Table 16.
Comparison with policy gradient and actor–critic methods under complex difficulty. Results: mean ± SD over 15 seeds. PPO trained for 400k steps with random warmup of 15k steps and entropy annealing from 0.08 to 0.01.
| Method | Success (%) | Collision (%) | Smoothness | Mean Steps | Min Clearance |
|---|
| PPO † | 54.8 ± 28.2 | 41.3 ± 21.5 | 0.628 ± 0.077 | 64.7 ± 104.3 | 0.593 ± 0.223 |
| SAC | 53.9 ± 31.2 | 26.3 ± 13.7 | 0.432 ± 0.038 | 240.8 ± 145.9 | 0.737 ± 0.137 |
| TD3 | 56.1 ± 37.2 | 32.4 ± 31.0 | 0.562 ± 0.156 | 151.9 ± 138.7 | 0.676 ± 0.310 |
| D3QN (PBRS) | 85.7 ± 17.0 | 14.0 ± 16.0 | 0.513 ± 0.057 | 32.90 ± 24.95 | 0.867 ± 0.168 |
| DWA-D3QN | 94.1 ± 3.4 | 5.9 ± 3.4 | 0.674 ± 0.068 | 23.93 ± 2.26 | 0.981 ± 0.120 |