Next Article in Journal
Adjustment Algorithm for Free Station Control Network of Ultra-Large Deepwater Jacket
Previous Article in Journal
Quantifying Intra- and Inter-Observer Variabilities in Manual Contours for Radiotherapy: Evaluation of an MR Tumor Autocontouring Algorithm for Liver, Prostate, and Lung Cancer Patients
 
 
Article
Peer-Review Record

Dual-Priority Delayed Deep Double Q-Network (DPD3QN): A Dueling Double Deep Q-Network with Dual-Priority Experience Replay for Autonomous Driving Behavior Decision-Making

Algorithms 2025, 18(5), 291; https://doi.org/10.3390/a18050291
by Shuai Li 1, Peicheng Shi 1,*, Aixi Yang 2, Heng Qi 3 and Xinlong Dong 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Algorithms 2025, 18(5), 291; https://doi.org/10.3390/a18050291
Submission received: 3 April 2025 / Revised: 15 May 2025 / Accepted: 16 May 2025 / Published: 19 May 2025
(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

General Comment:
The paper proposes DPD3QN, a behaviour-decision algorithm for autonomous vehicles that combines (i) a duelling-DDQN and (ii) a new “dual-priority” experience-replay scheme that mixes TD-error-based PER with a handcrafted, feature-based segment-priority sampler. Experiments show faster convergence and higher success rates than vanilla DQN and DDQN.

However, the novelty of this manuscript is minimal. Duelling + DDQN (D3QN) and TD-error PER are established; the extra “segment priority” is a manual weighting heuristic (distance, speed ranges, hard-coded weights). A formal analysis is needed to show this component’s benefit independently of PER and D3QN.

In the case study, only DQN and DDQN are compared. Please add more advanced baselines reviewed in this paper, such as D3QN and models with DPER, and report the same metrics.


Specific comment:
1. The lack of line number makes it difficult to pinpoint the location where I would like to leave a comment.
2. Abstract: grammar issue "Compared with the currently popular DQN and DDQN algorithms, the success rates of this algorithm in the challenging."
3. Page 2, these sentences read strange: However, the experience replay of algorithms such as DQN, DDQN, and Dueling DQN is uniform random sampling. However, the importance of experience samples varies.
4. Page 6: "Here, denotes the parameters..." Something is missing before 'denotes'.
5. Page 6: At the beginning of Section 3, 'To address the issues mentioned above...'. I do not find issues mentioned anywhere close to this statement.
6. Page 6: At the beginning of Section, "Dueling Double Deep Q Network (DDQN)". Is it D3QN?
7. Page 7, first line, 'a hierarchical structure' could be described in detail.
8. Figure 2 is not clear.
9. Page 7, there is a reference error at the bottom.
10. Equation 19: p_i is not explained.
11. Equation 21 and 22, functions g and h are not explained.
12. Equation 23, are the alpha and beta the same as in equations 19 and 20?

Author Response

Comments 1: The lack of line number makes it difficult to pinpoint the location where I would like to leave a comment.

Response 1: Thank you for pointing this out. I agree with this comment. Therefore, I have added line numbers.

 

Comments 2: Abstract: grammar issue "Compared with the currently popular DQN and DDQN algorithms, the success rates of this algorithm in the challenging."

Response 2: Agree. I have corrected the grammar in the mentioned section. The revision is in the 1st page of the manuscript, abstract section, lines 23 and 24. “Compared with the currently popular DQN and DDQN algorithms, this algorithm achieves higher success rates in challenging scenarios.”

 

Comments 3: Page 2, these sentences read strange: However, the experience replay of algorithms such as DQN, DDQN, and Dueling DQN is uniform random sampling. However, the importance of experience samples varies.

Response 3: Agree. I have corrected the grammar in the mentioned section. The revision is in the 2nd page of the manuscript, fourth paragraph, lines 75 to 77. “Although algorithms such as DQN, DDQN, and Dueling DQN use uniform random sampling for experience replay, the importance of experience samples can vary significantly.”

 

Comments 4: Page 6: "Here, denotes the parameters..." Something is missing before 'denotes'.

Response 4: Agree. I have corrected the grammar in the mentioned section. The revision is in the 6th page of the manuscript, fifth paragraph, lines 249 to 250. “Here, θ denotes the parameters responsible for processing features in the input layer,”

 

Comments 5: Page 6: At the beginning of Section 3, 'To address the issues mentioned above...'. I do not find issues mentioned anywhere close to this statement.

Response 5: Agree. I have corrected the ambiguity in the reference of this sentence. The revision is in the 7th page of the manuscript, second paragraph, lines 271 to 272. “To enhance decision-making efficiency and address the limitations of uniform experience replay in existing algorithms.”

 

Comments 6: Page 6: At the beginning of Section, "Dueling Double Deep Q Network (DDQN)". Is it D3QN?

Response 6: Agree. I have corrected this abbreviation error. The revision is in the 7th page of the manuscript, second paragraph, lines 272 to 274. “This section proposes a Dueling Double Deep Q Network (D3QN) integrated with a dual-priority experience replay mechanism (illustrated in Fig. 2),”

 

Comments 7: Page 7, first line, 'a hierarchical structure' could be described in detail.

Response 7: Agree. I have provided a more detailed explanation of this part. The revision is in the 7th page of the manuscript, second paragraph, lines 276 to 280. “In this structure, high-level modules determine general driving intentions—such as lane keeping, lane changing, or overtaking—while low-level modules handle specific motion control tasks like acceleration, deceleration, and steering. This layered design improves modularity and enables more realistic and flexible traffic behavior modeling.”

 

Comments 8:  Figure 2 is not clear.

Response 8: Agree. I have replaced Figure 2 with a clearer version and enlarged it. The figure was updated in the 7th page of the manuscript, after second paragraph, lines 287.

 

Comments 9:  Page 7, there is a reference error at the bottom.

Response 9: Agree. I have modified the citation format and placed the referenced text before the Figure 2. The referenced passage has been placed on page 8, paragraph 1, lines 296 to 299.

 

Comments 10:  Equation 19: pi​ is not explained.

Response 10: Agree. I have reinterpreted the parameter pi in the formula. The revision is in the 9th page of the manuscript, fifth paragraph, lines 362 to 365. “which represents the priority of the i-th transition—typically defined based on the magnitude of its temporal-difference (TD) error, such as pi = ∣δi∣ + ε, where δi is the TD error and ε is a small positive constant to ensure non-zero probability.”

 

Comments 11: Equation 21 and 22, functions g and h are not explained.

Response 11: Agree. I have reinterpreted the parameter g and h in the formula. The revision is in the 10th and 11st page of the manuscript, sixth and first paragraph, lines 409 to 413 and 420 to 422. “The function g() is a mapping function that combines the number of samples and the variability in position to determine the importance of the position segment. It typically increases with larger ​, and may decrease or normalize based on , thereby balancing between data abundance and scene dynamics.”

 “The function h() is similar to g(), serving as a mapping from the speed variability and the segment size to a scalar importance value, emphasizing segments with abrupt speed changes or rare events.”

 

Comments 12: Equation 23, are the alpha and beta the same as in equations 19 and 20?

Response 12: Thank you for your insightful comment regarding the roles of α and β in Equation (23). We would like to clarify that the α and β in Equation (23) are not the same as those used in Equations (19) and (20). In this context, α and β serve as hyperparameters that control the influence of two different priority components: the TD error priority and the segment-based priority (derived from position and velocity information). To avoid confusion, we have revised the corresponding explanation in the manuscript for clarity in the 11th page of the manuscript, 4th paragraph, lines 433 and 436 to 438. “respectively,  and  are hyperparameters that control the influence of their respective weights. Based on the dual priority , we then perform weighted sampling to ensure that the model can pay more attention to important samples.”

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

In the paper authors are dealing by proposing novel DPD3QN i.e. dueling double deep Q network based on dual priority experience replay. Approach combines replay mechanism and TD-error priority and segmented sampling priority. By developing such approach, its contributed to significantly improving of the learning frequency and quality of key states by giving higher priority on TD-error adjusting the state importance of segmented sampling. In such way, dual priority effectively solves the problems of low sample utilization and can more effectively to us spars rewards. This is the main novelty proposed by authors and main scientific contribution. When it comes to the paper:

-the abstract is adequate and concise;

-the chapters have a logical and corr. order:

-proposed approach is well explained and demonstrated, also, its main strength of the paper.

However, there is a space for further improvement:

-Title should be better connected with the paper, i.e. area of application should be mentioned, in this particular case;

-At the end of the Intro section, a structure of the paper should be provided, with a detailed explanation regarding chapters in the paper;

-Related works should be improved by adding more relevant and up-to-date research;

-All abbreviations should be explained before appearing in the text;

-Simulation Results Analysis should be extended, and further discussion is needed;

-Conclusions should be better connected with obtained results, also, limitations should be discussed in a more detailed manner. Besides, directions for future research should be provided.  

Author Response

Comments 1: Title should be better connected with the paper, i.e. area of application should be mentioned, in this particular case.

Response 1: Thank you to the reviewers for your valuable feedback. I have revised the title to "DPD3QN: A Dueling Double Deep Q Network with Dual-Priority Experience Replay for Autonomous Driving Behavior Decision-Making".

 

Comments 2: At the end of the Intro section, a structure of the paper should be provided, with a detailed explanation regarding chapters in the paper.

Response 2: Thank you to the reviewers for your valuable feedback. I have provided a detailed explanation of the chapter contents in the concluding section of the Introduction. The added content is located in the fourth paragraph of page 3, lines 113 to 118. “The remainder of this paper is organized as follows. Section 2 introduces related work in the field of deep reinforcement learning and autonomous driving. Section 3 presents the proposed DPD3QN method, including the network structure and du-al-priority experience replay mechanism. Section 4 describes the simulation environment and experimental setup. Section 5 provides the simulation results and comparative performance analysis. Finally, Section 6 concludes the study and outlines future research directions.”

 

Comments 3: Related works should be improved by adding more relevant and up-to-date research.

Response 3: Thank you to the reviewers for your valuable feedback. The Related Work section primarily focuses on methods relevant to this study. To strengthen the discussion, several recent research findings have been incorporated into the Introduction and the methodological comparison section of Chapter 3. The additions are located in the second paragraph on page 2 (lines 61–62) and the third paragraph on page 10 (lines 386–387), respectively. “Noisy Deep Q-Network (Noisy DQN)” “Deep Deterministic Policy Gradient (DDPG)”

 

Comments 4: All abbreviations should be explained before appearing in the text.

Response 4: Thank you to the reviewers for your valuable feedback. All abbreviations in the manuscript have been thoroughly checked and are properly defined upon their first appearance.

 

Comments 5: Simulation Results Analysis should be extended, and further discussion is needed.

Response 5: Thank you to the reviewers for your valuable feedback. In the experimental simulation section of Chapter 4, the experimental setup has been enriched, with additions located on page 14, first paragraph (lines 538–541). “The simulation environment is shown in Fig. 5. The experimental setup adopts a dual-agent system, consisting of a self-driving vehicle and several environment vehicles that mimic human driving behavior. All vehicles operate based on the previously mentioned motion model and environmental configuration, aiming to study the interactive dynamics under standardized highway conditions.“ Additionally, in the adaptability testing part of the experimental results in Chapter 5, a three-lane highway simulation figure has been incorporated, added on page 19, first paragraph (lines 708–709).

 

Comments 6: Conclusions should be better connected with obtained results, also, limitations should be discussed in a more detailed manner. Besides, directions for future research should be provided.

Response 6: Thank you to the reviewers for your valuable feedback. The additional content in the conclusion section is located on page 21, paragraph 8, lines 791 to 799. “The realism of the simulation environment is limited, and the generalization capability to real-world scenarios still needs to be validated. There remain gaps com-pared to real road environments in terms of perception errors, behavioral uncertainty, and the complexity of traffic rules. Future research could incorporate higher-fidelity simulation platforms (such as CARLA or LGSVL) or real-world driving data for further validation. The modeling capability for multi-agent interaction needs to be improved. The current method primarily focuses on single-vehicle decision-making tasks and does not fully account for the dynamic impact of complex games and collaborative behaviors among multiple vehicles.”

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

I find this article to be in a more or less publishable condition, such that the introduction, methodology, results, and conclusion are very well presented. Hence, the article as a whole makes a nice contribution to science. Nevertheless, I notice that there is no discussion part included in the article highlighting such issues as theoretical implications, practical implications, limitations, and recommendations for future research. 

Minor point: On p. 7 there is a paragraph starting with Chinese signs. Since this article is addressing an international audience, I suggest that a translation to English is made, or else that another solution to the problem is found

Author Response

Comments 1: I find this article to be in a more or less publishable condition, such that the introduction, methodology, results, and conclusion are very well presented. Hence, the article as a whole makes a nice contribution to science. Nevertheless, I notice that there is no discussion part included in the article highlighting such issues as theoretical implications, practical implications, limitations, and recommendations for future research. 

Minor point: On p. 7 there is a paragraph starting with Chinese signs. Since this article is addressing an international audience, I suggest that a translation to English is made, or else that another solution to the problem is found

Response 1: We sincerely thank the reviewer for the positive feedback and constructive suggestions. We have added content to the conclusion section addressing the practical implications, research limitations, and recommendations for future work. The realism of the simulation environment is limited, and the generalization capability to real-world scenarios still needs to be validated. There remain gaps compared to real road environments in terms of perception errors, behavioral uncertainty, and the complexity of traffic rules. Future research could incorporate higher-fidelity simulation platforms (such as CARLA or LGSVL) or real-world driving data for further validation. The modeling capability for multi-agent interaction needs to be improved. The current method primarily focuses on single-vehicle decision-making tasks and does not fully account for the dynamic impact of complex games and collaborative behaviors among multiple vehicles. The revision is in the 21st page of the manuscript, eighth paragraph, lines 791 to 799. In addition, we have thoroughly reviewed the characters on page 7 and corrected the Chinese characters accordingly.

Author Response File: Author Response.docx

Reviewer 4 Report

Comments and Suggestions for Authors

This paper proposes DPD3QN, which integrates a dueling double deep Q-network (D3QN) with a dual-priority experience replay mechanism, and evaluates it in a simulated four-lane highway environment using OpenAI Gym’s Highway-env. The authors report faster convergence, higher episode rewards, longer driving distances, and improved success rates compared to DQN and DDQN. But the standard deviation or confidence intervals are not shown in Figs 6-9. 

  • Novelty and Ablation. The combination of dueling, double-Q, and prioritized replay is sensible, but it is not clear which component contributes most. An ablation study with  exist variants of prioritized replay could be very helpful
  • Hyperparameter Sensitivity. The paper introduces several weighting factors (α, β, ε) for the dual-priority calculation but does not justify their chosen values or report sensitivity analyses. A brief sweep showing performance versus α and β would help readers understand robustness.
  • Clarity of Dual-Priority Formula. Equation 23 combines TD error priority and segment weights with coefficients α, β.  Explain whether α + β = 1 or if priorities can exceed unity. A schematic or small example calculation would aid understanding.
  • Equation (19) redefines the sampling probability with a bias term b; please specify the value of b used in experiments.

  • Please make the code repository available for reproducibility.

 

Author Response

Comments 1: Novelty and Ablation. The combination of dueling, double-Q, and prioritized replay is sensible, but it is not clear which component contributes most. An ablation study with exist variants of prioritized replay could be very helpful

Response 1: Thank you for this valuable suggestion. We fully agree that ablation studies can provide deeper insight into the contribution of each component in composite algorithms like DPD3QN. However, in this study, our primary objective was to evaluate the overall effectiveness of the proposed integrated approach in complex highway scenarios, rather than to isolate individual components. As such, we did not conduct separate ablation experiments at this stage. That being said, we recognize the importance of this direction and plan to include a comprehensive ablation study and comparison with alternative prioritized replay mechanisms (e.g., rank-based or stochastic prioritization) as part of our future work. This extension will help further validate the role and impact of each module within the DPD3QN.

 

Comments 2: Hyperparameter Sensitivity. The paper introduces several weighting factors (α, β, ε) for the dual-priority calculation but does not justify their chosen values or report sensitivity analyses. A brief sweep showing performance versus α and β would help readers understand robustness.

Response 2: Thank you for this valuable suggestion. I have provided explanations for the selection of the hyperparameters used in the dual-priority calculation and conducted a sensitivity analysis. The corresponding revisions can be found on pages 11 and 12 of the manuscript, specifically in paragraphs 7 and 8, and paragraph 1, covering lines 455 to 467. “Moreover, since both the TD error and the segment weights are not inherently bounded, the combined priority value ​ may theoretically exceed 1. To address this, we apply a normalization process during the sampling stage by dividing each priority by the total priority sum ​, resulting in the normalized sampling probability ​​. This ensures that each sample is selected according to its relative importance, regardless of the absolute value. At the same time, it prevents high-priority samples from dominating the training process, thereby reducing the risk of overfitting or biased learning.

This approach not only maintains training efficiency but also enhances the model’s ability to learn from diverse types of scenarios with increased robustness. By adjusting α and β, researchers and developers can fine-tune the training process based on specific task demands and environmental complexities. In future versions, we also plan to include a schematic illustration and a numerical example to further improve the transparency and interpretability of the dual-priority formula.

 

Comments 3: Clarity of Dual-Priority Formula. Equation 23 combines TD error priority and segment weights with coefficients α, β.  Explain whether α + β = 1 or if priorities can exceed unity. A schematic or small example calculation would aid understanding.

Response 3: Thank you for this valuable suggestion. The formula does not necessarily require that α+β=1, and the combined priority may also exceed 1. It is recommended to add a schematic diagram or a simple calculation example to enhance the clarity and interpretability of the formula. The added explanatory content is located on page 11, paragraph six, lines 445 to 454 of the manuscript. The calculation example has been added after the first paragraph on page 12, covering lines 467 to 472. “It is important to note that α+β is not strictly constrained to equal 1. This design provides greater flexibility, allowing the relative importance of TD error and segment-based priority to be dynamically adjusted during different training stages. For example, in the early stages of training, when the model has not yet learned an effective policy and the TD error tends to fluctuate significantly, a higher αcan be used to emphasize learning from large-value errors and accelerate policy correction. In contrast, during the later stages—when the policy becomes more stable—β can be increased to direct the model's attention to rare but critical driving scenarios, such as high-speed lane changes or emergency braking. This helps improve the generalization and real-world adaptability of the learned policy.

 

Comments 4: Equation (19) redefines the sampling probability with a bias term b; please specify the value of b used in experiments.

Response 4: Thank you for this valuable suggestion. The corresponding revisions are located on page 9, paragraph 4, lines 355 to 356 of the manuscript. “In our experiments, the value of the bias term ? is set to 0.01, which helps avoid zero probability and ensures sufficient sample diversity.”

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors did not respond to my general comments, which questions the novelty of the study and lack of advanced baseline for case study results comparison.

There is still reference error in the revised manuscript (e.g. page 8).

Author Response

Comments 1: The authors did not respond to my general comments, which questions the novelty of the study and lack of advanced baseline for case study results comparison.

Response 1: Thank you very much for your thoughtful feedback. We fully understand the concern regarding the level of novelty. While it is true that D3QN and TD-error PER are well-established methods, our contribution focuses on the integration of segment-based priority into the experience replay framework, tailored for autonomous highway scenarios.

This segment priority component, although heuristic in design, introduces task-relevant semantic weighting (e.g., spatial position and velocity categories) that enhances learning in sparse but critical situations. As our current work aims to demonstrate the practical effectiveness of this dual-priority structure in a combined framework, we believe that a full ablation study or formal separation of component-wise contributions is beyond the intended scope of this paper. Nonetheless, we appreciate the suggestion and consider it a valuable direction for future research.

 

In designing our experimental comparisons, we focused on demonstrating incremental improvements over standard baselines (DQN and DDQN), as these are the most widely used and well-understood reference points in the field. While more advanced configurations such as D3QN or DPER are relevant, our primary objective was to validate the effectiveness of the proposed DPD3QN method in contrast to foundational models.

We believe that the current comparisons are sufficient to support our claims, and including additional variants may dilute the focus of the study. However, we fully acknowledge the value of broader baseline comparisons and will consider them in follow-up studies.

 

Comments 2: There is still reference error in the revised manuscript (e.g. page 8).

Response 2: Thank you very much for your thoughtful feedback. I have corrected the citation error on page 8. Please refer to the first line of the first paragraph on page 8 for the specific revision.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for improving your paper

Author Response

Comments 1: Thank you for improving your paper

Response 1: Thank you for your review and suggestions.

Author Response File: Author Response.docx

Reviewer 4 Report

Comments and Suggestions for Authors

Thanks for the modification. Please provide the source code to the paper for reproductivity. Please do what you promised.

Author Response

Comments 1: Thanks for the modification. Please provide the source code to the paper for reproductivity. Please do what you promised.

Response 1: Thank you for your review and suggestions. All the data used in the study are already included in the manuscript, and the corresponding option has been selected during the submission process in the system.

Author Response File: Author Response.docx

Back to TopTop