YOLO-DER: A Dynamic Enhancement Routing Framework for Adverse Weather Vehicle Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1、The term "Adverse Weather Vehicle Detection" in the keywords is inconsistent with "Object Detection" in the title. Please standardize the terminology for consistency.
2、It is recommended to add the performance improvement margins of the proposed model compared with baseline models in the abstract.
3、In the "Experimental Settings and Results" section, the rationale for selecting YOLO12 as the baseline model instead of other alternative models is not clearly elaborated.
4、Lines 255–256 only mention foggy and low-light scenarios as special weather conditions, which is inconsistent with the previously stated "adverse weather conditions." Please align the scope of adverse weather types for consistency.
5、It is unclear why the ablation experiment results on the ExDark dataset are not included in Table 5. Please supplement the relevant data or explain the reason for the omission.
Author Response
Reviewer#1, Concern #1: The term "Adverse Weather Vehicle Detection" in the keywords is inconsistent with "Object Detection" in the title. Please standardize the terminology for consistency.
Reply: We sincerely thank the reviewer for this helpful comment. As the overall focus of our work is indeed on vehicle detection, we have revised the title to use the term “vehicle detection” accordingly. In addition, we have carefully checked the manuscript and replaced the occurrences of “object detection” in the abstract with “vehicle detection” to ensure consistent terminology throughout the paper.
Reviewer#1, Concern #2: It is recommended to add the performance improvement margins of the proposed model compared with baseline models in the abstract.
Reply: We thank the reviewer for the helpful suggestion. In the revised manuscript, we have updated the abstract to explicitly report the performance improvement margins of YOLO-DER over the baseline models. In line 15, we add: “Extensive experiments on BDD100K, Foggy Cityscapes, and ExDark demonstrate the superior performance of YOLO-DER, yielding mAP50 scores of 80.8%, 57.9%, and 85.6%, which translate into absolute gains of +3.8%, +2.3%, and +2.9% over YOLOv12 on the respective datasets..”
Reviewer#1, Concern #3: In the "Experimental Settings and Results" section, the rationale for selecting YOLO12 as the baseline model instead of other alternative models is not clearly elaborated.
Reply: We thank the reviewer for this insightful comment. We have now clarified the rationale for choosing YOLOv12 as our baseline. Specifically, YOLOv12 offers an excellent balance between detection accuracy and real-time performance, which is essential for robust vehicle detection under adverse weather conditions. Moreover, its modular CSPDarknet-style backbone allows seamless integration of our Dynamic Enhancement Routing (DER) and EnhanceNet modules at the feature level without altering the detection head. These properties make YOLOv12 a suitable and fair baseline for evaluating the effectiveness and efficiency of the proposed degradation-aware enhancement mechanism. The corresponding clarification has been added to Section 4.2 of the revised manuscript.
Reviewer#1, Concern #4: Lines 255–256 only mention foggy and low-light scenarios as special weather conditions, which is inconsistent with the previously stated "adverse weather conditions." Please align the scope of adverse weather types for consistency.
Reply: Thank you for the insightful comment. We have revised Lines 255–256 to explicitly state “foggy, rainy, and low-illumination conditions,” ensuring that the description of experimental scenarios is consistent with the definition of “adverse weather conditions” used in the rest of the paper.
In Lines 255–256 of the revised manuscript, we have revised the original sentence to: "The experiments are structured to verify three aspects: (1) detection performance improvement across diverse adverse-weather scenarios, including foggy, rainy, and low-illumination conditions; (2) the contribution of each enhancement component (DER and EnhanceNet);"
Reviewer#1, Concern #5: It is unclear why the ablation experiment results on the ExDark dataset are not included in Table 5. Please supplement the relevant data or explain the reason for the omission.
Reply: Thank you for pointing out this issue. The ablation results on the ExDark dataset are indeed included in our manuscript, but they are presented in Figure 5 rather than Table 5. Our intention was to improve the diversity and readability of the experimental presentation:
- Table 5 summarizes the ablation results on the BDD100K and Foggy Cityscapes datasets, providing a clear side-by-side numerical comparison.
- Figure 5 presents the ExDark ablation results using a bar chart, which more intuitively highlights the performance gains under different components.
In the revised manuscript, we have explicitly indicated the corresponding datasets in the captions of both Table 5 and Figure 5, ensuring that readers can easily identify which dataset each ablation result refers to and avoiding any potential confusion.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors- The paper addresses an important and timely problem of robust object detection under adverse weather conditions. The motivation is clearly written in the introduction (pp. 1–3). The introduction discusses fog, rain, and low-light challenges that degrade detection.
- While the proposed Dynamic Enhancement Routing (DER) and EnhanceNet modules are novel combinations, many underlying ideas (dynamic routing, feature-level adaptation, differentiable enhancement) have been explored in earlier works such as GDIP, IA-YOLO, and D-YOLO (related work, pp. 3–5). The authors should clearly write that what is fundamentally new beyond integrating two enhancement modules into YOLOv12.
-
The methodology section (pp. 5–8) is well-structured. It has clear equations, modular diagrams, and explicit formulations for global feature compression, enhancement branches, gating, and loss design. This enhances reproducibility and make paper strong. The ablation study (Table 5, Figure 5, p. 10–11) shows benefits from DER (+0.3 mAP) and EnhanceNet (+1.1 mAP), with the combination yielding larger gains (+3.8 mAP). This supports the stated synergy.
-
All evaluations only consider the Car class (Tables 1–4). While the authors justify the choice, using only a single class limits the generalization claims. Real-world deployment requires multi-class detection (pedestrians, riders, trucks, signs). This is an important limitation that should be discussed more explicitly.
-
YOLOv12 is a relatively new and evolving architecture (referenced via arXiv 2025). Without specifying implementation details (hyperparameters, exact variant, anchor settings), readers may struggle to replicate results. I would recommend to include all hyperparamters.
-
The paper states an additional overhead of only 3 ms (p. 10), but no FPS measurements, hardware variance, or model size comparisons are included. Real-time validity must be demonstrated with detailed benchmarks (FPS across input resolutions, batch sizes, GPU/CPU configurations).
-
Figures 4–5 provide some qualitative examples (p. 11), but the illustrations are small and do not include challenging corner cases (dense fog, heavy rain, extreme glare). You can add more additional failure cases that would help demonstrate robustness and limitations.
-
EnhanceNet includes multiple branches (tone, contrast, detail, white-balance, identity) controlled by MLP projections (p. 7). Although the paper states parameters increase by <10%, no empirical parameter count or FLOPs table is provided. A more detailed efficiency analysis is needed.
-
Fog severity and brightness severity are computed using handcrafted formulas (Eq. 10–11). Although this is a good step, these metrics might not correlate strongly with real perceptual degradation. The authors could validate them (e.g., correlation analysis with human-rated degradation levels).
-
Writing quality is strong, but some sections feel long and repetitive.
For instance, parts of the related work (pp. 3–5) are overly detailed.
Author Response
Reviewer#2, Concern #2: While the proposed Dynamic Enhancement Routing (DER) and EnhanceNet modules are novel combinations, many underlying ideas (dynamic routing, feature-level adaptation, differentiable enhancement) have been explored in earlier works such as GDIP, IA-YOLO, and D-YOLO (related work, pp. 3–5). The authors should clearly write that what is fundamentally new beyond integrating two enhancement modules into YOLOv12.
Reply: We sincerely thank the reviewer for this insightful comment. To clearly address the concern regarding novelty, we provide our response from three complementary perspectives: (1) differences from existing related works, (2) the specific characteristics of our proposed framework, and (3) a brief summary.
- Differences from Existing Works
- While we acknowledge that prior methods such as GDIP, IA-YOLO, and D-YOLO have explored adaptive or differentiable enhancement strategies, our design differs in several practical aspects:
- Degradation descriptor usage: Existing works typically apply enhancement directly based on learned parameters or gated operations. In contrast, our method explicitly extracts a compact degradation descriptor using global feature statistics, which guides subsequent routing decisions.
- Soft routing mechanism: Prior methods generally activate enhancement branches through fixed or binary gating. Our Dynamic Enhancement Routing module instead uses a soft, continuous weighting across multiple enhancement paths, resulting in smoother and more flexible behavior.
- Multi-level placement: Most existing dynamic-enhancement frameworks operate at a single stage of the network. Our modules are inserted at multiple layers of the backbone, enabling stage-wise adaptation to different degrees of degradation.
These distinctions are not claims of conceptual breakthroughs but reflect meaningful architectural differences that lead to the observed performance gains.
- Distinctive Features of Our Framework
The primary aim of our work is to build a practical and unified enhancement–detection system rather than to propose entirely new enhancement operations. Our framework exhibits several characteristics that, to our knowledge, are not simultaneously present in prior detectors:
- Complementary modules by design: DER emphasizes global degradation tendencies, while EnhanceNet focuses on local structure refinement. Their roles are intentionally complementary rather than simple stacking.
- Quantitative conditioning: EnhanceNet uses measured fog and brightness levels as auxiliary cues. This helps the model adjust enhancement strength more consistently across diverse weather conditions.
- Unified end-to-end workflow: Both components operate directly in the feature space of YOLOv12 and are optimized together with the detection objective, ensuring that enhancement is aligned with downstream semantic requirements.
- These characteristics collectively contribute to more reliable performance in foggy, rainy, and low-light scenes.
- Summary
In summary, although our method builds upon ideas present in earlier literature, the way degradation descriptors, soft routing, and multi-branch enhancement are combined within the YOLOv12 feature extraction pipeline provides a unified and practical design tailored for adverse-weather detection. We will revise the manuscript to more clearly articulate these distinctions to avoid any misunderstanding.
Reviewer#2, Concern #3: The methodology section (pp. 5–8) is well-structured. It has clear equations, modular diagrams, and explicit formulations for global feature compression, enhancement branches, gating, and loss design. This enhances reproducibility and make paper strong. The ablation study (Table 5, Figure 5, p. 10–11) shows benefits from DER (+0.3 mAP) and EnhanceNet (+1.1 mAP), with the combination yielding larger gains (+3.8 mAP). This supports the stated synergy.
Reply: We sincerely thank the reviewer for the positive and encouraging comments. We are glad that the methodological clarity and ablation analysis are recognized, and we appreciate the reviewer’s acknowledgment of our efforts to enhance reproducibility.
Reviewer#2, Concern #4: All evaluations only consider the Car class (Tables 1–4). While the authors justify the choice, using only a single class limits the generalization claims. Real-world deployment requires multi-class detection (pedestrians, riders, trucks, signs). This is an important limitation that should be discussed more explicitly.
Reply: We thank the reviewer for raising this important point. Our focus on the Car class is primarily due to the strong class imbalance across adverse-weather subsets in BDD100K, Foggy Cityscapes, and ExDark, which makes multi-class evaluation less reliable for assessing the core contribution of degradation-aware enhancement. We agree that this is a limitation, and we will state it more explicitly. At the same time, the proposed modules are class-agnostic and can be directly applied to multi-class detection, which we plan to explore in future work.
Reviewer#2, Concern #5:YOLOv12 is a relatively new and evolving architecture (referenced via arXiv 2025). Without specifying implementation details (hyperparameters, exact variant, anchor settings), readers may struggle to replicate results. I would recommend to include all hyperparamters.
Reply: We sincerely appreciate the reviewer’s concern regarding reproducibility, especially given that YOLOv12 is a relatively new architecture. To address this, we have updated the Implementation Details section to explicitly include all key hyperparameters used in our experiments. In particular, we now describe the warm-up configuration (3 epochs with warm-up momentum 0.8 and a bias learning rate of 0.0), YOLOv12’s default loss weights (box/class/DFL gains of 7.5/0.5/1.5), the optimization settings (SGD with momentum 0.937 and weight decay 5x10-4), the cosine-annealing learning-rate schedule, gradient clipping, mixed-precision training, and initialization strategies. We also clarify that we use the standard YOLOv12 backbone and anchor-free detection head without modification. With these details now added to the main text, readers should be able to fully replicate our training pipeline. We thank the reviewer again for the helpful suggestion.
Reviewer#2, Concern #6:The paper states an additional overhead of only 3 ms (p. 10), but no FPS measurements, hardware variance, or model size comparisons are included. Real-time validity must be demonstrated with detailed benchmarks (FPS across input resolutions, batch sizes, GPU/CPU configurations).
Reply: We thank the reviewer for highlighting the need for explicit real-time evaluation. In response, we have added a detailed latency/FPS benchmark on an NVIDIA RTX-4090 (FP16, batch size = 1) across three input resolutions (448, 512, 640), as shown in Table 5 of the revised manuscript. The results confirm that our DER and EnhanceNet modules introduce only ~3 ms of additional latency at 448×448, consistent with our original statement, while the full system still runs above 100 FPS. These measurements provide a clear and reproducible validation of the real-time feasibility of our method.
Reviewer#2, Concern #7:Figures 4–5 provide some qualitative examples (p. 11), but the illustrations are small and do not include challenging corner cases (dense fog, heavy rain, extreme glare). You can add more additional failure cases that would help demonstrate robustness and limitations.
Reply: Additional challenging corner cases have been incorporated to address the reviewer’s concern. As shown in Figure 4, the baseline detector frequently fails to capture small or low-contrast objects, while YOLO-DER recovers them more reliably under foggy and nighttime conditions due to improved noise suppression and preservation of structural cues. Furthermore, Figure 5 includes newly added dense-fog and heavy-rain examples from BDD100K, ExDark, and Foggy Cityscapes. These visualizations demonstrate that the proposed method maintains stable detection performance under extremely low illumination and adverse weather, producing more accurate localization and fewer false alarms compared with the baselines. Although slight confidence degradation remains for distant or heavily occluded targets, these cases clearly reveal the boundary conditions of the method.
Reviewer#2, Concern #8: EnhanceNet includes multiple branches (tone, contrast, detail, white-balance, identity) controlled by MLP projections (p. 7). Although the paper states parameters increase by <10%, no empirical parameter count or FLOPs table is provided. A more detailed efficiency analysis is needed.
Reply: We thank the reviewer for this constructive suggestion. In the revised manuscript, we have expanded the efficiency analysis by adding parameter counts and FLOPs to the inference benchmark (Table~5). As shown in the table, incorporating DER and EnhanceNet increases the model size from 7.8M to 8.5M parameters (+8.9%) and FLOPs from 18.9G to 20.5G (+8.5%) at 448×448 resolution, while maintaining real-time performance (102 FPS). Similar trends hold across higher resolutions. These results empirically verify that the additional enhancement branches introduce only minimal computational overhead, consistent with our <10% statement.
Reviewer#2, Concern #9: Fog severity and brightness severity are computed using handcrafted formulas (Eq. 10–11). Although this is a good step, these metrics might not correlate strongly with real perceptual degradation. The authors could validate them (e.g., correlation analysis with human-rated degradation levels).
Reply: We thank the reviewer for this valuable suggestion. To validate whether our fog-severity and brightness-severity metrics meaningfully reflect perceptual degradation, we conducted a small human-rating study as recommended. Fifty images were randomly sampled from all datasets and scored on a 1–5 degradation scale by three independent annotators. As reported in Table 7 of the revised manuscript, the fog-severity metric achieves strong correlations of 0.84 (Pearson) and 0.82 (Spearman), while the brightness-severity metric reaches 0.81 and 0.80, respectively. These results demonstrate a high level of perceptual consistency and confirm that the proposed handcrafted metrics provide reliable supervisory cues for our enhancement framework.
Reviewer#2, Concern #10: Writing quality is strong, but some sections feel long and repetitive. For instance, parts of the related work (pp. 3–5) are overly detailed.
Reply: We thank the reviewer for the helpful observation. In the revised manuscript, we have streamlined the related work section by removing redundant descriptions and tightening overly detailed parts, particularly in pp. 3–5. The content is now more concise and better focused on the works most relevant to our contributions. We appreciate the reviewer’s suggestion, which has improved the clarity and readability of the paper.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsFor the publication, the following comments should be solved carefully:
- The paper cites methods such as IA-YOLO, GDIP, TogetherNet, D-YOLO, and DA-RAW, but the unique technical contribution of YOLO-DER – especially how DER and EnhanceNet together go beyond these approaches – should be explained more explicitly in the Introduction and Related Work.
- Since YOLOv12 is still relatively new, please briefly summarize its main characteristics and clearly state which variant (e.g., n/s/m/l) and backbone size you use. This will help readers reproduce the experiments and ensure a fair comparison with YOLOv8, YOLOv7, etc.
- The authors select YOLOv12, but there are many other YOLO variants. YOLOv10 combined with transformer architectures is also a strong option. The paper “Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body-worn cameras with benchmark datasets” applies ViT and Axial Transformer within a YOLOv10 framework and reports reliable detection performance. Since there is no universal method that achieves the best accuracy on all datasets and tasks, please cite and discuss this paper to justify your choice of YOLO version and architecture.
- There are a few minor language issues, for example “we presents YOLO-DER” should be “we present YOLO-DER,” and there is inconsistent use of hyphens (e.g., “low illumination” vs. “low-illumination”) and spacing around references. A careful proofreading or language edit would improve overall readability.
- Entropy and “entropy-aware / entropy regulation” are mentioned in the Related Work and strongly emphasized in the Conclusion. It would be helpful to connect this perspective more clearly in the Methods section, for example by adding a short paragraph explaining how DER and EnhanceNet implicitly modulate entropy in feature space and how this relates to equations (1)–(11).
- The performance of YOLO models is strongly influenced by hyperparameter optimization, but the current manuscript does not describe or analyze this aspect. The authors should acknowledge this as a limitation and discuss related work:“Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-worn cameras,” where hyperparameter tuning plays an important role.
- In Sections 4.4 and 4.5, the manuscript seems like generative AI is used extensively. These descriptions should be rewritten.
Author Response
Reviewer#3, Concern #1:The paper cites methods such as IA-YOLO, GDIP, TogetherNet, D-YOLO, and DA-RAW, but the unique technical contribution of YOLO-DER – especially how DER and EnhanceNet together go beyond these approaches – should be explained more explicitly in the Introduction and Related Work.
Reply: We thank the reviewer for highlighting the need to more clearly distinguish the technical contributions of YOLO-DER from existing methods such as IA-YOLO, GDIP, TogetherNet, D-YOLO, and DA-RAW. In the revised manuscript, we have refined both the Introduction and Related Work sections to explicitly articulate the novelty of our design. In particular, we emphasize three aspects:
(1) Degradation-aware conditioning: Unlike prior works that rely on fixed or gated enhancement operations, YOLO-DER extracts a compact degradation descriptor from global feature statistics, which provides a continuous, data-driven signal for routing decisions.
(2) Soft multi-branch routing + complementary enhancement: Our Dynamic Enhancement Routing (DER) performs soft, continuous weighting across multiple enhancement paths, while EnhanceNet focuses on localized refinement guided by quantified fog and brightness levels. These modules are designed to be complementary rather than simple sequential stacking.
(3) Stage-wise integration into YOLOv12: Whereas earlier adaptive enhancement methods typically operate at a single processing stage, our modules are inserted at multiple backbone levels and are jointly optimized with the detection objective, enabling fine-grained, feature-level adaptation across diverse adverse-weather conditions.
We have incorporated these clarifications into the revised Introduction and Related Work to more explicitly highlight how YOLO-DER differs from and goes beyond existing dynamic-enhancement and differentiable preprocessing approaches.
Reviewer#3, Concern #2: Since YOLOv12 is still relatively new, please briefly summarize its main characteristics and clearly state which variant (e.g., n/s/m/l) and backbone size you use. This will help readers reproduce the experiments and ensure a fair comparison with YOLOv8, YOLOv7, etc.
Reply: We thank the reviewer for this helpful suggestion. In the revised manuscript, we have added a concise summary of the core characteristics of YOLOv12 (e.g., hybrid-scale decoupled head, lightweight FPN design, and improved context aggregation) to provide readers with sufficient background. We also clearly specify that our implementation is based on the YOLOv12-S variant with its default backbone configuration and hyperparameters released in the official repository. These details have been included in the Experimental Setup section to ensure full reproducibility and fair comparison with YOLOv8, YOLOv7, and other detectors.
Reviewer#3, Concern #3: The authors select YOLOv12, but there are many other YOLO variants. YOLOv10 combined with transformer architectures is also a strong option. The paper “Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body-worn cameras with benchmark datasets” applies ViT and Axial Transformer within a YOLOv10 framework and reports reliable detection performance. Since there is no universal method that achieves the best accuracy on all datasets and tasks, please cite and discuss this paper to justify your choice of YOLO version and architecture.
Reply:We thank the reviewer for this insightful comment and for pointing out the recent work on YOLOv10 combined with transformer architectures. Following your suggestion, we have now cited and discussed the paper “Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body-worn cameras with benchmark datasets” in the related work section. In the revised manuscript, we clarify that this YOLOv10+transformer framework achieves reliable performance for non-PPE detection in construction-site surveillance scenarios by leveraging ViT and axial transformer modules.
At the same time, we explicitly justify our choice of YOLOv12 as the base detector for adverse-weather vehicle detection. We adopt YOLOv12 because it provides a favourable accuracy–speed trade-off on our traffic datasets and a clean one-stage architecture that is easy to integrate with the proposed Dynamic Enhancement Routing and enhancement subnetworks.
Reviewer#3, Concern #4: There are a few minor language issues, for example “we presents YOLO-DER” should be “we present YOLO-DER,” and there is inconsistent use of hyphens (e.g., “low illumination” vs. “low-illumination”) and spacing around references. A careful proofreading or language edit would improve overall readability.
Reply: We sincerely thank the reviewer for carefully pointing out the minor language issues and formatting inconsistencies. In the revised manuscript, we have conducted a thorough proofreading and language polishing. Specifically, we corrected the grammatical error “we presents YOLO-DER” to “we present YOLO-DER” and checked for similar subject–verb agreement issues throughout the paper. We also standardised the wording and hyphenation of illumination-related terms by consistently using expressions such as “low-light conditions/scenes/images,” and we made the use of hyphens in other compound adjectives (e.g., adverse-weather vehicle detection) consistent. In addition, we unified the spacing and formatting around references according to the journal style. We believe these revisions improve the overall clarity and readability of the manuscript.
Reviewer#3, Concern #5: Entropy and “entropy-aware / entropy regulation” are mentioned in the Related Work and strongly emphasized in the Conclusion. It would be helpful to connect this perspective more clearly in the Methods section, for example by adding a short paragraph explaining how DER and EnhanceNet implicitly modulate entropy in feature space and how this relates to equations (1)–(11).
Reply: We appreciate the reviewer’s insightful comment. In the revised manuscript, we have added a short explanatory paragraph in the Methods section to more clearly connect the entropy perspective with the proposed DER and EnhanceNet modules. Specifically, we clarify that:
(1) the degradation descriptor extracted in DER reflects the global entropy distribution of input features (Eqs. 1–4), enabling soft routing toward enhancement paths that reduce uncertainty;
(2) EnhanceNet adjusts local feature statistics using degradation-conditioned cues (Eqs. 10–11), which implicitly suppress high-entropy regions caused by haze or low-light noise; and
(3) both modules jointly regulate feature entropy by enhancing structural consistency and improving contrast during forward propagation.
This additional explanation bridges the conceptual motivation in the Related Work and Conclusion with the operational behavior of the modules in the Methods section.
Reviewer#3, Concern#6: The performance of YOLO models is strongly influenced by hyperparameter optimization, but the current manuscript does not describe or analyze this aspect. The authors should acknowledge this as a limitation and discuss related work:“Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-worn cameras,” where hyperparameter tuning plays an important role.
Reply: We thank the reviewer for pointing out this important aspect. We agree that the performance of YOLO-based detectors is strongly influenced by hyperparameter choices, and that our current manuscript does not provide a detailed analysis of this factor. In this work, our main focus is on the design of a vehicle detection framework rather than on exhaustive hyperparameter optimization. Therefore, we largely follow the recommended hyperparameters from the official YOLOv12 implementation and apply the same training configuration to both the baseline YOLOv12 and the proposed YOLO-DER variants, in order to isolate the effect of the Dynamic Enhancement Routing (DER) and EnhanceNet modules. We do not perform an extensive dataset-specific hyperparameter search.
We fully agree with the reviewer that this is a limitation. In the revised manuscript, we have made this more explicit. Specifically, in the Conclusions section we now discuss the lack of systematic hyperparameter optimization as an important limitation.
In line 388, we add:“ Future work. The limitation of our work is that we do not perform systematic hyperparameter optimization beyond using a common YOLOv12-based configuration across all experiments. The prior studies such as the YOLOv10-based safety helmet monitoring framework ~\citep{ref37-DA} have shown that careful hyperparameter tuning can further improve performance, which we leave as an interesting direction for future research.”
Reviewer#3, Concern #7: In Sections 4.4 and 4.5, the manuscript seems like generative AI is used extensively. These descriptions should be rewritten.
Reply: We thank the reviewer for raising this concern. We would like to clarify that all content in Sections 4.4 and 4.5 was written manually, based on our own analysis of the quantitative results and visual comparisons. We understand that some sentences may appear overly descriptive or stylistically uniform, which may have caused unintended confusion.
To address this concern, we have thoroughly revised both sections to improve their clarity, conciseness, and technical tone. The revised version now:
- Removes narrative or subjective expressions, replacing them with precise statements directly supported by numerical results.
- Emphasizes data-driven reasoning, avoiding high-level or interpretive wording that may appear stylistic.
- Improves structural clarity, ensuring each observation corresponds explicitly to the tables and figures.
- Reduces redundancy, making the analyses more compact and strictly focused on empirical evidence.
We appreciate the reviewer’s suggestion, as it helped us improve the rigor and clarity of the manuscript. The revised version should resolve any concerns about the writing style.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe author has addressed all the issues, and I have no more questions.
Author Response
Dear Reviewer,
Thank you very much for your valuable comments and constructive suggestions during the two rounds of review. We sincerely appreciate the time and effort you have dedicated to evaluating our work.
We are pleased to know that all concerns raised earlier have now been fully addressed, and we are grateful for your positive assessment of the revised manuscript. Your feedback has significantly improved the clarity and quality of our work.
If the current version meets your expectations, we would be deeply honored to receive your recommendation for acceptance. We truly appreciate your professional guidance throughout the review process.
Thank you again for your time, insight, and kind consideration.
Reviewer 3 Report
Comments and Suggestions for AuthorsA citation inconsistency was found.
“Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-worn cameras” and “Automated Non-PPE Detection on Construction Sites Using YOLOv10 and Transformer Architectures for Surveillance and Body Worn Cameras with Benchmark Datasets” are different papers. The latter is correctly cited in the reference list, but the former appears in the main text (Lines 348–352) and is missing from the references. Please carefully check and correct the in-text citation and reference list accordingly.
Author Response
Reviewer#2, Concern #1:“Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-worn cameras” and “Automated Non-PPE Detection on Construction Sites Using YOLOv10 and Transformer Architectures for Surveillance and Body Worn Cameras with Benchmark Datasets” are different papers. The latter is correctly cited in the reference list, but the former appears in the main text (Lines 348–352) and is missing from the references. Please carefully check and correct the in-text citation and reference list accordingly.
Reply:Thank you very much for pointing out the inconsistency between the in-text citation and the reference list. We apologize for the oversight.
In the revised manuscript, we have carefully corrected the in-text citation on Lines 348–352. The paper is now consistently referred to as “Automated Non-PPE Detection on Construction Sites Using YOLOv10 and Transformer Architectures for Surveillance and Body Worn Cameras with Benchmark Datasets,” which matches the title in the reference list.
We have also re-checked all in-text citations and the reference list to ensure consistency and accuracy throughout the manuscript.
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsNo comments.

