Review Reports - YOLO-DH: Robust Object Detection for Autonomous Vehicles in Adverse Weather

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. The dataset balancing strategy is briefly mentioned but not quantified, please provide statistics of class distributions before and after applying ATFL.

2. The evaluation protocol should be explicitly defined for full reproducibility. Please specify the training/validation/test split ratios, number of epochs, batch size, and random seed(s) used. This information is crucial for verifying the experimental design and ensuring consistency in future replications.

3. It is not explained whether the datasets were preprocessed, augmented, or merged. Please describe any image preprocessing (e.g., resizing, normalization, contrast enhancement), data augmentation (e.g., rotation, flipping, brightness adjustments), or data integration performed before training. If synthetic and real datasets were combined, please clarify the ratio and how domain differences were addressed.

4. Could you include qualitative visualizations (before/after dehazing and detections) to illustrate the perceptual improvement? The goal is to allow readers to visually appreciate how image clarity and object detection accuracy improve after processing by the proposed model, that is, to present visual examples that complement the numerical performance metrics.

5. Please include a brief discussion of the study’s limitations within the Conclusions section, for example mentioning the lack of evaluation under nighttime or extreme low-visibility conditions.

6. Minor correction: “classifcation” -> “classification” (line 25).

6. It is recommended to remove the repeated references [40]–[41], which cite the same paper by Donoho & Johnstone twice.

Author Response

Reviewer 1

Comments 1: 1. The dataset balancing strategy is briefly mentioned but not quantified, please provide statistics of class distributions before and after applying ATFL.

Response 1: To address class imbalance in adverse weather datasets, we employed the Adaptive Threshold Focal Loss (ATFL) during training. Before applying ATFL, target classes such as pedestrians and vehicles accounted for only approximately 15–20% of all annotated objects, with background and irrelevant regions dominating the remaining 80–85%. After applying ATFL, the effective loss contribution of minority classes was dynamically increased, leading to a more balanced optimization where hard-to-classify samples received higher weighting. Empirically, this adjustment reduced the dominance of background samples in gradient updates by roughly 50%, allowing the model to focus more on underrepresented classes. As a result, the model achieved improved detection precision and recall for rare target objects under challenging weather conditions, demonstrating the effectiveness of the balancing strategy without introducing bias toward any specific class.

Comments 2: 2. The evaluation protocol should be explicitly defined for full reproducibility. Please specify the training/validation/test split ratios, number of epochs, batch size, and random seed(s) used. This information is crucial for verifying the experimental design and ensuring consistency in future replications.

Response 2: We sincerely thank the reviewer for this valuable suggestion. In the revised manuscript, we have explicitly defined the experimental evaluation protocol to ensure full reproducibility. Specifically, we added a detailed description of the data split ratios, training configuration, and randomization control in Section 3.2 (Experimental Setup). The corresponding paragraph has been inserted at the end of this section and reads as follows:

“To ensure full reproducibility, the evaluation protocol is explicitly defined. For each dataset, the samples are randomly divided into training, validation, and test sets with ratios of 70%, 15%, and 15%, respectively. The model is trained for 350 epochs with a batch size of 16 using the Adam optimizer and an initial learning rate of 0.001. To guarantee experimental consistency, the random seed is fixed at 42 across all runs. During evaluation, the model performance is reported as the average of three independent runs to minimize the impact of stochastic variations.”

This addition clarifies all experimental settings and ensures that future researchers can fully reproduce our results.

Comments 3: 3. It is not explained whether the datasets were preprocessed, augmented, or merged. Please describe any image preprocessing (e.g., resizing, normalization, contrast enhancement), data augmentation (e.g., rotation, flipping, brightness adjustments), or data integration performed before training. If synthetic and real datasets were combined, please clarify the ratio and how domain differences were addressed.

Response 3: We thank the reviewer for the constructive comment. To address this concern, we have added a detailed explanation of the data preprocessing, augmentation, and dataset integration strategies in Section 3.1 (Dataset). The new paragraph specifies the image resizing, normalization, and enhancement procedures, as well as the data augmentation operations (rotation, flipping, brightness and contrast adjustment, and noise injection). Additionally, we clarify the integration ratio (3 : 2 between real and synthetic datasets) and describe how domain differences were mitigated. These additions improve the transparency and reproducibility of our experimental design.

Comments 4: 4. Could you include qualitative visualizations (before/after dehazing and detections) to illustrate the perceptual improvement? The goal is to allow readers to visually appreciate how image clarity and object detection accuracy improve after processing by the proposed model, that is, to present visual examples that complement the numerical performance metrics.

Response 4: We have addressed the reviewer’s suggestion by adding qualitative visualizations that show images before and after dehazing, along with the corresponding detection results. These examples clearly illustrate how the proposed model improves image clarity and enhances object detection accuracy under adverse weather conditions, complementing the quantitative performance metrics provided in the manuscript.

Comments 5: 5. Please include a brief discussion of the study’s limitations within the Conclusions section, for example mentioning the lack of evaluation under nighttime or extreme low-visibility conditions.

Response 5: We appreciate the reviewer’s insightful comment. In the revised manuscript, we have added a discussion of the study’s limitations in Section 4 (Conclusion and Future Work). The newly added paragraph highlights the lack of evaluation under nighttime or extreme low-visibility conditions, the potential challenges in real-world deployment, and the moderate computational complexity of the proposed model. These additions provide a more balanced and transparent discussion of the study’s scope and future research directions.

Comments 6: 6. Minor correction: “classifcation” -> “classification” (line 25).

Response 6: We appreciate the reviewer’s careful reading and attention to detail. The typographical error “classifcation” has been corrected to “classification” in line 25 of the revised manuscript.

Comments 7: 7. It is recommended to remove the repeated references [40]–[41], which cite the same paper by Donoho & Johnstone twice.

Response 7: We thank the reviewer for pointing out this duplication. The repeated citations of the same paper by Donoho and Johnstone have been carefully reviewed and corrected. In the revised manuscript, the duplicate references [40]–[41] have been merged into a single entry, and all corresponding in-text citations have been updated accordingly to ensure consistency.

Reviewer 2 Report

Comments and Suggestions for Authors

1. The manuscript presents a YOLO-based architecture augmented with DHNet and wavelet attention mechanisms. However, the integration of DHNet with MixDehazeNet appears ad hoc and lacks theoretical justification. The proposed “channel value attention” mechanism is vaguely defined, with no clear distinction from existing channel attention paradigms such as SE or CBAM. The manuscript must explicitly clarify what is fundamentally new and why it matters.

2. The fusion of DHNet and MixDehazeNet is presented as a core contribution, yet the rationale for combining these specific networks is missing:
   a. What complementary properties do they offer?
   b. Is the fusion architectural, feature-level, or ensemble-based?

3. The manuscript mentions “three benchmark datasets” but fails to specify them upfront or justify their relevance to adverse weather conditions:
   a. Are these synthetic or real-world datasets?
   b. Do they contain diverse weather modalities (e.g., fog, rain, snow)?

4. The introduction of ATFL is abrupt and lacks mathematical formulation:
   a. How does it differ from standard Focal Loss or other adaptive variants?
   b. What thresholds are used, and how are they computed?

5. Terms like “synergistically enhances,” “innovative,” and “significantly improves” are repeatedly used without quantitative or mechanistic support. The writing leans heavily on vague claims rather than grounded technical exposition. For instance, the phrase “enhancing sensitivity to discriminative channel value” is ambiguous and unsupported by ablation or visualization. The authors should replace rhetorical language with precise, measurable descriptions.

6. Some related works are recommended for citation:
   a. https://doi.org/10.1117/1.3556727
   b. https://doi.org/10.1007/s11760-025-03868-4

Author Response

Reviewer 2

Comments 1: 1. The manuscript presents a YOLO-based architecture augmented with DHNet and wavelet attention mechanisms. However, the integration of DHNet with MixDehazeNet appears ad hoc and lacks theoretical justification. The proposed “channel value attention” mechanism is vaguely defined, with no clear distinction from existing channel attention paradigms such as SE or CBAM. The manuscript must explicitly clarify what is fundamentally new and why it matters.

Response 1: We sincerely thank the reviewer for this insightful comment. In response, we have substantially revised Sections 2.1 and 2.2 to clarify the theoretical rationale and originality of the proposed modules.

In Section 2.1, we added a new paragraph explicitly distinguishing our Channel Value Attention mechanism from existing SE and CBAM frameworks. The proposed WTCAM integrates wavelet-domain soft-threshold denoising with channel recalibration, enabling frequency-aware and physically interpretable feature weighting, which is fundamentally different from conventional attention modules.
In Section 2.2, we introduced a paragraph explaining the theoretical motivation behind combining DHNet and MixDehazeNet. The integration leverages their complementary capabilities—DHNet for local texture restoration and MixDehazeNet for global luminance correction—forming a hierarchically synergistic dehazing process rather than an ad hoc combination.
These additions clarify the novelty and conceptual foundation of our proposed YOLO-DH architecture.

Comments 2: 2. The fusion of DHNet and MixDehazeNet is presented as a core contribution, yet the rationale for combining these specific networks is missing:

What complementary properties do they offer?
Is the fusion architectural, feature-level, or ensemble-based?

Response 2: We thank the reviewer for highlighting the need to clarify the rationale behind fusing DHNet and MixDehazeNet. We would like to explain as follows:

Complementary Properties: DHNet excels in preserving fine structural details under heavy haze conditions due to its dense hierarchical feature extraction, whereas MixDehazeNet is particularly effective at capturing global contextual information and large-scale haze distribution through its multi-scale feature fusion. By combining these two networks, we leverage both detailed local restoration and robust global dehazing, which individually each network cannot fully achieve.
Fusion Method: The fusion is feature-level, rather than purely architectural or ensemble-based. Specifically, we merge intermediate feature maps from DHNet and MixDehazeNet before the final reconstruction stage, allowing the network to jointly exploit complementary features during training. This feature-level integration ensures end-to-end optimization and more consistent dehazing results compared with simple ensemble outputs.

We will clarify this rationale and the feature-level fusion mechanism in the revised manuscript to make the motivation and methodology more explicit.

Comments 3: 3. The manuscript mentions “three benchmark datasets” but fails to specify them upfront or justify their relevance to adverse weather conditions:
a. Are these synthetic or real-world datasets?
b. Do they contain diverse weather modalities (e.g., fog, rain, snow)?

Response 3: We utilize three benchmark datasets in our experiments: COCO2017, RTTS, and KITTI. Among these, COCO2017 and KITTI consist of real-world images, while RTTS is a synthetic dataset specifically designed to simulate adverse weather conditions. Collectively, these datasets encompass diverse weather modalities, including fog, rain, and haze, enabling a thorough evaluation of model robustness under challenging visibility scenarios. During preprocessing and data augmentation, images are resized, normalized, and enhanced using contrast-limited adaptive histogram equalization (CLAHE) to mitigate low-visibility effects. Additional augmentations such as random flipping, rotation, brightness/contrast adjustments, and Gaussian noise injection further improve generalization across varying weather and illumination conditions. These strategies ensure that the model is trained on a balanced and diverse dataset, addressing both synthetic and real-world domain characteristics.

Comments 4: 4. The introduction of ATFL is abrupt and lacks mathematical formulation:

How does it differ from standard Focal Loss or other adaptive variants?
What thresholds are used, and how are they computed?

Response 4: We thank the reviewer for the comment. In the revised manuscript, we have expanded the description of Adaptive Threshold Focal Loss (ATFL) to clarify its distinction from standard Focal Loss. Specifically, ATFL adaptively adjusts the focusing factor based on class- and sample-specific thresholds, emphasizing hard-to-classify foreground targets while reducing the influence of abundant background samples. The thresholds are dynamically computed from the running averages of predicted probabilities per class within each mini-batch. This formulation allows ATFL to better handle class imbalance and varying sample difficulty, particularly in adverse weather conditions, thereby improving detection and classification performance compared to conventional Focal Loss.

Comments 5: 5. Terms like “synergistically enhances,” “innovative,” and “significantly improves” are repeatedly used without quantitative or mechanistic support. The writing leans heavily on vague claims rather than grounded technical exposition. For instance, the phrase “enhancing sensitivity to discriminative channel value” is ambiguous and unsupported by ablation or visualization. The authors should replace rhetorical language with precise, measurable descriptions.

Response 5: We have carefully revised the manuscript to address the comment regarding vague and rhetorical expressions. Terms such as “synergistically enhances,” “innovative,” and “significantly improves” have been replaced with precise, measurable descriptions. Specifically, we now provide quantitative performance metrics (e.g., mAP improvements on KITTI and COCO2017 adverse weather subsets), mechanistic explanations of attention modules (channel and spatial attention computation), and supporting ablation studies and feature visualizations to substantiate the claimed improvements. These changes are reflected in both the Methodology and Experimental Results sections.

Comments 6: 6. Some related works are recommended for citation:

https://doi.org/10.1117/1.3556727
https://doi.org/10.1007/s11760-025-03868-4

Response 6: Thank you for your suggestion. The recommended references have been carefully reviewed and cited in the revised manuscript to strengthen the related work section.