MFA-YOLO: Multi-Scale Fusion and Attention-Based Object Detection for Autonomous Driving in Extreme Weather
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper presents MFA-YOLO, an improved object detection model based on YOLOv11, designed for autonomous driving in extreme weather conditions. The study addresses challenges such as detecting small, occluded, or background-blended objects under adverse conditions like fog, rain, and snow. Experiments on EWD, GAD, and UA-DETRAC datasets demonstrate superior performance in precision, recall, and mean average precision (mAP) compared to existing YOLO versions and other detection models. However, there are some major concerns that need to be addressed.
Major Concerns:
1. Limited Discussion on Real-Time Performance
•The paper emphasizes accuracy improvements but lacks a thorough computational efficiency analysis (e.g., FPS, latency, and memory usage).
•Practical deployment feasibility on edge devices remains unclear.
2. Baseline Comparisons
•The comparison is focused on YOLO versions, but no results are shown for transformer-based detection models like DETR or DINO.
•It would be beneficial to include a lightweight model comparison, such as MobileNet-based YOLO variants, to demonstrate efficiency.
3. Generalization Beyond Extreme Weather
•The model is designed for extreme weather, but there is no ablation study evaluating its generalization to normal driving conditions.
•Does the method introduce performance trade-offs in regular environments?
4. Loss Function Justification
•While BE-ATFL is introduced, no comparison is made with other focal loss variants (e.g., Generalized Focal Loss, Asymmetric Loss).
•It’s unclear if BE-ATFL provides significant gains beyond existing solutions.
5. Unclear Hyperparameter Choices and Baseline settings
•Details like batch size tuning, optimizer settings, and augmentation strategies could be further elaborated.
•Are you using pre-trained weights for the baseline models or training those models using the same data as the proposed method? If the baselines are not trained with the same data, then it would not be a fair comparison. If you are training the baseline models yourself, how are the hyperparameters configured for the baselines?
Author Response
We have prepared two responses: text in the box and PDF.
Responses to reviewers
Responses to reviewer 1:
Dear Reviewer 1:
First of all, we sincerely thank you for your careful review of the paper and the comprehensive comments you provided to improve its quality. You have made a very accurate summary of our work and made many valuable and constructive comments on our work. These comments will help us further improve the content of the paper.
we have repeatedly thought about your concerns about the work and tried our best to respond and revise them (The sections we modified are highlighted in blue text in the paper, and we have turned on line numbers in the manuscript so that you can quickly and accurately locate them). The specific content is as follows.
Comment 1:
- Limited Discussion on Real-Time Performance
- The paper emphasizes accuracy improvements but lacks a thorough computational efficiency analysis (e.g., FPS, latency, and memory usage).
- Practical deployment feasibility on edge devices remains unclear.
Response 1:
Thank you very much for your valuable comment! We fully understand your concerns about the real-time performance and computational efficiency of the algorithm. In response, we have included additional comparative experiments on these aspects, from line 518 on page 15 to line 541 on page 16 of the manuscript (highlighted in blue text). These experiments assess performance using indicators such as latency, FPS, memory usage, and FLOPs. The results have been summarized in Table 4. Our findings demonstrate that MFA-YOLO exhibits low latency, making it suitable for deployment on edge devices.
Comment 2:
- Baseline Comparisons
- The comparison is focused on YOLO versions, but no results are shown for transformer-based detection models like DETR or DINO.
- It would be beneficial to include a lightweight model comparison, such as MobileNet-based YOLO variants, to demonstrate efficiency.
Response 2:
We appreciate your thoughtful consideration. In response, we have added a performance comparison with the Transformer-based RT-DETR algorithm, which is now included in Tables 1, 2, and 3 (highlighted in blue text on pages 13 and 14). We also expanded on this comparison in the subsequent new experiments. Additionally, we have cited the literature on the DETR algorithm in the introduction (lines 43 to 45 on page 2, highlighted in blue text) to further enrich our paper.
We completely agree with your suggestion that comparing our model with a lightweight model based on MobileNet would be beneficial in demonstrating its efficiency. To address this, we have included the YOLO-MN3 model in Table 4 (highlighted in blue text on page 16) for the comparison of algorithm computational efficiency and real-time performance. YOLO-MN3 is a new variant of YOLOv11, where the original backbone network is replaced with MobileNetV3.
Comment 3:
- Generalization Beyond Extreme Weather
- The model is designed for extreme weather, but there is no ablation study evaluating its generalization to normal driving conditions.
- Does the method introduce performance trade-offs in regular environments?
Response 3:
We sincerely apologize for any misunderstanding caused by our unclear expression. To clarify, the three datasets we used correspond to different driving scenarios. The EWD dataset pertains to autonomous driving in extreme weather, the GAD dataset corresponds to autonomous driving in normal weather, and the UA-DETRAC dataset is focused on car recognition under overhead monitoring. We have added further explanations on lines 429 to 433 on page 12 of the paper (highlighted in blue text) to provide better clarity.
The results obtained from the experiments on these three datasets are summarized in Tables 1, 2, and 3, respectively. The outstanding performance of our algorithm across these three distinct scenarios demonstrates the effectiveness of the improvements we made and confirms that they do not lead to performance degradation under normal conditions. This is discussed in lines 488 to 500 on page 14 of the paper.
Comment 4:
- Loss Function Justification
- While BE-ATFL is introduced, no comparison is made with other focal loss variants (e.g., Generalized Focal Loss, Asymmetric Loss).
- It’s unclear if BE-ATFL provides significant gains beyond existing solutions.?
Response 4:
Thank you very much for your valuable comments! In response to your constructive feedback, we made modifications based on your suggestions, which significantly enhance the persuasiveness of the BE-ATFL loss function we proposed.
Specifically, we added ablation experiments on the BE-ATFL loss function from line 574 on page 17 to line 582 on page 18 (highlighted in blue text), with the new experiments summarized in Table 7. We also included comparisons with the Generalized Focal Loss and Asymmetric Loss focal losses, as you recommended. The results from the ablation experiments demonstrate that the proposed improvement is effective. Your feedback has been instrumental in strengthening this work and greatly enhances the persuasiveness of our paper. Once again, I would like to express my sincerest thanks to you!
Comment 5:
- Unclear Hyperparameter Choices and Baseline settings
- Details like batch size tuning, optimizer settings, and augmentation strategies could be further elaborated.
- Are you using pre-trained weights for the baseline models or training those models using the same data as the proposed method? If the baselines are not trained with the same data, then it would not be a fair comparison. If you are training the baseline models yourself, how are the hyperparameters configured for the baselines?
Response 5:
Thank you very much for your valuable comments! They have helped provide important details for the reader. At the same time, we apologize for the previous lack of clarity in our expression. To address this, we have added a detailed explanation in blue text on page 12, lines 434 to 461, outlining the various hyperparameters and data augmentation strategies used in our experiments. To ensure fairness, all algorithms were trained with the same training hyperparameters and dataset, and none of the algorithms utilized pre-trained models.
Thank you very much for your review and suggestions on our work, such feedback is very important for us to improve the quality of the manuscript! We believe that this paper will become better after making revisions based on your comments!
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors- The paper primarily benchmarks against CNN-based object detectors but does not compare against transformer-based methods like DETR, Deformable DETR, or Swin Transformer. Given the increasing adoption of vision transformers in autonomous driving, evaluating MFA-YOLO against these architectures would strengthen its positioning. Additionally, discussing whether self-attention mechanisms could be integrated into MFA-YOLO would provide a forward-looking perspective.
-
The study focuses primarily on detection accuracy (mAP, precision, recall) but does not evaluate the computational efficiency of MFA-YOLO. Given that autonomous driving applications require real-time inference, the paper should report:
- Inference speed (FPS) on edge devices and GPUs
- Memory footprint and FLOPs comparison with baselines
- Training time and model convergence trends
- The paper lacks critical details on dataset preprocessing, including augmentation techniques, data cleaning procedures, and potential biases. Since adverse weather conditions can significantly impact sensor reliability, it is essential to clarify:
-
- How illumination, motion blur, or occlusions were handled.
- If synthetic data augmentation (e.g., GAN-based or style transfer methods) was considered.
- How class imbalances in small-object detection were managed.
4. The evaluation focuses on three datasets (EWD, GAD, UA-DETRAC) but does not explore generalization under:
-
- Domain shifts (e.g., cross-dataset evaluation).
- Real-world deployment scenarios (e.g., sensor noise, adversarial conditions).
- Weather-specific analysis (e.g., varying fog density, snowfall intensity).
Author Response
We have prepared two responses: text in the box and PDF.
Responses to reviewers
Responses to reviewer 2:
Dear Reviewer 2:
First of all, we sincerely thank you for your careful review of the paper and your comprehensive comments to improve the quality of the paper, which are very valuable and constructive.
We are deeply grateful for your recognition of the innovation and experimental completeness of the work; we have repeatedly thought about your concerns about the work and tried our best to respond and revise them (The sections we modified are highlighted in blue text in the paper, and we have turned on line numbers in the manuscript so that you can quickly and accurately locate them). The specific content is as follows.
Comment 1:
1.The paper primarily benchmarks against CNN-based object detectors but does not compare against transformer-based methods like DETR, Deformable DETR, or Swin Transformer. Given the increasing adoption of vision transformers in autonomous driving, evaluating MFA-YOLO against these architectures would strengthen its positioning. Additionally, discussing whether self-attention mechanisms could be integrated into MFA-YOLO would provide a forward-looking perspective.
Response 1:
Thank you very much for your valuable comment! We greatly appreciate your thorough consideration of the comparison. In response, we have added a comparison with the Transformer-based RT-DETR algorithm in our experiments, which is summarized in Tables 1, 2, and 3 (highlighted in blue text on pages 13 and 14). Additionally, we have included further comparisons with the RT-DETR algorithm in the subsequent experiments. We have also cited relevant literature on the DETR algorithm in the introduction (page 2, lines 43 to 45, highlighted in blue text) to enrich the content of our paper.
Furthermore, I would like to apologize for any misunderstanding caused by our unclear expression. In fact, the paper does include a module related to the self-attention mechanism, specifically the HAFM module introduced on page 10. To clarify, we have added a mathematical formula for attention, as shown in Formula 9 on page 10 (highlighted in blue text). Once again, I sincerely apologize for any confusion caused by our previous lack of clarity.
Comment 2:
- The study focuses primarily on detection accuracy (mAP, precision, recall) but does not evaluate the computational efficiency of MFA-YOLO. Given that autonomous driving applications require real-time inference, the paper should report:
Inference speed (FPS) on edge devices and GPUs
Memory footprint and FLOPs comparison with baselines
Training time and model convergence trends
Response 2:
Thank you very much for your valuable comment! We fully understand your concerns regarding the algorithm's computational efficiency and real-time performance. In response, we have added new experiments to improve this section, from line 518 on page 15 to line 541 on page 16 (highlighted in blue text). The comparison section of these new experiments includes the key indicators you raised, such as FPS, memory usage, FLOPS, and more. Additionally, we report the algorithm's training time and convergence trend from line 583 to line 599 on page 18 of the paper.
Comment 3:
3.The paper lacks critical details on dataset preprocessing, including augmentation techniques, data cleaning procedures, and potential biases. Since adverse weather conditions can significantly impact sensor reliability, it is essential to clarify:
How illumination, motion blur, or occlusions were handled.
If synthetic data augmentation (e.g., GAN-based or style transfer methods) was considered.
How class imbalances in small-object detection were managed.
Response 3:
First of all, I would like to apologize for not describing the data preprocessing techniques clearly in the paper. The improvements we made based on your comments will help readers better understand the details. We have now added a detailed description of the data preprocessing techniques on page 12, lines 443 to 461 (highlighted in blue text), explaining how we address issues such as illumination, motion blur, and more. We hope that these additions meet your expectations.
I also apologize for not explaining more clearly how we manage class imbalance in the samples. In fact, the BE-ATFL loss function we proposed is designed to address this issue, but we regret any confusion caused by our insufficient overall description. To clarify, we have added a more comprehensive explanation of the BE-ATFL loss function from lines 400 to 405 on page 11, and we included new ablation experiments on BE-ATFL from lines 574 to 582 on page 17 (highlighted in blue text) to demonstrate the effectiveness of our improvements. I hope the changes we made are now clearer and more satisfactory. Thank you again for pointing out these areas for improvement.
Additionally, we truly appreciate your suggestion to use the GAN network for synthetic data augmentation. It is a highly constructive and valuable idea. We have noted this as a potential direction for future research in lines 642 to 647 on page 20 of the paper (highlighted in blue text). We believe it presents a great opportunity for innovation in future work. Once again, I would like to express my deepest gratitude to you.
Comment 4:
4.The evaluation focuses on three datasets (EWD, GAD, UA-DETRAC) but does not explore generalization under:
Domain shifts (e.g., cross-dataset evaluation).
Real-world deployment scenarios (e.g., sensor noise, adversarial conditions).
Weather-specific analysis (e.g., varying fog density, snowfall intensity).
Response 4:
Thank you very much for this valuable and constructive comment! Your consideration is extremely thoughtful. Not only did you consider the changing fog density under specific weather conditions, but you also reflected on the broader application of our algorithm in other fields. We fully agree that incorporating such modifications will significantly enhance the persuasiveness of our algorithm’s generalization ability and robustness.
In response, we added experiments in other fields, specifically industrial defect detection, from lines 605 to 625 on page 19 of the paper (highlighted in blue text). Our algorithm was applied to rail defect detection, and the results are summarized in Table 8. As shown in the table, our algorithm performs well in this new field, demonstrating its strong generalization capability.
Additionally, we conducted experiments considering specific weather conditions, particularly varying fog density, from lines 542 on page 16 to lines 560 on page 17 (highlighted in blue text). In simulations with different levels of fog, our algorithm still performs effectively, showing its robust performance under challenging conditions.
We also recognize the importance of considering additional real-world factors, such as sensor noise, adversarial conditions, and snowfall intensity, in future work. We all believe that this is an important and innovative direction, and we have highlighted it as a potential future research area in lines 642 to 647 on page 20 of the paper.
Thank you very much for your review and suggestions on our work, such feedback is very important for us to improve the quality of the manuscript! We believe that revising the paper according to your comments will greatly improve our work!
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThank you for the revision and response. The reviewer’s concerns have been addressed.
One typo in Table 4: “Faster-CNN” -> “Faster R-CNN.”
Author Response
Responses to reviewer 1:
Dear Reviewer 1:
We sincerely appreciate you pointing out the mistakes in our paper, and we apologize for the incorrect naming of the algorithm. The necessary changes have been made, as shown in Table 4 on page 16 (highlighted in blue text).
We would like to sincerely thank you once again for your valuable and constructive feedback on our work! Your comments have been extremely helpful in enhancing the quality of our paper, and we believe the revisions based on your suggestions have greatly improved it!
Best regards
The Authors