Review Reports - Survey on Image-Based Vehicle Detection Methods

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper provides a comprehensive overview of vehicle detection methods, but some sections could benefit from clearer transitions and subheadings to improve readability.

1：The abstract could better highlight the paper’s unique contribution (e.g., systematic comparison of classical vs. deep learning methods).

2：In the Deep Learning Methods Section:The YOLO series discussion (Table 4) is thorough. but a timeline graphic should be added to show architectural evolution (e.g., YOLOv1 to v10).

3: In Table 8: dataset size (number of images/videos) and resolution should be included to help readers assess suitability for their needs.

Author Response

Response to Reviewer 1 Comments
We would like to sincerely thank you for the thoughtful comments and constructive suggestions. Your feedback has been invaluable in helping us refine and improve the clarity and quality of our work. We appreciate the time and effort you have taken to review our manuscript, and we believe that the suggested revisions will further enhance the impact and rigor of the study
Reviewer #1, Comment # 1: “The abstract could better highlight the paper’s unique contribution (e.g., systematic comparison of classical vs. deep learning methods).”
Author response: Thank you for the helpful suggestion. We agree that the original abstract did not clearly emphasize the unique contribution of the paper. Therefore, we have revised the abstract to explicitly highlight the systematic comparison between classical machine learning methods and modern deep learning-based techniques for vehicle detection. The new version also outlines the categorization of deep learning approaches into one-stage and two-stage detectors and summarizes the paper’s broader contributions to datasets, evaluation metrics, and future directions.
Reviewer #1, Comment #2: “In the Deep Learning Methods Section: The YOLO series discussion (Table 4) is thorough, but a timeline graphic should be added to show architectural evolution (e.g., YOLOv1 to v10).”
Author Response: Thank you for your insightful suggestion. We agree that a visual representation of the YOLO series' evolution would enhance the reader’s understanding of its architectural progression. Accordingly, we have added Figure 6 to the manuscript, titled "YOLO Versions Timeline", which illustrates the development of YOLO-based models from YOLOv1 to YOLOv12. This figure includes both official versions and major community-developed variants such as Scaled-YOLOv4, YOLOX, PP-YOLO, and YOLO-NAS. It complements the tabular summary presented in Table 4, providing a chronological and visual overview of the major milestones in YOLO's evolution. The new figure is located on Page 10 and is referenced within the discussion of YOLO series algorithms.
Reviewer #1, Comment #3: “In Table 8: dataset size (number of images/videos) and resolution should be included to help readers assess suitability for their needs.”
Author Response: Thank you for this helpful suggestion. To improve the utility of Table 8 for readers, we have added two new columns: Number of Images/Videos and Typical Image Resolution. These additions provide a clearer understanding of the dataset scale and visual quality, which are critical factors in evaluating dataset suitability for different vehicle detection models. The updated table appears as Table 8 on Page 14 of the revised manuscript

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The resolution of the sub-figures in Figure 1 is relatively low. It is recommended to enhance the resolution.
The first paragraph of the Introduction is overly lengthy and logically disjointed. Additionally, apart from the Introduction, none of the other sections are numbered. It is advised to restructure the manuscript to improve readability.
A significant number of formulas lack numbering, leading to a chaotic format. A comprehensive format optimization throughout the manuscript is recommended.
What specific metric does "accuracy" in Table 3 represent? Does it refer to precision, recall, or another evaluation indicator? Please clarify the data source.
The text in Figure 4 and Figure 5 is too small. It is suggested to optimize the font size for better legibility.
The authors review one-stage detectors for vehicle detection but exclude key methods such as CenterNet and FCOS, and lack technical details on architectural modifications (e.g., activation functions, feature fusion strategies). It is recommended to supplement comparative analysis of the omitted algorithms and clarify the principles of key technical innovations to deepen the review’s depth.
The authors’ summary of two-stage algorithms is insufficient. It is advised to add an analysis and synthesis of algorithms like Mask R-CNN and Cascade R-CNN.
This manuscript lacks demonstration and analysis of the detection performance of various algorithms in real-world scenarios, as well as an in-depth discussion on the applicable contexts of different methods and their future development trends.

Author Response

Response to Reviewer 2 Comments
We would like to sincerely thank you for the thoughtful comments and constructive suggestions. Your feedback has been invaluable in helping us refine and improve the clarity and quality of our work. We appreciate the time and effort you have taken to review our manuscript, and we believe that the suggested revisions will further enhance the impact and rigor of the study
Reviewer #2, Comment #1: “The resolution of the sub-figures in Figure 1 is relatively low. It is recommended to enhance the resolution.”
Author Response: Thank you for pointing this out. We acknowledge that the resolution of the original Figure 1 was suboptimal. In response, we have replaced Figure 1 with another higher-resolution figure to improve visual clarity and ensure that all elements, particularly text and system components, are easily readable.
Reviewer #2, Comment #2: “The first paragraph of the Introduction is overly lengthy and logically disjointed. Additionally, apart from the Introduction, none of the other sections are numbered. It is advised to restructure the manuscript to improve readability.”
Author Response: Thank you for your constructive feedback. We have carefully revised the Introduction section to improve its clarity, structure, and logical flow. Specifically, the originally lengthy and dense paragraph has been divided into five thematically focused paragraphs. These now follow a clearer narrative that transitions from real-world motivation to technical challenges, classical and deep learning methods, and finally to the paper’s contributions. In addition, we have ensured that all main sections and subsections throughout the manuscript are now properly numbered to improve organization, readability, and compliance with MDPI formatting guidelines.
Reviewer #2, Comment 3: “A significant number of formulas lack numbering, leading to a chaotic format. A comprehensive format optimization throughout the manuscript is recommended.”
Author Response: Thank you for this important observation. We have carefully reviewed all mathematical expressions in the manuscript and addressed the formatting inconsistencies. Specifically, all major equations—including those related to Haar-like features, HOG, LBP, SIFT, SVM, and AdaBoost—have now been properly numbered using the equation environment. All key formulas are also clearly structured and formatted for consistent citation within the manuscript. These improvements enhance clarity and readability, particularly in the Vehicle Detection Methods section.
Reviewer #2, Comment #4: “What specific metric does 'accuracy' in Table 3 represent? Does it refer to precision, recall, or another evaluation indicator? Please clarify the data source.”
Author Response: Thank you for your observation. To address this concern and improve clarity, we have revised Table 3 by changing the column header from “Accuracy” to “Performance.” This aligns with the terminology used in other tables within the manuscript. We have also added a clarification to the table caption, noting that “Performance” refers to the primary evaluation metric reported in each referenced study (e.g., accuracy, precision, or mAP). These updates ensure consistent terminology and eliminate potential ambiguity regarding the source and type of performance measurement.
Reviewer #2, Comment #5: “The text in Figure 4 and Figure 5 is too small. It is suggested to optimize the font size for better legibility.”
Author Response: Thank you for your feedback. We have addressed this issue by optimizing the display size of Figure 4 and Figure 5 to ensure all text elements are clearly visible and legible. The updated figures now provide improved readability in both digital and print formats and appear as Figure 4 and Figure 5 on Page 5,6 of the revised manuscript.
Reviewer #2, Comment #7: “The discussion on two-stage detection methods should be expanded, and advanced models such as Mask R-CNN and Cascade R-CNN should be incorporated.”
Author Response: Thank you for this helpful recommendation. In response, we have expanded the section on two-stage object detection methods to provide a more detailed overview of the R-CNN family. Specifically, we have included descriptions of Mask R-CNN and Cascade R-CNN, highlighting their architectural differences and relevance to vehicle detection tasks. We also added appropriate citations and clarified their benefits in scenarios involving dense scenes or high-precision requirements. These updates are presented in the revised Two-Stage Methods subsection and are supported by updated tables and a new figure that illustrates the architecture.
Reviewer #2, Comment #8: “This manuscript lacks demonstration and analysis of the detection performance of various algorithms in real-world scenarios, as well as an in-depth discussion on the applicable contexts of different methods and their future development trends.”
Author Response: Thank you for highlighting this important point. To address this, we have expanded the Application Areas section by adding a new paragraph that discusses the real-world performance trade-offs of different detection models. We contrast the use of one-stage and two-stage architectures in various application settings and highlight the influence of environmental factors such as occlusion, lighting, and hardware constraints. Additionally, we outline future trends, including edge deployment, multimodal sensor fusion, and transformer-based approaches. These additions help contextualize the reviewed methods and provide a forward-looking perspective aligned with emerging technological directions.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The main question of a paper is real situation and future on vehicle detection methods

Introduction. This paper provides an overview of vehicle detection methods over the past two decades. The authors describe in detail both classical methods based on traditional machine learning algorithms and modern methods using deep learning. The paper also discusses performance evaluation metrics and the main technical challenges faced by researchers in this field.
The topic is relevant and suitable for the journal.
The paper presents new data for the scientific field:
1. Performance evaluation metrics are combined
- Precision: Defined as the proportion of correct positive predictions among all positive predictions. Formula: \( P = \frac{TP}{TP + FP} \).
- Recall: Defined as the proportion of correct positive predictions among all actually positive objects. Formula: \( R = \frac{TP}{TP + FN} \).
- Intersection over Union (IoU): Estimates the degree of intersection between the target label and the predicted label. Formula: \( \text{IoU}(I, U) = \frac{\text{Overlap Area}}{\text{Total Area}} \).
2. Challenges and future research directions are identified
- Data and annotation quality: High-quality annotations are necessary to train accurate vehicle detection models. Errors in annotations can reduce the reliability of detection.
- Complex road scenes: Dense road conditions with frequent occlusions, scale variations, and overlapping objects pose challenges. Multi-level detection methods and context-aware features are investigated.
- Environmental conditions: Lighting changes, shadows, and adverse weather conditions (rain, fog) significantly reduce image quality and detection accuracy. Robust preprocessing and domain adaptation methods are needed.
- Computational efficiency: Achieving real-time detection with limited computational resources, especially on edge computing devices or drones, remains a challenging task. Lightweight models and compression methods are explored.
- Generalization and adaptability: Models often face difficulties when applied to unknown environments or datasets. Transfer learning, data augmentation, and continuous learning methods are proposed.
- Integration with V2X systems: Future research should explore the integration of detection models with vehicle-to-surround (V2X) communication systems to improve situational awareness and cooperative detection in connected environments.
Conclusion
The paper provides a comprehensive overview of vehicle detection methods, ranging from classical algorithms to state-of-the-art deep learning models. The authors describe in detail the performance evaluation metrics and the main technical challenges faced by researchers. The work is a valuable resource for vehicle detection practitioners, providing extensive insights and directions for future research. Evaluation
The article is written clearly and structured, with a detailed description of the methods and metrics. The authors successfully highlight key challenges and future research directions, making the work useful for both novice and experienced researchers. However, for a more complete understanding, it would be useful to add examples of specific studies and their results, as well as a more detailed comparison of the different methods.

Author Response

Response to Reviewer 3 Comments
We would like to sincerely thank you for the thoughtful comments and constructive suggestions. Your feedback has been invaluable in helping us refine and improve the clarity and quality of our work. We appreciate the time and effort you have taken to review our manuscript, and we believe that the suggested revisions will further enhance the impact and rigor of the study
Reviewer #3, Comment #1: “It would be useful to add examples of specific studies and their results, as well as a more detailed comparison of the different methods.”
Author Response: Thank you for the encouraging feedback and helpful suggestions. In response, we have updated the manuscript by adding specific examples of vehicle detection studies, including the datasets used, performance metrics such as mAP and FPS, and notable strengths of the models. A comparative summary table has also been included to facilitate a clearer understanding of the trade-offs between different detection methods. These additions enrich the review and provide practical context for the discussed techniques.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This paper provides a thorough review of vehicle detection methods for applications in road surveillance, intelligent transportation, and autonomous driving. It addresses challenges like varying vehicle shapes, scene complexity, and occlusion, which affect real-time detection. The review covers both classical machine learning and deep learning approaches, highlighting the advantages of deep learning with large datasets. The paper discusses one-stage and two-stage detection frameworks, noting trade-offs in speed and accuracy. Despite advancements, real-time detection remains a challenge, and the review concludes with open research issues for future exploration. The paper is well-structured and logically organized, presenting a clear approach. However, there are areas that require improvement.

Each section of the manuscript should be numbered to provide readers with a clearer understanding of the overall structure of the paper.
The methodology section on vehicle detection would benefit from the inclusion of a detailed logical framework diagram. The current description in Figure 1 is overly simplistic and needs further elaboration.
The authors' review of the methodology is not comprehensive enough. Some newer methods on vehicle detection, such as the Transformer architecture and Generative Adversarial Networks (GANs), should also be discussed in more detail.
In the background section, the need for vehicle detection should be introduced not only in the context of the emergence of intelligent transportation systems but also by highlighting the rapid development of autonomous driving technologies. The related work can refer to “Enhancing High-Speed Cruising Performance of Autonomous Vehicles Through Integrated Deep Reinforcement Learning Framework, IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 1, pp. 835-848, Jan. 2025”.

Author Response

Response to Reviewer 4 Comments
We would like to sincerely thank you for the thoughtful comments and constructive suggestions. Your feedback has been invaluable in helping us refine and improve the clarity and quality of our work. We appreciate the time and effort you have taken to review our manuscript, and we believe that the suggested revisions will further enhance the impact and rigor of the study
Reviewer #4, Comment #1: Each section of the manuscript should be numbered to provide readers with a clearer understanding of the overall structure of the paper.
Author Response: We appreciate the reviewer’s valuable suggestion. To improve the structural clarity and enhance readability, we have incorporated section numbering throughout the manuscript. All major sections and subsections are now clearly numbered (e.g., 1. Introduction, 2. Classical Methods, 3. Deep Learning Methods, etc.), enabling easier navigation and a more coherent presentation of the content.
Reviewer #4, Comment #2: ” The methodology section on vehicle detection would benefit from the inclusion of a detailed logical framework diagram. The current description in Figure 1 is overly simplistic and needs further elaboration.”
Author Response: Thank you for your constructive suggestion. In response, we have revised and enhanced Figure 1 to present a more detailed logical framework for image-based vehicle detection. The updated figure, now titled “Overview of the Image-Based Vehicle Detection Process”, illustrates the entire pipeline from data acquisition and preprocessing (e.g., frame extraction and image enhancement) to detection stages using either classical or deep learning-based methods. The structure highlights the interaction between core components such as image normalization, feature extraction, model inference, and output interpretation. This enhanced diagram improves conceptual clarity and provides a comprehensive visual summary of the detection methodology adopted in modern vehicle detection systems. The figure caption and related in-text explanations have also been revised to reflect this change (see Section 1 and Section 2 of the revised manuscript).
Reviewer #4, Comment #3: “The authors' review of the methodology is not comprehensive enough. Some newer methods on vehicle detection, such as the Transformer architecture and Generative Adversarial Networks (GANs), should also be discussed in more detail.”
Author Response: We appreciate this valuable recommendation. In the revised manuscript, we have expanded the methodology review by incorporating two new subsubsections that focus on Transformer-based and GAN-based vehicle detection methods. Specifically:
•
Transformer-Based Methods: We introduce a dedicated subsection discussing the role of self-attention mechanisms in object detection, highlighting key models such as
DETR, Deformable DETR, RT
-DETR, Swin Transformers, and Co-DETR. We also include recent advancements like PLC-Fusion for multimodal vehicle detection. These models are discussed in the context of autonomous driving and surveillance scenarios, and relevant citations have been added to support the discussion.
•
GAN-Based Methods: We have also added a subsection summarizing recent works that utilize GANs for vehicle detection. This includes applications such as image enhancement in poor lighting, domain adaptation (e.g., day-to-night style transfer), and synthetic data generation for training robust detectors. The section cites relevant studies that demonstrate how GAN-based techniques improve detection performance in complex and low-data environments.
These additions aim to provide a more comprehensive and up-to-date overview of state-of-the-art in vehicle detection. The new content can be found in Section 3 under “Advanced Deep Learning-Based Methods” in the revised manuscript.
Reviewer #4, Comment #4: “In the background section, the need for vehicle detection should be introduced not only in the context of the emergence of intelligent transportation systems but also by highlighting the rapid development of autonomous driving technologies. The related work can refer to “Enhancing High-Speed Cruising Performance of Autonomous Vehicles Through Integrated Deep Reinforcement Learning Framework, IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 1, pp. 835-848, Jan. 2025”.”
Author Response: We thank the reviewer for the valuable suggestion. In the revised manuscript, we have updated the introduction to explicitly emphasize the role of vehicle detection in the context of the rapid advancement of autonomous driving technologies, beyond its traditional use in intelligent transportation systems (ITS). We have also cited the recommended reference—“Enhancing High-Speed Cruising Performance of Autonomous Vehicles Through Integrated Deep Reinforcement Learning Framework”—to highlight how advanced perception systems, such as vehicle detection models, integrate with decision-making modules in autonomous vehicles to enable efficient and safe high-speed navigation. This modification strengthens the motivation for vehicle detection as a critical component in both perception and control pipelines in autonomous driving systems.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

The article mentions that single-stage detectors like the YOLO series excel in real-time performance, but complex conditions in actual traffic scenarios (e.g., occlusion, lighting variations) may impact detection speed. How can model architectures or algorithms (e.g., lightweight design, hardware acceleration) be further optimized to maintain high-frame-rate real-time detection under extreme conditions?
The article notes that two-stage detectors (e.g., Faster R-CNN) achieve high accuracy but are slower, while single-stage detectors (e.g., YOLOv10) are more suitable for real-time scenarios. Could future research explore hybrid architectures (e.g., combining two-stage region proposals with single-stage fast regression) or novel attention mechanisms (e.g., Transformer) to balance both accuracy and speed?
The integration of vehicle detection in V2X (vehicle-to-everything) and autonomous driving is mentioned but not discussed in depth. How can detection models be combined with communication technologies (e.g., 5G) to enable multi-vehicle cooperative perception or edge-cloud collaborative computing? Are there any practical case studies or technical bottlenecks?
The article does not cover detection methods involving emerging sensors (e.g., millimeter-wave radar, LiDAR) fused with vision. Could multi-modal fusion become a key direction for improving detection robustness in complex scenarios? What is the current progress in this area?

Author Response

Response to Reviewer 5 Comments
We would like to sincerely thank you for the thoughtful comments and constructive suggestions. Your feedback has been invaluable in helping us refine and improve the clarity and quality of our work. We appreciate the time and effort you have taken to review our manuscript, and we believe that the suggested revisions will further enhance the impact and rigor of the study
Reviewer #5, Comment #1: “The article mentions that single-stage detectors like the YOLO series excel in real-time performance, but complex conditions in actual traffic scenarios (e.g., occlusion, lighting variations) may impact detection speed. How can model architectures or algorithms (e.g., lightweight design, hardware acceleration) be further optimized to maintain high-frame-rate real-time detection under extreme conditions?”
Author Response: We appreciate the reviewer’s insightful comment. In response, we have added a new subsection titled “Model Optimization for Extreme Real-World Conditions” to the Challenges and Future Research section. This addition discusses several recent advancements in model architecture design, multi-feature fusion, and hardware-aware optimization strategies that have been proposed to sustain high-frame-rate real-time performance under challenging conditions such as occlusion and poor lighting. Specifically, we now include discussions on the integration of lightweight backbones like MobileNet and GhostNet, hybrid architectures combining YOLOv4 and EfficientDet, multi-scale attention modules such as in SYGNet and GS-YoloNet, and hardware-accelerated inference using TensorRT-optimized models. We also provide a comparative analysis reference highlighting how MobileNet outperforms heavier models like ConvNext in edge deployment scenarios. These enhancements ensure both robustness and computational efficiency in real-time vehicle detection systems
Reviewer #5, Comment #2: “The article notes that two-stage detectors (e.g., Faster R-CNN) achieve high accuracy but are slower, while single-stage detectors (e.g., YOLOv10) are more suitable for real-time scenarios. Could future research explore hybrid architectures (e.g., combining two-stage region proposals with single-stage fast regression) or novel attention mechanisms (e.g., Transformer) to balance both accuracy and speed?”
Author Response: Thank you for your valuable comment. We fully agree with your insight. To address this point, we have expanded the introduction and future directions to include recent advancements in hybrid object detection architectures and Transformer-based models. For example, Soviany et al. (ECCV Workshops 2018) proposed a framework that balances the accuracy-speed trade-off by integrating single-stage and two-stage detector features. Similarly, RT-DETR (Real-Time Detection Transformer) and its variants have demonstrated that attention-based end-to-end models can rival classical detectors in both speed and precision by leveraging sparse query mechanisms. Additionally, models combining YOLOv4 with EfficientDet have shown improvements in real-time detection performance for autonomous vehicles.
Reviewer #5, Comment #3: “The integration of vehicle detection in V2X (vehicle-to-everything) and autonomous driving is mentioned but not discussed in depth. How can detection models be combined
with communication technologies (e.g., 5G) to enable multi-vehicle cooperative perception or edge-cloud collaborative computing? Are there any practical case studies or technical bottlenecks?”
Author Response: We thank the reviewer for raising this important point. In response, we have added a new subsection titled “Integration with V2X and Edge-Cloud Collaborative Systems” under the deep learning methods section. This addition discusses how real-time vehicle detection models can be integrated with V2X communication technologies, particularly 5G, to enable cooperative perception across multiple vehicles. We explain how such systems benefit from fast data exchange, improved situational awareness, and enhanced detection accuracy through shared sensing. The subsection also elaborates on how edge–cloud collaborative computing can support low-latency detection pipelines by handling time-critical inference on edge nodes while leveraging cloud resources for global optimization. We further discuss practical deployment scenarios where these architectures have been evaluated and highlight common technical challenges such as handover stability, sensor fusion inconsistencies, bandwidth limitations, and the computational trade-offs between edge and cloud processing.
Reviewer #5, Comment #4: “The article does not cover detection methods involving emerging sensors (e.g., millimeter-wave radar, LiDAR) fused with vision. Could multi-modal fusion become a key direction for improving detection robustness in complex scenarios? What is the current progress in this area?”
Author Response: We thank the reviewer for this valuable suggestion. To address this point, we have added a new subsection titled “Emerging Multi-Modal Vehicle Detection Techniques” under the section discussing vehicle detection methods. This subsection introduces recent progress in fusing visual information with data from complementary sensors such as LiDAR and millimeter-wave radar. We discuss how multi-modal fusion techniques enhance robustness in challenging conditions like occlusion, low light, or inclement weather. The added content also outlines fusion strategies (e.g., early, deep, and late fusion) and highlights key challenges in practical deployment, such as sensor calibration, real-time constraints, and synchronization. These additions provide a broader view of current research directions and help enrich the manuscript’s treatment of detection robustness in intelligent transportation scenarios.

Author Response File: Author Response.pdf

Reviewer 6 Report

Comments and Suggestions for Authors

This is an interesting manuscript which reviewed the methods used for vehicle detection and feature extraction techniques, available datasets, and evaluation techniques, including classical machine-learning and deep-learning techniques. It is a good supplement to the existing research. However, the manuscript could better emphasize the comprehensiveness of literature and address the limitations of existing methods. Some questions should be addressed before publication.

1 As a review paper, many existing research papers did not be referenced in the manuscript, especially for the latest three years.

2 The manuscript dedicates significant space to classical feature extraction methods like Haar-like features and HOG, which are largely outdated for modern vehicle detection. While the historical context is valuable, limited critical analysis of their practical limitations weakens the technical depth, lacking comparisons with contemporary deep learning alternatives in efficiency or accuracy.

3 Though YOLO variants are cited, the manuscript does not quantitatively benchmark their computational efficiency against each other or non-YOLO architectures.

4 Environmental challenges (e.g., fog, rain) are listed, but mitigation strategies like synthetic data augmentation (e.g., GTA5-to-real adaptation) or weather-invariant feature learning are not explored.

5 Modern architectures use attention to handle occlusions, but the manuscript treats this as a generic "context-aware" feature without dissecting specific mechanisms (e.g., transformer-based vs. spatial attention). Technical novelty in recent works is insufficiently highlighted.

6 the Resolution and quality should be improved. Many pictures are simple, such as figure 1, 2,4,5.

7 The section number should be indicated explicitly.

8 Ensure figure/table captions fully describe methods.

Comments on the Quality of English Language

fine.

Author Response

Response to Reviewer 6 Comments
We would like to sincerely thank you for the thoughtful comments and constructive suggestions. Your feedback has been invaluable in helping us refine and improve the clarity and quality of our work. We appreciate the time and effort you have taken to review our manuscript, and we believe that the suggested revisions will further enhance the impact and rigor of the study.
Reviewer #6, Comment #1: “As a review paper, many existing research papers did not be referenced in the manuscript, especially for the latest three years.”
Author Response: We thank the reviewer for this important observation. In response, we carefully revised the manuscript and incorporated numerous additional references, with a particular focus on recent publications from the past three years. These updates enhance the comprehensiveness and relevance of the review, especially regarding emerging topics such as transformer-based detection, V2X integration, edge-cloud collaboration, and multi-modal sensor fusion. We believe these additions strengthen the manuscript’s coverage of state-of-the-art methods and align it better with current research trends.
Reviewer #6, Comment #2: “The manuscript dedicates significant space to classical feature extraction methods like Haar-like features and HOG, which are largely outdated for modern vehicle detection. While the historical context is valuable, limited critical analysis of their practical limitations weakens the technical depth, lacking comparisons with contemporary deep learning alternatives in efficiency or accuracy.”
Author Response: Thank you for the insightful comment. We agree that classical methods such as Haar-like features and HOG are outdated compared to current deep learning approaches. In the revised manuscript, we retained the discussion of classical methods to provide historical context but significantly updated the section to critically analyze their limitations, such as poor adaptability to occlusion, lighting variation, and scale changes. Furthermore, we explicitly compared these methods with contemporary deep learning techniques, emphasizing the superior accuracy, robustness, and real-time performance of CNN-based models. These revisions clarify why classical methods are less suited for modern vehicle detection and strengthen the technical depth of the manuscript.
Reviewer #6, Comment #3: “ Though YOLO variants are cited, the manuscript does not quantitatively benchmark their computational efficiency against each other or non-YOLO architectures.”
Author Response: Thank you for your observation. We have revised the discussion in the “Summary of Deep Learning-Based Detection Methods” subsection to clarify the performance trade-offs between YOLO variants and non-YOLO architectures. While the survey does not provide detailed benchmarking results, the revised text emphasizes the efficiency of YOLO models for real-time and edge deployments, and the suitability of architectures like Faster R-CNN and RetinaNet in accuracy-critical applications. These updates aim to better contextualize model selection based on deployment needs.
Reviewer #6, Comment #4:” Environmental challenges (e.g., fog, rain) are listed, but mitigation strategies like synthetic data augmentation (e.g., GTA5-to-real adaptation) or weather-invariant feature learning are not explored.”
Author Response: We appreciate this insightful suggestion. In the revised manuscript, we have expanded the discussion of environmental conditions in the “Challenges and Future Research” section. Specifically, we now mention synthetic data generation, domain adaptation, and weather-invariant feature extraction as promising approaches for enhancing detection robustness under adverse weather conditions. The updated text also reflects recent developments in feature enhancement and sensor fusion techniques that address such challenges.
Reviewer #6, Comment #5: “Modern architectures use attention to handle occlusions, but the manuscript treats this as a generic "context-aware" feature without dissecting specific mechanisms (e.g., transformer-based vs. spatial attention). Technical novelty in recent works is insufficiently highlighted.”
Author Response: We thank the reviewer for highlighting this important point. In the revised manuscript, we have clarified the different categories of attention mechanisms used to mitigate occlusions in vehicle detection. Specifically, we now distinguish between:
•Transformer-based attention, which captures global context and long-range dependencies (e.g., ViT-based and RT-DETR models), and
•Spatial/channel attention mechanisms integrated into CNNs, which improve focus on object-specific regions in cluttered scenes.
This refinement is reflected in the updated “Challenges and Future Research” section. We believe this enhancement addresses the concern and adds clarity to the discussion of recent technical trends in attention-based vehicle detection.
Reviewer #6, Comment #6: “the Resolution and quality should be improved. Many pictures are simple, such as figure 1, 2,4,5.”
Author Response: We appreciate the reviewer’s feedback on the quality and resolution of the figures. In the revised manuscript, we have taken the following steps to address this issue:
•Figures 1, 2, 4, and 5 have been replaced or redrawn using higher-resolution vector-based images (PDF or SVG) to ensure visual clarity in both digital and printed formats.
•We also refined the design elements (e.g., labels, lines, arrows) to enhance readability and presentation.
•The updated figures are now more informative and align with the technical content discussed in the manuscript.
Reviewer #6, Comment #7: “ The section number should be indicated explicitly”
Author Response: We appreciate the reviewer’s suggestion regarding section numbering. In the revised manuscript, we have explicitly added hierarchical section numbers throughout the text (e.g., 1.
Introduction, 2. Detection Methods, 3. Datasets, etc.) to improve readability and structural clarity. This change helps readers easily navigate the document and aligns with MDPI formatting standards.
Reviewer #6, Comment #8: “Ensure figure/table captions fully describe methods.”
Author Response: Thank you for pointing this out. We have revised all figures and table captions to ensure they are self-contained and clearly describe the illustrated methods or presented results. Captions now include key information such as model names, evaluation conditions, and performance metrics where applicable. These improvements enhance the comprehensibility of visual content without requiring readers to refer back to the main text.

Author Response File: Author Response.pdf

Reviewer 7 Report

Comments and Suggestions for Authors

This is a review paper on image-based vehicle detection methods, covering both classic methods (handcrafted features + classification) and end-to-end deep-learning methods.

In my opinion this is a borderline paper, with some strong points, but also weak points.
I feel like the authors have tried to cover too large a field.

Strong points:

- Good basic structure: classic methods (features + classification algorithms), deep learning methods (one-stage, two-stage)
- A significant number of references
- Summarizing tables
- Covers a wide range: algorithms (classic and deep), applications, datasets
- Clear presentation and structure

Weak points:

- No discussion or sub-groupings of papers, besides the major categorizations: classic vs deep, one vs two stages. One major goal of surveys is to group the papers into various sub-classes and taxonomies, according to different criteria, to highlight similarities, different approaches, common research directions etc, providing new insights to the readers. This is missing here. For example, the distinction between appearance-based features vs motion-based features is mentioned in (214), but is not reflected anywhere else. Different possible sub-tasks are mentioned in Table 3 (vehicle detection vs recognition, static vs moving objects, etc), but this distinction is not discussed on its own. Instead, most of the discussion is like a list-of-abstracts (e.g. 243-287, 367-393). I think that splitting the papers into more sub-groups, and discussing these criteria, would be very helpful. As it is now, the lists of papers may be informative, but are not really helpful.

- Pretty much any pre-trained YOLO algorithm available on the Internet, e.g. pre-trained on COCO, includes a vehicle category. They should be somehow included in the review, or at least mentioned as a baseline accuracy, besides the specialized papers.

- There is a major focus on pre v5 YOLO versions, but little focus on later models. Table 5 includes only up to YOLOv4

- Challenges and Future Research should be more solid, and should be somewhat supported by other parts of the paper. They seem to come out of the blue. For example, labeling quality is first one, but nothing is mentioned about the labeling quality in the Datasets section. Complex traffic scenes is pretty vague (some examples would be helpful). Computational efficiency: there is little discussion on this in the paper.

- Some figures not really informative (e.g. Fig.4, Fig.1 are too basic)

- SIFT description seems incomplete, this is just the DoG filters. Either describe all the steps briefly, or don't describe it, but now it seems cut abruptly.

- Table 1: LBP have "high computational cost", but in the LBP description it was mentioned "highly efficient" and "well-suited for real-time ..." (rows 146-147). Also at row 172 it is mentioned that classic methods generally have "significant computational burden"

- Motion-based features are mentioned in 215-220 in passing, but never discussed (which are the motion-based features exactly, how do they fit with the others like HoG, SIFT etc)

- Table 5: What is the role of k-means++ e.g. for [59]? It is unclear.

- IoU: should mention that the IoU threshold influences the later results like precision and recall values, (i.e. 50%, 95%).

- eq 1: clarification needed: is the v vector which is normalizes the pixels, or the gradients (magnitude, or angles, or both)

Typos and minor remarks, in text order:

- Introduction is one big wall-of-text, hard to navigate
- row 79: "they", not "it"
- 99: Haar-like features Figure 3 is used
- 104: order of references seems wrong, [116] is appearing here too soon
- 127: abbreviation of LBP should appear on first usage
- 136: repetition: "The LBP can be calculated as : The Local Binary Pattern (LBP) 136
is calculated using"
- 178: SVM maximizing the gap between "the two sets of samples" isn't quite accurate, since the samples are fixed. Maybe formulate better as maximizing the margin of the decision boundary, or something like this.
- 186: "Where: represents" . Also better explain the above equation, who is I, where are the support vectors involved.
- 208: who is epsilon_t
-Table 2: LSVM is never discussed, what does it mean (it is only listed in 184 among others)
- 214: "These methods either [ 48 ] appearance-based"
- 246: "which used"
- 247: Figure 5 not 3
- 325: "there are numerous algorithms use"
- 344: "sele active"
- 351: "Faster R-CNN [ 24] It was"
- 426: FP = False Positives

Author Response

Response to Reviewer 7 Comments
We would like to sincerely thank you for the thoughtful comments and constructive suggestions. Your feedback has been invaluable in helping us refine and improve the clarity and quality of our work. We appreciate the time and effort you have taken to review our manuscript, and we believe that the suggested revisions will further enhance the impact and rigor of the study.
Reviewer #7, Comment #1: “No discussion or sub-groupings of papers, besides the major categorizations: classic vs deep, one vs two stages. One major goal of surveys is to group the papers into various sub-classes and taxonomies, according to different criteria, to highlight similarities, different approaches, common research directions etc, providing new insights to the readers. This is missing here. For example, the distinction between appearance-based features vs motion-based features is mentioned in (214) but is not reflected anywhere else. Different possible sub-tasks are mentioned in Table 3 (vehicle detection vs recognition, static vs moving objects, etc), but this distinction is not discussed on its own. Instead, most of the discussion is like a list of abstracts (e.g. 243-287, 367-393). I think that splitting the papers into more sub-groups and discussing these criteria would be very helpful. As it is now, the lists of papers may be informative, but are not really helpful.”
Author Response: Thank you for your valuable feedback highlighting the need for deeper sub-grouping and taxonomy in our literature discussion. We agree that one of the central goals of a survey paper is not only to categorize works but also to extract trends, insights, and distinctions that guide future research. In response, we have made the following major revisions to address this concern:
1. Enhanced Taxonomical Structure: We introduced detailed sub-groupings within both classical and deep learning-based detection methods. For classical methods, we explicitly classified works into:
a. Appearance-based methods (e.g., HOG, LBP, Haar),
b. Motion-based methods (e.g., background subtraction, optical flow),
c. Hybrid/Heuristic methods (e.g., pixel-level heuristics, shadow removal).
This structure is now reflected both in the "Summary of Classical Vehicle Detection Methods" subsection and in Table 2, which provides a clear mapping of each approach to its technique and references.
2. Sub-Task-Oriented Discussion: To address the distinction between tasks such as vehicle detection, vehicle recognition, static vs. moving object detection, we have added discussion points and categorized relevant works accordingly in both the text and updated tables (e.g., Table 1 and Table 4). For example, we explicitly note whether a method performs only detection, includes classification/recognition, or applies to motion tracking.
3. Refined Deep Learning Methodology Subsections: Within deep learning methods, we extended the structure to include:
a. One-stage detectors (YOLO, SSD, RetinaNet)
b. Two-stage detectors (Faster R-CNN, Mask R-CNN)
c. Anchor-free detectors (CenterNet, FCOS)
d. Transformer-based approaches
e. GAN-enhanced detection
f. Multi-modal and V2X-integrated systems
Each group is now discussed with a focus on common characteristics, research goals, strengths, and limitations, rather than solely summarizing individual contributions. We also introduced Table 6 to compare these methods based on task suitability and architectural strengths.
4. Reduction of List-of-Abstracts Style: We carefully revised the previously cited sections (e.g., lines 243–287, 367–393) to replace the "list-of-abstracts" format with grouped thematic discussions, drawing comparisons, highlighting shared challenges, and noting methodological differences across models.
We believe these revisions significantly improve the paper’s utility for readers by offering clearer research directions, comparative analysis, and structured insight into the evolution of vehicle detection techniques. The newly added tables and structural clarity also enhance the paper’s pedagogical value.
Reviewer #7, Comment #2: “Pretty much any pre-trained YOLO algorithm available on the Internet, e.g. pre-trained on COCO, includes a vehicle category. They should be somehow included in the review, or at least mentioned as a baseline accuracy, besides the specialized papers.”
Author Response: We thank the reviewer for this helpful observation. We acknowledge that many YOLO models, particularly YOLOv3 through YOLOv8, are widely used in pre-trained form (e.g., on the COCO dataset), and these models include vehicle-related classes such as cars and trucks. To clarify this point, we have added a brief note in the YOLO discussion acknowledging the use of these pre-trained models as baselines in vehicle detection tasks. This addition aims to improve the completeness of our discussion without altering the structure of the manuscript.
Reviewer #7, Comment #3: “There is a major focus on pre v5 YOLO versions, but little focus on later models. Table 5 includes only up to YOLOv4”
Author Response: Thank you for pointing this out. While our earlier draft emphasized YOLO versions up to v4, we have now updated the manuscript to reflect the evolution of the YOLO series up to YOLOv10. Specifically, the discussion now includes key advancements in versions v5 through v10, as well as a summary table and timeline
(Table 3 and Figure 7 to provide a broader perspective on architectural improvements. This update aims to improve the manuscript’s relevance to current research while maintaining its structure.
Reviewer #7, Comment #4: “Challenges and Future Research should be more solid and should be somewhat supported by other parts of the paper. They seem to come out of the blue. For example, labeling quality is first one, but nothing is mentioned about the labeling quality in the Datasets section. Complex traffic scenes is pretty vague (some examples would be helpful). Computational efficiency: there is little discussion on this in the paper.”
Author Response: We appreciate your feedback regarding the Challenges and Future Research section. To address your concern, we have revised this section to better align with the preceding discussions in the paper. For instance, the point on labeling quality is now explicitly tied to the dataset characteristics discussed earlier. We have clarified the term "complex traffic scenes" by incorporating examples such as occlusions, varying lighting, and multi-scale vehicle appearances, which are referenced in both classical and deep learning method evaluations. Additionally, we have expanded on computational efficiency challenges by referencing discussions on lightweight models and inference performance in Sections 3.1 and 3.2. These refinements ensure the challenges are better grounded in the body of the paper and contribute to a more coherent narrative.
Reviewer #7, Comment #5: “Some figures not really informative (e.g. Fig.4, Fig.1 are too basic.”
Author Response: Thank you for your observation. We agree that Figures 1 and 4 were initially too basic in conveying technical value. We have revised these figures to enhance their clarity and informativeness. Specifically, we updated the architectural diagrams to better reflect key processing steps and components relevant to vehicle detection workflows, such as feature extraction, attention modules, and prediction layers. These improvements aim to provide the reader with a more meaningful visual understanding of the methods described in the text.
Reviewer #7, Comment #6: SIFT description seems incomplete, this is just the DoG filters. Either describe all the steps briefly, or don't describe it, but now it seems cut abruptly.
Author Response: We appreciate the reviewer’s observation. The previous description of SIFT focused only on the Difference of Gaussians (DoG) component and did not adequately represent the full pipeline. To address this, we have updated the section to briefly summarize all key stages of the SIFT algorithm, including scale-space extrema detection, key point localization, orientation assignment, and descriptor generation. This provides a more
complete and coherent explanation aligned with the level of detail presented for other classical methods.

Reviewer #7, Comment #7: “Table 1: LBP have "high computational cost", but in the LBP description it was mentioned "highly efficient" and "well-suited for real-time ..." (rows 146-147). Also, at row 172 it is mentioned that classic methods generally have a "significant computational burden"”
Author Response: We clarified the inconsistency regarding Local Binary Patterns (LBP). The text now explains that LBP is computationally efficient and well-suited for real-time detection in well-lit and uniform backgrounds. However, its performance deteriorates in complex or cluttered scenes due to texture sensitivity, which can result in increased processing overhead when additional post-processing is required. This dual behavior is now consistently explained and reflected in both the description and Table 1.
Reviewer #7, Comment #8: Motion-based features are mentioned in 215-220 in passing, but never discussed (which are the motionbased features exactly, how do they fit with the others like HoG, SIFT etc)
Author Response: We appreciate the reviewer’s insightful comment. We expanded the discussion of motion-based vehicle detection methods in Section 2.1.3. The revised text now includes detailed explanations of common techniques such as optical flow analysis, background subtraction, frame differencing, and Hidden Markov Models (HMMs). We also contrast these methods with appearance-based techniques like HOG and SIFT, highlighting their use in low-light or occlusion-heavy environments.
Comment 9:” Table 5: What is the role of k-means++ e.g. for [59]? It is unclear”
Author Response: We thank the reviewer for pointing this out. In the revised manuscript, we clarified that k-means++ and Rk-means++ are used during the anchor box generation stage in YOLO-based models. Specifically, they cluster the dimensions (width and height) of ground-truth bounding boxes in the training dataset to produce optimal prior anchor boxes. This improves localization accuracy, especially in multi-scale and imbalanced vehicle detection scenarios. For example, in [86] and [87], k-means++ enhances anchor design by capturing vehicle size variability, while [99] uses Rk-means++ to better adapt to scale diversity and improve detection robustness. A clarifying note was added to the corresponding paragraph and table caption.
Comment 10: “IoU: should mention that the IoU threshold influences the later results like precision and recall values, (i.e. 50%, 95%).”
Author Response: We appreciate the reviewer’s insightful comment. We have added an explanation to the Evaluation Metrics section, noting that the Intersection over Union (IoU) threshold directly impacts evaluation scores such as precision and recall. For instance, mAP@0.5 and mAP@0.5:0.95 are calculated using varying IoU thresholds, affecting the sensitivity and strictness of detection success criteria.
Reviewer #7, Comment 11: eq 1: clarification needed: is the v vector which is normalizes the pixels, or the gradients (magnitude, or angles, or both)
Author Response: We appreciate the reviewer’s insightful comment. Equation (4) has been clarified to explicitly state that the vector ‘v’ refers to the histogram of oriented gradients (HOG) feature vector. The L2 normalization applied to 'v' ensures invariance to illumination and contrast changes, which enhances the robustness of the extracted features.
Reviewer #7, Comment 12: “Typos and minor remarks, in text order: - Introduction is one big wall-of-text, hard to navigate - row 79: "they", not "it" - 99: Haar-like features Figure 3 is used - 104: order of references seems wrong, [116] is appearing here too soon - 127: abbreviation of LBP should appear on first usage - 136: repetition: "The LBP can be calculated as : The Local Binary Pattern (LBP) 136 is calculated using" - 178: SVM maximizing the gap between "the two sets of samples" isn't quite accurate, since the samples are fixed. Maybe formulate better as maximizing the margin of the decision boundary, or something like this. - 186: "Where: represents" . Also better explain the above equation, who is I, where are the support vectors involved. - 208: who is epsilon_t -Table 2: LSVM is never discussed, what does it mean (it is only listed in 184 among others) - 214: "These methods either [ 48 ] appearance-based" - 246: "which used" - 247: Figure 5 not 3 - 325: "there are numerous algorithms use" - 344: "sele active" - 351: "Faster R-CNN [ 24] It was" - 426: FP = False Positives”
Author Response: We appreciate the reviewer’s insightful comment. All identified typographical and formatting issues have been thoroughly corrected. Specific corrections include:
•Breaking down the “wall-of-text” Introduction into logical thematic paragraphs.
•Fixing reference sequencing and early citation of later-numbered references.
•Introducing all abbreviations (e.g., LBP) at first mention with full definitions.
•Removing redundant or circular phrases (e.g., lines 136 and 178).
•Clarifying the SVM description to refer to margin maximization and explaining the role of support vectors.
•Expanding descriptions in formulas to define all symbols (e.g., α_t, D_t, I).
•Adding the previously undefined term LSVM in Table 2 and explaining its usage.

Author Response File: Author Response.pdf

Round 2

Reviewer 4 Report

Comments and Suggestions for Authors

The author has addressed my concerns well. I would recommend the publication of this paper. However, the author is advised to provide more detailed descriptions to better highlight the core innovation of this work. Furthermore, the rapid development of autonomous driving technologies can refer to the recent work “A MAS-Based Hierarchical Architecture for the Cooperation Control of Connected and Automated Vehicles, IEEE Transactions on Vehicular Technology, vol. 72, no. 2, pp. 1559-1573, Feb. 2023”.

Author Response

Response to Reviewer 1 Comments

Reviewer #1, Comment #1: The author has addressed my concerns well. I would recommend the publication of this paper. However, the author is advised to provide more detailed descriptions to better highlight the core innovation of this work. Furthermore, the rapid development of autonomous driving technologies can refer to the recent work ‘A MAS-Based Hierarchical Architecture for the Cooperation Control of Connected and Automated Vehicles,’ IEEE Trans. Veh. Technol., vol. 72, no. 2, pp. 1559–1573, Feb. 2023.

Author response: We sincerely thank the reviewer for their positive feedback and recommendation for publication.

In response to your valuable suggestion, we have revised the Introduction to clearly highlight the core contributions of our work. Specifically, we emphasize how this survey differs from prior reviews by organizing vehicle detection methods according to their suitability for real-time applications, edge deployment, and integration within autonomous driving systems. We also highlight emerging directions, such as Transformer-based attention models and cooperative V2X-enabled perception frameworks.
Additionally, we have incorporated and discussed the referenced paper by Zhu et al. (2023) in both the Introduction and Section 6 (Challenges and Future Research Directions) to underscore the importance of MAS-based cooperation control in connected and automated vehicles. In particular, we highlight how reliable and timely vehicle detection is foundational for enabling shared situational awareness in multi-agent systems. The citation was added to support the discussion on V2X integration and cooperative perception architectures.

Author Response File: Author Response.pdf

Reviewer 6 Report

Comments and Suggestions for Authors

1 the sub-captions of A, B, C, D, E and F in Figure 4 should be added.

2 the YOLO versions should be coincided in Table 3 and Figure 7.

Comments on the Quality of English Language

could be improved.

Author Response

Response to Reviewer 2 Comments

Reviewer # 2, comments #1: The sub-captions of A, B, C, D, E, and F in Figure 4 should be added.

Author response: Thank you for this helpful observation. We have updated Figure 4 to visually include sub-captions (A–F) beneath each Haar-like pattern. In addition, we revised the caption to describe each feature type (e.g., vertical edge, line, checkerboard), enhancing the figure’s interpretability and aligning it with the textual discussion.

Reviewer #2, Comments #2: The YOLO versions should be coincided in Table 3 and Figure 7.”

Author response: We appreciate your attention to consistency across figures and tables. In response, we carefully reviewed and updated Table 4 and Figure 7 to ensure alignment regarding YOLO version naming, chronological order, and architectural details. Specifically, we verified that all YOLO versions listed in Table 4 (from YOLOv1 through YOLOv12) correspond precisely to those shown in the timeline of Figure 7, including their respective backbones and detection heads. This update ensures coherence throughout the manuscript and improves the clarity of the YOLO model progression for the reader.

Author Response File: Author Response.pdf