Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification

Vehicles 2025, 7(3), 89; https://doi.org/10.3390/vehicles7030089

by Ananya Saha

, Mahir Afser Pavel

, Md Fahim Shahoriar Titu

, Afifa Zain Apurba

and Riasat Khan^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Sorin Arsene

Vehicles 2025, 7(3), 89; https://doi.org/10.3390/vehicles7030089

Submission received: 2 July 2025 / Revised: 31 July 2025 / Accepted: 17 August 2025 / Published: 25 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Comment to authors

The paper presents a novel hybrid framework, Hybrid ViT-RetinaNet, for fine-grained vehicle damage classification, integrating Vision Transformers (ViTs) with RetinaNet and employing explainable ensemble learning techniques. However, there are some comments that should be considered:

While the model achieves 22 FPS, the computational overhead of ViTs may limit deployment on resource-constrained devices. The authors should discuss potential optimizations (e.g., model pruning, quantization) or lightweight alternatives (e.g., MobileViT) for edge deployment.
The dataset includes 40% COCO images, which may not fully represent real-world damage scenarios. The authors should clarify how COCO images were selected/adapted and whether they introduce bias. Expanding the dataset to include more vehicle types (e.g., trucks, buses) would improve generalizability.
The paper focuses on coarse-grained damage categories. Future work could explore hierarchical or fine-grained damage classification (e.g., scratch severity, dent depth) to enhance practical utility.
While robustness to synthetic noise is demonstrated, real-world testing with dynamic conditions (e.g., varying lighting, occlusions) would further validate the model's applicability.

Author Response

Response for Reviewers

We would like to thank the reviewers for their comments. In the following, we will present our responses to the comments together with a summary of the corresponding changes in the revised manuscript in Red Fonts. All page numbers mentioned below are referred in the new revised manuscript in Yellow Highlights. Following what the reviewer suggested, we have done the followings:

Comment: “While the model achieves 22 FPS, the computational overhead of ViTs may limit deployment on resource-constrained devices. The authors should discuss potential optimizations (e.g., model pruning, quantization) or lightweight alternatives (e.g., MobileViT) for edge deployment.”

Response: In response to the reviewer’s suggestion, we have integrated two key optimizations to enhance deployment efficiency on resource-constrained devices: (1) Quantization, which reduces the model size by approximately 51.5% (from 1218.1 MB to 591.39 MB) and cuts the parameter count nearly in half while maintaining high accuracy, enabling faster inference with lower memory and power usage; and (2) the inclusion of MobileViT, a lightweight transformer model designed for edge deployment, which achieved the highest FPS (48) among all tested models. While MobileViT offers impressive speed, our proposed DINOv2-based approach maintains superior accuracy and robustness, making it suitable for applications requiring both performance and efficiency. These additions and their evaluations are detailed in Section 4.4 and supported by Table 4. Impacts on quantization for model deployment: Comparison between full precision and quantized models and Table 2. Performance Comparison of the Applied Models. (page 11˗15)

Comment: “The dataset includes 40% COCO images, which may not fully represent real-world damage scenarios. The authors should clarify how COCO images were selected/adapted and whether they introduce bias. Expanding the dataset to include more vehicle types (e.g., trucks, buses) would improve generalizability.”

Response: The 40% COCO-derived images included in our dataset were selectively sourced through platforms such as Roboflow and extensively augmented (e.g., cropping, blurring, brightness shifts, geometric distortion, adding Gaussian and salt-and-pepper noises, etc.) to simulate diverse visual conditions rather than depicting actual vehicle damage. These images were intended to represent real-world damage scenarios and to enrich the dataset with non-damage contexts and challenging backgrounds—thereby improving model robustness and reducing false positives. Selection was based on visual similarity to real vehicle environments and occlusion patterns to minimize dataset bias. To ensure they did not skew the model’s learning, we performed ablation studies and/or cross-validation, which confirmed that their inclusion enhanced generalization without introducing performance degradation. We acknowledge the reviewer’s point and agree that expanding the dataset to include more varied vehicle types (e.g., buses, trucks) would further improve generalizability, which we identify as a priority for future work.

Comment: “The paper focuses on coarse-grained damage categories. Future work could explore hierarchical or fine-grained damage classification (e.g., scratch severity, dent depth) to enhance practical utility.”

Response: We appreciate the reviewer’s insightful suggestion. In response, we have acknowledged this limitation in the revised Limitations and Future Research Suggestions section. (page 22˗23)

Comment: “While robustness to synthetic noise is demonstrated, real-world testing with dynamic conditions (e.g., varying lighting, occlusions) would further validate the model's applicability.”

Response: We appreciate the reviewer’s observation and have addressed this in the revised manuscript by incorporating a new qualitative analysis (Figure 12: Damage detection results under dynamic and challenging conditions.), showcasing the model’s performance under diverse real-world conditions, including varying lighting, occlusions, background clutter, weather-related distortions, and motion blur. The examples include complex scenes such as partial damage under reflection, debris-filled accident zones, adverse weather impact, and control cases. (page 19)

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1.The abstract should explicitly explain why a robust and interpretable deep learning framework for vehicle damage classification is important. Additionally, it should clarify why existing methods cannot effectively address the problem, highlighting their limitations and motivating the structural contributions of the proposed method.

The current literature review in the introduction is not comprehensive enough. Deep learning methods should be more systematically categorized, especially since the manuscript focuses on interpretability. It is recommended to divide related works into self-supervised and semi-supervised methods. For example, recent advances in semi-supervised monitoring, such as “Research on multimodal techniques for arc detection in railway systems with limited data”, and vibration signal processing method “CFFsBD: A Candidate Fault Frequencies-based Blind Deconvolution for Rolling Element Bearings Fault Feature Enhancement”, have achieved promising results. Comparing the proposed approach to these representative methods will improve the completeness of the literature review. In addition, methods that combine physical information and deep learning: “Surrogate modeling of pantograph-catenary system interactions using physics-informed neural networks”, have proven effective for interpretability, and discussion of these should be included.
Moreover, the introduction should better reflect the overall motivation for the work. Specifically, the necessity of transfer learning for robust and interpretable vehicle damage classification should be clarified, as well as the limitations of existing approaches. The structural design—i.e., introducing a Vision Transformer (ViT)-based detection head on top of RetinaNet to enhance the representation of complex damage patterns—should also be explicitly linked to robustness and interpretability. Finally, the main contributions should be clearly and concisely summarized, preferably no more than three key points.
Since the code has not been released, the methodology section should include a pseudocode for the overall algorithm. In Figure 4, the input and output variables for the proposed modules should be clearly indicated, along with their data flow. Furthermore, the quality of Figures 2, 3, and 4 could be improved for better readability and aesthetics. At the end of the methodology section, the overall loss function optimization strategy should be described, including details on focal loss, IoU loss, etc.
It is recommended to compare the results with more state-of-the-art (SOTA) methods from 2024 and 2025, and on more diverse datasets under different conditions, to better highlight the strengths of the proposed algorithm.
Overall, this is a well-executed application-oriented paper. However, the motivation and novelty are not sufficiently clear in the current manuscript. I strongly recommend the authors clarify these aspects for a stronger impact. Therefore, I suggest the manuscript requires major revision.

Author Response

Response for Reviewers

Comment: “The abstract should explicitly explain why a robust and interpretable deep learning framework for vehicle damage classification is important. Additionally, it should clarify why existing methods cannot effectively address the problem, highlighting their limitations and motivating the structural contributions of the proposed method.”

Response: In response to the reviewer’s suggestion, we added, “Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, such as CNNs, often struggle with generalization, require large annotated datasets, and lack interpretability.” (page 1)

Comment: “The current literature review in the introduction is not comprehensive enough. Deep learning methods should be more systematically categorized, especially since the manuscript focuses on interpretability. It is recommended to divide related works into self-supervised and semi-supervised methods. For example, recent advances in semi-supervised monitoring, such as “Research on multimodal techniques for arc detection in railway systems with limited data”, and vibration signal processing method “CFFsBD: A Candidate Fault Frequencies-based Blind Deconvolution for Rolling Element Bearings Fault Feature Enhancement”, have achieved promising results. Comparing the proposed approach to these representative methods will improve the completeness of the literature review. In addition, methods that combine physical information and deep learning: “Surrogate modeling of pantograph-catenary system interactions using physics-informed neural networks”, have proven effective for interpretability, and discussion of these should be included.”

Response: We appreciate the reviewer’s suggestion and have revised the Related Works section to more systematically categorize deep learning approaches into Section 2.2. Supervised Deep Learning Methods and Section 2.3. Self-Supervised and Semi-Supervised Methods. We also expanded the discussion on interpretability by clearly outlining the role of XAI techniques in vehicle damage assessment, aligning it more closely with the focus of our work. The suggested articles have been added as, “Efficient semi-supervised deep learning models are applied in the railway domain, i.e., arc identification [23], bearing fault detection [24], pantograph-catenary relations determination [25], etc.” (page 4)

Comment: “Moreover, the introduction should better reflect the overall motivation for the work. Specifically, the necessity of transfer learning for robust and interpretable vehicle damage classification should be clarified, as well as the limitations of existing approaches. The structural design—i.e., introducing a Vision Transformer (ViT)-based detection head on top of RetinaNet to enhance the representation of complex damage patterns—should also be explicitly linked to robustness and interpretability. Finally, the main contributions should be clearly and concisely summarized, preferably no more than three key points.”

Response: In response, we revised the Introduction to clearly articulate the motivation for using transfer learning to address data scarcity and improve generalization, as well as the need for interpretability in high-stakes applications like insurance and diagnostics. We have also explicitly emphasized the architectural contribution—a ViT-based detection head integrated with RetinaNet—and linked it directly to improvements in robustness and feature interpretability under complex and noisy conditions. Finally, we summarized the main contributions into three focused points to improve clarity and alignment with the paper’s core innovations. (page 1˗3)

Comment: “Since the code has not been released, the methodology section should include a pseudocode for the overall algorithm. In Figure 4, the input and output variables for the proposed modules should be clearly indicated, along with their data flow. Furthermore, the quality of Figures 2, 3, and 4 could be improved for better readability and aesthetics. At the end of the methodology section, the overall loss function optimization strategy should be described, including details on focal loss, IoU loss, etc.”

Response: The code has been added in the Abstract as, “The employed vehicle damage dataset and implementation code can be found at: https://github.com/MdFahimShahoriar/finegrained-damage-classify-xai.” Figures 2, 3, and 4 are updated. The overall loss function optimization strategy are described, including details on focal loss, IoU loss, etc. (page 1 and page 8)

Comment: “It is recommended to compare the results with more state-of-the-art (SOTA) methods from 2024 and 2025, and on more diverse datasets under different conditions, to better highlight the strengths of the proposed algorithm.”

Response: We appreciate the reviewer’s suggestion. In response, we have included a comparison with the recently proposed DINOv2 model, which represents a 2024 state-of-the-art vision transformer. DINOv2 model’s results are added in Table 2: Performance Comparison of the Applied Models, Table 5: Comparison of Model Size, Parameter Count, and Training Time per Epoch for Object Detection Models and Figure 5: Training and Validation Performance of DINOv2 Over Epochs. (page 11, 16)

Comment: “Overall, this is a well-executed application-oriented paper. However, the motivation and novelty are not sufficiently clear in the current manuscript. I strongly recommend the authors clarify these aspects for a stronger impact. Therefore, I suggest the manuscript requires major revision.”

Response: We sincerely thank the reviewer for the encouraging remarks. In response to the comment, we have revised the Abstract and Introduction to clearly articulate the motivation behind this work, the need for robust, scalable, and interpretable vehicle damage classification systems in real-world, high-stakes applications. We have also clarified the novelty of our approach by emphasizing the integration of a ViT-based detection head within the RetinaNet framework, supported by transfer learning, XAI techniques, and ensemble-based refinement, which collectively address limitations in existing CNN-based and transformer-only methods. (page 1˗3)

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a vehicle damage classification system integrating Vision-Transformers (ViT) with RetinaNet core and ensemble detection strategies. The proposed model is pre-trained and tuned through focal loss and aggressive augmentation techniques to improve generalization under real-world damage variability. The proposed system applies Weighted Box Fusion (WBF) strategy to improve the detection results from multiple models. To ensure interpretability and clarity, Authors use multiple explanatory techniques - Grad-CAM, Grad-CAM++ and SHAP. A custom multi-class vehicle damage dataset with 4,500 images was built, consisting of about 60% of selected images covering different damage types along with 40% of images from the COCO dataset to support model generalization. Comparative evaluations show that Hybrid ViT-RetinaNet achieves the highest performance of the models presented. Tests under various conditions, including real visual noise and artificial noise, confirm the model's potential.

To admit this article for publication, the authors need to clarify a few issues:

In Section 3.1, the authors list the classes present in the dataset, such as bumper dents, panel scratches, frontal crashes and rear crashes. On the other hand, the presented results presented in the form of a matrix or images include the labels: damaged, undamaged, background or car damaged, car not damaged. What labels were used during learning?

Section 3.2 The authors should include the structures of all models that are then trained.

In the article, it would be useful to add model tests using pictures of an undamaged car or cars.

This would confirm the claim of robustness of the model proposed by the authors.

All drawings from Figure 2 onward have too small captions that are difficult to read.

Figures 7, 8 and 9 should have a single description with appropriate annotation, for the subfigures. The description should contain information about model type?

Figures 10, 11 and 12 should have one description, with appropriate annotation, for the subfigures.

Figures 13- 17 should have one description with appropriate annotation, for the subfigures.

Figures 18- 23 should have one description with appropriate annotation, for subfigures.

Figures 24- 28 should have one description with appropriate annotation, for subfigures.

Figures 29- 33 should have one description with appropriate annotation, for subfigures.

Figures 34- 38 should have one description with appropriate annotation, for subfigures.

Figures 39- 43 should have one description with appropriate annotation, for subfigures.

Figures 44- 48 should have one description with appropriate annotation, for subfigures.

Author Response

Response for Reviewers

Comment: “In Section 3.1, the authors list the classes present in the dataset, such as bumper dents, panel scratches, frontal crashes and rear crashes. On the other hand, the presented results presented in the form of a matrix or images include the labels: damaged, undamaged, background or car damaged, car not damaged. What labels were used during learning?”

Response: We appreciate the reviewer’s observation. To clarify, our model was trained using a binary classification scheme with only two primary labels: “Damaged” and “Not Damaged.” The references to specific types of damage, e.g., bumper dents, panel scratches, frontal crashes, and rear crashes, were included in Section 3.1 merely as examples of visual damage patterns present in the “Damaged” category; they were not used as separate class labels during training. Although the model is trained for binary classification, it is designed with an object detection head that enables it to localize and detect a wide range of damage types. These subtypes (e.g., bumper dents, panel scratches) are implicitly learned by the model through visual features and bounding box annotations, allowing it to detect diverse and overlapping damage patterns in real-world scenes.

Comment: “Section 3.2 The authors should include the structures of all models that are then trained.”

Response: We thank the reviewer for this insightful comment. To address it, we have provided a public link to our code and dataset for full reproducibility in the Abstract as, “The employed vehicle damage dataset and implementation code can be found at: https://github.com/MdFahimShahoriar/finegrained-damage-classify-xai.” (page 1)

Comment: “In the article, it would be useful to add model tests using pictures of an undamaged car or cars.”

Response: Thanks. Added in Figure 13(f). (page 19)

Comment: “This would confirm the claim of robustness of the model proposed by the authors.”

Response: We appreciate the reviewer’s suggestion. To substantiate the robustness of our proposed Hybrid ViT-RetinaNet model, we have included extensive evaluations under challenging real-world and synthetically generated noisy conditions (see Section 4.6. Inference Under Visual and Artificial Noise and Section 4.7. Damage Detection Under Dynamic Conditions). The model demonstrates consistent detection performance in the presence of fire, deformation, motion blur, rain, low illumination, and other noise, thereby validating its resilience and applicability in dynamic and high-stakes environments. (page 17˗20)

Comment: “All drawings from Figure 2 onward have too small captions that are difficult to read. Figures 7, 8 and 9 should have a single description with appropriate annotation, for the subfigures. The description should contain information about model type? Figures 10, 11 and 12 should have one description, with appropriate annotation, for the subfigures. Figures 13- 17 should have one description with appropriate annotation, for the subfigures. Figures 18- 23 should have one description with appropriate annotation, for subfigures. Figures 24- 28 should have one description with appropriate annotation, for subfigures. Figures 29- 33 should have one description with appropriate annotation, for subfigures. Figures 34- 38 should have one description with appropriate annotation, for subfigures. Figures 39- 43 should have one description with appropriate annotation, for subfigures. Figures 44- 48 should have one description with appropriate annotation, for subfigures.”

Response: Thanks. All the figures are updated.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Notes:

1. When numbering figures, they cannot have the same type of numbering as the main figure (see figures 7-48);
2. The positioning of tables and figures in the text should be done at the end of the paragraph in which this reference is placed. Not in the middle of sentences, before the references or far from them;
3. Figures 1-6 should be enlarged to make the text in them more readable;
4. The references to figures 7, 8, 10, 11, 13-16, 18-22, 24-27, 29-32, 34-37, 39-42, 44-47, should be revised. They should be as an index (e.g.: a), b), c), etc.) of the corresponding figures. To point out the inconsistency in the references to the figures ( "..Figure 9 presents...", "...In Subfigure 7,...", "...rain (Figure 23d)...", "...dust (Figure 23e)...").
5. The data presented in table 3 are redundant with the paragraph between lines 469-476.

Author Response

Response for Reviewers

Comment: “When numbering figures, they cannot have the same type of numbering as the main figure (see figures 7-48).”

Response: Thanks. Corrected.

Comment: “The positioning of tables and figures in the text should be done at the end of the paragraph in which this reference is placed. Not in the middle of sentences, before the references or far from them.”

Response: Thanks. Corrected.

Comment: “Figures 1-6 should be enlarged to make the text in them more readable.”

Response: Thanks. Corrected.

Comment: “The references to figures 7, 8, 10, 11, 13-16, 18-22, 24-27, 29-32, 34-37, 39-42, 44-47, should be revised. They should be as an index (e.g.: a), b), c), etc.) of the corresponding figures. To point out the inconsistency in the references to the figures ( "..Figure 9 presents...", "...In Subfigure 7,...", "...rain (Figure 23d)...", "...dust (Figure 23e)...").”

Response: Thanks. Corrected.

Comment: “The data presented in table 3 are redundant with the paragraph between lines 469-476.”

Response: Thanks. Table 3 descriptions are updated and shorted as, “Table 3 compares hybrid pipelines using Faster R-CNN and RetinaNet with the proposed end-to-end architecture. While both hybrids achieved high accuracy (97\% and 95\%) and strong mAP scores (83.5\% and 85.1\%), their staged processing introduces latency and slower inference speeds—13 FPS and 16 FPS—due to model-to-model communication overhead.” (page 14)

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript was well revised according to the reviewers' comments and basically met the requirements for journal publication. However, the reviewers also had a small suggestion, that is, the clarity of some pictures in the manuscript is not enough and needs to be adjusted.

Author Response

Response for Reviewers

We would like to thank the reviewers for their comments. In the following, we will present our responses to the comments together with a summary of the corresponding changes in the revised manuscript in Yellow Highlights. All page numbers mentioned below are referred in the new revised manuscript in Red Fonts. Following what the reviewer suggested, we have done the followings:

Comment:“The manuscript was well revised according to the reviewers’ comments and basically met the requirements for journal publication. However, the reviewers also had a small suggestion, that is, the clarity of some pictures in the manuscript is not enough and needs to be adjusted.”

Response: Thanks. Following the reviewer’s suggestion, the figures have been updated.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The Authors addressed all of the comments presented in the review. However, there is an error in the article related to the description of Figures 10 and 11. The descriptions do not correspond to the photos presented.

Author Response

Response for Reviewers

We would like to thank the reviewers for their comments. In the following, we will present our responses to the comments together with a summary of the corresponding changes in the revised manuscript in Yellow Highlights. All page numbers mentioned below are referred in the new revised manuscript in Red Fonts. Following what the reviewer suggested, we have done the followings:

Comment: “The Authors addressed all of the comments presented in the review. However, there is an error in the article related to the description of Figures 10 and 11. The descriptions do not correspond to the photos presented.”

Response: Thanks. Descriptions of Figures 10 and 11 have been corrected.

Author Response File: Author Response.pdf

Article Menu

Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification

Further Information

Guidelines

MDPI Initiatives

Follow MDPI