Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification

Saha, Ananya; Pavel, Mahir Afser; Titu, Md Fahim Shahoriar; Apurba, Afifa Zain; Khan, Riasat

doi:10.3390/vehicles7030089

Open AccessArticle

Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification

by

Ananya Saha

,

Mahir Afser Pavel

,

Md Fahim Shahoriar Titu

,

Afifa Zain Apurba

and

Riasat Khan

^*

Electrical and Computer Engineering, North South University, Dhaka 1229, Bangladesh

^*

Author to whom correspondence should be addressed.

Vehicles 2025, 7(3), 89; https://doi.org/10.3390/vehicles7030089

Submission received: 2 July 2025 / Revised: 31 July 2025 / Accepted: 17 August 2025 / Published: 25 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, such as CNNs, often struggle with generalization, require large annotated datasets, and lack interpretability. This study presents a robust and interpretable deep learning framework for vehicle damage classification, integrating Vision Transformers (ViTs) and ensemble detection strategies. The proposed architecture employs a RetinaNet backbone with a ViT-enhanced detection head, implemented in PyTorch using the Detectron2 object detection technique. It is pretrained on COCO weights and fine-tuned through focal loss and aggressive augmentation techniques to improve generalization under real-world damage variability. The proposed system applies the Weighted Box Fusion (WBF) ensemble strategy to refine detection outputs from multiple models, offering improved spatial precision. To ensure interpretability and transparency, we adopt numerous explainability techniques—Grad-CAM, Grad-CAM++, and SHAP—offering semantic and visual insights into model decisions. A custom vehicle damage dataset with 4500 images has been built, consisting of approximately 60% curated images collected through targeted web scraping and crawling covering various damage types (such as bumper dents, panel scratches, and frontal impacts), along with 40% COCO dataset images to support model generalization. Comparative evaluations show that Hybrid ViT-RetinaNet achieves superior performance with an F1-score of 84.6%, mAP of 87.2%, and 22 FPS inference speed. In an ablation analysis, WBF, augmentation, transfer learning, and focal loss significantly improve performance, with focal loss increasing F1 by 6.3% for underrepresented classes and COCO pretraining boosting mAP by 8.7%. Additional architectural comparisons demonstrate that our full hybrid configuration not only maintains competitive accuracy but also achieves up to 150 FPS, making it well suited for real-time use cases. Robustness tests under challenging conditions, including real-world visual disturbances (smoke, fire, motion blur, varying lighting, and occlusions) and artificial noise (Gaussian; salt-and-pepper), confirm the model’s generalization ability. This work contributes a scalable, explainable, and high-performance solution for real-world vehicle damage diagnostics.

Keywords:

vehicle damage classification; vision transformer; RetinaNet; Detectron2; explainable AI; ensemble modeling; weighted box fusion; insurance automation

1. Introduction

The rapid expansion of the automotive industry and the global surge in vehicle ownership have intensified competition for limited road space. Forecasts predict that by 2040 the number of vehicles worldwide will exceed 1.6 billion [1]. This exponential growth has precipitated an urgent need for vehicle damage assessment systems that are not only efficient and accurate but also scalable to accommodate the increasing volume and complexity of vehicular incidents. Traditional methods, which primarily rely on manual inspections conducted by human experts, are time-consuming and susceptible to inconsistencies and human error [2]. Manual damage assessment is inaccurate; hence, considerable losses in insurance claims for automotive and delays in repair workflows can occur [3].

In recent years, computer vision and deep learning techniques have been used to build automated object detection and classification capabilities, setting the stage for intelligent vehicle damage assessment [4]. Specifically, CNNs such as EfficientNet, ResNet, or Mask R-CNN emerged as exceedingly successful methods for extracting fine-grained visual features for damage identification [5]. However, their limitations on generalization often stem from the requirement for big-scale annotated datasets and computationally heavy training procedures [6]. To overcome these disadvantages, Vision Transformers (ViTs) [5], Detection Transformers (DETRs) [4], and Open-Vocabulary Object Detection techniques such as OWL-ViT [7] have recently been proposed. Since these models operate with self-attention mechanisms capable of capturing long-range spatial dependencies within images, damage classification becomes more accurate and robust across changing environments and structural conditions. To overcome the limitations of training from scratch and to improve generalization on limited, domain-specific data, this study adopts a transfer learning strategy using pretrained Vision Transformer backbones fine-tuned on vehicle damage datasets.

Explainable AI (XAI) integration has become key to establishing transparency and trust in automated assessment systems. Various XAI methods, e.g., Gradient-weighted Class Activation Mapping (Grad-CAM) [8], SHapley Additive exPlanations (SHAPs) [9], and Local Interpretable Model-agnostic Explanations (LIMEs) [10] operate as interpreters to explain visual content and quantify the reason behind every decision of the model. Such clarity is essential for insurance and vehicle servicing stakeholders to endorse any AI-generated decision through the claim and repair processes.

Automated vehicle damage classification extends beyond insurance and repairs use cases and is vital to road infrastructure management, mobility safety, and operational efficiency. Early identification of damage enables proactive maintenance and risk mitigation, potentially averting accidents that pose threats to life and property. State-of-the-art sensor technologies, such as vehicle-mounted cameras, roadside imaging systems, and LiDAR-equipped inspection vehicles, further facilitate real-time image acquisition for deep learning-based analysis [3]. These systems can support continuous monitoring and predictive maintenance strategies when integrated with AI-driven classification frameworks.

Alongside the classification task, the applicability of vehicle damage detection systems extends across a spectrum of critical domains, including the enhancement of autonomous vehicle safety, optimization of traffic flow dynamics, preservation of transportation infrastructure, and facilitation of environmental surveillance. These intelligent systems empower rapid, data-driven decision-making in high-stakes, risk-sensitive environments by enabling real-time identification of structural anomalies and damage patterns. Although real-world deployment poses challenges related to adaptability and scalability across diverse vehicle types and environmental settings, deep learning frameworks exhibit the flexibility required to address these variations effectively.

This research hypothesizes that the integration of Vision Transformers (ViTs) with CNN-based detection frameworks, explainable AI (XAI) techniques, and ensemble fusion strategies can substantially improve the accuracy, robustness, and interpretability of automated vehicle damage classification systems, illustrated in Figure 1. By leveraging the fine-grained feature extraction capabilities of ViTs, the structural detection efficiency of Detectron2, and visual attribution techniques such as Grad-CAM, Grad-CAM++, and SHAP, the system can reliably classify diverse vehicle damage types under real-world variability and noise. Furthermore, incorporating adaptive fusion mechanisms enhances spatial precision and mitigates detection ambiguity. The proposed framework is designed to be scalable and interpretable across various settings, offering a practical solution for insurance automation, roadside diagnostics, and real-time safety monitoring. A conceptual overview of the proposed system architecture is illustrated in Figure 1 for visibility.

The novelty of this study is the application of a vehicle damage analysis, which leverages advanced deep learning architectures and explainable AI to solve real-world diagnostic challenges in the automotive industry. The primary contributions of this work are as follows:

A hybrid RetinaNet–ViT architecture designed for robust damage classification, combining CNN-based localization with transformer-based global reasoning, enhanced through transfer learning and focal loss.
Built-in interpretability using integrated Grad-CAM, Grad-CAM++, and SHAP visualizations to support explainable predictions for transparent decision-making in real-world applications.
Extensive validation under visual distortions, including artificial noise and real-world environmental challenges, demonstrating model robustness and generalization.

In summary, this work introduces a robust and interpretable framework, an ensembling method of the RetinaNet–ViT architecture with advanced transfer learning, focal loss, and rich augmentations. It incorporates explainable AI methods for visual justification, enhances localization through Weighted Box Fusion, and demonstrates strong resilience under both synthetic noise and real-world environmental distortions. The remainder of this article is organized as follows: Section 2 reviews the related literature on vehicle damage detection. Section 3 outlines the proposed methodology. Section 4 presents experimental results and evaluations. Section 6 concludes with insights and directions for future work.

2. Related Works

Automated vehicle damage detection has rapidly advanced, with increasing focus on accuracy, efficiency, and interpretability. This section categorizes existing methods into traditional techniques, supervised deep learning, self-supervised and semi-supervised approaches, and explainable AI, while also highlighting their limitations in real-world deployment scenarios.

2.1. Traditional Approaches

Early vehicle damage assessments relied on manual inspection by human experts [2], which were subjective, time-consuming, and error-prone. Heuristic methods such as edge detection, histogram analysis, and template matching were later adopted [4] but lacked robustness under varying lighting conditions, occlusions, or diverse vehicle types [5,11]. Hand-crafted features like SIFT and HOG also fell short in generalizing across damage types [12].

2.2. Supervised Deep Learning Methods

CNN-based models such as EfficientNet, ResNet, and Mask R-CNN significantly improved visual feature extraction for damage classification tasks [5,13]. Despite their success, these approaches typically require large annotated datasets and struggle with generalization under complex environments [6]. To reduce training burden and improve robustness, ensemble strategies [14,15], dataset-specific tuning [16,17], and multi-task models with pretrained backbones [18] have been explored.

2.3. Self-Supervised and Semi-Supervised Methods

Transformers such as DETR and OWL-ViT use self-attention mechanisms to model long-range dependencies, improving performance on occluded and complex scenes [4,7]. Vision–language models like CLIP enable zero-shot classification, making them promising for damage detection tasks with limited labeled data [19]. Some studies adopt transfer learning and fine-tuning strategies as a semi-supervised approach, allowing adaptation from general datasets to vehicle-specific contexts [4,20]. Lightweight variants, like MobileViT, aim to reduce computational overhead while retaining transformer benefits [20], and hybrid CNN–Transformer models like Swin Transformer provide a strong balance between local and global feature learning [21]. Efficient semi-supervised deep learning models are applied in the railway domain, i.e., arc identification [22], bearing fault detection [23], pantograph–catenary relations determination [24], etc.

2.4. Explainable AI Techniques

Interpretability is crucial for real-world damage detection systems, particularly in high-stakes domains like insurance or legal liability. Techniques such as Grad-CAM, SHAP, and Grad-CAM++ have been widely used to highlight regions of interest and offer post hoc explanations of model predictions [8,9,10]. These tools enhance user trust and transparency [25,26], though generating real-time explanations for high-resolution images remains an ongoing challenge [27].

Table 1 presents a comprehensive summary of key limitations in existing vehicle damage assessment approaches and the corresponding solutions proposed in this work. Among these challenges are subjectivity in manual assessments, generalization failures in rule-based methods, the reliance of CNNs on large labeled datasets, lack of interpretability, and the high computational burden of transformer models. This study cleanses the above problems using fine-tuned Vision Transformers (ViTs), efficient hybrid architectures, and data augmentation strategies. Alongside these strategies, Explainable AI (XAI) techniques, such as Grad-CAM and SHAP, are employed to ensure interpretability, together with the use of lightweight transformer models and transfer learning, to guarantee scalability and real-time viability. Furthermore, the system is rigorously evaluated under a variety of challenging visual conditions—including Gaussian noise, salt-and-pepper distortion, motion blur, rain, and fire occlusions—to assess its robustness and generalization in real-world accident scenarios where image quality is often compromised.

3. Methodology

This section elaborates on the vehicle damage classification system that is developed in this work. This methodology includes research design, dataset generation, model selection, training, explainability, and deployment. This system provides genuinely interpretable solutions to assess vehicle damage in real-world environments using an advanced deep learning architecture along with explainable AI methodologies.

Figure 2 illustrates the end-to-end workflow of the proposed vehicle damage assessment system, integrating data acquisition, preprocessing, model training, and interpretability techniques. The pipeline begins with image collection from public repositories and custom datasets, followed by normalization, bounding box adjustments, and dataset splitting. A sequential neural network is trained through a process involving data augmentation, hyperparameter tuning, and model selection. For model explainability, the system incorporates Grad-CAM, Grad-CAM++, and SHAP using the Detectron2 framework, enabling transparent and trustworthy predictions.

The process begins with leveraging pretrained models on the COCO dataset, which are then fine-tuned for the custom vehicle damage dataset. Optimization strategies address common challenges, including class imbalance and overfitting. Focal Loss has been used to deal with class imbalance by increasing focus on examples that are hard to classify and reducing focus on easy examples. AdamW optimizes the algorithm to converge faster while maintaining more stability through weight decay. In addition, a dynamic learning rate schedule is imposed depending on Cosine Annealing to let the training converge more efficiently throughout the process. Regularization techniques, dropout regularization, and further data augmentations are implemented to curb overfitting and foster generalization.

Following the training, the models are evaluated for performance concerning well-established metrics such as accuracy, precision, recall, F1-score, and mAP. All these help the understanding of how well the model can identify and localize the different types of vehicle damage. Next, the interpretability methods are integrated to make the model’s decision-making process transparent. Finally, Grad-CAM, Grad-CAM++, and SHAP are leveraged to interpret decisions by pointing to image regions that contributed the most to the models’ outputs, thus making the system more transparent and enhancing trust in the system’s results.

3.1. Dataset and Preprocessing

The dataset used in this research is a custom vehicle damage database from COCO [29] for general object detection and a carefully curated dataset for 4500 vehicle damage images collected from public sources, including several types, such as bumper dents, panel scratches, frontal impacts, and rear-end collisions [30]. About 60% are custom vehicle damage images, while the rest are from COCO for larger generalization. Figure 3 presents the entire clean preprocessing pipeline.

Image normalization meant scaling pixel values to consider a uniform range [0, 1] for all the possible input distributions of the dataset.

All the bounding box annotations for the object detection models were refined and normalized during the preprocessing to localize the damage in the vehicle for each image precisely. To ensure a fair and comprehensive evaluation of the model, the dataset was split into training (70%), validation (20%), and test (10%) subsets. Data augmentation processes were performed on the training subset to increase the degree of robustness in the models. Through this preprocessing pipeline, data enrichment was achieved while also laying the groundwork for future model training, where enhanced features would facilitate improved performance in classification tasks.

3.2. Model Selection and Architecture

The proposed vehicle damage classification system is developed using various architectures such as ViT (Detectron2), DINOv2, EfficientNet, ResNet, and Faster-RCNN. All the models are implemented in the PyTorch framework. The global architecture of the model is illustrated in Figure 4.

The input image

I \in R^{H \times W \times 3}

is first passed through a feature extractor. ResNet-based models, e.g., EfficientNet and Faster R-CNN, use convolutional layers for feature extraction. For the ViT model, the image has been divided into patches, and those patches have been passed through the transformer encoder to capture long-range dependencies [31]. After extracting features, they are flattened again and routed through a classification head of fully connected layers. A dropout layer is utilized to prevent overfitting, while the final class probabilities are generated using a softmax activation function:

{\hat{y}}_{i} = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}, i = 1, 2, \dots, K

(1)

where K denotes the number of output classes and

z_{i}

is the logit score for class i. Focal Loss function is used during training to handle class imbalance in the dataset, expressed as:

L_{focal} = - α_{t} {(1 - {\hat{y}}_{t})}^{γ} log ({\hat{y}}_{t})

(2)

where

{\hat{y}}_{t}

is the predicted probability for true classes t,

α_{t}

is a weighting factor, and

γ

is a focusing parameter that adjusts the importance of examples.

To counter class balance and make the detection more accurate, both Focal Loss and Intersection over Union (IoU) Loss are used in the model.

Focal Loss is derived from the cross-entropy loss and improves it in reducing the loss contribution from well-classified examples and focusing learning on hard misclassified examples. It is used for a big class-imbalance scenario.

FL (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} log (p_{t})

(3)

Here,

p_{t}

represents the model-estimated probability for the true class,

α_{t}

the weighting factor to balance class importance, and

γ

the focusing parameter that controls the down-weighting of easy examples.

IoU Loss, in contrast, is supposed to maximize the overlap between the predicted regions and the ground truth, fitting best for segmentation and bounding box regression tasks. It is defined as follows:

IoU = \frac{| Prediction \cap Ground Truth |}{| Prediction \cup Ground Truth |}

(4)

The corresponding IoU Loss can be found as follows:

IoU Loss = 1 - IoU

(5)

This loss aims to punish those predictions with very little overlap and favor those with a greater degree of spatial alignment. When both Focal Loss and IoU Loss operate simultaneously, the model can, therefore, perform well against the imbalance in classes while continuing to excel in localization.

This architecture enables CNN and transformer-based models to learn fine-grained features, ensuring accurate vehicle damage classification effectively.

3.3. Training and Optimization Strategies

To optimize the performance of our model and ensure robust generalization, we employed a range of advanced transfer learning techniques. One of the key strategies was leveraging transfer learning, where the initial model weights were initialized using pretrained parameters from the COCO dataset [29]. This significantly accelerated the convergence process and provided a rich foundation of general visual features, which are beneficial for tackling diverse tasks across various domains.

Addressing the challenge of class imbalance in real-world datasets, we applied Focal Loss to adjust the model’s loss function. This approach enhanced the contribution of difficult-to-classify examples, enabling them to play a more prominent role in the model’s decision-making process. Complementing this, we used the conventional Cross-Entropy Loss as the primary classification objective, which ensures the generation of well-calibrated probabilistic outputs. For optimization, we adopted the AdamW optimizer [32], known for its ability to decouple weight decay from gradient-based updates. This facilitates a more effective learning dynamic and helps mitigate overfitting. Additionally, we incorporated the Cosine Annealing Learning Rate Schedule [33], which adjusts the learning rate periodically. This strategy facilitates the exploration of wider and flatter minima in the loss landscape, which are often associated with better generalization performance.

Dropout layers were utilized to further enhance regularization and prevent overfitting. This technique reduces co-adaptation among neurons and discourages the model from memorizing redundant features. Additionally, to improve the model’s ability to generalize across unseen data distributions, we implemented a variety of data augmentation techniques. These included random cropping, flipping, scaling, and color jittering, all of which simulate real-world variations in data. The entire training pipeline—from model design and setup to optimization—was meticulously crafted to ensure the model’s robustness, generalization, and overall performance.

Model Compilation and Training

The trained models use an AdamW optimizer and a Focal Loss function for classification, defined in (2), to manage class imbalance of multiple damage types, such as bumper dents, panel scratches, frontal impacts, and rear-end collisions. Training involved employing early stopping and model checkpointing techniques to prevent overfitting. Upon completion, the trained models were saved in both JSON and HDF5 formats for future use and deployment.

3.4. Explainability and Interpretability (XAI Integration)

To guarantee transparency, accountability, and trustworthiness within deep learning algorithms, particularly in safety-critical domains such as medical imaging, we incorporate various state-of-the-art Explainable AI (XAI) techniques. These methods explain the internal working mechanism of the models and their decision boundaries, thus moving them from the category of a “black box” to a more interpretable scheme that can be understood and validated by human experts. In particular, we use a mix of visual-based and feature attribution-based XAI tools that explain the model predictions at both the global and local levels.

Grad-CAM (Gradient-weighted Class Activation Mapping): With Grad-CAM, it is a modality that depicts areas of localization by texture highlighting input images that are most effective for a decision by prediction from the model [8]. In particular, Grad-CAM produces a map of the important parts of a deep learning model for an input image under the influence of its decision. Furthermore, it uses any target concept type (e.g., class label) gradients in the final convolution layer of a CNN to produce a heatmap that overlays salient regions. This step is most useful in understanding spatial importance and determining whether the model focuses on critical damage regions, such as dents, scratches, or structural deformations in vehicle images.
Grad-CAM++: Grad-CAM++ extends Grad-CAM since it adds a refined scheme for calculating the importance weights of the feature maps [34]. Higher-order derivatives are used for this computation. The advantages of this method are improved spatial localization of discriminative regions, especially with multiple instances of the same class in one image, as well as in diffuse boundaries of object images. This work proves especially useful in detecting subtle changes in anatomy from several imaging modalities or layers due to high model interpretability.
SHAP (SHapley Additive exPlanations): SHAP is a model-agnostic, game-theoretic approach to explain the output of an AI model by assigning each feature an importance value for a particular prediction [9]. Based on the concept of Shapley values from cooperative game theory, SHAP ensures consistency and local accuracy in feature attribution. Concerning our work here, SHAP would be applied towards tabular feature representations together with intermediate embeddings extracted from the neural network to quantify the contribution of each feature, anatomical, radiomic, or modality-specific-positive or negative-net result, on the way to the final classification decision. This process promotes deeper insights into the reasoning behind the model and the discovery of biases or confounding factors in learned representations.

4. Results and Discussion

This section provides a complete analysis of the proposed vehicle damage classification system, including the experimentation setup, performance metrics, results, and analysis of the applied deep learning models. Integrating Explainable Artificial Intelligence (XAI) techniques’ interpretation to the model’s decision-making process is demonstrated.

4.1. Experimental Setup

All the experiments were performed in a high-performance computing system with an Intel Core i-12900 K processor, NVIDIA RTX 3090 GPU, and 64 GB DDR5 DRAM. The models were implemented using the PyTorch and TensorFlow frameworks. The dataset was split into a training set (70%) and validation and testing sets (20% and 10%, respectively). The training employed a batch size of thirty-two, an initial learning rate of 0.0001, and a maximum of fifty epochs with early stopping enabled. This experiment was set up according to the standard procedure:

Data pre-processing and augmenting (cropping, rotation, brightness adjustment, and noise addition).
Transfer learning from pretrained COCO models.
Training with various optimization methods (AdamW, cosine annealing scheduling).
Metrics for object detection and classification evaluation.
Model interpretability analysis through XAI techniques.

4.2. Performance Metrics

For evaluating the effectiveness of a certain model, the following metrics will be used:

Mean Average Precision (mAP) for which the IoU thresholds are 0.50:0.95 [29], measuring localization accuracy in object detection models.
Precision, recall, and F1-score for estimating classification performance [6].
The model’s inference speed in terms of frames per second (FPS), meaning how fast the model is processing images [31].
Confusion matrix analysis to help analyze misclassification patterns and error analysis.
Receiver Operating Characteristic (ROC) Curve for assessing how the two rates of performance trade between true positive and false-positive rates.

Impact of Model Type: Classification vs. Detection Frameworks

This section presents a detailed comparative analysis of the selected models regarding their performance in a vehicle damage classification task. The models are assessed and evaluated for their performance based on several considerations: classification accuracy, mean Average Precision (mAP), precision, recall, F1-score, and inference speed (FPS). The evaluation results are tabulated in Table 2.

Among all the proposed models applied in this work, the Vision Transformer (ViT) integrated with Detectron2 emerges as the top performer, achieving a mean Average Precision (mAP) of 87.2% and an F1-score of 84.6%. These results demonstrate that the damage detection framework can perform the task more effectively than others and highlight the effectiveness of the ViT architecture in acquiring globally relevant information. This performance underlines that the ViT model achieves a balanced relationship between precision and recall and provides high accuracy values for detecting different damage categories. ResNet and EfficientNet, in contrast, exert strong but comparatively nuanced performance metrics. EfficientNet records a higher precision of 81.7%, with a corresponding trade-off in recall (77.2%), which means EfficientNet is better at identifying a few select damage types but may not be as attentive as required to detecting more subtle forms of damage. On the other hand, ResNet has a more balanced precision and recall value of 83.9% and 80.1%, respectively. ResNet can detect a wide range of damage types relatively, thus becoming invaluable in applications where sensitivity and specificity matter. Although Faster R-CNN enjoys the highest classification accuracy (98%), relatively lower values for mAP and F1-score may be observed. A lower mAP of 76.5% and F1-score of 75.2% could mean that, in contrast to Faster R-CNN’s performance in classifying specific damage types, it may have difficulty with the localization and detection of all instances of a particular damage. This implies that the Faster R-CNN performs best when high classification accuracy is paramount, but may become less handy when damage detection becomes comprehensive. This study focuses on achieving accurate offline analysis. However, while inference time was still measured, the model ViT (Detectron2) was processed at 22 FPS (Table 2), making it suitable for near real-time applications on high-performance computing hardware. Lighter models, such as EfficientNet (15 FPS) or ResNet (18 FPS), may prove more feasible for constrained hardware.

The DINOv2 model achieved superior performance with a test accuracy of 93.46%, surpassing MobileViT’s 90.51%, and demonstrated higher precision (88.5%), recall (87.6%), IoU (84.7%), and F1-score (86.1%) with only 20 misclassifications. In contrast, MobileViT had slightly lower metrics across the board and a higher misclassification count of 48, indicating DINOv2’s stronger overall reliability. Figure 5 illustrates the learning dynamics, accuracy and loss trends of the applied DINOv2 model.

Figure 6 presents a comparative analysis of model performance using confusion matrices, highlighting the contrast between classification-based and detection-based evaluation approaches. In Figure 6a, the classification model demonstrates a balanced ability to distinguish between the two primary categories: “Car Damage” and “Car Not Damaged.” The true positive rates are notably high, with 90 correctly classified damaged instances and 92 correctly classified non-damaged cases. However, the matrix also reveals a moderate level of misclassification—10 damaged vehicles are incorrectly labeled as undamaged, and 8 undamaged vehicles are predicted as damaged. These findings suggest that the classification model maintains a good trade-off between sensitivity and specificity, though further fine-tuning could enhance its reliability in borderline cases.

In contrast, Figure 6b provides a more granular view of the hybrid RetinaNet–ViT model’s performance by visualizing predictions across multiple localized categories through a detection confusion matrix. The ensemble detection model excels at identifying background regions (82 accurate predictions) and moderately performs in detecting “Damaged-part 2” (22 correct identifications). However, it exhibits clear limitations in identifying “Damaged-part 1,” with zero true positives in that class. This disparity highlights the challenge of fine-grained localization, which requires the model to learn subtle spatial cues and contextual variations.

Comparatively, the classification matrix reflects a broader binary decision-making capability that is useful for initial screening tasks. On the other hand, the detection matrix emphasizes spatial and categorical precision, which is crucial for detailed damage assessment. This comparison identifies that the auxiliaries, making classification and detection of insult assessment, complement each other. Hence, classification offers reliable binary classification for damage presence, while detection leads to actual localization and actual categorization grades required for real-world orchestration in insurance, surveillance, and automotive safety.

4.3. Ablation Studies

This section performs a series of ablation analyses that assess the contributions of various design decisions under the proposed vehicle damage detection framework. The following paragraphs analyze in detail some core elements of the framework that yield pertinent insights into how each design decision affects the performance of the entire system in terms of functioning, effectiveness, and practical applicability.

4.3.1. Impact of Weighted Box Fusion on Ensemble Predictions

To further enhance the robustness and spatial precision of the damage detection pipeline, we incorporated Weighted Box Fusion (WBF) into the ensemble detection framework of the hybrid RetinaNet–ViT model. Unlike Non-Maximum Suppression (NMS), which retains the highest confidence score of the bounding box while discarding others with significant overlap, WBF aggregates predictions from multiple detectors by averaging their bounding box coordinates and confidence scores based on Intersection over Union (IoU). This fusion mechanism enables more accurate localization and consensus-based decision-making, particularly in complex or ambiguous visual scenes.

Figure 7 presents a comparative view of detection performance between the standard ensemble Figure 7b and the WBF-enhanced ensemble Figure 7a, illustrated using confusion matrices. Notably, the WBF-based ensemble achieves higher precision in identifying the “Damaged-part 2” class (23 correct predictions vs. 22). It also reduces false positives in the “No Damage” category (0 vs. 3). These improvements suggest that WBF successfully refines spatial hypotheses, particularly by resolving bounding box overlaps in uncertain or low-confidence regions.

4.3.2. Effect of Architectural Design: Hybrid vs. Standalone Models

The effect of the hybrid versus standalone design is investigated on the performance of models with important indicators of accuracy, mAP@0.50, precision, recall, F1-score, and FPS at the forefront. Among different architectural paradigms, hybrid ones, integrating some neural components or pipelines into one architecture, remain at the center of attraction due to their ability to harness the capabilities of individual subsystems. Standalone systems, on the contrary, tend to follow more straightforward architectural paths but at the cost of suffering the inability to perform well in high-variance environments.

Table 3 compares hybrid pipelines using Faster R-CNN and RetinaNet with the proposed end-to-end architecture. While both hybrids achieved high accuracy (97% and 95%) and strong mAP scores (83.5% and 85.1%), their staged processing introduces latency and slower inference speeds—13 FPS and 16 FPS—due to model-to-model communication overhead.

4.4. Model Optimization and Deployment Efficiency

Quantization has been applied in this work as a critical optimization strategy for deploying vehicle damage classification and detection models, enabling substantial reductions in both computational and memory overhead while maintaining operational flexibility. As reported in Table 4, the hybrid model compresses from 1218.1 MB to 591.39 MB (a reduction of approximately 51.5%), and parameters shrink from 159.6 million to 77.6 million through precision transformation (FP32 → INT8) and mixed-precision arithmetic. This compression significantly improves hardware-level efficiency by reducing floating-point operations (FLOPs), enhancing cache coherence, and minimizing DRAM access latency, which directly translates into faster inference speeds.

Unlike pruning, which may irreversibly remove critical weights, quantization preserves the model’s representational capacity, retaining 77.15 million trainable parameters post-quantization. This process enables downstream fine-tuning or incremental retraining on domain-specific vehicle datasets (e.g., different damage types such as scratches, dents, and shattered glass) with negligible accuracy loss. The hybrid architecture, combining CNN-based local damage feature extraction with global context modeling via transformer modules, substantially benefits from quantization, as the reduced precision accelerates both convolutional and attention operations without compromising the fidelity of multi-scale feature fusion.

Quantization offers a decisive advantage for edge deployment scenarios, e.g., real-time accident assessment, automated insurance claim verification, or on-board vehicle damage detection systems. Cutting memory bandwidth requirements allows the model to achieve higher frames per second (FPS) on resource-constrained devices like the NVIDIA Jetson series or mobile NPUs while simultaneously lowering power consumption. When coupled with knowledge distillation, patch–spatial attention fusion, and multi-branch hybrid models, quantization further improves inference throughput by streamlining redundant feature computations, all while maintaining high mAP, precision, and F1-scores critical for robust damage detection pipelines.

4.4.1. Impact of Data Augmentation

To assess the individual contribution of image augmentation, we performed experiments with and without data augmentation during the evaluation. The model trained with augmentation achieved an mAP of 87.2%, compared with 80.1% for the model without augmentation, showing a gain of 7.1%. Thus, these results prove that augmentation is important in improving the model’s ability to adapt to real-world variations.

4.4.2. Impact of XAI Integration

Integration of XAI techniques and an analysis of their effect (including Grad-CAM, Grad-CAM++, and SHAP) on the model training and interpretability were performed. Although the XAI component increased training time by 10–20%, illustrated in Table 5, it did not damage the classification performance (mAP = 87.2%). SHAP offered the finest level of attribute of feature importance, increasing stakeholder trust by communicating the extent to which each image region contributed to the prediction.

4.4.3. Impact of Transfer Learning

The effect of transfer learning on training with COCO-pretrained weights and from scratch has been evaluated. Training with COCO pretrained Visual Transformer (Detectron2) for 50 epochs gave an mAP of 87.2%, whereas training from scratch yielded 78.5%, showing that an improvement of 8.7% was achieved due to pretrained weights. This demonstrates the importance of using general visual features learned from COCO to converge faster and better.

4.4.4. Impact of Focal Loss

The Focal Loss with parameters set to

γ = 2

,

α = 0.25

was evaluated against a regular categorical cross-entropy loss for the problem of imbalanced classes. The Focal Loss increased the F1-score of under-represented classes (such as structural deformations) by 6.3% (from 78.1% to 84.4%), showing its power in dealing with skewed class distributions.

4.4.5. Computational Efficiency and Model Complexity

In real-world applications—particularly those involving real-time decision-making, such as vehicle damage assessment—computational efficiency and model complexity are critical design considerations. Factors such as model size, number of parameters, and training time per epoch directly influence deployment feasibility, hardware requirements, and responsiveness of the system. Table 5 provides a comparative overview of different collaborative filtering (CF) and object detection models in terms of their computational profiles.

As seen in Table 5, traditional CF models such as user-based and item-based approaches are lightweight, both in terms of memory footprint (under 2 MB) and computational cost (training in under 4 s per epoch). The corresponding parameter count remains under 12,000, making them suitable for low-power devices or applications with strict latency constraints. However, their simplicity may come at the cost of reduced performance or generalization capability for more nuanced tasks. The proposed hybrid Hybrid ViT-RetinaNet model moderately increases complexity—nearly doubling both the parameter count and model size compared to traditional CF methods—but offers a richer feature representation and improved classification performance, as shown in earlier sections. Yet it remains computationally efficient, requiring just 5 s per training epoch. In contrast, object detection models, especially those based on Detectron2 (e.g., Faster R-CNN), exhibit significantly higher model complexity. With over 41 million parameters and a model size exceeding 170 MB, these models demand more memory and processing power. Their training time per epoch ranges from 4 to 7 min, depending on whether explainability components are included.

Incorporating explainability frameworks such as Class Activation Maps (CAMs) or SHAP further increases both the model size and computational burden. While CAM integration results in only a marginal increase in model size (to 172 MB) and training time (to 5 min), SHAP-based models require additional computations for feature attribution, extending the epoch time to approximately 6–7 min. These additions, however, enhance model transparency—crucial in domains, i.e., automotive damage inspection, where interpretability and trust are essential.

4.5. Model Evaluation and Prediction Results

Multiple damage categories (e.g., bumper dents, panel scratches, frontal impacts, and rear-end collisions) and Not Damaged are conducted on separate test samples, namely vehicle images never seen before, so the Hybrid ViT-RetinaNet model is generalizable and robust. The final class has been set with a confidence threshold of 0.5. Thus, if the probability predicted by the model is more than 0.5, the image gets classed as Damaged, and anything less classifies it as Not Damaged. Figure 8 illustrates sample predictions generated by the ensemble model on unseen test images.

4.6. Inference Under Visual and Artificial Noise

In this experiment, we evaluate the robustness of the proposed Hybrid ViT-RetinaNet vehicle damage detection model under noisy conditions, simulating real-world accident scenarios. As shown in Figure 9, the model accurately localized and classified damaged parts even in images affected by severe environmental disturbances such as fire, smoke, deformation, and lighting variations. These results suggest the model’s strong generalization capability and reliability in emergency or post-accident documentation scenarios, which are often affected by noise and chaos.

Overall, the results substantiate the model’s capacity to operate under non-ideal conditions. Taken together, the results demonstrate the suitability of the model to work under imperfect conditions and perform near-real-time damage assessment on high-end computers, with opportunities to optimize it further to run in resource-constrained settings. These findings support the case for further integration of noise-aware learning and domain-adaptive strategies in future iterations of vehicle damage detection frameworks.

To rigorously evaluate the robustness of the proposed damage detection framework, we introduced artificial pixel-level noise to simulate degraded real-world visual conditions. Specifically, we applied Gaussian noise and salt-and-pepper noise to images from the original accident scenarios, demonstrated in Figure 10. These synthetic perturbations emulate challenges such as sensor interference, low-light capture noise, and transmission artifacts commonly found in field data.

Gaussian noise proceeded with zero mean and a variance of 25, while salt-and-pepper noise cooperated with a 2% corruption rate for both salt-and-pepper pixels. The following augmentation functions were employed:

add_gaussian_noise (image, mean = 0, var = 25)
add_salt_and_pepper_noise (image, salt_prob = 0.02, pepper_prob = 0.02)

As shown in Figure 10, the Hybrid ViT-RetinaNet model continues to perform reliably across a wide range of synthetically corrupted inputs. Despite the added distortions, key damage patterns, such as burnt areas, broken wheels, and structural deformation, are correctly localized and classified.

4.7. Damage Detection Under Dynamic Conditions

Figure 11 illustrates six diverse situations to show the varied nature and intense challenge faced in the detection of vehicular damage. In Figure 11a, several overlapping bounding boxes label the vehicle part as “Damaged-part,” each with a severity rating ranging from 35% to 41%. This means that it managed to detect more than one damaged area under reflection or partial occlusion. Figure 11b shows the chaos of a train derailment, with red bounding boxes highlighting the severely damaged areas amidst background clutter and several people, hence proving the efficacy of the model even in complicated outdoor scenarios. One of the accident sites in Figure 11c is clogged with debris, yet the system estimates damage severities correctly for multiple zones from 38% to 42%. In Figure 11d, the same model points to damage on a bus amid what seems like noise or adverse weather conditions—damage estimations for the detected areas are in the 36% to 37% range. Figure 11e is visually noisy and distorted, possibly with regard to motion blur or compression. The model, however, is able to label multiple damaged regions with overlapping severity labels ranging from 35%-to-41%. Finally, Figure 11f depicts a grayscale image of a vehicle with no visible bounding boxes, maybe an undamaged or a control case. Altogether, this example shows that the system can keep its detection accuracy in scenarios of varying lighting conditions, occlusions, cluttered backgrounds, or noisy inputs.

4.8. Explainable AI (XAI) Evaluation

While machine learning can interpret models, a better understanding of such learned models will achieve greater accessibility and applicability in high-stakes areas, including autonomous driving and insurance analytics. That is to say, understanding the decision of any model will be critical in making it trusted, safe, and accountable. Such techniques provide spatial and quantitative explanations that can demystify the behavior of deep convolutional neural networks within visual damage classification through explainable artificial intelligence methods. This section provides a comparison of the three widely used XAI techniques, Grad-CAM, Grad-CAM++, and SHAP, when it comes to their ability to explain model predictions in samples from the Damaged class. To be able to interpret the model and have a transparent approach to decisions, we incorporate Explainable AI (XAI) techniques within our damage classification pipeline:

Figure 12, Figure 13, Figure 14 and Figure 15 provide a well-rounded visual description of the decision-making process for classifying the Damaged class for vehicles. Figure 12 shows original images as input from various types of vehicle damage to lay the groundwork for subsequent analyses of interpretability. Each image shows entirely different types of damage: bumper deformed (Figure 12a), panel dents (Figure 12b), a frontal impact (Figure 12c), and a rear-end impact (Figure 12d). Each also has bounding boxes that localize the affected regions for the types of damage recognized by the model.

Grad-CAM Grad-CAM highlights certain regions in the image that offer a darker context about the model prediction through accessing first-order gradients from the last convolutional layers. Grad-CAM visualization is built upon that, which shows heatmaps of the damaged images indicating the regions being targeted for prediction in the model, as presented in Figure 13. These types of heatmaps would ultimately serve for rough localization of important regions, such as bumper (Figure 13a), vehicle panel (Figure 13b), frontal (Figure 13c), and rear-end areas (Figure 13d), and total scores overall patterns the model tries to form towards the damage. The drawback is that while broad area coverage is good, Grad-CAM often fails to identify minute details.

Grad-CAM++ is an advanced extension of the original Gradient-weighted Class Activation Mapping (Grad-CAM) technique, designed to provide more accurate and fine-grained visual explanations for convolutional neural network predictions [34]. Unlike Grad-CAM, which computes a coarse heatmap based on the gradients flowing into the final convolutional layer, Grad-CAM++ utilizes a weighted combination of the pixel-wise gradients of the output category concerning the feature maps. This technique makes it particularly useful for tasks, e.g., vehicle damage classification, where precise damage localization is critical for interoperability. To overcome this limitation, Figure 14 provides Grad-CAM++ visualizations, enhancing localization’s precision. Grad-CAM++ generates more concentrated and sharper heat maps, enabling the model to identify minor and subtle damage cues better. For instance, more evident attention is paid to localized bumper damage (Figure 14a), precise dents on the panel (Figure 14b), and refined areas of frontal and rear-end impacts (Figure 14c,d). This improved resolution helps more accurately interpret the visual cues the model deems most relevant for each damage type.

SHAP (SHapley Additive exPlanations) estimates feature contributions based on cooperative game theory, providing class-agnostic, theoretically grounded insight into feature importance [9]. Figure 15 displays SHAP (SHapley Additive exPlanations) feature attribution plots, which quantitatively reveal how different input features contributed to the model’s final decision. Each subfigure corresponds to a specific damage type—bumper deformation, panel damage, frontal impact, and rear-end impact—showing whether individual features positively (pink) or negatively (blue) influenced the classification. Unlike the visual attention maps, SHAP offers a feature-level breakdown, providing transparency into how specific inputs affect the prediction score. Altogether, these figures offer visual and numerical insights, reinforcing the interpretability and trustworthiness of the model in damage classification tasks.

These visualizations altogether create a comprehensive framework for understanding the decision-making of the given damage classification model. The original images in Figure 12 provide a baseline visual context, while Grad-CAM in Figure 13 and, specifically, Grad-CAM++ in Figure 14 build insight into the spatial relevance of the regions of interest.

5. Discussion

This study presents an integrated architecture that combines convolutional and transformer-based deep learning models for vehicle damage classification. The models were developed and evaluated using a PyTorch-based pipeline. Experimental results show that transformer-based models—particularly the Vision Transformer (ViT) implemented with Detectron2—consistently outperform conventional CNN architectures across key metrics such as mean Average Precision (mAP), F1-score, and prediction reliability. The enhanced generalization across various damage types can be attributed to the use of transfer learning from COCO-pretrained models, as well as the incorporation of sophisticated data augmentation techniques. The incorporation of Weighted Box Fusion (WBF) further refines detection accuracy by reducing false positives, particularly in visually ambiguous or overlapping damage categories. The proposed Hybrid ViT-RetinaNet technique achieves a favorable balance between performance and inference speed, reaching up to 150 FPS while maintaining competitive mAP values. Ablation studies validate that transfer learning (+8.7% mAP), augmentation (+7.1% mAP), and Focal Loss (+6.3% F1 for minority classes) each play significant roles in enhancing the system’s effectiveness. The use of Focal Loss also proved beneficial in improving the detection of under-represented damage categories by addressing class imbalance during training. To ensure interpretability and trust in model decisions, we employed multiple Explainable AI (XAI) techniques—Grad-CAM, Grad-CAM++, and SHAP—which provided spatial heatmaps and feature importance attributions. Explainable approaches give a classifier explanation employing visualization and feature identification. Since they do the actual explaining, the interpretability of the system is thus improved. Interpretability, for instance, is needed when an algorithm decides in favor of a case, and this decision needs to be upheld under an insurance investigation. The system supports classification at the image level but does not currently handle fine-grained localization or hierarchical labeling (e.g., distinguishing minor scratches from structural damage). Also, though ViT affords better accuracy, its computationally expensive nature is less suitable for direct deployment in resource-constrained devices unless optimized through some avenues, such as model compression or pruning.

5.1. Limitations and Future Research Suggestions

The proposed combined convolutional and transformer-based deep learning framework demonstrates strong performance in vehicle damage analysis under varied visual and environmental conditions. This study has a few limitations. First, the vehicle damage classes adopted in this study are coarse-grained, primarily encompassing broad damage categories. Future extension of this work is to utilize more detailed damage types. Moreover, the ViT transformer-based backbone, while effective in capturing global spatial dependencies, incurs substantial computational overhead, which can impede real-time deployment, particularly on edge devices with limited resources. This trade-off between accuracy and efficiency highlights the need for future research into optimized lightweight transformer variants or efficient hybrid CNN-transformer models. The system’s reliance on fully supervised learning also presents scalability bottlenecks due to the high cost and effort associated with expert-driven annotation. The employed dataset contains images from only passenger vehicles, which limits the model’s transferability to other vehicle classes, such as commercial trucks, buses, and bicycles.

Despite demonstrating robustness to artificial noise and common visual degradations (e.g., Gaussian noise; salt-and-pepper perturbations) in Figure 9 and Figure 10, the system’s performance under real-world environmental complexities such as intense shadowing, reflective surfaces, heavy rainfall, or fire-induced occlusion has not been exhaustively explored. Future datasets and evaluation protocols should incorporate a broader range of such perturbations to consider operational conditions more realistically. To address these limitations holistically, future research should explore the integration of multi-modal sensing, such as LiDAR, thermal imaging, or stereo vision, to supplement RGB data and enhance perception under visually ambiguous conditions.

5.2. Comparative Analysis with Prior Research

Table 6 presents a comparative analysis of the proposed system with related works on automatic automobile vehicle damage classification. Prior works primarily utilized simple CNNs or object detection frameworks, i.e., YOLO (You Only Look Once) and Mask R-CNN. However, these approaches often fall short in terms of granularity in class definitions, interpretability, or computational efficiency.

While this research highlights the strengths of the proposed vehicle damage classification system, it also acknowledges the current limitations, particularly with respect to granularity and computational efficiency. The integration of deep learning with explainable AI techniques has significantly improved classification accuracy while also providing valuable transparency into model decision-making. Future work could address these limitations by exploring fine-grained classification, self-supervised learning, and blockchain integration to enhance scalability, robustness, and interpretability. Overall, this research contributes to the development of a scalable, interpretable, and high-performing vehicle damage classification framework, with promising potential for future advancements in automated inspection systems and insurance analytics.

6. Conclusions

This research presents a comprehensive and interpretable deep learning framework for fine-grained vehicle damage classification by synergistically integrating Vision Transformers (ViTs) with CNN-based object detection backbones (RetinaNet), enhanced through Weighted Box Fusion (WBF) and supplemented with robust Explainable AI (XAI) techniques. The central proposal aimed to overcome the inherent limitations of traditional manual inspections and the constrained generalization capacity of earlier CNN-based models by introducing a scalable, accurate, and interpretable solution suitable for real-world deployment in insurance, automotive safety, and autonomous inspection systems.

To realize this vision, we developed a modular and end-to-end pipeline using the Detectron2 and PyTorch frameworks. The architecture leverages the ViT encoders for efficient transfer learning, employs Focal Loss to address the class imbalance, and incorporates aggressive augmentation strategies to bolster generalization in diverse environmental settings. Furthermore, the integration of Grad-CAM, Grad-CAM++, and SHAP provides multi-level semantic and visual justifications for model predictions, thereby enhancing transparency and trustworthiness.

Our empirical findings demonstrate that the proposed ViT-enhanced RetinaNet detector significantly outperforms traditional CNN counterparts. Specifically, it achieves a mean Average Precision (mAP) of 87.2%, an F1-score of 84.6%, and real-time inference capabilities at 22 FPS. These results are not only statistically robust but also practically significant. The model maintains its detection fidelity under complex noise conditions (e.g., Gaussian, salt-and-pepper, fire occlusion, and motion blur), affirming its robustness in real-world post-accident scenarios. Moreover, the application of WBF enhances spatial coherence in ensemble predictions, while the explainability modules facilitate transparent decision pathways for stakeholders such as insurance adjusters and vehicle inspection engineers.

Author Contributions

Conceptualization, A.S., M.A.P. and M.F.S.T.; methodology, A.S., M.A.P., M.F.S.T. and A.Z.A.; software, A.S., M.A.P., M.F.S.T. and A.Z.A.; validation, A.S., M.A.P., M.F.S.T. and A.Z.A.; formal analysis, A.S., M.A.P., M.F.S.T. and A.Z.A.; investigation, A.S., M.A.P., M.F.S.T. and A.Z.A.; resources, A.S., M.A.P., M.F.S.T. and A.Z.A.; data curation, A.S., M.A.P., M.F.S.T. and A.Z.A.; writing—original draft preparation, A.S., M.A.P., M.F.S.T. and A.Z.A.; writing—review and editing, R.K.; visualization, A.S., M.A.P., M.F.S.T. and A.Z.A.; supervision, R.K.; project administration, R.K.; funding acquisition, R.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by North South University, Dhaka, Bangladesh grant number CTRG-22-SEPS-03.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The employed vehicle damage dataset and implementation code can be found at: https://github.com/MdFahimShahoriar/finegrained-damage-classify-xai (accessed on 16 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Veloso, F.; Kumar, R. The Automotive Supply Chain: Global Trends and Asian Perspectives. Asian Development Bank Economics Working Paper Series; Asian Development Bank (ADB): Metro Manila, Philippines, 2002. [Google Scholar]
Denton, T. Advanced Automotive Fault Diagnosis: Automotive Technology: Vehicle Maintenance and Repair; Routledge: Oxfordshire, UK, 2020. [Google Scholar]
Banerjee, D. Robust Car Damage Identification Through CNN and SVM Techniques. In Proceedings of the International Conference on Technological Advancements in Computational Sciences, Tashkent, Uzbekistan, 13–15 November 2024; pp. 101–107. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Nguyen, D.K.; Assran, M.; Jain, U.; Oswald, M.R.; Snoek, C.G.; Chen, X. An image is worth more than 16 × 16 patches: Exploring transformers on individual pixels. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 728–755. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Amanabadi, J.; Taghikhany, T.; Alinia, M.M. Enhancing structural damage detection: A comprehensive study on feature engineering and hyperparameters for pattern recognition algorithms. J. Vib. Control 2025. [Google Scholar] [CrossRef]
Botezatu, A.P.; Burlacu, A.; Orhei, C. A review of deep learning advancements in road analysis for autonomous driving. Appl. Sci. 2024, 14, 4705. [Google Scholar] [CrossRef]
Zhong, C. A Fast Multi-modal Facial Recognition Algorithm Based on Deep Residual Networks. Int. J. Netw. Secur. 2025, 27, 368–377. [Google Scholar]
Fouad, M.M.; Malawany, K.; Osman, A.G.; Amer, H.M.; Abdulkhalek, A.M.; Eldin, A.B. Automated vehicle inspection model using a deep learning approach. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 13971–13979. [Google Scholar] [CrossRef]
Pérez-Zarate, S.A.; Corzo-García, D.; Pro-Martín, J.L.; Álvarez García, J.A.; Martínez-del Amor, M.A.; Fernández-Cabrera, D. Automated Car Damage Assessment Using Computer Vision: Insurance Company Use Case. Appl. Sci. 2024, 14, 9560. [Google Scholar] [CrossRef]
Lee, D.; Lee, J.; Park, E. Automated vehicle damage classification using the three-quarter view car damage dataset and deep learning approaches. Heliyon 2024, 10, e34016. [Google Scholar] [CrossRef]
Hoang, V.D.; Huynh, N.T.; Tran, N.; Le, K.; Le, T.M.C.; Selamat, A.; Nguyen, H.D. Powering AI-driven car damage identification based on VeHIDE dataset. J. Inf. Telecommun. 2025, 9, 24–43. [Google Scholar] [CrossRef]
Qaddour, J.; Siddiqa, S.A. Automatic damaged vehicle estimator using enhanced deep learning algorithm. Intell. Syst. Appl. 2023, 18, 200192. [Google Scholar] [CrossRef]
Dai, K.; Shao, J.; Gong, B.; Jing, L.; Chen, Y. CLIP-FSSC: A transferable visual model for fish and shrimp species classification based on natural language supervision. Aquac. Eng. 2024, 107, 102460. [Google Scholar] [CrossRef]
Castrillo, J.; Valle, R.; Baumela, L. Efficiency Evaluation of Mobile Vision Transformers. In Proceedings of the International Conference on Information Technology & Systems, Temuco, Chile, 24–26 January 2024; pp. 3–11. [Google Scholar]
Xin, J.; Tao, G.; Tang, Q.; Zou, F.; Xiang, C. Structural damage identification method based on Swin Transformer and continuous wavelet transform. Intell. Robot. 2024, 4, 200–215. [Google Scholar] [CrossRef]
Yan, J.; Cheng, Y.; Zhang, F.; Li, M.; Zhou, N.; Jin, B.; Wang, H.; Yang, H.; Zhang, W. Research on multimodal techniques for arc detection in railway systems with limited data. Struct. Health Monit. 2025. [Google Scholar] [CrossRef]
Cheng, Y.; Zhou, N.; Wang, Z.; Chen, B.; Zhang, W. CFFsBD: A Candidate Fault Frequencies-Based Blind Deconvolution for Rolling Element Bearings Fault Feature Enhancement. IEEE Trans. Instrum. Meas. 2023, 72, 3238032. [Google Scholar] [CrossRef]
Cheng, Y.; Yan, J.; Zhang, F.; Li, M.; Zhou, N.; Shi, C.; Jin, B.; Zhang, W. Surrogate modeling of pantograph-catenary system interactions. Mech. Syst. Signal Process. 2025, 224, 112134. [Google Scholar] [CrossRef]
Chen, V.; Yang, M.; Cui, W.; Kim, J.S.; Talwalkar, A.; Ma, J. Applying interpretable machine learning in computational biology—Pitfalls, recommendations and opportunities for new developments. Nat. Methods 2024, 21, 1454–1461. [Google Scholar] [CrossRef] [PubMed]
Yeo, W.J.; Van Der Heever, W.; Mao, R.; Cambria, E.; Satapathy, R.; Mengaldo, G. A comprehensive review on financial explainable AI. Artif. Intell. Rev. 2025, 58, 189. [Google Scholar] [CrossRef]
Mahmoudi, S.A.; Gloesener, M.; Benkedadra, M.; Lerat, J.S. Edge AI System for Real-Time and Explainable Forest Fire Detection Using Compressed Deep Learning Models. Proc. Copyr. 2025, 3, 847–854. [Google Scholar]
Saai, K.; Vijayakumar, V. Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications. arXiv 2024, arXiv:2501.00042. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Wang, X.; Li, W.; Wu, Z. CarDD: A new dataset for vision-based car damage detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7202–7214. [Google Scholar] [CrossRef]
Midigudla, R.S.; Dichpally, T.; Vallabhaneni, U.; Wutla, Y.; Sundaram, D.M.; Jayachandran, S. A comparative analysis of deep learning models for waste segregation: YOLOv8, EfficientDet, and Detectron 2. Multimed. Tools Appl. 2025, 1–24. [Google Scholar] [CrossRef]
Outmezguine, N.J.; Levi, N. Decoupled Weight Decay for Any p Norm. arXiv 2024, arXiv:2404.10824. [Google Scholar] [CrossRef]
Roy, S.; Park, C.; Fahrezi, A.; Etemad, A. A bag of tricks for few-shot class-incremental learning. Trans. Mach. Learn. Res. 2024. Available online: https://openreview.net/pdf?id=DiyYf1Kcdt (accessed on 1 July 2025).
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]
Jalili, A.; Tabrizchi, H.; Babaali, B. EfficientNet-based vehicle damage insurance verification. In Proceedings of the International Symposium on Artificial Intelligence and Signal Processing, Babol, Iran, 21–22 February 2024; pp. 1–6. [Google Scholar] [CrossRef]
Kannan, I.R.; Balasubramanian, Y.; Subramanian, S.P.; Kandhasamy, M.; Ramesh, S. CDA-Net: Computer Vision based Automatic Car Damage Analysis. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, New York, NY, USA, 13–15 December 2024. [Google Scholar] [CrossRef]
Hasan, M.J.; Nalwan, A.; Ong, K.L.; Jahani, H.; Boo, Y.L.; Nguyen, K.C.; Hasan, M. GroundingCarDD: Text-Guided Multimodal Phrase Grounding for Car Damage Detection. IEEE Access 2024, 12, 179464–179477. [Google Scholar] [CrossRef]

Figure 1. Feature-guided damage classification via object detection and transformer embeddings.

Figure 2. Workflow for the proposed automated vehicle damage detection.

Figure 3. Preprocessing and augmentation pipeline for the employed vehicle damage dataset.

Figure 4. Parallel model architecture for vehicle damage classification.

Figure 5. Training and validation performance of DINOv2 over epochs.

Figure 6. Comparative visualization of hybrid RetinaNet–ViT model’s performance: (a) classification accuracy via confusion matrix, and (b) detection accuracy using detection confusion matrix.

Figure 7. Comparative analysis of ensemble detection performance of the hybrid model: (a) after applying Weighted Box Fusion (WBF), and (b) standard ensemble predictions.

Figure 8. Representative predictions of the trained model on unseen vehicle images. All images are identified as Damaged correctly, showcasing the model’s ability to detect a wide range of damage types.

Figure 9. Inference results on noisy images of damaged vehicles. The model demonstrates robust performance across challenging conditions, including fire, deformation, dust, rain, and poor lighting, ensuring reliable detection in real-world accident scenarios.

Figure 10. Inference results on artificially corrupted images of damaged vehicles. The model shows resilience to simulated Gaussian and salt-and-pepper noise, sustaining high detection quality in challenging visual conditions.

Figure 11. Damage detection results under dynamic and challenging conditions.

Figure 12. Original vehicle images from the Damaged class, depicting damage types: bumper deformation (a), panel dents (b), frontal impact (c), and rear-end damage (d).

Figure 13. Grad-CAM visualizations for the Damaged class images. The heatmaps highlight the critical regions used by the model to make its predictions.

Figure 14. Grad-CAM++ visualizations with enhanced localization. The heatmaps produced by Grad-CAM++ provide more precise localization compared to Grad-CAM, helping in distinguishing smaller damage features.

Figure 15. SHAP feature attribution plots for each damage type in the Damaged class. The plots show the contribution of each input feature toward the final model decision.

Table 1. Addressed limitations in existing methods and corresponding solutions proposed in this work.

Reference	Limitation	Addressed in This Work
[2]	Subjectivity in manual assessment	Automated deep learning-based classification
[4]	Poor generalization of rule-based methods	Transformer-based approach with self-attention
[5]	CNNs require large datasets	Fine-tuning with limited labeled data
[8]	Lack of model interpretability	Integration of XAI techniques for transparency
[20]	High computational overhead of transformers	Efficient transformer-based model adaptation
[12]	Feature engineering limitations in traditional methods	Deep learning-based feature extraction
[21]	CNNs struggle with long-range dependencies	Swin Transformer integration with CNN backbones
[25]	Limited user trust in AI-based assessments	XAI-driven explanation techniques for trust building
[4]	Transformers require large-scale data for training	Hybrid models leveraging CNN pretraining
[11]	Poor performance under varying lighting conditions	Data augmentation and adaptive normalization
[19]	Zero-shot classification challenges in damage detection	Vision–language models for flexible classification
[27]	Real-time XAI challenges	Optimized XAI algorithms for efficiency
[28]	Heavy transformer models unsuitable for edge devices	Lightweight transformer architectures
[14]	Class imbalance issue	Implemented Focal Loss
[16]	Variations in the ensemble’s weight values due to continuous addition of data	Used a stable ensemble strategy
[15]	Challenges of camera movements and environmental conditions	Conducted inference on noisy images

Table 2. Performance comparison of the applied models.

Model	Accuracy	mAP@0.50	Precision	Recall	F1-Score	FPS
Faster R-CNN	98%	76.5%	78.1%	72.4%	75.2%	12
EfficientNet	86%	82.3%	81.7%	77.2%	79.4%	15
ResNet	91%	85.4%	83.9%	80.1%	82.0%	18
ViT (Detectron2)	96%	87.2%	86.5%	82.9%	84.6%	22
MobileViT	90.51%	83.0%	82.8%	80.0%	81.4%	48
DINOv2	93.46%	88.5%	87.6%	84.7%	86.1%	20

Table 3. Effect of architectural design on performance: comparison between hybrid and standalone models (Compact).

Model	Accuracy	mAP@0.50	Precision	Recall	F1-Score	FPS
Faster R-CNN (Hybrid CF pipeline)	97%	83.5%	84.0%	80.5%	82.2%	13
RetinaNet (Hybrid CF pipeline)	95%	85.1%	85.6%	81.8%	83.6%	16
Hybrid CF (Proposed full system)	93%	84.1%	85.3%	81.7%	83.4%	150

Table 4. Impacts on quantization for model deployment: Comparison between full precision and quantized models.

Method	Size (MB)	Parameters	Trainable	Non-Trainable	Train Acc. (%)	Test Acc. (%)
Hybrid (FP32)	1218.1	159,622,096	158,974,544	647,552	98.31	93.15
Quantized (INT8)	591.39	77,597,648	77,152,848	444,800	93.45	86.20

Table 5. Comparison of model size, parameter count, and training time per epoch for object detection models.

Model Name	Model Size (MB)	# Parameters	Training Time per Epoch
Faster R-CNN (Detectron2)	∼170	∼41 million	∼4 min
EfficientNet	∼50	∼5.3 million	∼2 min
ResNet	∼90	∼25 million	∼3 min
ViT (Detectron2)	∼200	∼86 million	∼5 min
MobileViT	∼45	∼5.6 million	∼2 min
DINOv2	∼340	∼1B (frozen)	∼6–7 min
ViT + CAM (Explainable)	∼202	∼86 million (+CAM)	∼6 min
ViT + SHAP (Explainable)	∼204	∼86 million (+SHAP)	∼8 min

Table 6. Comparison of proposed vehicle damage classification system with prior research.

Reference	Dataset	Method	Performance
[30]	Custom Vehicle Dataset	Improved Mask R-CNN (ResNet-101)	mAP: 0.85
[14]	Scraped using a web scraper	Custom model using pretrained architectures	Accuracy and F1-score: 85.5%
[35]	Ripik.AI	EfficientNet	Accuracy: 90.52%, F1-score: 90.85%
[36]	Car make and model (CMM)	YOLO_v5	Accuracy: 90.45%
[15]	TartesiaDS	YOLO_v8	Accuracy: 91%
[37]	CarDD	BERT-Swin Transformer	mAP@0.50: 80.0%, Recall: 86.7%
[17]	VeHIDE Dataset	Mark-RCNN	F1-score: 83.2%
[18]	Subset of ImageNet	Inception-ResNetV2	Accuracy: 91.36%, F1-score: 91%
[16]	Three-Quarter View Car Damage Dataset	Ensemble of pretrained models	Accuracy: 91.36%, F1-score: 91.34%
This Work	COCO + Custom Dataset	Hybrid ViT-RetinaNet (Detectron2-based)	Accuracy: 93%, mAP@0.50: 84.1%, F1-Score: 83.4%, FPS: 150

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Saha, A.; Pavel, M.A.; Titu, M.F.S.; Apurba, A.Z.; Khan, R. Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification. Vehicles 2025, 7, 89. https://doi.org/10.3390/vehicles7030089

AMA Style

Saha A, Pavel MA, Titu MFS, Apurba AZ, Khan R. Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification. Vehicles. 2025; 7(3):89. https://doi.org/10.3390/vehicles7030089

Chicago/Turabian Style

Saha, Ananya, Mahir Afser Pavel, Md Fahim Shahoriar Titu, Afifa Zain Apurba, and Riasat Khan. 2025. "Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification" Vehicles 7, no. 3: 89. https://doi.org/10.3390/vehicles7030089

APA Style

Saha, A., Pavel, M. A., Titu, M. F. S., Apurba, A. Z., & Khan, R. (2025). Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification. Vehicles, 7(3), 89. https://doi.org/10.3390/vehicles7030089

Article Menu

Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification

Abstract

1. Introduction

2. Related Works

2.1. Traditional Approaches

2.2. Supervised Deep Learning Methods

2.3. Self-Supervised and Semi-Supervised Methods

2.4. Explainable AI Techniques

3. Methodology

3.1. Dataset and Preprocessing

3.2. Model Selection and Architecture

3.3. Training and Optimization Strategies

Model Compilation and Training

3.4. Explainability and Interpretability (XAI Integration)

4. Results and Discussion

4.1. Experimental Setup

4.2. Performance Metrics

Impact of Model Type: Classification vs. Detection Frameworks

4.3. Ablation Studies

4.3.1. Impact of Weighted Box Fusion on Ensemble Predictions

4.3.2. Effect of Architectural Design: Hybrid vs. Standalone Models

4.4. Model Optimization and Deployment Efficiency

4.4.1. Impact of Data Augmentation

4.4.2. Impact of XAI Integration

4.4.3. Impact of Transfer Learning

4.4.4. Impact of Focal Loss

4.4.5. Computational Efficiency and Model Complexity

4.5. Model Evaluation and Prediction Results

4.6. Inference Under Visual and Artificial Noise

4.7. Damage Detection Under Dynamic Conditions

4.8. Explainable AI (XAI) Evaluation

5. Discussion

5.1. Limitations and Future Research Suggestions

5.2. Comparative Analysis with Prior Research

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI