Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures

Zhang, Weijie; Belcheva, Veronika; Ermakova, Tatiana

doi:10.3390/computers14050187

Open AccessArticle

Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures

by

Weijie Zhang

,

Veronika Belcheva

and

Tatiana Ermakova

^*

School of Computing, Communication and Business, Hochschule für Technik und Wirtschaft, University of Applied Sciences for Engineering and Economics, 10318 Berlin, Germany

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(5), 187; https://doi.org/10.3390/computers14050187

Submission received: 13 April 2025 / Revised: 25 April 2025 / Accepted: 8 May 2025 / Published: 12 May 2025

(This article belongs to the Special Issue Application of Artificial Intelligence and Modeling Frameworks in Health Informatics and Related Fields)

Download

Browse Figures

Versions Notes

Abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment worldwide, requiring early detection for effective treatment. Deep learning models have been widely used for automated DR classification, with Convolutional Neural Networks (CNNs) being the most established approach. Recently, Vision Transformers (ViTs) have shown promise, but a direct comparison of their performance and interpretability remains limited. Additionally, hybrid models that combine CNN and transformer-based architectures have not been extensively studied. This work systematically evaluates CNNs (ResNet-50), ViTs (Vision Transformer and SwinV2-Tiny), and hybrid models (Convolutional Vision Transformer, LeViT-256, and CvT-13) on DR classification using publicly available retinal image datasets. The models are assessed based on classification accuracy and interpretability, applying Grad-CAM and Attention-Rollout to analyze decision-making patterns. Results indicate that hybrid models outperform both standalone CNNs and ViTs, achieving a better balance between local feature extraction and global context awareness. The best-performing model (CvT-13) achieved a Quadratic Weighted Kappa (QWK) score of 0.84 and an AUC of 0.93 on the test set. Interpretability analysis shows that CNNs focus on fine-grained lesion details, while ViTs exhibit broader but less localized attention. These findings provide valuable insights for optimizing deep learning models in medical imaging, supporting the development of clinically viable AI-driven DR screening systems.

Keywords:

diabetic retinopathy; deep learning; convolutional neural networks (CNNs); vision transformers (ViTs); hybrid models; medical imaging; interpretability

1. Introduction

Diabetic retinopathy (DR) is a complication of diabetes and one of the most common causes of vision loss and blindness worldwide [1,2]. According to Thomas et al. [2], the global prevalence of DR in people with diabetes was 27% between 2015 and 2019. Early detection due to regular screening greatly improves patient outcomes. A study from the United Kingdom shows that between 2007 and 2015, these measures helped to almost halve the incidence of severe visual impairment and vision loss in diabetes patients in Wales per 100,000 population [3]. However, the manual evaluation of high-resolution fundus images is time intensive and heavily reliant on specialist expertise [4]. As the global prevalence of diabetes continues to rise [5], there is an urgent need for accurate, scalable, and automated methods that can assist clinicians in identifying the onset and severity of DR.

In addition to molecular approaches, where some studies have employed machine learning to uncover immune-related biomarkers for diabetic retinopathy (DR) [6,7], AI-driven analysis of image-based data has shown considerable promise in facilitating the early diagnosis of DR.

Convolutional Neural Network (CNN)-based models have been the standard for image classification due to their ability to capture spatial hierarchies and local patterns in images [8]. They have shown consistent performance in medical imaging tasks where feature localization is crucial. However, one of the most significant advancements in deep learning in recent years is the Transformer model, which revolutionized Natural Language Processing (NLP) by relying entirely on the attention mechanism [9]. Originally developed for NLP, its potential was soon recognized in Computer Vision (CV). The Vision Transformer (ViT) applies the Transformer architecture to image recognition by dividing an image into patches, analogous to words in text processing. These patches are linearly projected into embedding vectors and processed through a standard Transformer framework [10]. Hybrid architectures combine the strengths of CNNs (local feature extraction) with ViTs (global context awareness), offering a promising approach for medical image classification.

In DR severity classification, recent ViT-related research has pursued four main directions: (1) verifying that ViT models match or exceed CNN performance [8,11,12,13,14,15]; (2) evaluating diverse ViT architectures against CNN benchmarks [8,15]; (3) creating DR-specific ViT variants through original designs or adaptations [16,17,18,19,20]; and (4) enhancing the interpretability of ViT-based DR predictions [21].

Despite advancements in automated DR detection, a comprehensive comparison of CNNs, ViTs, and hybrid models, along with an analysis of their interpretability, remains largely unexplored. In this study, we focus on systematically comparing three deep learning paradigms regarding their performance and interpretability in the classification of DR severity from fundus images. Using two large publicly available DR datasets, we evaluate not only the diagnostic accuracy but also the degree to which each model type localizes clinically relevant retinal features. Specifically, this study addresses the following research questions:

How do CNN, ViT, and hybrid models compare in their ability to classify the severity levels of diabetic retinopathy across multiple classes as defined by the ICDR scale?
What do visualization techniques reveal about the decision-making processes of CNN, ViT, and hybrid models, and how can these insights help explain differences in their predictions and behavior?

Although Vision Transformer (ViT) has demonstrated impressive performance in classifying the severity of diabetic retinopathy (DR), much of the existing research has focused primarily on performance optimization and improving evaluation metrics, while giving insufficient attention to the interpretability of model predictions. Furthermore, the Grad-CAM technique [22] is often used, despite being originally developed to interpret the convolutional structures of CNNs [8]. In contrast, the Attention-Rollout method—specifically tailored to the attention mechanism of ViTs—would have been a more appropriate choice for explaining the model’s decision-making process [23]. The only study that directly compared the interpretability of CNN models with that of the original ViT was limited in scope and failed to include the numerous existing ViT variants [21]. However, other research suggests that these ViT variants achieve better performance on this task.

The rest of this paper is organized as follows: Section 2 gives an overview of related work, Section 3 presents the dataset and preprocessing steps, Section 4 outlines the model architectures and training setup, Section 5 presents the evaluation results, Section 6 discusses the results in the context of existing research, and Section 7 concludes with a summary and an outlook for future work.

2. Related Work

Wu et al. [11] evaluated the suitability of pure attention mechanisms for DR severity classification and investigated whether transformers could replace conventional CNNs. The EyePACS and APTOS-2019 datasets were used, split in an 8:2 ratio into training and test sets. The models tested included ViT-Base and ViT-Large, and the results suggested that Vision Transformers are a promising approach for this task.

Adak et al. [8] assessed transformer-based models for DR severity detection using the APTOS-2019 dataset. Their study employed several architectures, including ViT, BEiT, CaiT, and DeiT. Experimental findings demonstrated that ensemble-based ViT models achieved excellent performance.

Chetoui et al. [12] proposed a federated learning approach based on Vision Transformers for DR detection. Their method distributed model training across four institutions and leveraged transformer architecture alongside federated learning techniques. The datasets used included APTOS, Messidor-1, Messidor-2, IDRiD, and EyePACS. Tested models included ViT-Base-Patch32 and DenseNet-121. Results indicated that federated learning offers significant advantages in terms of data security, accessibility, and privacy.

Sun et al. [16] introduced a unified deep model that combined DR severity grading with lesion detection through a novel Lesion-Aware Transformer (LAT). Their experiments, conducted on the Messidor-1, Messidor-2, and EyePACS datasets, showed that LAT is an effective solution for both classification and lesion detection.

Kumar et al. [15] compared several architectures—ViT, CNN, and MLP—for DR detection using the APTOS-2019 dataset. Their evaluation focused on metrics such as convergence time, accuracy, and model size. Tested models included EfficientNet, ResNet, Swin Transformer, ViT, and MLP-Mixer. The transformer-based models outperformed the others in terms of accuracy and demonstrated comparable convergence times.

To reduce parameter count while maintaining strong classification performance, Bala et al. [18] introduced CTNet, a lightweight hybrid model that combines CNN and ViT. CTNet was tested on the APTOS-2019 and IDRiD datasets, showing high performance in DR image classification with low resource consumption.

Wang et al. [19] developed a transformer-based architecture combining hyperbolic embeddings with a spatial priority module to improve the accuracy and speed of lesion segmentation in DR. Their results indicated superior segmentation performance, particularly for small lesions.

Band et al. [24] introduced the RETINA benchmark to evaluate model reliability in safety-critical settings. Using Bayesian Deep Learning (BDL), they assessed Bayesian Neural Networks (BNNs) on the EyePACS and APTOS-2019 datasets. Their findings showed that BDL enhances both the accuracy and reliability of DR detection.

Lee et al. [25] enhanced DR severity classification using modified VGG-16 and ResNet-50 models with dropout regularization. CLAHE preprocessing and data augmentation were used to address data imbalance. Their models achieved classification accuracies of 94.03% and 97.21%, respectively, with sensitivities over 70% and specificities exceeding 90%.

Halder et al. [26] applied ViT models to four subsets of the MedMNISTv2 dataset. Their results showed that ViT outperforms existing benchmark methods, achieving higher diagnostic accuracy across all subsets.

Philippi et al. [27] evaluated a Swin-UNETR-based hybrid method for the automatic segmentation of retinal disease and tested on a private dataset. Their approach significantly improved segmentation accuracy and reliability.

He et al. [28] proposed an interpretable Swin-Poly-Transformer for OCT image classification. The architecture combined the Swin Transformer’s multi-scale feature extraction with PolyLoss and Score-CAM for interpretability. Experimental results demonstrated superior accuracy and ROC-AUC compared to both CNN and standard ViT models.

Goh et al. [29] developed five Convolutional Neural Networks (VGG19, ResNet50, InceptionV3, DenseNet201, and EfficientNetV2S) and four Vision Transformer (ViT) models (VAN_small, Cross-ViT_small, ViT_small, and [SWIN]_tiny) using retinal images from the Kaggle dataset to detect referable diabetic retinopathy (DR), defined as moderate or worse DR. Model performance was compared across the Kaggle internal test set and two external datasets: the SEED study and Messidor-1. The SWIN transformer significantly outperformed all CNN models.

Touati et al. [30] presented the Diabetic Retinopathy Compact Convolutional Transformer (DRCCT) model, which integrates convolutional and transformer-based approaches to improve the classification of retinal images across five stages of diabetic retinopathy. Hidri et al. [31] proposed a robust and optimal deep ConvNet for automatically discriminating between healthy, moderate, and severe DR and achieved significant improvement at the moderate class level.

Asia et al. [32] compared CNN architectures such as ResNet-101, ResNet-50, and VGGNet-16. When being evaluated on internal datasets from Xiangya No. 2 Hospital Ophthalmology (XHO) in China, the ResNet-101 model showed high performance in both training and testing phases, with minimal loss. Additionally, the ResNet-101 model was applied to the databases of HRF, STARE, DIARETDB0, and XHO, consistently achieving high accuracy across all of them.

Yang et al. [17] used a Vision Transformer pre-trained with Masked Autoencoders (ViT-MAE) for binary diabetic retinopathy classification. Pretrained on over 100,000 high-resolution retinal images, the model achieved 93.42% accuracy and outperformed ImageNet-pretrained ViTs. This result challenges the common practice of relying on ImageNet-pretrained weights for medical image analysis and highlights the benefit of domain-specific pretraining.

Akhtar et al. [33] introduced RSG-Net, a CNN-based model designed for both binary and four-class DR classification. The study emphasizes robust preprocessing—including cropping, Gaussian blur-based denoising, histogram equalization, and resizing—to enhance image quality, alongside extensive data augmentation to address class imbalance. RSG-Net outperformed several state-of-the-art models, achieving over 99% accuracy in both tasks, though this exceptional performance may be influenced by the small dataset size (1200 images) and heavy use of augmentation.

Xue et al. [34] introduced VMamba-m, an enhanced Vision Mamba model that incorporates both local and channel (SE) attention mechanisms along with focal loss to address class imbalance in DR classification. VMamba-m achieves higher accuracy and computational efficiency than standard Vision Transformers (ViTs) and CNNs.

3. Dataset and Preprocessing

3.1. Training Corpus

For this study, we used two of the largest publicly available datasets for the classification of diabetic retinopathy (DR): EyePACS [35] and APTOS-2019 [36]. These datasets consist of images of the retinal fundus labeled with severity levels of diabetic retinopathy according to the International Clinical Diabetic Retinopathy (ICDR) classification. Table 1 shows the severity levels along with their corresponding symptoms. For a visual reference of the diabetic retinopathy severity grades defined by the ICDR scale, see [37].

The EyePACS dataset is one of the most widely used datasets for diabetic retinopathy classification. It contains 88,702 high-resolution retinal images, each graded by professional ophthalmologists. Most of the images in EyePACS have a resolution of 1024 × 1024 pixels or higher. The APTOS-2019 dataset, released by the Asia Pacific Tele-Ophthalmology Society (APTOS) as part of a Kaggle competition, contains 3662 retinal images. The images have varying resolutions and are each labeled with DR severity classes using the same ICDR classification system. This dataset is smaller but introduces additional variability in imaging conditions, which enhances model generalization. Both datasets consist of RGB color images. Figure 1 illustrates the distribution of classes in both datasets, highlighting the class imbalance.

The two datasets were merged into a single training corpus. Given the class imbalance, where most images belong to Class R0 or R1, a balanced dataset was created by selecting 1000 images per class, resulting in a total of 5000 images.

In medical imaging, ensuring model sensitivity to less frequent but clinically significant classes is very important. Data balancing helps mitigate the risk of poor performance on severe cases, even if it does not reflect real-world distributions. The decision to use 1000 images per class was constrained by the availability of class R4 images, which total approximately 1000 across both datasets.

Despite some variations in image quality between datasets, these differences are relatively minor and can be further reduced through data preprocessing and augmentation. Additionally, both datasets use a standardized annotation format, ensuring a seamless integration process.

3.2. Data Preprocessing

To eliminate redundant information, black borders are removed. This is achieved by first converting images to grayscale, then generating a mask to detect pixels exceeding a predefined brightness threshold. This allows the cropping of black borders. The effects of this procedure can be observed in the differences between Figure 2a,b.

To enhance image clarity and improve feature visibility, an Unsharp Mask Technique is employed [38]. This method subtracts a blurred version of the image from the original, followed by a brightness adjustment, effectively accentuating edges and fine details.

However, a side effect of this approach is the excessive enhancement of the circular border, as it represents the most prominent edge in the image. Since the border is an inherent characteristic of retinal images and is irrelevant for DR severity classification, it is undesirable for models to focus on this feature. To tackle this issue, a circular mask is designed, blacking out the peripheral 15% of the image (centered around the middle of the image) to remove the bright circular border. The processed image obtained after applying this mask is shown in Figure 2d.

3.3. Data Augmentation

To further increase feature diversity within the data and improve the model’s generalization ability, several data augmentation techniques are applied to the preprocessed retinal images.

Random scaling by ±10%: This slight variation in image size simulates differences in real-world image capture conditions.
Random rotation between 0° and 360°: Rotating images by a random angle ensures that the model remains robust to variations in fundus orientation.
Random distortion by ±20%: This simulates perspective distortions that may occur during image acquisition by shifting the image corners randomly along the x- or y-axis by up to ±20% of the image width or height.

After augmentation, the image is cropped again to retain only the smallest central rectangle. This process can result in images similar to Figure 2.

To mitigate potential disruptions to model training caused by black regions in image corners, the mean pixel value of all non-black pixels is computed and used to fill these areas. This adjustment smooths edge transitions as illustrated in Figure 2f.

The augmentation techniques selected in this study are designed to capture the variability commonly encountered in real-world fundus imaging. Random scaling, rotation, and distortion are employed to mimic differences in camera positioning and eye alignment that occur during image acquisition, thereby enhancing model robustness to spatial variability. Cropping and masking are used to focus the model on the retinal region of interest and reduce the influence of irrelevant background features, which helps prevent overfitting to non-retinal structures such as the bright circular border often present in fundus photographs. Indeed, Yang et al. [17] standardized all images to a uniform size and applied random horizontal flips along with rotations in the range of −180° to +180°. Hidri et al. [31] augmented the original images by rotating them at fixed angles of 120°, 72°, and 45°. Touati et al. [30] applied various data augmentation techniques, including rotation, resizing, flipping, cropping, shifting, and noise injection. Asia et al. [32] reported that the input images underwent resizing, augmentation via rotation and noise addition, and normalization.

4. Models and Training

4.1. Model Selection

The study employs a selection of deep learning models from three primary categories: Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and hybrid models that integrate convolutional and transformer-based architectures.

For each architecture type, specific representative models are selected. The selection prioritizes leading models with high ratings on platforms like Papers with Code, Hugging Face, and Kaggle. Additionally, a focus is placed on DR-specific models that have not been sufficiently explored in existing research on DR severity classification. Pretrained models on large-scale datasets such as ImageNet are preferred to ensure robust feature extraction and efficient training. Table 2 provides an overview of the selected models, categorizing them by architecture type and detailing their core structural components along with the explainability methods used to interpret their predictions.

4.2. Implementation and Training Setup

The selected versions of these models (see Table 3) are standardized in terms of parameter count and floating-point operations per second (FLOPs). Preference is given to models pre-trained on ImageNet-1k with an image resolution of 224 × 224 pixels, though some variations trained at 256 × 256 are also considered due to resource constraints.

The models are fine-tuned using the hyperparameters reported in Table 4. The experiments are conducted using the hardware and software environment detailed in Table 5. The dataset is split into three subsets:

Training Set: 70% of the data (20% per class)
Validation Set: 15% of the data (20% per class)
Test Set: 15% of the data (20% per class)

The training progress for all models, including the loss and QWK metrics over epochs, is visualized in Appendix A (Figure A1 and Figure A2).

4.3. Evaluation Metrics

To assess model performance, this study employs Quadratic Weighted Kappa (QWK), a metric used for classification tasks [52,53]. QWK extends Cohen’s Kappa by assigning different penalties to classification errors based on their severity, making it particularly suitable for ordinal classification problems such as diabetic retinopathy severity grading.

QWK is computed from the confusion matrix for multi-class classification. For a binary setting, the confusion matrix consists of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). In multi-class classification, the confusion matrix extends to observed frequencies

O_{i j}

, where

O_{i, j}

denotes the number of instances with true class i and predicted class j.

The QWK calculation involves the following:

Observed Frequencies ( $O_{i j}$ ): Values extracted directly from the confusion matrix.
Expected Frequencies ( $E_{i j}$ ): Computed based on marginal distributions, representing expected values under random classification.
Weight Matrix ( $W_{i j}$ ): Defines the penalty for misclassification, increasing quadratically as the difference between the actual and predicted class increases.

QWK is computed as follows:

Q W K = 1 - \frac{\sum_{i, j} W_{i j} O_{i j}}{\sum_{i, j} W_{i j} E_{i j}}

(1)

A QWK score of 1 indicates perfect agreement between predictions and ground truth, 0 represents random classification, and negative values suggest systematic misclassification.

Alongside QWK, standard classification metrics are used to provide a comprehensive evaluation of model performance: Accuracy, Precision, F1-Score, and ROC-AUC.

For multi-class classification, the results for each class should be averaged. There are two common approaches: macro-averaging and micro-averaging. In our case, due to the balanced class distribution, both methods yield the same results.

4.4. Explainability Methods

To interpret the decision-making process of the models, two explainability methods are applied: Grad-CAM (Gradient-weighted Class Activation Mapping) and Attention-Rollout. Grad-CAM is a visualization technique designed for CNN-based models, Grad-CAM highlights the most influential regions in an input image that contribute to a classification decision. It achieves this by using gradients from the final convolutional layer to generate a heatmap overlay on the input image. Attention-Rollout is used for Transformer-based models. This technique visualizes how attention weights are distributed across different parts of the input image, illustrating which regions influence the final prediction.

The choice of explainability method depends on the model architecture (see Table 2):

ResNet-50 and EfficientNet-B0: Well suited for Grad-CAM, as these models rely on convolutional layers that produce spatial feature maps.
ViT-Small and DINO-v2: Compatible with Attention-Rollout, as they use multi-head self-attention (MHSA) layers of uniform structure, making attention visualization feasible.
CvT-13: Since CvT-13 incorporates convolutional layers close to the output, Grad-CAM is applicable.

Some models are not compatible with either technique due to their architectural differences:

SwinV2-Tiny: Although it includes a convolutional patch embedding layer, its hierarchical multi-head self-attention (MHSA) layers vary in structure, making Attention-Rollout unsuitable. Additionally, its convolutional layers are located far from the output, reducing the effectiveness of Grad-CAM.
LeViT-256: While it incorporates elements of both CNNs and Transformers, it lacks uniform MHSA layers for Attention-Rollout and does not have convolutional layers positioned effectively for Grad-CAM.

5. Results

5.1. Overall Test Performance

Table 6 presents the performance of the models on the test dataset, based on the metrics specified in the previous section. Because Quadratic Weighted Kappa (QWK) serves as the primary metric in this study, we place particular emphasis on its values. Within the two CNN models, EfficientNet-B0 achieves higher performance, with a QWK of 0.84. In the ViT group, SwinV2-Tiny exhibits the best result at 0.80. Among the hybrid architectures, CvT-13 attains the strongest outcome, also reaching a QWK of 0.84. By contrast, DINOv2-Small and ViT-Small (both in the ViT group) show the weakest performance overall.

The CNN and hybrid models yield very similar results across the various evaluation metrics. While SwinV2-Tiny in the ViT group stays slightly behind, its performance still remains close to that of the CNN and hybrid groups. It is also noteworthy that all models display a positive correlation between QWK and other metrics, suggesting that a higher QWK value generally aligns with improved performance in other measures. Thus, QWK appears to serve as a robust and comprehensive indicator of overall predictive capability.

5.2. Classwise Performance

Table 7, Table 8, Table 9, Table 10 and Table 11 show the performance of various models across individual classes. To obtain these results, each model is tested on a dataset containing only examples from one class, enabling a detailed examination of each model’s strengths and weaknesses regarding specific class characteristics. Because these class-specific test sets contain only a single label each (no class variation), it is not possible to calculate QWK or ROC-AUC values.

Overall, the models perform better in Classes 0 and 2, whereas Classes 1, 3, and 4 exhibit greater variance and more pronounced deficiencies. This suggests that certain classes may have more distinct or recognizable features, making them easier for the models to learn and discriminate.

A general assessment shows that some models excel across multiple classes, while others display more apparent weaknesses. LeViT-256 and CvT-13 rank among the top performers, demonstrating excellent results in Classes 0, 1, 2, 3, and 4, with high accuracy and F1-scores. ResNet-50 falls into the mid-range category: although it achieves strong performance in Classes 0, 2, and 3, its results in Classes 1 and 4 are comparatively weaker.

Models such as EfficientNet-B0, DINOv2-Small, ViT-Small, and SwinV2-Tiny show less robust outcomes. For instance, EfficientNet-B0 struggles especially in Classes 1 and 3, yielding low accuracy and F1-scores. DINOv2-Small also displays weak performance in Classes 1, 3, and 4. While ViT-Small attains acceptable results in Classes 0 and 2, it fares less well in Classes 1 and 4. SwinV2-Tiny delivers very strong results in Class 0 but only moderate performance elsewhere.

Examining these findings from a broader perspective reveals interesting patterns across model groups. The ViT-based models—ViT-Small, DINOv2-Small, and SwinV2-Tiny—generally lag behind the other groups and show notable weaknesses in several classes. Among the CNN architectures, ResNet-50 achieves relatively consistent performance, whereas EfficientNet-B0’s results vary more widely and include some notably low scores. By contrast, hybrid models that combine CNN and Transformer components clearly exhibit the strongest overall performance, most likely due to their ability to leverage the benefits of both approaches.

5.3. Visual Analysis

To better understand the decision-making processes of the evaluated models, we apply Grad-CAM and Attention-Rollout techniques. Table 12 summarizes visualization outcomes for representative examples across all DR severity levels (0–4), illustrating how each model captures relevant retinal features associated with different stages of diabetic retinopathy.

In each severity class, we select a random fundus image from the test dataset. We then apply data augmentation (including corner-filling, which replaces black image borders with averaged pixel values) before feeding these images into our models. This augmentation step ensures that none of the models overemphasize the circular edges of fundus images, thus making the classification decisions less sensitive to border artifacts. Note that minor variations in scaling appear in the resulting visualizations, stemming from differences in preprocessing among the models.

Grad-CAM is applied to ResNet-50, EfficientNet-B0, and CvT-13 to generate heatmaps indicating the regions that contributed most to the model’s classification decision.

Overall, the CNNs tend to focus on localized lesions—such as exudates, hemorrhages, and microaneurysms—particularly in the more severe DR cases. In mild or moderate DR images, Grad-CAM highlights fewer discrete areas but still emphasizes subtle lesion patterns near the macular region. This behavior aligns well with the inherent capacity of CNN to capture local features. However, we observe some variability in the heatmaps’ focus when lesions are relatively small: in some instances, CNNs spread attention to broader areas, possibly reflecting uncertainty about the primary lesion sites.

For ViT (Vision Transformer) and hybrid models (e.g., Convolutional-ViT), we employ the Attention-Rollout method, which aggregates multi-head self-attention matrices across all transformer layers. Compared to the CNN heatmaps, these transformer-based models often exhibit a more “global” view of the image. In severe DR cases with prominent lesions, the attention is strongly drawn to multiple clusters of hemorrhages and large exudates scattered throughout the fundus. However, in certain instances—especially mild DR images—transformer-based models occasionally direct substantial attention to non-lesion areas, such as the circular fundus boundary, reflecting the high-level, context-driven nature of self-attention. This can be advantageous for capturing long-range dependencies (e.g., multiple small lesions spread across different retinal regions) but may dilute focus on a single small lesion patch.

In summary, both ViT-Small and DINOv2-Small exhibit certain weaknesses in detecting local features due to their architectures. Their self-attention mechanism grants a global field of view, allowing the models to consider all parts of an image simultaneously. However, this often leads to a relatively dispersed focus, meaning they can overlook critical local features. In contrast, CNN-based models such as ResNet-50 and EfficientNet-B0 rely on convolutional operations that are especially effective at capturing and amplifying local features, thereby enabling the more precise detection of specific DR indicators. This difference may also explain why SwinV2-Tiny, with its unique shifted-window self-attention mechanism designed to incorporate more localized attention, performs better among the ViT-based models.

At the same time, an overemphasis on highly localized details can lead to overfitting as seen in certain training phases for both ResNet and EfficientNet. Hybrid architectures like CvT-13 combine the advantages of convolutional and transformer mechanisms, offering a more balanced performance by capturing both local and global features. This balance is likely the reason CvT-13 demonstrates greater robustness and reliability in diabetic retinopathy classification tasks.

6. Discussion

6.1. Main Findings

In this study, the performance of various models for classifying the severity levels of diabetic retinopathy (DR) was examined and compared. CNN-based, newer ViT-based, and hybrid models were analyzed. The investigation covered not only the performance evaluation of the models but also a detailed visual analysis of their decision-making processes using Grad-CAM and Attention-Rollout.

The CNN models, ResNet-50 and EfficientNet-B0, demonstrated strong performance in classifying DR severity levels but showed a greater tendency to overfit. ViT models, particularly ViT-Small and DINOv2-Small, exhibited weaknesses in capturing local features, whereas SwinV2-Tiny stood out among the ViTs. Hybrid models such as CvT-13 and LeViT-256 successfully combined the strengths of both CNNs and ViTs, achieving more balanced and robust results.

Through Grad-CAM and Attention-Rollout, it was observed that CNN models effectively captured local features such as exudates and blood vessels. In contrast, ViT models—due to their global perspective—often overlooked important local details. Hybrid models like CvT-13 proved to be more robust, as they were able to integrate both local and global information effectively.

6.2. Comparison with Related Work

Although the best-performing model in this study achieved an accuracy of 72.93% and a QWK score of 0.841—seemingly falling short of results reported in other works (see Table 13)—a deliberately simplified approach was adopted to reduce infrastructure and implementation demands, lower barriers in resource-limited environments, and promote clinical applicability. For instance, Adak et al. [8] employed an ensemble of transformers (EiT) to classify DR on the APTOS-2019 dataset, achieving an accuracy of 94.63%. Sun et al. [16] developed a Lesion-Aware Transformer (LAT) for DR classification and reported Quadratic Weighted Kappa (QWK) scores of 0.893 (validation) and 0.884 (test) on the EyePACS dataset. Wu et al. [11] used adapted Vision Transformer (ViT) models with specific preprocessing techniques and achieved 91.4% accuracy. Chetoui et al. [12] applied federated learning with ViT and CNN models on the APTOS-2019 and EyePACS datasets, achieving 95% accuracy.

Notably, while the performance of CNN models in this study slightly lags behind that reported by other researchers, the performance of ViT models (with the exception of SwinV2-Tiny) is considerably lower than that in comparable studies. In addition to others mentioned earlier, there could be several reasons.

First, class imbalance in the datasets used by other researchers may have affected performance. In particular, the APTOS-2019 dataset exhibits a highly imbalanced class distribution, which could lead to overfitting. Although other studies attempted to address this issue through data augmentation, the effectiveness of these strategies remains questionable. For instance, after augmentation in the study by Wu et al. [11], class distribution remained skewed: class 0 (no DR) 24.56%, class 1 (mild DR) 23.20%, class 2 (moderate DR) 25.18%, class 3 (severe DR) 14.94%, and class 4 (proliferative DR) 12.12%. The proportions for classes 3 and 4 remained significantly lower. In this research, the models generally performed worse on classes 3 and 4, indicating a need for more data in these categories.
Second, the overall dataset size may have been too small to adequately train the more complex ViT models.
Third, fine-tuning on a small training set with limited epochs may not have been sufficient to adapt the model effectively. Pretrained models learn general features from large, diverse datasets. However, fine-tuning on smaller, domain-specific datasets might not successfully align those general features with the task-specific patterns. This could explain the suboptimal adaptation observed in this study.
Fourth, excessive augmentation techniques, especially random rotations of 0–360° and intensity distortions of ±20%, may have led to excessive complexity and impeded the learning process. Indeed, Kumar et al. [15] tested Transformer, CNN, and MLP models, achieving 86.4% accuracy with the Swin Transformer, notably without employing augmentation techniques to improve the performance. Bala et al.’s CTNet [18] for DR classification using residual connections, CNN, and ViT reported an AUC of 0.987 and a Kappa score of 0.972 on the APTOS-2019 dataset, also without using rotation or zoom-based augmentation.

Despite the fact that CNN and ViT models in this study may not outperform those in related research due to factors such as dataset selection and the data augmentation techniques applied, this study highlights the substantial potential of hybrid models for DR severity classification. This represents the core innovation and primary contribution of the present work.

Although CNN and ViT models have been extensively studied for this task, there is a lack of research on hybrid models that combine the strengths of both architectures. The findings of this study demonstrate that hybrid models can achieve superior overall performance compared to standalone CNN or ViT models.

The cross-class evaluation shows that the hybrid models delivered impressive performance. For example, the hybrid model CvT-13 achieved the best result with a QWK score of 0.841. Similarly, LeViT-256 also achieved a high QWK score. The class-specific performance analysis further illustrates that hybrid models perform remarkably well across multiple classes. LeViT-256 achieved high accuracy and F1-scores in classes 0, 1, 2, 3, and 4, making it one of the top-performing models. Likewise, CvT-13 achieved very strong results in classes 0, 1, 3, and 4.

What is particularly noteworthy is that hybrid models were able to effectively overcome the common overfitting problems often observed with CNNs and the underfitting issues seen in ViTs—despite being trained on relatively small, class-balanced datasets. It is reasonable to expect that, when trained on larger datasets with more advanced augmentation and training methods, hybrid models could outperform even the most advanced standalone CNN or ViT models reported in the literature.

6.3. Limitations and Future Work

The main limitations of this study include the relatively small size of the dataset and class balancing. The use of larger and more diverse datasets could improve the generalizability of the models and increase their robustness to variations in imaging conditions and quality. According to the original ViT publication [10], the ViT models show their strengths especially with large datasets.

Our balanced dataset (1000 samples per class) does not reflect the class distribution typically found in real-world clinical data, which is often very unbalanced. Such skewed distributions pose a major challenge for modeling, as many learning algorithms tend to favor the majority class—possibly at the expense of the performance of the minority class. Future work could explore further established techniques such as sampling strategies or cost-sensitive learning alongside newer approaches that utilize the capabilities of neural networks to mitigate the imbalance between classes [54].

All experiments were performed with the EyePACS and APTOS-2019 datasets commonly used in the field as shown in Table 13 but lacked external validation with independent datasets. Future work will evaluate performance on different, independent datasets to address concerns about domain shift and assess clinical applicability.

Another possible limitation concerns the exaggerated augmentation techniques, in particular random rotations of 0–360° and intensity distortions of ±20%. Such transformations may have introduced excessive complexity and hindered the learning process. It might be advisable to use a more conservative augmentation strategy [18]. Finally, the interpretability of some models (e.g., SwinV2-Tiny, LeViT-256) could not be fully evaluated due to limitations in the available visualization tools. Future research could investigate the integration of Markov Random Fields (MRFs) [55] into the first layers of CNN architectures as a means to improve performance. MRFs can capture both local and global image features by modeling pixel intensities and their spatial dependencies.

Although CvT-13 and EfficientNet showed differences in performance, we did not statistically analyze them in the current study. In our future studies, we plan to use the DeLong test to assess statistically significant differences in the model AUCs and apply the Bonferroni correction for multiple comparisons as demonstrated by [29]. We are also considering the McNemar test to compare the performance of the models. The McNemar test evaluates differences in proportions between two paired binary outcomes. It is applied to a 2 × 2 contingency table that summarizes mutually exclusive outcomes from two tests conducted on the same subjects [56]. When evaluating models by cross-validation or multiple runs, we would still consider the paired t-test or, if normality assumptions are violated, the Wilcoxon signed-rank test to compare performance measures as outlined in [31].

Due to computational constraints and following Goh et al. [29], training was limited to 20 epochs and stopped early to avoid overfitting. Touati et al. [30] trained for over 100 epochs to fully leverage the augmented dataset. However, extensive regularization techniques were required to prevent overfitting. Future work will analyze the learning curves and implement longer training schedules.

7. Conclusions

Our principal conclusions indicate that hybrid networks integrating both convolutional and attention-based mechanisms, exhibit consistently high classification performance. Furthermore, visual analysis reveals that CNNs excel at capturing local features, while ViTs provide a global perspective, which can sometimes lead to less focused decision-making. These insights offer valuable guidance for the further development and optimization of DR detection models for use in medical practice.

Author Contributions

Conceptualization, W.Z. and T.E.; methodology, W.Z., T.E. and V.B.; software, W.Z.; validation, W.Z., T.E. and V.B.; formal analysis, W.Z.; investigation, W.Z.; resources, W.Z.; data curation, W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, T.E. and V.B.; visualization, W.Z.; supervision, T.E.; project administration, T.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study were obtained from two publicly available sources. The EyePACS Diabetic Retinopathy dataset is accessible via the Kaggle competition portal at https://kaggle.com/competitions/diabetic-retinopathy-detection (accessed on 7 May 2025). The APTOS-2019 Blindness Detection dataset is available at https://kaggle.com/competitions/aptos2019-blindness-detection (accessed on on 7 May 2025). All analyses and results reported in this manuscript are based on these publicly available datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DR	Diabetic Retinopathy
NPDR	Non-Proliferative Diabetic Retinopathy
APTOS	Asia Pacific Tele-Ophthalmology Society
CNN	Convolutional Neural Network
ViT	Vision Transformer
CvT	Convolutional Vision Transformer
ResNet	Residual Network
LeViT	Light Vision Transformer
QWK	Quadratic Weighted Kappa
Grad-CAM	Gradient-weighted Class Activation Mapping
FLOPs	Floating Point Operations
GPU	Graphics Processing Unit
CPU	Central Processing Unit

Appendix A

Appendix A.1. Training and Validation Loss

Figure A1 presents the training and validation loss curves for the selected models during the 20 epochs of fine-tuning. It is worth noting that the y-axis of the loss curve for the ViT-Small model is scaled up to four, while for the other models, it is limited to two. This adjustment is necessary due to the particularly high losses observed in the ViT-Small model.

The first two graphs illustrate the fine-tuning process of ResNet-50 and EfficientNet-B0. Both models exhibit an initial rapid decline in training loss, indicating that they quickly learn relevant features in the early training epochs. As training progresses, the loss continues to decrease and eventually stabilizes near zero, suggesting that the models fit the training data well with minimal residual training error. However, their validation losses behave differently. Initially, both models experience fluctuations in validation loss, followed by a gradual increase, indicating that although they improve on the training data, their performance on the validation data deteriorates. This suggests overfitting and poor generalization. Comparing both models, EfficientNet-B0 shows a more consistent and pronounced increase in validation loss, whereas ResNet-50 exhibits less fluctuation and a less consistent rise, implying that EfficientNet-B0 may be more prone to overfitting than ResNet-50.

Figure A1. Training and validation loss during fine-tuning: (a) ResNet-50 (b) EfficientNet-B0, (c) ViT-Small, (d) DINOv2-Small, (e) SwinV2-Tiny, (f) LeViT-256, and (g) CvT-13.

Appendix A.2. Validation Performance

Figure A2 presents the Quadratic Weighted Kappa (QWK) scores of the selected models on the validation dataset over the 20 training epochs, grouped by model type.

Figure A2. Quadratic Weighed Kappa during training: (a) CNN models; (b) ViT models; (c) Hybrid models.

Among the CNN-based models, both ResNet-50 and EfficientNet-B0 maintain consistently high QWK values with minimal fluctuations. Within the ViT group, DINOv2-Small shows stable but low values, while SwinV2-Tiny improves significantly over time, reaching high QWK values. ViT-Small maintains moderate but stable QWK scores. Among the hybrid models, CvT-13 initially shows slight improvements before stabilizing at consistently high QWK values, comparable to the CNN models. LeViT-256 exhibits greater fluctuations but eventually achieves high QWK values, slightly below those of CvT-13.

References

Leasher, J.L.; Bourne, R.R.; Flaxman, S.R.; Jonas, J.B.; Keeffe, J.; Naidoo, K.; Pesudovs, K.; Price, H.; White, R.A.; Wong, T.Y.; et al. Global Estimates on the Number of People Blind or Visually Impaired by Diabetic Retinopathy: A Meta-analysis From 1990 to 2010. Diabetes Care 2016, 39, 1643–1649. [Google Scholar] [CrossRef] [PubMed]
Thomas, R.L.; Halim, S.; Gurudas, S.; Sivaprasad, S.; Owens, D.R. IDF Diabetes Atlas: A review of studies utilising retinal photography on the global prevalence of diabetes related retinopathy between 2015 and 2018. Diabetes Res. Clin. Pract. 2019, 157, 107840. [Google Scholar] [CrossRef]
Thomas, R.L.; Luzio, S.D.; North, R.V.; Banerjee, S.; Zekite, A.; Bunce, C.; Owens, D.R. Retrospective analysis of newly recorded certifications of visual impairment due to diabetic retinopathy in Wales during 2007–2015. BMJ Open 2017, 7, e015024. [Google Scholar] [CrossRef] [PubMed]
Laurik-Feuerstein, K.L.; Sapahia, R.; Cabrera DeBuc, D.; Somfai, G.M. The assessment of fundus image quality labeling reliability among graders with different backgrounds. PLoS ONE 2022, 17, e0271156. [Google Scholar] [CrossRef]
Wong, T.Y.; Tan, T.E. The Diabetic Retinopathy “Pandemic” and Evolving Global Strategies: The 2023 Friedenwald Lecture. Investig. Opthalmology Vis. Sci. 2023, 64, 47. [Google Scholar] [CrossRef]
Yagin, F.H.; Yasar, S.; Gormez, Y.; Yagin, B.; Pinar, A.; Alkhateeb, A.; Ardigò, L.P. Explainable Artificial Intelligence Paves the Way in Precision Diagnostics and Biomarker Discovery for the Subclass of Diabetic Retinopathy in Type 2 Diabetics. Metabolites 2023, 13, 1204. [Google Scholar] [CrossRef]
Tao, Y.; Xiong, M.; Peng, Y.; Yao, L.; Zhu, H.; Zhou, Q.; Ouyang, J. Machine learning-based identification and validation of immune-related biomarkers for early diagnosis and targeted therapy in diabetic retinopathy. Gene 2025, 934, 149015. [Google Scholar] [CrossRef]
Adak, C.; Karkera, T.; Chattopadhyay, S.; Saqib, M. Detecting Severity of Diabetic Retinopathy from Fundus Images using Ensembled Transformers. arXiv 2023, arXiv:2301.00973. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Wu, J.; Hu, R.; Xiao, Z.; Chen, J.; Liu, J. Vision Transformer-based recognition of diabetic retinopathy grade. Med. Phys. 2021, 48, 7850–7863. [Google Scholar] [CrossRef] [PubMed]
Chetoui, M.; Akhloufi, M.A. Federated Learning for Diabetic Retinopathy Detection Using Vision Transformers. BioMedInformatics 2023, 3, 948–961. [Google Scholar] [CrossRef]
Mohan, N.J.; Murugan, R.; Goel, T.; Roy, P. ViT-DR: Vision Transformers in Diabetic Retinopathy Grading Using Fundus Images. In Proceedings of the 2022 IEEE 10th Region 10 Humanitarian Technology Conference (R10-HTC), Hyderabad, India, 16–18 September 2022; pp. 167–172. [Google Scholar] [CrossRef]
Nazih, W.; Aseeri, A.; Youssef Atallah, O.; El-Sappagh, S. Vision Transformer Model for Predicting the Severity of Diabetic Retinopathy in Fundus Photography-Based Retina Images. IEEE Access 2023, 11, 117546–117561. [Google Scholar] [CrossRef]
Kumar, N.S.; Ramaswamy Karthikeyan, B. Diabetic Retinopathy Detection using CNN, Transformer and MLP based Architectures. In Proceedings of the 2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Hualien City, Taiwan, 16–19 November 2021; pp. 1–2. [Google Scholar] [CrossRef]
Sun, R.; Li, Y.; Zhang, T.; Mao, Z.; Wu, F.; Zhang, Y. Lesion-Aware Transformers for Diabetic Retinopathy Grading. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10933–10942. [Google Scholar] [CrossRef]
Yang, Y.; Cai, Z.; Qiu, S.; Xu, P. Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image. PLoS ONE 2024, 19, e0299265. [Google Scholar] [CrossRef]
Bala, R.; Sharma, A.; Goel, N. CTNet: Convolutional Transformer Network for Diabetic Retinopathy Classification. Neural Comput. Appl. 2024, 36, 4787–4809. [Google Scholar] [CrossRef]
Wang, Z.; Lu, H.; Yan, H.; Kan, H.; Jin, L. Vison Transformer Adapter-Based Hyperbolic Embeddings for Multi-Lesion Segmentation in Diabetic Retinopathy. Sci. Rep. 2023, 13, 11178. [Google Scholar] [CrossRef]
Zang, F.; Ma, H. CRA-Net: Transformer guided category-relation attention network for diabetic retinopathy grading. Comput. Biol. Med. 2024, 170, 107993. [Google Scholar] [CrossRef]
Playout, C.; Duval, R.; Boucher, M.C.; Cheriet, F. Focused Attention in Transformers for interpretable classification of retinal images. Med. Image Anal. 2022, 82, 102608. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Abnar, S.; Zuidema, W. Quantifying Attention Flow in Transformers. arXiv 2020, arXiv:2005.00928. [Google Scholar]
Band, N.; Rudner, T.G.J.; Feng, Q.; Filos, A.; Nado, Z.; Dusenberry, M.W.; Jerfel, G.; Tran, D.; Gal, Y. Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks. arXiv 2022, arXiv:2211.12717. [Google Scholar]
Lee, C.H.; Ke, Y.H. Fundus images classification for Diabetic Retinopathy using Deep Learning. In Proceedings of the 13th International Conference on Computer Modeling and Simulation, ICCMS ’21, New York, NY, USA, 25–27 June 2021; pp. 264–270. [Google Scholar] [CrossRef]
Halder, A.; Gharami, S.; Sadhu, P.; Singh, P.K.; Woźniak, M.; Ijaz, M.F. Implementing vision transformer for classifying 2D biomedical images. Sci. Rep. 2024, 14, 12567. [Google Scholar] [CrossRef]
Philippi, D.; Rothaus, K.; Castelli, M. A vision transformer architecture for the automated segmentation of retinal lesions in spectral domain optical coherence tomography images. Sci. Rep. 2023, 13, 517. [Google Scholar] [CrossRef]
He, J.; Wang, J.; Han, Z.; Ma, J.; Wang, C.; Qi, M. An interpretable transformer network for the retinal disease classification using optical coherence tomography. Sci. Rep. 2023, 13, 3637. [Google Scholar] [CrossRef]
Goh, J.H.L.; Ang, E.; Srinivasan, S.; Lei, X.; Loh, J.; Quek, T.C.; Xue, C.; Xu, X.; Liu, Y.; Cheng, C.Y.; et al. Comparative Analysis of Vision Transformers and Conventional Convolutional Neural Networks in Detecting Referable Diabetic Retinopathy. Ophthalmol. Sci. 2024, 4, 100552. [Google Scholar] [CrossRef]
Touati, M.; Touati, R.; Nana, L.; Benzarti, F.; Ben Yahia, S. DRCCT: Enhancing Diabetic Retinopathy Classification with a Compact Convolutional Transformer. Big Data Cogn. Comput. 2025, 9, 9. [Google Scholar] [CrossRef]
Sassi Hidri, M.; Hidri, A.; Alsaif, S.A.; Alahmari, M.; AlShehri, E. Optimal Convolutional Networks for Staging and Detecting of Diabetic Retinopathy. Information 2025, 16, 221. [Google Scholar] [CrossRef]
Asia, A.O.; Zhu, C.Z.; Althubiti, S.A.; Al-Alimi, D.; Xiao, Y.L.; Ouyang, P.B.; Al-Qaness, M.A.A. Detection of Diabetic Retinopathy in Retinal Fundus Images Using CNN Classification Models. Electronics 2022, 11, 2740. [Google Scholar] [CrossRef]
Akhtar, S.; Aftab, S.; Ali, O.; Ahmad, M.; Khan, M.A.; Abbas, S.; Ghazal, T.M. A deep learning based model for diabetic retinopathy grading. Sci. Rep. 2025, 15, 3763. [Google Scholar] [CrossRef]
Xue, J.; Wu, J.; Bian, Y.; Zhang, S.; Du, Q. Classification of Diabetic Retinopathy Based on Efficient Computational Modeling. Appl. Sci. 2024, 14, 11327. [Google Scholar] [CrossRef]
Dugas, E.; Jared, J.; Cukierski, W. Diabetic Retinopathy Detection. 2015. Available online: https://kaggle.com/competitions/diabetic-retinopathy-detection (accessed on 21 May 2024).
Maggie, K.; Dane, S. APTOS 2019 Blindness Detection. 2019. Available online: https://kaggle.com/competitions/aptos2019-blindness-detection (accessed on 21 May 2024).
Cleland, C. Comparing the International Clinical Diabetic Retinopathy (ICDR) severity scale. Community Eye Health 2023, 36, 10. [Google Scholar]
Graham, B. Diabetic Retinopathy Detection Competition Report. 2015. Available online: https://storage.googleapis.com/kaggle-forum-message-attachments/88655/2795/competitionreport.pdf (accessed on 21 May 2024).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11999–12009. [Google Scholar] [CrossRef]
Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jegou, H.; Douze, M. LeViT: A Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 11–17 October 2021; pp. 12239–12249. [Google Scholar] [CrossRef]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar] [CrossRef]
Microsoft. ResNet-50. Available online: https://huggingface.co/microsoft/resnet-50 (accessed on 3 June 2024).
Google. EfficientNet-B0. Available online: https://huggingface.co/google/efficientnet-b0. (accessed on 3 June 2024).
WinKawaks. vit-small-patch16-224. 2023. Available online: https://huggingface.co/WinKawaks/vit-small-patch16-224 (accessed on 3 June 2024).
Facebook. DINOv2-Small-ImageNet1K-1-Layer. Available online: https://huggingface.co/facebook/dinov2-small-imagenet1k-1-layer. (accessed on 3 June 2024).
Microsoft. SwinV2-Tiny-Patch4-Window16-256. Available online: https://huggingface.co/microsoft/swinv2-tiny-patch4-window16-256 (accessed on 3 June 2024).
Facebook. LeViT-256. Available online: https://huggingface.co/facebook/levit-256 (accessed on 3 June 2024).
Microsoft. CvT-13. Available online: https://huggingface.co/microsoft/cvt-13 (accessed on 3 June 2024).
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef] [PubMed]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Peng, Y.; Yin, H. Markov Random Field Based Convolutional Neural Networks for Image Classification. In Intelligent Data Engineering and Automated Learning—IDEAL 2017; Yin, H., Gao, Y., Chen, S., Wen, Y., Cai, G., Gu, T., Du, J., Tallón-Ballesteros, A.J., Zhang, M., Eds.; Springer: Cham, Switzerland, 2017; pp. 387–396. [Google Scholar]
Park, C.; Park, S.Y.; Kim, H.J.; Shin, H.J. Statistical Methods for Comparing Predictive Values in Medical Diagnosis. Korean J. Radiol. 2024, 25, 656. [Google Scholar] [CrossRef]

Figure 1. Distribution of image counts and class proportions in the two datasets.

Figure 2. Example of a fundus image during the processing stages. (a) Original Image; (b) Cropped Image; (c) Edge-enhanced image; (d) Border-cropped image; (e) Augmented image with black corners; (f) Augmented image with filled corners.

Table 1. International Clinical Diabetic Retinopathy Severity Scale (ICDR) [37].

ICDR Severity Scale	Grade	Symptoms
No diabetic retinopathy	R0	Normal retina.
Mild non-proliferative diabetic retinopathy (NPDR)	R1	Microaneurysms (small blood vessel bulges) or hemorrhages (bleeding) with or without hard exudates (inflammatory fluid deposits).
Moderate NPDR	R2	Microaneurysms, retinal dot or blot hemorrhages, hard exudates or cotton wool spots (nerve fiber swelling).
Severe NPDR	R3	Multiple intraretinal hemorrhages, definite venous beading (venous pearl formation), and intraretinal microvascular abnormalities.
Proliferative diabetic retinopathy (PDR)	R4	Neovascularization (new blood vessels), and vitreous or pre-retinal hemorrhage.

Table 2. Comparison of selected model architectures used for diabetic retinopathy classification.

Model	Type	Architecture	Explainability Method
ResNet [39] Residual Network	CNN	Stacked residual blocks with skip connections	Grad-CAM
EfficientNet [40] Efficient Network	CNN	Compound scaling + depthwise separable convolutions	Grad-CAM
ViT Standard Vision Transformer	ViT	Patch embeddings + global self-attention	Attention-Rollout
DINO-v2 [41] Self-Distillation with no Labels v2		Self-supervised learning with teacher-student distillation	Attention-Rollout
Swin-v2 [42] Shifted Window Transformer v2		Shifted window self-attention with hierarchical feature learning	Not applicable
LeViT [43] Lightweight Vision Transformer	Hybrid	Hybrid CNN-Transformer with efficient self-attention	Grad-CAM and Not applicable
CvT [44] Convolutional Vision Transformer	Hybrid	CNN-based tokenization with Transformer backbone	Grad-CAM and Attention-Rollout

Table 3. Overview of the selected model versions.

Model	Params (M)	FLOPs (G)	Dataset	Resolution
CNN
microsoft/resnet-50 [45]	25.6	4.10	ImageNet-1k	224 × 224
google/efficientnet-b0 [46]	5.3	0.39	ImageNet-1k	224 × 224
ViT
winkawaks/vit-small-patch16-224 [47]	22.1	4.60	ImageNet-1k	224 × 224
facebook/dinov2-small-imagenet1k-1-layer [48]	22.0	6.10	ImageNet-1k	224 × 224
microsoft/swinv2-tiny-patch4-window16-256 [49]	28.0	6.60	ImageNet-1k	256 × 256
Hybrid
facebook/levit-256 [50]	19.0	1.10	ImageNet-1k	224 × 224
microsoft/cvt-13 [51]	20.0	4.50	ImageNet-1k	224 × 224

Table 4. Hyperparameters.

Parameter	Value
Training Batch Size	64
Evaluation Batch Size	64
Number of Epochs	20
Best Model Metric	QWK
Optimizer	AdamW
Learning Rate	$1 \times 10^{- 3}$
Weight Decay	$1 \times 10^{- 2}$

Table 5. Hardware and software specifications.

Component	Specification
GPU	NVIDIA RTX 4090 (24GB)
CPU	12 vCPUs, Intel Xeon Platinum 8352V @ 2.10GHz
PyTorch Version	2.0.0
Python Version	3.8
CUDA Version	11.8
Operating System	Ubuntu 20.04

Table 6. Test performance metrics for the selected models.

Model	QWK	Accuracy	Recall	Precision	F1	ROC-AUC
CNN
ResNet-50	0.81	0.73	0.73	0.74	0.72	0.91
EfficientNet-B0	0.84	0.73	0.73	0.74	0.73	0.92
ViT
ViT-Small	0.62	0.54	0.54	0.54	0.53	0.80
DINOv2-Small	0.51	0.51	0.51	0.50	0.49	0.80
SwinV2-Tiny	0.80	0.71	0.71	0.71	0.70	0.92
Hybrid
LeViT-256	0.83	0.74	0.74	0.74	0.74	0.92
CvT-13	0.84	0.74	0.74	0.74	0.74	0.93

Table 7. Model performance for class R0 (No diabetic retinopathy).

Model	Accuracy	Recall	Precision	F1
CNN
ResNet-50	0.99	0.99	1.00	1.00
EfficientNet-B0	0.97	0.97	1.00	0.99
ViT
ViT-Small-Patch16-224	0.89	0.89	1.00	0.94
DINOv2-Small	0.88	0.88	1.00	0.94
SwinV2-Tiny	1.00	1.00	1.00	1.00
Hybrid
LeViT-256	0.99	0.99	1.00	1.00
CvT-13	1.00	1.00	1.00	1.00

Table 8. Model performance for class R1 (NPDR).

Model	Accuracy	Recall	Precision	F1
CNN
ResNet-50	0.85	0.85	1.00	0.92
EfficientNet-B0	0.21	0.21	1.00	0.34
ViT
ViT-Small-Patch16-224	0.50	0.50	1.00	0.67
DINOv2-Small	0.17	0.17	1.00	0.29
SwinV2-Tiny	0.77	0.77	1.00	0.87
Hybrid
LeViT-256	0.88	0.88	1.00	0.94
CvT-13	0.85	0.85	1.00	0.92

Table 9. Model performance for class R2 (Moderate NPDR).

Model	Accuracy	Recall	Precision	F1
CNN
ResNet-50	0.95	0.95	1.00	0.98
EfficientNet-B0	0.99	0.99	1.00	0.99
ViT
ViT-Small-Patch16-224	0.59	0.59	1.00	0.74
DINOv2-Small	0.65	0.65	1.00	0.79
SwinV2-Tiny	0.88	0.88	1.00	0.94
Hybrid
LeViT-256	0.94	0.94	1.00	0.97
CvT-13	0.94	0.94	1.00	0.97

Table 10. Model performance for class R3 (Severe NPDR).

Model	Accuracy	Recall	Precision	F1
CNN
ResNet-50	0.84	0.84	1.00	0.91
EfficientNet-B0	0.13	0.13	1.00	0.24
ViT
ViT-Small-Patch16-224	0.49	0.49	1.00	0.66
DINOv2-Small	0.37	0.37	1.00	0.54
SwinV2-Tiny	0.46	0.46	1.00	0.63
Hybrid
LeViT-256	0.85	0.85	1.00	0.92
CvT-13	0.88	0.88	1.00	0.94

Table 11. Model performance for class R4 (Proliferative DR).

Model	Accuracy	Recall	Precision	F1
CNN
ResNet-50	0.84	0.84	1.00	0.91
EfficientNet-B0	0.50	0.50	1.00	0.67
ViT
ViT-Small-Patch16-224	0.22	0.22	1.00	0.36
DINOv2-Small	0.40	0.40	1.00	0.57
SwinV2-Tiny	0.55	0.55	1.00	0.71
Hybrid
LeViT-256	0.72	0.72	1.00	0.84
CvT-13	0.80	0.80	1.00	0.89

Table 12. Visualization results of selected models using Grad-CAM and Attention-Rollout.

	Grad-CAM			Attention-Rollout
Augmentiert	ResNet	EfficientNet	CvT	ViT	DINOv2

Table 13. Comparison of performance metrics reported in related work.

Study	Model Type	Datasets	Best Metrics
Adak et al. [8]	Ensemble of ViTs	APTOS-2019	94.63% (Accuracy)
Wu et al. [11]	ViT-Base ViT-Large	EyePACS APTOS-2019	91.4% (Accuracy)
Chetoui et al. [12]	ViT DenseNet (Federated)	APTOS-2019 EyePACS Others	95% (Accuracy)
Sun et al. [16]	Lesion-Aware Transformer (LAT)	EyePACS Messidor-1/2	0.893 (Validation QWK)
Kumar et al. [15]	Swin Transformer	APTOS-2019	86.4% (Accuracy)
Bala et al. [18]	CTNet (CNN + ViT Hybrid)	APTOS-2019 IDRiD	0.972 (QWK) 0.987 (AUC)
Goh et al. [29]	VGG19 ResNet50 InceptionV3 DenseNet201 EfficientNetV2S VAN_small Cross-ViT_small ViT_small [SWIN]_tiny	Kaggle dataset SEED Messidor-1	0.973 (AUC)
Hidri et al. [31]	Xception Inception-Resnetv2 DenseNet	Kaggle dataset	0.92 (AUC)
Touati et al. [30]	DRCCT (CNN + ViT Hybrid)	APTOS-2019	95% (Validation Accuracy)
Asia et al. [32]	ResNet-101 ResNet-50 VggNet-16	XHO (HRF STARE DIARETDB0 MESSIDOR)	98.8% (XHO Accuracy) 100% (STARE Accuracy)
Akhtar et al. [33]	RSG-Net (CNN)	Messidor-1	99.36% (Accuracy 4 grades) 99.37% (Accuracy 2 grades)
Xue et al. [34]	VMamba-m (ViT)	APTOS-2019	94.3% (Accuracy) 0.951 (AUC)
Yang et al. [17]	ViT + MAE Pretraining	APTOS-2019 EyePACS Messidor-2 OIA-DDR	93.42% (Accuracy) 0.985 (AUC)
This study	CNN ViT Hybrid	EyePACS + APTOS-2019 (balanced)	0.841 (QWK) 72.93% (Accuracy) 0.93 (AUC)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Belcheva, V.; Ermakova, T. Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures. Computers 2025, 14, 187. https://doi.org/10.3390/computers14050187

AMA Style

Zhang W, Belcheva V, Ermakova T. Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures. Computers. 2025; 14(5):187. https://doi.org/10.3390/computers14050187

Chicago/Turabian Style

Zhang, Weijie, Veronika Belcheva, and Tatiana Ermakova. 2025. "Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures" Computers 14, no. 5: 187. https://doi.org/10.3390/computers14050187

APA Style

Zhang, W., Belcheva, V., & Ermakova, T. (2025). Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures. Computers, 14(5), 187. https://doi.org/10.3390/computers14050187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures

Abstract

1. Introduction

2. Related Work

3. Dataset and Preprocessing

3.1. Training Corpus

3.2. Data Preprocessing

3.3. Data Augmentation

4. Models and Training

4.1. Model Selection

4.2. Implementation and Training Setup

4.3. Evaluation Metrics

4.4. Explainability Methods

5. Results

5.1. Overall Test Performance

5.2. Classwise Performance

5.3. Visual Analysis

6. Discussion

6.1. Main Findings

6.2. Comparison with Related Work

6.3. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Training and Validation Loss

Appendix A.2. Validation Performance

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI