AdaptiveSwin-CNN: Adaptive Swin-CNN Framework with Self-Attention Fusion for Robust Multi-Class Retinal Disease Diagnosis

Qureshi, Imran

doi:10.3390/ai6020028

Open AccessArticle

AdaptiveSwin-CNN: Adaptive Swin-CNN Framework with Self-Attention Fusion for Robust Multi-Class Retinal Disease Diagnosis

by

Imran Qureshi

College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia

AI 2025, 6(2), 28; https://doi.org/10.3390/ai6020028

Submission received: 18 December 2024 / Revised: 27 January 2025 / Accepted: 3 February 2025 / Published: 6 February 2025

(This article belongs to the Special Issue Multimodal Artificial Intelligence in Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Retinal diseases account for a large fraction of global blinding disorders, requiring sophisticated diagnostic tools for early management. In this study, the author proposes a hybrid deep learning framework in the form of AdaptiveSwin-CNN that combines Swin Transformers and Convolutional Neural Networks (CNNs) for the classification of multi-class retinal diseases. In contrast to traditional architectures, AdaptiveSwin-CNN utilizes a brand-new Self-Attention Fusion Module (SAFM) to effectively combine multi-scale spatial and contextual options to alleviate class imbalance and give attention to refined retina lesions. Utilizing the adaptive baseline augmentation and dataset-driven preprocessing of input images, the AdaptiveSwin-CNN model resolves the problem of the variability of fundus images in the dataset. AdaptiveSwin-CNN achieved a mean accuracy of 98.89%, sensitivity of 95.2%, specificity of 96.7%, and F1-score of 97.2% on RFMiD and ODIR benchmarks, outperforming other solutions. An additional lightweight ensemble XGBoost classifier to reduce overfitting and increase interpretability also increased diagnostic accuracy. The results highlight AdaptiveSwin-CNN as a robust and computationally efficient decision-support system.

Keywords:

convolution neural network; fundus imaging; multi-class retinal disease classification; multi-scale features fusion; Swin Transformer

1. Introduction

Fundus diseases encompass a range of conditions that significantly affect visual function and quality of life. These include diabetic macular edema (DME), diabetic retinopathy, glaucoma, hypertensive retinopathy [1]. Diabetic retinopathy (DR) is a microvascular complication of diabetes that damages the retina due to prolonged high blood sugar levels, leading to characteristic features in fundus images such as microaneurysms (small red dots), intraretinal hemorrhages (red or dark blotches), cotton-wool spots (whitish patches due to nerve fiber damage), and neovascularization (new, fragile blood vessels that can bleed and scar the retina). Glaucoma, a leading cause of irreversible blindness, is a group of optic neuropathies often linked to elevated intraocular pressure. It is identified in fundus images by structural changes in the optic nerve head, including increased optic disc cupping, thinning of the neuroretinal rim, and visual field loss. Diabetic macular edema, a sight-threatening complication of diabetic retinopathy, involves fluid leakage from damaged blood vessels into the macula, causing retinal thickening and visible hard exudates (yellowish deposits of lipids). Hypertensive retinopathy (HR) results from chronic high blood pressure, causing vascular changes such as arteriolar narrowing (generalized or focal), arterio-venous nicking (where arteries compress veins at crossings), flame-shaped hemorrhages, cotton-wool spots, and optic disk edema in severe cases. These distinct pathological features, as visually presented in Figure 1, visible in Colored Fundus Images (CFIs), play a crucial role in the early diagnosis, classification, and management of these retinal diseases, potentially preventing vision loss [2].

However, the interpretation of retinal CFIs presents challenges due to the intricate structural and morphological characteristics of retinal lesions. Consequently, manual diagnosis is a time-consuming and labor-intensive process. In this context, Computer-Aided Diagnosis (CAD) systems have emerged as valuable tools for the automated analysis and classification of retinal CFIs, offering potential improvements in efficiency and accuracy in ophthalmic diagnostics.

CAD refers to the application of advanced computational technologies in the analysis and interpretation of medical imaging data to support diagnostic decision making [3]. This innovative approach has gained significant traction across various medical specialties, demonstrating particular utility in oncology for the detection and characterization of breast, lung, and colorectal cancers. Deep learning, particularly through the use of Convolutional Neural Networks (CNNs), has transformed computer-aided diagnosis in medical imaging [4]. Since their development in the 1980s, CNNs have excelled in tasks such as image classification, object detection, and semantic segmentation. Early models like LeNet [5] and AlexNet [6] paved the way for more advanced architectures, including VGGNet [7], GoogLeNet [8], ResNet [9], DenseNet [10], MobileNet [11], and EfficientNet [12]. These innovations have significantly enhanced the ability of deep learning systems to analyze complex medical images, improving diagnostic accuracy and efficiency across various healthcare applications. While existing models have demonstrated significant success in classifying fundus diseases using CF images, there remains substantial potential for further advancements and refinements in this critical area of ophthalmic diagnostics.

The Transformer architecture represents a paradigm shift in neural network design, offering distinct advantages over its predecessors, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). While RNNs require sequential processing to accumulate global context and CNNs primarily extract local features, the Transformer can directly capture global information across the entire input sequence. At its core, the Transformer is built upon an attention mechanism that enables parallel computation, significantly enhancing processing speed compared to RNNs. This innovative architecture was introduced by Ashish Vaswani and colleagues in their seminal paper “Attention Is All You Need”, initially demonstrating its efficacy in machine translation tasks. The Transformer’s unique design departs from conventional approaches by eschewing RNN and CNN components in both its encoder and decoder modules. Instead, it relies entirely on attention mechanisms to process and transform input data [13].

The Swin Transformer represents a significant advancement in Transformer-based architectures for computer vision tasks [14], demonstrating superior performance across various applications. Unlike the conventional Vision Transformer (ViT) [15], the Swin Transformer incorporates a hierarchical design that processes visual information at multiple scales. This multi-scale approach is seamlessly integrated into the Transformer framework, enabling the model to capture both fine-grained details and global context efficiently. In addition, the Swin Transformer distinguishes itself from the Vision Transformer (ViT) through its innovative pyramidal architecture, which progressively reduces feature map dimensions while increasing channel depth as the network deepens. This design contrasts with ViT’s uniform columnar structure, allowing for more efficient multi-scale feature extraction. Furthermore, the Swin Transformer incorporates several CNN-inspired techniques, including hierarchical feature extraction akin to Feature Pyramid Networks (FPNs), and a combination of sliding windows, attention masks, and cyclic shifts. These adaptations enhance the model’s ability to capture local image information and extract multi-scale features effectively. By integrating these multi-scale design principles and leveraging proven CNN methodologies, the Swin Transformer demonstrates superior performance compared to traditional ViT models across a range of computer vision tasks, showcasing its versatility and effectiveness in visual data processing.

1.1. Research Contribution

This research introduces AdaptiveSwin-CNN, an innovative framework that significantly advances the domain of automated retinal disease diagnosis. By integrating Swin Transformers for global context and Convolutional Neural Networks (CNNs) for spatial detail, coupled with the novel Self-Attention Fusion Module (SAFM), the proposed system achieves the following:

(1): Achieves state-of-the-art performance metrics, with an accuracy of 99.1% and an F1-score of 98.3%, outperforming existing approaches across diverse datasets.
(2): Demonstrates superior cross-dataset performance, addressing variability in image quality and patient demographics.
(3): Optimizes architecture, which ensures reduced training and inference times, making it suitable for real-time applications.
(4): Provides a scalable, interpretable solution that can assist ophthalmologists in early and accurate diagnosis, especially in regions with limited healthcare resources.

1.2. Organization of Paper

The article’s remaining sections are organized as follows: Section 2 discusses a comprehensive literature assessment of the computational techniques that are currently accessible for diagnosing retinal disorders. Section 3 explains the proposed approach taken to accomplish the goal. Section 4 presents the experimental settings and results. Section 5 illustrates a discussion on the comparative analysis of this study with existing approaches, highlighting the potential merits and demerits of the proposed model with possible future directions. The conclusion of the proposed research is provided in Section 6.

2. Literature Review

The field of ophthalmology has experienced notable progress through the introduction of artificial intelligence (AI) and deep learning techniques [16]. In this paper, a comprehensive literature review is described in the subsequent paragraphs, aiming to elucidate key research endeavors in the retinal disease diagnosis, the methodologies applied, and the outcomes obtained.

In the initial stages of integrating artificial intelligence into ophthalmology, reference [17] achieved significant progress by introducing a technique for the localization of the optic disc (OD) utilizing Convolutional Neural Networks (CNNs). This pioneering method involved an ingenious approach to data preparation and a dual-phase training regimen, effectively tackling the problem of class imbalance and achieving an impressive detection accuracy of 99.11%. A distinctive feature of their methodology was the substitution of the conventional blue channel with segmented maps of the vasculature, enriching the contextual data for the CNN architectures. Considering the fundamental configuration of the CNN architecture, the authors in [18] presented a deep (DCNN) model, ResNet50, obtained a promising classification accuracy of 96% in average using IDRiD dataset. Similarly in a study [19], the HDR-EfficientNet model was presented, for diagnosing a range of conditions associated with the eyes, including as diabetic retinopathy (DR) and hypertensive retinopathy (HR), yielded an average accuracy of 98% using multiple publicly accessible datasets. Another research in [20] proposed a novel DCNN approach, the HYPER-RETINO framework, for grading hypertensive retinopathy (HR) based on five stages. By utilizing pre-trained HR-related lesions and a DenseNet architecture, the system undergoes preprocessing, lesion detection through segmentation, and classification to recognize various HR stages. Their study verifies the reliability of the HYPER-RETINO method in accurately diagnosing different stages of hypertensive retinopathy, offering a promising automated solution for clinicians in the field of ophthalmology. The research paper in [21] suggested a novel automated system based on a DCNN, Incept-HR, designed for the classification of hypertensive retinopathy (HR) using advanced deep learning techniques. A new methodology, Inception-HR (Incept-HR), was devised using InceptionV3 and residual blocks for evaluating hypertensive retinopathy, yielding 99% accuracy. Incept-HR surpasses the performance of existing models like VGG19 and VGG16, showcasing its potential as a valuable diagnostic tool for early HR detection.

According to Reddy et al. [22], an optimized DCNN with a gray-wolf optimizer of different weights can be used to grade the condition after analyzing early-stage detection signs of diabetic macular edema and DR. Similarly, Veena et al. [23] developed a novel segmentation system based on a DCNN to identify glaucoma based on retinal images. Using this method, two distinct and creative CNNs were trained to accurately diagnose glaucoma by segmenting the optic disc and cup. Li et al. [24] confirmed the effectiveness of the Inception-v4-based deep ensemble technique in the classification of diabetic macular edema and DR using fundus samples.

Recent advancements in the classification of colored fundus images for multi-class retinal diseases have seen the development of sophisticated hybrid deep learning approaches combining CNNs and Swin Transformers. These models leverage the strengths of both architectures to enhance feature extraction and classification accuracy. For instance, the MCSFNet model utilizes a multi-scale feature fusion stem with varying convolutional kernel sizes for low-level feature extraction, integrating self-attention mechanisms to capture global spatial relationships [25]. Similarly, a hybrid model combining EfficientNet and Swin Transformer was designed in [26] specifically for diabetic retinopathy detection, with EfficientNet handling local feature extraction and Swin Transformer capturing global lesion features. Other notable implementations include the use of explainable CNNs with attention mechanisms [27,28], incorporating pre-trained CNN models like ResNet101 and DenseNet121 alongside Grad-CAM techniques for improved interpretability. The Swin Transformer V2-based model introduced PolyLoss to enhance performance and used Grad-CAM for decision-making visualization [29], while the HTC-Net framework [30] combined deep CNNs with Transformers to model both local and global information in medical images. The ODFormer model in [31], based on the Swin Transformer architecture, focused on the semantic segmentation of fundus images, particularly for optic nerve head detection. Another approach in [32] aimed at improving fundus disease detection used a multistage backbone network to enhance generalization performance in multilabel classification scenarios. These diverse implementation strategies demonstrate the field’s focus on addressing key challenges such as multilabel classification, data augmentation, and model interpretability. However, the outlined studies on hybrid deep learning approaches for multi-class retinal disease classification using CNNs and Swin Transformers still exhibit several limitations.

The performance of these approaches is heavily influenced by preprocessing steps, such as image enhancement or normalization. Inappropriate or dataset-specific preprocessing can introduce biases and degrade model performance.
Both CNNs and Swin Transformers primarily extract features in the RGB color space, potentially overlooking critical diagnostic information from alternative color spaces (e.g., CIELAB, HSV) or other domain-specific representations.
The integration of CNNs and Swin Transformers often results in resource-intensive architectures, which can hinder their applicability in real-time settings or environments with limited computational capabilities.
Despite advancements in attention mechanisms, the inherent complexity of hybrid models poses challenges in interpretability, which may limit their acceptance and usability in clinical practice.

3. Material and Methods

The suggested AdaptiveSwin-CNN framework for the automated detection of multi-class retinal diseases using fundus image is an advanced deep learning technology, as visually shown in Figure 2. Three main steps illustrate the working mechanism of the proposed AdaptiveSwin-CNN model. Ophthalmologists considered more than 100 fundus samples to classify different eye diseases in the first module that was dedicated to dataset collection. This process also includes transforming the multilabeled dataset into a multi-class dataset, which in turn the multilabel item detection problem into a multi-class classification challenge. The final dataset samples were chosen for inclusion based on their membership in a particular dataset class, and each of their categories contained more than 100 images. From 45 ailments, four illnesses—DR, HR, Glaucoma, and DME—were chosen out of the normal/healthy category.

Data augmentation and preprocessing are carried out in the second module to prevent the proposed AdapticeSwin-CNN model from overfitting and to control the variations in the samples. Finally, the final module presents a trained AdaptiveSwin-CNN model that provides robust features for categorizing the image into multi-class retinal disease classification using an XGBoost classifier. Algorithm 1 presents a detailed overview of the proposed AdaptiveSwin-CNN model components and functions for feature extraction.

Algorithm 1. Implementation of the proposed AdaptiveSwin-CNN model for feature map extraction and disease classification
Layer	Purpose	Details
Input layer	Accepts fundus images as input.	Preprocessed images (e.g., resized to 224 × 224 pixels and normalized).
Preprocessing layer	Standardizes image data for processing.	Resizing, normalization (pixel scaling between 0 and 1), and augmentation techniques like flipping or rotation (optional).
Conv1 (Convolutional)	Extract low-level features like edges and textures.	32 filters, 3 × 3 kernel, ReLU activation, and a stride of 1.
Pool1 (Pooling)	Downsample spatial dimensions to reduce computation.	Max pooling with 2 × 2 filter and a stride of 2.
Conv2 (Convolutional)	Detect more complex patterns like contours and textures.	64 filters, 3 × 3 kernel, ReLU activation.
Pool2 (Pooling)	Further reduces spatial dimensions for feature abstraction.	Max pooling with 2 × 2 filter and a stride of 2.
Conv3 (Convolutional)	Identifies deeper features specific to retinal abnormalities.	128 filters, 3 × 3 kernel, ReLU activation.
Pool3 (Pooling)	Compress spatial dimensions further while retaining essential features.	Max pooling with 2 × 2 filter and a stride of 2.
Dense1 (Fully Connected)	Combines extracted CNN features into a high-dimensional feature vector.	256 neurons, ReLU activation.
Patch Embedding Layer	Converts input image patches into embeddings for the Swin Transformer.	Splits input features into fixed-size patches (e.g., 4 × 4 pixels), flattens them, and applies a linear projection to form embeddings.
Swin Transformer Block1	Performs global feature refinement using hierarchical self-attention	Applies multi-head self-attention to model long-range dependencies.
Patch Merging Layer1	Reduces spatial resolution while increasing feature dimensionality.	Combines adjacent patches (e.g., 2 × 2) and outputs higher-level features with reduced spatial size but enriched representation.
Swin Transformer Block 2	Further refines features by capturing global contextual relationships.	Applies another multi-head self-attention mechanism with higher-level features.
Patch Merging Layer 2	Further reduces resolution and enhances high-level feature representations.	Aggregates adjacent patches (e.g., 2 × 2) and outputs a compact feature map.
Global Average Pooling Layer	Reduces the spatial dimensions of the feature map to a single vector per channel.	Averages each feature map spatially, producing a global feature vector while retaining the most significant information for each channel.
Feature Concatenation	Combines CNN and Swin Transformer outputs into a unified feature vector.	Merges CNN-extracted local features with Transformer-refined global features for comprehensive representation.
XgBoost Classifier	Maps the concatenated feature vector to the final classification output.	Gradient-boosting decision tree algorithm predicts probabilities for each class, enabling accurate multi-class disease classification.

3.1. Data Acquisition

The proposed AdaptiveSwin-CNN approach used two datasets named RFMiD [33], and ODIR [34] for the experiment purposes. The RFMiD benchmark consists of 3200 colored retinal samples collected after various eye exams performed since 2009. The 1920 images, or 60% of the dataset, were made publicly available and utilized in this study, as illustrated in Table 1. While capturing images, every image was focused on a target, either the macula or an optic disc (OD). With the assistance of clinical professionals, the retinal picture collection was labeled into normal (healthy) and various pathological categories. There were 45 distinct forms of retinal abnormalities included in the abnormal group. In the collection of data that was made accessible, the average disease-wise stratification was 60.7%. There might be several diseases present in the abnormal retinal samples. From a total of 1920 images, each one was classified as either normal or abnormal. If it was determined to be abnormal, it was then classified according to the set of retinal image disorders. The labels of the photos were delivered in a.csv files. In addition to this, the proposed study also used the ODIR benchmark, which offers eight ocular disease classifications. However, the author only considered the DR, glaucoma, HR, and DME. According to Table 1, all of the images were resized to 224 × 224 dimensions.

In the proposed AdaptiveSwin-CNN model, the challenge of detecting many eye illnesses is transformed into a problem of classifying eye diseases into multiple classes. For effective AdaptiveSwin-CNN network training, it was concluded that only unique samples belonging to a certain category of fundus image had more than 100 images in each group. Thus, the dataset was finalized with a selection of 5 classes, including 1 class categorized as normal (healthy) and 4 classes representing different diseases: diabetic retinopathy (DR), hypertensive retinopathy (HR), glaucoma, and diabetic macular edema (DME). The retrieved photos that belonged to the chosen class were given the appropriate labels.

3.2. Dataset Augmentation and Preprocessing

Following the establishment of the selection criteria, a total of 3310 photos were selected as follows: 590 for DR illness, 450 for HR, 400 for glaucoma, 470 for DME and 1400 for a normal category denoting healthy fundus images. All fundus images are scaled down to 224 × 224 size.

To address the issue of misclassification caused by class imbalance—where the normal class in the finalized dataset contains significantly more samples than other classes—the dataset was amplified using data augmentation techniques. These augmentation methods were carefully selected to simulate real-life scenarios and enhance the diversity of the training data, ensuring improved model performance and generalization. The chosen techniques include widely used geometric transformations such as rotation, translation, cropping, flipping, contrast enhancement, and affine transformations. These methods are fundamental image preprocessing approaches that preserve the original object type while diversifying the dataset. As one of the earliest and most commonly employed strategies, they play a crucial role in mitigating dataset imbalance. A visual representation of the modifications made to the original sample images after augmentation is provided in Figure 3, highlighting the effectiveness of these transformations in enriching the dataset.

To address the challenges of data imbalance and overfitting, data augmentation was employed to equalize the number of images across all classes, ensuring a more balanced and representative dataset. The augmented dataset was then divided into training and testing sets in a 70:30 ratio, providing a consistent configuration for accurate performance analysis and enabling meaningful comparisons with other deep neural networks. During the preprocessing phase, each image underwent specific transformations to optimize its suitability for training the proposed AdaptiveSwin-CNN model. These preprocessing steps were critical in enhancing the model’s learning capacity and ensuring robust feature extraction. Table 2 presents a detailed breakdown of the total number of retinal samples available for each disease class after the data augmentation process, demonstrating the effectiveness of this approach in addressing class imbalances.

3.3. Proposed AdpativeSwin-CNN Architecture

The proposed AdaptiveSwin-CNN classification model for multi-class retinal diseases begins with the input layer (as shown in Figure 2), where preprocessed fundus images are resized to a standard dimension (e.g., 224 × 224 pixels) to ensure uniformity. Next, the patch embedding layer divides the input image into non-overlapping patches and applies linear embedding. Layers 3–6 represents Four Swin Transformer blocks that process patch embeddings using shifted window-based self-attention and feed-forward networks, with patch merging layers between blocks to reduce spatial dimensions and increase channel dimensions. Afterward, the global average pooling layer reduces spatial dimensions to 1 × 1. Finally, in layer 8, which is the feature extraction layer, extracts the final features from the Swin Transformer output. The extracted features are then fed into an XGBoost classifier for the final multi-class classification of retinal diseases.

3.3.1. Problem Definition

In general, the task of retinal feature learning falls within the category of image detection and classification. As a result, image classification requirements and retinal dataset requirements are essentially same. Given image I

\in R^{W \times H \times C}

, its label image is

L

\in R^{W \times H}

, a binary image, and each pixel of the picture falls under one of three categories. W stands for the image’s width, H for its height, and C for its channel. There are m total categories in this essay. Each pixel in the image is classified using the trained model, which is statistically assessed using an assessment metric.

3.3.2. General Organization

In this research, we present a novel Encoder Module (EM)-, Decoder Module (DM)-, and Fusion Module (FM)-based AdaptiveSwin-CNN to identify abnormal features in retinal samples, as shown in Figure 4.

Encoder Module (EM): The suggested EM in this study consists of N number of convolution blocks and one Swin-Trans-Encoder Block, which are stacked. Different picture scales are available in convolution layers. By utilizing convolution and ST encoder block to extract deep picture features and feature relations, features with deep semantic relations can be created. This allows us to represent the local and global long-range relationships between various locations in the crack picture.

Decoder Module (DM): This article proposes a symmetrical design for (AdaptiveSwin-CNN) EM and DM, where the DM’s Swin-Trans-Encoder Block and Conv-Block both match the EM’s parameter settings. Notably, the EM’s feature dimension reduces as the DM’s size grows with each layer, whereas the DM’s size grows with each layer of the network. The Fusion Module (FM) receives input from both the Decoder Module (DM) and each layer’s output from the Encoder Module (EM).

Fusion Module (FM): In order to obtain a representation that combines deep features with low-level ones, when the picture sizes vary, the proposed research in this article builds an FM that combines the decoder and encoder outputs. The suggested study in this article uses fusion features from many layers to create a comprehensive picture, which helps reduce the impact of noise on retinal disease diagnosis. This approach allows us to fuse feature maps from different scales.

This subsequent section will provide a detailed introduction of the proposed AdaptiveSwin-CNN model.

3.3.3. Encoder Module (EM)

This study introduces a technique that integrates convolution with Transformers to describe the long-range interdependence related to various components in the retinal image from both a global and a local viewpoint. This method uses a Conv-Block and Swin-Trans-Encoder Block to investigate the relationships between the various parts of the retinography.

As stated in Section 3.3.1, the picture label pair

{I, L}, I \in

R^{W \times H \times C}, L \in R^{W \times H}

constitute the input for the proposed model. The EM comprises the Conv-Block and the Swin-Trans-Encoder Block, and the output feature is

f_{e n}

.

Conv-Block: To minimize the effect of noise and acquire the local feature from a local perspective, the proposed research builds a Conv-Block in this work to produce multi-scale feature maps. The Nx Conv-Block is used in the Encoder Module (EM) to extract the features of the retinal image

I

and create various scale feature maps. Figure 5a depicts the general organization of the Conv-Block. Each convolution operation is followed by the application of a RELU activation function known as the Conv-RELU Block. The convolution kernel size is 3 × 3. In the Conv-Block that follows the M convolutions, the max-pooling feature is passed.

I_{{e n}_{c o n v}}^{i}

, I ϵ [1, N] is the i-th output feature of the Conv-Block, according to Equation (1).

I_{{e n}_{c o n v}}^{i + 1} = \underset{m a x p o o l}{\underset{⏟}{a r g m a x}} \frac{(R e l u (C o n v (I_{{e n}_{c o n v}}^{i})))}{M}

(1)

where

I_{{e n}_{c o n v}}^{i}

\in

R^{W^{i} \times H^{i} \times C^{i}}

and

I_{{e n}_{c o n v}}^{1}

= I.

Swin-Trans-Encoder Block: In this study, the author analyzes the long-range dependency connection of different abnormal feature areas from a global viewpoint by partitioning the output feature map of the preceding the Conv-Block and creating a Swin-Trans-Encoder Block. Figure 5c depicts the structure, with the ST block [35] shown in Figure 5e.

Two ST blocks (Swin Transformer blocks) plus a patch embedding layer make up the Swin-Trans-Encoder Block. The proposed research in this article divides the feature map into 4 × 4 patches using the patch embedding operation, in accordance with the Swin Transformer [35], and embeds the feature obtaining

I_{e m} \in R^{\frac{W^{N}}{4} \times \frac{H^{N}}{4} \times C^{N}}

. However, the proposed study only encodes these patch characteristics using two ST blocks. Equation (2) calculates the ST block.

\{\begin{array}{l} {\hat{o}}^{l} = W - M S A (L N (o^{l - 1})) + o^{l - 1} \\ o^{l} = M L P (L N ({\hat{o}}^{l})) + {\hat{o}}^{l} \\ {\hat{o}}^{l + 1} = S W - M S A (L N (o^{l})) + o^{l} \\ o^{l + 1} = M L P (L N ({\hat{o}}^{l + 1})) + {\hat{o}}^{l + 1} \end{array}

(2)

where

{\hat{o}}^{l}

is the output for (S)W-MSA and

o^{l}

for MLP.

In summary, the Swin-Trans-Encoder Block is described by Equation (3).

f_{e n_{s w i n t r a n s}} = \underset{4 \times}{\underset{⏟}{S T_{B l o c k}}} (PosEmbed (I_{en conv}^{N}))

(3)

where

I_{e n_{conv}}^{N}

is the output feature of the

N

-th Conv-Block.

3.3.4. Decoder Module (DM)

The Decoder Module (DM) is constructed symmetrically, consisting of a Swin-Trans-Decoder Block and a Conv-Block. The layer configuration of the Conv-Block is identical to that of the Encoder Module (EM), allowing for the decoding of features from the EM. The value of the characteristic

f_{d e}

is then obtained.

Conv-Block: To obtain additional details, refer to the Conv-Block description in the Encoder Module (EM). It is important to mention that the pooling procedure is performed during encoding, whereas the up-sampling operation is carried out during decoding.

Block Swin-Trans-Decoder: Two ST blocks, as well as Patch Expanding, make up the Swin-Trans-Decoder Block, with the ST block being computed using Equation (2). It is important to note that both the Swin-Trans-Encoder Block and Swin-Trans-Decoder Block have the identical ST block dimensions. The proposed research creates Patch Expanding, which gives us the Swin-Trans-Decoder capability, for up-sampling. This specific approach makes use of normalizing and linear layering. The feature dimension after the Patch Expanding operation is W′, H′, and C′.

f_{d e_{s t}} = PatchEmb (\underset{4 \times}{\underset{⏟}{S T_{B l o c k}}} (f_{e n_{s t}}))

(4)

Then, the output feature

I_{d e^{'}}^{i}, i \in [0, N]

of Swin-Trans-Decoder Block is sent to the stacked Conv-Block layers and the i-th layer is calculated as Equation (5).

I_{d e_{c o n v}}^{i} = U p S a m p l i n g (\underset{M X}{\underset{⏟}{R E L U (Conv (I_{d e_{c o n v}}^{i - 1})}}))

(5)

where

I_{d e_{c o n v}}^{i} = f_{d e_{s t}}

.

3.3.5. Fusion Module (FM)

In this study, a Fusion Module (FM) is designed to more effectively fuse the encoding and decoding properties of various scales. It is depicted in Figure 5b. To generate the same scale feature map, the characteristics of each scale’s encoding and decoding from various layers are first concatenated and combined in the form of 1 × 1 convolution. The final feature shown in Figure 4 is created by concatenating the convolutional fusion features at various sizes.

Given the encoding feature

f_{e n}^{i}

and decoding feature

f_{d e}^{i}

of the i-th layer, the fusion image feature

I_{fusion}^{i}

is calculated as Equation (6).

I_{fusion}^{i} = D e c o n v (R E L U (C o n v (C o n c a t (f_{e n}^{i}, f_{de}^{i}))))

(6)

where i ϵ [N + 1]. Note that when i ϵ [1,N],

f_{e n}

=

I_{e n_{conv}}^{i}

,

f_{d e}

=

I_{d e_{c o n v}}^{i}

, and when

i = N + 1, f_{e n} = f_{{e n}_{s t}}, f_{d e} = f_{d e_{s t}}

, as shown in Equation (7).

f_{e n}^{i} = \{\begin{array}{l} I_{e n_{c o n v}}^{i}, & i \in [1, N] \\ f_{e n_{s t},} & i = N + 1 \end{array} f_{d e}^{i} = \{\begin{array}{l} I_{d e_{c o n v}}^{i}, & i \in [1, N] \\ f_{d e_{s t},} & i = N + 1 \end{array}

(7)

Using Figure 4 as a reference, the predicted feature is finally computed using the following Equation (8).

I_{fusion} = C o n c a t (I_{fusion}^{i}), i \in [1, N + 1]

(8)

3.3.6. Loss Function

In order to calculate the loss using Equation (9), the proposed study chooses to apply Binary Cross Entropy, considering the predicted feature

I_{fusion}

. Equation (9), which states that the input picture’s pixel count is M = WxHxC, uses the j-th pixel value

f_{j}

and label

L_{j}

on the feature map to calculate the loss.

o s s (F_{j}; W) = \{\begin{array}{l} l o g (1 - S i g m o i d (f_{j}; W)), if L_{j} = 0 \\ l o g (S i g m o i d (f_{j}; W)), if L_{j} = 1 \end{array}

(9)

Equation (10) is then used to compute the overall loss.

Loss = \sum_{j = 1}^{M} (\sum_{i = 1}^{N} l (f_{j}^{i}; W) + l o s s (f_{j}^{f u s i o n}; W))

(10)

3.4. XGBoost Classifier

The author developed the XGBoost algorithm [22] using statistical learning theory and obtained remarkable outcomes in computer vision problems. A training set and a test set were created from the dataset before running the XGBoost learning operation. The initial classification challenge is split into several multi-classification problems using XGBoost. Algorithm 1 makes note of these steps. As the number of sample categories rises, the training procedure for multi-classification becomes increasingly complex and hard. Recent studies have shown that there is a need for new approaches to studying how to reduce computation and computational complexity in XGBoost [22]. To mitigate this problem and reduce the dimensionality of the sample data, this study proposes to apply the Relief approach. One of the most popular and successful machine learning (ML) methods is gradient tree boosting. It demonstrates exceptional performance in diverse applications and is a very proficient machine learning technology.

An analysis of the extreme Gradient Boosting approach is presented in Algorithm 2.

f_{t} (x_{i}) = a r g m i n L o s s (t) = a r g m i n L o s s (L, L + f (x))

(11)

Algorithm 2. XGBoost algorithm

Input: Consider feature data, i.e., x = [x1, x2, x3, …, xn] with labels

L

and test samples

x_{t e s t}

Output: Classes as normal, DR, HR, glaucoma and DME

Step 1: Initialize tree as a constant

Y_{i}^{t}

=

f_{0} = 0

for determination of normalization and parameters for a classifier.

Step 2: Apply Equation (11), where XGBoost classifier is constructed with minimize loss function. The classifier training is completed by using input feature data.

Step 3: Iterate through Step 2 until the model meets the stop condition.

Step4: The class label is applied to test samples

Y_{i}^{t}

by utilizing Equation (11)’s decision function.

4. Experiments and Results

A series of experiments were conducted on RFMiD and ODIR datasets, and the outcomes were compared with the most advanced existing deep learning approaches using common statistical metrics. This part will provide a comprehensive presentation of the experimental settings and analysis.

4.1. Experimental Setup

This study utilized a CFI to identify multi-class eye diseases with the proposed AdaptiveSwin-CNN model. The training and experimentation setup was executed on Windows 10 with Corei9-Intel CPU and NVIDIA with high-capacity RAM. All source codes for the proposed AdaptiveSwin-CNN model development, training, and testing were written in Python language with various deep learning libraries named TensorFlow, PyTorch, and Keras.

The proposed AdaptiveSwin-CNN model consists of many convolutional layers with various filter sizes, batch normalization, max-pooling, dropout, and fully connected layers. The final design of the proposed AdaptiveSwin-CNN model is inspired by the LeNet architecture [5]. The proposed AdaptiveSwin-CNN model was fine-tuned using a variety of hyperparameters such as a batch size of 32, learning rate 1 ×

10^{- 3}

, activation function ReLU, padding—same, dropout 50%, and no. of epochs 200, and the optimization functions were Adma, and SGD, respectively. The application of hyperparameters assists in obtaining a higher accuracy level for both the proposed model training and testing purposes.

To build and train the proposed CNN architecture, various kernel dimensions were utilized to generate feature maps at each convolutional stage. Common kernel sizes, such as 3 × 33\times 33 × 3 and 5 × 55\times 55 × 5, were employed, and corresponding weight parameters in the convolutional layers were adjusted accordingly. These convolutional layers applied operations using diverse window sizes, guided by excitation functions specific to each feature map, to optimize feature extraction. Similarly, a pooling layer was developed using a comparable process, with one key distinction: it incorporated sliding steps of two and a window size of 2 × 22\times 22 × 2 to maximize the retention of essential features from the preceding layer.

This pooling mechanism significantly reduced the convolutional weights, enhancing the network’s computational efficiency while preserving critical information. The output from the average pooling stage was then passed into the XGBoost classifier, which served as a fully connected layer for the multi-class classification of retinal fundus images. The XGBoost classifier leverages these processed features to accurately distinguish between multiple retinal diseases. For a detailed overview, Table 3 provides a breakdown of the parameter counts in the convolutional layers of the proposed AdaptiveSwin-CNN model, highlighting the efficiency and scalability of the architecture.

4.2. Performance Metrics

The predictions provided by a classifier with n classes are stored in a confusion matrix, which is a table of dimensions n × n. It provides information on the classifier’s performance based on things like true positive (TP), false negative (FN), and true negative (TN) by comparing the expected and actual classifications.

A classifier’s performance is assessed by computing the accuracy (Acc). Accuracy is defined as the percentage of samples that were properly identified relative to the total samples. The computation can be performed using the formula provided in Equation (12).

A c c u r a c y (A C C) = T P + T N / T P + T N + F P + F N * 100 %

(12)

Precision (P) is a quantitative measure that assesses the ratio of correctly predicted positive samples to the total number of projected positive samples, as calculated in Equation (13).

P r e c i s i o n (P) = \frac{T P}{T P + F P} * 100 %

(13)

R, short for “recall”, sensitivity, or the true positive rate, measures how many samples the classifier correctly identifies as belonging to the positive class. A high recall value suggests that a substantial portion of high-confidence samples are identified by the classifier.

R e c a l l (R) = \frac{T P}{T P + F N} * 100 %

(14)

The F1-score (F1) is a statistical measure used to validate the classifier potential. A high F1-score signifies the effectiveness of the categorization approach in achieving a harmonious balance between precision and recall, as computed in Equation (15).

F 1 = 2 \times \frac{P \times R}{P + R} * 100 %

(15)

4.3. Experiment 1

In this experiment 1, the author used two datasets called “RFMiD” and “ODIR” that included five types of eye diseases—normal, DR, HR, glaucoma, and DME—sourced from qualified retinal specialists to assess the efficacy of the suggested AdaptiveSwin-CNN model. Using this extensive dataset, the proposed research in this article first examined how well the model performed in terms of representation on the training, validation, and loss function sets. Graphical representations of the proposed AdaptiveSwin-CNN model training and classification results are shown in Figure 6, Figure 7, Figure 8 and Figure 9, respectively. The results emphasize the exceptional performance of the suggested model in both training and validation scenarios.

4.4. Experiment 2

In experiment 2, the author presents the performance comparison between the proposed AdaptiveSwin-CNN system with some standard DL models. Interestingly, all of these DL models were trained using an identical number of epochs. Two deep neural networks with the same parameters were trained, and the one with the highest validation accuracy was selected. In terms of sensitivity, specificity, and accuracy, Table 4 compares the performance of the suggested model with that of the VGG16 [7], VGG19 [7], InceptionV3 [36], ResNet [9], Xception [37], and MobileNet [11] models based on averages for the five classes. The suggested system performs better than alternative deep learning models, as evidenced by the results. Figure 10 visually illustrates the classification accuracy comparison between various deep learning models and AdaptiveSwin-CNN.

4.5. Experiment 3

A statistical evaluation of the suggested system’s performance for each retinal class was conducted using the error rate (E), recall, specificity, F1-score, and accuracy criteria. The results obtained by the suggested approach to distinguish between five groups of related retinograph images are shown in Table 5. The suggested model produced better results from Table 6 with a training error of 0.51, which was smaller when employing the statistical parameters.

4.6. Experiment 4

The gradient class activation map, or Grad-CAM for short, is a tool for graphically representing the regions of a picture that a DNN uses to make a certain prediction. This visualization facilitates the provision of insights into the specific regions of the input image that have the most impact on the decision-making process of the network. By emphasizing these specific areas, Grad-CAM provides a more comprehensive comprehension of the characteristics that the network is identifying. Several consecutive stages are required to incorporate Grad-CAM into the suggested system. The proposed process starts with the importing of the pretrained model and the normalized samples. A dense classification layer should follow the model’s global average pooling layer. The characteristics that are categorized utilizing the suggested deep-ocular technique are shown in Figure 11 using Grad-CAM.

The process started with obtaining the category index that was expected. Following this, the suggested study in this article computed the gradients of the output score of the projected class concerning the output of the final convolutional layer. Indicated by the gradients are the relative importance of the several convolutional layer output components that make up the provided prediction. A larger gradient indicates that a larger percentage of the image is responsible for predicting the given category. Once the gradients were obtained, the proposed study in this article averaged them over all of the output channels of the convolutional layer to obtain the heatmap. Finally, the proposed study in this article validated that the heatmap would fit the input image by adjusting its size. A color map “ject” was used to improve the heatmap. Ultimately, an additive mixing approach was used to superimpose the heatmap onto the original image. During the prediction phase, the model focused on specific regions, and the final image shows those regions.

4.7. Experiment 5

While other cutting-edge methods use deep learning (DLM) algorithms to identify retinal ailments, only a small number of studies have used DLM methods to categorize multi-class retinal disorders by fundus modality. The efficiency of the proposed DL system for diagnosing multi-class retinal disorders was compared with four selected DL studies [26,29,38,39]. The author of this study successfully implemented [26,29,38,39] systems using the same image dimensions and DL network settings based on their simple configuration. Table 6 provides a detailed performance comparison between the proposed AdaptiveSwin-CNN model and the existing architectures using statistical parameters. Table 5 reveals that the proposed AdaptiveSwin-CNN network achieves superior performance (95.2% of sensitivity, 96.7% of specificity, 98.89% of accuracy, and 97.28% of F1-score) in classifying multi-class eye disorders such as normal, DR, HE, glaucoma, and DME. In contrast, the existing systems have also shown comparable classification accuracy for five retinal levels. It is worth noting that, previously, these models [26,29,38,39] were designed for diagnosing one or three eye classes and obtained reasonable results with a long computational time.

4.8. Ablation Study

In this part, the author conducts an ablation study that evaluates the individual and combined contributions of the proposed model’s components—an eight-layer CNN, Swin Transformer encoder–decoder blocks, and the XGBoost classifier toward multi-class retinal disease classification. The objective is to assess how each component influences performance by systematically modifying the architecture and analyzing metrics such as accuracy, precision, recall, and F1-score. The study uses k-fold cross-validation on preprocessed retinal fundus image datasets to ensure robust and unbiased evaluation. Variants include a baseline CNN model, a CNN with XGBoost replacing its classifier, a CNN combined with a Swin Transformer (without XGBoost), a Swin Transformer alone (removing CNN), and the full model integrating all components. Each setup is analyzed to understand the contribution of local feature extraction (CNN), global context modeling (Swin Transformer), and advanced classification (XGBoost), particularly for imbalanced datasets.

The findings in Table 7 reveal that the CNN serves as a strong baseline for extracting local retinal features, while the Swin Transformer significantly enhances global context understanding, refining disease-specific features. The inclusion of XGBoost improves classification performance, particularly for minority classes, due to its ability to handle structured and imbalanced data. The full model demonstrates the highest performance, showcasing the synergistic effect of combining local–global feature extraction with robust classification.

4.9. Computational Cost

The data augmentation phase, which involved adjusting the luminance, removing noise, and increasing the dataset size, took 60 s. In comparison, the feature learning and extraction process of the suggested system required an average of 55 s, whereas the creation of the XGBoost classifier only took 30 s. With a set number of 30 iterations, the XGBoost training for the multi-class categorization of retinopathy into five classes took 35.20 s. On average, it takes just 38.12 s to identify the image when training and testing are completed. The suggested model took 2.1 s longer to calculate compared to previous deep learning models that use the Convolutional Neural Network (CNN) architecture. The reason for this is the utilization of a unique approach involving CNN, Swin Transformers, and XGBoost within a perceptually focused color space.

There is a general way to express the computational time complexity in relation to the network training. Considering training examples t and n epochs, the suggested system is built using two Swin Transformer blocks, i and j. The outcome was O (n × t (i + j)). By leveraging Google cloud, it is possible to reduce the time complexity. Essentially, the TPUs exhibited substantial enhancements in the velocity of DL models and spent a reduced amount of energy.

5. Discussion

The global prevalence of eye conditions significantly varies based on the ethnic and geographic characteristics of human populations. Research suggests that unavoidable eye infections are more common in populations residing at low altitudes, particularly in tropical and temperate zones [2]. In many developing countries, especially in Asia, ocular morbidity rates are often underreported and overlooked, leading to a lack of accurate recognition of the issue [40]. Globally, it is estimated that approximately 285 million people live with visual impairments, including 246 million individuals with low vision and 39 million who are completely blind. Furthermore, studies indicate that around 2.2 billion people worldwide are affected by refractive errors such as myopia or hyperopia, with nearly half of these cases being preventable or treatable through timely interventions. These findings underscore the urgent need for enhanced global eye care strategies to address the gaps in diagnosis, prevention, and treatment [41].

Deep learning (DL) methods are increasingly revolutionizing the field of medical image analysis, providing robust solutions for a variety of applications, including disease diagnosis [3,4]. DL-based models have demonstrated remarkable effectiveness in automating the detection and classification of diseases, offering significant potential to alleviate the growing workload faced by ophthalmologists. Despite these advancements, accurately diagnosing a wide range of ocular disorders remains a challenge, prompting ongoing research and debate within the scientific community [18]. The proposed AdaptiveSwin-CNN model addresses this challenge by leveraging an optimized deep learning architecture specifically designed to classify retinal images into multiple retinal disease categories with high precision and reliability. This approach emphasizes the integration of advanced techniques to deliver a comprehensive solution for multi-class retinal disease classification, ultimately supporting early diagnosis and effective clinical decision making.

The utilization of multilabel categorization, as demonstrated in [19], offers a promising approach for addressing the challenges of ocular disease classification. Ocular disease datasets often exhibit significant variability between subjects, leading to class imbalances that make accurate disease detection and classification particularly difficult. The research presented in this article focuses on overcoming these challenges by introducing a hybrid ensemble deep learning (DL) model. The proposed model operates in three distinct phases: image preprocessing, feature extraction using a fusion approach, and classification.

To address common issues such as gradient vanishing and overfitting, the model employs an enhanced encoder–decoder architecture during the initial phase to extract critical image characteristics effectively. A fusion module is then implemented to identify and retain the most significant features from the extracted data, thereby reducing feature dimensionality while preserving essential information. Finally, an XGBoost classifier is utilized to perform classification analysis based on the optimized features, yielding reliable and accurate outcomes, as illustrated in Section 4.3, Section 4.4, Section 4.5, Section 4.6, Section 4.7 and Section 4.8, respectively.

5.1. Potential Drawbacks of the AdaptiveSwin-CNN

The goal of this research is to design an automated system that helps ophthalmologists to early screen and detect eye diseases with high precision. Despite the promising performance of the proposed AdaptiveSwin-CNN model for multi-class retinal disease diagnosis, there are some potential limitations of the suggested model which are illustrated in Table 8.

5.2. Possible Future Directions

The AdaptiveSwin-CNN model, while currently designed for fundus imaging, can be extended to additional ocular imaging modalities like OCT or fluorescein angiography with appropriate adaptations. This would involve tailoring the preprocessing pipeline to handle modality-specific characteristics, such as the higher resolution and depth information in OCT or the dynamic contrast patterns in fluorescein angiography. The model’s CNN layers could be adjusted to capture unique spatial features, and the Swin Transformer’s self-attention mechanism can be leveraged to focus on modality-specific patterns and regions of interest. Furthermore, transfer learning from the existing fundus imaging model can accelerate adaptation, while fine-tuning on labeled datasets from the new modality ensures optimal performance. With these modifications, AdaptiveSwin-CNN has strong potential for application across various ocular imaging modalities.

Generalization across datasets is a significant challenge in medical image analysis due to variations in imaging conditions, patient demographics, and device-specific features. To overcome this, future efforts can focus on domain adaptation approaches, such as unsupervised domain adaptation (UDA) to align feature distributions between domains, transfer learning to fine-tune pre-trained models, and style transfer methods to standardize image styles across datasets. Incorporating diverse training datasets from multiple centers and regions, combined with extensive data augmentation, can enhance model robustness. Self-supervised learning can be used to train models on large-scale unlabeled data, improving their ability to learn generalizable features.

6. Conclusions

This study describes the AdaptiveSwin-CNN framework, which is a novel deep-learning-based solution for the diagnosis of multi-class retinal diseases. Leveraging the complementary advantages of Swin Transformers and CNNs together with the Self-Attention Fusion Module (SAFM), this framework integrates multi-scale features to increase sensitivity to small normal retinal abnormalities while attenuating noises. AdaptiveSwin-CNN outperforms existing approaches, with an accuracy of 98.89% and F1-score of 97.2% and exhibits excellent generalization across the datasets. Its robustness and interpretability are further reinforced by using adaptive augmentation techniques and a lightweight XGBoost classifier. This work provides decision support to ophthalmologists, handling major obstacles including dataset disparity and computer efficiency in retinal disease classification.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Two publicly available datasets shared by [33,34], accessed on 30 November 2024.

Conflicts of Interest

The author declares no conflicts of interest.

References

Joshi, R.C.; Sharma, A.K.; Dutta, M.K. VisionDeep-AI: Deep learning-based retinal blood vessels segmentation and multi-class classification framework for eye diagnosis. Biomed. Signal Process. Control 2024, 94, 106273. [Google Scholar] [CrossRef]
Zhao, Z.; Chen, H.; Wang, Y.-P.; Meng, D.; Xie, Q.; Yu, Q.; Wang, L. Retinal disease diagnosis with unsupervised Grad-CAM guided contrastive learning. Neurocomputing 2024, 593, 127816. [Google Scholar] [CrossRef]
Qureshi, I.; Yan, J.; Abbas, Q.; Shaheed, K.; Bin Riaz, A.; Wahid, A.; Khan, M.W.J.; Szczuko, P. Medical image segmentation using deep semantic-based methods: A review of techniques, applications and emerging trends. Inf. Fusion 2023, 90, 316–352. [Google Scholar] [CrossRef]
Qureshi, I.; Ma, J.; Abbas, Q. Diabetic retinopathy detection and stage classification in eye fundus images using active deep learning. Multimed. Tools Appl. 2021, 80, 11691–11721. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.X.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sharma, V.; Tripathi, A.K.; Mittal, H.; Nkenyereye, L. SoyaTrans: A novel transformer model for fine-grained visual classification of soybean leaf disease diagnosis. Expert Syst. Appl. 2025, 260, 125385. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Tayal, A.; Gupta, J.; Solanki, A.; Bisht, K.; Nayyar, A.; Masud, M. DL-CNN-based approach with image processing techniques for diagnosis of retinal diseases. Multimed. Syst. 2022, 28, 1417–1438. [Google Scholar] [CrossRef]
Meng, X.; Xi, X.; Yang, L.; Zhang, G.; Yin, Y.; Chen, X. Fast and effective optic disk localization based on convolutional neural network. Neurocomputing 2018, 312, 285–295. [Google Scholar] [CrossRef]
Lin, C.-L.; Wu, K.-C. Development of revised ResNet-50 for diabetic retinopathy detection. BMC Bioinform. 2023, 24, 157. [Google Scholar] [CrossRef]
Abbas, Q.; Daadaa, Y.; Rashid, U.; Sajid, M.Z.; Ibrahim, M.E.A. HDR-EfficientNet: A classification of hypertensive and diabetic retinopathy using optimize efficientnet architecture. Diagnostics 2023, 13, 3236. [Google Scholar] [CrossRef]
Abbas, Q.; Qureshi, I.; Ibrahim, M.E.A. An automatic detection and classification system of five stages for hypertensive retinopathy using semantic and instance segmentation in DenseNet architecture. Sensors 2021, 21, 6936. [Google Scholar] [CrossRef] [PubMed]
Sajid, M.Z.; Qureshi, I.; Youssef, A.; Khan, N.A. FAS-Incept-HR: A fully automated system based on optimized inception model for hypertensive retinopathy classification. Multimed. Tools Appl. 2024, 83, 14281–14303. [Google Scholar] [CrossRef]
Reddy, V.P.C.; Gurrala, K.K. Joint DR-DME classification using deep learning-CNN based modified grey-wolf optimizer with variable weights. Biomed. Signal Process. Control 2022, 73, 103439. [Google Scholar] [CrossRef]
Veena, H.; Muruganandham, A.; Kumaran, T.S. A novel optic disc and optic cup segmentation technique to diagnose glaucoma using deep learning convolutional neural network over retinal fundus images. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 6187–6198. [Google Scholar] [CrossRef]
Li, F.; Wang, Y.; Xu, T.; Dong, L.; Yan, L.; Jiang, M.; Zhang, X.; Jiang, H.; Wu, Z.; Zou, H. Deep learning-based automated detection for diabetic retinopathy and diabetic macular oedema in retinal fundus photographs. Eye 2022, 36, 1433–1441. [Google Scholar] [CrossRef] [PubMed]
Huang, C.; Jiang, Y.; Yang, X.; Wei, C.; Chen, H.; Xiong, W.; Lin, H.; Wang, X.; Tian, T.; Tan, H. Enhancing Retinal Fundus Image Quality Assessment With Swin-Transformer–Based Learning Across Multiple Color-Spaces. Transl. Vis. Sci. Technol. 2024, 13, 8. [Google Scholar] [CrossRef]
Yao, Z.; Yuan, Y.; Shi, Z.; Mao, W.; Zhu, G.; Zhang, G.; Wang, Z. FunSwin: A deep learning method to analysis diabetic retinopathy grade and macular edema risk based on fundus images. Front. Physiol. 2022, 13, 961386. [Google Scholar] [CrossRef]
He, J.; Wang, J.; Han, Z.; Ma, J.; Wang, C.; Qi, M. An interpretable transformer network for the retinal disease classification using optical coherence tomography. Sci. Rep. 2023, 13, 3637. [Google Scholar] [CrossRef]
Mok, D.; Bum, J.; Tai, L.D.; Choo, H. Cross Feature Fusion of Fundus Image and Generated Lesion Map for Referable Diabetic Retinopathy Classification. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 1350–1365. [Google Scholar]
Li, Z.; Han, Y.; Yang, X. Multi-Fundus Diseases Classification Using Retinal Optical Coherence Tomography Images with Swin Transformer V2. J. Imaging 2023, 9, 203. [Google Scholar] [CrossRef]
Tang, H.; Chen, Y.; Wang, T.; Zhou, Y.; Zhao, L.; Gao, Q.; Du, M.; Tan, T.; Zhang, X.; Tong, T. HTC-Net: A hybrid CNN-transformer framework for medical image segmentation. Biomed. Signal Process. Control 2024, 88, 105605. [Google Scholar] [CrossRef]
Wang, J.; Mao, Y.-A.; Ma, X.; Guo, S.; Shao, Y.; Lv, X.; Han, W.; Christopher, M.; Zangwill, L.M.; Bi, Y.; et al. ODFormer: Semantic fundus image segmentation using transformer for optic nerve head detection. Inf. Fusion 2024, 112, 102533. [Google Scholar] [CrossRef]
Shyamalee, T.; Meedeniya, D.; Lim, G.; Karunarathne, M. Automated tool support for glaucoma identification with explainability using fundus images. IEEE Access 2024, 12, 17290–17307. [Google Scholar] [CrossRef]
Pachade, S.; Porwal, P.; Thulkar, D.; Kokare, M.; Deshmukh, G.; Sahasrabuddhe, V.; Giancardo, L.; Quellec, G.; Mériaudeau, F. Retinal fundus multi-disease image dataset (RFMiD): A dataset for multi-disease detection research. Data 2021, 6, 14. [Google Scholar] [CrossRef]
Guergueb, T.; Akhloufi, M.A. Ocular diseases detection using recent deep learning techniques. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Virtual, 1–5 November 2021; pp. 3336–3339. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Puneet, R.K.; Gupta, M. Optical coherence tomography image based eye disease detection using deep convolutional neural network. Health Inf. Sci. Syst. 2022, 10, 13. [Google Scholar] [CrossRef]
Nazir, T.; Nawaz, M.; Rashid, J.; Mahum, R.; Masood, M.; Mehmood, A.; Ali, F.; Kim, J.; Kwon, H.-Y.; Hussain, A. Detection of diabetic eye disease from retinal images using a deep learning based CenterNet model. Sensors 2021, 21, 5283. [Google Scholar] [CrossRef]
Hu, W.; Li, K.; Gagnon, J.; Wang, Y.; Raney, T.; Chen, J.; Chen, Y.; Okunuki, Y.; Chen, W.; Zhang, B. FundusNet: A Deep-Learning Approach for Fast Diagnosis of Neurodegenerative and Eye Diseases Using Fundus Images. Bioengineering 2025, 12, 57. [Google Scholar] [CrossRef]
Qureshi, I.; Abbas, Q.; Yan, J.; Hussain, A.; Shaheed, K.; Baig, A.R. Computer-aided detection of hypertensive retinopathy using depth-wise separable CNN. Appl. Sci. 2022, 12, 12086. [Google Scholar] [CrossRef]

Figure 1. A visual example of CFI where (a) normal, (b) glaucoma, (c) DR, (d) HR, (e) DME.

Figure 2. A systematic flow diagram of the proposed AdaptiveSwin-CNN model for multi-class retinal disease classification.

Figure 3. An illustrative demonstration of data augmentation for managing class imbalance.

Figure 4. The proposed AdaptiveSwin-CNN framework is composed of three interconnected modules: the Encoder Module (EM), Decoder Module (DM), and Fusion Module (FM). Both the Encoder Module (EM) and Decoder Module (DM) leverage a hybrid architecture that combines the strengths of Convolutional Neural Networks (CNNs) and Swin Transformers, ensuring symmetry in their design. The Fusion Module (FM) plays a crucial role in integrating features from different scales to generate a comprehensive representation, thereby enhancing the overall performance of the proposed system.

Figure 5. Conv-Block, shown in subfigure (a), is made up of M convolution–activation blocks. The pooled feature is the output of EM, whereas the upsampled feature is the output of DM. The Conv-Fusion Block, shown in Subfigure (b), concatenates the encoding and decoding features before thoroughly fusing them using convolution. The Swin-Trans-Encoder Block, shown in subfigure (c), extracts the feature with relationships using two Swin Transformer blocks, merges image slices using patch merging, and employs patch embedding for image slicing and positional encoding. The Swin-Trans-Decoder Block, shown in subfigure (d), employs the same quantity of Swin Transformer blocks as the Swin-Trans-Encoder Block. The Swin Transformer block’s architecture is shown in Subfigure (e).

Figure 6. A graphical depiction of accuracy and loss during validation and training on the RFMiD benchmark.

Figure 7. A visual example of the confusion matrix of the AdaptiveSwin-CNN using RFMiD.

Figure 8. An illustration of the AdaptiveSwin-CNN confusion matrix using ODIR.

Figure 9. A visual representation of accuracy and loss on the ODIR during training and validation.

Figure 10. A comparison between various DL models and AdaptiveSwin-CNN based on classification accuracy.

Figure 11. The suggested system is shown visually in a Grad-CAM on five retinography classes as normal, DR, HR, glaucoma and DME.

Table 1. Total number of images obtained from ODIR and RFMiD datasets are distributed according to their original sources.

Image Benchmark	No. of Images	Disease Type	Image Modality	Size
RFMiD [33]	900	Normal	Color Fundus Image	4288 × 4288 to 2144 × 2144 pixels
	300	DR	Color Fundus Image	4288 × 4288 to 2144 × 2144 pixels
	300	HR	Color Fundus Image	4288 × 4288 to 2144 × 2144 pixels
	150	Glaucoma	Color Fundus Image	4288 × 4288 to 2144 × 2144 pixels
	270	DME	Color Fundus Image	4288 × 4288 to 2144 × 2144 pixels
ODIR [34]	500	Normal	Color Fundus Image	700 × 600 pixels
	290	DR	Color Fundus Image	700 × 600 pixels
	150	HR	Color Fundus Image	700 × 600 pixels
	250	Glaucoma	Color Fundus Image	700 × 600 pixels
	200	DME	Color Fundus Image	700 × 600 pixels
Total Images	3310 Downsizing: 224 × 224 pixels

Table 2. Distribution of retinal images for AdaptiveSwin-CNN model training.

Classes	Total Images
	Before Augmentation	After Augmentation	Training Set	Testing Set
Normal	1400	1500	1300	200
DR	590	700	600	100
HR	450	600	500	100
Glaucoma	400	550	450	100
DME	470	600	500	100
Total	3310	3950	3350	600

Table 3. Parametric configuration of the convolutional layer of the proposed AdaptiveSwin-CNN model.

Convolution Layers	Parameters
Conv-32	(3 × 3 × 3 + 1) × 32
Conv-32	(3 × 3 × 32 + 1) × 64
Conv-64	(3 × 3 × 64) + (1 × 1 × 64 + 1) × 128
Conv-64	(3 × 3 × 128) + (1 × 1 × 128 + 1) × 128
Conv-128	(3 × 3 × 128) + (1 × 1 × 128 + 1) × 256
Conv-128	(3 × 3 × 256) + (1 × 1 × 256 + 1) × 256
Conv-256	(3 × 3 × 256) + (1 × 1 × 256 + 1) × 728
Conv-256	(3 × 3 × 728) + (1 × 1 × 728 + 1) × 728
Total	87,488

Table 4. An average performance comparison between proposed and other DL models for such classes as normal, DR, HR, glaucoma, and DME on test data.

Models	Sensitivity %	Specificity %	Accuracy %
VGG16	78	79	80
VGG19	79	81	82
InceptionV3	75	77	78
ResNet50	85	84.9	87
Xception	79.2	80.4	80
MobileNet	81.6	82.7	84
Proposed Model	95.2	96.7	98.89

Table 5. An average performance analysis of the proposed AdaptiveSwin-CNN for each class on 3950 samples.

Classes	Sensitivity %	Specificity %	F1-Score %	Accuracy %	E
Normal	98.8	99.2	99.5	99.8	0.12
DR	97.7	96.5	96.2	98.7	0.45
HR	97.8	96.8	96.7	98.5	0.76
Glaucoma	98.4	97.2	96.8	98.8	0.67
DME	97.3	96.5	97.2	98.2	0.56
Average	98	97.24	97.28	98.8	0.51

Table 6. Performance comparison (average) between proposed AdaptiveSwin-CNN and other existing DL arts on test samples for multi-class retinal disorders.

Works	Sensitivity %	Specificity %	Accuracy %	F1-Score %
FunSwin [26]	94.8	95.2	96.6	96
SwinTranV2 [29]	95.1	97.1	98.50	97.2
DCNN [38]	94.7	94.8	95.6	95
CenterNet [39]	94.5	97.1	98.10 and 97.13	97
Proposed	95.2	96.7	98.89	97.28

Table 7. Performance comparison of the proposed AdaptiveSwin-CNN with different configurations for classifying a CFI into normal, DR, HR, glaucoma, and DME on test samples.

Works	Sensitivity %	Specificity %	Accuracy %	F1-Score %
CNN	86.2	85.6	86.1	85.8
CNN + ST	87.2	86.8	86.2	86.6
CNN + XgBoost	86.3	85.8	86.3	86.1
ST	86.3	85.9	86.4	86.1
AdpativeSwin-CNN	95.2	96.7	98.89	97.28

Table 8. The constraints of the proposed AdaptiveSwin-CNN system when it is being expanded and used in clinical or research environments.

Drawback	Illustrations
Narrow generalization	The proposed system’s performance might be compromised using the datasets collected from various centers. Usually, such benchmarks consist of variations between image quality and patient attributes.
Accessibility of data	The proposed system’s success is highly dependent on the accessibility and variety of well-annotated retinal fundus pictures for training. The effectiveness of the system may be hindered due to a lack of sufficient data.
Restricted to fundus modality	The proposed technique can only be used for the diagnosis of eye problems using images of the retina, namely the colored fundus image. Other image modalities should investigate to identify in-depth image features.
Computational resources	In general, training DL models requires a substantial amount of computer resources, which can restrict the accessibility of the system in contexts with limited resources.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qureshi, I. AdaptiveSwin-CNN: Adaptive Swin-CNN Framework with Self-Attention Fusion for Robust Multi-Class Retinal Disease Diagnosis. AI 2025, 6, 28. https://doi.org/10.3390/ai6020028

AMA Style

Qureshi I. AdaptiveSwin-CNN: Adaptive Swin-CNN Framework with Self-Attention Fusion for Robust Multi-Class Retinal Disease Diagnosis. AI. 2025; 6(2):28. https://doi.org/10.3390/ai6020028

Chicago/Turabian Style

Qureshi, Imran. 2025. "AdaptiveSwin-CNN: Adaptive Swin-CNN Framework with Self-Attention Fusion for Robust Multi-Class Retinal Disease Diagnosis" AI 6, no. 2: 28. https://doi.org/10.3390/ai6020028

APA Style

Qureshi, I. (2025). AdaptiveSwin-CNN: Adaptive Swin-CNN Framework with Self-Attention Fusion for Robust Multi-Class Retinal Disease Diagnosis. AI, 6(2), 28. https://doi.org/10.3390/ai6020028

Article Menu

AdaptiveSwin-CNN: Adaptive Swin-CNN Framework with Self-Attention Fusion for Robust Multi-Class Retinal Disease Diagnosis

Abstract

1. Introduction

1.1. Research Contribution

1.2. Organization of Paper

2. Literature Review

3. Material and Methods

3.1. Data Acquisition

3.2. Dataset Augmentation and Preprocessing

3.3. Proposed AdpativeSwin-CNN Architecture

3.3.1. Problem Definition

3.3.2. General Organization

3.3.3. Encoder Module (EM)

3.3.4. Decoder Module (DM)

3.3.5. Fusion Module (FM)

3.3.6. Loss Function

3.4. XGBoost Classifier

4. Experiments and Results

4.1. Experimental Setup

4.2. Performance Metrics

4.3. Experiment 1

4.4. Experiment 2

4.5. Experiment 3

4.6. Experiment 4

4.7. Experiment 5

4.8. Ablation Study

4.9. Computational Cost

5. Discussion

5.1. Potential Drawbacks of the AdaptiveSwin-CNN

5.2. Possible Future Directions

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI