1. Introduction
Various glaucoma-related risk factors have been found, where the raised intraocular pressure (IOP) is the main cause. Loss of side vision is usually the most obvious sign of glaucoma. However, it may not be noticed as the condition progresses. Glaucoma is often referred to as the “silent thief of sight” due to its ability to progress unnoticed until significant vision loss has occurred. Early symptoms of glaucoma can include intense eye or forehead pain, eye redness, blurry or diminished vision, seeing halos or rainbows around lights, headaches, nausea, and vomiting.
Although early diagnosis and treatment can help manage the condition, there is currently no cure for glaucoma. Developing automated methods for early detection is crucial [
1]. Retinal fundus imaging plays a vital role in assessing the health of the optic nerve, macula, retina, vitreous, and blood vessels. Ophthalmologists use fundus cameras to capture these images, which are instrumental in diagnosing glaucoma and other eye diseases [
1].
Diagnosing glaucoma is particularly difficult because it requires a thorough evaluation of the optic nerve and the measurement of increased intra-ocular pressure (IOP) [
2] Traditional clinical diagnostic methods face multiple challenges, such as the lack of standardized diagnostic criteria, over-reliance on IOP measurements, subjective interpretation of test results by healthcare providers, and the absence of a definitive diagnostic test, all of which contribute to the risk of misdiagnosis [
3].
Deep learning (DL) techniques, particularly convolutional neural networks (CNNs), have made significant strides in automating glaucoma screening, thus reducing the subjectivity associated with manual assessments by clinicians [
4]. However, CNNs have inherent limitations due to their limited receptive field, which results in a lack of comprehensive image understanding and increases the likelihood of misclassification, especially when encountering new images with similar visual features but different spatial structures.
Vision Transformers (ViTs) were introduced to overcome these limitations by applying global attention to multiple patches of an image, thereby improving model performance [
4]. Despite their benefits, ViTs face challenges related to computational complexity and the use of fixed-scale word tokens, which are not well-suited for the varying scale and resolution requirements of visual tasks.
However, there is need for a more effective and efficient deep learning model for glaucoma diagnosis. The Swin Transformer, introduced in 2021, addresses these challenges by computing self-attention locally within non-overlapping windows and using hierarchical feature maps, leading to computational complexity that scales linearly with image size [
5]. This research aims to investigate the application of SegFormer for precise segmentation and Swin Transformer for robust classification to develop a reliable and accurate diagnostic tool for glaucoma.
This study advances the field of medical imaging by introducing a robust, AI-driven system for early glaucoma detection. Leveraging cutting-edge deep learning techniques, the research addresses key limitations in traditional diagnostic methods and contributes to more reliable, accessible, and scalable healthcare solutions. The main contributions of the research are:
Developed an automated glaucoma detection system using SegFormer for optic cup segmentation and Swin Transformer for fundus image classification.
Enhanced diagnostic precision while minimizing observer variability found in manual screening processes.
Employed diverse, publicly available datasets to ensure reproducibility and model robustness.
Highlighted the practical application of AI in healthcare, improving early glaucoma screening in remote or underserved regions.
Supported cost-effective screening by potentially reducing the long-term burden of vision loss.
3. Results
3.1. Comparative Analysis of Swin Transformer vs. Proposed Model (Improved Swin Transformer)
The comparative results between the Swin Transformer alone and the combination of SegFormer for segmentation followed by Swin Transformer for classification reveal a significant improvement in performance metrics. This section discusses these results in detail, highlighting the impact of segmentation before classification and its alignment with the research objectives.
A thorough evaluation is conducted using a separate validation set to ensure the model performs well on unseen data. This evaluation helps fine-tune the model’s parameters, ensuring it generalizes effectively from the training data to real-world scenarios. A detailed assessment using this unseen data ensures that the model is robust and reliable for glaucoma detection.
This was achieved by dividing the dataset into distinct subsets: a training set and a separate validation set. The training set was used to optimize the model’s parameters, while the validation set was reserved exclusively for testing the model’s performance on data it had not seen before. By doing so, the model’s ability to generalize to new, unseen data was evaluated.
Performance metrics such as accuracy, precision, recall, and F1-score were calculated on the validation set to quantify the model’s effectiveness and reliability.
3.1.1. Model’s Performance Based on Accuracy
Accuracy measures the proportion of correctly classified instances out of the total instances, providing an overall indication of the model’s performance.
Table 3 presents the estimation of accuracy that was obtained, the proposed model achieved an accuracy of 98.9% during training and 97.8% during testing, whereas the Swin Transformer alone achieved 77% (training) and 76.22% (testing).
The significantly higher accuracy of the proposed model demonstrates that the segmentation step before classification greatly enhances the model’s ability to correctly identify both glaucomatous and non-glaucomatous eyes. This high accuracy indicates that the model is highly effective and reliable in real-world applications. The small gap between training and testing accuracy (98.9% vs. 97.8%) suggests that the model generalizes well, maintaining its high level of performance on unseen data. The comparison chart is plotted, as shown in
Figure 3.
3.1.2. Model’s Performance Based on Precision
Precision is a crucial metric that assesses the accuracy of positive predictions made by the model, specifically the proportion of true positive predictions among all positive predictions.
As shown in
Table 4, the precision of the proposed model during training was 98.85%, and during testing, it was 97.5%. In comparison, the standalone Swin Transformer model achieved a training precision of 77.66% and a testing precision of 75.55%. High precision indicates a low rate of false positives, which is essential in medical diagnostics to avoid unnecessary treatments and patient anxiety.
The significant improvement in precision for the proposed model indicates that the segmentation of the optic cup using SegFormer allows the classification model to focus more accurately on the relevant features. This precision enhancement suggests that the proposed model is highly reliable and effective in correctly identifying glaucomatous eyes while minimizing false positives. The close values of training and testing precision (98.85% vs. 97.5%) demonstrate that the model has generalized well, maintaining its performance across both training and unseen data.
Figure 4 represents the comparison of precision estimated on the proposed model and Swin standalone.
3.1.3. Model’s Performance Based on Recall
Recall, also known as sensitivity, measures the model’s ability to correctly identify all actual positive cases.
The proposed model has gained better recall or sensitivity rate as tabulated in
Table 5. The model obtained a recall rate of 98.99% during training and 98.29% during testing, whereas the Swin Transformer alone had a recall of 77.57% (training) and 77.3% (testing). High recall is particularly important in medical diagnostics to ensure that as many true cases as possible are detected, thus reducing the risk of missing a diagnosis.
The high recall values for the proposed model indicate that the segmentation step significantly enhances the model’s ability to detect true positive cases of glaucoma. This improvement ensures that the model is effective at identifying nearly all actual cases of glaucoma, thereby providing a high level of reliability. The proximity of training and testing recall values (98.99% vs. 98.29%) further demonstrates that the model generalizes well, effectively maintaining its high performance on both the training set and new, unseen data.
Figure 5 represents the comparison of recall estimated for classifying the glaucoma fundus images.
3.1.4. Model’s Performance Based on F1-Score
The F1-score is particularly useful for providing a balanced measure of the model’s performance, especially in cases where there is an imbalance between positive and negative classes.
The F1-score, which combines precision and recall into a single metric, was 98.92% for training and 97.9% for testing in the proposed model. In contrast, the Swin Transformer model achieved an F1-score of 77.4% (training) and 77.39% (testing).
The high F1-score of the proposed model suggests that it effectively balances both precision and recall, ensuring accurate and reliable classification of glaucomatous eyes. This balance is crucial for achieving robust performance in medical diagnostics. The minimal difference between training and testing F1-scores (98.92% vs. 97.9%) indicates that the model generalizes exceptionally well, maintaining consistent performance across both training and new datasets.
Table 6 below shows the detailed representation of the F1-score. performance analysis.
Figure 6 represents the comparison of the F1-Score estimated for classifying the glaucoma fundus images.
The comparative analysis of the Swin Transformer model against the combined SegFormer and Swin Transformer approach reveals notable differences in performance metrics.
Table 7 summarizes these metrics, including accuracy, precision, recall, and F1-Score, for both models.
The comparison table highlights the significant benefits of incorporating segmentation before classification in the context of glaucoma detection. The improved Swin Transformer approach not only improves the accuracy, precision, recall, and F1-Score but also addresses the limitations of feature discrimination in CNNs. This methodology demonstrates the potential of advanced deep learning models to enhance diagnostic accuracy and reliability, ultimately contributing to better patient outcomes. By isolating the optic cup and focusing the classifier on the most relevant features, the proposed approach achieves superior performance, making it a valuable tool in the early detection and treatment of glaucoma.
Figure 7 below shows the graphical representation of this comparison.
3.2. Models Performance in Comparison with CNN Models
One of the primary goals of this research was to address the limitations of feature discrimination in convolutional neural networks (CNNs) by adopting SegFormer for effective fundus image segmentation. This objective is critical because traditional CNNs, while powerful, often struggle to capture the intricate details necessary for accurate medical image analysis. SegFormer, a semantic segmentation framework, was integrated to overcome these limitations, significantly enhancing the model’s ability to capture and utilize detailed features from retinal images.
The integration of SegFormer has proven to be highly effective in improving feature discrimination. By segmenting the optic cup region in retinal fundus images, SegFormer provides a detailed and focused input for subsequent classification by the Swin Transformer. This segmentation step ensures that the model can distinguish between glaucomatous and non-glaucomatous regions with greater accuracy.
The results clearly indicate that this goal has been met. The proposed model, which combines SegFormer for segmentation and Swin Transformer for classification, achieved superior performance metrics compared to state-of-the-art CNN models. For instance, the proposed model’s training accuracy reached 98.9%, while the testing accuracy was 97.8%. These values surpass those of models like VGG-19, which had a training accuracy of 97.73% and testing accuracy of 95.54%, and DenseNet169, with 97.14% training accuracy and 95.45% testing accuracy shown in
Table 8 below.
4. Discussions
4.1. Efficiency of Swin Transformer in Classifying High-Resolution Images
The Swin Transformer, recognized for its hierarchical architecture and innovative design, excels at classifying high-resolution images with reduced computational complexity. This section details how the Swin Transformer achieves this efficiency and demonstrates the fulfillment of the second research objective, referencing the provided performance metrics.
The Swin Transformer employs a hierarchical structure that processes images similarly to traditional CNNs but incorporates self-attention mechanisms for added benefits. The model significantly lowers computing complexity by partitioning the input image into non-overlapping patches and applying self-attention within local windows. The model may capture cross-window interactions and integrate global context into the representations by shifting windows between layers.
This approach lowers the computational cost of self-attention from quadratic to linear, enhancing scalability for high-resolution images. This efficiency is crucial for processing high-resolution medical images, such as retinal fundus images, which require detailed information for accurate diagnosis. The Swin Transformer processes these images efficiently, making it suitable for real-world medical applications.
The performance metrics highlight this efficiency and accuracy. The proposed model, which combines SegFormer for segmentation and Swin Transformer for classification, achieved outstanding results.
In conclusion, the Swin Transformer’s ability to efficiently classify high-resolution images with reduced computational complexity is a key factor in the success of the proposed glaucoma detection methodology. The innovative architecture of the Swin Transformer enhances diagnostic accuracy and ensures practical applicability in clinical settings. This demonstrates the successful achievement of establishing the Swin Transformer as a valuable tool in medical image analysis. The integration of SegFormer for segmentation further amplifies these benefits, making the combined approach highly effective for glaucoma detection.
4.2. Addressing Feature Discrimination with SegFormer
Adding SegFormer to the pipeline made a noticeable improvement in how well the model could tell different features apart. By accurately segmenting the optic cup area in retinal fundus images, SegFormer helps the model focus on the most relevant regions. These clearer, more focused features are then passed to the Swin Transformer, allowing it to classify the images more accurately and with better understanding of the important visual details.
The results clearly indicate that this goal has been met. The proposed model, which combines SegFormer for segmentation and Swin Transformer for classification, achieved superior performance metrics compared to state-of-the-art CNN models.
The significant improvements across all these metrics validate the effectiveness of integrating SegFormer for segmentation. By providing a detailed and focused input for the Swin Transformer, the proposed model can capture and utilize intricate features from retinal images, leading to more accurate and reliable glaucoma detection. This success confirms that the research objective of addressing feature discrimination limitations in CNNs through the adoption of SegFormer has been successfully achieved.
5. Conclusions
In this study, an early prediction model for glaucoma detection using deep learning techniques was developed. The SegFormer model segmented the fundus images, while the Swin Transformer performed the final classification. The model achieved a training accuracy of 98.9% and a testing accuracy of 97.8%, outperforming other models. It also showed high precision, recall, and F1-score rates, demonstrating its effectiveness in identifying glaucomatous characteristics from retinal fundus images.
The Swin Transformer proved to be highly efficient in classifying high-resolution images with reduced computational complexity. Its ability to learn from various distributions and capture global features significantly enhanced the model’s performance. This combination of SegFormer and Swin Transformer addressed the limitations of traditional CNNs, offering superior accuracy and robustness in glaucoma detection.
One limitation of this study is the reliance on limited, less diverse datasets, which may hinder the model’s generalizability. Expanding to larger, more varied datasets could address data imbalance and improve robustness. Additionally, variability in image quality, resolution, and subjective annotations may affect accuracy. While the Swin Transformer showed promise in glaucoma detection, further validation and fine-tuning are required for broader medical applications and to ensure reliability in real-world clinical settings.