A critical review of the application of DL for classification of retinal pathologies is presented in this section. Case studies were drawn from diabetic retinopathy (DR), Age-Related Macular Degeneration (AMD), glaucoma and multiretinal disease applications. Segmentation methods were not part of this review.
5.1. Diabetic Retinopathy Classification
What makes DR the most important target for automatic detection is its persistence on the leaderboard of sight-threatening diseases among working-age adults. In [
55], authors proposed IDx-DR X2.1, a DL device based on AlexNet, to detect DR severity. The purpose was to compare the performance of this device against a previously designed non-DL-based method called the Iowa detection Program (IDP). The authors used five DR levels; moderate, severe, non-proliferative DR, proliferative diabetic retinopathy (PDR) and/or Macular Edema (ME). The DL-based method outperformed the non-DL-based method. The model did not miss any cases of severe NPDR or ME. Specificity was higher than the specificity of IDP. The advantage of this study is that it was evaluated on a publicly available database. This has a positive bearing on the reproducibility of the method. The limitation of this method is on the dataset used, Messidor-2, which contains high-quality images not typical of those obtained in a clinical screening setup, and besides, the dataset only contains only one image per eye. This limits the area of the retina coverage. In [
80], a four-class DR classification model was trained on 70,000 labeled retinal images to detect 0 (no DR), 1 (mild DR), 2 (moderate NPDR), 3 (severe NPDR) and 4 (PDR). Each patient was represented by two images, one for each eye. The model was evaluated on 10,000 fundus images from the Kaggle DR detection challenge dataset. This model outperformed state-of-the-art models. A significant contribution was the inclusion of images from both eyes, which meant a larger area of the retina was covered. In [
81], entropy images were used in place of fundus photos, and they demonstrated that feature maps were produced more efficiently. A model proposed in [
82] assists with explainability by incorporating heatmaps which highlight areas of lesion concentration. They utilized heatmaps to indicate the pixels in the image were involved in predictions at the level of the image. Apart from DR classification, this model detects lesions as well. The two-task method outperformed both other lesion-detection methods and other heatmap-generating algorithms for ConvNets. This method could be used to discover new biomarkers in image data, owing to its non-reliance on manual segmentation for the detection of relevant lesions. The model makes an attempt to address the lack of CNN model interpretability, which leads to a lack of trust with patients and clinicians. One important feature of this technique was that it managed to detect, with great precision, lesions in blurry images captured by hand-held retinography. This provides hope for DR screening with lower resolution images taken using cellular phones, making CAD of DR more accessible to poorer communities. One limitation of this method is the inferior database ground truth of 1 grade per image. This leaves room for grader subjectivity.
Two deep CNNs, Combined Kernels with Multiple losses Network (CKMLNet) and VGGNet with extra kernels (VNXK) were developed by [
83]. The two networks are improvements of GoogleNet and VGGNet, respectively. They also introduced a color space, LGI, for DR grading via CNNs. The improved networks were evaluated on the Messidor and EyePac datasets and the best ROC performances of 0.891 and 0.887 were achieved for the CKMLNet/LGI and the VNXK/LGI networks, respectively. These performances compared well with those of the state-of-the-art methods in [
84,
85,
86]. A five-class classification model to detect and grade DR into categories ranging from 0 to 4, 0 being no DR and 4 being proliferative DR was proposed in [
87]. Authors used transfer learning on VGG-16 and VGG-19 and evaluated their method on the EyePacs database. The best performance they achieved was accuracy of 0.820, a sensitivity of 0.800 and a specificity of 0.8200. Classes 3 and 4 performed poorly owing to class imbalances not favoring them. The augmentation approach adopted by the authors could have caused this poor performance. Prior to augmentation, they grouped classes 1 to 4 and labeled this class 1 and the no DR class as class 0. They then proceeded with augmentation on the new classes and, in the process, missed correcting the limited counts for classes 3 and 4.
EfficientNet-B3 was employed as the backbone model by [
88] to develop DR detection models on the APTOS dataset of 38,788 annotated images. The model obtained a Kappa score of 0.935 on the test set, and the authors concluded that their method performed at the level of experts. A major advantage and contribution of this work was the provision of a more structured way of uniformly scaling the three dimensions of the EfficientNet network—width, depth and resolution. This was an improvement from the arbitrary scaling of the same by other authors. The drawback of this method is its complexity, and besides, the evaluation metric that the authors used (Kappa) is a departure from the ones employed by most models (accuracy, sensitivity, specificity), making it difficult to compare the performance with other models.
Authors in [
89] used the DenseNet-121 model to design a DR detection method and evaluated it on the same database as in [
88], APTOS. Their research achieved good performance with an accuracy of 0.949, a sensitivity of 0.926 and a specificity of 0.971. A weighted Kappa measure of 0.88 was achieved for this model, a performance inferior to the EfficientNet model in [
88] on the same dataset. The authors claimed their method had higher efficacy compared to some state-of-the-art models, which they did not mention, and besides, the basis of comparison where different datasets were used for evaluation may not be justifiable. This makes it hard to believe the authors’ conclusions. Jang et al. in [
90] developed a DR classification system using a CNN model built on the Caffe framework and evaluated it using the Kaggle database, achieving accuracy of 0.757 on the binary classification problem (DR, no DR). The authors concluded that their model can be used for DR screening programs for large DR populations. The researchers, however, used only accuracy as their evaluation metric and claimed their model performs comparably with Pratt et al. in [
91], who quoted performance accuracy alongside specificity and sensitivity as evaluation metrics. Their claim is unjustifiable because accuracy alone results in misleading outcomes in highly imbalanced datasets like the one they used for evaluation. Furthermore, the authors reduced the DR classification problem to a binary classification problem, which is a departure from the typical five-class classification problem as stipulated in the International DR Disease Severity scale [
92,
93].
A two-stage Deep CNN for lesion detection and grading DR severity was proposed in [
94]. This multiclass model detected microaneurysms, hemorrhage and exudates with recall values of 0.7029, 0.8426 and 0.9079 with a maximum area under the curve value of 0.9590. A DR analysis method based on two-stage deep CNNs was proposed by [
94]. The model was evaluated on a re-annotated Kaggle fundus image dataset and obtained a maximum accuracy of 0.973, specificity of 0.898 and sensitivity of 0.987. Whilst this model performed fairly well, it was designed to detect a limited number of lesions, and it would be useful to observe its performance on an expanded range of lesions. AttenNet, a multiclass deep Attention-Based Retinal Disease Classifier using the Densenet-169 as its backbone, was developed by [
95]. It pays attention to critical areas that contain abnormalities, a feature that helps visualize the lesions and possibly helps interpret the outcomes of the model. AttenNet achieved a four-class accuracy of 97.4%, a binary class sensitivity of 100%, with a specificity of 100%. The major contribution of this work was its high performance and an attempt to provide model explainability. Its limitation, though, is its potential computational expense owing to the complexity of the DenseNet-169.
Using the Kaggle dataset of 35,126 color fundus images, authors in [
6] proposed a DL ensemble for predicting the five DR classes; normal, mild, moderate, severe and PDR. They used a collection of five CNN architectures: Resnet50, Inception V3, Xception, Dense 121 and Dense 169. The authors claimed that the model detected all DR stages and performed better than state-of-the-art methods on the same Kaggle dataset and yet, evidently, with a sensitivity of 0.515, specificity of 0.867 and accuracy of 0.808, this method trails behind a few models, such as the DCNN in [
94] and the CKML in [
83], evaluated on the same dataset. Jiang et al. in [
96] presented an explainable ensemble DL model for DR disease classification. They integrated several deep learning algorithms (Inception V3, Resnet152 and Inception-Resnet-V2) and used the Adaboost algorithm to minimize bias in each individual model. The work provides weighted class activation maps (CAMs) to explain the results of the DR detection. CAMs illustrate the suspected position of the lesions. This research performed better than single deep learning models, producing an AUC of 0.946 for the integrated model against an AUC of 0.943 for the best-performing individual model. The Adaboost algorithm helped the models reach a global minimum. Prior to model development, the images underwent augmentation to increase their diversity. The dataset they used is private, and this poses potential accessibility challenges in the event of the need to confirm their results. In [
97], the authors proposed ensemble classification methods combined with vessel segmentation for the detection of diabetic retinopathy. While the paper proposes an innovative and promising method for retinal disease prediction using deep learning techniques, the authors did not provide more detail on the datasets used for testing the proposed method, as well as the performance metrics used to evaluate its effectiveness. This makes it difficult to evaluate the proposed methods against other methods in the literature. The paper provides a comprehensive overview of the method used; however, the deep learning methods used in the ensemble were not mentioned, making it difficult for readers to understand how the models were combined and how each model affected the final performance metrics. A novel method that combines a Deep Convolutional Neural Network and vessel segmentation was presented in [
98] for the early detection of proliferative diabetic retinopathy. The proposed method achieved an area under the curve (AUC) performance of 0.969, an accuracy of 94.1%, a specificity of 95.7% and a sensitivity of 92.7% on the MESSIDOR-2 database. These performances mean the proposed method can effectively distinguish between a diseased retina and a non-diseased retina. The small size of the dataset, lack of interpretability analysis and the fact that authors did not make an attempt to compare their method against other segmentation methods serve as the limitations to the proposed method. It would be hard to believe this method is generalizable, and besides, clinicians may find it hard to entrust patients’ lives on a black box method whose decision-making process remains opaque.
In [
99], ViT-DR, a vision transformer-based model for DR detection on fundus images, is presented. The model was evaluated on four publicly available datasets: MESSIODOR-2, e-ophtha, APTOS and IDRiD. AUC scores of 0.956, 0.975, 0.946 and 0.924 were obtained for the datasets, respectively. The authors provide a detailed analysis of the model’s attention maps, which highlights the areas of the fundus images that the model is focusing on during the classification process. This way, users will have an idea of how decisions are made. The model is a promising approach for diabetic retinopathy grading using fundus images, but further research is needed to evaluate its generalizability to other tasks and its computational efficiency. A lesion-aware vision transformer network was proposed for DR detection in [
100]. The authors’ approach leverages lesion awareness to improve the model’s performance in detecting and grading diabetic retinopathy. The model was evaluated on the MESSIDOR-2, e-ophtha and APTOS databases, achieving AUC scores of 0.956, 0.977 and 0.947, respectively. The performance of this network was quite comparable to the ViT proposed in [
99], including the provision for model explainability. This model’s effectiveness for the detection of different types of lesions in clinical settings is yet to be established. A vision transformer that incorporates a residual module was presented in [
101] for the classification of DR severity. The model achieved an accuracy of 0.893 on the MESSIDOR-2 dataset and an AUC of 0.981 on the APTOS dataset. The inconsistency in reporting performances, for example, the absence of AUC score for the MESSIDOR-2 dataset and accuracy for the APTOS dataset, is concerning. It is not possible to draw comparisons with other models, and besides, the performance of this model on these datasets is not fully specified. The authors have not provided an interpretability analysis for this model. Therefore, it remains difficult to appreciate how classification decisions are made. The authors of [
102] developed an ensemble of transformer-based models coupled with attention maps for the detection of DR. The model was evaluated on the MESSIDOR-2 and the APTOS datasets and achieved AUC scores of 0.977 on MESSIDOR-2 and an accuracy of 0.912 on the APTOS dataset. A major contribution of this work was the improvement in performance and the inclusion of the attention module to help clinicians understand the underlying pathology better. Critical omissions from this work include the lack of analysis of its performance against other models and also computational efficiency comparisons against CNN-based models. These are important aspects in considering the clinical applications of a model.
Table 3 is a summary of the DL-based models applied to detect diabetic retinopathy.
5.1.1. Discussion
The studies reviewed in this section have shown that DL techniques outperform traditional methods in diagnosing and classifying DR. For example, in [
55], the authors developed a deep learning device, IDx-DR X2.1, which outperformed the Iowa Detection Program (IDP), a non-deep learning-based method. The model achieved high sensitivity and specificity and did not miss any cases of severe NPDR or macular edema. Similarly, authors in [
80] developed a four-class DR classification model that outperformed state-of-the-art models. The authors also included images from both eyes, allowing for a more extensive coverage of the retinal area. The MESSIDOR-2 and EyePacs databases were the most commonly used databases in the papers reviewed in this work.
One of the most significant contributions of the reviewed studies is the use of DL models for lesion detection and grading of DR severity. For instance, [
94] developed a two-stage deep CNN for lesion detection and grading DR severity, while [
82] proposed a model that assists with the explainability by incorporating heatmaps in the model. These models demonstrated the potential of deep learning techniques in detecting DR lesions, which can be a useful assistive tool in clinical practice, especially if it has explainability embedded in it.
Another advantage of the deep learning models developed in the reviewed studies is their potential to be used in resource-limited settings, such as developing countries. For example, in [
90], authors developed a DR classification system using a CNN model built on the Caffe framework and evaluated it using the Kaggle database. They achieved high accuracy on the binary classification problem (DR, no DR), demonstrating the potential of deep learning in providing an accessible tool for DR screening programs for large DR populations.
However, there are some limitations to the studies reviewed. One of the limitations is the small dataset size used in some studies, which may pose generalizability challenges. Another limitation is the lack of interpretability of some deep learning models, which may hinder their acceptance and use in clinical practice. The evaluation metrics used in some studies were also limited, and this may affect the generalizability of the models developed.
5.1.2. Summary
This review explores recent advances in the use of DL methods to detect and diagnose diabetic retinopathy (DR). The authors examined several studies that classify DR into different categories, ranging from no DR to proliferative DR, and evaluated the strengths and limitations of each approach. Some of the most promising methods use ensemble models or innovative techniques, such as entropy images or lesion detection.
One of the biggest challenges faced by researchers in this field is the lack of standardized datasets and ground-truth annotations for DR. Many studies use publicly available datasets, which may not be representative of real-world screening situations. Additionally, some studies rely on limited or imbalanced datasets, which may lead to biased results.
Overall, the authors conclude that deep learning methods show great promise for improving DR screening and diagnosis. However, further research is needed to address issues such as dataset bias and lack of interpretability and to determine whether these methods can be applied effectively across different populations and screening settings.
5.2. Age-Related Macular Degeneration Classification
Some recent results on AMD classification using convolutional neural networks are presented in this section. The outcomes of the preliminary work were presented in [
103]. They applied transfer learning to fine-tune a DCNN for the purpose of detecting individuals with intermediate-stage AMD. Accuracies up to 0.950, sensitivities of 0.964 and specificities of 0.956 with no hyperparameter fine-tuning were attained on the AREDS dataset. Higher performances would probably have been recorded with fine-tuning and with a bigger training dataset. The model proposed in [
104] performed binary classification between early-stage AMD and advanced-stage AMD using Deep CNN on the AREDS database. This model was compared with earlier models that combined deep features and transfer learning. The researchers concluded that applying deep learning-based methods for AMD detection leads to results similar to human experts’ performance levels. A deep CNN-based method with transfer learning to assist in identifying persons at risk of AMD was proposed in [
79]. This model was evaluated using the AREDS database with 150,000 images. They used an enhanced VGG16 architecture employing batch normalization. The authors solved a binary and a four-class problem, achieving between 83% and 92%. As their main contribution, the authors debunked the belief that transfer learning always outperforms networks trained from scratch. Their network, trained from scratch with sufficient images, produced higher accuracies compared to accuracies obtained using transfer learning. Network depth has a positive bearing on performance, as observed with VGGNet-16 outperforming shallower networks, such as AlexNet, on similar tasks. The work of [
105] involved the development of an AlexNet model for classifying OCT images into healthy, dry AMD, wet AMD and DME types. The method trains the network from scratch without using transfer learning. It was evaluated on a four-class problem and two, binary class combinations. The method performed better than that of presented in [
18], who used transfer learning and evaluated their network on the same dataset. The advantage of this network is the high number of training images (83,484). What makes these results important is that AlexNet is less computationally expensive compared to its successors, and yet it is achieving some performance improvements. The marginal performance improvement in this method though, compared to the model by Kermany et al. in [
18] may not justify foregoing the computational efficiencies afforded by transfer learning.
In [
106], a 14-layer deep CNN was evaluated using the blindfold and cross-validation strategies on some private AMD retinal database, resulting in accuracies as high as 95.17%. Three fully connected layers, four max-pooling layers and seven convolutional layers were implemented in this work. Adam optimizer was employed in parameter tuning. Matsube et al. in [
107], designed a network with three convolutional layers with ReLU unit and max-pooling layers and evaluated it on pre-processed fundus images. The Deep CNN fared well against human grading by six ophthalmologists. The authors deemed their system capable of identifying exudative AMD with high efficacy and useful for AMD screening and telemedicine. An ensemble of several CNN networks was proposed in [
108] to classify among 13 different AMD classes on the AREDS database. The model outperformed human graders on the AREDS database, and they deemed it suitable for AMD classification in other datasets for individuals with ages 55 years and above. Authors in [
109] sought to analyze the impact of image denoising, resizing and cropping for AMD detection. The authors observed that a reduction in image size would not lead to a significant reduction in performance, and yet results in a substantial reduction in the model size. They also concluded that the model’s highest accuracies were obtained with original images, without denoising and cropping. AMDOCT-Net fared better than VGG16 and OCT-Net architectures for comparable model sizes. This work produces significant results regarding image resizing; it significantly reduces model size with an insignificant reduction in performance. The authors of [
110] proposed a vision transformer network for AMD classification and detection. They evaluated the model on the MESSIDOR and the APTOS databases, achieving an accuracy of 0.913 with APTOS and an AUC score of 0.963 on the MESSIDOR dataset. The major contributions of this work include the high performance of the model and the explainability capability inherent with vision transformers. The limitation of this model is that the attention maps may not always align with the underlying pathology, which could lead to incorrect diagnoses. In [
111], a vision transformer network was proposed for AMD diagnosis on retinal fundus images and was evaluated on the AREDS dataset. The model achieved an accuracy of 0.994 on the four-class classification task and an AUC of 0.993 on the binary classification task. As a contribution, this work shows that AMD detection assistive tools can be developed using ViTs and achieve performances comparable to state-of-the-art CNN models but with the added advantage of explainability to enhance trust with clinicians and patients alike. The drawback of this model, though, is that it was not evaluated on many AMD datasets to allow for generalizability.
5.2.1. Discussion
This section reviewed several studies that applied DL methods for the classification of Age-Related Macular Degeneration (AMD). A plethora of studies have demonstrated great potential in the use of DL methods for the classification of AMD stages and also to differentiate between healthy and AMD-affected eyes. Most studies reviewed evaluated their models on the AREDS database.
Transfer learning has been applied in a lot of the studies, examples of which are [
103,
104], to fine-tune pre-trained DL network architectures for the classification of AMD. The results show accuracies of up to 0.950, sensitivities of 0.964 and specificities of 0.956, which compare closely with the performance levels of human experts. It was, however, observed in [
79] that a network trained from scratch with sufficient input images could produce higher accuracies compared to models fine-tuned on pre-trained models.
The study observed that the depth of the network also impacts model performance. This was demonstrated by a VGGNet-16 network outperforming shallower networks, such as AlexNet, for similar tasks. AlexNet was utilized in [
105] for the classification of OCT images into healthy, dry AMD, wet AMD, and DME types without using transfer learning. The high number of training images (83,484) used in this study contributed to its better performance compared to transfer learning-based methods.
Other studies have investigated the impact of denoising, resizing and cropping images on the accuracy of AMD detection. Studies by [
109] showed that reducing the image size does not significantly reduce performance, and yet results in a substantial reduction in the model size’s computational expense. They also concluded that the highest accuracies were obtained with original images, without denoising and cropping. In [
110,
111], vision transformers were employed for AMD classification, achieving high accuracy and AUC scores on the MESSIDOR, APTOS and AREDS databases. The major contribution of these papers is the explainability capability inherent in the ViT models, which enhances trust with clinicians and patients alike.
Overall, the papers reviewed show that deep learning-based methods, including both CNNs and ViTs, have the potential to achieve performance levels similar to human experts in AMD classification. However, limitations of the models include a lack of generalizability and the potential for incorrect diagnoses due to attention maps not aligning with the underlying pathology. Additionally, it is important to carefully consider the trade-offs between transfer learning and training from scratch when developing AMD classification models.
5.2.2. Summary
This section discussed recent developments in using deep learning models, specifically CNNs and vision transformers, for Age-Related Macular Degeneration (AMD) classification. Several studies have shown promising results in using these models to classify retinal fundus images for various stages of AMD, with some achieving high levels of accuracy and outperforming human graders. The use of transfer learning and network depth has also been explored, with some studies showing that training networks from scratch with sufficient data can produce higher accuracies compared to using pre-trained models. However, there is still room for improvement, particularly in terms of generalizability to different datasets and addressing potential limitations of the models, such as the alignment of attention maps with underlying pathology in vision transformers.
Table 4 summarizes the main algorithms for AMD detection.
5.3. Glaucoma
An early work in glaucoma detection was presented in [
112]. The authors proposed a CNN employing dropout and data augmentation to improve convergence. The CNN network had six layers, four convolutional layers of decreasing filter sizes and two dense (FC) layers. The model was evaluated on the ORIGA and SCES datasets and achieved an AUC measure of 0.831 on the ORIGA database and 0.887 on the SCES database. Neither the specificity nor the sensitivity of this network was reported, raising doubts about whether this network did not suffer from overfitting, which is typical with imbalanced data in such domains. The Inception-V3 pre-trained architecture was designed in [
23] to predict glaucomatous optic neuropathy (GON). The images were first graded by expert ophthalmologists, and the local space average color subtraction technique was employed to accommodate for varying illumination. The authors claimed the model was capable of detecting referable GON with high sensitivity and specificity. False positive and false negative results were caused by the presence of other eye conditions. In [
113], the researchers took advantage of domain knowledge and designed a multibranch neural network (MB-NN) with methods to automatically extract important parts of images and obtain domain knowledge features. The model was evaluated on datasets obtained from various hospitals and achieved an accuracy of 0.9151, a sensitivity of 0.9233 and a specificity of 0.9090. ResNet-50 was used as a base network to implement a deep CNN for the detection of early glaucoma. A proprietary database with 78 images was used to train the model, and 3 additional public datasets were used to validate it. A validation accuracy of 0.9695 was achieved. Whilst most methods focus on advanced glaucoma detection, this method’s focus is early detection, a more difficult and important task of detecting the more subtle changes to the images. The few training images made the model more susceptible to overfitting. The DenseNet-201 network in [
114] was developed as a model for the detection of glaucoma. The model was evaluated on the ACRIMA dataset and obtained a maximum accuracy of 0.97, F1 score of 0.969, AUC of 0.971, sensitivity of 0.941 and specificity of 1.0. This model performed better than the authors’ previous work in [
115] where they experimented with ResNet-121. An added advantage of the DenseNet network is its ability to manage the diminishing gradient problem. DenseNet suffers from computational inefficiency owing to its deep layers and millions of parameters.
An attention-based CNN network for glaucoma detection (AG-CNN) was proposed by [
64]. The network was trained on an 11,760-image LAG dataset. Attention maps were used to highlight salient regions of glaucoma. The model performed better than state-of-the-art networks on the same database and also on RIM-ONE public database. The best performances were accuracy: 96.2%; Sensitivity: 95.4%; Specificity: 96.7% and AUC: 0.983. The main contribution and advantage of this paper was the introduction of visualized heatmaps that helped to locate small pathological areas better than the other methods. This helps with model explainability. The limitation of their network is that it adds more weight parameters to the model, increasing the computational complexity. The authors of [
116] proposed a deep learning method for glaucoma detection that combines optic disc segmentation and transfer learning. The model, which was fine-tuned on a pre-trained ResNet50 model, was evaluated on two publicly available image databases, DRISHTI-GS1 and RIM-ONE V3, achieving accuracies of 98.7% and 96.1%, respectively. A significant contribution of the authors was an analysis of model interpretability. Whilst good performances were recorded with this method, the small sizes of the datasets and the limited number of datasets on which the model was evaluated adversely affect its generalizability. Moreover, it would have been easier to compare the performances of this model with other segmentation models in the literature had the authors had a wider range of evaluation metrics, such as specificity, sensitivity and F1 score. In the work [
117], a vision transformer for glaucoma detection was proposed and evaluated on the ORIGA and RIM-ONE v3 datasets, achieving a sensitivity of 0.941 and a specificity of 0.957 on the RIM-ONE v3 dataset and a sensitivity of 0.923 and a specificity of 0.912 on the ORIGA dataset. The paper provides a thorough analysis of the model’s attention maps, which can help clinicians understand the underlying features that contribute to the model’s decision-making process. Additionally, the authors did compare the performance of their model with state-of-the-art models, providing an opportunity for readers to judge the strengths and weaknesses of different models. The small size of the datasets used for evaluation makes it hard to generalize the performance of their approach. There is a need for additional validation with larger and more diverse datasets. In the work of [
118], the ORIGA dataset was used to evaluate a ViT model for glaucoma classification. An AUC of 0.960 for binary classification and an F1 score of 0.837 for multiclass classification were registered. The authors managed interpretability well by providing a detailed analysis of the model’s attention maps, which help identify important features associated with glaucoma. However, like in [
117,
119,
120], readers will be skeptical about generalizing the performance of the model owing to the small size of the ORIGA, RIGA and RIM-ONE v3 datasets used for evaluation. In the work of Seremer et al. [
121] transfer learning was applied to train and fine-tune the ResNet-50 and the GoogleNet networks for early and advanced glaucoma classification. The models were evaluated on the RIM-ONE public dataset. It was observed that the sensitivity values were very low for both GoogleNet and ResNet, reaching as low as 0.17. Specificities as high as 0.98 were achieved with the GoogleNet architectures for early glaucoma detection. GoogleNet was also reported to have shorter execution times compared to ResNet. A multistage DL model for glaucoma detection based on a curriculum learning strategy was proposed in [
122]. The model included segmentation of the optic disc and cup, prediction of morphometric features and classification of the disease level (healthy, suspicious and glaucoma). The model performed better than state-of-the-art models on the RIM-ONE-v1 and DRISHTI-GS1 datasets, with an accuracy of 89.4% and AUC of 0.82. Omitting specificity and sensitivity of the model raises questions about possible overfitting owing to imbalanced data. The performances of DL techniques for the detection of glaucoma are summarized in
Table 5.
5.3.1. Discussion
Glaucoma is a leading cause of blindness, and deep learning (DL) techniques have been employed to aid its detection. Several studies have proposed various DL models that employ different architectures, including Inception-V3, ResNet-50, DenseNet-201 and vision transformers, for detecting glaucoma. Attention-based CNN networks, transfer learning and multistage DL models have also been proposed. Most studies focus on detecting advanced glaucoma, but some focus on early detection, which is more challenging. While these models, most of which were evaluated on the RIM-ONE v3 database, achieved high accuracy, sensitivity and specificity on their respective datasets, they have limitations, such as small dataset size, limited diversity and limited evaluation metrics. Thus, additional validation with more diverse and larger datasets is needed to generalize their findings better. Additionally, there is a need for interpretability and model explainability. Overall, the performance of DL techniques for glaucoma detection is promising, and they have the potential to improve the accuracy and efficiency of glaucoma diagnosis.
5.3.2. Summary
Several deep learning models have been proposed for glaucoma detection using various techniques, such as CNNs, attention-based networks, transfer learning and curriculum learning. These models were evaluated on different datasets and achieved good accuracy, sensitivity and specificity measures. However, the small size and limited number of datasets used for evaluation affect their generalizability. The visualized heatmaps introduced in some models aid in locating small pathological areas, while others focus on early detection, a more challenging task. The choice of architecture and evaluation metrics depends on the specific requirements of the detection task.
5.4. Multiple Retinal Disease Detection
This section presents a review of studies that targeted classifying between AMD, DR, glaucoma and other retinal diseases in multiclass tasks or in multiclass, multilevel tasks. Using EfficientNet-B3 as the base model, authors in [
67] developed a DL model merged with a mixture loss function for automatic classification between glaucoma, cataract and AMD in a four-class problem, including normal. The mixture loss function was a hybridization of the focal loss and the correntropy-induced loss functions combined to minimize the effects of outliers and class imbalance. The 5000-image OIA-ODIR dataset was used for model evaluation. The FCL-EfficientNet-B3 model outperformed other baseline methods for the detection of the three retinal diseases. The main advantages of their model include the reduction of computation cost and training speeds. EfficientNet scales well, but it is hard to achieve a balance in its three dimensions. The model also struggled to correctly classify AMD and glaucoma. An ensemble of three ResNet-152 networks was proposed in [
123] for classifying Choroidal Neovascularization (CNV), Diabetic Macula Edema (DME), Drusen and normal. The ensemble method outperformed a single ResNet-152 network, posting a maximum accuracy of 0.989, sensitivity of 0.989 and specificity of 0.996. The authors carried out experiments with different size datasets and concluded that model performance improved with more training data. The model has a drawback of the increased computational complexity owing to the large number of layers and parameters in ResNet-152.
Kamran et al. in [
124] proposed an architecture to differentiate between a range of pathologies causing retinal degeneration. The authors claim their model outperforms expert ophthalmologists. In [
125], an ensemble, four-class classification model to automatically detect Choroidal Neovascularization (CNV), Diabetic Macula Edema (DME), Drusen and normal in OCT images based on the ResNet50 neural network was presented. This model, which the authors claim performs better than ophthalmologists with significant clinical experience, attained an accuracy of 0.973, a sensitivity of 0.963 and a specificity of 0.985. Global accuracies of up to 0.95 were attained in [
126] with their deep learning classifier of inherited retinal diseases using fundus autofluorescence (FAF). Their classifier detected retinitis pigmentosa, stargardt disease and normal out of 389 images. A CNN-automated multiclass classifier for retinal diseases using spectral-domain OCT images was developed by [
3]. The model detected AMD, Choroidal Neovascularization (CNV), Diabetic Macula Edema (DME), Drusen and normal cases. The model correctly detected AMD with 100% accuracy, CNV with 98.86% accuracy, DME with 99.17% accuracy, Drusen with 98.97% accuracy and normal with 99.15% accuracy. The overall accuracy achieved was 95.30%. Gour and Khanna (2020) proposed an automated multiclass, multilabel transfer learning-based CNN for the detection of ocular diseases. Leveraging the power of transfer learning, they built two models using four CNN architectures, VGG16, InceptionV3, MobileNet and ResNet and evaluated the models on the ODIR database to predict the presence or absence of eight ocular diseases from the dataset. Model 1 passes the left and right eye images separately as inputs to the CNN architectures for feature extraction before the features are later concatenated. Model 2 concatenates the images followed by feature extraction. For both models, the architectures were trained for 100 epochs and the sigmoid activation function was used to predict the probability of each of the eight labels corresponding to the eight ocular diseases depicted in the ODIR database. The disease categories represented in the database are normal (N), Diabetes (D), glaucoma (G), Cataract (C), AMD (A), Hypertension (H), Myopia (M) and other diseases (O). The VGG16 architecture with SGD optimizer on model 1 outperformed the other architectures, achieving AUC and F1 score values of 84.93 and 85.57, respectively. This work provides a fairly viable solution to the multiclass, multilabel classification problem for the prediction of ocular diseases, but its limitation was the low performance of categories with fewer images owing to the imbalanced nature of the dataset.
Table 6 presents a summary of the DL-based methods for the detection of multiple retinal diseases. An Ensemble Label Power-set Pruned datasets Joint Decomposition (ELPPJD) technique was developed in [
127] to solve the multiclass, multilabel classification problem. They transformed the multilabel problem into a multiclass classification problem. They adopted 10-fold cross-validation and used average accuracy, precision, recall and F-measure to evaluate the models. The authors developed two variants of the ELPPJD method, ELPPJD_SB (size-balanced strategy) and ELPPJD_LS (Label similarity), two decomposition strategies in ELPPJD. ELPPJD_LS outperformed not only ELPPJD_SB but also two widely used multilabel classification methods, RAkEL and HOMER. ELPPJD_SL produced an average accuracy of 88.59%, a good result in multiclass classification [
127]. The authors utilized transfer learning and fine-tuning techniques in [
128] to adapt a pre-trained Inception-v3 architecture, combining it with a novel feature attention layer for the prediction of four common retinal diseases, diabetic retinopathy, Age-Related Macular Degeneration, glaucoma and retinal vein occlusion. With the feature attention layer helping to highlight important regions of the input image, the model had some remarkable accuracies, outperforming state-of-the-art models in the process. Specifically, EyeDeep-Net achieved an accuracy of 95.4% on the IDRiD dataset and an accuracy of 96.5% on the MESSIDOR dataset for multiclass classification. Whilst this method achieves considerably good accuracies compared to state-of-the-art methods, the datasets used were comparatively small, which may affect the generalizability of the model. Moreover, the authors did not provide a thorough interpretability analysis of the proposed method, which could have helped understand the model’s decision-making process. A vision transformer was presented in [
129] for the classification of multiple diseases in fundus images. Evaluation performed on the IDRiD, Messidor-2 and APTOS datasets yielded promising accuracies of 0.9847, 0.9667 and 0.9576, respectively. The authors performed extensive experiments to evaluate their approach and provide a detailed analysis of the model’s attention maps to identify the regions of interest for each disease. Although the authors compared their results with those of previous studies on individual diseases, they did not compare their approach with other multidisease classification models. There was no attempt by the authors to provide an analysis of the computational cost of their model with CNNs, which have been dominating computer vision. A novel attention-guided approach to identify the most important regions in retinal images for disease classification was proposed by [
130]. The authors demonstrated that their approach outperforms several state-of-the-art models on two publicly available datasets, achieving a macro F1-score of 0.871 on the MESSIDOR-2 dataset and 0.845 on the EYEPACS dataset. The use of attention-guided vision transformers, which can improve the interpretability of the model’s predictions and provide insight into the most important regions for disease classification, was a major contribution of their work. However, the authors failed to provide a discussion of the computational complexity of their model. Given the large number of parameters in vision transformer-based models, the computational cost of training and deploying the model may be a limiting factor in real-world clinical applications. Two deep learning architectures, RetinaNet and ViT, were combined in the work of [
131] for the automated detection of retinal diseases. Their method achieved state-of-the-art performance on the IDRiD and the MESSIDOR-2 datasets, scoring a sensitivity of 0.944 and a specificity of 0.966 on the IDRiD dataset and an accuracy of 0.971 on the MESSIDOR-2 dataset. One limitation of this model is the lack of discussion on the explainability of the model. Given the black-box nature of deep learning models, it would be valuable to provide insights into the most important regions of the retinal images for disease detection. An approach for multilabel classification of retinal diseases using a self-attention mechanism-based Vision Transformer was proposed in [
132]. The authors demonstrated that their approach outperforms several state-of-the-art models on the Kaggle Diabetic Retinopathy Detection (KDD) dataset, achieving a mean F1-score of 0.865 and an accuracy of 0.897. The use of a self-attention mechanism-based ViT allows the model to focus on relevant features in the retinal images for disease detection. However, one limitation of this paper is the lack of evaluation of other publicly available datasets, which limits the generalizability of the proposed approach. Additionally, the authors do not provide insights into the most important regions of the retinal images for disease detection, which limits the interpretability of the proposed approach.
5.4.1. Discussion
The use of deep learning (DL) models for the detection and classification of retinal diseases is a promising area of research, with numerous studies showing significant progress in recent years. However, there are several critical issues that need to be addressed in order to improve the reliability and generalizability of these models.
One of the primary challenges is the lack of diverse and well-annotated datasets. Many studies have reported using relatively small datasets, and the lack of diversity in these datasets can limit the generalizability of the developed models. Moreover, it is important to consider that the prevalence of retinal diseases varies widely across different populations and ethnicities. This can limit the generalizability of models developed using datasets from a specific population or region. Therefore, efforts to collect and annotate large, diverse datasets are critical to ensure the generalizability of these models. The MESSIDOR-2 database was the most frequently used database for evaluating the models.
Another challenge is the interpretability of DL models. It is often difficult to understand how these models arrive at their predictions, which can limit their utility in clinical settings. While some studies have proposed the use of attention mechanisms or visualization techniques to identify important regions in retinal images, more research is needed to develop methods for interpreting the predictions of DL models.
Additionally, DL models require significant computational resources for training and inference, which can limit their scalability and feasibility in clinical settings. Therefore, there is a need for more research on developing efficient DL models that can be trained and deployed on resource-constrained devices.
Finally, it is important to recognize that DL models should not replace expert ophthalmologists. While these models can provide valuable insights and support to clinicians, they should be used as a tool for aiding diagnosis and not as a replacement for clinical expertise.
5.4.2. Summary
This section presents an overview of several studies that have targeted the classification of multiple retinal diseases using deep learning (DL) models. Common approaches used in these studies are pre-trained convolutional neural networks (CNNs), such as ResNet, EfficientNet and ViT, and ensemble methods. The main challenges are class imbalance and the interpretability of DL models. Some studies have proposed the use of mixture loss functions or transfer learning to overcome class imbalance and attention mechanisms or visualization techniques to improve interpretability. The reviewed studies have shown promising results, but larger and more diverse annotated datasets are needed to improve generalizability, and more research is needed on the interpretability and explainability of DL models.