Development of Decision Support Software for Deep Learning-Based Automated Retinal Disease Screening Using Relatively Limited Fundus Photograph Data

: Purpose—This study was conducted to develop an automated detection algorithm for screening fundus abnormalities, including age-related macular degeneration (AMD), diabetic retinopathy (DR), epiretinal membrane (ERM), retinal vascular occlusion (RVO), and suspected glaucoma among health screening program participants. Methods—The development dataset consisted of 43,221 retinal fundus photographs (from 25,564 participants, mean age 53.38 ± 10.97 years, female 39.0%) from a health screening program and patients of the Kangbuk Samsung Hospital Ophthalmology Department from 2006 to 2017. We evaluated our screening algorithm on independent validation datasets. Five separate one-versus-rest (OVR) classiﬁcation algorithms based on deep convolutional neural networks (CNNs) were trained to detect AMD, ERM, DR, RVO, and suspected glaucoma. The ground truth for both development and validation datasets was graded at least two times by three ophthalmologists. The area under the receiver operating characteristic curve (AUC), sensitivity, and speciﬁcity were calculated for each disease, as well as their macro-averages. Results—For the internal validation dataset, the average sensitivity was 0.9098 (95% conﬁdence interval (CI), 0.8660–0.9536), the average speciﬁcity was 0.9079 (95% CI, 0.8576–0.9582), and the overall accuracy was 0.9092 (95% CI, 0.8769–0.9415). For the external validation dataset consisting of 1698 images, the average of the AUCs was 0.9025 (95% CI, 0.8671–0.9379). Conclusions—Our algorithm had high sensitivity and speciﬁcity for detecting major fundus abnormalities. Our study will facilitate expansion of the applications of deep learning-based computer-aided diagnostic decision support tools in actual clinical settings. Further research is needed to improved generalization for this algorithm. ERM: epiretinal membrane, DR: diabetic retinopathy, RVO: retinal vascular occlusion.


Introduction
Although deep learning has yielded substantial results in other areas of research, the development of a deep learning-based diabetic retinopathy detection algorithm by Google may have been the first time that most ophthalmologists acknowledged the potential role of deep learning-based applications in the clinical setting [1][2][3]. Since then, a variety of studies have been conducted to develop deep learning-based algorithms to diagnose diseases, analyze images, and assess treatment response [4][5][6][7][8]. Gargeya et al. [2] diabetic retinopathy (DR) diagnosis model achieved a 0.97 area under the receiver operating characteristic curve (AUC) with a 94% and 98% sensitivity and specificity, respectively, and testing on MESSIDOR 2 and E-Ophtha databases achieved a 0.94 and 0.95 AUC score, respectively.
For age-related macular degeneration (AMD), unlike DR, many studies focused on severity classification or treatment responses, rather than diagnosis. Burlina et al. [7] AMD grading model showed performance comparable with that of humans and achieved promising results for providing AMD detailed severity grading, which normally requires highly trained graders [7]. However, so far, the clinical applications of these deep learning-based algorithms have remained limited for various reasons, including the limited availability of well-annotated clinical data for algorithm development and the assessment of performance accuracy, the presence of legal restrictions and regulations, and reimbursement-related challenges. One of the most fundamental problems that ophthalmologists have encountered is that most algorithms are developed for the detection of single disease entities. In most clinical settings, we do not need a screening tool to determine only whether a patient has diabetic retinopathy; instead, we need a diagnostic tool that can reliably determine whether patients have any abnormal ocular findings and then provide specific diagnoses, including multiple diagnoses for a single patient. This function is most important in situations where fundus screening tools are used for large patient populations. In this context, there is a growing need to develop a grading tool to assist in the screening of fundus photographs that can yield reliable results, thereby minimizing ophthalmologists' work burden.
Therefore, we developed decision support software for deep learning-based automated retinal disease detection for use in fundus photography screening. Here, we report its performance and discuss its potential role as a screening tool for the general population.

Methods
This study adhered to the tenets of the Declaration of Helsinki, and the study protocol was reviewed and approved by the Institutional Review Board of Kangbuk Samsung Hospital (No. KBSMC 2018-01-040). The requirement for written informed consent was waived because the study used retrospective and anonymized retinal images.
In the Republic of Korea, a regular systemic health screening examination is mandatory (including optional examinations, such as fundus photography) for adults who are ≥40 years of age. In 2015, 76.1% of South Koreans ≥40 years of age received an annual health examination (National Health Screening Statistical Yearbook, National Health Insurance Corporation, 2016) [9]. Our institution conducted 943,844 fundus photography examinations (both eyes were counted as a single examination) for 316,516 participants who were ≥40 years of age during 2006-2017.
In addition to the fundus photographs of health screening program participants, fundus photographs of patients who visited the Department of Ophthalmology of our hospital were also used for training in order to obtain data on diseases with low prevalence among the general population (especially diabetic retinopathy (DR) and retinal vein occlusion (RVO)). A schematic diagram of the inference workflow for our algorithm is shown in Figure 1.
In a general health screening environment, detecting abnormality has much higher priority than detecting normality, and that is the reason why our study focuses on screening fundus photos that contain abnormalities from a general population. We accordingly designed the inference structure of our algorithm. Although normal (healthy) images are used during training as negative data with respect to the target diagnosis of each classifier, and although we also trained a classifier that predicts whether the photo is normal or not, they were not included in the main classification pipeline because predicting truly healthy is less important than predicting truly abnormal or a specific diagnosis assigned to each classifier. The proposed algorithm consists of five independent diagnosis detectors or binary classifiers that process and analyze one single input image in parallel as shown in  There are two main reasons why we adopted five detectors rather than a single  multi-class detector: (1) Considering the relatively limited size of the training dataset, we intended to train each diagnosis detector in a way fully specific to each diagnosis without limitation by constructing individual pipelines, which includes a preprocessing unit, the lowlevel to high-level feature extractor, and the classification head. For instance, it is empirically observed that diabetic retinopathy and glaucoma suspect are better detected after applying preprocessing such as contrast-limited adaptive histogram equalization (CLAHE) [10]. The parallelism can be the best architecture that reflects this diversified situation with ease and flexibility. (2) In a real clinical environment, the fundus photo that our model reviews can have multiple diagnoses and not necessarily a single diagnosis. In our second validation dataset, for example, the portion of multiples is up to 8.66%. In this condition that the number of diagnoses varies, it is very hard to make an appropriate prediction using a single multi-class detector because the commonly used softmax score is best fit for top-1 prediction and can cause ambiguity for top-k prediction. We might manage to design a multi-class cutoff threshold system without knowing the number of diagnosis in the photo, but the resulting threshold system must be very complex or confusing. Therefore, this can lead to even less accuracy, as well as ambiguity; thus, it may not be a good architecture. When we conducted training of a single multi-class detector to test feasibility, the average sensitivity was only 79.0% for top-1 prediction and 86.3% for top-2 predictions even though one false positive was allowed. This result supports our idea that multiple one-versus-rest classifiers work properly and can achieve better accuracy for prediction.
Therefore, we trained each modularized diagnosis detector on each one-versus-rest (OVR) relabeled training dataset. Each diagnosis detector focuses on finding abnormalities related to each diagnosis, and we only need to combine the five results from each detector using AND logic. This architecture can remove the complexity which may cause performance degradation and can also provide an easy-to-explain inference procedure from a human's perspective. In a general health screening environment, detecting abnormality has much higher priority than detecting normality, and that is the reason why our study focuses on screening fundus photos that contain abnormalities from a general population. We accordingly designed the inference structure of our algorithm. Although normal (healthy) images are used during training as negative data with respect to the target diagnosis of each classifier, and although we also trained a classifier that predicts whether the photo is normal or not, they were not included in the main classification pipeline because predicting truly healthy

Disease Definition
Fundus photographs with clinically insignificant fundus findings (e.g., small drusen outside the major vascular arcade) were graded as normal fundus images for the purpose of optimal screening. Regardless of the extent of the photograph that was obscured, if any identifiable abnormal findings were present in fundus photographs, they were considered gradable. Ungradable photographs could contain a fundus with an obscured posterior pole of the retina, without any abnormal findings in the remaining visible area. These ungradable photographs were excluded from the training dataset and the validation dataset for the classification task.
The selection of cases of four major retinal diseases and suspected glaucoma were based on a previous study reporting the vitreoretinal diseases that were prevalent among health screening participants [11].

Age-Related Macular Degeneration
AMD was defined in accordance with the international classification developed by the International Age-Related Maculopathy Epidemiological Study Group [12]. The largest drusen present was used to determine the grade and predominant drusen type. Pigmentary abnormalities included either increased pigmentation or hypopigmentation of the retinal pigment epithelium (RPE), without any visibility of choroidal vessels. Geographic atrophy was defined as any sharply delineated area, approximately round or oval in shape that displayed hypopigmentation, depigmentation, or apparent absence of the RPE; additionally, such areas were required to exhibit greater visibility of choroidal vessels than surrounding areas. Exudative AMD was defined as the presence of any of the following: (1) RPE detachment or serous detachment of the sensory retina; (2) subretinal or sub-RPE neovascular membranes; (3) subretinal hemorrhage; (4) epiretinal, subretinal, intraretinal, or sub-pigment epithelial scarring, glial tissue, or fibrin-like deposits. Early AMD was defined as the presence of a soft drusen (≥63 µm) or any drusen (except a hard, indistinct drusen) combined with RPE changes near the macula. Late AMD was defined as the presence of signs of exudative AMD or geographic atrophy.

Diabetic Retinopathy
DR was defined in accordance with the international classification of DR severity and diabetic macular edema [13]. Regardless of severity, images with any signs of DR were categorized as having DR. Findings indicative of severe disease, including clinically significant macular edema, were annotated for future comparisons.

Epiretinal Membrane
Both cellophane macular reflex (characterized by an irregular, increased light reflex from the inner retinal surface, without prominent retinal folds) and preretinal macular fibrosis (characterized by retinal folds or traction with thickening and contraction of the membrane) were categorized as ERM. Focal ERM located outside the major vascular arcade was not included in the diagnosis. Cases of secondary ERM were classified as multiple diagnoses.

Retinal Vein Occlusion
Regardless of the involved area or the type of occlusion, fundus photographs with any features of RVO were included. Old RVO with ghost vessels or visible collateral vessels was also included in the diagnosis.

Suspected Glaucoma
The diagnosis of suspected glaucoma was based solely on fundus findings. Participants were considered to have suspected glaucoma on the basis of optic disc findings of a vertical or horizontal cup/disc ratio (C/D) of 0.7 or an asymmetric C/D ratio of ≥−0.2. Other fundus findings considered in the diagnosis of suspected glaucoma included neuroretinal rim notching, loss, or thinning, disc hemorrhage, and a defect of the retinal nerve fiber layer. These criteria were based on the International Society of Geographical and Epidemiological Ophthalmology (ISGEO) guidelines, without consideration of any other diseases that can contribute to glaucomatous findings [14,15].

Multiple Diagnoses
In total, 789 fundus photographs had two or more diagnoses. The most common combination was AMD and ERM (n = 417, 52.9%), followed by DR and ERM (n = 99, 12.5%). As our algorithm combined one-versus-rest (OVR) classifiers, multiple diagnoses fit into the workflow naturally. Once each OVR classifier detects its relevant disease, the prediction can be represented in the form of not only the decision, but also a score between 0.0 and 1.0. For screening purposes, it is reasonable to use a high sensitivity threshold for each score so that fundus photographs with any abnormal diagnoses would be further evaluated by ophthalmologists regardless of the diagnosis.

Grading and Annotation Process
To enhance the performances of detailed diagnosis, we developed an annotation tool which resembled an Early Treatment Diabetic Retinopathy Study grid divided to 20 sectors centering out from fovea to periphery with a maximum diameter of 8000 µm (Figure 2.). In addition to diagnosis, each sector was annotated for specific findings important for diagnosis or differential diagnosis (  The graders comprised two retinal specialists and three residents with third to fourth year training. All graders were fully aware of the disease definitions and annotation guidelines. Before the actual grading and annotation process, brief practice and simulation procedures were performed to ensure accurate grading and annotation. At least one retinal specialist was included in the first round of grading, which was performed by two ophthalmologists. Any disagreements regarding diagnosis or annotations were discussed between these two ophthalmologists. Finally, a third ophthalmologist (who was not involved in the first round of grading) made diagnoses or annotations for fundus photographs that could not be clearly determined by the prior two ophthalmologists. Intergrader agreement was assessed using the kappa statistic, which was 0.78-0.90, depending on the disease entity. Intergrader agreement was lowest for AMD and highest for RVO. A total of 43,221 fundus photographs were taken during the development phase, including 33,895 photographs for training and 9332 photographs for performance tuning. The demographic characteristics and disease distributions of the training dataset are listed in Table 1. For the training set, 33,895 fundus photographs were used, of which abnormal fundus photographs constituted 18,221 (53.7%), and, for the tuning set, 9332 fundus photographs, including 5405 (57.9%) abnormal fundus photographs, were used to validate and tune the performance during the training phase. A separate dataset of 11,707 photographs was used to construct multiple validation datasets via balanced and stratified bootstrapping (uniform sampling with allowed replacement) for all annotated retinal fundus photographs.

Algorithm Development
For development of the algorithm, supervised heterogeneous transfer learning was conducted using ResNet-50 [16] which is pretrained on the ImageNet dataset. In some prior works, the following CNNs were used as backbones.
• Gulshan  On the contrary, we used ResNet-50 with 512 × 512 input pixels. It was difficult to conduct an exact comparison with prior studies because each test dataset is different. However, this customized ResNet resulted in better classification capability of our trained models because ResNet has an improved residual connection architecture that can boost performance, and we allowed larger input images that can be helpful to increase discriminative reception. We observed when training an initial binary classifier that a WideResNet [22]-like CNN showed better performance than Inception-v3 or VGGNet, not to mention AlexNet, and we also found that ResNet performed a bit better than WideResNet when using larger input images. The details of each diagnosis detector including preprocessing are presented in Figure 3.
Each diagnostic OVR classifier (AMD, ERM, DR, RVO, suspected glaucoma) was primarily trained using the pretrained ResNet-50 in order to speed up the convergence of learning representation with the help of good initialization via transfer learning with different tasks (classes to classify) and different domains. During fine-tuning, the rollback technique with lower layers of the network with pre-trained weights [23] was repeatedly applied in order to improve the network's capability to detect the finer features of the retinal fundus. Of particular note, for the AMD classifier, additional excerpts of retinal fundus photographs that were apt to be misclassified as DR or RVO were used for training.
performance, and we allowed larger input images that can be helpful to increase discriminative reception. We observed when training an initial binary classifier that a WideRes-Net [22]-like CNN showed better performance than Inception-v3 or VGGNet, not to mention AlexNet, and we also found that ResNet performed a bit better than WideResNet when using larger input images. The details of each diagnosis detector including preprocessing are presented in Figure 3. Each diagnostic OVR classifier (AMD, ERM, DR, RVO, suspected glaucoma) was primarily trained using the pretrained ResNet-50 in order to speed up the convergence of learning representation with the help of good initialization via transfer learning with different tasks (classes to classify) and different domains. During fine-tuning, the rollback technique with lower layers of the network with pre-trained weights [23] was repeatedly applied in order to improve the network's capability to detect the finer features of the retinal fundus. Of particular note, for the AMD classifier, additional excerpts of retinal fundus photographs that were apt to be misclassified as DR or RVO were used for training.
Unsupervised domain adaptation was applied to each OVR classifier to overcome performance drop, which may be caused by the domain shift. A trained DNN model often fails to work properly in other sites because of the distribution discrepancy between the training data and the data from the target domain, mainly caused by a lack of generalizability. This phenomenon is called domain shift, and domain adaptation addresses this problem by adapting a model trained on a domain (source) to another domain (target). According to the maximum classifier discrepancy principle, two well-trained classifiers that shared a deep feature generator unit for each diagnosis were used with a labeled development dataset and an unlabeled validation dataset [24]. Some performance drop was observed on the validation dataset when the initially trained model using the development dataset was tested from the area under the receiver operating characteristic curve (AUC) perspective. However, a significant improvement in the AUC for each diagnosis was achieved after further training for unsupervised domain adaptation. Two or more classifiers that were candidates for the best prediction model for each diagnosis were used Unsupervised domain adaptation was applied to each OVR classifier to overcome performance drop, which may be caused by the domain shift. A trained DNN model often fails to work properly in other sites because of the distribution discrepancy between the training data and the data from the target domain, mainly caused by a lack of generalizability. This phenomenon is called domain shift, and domain adaptation addresses this problem by adapting a model trained on a domain (source) to another domain (target). According to the maximum classifier discrepancy principle, two well-trained classifiers that shared a deep feature generator unit for each diagnosis were used with a labeled development dataset and an unlabeled validation dataset [24]. Some performance drop was observed on the validation dataset when the initially trained model using the development dataset was tested from the area under the receiver operating characteristic curve (AUC) perspective. However, a significant improvement in the AUC for each diagnosis was achieved after further training for unsupervised domain adaptation. Two or more classifiers that were candidates for the best prediction model for each diagnosis were used as an ensemble for knowledge transfer via the teacher-student knowledge distillation technique [25]. Overall, 78.4% of the development set was used for training, including fine-tuning of the network, and 21.6% of the development set was used for tuning the performance of the network. Fundus images from the same person were always placed in the same dataset (training or tuning). While training the model and obtaining inferences using the trained model, the input photographs to the model were tightly cropped to obtain an empirically optimized region of interest of 512 by 512 pixels (square) by applying a circular mask that shared the center of the circle to obtain a clear cut of the circular boundary of the fundus image. Histogram equalization was also applied for the prediction of some diagnoses to maximize the discriminative capability of the relevant classifier.

Evaluation Metric and Statistical Analysis
The model performance for the initial diagnostic classifier was mainly evaluated using areas under the receiver operating characteristic curve (AUROCs), as the sensitivity, specificity, and accuracy vary depending on the cutoff threshold, while the AUROC is static and, thus, shows the predictive capacity of the classifier in a single measure. In more detail, the main validation can be regarded as a representative of a real-world population because we collected data in the health screening center without any considerable compromise. We only added some minority diagnosis images to keep statistical significance; however, as shown in Table 2, the dataset was still largely imbalanced, which is the feature of a real-world population. Because the imbalance between categories can induce a seriously distorted result for PPV (positive predictive value) = TP/(TP + FP) and NPV (negative predictive value) = TN/ (TN + FN), they were not used as the evaluation metric. Instead, we used sensitivity = TP/(TP + FN) and specificity = TN/(TN + FP), as well as AUROC, which are not affected by the category imbalance. To obtain the 95% confidence intervals (CIs) of each measure for the evaluation of statistical significance, sample bootstrapping was repeated 20 times on the basis of a nonparametric approach from the validation dataset. Model evaluation was conducted for each round of sample bootstrapping. When the ratio of a disease is highly imbalanced, the average of each performance measure per disease classifier (the macro-average) is the most appropriate measure to avoid ignoring the performance of the minority disease classifiers; therefore, the macro-average was calculated to quantify overall performance.

Results
We evaluated the performance of our algorithm on the independent validation dataset, which consisted of single-diagnosis images gathered in 2019 at the Kangbuk Samsung Hospital health examination center.

Overall Outcome: The Macro-Average Performance of the Five OVR Classifiers
The macro-average of each classifier's AUC was 0.952 in the initial training and then improved to 0.964 (95% confidence interval (CI), 0.947-0.982) via domain-adaptive training. The macro-average of sensitivity was 0.910 (95% CI, 0.866-0.954), that of specificity was 0.908 (95% CI, 0.858-0.958), and that of accuracy was 0.909 (95% CI, 0.877-0.942) after adjusting the cutoff threshold of each classifier. In more detail, given that the sensitivity is more important than specificity, we adjusted the cutoff threshold where the ratio of the amount of change in sensitivity and in specificity while moving the ROC curve within specificity ranged between 0.80 and 0.90. We conducted this procedure automatically using Python scripts. One exception was the suspected glaucoma (GS) detector. As the AUROC for the GS detector was very high (0.9967), its sensitivity was greater than 0.99 even when the reference specificity was set to 0.95. Therefore, we set the upper bound of the specificity range to 0.96 only for the GS detector.

Individual Outcome: The Performance of the OVR Classifier for Each Specific Disease
The per-diagnosis AUC from the OVR perspective was highest for the detection of ERM (0.982; 95% CI, 0.971-0.992) and the lowest for the detection of AMD (0.943; 95% CI, 0.932-0.954). The per-diagnosis ROC curves and AUCs are presented in Figure 4 and in Table 3, respectively. Each was measured using the relevant OVR classifier on the balanced samples with 20 rounds of bootstrapping after domain-adaptive training.
ROC for the GS detector was very high (0.9967), its sensitivity was greater than 0.99 even when the reference specificity was set to 0.95. Therefore, we set the upper bound of the specificity range to 0.96 only for the GS detector.

Individual Outcome: The Performance of the OVR Classifier for Each Specific Disease
The per-diagnosis AUC from the OVR perspective was highest for the detection of ERM (0.982; 95% CI, 0.971-0.992) and the lowest for the detection of AMD (0.943; 95% CI, 0.932-0.954).
The per-diagnosis ROC curves and AUCs are presented in Figure 4 and in Table 3, respectively. Each was measured using the relevant OVR classifier on the balanced samples with 20 rounds of bootstrapping after domain-adaptive training.   The per-diagnosis sensitivities are presented in Table 4. Each was measured using the relevant OVR classifier on the balanced samples with 20 rounds of bootstrapping after cutoff threshold tuning with the tuning dataset. Sensitivity was defined as TP/(TP + FN), where TP indicates true positive and FN denotes false negative. The per-diagnosis specificities (sensitivities for the remaining diagnoses) are presented in Table 5. Each was measured using the relevant OVR classifier on the balanced samples with 20 rounds of bootstrapping after cutoff threshold tuning with tuning datasets. Specificity was defined as TN/(TN + FP), where TN denotes true negative and FP refers to false positive. When compared with prior distinguished studies, the DR, AMD, and GS classifier of our algorithm demonstrated comparable or superior performance, as shown in Table 6. Note that this was measured using each testing dataset, which differed; to our knowledge, our ERM and RVO classifiers are the first deep learning-based screening models in the literature.

Additional Analysis: Algorithm Performance for Multiple Diagnosis Photographs or Diseases that Were Not Included in the Training Dataset
For multiple diagnoses, the proportion of images for which the algorithm predicted at least one of the ophthalmologist's original diagnoses was 96.3%. The proportion of the algorithm's correct prediction numbers for each disease out of all diagnosed diseases by the ophthalmologist was 78.8%. The proportion of the fundus images for which the algorithm's predictions covered all the diagnoses by the ophthalmologist without a false negative was 61.6%.
The likelihood that the algorithm predicted at least one of the five trained diagnoses on images with other diagnoses (e.g., macular hole, retinal detachments, retinitis pigmentosa etc.) was 95.3%. The proportion of ungradable images for which the algorithm predicted at least one of the five trained diagnoses was 86.9%.

External Evaluation: Fundus Photographs from Youngnam University Hospital
We evaluated the performance of our algorithm on an external independent validation dataset from Youngnam University Hospital, which consisted of not only single-diagnosis images, but also multiple-diagnosis images (8.66%). Other than the inclusion of many multiple-diagnosis photographs, the key characteristic of this dataset is that the ratio of photographs with abnormalities was more than 88%, which was the exact opposite of the first validation dataset, in which more than 88% of the photographs were normal. The demographic characteristics and structure of the second validation dataset are summarized in Table 7. In the second validation dataset, the per-diagnosis AUC from the OVR perspective was highest for the detection of AMD (0.950; 95% CI, 0.937-0.963) and the lowest for the detection of ERM (0.844; 95% CI, 0.796-0.892). The per-diagnosis AUCs are presented in Table 8. Each was measured using the relevant OVR classifier on the balanced samples with 20 rounds of bootstrapping after domain-adaptive training. For the multiple-diagnosis images, the correct prediction of each diagnosis was separately counted, allowing duplication, when the five disease classifiers were tested on a single image. A promising result is that the detection rate for abnormalities was 1481/1501 = 98.67%, where "others" included other rare diseases or ungradable photos.

Interpretability Considerations
Interpretability has been a common challenge for deep learning because deep neural networks (DNNs) are trained to learn features automatically under the data-driven regime. One method to resolve this challenge is supported by visualization based on GradCAM (gradient-weighted class activation map) [26], which can highlight most discriminative regions of the photo for the model to make a prediction, and this was also demonstrated in our study, as shown in Figure 5.

Interpretability Considerations
Interpretability has been a common challenge for deep learning because deep neural networks (DNNs) are trained to learn features automatically under the data-driven regime. One method to resolve this challenge is supported by visualization based on Grad-CAM (gradient-weighted class activation map) [26], which can highlight most discriminative regions of the photo for the model to make a prediction, and this was also demonstrated in our study, as shown in Figure 5. (e) (f) Figure 5. GradCAM (gradient-weighted class activation map) [20] analysis on the trained model.
The first four image-GradCAM pairs (a-d) respond to well-presented examples. However, the fifth GradCAM (e) includes both true positive and false positive from the perspective of lesions, and the lesions are spread throughout the fundus image in the sixth GradCAM that can rather hinder the correct interpretation. Although the fifth and sixth GradCAM (f) falls into the "fail" category, we cannot conclude this, since our algorithm provides diagnosis from multiple abnormal findings from one fundus photograph. Whether or not GradCAM provides corresponding areas only for final diagnosis for heterogenous abnormal features needs much more investigation which is far beyond our current study range. Providing confidence or uncertainty can be an alternative approach that is actively used in research. All of these are still not so enough to address high interpretability; thus, it is appropriate to position these kinds of AI tools as decision support software for the moment.

Discussion
Attempts to develop automated or semi-automated detection algorithms for various retinal diseases are not a novel concept for ophthalmologists [27,28]. However, the application of a deep learning method enabled these attempts to achieve a higher level of accuracy. In this study, we assessed the performance of our deep learning-based algorithm for detecting major retinal diseases. Unlike other algorithms, we focused on developing However, the fifth GradCAM (e) includes both true positive and false positive from the perspective of lesions, and the lesions are spread throughout the fundus image in the sixth GradCAM that can rather hinder the correct interpretation. Although the fifth and sixth GradCAM (f) falls into the "fail" category, we cannot conclude this, since our algorithm provides diagnosis from multiple abnormal findings from one fundus photograph. Whether or not GradCAM provides corresponding areas only for final diagnosis for heterogenous abnormal features needs much more investigation which is far beyond our current study range. Providing confidence or uncertainty can be an alternative approach that is actively used in research. All of these are still not so enough to address high interpretability; thus, it is appropriate to position these kinds of AI tools as decision support software for the moment.