Deep Learning-Based Classification of Inherited Retinal Diseases Using Fundus Autofluorescence

Background. In recent years, deep learning has been increasingly applied to a vast array of ophthalmological diseases. Inherited retinal diseases (IRD) are rare genetic conditions with a distinctive phenotype on fundus autofluorescence imaging (FAF). Our purpose was to automatically classify different IRDs by means of FAF images using a deep learning algorithm. Methods. In this study, FAF images of patients with retinitis pigmentosa (RP), Best disease (BD), Stargardt disease (STGD), as well as a healthy comparable group were used to train a multilayer deep convolutional neural network (CNN) to differentiate FAF images between each type of IRD and normal FAF. The CNN was trained and validated with 389 FAF images. Established augmentation techniques were used. An Adam optimizer was used for training. For subsequent testing, the built classifiers were then tested with 94 untrained FAF images. Results. For the inherited retinal disease classifiers, global accuracy was 0.95. The precision-recall area under the curve (PRC-AUC) averaged 0.988 for BD, 0.999 for RP, 0.996 for STGD, and 0.989 for healthy controls. Conclusions. This study describes the use of a deep learning-based algorithm to automatically detect and classify inherited retinal disease in FAF. Hereby, the created classifiers showed excellent results. With further developments, this model may be a diagnostic tool and may give relevant information for future therapeutic approaches.


Introduction
Inherited retinal diseases (IRDs) encompass a large, clinically and genetically heterogeneous cluster of diseases that affect around 1 in 3000 people, with a total of more than 2 million people worldwide [1]. Considering that IRDs are the most frequent inherited forms of human visual handicap, this group of diseases has a profound impact on both patients and society [2,3].
In this context, the advent of noninvasive imaging techniques has allowed for refined assessment of IRDs. Color fundus photography, fundus autofluorescence (FAF), as well as high-resolution spectral-domain optical coherence tomography have become fundamental to the diagnosis and follow

Development of a Deep Learning Classifier
For this study, the deep learning framework TensorFlow™ (Google Inc., Mountain View, CA, USA) was used. We used ResNet 101 (Microsoft ResNet; Microsoft Research Asia, Beijing, China) to perform the classification task [27]. This deep CNN is widely used for image classification. ResNet 101 has the advantage of introducing residual connections to increase the network depth without negative outcomes and to thus improves classification results [27,28]. Transfer learning from the ImageNet dataset (http://www.image-net.org/) was used to provide base knowledge to the CNN before fine-tuning it. To fit our task, we reduced the number of output neurons in the last fully connected layer to four. Moreover, we fixed the first ResNet 101 block during the training process to keep low-level features learned on ImageNet and to speed up the training. Data augmentation was used to increase the original dataset and to reduce overfitting of the final model. This was achieved through a combination of image translation, cropping, and rotation. Moreover, Gaussian noise augmentation was used to mimic low-quality noisy images. The images were normalized using the mean and standard deviation of the ImageNet dataset to match the model initialization. The model was optimized using Adam Optimization Algorithm during 5000 iterations [29]. The model was then evaluated with the test set of 94 images. By using integrated gradients, attribution maps were generated, allowing to assess the impact of each pixel in the classification and showing on which areas the model relies to perform the classification [30]. The method is summarized in Figure 1. Figure 1. Illustration of the development of a deep learning classifier: fundus autofluorescence images of Stargardt disease, retinitis pigmentosa, and Best vitelliform macular dystrophy as well as of healthy controls were extracted from the Créteil database. After data preparation, transfer learning from the ImageNet dataset (http://www.image-net.org/) was used. The images were randomly partitioned in three sets: the training set (70% of the images), the validation set (10% of the images), and the test set (20% of the images). Data augmentation was performed on the training set to increase the original dataset and to reduce overfitting of the final model. The images were normalized using the mean and standard deviation of the ImageNet dataset to match the model initialization. The model was optimized using Adam Optimization Algorithm during 5000 iterations. The model was then evaluated with the test set of 94 images. The output of the model was the metric evaluation of the performance of the model (accuracy, sensitivity, and specificity) and integrated gradient visualization.
Performance was evaluated through a comparison of the CNN output to the ground truth, set by clinical diagnosis by expert readers. Three metrics were used for this purpose: accuracy, sensitivity, and specificity. Confusion matrices, area under (AUC) receiver operating characteristics (ROC), and precision-recall (PRC) curves were generated. The deep learning model's confidence was assessed using softmax regression on the test set. In addition, multiple Kernel density estimation (KDE), a nonparametric probability density estimation, was also generated to compare the model's confidence throughout the four classes.

Results
The data used to train, validate, and test the algorithm were composed of 73 FAF images from participants with a normal retina and 410 FAF images from participants with IRDs: 125 FAF images from patients with STGD, 160 FAF images from patients with RP, and 125 FAF images from patients with BD. Of these, 389 FAF images were used for training and validation and the remaining 94 FAF images (23 STGD, 32 RP, 25 BD, and 14 healthy controls) were used for testing. For STGD, the ROC-AUC was 0.998, the PRC-AUC was 0.986, the sensitivity for STGD FAF image classification was 0.96, and the specificity was 1. For RP, the ROC-AUC was 0.999, the PRC AUC was 0.999, the Figure 1. Illustration of the development of a deep learning classifier: fundus autofluorescence images of Stargardt disease, retinitis pigmentosa, and Best vitelliform macular dystrophy as well as of healthy controls were extracted from the Créteil database. After data preparation, transfer learning from the ImageNet dataset (http://www.image-net.org/) was used. The images were randomly partitioned in three sets: the training set (70% of the images), the validation set (10% of the images), and the test set (20% of the images). Data augmentation was performed on the training set to increase the original dataset and to reduce overfitting of the final model. The images were normalized using the mean and standard deviation of the ImageNet dataset to match the model initialization. The model was optimized using Adam Optimization Algorithm during 5000 iterations. The model was then evaluated with the test set of 94 images. The output of the model was the metric evaluation of the performance of the model (accuracy, sensitivity, and specificity) and integrated gradient visualization.
Performance was evaluated through a comparison of the CNN output to the ground truth, set by clinical diagnosis by expert readers. Three metrics were used for this purpose: accuracy, sensitivity, and specificity. Confusion matrices, area under (AUC) receiver operating characteristics (ROC), and precision-recall (PRC) curves were generated. The deep learning model's confidence was assessed using softmax regression on the test set. In addition, multiple Kernel density estimation (KDE), a nonparametric probability density estimation, was also generated to compare the model's confidence throughout the four classes.

Results
The data used to train, validate, and test the algorithm were composed of 73 FAF images from participants with a normal retina and 410 FAF images from participants with IRDs: 125 FAF images from patients with STGD, 160 FAF images from patients with RP, and 125 FAF images from patients with BD. Of these, 389 FAF images were used for training and validation and the remaining 94 FAF images (23 STGD, 32 RP, 25 BD, and 14 healthy controls) were used for testing. For STGD, the ROC-AUC was 0.998, the PRC-AUC was 0.986, the sensitivity for STGD FAF image classification was 0.96, and the specificity was 1. For RP, the ROC-AUC was 0.999, the PRC AUC was 0.999, the sensitivity was 1, and the specificity averaged 0.97. For BD, the ROC-AUC was 0.995, the PRC AUC was 0.988, the sensitivity averaged 0.92, and the specificity averaged 0.97. For healthy controls, the ROC-AUC was 0.998, the PRC-AUC was 0.989, the sensitivity for normal FAF image classification was 0.86, and the specificity averaged 0.99. The overall accuracy for the classification was 0.95.
These results are summarized in Tables 1 and 2 and Figure 2.            In order to assess model uncertainty, softmax regression was employed on the test set. The average confidence probability for correctly predicted test images was 0.943 (median 0.993), whereas the average confidence probability for erroneously predicted elements was 0.645 (median 0.595) ( Figure 5). In order to assess model uncertainty, softmax regression was employed on the test set. The average confidence probability for correctly predicted test images was 0.943 (median 0.993), whereas the average confidence probability for erroneously predicted elements was 0.645 (median 0.595) ( Figure 5). Figure 5. Distribution of softmax probabilities for correct and erroneous predictions on the test set: note that the average confidence probability for correctly predicted test images was 0.943 (median 0.993) (blue box), whereas the average confidence probability for erroneously predicted elements was 0.645 (median 0.595) (green box).
Moreover, multiple KDE graphs show the highest estimated probability for each of the four classes ( Figure 6). Moreover, multiple KDE graphs show the highest estimated probability for each of the four classes ( Figure 6).
The model was then trained separately only with 30 × 30 degree-field-of-view and 55 × 55 degree-field-of-view FAF images of the four classes, obtaining a classification accuracy of 0.94 for 30 × 30 degree-field-of-view FAF images and of 0.94 for 55 × 55 degree-field-of-view FAF images. The confusion matrices corresponding to these trainings are shown in Table 3. are then compared to each ground truth class to which they were originally assigned, leading to four different KDE graphs. A peak towards 1 corresponds to FAF images for which the probability of the predicted class coincides with the tested ground truth class, while a peak towards 0 corresponds to FAF images for which the probability for the tested ground truth class is low and is, therefore, not likely to correspond to the studied ground truth class.
The model was then trained separately only with 30 × 30 degree-field-of-view and 55 × 55 degree-field-of-view FAF images of the four classes, obtaining a classification accuracy of 0.94 for 30 × 30 degree-field-of-view FAF images and of 0.94 for 55 × 55 degree-field-of-view FAF images. The confusion matrices corresponding to these trainings are shown in Table 3. are then compared to each ground truth class to which they were originally assigned, leading to four different KDE graphs. A peak towards 1 corresponds to FAF images for which the probability of the predicted class coincides with the tested ground truth class, while a peak towards 0 corresponds to FAF images for which the probability for the tested ground truth class is low and is, therefore, not likely to correspond to the studied ground truth class.

Discussion
In this study, we demonstrated the feasibility of automated classification of several IRDs using FAF images, employing a convolutional neural network. Our study showed high sensitivity and specificity, with an overall accuracy of 0.95. Fundus autofluorescence imaging, providing a metabolic mapping of the retina, provides crucial information for the diagnosis of IRDs and typical phenotypes in each of the IRDs included in this study. Stargardt disease is produced by a mutation in the ABCA4 gene. The ABCA4 gene product is an ATP-binding cassette transporter that transports all-trans-retinol produced in a light-exposed photoreceptor outer segment to the extracellular space [31,32]. As a result of the mutation, A2E (N-retinylidene-N-retinylethanolamine) accumulates within the outer segment of photoreceptors, subsequently phagocytosed by RPE cells. As an in vivo metabolic mapping of the retina, FAF can visualize the high-concentration intracellular A2E as hyperautofluorescent lesions, corresponding to the flecks typically seen in Stargardt disease [12,13]. In late stages of Stargardt disease, hypoautofluorescent atrophy is visible on FAF due to the disappearance of both RPE and choriocapillaris [12,13,32].
The most frequent FAF phenotype in retinitis pigmentosa is the hyperautofluorescent Robson-Holder parafoveal [12,15], which is not visualized on color fundus photography. The parafoveal hyperautofluorescent ring may be a result of rod system dysfunction, according to the distribution of rod photoreceptors and to the low density of cones outside the foveal area [12]. In the rest of the cases, FAF in retinitis pigmentosa displays abnormal central hyperautofluorescence extending centrifugally from the fovea in 18% of cases and neither pattern in 24% of cases [12,33].
Best vitelliform macular dystrophy is caused by mutations in the BEST1 gene located on chromosome 11q13, which encodes bestrophin-1, a protein localized to the basolateral surface of the RPE. Spaide et al. hypothesized that the central accumulation of a well-demarcated hyperautofluorescent vitelliform deposit is subsequent to the inadequate removal of subretinal fluid, leading to physical separation of the photoreceptors from the RPE, progressively resulting in the accumulation of lipofuscin at the outer side of the neurosensory retina (due to shedding of the outer segment discs that cannot be phagocyted by the RPE cells) [6]. Several progression stages of BD are reflected by FAF imaging, from initial increased autofluorescence to late-stage atrophy of the photoreceptors [6,12,14]. Fundus autofluorescence is therefore an important tool for the phenotypic characterization of retinal dystrophies.
Furthermore, by using integrated gradient visualization, we were able to ascertain the impact of each pixel in the classification and to visualize areas that the model relies on to predict one class or the other. Interestingly, by using integrated gradient visualization, the regions of interest for CNN corresponded to the areas of interest described above, i.e., increased autofluorescence either parafoveal, such as in the Robson-Holder ring, or focal, in flecks or in vitelliform deposits. When well-circumscribed, hypoautofluorescent atrophy was present, integrated gradient visualization allowed to demonstrate that this feature was also taken into account. These findings are illustrated in Figure 3. However, no reference databases to classify consistently the normal and pathological FAF phenotypes are available.
To date, the use of both deep learning classification for IRDs and of FAF imaging for deep learning purposes have been scarce in the literature. Concerning the automated classification of IRDs, there has been one study by Fujinami-Yokokawa et al. [34], using OCT and a commercially available deep learning platform (Inception-v3 CNN) [35]. The authors reported a mean overall test accuracy of 0.909. However, in their study, three different OCT devices were used and the total number of OCT macular images was 178. Moreover, the authors performed four repeated tests, with a significant increase in test accuracy, which may suggest overfitting of their model. Recently, Shah et al. demonstrated that it is possible to use deep learning classification models to differentiate between normal OCT images and STGD OCT images and to distinguish the severity of STGD from OCT images. The authors used on a small dataset a pretrained model (VGG19) and a new classification model, obtaining an accuracy of 0.996, a sensitivity of 99.8%, and a specificity 98.0% for the pretrained model and an accuracy of 0.979, a sensitivity 97.9%, and a specificity 98.0% with the new classification model [36]. The high accuracy, sensitivity, and specificity are consistent with our results (Table 1).
FAF imaging has been used in a deep learning-based algorithm to automatically detect and classify GA by Treder et al. [21] as well as to detect chorioretinal atrophy by Ometto et al. [26] and to distinguish GA from Stargardt disease by Wang et al. [27]. In the study performed by Treder et al., two classifiers were built to differentiate between GA and healthy-eye FAF images and between GA and a group named other retinal diseases (ORD), with a training and validation set of 200 GA FAF images, 200 healthy-eye FAF images, and 200 FAF images of ORD. The test set consisted of 60 untrained FAF images in each case (GA 30, healthy 30, or ORD 30). For the GA classifiers, their model achieved a training accuracy of 99/98 and a validation accuracy of 96/91. Wang et al. used 320 FAF images from normal subjects, 320 FAF images with GA, and 100 with Stargardt disease in atrophic stage, obtaining a high screening accuracy with 0.98 for GA and 0.95 for atrophic Stargardt disease [27]. The excellent results confirm that automated classification with a deep learning classifier is possible in GA using FAF images.
Considerable efforts continue to be made to develop automated image analysis systems for the precise detection of disease in several medical specialties. In recent years, the use of CNNs has become increasingly popular for feature learning and object classification. Following the ImageNet Large Scale Visual Recognition Challenge, Russakovsky and collaborators demonstrated that the object classification capabilities of CNN architectures can surpass those of humans [37]. While the applications of artificial intelligence have mainly focused on diabetic retinopathy and age-related macular degeneration or glaucoma [19][20][21][22][23], IRDs would make an interesting candidate due to the typical, symmetrical phenotype of these disorders.
Our study has several limitations, one of which is the use of a small dataset. Moreover, our deep learning classifier was trained to only distinguish between three of the less rare IRDs, for which we have sufficient training data. Moreover, eye-level partitioning of the dataset and the use of a training/validation/test split are other limitations due to the sample variability and to the fact that, when training on different images, the model might not perform as well. Lack of molecular genetic testing for a part of the included eyes is another limitation. Due to the vast spectrum and genotypic and phenotypic variability of IRDs, it is difficult to assess how such a classifier would perform in a clinical setting. Moreover, image noise and the presence of interindividual and intraindividual variability in terms of media opacities, lipofuscin content, and genetic expression impact FAF imaging and may become a significant challenge.
Confidence estimation allows for quantifying model uncertainty. This is of the utmost importance when the deep learning model has to make predictions in a clinical setting, possibly on out of distribution data (therefore different from the distribution of the data on which the model was trained) resulting in variations in accuracy. Interestingly, in our series, the average confidence probability for when the predicted class for FAF images was correct was 0.943 (median 0.993), whereas the average confidence probability for the erroneously predicted FAF images was 0.645 (median 0.595). This analysis shows that, when correctly classifying images, the deep learning model is more confident than when it classifies images incorrectly ( Figure 5). Furthermore, KDE graphs ( Figure 6) offer complementary information on the model's confidence in correctly predicting the four classes.
Moreover, our limited dataset made further classifications according to the disease stage impossible. Nevertheless, given that IRDs are orphan diseases, large datasets would only be available through multi-institutional collaborations. With further developments, this model may be a diagnostic tool and may give relevant information for future therapeutic approaches.