Deep Learning to Distinguish ABCA4-Related Stargardt Disease from PRPH2-Related Pseudo-Stargardt Pattern Dystrophy

(1) Background: Recessive Stargardt disease (STGD1) and multifocal pattern dystrophy simulating Stargardt disease (“pseudo-Stargardt pattern dystrophy”, PSPD) share phenotypic similitudes, leading to a difficult clinical diagnosis. Our aim was to assess whether a deep learning classifier pretrained on fundus autofluorescence (FAF) images can assist in distinguishing ABCA4-related STGD1 from the PRPH2/RDS-related PSPD and to compare the performance with that of retinal specialists. (2) Methods: We trained a convolutional neural network (CNN) using 729 FAF images from normal patients or patients with inherited retinal diseases (IRDs). Transfer learning was then used to update the weights of a ResNet50V2 used to classify the 370 FAF images into STGD1 and PSPD. Retina specialists evaluated the same dataset. The performance of the CNN and that of retina specialists were compared in terms of accuracy, sensitivity, and precision. (3) Results: The CNN accuracy on the test dataset of 111 images was 0.882. The AUROC was 0.890, the precision was 0.883 and the sensitivity was 0.883. The accuracy for retina experts averaged 0.816, whereas for retina fellows it averaged 0.724. (4) Conclusions: This proof-of-concept study demonstrates that, even with small databases, a pretrained CNN is able to distinguish between STGD1 and PSPD with good accuracy.


Introduction
The adenosine triphosphate-binding cassette, subfamily A, member 4 (ABCA4) gene encodes for a membrane-associated protein located in the outer segment (OS) disc membranes of rod and cone photoreceptors [1,2]. Mutations in the ABCA4 gene are a known cause for recessive Stargardt disease (STGD1). STGD1 follows an autosomal recessive pattern of inheritance, with usual disease onset in the second decade of life. Nevertheless, albeit rare, late-onset forms of STGD1 do exist [3,4]. The characteristic fundus features of STDG1 disease include irregular yellow-white fundus flecks, atrophic macular lesions, and sparing of the peripapillary area by both flecks and atrophy [4][5][6].
Despite this, the association of these findings is considered pathognomonic to STGD1; the phenotype is not necessarily exclusive to STGD1. Multifocal pattern dystrophy simulating STGD1 ("pseudo-Stargardt pattern dystrophy") is an autosomal-dominant inherited retinal disease, caused by a PRPH2/RDS mutation that may simulate STGD1 [7]. Pseudo-Stargardt pattern dystrophy (PSPD) patients may display yellowish fundus flecks and chorioretinal atrophy [7].
Fundus autofluorescence (FAF) is an in vivo imaging method for the metabolic mapping of lipofuscin accumulation in the retina [8][9][10]. Studies on animal models, as well 2 of 10 as histopathological studies, have demonstrated that the formation of retinal pigment epithelium (RPE) cell lipofuscin is augmented and pathogenic in ABCA4-associated disease [5,[10][11][12]. Moreover, quantitative autofluorescence (qFAF) studies have shown that high levels of qFAF are a hallmark of ABCA4-positive patients, but the qFAF levels were also elevated in eyes with a PRPH2/RDS mutation [13]. Therefore, in both STDG1 and PSPD, FAF reveals a similar phenotype.
Despite the overlap between STGD1 and PSPD, the natural history is different, with only mild vision loss until the advanced stages of the disease for PSPD patients, and hence a better prognosis than STGD1 patients, making a correct diagnosis essential in these patients.
Deep learning approaches require large volumes of high-quality training data, which may be difficult in the case of rare inherited retinal diseases having undergone genetic screening. Moreover, the pretrained networks used in the current literature undergo transfer learning using the ImageNet dataset, containing color photography. Thus, our objective was to consider whether a deep learning classifier, pretrained using various FAF images, can assist in distinguishing ABCA4-related STGD1 from PSPD caused by mutations in PRPH2/RDS, and to compare these findings with the grading of retinal specialists.

Image Database
Patients retrospectively included in this study had (1) a confirmed ABCA4 mutation and associated STGD1 at various stages of progression OR (2) a confirmed PRPH2/RDSassociated disease with a PSPD phenotype. Genetic diagnosis was established by highspeed sequencing of genes involved in hereditary diseases on HiSe. Of note, only 1 pathogenic ABCA4 variant was required for STGD1 patients to be included. This retrospective study was conducted in accordance with the tenets of the Declaration of Helsinki. Written consent was waived due to the retrospective nature of the study.
We used macula-centered fundus autofluorescence retinal images from genetically confirmed eyes with STGD1 or PSPD, in the Department of Ophthalmology of Créteil, France. FAF images had been obtained in the Ophthalmology outpatient clinic in the Department of Ophthalmology in Créteil between April 2007 and April 2020, using Spectralis HRA + OCT (Heidelberg Engineering, Heidelberg, Germany). High-resolution (1536 × 1536 pixels), 30 • × 30 • , as well as 55 • × 55 • field-of-view images, centered on the fovea, with a minimum averaging of 30 frames, were extracted. One FAF image/eye/year of each patient was extracted. All images were deidentified, and all personal data (e.g., patient name, birth date, and study date) were removed. FAF images were not cropped but were resized to 224 × 224 pixels to meet the network specifications, with the fovea at the center, and were labeled as either STGD1 or PSPD, according to the specific mutation obtained following genetic testing.
A two-class classification system (STGD1 and PSPD) was implemented. The images were then partitioned into three sets using the function train_test_split of scikit-learn: the training set (60% of the images), the validation set (10% of the images), and the test set (30% of the images). The assignment of the images towards the training, the validation, and the testing sets was performed randomly. The training, validation and test data were strictly separated to prevent correlations. The images of patients used for training the deep learning classifier were not used to test it.

Development of a Deep Learning Classifier
We used ResNet50V2 to perform the classification task. We pretrained the ResNet50V2 model using 729 FAF images from patients with either normal FAF (73 images) or IRDs without genetic confirmation (656 images). The loss function used was cross-entropy. The model was optimized using the Adam Optimization Algorithm [14]. The Reduce Learning Rate on plateau was applied in order to allow the optimizer to more efficiently find the minimum in the loss surface. The model was then evaluated with the test set of 111 images (Table 1). By using integrated gradients, attribution maps were generated, allowing the impact of each pixel in the classification to be assessed, and showing which areas the model relies on to perform the classification [15]. The method is summarized in Figure 1. The performance was assessed through a comparison of the CNN's output to the ground truth, which was set by genetic confirmation of either STGD1 (ABCA4 mutation) or PSPD (PRPH2/RDS mutation).

Development of a Deep Learning Classifier
We used ResNet50V2 to perform the classification task. We pretrained the Res-Net50V2 model using 729 FAF images from patients with either normal FAF (73 images) or IRDs without genetic confirmation (656 images). The loss function used was cross-entropy. The model was optimized using the Adam Optimization Algorithm [14]. The Reduce Learning Rate on plateau was applied in order to allow the optimizer to more efficiently find the minimum in the loss surface. The model was then evaluated with the test set of 111 images (Table 1). By using integrated gradients, attribution maps were generated, allowing the impact of each pixel in the classification to be assessed, and showing which areas the model relies on to perform the classification [15]. The method is summarized in Figure 1. The performance was assessed through a comparison of the CNN's output to the ground truth, which was set by genetic confirmation of either STGD1 (ABCA4 mutation) or PSPD (PRPH2/RDS mutation). Various fundus autofluorescence images were extracted from the Créteil database and were used to train ResNet50V2. The pretrained network was used to classify genetically confirmed STGD1 and PSPD FAF images. The images were randomly partitioned into three sets: the training set (60% of the images), the validation set (10% of the images), and the test set (30% of the images). Data augmentation was performed on the training set to increase the original dataset and to reduce overfitting of the final model. The model was optimized using the Adam Optimization Algorithm. The model was then evaluated with the test set of 111 images. The output of the model was the metric evaluation of the performance of the model (accuracy, sensitivity, specificity, precision, recall, F1-score) and integrated gradient visualization.  Various fundus autofluorescence images were extracted from the Créteil database and were used to train ResNet50V2. The pretrained network was used to classify genetically confirmed STGD1 and PSPD FAF images. The images were randomly partitioned into three sets: the training set (60% of the images), the validation set (10% of the images), and the test set (30% of the images). Data augmentation was performed on the training set to increase the original dataset and to reduce overfitting of the final model. The model was optimized using the Adam Optimization Algorithm. The model was then evaluated with the test set of 111 images. The output of the model was the metric evaluation of the performance of the model (accuracy, sensitivity, specificity, precision, recall, F1-score) and integrated gradient visualization.

Evaluation of Retina Specialists' Performance
Graders evaluated the FAF images to distinguish between STDG1 and PSPD based on typical features. For STGD1, typical features were considered to be the presence of hyperautofluorescent flecks, presence/absence of hypoautofluorescent areas of atrophy, and presence of peripapillary sparing. For PSPD, the typical features were considered to be central hypoautofluorescent lesion with jagged border, butterfly-shaped hyperautofluorescent lesions in the macula, and absence/presence of peripapillary sparing [7,8,13].
The graders were masked to the mutation, to the grades from previous retinal imaging, and to all clinical data. Four graders (2 senior retina specialists: E.S. and O.Z., and 2 retina fellows: D.S. and P.D.) evaluated each FAF image independently. The same metrics (accuracy, sensitivity, specificity) were computed for human grading. The expert grader's performance was compared to the deep learning classifier's performance.

Deep Learning Classifier
In this study, we included 304 FAF images from 80 eyes of 40 patients with genetically confirmed STGD1 (mean age, 47.20 ± 18.73 years) and 66 images from 18 eyes of nine patients with PSPD (mean age, 51.22 ± 12.81 years). Table 2 summarizes their demographic and genetic data.   Using a pretrained classifier on FAF images, we obtained an overall accuracy of 0.882 on the test dataset of 111 images. The AUROC was 0.89. The test loss was 0.413. The precision was 0.883. The recall (sensitivity) was 0.883. The F1-score was 0.884. Of the 91 STGD1 FAF images, 88 were correctly classified. The sensitivity for STGD1 was 0.967, while the specificity was 0.50. The positive predictive value was 0.897 and the negative predictive value was 0.769 for STGD1.
Of the 20 PSPD FAF images, 10 were correctly classified. Conversely, the sensitivity for PSPD was 0.5, with a specificity of 0.967. The positive predictive value was 0.769 and the negative predictive value was 0.897 for PSPD. The results on the training, validation, and test sets are presented in Table 3.   Using a pretrained classifier on FAF images, we obtained an overall accuracy of 0.882 on the test dataset of 111 images. The AUROC was 0.89. The test loss was 0.413. The precision was 0.883. The recall (sensitivity) was 0.883. The F1-score was 0.884. Of the 91 STGD1 FAF images, 88 were correctly classified. The sensitivity for STGD1 was 0.967, while the specificity was 0.50. The positive predictive value was 0.897 and the negative predictive value was 0.769 for STGD1.
Of the 20 PSPD FAF images, 10 were correctly classified. Conversely, the sensitivity for PSPD was 0.5, with a specificity of 0.967. The positive predictive value was 0.769 and the negative predictive value was 0.897 for PSPD. The results on the training, validation, and test sets are presented in Table 3.

Evaluation of Retina Specialists' Performance
The retina specialists consisted of two retina experts (E.S. and O.Z.) and two retina fellows (D.S. and P.D.). The diagnostic accuracy for retina experts averaged 0.816, whereas for retina fellows it averaged 0.724. The sensitivity for the detection of STGD1 on FAF imaging was higher than for PSPD detection, both for retina experts and for retinal fellows (for retina experts: 0.828 versus 0.777; for retina fellows: 0.828 versus 0.363). In terms of specificity, it was higher for PSPD compared to STGD1 for both retina experts and for retinal fellows. The interclass correlation (ICC) between the four human graders was 0.242. The evaluation of retina specialists' performances in distinguishing, using FAF images, STGD1 from PSPD, is summarized in Table 4.

Discussion
In this study, we evaluated the performance of a deep learning classifier in distinguishing between STGD1 and PSPD on FAF, and we compared the results with those of retina specialists with varying levels of expertise. The ground truth consisted of STGD1 and PSPD with a genetic molecular diagnosis. The human graders were masked to the mutation, to the grades from previous retinal imaging, and to all clinical data.
The accuracy of the DL model was non-inferior to that of the retina experts (accuracy: 0.88 versus 0.816), as shown in Tables 3 and 4. Regarding the DL model's performance, on the one hand, the training loss in Table 3 indicated how well the model fitted the images in the training dataset. On the other hand, the validation loss indicated how well the model fitted new data from the validation dataset. Differences between training and validation loss may be signs that the model is overfitting or underfitting, but are also dependent on model regularization, and may also be due to the difficulty of the images in the respective datasets, and their proportion. In our model, we used 60% of the data to train

Evaluation of Retina Specialists' Performance
The retina specialists consisted of two retina experts (E.S. and O.Z.) and two retina fellows (D.S. and P.D.). The diagnostic accuracy for retina experts averaged 0.816, whereas for retina fellows it averaged 0.724. The sensitivity for the detection of STGD1 on FAF imaging was higher than for PSPD detection, both for retina experts and for retinal fellows (for retina experts: 0.828 versus 0.777; for retina fellows: 0.828 versus 0.363). In terms of specificity, it was higher for PSPD compared to STGD1 for both retina experts and for retinal fellows. The interclass correlation (ICC) between the four human graders was 0.242. The evaluation of retina specialists' performances in distinguishing, using FAF images, STGD1 from PSPD, is summarized in Table 4.

Discussion
In this study, we evaluated the performance of a deep learning classifier in distinguishing between STGD1 and PSPD on FAF, and we compared the results with those of retina specialists with varying levels of expertise. The ground truth consisted of STGD1 and PSPD with a genetic molecular diagnosis. The human graders were masked to the mutation, to the grades from previous retinal imaging, and to all clinical data.
The accuracy of the DL model was non-inferior to that of the retina experts (accuracy: 0.88 versus 0.816), as shown in Tables 3 and 4. Regarding the DL model's performance, on the one hand, the training loss in Table 3 indicated how well the model fitted the images in the training dataset. On the other hand, the validation loss indicated how well the model fitted new data from the validation dataset. Differences between training and validation loss may be signs that the model is overfitting or underfitting, but are also dependent on model regularization, and may also be due to the difficulty of the images in the respective datasets, and their proportion. In our model, we used 60% of the data to train the model and only 10% to validate the model, which may explain the difference in the training and validation loss. However, when applied to the test dataset containing 30% of images, the test loss was very close to the training loss, suggesting the rather good generalization capacity of the model. Moreover, the accuracy of the model was superior to the accuracy of retina fellows (relying solely on FAF images), who had an accuracy of 0.724 (Table 4). Although there was a lower overall accuracy of the retina specialists, the sensitivity and specificity for each class were more consistent compared to the same metrics of the DL model. The higher accuracy of the DL model was derived from the higher recall/sensitivity compared to human readers (0.883 versus 0.790 for retina experts and 0.595 for retina fellows). Nevertheless, the class imbalance (304 STGD1 FAF images and 66 PSPD images) was reflected by poor sensitivity results for PSPD by the DL model. Indeed, with only 10 out of 10 PSPD images correctly classified by the model in the test set, despite the model's overall good accuracy, the proportion of PSPD that were correctly identified was 50%. Moreover, a poor specificity was found for STGD1, showing that the proportion of true negatives that were correctly identified was low, and that of false positives high, due to the misclassification of PSPD as STGD1.
There are several reasons for these low values of sensitivity and specificity. Both STGD1 and PSPD have a phenotypic similarity in multimodal imaging, including FAF. While the model correctly relies on pixels within the area of atrophy and flecks, the similarity between the two images, and the inconsistency of differentiating features (i.e., peripapillary sparing), may be misleading for the DL model, as for the clinician. An important point is that, on 30 • × 30 • FAF images, the presence of the optic disc and the subsequent (when present) peripapillary sparing was inconstant, which may help further explain the model's performance.
Furthermore, the class imbalance may also have contributed to the difference between the specificity and sensitivity of each class. Moreover, functional testing, such as fullfield ERG in PSPD, is often normal [13], but high variability is present even in families carrying the same mutation [7]. In previous studies, while neither FAF nor spectral-domain optical coherence tomography (SD-OCT) were able to distinguish between STGD1 and PRPH2-associated PSPD, qFAF was shown to be lower in patients with PSPD compared to STGD1 [13,16]. As genetic molecular diagnosis is not an easily carried out, and as the prognosis of (late-onset) STGD1 is different from PSPD, with further improvements and larger datasets such a DL model may assist physicians in differentiating the two diseases.
Given the inherently small databases, and in order to improve the classification accuracy, ResNet50V2 was pretrained with various FAF images to learn how to extract features specific to FAF. Despite our limited database (304 FAF images of STGD1 and 66 images of PSPD) and the class imbalance, the pretrained ResNet50V2 obtained a good accuracy of 0.88. Other approaches may have been implemented in order to deal with the small sample size. In practice, it takes hundreds of thousands of images to optimize the innumerable parameters of a CNN. We pretrained ResNet50V2 using 729 FAF images from patients with either normal FAF (73 images) or other IRDs without genetic confirmation (656 images). Pretraining the CNN with a larger sample size could have led to a better performance of the model. We performed data augmentation only on the training set within the Keras framework. Although data generation of the input database has been previously performed in a study classifying diabetic retinopathy using OCT Angiography [17], as it may generate small variations of the training dataset into the test dataset, leading to an overestimation of the DL model's accuracy, this method was not used here.
If both age-related macular degeneration and diabetic retinopathy are retinal diseases where, due to the high incidence and prevalence, large-scale, publicly available databases exist, predominantly consisting of color fundus photography and/or OCT images, this is not true for IRDs. In IRDs, deep learning applications are developing, but are still at an early stage due to the rarity of these diseases. Our group recently focused on the applications of DL to classify IRDs using FAF, obtaining excellent accuracies when distinguishing STGD1, retinitis pigmentosa, BD and healthy controls [18], as well as to distinguish the chorioretinal atrophy of genetic or degenerative causes [19]. Furthermore, the literature has also recently focused on distinguishing STGD1 from healthy controls, using OCT as the imaging technique for the DL classification, and obtaining high accuracies despite the relatively small datasets [20]. Other groups have also focused on the prediction of causative genes in IRDs from color fundus photography, FAF imaging [21], or SD-OCT [22].
This study has several limitations, of which the main is the small dataset. Given that STGD1 and PSPD, while presenting a phenotypic overlap on FAF imaging, are consequent to mutations in different genes (ABCA4 for STGD1 and PRPH2 for PSPD), a genetic molecular diagnosis for all included eyes was essential. These databases would be extremely useful, even more so when dealing with rare, orphan diseases. Therefore, the ultimate and best solution for increasing the accuracy of our model would be multi-institutional collaborations. Another limitation is the fact that we included various stages of STGD1, with various degrees of overlap with PRPH2. To date, no reference databases of FAF imaging are available. Finally, while the image size was reduced to fit CNN requirements, the FAF images were not cropped, leading to a variability within the database with regard to the visualization of the optic disc and peripapillary sparing, which may have additionally impacted the results.
Last but not least, it is important to keep in mind that the comparison of a deep learning classifier and retina specialists to distinguish STGD1 from PSPD relies solely on the features in a single FAF image. In a clinical setting, a diagnosis by a physician is made with the consideration of other clinical parameters such as the age of the patient, time of onset, family history and a battery of other imaging and functional tests. The high number of graders, leading to a greater difficulty to reach agreement, as well as the phenotypic similarity between STGD1 and PSPD, may explain the fair agreement between the different human readers. This highlights the relevance of the CNN in detecting these phenotypically similar diseases, which present diagnostic challenges to the clinician and are solely based on FAF imaging. Larger studies with larger datasets and, ideally, a multimodal approach, are needed before implementation in a clinical setting.

Conclusions
Our results show the efficiency of training a CNN with transfer learning, generating a stable classification performance despite the small dataset. Therefore, pretraining the model with the same type of imaging may prove useful when it is difficult to amass image data, such as in cases of rare genetic diseases.
This proof-of-concept study demonstrates that a DL-based distinction between STGD1 and PSPD is possible using FAF images, with a good accuracy, sensitivity, and specificity compared to retinal experts relying solely on FAF images.

Conflicts of Interest:
The authors declare no conflict of interest.