Evaluating a Deep Learning Diabetic Retinopathy Grading System Developed on Mydriatic Retinal Images When Applied to Non-Mydriatic Community Screening

Artificial Intelligence has showcased clear capabilities to automatically grade diabetic retinopathy (DR) on mydriatic retinal images captured by clinical experts on fixed table-top retinal cameras within hospital settings. However, in many low- and middle-income countries, screening for DR revolves around minimally trained field workers using handheld non-mydriatic cameras in community settings. This prospective study evaluated the diagnostic accuracy of a deep learning algorithm developed using mydriatic retinal images by the Singapore Eye Research Institute, commercially available as Zeiss VISUHEALTH-AI DR, on images captured by field workers on a Zeiss Visuscout® 100 non-mydriatic handheld camera from people with diabetes in a house-to-house cross-sectional study across 20 regions in India. A total of 20,489 patient eyes from 11,199 patients were used to evaluate algorithm performance in identifying referable DR, non-referable DR, and gradability. For each category, the algorithm achieved precision values of 29.60 (95% CI 27.40, 31.88), 92.56 (92.13, 92.97), and 58.58 (56.97, 60.19), recall values of 62.69 (59.17, 66.12), 85.65 (85.11, 86.18), and 65.06 (63.40, 66.69), and F-score values of 40.22 (38.25, 42.21), 88.97 (88.62, 89.31), and 61.65 (60.50, 62.80), respectively. Model performance reached 91.22 (90.79, 91.64) sensitivity and 65.06 (63.40, 66.69) specificity at detecting gradability and 72.08 (70.68, 73.46) sensitivity and 85.65 (85.11, 86.18) specificity for the detection of all referable eyes. Algorithm accuracy is dependent on the quality of acquired retinal images, and this is a major limiting step for its global implementation in community non-mydriatic DR screening using handheld cameras. This study highlights the need to develop and train deep learning-based screening tools in such conditions before implementation.


Introduction
There are 463 million people with diabetes in the world. 80% of this population reside in low-and middle-income countries (LMIC), where resources are limited, and 30% present diabetic retinopathy (DR) [1,2]. Regular screening for DR is recommended to identify vision-threatening DR (VTDR), an avoidable cause of blindness, and treat it promptly [3].
Many high-income countries have established DR screening as a public health programme, recommending yearly screening of people with diabetes [4]. DR screening is conducted at fixed locations, and images of the central retina acquired after pupil dilation 2 of 11 by trained screeners using standardised, table-top fixed retinal cameras are graded for DR by qualified graders. Patients with VTDR, or if images are ungradable, are referred to ophthalmic departments for further management. This successful DR screening is laborious and cannot be translated to LMIC [5], where opportunistic DR screening is performed by minimally trained field workers in medical camps or public spaces, and pupil dilation is not routinely carried out due to restrictive policies. A major challenge of this strategy is the use of handheld retinal cameras without stabilising platforms through non-mydriatic pupils, which has been reported to drop the proportion of gradable images by about 20% [6]. Moreover, image acquisition by field workers in communities with limited healthcare access can be challenging due to the increased prevalence of undiagnosed co-pathologies, especially cataract [7].
One solution to improve the efficiency of DR screening programmes is to use automated algorithms to grade the retinal images. The successful advent of deep neural network (DNN) approaches in recent years has showcased a wide range of automated systems unfolding their benefits in several healthcare disciplines [8], and DR screening is no exception [9][10][11]. However, DNNs are trained on retinal images captured through dilated pupils to ensure high diagnostic accuracy. The algorithms are developed to identify referable images based on standard DR severity scales [12][13][14][15][16][17], such as the International Clinical Diabetic Retinopathy (ICDR) severity scale [18,19]. Some of these automated algorithms are already approved by regulators and implemented in a few screening programmes. Based on these reports, many manufacturers have also incorporated these algorithms into their low-cost cameras for instant offline grading of retinal images obtained from population-based screening in LMIC [17].
However, there is a paucity of studies that have evaluated the diagnostic accuracy of automated algorithms for the grading of retinal images captured on handheld cameras through non-mydriatic pupils [20,21]. Moreover, there are no reports of real-world implementation of automated grading in a multicentre DR screening programme in India.
In this study, we evaluated the diagnostic accuracy of the Deep Learning algorithm developed by the Singapore Eye Research Institute (SERI) on mydriatic retinal images, which is commercially deployed in Zeiss VISUHEALTH-AI DR, for grading non-mydriatic retinal images captured by field workers using handheld cameras through non-dilated pupils versus human graders in a real-world community DR screening in India. We report the performance of the automated DR system at predicting three possible outcomes: (1) referable DR, (2) non-referable DR, and (3) ungradable image. In addition, we evaluated the performance of the algorithm in detecting gradability (referable and non-referable) and eyes that require hospital referral defined as a total of ungradable and referable DR images. Finally, we report the regional variations in outcomes, intergrader agreement, and agreement between human graders and the algorithm.

Study Settings
Anonymised retinal images used in this study were captured as part of the SMART India study, a study that aimed to increase research capacity and capability to tackle the burden of blindness due to DR in India [22]. In this cross-sectional prospective and community-based study, door-to-door surveys and point of care non-laboratory tests, and retinal images were obtained using a non-mydriatic handheld fundus camera on people with diabetes in each household at 20 pre-defined sites (Table S1). Each site included both rural and urban areas across India. Field workers were trained to capture a set of at least two gradable retinal photographs from each eye through non-dilated eyes using the Zeiss Visuscout ® 100 camera. For each patient, a variable number of macula and optic disc images of each eye were taken to acquire the best possible images. When the acquisition of retinal images was not possible, potentially due to cataract or a small pupil, the same camera was used to take photographs of the anterior segment. Retinal fundus photographs were obtained from subjects who are known diabetics or who on the day of survey had a high random blood sugar of 160 mg/dL (8.9 mmol/L) or higher.

Image Grading by Graders (Reference Standard)
Retinal photographs captured by field workers were uploaded to a database for independent grading by primary (on-site) and secondary graders (in a Reading Centre). Each eye was independently graded by each grader, either a trained optometrist or ophthalmologist, and all images per eye were available to the graders. Senior ophthalmologists at each Reading Centre arbitrated discrepancies. Patient eyes were graded as no, mild, moderate, severe, and proliferative DR as per the ICDR severity scale [18,19], or as ungradable. The disease severity graded by the human graders for each patient eye was populated with three outcomes: (1) referable DR: moderate non-proliferative DR or worse with or without macular oedema (hard exudates/thickening around fovea), (2) non-referable DR: eyes with no DR or mild DR, and (3) ungradable. The final grade from human graders on the basis of all images per patient eye was used as the reference standard. Graders were masked with respect to the automated algorithm grades.

Imaging Grading by Automated Algorithm (Index Test)
VISUHEALTH-AI DR (Software version 1.8) is an automated screening web-service with an optimization algorithm using deep neural networks to automatically review patient's fundus images for the presence of DR. The screening solution is indicated to categorize single-field macula-centred 40-degree non-mydriatic fundus images taken with VISUSCOUT 100 and delivers three possible outcomes: referable DR, non-referable DR, and ungradable image (with advice for hospital referral). VISUHEALTH-AI DR is a standalone health software product classified as a class IIa device as per Rule 11 of Medical Device Regulation (EU-MDR) 2017/745 Annex VIII.
Only colour retinal images that were macula-centred with a visible optic nerve head were selected from the pool of captured images to ensure that the performance of the algorithm was evaluated in accordance with the protocols used for its development. Optic disc-centred images and anterior segment photographs were discarded ( Figure S1). The algorithm grading was independent of the human grading.

Outcomes
The fully anonymised/deidentified images available for each patient eye were independently analysed. Image-level outputs were processed and tabulated for the three possible outcomes to obtain the eye-level prediction for evaluation against the reference standard. Patient-eye predictions were derived as follows: At least one referable image derived in a referable patient eye; Non-referable and ungradable (if any) images derived in a non-referable patient eye; Two or more ungradable images derived in an ungradable patient eye.
Performance was evaluated as a three outcome multilabel system as well as at two other relevant binary tasks: "gradability" and "hospital-referable". Gradability assessment evaluated device performance at discerning gradable images (referable + non-referable) from ungradable images. Hospital-referable assessment evaluated the model's ability to discern samples that must be sent for further screening, i.e., referable + ungradable from non-referable. Site, age category, and visual acuity covariates were also studied.

Statistical Analysis
A descriptive analysis of the participant demographics by site (20 sites), age categories (≤40, 41-60, 61-70, and >70 years), and visual acuity (VA) categories (Normal: logMAR VA < 0.4, moderate visual impairment (VI): logMAR 0.4 ≤ VA < 1.0, severe VI: logMAR 1.0 ≤ VA < 1.3, blind: logMAR VA ≥ 1.3 [23]) is performed. The robustness of the algorithm at the different tasks was evaluated by comparing the standard reference to the automated prediction. For the main multiclass task, a full metric report was calculated with precision (positive predictive value), recall, and F-score. For the binary tasks of gradability and hospital-referable performance, sensitivity and specificity metrics were calculated. All reports were also studied for each site, age category, and visual acuity categories. Interobserver variability was measured with the quadratic weighted Kappa scores (see Table S2 for metric definitions). In all cases, exact Clopper-Pearson 95% confidence intervals were calculated.

Results
From a pool of 60,633 retinal fundus images, a total of 29,656 images from 11,199 patients and 20,489 patient eyes were eligible for the study ( Figure S1). Table 1 shows participant demographics and eye grade distribution at each site. The average age of the participants was 57.7 (11.1) years, with 5365 males (47.9%). Sites 18 and 14 had the highest and lowest average age of 66.7 (12.3) and 53.7 (8.1) years, respectively. The participants were also categorised by age and visual acuity categories (Table S3). Overall performance at the gradability task (

Grader and Model Assessment
As listed in Table 4, the algorithm reported a Kappa value of 0.47 (95% CI 0.44, 0.50) for referable DR. For the same task, primary and secondary graders showed an agreement of 0.60 (0.57, 0.63) Kappa. When final grades (reference standard, after arbitration) were compared to primary and secondary graders, Kappa values were 0.66 (0.64, 0.69) and 0.84 (0.83, 0.86), respectively.

Discussion
We evaluated the accuracy of an offline automated screening algorithm to identify referable DR from fundus images of people with diabetes captured by minimally trained field workers using non-mydriatic handheld cameras in a home environment. To our knowledge, this is the first prospective multicentre study on a considerably large dataset of handheld retinal images taken by multiple field workers in a community setting that mirrors the real-life implementation of such programmes in LMIC. We show that the success of an automated AI algorithm is dependent on the quality of the acquired retinal images. Although validation studies to date have shown that most automatic algorithms where mydriatic fundus images were used have high diagnostic accuracy, our study shows that real-world scenarios of DR screening in India where non-mydriatic DR screening is widely practised pose challenges (Figure 1). referable DR from fundus images of people with diabetes captured by minimally trained field workers using non-mydriatic handheld cameras in a home environment. To our knowledge, this is the first prospective multicentre study on a considerably large dataset of handheld retinal images taken by multiple field workers in a community setting that mirrors the real-life implementation of such programmes in LMIC. We show that the success of an automated AI algorithm is dependent on the quality of the acquired retinal images. Although validation studies to date have shown that most automatic algorithms where mydriatic fundus images were used have high diagnostic accuracy, our study shows that real-world scenarios of DR screening in India where non-mydriatic DR screening is widely practised pose challenges (Figure 1). Most previous algorithms for detecting referable DR have used mydriatic retinal photographs taken as part of in-clinic screening programmes and acquired using tabletop retinal cameras, which contributes to significant dataset differences in terms of image quality. Gulshan et al. reported, for referable DR, a 90.3% sensitivity and 98.1% specificity for the EyePACS-1 dataset. Similarly, Gargeya et al. [14] showed a sensitivity of 93% and a specificity of 87% on the Messidor-2 dataset [24], and a study by Ting et al. [13] reported Most previous algorithms for detecting referable DR have used mydriatic retinal photographs taken as part of in-clinic screening programmes and acquired using table-top retinal cameras, which contributes to significant dataset differences in terms of image quality. Gulshan et al. reported, for referable DR, a 90.3% sensitivity and 98.1% specificity for the EyePACS-1 dataset. Similarly, Gargeya et al. [14] showed a sensitivity of 93% and a specificity of 87% on the Messidor-2 dataset [24], and a study by Ting et al. [13] reported a sensitivity of 90.5% and a specificity of 91.6% on their primary validation dataset. More recently, Gulshan et al. [17] presented a prospective study on in-clinic non-mydriatic images from two different sites. The automated DR detection was equal to or worse than the manual grading, with 88.9% sensitivity and 92.2% specificity at one site and 92.1% sensitivity and 95.2% specificity at the other sites, highlighting the difference in performance between in-clinic and community performance. Few studies have evaluated automated systems for referable DR detection in handheld retinal images or in community settings. Rajalakshmi et al. [20] presented a study on 2408 smartphone-based mydriatic fundus photographs taken by hospital trained staff in a clinic environment and reported sensitivity of 95.8% and specificity 80.2% at detecting any DR. Similarly, Natarajan et al. presented a pilot study on 223 patients where a smartphone-based automated system was used to detect referable DR. The authors reported 100.0% sensitivity and 88.4% specificity [21].
It is important to highlight that our investigation was substantially different from previous studies, and, therefore, comparisons are not straightforward. Grading nonmydriatic retinal images captured by field workers using a handheld camera in the patient's home entails application-specific challenges.
Mydriatic retinal photography, where resources are available, has been widely shown as the most effective screening strategy to provide high-quality DR screening since it increases image quality and allows for higher sensitivities [25,26]. On the contrary, nonmydriatic retinal imaging increases failure screening rates resulting from media opacity or small pupils [1]. However, DR screening without mydriasis in primary care premises have been proven to be a valid cost-effective screening method with advantages, not only for patient convenience, but also for logistic reasons [27]. These advantages facilitate the development of community-based screening programs in LMIC. Nevertheless, our study results show that by taking image acquisition out of controlled stress-free hospital premises, new challenges are generated.
In our study, the evaluation of the specific tasks of gradable and hospital-referable eye detection showed higher performance, suggesting that there is more consensus between human graders and the algorithm in identifying gradable images. In particular, hospitalreferable is showed as the most separable class. This is an encouraging outcome, given that both referable DR and ungradable patients must be sent to ophthalmic departments for further management, which makes hospital-referable patient detection the most crucial task when using these automatic algorithms in such DR screening programmes. Although this process will ensure safety, increased referral to hospitals will overburden the already stretched ophthalmic services. In addition, we are reliant on patients attending for retinal examination after pupil dilation, and so, the overall effectiveness of such DR screening programmes may be compromised.
Comparatively, the study shows that identifying referable DR is the most challenging task. Our study highlights the limitations of the algorithm to diagnose referable DR in these community screening settings when handheld cameras are used and the pupils are not dilated. This is of significance for policy-makers in LMIC due to restrictions placed on dilating pupils in many of these countries. It should be noted that human graders had access to multiple fields for grading, whereas the algorithm, following its specifications, was only presented with single field fundus images, and this can be known to reduce sensitivity and specificity, particularly in non-mydriatic photography [28][29][30].
Algorithm evaluation by site also showed remarkable variations in performance. Different acquisition settings cause varying quality photographs, which can have a significant impact in referable DR screening and automated prediction algorithms. This is also reflected by the percentage of ungradable images, reaching up to over 30% in some sites.
We also investigated the performance of the algorithm by age and visual acuity of the screened individuals. Varying degrees of performance are shown for the different outcomes analysed. Performance differences for age and visual acuity categories are less remarkable in gradability and hospital-referable tasks. The detection of referable DR was less accurate in older individuals, which may be related to the poorer image quality caused by higher prevalence of co-pathology, especially cataract and small pupils. However, performance by visual acuity did not show the same trend, suggesting that variable image quality may be more related to technical challenges faced by the field workers.
We compared automated grading performance with the intergrader agreement among human graders. Agreement between human graders was also only moderate for all outcomes, which is concordant with the variable agreement reported by previous studies [31,32]. A higher agreement was observed between arbitration graders (who had access to all grades) and secondary graders than primary graders. Automated grading and human grading showed the lowest agreement, with the gradability task reporting the highest Kappa values.
The main limitation of the study is the unbalanced grading setting between human graders and the algorithm. VISUHEALTH-AI DR is designed to categorize single field macula-centred fundus photographs. However, field workers captured a set of two or more retinal photographs that, in all cases, included at least a macula-centred and an optic nerve head centred image. As a result, the algorithm's prediction is hindered by a partial use of the data that was available to human graders. These study settings allow us to establish the results as the worst-case scenario for the evaluation of algorithm performance and unveils margins for improvement to be explored.
In the future, supervised machine learning methods for DR evaluation must demonstrate robustness of retinal image datasets acquired in various settings of DR screening to enable widespread implementation and to reduce health inequality. Investigation of automated referable DR systems in community settings with non-mydriatic retinal imaging is a key requirement to develop resource-driven screening programs in LMICs. Although we strongly recommend mydriatic retinal photography captured on fixed cameras, it is not logistically possible to ensure global coverage of DR screening with such methodologies. Several strategies are required to ensure regular DR screening of people with diabetes around the world, especially when one-tenth of the global population is estimated to have diabetes by 2040, with 80% of them living in countries with limited resources. Manufacturers should follow a two-step strategy when they incorporate automated algorithms in retinal cameras. The first is to automatically qualify the gradability of a retinal image for its eligibility for automated grading and then to apply the automated algorithm only if a retinal image passes the gradability test [33]. To ensure that automated grading can be implemented globally, images from real-life programmes from LMIC reflecting their specific acquisition conditions should be used in the development of automated algorithms to allow models to learn their distinct features and leverage that crucial knowledge.
In conclusion, although AI may be more efficient in grading large numbers of retinal images, the quality of captured images in real-world community settings determines the success of any AI system used for non-mydriatic DR screening. To be implemented globally, AI systems should leverage that specific knowledge and use images acquired in such conditions in their development process. In this study, we analysed the performance of the Zeiss VISUHEALTH-AI DR algorithm, developed by SERI, under such premises.