The Clinical Influence after Implementation of Convolutional Neural Network-Based Software for Diabetic Retinopathy Detection in the Primary Care Setting

Deep learning-based software is developed to assist physicians in terms of diagnosis; however, its clinical application is still under investigation. We integrated deep-learning-based software for diabetic retinopathy (DR) grading into the clinical workflow of an endocrinology department where endocrinologists grade for retinal images and evaluated the influence of its implementation. A total of 1432 images from 716 patients and 1400 images from 700 patients were collected before and after implementation, respectively. Using the grading by ophthalmologists as the reference standard, the sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) to detect referable DR (RDR) were 0.91 (0.87–0.96), 0.90 (0.87–0.92), and 0.90 (0.87–0.93) at the image level; and 0.91 (0.81–0.97), 0.84 (0.80–0.87), and 0.87 (0.83–0.91) at the patient level. The monthly RDR rate dropped from 55.1% to 43.0% after implementation. The monthly percentage of finishing grading within the allotted time increased from 66.8% to 77.6%. There was a wide range of agreement values between the software and endocrinologists after implementation (kappa values of 0.17–0.65). In conclusion, we observed the clinical influence of deep-learning-based software on graders without the retinal subspecialty. However, the validation using images from local datasets is recommended before clinical implementation.


Introduction
Diabetic retinopathy (DR) is the leading cause of blindness among working-age patients with type 2 diabetes [1]. The prevalence of DR is approximately 24-35% for patients with type 2 diabetes worldwide [2][3][4][5][6], and the burden of vision-threatening DR has been increasing owing to the rapid growth of the diabetic population [5][6][7]. Previous studies have shown that early screening and timely treatment can reduce the risk of worsening DR and blindness [8], and international guidelines recommend that screening for DR should be performed at least once every year for patients with type 2 diabetes [9]; however, adherence Life 2021, 11, 200 2 of 9 to this DR screening program has been alarmingly low [2]. The major barrier to annual DR screening is the lack of trained specialists and equipment to handle the rapidly growing population of patients with diabetes [10].
The use of deep-learning algorithms in the field of DR screening has demonstrated promising results [11]. Recent publications have shown that the diagnostic performance of some deep learning algorithms is similar to, or even better than, that of human experts [12,13]. In addition, these algorithms have the advantage of high reproducibility and they could help to reduce time and human resources theoretically. However, how to translate these advantages into clinical benefits is still in the center of discussion [14,15]. Currently, most studies discussing the use of deep learning algorithms for DR screening have focused on comparing the performance of the algorithms with diagnosis by local retinal specialists or regional graders [16,17]. The change in the clinical workflow or the impact on regional graders after implementation of such software has seldom been discussed.
VeriSee TM (Acer Inc., New Taipei City, Taiwan), a deep-learning-based software for DR, has recently been approved by the Taiwan Food and Drug Administration as a smart medical device based on its comparable performance to that of ophthalmologists [18]. In this study, we deployed VeriSee TM , referred to here as "the software", in an endocrinology department. We compared the diagnostic accuracy of referable DR (RDR) between regional graders and the software before its implementation and investigated the change in the clinical workflow and the influence on regional graders after its implementation.

Setting and Participants
This cross-sectional study was conducted at Taichung Veterans General Hospital between June and October 2019. The payment for performance program for diabetes in Taiwan recommends that patients with diabetes should receive annual comprehensive screening for diabetic complications, including DR [19]. We included patients with type 2 diabetes who underwent a fundus examination during the study period in our hospital.

Retinal Imaging
The standard protocol was performed in a dark room to ensure the physical dilation of the pupils. Retinal images were captured in a single-field, 45-degree view by trained technicians using a digital retinal camera (CR-2, Cannon Inc., Tokyo, Japan). Images were retaken if the technician considered them to be of poor quality, and only the image with the best quality from the repeated assessments was uploaded. All images were collected anonymously for analysis. Finally, adequate image quality was judged independently by each grader. The criterion for excluding an image was a poor quality, as judged by any one of the graders. At the patient level, a patient was excluded if an image of either eye was of poor quality.

Reference Grading
Three ophthalmologists independently graded all retinal images according to the international clinical classification of the DR scale, which classifies DR into no DR, mild nonproliferative DR (NPDR), moderate NPDR, severe NPDR, or proliferative DR (PDR). Moderate NPDR, severe NPDR, and PDR are all identified as RDR. The ophthalmologists graded the images independently and blindly from the output of the software. Disagreements between two ophthalmologists were adjudicated by the senior retinal specialist. Only images that were graded the same by at least two ophthalmologists were included in the final analysis, and the grades served as the ground truth.

The Deep Learning Algorithm
The development of the software has been described previously [18]. Briefly, the model was built by the convolutional network. The base structure for the convolutional network model is Inception V4. The number of layers, neurons, loss function, and active function are the same as inception V4, and other hyperparameters were fine-tuned to obtain the optimal accuracy [20]. After pretraining with large public retinal image datasets, the deep learning model was fine-tuned with 5649 retinal images. The final model was found to detect RDR with a sensitivity of 89.2%, a specificity of 90.1%, and an area under the receiver operating characteristic curve (AUC) of 0.950 [18].

Clinical Workflow
In our clinical practice, retinal images were graded by five endocrinologists. Generally, approximately 700 patients received fundus examinations per month, and five endocrinologists were responsible for all of them. Each endocrinologist was assigned to retinal images from approximately 140 patients and was requested to complete grading of all images within three days after the examination. After software was integrated into our clinical workflow, the preliminary VeriSee TM report accompanied with corresponding retinal images was automatically uploaded in the reports system, which endocrinologists could read before making the final decision in October 2019. Under the awareness of the diagnostic accuracy of the VeriSee TM , endocrinologists could either confirm the grading if they agreed with the preliminary report or revise the grading if they disagreed with it after examining the image (Supplementary Figures S1 and S2).

Statistical Analysis
The prevalence of DR is approximately 27-35% in Taiwan [21,22]. With a predefined sensitivity of 86%, a type I error of 5%, a power of 80%, and a margin of error of 7%, the sample size was estimated to be at least 350 patients. To assess the performance of the software and the endocrinologists, the sensitivity, specificity, F1 score, balanced accuracy, and AUC for detecting RDR were calculated using the grading by the ophthalmologists as the ground truth. Different from the software which generates the diagnosis of RDR based on each image, regional graders diagnose RDR at the patient level, which means clinicians would consider referring a patient to an ophthalmologist if one of the patient's eyes was diagnosed with RDR. To address the gap between the laboratory and the real world, we evaluated the performance of this software at both the patient level and the image level and compared the performance to the regional grader at the patient level. A 95% confidence interval (CI) was obtained based on the exact binominal distribution.
To evaluate the influence of the software on the endocrinologists, a quadric-weighted kappa coefficient was used to determine the agreement between the software and endocrinologists before and after software implementation. Kappa values of 0.01-0.20 indicate none to slight agreement, values of 0.21-0.40 indicate a fair agreement, values of 0.41-0.60 indicate a moderate agreement, values of 0.61-0.80 indicate a substantial agreement, and values of 0.81-1.00 indicate an almost perfect agreement [23]. The monthly percentage of finishing retinal image grading within the allotted time (three days after image examination) and the monthly RDR rate were also compared. All analyses were conducted using R (Version 3.5.3, R Core Team, R Foundation for Statistical Computing, Vienna, Austria).

Patients and Images
In June 2019, 716 patients underwent fundus examinations before software implementation, and 1432 retinal images were collected. After excluding images considered to be of poor quality by the software, ophthalmologists, and endocrinologists, a total of 981 (68.5%) images were included for analysis. However, only 468 (65.4%) patients had adequate quality of both retinal images. In October 2019, there were 700 patients with 1400 retinal images after software implementation. A total of 503 (71.9%) patients having adequate quality of both retinal images for grading by the software and the endocrinologists were enrolled for analysis. The rate of adequate image quality judged by endocrinologists was 73% ± 9% before Verisee TM implantation and 78% ± 2% after implantation (p > 0.05).  Figure 1. A total of 91 (9.3%) and 175 (17.8%) retinal images were graded as RDR by the ophthalmologists and the software, respectively. The sensitivity and specificity of the software to detect RDR were 0.91 (95% CI: 0.83-0.96) and 0.90 (95% CI: 0.87-0.92), respectively, and the AUC was 0.90 (95% CI: 0.87-0.93; Table 1) at the image level. equate quality of both retinal images. In October 2019, there were 700 patients with 1400 retinal images after software implementation. A total of 503 (71.9%) patients having adequate quality of both retinal images for grading by the software and the endocrinologists were enrolled for analysis. The rate of adequate image quality judged by endocrinologists was 73% ± 9% before Verisee TM implantation and 78% ± 2% after implantation (p > 0.05).

Sensitivity, Specificity, and AUC at the Patient Level
Of the 716 patients, 468 (65.4%) were considered adequate quality for grading by the software, the ophthalmologists, and the endocrinologists. The numbers of patients graded as RDR by the software, ophthalmologists, and endocrinologists were 117 (25%),

VeriSee TM Endocrinologists
Number

Comparison before and after Implementation of the Software
Before implementation, the monthly RDR rate graded by the endocrinologists was 55.1% (software: 27%; ophthalmologists: 9%, Figure 3). The monthly RDR rate after implementation decreased to 42.9%. The monthly percentage of finishing grading within three days after the examination was 66.8% before implementation, and this increased to 77.6% after implementation (Table 3).

Comparison before and after Implementation of the Software
Before implementation, the monthly RDR rate graded by the endocrinologists was 55.1% (software: 27%; ophthalmologists: 9%, Figure 3). The monthly RDR rate after implementation decreased to 42.9%. The monthly percentage of finishing grading within three days after the examination was 66.8% before implementation, and this increased to 77.6% after implementation ( Table 3).
The characteristics of each endocrinologist and the kappa values before and after implementation are listed in Table 4. The overall kappa values were low before implementation and increased after implementation of the software. However, there was heterogeneity in the improvement of the kappa values, ranging from 0.17 to 0.65 after implementation (Figure 4).  The characteristics of each endocrinologist and the kappa values before and after implementation are listed in Table 4. The overall kappa values were low before implementation and increased after implementation of the software. However, there was heterogeneity in the improvement of the kappa values, ranging from 0.17 to 0.65 after implementation ( Figure 4).

Discussion
In the present study, we evaluated the impact of deploying the deep-learning-based software for RDR diagnosis in clinical practice. A difference in the performance of the software was observed between implementation in laboratory and real-world settings.

Discussion
In the present study, we evaluated the impact of deploying the deep-learning-based software for RDR diagnosis in clinical practice. A difference in the performance of the software was observed between implementation in laboratory and real-world settings. Therefore, validation with local datasets according to local clinical practice is important before its implementation. Although software implantation was found to have a potential benefit on lessening the workload, clinical physicians' acceptance of the new technology varied.
It has been reported that some types of deep-learning-based software demonstrated a high level of performance in the laboratory but reduced sensitivity and specificity in realworld practice [12,24]. Ting et al. [12] externally validated their deep learning algorithm, which showed various levels of specificity from 73.3% to 92.2% for detecting RDR in ten multiethnic clinical datasets, despite there being a specificity of 91.6% in the primary validation dataset. Verbraak et al. [24] reported a deep-learning-based device for RDR with a sensitivity of 79.4% and specificity of 93.8% in the primary care setting, despite the fact that the device had a sensitivity of 87.2% and specificity of 90.7% in the original report [16]. In line with previous reports, our study showed a difference in performance between laboratory and real-world practice. In addition, we found a slight drop in specificity at the patient level compared to that at the image level. Physicians consider referring a patient to an ophthalmologist when one eye reveals RDR, even though the other eye is in good condition. Patient-based judgment is different from how the software is trained based on images. Therefore, validation of the accuracy of the software at both the image and patient levels is warranted before implementation.
The performance of non-retinal specialists on DR has been well investigated. It was reported that the diagnostic sensitivity and specificity of PDR diagnosis made by physicians other than ophthalmologists were 49% and 84%, respectively, and the rate of correct PDR diagnosis made by endocrinologists was only 31% [25]. Suboptimal sensitivity and specificity should be alerted for non-retinal specialists to detect DR [26]. In the present study, the endocrinologists had a good sensitivity level but a relatively low specificity level regarding the diagnosis of RDR. A possible reason for the high sensitivity with low specificity is that endocrinologists tended to refer a patient if they were uncertain of the diagnosis to avoid the misdiagnosis of patients with RDR. Therefore, the monthly referral rate for RDR was 55% before the implementation of the software, and this was surprisingly higher than the DR prevalence. With the assistance of the software, the monthly rate of RDR graded by endocrinologists was closer to the true prevalence in Taiwan [21,22]. Our study also demonstrated that deep-learning-based software is helpful as it lessens the workload of non-retinal specialists by decreasing the time spent on grading retinal images and increasing the rate of finishing grading within the allotted time.
It is notable that our results showed heterogeneous acceptance among the endocrinologists for diagnosis made by software according to the kappa value. Despite the evidence regarding the high performance of the software [18], not every endocrinologist followed the DR grades assessed using the software in the present study. The human and machine interaction is complex, and further studies are needed to determine the important factors that influence clinicians' acceptance.
The strength of this study is that we included endocrinologists, who play an important role in DR screening and are the main graders of retinal image in Taiwan. To the best of our knowledge, this is the first report to evaluate the clinical impact of deep-learningbased software on regional graders. However, the present study has several limitations. First, the sample size was relatively small, and we evaluated the impact of the software shortly after its implementation. A long-term investigation with a large sample size is needed in the future. Second, macula edema was not investigated in the study, because the software was not developed for the detection of macula edema. Finally, the percentage of poor quality reached approximately 30% in our retinal images captured using a nonmydriatic fundoscopy, and this percentage is higher than that found in previous reports [27].