Automatic Screening of the Eyes in a Deep-Learning–Based Ensemble Model Using Actual Eye Checkup Optical Coherence Tomography Images

: Eye checkups have become increasingly important to maintain good vision and quality of life. As the population requiring eye checkups increases, so does the clinical work burden of clinicians. An automatic screening algorithm to reduce the clinicians’ workload is necessary. Machine learning (ML) has recently become one of the chief techniques for automated image recognition and is a helpful tool for identifying ocular diseases. However, the accuracy of ML models is lower in a clinical setting than in the laboratory. The performance of ML models depends on the training dataset. Eye checkups often prioritize speed and minimize image processing. Data distribution differs from the training dataset and, consequently, decreases prediction performance. The study aim was to investigate an ML model to screen for retinal diseases from low-quality optical coherence tomography (OCT) images captured during actual eye chechups to prevent a dataset shift. The ensemble model with convolutional neural networks (CNNs) and random forest models showed high screening performance in the single-shot OCT images captured during the actual eye checkups. Our study indicates the strong potential of the ensemble model combining the CNN and random forest models in accurately predicting abnormalities during eye checkups.


Introduction
The prevalence of visual impairment is higher among older people, and the significant causes include glaucoma, age-related macular degeneration, and diabetic retinopathy [1,2]. The global average life expectancy has increased, and the risk of visual impairment is expected to increase accordingly [3]. Therefore, eye checkups are essential to maintain good vision and quality of life.
In Japan, eye checkups are performed at a frequency of 16.2% by the local governments [4], and all eye checkups mostly use fundus photography. The use of optical coherence tomography (OCT) [5] has become widespread globally, as it is more accurate in

OCT Imaging
OCT images from both eyes were obtained using an OCT-HS100 (Canon Co., Ltd., Tokyo, Japan) and RS-3000 Advance (RS-3000; Nidek Co., Ltd., Aichi, Japan). OCT-HS100 and RS-3000 have an auto-eye-tracking feature for the posterior direction, auto-alignment, and an auto-focus system. Thus, the OCT-HS100 and RS-3000 provide multiple OCT images and are suitable for eye checkups.
OCT-HS100 has an A-scan rate of 70,000 scans/s and a superluminescent diode with a lambda max of 855 nm and creates a cross-sectional image (B-scan). In this study, the B-scan image (OCT image) captured a single shot with a horizontal and vertical angle of view of 9 mm, a resolution of 1024 × 1176 pixels, and a TIFF compression.
RS-3000 has an A-scan rate of 53,000 scans/s and a superluminescent diode with a lambda max of 880 nm and creates a B-scan image. In this study, the OCT image captured a single shot with a horizontal and vertical angle of view of 9 mm, a resolution of 1024 × 512 pixels, and JPG compression. The OCT images from RS-3000 were resized to a resolution of 1024 × 1176 pixels and converted to a TIFF compression.

Datasets
Labeling of Abnormal and Normal Images A total of 7703 OCT images were captured over the course of three years. All OCT images were double reviewed and labeled by two ophthalmologists at each hospital (S.U. and T.I. labeled the OCT images from OCT-HS100; Y.I and E.W. labeled the OCT images from RS-3000). Images with findings that the ophthalmologists could not mutually agree on and those that did not lead to a diagnosis were excluded.
Of the OCT images, 655 were classified as abnormal findings, whereas 6050 were normal. The OCT images of 998 eyes were not used because of difficulties in their interpretation. The OCT images in the left eye were flipped horizontally. The number of images in both classifications was then adjusted to match the number of abnormal findings; thus, 655 normal images were extracted randomly. The training and test datasets were randomly divided into 1210 and 100 images, respectively (with an abnormal-to-normal ratio of 1:1).

Experiment 1
In Experiment 1, we compared the screening performances using transfer learning from convolutional neural network (CNN) models of ResNet-152 [16], DenseNet-201 [18], and EfficientNet-B7 [19], and the ensemble model used a soft-voting algorithm to average the predictions of three models.

Preprocessing
The ellipsoid zone (EZ), the inner/outer segment of photoreceptors (IS/OS), is the second hyper-reflective band on an OCT image [25,26]. The EZ illuminance in OCT images is reduced when ocular diseases impair photoreceptor cells [27,28]. Thus, EZ luminance is an indicator to diagnose retinal disease in the OCT images. However, the single-shot OCT images contain images in which the boundaries between the interdigitation zone (IZ), EZ, and external limiting membrane (ELM) are challenging to determine. Thus, we applied random (probability = 50%) center cropping to 600 × 600 pixels to zoom in on the IZ, EZ, and ELM. Then, the OCT images were resized to 512 × 512 pixels. After resizing, data augmentation was applied to the input images as follows: random brightness from 0.8 to 2.0 times, random contrast from 0.8 to 1.5 times, random rotation within 10 degrees, random horizontal and vertical shift within 50 pixels, and random (probability = 50%) horizontal mirroring. For margins created by image processing, we used padding with blue (red, green, and blue color information of 0, 0, and 255, respectively) to prevent misrecognition ( Figure 1). horizontal mirroring. For margins created by image processing, we used padding with blue (red, green, and blue color information of 0, 0, and 255, respectively) to prevent misrecognition ( Figure 1).

Figure 1.
Data augmentation for training the CNN model. Original OCT image (A) with a resolution of 1021 × 1176 pixels was center-cropped to 600 × 600 pixels (probability = 50%) and then resized to 512 × 512 pixels. After resizing, the following data augmentations were applied to the images: random brightness from 0.8 to 2.0 times, random contrast from 0.8 to 1.5 times, random rotation within 10 degrees, random horizontal and vertical shift within 50 pixels, and random horizontal mirroring (probability = 50%). (B) If margins were created by image processing, these were padded with blue (red, green, and blue color information of 0, 0, and 255, respectively) to prevent misrecognition. Abbreviations: CNN, convolutional neural network; OCT, optical coherence tomography.

Network
In the training phase for transfer learning, supervised learning was used where the network model was given training images. The accuracy of the classification measured as the weights of the deep layers were changed. The weights were changed based on the optimization function. In this study we used the Adam optimizer for all CNN models [29]. The layers of deep neural networks were frozen until just before the output layer in order to use the ImageNet weight parameters. We created the fully connected layer as an output layer. The fully connected layer provided two outputs (abnormal or normal eyes) using the softmax function. We defined an abnormal eye as a predicted value ≥ 0.5.
All CNN models were trained with 2000 epochs. The optimizer used an adaptive learning rate; the primary learning rate was 0.02, which was subsequently reduced to 0.5 times at 25%, 50%, 75%, and 90% of the total number of epochs. The training data were divided into three parts and cross-validated. We used Python 3.8.5 for Windows 10 (Microsoft Co., Ltd., Redmond, WA, USA), with the following libraries: Matplotlib 3.3.2, Numpy 1.18.5, OpenCV 3.3.1, Pandas 1.1.3, Pytorch 1.7.0, Torchvision 0.8.1, Scikit-learn 0.23.2, and Seaborn 0.11.0.

Data Visualization
The explanations for the abnormal predictions by the CNN models were visualized using gradient-weight class activation mapping (Grad-CAM) [30]. Grad-CAM can generate visual explanations from any CNN-based network without requiring architectural changes or retraining. Grad-CAM images were generated using the feature map in the last convolutional layer. with a resolution of 1021 × 1176 pixels was center-cropped to 600 × 600 pixels (probability = 50%) and then resized to 512 × 512 pixels. After resizing, the following data augmentations were applied to the images: random brightness from 0.8 to 2.0 times, random contrast from 0.8 to 1.5 times, random rotation within 10 degrees, random horizontal and vertical shift within 50 pixels, and random horizontal mirroring (probability = 50%). (B) If margins were created by image processing, these were padded with blue (red, green, and blue color information of 0, 0, and 255, respectively) to prevent misrecognition. Abbreviations: CNN, convolutional neural network; OCT, optical coherence tomography.

Network
In the training phase for transfer learning, supervised learning was used where the network model was given training images. The accuracy of the classification measured as the weights of the deep layers were changed. The weights were changed based on the optimization function. In this study we used the Adam optimizer for all CNN models [29]. The layers of deep neural networks were frozen until just before the output layer in order to use the ImageNet weight parameters. We created the fully connected layer as an output layer. The fully connected layer provided two outputs (abnormal or normal eyes) using the softmax function. We defined an abnormal eye as a predicted value ≥ 0.5.
All CNN models were trained with 2000 epochs. The optimizer used an adaptive learning rate; the primary learning rate was 0.02, which was subsequently reduced to 0.5 times at 25%, 50%, 75%, and 90% of the total number of epochs. The training data were divided into three parts and cross-validated.
We used Python 3.8.5 for Windows 10 (Microsoft Co., Ltd., Redmond, WA, USA), with the following libraries:

Data Visualization
The explanations for the abnormal predictions by the CNN models were visualized using gradient-weight class activation mapping (Grad-CAM) [30]. Grad-CAM can generate visual explanations from any CNN-based network without requiring architectural changes or retraining. Grad-CAM images were generated using the feature map in the last convolutional layer.

Classification of Ocular Disease
To determine for which diseases the model performed well and poorly, two ophthalmologists (S.U. and T.I.) checked the OCT images in the test data. In cases of multiple ocular diseases, the ocular disease with the most abnormalities was diagnosed.

Statistical Analysis
We used receiver-operating characteristic (ROC) curves and calculated the corresponding area under each curve with 1000 times bootstrap to evaluate the screening performance of the CNN and ensemble models. Then, the area under the ROC curve (AUC) was compared among the models using the Scheffé test.
IBM SPSS Statistics version 26 (IBM Corp., Armonk, NY, USA) was used for statistical analysis, and a p-value of <0.05 was considered significant. Table 1 showed the abnormal OCT images in the test dataset.         In most cases, the CNN models focused on abnormal images of the retina to predict the disease ( Figure 4). The images that the CNN models misjudged did not have remarkable lesions in the inner retinal layers, and in such cases, the CNN models focused on the nasal and temporal retinal regions to make decisions ( Figure 5).     ResNet-152 and DenseNet-201 made incorrect classifications for 5/100 images. Effi-cientNet-B7 was incorrect for 4/100 images. The ensemble model was incorrect for 2/100 images.

Experiment 2
In Experiment 2, we examined the ability of another ML model to identify anomalies in the thickness of peripheral nasal and temporal retinal regions, which was the weak point of the CNN models created in Experiment 1. Then, we developed an ensemble model combining the CNN models (ResNet-152, DenseNet-201, and EfficientNet-B7) and the ML model and verified the screening accuracy.

Preprocessing
Twenty percent of the total pixel size of all original OCT images ( Figure 6A) was removed from the right and left edges to avoid depression of the optic disc (Resolution: 615 × 1176 pixels; Figure 6B). The OCT images were divided into five sections (Resolution: 123 × 1176 pixels; Figure 6C): the peripheral temporal retina, temporal perimacular area, central macular area, nasal perimacular area, and peripheral nasal retina as segments were defined as segments 1, 2, 3, 4, and 5, respectively. Each section was binarized using the discriminant analysis method ( Figure 6D). Morphological closing was applied to these segment images to pad the dark area related to the inner retinal layer and choroidal vessels. The sum of the retinal and choroidal areas ( Figure 6E) was then calculated, and the area (in pixels) of each section was exported to an Excel file (Microsoft Co., Ltd., Redmond, WA, USA). ResNet-152 and DenseNet-201 made incorrect classifications for 5/100 images. EfficientNet-B7 was incorrect for 4/100 images. The ensemble model was incorrect for 2/100 images.

Experiment 2
In Experiment 2, we examined the ability of another ML model to identify anomalies in the thickness of peripheral nasal and temporal retinal regions, which was the weak point of the CNN models created in Experiment 1. Then, we developed an ensemble model combining the CNN models (ResNet-152, DenseNet-201, and EfficientNet-B7) and the ML model and verified the screening accuracy.

Preprocessing
Twenty percent of the total pixel size of all original OCT images ( Figure 6A) was removed from the right and left edges to avoid depression of the optic disc (Resolution: 615 × 1176 pixels; Figure 6B). The OCT images were divided into five sections (Resolution: 123 × 1176 pixels; Figure 6C): the peripheral temporal retina, temporal perimacular area, central macular area, nasal perimacular area, and peripheral nasal retina as segments were defined as segments 1, 2, 3, 4, and 5, respectively. Each section was binarized using the discriminant analysis method ( Figure 6D). Morphological closing was applied to these segment images to pad the dark area related to the inner retinal layer and choroidal vessels. The sum of the retinal and choroidal areas ( Figure 6E) was then calculated, and the area (in pixels) of each section was exported to an Excel file (Microsoft Co., Ltd., Redmond, WA, USA).

Network
The random forest algorithm was used in Experiment 2 [31]. The random forest algorithm is an ensemble learning method based on bagging. The input data were sampled randomly using bootstrap and divided into multiple groups. Each group was trained using the same decision trees to make them parallel, but this can lead to overfitting. We then averaged the prediction values in each group to prevent overfitting. Data on the retinal and choroidal areas were divided into 100 groups. The depth of the decision trees was set to 5.
Then, the ensemble model was made using a soft-voting algorithm between the CNN models with ResNet-152, DenseNet-201, EfficientNet-B7, and random forest model.

Statistical Analysis
We determined the differences in the retinal and choroidal areas between the abnormal and normal images using the Mann-Whitney U test with Bonferroni correction for each segment. [32] We used ROC curves and calculated the AUC with 1000 times bootstrap to estimate the screening performance of the random forests, ensemble with CNNs, and ensemble with CNNs and random forest models.
IBM SPSS Statistics version 26 (IBM Corp., Armonk, NY, USA) was used for statistical analysis, and a p-value of < 0.05 was considered significant.

Results
The sum of the retinal and choroidal areas was significantly thicker in the abnormal eyes than in the normal eyes in segments 3 and 4 (p < 0.001) and was significantly thinner

Network
The random forest algorithm was used in Experiment 2 [31]. The random forest algorithm is an ensemble learning method based on bagging. The input data were sampled randomly using bootstrap and divided into multiple groups. Each group was trained using the same decision trees to make them parallel, but this can lead to overfitting. We then averaged the prediction values in each group to prevent overfitting. Data on the retinal and choroidal areas were divided into 100 groups. The depth of the decision trees was set to 5.
Then, the ensemble model was made using a soft-voting algorithm between the CNN models with ResNet-152, DenseNet-201, EfficientNet-B7, and random forest model.

Statistical Analysis
We determined the differences in the retinal and choroidal areas between the abnormal and normal images using the Mann-Whitney U test with Bonferroni correction for each segment. [32] We used ROC curves and calculated the AUC with 1000 times bootstrap to estimate the screening performance of the random forests, ensemble with CNNs, and ensemble with CNNs and random forest models.
IBM SPSS Statistics version 26 (IBM Corp., Armonk, NY, USA) was used for statistical analysis, and a p-value of <0.05 was considered significant.

Discussion
This study investigated the screening performances of ML models using OCT images obtained from actual eye checkups. The CNN models focused on the structural changes in the retina in abnormal eyes, with accuracy from 95% to 96%, and the screening performance did not differ between each model (Figures 2-4). Our finding is consistent with earlier studies that reported a classification accuracy of a single-CNN model of about 70-95%; this model may also miss some diseases using OCT images [10,11,[13][14][15]. Furthermore, our findings support the earlier study that the latest CNN model is not necessarily better when performing transfer learning using OCT images [20]. We have considered that the number of classification categories relates to our findings. We developed CNN models for the classification of two categories (abnormal, normal), but did not develop CNN models for the classification of every disease in this study. The binary classification is the simplest classifier in CNN models. Therefore, we expect the differences between

Discussion
This study investigated the screening performances of ML models using OCT images obtained from actual eye checkups. The CNN models focused on the structural changes in the retina in abnormal eyes, with accuracy from 95% to 96%, and the screening performance did not differ between each model (Figures 2-4). Our finding is consistent with earlier studies that reported a classification accuracy of a single-CNN model of about 70-95%; this model may also miss some diseases using OCT images [10,11,[13][14][15]. Furthermore, our findings support the earlier study that the latest CNN model is not necessarily better when performing transfer learning using OCT images [20]. We have considered that the number of classification categories relates to our findings. We developed CNN models for the classification of two categories (abnormal, normal), but did not develop CNN models for the classification of every disease in this study. The binary classification is the simplest classifier in CNN models. Therefore, we expect the differences between CNN models to be more significant for multi-class classifications, such as those for classifying individual retinal diseases.
RP was false-negatively predicted, suggesting insufficient training with abnormalities in the retinal pigment epithelium and photoreceptor layer, including interdigitation and ellipsoid zones in the ELM. In cases with no apparent edema in the inner retinal layer, our CNN models tended to predict an abnormality based on the peripheral temporal and nasal retinal shapes ( Figure 5). Russakoff et al. [33] described CNN models trained with OCT images of age-related macular degeneration that focused on the temporal and nasal retinas, and differentiated between progressors and non-progressors. These findings suggest that CNN can detect subtle differences in the morphology of the peripheral temporal and nasal retinal regions, and can therefore differentiate between abnormal and normal eyes or between progressors and non-progressors.
The random forest model used the central macular (segment 3) and nasal (segments 4 and 5) areas as bases for determining eye abnormalities, as a significant difference was found between abnormal and normal retinal areas in both segments (Figures 7 and 8). Of the five diseases that were misjudged by the random forest model ( Figure 9A), the retinal and choroidal areas were underestimated because the morphological transformation did not fill in the inner retinal layer with white in ERM ( Figure 11A,A ) and macular edema ( Figure 11B,B ,C,C ). Furthermore, the OCT images of RP had small retinal and choroidal areas ( Figure 11D,D ,E,E ). Most diseases were correctly classified by the random forest model, although cases with an underestimated macular area were misclassified. Therefore, the features of the retinal and choroidal areas extracted by dividing the OCT images into five segments were useful.
The ensemble model had better screening performance than the single-CNN models (Figures 2 and 3). However, the risk of misjudging normal features as abnormal remains. The ensemble with the CNN and random forest model improved that risk. The random forest model, which evaluated for disease using the nasal retinal area, thereby improving on CNN models that misrecognized the nasal peripheral retinal structures ( Figures 5 and 7). Thus, the ensemble model combining the CNN models trained with OCT images and the random forest model trained in the retinal area can vastly improve disease prediction during an actual eye or health checkup in which only OCT images are acquired. Furthermore, the ensemble with the CNN and random forest model may be useful to clinicians, given its screening accuracy of 0.999 at 0.025 image/s. In this study, the ensemble model showed high screening performance in the singleshot OCT images captured during the actual eye checkups, because the random forest model complements the weaknesses of CNN. These findings suggest that our ensemble model can screen for retinal diseases without requiring retakes in the actual eye checkups. On the other hand, we have been concerned that the screening performance will be degraded when our ensemble model is applied for actual in-person eye checkups because we excluded OCT images in which the ophthalmologists had difficulty determining the disease by reading the images alone. Therefore, the accuracy of our ensemble model during actual eye checkups will need to be confirmed in a future investigation.