Digital Image Processing and Development of Machine Learning Models for the Discrimination of Corneal Pathology: An Experimental Model

: Machine learning (ML) has an impressive capacity to learn and analyze a large volume of data. This study aimed to train different algorithms to discriminate between healthy and pathologic corneal images by evaluating digitally processed spectral-domain optical coherence tomography (SD-OCT) corneal images. A set of 22 SD-OCT images belonging to a random set of corneal pathologies was compared to 71 healthy corneas (control group). A binary classiﬁcation method was applied where three approaches of ML were explored. Once all images were analyzed, representative ar-eas from every digital image were also extracted, processed and analyzed for a statistical feature comparison between healthy and pathologic corneas. The best performance was obtained from transfer learning—support vector machine (TL-SVM) (AUC = 0.94, SPE 88%, SEN 100%) and transfer learning—random forest (TL- RF) method (AUC = 0.92, SPE 84%, SEN 100%), followed by convolutional neural network (CNN) (AUC = 0.84, SPE 77%, SEN 91%) and random forest (AUC = 0.77, SPE 60%, SEN 95%). The highest diagnostic accuracy in classifying corneal images was achieved with the TL-SVM and the TL-RF models. In image classiﬁcation, CNN was a strong predictor. This pilot experimental study developed a systematic mechanized system to discern pathologic from healthy corneas using a small sample.


Introduction
In recent years, there has been a significant advancement in computerized corneal digital imaging analysis to develop objective and reproducible machine learning (ML) algorithms for preclinical detection and measurement of various corneal pathologic changes. Standardized quantitative measurement of different corneal structural alterations, such as stromal thinning and edema, inflammatory infiltration, fibrosis and scarring, are crucial for early detection, objective documentation, grading, disease progression, and therapeutic monitoring.
The great diversity of unspecific corneal pathologic lesions, and their significant overlap between disorders, represent a major disadvantage for analysis with spectral-domain optical coherence tomography (SD-OCT) [1]. Unlike other corneal imaging technologies, including corneal topography-tomography and aberrometry that analyze numerical data, the SD-OCT has difficulty providing precise measurement values over varied and unspecific morphologic patterns that could guide clinicians to more objective diagnostic analysis [1,2]. The latter represents a significant challenge for AI developers.
Nonetheless, artificial intelligence (AI) has an immense capacity to learn and analyze a large volume of data and, at the same time, autocorrect and continue learning to improve

Materials and Methods (A) Design.
A prospective, cross-sectional, pilot exploratory cohort study was designed. Two participant groups were formed. The experimental group consisted of patients with various corneal diseases, and a control group of healthy patients with no corneal pathology. All patients read and signed informed consent to participate voluntarily in the study, which was previously approved by the Ethics and Research Committees of our institution (protocol registration No. CONBIOETICA-14-CEI-0003-2019 and 18-CI-14-120058, respectively), and conducted according to the tenets of the Declaration of Helsinki.
(B) Experimental procedures. Figure 1 shows the workflow of experiments performed in the study. All corneal SD-OCT analyses were performed by the same technician (SIS). SD-OCT is a conventional imaging diagnostic tool in ophthalmology clinics that helps study the microstructural changes of different eye pathologies, including the cornea [32]. SD-OCT provided noncontact in-vivo corneal cross-sectional, high-resolution images. The Avanti RTVue XR (Optovue ® , Fremont, CA, USA) SD-OCT corneal module permitted a high-speed acquisition of image frames (1024 axial scans in 0.4 s) with little motion artifacts, reducing background noise. This device worked at a wavelength of 830 nm and a speed of 26,000 A-scans per second [33]. All the experiments were performed on a dataset comprising cross-line morphologic corneal images belonging to a random set of corneal pathologies. This set of images was compared to healthy corneas (control group). The problem was confronted using binary classification methods. The quest illustrated three approaches:

1.
Traditional machine learning, including RF and SVM.

2.
Deep learning using end-to-end CNN.

3.
TL using the pre-trained model (e.g., AlexNet) [34].   (C) Segmentation and Feature Extraction. Cornea Scan Postprocessing: Digital SD-OCT images are usually contaminated with noise inherited from the sensor (Figure 2a). This problem is easily circumvented by applying a 2D median filter. Nevertheless, in order to extract statistical features from these images, one must do what is known in the imaging domain as image segmentation, which partitions the image into different segments (a.k.a., regions), which are, in our case, three segments (cornea region, background region and region of interest) as shown in Figure 2b-d, respectively. In the figure, the cornea region mask is obtained by applying contrast limited adaptive histogram equalization (CLAHE) followed by fixed thresholding (Figure 2b). The background region is simply the inverse of the mask (Figure 2c). The region of interest (ROI) was obtained using our fast image segmentation method based on Delaunay Triangulations [35,36], which was fully automatic and did not require specifying the number of clusters, as was the case with the k-means clustering [37] or the multilevel image thresholds using Otsu's method [38] (Figure 2d). Cornea Scan Postprocessing: Digital SD-OCT images are usually contaminated with noise inherited from the sensor (Figure 2a). This problem is easily circumvented by applying a 2D median filter. Nevertheless, in order to extract statistical features from these images, one must do what is known in the imaging domain as image segmentation, which partitions the image into different segments (a.k.a., regions), which are, in our case, three segments (cornea region, background region and region of interest) as shown in Figure  2b-d, respectively. In the figure, the cornea region mask is obtained by applying contrast limited adaptive histogram equalization (CLAHE) followed by fixed thresholding ( Figure  2b). The background region is simply the inverse of the mask (Figure 2c). The region of interest (ROI) was obtained using our fast image segmentation method based on Delaunay Triangulations [35,36], which was fully automatic and did not require specifying the number of clusters, as was the case with the k-means clustering [37] or the multilevel image thresholds using Otsu's method [38] (Figure 2d). Feature Extraction: In image processing and its intersection with ML, feature extraction plays a crucial role in pattern recognition [39]. The process starts by calculating a set of measured data from images intended to be informative and non-redundant, facilitating subsequent machine learning tasks. There is a plethora of features one can extract from images; however, in this study, we resorted to measuring a few simple statistical features, which are: (1) the mean intensity value, (2) the standard deviation of values, (3) the ratio of features ("/" denotes the ratio of two columns, e.g., H/J is the ratio of column H and column J) and (4) the absorbance (Table 1). Absorbance is a transform made to the image, calculated by using ln(A/A ), where ln is the natural logarithm, A is the image and is the mean of the background region of A ( Figure 2). Feature Extraction: In image processing and its intersection with ML, feature extraction plays a crucial role in pattern recognition [39]. The process starts by calculating a set of measured data from images intended to be informative and non-redundant, facilitating subsequent machine learning tasks. There is a plethora of features one can extract from images; however, in this study, we resorted to measuring a few simple statistical features, which are: (1) the mean intensity value, (2) the standard deviation of values, (3) the ratio of features ("/" denotes the ratio of two columns, e.g., H/J is the ratio of column H and column J) and (4) the absorbance (Table 1). Absorbance is a transform made to the image, calculated by using ln A/ A , where ln is the natural logarithm, A is the image and A is the mean of the background region of A ( Figure 2).

(D) Experimental set-up. Random forest:
The experimental set up to execute RF classification of data into healthy or pathologic images comprised two stages. The bag of decision trees was set to 50 growing trees (this number provides the right balance between the area under the curve (AUC), processing time and memory usage) in the training stage, which was used with a training set of 50 control and 15 pathologic images. The RF constructed decision trees based on the predictive characteristics of the features in Table 1. See also Figure 3 for a visualization of three out of the fifty trees. Figure 4 depicts how RF optimizes its performance across several growing trees (50 trees in our case) by performing out-of-bag (OBD) error calculation. The training phase converged into a model that we used, in the second stage, on a new validation test dataset comprising 21 controls and 7 pathologic images. The experimental set up to execute RF classification of data into healthy or pathologic images comprised two stages. The bag of decision trees was set to 50 growing trees (this number provides the right balance between the area under the curve (AUC), processing time and memory usage) in the training stage, which was used with a training set of 50 control and 15 pathologic images. The RF constructed decision trees based on the predictive characteristics of the features in Table 1. See also Figure 3 for a visualization of three out of the fifty trees. Figure 4 depicts how RF optimizes its performance across several growing trees (50 trees in our case) by performing out-of-bag (OBD) error calculation. The training phase converged into a model that we used, in the second stage, on a new validation test dataset comprising 21 controls and 7 pathologic images.  Convolutional Neural Network: CNN is a deep learning method and architecture that is well known in its capabilities for image classification [40,41]. Input image dimensions were fixed to [227 227 3] (to match that of the AlexNet pre-trained model input size requirement), and the fully connected layer was set to two classes (healthy/pathologic). The architecture embodied three convolution layers, each of which had a filter size of 5-by-5, the activation function was set to the rectified linear unit (ReLU) and the number of epochs was set to eight. Images were fed directly to the CNN classifier with the same training and testing proportions as in section A.
Transfer Learning: In TL, the statistical model we used for prediction had been pretrained on an enormous data set of natural images (eg., millions of samples), and the weights were then used in local learning [41,42]. Features were augmented to yield 4096 features, which were then fed to classifiers with the same training and testing proportions as in section A. They used shallow learning classifiers such as the SVM and RF.  Convolutional Neural Network: CNN is a deep learning method and architecture that is well known in its capabilities for image classification [40,41]. Input image dimensions were fixed to [227 227 3] (to match that of the AlexNet pre-trained model input size requirement), and the fully connected layer was set to two classes (healthy/pathologic). The architecture embodied three convolution layers, each of which had a filter size of 5by-5, the activation function was set to the rectified linear unit (ReLU) and the number of epochs was set to eight. Images were fed directly to the CNN classifier with the same training and testing proportions as in section A.
Transfer Learning: In TL, the statistical model we used for prediction had been pretrained on an enormous data set of natural images (eg., millions of samples), and the weights were then used in local learning [41,42]. Features were augmented to yield 4096 features, which were then fed to classifiers with the same training and testing proportions as in section A. They used shallow learning classifiers such as the SVM and RF.
We categorized the registries in cases with corneal pathology and healthy corneal OCT entries. We used traditional machine learning, including random forest (RF) and

(E) Statistical metrics analysis.
We categorized the registries in cases with corneal pathology and healthy corneal OCT entries. We used traditional machine learning, including random forest (RF) and support vector machine (SVM), deep learning using the CNN and transfer learning (TL) using a pre-trained model (e.g., AlexNet). Then we applied the algorithm to another image matrix and finally measured the model precision, relative risk, sensitivity, specificity, negative predictive and positive predictive values of the algorithm. Receiver operating characteristic (ROC) curves were analyzed to determine the optimal cut-off values, sensitivity and specificity. The AUC was used as a measure of accuracy. Accuracy was measured on the test dataset (correctly predicted class/total testing class) × 100. All measurements were reported as the average of 10 runs on a random selection of samples to eliminate any training/test dataset selection bias. For all models, the data was divided according to the same ratio shown in Table 2. All of the experiments, including the developed algorithms, were implemented using MATLAB ver. 9.5.0.944444 (R2018b) and IP Toolbox ver. 10.3 running on a 64-bit workstation with Windows 10 and 32.GB of RAM, 2.60 GHz.

Results
A total of 93 SD-OCT corneal images were registered in the study, 71 images formed part of the control group, and 22 pathologic images were included in the experimental group. The latter comprised 14 (63.6%) ectatic corneas and 8 (36.4%) corneas with infectious keratitis. The total analyzed corneas belonged to 55 (59.2%) women and 38 (40.8%) men. The mean age of patients in the experimental group was 38.68 ± 11.74 years, and in the control group, 45.56 ± 20.69 years. We tested each model sequence's accuracy to assign the entry as pathologic or healthy. The RF method (AUC = 0.77, SPE 60%, SEN 95%) had the lowest precision in the set (86.07%, ±5.44); however, the model only used 16 possible features extracted from the data, followed by CNN (AUC = 0.84, SPE 77%, SEN 91%). Figure 5 displays the importance of statistical features of images analyzed by RF. The measure represents the increase in prediction error for any given variable if that variable's values are permuted across the out-of-bag observations. This measure is computed for every tree, then averaged over the entire ensemble, and divided by the standard deviation over the entire ensemble. It appears that features 12, 13 and 15 (i.e., corresponding to columns M, N, P in Table 1) are the most important in this classification model.

Results
A total of 93 SD-OCT corneal images were registered in the study, 71 images formed part of the control group, and 22 pathologic images were included in the experimental group. The latter comprised 14 (63.6%) ectatic corneas and 8 (36.4%) corneas with infectious keratitis. The total analyzed corneas belonged to 55 (59.2%) women and 38 (40.8%) men. The mean age of patients in the experimental group was 38.68 ± 11.74 years, and in the control group, 45.56 ± 20.69 years. We tested each model sequence's accuracy to assign the entry as pathologic or healthy. The RF method (AUC = 0.77, SPE 60%, SEN 95%) had the lowest precision in the set (86.07%, ±5.44); however, the model only used 16 possible features extracted from the data, followed by CNN (AUC = 0.84, SPE 77%, SEN 91%). Figure 5 displays the importance of statistical features of images analyzed by RF. The measure represents the increase in prediction error for any given variable if that variable's values are permuted across the out-of-bag observations. This measure is computed for every tree, then averaged over the entire ensemble, and divided by the standard deviation over the entire ensemble. It appears that features 12, 13 and 15 (i.e., corresponding to columns M, N, P in Table 1) are the most important in this classification model. Figure 5. Feature importance using the random forest method. The plot represents the increase in prediction error for any given variable if the values of that variable are permuted across the outof-bag observations. Therefore, the higher the induced error is, the higher is the importance of that variable; out-of-bag (OOB).
Algorithms that used more (4096) had higher precision. The transfer learning-SVM model yielded the best results (TL-SVM. AUC = 0.94, SPE 88.12%, SEN 100%) followed by Figure 5. Feature importance using the random forest method. The plot represents the increase in prediction error for any given variable if the values of that variable are permuted across the out-of-bag observations. Therefore, the higher the induced error is, the higher is the importance of that variable; out-of-bag (OOB).
Algorithms that used more (4096) had higher precision. The transfer learning-SVM model yielded the best results (TL-SVM. AUC = 0.94, SPE 88.12%, SEN 100%) followed by the transfer learning-RF model (TL-RF. AUC = 0.92, SPE 84%, SEN 100%). The visualization of this classification's performance can be further examined in Figure 6 showing the best performing test's confusion matrices from the 10 random tests for each algorithm. Table 3 summarizes the outcomes of the constructed models.
the transfer learning-RF model (TL-RF. AUC = 0.92, SPE 84%, SEN 100%). The visualization of this classification's performance can be further examined in Figure 6 showing the best performing test's confusion matrices from the 10 random tests for each algorithm. Table 3 summarizes the outcomes of the constructed models. Figure 6. Confusion charts. Performance visualization using confusion matrix charts for the examined classification problem using the four different approaches (best accuracy out of the ten tests); accuracy (ACC). Table 3. Performance of the classification methods using different algorithms. All metrics are the average over 10 runs (**) Area under the curve (-) convolutional neural network (CNN) extracts its own features automatically from images.

Discussion
AI facilitates the early detection of certain ophthalmic disorders, increasing their diagnostic accuracy. Additionally, it enables the objective evaluation of disease progression and detailed follow-up of therapeutic results, particularly in those disorders most related to images and numerical data from a distance, curvature, elevation, volume and thickness measurements of specific ocular structures, like the cornea, iris, lens and retina [30]. It is a powerful tool that permits us to increase the diagnostic sensitivity and specificity of ophthalmic pathology [43]. Figure 6. Confusion charts. Performance visualization using confusion matrix charts for the examined classification problem using the four different approaches (best accuracy out of the ten tests); accuracy (ACC). Table 3. Performance of the classification methods using different algorithms. All metrics are the average over 10 runs (**) Area under the curve (-) convolutional neural network (CNN) extracts its own features automatically from images.

Discussion
AI facilitates the early detection of certain ophthalmic disorders, increasing their diagnostic accuracy. Additionally, it enables the objective evaluation of disease progression and detailed follow-up of therapeutic results, particularly in those disorders most related to images and numerical data from a distance, curvature, elevation, volume and thickness measurements of specific ocular structures, like the cornea, iris, lens and retina [30]. It is a powerful tool that permits us to increase the diagnostic sensitivity and specificity of ophthalmic pathology [43].
The corneal shape, volume, thickness, elevation and refractive power analysis has significantly evolved in recent years through the development of topography, tomography, pachymetry and aberrometry, which yield accurate corneal measurements. Because these devices generate multiple maps and images of the cornea, the amount of numerical data available from each acquisition may be overwhelming. Therefore, machine learning algorithms are being applied for early detection and accurate progression analysis of different corneal conditions, including keratoconus and endothelial health [16,43,44]. Currently, refractive surgery screening is the most fertile field for machine learning development in corneal disease. The reason for this is the increased risk of iatrogenic post-LASIK keratectasia due to unrecognized preoperative evaluation [16].
This study's objective was to test and train different ML algorithms' performance for the morphologic discrimination between pathologic and healthy corneas to optimize their role in early detection, differentiation and monitoring disease progression. There are few studies related to this challenging task, and a critical reason for this is the great morphologic variation and overlap of clinical manifestations among corneal disorders [18,19,45]. For example, when discriminating keratoconic corneas from healthy ones, metric parameters like the radius of curvature, elevation over a based-fit-sphere plane, volume and corneal thickness changes are considered for analysis [46]. The same is true for corneal stromal edema, where epithelial and stromal thickness and volume, measurable parameters are considered for analysis, but for discrimination and diagnostic differentiation of other non-measurable and unspecific clinical findings, like scarring, leukomas, inflammatory infiltrates, among other pathologic features of disorders like infectious keratitis, the application of ML algorithms becomes more complex and limited [18].
Artificial neural networks like CNN are strong predictors in image classification, with the advantage of dealing with noisy and missed clinical data, understanding complex data patterns in a way not possible for linear and non-linear calculations. However, this model requires massive clinical datasets (in the order of tens of thousands) for proper training [47], explaining why the CNN model performed poorly with the limited data set used for analysis in the present study [5]. Moreover, the availability of large training datasets is not always feasible, especially in corneal imaging analysis. There are specific difficulties, including the devices' high costs, technical acquisition challenges and methodology differences that prevent building large datasets. Furthermore, when analyzing large datasets, there is a need for high computational power, limiting availability and increasing costs.
On the other hand, our results highlight the benefit of adopting the TL approach. A linear solution (two dimensional) is impossible in many ophthalmologic cases as discussed before; therefore, getting a solution in a higher-dimensional dataset is required. An advantage of the TL-RF method is that it can model non-linear class boundaries and may give variable importance, but at the same time, it may be slow, and it may be difficult to get insights into the decision rules. The TL-SVM solves the linearity problem with a relatively lower computational cost using the kernel trick, a function used to obtain non-linear variants of a selected algorithm [29]. In the present study, the highest diagnostic accuracy in classifying normal from pathologic corneal images was achieved with the TL-SVM and the TL-RF models in this order. Even though transfer learning models (e.g., AlexNet) are models that have been trained to extract reliable and unique features from millions of raw images, our study supports the hypothesis that their image descriptors can be extended to medical images.
Indeed, random forest learning models can achieve satisfactory outcomes with small datasets, but with the inconvenience of requiring a manual selection of specific visual features before classification. This condition can result in a set of suboptimal features that limits the algorithms' application [43]. RF generates meta-models or ensembles of decision trees, and it is capable of fitting highly non-linear data given relatively small samples [48]. RF reached this performance with merely 16 features (values), a strong indication that if more statistical features (hypothetically driven by ophthalmology experts) are added, the model may, with overwhelming probability, outperform CNN in this small dataset. Additionally, unlike other deep learning models, RF models can pinpoint individual feature significance, which can help trace back diseases and trigger scientific discovery or physiological new findings.
Considering potential future research directions, we envision extending this work to classify different corneal disease sub-types; this will eventually aid clinicians in their diagnostic procedures. Furthermore, we aim to develop algorithms for risk prediction of corneal disease, which will help us identify individuals at higher risk of developing a corneal disease (e.g., personalized medicine), hence improving its earlier detection before the disease reaches a devastating stage. However, as mentioned before, the necessity of collecting a substantial amount of data to get more accurate predictions and also to be able to use RF algorithms that are more suitable for image analysis is imperative, but an arduous task [49,50].
Related work: In the last couple of years, several ophthalmology groups have been testing different ML algorithms to facilitate and improve the accuracy of clinical diagnosis of several corneal disorders (Table 4) [51]. Santos et al. reported CorneaNet, a CNN design for fast and robust segmentation of cornea SD-OCT images from healthy and keratoconic corneas with an accuracy of 99.5% and a segmentation time of less than 25 ms [46]. Kamiya et al. used deep learning in color-coded maps obtained by AS-OCT, achieving an accuracy of 0.991 in discriminating between keratoconic and healthy eyes and an accuracy of 0.874 in determining the keratoconus stage [52]. Shi et al. used Scheimpflug camera images and ultra-high-resolution optical coherence tomography (UHR-OCT) imaging data to improve subclinical keratoconus diagnostic capability using CNN. When only the UHR-OCT device was used for analysis, the diagnostic accuracy of subclinical keratoconus was 93%, the sensitivity reached 95.3% and the specificity 94.5%. These parameters were similar when both the Scheimpflug camera and the UHR-OCT were used in combination for analysis [53]. Treder et al. evaluated the performance of a deep learning-based algorithm in the detection of graft detachment in AS-OCT after Descemet membrane endothelial keratoplasty (DMEK) surgery with a sensitivity of 98%, a specificity of 94% and an accuracy of 96% [45]. Zéboulon et al. trained a CNN to classify each pixel in the corneal OCT images as normal or edematous and generate colored heat maps of the result, with an accuracy of 98.7%, a sensitivity of 96.4% and specificity of 100% [18]. Eleiwa et al. assessed the performance of a deep learning algorithm to detect Fuchs' endothelial corneal dystrophy (FECD) and identify early-stage disease without clinically evident edema using OCT images. The final model achieved an AUC = 0.998 ± 0.001 with a specificity of 98% and sensitivity of 99% in discriminating healthy corneas from all FECD [19]. Yousefi et al. used AI to help the corneal surgeon identify higher-risk patients for future corneal or endothelial transplants using OCT-based corneal parameters [54]. After refined and made available through computerized ophthalmic diagnostic equipment, all these research developments will have a significantly impact in ophthalmic health care services around the world. Interestingly, none of these previous reports, particularly those involving corneal SD-OCT imaging analysis, used TL-SVM and TL-RF algorithms to analyze corneal pathology.

Conclusions
Applying ML optimal algorithms to different corneal pathologies for early detection, accurate diagnosis and disease progression is challenging. There are economic limitations related to the high cost of equipment, technical acquisition challenges and differences in the methodologic analysis that make it difficult to build large and reliable datasets. Furthermore, without the availability of vast datasets to feed data-hungry ML algorithms, these algorithms would remain limited in their capability to yield reliable results.
In the present experimental pilot study with limited dataset samples, TL-SVM and TL-RF showed better sensitivity and specificity scores, and hence, they were more accurate to discriminate between healthy and pathologic corneas. On the contrary, the CNN algorithm showed less reliable results due to the limited training samples. The RF algorithm with handcrafted features had an acceptable performance, although, we trust that adding more hypothesis-driven features (16 features) would improve the performance furthermore. Additionally, RF is still a good option when one wants to explore specific imaging features and their physiological trace, for instance, we observed that features 12, 13 and 15 (i.e., corresponding to columns M, N, P in Table 1) are the most important in the RF classification model.
We believe that this revolutionary technology will mark the beginning of a new trend in image processing and corneal SD-OCT analysis, differing from current tendencies, where diverse anatomic characteristics like reflectivity, shadowing, thickness, among others, are subjectively analyzed. Nevertheless, this study may act as a proof of concept where external validations especially on large datasets are sought.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available because the study was approved under informed consent for the use of clinical records data for research purposes, protecting their privacy, prohibiting sharing information with third parties according to the Mexican General Law for the Protection of Personal Data in Possession of Obliged Parties, (published in the Federation Official Diary on January 26, 2017).