Accuracy of Machine Learning Assisted Detection of Keratoconus: A Systematic Review and Meta-Analysis

(1) Background: The objective of this review was to synthesize available data on the use of machine learning to evaluate its accuracy (as determined by pooled sensitivity and specificity) in detecting keratoconus (KC), and measure reporting completeness of machine learning models in KC based on TRIPOD (the transparent reporting of multivariable prediction models for individual prognosis or diagnosis) statement. (2) Methods: Two independent reviewers searched the electronic databases for all potential articles on machine learning and KC published prior to 2021. The TRIPOD 29-item checklist was used to evaluate the adherence to reporting guidelines of the studies, and the adherence rate to each item was computed. We conducted a meta-analysis to determine the pooled sensitivity and specificity of machine learning models for detecting KC. (3) Results: Thirty-five studies were included in this review. Thirty studies evaluated machine learning models for detecting KC eyes from controls and 14 studies evaluated machine learning models for detecting early KC eyes from controls. The pooled sensitivity for detecting KC was 0.970 (95% CI 0.949–0.982), with a pooled specificity of 0.985 (95% CI 0.971–0.993), whereas the pooled sensitivity of detecting early KC was 0.882 (95% CI 0.822–0.923), with a pooled specificity of 0.947 (95% CI 0.914–0.967). Between 3% and 48% of TRIPOD items were adhered to in studies, and the average (median) adherence rate for a single TRIPOD item was 23% across all studies. (4) Conclusions: Application of machine learning model has the potential to make the diagnosis and monitoring of KC more efficient, resulting in reduced vision loss to the patients. This review provides current information on the machine learning models that have been developed for detecting KC and early KC. Presently, the machine learning models performed poorly in identifying early KC from control eyes and many of these research studies did not follow established reporting standards, thus resulting in the failure of these clinical translation of these machine learning models. We present possible approaches for future studies for improvement in studies related to both KC and early KC models to more efficiently and widely utilize machine learning models for diagnostic process.


Introduction
Corneal diseases are the second largest cause of blindness worldwide, behind only cataract in overall importance [1]. Keratoconus (KC), one of the most common corneal conditions, is characterized by bilateral, progressive corneal thinning that results in an abnormally steep cornea, and decreased vision [2]. The disease primarily affects young adults and children [3]. Globally KC patients comprise the second largest group of patients requiring corneal transplants, and there are thus associated risks of surgical complications and immune rejection of the graft [4,5].
In the medical field, the recent availability of biomedical data has led to the advent of the big data era [6,7], creating opportunities for more comprehensive data-informed decision making. The challenge for the clinician has expanded beyond data collection to encompass the interpretation of a greater amount of information. Artificial intelligence (AI) is the ability of a machine to learn and display intelligence [8]. Machine learning methods represent a branch of AI where computational algorithms can be used to process and identify patterns in large amounts of data at a scale that is beyond the ability of humans to synthesize. Through advanced pattern mining, innovative detection solutions, referred to as automatic detection models, can be based on these massive amounts of data [9].
Increased detailed data about the cornea resulting from corneal topography and tomography systems are essential for diagnosing KC. They are especially useful for detecting early indications of KC, prior to the development of typical KC clinical characteristics [10]. Corneal tomography provides parameters and images [11], which are laborious to analyze manually. This has also been a motivating force in the use of machine learning for KC detection since 1995 [12], with a growing volume of machine learning research being conducted in KC detection over the following years.
There is no existing study that summarizes the use of machine learning in KC, identifies limitations, and makes recommendations for future directions. The aim of this study and meta-analysis was to systematically review all currently available literature to determine accuracy (through the use of pooled sensitivity and specificity) of machine learning in the detection of KC, addressing this knowledge gap.

Literature Search Strategy
A web-based systematic literature search was performed for articles published from inception through 28 February 2021, on PubMed, Web of Science, and MEDLINE ( Figure 1). Database searches were supplemented by hand-search and grey literature search techniques to ensure all publications in this field were included. The protocol for this systematic review was registered on PROSPERO (registration number CRD42021237167). adults and children [3]. Globally KC patients comprise the second largest group of patients requiring corneal transplants, and there are thus associated risks of surgical complications and immune rejection of the graft [4,5].
In the medical field, the recent availability of biomedical data has led to the advent of the big data era [6,7], creating opportunities for more comprehensive data-informed decision making. The challenge for the clinician has expanded beyond data collection to encompass the interpretation of a greater amount of information. Artificial intelligence (AI) is the ability of a machine to learn and display intelligence [8]. Machine learning methods represent a branch of AI where computational algorithms can be used to process and identify patterns in large amounts of data at a scale that is beyond the ability of humans to synthesize. Through advanced pattern mining, innovative detection solutions, referred to as automatic detection models, can be based on these massive amounts of data [9].
Increased detailed data about the cornea resulting from corneal topography and tomography systems are essential for diagnosing KC. They are especially useful for detecting early indications of KC, prior to the development of typical KC clinical characteristics [10]. Corneal tomography provides parameters and images [11], which are laborious to analyze manually. This has also been a motivating force in the use of machine learning for KC detection since 1995 [12], with a growing volume of machine learning research being conducted in KC detection over the following years.
There is no existing study that summarizes the use of machine learning in KC, identifies limitations, and makes recommendations for future directions. The aim of this study and meta-analysis was to systematically review all currently available literature to determine accuracy (through the use of pooled sensitivity and specificity) of machine learning in the detection of KC, addressing this knowledge gap.

Literature Search Strategy
A web-based systematic literature search was performed for articles published from inception through 28 February 2021, on PubMed, Web of Science, and MEDLINE ( Figure  1). Database searches were supplemented by hand-search and grey literature search techniques to ensure all publications in this field were included. The protocol for this systematic review was registered on PROSPERO (registration number CRD42021237167).  PubMed, Web of Science, and MEDLINE were interrogated using search strings pertaining to keratoconus and machine learning. Key words were used by search engine and designated filters according to PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines ( Figure 1) The review articles were imported into Endnote (version X9), which was used to perform the initial screening. Two independent reviewers (K.C. and S.S.) screened the titles and abstracts of journals for possible qualifying studies, and inconsistencies were settled by consensus. Both reviewers thoroughly analyzed all potential qualifying studies for inclusion/exclusion. (Figure 2). PubMed, Web of Science, and MEDLINE were interrogated using search strings pertaining to keratoconus and machine learning. Key words were used by search engine and designated filters according to PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines ( Figure 1) Publications in which no machine learning algorithms were included, but only statistical research was undertaken; -Non-English publications.
The review articles were imported into Endnote (version X9), which was used to perform the initial screening. Two independent reviewers (K.C. and S.S.) screened the titles and abstracts of journals for possible qualifying studies, and inconsistencies were settled by consensus. Both reviewers thoroughly analyzed all potential qualifying studies for inclusion/exclusion. (Figure 2).

Data Extraction
A customized analysis form was generated in Microsoft Excel. The following items were reviewed from each study: authors, publication year, country, study objective, sample size, reference standard diagnosis methods, corneal imaging systems used to generate data and machine learning method used, indicators of studies' quality, and the number of

Data Extraction
A customized analysis form was generated in Microsoft Excel. The following items were reviewed from each study: authors, publication year, country, study objective, sample size, reference standard diagnosis methods, corneal imaging systems used to generate data and machine learning method used, indicators of studies' quality, and the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). If the number of TP, FP, TN, or FN were not presented, then these values were derived from the data provided, such as sensitivity and specificity. The meta-analysis omitted studies that lacked data on TP, FP, TN, or FN or that lacked the capacity to measure these data to create a two-way contingency chart.

Reporting Completeness of Machine Learning Studies in KC
We evaluated the reporting completeness of machine learning research in this study by referring to the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis; www.tripod-statement.org (accessed on 7 June 2021)) statement relevant to model development. This statement contains a 20-item checklist, totaling 31 items when all sub-items are included. The checklist includes questions about the title, abstract, background, methods, results, discussion, supplementary material, and funding information. Two items (5c, i.e., "Give details of treatments received, if relevant", and 11, i.e., "Provide details on how risk groups were created, if done") were omitted since they were irrelevant to the research covered in this review. Each study was therefore evaluated on a total of either 28 or 29 possible items. This total number of items varied between 28 and 29 since item 14b (i.e., "If nothing on univariable analysis (in methods or results) is reported, score not applicable") may be rated as "not applicable" and was thus omitted from the calculation of reporting adherence. If a study had data for several models, we extracted data for the model with the highest performance. Each included item received a score of "1" for adherence and a score of "0" for non-adherence. Multiple items (items 1, 2, 3a, 4b, 5a, 6a, 7a, 7b, 9, 10a, 10b, 10d, 13a, 13b, 14a, 15a, 16, 17, 20 and 22) in the TRIPOD analysis were derived from several sub-items (the sub-items for each number can be found in www.tripod-statement.org (accessed on 7 June 2021)). The score was therefore determined by the combination of several elements rather than a single element. The results of each TRIPOD item for each paper and the level of reporting adherence for each TRIPOD item were documented systematically in a spreadsheet.
For each machine learning study, we calculated the TRIPOD adherence score by dividing the sum of TRIPOD items adhered to by the entire number of applicable TRIPOD items in the study. The average adherence score was calculated using the median value of the adherence score across all studies. For each TRIPOD item, the adherence score was calculated by dividing the number of studies that adhered to the item by the total number of applicable studies for the item. The median value was used to represent the average adherence score for each TRIPOD item.

Statistical Methods
All analyses were configured in 'mada' and 'metafor' package from RStudio Server Pro (PBC, Boston, MA) (Version 1.3.1056-1) for Windows. To measure the overall machine learning performance for KC detection, the sensitivity and specificity values for all presented models were pooled, following the bivariate meta-analysis method of Reitsma et al. [13] using linear mixed model techniques. The bivariate method retains the two-dimensional nature (sensitivity and specificity) of the original data. Pairs of sensitivity and specificity are jointly evaluated, accommodating any correlation that could exist between sensitivity and specificity using a random effects approach. This was accomplished using the 'mada' package (Version 0.5.10) from RStudio. The 95% confidence interval (CI) of the sensitivity and specificity of various imaging systems were compared. The Deeks' funnel asymmetry test, developed especially for diagnostic test accuracy [14], was used to determine if there was cross-study publishing bias [15].
A hierarchical summary receiver-operating characteristic (HSROC) curve was fitted. Each individual study was presented as a circle and plotted within the HSROC curve. The summary point was represented by a dot surrounded by a 95% confidence interval (95% CI).

Outcomes Measure
The primary outcome indicator was the diagnostic accuracy of machine learning algorithms through a variety of imaging technologies for the identification of KC, as determined by the pooled sensitivity and specificity values.

Search Collection
Of the initial literature search, 532 studies were retrieved, and 280 duplicates were omitted. Following review of the title, abstract, and full text, 35 studies were included in the review. There was an increasing trend of machine learning studies in KC published over time (Figure 3), with the earliest study published in 1995 by Maeda et al. [12] and increasing to over 30 studies in 2020.

Outcomes Measure
The primary outcome indicator was the diagnostic accuracy of machine learning algorithms through a variety of imaging technologies for the identification of KC, as determined by the pooled sensitivity and specificity values.

Search Collection
Of the initial literature search, 532 studies were retrieved, and 280 duplicates were omitted. Following review of the title, abstract, and full text, 35 studies were included in the review. There was an increasing trend of machine learning studies in KC published over time (Figure 3), with the earliest study published in 1995 by Maeda et al. [12] and increasing to over 30 studies in 2020.

Search Characteristics
The 35 articles on machine learning and KC were reviewed were classified into three categories based on their aims: detecting KC eyes from controls, differentiating early KC from controls, and identifying different KC severities. Each study focused on one or more of these aims, with 12 research papers focused on KC versus control, 4 research papers focused on early KC versus control, 1 article focused on KC severity, and 18 publications with multiple aims. There was no study examining the progression of KC.

Search Characteristics
The 35 articles on machine learning and KC were reviewed were classified into three categories based on their aims: detecting KC eyes from controls, differentiating early KC from controls, and identifying different KC severities. Each study focused on one or more of these aims, with 12 research papers focused on KC versus control, 4 research papers focused on early KC versus control, 1 article focused on KC severity, and 18 publications with multiple aims. There was no study examining the progression of KC.

Detecting KC from Controls and Meta-Analysis
A total of 26 studies [12, developed machine learning models that were based either directly on captured parameters or as calculated parameters from corneal topography or tomography systems. The number of parameters used in these studies ranged from 5 to 443 (Table 1). The machine learning algorithms explored in these studies included decision tree, discriminant analysis, logistic regression, naive bayes, neural networks, random forest, and support vector machine. The majority of these studies (n = 23) employed a single machine learning algorithm, while 4 studies [24,29,31,38] compared several algorithms.
Four articles directly analyzed images [41][42][43][44] generated by corneal topography or tomography systems, as opposed to the image parameters. Convolutional neural networks, a common deep learning-based method, were used in all these studies.
Meta-analysis was performed on 26 of the 30 studies, with 4 studies [37,39,41,43] being excluded due to inadequate data needed to quantify the TP, FP, TN, and FN. Asterisks (*) indicated studies that were excluded from the meta-analysis in Table 1. Deeks' funnel plot (Supplementary Figure S1) was used to assess possible publication bias. No evidence of publication bias was apparent (p = 0.91).

Detecting Early KC from Controls and Meta-Analysis
A total of 13 studies [20,21,[27][28][29]31,33,37,39,[45][46][47][48] evaluated machine learning models to assess early KC from controls using corneal topography parameters. These studies used fewer parameters compared to those utilized to identify KC in controls, varying from 11 to 55 parameters (Table 1). Ten studies applied a single algorithm and three compared several algorithms (one of which was conducted by our group [48]). Additionally, one study [44] built the model by combining convolutional neural networks and TMS derived images.
Ten studies were included in the meta-analysis, out of a total of fourteen. As shown by an asterisk (*) in Table 1, four studies [37,39,44,45] were omitted from the meta-analysis due to inadequate evidence to calculate the TP, FP, TN, and FN. Assessment of publishing bias with Deeks' funnel plot (Supplementary Figure S2) found no significant effect (p = 0.18).
Overall, the pooled sensitivity was 0.882 (95% CI 0.822-0.923) and pooled specificity was 0.947 (95% CI 0.914-0.967) for early KC versus control. The most widely used imaging machines in the detection of early KC from control eyes were the Pentacam (including Pentacam HR) and Orbscan (including Orbscan II and Orbscan IIz). Six studies used Pentacam data, two studies used Orbscan data, and an additional two studies used data from either GALILEI, or a combination of Pentacam and Corvis. Early KC detection (Table 3) was correlated with a pooled sensitivity of 0.882 (95% CI 0.795-0.935), and a pooled specificity of 0.935 (95% CI 0.874-0.967) for models utilizing Pentacam data. When using Orbscan data, a pooled sensitivity of 0.842 (95% CI 0.504-0.965), and a pooled specificity of 0.958 (95% CI 0.821-0.991) were obtained. The pooled diagnostic performance of detecting KC compared to controls was superior to that of early KC with sensitivity (0.970 (95% CI 0.949-0.982) vs. 0.882 (95% CI 0.822-0.923) and specificity (0.985 (95% CI 0.971-0.993) vs. 0.947 (95% CI 0.914-0.967)). This difference implied that early KC detection using machine learning algorithms are still in their infancy.
The Pentacam tomography system was the most commonly used corneal imaging device for both the KC and early KC categories. In Figure 4, we compare studies that used the Pentacam data set to diagnose KC and early KC. Diagnostic efficiency for detecting KC was superior to that for early KC, as shown by the higher sensitivity (y-axis) and specificity (x-axis) in the plot.
to that of early KC with sensitivity (0.970 (95% CI 0.949-0.982) vs. 0.882 (95% CI 0.822-0.923) and specificity (0.985 (95% CI 0.971-0.993) vs. 0.947 (95% CI 0.914-0.967)). This difference implied that early KC detection using machine learning algorithms are still in their infancy. The Pentacam tomography system was the most commonly used corneal imaging device for both the KC and early KC categories. In Figure 4, we compare studies that used the Pentacam data set to diagnose KC and early KC. Diagnostic efficiency for detecting KC was superior to that for early KC, as shown by the higher sensitivity (y-axis) and specificity (x-axis) in the plot.

Detection of Different KC Severities
Along with detecting KC eyes as a distinct category, six of the published studies grouped their KC eyes into clinical stages and used machine learning algorithms to identify each stage separately. These studies classified KC eyes into various categories based on a variety of measures. Kamiya et al. [42] classified eyes into Grades 1-4 according to the Amsler-Krumeich (AK) classification scheme, which is mostly focused on keratometry, but often incorporates refraction and pachymetry [51]. Bolarin et al. [35] and Velazquez-Blazquez et al. [40] graded eyes into Grade I-IV plus or mild KC, using a different classification system named RETICS, based on corrected distance visual acuity (CDVA) [51]. Another study, Issarti et al. [33] classified their KC eyes into mild and moderate stages using a self-defined classification (described at the end of Table 4), whereas Maeda et al. [12] did not specify their staging method. Table 4 summarizes these findings. Since no consistent grading system was used for classifying KC severity in these studies-and indeed none is globally established [52]-the findings were therefore not directly comparable. The first attempt at proposing a data-driven KC classification scheme was based on density-based clustering by Yousefi et al. [49] using OCT-based parameters from 3156 eyes. They identified five clusters that could represent five stages of KC, ranging from normal to advanced KC.

Discussion
This is the first comprehensive meta-analysis on KC and machine learning, and it has demonstrated that by using data from a variety of corneal imaging devices, machine learning can reliably distinguish KC eyes from control eyes (pooled sensitivity > 0.90). However, the performance of machine learning models in distinguishing early KC eyes from controls was poorer, with the maximum pooled sensitivity of 0.88. Overall completeness of model was evaluated using the TRIPOD guidelines, and standard reporting compliance was found to be inadequate in all published KC machine learning research so for undertaken.
Machine learning models in KC were developed for a variety of imaging systems. Models themselves are therefore not directly interchangeable owing to the different input expectations. This may have a detrimental effect on the clinical translation of these models. For example, Smadja et al. developed a machine learning model that had a sensitivity of 93.6% and a specificity of 97.2% for discriminating normal eyes from early KC in their study [21]. This model was constructed using the anterior and posterior asphericity asymmetry indices (AAI and OSI), corneal volume, paracentral mean keratometry, and anterior chamber depth derived from a GALILEI machine. Parameters such as AAI and OSI are

Discussion
This is the first comprehensive meta-analysis on KC and machine learning, and it has demonstrated that by using data from a variety of corneal imaging devices, machine learning can reliably distinguish KC eyes from control eyes (pooled sensitivity > 0.90). However, the performance of machine learning models in distinguishing early KC eyes from controls was poorer, with the maximum pooled sensitivity of 0.88. Overall completeness of model was evaluated using the TRIPOD guidelines, and standard reporting compliance was found to be inadequate in all published KC machine learning research so for undertaken.
Machine learning models in KC were developed for a variety of imaging systems. Models themselves are therefore not directly interchangeable owing to the different input expectations. This may have a detrimental effect on the clinical translation of these models. For example, Smadja et al. developed a machine learning model that had a sensitivity of 93.6% and a specificity of 97.2% for discriminating normal eyes from early KC in their study [21]. This model was constructed using the anterior and posterior asphericity asymmetry indices (AAI and OSI), corneal volume, paracentral mean keratometry, and anterior chamber depth derived from a GALILEI machine. Parameters such as AAI and OSI are not accessible in other corneal tomography imaging systems, such as Pentacam [53], suggesting that the generated model cannot be utilized in clinics equipped with the Pentacam system. Pentacam is one of the most frequently used corneal tomographic technologies in clinical practice [11], and our research discovered that the Pentacam is the most frequently utilized source of data in the publications we reviewed [26][27][28]31,33,37,47,48]. As a consequence, despite the fact that the model created by Smadja et al. showed a high level of performance for early KC diagnosis, it is not generally applicable.
The machine learning models that used data from the Pentacam demonstrated a higher pooled sensitivity and specificity in detecting KC and early KC from control eyes compared to other imaging machines. This is likely due to the ability of the Pentacam machine to generate a wider spectrum of data than other systems, including data on the front cornea, the back cornea, corneal pachymetry, and other areas of the anterior eye segment [54,55].
A review of the literature on the application of AI to evaluate corneal topography for the diagnosis and early detection of corneal ectasias was recently published [56]. That article summarized significant advances in corneal imaging and the application of AI in KC as viewed by an Eye Care Professional, a biomedical engineer, and a data scientist. It concluded that AI in corneal imaging may improve refractive surgery and diagnosis of corneal ectasias. That review focused exclusively on corneal imaging modalities and their performance in relation to AI. In contrast to that study, our systematic review included metaanalysis-additional information on individual studies as well as the pooled performance of existing machine learning models for detecting KC as well as early KC. In addition, we also reported on the assessment of completeness of items included in the machine learning models using the TRIPOD reporting system. We analyzed each study by its objectives and showed that while substantial research had been undertaken on KC detection, much less had been undertaken on early KC detection, KC severity detection, and detection of progression. Our analysis also evaluated the parameters utilized in prior studies, indicating that no study had analyzed all available parameters despite the extensive use of corneal imaging data. Additionally, we compared and contrasted the corneal imaging systems used. These aspects have not been previously reviewed.

Reporting Completeness of Machine Learning Studies in KC
Only when all model components are completely and transparently reported can the model's potential clinical usefulness be appropriately assessed. The main objective of machine learning models is to help clinicians in making medical decisions about an individual patient [57]. Users or doctors will need information on the clinical setting in which the diagnosis is needed (e.g., primary care, secondary care, or the general population), as well as the patients for whom the model is appropriate. Additionally, they will need information about which clinical data, referred to as predictors in the model, are necessary for model usage, as well as the definition of the patient outcome to which the model is referring. Unfortunately, this systematic review suggests that the studies on machine learning and KC often lacked sufficient description. Only 34% provided target setting [12,19,24,30,31,33,35,39,42,44,47,49], 23% defined all necessary predictors precisely [17,21,23,25,34,39,45,48], 26% exactly defined the patient outcome [21,28,29,39,44,[46][47][48][49], and none completely described participant characteristics (e.g., basic demographics, clinical features, and available predictors).
Attempting to replicate the process of model construction using one's own data, machine learning researchers working in KC will need to specify information on the approach used to handle any missing data, the full prediction model, and a comprehensive explanation of all model-building procedures. However, only one study (3%) defined the whole model-building process [28], 14% provided the complete prediction model [20,21,26,35,47], and none explained how missing data were handled explicitly.
One factor contributing to the low percentage of adherence is that not all elements are relevant to all machine learning research. For example, it is impractical to mathematically specify the final model in a study using non-regression techniques such as random forest, support vector machine, or neural network. While this is a legitimate rationale, it is important to refer to relevant TRIPOD items and provide as much information about model creation as possible in order to guide the modelling choices and facilitate subsequent validation.
While the majority of research developed novel models for KC detection, only four [19,22,31,37] conducted external validation using other data sources. One common concern is that local data sets used for validation are unlikely to be representative of the target population on a global scale [57]. When evaluated on data sets collected in the United States of America and Switzerland, the machine learning model developed in a cross-ethnic research by Mahmoud et al. [22] performed differently. This implies that most of the current identified models cannot be used in a broad clinical setting since their performance may vary and it will most likely be poorer when applied to other external clinics or nations. We recommend that any model be externally evaluated on a large scale to understand this variation. This is especially true given the relatively small sample sizes reported in each study; international collaboration would therefore be highly advantageous to move this field along.
The development of KC is frequently manifested not only by corneal changes, but also by clinical symptoms such as vision, refraction, and slit-lamp findings. No study has ever incorporated all pertinent data [12,. Demographic data, such as age and gender, as well as potential risk factors for KC, such as eye rubbing and family history, may also aid in KC detection. We have limited knowledge of how these critical factors may influence the detection of KC in the machine learning models currently employed. Due to the lack of known risk factors for KC [2], it should be possible to start assessing potential risk factors and include in future machine learning models to assess their importance in diagnosis.
The bulk of reported studies of KC have used data derived from a single corneal topography or tomography imaging device to train their machine learning models. Thus, there is a dearth of information as to what impact combining data from multiple devices would have on machine learning models in the detection of KC. Other forms of data, such as from the corneal epithelial thickness map produced with optical coherence tomography (OCT) and corneal biomechanical measurements, are also increasingly being recognized as crucial in the diagnosis of KC [30], particularly early KC [58]. Integrating data from multiple devices and considering a broader variety of factors may therefore further improve the early detection of KC.
The majority of studies employed a specific machine learning method; although, there was limited information on how or why the authors chose a particular algorithm from a large number of potential alternatives. Given that machine learning algorithms are sometimes referred to as 'black boxes', it would be advantageous to evaluate several machine learning algorithms on the same data set and select the best one [59]. This process may assist in avoiding selection bias and may aid in improving early KC detection.
Imbalanced sample sizes were found in about half of the studies analyzed [12,16,20,21,[27][28][29][30][31]33,44,46], which could skew the machine learning model and impair its capacity to identify cases. In comparison to the number of control eyes, most studies included a lower number of KC or early KC eyes for model establishment. This may be a concern when considering the performance of the model. For example, in the study by Lopes et al. [31], a random forest model was constructed using 2980 stable (control) eyes and 71 ectatic susceptibility eyes, achieving a sensitivity of 80% and a specificity of 96.8%. However, their model's baseline accuracy, defined as the ability to identify all eyes in the control group without constructing a model, was 97.7%. Although the analysis obtained a high specificity of 96.8%, this could be deemed as suboptimal in comparison to its potential baseline accuracy. Instead of utilizing accuracy, sensitivity, and specificity, precision value (also known as positive predictive value) may provide a more interpretable evaluation for models with unbalanced sample sizes. This metric has been employed in several studies to evaluate their models [22,28,32,33,36,47,48].
There have been fewer studies on identifying the different stages of KC, and a standardized classification system for KC has yet to be devised. There has been no machine learning study using longitudinal data on KC progression. While applications of machine learning have benefited in predicting and detecting progression indicators in AMD, diabetic retinopathy, and glaucoma [60][61][62][63][64]. The combination of machine learning methods and large clinical data sets may assist in the analysis of KC progression.

Approach for Future Studies
Machine learning has been increasingly used in KC over the last three decades, mostly for the identification of KC and early KC. The advantage of machine learning is that it allows consistent and unbiased diagnosis, which is critical when diagnosing patients at an early stage, as early intervention using treatments such as corneal crosslinking (CXL), could delay or slow disease progression, thus preventing the need for a possible corneal transplant.
There is still room to improve the efficiency of machine learning models in detecting early KC. This may be accomplished by allowing the use of all publicly available data, including complete databases from corneal imaging systems, clinical data, genetic data, and other risk factors. There is also a need to maximize the potential of machine learning techniques by optimizing their output at the methodology and data space levels.
Currently, there are no successful examples of machine learning models being used in clinical practice. This may be due to a lack of large patient populations to validate results, the utilization of various imaging devices, a local participant group comprised of individuals of various ethnic backgrounds, clinicians' overall acceptance of machine learning techniques for diagnosis and their relative reliability to humans. External model validation on a diverse patient population, as well as the creation of platform-independent models that can be generalized through several corneal imaging systems, are therefore needed. Finally, research on machine learning in KC should also address additional research gaps in the area, such as classifying KC severity, and identifying and forecasting KC progression.

Conclusions
We present an up to date, comprehensive review on the use of machine learning in KC detection and identify the substantial limitations that need to be overcome to make the diagnostic process more efficient for early keratoconus. In light of our findings on pooling detection performance and low adherence to the TRIPOD checklist, we believe that both improved machine learning model performance in early KC detection and improved quality machine learning research in KC is sorely needed. Despite various challenges, the future of integrating machine learning technologies into clinical practice is promising with the advent of advanced imaging modalities. Machine learning can be further investigated for broad application to the entire process of KC detection and management. In particular, some open avenues for research include early KC detection, risk factor evaluation, prediction of progression, and clinical management guidance. However, global collaboration is essential to obtain larger data sets and more robust models.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm11030478/s1, Figure S1: Assessment of publication risk bias across studies using Deeks' funnel plot asymmetry test in detecting KC from controls. ESS, effective sample size, which is determined as a function of the number of diseased (n1) and non-diseased (n2) subjects: (4n1 * n2)/(n1 + n2). The y-axis shows the inverse of the square root of the effective sample size (1/root(ESS)). The x-axis shows the natural logarithm of the diagnostic odds ratio (lnDOR (TP * TN)/(FP * FN)). In this figure, the vertical line represents the meta-analysis summary estimate, and each black circle represents one study, and dispersed symmetrically right and left the vertical line implies a low probability of publication bias for the included research. The dotted line indicates the ESS weighted regression tests of funnel plot asymmetry, and p values < 0.05 were considered as significant; Figure S2: Assessment of publication risk bias across studies using Deeks' funnel plot asymmetry test in detecting early KC from controls. ESS, effective sample size, which is determined as a function of the number of diseased (n1) and non-diseased (n2) subjects: (4n1 * n2)/(n1 + n2). The y-axis shows the inverse of the square root of the effective sample size (1/root(ESS)). The x-axis shows the natural logarithm of the diagnostic odds ratio (lnDOR (TP * TN)/(FP * FN)). In this figure, the vertical line represents the meta-analysis summary estimate, and each black circle represents one study, and dispersed symmetrically right and left the vertical line implies a low probability of publication bias for the included research. The dotted line indicates the ESS weighted regression tests of funnel plot asymmetry, and p values < 0.05 were considered as significant; Table S1: Results of each TRIPOD item for each paper and the level of reporting adherence for each TRIPOD item.
Author Contributions: Conceptualization, P.N.B. and S.S.; methodology, K.C.; software, K.C.; formal analysis, K.C.; data curation, K.C. and S.S.; writing-original draft preparation, K.C.; writing-review and editing, P.N.B., S.S., K.V. and K.C.; supervision, P.N.B., S.S. and K.V.; funding acquisition, P.N.B. and S.S. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The corresponding author had full access to all the data in the study; data are available upon reasonable request.