Application and Performance of Artificial Intelligence Technology in Detection, Diagnosis and Prediction of Dental Caries (DC)—A Systematic Review

Evolution in the fields of science and technology has led to the development of newer applications based on Artificial Intelligence (AI) technology that have been widely used in medical sciences. AI-technology has been employed in a wide range of applications related to the diagnosis of oral diseases that have demonstrated phenomenal precision and accuracy in their performance. The aim of this systematic review is to report on the diagnostic accuracy and performance of AI-based models designed for detection, diagnosis, and prediction of dental caries (DC). Eminent electronic databases (PubMed, Google scholar, Scopus, Web of science, Embase, Cochrane, Saudi Digital Library) were searched for relevant articles that were published from January 2000 until February 2022. A total of 34 articles that met the selection criteria were critically analyzed based on QUADAS-2 guidelines. The certainty of the evidence of the included studies was assessed using the GRADE approach. AI has been widely applied for prediction of DC, for detection and diagnosis of DC and for classification of DC. These models have demonstrated excellent performance and can be used in clinical practice for enhancing the diagnostic performance, treatment quality and patient outcome and can also be applied to identify patients with a higher risk of developing DC.


Introduction
Oral diseases like dental caries (DC) and periodontal diseases pose a major disease burden and are considered non-fatal causes of disability affecting people of all age groups globally [1]. The pain and discomfort that is often associated with DC may eventually compromise the individual's sleep, diet, social well-being and self-esteem which can affect quality of life [2]. According to the global burden of disease, untreated DC are the most prevalent and common factor affecting health [1]. On the global scale, it is estimated that DC are prevalent among 2.3 billion adults with secondary dentition, and among 530 million children with deciduous dentition [3].
In 2015, the global cost of oral diseases was reported to exceed 540 billion dollars, consequently leading to major health and financial burden [4]. Early and accurate detection of DC can enable cost effective preventive measures and more conservative treatment options, reducing the healthcare costs [5]. Traditionally, visual inspection in combination with radiographic assessment is the routine diagnostic approach for DC. However, studies indicated the presence of considerable variability in its reliability and accuracy, affected mainly by the level of dentists' clinical experience. Sensitivity can range between 19-92% for occlusal and 39-94% for proximal DC [6]. Various parameters like shadow, contrast, and brightness in radiographs may have an impact on the diagnosis [7].
The recent advancements in techniques used for the detection and diagnosis of DC resulted in the development of novel methods that aim to overcome the constraints of clinical and radiographic diagnosis. These include ultrasonic detection of caries, laser fluorescence, digital imaging fiber-optic trans-illumination (FOTI), quantitative light-induced fluorescence (QLF), digital subtraction radiography (DSR), tuned aperture computed tomography (TACT), and electrical conductance measurement (ECM) [8,9]. Laser fluorescence has relatively higher sensitivity in diagnosing early DC in comparison with other methods [10]. However, studies have reported on the limitations of these techniques; FOTI demonstrates low sensitivity when used for the diagnosis of proximal DC [11] and ultrasonic devices are only capable of detecting established DC [12]. Caries risk prediction models like the caries-risk assessment tool (CAT), caries management by risk assessment (CAMBRA), and Cariogram are commonly used for predicting DC. Nevertheless, these models lack sufficient evidence to prove their effectiveness. A systematic review reported that the sensitivity and specificity of Cariogram ranged between 41-75% and 65.8-88%, respectively [13].
With the present evolution in the fields of science and technology, newer applications based on artificial intelligence (AI) technology have been widely used in medical sciences. These models have demonstrated excellent performance and high accuracy and sensitivity in performing their intended tasks, including diagnosing eye disorders, and breast and skin cancers, detection and diagnosis of pulmonary nodules [14][15][16][17]. AI models have also been widely applied in detection, segmentation and classification of coronavirus disease of 2019 (COVID-19) using computerized tomography (CT) medical images and these models have demonstrated substantial potential in rapid diagnosis of COVID-19 [18]. Hence, with a growing interest in these AI-based applications, these models have now been employed in a wide range of applications related to the diagnosis of oral diseases and have demonstrated phenomenal precision and accuracy in their performance [19][20][21][22]. Studies have reported on the application and performance of the AI models in various disciplines of dentistry, which includes orthodontics, restorative dentistry and prosthodontics [23][24][25]. However, there are no systematic review articles exclusively reporting on application of AI models on dental caries. Additionally, detecting DC using AI-based models has been found to be a cost-effective approach, where the AI-based DC detection model demonstrated higher accuracy in detecting DC in comparison with trained examiners, with fewer chances of false negative errors [26].
Hence, the aim of this systematic review is to report on the diagnostic accuracy and performance of AI-based modes designed for detection, diagnosis, and prediction of DC.

Search Strategy
This systematic review was executed in compliance with the standards of preferred reporting items for systematic reviews and meta-analysis for diagnostic test accuracy (PRISMA-DTA) [27]. The literature search for this paper was based on the population, intervention, comparison, and outcome (PICO) criteria (Table 1).
Eminent electronic databases (PubMed, Google scholar, Scopus, Web of science, Embase, Cochrane, Saudi Digital Library) were searched for relevant articles that were published from January 2000 until February 2022. The literature search was based on the Medical Subject Headings (MeSH) terms like dental caries, tooth decay, cavity, diagnosis, detection, prediction, artificial intelligence, machine learning, deep learning (DL), automated system, convolutional neural networks (CNNs), artificial neural networks (ANNs) and deep neural networks (DCNNs). A combination of these MeSH terms using Boolean operators and/or were used in the advanced search for the articles. Manual searches for additional articles was also carried out in the college library based on the reference lists extracted from the initially selected articles.

Research Question
What is the performance of AI-based models designed for detection, diagnosis and prediction of DC?

Population
Patients who underwent investigation for DC Intervention AI applications for detection, diagnosis and prediction of DC

Comparison
Expert/Specialist opinions, Reference standards/models

Study Selection
Article selection was conducted in two phases. During the initial phase, articles were selected based on the relevance of the title and abstract to the research question. In this phase, the article search was done by two authors (S.B.K and F.B) independently, and this process generated 448 articles. These articles were further screened to eliminate the duplicates, ultimately leading to the exclusion of 168 articles. The remaining 280 articles were evaluated based on the eligibility criteria.

Eligibility Criteria
In this systematic review, the articles were included based on being (a) Original research articles reporting on application of AI-based models in diagnosis, detection and prediction of DC, (b) Articles reporting on the data sets used for training/validating and testing of the model, (c) Articles with clear information on quantifiable performance outcome measures, (d) The type of study design did not limit its inclusion.
The articles excluded were (a) Articles with only abstracts, without full text availability, (b) Conference proceedings, commentaries, editorial letters, short communications, review articles and scientific posters uploaded online, (c) Articles published in non-English languages.

Data Extraction
280 articles obtained from the initial phase were further evaluated based on these eligibility criteria. Following this, the number of articles decreased to 35. In the final phase, the authors' details and information were concealed and the articles were assigned for further analysis by two authors (M.A and L.A), who were not involved in the initial phase of evaluation. In order to determine the degree of consistency between these two authors, interrater reliability was assessed. Cohen's kappa showed 84% agreement between these authors. These articles were critically analyzed based on the guidelines of quality assessment and diagnostic accuracy tool (QUADAS-2) [28]. This is a revised tool used for assessing the quality of studies that have reported on diagnostic tools. Quality assessment was carried out based on four main domains (patient selection, index test, reference standard, and flow and timing), which were evaluated for risk of bias and applicability concerns [28]. The authors further had contrasting opinions about the inclusion of one article that did not clearly mention the outcome measures. The issue was resolved upon discussion with another author (A.F), and a decision to exclude the article was made. A total of 34 articles were subjected to quantitative synthesis ( Figure 1).
sessing the quality of studies that have reported on diagnostic tools. Quality assessment was carried out based on four main domains (patient selection, index test, reference standard, and flow and timing), which were evaluated for risk of bias and applicability concerns [28]. The authors further had contrasting opinions about the inclusion of one article that did not clearly mention the outcome measures. The issue was resolved upon discussion with another author (A.F), and a decision to exclude the article was made. A total of 34 articles were subjected to quantitative synthesis ( Figure 1).

Results
Thirty-four articles  that met the selection criteria were assessed for quantitative data ( Table 2). The research trend shows that most of the research on application of AI on DC was conducted within the last few years and the trend shows a gradual increase in this area of research.

Results
Thirty-four articles  that met the selection criteria were assessed for quantitative data ( Table 2). The research trend shows that most of the research on application of AI on DC was conducted within the last few years and the trend shows a gradual increase in this area of research.
With this data, performing a meta-analysis was not possible due to the heterogeneity between the studies in the software and data sets used for assessment of performance of the AI models. Therefore, the descriptive data was presented based on the application of AI models for which it has been designed.

Study Characteristics
The data mainly included details of the study (details of authors, publication year, type of algorithm architecture used, study objective, number of patients/images/photographs/ radiographs for validating and testing, study factor, study modality, comparisons, evaluation accuracy/average accuracy/statistical significance, outcomes and conclusions).

Outcome Measures
The outcome was measured in terms of task performance efficiency. The outcome measures were reported in terms of accuracy, sensitivity, specificity, ROC = receiver operating characteristic curve, AUC = area under the curve, AUROC = area under the receiver operating characteristic, ICC = intraclass correlation coefficient, IOU = intersection-overunion, PRC = precision recall curve, statistical significance, F1 Scores, vDSC = volumetric dice similarity coefficient, sDSC = surface dice similarity coefficient, PPV = positive predictive value, NPV = negative predictive value, MDG = mean decreased gini, MDA = mean decreased accuracy coefficients, IoU = intersection over union, dice coefficient .

Risk of Bias Assessment and Applicability Concerns
The quality assessment of the 18 articles included in this study was done using the guidelines of QUADAS-2 [13]. This tool was originally produced in 2003 by collaboration between the Centre for Reviews and Dissemination, University of York, and the Academic Medical Centre at the University of Amsterdam. Modified versions have been adopted by Cochrane Collaboration, NICE and AHRQ. The current version is widely used in systematic reviews to evaluate the risk of bias and applicability of primary diagnostic accuracy studies. QUADAS-2, consists of four key domains: patient selection; index test; reference standard; flow and timing. The current assessment of risk and applicability based on QUADAS-2 shows that the majority of studies have low risk and a very small number of studies show high risk of bias. (Supplementary Table S1) ( Figure 2).

Study Characteristics
The data mainly included details of the study (details of authors, publication year, type of algorithm architecture used, study objective, number of patients/images/photographs/radiographs for validating and testing, study factor, study modality, comparisons, evaluation accuracy/average accuracy/statistical significance, outcomes and conclusions).

Outcome Measures
The outcome was measured in terms of task performance efficiency. The outcome measures were reported in terms of accuracy, sensitivity, specificity, ROC = receiver operating characteristic curve, AUC = area under the curve, AUROC = area under the receiver operating characteristic, ICC = intraclass correlation coefficient, IOU = intersectionover-union, PRC = precision recall curve, statistical significance, F1 Scores, vDSC = volumetric dice similarity coefficient, sDSC = surface dice similarity coefficient, PPV = positive predictive value, NPV = negative predictive value, MDG = mean decreased gini, MDA = mean decreased accuracy coefficients, IoU = intersection over union, dice coefficient .

Risk of Bias Assessment and Applicability Concerns
The quality assessment of the 18 articles included in this study was done using the guidelines of QUADAS-2 [13]. This tool was originally produced in 2003 by collaboration between the Centre for Reviews and Dissemination, University of York, and the Academic Medical Centre at the University of Amsterdam. Modified versions have been adopted by Cochrane Collaboration, NICE and AHRQ. The current version is widely used in systematic reviews to evaluate the risk of bias and applicability of primary diagnostic accuracy studies. QUADAS-2, consists of four key domains: patient selection; index test; reference standard; flow and timing. The current assessment of risk and applicability based on QUADAS-2 shows that the majority of studies have low risk and a very small number of studies show high risk of bias. (Supplementary Table S1) ( Figure 2).

Assessment of Strength of Evidence
The articles included in this systematic review were assessed for the certainty of the evidence using the grading of recommendations assessment, development and evaluation (GRADE) approach. The certainty of evidence is rated based on five domains: risk of bias, inconsistency, indirectness, imprecision, or publication bias and are ultimately categorized as either very low, low, moderate, or high certainty of evidence [63] (Table 3). Table 3. Assessment of Strength of Evidence.

Discussion
Oral diseases like DC and periodontal diseases are some of the major public health issues affecting people of all age groups in developing and developed countries. In most cases, DC remain undiagnosed because of deep fissures and tight interproximal contacts, making them difficult to be detected in the early stages, eventually leading to their detection in the advanced stages. Early detection of DC reduces the disease burden and need for invasive treatment procedures which can ultimately improve treatment outcome [32]. Clinical oral examination using a dental probe/explorer along with radiographs is regarded as the most conventional method in detecting DC. However, studies have also reported on the variations in accuracy and reliability among clinicians using this method, influenced by their clinical experience [7,64,65].
Automated decision support systems based on AI technology are new developments in the field of medical sciences. AI-based models have also been widely applied in the field of dentistry and have demonstrated exceptional performance in tooth detection, tooth numbering, diagnosing and predicting oral cancer, periodontal diseases, and root fractures, orthodontic diagnosis, and detection of jaw lesions, cysts and tumors [19][20][21][22]. Considering the challenges and limitations dentists face in detecting DC during clinical examination, there is a need for developing AI-based automated models that can assist dentists in decision making, increasing the accuracy of DC detection and diagnosis.
Several factors influence the risk of developing DC like oral hygiene practices, dietary habits, socio-economic status, utilization of dental care services, in addition to attitude towards oral health [66]. Hence, identifying the factors that determine the risk of developing DC in an individual is essential for its prevention. AI-based models have been widely applied for prediction of DC. Zanella-Calzada et al. [29] reported on an AI-based model for analyzing the dietary and demographic factors that determine DC using data sets, where the model demonstrated an accuracy of 0.69 and AUC values of 0.69 and 0.75. This model showed good accuracy for classifying individuals with and without caries based on dietary and demographic factors. The main advantage of this model is that the data used for training the model was obtained from subjects from different regions, hence providing robustness in results, and eliminating the bias to subject selection. Hung M et al. [33] proposed an AI-based ML model for diagnostic prediction of root caries. This model demonstrated an excellent performance with an accuracy of 97.1%, a precision of 95.1%, the sensitivity of 99.6%, the specificity of 94.3% and an AUC of 0.997. Although the model demonstrated excellent performance, there were certain limitations related to the data sets used for its development. In this model, the data was obtained from a sample of the United States (US) population, and therefore, would be more representative of the US nationals and not of patients with different demographic data. Another important limitation was that the authors did not consider some important covariates like lifestyle and oral hygiene factors.
Ramos-Gomez et al. [40] described an AI-based ML algorithm (Random Forest) for identifying survey items that predict DC. The model demonstrated a mean decreased Gini coefficient (MDG) of 0.84; and a mean decreased accuracy (MDA) of 1.97 for classifying active DC based on parent's age. For predicting DC based on parents age, the model demonstrated an MDG = 2.97; MDA = 4.74. This model can be of potential use for screening children for DC based on the survey data. The study had several limitations which include the limited sample size for testing obtained from limited hospital records that are not representative of the general population. In addition, the data regarding children's oral hygiene practices were obtained from their parents, giving rise to social desirability bias. Zaorska et al. [45] reported on an AI model for predicting DC based on chosen polymorphisms. The model demonstrated a sensitivity of 90%, a specificity of 96%, an overall accuracy of 93% (p < 0.0001) and the AUC was 0.970 (p < 0.0001). Prediction accuracy of 90.9-98.4% was achieved by this model. The main strength of this model was the homogeneous age and gender of study subjects, and that the assessment of performance was carried out using two different statistical approaches, rendering results that are more reliable. However, the sample used for validating the model was limited. Pang et al. [46] reported on an AI-based ML model for caries risk prediction based on environmental and genetic factors. The model demonstrated an AUC of 0.73. This model could accurately identify individuals at high and very high caries risk. However, the sample was confined to only one center and early signs of DC were not detected in this study. In a study by Hur et al. [53] an ML model for predicting DC on second molars associated with impacted third molars was tested. This model demonstrated good accuracy with a ROC of 0.88 to 0.89. However, the authors did not consider the major DC contributing factors like oral hygiene and dietary intake of sugars. Park et al. [57] also reported on an ML model for predicting early childhood caries. The model demonstrated a favorable performance with an AUROC of 0.774-0.785. Limitations of this study included low specificity values and potential bias resulting from the consideration of maternal variables solely. Additionally, the authors did not consider important variables like feeding practices, sugar intake and usage of fluoride.
Undiagnosed and untreated DC are major public health problems affecting billions of people worldwide. Early detection of DC can significantly reduce the need for invasive treatment and ultimately the cost of care. Hence, diagnostic tools with high accuracy in detecting DC are needed. Lee et al. [30] reported on a DL model for detecting and diagnosing DC on periapical radiographs. The model demonstrated an accuracy of 89.0%, 88.0%, 82.0% and AUC of 0.917 for premolar, molar, and both premolar and molar models respectively. DL model CapsNet is a recently developed model made of deep players and is very effective for processing visual factors from posture, speed, hue, and texture [67]. Therefore, models of this nature are enabled with improved and optimized features for the detection and diagnosis of DC [68]. Although the model demonstrated considerable performance, there were limitations related to unconsidered clinical parameters, limited number of radiographs, and the inclusion of permanent teeth only [30]. Choi et al. [31] reported on an automated model for the detection of proximal DC in periapical radiographs. The proposed model was found to be superior to the system using naïve CNNs. Casalegno et al. [32] reported on a DL model for the automated detection and localization of DC. This model demonstrated promising results with increased speed and accuracy in detecting DC. However, the model had some shortcomings, where physically unrealistic labeling of artifacts took place, especially in underexposed and overexposed areas. Cantu et al. [34] proposed a DL model for detecting DC on bitewing radiographs. The model demonstrated an accuracy of 0.80; sensitivity of 0.75 and specificity of 0.83. The accuracy of this model was higher than that of experienced dentists in detecting initial lesions. The main strength of this study was the large number of balanced data sets used in training and testing. These results were similar to the results of another study conducted by Lee et al. [52] on a deep CNNs (U-Net) model for detection of DC on bitewing radiographs, where the model demonstrated a precision of 63.29%, a recall of 65.02%, and an F1-score of 64.14%. However, the limitation of this study was related to the small number of data sets. Geetha et al. [35] reported on an AI-based model for diagnosing DC on digital radiographs. The model showed excellent performance with an accuracy of 97.1%, a false positive (FP) rate of 2.8% and a ROC area of 0.987. However, the model must be improved to enable classification of DC based on lesion depth.
Another study conducted by Schwendicke et al. [36] also reported on an AI-based model for detecting DC. The performance of this model was comparable with that of trained dentists. The limitation of this study was related to the reliability of the examiners. Duong et al. [37] also reported on an AI-based model for detecting DC using photographs on smart phones. The model demonstrated an accuracy of 92.37%, a sensitivity of 88.1% and a specificity of 96.6%. However, there was no heterogeneity in the data sets and the presence of plaque, debris, stains and shadow may have affected the results. A study conducted by Zhang et al. [60] suggested a CNNs-based model (ConvNet) for detecting DC using oral photographs. The model demonstrated an AUC of 85.65% and a sensitivity of 81.90%. However, the dataset was collected from a single organization, which can limit its applicability. In addition, factors like the presence of plaque and stains could have affected the obtained results. Another study conducted by Kühnisch et al. [61] reported on a CNNs-based model for detection and categorization of DC using oral photographs. In this study, the model demonstrated an excellent accuracy of 92.5%, a sensitivity of 89.6%, and a specificity of 94.3%. This study had taken into consideration the limitations of the previously reported studies and therefore only considered photographs that were free from plaque, calculus and saliva.
A study by Devlin et al. [43] proposed an AI-based model for detecting enamel-only proximal DC using bitewing radiographs. The model demonstrated significant results in comparison with expert dentists. Bayrakdar et al. [44] also reported on AI-based DL models (VGG-16 and U-Net) for automatic caries detection and segmentation on bitewing radiographs. These models demonstrated superior performance in comparison with experienced specialists. However, this study was compromised by the limited data sets obtained from one center. Zheng et al. [47] compared three CNNs models (VGG19, Inception V3, and ResNet18) for diagnosing deep DC. CNNs model ResNet18 showed good performance in comparison with the other two models and the trained dentists. However, diagnosis of cases was done by a panel of experienced dentists, which is not the gold standard for diagnosing deep DC and pulpitis. Nevertheless, histological testing, which is the gold standard, is not practically feasible in clinical practice [69]. Another study conducted by Moran et al. [49] reported on a CNNs model (Inception) for identifying approximal DC on bitewing radiographs. The model demonstrated an accuracy of 73.3%. This model demonstrated promising results in comparison with the reference model (ResNet). Mertens et al. [50] reported on a CNNs model for the detection of proximal DC using bitewing radiographs. The model demonstrated a ROC of 0.89 and a sensitivity of 0.81 and showed significant results in comparison with five expert dentists. The main strength of this study was its design being a randomized controlled trial. On the other hand, the main drawback was related to the limited sample of data sets which were obtained from one center. Another study conducted by Mao et al. [56] used a CNNs-based model for identifying DC on bitewing radiographs. The model demonstrated an accuracy of 90.30% for detecting DC. The AlexNet model showed a high accuracy in comparison with other models. To achieve better accuracy, the authors had reduced the size of photographs used in the training process, which reduced training time and increased the accuracy of the model. A study conducted by Bayraktar et al. [59] described a CNNs based model (YOLO) for the diagnosis of interproximal caries lesions on bitewing radiographs. The model demonstrated an excellent accuracy of 94.59%, a sensitivity of 72.26%, and a specificity of 98.19%. The main strength of this study was the large number of data sets which yielded near perfect results. However, the model could not classify the DC lesions according to their location in enamel and/or dentin.
Lian et al. [48] also reported on DL models for detecting DC and classifying DC on panoramic radiographs. The models demonstrated Dice coefficient values of 0.663 and an accuracy of 0.986. Their performance was similar to that of expert dentists. The strength of this study was the large data which was meticulously collected and labeled by three expert dentists. Controversial results were additionally revised by a fourth expert dentist. However, the panoramic radiographs used in this study were obtained from one single machine, hence, the performance may vary with panoramic radiographs obtained using equipment from another company. A study conducted by De Araujo Faria et al. [54] reported on an AI-based model for the prediction and detection of radiation-related caries (RRC) on panoramic radiographs. This model demonstrated an excellent detection accuracy of 98.8% and an AUC of 0.9869. For prediction, it showed an accuracy of 99.2% and an AUC of 0.9886. However, the limited sample size may have affected the results, as the patients in that particular center were usually at an advanced DC stage by the time the radiographs were obtained. Zhu et al. [62] also reported on a CNNs-based model (CariesNet) to delineate different degrees of caries on panoramic radiographs. The model demonstrated an excellent performance with a mean dice coefficient of 93.64%, an accuracy of 93.61%, an F1 score of 92.87% and a precision of 94.09%. The large number of datasets used to train and validate the model was a strength of this study.
Huang et al. [68] reported on AI-based models AlexNet, VGG-16, ResNet-152, Xception, and ResNext-101 for detecting DC on OCT and micro-CT images. ResNet-152 demonstrated the highest accuracy rate of 95.21% and a sensitivity of 98.85% in comparison with the other three models. However, the study utilized a manual verification process, where human errors are inevitable.

Conclusions
AI models have been widely explored for prediction, detection and diagnosis of DC. These models have demonstrated excellent performance and can be used in clinical practice for identifying patients with higher DC risk and can also aid in enhancing the diagnostic, treatment quality and patient outcome. The results of the predictive models can help in planning preventive dental care, designing oral hygiene care and dietary plans for patients with high risk of DC. These models can assist dentists as a supportive tool in clinical practice and can also assist non-dental professionals in detecting and diagnosing DC more accurately in schools and rural health centers. Although these models have demonstrated excellent performance, there are certain limitations related to the size and heterogeneity of the data sets reported in most of these articles. Hence, these models need additional training and validation for better performance.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/diagnostics12051083/s1, Table S1: Assessment of risk of bias domains and applicability concerns.