Machine Learning in Dentistry: A Scoping Review

Machine learning (ML) is being increasingly employed in dental research and application. We aimed to systematically compile studies using ML in dentistry and assess their methodological quality, including the risk of bias and reporting standards. We evaluated studies employing ML in dentistry published from 1 January 2015 to 31 May 2021 on MEDLINE, IEEE Xplore, and arXiv. We assessed publication trends and the distribution of ML tasks (classification, object detection, semantic segmentation, instance segmentation, and generation) in different clinical fields. We appraised the risk of bias and adherence to reporting standards, using the QUADAS-2 and TRIPOD checklists, respectively. Out of 183 identified studies, 168 were included, focusing on various ML tasks and employing a broad range of ML models, input data, data sources, strategies to generate reference tests, and performance metrics. Classification tasks were most common. Forty-two different metrics were used to evaluate model performances, with accuracy, sensitivity, precision, and intersection-over-union being the most common. We observed considerable risk of bias and moderate adherence to reporting standards which hampers replication of results. A minimum (core) set of outcome and outcome metrics is necessary to facilitate comparisons across studies.


Introduction
With the advent of the big data era, machine learning (ML) methods like Support Vector Machine, Naïve Bayesian Classifier, Decision Tree, Random Forest (RF), K-Nearest Neighbor, and Deep Learning involving Convolutional Neural Network (CNN), etc., have been increasingly adopted in fields such as finance, spatial sciences, and speech recognition [1]. Additionally, in medicine and dentistry, ML has been employed for a range of applications, for example, image analysis in dermatology, ophthalmology, or radiology, with accuracy values similar or better than that of experienced clinicians [1,2].
In the field of ML, mathematical models are employed to enable computers to learn inherent structures in data and to use the learned understanding for predicting on new, unseen data [3]. For deep learning models, specifically CNNs, different types of model 'architecture' can be used. A ML workflow involves training the model, where a subset of the data is used to learn the underlying statistical patterns in the data, and testing it on a yet unseen, testing data subset. ML models tend to become more accurate, when larger training datasets are used [4]. Moreover, basic learning parameters are usually optimized on a separate data subset, referred to as validation data, a process called hyperparameter tuning. Testing the model on the test data involves a wealth of performance metrics Ethics approval was not sought because this study was based exclusively on published literature.
Screening of titles or abstracts was performed by one reviewer (A.C.). Inclusion or exclusion was decided by two reviewers in consensus (F.S. and A.C.). All papers which were found to be potentially eligible were assessed in full text against the inclusion criteria. We did not limit the inclusion of studies based on the target study population, outcome of interest, or the context in which ML was used. All original studies related to dentistry and ML, without gross reporting fallacies, such as failure to define the type of ML used, failure to minimally describe which dataset was employed for training and testing, and failure to report study findings, were included in this scoping review.

Data Collection, Items, and Pre-Processing
Data extraction was performed jointly by A.C., A.M., and L.T.A.-S. The extracted data was reviewed by L.T.A.-S. Adjudication in case of any disagreement was performed by discussion (L.T.A.-S. and J.K.). A pretested Excel spreadsheet was used to record the extracted data. Study characteristics included country, year of publication, aim of study and clinical field, type of input data (covariates or imagery [photographs or radiographs; 2-D or 3-D imagery]), dataset source, size and partitions (training, test, validation sets), type of model used and, for deep learning, architecture, augmentation strategies employed, reference test and its definition, comparators (if available, e.g., current standard of care, clinicians, etc.), and performance metrics and their values. In each study, all data items that were compatible with a domain of the extracted data were sought and recorded (e.g., all performance metrics, models employed). No assumptions were made regarding missing or unclear data.

Quality Assessment
The risk of bias was assessed using the QUADAS-2 tool in four domains [9]. First, risk of bias in data selection was assessed using the parameters of 'inappropriate exclusions', 'case-control design', and 'consecutive or random patient enrollment'. Second, risk of bias in the index test was assessed using the parameters of 'assessment independent of reference standard and 'pre-specification of thresholds used'. Third, risk of bias in the reference standard was assessed using the parameters of 'validity of reference standard and 'assessment independent of index test'. Fourth, risk of bias in the flow and timing was assessed using the parameters of 'appropriate interval between index test and reference standard', 'use of a reference standard for all patients', 'use of the same reference standard for all patients', and 'inclusion of all patients in the analysis'. Using the same tool, applicability concerns in three domains were also evaluated. First, applicability concerns for data selection were assessed using the parameter of 'mismatch between the included patients and the review question'. Second, applicability concerns for the index test were assessed via the parameter of 'mismatch between the test, its conduct, or its interpretation and the review question'. Last, applicability concerns for the reference standard were assessed via the parameter of 'mismatch between the target condition as defined by the reference standard and the review question'. We note that alternatively (or even complimentary), the PROBAST tool [10] could have been used for the same assessment.
Adherence to reporting standards was assessed using the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) tool, which is a 22-item checklist that provides reporting standards for prediction model studies [11]. Note that not all studies included were prediction model studies (studies varied widely in their broader approach, as discussed below), but all involved a mathematical model (ML) for a specific task, which is why we assumed that this checklist would require most studies to adhere to the large majority of domains. TRIPOD has been used for similar purposes in other domains [5]. Risk of bias and adherence to reporting standards were independently assessed by one reviewer (L.T.A.-S.).

Data Synthesis
We describe various aspects of the included studies, such as country of origin, type of input data used, source of datasets, type of ML methods used, etc. We had initially attempted to conduct a meta-analysis using the results of the confusion matrices reported by the included studies; however, out of 168 studies, only 16 (10%) studies presented their confusion matrices in a way that could be used for analysis and furthermore. These studies differed from each other in terms of their clinical research question/task, type of input data, model architecture, etc.
Instead, a narrative synthesis was performed, displaying which ML tasks (i.e., classification, object detection, semantic segmentation, instance segmentation, and generation) have been studied in different clinical fields of dentistry namely, restorative dentistry and endodontics, oral medicine, oral radiology, orthodontics, oral surgery and implantology, periodontology, prosthodontics, and others, i.e., non-specific field or general dentistry. We briefly explain the different tasks in the following section:

•
In ML, classification refers to a predictive modeling problem where a class label is predicted for a given example of input data. An example is to classify a given handwritten character as one of the known characters. Algorithms popularly used for classification in the included studies were logistic regression, k-Nearest Neighbors, Decision Trees, Naïve Bayes, RF, Gradient Boosting, etc.

•
In object detection tasks, one attempts to identify and locate objects within an image or video. Specifically, object detection draws bounding boxes around the detected objects, which allow to locate the said objects. Given the complexity of handling image data, deep learning based on CNNs, such as Region-based CNN, Fast Region-based CNN, You Only Look Once, Single Shot multiBox Detection, are popularly used for this task.

•
In image segmentation tasks, one aims to identify the exact outline of a detected object in an image. There are two types of segmentation tasks: semantic segmentation and instance segmentation. Semantic segmentation classifies each pixel in the image into a particular class. It does not differentiate between different instances of the same object. For example, if there are two cats in an image, semantic segmentation gives the same label, for instance, 'cat', to all the pixels of both cats. Instance segmentation differs from this in the sense that it gives a unique label to every instance of a particular object in the image. Thus, in the example of an image containing two cats, each cat would receive a distinct label, for instance, 'cat1' and 'cat2'. Currently, the most popular models for image segmentation are Fully CNNs and their variants like UNet, DeepLab, PointNet, etc.
• A fifth type of a ML task is a generation task, which is not predictive in nature. Such tasks involve the generation of new images from the input images, for example, generation of artifact-free CT images from those containing metal artifacts.
The study protocol was registered after the initial screening stage (PROSPERO registration no. CRD42021288159).

Study Selection and Characteristics
A total of 183 studies were identified and 168 (92%) studies were included ( Figure 1). The included studies [3,4, and their characteristics can be found in Table S1. The excluded studies with reasons for exclusion are listed in Table S2 Figure S1) and used different kinds of input data, such as 2-D data (radiographs: 42% studies, photographs, or other kinds of images: 16% studies), 3-D data (radiographic scans: 18% studies, non-radiographic scans: 4% studies), non-image data (survey data: 10% studies, single nucleotide polymorphism sequences: 1% studies), and combinations of the aforementioned types of data (9% studies). Further, 97% studies used data from universities, hospitals, and private practices, whereas 1% studies each used data from the National Health and Nutrition Examination Survey, M3BE database, 2013 Nationwide Readmissions Database of the USA, and the National Institute of Dental and Craniofacial Research dataset.
Additionally, 85% studies partitioned their total dataset into training and testing data subsets, and 59% studies also created validation data subsets from the same data source. The median size of the training datasets was 450 (range: 12 to 1,296,000 data instances) and of the test datasets was 126 (range: 1 to 144,000). Nearly half of the studies tested model performance on a hold-out test dataset while the remaining used cross-validation. Cross-validation is a resampling method that uses different portions of the data to test and train a model during each iteration. For example, in a 10-fold cross-validation, the original dataset is randomly partitioned into 10 subsamples, out of which nine subsamples are used as training data and one subsample as the test data. Ten iterations of the following step are carried out; the model is trained on the nine subsamples designated as training data and tested on the one subsample of test data; but in each iteration, a different subsample is chosen to serve as the test data and thus a different combination of subsamples constitutes the training data. Eventually, the final estimation of model performance is the average of these results.
In addition, 65% studies augmented their input data, mainly the training data, but a few augmented the testing data, too. Only 20% studies used an external dataset to validate their model's performance. The reference test (i.e., how the ground truth was defined) was established by professional experts in 73% studies: one expert in 18% studies, two experts in 11% studies, three experts in 10% studies, four and five experts in 2% studies each, six experts in 1% studies, and seven, eight, 12, and 20 experts in 0.5% studies, each. Another 27% studies used experts for establishing the reference test but did not provide details on the exact numbers. Additionally, 22% studies used information from their datasets as the reference test (for example, age, diagnosis from medical records) and 1% studies used a software tool to generate the reference test. The remaining 4% studies did not provide details on how the reference test was established.
Of all studies, 70% used deep learning models; CNN as classifiers: 59 studies, CNN for other tasks: 14 studies, Faster R-CNN: seven studies, fully CNN: 19 studies, Mask R-CNN: seven studies, 3-D CNN: three studies, adaptive CNN and pulse-coupled CNN: one study each, and non-convolutional deep neural networks: seven studies (Table S1). Another 22% studies used non-deep learning models; perceptron: four studies, other neural networks: three studies, other types of models, such as, fuzzy classifier, SVM, RF, etc.: 30 studies. In addition, 6% studies used various combinations of the aforementioned models and 2% studies did not provide details of the model architecture employed. Both, models using and not using deep learning were employed in higher proportions by studies in restorative dentistry and endodontics, oral medicine, and non-specific field or general dentistry (Table S3). Additionally, models not using deep learning were frequently employed by studies in orthodontics and periodontology. Finally, 20% studies compared their model's performance with that of human comparators.

Risk of Bias and Applicability Concerns
The risk of bias was assessed in four domains, namely data selection, index test, reference standard, and flow and timing. It was found to be high for 54% of the studies regarding data selection and for 58% of the studies regarding the reference standard (Table 1). On the other hand, the risk of bias was low for the majority of studies regarding the index test (77%) and flow and timing (89%). Applicability concerns were found to be high for 53% of the studies regarding data selection but were low for most studies regarding the index test (79%) and reference standard (73%).

Adherence to Reporting Standards
Overall adherence to the TRIPOD reporting checklist was 33.3%, with 18/22 domains having an adherence rate less than 50% (Figure 2). Reporting adherence was at or above 80% for background and objectives, and potential clinical use of the model and implica-tions for future research, but below 10% for sample size calculation, handling of missing data, differences between development and validation data, and details on participants. In particular, less than 20% of studies adequately defined their predictors and outcomes (in terms of their blinded assessments), stratification into risk groups, presented the full prediction model and provided information on supplementary resources, such as study protocol, web calculator, or data sets. Less than 40% of the studies adequately reported about their data sources (i.e., study dates), participant eligibility, statistical methods (specifically, details on model refinement), model results (in terms of results from crude models), study limitations, and results with reference to performance in the development data, and any other validation data.

Tasks, Metrics, and Findings of the Studies
Based on the nature of the ML task formulated, the 168 included studies could be classified into five major categories of ML tasks; classification task, n = 85; object detection task, n = 22; semantic segmentation task, n = 37; instance segmentation task, n = 19; and generation task, n = 5. Classification tasks were most commonly used in oral medicine studies (22%), whereas object detection, semantic segmentation, and instance segmentation tasks, each were most commonly used in non-specific field or general dentistry studies (36%, 38%, and 58%, respectively), Table 2. Generation tasks, though small in number, were most commonly used in oral radiology studies (80%).

Tasks, Metrics, and Findings of the Studies
Based on the nature of the ML task formulated, the 168 included studies could be classified into five major categories of ML tasks; classification task, n = 85; object detection task, n = 22; semantic segmentation task, n = 37; instance segmentation task, n = 19; and generation task, n = 5. Classification tasks were most commonly used in oral medicine studies (22%), whereas object detection, semantic segmentation, and instance segmentation tasks, each were most commonly used in non-specific field or general dentistry studies (36%, 38%, and 58%, respectively), Table 2. Generation tasks, though small in number, were most commonly used in oral radiology studies (80%).
A total of 42 different metrics were used by the studies to evaluate model performance and some of these could be grouped into one class, for example, the various correlation coefficients could be combined. Such grouping (or consolidation) resulted in 26 distinct classes of metrics. Note that most studies reported multiple metrics. Studies on classification tasks commonly reported accuracy, sensitivity, area under the receiver-operating characteristic, specificity, and precision, and those on object detection reported on sensitivity, precision, and accuracy. Studies on semantic segmentation reported on IoU and sensitivity, and those on instance segmentation reported on accuracy, sensitivity, and IoU. Lastly, studies using generation tasks commonly reported on peak signal-to-noise ratio, structural similarity index, and relative error. Table S4 shows the number of studies which used the different metrics, stratified by ML task.
After stratifying the studies by ML task and clinical field of dentistry, we attempted to evaluate studies that reported on accuracy, or mean average precision, or IoU. A formal comparison was inhibited by the large variability at the level of clinical or diagnostic tasks amongst the studies. Table 2. Number of studies in each field of dentistry, stratified by type of machine learning task (n = 168).

Discussion
ML in dentistry is characterized by the availability of a plethora of clinical tasks which necessitate the use of a wide range of input data types, ML models, performance metrics, etc. This has given rise to a large body of evidence with limited comparability. The present scoping review synthesized this evidence and allowed to comprehensively assess this body. We will begin by discussing our findings in detail.
First, the included studies aimed for different ML tasks on a wide variety of data. These data then differed once more within specific subtypes (e.g., imagery, with radiographs, scans, photographs, each of them being sub-classified again, and differing in resolution, contrast, etc.). Moreover, data usually stemmed from single centers, representing only a limited population (and diversity in terms of data generation strategy or technique), all of which likely adversely impacts generalizability of results. The data used were nearly never available, except for the few studies employing data from open databases, leading to difficulties in replication of results. Researchers are urged to comply with journals' data sharing policies and make their data available upon reasonable request. We acknowledge that there may be data sharing and privacy concerns across institutions and countries. Alternatives to centralized learning of ML models, like federated learning, which do not require data sharing may be of relevance especially for data which are hard to deidentify [178]. Practices of data linkage and triangulation, i.e., using a variety of data sources to create a richer dataset, were almost non-existent. Thus, limiting options for verification of data integrity and increasing the learning output of a ML model by leveraging information from multiple data sources on hierarchical structures and correlations.
Second, a wide range of outcome measures was used by the included studies. These can be measured on different levels, such as patient-level, tooth-level, and surface-level, and while this is relevant for any comparison or synthesis across studies, it was not always reported on what level the outcomes were assessed. Another issue was the high number of performance metrics in use, as evident from our results, leading to only a few studies being comparable to each other. Defining an agreed-upon set of outcome metrics for specific subtasks in ML in dentistry (e.g., classification, detection, segmentation on images) along with standards towards the level of outcome assessment seems warranted. This outcome set should reflect various aspects of performance (e.g., under-and over-detection), consider the impact of prevalence (e.g., predictive values), and attempt to transport not only diagnostic value, but also clinical usefulness. For the latter, studies attempting to assess the value of ML in the hands of clinicians against the current standard of care are needed.
Third, the use of reference tests (i.e., how the ground truth was established) warrants discussion. A wide range of strategies to establish reference tests were employed. In many studies, no details towards the definition of the reference tests were provided. A few studies using image data used only one human annotator as the reference test, a decision which may be criticized given the known wide variability in experts' annotations [2]. Alternative concepts of applying the reference test to training datasets should be employed and compared to gauge the impact of different approaches and validate the one eventually selected. Additionally, testing datasets should be standardized and heterogeneous to ensure class balance and generalizability. One approach is to establish open benchmarking datasets, as attempted by the ITU/WHO Focus Group on Artificial Intelligence for Health [179].
Fourth, the quality of conducting and reporting ML studies in dentistry remains problematic. Notably, the specific risks emanating from ML and the underlying data are insufficiently addressed, e.g., biases, data leakage, or overfitting of the model. Furthermore, many studies suffered from unclear or a lack of validation of their results on external datasets. The evaluation of a model's performance on unseen data is a crucial aspect as it relates to the generalizability of ML models regarding performance on data from other sources. Exploration of why some models were not generalizable was even less common, thus preventing identification of steps required to better the models. Generally, the majority of studies performed application testing, developed models, and showed that ML can learn and, in many studies, predict. Understanding why this is, how it could be improved, what the clinical domain needs, or which safeguards for ML in dentistry are required, was seldom an issue. General reporting did not allow full replication, as many details were not presented, and additionally, the display of the model performance remained, as discussed, insufficient. Researchers need to adhere to the published guidelines on study conduct and reporting [180][181][182].
In an effort to characterize the emerging pattern in the included studies, first, we would like to elaborate on the nature of clinical tasks employed by the studies. A wide array of research questions were present; from detecting dental artifacts in images to investigating the benefits of transfer-learning, from classifying different dental conditions to aiding in decision-making and assessing cost-effectiveness. Thus, there is evidence of broadening of avenues where ML could be exploited. As stated earlier, classification tasks were the most common and this may be because diagnosing dental structures or anomalies on images is a vital step towards successful treatment outcomes and prognosis. However, over the years, ML methods have improved their classification performance on images at the cost of increased model complexity and opacity [183]. The inability to explain ML's methods and decisions is one of the contributing factors towards development of explainable AI, i.e., a set of processes that allows human users to comprehend and trust the results created by ML algorithms. Second, more recent studies tended to employ image segmentation models [2,25,39,48,59,60,73,151].
The presented scoping review has a few salient features. First, it is the most comprehensive overview on ML in dentistry with 168 studies being included. Second, and as a limitation, we could not include randomized controlled trials because none were available and found the included studies to have a considerable risk of bias, both of which should be considered when interpreting our results. Third, to our knowledge this study is the first to employ TRIPOD for gauging the reporting quality of studies using ML in dentistry. TRIPOD is a checklist designed to assess prediction models which has not been validated specifically for ML applications [5]. However, previous studies have used it to evaluate ML models since the quality assessment criteria for clinical prediction tools and ML models are similar [5]. At present, a TRIPOD-ML tool is under-construction [5]. Fourth, we included studies until May 2021 only, as the systematic critique of the 168 studies required considerable time and effort since then. We acknowledge that inclusion of recently published studies may have strengthened our review. Furthermore, we acknowledge that arXiv, an archiving database, may include studies which did not undergo a formal peer-review process and this may be a limitation for our study. However, studies on arXiv are reviewed by peers in a non-formal process and updated after peer-review. Last, any clinical usability cannot be inferred from this study because it was not the focus of this comprehensive review.

Conclusions
In conclusion, we demonstrated that ML has been employed for a large number of tasks in dentistry, building on a wide range of methods and employing highly heterogeneous reporting metrics. As a result, comparisons across studies or benchmarking of the developed ML models are only possible to a limited extent. A minimum (core) set of defined outcomes and outcome metrics would help to overcome this and facilitate comparisons, whenever appropriate. The overall body of evidence showed considerable risk of bias as well as moderate adherence to reporting standards. Researchers are urged to adhere more closely to reporting standards and plan their studies with even greater scientific rigor to reduce any risk of bias. Last, the included studies mainly focused on developing ML models, while presenting their generalizability, robustness, or clinical usefulness was uncommon. Future studies should aim to demonstrate that ML positively impacts the quality and efficiency of healthcare.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm12030937/s1, Search strategy, Figure S1: Geographical trends in number of publications of machine learning methods in dentistry between 1 January 2015 and 31 May 2021; Table S1: Studies included in the scoping review along with their characteristics (n = 168); Table S2: Studies excluded from the scoping review along with the reason for exclusion (n = 15); Table S3: Number of studies in each field of dentistry, stratified by the machine learning model used (n = 168); Table S4: Number of studies using the various performance metrics stratified by type of machine learning task. References [1,[184][185][186][187][188][189][190][191][192][193][194][195][196][197]   Data Availability Statement: All relevant data are available through the paper and supplementary material. Additional information is available from the authors upon reasonable request.
Conflicts of Interest: F.S. and J.K. are co-founders of the startup dentalXrai GmbH. dentalXrai GmbH had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.