Deep Learning in Diagnosis of Dental Anomalies and Diseases: A Systematic Review

Deep learning and diagnostic applications in oral and dental health have received significant attention recently. In this review, studies applying deep learning to diagnose anomalies and diseases in dental image material were systematically compiled, and their datasets, methodologies, test processes, explainable artificial intelligence methods, and findings were analyzed. Tests and results in studies involving human-artificial intelligence comparisons are discussed in detail to draw attention to the clinical importance of deep learning. In addition, the review critically evaluates the literature to guide and further develop future studies in this field. An extensive literature search was conducted for the 2019–May 2023 range using the Medline (PubMed) and Google Scholar databases to identify eligible articles, and 101 studies were shortlisted, including applications for diagnosing dental anomalies (n = 22) and diseases (n = 79) using deep learning for classification, object detection, and segmentation tasks. According to the results, the most commonly used task type was classification (n = 51), the most commonly used dental image material was panoramic radiographs (n = 55), and the most frequently used performance metric was sensitivity/recall/true positive rate (n = 87) and accuracy (n = 69). Dataset sizes ranged from 60 to 12,179 images. Although deep learning algorithms are used as individual or at least individualized architectures, standardized architectures such as pre-trained CNNs, Faster R-CNN, YOLO, and U-Net have been used in most studies. Few studies have used the explainable AI method (n = 22) and applied tests comparing human and artificial intelligence (n = 21). Deep learning is promising for better diagnosis and treatment planning in dentistry based on the high-performance results reported by the studies. For all that, their safety should be demonstrated using a more reproducible and comparable methodology, including tests with information about their clinical applicability, by defining a standard set of tests and performance metrics.


Introduction
Today, although most oral and dental diseases have early diagnosis and treatment opportunities with technological developments in oral and dental health, their global increase cannot be prevented. According to the WHO Global Oral Health Status Report (2022) [1], oral and dental diseases affect approximately 3.5 billion people worldwide. Especially in low-and middle-income countries, there are not adequate services in the field of oral and dental health due to the costs of diagnosis and treatment. As a result of this situation, it is estimated by the WHO that three out of four people in low-and middleincome countries are affected by oral and dental diseases [1]. The most common dental diseases, especially dental caries, are periodontal diseases, edentulism, oral cancer, dental anomalies, and cleft lip and palate diseases [1]. When efficient diagnosis and treatment are not provided for these diseases, it can cause various complications ranging from mild discomfort to death.
In addition to clinical examination, dental imaging technologies play a critical role in diagnosing oral and dental diseases. In Figure 1A, examples of some anomalies and diseases associated with dental imaging techniques are given. The advanced level of three-dimensional dental imaging technologies such as cone-beam computed tomography (CBCT), magnetic resonance imaging, and ultrasound, especially two-dimensional panoramic and periapical radiographs, has increased the success rate in diagnosis [2,3]. However, image-based dental diagnosis has some limitations. Image-based medical diagnosis is not objective as it depends on specialist experience and inter-observer variables. The background is noisy on radiographs, and anatomical structures overlap. Computed tomography has poor resolution compared to radiographs due to scattering from metallic objects. Ultrasonography contains high levels of noise. These limitations make interpreting images difficult and increase the rate of expert oversight and error.
Expert systems, aimed at assisting experts in managing images, formerly applied strict rules and methods based on how experts think. In recent years, with the ease of accessing data and the development of computers with faster processing power, artificial intelligence (AI) technologies have advanced, and expert systems have evolved into data-oriented AI applications. In particular, the increase in studies on the successful performance of deep learning methods, especially in image-based diagnostic tasks where a diagnosis is challenging, such as cancer [4], lung, and eye diseases [5,6], has increased interest in the medical application of AI [7,8]. Recent literature reviews have acknowledged the success of expert systems based on deep learning methods that compete with the performance of experts in image-based dental diagnostic tasks, especially the research presented in this article.
Deep learning is a form of machine learning that uses multilayer artificial neural networks in a wide range of applications, from image, audio, and video processing to natural language processing. Unlike traditional machine learning methods, deep learning can learn these features simultaneously by automatically extracting features from raw data symbols instead of learning with rules. In addition to these flexible structures, prediction accuracy can increase according to the size of the data. The concept of deep learning was first proposed by Hinton in 2006 as a more efficient version of multilayer artificial neural networks [9]. The CNN architecture, which is the most commonly used deep learning algorithm, is presented in Figure 1B. Since the emergence of deep learning, it has been proposed for many applications in the field of oral and dental health, such as tooth classification [10], detection [11] and segmentation [12], endodontic treatment and diagnosis [13], periodontal problem tooth detection [14], oral lesion pathology [15,16], forensic medicine applications [17,18], and classification of dental implants [19,20]. Considering the large number of images obtained in the field of oral and dental health, the dependence of dentists on computer applications in the analysis of these images, and the improvement of decision-making performance in a limited time, there seems to be excellent potential for the future of deep learning applications. . Examples of dental anomalies and diseases on dental imaging techniques; a. Mesiodens on panoramic radiographs [21], b. Apical lesions on periapical radiographs [22], c. Temporomandibular joint osteoarthritis on orthopantomograms [23], d. Missing tooth on cone beam computed tomography [24], e. Dental caries on near-infrared-light transillumination [25], f. Dental caries on bite viewing radiographs [26], g. Dental calculus and inflammation on optical color images [27], h. Gingivitis on intraoral photos [28]. (B). Convolutional neural network architecture. Several reviews of deep learning for oral and dental health have been published recently. These studies have focused on specific research areas such as dental caries [29], dental implants [30,31], forensic [32], endodontics [13], temporomandibular joint disorder [33], periapical radiolucent lesions [14], gingivitis and periodontal disease [34], and dental informatics [35]. Other reviews have addressed deep learning issues in dentistry [36][37][38][39][40] and dental imaging [41,42]. However, a comprehensive review study on deep learning methods used to diagnose dental diseases, including dental anomalies, has yet to be conducted. This study aims to systematically review 101 related research articles applying deep learning methods to diagnose dental anomalies and diseases.

•
The essential contributions of this article can be listed as follows: • This study is the first systematic review of dental anomalies and deep learning.

•
This study includes 101 shortlisted research articles from Scholar and PubMed that apply deep learning methods for diagnosing dental anomalies and diseases.
• This review included variables such as the size of the dataset, the dental imaging method, the deep learning architecture used for performance evaluation criteria, and the explainable AI method. • Unlike other reviews in the literature, in this review, studies comparing human-AI performance among shortlisted research articles are discussed in detail, especially statistical tests.
As per the workflow of the current article, Section 2 contains a research methodology demonstration that includes the research question, information sources, eligibility criteria, search strategy, selection process, data extraction, and analysis processes. In Section 3, the dataset features of the studies included in the shortlist were synthesized by presenting the findings, such as the deep learning method, performance metrics, and human-AI comparison. In Section 4, the findings presented in the previous section are discussed with emphasis on management implications, academic implications, literature shortcomings, and suggested solutions. The problems and suggested solutions for increasing the clinical utility of deep learning and the limitations of the current article are also included in this section. Finally, in Section 5, potential research directions are finalized.

Material and Methods
This systematic review was conducted by referring to the PRISMA 2020 statement [43], an updated guideline for reporting systematic reviews. The review question determining the study's eligibility criteria and search strategy is based on the PICO (problem/population, intervention/indicator, comparison, and outcome) framework in Table 1.

Information Sources and Eligibility Criteria
The systematic literature search was carried out by a reviewer by conducting an extensive investigation in two different electronic databases, Medline via PubMed, and Google Scholar, for studies published in the last five years (2019-May 2023). Google Scholar is a comprehensive database of scholarly material from academic research, including books, journal articles, conference reports, chapters, and theses. Google Scholar provides free services, with no subscription required. Search results are ordered by relevance, where it was published, authors, full-text match, and how often it is cited. Medline is a database containing international publications on clinical medicine and biomedical research. The PubMed database is an accessible interface service provided by Medline. The research articles included in this systematic review were selected according to the eligibility criteria below.
Articles published between January 2019-May 2023.

2.
Articles on the diagnosis of dental anomalies or diseases.

3.
Articles suggesting deep learning methods.

4.
Articles created using a reference dataset on dental imaging techniques.
Articles written in English. 7.
The article must contain detailed information about the dataset, methods, results, and tests applied.
Articles on topics such as healthy tooth detection, tooth labeling/numbering, dental implants, and endodontic treatment.

2.
Articles that have applied other AI methods that do not include deep learning methodologies, such as classical machine learning.

3.
Review articles and other types such as conferences, article abstracts, book chapters, preprints, or non-full-text articles, even if it is a research article.

Search Strategy and Selection Process
Keywords combining techniques of interest (such as deep learning/CNN), image materials (such as radiographs), and areas of interest (such as dental anomalies/diseases) were used to navigate through articles. Medical Subject Headings (MeSH) of deep learning, CNN, convolutional neural networks, oral, dental, tooth, teeth, anomalies, and diseases were included. Included MeSH terms are combined with Boolean operators such as and/or, and advanced settings of databases are used with selections such as inclusion date range, publication types and language. The electronic search strategy applied to databases is given in Table 2.

Database Search Strategy Search Date
Google Scholar all: ("deep learning" OR "CNN" OR "convolutional neural network") AND ("oral" OR "dental" OR "tooth" OR "teeth") AND ("anomalies" OR "diseases") The articles included in this systematic review were selected in two stages. In the first stage, a reviewer evaluated the articles according to the relevance of the titles and abstracts related to our research topic. In the first stage, studies with titles and abstracts unrelated to oral and dental health that could not be full-text articles, such as abstracts, were eliminated. In the second stage, a second reviewer conducted a detailed examination according to the eligibility criteria. During this examination, review articles, articles whose method was not deep learning, and articles that did not focus on oral/dental anomalies or disease diagnosis were excluded.

Data Extraction and Analysis
One reviewer performed the data extraction phase from the included studies. From the included articles, the primary author, publication year, anomaly/disease for which the diagnosis was intended, image type, number of images, primary performance metric and outcome value, other measured performance criteria, and explainable AI method data were obtained by reviewing detailed full texts. The shortlist presented in the article was thoroughly reviewed and checked by a second reviewer (specialist dentist). Different shortlists were made for anomaly and disease studies, and the two subjects were analyzed within themselves. The included studies were categorized as classification, object detection, and segmentation studies. Data such as the country of origin of the studies, the data division strategy determining the number of training and test datasets used in the study, and the field of dentistry were not mentioned.
The distribution of the number of publications by year, type of task, type of anomaly/disease, and dental imaging technique was visualized and analyzed. Considering the heterogeneity, performance, and outcome measures of index and reference tests for quality assessment, meta-analysis was not performed as the results were largely unsuitable for heterogeneity tests. Instead, a separate shortlist was created by selecting studies that performed tests on human-AI comparison among the included studies for quality assessment. From these studies, data on reference datasets, statistical significance tests, diagnostic performance results, diagnostic time, and the impact of AI performance were extracted and analyzed. Further analysis, including the clinical significance of deep learning, was performed narratively alongside descriptive statistics.

Results
According to the search results, a total of 1997 records were identified, including 1860 from Google Scholar and 137 from PubMed. After removing duplicates from these records, 545 studies that were not full-text research articles (n = 497) and not related to dental health topics (n = 48) were excluded, and 296 records were scanned. According to the screening results, 101 studies that met the eligibility criteria were included in the systematic review. Of the included studies, 22 are on dental anomalies (Table 3), and 79 are on dental disease (Table 4). Figure 2 presents the search results in detail according to the PRISMA-2020 flowchart.           Figure 3 shows the distribution of publications by years, tasks performed, anomaly/disease applications, and dental imaging techniques. When the distribution of the number of publications in 2019-May 2023 is examined, the highest number belongs to 2022, with 14 anomalies and 23 diseases. Although 2023 (n = 17) is not yet finished, the number of publications is more than double the number of publications in 2019 (n = 7), and the number of publications has increased yearly. The most common task performed in diagnosing dental anomaly/disease is classification (n = 51). Another common task performed after classification is object detection in anomaly diagnosis (n = 7) and segmentation in disease diagnosis (n = 19). The most common diagnostic studies, mainly dental caries and plaques (n = 31) are on periodontal diseases (n = 23), cysts, and tumors (n = 11). Cleft lip and palate (n = 2), temporomandibular joint osteoarthritis (TMJOA), gingivitis, and missing teeth (n = 3) are the least researched diseases in the diagnosis of dental disease with deep learning. Apart from these studies, there are also studies on the diagnosis of inflammation, osteoporosis, and fractures. Since mesiodens are a type of supernumerary teeth, the most common type of anomaly in which deep learning methods are used for diagnosis is supernumerary teeth (n = 11). Other common types of anomalies examined were impacted teeth (n = 6), hypomineralization (n = 4), and ectopic eruption (n = 2). Three other anomaly applications in Figure 3 are diagnosing taurodont [64], maxillary canine impaction [48], and odontomas [46]. Another study is on the classification of ten different types of anomalies [47].
The most commonly used deep learning methods for the classification task in dental anomaly diagnosis are using pre-trained CNN models by fine-tuning (transfer learning) or modifying them. InceptionResNet, VGG16, AlexNet, InceptionV3, SqueezeNet, ResNet, and DenseNet are CNN models used as solution methods. In one study [44], Hybrid graph cut segmentation was applied to separate the background and anatomy in panoramic radiography images, and then the preprocessed images were classified with CNN. YOLO (n = 2), Faster R-CNN (n = 2), DetectNet, and EfficientDetD3 models were used for object detection tasks in dental anomaly diagnosis. In one study [55], the authors designed a new DMLnet model based on the YOLOv5 architecture for automatically diagnosing mesiodens on panoramic radiographs. Generally, U-Net (n = 3) or modified U-Net (n = 1) architectures are used for the segmentation task. In a study for the diagnosis of mesiodens [60], varied tasks were performed by segmentation with the DeepLabV3 model and classification with the InceptionResNetV2 model. While more diverse than the classification methods used for dental anomaly diagnosis, the most commonly used deep learning method is the same for disease diagnosis, with models designed with pre-trained CNNs. ResNet (n = 5), DenseNet (n = 5), and AlexNet (n = 4) are the most commonly used pre-trained CNNs, with VGG (n = 2), Inception (n = 2), EfficientNet (n = 2), LeNet (n = 2), and MobileNet (n = 1) also used. Another common method is custom CNN models designed by the authors (n = 6). In addition to these methods, hybrid methods combined with two different algorithms were also used. CNN-LSTM [70], CNN-SVM [71], Siamese Network-DenseNet121 [71], and CNN-fuzzy logic [84] are hybrid models using. In a study [74], a swine transformer, one of the transformer types shown to compete with CNNs recently, was used. Faster R-CNN (n = 7), DetectNet (n = 5), YOLO (n = 4), Single-Shot Detector (SSD, n = 2), and Mask R-CNN (n = 1) were used for the object detection task. U-Net (n = 14) and DeepLabV3+ (n = 2) were the most commonly used architectures for the segmentation task in disease diagnosis as well as in dental anomaly diagnosis. Two-stage methods combining different tasks are frequently proposed for diagnosing dental diseases. After applying segmentation as an image preprocessing in the first stage, the studies that involved classification in the second stage used U-Net + DenseNet (n = 2), Mask R-CNN + CNN, Morphology-based Segmentation + Modified LeNet, Curvilinear Semantic DCNN + InceptionResNetV2 methods. In a study [27], a parallel 1D CNN was used as a YOLOv5 classifier as an image preprocessing method. To optimize the weights, methods that combine CNN with different optimization algorithms, such as antlion [83] and pervasive deep gradient [26], have also been proposed. In sixty studies, the ACC metric was used as the primary performance measurement method. While the ACC metric was measured in nine studies, it was never used in thirtytwo. In a study using Faster R-CNN to diagnose gingivitis from intraoral photographs, the highest ACC value of 100% was obtained [26]. The lowest ACC value of 69% was obtained in a study using ResNet18 to diagnose dental caries on NILT images [67]. After ACC, the most frequently used performance measurement method is SEN (SEN = recall = TPR), which was used in eighty-seven studies, nine of which were the primary metric. The values obtained in studies using SEN as the primary metric range from 81-99% [59,122]. Another frequently used metric is precision (precision = PPV). Precision was used as a performance measurement method in sixty-seven studies, of which nine were the primary metric. As the lowest value, the mean average precision (mAP) value of 59.09% was reached in a study where Faster R-CNN was recommended for detecting missing teeth on panoramic radiographs [119]. The highest precision value of 98.50% was achieved in a study that proposed YOLOv4 for detecting mandibular fractures on panoramic radiographs [117]. Another performance measurement method used as a primary metric is the Area under the ROC Curve (AUC). AUC was used in thirty studies, nine of which were the primary metric, and gave results in the 57.10-99.87% range [47,91]. F score (n = 47) and SPEC (n = 48) are among other frequently used metrics. In addition, some studies use Intersection over Union (IoU), negative predictive value (NPV), Dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC), Matthews correlation coefficient (MCC), false positive rate (FPR), loss, error rate (ER), and Classification rate (CR) as performance measurement methods.
Of the 101 studies, 22 mentioned the topic of explainable AI. Five of these studies described the class activation heat map (CAM) without detailing their explainable AI method, while the others used gradient-weighted class activation mapping (Grad-CAM).
In 21 studies, human and AI performances were tested and compared. The reference data, comparative tests, and performance results of these studies are summarized in Table 5. In these studies, test datasets prepared for the reference dataset were also used in comparative tests, and the size of the test datasets varied between 25 and 800 [86,87]. In one study [47], a different test dataset of 7697 images was used to test the model's performance, and a different test dataset of 30 images was used to compare the model's performance with the human's performance. In one study [77], validation performance was used for tests for which no test dataset was created. In 14 studies, reference datasets were annotated by physicians experienced in oral and dental health, such as pediatric specialists (PS), general practitioners (GP), oral and maxillofacial radiologists (OMFR), surgeons (OMFS), and endodontic specialists (ES). In 10 studies, more than one specialist was given the task of explanation to ensure the reliability of the reference dataset. In two studies, dentist-trained researchers annotated the reference dataset [74,123]. In two studies, CBCT data were referenced rather than annotated, as images from retrospective databases were already labeled [23,97].
In all studies except for two, comparative tests were performed by comparing the performance results of a group of human auditors on the test data with the performance results of the model. In addition to this test, in two studies, the performance of the AIunaided group and the AI-aided group were compared, and the effect of the AI model on the diagnostic performance of the specialists was measured [75,124]. Statistical analysis tests such as the Kruskal-Wallis test, t-tests, Mann-Whitney-u test, and Kappa statistics, especially McNemar's χ 2 test, were used to measure the significance of performance differences between specialists and AI models. Statistical significance was not measured in one study [87]. Of the eighteen studies whose p-value was calculated, thirteen reported that the performance difference was significant (p < 0.05), and five were insignificant. In addition to test performance, test times were also measured in seven studies. In only one of these studies, the AI model provided a diagnosis later than the specialist [74]. In other studies, the authors only compared the diagnostic performances, stating that the diagnostic time of the AI model would be shorter than that of the specialists. Except for six studies, the diagnostic performance of AI models proposed in other studies exceeded that of human auditory groups. In four of the six studies, AI lags behind the experts by a small margin, and in the other two studies, the performance gap is quite significant [21,47].

Discussion
This study evaluated the last five years of literature research on the diagnosis of dental anomalies and diseases using various deep learning methods, mainly CNNs, using a systematic review. According to the results of this evaluation, some findings that need to be discussed have emerged.
Even if it is not as much as the diagnostic applications in the field of medicine, the results of the searches made in two databases, Google Scholar, and PubMed, show the records of 1997 for the last five years, which shows that deep learning is experiencing a golden age in the diagnostic applications in dentistry. Only 137 of these records were available on PubMed, which is often an essential medical and dental research resource. This case indicates that a significant part of the research identified is not from dentistry but from technical sciences and has been published differently.
Over the years, deep learning has grown in popularity as a research topic in diagnosing dental diseases, given the number of studies shortlisted before the first half of 2023. However, diagnosing dental anomalies with deep learning has yet to be sufficiently investigated (n = 22). This article is the first systematic review of dental anomalies. Due to the rarity of dental anomalies compared to other dental diseases, the scarcity of data has made research on this subject with deep learning algorithms rare [137,138]. In addition, it is no coincidence that the most common oral and dental disease reported worldwide by the WHO is dental caries and that the studies included in the shortlist are primarily focused on diagnosing dental caries (n = 31). Due to the working principle of deep learning algorithms, the qualities of the data used, such as the number of images, quality, and an expert's explanation of the image, are very important compared to other AI algorithms. These findings prove that deep learning has gained more space in the literature for diagnosing common worldwide diseases where it is easy to obtain quality data. In general, the progress of the health sector in the world draws the boundaries of the field of AI in medicine. Similar to this subject, panoramic radiographs are the most widely used imaging technique in the field of oral and dental health worldwide due to their advantages over others, and panoramic radiographs were used as data in more than half (n = 55) of the studies included in the shortlist in this review.
Since classification is the most appropriate task type for disease diagnosis in general, the most classification (n = 51) and the least segmentation (n = 24) task types were performed in the studies included in the shortlist. Despite the fact that deep learning algorithms can work with raw data, segmentation and object detection are used as preprocessing tools applied to the data before classification in order to overcome the difficulties of dental images. Although they are used as individual or at least individualized architectures as deep learning algorithms, standardized architectures such as pre-trained CNN models, Faster R-CNN, YOLO, and U-Net have been used in most studies. Considering the existence of these architectures in the literature, it shows that deep learning and diagnostic studies in dentistry lag behind other fields. The use of transformer architectures in only one study, a relatively new field of research according to CNNs, indicates a possible delay in adopting the latest architectures. Explainable AI methods are used to explain the decision-making processes of models. Visualizing why and how deep learning models, defined as black box AI models, make the diagnosis decision is vital to making the model's accuracy, objectivity, and results reliable. Of the 101 included studies, only 22 mentioned an explainable AI method (Grad-CAM). Considering the clinical importance of deep learning diagnostic studies, it is essential to include explainable AI methods in studies for reliability. In addition, developing new and different explainable AI methods is very important.
Although the high performance of the proposed deep learning algorithms indicates their reliability, appropriate metrics were not selected for their performance in the clinical setting, and additional tests were not carried out. In some studies, only ACC was used as a performance metric [27,65,68,80,123,136]. ACC class imbalance can be misleading in existing problems. Likewise, the AUC is only partially informative when over-or underdetection is unimportant. In problems involving such inequalities, additional metrics must be measured. Although PPV = precision, one of the metrics giving information about the clinical benefit, was measured in 67 studies, NPV was measured in only 16 studies. In addition to the limitations of the reported metrics, the number of studies applying tests that provide information on clinical utility is very small. The number of studies comparing human and AI is 21. In some of these studies, the number of explanatory experts forming the reference data set is one [21,47,133]; in others, the researcher is used instead of an expert as an explanatory [74,123]. Using more experts to overcome the limitations of a single expert in creating reference datasets will increase reliability. In some studies, performance measurements obtained with validation data have been reported instead of creating a separate test dataset [77]. The validation phase is used to test the efficiency of the hyperparameters of the deep learning model, and its use in the final testing phase may make the reported results misleading. Only two studies test how deep learning affects expert performance in AI and human benchmark tests (AI-unaided group-AI-aided group comparison) [75,124]. In other studies, deep learning algorithms and the performance of experts were compared, and it was hoped that deep learning performance would reach or exceed that of experts. At this point, although the aim of using expert system applications supported by AI algorithms as an auxiliary tool for experts is emphasized in almost all studies, it is clear that tests suitable for this purpose need to be revised. As a result of this inadequacy, it will take many years for research on the clinical applicability and ethical and legal dimensions of AI algorithms to multiply. As every multidisciplinary task requires, the cooperation of health institutions and experts with computer scientists is the most critical factor in preventing this situation. Another vital solution factor may be defining a standard set of tests and performance criteria for deep learning oral and dental health studies. An open-access, standardized test dataset created by experts for each dental image type can enable the performance of deep learning algorithms to be reliably evaluated and compared.
This systematic review article has some limitations. Since today's databases and publications are quite large, only two different databases were scanned for this review. The selected articles were evaluated in line with the inclusion and exclusion criteria and the boundaries drawn. Studies such as conferences, preprint articles, and book chapters were excluded because the inclusion criteria were broad. The fact that some articles are not open access or contain missing information that does not match the summary tables we have created has limited the included studies. One reviewer performed the data extraction phase from the included studies, and the shortlist presented in the article was only thoroughly reviewed and checked by a second reviewer (a specialist dentist). A traditional systematic review was used; meta-analyses were not conducted, and the results were quite broad. As a result, the study findings were compiled narratively and according to a systematization we designed, with the aim of guiding and further developing future studies in this field.

Conclusions
In this systematic review, deep learning diagnosis of dental anomalies and diseases was discussed, and 101 studies included in the shortlist were analyzed and evaluated with the limitations discussed. Deep learning algorithms show auspicious performance in evaluating visual data for diagnosing dental anomalies and diseases. Applications of deep learning in oral and dental health services can alleviate the workload of oral and dental health professionals by allowing more comprehensive, reliable, and objectively accurate image evaluation and disease detection, and can increase the chance of developing countries reaching diagnosis and treatment by reducing the cost. In order to achieve these advantages of deep learning, there seems to be a great need for the development of clinical applications of deep learning studies in the field of oral and dental health, including the definition of standard test datasets, testing procedures, and performance metrics.