Development, Application, and Performance of Artificial Intelligence in Cephalometric Landmark Identification and Diagnosis: A Systematic Review

This study aimed to analyze the existing literature on how artificial intelligence is being used to support the identification of cephalometric landmarks. The systematic analysis of literature was carried out by performing an extensive search in PubMed/MEDLINE, Google Scholar, Cochrane, Scopus, and Science Direct databases. Articles published in the last ten years were selected after applying the inclusion and exclusion criteria. A total of 17 full-text articles were systematically appraised. The Cochrane Handbook for Systematic Reviews of Interventions (CHSRI) and Newcastle-Ottawa quality assessment scale (NOS) were adopted for quality analysis of the included studies. The artificial intelligence systems were mainly based on deep learning-based convolutional neural networks (CNNs) in the included studies. The majority of the studies proposed that AI-based automatic cephalometric analyses provide clinically acceptable diagnostic performance. They have worked remarkably well, with accuracy and precision similar to the trained orthodontist. Moreover, they can simplify cephalometric analysis and provide a quick outcome in practice. Therefore, they are of great benefit to orthodontists, as with these systems they can perform tasks more efficiently.


Introduction
Artificial Intelligence (AI) is a term used to refer to neural networks of computers that imitate human intelligence. The AI's key concepts are machine learning, representational learning, and deep learning (DL). Machine learning (ML) models include genetic algorithms, artificial neural networks, and fuzzy logic. These models can analyze data to perform various functions [1]. Representational learning and deep learning are subsets of ML in which the former requires a computer algorithm that analyzes the features required for classifying any data. The latter subset consists of artificial neural networks that mimic the human brain's neural network, which is capable of deciphering features in a given model, such as a radiograph or an ultrasound [2]. The most demanding class of DL algorithms is the artificial neural network (ANN), which has the convolutional neural network (CNN) as its most popular subclass [3].
AI is becoming more prevalent in medicine and has reduced the need for humans to perform many tasks. Its applications in dentistry have also significantly evolved over the years [3]. AI algorithms support therapeutic decisions by assisting dentists in analyzing medical imaging and treatment planning. For example, it can be useful in identifying teeth and anatomical structures, detecting carious lesions, periapical lesions, and root fractures, and predicting the viability of dental pulp and the success of retreatment procedures [4]. Moreover, it has proven to be vital in the process of diagnosing head and neck cancer lesions, which is crucial in dental practice since early detection can greatly improve prognosis [5]. Briefly, it can be used to perform simple tasks in dental clinics without the involvement of a large numbers of dentists, resulting in accurate and comparable results. In addition, it is also widely used in dental laboratories and plays a significant role in dental education [3][4][5].
AI is advancing in the field of orthodontics. It is increasingly being used to interpret cephalometric radiographs and identify landmarks which help with the diagnosis and treatment planning of dentoskeletal discrepancies [6]. The most common types of AI architecture in orthodontics are ANN, CNN, and regression algorithms [6,7]. In addition, 3D scans and virtual models are beneficial in analyzing craniofacial or dental abnormalities. Aligners can be printed with 3D scan to formulate a data algorithm, which helps in providing and standardizing a specific treatment plan for the patients [4,7]. In machine learning-based studies, datasets are split into training and test sub-datasets, where the former is used to train the model and the latter is used to evaluate its performance on unseen data. In dentistry, there are different types of datasets; patient history, restorative and periodontal charts, results of diagnostic tests, radiographs, and oral images. These datasets can be inputted into models to generate outputs such as image interpretation, diagnosis, and future disease predictions [5,7].
The cephalometric landmarks are readily recognizable points representing hard or soft tissue anatomical structures. The structures are used as reference points for the identification of various cephalometric angles and cephalometric measurements [7]. The various cephalometric landmarks S (Sella), Po (porion), Pog (pogonion), Gn (gnathion), Go (Gonion), N (nasion), and Me (menton) are the most common hard tissue points. Whereas, A (most posterior tegmental point of the curvature of the maxillary sulcus) and B (the most anterior measure point of the mandibular apical base), P (pronasale), G (glabella) Sn (subnasale), Col (columella), LLA (lower lip anterior), and ULA (upper lip anterior) are common soft tissue points used in cephalometric analysis [8]. Several methods have been used to automatically identify these landmarks for a very long time. The approaches that have been tried and tested include pixel intensity, knowledge-based methods, and template matching [5,6]. However, the results were not always satisfactory. In recent years, deep learning algorithms (AI) have been widely introduced to detect landmarks automatically and accurately on lateral cephalograms [7]. Recent studies on automatic cephalometric landmark identification using deep learning methods demonstrated improved detection accuracy when compared with other machine learning methods [7][8][9]. Monill-González et al. conducted a study to compare the performance of one of the deep learning methods, You-Only-Look-Once (YOLO v3), with human examiners and found promising results. YOLO is known to take a shorter amount of time to identify landmarks. Ji-Hoon Park compared YOLOv3 and Single Shot Multi-Box Detector (SSD) and found YOLOv3 to be the more promising method for identifying automated cephalometric landmarks [8]. Despite this, only few studies on the AI performance of cephalometric analysis have proven beneficial to dentists. A large number of skeletal and soft tissue landmarks are required to evaluate and predict the outcome of a disease [9]. For a better understanding of the application of these methods in clinical orthodontics, more results of cephalometric analysis need to be obtained. While landmark identification is an essential part of the diagnostic process, image-related errors and expert bias can influence the results. It is therefore required to design a study to assess whether AI can achieve similar results to clinicians in cephalometric landmark detection upon repeated detection trials. One might expect improved performance with a substantial amount of learning data, but manually detecting multiple landmarks would be challenging [10].
This study aimed to provide an overview of the existing literature on how far artificial intelligence is being used to support the identification of cephalometric landmarks. It is hypothesized in this study that AI accurately identifies cephalometric landmarks compared with human examiners and other machine learning methods.

Focused Question
This systematic review was conducted using PRISMA-DTA (Preferred Reported Items for Systematic Review and Meta-Analysis-Diagnostic Test Accuracy) guidelines [11]. Our intended question was "Does AI play a significant role in measuring cephalometric landmarks accurately as compared to the human examiner?" The question was constructed according to the Participants Intervention Comparison Outcome and Study (PICOS) strategy [12].

Intervention
AI techniques (machine learning; deep learning, CNN, ANN, PANN) were applied in orthodontics concerning the cephalometric analysis, and the modifications were made with commercial cephalometric analysis software (V-Ceph version 8).

Comparison
The comparison was made based on automatic algorithm architects, testing models, lateral cephalometric radiograph analysis, rater opinions, machine-to-orthodontist comparison, success detection rate (SDR), Single Shot Multibox Detector (SSD), and Landmark error (LE) value calculation.

Outcome
For the association between AI and human findings, bland Altman plots were used to measure outcomes such as sensitivity, specificity, and intraclass correlation coefficient (ICC).

Study Design Type
For this review, we considered clinical trial-based studies published in English.

Eligibility Criteria
Two examiners evaluated the articles (N.J and N.K) according to the following inclusion criteria:
English language articles.
Review articles, letters to editors, gray literature, case reports, incomplete articles which showed only the abstract without a definitive comparison between AI and human examiners, articles in which there was no comparison of AI with human examiners, AI not related to orthodontics, AI in orthodontics not related to cephalometry and non-deeplearning methods (e.g., knowledge-or atlas-based articles or those involving shallow machine learning) were excluded.

Search Methodology
An electronic search was carried out with PubMed/MEDLINE, Google Scholar, Cochrane, Scopus, Science Direct, and research databases. The medical subject heading (MeSH) and other keywords used in the articles were "intelligence, machine", "machine intelligence", "artificial intelligence", "computational intelligence", "classification", "orthodontics", "cephalometry", "learning, deep", "algorithms", "neural networks, computer", and "expert systems". The articles published in the last decade (2010 to 2021) were included. The last search was performed in October 2021. Two well-calibrated reviewers (N.J. and N.K.) performed the search. Consensus resolved disagreements, and a third examiner (N.A.) was consulted. All the titles and abstracts were read thoroughly from the articles searched primarily, and irrelevant studies were excluded. The relevant articles were listed and scrutinized for any similar studies that matched our inclusion criteria. We read the full texts of the included studies to obtain appropriate results, and the findings were recorded.

Quality Assessment of Included Studies
The quality assessment was conducted according to the parameters described in the Cochrane Handbook for Systematic Reviews of Interventions [13]. The quality of each study was classified into low, medium, and high risk of biasness. The same 2 review authors autonomously sort out the search to amplify the number of studies recovered. The reviewers surveyed each selected article according to the inclusion criteria and directed unbiased evaluation, and any ambiguity was settled by consultation with a third reviewer (N.A.).
Furthermore, the Newcastle-Ottawa quality assessment scale (NOS) was used for the analysis of the included articles [14]. The analysis was based on the three core quality analysis parameters: case and group (definition, selection, and representativeness), comparability (comparison of case and control groups; analysis and control of confounding variable), and exposure (use of a universal assessment method for both control and case groups; dropout rate of patients in the included studies). A star system was implemented for rating the included studies. The quality of each study was classified into low, moderate, or high risk of biasness.

Search Results
The title search yielded 100 articles from 2010 to 2021, from which we removed 28 duplicates, and 36 entries that did not analyze AI. Thirty-six articles were selected for full-text reading, which lead to a further exclusion of 19 articles based on the inclusion criteria. A total of 17 full-text articles were included in this systematic review, as shown in Figure 1.

Quality Assessment Outcomes
According to CHSRI, 14 studies mentioned choosing their patients randomly and 2 mentioned blinding their participants or assessors. Eight studies mentioned the withdrawal/dropout of their participants. All 17 studies repeated the measurement of their variables. Likewise, two studies carried out sample size estimation. All included studies reported their outcomes and examiner reliability.
Furthermore, twelve studies were categorized as having a "moderate" risk of bias and five studies were categorized as "low" risk of bias (Table 2).  A study was graded to have a low risk of bias if it yielded six or more "yes" answers to the nine questions, moderate risk if it yielded three to five "yes" answers, and high risk if it yielded two "yes" answers or less; Y: yes, N: no, UC: unclear.
In accordance with NOS, the included studies scored in the range of 5 to 9 points, with a mean score of 6.52. Fourteen studies reported a "moderate" risk of bias, while two studies a high risk of bias and one study showed low risk of bias (Table 3). A total of nine stars can be awarded to a study. Any study with the maximum stars was rated as having a low risk of bias. A study with six to eight stars was declared as having moderate bias, whereas a study with five stars or less was considered as having a high risk of bias.

Discussion
AI technologies are radically transforming various aspects of dentistry. The use of AI in orthodontics has also grown significantly in the last decade with improvement in diagnostic accuracy, treatment planning, and prognosis prediction. This systematic review was carried out to evaluate the performance of AI for the detection of cephalometric landmarks with an accuracy and precision similar to an orthodontist [1,2].
Cephalometric analysis is carried out to identify various landmarks or points on the radiograph that helps in establishing various relationships and planes, which in turn aids in establishing the diagnosis and treatment plan. Manual analysis is time-consuming and accompanied by a possibility of significant inter-observer variability. Over the past 40 years, researchers have introduced and suggested various methods of AI for cephalometric landmark identification. Initially, they did not seem to be accurate enough for use. However, with time, newer algorithms were introduced with which the increased computational power provided enhanced accuracy, reliability, and efficiency [3,4,10].
Previously, knowledge-based techniques or image-based learning systems were used to automate landmark identification. However, recent studies have focused on deep learning AI systems. In this systematic review, the majority of the included studies created an automated cephalometric analysis using a specialized CNN-based algorithm [15,[17][18][19][20][21][22][23][24][25]27,28,31]. Among these, few studies demonstrated conflicting results as certain landmarks and analyses were not accurately identified by the AI system, i.e., saddle angle, Mx In-NA line, Mn In-NB line [15], lower incisor root tip [21], SN-MeGo [19], porion, orbitale, PNS, and gonion [21]. It could be because certain landmarks, such as porion, orbitale, and PNS, are hard to detect due to surrounding superimposing anatomical structures. Moreover, as some other landmarks exist bilaterally, they might cause errors in the process of determining the midpoint of those bilateral structures [18]. Kim MJ et al. [22] further added that AI prediction is affected by the expert examiner's identification pattern. If the examiner shows difficulties in some areas, the AI predictions will reflect these difficulties as the CNN model emulates the human examiner's landmark identification pattern and performs prediction.
Moon et al., [18], Hwang et al., [20], and Hwang et al., [21] compared YOLO v3, a real-time object detection algorithm, with human readings, and found that AI can identify cephalometric landmarks as accurately the human examiners. Mario et al. [16] also found equivalent results as compared to human experts, although their work was based on PANN. Similarly, Kim YH et al. [17] found that the deep learning method achieved better results than the examiners for the detection of some cephalometric landmarks, especially those that are located anatomically on curves. These landmarks are sensitive to identification errors because of reduced human-induced variability, which, in turn, is a result of certain factors including the overall knowledge of the examiners about the subject and the quality of cephalometric images. Moreover, according to Moon et al. [18] and Kunz et al. [19], an adequate amount and quality of data are needed to create an accurate and clinically applicable AI. Moon et al. [18] further reported that if we take the inter-examiner difference of 1.50 mm between human examiners into consideration, the estimated quantity of learning data seemed to be at least 2300 data sets. Similar thoughts were shared by Song et al. as well [24]. He added that human beings' skeletal anatomies are so different that if sufficient data is not included in the training dataset, the results might be rambling. Strikingly, the minimum amount of learning data calculated by Moon et al. [18] far outnumbered the learning data (40-1000) that were included in previous studies, thus reporting conflicting results [22][23][24][25][27][28][29][30][31].
Moreover, the quality of the data also plays an important role. Lee et al. [23] used a public database and it was observed that even though the overall landmarks were located within acceptable margins from ground truth, the detected landmarks and ground truth did not adequately match, which could be owing to the following reasons: (1) the input images were scaled from 1935 × 2400 pixels to 64 × 64 pixels so that the fine error in the scaled images grew rapidly as the images were enlarged to the original size, and (2) the regression systems were trained without proper use of deep learning-related techniques. Furthermore, certain studies used the datasets presented in the "International Symposium on Biomedical Imaging (ISBI) 2014 and 2015 grand challenges in dental x-ray image" [23,24,27,30,31]. This dataset was somewhat flawed as it used a smaller sample size with a wide age range (six to 60 years). Moreover, the mean intra-observer variability of the two experts was 1.73 and 0.90 mm, respectively, which is very high. Thus, there were chances of unnecessary bias in the trained model, which suggests uncertainty with the clinical applicability of this dataset.
Unlike others, Kunz et al. [19] used high-quality cephalometric radiographs that had been generated on an approved X-ray unit and not collected from the public domain. The radiographs were not selected beforehand so that a vast variety of different skeletal and dental problems were included, which is a prerequisite for reliable AI learning. In addition, only experienced practitioners were asked to perform landmark identification and tracing, which resulted in a very high intra-rater and inter-rater reliability. Kunz et al. [19] also showed that the measurements are in good agreement concerning the Bland-Altman plots.
Lastly, the literature shows that the recent deep learning-based techniques have outperformed the conventional machine learning-based techniques in terms of accurate tracing of cephalometric landmarks. Kim H et al. [26] achieved a maximum accuracy of 96.79%. In addition, Kim YH et al. [17] found that the deep learning method achieved better results than the examiners for the detection of some cephalometric landmarks. With such promising results it would not be wrong to say that, with the continuous development and advancements, AI could shortly exceed manual markings performed by clinical experts, consequently saving labor, time, and effort.
The review had a few shortcomings; some of the included studies suffered from a range of risks of bias and few studies utilized similar datasets. There were several studies in which there was no clarification on how the human annotator labels led to test datasets. Certain studies employed the use of only a single expert, which could have affected the results because of variations in landmark identification. Additionally, there were very few studies that employed the use of independent datasets. Studies conducted in the future should consider using wider outcome sets and aim at testing deep-learning applications across different settings.

Conclusions
The results from the various articles analyzed in this systematic review suggest that the applications of artificial intelligence systems are promising and reliable in cephalometric analysis. Despite the limitations, almost all of the studies agreed that AI-based automated algorithms have worked remarkably well, with accuracy and precision similar to trained orthodontists. It can simplify cephalometric analysis and provide a quick outcome in practice, which can save practitioners time and enhance their performance. Additionally, it can be of greater benefit and used as an ancillary support for orthodontists and clinicians with lesser experience.