Comparison of Three Commercially Available, AI-Driven Cephalometric Analysis Tools in Orthodontics

Background: Cephalometric analysis (CA) is an indispensable diagnostic tool in orthodontics for treatment planning and outcome assessment. Manual CA is time-consuming and prone to variability. Methods: This study aims to compare the accuracy and repeatability of CA results among three commercial AI-driven programs: CephX, WebCeph, and AudaxCeph. This study involved a retrospective analysis of lateral cephalograms from a single orthodontic center. Automated CA was performed using the AI programs, focusing on common parameters defined by Downs, Ricketts, and Steiner. Repeatability was tested through 50 randomly reanalyzed cases by each software. Statistical analyses included intraclass correlation coefficients (ICC3) for agreement and the Friedman test for concordance. Results: One hundred twenty-four cephalograms were analyzed. High agreement between the AI systems was noted for most parameters (ICC3 > 0.9). Notable differences were found in the measurements of angle convexity and the occlusal plane, where discrepancies suggested different methodologies among the programs. Some analyses presented high variability in the results, indicating errors. Repeatability analysis revealed perfect agreement within each program. Conclusions: AI-driven cephalometric analysis tools demonstrate a high potential for reliable and efficient orthodontic assessments, with substantial agreement in repeated analyses. Despite this, the observed discrepancies and high variability in part of analyses underscore the need for standardization across AI platforms and the critical evaluation of automated results by clinicians, particularly in parameters with significant treatment implications.


Introduction
Artificial intelligence (AI), a term coined in 1956 by John McCarthy, describes the ability of machines to imitate logical human behavior [1].Recent advancements in AI technology have led to the incorporation of this technology into many fields of everyday life, including internet search engines (Google), private online assistants (Siri, Alexa), and housekeeping (iRobot).The development of AI has also made its way into the field of medicine, particularly radiology, where medical imaging in 2023 constituted approximately 85% of the FDA-approved AI programs [2].With its significant role in imaging for treatment planning and outcome monitoring, orthodontics is one of the fields of dentistry where AI tools are being implemented most rapidly.
Since 1931, when Broadbent and Hofrath simultaneously developed a standardized method to obtain lateral cephalometric radiographs, cephalometric analysis (CA) has remained a fundamental tool used in orthodontics [3].It allows for the precise assessment of the mandible, maxilla, and cranial base in the sagittal and vertical dimensions [4].It involves the use of X-ray lateral cephalograms of the head and face to obtain precise linear and angular measurements between predefined landmarks.CA allows for the assessment of projected growth directions in children and adolescents, the diagnosis of malocclusion, precise treatment planning, and posttreatment evaluation.In addition to orthodontics, CA is a valuable tool in orthognathic surgery planning, ensuring precise assessment and intervention [5].Moreover, cephalometry is used to measure changes in the pharynx and other anatomical structures in patients with obstructive sleep apnea, especially after surgical treatment [6].Despite its high diagnostic value, CA remains a burdensome task-it is associated with the labor-intensive and time-consuming process of identifying cephalometric landmarks.Currently, the time-consuming manual measurements have been replaced with digital CA software, which facilitates quicker measurements and calculations, as well as the automatic presentation of analysis results.The digitalization of CAs has been shown to reduce the number of errors resulting from manual measurements made with a ruler and protractor [7].
The integration of AI in dental diagnostics has paved the way for the development of AI-based commercially available programs such as AudaxCeph (Audax, Ljubljana, Slovenia), WebCeph (Assemble Circle, Seoul, Republic of Korea), and CephX (ORCA Dental AI, Las Vegas, NV, USA).AI algorithms in CA utilize deep learning (DL) and convolutional neural networks (CNNs) to automate the identification of anatomical landmarks on radiographs.These algorithms are trained on large datasets of labeled images to learn the patterns and features associated with specific anatomical landmarks.These programs automate the identification of cephalometric points, evaluate landmarks, calculate angles and distances, and generate automated analysis reports with diagnoses.The primary advantage of such software is the ability to automatically perform CA within seconds [13].
Studies conducted to date have demonstrated a high degree of agreement between manual and AI-automated CA performed by the mentioned software [28][29][30][31].However, to the best of our knowledge, the agreement among the results from automated CA has not been assessed.
The present study aimed to compare the agreement of the results from randomly selected, three commercially available AI tools in automated CA in one patient cohort and to assess the repeatability of the AI results.Our hypothesis is that AI-driven cephalometric analysis tools demonstrate high accuracy and interchangeability.

Patient Population, Sample Size Calculation
The material of this retrospective study initially consisted of 130 lateral cephalograms obtained from patients aged 12 to 20 years from the patient archives of a single, private orthodontic center.The cephalograms were selected from the initial records of new patients admitted between 2018 and 2023.After initial screening, the images were anonymized.All the cephalograms were performed on the same digital panoramic machine, Hyperion X9-Pro (MyRay, Verona, Italy).The primary indication for lateral cephalograms was or-thodontic treatment planning.The selected lateral cephalograms were manually uploaded into the databases of the chosen AI programs without any image modifications (cropping, contrast adjustments, filters, etc.).
The sample size was validated according to a paper by Bonnet titled "Sample Size Requirements for Estimating Intraclass Correlations with Desired Precision" [31].The sample size calculations were conducted using a web-based calculator (https://wnarifin.github.io/ssc/sssnsp.html,accessed on 6 May 2024).The following assumptions were made: ICC was calculated for each of the assessed parameters with a precision of 0.1, a confidence level of 90%, and a number of raters of 3.

Eligibility Criteria
All patients, aged 12-20 years, with lateral cephalograms acquired during the treatment planning, were consecutively enrolled in this study.Patients aged 12-20 years were selected for this study as this age range represents the common period during which orthodontic treatment is initiated and actively managed.Adolescents and young adults are the primary demographic for orthodontic interventions, making this sample representative of the population typically undergoing cephalometric analysis in clinical practice.The eligibility criteria are listed in Table 1.

Automatic Cephalometric Analysis
The selected lateral cephalograms were manually uploaded into the following databases of the chosen AI programs: CephX, WebCeph, and AudaxCeph.The selection of the programs included in this study was based on the commercial availability, widespread use in clinical practice, and the ability to perform fully automated cephalometric analyses.This ensures the relevance and applicability of our findings to practitioners.
The software automatically selected types of CAs and generated automatic reports.For the analysis, the common measurements from all three programs according to Downs, Ricketts, and Steiner were utilized.No manual adjustments to cephalometric landmarks were made to assess the fully automatic process of CA.The analyzed parameters are listed in Table 2.

Repeatability Analysis
Fifty randomly selected subjects were reuploaded as the new patients and reanalyzed by all three evaluated platforms.The intraclass correlation coefficient (ICC3) values for repeated CAs were calculated to assess the agreement among the results.

Statistical Analysis
The concordance of measurements of quantitative variables was assessed with ICC type 3 (according to the Shrout and Fleiss classification) [32].The Friedman test was used to compare three or more repeated measures of quantitative variables.Paired Wilcoxon tests with Bonferroni correction served as post hoc procedures.The paired Wilcoxon test was used to compare two repeated measures of quantitative variables.The significance level for all the statistical tests was set to 0.05.All the analyses were conducted with R software, version 4.3.3.

Population, Sample Size Calculation
After the exclusion of six cephalometric radiographs due to poor image quality (2), presence of artifacts (1), or significant double borders of the mandible (3), 124 patients (59 men, 65 women) aged 12-20 years (mean age of 14.4 years) were included in this study.
The minimum sample size calculation (n = 104) showed that our sample was sufficient for the validity of the results.
Figure 1 shows the cephalogram of a sample patient with superimposed cephalometric points.

Statistical Analysis
The concordance of measurements of quantitative variables was assessed with ICC type 3 (according to the Shrout and Fleiss classification) [32].The Friedman test was used to compare three or more repeated measures of quantitative variables.Paired Wilcoxon tests with Bonferroni correction served as post hoc procedures.The paired Wilcoxon test was used to compare two repeated measures of quantitative variables.The significance level for all the statistical tests was set to 0.05.All the analyses were conducted with R software, version 4.3.3.

Population, Sample Size Calculation
After the exclusion of six cephalometric radiographs due to poor image quality (2), presence of artifacts (1), or significant double borders of the mandible (3), 124 patients (59 men, 65 women) aged 12-20 years (mean age of 14.4 years) were included in this study.
The minimum sample size calculation (n = 104) showed that our sample was sufficient for the validity of the results.
Figure 1 shows the cephalogram of a sample patient with superimposed cephalometric points.

The Results from Automated CA
The results of the analyses are presented in Table 3.Most of the analyses revealed similar mean calculation values; however, significant discrepancies were found in some of the analyzed parameters.The largest differences were demonstrated in the measurements of angular values.The greatest discrepancies were observed in the results of the angle convexity analysis, where CephX had a mean value of 176.32°,AudaxCeph had a mean value of 7.18°, and WebCeph had a mean value of 7.99°.Similar discrepancies were evident in the measurements of the occlusal plane angle, with CephX reporting a mean value of 42.8°, AudaxCeph of 6.11°, and WebCeph of 5.86°; the angle of the lower incisor (LI) to the occlusal plane-CephX 69.23°, AudaxCeph 20.31°, and WebCeph 20.62°; and the angle of the LI to the mandibular plane-CephX 87.13°, AudaxCeph 6.98°, and WebCeph 6.87°.The results of the other analyses performed showed some minor discrepancies.

The Results from Automated CA
The results of the analyses are presented in Table 3.Most of the analyses revealed similar mean calculation values; however, significant discrepancies were found in some of the analyzed parameters.The largest differences were demonstrated in the measurements of angular values.The greatest discrepancies were observed in the results of the angle convexity analysis, where CephX had a mean value of 176.Due to differences in the number of analyses performed by the AI programs, some measurements were performed only by part of the programs.A summary of the results of the analyses performed only by CephX and AudaXCeph can be found in Table 4.

Concordance Analysis
The results of the concordance analysis showed good to excellent concordance of the results of the analyses for most of the parameters.However, some of the parameters showed poor and fair concordance.The detailed results of the concordance analysis of all three selected platforms are shown in Table 5. Comparisons of the results of the sample analyses showing excellent and poor concordance are shown in Figures 2 and 3, respectively.
The results of the concordance analysis of the two CA platforms are shown in Table 6.The results of the concordance analysis of the two CA platforms are shown in Table 6.

Repeatability Analysis
The results of the repeated analyses for 50 patients, as performed by each of the three programs, showed perfect agreement; each program returned the same results for all the repeated analyses performed.

Discussion
The present study aimed to compare the variability of the CA results of three commercially available AI-automated CA tools-CephX, WebCeph, and AudaxCeph.Our results demonstrated a high level of agreement among the AI-driven automated systems in the CA for most of the parameters evaluated.The repeatability analysis showed perfect agreement within each program, indicating that the automated systems can produce consistent results when reanalyzing the same radiographs, proving the accuracy of the algorithms.This suggests that AI-driven tools can offer a reliable alternative to traditional methods, with the added benefits of speed and consistency.Notably, significant discrepancies were observed in some angular measurements, such as angle convexity and occlusal plane angles, indicating that different methodologies were adopted by the selected platforms.
AI-driven tools offer significant advantages in CA, potentially leading to improved diagnostic accuracy, consistency in landmark identification, and a reduction in the time required for cephalometric analysis.These tools have the potential to enhance clinical decision-making, streamline workflows, and reduce the risk of human error, ultimately leading to better patient outcomes.However, the results of our study indicate significant variability in the results of several analyses among the programs.These discrepancies could be attributed to differences in algorithms, methods of evaluation, and landmark recognition capabilities across the three platforms.The highest mean differences were found in the facial angle of convexity, defined as angle convexity (CephX), angle of convexity (AudaXCeph), and N-A-Pg (WebCeph).Considering the average results of the analysis, along with their SDs, the only possibility for explaining these differences was a completely different measurement method.The inspection of the CA results of individual patients revealed that CephX indicated entirely different normal ranges than the two other programs.CephX indicated a correct value of 180 ± 5 • , while the other programs indicated a value of 0 ± 5 • .After eliminating CephX from this analysis, AudaXCeph and WebCeph showed very similar results, with 7.18 ± 4.53 and 7.99 ± 4.51, respectively.
The angle of convexity is usually defined as the angle formed by the intersection of two lines drawn from the most anterior point on the maxilla (Point A) and the most anterior point on the mandible (Point B) to the point on the forehead (nasion).This measurement is used to assess the relationship between the mandible and maxilla and the overall facial profile.A larger angle of convexity usually indicates a more convex facial profile, which can be associated with a protruding upper jaw or a receding lower jaw.A smaller angle indicates a straighter or more concave facial profile.The angle of convexity is a substantial parameter of CA, although its definition varies among authors.Some use the soft tissue glabellar point [33,34], the frontal point (Fr) [35], the NS point [36], or the N ′ point at the depression of the nose as cranial reference points [37,38].Godt et al. have proven that variance in the definition of landmarks used in facial convexity measurement methods can lead to significant discrepancies in the obtained angle values [39].The differences in facial convexity automated measurement values and the normal ranges between CephX and the two other employed programs clearly show that the adapted methodology was different (Figure 3A).This example clearly shows that CephX indicated entirely different mean values and CI for angle convexity compared to AudaXCeph and WebCeph, suggesting a different measurement approach.
Similar discrepancies were found in the results of the occlusal plane calculations.These calculations were defined as the FH to the occipital plane by both CephX and AudaX-Ceph and as the cant of the occlusal plane by WebCeph.Again, the mean values shown by WebCeph and AudXCeph were similar, but the mean results of the CephX calculations were entirely different, yielding the results of 5.86 ± 4.05, 6.11 ± 4.01, and 42.8 ± 72.23, respectively.Notably, the results obtained by CephX showed significant variability (SD = 72.23,min = 0.13, max 179.93) compared to the significantly lower variability of the results from the rest of the analyzed platforms (Figure 3B).
The occlusal plane in the CA is an imaginary surface drawn through the incisal edges and occlusal surfaces of the teeth.It represents the mean curvature of the surface drawn rather than the actual plane.The measurement of the occlusal plane typically involves determining its angle relative to other anatomical planes or structures in the head and neck.As shown in a review by Mazurkiewicz et al., the occlusal plane might be evaluated with many different methods and devices; thus, discrepancies might be obtained [40].The cant of the occlusal plane is defined as the vertical alignment of the teeth when there is a difference between the left and right sides.This involves either an upward or downward rotation of one side over the other in the transverse plane.The different methodologies adopted by CephX explain the discrepancies among the selected programs.However, the variability of the results from the CephX analyses (SD = 72.23,min = 0.13, max = 179.93)indicates significant discrepancies in the obtained results and raises considerable doubts about the reliability of these measurements.The results of the CephX occlusal plane measurements showed values as high as 179.93 in some patients, whereas the indicated average normal value was 9.3 ± 3.8.The same patients, assessed by AudaxCeph and WebCeph, had values of 4.68589 and 4.1, respectively.The differences shown in Figure 3B raise concerns regarding the reliability of CephX in occlusal plane measurements.
Another parameter that showed significant discrepancies was the lower incisor (LI) to the occlusal plane (Figure 3C).The mean results for CephX, AudaxCeph, and WebCeph were 69.23 • , 20.31 • , and 20.62 • , respectively.Similarly, for the LI to the mandibular plane, the mean values were 87.13 • , 6.98 • , and 6.87 • , respectively (Figure 3D).These significant differences in the mean values indicate that the three AI programs may have used different methods for locating the landmarks and performing the measurements.It is worth noting that these parameters are crucial in the diagnosis and treatment planning for many orthodontic cases, including those involving anterior open bite, deep overbite, and skeletal Class III malocclusion [41].Hence, the discrepancies in these measurements among the different AI programs may lead to different diagnoses and treatment plans.between the AI and manual asymmetry rate analyses (ICC type 3 = 0).As of the date of manuscript preparation (April 2024), CephX's automated asymmetry assessment module is no longer available.
A recent study by Yassir et al. [44] evaluated the accuracy and reliability of WebCeph in CA.The authors reported problems with landmark identification and soft tissue delineation, and inconsistency of measurements are inherent features of the program's automated analyses.Similar to our results, most of the discrepancies were found in the angular measurements.The 2024 study by Silva et al. [45] compared the accuracy of WebCeph and CefBot for landmark identification to that of 10 experienced readers.The authors concluded that CefBot exhibited excellent reliability and was ready for use in clinical practice, while WebCeph produced significant errors in landmark identification.A study on AudaXCeph's tracing reliability by Ristau et al. showed that the program was not significantly different from that of experienced orthodontists [31].However, some discrepancies in lower incisor apex measurements were found.These problems with lower incisor identification were also found in our study, yielding the high variability shown in the measurements presented in Figure 4C,D.Ribwar and Azeez compared the results of CephX and manual tracings [46].The authors have shown that, except for several parameters, the results of the automated analysis showed high agreement with the manual method.The article concluded that CephX is adequate for clinical use.
Along with a growing body of evidence on the use of AI-automated CA in experimental studies, recent meta-analyses have been published providing a comprehensive overview of its accuracy and reliability [47][48][49][50][51][52][53].However, most of the tested AI models were experimental and not available for common users.Furthermore, the results depend on the predefined thresholds.As expected, the accuracy sharply decreases when the threshold is lower than 2 mm [47,49,52].A systematic review and meta-analysis by Schwendicke et al. [48] assessed the accuracy of deep learning (DL) for cephalometric landmark detection on 2D and 3D radiographs.The meta-analysis, which included 19 studies published between 2017 and 2020, revealed that DL exhibited relatively high accuracy in detecting cephalometric landmarks.However, the body of evidence suffers from a high risk of bias, highlighting the need for further studies to demonstrate the robustness and generalizability of DL for landmark detection.Rauniyar et al. [52] conducted a systematic review to determine the accuracy of identifying cephalometric landmarks using AI and compared the results with those of a manual tracing group.The review concluded that AI showed extremely positive and promising results compared to manual tracing, indicating its potential in CA.A meta-regression conducted by Serafin et al. [53] in a meta-analysis on AI-automated cephalometric landmarks indicated a significant relationship between the mean error in landmarking and the year of publication (p-value = 0.012).The authors concluded that the accuracy of the AI algorithms in this task has risen significantly in studies published between 2021 and 2023.These results give hope for the further development of algorithms, their refinement, and the possibility of their application in daily clinical practice.
The AI-driven CA tools used in this study are trained on large datasets of labeled images using deep learning and convolutional neural networks.These algorithms learn to identify anatomical landmarks by recognizing patterns and features from the training data.However, the specific details of these datasets and training processes are proprietary.The program's developers refrain from providing information on this topic and treat it as a trade secret.The three AI-driven CA tools used in this study-CephX, WebCeph, and AudaxCeph-were selected based on their commercial availability, widespread use in clinical practice, and the ability to perform fully automated cephalometric analyses.These programs are among the most used AI tools in orthodontics, making this study relevant to practitioners.While other AI-driven tools exist, our access was limited to these three programs.Future studies should aim to include a broader range of AI tools to provide a more comprehensive evaluation.
A reduction in analysis time without compromising accuracy can potentially enhance productivity in orthodontic practices.Moreover, the consistency of AI-driven systems reduces the risk of human error, thus providing more standardized outcomes.However, the variations observed in certain measurements highlight the need for standardization among different AI platforms.This variation could lead to different orthodontic diagnoses and treatment plans, raising concerns about the interchangeability of these systems.It also indicates the necessity of thoroughly familiarizing oneself with the methodology of the platform used before its application.Additionally, our study demonstrated the presence of evident calculation results stemming from erroneous cephalometric landmark identification.However, practitioners must be aware of the limitations of these tools, especially regarding the discrepancies found in some angular measurements.It is imperative that practitioners critically evaluate the results from AI-driven systems and, if necessary, confirm the findings manually, especially when the measurements have significant implications for treatment decisions [54].
Despite some discrepancies, this study found high levels of agreement and repeatability in the results obtained from the AI programs for most cephalometric parameters.This suggests the potential of AI to deliver consistent and reliable analyses.The perfect repeatability of results across all evaluated programs underscores the consistency of automated analysis.AI reduces the risk of human error associated with manual CA.This standardization, coupled with high accuracy in landmark detection, can significantly improve the reliability of the assessments.In our view, our study offers valuable data that can guide the further development and refinement of AI algorithms, potentially expanding their use in clinical settings.However, this implies the need for further studies-evaluating different software with broader and more heterogeneous study groups-to provide a comprehensive evaluation of existing and future technologies.Our findings underscore the need for methodological standardization and algorithm improvement, as both factors influenced our results, indicating that current AI tools still require human supervision.Moreover, there is a vast yet insufficiently explored area of cone beam computed tomography (CBCT)-based CA.While some pilot studies have already assessed the accuracy of AI algorithms, further evaluation is still needed.Future studies should consider including a human comparison group (preferably multi-reader evaluation) to evaluate the performance of AI algorithms against experienced orthodontists.This would provide a more comprehensive understanding of the advantages and limitations of AI in cephalometry.
The findings of the present study should be interpreted considering several limitations.First, this study was retrospective in nature and relied on archived cephalograms, which might have affected the quality of some images.Moreover, there are geographic and ethnic limitations which may influence the generalizability of the results, especially in more diverse populations.Second, this study included a relatively small sample size, which may limit the generalizability of the findings.The results of the analyses were not compared to any "golden standard" based on the expert readers' consensus; however, this was not the aim of this study.While this study demonstrates the high repeatability of AI-driven cephalometric analyses, the absence of a gold standard limits the ability to determine the most accurate tool.Future research should include manual evaluations by expert orthodontists as a benchmark to compare the AI tools, providing a comprehensive assessment of their accuracy.A further limitation of this study is the human involvement in the image selection, which implies that the process was not entirely AI-driven.Knowledge about image quality was required to exclude suboptimal radiographs, which could impact the analysis.Last, this study evaluated only three AI programs, and many other commercially available AI-driven automated CA tools were not included in the analysis.
Future studies should focus on evaluating additional AI-driven cephalometric analysis tools to provide a more comprehensive comparison.It would be beneficial to conduct studies involving larger and more diverse patient populations to enhance the generalizability of the findings.Moreover, integrating a human comparison group, including multi-reader evaluations by experienced orthodontists, could offer deeper insights into the performance of AI algorithms.

Conclusions
In conclusion, AI-driven automated CA tools can provide a quick and consistent alternative to traditional manual methods.However, significant discrepancies exist in the measurements of some parameters among different AI programs, which may potentially lead to varying diagnoses.Moreover, some parameters assessed by the selected AI platforms exhibited significant variability, indicating severe inaccuracies in landmark identification.Therefore, clinicians should be aware of these discrepancies, carefully interpreting the results of automated CA in conjunction with clinical findings and assessing the accuracy of landmark identification.
32 • , AudaxCeph had a mean value of 7.18 • , and WebCeph had a mean value of 7.99 • .Similar discrepancies were evident in the measurements of the occlusal plane angle, with CephX reporting a mean value of 42.8 • , AudaxCeph of 6.11 • , and WebCeph of 5.86 • ; the angle of the lower incisor (LI) to the occlusal plane-CephX 69.23 • , AudaxCeph 20.31 • , and WebCeph 20.62 • ; and the angle of the LI to the mandibular plane-CephX 87.13 • , AudaxCeph 6.98 • , and WebCeph 6.87 • .The results of the other analyses performed showed some minor discrepancies.

Table 2 .
Categorized list of the assessed parameters as mentioned in the analyzed software.

Table 3 .
Summary of the mean results of analyses performed by all three selected CA platforms.

Table 3 .
Summary of the mean results of analyses performed by all three selected CA platforms.

Table 4 .
Summary of the mean results of analyses performed by two of the selected platforms (CephX and AudaXCepx).

Table 5 .
Results of the concordance analysis of all three CA platforms.

Table 6 .
Results of the concordance analysis of two of the selected platforms (CephX and AudaxCeph).

Table 6 .
Results of the concordance analysis of two of the selected platforms (CephX and AudaxCeph).