1. Introduction
This review synthesizes evidence across cephalograms, photographs, cone-beam computed tomography (CBCT), digital models, and panoramic radiographs, linking diagnostic metrics to validation tiers and regulatory context, a perspective largely missing from prior orthodontic AI reviews. Orthodontic adoption of AI began in the mid-2010s with the first fully automated convolutional neural network (CNN) for cephalometric landmark detection [
1]. Since 2017, mean radial error fell from 1.87 ± 2.04 mm (success detection rate (SDR) within 2 mm: 73% [
2]) to 0.94 ± 0.74 mm on a multicenter model trained on 9870 cephalograms, which also delivered ≈89% skeletal-class accuracy [
3]. Detector architecture still matters: on the same dataset, YOLOv3 attains an MRE of ≈ 1.5 mm, clearly outperforming SSD [
4].
The diagnostic scope (
Figure 1) has expanded in tandem. A hybrid random forest combining airway and cephalometric features distinguishes Classes I–III with mid-80% accuracy [
5], and a profile-photograph CNN estimates ANB with mean absolute error (MAE) ≈ 1.5° [
6]. Independent validations of landmark detectors [
7] and a large multicenter oral-pathology CNN confirm cross-site generalizability [
8].
State-of-the-art machine learning techniques such as domain adaptation, cross-domain learning, fused domain-cycling generative adversarial models, multimodal feature-fusion approaches, and collaborative generative models shape the current and future landscape of medical AI. Orthodontic imaging has not yet adopted these methods, but they could improve multi-domain consistency, data harmonization, and generative augmentation to handle variation between devices, datasets, and domains. Better generalization across imaging environments remains one of the main challenges in malocclusion diagnosis.
Figure 1 presents a conceptual diagnostic pipeline rather than current routine practice. In reality, landmark detection, cephalometric measurement, and malocclusion classification remain separate tools and are not yet integrated into a fully automated system. The diagram is intended to illustrate a possible AI-enabled workflow, not to claim that end-to-end automation is already feasible or validated. Readers should interpret it as an explanatory framework, not as a depiction of existing clinical workflows.
Despite strong technical results, evidence for orthodontic workflow impact remains limited. Routine clinical adoption lags for four principal reasons: limited data diversity (small, demographically narrow cohorts); scarce prospective multicenter external validation with clinical-outcome endpoints; constrained transparency with subgroup/fairness reporting; and incomplete regulatory readiness and deployment practices (unclear decision thresholds and audit pathways) [
9]. Explainability is inconsistently reported. Grad-CAM overlays are feasible and can be clinically informative [
7], yet comparative work shows that different saliency methods highlight different anatomy [
10]. Heatmaps help indicate general anatomical regions but offer limited diagnostic value in orthodontics, where decisions rely on specific cephalometric landmarks rather than qualitative visualizations. Clinicians need outputs such as computed ANB angles, vertical indices, or landmark-derived measurements that can be checked against accepted diagnostic rules. Structured, rule-based explanations at this level allow clinicians to verify how the AI reached its conclusion. Future orthodontic AI systems will require this type of clinically grounded interpretability to be integrated safely into routine practice. Orthodontic workflow evidence remains preliminary; analogous deployments in other specialties report measurable gains, radiology triage reduced CT reading time by 11 s [
11], and a surgical-pathology system cut idle time by 18% [
12], while this review focuses on orthodontic diagnosis and regulatory readiness. Regulatory clearances in dental and orthodontic AI currently cover tasks such as cephalometric analysis and orthodontic monitoring; for example, WEBCEPH™ is described as an AI-powered cephalometric and orthognathic platform with KFDA and FDA 510(k) approval, yet to our knowledge no system has secured FDA or CE approval for full, end-to-end malocclusion diagnosis [
13].
This review fills a key gap by integrating accuracy, interpretability, efficiency, and fairness across lateral cephalograms, intraoral/extraoral photographs, CBCT scans, 3D digital dental models, and panoramic radiographs, and by mapping these claims to validation tiers and regulator-aligned multicenter validation frameworks. These developments matter directly for clinicians, as AI systems that match expert-level skeletal classification and reduce cephalometric tracing time by ~70–80% could act as reliable co-pilots once validation, fairness, and regulatory gaps are addressed.
2. Materials and Methods
We searched PubMed, Scopus, and Web of Science (January 2019–October 2025) using (“artificial intelligence”) AND (“malocclusion” OR “skeletal class”) AND “cephalometric.” This narrative review surveys AI-based malocclusion diagnosis across lateral cephalograms, intraoral and extraoral photographs, CBCT scans, digital dental models, and panoramic radiographs published between 2019 and 2025. Two reviewers independently screened titles/abstracts and full texts, resolving disagreements by consensus. We included original studies reporting diagnostic performance (accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity) or landmark-localization error for AI-based malocclusion diagnosis; reviews, editorials, case reports, and non-orthodontic applications were excluded. We extracted dataset size, validation design, and metrics; no formal quality appraisal or risk-of-bias assessment was undertaken, consistent with a narrative review. In this review, we focus on studies using multicenter datasets, predefined external-test cohorts and transparent reporting of subgroup or device-related performance, and consider these to provide higher-quality and more generalizable data. Conversely, findings derived from single-center, small-sample, or internally validated-only designs are interpreted as preliminary and susceptible to overfitting or limited real-world applicability. These quality distinctions are explicitly noted when summarizing performance ranges and comparing methods. Together, this provides clearer separation between robust evidence and results that should be viewed as exploratory. The final set of eligible studies forms the basis for the narrative synthesis presented in the following sections.
Though descriptions of studies are offered to clarify orientation, readers are referred to the original study for further information, consistent with the nature of narrative reviews. Secondly, after the screening of database candidates and manual searches, the studies not providing a solution to the malocclusion diagnosis, not related to an artificial intelligence-guided diagnostic output, or reporting on specific tasks such as isolated landmark detection, segmentation, or appliance prediction were discarded. Further exclusions were applied to those manuscripts that either did not present sufficient information regarding methodology and diagnostic performance measures or were unsuitable for a useful comparison. Thus, only research containing specific information about the ability to diagnose skeletal class, occlusal classification, or clinically relevant cephalometric diagnostic variables were included in the final set.
Current Capabilities
Current CNN and hybrid models achieve sub-millimeter landmark precision and high discriminative power for skeletal classification (
Table 1). Evidence from a large oral-pathology dataset (over 20,000 images) suggests that such deep architectures can generalize across heterogeneous malocclusion phenotypes [
8]. Together, these findings provide a strong technical basis for future prospective clinical validation studies.
A large oral-pathology CNN (>20,000 images) was transferred to unseen classes, indicating scalability beyond orthodontics [
8].
3. Discussion
The limitation of the search strategy is that it was focused on concepts directly related to malocclusion diagnosis and skeletal classification. This means that research specifically shrink-wrapped around the same task, i.e., support for decision-making on extractions, cephalometric landmarking, and orthodontic measurement prediction, among others, may not have been obtained if these words did not appear in either the title or the abstract. To expand the search, it would be necessary to look for the scope of the research beyond the diagnosis of malocclusion, which is not possible within the purpose of this study. Thus, the reader should consider this evidence as indicative of AI systems specifically designed for diagnostic classification, rather than encompassing all existing orthodontic AI applications, or all technologically possible ones.
Beyond pooling accuracies, the clinical meaning of “90–96%” becomes clearer when reported metrics are linked to study design (external versus internal testing), subgroup performance, and clinician-facing explainability. Contemporary AI models perform on par with expert orthodontists; on lateral cephalograms, deep CNNs typically achieve 90–96% accuracy [
14]. A CNN trained on ~1500 images reached 96% accuracy (Cohen’s κ ≈ 0.91), outperforming automated cephalometry (κ ≈ 0.78) [
14]. These headline results were obtained from single-country datasets with predominantly internal validation, so they should be regarded as optimistic upper bounds rather than as expected performance in heterogeneous real-world clinical populations. These models operate end-to-end, identifying malocclusion patterns directly from images without requiring manual landmark input. CNNs also generalize complex patterns. A multicenter South Korean study reported AUC > 0.89 and accuracy > 0.80 when simultaneously classifying anteroposterior skeletal class, vertical pattern, and bite type on external radiographs [
15]. Performance is often highest for pronounced phenotypes. Class III cases (prognathism) yielded the best detection rates, with an external test AUC ~0.99 in one study [
15], indicating that AI is very reliable at identifying severe malocclusions.
AI also performs well on photographic data. In one prospective trial, a CNN analyzing intraoral photographs (three views per patient) achieved 93% accuracy (F1 ≈ 0.90) in the Angle molar classification, matching expert orthodontist performance [
16]. Notably, the AI was more precise than the human examiner for molar classes (precision 88.6% vs. 80.4%) [
17], suggesting greater consistency in applying classification criteria. However, clinicians outperformed the AI on continuous measurements (e.g., overjet and overbite), with human errors ~1.3 mm vs. AI ~2.0 mm [
16]. This indicates that 2D CNNs (without depth cues) struggle with sub-millimeter accuracy.
AI also detects specific malocclusion traits. Noëldeke et al. reported >98% accuracy for crossbite detection in intraoral photographs, with all CNNs (ResNet, MobileNet, etc.) achieving recall >90% and the best model (Xception) reaching 100% specificity [
18]. These results show that subtle occlusal discrepancies can be learned from 2D images given adequate data. Machine learning models trained on panoramic radiographs have also diagnosed Class III malocclusions in children, indicating that panoramic imaging can support malocclusion-related prediction tasks despite the lack of large public malocclusion datasets. Together with AI systems that categorize crowding severity and support extraction decisions from intraoral photographs, these findings suggest that crossbite and crowding can both be modeled as trait-level malocclusion patterns in photographic data.
Cephalometric landmark detection has reached clinically acceptable accuracy. Modern detectors locate ~75–85% of landmarks within 2 mm of the true position [
19]. An evaluation of a YOLOv3-based cephalometric model reported an average landmark error of ~1.8 mm and SDR within 2 mm: 75.5% [
19]. Many errors approach the level of human intra-observer variability, supporting automated cephalometry. Indeed, recent AI models achieve cephalometric measurements (angles and distances) comparable to experienced orthodontists [
19]. This implies that AI can effectively reproduce complete diagnostic work-ups (ANB angle, vertical indices, etc.) with minimal clinician input. The current landscape of AI-driven malocclusion diagnosis is further synthesized in
Table 2, which summarizes state-of-the-art performance metrics, explainability approaches, workflow impacts, fairness considerations, and regulatory readiness, together with their key supporting references [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29].
Ethnic craniofacial differences make multi-ethnic datasets essential for safe orthodontic AI implementation [
23,
24,
25,
26,
27]. A model trained on predominantly Caucasian norms may misclassify skeletal relationships in Asian or African populations, where features such as bimaxillary protrusion represent normal variation rather than pathology. This can lead to systematic overcalling of Class II or III patterns and embed incorrect diagnostic standards into automated systems [
25,
26]. Population-specific data and calibration are therefore necessary to avoid clinically significant errors.
3.1. Algorithms and Techniques
Deep CNNs remain the dominant algorithms due to their success in image recognition. Common architectures (ResNet, VGG, DenseNet, EfficientNet) have all been applied to malocclusion classification or feature detection [
30]. ResNet-50, ResNet-18, MobileNet, Xception, and DenseNet were evaluated for crossbite detection, each achieving high AUC (>0.95) on cross-validation [
18]. Model choice depends on data size and compute resources: lighter models like MobileNet can be competitive (F1 ~0.98) while offering faster inference [
18]. Vision Transformers (ViTs) are emerging alternatives. A transformer model with masked pretraining on unlabeled cephalograms achieved ~90% accuracy on multi-attribute diagnosis (e.g., Class III with open bite) [
31], showing strong generalization. These results suggest that transformers can capture global craniofacial patterns, though they demand large datasets and are early in dental AI research [
32]. Although Vision Transformers have demonstrated strong performance in large-scale medical imaging, their practical value in orthodontics remains limited by the comparatively small size and homogeneity of available datasets [
31,
32]. ViTs typically require extensive pretraining or very large labeled cohorts to outperform convolutional models, conditions that are rarely met in dental AI research [
32]. As a result, current ViT-based orthodontic systems should be viewed as exploratory rather than clinically mature, with their reported generalization reflecting potential rather than proven real-world robustness [
31]. Until larger, diverse orthodontic datasets become widely available, CNNs are likely to remain the more reliable and data-efficient option.
Hybrid strategies are also reported: Ryu et al. (2023) combined CNN-based tooth landmark detection with classification networks to categorize crowding severity and predict extraction need [
33]. Such multi-step pipelines (first measuring arch length discrepancy via landmarks, then decision-modeling extraction) add interpretability and structure to the AI’s reasoning. Interestingly, some classical machine learning methods hold their own. In a skeletal Class III classification study, a Gaussian process regression (GPR) model slightly outperformed deep networks (reportedly highest AUC ~0.98) in classifying Class III vs. normal, highlighting that, with the right features, non-deep algorithms can excel [
34]. Similarly, Paddenberg-Schubert et al. found that no single algorithm was intrinsically superior for skeletal class prediction; simpler linear models achieved ~99% accuracy when fed a few key cephalometric features (e.g., SNB, jaw length ratio) [
35]. Such near-perfect accuracies arise in tightly controlled, well-separated samples without external testing or formal risk-of-bias assessment and are likely to overestimate true diagnostic performance in routine practice.
3.2. Datasets and Open Initiatives
A critical concern in orthodontic AI is the size and diversity of training data. Many early studies used small, homogeneous samples of a few hundred single-center patients, which limits generalizability. More recent work uses larger, more varied datasets: Kim et al. (2022) trained a CNN on ~1334 cephalograms [
14], and Yim et al. (2022) used >2000 cephalograms from 10 hospitals, improving robustness to differing image quality and patient demographics [
15]. Photographic datasets have also expanded, with Bardideh et al. generating 7500 intraoral images from ~948 adults [
16] and Ryu et al. training on ~1300 annotated occlusal photos per arch [
33].
A recent review still identified only 16 unique open dental image datasets, with panoramic X-rays most common and no large public malocclusion datasets as of 2023 [
36]. New comprehensive resources begin to fill this gap: the OMNI dataset provides thousands of intraoral, extraoral, and radiographic images for benchmarking malocclusion models, and Wang et al. (2024) released a 3D dataset of 1060 pre- and post-treatment dental models with segmentation labels across multiple malocclusions [
37]. These open datasets support validation, fair algorithm comparison, and inclusion of diverse populations, as OMNI aggregates multi-ethnic patient images to mitigate bias, yet most current models still train and test on ethnically homogeneous cohorts (for example East Asian or Middle Eastern single-country studies), which raises concerns about performance in other ethnic and age groups.
AI systems for malocclusion diagnosis therefore still rely predominantly on lateral cephalograms and intraoral or extraoral photographs, with panoramic radiographs and 3D dental models only beginning to appear in the literature and remaining comparatively underexplored [
36]. Such data are starting to become available through collaborations, e.g., an orthodontic grand challenge based on RSNA’s Machine Learning Challenge, as community challenges are public competitions on medical images organized by the RSNA. Thus, while the size and representativeness of datasets have improved over the last five years, more needs to be performed in terms of international data sharing and standardized benchmarks to create broadly generalizable and non-biased AI models.
A critical limitation in the current literature is the risk of data leakage and overfitting, which can substantially inflate reported accuracies, particularly when values approach the 96–99% range. In several published studies, train-test separation relies on image-level rather than patient-level splits, allowing different images from the same individual to appear in both sets. This undermines the validity of “90%+” performance claims, as models may learn patient-specific features rather than clinically meaningful patterns. In addition, some datasets mix sequential radiographs or photographs from the same treatment episode, further increasing the chance of inadvertent overlap. Without strict patient-level separation and transparent documentation of splitting protocols, results may overestimate real-world generalization. Readers should therefore interpret high accuracy values cautiously and prioritize studies that explicitly implement robust leakage-prevention strategies. Future orthodontic AI research will require standardized reporting of dataset partitioning to ensure reliability and reproducibility. The main architectural families used in AI-based malocclusion diagnosis and their comparative strengths and limitations are summarized in
Table 3.
3.3. Clinical Validation and Utility
Most clinical evidence for orthodontic AI remains concentrated at early Fryback and Thornbury tiers (technical and diagnostic accuracy), with encouraging but limited advances toward real-world validation. Multicenter external testing has shown generalization across sites (AUC > 0.85) while revealing weaknesses on subtler distinctions, and a prospective head-to-head study found that AI matched expert accuracy and surpassed clinician consistency for occlusion classification. Concordance on extraction decisions is substantial (κ ≈ 0.8) with high AUCs (93–96%), though ground-truth variability persists in borderline cases. To date, one workflow study reached level 5 validation by reducing analysis time without compromising diagnostic accuracy; however, no large-scale randomized trials have evaluated patient outcomes, underscoring the need for prospective multicenter studies [
15,
17,
33,
38,
39].
Validation tiers and endpoints are summarized in
Table 4.
Next steps (evidence gaps): Prospective multicenter trials—e.g., deploying AI for preliminary malocclusion triage in new patients and comparing treatment plans or outcomes with controls—would provide the strongest evidence; until such trials are completed, clinical-benefit claims remain inferential [
15,
38].
3.4. Regulatory and Ethical Considerations
As AI diagnostic tools near clinical use, regulators and clinicians emphasize explainability, validated performance across sites and subgroups, secure lifecycle management, and clear human oversight. Recent U.S. authorizations in dental AI (e.g., cephalometric analysis 510 (k), radiographic pathology detection, and a 2024 De Novo for orthodontic monitoring) show a viable pathway but do not constitute approval for end-to-end malocclusion diagnosis; submissions must evidence multicenter external validation with prespecified endpoints, subgroup reporting, safe-update mechanisms, cybersecurity, transparent labeling, and a clinician-in-the-loop model [
13,
33,
38,
39,
40,
41]. Regulatory and deployment requirements are summarized in
Table 5.
Legal liability in AI-assisted diagnostics remains centered on the principle that such systems function as decision support rather than autonomous diagnostic agents [
38,
41]. Under both FDA SaMD and EU MDR frameworks, primary responsibility for diagnostic decisions continues to rest with the clinician, who must override system outputs when inconsistent with clinical findings [
13]. Manufacturers carry liability when errors stem from flawed algorithms, insufficient validation, or unsafe software updates, whereas institutions may be accountable for improper deployment, lack of oversight, or failure to maintain post-market monitoring [
41]. Documenting instances of AI–clinician disagreement and recording the reasoning behind final decisions is increasingly viewed as an essential safeguard. Emerging regulatory guidance also emphasizes that safe-update mechanisms and transparent labeling are prerequisites for assigning responsibility when model performance changes over time [
38,
41].
3.5. Future Directions
Future priorities and actions are summarized in
Table 6.
Orthodontic AI should now shift from small, single-site studies to coordinated, multi-institution efforts that expand diverse datasets (including 3D/CBCT) with subgroup metadata for bias monitoring, while running prospective trials to document workflow and outcome effects in real clinics. Hybrid, rule-aware models can improve transparency for clinicians (e.g., explanations tied to cephalometric features), and fairness must be planned from the outset with subgroup reporting and privacy-preserving collaboration such as federated learning. Regulatory readiness will hinge on multicenter validation, safe-update processes, and institutional oversight, as applications broaden from diagnosis toward treatment planning where AI has already shown high performance in selected decisions (e.g., surgery vs. orthodontics in Class III) [
33,
36,
38,
42,
43].
3.6. Conflicting AI Outputs and Clinical Decision Pathways
If algorithms are based on heterogeneous data, use different ground-truth labeling standards, or are validated differently, AI systems will make divergent recommendations [
4,
7,
15]. This should not be read as a failure of systems, but as a yardstick that raises the need for further attention. In cases of outlying outputs, the initial action taken is to confirm the finding from a second diagnostic source, for instance, the AI cephalometric analysis will have to be cross confirmed with intraoral photos or digital dental models. Medical practitioners should also promote systems that can give understandable confidence or measures for uncertainty that also can help to see where the predictions are unreliable. The tool with the greatest regulatory-grade validation and established generalizability across populations should be utilized where more than one AI tool is deployed [
33]. The discrepancy as well as the rationale that eventually led to a specific interpretation must be recorded by the clinician for the interpretation to be traced back to it, which is emerging as a requirement by the developing pressure to start regulating algorithmic diagnostic devices. When there is a significant difference between continued indices of discrepancy, there is a need for the clinician’s expertise and decision to discard suggestions from the system. This helps avoid deterministic thinking based on what we can extrapolate from algorithms and focuses again on the primacy of human choice. The goal is for AI systems to serve as diagnostic aides that support, but do not usurp, the authority of the clinician to diagnose.
In addition to the specific studies summarized above, several broader reviews and methodological frameworks place these findings in a wider orthodontic- and dental-AI context. Recent scoping and narrative reviews show that AI applications in orthodontics have expanded rapidly, but that study designs remain highly heterogeneous in terms of data types, annotation protocols, and outcome metrics, which complicates direct comparison and synthesis [
44,
45,
46,
47,
48]. Parallel work on trustworthy AI in dentistry and medical imaging emphasizes that reproducibility, fairness, explainability, and lifecycle monitoring need to be evaluated alongside accuracy [
49,
50]. Methodological guidance such as the CONSORT-AI extension and analyses of data leakage in deep-learning classification highlights the importance of prespecified endpoints, transparent reporting, strict patient-level data separation, and external validation to avoid inflated performance estimates [
51,
52]. Related scoping work in adjacent orthodontic AI applications (e.g., clear aligner therapy) further reinforces these general reporting and validation priorities [
53].
4. Conclusions
AI-based tools achieve high technical performance in malocclusion diagnosis, with 90–96% accuracy for skeletal Class I/II/III on cephalograms, SDR within 2 mm of ≈73–76% for landmarking and ~70–80% reductions in cephalometric tracing time. Despite this, clinical translation remains limited: evidence beyond retrospective or internal testing is sparse, multicenter external studies are few, and current regulatory authorizations cover cephalometric analysis and orthodontic monitoring rather than end-to-end malocclusion classification. Turning these systems into dependable co-pilots for orthodontists will require prospective multicenter trials with prespecified clinical endpoints, systematic subgroup and fairness reporting, clinician-facing explainability, and diverse, openly benchmarked datasets aligned with regulatory expectations for SaMD. Differences in imaging hardware, operator-dependent image acquisition, and demographic variation can increase errors when models are trained on narrow cohorts and may lead to failure modes such as systematic misclassification in marginalized groups or degraded performance when images do not resemble the training set [
15,
23,
24,
25,
26,
27]. Off-label use, lack of active monitoring for model drift, and limited clinical oversight further increase the risk of misuse; safe and fair application in daily clinics depends on recognizing and managing these weaknesses.