Artificial Intelligence in Malocclusion Diagnosis: Capabilities, Challenges, and Clinical Integration

Mikulewicz, Marcin; Chojnacka, Katarzyna

doi:10.3390/app152413138

Open AccessReview

Artificial Intelligence in Malocclusion Diagnosis: Capabilities, Challenges, and Clinical Integration

by

Marcin Mikulewicz

^1,*

and

Katarzyna Chojnacka

²

¹

Department of Dentofacial Orthopedics and Orthodontics, Division of Facial Abnormalities, Medical University of Wroclaw, 50-425 Wroclaw, Poland

²

Department of Advanced Material Technologies, Faculty of Chemistry, Wroclaw University of Science and Technology, 50-372 Wroclaw, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 13138; https://doi.org/10.3390/app152413138

Submission received: 21 November 2025 / Revised: 9 December 2025 / Accepted: 12 December 2025 / Published: 14 December 2025

(This article belongs to the Special Issue Advanced Studies in Orthodontics)

Download

Browse Figure

Versions Notes

Abstract

Background: This narrative review synthesizes evidence on AI for orthodontic malocclusion diagnosis across five imaging modalities and maps diagnostic metrics to validation tiers and regulatory readiness, with focused appraisal of Class III detection (2019–2025). Key algorithms, datasets, clinical validation, and ethical/regulatory considerations are synthesized. Methods: PubMed, Scopus, and Web of Science were searched for studies published January 2019–October 2025 using (“artificial intelligence”) AND (“malocclusion” OR “skeletal class”) AND “cephalometric.” Records were screened independently by two reviewers, with disagreements resolved by consensus. Eligible studies reported diagnostic performance (accuracy, area under the receiver operating characteristic curve (AUC), sensitivity/specificity) or landmark-localization error for AI-based malocclusion diagnosis. Data on dataset size and validation design were extracted; no formal quality appraisal or risk-of-bias assessments were undertaken, consistent with a narrative review. Results: Deep learning models show high diagnostic accuracy: cephalogram classifiers reach 90–96% for skeletal Class I/II/III; intraoral photograph models achieve 89–93% for Angle molar relationships; automated landmarkers localize ~75% of points within 2 mm. On 9870 multicenter cephalograms, landmarking achieved 0.94 ± 0.74 mm with ≈89% skeletal-class accuracy when landmarks fed a classifier. Conclusion: AI can reduce cephalometric tracing time by ~70–80% and provide consistent skeletal classification. Regulator-aligned benchmarks (multicenter external tests, subgroup reporting, explainability) and pragmatic open-data priorities are outlined, positioning AI as a dependable co-pilot once these gaps are closed.

Keywords:

artificial intelligence; malocclusion diagnosis; cephalometric analysis; deep learning; orthodontics

1. Introduction

This review synthesizes evidence across cephalograms, photographs, cone-beam computed tomography (CBCT), digital models, and panoramic radiographs, linking diagnostic metrics to validation tiers and regulatory context, a perspective largely missing from prior orthodontic AI reviews. Orthodontic adoption of AI began in the mid-2010s with the first fully automated convolutional neural network (CNN) for cephalometric landmark detection [1]. Since 2017, mean radial error fell from 1.87 ± 2.04 mm (success detection rate (SDR) within 2 mm: 73% [2]) to 0.94 ± 0.74 mm on a multicenter model trained on 9870 cephalograms, which also delivered ≈89% skeletal-class accuracy [3]. Detector architecture still matters: on the same dataset, YOLOv3 attains an MRE of ≈ 1.5 mm, clearly outperforming SSD [4].

The diagnostic scope (Figure 1) has expanded in tandem. A hybrid random forest combining airway and cephalometric features distinguishes Classes I–III with mid-80% accuracy [5], and a profile-photograph CNN estimates ANB with mean absolute error (MAE) ≈ 1.5° [6]. Independent validations of landmark detectors [7] and a large multicenter oral-pathology CNN confirm cross-site generalizability [8].

State-of-the-art machine learning techniques such as domain adaptation, cross-domain learning, fused domain-cycling generative adversarial models, multimodal feature-fusion approaches, and collaborative generative models shape the current and future landscape of medical AI. Orthodontic imaging has not yet adopted these methods, but they could improve multi-domain consistency, data harmonization, and generative augmentation to handle variation between devices, datasets, and domains. Better generalization across imaging environments remains one of the main challenges in malocclusion diagnosis.

Figure 1 presents a conceptual diagnostic pipeline rather than current routine practice. In reality, landmark detection, cephalometric measurement, and malocclusion classification remain separate tools and are not yet integrated into a fully automated system. The diagram is intended to illustrate a possible AI-enabled workflow, not to claim that end-to-end automation is already feasible or validated. Readers should interpret it as an explanatory framework, not as a depiction of existing clinical workflows.

Despite strong technical results, evidence for orthodontic workflow impact remains limited. Routine clinical adoption lags for four principal reasons: limited data diversity (small, demographically narrow cohorts); scarce prospective multicenter external validation with clinical-outcome endpoints; constrained transparency with subgroup/fairness reporting; and incomplete regulatory readiness and deployment practices (unclear decision thresholds and audit pathways) [9]. Explainability is inconsistently reported. Grad-CAM overlays are feasible and can be clinically informative [7], yet comparative work shows that different saliency methods highlight different anatomy [10]. Heatmaps help indicate general anatomical regions but offer limited diagnostic value in orthodontics, where decisions rely on specific cephalometric landmarks rather than qualitative visualizations. Clinicians need outputs such as computed ANB angles, vertical indices, or landmark-derived measurements that can be checked against accepted diagnostic rules. Structured, rule-based explanations at this level allow clinicians to verify how the AI reached its conclusion. Future orthodontic AI systems will require this type of clinically grounded interpretability to be integrated safely into routine practice. Orthodontic workflow evidence remains preliminary; analogous deployments in other specialties report measurable gains, radiology triage reduced CT reading time by 11 s [11], and a surgical-pathology system cut idle time by 18% [12], while this review focuses on orthodontic diagnosis and regulatory readiness. Regulatory clearances in dental and orthodontic AI currently cover tasks such as cephalometric analysis and orthodontic monitoring; for example, WEBCEPH™ is described as an AI-powered cephalometric and orthognathic platform with KFDA and FDA 510(k) approval, yet to our knowledge no system has secured FDA or CE approval for full, end-to-end malocclusion diagnosis [13].

This review fills a key gap by integrating accuracy, interpretability, efficiency, and fairness across lateral cephalograms, intraoral/extraoral photographs, CBCT scans, 3D digital dental models, and panoramic radiographs, and by mapping these claims to validation tiers and regulator-aligned multicenter validation frameworks. These developments matter directly for clinicians, as AI systems that match expert-level skeletal classification and reduce cephalometric tracing time by ~70–80% could act as reliable co-pilots once validation, fairness, and regulatory gaps are addressed.

2. Materials and Methods

We searched PubMed, Scopus, and Web of Science (January 2019–October 2025) using (“artificial intelligence”) AND (“malocclusion” OR “skeletal class”) AND “cephalometric.” This narrative review surveys AI-based malocclusion diagnosis across lateral cephalograms, intraoral and extraoral photographs, CBCT scans, digital dental models, and panoramic radiographs published between 2019 and 2025. Two reviewers independently screened titles/abstracts and full texts, resolving disagreements by consensus. We included original studies reporting diagnostic performance (accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity) or landmark-localization error for AI-based malocclusion diagnosis; reviews, editorials, case reports, and non-orthodontic applications were excluded. We extracted dataset size, validation design, and metrics; no formal quality appraisal or risk-of-bias assessment was undertaken, consistent with a narrative review. In this review, we focus on studies using multicenter datasets, predefined external-test cohorts and transparent reporting of subgroup or device-related performance, and consider these to provide higher-quality and more generalizable data. Conversely, findings derived from single-center, small-sample, or internally validated-only designs are interpreted as preliminary and susceptible to overfitting or limited real-world applicability. These quality distinctions are explicitly noted when summarizing performance ranges and comparing methods. Together, this provides clearer separation between robust evidence and results that should be viewed as exploratory. The final set of eligible studies forms the basis for the narrative synthesis presented in the following sections.

Though descriptions of studies are offered to clarify orientation, readers are referred to the original study for further information, consistent with the nature of narrative reviews. Secondly, after the screening of database candidates and manual searches, the studies not providing a solution to the malocclusion diagnosis, not related to an artificial intelligence-guided diagnostic output, or reporting on specific tasks such as isolated landmark detection, segmentation, or appliance prediction were discarded. Further exclusions were applied to those manuscripts that either did not present sufficient information regarding methodology and diagnostic performance measures or were unsuitable for a useful comparison. Thus, only research containing specific information about the ability to diagnose skeletal class, occlusal classification, or clinically relevant cephalometric diagnostic variables were included in the final set.

Current Capabilities

Current CNN and hybrid models achieve sub-millimeter landmark precision and high discriminative power for skeletal classification (Table 1). Evidence from a large oral-pathology dataset (over 20,000 images) suggests that such deep architectures can generalize across heterogeneous malocclusion phenotypes [8]. Together, these findings provide a strong technical basis for future prospective clinical validation studies.

A large oral-pathology CNN (>20,000 images) was transferred to unseen classes, indicating scalability beyond orthodontics [8].

3. Discussion

The limitation of the search strategy is that it was focused on concepts directly related to malocclusion diagnosis and skeletal classification. This means that research specifically shrink-wrapped around the same task, i.e., support for decision-making on extractions, cephalometric landmarking, and orthodontic measurement prediction, among others, may not have been obtained if these words did not appear in either the title or the abstract. To expand the search, it would be necessary to look for the scope of the research beyond the diagnosis of malocclusion, which is not possible within the purpose of this study. Thus, the reader should consider this evidence as indicative of AI systems specifically designed for diagnostic classification, rather than encompassing all existing orthodontic AI applications, or all technologically possible ones.

Beyond pooling accuracies, the clinical meaning of “90–96%” becomes clearer when reported metrics are linked to study design (external versus internal testing), subgroup performance, and clinician-facing explainability. Contemporary AI models perform on par with expert orthodontists; on lateral cephalograms, deep CNNs typically achieve 90–96% accuracy [14]. A CNN trained on ~1500 images reached 96% accuracy (Cohen’s κ ≈ 0.91), outperforming automated cephalometry (κ ≈ 0.78) [14]. These headline results were obtained from single-country datasets with predominantly internal validation, so they should be regarded as optimistic upper bounds rather than as expected performance in heterogeneous real-world clinical populations. These models operate end-to-end, identifying malocclusion patterns directly from images without requiring manual landmark input. CNNs also generalize complex patterns. A multicenter South Korean study reported AUC > 0.89 and accuracy > 0.80 when simultaneously classifying anteroposterior skeletal class, vertical pattern, and bite type on external radiographs [15]. Performance is often highest for pronounced phenotypes. Class III cases (prognathism) yielded the best detection rates, with an external test AUC ~0.99 in one study [15], indicating that AI is very reliable at identifying severe malocclusions.

AI also performs well on photographic data. In one prospective trial, a CNN analyzing intraoral photographs (three views per patient) achieved 93% accuracy (F1 ≈ 0.90) in the Angle molar classification, matching expert orthodontist performance [16]. Notably, the AI was more precise than the human examiner for molar classes (precision 88.6% vs. 80.4%) [17], suggesting greater consistency in applying classification criteria. However, clinicians outperformed the AI on continuous measurements (e.g., overjet and overbite), with human errors ~1.3 mm vs. AI ~2.0 mm [16]. This indicates that 2D CNNs (without depth cues) struggle with sub-millimeter accuracy.

AI also detects specific malocclusion traits. Noëldeke et al. reported >98% accuracy for crossbite detection in intraoral photographs, with all CNNs (ResNet, MobileNet, etc.) achieving recall >90% and the best model (Xception) reaching 100% specificity [18]. These results show that subtle occlusal discrepancies can be learned from 2D images given adequate data. Machine learning models trained on panoramic radiographs have also diagnosed Class III malocclusions in children, indicating that panoramic imaging can support malocclusion-related prediction tasks despite the lack of large public malocclusion datasets. Together with AI systems that categorize crowding severity and support extraction decisions from intraoral photographs, these findings suggest that crossbite and crowding can both be modeled as trait-level malocclusion patterns in photographic data.

Cephalometric landmark detection has reached clinically acceptable accuracy. Modern detectors locate ~75–85% of landmarks within 2 mm of the true position [19]. An evaluation of a YOLOv3-based cephalometric model reported an average landmark error of ~1.8 mm and SDR within 2 mm: 75.5% [19]. Many errors approach the level of human intra-observer variability, supporting automated cephalometry. Indeed, recent AI models achieve cephalometric measurements (angles and distances) comparable to experienced orthodontists [19]. This implies that AI can effectively reproduce complete diagnostic work-ups (ANB angle, vertical indices, etc.) with minimal clinician input. The current landscape of AI-driven malocclusion diagnosis is further synthesized in Table 2, which summarizes state-of-the-art performance metrics, explainability approaches, workflow impacts, fairness considerations, and regulatory readiness, together with their key supporting references [20,21,22,23,24,25,26,27,28,29].

Ethnic craniofacial differences make multi-ethnic datasets essential for safe orthodontic AI implementation [23,24,25,26,27]. A model trained on predominantly Caucasian norms may misclassify skeletal relationships in Asian or African populations, where features such as bimaxillary protrusion represent normal variation rather than pathology. This can lead to systematic overcalling of Class II or III patterns and embed incorrect diagnostic standards into automated systems [25,26]. Population-specific data and calibration are therefore necessary to avoid clinically significant errors.

3.1. Algorithms and Techniques

Deep CNNs remain the dominant algorithms due to their success in image recognition. Common architectures (ResNet, VGG, DenseNet, EfficientNet) have all been applied to malocclusion classification or feature detection [30]. ResNet-50, ResNet-18, MobileNet, Xception, and DenseNet were evaluated for crossbite detection, each achieving high AUC (>0.95) on cross-validation [18]. Model choice depends on data size and compute resources: lighter models like MobileNet can be competitive (F1 ~0.98) while offering faster inference [18]. Vision Transformers (ViTs) are emerging alternatives. A transformer model with masked pretraining on unlabeled cephalograms achieved ~90% accuracy on multi-attribute diagnosis (e.g., Class III with open bite) [31], showing strong generalization. These results suggest that transformers can capture global craniofacial patterns, though they demand large datasets and are early in dental AI research [32]. Although Vision Transformers have demonstrated strong performance in large-scale medical imaging, their practical value in orthodontics remains limited by the comparatively small size and homogeneity of available datasets [31,32]. ViTs typically require extensive pretraining or very large labeled cohorts to outperform convolutional models, conditions that are rarely met in dental AI research [32]. As a result, current ViT-based orthodontic systems should be viewed as exploratory rather than clinically mature, with their reported generalization reflecting potential rather than proven real-world robustness [31]. Until larger, diverse orthodontic datasets become widely available, CNNs are likely to remain the more reliable and data-efficient option.

Hybrid strategies are also reported: Ryu et al. (2023) combined CNN-based tooth landmark detection with classification networks to categorize crowding severity and predict extraction need [33]. Such multi-step pipelines (first measuring arch length discrepancy via landmarks, then decision-modeling extraction) add interpretability and structure to the AI’s reasoning. Interestingly, some classical machine learning methods hold their own. In a skeletal Class III classification study, a Gaussian process regression (GPR) model slightly outperformed deep networks (reportedly highest AUC ~0.98) in classifying Class III vs. normal, highlighting that, with the right features, non-deep algorithms can excel [34]. Similarly, Paddenberg-Schubert et al. found that no single algorithm was intrinsically superior for skeletal class prediction; simpler linear models achieved ~99% accuracy when fed a few key cephalometric features (e.g., SNB, jaw length ratio) [35]. Such near-perfect accuracies arise in tightly controlled, well-separated samples without external testing or formal risk-of-bias assessment and are likely to overestimate true diagnostic performance in routine practice.

3.2. Datasets and Open Initiatives

A critical concern in orthodontic AI is the size and diversity of training data. Many early studies used small, homogeneous samples of a few hundred single-center patients, which limits generalizability. More recent work uses larger, more varied datasets: Kim et al. (2022) trained a CNN on ~1334 cephalograms [14], and Yim et al. (2022) used >2000 cephalograms from 10 hospitals, improving robustness to differing image quality and patient demographics [15]. Photographic datasets have also expanded, with Bardideh et al. generating 7500 intraoral images from ~948 adults [16] and Ryu et al. training on ~1300 annotated occlusal photos per arch [33].

A recent review still identified only 16 unique open dental image datasets, with panoramic X-rays most common and no large public malocclusion datasets as of 2023 [36]. New comprehensive resources begin to fill this gap: the OMNI dataset provides thousands of intraoral, extraoral, and radiographic images for benchmarking malocclusion models, and Wang et al. (2024) released a 3D dataset of 1060 pre- and post-treatment dental models with segmentation labels across multiple malocclusions [37]. These open datasets support validation, fair algorithm comparison, and inclusion of diverse populations, as OMNI aggregates multi-ethnic patient images to mitigate bias, yet most current models still train and test on ethnically homogeneous cohorts (for example East Asian or Middle Eastern single-country studies), which raises concerns about performance in other ethnic and age groups.

AI systems for malocclusion diagnosis therefore still rely predominantly on lateral cephalograms and intraoral or extraoral photographs, with panoramic radiographs and 3D dental models only beginning to appear in the literature and remaining comparatively underexplored [36]. Such data are starting to become available through collaborations, e.g., an orthodontic grand challenge based on RSNA’s Machine Learning Challenge, as community challenges are public competitions on medical images organized by the RSNA. Thus, while the size and representativeness of datasets have improved over the last five years, more needs to be performed in terms of international data sharing and standardized benchmarks to create broadly generalizable and non-biased AI models.

A critical limitation in the current literature is the risk of data leakage and overfitting, which can substantially inflate reported accuracies, particularly when values approach the 96–99% range. In several published studies, train-test separation relies on image-level rather than patient-level splits, allowing different images from the same individual to appear in both sets. This undermines the validity of “90%+” performance claims, as models may learn patient-specific features rather than clinically meaningful patterns. In addition, some datasets mix sequential radiographs or photographs from the same treatment episode, further increasing the chance of inadvertent overlap. Without strict patient-level separation and transparent documentation of splitting protocols, results may overestimate real-world generalization. Readers should therefore interpret high accuracy values cautiously and prioritize studies that explicitly implement robust leakage-prevention strategies. Future orthodontic AI research will require standardized reporting of dataset partitioning to ensure reliability and reproducibility. The main architectural families used in AI-based malocclusion diagnosis and their comparative strengths and limitations are summarized in Table 3.

3.3. Clinical Validation and Utility

Most clinical evidence for orthodontic AI remains concentrated at early Fryback and Thornbury tiers (technical and diagnostic accuracy), with encouraging but limited advances toward real-world validation. Multicenter external testing has shown generalization across sites (AUC > 0.85) while revealing weaknesses on subtler distinctions, and a prospective head-to-head study found that AI matched expert accuracy and surpassed clinician consistency for occlusion classification. Concordance on extraction decisions is substantial (κ ≈ 0.8) with high AUCs (93–96%), though ground-truth variability persists in borderline cases. To date, one workflow study reached level 5 validation by reducing analysis time without compromising diagnostic accuracy; however, no large-scale randomized trials have evaluated patient outcomes, underscoring the need for prospective multicenter studies [15,17,33,38,39].

Validation tiers and endpoints are summarized in Table 4.

Next steps (evidence gaps): Prospective multicenter trials—e.g., deploying AI for preliminary malocclusion triage in new patients and comparing treatment plans or outcomes with controls—would provide the strongest evidence; until such trials are completed, clinical-benefit claims remain inferential [15,38].

3.4. Regulatory and Ethical Considerations

As AI diagnostic tools near clinical use, regulators and clinicians emphasize explainability, validated performance across sites and subgroups, secure lifecycle management, and clear human oversight. Recent U.S. authorizations in dental AI (e.g., cephalometric analysis 510 (k), radiographic pathology detection, and a 2024 De Novo for orthodontic monitoring) show a viable pathway but do not constitute approval for end-to-end malocclusion diagnosis; submissions must evidence multicenter external validation with prespecified endpoints, subgroup reporting, safe-update mechanisms, cybersecurity, transparent labeling, and a clinician-in-the-loop model [13,33,38,39,40,41]. Regulatory and deployment requirements are summarized in Table 5.

Legal liability in AI-assisted diagnostics remains centered on the principle that such systems function as decision support rather than autonomous diagnostic agents [38,41]. Under both FDA SaMD and EU MDR frameworks, primary responsibility for diagnostic decisions continues to rest with the clinician, who must override system outputs when inconsistent with clinical findings [13]. Manufacturers carry liability when errors stem from flawed algorithms, insufficient validation, or unsafe software updates, whereas institutions may be accountable for improper deployment, lack of oversight, or failure to maintain post-market monitoring [41]. Documenting instances of AI–clinician disagreement and recording the reasoning behind final decisions is increasingly viewed as an essential safeguard. Emerging regulatory guidance also emphasizes that safe-update mechanisms and transparent labeling are prerequisites for assigning responsibility when model performance changes over time [38,41].

3.5. Future Directions

Future priorities and actions are summarized in Table 6.

Orthodontic AI should now shift from small, single-site studies to coordinated, multi-institution efforts that expand diverse datasets (including 3D/CBCT) with subgroup metadata for bias monitoring, while running prospective trials to document workflow and outcome effects in real clinics. Hybrid, rule-aware models can improve transparency for clinicians (e.g., explanations tied to cephalometric features), and fairness must be planned from the outset with subgroup reporting and privacy-preserving collaboration such as federated learning. Regulatory readiness will hinge on multicenter validation, safe-update processes, and institutional oversight, as applications broaden from diagnosis toward treatment planning where AI has already shown high performance in selected decisions (e.g., surgery vs. orthodontics in Class III) [33,36,38,42,43].

3.6. Conflicting AI Outputs and Clinical Decision Pathways

If algorithms are based on heterogeneous data, use different ground-truth labeling standards, or are validated differently, AI systems will make divergent recommendations [4,7,15]. This should not be read as a failure of systems, but as a yardstick that raises the need for further attention. In cases of outlying outputs, the initial action taken is to confirm the finding from a second diagnostic source, for instance, the AI cephalometric analysis will have to be cross confirmed with intraoral photos or digital dental models. Medical practitioners should also promote systems that can give understandable confidence or measures for uncertainty that also can help to see where the predictions are unreliable. The tool with the greatest regulatory-grade validation and established generalizability across populations should be utilized where more than one AI tool is deployed [33]. The discrepancy as well as the rationale that eventually led to a specific interpretation must be recorded by the clinician for the interpretation to be traced back to it, which is emerging as a requirement by the developing pressure to start regulating algorithmic diagnostic devices. When there is a significant difference between continued indices of discrepancy, there is a need for the clinician’s expertise and decision to discard suggestions from the system. This helps avoid deterministic thinking based on what we can extrapolate from algorithms and focuses again on the primacy of human choice. The goal is for AI systems to serve as diagnostic aides that support, but do not usurp, the authority of the clinician to diagnose.

In addition to the specific studies summarized above, several broader reviews and methodological frameworks place these findings in a wider orthodontic- and dental-AI context. Recent scoping and narrative reviews show that AI applications in orthodontics have expanded rapidly, but that study designs remain highly heterogeneous in terms of data types, annotation protocols, and outcome metrics, which complicates direct comparison and synthesis [44,45,46,47,48]. Parallel work on trustworthy AI in dentistry and medical imaging emphasizes that reproducibility, fairness, explainability, and lifecycle monitoring need to be evaluated alongside accuracy [49,50]. Methodological guidance such as the CONSORT-AI extension and analyses of data leakage in deep-learning classification highlights the importance of prespecified endpoints, transparent reporting, strict patient-level data separation, and external validation to avoid inflated performance estimates [51,52]. Related scoping work in adjacent orthodontic AI applications (e.g., clear aligner therapy) further reinforces these general reporting and validation priorities [53].

4. Conclusions

AI-based tools achieve high technical performance in malocclusion diagnosis, with 90–96% accuracy for skeletal Class I/II/III on cephalograms, SDR within 2 mm of ≈73–76% for landmarking and ~70–80% reductions in cephalometric tracing time. Despite this, clinical translation remains limited: evidence beyond retrospective or internal testing is sparse, multicenter external studies are few, and current regulatory authorizations cover cephalometric analysis and orthodontic monitoring rather than end-to-end malocclusion classification. Turning these systems into dependable co-pilots for orthodontists will require prospective multicenter trials with prespecified clinical endpoints, systematic subgroup and fairness reporting, clinician-facing explainability, and diverse, openly benchmarked datasets aligned with regulatory expectations for SaMD. Differences in imaging hardware, operator-dependent image acquisition, and demographic variation can increase errors when models are trained on narrow cohorts and may lead to failure modes such as systematic misclassification in marginalized groups or degraded performance when images do not resemble the training set [15,23,24,25,26,27]. Off-label use, lack of active monitoring for model drift, and limited clinical oversight further increase the risk of misuse; safe and fair application in daily clinics depends on recognizing and managing these weaknesses.

Author Contributions

K.C. contributed to conceptualization, methodology, original draft writing, and review and editing. M.M. contributed to conceptualization, methodology, original draft writing, review and editing, supervision, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AUC	Area Under the Receiver Operating Characteristic Curve
ANB	A Point-Nasion-B Point (cephalometric angle)
CBCT	Cone-Beam Computed Tomography
CE	Conformité Européenne (European regulatory marking)
CNN	Convolutional Neural Network
CT	Computed Tomography
FDA	U.S. Food and Drug Administration
F1	F1 Score (harmonic mean of precision and recall)
GPR	Gaussian Process Regression
Grad-CAM	Gradient-weighted Class Activation Mapping
MAE	Mean Absolute Error
ML	Machine Learning
MRE	Mean Radial Error
OMNI	Oral and Maxillofacial Natural Image dataset
SaMD	Software as a Medical Device
SDR	Success Detection Rate (percentage of landmarks within error threshold)
ViT	Vision Transformer
YOLO	You Only Look Once (object-detection neural-network family)

References

Arik, S.Ö.; Ibragimov, B.; Xing, L. Fully automated quantitative cephalometry using convolutional neural networks. J. Med. Imaging 2017, 4, 014501. [Google Scholar] [CrossRef]
Le, V.N.T.; Kang, J.; Oh, I.-S.; Kim, J.-G.; Yang, Y.-M.; Lee, D.-W. Effectiveness of human-artificial intelligence collaboration in cephalometric landmark detection. J. Pers. Med. 2022, 12, 387. [Google Scholar] [CrossRef]
Jiang, F.; Guo, Y.; Yang, C.; Zhou, Y.; Lin, Y.; Cheng, F.; Quan, S.; Feng, Q.; Li, J. Artificial intelligence system for automated landmark localization and analysis of cephalometry. Dentomaxillofacial Radiol. 2023, 52, 20220081. [Google Scholar] [CrossRef]
Park, J.-H.; Hwang, H.-W.; Moon, J.-H.; Yu, Y.; Kim, H.; Her, S.-B.; Srinivasan, G.; Aljanabi, M.N.A.; Donatelli, R.E.; Lee, S.-J. Automated identification of cephalometric landmarks: Part 1-Comparisons between the latest deep-learning methods YOLOV3 and SSD. Angle Orthod. 2019, 89, 903–909. [Google Scholar] [CrossRef]
Marya, A.; Inglam, S.; Chantarapanich, N.; Wanchat, S.; Rithvitou, H.; Naronglerdrit, P. Development and validation of predictive models for skeletal malocclusion classification using airway and cephalometric landmarks. BMC Oral Health 2024, 24, 1064. [Google Scholar] [CrossRef] [PubMed]
Ito, S.; Mine, Y.; Urabe, S.; Yoshimi, Y.; Okazaki, S.; Sano, M.; Koizumi, Y.; Peng, T.-Y.; Kakimoto, N.; Murayama, T.; et al. Prediction of a cephalometric parameter and skeletal patterns from lateral profile photographs: A retrospective comparative analysis of regression convolutional neural networks. J. Clin. Med. 2024, 13, 6346. [Google Scholar] [CrossRef] [PubMed]
Johannes, T.; Akhilanand, C.; Joachim, K.; Shankeeth, V.; Anahita, H.; Saeed Reza, M.; Mohammad, B.; Hossein, M.-R. Evaluation of AI model for cephalometric landmark classification (TG Dental). J. Med. Syst. 2023, 47, 92. [Google Scholar] [CrossRef]
Saldivia-Siracusa, C.; de Souza, E.S.C.; da Silva, A.V.B.; Araújo, A.L.D.; Pedroso, C.M.; da Silva, T.A.; Sant’ANa, M.S.P.; Fonseca, F.P.; Pontes, H.A.R.; Quiles, M.G.; et al. Automated classification of oral potentially malignant disorders and oral squamous cell carcinoma using a convolutional neural network framework: A cross-sectional study. Lancet Reg. Health—Am. 2025, 47, 101138. [Google Scholar] [CrossRef] [PubMed]
Griffin, A.C.; Wang, K.H.; Leung, T.I.; Facelli, J.C. Recommendations to promote fairness and inclusion in biomedical AI research and clinical use. J. Biomed. Inform. 2024, 157, 104693. [Google Scholar] [CrossRef]
Barra, P.; Della Greca, A.; Amaro, I.; Tortora, A.; Staffa, M. A comparative analysis of XAI techniques for medical imaging: Challenges and opportunities. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; IEEE: New York, NY, USA, 2024; pp. 6782–6788. [Google Scholar] [CrossRef]
Purkayastha, S.; Isaac, R.; Veldandi, A.; Saxena, R.; Singh, P.; Vaswani, P.; Krupinski, E.A.; Danish, J.A.; Gichoya, J.W. Measuring impact of radiologist-AI collaboration: Efficiency, accuracy, and clinical impact. In Proceedings of the International Symposium on Biomedical Imaging, Athens, Greece, 27–30 May 2024; IEEE: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
Chacko, R.; Davis, M.J.; Levy, J.; LeBoeuf, M. Integration of a deep learning basal cell carcinoma detection and tumor mapping algorithm into the Mohs micrographic surgery workflow and effects on clinical staffing: A simulated, retrospective study. JAAD Int. 2024, 15, 185–191. [Google Scholar] [CrossRef]
Suara, S.; Jha, A.; Sinha, P.; Sekh, A.A. Is Grad-CAM explainable in medical images? In Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2024; Volume 2009, pp. 124–135. [Google Scholar] [CrossRef]
Kim, H.J.; Kim, K.D.; Kim, D.H. Deep convolutional neural network-based skeletal classification of cephalometric images compared with automated-tracing software. Sci. Rep. 2022, 12, 11659. [Google Scholar] [CrossRef] [PubMed]
Yim, S.; Kim, S.; Kim, I.; Park, J.-W.; Cho, J.-H.; Hong, M.; Kang, K.-H.; Kim, M.; Kim, S.-J.; Kim, Y.-J.; et al. Accuracy of onestep automated orthodontic diagnosis using a CNN and lateral cephalogram images from multihospital datasets. Korean J. Orthod. 2022, 52, 3–19. [Google Scholar] [CrossRef] [PubMed]
Bardideh, E.; Alizadeh, F.L.; Amiri, M.; Ghorbani, M. Designing an artificial intelligence system for dental occlusion classification using intraoral photographs: A comparative analysis between artificial intelligence-based and clinical diagnoses. Am. J. Orthod. Dentofac. Orthop. 2024, 166, 125–137. [Google Scholar] [CrossRef]
Zhang, R.; Zhang, L.; Zhang, D.; Wang, Y.; Huang, Y.; Wang, D.; Xu, L. Development and evaluation of a deep learning model for occlusion classification in intraoral photographs. PeerJ 2025, 13, e20140. [Google Scholar] [CrossRef]
Noëldeke, B.; Vassis, S.; Sefidroodi, M.; Pauwels, R.; Stoustrup, P. Comparison of deep learning models to detect crossbites on 2D intraoral photographs. Head Face Med. 2024, 20, 45. [Google Scholar] [CrossRef] [PubMed]
Hwang, H.-W.; Moon, J.-H.; Kim, M.-G.; Donatelli, R.E.; Lee, S.-J. Evaluation of automated cephalometric analysis based on the latest deep learning method. Angle Orthod. 2021, 91, 329–335. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Qiu, Z.; Rivaz, H.; Xiao, Y. Is visual explanation with Grad-CAM more reliable for deeper neural networks? A case study with automatic pneumothorax diagnosis. In Machine Learning in Medical Imaging; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2024; Volume 14349, pp. 224–233. [Google Scholar] [CrossRef]
Pellano, K.N.; Strümke, I.; Groos, D.; Adde, L.; Alexander Ihlen, E.F. Evaluating explainable AI methods in deep learning models for early detection of cerebral palsy. IEEE Access 2025, 13, 10126–10138. [Google Scholar] [CrossRef]
Ahmad, M.A.; Eckert, C.; Allen, C.; Kumar, V.; Hu, J.; Teredesai, A. Fairness in healthcare AI. In Proceedings of the 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI), Victoria, BC, Canada, 9–12 August 2021; IEEE: New York, NY, USA, 2021; pp. 554–555. [Google Scholar] [CrossRef]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. 2022, 54, 115. [Google Scholar] [CrossRef]
Obermeyer, Z.; Powers, B.; Vogeli, C.; Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019, 366, 447–453. [Google Scholar] [CrossRef]
Sheng, Y.; Yang, J.; Yang, L.; Shi, Y.; Hu, J.; Jiang, W. Muffin: A framework toward multi-dimension AI fairness by uniting off-the-shelf models. In Proceedings of the Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; IEEE: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
Ueda, D.; Kakinuma, T.; Fujita, S.; Kamagata, K.; Fushimi, Y.; Ito, R.; Matsui, Y.; Nozaki, T.; Nakaura, T.; Fujima, N.; et al. Fairness of artificial intelligence in healthcare: Review and recommendations. Jpn. J. Radiol. 2024, 42, 3–15. [Google Scholar] [CrossRef]
Ferrara, E. Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. Sci. 2024, 6, 3. [Google Scholar] [CrossRef]
Lai, X.; Wang, Y.; Li, J.; Zhong, Z.; Yang, X. Mitigating racial bias in chest X-ray disease diagnosis via disentanglement learning. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024. [Google Scholar] [CrossRef]
Lee, K.-S.; Ryu, J.-J.; Jang, H.S.; Lee, D.-Y.; Jung, S.-K. Deep convolutional neural networks based analysis of cephalometric radiographs for differential diagnosis of orthognathic surgery indications. Appl. Sci. 2020, 10, 2124. [Google Scholar] [CrossRef]
Chang, Q.; Bai, Y.; Wang, S.; Wang, F.; Liang, S.; Xie, X. Automated orthodontic diagnosis via self-supervised learning and multi-attribute classification using lateral cephalograms. Biomed. Eng. OnLine 2025, 24, 9. [Google Scholar] [CrossRef]
Gao, Y.; Zhang, P.; Xie, Y.; Han, J.; Zeng, L.; Ning, N.; Zheng, Q.; Li, H.; Chen, X.; Chen, Z. Application of transformers in stomatological imaging: A review. Digit. Med. 2024, 10, e24-00001. [Google Scholar] [CrossRef]
Ryu, J.; Kim, Y.H.; Jeong, H.; Ahn, H.W. Evaluation of artificial intelligence model for crowding categorization and extraction diagnosis using intraoral photographs. Sci. Rep. 2023, 13, 5177. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Chen, X.; Huang, J.; Mo, S.; Gu, M.; Kang, N.; Song, S.; Zhang, X.; Liang, B.; Tang, M. Machine Learning Algorithms for the Diagnosis of Class III Malocclusions in Children. Children 2024, 11, 762. [Google Scholar] [CrossRef] [PubMed]
Paddenberg-Schubert, E.; Midlej, K.; Krohn, S.; Schröder, A.; Awadi, O.; Masarwa, S.; Lone, I.M.; Zohud, O.; Kirschneck, C.; Watted, N.; et al. Machine learning models for improving the diagnosing efficiency of skeletal Class I and III in German orthodontic patients. Sci. Rep. 2025, 15, 12738. [Google Scholar] [CrossRef] [PubMed]
Uribe, S.E.; Issa, J.; Sohrabniya, F.; Denny, A.; Schwendicke, F. Publicly available dental image datasets for artificial intelligence. J. Dent. Res. 2024, 103, 1132–1141. [Google Scholar] [CrossRef]
Wang, S.; Lei, C.; Liang, Y.; Sun, J.; Xie, X.; Wang, Y.; Zuo, F.; Bai, Y.; Li, S.; Liu, Y.-J. A 3D dental model dataset with pre/post-orthodontic treatment for automatic tooth alignment. Sci. Data 2024, 11, 1277. [Google Scholar] [CrossRef]
Lal, A.; Nooruddin, A.; Umer, F. Concerns regarding deployment of AI-based applications in dentistry: A review. BDJ Open 2025, 11, 27. [Google Scholar] [CrossRef]
Food and Drug Administration. Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations (Draft Guidance for Industry and FDA Staff). U.S. Department of Health & Human Services. 2023. Available online: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/artificial-intelligence-enabled-device-software-functions-lifecycle-management-and-marketing (accessed on 25 July 2025).
Stăncioiu, A.-A.; Motofelea, A.C.; Hușanu, A.A.; Vasica, L.; Popa, A.; Nagib, R.; Szuhanek, C. Innovative Aesthetic and Functional Orthodontic Planning with Hard and Soft Tissue Analyses. J. Clin. Med. 2025, 14, 4458. [Google Scholar] [CrossRef]
Muralidharan, V.; Adewale, B.A.; Huang, C.J.; Nta, M.T.; Ademiju, P.O.; Pathmarajah, P.; Hang, M.K.; Adesanya, O.; Abdullateef, R.O.; Babatunde, A.O.; et al. A scoping review of reporting gaps in FDA-approved AI medical devices. npj Digit. Med. 2024, 7, 273. [Google Scholar] [CrossRef]
Süer Tümen, D.; Nergiz, M. Federated learning-based CNN models for orthodontic skeletal classification and diagnosis. Diagnostics 2025, 15, 920. [Google Scholar] [CrossRef]
Taraji, S.; Atici, S.F.; Viana, G.; Kusnoto, B.; Allareddy, V.S.; Miloro, M.; Elnagar, M.H. Novel machine learning algorithms for prediction of treatment decisions in adult patients with Class III malocclusion. J. Oral Maxillofac. Surg. 2023, 81, 1391–1402. [Google Scholar] [CrossRef] [PubMed]
Monill-González, A.; Rovira-Calatayud, L.; d’Oliveira, N.G.; Ustrell-Torrent, J.M. Artificial intelligence in orthodontics: Where are we now? A scoping review. Orthod. Craniofacial Res. 2021, 24, 6–15. [Google Scholar] [CrossRef]
Bichu, Y.; Gaikwad, R.; Valiathan, A.; Gandhi, S. Applications of artificial intelligence and machine learning in orthodontics: A scoping review. Prog. Orthod. 2021, 22, 18. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Zhang, C.; Shan, Z.; Zhang, Y.; Shen, G. Application of artificial intelligence in orthodontics: Current state and future perspectives. Healthcare 2023, 11, 2760. [Google Scholar] [CrossRef] [PubMed]
Kazimierczak, N.; Kazimierczak, W.; Serafin, Z.; Nowicki, P.; Nożewski, J.; Janiszewska-Olszowska, J. AI in orthodontics: Revolutionizing diagnostics and treatment planning—A comprehensive review. J. Clin. Med. 2024, 13, 344. [Google Scholar] [CrossRef]
Gracea, R.S.; Veltri, M.; Grippaudo, C.; Farronato, G. Artificial intelligence for orthodontic diagnosis and treatment planning: A scoping review. J. Dent. 2024, 152, 105442. [Google Scholar] [CrossRef]
Ducret, M.; Schwendicke, F.; Meyer, M.; Ortalli, M. Trustworthy artificial intelligence in dentistry. J. Dent. Res. 2024, 103, 1051–1056. [Google Scholar] [CrossRef]
Schmidt, L.; Pawlitzki, M.; Renard, B.Y.; Meuth, S.G.; Masanneck, L. The three-year evolution of Germany’s Digital Therapeutics reimbursement program and its path forward. npj Digit. Med. 2024, 7, 139. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Rivera, S.C.; Moher, D.; Calvert, M.J.; Denniston, A.K.; Ashrafian, H.; Beam, A.L.; Chan, A.-W.; Collins, G.S.; Darzi, A.; et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. BMJ 2020, 370, m3164. [Google Scholar] [CrossRef]
Tampu, I.E.; Eklund, A.; Haj-Hosseini, N. Inflation of test accuracy due to data leakage in deep learning classification. Sci. Data 2022, 9, 600. [Google Scholar] [CrossRef] [PubMed]
Ruiz, D.C.; Paredes-Gallardo, V.; Ortiz-Ruiz, A.J.; Ausina-Molina, N. Artificial intelligence applied to clear aligner therapy: A scoping review. J. Dent. 2025, 154, 105564. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of AI-enabled malocclusion diagnosis. Arrows indicate the information flow from imaging inputs (left) through the AI pipeline (center) to test-set diagnostic performance, workflow impact, and clinical integration requirements (right). Left panel: data inputs (lateral cephalogram, intraoral and profile photographs, CBCT/3D imaging, digital dental models, panoramic radiographs). Center panel: AI pipeline Block A, preprocessing and feature setup; Block B, CNN/transformer inference for skeletal Class I–III classification, trait detection, and cephalometric landmarking; Block C, explainability and structured outputs (Grad-CAM-style maps and measurements such as ANB and vertical indices, shown as an illustrative example rather than an endorsement of a specific method). Right panel: key diagnostic performance metrics, estimated workflow gains, current evidence gaps, and regulator-aligned requirements for clinical integration.

Table 1. Diagnostic performance of state-of-the-art AI systems for malocclusion diagnosis (test-set results). Abbreviations: AUC, area under the receiver operating characteristic curve; MAE, mean absolute error.

Study/Task	Dataset Size and Characteristics	Validation Setting	Reported Performance (Range or Representative Values)	Notes/Sources of Variation
Cephalometric landmarking (CNN-based)	9870 cephalograms from 20 institutions	Multicenter external validation	MRE: 0.94 ± 0.74 mm; SDR (≤2 mm): ≈75–80%	Variation by landmark difficulty; device heterogeneity [3,4]
Profile-photo landmarking	876–1208 profile images	Internal testing with cross-validation	Median error: ≈1.5 mm	Sensitive to lighting and pose variation [2,6]
Skeletal Class I–III classification (cephalograms)	1334–2000+ images	Internal + external validation	Accuracy: 89–96%; AUC: 0.89–0.99	Higher accuracy for pronounced Class III; variability across hospitals [14,15]
Airway + cephalometric feature models	420 CBCT scans	Internal validation	Accuracy: 0.74; Class III AUC: 0.80	Performance limited by small cohort and modality differences [5]
ANB regression from photographs	1208 lateral profiles	Internal validation	MAE: ≈1.5°	Sensitive to soft-tissue variability [6]
Occlusion/molar-class classification (intraoral photos)	948–7500+ images (augmented)	Prospective + internal testing	Accuracy: 89–93%; F1: 0.90	Variation in image quality and arch visibility [16,17]
Crossbite detection (2D photographs)	1000+ images	External validation	Accuracy: >98%; Specificity: 100%	High performance but narrow task domain [18]

Table 2. Evidence grid: AI-driven malocclusion diagnosis.

Domain	State-of-the-Art Findings	Key References
Model accuracy	Deep anatomical-context CNNs report MRE 1.87 ± 2.04 mm with an SDR within 2 mm of 73%. A large multicenter CNN trained on 9870 cephalograms achieves 0.94 ± 0.74 mm error and ≈89% accuracy for skeletal class. YOLOv3 outperforms SSD with MRE ≈ 1.5 mm and SDR within 2 mm ≈ 80%. Airway-plus-cephalometric random forests reach overall accuracy 0.74 (Class III AUC 0.80), while profile-photo CNNs estimate ANB with MAE ≈ 1.5°.	[1,2,3,4,5,6]
Explainability	Grad-CAM and related attribution methods consistently highlight mandibular borders, cranial base, and upper incisors. Comparative work (Grad-CAM vs. Score-CAM vs. GBP) reveals task-dependent advantages of each method, and increasing network depth has been shown to degrade Grad-CAM localization fidelity.	[7,10,13,20,21,22]
Workflow/time savings	Human–AI landmarking pipelines reduce manual tracing time by approximately 70–80%. AI-based triage shortens CT interpretation by about 11 s per case, and automated lesion mapping has been associated with an 18% reduction in surgical idle time.	[2,11,12]
Bias and fairness	Subgroup-specific error metrics are rarely reported, suggesting that unrecognized performance disparities are likely. The MUFFIN framework targets multi-attribute fairness, and disentanglement-based approaches can mitigate racial bias. Current guidelines emphasize diverse training cohorts and routine fairness audits.	[9,23,24,25,26,27,28,29]
FDA/EMA status	EU MDR and US FDA SaMD guidance now explicitly require provisions for cybersecurity, explainability, and ongoing bias surveillance. No product has yet obtained clearance for comprehensive malocclusion diagnosis; existing approvals are confined to cephalometric landmark identification.	[13]
Public datasets	Pivotal AI studies remain predominantly proprietary and often single-center. No open, large-scale benchmark dataset dedicated to malocclusion (on the scale of OMNI or RSNA collections) is currently available, limiting independent external validation.	[2,3,4]
Prospective/multicenter evidence	The largest published series (9870 cephalograms from 20 centers) demonstrates technical and logistical feasibility of multicenter deployment. However, no prospective or randomized orthodontic trials have yet quantified real-world clinical impact on decision-making or patient outcomes.	[2,3]

Table 3. Comparative characteristics of major AI architectures used in malocclusion diagnosis.

Model Type	Mathematical/Architectural Characteristics	Data Requirements	Generalization and Robustness	Strengths	Limitations and Typical Failure Modes
CNNs (Convolutional Neural Networks)	Convolutions with local receptive fields; weight sharing; strong spatial inductive bias enabling efficient feature extraction.	Moderate (10³–10⁴ images) with augmentation typically sufficient.	Generally robust across sites; multicenter studies show good cross-device performance when trained on heterogeneous data [15].	High accuracy on cephalograms and photographs; efficient training; stable landmarking [3,4].	Limited global context; sensitive to device-specific imaging differences; failures on low-quality or unusual projections.
Vision Transformers (ViTs)	Self-attention mechanism; global feature modeling; weak inductive bias requiring larger training sets or pretraining.	High: often needs ≥10⁴–10⁵ labeled images or strong self-supervised pretraining.	Good potential for multi-attribute diagnosis; robust to spatial variation when pretrained; early evidence in dental imaging is promising [31].	Captures global craniofacial patterns; flexible architecture; strong performance on complex classification tasks.	Data-hungry; prone to overfitting on small orthodontic datasets; unstable saliency maps; may generalize poorly without large-scale pretraining.
Hybrid Models (CNN + rules/CNN + classical ML)	Combine learned representations with deterministic or feature-based components (e.g., rule-based cephalometric logic, geometric features).	Moderate: landmarking requires annotated points; downstream classifiers need fewer samples.	Generally stable; rule-structured outputs enhance interpretability; performance depends on accuracy of earlier stages [33].	Transparent decision pathways; clinically intuitive outputs; align well with cephalometric reasoning.	Pipeline errors propagate; rule-based components may not generalize to atypical anatomy; greater maintenance complexity.

Table 4. Validation tiers and endpoints for AI in orthodontics.

Fryback and Thornbury Level	What is Measured	Example Endpoints/Metrics	Evidence in Orthodontics/Gap	References
1. Technical efficacy	Algorithmic performance on retrospective data	Accuracy; AUC; landmark SDR (≤2 mm); internal cross-validation	Many studies at this tier, e.g., automated extraction recommendations achieving AUC 93–96% and substantial agreement (κ ≈ 0.8) with orthodontists.	[33,38]
2. Diagnostic accuracy	Correct diagnosis on independent/external data	External-test AUC, sensitivity/specificity by task/site	Multicenter external testing maintained AUC > 0.85 on radiographs from clinics that did not contribute to training; weaker performance on subtle vertical discrepancies noted.	[15]
3. Diagnostic-thinking efficacy	Influence on clinician diagnostic judgments	Concordance with experts; consistency vs. clinicians	Prospective head-to-head: AI matched expert accuracy and exceeded clinicians in consistency for occlusion classification; supports use as a reliable second opinion.	[17,33]
4. Therapeutic/management efficacy	Impact on treatment decisions and workflow	Changes in planned management; time saved without accuracy loss	Limited prospective/practitioner-comparison studies; concordance on extraction decisions suggests potential to standardize planning.	[17,33,38]
5. Patient-outcome efficacy	Effect on outcomes or high-level clinical utility	Outcome improvement; efficiency with maintained accuracy	-	-
6. Societal efficacy	Cost-effectiveness; population-level benefit	Randomized controlled trials; health-economic analyses	No large-scale RCTs comparing AI-assisted vs. standard care; most studies remain at levels 1–2, few at 4–5, virtually none at 6.	[38]

Table 5. Regulatory-readiness checklist for AI in orthodontics.

Requirement	What to Provide (Evidence/Documents)	When (Lifecycle Stage)	References
Explainability to clinicians (e.g., Grad-CAM, rule traces, interpretable intermediates)	Visual explanations (heatmaps), rationale summaries, and/or intermediate cephalometric measurements that justify outputs so clinicians can accept or override recommendations.	Pre-submission; maintained in user documentation and labeling.	[33,38]
Risk classification and regulatory pathway (FDA/EMA; CE-MDR/EU AI Act)	Device classification rationale; SaMD documentation; conformity to applicable frameworks; description of human oversight.	Early development and before clinical deployment; update as regulations evolve.	[38,41]
Validation standards (multicenter external testing; prespecified endpoints; subgroup reporting)	Protocols with predefined clinical endpoints; external/multicenter test reports; performance stratified by site/device/demographics.	Pivotal validation prior to market submission; repeat after significant updates.	[15,38,41]
Prior clearances/precedents (contextualize scope)	Evidence of relevant authorizations (e.g., cephalometric analysis tool 510 (k); orthodontic monitoring De Novo), with explicit note that these are not end-to-end malocclusion diagnosis.	Regulatory briefing package; risk/benefit summary.	[13,39,40,41]
Post-market performance monitoring	Real-world performance plan; triggers for investigation/rollback; periodic safety update reports; complaint/incident handling.	Post-market; continuous.	[41]
Safe-update mechanisms (change control for ML)	Software versioning; update risk assessment; rollback procedures; re-validation criteria (including external tests) for model updates.	Pre-market plan; enforced for every update.	[41]
Cybersecurity and data protection	Threat model; secure data handling; access controls; audit logs; vulnerability management disclosures.	Pre-market and ongoing.	[41]
Labelling and training-data transparency	Intended use; known limitations; data provenance/representativeness; performance bounds; user instructions for safe use and overrides.	Pre-market labeling; keep current with updates.	[38,41]
Equity and bias mitigation	Subgroup performance reports; targeted retraining/augmentation where gaps exist; fairness monitoring plan.	Pre-market evidence; post-market surveillance.	[38,41]
Clinician-in-the-loop responsibility	Statement that AI is decision support; documentation of clinician oversight and liability boundaries.	Labeling, training materials, institutional deployment policy.	[38]

Note: This checklist aligns with reported dental-AI approvals (cephalometric analysis, pathology detection, orthodontic monitoring) while confirming that full end-to-end malocclusion diagnosis is not currently authorized; dossiers should emphasize subgroup performance, external validation, explainability, and safe-update controls [13,39,40,41].

Table 6. Future directions for AI in orthodontics.

Priority/Action	Rationale/Impact	Implementation Notes	References
Build global consortia and open, diverse datasets (“ImageNet for orthodontics”) with detailed metadata (age, sex, ethnicity); expand 3D/CBCT collections for 3D AI.	Improves generalizability and bias monitoring; enables robust 3D modeling.	Standardize data schemas and consent; require subgroup labels in uploads; include CBCT and 3D dental models alongside 2D radiographs and photographs.	[36]
Run prospective clinical trials to quantify AI’s effect on workflow, diagnostic accuracy, and patient outcomes; pilot “AI-as-preliminary-analyst” under orthodontist oversight with user feedback.	Demonstrates real-world utility beyond technical metrics; informs deployment.	Predefine endpoints (time saved, error rates, outcome deltas); include trainees; compare AI-assisted vs. standard care in routine clinics; capture failure modes.	-
Develop hybrid models that embed orthodontic rules: DL for landmarks → rule-based diagnosis; provide explanatory outputs (e.g., highlight features driving a Class III call).	Increases transparency and clinician trust; supports auditability.	Pair saliency/explanations with rule traces and intermediate cephalometric measures in reports and UI.	[33]
Address bias and ensure equity: report performance by demographic subgroups; mitigate gaps via targeted augmentation/retraining; use federated learning to include diverse data without privacy risk; plan to evidence fairness.	Reduces disparities; aligns with emerging fairness expectations.	Establish subgroup dashboards; adopt federated learning for cross-site training; document fairness evidence for review.	[38,42]
Regulatory science and guidelines: co-develop validation standards (minimum N, multicenter); define safe update pathways and ongoing performance monitoring; create institutional AI oversight committees.	Smoother approvals, safer lifecycle, stronger trust.	Implement post-market surveillance, version control, and rollback plans; align validation packages to regulator expectations.	-
Expanded applications and multidisciplinary integration: support treatment planning (e.g., surgery vs. orthodontics for Class III ≈93% accuracy); leverage longitudinal data to anticipate growth/complexity; integrate broader patient data.	Moves toward holistic, personalized recommendations and planning.	Link EHR/radiographs/functional data for comprehensive assessment; use longitudinal datasets to model growth and case complexity.	[43]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mikulewicz, M.; Chojnacka, K. Artificial Intelligence in Malocclusion Diagnosis: Capabilities, Challenges, and Clinical Integration. Appl. Sci. 2025, 15, 13138. https://doi.org/10.3390/app152413138

AMA Style

Mikulewicz M, Chojnacka K. Artificial Intelligence in Malocclusion Diagnosis: Capabilities, Challenges, and Clinical Integration. Applied Sciences. 2025; 15(24):13138. https://doi.org/10.3390/app152413138

Chicago/Turabian Style

Mikulewicz, Marcin, and Katarzyna Chojnacka. 2025. "Artificial Intelligence in Malocclusion Diagnosis: Capabilities, Challenges, and Clinical Integration" Applied Sciences 15, no. 24: 13138. https://doi.org/10.3390/app152413138

APA Style

Mikulewicz, M., & Chojnacka, K. (2025). Artificial Intelligence in Malocclusion Diagnosis: Capabilities, Challenges, and Clinical Integration. Applied Sciences, 15(24), 13138. https://doi.org/10.3390/app152413138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artificial Intelligence in Malocclusion Diagnosis: Capabilities, Challenges, and Clinical Integration

Abstract

1. Introduction

2. Materials and Methods

Current Capabilities

3. Discussion

3.1. Algorithms and Techniques

3.2. Datasets and Open Initiatives

3.3. Clinical Validation and Utility

3.4. Regulatory and Ethical Considerations

3.5. Future Directions

3.6. Conflicting AI Outputs and Clinical Decision Pathways

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI