1. Introduction
Artificial intelligence (AI) and machine learning have rapidly expanded across dentistry, particularly in image-based diagnosis, segmentation, treatment planning, risk prediction, and digital workflow support. This growth is consistent with the broader movement toward data-driven healthcare, where computational models are expected to assist clinicians, reduce repetitive tasks, and improve the consistency of image interpretation. In dentistry, these expectations are especially visible because many clinical decisions depend on radiographs, cone–beam computed tomography (CBCT), intraoral photographs, three-dimensional scans, and other structured imaging outputs.
Broad scoping evidence has shown that dental machine-learning research includes a wide spectrum of tasks, input data types, model architectures, reference standards, and performance metrics, with classification, detection, and segmentation being among the most frequent applications [
1,
2]. However, this same diversity has also generated an evidence base that is difficult to compare, reproduce, and translate into clinical practice.
Most dental AI studies have been designed to demonstrate that a model can learn from a development dataset and achieve favorable internal performance. Such evidence is important, but it does not establish whether a model will remain accurate when exposed to patients, devices, centers, acquisition protocols, disease distributions, or annotation standards that differ from those used during model training. Generalizability is therefore a central condition for clinical implementation. Earlier mapping of machine-learning studies in dentistry found that external validation was uncommon and that many models relied on single-center datasets, internal train–test partitions, or cross-validation without evaluating performance in genuinely independent data [
1]. More recent field-wide evidence has continued to identify limited assessment of bias, outliers, calibration, reproducibility, and data access as persistent barriers to responsible adoption [
2]. These limitations are not merely technical details; they affect whether a model can be trusted outside the environment in which it was developed.
The problem is also evident in disease-specific reviews. In caries detection, a systematic review and meta-analysis identified 45 studies using AI platforms on dental radiographs or clinical images and reported high heterogeneity in performance, dataset size, caries definitions, annotation procedures, and reporting quality. Although the pooled diagnostic estimates suggested promising performance, the review also found substantial variation across imaging modalities and reported that none of the included studies validated their model on external data according to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) assessment [
3]. This is an important distinction for the present review. The question is not only whether AI can detect caries or segment dental structures with high internal accuracy, but whether those results are transportable across external settings and sufficiently reproducible to support clinical decision-making.
Concerns about interpretability, bias, and generalizability have been addressed more directly in recent systematic evidence. A review focused on explainability, bias, and generalizability included 11 studies and concluded that trustworthiness attributes are relevant to dental AI, but that the included literature remained heterogeneous across dental domains, model types, and outcome definitions [
4]. That review is close to the present topic, but its emphasis was on model interpretability and equity rather than on the methodological pathway from internal performance to external clinical translation. The current review is therefore positioned differently. It will focus specifically on studies that go beyond internal validation by empirically evaluating external validation, multicenter or multi-device generalization, cross-dataset reproducibility, privacy-preserving or federated learning, and robustness to data heterogeneity.
This distinction is important because implementation-ready AI requires more than a high area under the curve, Dice similarity coefficient (DSC), or accuracy value obtained in a familiar dataset. Dental imaging data are highly heterogeneous. Differences in radiographic devices, CBCT systems, intraoral scanners, image resolution, acquisition settings, patient age, dentition status, restorations, orthodontic appliances, disease prevalence, and annotation protocols can produce domain shift and reduce performance in external datasets. Recent primary studies illustrate this translational gap. Evidence from dental AI beyond imaging has also shown that performance may decline during cross-national external validation, reinforcing the broader concern that internal discrimination is not necessarily preserved when population structure, disease distribution, or data-collection conditions change [
5]. Other externally validated models have tested AI systems on independent dental photographs, clinical records, CBCT scans, periapical radiographs, panoramic radiographs, or intraoral scans, showing that generalization can be evaluated but that its strength depends heavily on the independence, diversity, and quality of the external data [
6,
7,
8,
9,
10,
11].
Privacy and data governance create an additional challenge. Dentistry often depends on identifiable or potentially re-identifiable imaging data, because tooth morphology, restorations, and craniofacial structures may act as individualizing features. Direct pooling of multi-institutional dental images may therefore be restricted by ethical, legal, regulatory, and institutional barriers. Federated learning has been proposed as a privacy-preserving framework that allows collaborative model training without directly sharing raw patient data [
12]. In dental imaging, empirical studies comparing federated, centralized, and local learning have shown the relevance of this approach for segmentation tasks, while also revealing new challenges related to heterogeneous data quality, labeling inconsistency, noisy clients, computational demands, and model monitoring [
13,
14]. Thus, privacy-preserving learning is not only a technical alternative to centralized data pooling; it is a key translational safeguard for scalable dental AI.
Reproducibility is another unresolved barrier. Previous reviews have emphasized that dental AI studies often provide incomplete information on data sources, preprocessing, data partitioning, model specification, calibration, missing data, and external validation, limiting replication and benchmarking [
1,
2]. Open datasets, public code, transparent reporting of train–validation–test structures, clear reference standards, and cross-dataset evaluations are essential for determining whether a model is robust or merely optimized for a specific dataset. This issue is particularly relevant for segmentation and measurement tasks, where performance can be affected by annotation conventions, unit of analysis, and whether metrics are calculated at the pixel, tooth, image, or patient level. Without reproducible reporting, apparently strong performance may not translate into clinically reliable implementation.
In this review, translational readiness refers to the extent to which an imaging-based dental AI model has been evaluated beyond internal algorithmic performance and provides evidence relevant to clinical transferability. This concept was operationalized across three related levels. Technical validation refers to model performance in development or test datasets, including task-specific metrics such as accuracy, area under the curve, Dice similarity coefficient, or measurement error. Clinical validation refers to evaluation in independent patients, datasets, centers, devices, acquisition protocols, or clinical reference standards that differ from the development setting. Implementation readiness refers to evidence that the model can be reproduced, integrated into clinical workflows, interpreted by users, monitored safely, and used under appropriate regulatory, ethical, and governance conditions. This distinction is important because strong technical performance does not necessarily imply clinical validity or readiness for routine implementation.
Against this background, the present systematic review does not seek to re-estimate the overall diagnostic accuracy of AI in dentistry. Instead, it aims to synthesize the subset of imaging-based dental AI studies that directly address translational readiness through external validation, independent dataset testing, multicenter or multi-device generalizability, cross-dataset reproducibility, privacy-preserving or federated learning, explainability, reporting transparency, and robustness to data heterogeneity.
Before conducting this review, we expected that only a small and methodologically heterogeneous subset of dental AI studies would have progressed beyond internal validation. This expectation was based on previous evidence showing that many dental AI models are developed and tested within limited datasets, with incomplete assessment of external validity, reproducibility, calibration, transparency, and clinical workflow relevance. We conducted this review to clarify whether recent imaging-based dental AI studies have begun to address these translational gaps and to identify which validation designs provide stronger evidence of clinical transferability. The novelty of this review lies in shifting the focus from algorithmic performance alone to the translational pathway from external validation and generalizability to reproducibility, privacy-preserving learning, and implementation readiness.
Evidence was organized through a structured narrative synthesis reported in accordance with the Synthesis Without Meta-analysis (SWiM) framework. Studies were grouped into predefined domains related to clinical transferability and implementation readiness, including external validation, generalizability, reproducibility, privacy-preserving learning, transparency, and clinical workflow relevance. This approach was selected because the review focuses on the translational validation of imaging-based dental AI rather than on a single diagnostic comparison or intervention effect.
The primary objective is to evaluate the extent to which dental AI models have moved beyond internal algorithmic performance toward reproducible, externally validated, and clinically transferable implementation.
2. Materials and Methods
This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement [
15]. The protocol was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO; CRD420261402398). The search strategy and reporting of information sources were planned to support transparency and reproducibility, in line with PRISMA-S recommendations for systematic review searches [
16].
2.1. Eligibility Criteria
Eligibility criteria were defined using an AI-validation review framework rather than a conventional intervention PICO. This decision was made because the objective of the review was not to compare a single AI intervention with a single clinical comparator, but to evaluate whether imaging-based dental AI studies have progressed beyond internal algorithmic performance toward external validation, generalizability, reproducibility, privacy-preserving learning, and clinical implementation readiness.
Population/data source: Eligible studies used human dental, oral, craniofacial, or maxillofacial imaging data. These data included dental radiographs, panoramic radiographs, periapical or bitewing radiographs, cone–beam computed tomography scans, intraoral photographs, dental photographs, intraoral scans, three-dimensional dental models, or other imaging-based oral-health datasets.
Index AI approach: The index approach was an artificial intelligence or machine-learning model applied to an imaging-based dental task. Eligible models included machine learning, deep learning, convolutional neural networks, transformer-based models, segmentation models, radiomics, federated-learning models, multimodal AI, or related computational approaches.
Target task: Eligible tasks included dental diagnosis, detection, classification, segmentation, measurement, radiographic quality control, risk prediction, or decision support. Studies were not required to evaluate the same clinical condition, because the review question focused on translational validation rather than on one diagnostic indication.
Validation setting: To be eligible, studies had to include empirical evidence beyond internal model development. This included at least one of the following: independent external testing, patient-level external validation, multicenter validation, multi-device or multi-acquisition testing, cross-dataset generalizability, federated or privacy-preserving learning, robustness to dataset heterogeneity, or reproducibility assessment. Studies limited only to internal train–validation–test splitting, k-fold cross-validation, or internal hold-out testing were not eligible unless they also included one of these external or translational validation components.
Reference standard or comparator: Reference standards and comparators were recorded according to the design and task of each study. They could include expert or consensus annotation, manual or refined segmentation, cone–beam computed tomography confirmation, longitudinal clinical documentation, internal validation results, centralized or local learning, or clinician assessment. These elements were not treated as a single common comparator because they served different methodological functions across diagnostic, segmentation, measurement, and federated-learning studies.
Outcomes: The primary outcomes were external-validation performance, generalizability across datasets, centers, devices, acquisition protocols, populations, or annotation settings, cross-dataset reproducibility, privacy-preserving or federated-learning performance, internal-to-external performance change when directly comparable, and methodological indicators of translational readiness. Task-specific quantitative metrics, including AUC, accuracy, sensitivity, specificity, precision, recall, F1-score, negative predictive value, Dice similarity coefficient, intersection over union, Hausdorff distance, HD95, average symmetric surface distance, root mean square error, mean absolute error, and other reported performance measures, were extracted according to the task evaluated in each study.
Study design: Eligible designs included primary empirical studies, diagnostic accuracy studies with external validation, model-development studies with independent validation, segmentation validation studies, multicenter or multi-device validation studies, cross-dataset studies, federated-learning studies, and retrospective or prospective validation studies. Studies comparing AI performance with expert clinicians or established reference standards were eligible when they also contributed evidence relevant to external validation, generalizability, reproducibility, or clinical transferability.
2.2. Exclusion Criteria
Studies were excluded when they were systematic reviews, scoping reviews, narrative reviews, umbrella reviews, editorials, letters, commentaries, protocols, dissertations, or conference abstracts without sufficient methodological and outcome data. Studies were also excluded when they used animal data only, purely simulated datasets without clinical or human-derived dental data, non-dental medical datasets, or broader medical AI datasets from which dental-specific results could not be separated.
Purely technical algorithm papers were excluded when they did not include empirical validation in dental data. Studies limited to internal train–validation–test partitioning, k-fold cross-validation, or internal hold-out testing were excluded unless they also included an external dataset, independent cohort, multicenter or multi-device evaluation, cross-dataset testing, federated or collaborative learning comparison, or reproducibility/generalizability assessment. Studies assessing non-AI digital tools, conventional statistical models without an AI or machine-learning component, or manual-only diagnostic methods were not eligible.
2.3. Information Sources and Search Strategy
A comprehensive literature search was performed in PubMed/MEDLINE, Scopus, and Embase. Searches were conducted without language restrictions and included articles published up to May 2026, with no restriction on the initial publication date. Additional records were identified through backward reference checking, forward citation tracking, and screening of relevant reviews. Authors were contacted when essential methodological or performance information was unavailable in the published report.
The search strategy combined controlled vocabulary and free-text terms related to AI, machine learning, deep learning, federated learning, dentistry, dental imaging, external validation, generalizability, reproducibility, independent datasets, multicenter testing, privacy-preserving learning, and cross-dataset evaluation. The complete search strategy for each database is provided in
Supplementary Table S1.
2.4. Selection Process
All records retrieved from the databases were imported into reference-management software, and duplicates were removed before screening. Two reviewers independently screened titles and abstracts according to the eligibility criteria. Full texts of potentially eligible articles were then retrieved and assessed independently by the same reviewers. Disagreements were resolved through discussion, and a third reviewer was consulted when consensus was not reached. Reasons for exclusion at the full-text stage were recorded and summarized in
Supplementary Table S2.
2.5. Data Collection Process
Data extraction was performed independently by two reviewers using a prepiloted extraction form designed for this review. Extracted information was compared between reviewers, and discrepancies were resolved through discussion. When necessary, authors were contacted to clarify missing information related to external validation, dataset independence, model performance, or availability of code, models, or data.
2.6. Data Items
The following variables were extracted from each included study: bibliographic information, country, dental specialty or application, study design, data source, clinical setting, sample size, unit of analysis, imaging modality, AI model type, clinical or technical task, training strategy, validation strategy, external-validation design, reference standard, comparator, internal performance metrics, external performance metrics, uncertainty estimates, availability of code, model, or dataset, explainability methods, privacy-preserving methods, demographic or device heterogeneity, evidence of data leakage prevention, and the main translational limitations reported by the authors.
External validation was categorized according to the most specific level reported in each study: patient-level external validation, independent image-dataset validation, multicenter validation, multi-device or multi-acquisition validation, cross-dataset generalizability testing, or federated/collaborative learning across distributed data sources. For studies reporting both internal and external performance, the internal-to-external performance change was extracted or calculated when the same metric was available in both settings.
2.7. Study Risk of Bias and Reporting Quality Assessment
Risk of bias was assessed according to the methodological design and AI task of each study. For diagnostic accuracy, classification, and detection studies, the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool was used [
17]. The assessment considered patient selection, conduct and interpretation of the index AI test, appropriateness of the reference standard, and flow and timing. AI-specific issues were also considered, including independence of the external dataset, possible data leakage, threshold selection, blinding to the reference standard, and applicability of the validation sample to the review question.
For studies based on clinical or epidemiological prediction models, the Prediction model Risk of Bias ASsessment Tool (PROBAST) was used when applicable [
18]. The domains included participants, predictors, outcome, and analysis, with attention to overfitting, model calibration, handling of missing data, and adequacy of the validation strategy.
For segmentation, measurement, cross-dataset, and federated-learning studies where QUADAS-2 or PROBAST did not fully capture the relevant sources of bias, an adapted AI-specific methodological appraisal was applied. This appraisal was not intended to generate a separate pooled quality score. Instead, it was used to identify AI-specific threats to validity and implementation relevance that were not fully captured by conventional diagnostic or prediction-model tools. The seven domains assessed whether the data sources and validation datasets were independent, whether the annotation or reference standard was clearly defined and reproducible, whether leakage prevention was described, whether the validation design tested transportability beyond the development setting, whether statistical uncertainty was reported, whether code, models, datasets, or preprocessing steps were sufficiently transparent for reproducibility, and whether the evaluation was clinically applicable to the intended dental workflow. This appraisal assessed seven domains: data source and independence; annotation or reference-standard quality; prevention of data leakage; external validation and transportability; statistical analysis and uncertainty reporting; reproducibility and transparency, including code, model, and data availability; and clinical applicability. Reporting transparency was additionally examined using CLAIM-derived items, particularly for imaging-based AI studies [
19]. Risk-of-bias and reporting-quality assessments were conducted independently by two reviewers and summarized in tabular form.
2.8. Certainty of Evidence Assessment
The certainty of evidence was evaluated using a Grading of Recommendations Assessment, Development and Evaluation (GRADE)-informed framework [
20]. Because the included studies differed substantially in task, modality, unit of analysis, validation design, and performance metric, certainty was not assigned to a single global outcome. Instead, certainty was summarized by functional domain. The domains considered were risk of bias, inconsistency, indirectness, imprecision, and publication or reporting bias. Certainty was rated as high, moderate, low, or very low, with downgrading decisions justified narratively for each domain. For domains without quantitative pooling, certainty judgments were based on the consistency of findings, independence of external validation, transparency of reporting, and clinical applicability of the validation setting.
2.9. Data Synthesis and Statistical Analysis
The primary synthesis was planned as a structured narrative synthesis reported according to the Synthesis Without Meta-analysis (SWiM) framework, because the review addressed a translational AI-validation question rather than a single diagnostic or intervention outcome. The review was not designed to estimate whether AI is more accurate than dentists for one diagnostic condition. Instead, it was designed to evaluate whether imaging-based dental AI models had been assessed beyond internal development through distinct but related indicators of clinical transferability, including external validation in data not used for model development, multicenter or multi-device generalizability, cross-dataset reproducibility, privacy-preserving learning, and other implementation-readiness dimensions.
The synthesis followed a reproducible multistep process. First, a study-by-domain matrix was created to map each included study to one or more translational-readiness domains: external validation and independent test cohorts; multicenter, multi-device, or cross-domain generalizability; cross-dataset reproducibility and annotation heterogeneity; privacy-preserving and federated learning; explainability, transparency, and reproducibility; and clinical workflow relevance or implementation readiness. Second, each study was coded for validation type, unit of analysis, image modality, task, reference standard, comparator, internal performance, external performance, and availability of reproducibility resources. Third, external validation was classified as patient-level, dataset-level, center-level, device-level, cross-dataset, or federated/distributed validation. These validation categories were not interpreted as equivalent levels of evidence. For the purposes of synthesis, multicenter, multi-device, cross-dataset, and federated/distributed validation were considered to provide stronger evidence of transportability because they tested model performance under broader sources of domain shift. By contrast, validation in a single independent cohort or image repository was interpreted as evidence of external testing, but not necessarily as evidence of broad clinical generalizability when center diversity, device heterogeneity, acquisition protocols, population characteristics, or annotation harmonization were limited or unclear. Fourth, studies were compared within each functional domain to identify recurring methodological strengths, translational limitations, and possible sources of performance degradation. Fifth, the domain-level findings were integrated into a structured narrative that distinguished evidence of clinical transferability from evidence limited to algorithmic performance in familiar datasets. This synthesis was reported according to the SWiM framework because quantitative pooling was inappropriate [
21].
Exploratory quantitative synthesis was planned only if at least three studies reported sufficiently comparable metrics within the same task family, validation context, unit of analysis, and outcome metric. Feasibility assessment for quantitative synthesis considered not only the number of available studies but also comparability in clinical task, imaging modality, anatomical target, unit of analysis, reference standard, validation design, and outcome metric. Quantitative pooling was considered inappropriate when these elements differed sufficiently to compromise the interpretability of a pooled estimate. Candidate analyses included pooled AUC for comparable diagnostic or classification studies with external validation, pooled Dice similarity coefficient for comparable segmentation studies with external validation or cross-dataset testing, and pooled internal-to-external performance change when the same metric was reported in both internal and external settings within the same study.
Quantitative pooling was not planned across all included studies because the review was expected to include heterogeneous AI tasks, imaging modalities, reference standards, validation designs, units of analysis, and metric families. A global diagnostic meta-analysis would only be performed if the included studies provided a clinically and statistically coherent set of comparable outcomes. Forest plots were planned only for outcome groups that met the criteria for exploratory quantitative synthesis.
If quantitative synthesis was feasible, random-effects models would be used because clinical and methodological heterogeneity was expected. Restricted maximum likelihood estimation would be preferred for between-study variance, and results would be reported with 95% confidence intervals. Heterogeneity would be evaluated using visual inspection of forest plots, the I
2 statistic, and tau-squared (τ
2) [
22]. Studies without adequate uncertainty estimates would be retained in the structured narrative synthesis but would not be included in quantitative pooling unless sufficient data could be derived from available information. Analyses were planned in R 4.3.2 (R Foundation for Statistical Computing, Vienna, Austria) using the metafor package [
23].
3. Results
3.1. Study Selection
The database search identified 1308 records, including 212 records from PubMed/MEDLINE, 651 from Scopus, and 445 from Embase. After removing 385 duplicate records, 923 unique records were screened by title and abstract. Most records were excluded at this stage because they did not address artificial intelligence models in dentistry or oral healthcare, were not based on human dental or oral imaging data, focused on non-dental medical applications, were reviews, editorials, protocols, or conference abstracts, or reported only internal model development without evidence of external validation, independent dataset testing, multicenter assessment, cross-dataset evaluation, federated learning, or reproducibility analysis.
After title and abstract screening, 24 articles were considered potentially eligible and were assessed in full text. Nine articles were excluded after full-text evaluation. The most frequent reasons were restriction to internal train–validation–test performance, absence of an independent external dataset or cross-setting evaluation, lack of empirical evidence related to generalizability, privacy-preserving learning, or reproducibility, or use of non-imaging data outside the final imaging-based scope of the review. The excluded full-text articles and reasons for exclusion are presented in
Supplementary Table S2.
Overall, 15 studies met all inclusion criteria and were included in the systematic review [
6,
7,
8,
9,
10,
11,
13,
14,
24,
25,
26,
27,
28,
29,
30]. These studies formed the evidence base for evaluating external validation, generalizability, cross-dataset reproducibility, privacy-preserving or federated learning, and translational readiness of imaging-based artificial intelligence models in dentistry. The study selection process is summarized in
Figure 1.
After full-text assessment and data extraction, the feasibility of quantitative synthesis was evaluated for the predefined candidate outcome groups described in the protocol. Detection and classification studies most commonly reported AUC-based metrics, whereas segmentation studies primarily reported Dice similarity coefficient (DSC), intersection over union (IoU), or Hausdorff-distance measures. However, no outcome group contained at least three studies that were sufficiently comparable with respect to clinical task, imaging modality, anatomical target, unit of analysis, reference standard, validation setting, and reported performance metric. For example, AUC values were reported across fundamentally different applications, including caries detection, root-number classification, radiographic quality assessment, and palatal radicular groove diagnosis, while DSC values were reported for segmentation tasks involving different anatomical structures, imaging modalities, and validation frameworks. In addition, uncertainty estimates and reporting formats were inconsistent across studies. Therefore, quantitative pooling and forest plots were not considered methodologically appropriate, and the evidence was synthesized through a structured narrative synthesis organized according to predefined translational-readiness domains.
3.2. General Characteristics of the Included Studies
The 15 included studies were published between 2023 and 2026, with most appearing in 2025 or 2026 [
6,
7,
8,
9,
10,
11,
13,
14,
24,
25,
26,
27,
28,
29,
30]. The corpus covered a broad range of imaging-based dental AI applications rather than a single diagnostic task. These included caries detection and classification [
10,
24,
25], caries and periodontal bone loss assessment using commercial clinical decision-support systems [
6], gingival inflammation grading [
29], root-number detection in maxillary premolars [
11], palatal radicular groove diagnosis and classification in CBCT scans [
7], radiographic quality control in periapical radiographs [
8], tooth-width estimation from standardized occlusal photographs [
30], and automated segmentation of teeth, pulp, gingiva, or dental hard tissues across panoramic radiographs, CBCT scans, intraoral scans, and three-dimensional dental models [
9,
13,
14,
26,
27,
28].
The included studies also differed in how they tested model transportability. Some evaluated independent external datasets or patient-level external validation [
6,
10,
11,
24,
25,
30], whereas others assessed multicenter, multi-device, cross-dataset, or federated-learning settings [
7,
8,
9,
13,
14,
26,
27,
28,
29]. This variation was central to the review question: the included evidence did not support a single pooled diagnostic accuracy framework, but it did allow a structured assessment of how dental AI models behave when tested beyond their original development environment. Detailed study characteristics, including dental domain, imaging modality, AI task, sample size, unit of analysis, model type, reference standard, and validation approach, are presented in
Table 1.
3.3. AI Tasks, Imaging Modalities, and Reference Standards
The 15 included studies covered a heterogeneous but clearly image-centered set of artificial intelligence applications in dentistry [
6,
7,
8,
9,
10,
11,
13,
14,
24,
25,
26,
27,
28,
29,
30]. At the level of primary task, five studies focused mainly on segmentation [
9,
13,
14,
26,
28], five on detection, classification, or grading tasks [
6,
8,
11,
24,
29], four combined detection or classification with segmentation outputs [
7,
10,
25,
27], and one addressed quantitative measurement of dental morphology [
30]. This distribution shows that the included literature extends beyond conventional lesion detection and also includes structurally oriented tasks such as tooth segmentation, quality control, and morphometric estimation.
The imaging sources were similarly diverse. Panoramic radiographs were the most frequent modality [
11,
13,
14,
27], followed by CBCT scans [
7,
9,
26]. Other studies used dental photographs [
10,
24], intraoral or three-dimensional scans [
25,
28], periapical radiographs [
8], full-mouth radiographic series linked to clinical follow-up [
6], standard oral RGB images [
29], or standardized occlusal photographs [
30]. Reference standards were predominantly based on expert or consensus annotation and grading procedures [
7,
8,
10,
13,
14,
24,
25,
27,
28,
29], although some studies relied on refined or manual segmentation standards [
9,
26], CBCT-confirmed morphology [
11], longitudinal clinical documentation [
6], or mesiodistal measurements derived from three-dimensional intraoral scans [
30]. Taken together, these distributions highlight the methodological heterogeneity of the included evidence and reinforce the rationale for a structured narrative synthesis rather than a single pooled diagnostic-accuracy framework. This heterogeneity also affected interpretability. Performance estimates could not be interpreted as reflecting a common clinical effect because studies differed not only in numerical metrics but also in diagnostic task, anatomical target, imaging modality, reference standard, validation design, and unit of analysis. Therefore, higher performance in one study did not necessarily indicate greater clinical transferability than lower performance in another study evaluating a more complex task or broader validation setting. A summary of the distribution of AI tasks, imaging modalities, and reference standards is presented in
Figure 2.
3.4. External Validation and Independent Dataset Testing
The included studies did not apply a single model of external validation. Instead, they represented several levels of independence, ranging from independent image datasets and patient-level external cohorts to multicenter, multi-device, cross-dataset, and federated-learning validation frameworks. This distinction was important because a model tested on new images from the same workflow does not face the same translational challenge as a model evaluated across different centers, scanners, annotation protocols, or distributed data silos.
Independent external validation was most explicit in studies that used separate patient cohorts, image repositories, or external test sets not involved in model development [
6,
10,
11,
24,
25,
30]. Other studies evaluated transportability through multicenter or multi-device settings, including cone–beam computed tomography (CBCT) scans from different systems, periapical radiographs from different acquisition technologies, and panoramic radiographs from multiple international centers [
7,
8,
9,
13,
14,
26,
27,
28,
29]. Cross-dataset testing was especially informative when annotation heterogeneity itself became part of the validation problem, as shown in the AKUDENTAL study [
27]. Federated-learning studies contributed a different form of external validity by evaluating whether models trained across distributed sources could improve generalizability without direct data sharing [
13,
14].
These validation frameworks were not interpreted as equivalent forms of translational evidence. Studies incorporating multicenter, multi-device, cross-dataset, or federated/distributed validation were considered to provide stronger evidence of transportability because they evaluated model performance under broader forms of domain shift. In contrast, studies based on a single independent patient cohort, external image repository, or external test set were interpreted as demonstrating external testing but providing more limited evidence of generalizability when diversity of centers, devices, acquisition protocols, populations, or annotation practices was restricted. Consequently, greater interpretive weight was assigned to studies that evaluated model performance across multiple sources of heterogeneity rather than within a single external dataset.
Across studies, the external metrics were task-specific rather than directly interchangeable. Classification and detection studies generally reported accuracy, sensitivity, specificity, area under the curve (AUC), concordance, negative predictive value (NPV), precision, recall, F1-score, or mean average precision (mAP), whereas segmentation and measurement studies used Dice similarity coefficient (DSC), intersection over union (IoU), 95th percentile Hausdorff distance (HD95), mean intersection over union (mIoU), mean absolute error (MAE), root mean square error (RMSE), or agreement statistics. Several studies reported some degree of internal-to-external performance change, but the form of this comparison varied substantially. For this reason, the external-validation evidence was synthesized functionally rather than treated as a single diagnostic-accuracy outcome. The external-validation characteristics of the included studies are summarized in
Table 2.
3.5. Multicenter, Multi-Device, and Cross-Domain Generalizability
Generalizability was evaluated through several validation strategies rather than through a uniform design across studies. Some studies directly tested models across centers, scanners, acquisition systems, or geographically distinct datasets, whereas others assessed generalizability by applying a model to an independent external image set or to a clinical cohort that differed from the development data. This distinction is relevant because transportability in dental AI may be affected not only by patient differences, but also by device type, image acquisition protocol, annotation conventions, dentition status, anatomical complexity, and data-quality variation.
The strongest evidence of multicenter or multi-device generalizability came from studies using CBCT scans from different systems or centers, periapical radiographs acquired with different technologies, panoramic radiographs from multiple institutions, or distributed datasets evaluated under federated-learning conditions [
7,
8,
9,
13,
14,
26,
27,
28,
29]. Other studies contributed more limited but still relevant evidence by testing models on independent image repositories, external patient cohorts, or clinical datasets not used during model development [
6,
10,
11,
24,
25,
30]. Overall, the evidence showed that external performance was often maintained only partially and depended on the degree of similarity between development and validation data. Studies that explicitly examined device heterogeneity, annotation inconsistency, noise, labeling inaccuracy, or dataset shift were particularly informative because they identified why a model may fail or lose performance outside its original training environment. The main generalizability features of the included studies are summarized in
Table 3.
3.6. Cross-Dataset Reproducibility and Annotation Heterogeneity
Reproducibility was uneven across the included studies. Only a minority provided open or partially accessible resources that would allow independent reanalysis, retraining, or benchmarking. The strongest example was the AKUDENTAL study, which made the dataset and processing, training, and evaluation code available through GitHub and explicitly examined how differences in annotation protocols affected cross-dataset performance [
27]. Other studies provided more limited reproducibility resources, such as code with restricted or request-based data access [
11], data and code through private or institutional repositories upon request [
26], a public web application with data available upon reasonable request [
10], or a repository for the analyzed dataset [
30]. By contrast, several externally validated studies reported no public release of code, trained models, or patient-level datasets, usually because of privacy restrictions, commercial-system constraints, or the use of institutional imaging data (
Table 4).
Annotation heterogeneity was not assessed consistently. In most studies, reference standards were based on expert annotation, consensus diagnosis, refined segmentation, or clinical records, but the reproducibility of those labels across annotators or datasets was rarely the main analytic focus. AKUDENTAL was the clearest exception because it directly showed that cross-dataset performance differences could reflect inconsistent annotation definitions rather than model failure alone [
27]. Rubak et al. [
14] also addressed this issue experimentally by introducing label manipulation and image-noise scenarios in federated, centralized, and local learning. Chen et al. [
29] reported strong agreement among expert annotators for gingival inflammation grading, whereas other studies used consensus or expert adjudication without formal cross-dataset label harmonization. These findings make annotation reproducibility a central translational issue: a model cannot be considered broadly generalizable if the target labels themselves are unstable across datasets, centers, or clinical conventions.
3.7. Privacy-Preserving and Federated Learning
Only two included studies directly evaluated privacy-preserving or federated-learning strategies in dental imaging, and both focused on tooth segmentation in panoramic radiographs [
13,
14]. This limited number was expected because most externally validated dental AI studies still relied on centralized development or independent external testing rather than distributed model training. Nevertheless, these two studies were central to the translational focus of this review because they addressed a problem that conventional external validation cannot solve: how to develop more generalizable dental AI models when raw imaging data cannot be freely pooled across institutions.
The rationale for federated learning is particularly relevant in dentistry. Dental radiographs may contain individualizing anatomical and restorative features, and multi-institutional data sharing can be restricted by privacy, regulatory, and institutional constraints. In this setting, federated learning allows participating centers to train a shared model without exchanging raw patient images; instead, model parameters or learned updates are shared and aggregated centrally [
12]. This privacy-preserving logic was empirically tested by Schneider et al. [
13], who compared federated learning with local and centralized learning using 4177 panoramic radiographs from nine international centers. Federated learning outperformed local learning in most centers and showed better generalizability on pooled multicenter test data, although centralized learning generally remained superior when direct data pooling was allowed [
13]. This finding suggests that federated learning may narrow the gap between isolated local models and centralized models when privacy constraints make direct pooling unrealistic.
Rubak et al. extended this evidence by examining whether federated learning remained robust when data quality and label quality were deliberately compromised [
14]. Using 2066 panoramic radiographs from six institutions, the authors compared federated, centralized, and local learning under baseline conditions, label manipulation, image-noise manipulation, and faulty-client exclusion. Federated learning matched or exceeded centralized learning and consistently outperformed local learning across several corruption scenarios, while per-client loss monitoring helped identify corrupted sites [
14]. This study was particularly important because it moved beyond the assumption that all participating centers contribute clean and consistently annotated data. In practical multicenter dental AI development, differences in image quality, labeling behavior, and institutional workflow are likely to be common rather than exceptional.
Taken together, these two studies indicate that federated learning is a plausible privacy-preserving route for scalable dental AI development, but not a complete solution to implementation barriers. Its usefulness depends on the quality and balance of local datasets, the reliability of annotations, the ability to monitor anomalous clients, and the technical capacity of participating institutions. The evidence also shows that privacy-preserving learning and generalizability are closely linked: federated learning may improve access to diverse data without direct image sharing, but model performance remains vulnerable to data heterogeneity, labeling inconsistency, and client-level imbalance. Thus, in the current dental AI literature, federated learning should be viewed as an emerging translational strategy rather than a mature deployment standard.
3.8. Explainability, Transparency, and Reporting Reproducibility
Explainability and reporting transparency were inconsistently addressed across the included studies. Most articles clearly described the external-validation design and reported the main model architecture, input modality, reference standard, and performance metrics. However, formal explainability methods were uncommon. Only a small number of studies used visual or interpretable outputs to support model interpretation, such as Grad-CAM or attention-related visualizations, whereas most segmentation, measurement, and federated-learning studies focused on external performance, robustness, or reproducibility rather than on explaining individual model decisions.
Transparency was also uneven. Several studies provided enough detail to understand the validation structure, including whether testing was performed on external centers, independent image datasets, cross-dataset benchmarks, or distributed institutional data [
6,
7,
8,
9,
10,
11,
13,
14,
24,
25,
26,
27,
28,
29,
30]. Nevertheless, open reproducibility was limited. Code, datasets, or trained models were publicly available in only a minority of studies, and in some cases access was restricted because of patient privacy, institutional data governance, proprietary software, or commercial AI systems. This pattern is important because a model may appear externally valid within the published study, but independent reproduction remains difficult when data splits, annotations, source code, trained weights, or external datasets are not accessible.
Table 5 summarizes explainability and reporting transparency using CLAIM-derived elements relevant to imaging-based AI studies, including clarity of external validation, reporting of uncertainty, leakage-prevention information, and availability of code, data, or models.
3.9. Risk of Bias and Methodological Quality
Risk of bias and methodological quality were assessed according to the methodological design and primary AI task of each study. QUADAS-2 was applied to diagnostic, detection, classification, or grading studies, because these studies evaluated an index test against a reference standard and therefore fitted the diagnostic-accuracy structure of patient or image selection, index-test conduct, reference standard, and flow and timing [
17]. PROBAST was retained in the protocol for clinical or epidemiological prediction models [
18], but it was not used as a primary appraisal tool in the final synthesis because the final included corpus was restricted to imaging-based studies rather than questionnaire-based or clinical-risk prediction models. For segmentation, measurement, cross-dataset, and federated-learning studies, an adapted AI-specific methodological appraisal was used because conventional diagnostic-accuracy tools do not fully capture issues such as annotation quality, data leakage prevention, external transportability, code and model availability, and distributed-learning robustness.
Across the included studies, the main methodological strengths were the use of external or independent test data, multicenter or multi-device evaluation, and clinically meaningful reference standards. These features reduced the likelihood that model performance reflected only internal optimization. However, important concerns remained. External datasets were often smaller than development datasets, some external validations relied on independent image repositories with limited patient-level metadata, and open reproducibility resources were inconsistently available. For segmentation and measurement studies, the most relevant concerns were not always those captured by standard diagnostic risk-of-bias domains; rather, they involved annotation consistency, availability of manual or refined reference segmentations, reporting of data partitioning, and whether the model could be independently reproduced.
The risk-of-bias and methodological-quality findings should therefore be interpreted as a domain-specific appraisal rather than as a single uniform score across all included studies. In diagnostic or classification studies, the key concerns related to patient or image selection, independence of the external set, and appropriateness of the reference standard. In segmentation, measurement, cross-dataset, and federated-learning studies, the main concerns related to dataset independence, annotation heterogeneity, incomplete open access to code or trained models, and limited reporting of procedures used to prevent data leakage.
Table 6 summarizes these judgments by study and appraisal domain.
3.10. Structured Synthesis of Translational Readiness
The structured synthesis integrated the 15 included studies into six translational-readiness domains: external validation, generalizability, reproducibility and annotation heterogeneity, federated learning and privacy-preserving training, transparency and explainability, and clinical implementation readiness. This approach was chosen because the included studies were methodologically heterogeneous and could not be meaningfully reduced to a single quantitative endpoint. Instead of treating external performance as a uniform construct, the synthesis focused on the specific ways in which each study contributed to the broader question of whether dental AI models are moving toward responsible clinical use.
External validation was the foundational domain and was represented by all included studies [
6,
7,
8,
9,
10,
11,
13,
14,
24,
25,
26,
27,
28,
29,
30]. This was a defining feature of the final corpus, because studies limited to internal train–validation–test performance had already been excluded. However, the form of external validation varied considerably. Some studies used independent patient cohorts or image repositories [
6,
10,
11,
24,
25,
30], whereas others relied on multicenter, multi-device, or cross-dataset testing [
7,
8,
9,
13,
14,
26,
27,
28,
29]. As a result, the simple presence of an external test set did not imply the same level of translational maturity across studies.
Generalizability was the second most consistently represented domain. It was most explicitly addressed in studies that tested models across different centers, CBCT systems, radiographic acquisition technologies, clinical cohorts, or benchmark datasets [
7,
8,
9,
13,
14,
25,
26,
27,
28,
29,
30]. These studies showed that performance could often be retained to a useful degree outside the development environment, but they also showed that transportability was conditional rather than automatic. Device differences, image-quality variation, labeling conventions, and external cohort characteristics frequently influenced performance, sometimes producing measurable attenuation relative to internal results [
8,
25,
26,
27].
Reproducibility and annotation heterogeneity were much less mature than external validation. The clearest evidence came from the AKUDENTAL study, which explicitly compared performance across multiple panoramic-radiograph datasets and demonstrated that annotation differences could materially affect apparent model performance [
27]. Rubak et al. provided complementary evidence by showing that labeling inaccuracy, image noise, and faulty clients affected segmentation performance under distributed learning conditions [
14]. Other studies contributed more limited reproducibility signals through code sharing, request-based data access, public repositories, or well-defined external datasets [
10,
11,
26,
29,
30]. Overall, this domain remained a major weakness of the literature: many models were externally tested, but far fewer were independently reproducible.
Federated learning and privacy-preserving development were represented only by Schneider et al. and Rubak et al. [
13,
14]. Despite the small number of studies, this domain was important because it addressed a translational challenge that conventional external validation does not solve: how to build more generalizable models when raw imaging data cannot be freely pooled across institutions. Both studies showed that federated learning could outperform isolated local training and approach the performance of centralized learning, although sensitivity to center-level heterogeneity and data-quality problems remained evident [
13,
14].
Transparency and explainability were addressed inconsistently. Fang et al. used gradient-weighted class activation mapping to visualize model attention during radiographic quality assessment [
8], while some other studies improved transparency through open or partially open resources, such as publicly available code, web-accessible models, or dataset repositories [
10,
11,
24,
26,
27,
30]. However, formal explainability methods were uncommon, confidence intervals were not always consistently reported, and many studies did not provide the code, trained weights, or full datasets needed for full external verification. In practice, transparency was often stronger in reporting the validation structure than in enabling genuine computational reproducibility.
Clinical implementation readiness was the least uniformly represented domain and was concentrated in a smaller subset of studies. The strongest examples were those that moved beyond technical validation alone, such as the external clinical validation of commercial decision-support systems for caries and periodontal bone loss [
6], the multicenter PRG-Net study that evaluated dentist performance with and without AI assistance [
7], the quality-control study that explicitly linked model outputs to radiographic workflow [
8], the CBCT segmentation study that examined refinement burden under real-world conditions [
9], and the intraoral-scan caries study that compared AI outputs with practitioner assessment [
25]. Most other studies remained primarily technical, even when externally validated.
Taken together, the structured narrative synthesis showed that the current literature has moved beyond purely internal algorithmic performance, but translational readiness remains uneven across domains. External validation and generalizability are increasingly represented, whereas reproducibility, privacy-preserving scalability, explainability, and implementation readiness remain less consistently developed. In this sense, the strongest pattern across the included studies was not the presence of a single best-performing methodological strategy, but the emergence of a gradual transition from proof-of-concept modeling toward clinically transferable dental AI.
Figure 3 summarizes how the included studies contributed to each translational-readiness domain and illustrates the uneven development of external validation, generalizability, reproducibility, privacy-preserving learning, transparency, and clinical implementation readiness.
3.11. Exploratory Meta-Analysis/Feasibility of Quantitative Synthesis
Exploratory meta-analysis was considered for three candidate outcome families: external-validation AUC in detection or classification studies, external-validation segmentation performance using Dice similarity coefficient, and internal-to-external performance change. After reviewing the extracted data, quantitative pooling was not performed because the studies did not provide a sufficiently comparable set of tasks, units of analysis, reference standards, metrics, and uncertainty estimates.
For classification and detection studies, several articles reported external AUC or related discrimination metrics, but the clinical targets differed substantially. These included palatal radicular groove diagnosis and subtype classification in CBCT scans [
7], technical-error classification and quality grading of periapical radiographs [
8], root-number detection in maxillary premolars using panoramic radiographs with CBCT confirmation [
11], and caries detection or classification in dental photographs [
10,
24]. Although these outcomes were all externally validated, they represented different diseases or technical tasks, different imaging modalities, and different reference standards. A pooled AUC would therefore have combined conceptually distinct diagnostic problems and would not have produced a clinically interpretable summary estimate.
A similar limitation was observed for segmentation outcomes. Dice similarity coefficient was reported or extractable in several segmentation studies, including posterior-tooth segmentation in CBCT scans [
9], tooth segmentation in panoramic radiographs under federated or distributed learning settings [
13,
14], mixed-dentition CBCT segmentation of pulp and dental hard tissues [
26], and multiview segmentation of teeth and gingiva in three-dimensional dental scans [
28]. However, these studies segmented different anatomical structures, used different imaging modalities, evaluated different model families, and applied different validation frameworks. Even when the same metric was reported, the meaning of a Dice coefficient was not directly interchangeable across tooth segmentation, pulp segmentation, gingival segmentation, and federated panoramic-radiograph segmentation.
Internal-to-external performance change was also assessed for feasibility because it directly reflected the translational question of whether model performance was preserved outside the development environment. Several studies reported internal and external results or described external performance attenuation [
8,
10,
11,
24,
25,
26,
27,
30]. However, the available comparisons were expressed using different metrics, including AUC, accuracy, sensitivity, specificity, Dice similarity coefficient, Hausdorff distance, intersection over union, mean absolute error, root mean square error, and mean average precision. In addition, confidence intervals or standard errors were not consistently reported for both internal and external estimates. As a result, a pooled internal-to-external performance drop would have required assumptions that were not supported by the available data.
For these reasons, exploratory quantitative synthesis was judged to be inappropriate for the current dataset. The decision not to pool was not based on an insufficient number of included studies, but on the absence of methodologically compatible subgroups that shared the same task, imaging modality, unit of analysis, reference standard, and statistical metric. This heterogeneity also limited direct interpretation across studies, because the same numerical metric could have different clinical meanings depending on whether it was calculated for a diagnostic classification, anatomical segmentation, measurement task, image-level output, tooth-level output, or patient-level decision. The findings were therefore retained in the structured narrative synthesis, where the direction and nature of external performance, generalizability, reproducibility, and implementation readiness could be interpreted without forcing clinically heterogeneous outcomes into a single summary estimate. The feasibility assessment for each candidate quantitative synthesis is summarized in
Table 7.
3.12. Certainty of Evidence
The certainty of evidence was summarized by functional domain using a GRADE-informed approach [
20]. A single global certainty rating was not assigned because the included studies differed substantially in task, imaging modality, unit of analysis, validation structure, reference standard, and performance metric. Instead, certainty judgments were anchored to the specific translational question addressed by each domain: whether the available evidence supports external validation, generalizability, reproducibility, privacy-preserving learning, transparency, or clinical implementation readiness of imaging-based dental AI models.
Overall certainty ranged from very low to moderate across domains. The highest confidence was assigned to the presence of external validation, because all included studies evaluated performance beyond purely internal model development. However, confidence was reduced when external samples were small, based on independent image repositories with limited patient-level metadata, or not accompanied by complete uncertainty estimates. Certainty was lower for reproducibility, transparency, and clinical implementation readiness because open code, trained models, datasets, standardized reporting, and real-world workflow evaluation were inconsistently reported. The domain-level certainty assessment is summarized in
Table 8.
4. Discussion
This systematic review showed that imaging-based dental AI has begun to move beyond internal model performance, but the evidence remains uneven across the translational pathway. All included studies incorporated some form of external validation, independent testing, multicenter assessment, cross-dataset evaluation, or federated-learning design. This represents a meaningful shift from the earlier dental AI literature, where model development frequently relied on internal train–test partitions, single-center datasets, or cross-validation without robust assessment in independent data [
1,
2]. However, the present findings also show that external validation alone does not establish implementation readiness. The strength of translational evidence depended on the level of dataset independence, the diversity of imaging devices and clinical settings, the stability of reference annotations, the availability of reproducibility resources, and the extent to which model outputs were evaluated in clinically meaningful workflows.
Previous reviews have mapped the broad expansion of machine learning in dentistry and have consistently highlighted heterogeneity in tasks, data sources, model architectures, validation strategies, and reporting quality [
1,
2]. Disease-specific evidence in caries detection has also suggested promising diagnostic performance but substantial variability across datasets, imaging modalities, annotation procedures, and reporting standards, with limited external validation in earlier studies [
3]. More recently, systematic evidence addressing explainability, bias, and generalizability has emphasized that trustworthiness attributes are increasingly recognized in dental AI, although the evidence remains heterogeneous across domains and outcomes [
4]. The present review differs from those earlier syntheses by restricting the evidence base to imaging-based dental AI studies that empirically tested at least one dimension of clinical transferability beyond internal development. Therefore, the central question was not whether dental AI can achieve high performance in controlled datasets, but whether these models have been evaluated under conditions that approximate clinical translation.
A key finding was that external validation was present across all included studies, but its meaning varied substantially. Some studies used independent patient cohorts or image datasets, whereas others evaluated models across centers, devices, acquisition protocols, public benchmark datasets, or federated-learning environments. These designs are not interchangeable. A model tested on an independent image repository may demonstrate a useful degree of external performance, but it does not face the same translational challenge as a model evaluated across multiple centers, scanner systems, annotation protocols, or distributed institutional datasets. This distinction is important because the clinical reliability of dental AI depends on its ability to remain stable when image acquisition, patient characteristics, disease spectrum, or annotation practices differ from those used during model development.
Accordingly, the synthesis assigned greater interpretive weight to multicenter, multi-device, cross-dataset, and federated/distributed validation designs than to single-cohort or independent image-repository validation.
Several studies provided stronger evidence of transportability by testing models across multicenter or multi-device conditions. For example, multicenter cone–beam computed tomography validation, periapical radiograph quality assessment across acquisition settings, multi-system cone–beam computed tomography segmentation, and distributed panoramic-radiograph segmentation directly addressed forms of domain shift that are likely to occur in real dental practice [
7,
8,
9,
13,
14]. Cross-dataset studies further showed that performance can vary when datasets differ not only in image source but also in annotation structure and labeling conventions [
27]. These findings support a central conclusion of this review: generalizability is not a binary property of a model but a context-dependent characteristic shaped by the validation environment.
At the same time, the evidence does not yet support the assumption that externally validated dental AI models are broadly transferable across all clinical settings. External samples were often smaller than development datasets, and some studies relied on image repositories or open-source datasets with limited patient-level or center-level information. In such cases, external testing remains useful, but it may not fully capture the variability of routine dental practice. This is particularly relevant for tasks such as caries detection, segmentation of anatomical structures, root morphology assessment, and gingival inflammation grading, where performance may depend on imaging quality, anatomical complexity, lesion spectrum, and operator-dependent acquisition factors.
The most persistent weakness across the included literature was reproducibility. Many studies reported external validation, but far fewer provided the resources needed for independent verification, such as open code, trained model weights, full datasets, data partitions, preprocessing scripts, or detailed annotation protocols. This distinction matters because a model can be externally validated within a published study and still remain difficult to reproduce, benchmark, or adapt in another clinical environment. In this review, reproducibility was strongest when studies provided public datasets, open code, web-accessible models, or repository-based access, but these features were not consistently available across the corpus.
Annotation heterogeneity emerged as a particularly important issue. Dental AI models often depend on labels produced by experts, consensus groups, manual segmentations, refined automatic segmentations, or clinical documentation. These reference standards are clinically reasonable, but they are not always equivalent across datasets or institutions. The AKUDENTAL study was especially informative because it showed that cross-dataset performance can be affected by differences in annotation definitions and labeling granularity [
27]. Similarly, Rubak et al. [
14] demonstrated that label inaccuracy, image noise, and faulty clients can alter segmentation performance under distributed learning conditions. These findings suggest that apparent model failure in external datasets may sometimes reflect inconsistencies in the target labels rather than only weaknesses of the algorithm.
The reporting limitations observed in this review are consistent with broader concerns in AI model evaluation. Transparent reporting frameworks such as TRIPOD + AI emphasize that machine-learning models require clear descriptions of data sources, predictors or inputs, outcomes, missing data, validation procedures, performance measures, and reproducibility elements to allow readers to judge model reliability and applicability [
31]. Although TRIPOD + AI is focused on prediction models, its underlying principle is directly relevant to imaging-based dental AI: without transparent reporting of how data were selected, partitioned, annotated, processed, and externally tested, performance estimates remain difficult to interpret and reproduce.
Privacy-preserving learning was represented by only two included studies, both focused on tooth segmentation in panoramic radiographs [
13,
14]. Despite this small number, this domain is highly relevant to the translational future of dental AI. Dental radiographs and three-dimensional dental records may contain individualizing anatomical and restorative patterns, and the direct pooling of multi-institutional imaging data may be restricted by ethical, legal, regulatory, or institutional constraints. Federated learning addresses this challenge by enabling collaborative model training without direct exchange of raw patient images [
12].
The included federated-learning studies suggest that distributed training can improve over isolated local learning and may approach centralized learning under some conditions [
13,
14]. However, they also show that privacy-preserving learning is not automatically equivalent to robust clinical generalization. Performance remained sensitive to client-level heterogeneity, label noise, image noise, and data-quality problems. Therefore, federated learning should be interpreted as an emerging translational strategy rather than a mature implementation standard. Its success will require more than technical aggregation of model updates; it will also require governance structures, annotation harmonization, quality-control procedures, monitoring of anomalous clients, and institutional capacity to support secure distributed model development.
Beyond technical validation, successful clinical implementation depends on several translational factors that were only partially addressed in the included literature. Regulatory approval pathways, post-market surveillance requirements, reimbursement mechanisms, integration with existing clinical information systems, workflow compatibility, clinician acceptance, professional liability, and patient trust all influence whether an AI model ultimately reaches routine practice. Even highly accurate and externally validated systems may face substantial barriers if they increase workflow burden, lack interoperability with imaging or electronic health record platforms, generate outputs that clinicians do not trust, or create uncertainty regarding responsibility for AI-assisted decisions. These considerations are particularly relevant in dentistry, where implementation frequently occurs in small private practices with variable digital infrastructure and limited technical support. Therefore, clinical implementation readiness should be viewed as a multidimensional construct that extends beyond external validation and includes regulatory, organizational, economic, and human-factor considerations.
The clinical implication of these findings is that externally validated dental AI should currently be viewed mainly as decision support rather than as autonomous clinical decision-making. The strongest near-term applications are likely to be tasks in which AI can assist clinicians by prioritizing images, identifying suspicious findings, supporting segmentation or measurement, improving quality control, or reducing repetitive workload. However, clinical adoption should remain cautious until models demonstrate reproducibility, workflow compatibility, safety, and usefulness in real clinical environments.
Clinical implementation readiness was less developed than external validation or generalizability. Moreover, very few studies reported information relevant to regulatory status, reimbursement considerations, workflow integration, user acceptance, liability frameworks, or patient-centered implementation outcomes. A small subset of studies moved beyond technical validation by evaluating commercial AI decision-support systems, clinician-AI comparisons, reader assistance, workflow-oriented quality control, segmentation refinement burden, or practitioner agreement [
6,
7,
8,
9,
25]. These studies are important because they begin to address how AI outputs may interact with clinical judgment, diagnostic workflow, or task efficiency. However, most included studies remained primarily technical, even when externally validated.
The gap between external validation and clinical implementation is substantial. A model may perform well on an external dataset but still fail to improve clinical workflow, decision quality, patient outcomes, efficiency, or safety. Taken together, these findings suggest that the current role of dental AI is more consistent with decision support than autonomous clinical decision-making. Many of the included studies demonstrated the potential of AI to assist image interpretation, quality control, segmentation, or diagnostic assessment, but very few evaluated whether AI could safely replace clinician judgment in real-world settings. Therefore, the most realistic near-term implementation pathway is likely to involve human–AI collaboration, in which AI systems help clinicians manage large volumes of imaging data, identify relevant findings, and improve efficiency, while final clinical decisions remain the responsibility of trained professionals.
An additional challenge is that numerical model performance does not automatically translate into clinical utility. Metrics such as AUC, sensitivity, specificity, negative predictive value, Dice similarity coefficient, or Hausdorff distance should be interpreted in relation to the intended clinical task, the consequences of diagnostic or segmentation errors, and the context in which the model will be used. For example, a negative predictive value that may be useful for ruling out disease in a screening context does not necessarily imply suitability for treatment planning, while a high segmentation Dice similarity coefficient may still be insufficient if clinically relevant anatomical boundaries are inaccurately delineated. Because the included studies addressed highly heterogeneous clinical applications, universally applicable thresholds for clinical acceptability could not be defined. Instead, the findings of this review should be interpreted as evidence of technical and translational performance rather than definitive evidence of clinical utility. The DECIDE-AI reporting guideline is relevant here because it focuses on early-stage clinical evaluation of AI-based decision-support systems, including clinical setting, user interaction, human factors, implementation context, and safety considerations [
32]. Very few dental AI studies in this review approached that level of implementation assessment. Similarly, when future studies aim to test AI systems as clinical interventions, reporting should align with CONSORT-AI to ensure adequate description of the AI intervention, human–AI interaction, trial context, error handling, and outcome assessment [
33].
Therefore, the evidence synthesized here supports a cautious interpretation: within the selected subset of externally tested studies, imaging-based dental AI shows encouraging movement toward more rigorous validation, but this should not be interpreted as broad clinical maturity. The gap between external validation and implementation-ready AI remains substantial because reproducibility, workflow evaluation, regulatory considerations, clinician acceptance, and post-deployment monitoring were inconsistently addressed.
Exploratory quantitative synthesis was considered but not performed because the available studies did not provide methodologically compatible subgroups. This decision should not be interpreted as a weakness of the review; rather, it reflects the structure of the evidence. Pooling external area under the curve values across palatal radicular groove diagnosis, radiographic quality control, root-number detection, and caries classification would have produced a summary estimate without a coherent clinical meaning. Similarly, Dice similarity coefficients from cone–beam computed tomography tooth segmentation, panoramic-radiograph segmentation, mixed-dentition pulp and hard-tissue segmentation, and three-dimensional scan segmentation are not directly interchangeable, even when they share the same metric name.
The same issue applied to internal-to-external performance change. Several studies reported internal and external results, but they used different metrics, units of analysis, imaging modalities, and validation structures. Confidence intervals or standard errors were also not consistently available for paired comparisons. A pooled estimate would therefore have required assumptions that were not supported by the reported data. The structured narrative synthesis was better suited to this evidence base because it allowed the review to preserve clinically meaningful distinctions among validation type, modality, task, reference standard, reproducibility, and implementation relevance.
The findings of this review suggest three priority directions for future dental AI research. First, external validation should be strengthened prospectively and should involve patient-level, center-level, device-level, or cross-dataset independence whenever possible, because random image-level splitting or poorly described external datasets may overestimate transportability. Second, reproducibility and interpretability should be improved through uncertainty estimates, explicit leakage-prevention procedures, transparent annotation protocols, public code or model weights when feasible, and clear reporting of the unit of analysis and metric calculation. This is particularly important when multiple images, teeth, surfaces, or slices originate from the same patient, or when performance is calculated at pixel, tooth, image, or patient level. Third, studies claiming clinical relevance should move toward implementation-oriented designs, including reader studies, workflow analyses, prospective silent-mode evaluation, multicenter clinical validation, clinician–AI interaction studies, cost and resource assessments, and post-deployment monitoring. In this sense, the next stage of dental AI should be judged not only by whether models perform well, but by whether they remain reliable, reproducible, privacy-compatible, and useful across real-world clinical environments.
This review has several strengths. It focused on a clearly defined translational question and restricted the included corpus to imaging-based dental AI studies that went beyond internal model development. This allowed the synthesis to examine external validation, generalizability, reproducibility, privacy-preserving learning, transparency, and implementation readiness as distinct but related domains. The structured narrative synthesis provided a planned way to interpret methodologically heterogeneous studies without forcing incompatible outcomes into a single pooled estimate. The review also incorporated risk-of-bias, reporting-quality, and GRADE-informed certainty assessments aligned with the specific nature of dental AI validation studies.
Several limitations should also be considered. The included evidence was recent and relatively small, and the studies differed substantially in clinical task, imaging modality, unit of analysis, model architecture, reference standard, and performance metric. Because eligibility required at least one translational-validation dimension beyond internal model development, the included corpus represents a selected and relatively advanced subset of imaging-based dental AI research rather than the full dental AI literature. This design was appropriate for the review question, but it may overestimate the maturity of the broader field if interpreted as representing all dental AI studies, many of which remain limited to internal validation. If internally validated studies without external testing were considered in the denominator, the overall translational readiness of dental AI would likely be lower than suggested by the included evidence. Some judgments on transparency, leakage prevention, and reproducibility depended on what authors reported, and independent verification was not always possible. The review excluded non-imaging dental AI studies from the final corpus to preserve methodological coherence, even though some non-imaging models may provide useful lessons about external validation and transportability. Finally, because quantitative pooling was not appropriate, the conclusions rely on structured narrative synthesis rather than pooled performance estimates.
Overall, the findings support a cautious interpretation of the current evidence and show that translational readiness in dental AI depends on more than external validation alone.