Background: Periapical pathosis in periapical radiographs must be properly diagnosed for the success of endodontic treatment but is often muddled by 2D imaging limitations and subjective interpretation. Artificial intelligence (AI) offers a solution, but whether the diagnostic granularity of AI versus human clinicians in everyday clinical practice has been adequately explored remains to be addressed. The purpose of this study was to evaluate the diagnostic accuracy of ChatGPT-5 in detecting periapical radiographic abnormalities compared with the three-expert consensus reference standard.
Methods: In this diagnostic accuracy retrospective study, 270 periapical radiographs were independently read by a large language model (ChatGPT-5) and a three-board-certified oral radiologist consensus. The AI was given a standardized prompt to label radiographic features, like the presence of periapical radiolucency, border, shape, and integrity of lamina dura. Diagnostic accuracy, agreement (Cohen’s κ), and predictors of correct AI classification were compared with the expert consensus reference standard.
Results: ChatGPT-5 demonstrated high sensitivity (87.5%) but low specificity (12.5%), resulting in an overall diagnostic accuracy of 50.0%. This performance profile reflects a tendency toward over-identification of pathology, with the model classifying 87.5% of radiographs as abnormal compared with 50.0% by expert consensus. Agreement was almost perfect for anatomical localization (arch, κ = 0.857) but poor for binary abnormality detection (κ = 0.000). For morphological descriptors, statistically significant disagreement was observed for lesion border characterization (κ = 0.127;
p < 0.001), whereas lesion shape demonstrated only descriptive divergence without reaching statistical significance (κ = 0.359). Root resorption assessment also differed significantly between evaluators (
p = 0.046). Regression analysis showed that well-defined corticated borders (OR = 60.25,
p < 0.001) and first molar-associated lesions (OR = 32.55,
p < 0.001) were significant predictors of correct AI classification.
Conclusions: This study demonstrates that while ChatGPT-5 Vision can visually interpret periapical radiographs with high sensitivity, limited specificity and inconsistent morphological feature characterization restrict its reliability for independent clinical diagnosis. The AI system tends to over-diagnose systematically and categorizes lesions more structurally and defined compared to dental experts. AI has the potential for being optimized as a sensitive first-screening test, but its findings must be validated by dental professionals to avoid false positives and ensure proper characterization.
Full article