Automated YOLO-Based Cephalometric Landmark Detection for ANB-Based Skeletal Classification: A Retrospective Single-Centre Study
Abstract
1. Introduction
1.1. Why Only Four Landmarks?
1.2. Aims and Hypotheses
2. Materials and Methods
2.1. Ethical Considerations
2.2. Dataset, Demographic Composition and Annotation
2.2.1. Source Dataset
2.2.2. Demographic Composition of the Test Set
2.2.3. Annotation Protocol
2.2.4. Image Acquisition
2.2.5. Data Augmentation and Partitioning
2.3. YOLO Model Configurations
2.3.1. Implementation and Hyperparameters
2.3.2. Bounding-Box Parameter
2.3.3. Models Excluded from the Final Clinical Analysis
2.4. Outcome Metrics
2.4.1. Localisation Accuracy
2.4.2. Inter-Expert Landmark Variability
2.4.3. Clinical Classification
2.5. Statistical Analysis
2.5.1. Inter-Expert Classification Agreement (Sensitivity Analysis)
2.5.2. Multiple Comparisons
3. Results
3.1. Inter-Expert Variability of S, N, A and B
3.2. Effect of Bounding-Box Size on Localisation Accuracy
3.3. Landmark Localisation Accuracy for S, N, A and B
3.4. Angular Measurement Accuracy
3.5. ANB-Based Skeletal Classification Concordance
3.6. Inter-Expert Classification Stability (Sensitivity Analysis)
3.7. Comparison with Human Inter-Expert Variability
4. Discussion
4.1. Diagnostic Concordance Despite Coordinate-Level Errors of 3 mm
4.2. Localisation Accuracy Within or Immediately Above Human Variability
4.3. The Dominant Role of Bounding-Box Size
4.4. Anomalous Performance of Model 5 and Model 1
4.5. Implications for Clinical Deployment
- For patients with AI-predicted ANB values clearly within their diagnostic category (e.g., ANB or ANB ), the present data suggest that AI-derived skeletal classifications are likely to be concordant with expert assessment with high probability.
- For borderline AI-predicted ANB values (approximately to ), clinician verification should be mandatory.
- A confidence-aware deployment protocol that flags borderline AI predictions for human review, while supporting clinician workflow for clear-cut cases, would preserve diagnostic safety but its operational benefits remain to be demonstrated in a prospective trial.
4.6. Comparison with the Existing Literature
4.7. Limitations
- Small, single-centre independent test set. The 130 AI-expert classification pairs reported here derive from only 11 independent patients; the effective number of patient-level observations is therefore much smaller than the number of model–image pairs. The 95% confidence intervals around per-class concordance (Class II: 85.9–98.2%; Class I: 79.7–99.2%) are correspondingly wide. The present findings are best interpreted as preliminary diagnostic concordance, requiring confirmation in a substantially larger, multi-centre, patient-level cohort.
- No external validation. All radiographs originated from a single academic centre; demographic composition of the training set was not systematically controlled. Generalisability to other patient populations, imaging equipment, software platforms and clinical protocols requires multi-centre external validation.
- Confounded hyperparameter grid. The 14-configuration grid is not a balanced factorial design (bounding-box size co-varies with architecture, dataset size and epochs). The dominant effect of box size is robust qualitatively to architecture/data variations, but a controlled single-architecture ablation isolating box size at fixed dataset size is needed to confirm the precise effect magnitude reported.
- No prespecified equivalence margin. The study was designed as a descriptive concordance study, not a formal non-inferiority trial. The exact binomial test against a 90% concordance threshold (Section 3.5) is a single sensitivity check, not a formal equivalence test. The language of equivalence or expert-level performance is therefore avoided throughout.
- No per-prediction confidence estimates. The current YOLO models produce point predictions without per-prediction calibrated uncertainty estimates. Future work incorporating Bayesian deep-learning or ensemble methods [20] would enable real-time flagging of borderline cases and is essential for the confidence-aware workflow proposed in the Implications section.
- Model anomalies require deeper analysis. The failure of Models 1 and 5 on the test set, despite training/validation performance, requires training/validation-curve analyses and controlled ablation to fully characterise. We have classified both as failed generalisation rather than over-fitting but cannot rule out implementation-specific factors.
- No formal language editing. Final language polishing by a native-English editing service is being arranged for the camera-ready version.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial intelligence |
| ANB | Angle between A-point, Nasion, and B-point |
| CI | Confidence interval |
| CNN | Convolutional neural network |
| KDE | Kernel density estimate |
| MRE | Mean radial error |
| SD | Standard deviation |
| SDR | Successful detection rate |
| SNA | Angle between Sella, Nasion, and A-point |
| SNB | Angle between Sella, Nasion, and B-point |
| YOLO | You Only Look Once |
References
- Baumrind, S.; Frantz, R.C. The reliability of head film measurements: 1. Landmark identification. Am. J. Orthod. 1971, 60, 111–127. [Google Scholar]
- Proffit, W.R.; Fields, H.W.; Sarver, D.M. Contemporary Orthodontics, 6th ed.; Elsevier Health Sciences: Philadelphia, PA, USA, 2018. [Google Scholar]
- Miyajima, K.; McNamara, J.A.; Kimura, T.; Murata, S.; Iizuka, T.; Gosa, T. Craniofacial structure of Japanese and European-American adults with normal occlusions and well-balanced faces. Am. J. Orthod. Dentofac. Orthop. 1996, 110, 431–438. [Google Scholar]
- Subramanian, A.K.; Chen, Y.; Almalki, A.; Sivamurthy, G.; Kafle, D. Cephalometric analysis in orthodontics using artificial intelligence—A comprehensive review. BioMed Res. Int. 2022, 2022, 1880113. [Google Scholar] [CrossRef] [PubMed]
- Majstorovic, N.V.; Dimitrijevic, S. Artificial Intelligence in Orthodontics Diagnosis and Treatment. In New Technologies, Development and Application VIII; Springer: Cham, Switzerland, 2025; Volume 1483. [Google Scholar]
- Lindner, C.; Cootes, T.F. Fully automatic cephalometric evaluation using random forest regression-voting. In Proceedings of the ISBI 2015 Grand Challenge in Dental X-ray Analysis, New York, NY, USA, 16–19 April 2015. [Google Scholar]
- Lindner, C.; Wang, C.W.; Huang, C.T.; Li, C.H.; Chang, S.W.; Cootes, T.F. Fully automatic system for accurate localisation and analysis of cephalometric landmarks in lateral cephalograms. Sci. Rep. 2016, 6, 33581. [Google Scholar] [CrossRef] [PubMed]
- Zeng, M.; Yan, Z.; Liu, S.; Zhou, Y.; Qiu, L. Cascaded convolutional networks for automatic cephalometric landmark detection. Med. Image Anal. 2021, 68, 101904. [Google Scholar] [CrossRef] [PubMed]
- Song, Y.; Qiao, X.; Iwamoto, Y.; Chen, Y.-W. Automatic cephalometric landmark detection on X-ray images using a deep-learning method. Appl. Sci. 2020, 10, 2547. [Google Scholar]
- Khalid, M.A.; Zulfiqar, K.; Bashir, U.; Shaheen, A.; Iqbal, R.; Rizwan, Z.; Rizwan, G.; Fraz, M.M. CEPHA29: Automatic cephalometric landmark detection challenge 2023. arXiv 2022, arXiv:2212.04621. [Google Scholar]
- Laitenberger, F.; Scheuer, H.T.; Scheuer, H.A.; Lilienthal, E.; You, S.; Friedrich, R.E. Cephalometric landmark detection using vision transformers with direct coordinate prediction. J. Cranio-Maxillofac. Surg. 2025, 53, 1518–1529. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Wang, C.-W.; Huang, C.-T.; Lee, J.-H.; Li, C.-H.; Chang, S.-W.; Siao, M.-J.; Lai, T.-M.; Ibragimov, B.; Vrtovec, T.; Ronneberger, O.; et al. A benchmark for comparison of dental radiography analysis algorithms. Med. Image Anal. 2016, 31, 63–76. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.-W.; Huang, C.-T.; Hsieh, M.-C.; Li, C.-H.; Chang, S.-W.; Li, W.-C.; Vandaele, R.; Marée, R.; Jodogne, S.; Geurts, P.; et al. Evaluation and comparison of anatomical landmark detection methods for cephalometric X-ray images: A grand challenge. IEEE Trans. Med. Imaging 2015, 34, 1890–1900. [Google Scholar] [CrossRef] [PubMed]
- Bao, H.; Zhang, K.; Yu, C.; Li, H.; Cao, D.; Shu, H.; Liu, L.; Yan, B. Evaluating the accuracy of automated cephalometric analysis based on artificial intelligence. BMC Oral Health 2023, 23, 191. [Google Scholar] [CrossRef] [PubMed]
- Kotula, J.; Kotula, K.; Kuc, A.E.; Porowski, R.; Lis, J.; Kawala, B.; Sarul, M. Reliability of cephalometric landmark identification and sagittal discrepancy measurements across skeletal Classes I, II, and III: A comparative study. Dent. J. 2026, in press. [Google Scholar] [CrossRef]
- Park, J.H.; Hwang, H.W.; Moon, J.H.; Yu, Y.; Kim, H.; Her, S.B.; Srinivasan, G.; Aljanabi, M.N.A.; Donatelli, R.E.; Lee, S.J. Automated identification of cephalometric landmarks: Part 1. Comparisons between the latest deep-learning methods YOLOv3 and SSD. Angle Orthod. 2019, 89, 903–909. [Google Scholar] [PubMed]
- Kim, H.; Shim, E.; Park, J.; Kim, Y.-J.; Lee, U.; Kim, Y. Web-based fully automated cephalometric analysis by deep learning. Comput. Methods Programs Biomed. 2020, 194, 105513. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.-H.; Yu, H.-J.; Kim, M.-J.; Kim, J.-Y.; Choi, J. Automated cephalometric landmark detection with confidence regions using Bayesian convolutional neural networks. BMC Oral Health 2020, 20, 270. [Google Scholar] [CrossRef] [PubMed]
- Dai, C.; Huang, C.; Xu, M.; Wang, Y. A cephalometric landmark detection method using dual-encoder on X-ray image. J. Biomed. Eng. 2025, 42, 883–891. [Google Scholar]
- Hwang, H.-W.; Park, J.-H.; Moon, J.-H.; Yu, Y.; Kim, H.; Her, S.-B.; Srinivasan, G.; Aljanabi, M.N.A.; Donatelli, R.E.; Lee, S.-J. Automated identification of cephalometric landmarks: Part 2—Might it be better than human? Angle Orthod. 2020, 90, 69–76. [Google Scholar] [PubMed]





| Model | Architecture | Epochs | Train n | Box (px) | MRE ± SD (mm) | SDR@2 mm | SDR@2.5 mm | SDR@4 mm |
|---|---|---|---|---|---|---|---|---|
| Model 1 † | YOLOv11l | 200 | 235 | 40 × 40 | 3.18 ± 1.12 | 8.1% | 22.4% | 81.5% |
| Model 2 * | YOLOv11l | 200 | 1175 | 40 × 40 | 3.10 ± 1.00 | 7.9% | 25.6% | 87.2% |
| Model 3 | YOLOv5xu | 150 | 1175 | 40 × 40 | 3.24 ± 1.08 | 6.5% | 20.1% | 80.6% |
| Model 4 * | YOLOv11l | 200 | 1110 | 40 × 40 | 3.28 ± 1.15 | 6.8% | 21.0% | 83.3% |
| Model 5 † | YOLOv11l | 200 | 1665 | 40 × 40 | 5.87 ± 2.31 | 2.1% | 6.4% | 38.2% |
| Model 6 | YOLOv11l | 200 | 1665 | 150 × 150 | 11.4 ± 4.8 | 0.3% | 0.8% | 1.8% |
| Model 7 | YOLOv11m | 200 | 1665 | 150 × 150 | 10.8 ± 4.5 | 0.4% | 0.9% | 2.4% |
| Model 8 | YOLOv11n | 200 | 1665 | 150 × 150 | 13.7 ± 5.2 | 0.1% | 0.3% | 0.8% |
| Model 9 | YOLOv11s | 200 | 1665 | 150 × 150 | 11.1 ± 4.6 | 0.3% | 0.7% | 1.6% |
| Model 10 | YOLOv11s | 300 | 4255 | 100 × 100 | 8.3 ± 3.6 | 1.2% | 3.5% | 12.6% |
| Model 11 | YOLOv11s | 600 | 4255 | 150 × 150 | 10.2 ± 4.4 | 0.4% | 1.0% | 2.7% |
| Model 12 * | YOLOv11s | 300 | 4255 | 40 × 40 | 3.21 ± 0.98 | 7.6% | 24.1% | 86.2% |
| Model 13 * | YOLOv11l | 300 | 4255 | 40 × 40 | 3.26 ± 1.05 | 7.2% | 23.8% | 81.4% |
| Model 14 * | YOLOv11n | 300 | 4255 | 40 × 40 | 3.28 ± 1.11 | 6.9% | 22.7% | 81.0% |
| Study | Year | Architecture | Dataset | LM | MRE (mm) | SDR@2 mm (%) |
|---|---|---|---|---|---|---|
| Lindner & Cootes [6] | 2015 | Random Forest | 400 | 19 | 1.6–1.7 | ∼74.8 |
| Park et al. [18] | 2019 | Cascaded CNN | 1028 | 19 | 1.46 ± 0.98 | ∼85 |
| Kim et al. [19] | 2020 | Stacked Hourglass | 2075 | 23 | 1.37 ± 1.79 | ∼81 |
| Lee et al. [20] | 2022 | Bayesian YOLOv5 | 1028 | 20 | 2.3 ± 1.1 | ∼72 |
| Dai et al. [21] | 2025 | Dual-Enc. Transformer | 400+ | 19 | — | 89.5–90.7 |
| Present study | 2026 | YOLOv11 variants | 4255 | 4 | 3.10 ± 1.00 | 7.9 † |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kotula, J.; Konarzewski, M.; Polkowski, J.; Kotula, K.; Lis, J.; Porowski, R.; Kuc, A.E.; Kawala, B.; Sarul, M. Automated YOLO-Based Cephalometric Landmark Detection for ANB-Based Skeletal Classification: A Retrospective Single-Centre Study. J. Clin. Med. 2026, 15, 5149. https://doi.org/10.3390/jcm15135149
Kotula J, Konarzewski M, Polkowski J, Kotula K, Lis J, Porowski R, Kuc AE, Kawala B, Sarul M. Automated YOLO-Based Cephalometric Landmark Detection for ANB-Based Skeletal Classification: A Retrospective Single-Centre Study. Journal of Clinical Medicine. 2026; 15(13):5149. https://doi.org/10.3390/jcm15135149
Chicago/Turabian StyleKotula, Jacek, Marcin Konarzewski, Jakub Polkowski, Krzysztof Kotula, Joanna Lis, Rafal Porowski, Anna Ewa Kuc, Beata Kawala, and Michal Sarul. 2026. "Automated YOLO-Based Cephalometric Landmark Detection for ANB-Based Skeletal Classification: A Retrospective Single-Centre Study" Journal of Clinical Medicine 15, no. 13: 5149. https://doi.org/10.3390/jcm15135149
APA StyleKotula, J., Konarzewski, M., Polkowski, J., Kotula, K., Lis, J., Porowski, R., Kuc, A. E., Kawala, B., & Sarul, M. (2026). Automated YOLO-Based Cephalometric Landmark Detection for ANB-Based Skeletal Classification: A Retrospective Single-Centre Study. Journal of Clinical Medicine, 15(13), 5149. https://doi.org/10.3390/jcm15135149

