The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group
Abstract
1. Introduction
1.1. Context and Diagnostic Challenge
1.2. Next-Generation Sequencing: Transformative but Not Definitive
1.3. Artificial Intelligence in Medical Diagnostics: A New Frontier
1.4. Study Aim and Diagnostic Benchmarking Framework
2. Materials and Methods
2.1. Study Design and Setting
2.2. Data Curation
2.3. Patient Population and Inclusion Criteria
2.4. Vignette Design and Simulation Protocol
- Demographic and Anthropometric Data: Age at presentation, sex, height, weight, and body mass index (BMI), with percentile values based on WHO or CDC standards. When available, additional metrics such as sitting height and head circumference were included.
- Prenatal and Perinatal History: Key findings such as intrauterine growth restriction (IUGR), oligohydramnios, prenatal suspicion of skeletal anomalies, birth measurements, and perinatal complications.
- Skeletal and Extraskeletal Findings: A structured description of radiographically and clinically observed features, including short stature patterns (rhizomelic, mesomelic, etc.), joint anomalies, and presence of extraskeletal manifestations (e.g., facial dysmorphism, developmental delay, cardiac or renal anomalies).
- Radiological Impressions: Summary of key imaging findings, including any radiographs or skeletal surveys interpreted by pediatric radiologists or geneticists, highlighting patterns suggestive of specific dysplasias.
- Clinical Course: History of fractures, neurodevelopmental milestones, disease progression, or other relevant time-linked features. Vignettes did not describe full longitudinal follow-up but included key evolutional clues when available.
- Biochemical and Metabolic Parameters: Abnormal laboratory results considered diagnostically relevant (e.g., alkaline phosphatase, calcium/phosphate disturbances, markers of storage disorders).
- Family History: When available, a brief summary of family history of similar conditions or known genetic diagnoses was included to reflect the clinical reality of hereditary disorders.
2.5. Artificial Intelligence Systems Description
2.6. AI-Based Diagnostic Simulation
2.7. Expert Clinical Panel—Human Comparator
2.8. Outcome Measures and Adjudication
- − Frequency of correct diagnosis within the top three suggestions;
- − Distribution of confidence scores across groups;
- − Level of agreement between AI models and human experts.
2.9. Statistical Analysis
3. Results
3.1. Cohort Characteristics and Clinical Parameters
3.2. Diagnostic Spectrum and Distribution
3.3. Diagnostic Performance of AI Models
3.4. Additional Statistical Findings
3.5. Performance of Human Experts and Comparison with AI Models
4. Discussion
4.1. General Diagnostic Performance of LLMs Versus Human Expert Group
4.2. Impact of Disorder Prevalence on AI Performance
4.3. Diagnostic Accuracy and Phenotypic Distinctiveness
4.4. Diagnostic Confidence and Interpretability of AI Decisions
4.5. Strengths and Limitations of LLMs in Diagnostically Challenging Cases
4.6. LLM Concordance and the Value of Multi-Model Usage
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
NGS | Next-Generation Sequencing |
WES | Whole-Exome Sequencing |
VUS | Variants of Uncertain Significance |
WGS | Linear Dichroism |
OI | Osteogenesis Imperfecta |
AI | Artificial Intelligence |
LLMs | Large Language Models |
USMLE | United States Medical Licensing Examination |
References
- Savarirayan, R.; Rimoin, D.L. The skeletal dysplasias. Best. Pract. Res. Clin. Endocrinol. Metab. 2002, 16, 547–560. [Google Scholar] [CrossRef] [PubMed]
- Jurcă, M.C.; Jurcă, S.I.; Mirodot, F.; Bercea, B.; Severin, E.M.; Bembea, M.; Jurcă, A.D. Changes in skeletal dysplasia nosology. Rom. J. Morphol. Embryol. 2021, 62, 689–696. [Google Scholar] [CrossRef] [PubMed]
- Ozono, K.; Namba, N.; Kubota, T.; Kitaoka, T.; Miura, K.; Ohata, Y.; Fujiwara, M.; Miyoshi, Y.; Michigami, T. Pediatric aspects of skeletal dysplasia. Pediatr. Endocrinol. Rev. 2012, 10 (Suppl. S1), 35–43. [Google Scholar] [PubMed]
- Smith, C.I.E.; Bergman, P.; Hagey, D.W. Estimating the number of diseases—The concept of rare, ultra-rare, and hyper-rare. iScience 2022, 25, 104698. [Google Scholar] [CrossRef]
- Alman, B.A. Skeletal dysplasias and the growth plate. Clin. Genet. 2008, 73, 24–30. [Google Scholar] [CrossRef]
- Charoenngam, N.; Nasr, A.; Shirvani, A.; Holick, M.F. Hereditary Metabolic Bone Diseases: A Review of Pathogenesis, Diagnosis and Management. Genes 2022, 13, 1880. [Google Scholar] [CrossRef]
- Colares Neto, G.D.P.; Alves, C.D.A.D. Demystifying Skeletal Dysplasias: A Practical Approach for the Pediatric Endocrinologist. Horm. Res. Paediatr. 2025, 98, 214–225. [Google Scholar] [CrossRef]
- Cho, S.Y.; Jin, D.-K. Guidelines for genetic skeletal dysplasias for pediatricians. Ann. Pediatr. Endocrinol. Metab. 2015, 20, 187–191. [Google Scholar] [CrossRef]
- Handa, A.; Grigelioniene, G.; Nishimura, G. Skeletal Dysplasia Families: A Stepwise Approach to Diagnosis. Radiographics 2023, 43, e220067. [Google Scholar] [CrossRef]
- Bauskis, A.; Strange, C.; Molster, C.; Fisher, C. The diagnostic odyssey: Insights from parents of children living with an undiagnosed condition. Orphanet J. Rare Dis. 2022, 17, 233. [Google Scholar] [CrossRef]
- Kumar, M.; Thakur, S.; Haldar, A.; Anand, R. Approach to the diagnosis of skeletal dysplasias: Experience at a center with limited resources. J. Clin. Ultrasound 2016, 44, 529–539. [Google Scholar] [CrossRef] [PubMed]
- Offiah, A.C.; Hall, C.M. The radiologic diagnosis of skeletal dysplasias: Past, present and future. Pediatr. Radiol. 2020, 50, 1650–1657. [Google Scholar] [CrossRef] [PubMed]
- Rimoin, D.L.; Cohn, D.; Krakow, D.; Wilcox, W.; Lachman, R.S.; Alanay, Y. The skeletal dysplasias: Clinical-molecular correlations. Ann. N. Y. Acad. Sci. 2007, 1117, 302–309. [Google Scholar] [CrossRef] [PubMed]
- Sabir, A.H.; Morley, E.; Sheikh, J.; Calder, A.D.; Beleza-Meireles, A.; Cheung, M.S.; Cocca, A.; Jansson, M.; Lillis, S.; Patel, Y.; et al. Diagnostic yield of rare skeletal dysplasia conditions in the radiogenomics era. BMC Med. Genom. 2021, 14, 148. [Google Scholar] [CrossRef]
- Strande, N.T.; Berg, J.S. Defining the Clinical Value of a Genomic Diagnosis in the Era of Next-Generation Sequencing. Annu. Rev. Genom. Hum. Genet. 2016, 17, 303–332. [Google Scholar] [CrossRef]
- Scocchia, A.; Kangas-Kontio, T.; Irving, M.; Hero, M.; Saarinen, I.; Pelttari, L.; Gall, K.; Valo, S.; Huusko, J.M.; Tallila, J.; et al. Diagnostic utility of next-generation sequencing-based panel testing in 543 patients with suspected skeletal dysplasia. Orphanet J. Rare Dis. 2021, 16, 412. [Google Scholar] [CrossRef]
- Abbasi, A.; Alexandrov, L.B. Significance and limitations of the use of next-generation sequencing technologies for detecting mutational signatures. DNA Repair. 2021, 107, 103200. [Google Scholar] [CrossRef]
- Burke, W.; Parens, E.; Chung, W.K.; Berger, S.M.; Appelbaum, P.S. The challenge of genetic variants of uncertain clinical significance: A narrative review. Ann. Intern. Med. 2022, 175, 994–1000. [Google Scholar] [CrossRef]
- Petersen, B.-S.; Fredrich, B.; Hoeppner, M.P.; Ellinghaus, D.; Franke, A. Opportunities and challenges of whole-genome and -exome sequencing. BMC Genet. 2017, 18, 14. [Google Scholar] [CrossRef]
- Austin-Tse, C.A.; Jobanputra, V.; Perry, D.L.; Bick, D.; Taft, R.J.; Venner, E.; Gibbs, R.A.; Young, T.; Barnett, S.; Belmont, J.W.; et al. Best practices for the interpretation and reporting of clinical whole genome sequencing. npj Genom. Med. 2022, 7, 27. [Google Scholar] [CrossRef]
- Bagger, F.O.; Borgwardt, L.; Jespersen, A.S.; Hansen, A.R.; Bertelsen, B.; Kodama, M.; Nielsen, F.C. Whole genome sequencing in clinical practice. BMC Med. Genom. 2024, 17, 39. [Google Scholar] [CrossRef] [PubMed]
- Schulze, T.G.; McMahon, F.J. Defining the phenotype in human genetic studies: Forward genetics and reverse phenotyping. Hum. Hered. 2004, 58, 131–138. [Google Scholar] [CrossRef]
- Best, S.; Yu, J.; Lord, J.; Roche, M.; Watson, C.M.; Bevers, R.P.J.; Stuckey, A.; Madhusudhan, S.; Jewell, R.; Sisodiya, S.M.; et al. Uncovering the burden of hidden ciliopathies in the 100 000 Genomes Project: A reverse phenotyping approach. J. Med. Genet. 2022, 59, 1151–1164. [Google Scholar] [CrossRef] [PubMed]
- Preiksaitis, C.; Ashenburg, N.; Bunney, G.; Chu, A.; Kabeer, R.; Riley, F.; Ribeira, R.; Rose, C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med. Inform. 2024, 12, e53787. [Google Scholar] [CrossRef] [PubMed]
- Valizadeh, A.; Moassefi, M.; Nakhostin-Ansari, A.; Heidari Some’eh, S.; Hosseini-Asl, H.; Saghab Torbati, M.; Aghajani, R.; Maleki Ghorbani, Z.; Menbari-Oskouie, I.; Aghajani, F.; et al. Automated diagnosis of autism with artificial intelligence: State of the art. Rev. Neurosci. 2024, 35, 141–163. [Google Scholar] [CrossRef] [PubMed]
- Kufel, J.; Bargieł-Łączek, K.; Kocot, S.; Koźlik, M.; Bartnikowska, W.; Janik, M.; Czogalik, Ł.; Dudek, P.; Magiera, M.; Lis, A.; et al. What Is Machine Learning, Artificial Neural Networks and Deep Learning?—Examples of Practical Applications in Medicine. Diagnostics 2023, 13, 2582. [Google Scholar] [CrossRef]
- Aster, A.; Laupichler, M.C.; Rockwell-Kollmann, T.; Masala, G.; Bala, E.; Raupach, T. ChatGPT and Other Large Language Models in Medical Education—Scoping Literature Review. Med. Sci. Educ. 2024, 35, 555–567. [Google Scholar] [CrossRef]
- Temsah, A.; Alhasan, K.; Altamimi, I.; Jamal, A.; Al-Eyadhy, A.; Malki, K.H.; Temsah, M.-H. DeepSeek in Healthcare: Revealing Opportunities and Steering Challenges of a New Open-Source Artificial Intelligence Frontier. Cureus 2025, 17, e79221. [Google Scholar] [CrossRef]
- Peng, Y.; Malin, B.A.; Rousseau, J.F.; Wang, Y.; Xu, Z.; Xu, X.; Weng, C.; Bian, J. From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare. J. Biomed. Inf. 2025, 163, 104791. [Google Scholar] [CrossRef]
- Sewell, M.D.; Chahal, A.; Al-Hadithy, N.; Blunn, G.W.; Molloy, S.; Hashemi-Nejad, A. Genetic skeletal dysplasias: A guide to diagnosis and management. J. Back Musculoskelet. Rehabil. 2015, 28, 575–590. [Google Scholar] [CrossRef]
- Cascella, M.; Semeraro, F.; Montomoli, J.; Bellini, V.; Piazza, O.; Bignami, E. The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives. J. Med. Syst. 2024, 48, 22. [Google Scholar] [CrossRef] [PubMed]
- Han, J. Everything About DeepSeek: Key Features, Usage, and Technical Advantages. PopAi. 2025. Available online: https://www.popai.pro/resources/everything-about-deepseek/ (accessed on 6 June 2025).
- Sandmann, S.; Hegselmann, S.; Fujarski, M.; Bickmann, L.; Wild, B.; Eils, R.; Varghese, J. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 2025, ahead of print. [Google Scholar] [CrossRef] [PubMed]
- Cung, M.; Sosa, B.; Yang, H.S.; McDonald, M.M.; Matthews, B.G.; Vlug, A.G.; Imel, E.A.; Wein, M.N.; Stein, E.M.; Greenblatt, M.B. The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries. J. Bone Min. Res. 2024, 39, 106–115. [Google Scholar] [CrossRef]
- Reese, J.T.; Chimirri, L.; Bridges, Y.; Danis, D.; Caufield, J.H.; Wissink, K.; McMurry, J.A.; Graefe, A.S.; Casiraghi, E.; Valentini, G.; et al. Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools. medRxiv 2024. [Google Scholar] [CrossRef]
- Ao, G.; Chen, M.; Li, J.; Nie, H.; Zhang, L.; Chen, Z. Comparative analysis of large language models on rare disease identification. Orphanet J. Rare Dis. 2025, 20, 150. [Google Scholar] [CrossRef]
- Carbonari, V.; Veltri, P.; Guzzi, P.H. Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases. arXiv 2025, arXiv:2505.17065. [Google Scholar] [CrossRef]
- Marewski, J.N.; Gigerenzer, G. Heuristic decision making in medicine. Dialogues Clin. Neurosci. 2012, 14, 77–89. [Google Scholar] [CrossRef]
- Iqbal, U.; Tanweer, A.; Rahmanti, A.R.; Greenfield, D.; Lee, L.T.-J.; Li, Y.-C.J. Impact of large language model (ChatGPT) in healthcare: An umbrella review and evidence synthesis. J. Biomed. Sci. 2025, 32, 45. [Google Scholar] [CrossRef]
- Wang, A.; Liu, C.; Yang, J.; Weng, C. Fine-tuning Large Language Models for Rare Disease Concept Normalization. bioRxiv 2024. [Google Scholar] [CrossRef]
- Rutledge, G.W. Diagnostic accuracy of GPT-4 on common clinical scenarios and challenging cases. Learn. Health Syst. 2024, 8, e10438. [Google Scholar] [CrossRef]
- Vrdoljak, J.; Boban, Z.; Vilović, M.; Kumrić, M.; Božić, J. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef] [PubMed]
- Bajwa, J.; Munir, U.; Nori, A.; Williams, B. Artificial intelligence in healthcare: Transforming the practice of medicine. Future Healthc. J. 2021, 8, e188–e194. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Yi, H.; You, M.; Liu, W.; Wang, L.; Li, H.; Zhang, X.; Guo, Y.; Fan, L.; Chen, G.; et al. Enhancing diagnostic capability with multi-agents conversational large language models. npj Digit. Med. 2025, 8, 159. [Google Scholar] [CrossRef] [PubMed]
- Ríos-Hoyo, A.; Shan, N.L.; Li, A.; Pearson, A.T.; Pusztai, L.; Howard, F.M. Evaluation of large language models as a diagnostic aid for complex medical cases. Front. Med. 2024, 11, 1380148. [Google Scholar] [CrossRef]
- Brasil, S.; Pascoal, C.; Francisco, R.; Dos Reis Ferreira, V.; Videira, P.A.; Valadão, A.G. Artificial Intelligence (AI) in Rare Diseases: Is the Future Brighter? Genes 2019, 10, 978. [Google Scholar] [CrossRef]
Diagnostic Challenge | Explanation/Impact |
---|---|
High Genetic Heterogeneity | >450 disorders with diverse inheritance and mutational spectrum |
Phenotypic Overlap | Many dysplasias share clinical and radiographic features |
Age-Dependent Expression | Some features (e.g., metaphyseal changes) appear later in life |
Radiological Expertise Often Lacking | Interpretation errors common in early infancy |
Limited Access to Clinical Geneticists | Especially in low-resource or regional healthcare settings |
Delayed Molecular Testing and Interpretation | NGS not always available or rapidly interpreted |
Evolving or Atypical Presentations | Non-classic phenotypes often lead to misdiagnosis or delayed diagnosis |
Genetic Diagnosis | Number of Cases (n) |
---|---|
Osteogenesis Imperfecta (Types I, III, IV) | 12 |
Achondroplasia | 6 |
Hypophosphatemic Rickets | 4 |
Mucopolysaccharidosis IVA (Morquio Syndrome) | 4 |
Metaphyseal Chondrodysplasia (Schmid Type) | 2 |
CODAS | 2 |
Larsen Syndrome | 2 |
Others * | 13 |
Total | 45 |
Evaluator | Correct as Primary Diagnosis | Correct as 2nd or 3rd Alternative Diagnosis | Total Top 3 Accuracy |
---|---|---|---|
ChatGPT | 25/45 (55.6%) | 3/45 (6.6%) | 28/45 (62.2%) |
DeepSeek | 26/45 (57.8%) | 3/45 (6.7%) | 29/45 (64.4%) |
Skeletal Dysplasia Disorder Rarity Category | ChatGPT | DeepSeek | Expert Panel |
---|---|---|---|
Rare Disorders (n = 21) | 18/21 (85.7%) | 17/21 (80.9%) | 18/21 (85.7%) |
Ultra-Rare Disorders (n = 24) | 10/24 (41.7%) | 12/24 (50.0%) | 19/24 (79.2%) |
Clinical Feature | Frequency in Correct Diagnoses (n = 37) | % of Correct Diagnoses |
---|---|---|
Repeated Bone Fractures | 14 | 37.8% |
Biochemical Abnormalities | 13 | 35.1% |
Limb Shortening | 12 | 32.4% |
Macrocephaly | 10 | 27.0% |
Facial Dysmorphism | 8 | 21.6% |
Motor Developmental Delay | 7 | 18.9% |
Vision or Hearing Impairment | 4 | 10.8% |
Clinical Feature | Expert OR (p) | ChatGPT OR (p) | DeepSeek OR (p) |
---|---|---|---|
Repeated Bone Fractures | 17.5 (0.032) | 1.9 (0.40) | 6.6 (0.052) |
Macrocephaly | 4.68 (0.089) | 1.02 (0.97) | 1.87 (0.43) |
Biochemical Abnormalities | 41.7 (0.074) | 3.3 (0.15) | 5.5 (0.079) |
Intellectual Disability | 0.03 (0.022) | 0.70 (0.65) | 0.81 (0.77) |
Vision Impairment | 0.34 (0.40) | 0.52 (0.38) | 0.58 (0.42) |
Evaluator | Correct Diagnosis (Avg. Score) | Incorrect Diagnosis (Avg. Score) |
---|---|---|
ChatGPT | 4.77 | 4.03 |
DeepSeek | 4.83 | 4.45 |
Expert Panel | 4.88 | 4.63 |
Evaluator Pair | Concordant Diagnoses (n) | Discordant Diagnoses (n) | % Concordance |
---|---|---|---|
ChatGPT vs. DeepSeek | 41 | 4 | 91.1% |
Expert Panel vs. ChatGPT | 28 | 17 | 62.2% |
Expert Panel vs. DeepSeek | 29 | 16 | 64.4% |
Insight | Implication for Practice |
---|---|
LLMs perform well in well-characterized disorders | Use AI to screen for common dysplasias or aid non-specialists |
Lower AI top-3 accuracy in ultra-rare or syndromic conditions | Expert review remains crucial for unusual phenotypes |
High inter-LLM concordance | A single optimized model may be sufficient in many clinical settings |
AI occasionally outperforms experts | Potential as a second opinion or in ambiguous cases |
Diagnostic confidence correlates with correctness | Confidence estimates may assist in prioritizing cases for expert review |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ilić, N.; Marić, N.; Cvetković, D.; Bogosavljević, M.; Bukara-Radujković, G.; Krstić, J.; Paunović, Z.; Begović, N.; Panić Zarić, S.; Todorović, S.; et al. The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group. Genes 2025, 16, 762. https://doi.org/10.3390/genes16070762
Ilić N, Marić N, Cvetković D, Bogosavljević M, Bukara-Radujković G, Krstić J, Paunović Z, Begović N, Panić Zarić S, Todorović S, et al. The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group. Genes. 2025; 16(7):762. https://doi.org/10.3390/genes16070762
Chicago/Turabian StyleIlić, Nikola, Nina Marić, Dimitrije Cvetković, Marko Bogosavljević, Gordana Bukara-Radujković, Jovana Krstić, Zoran Paunović, Ninoslav Begović, Sanja Panić Zarić, Slađana Todorović, and et al. 2025. "The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group" Genes 16, no. 7: 762. https://doi.org/10.3390/genes16070762
APA StyleIlić, N., Marić, N., Cvetković, D., Bogosavljević, M., Bukara-Radujković, G., Krstić, J., Paunović, Z., Begović, N., Panić Zarić, S., Todorović, S., Mitrović, K., Vlahović, A., & Sarajlija, A. (2025). The Artificial Intelligence-Assisted Diagnosis of Skeletal Dysplasias in Pediatric Patients: A Comparative Benchmark Study of Large Language Models and a Clinical Expert Group. Genes, 16(7), 762. https://doi.org/10.3390/genes16070762