Next Article in Journal
Residual Safety Margin-Based Risk Stratification for Hospital-Wide POCT Glucose Meters Anchored to ISO 15197: Moving Beyond Pass-Fail
Previous Article in Journal
APACHE II and NUTRIC Scores for Mortality Prediction in Chronic Critical Illness: A “Right-Side” Prognostic Modeling Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Reliability of Large Language Model-Based Artificial Intelligence in AIS Assessment: Lenke Classification and Fusion-Level Suggestion

1
Department of Orthopedics and Traumatology, Antalya Training and Research Hospital, Antalya 07100, Turkey
2
Department of Orthopedics and Traumatology, Faculty of Medicine, Istanbul University, Istanbul 34093, Turkey
3
Department of Orthopedics and Traumatology, İstinye University Medical Park TEM Hospital, Istanbul 34250, Turkey
*
Author to whom correspondence should be addressed.
Diagnostics 2025, 15(24), 3219; https://doi.org/10.3390/diagnostics15243219
Submission received: 15 November 2025 / Revised: 13 December 2025 / Accepted: 16 December 2025 / Published: 16 December 2025
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

Background: Accurate deformity classification and fusion-level planning are essential in adolescent idiopathic scoliosis (AIS) surgery and are traditionally guided by Cobb angle measurement and the Lenke system. Multimodal large language models (LLMs) (e.g., ChatGPT-4.0; Claude 3.7 Sonnet, Gemini 2.5 Pro, DeepSeek-R1-0528 Chat) are increasingly used for image interpretation despite limited validation for radiographic decision-making. This study evaluated the agreement and reproducibility of contemporary multimodal LLMs for AIS assessment compared with expert spine surgeons. Methods: This single-center retrospective study included 125 AIS patients (94 females, 31 males; mean age 14.8 ± 1.9 years) who underwent posterior instrumentation (2020–2024). Two experienced spine surgeons independently performed Lenke classification (including lumbar and sagittal modifiers) and selected fusion levels (UIV–LIV) on standing AP, lateral, and side-bending radiographs; discrepancies were resolved by consensus to establish the reference standard. The same radiographs were analyzed by four paid multimodal LLMs using standardized zero-shot prompts. Because LLMs showed inconsistent end-vertebra selection, LLM-derived Cobb angles lacked a common anatomical reference frame and were excluded from quantitative analysis. Agreement with expert consensus and test–retest reproducibility (repeat analyses one week apart) were assessed using Cohen’s κ. Evaluation times were recorded. Results: Surgeon agreement was high for Lenke classification (92.0%, κ = 0.913) and fusion-level selection (88.8%, κ = 0.879). All LLMs demonstrated chance-level test–retest reproducibility and very low agreement with expert consensus (Lenke: 1.6–10.2%, κ = 0.001–0.036; fusion: 0.8–12.0%, κ = 0.003–0.053). Claude produced missing outputs in 17 Lenke and 29 fusion-level cases. Although LLMs completed assessments far faster than surgeons (seconds vs. ~11–12 min), speed did not translate into clinically acceptable reliability. Conclusions: Current general-purpose multimodal LLMs do not provide reliable Lenke classification or fusion-level planning in AIS. Their poor agreement with expert surgeons and marked internal inconsistency indicate that LLM-generated interpretations should not be used for surgical decision-making or patient self-assessment without task-specific validation.
Keywords: artificial intelligence; multimodal large language models; adolescent idiopathic scoliosis; Lenke classification; deep learning artificial intelligence; multimodal large language models; adolescent idiopathic scoliosis; Lenke classification; deep learning

Share and Cite

MDPI and ACS Style

Aktan, C.; Koşar, A.; Ünal, M.; Korkmaz, M.; Kaya, Ö.; Akgül, T.; Güler, F. Reliability of Large Language Model-Based Artificial Intelligence in AIS Assessment: Lenke Classification and Fusion-Level Suggestion. Diagnostics 2025, 15, 3219. https://doi.org/10.3390/diagnostics15243219

AMA Style

Aktan C, Koşar A, Ünal M, Korkmaz M, Kaya Ö, Akgül T, Güler F. Reliability of Large Language Model-Based Artificial Intelligence in AIS Assessment: Lenke Classification and Fusion-Level Suggestion. Diagnostics. 2025; 15(24):3219. https://doi.org/10.3390/diagnostics15243219

Chicago/Turabian Style

Aktan, Cemil, Akın Koşar, Melih Ünal, Murat Korkmaz, Özcan Kaya, Turgut Akgül, and Ferhat Güler. 2025. "Reliability of Large Language Model-Based Artificial Intelligence in AIS Assessment: Lenke Classification and Fusion-Level Suggestion" Diagnostics 15, no. 24: 3219. https://doi.org/10.3390/diagnostics15243219

APA Style

Aktan, C., Koşar, A., Ünal, M., Korkmaz, M., Kaya, Ö., Akgül, T., & Güler, F. (2025). Reliability of Large Language Model-Based Artificial Intelligence in AIS Assessment: Lenke Classification and Fusion-Level Suggestion. Diagnostics, 15(24), 3219. https://doi.org/10.3390/diagnostics15243219

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop