Next Article in Journal
Exploring the Potential of ChatGPT-4o in Thyroid Nodule Diagnosis Using Multi-Modality Ultrasound Imaging: Dual- vs. Triple-Modality Approaches
Previous Article in Journal
Novel Strategies and Therapeutic Advances for Bladder Cancer
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS)

1
Department of Diagnostic Imaging, National University Hospital, Singapore 119074, Singapore
2
Biostatistics Unit, Yong Loo Lin School of Medicine, Singapore 117597, Singapore
3
AIO Innovation Office, National University Health System, Singapore 119228, Singapore
4
University Spine Centre, University Orthopaedics, Hand and Reconstructive Microsurgery, National University Health System, Singapore 119074, Singapore
*
Author to whom correspondence should be addressed.
Cancers 2025, 17(13), 2073; https://doi.org/10.3390/cancers17132073
Submission received: 23 May 2025 / Revised: 16 June 2025 / Accepted: 19 June 2025 / Published: 20 June 2025
(This article belongs to the Section Methods and Technologies Development)

Simple Summary

Spinal tumours can result in spinal instability, and clinicians utilise the Spinal Instability Neoplastic Score (SINS) to determine if surgical intervention is required. However, the calculation of the SINS is time-consuming. Large language models may improve the calculation of the SINS, but their accuracy remains underexplored. The authors aim to evaluate two models—Claude 3.5 and Llama 3.1—against clinician assessments. The authors hope that the findings from this study may lead to the implementation of large language models in clinical practice to streamline workflows and improve consistency in the assessment of spinal metastases.

Abstract

Background: Large language models (LLMs) have emerged as powerful tools in healthcare. In diagnostic radiology, LLMs can assist in the computation of the Spine Instability Neoplastic Score (SINS), which is a critical tool for assessing spinal metastases. However, the accuracy of LLMs in calculating the SINS based on radiological reports remains underexplored. Objective: This study evaluates the accuracy of two institutional privacy-preserving LLMs—Claude 3.5 and Llama 3.1—in computing the SINS from radiology reports and electronic medical records, comparing their performance against clinician readers. Methods: A retrospective analysis was conducted on 124 radiology reports from patients with spinal metastases. Three expert readers established a reference standard for the SINS calculation. Two orthopaedic surgery residents and two LLMs (Claude 3.5 and Llama 3.1) independently calculated the SINS. The intraclass correlation coefficient (ICC) was used to measure the inter-rater agreement for the total SINS, while Gwet’s Kappa was used to measure the inter-rater agreement for the individual SINS components. Results: Both LLMs and clinicians demonstrated almost perfect agreement with the reference standard for the total SINS. Between the two LLMs, Claude 3.5 (ICC = 0.984) outperformed Llama 3.1 (ICC = 0.829). Claude 3.5 was also comparable to the clinician readers with ICCs of 0.926 and 0.986, exhibiting a near-perfect agreement across all individual SINS components [0.919–0.990]. Conclusions: Claude 3.5 demonstrated high accuracy in calculating the SINS and may serve as a valuable adjunct in clinical workflows, potentially reducing clinician workload while maintaining diagnostic reliability. However, variations in LLM performance highlight the need for further validation and optimisation before clinical integration.
Keywords: large language models; radiology; artificial intelligence; spine metastases large language models; radiology; artificial intelligence; spine metastases

Share and Cite

MDPI and ACS Style

Chan, L.Y.T.; Chan, D.Z.M.; Tan, Y.L.; Yap, Q.V.; Ong, W.; Lee, A.; Ge, S.; Leow, W.N.; Makmur, A.; Ting, Y.; et al. Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS). Cancers 2025, 17, 2073. https://doi.org/10.3390/cancers17132073

AMA Style

Chan LYT, Chan DZM, Tan YL, Yap QV, Ong W, Lee A, Ge S, Leow WN, Makmur A, Ting Y, et al. Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS). Cancers. 2025; 17(13):2073. https://doi.org/10.3390/cancers17132073

Chicago/Turabian Style

Chan, Li Yi Tammy, Ding Zhou Matthew Chan, Yi Liang Tan, Qai Ven Yap, Wilson Ong, Aric Lee, Shuliang Ge, Wenxin Naomi Leow, Andrew Makmur, Yonghan Ting, and et al. 2025. "Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS)" Cancers 17, no. 13: 2073. https://doi.org/10.3390/cancers17132073

APA Style

Chan, L. Y. T., Chan, D. Z. M., Tan, Y. L., Yap, Q. V., Ong, W., Lee, A., Ge, S., Leow, W. N., Makmur, A., Ting, Y., Teo, E. C., Jiong Hao, T., Kumar, N., & Hallinan, J. T. P. D. (2025). Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS). Cancers, 17(13), 2073. https://doi.org/10.3390/cancers17132073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop