Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS)
Simple Summary
Abstract
1. Introduction
2. Methods
2.1. Study Design
- Extract the back pain history from the imaging request form and/or the latest clinical EMR entry.
- Determine lesion location, radiographic spinal alignment, vertebral body collapse, and posterior spinal element involvement from the MRI report.
- Assess the bone lesion quality based on the CT imaging report.
2.2. Statistical Analysis
3. Results
3.1. Reference Standard
3.2. Overall Accuracy and Agreement for Total SINS
3.3. Subgroup Analysis for Individual SINS Components
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Li, R.; Kumar, A.; Chen, J.H. How Chatbots and Large Language Model Artificial Intelligence Systems Will Reshape Modern Medicine: Fountain of Creativity or Pandora’s Box? JAMA Intern. Med. 2023, 183, 596–597. [Google Scholar] [CrossRef] [PubMed]
- Shah, N.H.; Entwistle, D.; Pfeffer, M.A. Creation and Adoption of Large Language Models in Medicine. JAMA 2023, 330, 866–869. [Google Scholar] [CrossRef]
- Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv 2023. [Google Scholar] [CrossRef]
- Gertz, R.J.; Bunck, A.C.; Lennartz, S.; Dratsch, T.; Iuga, A.-I.; Maintz, D.; Kottlors, J. GPT-4 for Automated Determination of Radiological Study and Protocol Based on Radiology Request Forms: A Feasibility Study. Radiology 2023, 307, e230877. [Google Scholar] [CrossRef]
- Hallinan, J.T.P.D.; Leow, N.W.; Ong, W.; Lee, A.; Low, Y.X.; Chan, M.D.Z.; Devi, G.K.; Loh, D.D.-L.; He, S.S.; Nor, F.E.M.; et al. MRI Spine Request Form Enhancement and Auto Protocoling Using a Secure Institutional Large Language Model. Spine J. 2025, 25, 505–514. [Google Scholar] [CrossRef]
- Rajendran, P.; Yang, Y.; Niedermayr, T.R.; Gensheimer, M.; Beadle, B.; Le, Q.-T.; Xing, L.; Dai, X. Large Language Model-Augmented Learning for Auto-Delineation of Treatment Targets in Head-and-Neck Cancer Radiotherapy. Radiother. Oncol. 2025, 205, 110740. [Google Scholar] [CrossRef] [PubMed]
- Bhayana, R.; Nanda, B.; Dehkharghanian, T.; Deng, Y.; Bhambra, N.; Elias, G.; Datta, D.; Kambadakone, A.; Shwaartz, C.G.; Moulton, C.-A.; et al. Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer. Radiology 2024, 311, e233117. [Google Scholar] [CrossRef]
- Fink, M.A.; Bischoff, A.; Fink, C.A.; Moll, M.; Kroschke, J.; Dulz, L.; Heußel, C.P.; Kauczor, H.-U.; Weber, T.F. Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer. Radiology 2023, 308, e231362. [Google Scholar] [CrossRef]
- Silbergleit, M.; Tóth, A.; Chamberlin, J.H.; Hamouda, M.; Baruah, D.; Derrick, S.; Schoepf, U.J.; Burt, J.R.; Kabakus, I.M. ChatGPT vs Gemini: Comparative Accuracy and Efficiency in CAD-RADS Score Assignment from Radiology Reports. J. Imaging Inform. Med. 2024. [Google Scholar] [CrossRef]
- Cozzi, A.; Pinker, K.; Hidber, A.; Zhang, T.; Bonomo, L.; Lo Gullo, R.; Christianson, B.; Curti, M.; Rizzo, S.; Del Grande, F.; et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 2024, 311, e232133. [Google Scholar] [CrossRef]
- Tripathi, S.; Sukumaran, R.; Cook, T.S. Efficient Healthcare with Large Language Models: Optimizing Clinical Workflow and Enhancing Patient Care. J. Am. Med. Inform. Assoc. 2024, 31, 1436–1440. [Google Scholar] [CrossRef] [PubMed]
- Mason, K.; Lim, E.; Higham, A.; de Pennington, N. How Large Language Models Can Shape Your Clinical Practice and Make You a More Efficient Clinician. Bulletin 2025, 107, 154–156. [Google Scholar] [CrossRef]
- Mukherjee, P.; Hou, B.; Lanfredi, R.B.; Summers, R.M. Feasibility of Using the Privacy-Preserving Large Language Model Vicuna for Labeling Radiology Reports. Radiology 2023, 309, e231147. [Google Scholar] [CrossRef] [PubMed]
- Montagna, S.; Ferretti, S.; Klopfenstein, L.C.; Ungolo, M.; Pengo, M.F.; Aguzzi, G.; Magnini, M. Privacy-Preserving LLM-Based Chatbots for Hypertensive Patient Self-Management. Smart Health 2025, 36, 100552. [Google Scholar] [CrossRef]
- Fisher, C.G.; DiPaola, C.P.; Ryken, T.C.; Bilsky, M.H.; Shaffrey, C.I.; Berven, S.H.; Harrop, J.S.; Fehlings, M.G.; Boriani, S.; Chou, D.; et al. A Novel Classification System for Spinal Instability in Neoplastic Disease: An Evidence-Based Approach and Expert Consensus from the Spine Oncology Study Group. Spine 2010, 35, E1221–E1229. [Google Scholar] [CrossRef]
- Liu, J.; Wang, C.; Liu, S. Utility of ChatGPT in Clinical Practice. J. Med. Internet Res. 2023, 25, e48568. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
- Wu, Q.; Wu, Q.; Li, H.; Wang, Y.; Bai, Y.; Wu, Y.; Yu, X.; Li, X.; Dong, P.; Xue, J.; et al. Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study. JMIR Med. Inform. 2024, 12, e55799. [Google Scholar] [CrossRef]
- Gu, K.; Lee, J.H.; Shin, J.; Hwang, J.A.; Min, J.H.; Jeong, W.K.; Lee, M.W.; Song, K.D.; Bae, S.H. Using GPT-4 for LI-RADS Feature Extraction and Categorization with Multilingual Free-Text Reports. Liver Int. 2024, 44, 1578–1587. [Google Scholar] [CrossRef]
- Matute-González, M.; Darnell, A.; Comas-Cufí, M.; Pazó, J.; Soler, A.; Saborido, B.; Mauro, E.; Turnes, J.; Forner, A.; Reig, M.; et al. Utilizing a Domain-Specific Large Language Model for LI-RADS V2018 Categorization of Free-Text MRI Reports: A Feasibility Study. Insights Imaging 2024, 15, 280. [Google Scholar] [CrossRef]
- Fisher, C.G.; Schouten, R.; Versteeg, A.L.; Boriani, S.; Varga, P.P.; Rhines, L.D.; Kawahara, N.; Fourney, D.; Weir, L.; Reynolds, J.J.; et al. Reliability of the Spinal Instability Neoplastic Score (SINS) among Radiation Oncologists: An Assessment of Instability Secondary to Spinal Metastases. Radiat. Oncol. 2014, 9, 69. [Google Scholar] [CrossRef] [PubMed]
- Fourney, D.R.; Frangou, E.M.; Ryken, T.C.; Dipaola, C.P.; Shaffrey, C.I.; Berven, S.H.; Bilsky, M.H.; Harrop, J.S.; Fehlings, M.G.; Boriani, S.; et al. Spinal Instability Neoplastic Score: An Analysis of Reliability and Validity from the Spine Oncology Study Group. J. Clin. Oncol. 2011, 29, 3072–3077. [Google Scholar] [CrossRef] [PubMed]
- Wagner, M.W.; Ertl-Wagner, B.B. Accuracy of Information and References Using ChatGPT-3 for Retrieval of Clinical Radiological Information. Can. Assoc. Radiol. J. 2024, 75, 69–73. [Google Scholar] [CrossRef]
- Lee, D.; Kader, G. The Emergence of Strategic Reasoning of Large Language Models. arXiv 2025. [Google Scholar] [CrossRef]
- Deroy, A.; Maity, S. Code Generation and Algorithmic Problem Solving Using Llama 3.1 405B. arXiv 2025. [Google Scholar] [CrossRef]
- Fervers, P.; Hahnfeldt, R.; Kottlors, J.; Wagner, A.; Maintz, D.; Pinto Dos Santos, D.; Lennartz, S.; Persigehl, T. ChatGPT Yields Low Accuracy in Determining LI-RADS Scores Based on Free-Text and Structured Radiology Reports in German Language. Front. Radiol. 2024, 4, 1390774. [Google Scholar] [CrossRef]
- Safavi-Naini, S.A.A.; Ali, S.; Shahab, O.; Shahhoseini, Z.; Savage, T.; Rafiee, S.; Samaan, J.S.; Shabeeb, R.A.; Ladak, F.; Yang, J.O.; et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. arXiv 2024, arXiv:2409.00084. [Google Scholar]
- Yang, X.; Li, T.; Wang, H.; Zhang, R.; Ni, Z.; Liu, N.; Zhai, H.; Zhao, J.; Meng, F.; Zhou, Z.; et al. Multiple Large Language Models versus Experienced Physicians in Diagnosing Challenging Cases with Gastrointestinal Symptoms. npj Digit. Med. 2025, 8, 85. [Google Scholar] [CrossRef]
- Anna, K. Comparison Analysis: Claude 3.5 Sonnet vs GPT-4o. Available online: https://www.vellum.ai/blog/claude-3-5-sonnet-vs-gpt4o (accessed on 7 May 2025).
- The Llama 3 Herd of Models | Research—AI at Meta. Available online: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ (accessed on 7 May 2025).
- Cabitza, F.; Rasoini, R.; Gensini, G.F. Unintended Consequences of Machine Learning in Medicine. JAMA 2017, 318, 517–518. [Google Scholar] [CrossRef]
- Leevy, J.L.; Khoshgoftaar, T.M.; Bauder, R.A.; Seliya, N. A Survey on Addressing High-Class Imbalance in Big Data. J. Big Data 2018, 5, 42. [Google Scholar] [CrossRef]
- Roccetti, M.; Delnevo, G.; Casini, L.; Cappiello, G. Is Bigger Always Better? A Controversial Journey to the Center of Machine Learning Design, with Uses and Misuses of Big Data for Predicting Water Meter Failures. J. Big Data 2019, 6, 70. [Google Scholar] [CrossRef]
- Hallinan, J.T.P.D.; Zhu, L.; Tan, H.W.N.; Hui, S.J.; Lim, X.; Ong, B.W.L.; Ong, H.Y.; Eide, S.E.; Cheng, A.J.L.; Ge, S.; et al. A Deep Learning-Based Technique for the Diagnosis of Epidural Spinal Cord Compression on Thoracolumbar CT. Eur. Spine J. 2023, 32, 3815–3824. [Google Scholar] [CrossRef] [PubMed]
- Hallinan, J.T.P.D.; Zhu, L.; Zhang, W.; Kuah, T.; Lim, D.S.W.; Low, X.Z.; Cheng, A.J.L.; Eide, S.E.; Ong, H.Y.; Muhamat Nor, F.E.; et al. Deep Learning Model for Grading Metastatic Epidural Spinal Cord Compression on Staging CT. Cancers 2022, 14, 3219. [Google Scholar] [CrossRef] [PubMed]
- Hallinan, J.T.P.D.; Zhu, L.; Zhang, W.; Lim, D.S.W.; Baskar, S.; Low, X.Z.; Yeong, K.Y.; Teo, E.C.; Kumarakulasinghe, N.B.; Yap, Q.V.; et al. Deep Learning Model for Classifying Metastatic Epidural Spinal Cord Compression on MRI. Front. Oncol. 2022, 12, 849447. [Google Scholar] [CrossRef] [PubMed]
Category | Score | Details |
---|---|---|
Location | 3 | Junctional (Occiput–C2; C7–T2; T11–L1; L5–S1) |
2 | Mobile spine (C3–C6; L2–L4) | |
1 | Semirigid (T3–T10) | |
0 | Rigid (S2–S5) | |
Pain | 3 | Mechanical pain |
1 | Occasional pain but not mechanical | |
0 | Pain-free | |
Bone Lesion Quality | 2 | Lytic |
1 | Mixed lytic/blastic | |
0 | Blastic | |
Spinal Alignment | 3 | Subluxation/translation |
2 | De novo deformity (kyphosis/scoliosis) | |
0 | Normal alignment | |
Vertebral Body Collapse | 3 | >50% collapse of vertebral body height |
2 | <50% collapse of vertebral body height | |
1 | No collapse with >50% of vertebral body involvement | |
0 | None of the above | |
Posterior Element Involvement | 3 | Bilateral |
1 | Unilateral | |
0 | None of the above | |
Interpretation of Total Score
|
Demographics and Oncological Characteristics | Dataset (n = 96 Patients; MRI Spines: 124) |
---|---|
Mean age (years) for all patients | 62 years (SD ± 10; range 32–86) |
Sex | |
Female | 53/96 (55.2) |
Male | 43/96 (44.8) |
Diagnosis | |
Known Cancer | 78/96 (81.2) |
New Diagnosis | 18/96 (18.8) |
Cancer Type | |
Lung | 37/96 (38.5) |
Breast | 16/96 (16.7) |
Gastrointestinal | 10/96 (10.4) |
Prostate | 6/96 (6.3) |
Gynaecologic | 6/96 (6.3) |
Renal | 5/96 (5.2) |
Myeloma/Plasmacytoma | 5/96 (5.2) |
Liver | 4/96 (4.2) |
Others | 7/96 (7.3) |
ICC 1 | Gwet’s Kappa | |||||||
---|---|---|---|---|---|---|---|---|
Total Score | Location | Pain | Bone Lesion Quality | Radiographic Spinal Alignment | Vertebral Body Collapse | Posterior Spinal Element Involvement | Highest Vertebral Level | |
Inter-rater Agreement * | 0.998 (0.997, 0.999) | 1.000 (1.000, 1.000) | 0.987 (0.968, 1.000) | 0.976 (0.949, 1.000) | 0.975 (0.949, 0.999) | 0.972 (0.945, 0.999) | 0.993 (0.979, 1.000) | 0.955 (0.925, 0.986) |
p—value | <0.001 | 0 | 0.0095 | 0.0139 | 0.0127 | 0.0138 | 0.007 | 0.0154 |
SINS Range | Number of Cases (n) | Percentage (%) |
---|---|---|
0–6 (Stable) | 20 | 16.1 |
7–12 (Indeterminate) | 66 | 53.2 |
13–18 (Unstable) | 38 | 30.6 |
Total | 124 | 100% |
ICC | p-Value | Overall Rating (Correct/Incorrect) | Percentage of Correct Ratings (%) | |
---|---|---|---|---|
Reference—Clinician 1 | 0.926 (0.896, 0.948) | <0.001 | 110/14 | 88.7 |
Reference—Clinician 2 | 0.986 (0.980, 0.990) | <0.001 | 122/2 | 98.4 |
Reference—Claude 3.5 | 0.984 (0.978, 0.989) | <0.001 | 122/2 | 98.4 |
Reference—Llama 3.1 | 0.829 (0.764, 0.877) | <0.001 | 93/31 | 75 |
Subcomponent | Reference—Clinician 1 | Reference—Clinician 2 | Reference—Claude 3.5 | Reference—Llama 3.1 | ||||
---|---|---|---|---|---|---|---|---|
Gwet’s Kappa | p-Value | Gwet’s Kappa | p-Value | Gwet’s Kappa | p-Value | Gwet’s Kappa | p-Value | |
Location | 0.899 (0.839, 0.960) | <0.001 | 0.910 (0.852, 0.968) | <0.001 | 0.919 (0.864, 0.975) | <0.001 | 0.455 (0.341, 0.568) | <0.001 |
Pain | 0.702 (0.604, 0.800) | <0.001 | 0.965 (0.926, 1.000) | <0.001 | 0.954 (0.909, 0.999) | <0.001 | 0.404 (0.286, 0.522) | <0.001 |
Bone lesion quality | 0.884 (0.814, 0.954) | <0.001 | 0.930 (0.875, 0.986) | <0.001 | 0.930 (0.875, 0.985) | <0.001 | 0.360 (0.244, 0.476) | <0.001 |
Radiographic spinal alignment | 0.924 (0.871, 0.976) | <0.001 | 0.979 (0.950, 1.000) | <0.001 | 0.990 (0.969, 1.000) | <0.001 | 0.744 (0.653, 0.834) | <0.001 |
Vertebral body collapse | 0.763 (0.672, 0.853) | <0.001 | 0.927 (0.874, 0.981) | <0.001 | 0.927 (0.873, 0.981) | <0.001 | 0.518 (0.399, 0.637) | <0.001 |
Posterior spinal element involvement | 0.906 (0.845, 0.967) | <0.001 | 0.969 (0.933, 1.000) | <0.001 | 0.979 (0.950, 1.000) | <0.001 | 0.688 (0.587, 0.790) | <0.001 |
Highest vertebral level | 0.832 (0.764, 0.901) | <0.001 | 0.874 (0.814, 0.935) | <0.001 | 0.950 (0.910, 0.990) | <0.001 | 0.874 (0.813, 0.935) | <0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chan, L.Y.T.; Chan, D.Z.M.; Tan, Y.L.; Yap, Q.V.; Ong, W.; Lee, A.; Ge, S.; Leow, W.N.; Makmur, A.; Ting, Y.; et al. Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS). Cancers 2025, 17, 2073. https://doi.org/10.3390/cancers17132073
Chan LYT, Chan DZM, Tan YL, Yap QV, Ong W, Lee A, Ge S, Leow WN, Makmur A, Ting Y, et al. Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS). Cancers. 2025; 17(13):2073. https://doi.org/10.3390/cancers17132073
Chicago/Turabian StyleChan, Li Yi Tammy, Ding Zhou Matthew Chan, Yi Liang Tan, Qai Ven Yap, Wilson Ong, Aric Lee, Shuliang Ge, Wenxin Naomi Leow, Andrew Makmur, Yonghan Ting, and et al. 2025. "Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS)" Cancers 17, no. 13: 2073. https://doi.org/10.3390/cancers17132073
APA StyleChan, L. Y. T., Chan, D. Z. M., Tan, Y. L., Yap, Q. V., Ong, W., Lee, A., Ge, S., Leow, W. N., Makmur, A., Ting, Y., Teo, E. C., Jiong Hao, T., Kumar, N., & Hallinan, J. T. P. D. (2025). Evaluating the Accuracy of Privacy-Preserving Large Language Models in Calculating the Spinal Instability Neoplastic Score (SINS). Cancers, 17(13), 2073. https://doi.org/10.3390/cancers17132073