1. Introduction
The Spinal Instability Neoplastic Score (SINS) is a tool for assessing patients with spinal tumors, guiding decisions about the need for surgical intervention. SINS calculation requires analysis of multiple radiological and clinical components, including tumor location, spinal alignment, bone lesion quality, degree of pain, vertebral body collapse, and posterior element involvement [
1]. However, the complexity, interobserver variability, and time constraints, have all presented barriers to SINS implementation in clinical practice [
2,
3].
Artificial intelligence (AI) such as large language models (LLMs) and other natural language processing (NLP) are being studied as potential clinical support tools [
4]. In imaging reports, LLMs have demonstrated promise in automated calculation of the Coronary Artery Disease Reporting and Data System (CAD-RADS) scores but still had some limitations in processing complex data and unstandardized reports [
5]. In that study, ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced were used to automatically calculate the CAD-RADS scores from structured radiologist reports as per the reporting guidelines. LLMs have been used to help standardize reporting and facilitate data extraction [
6,
7,
8]. LLMs have also demonstrated higher performance compared to physician groups with and without AI-assistance, showing the potential for autonomous AI in certain contexts [
9,
10].
AI has the potential to streamline workflows for the calculation of clinical scores like SINS, while also improving accuracy and efficiency. However, careful development of these AI tools are needed to minimize the limitations of AI alone. Clinicians can play a key role in shaping these tools to ensure that they are safe for clinical deployment [
11].
Our study will assess the accuracy and efficiency of LLM-predicted and LLM-assisted SINS calculation by trainee doctors, compared against a reference standard of evaluations by musculoskeletal radiologists and an orthopedic spine surgeon. The main outcomes were agreement with the reference standard (measured by intraclass correlation coefficients [ICC] for total score) and time required for SINS calculation. Through evaluating the performance of our institutional privacy preserving-LLM (PP-LLM) in this setting to calculate SINS, we aim to determine the feasibility and its potential as a clinical decision-support tool.
2. Materials and Methods
2.1. Study Methodology
This retrospective observational study was granted a waiver by the Domain-Specific Review Board (DSRB) due to its low-risk design, as only de-identified data were analyzed. All analyses were conducted using an institutional deployment of a privacy-preserving large language model (Claude 3.5, Anthropic, San Francisco, CA, USA), ensuring secure handling of patient information.
Claude 3.5 was selected as the model because it was available in a privacy-preserving institutional deployment and had demonstrated strong performance in clinical reasoning tasks at the time of study initiation. In this configuration, all inputs and outputs were processed within the hospital’s secure servers, and no data were transmitted externally, thereby ensuring compliance with institutional governance and national Personal Data Protection Act (PDPA) requirements. This is one of the differences when compared to commercial models such as ChatGPT and Google Gemini. This implementation distinguishes our model as ‘privacy-preserving,’ not by the model’s inherent properties, but by the deployment architecture.
Patients with spinal metastases who underwent MRI between January 2020 and December 2022 at the National University Hospital, Singapore, were included in a random selection. Patients with prior spinal surgery or instrumentation, or without complete MRI, CT, or EMR entries (Epic Systems Corporation, Verona, WI, USA) were excluded. All selected patients were 18 years or older. The SINS, outlined in
Table 1, consists of six components. In this study, the MRI report was used to document location, alignment, vertebral body collapse and posterior element involvement, while pain was extracted from the EMR entries and bone lesion quality was evaluated using the CT report.
2.2. Study Design and Sample
The MRI and CT reports were generated prospectively by experienced, board-certified radiologists specializing in spine oncology, who regularly collaborate in multidisciplinary meetings with spine surgeons, oncologists, and radiotherapists. Only the main body of each radiology report was used for this study, and any concluding statements or partial SIN scoring provided by the reporting radiologists (based solely on MRI or CT criteria without clinical data) were excluded to ensure a consistent, unbiased assessment.
MRI reports followed a standardized format detailing the vertebral level(s) involved, presence of posterior element involvement (unilateral or bilateral), spinal deformity, and degree of pathological vertebral fractures. CT reports emphasized bone lesion characteristics. To evaluate pain history, the radiology request forms and EMR clinical entries (Epic Systems Corporation, Verona, WI, USA) at the time of the MRI and CT requests were retrieved. These entries classified pain as mechanical, occasional, or absent, thereby completing the clinical data required for SINS calculation.
Three experienced readers, a pair of musculoskeletal radiologists (AA and BB, with 3 and 12 years of experience, respectively) and an orthopedic spine surgeon (OSS, with 7 years of experience) used the same MRI, CT, and EMR data that were made available to the LLM in order to establish a reference standard SINS. First, each reader independently evaluated all cases to assess interobserver agreement. Any discrepancies were resolved through discussion, and a consensus SINS was established as the reference standard for subsequent analyses.
A flowchart of the study design is provided in
Figure 1. Eight clinicians were recruited to calculate SINS, as they were all involved in the management of patients with spinal metastases and surgical referrals (AC, Orthopedics 2 years of experience; AK IM/Oncology 3 years of experience; GL Orthopedics 2 years of experience; LX Orthopedics 2 years of experience; MY IM/Oncology 3 years of experience; SL Orthopedics 3 years of experience; YL Orthopedics 2 years of experience; JT Orthopedics 1 year of experience). They were divided into two groups of four, with Group A (AC, GL, MY and SL) calculating SINS with AI assistance for a randomly assigned subset of cases, while Group B (AK, LX, YL, and JT) did so for the remaining cases.
Clinicians were recruited from the orthopedics and oncology trainee pool at our institution. All were directly involved in the management of metastatic spine disease and surgical referrals. Recruitment was voluntary, and participants represented a spectrum of postgraduate experience (1–3 years).
The 76 datasets, each representing a unique visit per patient where an MRI spine, CT spine scan, and clinical consultation were performed, were randomly shuffled and then evenly distributed between the two groups. This approach was chosen to ensure a balanced distribution of cases and mitigate potential biases. For the remaining cases, each group calculated the SINS without AI assistance. This approach was intended to minimize any bias from familiarization with cases, as each case contained distinct clinical features. Although a washout period could have been implemented to allow for review of all cases with and without LLM assistance, familiarization with characteristic clinical information was still likely, potentially affecting the accuracy of the assessment, which justified the alternating design.
The PP-LLM was tasked with calculating the SINS using a chain of thought prompt and a few-shot technique. In few-shot learning, the model is provided with a small number of examples (usually one to three) that demonstrate the expected reasoning process. This allows the LLM to generalize the task without requiring extensive training data. The prompt provided to the LLM instructed it to calculate the SINS per spinal level based on the provided clinical history, MRI report, and CT report. The prompt instructed the model through the prompt as follows:
Extract pain history from the imaging request forms and/or the most recent EMR clinical entry.
Identify lesion location, spinal alignment, vertebral body collapse, and posterior element involvement from the MRI report.
Determine bone lesion quality based on the CT report.
The LLM was asked to calculate the highest SINS per case using this stepwise approach, and 10 cases were presented as examples to guide the LLM’s analysis. The model was explicitly instructed not to make up facts during this process and the temperature was set to 0 (reduces stochasticity of the outputs).
Time for SINS assessment was measured in seconds for each case under supervision by the study team, with and without LLM assistance, to evaluate the impact on time efficiency.
2.3. Statistical Analysis
The primary outcomes were:
Agreement between LLM-assisted SINS and the reference standard.
Agreement between clinician SINS (without LLM assistance) and the reference standard.
Agreement between LLM-predicted SINS (no human intervention) and the reference standard.
The time taken for SINS calculation across these conditions.
Inter-rater agreement for the total SINS Inter-rater agreement for the total SINS was quantified with the single-measures intraclass correlation coefficient (ICC, two-way random, absolute agreement). ICC values were interpreted as poor (<0.50), moderate (0.50–0.75), good (0.75–0.90), and excellent (>0.90) [
12].
Agreement for each individual SINS component was assessed with Gwet’s kappa (AC1) to avoid the κ-paradox under skewed category distributions. AC1 values were graded using the widely adopted Landis–Koch scale: slight (<0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (>0.80) agreement [
13].
For time efficiency, differences in scoring time between the conditions were analyzed using the Mann–Whitney U test due to non-normality of the time data. Statistical significance was set at p < 0.05. Missing data were excluded pairwise in the analysis, and descriptive statistics (e.g., medians and interquartile ranges [IQRs]) were provided for time measurements.
3. Results
3.1. Demographics
A total of 60 patients with spinal metastases were included in the analysis (
Table 2). A total of 80 MRI spines were performed for the 60 patients, with 4 out of 80 studies excluded due to incomplete MRI, CT or EMR data, leaving 76 studies for analysis. The mean patient age was 63 years (SD ± 9, range 41–83), and just over half were female (32/60, 53.3%). Most patients were of Chinese ethnicity (47/60, 78.3%), and 95% (57/60) had a pre-operative ECOG score of 0–2. Lung tumors were the most common primary (38.3%), followed by breast (20.0%) and prostate (11.7%). Skeletal Oncology Research Group (SORG) classifications showed a predominance of rapid-growth tumors (40.0%). Regarding neurological function, nearly half of the cohort presented with Frankel B or C, highlighting the substantial disease burden.
3.2. Reference Standard and SINS Distribution
Three expert readers independently assessed the Spinal Instability Neoplastic Score (SINS) as outlined in the methods (
Table 3). Interobserver agreement for the total SINS was excellent, with an intraclass correlation coefficient (ICC) of 0.999 (95% CI: 0.999–1.000). Agreement across individual SINS categories was almost perfect for all, with the lowest agreement observed in the bone lesion quality category (Gwet’s Kappa: 0.975, 95% CI: 0.939–1.000). Perfect agreement (Gwet’s Kappa: 1.000) was achieved for both location and posterior element involvement. Consensus gradings were assigned for all SINS components.
Table 4 presents the distribution of reference standard SINS in the study cohort. More than half of the included studies (52.6%) were categorized as indeterminate (SINS 7–12), while 30.3% were classified as unstable (SINS 13–18), and 17.1% were considered stable (SINS 0–6).
3.3. LLM-Predicted, LLM-Assisted, and Non-LLM-Assisted SINS Agreement with the Reference Standard
LLM-predicted total scores demonstrated excellent agreement with the reference standard (ICC = 0.990, 95%CI: 0.984–0.993). Clinicians assisted by the LLM achieved an even higher level of agreement (ICC = 0.993, 95%CI: 0.991–0.994). In contrast, clinicians who assessed cases without LLM assistance had a lower, though still excellent, ICC of 0.968 (95%CI: 0.960–0.975), a statistically significant reduction compared to both the LLM-predicted and LLM-assisted approach (both
p < 0.001) (
Table 5 and
Figure 2).
LLM-predicted scores showed high agreement across most subcategories. In particular, spinal alignment achieved perfect agreement (Gwet’s Kappa = 1.000), while posterior element involvement also demonstrated almost perfect agreement (Gwet’s Kappa = 0.969, 95%CI: 0.925–1.000). Pain scoring showed a similarly high level of agreement (Gwet’s Kappa = 0.962, 95% CI: 0.910–1.000). However, location (Gwet’s Kappa = 0.917, 95%CI: 0.845–0.990) and bone lesion quality (Gwet’s Kappa = 0.867, 95%CI: 0.772–0.963) exhibited comparatively lower agreement, indicating that these parameters may be more challenging for the LLM to interpret.
LLM-assisted scores were significantly superior compared to non-LLM-assisted scores in the assessment of location (Gwet’s Kappa = 0.958, 95%CI: 0.931–0.986 vs. 0.897, 95%CI: 0.859–0.937, p = 0.005), pain (Gwet’s Kappa = 0.991, 95%CI: 0.978–1.000 vs. 0.942, 95%CI: 0.912–0.972, p < 0.001), bone lesion quality (Gwet’s Kappa = 0.948, 95%CI: 0.917–0.978 vs. 0.901, 95%CI: 0.862–0.939, p = 0.024), spinal alignment (Gwet’s Kappa = 0.996, 95%CI: 0.987–1.000 vs. 0.965, 95%CI: 0.942–0.988, p = 0.011), and vertebral body collapse (Gwet’s Kappa = 0.920, 95%CI: 0.884–0.955 vs. 0.856, 95%CI: 0.810–0.903, p < 0.001), However, no statistically significant differences (p > 0.05) were observed in the assessment of posterior element involvement (Gwet’s Kappa = 0.957, 95%CI: 0.932–0.983 vs. 0.950, 95%CI: 0.922–0.977).
There were no significant differences between subcategories for LLM-Predicted and LLM-Assisted scores except for bone lesion quality (GA = 0.948, 95%CI: 0.917–0.978 vs. 0.867, 95%CI: 0.772–0.963, p = 0.04).
On average, LLM-assisted clinicians outperformed those who assessed cases without assistance. These findings highlight that while human scoring remains reliable across most domains, LLM input enhances agreement with the reference standard in key areas, particularly pain assessment and vertebral body collapse.
3.4. Time Efficiency with and Without LLM Assistance for SINS Calculation
In terms of efficiency, most readers completed SIN scoring faster with LLM assistance, as shown in
Table 6 and
Figure 3. For instance, TWPJ experienced the greatest reduction in median scoring time, from 140 s to just 45 s, while G also saw a substantial drop from 89.5 s to 51 s. Across all readers, the LLM-assisted method yielded a median scoring time of 60.0 s (IQR: 46.0–80.0 s), notably lower than the non-LLM-assisted median of 83.0 s (IQR: 58.0–124.0 s), a statistically significant difference (
p < 0.001). Furthermore, when the LLM generated its own predicted scores without human involvement, it required only about 5 s.
Table 7 shows the main aggregated results for SINS agreement and time savings using LLM assistance.
4. Discussion
This study assessed the feasibility and effectiveness of an institutional privacy-preserving large language model (LLM) for assisting clinicians with the calculation of the Spine Instability Neoplastic Score (SINS). We found that both the LLM-predicted and LLM-assisted SINS calculations demonstrated excellent agreement with the reference standard (intraclass correlation coefficients [ICCs] = 0.990 and 0.993, respectively). These values were notably higher than the non-LLM-assisted scores (ICC = 0.968). Although the LLM-assisted approach yielded a slightly higher ICC than the LLM-predicted approach, the difference was not statistically significant (p > 0.05), highlighting that both methods were highly accurate. Additionally, LLM-assisted scoring showed a significant reduction in calculation time, shortening the median time required by 23 s compared to the non-LLM-assisted approach (p < 0.001). Importantly, the fully automated LLM-predicted approach was the fastest by far (Approximately 5 s), providing the greatest potential for time savings in clinical workflows. Although ICC improvements are statistically significant, the absolute differences appear small. However, in aggregate, even modest gains in reliability may enhance confidence in multicenter trials, reduce interobserver variability in research, and support decision-making consistency in high-volume clinical workflows.
Subgroup analyses highlighted how the LLM benefits individual components of the SINS. Vertebral body collapse showed the greatest improvement in agreement compared to the reference standard, moving from non-LLM (GA = 0.856) to LLM-assisted scoring (GA = 0.920), and location and pain assessments also improved substantially (from GA = 0.897 to GA = 0.958, and from GA = 0.942 to GA = 0.991, respectively). Interestingly, although LLM-predicted scores were similar or superior to human-only assessments (e.g., alignment, GA = 1.000; posterior element involvement, GA = 0.969), they did not consistently outperform LLM-assisted scoring for more nuanced categories such as vertebral body collapse or pain. For example, vertebral body collapse requires distinguishing between thresholds of <50% versus >50% height loss, or cases with no collapse but extensive vertebral involvement. These cut-offs are clinically meaningful yet sometimes difficult to determine, especially when radiology reports use terms like “mild” or “significant” compression without quantitative detail. Similarly, pain scoring depends on whether symptoms are mechanical, which requires clear documentation of positional or loading-related exacerbation. In practice, clinical notes often provide limited or ambiguous descriptions, leading to variability in interpretation. Both categories therefore demand more interpretative judgment, which contributes to lower consistency across raters and algorithms compared to other SINS domains. This suggests that while the LLM excels in structured, objective measures, clinician oversight may remain important for more complex or subjective elements.
Targeted enhancements in training data and algorithmic design, potentially through advanced vision-language models that combine text with MRI and CT imaging, may be needed to improve LLM performance in these areas. Further development of vision language models may also help improve efficiency of automated triage, where the SINS could have an automated draft calculation even at the point of completion of the scan, and ready for radiologist review at the time of reporting. These models could even be combined with existing algorithms for epidural spinal-cord compression across both cross-sectional modalities [
14,
15]
These results align with prior research suggesting that AI-driven tools can enhance decision-making in clinical practice, particularly when evaluating structured tasks such as severity scoring and risk assessment. For example, McDuff et al. [
16] similarly reported improved diagnostic accuracy with LLM assistance for diagnostic reasoning. In our analysis, the synergy between human oversight and automated suggestions appeared valuable, as clinicians could validate or refine the LLM recommendations, which mitigated potential AI-driven errors. This human-in-the-loop model is consistent with earlier work indicating that AI–human collaboration often yields superior outcomes compared to standalone AI systems [
17,
18]. The strong performance of purely LLM-predicted SINS further underscores the capacity for automated triage, in which high-risk or unstable cases could be flagged immediately after radiologic interpretation for expedited review by spine oncology teams. Allowing fully autonomous SINS calculation could lead to the greatest productivity gains, enabling earlier identification and referral of cases to spine oncologists. However, the safety aspect of such an approach relies on adherence to local clinical protocols, which must ensure appropriate oversight and verification by healthcare professionals [
11].
However, results have been mixed across other studies. For example, a study by Goh and colleagues [
19] found no significant benefit of LLM assistance in diagnostic reasoning, although their study also showed that LLM alone outperformed unaided clinicians. These variations may reflect differences in LLM deployment strategies and prompt optimization, as our study employed carefully designed prompts to maximize output quality. Similar dynamics have been reported in the computer vision domain. Another study by Agarwal and colleagues [
9] demonstrated that radiologists underutilized AI input when interpreting chest radiographs, resulting in lower performance with AI assistance compared to AI alone. These findings echo the article by Rajpurkar and Topol [
10] that AI may outperform doctors in specific tasks, challenging the assumption that human-AI collaboration naturally yields superior outcomes.
AI has shown promise when deployed independently for high-volume screening tasks, such as identifying normal chest radiographs [
20] and assisting mammography screening [
21], where AI improved cancer detection while reducing workload. These findings highlight the potential of task-specific AI deployment, where fully autonomous AI handles routine evaluations, and human oversight is reserved for complex cases. However, not all AI triage applications have demonstrated clear benefits. For example, a study by Savage and colleagues [
22] reported that an AI triage system for intracranial hemorrhage (ICH) detection did not improve diagnostic performance or reporting turnaround times, emphasizing that the effectiveness of AI integration is highly dependent on the clinical setting and workflow design. These mixed outcomes reinforce the importance of rigorous, task-specific evaluation before clinical implementation.
Our study contributes to this evolving landscape by showing that LLM-assisted and LLM-predicted SINS scoring both outperform conventional methods, with LLM-predicted scoring offering the fastest performance but LLM-assisted scoring providing the highest accuracy, especially for subjective elements. Recent examination-based studies further support the potential of LLMs to meet or exceed trainee-level performance across various medical domains [
23]. For example, Watari et al. [
24] showed that GPT-4 matched the average score of Japanese medical residents on a nationwide in-training examination, while Zheng et al. demonstrated that a domain-specific ophthalmology LLM achieved performance comparable to senior residents on clinical vignette questions [
25].
Despite these promising outcomes, several limitations must be acknowledged in this study. First, this single-center retrospective design may limit generalizability to other clinical settings or patient populations. Structured reporting formats used at our institution may differ from those in other settings. Variability in pain documentation across electronic medical record systems could also affect reproducibility. Our single-center cohort was predominantly composed of Chinese patients with lung cancer primaries, which may limit external applicability. Multicenter validation will be necessary to confirm the robustness of these findings across diverse populations and reporting environments. Second, we did not evaluate how the use of LLM assistance ultimately affects downstream decisions (e.g., surgical planning or referral to radiotherapy), leaving the real-world impact on patient outcomes uncertain. Third, the performance of LLMs in SINS calculation may also be influenced by model size and architecture. More powerful models such as GPT-4o or Gemini 1.5 could potentially achieve superior accuracy. Future studies should examine how scaling influences performance, efficiency, and consistency across different clinical settings. Fourth, the clinicians used in the study may have familiarity with local reporting formats, which may have introduced systemic bias, and results may not fully generalize to clinicians from other institutions or specialties. Last, our dataset excluded cases with incomplete imaging or EMR data, and we did not perform a separate analysis of incomplete or ambiguous reports. While this decision ensured consistency in benchmarking, it limits insight into how LLMs perform under real-world conditions where reports are often partial or incomplete. Future studies should specifically evaluate this subgroup to better reflect practical clinical workflows.
Our dataset excluded cases with incomplete imaging or EMR data, and we did not perform a separate analysis of incomplete or ambiguous reports. While this decision ensured consistency in benchmarking, it limits insight into how LLMs perform under real-world conditions where reports are often partial or incomplete. Future studies should specifically evaluate this subgroup to better reflect practical clinical workflows.
Future work can examine optimal methods for integrating these AI tools into routine clinical practice. For example, automated SINS scoring could be embedded into structured report templates, enabling real-time feedback to clinicians. LLMs could also be used to triage cases by flagging those likely to be unstable, allowing prioritization in busy reporting environments. A human-in-the-loop model, integrated with PACS or EMR systems, may represent a safe and efficient pathway to adoption. Robust governance and accountability frameworks will be essential for clinical deployment. Multicenter prospective studies should validate our findings across broader settings, assess long-term effects on patient outcomes, and explore cost-effectiveness, resource utilization (e.g., initial radiotherapy vs. surgery), and time-to-treatment metrics. Continued investigation of practical implementation and real-world benefits remains essential [
26]. Further AI development must involve all stakeholders to overcome the multiple challenges facing AI deployment and research [
27,
28]. Spine oncologists, in particular, need to steer this transformation to preserve clinical effectiveness and ensure a positive impact on patient care [
29,
30].