Saved Queries

Building on advances in promptable segmentation models, this work introduces a framework that integrates Large Vision-Language Model (LVLM) bounding box priors with prototype-based region of interest (ROI) selection to improve zero-shot medical image segmentation. Unlike prior methods such as SaLIP, which often misidentify regions due to reliance on text–image CLIP similarity, the proposed approach leverages visual prototypes to mitigate language bias and enhance ROI ranking, resulting in more accurate segmentation. Bounding box estimation is further strengthened through systematic prompt engineering to optimize LVLM performance across diverse datasets and imaging modalities. Evaluation was conducted on three publicly available benchmark datasets—CC359 (brain MRI), HC18 (fetal head ultrasound), and CXRMAL (chest X-ray)—without any task-specific fine-tuning. The proposed method achieved substantial improvements over prior approaches. On CC359, it reached a Dice score of 0.95 ± 0.06 and a mean Intersection-over-Union (mIoU) of 0.91 ± 0.10. On HC18, it attained a Dice score of 0.82 ± 0.20 and mIoU of 0.74 ± 0.22. On CXRMAL, the model achieved a Dice score of 0.90 ± 0.08 and mIoU of 0.83 ± 0.12. These standard deviations reflect variability across test images within each dataset, indicating the robustness of the proposed zero-shot framework. These results demonstrate that integrating LVLM-derived bounding box priors with prototype-based selection substantially advances zero-shot medical image segmentation. Full article

(This article belongs to the Special Issue Advances and Applications of Generative AI: Bridging Theory and Practice)

►▼ Show Figures

Figure 1

17 pages, 402 KB

Open AccessArticle

Training a Team of Language Models as Options to Build an SQL-Based Memory

by Seokhan Lee and Hanseok Ko

Appl. Sci. 2025, 15(21), 11399; https://doi.org/10.3390/app152111399 (registering DOI) - 24 Oct 2025

Viewed by 91

Abstract

Despite the rapid progress in the capabilities of large language models, they still lack a reliable and efficient method of storing and retrieving new information conveyed over the course of their interaction with users upon deployment. In this paper, we use reinforcement learning methods to train a team of smaller language models, which we frame as options, on reward-respecting subtasks, to learn to use SQL commands to store and retrieve relevant information to and from an external SQL database. In particular, we train a storage language model on a subtask for distinguishing between user and assistant in the dialogue history, to learn to store any relevant facts that may be required to answer future user queries. We then train a retrieval language model on a subtask for querying a sufficient number of fields, to learn to retrieve information from the SQL database that could be useful in answering the current user query. We find that training our models on their respective subtasks results in much higher performance than training them to directly optimize the reward signal and that the resulting team of language models is able to achieve performance on memory tasks comparable to existing methods that rely on language models orders of magnitude larger in size. In particular, we were able to able to achieve a 36% gain in accuracy over a prompt engineering baseline and a 13% gain over a strong baseline that uses the much larger GPT-3.5 Turbo on the MSC-Self-Instruct dataset. Full article

(This article belongs to the Topic Challenges and Solutions in Large Language Models)

►▼ Show Figures

Figure 1

15 pages, 2174 KB

Open AccessArticle

BoxingPro: An IoT-LLM Framework for Automated Boxing Coaching via Wearable Sensor Data Fusion

by Man Zhu, Pengfei Huang, Xiaolong Xu, Houpeng He and Lijie Zhang

Electronics 2025, 14(21), 4155; https://doi.org/10.3390/electronics14214155 - 23 Oct 2025

Viewed by 218

Abstract

The convergence of Internet of Things (IoT) and Artificial Intelligence (AI) has enabled personalized sports coaching, yet a significant gap remains: translating low-level sensor data into high-level, contextualized feedback. Large Language Models (LLMs) excel at reasoning and instruction but lack a native understanding of physical kinematics. This paper introduces BoxingPro, a novel framework that bridges this semantic gap by fusing wearable sensor data with LLMs for automated boxing coaching. Our core contribution is a dedicated translation methodology that converts multi-modal time-series data (IMU) and visual data (video) into structured linguistic prompts, enabling off-the-shelf LLMs to perform sophisticated biomechanical reasoning without extensive retraining. Our evaluation with professional boxers showed that the generated feedback achieved an average expert rating of over 4.0/5.0 on key criteria like biomechanical correctness and actionability. This work establishes a new paradigm for integrating sensor-based systems with LLMs, with potential applications extending far beyond boxing to any domain requiring physical skill assessment. Full article

(This article belongs to the Special Issue Techniques and Applications in Prompt Engineering and Generative AI)

►▼ Show Figures

Figure 1

22 pages, 698 KB

Open AccessReview

Oral Health Impact Profile (OHIP) as a Tool for the Assessment of the Oral Health-Related Quality of Life—A Scoping Review

by Łukasz Wojszko, Karolina Banaszek, Oliwia Gagacka and Joanna Bagińska

Dent. J. 2025, 13(11), 490; https://doi.org/10.3390/dj13110490 - 23 Oct 2025

Viewed by 197

Abstract

Background/Objectives: The Oral Health Impact Profile (OHIP) is the most widely used tool for OHRQoL assessment. The measure has several versions, but there is no comprehensive summary of available Oral Health Impact Profile variants. The purpose of this scoping review is to identify and summarize Oral Health Impact Profile versions for the adult population available in the literature. Methods: PubMed, Scopus, and Web of Science databases were searched on 25–28 May 2025 to find papers presenting the Oral Health Impact Profile versions’ development process. Records written in English without any time restrictions were included. The Joanna Briggs Institute framework for scoping reviews was applied. The PRISMA-ScR approach was followed. Results: In total, 11 generic OHIP scales (the OHIP version that was not targeted at any specific condition) and 16 condition-specified OHIP scales were found. The analysis revealed a wide variety of number of items (from 49 to 5), recall period (from one year to one week), rating scale (4-0; 5-0; 5-1; 6-1; 1, 0, and −1), dimensionality of scale (7, 4, or 3 dimensions, 2–6 factors, or unidimensional), and validation process. Conclusions: Differences in OHIP features have to be taken into account during a comparison of results from different studies. Due to the availability of various tools, the idea of creating new versions of the OHIP should be considered with caution. Researchers should carefully select the appropriate OHIP version for their purposes, as the process of adapting the tool to a new language and culture is time-consuming and expensive. Full article

(This article belongs to the Special Issue Oral Health-Related Quality of Life and Its Determinants)

►▼ Show Figures

Graphical abstract

17 pages, 2618 KB

Open AccessArticle

Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW

by Hadia Arshad, Tahir Abdullah, Mariam Rehman, Afzaal Hussain, Faria Kanwal and Mehwish Parveen

Information 2025, 16(11), 928; https://doi.org/10.3390/info16110928 - 22 Oct 2025

Viewed by 157

Abstract

Whisper is a transformer-based multilingual model that has illustrated state-of-the-art behavior in numerous languages. However, the efficiency remains persistent with the limited computational resources. To address this issue, an experiment was performed on librispeech-train-clean-100 for training purposes. The test-clean set was utilized to evaluate its performance. To enhance efficiency and to cater the computational needs, a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation, was employed to add a limited number of trainable parameters into the frozen layers of the model. The results showed that Low-Rank Adaptation attained excellent Automatic Speech Recognition results while using fewer computational resources, showing its effectiveness for resource-saving adaptation. The research work emphasizes the promise of Low-Rank Adaptation as a lightweight and scalable fine-tuning strategy for large speech models using a transformer architecture. The baseline Whisper Small model achieved a word error rate of 16.7% without any parameter-efficient adaptation. In contrast, the Low-Rank Adaptation enhanced fine-tuned model achieved a lower word error rate of 6.08%, demonstrating the adaptability of the proposed parameter-efficient approach. Full article

►▼ Show Figures

Figure 1

29 pages, 549 KB

Open AccessArticle

Catch Me If You Can: Rogue AI Detection and Correction at Scale

by Fatemeh Stodt, Jan Stodt, Mohammed Alshawki, Javad Salimi Sratakhti and Christoph Reich

Electronics 2025, 14(20), 4122; https://doi.org/10.3390/electronics14204122 - 21 Oct 2025

Viewed by 225

Abstract

Modern AI systems can strategically misreport information when incentives diverge from truthfulness, posing risks for oversight and deployment. Prior studies often examine this behavior within a single paradigm; systematic, cross-architecture evidence under a unified protocol has been limited. We introduce the Strategy Elicitation Battery (SEB), a standardized probe suite for measuring deceptive reporting across large language models (LLMs), reinforcement-learning agents, vision-only classifiers, multimodal encoders, state-space models, and diffusion models. SEB uses Bayesian inference tasks with persona-controlled instructions, schema-constrained outputs, deterministic decoding where supported, and a probe mix (near-threshold, repeats, neutralized, cross-checks). Estimates use clustered bootstrap intervals, and significance is assessed with a logistic regression by architecture; a mixed-effects analysis is planned once the per-round agent/episode traces are exported. On the latest pre-correction runs, SEB shows a consistent cross-architecture pattern in deception rates: ViT

80.0 %

, CLIP

15.0 %

, Mamba

10.0 %

, RL agents

10.0 %

, Stable Diffusion

10.0 %

, and LLMs

5.0 %

(20 scenarios/architecture). A logistic regression on per-scenario flags finds a significant overall architecture effect (likelihood-ratio test vs. intercept-only:

χ^{2} (5) = 41.56

p = 7.22 \times 10^{- 8}

). Holm-adjusted contrasts indicate ViT is significantly higher than all other architectures in this snapshot; the remaining pairs are not significant. Post-correction acceptance decisions are evaluated separately using residual deception and override rates under SEB-Correct. Latency varies by architecture (sub-second to minutes), enabling pre-deployment screening broadly and real-time auditing for low-latency classes. Results indicate that SEB-Detect deception flags are not confined to any one paradigm, that distinct architectures can converge to similar levels under a common interface, and that reporting interfaces and incentive framing are central levers for mitigation. We operationalize “deception” as reward-sensitive misreport flags, and we separate detection from intervention via a correction wrapper (SEB-Correct), supporting principled acceptance decisions for deployment. Full article

(This article belongs to the Special Issue Trends and Prospects in AI-Empowered Information Systems and Technologies)

►▼ Show Figures

Figure 1

19 pages, 448 KB

Open AccessArticle

From Policy to Practice: Challenges and Opportunities in Bilingual Preschool Education in Georgia (Sakartvelo)

by Gulnara Bibileishvili

Educ. Sci. 2025, 15(10), 1340; https://doi.org/10.3390/educsci15101340 - 9 Oct 2025

Viewed by 401

Abstract

In Georgia (Sakartvelo), a program promoting bilingual education in preschool institutions was formally adopted in 2020. It aligns with the objectives of the 2021–2030 State Strategy for Civic Equality and Integration Plan, which envisions a comprehensive reform of bilingual education across Georgia’s regions. Any reform requires research and evaluation to measure how effectively it is being implemented and whether the intended outcomes have been achieved. The bilingual education initiative pursues a dual objective: to preserve the native languages of minority communities while ensuring effective acquisition of the state language. This dual mandate is intrinsically linked to state language policy and constitutes a sensitive issue for local communities, parents, and preschool administrators, thereby necessitating a careful and nuanced approach. The present study analyzed the readiness of the social environment to support the implementation of bilingual education programs at the preschool level in the regions of Georgia in which ethnic minorities live side by side. Research was carried out in two ethnically diverse regions—Kvemo Kartli and Samtskhe–Javakheti. The author conducted individual and group interviews, and the elicited data were analyzed with the help of content and thematic analyses. This study examines key attributes of the ongoing preschool reform to identify factors that facilitate the effective implementation of early bilingual education initiatives. The findings highlight both commonalities and regional variations in parental attitudes toward the bilingual education reform. Full article

(This article belongs to the Special Issue Innovation and Design in Multilingual Education)

►▼ Show Figures

Figure 1

24 pages, 323 KB

Open AccessArticle

Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes

by Anastasia Amanatidis, Kyle Egan, Kusuma Nio and Milan Toma

Surgeries 2025, 6(4), 87; https://doi.org/10.3390/surgeries6040087 - 9 Oct 2025

Viewed by 333

Abstract

Background/Objectives: Machine learning has been suggested as a way to improve how we predict anesthesia-related complications after surgery. However, many studies report overly optimistic results due to issues like data leakage and not fully using information from clinical notes. This study provides a transparent comparison of different machine learning models using both structured data and preoperative notes, with a focus on avoiding data leakage and involving clinicians throughout. We show how high reported metrics in the literature can result from methodological pitfalls and may not be clinically meaningful. Methods: We used a dataset containing both structured patient and surgery information and preoperative clinical notes. To avoid data leakage, we excluded any variables that could directly reveal the outcome. The data was cleaned and processed, and information from clinical notes was summarized into features suitable for modeling. We tested a range of machine learning methods, including simple, tree-based, and modern language-based models. Models were evaluated using a standard split of the data and cross-validation, and we addressed class imbalance with sampling techniques. Results: All models showed only modest ability to distinguish between patients with and without complications. The best performance was achieved by a simple model using both structured and summarized text features, with an area under the curve of 0.644 and accuracy of 60%. Other models, including those using advanced language techniques, performed similarly or slightly worse. Adding information from clinical notes gave small improvements, but no single type of data dominated. Overall, the results did not reach the high levels reported in some previous studies. Conclusions: In this analysis, machine learning models using both structured and unstructured preoperative data achieved only modest predictive performance for postoperative complications. These findings highlight the importance of transparent methodology and clinical oversight to avoid data leakage and inflated results. Future progress will require better control of data leakage, richer data sources, and external validation to develop clinically useful prediction tools. Full article

►▼ Show Figures

Figure 1

13 pages, 705 KB

Open AccessProtocol

The Silent Cognitive Burden of Chronic Pain: Protocol for an AI-Enhanced Living Dose–Response Bayesian Meta-Analysis

by Kevin Pacheco-Barrios, Rafaela Machado Filardi, Edward Yoon, Luis Fernando Gonzalez-Gonzalez, Joao Victor Ribeiro, Joao Pedro Perin, Paulo S. de Melo, Marianna Leite, Luisa Silva and Alba Navarro-Flores

J. Clin. Med. 2025, 14(19), 7030; https://doi.org/10.3390/jcm14197030 - 4 Oct 2025

Viewed by 478

Abstract

Background: Chronic pain affects nearly one in five adults worldwide and is increasingly recognized not only as a disease but as a potential risk factor for neurocognitive decline and dementia. While some evidence supports this association, existing systematic reviews are static and rapidly outdated, and none have leveraged advanced methods for continuous updating and robust uncertainty modeling. Objective: This protocol describes a living systematic review with dose–response Bayesian meta-analysis, enhanced by artificial intelligence (AI) tools, to synthesize and maintain up-to-date evidence on the prospective association between any type of chronic pain and subsequent cognitive decline. Methods: We will systematically search PubMed, Embase, Web of Science, and preprint servers for prospective cohort studies evaluating chronic pain as an exposure and cognitive decline as an outcome. Screening will be semi-automated using natural language processing models (ASReview), with human oversight for quality control. Bayesian hierarchical meta-analysis will estimate pooled effect sizes and accommodate between-study heterogeneity. Meta-regression will explore study-level moderators such as pain type, severity, and cognitive domain assessed. If data permit, a dose–response meta-analysis will be conducted. Living updates will occur biannually using AI-enhanced workflows, with results transparently disseminated through preprints and peer-reviewed updates. Results: This is a protocol; results will be disseminated in future reports. Conclusions: This living Bayesian systematic review aims to provide continuously updated, methodologically rigorous evidence on the link between chronic pain and cognitive decline. The approach integrates innovative AI tools and advanced meta-analytic methods, offering a template for future living evidence syntheses in neurology and pain research. Full article

(This article belongs to the Section Anesthesiology)

►▼ Show Figures

Figure 1

15 pages, 1245 KB

Open AccessArticle

Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues

by Daniel Grabowski, Kamila Łuczaj and Khalid Saeed

Sensors 2025, 25(19), 6086; https://doi.org/10.3390/s25196086 - 2 Oct 2025

Viewed by 481

Abstract

Advances in multimodal artificial intelligence enable new sensor-inspired approaches to lie detection by combining behavioral perception with generative reasoning. This study presents a deception detection framework that integrates deep video and audio processing with large language models guided by chain-of-thought (CoT) prompting. We interpret neural architectures such as ViViT (for video) and HuBERT (for speech) as digital behavioral sensors that extract implicit emotional and cognitive cues, including micro-expressions, vocal stress, and timing irregularities. We further incorporate a GPT-5-based prompt-level fusion approach for video–language–emotion alignment and zero-shot inference. This method jointly processes visual frames, textual transcripts, and emotion recognition outputs, enabling the system to generate interpretable deception hypotheses without any task-specific fine-tuning. Facial expressions are treated as high-resolution affective signals captured via visual sensors, while audio encodes prosodic markers of stress. Our experimental setup is based on the DOLOS dataset, which provides high-quality multimodal recordings of deceptive and truthful behavior. We also evaluate a continual learning setup that transfers emotional understanding to deception classification. Results indicate that multimodal fusion and CoT-based reasoning increase classification accuracy and interpretability. The proposed system bridges the gap between raw behavioral data and semantic inference, laying a foundation for AI-driven lie detection with interpretable sensor analogues. Full article

(This article belongs to the Special Issue Sensor-Based Behavioral Biometrics)

►▼ Show Figures

Figure 1

22 pages, 2016 KB

Open AccessReview

Human-Centred Design (HCD) in Enhancing Dementia Care Through Assistive Technologies: A Scoping Review

by Fanke Peng, Kate Little and Lin Liu

Digital 2025, 5(4), 51; https://doi.org/10.3390/digital5040051 - 2 Oct 2025

Viewed by 604

Abstract

Background: Dementia is a progressive neurodegenerative condition that impairs cognitive functions such as memory, language comprehension, and problem-solving. Assistive technologies can provide vital support at various stages of dementia, significantly improving the quality of life by aiding daily activities and care. However, for these technologies to be effective and widely adopted, a human-centred design (HCD) approach is of consequence for both their development and evaluation. Objectives: This scoping review aims to explore how HCD principles have been applied in the design of assistive technologies for people with dementia and to identify the extent and nature of their involvement in the design process. Eligibility Criteria: Studies published between 2017 and 2025 were included if they applied HCD methods in the design of assistive technologies for individuals at any stage of dementia. Priority was given to studies that directly involved people with dementia in the design or evaluation process. Sources of Evidence: A systematic search was conducted across five databases: Web of Science, JSTOR, Scopus, and ProQuest. Charting Methods: Articles were screened in two stages: title/abstract screening (n = 350) and full-text review (n = 89). Data from eligible studies (n = 49) were extracted and thematically analysed to identify design approaches, types of technologies, and user involvement. Results: The 49 included studies covered a variety of assistive technologies, such as robotic systems, augmented and virtual reality tools, mobile applications, and Internet of Things (IoT) devices. A wide range of HCD approaches were employed, with varying degrees of user involvement. Conclusions: HCD plays a critical role in enhancing the development and effectiveness of assistive technologies for dementia care. The review underscores the importance of involving people with dementia and their carers in the design process to ensure that solutions are practical, meaningful, and capable of improving quality of life. However, several key gaps remain. There is no standardised HCD framework for healthcare, stakeholder involvement is often inconsistent, and evidence on real-world impact is limited. Addressing these gaps is crucial to advancing the field and delivering scalable, sustainable innovations. Full article

►▼ Show Figures

Figure 1

10 pages, 294 KB

Open AccessArticle

Performance Differences Between Spanish AzBio and Latin American HINT: Implications for Test Selection

by Chrisanda Marie Sanchez, Jennifer Coto, Sandra Velandia, Ivette Cejas and Meredith A. Holcomb

Audiol. Res. 2025, 15(5), 129; https://doi.org/10.3390/audiolres15050129 - 2 Oct 2025

Viewed by 224

Abstract

Background/Objectives: Spanish-speaking patients face persistent barriers in accessing equitable audiological care, particularly when standardized language-appropriate tools are lacking. Two Spanish-language sentence recognition tests, the Spanish AzBio Sentence (SAzB) and the Latin American Hearing in Noise Test (LAH), are commonly used to evaluate speech perception in adults with hearing loss. However, performance differences between these measures may influence referral decisions for hearing intervention, such as cochlear implantation. This study compared test performance under varying noise and spatial conditions to guide appropriate test selection and reduce the risk of misclassification that may contribute to healthcare disparities. Methods: Twenty-one bilingual Spanish/English speaking adults with normal bilateral hearing completed speech perception testing using both the SAzB and LAH. Testing was conducted under two spatial configurations: (1) speech and noise presented from the front (0° azimuth) and (2) speech to the simulated poorer ear and noise to the better ear (90°/270° azimuth). Conditions included quiet and three signal-to-noise ratios (+10, +5, and 0 dB). Analyses included paired t-tests and one-way ANOVAs. Results: Participants scored significantly higher on the LAH than on the SAzB across all SNR conditions and configurations, with ceiling effects observed for the LAH. SAzB scores varied by language dominance, while LAH scores did not. No other differences were observed based on any further demographic information. Conclusions: The SAzB provides a more challenging and informative assessment of speech perception in noise. Relying on easier tests like the LAH may obscure real-world difficulties and delay appropriate referrals for hearing loss intervention, including cochlear implant evaluation. Selecting the most appropriate test is critical to avoiding under-referral and ensuring Spanish-speaking patients receive equitable and accurate care. Full article

(This article belongs to the Section Speech and Language)

►▼ Show Figures

Figure 1

17 pages, 3363 KB

Open AccessArticle

Social-LLM: Modeling User Behavior at Scale Using Language Models and Social Network Data

by Julie Jiang and Emilio Ferrara

Sci 2025, 7(4), 138; https://doi.org/10.3390/sci7040138 - 2 Oct 2025

Viewed by 681

Abstract

The proliferation of social network data has unlocked unprecedented opportunities for extensive, data-driven exploration of human behavior. The structural intricacies of social networks offer insights into various computational social science issues, particularly concerning social influence and information diffusion. However, modeling large-scale social network data comes with computational challenges. Though large language models make it easier than ever to model textual content, any advanced network representation method struggles with scalability and efficient deployment to out-of-sample users. In response, we introduce a novel approach tailored for modeling social network data in user-detection tasks. This innovative method integrates localized social network interactions with the capabilities of large language models. Operating under the premise of social network homophily, which posits that socially connected users share similarities, our approach is designed with scalability and inductive capabilities in mind, avoiding the need for full-graph training. We conduct a thorough evaluation of our method across seven real-world social network datasets, spanning a diverse range of topics and detection tasks, showcasing its applicability to advance research in computational social science. Full article

(This article belongs to the Topic Social Computing and Social Network Analysis)

►▼ Show Figures

Figure 1

3 pages, 726 KB

Open AccessInteresting Images

Unilateral Vocal Cord Paralysis Diagnosed with Dynamic Digital Radiography

by Michaela Cellina

Diagnostics 2025, 15(19), 2502; https://doi.org/10.3390/diagnostics15192502 - 1 Oct 2025

Viewed by 433

Abstract

Flexible laryngoscopy (FL) is the standard diagnostic tool for vocal cord paralysis (VCP), but it involves patient discomfort, and its interpretation is subjective and operator-dependent. Dynamic digital radiography (DDR) is a novel imaging technique that acquires high-resolution sequential radiographs at a low radiation dose. While DDR has been widely applied in chest and diaphragmatic imaging, its use for laryngeal motion analysis has been poorly investigated. We present the case of a 50-year-old male referred for Computed Tomography (CT) of the neck and chest for suspected vocal cord paralysis. The referring physician did not specify the side of the suspected paralysis. Due to a language barrier and the absence of prior documentation, a detailed history could not be obtained. To assess vocal cord motion, we performed, for the first time in our Institution, a DDR study of the neck. During phonation maneuvers, DDR demonstrated fixation of the left vocal cord in an adducted paramedian position. CT confirmed this finding and did not highlight any further anomaly. This case demonstrates the feasibility of DDR as a low-cost, low-dose, non-invasive technique for functional evaluation of the larynx and may represent a valuable complementary imaging tool in laryngeal functional assessment. Full article

(This article belongs to the Section Medical Imaging and Theranostics)

►▼ Show Figures

Figure 1

15 pages, 812 KB

Open AccessArticle

Large Language Model (LLM)-Predicted and LLM-Assisted Calculation of the Spinal Instability Neoplastic Score (SINS) Improves Clinician Accuracy and Efficiency

by Matthew Ding Zhou Chan, Calvin Kai En Tjio, Tammy Li Yi Chan, Yi Liang Tan, Alynna Xu Ying Chua, Sammy Khin Yee Loh, Gabriel Zi Hui Leow, Ming Ying Gan, Xinyi Lim, Amanda Kexin Choo, Yu Liu, Jonathan Wen Po Tan, Ee Chin Teo, Qai Ven Yap, Ting Yonghan, Andrew Makmur, Naresh Kumar, Jiong Hao Tan and James Thomas Patrick Decourcy Hallinan

Cancers 2025, 17(19), 3198; https://doi.org/10.3390/cancers17193198 - 30 Sep 2025

Viewed by 367

Abstract

Background: The Spinal Instability Neoplastic Score (SINS) guides treatment for patients with spinal tumors, but issues arise with complexity, interobserver variability, and time demands. Large language models (LLMs) may help overcome these limitations. Objectives: This study evaluates the accuracy and efficiency of a privacy-preserving LLM (PP-LLM) for SINS calculation, with and without clinician involvement, to assess its feasibility as a clinical decision-support tool. Methods: This retrospective observational study was granted a Domain-Specific Review Board waiver owing to minimal risk. Patients from 2020 to 2022 were included. A PP-LLM was employed to maintain secure handling of patient data. A consensus SINS reference standard was established by musculoskeletal radiologists and an orthopedic surgeon. Eight orthopedic and oncology trainees were divided into two groups to calculate SINS, with and without PP-LLM assistance. LLM-predicted scores were also generated independently of any human input. Results: The main outcomes were agreement with the reference standard (measured by intraclass correlation coefficients [ICCs]) and time required for SINS calculation. The LLM-assisted method achieved excellent agreement (ICC = 0.993, 95%CI = 0.991–0.994), closely followed by the LLM-predicted approach (ICC = 0.990, 95%CI = 0.984–0.993). Clinicians working without LLM support showed a significantly lower ICC compared to both LLM methods (0.968, 95%CI = 0.960–0.975) (both p < 0.001). The LLM alone produced scores in approximately 5 s, while the median scoring time for LLM-assisted clinicians was 60.0 s (IQR = 46.0–80.0), notably shorter than the 83.0 s (IQR = 58.0–124.0) required without LLM assistance. Conclusions: An LLM-based approach, whether used autonomously or in conjunction with clinical expertise, enhances both accuracy and efficiency in SINS calculation. Adopting this technology may streamline oncologic workflows and facilitate more timely interventions for patients with spinal metastases. Full article

(This article belongs to the Special Issue Enhancing Precision in Cancer Treatment: AI-Driven Innovations in Imaging)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 22.

Go to page 1 2 3 4 5

Search Results (1,058)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI