Saved Queries

Background/Objectives: Time-constrained consultations in high-volume settings can crowd out patient-centered communication, while AI-generated advice may face algorithm aversion when it lacks a humanistic dimension. This study examined whether a brief narrative-based prompt could improve coded patient-facing communication features in an LLM relative to both clinicians and an unprompted model on authentic patient queries. Methods: We conducted a three-condition comparative evaluation using a stratified sample of 1000 de-identified MedDialog-CN consultations (2016–2020). For each consultation, the same patient query was used to generate (i) a zero-shot GPT-o3-mini response and (ii) a narrative-prompted GPT-o3-mini response; the original physician reply served as the human baseline. Responses were annotated with a pre-specified schema operationalizing four communication dimensions—Storytelling, Empathy, Personalization, and Clarity—with expert adjudication. Frequency-based indicators were summarized as mean events per consultation, and binary indicators as proportions; secondary checks captured unwarranted certainty and risk-relevant language. Results: Narrative prompting shifted coded patient-facing communication from sparse and selectively deployed (clinicians and zero-shot AI) to more routine and standardized. Across the reported communication measures, the prompted model showed the most favorable overall pattern, with higher narrative-device use, empathic support, contextual tailoring, and terminology explanation, alongside more frequent consideration of patient preferences and markedly higher rates of emotion–symptom linkage and the presence of a patient-centered narrative framework. Conclusions: Narrative prompting may offer a lightweight and potentially scalable strategy for improving patient-facing communication in Chinese asynchronous, text-based online consultations. An important next step is calibration: humanistic cues should be delivered selectively and safely so that responses remain credible, locally feasible, and cognitively manageable. Full article

(This article belongs to the Special Issue Artificial Intelligence in Healthcare: Opportunities and Challenges)

►▼ Show Figures

Figure 1

22 pages, 3734 KB

Open AccessArticle

CLEAR: A Cognitive LLM-Empowered Adaptive Restoration Framework for Robust Ship Detection in Complex Maritime Scenarios

by Min Li, Xinyu Zhao and Yunfeng Wan

Remote Sens. 2026, 18(8), 1142; https://doi.org/10.3390/rs18081142 (registering DOI) - 12 Apr 2026

Abstract

Ship detection in remote sensing imagery serves as a cornerstone of modern maritime surveillance. Existing visible light detectors suffer from severe performance degradation in adverse environmental conditions (e.g., fog, low light) due to domain gaps. Traditional global enhancement methods often lack adaptability, leading to “negative transfer”—where artifacts are introduced into clean images or mismatched with degradation types. To address these challenges, we propose CLEAR (Cognitive Large Language Model (LLM)-Empowered Adaptive Restoration) framework. Inspired by the dual-process theory of cognition, we introduce a dynamic switching mechanism between fast perception and deep reasoning. Rather than processing all images indiscriminately, it utilizes a hybrid gating mechanism to efficiently filter nominal samples, triggering Vision–Language Model (VLM) only when necessary to diagnose degradation and dispatch targeted restoration operators. Extensive experiments on the constructed HRSC-Robust dataset demonstrate that CLEAR achieves an overall mean Average Precision (mAP) at 0.5 Intersection-over-Union (IoU) of 86.92%, outperforming the baseline by 7.74%. Notably, it establishes a “fail-safe” mechanism for optical degradations. By adaptively resolving fog and low-light, it effectively mitigates detector blindness—exemplified by a doubled Recall rate (52.52%) in dark scenarios. Furthermore, a confidence-based sparse triggering strategy ensures operational efficiency, maintaining a throughput of ~11.8 FPS in nominal conditions. This work validates the potential of VLMs for interpretable and robust remote sensing tasks. Full article

(This article belongs to the Special Issue Remote Sensing Intelligent Interpretation in the Era of Large Models and Intelligent Agents: New Challenges, Methods and Opportunities)

►▼ Show Figures

Figure 1

20 pages, 5504 KB

Open AccessArticle

A Large Language Model for Traffic Flow Prediction Based on Stationary Wavelet Transform and Graph Convolutional Networks

by Xin Wang, Gang Liu, Jing He, Xiangbing Zhou and Zhiyong Luo

ISPRS Int. J. Geo-Inf. 2026, 15(4), 166; https://doi.org/10.3390/ijgi15040166 (registering DOI) - 11 Apr 2026

Abstract

With the rapid development of Intelligent Transportation Systems (ITSs), traffic prediction, a crucial component of ITSs, has garnered growing scholarly attention. The appli-cation of deep learning into traffic prediction has emerged as a prominent research direction, especially amid the rapid advancement of pretrained large language models (LLMs), which offer substantial benefits in time-series analysis through cross-modal knowledge transfer. In response to this advancement, this study introduces an innovative model for traffic flow prediction, designated as WGLLM. To capture spatiotemporal characteristics inherent in traffic flow data, this model incorporates a sequence embedding layer constructed on the stationary wavelet transform (SWT) and long short-term memory (LSTM), in conjunction with a spatial embedding layer founded on graph convolutional networks (GCNs). Additionally, a fully connected layer is utilized to integrate embeddings into the LLMs for comprehensive global dependency analysis. To verify the effectiveness of the proposed approach, experiments were carried out on two real traffic flow datasets. The experimental results demonstrate that WGLLM achieves superior predictive performance compared to multiple mainstream baseline models, accompanied by a significant enhancement in prediction accuracy. Full article

►▼ Show Figures

Figure 1

18 pages, 439 KB

Open AccessArticle

Understanding and Predicting Tourist Behavior Through Large Language Models

by Anna Dalla Vecchia, Simone Mattioli, Sara Migliorini and Elisa Quintarelli

Big Data Cogn. Comput. 2026, 10(4), 117; https://doi.org/10.3390/bdcc10040117 (registering DOI) - 11 Apr 2026

Abstract

Understanding and predicting how tourists move through a city is a challenging task, as it involves a complex interplay of spatial, temporal, and social factors. Traditional recommender systems often rely on structured data, trying to capture the nature of the problem. However, recent advances in Large Language Models (LLMs) open new possibilities for reasoning over richer, text-based representations of user context, even without a dedicated pre-training phase. In this study, we investigate the potential of LLMs to interpret and predict tourist movements in a real-world application scenario involving tourist visits to Verona, a municipality in Northern Italy, between 2014 and 2023. We propose an incremental prompt engineering approach that gradually enriches the model input, from spatial features alone to richer behavioral information, including visit histories, time information, and user cluster patterns. The approach is evaluated using six open-source models, enabling us to compare their accuracy and efficiency across various levels of contextual enrichment. The results provide a first insight about the abilities of LLMs to incorporate spatio-temporal contextual factors, thus improving predictions, while maintaining computational efficiency. The analysis of the model-generated explanations completes the picture by adding an interpretability dimension that most existing next-PoI prediction solutions lack. Overall, the study demonstrates the potential of LLMs to integrate multiple contextual dimensions in tourism mobility, highlighting the possibility of a more text-oriented, adaptive, and explainable T-RS. Full article

(This article belongs to the Section Large Language Models and Embodied Intelligence)

►▼ Show Figures

Figure 1

15 pages, 392 KB

Open AccessArticle

Random Forest Predicts Human Ratings of Creative Stories Using Very Small Training Samples

by Baptiste Barbot and Thomas Calogero Kiekens

Behav. Sci. 2026, 16(4), 576; https://doi.org/10.3390/bs16040576 (registering DOI) - 11 Apr 2026

Abstract

The Consensual Assessment Technique (CAT) is a gold standard of creativity assessment which provides valid product-based creativity scores that are contextually grounded (stemming from raters with unique expertise, culturally and historically situated). However, its implementation is often demanding (raters’ burden, complex rating designs). This study investigates whether machine learning can effectively simulate expert-panel judgments of creativity using minimal training data. Using a dataset of 411 short stories, we compared the performance of Random Forest (RF), Gradient Boosted Trees, and Decision Tree models, based on story length and Divergent Semantic Integration, to predict expert CAT ratings by (1) identifying the optimal algorithm and (2) the minimum training sample size required for reliable prediction. Results indicate that RF consistently outperformed other algorithms, achieving high correlations with CAT scores (r = 0.80) using as few as 25 training stories. Furthermore, RF demonstrated superior accuracy and lower reliance on story length compared to LLM-based scoring models. These findings provide a robust proof-of-concept for using simulated expert panels as a scalable alternative to (decontextualized) automated assessment methods, while reducing human raters’ burden and the logistical constraints of complex rating designs. Extension of this work to different contexts, creativity tasks and domains are necessary to gauge its generalizability. Full article

(This article belongs to the Section Cognition)

►▼ Show Figures

Figure 1

22 pages, 1449 KB

Open AccessFeature PaperArticle

On the Vulnerability of Citation Metrics in the Era of Generative Artificial Intelligence

by Kay Smarsly

Publications 2026, 14(2), 23; https://doi.org/10.3390/publications14020023 (registering DOI) - 11 Apr 2026

Abstract

Large language model (LLM) chatbots, as a widely used form of generative artificial intelligence, have reduced the marginal cost of producing publication-style manuscripts and have expanded feasible routes for manipulating citation metrics within the publishing ecosystem. Citation-based indicators (e.g., the h-index, the i10-index, and total citation counts) remain embedded in research evaluation and are sensitive to indexing practices of bibliographic databases, with Google Scholar providing broad coverage combined with comparatively limited curation. In this study, a systematic literature review is conducted to synthesize reported mechanisms of citation-metric manipulation and to examine limitations of citation-metric use, including evidence reported in civil engineering. A Google Scholar proof-of-concept case study examines whether the indexing of LLM-assisted, non-peer-reviewed documents with concentrated references to a target author is associated with changes in author-level citation metrics under platform-specific conditions. After indexing, a stepwise increase in author-level metrics is observed, demonstrating the feasibility of citation-metric manipulation under the platform-specific conditions. Finally, this paper discusses the implications for research integrity and citation manipulation in the era of generative artificial intelligence. It also presents recommendations for researchers, academic institutions and evaluation committees, publishers and editors, bibliographic database providers, and funding institutions and policymakers. Full article

(This article belongs to the Special Issue AI in Academic Metrics and Impact Analysis)

►▼ Show Figures

Figure 1

6 pages, 450 KB

Open AccessProceeding Paper

Class Entity Identification Based on Large Language Models: A Choice Between Classification and Generation

by Eric Jui-Lin Lu and Cheng-Hao Yang

Eng. Proc. 2026, 134(1), 42; https://doi.org/10.3390/engproc2026134042 (registering DOI) - 10 Apr 2026

Abstract

Large language models (LLMs) have been widely applied to knowledge graph question answering (KGQA) systems. Recent Text-to-SPARQL studies have demonstrated that generation performance can achieve an F1 score exceeding 90%. Further error analysis has categorized common errors into entity translation errors, entity position errors, and resource description framework (RDF) triple-count errors, with the latter accounting for 24% of all errors. Notably, nearly 90% of RDF triple-count errors occur when the triples involve class entities. Previous research has shown that incorporating prompts can effectively enhance model performance. Based on the results, we predicted whether a question contains a class entity and the number of RDF triples in the corresponding query to reduce RDF triple-count errors in large language models by providing precise task-related information through prompt design. Since both strategies are classification-oriented, two implementation paradigms were established: traditional classification architectures and generative modeling. They were compared in terms of performance. For classification-based architectures, we employed Bidirectional Encoder Representations from Transformers (BERT) and the Robustly Optimized BERT Approach (RoBERTa) to obtain question embeddings for classification. For the generative approach, we adopted the Instruction-Tuned Text-to-Text Transfer Transformer (Flan-T5). Experimental results show that the generative model slightly outperforms conventional classification architectures, indicating that generative approaches can achieve higher prediction accuracy and provide more reliable information without the need for additional complex encoder designs, thereby improving the overall quality of Text-to-SPARQL generation. Full article

(This article belongs to the Proceedings of The 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025))

►▼ Show Figures

Figure 1

31 pages, 3673 KB

Open AccessArticle

Unveiling Systemic Risks in Sustainable Safety Management: Integrating BERTopic, LLM, and SNA for Accident Text Mining

by Lanjing Wang, Rui Huang, Yige Chen, Yunxiang Yang, Jing Zhan and Haiyuan Gong

Sustainability 2026, 18(8), 3787; https://doi.org/10.3390/su18083787 - 10 Apr 2026

Abstract

To unveil the underlying risk structures in complex industrial systems, this paper proposes a hybrid analytical framework that integrates BERTopic modeling, a large language model (LLM), and social network analysis (SNA). This framework aims to extract systemic safety intelligence from unstructured accident reports. It first employs BERTopic to identify latent causal topics based on 745 Chinese accident investigation reports and utilizes DeepSeek-V3.1 (LLM) for semantic refinement and causal mapping of these topics. Subsequently, a semantic network of causal keywords based on positive pointwise mutual information (PPMI) is constructed, and its topological structure is analyzed using SNA methods. The study identifies and analyzes five major risk communities: confined spaces, fire, mining, construction, and road traffic. It reveals that accident causation exhibits the small-world characteristics of multi-factor coupling and non-linearity, with core risk nodes concentrated in systemic inducements such as organizational management and compliance deficiencies. The results demonstrate that this framework effectively identifies the latent systemic risk patterns embedded within the texts, providing methodological support for developing sustainable safety management mechanisms based on design for safety. Full article

(This article belongs to the Special Issue Achieving Sustainability in Safety Management and Design for Safety)

39 pages, 5852 KB

Open AccessArticle

SAPIENT: A Multi-Agent Framework for Corporate Reputation Intelligence Through Sentinel Monitoring and LLM-Based Synthetic Population Simulation

by Alper Ozpinar and Saha Baygul Ozpinar

Systems 2026, 14(4), 425; https://doi.org/10.3390/systems14040425 - 10 Apr 2026

Abstract

Corporate reputation teams rely on media monitoring and qualitative research, both limited in speed and coverage when digital narratives form rapidly. This paper proposes SAPIENT (Sentinel-Augmented Population Intelligence for Emerging Narrative Tracking), a multi-agent system that links a sentinel layer over public text streams with a simulation layer that runs moderated, repeatable in silico focus-group sessions. The sentinel layer ingests social media, news, and forum text to produce a compact signal state (topics, sentiment, anomaly scores, risk labels), which conditions the simulation layer through an orchestrator. Persona agents and a moderator follow an Agentic Focus Group (AFG) protocol with repeated runs, variance reporting, and human review gates. We describe four sustainability communication scenarios: greenwashing backlash prediction, greenhushing risk assessment, campaign pre-testing, and crisis communication simulation. Nine experiments span 280 AFG runs across 20 conditions, three LLM backends (Claude Sonnet 4, GPT-4o, and Gemini 2.5 Flash), and a preregistered pilot human validation study with 54 participants. Signal conditioning improved simulation specificity (

p = 0.012

). Cross-lingual sessions revealed a sentiment asymmetry between English and Turkish (

p = 0.001

) with preserved persona rank ordering (

r = 0.81

p = 0.015

). Cross-model comparison showed consistent persona differentiation across all three backends (Pearson

r > 0.92

p < 0.002

for all pairs). Sentiment was robust to prompt paraphrasing (

p = 0.061

, n.s.), though credibility was sensitive to prompt wording (

p < 0.001

). All significant results from Experiments 1–8 survived Benjamini–Hochberg correction. A preregistered pilot with 54 human participants on Prolific replicated the predicted credibility ranking across framing variants (

p = 0.004

) but not the sentiment ranking, identifying a specific calibration target for future work. Full article

(This article belongs to the Section Artificial Intelligence and Digital Systems Engineering)

29 pages, 2439 KB

Open AccessReview

Agentic and LLM-Based Multimodal Anomaly Detection: Architectures, Challenges, and Prospects

by Mohammed Ayalew Belay, Amirshayan Haghipour, Adil Rasheed and Pierluigi Salvo Rossi

Sensors 2026, 26(8), 2330; https://doi.org/10.3390/s26082330 - 9 Apr 2026

Viewed by 118

Abstract

Anomaly detection is crucial in maintaining the safety, reliability, and optimal performance of complex systems across diverse domains, such as industrial manufacturing, cybersecurity, and autonomous systems. While conventional methods typically handle single data modalities, recently, there has been an increase in the application of multimodal detection in dynamic real-world environments. This paper presents a comprehensive review of recent research at the intersection of agentic artificial intelligence and large language-based multimodal anomaly detection. We systematically analyze and categorize existing studies based on the agent architecture, reasoning capabilities, tool integration, and modality scope. The main contribution of this work is a novel taxonomy that unifies agentic and multimodal anomaly detection methods, alongside benchmark datasets, evaluation methods, key challenges, and mitigation strategies. Furthermore, we identify major open issues, including data alignment, scalability, reliability, explainability, and evaluation standardization. Finally, we outline future research directions, with a particular emphasis on trustworthy autonomous agents, efficient multimodal fusion, human-in-the-loop systems, and real-world deployment in safety-critical applications. Full article

(This article belongs to the Special Issue Intelligent Sensors for Security and Attack Detection)

27 pages, 3278 KB

Open AccessArticle

Multimodal PPG-Based Arrhythmia Detection Using a CLIP-Initialized Multi-Task U-Net and LLM-Assisted Reporting

by Youngho Huh, Minhwan Noh, Dongwoo Ji, Yuna Oh and Sukkyu Sun

Sensors 2026, 26(8), 2316; https://doi.org/10.3390/s26082316 - 9 Apr 2026

Viewed by 193

Abstract

Photoplethysmography (PPG) has emerged as an attractive modality for non-invasive cardiovascular monitoring due to its low cost, unobtrusive nature, and ubiquity in consumer wearable devices. Despite its potential, existing PPG-based arrhythmia detection systems remain limited in scope: (i) most target only atrial fibrillation, (ii) temporal localization of abnormal segments is rarely provided, and (iii) deep learning models lack explainability, hindering adoption in clinical workflows. We present a comprehensive and fully integrated framework for multi-class arrhythmia detection, segmentation, and explainability based on PPG waveforms, Heart Rate Variability (HRV), and structured clinical metadata. The proposed system introduces a CLIP-style contrastive learning module aligning PPG waveforms with clinical variables and rhythm-state textual descriptions using BioBERT; a multitask U-Net architecture performing 4-class classification and 1D segmentation; a Retrieval-Augmented Generation (RAG) pipeline leveraging Gemini Flash large language models to produce guideline-grounded diagnostic reports; and a real-time Streamlit-based web platform supporting inference, visualization, and database storage. The system significantly improves classification accuracy (from 86.27% to 91.19%) and segmentation Dice (from 0.5815 to 0.7167). These results demonstrate the feasibility of a robust, multimodal, and explainable PPG-based arrhythmia monitoring system for real-world applications. Full article

(This article belongs to the Section Wearables)

►▼ Show Figures

Figure 1

31 pages, 380 KB

Open AccessArticle

Hybrid Approach to Patient Review Classification at Scale: From Expert Annotations to Production-Ready Machine Learning Models for Sustainable Healthcare

by Irina Evgenievna Kalabikhina, Anton Vasilyevich Kolotusha and Vadim Sergeevich Moshkin

Big Data Cogn. Comput. 2026, 10(4), 114; https://doi.org/10.3390/bdcc10040114 - 9 Apr 2026

Viewed by 90

Abstract

Patients leave millions of medical reviews annually, providing critical data for quality management. However, manual processing is infeasible, and existing systems fail to distinguish medical from organizational problems—a distinction essential for complaint routing. The consequences of misrouting are significant: clinical issues may go unaddressed when medical complaints reach administrative staff, while systemic service problems remain unresolved when organizational complaints reach medical directors. We developed a hybrid approach combining expert annotation with Large Language Models (LLMs). Fifteen prompt iterations on 1500 reviews with expert validation (modified Cohen’s kappa (κ_mod), which weights errors hierarchically, reached 0.745) preceded the LLM annotation of 15,000 mixed-sentiment and positive reviews. These were combined with 7417 expert-annotated negative reviews to form a corpus of 22,417 reviews. Eight architectures, ranging from Logistic Regression to a BERT + TF-IDF + LightGBM ensemble, were compared using both standard metrics and domain-specific practical metrics tailored to complaint routing. The best model, scaled to 4.3 million Russian-language reviews from the Prodoctorov.ru platform, achieved 92.9% Practical Accuracy—the proportion of reviews classified without critical medical–organizational misclassification errors (M ↔ O)—compared to 68.0% standard accuracy, which treats all errors equally. Critical errors were reduced to 1.4%, yielding 144,000 more correctly processed complaints than traditional methods (TF-IDF + Logistic Regression). Analysis of the scaled data revealed the following: 46.1% M (medical), 21.0% O (organizational), and 32.9% C (combined) reviews; medical ratings were highest (4.75 vs. 4.59 for organizational, p < 0.001); combined reviews were longest (802 characters); zero-star reviews comprised 3.8% of feedback, with organizational complaints dominating (38.2%) among extreme negatives; and average ratings rose by 1.24 points over 14 years. This hybrid approach yields expert-comparable corpora, automates 93% of feedback processing, ensures correct complaint routing, and contributes to healthcare sustainability by reducing administrative burden, accelerating resolution, and enabling data-driven quality management without proportional increases in human resources. All analyses were conducted on Russian-language patient reviews. Full article

(This article belongs to the Special Issue Artificial Intelligence and Big Data Analytics for Sustainable Healthcare Systems)

25 pages, 1844 KB

Open AccessArticle

Retrieval-Augmented Large Language Model-Based Framework for Hierarchical Classification of Public Feedback on Transportation Infrastructure

by Milan Knezevic, Trevor Neece, Marko Vukojevic, Lev Khazanovich and Aleksandar Stevanovic

Appl. Sci. 2026, 16(8), 3663; https://doi.org/10.3390/app16083663 - 9 Apr 2026

Viewed by 254

Abstract

Transportation agencies receive large volumes of free-form public comments describing infrastructure conditions, safety concerns, and service issues. These comments are often processed manually for downstream operational actions, which is time-consuming, inconsistent across reviewers, and difficult to scale, thereby limiting their value for operational decision-making. This study presents a machine learning and Large Language Model (LLM) framework for automated triage of free-form public comments, assigning each report to a three-level hierarchical taxonomy consisting of Category, Subcategory, and Final Decision. The proposed framework uses agency historical data together with retrieval-based evidence, where semantically similar past comments are provided to the LLM as contextual support to better align predictions with agency-specific labeling practices. The framework was evaluated using TF-IDF with Logistic Regression, TF-IDF with Linear SVM, embedding-based kNN with cosine similarity, few-shot LLM prompting, and retrieval-based LLM prompting. Results show that retrieval-based prompting achieved the best overall performance, with the highest accuracy at both the Category and Subcategory levels. At the Final Decision level, retrieval-based prompting slightly outperformed kNN, while few-shot prompting performed worse. Error analysis showed that many misclassifications were semantically plausible alternatives, reflecting the overlap across infrastructure-related complaint categories. When a second candidate label was allowed, further improving performance. Latency analysis also indicated that the framework can process more than 2000 comments in under 30 min, supporting faster and more consistent agency workflows. Full article

(This article belongs to the Special Issue Intelligent Transportation and Mobility Analytics)

►▼ Show Figures

Figure 1

7 pages, 707 KB

Open AccessProceeding Paper

Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies

by Eric Jui-Lin Lu and Zi-Ting Su

Eng. Proc. 2026, 134(1), 36; https://doi.org/10.3390/engproc2026134036 - 9 Apr 2026

Viewed by 145

Abstract

Large language models demonstrate strong in-context learning (ICL) capabilities, allowing them to perform diverse tasks without fine-tuning. In knowledge graph question answering (KGQA), natural language questions are translated into SPARQL queries. Existing ICL approaches mainly rely on semantic similarity, often neglecting structural features. To address this limitation, we developed a structure-aware example selection strategy that integrates both semantic and structural patterns by abstracting Resource Description Framework (RDF) triples. We compare four strategies: (1) fully random, (2) semantic similarity, (3) same-type random, and (4) same-type semantic similarity. Experiments on LC-QuAD 1.0 using FLAN-T5 show that in non-fine-tuned settings, structure-aware semantic selection achieves the best results, highlighting the importance of structural congruence, while after fine-tuning, differences between strategies converge but diversity and semantic relevance remain beneficial. These findings demonstrate the critical role of example quality in ICL and provide empirical insights for KGQA design. Full article

(This article belongs to the Proceedings of The 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025))

►▼ Show Figures

Figure 1

15 pages, 719 KB

Open AccessArticle

Efficacy of Large Language Models for Screening of Systematic Reviews on Periprosthetic Joint Infection

by Woojin Shin, Jaeyoung Hong, Sunwoo Lee, Seongchan Park, Hyoungtae Kim and Suenghwan Jo

J. Clin. Med. 2026, 15(8), 2830; https://doi.org/10.3390/jcm15082830 - 8 Apr 2026

Viewed by 195

Abstract

Background: Periprosthetic joint infection (PJI) remains a devastating complication following arthroplasty. Systematic reviews of PJI provide essential evidence to inform clinical practice; however, the screening process remains labor-intensive. Recent advancements in large language models (LLMs) offer potential for automating literature screening, though evaluation of current generation models is needed. Methods: This validation study evaluated GPT-5, GPT-5 Pro, and Gemini 2.5 Pro in replicating the title/abstract and full-text screening stages of a published systematic review on intraosseous versus intravenous antibiotic prophylaxis in total joint arthroplasty. Title/abstract screening was performed on 165 articles, followed by a full-text eligibility assessment of 26 articles. Accuracy, sensitivity, specificity, and Cohen’s kappa (κ) were calculated against human screening decisions as the gold standard. Results: In title/abstract screening, GPT-5 Pro achieved the highest accuracy (92.1–92.7%) and specificity (98.6–99.3%), while GPT-5 demonstrated the highest sensitivity (84.6–96.1%). In full-text screening, Gemini 2.5 Pro showed the most consistent performance across repeated evaluations (κ = 0.839 in both trials), whereas GPT-5 Pro exhibited marked intra-model variability (κ = 0.399 to 0.920). Conclusions: Current-generation LLMs achieve near-human accuracy in systematic review screening for PJI research, though substantial intra-model variability underscores the continued need for human oversight in systematic review workflows. Full article

(This article belongs to the Section Orthopedics)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 52.

Go to page 1 2 3 4 5

Search Results (2,565)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI