Next Article in Journal
Functional Outcomes and Activity Levels in Patients After Internal Hemipelvectomy for Primary Sarcoma Involving the Bony Pelvis
Previous Article in Journal
Standardisation and Optimisation of Chest and Pelvis X-Ray Imaging Protocols Across Multiple Radiography Systems in a Radiology Department
Previous Article in Special Issue
A Machine-Learning-Based Analysis of Resting State Electroencephalogram Signals to Identify Latent Schizotypal and Bipolar Development in Healthy University Students
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models

by
Daniel-Corneliu Leucuța
,
Andrada Elena Urda-Cîmpean
*,
Dan Istrate
and
Tudor Drugan
Department of Medical Informatics and Biostatistics, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania
*
Author to whom correspondence should be addressed.
Diagnostics 2025, 15(12), 1451; https://doi.org/10.3390/diagnostics15121451
Submission received: 19 May 2025 / Revised: 2 June 2025 / Accepted: 4 June 2025 / Published: 6 June 2025
(This article belongs to the Special Issue A New Era in Diagnosis: From Biomarkers to Artificial Intelligence)

Abstract

Background/Objectives: Diagnostic accuracy studies are essential for the evaluation of the performance of medical tests. The risk of bias (RoB) for these studies is commonly assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool. This study aimed to assess the capabilities and reasoning accuracy of large language models (LLMs) in evaluating the RoB in diagnostic accuracy studies, using QUADAS 2, compared to human experts. Methods: Four LLMs were used for the AI assessment: ChatGPT 4o model, X.AI Grok 3 model, Gemini 2.0 flash model, and DeepSeek V3 model. Ten recent open-access diagnostic accuracy studies were selected. Each article was independently assessed by human experts and by LLMs using QUADAS 2. Results: Out of 110 signaling questions assessments (11 questions for each of the 10 articles) by the four AI models, and the mean percentage of correct assessments of all the models was 72.95%. The most accurate model was Grok 3, followed by ChatGPT 4o, DeepSeek V3, and Gemini 2.0 Flash, with accuracies ranging from 74.45% to 67.27%. When analyzed by domain, the most accurate responses were for “flow and timing”, followed by “index test”, and then similarly for “patient selection” and “reference standard”. An extensive list of reasoning errors was documented. Conclusions: This study demonstrates that LLMs can achieve a moderate level of accuracy in evaluating the RoB in diagnostic accuracy studies. However, they are not yet a substitute for expert clinical and methodological judgment. LLMs may serve as complementary tools in systematic reviews, with compulsory human supervision.
Keywords: diagnostic accuracy; large language models; artificial intelligence; risk of bias; evidence-based medicine diagnostic accuracy; large language models; artificial intelligence; risk of bias; evidence-based medicine

Share and Cite

MDPI and ACS Style

Leucuța, D.-C.; Urda-Cîmpean, A.E.; Istrate, D.; Drugan, T. Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models. Diagnostics 2025, 15, 1451. https://doi.org/10.3390/diagnostics15121451

AMA Style

Leucuța D-C, Urda-Cîmpean AE, Istrate D, Drugan T. Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models. Diagnostics. 2025; 15(12):1451. https://doi.org/10.3390/diagnostics15121451

Chicago/Turabian Style

Leucuța, Daniel-Corneliu, Andrada Elena Urda-Cîmpean, Dan Istrate, and Tudor Drugan. 2025. "Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models" Diagnostics 15, no. 12: 1451. https://doi.org/10.3390/diagnostics15121451

APA Style

Leucuța, D.-C., Urda-Cîmpean, A. E., Istrate, D., & Drugan, T. (2025). Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models. Diagnostics, 15(12), 1451. https://doi.org/10.3390/diagnostics15121451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop