Patient-Facing Radiology Communication with LLMs: Calibration Deficit and the Metadata Paradox

Shin, Cheong; Park, Jung Hyun; Kim, Sungjun; Lee, Young Han; Lee, Hong-Seon

doi:10.3390/healthcare14111490

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Patient-Facing Radiology Communication with LLMs: Calibration Deficit and the Metadata Paradox

by

Cheong Shin

¹

,

Jung Hyun Park

^1,2,3,*

,

Sungjun Kim

^1,3,4,5

,

Young Han Lee

^5,6 and

Hong-Seon Lee

^4,*

¹

Department of Integrative Medicine, The Graduate School, College of Medicine, Yonsei University, Seoul 03722, Republic of Korea

²

Department of Rehabilitation Medicine, Gangnam Severance Hospital, Rehabilitation Institute of Neuromuscular Disease, College of Medicine, Yonsei University, 211, Eonju-ro, Gangnam-gu, Seoul 03722, Republic of Korea

³

Department of Medical Device Engineering and Management, The Graduate School, College of Medicine, Yonsei University, Seoul 03722, Republic of Korea

⁴

Department of Radiology, Gangnam Severance Hospital, College of Medicine, Yonsei University, Seoul 03722, Republic of Korea

⁵

Institute for Innovation in Digital Healthcare, Yonsei University, Seoul 03722, Republic of Korea

⁶

Department of Radiology, Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Severance Hospital, College of Medicine, Yonsei University, Seoul 03722, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Healthcare 2026, 14(11), 1490; https://doi.org/10.3390/healthcare14111490

Submission received: 14 March 2026 / Revised: 4 May 2026 / Accepted: 13 May 2026 / Published: 27 May 2026

(This article belongs to the Special Issue Enhancing Communication in Clinical Practice for Better Care)

Download Versions Notes

Abstract

Background/Objectives: Patients increasingly access radiology reports via online portals and frequently seek clarification. While Large Language Models (LLMs) may facilitate this communication, their clinical safety and reliability in this context remain largely uncharacterized. This study aimed to evaluate performance heterogeneity (the disparity between factual synthesis and interpretive reasoning), the Metadata Paradox (performance degradation triggered by demographic priors), and calibration characteristics in answering simulated patient questions derived from radiology reports. Methods: In this retrospective study, 2000 simulated inquiries were generated from 200 MIMIC-IV radiology reports based on an expert-refined 10-category taxonomy, categorized into factual tasks (e.g., terminology/anatomy) and interpretive tasks (e.g., diagnostic confidence/finding detail). Three LLMs (GPT-4o mini, Grok (v4-0709), Claude 3.5 Sonnet) generated 12,000 answers (with/without metadata). Quality was scored (1–3 scale) by Gemini 2.5 Flash, validated by three independent board-certified radiologists and finalized through four-specialist consensus adjudication (n = 1200). Performance and self-confidence calibration were assessed using Generalized Estimating Equations. Results: The LLM judge showed an overall agreement rate of 90.5% with the adjudicated ground truth. Grok and Claude 3.5 Sonnet significantly outperformed GPT-4o mini (p < 0.001); specifically, GPT-4o mini was associated with a 2.8-fold higher risk of failure compared to Grok (adjusted OR 2.83; 95% CI: 2.28–3.49; p < 0.001) and an absolute risk difference (ARD) of 8.4 percentage points. Accuracy reached its ceiling in factual tasks (Terminology: 98.1%) but was significantly lower in interpretive tasks (Diagnostic Confidence: 82.3%, p < 0.001). Metadata inclusion triggered the ‘Metadata Paradox,’ significantly increasing the risk of failure (OR 1.11; p = 0.044). A substantial calibration deficit (defined as the disconnect between self-confidence and accuracy) was observed; notably, the majority of safety-critical errors (Score 1: clinically significant misinformation; n = 131) were assigned high self-confidence (8/10; GPT-4o mini: 93.8%, Grok: 100%, Claude 3.5 Sonnet: 61.5%). Conclusions: Although LLMs accurately address factual queries, their consistent calibration deficit in safety-critical errors and susceptibility to stochastic stereotyping highlight the necessity of independent verification frameworks.

Keywords: large language models (LLMs); radiology report; question taxonomy; patient-centered care; confidence calibration; clinical AI

Share and Cite

MDPI and ACS Style

Shin, C.; Park, J.H.; Kim, S.; Lee, Y.H.; Lee, H.-S. Patient-Facing Radiology Communication with LLMs: Calibration Deficit and the Metadata Paradox. Healthcare 2026, 14, 1490. https://doi.org/10.3390/healthcare14111490

AMA Style

Shin C, Park JH, Kim S, Lee YH, Lee H-S. Patient-Facing Radiology Communication with LLMs: Calibration Deficit and the Metadata Paradox. Healthcare. 2026; 14(11):1490. https://doi.org/10.3390/healthcare14111490

Chicago/Turabian Style

Shin, Cheong, Jung Hyun Park, Sungjun Kim, Young Han Lee, and Hong-Seon Lee. 2026. "Patient-Facing Radiology Communication with LLMs: Calibration Deficit and the Metadata Paradox" Healthcare 14, no. 11: 1490. https://doi.org/10.3390/healthcare14111490

APA Style

Shin, C., Park, J. H., Kim, S., Lee, Y. H., & Lee, H.-S. (2026). Patient-Facing Radiology Communication with LLMs: Calibration Deficit and the Metadata Paradox. Healthcare, 14(11), 1490. https://doi.org/10.3390/healthcare14111490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Patient-Facing Radiology Communication with LLMs: Calibration Deficit and the Metadata Paradox

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI