Quantifying Readability in Chatbot-Generated Medical Texts Using Classical Linguistic Indices: A Review

Robert Olszewski; Jakub Brzeziński; Klaudia Watros; Jacek Rysz

doi:10.3390/app16031423

,

and

¹

Department of Gerontology and Public Health, National Institute of Geriatrics, Rheumatology and Rehabilitation, Spartańska 1 Street, 02-637 Warsaw, Poland

²

Department of Ultrasound, Institute of Fundamental Technological Research, Polish Academy of Sciences, Pawińskiego 5B Street, 02-106 Warsaw, Poland

³

Department of Nephrology, Hypertension and Family Medicine, Medical University of Lodz, Ul. Zeromskiego 113, 90-549 Lodz, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci.2026, 16(3), 1423;https://doi.org/10.3390/app16031423

This article belongs to the Special Issue Artificial Intelligence Applications in Healthcare and Precision Medicine, 2nd Edition

Version Notes

Order Reprints

Abstract

The rapid development of large language models (LLMs), including ChatGPT, Gemini, and Copilot, has led to their increasing use in health communication and patient education. However, their growing popularity raises important concerns about whether the language they generate aligns with recommended readability standards and patient health literacy levels. This review synthesizes evidence on the readability of medical information generated by chatbots using established linguistic readability indices. A comprehensive search of PubMed, Scopus, Web of Science, and Cochrane Library identified 4209 records, from which 140 studies met the eligibility criteria. Across the included publications, 21 chatbots and 14 readability scales were examined, with the Flesch–Kincaid Grade Level and Flesch Reading Ease being the most frequently applied metrics. The results demonstrated substantial variability in readability across chatbot models; however, most texts corresponded to a secondary or early tertiary reading level, exceeding the commonly recommended 8th-grade level for patient-facing materials. ChatGPT-4, Gemini, and Copilot exhibited more consistent readability patterns, whereas ChatGPT-3.5 and Perplexity produced more linguistically complex content. Notably, DeepSeek-V3 and DeepSeek-R1 generated the most accessible responses. The findings suggest that, despite technological advances, AI-generated medical content remains insufficiently readable for general audiences, posing a potential barrier to equitable health communication. These results underscore the need for readability-aware AI design, standardized evaluation frameworks, and future research integrating quantitative readability metrics with patient-level comprehension outcomes.

Keywords:

medical chatbots; readability; health communication; large language models; digital health; artificial intelligence; patient education

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.