Abstract
The rapid development of large language models (LLMs), including ChatGPT, Gemini, and Copilot, has led to their increasing use in health communication and patient education. However, their growing popularity raises important concerns about whether the language they generate aligns with recommended readability standards and patient health literacy levels. This review synthesizes evidence on the readability of medical information generated by chatbots using established linguistic readability indices. A comprehensive search of PubMed, Scopus, Web of Science, and Cochrane Library identified 4209 records, from which 140 studies met the eligibility criteria. Across the included publications, 21 chatbots and 14 readability scales were examined, with the Flesch–Kincaid Grade Level and Flesch Reading Ease being the most frequently applied metrics. The results demonstrated substantial variability in readability across chatbot models; however, most texts corresponded to a secondary or early tertiary reading level, exceeding the commonly recommended 8th-grade level for patient-facing materials. ChatGPT-4, Gemini, and Copilot exhibited more consistent readability patterns, whereas ChatGPT-3.5 and Perplexity produced more linguistically complex content. Notably, DeepSeek-V3 and DeepSeek-R1 generated the most accessible responses. The findings suggest that, despite technological advances, AI-generated medical content remains insufficiently readable for general audiences, posing a potential barrier to equitable health communication. These results underscore the need for readability-aware AI design, standardized evaluation frameworks, and future research integrating quantitative readability metrics with patient-level comprehension outcomes.