Next Article in Journal
Effects of Momentum-FluxRatio on POD and SPOD Modes in High-Speed Crossflow Jets
Next Article in Special Issue
MV-RiskNet: Multi-View Attention-Based Deep Learning Model for Regional Epidemic Risk Prediction and Mapping
Previous Article in Journal
Recent Advances in Time Series Forecasting Methods
Previous Article in Special Issue
Machine Learning Models for Reliable Gait Phase Detection Using Lower-Limb Wearable Sensor Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Quantifying Readability in Chatbot-Generated Medical Texts Using Classical Linguistic Indices: A Review

1
Department of Gerontology and Public Health, National Institute of Geriatrics, Rheumatology and Rehabilitation, Spartańska 1 Street, 02-637 Warsaw, Poland
2
Department of Ultrasound, Institute of Fundamental Technological Research, Polish Academy of Sciences, Pawińskiego 5B Street, 02-106 Warsaw, Poland
3
Department of Nephrology, Hypertension and Family Medicine, Medical University of Lodz, Ul. Zeromskiego 113, 90-549 Lodz, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(3), 1423; https://doi.org/10.3390/app16031423
Submission received: 20 November 2025 / Revised: 5 January 2026 / Accepted: 27 January 2026 / Published: 30 January 2026

Abstract

The rapid development of large language models (LLMs), including ChatGPT, Gemini, and Copilot, has led to their increasing use in health communication and patient education. However, their growing popularity raises important concerns about whether the language they generate aligns with recommended readability standards and patient health literacy levels. This review synthesizes evidence on the readability of medical information generated by chatbots using established linguistic readability indices. A comprehensive search of PubMed, Scopus, Web of Science, and Cochrane Library identified 4209 records, from which 140 studies met the eligibility criteria. Across the included publications, 21 chatbots and 14 readability scales were examined, with the Flesch–Kincaid Grade Level and Flesch Reading Ease being the most frequently applied metrics. The results demonstrated substantial variability in readability across chatbot models; however, most texts corresponded to a secondary or early tertiary reading level, exceeding the commonly recommended 8th-grade level for patient-facing materials. ChatGPT-4, Gemini, and Copilot exhibited more consistent readability patterns, whereas ChatGPT-3.5 and Perplexity produced more linguistically complex content. Notably, DeepSeek-V3 and DeepSeek-R1 generated the most accessible responses. The findings suggest that, despite technological advances, AI-generated medical content remains insufficiently readable for general audiences, posing a potential barrier to equitable health communication. These results underscore the need for readability-aware AI design, standardized evaluation frameworks, and future research integrating quantitative readability metrics with patient-level comprehension outcomes.

1. Background

In recent years, there has been a marked increase in research on the readability of medical texts and educational materials intended for patients. Numerous studies have demonstrated that the comprehensibility of health information is a key factor in determining the effectiveness of communication between healthcare professionals and patients [1,2,3,4]. A high level of linguistic complexity may limit patients’ ability to interpret recommendations, make informed health decisions, and adhere to prescribed therapies [5,6,7].
In the scientific literature, quantitative readability indices such as the Flesch Reading Ease, Flesch–Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook, Coleman–Liau Index, and Automated Readability Index are commonly used to objectively assess the linguistic complexity of a text [8,9]. Numerous studies employing these measures have shown that health-related materials directed toward patients often exceed the recommended reading level, typically corresponding to primary or secondary education, thereby reducing their comprehensibility and practical usefulness [10,11,12,13].
At the same time, with the rapid development of natural language processing (NLP) technologies and the widespread adoption of large language models (LLMs) such as ChatGPT, Gemini, and Copilot, a new line of research has emerged focusing on the readability and comprehensibility of medical responses generated by chatbots [14,15,16]. Preliminary analyses indicate that although AI-generated texts often demonstrate linguistic accuracy and logical coherence, their readability and alignment with patients’ health literacy levels vary substantially [17,18,19]. In some cases, chatbots produce overly technical or specialized messages, which may limit their educational value and potentially lead to misinterpretation or incomplete understanding of health information [20,21].
The review was guided by the following research question: To what extent do chatbot-generated medical texts comply with recommended readability standards for patient-facing health communication when evaluated using classical linguistic readability indices? This question focuses on publicly accessible chatbot outputs intended for patient education and health communication. In light of the growing body of research on chatbot readability, this review further examines whether and how such systems generate responses to medical, preventive, or educational inquiries posed by health professionals.
Recent developments in generative AI have also reshaped the conceptual understanding of how language models interact with users’ health literacy needs. Earlier works in health communication emphasized structural barriers, such as excessive medical terminology, syntactic density, and low plain-language compliance, in printed materials [22,23,24,25]. However, generative LLMs introduce new challenges related to style transfer, prompt sensitivity, and the composition of training data [26,27,28]. Because these models are trained on large biomedical corpora, scientific preprints, and clinician-oriented resources, they tend to internalize formal and information-dense linguistic patterns. This training bias partly explains why chatbot-generated texts remain difficult for lay audiences despite their apparent fluency and coherence.
Furthermore, LLMs exhibit substantial variability in linguistic register depending on prompting strategy, system parameters, and model architecture [29]. This variability raises important methodological questions for evaluating AI-driven patient communication, including the reproducibility of readability scores, the impact of system updates, and the degree to which model fine-tuning shapes the balance between comprehensiveness and accessibility.
Together, these factors highlight the need to view readability not only as an attribute of a finished text but as an emergent property of algorithmic systems that continuously adapt during interaction. Understanding this dynamic context is essential for developing robust evaluation frameworks and for designing future AI systems capable of aligning linguistic complexity with patient literacy demands. In recent years, generative artificial intelligence in healthcare has evolved from general-purpose large language models toward domain-specific architectures designed for clinical and patient-facing applications. In particular, retrieval-augmented generation (RAG) systems, which integrate language models with external clinical knowledge bases, have become increasingly prominent in healthcare settings. Recent reviews indicate that RAG-based approaches improve factual grounding, transparency, and domain reliability in patient education and clinical decision support tasks compared to standalone LLMs [30,31,32,33].
In parallel, multimodal generative AI systems combining text with medical imaging, laboratory data, and electronic health records are rapidly expanding across clinical domains. These developments suggest that contemporary evaluations of chatbot-generated medical texts should be interpreted within a broader ecosystem of healthcare-oriented generative AI, in which readability interacts with knowledge grounding, modality integration, and clinical specialization. Achieving this objective will synthesize existing evidence and inform the development of guidelines for designing and evaluating AI-based tools for patient communication and health education.

2. Materials and Methods

Database searches were conducted between 1 July and 30 September 2025, covering all records available up to the final search date. This study is designed as a comprehensive literature review informed by PRISMA (PRISMA 2020 checklist: EQUATOR Network) principles rather than a full PRISMA-compliant systematic review. PRISMA guidelines were used as a framework to enhance transparency in study identification, screening, and reporting; however, given the heterogeneity of study designs, chatbot models, prompts, and readability metrics, several elements required for a formal systematic review-such as quantitative synthesis and standardized risk-of-bias assessment, were not applicable [34]. A complete overview of the PRISMA checklist items and their implementation in this review is provided in Table S1 in the Supplementary Materials. The review, therefore, aims to provide a broad, structured synthesis of the current evidence rather than a statistically pooled evaluation. Each database was selected for its relevance to clinical, social, and technical research. To address potential limitations in database coverage, the search strategy was expanded to include studies that assessed text readability using established readability indices and analyzed publicly accessible chatbots available to general internet users.
Four medical databases were systematically searched: PubMed, Cochrane Library, Scopus, and Web of Science. The search strategy included the following keywords: chatbot [Title/Abstract] AND readability [Title/Abstract], chatbot [Title/Abstract] AND Flesch–Kincaid Grade Level [Title/Abstract], chatbot [Title/Abstract] AND Flesch Reading Ease [Title/Abstract], chatbot [Title/Abstract] AND Gunning Fog Index [Title/Abstract], chatbot [Title/Abstract] AND Simple Measure of Gobbledygook [Title/Abstract], chatbot [Title/Abstract] AND Coleman–Liau Index [Title/Abstract], and chatbot [Title/Abstract] AND Automated Readability Index [Title/Abstract]. No additional filters or limits were applied. A total of 4209 records were initially retrieved. After applying predefined inclusion and exclusion criteria, including language, document type, and chatbot accessibility, 140 articles were included in the review [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174]. Figure 1 presents the PRISMA flow diagram illustrating the process of study identification, screening, eligibility assessment, and inclusion in the final review. Studies were included if they met the following criteria: (1) peer-reviewed original research articles; (2) assessment of readability of chatbot-generated medical or health-related text using at least one established quantitative readability index; (3) analysis of publicly accessible chatbot systems; and (4) focus on patient-facing or health education content. Only English-language publications were included. Studies were excluded if they evaluated proprietary or non-public chatbot systems, focused solely on technical performance without readability assessment, or did not report quantitative readability outcomes. Interrater agreement between the two independent reviewers was assessed using Cohen’s kappa coefficient. Data extraction was performed independently by two reviewers using a predefined extraction framework. Extracted variables included chatbot model, medical domain, text type, readability indices applied, and reported outcomes. Any discrepancies were resolved through structured discussion until consensus was reached; no third reviewer was involved. The obtained level of agreement was substantial (κ = 0.78), indicating high consistency in the screening and eligibility assessments.
In addition to classical database searching, methodological attention was given to heterogeneity in prompt design, since prompting is increasingly recognized as a significant determinant of model output structure, tone, and complexity. The included studies displayed wide variation in whether prompts were phrased as open-ended questions, clinically oriented scenarios, or direct instructions to simplify language. Because readability scores are sensitive to such differences, prompt variability was treated as an important contextual factor; however, insufficient reporting of prompt formulations in the source studies precluded retrospective operationalization or stratified analysis. A meta-analysis was not conducted due to substantial heterogeneity in study designs, chatbot models, and readability outcomes, making quantitative pooling methodologically inappropriate. Across all analyzed studies, 21 chatbots and 14 readability indices were used. For consistency, chatbot nomenclature was standardized throughout the manuscript. Model names referring to the same underlying architecture were consolidated (e.g., GPT-4 and GPT-4o), and Google Bard was treated as Gemini in studies published after the official rebranding.
Descriptive statistics were used to summarise the readability scores for each chatbot and the readability index. For every chatbot–scale pair, the mean (M) and standard deviation (SD) were calculated and reported as Mean ± SD. Calculations were performed using Python version 3.9 (Python Software Foundation, Wilmington, DE, USA). Although prompt variability is increasingly recognized as a key determinant of LLM output structure and linguistic complexity, most studies included in this review addressed prompting only descriptively. Few investigations systematically categorized prompts by intent, framing, or explicit readability constraints, thereby limiting reproducibility and cross-study comparability.
Future research would benefit from operationalizing prompt variability through standardized prompt taxonomies, for example, by distinguishing informational, instructional, reassurance-oriented, and simplification-focused prompts. Such an approach would enable clearer attribution of observed readability differences to model architecture versus prompting strategy and support longitudinal comparisons across model updates.
A formal risk-of-bias or quality assessment was not conducted due to substantial heterogeneity in study designs, chatbot architectures, prompting strategies, medical domains, and readability metrics. This heterogeneity also precluded quantitative synthesis and meta-analysis, as pooling results across fundamentally different models and outcome measures would not yield meaningful summary estimates. Instead, findings were synthesized qualitatively through comparative analysis.

3. Results

The comparative analysis revealed notable variability in the readability of chatbot-generated texts across models and readability indices. Overall, most chatbots produced content that would require at least a secondary or early tertiary education level to be fully comprehensible, suggesting that the linguistic complexity of current large language models (LLMs) remains relatively high for lay audiences. The review included studies published between 2023 and 2025, encompassing the most recent phase of research on the readability of AI-generated medical content and reflecting the rapid evolution of large language models used in healthcare communication. Supplementary Table S2 provides detailed characteristics of all included studies, including publication year, country, chatbot model, language of output, text type, readability indices used, and primary outcomes.
Most studies originated in the United States (n = 60; 42.8% of all publications) and Turkey (n = 34; 24% of all publications). Table 1 presents the complete distribution of countries from which the studies were derived, whereas Figure 2 illustrates their geographical dispersion on a world map.
A total of 21 chatbots were used across the publications included in the review. The most frequently used model was ChatGPT-4 (94 occurrences). The next most frequently used chatbot was ChatGPT-3.5 (83 occurrences). Table 2 presents all chatbots used in the publications included in the review.
The most frequently used readability measure across the analyzed publications was the Flesch–Kincaid Grade Level (used 117 times), followed by the Flesch Reading Ease Score (used 94 times). Table 3 presents all readability indices and the frequency of their use across the included studies.
The most frequently addressed topics in the chatbot queries were Patient Education/Health Communication (18 occurrences), followed by Oncology/Cancer (15 occurrences) and Otolaryngology (13 occurrences). Figure 3 presents a tabular distribution of all medical fields covered in the analyzed publications.

Readability Patterns Across Medical Specialities

Distinct readability patterns emerged across different medical domains. Topics such as oncology, cardiology, neurology, and orthopaedics exhibited consistently higher grade-level scores across multiple chatbot models. These fields are characterized by dense terminology, abstract pathophysiological concepts, and complex treatment algorithms, all of which tend to increase syntactic complexity and average sentence length. In contrast, domains such as patient education, public health, and maternal care yielded comparatively lower readability scores. These topics typically rely on more narrative, instruction-based language that is easier for LLMs to simplify.
Notably, oncology-related responses demonstrated some of the highest complexity values in the dataset. This may reflect both the inherent difficulty of the domain and LLMs’ tendency to adopt cautious, legally conservative phrasing when discussing high-risk clinical conditions. Similarly, cardiology questions frequently elicited long, multi-clause sentences with numerous modifiers, suggesting that models may emphasize completeness over accessibility when addressing conditions perceived as clinically severe.
These speciality-level differences underscore the importance of contextualizing readability within the content domain, as the same model can yield dramatically different linguistic structures across clinical topics. Models producing lower grade-level estimates on one scale tended to score similarly across the others, reinforcing the robustness of the observed ranking patterns.
As shown in Table 4, readability scores vary significantly across chatbot models and readability indices, with most results exceeding the recommended 8th-grade level for patient-facing materials. Among the most frequently analyzed models, ChatGPT-4, Google Gemini, and Microsoft Copilot demonstrated the most balanced readability profiles. Their texts generally fell within the “difficult” category of the Flesch Reading Ease scale and corresponded to approximately college-level reading difficulty. These models showed relatively low variation across scales, indicating a consistent language structure and stable readability performance.
ChatGPT-3.5 and Perplexity, in contrast, generated content characterized by higher linguistic complexity, with longer sentences and more specialized vocabulary. Both models consistently scored higher on grade-level indices, implying that the information they produced would be challenging for audiences with average health literacy. Within the GPT family, the transition from version 3.5 to 4 was accompanied by a measurable improvement in readability, suggesting refinements in language coherence and sentence simplification in the newer model.
Models such as Claude and Meta AI showed intermediate readability, with scores fluctuating between moderate and strenuous across the scales. This variability likely reflects the heterogeneity of available prompts and text domains used in the analyzed studies.
DeepSeek-V3 and DeepSeek-R1 were the only models to produce outputs classified as readable or moderately easy, with text difficulty levels approximating those recommended for patient information materials. Their consistently lower grade-level scores suggest that these models may prioritize shorter sentences and simpler word choice, making them more accessible to a general audience.
Smaller or domain-specific chatbots, such as DocsGPT, PiAI, ChatSpot, Vello, and Open Evidence, were represented in fewer studies and across fewer readability indices. While their readability estimates varied widely, these systems tended to exhibit higher linguistic variability and less consistent results, likely due to narrower training data and differing use cases.
Chatbots that achieved higher grade-level scores on indices such as the Flesch–Kincaid Grade Level, Gunning Fog, or Linsear Write generally exhibited lower values on the Flesch Reading Ease scale. This alignment indicates that the indices captured similar dimensions of linguistic complexity, providing a coherent overall picture of relative readability across chatbot-generated texts. Figure 4 presents a heatmap demonstrating the comparative distribution of 14 readability metrics across 21 AI chatbots. Missing data reflects incomplete reporting in source studies. Darker colours indicate lower values, while yellow-green shades indicate higher values.
Lower values on grade-level indices (e.g., Flesch–Kincaid Grade Level, Gunning Fog Index) indicate greater readability. In contrast, higher values on the Flesch Reading Ease scale correspond to easier-to-read text. Scale directionality is explicitly indicated to facilitate interpretation by readers unfamiliar with readability metrics.
We also summarized the citation impact of all included publications. Figure 5 presents the 20 most cited articles in the dataset, while the complete citation ranking of all 140 studies is provided in the Supplementary Materials Table S3.

4. Discussion

This review provides a comprehensive synthesis of existing studies assessing the readability of chatbot-generated medical texts using classical linguistic indices. In this review, readability is treated as a patient-facing communication dimension of AI-generated medical content, evaluated under the assumption of baseline informational correctness and considered complementary to, rather than a replacement for, accuracy and clinical validity assessments [175,176,177]. Readability should not be viewed as a purely linguistic attribute, as excessive textual complexity in healthcare contexts may directly compromise patient safety, healthcare reliability, and decision-making accuracy. AI-generated medical information that is difficult to read may increase the risk of misinterpreting treatment instructions, overlooking contraindications, or misunderstanding probabilistic risk information [178].
Significantly, readability interacts with known failure modes of generative AI systems, including hallucinations, overgeneralization, and omission of uncertainty markers. Linguistically dense or overly formal responses may obscure hedging statements and limitations, potentially fostering unwarranted trust in incorrect or incomplete information [179]. From this perspective, readability constitutes a core dimension of responsible AI deployment in healthcare, alongside accuracy, transparency, and domain alignment.
Recent domain-specific reviews further reinforce the importance of contextual grounding and clinical specialization in generative AI for healthcare. A comprehensive review of retrieval-augmented generation in healthcare suggests that grounding model outputs in curated clinical sources not only improves factual accuracy but may also constrain response scope, thereby indirectly enhancing communicative clarity [180,181,182]. Similarly, longitudinal analyses of generative AI applications in health care illustrate how domain-specific fine-tuning and guideline integration shape both informational quality and accessibility [183,184]. These findings suggest that readability assessments should explicitly account for whether standalone LLMs or augmented architectures generate chatbot responses, as this distinction may systematically influence linguistic complexity and clinical appropriateness.
While a growing number of publications have explored factual accuracy, empathy, or the reliability of AI-driven health information, the fundamental issue of linguistic accessibility has remained largely underexamined. By consolidating findings from 140 studies across 21 chatbot models, this review provides a comprehensive overview of the readability of chatbot-generated medical texts using classical linguistic indices.
Earlier research on online health communication—long before the advent of generative AI—consistently showed that most patient education materials were written at a level too advanced for the general population, typically above the 8th-grade level recommended by the American Medical Association and the U.S. Department of Health and Human Services [185,186]. Studies on web-based patient portals and hospital websites confirmed similar patterns, revealing that even materials intended for public education often demand college-level literacy.
Recent studies investigating chatbot-generated content, though limited in number and scope, have echoed these concerns. For example, it was reported that ChatGPT and Bard produced health information with Flesch–Kincaid Grade Levels of 12–14, substantially above recommended thresholds [187,188]. Similarly, another study found that ChatGPT’s answers regarding cardiovascular health were syntactically correct but lexically dense, often employing specialized terminology [189,190,191,192]. The present review confirms and extends these observations by aggregating evidence across multiple models and domains, demonstrating that the issue of excessive linguistic complexity is systemic rather than model-specific.
However, some studies have suggested that newer model iterations, such as GPT-4, tend to produce slightly simpler, more structured responses than earlier versions, such as GPT-3.5 [193,194]. This review provides converging evidence for this trend, indicating incremental but insufficient progress toward readability improvement. These findings collectively suggest that advances in model architecture alone do not guarantee improved accessibility for end users without deliberate optimization for readability.
This variability is not merely linguistic but reflects system-level technical choices that shape the generated text. Readability in LLM-generated medical text should therefore be interpreted as a downstream outcome shaped by these design decisions rather than as an inherent property of a model label. Variation in readability across studies may reflect differences in decoding strategies (e.g., temperature, sampling constraints, output length limits), prompt and instruction design (e.g., explicit simplification constraints, disclaimer requirements), and alignment objectives [195]. In particular, safety-optimized alignment procedures (including RLHF) can promote conservative phrasing, hedging, and extensive disclaimers, which may increase sentence length and syntactic complexity. Conversely, instruction tuning that prioritizes clarity and user comprehension may yield more concise, accessible outputs. Retrieval-augmented generation further complicates interpretation: while retrieval can improve factual grounding, it may also introduce domain-specific terminology and longer guideline-like responses that inflate grade-level estimates [196]. These mechanisms imply that readability comparisons between chatbots are not causally interpretable without standardized reporting of technical parameters and interaction settings.
To enable technically meaningful interpretation and cross-study comparability, future readability evaluations should routinely report a minimal set of system-level indicators: (i) model identifier, version, and date of access; (ii) complete prompt templates and instruction constraints (including system prompts where available); (iii) decoding parameters and output-length settings; (iv) retrieval/tool-use configuration (if applicable); (v) interaction design (single-turn vs. multi-turn, context length, memory settings); and (vi) post-processing or safety filtering applied to responses [197,198,199,200]. The limited reporting of such parameters in most of the existing literature constitutes a significant methodological barrier to linking readability outcomes to specific LLM techniques and to establishing reproducible benchmarks [201].
The observed variability in readability across models likely stems from architectural and training differences. GPT-3.5 and Perplexity frequently produced longer and more syntactically intricate sentences, consistent with their tendency to generate verbose, detail-heavy responses. GPT-4 and Gemini, although more consistent, still align with formal scientific prose because their training corpora heavily represent academic texts. In contrast, DeepSeek-V3 and DeepSeek-R1-models intentionally optimized for brevity—generated significantly shorter sentences and simpler vocabulary. This suggests that model alignment strategies and fine-tuning objectives play a decisive role in shaping linguistic accessibility.
It should also be acknowledged that many chatbots evaluated in the included studies may rely on shared large language model backends, common APIs, or similar corporate infrastructures, despite being presented as distinct systems. Consequently, observed differences in readability across chatbot labels may reflect variations in prompting strategies, interface design, or response formatting rather than fundamental differences in underlying model technologies [202,203].
An additional factor is the influence of reinforcement learning from human feedback (RLHF), which may inadvertently increase linguistic complexity by promoting cautious, formal, and legally conservative phrasing. Systems optimized primarily for safety or factual correctness may therefore produce verbose outputs (e.g., hedging or extensive disclaimers), whereas models fine-tuned with objectives emphasizing instructional clarity tend to generate more patient-friendly text. These findings support the need for fine-tuning pipelines that explicitly include readability as a core performance metric [204]. The persistence of high reading difficulty in chatbot-generated health communication underscores a broader challenge: technological sophistication does not automatically translate into information that patients can understand.
To address this, future chatbot design should incorporate mechanisms to monitor and adapt readability, including real-time complexity assessment and model optimization strategies that prioritize clarity over verbosity. Moreover, interdisciplinary collaboration between computer scientists, linguists, and health communication experts will be essential to ensure that AI systems are optimized not only for accuracy but also for comprehension and inclusivity.
Traditional readability metrics, while valuable, measure only the surface structure of text, such as sentence length, syllable count, and syntactic density. They do not capture semantic transparency, contextual coherence, or pragmatic appropriateness, all of which shape actual understanding. Several recent works have emphasized that comprehension depends on both linguistic and cognitive accessibility, including familiarity with medical terminology and the perceived credibility of the source [204,205,206,207].
Several high-impact studies within the dataset provide significant insights into how LLMs handle medical communication. For example, studies evaluating oncology- and cardiology-related materials demonstrated that even state-of-the-art models struggled to reach recommended reading levels, often producing content equivalent to college-level difficulty [208]. Research on low back pain, cataract surgery, and thyroid disorders found that although LLMs offer coherent, structurally organized explanations, they often introduce specialized terminology without simplifying or contextualizing it for lay readers [209,210,211,212,213,214]. Notably, several investigations comparing AI-generated content with expert-written materials revealed that AI models can surpass clinicians in structural clarity but still fall short in accessibility. This finding reinforces the duality between linguistic fluency and accurate readability [215,216,217,218].
These landmark studies collectively suggest that readability challenges are systemic across models and domains rather than isolated incidents. Their conclusions emphasize the need for computational approaches that extend beyond classical metrics toward more holistic, patient-centred evaluation frameworks. The geographic concentration of readability research in English-speaking or high-income countries limits the generalizability of findings. Chatbots operating in languages with complex morphology (e.g., Polish, Turkish, or Korean) may exhibit different readability patterns due to linguistic structure and translation effects. Expanding this line of inquiry to multilingual and multicultural contexts is therefore crucial to understanding global variations and equity implications. At the policy level, the findings highlight the need for evidence-based standards for AI-generated health communication, analogous to readability guidelines for printed materials. Institutions such as the WHO or national health agencies could issue frameworks defining acceptable linguistic thresholds for AI-based public health tools, ensuring that emerging technologies align with accessibility principles.
Given the multiple factors influencing chatbot-generated responses, including model architecture, prompting strategies, knowledge base design, and regional context, the statistical results summarized in this review should be interpreted descriptively rather than causally. Their reliability lies in the consistency of observed readability patterns across multiple studies and indices, not in precise attribution to specific technological components.

5. Future Directions

5.1. Dynamic, Readability-Aware Text Generation

Future LLMs should incorporate real-time control mechanisms that allow users or healthcare providers to specify a target readability range (e.g., FKGL 6–8). Such systems could include built-in constraints on sentence length, lexical complexity, and structural density, enabling models to adapt dynamically to each patient’s literacy level. Integrating these features into user-facing interfaces would substantially improve the accessibility of AI-driven health communication.

5.2. Beyond Surface Metrics: Hybrid Readability Models

Classical readability indices capture syntactic and lexical features but fail to assess semantic transparency or conceptual load. Combining traditional metrics with embedding-based semantic measures, such as contextual coherence or terminology familiarity, would create more comprehensive tools for evaluating patient comprehension. Future research should explore hybrid frameworks that combine rule-based and machine-learning indicators to capture the multifactorial nature of readability.

5.3. Cross-Linguistic and Cross-Cultural Readability Evaluation

Most studies included in this review focused on English-language outputs, limiting the generalizability of findings. Languages with complex morphology, such as Polish, Turkish, Korean, or Finnish, may exhibit different readability patterns due to inflectional structure and word length. Expanding research to multilingual contexts is crucial for ensuring equitable access to AI-generated health information and for identifying cultural and linguistic factors that modulate readability.

5.4. User-Based Comprehension Studies

A critical next step involves shifting from purely text-based metrics to patient-centred comprehension research. Randomized controlled studies assessing users’ understanding, recall, and decision-making accuracy after reading AI-generated texts would provide more actionable insights into real-world usability. Combining these behavioural outcomes with readability indices would help validate whether improvements in linguistic complexity translate into meaningful gains in patient comprehension.
An additional limitation of the current literature concerns the limited stratification of readability outcomes by use-case category. Chatbot-generated medical texts serve heterogeneous functions, including general health education, preventive counselling, disease-specific self-management, and post-discharge instructions. These use cases differ substantially in their tolerance for ambiguity, acceptable linguistic complexity, and clinical risk.
Aggregating readability scores across heterogeneous use cases may therefore obscure clinically meaningful differences and limit interpretability. Future evaluations should classify chatbot outputs into functional use-case categories to better align readability assessments with real-world healthcare applications.

6. Limitations and Strengths

Several limitations should be acknowledged. First, the review was restricted to studies that reported quantitative readability metrics and analyzed publicly available chatbot models. As a result, it may have excluded unpublished or domain-specific evaluations, particularly those conducted within clinical settings or using proprietary systems. Second, the included studies varied in methodology, prompt design, and thematic focus, limiting direct comparability and precluding meta-analytic synthesis. Some discrepancies in readability scores may therefore reflect differences in prompt structure rather than actual model variation. Third, the findings should be interpreted with the understanding that classical readability indices capture surface linguistic features rather than semantic or cognitive comprehension. Fourth, the geographical and linguistic concentration of existing research (predominantly in English and in high-income countries) limits the generalizability of conclusions to other languages and health systems.
First, it is the first synthesis of studies assessing the readability of chatbot-generated medical texts across a wide range of models, indices, and medical domains. By including 140 publications and systematically analyzing 21 chatbots and 14 readability measures, the review offers a broad overview of how AI communicates health information to lay audiences. Second, the inclusion of multiple readability indices and cross-model comparisons enhances methodological robustness and interpretive depth. The convergence of findings across different indices (e.g., Flesch–Kincaid, SMOG, and Gunning Fog) strengthens the validity of observed trends and supports the reliability of the overall conclusions. Third, the study offers a clear conceptual framework for future investigations by linking linguistic readability with broader issues of health literacy, digital equity, and responsible AI design. This interdisciplinary perspective situates the findings not only within computational linguistics but also within public health and communication research, making the results relevant for both technical and health policy audiences.

7. Conclusions

This review is, to our knowledge, among the first to systematically synthesise evidence on the readability of chatbot-generated medical content. Despite advances in AI language models, most outputs remain too complex for typical patient audiences, highlighting a persistent communication gap. Readability should therefore be treated as a key quality criterion in the design and evaluation of health chatbots. Our findings highlight an emerging risk that general-purpose AI models may unintentionally widen the health communication gap unless readability-aware safety controls become standard in clinical and public-facing AI systems. Based on the reviewed evidence, future evaluations of AI-generated medical content should routinely report a minimum core set of readability indices and explicitly document the prompting strategies used. In addition, implementation of generative AI tools in healthcare should incorporate readability assessment and user testing as standard components of validation. At the policy level, public health agencies may consider developing guidelines and standards for readability in AI-generated patient communication. Future work should integrate standardized readability assessment with user-based comprehension testing as a routine component of evaluating AI-generated patient communication.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16031423/s1, Table S1: Completed PRISMA 2020 Checklist; Table S2: Characteristics of all included studies. Table S3: Complete citation ranking of all 140 studies.

Author Contributions

Conceptualization, J.B. and R.O.; methodology, J.B.; software, K.W.; validation, R.O. and J.R.; formal analysis, K.W.; investigation, J.B.; resources, J.B.; data curation, J.B.; writing—original draft preparation, J.B., R.O. and K.W.; writing—review and editing, J.B., R.O. and J.R.; visualization, K.W.; supervision, R.O. and J.R.; project administration, J.B.; funding acquisition, none. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Some or all of the framework features that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Fitzpatrick, P.J. Improving health literacy using the power of digital communications to achieve better health outcomes for patients and practitioners. Front. Digit. Health 2023, 5, 1264780. [Google Scholar] [CrossRef] [PubMed]
  2. Sharkiya, S.H. Quality communication can improve patient-centred health outcomes among older patients: A rapid review. BMC Health Serv. Res. 2023, 23, 886. [Google Scholar] [CrossRef] [PubMed]
  3. Chen, X.; Hay, J.L.; Waters, E.A.; Kiviniemi, M.T.; Biddle, C.; Schofield, E.; Li, Y.; Kaphingst, K.; Orom, H. Health Literacy and Use and Trust in Health Information. J. Health Commun. 2018, 23, 724–734. [Google Scholar] [CrossRef] [PubMed]
  4. Kwame, A.; Petrucka, P.M. A literature-based study of patient-centered care and communication in nurse-patient interactions: Barriers, facilitators, and the way forward. BMC Nurs. 2021, 20, 158. [Google Scholar] [CrossRef]
  5. Al Shamsi, H.; Almutairi, A.G.; Al Mashrafi, S.; Al Kalbani, T. Implications of Language Barriers for Healthcare: A Systematic Review. Oman Med. J. 2020, 35, e122. [Google Scholar] [CrossRef]
  6. Coughlin, S.S.; Vernon, M.; Hatzigeorgiou, C.; George, V. Health Literacy, Social Determinants of Health, and Disease Prevention and Control. J. Environ. Health Sci. 2020, 6, 3061. [Google Scholar]
  7. Pandey, M.; Maina, R.G.; Amoyaw, J.; Li, Y.; Kamrul, R.; Michaels, C.R.; Maroof, R. Impacts of English language proficiency on healthcare access, use, and outcomes among immigrants: A qualitative study. BMC Health Serv. Res. 2021, 21, 741. [Google Scholar] [CrossRef]
  8. Yeung, A.W.K.; Goto, T.K.; Leung, W.K. Readability of the 100 Most-Cited Neuroimaging Papers Assessed by Common Readability Formulae. Front. Hum. Neurosci. 2018, 12, 308. [Google Scholar] [CrossRef]
  9. Nash, E.; Bickerstaff, M.; Chetwynd, A.J.; Hawcutt, D.B.; Oni, L. The readability of parent information leaflets in paediatric studies. Pediatr. Res. 2023, 94, 1166–1171. [Google Scholar] [CrossRef]
  10. Brega, A.G.; Freedman, M.A.; LeBlanc, W.G.; Barnard, J.; Mabachi, N.M.; Cifuentes, M.; Albright, K.; Weiss, B.D.; Brach, C.; West, D.R. Using the Health Literacy Universal Precautions Toolkit to Improve the Quality of Patient Materials. J. Health Commun. 2015, 20, 69–76. [Google Scholar] [CrossRef]
  11. Rooney, M.K.; Santiago, G.; Perni, S.; Horowitz, D.P.; McCall, A.R.; Einstein, A.J.; Jagsi, R.; Golden, D.W. Readability of Patient Education Materials from High-Impact Medical Journals: A 20-Year Analysis. J. Patient Exp. 2021, 8, 2374373521998847. [Google Scholar] [CrossRef]
  12. Eltorai, A.E.; Ghanian, S.; Adams, C.A., Jr.; Born, C.T.; Daniels, A.H. Readability of patient education materials on the american association for surgery of trauma website. Arch. Trauma. Res. 2014, 3, e18161. [Google Scholar] [CrossRef]
  13. Badarudeen, S.; Sabharwal, S. Assessing readability of patient education materials: Current role in orthopaedics. Clin. Orthop. Relat. Res. 2010, 468, 2572–2580. [Google Scholar] [CrossRef]
  14. Geantă, M.; Bădescu, D.; Chirca, N.; Nechita, O.C.; Radu, C.G.; Rascu, Ș.; Rădăvoi, D.; Sima, C.; Toma, C.; Jinga, V. The Emerging Role of Large Language Models in Improving Prostate Cancer Literacy. Bioengineering 2024, 11, 654. [Google Scholar] [CrossRef] [PubMed]
  15. Demir, G.; Sevri, M.; Hacıosmanoğlu, C.D.; Büyüktaşkın, D.; Özaslan, A. Comparative Evaluation of Large Language Models in Addressing Autism-Related Information Queries: Insights from ChatGPT, Gemini, and Copilot. Gazi Med. J. 2025, 36, 407–416. [Google Scholar] [CrossRef]
  16. Bolgova, O.; Ganguly, P.; Mavrych, V. Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot. Anat. Sci. Educ. 2025, 18, 718–726. [Google Scholar] [CrossRef] [PubMed]
  17. Swisher, A.R.; Wu, A.W.; Liu, G.C.; Lee, M.K.; Carle, T.R.; Tang, D.M. Enhancing Health Literacy: Evaluating the Readability of Patient Handouts Revised by ChatGPT’s Large Language Model. Otolaryngol. Head Neck Surg. 2024, 171, 1751–1757. [Google Scholar] [CrossRef]
  18. Nasra, M.; Jaffri, R.; Pavlin-Premrl, D.; Kok, H.K.; Khabaza, A.; Barras, C.; Slater, L.A.; Yazdabadi, A.; Moore, J.; Russell, J.; et al. Can artificial intelligence improve patient educational material readability? A systematic review and narrative synthesis. Intern. Med. J. 2025, 55, 20–34. [Google Scholar] [CrossRef]
  19. Kirchner, G.J.; Kim, R.Y.; Weddle, J.B.; Bible, J.E. Can Artificial Intelligence Improve the Readability of Patient Education Materials? Clin. Orthop. Relat. Res. 2023, 481, 2260–2267. [Google Scholar] [CrossRef]
  20. Mokmin, N.A.M.; Ibrahim, N.A. The evaluation of chatbot as a tool for health literacy education among undergraduate students. Educ. Inf. Technol. 2021, 26, 6033–6049. [Google Scholar] [CrossRef]
  21. Sezer, B.; Aydoğdu, T. Performance of Advanced Artificial Intelligence Models in Traumatic Dental Injuries in Primary Dentition: A Comparative Evaluation of ChatGPT-4 Omni, DeepSeek, Gemini Advanced, and Claude 3.7 in Terms of Accuracy, Completeness, Response Time, and Readability. Appl. Sci. 2025, 15, 7778. [Google Scholar] [CrossRef]
  22. Tilton, A.K.; Caplan, B.E.; Cole, B.J. Generative AI in consumer health: Leveraging large language models for health literacy and clinical safety with a digital health framework. Front. Digit. Health 2025, 7, 1616488. [Google Scholar] [CrossRef]
  23. Randell, R.L.; Wilson, H.P.; Ragavan, M.I.; Collins, A.B.; Vail, J.; Ramirez, S.; Amodei, J.; Mickievicz, E.; Krieger, M.S.; Macon, E.C.; et al. Communicating Health Research with Plain Language. Inq. J. Health Care Organ. Provis. Financ. 2025, 62, 469580251357755. [Google Scholar] [CrossRef] [PubMed]
  24. Giguère, A.; Zomahoun, H.T.V.; Carmichael, P.H.; Uwizeye, C.B.; Légaré, F.; Grimshaw, J.M.; Gagnon, M.P.; Auguste, D.U.; Massougbodji, J. Printed educational materials: Effects on professional practice and healthcare outcomes. Cochrane Database Syst. Rev. 2020, 8, CD004398. [Google Scholar] [CrossRef] [PubMed]
  25. Yu, P.; Xu, H.; Hu, X.; Deng, C. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare 2023, 11, 2776. [Google Scholar] [CrossRef]
  26. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
  27. Reddy, S. Generative AI in healthcare: An implementation science informed translational path on application, integration and governance. Implement. Sci. 2024, 19, 27. [Google Scholar] [CrossRef]
  28. Warde, F.; Papadakos, J.; Papadakos, T.; Rodin, D.; Salhia, M.; Giuliani, M. Plain language communication as a priority competency for medical professionals in a globalized world. Can. Med. Educ. J. 2018, 9, e52–e59. [Google Scholar] [CrossRef]
  29. Delgado-Chaves, F.M.; Jennings, M.J.; Atalaia, A.; Wolff, J.; Horvath, R.; Mamdouh, Z.M.; Baumbach, J.; Baumbach, L. Transforming literature screening: The emerging role of large language models in systematic reviews. Proc. Natl. Acad. Sci. USA 2025, 122, e2411962122. [Google Scholar] [CrossRef]
  30. Yang, S.; Jing, M.; Wang, S.; Huang, Z.; Wang, J.; Kou, J.; Shi, M.; Xia, Z.; Wei, Q.; Xing, W.; et al. Building trustworthy large language model-driven generative recommender system for healthcare decision support: A scoping review of corpus sources, customization techniques, and evaluation frameworks. Artif. Intell. Med. 2026, 171, 103310. [Google Scholar] [CrossRef]
  31. Ozmen, B.B.; Singh, N.; Shah, K.; Berber, I.; Singh, D.; Pinsky, E.; Schulz, S.A.; Bishop, S.N.; Bernard, S.; Djohan, R.S.; et al. MicroRAG: Development of a Novel Artificial Intelligence Retrieval-Augmented Generation Model for Microsurgery Clinical Decision Support. Microsurgery 2025, 45, e70138. [Google Scholar] [CrossRef]
  32. Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef]
  33. Busch, F.; Kaibel, L.; Nguyen, H.; Lemke, T.; Ziegelmayer, S.; Graf, M.; Marka, A.W.; Endrös, L.; Prucker, P.; Spitzl, D.; et al. Evaluation of a Retrieval-Augmented Generation-Powered Chatbot for Pre-CT Informed Consent: A Prospective Comparative Study. J. Imaging Inform. Med. 2025, 38, 4312–4323. [Google Scholar] [CrossRef]
  34. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
  35. Yurdakurban, E.; Topsakal, K.G.; Duran, G.S. A comparative analysis of AI-based chatbots: Assessing data quality in orthognathic surgery related patient information. J. Stomatol. Oral Maxillofac. Surg. 2024, 125, 101757. [Google Scholar] [CrossRef] [PubMed]
  36. Camargo, E.S.; Quadras, I.C.C.; Garanhani, R.R.; de Araujo, C.M.; Stuginski-Barbosa, J. A Comparative Analysis of Three Large Language Models on Bruxism Knowledge. J. Oral Rehabil. 2025, 52, 896–903. [Google Scholar] [CrossRef] [PubMed]
  37. Deveci, C.D.; Baker, J.J.; Sikander, B.; Rosenberg, J. A comparison of cover letters written by ChatGPT-4 or humans. Dan. Med. J. 2023, 70, A06230412. [Google Scholar]
  38. Kring, T.; Prasad, S.; Dadi, S.; Sokhn, E.; Franzmann, E. A comparison of quality and readability of Artificial Intelligence chatbots in triage for head and neck cancer. Am. J. Otolaryngol. 2025, 46, 104710. [Google Scholar] [CrossRef]
  39. Yun, J.Y.; Kim, D.J.; Lee, N.; Kim, E.K. A comprehensive evaluation of ChatGPT consultation quality for augmentation mammoplasty: A comparative analysis between plastic surgeons and laypersons. Int. J. Med. Inform. 2023, 179, 105219. [Google Scholar] [CrossRef]
  40. Carlson, J.A.; Cheng, R.Z.; Lange, A.; Nagalakshmi, N.; Rabets, J.; Shah, T.; Sindhwani, P. Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware. Cureus 2024, 16, e67996. [Google Scholar] [CrossRef]
  41. Halawani, A.; Mitchell, A.; Saffarzadeh, M.; Wong, V.; Chew, B.H.; Forbes, C.M. Accuracy and Readability of Kidney Stone Patient Information Materials Generated by a Large Language Model Compared to Official Urologic Organizations. Urology 2024, 186, 107–113. [Google Scholar] [CrossRef] [PubMed]
  42. Yau, J.Y.; Saadat, S.; Hsu, E.; Murphy, L.S.; Roh, J.S.; Suchard, J.; Tapia, A.; Wiechmann, W.; Langdorf, M.I. Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study. J. Med. Internet Res. 2024, 26, e60291. [Google Scholar] [CrossRef]
  43. Yıldız, H.A.; Söğütdelen, E. AI Chatbots as Sources of STD Information: A Study on Reliability and Readability. J. Med. Syst. 2025, 49, 43. [Google Scholar] [CrossRef] [PubMed]
  44. Stephan, D.; Bertsch, A.; Burwinkel, M.; Vinayahalingam, S.; Al-Nawas, B.; Kämmerer, P.W.; Thiem, D.G. AI in Dental Radiology-Improving the Efficiency of Reporting with ChatGPT: Comparative Study. J. Med. Internet Res. 2024, 26, e60684. [Google Scholar] [CrossRef] [PubMed]
  45. Hand, C.; Bohn, C.; Tannir, S.; Ulrich, M.; Saniei, S.; Girod-Hoffman, M.; Lu, Y.; Forsythe, B. American Academy of Orthopaedic Surgeons OrthoInfo provides more readable information regarding rotator cuff injury than ChatGPT. J. ISAKOS 2025, 12, 100841. [Google Scholar] [CrossRef]
  46. Bohn, C.; Hand, C.; Tannir, S.; Ulrich, M.; Saniei, S.; Girod-Hoffman, M.; Lu, Y.; Krych, A.; Forsythe, B. American academy of Orthopedic Surgeons’ OrthoInfo provides more readable information regarding meniscus injury than ChatGPT-4 while information accuracy is comparable. J. ISAKOS 2025, 11, 100843. [Google Scholar] [CrossRef]
  47. Ichhpujani, P.; Parmar, U.P.S.; Kumar, S. Appropriateness and readability of Google Bard and ChatGPT-3.5 generated responses for surgical treatment of glaucoma. Rom. J. Ophthalmol. 2024, 68, 243–248. [Google Scholar] [CrossRef]
  48. Azzopardi, M.; Ng, B.; Logeswaran, A.; Loizou, C.; Cheong, R.C.T.; Gireesh, P.; Ting, D.S.J.; Chong, Y.J. Artificial intelligence chatbots as sources of patient education material for cataract surgery: ChatGPT-4 versus Google Bard. BMJ Open Ophthalmol. 2024, 9, e001824. [Google Scholar] [CrossRef]
  49. Gondode, P.G.; Singh, R.; Mehta, S.; Singh, S.; Kumar, S.; Nayak, S.S. Artificial intelligence chatbots versus traditional medical resources for patient education on “Labor Epidurals”: An evaluation of accuracy, emotional tone, and readability. Int. J. Obstet. Anesth. 2025, 61, 104302. [Google Scholar] [CrossRef]
  50. Pradhan, F.; Fiedler, A.; Samson, K.; Olivera-Martinez, M.; Manatsathit, W.; Peeraphatdit, T. Artificial intelligence compared with human-derived patient educational materials on cirrhosis. Hepatol. Commun. 2024, 8, e0367. [Google Scholar] [CrossRef]
  51. Ayad, O.; Yassa, A.; Patel, A.M.; Vengsarkar, V.A.; Ayad, S.; Ayad, S.; Mikhael, M. Artificial intelligence in patient care: Evaluating artificial intelligence’s accuracy and accessibility in addressing blepharoplasty concerns. Int. Ophthalmol. 2025, 45, 244. [Google Scholar] [CrossRef] [PubMed]
  52. Erden, Y.; Temel, M.H.; Bağcıer, F. Artificial intelligence insights into osteoporosis: Assessing ChatGPT’s information quality and readability. Arch. Osteoporos. 2024, 19, 17. [Google Scholar] [CrossRef] [PubMed]
  53. Shin, D.; Park, H.; Shaffrey, I.; Yacoubian, V.; Taka, T.M.; Dye, J.; Danisa, O. Artificial intelligence versus clinical judgement: How accurately do generative models reflect CNS guidelines for chiari malformation? Clin. Neurol. Neurosurg. 2025, 248, 108662. [Google Scholar] [CrossRef] [PubMed]
  54. Andrikyan, W.; Sametinger, S.M.; Kosfeld, F.; Jung-Poppe, L.; Fromm, M.F.; Maas, R.; Nicolaus, H.F. Artificial intelligence-powered chatbots in search engines: A cross-sectional study on the quality and risks of drug information for patients. BMJ Qual. Saf. 2025, 34, 100–109. [Google Scholar] [CrossRef]
  55. De Rouck, R.; Wille, E.; Gilbert, A.; Vermeersch, N. Assessing artificial intelligence-generated patient discharge information for the emergency department: A pilot study. Int. J. Emerg. Med. 2025, 18, 85. [Google Scholar] [CrossRef]
  56. Mondal, H.; Gupta, G.; Sarangi, P.K.; Sharma, S.; Choudhary, P.K.; Juhi, A.; Kumari, A.; Mondal, S. Assessing the Capability of Large Language Model Chatbots in Generating Plain Language Summaries. Cureus 2025, 17, e80976. [Google Scholar] [CrossRef]
  57. Xu, Q.; Wang, J.; Chen, X.; Wang, J.; Li, H.; Wang, Z.; Li, W.; Gao, J.; Chen, C.; Gao, Y. Assessing the Efficacy of ChatGPT Prompting Strategies in Enhancing Thyroid Cancer Patient Education: A Prospective Study. J. Med. Syst. 2025, 49, 11. [Google Scholar] [CrossRef]
  58. Scaff, S.P.S.; Reis, F.J.J.; Ferreira, G.E.; Jacob, M.F.; Saragiotto, B.T. Assessing the performance of AI chatbots in answering patients’ common questions about low back pain. Ann. Rheum. Dis. 2025, 84, 143–149. [Google Scholar] [CrossRef]
  59. Dharia, S.N.; Traversone, J.; Wortman, R.; Mulligan, M. Assessing the quality and readability of ChatGPT responses to frequently asked questions about trigger finger release. J. Plast. Reconstr. Aesthet. Surg. 2025, 105, 170–172. [Google Scholar] [CrossRef]
  60. Stephenson-Moe, C.A.; Behers, B.J.; Gibons, R.M.; Behers, B.M.; Jesus Herrera, L.; Anneaud, D.; Rosario, M.A.; Wojtas, C.N.; Bhambrah, S.; Hamad, K.M. Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study. Medicine 2025, 104, e42135. [Google Scholar] [CrossRef]
  61. Grilo, A.; Marques, C.; Corte-Real, M.; Carolino, E.; Caetano, M. Assessing the Quality and Reliability of ChatGPT’s Responses to Radiotherapy-Related Patient Queries: Comparative Study with GPT-3.5 and GPT-4. JMIR Cancer 2025, 11, e63677. [Google Scholar] [CrossRef]
  62. Gezer, M.C.; Armangil, M. Assessing the quality of ChatGPT’s responses to commonly asked questions about trigger finger treatment. Turk. J. Trauma Emerg. Surg. Ulus. Travma Acil Cerrahi Derg. 2025, 31, 389–393. [Google Scholar] [CrossRef] [PubMed]
  63. Keating, M.; Bollard, S.M.; Potter, S. Assessing the Quality, Readability, and Acceptability of AI-Generated Information in Plastic and Aesthetic Surgery. Cureus 2024, 16, e73874. [Google Scholar] [CrossRef] [PubMed]
  64. Ozduran, E.; Hancı, V.; Erkin, Y.; Özbek, İ.C.; Abdulkerimov, V. Assessing the readability, quality and reliability of responses produced by ChatGPT, Gemini, and Perplexity regarding most frequently asked keywords about low back pain. PeerJ 2025, 13, e18847. [Google Scholar] [CrossRef] [PubMed]
  65. Ömür Arça, D.; Erdemir, İ.; Kara, F.; Shermatov, N.; Odacioğlu, M.; İbişoğlu, E.; Hanci, F.B.; Sağiroğlu, G.; Hanci, V. Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study. Medicine 2024, 103, e38352. [Google Scholar] [CrossRef]
  66. Olszewski, R.; Watros, K.; Mańczak, M.; Owoc, J.; Jeziorski, K.; Brzeziński, J. Assessing the response quality and readability of chatbots in cardiovascular health, oncology, and psoriasis: A comparative study. Int. J. Med. Inform. 2024, 190, 105562. [Google Scholar] [CrossRef]
  67. Saeedi, S.; Bakhtiar, M. Assessing the response quality and readability of ChatGPT in stuttering. J. Fluen. Disord. 2025, 85, 106149. [Google Scholar] [CrossRef]
  68. Khabaz, K.; Newman-Hung, N.J.; Kallini, J.R.; Kendal, J.; Christ, A.B.; Bernthal, N.M.; Wessel, L.E. Assessment of Artificial Intelligence Chatbot Responses to Common Patient Questions on Bone Sarcoma. J. Surg. Oncol. 2025, 131, 719–724. [Google Scholar] [CrossRef]
  69. Pan, A.; Musheyev, D.; Bockelman, D.; Loeb, S.; Kabarriti, A.E. Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer. JAMA Oncol. 2023, 9, 1437–1440. [Google Scholar] [CrossRef]
  70. Topdağı, B.; Kavaz, T. Assessment of information quality in contemporary artificial intelligence systems for digital smile design: A comparative analysis. J. Prosthet. Dent. 2025, 134, 1279.E1–1279.E8. [Google Scholar] [CrossRef]
  71. Hancı, V.; Ergün, B.; Gül, Ş.; Uzun, Ö.; Erdemir, İ.; Hancı, F.B. Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care. Medicine 2024, 103, e39305. [Google Scholar] [CrossRef] [PubMed]
  72. Cao, H.; Hao, C.; Zhang, T.; Zheng, X.; Gao, Z.; Wu, J.; Gan, L.; Liu, Y.; Zeng, X.; Wang, W. Battle of the artificial intelligence: A comprehensive comparative analysis of DeepSeek and ChatGPT for urinary incontinence-related questions. Front. Public Health 2025, 13, 1605908. [Google Scholar] [CrossRef] [PubMed]
  73. Özer Aslan, İ.; Aslan, M.T. Benchmarking AI Chatbots for Maternal Lactation Support: A Cross-Platform Evaluation of Quality, Readability, and Clinical Accuracy. Healthcare 2025, 13, 1756. [Google Scholar] [CrossRef] [PubMed]
  74. Rouhi, A.D.; Ghanem, Y.K.; Yolchieva, L.; Saleh, Z.; Joshi, H.; Moccia, M.C.; Suarez-Pierre, A.; Han, J.J. Can Artificial Intelligence Improve the Readability of Patient Education Materials on Aortic Stenosis? A Pilot Study. Cardiol. Ther. 2024, 13, 137–147. [Google Scholar] [CrossRef]
  75. Dursun, D.; Bilici Geçer, R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med. Inform. Decis. Mak. 2024, 24, 211. [Google Scholar] [CrossRef]
  76. Lack, B.T.; Mouhawasse, E.; Childers, J.T.; Jackson, G.R.; Daji, S.V.; Yerke-Hansen, P.; Familiari, F.; Knapik, D.M.; Sabesan, V.J. Can ChatGPT answer patient questions regarding reverse shoulder arthroplasty? J. ISAKOS 2024, 9, 100323. [Google Scholar] [CrossRef]
  77. Hones, K.; Krisanda, E.; Chim, H. Caution Regarding ChatGPT’s Appropriateness and Reliability Regarding Surgery for Wrist Arthritis. Hand 2025, 20, 910–916. [Google Scholar] [CrossRef]
  78. Dias, R.; Castan, A.; Gotoff, K.; Kadkoy, Y.; Ippolito, J.; Beebe, K.; Benevenia, J. ChatGPT 3.5 Better Improves Comprehensibility of English, than Spanish, Generated Responses to Osteosarcoma Questions. J. Surg. Oncol. 2025, 131, 1692–1695. [Google Scholar] [CrossRef]
  79. Nian, P.P.; Umesh, A.; Jones, R.H.; Adhiyaman, A.; Williams, C.J.; Goodbody, C.M.; Heyer, J.H.; Doyle, S.M. ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines. J. Pediatr. Orthop. Soc. N. Am. 2024, 10, 100135. [Google Scholar] [CrossRef]
  80. Siu, A.H.Y.; Gibson, D.P.; Chiu, C.; Kwok, A.; Irwin, M.; Christie, A.; Koh, C.E.; Keshava, A.; Reece, M.; Suen, M.; et al. ChatGPT as a patient education tool in colorectal cancer-An in-depth assessment of efficacy, quality and readability. Color. Dis. 2025, 27, e17267. [Google Scholar] [CrossRef]
  81. Deng, J.; Li, L.; Oosterhof, J.J.; Malliaras, P.; Silbernagel, K.G.; Breda, S.J.; Eygendaal, D.; Oei, E.H.; de Vos, R.J. ChatGPT is a comprehensive education tool for patients with patellar tendinopathy, but it currently lacks accuracy and readability. Musculoskelet. Sci. Pract. 2025, 76, 103275. [Google Scholar] [CrossRef]
  82. Mathes, S.; Seurig, S.; Bluhme, F.; Beyer, K.; Heizmann, F.; Wagner, M.; Neugärtner, I.; Biedermann, T.; Darsow, U. ChatGPT Performance on 120 Interdisciplinary Allergology Questions-Systematic Evaluation with Clinical Error Impact Assessment for Critical Erroneous AI-Guided Chatbot Advice. J. Allergy Clin. Immunol. Pract. 2025, 13, 1350–1357.e4. [Google Scholar] [CrossRef] [PubMed]
  83. AlShehri, Y.; McConkey, M.; Lodhia, P. ChatGPT Provides Satisfactory but Occasionally Inaccurate Answers to Common Patient Hip Arthroscopy Questions. Arthroscopy 2025, 41, 1337–1347. [Google Scholar] [CrossRef] [PubMed]
  84. Ho, R.A.; Shaari, A.L.; Cowan, P.T.; Yan, K. ChatGPT Responses to Frequently Asked Questions on Ménière’s Disease: A Comparison to Clinical Practice Guideline Answers. OTO Open 2024, 8, e163. [Google Scholar] [CrossRef] [PubMed]
  85. Shen, S.A.; Perez-Heydrich, C.A.; Xie, D.X.; Nellis, J.C. ChatGPT vs. web search for patient questions: What does ChatGPT do better? Eur. Arch. Otorhinolaryngol. 2024, 281, 3219–3225. [Google Scholar] [CrossRef]
  86. Sikander, B.; Baker, J.J.; Deveci, C.D.; Lund, L.; Rosenberg, J. ChatGPT-4 and Human Researchers Are Equal in Writing Scientific Introduction Sections: A Blinded, Randomized, Non-inferiority Controlled Study. Cureus 2023, 15, e49019. [Google Scholar] [CrossRef]
  87. Browne, R.; Gull, K.; Hurley, C.M.; Sugrue, R.M.; O’Sullivan, J.B. ChatGPT-4 Can Help Hand Surgeons Communicate Better with Patients. J. Hand Surg. Glob. Online 2024, 6, 436–438. [Google Scholar] [CrossRef]
  88. Akyol Onder, E.N.; Ensari, E.; Ertan, P. ChatGPT-4o’s performance on pediatric Vesicoureteral reflux. J. Pediatr. Urol. 2025, 21, 504–509. [Google Scholar] [CrossRef]
  89. Najafali, D.; Galbraith, L.G.; Camacho, J.M.; Stoffel, V.; Herzog, I.; Moss, C.; Taiberg, S.L.; Knoedler, L. Class in Session: Analysis of GPT-4-created Plastic Surgery In-service Examination Questions. Plast. Reconstr. Surg. Glob. Open 2024, 12, e6185. [Google Scholar] [CrossRef]
  90. Bahçeci, T.; Elmaağaç, B.; Ceyhan, E. Comparative analysis of the effectiveness of microsoft copilot artificial intelligence chatbot and google search in answering patient inquiries about infertility: Evaluating readability, understandability, and actionability. Int. J. Impot. Res. 2025, 37, 1002–1007. [Google Scholar] [CrossRef]
  91. Maron, C.M.; Emile, S.H.; Horesh, N.; Freund, M.R.; Pellino, G.; Wexner, S.D. Comparing answers of ChatGPT and Google Gemini to common questions on benign anal conditions. Tech. Coloproctol. 2025, 29, 57. [Google Scholar] [CrossRef]
  92. Du, K.; Li, A.; Zuo, Q.H.; Zhang, C.Y.; Guo, R.; Chen, P.; Du, W.S.; Li, S.M. Comparing Artificial Intelligence-Generated and Clinician-Created Personalized Self-Management Guidance for Patients with Knee Osteoarthritis: Blinded Observational Study. J. Med. Internet Res. 2025, 27, e67830. [Google Scholar] [CrossRef]
  93. Gondode, P.; Duggal, S.; Garg, N.; Sethupathy, S.; Asai, O.; Lohakare, P. Comparing patient education tools for chronic pain medications: Artificial intelligence chatbot versus traditional patient information leaflets. Indian. J. Anaesth. 2024, 68, 631–636. [Google Scholar] [CrossRef] [PubMed]
  94. Shanmugam, S.K.; Browning, D.J. Comparison of Large Language Models in Diagnosis and Management of Challenging Clinical Cases. Clin. Ophthalmol. 2024, 18, 3239–3247. [Google Scholar] [CrossRef] [PubMed]
  95. Roy, J.M.; Atallah, E.; Piper, K.; Majmundar, S.; Mouchtouris, N.; Self, D.M.; Kaul, A.; Sizdahkhani, S.; Musmar, B.; Tjoumakaris, S.I.; et al. Comparison of quality, empathy and readability of physician responses versus chatbot responses to common cerebrovascular neurosurgical questions on a social media platform. Clin. Neurol. Neurosurg. 2025, 255, 108986. [Google Scholar] [CrossRef] [PubMed]
  96. Zaleski, A.L.; Berkowsky, R.; Craig, K.J.T.; Pescatello, L.S. Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study. JMIR Med. Educ. 2024, 10, e51308. [Google Scholar] [CrossRef]
  97. Singh, S.; Errampalli, E.; Errampalli, N.; Miran, M.S. Enhancing Patient Education on Cardiovascular Rehabilitation with Large Language Models. Mo. Med. 2025, 122, 67–71. [Google Scholar]
  98. Abreu, A.A.; Murimwa, G.Z.; Farah, E.; Stewart, J.W.; Zhang, L.; Rodriguez, J.; Sweetenham, J.; Zeh, H.J.; Wang, S.C.; Polanco, P.M. Enhancing Readability of Online Patient-Facing Content: The Role of AI Chatbots in Improving Cancer Information Accessibility. J. Natl. Compr. Canc. Netw. 2024, 22, e237334. [Google Scholar] [CrossRef]
  99. Mondal, H.; Tiu, D.N.; Mondal, S.; Dutta, R.; Naskar, A.; Podder, I. Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots. J. Midlife Health 2025, 16, 45–50. [Google Scholar] [CrossRef]
  100. Zhan, Y.; Chen, X.; Ye, F.; Wu, Z.; Usman, M.; Yuan, Z.; Wu, H.; Huang, J.; Yu, H. Evaluating AI Chatbot Responses to Postkidney Transplant Inquiries. Transplant. Proc. 2025, 57, 394–405. [Google Scholar] [CrossRef]
  101. Kayra, M.V.; Anil, H.; Ozdogan, I.; Baradia, S.M.A.; Toksoz, S. Evaluating AI chatbots in penis enhancement information: A comparative analysis of readability, reliability and quality. Int. J. Impot. Res. 2025, 37, 558–563. [Google Scholar] [CrossRef]
  102. Kacer, E.O. Evaluating AI-based breastfeeding chatbots: Quality, readability, and reliability analysis. PLoS ONE 2025, 20, e0319782. [Google Scholar] [CrossRef] [PubMed]
  103. Zhou, M.; Pan, Y.; Zhang, Y.; Song, X.; Zhou, Y. Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models. Int. J. Med. Inform. 2025, 198, 105871. [Google Scholar] [CrossRef]
  104. Helvacioglu-Yigit, D.; Demirturk, H.; Ali, K.; Tamimi, D.; Koenig, L.; Almashraqi, A. Evaluating artificial intelligence chatbots for patient education in oral and maxillofacial radiology. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2025, 139, 750–759. [Google Scholar] [CrossRef] [PubMed]
  105. Dincer, H.A.; Dogu, D. Evaluating Artificial Intelligence in Patient Education: DeepSeek-V3 Versus ChatGPT-4o in Answering Common Questions on Laparoscopic Cholecystectomy. ANZ J. Surg. 2025, 95, 2322–2328. [Google Scholar] [CrossRef] [PubMed]
  106. Sina, E.M.; Campbell, D.J.; Duffy, A.; Mandloi, S.; Benedict, P.; Farquhar, D.; Unsal, A.; Nyquist, G. Evaluating ChatGPT as a Patient Education Tool for COVID-19-Induced Olfactory Dysfunction. OTO Open 2024, 8, e70011. [Google Scholar] [CrossRef]
  107. Lee, T.J.; Campbell, D.J.; Rao, A.K.; Hossain, A.; Elkattawy, O.; Radfar, N.; Lee, P.; Gardin, J.M. Evaluating ChatGPT Responses on Atrial Fibrillation for Patient Education. Cureus 2024, 16, e61680. [Google Scholar] [CrossRef]
  108. Campbell, D.J.; Estephan, L.E.; Mastrolonardo, E.V.; Amin, D.R.; Huntley, C.T.; Boon, M.S. Evaluating ChatGPT responses on obstructive sleep apnea for patient education. J. Clin. Sleep Med. 2023, 19, 1989–1995. [Google Scholar] [CrossRef]
  109. Pandey, V.K.; Munshi, A.; Mohanti, B.K.; Bansal, K.; Rastogi, K. Evaluating ChatGPT to test its robustness as an interactive information database of radiation oncology and to assess its responses to common queries from radiotherapy patients: A single institution investigation. Cancer Radiother. 2024, 28, 258–264. [Google Scholar] [CrossRef]
  110. Sahin, S.; Erkmen, B.; Duymaz, Y.K.; Bayram, F.; Tekin, A.M.; Topsakal, V. Evaluating ChatGPT-4’s performance as a digital health advisor for otosclerosis surgery. Front. Surg. 2024, 11, 1373843. [Google Scholar] [CrossRef]
  111. Alapati, R.; Campbell, D.; Molin, N.; Creighton, E.; Wei, Z.; Boon, M.; Huntley, C. Evaluating insomnia queries from an artificial intelligence chatbot for patient education. J. Clin. Sleep Med. 2024, 20, 583–594. [Google Scholar] [CrossRef]
  112. Fazilat, A.Z.; Brenac, C.; Kawamoto-Duran, D.; Berry, C.E.; Alyono, J.; Chang, M.T.; Liu, D.T.; Patel, Z.M.; Tringali, S.; Wan, D.C.; et al. Evaluating the quality and readability of ChatGPT-generated patient-facing medical information in rhinology. Eur. Arch. Otorhinolaryngol. 2025, 282, 1911–1920. [Google Scholar] [CrossRef]
  113. Giammanco, P.A.; Collins, C.E.; Zimmerman, J.; Kricfalusi, M.; Rice, R.C.; Trumbo, M.; Carlson, B.A.; Rajfer, R.A.; Schneiderman, B.A.; Elsissy, J.G. Evaluating the Quality and Readability of Information Provided by Generative Artificial Intelligence Chatbots on Clavicle Fracture Treatment Options. Cureus 2025, 17, e77200. [Google Scholar] [CrossRef] [PubMed]
  114. Singavarapu, J.; Khemlani, A.; Jacobs, M.; Berglas, E.; Lazar, J.; Kabarriti, A. Evaluating the Quality of Cardiovascular Disease Information from AI Chatbots: A Comparative Study. Cureus 2025, 17, e88085. [Google Scholar] [CrossRef] [PubMed]
  115. Kara, M.; Ozduran, E.; Kara, M.M.; Özbek, İ.C.; Hancı, V. Evaluating the readability, quality, and reliability of responses generated by ChatGPT, Gemini, and Perplexity on the most commonly asked questions about Ankylosing spondylitis. PLoS ONE 2025, 20, e0326351. [Google Scholar] [CrossRef] [PubMed]
  116. Karaagac, M.; Carkit, S. Evaluation of AI-Based Chatbots in Liver Cancer Information Dissemination: A Comparative Analysis of GPT, DeepSeek, Copilot, and Gemini. Oncology 2025, 1–10. [Google Scholar] [CrossRef]
  117. Spina, A.; Andalib, S.; Flores, D.; Vermani, R.; Halaseh, F.F.; Nelson, A.M. Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study. JMIR AI 2024, 3, e54371. [Google Scholar] [CrossRef]
  118. Şahin, M.F.; Keleş, A.; Özcan, R.; Doğan, Ç.; Topkaç, E.C.; Akgül, M.; Yazıci, C.M. Evaluation of information accuracy and clarity: ChatGPT responses to the most frequently asked questions about premature ejaculation. Sex. Med. 2024, 12, qfae036. [Google Scholar] [CrossRef]
  119. Öztürk, Z.; Bal, C.; Çelikkaya, B.N. Evaluation of Information Provided by ChatGPT Versions on Traumatic Dental Injuries for Dental Students and Professionals. Dent. Traumatol. 2025, 41, 427–436. [Google Scholar] [CrossRef]
  120. Casciato, D.; Mateen, S.; Cooperman, S.; Pesavento, D.; Brandao, R.A. Evaluation of Online AI-Generated Foot and Ankle Surgery Information. J. Foot Ankle Surg. 2024, 63, 680–683. [Google Scholar] [CrossRef]
  121. Davis, R.J.; Ayo-Ajibola, O.; Lin, M.E.; Swanson, M.S.; Chambers, T.N.; Kwon, D.I.; Kokot, N.C. Evaluation of Oropharyngeal Cancer Information from Revolutionary Artificial Intelligence Chatbot. Laryngoscope 2024, 134, 2252–2257. [Google Scholar] [CrossRef]
  122. Meyer, M.K.R.; Kandathil, C.K.; Davis, S.J.; Durairaj, K.K.; Patel, P.N.; Pepper, J.P.; Spataro, E.A.; Most, S.P. Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude for Readability and Accuracy. Aesthetic Plast. Surg. 2025, 49, 1868–1873. [Google Scholar] [CrossRef] [PubMed]
  123. Gupta, A.; Basha, A.; Sontam, T.R.; Hlavinka, W.J.; Croen, B.J.; Abdou, C.; Abdullah, M.; Hamilton, R. Evolution of patient education materials from large-language artificial intelligence models on complex regional pain syndrome: Are patients learning? Bayl. Univ. Med. Cent. Proc. 2025, 38, 221–226. [Google Scholar] [CrossRef] [PubMed]
  124. Kılınç, D.D.; Mansız, D. Examination of the reliability and readability of Chatbot Generative Pretrained Transformer’s (ChatGPT) responses to questions about orthodontics and the evolution of these responses in an updated version. Am. J. Orthod. Dentofacial Orthop. 2024, 165, 546–555. [Google Scholar] [CrossRef] [PubMed]
  125. Canillas Del Rey, F.; Canillas Arias, M. Exploring the potential of Artificial Intelligence in Traumatology: Conversational answers to specific questions. Rev. Esp. Cir. Ortop. Traumatol. 2025, 69, 38–46, (In English, Spanish). [Google Scholar] [CrossRef]
  126. Park, K.U.; Lipsitz, S.; Dominici, L.S.; Lynce, F.; Minami, C.A.; Nakhlis, F.; Waks, A.G.; Warren, L.E.; Eidman, N.; Frazier, J.; et al. Generative artificial intelligence as a source of breast cancer information for patients: Proceed with caution. Cancer 2025, 131, e35521. [Google Scholar] [CrossRef]
  127. Zaretsky, J.; Kim, J.M.; Baskharoun, S.; Zhao, Y.; Austrian, J.; Aphinyanaphongs, Y.; Gupta, R.; Blecker, S.B.; Feldman, J. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw. Open 2024, 7, e240357. [Google Scholar] [CrossRef]
  128. Lee, Y.; Shin, T.; Tessier, L.; Javidan, A.; Jung, J.; Hong, D.; Strong, A.T.; McKechnie, T.; Malone, S.; ASMBS Artificial Intelligence and Digital Surgery Task Force; et al. Harnessing artificial intelligence in bariatric surgery: Comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations. Surg. Obes. Relat. Dis. 2024, 20, 603–608. [Google Scholar] [CrossRef]
  129. Asfuroğlu, Z.M.; Yağar, H.; Gümüşoğlu, E. High accuracy but limited readability of large language model-generated responses to frequently asked questions about Kienböck’s disease. BMC Musculoskelet. Disord. 2024, 25, 879. [Google Scholar] [CrossRef]
  130. Gül, Ş.; Erdemir, İ.; Hanci, V.; Aydoğmuş, E.; Erkoç, Y.S. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Medicine 2024, 103, e38009. [Google Scholar] [CrossRef]
  131. Ulusoy, I.; Yılmaz, M.; Kıvrak, A. How Efficient Is ChatGPT in Accessing Accurate and Quality Health-Related Information? Cureus 2023, 15, e46662. [Google Scholar] [CrossRef] [PubMed]
  132. Akkan, H.; Seyyar, G.K. Improving readability in AI-generated medical information on fragility fractures: The role of prompt wording on ChatGPT’s responses. Osteoporos. Int. 2025, 36, 403–410. [Google Scholar] [CrossRef] [PubMed]
  133. Tan, C.W.; Chan, J.C.Y.; Chan, J.J.I.; Nagarajan, S.; Sng, B.L. Information about labor epidural analgesia: An updated evaluation on the readability, accuracy, and quality of ChatGPT responses incorporating patient preferences and complex clinical scenarios. Int. J. Obstet. Anesth. 2025, 63, 104688. [Google Scholar] [CrossRef] [PubMed]
  134. Xie, Y.; Seth, I.; Hunter-Smith, D.J.; Rozen, W.M.; Seifman, M.A. Investigating the impact of innovative AI chatbot on post-pandemic medical education and clinical assistance: A comprehensive analysis. ANZ J. Surg. 2024, 94, 68–77. [Google Scholar] [CrossRef]
  135. Cao, J.J.; Kwon, D.H.; Ghaziani, T.T.; Kwo, P.; Tse, G.; Kesselman, A.; Kamaya, A.; Tse, J.R. Large language models’ responses to liver cancer surveillance, diagnosis, and management questions: Accuracy, reliability, readability. Abdom. Radiol. 2024, 49, 4286–4294. [Google Scholar] [CrossRef]
  136. Singh, S.P.; Jamal, A.; Qureshi, F.; Zaidi, R.; Qureshi, F. Leveraging Generative Artificial Intelligence Models in Patient Education on Inferior Vena Cava Filters. Clin. Pract. 2024, 14, 1507–1514. [Google Scholar] [CrossRef]
  137. Andreadis, K.; Newman, D.R.; Twan, C.; Shunk, A.; Mann, D.M.; Stevens, E.R. Mixed methods assessment of the influence of demographics on medical advice of ChatGPT. J. Am. Med. Inform. Assoc. 2024, 31, 2002–2009. [Google Scholar] [CrossRef]
  138. Shukla, I.Y.; Sun, M.Z. Online and ChatGPT-generated patient education materials regarding brain tumor prognosis fail to meet readability standards. J. Clin. Neurosci. 2025, 138, 111410. [Google Scholar] [CrossRef]
  139. Hunter, N.; Allen, D.; Xiao, D.; Cox, M.; Jain, K. Patient education resources for oral mucositis: A google search and ChatGPT analysis. Eur. Arch. Otorhinolaryngol. 2025, 282, 1609–1618. [Google Scholar] [CrossRef]
  140. Yalla, G.R.; Hyman, N.; Hock, L.E.; Zhang, Q.; Shukla, A.G.; Kolomeyer, N.N. Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted from Patient Brochures. Cureus 2024, 16, e56766. [Google Scholar] [CrossRef]
  141. Alasker, A.; Alsalamah, S.; Alshathri, N.; Almansour, N.; Alsalamah, F.; Alghafees, M.; AlKhamees, M.; Alsaikhan, B. Performance of large language models (LLMs) in providing prostate cancer information. BMC Urol. 2024, 24, 177. [Google Scholar] [CrossRef]
  142. Chen, D.; Parsa, R.; Hope, A.; Hannon, B.; Mak, E.; Eng, L.; Liu, F.F.; Fallah-Rad, N.; Heesters, A.M.; Raman, S. Physician and Artificial Intelligence Chatbot Responses to Cancer Questions from Social Media. JAMA Oncol. 2024, 10, 956–960. [Google Scholar] [CrossRef] [PubMed]
  143. Zhang, J.; Sun, Y.; Rong, Y.; Li, H.; Jiang, B.; Zhao, C.; Liu, H. Potential of AI Chatbots in Online Hair Transplantation Consultations: A Multi-metric Assessment of Three Models. Aesthet. Plast. Surg. 2025, 49, 6155–6161. [Google Scholar] [CrossRef] [PubMed]
  144. Bragazzi, N.L.; Buchinger, M.; Atwan, H.; Tuma, R.; Chirico, F.; Szarpak, L.; Farah, R.; Khamisy-Farah, R. Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19’s Impacts in Pregnancy: Cross-Sectional Pilot Study. JMIR Form. Res. 2025, 9, e56126. [Google Scholar] [CrossRef] [PubMed]
  145. Warren, C.J.; Edmonds, V.S.; Payne, N.G.; Voletti, S.; Wu, S.Y.; Colquitt, J.; Sadeghi-Nejad, H.; Punjani, N. Prompt matters: Evaluation of large language model chatbot responses related to Peyronie’s disease. Sex. Med. 2024, 12, qfae055. [Google Scholar] [CrossRef]
  146. Warren, C.J.; Payne, N.G.; Edmonds, V.S.; Voleti, S.S.; Choudry, M.M.; Punjani, N.; Abdul-Muhsin, H.M.; Humphreys, M.R. Quality of Chatbot Information Related to Benign Prostatic Hyperplasia. Prostate 2025, 85, 175–180. [Google Scholar] [CrossRef]
  147. Stapleton, P.; Santucci, J.; Cundy, T.P.; Sathianathen, N. Quality of Information on Wilms Tumor from Artificial Intelligence Chatbots: What Are Your Patients and Their Families Reading? Urology 2025, 198, 130–134. [Google Scholar] [CrossRef]
  148. Boscolo-Rizzo, P.; Marcuzzo, A.V.; Lazzarin, C.; Giudici, F.; Polesel, J.; Stellin, M.; Pettorelli, A.; Spinato, G.; Ottaviano, G.; Ferrari, M.; et al. Quality of Information Provided by Artificial Intelligence Chatbots Surrounding the Reconstructive Surgery for Head and Neck Cancer: A Comparative Analysis Between ChatGPT4 and Claude2. Clin. Otolaryngol. 2025, 50, 330–335. [Google Scholar] [CrossRef]
  149. Aydın, F.O.; Aksoy, B.K.; Ceylan, A.; Akbaş, Y.B.; Ermiş, S.; Kepez Yıldız, B.; Yıldırım, Y. Readability and Appropriateness of Responses Generated by ChatGPT 3.5, ChatGPT 4.0, Gemini, and Microsoft Copilot for FAQs in Refractive Surgery. Turk. J. Ophthalmol. 2024, 54, 313–317. [Google Scholar] [CrossRef]
  150. Musheyev, D.; Pan, A.; Gross, P.; Kamyab, D.; Kaplinsky, P.; Spivak, M.; Bragg, M.A.; Loeb, S.; Kabarriti, A.E. Readability and Information Quality in Cancer Information from a Free vs Paid Chatbot. JAMA Netw. Open 2024, 7, e2422275. [Google Scholar] [CrossRef]
  151. Alsabawi, Y.; Quesada, P.R.; Rouse, D.T. Readability of custom chatbot vs. GPT-4 responses to otolaryngology-related patient questions. Am. J. Otolaryngol. 2025, 46, 104717. [Google Scholar] [CrossRef]
  152. Gawey, L.; Dagenet, C.B.; Tran, K.A.; Park, S.; Hsiao, J.L.; Shi, V. Readability of Information Generated by ChatGPT for Hidradenitis Suppurativa. JMIR Dermatol. 2024, 7, e55204. [Google Scholar] [CrossRef] [PubMed]
  153. Büker, M.; Mercan, G. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment. Int. J. Med. Inform. 2025, 201, 105948. [Google Scholar] [CrossRef] [PubMed]
  154. Ozduran, E.; Akkoc, I.; Büyükçoban, S.; Erkin, Y.; Hanci, V. Readability, reliability and quality of responses generated by ChatGPT, gemini, and perplexity for the most frequently asked questions about pain. Medicine 2025, 104, e41780. [Google Scholar] [CrossRef] [PubMed]
  155. Alamleh, S.; Mavedatnia, D.; Francis, G.; Le, T.; Davies, J.; Lin, V.; Lee, J.J.W. Readability, Reliability, and Quality Analysis of Internet-Based Patient Education Materials and Large Language Models on Meniere’s Disease. J. Otolaryngol. Head Neck Surg. 2025, 54, 19160216251360651. [Google Scholar] [CrossRef]
  156. Şan, H.; Bayrakcı, Ö.; Çağdaş, B.; Serdengeçti, M.; Alagöz, E. Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients. Rev. Esp. Med. Nucl. Imagen. Mol. Engl. Ed. 2024, 43, 500021. [Google Scholar] [CrossRef]
  157. Aydinbelge-Dizdar, N.; Dizdar, K. Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes. Rev. Esp. Med. Nucl. Imagen. Mol. Engl. Ed. 2025, 44, 500065. [Google Scholar] [CrossRef]
  158. Şahin, M.F.; Ateş, H.; Keleş, A.; Özcan, R.; Doğan, Ç.; Akgül, M.; Yazıcı, C.M. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis. J. Med. Syst. 2024, 48, 38. [Google Scholar] [CrossRef]
  159. Yassa, A.; Ayad, O.; Cohen, D.A.; Patel, A.M.; Vengsarkar, V.A.; Hegazin, M.S.; Filimonov, A.; Hsueh, W.D.; Eloy, J.A. Search for medical information for chronic rhinosinusitis through an artificial intelligence ChatBot. Laryngoscope Investig. Otolaryngol. 2024, 9, e70009. [Google Scholar] [CrossRef]
  160. Shin, D.; Tang, T.; Carson, J.; Isaac, R.; Dinh, C.; Im, D.; Fay, A.; Isaac, A.; Cho, S.; Brandt, Z.; et al. Subthalamic nucleus or globus pallidus internus deep brain stimulation for the treatment of parkinson’s disease: An artificial intelligence approach. J. Clin. Neurosci. 2025, 138, 111393. [Google Scholar] [CrossRef]
  161. Anıl, H.; Kayra, M.V. The digital dialogue on premature ejaculation: Evaluating the efficacy of artificial intelligence-driven responses. Int. Urol. Nephrol. 2025, 57, 2829–2836. [Google Scholar] [CrossRef]
  162. Liu, X.; Shi, S.; Zhang, X.; Gao, Q.; Wang, W. The role of ChatGPT-4o in differential diagnosis and management of vertigo-related disorders. Sci. Rep. 2025, 15, 18688. [Google Scholar] [CrossRef] [PubMed]
  163. Taka, T.M.; Collins, C.E.; Miner, A.; Overfield, I.; Shin, D.; Seo, L.; Danisa, O. The role of generative artificial intelligence in deciding fusion treatment of lumbar degeneration: A comparative analysis and narrative review. Eur. Spine J. 2025, 34, 3901–3910. [Google Scholar] [CrossRef] [PubMed]
  164. Arzu, U.; Gencer, B. To Self-Treat or Not to Self-Treat: Evaluating the Diagnostic, Advisory and Referral Effectiveness of ChatGPT Responses to the Most Common Musculoskeletal Disorders. Diagnostics 2025, 15, 1834. [Google Scholar] [CrossRef] [PubMed]
  165. Ayo-Ajibola, O.; Davis, R.J.; Lin, M.E.; Vukkadala, N.; O’Dell, K.; Swanson, M.S.; Johns, M.M., 3rd; Shuman, E.A. TrachGPT: Appraisal of tracheostomy care recommendations from an artificial intelligent Chatbot. Laryngoscope Investig. Otolaryngol. 2024, 9, e1300. [Google Scholar] [CrossRef]
  166. Kerkütlüoğlu, M.; Kaya, E.; Gökmen, R. Trustworthiness, Value, Danger, and Readability of ChatGPT-Generated Responses to Health Questions Related to Pulmonary Arterial Hypertension. Cureus 2024, 16, e71472. [Google Scholar] [CrossRef]
  167. Lee, T.J.; Campbell, D.J.; Patel, S.; Hossain, A.; Radfar, N.; Siddiqui, E.; Gardin, J.M. Unlocking Health Literacy: The Ultimate Guide to Hypertension Education from ChatGPT Versus Google Gemini. Cureus 2024, 16, e59898. [Google Scholar] [CrossRef]
  168. Covington, E.W.; Watts Alexander, C.S.; Sewell, J.; Hutchison, A.M.; Kay, J.; Tocco, L.; Hyte, M. Unlocking the future of patient Education: ChatGPT vs. LexiComp® as sources of patient education materials. J. Am. Pharm. Assoc. 2025, 65, 102119. [Google Scholar] [CrossRef]
  169. Steimetz, E.; Minkowitz, J.; Gabutan, E.C.; Ngichabe, J.; Attia, H.; Hershkop, M.; Ozay, F.; Hanna, M.G.; Gupta, R. Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports. JAMA Netw. Open 2024, 7, e2412767. [Google Scholar] [CrossRef]
  170. Patel, T.A.; Michaelson, G.; Morton, Z.; Harris, A.; Smith, B.; Bourguillon, R.; Wu, E.; Eguia, A.; Maxwell, J.H. Use of ChatGPT for patient education involving HPV-associated oropharyngeal cancer. Am. J. Otolaryngol. 2025, 46, 104642. [Google Scholar] [CrossRef]
  171. Burns, C.; Bakaj, A.; Berishaj, A.; Hristidis, V.; Deak, P.; Equils, O. Use of Generative AI for Improving Health Literacy in Reproductive Health: Case Study. JMIR Form. Res. 2024, 8, e59434. [Google Scholar] [CrossRef] [PubMed]
  172. ELSenbawy, O.M.; Patel, K.B.; Wannakuwatte, R.A.; Thota, A.N. Use of generative large language models for patient education on common surgical conditions: A comparative analysis between ChatGPT and Google Gemini. Updates Surg. 2025, 1–7. [Google Scholar] [CrossRef]
  173. Šuto Pavičić, J.; Marušić, A.; Buljan, I. Using ChatGPT to Improve the Presentation of Plain Language Summaries of Cochrane Systematic Reviews About Oncology Interventions: Cross-Sectional Study. JMIR Cancer 2025, 11, e63347. [Google Scholar] [CrossRef] [PubMed]
  174. Tran, Q.L.; Huynh, P.P.; Le, B.; Jiang, N. Utilization of Artificial Intelligence in the Creation of Patient Information on Laryngology Topics. Laryngoscope 2025, 135, 1295–1300. [Google Scholar] [CrossRef] [PubMed]
  175. Sönmezoğlu, H.İ.; Güner Sönmezoğlu, B.; Temel, M.H.; Çakir, B. Comprehensibility and readability of selected artificial intelligence chatbots in providing uveitis-related information. Medicine 2025, 104, e45135. [Google Scholar] [CrossRef]
  176. Baur, D.; Ansorg, J.; Heyde, C.E.; Voelker, A. Development and Evaluation of a Retrieval-Augmented Generation Chatbot for Orthopedic and Trauma Surgery Patient Education: Mixed-Methods Study. JMIR AI 2025, 4, e75262. [Google Scholar] [CrossRef]
  177. Prabha, S.; Gomez-Cabello, C.A.; Haider, S.A.; Genovese, A.; Trabilsy, M.; Wood, N.G.; Bagaria, S.; Tao, C.; Forte, A.J. Enhancing Clinical Decision Support with Adaptive Iterative Self-Query Retrieval for Retrieval-Augmented Large Language Models. Bioengineering 2025, 12, 895. [Google Scholar] [CrossRef]
  178. Cross, J.L.; Choma, M.A.; Onofrey, J.A. Bias in medical AI: Implications for clinical decision-making. PLoS Digit. Health 2024, 3, e0000651. [Google Scholar] [CrossRef]
  179. Alli, S.R.; Hossain, S.Q.; Das, S.; Upshur, R. The Potential of Artificial Intelligence Tools for Reducing Uncertainty in Medicine and Directions for Medical Education. JMIR Med. Educ. 2024, 10, e51446. [Google Scholar] [CrossRef]
  180. Gomez-Cabello, C.A.; Prabha, S.; Haider, S.A.; Genovese, A.; Collaco, B.G.; Wood, N.G.; Bagaria, S.; Forte, A.J. Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support. Bioengineering 2025, 12, 1194. [Google Scholar] [CrossRef]
  181. Abo El-Enen, M.; Saad, S.; Nazmy, T. A survey on retrieval-augmentation generation (RAG) models for healthcare applications. Neural Comput. Appl. 2025, 37, 28191–28267. [Google Scholar] [CrossRef]
  182. Wada, A.; Tanaka, Y.; Nishizawa, M.; Yamamoto, A.; Akashi, T.; Hagiwara, A.; Hayakawa, Y.; Kikuta, J.; Shimoji, K.; Sano, K.; et al. Retrieval-augmented generation elevates local LLM quality in radiology contrast media consultation. npj Digit. Med. 2025, 8, 395. [Google Scholar] [CrossRef] [PubMed]
  183. Maity, S.; Saikia, M.J. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef] [PubMed]
  184. Weiss, B.D. Health Literacy and Patient Safety: Help Patients Understand. Manual for Clinicians, 2nd ed.; American Medical Association Foundation and American Medical Association: Chicago, IL, USA, 2007. [Google Scholar]
  185. US Department of Health and Human Services; Office of Disease Prevention and Health Promotion. National Action Plan to Improve Health Literacy; US Department of Health and Human Services: Washington, DC, USA, 2010.
  186. DeTemple, D.E.; Meine, T.C. Comparison of the readability of ChatGPT and Bard in medical communication: A meta-analysis. BMC Med. Inform. Decis. Mak. 2025, 25, 325. [Google Scholar] [CrossRef] [PubMed]
  187. Moons, P.; Van Bulck, L. Using ChatGPT and Google Bard to improve the readability of written patient information: A proof of concept. Eur. J. Cardiovasc. Nurs. 2024, 23, 122–126. [Google Scholar] [CrossRef]
  188. Andrew, A. Accuracy of ChatGPT in answering cardiology board-style questions. J. Educ. Eval. Health Prof. 2025, 22, 9. [Google Scholar] [CrossRef]
  189. Uchmanowicz, I.; Jędrzejczyk, M.; Vellone, E.; Janczak, S.; Mirkowski, K.; Uchmanowicz, B.M.; Czapla, M. ChatGPT in cardiovascular medicine: Revolution, hype, or helper? Front. Public Health 2025, 13, 1622561. [Google Scholar] [CrossRef]
  190. Harskamp, R.E.; De Clercq, L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: A proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). Acta Cardiol. 2024, 79, 358–366. [Google Scholar] [CrossRef]
  191. Lautrup, A.D.; Hyrup, T.; Schneider-Kamp, A.; Dahl, M.; Lindholt, J.S.; Schneider-Kamp, P. Heart-to-heart with ChatGPT: The impact of patients consulting AI for cardiovascular health advice. Open Heart 2023, 10, e002455. [Google Scholar] [CrossRef]
  192. Meyer, A.; Riese, J.; Streichert, T. Comparison of the Performance of GPT-3.5 and GPT-4 with That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med. Educ. 2024, 10, e50965. [Google Scholar] [CrossRef]
  193. Lahat, A.; Sharif, K.; Zoabi, N.; Shneor Patt, Y.; Sharif, Y.; Fisher, L.; Shani, U.; Arow, M.; Levin, R.; Klang, E. Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4. J. Med. Internet Res. 2024, 26, e54571. [Google Scholar] [CrossRef]
  194. Bolliger, L.S.; Haller, P.; Cretton, I.C.R.; Reich, D.R.; Kew, T.; Jäger, L.A. EMTeC: A corpus of eye movements on machine-generated texts. Behav. Res. Methods 2025, 57, 189. [Google Scholar] [CrossRef]
  195. James, A.; Trovati, M.; Bolton, S. Retrieval-Augmented Generation to Generate Knowledge Assets and Creation of Action Drivers. Appl. Sci. 2025, 15, 6247. [Google Scholar] [CrossRef]
  196. Nastoska, A.; Jancheska, B.; Rizinski, M.; Trajanov, D. Evaluating Trustworthiness in AI: Risks, Metrics, and Applications Across Industries. Electronics 2025, 14, 2717. [Google Scholar] [CrossRef]
  197. Novelo, R.; Silva, R.R.; Bernardino, J. A Literature Review of Personalized Large Language Models for Email Generation and Automation. Future Internet 2025, 17, 536. [Google Scholar] [CrossRef]
  198. Di Martino, F.; Delmastro, F. Explainable AI for clinical and remote health applications: A survey on tabular and time series data. Artif. Intell. Rev. 2023, 56, 5261–5315. [Google Scholar] [CrossRef]
  199. Wagner, N.; Kraus, M.; Minker, W.; Griol, D.; Callejas, Z. A Survey on Multi-User Conversational Interfaces. Appl. Sci. 2025, 15, 7267. [Google Scholar] [CrossRef]
  200. Lai, X.; Lai, Y.; Chen, J.; Huang, S.; Gao, Q.; Huang, C. Evaluation Strategies for Large Language Model-Based Models in Exercise and Health Coaching: Scoping Review. J. Med. Internet Res. 2025, 27, e79217. [Google Scholar] [CrossRef]
  201. Lv, X.; Zhang, X.; Li, Y.; Ding, X.; Lai, H.; Shi, J. Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content. J. Med. Internet Res. 2024, 26, e55847. [Google Scholar] [CrossRef]
  202. Singh, S.U.; Namin, A.S. A survey on chatbots and large language models: Testing and evaluation techniques. Nat. Lang. Process. J. 2025, 10, 100128. [Google Scholar] [CrossRef]
  203. Dahlgren Lindström, A.; Methnani, L.; Krause, L.; Ericson, P.; de Rituerto de Troya, Í.M.; Coelho Mollo, D.; Dobbe, R. Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback. Ethics Inf. Technol. 2025, 27, 28. [Google Scholar] [CrossRef]
  204. Shao, Y.; Yang, X.; Chen, Q.; Guo, H.; Duan, X.; Xu, X.; Yue, J.; Zhang, Z.; Zhao, S.; Zhang, S. Determinants of digital health literacy among older adult patients with chronic diseases: A qualitative study. Front. Public Health 2025, 13, 1568043. [Google Scholar] [CrossRef] [PubMed]
  205. Zolfaghari, Z.; Karimian, Z.; Zarifsanaiey, N.; Farahmandi, A.Y. Navigating challenges in medical english learning: Leveraging technology and gamification for interactive education—A qualitative study. BMC Med. Educ. 2025, 25, 1045. [Google Scholar] [CrossRef] [PubMed]
  206. Khojasteh, L.; Kafipour, R.; Pakdel, F.; Mukundan, J. Empowering medical students with AI writing co-pilots: Design and validation of AI self-assessment toolkit. BMC Med. Educ. 2025, 25, 159. [Google Scholar] [CrossRef] [PubMed]
  207. Ahmed, A.; Leroy, G.; Kauchak, D.; Barai, P.; Harber, P.; Rains, S. Parallel Corpus Analysis of Text and Audio Comprehension to Evaluate Readability Formula Effectiveness: Quantitative Analysis. J. Med. Internet Res. 2025, 27, e69772. [Google Scholar] [CrossRef]
  208. Joseph, S.; Bhardwaj, A.; Skariah, J.; Aggarwal, I.; Shah, V.; Harris, R.A. Effects of education level on natural language processing in cardiovascular health communication. Front. Public Health 2025, 13, 1688173. [Google Scholar] [CrossRef]
  209. Gao, Y.; Xu, Q.; Zhang, O.; Wang, H.; Wang, Y.; Wang, J.; Chen, X. Large language models: Unlocking new potential in patient education for thyroid eye disease. Endocrine 2025, 90, 689–698. [Google Scholar] [CrossRef]
  210. Zhang, Z.; Zhang, H.; Pan, Z.; Bi, Z.; Wan, Y.; Song, X.; Fan, X. Evaluating Large Language Models in Ophthalmology: Systematic Review. J. Med. Internet Res. 2025, 27, e76947. [Google Scholar] [CrossRef]
  211. Zhang, J.; Song, X.; Tian, B.; Tian, M.; Zhang, Z.; Wang, J.; Fan, T. Large language models in the management of chronic ocular diseases: A scoping review. Front. Cell Dev. Biol. 2025, 13, 1608988. [Google Scholar] [CrossRef]
  212. Betzler, B.K.; Chen, H.; Cheng, C.Y.; Lee, C.S.; Ning, G.; Song, S.J.; Lee, A.Y.; Kawasaki, R.; van Wijngaarden, P.; Grzybowski, A.; et al. Large language models and their impact in ophthalmology. Lancet Digit. Health 2023, 5, e917–e924. [Google Scholar] [CrossRef]
  213. Bacco, L.; Russo, F.; Ambrosio, L.; D’Antoni, F.; Vollero, L.; Vadalà, G.; Dell’Orletta, F.; Merone, M.; Papalia, R.; Denaro, V. Natural language processing in low back pain and spine diseases: A systematic review. Front. Surg. 2022, 9, 957085. [Google Scholar] [CrossRef]
  214. Shah, R.; Schwab, J.H. Large Language Models in Spine Surgery: A Promising Technology. HSS J. 2025, 21, 15563316251340696. [Google Scholar] [CrossRef]
  215. Croxford, E.; Gao, Y.; First, E.; Pellegrino, N.; Schnier, M.; Caskey, J.; Oguss, M.; Wills, G.; Chen, G.; Dligach, D.; et al. Evaluating clinical AI summaries with large language models as judges. npj Digit. Med. 2025, 8, 640. [Google Scholar] [CrossRef]
  216. Alshammari, A.F.; Madfa, A.A.; Anazi, B.A.; Alenezi, Y.E.; Alkurdi, K.A. Comparison of accuracy and consistency of AI Language models when answering standardised dental MCQs. BMC Med. Educ. 2025, 25, 1507. [Google Scholar] [CrossRef]
  217. Martos, M.; Fields, B.; Finlayson, S.G.; Hartell, N.; Kim, T.; Larimer, E.; Lau, J.J.; Lin, Y.H.; Salaguinto, T.; Tran, N.; et al. Accuracy of Artificial Intelligence vs Professionally Translated Discharge Instructions. JAMA Netw. Open 2025, 8, e2532312. [Google Scholar] [CrossRef]
  218. Lee, C.; Britto, S.; Diwan, K. Evaluating the Impact of Artificial Intelligence (AI) on Clinical Documentation Efficiency and Accuracy Across Clinical Settings: A Scoping Review. Cureus 2024, 16, e73994. [Google Scholar] [CrossRef]
Figure 1. PRISMA Flowchart for this Review.
Figure 1. PRISMA Flowchart for this Review.
Applsci 16 01423 g001
Figure 2. Global distribution of the studies included in the review.
Figure 2. Global distribution of the studies included in the review.
Applsci 16 01423 g002
Figure 3. Medical fields covered by chatbot readability studies, grouped by topic category.
Figure 3. Medical fields covered by chatbot readability studies, grouped by topic category.
Applsci 16 01423 g003
Figure 4. Comparative Heatmap of 14 Readability Metrics Across 21 AI Chatbots.
Figure 4. Comparative Heatmap of 14 Readability Metrics Across 21 AI Chatbots.
Applsci 16 01423 g004
Figure 5. Top 20 Most Cited Publications Included in the Review [36,38,47,51,69,71,74,96,98,103,108,110,121,124,127,128,131,134,142,169].
Figure 5. Top 20 Most Cited Publications Included in the Review [36,38,47,51,69,71,74,96,98,103,108,110,121,124,127,128,131,134,142,169].
Applsci 16 01423 g005
Table 1. Geographical distribution of studies included in the review (n = 140).
Table 1. Geographical distribution of studies included in the review (n = 140).
CountryCount
USA60
Turkey34
China6
India6
Australia5
Canada5
Germany3
Italy2
Brazil1
Denmark2
Ireland2
Belgium1
Croatia1
Egypt1
Netherlands1
Poland1
Saudi Arabia1
Singapore1
South Korea1
Spain1
United Kingdom1
Table 2. Chatbots analyzed across the included publications and their frequency of occurrence.
Table 2. Chatbots analyzed across the included publications and their frequency of occurrence.
ChatbotCount
ChatGPT-4/GPT-4o94
ChatGPT-3.583
Google Bard/Gemini52
Microsoft Copilot/Microsoft Copilot Pro/Bing AI39
Perplexity AI/Perplexity Pro26
Claude 2.0/Claude 3.5/Claude Sonnet12
Meta AI Assistant4
ChatSonic 1.0.23
DeepSeek-V32
DocsGPT 0.15.0—Changelog2
DeepSeek-R12
Open Evidence 2.01
ChatSpot Alpha1
DeepSeek-R11
Ernie Bot 4.01
LLaMA 3.11
Llama 3.1 Large1
MediSearch Version 1.5.101
Pi AI 1.0.531
Vello1
Vello Pro1
Table 3. Readability indices used in the included studies and frequency of their application.
Table 3. Readability indices used in the included studies and frequency of their application.
Readability ScaleCount
Flesch–Kincaid Grade Level117
Flesch Reading Ease Score95
Gunning Fog Index41
Simple Measure of Gobbledygook39
Coleman–Liau Index22
Automated Readability Index14
FORCAST4
Dale–Chall Readability3
Fry Readability Graph2
Fry Readability Score2
Läsbarhetsindex2
Linsear Write2
Raygor Readability Estimate2
Lix Readability Index1
Table 4. Readability scores of medical texts generated by chatbots.
Table 4. Readability scores of medical texts generated by chatbots.
ChatbotFlesch Reading EaseFlesch–Kincaid Grade LevelGunning Fog IndexSMOG
Index
Coleman–Liau IndexAutomated Readability IndexLinsear WriteDale-Chall ScoreFORCASTFry GraphFry Readability ScoreLesbarhetsindexLix Readability IndexRaygor Estimate
ChatGPT-437.55 ± 17.7613.85 ± 8.1014.49 ± 3.6012.94 ± 2.7414.61 ± 2.9111.67 ± 2.389.61 ± 2.339.9012.60 ± 0.4213.55 ± 0.649.50 ± 0.7136.49 ± 38.9172.0013.80 ± 0.28
ChatGPT-3.535.16 ± 13.5915.45 ± 8.7815.57 ± 3.2613.11 ± 1.9215.43 ± 2.1614.06 ± 1.6213.95 ± 1.8110.25 ± 0.3512.48 ± 0.12
Microsoft Copilot35.66 ± 12.0113.66 ± 8.0214.57 ± 2.9413.64 ± 2.8714.25 ± 2.3811.95 ± 2.2011.90 ± 1.2710.3012.30
Google Gemini39.61 ± 14.7313.14 ± 8.3114.29 ± 4.1312.65 ± 2.4113.33 ± 2.6611.23 ± 2.4511.71 ± 2.3911.6011.21 ± 1.41
Perplexity31.31 ± 11.2719.62 ± 13.5116.58 ± 2.6314.02 ± 2.4014.68 ± 2.0714.06 ± 3.1414.76 ± 5.11
Meta AI28.38 ± 21.8311.97 ± 1.7911.6012.4019.1013.50 13.80
Claude40.11 ± 21.1811.22 ± 2.8710.31 10.31
PiAI16.3015.9020.00 11.90
DeepSeek-V353.35 ± 7.008.45 ± 0.35 16.4015.10
ChatSpot23.1015.0018.20 11.30
DeepSeek76.43 12.2615.40
DocsGPT72.009.75 ± 5.73 12.10
Llama 3.1 Large20.1024.10
Llama 3.123.7034.20
Ernie Bot 4.037.5012.90
DeepSeek-R161.407.20
MediSearch 18.30
ChatSonic 21.65 ± 16.77
Open Evidence 17.09 ± 0.56
Vello 29.00
Vello Pro 17.40
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Olszewski, R.; Brzeziński, J.; Watros, K.; Rysz, J. Quantifying Readability in Chatbot-Generated Medical Texts Using Classical Linguistic Indices: A Review. Appl. Sci. 2026, 16, 1423. https://doi.org/10.3390/app16031423

AMA Style

Olszewski R, Brzeziński J, Watros K, Rysz J. Quantifying Readability in Chatbot-Generated Medical Texts Using Classical Linguistic Indices: A Review. Applied Sciences. 2026; 16(3):1423. https://doi.org/10.3390/app16031423

Chicago/Turabian Style

Olszewski, Robert, Jakub Brzeziński, Klaudia Watros, and Jacek Rysz. 2026. "Quantifying Readability in Chatbot-Generated Medical Texts Using Classical Linguistic Indices: A Review" Applied Sciences 16, no. 3: 1423. https://doi.org/10.3390/app16031423

APA Style

Olszewski, R., Brzeziński, J., Watros, K., & Rysz, J. (2026). Quantifying Readability in Chatbot-Generated Medical Texts Using Classical Linguistic Indices: A Review. Applied Sciences, 16(3), 1423. https://doi.org/10.3390/app16031423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop