1. Introduction
The Internet has become an indispensable source of health-related information for patients worldwide, profoundly influencing their understanding of diseases and treatment decisions [
1]. Although online access to medical content offers convenience and broad reach, the quality, reliability, and readability of this information remain highly variable [
2]. This issue is especially concerning for conditions such as sarcoidosis, a heterogeneous and often misunderstood multisystem granulomatous disease, in which inaccurate information may contribute to diagnostic delays, increased anxiety, and inappropriate health-seeking behaviors.
In recent years, generative large language models (LLMs), such as ChatGPT and Gemini, have emerged as accessible tools capable of generating human-like responses to a wide array of health-related queries. Their conversational nature and real-time availability have positioned them as promising adjuncts for patient education, especially in contexts where professional consultation is delayed or unavailable. However, despite their growing popularity, concerns persist regarding clinical accuracy, health literacy adaptation, and practical actionability of their outputs.
Sarcoidosis poses a unique challenge to both clinicians and patients. Its protean manifestations across multiple organs often require detailed and nuanced explanations, particularly regarding pulmonary involvement, diagnostic ambiguity, and long-term immunosuppressive treatment. Previous studies have documented the poor quality and limited reliability of sarcoidosis-related content available on public platforms such as YouTube [
3]. However, the performance of artificial intelligence (AI) chatbots in delivering trustworthy and patient-relevant information about sarcoidosis remains underexplored [
4].
To address this gap, this study evaluated the quality, readability, and actionability of AI chatbot-generated responses to commonly searched sarcoidosis-related queries. By employing standardized metrics and expert assessments, we aimed to determine whether these tools could meaningfully support patient understanding in complex disease contexts.
3. Results
3.1. Inter-Rater Reliability Analysis
The inter-rater reliability for each of the four sarcoidosis-related queries was assessed using ICC across three evaluation domains: DISCERN, PEMAT-P, and WRR. All ICC values exceeded the commonly accepted threshold of 0.80, indicating excellent agreement between the four pulmonology experts. The highest consistency was observed for the first query, with ICCs of 0.91 (DISCERN), 0.89 (PEMAT-P), and 0.93 (WRR). Even the lowest ICC value—0.80 for PEMAT-P in the fourth query—demonstrated.
For the DISCERN tool, which assesses the reliability and quality of medical information, the ICC values ranged from 0.82 (Q4) to 0.88 (Q2), demonstrating stable consensus in rating informational content. PEMAT-P, which evaluates understandability and actionability, showed ICCs between 0.80 (Q4) and 0.89 (Q1), again reflecting strong evaluator alignment. The highest ICC values were observed in the WRR domain, which integrates multiple aspects of quality and relevance; WRR scores ranged from 0.82 (Q4) to 0.93 (Q1), indicating near-perfect agreement for foundational clinical content.
3.2. Individual Chatbot Performance
Substantial differences were observed among the 11 AI chatbots across all the evaluated domains, including reliability, understandability, and actionability. ChatGPT-4o Deep Research consistently achieved the highest overall performance across all the evaluation domains. It obtained a mean DISCERN score of 70.06, a PEMAT-P score of 77.75, and a WRR score of 70.78. These findings reflect a high standard of medical reliability accompanied by content that is both actionable and sufficiently understandable to support patient education. Perplexity Research demonstrated similarly robust performance, with mean scores of 68.75 (DISCERN), 76.38 (PEMAT-P), and 82.03 (WRR), ranking first in the WRR domain. These results underscore the utility of retrieval-augmented LLM for producing comprehensive, structured, and contextually appropriate health information.
Grok3 Deep Search also performed well across all domains, with a mean DISCERN score of 63.56, a PEMAT-P score of 82.25, and a WRR score of 70.63. Notably, its PEMAT-P score was among the highest, indicating that its responses were not only informative but also presented in a format that facilitated patient comprehension and actionability.
In contrast, general-purpose models such as Copilot and ChatGPT-4 showed lower overall performance.Copilot had the lowest DISCERN (35.00), PEMAT-P (32.50), and WRR (6.32) scores, reflecting substantial deficiencies in the clarity, reliability, and usefulness of its responses. ChatGPT-4 demonstrated moderate performance, with an average DISCERN score of 48.75 and PEMAT-P and WRR scores of 44.25 and 8.94, respectively, indicating some improvement over Copilot but falling short of models enhanced with research or search capabilities.
The ChatGPT-4o, DeepSeek, and Gemini2 Flash occupied an intermediate performance tier. ChatGPT-4o displayed modest reliability (mean DISCERN = 57.19), with average PEMAT-P and WRR scores of 51.75 and 6.27, respectively. DeepSeek and Gemini2 Flash showed comparable trends, with balanced but unremarkable performance across domains. Although some of these models generated more readable content, they were often deficient in terms of structural clarity, depth, or action-oriented guidance (
Table 2).
Importantly, DeepSeek Deep Think, while not leading in any single metric, maintained high consistency across all domains, with mean scores of 64.50 (DISCERN), 69.25 (PEMAT-P), and 60.70 (WRR), positioning it as one of the most balanced performers.
Figure 2 illustrates chatbot performance in terms of readability, reliability, actionability, overall quality, and response length for each query.
3.3. Comparative Analysis Based on Chatbot Categories
To examine the effect of chatbot architecture on performance, the models were grouped into general-purpose conversational agents (ChatGPT-4, ChatGPT-4o, Gemini2 Flash, Copilot, and Grok3), and search-augmented or research-enhanced tools (ChatGPT-4o Deep Research, DeepSeek, DeepSeek Deep Think, Grok3 Deep Search, Perplexity, and Perplexity Research).
The search-augmented models outperformed their general-purpose counterparts across all key domains. They achieved higher mean scores for the DISCERN (66.3 vs. 45.9), PEMAT-P (69.4 vs. 48.1), and WRR (66.3 vs. 8.8), indicating superior reliability, actionability, and structural clarity.
However, general-purpose chatbots demonstrate better readability. Their average Flesch–Kincaid Reading Score was higher (35.6 vs. 13.7) and the Grade Level Score was lower (11.9 vs. 16.9), suggesting more accessible language. Word counts also differed substantially; search-augmented models produced longer responses (mean = 1601.6 words) than general-purpose models (mean = 221.8 words) (
Table 2).
4. Discussion
This study presents a systematic evaluation of 11 AI chatbots in terms of their capacity to provide reliable, understandable, and actionable responses to four frequently searched questions about sarcoidosis. Using validated tools, such as DISCERN, PEMAT-P, Flesch–Kincaid readability metrics, and a composite WRR, the quality and usability of chatbot-generated content were assessed across multiple domains. Their findings revealed substantial variations in the chatbot performance. Models incorporating retrieval-augmented or research-enhanced architectures, such as ChatGPT-4o Deep Research, Perplexity Research, and Grok3 Deep Search, achieved the highest median scores across DISCERN, PEMAT-P, and WRR. These chatbots produced detailed, structured, and evidence-consistent responses, indicating their potential value in patient education for complex conditions such as sarcoidosis. Conversely, the general-purpose conversational agents, ChatGPT-4, ChatGPT-4o, Copilot, and Grok3, exhibited lower reliability and limited actionability. Although these models often generate more readable content, as measured by higher Flesch Reading Ease scores and lower Grade Level estimates, their outputs were typically less comprehensive and lacked clear guidance for patients. Copilot consistently underperformed across all the dimensions (
Table 3).
An important strength of this study was the excellent inter-rater reliability observed across all evaluated domains. The high ICC values (>0.80) obtained for the DISCERN, PEMAT-P, and WRR scoring demonstrated consistent agreement among the four blinded pulmonologists, despite the subjective nature of the qualitative assessments. This high level of inter-rater agreement supports the reliability of the evaluation framework and provides further assurance regarding the consistency of the findings. Particularly noteworthy is the high agreement, even for more nuanced and complex chatbot responses, indicating that subtle quality differences were reliably detected by the domain experts.
A key finding of this study is the inverse relationship between content quality and readability. Retrieval-augmented models delivered responses with superior structure and medical reliability, but at advanced reading levels (mean FKGL > 16), potentially limiting accessibility for individuals with lower health literacy. Conversely, more readable outputs from general-purpose models often lack depth of information or actionable content. This trade-off highlights a critical barrier in chatbot-mediated health communication: achieving both accuracy and accessibility remains challenging, especially for rare and complex diseases, such as sarcoidosis.
These observations are consistent with those of prior studies that assessed the performance and limitations of large language models in healthcare communication. Multiple investigations have demonstrated that AI-generated content, particularly that produced by ChatGPT, tends to score highly in reliability and structural coherence yet often falls short in terms of accessibility due to elevated reading complexity. For instance, ChatGPT-generated responses to appendicitis-related queries were found to be more reliable than those of competing models but were consistently written at a higher reading level, thus reducing their comprehensibility to the public [
11]. A similar pattern was reported in the context of Achilles tendon rupture, where GPT-4-based content achieved strong DISCERN scores yet remained linguistically dense and better suited for college-educated readers [
6]. These studies reinforce the notion that high-quality informational content generated by LLMs frequently exceeds the average health literacy of the target patient population. Comparable findings have also emerged in specialty domains such as spine surgery. In an evaluation of GPT-4 responses to common patient questions in that field, the model again surpassed traditional online sources in terms of reliability. However, readability scores indicated a substantial barrier to understanding lay users. The mean Flesch–Kincaid Grade Level for the GPT-4 responses approached 13, implying that a high school graduate would likely struggle to fully comprehend the text without additional support [
12]. This persistent trade-off between informational depth and linguistic simplicity is a critical challenge in AI-driven patient education. Recent studies have suggested that such limitations can be modified through prompt optimization. When GPT-4 was explicitly instructed to produce layperson-friendly summaries, it demonstrated a marked improvement in readability while maintaining moderate scientific fidelity. This implies that model behavior can be partially directed through carefully constructed prompts, offering a pathway toward more accessible AI-generated health content without compromising accuracy or coherence. While improving readability is essential, the actionability of chatbot responses, that is, their capacity to offer practical, patient-oriented recommendations, remains another under-addressed area. Our observation that even high-performance chatbots often fall short of delivering clear, actionable guidance aligns with the previous systematic reviews of adjacent domains. For example, while many health communication tools are designed to be understandable and well structured, relatively few provide patients with clear, personalized actions that can be readily implemented in real-life decision making [
10]. Similarly, an analysis of online resources for dementia caregivers found that most materials lacked user-centered co-design and were written at reading levels exceeding what would be considered accessible, thus diminishing their real-world applicability despite being technically accurate [
13]. These findings collectively suggest that understandability alone does not guarantee clinical usefulness; the content must also be actionable and tailored to users’ needs. It should also be noted that tools such as DISCERN and PEMAT-P, when validated for traditional health education materials, may not fully accommodate the evolving structural and interactive nature of AI-generated responses. Therefore, their application to chatbot outputs should be cautiously interpreted. Together, these studies illustrate a broader structural limitation of the current generation of AI chatbots: the difficulty of simultaneously optimizing reliability, readability, and actionability. Most models still require a trade-off: prioritizing one dimension tends to occur at the expense of another. In sarcoidosis, a condition that requires nuanced explanations regarding organ-specific manifestations, diagnostic uncertainty, and long-term management, such imbalances may be particularly consequential. The complexity inherent to sarcoidosis magnifies the importance of delivering content that is not only factually correct and linguistically accessible but also organized in a way that supports informed decision making and shared care planning. These findings underscore the need for further innovation in AI chatbot design. One promising avenue involves tailoring model outputs through advanced prompt engineering, which dynamically adjusts user literacy levels and preferences. Additionally, the development of domain-specific guideline-informed AI models, particularly for rare or complex diseases, may be necessary to ensure both accuracy and contextual relevance. These specialized models could integrate structured medical knowledge bases and clinical pathways to produce responses that are not only trustworthy but also actionable and appropriate for varying user profiles.
The unique clinical characteristics of sarcoidosis pose additional challenges to AI-powered chatbot systems. Owing to its multisystemic nature, educational content must encompass a broad range of organ-specific manifestations, including pulmonary, dermatologic, ocular, cardiac, and neurological involvement. However, the current AI models often struggle to address organ-specific presentations and their respective clinical priorities in a sufficiently integrated and balanced manner. Furthermore, the clinical management of sarcoidosis frequently involves “watchful waiting” strategies, individualized treatment plans, and long-term follow-up decisions. Consequently, AI-generated prognostic guidance tends to rely on generalized patterns rather than evidence-based guidelines, which may lead to either excessive uncertainty or a false sense of certainty in patients.
In diseases, such as sarcoidosis, in which the distinction between active inflammation and fibrotic progression is clinically critical, AI-generated responses that fail to adequately differentiate between these two states may result in misleading guidance. Additionally, because AI models are often trained on datasets that lack a sufficient volume of high-quality, expert-informed, and up-to-date content, they may exhibit significant informational gaps and a reduction in source reliability. Responses pertaining to nonsteroidal treatment options, immunosuppressive agents, biologics, or imaging strategies used in follow-up are often outdated or inconsistent with the current clinical guidelines. Of particular note is the near-complete absence of any reference to antifibrotic therapies in AI-generated outputs, despite their increasing relevance in advanced pulmonary sarcoidosis.
These limitations underscore the need for AI-based health communication tools developed for sarcoidosis to be tailored not only for factual medical accuracy but also for contextual awareness, multidisciplinary scope, and sensitivity to patient-specific needs. Purpose-built, domain-specific AI models integrated with structured clinical knowledge bases and current guidelines may provide a more robust and clinically meaningful framework for supporting patients with rare and complex diseases, such as sarcoidosis.
This study has several limitations. An additional consideration not explicitly addressed is the potential limitation of the available high-quality, structured sarcoidosis content on the open web, which may restrict the representativeness of the chatbot training data. Unlike more prevalent diseases, sarcoidosis is underrepresented in digital health sources, possibly influencing the quality of AI-generated responses. Future studies should explore the origin, quantity, and classification of online content related to sarcoidosis in order to better characterize the informational gaps and limitations of current AI-based models. First, chatbot responses were evaluated only in English, limiting generalizability to non-English-speaking populations. Second, only four fixed queries were used, which may not capture the full variability in user inputs. Third, the assessments were based on expert reviews rather than real patient interactions, potentially overlooking user experience factors. Finally, while the WRR provides a useful composite measure, it lacks external validation and includes subjective components. Moreover, although the DISCERN and PEMAT-P tools are validated instruments, they were originally developed to evaluate traditional patient education materials and may not fully accommodate the unique structural and generative features of AI-produced content. This potential misalignment should be considered when interpreting results.
Future studies should explore user-centered evaluations involving diverse patient populations with varying health literacy levels. Prompt optimization strategies should be tested to enhance actionability and personalization. Additionally, developing domain-specific, guideline-aligned AI models may improve the reliability and relevance of chatbot responses, particularly for complex or rare conditions, such as sarcoidosis. Longitudinal assessments and real-world implementation studies are warranted to evaluate their clinical impact.