Assessing AI-Generated Autism Information for Healthcare Use: A Cross-Linguistic and Cross-Geographic Evaluation of ChatGPT, Gemini, and Copilot
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper is a well-structured, methodologically sound study that advances the emerging literature on the reliability and usability of AI-generated health information. The paper’s cross-linguistic and cross-geographic design introduces a novel comparative dimension, which is commendable.
Introduction (Theoretical Framing):
Although the introduction is thorough, it tends to be more descriptive. The paper would gain from a stronger theoretical framework that connects AI communication reliability to health literacy, digital trust, or information asymmetry theories. The authors could incorporate models such as eHealth Literacy (Norman & Skinner, 2006) or Cognitive Load Theory to strengthen their discussion on readability and comprehension.
Methodological Transparency:
Although the coding reliability and statistical methods are well reported, specific examples of queries and interrater reliability coefficients (ICC/Kappa values) should be fully included in the results section or appendix for replication.
Interpretive Depth:
The discussion, while thorough, sometimes feels more summative than interpretive. It could more clearly connect the empirical results to larger debates about AI bias, linguistic dominance, and cultural inclusivity.
Ethical and Policy Dimension:
The authors could briefly address the ethical implications of AI misinformation in healthcare—such as accountability, consent, and the “delegated trust problem”—to highlight its practical significance.
Author Response
|
Reviewer 1 Feedback |
Authors Response |
|
This paper is a well-structured, methodologically sound study that advances the emerging literature on the reliability and usability of AI-generated health information. The paper’s cross-linguistic and cross-geographic design introduces a novel comparative dimension, which is commendable. |
We thank the reviewer for their positive and encouraging feedback. We appreciate the recognition of the study’s methodological rigor and its contribution to advancing understanding of AI-generated health information across linguistic and geographic contexts. Your thoughtful comments helped us further strengthen the clarity and framing of our work.
|
|
Introduction (Theoretical Framing): Although the introduction is thorough, it tends to be more descriptive. The paper would gain from a stronger theoretical framework that connects AI communication reliability to health literacy, digital trust, or information asymmetry theories. The authors could incorporate models such as eHealth Literacy (Norman & Skinner, 2006) or Cognitive Load Theory to strengthen their discussion on readability and comprehension. |
We have added the following information on page 2 (see highlighted text in yellow).
“From a theoretical standpoint, this challenge can be framed through the lens of eHealth Literacy (Norman & Skinner, 2006), which refers to an individual’s ability to seek, understand, and evaluate health information from electronic sources and apply this knowledge to make informed decisions. Caregivers with limited eHealth literacy may struggle to interpret or verify online information, making them more vulnerable to misinformation or overly technical content (Diviani et al., 2015). In addition, Cognitive Load Theory (Sweller, 1988; Paas et al., 2003) suggests that when information is presented in overly complex or lengthy formats, it can exceed the reader’s working memory capacity, thereby reducing comprehension and recall. Thus, readability is not only a stylistic concern but a cognitive and accessibility issue that directly influences health decision-making and digital trust (McCormack et al., 2013; Sillence et al., 2007).”
The following references have been added to the reference list: Diviani, N., van den Putte, B., Giani, S., & van Weert, J. C. M. (2015). Low health literacy and evaluation of online health information: A systematic review of the literature. Journal of Medical Internet Research, 17(5), e112. https://doi.org/10.2196/jmir.4018 McCormack, L., Haun, J., Sørensen, K., & Valerio, M. (2013). Recommendations for advancing health literacy measurement. Journal of Health Communication, 18(Suppl 1), 9–14. https://doi.org/10.1080/10810730.2013.829892 Norman, C. D., & Skinner, H. A. (2006). eHealth literacy: Essential skills for consumer health in a networked world. Journal of Medical Internet Research, 8(2), e9. https://doi.org/10.2196/jmir.8.2.e9 Paas, F., Renkl, A., & Sweller, J. (2003). Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 38(1), 1–4. https://doi.org/10.1207/S15326985EP3801_1 Sillence, E., Briggs, P., Harris, P. R., & Fishwick, L. (2007). How do patients evaluate and make use of online health information? Social Science & Medicine, 64(9), 1853–1862. https://doi.org/10.1016/j.socscimed.2007.01.012 Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285. https://doi.org/10.1207/s15516709cog1202_4
|
|
Methodological Transparency: Although the coding reliability and statistical methods are well reported, specific examples of queries and interrater reliability coefficients (ICC/Kappa values) should be fully included in the results section or appendix for replication. |
In response to the reviewer’s comment, we have added the following information at the end of the Interrater Reliability section on page 7 (highlighted in yellow for ease of reference).
“Specifically, ICC values indicated high consistency for accuracy (.86) and references (.99), with perfect agreement for readability (1.00). Weighted Cohen’s Kappa values showed substantial agreement for actionability (.78) and language use (.74). These coefficients demonstrate that the coding process was highly stable and reproducible across raters and dimensions.” |
|
Interpretive Depth: The discussion, while thorough, sometimes feels more summative than interpretive. It could more clearly connect the empirical results to larger debates about AI bias, linguistic dominance, and cultural inclusivity. |
In response to this feedback, we have incorporated the following new sections into the manuscript. Please refer to the highlighted text on pages 18–20 for these additions.
Page 18: “….From a broader perspective, this limitation also relates to the ongoing debate about algorithmic transparency and epistemic trust in AI communication (Ferrario, 2024; Raji et al., 2021). When users cannot see or verify information sources, they must rely on the model’s perceived authority, a dynamic that can reinforce information asymmetry and limit informed decision-making in health contexts.”
Page 18: “…These differences reflect the phenomenon of linguistic dominance in large language model training, where English-language data are disproportionately represented, leading to structural biases that favor Western-centric discourses and limit global inclusivity (Bender et al., 2021; Kreps et al., 2022). As a result, AI tools risk reinforcing existing disparities in knowledge accessibility, particularly for families seeking autism information in underrepresented languages such as Turkish.”
Page 19: “…Beyond technical accuracy, this issue illustrates how AI bias can intersect with linguistic inequity: models trained primarily on Global North datasets may unintentionally encode cultural assumptions, rhetorical styles, and healthcare frameworks that are less applicable or even misleading in non-Western contexts (Birhane, 2021). Addressing such biases requires intentional diversification of training data and evaluation benchmarks that represent the cultural and linguistic diversity of real-world users.”
Page 19: “…This finding also connects to the broader conversation on cultural inclusivity and representational fairness in AI. When models reproduce dominant medicalized narratives, they perpetuate historical power imbalances in how disability is discussed and understood (Lewis, 2025). Ensuring inclusive AI systems requires both linguistic sensitivity and participatory approaches that involve neurodivergent individuals in the design and evaluation of AI training data and outputs.”
Pages 19-20: “…Moreover, the inconsistent referencing patterns observed across tools may reflect deeper structural limitations in how AI systems represent knowledge provenance, raising epistemological questions about what counts as “trusted” information in algorithmic contexts (Jacobs & Wallach, 2021). Future work should therefore focus on designing models that not only generate accurate and culturally sensitive content but also provide transparent, traceable sources that support user trust and critical evaluation.”
The following references have been added to the reference list:
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT’21), 610–623. https://doi.org/10.1145/3442188.3445922 Birhane, A. (2021). Algorithmic injustice: A relational ethics approach. Patterns, 2(2), 100205. https://doi.org/10.1016/j.patter.2021.100205 Ferrario, A. (2024). Justifying our credences in the trustworthiness of AI systems: A reliabilistic approach. Science and Engineering Ethics, 30(6), 55. https://doi.org/10.1007/s11948-024-00522-z Jacobs, A. Z., & Wallach, H. (2021). Measurement and fairness. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT’21), 375–385. https://doi.org/10.1145/3442188.3445901 Kreps, S., McCain, R. M., & Brundage, M. (2022). All the news that’s fit to fabricate: AI-generated text as a tool of media misinformation. Journal of Experimental Political Science, 9(1), 104-117. https://doi.org/10.1017/XPS.2020.37 Lewis, A. A. (2025). Unpacking cultural bias in ai language learning tools: An analysis of impacts and strategies for inclusion in diverse educational settings. International Journal of Research and Innovation in Social Science, 9(1), 1878-1892. https://doi.org/10.47772/IJRISS Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366.
|
|
Ethical and Policy Dimension:The authors could briefly address the ethical implications of AI misinformation in healthcare—such as accountability, consent, and the “delegated trust problem”—to highlight its practical significance. |
To address this feedback, the subsection “Implications for Practice” has been renamed “Implications for Practice and Policy,” and the following paragraph and references have been added to the paper on page 21 (see highlighted text in yellow):
“Beyond clinical application, these findings also carry important ethical and policy implications. As AI-generated health information becomes more common, issues of accountability, informed consent, and user trust become increasingly relevant. When misinformation or biased content is produced by AI systems, it remains unclear who bears responsibility; the developer, the healthcare provider recommending the tool, or the user interpreting it (" et al., 2019; Morley et al., 2020). This “delegated trust problem” highlights the need for clearer guidance and oversight within healthcare systems to ensure that AI tools used for patient education meet basic standards of accuracy, privacy, and fairness. Rather than calling for broad regulation, incremental policy actions such as developing professional guidelines for AI use in health communication, promoting transparency about data sources, and incorporating AI literacy training into clinical practice. may help safeguard patient autonomy and foster public trust in the responsible use of AI-generated information.”
The following references have been added to the reference list:
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. https://doi.org/10.1038/s42256-019-0088-2 Morley, J., Floridi, L., Kinsey, L., & Elhalal, A. (2020). From what to how: An initial review of publicly available AI ethics tools, methods and research to translate principles into practices. Science and Engineering Ethics, 26(4), 2141–2168. https://doi.org/10.1007/s11948-019-00165-5 |
Reviewer 2 Report
Comments and Suggestions for Authors“The manuscript offers timely and valuable insights into how AI tools can support autism-related information access. However, the statement in the Conclusion section that this work constitutes “one of the most comprehensive evaluations” would benefit from closer alignment with the methodological scope presented. At present, the study’s comprehensiveness is bounded by several practical constraints that, while understandable, limit the breadth implied by that claim.
A) The evaluation focused on only three LLMs (ChatGPT, Gemini, and Copilot), excluding emerging systems such as DeepSeek, Llama that increasingly shape the public AI landscape.
B) (a major limitation) Interactions were conducted in a single-turn format. Although this single-turn approach ensures methodological consistency and facilitates fair model comparison, it diverges from typical user behavior in naturalistic settings, where individuals often engage in multi-turn conversations to refine their questions or seek clarifications. Multi-turn interactions allow LLMs to leverage conversational context and adjust their reasoning dynamically, often leading to more accurate, relevant, and personalized outputs. Consequently, results derived from single-turn exchanges may underestimate real-world model performance or obscure important differences in conversational robustness across systems. Moreover, the exclusive use of single-turn prompts may inadvertently flatten performance differences among the evaluated tools. Because ChatGPT and Gemini are designed for sustained dialogue and contextual reasoning, their strengths typically emerge through multi-turn engagement where users refine and build upon prior exchanges. Copilot, on the other hand, is optimized for discrete, instruction-style tasks and may perform comparatively well in isolated queries. As such, the single-turn design simplifies comparison but potentially biases results toward models better suited for one-shot responses, limiting the generalizability of conclusions about real-world effectiveness.
C) The linguistic coverage, restricted to English and Turkish, provides an important but partial view of cross-cultural generalizability.
Given these boundaries, the authors could strengthen the paper in one of two ways:
Option 1 – Expand the empirical scope: Extend the analysis to incorporate additional models representative of current AI diversity, include multi-turn conversational prompts that capture naturalistic user behavior, and, if feasible, explore additional language contexts or basic outcome-oriented measures. Such extensions would substantively justify the “comprehensive evaluation” descriptor and position the work as a broader benchmark study.
Option 2 – Refine the framing: If expanding the dataset or design is not feasible within the current revision cycle, the manuscript could instead recalibrate its language to emphasize depth within defined parameters rather than breadth. For example, the phrase might be revised to indicate that this is “a detailed comparative analysis of leading AI models within specific linguistic and interactional settings.” Clarifying that the study’s comprehensiveness refers to methodological rigor and comparative depth, rather than coverage of all models, languages, or interaction types, would preserve the study’s strength while accurately reflecting its scope.
Either approach would enhance the credibility and interpretive balance of the paper, ensuring that readers clearly understand the study’s contributions and limitations within the evolving landscape of AI-assisted autism research.”
Author Response
|
Reviewer 2 Feedback |
Authors Response |
|
“The manuscript offers timely and valuable insights into how AI tools can support autism-related information access. However, the statement in the Conclusion section that this work constitutes “one of the most comprehensive evaluations” would benefit from closer alignment with the methodological scope presented. At present, the study’s comprehensiveness is bounded by several practical constraints that, while understandable, limit the breadth implied by that claim. |
We appreciate the reviewer’s thoughtful comment and agree that clarity regarding the scope of the study is important. We have revised the conclusion to state that “this study provides a detailed comparative evaluation of how AI-driven language models perform in generating autism-related information” to better reflect the methodological boundaries of our work. At the same time, we respectfully note that this study remains more comprehensive than most existing investigations in this emerging field. Prior studies assessing AI-generated autism information (e.g., McFayden et al., 2024; Hassona & Alqaisi, 2024) have typically focused on a single AI model (usually ChatGPT), a single language, and one national context. In contrast, our study systematically compared three widely used AI tools (ChatGPT, Gemini, and Copilot) across three countries (USA, England, and Türkiye) and two languages (English and Turkish), while analyzing multiple dimensions of content quality including accuracy, readability, actionability, reference reliability, and language framing. This multi-country, bilingual, and multi-criteria design extends the methodological scope of prior literature and provides a richer cross-cultural and linguistic understanding of AI-generated autism information. Thus, while we recognize that the study’s coverage is not global, we believe that the revised phrasing “a comprehensive evaluation” remains appropriate, as it accurately conveys the study’s comparative breadth and methodological depth relative to previous research.
|
|
A) The evaluation focused on only three LLMs (ChatGPT, Gemini, and Copilot), excluding emerging systems such as DeepSeek, Llama that increasingly shape the public AI landscape. |
We appreciate the reviewer’s observation. At the time of data collection (May 2025), our selection focused on LLMs that were widely available, actively used by the public, and capable of generating responses in both English and Turkish through established interfaces. We also acknowledged this scope in the Limitations section, noting that the inclusion of all existing or emerging AI tools across every language and region would be methodologically unfeasible within a single study. Nevertheless, we agree that newer systems such as DeepSeek and Llama represent important future directions, and subsequent research could extend this framework to these and other models as they become more stable and publicly adopted. This is noted under Limitations and Recommendation for Future Research section on page 21 (see purple highlight).
“ … Future research should extend this work by including other models such as DeepSeek and Llama which are increasingly shaping the AI landscape and may offer distinct linguistic or contextual advantages across healthcare domains.” |
|
B) (a major limitation) Interactions were conducted in a single-turn format. Although this single-turn approach ensures methodological consistency and facilitates fair model comparison, it diverges from typical user behavior in naturalistic settings, where individuals often engage in multi-turn conversations to refine their questions or seek clarifications. Multi-turn interactions allow LLMs to leverage conversational context and adjust their reasoning dynamically, often leading to more accurate, relevant, and personalized outputs. Consequently, results derived from single-turn exchanges may underestimate real-world model performance or obscure important differences in conversational robustness across systems. Moreover, the exclusive use of single-turn prompts may inadvertently flatten performance differences among the evaluated tools. Because ChatGPT and Gemini are designed for sustained dialogue and contextual reasoning, their strengths typically emerge through multi-turn engagement where users refine and build upon prior exchanges. Copilot, on the other hand, is optimized for discrete, instruction-style tasks and may perform comparatively well in isolated queries. As such, the single-turn design simplifies comparison but potentially biases results toward models better suited for one-shot responses, limiting the generalizability of conclusions about real-world effectiveness. |
We appreciate this insightful observation and agree that multi-turn interactions can provide a more naturalistic simulation of how users engage with AI systems. As acknowledged in the Limitations section, our study intentionally employed a single-turn design to ensure methodological consistency, comparability, and replicability across models, languages, and countries. While multi-turn dialogues may yield richer or more adaptive responses, it would not have been feasible to standardize follow-up prompts across 44 questions, three AI systems, two languages, and three national contexts. Moreover, multi-turn interactions inherently introduce user-specific variability such as phrasing, sequencing, or clarification style that could confound systematic cross-model comparison. Therefore, focusing on the initial responses to standardized caregiver-relevant questions allowed us to isolate and evaluate each model’s baseline informational quality under controlled and replicable conditions. We agree that future research should build on this work by incorporating multi-turn designs to examine how conversational context influences accuracy, personalization, and user trust. However, for the present study, the single-turn format was essential to maintain methodological rigor and ensure fair comparison across all evaluated dimensions. We expanded on the second point listed under the Limitations and Recommendation for Future Research section to help address this concern. Please see revised text below or on page 21 (purple highlights):
“Second, all queries were entered as single-turn prompts. Although this approach ensured methodological consistency and allowed for direct, standardized comparison across tools, languages, and countries, it does not fully capture how users typically engage in multi-turn, interactive conversations. Multi-turn exchanges can influence the quality and personalization of AI responses, but including them in this study would have introduced user-driven variability that could not be systematically controlled. Therefore, the single-turn design was intentionally chosen to evaluate the baseline informational quality of each model under comparable and replicable conditions. Future studies should build on this work by examining how LLMs perform in dynamic, real-world dialogue scenarios where conversational context and iterative questioning may further shape response quality and accuracy.” |
|
C) The linguistic coverage, restricted to English and Turkish, provides an important but partial view of cross-cultural generalizability. |
We appreciate this observation. While the study focused on English and Turkish, these languages were strategically selected to balance global relevance and linguistic diversity. English serves as the dominant language of AI training corpora and international healthcare communication, providing a meaningful benchmark for evaluating global performance. Turkish, in contrast, represents a morphologically rich, non-Indo-European language with different syntactic and cultural structures, allowing us to test model robustness beyond English-centric contexts. Furthermore, by assessing English-language outputs across three countries (the USA, England, and Türkiye), our design inherently expanded cross-cultural generalizability beyond a single linguistic or national frame. We also wanted to note that it is not feasible to include all languages and regions within one study, and we have acknowledged this as a limitation while highlighting directions for future cross-linguistic research. |
|
Given these boundaries, the authors could strengthen the paper in one of two ways: Option 1 – Expand the empirical scope: Extend the analysis to incorporate additional models representative of current AI diversity, include multi-turn conversational prompts that capture naturalistic user behavior, and, if feasible, explore additional language contexts or basic outcome-oriented measures. Such extensions would substantively justify the “comprehensive evaluation” descriptor and position the work as a broader benchmark study. Option 2 – Refine the framing: If expanding the dataset or design is not feasible within the current revision cycle, the manuscript could instead recalibrate its language to emphasize depth within defined parameters rather than breadth. For example, the phrase might be revised to indicate that this is “a detailed comparative analysis of leading AI models within specific linguistic and interactional settings.” Clarifying that the study’s comprehensiveness refers to methodological rigor and comparative depth, rather than coverage of all models, languages, or interaction types, would preserve the study’s strength while accurately reflecting its scope. Either approach would enhance the credibility and interpretive balance of the paper, ensuring that readers clearly understand the study’s contributions and limitations within the evolving landscape of AI-assisted autism research.” |
We appreciate this constructive suggestion. Given the current dataset and study design, expanding the empirical scope (Option 1) is not feasible within this revision cycle. However, we fully agree that clarifying the framing will improve interpretive precision. Accordingly, we have revised the language throughout the manuscript, particularly in the Abstract and Conclusion, to reflect that this study represents a “detailed comparative analysis of leading AI models within defined linguistic and geographic contexts” rather than a fully comprehensive global evaluation. Please review purple highlighted text throughout the paper. |
Reviewer 3 Report
Comments and Suggestions for AuthorsReview Comments
Assessing AI-Generated Autism Information for Healthcare Use: A Cross-Linguistic and Cross-Geographic Evaluation of ChatGPT, Gemini, and Copilot
Manuscript ID: healthcare-3914028
- Authors are focused on assessing the content generated by AI tools based on different languages and geographical locations. Anyhow, their assessment is limited to only a set of countries.
- Why did the authors consider only three countries in the content assessment, and how can they assess the genuineness of the autism healthcare information and generalize it to healthcare?
- Although the topic is relevant to the field, the question is how far it is acceptable to the healthcare sector. If the clinicians/doctors are relying heavily on the content generated by AI, it could lead to severe consequences for patients and a decline in medical standards. Assessing the content is acceptable, but implementing it in the health sector is a concern to be discussed.
- The assessment of the Autism content generated by different AI tools is the point to be considered. Instead of using the content directly, conducting a preliminary assessment drives the people towards the verification of the genuineness of the content.
- The proposed methodology should be presented using a block diagram or flowchart to explain the methods implemented clearly. This improves the readability of the work.
- The conclusions section needs revision. The content presented does not address the research questions posed. The Conclusions should explain the achievements of the work done.
- The references included are related to the proposed research work.
- The term ‘References’ used in the manuscript and Table 5 may create ambiguity with the actual References section. It is recommended to revise it with appropriate terminology.
Author Response
|
Reviewer 3 Feedback |
Authors Response |
|
Authors are focused on assessing the content generated by AI tools based on different languages and geographical locations. Anyhow, their assessment is limited to only a set of countries. Why did the authors consider only three countries in the content assessment, and how can they assess the genuineness of the autism healthcare information and generalize it to healthcare? |
To address this feedback, we have added the following information at the end of the Discussion section, just before the Implications for Practice and Policy subsection (see green highlight on page 21).
“Importantly, although this study examined only three countries and two languages, this multi-country, bilingual design offers broader generalizability than most prior investigations, which have been confined to single-language or single-context analyses (McFayden et al., 2024; Sallam et al., 2024). By assessing English-language outputs across three distinct national contexts and comparing them with Turkish-language data, this study provides a more comprehensive picture of how AI-generated autism information performs across diverse healthcare and cultural systems. These findings thus extend the scope of current research beyond national boundaries, highlighting both shared strengths and language-specific gaps in AI-mediated health communication.” |
|
Although the topic is relevant to the field, the question is how far it is acceptable to the healthcare sector. If the clinicians/doctors are relying heavily on the content generated by AI, it could lead to severe consequences for patients and a decline in medical standards. Assessing the content is acceptable, but implementing it in the health sector is a concern to be discussed. |
The about ethical and professional acceptability of using AI in healthcare (i.e., distinguishing using AI-generated information from relying on it for medical decision-making) was already discussed as part of ethical and policy implications at the end of the Implications for Practice and Policy section. We have added additional information to clarify this point under the same section (see green highlight on page 21). “At the same time, it is essential to recognize that AI-generated information should be viewed as a supplementary resource rather than a clinical decision-making tool. Overreliance on AI outputs by healthcare providers or families could risk misinformation, misinterpretation, and erosion of medical standards. Therefore, implementation of AI-assisted content in healthcare should occur only under human supervision and within established professional and ethical boundaries (Jobin et al., 2019; Morley et al., 2020). Clear institutional policies and clinician training on appropriate AI use can help ensure that these technologies enhance, rather than replace, clinical judgment and patient-centered care.” |
|
The proposed methodology should be presented using a block diagram or flowchart to explain the methods implemented clearly. This improves the readability of the work. |
A flowchart has been created and added as Appendix B on page 25 (see the chart on this page). |
|
The conclusions section needs revision. The content presented does not address the research questions posed. The Conclusions should explain the achievements of the work done. |
To address this feedback, we have revised the Conclusion section as noted below. Please also see the green highlighted text on page 22.
“This study provides a comprehensive evaluation of how AI-driven language models perform in generating autism-related information across different languages, countries, and content domains. The findings showed that ChatGPT consistently produced the most accurate responses across all locations and languages, followed by Gemini and Copilot, with only minor variability by setting. These results reinforce the potential of GPT-based systems as factually reliable sources of educational content for families and healthcare professionals [RQ1: Accuracy × Location/Language]. Across all models, the readability of responses remained above the recommended sixth- to eighth-grade level for health materials. While Copilot generated the most accessible text and Gemini followed closely, ChatGPT produced more complex responses, highlighting that readability continues to pose a challenge in AI-generated health communication [RQ2: Readability differences across tools/languages].
Gemini, however, provided the most actionable and user-oriented responses, frequently suggesting practical steps or strategies that caregivers could apply—an advantage particularly evident in Turkish-language outputs [RQ3: Actionability]. In terms of language framing, all three models relied heavily on medicalized terminology, with limited use of neurodiversity-affirming or strengths-based language. This indicates a persistent gap between AI-generated content and current best practices in inclusive communication [RQ4: Language framing (medicalized vs. neurodiversity-affirming)]. When considering reference generation, Gemini again stood out for consistently including credible sources, while Copilot showed mixed performance and ChatGPT omitted references unless explicitly requested [RQ5: References; frequency, credibility, functionality].
Together, these findings demonstrate that while AI tools have clear strengths in accuracy, accessibility, and practical guidance, they also exhibit notable weaknesses in inclusivity, readability, and source transparency. These results highlight the importance of healthcare professionals guiding families in how to use AI responsibly and critically. As AI technologies continue to evolve, their integration into health communication must be accompanied by ongoing evaluation, professional oversight, and ethical safeguards to ensure that digital tools complement, rather than replace, human-centered care for autistic individuals and their families.”
|
|
The term ‘References’ used in the manuscript and Table 5 may create ambiguity with the actual References section. It is recommended to revise it with appropriate terminology. |
We appreciate the reviewer’s suggestion. In this study, the term “References” specifically refers to the external source links or citations generated by the AI systems (e.g., .gov, .org, .edu, .com) rather than the bibliographic references listed at the end of the manuscript. This distinction is clearly explained in the Methods section under the subheading “References,” where we describe the coding procedures, domain categories, and hyperlink verification process. To avoid redundancy and maintain consistency with this defined variable, we have retained the term “References” throughout the manuscript and tables. |
Reviewer 4 Report
Comments and Suggestions for AuthorsI liked the study. Can the authors only answer the following questions?
1. What are the age ranges of the collected data?
2. It would be good to include information about the versions of Chat GPT, Gemini, and Copilot used.
Author Response
|
Reviewer 4 Feedback |
Authors Response |
|
I liked the study. Can the authors only answer the following questions? What are the age ranges of the collected data? |
To clarify the age-related concern, we have added the following text on page 5 (see blue highlighted text).
“…These questions were designed to reflect the experiences and information needs of caregivers of children and adolescents with autism, approximately ages 2–18 years…” |
|
It would be good to include information about the versions of Chat GPT, Gemini, and Copilot used. |
The version of each AI tool is specified on page 4 under the Methods → Search Using the AI Platforms section. We did not repeat this information throughout the paper to avoid redundancy. Please see blue highlighted text. |
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsAll the comments were addressed. So, it looks good to go for the publication.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
Thank you for addressing all my concerns.
