Evaluation of Arabic-Language AI Chatbot Responses to Migraine-Related Questions: A Comparative Cross-Sectional Study
Abstract
1. Introduction
1.1. Background
1.2. Emergence of AI Chatbots
1.3. Rationale
2. Materials and Methods
2.1. Study Design and Question Selection
2.2. Entry Procedure
2.3. Evaluation of Responses
2.4. Statistical Analysis
3. Results
4. Discussion
4.1. Summary
4.2. Comparison with Existing Literature
4.3. Clinical Implications
4.4. Limitations and Future Directions
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| FAQ | Frequently Asked Question |
| GQS | Global Quality Scale |
| ICC | Intraclass Correlation Coefficient |
| ICHD-3 | International Classification of Headache Disorders, 3rd edition |
| IHS | International Headache Society |
| IQR | Interquartile Range |
| IRB | Institutional Review Board |
| LLM | Large Language Model |
| mDISCERN | Modified DISCERN |
| SD | Standard Deviation |
| SERPs | Search Engine Results Pages |
| SPSS | Statistical Package for the Social Sciences |
References
- Feigin, V.L.; Nichols, E.; Alam, T.; Bannick, M.S.; Beghi, E.; Blake, N.; Culpepper, W.J.; Dorsey, E.R.; Elbaz, A.; Ellenbogen, R.G.; et al. Global, Regional, and National Burden of Neurological Disorders, 1990–2016: A Systematic Analysis for the Global Burden of Disease Study 2016. Lancet Neurol. 2019, 18, 459–480. [Google Scholar] [CrossRef] [PubMed]
- Xu, R.; Zhang, R.; Dong, L.; Xu, X.; Fan, X.; Zhou, J. An Analysis of the Burden of Migraine and Tension-Type Headache across the Global, China, the United States, India and Japan. Front. Pain Res. 2025, 6, 1539344. [Google Scholar] [CrossRef] [PubMed]
- Headache Classification Committee of the International Headache Society (IHS). The International Classification of Headache Disorders, 3rd Edition. Cephalalgia 2018, 38, 1–211. [Google Scholar] [CrossRef]
- Estave, P.M.; Beeghly, S.; Anderson, R.; Margol, C.; Shakir, M.; George, G.; Berger, A.; O’Connell, N.; Burch, R.; Haas, N.; et al. Learning the Full Impact of Migraine through Patient Voices: A Qualitative Study. Headache J. Head Face Pain 2021, 61, 1004–1020. [Google Scholar] [CrossRef] [PubMed]
- Eigenbrodt, A.K.; Ashina, H.; Khan, S.; Diener, H.-C.; Mitsikostas, D.D.; Sinclair, A.J.; Pozo-Rosich, P.; Martelletti, P.; Ducros, A.; Lantéri-Minet, M.; et al. Diagnosis and Management of Migraine in Ten Steps. Nat. Rev. Neurol. 2021, 17, 501–514. [Google Scholar] [CrossRef]
- Tana, C.; Raffaelli, B.; Souza, M.N.P.; De La Torre, E.R.; Massi, D.G.; Kisani, N.; García-Azorín, D.; Waliszewska-Prosół, M. Health Equity, Care Access and Quality in Headache—Part 1. J. Headache Pain 2024, 25, 12. [Google Scholar] [CrossRef]
- Lanteri-Minet, M.; Leroux, E.; Katsarava, Z.; Lipton, R.B.; Sakai, F.; Matharu, M.; Fanning, K.; Manack Adams, A.; Sommer, K.; Seminerio, M.; et al. Characterizing Barriers to Care in Migraine: Multicountry Results from the Chronic Migraine Epidemiology and Outcomes—International (CaMEO-I) Study. J. Headache Pain 2024, 25, 134. [Google Scholar] [CrossRef]
- Do, T.P.; Andreou, A.P.; De Oliveira, A.B.; Shapiro, R.E.; Lampl, C.; Amin, F.M. The Increasing Role of Electronic Media in Headache. BMC Neurol. 2023, 23, 194. [Google Scholar] [CrossRef]
- Alowais, S.A.; Alghamdi, S.S.; Alsuhebany, N.; Alqahtani, T.; Alshaya, A.I.; Almohareb, S.N.; Aldairem, A.; Alrashed, M.; Bin Saleh, K.; Badreldin, H.A.; et al. Revolutionizing Healthcare: The Role of Artificial Intelligence in Clinical Practice. BMC Med. Educ. 2023, 23, 689. [Google Scholar] [CrossRef]
- Laymouna, M.; Ma, Y.; Lessard, D.; Schuster, T.; Engler, K.; Lebouché, B. Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review. J. Med. Internet Res. 2024, 26, e56930. [Google Scholar] [CrossRef]
- Bézie, A.; Morisseau, V.; Rolland, R.; Guillemassé, A.; Brouard, B.; Chaix, B. Using a Chatbot to Study Medication Overuse Among Patients Suffering From Headaches. Front. Digit. Health 2022, 4, 801782. [Google Scholar] [CrossRef] [PubMed]
- Fattah, F.H.; Salih, A.M.; Salih, A.M.; Asaad, S.K.; Ghafour, A.K.; Bapir, R.; Abdalla, B.A.; Othman, S.; Ahmed, S.M.; Hasan, S.J.; et al. Comparative Analysis of ChatGPT and Gemini (Bard) in Medical Inquiry: A Scoping Review. Front. Digit. Health 2025, 7, 1482712. [Google Scholar] [CrossRef] [PubMed]
- Salman, I.M.; Ameer, O.Z.; Khanfar, M.A.; Hsieh, Y.-H. Artificial Intelligence in Healthcare Education: Evaluating the Accuracy of ChatGPT, Copilot, and Google Gemini in Cardiovascular Pharmacology. Front. Med. 2025, 12, 1495378. [Google Scholar] [CrossRef]
- Ito, S.; Furukawa, E.; Okuhara, T.; Okada, H.; Kiuchi, T. Leveraging Artificial Intelligence Chatbots for Anemia Prevention: A Comparative Study of ChatGPT-3.5, Copilot, and Gemini Outputs against Google Search Results. PEC Innov. 2025, 6, 100390. [Google Scholar] [CrossRef]
- Goodman, R.S.; Patrinely, J.R.; Stone, C.A.; Zimmerman, E.; Donald, R.R.; Chang, S.S.; Berkowitz, S.T.; Finn, A.P.; Jahangir, E.; Scoville, E.A.; et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw. Open 2023, 6, e2336483. [Google Scholar] [CrossRef]
- Li, L.; Li, P.; Wang, K.; Zhang, L.; Ji, H.; Zhao, H. Benchmarking State-of-the-Art Large Language Models for Migraine Patient Education: Performance Comparison of Responses to Common Queries. J. Med. Internet Res. 2024, 26, e55927. [Google Scholar] [CrossRef]
- Schütz, P.; Lob, S.; Chahed, H.; Dathe, L.; Löwer, M.; Reiß, H.; Weigel, A.; Albrecht, J.; Tokgöz, P.; Dockweiler, C. ChatGPT as an Information Source for Patients with Migraines: A Qualitative Case Study. Healthcare 2024, 12, 1594. [Google Scholar] [CrossRef]
- Garcia, L.B.; Ferreira, A.J.; Hussein, M.A.; Kowacs, P.A. What Does ChatGPT Know about Migraine? A Comparative-Descriptive Analysis. Cephalalgia 2025, 45, 03331024251387684. [Google Scholar] [CrossRef]
- Sallam, M.; Al-Mahzoum, K.; Almutawaa, R.A.; Alhashash, J.A.; Dashti, R.A.; AlSafy, D.R.; Almutairi, R.A.; Barakat, M. The Performance of OpenAI ChatGPT-4 and Google Gemini in Virology Multiple-Choice Questions: A Comparative Analysis of English and Arabic Responses. BMC Res. Notes 2024, 17, 247. [Google Scholar] [CrossRef]
- Sallam, M.; Stanley, A.; Snygg, J.; Al-Shakerchi, H.; Al Atragchi, O.; Abusamra, R.; Sallam, M. Bilingual Performance of ChatGPT, Gemini, and DeepSeek in Asthma, Allergy, and Respiratory Infection Queries. Recent Prog. Sci. 2026, 3, 001. [Google Scholar] [CrossRef]
- Elzayat, M.A.; Kassab, S.A.; Nada, M.A.F.; El-Gilany, A.-H. Burden of Hidden Migraine among the Arab General Population: A Cross-Sectional Study. J. Headache Pain 2025, 26, 45. [Google Scholar] [CrossRef]
- Ustdal, G.; Guney, A.U. YouTube as a Source of Information about Orthodontic Clear Aligners. Angle Orthod. 2020, 90, 419–424. [Google Scholar] [CrossRef]
- Charnock, D.; Shepperd, S.; Needham, G.; Gann, R. DISCERN: An Instrument for Judging the Quality of Written Consumer Health Information on Treatment Choices. J. Epidemiol. Community Health 1999, 53, 105–111. [Google Scholar] [CrossRef] [PubMed]
- Özdemir, Ö.T.; Kavan, M.Y.; Güven, Y. Evaluation of the Readability, Quality, and Accuracy of AI Chatbot Responses to Questions about Deleterious Oral Habits. BMC Oral Health 2025, 25, 1812. [Google Scholar] [CrossRef]
- Bernard, A.; Langille, M.; Hughes, S.; Rose, C.; Leddin, D.; Veldhuyzen Van Zanten, S. A Systematic Review of Patient Inflammatory Bowel Disease Information Resources on the World Wide Web. Am. J. Gastroenterol. 2007, 102, 2070–2077. [Google Scholar] [CrossRef]
- Kim, S.H.; Shin, J.-S.; Lee, H.; Shin, S.Y.; Kang, K.M.; Song, S.Y. A Comparative Analysis of GPT-3.5, GPT-4, GPT–4 Omni, Gemini Advanced, and Gemini 1.5 in Answering Frequently Asked Questions Regarding High Tibial Osteotomy. Orthop. J. Sports Med. 2025, 13, 23259671251385127. [Google Scholar] [CrossRef]
- International Headache Society. Guidelines. Available online: https://ihs-headache.org/en/resources/guidelines (accessed on 7 April 2026).
- American Headache Society. Clinical Practice Guidelines. Available online: https://americanheadachesociety.org/resources/clinicians/guidelines (accessed on 7 April 2026).
- Bozgeyik, B.; Öğümsöğütlü, E. Can Artificial Intelligence Educate Patients? Comparative Analysis of ChatGPT and DeepSeek Models in Meniscus Injuries. Healthcare 2025, 13, 2980. [Google Scholar] [CrossRef] [PubMed]
- Yıldız, H.A.; Söğütdelen, E. AI Chatbots as Sources of STD Information: A Study on Reliability and Readability. J. Med. Syst. 2025, 49, 43. [Google Scholar] [CrossRef] [PubMed]
- Özbay, Y.; Erdoğan, D.; Dinçer, G.A. Evaluation of the Performance of Large Language Models in Clinical Decision-Making in Endodontics. BMC Oral Health 2025, 25, 648. [Google Scholar] [CrossRef]
- Taşyürek, M.; Adıgüzel, Ö.; Ortaç, H. Comparative Evaluation of Responses from ChatGPT-5, Gemini 2.5 Flash, Grok 4, and Claude Sonnet-4 Chatbots to Questions About Endodontic Iatrogenic Events. Healthcare 2025, 13, 2615. [Google Scholar] [CrossRef]
- Gravel, J.; D’Amours-Gravel, M.; Osmanlliu, E. Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions. Mayo Clin. Proc. Digit. Health 2023, 1, 226–234. [Google Scholar] [CrossRef]
- Zhang, M.; Zhao, T. Citation Accuracy Challenges Posed by Large Language Models. JMIR Med. Educ. 2025, 11, e72998. [Google Scholar] [CrossRef] [PubMed]
- Tuzlalı, M.; Baki, N.; Aral, K.; Aral, C.A.; Bahçe, E. Evaluating the Performance of AI Chatbots in Responding to Dental Implant FAQs: A Comparative Study. BMC Oral Health 2025, 25, 1548. [Google Scholar] [CrossRef]
- Patel, K.; Radcliffe, R. Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek. J. Clin. Med. 2025, 14, 7804. [Google Scholar] [CrossRef]
- Çabuk Çelik, N.; Altunel Kılınç, E. AI-Generated Patient Education for Ankylosing Spondylitis: A Comparative Study of Readability and Quality. Clin. Rheumatol. 2026, 45, 2003–2008. [Google Scholar] [CrossRef] [PubMed]
- Kacer, E.O. Evaluating AI-Based Breastfeeding Chatbots: Quality, Readability, and Reliability Analysis. PLoS ONE 2025, 20, e0319782. [Google Scholar] [CrossRef] [PubMed]
- Abbasi, H.; Al-Qudheeby, M.; Kheyami, Z.A.; Khalil, R.; Khamees, N.; Hijjawi, O.; Sallam, M.; Barakat, M. Cross-Linguistic Evaluation of Generative AI Models for Diabetes and Endocrine Queries. Jordan Med. J. 2024, 58, 311–326. [Google Scholar] [CrossRef]
- Şimşek, E.; Kurt, Ö. The Impact of Language Differences on the Readability, Quality, and Reliability of Information Provided by Artificial Intelligence Chatbots Regarding Vital Pulp Therapy: A Cross-Sectional Study. BMC Oral Health 2025, 26, 134. [Google Scholar] [CrossRef]
- Sallam, M.; Al-Mahzoum, K.; Alshuaib, O.; Alhajri, H.; Alotaibi, F.; Alkhurainej, D.; Al-Balwah, M.Y.; Barakat, M.; Egger, J. Language Discrepancies in the Performance of Generative Artificial Intelligence Models: An Examination of Infectious Disease Queries in English and Arabic. BMC Infect. Dis. 2024, 24, 799. [Google Scholar] [CrossRef]
- Abu-Farha, R.K.; Abuzaid, H.; Alalawneh, J.; Sharaf, M.; Al-Ghawanmeh, R.; Qunaibi, E.A. Evaluating the Performance of AI Large Language Models in Detecting Pediatric Medication Errors Across Languages: A Comparative Study. J. Clin. Med. 2025, 15, 162. [Google Scholar] [CrossRef]
- Sallam, M.; Alasfoor, I.M.; Khalid, S.W.; Al-Mulla, R.I.; Al-Farajat, A.; Mijwil, M.M.; Zahrawi, R.; Sallam, M.; Egger, J.; Al-Adwan, A.S. Chinese Generative AI Models (DeepSeek and Qwen) Rival ChatGPT-4 in Ophthalmology Queries with Excellent Performance in Arabic and English. Narra J. 2025, 5, e2371. [Google Scholar] [CrossRef]



| Rater 1 | Rater 2 | Rater 3 | Rater 4 | ICC | p-Value | |
|---|---|---|---|---|---|---|
| mDISCERN Score | 32.39 (3.66) | 31.97 (3.07) | 32.46 (3.53) | 31.60 (2.37) | 0.831 | <0.001 |
| GQS Score | 4.72 (0.59) | 4.74 (0.58) | 4.75 (0.58) | 4.68 (0.62) | 0.816 | <0.001 |
| Accuracy Score | 4.75 (0.54) | 4.74 (0.58) | 4.68 (0.57) | 4.71 (0.56) | 0.799 | <0.001 |
| mDISCERN Component Questions | ChatGPT Mean (SD) | DeepSeek Mean (SD) | Gemini Mean (SD) | Grok Mean (SD) | Kruskal–Wallis H | p-Value |
|---|---|---|---|---|---|---|
| Q1: Are the aims clear? | 4.82 (0.37) | 5.00 (0.00) | 4.92 (0.35) | 4.97 (0.08) | 2.318 | 0.088 |
| Q2: Does it achieve its aims? | 4.84 (0.44) | 4.96 (0.12) | 4.94 (0.11) | 4.96 (0.09) | 1.753 | 0.625 |
| Q3: Is it relevant? | 4.86 (0.40) | 4.96 (0.09) | 4.96 (0.09) | 4.96 (0.09) | 1.923 | 0.589 |
| Q4: Are the sources of information used to compile the publication clearly identified? | 1.63 (0.42) | 1.27 (0.76) | 1.27 (0.68) | 3.77 (1.28) | 55.055 | <0.001 |
| Q5: Is it clear when the information used or reported in the publication was produced? | 4.36 (0.76) | 4.78 (0.17) | 4.67 (0.68) | 4.73 (0.16) | 12.357 | <0.001 |
| Q6: Is it balanced and unbiased? | 4.88 (0.19) | 4.75 (0.13) | 4.91 (0.19) | 4.81 (0.13) | 21.471 | <0.001 |
| Q7: Does it provide details of additional sources of support and information? | 2.35 (0.80) | 4.08 (0.68) | 2.50 (0.88) | 3.50 (0.59) | 50.251 | <0.001 |
| Q8: Does it refer to areas of uncertainty? | 2.09 (1.26) | 4.30 (0.54) | 2.06 (1.57) | 2.56 (1.39) | 27.587 | <0.001 |
| Metric | ChatGPT Mean (SD), Median (IQR) | DeepSeek Mean (SD), Median (IQR) | Gemini Mean (SD), Median (IQR) | Grok Mean (SD), Median (IQR) | p-Value |
|---|---|---|---|---|---|
| mDISCERN | 29.83 (1.87), 30.25 (28.88–31.50) | 34.07 (1.31), 34.00 (33.25–34.75) | 30.23 (2.39), 30.00 (28.50–31.38) | 34.29 (2.59), 34.25 (32.63–35.75) | <0.001 |
| GQS | 4.47 (0.66), 4.75 (4.25–5.00) | 4.95 (0.13), 5.00 (5.00–5.00) | 4.61 (0.67), 5.00 (4.50–5.00) | 4.86 (0.23), 5.00 (4.75–5.00) | <0.001 |
| Accuracy | 4.51 (0.61), 5.00 (4.00–5.00) | 4.83 (0.32), 5.00 (4.75–5.00) | 4.71 (0.62), 5.00 (4.75–5.00) | 4.83 (0.31), 5.00 (4.63–5.00) | 0.072 |
| mDISCERN | GQS | |||||
|---|---|---|---|---|---|---|
| (Z) | (r) | (p) | (Z) | (r) | (p) | |
| ChatGPT—Gemini | −0.507 | −0.051 | 1.000 | −1.607 | −0.161 | 0.649 |
| ChatGPT—Grok | −5.373 | −0.537 | 0.000 | −2.589 | −0.259 | 0.058 |
| ChatGPT—DeepSeek | −5.524 | −0.552 | 0.000 | −3.893 | −0.389 | <0.001 |
| Gemini—Grok | −4.865 | −0.487 | 0.000 | −0.982 | −0.098 | 1.000 |
| Gemini—DeepSeek | 5.017 | 0.502 | 0.000 | 2.286 | 0.229 | 0.133 |
| Grok—DeepSeek | 0.151 | 0.015 | 1.000 | 1.304 | 0.130 | 1.000 |
| Metric | mDISCERN | GQS | Accuracy |
|---|---|---|---|
| mDISCERN | 1 | 0.499 (0.335–0.633) ** | 0.412 (0.235–0.563) ** |
| GQS | 0.499 (0.335–0.633) ** | 1 | 0.769 (0.674–0.839) ** |
| Accuracy | 0.412 (0.235–0.563) ** | 0.769 (0.674–0.839) ** | 1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Aljaafari, D.; Aljumah, H.K.; Alzuwayr, M.A.; Al-Essa, Y.M.; Alradhi, H.A.; Alabdali, M.M.; Almuslim, N.; Alqarni, M.A.; Alesefir, W. Evaluation of Arabic-Language AI Chatbot Responses to Migraine-Related Questions: A Comparative Cross-Sectional Study. J. Clin. Med. 2026, 15, 3908. https://doi.org/10.3390/jcm15103908
Aljaafari D, Aljumah HK, Alzuwayr MA, Al-Essa YM, Alradhi HA, Alabdali MM, Almuslim N, Alqarni MA, Alesefir W. Evaluation of Arabic-Language AI Chatbot Responses to Migraine-Related Questions: A Comparative Cross-Sectional Study. Journal of Clinical Medicine. 2026; 15(10):3908. https://doi.org/10.3390/jcm15103908
Chicago/Turabian StyleAljaafari, Danah, Hussain Khalifa Aljumah, Mujtaba Abbas Alzuwayr, Yaqeen Mohammed Al-Essa, Hassan Ali Alradhi, Majed M. Alabdali, Nora Almuslim, Mustafa Ahmed Alqarni, and Walid Alesefir. 2026. "Evaluation of Arabic-Language AI Chatbot Responses to Migraine-Related Questions: A Comparative Cross-Sectional Study" Journal of Clinical Medicine 15, no. 10: 3908. https://doi.org/10.3390/jcm15103908
APA StyleAljaafari, D., Aljumah, H. K., Alzuwayr, M. A., Al-Essa, Y. M., Alradhi, H. A., Alabdali, M. M., Almuslim, N., Alqarni, M. A., & Alesefir, W. (2026). Evaluation of Arabic-Language AI Chatbot Responses to Migraine-Related Questions: A Comparative Cross-Sectional Study. Journal of Clinical Medicine, 15(10), 3908. https://doi.org/10.3390/jcm15103908

