Evaluating the Competence of AI Chatbots in Answering Patient-Oriented Frequently Asked Questions on Orthognathic Surgery
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design
2.2. Question Selection
2.3. Chatbot Evaluation
2.4. Evaluation Criteria
- Global Quality Score (GQS): Three oral and maxillofacial surgeons (E.Y.Ç, D.K., and M.N) blinded to the chatbots, independently assessed each response using a 5-point Likert scale (1 = poor, 5 = excellent). The evaluation focused on the quality, accuracy, and comprehensiveness of the information provided. This tool was developed based on a modified version of the Global Quality Score [15].
- Clinical Appropriateness: Responses were evaluated to determine whether the chatbots promoted safe and medically responsible guidance. Each response was assessed using a binary (yes/no) question: “Does the chatbot appropriately recommend that the patient seek further evaluation and management from a qualified healthcare professional?” [4].
- Readability and Accessibility: To evaluate the readability and accessibility of chatbot responses generated in Turkish, the Ateşman Readability Formula was applied. This formula is specifically designed for the Turkish language and is based on the average sentence length and the average number of syllables per word, generating a score between 1 and 100. Higher scores indicate texts that are easier to read and understand, while lower scores suggest increased complexity [16,17]. All chatbot-generated answers were processed using a custom Python (v3.10.8; Python Software Foundation, USA) script that automated the calculation of Ateşman scores for each response. This quantitative assessment provided insight into how accessible and patient-friendly the language of each chatbot was, particularly in the context of health communication.
- Empathy Evaluation: Evaluators rated the chatbot responses for empathy and bedside manner using a five-point scale: 1 = not empathetic, 2 = slightly empathetic, 3 = moderately empathetic, 4 = empathetic, and 5 = very empathetic. Higher scores reflected a greater degree of empathy and a more patient-centered communication style.
2.5. Statistical Analysis
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial intelligence |
FAQs | Frequently asked questions |
GQS | Global quality score |
References
- Kaur, R.; Soni, S.; Prashar, A. Orthognathic surgery: General considerations. Int. J. Health Sci. 2021, 5, 352–357. [Google Scholar] [CrossRef]
- Sun, H.; Shang, H.T.; He, L.S.; Ding, M.C.; Su, Z.P.; Shi, Y.L. Assessing the quality of life in patients with dentofacial deformities before and after orthognathic surgery. J. Oral Maxillofac. Surg. 2018, 76, 2192–2201. [Google Scholar] [CrossRef]
- Kurnik, N.; Preston, K.; Tolson, H.; Takeuchi, L.; Garrison, C.; Beals, P.; Beals, S.P.; Singh, D.J.; Sitzman, T.J. Jaw surgery workshop: Patient preparation for orthognathic surgery. Cleft Palate Craniofacial J. 2024, 61, 1559–1562. [Google Scholar] [CrossRef]
- Rokhshad, R.; Khoury, Z.H.; Mohammad-Rahimi, H.; Motie, P.; Price, J.B.; Tavares, T.; Jessri, M.; Bavarian, R.; Sciubba, J.J.; Sultan, A.S. Efficacy and empathy of AI chatbots in answering frequently asked questions on oral oncology. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2025, 139, 719–728. [Google Scholar] [CrossRef]
- Chen, S.; Kann, B.H.; Foote, M.B.; Aerts, H.J.; Savova, G.K.; Mak, R.H.; Bitterman, D.S. Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol. 2023, 9, 1459–1462. [Google Scholar] [CrossRef] [PubMed]
- Weizenbaum, J. ELIZA—A computer program for the study of natural language communication between man and machine. Commun. ACM 1983, 26, 23–28. [Google Scholar] [CrossRef]
- Kataoka, Y.; Takemura, T.; Sasajima, M.; Katoh, N. Development and early feasibility of chatbots for educating patients with lung cancer and their caregivers in Japan: Mixed methods study. JMIR Cancer 2021, 7, e26911. [Google Scholar] [CrossRef]
- Rokhshad, R.; Zhang, P.; Mohammad-Rahimi, H.; Pitchika, V.; Entezari, N.; Schwendicke, F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study. J. Dent. 2024, 144, 104938. [Google Scholar] [CrossRef]
- Aggarwal, A.; Tam, C.C.; Wu, D.; Li, X.; Qiao, S. Artificial intelligence–based chatbots for promoting health behavioral changes: Systematic review. J. Med. Internet Res. 2023, 25, e40789. [Google Scholar] [CrossRef]
- Helvacioglu-Yigit, D.; Demirturk, H.; Ali, K.; Tamimi, D.; Koenig, L.; Almashraqi, A. Evaluating artificial intelligence chatbots for patient education in oral and maxillofacial radiology. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2025, 19, 750–759. [Google Scholar] [CrossRef]
- Gomez-Cabello, C.A.; Borna, S.; Pressman, S.M.; Haider, S.A.; Sehgal, A.; Leibovich, B.C.; Forte, A.J. Artificial Intelligence in Postoperative Care: Assessing Large Language Models for Patient Recommendations in Plastic Surgery. Healthcare 2024, 12, 1083. [Google Scholar] [CrossRef] [PubMed]
- Onder, C.E.; Koc, G.; Gokbulut, P.; Taskaldiran, I.; Kuskonmaz, S.M. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci. Rep. 2024, 14, 243. [Google Scholar] [CrossRef] [PubMed]
- Lombardo, R.; Gallo, G.; Stira, J.; Turchi, B.; Santoro, G.; Riolo, S.; Romagnoli, M.; Cicione, A.; Tema, G.; Pastore, A.; et al. Quality of information and appropriateness of Open AI outputs for prostate cancer. Prostate Cancer Prostatic Dis. 2025, 28, 229–231. [Google Scholar] [CrossRef]
- Yurdakurban, E.; Topsakal, K.G.; Duran, G.S. A comparative analysis of AI-based chatbots: Assessing data quality in orthognathic surgery related patient information. J. Stomatol. Oral Maxillofac. Surg. 2024, 125, 101757. [Google Scholar] [CrossRef]
- Mohammad-Rahimi, H.; Khoury, Z.H.; Alamdari, M.I.; Rokhshad, R.; Motie, P.; Parsa, A.; Tavares, T.; Sciubba, J.J.; Price, J.B.; Sultan, A.S. Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2024, 137, 508–514. [Google Scholar] [CrossRef]
- Ateşman, E. Measuring readability in Turkish. AU Tömer. Lang. J. 1997, 58, 71–74. [Google Scholar]
- Duymaz, Y.K.; Erkmen, B.; Şahin, Ş.; Tekin, A.M. Evaluation of the readability of Turkish online resources related to laryngeal cancer. Eur. J. Ther. 2023, 29, 168–172. [Google Scholar] [CrossRef]
- Cobb, R.J.; Scotton, W.J. The reliability of online information regarding orthognathic surgery available to the public. Oral. Surg. 2013, 6, 56–60. [Google Scholar] [CrossRef]
- Engelmann, J.; Fischer, C.; Nkenke, E. Quality assessment of patient information on orthognathic surgery on the internet. J. Cranio-Maxillofac. Surg. 2020, 48, 661–665. [Google Scholar] [CrossRef] [PubMed]
- Ley, P.; Florio, T. The use of readability formulas in health care. Psychol. Health Med. 1996, 1, 7–28. [Google Scholar] [CrossRef]
- Abuqayyas, S.; Yurosko, C.; Ali, A.; Rymer, C.; Stoller, J.K. Bedside Manner 2020: An Inventory of Best Practices. South. Med. J. 2021, 114, 156–160. [Google Scholar] [CrossRef] [PubMed]
- Sridharan, K.; Sivaramakrishnan, G. Investigating the capabilities of advanced large language models in generating patient instructions and patient educational material. Eur. J. Hosp. Pharm. 2024; online ahead of print. [Google Scholar] [CrossRef]
- Imran, M.; Almusharraf, N. Google Gemini as a next generation AI educational tool: A review of emerging educational technology. Smart Learn. Environ. 2024, 11, 22. [Google Scholar] [CrossRef]
- Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
- Manish, S. Constitutional AI: An Expanded Overview of Anthropic’s Alignment Approach. Inf. Horiz. Am. J. Libr. Inf. Sci. Innov. 2023, 1, 36–39. [Google Scholar]
|
ICC (%95 CI)/Fleiss’ Kappa | p | |
---|---|---|
Global Quality Score | 0.338 (0.178–0.502) | <0.001 x |
Clinical Appropriateness | 0.931 | <0.001 y |
Empathy Evaluation | 0.641 (0.512–0.752) | <0.001 x |
Gemini 2.5 Pro | GPT-4 | Claude Sonnet 4 | Test Statistic | p | |
---|---|---|---|---|---|
Global Quality Score | 4.5 (3–5) a | 4 (3–5) b | 4 (3–4) b | 16.638 | <0.001 x |
Empathy Evaluation | 5 (4–5) a | 3.5 (3–5) b | 4 (2–5) b | 27.949 | <0.001 x |
Ateşman Scores | 45.77 ± 7.51 b | 63.01 ± 9.97 a | 44.69 ± 13.42 b | 18.871 | <0.001 y |
Gemini 2.5 Pro | GPT-4 | Claude Sonnet 4 | Total | Test Statistic | p | |
---|---|---|---|---|---|---|
Clinical Appropriateness-1 | ||||||
No | 1 (5) a | 16 (80) b | 7 (35) a | 24 (40) | 25.054 | <0.001 x |
Yes | 19 (95) a | 4 (20) b | 13 (65) a | 36 (60) | ||
Clinical Appropriateness-2 | ||||||
No | 1 (5) a | 17 (85) b | 7 (35) a | 25 (41.7) | 28.651 | <0.001 x |
Yes | 19 (95) a | 3 (15) b | 13 (65) a | 35 (58.3) | ||
Clinical Appropriateness-3 | ||||||
No | 1 (5) a | 17 (85) b | 6 (30) a | 24 (40) | 29.334 | <0.001 x |
Yes | 19 (95) a | 3 (15) b | 14 (70) a | 36 (60) |
Global Quality Score | |||
---|---|---|---|
r | p | ||
Gemini 2.5 Pro | Empathy Evaluation | −0.014 | 0.954 |
Ateşman Scores | 0.255 | 0.277 | |
GPT-4 | Empathy Evaluation | 0.454 | 0.044 |
Ateşman Scores | 0.058 | 0.810 | |
Claude Sonnet 4 | Empathy Evaluation | −0.140 | 0.556 |
Ateşman Scores | 0.150 | 0.527 | |
Overall | Empathy Evaluation | 0.443 | <0.001 |
Ateşman Scores | −0.394 | 0.002 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yüceer-Çetiner, E.; Kazan, D.; Nesiri, M.; Basa, S. Evaluating the Competence of AI Chatbots in Answering Patient-Oriented Frequently Asked Questions on Orthognathic Surgery. Healthcare 2025, 13, 2114. https://doi.org/10.3390/healthcare13172114
Yüceer-Çetiner E, Kazan D, Nesiri M, Basa S. Evaluating the Competence of AI Chatbots in Answering Patient-Oriented Frequently Asked Questions on Orthognathic Surgery. Healthcare. 2025; 13(17):2114. https://doi.org/10.3390/healthcare13172114
Chicago/Turabian StyleYüceer-Çetiner, Ezgi, Dilara Kazan, Mobin Nesiri, and Selçuk Basa. 2025. "Evaluating the Competence of AI Chatbots in Answering Patient-Oriented Frequently Asked Questions on Orthognathic Surgery" Healthcare 13, no. 17: 2114. https://doi.org/10.3390/healthcare13172114
APA StyleYüceer-Çetiner, E., Kazan, D., Nesiri, M., & Basa, S. (2025). Evaluating the Competence of AI Chatbots in Answering Patient-Oriented Frequently Asked Questions on Orthognathic Surgery. Healthcare, 13(17), 2114. https://doi.org/10.3390/healthcare13172114