Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design
2.2. Question Development
2.3. AI Response Collection
2.4. Expert Evaluation
2.5. Readability Analysis
2.6. Statistical Analysis
2.7. Model Versioning and Access Dates
2.8. Rating Scales
2.9. Inter-Rater Agreement Reporting
3. Results
3.1. Inter-Rater Reliability
3.2. Reliability Scores
- Diagnosis/Symptoms: Gemini 1.5 Flash scored significantly higher than both ChatGPT-4o (p = 0.045) and DeepSeek-V3 (p = 0.003).
- Treatment/Intervention: DeepSeek-V3 scored significantly lower than both Gemini 1.5 Flash and ChatGPT-4o (p < 0.005), with no difference between the latter two (p = 0.072).
- Lifestyle/Activity: No significant differences were observed (p = 0.057).
3.3. Usefulness Scores
- Gemini 1.5 Flash and ChatGPT-4o both received higher scores compared with DeepSeek-V3 in diagnosis and treatment domains.
- In the lifestyle/activity domain, no significant differences were observed (p-adj > 0.05).
3.4. Global Quality Scale (GQS) Scores
- Gemini 1.5 Flash achieved the highest GQS in diagnosis/symptoms (4.53 ± 0.26), significantly outperforming both ChatGPT-4o (p = 0.003) and DeepSeek-V3 (p = 0.003).
- In treatment/intervention, Gemini 1.5 Flash scored significantly higher than DeepSeek-V3 (p = 0.018), while no significant difference was observed between ChatGPT-4o and Gemini 1.5 Flash.
- No differences were detected in lifestyle/activity (p = 0.050).
3.5. Readability
- Flesch Reading Ease (FRE): DeepSeek-V3 had the highest score (45.9 ± 7.6), indicating the most patient-friendly readability, followed by ChatGPT-4o (37.0 ± 7.3) and Gemini 1.5 Flash (30.7 ± 7.3).
- Grade-Level and Complexity Indices (FKGL, GFI, CLI, SMOG, ARI): Gemini 1.5 Flash consistently produced the most complex text (FKGL ≈ 11.6), DeepSeek-V3 generated the simplest responses (FKGL ≈ 6.5), while ChatGPT-4o scored in between (FKGL ≈ 8.9).
4. Discussion
4.1. Reliability and Usefulness of AI-Generated Information
4.2. Quality Assessment
4.3. Readability Considerations
4.4. Clinical and Digital Health Implications
4.5. Limitations and Future Directions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| GQS | Global quality score |
| ICC | intraclass correlation coefficient |
| 95% CI | 95% Confidence Interval |
| X | mean |
| M | median |
| IQR | interquartile range |
| SD | standard deviation |
References
- Longo, U.G.; Risi Ambrogioni, L.; Berton, A.; Candela, V.; Carnevale, A.; Schena, E.; Gugliemelli, E.; Denaro, V. Physical Therapy and Precision Rehabilitation in Shoulder Rotator Cuff Disease. Int. Orthop. SICOT 2020, 44, 893–903. [Google Scholar] [CrossRef]
- Teunis, T.; Lubberts, B.; Reilly, B.T.; Ring, D. A Systematic Review and Pooled Analysis of the Prevalence of Rotator Cuff Disease with Increasing Age. J. Shoulder Elb. Surg. 2014, 23, 1913–1921. [Google Scholar] [CrossRef]
- Tashjian, R.Z. Epidemiology, Natural History, and Indications for Treatment of Rotator Cuff Tears. Clin. Sports Med. 2012, 31, 589–604. [Google Scholar] [CrossRef]
- Longo, U.G.; Bandini, B.; Mancini, L.; Merone, M.; Schena, E.; De Sire, A.; D’Hooghe, P.; Pecchia, L.; Carnevale, A. Artificial Intelligence in Rotator Cuff Tear Detection: A Systematic Review of MRI-Based Models. Diagnostics 2025, 15, 1315. [Google Scholar] [CrossRef]
- Topol, E.J. High-Performance Medicine: The Convergence of Human and Artificial Intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
- Dalton, D.M.; Kelly, E.G.; Molony, D.C. Availability of Accessible and High-Quality Information on the Internet for Patients Regarding the Diagnosis and Management of Rotator Cuff Tears. J. Shoulder Elb. Surg. 2015, 24, e135–e140. [Google Scholar] [CrossRef]
- Celik, H.; Polat, O.; Ozcan, C.; Camur, S.; Kilinc, B.E.; Uzun, M. Assessment of the Quality and Reliability of the Information on Rotator Cuff Repair on YouTube. Orthop. Traumatol. Surg. Res. 2020, 106, 31–34. [Google Scholar] [CrossRef] [PubMed]
- Rooney, M.K.; Santiago, G.; Perni, S.; Horowitz, D.P.; McCall, A.R.; Einstein, A.J.; Jagsi, R.; Golden, D.W. Readability of Patient Education Materials From High-Impact Medical Journals: A 20-Year Analysis. J. Patient Exp. 2021, 8, 2374373521998847. [Google Scholar] [CrossRef] [PubMed]
- Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
- Warren, E.; Hurley, E.T.; Park, C.N.; Crook, B.S.; Lorentz, S.; Levin, J.M.; Anakwenze, O.; MacDonald, P.B.; Klifto, C.S. Evaluation of Information from Artificial Intelligence on Rotator Cuff Repair Surgery. JSES Int. 2024, 8, 53–57. [Google Scholar] [CrossRef] [PubMed]
- Umapathy, V.R.; Rajinikanth B, S.; Samuel Raj, R.D.; Yadav, S.; Munavarah, S.A.; Anandapandian, P.A.; Mary, A.V.; Padmavathy, K.; R, A. Perspective of Artificial Intelligence in Disease Diagnosis: A Review of Current and Future Endeavours in the Medical Field. Cureus 2023, 19, e45684. [Google Scholar] [CrossRef]
- Rouhi, A.D.; Ghanem, Y.K.; Yolchieva, L.; Saleh, Z.; Joshi, H.; Moccia, M.C.; Suarez-Pierre, A.; Han, J.J. Can Artificial Intelligence Improve the Readability of Patient Education Materials on Aortic Stenosis? A Pilot Study. Cardiol. Ther. 2024, 13, 137–147. [Google Scholar] [CrossRef] [PubMed]
- Hancı, V.; Ergün, B.; Gül, Ş.; Uzun, Ö.; Erdemir, İ.; Hancı, F.B. Assessment of Readability, Reliability, and Quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® Responses on Palliative Care. Medicine 2024, 103, e39305. [Google Scholar] [CrossRef]
- Onder, C.E.; Koc, G.; Gokbulut, P.; Taskaldiran, I.; Kuskonmaz, S.M. Evaluation of the Reliability and Readability of ChatGPT-4 Responses Regarding Hypothyroidism during Pregnancy. Sci. Rep. 2024, 14, 243. [Google Scholar] [CrossRef]
- Will, J.; Gupta, M.; Zaretsky, J.; Dowlath, A.; Testa, P.; Feldman, J. Enhancing the Readability of Online Patient Education Materials Using Large Language Models: Cross-Sectional Study. J. Med. Internet Res. 2025, 27, e69955. [Google Scholar] [CrossRef] [PubMed]
- Aamir, A.; Iqbal, A.; Jawed, F.; Ashfaque, F.; Hafsa, H.; Anas, Z.; Oduoye, M.O.; Basit, A.; Ahmed, S.; Abdul Rauf, S.; et al. Exploring the Current and Prospective Role of Artificial Intelligence in Disease Diagnosis. Ann. Med. Surg. 2024, 86, 943–949. [Google Scholar] [CrossRef]
- Çıracıoğlu, A.M.; Dal Erdoğan, S. Evaluation of the Reliability, Usefulness, Quality and Readability of ChatGPT’s Responses on Scoliosis. Eur. J. Orthop. Surg. Traumatol. 2025, 35, 123. [Google Scholar] [CrossRef]
- Jo, C.H.; Yang, J.; Jeon, B.; Shim, H.; Jang, I. Preoperative Rotator Cuff Tear Prediction from Shoulder Radiographs Using a Convolutional Block Attention Module-Integrated Neural Network 2024. arXiv 2024, arXiv:2408.09894. [Google Scholar]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Secure, Privacy-Preserving and Federated Machine Learning in Medical Imaging. Nat. Mach. Intell. 2020, 2, 305–311. [Google Scholar] [CrossRef]
- Rajkomar, A.; Dean, J.; Kohane, I. Machine Learning in Medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef]
- Jiang, F.; Jiang, Y.; Zhi, H.; Dong, Y.; Li, H.; Ma, S.; Wang, Y.; Dong, Q.; Shen, H.; Wang, Y. Artificial Intelligence in Healthcare: Past, Present and Future. Stroke Vasc. Neurol. 2017, 2, 230–243. [Google Scholar] [CrossRef] [PubMed]
- Elangovan, K.; Lim, G.; Ting, D. A Comparative Study of an on Premise AutoML Solution for Medical Image Classification. Sci. Rep. 2024, 14, 10483. [Google Scholar] [CrossRef] [PubMed]
- Waring, J.; Lindvall, C.; Umeton, R. Automated Machine Learning: Review of the State-of-the-Art and Opportunities for Healthcare. Artif. Intell. Med. 2020, 104, 101822. [Google Scholar] [CrossRef] [PubMed]


| Category | No | Question |
|---|---|---|
| Diagnosis and Symptoms | 1 | What is a rotator cuff tear? |
| 2 | What are the symptoms of a shoulder tendon tear? | |
| 3 | How can I tell if I have a tear in my shoulder?/How is it detected? | |
| 4 | Is MRI necessary to detect a shoulder tear? Can it clearly show the tear? | |
| 5 | What is a partial tear in the shoulder, and how does it differ from a complete tear? | |
| 6 | If I have shoulder pain but can still move my arm, could it still be a tear? | |
| 7 | How can a shoulder tendon tear be distinguished from a muscle strain? | |
| Treatment and Interventions | 8 | Can a rotator cuff tear heal without surgery?/Can it be treated non-surgically? |
| 9 | Is physical therapy sufficient for treating a rotator cuff tear? | |
| 10 | When is surgery necessary for shoulder tendon tears? Does every tear require an operation? | |
| 11 | How is rotator cuff surgery performed? | |
| 12 | What is the difference between arthroscopy and open surgery for shoulder tendon tears? | |
| 13 | Is there pain after shoulder tendon tear surgery? | |
| 14 | Is postoperative physical therapy necessary, and what does the process involve? | |
| 15 | What happens if a shoulder tendon tear is left untreated? Can it cause permanent damage? | |
| 16 | Are PRP or stem cell therapies effective for shoulder tendon tears? | |
| 17 | What are the risks of shoulder tendon tear surgery? | |
| Lifestyle and Activity | 18 | When can I use my arm after rotator cuff surgery? |
| 19 | When can I return to sports?/Can I play sports again? | |
| 20 | Can a shoulder tendon tear reoccur, and how can it be prevented? |
| Model | Criterion | ICC (95% CI) | Interpretation |
|---|---|---|---|
| ChatGPT-4o | Reliability | 0.900 (0.758–0.959) | Excellent agreement |
| Usefulness | 0.848 (0.691–0.934) | Good–excellent | |
| GQS | 0.726 (0.469–0.878) | Good | |
| Gemini 1.5 Flash | Reliability | 0.897 (0.767–0.957) | Excellent agreement |
| Usefulness | 0.776 (0.559–0.901) | Good | |
| GQS | 0.856 (0.710–0.937) | Excellent agreement | |
| DeepSeek-V3 | Reliability | 0.755 (0.401–0.902) | Good |
| Usefulness | 0.822 (0.472–0.935) | Good–excellent | |
| GQS | 0.743 (0.503–0.886) | Good |
| Criterion | Domain | Model | Mean ± SD | Median (IQR) | η2 | Test Statistic | p-Value | Significant Pairwise Differences |
|---|---|---|---|---|---|---|---|---|
| Reliability | Diagnosis & Symptoms | ChatGPT-4o | 4.39 ± 0.80 | 4.25 (1.0) | 0.582 | 12.47 | 0.002 | Gemini > ChatGPT (p = 0.045); Gemini > Deepseek (p = 0.003) |
| Gemini | 5.42 ± 0.59 | 5.25 (0.0) | ||||||
| Deepseek | 3.75 ± 0.50 | 3.75 (0.5) | ||||||
| Treatment & Intervention | ChatGPT-4o | 4.82 ± 0.58 | 4.75 (0.94) | 0.629 | 18.98 | 0.001 | Gemini > Deepseek (p = 0.003); ChatGPT > Deepseek (p = 0.003) | |
| Gemini | 5.57 ± 0.72 | 5.75 (0.94) | ||||||
| Deepseek | 3.82 ± 0.44 | 3.75 (0.44) | ||||||
| Lifestyle/Daily Mgmt | ChatGPT-4o | 4.25 ± 0.43 | 4.0 (0.75) | 0.624 | 5.74 | 0.057 | ns (no sig. difference) | |
| Gemini | 6.58 ± 0.28 | 6.75 (0.5) | ||||||
| Deepseek | 4.33 ± 0.14 | 4.25 (0.25) | ||||||
| Overall | ChatGPT-4o | 4.58 ± 0.67 | 4.75 (0.88) | 0.613 | 36.95 | 0.001 | Gemini > ChatGPT (p = 0.003); Gemini > Deepseek (p = 0.003); ChatGPT > Deepseek (p = 0.003) | |
| Gemini | 5.67 ± 0.72 | 5.50 (1.0) | ||||||
| Deepseek | 3.87 ± 0.46 | 3.75 (0.5) | ||||||
| Usefulness | Diagnosis & Symptoms | ChatGPT-4o | 5.39 ± 0.71 | 5.75 (1.0) | 0.583 | 12.50 | 0.002 | Gemini > Deepseek (p = 0.024); ChatGPT > Deepseek (p = 0.003) |
| Gemini | 5.82 ± 0.31 | 5.75 (0.25) | ||||||
| Deepseek | 4.14 ± 0.78 | 4.75 (1.5) | ||||||
| Treatment & Intervention | ChatGPT-4o | 5.35 ± 0.37 | 5.25 (0.56) | 0.679 | 20.33 | 0.001 | ChatGPT > Deepseek (p = 0.003); Gemini > Deepseek (p = 0.003) | |
| Gemini | 5.60 ± 0.45 | 5.75 (0.63) | ||||||
| Deepseek | 3.90 ± 0.31 | 3.75 (0.25) | ||||||
| Lifestyle/Daily Mgmt | ChatGPT-4o | 5.66 ± 0.38 | 5.75 (0.75) | 0.712 | 6.27 | 0.044 | ns (no sig. difference) | |
| Gemini | 6.16 ± 0.52 | 6.0 (1.0) | ||||||
| Deepseek | 4.25 ± 0.0 | 4.25 (0.0) | ||||||
| Overall | ChatGPT-4o | 5.41 ± 0.50 | 5.25 (0.50) | 0.658 | 39.50 | 0.001 | Gemini > Deepseek (p = 0.003); ChatGPT > Deepseek (p = 0.003) | |
| Gemini | 5.76 ± 0.44 | 5.75 (0.25) | ||||||
| Deepseek | 4.03 ± 0.51 | 3.87 (0.88) | ||||||
| GQS | Diagnosis & Symptoms | ChatGPT-4o | 3.64 ± 0.37 | 3.75 (0.75) | 0.865 | 17.58 | 0.001 | Gemini > ChatGPT (p = 0.003); Gemini > Deepseek (p = 0.003); ChatGPT > Deepseek (p = 0.009) |
| Gemini | 4.53 ± 0.26 | 4.75 (0.5) | ||||||
| Deepseek | 2.78 ± 0.36 | 3.0 (0.25) | ||||||
| Treatment & Intervention | ChatGPT-4o | 3.55 ± 0.38 | 3.75 (0.63) | 0.314 | 10.49 | 0.005 | Gemini > Deepseek (p = 0.018) | |
| Gemini | 4.15 ± 0.73 | 4.37 (1.19) | ||||||
| Deepseek | 3.12 ± 0.29 | 3.12 (0.31) | ||||||
| Lifestyle/Daily Mgmt | ChatGPT-4o | 3.66 ± 0.38 | 3.75 (0.75) | 0.668 | 6.01 | 0.050 | ns (no sig. difference) | |
| Gemini | 4.66 ± 0.38 | 4.75 (0.75) | ||||||
| Deepseek | 3.41 ± 0.28 | 3.25 (0.5) | ||||||
| Overall | ChatGPT-4o | 3.60 ± 0.36 | 3.75 (0.69) | 0.555 | 33.62 | 0.001 | Gemini > ChatGPT (p = 0.003); Gemini > Deepseek (p = 0.003); ChatGPT > Deepseek (p = 0.003) | |
| Gemini | 4.36 ± 0.58 | 4.62 (0.5) | ||||||
| Deepseek | 3.05 ± 0.37 | 3.0 (0.44) |
| Metric | Model | Mean ± SD | Median (IQR) | Interpretation | p-Value (Pairwise) |
|---|---|---|---|---|---|
| FRE | ChatGPT-4o | 37.0 ± 7.3 | 37.0 (10.3) | Fairly difficult | Deepseek > ChatGPT (p = 0.003); Deepseek > Gemini (p = 0.003); ChatGPT > Gemini (p = 0.021) |
| Gemini | 30.7 ± 7.3 | 31.0 (6.0) | Difficult | ||
| Deepseek | 45.9 ± 7.7 | 47.5 (9.0) | Closer to plain language | ||
| FKGL | ChatGPT-4o | 8.9 ± 1.4 | 8.9 (2.0) | Above target (9th grade) | All comparisons p = 0.001 |
| Gemini | 11.6 ± 1.8 | 11.5 (2.3) | High school level | ||
| Deepseek | 6.5 ± 1.3 | 6.4 (2.0) | Within target (6th grade) | ||
| GFI | ChatGPT-4o | 11.1 ± 1.7 | 11.2 (3.0) | Difficult | Gemini > ChatGPT (p = 0.031); ChatGPT > Deepseek (p = 0.015); Gemini > Deepseek (p = 0.001) |
| Gemini | 12.5 ± 1.5 | 12.7 (1.5) | Difficult | ||
| Deepseek | 9.6 ± 1.7 | 9.5 (2.7) | Easier | ||
| CLI | ChatGPT-4o | 12.1 ± 1.0 | 12.1 (1.1) | High school level | Gemini > ChatGPT (p = 0.028); Deepseek < both (p = 0.001) |
| Gemini | 13.1 ± 1.2 | 13.1 (1.6) | High school/college | ||
| Deepseek | 9.6 ± 1.2 | 9.3 (1.3) | Closer to target | ||
| SMOG | ChatGPT-4o | 8.7 ± 1.4 | 8.2 (2.0) | Slightly above target | All comparisons p = 0.003 |
| Gemini | 11.1 ± 1.7 | 11.2 (1.9) | Difficult | ||
| Deepseek | 7.1 ± 0.9 | 7.0 (1.3) | Within target | ||
| ARI | ChatGPT-4o | 12.9 ± 1.5 | 12.8 (2.1) | Difficult | All comparisons p = 0.003 |
| Gemini | 17.5 ± 2.6 | 17.9 (2.8) | Very difficult | ||
| Deepseek | 9.1 ± 2.9 | 9.6 (4.8) | Easier |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Koluman, A.C.; Çiftçi, M.U.; Çiftçi, E.A.; Çakmur, B.B.; Ziroğlu, N. Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears. Healthcare 2025, 13, 2670. https://doi.org/10.3390/healthcare13212670
Koluman AC, Çiftçi MU, Çiftçi EA, Çakmur BB, Ziroğlu N. Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears. Healthcare. 2025; 13(21):2670. https://doi.org/10.3390/healthcare13212670
Chicago/Turabian StyleKoluman, Ali Can, Mehmet Utku Çiftçi, Ebru Aloğlu Çiftçi, Başar Burak Çakmur, and Nezih Ziroğlu. 2025. "Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears" Healthcare 13, no. 21: 2670. https://doi.org/10.3390/healthcare13212670
APA StyleKoluman, A. C., Çiftçi, M. U., Çiftçi, E. A., Çakmur, B. B., & Ziroğlu, N. (2025). Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears. Healthcare, 13(21), 2670. https://doi.org/10.3390/healthcare13212670

