AI Chatbots in Pediatric Orthopedics: How Accurate Are Their Answers to Parents’ Questions on Bowlegs and Knock Knees?
Abstract
:1. Introduction
- Assessing the accuracy of chatbot responses to parental questions concerning pediatric knee deformities.
- Offering an open, replicable evaluation framework and benchmarking dataset for future work on health-oriented chatbots.
- Evaluating the responses against the clinical standards and expertise of pediatric orthopedic practitioners to assess their reliability. Other aspects of the evaluation include the clinical accuracy, clarity, and a completeness devoid of ambiguity, which may indicate a lack of precision that could postpone essential medical action.
2. Materials and Methods
2.1. Study Design
2.2. Data Collection
2.2.1. Chatbot Selection
- The most frequently used chatbot by the Saudi population according to the report from the Saudi Center for Public Opinion Polling (SCOP) [10].
- Popularity and user ratings in app stores or online platforms.
- Timely access and availability during the period of the study.
2.2.2. Question Formulation
- Description of the condition.
- Normal developmental variations.
- Etiological factors and risk factors.
- Diagnostic procedures and tests used.
- Treatment and its effectiveness.
- Complications.
- Timing of medical intervention.
2.2.3. Creating Responses
2.3. Evaluating Responses
2.3.1. Appraisal of the Accuracy
- Accuracy: Ensuring accuracy is especially vital for pediatric orthopedic conditions due to its prominence in early intervention as a specialist discipline. Inaccurate technology-derived advice risks delaying care through mismanaged awareness of the proper steps to take and, in some cases, due to overly complicating intervention strategies that could otherwise be straightforward [2,22].
- Comprehensiveness: Looking into answering the question thoroughly by providing its components, such as the causatives, their treatment, and the timing of care seeking, define comprehensiveness. Failure to answer comprehensively risks providing incomplete responses that the user deems crucial. This reduces the educational value of chatbot interactions [2,24].
- Risk of Misleading Information: The risk of accurate-sounding yet false information is increasingly problematic with the advent of AI models. The possible hazards of conversational agents providing contextualized and non-contextualized dangerous advice have been documented in multiple studies [22,24,25].
2.3.2. Readability Assessments Using a Flesch–Kincaid Readability Test
2.4. Statistical Analysis
2.5. Ethical Considerations
3. Results
3.1. Inter-Rater Reliability
3.1.1. Bowlegs Responses
3.1.2. Knock Knees Responses
3.2. Flesch–Kincaid Readability Test Results
3.3. Comparison of Different Chatbots Responses
4. Discussion
Limitations and Future Directions
5. Conclusions
Supplementary Materials
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Panch, T.; Szolovits, P.; Atun, R. Artificial intelligence, machine learning and health systems. J. Glob. Health 2018, 8, 020303. [Google Scholar] [CrossRef]
- Bibault, J.E.; Chaix, B.; Guillemassé, A.; Cousin, S.; Escande, A.; Perrin, M.; Pienkowski, A.; Delamon, G.; Nectoux, P.; Brouard, B. A Chatbot Versus Physicians to Provide Information for Patients With Breast Cancer: Blind, Randomized Controlled Noninferiority Trial. J. Med. Internet Res. 2019, 21, e15787. [Google Scholar] [CrossRef]
- Khoo, K.; Bolt, P.; Babl, F.E.; Jury, S.; Goldman, R.D. Health information seeking by parents in the Internet age. J. Paediatr. Child. Health 2008, 44, 419–423. [Google Scholar] [CrossRef]
- Laymouna, M.; Ma, Y.; Lessard, D.; Schuster, T.; Engler, K.; Lebouché, B. Roles, users, benefits, and limitations of chatbots in health care: Rapid review. J. Med. Internet Res. 2024, 26, e56930. [Google Scholar] [CrossRef]
- Scherl, S.A. Common lower extremity problems in children. Pediatr. Rev. 2004, 25, 52–62. [Google Scholar] [CrossRef]
- Staheli, L.T. Fundamentals of Pediatric Orthopedics, 5th ed.; Wolters Kluwer: Philadelphia, PA, USA, 2016. [Google Scholar]
- Bendig, E.; Erb, B.; Schulze-Thuesing, L.; Baumeister, H. The Next Generation: Chatbots in Clinical Psychology and Psychotherapy to Foster Mental Health–A Scoping Review. Verhaltenstherapie 2019, 32, 64–76. [Google Scholar] [CrossRef]
- Cheng, Y.; Xie, C.; Wang, Y.; Jiang, H. Chatbots and Health: Mental Health. In The International Encyclopedia of Health Communication; Wiley: Hoboken, NJ, USA; pp. 1–6. [CrossRef]
- Shiferaw, M.W.; Zheng, T.; Winter, A.; Mike, L.A.; Chan, L.-N. Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions. BMC Med. Inform. Decis. Mak. 2024, 24, 404. [Google Scholar] [CrossRef]
- Saudi Center for Opinion Polling. AI Usage Trends in Saudi Arabia: Public Perception and Adoptio; Saudi Center for Opinion Polling: Riyadh, Saudi Arabia, 2025. [Google Scholar]
- Bowed Legs (Genu Varum Blount’s Disease)-OrthoInfo-AAOS. Available online: https://orthoinfo.aaos.org/en/diseases--conditions/bowed-legs-blounts-disease/ (accessed on 1 February 2025).
- Bow Legs (Genu Varum) (for Parents)|Nemours KidsHealth. Available online: https://kidshealth.org/en/parents/bow-legs.html (accessed on 1 February 2025).
- OrthoKids-Bowed Legs & Knock Knees. Available online: https://orthokids.org/conditions/bowed-legs-knock-knees/ (accessed on 1 February 2025).
- Genu Varum-Bowlegs in Children: What Physicians Need to Know|Children’s Hospital Los Angeles. Available online: https://www.chla.org/blog/experts/peds-practice-tips/genu-varum-bowlegs-children-what-physicians-need-know?/ (accessed on 1 February 2025).
- Knock Knees (Genu Valgum) (for Parents)|Nemours KidsHealth. Available online: https://kidshealth.org/en/parents/knock-knees.html (accessed on 1 February 2025).
- Knock Knees|Boston Children’s Hospital. Available online: https://www.childrenshospital.org/conditions/knock-knees/ (accessed on 1 February 2025).
- Bowlegs|Boston Children’s Hospital. Available online: https://www.childrenshospital.org/conditions/bowlegs (accessed on 1 February 2025).
- Your Child’s Knocked Knees: Everything You Need to Know. Available online: https://www.jeremyburnhammd.com/knock-knees-knocked-knees-valgus/ (accessed on 1 February 2025).
- Bow Legged (Genu Varum): What Is It, Causes & Treatment. Available online: https://my.clevelandclinic.org/health/diseases/22049-bow-legged?utm_source=chatgpt.com (accessed on 1 February 2025).
- Bow legs and knock knees in children and young people | NHS inform. Available online: https://www.nhsinform.scot/illnesses-and-conditions/muscle-bone-and-joints/children-and-young-peoples-muscle-bone-and-joints/bow-legs-and-knock-knees-in-children-and-young-people/ (accessed on 1 February 2025).
- Knock knees-NHS. Available online: https://www.nhs.uk/conditions/knock-knees/ (accessed on 1 February 2025).
- Bickmore, T.W.; Trinh, H.; Olafsson, S.; O’Leary, T.K.; Asadi, R.; Rickles, N.M.; Cruz, R. Patient and consumer safety risks when using conversational assistants for medical information: An observational study of Siri, Alexa, and Google Assistant. J. Med. Internet Res. 2018, 20, e11510. [Google Scholar] [CrossRef]
- Miner, A.S.; Milstein, A.; Schueller, S.; Hegde, R.; Mangurian, C.; Linos, E. Smartphone-based conversational agents and responses to questions about mental health, interpersonal violence, and physical health. JAMA Intern. Med. 2016, 176, 619–625. [Google Scholar] [CrossRef]
- Vaidyam, A.N.; Wisniewski, H.; Halamka, J.D.; Kashavan, M.S.; Torous, J.B. Chatbots and conversational agents in mental health: A review of the psychiatric landscape. Can. J. Psychiatry 2019, 64, 456–464. [Google Scholar] [CrossRef]
- Shaw, J.; Rudzicz, F.; Jamieson, T.; Goldfarb, A. Artificial intelligence and the implementation challenge. J. Med. Internet Res. 2019, 21, e13659. [Google Scholar] [CrossRef] [PubMed]
- Flesch, R. A new readability yardstick. J. Appl. Psychol. 1948, 32, 221. [Google Scholar] [CrossRef]
- Kincaid, J.P.; Fishburne, R.P., Jr.; Rogers, R.L.; Chissom, B.S. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Branch Rep. 1975, 8–75. Available online: https://stars.library.ucf.edu/istlibrary/56/?utm_sourc (accessed on 15 March 2025).
- Badarudeen, S.; Sabharwal, S. Assessing readability of patient education materials: Current role in orthopaedics. Clin. Orthop. Relat. Res. 2010, 468, 2572–2580. [Google Scholar] [CrossRef]
- Wang, L.-W.; Miller, M.J.; Schmitt, M.R.; Wen, F.K. Assessing readability formula differences with written health information materials: Application, results, and recommendations. Res. Soc. Adm. Pharm. 2013, 9, 503–516. [Google Scholar] [CrossRef]
- Flesch-Kincaid Readability Test and Calculator. Available online: https://hemingwayapp.com/articles/readability/flesch-kincaid-readability-test (accessed on 15 March 2025).
- Flesch Kincaid Calculator-Flesch Reading Ease Calculator. Available online: https://charactercalculator.com/flesch-reading-ease/ (accessed on 15 March 2025).
- Pirkle, S.; Yang, J.; Blumberg, T.J. Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions? J. Pediatr. Orthop. 2025, 45, e66–e71. [Google Scholar] [CrossRef]
- MacIntyre, M.R.; Cockerill, R.G.; Mirza, O.F.; Appel, J.M. Ethical considerations for the use of artificial intelligence in medical decision-making capacity assessments. Psychiatry Res. 2023, 328, 115466. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Cao, Z.; Ma, Z.; Chen, M. An Evaluation System for Large Language Models based on Open-Ended Questions. In Proceedings of the 2024 IEEE 11th International Conference on Cyber Security and Cloud Computing (CSCloud), Shanghai, China, 28–30 June 2024; pp. 65–72. [Google Scholar]
- Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.-T.; Jin, A.; Bos, T.; Baker, L.; Du, Y. Lamda: Language models for dialog applications. arXiv 2022, arXiv:2201.08239. [Google Scholar]
- Khaleel, I.; Wimmer, B.C.; Peterson, G.M.; Zaidi, S.T.R.; Roehrer, E.; Cummings, E.; Lee, K. Health information overload among health consumers: A scoping review. Patient Educ. Couns. 2020, 103, 15–32. [Google Scholar] [CrossRef]
- Nadarzynski, T.; Bayley, J.; Llewellyn, C.; Kidsley, S.; Graham, C.A. Acceptability of artificial intelligence (AI)-enabled chatbots, video consultations and live webchats as online platforms for sexual health advice. BMJ Sex. Reprod. Health 2020, 46, 210–217. [Google Scholar] [CrossRef]
- Nazi, Z.A.; Peng, W. Large language models in healthcare and medical domain: A review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
- Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
Topic | Chatbot | The Score | The Grade Level |
---|---|---|---|
Knock knees responses | ChatGPT | 61 | 8th grade |
Gemini | 51 | 10–12th grade | |
Copilot | 51 | 10–12th grade | |
Bow legs responses | ChatGPT | 56 | 10–12th grade |
Gemini | 50 | 10–12th grade | |
Copilot | 48 | College students |
Dimension | Chatbot | Average | Median | Range |
---|---|---|---|---|
Accuracy | ChatGPT | 4.87 | 5 | 0.67 |
Copilot | 4.74 | 4.67 | 1.33 | |
Gemini | 4.57 | 5 | 1.67 | |
Clarity | ChatGPT | 4.68 | 4.67 | 1.67 |
Copilot | 4.70 | 4.83 | 1.33 | |
Gemini | 4.70 | 4.83 | 1.33 | |
Comprehensiveness | ChatGPT | 4.84 | 5 | 1 |
Copilot | 4.83 | 4 | 1.67 | |
Gemini | 4.44 | 5 | 1.33 | |
Risk of Misleading Information | ChatGPT | 4.76 | 4.67 | 1.33 |
Copilot | 4.77 | 5 | 1 | |
Gemini | 4.81 | 4.67 | 1 |
Variable | Χ2 (df = 2) | p-Value |
---|---|---|
Accuracy | 7.810 | 0.020 |
Clarity | 0.528 | 0.768 |
Comprehensiveness | 12.021 | 0.002 |
Risk of Misleading Information | 0.929 | 0.628 |
Variable | Comparison | Test Statistic | p-Value | Effect Size (r) |
---|---|---|---|---|
Accuracy | Copilot vs. Gemini | −5.800 | 0.286 | |
Accuracy | Copilot vs. ChatGPT | −9.650 | 0.017 | 0.507 |
Accuracy | Gemini vs. ChatGPT | −3.850 | 0.804 | |
Clarity | Copilot vs. Gemini | 0.95 | 1 | |
Clarity | Copilot vs. ChatGPT | 2.65 | 1 | |
Clarity | Gemini vs. ChatGPT | 1.7 | 1 | |
Comprehensiveness | Copilot vs. Gemini | −10.950 | 0.007 | 0.540 |
Comprehensiveness | Copilot vs. ChatGPT | −10.650 | 0.009 | 0.556 |
Comprehensiveness | Gemini vs. ChatGPT | 0.3 | 1 | |
Risk of Misleading Information | Copilot vs. Gemini | 3.15 | 1 | |
Risk of Misleading Information | Copilot vs. ChatGPT | 3 | 1 | |
Risk of Misleading Information | Gemini vs. ChatGPT | −0.150 | 1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kamal, A.H. AI Chatbots in Pediatric Orthopedics: How Accurate Are Their Answers to Parents’ Questions on Bowlegs and Knock Knees? Healthcare 2025, 13, 1271. https://doi.org/10.3390/healthcare13111271
Kamal AH. AI Chatbots in Pediatric Orthopedics: How Accurate Are Their Answers to Parents’ Questions on Bowlegs and Knock Knees? Healthcare. 2025; 13(11):1271. https://doi.org/10.3390/healthcare13111271
Chicago/Turabian StyleKamal, Ahmed Hassan. 2025. "AI Chatbots in Pediatric Orthopedics: How Accurate Are Their Answers to Parents’ Questions on Bowlegs and Knock Knees?" Healthcare 13, no. 11: 1271. https://doi.org/10.3390/healthcare13111271
APA StyleKamal, A. H. (2025). AI Chatbots in Pediatric Orthopedics: How Accurate Are Their Answers to Parents’ Questions on Bowlegs and Knock Knees? Healthcare, 13(11), 1271. https://doi.org/10.3390/healthcare13111271