Next Article in Journal
Advanced Diagnostic Methods in Necrotizing Sialometaplasia of the Parotid Glands: An Updated Literature Review and a Rare Case Report
Previous Article in Journal
Comparison of Midazolam and Diazepam for Sedation in Patients Undergoing Double-Balloon Endoscopic Retrograde Cholangiopancreatography: A Propensity Score-Matched Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions

1
Division of Shoulder and Elbow Surgery, Rothman Orthopaedic Institute, Philadelphia, PA 19107, USA
2
Penn State College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA
3
Department of Orthopedic Surgery, The Warren Alpert Medical School, Brown University, Providence, RI 02912, USA
4
Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA
5
Baylor University Medical Center, Dallas, TX 75246, USA
6
Palm Beach Orthopaedic Institute, West Palm Beach, FL 33401, USA
7
Division of Shoulder and Elbow Surgery, Department of Orthopaedics, University of Pennsylvania, Philadelphia, PA 19104, USA
8
Southern Permanente Medical Group, Pasadena, CA 91188, USA
*
Author to whom correspondence should be addressed.
J. Clin. Med. 2025, 14(7), 2289; https://doi.org/10.3390/jcm14072289
Submission received: 10 February 2025 / Revised: 10 March 2025 / Accepted: 14 March 2025 / Published: 27 March 2025
(This article belongs to the Section Orthopedics)

Abstract

:
Background/Objectives: The utility of artificial intelligence (AI) in medical education has recently garnered significant interest, with several studies exploring its applications across various educational domains; however, its role in orthopedic education, particularly in shoulder and elbow surgery, remains scarcely studied. This study aims to evaluate the performance of multiple AI models in answering shoulder- and elbow-related questions from the AAOS ResStudy question bank. Methods: A total of 50 shoulder- and elbow-related questions from the AAOS ResStudy question bank were selected for the study. Questions were categorized according to anatomical location, topic, concept, and difficulty. Each question, along with the possible multiple-choice answers, was provided to each chatbot. The performance of each chatbot was recorded and analyzed to identify significant differences between the chatbots’ performances across various categories. Results: The overall average performance of all chatbots was 60.4%. There were significant differences in the performances of different chatbots (p = 0.034): GPT-4o performed best, answering 74% of the questions correctly. AAOS members outperformed all chatbots, with an average accuracy of 79.4%. There were no significant differences in performance between shoulder and elbow questions (p = 0.931). Topic-wise, chatbots did worse on questions relating to “Adhesive Capsulitis” than those relating to “Instability” (p = 0.013), “Nerve Injuries” (p = 0.002), and “Arthroplasty” (p = 0.028). Concept-wise, the best performance was seen in “Diagnosis” (71.4%), but there were no significant differences in scores between different chatbots. Difficulty analysis revealed that chatbots performed significantly better on easy questions (68.5%) compared to moderate (45.4%; p = 0.04) and hard questions (40.0%; p = 0.012). Conclusions: AI chatbots show promise as supplementary tools in medical education and clinical decision-making, but their limitations necessitate cautious and complementary use alongside expert human judgment.

1. Introduction

ChatGPT is an artificial intelligence (AI) large language model (LLM) developed by OpenAI that was initially released in November 2022 [1]. It functions as a chatbot to simulate natural conversations by synthesizing user input and generating responses to specific words, phrases, questions, or other prompts [2]. By January 2023, ChatGPT became the fastest growing consumer application to reach 100 million users, and its widespread implementation continues to expand [3]. Since then, multiple iterations of chatbot AI programs have been developed including Google’s Gemini software (formerly known as Bard) and Microsoft’s CoPilot AI, which were released in March and November 2023, respectively [4,5].
The use of AI within the scientific community has garnered much attention recently, with numerous studies exploring the risks and benefits associated with its educational and clinical use [6,7,8,9,10,11]. A direct way to assess these AI models is by administering medical licensing exams or certified question banks. Kung et al. evaluated the performance of ChatGPT on the United States Medical Licensing Examination (USMLE) Step 1, Step 2CK, and Step 3, with reported accuracy rates > 50%, which led to near passing or passing scores on the exams [12]. A similar study by Skalidis et al. administered the European Exam in Core Cardiology to ChatGPT, which scored an average total accuracy of 58.8% across all question sources [13]. Hoffman et al. showed the newer model of ChatGPT, GPT-4, correctly answered 63.4% of the questions when given Orthopedic In-Training Examination questions (OITE) [14]. In April 2024, Vaishya et al. compared the performance of multiple chatbots including ChatGPT-3.5, ChatGPT-4, and Gemini when provided an orthopaedicorthopedic qualifying examination [15]. The results showed that ChatGPT-4 answered 65/120 (54.2%) questions correctly while ChatGPT-3.5 only answered 54/120 (45%) correctly, demonstrating a significant improvement between generations [15]. Impressively, Gemini answered all 120 questions correctly [15]. These data suggest a continuously improving performance on the part of AI software compared to previous models, which may eventually guide clinical applications.
This study aims to evaluate the performance of multiple AI models (ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini, and CoPilot) in answering shoulder- and elbow-related questions from the American Academy of Orthopaedic Surgeons (AAOS) question bank, ResStudy. By doing so, we aim to provide insight into the knowledge of modern chatbots regarding different shoulder and elbow pathologies and evaluate their role as a potential learning tool for shoulder and elbow medical professionals.

2. Materials and Methods

The ResStudy question bank from the AAOS website was accessed in order to obtain shoulder- and elbow-related questions [16]. Questions with no figures or pictures were included in our study. The reason behind this is because the majority of chatbots do not allow for the incorporation of pictures or figures into their prompts. A total of 50 questions were chosen and included in our study. The questions were categorized according to whether they pertained to shoulders or elbows, according to explored topics (instability, adhesive capsulitis, arthroplasty/arthritis, trauma, rotator cuff, and nerve injuries) and according to explored concepts (diagnosis, management, prognosis, and anatomy). Correct answers were recorded, as well as the percentage of AAOS members who answered the questions correctly. We further used this percentage in order to further categorize the questions according to difficulty: easy (more than 75% of AAOS members answered correctly), moderate (between 50 and 75% of AAOS members answered correctly), and difficult (less than 50% of AAOS members answered correctly).
Each question, along with the possible multiple-choice answers, were provided to each of the following chatbots: ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Gemini, and CoPilot. The response of every chatbot was recorded, and the rates of correct answers were calculated. Descriptive statistics were used to report on the performance of every chatbot for the different categories (anatomic location, topic, and difficulty). A Chi-squared analysis was conducted to test for significant differences between the performance of the different chatbots in answering the questions included in our study. One-way analysis of variance (ANOVA) testing was used to compare the individual performance of the chatbots on different questions according to the level of difficulty, topics, and concepts tested. An independent t-test was used to compare the scores of the different chatbots on answering shoulder vs. elbow questions. A p-value of less than 0.05 was considered significant. The statistical package for the social sciences by IBM (SPSS, 2017) was used to conduct all statistical analyses in this study.
Our study is a cross-sectional study that does not involve patients, and as such, an institutional review board was not required.

3. Results

3.1. Characteristics of Included Questions

A total of 22 questions pertaining to the elbow and 28 questions pertaining to the shoulders were included in our study (n = 50). Instability was the most explored topic overall with 16 questions (32.0%), followed by arthritis/arthroplasty at 14 (28.0%). Management was the most commonly explored concept with 17 questions (34.0%), followed by anatomy at 16 questions (32.0%). For the shoulders, the most commonly explored topic was arthroplasty (n = 10, 35.7%) and the most commonly explored concepts were management and anatomy (n = 8, 28.6% each). For the elbow, the most commonly explored topic was instability (n = 10, 45.4%), and the most commonly explored concept was management (n = 9, 41.0%). The majority of the questions were determined to be of easy difficulty (n = 33, 66.0%), 13 questions (26.0%) were determined to be of moderate difficulty, and 4 questions (8.0%) were determined to be difficult questions. Table 1 shows the distribution of the included questions in our study according to relevant anatomic location, topic, concept, and difficulty.

3.2. Overall Performance

The average performance of all chatbots was 60.4%, scoring around 30 out of 50 questions correctly. GPT4o performed the best out of all the chatbots, providing the correct answer for 37 out of 50 questions (74.0%). This was followed by GPT 4 with 35 correct answers (70.0%), and ChatGPT 3.5 with 29 correct answers (58.0%). Gemini and CoPilot had the lowest scores with 25 correct answers each (50.0%) (Table 2). A statistically significant difference was observed when conducting Chi-Squared analysis comparing the performance of the different chatbots (p = 0.034). That being said, AAOS members performed better than all other chatbots; on average, 79.4% of AAOS members scored the correct answer for each question.

3.3. Performance According to Shoulder Versus Elbow Questions

When exploring the shoulder questions separately, ChatGPT 4o performed the best, correctly answering 22 out of 28 questions (78.5%), followed by GPT 4, which answered 20 questions correctly (71.4%). CoPilot and ChatGPT 3.5 performed the worst with only 14 correct answers each (50.0%), while Gemini answered 15 questions correctly (53.5%). On the other hand, 80.6% of the AAOS members chose the correct answer for the questions pertaining to shoulders (Table 3). When exploring elbow questions separately, ChatGPT 4o, GPT 4, and ChatGPT 3.5 all performed equally best with 15 out of 22 correct answers (68.2%). CoPilot had 11 correct answers (50.0%), and Gemini performed the worst with 10 out of 22 correct answers (45.4%). When assessing performance of AAOS members, it was observed that 77.9% answered the elbow questions correctly (Table 3). There were no significant differences in the performance of the different chatbots on shoulder questions compared to elbow questions (p = 0.931; MD: 0.702).

3.4. Performance According to Topic

The mean percentage of correct answers for the different chatbots was 71.3% for “Instability”, 80.0% for “Nerve Injuries”, 67.1% for “Arthroplasty”, 48.0% for “Rotator Cuff Tears”, 46.0% for “Trauma”, and 26.6% for “Adhesive Capsulitis” (Table 4).
For “Instability”, ChatGPT 4o performed the best with 14 out of 16 correct answers (87.5%), whereas Gemini and CoPilot performed equally worst with 9 out of 16 correct answers (56.3%). For “Nerve Injuries”, ChatGPT 3.5, GPT 4, and ChatGPT 4o answered 2 out of 2 questions correctly (100%), whereas Gemini and CoPilot answered only 1 question correctly (50.0%). For “Arthroplasty”, GPT 4 performed the best with 12 out of 14 correct answers (85.7%), whereas CoPilot performed the worst with 6 out of 14 correct answers (42.9%). For “Rotator Cuff Tears”, GPT 4 performed the best with 4 out of 5 correct answers (80.0%), whereas Gemini performed the worst with 1 out of 5 correct answers (20.0%). For “Trauma”, CoPilot and ChatGPT 4o performed equally best with 6 out of 10 correct answers (60.0%), whereas GPT 4 performed the worst with 3 correct answers (30.0%). Finally, for adhesive capsulitis, all chatbots had one out of three questions answered correctly (33.3%), except for Gemini, which had no correct answers (0%) (Table 4). When comparing the performance of the different chatbots according to topic, it was seen that the chatbots performed significantly worse on “Adhesive Capsulitis” when compared with “Instability” (p = 0.013), “Nerve Injuries” (p = 0.002), and “Arthroplasty” (p = 0.028).

3.5. Performance According to Concept

The mean percentage of correctness for the different chatbots was 71.4% for “Diagnosis”, 57.7% for “Management”, 56.3% for “Anatomy”, and 64.0% for “Prognosis” (Table 4).
For “Diagnosis” questions, GPT 4 performed the best, answering seven out of seven questions correctly (100%), whereas Copilot performed the worst with only two questions (28.5%) answered correctly. For “Management”, GPT 4o performed the best, answering 12 out of 17 questions (70.6%) correctly, whereas Gemini performed the worst with 8 (47.1%) answered correctly. Similarly, for “Anatomy”, GPT 4o performed the best with 13 out of 16 questions (81.3%) answered correctly, whereas Gemini performed the worst with only 6 questions (37.5%) answered correctly. Finally, for “Prognosis”, GPT 4 performed the best with 9 out of 10 questions (90.0%) answered correctly, whereas CoPilot performed the worst with only 5 questions (50.0%) answered correctly (Table 4). There were no significant differences between the performance of the different chatbots on the different tested concepts in our study (p = 0.535).

3.6. Performance According to Difficulty

The mean percentage correctness for the different chatbots was 68.5% for easy questions, 45.4% for moderate questions, and 40.0% for hard questions (Table 5).
For easy questions, GPT 4o performed the best, with 29 out of 33 questions (87.9%) answered correctly, whereas CoPilot performed the worst with 17 questions (51.5%) answered correctly. For moderate questions, ChatGPT 3.5 performed the best with 7 out of 13 questions (53.9%) answered correctly, whereas Gemini performed the worst with only 4 questions (30.8%) answered correctly. Finally, for hard questions, Gemini, CoPilot, and GPT 4 answered two out of four questions (50.0%) correctly, whereas ChatGPT 3.5 and GPT 4 answered one question (25.0%) correctly (Table 5). As expected, the performances of the different chatbots on easy questions was significantly better than their performances on moderate questions (p = 0.04) and hard questions (p = 0.012). There were no significant differences, however, between the performance of the chatbots on moderate questions and hard questions (p = 0.794).

4. Discussion

Our study showed that chatbots can answer around 60.4% of shoulder and elbow questions correctly, with the highest chatbot performance remaining lower than that of the general AAOS member population. Overall, GPT 4o performed the best out of the different chatbots, whereas Gemini and CoPilot had the lowest scores. There were no differences when comparing chatbot performance between shoulder and elbow questions. Chatbots performed generally well on topics like “Instability” and “Nerve Injuries”, but poorly on topics like “Adhesive capsulitis”. Moreover, questions conceptualized around diagnosis showed a relatively good performance by the chatbots overall. Finally, the chatbots performed better on easy questions than on questions with moderate or hard difficulty.
Considering the popularity and hype garnered by the creation of chatbots all over the world, the overall performance of chatbots on shoulder and elbow questions, in our study, was underwhelming. While Gemini and CoPilot scored the lowest at 50% correct questions, GPT 4o performed the best at 74% correct questions. Yet, 79.4% of the general population of AAOS members provided, on average, the correct answer, thereby outperforming the most advanced chatbot in our study, GPT 4o. Several studies in the literature have explored the performance of different chatbots on orthopaedic-related examinations and questions [16,17,18,19,20]. Similarly to our study, human examiners have performed better than chatbots on self-assessment examination questions from the American Society for Surgery of the Hand [21]. Another study explored the performance of GPT 4 on the Orthopaedic In-Training Examination and found that the chatbot correlated to the level of a third year orthopaedic resident [19]. Moreover, one study compared the performance of ChatGPT 3.5, GPT 4, and Bard Google (now known as Gemini) on orthopaedic postgraduate exam questions [15]. Interestingly, the study showed that Bard (Gemini) had superior results than when compared to ChatGPT 3.5 and GPT 4, even though our study showed the opposite [15]. This highlights the variability of chatbot performances according to question topic, delivery, and scoring.
When assessing chatbot performance according to different topics and concepts, it was seen that there were no differences in performance between shoulder and elbow questions. However, there was a particular limitation when answering questions related to “Adhesive Capsulitis”. In addition, while there were no significant differences in performance between the different concepts explored in our study, it was seen that the chatbots did poorly on questions related to “Management” and “Anatomy”. Chatbots are designed to employ natural language processing techniques in order to scrape large online databases of currently available literature in order to generate appropriate responses to different queries [22]. Hence, the performance of the different chatbots in our study is a reflection of the quality of the currently available literature, the accuracy and efficiency of the processing models utilized by the chatbots, and the clarity of the questions chatbots are asked [22]. By categorizing the different questions according to topics and concepts, and analyzing chatbot performances according to these different topics and concepts, we are able to deduce areas of particular weaknesses. According to our study, it is possible that information targeting adhesive capsulitis, as well as information dealing with shoulder and elbow “anatomy” and “management” is not of a high quality and reliability, thereby explaining the poor performance exhibited by the chatbots in our study. Perhaps focusing on publishing well-written peer-reviewed publications on these topics can help improve the knowledge of the chatbots and enhance their ability to respond to related questions.
When assessing performances according to perceived difficulty of the questions, it was seen that, on average, the chatbots performed better on easy questions than on questions with a moderate or hard difficulty. This is interesting, as it shows that the performance trends and patterns of the chatbots emulate that of the human examinees. What is considered difficult for AAOS members was also considered difficult for the different chatbots. This is warranted, as the chatbots utilize human-generated literature in order to come up with responses [22]. However, this carries the implication that the methods by which the chatbots process the literature and the way by which they perceive questions is similar to that of humans. Continuous training and updates can help improve the efficiency and knowledge of the different chatbots, leading to improved and enhanced performance. In this setting, it was shown that GPT 4o performed better than GPT 4, which in turn, performed better than ChatGPT 3.5. This reflects the aforementioned efforts in continuously improving and enhancing chatbot accuracy, reliability, and performance.
To our knowledge, this is the first study to assess the performance of different chatbots in answering shoulder- and elbow-related questions. That being said, several limitations exist. One of which is the exclusion of questions involving figures or images, which are critical components in medical diagnostics and education but were not compatible with the text-based input capabilities of the chatbots. Further, the study has evaluated the performance of these chatbots as per the current day. However, these AI models are continually evolving, with developers frequently releasing new versions, which may lead to different results over time.

5. Conclusions

While chatbots like ChatGPT-3.5, ChatGPT-4, GPT-4o, Gemini, and CoPilot demonstrated considerable potential in answering shoulder- and elbow-related questions from the AAOS ResStudy question bank, their performance still lagged behind that of AAOS members. Specifically, GPT-4o outperformed the others with a correct answer rate of 74%, yet it still fell short of the 79.4% average accuracy achieved by AAOS members. Chatbots excelled in areas such as “Instability” and “Nerve Injuries”, but struggled significantly with “Adhesive Capsulitis”. Furthermore, the difficulty level of the questions impacted chatbot performances, mirroring human trends with better results on easier questions. These findings underscore the continuous improvement in AI capabilities and highlight the potential of chatbots as supplementary tools in medical education and diagnostics. However, they also indicate that medical students and professionals should exercise caution when relying on these bots for accurate answers to shoulder and elbow-related questions. The results, also, emphasize the need for ongoing refinement and targeted enhancement, especially in areas where chatbot performance is currently deficient.

Author Contributions

Conceptualization, M.Y.F. and J.A.A.; Formal analysis, M.Y.F. and T.P.; Methodology, M.Y.F., T.P., M.D. and P.B.; Supervision, J.G.H., B.W.H., A.Z.K. and J.A.A.; Validation, M.Y.F., T.P., P.B., M.D., J.B., A.W., B.W.H., J.G.H., A.Z.K. and J.A.A.; Writing—original draft, M.Y.F., T.P., J.B., A.W., P.B. and M.D.; Writing—review & editing, M.Y.F., B.W.H., J.G.H., A.Z.K. and J.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted according to the guidelines of the Declaration of Helsinki, and no approval was needed since the data were publicly available and de-identified, and no patients were involved.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author (data are not publicly available due to privacy or ethical restrictions).

Conflicts of Interest

J.A.A. would like to disclose: Royalties or License, Consulting Fees, or Travel and Lodging from a company or supplier; Disclosures; OSTEOCENTRIC TECHNOLOGIES, ENOVIS, Insight Medical Systems, Integra LifeSciences Corporation, Trice Medical, Flexion Therapeutics Inc., Smith+Nephew Inc., ZIMMER-BIOMET, STRYKER, GLOBUS MEDICAL, INC. Stocks in: SHOULDER JAM, AEVUMED, OBERD, OTS MEDICAL, ORTHOBULLETS, ATREON, RESTORE 3D. Research support from a company or supplier as a PI; Disclosures; ENOVIS, ARTHREX. Royalties, financial, or material support from publishers; Disclosures; WOLTERS KLUWER, SLACK ORTHOPAEDICS, ELSEVIER. Board member/committee appointments for a society; Disclosures; AMERICAN SHOULDER AND ELBOW SOCIETY, MID ATLANTIC SHOULDER AND ELBOW SOCIETY, SHOULDER360, Pacira Pharmaceuticals.

References

  1. OpenAI. Models GPT-3; OpenAI: San Francisco, CA, USA, 2023. [Google Scholar]
  2. Salvagno, M.; Taccone, F.S.; Gerli, A.G. Can artificial intelligence help for scientific writing? Crit. Care 2023, 27, 75. [Google Scholar] [PubMed]
  3. Eysenbach, G. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation with ChatGPT and a Call for Papers. JMIR Med. Educ. 2023, 9, e46885. [Google Scholar] [CrossRef] [PubMed]
  4. AI, C. CoPilot AI—Homepage. Available online: https://teams.copilotai.com/ (accessed on 1 June 2024).
  5. Saeidnia, H.R. Welcome to the Gemini era: Google DeepMind and the information industry. Library Hi Tech. News, 2023; ahead of print. [Google Scholar]
  6. Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
  7. Mondal, H.; Mondal, S. ChatGPT in academic writing: Maximizing its benefits and minimizing the risks. Indian J. Ophthalmol. 2023, 71, 3600–3606. [Google Scholar] [PubMed]
  8. Cohen, I.G. What Should ChatGPT Mean for Bioethics? Am. J. Bioeth. 2023, 23, 8–16. [Google Scholar] [PubMed]
  9. Kasthuri, V.S.; Glueck, J.; Pham, H.; Daher, M.; Balmaceno-Criss, M.; McDonald, C.L.; Diebo, B.G.; Daniels, A.H. Assessing the Accuracy and Reliability of AI-Generated Responses to Patient Questions Regarding Spine Surgery. JBJS 2021, 10, 2106. [Google Scholar]
  10. Chalhoub, R.; Mouawad, A.; Aoun, M.; Daher, M.; El-Sett, P.; Kreichati, G.; Kharrat, K.; Sebaaly, A. Will ChatGPT be able to replace a spine surgeon in the clinical setting? World Neurosurg. 2024, 185, e648–e652. [Google Scholar] [PubMed]
  11. Liu, J.; Wang, C.; Liu, S. Utility of ChatGPT in Clinical Practice. J. Med. Internet Res. 2023, 25, e48568. [Google Scholar] [PubMed]
  12. Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar]
  13. Skalidis, I.; Cagnina, A.; Luangphiphat, W.; Mahendiran, T.; Muller, O.; Abbe, E.; Fournier, S. ChatGPT takes on the European Exam. in Core Cardiology: An artificial intelligence success story? Eur. Heart J. Digit. Health 2023, 4, 279–281. [Google Scholar] [PubMed]
  14. Hofmann, H.L.; Guerra, G.A.; Le, J.L.; Wong, A.M.; Hofmann, G.H.; Mayfield, C.K.; Petrigliano, F.A.; Liu, J.N. The Rapid Development of Artificial Intelligence: GPT-4’s Performance on Orthopedic Surgery Board Questions. Orthopedics 2024, 47, e85–e89. [Google Scholar] [CrossRef] [PubMed]
  15. Vaishya, R.; Iyengar, K.P.; Patralekh, M.K.; Botchu, R.; Shirodkar, K.; Jain, V.K.; Vaish, A.; Scarlat, M.M. Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions—An observational study. Int. Orthop. 2024, 48, 1963–1969. [Google Scholar] [CrossRef] [PubMed]
  16. Magruder, M.L.; Rodriguez, A.N.; Wong, J.C.; Erez, O.; Piuzzi, N.S.; Scuderi, G.R.; Slover, J.D.; Oh, J.H.; Schwarzkopf, R.; Chen, A.F.; et al. Assessing Ability for ChatGPT to Answer Total Knee Arthroplasty-Related Questions. J. Arthroplast. 2024, 39, 2022–2027. [Google Scholar]
  17. Fahy, S.; Oehme, S.; Milinkovic, D.; Jung, T.; Bartek, B. Assessment of Quality and Readability of Information Provided by ChatGPT in Relation to Anterior Cruciate Ligament Injury. J. Pers. Med. 2024, 14, 104. [Google Scholar] [CrossRef] [PubMed]
  18. Lum, Z.C. Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT. Clin. Orthop. Relat. Res. 2023, 481, 1623–1630. [Google Scholar] [PubMed]
  19. Ghanem, D.; Covarrubias, O.; Raad, M.; LaPorte, D.; Shafiq, B. ChatGPT Performs at the Level of a Third-Year Orthopaedic Surgery Resident on the Orthopaedic In-Training Examination. JB JS Open Access. 2023, 8, e23. [Google Scholar]
  20. Posner, K.M.; Bakus, C.; Basralian, G.; Chester, G.; Zeiman, M.; O’Malley, G.R.; Klein, G.R. Evaluating ChatGPT’s Capabilities on Orthopedic Training Examinations: An Analysis of New Image Processing Features. Cureus 2024, 16, e55945. [Google Scholar] [CrossRef] [PubMed]
  21. Arango, S.D.; Flynn, J.C.; Zeitlin, J.; Lorenzana, D.J.; Miller, A.J.; Wilson, M.S.; Strohl, A.B.; Weiss, L.E.; Weir, T.B. The Performance of ChatGPT on the American Society for Surgery of the Hand Self-Assessment Examination. Cureus 2024, 16, e58950. [Google Scholar] [CrossRef] [PubMed]
  22. Aleedy, M.; Shaiba, H.; Bezbradica, M. Generating and analyzing chatbot responses using natural language processing. Int. J. Adv. Comput. Sci. Appl. 2019, 10. [Google Scholar] [CrossRef]
Table 1. Distribution of Questions (n = 50) according to relevant anatomic location, topic, concept, and difficulty.
Table 1. Distribution of Questions (n = 50) according to relevant anatomic location, topic, concept, and difficulty.
Anatomic Location
n%
Shoulder2856
Elbow2244
Topic
n%
Instability1632
Neuro24
Arthroplasty1428
Rotator Cuff Tear510
Trauma1020
Adhesive Capsulitis36
Concept
n%
Diagnosis714
Management1734
Anatomy1632
Prognosis1020
Difficulty
n%
Easy3366
Moderate1326
Hard48
Table 2. Performance of different chatbots on shoulder and elbow questions.
Table 2. Performance of different chatbots on shoulder and elbow questions.
Number of Correct Questions (n = 50)% Questions Correct
GPT 4o37.076.0
GPT 435.070.0
ChatGPT 3.529.058.0
Gemini25.050.0
CoPilot25.050.0
Table 3. Comparison of chatbot performance between shoulder-related and elbow related questions.
Table 3. Comparison of chatbot performance between shoulder-related and elbow related questions.
ChatGPT 3.5GeminiCoPilotGPT 4GPT 4oAAOS Members
ElbowNumber of Correct Answers1510111515N/A
Percentage of Correct Answers68.1845.4550.0068.1868.1880.60
ShoulderNumber of Correct Answers1415142022N/A
Percentage of Correct Answers5053.55071.478.577.9
Table 4. Performance of different chatbots on shoulder and elbow questions according to topic and concept.
Table 4. Performance of different chatbots on shoulder and elbow questions according to topic and concept.
Total QuestionsChatGPT 3.5 % CorrectGemini % CorrectCoPilot % CorrectGPT 4 % CorrectGPT 4o % CorrectOverall Mean % Correct
TopicInstability167556.2556.2581.2587.571.25
Neuro2100505010010080
Arthroplasty145771.442.985.778.667
Rotator Cuff Tear5402040806048
Trauma10404060306046
Adhesive Capsulitis333.3033.333.333.326.7
ConceptDiagnosis771.4371.4328.5710085.7171.43
Management1758.8247.0652.9458.8270.5957.65
Anatomy165037.556.2556.2581.2556.25
Prognosis10606050906064
Table 5. Performance of different chatbots on shoulder and elbow questions according to difficulty.
Table 5. Performance of different chatbots on shoulder and elbow questions according to difficulty.
DifficultyTotal QuestionsChatGPT 3.5 % CorrectGemini % CorrectCoPilot % CorrectGPT 4 % CorrectGPT 4o % CorrectOverall Mean % Correct
Easy3363.64%57.58%51.52%81.82%87.88%68.48%
Moderate1353.85%30.77%46.15%46.15%50.00%45.38%
Hard425%50.00%50.00%50.00%25.00%40.00%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fares, M.Y.; Parmar, T.; Boufadel, P.; Daher, M.; Berg, J.; Witt, A.; Hill, B.W.; Horneff, J.G.; Khan, A.Z.; Abboud, J.A. An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions. J. Clin. Med. 2025, 14, 2289. https://doi.org/10.3390/jcm14072289

AMA Style

Fares MY, Parmar T, Boufadel P, Daher M, Berg J, Witt A, Hill BW, Horneff JG, Khan AZ, Abboud JA. An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions. Journal of Clinical Medicine. 2025; 14(7):2289. https://doi.org/10.3390/jcm14072289

Chicago/Turabian Style

Fares, Mohamad Y., Tarishi Parmar, Peter Boufadel, Mohammad Daher, Jonathan Berg, Austin Witt, Brian W. Hill, John G. Horneff, Adam Z. Khan, and Joseph A. Abboud. 2025. "An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions" Journal of Clinical Medicine 14, no. 7: 2289. https://doi.org/10.3390/jcm14072289

APA Style

Fares, M. Y., Parmar, T., Boufadel, P., Daher, M., Berg, J., Witt, A., Hill, B. W., Horneff, J. G., Khan, A. Z., & Abboud, J. A. (2025). An Assessment of the Performance of Different Chatbots on Shoulder and Elbow Questions. Journal of Clinical Medicine, 14(7), 2289. https://doi.org/10.3390/jcm14072289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop