Next Article in Journal
Navigating Hyperhemolysis in Sickle Cell Disease: Insights from Literature
Previous Article in Journal
Analytical and Clinical Validation of the ConfiSign HIV Self-Test for Blood-Based HIV Screening
Previous Article in Special Issue
Neurophysiological Examination for the Diagnosis of Orofacial Pain and Temporomandibular Disorders: A Literature Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

To Self-Treat or Not to Self-Treat: Evaluating the Diagnostic, Advisory and Referral Effectiveness of ChatGPT Responses to the Most Common Musculoskeletal Disorders

Department of Orthopedics and Traumatology, Marmara University Pendik Training and Research Hospital, 34890 Istanbul, Turkey
*
Author to whom correspondence should be addressed.
Diagnostics 2025, 15(14), 1834; https://doi.org/10.3390/diagnostics15141834
Submission received: 2 June 2025 / Revised: 5 July 2025 / Accepted: 19 July 2025 / Published: 21 July 2025

Abstract

Background/Objectives: The increased accessibility of information has resulted in a rise in patients trying to self-diagnose and opting for self-medication, either as a primary treatment or as a supplement to medical care. Our objective was to evaluate the reliability, comprehensibility, and readability of the responses provided by ChatGPT 4.0 when queried about the most prevalent orthopaedic problems, thus ascertaining the occurrence of misguidance and the necessity for an audit of the disseminated information. Methods: ChatGPT 4.0 was presented with 26 open-ended questions. The responses were evaluated by two observers using a Likert scale in the categories of diagnosis, recommendation, and referral. The scores from the responses were subjected to subgroup analysis according to the area of interest (AoI) and anatomical region. The readability and comprehensibility of the chatbot’s responses were analyzed using the Flesch–Kincaid Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL). Results: The majority of the responses were rated as either ‘adequate’ or ‘excellent’. However, in the diagnosis category, a significant difference was found in the evaluation made according to the AoI (p = 0.007), which is attributed to trauma-related questions. No significant difference was identified in any other category. The mean FKGL score was 7.8 ± 1.267, and the mean FRES was 52.68 ± 8.6. The average estimated reading level required to understand the text was considered as “high school”. Conclusions: ChatGPT 4.0 facilitates the self-diagnosis and self-treatment tendencies of patients with musculoskeletal disorders. However, it is imperative for patients to have a robust understanding of the limitations of chatbot-generated advice, particularly in trauma-related conditions.

1. Introduction

The significance of herbal medicines and conventional treatment protocols such as RICE (rest, ice, compression, elevation) in the management of musculoskeletal disorders has resulted in an increase in patients’ attempts at self-diagnosis and self-treatment [1,2,3,4,5]. The advent of the internet and social media platforms, coupled with the increasing accessibility of artificial intelligence, has led to a notable enhancement in the ease with which information can be accessed on a daily basis [6,7,8]. The increased accessibility of information has resulted in a rise in patients trying to self-diagnose, conducting preliminary research on their diseases, and opting for self-medication, either as a primary treatment or as a supplement to medical care [5,6,9]. Furthermore, the global experience of the 2020 pandemic has contributed to a heightened awareness of the importance of self-care and a shift in attitudes towards medical visits [10,11,12]. The prevailing concern regarding infection in healthcare facilities, compounded by the imperative to self-isolate for extended durations, has potentially resulted in a significant escalation in the tendency to self-diagnose and self-medicate. A study conducted in 2024 found that 38.8% of individuals with orthopaedic conditions currently conduct preliminary research before consulting a physician, and 24.5% try to treat themselves with herbal therapies [5].
ChatGPT, the latest advent in AI technology that was launched in 2022, is built on a generative pre-trained transducer (GPT) architecture. It can be defined as a large language model, or commonly known as a chatbot, that can be widely used for information retrieval in healthcare, including uses related to orthopaedics. Numerous publications in the extant literature address the utilization of chatbots in orthopaedics, with a focus on various facets of these systems, ranging from their role in academic writing processes to their capacity to address common inquiries on specialized subjects [8,13,14,15]. Nevertheless, there remain numerous unanswered questions in the literature concerning the consistency, scope, guidance capabilities, and capacity of these large language models to generate patient-based recommendations [16]. To our knowledge, there are a number of studies in the literature on whether patients will be guided correctly when they use AI-powered chatbots to investigate and attempt to self-treat musculoskeletal problems. In 2023, Gwak et al. stated in their study that ChatGPT can provide generally useful medical information and treatment options to patients unfamiliar with shoulder impingement syndrome. However, they also noted the possibility that the information may contain biased or inappropriate content [17]. In 2024, Ah-yan et al. reported that AI-assisted chatbots for self-management of low back pain provided reliable medical advice and helped patients to self-manage [18]. In contrast, in April 2025, Tabanlı and Demirkıran emphasized the limitations of ChatGPT in addressing psychosocial concerns in low back pain and reported the need for clinician supervision [19]. In 2025, Safran and Yildirim investigated the efficacy of AI-assisted chatbots in musculoskeletal rehabilitation. They concluded that the use of AI-assisted chatbots should remain complementary due to the observed limitations of ChatGPT in consistency, completeness and patient-specific clinical assessment ability [20].
As is evident, there is a lack of consensus in the literature on the self-treatment of musculoskeletal problems by AI-assisted chatbots, with the fact that the majority of studies focus on a single specific disorder being a significant limitation. The objective of this study is to evaluate the effectiveness, reliability, readability, and comprehensibility of the responses provided by an AI-powered chatbot when questioned about the most prevalent orthopaedic problems. Consequently, this study aims to ascertain the occurrence of misguidance and the necessity for an audit of the disseminated information.

2. Materials and Methods

ChatGPT 4.0 (ChatGPT Version 4.0; OpenAI, 2024) was presented with 26 open-ended questions, designed to cover the most common pathologies that result in referral to orthopaedic outpatient clinics. The questions were prepared in English by two board-certified orthopaedic surgeons who were actively involved in primary patient care, had participated in survey studies previously, and have native English proficiency. In the preparation of the questions, the most common reasons for referral to orthopaedic outpatient clinics were thoroughly considered, and the content was designed to span a range of levels, from general to specific. The questions were meticulously designed to mirror the standardized user experience and crafted to be asked by the patients themselves (see Table 1 for details). The questions were presented in a single trial to evaluate the immediate response performance of ChatGPT 4.0, and the answers were recorded. The authors state that artificial intelligence was used only at this point in the study process and that no support was received in any other process of the preparation of the manuscript. No ethics committee approval was obtained for this study, in which no patient or personal data was used.
The responses were evaluated at two-month intervals, with intra-rater reliability being assessed (Table 2). The objective of these two distinct evaluations was to demonstrate the consistency and concordance of the authors’ responses. In the evaluation conducted two months later, the questions were not asked again to ChatGPT 4.0; instead, the answers recorded in the first questioning were re-evaluated. This is a precautionary measure to ensure consistency in the evaluation process, particularly in the event of any updates to the AI system during the two-month period. The purpose of this procedure was to ensure that there are no discrepancies in the scores reported by the authors, possibly due to a potential update. The evaluation of the responses was conducted employing Likert scaling. In accordance with this methodology, each author assigned a rating to the questions from 1 to 5, with 1 representing “flawed”, 2 representing “inadequate”, 3 representing “acceptable”, 4 representing “adequate”, and 5 representing “excellent”. The evaluations were conducted in three distinct categories for each question: “diagnosis” (diagnostic accuracy of the response, whether it can identify the targeted diagnosis), “recommendation” (whether self-treatment was recommended and the effectiveness of the self-treatment) and “referral” (whether the chatbot directed the patient to a physician and the accuracy of the referral time). The responses generated by ChatGPT 4.0 were then subjected to a detailed analysis, which involved the categorization of these responses into distinct subgroups. This categorization was based on the specific areas of interest of orthopaedics, namely pediatric, sport, degenerative, infective, and trauma, as well as anatomical regions, including upper extremity, spine, hip, knee, foot & ankle. The “degenerative” subheading was planned to cover deformities secondary to degeneration and the “trauma” subheading was planned to cover complications after trauma surgery.
In addition to the scope of ChatGPT’s responses to patient inquiries, another salient point pertains to the readability and the comprehensibility of the provided answers. As mentioned before, the questions asked of ChatGPT 4.0 were designed to emulate those asked by patients. The readability and the comprehensibility of the responses were evaluated employing the Flesch–Kincaid Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL). The FRES, ranging from 0 to 100, has been shown to correlate positively with reading ease, with higher values indicating greater ease of reading. The FRES has been demonstrated to serve as a reliable predictor of the level of education required to comprehend the text, otherwise known as the “estimated reading level”. In contrast, the FKGL is an updated version of the FRES, with higher values indicate greater complexity. That is to say, a text with a higher FRES and a lower FKGL score is easier to understand, even for individuals with lower levels of education, and is not complex [21,22,23,24]. Each response in each question was evaluated independently using these scales, and the readability of the response and the level of education required to comprehend the text were recorded.
Statistical analyses were conducted utilizing IBM® SPSS® Statistics Version 26. Mean and standard deviation were employed as descriptive statistics. Cohen’s Kappa analysis was used to assess intra-rater reliability, Kruskal–Wallis analysis was used to ascertain statistical differences between multiple groups and Mann–Whitney U analysis was used for post hoc evaluations. p values less than 0.05 were considered significant. The level of agreement was interpreted using kappa values.

3. Results

In this study, ChatGPT 4.0 was found to provide consistently high-quality responses across all categories and evaluation criteria, with a majority of the answers being rated as ‘adequate’ or ‘excellent’ (see Table 3). However, the findings of both authors indicate a significant difference in the diagnostic aspect of the responses according to the area of interest (p = 0.007 and p = 0.001, respectively) (Table 3). Subsequent post hoc analyses revealed that this difference is attributed to the trauma-related questions (Table 4). Subgroup analyses according to anatomical region revealed no other significant differences in the diagnosis, recommendation, and referral aspects of the responses (Table 3).
The mean FKGL score was calculated as 7.8 ± 1.267 (range: 5.8–10.9), and the mean FRES was calculated as 52.68 ± 8.6 (30.2–65). The average estimated reading level required to understand the text was rated as “high school”. In the subgroup analyses according to area of interest and anatomical region, no significant difference was observed between the groups in terms of readability and comprehension of the text (p > 0.05), but it was observed that the average estimated reading level required to fully understand the answers was at the “college” level for the questions on infection and the questions related to the knee (Table 5).

4. Discussion

The relentless progression of technology, resulting in the affordability and accessibility of information, has led to a growing interest in self-diagnosis and self-treatment due to people’s desire to solve their own problems and their fear of hospitals and infection [6,7,8,9,10,11,12]. The most significant contribution of our study to the existing literature is the validation of the responses of ChatGPT 4.0, the latest advancement in technology, to the most frequently encountered musculoskeletal system disorders in referral to orthopaedic outpatient clinics. Moreover, the evaluation of the responses focused not only on their diagnostic efficiency but also on the validity of the self-treatment suggestions and the timing of referral to a physician. The most striking finding of our study was that although the majority of the responses were found to be “adequate” or “excellent”, the diagnostic effectiveness of the responses to trauma-related questions was found to be weaker (p < 0.05). Furthermore, the average educational level required for the readability and the comprehensibility of the responses was found to be “high school”. However, while not reaching statistical significance, it was observed that a “college” level of education was required to comprehend the answers related to the infection-related and knee-related questions.
The literature contains a number of contradictory opinions regarding the reliability of ChatGPT in healthcare. A significant number of studies conducted in 2023 and 2024 investigated the utilization of ChatGPT in the fields of sports medicine, pediatric orthopaedics and hip and knee arthroplasties, yielding favorable outcomes [15,25,26,27,28,29,30,31]. Conversely, Wright et al. documented a solely 59.2% satisfactory response rate for ChatGPT in inquiries concerning total knee and hip arthroplasties in 2024 [32]. Johns et al. investigated the reliability of ChatGPT regarding anterior cruciate ligament reconstruction in 2024 and stated that most of the responses were “outdated” [33]. In 2025, Schwartzman et al. examined the source of information of ChatGPT and reported that there were many repetitions and invalid references [16]. In the present study, it was observed that the diagnostic efficiency (4.77 ± 0.59 and 4.88 ± 0.33, respectively), recommendation capability (4.54 ± 0.86 and 4.62 ± 0.8, respectively), and referral accuracy (4.88 ± 0.83 and 4.96 ± 0.2, respectively) of the responses of ChatGPT 4.0 were considered to be highly satisfactory. Nevertheless, the reliability of ChatGPT’s diagnostic capability in trauma-related questions remains a subject of debate, according to the subgroup analysis conducted within the area of interest (Table 3 and Table 4). The authors of the study reported that the diagnostic efficiency of the chatbot’s responses regarding trauma complications was insufficient (3.75 ± 0.96 and 4.25 ± 0.5, respectively). While the majority of responses were evaluated as “adequate” or “excellent”, the observed differences in diagnosis for trauma-related questions raise concerns about the accuracy of AI-generated chatbots’ knowledge of musculoskeletal health. In investigating the reasons for this situation, the first hypothesis that emerges is the sudden nature of the trauma. The utilization of an AI-powered chatbot, such as ChatGPT 4.0, enables access to pre-existing knowledge found in the extant literature. However, it is important to note that the management of trauma and associated complications does not necessarily follow a textbook approach but is rather patient-based and often involves a combined approach. It is evident that ChatGPT’s capacity to formulate tailored, context-appropriate recommendations is limited, thus necessitating the supervision of a clinician. In the context of online research aimed at self-diagnosis, consultation with expert opinion is imperative, particularly in cases pertaining to trauma-related questions. Nevertheless, the high ratings for self-treatment and referral can be regarded as encouraging indicators that patients are receiving appropriate guidance.
A fundamental aspect of patients undertaking their own research is the importance of comprehensible information. While a substantial amount of evidence-based information is made available in the literature, on physicians’ own websites or on physician education platforms, it is not always possible for patients to properly understand this information, which includes technical terms. Consequently, patients often resort to alternative sources, such as social media or AI-powered chatbots. However, the utilization of plain language and reduced technical terms on these platforms is another topic that requires further research. There have been several reports in the literature on the readability and comprehension of ChatGPT responses. Gül et al. reported that ChatGPT responses to subdural hematomas were complex, difficult to read and required a university degree to understand [24]. Reyhan et al. found ChatGPT responses to questions about keratoconus to be difficult to read and understand [34]. Hancı et al., in their study published in 2024, reported that the readability of ChatGPT responses on palliative care was not sufficient [23]. The present study also set out to evaluate the readability and comprehensibility of the answers provided by ChatGPT 4.0 and to analyze the average level of education required to understand these answers. The mean FKGL score in this study was found to be 7.8 ± 1.267 (range: 5.8–10.9), with a mean FRES of 52.68 ± 8.6 (range: 30.2–65). These results indicate that the responses fell within the accessible range for “high school” graduates. This suggests that ChatGPT 4.0 has the potential to address a broad demographic of patients in terms of musculoskeletal health, though potentially at the expense of nuance and depth in complex cases. However, it is crucial to emphasize that, despite the language being simplified, there is a possibility that crucial clinical details that a healthcare professional would provide may be unintentionally excluded. The contradiction between the results of our study and the literature may be related to the content evaluated. In fact, most of the readability analysis studies in the literature focus on specific conditions (keratoconus, subdural hematoma), whereas our study focused on the most common musculoskeletal disorders. This may have resulted in less technical terms and more understandable responses from the chatbot. In fact, readability becomes more difficult for questions related to the knee and infections, where the situation under investigation becomes more specific.
The significance of artificial intelligence (AI) in healthcare is progressively growing in all areas, including patient education and guidance. In the context of musculoskeletal disorders, where self-diagnosis and self-treatment are prevalent, AI-supported chatbots emerge as a prominent solution. These chatbots offer instantaneous accessibility and rapid response capabilities, rendering them a valuable resource. In this study, we observed that ChatGPT 4.0 provides unbiased, comprehensible, and exhaustive answers in the domains of diagnosis, recommendation, and referral for common musculoskeletal disorders, with no erroneous content and generally satisfactory answers. These answers can be used to support patients’ self-diagnosis and self-treatment tendencies. Conversely, it is imperative to acknowledge that AI and AI-assisted chatbots are subject to perpetual development and refinement. The contents of our work, and of all AI-related works, may become outdated as the system evolves and updates. Subsequent studies may offer additional insights into this matter. Nevertheless, we are confident in the validity and reliability of the results of our study, at least for the present moment. However, it is also important to acknowledge the limitations of this study. Firstly, although a readability analysis was conducted, variables such as education level and occupation may have affected reading skills. The most significant shortcoming of the study is that it was a one-way assessment and readability analysis and did not investigate what patients received from the responses. Furthermore, the analysis was restricted to the 26 most prevalent problems referred to orthopaedic outpatient clinics. While this approach provides a general overview, it would be beneficial to employ more comprehensive and diverse questioning and to analyze the responses in order to evaluate the limitations of the chatbot, which was found to be diagnostically inadequate for trauma-related issues. Additionally, the selection of questions, crafted to encompass a wide demographic of orthopaedic patients, may have contributed to ChatGPT’s favorable performance. As previously documented, ChatGPT is a chatbot that is inadequate for use in specific patient-based situations. A detailed assessment of a specific area rather than a general approach may yield different results. Moreover, while the study constructed patient scenarios based on authors’ real patient-based experiences, the absence of real patient data to make the findings more robust and clinically meaningful is a significant limitation.

5. Conclusions

ChatGPT 4.0 is a noteworthy platform that facilitates the self-diagnosis and self-treatment tendencies of patients with musculoskeletal disorders who wish to conduct their own research and access satisfactory answers. However, it is imperative for patients to have a robust understanding of the limitations of chatbot-generated advice, particularly in cases involving musculoskeletal conditions that necessitate the expertise of a professional, such as trauma-related conditions. The findings of this study emphasize the necessity for ongoing enhancement of AI-driven healthcare tools, with a particular emphasis on the improvement of diagnostic accuracy and the assurance of the comprehensibility of medical content.

Author Contributions

Conceptualization, U.A. and B.G.; methodology, U.A. and B.G.; software, U.A.; validation, U.A.; formal analysis, B.G.; investigation, U.A. and B.G.; resources, U.A. and B.G.; data curation, B.G.; writing—original draft preparation, B.G.; writing—review and editing, U.A.; visualization, U.A. and B.G.; supervision, U.A.; project administration, B.G.; funding acquisition, U.A. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study since the research was limited to evaluating an AI system using hypothetical scenarios; it did not include human participants or animal subjects. As no personal data protection or ethical concerns were raised, no ethics committee approval or institutional approval was obtained.

Informed Consent Statement

Patient consent was waived since the research was limited to evaluating an AI system using hypothetical scenarios, and it did not include human participants or animal subjects. Written informed consent has not been obtained from the patient(s) to publish this paper since no patients are involved in the study.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
AoIArea of Interest
ChatGPTChat Generative Pre-trained Transformer
DegDegenerative
F&AFoot and Ankle
FKGLFlesch–Kincaid Grade Level
FRESFlesch–Kincaid Reading Ease Score
InfInfective
PedPediatric
PRPPlatelet-Rich Plasma
SPSSStatistical Package for the Social Sciences
TraTrauma
UpUpper Extremity

References

  1. Rispler, D.T.; Sara, J. The impact of complementary and alternative treatment modalities on the care of orthopaedic patients. J. Am. Acad. Orthop. Surg. 2011, 19, 634–643. [Google Scholar] [CrossRef]
  2. Line, S.; Nguyen, E.T.; Marsh, L.; Fry, C. Problems With Medium-Sized Joints: Ankle Conditions. FP Essent. 2023, 535, 25–36. [Google Scholar] [PubMed]
  3. Zhou, L.; Sun, K.; Chen, Y.; Chen, G.L.; Deng, D.J.; Jiao, G.L.; Li, Z.Z. Efficacy of Shangbai ointment in alleviating pain in patients with acute ankle joint lateral collateral ligament injury: A randomized controlled trial. J. South. Med. Univ. 2017, 37, 398–401. (In Chinese) [Google Scholar] [CrossRef]
  4. Gogate, N.; Satpute, K.; Hall, T. The effectiveness of mobilization with movement on pain, balance and function following acute and sub-acute inversion ankle sprain—A randomized, placebo controlled trial. Phys. Ther. Sport 2021, 48, 91–100. [Google Scholar] [CrossRef] [PubMed]
  5. Gencer, B.; Doğan, Ö.; Çulcu, A.; Ülgen, N.K.; Çamoğlu, C.; Arslan, M.M.; Mert, O.; Yiğit, A.; Yeni, T.B.; Hanege, F.; et al. Internet and social media preferences of orthopaedic patients vary according to factors such as age and education levels. Health Inf. Libr. J. 2024, 41, 84–97. [Google Scholar] [CrossRef]
  6. Duymus, T.M.; Karadeniz, H.; Çaçan, M.A.; Kömür, B.; Demirtaş, A.; Zehir, S.; Azboy, İ. Internet and social media usage of orthopaedic patients: A questionnaire-based survey. World J. Orthop. 2017, 8, 178–186. [Google Scholar] [CrossRef]
  7. Kaplan, B. Revisiting Health Information Technology Ethical, Legal, And Social Issues And Evaluation: Telehealth/Telemedicine And Covid-19. Int. J. Med. Inform. 2020, 143, 104239. [Google Scholar] [CrossRef]
  8. Morya, V.K.; Lee, H.W.; Shahid, H.; Magar, A.G.; Lee, J.H.; Kim, J.H.; Jun, L.; Noh, K.C. Application of ChatGPT for Orthopedic Surgeries and Patient Care. Clin. Orthop. Surg. 2024, 16, 347–356. [Google Scholar] [CrossRef]
  9. Curry, E.; Li, X.; Nguyen, J.; Matzkin, E. Prevalence of internet and social media usage in orthopedic surgery. Orthop. Rev. 2014, 6, 5483. [Google Scholar] [CrossRef]
  10. Yavuz, İ.A.; Kahve, Y.; Aydin, T.; Gencer, B.; Bingöl, O.; Yıldırım, A.Ö. Comparison of the first and second waves of the COVID-19 pandemic with a normal period in terms of orthopaedic trauma: Data from a level 1 trauma centre. Acta Orthop. Traumatol. Turc. 2021, 55, 391–395. [Google Scholar] [CrossRef]
  11. Gencer, B.; Doğan, Ö. Consequences of the COVID-19 pandemic on fracture distribution: Epidemiological data from a tertiary trauma center in Turkey. J. Exp. Clin. Med. 2022, 39, 128–133. [Google Scholar] [CrossRef]
  12. Gencer, B.; Çulcu, A.; Doğan, Ö. COVID-19 exposure and health status of orthopedic residents: A survey study. J. Exp. Clin. Med. 2022, 39, 337–341. [Google Scholar] [CrossRef]
  13. Salvagno, M.; Taccone, F.S.; Gerli, A.G. Can artificial intelligence help for scientific writing? Crit. Care 2023, 27, 75, Erratum in Crit. Care. 2023, 27, 99. [Google Scholar] [CrossRef] [PubMed]
  14. Giorgino, R.; Alessandri-Bonetti, M.; Luca, A.; Migliorini, F.; Rossi, N.; Peretti, G.M.; Mangiavini, L. ChatGPT in orthopedics: A narrative review exploring the potential of artificial intelligence in orthopedic practice. Front. Surg. 2023, 10, 1284015. [Google Scholar] [CrossRef]
  15. Hu, X.; Niemann, M.; Kienzle, A.; Braun, K.; Back, D.A.; Gwinner, C.; Renz, N.; Stoeckle, U.; Trampuz, A.; Meller, S. Evaluating ChatGPT responses to frequently asked patient questions regarding periprosthetic joint infection after total hip and knee arthroplasty. Digit. Health 2024, 10, 20552076241272620. [Google Scholar] [CrossRef]
  16. Schwartzman, J.D.; Shaath, M.K.; Kerr, M.S.; Green, C.C.; Haidukewych, G.J. ChatGPT is an Unreliable Source of Peer-Reviewed Information for Common Total Knee and Hip Arthroplasty Patient Questions. Adv. Orthop. 2025, 2025, 5534704. [Google Scholar] [CrossRef]
  17. Gwak, G.T.; Hwang, U.J.; Jung, S.H.; Kim, J.H. Search for Medical Information and Treatment Options for Musculoskeletal Disorders through an Artificial Intelligence Chatbot: Focusing on Shoulder Impingement Syndrome. J. Musculoskelet. Sci. Technol. 2023, 7, 8–16. [Google Scholar] [CrossRef]
  18. Ah-Yan, C.; Boissonnault, È.; Boudier-Revéret, M.; Mares, C. Impact of artificial intelligence in managing musculoskeletal pathologies in physiatry: A qualitative observational study evaluating the potential use of ChatGPT versus Copilot for patient information and clinical advice on low back pain. J. Yeungnam Med. Sci. 2025, 42, 11. [Google Scholar] [CrossRef]
  19. Tabanli, A.; Demirkiran, N.D. Comparing ChatGPT 3.5 and 4.0 in Low Back Pain Patient Education: Addressing Strengths, Limitations, and Psychosocial Challenges. World Neurosurg. 2025, 196, 123755. [Google Scholar] [CrossRef]
  20. Safran, E.; Yildirim, S. A cross-sectional study on ChatGPT’s alignment with clinical practice guidelines in musculoskeletal rehabilitation. BMC Musculoskelet. Disord. 2025, 26, 411. [Google Scholar] [CrossRef]
  21. Kilkenny, C.J.; Davey, M.S.; O’Sullivan, D.; Medlar, C.; O’ Driscoll, C.; O’Daly, B. Evaluation of the quality of information provided by ChatGPT on pelvic and acetabular surgery. J. Orthop. Rep. 2025, 4, 100561. [Google Scholar] [CrossRef]
  22. Friedman, D.B.; Hoffman-Goetz, L. A systematic review of readability and comprehension instruments used for print and web-based cancer information. Health Educ. Behav. 2006, 33, 352–373. [Google Scholar] [CrossRef]
  23. Hancı, V.; Ergün, B.; Gül, Ş.; Uzun, Ö.; Erdemir, İ.; Hancı, F.B. Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care. Medicine 2024, 103, e39305. [Google Scholar] [CrossRef] [PubMed]
  24. Gül, Ş.; Erdemir, İ.; Hanci, V.; Aydoğmuş, E.; Erkoç, Y.S. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Medicine 2024, 103, e38009. [Google Scholar] [CrossRef] [PubMed]
  25. Magruder, M.L.; Rodriguez, A.N.; Wong, J.C.J.; Erez, O.; Piuzzi, N.S.; Scuderi, G.R.; Slover, J.D.; Oh, J.H.; Schwarzkopf, R.; Chen, A.F.; et al. Assessing Ability for ChatGPT to Answer Total Knee Arthroplasty-Related Questions. J. Arthroplast. 2024, 39, 2022–2027. [Google Scholar] [CrossRef] [PubMed]
  26. Mika, A.P.; Martin, J.R.; Engstrom, S.M.; Polkowski, G.G.; Wilson, J.M. Assessing ChatGPT Responses to Common Patient Questions Regarding Total Hip Arthroplasty. J. Bone Jt. Surg. Am. 2023, 105, 1519–1526. [Google Scholar] [CrossRef]
  27. Giorgino, R.; Alessandri-Bonetti, M.; Del Re, M.; Verdoni, F.; Peretti, G.M.; Mangiavini, L. Google Bard and ChatGPT in Orthopedics: Which Is the Better Doctor in Sports Medicine and Pediatric Orthopedics? The Role of AI in Patient Education. Diagnostics 2024, 14, 1253. [Google Scholar] [CrossRef]
  28. Kunze, K.N.; Varady, N.H.; Mazzucco, M.; Lu, A.Z.; Chahla, J.; Martin, R.K.; Ranawat, A.S.; Pearle, A.D.; Williams, R.J., 3rd. The Large Language Model ChatGPT-4 Exhibits Excellent Triage Capabilities and Diagnostic Performance for Patients Presenting With Various Causes of Knee Pain. Arthroscopy 2025, 41, 1438–1447.e14. [Google Scholar] [CrossRef]
  29. Shrestha, N.; Shen, Z.; Zaidat, B.; Duey, A.H.; Tang, J.E.; Ahmed, W.; Hoang, T.; Restrepo Mejia, M.; Rajjoub, R.; Markowitz, J.S.; et al. Performance of ChatGPT on NASS Clinical Guidelines for the Diagnosis and Treatment of Low Back Pain: A Comparison Study. Spine 2024, 49, 640–651. [Google Scholar] [CrossRef]
  30. Adelstein, J.M.; Sinkler, M.A.; Li, L.T.; Mistovich, R.J. ChatGPT Responses to Common Questions About Slipped Capital Femoral Epiphysis: A Reliable Resource for Parents? J. Pediatr. Orthop. 2024, 44, 353–357. [Google Scholar] [CrossRef]
  31. Wrenn, S.P.; Mika, A.P.; Ponce, R.B.; Mitchell, P.M. Evaluating ChatGPT’s Ability to Answer Common Patient Questions Regarding Hip Fracture. J. Am. Acad. Orthop. Surg. 2024, 32, 656–659. [Google Scholar] [CrossRef]
  32. Wright, B.M.; Bodnar, M.S.; Moore, A.D.; Maseda, M.C.; Kucharik, M.P.; Diaz, C.C.; Schmidt, C.M.; Mir, H.R. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt. Open. 2024, 5, 139–146. [Google Scholar] [CrossRef]
  33. Johns, W.L.; Martinazzi, B.J.; Miltenberg, B.; Nam, H.H.; Hammoud, S. ChatGPT Provides Unsatisfactory Responses to Frequently Asked Questions Regarding Anterior Cruciate Ligament Reconstruction. Arthroscopy 2024, 40, 2067–2079.e1. [Google Scholar] [CrossRef]
  34. Reyhan, A.H.; Mutaf, Ç.; Uzun, İ.; Yüksekyayla, F. A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. J. Clin. Med. 2024, 13, 6512. [Google Scholar] [CrossRef]
Table 1. Questions asked to ChatGPT 4.0 that reflect the most common orthopaedic problems.
Table 1. Questions asked to ChatGPT 4.0 that reflect the most common orthopaedic problems.
AoIAnatQuestion
PedKneeMy 12-year-old son has been getting more and more pain in the front of his knee when he’s been playing sports and in the evenings. He’s also got a bit of a swelling going on. Any ideas what I should do?
PedF&AMy 3-year-old daughter’s feet are pressing inwards. What should I do?
PedHipMy grandson was just born. I’ve heard about something called a hip ultrasound. Do you know what this is? And when should I have it?
PedSpineMy 17-year-old daughter has a curve in her back. What should I do?
PedHipMy 6-year-old child walks with a limp. Do you have any advice?
SportUpMy 48-year-old mum has really bad pain in her right shoulder when she gets plates from the top shelves and washes her hair in the shower. Any ideas what she can do?
SportUpI’m a 45-year-old housewife and the outer side of my elbow really hurts when I squeeze a cloth or open a jar lid at home. What should I do?
SportKneeMy mate at work told me that his knee swelled up after he hurt it playing football a month ago. He said it felt like it was rotating and he felt a gap in it. What should he do?
SportF&ADuring the same match, another friend heard a loud pop behind his ankle, which we all heard. But the man can walk. Any ideas what it is?
SportKneeI’m 35 and had meniscus surgery last week. Do you think I should get physical therapy?
SportHipMy 33-year-old cousin played football when he was younger. He’s been in pain in his left groin for the last three years. Any ideas?
SportSpineI am 25 years old and my back hurts. What should I do?
DegSpineI am 75 years old and my back hurts. What should I do?
DegF&AI’m a 55-year-old woman and I’ve had pain in my heel every morning for the last month. It eases up during the day, but kicks in again in the morning. Any ideas what I should do?
DegKneeMy dad is 65 and only has pain in his knees when he’s going up and down stairs, but he’s fine when he’s walking on flat surfaces. His knee doesn’t lock, but he does have a bit of pain when going up and down stairs. Any ideas what I should do?
DegKneeI’m 75, and my knees hurt a lot when I bend over and stand up. I can’t even walk 100 m Any ideas what I can do?
DegF&AMy 48-year-old wife has a deformity of the big toe. Do you know what we’re supposed to do?
DegSpineI’m a 40-year-old office worker and I’ve had numbness in my left hand and neck for about a month. Any ideas what I should do?
InfKneeI’m 26 and got back from Thailand last week. My right knee’s all swollen, warm and really painful. Any ideas what I should do?
InfKneeMy 60-year-old mum had PRP last week, and now the knee where they did it is swollen and she can’t move it much. Do you know what is it?
InfHipMy 6-year-old daughter had the flu last week, and this week she’s got a bit of pain in her hip and is limping a bit. Wat I should do?
InfF&AMy dad, who’s 75, has had a black, stinky discharge on his foot for about three months. He also can’t feel some parts of his foot. Any ideas what we should do?
TraF&AI broke my tibia bone and had to have a nail put in it, and it’s been six months but I’m still in pain. Do you know what I can do?
TraF&AI broke my tibia bone and had a nail put in it during surgery. It’s been 10 months and I’m still in pain. Any ideas what I can do?
TraUpMy 80-year-old grandmother took a tumble at home a week ago, and her left wrist is still all swollen and sore. What we can do?
TraHipI’m 80 and had surgery after a hip fracture six months ago. I start limping when I get tired while walking. Any idea why?
AoI: Area of interest, Anat: Anatomic Region, Ped: Pediatric, Deg: Degenerative, Inf: Infective, Tra: Traumatic, F&A: Foot and ankle, Up: Upper extremity.
Table 2. Analysis of intra-rater reliability.
Table 2. Analysis of intra-rater reliability.
Intra-Rater Reliability
First EvaluationSecond EvaluationKappaLevel of Agreement
First AuthorDiagnosis4.77 ± 0.5874.81 ± 0.4910.859Strong
Recommendation4.54 ± 0.8594.58 ± 0.8090.912Almost Perfect
Referral4.88 ± 0.4314.88 ± 0.4311.000Perfect
Second AuthorDiagnosis4.88 ± 0.3264.85 ± 0.4640.816Strong
Recommendation4.62 ± 0.8044.65 ± 0.7970.894Strong
Referral4.96 ± 0.1964.96 ± 0.1961.000Perfect
Mean ± standard deviation and minimum–maximum range values were used as descriptive statistics.
Table 3. Comparison of ChatGPT 4.0 response scores by categories.
Table 3. Comparison of ChatGPT 4.0 response scores by categories.
First AuthorSecond Author
DiagRecomReferralDiagRecomReferral
Overall Score4.77 ± 0.59
(3–5)
4.54 ± 0.86
(2–5)
4.88 ± 0.43
(3–5)
4.88 ± 0.33
(4–5)
4.62 ± 0.8
(2–5)
4.96 ± 0.2
(4–5)
AoIPed
(n= 5)
4.8 ± 0.45
(4–5)
4.2 ± 0.84
(3–5)
5 (5)5 (5)4.4 ± 0.89
(3–5)
5 (5)
Sport
(n = 7)
5 (5)4.71 ± 0.77
(3–5)
4.57 ± 0.79
(3–5)
5 (5)4.86 ± 0.38
(4–5)
5 (5)
Deg
(n = 6)
5 (5)4 ± 1.27
(2–5)
5 (5)5 (5)4 ± 1.27
(2–5)
4.83 ± 0.41
(4–5)
Inf
(n = 4)
5 (5)5 (5)5 (5)5 (5)5 (5)5 (5)
Tra
(n = 4)
3.75 ± 0.96
(3–5)
5 (5)5 (5)4.25 ± 0.5
(4–5)
5 (5)5 (5)
p0.0070.1360.2270.0010.1890.504
AnatUp
(n = 3)
5 (5)4.33 ± 1.16 (3–5)4.33 ± 1.16 (3–5)5 (5)4.67 ± 0.58
(4–5)
5 (5)
Spine
(n = 4)
5 (5)4.75 ± 0.5
(4–5)
5 (5)5 (5)5 (5)5 (5)
Hip
(n = 5)
4.6 ± 0.89
(3–5)
4.8 ± 0.45
(4–5)
5 (5)4.8 ± 0.48
(4–5)
4.6 ± 0.89
(3–5)
5 (5)
Knee
(n = 7)
5 (5)4.43 ± 1.13 (2–5)5 (5)5 (5)4.43 ± 1.13
(2–5)
4.86 ± 0.38
(4–5)
F&A
(n = 7)
4.43 ± 0.79
(3–5)
4.43 ± 0.98
(3–5)
5 (5)4.71 ± 0.49
(4–5)
4.57 ± 0.79
(3–5)
5 (5)
p0.1890.9740.2600.4050.8340.607
Diag: Diagnostic, Recom: Recommendation, Deg: Degenerative, AoI: Area of interest, Anat: Anatomic Region, Ped: Pediatric, Deg: Degenerative, Inf: Infective, Tra: Traumatic, F&A: Foot and ankle, Up: Upper extremity, p: Statistical significance value. Mean ± standard deviation and minimum–maximum range values were used as descriptive statistics. Kruskal–Wallis Test was applied for statistical comparison of multiple subgroups. Italicized and underlined values denote statistically significant differences (p < 0.05).
Table 4. Post hoc analysis of the diagnostic evaluation of the answers of ChatGPT 4.0.
Table 4. Post hoc analysis of the diagnostic evaluation of the answers of ChatGPT 4.0.
First Author
PediatricSportDegenerativeInfectiveTrauma
PediatricN/A0.2370.2730.3710.078
Sport0.237N/A1.0001.0000.011
Degenerative0.2731.000N/A1.0000.018
Infective0.3711.0001.000N/A0.046
Trauma0.0780.0110.0180.046N/A
Second Author
PediatricSportDegenerativeInfectiveTrauma
PediatricN/A1.0001.0001.0000.025
Sport1.000N/A1.0001.0000.010
Degenerative1.0001.000N/A1.0000.016
Infective1.0001.0001.000N/A0.040
Trauma0.0250.0100.0160.040N/A
N/A: non-applicable. Mann–Whitney U Test was used for post hoc analysis of the differences found to be significant in the Kruskal–Wallis Test. The values given in the table are p values. Italicized and underlined values denote statistically significant differences (p < 0.05).
Table 5. Readability analysis of the responses of ChatGPT 4.0.
Table 5. Readability analysis of the responses of ChatGPT 4.0.
Flesch–Kincaid Grade LevelFlesch Reading Ease ScoreReading Grade Level
Overall Score7.8 ± 1.267 (5.8–10.9)52.68 ± 8.6 (30.2–65)High School
AoIPed8.26 ± 1.22 (7–10)50.96 ± 5.75 (43.2–56.4)High School
Sport7.93 ± 1.81 (5.8–10.9)51 ± 12.84 (30.2–65)High School
Deg7.35 ± 0.9 (5.9–8.6)56.4 ± 6.11 (46.9–64.6)High School
Inf8.23 ± 0.73 (7.2–8.9)49.13 ± 4.49 (46.1–55.8)College
Tra7.28 ± 1.25 (6.1–8.5)55.73 ± 9.56 (47–64.2)High School
p0.6100.596N/A
AnatUp7.3 ± 1.25 (6.3–8.7)55.73 ± 8.59 (46.7–63.8)High School
Spine6.83 ±1.26 (5.8–8.5)58.2 ± 8.5 (47–65)High School
Hip7.96 ± 1.87 (6.1–10.9)51.76 ± 12.82 (30.2–64.2)High School
Knee8.51 ± 0.95 (7.4–10)48.63 ± 6.27 (39.6–55.3)College
F&A7.76 ± 0.87 (6.3–8.6)52.91 ± 7.32 (46.9–63.8)High School
p0.2020.223N/A
Deg: Degenerative, AoI: Area of interest, Anat: Anatomic Region, Ped: Pediatric, Deg: Degenerative, Inf: Infective, Tra: Traumatic, F&A: Foot and ankle, Up: Upper extremity, p: Statistical significance value. N/A: Non applicable. Mean ± standard deviation and minimum–maximum range values were used as descriptive statistics. Kruskal–Wallis Test was applied for statistical comparison of multiple subgroups.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arzu, U.; Gencer, B. To Self-Treat or Not to Self-Treat: Evaluating the Diagnostic, Advisory and Referral Effectiveness of ChatGPT Responses to the Most Common Musculoskeletal Disorders. Diagnostics 2025, 15, 1834. https://doi.org/10.3390/diagnostics15141834

AMA Style

Arzu U, Gencer B. To Self-Treat or Not to Self-Treat: Evaluating the Diagnostic, Advisory and Referral Effectiveness of ChatGPT Responses to the Most Common Musculoskeletal Disorders. Diagnostics. 2025; 15(14):1834. https://doi.org/10.3390/diagnostics15141834

Chicago/Turabian Style

Arzu, Ufuk, and Batuhan Gencer. 2025. "To Self-Treat or Not to Self-Treat: Evaluating the Diagnostic, Advisory and Referral Effectiveness of ChatGPT Responses to the Most Common Musculoskeletal Disorders" Diagnostics 15, no. 14: 1834. https://doi.org/10.3390/diagnostics15141834

APA Style

Arzu, U., & Gencer, B. (2025). To Self-Treat or Not to Self-Treat: Evaluating the Diagnostic, Advisory and Referral Effectiveness of ChatGPT Responses to the Most Common Musculoskeletal Disorders. Diagnostics, 15(14), 1834. https://doi.org/10.3390/diagnostics15141834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop