Comparison of ChatGPT-5 and DeepSeek V3 for Artificial Intelligence-Assisted Patient Education in Foot and Ankle Disorders

Karagoz, Bekir; Bayrak, Hünkar Cagdas; Keçeci, Tolga

doi:10.3390/japma116030033

Open AccessArticle

Comparison of ChatGPT-5 and DeepSeek V3 for Artificial Intelligence-Assisted Patient Education in Foot and Ankle Disorders

by

Bekir Karagoz

^1,*

,

Hünkar Cagdas Bayrak

²

and

Tolga Keçeci

³

¹

Department of Orthopedics and Traumatology, Eskisehir City Hospital, 26080 Eskisehir, Turkey

²

Department of Orthopedics and Traumatology, Cekirge State Hospital, 16090 Bursa, Turkey

³

Department of Orthopedics and Traumatology, Ordu University Training and Research Hospital, 52200 Ordu, Turkey

^*

Author to whom correspondence should be addressed.

J. Am. Podiatr. Med. Assoc. 2026, 116(3), 33; https://doi.org/10.3390/japma116030033

Submission received: 11 April 2026 / Revised: 9 May 2026 / Accepted: 13 May 2026 / Published: 21 May 2026

Download

Browse Figures

Versions Notes

Abstract

Background: This study aims to compare the quality, reliability, and readability of information provided by artificial intelligence-based language models, ChatGPT-5 and DeepSeek V3, regarding foot and ankle disorders. Methods: The quality, reliability, and readability of the texts generated by both AI models were analyzed using DISCERN, the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), the Global Quality Score (GQS), and the CLEAR scoring system. DISCERN was used to assess information reliability, PEMAT-P to evaluate understandability and actionability, GQS to assess overall quality, and CLEAR to evaluate content quality and accuracy. Standardized questions were asked to both models for 35 different foot and ankle disorders, and the generated texts were evaluated by two independent orthopedic specialists using a blinded method. Readability analysis was performed using word count, the Flesch–Kincaid Grade Level (FKGL; required reading level), and the Flesch Reading Ease (FRE; ease of readability) scoring systems. Results: ChatGPT-5 scored significantly higher than DeepSeek V3 in DISCERN, PEMAT-P, GQS, and CLEAR evaluations (p < 0.05), indicating that ChatGPT-5 provides more reliable, comprehensive, and higher-quality information. DeepSeek V3 demonstrated better readability, producing simpler and more understandable content, as reflected in its lower FKGL score and higher FRE score. Conclusions: While ChatGPT-5 delivers more detailed and reliable health information, DeepSeek V3 offers simpler and more readable texts. Both models have distinct advantages for patient education. Future research should assess the impact of AI-generated health information on patient decision-making and its clinical application potential.

Keywords:

artificial intelligence; ChatGPT-5; DeepSeek V3; patient education; information reliability; readability

1. Introduction

Over the past decade, the increase in internet access has led to a rise in patients seeking health information online [1,2]. Studies in the field of orthopedics and traumatology have shown that 60% of patients turn to the internet for information [3]. While the growing volume of online information has contributed to improved health literacy, it also brings certain potential drawbacks. The most significant of these drawbacks are the spread of misinformation and variability in the reliability of sources [4,5,6].

Recent advancements in artificial intelligence (AI) have led to innovations in various fields [7]. AI chatbots are designed to facilitate conversational communication and potentially replace human interaction [8]. The rapid and widespread implementation of AI presents a more appealing alternative to traditional web search engines [9]. The latest developments in AI models have demonstrated their potential for providing patients with the information they need about disorders and treatments [8,10]. Additionally, these AI models hold promise as tools for patient education and engagement [11]. In this context, multiple AI models have been developed and continue to be improved. One of the most notable is the Chat Generative Pre-trained Transformer (ChatGPT), developed and launched by OpenAI in November 2022. Particularly, the ChatGPT-5 version, released in April 2025, is a large-scale, multimodal model capable of understanding visual inputs. Besides ChatGPT, several other AI models, including Gemini, Llama, Mistral, and Gemma, have been introduced. Apart from these AI platforms, DeepSeek V3 has gained significant popularity, particularly in early 2025. DeepSeek V3 was launched in China in December 2024. Being open-source, free of charge, and without query limits, it distinguishes itself from other AI platforms and has had a considerable global impact. These characteristics may facilitate scalable implementation and improve access to patient education resources, particularly in low-resource and underserved settings. However, despite its rapid rise in user numbers and worldwide influence, its effectiveness in patient education remains largely unknown.

This study aims to evaluate the quality and readability of the information provided by AI-powered knowledge sources, ChatGPT-5 and DeepSeek V3, regarding foot and ankle disorders. Although various studies in the literature have explored the capacity of AI-based models to generate health information, the performance of DeepSeek V3 in this field has not yet been sufficiently examined. Foot and ankle disorders were specifically included in this study because injuries in this region have the highest incidence among sports-related traumas and hold significant importance in both clinical management and patient education [12,13]. Therefore, assessing the information provided by AI models in this field is considered a critical necessity for health communication and patient guidance. In this context, given its broad accessibility and rapidly increasing popularity, this study hypothesizes that DeepSeek V3 provides high-quality information at a level suitable for patient education on foot and ankle disorders.

2. Material and Method

2.1. Study Design

Since this study does not involve human subjects or any in vivo procedures, it was exempt from ethical committee approval. No patient data or personal information was used during the study. A total of 35 foot and ankle disorders were selected from www.OrthoInfo.org, the patient education website of the American Academy of Orthopaedic Surgeons (AAOS) and a reputable online orthopedic resource [14]. These conditions were selected based on their clinical relevance, frequency in orthopedic practice, and availability as standardized patient education topics on the AAOS OrthoInfo platform. The selected disorders are shown in Figure 1. Between 1 May and 31 May 2025, ChatGPT-5 (OpenAI, San Francisco, CA, USA, April 2025 update) and DeepSeek V3 (DeepSeek AI, Hangzhou, China December 2024 update) were accessed using Google Chrome (Google LLC, Mountain View, CA, USA version 92.0.4515.159-64 bit). To prevent the models from being influenced by previous browsing history, all cookies and history were cleared before each session. Additionally, new user accounts were created for both platforms to ensure that responses were independent of algorithmic biases and remained consistent. To minimize potential biases arising from memory retention, the chat history was cleared before each query. These procedures were implemented to minimize personalization bias, session-dependent variability, and potential memory effects of the models, thereby ensuring that each response was generated under standardized and comparable conditions. Default system settings were used for both models, as user-level control over parameters such as temperature or response variability was not available or not standardized across platforms. A conversation was initiated by asking both ChatGPT-5 and DeepSeek V3 the same standardized question “Can you provide a high-quality informative text about XXX and its surgery?’’. In this query, XXX was replaced with each of the 35 selected foot and ankle disorders. (For example: “Can you provide a high-quality informative text about hallux rigidus and its surgery?”). A total of 35 texts were generated and saved separately from both ChatGPT-5 and DeepSeek V3.

The generated texts were independently evaluated by two orthopedic specialists with 7 and 10 years of experience in orthopedic surgery, respectively. A blinded evaluation method was applied to ensure the assessors remained unbiased while reviewing the AI-generated texts. The evaluators were not informed about which text belonged to which AI model, and each text was randomly numbered. During the evaluation, the assessors focused solely on the quality of the content, without any reference to the source of the information. The average scores given by the evaluators for each parameter were calculated and used in the final analysis. To assess the consistency between the evaluators, Intraclass Correlation Coefficient (ICC) analysis was performed [15].

2.2. Reliability and Quality Assessment

The DISCERN scoring system was used to evaluate the reliability of the responses provided by the AI models [16]. Developed by experts at Oxford University, the DISCERN scoring system consists of 16 items, each rated on a scale from 1 to 5. The total DISCERN score ranges from 16 to 80. According to the evaluation system, scores are classified as follows: 16–26: “Very poor”, 27–38: “Poor”, 39–50: “Average”, 51–62: “Good”, 63–80: “Excellent”. As indicated by this scoring system, a higher DISCERN score reflects a more balanced presentation of content and greater information reliability [17].

To assess information quality, the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) and the Global Quality Score (GQS) were used [18,19]. PEMAT-P, developed by the Agency for Healthcare Research and Quality, consists of 26 items. Each item is scored as 1 point (“agree”) if the feature is consistently present throughout the material or 0 points (“disagree”) if it could be improved. Items marked as “not applicable” were excluded from the calculation. Higher scores indicate superior information quality. GQS is a commonly used tool for evaluating overall information quality. This scoring system uses a scale from 1 to 5. One point represents the lowest score, indicating poor quality and incomplete information. Five points represent the highest score, indicating excellent quality and highly comprehensive information.

2.3. Readability Assessment

The readability of the texts was evaluated considering the average patient or caregiver’s ability to comprehend medical information. Therefore, the texts were analyzed based on word count, Flesch Reading Ease (FRE), and Flesch–Kincaid Grade Level (FKGL) scoring systems [20,21,22]. FRE scores indicate how easily a text can be understood, ranging from 0 to 100, with higher scores reflecting easier readability. FKGL scores estimate the educational level required to understand a text, where higher values indicate more complex content.

2.4. Content Quality Assessment

To assess the quality of health-related content, the CLEAR tool was used [23]. This scoring system consists of five criteria: Completeness of content, Lack of false information in the content, Evidence supporting the content, Appropriateness of the content, and Relevance. Each criterion is rated on a 5-point scale, ranging from poor to excellent. The total CLEAR score ranges from 5 to 25 and is classified into three categories: 5–11: “Poor”, 12–18: “Average”, 19–25: “Very good” content.

2.5. Statistical Analysis

All statistical analyses were performed using IBM SPSS 25.0 software (IBM Corp., Armonk, NY, USA). After assessing the normality of the data using the Kolmogorov–Smirnov test, the independent samples t-test was used for comparisons of groups with a parametric distribution, while the Mann–Whitney U test was applied for groups with a non-parametric distribution. Statistical significance was set at p < 0.05. Additionally, the reliability between the two observers was evaluated using the interrater correlation coefficient (ICC) test. ICC values were interpreted as follows: values below 0.5 indicated low reliability, values between 0.5 and 0.75 represented moderate reliability, values between 0.75 and 0.9 indicated good reliability, and values above 0.9 demonstrated excellent reliability [15].

3. Results

3.1. Interobserver Agreement Assessment

The interobserver reliability of the scoring results obtained in the study was analyzed, and the following reliability levels were determined: DISCERN (ICC 0.747, 95% CI 0.545–0.858), PEMAT-P (ICC 0.939, 95% CI 0.884–0.967), CLEAR (ICC 0.819, 95% CI 0.680–0.897), FKGL (ICC 0.866, 95% CI 0.765–0.924), FRE (ICC 0.77, 95% CI 0.473–0.887), and GQS (ICC 0.729, 95% CI 0.189–0.882). The measurements demonstrated moderate, high, and excellent levels of reliability (Figure 2).

3.2. Quality and Reliability Assessment

ChatGPT-5 scored statistically significantly higher than DeepSeek V3 in terms of reliability and information quality across the DISCERN 1–8, DISCERN 9–15, and DISCERN 16 evaluations (p = 0.04, p = 0.032, p = 0.013, respectively) (Table 1). The average DISCERN total score for texts provided by DeepSeek V3 was 62 ± 6.7, indicating that these texts were generally of good quality. In contrast, the average DISCERN total score for texts generated by ChatGPT-5 was 66.7 ± 2.3, categorizing them as excellent quality. Statistical analysis confirmed that the difference between the two AI models in terms of DISCERN scores was significant (p = 0.044). Despite these differences, both models demonstrated performance within “good” to “excellent” interpretative categories according to DISCERN classification, indicating overall high-quality outputs.

Regarding PEMAT-P scores, ChatGPT-5 achieved statistically significantly higher scores than DeepSeek V3 in understandability, actionability, and overall score averages (p = 0.045, p = 0.043, p = 0.015, respectively) (Table 1).

For GQS assessment, the average score for ChatGPT-5 was 4.46 ± 0.36, while for DeepSeek V3, it was 4.13 ± 0.21. Both AI models received high scores, indicating that they provided useful and high-quality information for patients. However, statistical analysis showed that ChatGPT-5 had a significantly higher score than DeepSeek V3 (p = 0.01).

3.3. Readability Assessment

Analyses conducted to assess readability revealed that the texts provided by DeepSeek V3 were easier to read compared to those generated by ChatGPT-5 (Table 2). The texts produced by ChatGPT-5 had an average length of 623 ± 73 words, whereas those generated by DeepSeek V3 had an average length of 565 ± 84 words. This difference between the two groups was found to be statistically significant (p = 0.021). In the FKGL scoring, DeepSeek V3 obtained a significantly lower score than ChatGPT-5, indicating that its texts were simpler and easier to understand (p = 0.022). Additionally, there was a notable difference between the two groups in FRE scoring (Table 2). The average score of DeepSeek V3 was significantly higher compared to that of ChatGPT-5, confirming its greater readability (p = 0.038).

3.4. Content Quality Assessment

The content evaluation conducted using the CLEAR scoring system indicated that the total scores of both AI models fell within the “very good” category (Table 3). However, despite both models being classified in this category, ChatGPT-5 demonstrated a statistically significant superior performance in terms of total CLEAR score compared to DeepSeek V3 (p = 0.001) (Table 3). For the “lack of false information” domain, ChatGPT-5 consistently achieved the maximum possible score (5) across all evaluations, indicating no variability in this parameter. Furthermore, ChatGPT-5 achieved statistically significantly higher scores than DeepSeek V3 across all CLEAR subcategories, including completeness, Lack of false information, evidence, appropriateness, and relevance (p = 0.022, p = 0.01, p = 0.043, p = 0.001, p = 0.001, respectively) (Table 3).

4. Discussion

This study yielded significant findings in evaluating the quality and readability of information provided by ChatGPT-5 and DeepSeek V3 on foot and ankle disorders. The analysis conducted to assess the quality of the generated texts revealed that both AI models produced information classified as “very good” and “excellent” in quality. However, ChatGPT-5 scored significantly higher in quality assessments. In the content quality analysis, ChatGPT-5-generated texts achieved higher scores than those provided by DeepSeek V3. Regarding readability, DeepSeek V3 demonstrated higher readability scores compared to ChatGPT-5, indicating that its content was easier to understand.

As a result of the widespread use of the internet and the acceleration of access through computers and smartphones, obtaining information on any subject has become significantly easier [24]. In recent years, the emergence of AI models and increased accessibility to these platforms have revolutionized information retrieval. Among these AI models, ChatGPT, launched by OpenAI in 2022, is undoubtedly the most prominent. Over time, this model has learned and understood human language, allowing it to engage in conversations with users. The latest version, ChatGPT-5, was released in April 2025, with its most significant feature being its human-level performance in academic and professional evaluations [17]. The introduction of ChatGPT-5 represents a major advancement in AI models and a significant milestone in deep learning research [17]. In addition to ChatGPT, various AI models have emerged in the market. One of the most impactful is DeepSeek V3, an AI model released in China in December 2024, which has gained global recognition since its launch. DeepSeek V3 stands out from other AI models due to its open-source nature, free accessibility, and high performance on low-cost hardware [25]. Thanks to these features, it has rapidly gained a large user base, making its capabilities increasingly comparable to those of ChatGPT-5.

One of the most common uses of AI platforms is assisting individuals in gathering information about disorders and treatments [5,26,27,28]. Many studies agree that this trend, which has increased with internet usage, will continue to grow with advancements in AI platforms [29,30]. However, numerous previous studies have reported questionable reliability regarding online health information [5,10,11], raising concerns that similar issues may also apply to AI-generated content. Therefore, assessing the quality and reliability of information produced by AI models is of critical importance. Among AI models, ChatGPT has the most extensive literature data on this subject [26,27,31,32]. Several studies suggest that ChatGPT has the potential to serve as a reliable source due to the quality and trustworthiness of the information it provides [7,9,13,31,32,33]. However, some studies argue that the possibility of missing recent research and key references may pose a potential risk to the reliability of its information [11]. In contrast, DeepSeek V3, being a relatively new AI platform, has very limited literature data available on this topic [25]. In our study, we aimed to evaluate the quality and reliability of the information provided by DeepSeek V3 by comparing it with ChatGPT-5, the latest version of ChatGPT, which has the most extensive literature on this topic. The DISCERN, PEMAT-P, and GQS scoring assessments, conducted to determine the quality and reliability of the generated texts, revealed that ChatGPT-5 outperformed DeepSeek V3 in both aspects. The presence of a statistically significant score difference in all three evaluations suggests that ChatGPT-5 provides more informative and comprehensible content, making it more effective in guiding patients regarding their health concerns.

Providing the most accurate and comprehensive information while avoiding deficiencies or excesses is crucial. In the healthcare field, high-quality and comprehensive information helps individuals without medical expertise make more informed decisions and enhance their communication with healthcare professionals. On the other hand, incorrect or incomplete information may lead individuals to misdiagnose themselves, delay seeking medical consultation, and ultimately cause irreversible harm [17,23]. One of the tools used to evaluate the content of AI platforms is the CLEAR scoring system [23]. Several studies in the literature have employed this scoring method to compare AI platforms [34,35,36]. One such study, conducted by Incerti Parenti et al., compared ChatGPT-3.5 and the Google search engine [17]. The findings showed that ChatGPT-3.5 scored significantly higher than Google search. The study also revealed that Google search results contained substantial gaps in areas such as definitions, disease significance, and etiology. In our study, we also used CLEAR scoring for content evaluation, where ChatGPT-5 achieved statistically significantly higher scores. The scores for completeness, accuracy (absence of misinformation), evidence-based content, appropriateness, and relevance were all classified within the “very good” range. While ChatGPT-5 scored statistically higher, DeepSeek V3 also achieved notably high scores, placing it within the “very good” category. These results indicate that, although ChatGPT-5 provided statistically superior content compared to DeepSeek V3, the information provided by DeepSeek V3 was also of very high quality, and this should not be overlooked.

Readability is just as important as the quality and reliability of information provided by AI models. While many websites present inadequate and biased content, increasing the potential for misinformation, recent studies investigating ChatGPT have found that it generally does not contain incorrect or harmful information and provides appropriate responses to most questions [26,27,31,32,33,34,35,36,37]. The most commonly used parameters for assessing readability are word count, FRE, and FKGL scores [21,22]. Numerous studies have applied these scoring methods [10,17,31,32]. In a study by Parenti et al., Google Search provided responses with higher readability scores compared to ChatGPT-3.5 [17]. Similarly, in our study, DeepSeek V3 demonstrated statistically significantly higher readability scores than ChatGPT-5. This finding suggests that DeepSeek V3 generates content using simpler language compared to ChatGPT-5. Patients often seek quick and concise information. Considering this, our results indicate that ChatGPT-5’s texts, with their higher word count and more detailed content, may have the potential to overwhelm users. However, despite its lower readability scores, patients may still perceive ChatGPT-5’s content as higher in quality and reliability. While those looking for in-depth and comprehensive information may benefit from ChatGPT-5, individuals seeking quick and digestible information might find DeepSeek V3 more useful. Despite these differences, DeepSeek V3 offers practical advantages, including open-source availability, cost-free access, and ease of deployment, which may facilitate broader use in clinical and educational settings.

This study has several limitations. First, the inclusion of a limited number of disorders may restrict the generalizability of the findings, and the variability in how AI platforms present information across a broader range of conditions was not assessed. Second, the study evaluated only initial responses and did not consider follow-up interactions, which may not fully reflect real-world patient–AI engagement. Third, the analysis was conducted exclusively in English, and performance in other languages was not evaluated. Additionally, the use of only a single standardized prompt structure represents an important limitation of this study, as alternative prompt formulations may generate substantially different responses and therefore influence the observed performance outcomes and the generalizability of the findings. Furthermore, AI model performance may vary across languages, particularly those with lower representation in training datasets, such as Turkish, potentially limiting the generalizability of these findings across different linguistic contexts. Future research should address these limitations by incorporating more diverse conditions, evaluating iterative interactions, and exploring multilingual and prompt-variation effects on AI-generated health information.

From a clinical perspective, these findings suggest that AI-based platforms may serve as complementary tools in patient education. Clinicians may utilize models with higher information quality, such as ChatGPT-5, to provide detailed and comprehensive explanations of diseases and treatment options. In contrast, models with higher readability, such as DeepSeek V3, may be more suitable for delivering concise and easily understandable information to patients. Integrating these AI tools into routine clinical workflows may enhance patient understanding, improve physician–patient communication, and support shared decision-making.

5. Conclusions

This study compared the quality, reliability, and readability of information provided by ChatGPT-5 and DeepSeek V3 regarding foot and ankle disorders. Both AI models generated high-quality patient education materials; however, ChatGPT-5 demonstrated superior performance in terms of content quality, reliability, and comprehensiveness, while DeepSeek V3 achieved higher readability scores, indicating simpler and more accessible content. These findings highlight the importance of critically evaluating AI-based health information sources. While ChatGPT-5 may be more suitable for delivering detailed and reliable information, DeepSeek V3 may better serve users seeking concise and easily understandable content. Future research should explore multimodal model capabilities, prompt-engineering strategies, and the impact of AI-generated information on patient decision-making. These findings may also guide future prospective and multilingual research aimed at optimizing the use of AI-based platforms in patient education across diverse clinical and linguistic settings.

Author Contributions

Conceptualization, B.K. and H.C.B.; methodology, B.K.; software, B.K.; validation, B.K., H.C.B. and T.K.; formal analysis, B.K.; investigation, B.K. and H.C.B.; resources, B.K.; data curation, B.K.; writing—original draft preparation, B.K.; writing—review and editing, H.C.B. and T.K.; visualization, B.K.; supervision, T.K.; project administration, B.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical approval was not required for this study, as it did not involve human participants or personal data.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data generated and analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Greenberg-Worisek, A.J.; Kurani, S.; Rutten, L.J.F.; Blake, K.D.; Moser, R.P.; Hesse, B.W. Correction: Tracking Healthy People 2020 Internet, Broadband, and Mobile Device Access Goals: An Update Using Data From the Health Information National Trends Survey. J. Med. Internet Res. 2022, 24, e39712. [Google Scholar] [CrossRef] [PubMed]
Bartolucci, M.L.; Parenti, S.I.; Bortolotti, F.; Gorini, T.; Alessandri-Bonetti, G. Awareness and sources of knowledge about obstructive sleep apnea: A cross-sectional survey study. Healthcare 2023, 11, 3052. [Google Scholar] [CrossRef] [PubMed]
Fraval, A.; Chong, Y.M.; Holcdorf, D.; Plunkett, V.; Tran, P. Internet use by orthopaedic outpatients—Current trends and practices. Australas. Med. J. 2012, 5, 633–638. [Google Scholar] [CrossRef]
Daraz, L.; Morrow, A.S.; Ponce, O.J.; Farah, W.; Katabi, A.; Majzoub, A.; Seisa, M.O.; Benkhadra, R.; Alsawas, M.; Larry, P.; et al. Readability of online health information: A meta-narrative systematic review. Am. J. Med. Qual. 2018, 33, 487–492. [Google Scholar] [CrossRef]
Bayrak, H.C.; Karagöz, B.; Bayrak, Ö. Comparative evaluation of large language model-based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity. Acta Orthop. Traumatol. Turc. 2025, 59, 415–420. [Google Scholar] [CrossRef] [PubMed]
Schwarz, I.; Houck, D.A.; Belk, J.W.; Hop, J.; Bravman, J.T.; McCarty, E.C. The quality and content of internet-based information on orthopaedic sports medicine requires improvement: A systematic review. Arthrosc. Sports Med. Rehabil. 2021, 3, e1547–e1555. [Google Scholar] [CrossRef]
Lower, K.; Seth, I.; Lim, B.; Seth, N. ChatGPT-4: Transforming medical education and addressing clinical exposure challenges in the post-pandemic era. Indian J. Orthop. 2023, 57, 1527–1534. [Google Scholar] [CrossRef]
Chow, J.C.; Wong, V.; Sanders, L.; Li, K. Developing an AI-assisted educational chatbot for radiotherapy using the IBM Watson Assistant platform. Healthcare 2023, 11, 2417. [Google Scholar] [CrossRef]
Stokel-Walker, C.; Van Noorden, R. What ChatGPT and generative AI mean for science. Nature 2023, 614, 214–216. [Google Scholar] [CrossRef]
Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef]
Seth, I.; Rodwell, A.; Tso, R.; Valles, J.; Bulloch, G.; Seth, S. A conversation with an open artificial intelligence platform on osteoarthritis of the hip and treatment. J. Orthop. Sports Med. 2023, 5, 112–120. [Google Scholar] [CrossRef]
Rydberg, E.M.; Wennergren, D.; Stigevall, C.; Ekelund, J.; Möller, M. Epidemiology of more than 50,000 ankle fractures in the Swedish Fracture Register during a period of 10 years. J. Orthop. Surg. Res. 2023, 18, 79. [Google Scholar] [CrossRef]
Wang, D.; He, Y.; Ma, Y.; Wu, H.; Ni, G. The era of artificial intelligence: Talking about the potential application value of ChatGPT/GPT-4 in foot and ankle surgery. J. Foot Ankle Surg. 2024, 63, 1–3. [Google Scholar] [CrossRef]
Eltorai, A.E.; Sharma, P.; Wang, J.; Daniels, A.H. Most American Academy of Orthopaedic Surgeons’ online patient education material exceeds average patient reading level. Clin. Orthop. Relat. Res. 2015, 473, 1181–1186. [Google Scholar] [CrossRef]
Koo, T.K.; Li, M.Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 2016, 15, 155–163. [Google Scholar] [CrossRef] [PubMed]
Charnock, D.; Shepperd, S.; Needham, G.; Gann, R. DISCERN: An instrument for judging the quality of written consumer health information on treatment choices. J. Epidemiol. Community Health 1999, 53, 105–111. [Google Scholar] [CrossRef]
Incerti Parenti, S.; Gamberini, S.; Fiordelli, A.; Bortolotti, F.; Laffranchi, L.; Alessandri-Bonetti, G. Online information on mandibular advancement device for the treatment of obstructive sleep apnea: A content, quality and readability analysis. J. Oral Rehabil. 2023, 50, 210–216. [Google Scholar] [CrossRef] [PubMed]
Shoemaker, S.J.; Wolf, M.S.; Brach, C. Development of the Patient Education Materials Assessment Tool (PEMAT): A new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns. 2014, 96, 395–403. [Google Scholar] [CrossRef]
Mangan, M.S.; Cakir, A.; Yurttaser Ocak, S.; Tekcan, H.; Balci, S.; Ozcelik Kose, A. Analysis of the quality, reliability, and popularity of information on strabismus on YouTube. Strabismus 2020, 28, 175–180. [Google Scholar] [CrossRef] [PubMed]
Kher, A.; Johnson, S.; Griffith, R. Readability assessment of online patient education material on congestive heart failure. Adv. Prev. Med. 2017, 2017, 9780317. [Google Scholar] [CrossRef]
Kincaid, J.P.; Fishburne, R.P.; Rogers, R.L.; Chissom, B.S. Derivation of New Readability Formulas for Navy Enlisted Personnel; Technical Report Research Branch Report; US Naval Air Station: Memphis, TN, USA, 1975; pp. 8–75.
Flesch, R. A new readability yardstick. J. Appl. Psychol. 1948, 32, 221–233. [Google Scholar] [CrossRef] [PubMed]
Sallam, M.; Barakat, M.; Sallam, M. Pilot testing of a tool to standardize the assessment of the quality of health information generated by artificial intelligence-based models. Cureus 2023, 15, e49373. [Google Scholar] [CrossRef] [PubMed]
Gutmann, J.; Kühbeck, F.; Berberat, P.O.; Fischer, M.R.; Engelhardt, S.; Sarikas, A. Use of learning media by undergraduate medical students in pharmacology: A prospective cohort study. PLoS ONE 2015, 10, e0122624. [Google Scholar] [CrossRef]
Peng, Y.; Malin, B.A.; Rousseau, J.F.; Wang, Y.; Xu, Z.; Xu, X.; Weng, C.; Bian, J. From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare. J. Biomed. Inform. 2025, 163, 104791. [Google Scholar] [CrossRef]
He, Y.; Tang, H.; Wang, W.; Gu, S.; Ni, G.; Wu, H. Will ChatGPT/GPT-4 be a lighthouse to guide spinal surgeons? Ann. Biomed. Eng. 2023, 51, 1362–1365. [Google Scholar] [CrossRef]
Cheng, K.; Li, Z.; Guo, Q.; Sun, Z.; Wu, H.; Li, C. Emergency surgery in the era of artificial intelligence: ChatGPT could be the doctor’s right-hand man. Int. J. Surg. 2023, 109, 1816–1818. [Google Scholar] [CrossRef]
Kirchner, G.J.; Kim, R.Y.; Weddle, J.B.; Bible, J.E. Can artificial intelligence improve the readability of patient education materials? Clin. Orthop. Relat. Res. 2023, 481, 2260. [Google Scholar] [CrossRef]
Kıvrak, A.; Ulusoy, İ. How high is the quality of the videos about children’s elbow fractures on YouTube? J. Orthop. Surg. Res. 2023, 18, 166. [Google Scholar] [CrossRef]
Yuce, A.; Oto, O.; Vural, A.; Misir, A. YouTube provides low-quality videos about talus osteochondral lesions and their arthroscopic treatment. Foot Ankle Surg. 2023, 29, 441–445. [Google Scholar] [CrossRef]
Khan, R.A.; Jawaid, M.; Khan, A.R.; Sajjad, M. ChatGPT—Reshaping medical education and clinical management. Pak. J. Med. Sci. 2023, 39, 605–607. [Google Scholar] [CrossRef] [PubMed]
van de Ridder, J.M.; Shoja, M.M.; Rajput, V. Finding the place of ChatGPT in medical education. Acad. Med. 2023, 98, 867. [Google Scholar] [CrossRef] [PubMed]
Campbell, D.J.; Estephan, L.E.; Mastrolonardo, E.V.; Amin, D.R.; Huntley, C.T.; Boon, M.S. Evaluating ChatGPT responses on obstructive sleep apnea for patient education. J. Clin. Sleep Med. 2023, 19, 1989–1995. [Google Scholar] [CrossRef] [PubMed]
Yilmaz Muluk, S.; Olcucu, N. The role of artificial intelligence in the primary prevention of common musculoskeletal diseases. Cureus 2024, 16, e65372. [Google Scholar] [CrossRef] [PubMed]
Sallam, M.; Al-Mahzoum, K.; Almutawaa, R.A.; Alhashash, J.A.; Dashti, R.A.; AlSafy, D.R.; Almutairi, R.A.; Barakat, M. The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: A comparative analysis of English and Arabic responses. BMC Res. Notes 2024, 17, 247. [Google Scholar] [CrossRef]
Keçeci, T.; Karagöz, B. Can large language models follow guidelines? A comparative study of ChatGPT-4o and DeepSeek AI in clavicle fracture management based on AAOS recommendations. BMC Med. Inform. Decis. Mak. 2025, 25, 350. [Google Scholar] [CrossRef]
Tao, B.K.; Handzic, A.; Hua, N.J.; Vosoughi, A.R.; Margolin, E.A.; Micieli, J.A. Utility of ChatGPT for automated creation of patient education handouts: An application in neuro-ophthalmology. J. Neuroophthalmol. 2024, 44, 119–124. [Google Scholar] [CrossRef]

Figure 1. List of foot and ankle disorders included in the study, selected from the American Academy of Orthopaedic Surgeons (AAOS) OrthoInfo platform.

Figure 2. Interobserver reliability of the scoring results. ICC: Interrater Correlation Coefficient; PEMAT-P: Patient Education Materials Assessment Tool for Printable Materials; GQS: Global Quality Score; FKGL: Flesch–Kincaid Grade Level; FRE: Flesch Reading Ease. ICC values were interpreted as follows: <0.5 low, 0.5–0.75 moderate, 0.75–0.9 good, and >0.9 excellent reliability.

Table 1. DISCERN, PEMAT-P and GQSs.

	ChatGPT-5	DeepSeek V3	p Value
DISCERN 1–8	31.3 ± 2.3	28.8 ± 4.3	0.04
DISCERN 9–15	30.9 ± 2.3	29.2 ± 3.1	0.032
DISCERN 16	4.5± 0.5	3.9 ± 0.9	0.013
DISCERN Total	66.7 ± 2.3	62 ± 6.7	0.044
PEMAT-P understandability	82.7 ± 6.8	78.9 ± 6.2	0.045
PEMAT-P actionability	81.6 ± 6.2	78.5 ± 5.5	0.043
PEMAT-P Total	82.1 ± 5.3	78.7 ± 3.2	0.015
GQS	4.46 ± 0.36	4.13 ± 0.21	0.01

PEMAT-P: Patient Education Materials Assessment Tool for Printable Materials, GQS: Global Quality Score.

Table 2. Readability scores.

	ChatGPT-5	DeepSeek V3	p Value
Word count	623 ± 73	565 ± 84	0.021
FKGL	10.1 ± 1.5	8.5 ± 1.1	0.022
FRE	42.9 ± 9.6	54.5 ± 9.2	0.038

FKGL: Flesch–Kincaid Grade Level, FRE: Flesch Reading Ease.

Table 3. CLEAR scores.

	ChatGPT-5	DeepSeek V3	p Value
Completeness	4.56 ± 0.5	4.24 ± 0.43	0.022
Lack of false information	5	4.76 ± 0.5	0.01
Evidence	3.76 ± 0.43	3.48 ± 0.5	0.043
Appropriateness	4.52 ± 0.5	4	0.001
Relevance	5	4.52 ± 0.5	0.001
CLEAR Total	22.8 ± 1.3	21 ± 0.9	0.001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the American Podiatric Medical Association (APMA). Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Karagoz, B.; Bayrak, H.C.; Keçeci, T. Comparison of ChatGPT-5 and DeepSeek V3 for Artificial Intelligence-Assisted Patient Education in Foot and Ankle Disorders. J. Am. Podiatr. Med. Assoc. 2026, 116, 33. https://doi.org/10.3390/japma116030033

AMA Style

Karagoz B, Bayrak HC, Keçeci T. Comparison of ChatGPT-5 and DeepSeek V3 for Artificial Intelligence-Assisted Patient Education in Foot and Ankle Disorders. Journal of the American Podiatric Medical Association. 2026; 116(3):33. https://doi.org/10.3390/japma116030033

Chicago/Turabian Style

Karagoz, Bekir, Hünkar Cagdas Bayrak, and Tolga Keçeci. 2026. "Comparison of ChatGPT-5 and DeepSeek V3 for Artificial Intelligence-Assisted Patient Education in Foot and Ankle Disorders" Journal of the American Podiatric Medical Association 116, no. 3: 33. https://doi.org/10.3390/japma116030033

APA Style

Karagoz, B., Bayrak, H. C., & Keçeci, T. (2026). Comparison of ChatGPT-5 and DeepSeek V3 for Artificial Intelligence-Assisted Patient Education in Foot and Ankle Disorders. Journal of the American Podiatric Medical Association, 116(3), 33. https://doi.org/10.3390/japma116030033

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Comparison of ChatGPT-5 and DeepSeek V3 for Artificial Intelligence-Assisted Patient Education in Foot and Ankle Disorders

Abstract

1. Introduction

2. Material and Method

2.1. Study Design

2.2. Reliability and Quality Assessment

2.3. Readability Assessment

2.4. Content Quality Assessment

2.5. Statistical Analysis

3. Results

3.1. Interobserver Agreement Assessment

3.2. Quality and Reliability Assessment

3.3. Readability Assessment

3.4. Content Quality Assessment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI