Assessing the Efficacy of Ortho GPT: A Comparative Study with Medical Students and General LLMs on Orthopedic Examination Questions
Abstract
1. Introduction
- Does a domain-specific LLM (Ortho GPT) outperform generic and broadly biomedical LLMs in accuracy and contextual reasoning in orthopedics?
- How does its performance correlate with human difficulty ratings and established exam benchmarks?
- What evidence supports the use of such a model as an educational aid for orthopedic training?
2. Materials and Methods
2.1. Study Design
2.2. Question Dataset
2.3. Models Evaluated
- ChatGPT 3.5 (OpenAI, 2023)
- ChatGPT 4o (OpenAI, 2024)
- Perplexity AI (Perplexity Inc., 2024)
- DeepSeek-R1 (DeepSeek AI, 2024)
- Llama 3.3-70B (Meta AI, 2024)
- Ortho GPT, a domain-specific model fine-tuned on orthopedic literature, clinical guidelines, and question banks.
2.4. Procedure
Evaluation Metrics
- The McNemar test was used for pairwise comparisons between models on a question-by-question basis. As a non-parametric test for paired nominal data, it is ideal for determining whether one model significantly outperforms another on the same set of binary outcomes—in this case, correct versus incorrect responses. This approach accounts for the dependent structure of matched items and avoids the limitations of unpaired statistical comparisons.
- To assess the similarity of response patterns, we calculated the Jaccard similarity index, which measures the degree of overlap in correctly answered items between two models. This index complements accuracy scores by capturing whether models are solving the same questions correctly, thereby providing insight into shared or divergent strengths. A high Jaccard value implies similar item-level performance, while a lower value suggests unique or complementary capabilities.
- Finally, the point-biserial correlation coefficient was computed to assess the relationship between model correctness (binary: correct vs. incorrect) and item difficulty, defined as the proportion of students who answered each question correctly. This analysis was applied to determine whether model performance was systematically associated with human-perceived item difficulty.
2.5. Statistical Analysis
2.6. Ethical Considerations
3. Results
3.1. Overall Accuracy
3.2. Item-Level Analyses
3.3. Comparative Accuracy Between LLMs: McNemar Test
3.4. Effect Sizes for McNemar Comparisons: Cohen’s g
3.5. Similarity of Response Patterns
3.6. Interpretation
4. Discussion
4.1. Error Analysis
4.2. Limitations
4.3. Ethical Considerations, Curricular Integration and Safety Considerations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| LLM | Large Language Models |
| MCQ | Multiple-Choice Question |
| USMLE | United States Medical Licensing Examination |
References
- Lucas, H.C.; Upperman, J.S.; Robinson, J.R. A systematic review of large language models and their implications in medical education. Med. Educ. 2024, 58, 1276–1285. [Google Scholar] [CrossRef]
- Lucas, J. Large Language Models in Worldwide Medical Exams: Comparative Analysis of Global Trends. J. Med. Internet Res. 2024, 26, e66114. [Google Scholar] [CrossRef]
- Guillen-Grima, F.; Guillen-Aguinaga, S.; Guillen-Aguinaga, L.; Alas-Brun, R.; Onambele, L.; Ortega, W.; Montejo, R.; Aguinaga-Ontoso, E.; Barach, P.; Aguinaga-Ontoso, I. Evaluating the efficacy of ChatGPT in navigating the Spanish Medical Residency Entrance Examination (MIR): Promising horizons for AI in clinical medicine. Clin. Pract. 2023, 13, 1460–1487. [Google Scholar] [CrossRef]
- Qiu, P.; Gemma, T. Towards Building Multilingual Language Models for Medicine: Biomistral Initiative. Nat. Commun. 2024, 15, 8384. [Google Scholar] [CrossRef]
- Li, R.; Wu, T. Evolution of Artificial Intelligence in Medical Education From 2000 to 2024: Bibliometric Analysis. Interact. J. Med. Res. 2025, 14, e63775. [Google Scholar] [CrossRef]
- Borna, S.; Gomez-Cabello, C.A.; Pressman, S.M.; Haider, S.A.; Sehgal, A.; Leibovich, B.C.; Cole, D.; Forte, A.J. Comparative analysis of artificial intelligence virtual assistant and large language models in post-operative care. Eur. J. Investig. Health Psychol. Educ. 2024, 14, 1413–1424. [Google Scholar] [CrossRef]
- Chen, X.; Wang, L.; You, M.; Liu, W.; Fu, Y.; Xu, J.; Zhang, S.; Chen, G.; Li, K.; Li, J. Evaluating and enhancing large language models’ performance in domain-specific medicine: Development and usability study with DocOA. J. Med. Internet Res. 2024, 26, e58158. [Google Scholar] [CrossRef]
- Chen, X.; Li, J. Evaluating and Enhancing Large Language Models Performance in Domain-specific Medicine: Osteoarthritis Management with DocOA. arXiv 2024, arXiv:2401.12998. [Google Scholar] [CrossRef]
- Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.-A.; Rouvier, M.; Dufour, R. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv 2024, arXiv:2402.10373. [Google Scholar]
- Yan, L.K.; Niu, Q.; Li, M.; Zhang, Y.; Yin, C.H.; Fei, C.; Peng, B.; Bi, Z.; Feng, P.; Chen, K. Large language model benchmarks in medical tasks. arXiv 2024, arXiv:2410.21348. [Google Scholar] [CrossRef]
- Khan, R.A.; Jawaid, M.; Khan, A.R.; Sajjad, M. ChatGPT-Reshaping medical education and clinical management. Pak. J. Med. Sci. 2023, 39, 605. [Google Scholar] [CrossRef]
- Tolsgaard, M.G.; Pusic, M.V.; Sebok-Syer, S.S.; Gin, B.; Svendsen, M.B.; Syer, M.D.; Brydges, R.; Cuddy, M.M.; Boscardin, C.K. The fundamentals of Artificial Intelligence in medical education research: AMEE Guide No. 156. Med. Teach. 2023, 45, 565–573. [Google Scholar] [CrossRef]
- Indran, I.R.; Paranthaman, P.; Gupta, N.; Mustafa, N. Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT. Med. Teach. 2024, 46, 1021–1026. [Google Scholar] [CrossRef]
- Samplay, K. Artificial Intelligence in Orthopedic Surgery Education: Current Concepts. Surg. Collat. Rev. 2025, 3. [Google Scholar] [CrossRef]
- Song, J. Artificial Intelligence in Orthopedics: Fundamentals, Current Applications, and Future Perspectives. Front. Med. Res. 2025, 12, 42. [Google Scholar] [CrossRef]
- Pei, Y. It is Time to Build Domain-specific Large Language Models. BMJ 2024. Available online: https://www.bmj.com/content/387/bmj-2024-081948 (accessed on 3 November 2025).
- Available online: https://www.yeschat.ai/gpts-9t557I9itD6-Ortho-GPT (accessed on 3 November 2025).
- Stadler, M.; Horrer, A.; Fischer, M.R. Crafting medical MCQs with generative AI: A how-to guide on leveraging ChatGPT. GMS J. Med. Educ. 2024, 41, Doc20. [Google Scholar]
- Sturmberg, J.P.; Martin, J.H.; Tramonti, F.; Kühlein, T. The need for a change in medical research thinking. Eco-systemic research frames are better suited to explore patterned disease behaviors. Front. Med. 2024, 11, 1377356. [Google Scholar] [CrossRef]
- Davis, J.; Van Bulck, L.; Durieux, B.N.; Lindvall, C. The temperature feature of ChatGPT: Modifying creativity for clinical research. JMIR Hum. Factors 2024, 11, e53559. [Google Scholar] [CrossRef]
- Liu, Z.; Zhong, A.; Li, Y.; Yang, L.; Ju, C.; Wu, Z.; Ma, C.; Shu, P.; Chen, C.; Kim, S. Tailoring large language models to radiology: A preliminary approach to llm adaptation for a highly specialized domain. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Vancouver, BC, Canada, 8 October 2023; pp. 464–473. [Google Scholar]
- Abd-Alrazaq, A.; AlSaad, R.; Alhuwail, D.; Ahmed, A.; Healy, P.M.; Latifi, S.; Aziz, S.; Damseh, R.; Alrazak, S.A.; Sheikh, J. Large language models in medical education: Opportunities, challenges, and future directions. JMIR Med. Educ. 2023, 9, e48291. [Google Scholar] [CrossRef]
- Zheng, Q.; Liu, W. Domain adaptation analysis of large language models in academic literature abstract generation: A cross-disciplinary evaluation study. J. Adv. Comput. Syst. 2024, 4, 57–71. [Google Scholar]
- Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
- Gordon, M.; Daniel, M.; Ajiboye, A.; Uraiby, H.; Xu, N.Y.; Bartlett, R.; Hanson, J.; Haas, M.; Spadafore, M.; Grafton-Clarke, C. A scoping review of artificial intelligence in medical education: BEME Guide No. 84. Med. Teach. 2024, 46, 446–470. [Google Scholar] [CrossRef]
- Masters, K. Artificial intelligence in medical education. Med. Teach. 2019, 41, 976–980. [Google Scholar] [CrossRef]
- Peacock, J.; Austin, A.; Shapiro, M.; Battista, A.; Samuel, A. Accelerating medical education with ChatGPT: An implementation guide. MedEdPublish 2023, 13, 64. [Google Scholar] [CrossRef]
- Zhang, C.; Liu, S.; Zhou, X.; Zhou, S.; Tian, Y.; Wang, S.; Xu, N.; Li, W. Examining the role of large language models in orthopedics: Systematic review. J. Med. Internet Res. 2024, 26, e59607. [Google Scholar] [CrossRef]
- Wilhelm, T.I.; Roos, J.; Kaczmarczyk, R. Large language models for therapy recommendations across 3 clinical specialties: Comparative study. J. Med. Internet Res. 2023, 25, e49324. [Google Scholar] [CrossRef]
- Khatun, A.; Brown, D.G. A study on large language models’ limitations in multiple-choice question answering. arXiv 2024, arXiv:2401.07955. [Google Scholar] [CrossRef]
- Fleisig, E.; Smith, G.; Bossi, M.; Rustagi, I.; Yin, X.; Klein, D. Linguistic bias in chatgpt: Language models reinforce dialect discrimination. arXiv 2024, arXiv:2406.08818. [Google Scholar] [CrossRef]
- Banerjee, S.; Agarwal, A.; Singla, S. Llms will always hallucinate, and we need to live with this. In Intelligent Systems and Applications; Springer: Cham, Switzerland, 2025; pp. 624–648. [Google Scholar]
- Qi, S.; Cao, Z.; Rao, J.; Wang, L.; Xiao, J.; Wang, X. What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing. Inf. Process. Manag. 2023, 60, 103510. [Google Scholar] [CrossRef]



| Model | r | p-Value |
|---|---|---|
| ChatGPT | −0.064 | 0.365 |
| Perplexity | 0.027 | 0.705 |
| Ortho GPT | −0.004 | 0.959 |
| ChatGPT 4o | −0.064 | 0.365 |
| Deepseek-R1 | 0.032 | 0.653 |
| Llama 3.3-70B | −0.053 | 0.452 |
| Model | p-Value | Sig |
|---|---|---|
| Ortho GPT vs. Deepseek | 2.33 × 10−35 | *** |
| ChatGPT vs. Deepseek | 9.63 × 10−35 | *** |
| ChatGPT 4o vs. Deepseek | 9.63 × 10−35 | *** |
| Ortho GPT vs. Llama3 | 1.11 × 10−32 | *** |
| ChatGPT vs. Llama3 | 4.93 × 10−32 | *** |
| ChatGPT 4o vs. Llama3 | 4.93 × 10−32 | *** |
| Perplexity vs. Deepseek | 1.97 × 10−31 | *** |
| Perplexity vs. Llama3 | 1.01 × 10−28 | *** |
| Perplexity vs. Ortho GPT | 4.01 × 10−5 | *** |
| ChatGPT vs. Perplexity | 6.14 × 10−2 | |
| Perplexity vs. ChatGPT 4o | 6.14 × 10−2 | |
| ChatGPT vs. Ortho GPT | 6.54 × 10−2 | |
| Deepseek vs. Llama3 | 3.21 × 10−1 | |
| ChatGPT vs. ChatGPT 4o | 1.00 × 100 |
| Comparison | p-Value | Sig | Cohen’s g | Interpretation |
|---|---|---|---|---|
| Ortho GPT vs. Deepseek | 2.33 × 10−35 | *** | 10.910 | Large effect |
| ChatGPT vs. Deepseek | 9.63 × 10−35 | *** | 10.677 | Large effect |
| Perplexity vs. Ortho GPT | 4.01 × 10−5 | *** | 4.025 | Large effect |
| ChatGPT vs. Perplexity | 6.14 × 10−2 | 2.043 | Large effect | |
| ChatGPT vs. Ortho GPT | 6.54 × 10−2 | 2.111 | Large effect | |
| ChatGPT vs. ChatGPT 4o | 1.00 × 100 | NaN | No effect |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pohlmann, P.F.; Glienke, M.; Sandkamp, R.; Gratzke, C.; Schmal, H.; Schoeb, D.S.; Fuchs, A. Assessing the Efficacy of Ortho GPT: A Comparative Study with Medical Students and General LLMs on Orthopedic Examination Questions. Bioengineering 2025, 12, 1290. https://doi.org/10.3390/bioengineering12121290
Pohlmann PF, Glienke M, Sandkamp R, Gratzke C, Schmal H, Schoeb DS, Fuchs A. Assessing the Efficacy of Ortho GPT: A Comparative Study with Medical Students and General LLMs on Orthopedic Examination Questions. Bioengineering. 2025; 12(12):1290. https://doi.org/10.3390/bioengineering12121290
Chicago/Turabian StylePohlmann, Philippe Fabian, Maximilian Glienke, Richard Sandkamp, Christian Gratzke, Hagen Schmal, Dominik Stephan Schoeb, and Andreas Fuchs. 2025. "Assessing the Efficacy of Ortho GPT: A Comparative Study with Medical Students and General LLMs on Orthopedic Examination Questions" Bioengineering 12, no. 12: 1290. https://doi.org/10.3390/bioengineering12121290
APA StylePohlmann, P. F., Glienke, M., Sandkamp, R., Gratzke, C., Schmal, H., Schoeb, D. S., & Fuchs, A. (2025). Assessing the Efficacy of Ortho GPT: A Comparative Study with Medical Students and General LLMs on Orthopedic Examination Questions. Bioengineering, 12(12), 1290. https://doi.org/10.3390/bioengineering12121290

