Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations
Abstract
:1. Introduction
2. Materials and Methods
2.1. Open-Ended Question Development Process
2.2. Large Language Models Used and Response Collection Process
2.3. Evaluation Process of Responses
2.4. Statistical Analysis
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
axSpa | Axial spondyloarthropathy |
EULAR | European League Against Rheumatism |
ASAS | Assessment of SpondyloArthritis international Society |
ChatGPT | Chat Generative Pre-trained Transformer |
LLM | Large language models |
LCS | Longest common subsequence |
USE | Universal Sentence Encoder |
LCS | Longest common subsequence |
FKRE | Flesch–Kincaid Reading Ease |
FKGL | Flesch–Kincaid Grade Level |
csDMARDs | Conventional disease-modifying antirheumatic drugs |
bDMARDs | Biologic disease-modifying antirheumatic drugs |
References
- Smith, J.A. Update on ankylosing spondylitis: Current concepts in pathogenesis. Curr. Allergy Asthma Rep. 2015, 15, 489. [Google Scholar] [CrossRef] [PubMed]
- Walsh, J.A.; Magrey, M. Clinical Manifestations and Diagnosis of Axial Spondyloarthritis. J. Clin. Rheumatol. 2021, 27, e547–e560. [Google Scholar] [CrossRef]
- Fattorini, F.; Gentileschi, S.; Cigolini, C.; Terenzi, R.; Pata, A.P.; Esti, L.; Carli, L. Axial spondyloarthritis: One year in review 2023. Clin. Exp. Rheumatol. 2023, 41, 2142–2150. [Google Scholar] [CrossRef] [PubMed]
- Ramiro, S.; Nikiphorou, E.; Sepriano, A.; Ortolan, A.; Webers, C.; Baraliakos, X.; Landewé, R.B.M.; Van den Bosch, F.E.; Boteva, B.; Bremander, A.; et al. ASAS-EULAR recommendations for the management of axial spondyloarthritis: 2022 update. Ann. Rheum. Dis. 2023, 82, 19–34. [Google Scholar] [CrossRef]
- Ortolan, A.; Webers, C.; Sepriano, A.; Falzon, L.; Baraliakos, X.; Landewé, R.B.; Ramiro, S.; van der Heijde, D.; Nikiphorou, E. Efficacy and safety of non-pharmacological and non-biological interventions: A systematic literature review informing the 2022 update of the ASAS/EULAR recommendations for the management of axial spondyloarthritis. Ann. Rheum. Dis. 2023, 82, 142–152. [Google Scholar] [CrossRef]
- Webers, C.; Ortolan, A.; Sepriano, A.; Falzon, L.; Baraliakos, X.; Landewé, R.B.M.; Ramiro, S.; van der Heijde, D.; Nikiphorou, E. Efficacy and safety of biological DMARDs: A systematic literature review informing the 2022 update of the ASAS-EULAR recommendations for the management of axial spondyloarthritis. Ann. Rheum. Dis. 2023, 82, 130–141. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Park, Y.J.; Pillai, A.; Deng, J.; Guo, E.; Gupta, M.; Paget, M.; Naugler, C. Assessing the research landscape and clinical utility of large language models: A scoping review. BMC Med. Inform. Decis. Mak. 2024, 24, 72. [Google Scholar] [CrossRef]
- Hügle, T. The wide range of opportunities for large language models such as ChatGPT in rheumatology. RMD Open 2023, 9, e003105. [Google Scholar] [CrossRef]
- Kelly, B.S.; Judge, C.; Bollard, S.M.; Clifford, S.M.; Healy, G.M.; Aziz, A.; Mathur, P.; Islam, S.; Yeom, K.W.; Lawlor, A.; et al. Radiology artificial intelligence: A systematic review and evaluation of methods (RAISE). Eur. Radiol. 2022, 32, 7998–8007, Erratum in Eur. Radiol. 2022, 32, 8054. https://doi.org/10.1007/s00330-022-08832-1. [Google Scholar] [CrossRef]
- Lee, S.; Jeon, U.; Lee, J.H.; Kang, S.; Kim, H.; Lee, J.; Chung, M.J.; Cha, H.S. Artificial intelligence for the detection of sacroiliitis on magnetic resonance imaging in patients with axial spondyloarthritis. Front. Immunol. 2023, 14, 1278247. [Google Scholar] [CrossRef] [PubMed]
- Chandwar, K.; Prasanna Misra, D. What does artificial intelligence mean in rheumatology? Arch. Rheumatol. 2024, 39, 1–9. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Wright, A.P.; Patterson, B.L.; Wanderer, J.P.; Turer, R.W.; Nelson, S.D.; McCoy, A.B.; Sittig, D.F.; Wright, A. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J. Am. Med. Inform. Assoc. 2023, 30, 1237–1245. [Google Scholar] [CrossRef] [PubMed]
- Sciberras, M.; Farrugia, Y.; Gordon, H.; Furfaro, F.; Allocca, M.; Torres, J.; Arebi, N.; Fiorino, G.; Iacucci, M.; Verstockt, B.; et al. Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines. J. Crohns. Colitis. 2024, 18, 1215–1221. [Google Scholar] [CrossRef] [PubMed]
- Alhur, A. Redefining Healthcare With Artificial Intelligence (AI): The Contributions of ChatGPT, Gemini, and Co-pilot. Cureus 2024, 16, e57795. [Google Scholar] [CrossRef]
- Tong, L.; Zhang, C.; Liu, R.; Yang, J.; Sun, Z. Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis. J. Orthop. Surg. Res. 2024, 19, 574. [Google Scholar] [CrossRef]
- Uz, C.; Umay, E. “Dr ChatGPT”: Is it a reliable and useful source for common rheumatic diseases? Int. J. Rheum. Dis. 2023, 26, 1343–1349. [Google Scholar] [CrossRef]
- Fattoh, I.E.; Kamal Alsheref, F.; Ead, W.M.; Youssef, A.M. Semantic Sentiment Classification for COVID-19 Tweets Using Universal Sentence Encoder. Comput. Intell. Neurosci. 2022, 2022, 6354543. [Google Scholar] [CrossRef]
- Izzidien, A.; Fitz, S.; Romero, P.; Loe, B.S.; Stillwell, D. Developing a sentence level fairness metric using word embeddings. Int J. Digit Humanit. 2022, 1–36. [Google Scholar] [CrossRef]
- Tang, L.; Sun, Z.; Idnay, B.; Nestor, J.G.; Soroush, A.; Elias, P.A.; Xu, Z.; Ding, Y.; Durrett, G.; Rousseau, J.F.; et al. Evaluating large language models on medical evidence summarization. NPJ Digit. Med. 2023, 6, 158. [Google Scholar] [CrossRef]
- Gulden, C.; Kirchner, M.; Schüttler, C.; Hinderer, M.; Kampf, M.; Prokosch, H.U.; Toddenroth, D. Extractive summarization of clinical trial descriptions. Int. J. Med. Inform. 2019, 129, 114–121. [Google Scholar] [CrossRef] [PubMed]
- Kılıcoglu, V.S.; Yurdakul, O.V.; Aydın, T. Evaluation of Chat Generative Pre-trained Transformer’s responses to frequently asked questions about psoriatic arthritis: A study on quality and readability. Ann. Med. Res. 2025, 32, 79–84. [Google Scholar]
- Solnyshkina, M.; Zamaletdinov, R.; Gorodetskaya, L.A.; Azad, I. Evaluating text complexity and Flesch-Kincaid grade level. J. Soc. Stud. Educ. Res. 2017, 8, 238–248. [Google Scholar]
- Mauro, D.; Thomas, R.; Guggino, G.; Lories, R.; Brown, M.A.; Ciccia, F. Ankylosing spondylitis: An autoimmune or autoinflammatory disease? Nat. Rev. Rheumatol. 2021, 387–404. [Google Scholar] [CrossRef] [PubMed]
- Păsăran, E.D.; Diaconu, A.E.; Oancea, C.; Bălănescu, A.R.; Aurelian, S.M.; Homentcovschi, C. An Actual Insight into the Pathogenic Pathways of Ankylosing Spondylitis. Curr. Issues Mol. Biol. 2024, 46, 12800–12812. [Google Scholar] [CrossRef]
- Hwang, M.C.; Ridley, L.; Reveille, J.D. Ankylosing spondylitis risk factors: A systematic literature review. Clin. Rheumatol. 2021, 40, 3079–3093. [Google Scholar] [CrossRef]
- Bilski, R.; Kamiński, P.; Kupczyk, D.; Jeka, S.; Baszyński, J.; Tkaczenko, H.; Kurhaluk, N. Environmental and Genetic Determinants of Ankylosing Spondylitis. Int. J. Mol. Sci. 2024, 25, 7814. [Google Scholar] [CrossRef] [PubMed]
- Yang, H.; Chen, Y.; Xu, W.; Shao, M.; Deng, J.; Xu, S.; Gao, X.; Guan, S.; Wang, J.; Xu, S.; et al. Epigenetics of ankylosing spondylitis: Recent developments. Int. J. Rheum. Dis. 2021, 24, 487–493. [Google Scholar] [CrossRef]
- Wei, Y.; Zhang, S.; Shao, F.; Sun, Y. Ankylosing spondylitis: From pathogenesis to therapy. Int. Immunopharmacol. 2025, 145, 113709. [Google Scholar] [CrossRef]
- Navarro-Compán, V.; Benavent, D.; Capelusnik, D.; van der Heijde, D.; Landewé, R.B.; Poddubnyy, D.; van Tubergen, A.; Baraliakos, X.; Van den Bosch, F.E.; van Gaalen, F.A.; et al. ASAS consensus definition of early axial spondyloarthritis. Ann. Rheum. Dis. 2024, 83, 1093–1099. [Google Scholar] [CrossRef]
- van der Heijde, D.; Molto, A.; Ramiro, S.; Braun, J.; Dougados, M.; van Gaalen, F.A.; Gensler, L.S.; Inman, R.D.; Landewé, R.B.M.; Marzo-Ortega, H.; et al. Goodbye to the term ‘ankylosing spondylitis’, hello ‘axial spondyloarthritis’: Time to embrace the ASAS-defined nomenclature. Ann. Rheum. Dis. 2024, 83, 547–549. [Google Scholar] [CrossRef] [PubMed]
- Zimba, O.; Kocyigit, B.F.; Korkosz, M. Diagnosis, monitoring, and management of axial spondyloarthritis. Rheumatol. Int. 2024, 44, 1395–1407. [Google Scholar] [CrossRef] [PubMed]
- Zouris, G.; Evangelopoulos, D.S.; Benetos, I.S.; Vlamis, J. The Use of TNF-α Inhibitors in Active Ankylosing Spondylitis Treatment. Cureus 2024, 16, e61500. [Google Scholar] [CrossRef]
- Fatani, B. ChatGPT for Future Medical and Dental Research. Cureus 2023, 15, e37285. [Google Scholar] [CrossRef]
- Mihalache, A.; Huang, R.S.; Popovic, M.M.; Muni, R.H. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med. Teach. 2024, 46, 366–372. [Google Scholar] [CrossRef]
- Waisberg, E.; Ong, J.; Masalkhi, M.; Zaman, N.; Sarker, P.; Lee, A.G.; Tavakkoli, A. Google’s AI chatbot “Bard”: A side-by-side comparison with ChatGPT and its utilization in ophthalmology. Eye 2024, 38, 642–645. [Google Scholar] [CrossRef]
- Chen, S.Y.; Kuo, H.Y.; Chang, S.H. Perceptions of ChatGPT in healthcare: Usefulness, trust, and risk. Front. Public Health. 2024, 12, 1457131. [Google Scholar] [CrossRef]
- Ren, Y.; Kang, Y.N.; Cao, S.Y.; Meng, F.; Zhang, J.; Liao, R.; Li, X.; Chen, Y.; Wen, Y.; Wu, J.; et al. Evaluating the performance of large language models in health education for patients with ankylosing spondylitis/spondyloarthritis: A cross-sectional, single-blind study in China. BMJ Open 2025, 15, e097528. [Google Scholar] [CrossRef] [PubMed]
- Daungsupawong, H.; Wiwanitkit, V. Comparative performance of artificial intelligence models in rheumatology board-level questions: Evaluating Google Gemini and ChatGPT-4o: Correspondence. Clin. Rheumatol. 2024, 43, 4015–4016. [Google Scholar] [CrossRef]
- Currie, G.M. Academic integrity and artificial intelligence: Is ChatGPT hype, hero or heresy? Semin. Nucl. Med. 2023, 53, 719–730. [Google Scholar] [CrossRef]
- Krusche, M.; Callhoff, J.; Knitza, J.; Ruffer, N. Diagnostic accuracy of a large language model in rheumatology: Comparison of physician and ChatGPT-4. Rheumatol. Int. 2024, 44, 303–306. [Google Scholar] [CrossRef] [PubMed]
- Coskun, B.N.; Yagiz, B.; Ocakoglu, G.; Dalkilic, E.; Pehlivan, Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol. Int. 2024, 44, 509–515. [Google Scholar] [CrossRef] [PubMed]
- Is, E.E.; Menekseoglu, A.K. Comparative performance of artificial intelligence models in rheumatology board-level questions: Evaluating Google Gemini and ChatGPT-4o. Clin. Rheumatol. 2024, 43, 3507–3513. [Google Scholar] [CrossRef] [PubMed]
- Azamfirei, R.; Kudchadkar, S.R.; Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 2023, 27, 120. [Google Scholar] [CrossRef]
1. General Management and Monitoring Strategies (goal setting, individualization, disease monitoring, education SAIDs, etc.) |
Q1 Individualization |
Q2 Disease follow-up and monitoring frequency |
Q3 Targeted therapy |
Q4 Patient education and lifestyle recommendations |
2. Pharmacological Treatment (First Step and Advanced Step) (NSAIDs, analgesics, glucocorticoids, conventional, and biological DMARDs) |
Q5 NSAID use and continuous use |
Q6 Paracetamol and opioid-like analgesics |
Q7 Glucocorticoid injections and systemic use |
Q8 csDMARDs (including sulfasalazine) |
Q9 High disease activity: TNFi, IL-17i, JAKi |
Q10 Selection in case of concomitant uveitis, IBD or psoriasis |
3. Treatment Response and Adaptation Strategies (no response, change in treatment, remission, secondary failure) |
Q11 What should be done if there is no response to treatment? |
Q12 First biologic/tsDMARD failure |
Q13 Approach in case of permanent remission |
4. Surgery and Accompanying Conditions (hip prosthesis, spinal fracture, deformity, osteotomy, etc.) |
Q14 Hip and spine surgery |
Q15 Non-inflammatory conditions such as spinal fracture |
Q1 | What factors should be considered when individualizing the treatment of patients with axial spondyloarthritis? |
Q2 | How often should disease monitoring of patients with axial spondyloarthritis be, and what should it include? |
Q3 | How should treatment be guided in patients with axSpA, and which key factors should be considered when setting and acting on a treatment target? |
Q4 | What lifestyle-related non-pharmacological strategies should be emphasized in the management of axSpA, and what should patient education include to support these? |
Q5 | Which treatment should be preferred first for managing pain and stiffness in axSpA, and how should the choice between continuous and on-demand use be made? |
Q6 | Which treatment options can be considered for residual pain in axSpA when first-line therapies have failed, are contraindicated, or not tolerated? |
Q7 | How should glucocorticoids be used in the management of axSpA, considering both local inflammation and purely axial disease? |
Q8 | What is the role of csDMARDs in the treatment of axSpA, and how should their use be guided by clinical presentation and current evidence? |
Q9 | What treatment options should be considered in axSpA patients with persistently high disease activity despite conventional therapy, and how should eligibility and treatment choice be guided? |
Q10 | How should the presence of extramusculoskeletal manifestations guide biologic treatment choice in patients with axSpA? |
Q11 | What should be assessed in axSpA patients who show no response to treatment, and why is re-evaluation important before changing therapy? |
Q12 | What are the recommended treatment options after failure of the first b/tsDMARD in patients with axSpA, and what factors should be considered when switching therapies? |
Q13 | What treatment approach should be considered for axSpA patients receiving bDMARDs who achieve sustained remission, and what are the key principles of tapering? |
Q14 | What surgical interventions can be considered in patients with axSpA involving the hip or spine, and what clinical findings and healthcare settings should guide these decisions? |
Q15 | Which pathologies should be considered and which tests should be performed when a sudden-onset, non-inflammatory spinal pain occurs in a patient with axSpA? |
Reliability Score | Usefulness Score |
---|---|
1 Completely unsafe: none of the information provided can be verified from medical sources or contains inaccurate and incomplete information. | 1 Not useful at all: Unintelligible language, contradictory information and missing important information. Not useful for patients. |
2 Very unsafe: most of the information cannot be verified from medical sources or is partially correct but contains important incorrect or incomplete information. | 2 Very little useful: Partly clear language is used. Some important information is missing or incorrect. Limited possible use for patients. |
3 Relatively reliable: the majority of the information provided is verified from medical scientific sources but there is some important incorrect or incomplete information. | 3 Relatively useful: Clear language is used. Most important information is mentioned, but some important information incomplete or incorrect. Useful for patients. |
4 Reliable: most of the information provided is verified from medical scientific sources but there is some minor inaccurate or incomplete information. | 4 Partly useful: Clear language is used. Some important information is missing or incorrect, but most important information is addressed. Somewhat useful for patients. |
5 Relatively very reliable: most of the information provided is verified from medical scientific sources and there is very little incorrect or incomplete information. | 5 Moderately useful: Clear language is used and most important information is covered, but some important information is still incomplete or incorrect. Useful for patients. |
6 Very reliable: most of the information provided is verified from medical scientific sources and there is almost no inaccurate or incomplete information. | 6 Very useful: Clear language is used. All important information is mentioned, but some unimportant information or details are also mentioned. Very useful for patients. |
7 Absolutely reliable: All of the information provided is verified from medical scientific sources and there is no inaccurate or incomplete information, or missing information. | 7 Extremely useful: Clear language is used and all important information is mentioned. Extremely useful to patients, additional information and resources are also provided. |
ChatGPT-3.5 | ChatGPT-4o | Gemini 2.0 | |
---|---|---|---|
ROUGE-L F1 | 13.9 (10.7–18.6) | 14.4 (10.8–21.2) | 13.3 (8.3–17.1) |
FKGL | 17.8 (14.6–19) | 17.1 (15–20.3) | 15.7 (13.6–17.9) |
FKRE | 6.74 (−0.1–20.48) | 4.17 (−2.94–16.73) | 15.31 (1.13–27.01) |
Reliability | 4 (3–6) | 6 (5–7) | 6 (4–7) |
Usefulness | 5 (4–7) | 7 (5–7) | 7 (4–7) |
Semantic Similarity | 66.19 (50.83–76.61) | 68.89 (53.64–79.57) | 68.85 (57.38–80.7) |
1 vs. 2 | 1 vs. 3 | 2 vs. 3 | |
---|---|---|---|
ROUGE-L F1 | 0.989 * | 0.025 * | 0.074 * |
FKGL | 0.637 * | <0.001 * | 0.003 * |
FKRE | 0.375 * | <0.001 * | <0.001 * |
Reliability | <0.001 * | 0.001 µ | 0.486 * |
Usefulness | 0.002 µ | <0.001 * | 0.386 µ |
Semantic Similarity | 0.038 * | 0.064 µ | 0.593 * |
Semantic Similarity | FKGL | FKRE | Reliability | Usefulness | |||
---|---|---|---|---|---|---|---|
ROUGE-L F1 µ | ChatGPT-3.5 | r | 0.089 | −0.210 | 0.198 | −0.089 | −0.131 |
p | 0.751 | 0.452 | 0.478 | 0.753 | 0.641 | ||
ChatGPT-4o | r | 0.211 | −0.004 | 0.055 | 0.46 | 0.078 | |
p | 0.450 | 0.987 | 0.847 | 0.085 | 0.781 | ||
Gemini 2.0 | r | 0.071 | −0.047 | 0.221 | −0.261 | −0.051 | |
p | 0.800 | 0.869 | 0.428 | 0.348 | 0.858 | ||
Semantic Similarity µ | ChatGPT-3.5 | r | — | 0.068 | 0.029 | 0.277 | 0.406 |
p | 0.810 | 0.919 | 0.318 | 0.133 | |||
ChatGPT-4o | r | −0.272 | 0.164 | −0.162 | 0.095 | ||
p | 0.327 | 0.558 | 0.564 | 0.737 | |||
Gemini 2.0 | r | 0.03 | 0.02 | 0.395 | 0.328 | ||
p | 0.914 | 0.940 | 0.146 | 0.232 | |||
FKGL µ | ChatGPT-3.5 | r | — | — | −0.730 | −0.387 | −0.395 |
p | 0.002 * | 0.154 | 0.145 | ||||
ChatGPT-4o | r | −0.764 | 0.311 | −0.12 | |||
p | 0.001 | 0.259 | 0.671 | ||||
Gemini 2.0 | r | −0.838 | 0.459 | −0.111 | |||
p | <0.001 * | 0.085 | 0.694 | ||||
FKRE µ | ChatGPT-3.5 | r | — | — | — | 0.505 | 0.504 |
p | 0.055 | 0.055 | |||||
ChatGPT-4o | r | −0.514 | −0.256 | ||||
p | 0.05 | 0.357 | |||||
Gemini 2.0 | r | −0.372 | 0.15 | ||||
p | 0.172 | 0.594 | |||||
Reliability µ | ChatGPT-3.5 | r | — | — | — | — | 0.870 |
p | 0.001 * | ||||||
ChatGPT-4o | r | 0.546 | |||||
p | 0.035 * | ||||||
Gemini 2.0 | r | 0.714 | |||||
p | 0.003 * |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Usen, A.; Kuculmez, O. Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations. Diagnostics 2025, 15, 1455. https://doi.org/10.3390/diagnostics15121455
Usen A, Kuculmez O. Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations. Diagnostics. 2025; 15(12):1455. https://doi.org/10.3390/diagnostics15121455
Chicago/Turabian StyleUsen, Ahmet, and Ozlem Kuculmez. 2025. "Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations" Diagnostics 15, no. 12: 1455. https://doi.org/10.3390/diagnostics15121455
APA StyleUsen, A., & Kuculmez, O. (2025). Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations. Diagnostics, 15(12), 1455. https://doi.org/10.3390/diagnostics15121455