ChatGPT Performance Deteriorated in Patients with Comorbidities When Providing Cardiological Therapeutic Consultations
Abstract
1. Introduction
- Assess the consistency of AI-generated recommendations, both within a single model version and between different versions.
- Determine the clinical validity and safety of these recommendations by measuring physician approval rates and systematically identifying any potentially contraindicated or inappropriate suggestions.
2. Method
2.1. Study Design
2.2. Data Analysis
2.3. Performance Metrics
2.4. Statistical Analysis
3. Results
3.1. Baseline Model Characteristics and Rater Reliability
3.2. Model Performance and Scenario-Specific Analysis
3.3. Model Consistency Analysis
4. Discussion
4.1. Principal Findings
4.2. Comparison with the Prior Literature
4.3. Interpretation of Findings
4.4. Strengths and Limitations
- Snapshot in Time: Our evaluation is a snapshot of models from early 2024. Given the rapid evolution of LLMs, the specific performance metrics reported may not be generalizable to newer versions [33]. However, we believe our findings on the fundamental challenges, such as response inconsistency and the risk of contraindicated advice, remain highly relevant as benchmarks against which future models can be measured.
- Limited Model Scope: This study focused exclusively on two versions of ChatGPT. While this was a deliberate choice to reflect real-world usage, a direct comparison with other contemporary models (e.g., Google’s Gemini, Anthropic’s Claude) was beyond the scope of this work and is an important area for future research.
- Simplified Prompt Design: While we established clear, guideline-based definitions for comorbidities, our use of simplified prompts that did not include granular clinical details (e.g., specific eGFR values) was a deliberate methodological choice to simulate quick clinical queries. However, we acknowledge this approach creates a methodological trade-off. While it allowed for an effective evaluation of the models’ ‘out-of-the-box’ performance in realistic scenarios, the lack of specific data is also a limitation, as it may have constrained the models’ ability to provide more tailored recommendations and could have contributed to some of the observed inaccuracies [34].
- Geographical and Sample Constraints: The study was conducted in Taiwan with a panel of five cardiologists. Although the medical practices and pharmaceuticals used are largely aligned with international standards, regional variations could limit the global generalizability of our specific findings.
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mendis, S.; Graham, I.; Narula, J. Addressing the Global Burden of Cardiovascular Diseases; Need for Scalable and Sustainable Frameworks. Glob. Heart 2022, 17, 48. [Google Scholar] [CrossRef] [PubMed]
- Preiksaitis, C.; Rose, C. Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review. JMIR Med. Educ. 2023, 9, e48785. [Google Scholar] [CrossRef]
- Lahat, A.; Sharif, K.; Zoabi, N.; Shneor Patt, Y.; Sharif, Y.; Fisher, L.; Shani, U.; Arow, M.; Levin, R.; Klang, E. Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4. J. Med. Internet Res. 2024, 26, e54571. [Google Scholar] [CrossRef]
- Chlorogiannis, D.D.; Apostolos, A.; Chlorogiannis, A.; Palaiodimos, L.; Giannakoulas, G.; Pargaonkar, S.; Xesfingi, S.; Kokkinidis, D.G. The Role of ChatGPT in the Advancement of Diagnosis, Management, and Prognosis of Cardiovascular and Cerebrovascular Disease. Healthcare 2023, 11, 2906. [Google Scholar] [CrossRef]
- Rizwan, A.; Sadiq, T. The Use of AI in Diagnosing Diseases and Providing Management Plans: A Consultation on Cardiovascular Disorders with ChatGPT. Cureus 2023, 15, e43106. [Google Scholar] [CrossRef]
- Sarraju, A.; Bruemmer, D.; Van Iterson, E.; Cho, L.; Rodriguez, F.; Laffin, L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained from a Popular Online Chat-Based Artificial Intelligence Model. JAMA 2023, 329, 842–844. [Google Scholar] [CrossRef]
- Shan, G.; Chen, X.; Wang, C.; Liu, L.; Gu, Y.; Jiang, H.; Shi, T. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med. Inform. 2025, 13, e64963. [Google Scholar] [CrossRef]
- Zhang, K.; Meng, X.; Yan, X.; Ji, J.; Liu, J.; Xu, H.; Zhang, H.; Liu, D.; Wang, J.; Wang, X.; et al. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J. Med. Internet Res. 2025, 27, e59069. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
- Zaretsky, J.; Kim, J.M.; Baskharoun, S.; Zhao, Y.; Austrian, J.; Aphinyanaphongs, Y.; Gupta, R.; Blecker, S.B.; Feldman, J. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw. Open 2024, 7, e240357. [Google Scholar] [CrossRef] [PubMed]
- Cheng, H.Y. ChatGPT’s Attitude, Knowledge, and Clinical Application in Geriatrics Practice and Education: Exploratory Observational Study. JMIR Form. Res. 2025, 9, e63494. [Google Scholar] [CrossRef]
- Rosner, B.A. Fundamentals of Biostatistics; Thomson-Brooks/Cole: Belmont, CA, USA, 2006; Volume 6. [Google Scholar]
- Iqbal, U.; Lee, L.T.J.; Rahmanti, A.R.; Celi, L.A.; Li, Y.C.J. Can large language models provide secondary reliable opinion on treatment options for dermatological diseases? J. Am. Med. Inform. Assoc. 2024, 31, 1341–1347. [Google Scholar] [CrossRef]
- Li, C.; Zhao, Y.; Bai, Y.; Zhao, B.; Tola, Y.O.; Chan, C.W.; Zhang, M.; Fu, X. Unveiling the Potential of Large Language Models in Transforming Chronic Disease Management: Mixed Methods Systematic Review. J. Med. Internet Res. 2025, 27, e70535. [Google Scholar] [CrossRef]
- Joglar, J.A.; Chung, M.K.; Armbruster, A.L.; Benjamin, E.J.; Chyou, J.Y.; Cronin, E.M.; Deswal, A.; Eckhardt, L.L.; Goldberger, Z.D.; Gopinathannair, R.; et al. 2023 ACC/AHA/ACCP/HRS Guideline for the Diagnosis and Management of Atrial Fibrillation: A Report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines. Circulation 2024, 83, 109–279. [Google Scholar] [CrossRef]
- Kim, D.-G.; Kim, S.H.; Park, S.Y.; Han, B.G.; Kim, J.S.; Yang, J.W.; Park, Y.J.; Lee, J.Y. Anticoagulation in patients with end-stage kidney disease and atrial fibrillation: A national population-based study. Clin. Kidney J. 2024, 17, sfae029. [Google Scholar] [CrossRef]
- McDonagh, T.A.; Metra, M.; Adamo, M.; Gardner, R.S.; Baumbach, A.; Böhm, M.; Burri, H.; Butler, J.; Čelutkienė, J.; Chioncel, O.; et al. 2023 Focused Update of the 2021 ESC Guidelines for the diagnosis and treatment of acute and chronic heart failure: Developed by the task force for the diagnosis and treatment of acute and chronic heart failure of the European Society of Cardiology (ESC) with the special contribution of the Heart Failure Association (HFA) of the ESC. Eur. J. Heart Fail 2024, 26, 5–17. [Google Scholar]
- Miao, J.; Thongprayoon, C.; Cheungpasitporn, W. Assessing the Accuracy of ChatGPT on Core Questions in Glomerular Disease. Kidney Int. Rep. 2023, 8, 1657–1659. [Google Scholar] [CrossRef]
- Yeo, Y.H.; Samaan, J.S.; Ng, W.H.; Ting, P.S.; Trivedi, H.; Vipani, A.; Ayoub, W.; Yang, J.D.; Liran, O.; Spiegel, B.; et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin. Mol. Hepatol. 2023, 29, 721–732. [Google Scholar] [CrossRef]
- Pugliese, N.; Wong, V.W.-S.; Schattenberg, J.M.; Romero-Gomez, M.; Sebastiani, G.; Aghemo, A.; Castera, L.; Hassan, C.; Manousou, P.; Miele, L.; et al. Accuracy, Reliability, and Comprehensibility of ChatGPT-Generated Medical Responses for Patients with Nonalcoholic Fatty Liver Disease. Clin. Gastroenterol. Hepatol. 2024, 22, 886–889. [Google Scholar] [CrossRef]
- Sheikh, M.S.; Barreto, E.F.; Miao, J.; Thongprayoon, C.; Gregoire, J.R.; Dreesman, B.; Erickson, S.B.; Craici, I.M.; Cheungpasitporn, W. Evaluating ChatGPT’s efficacy in assessing the safety of non-prescription medications and supplements in patients with kidney disease. Digit. Health 2024, 10, 20552076241248082. [Google Scholar] [CrossRef] [PubMed]
- Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nat. Med. 2022, 28, 31–38. [Google Scholar] [CrossRef]
- National Academy of Medicine. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. In The Learning Health System Series; Whicher, D., Ahmed, M., Israni, S.T., Matheny, M., Eds.; National Academies Press: Washington, DC, USA, 2022. [Google Scholar]
- Jung, K.H. Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc. Inform. Res. 2025, 31, 114–124. [Google Scholar] [CrossRef]
- Mirakhori, F.; Niazi, S.K. Harnessing the AI/ML in Drug and Biological Products Discovery and Development: The Regulatory Perspective. Pharmaceuticals 2025, 18, 47. [Google Scholar] [CrossRef]
- Chen, Y.; Esmaeilzadeh, P. Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security Challenges. J. Med. Internet Res. 2024, 26, e53008. [Google Scholar] [CrossRef]
- Lee, P.; Bubeck, S.; Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N. Engl. J. Med. 2023, 388, 1233–1239. [Google Scholar] [CrossRef]
- Gargari, O.K.; Habibi, G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digit. Health 2025, 11, 20552076251337177. [Google Scholar] [CrossRef]
- Barth, J.; Boer, W.E.L.d.; Busse, J.W.; Hoving, J.L.; Kedzia, S.; Couban, R.; Fischer, K.; von Allmen, D.Y.; Spanjer, J.; Kunz, R. Inter-rater agreement in evaluation of disability: Systematic review of reproducibility studies. BMJ 2017, 356, j14. [Google Scholar] [CrossRef]
- Lugo, V.M.; Torres, M.; Garmendia, O.; Suarez-Giron, M.; Ruiz, C.; Carmona, C.; Chiner, E.; Tarraubella, N.; Dalmases, M.; Pedro, A.M.; et al. Intra-and Inter-Physician Agreement in Therapeutic Decision for Sleep Apnea Syndrome. Arch. Bronconeumol. 2020, 56, 18–22. [Google Scholar] [CrossRef]
- Sarvari, P.; Al-Fagih, Z.; Ghuwel, A.; Al-Fagih, O. A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients. Health Care Sci. 2024, 3, 3–18. [Google Scholar] [CrossRef]
- Avram, R.; Dwivedi, G.; Kaul, P.; Manlhiot, C.; Tsang, W. Artificial Intelligence in Cardiovascular Medicine: From Clinical Care, Education, and Research Applications to Foundational Models—A Perspective. Can. J. Cardiol. 2024, 40, 1769–1773. [Google Scholar] [CrossRef]
- Bhattaru, A.; Yanamala, N.; Sengupta, P.P. Revolutionizing Cardiology with Words: Unveiling the Impact of Large Language Models in Medical Science Writing. Can. J. Cardiol. 2024, 40, 1950–1958. [Google Scholar] [CrossRef]
Hypertension | Atrial Fibrillation | Old Myocardial Infarction | Congestive Heart Failure | Hypercholesterolemia | ||||||
---|---|---|---|---|---|---|---|---|---|---|
GPT 3.5 | GPT 4.0 | GPT 3.5 | GPT 4.0 | GPT 3.5 | GPT 4.0 | GPT 3.5 | GPT 4.0 | GPT 3.5 | GPT 4.0 | |
Diabetes | 10 | 10 | 10 | 22 | 21 | 14 | 21 | 14 | 11 | 10 |
CKD | 12 | 25 | 10 | 12 | 11 | 18 | 10 | 19 | 11 | 10 |
ESRD | 11 | 13 | 10 | 11 | 10 | 10 | 10 | 20 | 16 | 10 |
COPD | 14 | 15 | 11 | 10 | 14 | 13 | 16 | 14 | 12 | 10 |
Asthma | 12 | 13 | 11 | 18 | 13 | 14 | 13 | 17 | 11 | 10 |
Version of the Model | “Low Priority” (n) | “High Priority” (n) | Total (N) | Approval Rate (%) |
---|---|---|---|---|
ChatGPT 3.5 | 404 | 2301 | 2705 | 85.06 |
ChatGPT 4.0 | 380 | 2521 | 2901 | 86.90 |
ChatGPT 3.5 (n = 2705) | ChatGPT 4.0 (n = 2901) | |||
---|---|---|---|---|
N | % | N | % | |
Low Priority: Maybe Useful | 218 | 53.96 | 225 | 59.21 |
Low Priority: Not Useful | 156 | 38.61 | 102 | 26.84 |
Contraindicated | 30 | 7.43 | 53 | 13.95 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hao, W.-R.; Chen, C.-C.; Chen, K.; Li, L.-C.; Chiu, C.-C.; Yang, T.-Y.; Jong, H.-C.; Yang, H.-C.; Huang, C.-W.; Liu, J.-C.; et al. ChatGPT Performance Deteriorated in Patients with Comorbidities When Providing Cardiological Therapeutic Consultations. Healthcare 2025, 13, 1598. https://doi.org/10.3390/healthcare13131598
Hao W-R, Chen C-C, Chen K, Li L-C, Chiu C-C, Yang T-Y, Jong H-C, Yang H-C, Huang C-W, Liu J-C, et al. ChatGPT Performance Deteriorated in Patients with Comorbidities When Providing Cardiological Therapeutic Consultations. Healthcare. 2025; 13(13):1598. https://doi.org/10.3390/healthcare13131598
Chicago/Turabian StyleHao, Wen-Rui, Chun-Chao Chen, Kuan Chen, Long-Chen Li, Chun-Chih Chiu, Tsung-Yeh Yang, Hung-Chang Jong, Hsuan-Chia Yang, Chih-Wei Huang, Ju-Chi Liu, and et al. 2025. "ChatGPT Performance Deteriorated in Patients with Comorbidities When Providing Cardiological Therapeutic Consultations" Healthcare 13, no. 13: 1598. https://doi.org/10.3390/healthcare13131598
APA StyleHao, W.-R., Chen, C.-C., Chen, K., Li, L.-C., Chiu, C.-C., Yang, T.-Y., Jong, H.-C., Yang, H.-C., Huang, C.-W., Liu, J.-C., & Li, Y.-C. (2025). ChatGPT Performance Deteriorated in Patients with Comorbidities When Providing Cardiological Therapeutic Consultations. Healthcare, 13(13), 1598. https://doi.org/10.3390/healthcare13131598