AI in the Hot Seat: Head-to-Head Comparison of Large Language Models and Cardiologists in Emergency Scenarios
Abstract
1. Introduction
2. Methods
2.1. Overview of Large Language Models in Medical Practice
2.2. Study Design and Participants
2.3. Physician Comparator Group and Scope of Assessment
2.4. Clinical Scenarios
2.5. LLM Prompting and Response Processing
- ChatGPT (OpenAI, GPT-4o, version released in 2024);
- Claude (Anthropic, Claude 3 Opus);
- Gemini (Google, Gemini Advanced/Ultra 1.0);
- Llama (Meta, Llama 3 70B);
- Qwen (Alibaba, Qwen 2 72B);
- Bing Copilot (Microsoft; GPT-4–class large language models);
- Deep Seek (DeepSeek-V2).
2.6. Response Standardization
2.7. Evaluation and Blinding
2.8. Prompting Strategy
- 1.
- Persona declaration (Persona-based prompting): The prompt begins with “I am an interventional cardiologist,” establishing the expert identity. This prompts the model to reason using specialized interventional terminology, procedural priorities, and complication-management strategies.
- 2.
- Clinical scenario with no prior examples (zero-shot prompting): The prompt does not include any example cases or pre-defined model outputs.
- 3.
- Stepwise or task force structured reasoning (Chain-of-Thought prompting): Prompts were designed to elicit stepwise clinical reasoning without directive or hierarchy-implying language.
2.9. Statistical Analysis
3. Results
3.1. Group Performance
3.2. Statistical Comparisons
3.3. Interpretation
4. Discussion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Pay, L.; Yumurtaş, A.Ç.; Çetin, T.; Çınar, T.; Hayıroğlu, M.İ. Comparative Evaluation of Chatbot Responses on Coronary Artery Disease. Arch. Turk. Soc. Cardiol./Türk Kardiyol. Derneği Arşivi 2025, 53, 370–371. [Google Scholar] [CrossRef] [PubMed]
- Boonstra, M.J.; Weissenbacher, D.; Moore, J.H.; Gonzalez-Hernandez, G.; Asselbergs, F.W. Artificial intelligence: Revolutionizing cardiology with large language models. Eur. Heart J. 2024, 45, 332–345. [Google Scholar] [CrossRef]
- Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X. The application of large language models in medicine: A scoping review. Iscience 2024, 27, 109713. [Google Scholar] [CrossRef] [PubMed]
- Bozyel, S.; Duman, A.B.; Dalgıç, Ş.N.; Şipal, A.; Şaylık, F.; Önder, Ş.E.G.; Çağdaş, M.; Güler, T.E.; Aksu, T.; Bağcı, U. Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients. Anatol. J. Cardiol. 2025, 29, 533–542. [Google Scholar] [CrossRef] [PubMed]
- Sinha, S.S.; Geller, B.J.; Katz, J.N.; Arslanian-Engoren, C.; Barnett, C.F.; Bohula, E.A.; Damluji, A.A.; Menon, V.; Roswell, R.O.; Vallabhajosyula, S. Evolution of critical care cardiology: An update on structure, care delivery, training, and research paradigms: A scientific statement from the American Heart Association. Circulation 2025, 151, e687–e707. [Google Scholar] [CrossRef]
- Clusmann, J.; Kolbinger, F.R.; Muti, H.S.; Carrero, Z.I.; Eckardt, J.-N.; Laleh, N.G.; Löffler, C.M.L.; Schwarzkopf, S.-C.; Unger, M.; Veldhuizen, G.P. The future landscape of large language models in medicine. Commun. Med. 2023, 3, 141. [Google Scholar] [CrossRef]
- Cross, J.; Choma, M.; Onofrey, J. Bias in medical AI: Implications for clinical decision-making. PLoS Digit. Health 2024, 3, e0000651. [Google Scholar] [CrossRef]
- Safranek, C.W.; Sidamon-Eristoff, A.E.; Gilson, A.; Chartash, D. The Role of Large Language Models in Medical Education: Applications and Implications; JMIR Publications: Toronto, ON, Canada, 2023; Volume 9, p. e50945. [Google Scholar]
- Güneş, Y.C.; Cesur, T. Large language models: Could they be the next generation of clinical decision support systems in cardiovascular diseases? Anatol. J. Cardiol. 2024, 28, 371. [Google Scholar] [CrossRef]
- Zhan, X.; Humbert-Droz, M.; Mukherjee, P.; Gevaert, O. Structuring clinical text with AI: Old versus new natural language processing techniques evaluated on eight common cardiovascular diseases. Patterns 2021, 2, 100289. [Google Scholar] [CrossRef]
- Yuan, C.; Ryan, P.B.; Ta, C.; Guo, Y.; Li, Z.; Hardin, J.; Makadia, R.; Jin, P.; Shang, N.; Kang, T. Criteria2Query: A natural language interface to clinical databases for cohort definition. J. Am. Med. Inform. Assoc. 2019, 26, 294–305. [Google Scholar] [CrossRef]
- Dewaswala, N.; Chen, D.; Bhopalwala, H.; Kaggal, V.C.; Murphy, S.P.; Bos, J.M.; Geske, J.B.; Gersh, B.J.; Ommen, S.R.; Araoz, P.A. Natural language processing for identification of hypertrophic cardiomyopathy patients from cardiac magnetic resonance reports. BMC Med. Inform. Decis. Mak. 2022, 22, 272. [Google Scholar] [CrossRef]
- Ambrosy, A.P.; Parikh, R.; Sung, S.H.; Narayanan, A.; Masson, R.; Lam, P.-Q.; Kheder, K.; Iwahashi, A.; Hardwick, A.; Fitzpatrick, J. The use of natural language processing-based algorithms and outpatient clinical encounters for worsening heart failure: Insights from the utilize-WHF study. J. Am. Coll. Cardiol. 2021, 77, 674. [Google Scholar] [CrossRef]
- Khurshid, S.; Reeder, C.; Harrington, L.X.; Singh, P.; Sarma, G.; Friedman, S.F.; Di Achille, P.; Diamant, N.; Cunningham, J.W.; Turner, A.C. Cohort design and natural language processing to reduce bias in electronic health records research. NPJ Digit. Med. 2022, 5, 47. [Google Scholar] [CrossRef]
- Patterson, O.V.; Freiberg, M.S.; Skanderson, M.; Fodeh, J.S.; Brandt, C.A.; DuVall, S.L. Unlocking echocardiogram measurements for heart disease research through natural language processing. BMC Cardiovasc. Disord. 2017, 17, 151. [Google Scholar] [CrossRef]
- Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
- Zhou, H.; Liu, F.; Gu, B.; Zou, X.; Huang, J.; Wu, J.; Li, Y.; Chen, S.S.; Zhou, P.; Liu, J. A survey of large language models in medicine: Progress, application, and challenge. arXiv 2023, arXiv:2311.05112. [Google Scholar]
- Maity, S.; Saikia, M.J. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef]
- Alkalbani, A.M.; Alrawahi, A.S.; Salah, A.; Haghighi, V.; Zhang, Y.; Alkindi, S.; Sheng, Q.Z. A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions. Information 2025, 16, 489. [Google Scholar] [CrossRef]
- Artsi, Y.; Sorin, V.; Konen, E.; Glicksberg, B.S.; Nadkarni, G.; Klang, E. Large language models for generating medical examinations: Systematic review. BMC Med. Educ. 2024, 24, 354. [Google Scholar] [CrossRef] [PubMed]
- Pressman, S.M.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Haider, C.R.; Forte, A.J. Clinical and surgical applications of large language models: A systematic review. J. Clin. Med. 2024, 13, 3041. [Google Scholar] [CrossRef] [PubMed]
- Committee, W.; Bass, T.A.; Abbott, J.D.; Mahmud, E.; Parikh, S.A.; Aboulhosn, J.; Ashwath, M.L.; Baranowski, B.; Bergersen, L.; Chaudry, H.I. 2023 ACC/AHA/SCAI Advanced Training Statement on Interventional Cardiology (Coronary, Peripheral Vascular, and Structural Heart Interventions) A Report of the ACC Competency Management Committee. Cardiovasc. Interv. 2023, 16, 1239–1291. [Google Scholar]
- Naidu, S.S.; Abbott, J.D.; Bagai, J.; Blankenship, J.; Garcia, S.; Iqbal, S.N.; Kaul, P.; Khuddus, M.A.; Kirkwood, L.; Manoukian, S.V. SCAI expert consensus update on best practices in the cardiac catheterization laboratory: This statement was endorsed by the American College of Cardiology (ACC), the American Heart Association (AHA), and the Heart Rhythm Society (HRS) in April 2021. Catheter. Cardiovasc. Interv. 2021, 98, 255–276. [Google Scholar] [CrossRef]
- Neo, J.R.E.; Ser, J.S.; Tay, S.S. Use of large language model-based chatbots in managing the rehabilitation concerns and education needs of outpatient stroke survivors and caregivers. Front. Digit. Health 2024, 6, 1395501. [Google Scholar] [CrossRef]
- Anaya, F.; Prasad, R.; Bashour, M.; Yaghmour, R.; Alameh, A.; Balakumaran, K. Evaluating ChatGPT platform in delivering heart failure educational material: A comparison with the leading national cardiology institutes. Curr. Probl. Cardiol. 2024, 49, 102797. [Google Scholar] [CrossRef] [PubMed]
- Li, Y. A practical survey on zero-shot prompt design for in-context learning. arXiv 2023, arXiv:2309.13205. [Google Scholar]
- Yuan, X.; Shen, C.; Yan, S.; Zhang, X.; Xie, L.; Wang, W.; Guan, R.; Wang, Y.; Ye, J. Instance-adaptive zero-shot chain-of-thought prompting. Adv. Neural Inf. Process. Syst. 2024, 37, 125469–125486. [Google Scholar]
- Gu, Y.; Han, X.; Liu, Z.; Huang, M. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv 2021, arXiv:2109.04332. [Google Scholar]
- Feng, G.; Zhang, B.; Gu, Y.; Ye, H.; He, D.; Wang, L. Towards revealing the mystery behind chain of thought: A theoretical perspective. Adv. Neural Inf. Process. Syst. 2023, 36, 70757–70798. [Google Scholar]
- Ahmed, T.; Devanbu, P. Better patching using llm prompting, via self-consistency. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA; pp. 1742–1746. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Subedi, K. The reliability of llms for medical diagnosis: An examination of consistency, manipulation, and contextual awareness. arXiv 2025, arXiv:2503.10647. [Google Scholar]
- Yang, W.; Some, L.; Bain, M.; Kang, B. A comprehensive survey on integrating large language models with knowledge-based methods. Knowl.-Based Syst. 2025, 318, 113503. [Google Scholar] [CrossRef]
- Dorfner, F.J.; Dada, A.; Busch, F.; Makowski, M.R.; Han, T.; Truhn, D.; Kleesiek, J.; Sushil, M.; Adams, L.C.; Bressem, K.K. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J. Am. Med. Inform. Assoc. 2025, 32, 1015–1024. [Google Scholar] [CrossRef]
- Xu, F.; Hao, Q.; Zong, Z.; Wang, J.; Zhang, Y.; Wang, J.; Lan, X.; Gong, J.; Ouyang, T.; Meng, F. Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv 2025, arXiv:2501.09686. [Google Scholar] [CrossRef]
- Hsieh, C.; Moreira, C.; Nobre, I.B.; Sousa, S.C.; Ouyang, C.; Brereton, M.; Jorge, J.; Nascimento, J.C. DALL-M: Context-aware clinical data augmentation with large language models. Comput. Biol. Med. 2025, 190, 110022. [Google Scholar] [CrossRef]
- Nazi, Z.A.; Peng, W. Large language models in healthcare and medical domain: A review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
- Goh, E.; Gallo, R.; Hom, J.; Strong, E.; Weng, Y.; Kerman, H.; Cool, J.A.; Kanjee, Z.; Parsons, A.S.; Ahuja, N. Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Netw. Open 2024, 7, e2440969. [Google Scholar] [CrossRef] [PubMed]
- Azevedo, G.L.; Potsch, B.P.; Yahiro, D.S.; Ávila, L.A.; Freitas, L.M.; Lutterbach, V.A.; Mesquita, C.T. Comparative Evaluation of Large Language Models as Clinical Decision Support Tool for Cardiac Amyloidosis. J. Nucl. Cardiol. 2024, 38, 101945. [Google Scholar] [CrossRef]
- Bozyel, S.; Şimşek, E.; Koçyiğit, D.; Güler, A.; Korkmaz, Y.; Şeker, M.; Ertürk, M.; Keser, N. Artificial intelligence-based clinical decision support systems in cardiovascular diseases. Anatol. J. Cardiol. 2024, 28, 74. [Google Scholar] [CrossRef] [PubMed]
- Günay, S.; Öztürk, A.; Yiğit, Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists. Am. J. Emerg. Med. 2024, 84, 68–73. [Google Scholar] [CrossRef]
- Zhu, L.; Mou, W.; Wu, K.; Lai, Y.; Lin, A.; Yang, T.; Zhang, J.; Luo, P. Multimodal ChatGPT-4V for Electrocardiogram Interpretation: Promise and Limitations. J. Med. Internet Res. 2024, 26, e54607. [Google Scholar] [CrossRef]
- Li, P.; Zhang, X.; Zhu, E.; Yu, S.; Sheng, B.; Tham, Y.C.; Wong, T.Y.; Ji, H. Potential Multidisciplinary Use of Large Language Models for Addressing Queries in Cardio-Oncology. J. Am. Heart Assoc. 2024, 13, e033584. [Google Scholar] [CrossRef]
- Siluvai, S.; Narayanan, V.; Ramachandran, V.S.; Lazar, V.R. Generative Pre-trained Transformer: Trends, Applications, Strengths and Challenges in Dentistry: A Systematic Review. Healthc. Inform. Res. 2025, 31, 189–199. [Google Scholar] [CrossRef]
- Nori, H.; Lee, Y.T.; Zhang, S.; Carignan, D.; Edgar, R.; Fusi, N.; King, N.; Larson, J.; Li, Y.; Liu, W. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv 2023, arXiv:2311.16452. [Google Scholar] [CrossRef]
- Kostopoulou, O.; Rosen, A.; Round, T.; Wright, E.; Douiri, A.; Delaney, B. Early diagnostic suggestions improve accuracy of GPs: A randomised controlled trial using computer-simulated patients. Br. J. Gen. Pract. 2014, 65, e49. [Google Scholar] [CrossRef]
- Sibbald, M.; Monteiro, S.; Sherbino, J.; LoGiudice, A.; Friedman, C.; Norman, G. Should electronic differential diagnosis support be used early or late in the diagnostic process? A multicentre experimental study of Isabel. BMJ Qual. Saf. 2022, 31, 426–433. [Google Scholar] [CrossRef]
- Ataee, S.; Popescu-Belis, A. Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models. arXiv 2025, arXiv:2510.18077. [Google Scholar]
- Lee, G.-G.; Latif, E.; Wu, X.; Liu, N.; Zhai, X. Applying large language models and chain-of-thought for automatic scoring. Comput. Educ. Artif. Intell. 2024, 6, 100213. [Google Scholar] [CrossRef]
- Ong, J.C.L.; Jin, L.; Elangovan, K.; San Lim, G.Y.; Lim, D.Y.Z.; Sng, G.G.R.; Ke, Y.H.; Tung, J.Y.M.; Zhong, R.J.; Koh, C.M.Y. Large language model as clinical decision support system augments medication safety in 16 clinical specialties. Cell Rep. Med. 2025, 6, 102323. [Google Scholar] [CrossRef]
- Benjamin, R.G.; Schwanenflugel, P.J. Text complexity and oral reading prosody in young readers. Read. Res. Q. 2010, 45, 388–404. [Google Scholar] [CrossRef]
- Wang, J.; Redelmeier, D.A. Cognitive biases and artificial intelligence. NEJM AI 2024, 1, AIcs2400639. [Google Scholar] [CrossRef]


| Model/Physician | Mean ± SD | SE | 95% CI (Lower–Upper) |
|---|---|---|---|
| ChatGPT | 87.4 ± 13.0 | 2.40 | 82.5–92.3 |
| Claude | 80.8 ± 13.6 | 2.49 | 75.7–85.9 |
| Deep Seek | 78.7 ± 15.7 | 2.87 | 72.9–84.6 |
| LLAMA | 73.7 ± 17.4 | 3.17 | 67.2–80.2 |
| Qwen | 66.2 ± 17.9 | 3.26 | 59.6–72.9 |
| Bing | 64.3 ± 16.8 | 3.07 | 58.0–70.6 |
| Gemini | 59.0 ± 16.6 | 3.02 | 52.8–65.2 |
| DR.1 | 78.9 ± 16.7 | 3.05 | 72.6–85.1 |
| DR.2 | 68.4 ± 16.7 | 3.04 | 62.1–74.6 |
| DR.3 | 81.0 ± 15.0 | 2.73 | 75.4–86.6 |
| DR.4 | 76.7 ± 16.3 | 2.97 | 70.6–82.7 |
| DR.5 | 98.5 ± 10.8 | 1.97 | 94.4–102.0 |
| Comparison | Difference in Mean Score | 95% CI (Lower–Upper) | p-Value |
|---|---|---|---|
| ChatGPT vs. physicians | 6.69 | 0.01–13.36 | <0.050 |
| Claude vs. physicians | 0.15 | −6.71–7.02 | 1.000 |
| Deep Seek vs. physicians | −1.95 | −8.94–5.04 | 0.906 |
| LLAMA vs. physicians | −6.98 | −13.46–−0.50 | 0.036 |
| Qwen vs. physicians | −14.45 | −20.52–−8.37 | <0.001 |
| Bing vs. physicians | −16.35 | −22.49–−10.20 | <0.001 |
| Gemini vs. physicians | −21.65 | −27.93–−15.37 | <0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Cicek, V.; Zhao, L.; Tur, Y.; Oz, A.; Kilic, S.; Durak, G.; Saylik, F.; Hayiroglu, M.I.; Cinar, T.; Bagci, U. AI in the Hot Seat: Head-to-Head Comparison of Large Language Models and Cardiologists in Emergency Scenarios. Med. Sci. 2026, 14, 33. https://doi.org/10.3390/medsci14010033
Cicek V, Zhao L, Tur Y, Oz A, Kilic S, Durak G, Saylik F, Hayiroglu MI, Cinar T, Bagci U. AI in the Hot Seat: Head-to-Head Comparison of Large Language Models and Cardiologists in Emergency Scenarios. Medical Sciences. 2026; 14(1):33. https://doi.org/10.3390/medsci14010033
Chicago/Turabian StyleCicek, Vedat, Lili Zhao, Yalcin Tur, Ahmet Oz, Sahhan Kilic, Gorkem Durak, Faysal Saylik, Mert Ilker Hayiroglu, Tufan Cinar, and Ulas Bagci. 2026. "AI in the Hot Seat: Head-to-Head Comparison of Large Language Models and Cardiologists in Emergency Scenarios" Medical Sciences 14, no. 1: 33. https://doi.org/10.3390/medsci14010033
APA StyleCicek, V., Zhao, L., Tur, Y., Oz, A., Kilic, S., Durak, G., Saylik, F., Hayiroglu, M. I., Cinar, T., & Bagci, U. (2026). AI in the Hot Seat: Head-to-Head Comparison of Large Language Models and Cardiologists in Emergency Scenarios. Medical Sciences, 14(1), 33. https://doi.org/10.3390/medsci14010033

