Artificial Intelligence for Biomedical Diagnostics: Diagnostic Accuracy and Reliability of Multimodal Large Language Models in Electrocardiogram Interpretation
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design and Objectives
2.2. Dataset and Case Selection
2.3. Models Under Evaluation
2.4. Prompting Strategy and Inference Protocol
2.5. Outcome Definitions
2.6. Statistical Analysis
3. Results
3.1. Overview of Diagnostic Accuracy
3.2. Task-Level Accuracy
3.3. Heart Rate Estimation
3.4. Inter-Run Reliability
3.5. Accuracy–Consistency Dissociation
4. Discussion
4.1. Task-Dependent Performance Hierarchy
4.2. Accuracy–Consistency Dissociation
4.3. The Kappa Paradox in MLLM Evaluation
4.4. Comparison with Prior Literature
- Second, accuracy is highly sensitive to task framing, prompt structure, and the availability of clinical context, which limits the external validity of any single accuracy estimate, including those reported here [28].
- Third, headline accuracy values can mask substantial inter-run variability, which supports the central methodological argument of the present study that single-run benchmarking is insufficient for diagnostic evaluation of MLLMs and that reliability should be reported alongside accuracy [24].
| Study | N Unique ECG Images | Models Evaluated | Runs | Reported Accuracy (and κ Where Available) |
|---|---|---|---|---|
| Present study | 13 | ChatGPT-5.3, Gemini 3.1 Pro, Claude Opus 4.6, Grok 4.1, ERNIE 5.0 | 5 | Overall categorical accuracy 52.3–64.9% across models; mean Fleiss κ 0.03–0.73 across models |
| Günay 2024 [27] | 40 | GPT-4, GPT-4o, Gemini Advanced | 12 | Median ~51% (GPT-4) to ~67.5% (GPT-4o); Fleiss κ 0.27–0.51 |
| Günay & Yiğit 2026 [24] | 40 | GPT-4o, DeepSeek R1 | 13 | Median correct answers 22/40 (DeepSeek) vs. 27/40 (GPT-4o); Fleiss κ 0.71/0.47 |
| Zhu 2024 [8] | 62 | GPT-4V | 3 attempts per item | MCQ-based; 53.2% strict (3/3 correct) vs. 83.9% lenient (≥1/3 correct) |
| Engelstein 2025 [11] | 80 | GPT-4o, Gemini 2.0 Flash | Mixed (3 prompt formats; 5 runs of best setup) | Zero-shot binary 53–63%; few-shot binary 83%; 6-class 41% |
| Çamkıran 2025 [10] | 107 | GPT-ECGReader, GPT-ECGAnalyzer, GPT-ECGInterpreter | 1 | 57.9–62.6% across the three GPT variants |
| Bocz 2025 [9] | 130 | GPT-4, Gemini | 1 | 25.6% (Gemini) to 31.2% (GPT-4) |
| Zeljković 2025 [28] | 150 | GPT-4 | 1 (×2 conditions) | 19% (without context) vs. 45% (with clinical context) |
| Seki 2025 [31] | 928 | GPT-4V, Gemini Pro Vision | 1 | No single accuracy metric reported; qualitative evaluation only |
| Lee 2025 [26] | 928 | GPT-4o, Gemini 2.5 Pro | 1 | 29.6% (Gemini 2.5 Pro) to 66.0% (GPT-4o) |
| Yang 2025 [29] | n.a. (signal-based; PTB-XL) | ECG-LM (purpose-built multimodal LLM) | n.a. | Domain-specific model; metrics not directly comparable to general-purpose MLLMs |
| Liu 2026 [30] | ECGBench (multi-dataset) | GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Claude 3.5 Sonnet, and open-source MLLMs (LLaVA-OneVision, DeepSeek-VL, MiniCPM-V) | 1 | Relative performance reported; domain-tuned PULSE outperforms general-purpose MLLMs by 21–33% in average accuracy |
4.5. Clinical Implications and Future Directions
4.6. Limitations
- First, the dataset comprised 13 unique ECGs, yielding 325 model responses and 2275 task-level assessments through the multi-run design. While the total number of responses reflects repeated measurements, the number of independent ECG cases is N = 13. The repeated-measures approach was a deliberate methodological choice: assessing inter-run reliability inherently requires multiple observations per case, which was prioritized over maximizing the number of unique ECGs within a single-run design. The dataset does not represent the full clinical ECG spectrum; specific pathologies such as ST-elevation myocardial infarction, atrioventricular block, or bundle branch block were underrepresented or absent. Accordingly, all between-model comparisons should be interpreted as exploratory and descriptive. A formal post hoc power analysis was not performed, as the study does not test a pre-specified hypothesis, and the repeated-measures structure does not align with standard power calculations based on independent observations. Future studies should employ larger datasets with systematic inclusion of defined pathological entities to permit more granular performance analysis. The statistical implications of the limited number of independent cases are addressed quantitatively in Section 2.6, where the precision of all reported accuracy estimates is communicated through 95% Wilson confidence intervals and where the rationale for the descriptive analytical framework is set out in detail.
- Second, all models were accessed through their official web interfaces using default settings. This approach reflects real-world usage conditions but precludes control over inference parameters such as temperature, which may influence output variability. API-based evaluations with fixed temperature settings could yield different consistency profiles.
- Third, only zero-shot prompting was employed. Few-shot or chain-of-thought prompting strategies might improve performance, and the absence of such comparisons limits generalizability to the specific prompting paradigm used.
- Fourth, all prompts were in English. Model performance may differ for prompts in other languages, and language-specific effects on diagnostic accuracy have been documented for text-based medical queries.
- Fifth, model versions evolve rapidly, and the results reported here are specific to the versions evaluated at the time of data collection (March 2026). Given the continuous updates to these systems, performance characteristics may change over time, underscoring the need for longitudinal re-evaluation.
- Sixth, this study assessed only whether model outputs matched ground truth categories, without evaluating the quality of clinical reasoning or the ability to integrate ECG findings with patient context. Models that arrive at correct answers through flawed reasoning may be less reliable in practice than accuracy metrics suggest. Prior work on MLLM ECG interpretation has highlighted discrepancies between predicted labels and underlying explanations, with models often producing inaccurate descriptions of image features even when the final classification is correct. Consistently, qualitative analyses report that the majority of diagnostic explanations are partially or fully incorrect [26,31].
- Seventh, the ECGs were interpreted in isolation, without accompanying clinical information such as patient age, symptoms, or medication history. In clinical practice, ECG interpretation is always performed in context, and the absence of this information may disadvantage models capable of contextual integration. The magnitude of this effect has been quantified in a recent study: when GPT-4 was provided with clinical context for the same set of ECGs, interpretation accuracy increased from 19% to 45%, with the largest improvement observed in acute coronary syndrome cases (10% vs. 70%) [28].
- Eighth, the constrained response format, while enabling standardized comparison across models, may not fully capture model capabilities for open-ended ECG interpretation. Models may identify abnormalities that do not fit the predefined categories, and this potential was not assessed.
- Ninth, ground truth labels were based on clinically pre-interpreted ECGs and subsequently harmonized by the study authors. While this approach reflects pragmatic study design and is consistent with prior work, residual subjectivity cannot be excluded, particularly given the known variability in human ECG interpretation.
- Tenth, the study did not include a concurrent comparison with human experts interpreting the same ECGs under identical conditions. A direct human–AI comparison on the same dataset would provide important contextualization of model performance and is recommended for future work.
- Eleventh, this study evaluated MLLMs in their default configuration without retrieval-augmented generation or access to external medical knowledge bases. The absence of such mechanisms represents both a limitation of the current evaluation and a motivation for future work.
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial intelligence |
| API | Application programming interface |
| bpm | Beats per minute |
| CNN | Convolutional neural network |
| ECG | Electrocardiogram |
| κ | Kappa coefficient |
| LAD | Left axis deviation |
| LoRA | Low-rank adaptation |
| MAE | Mean absolute error |
| MLLM | Multimodal large language model |
| QLoRA | Quantized low-rank adaptation |
| RAG | Retrieval-augmented generation |
| SD | Standard deviation |
References
- Schläpfer, J.; Wellens, H.J. Computer-Interpreted Electrocardiograms: Benefits and Limitations. J. Am. Coll. Cardiol. 2017, 70, 1183–1192. [Google Scholar] [CrossRef] [PubMed]
- Salerno, S.M.; Alguire, P.C.; Waxman, H.S. Competency in interpretation of 12-lead electrocardiograms: A summary and appraisal of published evidence. Ann. Intern. Med. 2003, 138, 751–760. [Google Scholar] [CrossRef] [PubMed]
- Cook, D.A.; Oh, S.Y.; Pusic, M.V. Accuracy of Physicians’ Electrocardiogram Interpretations: A Systematic Review and Meta-analysis. JAMA Intern. Med. 2020, 180, 1461–1471. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Salerno, S.M.; Alguire, P.C.; Waxman, H.S.; American College of Physicians. Training and competency evaluation for interpretation of 12-lead electrocardiograms: Recommendations from the American College of Physicians. Ann. Intern. Med. 2003, 138, 747–750. [Google Scholar] [CrossRef] [PubMed]
- Siontis, K.C.; Noseworthy, P.A.; Attia, Z.I.; Friedman, P.A. Artificial intelligence-enhanced electrocardiography in cardiovascular disease management. Nat. Rev. Cardiol. 2021, 18, 465–478. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nat. Med. 2022, 28, 31–38. [Google Scholar] [CrossRef] [PubMed]
- Mallick, K.; Singh, N.; Hajiarbabi, M. Cardi-GPT: An Expert ECG-Record Processing Chatbot. arXiv 2025, arXiv:2510.24737. [Google Scholar] [CrossRef]
- Zhu, L.; Mou, W.; Wu, K.; Lai, Y.; Lin, A.; Yang, T.; Zhang, J.; Luo, P. Multimodal ChatGPT-4V for Electrocardiogram Interpretation: Promise and Limitations. J. Med. Internet Res. 2024, 26, e54607. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Bocz, B.; Janosi, K.F.; Ferencz, A.B.; Debreceni, D.; Torma, D.; Kupo, P. Evaluation of AI-based ECG analysis accuracy using ChatGPT-4 and Gemini AI models. EP Eur. 2025, 27, euaf085.032. [Google Scholar] [CrossRef]
- Çamkıran, V.; Tunç, H.; Achmar, B.; Ürker, T.S.; Kutlu, İ.; Torun, A. Artificial intelligence (ChatGPT) ready to evaluate ECG in real life? Not yet! Digit. Health 2025, 11, 20552076251325279. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Engelstein, H.; Ramon-Gonen, R.; Sabbag, A.; Klang, E.; Sudri, K.; Cohen-Shelly, M.; Barbash, I. Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study. JMIR AI 2025, 4, e74426. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Shyr, C.; Ren, B.; Hsu, C.-Y.; Tinker, R.J.; Cassini, T.A.; Hamid, R.; Wright, A.; Bastarache, L.; Peterson, J.F.; Malin, B.A.; et al. A statistical framework for evaluating the repeatability and reproducibility of large language models. medRxiv, 2025; in press. [CrossRef] [PubMed]
- Alvarado Gonzalez, M.A.; Bruno Hernandez, M.; Peñaloza Perez, M.A.; Lopez Orozco, B.; Cruz Soto, J.T.; Malagon, S. Do Repetitions Matter? Strengthening Reliability in LLM Evaluations. arXiv 2025, arXiv:2509.24086. [Google Scholar] [CrossRef]
- Anggriani, D.; Mustamin, S.B.; Atnang, M.; Nuzry, K.A.P. High consistency, limited accuracy: Evaluating large language models for binary medical diagnosis. medRxiv, 2025; in press. [CrossRef]
- Blackwell, R.E.; Barry, J.; Cohn, A.G. Towards reproducible LLM evaluation: Quantifying uncertainty in LLM benchmark scores. arXiv 2025, arXiv:2410.03492. [Google Scholar] [CrossRef]
- Rautaharju, P.M.; Surawicz, B.; Gettes, L.S.; Bailey, J.J.; Childers, R.; Deal, B.J.; Gorgels, A.; Hancock, E.W.; Josephson, M.; Kligfield, P.; et al. AHA/ACCF/HRS recommendations for the standardization and interpretation of the electrocardiogram: Part IV: The ST segment, T and U waves, and the QT interval: A scientific statement from the American Heart Association Electrocardiography and Arrhythmias Committee, Council on Clinical Cardiology; the American College of Cardiology Foundation; and the Heart Rhythm Society. Endorsed by the International Society for Computerized Electrocardiology. J. Am. Coll. Cardiol. 2009, 53, 982–991. [Google Scholar] [CrossRef] [PubMed]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar] [CrossRef]
- Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Feinstein, A.R.; Cicchetti, D.V. High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 1990, 43, 543–549. [Google Scholar] [CrossRef]
- Python Software Foundation. Python Language Reference; Python Software Foundation: Beaverton, OR, USA, 2023; Available online: https://www.python.org (accessed on 30 January 2026).
- Schwartz, P.J.; Crotti, L. Long QT Syndrome. N. Engl. J. Med. 2025, 393, 2023–2034. [Google Scholar] [CrossRef]
- Viskin, S. Long QT syndromes and torsade de pointes. Lancet 1999, 354, 1625–1633. [Google Scholar] [CrossRef]
- Günay, S.; Öztürk, A.; Karahan, A.T.; Barindik, M.; Komut, S.; Yiğit, Y. Comparing DeepSeek and GPT-4o in ECG interpretation: Is AI improving over time? Heart Lung 2026, 75, 366–371. [Google Scholar] [CrossRef]
- Gwet, K.L. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters, 5th ed.; AgreeStat Analytics: Gaithersburg, MD, USA, 2021. [Google Scholar]
- Lee, H.; Yoo, S.; Kim, J.; Cho, Y.; Suh, D.; Lee, K. Comparative Diagnostic Performance of a Multimodal Large Language Model Versus a Dedicated Electrocardiogram AI in Detecting Myocardial Infarction from Electrocardiogram Images: Comparative Study. JMIR AI 2025, 4, e75910. [Google Scholar] [CrossRef] [PubMed]
- Günay, S.; Öztürk, A.; Yiğit, Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists. Am. J. Emerg. Med. 2024, 84, 68–73. [Google Scholar] [CrossRef] [PubMed]
- Zeljkovic, I.; Novak, A.; Lisicic, A.; Jordan, A.; Serman, A.; Jurin, I.; Pavlovic, N.; Manola, S. Beyond Text: The Impact of Clinical Context on GPT-4’s 12-Lead Electrocardiogram Interpretation Accuracy. Can. J. Cardiol. 2025, 41, 1406–1414. [Google Scholar] [CrossRef] [PubMed]
- Yang, K.; Hong, M.; Zhang, J.; Luo, Y.; Zhao, S.; Zhang, O.; Yu, X.; Zhou, J.; Yang, L.; Zhang, P.; et al. ECG-LM: Understanding Electrocardiogram with a Large Language Model. Health Data Sci. 2025, 5, 0221. [Google Scholar] [CrossRef]
- Liu, R.; Bai, Y.; Yue, X.; Zhang, P. Teaching multimodal LLMs to comprehend 12-lead electrocardiographic images. NPJ Digit. Med. 2026; in press. [CrossRef]
- Seki, T.; Kawazoe, Y.; Ito, H.; Akagi, Y.; Takiguchi, T.; Ohe, K. Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation. Front. Cardiovasc. Med. 2025, 12, 1458289. [Google Scholar] [CrossRef]
- Topol, E.J. High-Performance Medicine: The Convergence of Human and Artificial Intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
- Takahashi, S.; Sakaguchi, Y.; Kouno, N.; Takasawa, K.; Ishizu, K.; Akagi, Y.; Aoyama, R.; Teraya, N.; Bolatkan, A.; Shinkai, N.; et al. Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review. J. Med. Syst. 2024, 48, 84. [Google Scholar] [CrossRef]
- AlSaad, R.; Abd-Alrazaq, A.; Boughorbel, S.; Ahmed, A.; Renault, M.A.; Damseh, R.; Sheikh, J. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J. Med. Internet Res. 2024, 26, e59505. [Google Scholar] [CrossRef]
- Ahrens, L.; Haverkamp, W.; Strodthoff, N. ECG-LLM—Training and Evaluation of Domain-Specific Large Language Models for Electrocardiography. arXiv 2025, arXiv:2510.18339. [Google Scholar]
- Neha, F.; Bhati, D.; Shukla, D.K. Generative AI Models (2018–2024): Advancements and Applications in Kidney Care. BioMedInformatics 2025, 5, 18. [Google Scholar] [CrossRef]
- Neha, F.; Bhati, D.; Shukla, D.K. Retrieval-Augmented Generation (RAG) in Healthcare: A Comprehensive Review. AI 2025, 6, 226. [Google Scholar] [CrossRef]
- Cipollone, P.; Pierucci, N.; Matteucci, A.; Palombi, M.; Laviola, D.; Bruti, R.; Vinciullo, S.; Bernardi, M.; Spadafora, L.; Cersosimo, A.; et al. Artificial Intelligence in Cardiac Electrophysiology: A Comprehensive Review. J. Pers. Med. 2025, 15, 532. [Google Scholar] [CrossRef] [PubMed]


| Task | Type | Categories | Ground Truth Distribution | Baseline (%) |
|---|---|---|---|---|
| Heart rate | Continuous | Numeric (bpm) | Range: 50–160; Mean: 104.6 bpm | n/a |
| Rhythm | Binary | Regular, irregular | Regular: 11; Irregular: 2 | 84.6 |
| Electrical axis | 3-class | Normal, LAD, right axis deviation | Normal: 10; LAD: 2; Right axis dev.: 1 | 76.9 |
| PR/P-wave | 3-class | Normal, not visible, polymorphic | Normal: 10; Not visible: 2; Poly.: 1 | 76.9 |
| QRS duration | Binary | Narrow, wide | Narrow: 11; Wide: 2 | 84.6 |
| ST/T-wave | 5-class | Normal, ST elevation, ST depression, T-wave inversion, not assessable | Normal: 8; T-wave inv.: 2; ST elev.: 1; ST dep.: 1; N/A: 1 | 61.5 |
| QTc interval * | 4-class | Normal, prolonged, J wave visible, not assessable | Normal: 9; Prolonged: 2; J wave: 1; N/A: 1 | 69.2 |
| Model | Rhythm | Axis | PR/P-Wave | QRS | ST/T-Wave | QTc | Overall |
|---|---|---|---|---|---|---|---|
| ChatGPT-5.3 | 83.1 (72.2–90.3) | 73.8 (62.0–83.0) | 72.3 (60.4–81.7) | 90.8 (81.3–95.7) | 27.7 (18.3–39.6) | 38.5 (27.6–50.6) | 64.4 |
| Gemini 3.1 Pro | 83.1 (72.2–90.3) | 72.3 (60.4–81.7) | 63.1 (50.9–73.8) | 87.7 (77.5–93.6) | 41.5 (30.4–53.7) | 40.0 (29.0–52.1) | 64.6 |
| Claude Opus 4.6 | 67.7 (55.6–77.8) | 58.5 (46.3–69.6) | 60.0 (47.9–71.0) | 66.2 (54.0–76.5) | 20.0 (12.1–31.3) | 41.5 (30.4–53.7) | 52.3 |
| Grok 4.1 | 66.2 (54.0–76.5) | 61.5 (49.4–72.4) | 55.4 (43.3–66.8) | 78.5 (67.0–86.7) | 20.0 (12.1–31.3) | 67.7 (55.6–77.8) | 58.2 |
| ERNIE 5.0 | 80.0 (68.7–87.9) | 72.3 (60.4–81.7) | 61.5 (49.4–72.4) | 78.5 (67.0–86.7) | 36.9 (26.2–49.1) | 60.0 (47.9–71.0) | 64.9 |
| Baseline | 84.6 | 76.9 | 76.9 | 84.6 | 61.5 | 69.2 | 75.6 |
| Model | MAE (bpm) | SD (bpm) | 95% CI |
|---|---|---|---|
| ChatGPT-5.3 | 14.8 | 2.3 | 11.9–17.6 |
| Gemini 3.1 Pro | 17.5 | 3.5 | 13.9–21.0 |
| Claude Opus 4.6 | 19.2 | 3.9 | 15.7–22.7 |
| Grok 4.1 | 46.7 | 6.7 | 40.4–53.0 |
| ERNIE 5.0 | 31.7 | 8.8 | 25.2–38.2 |
| Model | Rhythm | Axis | PR/P-Wave | QRS | ST/T-Wave | QTc | Mean κ |
|---|---|---|---|---|---|---|---|
| ChatGPT-5.3 | 0.126 | 0.512 | 0.803 | 0.816 | 0.625 | 0.750 | 0.605 |
| Gemini 3.1 Pro | 0.484 | 0.328 | 0.815 | 0.586 | 0.377 | 0.487 | 0.513 |
| Claude Opus 4.6 | 0.938 | 0.548 | 0.739 | 0.840 | 0.645 | 0.653 | 0.727 |
| Grok 4.1 | 0.131 | 0.366 | 0.233 | 0.232 | 0.178 | 0.050 | 0.198 |
| ERNIE 5.0 | 0.200 | 0.044 | −0.078 | 0.068 | 0.062 | −0.116 | 0.030 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Stelling, H.; Kraus, A.; Grieb, G.; Breidung, D.; Güler, I. Artificial Intelligence for Biomedical Diagnostics: Diagnostic Accuracy and Reliability of Multimodal Large Language Models in Electrocardiogram Interpretation. Life 2026, 16, 681. https://doi.org/10.3390/life16040681
Stelling H, Kraus A, Grieb G, Breidung D, Güler I. Artificial Intelligence for Biomedical Diagnostics: Diagnostic Accuracy and Reliability of Multimodal Large Language Models in Electrocardiogram Interpretation. Life. 2026; 16(4):681. https://doi.org/10.3390/life16040681
Chicago/Turabian StyleStelling, Henrik, Armin Kraus, Gerrit Grieb, David Breidung, and Ibrahim Güler. 2026. "Artificial Intelligence for Biomedical Diagnostics: Diagnostic Accuracy and Reliability of Multimodal Large Language Models in Electrocardiogram Interpretation" Life 16, no. 4: 681. https://doi.org/10.3390/life16040681
APA StyleStelling, H., Kraus, A., Grieb, G., Breidung, D., & Güler, I. (2026). Artificial Intelligence for Biomedical Diagnostics: Diagnostic Accuracy and Reliability of Multimodal Large Language Models in Electrocardiogram Interpretation. Life, 16(4), 681. https://doi.org/10.3390/life16040681

