Evaluating Large Language Models in Cardiology: A Comparative Study of ChatGPT, Claude, and Gemini
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors presented a very original paper regarding AI tools in order to validate their relevance in clinical practice. The study is well conducted and the methodology is good. The results are of clinical interest.
- The 70 predefined clinical questions, 40 were randomly assigned to simulate patient inquiries and the remaining 30 to simulate physician inquiries should be available in an annex.
- The discussion is too long and should be summarized
- A chapter of discussion would be expected concerning the evolution of treatments over time with the rapid variations in scientific data
Author Response
We sincerely thank Reviewer 1 for the positive evaluation and constructive feedback that helped improve the clarity and completeness of our manuscript. Please find below our point-by-point responses and the corresponding changes.
- Availability of the 70 predefined clinical questions as an annex
Response: Thank you for this suggestion. All 70 predefined clinical prompts, with their assignment to the patient or physician scenario, are included in Supplementary Table S1 (Excel file). A statement has been added to the Methods section, referring to the Supplementary Table. - Discussion is too long and should be summarized
Response: Thank you for your feedback. The Discussion section has been completely restructured, incorporating themes suggested by both reviewers. An effort to condense the content has successfully reduced the text from approximately 1,800 words to the current 1,200. - Add a discussion on the evolution of cardiology treatments and the rapid changes in scientific data
Response: We agree that the rapidly evolving nature of cardiology guidelines and therapies is crucial. We have added a dedicated paragraph in the Discussion section addressing how frequent updates to clinical evidence and treatment recommendations present unique challenges for maintaining LLM relevance and safety over time.
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
I have reviewed your manuscript evaluating large language models in cardiology applications with great interest. Your work addresses an important gap in understanding AI performance in cardiology. The methodology is generally sound and the clinical focus is highly relevant. Below are my detailed comments.
- Please refer to the exact model names and versions utilized to avoid false interpretation as these models are getting updated rapidly. Specify exact model versions used (e.g., GPT-4 vs GPT-3.5).
- As noted in the point above, your study period (September-December 2024) coincided with major LLM updates, and by publication time, the evaluated models may have significantly different capabilities. This fundamentally challenges the current relevance of your findings. If re-evaluation isn't feasible, add a substantial discussion section addressing how rapid AI evolution affects study interpretation. Consider repositioning the study as establishing a benchmarking methodology rather than definitive model rankings. Discuss implications for future evaluation frameworks in rapidly evolving AI landscapes. This temporal issue affects not just the current findings but also raises questions about how to conduct meaningful LLM evaluations in rapidly evolving technological landscapes.
- Exclusion of other prominent models limits comprehensiveness however is understandable. Using only free-tier access may not reflect optimal performance available to healthcare institutions. Justify model selection criteria more thoroughly. Discuss how free vs. premium model access might affect clinical implementation.
- Evaluation relies entirely on expert opinion without validation against established clinical guidelines or objective medical standards. Include examples of responses that scored differently to illustrate evaluation criteria. Discuss how subjective evaluation might introduce bias despite good inter-rater reliability. If possible, cross-reference a subset of responses against relevant clinical guidelines (ESC, ACC/AHA) where possible.
- Provide technical explanation for the Friedman-derived calculations and include effect size interpretations in clinical context (what does a 0.3-point difference mean practically?). Multiple testing across four criteria could benefit from family-wise error correction. Consider adding these points.
- Acknowledge limitations of single-turn evaluation. Justify word limit choice or test different length constraints. Discuss implications for multi-turn clinical conversations.
- Expand on practical implementation implications beyond performance rankings. Address potential bias in training data affecting cardiology-specific performance.
- Include more discussion on ethical considerations, liability issues, and the need for human oversight in clinical AI applications.
This is valuable research addressing an important clinical question with generally sound methodology. The blinded expert evaluation approach is exemplary, and the statistical analysis is appropriate. However, the rapid pace of AI development presents unique challenges that need more explicit acknowledgment and methodological adaptation.
Your work contributes meaningfully to the literature on AI in healthcare, but requires revisions to maintain relevance and impact in this rapidly evolving field.
I look forward to seeing a revised version.
Sincerely
Author Response
We thank Reviewer 2 for the thorough and insightful review. Your detailed comments have substantially strengthened the rigor and transparency of our manuscript. We address each point in detail below:
- Specify exact model versions used
Response: The exact versions of the evaluated models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Flash-002) are now specified in the Methods session. - Temporal validity and methodological implications due to rapid LLM evolution
Response: We have expanded the Discussion and Limitations section to explicitly recognize the temporal nature of our findings. The revised manuscript positions the study as a replicable benchmarking methodology, with recommendations for future periodic reassessment, as the models are updated. - Model selection criteria (free-tier only) and implications for clinical use
Response: The rationale for selecting freely accessible public model versions is clarified in the Limitations section. We acknowledge that premium and institutional models may perform differently, and future research should address this gap. - Subjective expert ratings, divergence, and lack of systematic guideline mapping
Response: We thank the reviewer for this valuable suggestion. In response, we prepared a supplementary file (“Example_of_Discordant_Evalutation.docx”) that presents three representative examples of LLM responses that received divergent ratings among reviewers. For each example, we report the original prompt, model response, and individual reviewer ratings for all four criteria, and a brief interpretation of the sources of divergence. We also compared the response content with current guideline recommendations. This addition clarifies how subjective assessment can introduce variability despite good inter-rater reliability, and provides transparency regarding our evaluation criteria. Supplementary Material S2 is referenced in the revised Results and Discussion.
Technical explanation of Friedman/Kendall’s W and clinical interpretation of effect size
Response: The Methods section now provides a concise explanation of Kendall's W calculation (for inter-rater agreement) and our use of the Friedman test, including their clinical interpretation.
For all post-hoc pairwise comparisons between models, we used Dunn's test with Bonferroni correction to adjust for multiple testing and reduce type I error risk.
Additionally, we now discuss the practical significance of observed score differences (e.g., a 0.3–0.5 point change on a 5-point scale) in the Results and Discussion sections, helping readers understand their potential clinical relevance.
- Limitations of single-turn evaluation and word limit
Response: The rationale for single-turn, word-limited prompts is now clearly stated, and the limitations of this design, particularly regarding multiturn clinical interactions, are discussed. - Practical implications, bias in training data, and cardiology-specific performance
Response: The Discussion now addresses the influence of model training data composition and potential biases, with an emphasis on the need for domain-adapted LLMs and greater transparency from developers. - Ethical considerations, liability, and human oversight
Response: We further emphasized the need for robust human oversight, transparent accountability, and clear assignment of responsibility in any clinical deployment of LLMs.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
I have reviewed your manuscript on the comparative evaluation of ChatGPT, Claude, and Gemini in cardiology applications with great interest. This study is timely, methodologically sound, and provides useful insights into LLM performance in cardiology, with potential implications for AI-assisted patient education and clinical decision support.
To strengthen the manuscript, please consider the following revisions:
1. There are some discrepancies in the numbers reported in the tables and in the text. Revise Tables 4 and 5 to correct apparent mismatches in means, Δ values, and p-values (e.g., accuracy means in Table 4 suggest post-diagnostic scores are higher). Align textual interpretations accordingly and provide or reference raw data (e.g., from the GitHub repository's CSVs like stat_analysis_diagnostic_phase.csv or stat_analysis_user_type.csv) for verification.
2. Evaluation Scale and Rubric Details: If possible, add the detailed scoring rubric to the Methods section or supplements to improve transparency.
3. Subgroup Analysis Interpretation: If possible, report effect sizes, and discuss implications.
4. Generalizability and Model Evolution: Expand Limitations to address version-specific results and suggest re-benchmarking; note potential impacts from training data cutoffs.
5. Literature and Citations: Enhance the citations list.
6. Ethical Discussions Section: Enhance on the ethical discussions section.
This study provides valuable empirical evidence for LLM performance in cardiology with rigorous methodology. However, the rapidly evolving nature of the field and limitations in capturing real-world clinical complexity suggest results should be interpreted cautiously. The work establishes a solid foundation for ongoing evaluation but requires broader validation and deeper exploration of clinical utility.
I would be glad to review a revised version.
Sincerely
Author Response
Please see the attachmen
Author Response File: Author Response.docx
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
I have reviewed your revised manuscript with great interest. You have effectively addressed prior concerns. Congratulations.
Sincerely