Deep Learning for Myocardial Infarction Detection Using Electrocardiogram Images: A Systematic Review
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
This paper conducted systematic review, following PRISMA guidelines, addresses this 6 gap by analyzing and synthesizing research efforts on deep learning models trained on 7 ECG images for myocardial infarction detection. The paper is very well written and very well organized. I have not point to add since the author address all the issue and I suggest accepting this paper.
Author Response
Comment 1.1. This paper conducted systematic review, following PRISMA guidelines, addresses this 6 gap by analyzing and synthesizing research efforts on deep learning models trained on 7 ECG images for myocardial infarction detection. The paper is very well written and very well organized. I have not point to add since the author address all the issue and I suggest accepting this paper.
Authors’ response. Thank you for your comments.
Reviewer 2 Report
Comments and Suggestions for Authors
This paper presents a systematic review of deep learning approaches for myocardial infarction (MI) detection using ECG images, conducted in accordance with PRISMA guidelines. The research questions are clearly defined, and the results are well structured. The paper selection process is transparently described, supporting reproducibility. However, several issues should be more thoroughly addressed before the manuscript can be considered for publication.
1> Although PRISMA is a well-established framework for systematic reviews, a survey of deep learning applications should more explicitly focus on key technical and methodological questions, such as:
- What are the main challenges in applying deep learning to MI detection?
- What technical strategies have been proposed to address these challenges?
- The correlation among the reviewed papers?
2> In Figure 6, the term “framework” requires a clearer definition. The figure appears to summarize network backbones; however, some listed methods (e.g., YOLO and Fast R-CNN) are detection frameworks rather than backbones. Similarly, BiGRU and BiLSTM are typically network modules, while models such as NASNet or SpinalNet are often derived from or built upon well-known backbones. As a result, aggregating these methods into a single statistical figure may be conceptually misleading and should be reconsidered or clarified.
3> Table 3 reports performance comparisons among different methods, but the corresponding training and testing datasets are not specified. Since variations in dataset selection and experimental settings can significantly affect performance, the conclusions drawn from this table may lack robustness without this contextual information.
4> The manuscript would benefit from a clearer discussion of the relative advantages and limitations of using ECG images compared with ECG signals for MI detection.
5> Given that class imbalance is a common issue in ECG datasets, the paper should more explicitly discuss how the reviewed studies address imbalanced data and evaluate the effectiveness of these strategies.
Minor comment: Some repetition could be found in the manuscript. The authors are encouraged to carefully revise the text to remove duplicated content and improve conciseness.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors
This review examines deep learning approaches for myocardial infarction detection using ECG images. Following PRISMA guidelines, the authors analyzed 47 papers from 361 initial records across 6 databases, characterizing architectures, datasets, evaluation practices, and proposing future research directions including uncertainty quantification, explainability, and expert involvement.
1.Figures 6 and 7 overlap substantially,
2.line 313: "Almost 15%" when it's exactly 14.89% (7/47), be precise in systematic reviews,
3.in Table4, Khan dataset dominance (30% of dataset uses) deserves critical examination. Is this because it's high-quality or simply convenient? What are its limitations?
4.Why exclude "image registration" but not other image processing terms? The exclusion list (Table 1) is without justification,
5.Figure 3 (article screening flowchart) lacks the granular exclusion reasons recommended by PRISMA 2020,
6.Section 2.2 (search strategy) is overly detailed for main text, consider supplementary materials,
7.Lines 83-84 state "no time constraints" but publications span 2018-2025. This is an emergent finding, not a design choice,
8.Table 3: Add columns for evaluation protocol (CV vs holdout) and averaging method
9.Figures 10-11 show class distributions but no discussion of clinical validity of these groupings (e.g., is "abnormal heartbeat" clinically meaningful alongside MI?),
10.line 209: How was "structured research methodology" assessed? Seems subjective without operationalization.
11.Section 3.3.6 reports median performance >0.95 but doesn't adequately compare these results given that (a) 74.47% work with imbalanced data while 65.71% don't use balancing methods, (b) only 8.51% report AUPRC (the appropriate metric for imbalanced data), and (c) different datasets/protocols make comparisons invalid. The acknowledgment in one sentence (line 345-347) is insufficient,
12.Only 4/47 studies (8.5%) involved domain experts (Section 4.3), yet this critical finding doesn't appear in the abstract or receive sufficient emphasis in conclusions. For medical AI, this isn't just a limitation, it's a fundamental problem,
13.Only 2/47 studies (4.3%) provide source code, and ~40% don't fully describe their models (Fig 12). This should be prominently featured as a major finding, not just mentioned in Section 4.4,
14.Khan et al. datasets dominate (20/54 total dataset uses) but no critical examination of why this concentration exists, its quality or limitations nor whether this creates an echo chamber effect, or geographic/demographic biases in available datasets. Lines 354-362 simply list statistics,
15.Section 4.5 acknowledges ViTs are "more computationally expensive" and show "performance similar to that of CNNs, or only marginally better at best" (L513-515), yet still recommends continued exploration. The evidence presented actually argues against ViT investigation. Either provide stronger rationale or demote this priority,
16.Despite collecting quantitative metrics from 47 studies (Table 3), no statistical analysis is performed. Figure 14 shows distributions without tests for significance or heterogeneity assessment,
17.The contradiction between 74.47% having imbalanced data, 65.71% not using balancing methods, yet achieving >0.95 accuracy strongly suggests problematic evaluation practices (data leakage, inappropriate metrics). This deserves dedicated critical discussion,
18.Were negative results underrepresented? This is standard for systematic reviews but not addressed.
19.Section 4.3 mentions 4 studies involve experts but doesn't analyze how they were involved or what this revealed about model limitations. [77] found expert-comparable performance, this is significant and deserves more discussion,
20.Figure 9 shows ~50% use transfer learning, mostly fine-tuning, but no analysis of from what they transfer (ImageNet? Medical images?) or whether domain-specific pretraining improves results,
21.Figure 18 shows wide variety of techniques but no discussion of whether this heterogeneity makes comparison impossible or whether certain approaches correlate with better performance,
22.72% use all 12 leads (Fig 15) but no critical discussion of whether this is necessary or whether single-lead models might suffice for screening applications,
23. in section 3.3.6 (Performance results) the claim "should not be used for exact comparisons" is insufficient. Add specific discussion of risk of data leakage (same patients in train/test), impact of dataset difficulty differences and selection bias (studies may only report best results),
24.in section 3.3.9(Class imbalance), the finding that 65.71% don't use balancing methods despite 74.47% having imbalanced data needs prominent critical commentary, this likely explains inflated accuracy metrics and should inform interpretation of Table 3.
Questions:
1.How did you handle the 14 studies using Khan 2021 dataset, are they truly independent evaluations or variations on similar approaches?
2.What constitutes "macro" performance in Table 3? Did you standardize across different reporting styles?
3.For the 4 studies involving experts, what did expert evaluation reveal about model failures?
4.The median of 4 metrics reported seems low, is there correlation between fewer metrics and higher reported accuracy?
5.Why didn't explainability techniques (used by <15%) factor into quality criteria given the clinical context?
It is recommended to:
1.strengthen critical analysis by adding dedicated subsection examining why reported performances may be inflated (imbalanced data + inappropriate metrics + lack of standardization),
2.elevate clinical validation gap featuring the 8.5% expert involvement rate in abstract and conclusions as a major field limitation,
3.Reformulate or remove ViT recommendatop, in seection 4.5 as it lacks compelling justification by either providing stronger evidence or acknowledging this is low priority,
4.add statistical analysis of an even basic descriptive statistics comparing Q1 journals vs others, or correlation between methodology quality and reported performance,
5.expand dataset analysis using a critical examination of Khan dataset dominance and discussion of geographic/demographic representation,
6.integrate uncertaintenty quantification as section 4.1's conformal prediction discussion should inform evaluation of existing work throughout,
7. Acknowledge no meta-analysis, potential publication bias, single-application domain focus are available,
Conclurion: This is methodologically work that makes valuable contributions through comprehensive coverage and identification of important gaps (uncertainty quantification, explainability, expert involvement). However, it needs to be substantially more critical about the field's limitations. The authors have done a good data collection, now they should leverage it for incisive analysis of why reported performances likely overstate real-world applicability and what structural changes the field needs.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for Authors
The authors had adequately addressed the reviewer's comments. The current version is suitable for publication.
Reviewer 3 Report
Comments and Suggestions for Authors
The new version has been improved with most of the issues solved and remarks considered.

