Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Clinical Applicability of Machine Learning Models for Binary and Multi-Class Electrocardiogram Classification

AI 2025, 6(3), 59; https://doi.org/10.3390/ai6030059

by Daniel Nasef¹

, Demarcus Nasef¹, Kennette James Basco^2,3

, Alana Singh³, Christina Hartnett⁴, Michael Ruane⁴, Jason Tagliarino⁴, Michael Nizich^3,* and Milan Toma^1,*

Reviewer 1: Anonymous

Reviewer 2:

Tamantini Christian

Reviewer 3: Anonymous

AI 2025, 6(3), 59; https://doi.org/10.3390/ai6030059

Submission received: 4 February 2025 / Revised: 6 March 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

(This article belongs to the Section Medical & Healthcare AI)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper applied different machine learning methods in ECG classification. The authors divide the original multi-class modeling problem into different binary classification tasks, and compared the performance of multiple algorithms (i.e., tree-based models, neural networks). Feature importance from a CNN model is also provided to show the influence of different factors on the model decision. The manuscript is overall easy to follow. However, there are several major issues that hinder it from reaching the required level of publication.

The paper doesnot provide any information about the data, while directly jumping into discussions of models and results. Without seeing any introduction and analyses about the data patterns, I was not convinced by the findings and soundness of the selected methods/algorithms
The authors used CNN as one candidate model. However, there is no evidence in the paper justifying why CNN is applicable -- by just looking at the feature importance I cannot understand how convolutional layers work for this topic.
There are many duplications in the modeling strategy and metrics. For example, accuracy and misclassification rate is nothing but 1 minus another -- no extra information can be obtained using such redundant measures. Also, it is confusing why the authors used LGBM in experiment 1 but GBC in experiment 2.
It is not clear how the classification decision is made -- are the authors taking p=0.5 (or 1/class_numbers) as the cut-off score? Wouldn't the default cut-off give suboptimal results? Without all these information, I was not convinced by the provided metric values and comparisons.
Why the results/metrics from RF on some experiments are missing?

Author Response

The paper does not provide any information about the data, while directly jumping into discussions of models and results. Without seeing any introduction and analyses about the data patterns, I was not convinced by the findings and soundness of the selected methods/algorithms

Thank you for your comments. An explanation of the data is imperative to the rest of the results and methods of the paper. We have applied your suggestion and made sure to include it below:

“All ECGs were labeled by clinicians as “normal,” “abnormal,” or “borderline.” “Normal” and “abnormal” refer to ECG findings that either conform to or deviate from a standard ECG, with abnormalities warranting further evaluation. “Borderline” indicates an ECG that requires additional assessment to differentiate benign variations from pathology… Patient data included sex, weight, history of diabetes, and history of smoking. From ECGs, the data extracted included Ventricular rate, atrial rate, PR-interval, QRS duration, QT interval, QTC calculation, P-axis, R-axis, & T-axis. These patients were classified into either “normal”, “borderline”, or “abnormal” based on the ECG data and patient data.”

The authors used CNN as one candidate model. However, there is no evidence in the paper justifying why CNN is applicable -- by just looking at the feature importance I cannot understand how convolutional layers work for this topic.

Thank you for your comment and for pointing this out! We have chosen to include CNNs as another type of model architecture along with Machine Learning & DNN models. We didn’t include this because of higher expectations, but rather we wanted to compare the performance of basic model architectures to see which is the most optimal for screening ECGs while maintain generalizability. We have added a section clarifying this below:

“We set out with the goal to determine the most appropriate techniques and machine learning architecture to screen for abnormal ECG studies while maintaining an appropriate level of generalizability and interpretability – issues common with current models [16-18].”

There are many duplications in the modeling strategy and metrics. For example, accuracy and misclassification rate is nothing but 1 minus another -- no extra information can be obtained using such redundant measures.

Thank you for raising this important point. We acknowledge that accuracy and misclassification rate (1 − accuracy) are mathematically complementary. As healtcare professionals, we just wanted to report both metrics to accommodate the diverse preferences of our audience. For instance, while accuracy is a standard measure of overall correctness, misclassification rate emphasizes error frequency, which can be more intuitive when discussing clinical decision risks. Clinicians, who are key stakeholders, often evaluate models based on "error rates" to contextualize false positives/negatives in patient care pathways. Similarly, machine learning practitioners may prioritize accuracy for benchmarking against established literature.

Also, it is confusing why the authors used LGBM in experiment 1 but GBC in experiment 2.

Thank you for your comment. In both binary problems, several MLs from the Pycaret dataset were trained and tested. Of these models, the best performing in each one was selected for further analysis. We added this in the methods section to make it more clear, “Across both binary classification tasks, multiple machine learning models (including LightGBM, GBC, and others) were trained and evaluated using PyCaret’s framework, with the best-performing model from each task selected for final comparison and analysis to ensure optimal performance.”

It is not clear how the classification decision is made -- are the authors taking p=0.5 (or 1/class_numbers) as the cut-off score? Wouldn't the default cut-off give suboptimal results? Without all these information, I was not convinced by the provided metric values and comparisons.

Thank you for raising this critical point. Our ML models distinguish ECG patterns through a combination of domain-specific feature engineering and hierarchical learning architectures. For tree-based models (e.g., GBC, RF), feature importance analysis (Figure 11, Section 3.2) identified key predictors such as ventricular rate, QRS duration, and P-R interval. These features are clinically validated markers: for instance, prolonged QRS durations correlate with ventricular conduction abnormalities, while P-R interval deviations indicate atrioventricular block risks. Models learn decision boundaries by iteratively splitting data based on these features, prioritizing parameters with strong discriminatory power.

CNNs, trained on structured ECG-derived features, learn hierarchical representations by capturing local temporal dependencies (e.g., subtle ST-segment variations or T-wave alternans) even when raw waveforms are not directly input. For example, the CNN’s first convolutional layer might emphasize QRS complex detection, while deeper layers integrate contextual relationships between intervals like QT and RR. This mimics manual ECG parsing but automates pattern recognition at scale.

The hierarchical binary framework (Section 2.1) further reduces ambiguity: separating "Normal" vs. "Non-Normal" cases first minimizes overlap between ambiguous "Borderline" and definitive classes. Feature selection (e.g., atrial rate for rhythm vs. QTC for repolarization) is task-specific, allowing models to specialize in distinct diagnostic contexts.

We clarified this methodology in Section 2.2.1 (CNN architecture) and Section 3.2 (Feature Importance), emphasizing how clinical expertise guided feature selection to disentangle overlapping patterns. This approach ensures models leverage both data-driven insights and domain knowledge, enhancing reliability in borderline cases.

Why the results/metrics from RF on some experiments are missing?

The missing RF results in some experiments are due to the model being evaluated only for specific binary classification tasks, while other tree-based models like LightGBM and GBC were prioritized in multi-class settings. Additionally, some results were excluded if they did not provide meaningful insights, or if computational constraints favored more efficient models.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents a comparative analysis of supervised approaches for the hierarchical classification of ECGs. The methods compared are standard neural architectures and machine learning methods. This reviewer notes substantial shortcomings that do not allow him to suggest acceptance of the work.

Below are more pointed comments that may help the authors to prepare a significantly improved version for presentation in a scientific journal:
Abstract
- it is not clear what is meant by abnormal, borderline and normal. the authors need to make it clear to the reader what the classification problem is.
- in line 9 the authors say that CNN is generalisable. With respect to which different conditions? it is not clear how the validation was conducted (leave-one-subject-out? hold-out?).
Introduction
- the study of the literature on this topic should be expanded. it is not clear what the innvoation of this contribution is if one does not have a clear picture of the relevant literature on ECG signal classification. in which contexts, for which pathologies? what techniques are currently in use? what performance to beat? What are the current limitations in the literature that this contribution aims to overcome?
- it should be better explained what is meant by the labels ‘abnormal’, ‘borderline’ and ‘normal’.
Materials and methods
- this reviewer would provide an overview of the ECG signal in order to make the reader, even if not an expert in biological signals, understand the size and morphology of the ECG signal. I would at least provide an image showing the morphological characteristics of a signal belonging to the 3 labels. Furthermore, the authors should specify the window size used for signal analysis.
- it is not clear what data are used to do the proposed validation. is a public dataset used? Is data collected? not explained.
- if you do not know how much data and its distribution you cannot interpret the results to assess whether a data imbalance may have affected the classification performance.
Results
- the tables are not very readable. they do not provide a comprehensive evaluation of all metrics of the approaches. Furthermore, as it is not made clear on which dataset the evaluation is carried out and what type of validation is followed, it is not possible to assess the veracity of the results.
-Again, not having clarified which dataset is the reference dataset makes it impossible to understand this data. Are we talking about a dataset that does not include the raw ECG signal but we are evaluating a dataset that includes features shown in Fig11?

Author Response

Below are more pointed comments that may help the authors to prepare a significantly improved version for presentation in a scientific journal:

Abstract

It is not clear what is meant by abnormal, borderline and normal. the authors need to make it clear to the reader what the classification problem is.

Thank you for your comment, this clarification would certainly clarify the work we’ve done and the goals of our paper. We’ve revised the abstract to clarify each class and have added the revisions below:

“In this study, “normal” and “abnormal” refer to ECG findings that either align with or deviate from a standard ECG, warranting further evaluation. “Borderline” indicates an ECG requiring additional assessment to distinguish benign variations from pathology.”

In line 9 the authors say that CNN is generalisable. With respect to which different conditions? it is not clear how the validation was conducted (leave-one-subject-out? hold-out?).

Thank you for your comment, we see how this does not clearly convey the results of the CNN model. We have edited this segment to clarify the benefits of its generalizability, and we have pasted this edit below:

“Results showed that CNNs achieved the best balance between generalization and performance, effectively adapting to unseen data and variations without overfitting. They exhibit strong convergence and robust feature importance”

Introduction

The study of the literature on this topic should be expanded. it is not clear what the innvoation of this contribution is if one does not have a clear picture of the relevant literature on ECG signal classification. in which contexts, for which pathologies? what techniques are currently in use? what performance to beat? What are the current limitations in the literature that this contribution aims to overcome?

Thank you for your comment. We should clarify the reason for our research and the goals we have set out to complete. Although we mentioned the current state of machine learning in ECGs and the issues with current models, we could definitely be clearer with our goals in this project. We have made the corresponding revisions below:

“This study aims to evaluate the clinical applicability of various machine learning models for both binary and multi-class classification of ECG signals. We set out with the goal to determine the most appropriate techniques and machine learning architecture to screen for abnormal ECG studies while maintaining an appropriate level of generalizability and interpretability – issues common with current models [16-18].”

It should be better explained what is meant by the labels ‘abnormal’, ‘borderline’ and ‘normal’.

Thank you for your comment, we agree that this oversight makes the premise of the paper hard to understand and the results difficult to interpret. We have added an explanation of the labels used to provide some clarity:

“All ECGs were labeled by clinicians as “normal,” “abnormal,” or “borderline.” “Normal” and “abnormal” refer to ECG findings that either conform to or deviate from a standard ECG, with abnormalities warranting further evaluation. “Borderline” indicates an ECG that requires additional assessment to differentiate benign variations from pathology. The first binary task differentiates between "Abnormal" and "Non-Abnormal" signals, where "Non-Abnormal” encompasses the "Borderline" and "Normal" classes. The second binary task distinguishes "Normal" from "Non-Normal" signals, where "Non-Normal" includes "Abnormal" and "Borderline" classes.”

Materials and methods

This reviewer would provide an overview of the ECG signal in order to make the reader, even if not an expert in biological signals, understand the size and morphology of the ECG signal. I would at least provide an image showing the morphological characteristics of a signal belonging to the 3 labels. Furthermore, the authors should specify the window size used for signal analysis.

Thank you for your comments, we agree that a further explanation of the data used would clarify the use and purpose of the models trained. We have added additional information clarifying the variables used to determine the classification of each patient:

“Patient data included sex, weight, history of diabetes, and history of smoking. From ECGs, the data extracted included Ventricular rate, atrial rate, PR-interval, QRS duration, QT interval, QTC calculation, P-axis, R-axis, & T-axis. These patients where classified into either “normal”, “borderline”, or “abnormal” based on the ECG data and patient data.”

It is not clear what data are used to do the proposed validation. is a public dataset used? Is data collected? not explained.

Thank you for your comment. We have missed this oversight and agree that it is very important to source our dataset. We have added this information in the methods which is said below:

“The dataset used was sourced by the authors.” (Specifically, our co-authors from the Catholic Health Service of Long Island, 245 Old Country Rd, Melville, NY).

If you do not know how much data and its distribution you cannot interpret the results to assess whether a data imbalance may have affected the classification performance.

Thank you for your comments. Class imbalance can have a significant effect on model training and model performance, and thus we should have clarified the classes. We added this information to help in the interpretation of the results presented later:

“Borderline, and Normal. The dataset was distributed into 2248 “Abnormal”, 1451 “normal”, and 651 “abnormal” patients. To address inherent challenges such as inter-class overlap and imbalanced data distribution, the classification problem is reformulated into two binary tasks as follows. Binary Problem 1: The Borderline and Normal classes are combined into a single Non-Abnormal class, which is contrasted with the Abnormal class. In this problem, there were 2248 “Abnormal” and 2102 “non-Abnormal” patients. Binary Problem 2: The Abnormal and Borderline classes are merged into a single Non-Normal class, which is contrasted with the Normal class. In this problem, there were 2899 “Non-Normal” and 1451 “Normal” patients.”

Results

The tables are not very readable. they do not provide a comprehensive evaluation of all metrics of the approaches. Furthermore, as it is not made clear on which dataset the evaluation is carried out and what type of validation is followed, it is not possible to assess the veracity of the results.

Thank you for noticing this issue, we had not previously caught the lack of clarification in the results. Especially with our data stratification and binary problems, it is imperative that we clearly label which datasets/problems that results came from. We added the clarifications to the figure captions:

Figure 2. Comparison of Normalized Confusion Matrices from (a) CNN, (b) DNN, and (c) LightGBM models.) These matrices depict the results when models were trained to differentiate all three classes – abnormal, normal, and borderline.

Figure 3. Results from Binary problem 1. Comparison of Normalized Confusion Matrices from (a) CNN, (b) DNN, and (c) GBC models, when the ‘Borderline’ and ‘Normal’ classifications are combined into classification ‘NonAbnormal’.

Figure 4. Results from binary problem 2. Comparison of Normalized Confusion Matrices from (a) CNN, (b) DNN, and (c) RF models, when the ‘Abnormal’ and ‘Borderline’ classifications are combined into classification ‘Non-Normal’.

Again, not having clarified which dataset is the reference dataset makes it impossible to understand this data. Are we talking about a dataset that does not include the raw ECG signal but we are evaluating a dataset that includes features shown in Fig11?

Thank you for your comment, we agree that this clarification was important. We’ve implemented these concerns from in the previous comments above. We made sure to include explanations of the results and which dataset/problems they came from. We also included an explanation of the Data from ECGs used for the classification and training.

Reviewer 3 Report

Comments and Suggestions for Authors

The work done in the manuscript is appreciable. However, some important tasks are missing. Which I believe authors MUST follow.

Many terms are abbreviated multiple times; please abbreviate all terms at their first appearance.
Authors are suggested to compare the used models with an advanced deep model and an advanced machine learning technique.
A 5 fold cross validation is used. The standard id 10 FCV.
These days, explainable AI is mandatory to know the contributions of the features during the model's classification/prediction. So, authors are suggested to implement SHAP of LIME.
I appreciate the convergence analysis conducted in the manuscript. I expect from the authors to conduct an ablation study for convergence analysis by varying the amount of training and validating data. Like, 30%, 70%, and 40%, 60%, and 50%, 50%.

Author Response

The work done in the manuscript is appreciable. However, some important tasks are missing. Which I believe authors MUST follow.

Many terms are abbreviated multiple times; please abbreviate all terms at their first appearance.

Thank you for your comment, these items have been corrected!

Authors are suggested to compare the used models with an advanced deep model and an advanced machine learning technique.

Thank you for your suggestion! While we recognize that such comparisons could offer additional insights, our study is primarily focused on evaluating the selected models within the intended depth as to not overwhelm and dilute the focus with too many model & thus data.

A 5 fold cross validation is used. The standard id 10 FCV.

Thank you for your observation. While 10-fold cross-validation (CV) is widely used, there is no universally mandated "standard"—both 5- and 10-fold CV are valid and context-dependent choices supported by literature.

“In practice, 5-fold CV is often chosen because it provides an excellent trade-off between bias and variance in the model evaluation process. It is less computationally intensive than higher fold numbers, such as 10-fold CV, while still providing reliable estimates of model performance,” Allgaier J, Pryss R. Cross-Validation Visualized: A Narrative Guide to Advanced Methods. Machine Learning and Knowledge Extraction. 2024; 6(2):1378-1388. https://doi.org/10.3390/make6020065

Other studies show that 5-fold CV often provides comparable reliability while being more computationally efficient, particularly with larger datasets. Our choice of 5-fold CV aligns with best practices for balancing robustness and resource constraints, given the dataset size and repeated model comparisons. Prior work in ECG classification similarly adopts 5-fold validation, reinforcing its methodological validity. Thus, our approach ensured rigorous generalization assessment without compromising computational feasibility, and reported metrics remain statistically robust.

These days, explainable AI is mandatory to know the contributions of the features during the model's classification/prediction. So, authors are suggested to implement SHAP of LIME.

PyCaret, which we used for model selection, provides intrinsic feature importance metrics through permutation scoring and tree-based splits (e.g., Gini importance), as detailed in Section 3.2 and Figure 11. While PyCaret allows limited SHAP integration post-training, we focused on clinical applicability metrics and convergence analysis to align with the study’s objectives.

We acknowledge the value of XAI tools and plan to rigorously apply SHAP/LIME in future work assessing decision boundaries in borderline ECG cases. For this study, domain-specific feature rankings and permutation importance ensured alignment with clinical criteria (e.g., QRS duration, ventricular rate) while maintaining methodological consistency. We have clarified these limitations in Section 4 (revised manuscript), emphasizing XAI as a priority for model refinement in clinical deployments. We added, “While tree-based models provided intrinsic feature importance rankings, a more granular exploration of model interpretability (e.g., SHAP, LIME) was beyond this study’s scope. Future work will prioritize XAI techniques to quantify localized feature contributions, particularly for borderline ECG patterns and hybrid architectures.”

I appreciate the convergence analysis conducted in the manuscript. I expect from the authors to conduct an ablation study for convergence analysis by varying the amount of training and validating data. Like, 30%, 70%, and 40%, 60%, and 50%, 50%.

Thank you for your thoughtful feedback and for recognizing the convergence analysis conducted in our manuscript. We appreciate your suggestion to conduct an ablation study by varying the proportions of training and validation data, as this could provide additional insights into the model’s convergence behavior.

While our current study is focused on evaluating model performance under a fixed data split, we see the value of such an analysis in understanding the stability and generalizability of the model. Given the scope of this work, we plan to explore this approach in the future, where we can systematically investigate the impact of different data distributions on convergence.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors addressed my comments. I don't have other questions for the new version. Thanks.

Author Response

Knowing that you have no further questions for the new version is encouraging, and we are grateful for your support. Thank you once again for your contribution and insights!

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed some comments, but the key concerns from the first review remain unresolved. The methodological rigor is still insufficient, and the validity of the results is not convincingly demonstrated. The responses provided do not adequately clarify the scientific soundness of the work. I recommend substantial revisions before resubmission.

Author Response

Thank you for your feedback and for recognizing the potential of our work. We have implemented substantial revisions to address your concerns about methodological rigor, result validity, and scientific soundness. Below, we detail how each concern has been resolved in the revised manuscript. They are highlighted in green in the revised manuscript.

Added to Section 2.2.1 (CNN):
Implementation Details:
"The CNN architecture employed 32 filters (kernel size=3, ReLU activation) in the first convolutional layer, followed by max-pooling (pool size=2) and batch normalization to regularize activations. The second convolutional layer utilized 64 filters (kernel size=3) with identical pooling and normalization. A dropout rate of 0.5 was applied after pooling layers to mitigate overfitting. For optimization, Adam (default settings: learning rate=0.001, β₁=0.9, β₂=0.999) minimized binary cross-entropy loss. Input data was partitioned into stratified training (70%), validation (15%), and test (15%) sets using scikit-learn’s StratifiedKFold, preserving class distributions."

Added to Section 2.2.2 (DNN):
Architecture Optimization:
"The DNN comprised two hidden layers (128 and 64 units, ReLU activation) with batch normalization applied post-activation to stabilize training. Adam optimization maintained default hyperparameters (learning rate=0.001, β₁=0.9, β₂=0.999) with early stopping (patience=10 epochs, min delta=0.001) to halt training if validation loss plateaued. Dropout (rate=0.3) between dense layers reduced co-adaptation of neurons."

Added to Section 2.2.3 (Tree-Based Models):
PyCaret Configuration:
"For Gradient Boosting Classifier (GBC), key hyperparameters included n_estimators=100 (default), learning_rate=0.1, and max_depth=3. Random Forest (RF) used n_estimators=100 and min_samples_split=2. Class weights were automatically adjusted during training to account for imbalanced distributions."

Additionally, we added citations to recent 2025 review and research papers on this topic and explained how the focus of our article is on the clinical applicability of these studies, as stressed by the recent FDA guidance on development of these models. While many studies report strong classification metrics for ECG analysis, critical gaps remain in validation protocols needed to ensure clinical applicability. Claims of exceptional performance metrics often lack essential methodological rigor, e.g., many fail to demonstrate proper training dynamics through learning curves or validate classifier reliability across operational thresholds via metrics like receiver operating characteristic (ROC) curves. These omissions make it difficult to distinguish between genuine model generalizability versus dataset-specific overfitting, particularly given the inherent challenges of class imbalance and subtle diagnostic boundaries in ECG interpretation. The absence of convergence analyses in reported training processes poses particular concerns. Learning curve comparisons between training and validation performance provide vital insights into whether models achieve stable, generalizable pattern recognition versus memorizing training artifacts. As such, reported accuracy rates in ECG classification prove clinically meaningless if model convergence analysis reveals divergence between training/validation trajectories: a hallmark of overfitting. After all, we are from a medical college, medical students conducted this study and wrote this manuscript. Hence, we're attempting to focus on clinical applicability of everything we do.

We are grateful for your expertise and constructive critique, which has guided critical improvements to the manuscript. Your insights helped refine our roadmap for future validation studies. We hope these revisions reinforce the ML's promise as a clinically viable tool while transparently addressing limitations. Thank you again for your time and guidance.

Reviewer 3 Report

Comments and Suggestions for Authors

Authors have responded my comments.

Author Response

We are delighted to hear that you feel we have responded to your comments effectively. Your insights have played an important role in refining our work, and we appreciate your constructive feedback.

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

This reviewer thinks the paper is not ready for publication anyway.

Article Menu

Clinical Applicability of Machine Learning Models for Binary and Multi-Class Electrocardiogram Classification

Further Information

Guidelines

MDPI Initiatives

Follow MDPI