Next Article in Journal
Primary Cardiac Tumors: Clinical Presentations and Pathological Features in a Multicenter Cohort
Previous Article in Journal
Hepatic and Splenic Hyaloserositis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Interpretable Deep Learning Models for Arrhythmia Classification Based on ECG Signals Using PTB-X Dataset

1
Electronics and Communication Department, College of Engineering and Computer Science, Mustaqbal University, Buraydah 51411, Saudi Arabia
2
Department of Computer Science, College of Computer Science and Engineering, Taibah University, Yanbu 966144, Saudi Arabia
3
Computer Science Department, Faculty of Science, University of Tanta, Tanta 31527, Gharbia, Egypt
4
Information Technology Department, Faculty of Computers and Information, Menoufia University, Shibin El Kom 6131567, Egypt
5
Faculty of Medicine Kasr Al-Ainy, Cairo University, Cairo 11562, Egypt
6
Computer Science Department, College of Science, Northern Border University, Arar 91431, Saudi Arabia
7
Research Institute of Sciences and Engineering, University of Sharjah, Sharjah 27272, United Arab Emirates
8
Department of Embedded Network Systems Technology, Faculty of Artificial Intelligence, Kafrelsheikh University, Kafr El-Sheikh 33516, Egypt
*
Authors to whom correspondence should be addressed.
Diagnostics 2025, 15(15), 1950; https://doi.org/10.3390/diagnostics15151950
Submission received: 22 June 2025 / Revised: 26 July 2025 / Accepted: 31 July 2025 / Published: 4 August 2025
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

Background/Objectives: Automatic classification of ECG signal arrhythmias plays a vital role in early cardiovascular diagnostics by enabling prompt detection of life-threatening conditions. Manual ECG interpretation is labor-intensive and susceptible to errors, highlighting the demand for automated, scalable approaches. Deep learning (DL) methods are effective in ECG analysis due to their ability to learn complex patterns from raw signals. Methods: This study introduces two models: a custom convolutional neural network (CNN) with a dual-branch architecture for processing ECG signals and demographic data (e.g., age, gender), and a modified VGG16 model adapted for multi-branch input. Using the PTB-XL dataset, a widely adopted large-scale ECG database with over 20,000 recordings, the models were evaluated on binary, multiclass, and subclass classification tasks across 2, 5, 10, and 15 disease categories. Advanced preprocessing techniques, combined with demographic features, significantly enhanced performance. Results: The CNN model achieved up to 97.78% accuracy in binary classification and 79.7% in multiclass tasks, outperforming the VGG16 model (97.38% and 76.53%, respectively) and state-of-the-art benchmarks like CNN-LSTM and CNN entropy features. This study also emphasizes interpretability, providing lead-specific insights into ECG contributions to promote clinical transparency. Conclusions: These results confirm the models’ potential for accurate, explainable arrhythmia detection and their applicability in real-world healthcare diagnostics.

1. Introduction

The electrocardiogram (ECG) serves as a widely adopted, non-invasive diagnostic method for recording the heart’s electrical activity, offering crucial information about its rhythm and overall function. It plays a pivotal role in detecting a broad spectrum of cardiac abnormalities, particularly arrhythmias, irregular heart rhythms that can range from harmless to life-threatening, including conditions such as atrial fibrillation and ventricular tachycardia [1]. Timely detection and precise classification of these arrhythmias are vital for reducing the risk of severe outcomes such as stroke, heart failure, or sudden cardiac arrest. To support clinicians and improve diagnostic precision, automated ECG interpretation systems have been introduced. In this context, the PTB-XL dataset featuring over 20,000 labeled ECG recordings representing a wide range of cardiac conditions has emerged as a standard reference for assessing the performance of machine learning algorithms in arrhythmia classification [2].
Conventional ECG analysis techniques often depend on manual feature extraction and rule-based algorithms. Although these methods laid the groundwork for early developments in automated arrhythmia detection, they are constrained by their reliance on time-consuming procedures and their limited ability to capture the intricate and non-linear patterns present in ECG signals [3]. Traditional machine learning models, such as support vector machines (SVMs) and random forests, have been employed to mitigate some of these challenges [4]. However, they still require handcrafted features, which depend heavily on expert knowledge. In contrast, deep learning (DL) models offer a more advanced and efficient solution by automatically learning hierarchical representations directly from raw ECG data, bypassing the need for extensive preprocessing or manual intervention. Architectures like CNNs and recurrent neural networks (RNNs) have demonstrated superior performance in arrhythmia detection tasks, making them highly suitable for deployment in practical clinical settings [5].
Applying deep learning to ECG signal classification using the PTB-XL dataset presents several critical challenges. A major issue is class imbalance, where clinically important but infrequent arrhythmia are underrepresented, leading to biased model predictions that favor more common conditions. Additionally, ECG recordings are frequently contaminated with noise and artifacts, such as baseline drift and powerline interference, which can obscure relevant diagnostic features and negatively affect classification performance. The high dimensionality of multi-lead ECG signals further necessitates the use of complex architectures and demands significant computational resources. Generalizability also remains a concern, as models trained on PTB-XL may not perform consistently across different clinical environments due to variations in patient demographics, data acquisition protocols, and recording equipment. Finally, the “black-box” nature of deep learning models poses interpretability challenges, which may hinder clinician trust and limit their adoption in healthcare settings [2,6].
The critical importance of arrhythmia classification, coupled with the advantages and challenges of deep learning techniques, highlights the need for continued research in this domain [7]. Addressing these challenges and leveraging modern DL architectures can pave the way for robust, interpretable, and accurate arrhythmia detection systems, ultimately improving patient care and enabling seamless integration into clinical workflows [8].
The primary contributions of this study are as follows:
  • A custom convolutional neural network-based deep learning model is proposed, specifically tailored for ECG arrhythmia classification. The model supports various diagnostic tasks, including binary classification (e.g., normal vs. specific conditions), and multiclass classification into 5, 10, or 15 clinically meaningful categories based on the PTB-XL dataset, allowing for both broad and detailed arrhythmia detection.
  • To enhance contextual accuracy, the model integrates demographic attributes, namely age and gender, which contribute to a deeper understanding of arrhythmia patterns and lead to improved detection, particularly for less-represented classes.
  • This research utilizes the PTB-XL dataset, a large and heterogeneous collection of ECG recordings, to ensure comprehensive training and validation. This choice supports the model’s generalizability across a broad range of cardiac conditions.
  • An extensive experimental evaluation is conducted to assess the influence of incorporating demographic data. This study compares classification performance across binary and multiclass tasks, both with and without demographic features, demonstrating the added value of contextual inputs.
  • In addition, the model offers interpretability by examining the influence of individual ECG leads in arrhythmia detection. This lead-specific analysis provides clinical insight into which signal channels most significantly contribute to the diagnostic process.
The remainder of this paper is organized as follows: Section 2 reviews the most significant and recent published works related to ECG-based arrhythmia classification. Section 3 describes the methodology employed in this study, including data processing, model architecture, and training procedures. Section 4 presents the experimental results in detail and highlights the advantages of the proposed approach compared with state-of-the-art methods. Section 5 concludes the paper by summarizing the main findings and outlining potential directions for future research.

2. Literature Review

The application of electrocardiogram (ECG) signals for arrhythmia classification has emerged as a prominent area of research due to its vital importance in the early diagnosis and prevention of cardiovascular diseases. Traditional methods largely dependent on manual feature engineering and classical statistical models are now being superseded by deep learning (DL) techniques, which facilitate automated feature extraction and significantly improve classification performance. Among the most influential resources in this domain is the PTB-XL dataset, a large-scale, richly annotated collection of ECG recordings that has become central to the development and benchmarking of DL-based arrhythmia detection systems. This study highlights recent advancements in the field, with particular focus on the challenges associated with leveraging deep learning models on the PTB-XL dataset.
A wide range of studies have employed various computational approaches for arrhythmia detection and classification, including traditional machine learning algorithms [9,10,11,12,13,14,15,16,17,18,19,20], transfer learning techniques [6,21,22,23,24] deep learning models [2,25,26,27,28,29,30,31,32], and hybrid frameworks that combine multiple methodologies [12,17,33,34,35,36,37,38,39,40]. Additionally, other innovative strategies have been explored in this context [41,42,43,44]. These efforts increasingly utilize large-scale ECG datasets such as PTB-XL and PTB, which offer diverse and clinically rich data for model development. Among the most influential resources is the PTB-XL dataset, which is extensively used in recent DL-based ECG research.
Recent progress in deep learning has significantly transformed ECG analysis, particularly through the use of large-scale datasets such as PTB-XL. A landmark study by Strodthoff et al. [2] provided a comprehensive benchmarking of deep learning architectures on the PTB-XL dataset, achieving macro-AUC scores of 0.93 for diagnostic categories and 0.96 for rhythm classifications using ResNet- and Inception-based networks. Their work also highlighted the advantages of transfer learning in improving ECG classification performance. Similarly, Jin et al. [5] proposed SJTU-ECGNet, a knowledge-integrated deep learning model trained on a large Chinese ECG dataset, which reported a mean accuracy of 93.74%, a macro-F1 score of 83.51%, and an AUC-ROC of 0.977.
To address challenges such as class imbalance and computational efficiency, a multi-receptive field convolutional neural network (MRF-CNN) was developed, achieving an F1 score of 0.72 and an AUC of 0.93 in superclass classification on PTB-XL [23]. In a different approach, Raymond Ao et al. explored image-based ECG classification by applying transfer learning with VGG16 on visual representations of 12-lead ECGs. Their model achieved an AUROC of 1.000 and a sensitivity of 0.992 for atrial fibrillation, demonstrating the potential of image-based techniques where raw signal data may not be available [25].
Hybrid deep learning models have also gained traction. Butt et al. [27] proposed a CNN-LSTM combination alongside a transformer model enhanced by wavelet transforms, achieving a remarkable 99.93% accuracy in detecting myocardial infarction. In another comparative study, Smigiel et al. [32] developed a CNN framework utilizing QRS complex segmentation along with entropy-based inputs, yielding an AUC of 96.3% for binary classification using PTB-XL data.
Transfer learning has been widely investigated in the context of ECG classification. In a comparative evaluation of six deep learning models, including ResNet1d18, it was shown that fine-tuning pretrained networks consistently outperformed models trained from scratch, with ResNet1d18 achieving an F1 score of 0.877 [43]. Jing et al. [45] introduced the beat-level fusion network (BLF-Net), which incorporates attention mechanisms to enhance feature learning. Their model achieved macro-AUC scores of 0.941 for subclass classification and 0.969 for rhythm detection, underscoring its potential for clinical application. Building upon this line of research, the PTB-XL+ dataset was introduced, offering feature-based annotations derived from both commercial and open-source algorithms. This enriched dataset supported improved model development and achieved macro-AUC scores as high as 0.889, providing a valuable foundation for advancing diagnostic tools [46].
Hambarde et al. [47] proposed the WOLF-DNN model, a hybrid deep neural network optimized using a wolf-inspired metaheuristic for arrhythmia detection from single-lead ECG signals. The model achieved an F1-score of 96.55% and sensitivity of 95.51%, demonstrating strong performance on imbalanced data. However, its generalizability remains limited due to the absence of testing on diverse, real-world clinical datasets. Further contributions by Strodthoff et al. [48] included the extraction of ECG features from commercial systems (Marquette 12SL and the University of Glasgow ECG program) and an open-source alternative (ECGDeli). Benchmark results indicated strong discriminative capability, with macro-AUCs of 0.889 for Uni-G, 0.871 for 12SL, and 0.879 for ECGDeli, reinforcing the effectiveness of feature-based approaches in ECG analysis.
Krasteva et al. [49] applied transfer learning using 18 pretrained ImageNet models on ECHOView images from Holter ECG recordings to detect atrial fibrillation. Their best fine-tuned models, such as EfficientNetV2B1, achieved up to 97.6% accuracy in binary classification. While the study demonstrates strong performance and model interpretability through GradCAM, it focuses solely on atrial fibrillation and relies on image-based input, limiting generalization to other arrhythmias or raw ECG data. Wickramasinghe and Athif [50] proposed an interpretable CNN model for multi-label classification of 26 cardiac abnormalities using reduced-lead ECGs. Trained on the PhysioNet 2021 dataset, their dual-branch model (time and frequency) achieved an F1 score of 0.553 on the hidden test set, ranking second for 12-lead, fifth for 6-lead, and third for 2-lead configurations. While SHAP improved interpretability, performance was limited by label imbalance and reduced-lead data constraints. Xiong et al. [4] proposed a 1D CNN model for classifying ECGs into normal, AF, other, and noise categories using the PhysioNet 2017 dataset. Their model achieved an F1 score of 0.82, outperforming RNN and spectrogram-based CNNs. The approach was limited to single-lead ECGs and showed lower performance on minority classes.
Recent systematic reviews have synthesized these advancements, providing an in-depth examination of current methodologies and their practical applications in ECG-based arrhythmia classification [51,52]. An overview of key studies utilizing this dataset is summarized in Table 1.

3. Methodology

The proposed hybrid deep learning framework for classifying ECG signals into normal and arrhythmic conditions is illustrated in Figure 1. The model integrates two distinct data streams: raw 12-lead ECG signals sampled at 500 Hz, and demographic information comprising patient age and sex. Prior to model training, the data undergoes a comprehensive preprocessing phase that involves filtering incomplete or noisy records, excluding underrepresented subclasses to ensure class balance, and splitting the dataset into training, validation, and testing sets.
To improve generalization and reduce overfitting, data augmentation techniques are applied exclusively to the ECG input branch. These augmentations simulate realistic physiological variations and include adding Gaussian noise with a mean of 0 and a standard deviation of 0.01, amplifying the signal by multiplying with a gain factor k = 1 + x where x ranges between 0.001 and 0.01, and attenuating the signal by applying a gain of k = 1 − x with the same range. These operations are designed to reflect natural fluctuations in ECG signals due to patient variability or sensor conditions.
The architecture is composed of two primary branches. The ECG branch is based on a convolutional neural network (CNN) and processes the augmented ECG signals through a stack of eight convolutional layers with 64 to 512 filters and 3 × 1 kernel sizes. Feature maps are reduced via max pooling and global average pooling. A dense layer with 512 neurons and a dropout rate of 0.5 is used for regularization. In parallel, the demographic branch processes the age and sex inputs through fully connected layers with 100, 64, 32, and 16 neurons, followed by global average pooling and a dropout of 0.4.
Following independent feature extraction, the outputs of both branches are concatenated to form a unified feature vector. This combined representation is passed through two additional dense layers (Dense3_1 and Dense3_2) with interleaved dropout layers (Dropout3_1 and Dropout3_2), both with a rate of 0.2, to prevent overfitting and enhance feature abstraction. The final vector is then passed to a classification layer with a softmax activation function to predict class probabilities for normal rhythm and various arrhythmic conditions (e.g., Disease 1 to Disease N).
The inclusion of demographic variables in the model architecture is motivated by clinical evidence linking age and gender to distinct arrhythmogenic patterns and ECG morphologies. Numerous cardiac conditions, such as myocardial infarctions (AMI, IMI), conduction blocks (e.g., LAFB/LPFB, CLBBB, CRBBB), hypertrophy (LVH), and ischemic abnormalities (ISCA, ISCI, ISC), demonstrate age- and gender-dependent variations in incidence, pathophysiology, and ECG presentation. For example, atrial fibrillation, conduction delays, and bundle branch blocks are significantly more prevalent in elderly patients due to age-related degenerative changes in the conduction system, such as fibrosis and calcification of the His–Purkinje network, which impair electrical propagation [55]. Gender differences also critically influence arrhythmia susceptibility and ECG morphology: women exhibit higher rates of long QT syndrome and non-specific ST-T changes, likely due to estrogen-mediated effects on ventricular repolarization [56], whereas men show greater prevalence of left ventricular hypertrophy (attributed to testosterone-driven myocardial remodeling) and Brugada-type patterns linked to sodium channel mutations [57,58]. By incorporating demographic features into the learning framework, the model is better equipped to contextualize ECG features and improve classification accuracy across diverse patient profiles.

3.1. Dataset Description

The PTB-XL dataset is a large-scale, publicly available resource designed to advance research in automated cardiac diagnostics. Developed and released by the Physikalisch-Technische Bundesanstalt (PTB), the dataset contains over 20,000 annotated 12-lead electrocardiogram (ECG) recordings from 18,885 patients, with a nearly balanced gender distribution (52% male, 48% female). As one of the most comprehensive ECG datasets available, PTB-XL offers greater diversity and volume than many existing alternatives. Each 10 s recording captures a wide range of cardiac conditions, annotated with 71 diagnostic labels systematically grouped into five hierarchical categories: diagnostic, form, rhythm, clinical, and additional statements.
Annotations were meticulously curated by two expert cardiologists and organized into superclasses and subclasses, providing a robust foundation for both supervised and unsupervised machine learning tasks. The dataset includes patients ranging in age from 18 to 95 years and spans both healthy individuals and those with various cardiac abnormalities, thereby enhancing its suitability for developing accurate and generalizable arrhythmia detection models.
Recordings are available at two sampling frequencies—100 Hz and 500 Hz—allowing researchers to adapt preprocessing based on their computational needs. Additionally, the dataset includes rich metadata such as patient age, gender, and recording conditions, which supports stratified analysis and the development of auxiliary learning tasks. In accordance with the recommendations of the PTB-XL dataset creators, we followed their predefined data split strategy:
  • Folds 1–8 were used for training.
  • Fold 9 was used for validation.
  • Fold 10 was used for testing.
This split ensures consistency with prior studies and preserves the integrity of the evaluation [59]. Table 2 summarizes the dataset division.

3.2. Experimental Setup

The dataset employed in this study is organized using a predefined fold structure, enabling a standardized division into training, validation, and testing subsets for consistent evaluation of ECG classification algorithms. Specifically, the data is partitioned into 10 folds, with folds 1 through 8 allocated for training, fold 9 for validation, and fold 10 reserved for testing. This structured approach is critical for developing robust deep learning models while minimizing the risk of overfitting. Model training is guided by the cross-entropy loss function and optimized using the Adam optimizer with a learning rate of 0.001. To improve training performance and generalization, techniques such as early stopping and adaptive learning rate reduction are applied via callback functions. The experimental framework integrates both physiological signal data and demographic information, supporting accurate and reliable multiclass disease classification. All experiments were conducted on the Kaggle platform utilizing an NVIDIA TESLA P100 GPU accelerator.

3.3. Diagnostic Classes

This research focuses on developing an automated arrhythmia classification system based on ECG signals, capable of performing binary, multiclass, and subclass classification tasks. The PTB-XL dataset, used in this study, includes diagnostic labels categorized into five superclasses and 23 subclasses. Table 3 outlines the distribution of these superclasses and subclasses, while Figure 2 provides a visual representation of their relationships [46]. To evaluate the model’s performance comprehensively, we implemented multiple classification scenarios, including binary classification, 5-class (superclass), 10-class (subclass), and 15-class (subclass) tasks. These classification scenarios are described as follows:
A. Binary classification:
This scenario involves three cases:
Case 1: The model is designed to differentiate between normal ECG signals and one selected condition from the broader set of major disease classes, which includes CD, HYP, MI, and STTC.
Case 2: The model is employed to distinguish between normal ECG signals and one specific condition from a more detailed set of subclasses, such as STTC, AMI, IMI, LAFB/LPFB, LVH, IRBBB, CLBBB, ISCA, and CRBBB.
Case 3: The model is utilized to classify ECG signals as either normal or abnormal, with the abnormal category indicating the presence of a single underlying disease.
B. Multi-label classification
The class distributions for the binary, five-, ten-, and fifteen-class classification tasks, along with the number of records per class, are illustrated in Figure 3.
  • Five-Class Supercategory Classification: In this multiclass classification task, the model categorizes ECG signals into one of five broad superclasses: NORM, MI, CD, STTC, and HYP.
  • Ten-Class Subcategory Classification: In this configuration, the model classifies ECG signals into one of ten specific subclasses: NORM, STTC, AMI, IMI, LAFB/LPFB, LVH, IRBBB, CLBBB, ISCA, and CRBBB. These ten subclasses represent the most frequently occurring categories in the dataset, with all other less-represented subclasses being excluded.
  • Fifteen-Class Subcategory Classification: This scenario expands the classification to fifteen subclasses, including: NORM, STTC, AMI, IMI, LAFB/LPFB, LVH, IRBBB, CLBBB, NST_, ISCA, CRBBB, IVCD, ISC_, _AVB, and ISCI. As with the ten-class configuration, only the top fifteen subclasses based on the number of records are considered.

3.4. Data Augmentation

Data augmentation is crucial for improving the performance of deep learning (DL) models in classification tasks, especially when dealing with small datasets, as it artificially expands the training data by generating diverse and realistic variations (e.g., rotations, flips, scaling), helping to prevent overfitting and enhance generalization. For large datasets, data augmentation still plays a significant role by introducing additional variability that may not be fully captured, allowing the model to learn more robust and invariant features, ultimately leading to better accuracy and resilience to real-world data variations. In both cases, augmentation contributes to building more reliable and adaptable classification models. Table 4 shows the data augmentation techniques used in this study and the corresponding description. Figure 4 presents the class distribution of each category after data augmentation for the training samples.

3.5. The Proposed Models

3.5.1. The Proposed CNN Model Architecture

The custom CNN architecture developed in this study adopts a dual-branch structure to independently process ECG signals and demographic data, as illustrated in Figure 5. The first branch is dedicated to handling raw ECG inputs and is composed of five sequential convolutional layers. These layers progressively increase in depth with filter sizes of 32, 64, 128, 256, and 512, respectively. Each convolutional layer utilizes a kernel size of 3 × 1, optimized for temporal feature extraction along the signal axis. Following each convolutional operation, a max pooling layer with a pool size of 2 × 1 is applied to downsample the feature maps and retain the most salient features. To promote training stability and accelerate convergence, each max pooling layer is immediately followed by a batch normalization layer. After the convolutional layers, the network transitions to two fully connected (dense) layers with 100 and 32 neurons, respectively. A dropout layer with a rate of 0.4 is employed to reduce overfitting by randomly deactivating neurons during training.
The second branch of the model processes demographic inputs (patient age and gender) through a series of four dense layers with 100, 64, 32, and 16 neurons. This branch also incorporates a dropout rate of 0.4, ensuring consistency in regularization across both branches. The output layer consists of a dense layer with a softmax activation function corresponding to the number of output classes (i.e., five for the ECG superclass classification). The architecture is designed to allow for effective feature learning from both physiological and contextual information. A full summary of the CNN model’s structural parameters is provided in Table 5.

3.5.2. VGG Model Architecture

The VGG-based model is an adaptation of the well-established VGG architecture, modified to support a multi-branch input structure for joint processing of ECG signals and demographic information. As illustrated in Figure 6, the ECG branch consists of eight convolutional layers, with filter sizes progressively increasing as follows: 64, 64, 128, 128, 256, 256, 512, and 512. Each convolutional layer employs a 3 × 1 kernel size, preserving temporal resolution while extracting relevant features. Following each convolutional block, a 2 × 1 max pooling operation is applied to downsample the spatial dimensions and facilitate hierarchical feature learning. This branch concludes with two fully connected layers, each containing 512 neurons, and incorporates a dropout rate of 0.5 to reduce overfitting and enhance generalization.
The demographic branch follows the same dense layer configuration used in the custom CNN model. It includes four fully connected layers with 100, 64, 32, and 16 neurons, respectively, along with a dropout rate of 0.4 for regularization. After parallel processing, the outputs from the ECG and demographic branches are concatenated and passed through a shared dense path composed of three layers with 10 neurons for each of the first two dense layers, and a final output layer, where the output dimension corresponds to the number of target classes (2, 5, 10, or 15, depending on the classification task). A dropout rate of 0.2 is applied in this final segment to ensure robustness.
The ReLU activation function is used throughout the model, except for the final classification layer, which uses softmax to output class probabilities. The model is trained using a categorical cross-entropy loss function and the Adam optimizer, with training hyperparameters and callbacks consistent with those used in the custom CNN model. Training is conducted over a maximum of 60 epochs, using a batch size of 16. A detailed summary of the model architecture and parameters is provided in Table 6.

3.6. Model Interpretability Using the SHAP Method

To improve the transparency and clinical relevance of the 12-lead CNN model, we employed SHAP (SHapley Additive exPlanations), a widely used model-agnostic interpretability framework [4]. SHAP quantifies the contribution of each input feature in this case, and each ECG leads to the model’s output. This allows us to assess how much each lead influences the model’s predictions for different cardiac conditions. For each arrhythmia class, SHAP values were computed across all test samples, and the absolute values were averaged to obtain a relative importance score for each lead. This approach provides an interpretable measure of the lead-specific contributions to the model’s decision-making process, aligning with clinical expectations and aiding in the trust and validation of AI predictions [50]. The relative importance I l   of lead l is calculated as:
I l = 1 N i = 1 N S H A P i l
where N is the number of test samples, and S H A P i l is the SHAP value for lead l in sample i.

3.7. Performance Evaluation Metrics

To evaluate the performance of machine learning models in ECG arrhythmia classification, a set of key performance metrics is employed, including precision, recall, accuracy, and the confusion matrix. These metrics offer a comprehensive understanding of the model’s effectiveness, especially in the context of imbalanced datasets, where relying solely on accuracy can be misleading [59,60].
Accuracy reflects the overall correctness of the model by measuring the proportion of true predictions both for normal and arrhythmic cases, relative to the total number of predictions. It is mathematically defined as:
A c c u r a c y = T P + T N T P + T N + F P + F N
In this context, TP refers to true positive instances where the model correctly identifies the presence of a disease. TN denotes true negatives, indicating correctly classified normal instances. FP, or false positives, occur when normal instances are incorrectly classified as diseased, while FN, or false negatives, represent diseased cases mistakenly labeled as normal. Although accuracy provides a general assessment of model performance, it can be misleading in cases of class imbalance, where the majority class disproportionately influences the metric.
To address this limitation, precision, also known as the positive predictive value, is used to evaluate the correctness of positive predictions. It quantifies the proportion of true positives among all instances that the model predicted as positive and is defined as:
P r e c i s i o n = T P T P + F P
This metric is particularly important in contexts where false positives carry serious consequences, as in medical diagnostics, where misclassifying a healthy individual as diseased can lead to unnecessary tests, treatments, and patient anxiety. In such cases, maintaining a high precision is essential to ensure that positive predictions are as reliable as possible.
On the other hand, recall, also referred to as sensitivity or the true positive rate, measures the model’s ability to correctly identify actual positive cases. It represents the proportion of true positives captured out of all real instances of the condition and is especially important when the cost of missing a positive case (i.e., a false negative) is high. Mathematically, recall is defined as:
R e c a l l = T P T P + F N
High recall is particularly crucial in situations where overlooking true positive cases i.e., false negatives, can have severe consequences, such as the failure to detect life-threatening arrhythmias. In these scenarios, prioritizing sensitivity ensures that most relevant cases are correctly identified, even if it comes at the cost of a higher false positive rate. To further analyze model performance, the confusion matrix offers a comprehensive summary of predictions across all classes. It organizes the results into four key components: true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs). This matrix provides valuable insight into how the model performs for each class, highlighting both its strengths and areas requiring improvement. Table 7 presents the structure and components of a confusion matrix, detailing how predictions are distributed across correct and incorrect classifications.

4. Experimental Results and Discussion

This section provides a detailed evaluation of the proposed deep learning frameworks for ECG signal arrhythmia classification, utilizing the PTB-XL dataset. The experiments assess model performance across various classification tasks, including binary classification, multiclass classification involving 5 superclasses, and subclass classification with 10 and 15 categories. The study emphasizes the integration of patient demographic data, such as age and gender, alongside raw ECG signals to enhance diagnostic accuracy, particularly for underrepresented arrhythmia classes. Comparative analyses with existing state-of-the-art methods highlight the superiority of the proposed approach.
The results of the study on ECG signal arrhythmia classification using deep learning techniques are categorized into various classification tasks, including binary classification and multiclass classification, with and without incorporating patient data. These findings highlight the impact of patient data inclusion on model performance, measured through metrics such as precision, recall, and accuracy.

4.1. Experimental Results Using the CNN Model

The training and validation performance of the CNN model was evaluated across all classification tasks, including binary and multiclass classifications. As shown in Figure 7, training and validation accuracies exhibited a consistent upward trend during model training, reflecting the effective learning of features from the PTB-XL dataset. For binary classification, the model achieved high accuracies across all subclasses, indicating robust generalization. Similarly, in multiclass classification, the training accuracy progressively improved while validation accuracy remained closely aligned, highlighting the model’s ability to handle diverse arrhythmia classes. Training and validation losses, monitored throughout the learning process, decreased steadily until reaching a specific epoch corresponding to the minimum loss value. Beyond this point, the losses began to increase, indicating the onset of overfitting. To address this, the parameter values corresponding to the epoch with the minimum loss were selected, ensuring the model’s optimal performance before overfitting occurred. These trends were consistent across experiments with and without patient data.
The confusion matrices generated for each classification category for the test set provide detailed insights into the performance of the proposed CNN model. For binary classification tasks, the matrices illustrate a high number of true positive (TP) and true negative (TN) predictions, reflecting the model’s ability to accurately distinguish between normal and abnormal ECG signals, as shown in Figure 8. Misclassifications, represented by false positives (FPs) and false negatives (FNs), were minimal, particularly when patient data was incorporated, underscoring the value of demographic information in improving classification accuracy.

4.1.1. Binary Classification Test Results

The accuracy results for binary classification on the test set, presented in Table 8, Table 9, Table 10 and Table 11, demonstrate strong performance of the CNN model across different binary classification scenarios. In Table 8 and Figure 9, distinguishing between normal and individual superclasses yielded accuracy ranging between 91.95% and 97.52%, with an average of 94.99% without patient data, which improved to 95.42% with demographic information. Table 9 and Figure 10 highlight subclass classification, where average accuracy reached 97.56%, and Table 10 shows further enhancement to 97.78% when patient data was included. Table 11 covers binary classification of normal vs. abnormal ECG signals, reporting 88.21% accuracy without patient data and 89.34% with it. Overall, the CNN model displayed excellent discriminative ability, especially when enhanced with demographic inputs, showcasing its robustness in binary ECG arrhythmia classification.

4.1.2. Multiclass Classification Results

Multiclass classification results on the test set using the CNN model, both with and without patient demographic data, are presented in Table 12. The CNN model achieved an accuracy of 79.49% for five superclasses, 78.35% for 10 subclasses, and 73.70% for 15 subclasses. When patient data was incorporated, the accuracy slightly improved to 79.55%, 79.60%, and 72.61% respectively. These results highlight that integrating demographic information (age and gender) consistently enhances classification accuracy. The drop in accuracy for the 15-class case underscores the increased complexity and class imbalance challenges associated with fine-grained classification.
In multiclass and subclass classifications, the confusion matrices revealed a strong diagonal dominance, indicating accurate predictions across most arrhythmia categories. However, some overlap was observed in subclasses with similar ECG characteristics, such as STTC and CD. These findings highlight areas for improvement, particularly in reducing false negatives for underrepresented classes. Overall, the confusion matrices validate the model’s high precision and recall across all categories, demonstrating its reliability in clinical and diagnostic applications.

4.2. Experimental Results Using the VGG16 Model

The binary classification accuracy results using the VGG16 model on the test set, as presented in Table 13, Table 14 and Table 15, show strong performance across different diagnostic scenarios. For binary classification between “normal” and one disease from the remaining four subclasses (MI, STTC, CD, and HYP) with patient data, the model achieved accuracy values ranging from 92.02% to 97.52%, with an average of 94.54%. When distinguishing between “normal” and individual subclasses (such as AMI, CLBBB, and CRBBB), accuracy further improved, reaching as high as 99.94%, with an average accuracy of 97.38%. However, in the more generalized binary classification task of “normal” vs. “abnormal” ECG signals, the model’s performance dropped slightly to 87.29%. These results highlight VGG16′s effectiveness in fine-grained binary classification, though its performance is relatively modest in broader binary classification tasks.
Table 16 presents the multiclass classification accuracy results on the test set using the VGG16 model across increasing classification complexity (i.e., 5 superclasses, 10 subclasses, and 15 subclasses). The model achieved an accuracy of 76.53% for classifying ECG signals into 5 superclasses, slightly improving to 76.85% for 10 subclasses, which suggests a relatively stable performance despite the added complexity. However, the accuracy declined to 69.29% for 15 subclasses, highlighting the challenges of data imbalance and increased class overlap as the classification granularity intensifies. These results reflect the VGG16 model’s moderate capability in handling multiclass ECG classification, with performance gradually decreasing as the number of classes increases.

4.3. Performance Evaluation of CNN Against VGG16

This section presents a comparative analysis of the proposed CNN model and the VGG16 architecture across various binary and multiclass ECG classification tasks. Figure 11, Figure 12, Figure 13 and Figure 14 show the classification accuracy comparisons between the custom CNN model and the VGG16 model. Overall, the CNN model consistently outperforms VGG16 in both binary and multiclass classifications. In binary classification, as demonstrated in Figure 11, CNN shows higher accuracy than VGG16 for all five superclasses. For the 10-subclass binary classification as declared in Figure 12, CNN leads in most cases, though VGG16 slightly outperforms it in specific subclasses like STTC, IMI, and CLBBB. In the binary classification of normal vs. abnormal ECG signals (Figure 13), CNN again demonstrates superior accuracy. Finally, in multiclass classification, covering 5 superclasses, 10 subclasses, and 15 subclasses, the CNN model outperforms VGG16 in all scenarios, as shown in Figure 14. These results highlight CNN’s greater robustness and generalization ability in handling both simple and complex classification tasks.

4.4. Performance Comparison with Recently Published Works

The performance evaluations of the proposed models are evaluated against some recently published works. In Table 17, the proposed models, CNN and VGG16, are compared with recently published works [27,31,32,44,54,61] for binary and multiclass classifications. The proposed CNN model, incorporating patient data, achieved the best results in binary classification (89.34%, 95.42%, and 97.78%), outperforming the others. In case of multiclass classification, CNN achieved an accuracy of 79.55%, 79.60%, 72.61% at 5 classes, 10 classes, and 15 classes, which was slightly better than VGG16′s 76.53%, 76.85% and 69.29%, respectively. These results underscore the superior performance of the proposed models, especially when patient demographic data is included, demonstrating their robustness in complex classification tasks.

4.5. Augmentation-Based Model Results

This section presents the performance of the CNN model when trained with augmented ECG data. To assess the impact of augmentation techniques such as noise addition, amplification, and attenuation on classification accuracy and robustness, we evaluate both binary and multiclass classification tasks. Figure 15 shows the training and validation accuracies for all classes as well as training and validation losses using the CNN model. In addition, Figure 16 presents confusion matrices for different ECG arrhythmia classification categories, offering visual insight into the classification accuracy of the CNN model across binary and multiclass scenarios.
Based on the results presented in Table 18, Table 19, Table 20 and Table 21, the use of augmented ECG data significantly enhanced the performance of the CNN model across all binary and multiclass classification tasks. In the binary classification of normal versus one disease from the remaining four superclasses, the model achieved an average accuracy of 94.28% with particularly high precision and recall values across all classes, as shown in Table 18. Also, in the binary classification involving 10 subclasses, the model showed exceptional accuracy, averaging 97.68%, with some disease classes such as CLBBB and CRBBB achieving a perfect classification of 100% accuracy, as presented in Table 19. In addition, Table 20 shows the binary classification of normal versus abnormal results; the model maintained a solid accuracy of 88.93%. Meanwhile, in multiclass classification tasks, the model achieved accuracies of 76.89%, 78.89%, and 71.47% for 5, 10, and 15 subclasses, respectively, as declared in Table 21. These findings indicate that data augmentation effectively boosts classification accuracy and model robustness, particularly in fine-grained classification scenarios involving multiple arrhythmia subclasses.
The accuracy results of the CNN model reveal the positive impact of data augmentation on ECG arrhythmia classification performance. Without data augmentation, the model achieved an average accuracy of 95.42% for binary classification (normal vs. superclasses) and 97.78% for normal vs. subclass scenarios. Upon applying data augmentation techniques such as noise addition, amplification, and attenuation, the model maintained high performance, with average accuracies of 94.28% and 97.68% for the same classification tasks, respectively. In the binary classification of normal vs. abnormal ECGs, the accuracy slightly decreased from 89.34% to 88.93% when applying data augmentation. Table 22 presents the performance comparison of CNN models on ECG signal classification, with and without data augmentation. The results show that the CNN model achieved accuracy when applying data augmentation slightly below the accuracy results without applying data augmentation. Although data augmentation is widely used to improve model generalization, the modest decrease in accuracy observed here suggests that simple additive noise, amplification, and attenuation may not be sufficiently representative of real-world ECG variations. Given the high fidelity of the PTB-XL dataset and the CNN’s strong baseline performance, these augmentations likely introduced distortions that interfered with subtle diagnostic features rather than enhancing the model’s robustness. Future work could explore domain-specific augmentations (e.g., lead dropout simulation, heart rate variability, or realistic arrhythmia patterns) to better mimic clinical conditions and improve performance.

4.6. Explainable ECG Channel Contributions for Arrhythmia Detection

Figure 17 illustrates the CNN model’s lead-specific weight distribution, offering a clear view into how it prioritizes ECG channels for arrhythmia detection. This visualization enhances transparency and reflects alignment with clinical standards. For instance, leads V2, II, V3, and V4 were emphasized in detecting myocardial infarction (MI), mirroring the clinical focus on precordial and inferior leads. In cases of ST-T changes (STTCs), leads V4, V5, and aVR stood out, with aVR’s prominence corroborating its link to left main coronary artery disease [61]. For conduction disorders, leads V1–V3 were dominant, supporting right bundle branch block (RBBB) diagnosis [63], while the underrepresentation of lateral leads in left bundle branch block (LBBB) detection likely reflects dataset imbalance [46]. Hypertrophy detection aligned with voltage-based criteria (e.g., Sokolow–Lyon index) through leads V5 and V6 [64]. These patterns validate traditional workflows while offering novel risk stratification insights. The model’s emphasis on aVR, a lead often overlooked, clinically suggests the potential for the early identification of high-risk ischemia [65]. Prioritizing leads like V2–V4 for MI detection could streamline ECG analysis in resource-limited or emergency settings.
By mapping AI-derived lead importance to clinical indicators [66], the framework in the figure bridges AI diagnostics with medical reasoning. Clinicians gain tools to audit decisions, fostering trust in automation, while adaptive applications (e.g., wearables with condition-specific lead sensitivity) become feasible [67]. This integration of explainable AI enhances diagnostic accuracy, workflow efficiency, and data-driven care delivery.

5. Conclusions

This study proposed two explainable deep learning frameworks, CNN and VGG16 models, for ECG signal arrhythmia classification using the PTB-XL dataset, demonstrating their effectiveness across binary and multiclass classification tasks. By incorporating patient demographic data such as age and gender, the models achieved notable improvements in diagnostic accuracy, particularly in distinguishing complex or underrepresented arrhythmia subclasses. A key contribution of this work lies in its integration of explainable AI, which provided transparent insights into ECG channel importance across different cardiac conditions, reinforcing clinical relevance and trust in automated systems. The use of explainability not only enhanced the interpretability of predictions but also aligned model focus with established diagnostic practices, facilitating potential real-world adoption in healthcare. These findings affirm the critical role of explainable deep learning in developing accurate, reliable, and clinically meaningful ECG analysis systems and underline the importance of continued research in this domain. Future work will explore transformer-based models to better capture temporal dependencies in ECG signals. Integrating handcrafted features with deep learning may enhance classification accuracy and robustness. Cross-dataset evaluation is also important to assess generalizability across different clinical settings. Additionally, improving model interpretability through attention mechanisms or relevance-based methods will support clinical trust and real-world deployment.

Author Contributions

Conceptualization, M.A.A.; methodology, A.E.M.A. and A.A.; software, A.I.S.; validation, A.E.M.A., M.A.A. and A.I.S.; formal analysis, E.-S.A., A.A. and A.I.S.; investigation, A.A. and E.M.A.; resources, M.A.A.; data curation, M.A.A. and A.I.S.; writing—original draft preparation, A.E.M.A., E.-S.A., A.A. and M.A.A.; writing—review and editing, A.E.M.A., E.-S.A., A.A., M.A.A., E.M.A. and A.I.S.; visualization, A.E.M.A.; supervision, E.-S.A. and E.M.A.; project administration, M.A.A.; funding acquisition, E.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their appreciation to the Deanship of Scientific Research at Northern Border 749 University, Arar, KSA for funding this research work through the project number “NBU-FFR-2025-750 159-02”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this work is publicly available at: https://physionet.org/content/ptb-xl/1.0.3/records100/21000/ (accessed on 21 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest related to this study.

References

  1. Surawicz, B.; Knilans, T.K. Chou’s Electrocardiography in Clinical Practice; Elsevier: Amsterdam, The Netherlands, 2008. [Google Scholar] [CrossRef]
  2. Strodthoff, N.; Wagner, P.; Schaeffter, T.; Samek, W. Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL. IEEE J. Biomed. Health Inform. 2021, 25, 1519–1528. [Google Scholar] [CrossRef]
  3. Clifford, G.D.; Liu, C.; Moody, B.; Lehman, L.H.; Silva, I.; Li, Q.; Johnson, A.E.; Mark, R.G. AF Classification from a Short Single Lead ECG Recording: The Physionet Computing in Cardiology Challenge 2017. In Proceedings of the 2017 Computing in Cardiology (CinC), Rennes, France, 24–27 September 2017. [Google Scholar] [CrossRef]
  4. Warrick, P.A.; Lostanlen, V.; Eickenberg, M.; Homsi, M.N.; Rodraguez, A.C.; Anden, J. Arrhythmia Classification of Reduced-Lead Electrocardiograms by Scattering-Recurrent Networks. In Proceedings of the 2021 Computing in Cardiology (CinC), Brno, Czech Republic, 13–15 September 2021; IEEE: New York, NY, USA; pp. 1–4. [Google Scholar] [CrossRef]
  5. Jin, Y.; Li, Z.; Wang, M.; Liu, J.; Tian, Y.; Liu, Y.; Wei, X.; Zhao, L.; Liu, C. Cardiologist-level interpretable knowledge-fused deep neural network for automatic arrhythmia diagnosis. Commun. Med. 2024, 4, 31. [Google Scholar] [CrossRef] [PubMed]
  6. Weimann, K.; Conrad, T.O.F. Transfer learning for ECG classification. Sci. Rep. 2021, 11, 5251. [Google Scholar] [CrossRef] [PubMed]
  7. Hannun, A.Y.; Rajpurkar, P.; Haghpanahi, M.; Tison, G.H.; Bourn, C.; Turakhia, M.P.; Ng, A.Y. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat. Med. 2019, 25, 65–69. [Google Scholar] [CrossRef] [PubMed]
  8. Hu, R.; Chen, J.; Zhou, L. A transformer-based deep neural network for arrhythmia detection using continuous ECG signals. Comput. Biol. Med. 2022, 144, 105325. [Google Scholar] [CrossRef]
  9. Aziz, S.; Ahmed, S.; Alouini, M.-S. ECG-based machine-learning algorithms for heartbeat classification. Sci. Rep. 2021, 11, 18738. [Google Scholar] [CrossRef]
  10. Figueroa-Gil, L.E.; López-Cons, I.V.; Orjuela-Cañón, A.D. Machine Learning Techniques for Classifying Cardiac Arrhythmias. In Proceedings of the XLVII Mexican Conference on Biomedical Engineering, Hermosillo, Mexico, 7–9 November 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 27–39. [Google Scholar] [CrossRef]
  11. Pandey, S.K.; Janghel, R.R.; Vani, V. Patient Specific Machine Learning Models for ECG Signal Classification. Procedia Comput. Sci. 2020, 167, 2181–2190. [Google Scholar] [CrossRef]
  12. Zabihi, F.; Safara, F.; Ahadzadeh, B. An electrocardiogram signal classification using a hybrid machine learning and deep learning approach. Healthc. Anal. 2024, 6, 100366. [Google Scholar] [CrossRef]
  13. Jambukia, S.H.; Dabhi, V.K.; Prajapati, H.B. Classification of ECG signals using machine learning techniques: A survey. In Proceedings of the 2015 International Conference on Advances in Computer Engineering and Applications, Ghaziabad, India, 19–20 March 2015; IEEE: New York, NY, USA; pp. 714–721. [Google Scholar] [CrossRef]
  14. Baghdadi, N.A.; Abdelaliem, S.M.F.; Malki, A.; Gad, I.; Ewis, A.; Atlam, E. Advanced machine learning techniques for cardiovascular disease early detection and diagnosis. J. Big Data 2023, 10, 144. [Google Scholar] [CrossRef]
  15. Sraitih, M.; Jabrane, Y.; El Hassani, A.H. An Automated System for ECG Arrhythmia Detection Using Machine Learning Techniques. J. Clin. Med. 2021, 10, 5450. [Google Scholar] [CrossRef]
  16. Malik, M.; Dua, T.; Snigdha. Biomedical Signal Processing: ECG Signal Analysis Using Machine Learning in MATLAB. In Recent Advances in Metrology; Springer: Berlin/Heidelberg, Germany, 2023; pp. 121–127. [Google Scholar] [CrossRef]
  17. Hassaballah, M.; Wazery, Y.M.; Ibrahim, I.E.; Farag, A. ECG Heartbeat Classification Using Machine Learning and Metaheuristic Optimization for Smart Healthcare Systems. Bioengineering 2023, 10, 429. [Google Scholar] [CrossRef]
  18. Ben-Moshe, N.; Brimer, S.B.; Tsutsui, K.; Suleiman, M.; Sörnmo, L.; Behar, J.A. Machine learning for ranking f-wave extraction methods in single-lead ECGs. Biomed. Signal Process. Control 2025, 99, 106817. [Google Scholar] [CrossRef]
  19. Alfaras, M.; Soriano, M.C.; Ortín, S. A Fast Machine Learning Model for ECG-Based Heartbeat Classification and Arrhythmia Detection. Front. Phys. 2019, 7, 103. [Google Scholar] [CrossRef]
  20. Hemakom, A.; Atiwiwat, D.; Israsena, P. ECG and EEG based machine learning models for the classification of mental workload and stress levels for women in different menstrual phases, men, and mixed sexes. Biomed. Signal Process. Control 2024, 95, 106379. [Google Scholar] [CrossRef]
  21. Pal, A.; Srivastva, R.; Singh, Y.N. CardioNet: An Efficient ECG Arrhythmia Classification System Using Transfer Learning. Big Data Res. 2021, 26, 100271. [Google Scholar] [CrossRef]
  22. Chorney, W.; Wang, H. Towards federated transfer learning in electrocardiogram signal analysis. Comput. Biol. Med. 2024, 170, 107984. [Google Scholar] [CrossRef]
  23. Nguyen, C.V.; Do, C.D. Transfer Learning in ECG Diagnosis: Is It Effective? Available online: http://arxiv.org/abs/2402.02021 (accessed on 21 June 2025).
  24. Ahmad, M.; Ahmed, A.; Hashim, H.; Farsi, M.; Mahmoud, N. Enhancing Heart Disease Diagnosis Using ECG Signal Reconstruction and Deep Transfer Learning Classification with Optional SVM Integration. Diagnostics 2025, 15, 1501. [Google Scholar] [CrossRef]
  25. Ao, R.; He, G. Image Based Deep Learning in 12-Lead ECG Diagnosis. medRxiv 2022. [Google Scholar] [CrossRef]
  26. Bontinck, L.; Fonteyn, K.; Dhaene, T.; Deschrijver, D. ECGencode: Compact and computationally efficient deep learning feature encoder for ECG signals. Expert. Syst. Appl. 2024, 255, 124775. [Google Scholar] [CrossRef]
  27. Butt, F.S.; Wagner, M.F.; Schäfer, J.; Ullate, D.G. Toward Automated Feature Extraction for Deep Learning Classification of Electrocardiogram Signals. IEEE Access 2022, 10, 118601–118616. [Google Scholar] [CrossRef]
  28. Aarthy, S.T.; Iqbal, J.L.M. A novel deep learning approach for early detection of cardiovascular diseases from ECG signals. Med. Eng. Phys. 2024, 125, 104111. [Google Scholar] [CrossRef]
  29. Akalın, F.; Çavdaroğlu, P.D.; Orhan, M.F. Arrhythmia detection with transfer learning architecture integrating the developed optimization algorithm and regularization method. BMC Biomed Eng. 2025, 7, 8. [Google Scholar] [CrossRef]
  30. Ansari, Y.; Mourad, O.; Qaraqe, K.; Serpedin, E. Deep learning for ECG Arrhythmia detection and classification: An overview of progress for period 2017–2023. Front. Physiol. 2023, 14, 1246746. [Google Scholar] [CrossRef]
  31. Narotamo, H.; Dias, M.; Santos, R.; Carreiro, A.V.; Gamboa, H.; Silveira, M. Deep learning for ECG classification: A comparative study of 1D and 2D representations and multimodal fusion approaches. Biomed. Signal Process. Control 2024, 93, 106141. [Google Scholar] [CrossRef]
  32. Śmigiel, S.; Pałczyński, K.; Ledziński, D. Deep learning techniques in the classification of ecg signals using r-peak detection based on the ptb-xl dataset. Sensors 2021, 21, 8174. [Google Scholar] [CrossRef] [PubMed]
  33. Kuila, S.; Dhanda, N.; Joardar, S. ECG signal classification to detect heart arrhythmia using ELM and CNN. Multimed. Tools Appl. 2023, 82, 29857–29881. [Google Scholar] [CrossRef]
  34. Alamatsaz, N.; Tabatabaei, L.S.; Yazdchi, M.; Payan, H.; Alamatsaz, N.; Nasimi, F. A Lightweight Hybrid CNN-LSTM Model for ECG-Based Arrhythmia Detection. Available online: http://arxiv.org/abs/2209.00988 (accessed on 21 June 2025).
  35. Swaroop, P.; Badolia, N.; Ranjan, R.; Kumar, M. Arrhythmia Classification Using Hybrid CNN-LSTM Model. In Proceedings of the 2024 First International Conference on Electronics, Communication and Signal Processing (ICECSP), New Delhi, India, 8–10 August 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
  36. Selvam, I.J.; Madhavan, M.; Kumarasamy, S.K. Detection and classification of electrocardiography using hybrid deep learning models. Hell. J. Cardiol. 2024, 81, 75–84. [Google Scholar] [CrossRef]
  37. Varalakshmi, P.; Sankaran, A.P. An improved hybrid AI model for prediction of arrhythmia using ECG signals. Biomed. Signal Process. Control 2023, 80, 104248. [Google Scholar] [CrossRef]
  38. Shah, H.A.; Saeed, F.; Diyan, M.; Almujally, N.A.; Kang, J. ECG-TransCovNet: A hybrid transformer model for accurate arrhythmia detection using Electrocardiogram signals. In CAAI Transactions on Intelligence Technology; Wiley: Hoboken, NJ, USA, 2024; pp. 1–14. [Google Scholar] [CrossRef]
  39. Talukder, M.A.; Khalid, M.; Kazi, M.; Jahan Muna, N.; Nur-e-Alam, M.; Halder, S.; Sultana, N. A hybrid cardiovascular arrhythmia disease detection using ConvNeXt-X models on electrocardiogram signals. Sci. Rep. 2024, 14, 30366. [Google Scholar] [CrossRef]
  40. Bai, X.; Dong, X.; Li, Y.; Liu, R.; Zhang, H. A hybrid deep learning network for automatic diagnosis of cardiac arrhythmia based on 12-lead ECG. Sci. Rep. 2024, 14, 24441. [Google Scholar] [CrossRef]
  41. Jing, J.; Zhang, J.; Liu, A.; Gao, M.; Qian, R.; Chen, X. ECG-Based Multiclass Arrhythmia Classification Using Beat-Level Fusion Network. J. Healthc. Eng 2023, 2023, 1755121. [Google Scholar] [CrossRef] [PubMed]
  42. Li, Q.; Liu, Y.; Zhang, Z.; Liu, J.; Yuan, Y.; Wang, K.; He, R. Learning with incomplete labels of multisource datasets for ECG classification. Pattern Recognit. 2024, 150, 110321. [Google Scholar] [CrossRef]
  43. Nguyen, C.V.; Duong, H.M.; Do, C.D. MELEP: A Novel Predictive Measure of Transferability in Multi-Label ECG Diagnosis. J. Heal. Inform. Res. 2024, 8, 506–522. [Google Scholar] [CrossRef]
  44. Pałczyński, K.; Śmigiel, S.; Ledziński, D.; Bujnowski, S. Study of the Few-Shot Learning for ECG Classification Based on the PTB-XL Dataset. Sensors 2022, 22, 904. [Google Scholar] [CrossRef] [PubMed]
  45. Bhanja, N. Design and Comparison of Deep Learning Model for ECG Classification Using PTB-XL Dataset. Available online: https://www.researchgate.net/publication/374061560 (accessed on 21 June 2025).
  46. Wagner, P.; Strodthoff, N.; Bousseljot, R.-D.; Kreiseler, D.; Lunze, F.I.; Samek, W.; Schaeffter, T. PTB-XL, a large publicly available electrocardiography dataset. Sci Data 2020, 7, 154. [Google Scholar] [CrossRef]
  47. Hambarde, S.; Paithane, A.; Lambhate, P.; Hambarde, A.S.; Kalyankar, P.A. Smart Arrhythmia Detection Using Single Lead ECG Signal and Hybridized Deep Neural Network Model. Web Intell. 2025, 23, 155–171. [Google Scholar] [CrossRef]
  48. Strodthoff, N.; Mehari, T.; Nagel, C.; Aston, P.J.; Sundar, A.; Graff, C.; Kanters, J.K.; Haverkamp, W.; Dössel, O.; Loewe, A.; et al. PTB-XL+, a comprehensive electrocardiographic feature dataset. Sci. Data 2023, 10, 279. [Google Scholar] [CrossRef]
  49. Krasteva, V.; Stoyanov, T.; Naydenov, S.; Schmid, R.; Jekova, I. Detection of Atrial Fibrillation in Holter ECG Recordings by ECHOView Images: A Deep Transfer Learning Study. Diagnostics 2025, 15, 865. [Google Scholar] [CrossRef]
  50. Wickramasinghe, N.L.; Athif, M. Multi-label classification of reduced-lead ECGs using an interpretable deep convolutional neural network. Physiol. Meas. 2022, 43, 064002. [Google Scholar] [CrossRef]
  51. Gour, A.; Gupta, M.; Wadhvani, R.; Shukla, S. ECG Based Heart Disease Classification: Advancement and Review of Techniques. In Procedia Computer Science; Elsevier: Amsterdam, The Netherlands, 2024; pp. 1634–1648. [Google Scholar] [CrossRef]
  52. Safdar, M.F.; Nowak, R.M.; Pałka, P. Pre-Processing Techniques and Artificial Intelligence Algorithms for Electrocardiogram (ECG) Signals Analysis: A Comprehensive Review; Elsevier: Amsterdam, The Netherlands, 2024; Volume 1. [Google Scholar] [CrossRef]
  53. Feyisa, D.W.; Debelee, T.G.; Ayano, Y.M.; Kebede, S.R.; Assore, T.F. Lightweight Multireceptive Field CNN for 12-Lead ECG Signal Classification. Comput. Intell. Neurosci. 2022, 2022, 8413294. [Google Scholar] [CrossRef]
  54. Śmigiel, S.; Pałczyński, K.; Ledziński, D. ECG Signal Classification Using Deep Learning Techniques Based on the PTB-XL Dataset. Entropy 2021, 23, 1121. [Google Scholar] [CrossRef] [PubMed]
  55. Kusumoto, F.M.; Schoenfeld, M.H.; Barrett, C.; Edgerton, J.R.; Ellenbogen, K.A.; Gold, M.R.; Goldschlager, N.F.; Hamilton, R.M.; Joglar, J.A.; Kim, R.J.; et al. 2018 ACC/AHA/HRS Guideline on the Evaluation and Management of Patients with Bradycardia and Cardiac Conduction Delay: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines and the Heart Rhythm Society. Circulation 2019, 140, E382–E482. [Google Scholar] [CrossRef] [PubMed]
  56. Lehmann, M.H.; Timothy, K.W.; Frankovich, D.; Fromm, B.S.; Keating, M.; Locati, E.H.; Taggart, R.; A Towbin, J.; Moss, A.J.; Schwartz, P.J.; et al. Age-Gender Influence on the Rate-Corrected QT Interval and the QT-Heart Rate Relation in Families with Genotypically Characterized Long QT Syndrome. J. Am. Coll. Cardiol. 1997, 29, 93–99. [Google Scholar] [CrossRef]
  57. Okin, P.M.; Roman, M.J.; Devereux, R.B.; Kligfield, P. Gender Differences and the Electrocardiogram in Left Ventricular Hypertrophy. Hypertension 1995, 25, 242–249. [Google Scholar] [CrossRef]
  58. Antzelevitch, C.; Patocskai, B. Brugada Syndrome: Clinical, Genetic, Molecular, Cellular, and Ionic Aspects. Curr. Probl. Cardiol. 2016, 41, 7–57. [Google Scholar] [CrossRef]
  59. Powers, D.M.W. Evaluation: From precision, recall, and F-measure to ROC, informedness, markedness, and correlation. arXiv 2011, arXiv:2010.16061. [Google Scholar]
  60. Rashed, A.E.E.; Bahgat, W.M.; Ahmed, A.; Farrag, T.A.; Atwa, A.E.M. Efficient machine learning models across multiple datasets for autism spectrum disorder diagnoses. Biomed. Signal Process Control 2025, 100, 106949. [Google Scholar] [CrossRef]
  61. Misumida, N.; Kobayashi, A.; Fox, J.T.; Hanon, S.; Schweitzer, P.; Kanei, Y. Predictive Value of ST-Segment Elevation in Lead aVR for Left Main and/or Three-Vessel Disease in Non-ST-Segment Elevation Myocardial Infarction. Ann. Noninvasive Electrocardiol. 2016, 21, 91–97. [Google Scholar] [CrossRef]
  62. Razin, V.; Krasnov, A.; Karchkov, D.; Moskalenko, V.; Rodionov, D.; Zolotykh, N.; Smirnov, L.; Osipov, G. Solving the Problem of Diagnosing a Disease by ECG on the PTB-XL Dataset Using Deep Learning. In Advances in Neural Computation, Machine Learning, and Cognitive Research VII; Springer: Berlin/Heidelberg, Germany, 2023; pp. 13–21. [Google Scholar] [CrossRef]
  63. Surawicz, B.; Childers, R.; Deal, B.J.; Gettes, L.S. AHA/ACCF/HRS Recommendations for the Standardization and Interpretation of the Electrocardiogram. Circulation 2009, 119, e235–e240. [Google Scholar] [CrossRef]
  64. Hancock, E.W.; Deal, B.J.; Mirvis, D.M.; Okin, P.; Kligfield, P.; Gettes, L.S. AHA/ACCF/HRS Recommendations for the Standardization and Interpretation of the Electrocardiogram. Circulation 2009, 119, e251–e261. [Google Scholar] [CrossRef]
  65. Weintraub, R.G.; Alexander, P.M.A. Outcomes in Pediatric Dilated Cardiomyopathy. J. Am. Coll. Cardiol. 2017, 70, 2674–2676. [Google Scholar] [CrossRef]
  66. Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  67. Su, H.; Di Lallo, A.; Murphy, R.R.; Taylor, R.H.; Garibaldi, B.T.; Krieger, A. Physical human–robot interaction for clinical care in infectious environments. Nat. Mach. Intell. 2021, 3, 184–186. [Google Scholar] [CrossRef]
Figure 1. Overview of the proposed deep learning framework for arrhythmia classification.
Figure 1. Overview of the proposed deep learning framework for arrhythmia classification.
Diagnostics 15 01950 g001
Figure 2. A summary of the PTB-XL dataset showing the diagnostic superclasses and subclasses.
Figure 2. A summary of the PTB-XL dataset showing the diagnostic superclasses and subclasses.
Diagnostics 15 01950 g002
Figure 3. Class distribution before data augmentation for the training samples: (a) binary (normal and abnormal), (b) 5 superclasses, (c) 10 subclasses, and (d) 15 subclasses.
Figure 3. Class distribution before data augmentation for the training samples: (a) binary (normal and abnormal), (b) 5 superclasses, (c) 10 subclasses, and (d) 15 subclasses.
Diagnostics 15 01950 g003
Figure 4. Class distribution after data augmentation for the training samples: (a) binary (normal and abnormal), (b) 5 superclasses, (c) 10 subclasses, and (d) 15 subclasses.
Figure 4. Class distribution after data augmentation for the training samples: (a) binary (normal and abnormal), (b) 5 superclasses, (c) 10 subclasses, and (d) 15 subclasses.
Diagnostics 15 01950 g004
Figure 5. The proposed CNN model architecture.
Figure 5. The proposed CNN model architecture.
Diagnostics 15 01950 g005
Figure 6. The proposed VGG model architecture.
Figure 6. The proposed VGG model architecture.
Diagnostics 15 01950 g006
Figure 7. CNN training/validation accuracies and losses for all classes.
Figure 7. CNN training/validation accuracies and losses for all classes.
Diagnostics 15 01950 g007
Figure 8. Confusion matrix for each category using the CNN model for the test set.
Figure 8. Confusion matrix for each category using the CNN model for the test set.
Diagnostics 15 01950 g008aDiagnostics 15 01950 g008b
Figure 9. The classification results for the binary classification (normal vs. one of the remaining four superclasses) with and without patient data using CNN for the test set.
Figure 9. The classification results for the binary classification (normal vs. one of the remaining four superclasses) with and without patient data using CNN for the test set.
Diagnostics 15 01950 g009
Figure 10. The classification results for the binary classification (normal vs. one of the remaining 9 subclasses) with and without patient data using CNN for the test set.
Figure 10. The classification results for the binary classification (normal vs. one of the remaining 9 subclasses) with and without patient data using CNN for the test set.
Diagnostics 15 01950 g010
Figure 11. Performance comparison of CNN against VGG16 in the 5-superclass binary classification for the test set.
Figure 11. Performance comparison of CNN against VGG16 in the 5-superclass binary classification for the test set.
Diagnostics 15 01950 g011
Figure 12. Performance comparison of CNN against VGG16 in the 10-subclass binary classification for the test set.
Figure 12. Performance comparison of CNN against VGG16 in the 10-subclass binary classification for the test set.
Diagnostics 15 01950 g012
Figure 13. Performance comparison of CNN against VGG16 in the normal and abnormal binary classification.
Figure 13. Performance comparison of CNN against VGG16 in the normal and abnormal binary classification.
Diagnostics 15 01950 g013
Figure 14. Performance comparison of CNN against VGG16 in the multiclass classification.
Figure 14. Performance comparison of CNN against VGG16 in the multiclass classification.
Diagnostics 15 01950 g014
Figure 15. CNN training and validation accuracies for all classes as well as training and validation losses.
Figure 15. CNN training and validation accuracies for all classes as well as training and validation losses.
Diagnostics 15 01950 g015aDiagnostics 15 01950 g015b
Figure 16. CNN confusion matrices for each category for the test set.
Figure 16. CNN confusion matrices for each category for the test set.
Diagnostics 15 01950 g016aDiagnostics 15 01950 g016b
Figure 17. Explainable ECG channel weights for feature detection in binary classification (normal vs. one disease) based on the CNN model.
Figure 17. Explainable ECG channel weights for feature detection in binary classification (normal vs. one disease) based on the CNN model.
Diagnostics 15 01950 g017aDiagnostics 15 01950 g017b
Table 1. Summary of research studies utilizing the PTB-XL dataset for arrhythmia detection and classification.
Table 1. Summary of research studies utilizing the PTB-XL dataset for arrhythmia detection and classification.
Ref.DatasetModels AppliedKey ResultsDOI/Link
[2]PTB-XL, ICBEB2018ResNet, Inception, Transfer Learning, xresnet1d101Demonstrated PTB-XL as a benchmark for ECG analysis; transfer learning showed promising results for small datasets.https://doi.org/10.1109/JBHI.2020.3022989
[5]Chinese ECG BenchmarkKnowledge-Fused DNNAchieved higher performance than cardiologists on arrhythmia classification in remote settings.https://doi.org/10.1038/s43856-024-00464-4
[23]PTB-XL, CPSC2018, Georgia, RibeiroTransfer Learning
(Various CNNs, RNNs)
Transfer learning was effective on small datasets; fine-tuning did not consistently outperform training from scratch on larger dataset.https://doi.org/10.48550/arXiv.2402.02021
[25]PTB-XL, CPSC2018, Shaoxing, TongjiConvolutional Neural Networks (CNNs)Excellent AUROC, AUPRC on test data from PTB-XL; lower performance on unseen datasets].https://doi.org/10.1101/2022.11.21.22282586
[27]PTB-XLCNN-LSTM, Attention TransformerAchieved 91.07% accuracy (MI detection)https://doi.org/10.1109/ACCESS.2022.3220670
[31]PTB-XL1D CNN, LSTM, GRU, Multimodal Fusion, Attention-based modelsGRU achieved 79.67% sensitivity and 81.04% specificity. The 1D representation outperformed 2D representationshttps://doi.org/10.1016/j.bspc.2024.106141
[32]PTB-XLCNN, QRS complex extraction, Entropy-Based FeaturesImproved performance by adding entropy-based features to raw signals.https://doi.org/10.3390/s21248174
[41]PTB-XLBeat-Level Fusion Network (BLF-Net)Outperformed state-of-the-art methods in multiclass arrhythmia classification.https://doi.org/10.1155/2023/1755121
[43]PTB-XL, various smaller datasetsMELEP, CNNs, RNNsMELEP effectively predicted transferability; strong correlation with fine-tuning performance (0.6+ correlation)https://doi.org/10.1007/s41666-024-00168-3
[44]PTB-XLFew-Shot Learning (FSL), CNNFSL achieved 93.2% accuracy for 2-class classification, outperforming softmax-based models.https://doi.org/10.3390/s22030904
[45]PTB-XLAlexNet, LeNetAlexNet performed better with high classification accuracy for cardiac conditionshttps://www.researchgate.net/publication/374061560 accessed on 21 June 2025
[48]PTB-XLECGDeli, Marquette 12SLIntroduced ECGDeli, Marquette 12SL feature sets for enhanced ECG interpretation.https://doi.org/10.1038/s41597-023-02153-8
[53]PTB-XLMulti-Receptive Field CNN (MRF-CNN)Achieved 0.72 F1 score, 0.93 AUC for 5 superclasses on PTB-XL datasethttps://doi.org/10.1155/2022/8413294
[54]PTB-XLCNN, SincNet, Entropy-Based FeaturesBest performance with convolutional network + entropy features.https://doi.org/10.3390/e23091121
Table 2. PTB-XL dataset splitting into training, validation, and testing.
Table 2. PTB-XL dataset splitting into training, validation, and testing.
SubsetFoldsSize
Train1–815,237
Validation91886
Test101903
Table 3. The superclasses and the subclasses in the PTB-XL dataset.
Table 3. The superclasses and the subclasses in the PTB-XL dataset.
SuperclassDescriptionSubclassDescription
NORM.Normal ECGNORMNormal ECG
CDConduction DisturbanceLAFB/LPFBLeft anterior/Left posterior fascicular block
IRBBBIncomplete right bundle branch block
ILBBBIncomplete left bundle branch block
CLBBBComplete left bundle branch block
CRBBBComplete right bundle branch block
_AVBAV block
IVCBNon-specific intraventricular conduction disturbance (block)
WPWWolff–Parkinson–White syndrome
HYPHypertrophyLVHLeft ventricular hypertrophy
RHVRight ventricular hypertrophy
LAO/LAELeft atrial overload/enlargement
RAO/RAERight atrial overload/enlargement
SEHYPSeptal hypertrophy
MIMyocardial InfarctionAMIAnterior myocardial infarction
IMIInferior myocardial infarction
LMILateral myocardial infarction
PMIPosterior myocardial infarction
STTCST/T changeISCAIschemic in anterior leads
ISCIIschemic in inferior leads
ISC_Non-specific ischemic
STTCST-T changes
NST_Non-specific ST changes
Table 4. Data augmentation techniques and the corresponding description.
Table 4. Data augmentation techniques and the corresponding description.
TechniqueDescription
Adding noiseAdding noise of Gaussian distribution with a mean of 0 and a standard deviation of 0.01.
AmplificationMultiply by a gain k = 1 + x, x ∈ [0.001, 0.01].
AttenuationMultiply by a gain k = 1 − x, x ∈ [0.001, 0.01].
Table 5. The CNN model parameters.
Table 5. The CNN model parameters.
ParameterValue
Convolutional layers for Branch 1 (ECG signals)32, 64, 128, 256, 512
Convolutional layer size for Branch 13 × 1
Max pooling layer size2 × 1
Dense layers for Branch 1100, 32
Dense layers for Branch 2 (demographic data)100, 64, 32, 16
Drop out size for Branch 10.4
Drop out size for Branch 20.4
Dense layers for concatenated branch10, 10, num_classes (5, 10, 15)
Drop out size for concatenated branch0.2
Convolutional layer activation functionReLU
Output activation functionSoftmax
Loss functionCrossentropy
OptimizerAdam (Learning rate = 0.001)
CallbacksEarlyStopping, Reduce LearningRate, Save CheckPoint
Max_Epochs60
Batch size16
Table 6. Architectural and training parameters of the proposed VGG-based model.
Table 6. Architectural and training parameters of the proposed VGG-based model.
ParameterValue
Convolutional layers for Branch 1 (ECG signals)64, 64, 128, 128, 256, 256, 512, 512
Convolutional layer size for Branch 13 × 1
Max pooling layer size2 × 1
Dense layers for Branch 1512, 512
Dense layers for Branch 2 (demographic data)100, 64, 32, 16
Drop out size for Branch 10.5
Drop out size for Branch 20.4
Dense layers for concatenated branch10, 10, num_classes (5, 10, 15)
Drop out size for concatenated branch0.2
Convolutional layer activation functionReLU
Output activation functionSoftmax
Loss functionCrossentropy
OptimizerAdam (Learning rate = 0.001)
CallbacksEarlyStopping, Reduce LearningRate, Save CheckPoint
Max_Epochs60
Batch size16
Table 7. Components of a binary confusion matrix showing TP, TN, FP, and FN.
Table 7. Components of a binary confusion matrix showing TP, TN, FP, and FN.
Actual/PredictedPredicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
Table 8. Binary classification (normal vs. one disease from the remaining four superclasses) with and without patient data using the CNN model for the test set.
Table 8. Binary classification (normal vs. one disease from the remaining four superclasses) with and without patient data using the CNN model for the test set.
Without Patient DataWith Patient Data
NORM and
MI
NORM and STTCNORM and
CD
NORM and
HYP
NORM and
MI
NORM and STTCNORM and CDNORM and HYP
Precision0.96520.93970.91950.97520.97450.94100.92440.9770
Recall0.96520.93970.91950.97520.97450.94100.92440.9770
F1_score0.92790.91140.86900.87650.93210.91320.87940.8995
Accuracy0.96520.93970.91950.97520.97450.94100.92440.9770
Training time (s)260252223179289338309306
Average accuracy0.94990.9542
Table 9. Binary classification (normal vs. one disease from the remaining 9 subclasses) results without patient data using the CNN model for the test set.
Table 9. Binary classification (normal vs. one disease from the remaining 9 subclasses) results without patient data using the CNN model for the test set.
NORM and STTCNORM
and
AMI
NORM and
IMI
NORM and LAFB/
LPFB
NORM and LVHNORM and IRBBBNORM and CLBBBNORM and ISCANORM and CRBBB
Precision0.9243 0.9838 0.9678 0.9687 0.9855 0.9627 0.9965 0.9925 0.9987
Recall0.9243 0.9838 0.9678 0.9687 0.9855 0.9627 0.9965 0.9925 0.9987
F1_score0.8456 0.9641 0.8832 0.8990 0.9440 0.8435 0.9538 0.9050 0.9907
Accuracy0.9243 0.9838 0.9678 0.9687 0.9855 0.9627 0.9965 0.9925 0.9987
Training time (s)242227 197198176181 206184241
Average accuracy0.9756
Table 10. Binary classification (normal vs. one disease from the remaining 9 subclasses) results with patient data using the CNN model for the test set.
Table 10. Binary classification (normal vs. one disease from the remaining 9 subclasses) results with patient data using the CNN model for the test set.
NORM
and
STTC
NORM
and
AMI
NORM and
IMI
NORM and LAFB/LPFBNORM and
LVH
NORM and IRBBBNORM and CLBBBNORM and ISCANORM and CRBBB
Precision0.92550.98500.97470.97440.97970.97080.99860.99320.9986
Recall0.92550.98500.97470.97440.97970.97080.99860.99320.9986
F1_score0.86090.96470.90920.91680.92270.86210.97390.91220.9739
Accuracy0.92550.98500.97470.97440.97970.97080.99860.99320.9986
Training time (s)311283224289219276312236310
Average accuracy0.9778
Table 11. Binary classification (normal vs. abnormal) results without/with patient data using the CNN model for the test set.
Table 11. Binary classification (normal vs. abnormal) results without/with patient data using the CNN model for the test set.
Without Patient DataWith Patient Data
Precision0.88210.8934
Recall0.88210.8934
F1_score0.88080.8929
Accuracy0.88210.8934
Training time (s)325495
Table 12. Multiclass classification results with/without patient data using the CNN model for the test set.
Table 12. Multiclass classification results with/without patient data using the CNN model for the test set.
MetricWithout Patient DataWith Patient Data
5 Superclasses10 Subclasses15 Subclasses5 Superclasses10 Subclasses15 Subclasses
Precision0.79860.81450.79130.82700.83410.8131
Recall0.78950.76130.71410.76590.75230.6613
F1_score0.68040.64690.42720.67570.62610.3889
Accuracy0.79490.78350.73700.79550.79600.7261
Training time (s)340304338666626571
Table 13. Binary classification (normal versus one disease from the remaining 4 superclasses) using the VGG16 model for the test set.
Table 13. Binary classification (normal versus one disease from the remaining 4 superclasses) using the VGG16 model for the test set.
MetricWith Patient Data
NORM
and
MI
NORM and STTCNORM and
CD
NORM and
HYP
Precision0.95640.92020.92970.9752
Recall0.95640.92020.92970.9752
F1_score0.93060.87710.89160.8823
Accuracy0.95640.92020.92970.9752
Training time (s)398459460446
Average accuracy0.9454
Table 14. Binary classification (normal versus one disease from the remaining 9 subclasses) using the VGG16 model for the test set.
Table 14. Binary classification (normal versus one disease from the remaining 9 subclasses) using the VGG16 model for the test set.
MetricNORM and STTCNORM
and
AMI
NORM and
IMI
NORM and LAFB/
LPFB
NORM and LVHNORM and IRBBBNORM and CLBBBNORM and ISCANORM and CRBBB
Precision0.93900.98050.97310.96950.97870.96340.99740.96310.9994
Recall0.93900.98050.97310.96950.97870.96340.99740.96310.9994
F1_score0.87240.95330.90140.89520.91360.86770.96500.89520.9964
Accuracy0.93900.98050.97310.96950.97870.96340.99740.96310.9994
Training time (s)408394608371484486314517785
Average accuracy0.9738
Table 15. Binary classification (normal and abnormal) results using the VGG16 model for the test set.
Table 15. Binary classification (normal and abnormal) results using the VGG16 model for the test set.
PrecisionRecallF1_scoreAccuracyTraining time (s)
0.87290.87290.87010.8729557
Table 16. Multiclass classification results using the VGG16 model for the test set.
Table 16. Multiclass classification results using the VGG16 model for the test set.
Metric5 Superclasses10 Subclasses15 Subclasses
Precision0.76850.78840.7179
Recall0.76280.75150.6855
F1_score0.76560.76950.7013
Accuracy0.76530.76850.6929
Training time (s)102410051001
Table 17. The proposed models’ performance against some recently published works.
Table 17. The proposed models’ performance against some recently published works.
Ref.YearModelAverage Accuracy
BinaryMulticlass
Normal and AbnormalNormal vs. One SuperclassNormal vs. One Subclass5
Classes
10 Classes15 Classes
[27]2022CNN-LSTM 90.94% 74.33%
[31]2024GRU 80.69%
[32]2021QRS entropy+ Raw signal89.8% 75.8%
[44]2022FSL + XGBoost88.9% 75.2%
[54]2021CNN + Entropy features.89.2% 76.5%
[62]2023CNN 70.11%
Proposed2025VGG (Incorporating patient data)87.29%94.54%97.38%76.53%76.85%69.29%
Proposed2025CNN (Incorporating patient data)89.34%95.42%97.78%79.55%79.60%72.61%
Table 18. Binary classification (normal vs. one disease from the remaining 4 superclasses) for the test set using the CNN model trained on the augmented ECG dataset.
Table 18. Binary classification (normal vs. one disease from the remaining 4 superclasses) for the test set using the CNN model trained on the augmented ECG dataset.
MetricNORM and
MI
NORM and STTCNORM and CDNORM and HYP
Precision0.94990.92460.92510.9714
Recall0.94990.92460.92510.9714
F1_score0.91750.89770.88740.8846
Accuracy0.94990.92460.92510.9714
Training time (s)294235257274
Average accuracy0.9428
Table 19. Binary classification (normal vs. one disease from the 9 subclasses) for the test set using the CNN model trained on the augmented ECG dataset.
Table 19. Binary classification (normal vs. one disease from the 9 subclasses) for the test set using the CNN model trained on the augmented ECG dataset.
MetricNORM and STTCNORM
and
AMI
NORM and
IMI
NORM and LAFB/LPFBNORM and LVHNORM and IRBBBNORM and CLBBBNORM and ISCANORM and CRBBB
Precision0.92670.98830.97250.97110.97630.96411.00000.99261.0000
Recall0.92670.98830.97250.97110.97630.96411.00000.99261.0000
F1_score0.86690.97280.90350.89720.91590.86930.98390.90601.0000
Accuracy0.92670.98830.97250.97110.97630.96411.00000.99261.0000
Training time (s)277273278294266297381276288
Average accuracy0.9768
Table 20. Binary classification (normal vs. abnormal) for the test set using the CNN model trained on the augmented ECG dataset.
Table 20. Binary classification (normal vs. abnormal) for the test set using the CNN model trained on the augmented ECG dataset.
PrecisionRecallF1_scoreAccuracyTraining time (s)
0.88930.88930.8820 0.8893322
Table 21. Multiclass classification results for the test set using the CNN model trained on the augmented ECG dataset.
Table 21. Multiclass classification results for the test set using the CNN model trained on the augmented ECG dataset.
Metric5 Superclasses10 Subclasses15 Subclasses
Precision0.77320.79060.7220
Recall0.76660.78670.7144
F1_score0.64600.64990.4370
Accuracy0.7689 0.78890.7147
Training time (s)63112941868
Table 22. CNN accuracy results with and without data augmentation for the test set.
Table 22. CNN accuracy results with and without data augmentation for the test set.
DatasetBinary ClassificationMulticlass Classification
Binary
(Normal vs. Abnormal)
Normal vs. One of 5-Super ClassesNormal vs. One of 10-Subclasses5 Classes10 Classes15 Classes
PTB- XL dataset without augmentation0.89340.95420.97780.79550.79600.7261
PTB- XL dataset with augmentation0.88930.94280.97680.76890.78890.7147
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Atwa, A.E.M.; Atlam, E.-S.; Ahmed, A.; Atwa, M.A.; Abdelrahim, E.M.; Siam, A.I. Interpretable Deep Learning Models for Arrhythmia Classification Based on ECG Signals Using PTB-X Dataset. Diagnostics 2025, 15, 1950. https://doi.org/10.3390/diagnostics15151950

AMA Style

Atwa AEM, Atlam E-S, Ahmed A, Atwa MA, Abdelrahim EM, Siam AI. Interpretable Deep Learning Models for Arrhythmia Classification Based on ECG Signals Using PTB-X Dataset. Diagnostics. 2025; 15(15):1950. https://doi.org/10.3390/diagnostics15151950

Chicago/Turabian Style

Atwa, Ahmed E. Mansour, El-Sayed Atlam, Ali Ahmed, Mohamed Ahmed Atwa, Elsaid Md. Abdelrahim, and Ali I. Siam. 2025. "Interpretable Deep Learning Models for Arrhythmia Classification Based on ECG Signals Using PTB-X Dataset" Diagnostics 15, no. 15: 1950. https://doi.org/10.3390/diagnostics15151950

APA Style

Atwa, A. E. M., Atlam, E.-S., Ahmed, A., Atwa, M. A., Abdelrahim, E. M., & Siam, A. I. (2025). Interpretable Deep Learning Models for Arrhythmia Classification Based on ECG Signals Using PTB-X Dataset. Diagnostics, 15(15), 1950. https://doi.org/10.3390/diagnostics15151950

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop