After preprocessing, each heartbeat segment is transformed into a fixed-length representation suitable for neural network training. These processed signals are then used as inputs to the proposed deep learning models. In this study, three neural network architectures, ANN, CNN, and ResNet, are implemented and evaluated under the same experimental conditions. Furthermore, a fine-tuned CNN architecture is proposed to improve classification performance by optimizing kernel sizes, regularization mechanisms, and loss functions. The overall workflow of the proposed system, including signal preprocessing, segmentation, and model training, is illustrated in the following figures.
3.1. Dataset and Preprocessing
The MIT-BIH Arrhythmia Database serves as the foundational data source for this study [
29,
35]. This extensively validated benchmark dataset contains 48 half-hour segments of two-channel ambulatory ECG recordings obtained from 47 subjects examined at the BIH Arrhythmia Laboratory between 1975 and 1979. The recordings were derived from approximately 4000 24-h ambulatory ECG recordings acquired from a mixed population comprising 60% inpatients and 40% outpatients at Boston’s Beth Israel Hospital. From this collection, 23 recordings were selected randomly, while the remaining 25 were specifically chosen to include less common but clinically significant arrhythmias that would otherwise be underrepresented in a purely random sample. Each recording was digitized at 360 samples per second per channel with 11-bit resolution across a 10-mV range. The database includes computer-readable reference annotations for approximately 110,000 individual heartbeats, with each record independently annotated by at least two cardiologists and all disagreements resolved through consensus. Since PhysioNet’s launch in September 1999, all 48 complete recordings with their corresponding reference annotation files have been made freely accessible online.
Although MIT-BIH is one of the earliest publicly available ECG datasets, it remains the most widely adopted benchmark for arrhythmia classification and enables direct comparison with a large body of prior studies. Its expert annotations, standardized protocol, and extensive use in the literature make it a suitable benchmark for methodological evaluation. However, external validation on contemporary datasets such as PTB-XL and PhysioNet Challenge datasets remains an important direction for future work.
The preprocessing pipeline transforms raw ECG signals into a format suitable for deep learning classification while preserving clinically relevant morphological features. Each heartbeat is first isolated through segmentation into fixed-length windows of 187-time steps, a duration carefully selected to encompass the complete P-QRS-T complex while maintaining computational efficiency. The raw ECG signals undergo band-pass filtering between 0.5 Hz and 45 Hz using a Butterworth filter to eliminate baseline wander caused by patient respiration and high-frequency noise from muscular activity or powerline interference. Following filtration, Z-score standardization is applied per patient channel to ensure inter-subject consistency, where each sample is normalized by subtracting the channel mean and dividing by its standard deviation, thereby preserving relative amplitude relationships critical for arrhythmia discrimination, as shown in
Figure 2 and
Figure 3.
ECG heartbeat segmentation was performed using the annotated R-peak locations provided by the MIT-BIH database. For each detected heartbeat, a fixed-length window of 187 samples was extracted using an R-peak-centered strategy. Specifically, 93 samples preceding and 94 samples following the R-peak were selected to preserve the complete local heartbeat morphology. This segmentation approach ensures inclusion of the P-wave, QRS complex, and T-wave while maintaining a standardized input size across all samples. The resulting heartbeat segments were subsequently used for preprocessing, RR interval extraction, and FT-CNN feature learning.
Class imbalance, a persistent challenge in arrhythmia classification due to the inherent rarity of certain cardiac conditions, is addressed through systematic data augmentation. The dataset exhibits significant skew, with normal beats (Class 0) substantially outnumbering ectopic and fusion beats (Classes 1 through 4). To mitigate this, stratified resampling is performed by computing the class distribution and generating synthetic samples through controlled perturbation of existing minority class instances. The augmentation strategy employs temporal warping and amplitude scaling within physiologically plausible bounds, ensuring that artificially generated samples maintain clinical validity while expanding the representation of underrepresented classes.
Amplitude scaling was applied with a factor uniformly sampled from 0.9 to 1.1, while temporal warping used a random stretch/compression factor α ∈ [0.8, 1.2], applied to the time axis of the heartbeat segment. Both operations were constrained to preserve the clinical morphology; any augmented sample with an R-peak shift beyond ±5 samples or a QRS amplitude change exceeding ±20% was rejected. After augmentation, the training set contained approximately 50,000 samples with a balanced class distribution (around 10,000 per class).
The augmented data undergoes verification to prevent duplication and ensure that synthetic samples do not deviate beyond acceptable morphological ranges. Following augmentation, the complete dataset comprises balanced class distributions, enabling unbiased model training without the need for compensatory class weighting during optimization. All features are extracted and corresponding labels encoded into categorical format using one-hot encoding, producing a final dataset partitioned into training 80%, validation 10%, and test 10% sets while maintaining patient-wise separation to prevent data leakage between partitions. The preprocessing pipeline concludes with dimensionality verification and statistical summary generation to confirm that all heartbeat segments meet the input requirements for subsequent neural network processing.
In addition to morphological information extracted from ECG heartbeat segments, temporal heartbeat dynamics were incorporated through RR interval-based features. These descriptors provide clinically relevant information regarding beat-to-beat variability and rhythm irregularity, which may not be fully captured by waveform morphology alone. R-peaks were identified using the MIT-BIH reference annotations. Based on the detected R-peak locations, four dynamic RR features were computed, as shown in Equations (1)–(5):
Interval between the current beat and the immediately preceding beat.
- b.
Post-RR (RRpost)
Interval between the current beat and the immediately subsequent beat.
- c.
Local-RR (RRlocal)
Local mean RR interval, where N denotes the number of neighboring beats within a local temporal window.
- d.
Ratio-RR (RRratio)
Ratio capturing deviations of the current beat interval from local rhythm behavior. All four RR features were normalized before integration into the classification architecture using Z-score standardization:
where
μ is the mean, and
σ is the standard deviation of the respective feature computed across the training set.
A rigorous patient-wise separation was enforced throughout the experimental pipeline to avoid any intra-patient data leakage. All heartbeats originating from a single patient were assigned entirely to one of the training, validation, or test partitions, never split across them. Consequently, the model never encounters heartbeats from the same individual during both training and evaluation, guaranteeing an inter-patient test scenario. This approach follows the inter-patient paradigm and is critical for obtaining unbiased performance estimates. Leave-One-Patient-Out (LOOPO) cross-validation was employed for the final generalization assessment, where, for each fold, all heartbeats of one patient served as the test set, and the remaining 46 patients constituted the training/validation set. This protocol was strictly maintained for all models compared in this work, including the fourteen benchmark algorithms. By enforcing identical patient-based splits, any risk of information leakage across partitions is eliminated, ensuring that the reported performance metrics reflect the true generalization capability of each model.
For the main experiments, the patient-wise split followed the widely used inter-patient division proposed by de Chazal et al. Specifically, records 101, 106, 108, 109, 112, 114, 115, 116, 118, 119, 122, 124, 201, 203, 205, 207, 208, 209, 215, 220, 223, 230 (22 records) were used for training, while records 100, 103, 105, 111, 113, 117, 121, 123, 200, 202, 210, 212, 213, 214, 219, 221, 222, 228, 231, 232, 233, 234 (22 records) constituted the test set. Within the training partition, 10% of patients were held out for validation. This split ensures that no patient appears in both the training and test sets.
3.1.1. AAMI Standard Classification and Mapping Procedure
The MIT-BIH Arrhythmia Database provides annotations for multiple heartbeat categories. To establish a clinically meaningful and statistically tractable multiclass classification framework, the original heartbeat annotations were mapped into the five heartbeat categories recommended by the Association for the Advancement of Medical Instrumentation (AAMI) EC57:1998 standard [
42]. This standard grouping consolidates heartbeat types with similar physiological characteristics and has become the accepted protocol in ECG arrhythmia classification studies.
The mapping strategy employed in this work is summarized in
Table 1. By aggregating related heartbeat categories into standardized AAMI classes, the classification task becomes more clinically interpretable while reducing sparsity issues associated with rare beat categories.
After the mapping process, the dataset consisted of approximately 90,600 N-type, 7600 S-type, 6960 V-type, 1430 F-type, and 3410 Q-type beats prior to dataset partitioning. The support values reported in the classification results presented in
Section 4 correspond to the test-set distribution after patient-wise data splitting and augmentation procedures. Consequently, these values are smaller than the overall dataset counts because the test partition represents approximately 10% of the patient population.
To address the substantial class imbalance inherent in ECG datasets, stratified resampling and synthetic augmentation techniques were applied exclusively to the training partition. Specifically, temporal warping and amplitude scaling were performed within physiologically plausible limits to increase representation of minority classes while preserving clinically meaningful waveform morphology. This preprocessing strategy improves class balance and ensures that underrepresented arrhythmia categories contribute adequately during model training.
3.1.2. Signal Processing Window Duration Justification
Each heartbeat is isolated into a fixed-length window of 187 time-steps, corresponding to approximately 519 ms at the MIT-BIH sampling rate of 360 Hz. This duration was selected to ensure complete capture of clinically relevant ECG morphology. Standard cardiac intervals indicate that the P-wave typically spans 100–120 ms, the PR interval ranges from 120–200 ms, the QRS complex lasts 80–120 ms, and the QT interval extends up to 440 ms. Therefore, a 519 ms window centered on the R-peak provides sufficient coverage of the complete P–QRS–T complex while maintaining computational efficiency as shown in
Table 2.
Empirical evaluation on the MIT-BIH dataset showed that approximately 99.2% of aligned beats retained complete waveform morphology within the selected window. Since heartbeat segmentation was performed around detected R-peaks, the same annotations were also used to extract RR interval-based temporal features (previous-RR, post-RR, local-RR, and ratio-RR), enabling integration of rhythm dynamics alongside morphological information.
3.2. Proposed Architecture
The proposed architecture introduces a custom-designed Fine-Tuned Convolutional Neural Network (FT-CNN) specifically optimized for multi-class ECG arrhythmia classification. Unlike standard CNN implementations that employ generic layer configurations, this efficient architecture leverages domain-aware design principles to significantly enhance feature extraction from cardiac signals. The FT-CNN incorporates adaptive kernel sizing optimized for ECG morphologies, strategic pooling mechanisms that preserve critical temporal features, custom regularization techniques to address class imbalance, and progressive feature extraction through hierarchical learning. The input layer processes ECG signals via a specialized preprocessing pipeline that maintains morphological integrity while normalizing inter-patient variability, using a 1D temporal representation of 187-time steps to retain complete cardiac cycle information essential for accurate arrhythmia detection.
The architecture comprises three custom-designed convolutional blocks, each with progressively complex feature extraction capabilities. The first block focuses on low-level feature extraction using 32 filters with a kernel size of 5, optimized for detecting P-waves and QRS complexes, followed by batch normalization with learnable parameters, ReLU activation with custom initialization, and max pooling with a stride of 2 that preserves 60% of temporal resolution. The second block targets mid-level pattern recognition with 64 filters of kernel size 3 to capture inter-wave relationships, incorporating spatial dropout at a rate of 0.25 to prevent co-adaptation, Parametric ReLU (PReLU) for adaptive negative slope learning, and average pooling with stride 2 to smooth feature maps. The third block performs high-level feature synthesis using 128 filters with kernel size 3 for holistic pattern integration, batch normalization with gamma regularization, Leaky ReLU (α = 0.01) to prevent dead neurons, and global average pooling for dimensionality reduction while preserving features.
The FT-CNN distinguishes itself through four key fine-tuning mechanisms. First, dynamic learning rate scheduling employing cosine annealing with warm restarts enables the model to escape local minima and converge to optimal solutions. Second, class-aware weight initialization modifies He initialization with class distribution priors to ensure balanced learning across minority arrhythmia classes. Third, an efficient regularization stack combines L2 regularization (1 × 10−4), dropout (0.5 in fully connected layers), and early stopping with patience monitoring to prevent overfitting despite limited training samples for certain arrhythmia types. Fourth, a custom loss function integrates weighted categorical cross-entropy with a focal loss component to address significant class imbalance, particularly challenging given that classes 1 and 3 have only 556 and 162 samples, respectively, compared to 18,118 for class 0.
The transition from convolutional to fully connected layers employ a progressive dimensionality reduction strategy: flattened features pass through 256 neurons with batch normalization, then to 128 neurons with dropout 0.5, followed by 64 neurons with L2 regularization, and finally to 5 output classes using softmax with temperature scaling for improved calibration. The forward propagation through a custom convolutional block can be expressed as shown in Equation (6):
where BN represents batch normalization with learnable scale and shift parameters, and
f is the adaptive activation function selected based on the block’s position in the network hierarchy. Training employs the Adam optimizer β
1 = 0.9 and β
2 = 0.999 with an initial learning rate of 0.001, batch size of 32, and early stopping with patience of 15, monitoring validation loss over 100 epochs.
The proposed architecture qualifies as “fine-tuned” because each architectural decision was validated against ECG-specific requirements rather than adopted from generic computer vision solutions, with extensive hyperparameter optimization via grid search and Bayesian optimization identifying optimal configurations for kernel sizes, layer depths, and regularization strengths. This approach achieved 98.51% accuracy, outperforming both standard CNN implementations 97.20% and deeper ResNet architectures 96.88% on identical datasets, with modifications specifically targeting improved recall for clinically significant but rare arrhythmia classes where standard models typically underperform, as shown in
Figure 4.
The proposed FT-CNN adopts a dual-branch architecture consisting of a morphological feature extraction pathway and a temporal feature pathway. The morphological branch processes segmented ECG signals represented as one-dimensional sequences of 187-time steps using three convolutional blocks designed to capture hierarchical ECG characteristics. Parallel to this pathway, RR interval-based descriptors, including previous-RR, post-RR, local-RR, and ratio-RR, are extracted to represent temporal heartbeat dynamics.
The outputs from both branches are combined through feature fusion using vector concatenation. This integration enables the model to jointly exploit waveform morphology and rhythm information before the final classification stage. Feature fusion is performed after global average pooling. Let
FCNN denote the convolutional feature representation produced by the morphological branch, and let
FRR denote the normalized RR feature vector. The fused representation is defined, as shown in Equation (7):
where [ ; ] denotes vector concatenation.
System and Software Requirements
The experiment utilized GPU acceleration, specifically NVIDIA and AMD GPUs, to enhance the training of ResNet, a large neural network architecture. The CPU played a crucial role in managing system performance and data preprocessing tasks. Adequate RAM was essential, influenced by the dataset size, model parameters, and batch size. The experiment employed the Keras library, a high-level API running on TensorFlow, for model development, training, and evaluation. Google Colab, a cloud-based platform with free GPU and TPU access, served as the development environment. It seamlessly integrated with Google Drive and facilitated collaborative coding in Jupyter Notebooks.
The model’s compact size and fast inference support its potential for deployment on resource-constrained edge devices, though dedicated embedded optimization (e.g., int8 quantization) remains future work.
Python (3.13.7), a widely used language, was chosen for coding, and Jupyter Notebooks provided an interactive and step-by-step coding environment. The experiment relied on various Python libraries, including NumPy, Matplotlib, Seaborn, and Scikit-learn, for numerical operations, data visualization, and metrics computation. In conclusion, the experiment’s hardware and software infrastructure, coupled with the chosen deep learning frameworks, enabled the successful implementation and training of the ResNet model for ECG signal classification. The cloud-based approach of Google Colab facilitated efficient GPU utilization without the need for high-end local hardware, making it accessible and scalable for research and experimentation.
3.3. Hyperparameter Settings and Evaluation Measures
Hyperparameter selection plays a pivotal role in optimizing neural network performance, particularly for complex tasks such as ECG arrhythmia classification. This section details the hyperparameters employed for the proposed fine-tuned CNN (FT-CNN) and outlines the consistent configuration applied to benchmark models (ANN, ResNet, LSTM, etc.) to ensure fair and reproducible comparisons.
The FT-CNN was trained using a carefully tuned set of hyperparameters, selected through systematic grid search and Bayesian optimization to maximize classification accuracy while minimizing overfitting. A dynamic learning rate schedule was implemented, starting with an initial learning rate of 0.001 for the Adam optimizer, combined with cosine annealing decay and warm restarts that gradually reduced the learning rate to 1 × 10−6 over 50 epochs before restarting, thereby promoting exploration in early epochs and fine-tuning in later stages. The Adam optimizer itself, configured with default parameters β1 = 0.9, β2 = 0.999, and ε = 1 × 10−7, was chosen for its ability to combine adaptive gradient algorithms with momentum, making it particularly well-suited for the high-dimensional parameter spaces and noisy gradients characteristic of ECG signal processing. The network architecture was designed to accept input tensors of shape (187, 1), preserving the temporal structure of 187 time-step heartbeat segments while remaining computationally manageable, and produced outputs through a 5-neuron softmax layer corresponding to the five arrhythmia classes: normal, supraventricular ectopic, ventricular ectopic, fusion, and unknown beats.
Activation functions were strategically selected throughout the network, with Rectified Linear Unit (ReLU) used in convolutional and dense hidden layers to introduce non-linearity and mitigate vanishing gradients, while Parametric ReLU (PReLU) and Leaky ReLU (α = 0.01) were employed in specific blocks to enable adaptive negative slope learning and prevent dead neurons. A comprehensive multi-faceted regularization strategy was adopted to enhance generalization, incorporating L2 regularization (weight decay) with a factor of 1 × 10−4 applied to all convolutional and dense layers, dropout at a rate of 0.5 after fully connected layers, spatial dropout at a rate of 0.25 after the second convolutional block, and batch normalization after each convolutional layer to normalize activations and provide additional regularization.
Early stopping with a patience of 15 epochs monitored validation loss to prevent overfitting, typically concluding training around 60–70 epochs despite a maximum setting of 100 epochs. A batch size of 32 was selected as an optimal trade-off between gradient stability and computational efficiency, ensuring frequent weight updates for rapid convergence while maintaining sufficient sample diversity per iteration. To address significant class imbalance, a custom loss function combining weighted categorical cross-entropy with a focal loss component (γ = 2.0) was employed, with class weights computed inversely proportional to class frequencies to ensure minority classes contributed proportionally to gradient updates.
Finally, weight initialization followed He’s normal initialization for layers with ReLU activation and Glorot uniform initialization for the output layer, accelerating convergence by maintaining appropriate activation variances throughout the network. This carefully orchestrated hyperparameter configuration collectively contributed to the FT-CNN’s superior classification performance.
3.3.1. Benchmark Models Configuration
To ensure a fair and unbiased comparison, the same core hyperparameters were applied across all benchmark models wherever architecturally feasible, with specific guidelines followed for models possessing inherent architectural differences. All deep learning models, including ANN, ResNet, LSTM, GRU, and 1D-CNN, were trained using the Adam optimizer with an initial learning rate of 0.001, while traditional machine learning models such as Logistic Regression, SVM, and Random Forest employed default scikit-learn hyperparameters unless otherwise noted, reflecting their distinct optimization frameworks.
Input and output dimensions were standardized across all models, with each configured to accept the same input shape of 187 features and produce five-class outputs, ensuring consistency in the classification task. For neural network baselines, training was conducted for 50 epochs with a batch size of 32, mirroring the FT-CNN settings, and early stopping with a patience of 10 epochs was applied to prevent overfitting. Regularization strategies were implemented where applicable, with baseline neural networks incorporating dropout at a rate of 0.5 after dense layers and batch normalization after convolutional layers; however, more advanced techniques such as focal loss and cosine annealing were deliberately excluded from benchmark models to isolate and evaluate their contribution to the FT-CNN’s performance gains.
Activation functions followed a uniform approach, with ReLU used in hidden layers of all neural network baselines and softmax in the output layer, while ResNet inherently maintained ReLU activations within its identity block structure per the original implementation. Class imbalance was addressed in benchmark models through the class_weight parameter in Keras for neural networks or by setting class_weight = ‘balanced’ in scikit-learn models, ensuring minority classes were not ignored during training, though this approach did not incorporate the focal loss mechanism unique to the FT-CNN.
The hyperparameter configuration for the FT-CNN was motivated by both theoretical considerations and empirical validation. The learning rate of 0.001 with cosine annealing balances rapid initial learning with fine-grained convergence, a strategy shown to improve generalization in deep networks. The combination of L2 regularization, dropout, and batch normalization addresses the risk of overfitting given the relatively small size of the minority classes.
The use of focal loss directly targets the class imbalance problem by down-weighting easy examples and focusing training on hard-to-classify samples, which is particularly beneficial for arrhythmia types with low representation. These choices collectively contributed to the FT-CNN achieving 98.51% accuracy, outperforming all benchmark models and demonstrating the efficacy of a carefully tuned architecture for ECG classification.
Hyperparameters were selected through progressive empirical optimization and ablation analysis rather than formal methods such as Taguchi or ANOVA designs. This choice was motivated by the highly nonlinear interactions between architectural parameters, optimization strategies, and ECG feature representations. However, structured optimization frameworks may provide additional insights and will be explored in future investigations.
3.3.2. Evaluation Protocol
To comprehensively assess the performance of the proposed fine-tuned CNN architecture for ECG arrhythmia classification, several standard evaluation metrics are employed. These metrics provide a quantitative basis for comparing model effectiveness and are derived from the confusion matrix, which records the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions for each class. True positive represents the number of samples correctly identified as belonging to a particular arrhythmia class, while true negative denotes samples correctly identified as not belonging to that class. A false positive indicates samples incorrectly identified as belonging to a class, and a false negative represents samples incorrectly identified as not belonging to a class.
Based on these four fundamental counts, the following evaluation metrics are calculated, such as Accuracy, which represents the overall correctness of the model and is defined as the ratio of correctly predicted samples to the total number of samples, providing a general measure of how well the model performs across all classes, as shown in Equation (8):
Precision, also known as positive predictive value, measures the proportion of samples correctly classified as a specific arrhythmia class among all samples classified as that class. High precision indicates a low false positive rate, which is crucial for minimizing unnecessary clinical interventions, as shown in Equation (9):
Recall, also referred to as sensitivity or true positive rate, measures the proportion of actual positive samples that are correctly identified by the model. High recall is essential for detecting as many arrhythmia cases as possible, reducing the risk of missed diagnoses, as shown in Equation (10):
F1-Score is the harmonic mean of precision and recall, providing a single balanced metric that considers both false positives and false negatives. It is particularly useful when dealing with imbalanced datasets, as it offers a more reliable measure of model performance than accuracy alone, as shown in Equation (11):
Specificity, or true negative rate, measures the proportion of actual negative samples that are correctly identified. This metric is important for confirming that normal heartbeats are not incorrectly classified as arrhythmic, as shown in Equation (12):
To evaluate overall model performance across all five arrhythmia classes, two averaging techniques are employed. Macro-average computes the metric independently for each class and then takes the average, treating all classes equally regardless of their sample size, as shown in Equation (13):
where
C is the number of classes, and
Metrici is the evaluation metric for class
i. Weighted-average computes the average metric weighted by the number of samples in each class, accounting for class imbalance, as shown in Equation (14):
where
Supporti represents the number of true samples for class i. These evaluation metrics are calculated for each of the five arrhythmia classes in the MIT-BIH dataset and reported in the classification tables presented in
Section 4. The combination of these metrics provides a comprehensive understanding of each model’s strengths and limitations, particularly in handling imbalanced class distributions and identifying minority arrhythmia types. All metrics were computed on the held-out test set to ensure unbiased evaluation, with macro and weighted averages reported to account for the inherent class imbalance in ECG arrhythmia classification.
To evaluate robustness under realistic acquisition conditions, Gaussian noise was injected directly into the raw one-dimensional ECG signals before preprocessing and heartbeat segmentation. The noisy signal was generated, as shown in Equation (15):
where the additive noise term is drawn from a zero-mean Gaussian distribution, as shown in Equation (16):
and
σ was adjusted to produce signal-to-noise ratios (SNRs) of 20 dB, 10 dB, and 5 dB. In addition, amplitude scaling (±10%) and temporal shifting (±10 samples) were applied to simulate sensor gain variations and minor R-peak localization errors. The perturbed signals subsequently underwent the identical processing pipeline, including filtering, heartbeat segmentation, RR feature extraction, feature fusion, and FT-CNN classification.
Beyond these standard metrics, the proposed FT-CNN was further evaluated using leave-one-out cross-validation (LOOCV) to assess subject-wise generalization, ablation studies to quantify the contribution of each hyperparameter component, robustness analysis under signal perturbations (Gaussian noise, amplitude scaling, temporal shifting), and explainability evaluation using Grad-CAM and Integrated Gradients with faithfulness metrics. Statistical reliability was confirmed via 95% confidence intervals (bootstrap resampling) and the McNemar test for prediction consistency.