1. Introduction
The electrocardiogram (ECG) is one of the most widely used non-invasive diagnostic tools for monitoring heart activity and detecting cardiac abnormalities, including arrhythmias. Early and accurate classification of normal and abnormal heartbeats is essential for timely medical intervention and effective treatment. However, analyzing ECG signals presents significant challenges due to their complex temporal dynamics, variations in signal morphology across individuals, and the presence of noise and artifacts. Traditional methods for automated ECG classification typically rely on either manual feature extraction or deep learning approaches based on signal reconstruction. Both techniques have limitations that hinder their ability to generalize and their accuracy in real-world applications [
1].
Feature extraction-based methods attempt to capture relevant characteristics from ECG signals using statistical, time-domain, and frequency-domain techniques. However, the extracted features introduce bias, limit the model’s ability to adapt to new datasets and unseen abnormalities, and often require significant computational resources, making real-time processing difficult [
2]. On the other hand, deep learning methods that reconstruct ECG signals and detect anomalies based on reconstruction errors struggle to generalize, especially when encountering rare or underrepresented abnormal patterns in training data. Additionally, their high computational complexity and latency further hinder their suitability for real-time applications. These limitations highlight the need for a more efficient, effective, and generalizable approach to ECG classification that can operate reliably in real-time scenarios [
3].
To overcome these limitations, recent research in signal-based deep learning has increasingly adopted end-to-end architectures that learn discriminative representations directly from raw input signals, bypassing the dependency on handcrafted features or reconstruction losses. Such architectures can automatically discover both spatial and temporal dependencies, improving robustness and generalisation across diverse datasets and noise conditions. This paradigm has also shown success in related domains of signal interpretation [
4,
5]. These studies demonstrate the versatility of deep neural models in learning complex signal representations and capturing long-term dependencies, reinforcing the motivation for an end-to-end approach in ECG anomaly detection.
In this study, we propose an end-to-end deep learning model, ECG-CBA, which integrates convolutional neural networks (CNNs), Bidirectional long short-term memory networks (Bi-LSTMs), and a multi-head Attention mechanism. Unlike conventional approaches, ECG-CBA directly learns discriminative features from raw ECG signals rather than relying on manual feature extraction or signal reconstruction. The CNN component captures local spatial patterns in ECG waveforms, while the Bi-LSTM module models the sequential dependencies of heartbeats, enabling the recognition of temporal variations. The attention mechanism further enhances the model’s ability to focus on critical ECG segments, improve the detection of anomalies, and boost classification accuracy.
The proposed model is evaluated on the ECG5000 and MIT-BIH Arrhythmia datasets for binary classification of normal and abnormal heartbeats. Experimental results demonstrate that ECG-CBA achieves high classification performance, with accuracies of 99.60% and 97.49% on the ECG5000 and MIT-BIH datasets, respectively. Compared to traditional deep learning approaches, ECG-CBA improves sensitivity, specificity, and overall classification accuracy, making it a robust solution for ECG-based anomaly detection.
The key contributions of this research are summarized as follows:
We propose ECG-CBA, an end-to-end framework that learns discriminative representations directly from raw ECG signals, eliminating the need for hand-crafted feature extraction or signal reconstruction. This approach enhances generalization to unseen data and improves robustness against noise and inter-patient variability, making it highly suitable for real-world clinical applications.
The architecture integrates convolutional neural networks (CNNs) for spatial feature extraction, bidirectional long short-term memory (Bi-LSTM) networks for modeling temporal dependencies, and a multi-head attention mechanism to emphasize clinically relevant ECG segments.
Comprehensive experiments conducted on two benchmark datasets, ECG5000 and MIT-BIH Arrhythmia, demonstrate that ECG-CBA consistently outperforms existing approaches across multiple performance metrics, including accuracy, sensitivity, specificity, and F1-score.
The rest of the paper is organised as follows:
Section 2 presents the relevant related works on ECG anomaly detection.
Section 3 details the Background study of CNN, Bi-LSTM and Attention models.
Section 4 presents the design and development of the VAE-MCRS model.
Section 5 describes the implementation and training framework of the ECG-CBA model.
Section 6 outlines the performance evaluation methods, while
Section 7 presents the experimental results of the ECG-CBA model, compared against related recommendation methods as benchmarks. Finally,
Section 8 provides the conclusions of this study and suggests directions for future research.
2. Related Work
Anomalies in electrocardiogram (ECG) signals are critical indicators of various cardiovascular diseases, which remain a leading cause of morbidity and mortality worldwide [
6]. The advent of deep learning has revolutionized the field of medical diagnostics, particularly in the interpretation of ECG signals, enabling more accurate and efficient detection of arrhythmias and other cardiac anomalies [
7]. This literature review focuses on applying deep learning models for detecting and classifying anomalies in ECG signals, explicitly utilizing the ECG5000 and MIT-BIH Arrhythmia Databases. These datasets are widely recognized for their comprehensive representation of various ECG patterns, making them suitable for training and validating deep learning models.
The ECG5000 dataset, introduced by [
8], comprises 5000 ECG recordings from five different classes of heartbeats, including Normal, Atrial Fibrillation, and others. Recent studies have employed this dataset to leverage the capabilities of deep learning architectures [
9]. The dataset’s structure enables a balanced representation of both normal and abnormal ECG signals, facilitating the development of robust models that can generalize across different classes. The dataset’s size and diversity make it a suitable candidate for training deep learning models that require large amounts of labeled data to achieve high accuracy.
The MIT-BIH Arrhythmia Database is another pivotal resource in ECG research, consisting of 48 half-hour ECG recordings from 47 subjects, annotated with 11 different types of arrhythmias [
10]. Recent studies have highlighted the unique challenges posed by this dataset, such as class imbalance and the need for precise temporal segmentation of ECG signals [
11]. The annotations provided in this dataset offer a rich ground for training deep learning models, enabling researchers to explore various architectures and preprocessing techniques to improve classification performance.
Recent advancements in deep learning have led to the development of various architectures tailored for ECG signal analysis. Convolutional neural networks (CNNs) have emerged as a popular choice due to their ability to automatically learn spatial hierarchies in data [
12]. For instance, they proposed a hybrid CNN-LSTM model that combines the strengths of CNNs in feature extraction with Long Short-Term Memory (LSTM) networks for temporal sequence modeling, achieving state-of-the-art performance on the ECG5000 and MIT-BIH datasets.
Additionally, attention mechanisms have been integrated into deep learning models to enhance their performance further. A Transformer-based model that incorporates self-attention layers was introduced by Ref. [
13], enabling the model to focus on critical segments of the ECG signal while disregarding noise. Their findings demonstrated significant improvements in classification accuracy compared to traditional CNN approaches.
Liu et al. [
5] introduced a new method for long-term temperature compensation in structural health monitoring using ultrasonic guided waves. This study emphasize the versatility and growing importance of deep learning for extracting discriminative representations and modeling long-term dependencies across diverse signal domains.
Roy et al. [
14] proposed a novel approach for detecting anomalies in ECG signals using a deep LSTM autoencoder. The method employs an encoder to learn a lower-dimensional representation of ECG sequences and a decoder to reconstruct the original ECG signal. The model is trained on normal ECG signals, and anomaly detection is performed by analyzing the reconstruction loss of test ECG signals. The authors determine a reconstruction loss threshold using both manual and Kapur’s automated thresholding procedures. When applied to the ECG5000 dataset, the proposed model achieved an accuracy of over 98%.
Qin et al. [
15] introduced a novel one-class classification GAN (ECG-ADGAN) for ECG anomaly detection. The method incorporates a Bi-directional Long-Short Term Memory (Bi-LSTM) layer into a GAN architecture. It employs a mini-batch discrimination training strategy in the discriminator to synthesize ECG signals. The model was trained to generate samples that match the distribution of normal signals, enabling the reliable detection of anomalies, even those not well-represented in the training data. Experiments on the MIT-BIH arrhythmia database demonstrated the method’s effectiveness, achieving an accuracy of 95.5% and an AUC of 95.9%, outperforming state-of-the-art semi-supervised learning algorithms.
Pereira et al. [
16] proposed an unsupervised approach for learning representations of ECG sequences and detecting anomalies. They trained a variational autoencoder model with Bi-LSTM encoders and decoders for representation learning. Then, they introduced new unsupervised methods for anomaly detection in the latent space. The clustering step focused on defining the two clusters, which include the normal and anomalous heartbeats. Such a technique relies on the fact that normal heartbeats are the majority, and the anomalous ones are in a latent space different from the normal ones. Their model was regularized using a sparsity penalty.
Additionally, Dutta et al. [
17] presented MED-NET, a novel approach to ECG anomaly detection using LSTM autoencoderders. The model uses Stacked LSTM Architecture and autoencoderders to represent temporal attributes in a latent matrix. The model processes and extracts around 140 features from a dataset of 5000 samples. The LSTM network is structured by combining an Encoder–Decoder LSTM, allowing the model to accept variable-length input sequences and predict or output variable-length output sequences. The recreated time series-based ECG data as output in the final layer is compared with the original ECG time series-based input data to calculate the Reconstruction Loss.
Roy et al. [
14] proposed ECG-NET, an LSTM-based autoencoder for detecting anomalous ECG signals. A key contribution of this method is that it only requires normal ECG signals to train the model. This approach addressed the challenges of data imbalance and the limited availability of annotated anomalous ECG signals. The method also incorporated an automated reconstruction loss threshold selection approach during testing, where if the reconstruction loss value is above a certain threshold, the signal is considered an anomaly; otherwise, it is considered normal.
Shaik Munawar et al. [
18] proposed a Multi-Task Group Bi-directional LSTM method to improve the performance of arrhythmia classification. The MTGBi-LSTM model learns unique features in a shared representation that help overcome overfitting problems and increase the model’s learning rate. The global and intra LSTM method selects the relevant feature and easily escapes from local optima. The multi-task learning technique learns two ECG signals in a shared representation for effective learning.
Gutiérrez-Gnecchi et al. [
19] presented a DSP-based method for real-time arrhythmia classification designed for online ambulatory operation. The algorithm classifies eight heartbeat conditions using a wavelet transform based on quadratic wavelets to identify individual ECG waves and obtain a fiducial marker array. Classification is conducted using a Probabilistic Neural Network and tested with ECG records from the PhysioNet repository.
The study by Ref. [
20] proposed classifying arrhythmias into five categories using LSTM with a Luong Attention Mechanism for the model to learn the critical part of the ECG signal at each time step. In the proposed model, the authors used a Continuous Wavelet Transform (CWT) to denoise the signals from low and high frequencies. Then, a set of features was extracted using statistical features (such as skewness and kurtosis) and time—frequency domain features.
Ramaiah et al. [
21] addressed the critical issue of ECG ventricular fibrillation (VF) type signal detection. VF is a life-threatening cardiac arrhythmia with high mortality. The authors proposed a novel Long Short-Term Memory (LSTM) classifier optimized with Improved Penguin Optimization (IPEO) to mitigate overfitting. The method utilized the ECG recordings from the MIT-BIH and CPSC 2018 datasets. The proposed model also employed Fuzzy C-Means and Enhanced Fuzzy Rough Set methods for effective feature selection, extracting informative features, and clustering membership degrees.
The study by Ref. [
22] proposed a novel one-dimensional convolutional neural network (1D-CNN) architecture for the automated classification of arrhythmias. Leveraging the MIT-BIH dataset, incorporating both real and noise-attenuated ECG signals, the authors aimed to develop a robust and efficient diagnostic tool. Such noise is the inherent randomness of the arrhythmic events, leading to potential misdiagnosis.
Mohammed [
23] addressed the limitations of cloud-based machine learning (ML) for electrocardiogram (ECG) arrhythmia detection, focusing on the emerging field of edge inference to meet real-time, privacy, and availability demands. The author illuminated the computational challenge of deploying modern ML algorithms on resource-constrained edge devices. The challenge was addressed by the proposed model, which is a compact convolutional neural network (CNN) classifier enhanced with matched filter (MF) theory. The model’s minimal size (15 KB) and rapid inference time position it as a superior, low-complexity alternative for real-time ECG monitoring on edge devices, potentially benefiting a large population of patients with cardiovascular disease.
In Ref. [
24], a novel deep learning approach was proposed utilizing a Hybrid Residual Network (Hybrid ResNet). The method enhanced the feature extraction and classification accuracy through an architecture that integrated standard convolution, depthwise separable convolution, and residual connections. Their model ensured high-quality input because the ECG signals undergo preprocessing, including baseline drift removal, denoising via discrete wavelet transform, and heartbeat segmentation. Furthermore, the study tackles the class imbalance issue in the MIT-BIH Arrhythmia Database by employing the Synthetic Minority Oversampling Technique (SMOTE).
In Ref. [
25], the study proposed a hybrid deep learning-based technique that transforms 1D ECG data into 2D Scalogram images for automated noise reduction and feature extraction. The core contribution is the Residual attention-based 2D-CNN-LSTM-CNN (RACLC) model, which combines 2D convolutional neural networks (CNNs) and Long Short-Term Memory (LSTM) systems to capture both morphological and temporal information from ECG signals. Integrating an attention block enhances the model’s ability to focus on critical details within the ECG signal, thereby improving classification efficiency.
In Ref. [
26], the research proposed a 1D convolutional neural network (CNN) with residual connections. Using the residual connections enabled the model to capture the most critical feature in the input signal.
Murat et al. [
27] analyzed literature reports that use deep learning on arrhythmia ECG data. The authors utilized ECG data from five classes, comprising 100,022 beats, obtained from the MIT-BIH arrhythmia database, to evaluate deep learning techniques. They found that classifying raw ECG signals with deep learning-based systems without using any manual feature extraction is a significant advantage. However, some studies have shown that using certain temporal features (i.e., RR interval) and raw signals improves model performance. They also noted that the imbalance of ECG datasets is a significant problem that can give misleading information concerning model performance. The authors concluded that efficient hybrid models can provide more distinctive features from ECG signals.
The study by Ref. [
28] proposed AttentivECGRU, a novel arrhythmia detection method that employs a Gated Recurrent Unit (GRU)-based autoencoder with an attention mechanism. The model, trained solely on normal electrocardiogram (ECG) signals, learns to reconstruct these signals and effectively identifies abnormalities by analyzing reconstruction errors. In their study, a fuzzy logic-based threshold selection process is applied to differentiate between normal and arrhythmic ECGs, addressing the overlap in reconstruction loss distributions.
3. Background Study
This section describes the fundamental building blocks of the ECG-CBA model for ECG anomaly detection, including convolutional neural networks (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and Multi-Head Attention types. Each technique contributes to the precision and interpretability of the anomalies present in ECG signals. The extraction of spatial features is performed using a CNN module, while temporal dependencies are captured by utilizing a Bi-LSTM. Additionally, the Attention Mechanism helps highlight key portions of the sequence. The ECG-CBA model is described in detail in the subsequent subsections.
3.1. CNN
CNNs are a class of deep learning architectures that are particularly effective for processing data with a grid-like topology, such as time series signals or images. In the context of ECG signals, CNNs extract spatial features like wave peaks, troughs, and transitions between cardiac [
29,
30]. As shown in
Figure 1, a typical convolutional neural network (CNN) architecture consists of several key stages, including convolution, pooling, flattening, and fully connected layers for classification.
The primary operation in CNNs is the convolution, where a kernel slides across the input signal to produce a feature map. Mathematically, the convolution operation can be expressed as follows:
where
x is the input signal,
w is the kernel (filter) of size
K, and
y is the output feature map. The 1D convolution in ECG processing applies filters over temporal sequences to detect localized patterns such as QRS complexes (Q, R, and S waves representing ventricular depolarization).
Following convolution, a non-linear activation function such as ReLU is typically applied:
This enhances model expressiveness while introducing sparsity. Max-pooling layers are also commonly used to reduce the dimensionality of the feature maps:
where
P is the pooling window size, pooling reduces computational cost and controls overfitting by retaining the most salient features.
In ECG anomaly detection, CNNs can automatically learn discriminative features from raw input, eliminating the need for handcrafted feature engineering. These learned features are then forwarded to sequence models, such as Bi-LSTM, for temporal analysis.
3.2. Bi-LSTM
Long-short-term memory (LSTM) networks are a type of Recurrent Neural Network (RNN) designed to overcome the vanishing gradient problem and capture long-range dependencies in sequential data. The Bi-LSTM processes input data in both forward and backward directions, enhancing the ability to capture temporal relationships across the entire sequence [
31,
32,
33].
Figure 2 illustrates the structure of a Bi-LSTM network, where inputs are processed in both forward and backward directions to capture comprehensive temporal dependencies.
An LSTM unit updates its internal states through three gates—input gate
, forget gate
, and output gate
—to control the flow of information into and out of the memory cell
. The operations at each time step
t are defined as follows:
where
is the input vector at time step t.
is the hidden state from the previous time step.
and are the weight matrices for input and recurrent connections, respectively.
are the bias terms.
is the sigmoid activation function.
is the hyperbolic tangent function.
⊙ denotes element-wise multiplication.
is the candidate cell state.
is the updated cell state.
is the output hidden state.
In a Bi-LSTM network, two separate LSTM layers are applied to the input sequence: one processes the sequence forward, and the other processes it backward. This structure allows the model to capture information from past and future time steps at each point in the sequence. The final hidden state at each time step
t is formed by concatenating the forward and backward hidden states:
where
is the hidden state from the forward LSTM pass at time step t.
is the hidden state from the backward LSTM pass at time step t.
is the concatenated output representing both temporal directions.
denotes vector concatenation.
This bidirectional representation enables the model to learn more comprehensive temporal dependencies, which is especially beneficial in time-series tasks, such as ECG anomaly detection. This bidirectional context also allows the model to understand dependencies that span past and future time steps. This is critical in ECG analysis since some abnormalities can only be detected when considering prior and subsequent waveform segments. The Bi-LSTM thus captures richer temporal features than unidirectional RNNs or LSTMs.
3.3. Multi-Head Attention
Attention mechanisms enhance deep learning models by allowing them to focus on relevant parts of the input data [
34]. Multi-head attention extends this concept by employing multiple attention heads, each learning different relationships and representations [
35,
36].
Figure 3 presents the architecture of a transformer model block, highlighting the role of Multi-Head Attention, which is applied multiple times in both the encoder and decoder to capture contextual dependencies across input and output sequences effectively.
Given queries
Q, keys
K, and values
V, the scaled dot-product attention is computed as follows:
where
is the dimensionality of the key vectors, this operation produces a weighted sum of values, allowing the model to attend to essential segments.
In Multi-Head Attention, this process is performed
h times in parallel, using different learned projections:
where each head is defined as follows:
The multiple attention heads allow the model to capture diverse patterns and dependencies across different parts of the ECG signal.
This is crucial in ECG anomaly detection, as it allows us to focus on segments that exhibit subtle deviations from normal rhythms. For instance, rare arrhythmic episodes may only affect small waveform regions; attention mechanisms can prioritize such anomalies without being distracted by irrelevant patterns. The Multi-Head Attention mechanism enhances interpretability and performance by enabling the model to allocate focus across the input sequence dynamically. This is particularly beneficial in medical time-series data, such as ECG signals.
4. Proposed ECG-CBA Model
This section presents the proposed ECG-CBA model. We first describe its network architecture, then explain how its design enhances ECG anomaly detection. Finally, we provide implementation details to ensure reproducibility.
Figure 4 illustrates the ECG anomaly detection process, which is further explained in the following subsections.
4.1. Data Preprocessing
Preprocessing a Signal is crucial to ensure the data is clean, structured, and suitable for training machine learning or deep learning models. Signals often contain noise, artifacts, and variations in amplitude, which can hinder the performance of predictive models if not addressed. The preprocessing pipeline for Signals typically involves several key steps, including data segmentation, normalization, label encoding, and splitting into training and testing sets. These steps ensure that the data are formatted correctly and optimized for practical model training, which allows accurate analysis and classification of normal and abnormal ECG patterns. It is essential to highlight that the datasets used in this study are already segmented, with each ECG signal divided into 140 time intervals for the ECG5000 dataset and 188 time intervals for the MIT-BIH dataset. The data is further processed to ensure compatibility with deep learning models and enhance the overall quality of the analysis. The following sections provide a detailed overview of the preprocessing steps applied to the datasets used.
4.1.1. Data Segmentation
The preprocessing of the ECG5000 dataset begins with data segmentation, which ensures that the ECG signals are structured in a format suitable for deep learning models. Each record in the dataset consists of 140 time steps representing an ECG signal, with the last column indicating whether the signal is normal or anomalous. To prepare the data, the ECG features are first separated from their corresponding labels. Since deep learning models like CNNs and Bi-LSTMs require fixed-length sequences, the data is reshaped into a three-dimensional format: (samples, time steps, channels), where the channel dimension is set to 1, as the ECG5000 dataset contains single-lead signals. Next, normalization is applied to standardize the data, ensuring that all values have a zero mean and unit variance, which improves the stability of model training and prevents certain features from dominating others. This step is crucial as ECG signals can have varying amplitude ranges depending on the recording conditions.
4.1.2. Data Splitting
Following segmentation, data splitting is performed to create separate training and testing sets. Typically, the data set is divided using a split 80–20%, 90–10%, or 10–90%, ensuring that both splits contain a representative distribution of normal and abnormal samples. Since the ECG5000 and MIT-BIH datasets support binary classification, the class labels are converted into a Boolean format, where normal signals are mapped to one and abnormal signals are mapped to zero. Furthermore, to align with TensorFlow ’s input expectations for the CNN layer, the data are further reshaped into a format (batch_size, time_steps, features), maintaining the time series structure of the ECG data while making it compatible with deep learning architectures. This structured preprocessing approach ensures that the ECG signals are appropriately formatted for training a ECG-CBA model, optimizing both feature extraction and sequence learning to enhance the detection of ECG anomalies.
4.2. Proposed ECG-CBA Architecture
As illustrated in
Figure 5, the proposed model architecture consists of two main components: an encoder and a decoder. The encoder employs CNN layers to extract local features from the ECG signals, followed by a Bi-LSTM layer to capture sequential patterns. An attention mechanism is then applied to focus on the most relevant parts of the data, enhancing the model’s ability to detect anomalies.
The decoder flattens the encoded features and passes them through a dense layer with a sigmoid activation function to produce binary classification outputs. The model is compiled using the Adam optimizer and binary cross-entropy loss, with accuracy as the evaluation metric. Early stopping is implemented to prevent overfitting by monitoring validation loss and restoring the best weights.
4.2.1. ECG Encoder Blocks
CNN layers: The CNN plays a crucial role in extracting meaningful features from raw ECG time-series signals. CNN effectively captures local patterns such as peaks, waveforms, and abrupt changes that are essential for identifying anomalies in ECG data. In the proposed model, CNN serves as the first stage of processing, preparing the data for deeper sequential learning using Bi-LSTM and attention mechanisms. The Conv1D layers apply convolutional filters that slide over the ECG signal, detecting key features like P-waves, QRS complexes, and T-waves, ensuring that relevant characteristics are automatically learned. Following this, MaxPooling1D layers reduce the dimensionality of the extracted features, thereby lowering computational complexity while preserving critical information. This step also prevents overfitting by eliminating unnecessary variations in the data. Finally, the high-level representations generated by CNN are passed to the Bi-LSTM layer, allowing the model to capture long-term dependencies in ECG signals. By first extracting spatial dependencies through CNN, the model ensures that the LSTM can focus on learning sequential relationships. CNN serves as an automated feature extractor, transforming raw ECG signals into meaningful representations that improve the efficiency and accuracy of anomaly detection in the system.
Bi-LSTM layers: The Bi-LSTM layer plays a critical role in capturing the sequential dependencies and temporal patterns within ECG signals. Unlike traditional LSTMs, which process data in a single direction (forward or backward), Bi-LSTM processes the input in both directions, enabling the model to learn patterns from past and future time steps simultaneously. This is particularly important in ECG anomaly detection, where abnormalities may depend on both preceding and succeeding cardiac cycles. The Bi-LSTM layer follows the CNN feature extraction stage to ensure that the input passed to the LSTM network is already enriched with relevant spatial features. The CNN extracts local features, such as QRS complexes and wave morphologies, while the Bi-LSTM focuses on learn the temporal relationships between these features. This allows the model to recognize irregularities in heart rhythms that may not be apparent in isolated segments of the ECG signal. By leveraging Bi-LSTM, the model effectively detects anomalies by understanding both short-term fluctuations and long-term dependencies in ECG patterns. This enhances the robustness of the anomaly detection system and improves its ability to differentiate between normal and abnormal heartbeats. In addition, the return_sequences = True parameter in the Bi-LSTM layer ensures that the output maintains its sequential structure, which allows subsequent layers, such as the attention mechanism, to refine the learned representations further.
Multi-Head Attention layers: The Multi-Head Attention layer’s role is vital to enhance the ECG-CBA model’s ability to focus on essential features within the ECG signal. Unlike traditional sequential models that process data linearly, the attentional mechanism enables the model to selectively attend to different parts of the ECG signal simultaneously, capturing both local and long-range dependencies effectively. The use of the attention layer enables the model to recognize subtle variations in ECG signals that may indicate arrhythmias or other abnormalities.
Normalization layers: Finally, a normalization layer is added to stabilize training and enhance feature consistency, ensuring that the attended information is well-integrated before final classification.
4.2.2. ECG Decoder Blocks
The decoder block in the ECG anomaly detection model is responsible for transforming the encoded feature representations into a final classification decision. It consists of a Flatten layer, which converts the multi-dimensional feature representations from the encoder into a 1D vector, making it suitable for classification. This is followed by a Dense layer with a sigmoid activation function, which outputs a probability score between 0 and 1, determining whether the ECG signal is normal (0) or anomalous (1). The decoder receives the processed feature representations from the encoder. By flattening these high-dimensional features and applying a dense layer with sigmoid activation, the decoder enables binary classification. It plays a crucial role in translating learned feature representations into meaningful predictions, allowing the model to distinguish between normal and abnormal heartbeats effectively.
4.3. Classification
The model is trained on the training data and evaluated on the test set, with performance metrics such as accuracy, precision, recall, and F1-score calculated to assess its effectiveness. The results demonstrate the model’s ability to classify ECG signals as normal or abnormal, providing a robust solution for ECG-based anomaly detection. This approach leverages the strengths of CNNs, Bi-LSTMs, and attention mechanisms to achieve high accuracy and interpretability in detecting cardiac abnormalities.
In binary classification problems, models often output a probability score between 0 and 1, which must be converted into a class label using a decision threshold. A standard default threshold is 0.5, but this may not be optimal for anomaly detection. To find the optimal classification threshold, we used Youden’s J statistic [
37], which identifies the threshold that maximizes the difference between the True Positive Rate (TPR) and the False Positive Rate (FPR):
The threshold that maximizes
J is chosen as the optimal threshold:
where
T is the optimal threshold.
The ROC-based threshold selection method improves model performance by dynamically choosing the threshold that maximizes sensitivity while minimizing false alarms. This method is particularly advantageous in applications such as anomaly detection, medical diagnosis, and fraud detection, where the costs of False Positives and False Negatives are critical. In heartbeat anomaly detection, a lower threshold helps catch all possible cases, even at the cost of some false alarms.
The pseudocode of the proposed ECG-CBA is outlined in Algorithm 1.
| Algorithm 1 ECG-CBA: ECG Classification using CNN, Bi-LSTM, and Attention. |
- 1:
Input: Training set (normal and anomaly ECG samples), Validation set (normal and anomaly ECG samples), Test set (normal and anomaly ECG samples) - 2:
Output: Predicted test ECG as normal or anomaly - 3:
Hyper-parameters: Activation: Sigmoid, batch_size = 32, Learning rate: 0.001, Optimizer: Adam, Loss: Binary Cross-Entropy (BCE), EarlyStopping(patience = 5) - 4:
Training Phase - 5:
for each ECG sample in the training set do - 6:
Pass the sample through CNN layers to capture local patterns - 7:
Process features through Bi-LSTM layers to learn temporal dependencies - 8:
Apply Multi-Head Attention to highlight important features - 9:
Normalize the pattern representation of the ECG signal - 10:
Pass the encoded representation through a fully connected output layer (sigmoid activation) - 11:
Compute and minimize the binary cross-entropy loss metric - 12:
end for - 13:
Validate the model using validation ECG samples and compute validation loss - 14:
Testing Phase - 15:
for each test ECG sample (normal or anomaly) do - 16:
Pass the test sample through the trained ECG-CBA model - 17:
Compute the classification probability (P) - 18:
Determine the threshold using manual tuning or adaptive thresholding (Youden’s J method) - 19:
if then - 20:
Predict ECG as Normal - 21:
else - 22:
Predict ECG as Abnormal - 23:
end if - 24:
end for
|
5. Implementation and Training Framework of the ECG-CBA Model
This section describes the implementation details and training configuration of the proposed ECG-CBA model to ensure experimental reproducibility.
5.1. Model Implementation
Each ECG sample serves as the input to the model, represented as a one-dimensional sequence of timesteps with a single feature channel. Prior to training, the data are reshaped to for CNN compatibility and normalized to the range [0, 1].
The encoder network combines convolutional, recurrent, and attention-based components to extract hierarchical temporal features from the ECG signals. Its structure is summarized as follows:
Conv1D Layer 1: 32 filters, kernel size = 3, stride = 1, activation = ReLU, padding = “same”. Output shape: .
MaxPooling1D Layer 1: Pool size = 2. Output shape: .
Conv1D Layer 2: 64 filters, kernel size = 3, stride = 1, activation = ReLU, padding = “same”. Output shape: .
MaxPooling1D Layer 2: Pool size = 2. Output shape: .
Bidirectional LSTM: 64 units per direction, return_sequences=True. Output shape: .
Multi-Head Self-Attention: Four heads, key dimension = 64. Output shape: .
Layer Normalization: Normalizes attention outputs.
The Bi-LSTM layer produces an output tensor of shape (35, 128), corresponding to 35 timesteps with 128 features obtained by concatenating the forward and backward LSTM outputs (each with 64 units). This tensor is used as the input to the Multi-Head Self-Attention mechanism. The same tensor provides the Query, Key, and Value representations, each projected through learned weight matrices of size . The attention layer employs four heads with a key dimension of 64.
Within each head, pairwise similarity scores between the query and key vectors are computed and normalized using the softmax function to derive attention weights. These weights scale the Value representations to generate context-aware features. The four head outputs, each of dimension 64, are concatenated into a (35, 256) tensor and linearly transformed back to (35, 128) using an output projection. Layer Normalization follows this step to stabilize model convergence and maintain consistent feature distributions. This self-attention process enables the model to capture long-range temporal dependencies across ECG sequences, complementing the localized and sequential representations learned by the CNN and Bi-LSTM layers.
The decoder module consists of a flattening layer followed by a single dense neuron with a sigmoid activation function. This configuration outputs a probability value between 0 and 1, indicating the likelihood of an anomalous ECG segment.
5.2. Training Configuration
As shown in
Table 1, the model was trained using the Adam optimizer with an initial learning rate of
,
,
, and
. A fixed batch size of 32 and a maximum of 50 epochs were adopted. Early stopping monitored validation loss with a patience of five epochs and a minimum delta of
, ensuring training termination when performance plateaued. The binary cross-entropy loss function was employed, and accuracy served as the primary performance metric. Model checkpoints were automatically saved at the lowest validation loss to prevent overfitting.
Post-training evaluation included accuracy, precision, recall, F1-score, and AUC metrics. All experiments were conducted using TensorFlow 2.15 and Keras 3 API s to ensure replicable and transparent results.
7. Experiments
7.1. Experiments Setup
The experiments were conducted using Google Colab for training on the ECG5000 dataset and Kaggle’s cloud-based environment for training on the MIT-BIH dataset. Google Colab was selected for ECG5000 due to its availability of a Tesla T4 GPU (16 GB VRAM, NVIDIA, Santa Clara, CA, USA), which was sufficient for handling the relatively more minor dataset. For the larger MIT-BIH dataset, Kaggle’s NVIDIA A100 GPU environment was utilized due to its higher computational capacity, allowing for efficient training on a more extensive dataset.
7.2. Experimental Results
The proposed ECG-CBA model was trained with the same architecture for both the ECG5000 and MIT-BIH datasets. The architecture consisted of two Conv1D layers, where the first had 32 filters with a ReLU activation function, followed by MaxPooling with a pool size of two. The second Conv1D layer had 64 filters, also using ReLU activation and MaxPooling with a pool size of two. After convolutional feature extraction, a Bi-LSTM layer with 64 units was applied. Additionally, a Multi-Head Attention mechanism with four heads and a key dimension of 64 was integrated to enhance important sequential patterns. Finally, a Layer Normalization step was used to stabilize training and improve model convergence. The proposed model was trained for 200 epochs using an early stopping mechanism (patience = 5) with a learning rate of 0.001 and a binary cross-entropy loss function.
Table 3 presents the classification performance of the proposed ECG-CBA on the ECG5000 and MIT-BIH datasets, using the following metrics: accuracy, precision, recall, and F1 score. The ECG5000 model demonstrates exceptional performance, achieving near-perfect scores across all metrics, with 99.6% accuracy, 99.31% precision, 100% recall, and 99.65% F1 score. The perfect recall score of 1.0 indicates that the model correctly identifies all positive cases without false negatives, while the high precision shows minimal False Positives. In addition, the MIT-BIH dataset yields slightly lower but still strong results, with 98.80% accuracy, 98.55% precision, 96.90% recall, and 97.72% F1 score. Both datasets show balanced precision and recall, as evidenced by F1 scores closely aligned with their accuracy, indicating robust model performance in both cases.
Figure 9 and
Figure 10 clearly show that both the training and validation accuracy continuously increase over the epochs and eventually stabilize.
We evaluate the performance of our proposed model using the ROC curve, which illustrates how well the classification model performs across different thresholds. As shown in
Figure 11 and
Figure 12, the ROC curve of the ECG-CBA model exhibits a steep ascent toward the top-left corner, indicating a strong classification capability. Additionally, the Area Under the Curve (AUC-ROC) is very close to 1.0, indicating near-perfect discrimination between normal and abnormal ECG signal cases.
The confusion matrix results demonstrate the strong performance of the proposed ECG-CBA model in classifying ECG5000 test samples into Normal and Abnormal rhythm categories.
As shown in
Figure 13 and
Figure 14, the confusion matrices demonstrate a strong classification performance of the ECG-CBA model in distinguishing between normal and abnormal cases on both ECG5000 and MIT-BIH datasets. For normal instances, it correctly classified 436 (ECG5000) and 1717 (MIT-BIH) cases, with only four (ECG5000) and 34 (MIT-BIH) False Positives (misclassified as abnormal). For abnormal instances, it accurately identified 558 (ECG5000) and 1509 (MIT-BIH) cases, with just two (ECG5000) and 49 (MIT-BIH) False Negatives (misclassified as normal). This reflects robust classification accuracy across both datasets, with slightly higher precision on the ECG5000 dataset. This indicates a high level of accuracy with minimal misclassification errors. The low False Positive Rate suggests strong specificity, ensuring normal cases are rarely misidentified as abnormal. Similarly, the low false negative rate highlights the model’s excellent sensitivity, ensuring that abnormal cases are accurately detected. The balance between precision and recall confirms the model’s reliability in detecting ECG anomalies, making it highly effective for real-world applications in medical diagnostics.
7.3. Sensitvity Analysis
To achieve optimal performance of the ECG-CBA model, a series of experiments was conducted to analyze parameter sensitivity and fine-tune the model by adjusting key hyperparameters. These included the Threshold selection, the number of neurons per convolutional layer, the direction of the LSTM (forward vs. bidirectional), the number of multi-head attention mechanisms, and the training-to-testing data split percentage.
7.3.1. Threshold Selection
In
Table 4, the analysis of different threshold values on the ECG5000 and MIT-BIH datasets reveals significant variations in classification performance. The thresholds of 0.3 and 0.5 were manually selected, while the 0.7 threshold was determined using Youden’s method, optimizing the balance between sensitivity and specificity. At 0.3, recall is the highest (100% for ECG5000, 97.93% for MIT-BIH), but precision is lower due to a higher number of false positives. Increasing the threshold to 0.5 improves precision (98.97% for ECG5000, 97.79% for MIT-BIH) while maintaining high accuracy. At 0.7, accuracy reaches its peak (99.60% for ECG5000, 98.80% for MIT-BIH), with better precision but slightly lower recall. These findings indicate that lower thresholds are more suitable for scenarios where missing positive cases are critical, such as detecting cardiac anomalies. In comparison, higher thresholds are beneficial in reducing false alarms and ensuring more reliable positive classifications.
7.3.2. Training and Testing Splits
Table 5 presents the performance comparison of the ECG-CBA model on the ECG5000 and MIT-BIH datasets using different train–test splits. The results demonstrate that the ECG-CBA model maintains the highest performance across different train–test splits. While the training data is considerably reduced, the accuracy, precision, recall, and F1 score remain high, particularly for the ECG5000 dataset. Even with only 10% of the data used for training, the model still achieves an accuracy of 0.9859 for ECG5000 and 0.9283 for MIT-BIH. This indicates the robustness of the model, as it generalizes well despite limited training data.
7.3.3. CNN Layers
Table 6 compares the performance of different CONV1D configurations on the ECG5000 and MIT-BIH datasets. The results indicate that increasing the number of filters generally improves performance up to a certain point. The CONV1D 32, 64 configuration achieves the highest accuracy (0.9960 for ECG5000 and 0.9880 for MIT-BIH), demonstrating that this setting effectively captures features from the ECG signals. The CONV1D 16, 32 configuration performs slightly lower, while the CONV1D 64, 128 configuration results in a slight drop in accuracy and F1 score, likely due to overfitting or an increased complexity that does not generalize as well. These findings suggest that a balanced approach, rather than simply increasing or decreasing filters, is essential for optimal performance.
7.3.4. Bi-LSTM Layer
Table 7 compares the performance of Forward LSTM and Bi-LSTM on the ECG5000 and MIT-BIH datasets. The results show that Bi-LSTM outperforms Forward LSTM across all metrics for both datasets. Specifically, Bi-LSTM achieves higher accuracy (0.9960 vs. 0.9860 for ECG5000 and 0.9880 vs. 0.9646 for MIT-BIH), indicating its ability to better capture temporal dependencies in ECG signals. The improvement in recall (1.0000 vs. 0.9862 for ECG5000 and 0.9690 vs. 0.9346 for MIT-BIH) suggests that Bi-LSTM is more effective in correctly identifying abnormal heartbeats. These results confirm that considering both past and future contexts, as Bi-LSTM does, enhances the model’s performance compared to a unidirectional LSTM.
7.4. Ablation Experiments
Table 8 presents a comprehensive performance comparison of various model architectures utilizing CNN, Bi-LSTM, and Attention mechanisms on the ECG5000 and MIT-BIH datasets. The performance comparison of different model configurations highlights the effectiveness of integrating CNN, Bi-LSTM, and Attention mechanisms for ECG anomaly detection. The best-performing configuration, which includes all three components, achieves the highest accuracy (0.9960 for ECG5000 and 0.9880 for MIT-BIH), demonstrating that CNN extracts spatial features, Bi-LSTM captures long-term dependencies, and the Attention mechanism enhances feature weighting. Removing the Attention layer results in a noticeable performance drop, particularly in recall, suggesting that Attention enhances the model’s ability to detect anomalies in the minority class. Similarly, excluding Bi-LSTM while retaining CNN and Attention results in a decrease in accuracy and F1 score, confirming that Bi-LSTM plays a crucial role in learning sequential dependencies. Models relying solely on CNN or Bi-LSTM exhibit significantly lower performance, showing that these components complement each other—CNN excels at feature extraction, while Bi-LSTM models temporal dependencies. Additionally, the results indicate that ECG5000 outperforms MIT-BIH across all configurations, likely due to differences in class distribution between the datasets. While CNN + Bi-LSTM remains a viable trade-off for lightweight models, the combination of CNN, Bi-LSTM, and Attention provides the most robust performance, achieving optimal accuracy and recall. This comprehensive hybrid model proves to be the most effective for ECG anomaly detection across different datasets and train–test splits.
7.5. Comparison Performance with Related Works
In this section, we have included in the comparison only models that were tested on the entire set of features of the datasets rather than on a subset of features, since the selection of only a sample of features could introduce bias in accuracy and lead to a feature selection process that is not suitable for real-time applications. This ensures a fair evaluation and highlights the superiority of direct feature extraction over signal reconstruction, making the proposed ECG-CBA model more accurate, robust, and efficient for real-time ECG classification.
In
Table 9, we compare ECG classification models based on two primary approaches: signal reconstruction-based models and feature extraction-based models. Signal reconstruction-based models, such as ECG-NET [
14] (98.36% on ECG5000), AttentivECGRU [
28] (99.14% on ECG5000), and Qin et al. [
15] (95.50% on MIT-BIH), rely on reconstructing ECG signals and identifying anomalies based on reconstruction errors. While these models achieve high accuracy, they are sensitive to noise and struggle with detecting unseen abnormal patterns. In contrast, feature extraction-based models, including Farag et al. [
23] (98.18% on MIT-BIH), and Pham et al. [
26] (98.28% on MIT-BIH), directly learn discriminative features from ECG signals, improving robustness and adaptability.
Notably, almost all existing models have been tested on a single dataset, making it difficult to assess their generalizability across different ECG sources. To the best of our knowledge, no prior model has been evaluated on both the ECG5000 and MIT-BIH datasets. In contrast, the proposed ECG-CBA model has been tested on both datasets, demonstrating superior adaptability and robustness. As shown, ECG-CBA outperforms other models, achieving 99.60% on ECG5000 and 98.80% on MIT-BIH. These results confirm that ECG-CBA provides a highly accurate, reliable, and generalizable solution for real-time ECG anomaly detection.