1. Introduction
Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, accounting for nearly 17.9 million deaths each year according to the World Health Organization. Arrhythmias, representing abnormal heart rhythms, are among the most prevalent manifestations of CVDs and significantly increase the risk of stroke [
1], sudden cardiac arrest [
2], and heart failure [
3]. The electrocardiogram (ECG) is a non-invasive, low-cost, and widely adopted diagnostic tool that records the heart’s electrical activity and plays a crucial role in detecting arrhythmias [
4], myocardial ischemia [
5], and conduction abnormalities [
6]. Consequently, accurate and timely arrhythmia diagnosis from ECG signals is critical for improving clinical outcomes.
However, ECG interpretation remains a complex task due to the signal’s non-linear nature [
7]. ECG waveforms contain morphological components such as P waves, QRS complexes, and T waves that occur across multiple temporal scales [
8]. Clinicians identify arrhythmias mainly through inspection of beat-to-beat timing irregularities, especially variations in R–R intervals. Classical signal-processing techniques such as spectral analysis, autoregressive modeling, and time–frequency representations have also been employed to detect rhythm-related abnormalities and uncover physiological oscillations that may not be visible in the raw waveform [
9]. These methods offer valuable insights but may not fully capture the complex morphological deformations and multi-scale temporal dynamics that arise during arrhythmic episodes [
9]. Furthermore, ECG morphology is influenced by physiological variations, electrode placement, and noise sources such as baseline drift and electromyographic interference [
10,
11,
12]. During arrhythmic episodes, substantial morphological and temporal variability further complicates automated feature extraction and classification [
13]. These challenges make automated ECG arrhythmia detection both technically demanding and clinically important.
Early approaches relied on handcrafted feature extraction combined with traditional machine learning algorithms such as support vector machines (SVMs), k-nearest neighbors (k-NN), and hidden Markov models (HMMs) [
14,
15,
16]. While these methods achieved moderate success in specific tasks, their dependence on expert-designed features limits scalability, generalization, and adaptability to diverse patient populations. In contrast, deep learning has transformed ECG analysis by enabling end-to-end feature learning directly from raw signals. Convolutional neural networks (CNNs) effectively capture local morphological patterns of ECG beats, while recurrent neural networks (RNNs) and long short-term memory (LSTM) architectures model sequential dependencies [
17,
18,
19]. More recently, Transformer architectures, initially developed for natural language processing, have demonstrated strong potential for modeling long-range dependencies in biomedical signals [
20,
21,
22]. Nevertheless, CNNs often struggle to encode global context, whereas Transformers may overlook fine-grained morphological details that are crucial for precise arrhythmia classification [
23,
24].
To address these limitations, hybrid CNN–Transformer architectures have gained increasing attention. By integrating CNNs’ local feature extraction capabilities with Transformers’ global contextual modeling, hybrid frameworks provide complementary advantages. However, most existing methods employ serial or cascaded designs, where the Transformer is stacked on top of CNN features [
25,
26,
27]. Such designs may fuse information at a coarse level and fail to adaptively balance local and global contributions. Consequently, their feature representations can be suboptimal for complex arrhythmia patterns requiring both detailed morphology and long-range temporal understanding.
In this study, we propose a Hybrid CNN–Transformer Network with Gated Fusion (HCTG-Net), for automatic ECG arrhythmia diagnosis. HCTG-Net adopts a dual-branch architecture consisting of (1) a residual CNN branch for efficient extraction of local morphological features, and (2) a Transformer branch with positional encoding for capturing long-range temporal dependencies. A learnable gating fusion module dynamically aligns and balances the outputs of both branches on a per-dimension basis, ensuring adaptive integration of local and global representations.
The main contributions of this study are summarized as follows:
We propose HCTG-Net, a hybrid deep learning architecture that jointly models fine-grained local morphological features and long-range temporal dependencies from ECG signals.
We develop a gated fusion mechanism that adaptively aligns and balances CNN and Transformer features at a per-dimension level, enhancing representational complementarity.
We conduct comprehensive experiments on benchmark arrhythmia datasets, demonstrating that HCTG-Net achieves superior accuracy, robustness, and generalization compared to existing methods.
The remainder of this paper is structured as follows.
Section 2 describes the dataset, preprocessing procedures, and the proposed HCTG-Net architecture.
Section 3 presents the experimental setup, performance evaluation, and comparative results.
Section 4 provides a discussion of the findings and their implications. Finally,
Section 5 concludes the paper and outlines directions for future research.
2. Methods
The proposed HCTG-Net framework for automatic ECG arrhythmia diagnosis consists of three main stages: dataset acquisition, preprocessing of ECG signals, and classification using a hybrid CNN–Transformer gated fusion network. The overall workflow is summarized in
Figure 1.
2.1. Dataset
This study used the MIT-BIH Arrhythmia Database, which is one of the most widely adopted and influential benchmarks in arrhythmia research [
28]. The database contains 48 half-hour two-channel ECG recordings collected from 47 subjects consisting of 25 men aged between 32 and 89 years and 22 women aged between 23 and 89 years at a sampling rate of 360 Hz. Each record includes two simultaneously acquired ECG signals. In most recordings, the upper signal corresponds to a modified limb lead II (MLII) collected from chest-mounted electrodes, while the lower signal typically corresponds to a modified precordial lead such as V1, occasionally V2, V4, or V5. This lead configuration was standard practice in the BIH Arrhythmia Laboratory to ensure robust detection of arrhythmic events. In all of our experiments, we have used ECG lead II as the input. Each record also includes expert annotations of beat locations and arrhythmia types, resulting in approximately 110,000 labeled heartbeats. Furthermore, the recordings contain realistic noise and artifacts such as baseline drift, muscle interference, and electrode motion. These noisy segments closely resemble real-world ECG signals encountered in clinical practice and wearable monitoring, making MIT-BIH an appropriate dataset for assessing the practical applicability of deep learning models. Detailed information about the MIT-BIH database is provided on PhysioNet at
https://physionet.org/content/mitdb/1.0.0/ (accessed on 16 August 2025).
To ensure comparability with prior work, beat annotations were merged into five categories according to the AAMI standard [
29]. Specifically, category N includes normal beats, left and right bundle branch blocks, atrial escape beats, and nodal escape beats, which share stable and characteristic morphology. Category S includes atrial premature beats, aberrant atrial premature beats, nodal premature beats, and supraventricular premature beats, typically presenting subtle variations in atrial activation and timing irregularities. Category V includes premature ventricular contractions and ventricular escape beats, which exhibit prominent morphological deviations in the QRS complex. Category F represents fusion beats formed by the overlap of normal and ventricular depolarization. Category Q contains unclassified or unknown beats that cannot be reliably assigned to the other groups due to distortion, noise, or atypical morphology.
2.2. Data Preprocessing
A standardized pipeline was designed to improve signal quality and produce inputs suitable for robust heartbeat classification. The procedure included denoising, downsampling, segmentation, normalization, and class balancing.
First, baseline wander and high frequency noise were attenuated while preserving key morphological characteristics such as P waves, QRS complexes, and T waves. The signals were then downsampled from 360 Hz to 125 Hz to reduce computational cost without sacrificing diagnostically relevant information [
30]. This step retained essential temporal and morphological content while producing a compact representation for deep models.
Heartbeat segmentation was performed using the annotated R peaks from the database. For each R peak, a fixed length window of 188 samples was extracted, consisting of 90 samples before the R wave and 98 samples after it. This window covered the complete P–QRS–T complex of a cardiac cycle. To mitigate inter subject amplitude variability, each segmented heartbeat underwent z score normalization [
31]:
where
x represents the signal segment,
and
are its mean and standard deviation, respectively, and
is a small constant added to prevent division by zero. This normalization procedure standardized the dynamic range of the signals, removing amplitude offset across subjects. Besides, it also ensures that signals with different magnitudes do not bias the model, stabilizes gradient updates, and improves training convergence.
Finally, to address the significant class imbalance present in the MIT-BIH dataset, the Synthetic Minority Over-sampling Technique (SMOTE) was employed [
32]. Instead of simple replication on temporal waveform sequences, SMOTE generates synthetic samples for minority classes by interpolating between existing samples in the feature space. This approach effectively enhances data diversity and prevents overfitting caused by repeated instances. The resampling process was applied only to the training set to preserve the integrity of the test data and to ensure an unbiased performance evaluation.
2.3. ECG Signal Classification with HCTG-Net
The proposed HCTG-Net integrates a CNN branch and a Transformer branch, followed by a gated fusion module that adaptively combines the two representations. This design allows the model to jointly capture local morphological features and long-range dependencies in ECG signals, which are both crucial for robust arrhythmia detection.
Figure 2 summarizes the architecture. Specifically,
Figure 2a presents the overall network structure of HCTG-Net, comprising the CNN branch, the Transformer branch, and the gated fusion module.
Figure 2b details the Transformer branch, which models long-range temporal dependencies.
Figure 2c shows the CNN branch, which emphasizes local morphological patterns.
Figure 2d depicts the gated fusion module, which dynamically balances and integrates the local and global representations.
The CNN branch is designed to emphasize local waveform features such as QRS complexes and P/T-wave variations [
33]. It begins with a stem convolutional layer consisting of 64 filters of size 7, followed by three residual blocks. Each residual block contains two convolutional layers with batch normalization and ReLU activation. Skip connections are included to stabilize gradient propagation and mitigate the vanishing gradient problem. The operation of a residual block can be expressed as:
where
x is the block input,
denotes the stacked convolutional transformations, and
is either an identity mapping or a 1 × 1 convolution depending on dimensional alignment. The first block maintains the feature resolution with 64 channels, the second block halves the temporal resolution while increasing the depth to 128, and the third block expands the channel depth to 256. Finally, an adaptive average pooling layer produces a fixed-length representation
.
The Transformer branch focuses on learning global dependencies across the entire heartbeat sequence [
34]. Each signal segment is first projected into an embedding space of dimension
. Since Transformers do not have an inherent notion of temporal order, sinusoidal positional encoding is added to the embedded sequence to explicitly represent the time position of each sample within the 188-point heartbeat window.
Equation (
3) defines this sinusoidal positional encoding:
The embedded sequence is processed by two stacked Transformer blocks, each comprising a multi-head self-attention (MHSA) mechanism, a position-wise feed-forward network (FFN), and residual connections with layer normalization. The MHSA mechanism is defined as:
where queries Q, keys K, and values V are derived from linear projections of the input sequence. This mechanism enables the model to jointly capture relationships between distant signal positions. After Transformer processing, an adaptive average pooling layer aggregates the temporal information into a fixed-length vector
.
To integrate the complementary CNN and Transformer features, both vectors are projected into a shared latent space of dimension H = 256:
A gating network then generates an adaptive weight vector [
35]:
where
denotes the sigmoid function and
represents vector concatenation. The MLP projects the concatenated features from dimension 2H to H through two fully connected layers. The fused representation is computed as:
This mechanism allows the network to dynamically emphasize either local morphology or global temporal context depending on the characteristics of the input signal.
The fused feature vector is passed through a fully connected layer with 128 hidden units, followed by ReLU activation and dropout regularization [
36]. The final layer applies a softmax function to obtain class probabilities for the five arrhythmia categories:
where
represents the logit produced by the final linear layer for class
k, and
denotes the corresponding normalized probability. Equation (
8) formalizes the probabilistic decision-making stage of HCTG-Net. The softmax operator transforms the model’s learned representation into a probability distribution across the five heartbeat categories. This process provides several important signal-interpretation benefits. First, it enables multi-class classification by assigning each heartbeat to the class with the highest probability. Second, the probability values reflect the model’s confidence in its decision, allowing uncertainty-aware interpretation of ambiguous or noisy ECG beats. Third, the softmax formulation is compatible with cross-entropy loss, which ensures stable and efficient training of the network.
The configuration of HCTG-Net is summarized in
Table 1. While the table summarizes the structural configuration of each module in HCTG-Net, it is also important to highlight the specific role and contribution of these modules to ECG arrhythmia classification. The CNN branch is responsible for extracting local morphological features such as P-wave shape, QRS duration, and T-wave polarity, which are crucial for distinguishing normal, ventricular, and supraventricular beats. The use of residual blocks further stabilizes deep feature learning and preserves diagnostic waveform characteristics. The Transformer branch complements this by modeling long-range temporal dependencies within each heartbeat, enabling the network to capture contextual relationships and subtle conduction abnormalities that may not be identifiable from local morphology alone. The gated fusion module represents the key novelty of HCTG-Net: it adaptively balances and integrates the CNN-derived morphological features with the Transformer-derived global context on a per-dimension basis. This dynamic weighting mechanism allows the model to emphasize morphology-dominant cues for ventricular beats and context-dominant cues for supraventricular patterns, leading to more discriminative fused representations.
In summary, HCTG-Net combines a CNN branch for precise morphological analysis, a Transformer branch for comprehensive temporal modeling, and a gated fusion mechanism for adaptive feature integration. This hybrid design yields a unified representation that is highly effective for arrhythmia classification.
3. Results
3.1. Experimental Setup
All experiments were conducted on a personal computer equipped with an NVIDIA RTX 3080Ti GPU, an Intel Core i7 processor, and 64 GB of RAM. The implementation was carried out in Python 3.10 using the PyTorch 2.6.0 deep learning framework [
37]. Training was performed with the Adam optimizer, an initial learning rate of
, and weight decay of
. A cosine annealing learning rate scheduler with
epochs and a minimum learning rate of
was employed to improve convergence. The batch size was set to 256, and gradient clipping was applied with a maximum norm of 1.0 to stabilize training [
38].
After constructing the complete set of segmented heartbeat samples, the resulting beat-level dataset was randomly divided into training, validation, and testing subsets. For model development, 80% of the training data was used for training, while the remaining 20% served as the validation set for hyperparameter tuning and early stopping. The final test set was held out and used exclusively for the evaluation of model performance.
3.2. Evaluation Metrics
The classification performance of HCTG-Net was evaluated using multiple quantitative metrics to ensure a comprehensive assessment. The selected metrics included accuracy, precision, recall, and the F1-score [
39,
40]. These metrics together provided a balanced view of both overall and class-wise performance.
Accuracy measured the overall proportion of correctly classified samples and was computed as:
where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively.
Precision quantified the proportion of correctly predicted positive samples among all predicted positives:
Recall represented the proportion of actual positive samples correctly identified by the model:
The F1-score was calculated as the harmonic mean of precision and recall, providing a single measure that balances both sensitivity and precision:
To clearly highlight the characteristics and practical differences among the four evaluation metrics used in this study, we provide a comparative summary in
Table 2.
The combination of accuracy, precision, recall, and F1-score offered a comprehensive evaluation of the model’s performance. Accuracy reflected the overall classification capability, while precision and recall highlighted the model’s sensitivity to positive predictions. The F1-score captured the trade-off between these two measures, ensuring a fair and reliable assessment of arrhythmia classification performance on the MIT-BIH dataset.
3.3. Overall Performance of HCTG-Net
The proposed HCTG-Net was comprehensively evaluated on the MIT-BIH Arrhythmia Database according to the AAMI heartbeat classification standard.
Table 3 summarizes the classification results across the five major heartbeat categories, reporting precision, recall, and F1-score for each class, together with the overall average metrics. The detailed performance distribution is also illustrated in
Figure 3.
The results in
Table 3 demonstrate that HCTG-Net achieves consistently high performance across all heartbeat classes. The N and Q classes exhibit near-perfect precision, recall, and F1-score, indicating that the model is highly effective at recognizing both common and miscellaneous waveform patterns. The V class also shows excellent performance, with an F1-score of 0.9839. This result highlights the model’s ability to capture the distinctive morphological characteristics of ventricular arrhythmias, which are clinically critical to detect accurately.
Performance for the S and F categories is slightly lower, with F1-scores of 0.9338 and 0.9441, respectively. These reductions are expected due to the inherent class imbalance in the MIT-BIH dataset and the subtle morphological similarities between these arrhythmias and normal beats. The S class in particular is known to exhibit diverse and less pronounced waveform deviations, making it a challenging category for automated classification. Nevertheless, HCTG-Net still maintains strong performance for these minority classes, demonstrating its ability to generalize across both dominant and less frequent arrhythmia types.
Aggregated results further confirm the strong classification capability of the proposed model. The macro-average precision of 0.9774 and macro-average F1-score of 0.9711 reflect balanced performance across all classes, independent of class frequency. The weighted averages, which account for class distribution, are even higher due to the excellent performance on dominant categories. The overall accuracy of 0.9946 demonstrates that the hybrid CNN–Transformer architecture and gated fusion module effectively capture both local morphological details and long-range temporal dependencies in ECG signals.
A representative example of the input data analyzed by HCTG-Net is shown in
Figure 4, which displays three 10-s segments of the original MIT-BIH ECG recording. Each heartbeat in the segment is annotated with its corresponding AAMI class label. This visualization highlights the morphological variability among beats.
To further analyze the training stability and convergence behavior of HCTG-Net, the training and validation loss curves across epochs are illustrated in
Figure 5. As shown, the training loss decreases rapidly during the initial phase and gradually plateaus after convergence, indicating effective learning and stable optimization. The smooth convergence trend demonstrates that proposed HCTG-Net enables efficient gradient propagation and stable feature learning during training.
In addition to accuracy and convergence performance, computational efficiency was also evaluated to assess the practicality of HCTG-Net in real-time clinical applications. During inference, the model processed 10,000 heartbeat samples in 1.507 s, corresponding to an average reporting time of approximately 0.15 ms per sample. This remarkably low latency demonstrates that HCTG-Net is capable of near real-time arrhythmia classification, making it suitable for continuous ECG monitoring scenarios such as Holter analysis, bedside cardiac surveillance, and wearable health devices.
3.4. Confusion Matrix Analysis
To further evaluate the classification capability of HCTG-Net, a confusion matrix was generated on the MIT-BIH test set, as shown in
Figure 6. The confusion matrix provides a detailed visualization of the model’s ability to accurately identify each heartbeat type and the nature of the misclassification patterns.
The confusion matrix shows that the majority class N achieved the highest recognition performance, with the vast majority of samples correctly classified. Only a very small number of N beats were misclassified as S, V, F, or Q, indicating that the model effectively captures the typical P–QRS–T morphology and rhythm characteristics of normal ECG patterns. The strong diagonal dominance for this class confirms its clear separability and reflects the abundance of normal beats available during training.
The Q class shows a recall of 0.997, with minor confusion distributed across N, V, and F. Because this category includes a heterogeneous set of beats that do not fit into other classes, occasional overlap is consistent with its diverse waveform characteristics.
The V class also exhibits excellent separability. This result suggests that the CNN branch of HCTG-Net successfully extracts discriminative morphological features, such as widened QRS complexes or abnormal ventricular depolarization patterns, which are characteristic of V beats. The high recall for this class is clinically important because ventricular arrhythmias often carry greater diagnostic risk [
41].
The F obtains a recall of 0.937, with limited confusion primarily involving N and V. Since fusion beats combine features of both normal and ventricular depolarizations, slight ambiguity is unavoidable. The model’s ability to classify most fusion beats correctly demonstrates that the gated fusion mechanism effectively integrates local and global features to distinguish their hybrid morphology. The S class achieved a recall of 0.919, with occasional misclassification as N, likely due to morphological similarities between supraventricular and normal beats.
Overall, the confusion matrix confirms the robustness of HCTG-Net in accurately distinguishing both majority and minority heartbeat categories.
While the confusion matrix demonstrates the overall robustness of HCTG-Net, it is also essential to consider the clinical implications of its residual misclassifications. Compared with conventional handcrafted feature-based methods [
42,
43,
44], which rely on fixed morphological descriptors, the proposed model reduces the likelihood of clinically critical errors, such as misclassifying V for N. These high-risk confusions are particularly concerning, as they may lead to missed diagnoses of ventricular arrhythmias. The deep hierarchical representations learned by HCTG-Net effectively suppress such errors, highlighting its enhanced sensitivity to subtle morphological and temporal variations in ECG waveforms.
Furthermore, the architecture of HCTG-Net inherently supports probabilistic confidence estimation for each prediction, providing a practical mechanism for clinical decision support. Predictions with low confidence can be automatically flagged as uncertain cases, prompting cardiologists to conduct additional diagnostic evaluations—such as repeat Holter monitoring, exercise ECG, dobutamine stress ECG, or a short 12-lead ECG—to confirm the diagnosis. This feedback mechanism transforms the model from a passive classifier into an active assistant, thereby enhancing diagnostic reliability and helping clinicians prioritize cases that require closer examination. Integrating model confidence into the clinical workflow could further improve patient safety and increase the trustworthiness of AI-assisted arrhythmia screening in real-world applications.
3.5. Visualization of Feature Representation
To gain deeper insights into the feature learning capability of HCTG-Net, the high-dimensional representations from the final fused feature space were visualized using t-distributed Stochastic Neighbor Embedding (t-SNE) [
45]. This nonlinear dimensionality reduction technique projects the learned 256-dimensional embeddings onto a two-dimensional space while preserving local neighborhood relationships, thereby illustrating the clustering behavior of different heartbeat classes.
As shown in
Figure 7, samples belonging to the same category form compact clusters, with distinct boundaries observed between most classes. Normal beats N and fusion beats F form clear and well-isolated clusters, indicating that HCTG-Net effectively captures their discriminative morphological characteristics. Supraventricular arrhythmias S and the Q category exhibit more scattered distributions, reflecting the greater intra-class variability and morphological diversity of these beats. A small degree of overlap is observed between V and Q, suggesting partial similarity in certain waveform patterns or temporal features.
These findings confirm that HCTG-Net successfully learns discriminative and semantically meaningful feature representations by leveraging the complementary strengths of its CNN and Transformer branches.
3.6. Comparative Study with Previous Methods
To further validate the effectiveness of the proposed HCTG-Net, we compared its performance with several existing methods previously reported in the literature on the MIT-BIH Arrhythmia Database. The compared studies are those of Curtin et al. [
44], Wang et al. [
46], Sharma et al. [
47], Kumar et al. [
48], Zhuang et al. [
49], Wu et al. [
50], Irsyad et al. [
51], Tchinda et al. [
52], Choudhury et al. [
53], and Bahrami et al. [
54].
Table 4 summarizes the comparative results in terms of accuracy, precision, recall, and F1-score.
From the results in
Table 4, it is evident that the proposed HCTG-Net achieves the highest overall accuracy (0.9946) and F1-score (0.9711) among all competing methods. Although its precision (0.9774) is slightly lower than that of Kumar et al., Zhuang et al. and Sharma et al., HCTG-Net demonstrates superior balance across all metrics, achieving both high recall and excellent generalization capability. This balance indicates that the model not only identifies arrhythmias accurately but also maintains strong robustness against class imbalance and morphological variability in ECG waveforms.
It is also worth noting that the earlier study by Curtin et al. [
44], which represents a typical handcrafted feature-based approach, achieved an overall accuracy of 0.9430 and F1-score of 0.9653. Such methods rely heavily on manually designed morphological descriptors, statistical parameters, and expert-defined thresholds. While these approaches are effective in capturing certain QRS complex characteristics, they are inherently limited in modeling complex waveform morphologies and long-term temporal dependencies. In contrast, HCTG-Net leverages deep hierarchical feature extraction and gated fusion between CNN and Transformer branches, enabling it to learn both local morphological and global temporal patterns directly from raw ECG data without manual feature engineering. This capability allows HCTG-Net to achieve markedly higher accuracy and robustness, confirming its superiority over traditional hand-crafted algorithms in handling diverse and noisy ECG signals.
3.7. Ablation Study
In this section, an ablation study is carried out to investigate the influence of the major architectural components of the proposed HCTG-Net on arrhythmia classification performance. The analysis focuses on three core modules: the CNN branch, responsible for extracting local morphological features from ECG signals; the Transformer branch, which captures long-range temporal dependencies across heartbeats; and the gated fusion mechanism, which adaptively integrates features from both branches. To systematically examine the contribution of each component, several simplified variants of HCTG-Net were evaluated. These include a CNN-only version that focuses solely on local feature learning, a Transformer-only version emphasizing global temporal context, a hybrid CNN–Transformer model without the gating mechanism, and the full proposed HCTG-Net with all components enabled. All experiments were conducted on the MIT-BIH Arrhythmia Database under identical preprocessing, training, and testing conditions. The results of the ablation analysis are summarized in
Table 5.
From
Table 5, it can be observed that the CNN-only model is superior to the Transformer-only model in terms of accuracy but falls short in recall, demonstrating that convolutional layers are adept at capturing localized ECG patterns such as QRS complexes but are less effective in modeling the temporal dependencies required for detecting complex arrhythmias. The Transformer-only configuration shows an improvement in recall, indicating its capability to model long-term dependencies, though it slightly sacrifices precision due to the absence of spatially localized information.
When both the CNN and Transformer branches are combined without the gated fusion mechanism, performance improves significantly, verifying that local and global features are complementary in ECG classification. However, without adaptive weighting, the feature combination may be redundant or uneven, leading to suboptimal learning of discriminative representations.
The full HCTG-Net, which integrates CNN, Transformer, and the gated fusion mechanism, achieves the highest overall accuracy and F1-score. The learnable gating structure dynamically adjusts the contribution of each branch according to the input characteristics, allowing the model to emphasize local features during sharp waveform transitions and global dependencies when temporal variations dominate. This adaptability ensures balanced performance across all metrics and enhances model robustness.
4. Discussion
The results obtained on the MIT-BIH Arrhythmia Database demonstrate that the proposed HCTG-Net achieves both high overall performance and balanced behavior across all heartbeat categories. As summarized in
Table 4, HCTG-Net attains the highest accuracy, recall, and F1-score among all compared methods. This suggests that the model is not only capable of correctly classifying the dominant N beats, but is also effective at detecting less frequent yet clinically important arrhythmias such as S and F beats. In addition, the very low inference latency indicates that the method is suitable for real-time monitoring scenarios, which is essential for practical deployment in ambulatory and bedside environments.
From a modeling perspective, the main strength of HCTG-Net lies in its ability to integrate complementary local and global information within a single unified architecture. The CNN branch specializes in extracting fine-grained morphological features such as QRS width, P-wave and T-wave shapes, and subtle repolarization changes. These local patterns are critical for distinguishing, for example, ventricular beats from normal beats, where large and sharp changes in waveform morphology are present. In parallel, the Transformer branch focuses on modeling long-range temporal dependencies and rhythm context, enabling the network to recognize arrhythmias that manifest primarily through timing irregularities or more subtle morphological variations over time. By learning these two types of information independently and in parallel, HCTG-Net constructs a richer representation than methods that rely solely on CNNs or solely on sequence models.
A central novelty of the proposed network is the learnable gated fusion mechanism that combines the outputs of the CNN and Transformer branches. Instead of concatenating features or applying a fixed fusion rule, HCTG-Net employs a gating network that produces per-dimension importance weights for morphology-dominated versus context-dominated information. This design allows the model to dynamically emphasize the most informative branch for each feature dimension and for each input heartbeat. For example, the gate can favor CNN-derived features when the QRS complex exhibits significant distortion, as is typical in ventricular arrhythmias, and shift toward Transformer-derived features when discriminative information lies in the temporal pattern of successive beats, as in many supraventricular arrhythmias. The ablation study confirms that this adaptive fusion yields substantial gains over simple combinations of CNN and Transformer branches, indicating that performance improvements are driven by more effective feature integration rather than merely increased model capacity.
Several additional design choices contribute to the high accuracy and F1-score achieved by HCTG-Net. First, the use of residual connections in the CNN branch helps preserve the continuity of clinically relevant waveform patterns across layers, reducing the risk that subtle morphological details will be attenuated by deep transformations. Second, both branches are projected into a shared latent space before fusion, which improves feature-scale alignment and stabilizes optimization. Third, the preprocessing pipeline, which comprises denoising, heartbeat segmentation, and z-score normalization, reduces inter-subject amplitude variability and improves temporal alignment of P–QRS–T complexes. These steps reduce intra-class variance and facilitate more discriminative feature learning. Finally, the use of SMOTE and class-balancing strategies during training mitigates the effect of class imbalance, and directly improves performance metrics.
CNNs were popularized in biomedical signal processing for handling very large and high-dimensional data such as EEG and invasive electrophysiological recordings. Despite the comparatively smaller data volume per ECG record, they remain highly effective for ECG analysis. ECG waveforms exhibit strong local structure: the P wave, QRS complex, and T wave occur within compact temporal windows and carry most of the diagnostic information. One-dimensional CNNs are well matched to this structure, as they can efficiently learn filters that detect these characteristic motifs, their relative timings, and subtle shape variations. The ablation experiment shows that even a CNN-only variant of HCTG-Net achieves a high accuracy of 0.9697, underscoring the effectiveness of convolutional feature extraction for ECG. Within the full HCTG-Net, the CNN branch thus serves as a robust backbone for morphology-aware representation learning, while the Transformer branch and gated fusion further refine these features with temporal context and adaptive weighting.
Despite these strengths, several limitations of the proposed HCTG-Net should be acknowledged. First, the incorporation of Transformer blocks and a gated fusion network increases architectural complexity and computational overhead compared to lightweight CNN-only models. Although the current implementation achieves low inference latency on modern GPU hardware, deployment in extremely resource-constrained embedded systems may require further model compression, pruning, or quantization. Second, the relatively high model capacity increases the risk of overfitting, especially when applied to smaller or less diverse datasets. This risk is partly mitigated by the use of regularization techniques and data-balancing strategies, but careful tuning of training protocols remains necessary when adapting HCTG-Net to new datasets. Third, like many deep learning models, HCTG-Net remains a black-box predictor from the perspective of clinicians. So additional interpretability tools would be valuable to enhance trust and facilitate clinical adoption.
5. Conclusions
In this study, we proposed HCTG-Net, a novel Hybrid CNN–Transformer Network with Gated Fusion for automatic arrhythmia detection from ECG signals. The model was designed to integrate the complementary advantages of convolutional and self-attention mechanisms within a unified architecture. Specifically, the CNN branch effectively captures localized morphological features such as QRS complex and P–T wave patterns, while the Transformer branch models long-range temporal dependencies and contextual relationships across consecutive heartbeats. The proposed gated fusion mechanism adaptively balances the contribution of these two branches, allowing the network to dynamically emphasize either local or global information depending on the signal characteristics.
Extensive experiments conducted on the MIT-BIH Arrhythmia Database demonstrated that HCTG-Net achieves superior performance compared with existing methods in terms of accuracy, recall, and F1-score. The visualization of the learned feature space further confirmed that the model generates compact and well-separated clusters for different heartbeat categories, indicating strong discriminative feature learning. Ablation studies validated the importance of each module—CNN, Transformer, and gated fusion—revealing that their synergistic combination leads to substantial gains in both classification accuracy and model robustness.
Overall, the proposed HCTG-Net provides an efficient, accurate, and interpretable framework for ECG-based arrhythmia detection. Its hybrid and adaptive design enhances the generalization ability across diverse cardiac patterns and can be readily extended to other biomedical signal analysis tasks. Future work will focus on optimizing the model for real-time deployment in wearable and mobile healthcare systems, integrating multi-lead ECG data, and exploring domain adaptation techniques to improve cross-dataset robustness.
Author Contributions
Conceptualization, N.X. and Z.W. (Zibo Wei); methodology, N.X. and Z.W. (Zibo Wei); software, Z.W. (Zibo Wei); validation, X.W. and Y.W.; formal analysis, Z.W. (Zibo Wei); investigation, N.X. and X.W.; resources, Y.W. and Z.W. (Zhaohui Wang); data curation, N.X.; writing—original draft preparation, N.X.; writing—review and editing, Y.W. and Z.W. (Zhaohui Wang); visualization, Z.W. (Zibo Wei); supervision, Z.W. (Zhaohui Wang); project administration, Y.W. and Z.W. (Zhaohui Wang); funding acquisition, Z.W. (Zhaohui Wang). All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (No. 82070400 and No. 82270367).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original data and codes of this work can be requested from the corresponding author.
Acknowledgments
We sincerely thank all collaborators for their support and helpful discussions. We also appreciate the insightful comments from the reviewers, which significantly improved the quality of this manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Daniele, O.; Caravaglios, G.; Fierro, B.; Natalè, E. Stroke and cardiac arrhythmias. J. Stroke Cerebrovasc. Dis. 2002, 11, 28–33. [Google Scholar] [CrossRef] [PubMed]
- Huikuri, H.V.; Castellanos, A.; Myerburg, R.J. Sudden death due to cardiac arrhythmias. N. Engl. J. Med. 2001, 345, 1473–1482. [Google Scholar] [CrossRef] [PubMed]
- Masarone, D.; Limongelli, G.; Rubino, M.; Valente, F.; Vastarella, R.; Ammendola, E.; Gravino, R.; Verrengia, M.; Salerno, G.; Pacileo, G. Management of arrhythmias in heart failure. J. Cardiovasc. Dev. Dis. 2017, 4, 3. [Google Scholar] [CrossRef] [PubMed]
- Tao, R.; Wang, L.; Xiong, Y.; Zeng, Y.R. IM-ECG: An interpretable framework for arrhythmia detection using multi-lead ECG. Expert Syst. Appl. 2024, 237, 121497. [Google Scholar] [CrossRef]
- Sun, Q.; Wang, L.; Li, J.; Liang, C.; Yang, J.; Chen, Y.; Wang, C. Multi-phase ECG dynamic features for detecting myocardial ischemia and identifying its etiology using deterministic learning. Biomed. Signal Proces. 2024, 88, 105498. [Google Scholar] [CrossRef]
- Meine, T.J.; Nettles, R.E.; Anderson, D.J.; Cabell, C.H.; Corey, G.R.; Sexton, D.J.; Wang, A. Cardiac conduction abnormalities in endocarditis defined by the Duke criteria. Am. Heart J. 2001, 142, 280–285. [Google Scholar] [CrossRef]
- Kumar Behera, S.; Bhattacharya, D.; Aithal, N.; Sinha, N. Non-linear dynamics in ECG: A novel approach for robust classification of cardiovascular disorders. npj Cardiovasc. Health 2025, 2, 5. [Google Scholar] [CrossRef]
- Al-Ani, M.S. ECG waveform classification based on P-QRS-T wave recognition. UHD J. Sci. Technol. 2018, 2, 7–14. [Google Scholar] [CrossRef]
- Pradhan, B.K.; Neelappu, B.C.; Sivaraman, J.; Kim, D.; Pal, K. A Review on the Applications of Time-Frequency Methods in ECG Analysis. J. Healthc. Eng. 2023, 2023, 3145483. [Google Scholar] [CrossRef]
- Chatterjee, S.; Thakur, R.S.; Yadav, R.N.; Gupta, L.; Raghuvanshi, D.K. Review of noise removal techniques in ECG signals. IET Signal Proc. 2020, 14, 569–590. [Google Scholar] [CrossRef]
- Zhang, J.; Guo, Y.; Dong, X.; Wang, T.; Wang, J.; Ma, X.; Wang, H. Opportunities and challenges of noise interference suppression algorithms for dynamic ECG signals in wearable devices: A review. Measurement 2025, 250, 117067. [Google Scholar] [CrossRef]
- Mathe, S.E.; Penjarla, N.K.; Vappangi, S.; Kondaveeti, H.K. Advancements in Noise Reduction Techniques in ECG Signals: A Review. In Proceedings of the 2024 IEEE 3rd World Conference on Applied Intelligence and Computing (AIC), Gwalior, India, 27–28 July 2024; pp. 27–33. [Google Scholar]
- Michelson, E.L.; Morganroth, J. Spontaneous variability of complex ventricular arrhythmias detected by long-term electrocardiographic recording. Circulation 1980, 61, 690–695. [Google Scholar] [CrossRef]
- Patchamatla, P.S.S.; Balsem, Z.A.; Parthiban, K.; AC, R.; S, L.B. Improved Support Vector Machine for ECG Signal Classification in Implantable Biomedical Devices. In Proceedings of the 2025 3rd International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, 21–22 February 2025; pp. 1–7. [Google Scholar]
- Saini, I.; Singh, D.; Khosla, A. QRS detection using K-Nearest Neighbor algorithm (KNN) and evaluation on standard ECG databases. J. Adv. Res. 2013, 4, 331–344. [Google Scholar] [CrossRef]
- Song, X.h.; Han, J.y.; Lang, H.; Mao, Y. A sliding window voting strategy based on hidden Markov model for morphology detection of QRS complex. Comput. Eng. Sci. 2024, 46, 272–281. [Google Scholar]
- Lilda, S.D.; Jayaparvathy, R. Enhancing cardiovascular disease classification in ECG spectrograms by using multi-branch CNN. Comput. Biol. Med. 2025, 186, 109737. [Google Scholar] [CrossRef] [PubMed]
- Pantuprecharat, P.; Pawarangkoon, P.; Adirek, S. Recurrent Neural Network-based Model for Electrocardiogram Classification. Eng. Trans. Res. Publ. Mahanakorn Univ. Technol. 2025, 28, 19–25. [Google Scholar]
- Alamatsaz, N.; Tabatabaei, L.; Yazdchi, M.; Payan, H.; Alamatsaz, N.; Nasimi, F. A lightweight hybrid CNN-LSTM explainable model for ECG-based arrhythmia detection. Biomed. Signal Proces. 2024, 90, 105884. [Google Scholar] [CrossRef]
- Madan, S.; Lentzen, M.; Brandt, J.; Rueckert, D.; Hofmann-Apitius, M.; Fröhlich, H. Transformer models in biomedicine. BMC Med. Inform. Decis. Mak. 2024, 24, 214. [Google Scholar] [CrossRef]
- Pan, S.; Liu, X.; Xie, N.; Chong, Y. EG-TransUNet: A transformer-based U-Net with enhanced and guided models for biomedical image segmentation. BMC Bioinform. 2023, 24, 85. [Google Scholar] [CrossRef]
- Zhou, Y.; Li, L.; Lu, L.; Xu, M. nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 20852–20862. [Google Scholar]
- Apostol, A.; Nutu, M. Arrhythmia classification from 12-lead ecg signals using convolutional and transformer-based deep learning models. arXiv 2025, arXiv:2502.17887. [Google Scholar] [CrossRef]
- Byeon, H.; Alsaadi, M.; Vijay, R.; Assudani, P.J.; Dutta, A.K.; Bansal, M.; Singh, P.P.; Soni, M.; Bhatt, M.W. Explainable multi-view transformer framework with mutual learning for precision breast cancer pathology image classification. Front. Oncol. 2025, 15, 1626785. [Google Scholar] [CrossRef]
- Krishna, G.V.; Avula, S.K.; Raju, V.K.; Lakshmi, T.; Tumuluru, P.; Balaji, T.; Jaya, N. Enhanced ECG signal classification using hybrid CNN-Transformer models with tuning techniques and genetic algorithm optimization. J. Theor. Appl. Inf. Technol. 2024, 102, 1234–1245. [Google Scholar]
- Naidji, M.R.; Elberrichi, Z. A novel hybrid vision transformer CNN for COVID-19 detection from ECG images. Computers 2024, 13, 109. [Google Scholar] [CrossRef]
- Alghieth, M. DeepECG-Net: A hybrid transformer-based deep learning model for real-time ECG anomaly detection. Sci. Rep. 2025, 15, 20714. [Google Scholar] [CrossRef]
- Moody, G.B.; Mark, R.G. The impact of the MIT-BIH arrhythmia database. IEEE Eng. Med. Biol. Mag. 2001, 20, 45–50. [Google Scholar] [CrossRef]
- Kachuee, M.; Fazeli, S.; Sarrafzadeh, M. Ecg heartbeat classification: A deep transferable representation. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics (ICHI), New York, NY, USA, 4–7 June 2018; pp. 443–444. [Google Scholar]
- Marin, D.; He, Z.; Vajda, P.; Chatterjee, P.; Tsai, S.; Yang, F.; Boykov, Y. Efficient segmentation: Learning downsampling near semantic boundaries. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2131–2141. [Google Scholar]
- Fei, N.; Gao, Y.; Lu, Z.; Xiang, T. Z-score normalization, hubness, and few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 142–151. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; Zhang, L. Suppress and balance: A simple gated network for salient object detection. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 35–51. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Dalianis, H. Evaluation metrics and evaluation. In Clinical Text Mining: Secondary Use of Electronic Patient Records; Springer: Cham, Switzerland, 2018; pp. 45–53. [Google Scholar]
- Luo, X.; Yang, L.; Cai, H.; Tang, R.; Chen, Y.; Li, W. Multi-classification of arrhythmias using a HCRNet on imbalanced ECG datasets. Comput. Methods Programs Biomed. 2021, 208, 106258. [Google Scholar] [CrossRef]
- Bhaskaran, A.; De Silva, K.; Kumar, S. Contemporary updates on ventricular arrhythmias: From mechanisms to management. Intern. Med. J. 2023, 53, 892–906. [Google Scholar] [CrossRef]
- Hu, Y.H.; Palreddy, S.; Tompkins, W.J. A patient-adaptable ECG beat classifier using a mixture of experts approach. IEEE Trans. Biomed. Eng. 1997, 44, 891–900. [Google Scholar] [CrossRef]
- Hayn, D.; Jammerbund, B.; Schreier, G. QRS detection based ECG quality assessment. Physiol. Meas. 2012, 33, 1449–1461. [Google Scholar] [CrossRef]
- Curtin, A.E.; Burns, K.V.; Bank, A.J.; Netoff, T.I. QRS complex detection and measurement algorithms for multichannel ECGs in cardiac resynchronization therapy patients. IEEE J. Transl. Eng. Health Med. 2018, 6, 1900211. [Google Scholar] [CrossRef]
- Maaten, L.v.d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Wang, H.; Shi, H.; Lin, K.; Qin, C.; Zhao, L.; Huang, Y.; Liu, C. A high-precision arrhythmia classification method based on dual fully connected neural network. Biomed. Signal Proces. 2020, 58, 101874. [Google Scholar] [CrossRef]
- Sharma, P.; Dinkar, S.K.; Gupta, D. A novel hybrid deep learning method with cuckoo search algorithm for classification of arrhythmia disease using ECG signals. Neural Comput. Appl. 2021, 33, 13123–13143. [Google Scholar] [CrossRef]
- Kumar, S.; Mallik, A.; Kumar, A.; Del Ser, J.; Yang, G. Fuzz-ClustNet: Coupled fuzzy clustering and deep neural networks for Arrhythmia detection from ECG signals. Comput. Biol. Med. 2023, 153, 106511. [Google Scholar] [CrossRef]
- Zhuang, J.; Sun, J.; Yuan, G. Arrhythmia diagnosis of young martial arts athletes based on deep learning for smart medical care. Neural Comput. Appl. 2023, 35, 14641–14652. [Google Scholar] [CrossRef]
- Wu, W.; Huang, Y.; Wu, X. SRT: Improved transformer-based model for classification of 2D heartbeat images. Biomed. Signal Process. Control 2024, 88, 105017. [Google Scholar] [CrossRef]
- Irsyad, A.; Widagdo, P.P.; Wardhana, R. Deep learning Methods for ECG-Based Heart Disease Detection. J. Electron. Electromed. Eng. Med. Inform. 2024, 6, 467–477. [Google Scholar] [CrossRef]
- Saha Tchinda, B.; Tchiotsop, D. A lightweight 1D convolutional neural network model for arrhythmia diagnosis from electrocardiogram signal. Phys. Eng. Sci. Med. 2025, 48, 577–589. [Google Scholar] [CrossRef] [PubMed]
- Choudhury, M.; Tanvir, M.; Yousuf, M.A.; Islam, N.; Uddin, M.Z. Explainable AI-driven scalogram analysis and optimized transfer learning for sleep apnea detection with single-lead electrocardiograms. Comput. Biol. Med. 2025, 187, 109769. [Google Scholar] [CrossRef] [PubMed]
- Bahrami, R.; Fotouhi, A.M. A novel efficient hybrid deep learning framework for ECG-based heartbeat arrhythmia classification. Neural Comput. Appl. 2025, 37, 16409–16425. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).