6.1. Machine Learning Classifiers
The family of traditional machine learning classifiers exhibited heterogeneous performance levels across the evaluated metrics, reflecting the trade-off between model complexity, generalisation capability, and predictive power.
Table 2 summarises the detailed results obtained for each classifier on the balanced dataset.
Linear and probabilistic models such as Logistic Regression and Gaussian Naive Bayes reported the lowest accuracies (0.53–0.60). In particular, GaussianNB strongly favoured AF detection (recall 0.88) but at the cost of an extremely high false-positive rate for AF segments, yielding poor overall balance. Logistic Regression showed slightly more stable behaviour but still failed to capture the nonlinear dependencies inherent in ECG morphology.
Kernel- and distance-based methods like Support Vector Classifier (SVC) and K-Nearest Neighbours (KNNs) improved moderately, reaching accuracies around 0.66–0.68. However, their precision–recall trade-off remained suboptimal, with KNN particularly sensitive to noisy or overlapping class boundaries.
Tree-based ensembles proved more robust. Random Forest and Extra Trees achieved the best overall results, with accuracies above 0.80 and balanced precision and recall across both classes. Gradient Boosting performed worse (0.72 accuracy), showing higher sensitivity to NSR but lower recall for AF. Boosted frameworks such as XGBoost and CatBoost achieved stable results close to 0.79 accuracy, consistently outperforming simpler ensembles. These results highlight the strength of ensemble learning in capturing nonlinear patterns in ECG signals.
Finally, the Multi-Layer Perceptron (MLPClassifier) achieved competitive results with 0.74 accuracy, outperforming most linear and kernel-based approaches but falling short of ensemble-based methods.
Overall, machine learning classifiers provided computationally efficient baselines with reasonable discrimination power. However, even the strongest models (Extra Trees and Random Forest) did not surpass 81% accuracy, revealing the limitations of relying exclusively on handcrafted features to capture the complex spatio-temporal dynamics of AF.
6.2. Deep Learning Models
In contrast, deep learning approaches consistently outperformed traditional methods.
Table 3 summarises the results obtained in the seven experiments conducted to evaluate the impact of segment length, network depth and the introduction of recurrent components (BiLSTM) in the classification of ECG segments (AF vs. NSR).
The
Table 3 presents a row for each experiment performed, while the columns describe respectively the phase to which the experiment refers, the experiment number, the length of the ECG segment, whether the BiLSTM block was considered and if so the related neurons, how many convolutional blocks were considered and their relative sizes, how many pooling blocks were considered and their relative sizes, the number of epochs designated for training, the number of epochs actually performed, and the patience set for early stopping. The last four columns concern the validation metrics previously described (Accuracy, Precision, Recall and F1-Score).
The results highlight a clear four-phase evolution, where each architectural and parametric change contributed to progressively improving the model’s performance.
In the first phase (exp1–4), we used segments with original length (≈166 k samples) and a relatively simple CNN architecture, consisting of two Conv1D blocks (32 and 64 filters) followed by MaxPooling (pool size 2), GlobalAveragePooling, and a fully connected layer of 32 neurons. The main hyperparameters were gradually varied, increasing the maximum number of epochs (from 20 to 50 to 100) and the early stopping patience (from 5 to 10). Performance progressively improved from an accuracy of 0.76 in exp1 to 0.90 in exp4, indicating that a greater number of epochs allowed for more stable convergence, albeit with very long training times and an increasing risk of overfitting. In this case, the model tended to learn noisy local patterns, unable to effectively exploit the long-term temporal relationships present in the ECG signal.
The initial experiments (exp1–exp4) can be interpreted as convolution-only baselines. Increasing the number of training epochs and network depth progressively improved performance, reaching up to 0.90 accuracy. However, despite these improvements, CNN-only architectures remained limited in their ability to capture long-range temporal dependencies, which are critical for modelling rhythm irregularity in long-term ECG signals.
In phase two (exp5), to address the training time issue, the segment length was reduced to approximately 83 k samples while keeping the architecture unchanged. This choice resulted in a significant reduction in time per epoch and greater training stability, while still maintaining good performance (accuracy ). This indicates that shorter segments still contain sufficient discriminative information for AF/NSR classification and may be preferable when computational resources are limited.
In phase three (exp6), to overcome the intrinsic limitations of CNNs alone, which capture predominantly local patterns, a 32-unit Bidirectional LSTM block was introduced after four convolutional blocks (32-64-64-128 filters). This allowed the network to also model long-term temporal dependencies, resulting in a significant improvement in metrics (accuracy ). The synergistic effect between local feature extraction (Conv1D) and sequential modelling (BiLSTM) represented the first real qualitative leap.
Finally, in phase four (exp7), the exp6 architecture was optimised by increasing the pooling aggressiveness in the first two blocks (pool size 4 instead of 2) to reduce dimensionality and computation time, and increasing the number of epochs to 20. This produced the best overall result, with an accuracy of 0.965 and virtually identical precision, recall, and F1 scores, indicating a well-balanced and generalizable model. The stability of the validation curves and the low loss also show that the dropout setting (0.2/0.3) is sufficient to prevent overfitting, even with a more complex model.
The Transformer-based model is reported separately in
Table 3 from the phased CNN-based experiments, as it represents an alternative architectural paradigm rather than an incremental modification of the convolutional pipeline. Despite its compact size (approximately 65k parameters), the CNN + Transformer architecture achieved competitive performance (F1-score = 0.87), confirming the effectiveness of attention mechanisms in modeling long-range temporal dependencies in long-term ECG signals. However, its performance remained below that of the proposed CNN + BiLSTM architecture, suggesting that recurrent inductive biases are particularly well-suited for capturing rhythm irregularity in paroxysmal atrial fibrillation.
Table 4 provides the detailed classification report; the results show very high overall performance and balance between the two classes.
Specifically, for the NSR (0) class, the model achieves a precision of 0.97 and a recall of 0.96, while for the AF (1) class, it achieves virtually symmetric values (precision 0.96, recall 0.97). The F1 score, which balances precision and recall, is 0.96–0.97 for both classes, confirming balanced and generalizable behaviour. The overall accuracy is 0.97 out of a total of 1886 tested samples.
The confusion matrix (
Table 5) confirms this result: out of 929 NSR samples, 888 are correctly classified and only 41 are false positives (classified as AF); out of 957 AF samples, 932 are correct and 25 are false negatives. In both cases, the number of errors is very low and distributed symmetrically between the two classes, indicating the absence of residual bias.
These results significantly outperformed all feature-based classifiers, underscoring the ability of deep learning to uncover discriminative spatio-temporal representations that handcrafted features could not capture.
Figure 5 shows the accuracy and loss comparison during training and validation for the first (
exp1) and last (
exp7) deep learning experiments. In the case of
exp1, based on a simple CNN architecture with two convolutional blocks and long segments (≈166 k samples), the validation accuracy increases slowly and tends to stabilise after just a few epochs at values around 0.75, while the validation loss remains high (≈0.53) and shows fluctuations, indicating unstable convergence and limited predictive capacity. In contrast, in
exp7, which uses a deeper network with four convolutional blocks and a BiLSTM layer trained on halved segments (≈83 k samples), we observe a rapid increase in validation accuracy to values close to 0.97 and a decrease in validation loss to approximately 0.08. Furthermore, the training and validation curves are very close, indicating no overfitting and excellent generalisation.
This comparison highlights how the introduction of recurrent components (BiLSTM), the increase in convolutional depth, and the reduction in segment length enabled faster and more stable convergence, dramatically improving the model’s overall performance.
6.5. Comparison Between Deep Learning Architectures
A comparative analysis of the evaluated deep learning architectures highlights the impact of different inductive biases on long-term ECG modeling. Convolution-only networks provide a strong baseline by capturing local morphological patterns, but their performance saturates when long-range temporal dependencies become dominant.
The proposed CNN + BiLSTM model comprises 115,457 trainable parameters, corresponding to an approximate model size of 451 KB using 32-bit floating-point representation. While larger than the CNN + Transformer baseline (65 k parameters), the model remains compact compared to state-of-the-art deep architectures and is compatible with edge-oriented and real-time deployment scenarios
The CNN + Transformer architecture improves upon convolution-only models, achieving an accuracy of 0.87 while maintaining a very compact footprint (approximately 65 k parameters). This confirms that self-attention mechanisms are effective in modeling global temporal relationships in long ECG segments and represent an efficient alternative for resource-constrained deployments.
Nevertheless, the proposed CNN + BiLSTM architecture consistently outperformed the Transformer-based model, reaching an accuracy of 0.97. This suggests that recurrent inductive biases remain particularly effective for modeling rhythm irregularity and sequential dynamics in paroxysmal atrial fibrillation, especially when combined with convolutional feature extraction. Overall, the results indicate a trade-off between computational efficiency and peak performance, with recurrent models offering superior accuracy and attention-based models providing a favourable efficiency–performance balance.