Comparative Analysis of Machine Learning and Deep Learning Models for Atrial Fibrillation Detection from Long-Term ECG

Aversano, Lerina; Mancino, Ilaria; Marengo, Agostino; Verdone, Chiara

doi:10.3390/app16052390

Open AccessArticle

Comparative Analysis of Machine Learning and Deep Learning Models for Atrial Fibrillation Detection from Long-Term ECG

¹

Department of Agricultural Science, Food, Natural Resources and Engineering, University of Foggia, 71100 Foggia, Italy

²

Department of Engineering, University of Sannio, 82100 Benevento, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2390; https://doi.org/10.3390/app16052390

Submission received: 20 January 2026 / Revised: 11 February 2026 / Accepted: 11 February 2026 / Published: 28 February 2026

(This article belongs to the Special Issue Artificial Intelligence Innovations for Smart and Sustainable Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Atrial fibrillation is the most prevalent sustained cardiac arrhythmia and a major risk factor for stroke, heart failure, and premature mortality. Automatic detection remains challenging due to the variability of electrocardiogram (ECG) morphology, noise, and the paroxysmal nature of atrial fibrillation events. This study proposes a comprehensive framework that integrates optimised segmentation, feature extraction, and advanced deep learning architectures to improve detection accuracy. A coalescence window is introduced to dynamically cluster arrhythmic episodes, aligning computational analysis with clinical event distributions. Multiple classifiers are investigated, ranging from traditional machine learning models to state-of-the-art deep neural networks, including Temporal Convolutional Networks (TCNs), Convolutional Neural Networks (CNNs), and Bidirectional Long Short-Term Memory (BiLSTM). Experimental evaluation on a balanced dataset of ECG signals demonstrates the superior performance of deep learning models, with the best architecture achieving high accuracy and F1-score, significantly outperforming traditional approaches. Furthermore, the proposed pipeline is designed to be modular and resource-aware, supporting potential deployment in real-time and edge computing environments. These results highlight the feasibility of scalable atrial fibrillation monitoring systems that bridge algorithmic innovation with clinical applicability, ultimately contributing to earlier diagnosis and improved patient management.

Keywords:

atrial fibrillation; deep learning; electrocardiography; edge computing; machine learning; signal processing; wearable devices

1. Introduction

Atrial fibrillation (AF) is the most common sustained cardiac arrhythmia, affecting millions of individuals worldwide and contributing significantly to stroke, heart failure, and increased mortality. Among its clinical manifestations, paroxysmal AF is particularly challenging: episodes are transient, often asymptomatic, and difficult to capture with standard diagnostic tools such as manual electrocardiogram (ECG) interpretation or Holter monitoring. This underscores the need for automated and scalable detection systems that can identify not only persistent but also short-lived AF events.

Recent advances in artificial intelligence have transformed AF detection by leveraging large-scale ECG datasets and deep learning architectures. Convolutional and recurrent neural networks have achieved state-of-the-art performance in recognizing arrhythmic patterns [1,2,3], while hybrid models combining multiple feature representations further improve classification accuracy [4,5]. Despite these advances, clinical adoption faces persistent challenges. Many approaches function as black-box classifiers, raising concerns regarding interpretability and clinical trust [6,7,8]. Moreover, computationally intensive models may not be suitable for real-time use in wearable or edge devices, where efficiency and energy consumption are critical constraints [9,10,11].

Recent studies have also explored alternative sensing modalities and novel signal representations for atrial fibrillation detection, extending beyond traditional ECG-based approaches. Liu et al. [12] proposed AcousAF, an acoustic sensing system capable of detecting atrial fibrillation using smartphone microphones, demonstrating the feasibility of contactless and low-cost AF monitoring in mobile settings. Similarly, Pachori et al. [13] investigated AF detection from photoplethysmography (PPG) signals using variational mode decomposition, highlighting the potential of wearable optical sensors for arrhythmia screening. More recently, Zhao et al. [14] introduced a contactless mmWave-based system for arrhythmia detection, showing that non-invasive radar sensing can achieve competitive performance under controlled conditions.

While these approaches broaden the sensing landscape for AF detection, ECG remains the clinical gold standard due to its direct representation of cardiac electrophysiology and established diagnostic validity. Consequently, deep learning models operating on long-term ECG recordings continue to play a central role in clinically relevant AF detection systems, particularly when robustness, interpretability, and deployment in real-world healthcare scenarios are required.

In parallel, research has explored the integration of wearable and edge-aware AF detection systems, enabling continuous monitoring beyond clinical settings [15,16,17]. Emerging multimodal strategies that combine ECG with other physiological signals such as EEG demonstrate the potential of data fusion to improve reliability and robustness [18,19,20]. Additionally, ECG-based biometrics have gained traction as a promising modality for human authentication, opening opportunities for dual-use systems that ensure both health monitoring and user identification [21,22].

To address these limitations, this paper proposes a novel AF detection framework that integrates deep learning architectures with edge-aware deployment. A central contribution is the introduction of the coalescence window, a temporal grouping mechanism that dynamically clusters paroxysmal AF episodes based on inter-event distributions. In addition, we incorporate dataset balancing strategies to mitigate class imbalance, a critical factor for reliable learning from heterogeneous ECG recordings. The framework also enables direct comparison of traditional classifiers with state-of-the-art deep learning models, including temporal convolutional networks (TCNs), convolutional neural networks (CNNs), and bidirectional long short-term memory (BiLSTM), trained and validated on raw ECG signals. By combining robustness and feasibility for wearable deployment, the proposed approach aims to bridge the gap between algorithmic innovation and clinical applicability, ultimately contributing to more effective and personalized AF monitoring.

2. Related Works

Research on Atrial Fibrillation detection has rapidly evolved over the past decade, driven by its clinical importance as the most common sustained arrhythmia and the increasing availability of long-duration ECG recordings. Early efforts were dominated by signal processing and statistical approaches, where morphological features such as P-wave variability, RR interval irregularity, and spectral components were manually extracted and classified using conventional machine learning algorithms. While such methods provided initial insights, their reliance on handcrafted features often limited generalizability across heterogeneous datasets and clinical scenarios.

The transition toward deep learning has marked a turning point in AF detection. Convolutional Neural Networks have been among the most widely adopted architectures, effectively capturing spatial and frequency-domain patterns from raw or transformed ECG signals. Studies have demonstrated the advantages of spectrogram-based representations for CNN input, which allow the model to exploit both temporal and spectral information simultaneously [1,2]. Extensions to hybrid models that combine CNNs with recurrent layers, such as LSTMs or BiLSTMs, have further enhanced performance by incorporating long-term temporal dependencies [3,5]. These combinations are particularly relevant for AF, where arrhythmic episodes often unfold over variable durations. More recently, Temporal Convolutional Networks have been introduced as an efficient alternative to recurrent models, offering competitive accuracy while reducing training and inference costs [22]. Other innovations, including randomized attention mechanisms and dual-path systems, have optimized feature extraction and improved robustness against noise and variability in ECG data [23].

Another important research direction has focused on wearable and edge-based AF detection systems. With the proliferation of smart devices and the demand for continuous monitoring beyond clinical environments, efficient deployment of algorithms on low-power hardware has become critical. Systems enabling real-time ECG acquisition and wireless transmission have already demonstrated high accuracy in ambulatory monitoring [15], while circadian-aware wearables and resource-efficient CNNs have been proposed to optimize detection in daily life conditions [9,24].

In summary, current research in AF detection is shaped by two converging trends: the design of advanced deep learning architectures capable of end-to-end predictive modeling, and the pursuit of efficient and wearable-ready solutions grounded in clinical usability.

3. Background

This section provides an overview of the fundamental machine learning (ML) and deep learning (DL) technologies that underpin the development of modern artificial intelligence systems. The focus here is on describing these techniques in general, independently of their application, in order to highlight their theoretical principles, architectures, and training mechanisms.

3.1. Machine Learning

Machine learning (ML) encompasses a broad class of algorithms designed to learn predictive models from data, without the need to explicitly encode domain-specific rules. Given a dataset of input–output pairs, ML systems attempt to learn a mapping function

f : X \to Y

that generalizes beyond the training data. This paradigm has proved essential in scenarios where the complexity of the patterns to be modelled precludes manual rule design.

Traditional ML algorithms often rely on the use of handcrafted features, which are engineered by experts to capture meaningful properties of the data. The success of the model is thus directly linked to the quality and relevance of these features. Among the most widely used families of ML algorithms are:

Linear models, such as Logistic Regression, which establish decision boundaries in the form of linear hyperplanes;
Probabilistic models, such as Gaussian Naïve Bayes, which make predictions based on Bayes’ theorem under strong independence assumptions;
Kernel- and distance-based models, including Support Vector Machines (SVMs) and k-Nearest Neighbors (KNN);
Tree-based models, such as Decision Trees, Random Forests, Gradient Boosting, XGBoost, and CatBoost;
Shallow neural networks, such as Multi-Layer Perceptrons (MLPs), which can approximate nonlinear mappings but typically require careful feature engineering due to their limited depth.

A major advantage of traditional ML is its computational efficiency, and suitability for deployment in resource-constrained environments. However, the reliance on handcrafted features often limits the ability of these methods to capture complex patterns in raw data, motivating the transition to deep learning.

3.2. Deep Learning

Deep learning (DL) represents a paradigm shift in artificial intelligence. Rather than relying on manually designed features, DL architectures automatically learn hierarchical representations directly from raw data. This is achieved through deep neural networks composed of multiple layers of nonlinear transformations, where each successive layer extracts increasingly abstract features.

3.2.1. Convolutional Neural Networks

Convolutional Neural Networks are a class of deep neural networks originally designed for image analysis but now widely applied to diverse data modalities, including one-dimensional signals such as speech, sensor readings, and electrocardiograms. Their key innovation lies in the convolutional layer, where learnable filters (also called kernels) are applied across local regions of the input, producing feature maps that capture spatially or temporally localized patterns.

Formally, the convolution operation for a one-dimensional input x with kernel w of size k is expressed as:

s (t) = (x * w) (t) = \sum_{i = 0}^{k - 1} w_{i} \cdot x_{t + i},

where

s (t)

represents the activation at position t. During training, both the kernel weights w and bias terms are learned via backpropagation, allowing the filters to adapt to task-specific features.

CNNs are organized in hierarchical blocks, each performing a sequence of transformations:

Convolutional layers: apply multiple filters to detect local patterns such as edges, peaks, or oscillations. In the case of biomedical signals, these filters often capture recurring motifs like QRS complexes or periodic oscillatory patterns.
Activation functions: nonlinear mappings, most commonly the Rectified Linear Unit (ReLU), are applied to the convolution outputs, introducing nonlinearity and improving representational capacity.
Pooling layers: reduce the dimensionality of feature maps by downsampling, typically via max pooling or average pooling. This introduces invariance to small translations and reduces computational cost, while preserving the most salient features.
Normalization layers: such as batch normalization, which stabilize training by normalizing intermediate activations, improving both convergence speed and generalization.
Fully connected layers: at the final stage, feature maps are flattened and connected to dense layers that integrate local features into global representations. The output layer typically uses a sigmoid (for binary classification) or softmax (for multi-class problems) activation.

One of the strengths of CNNs is their ability to build hierarchical feature representations: lower layers capture low-level features (short-term patterns, sharp transitions), while deeper layers combine them into high-level abstractions (complex shapes, rhythm irregularities). This property is particularly advantageous for structured data where both local details and global structures are relevant.

Another important aspect is weight sharing. Unlike fully connected networks, where each parameter is unique to a connection, CNN filters are applied across the entire input domain. This drastically reduces the number of parameters, improving efficiency and mitigating overfitting, especially when training data is limited.

In addition, CNNs benefit from translation invariance: a pattern recognized in one part of the signal can also be detected elsewhere with the same filter. For one-dimensional biomedical signals such as ECG, this property is crucial because pathological patterns (e.g., fibrillatory waves) may appear at arbitrary positions in the input segment.

Over time, several architectural refinements have been introduced, including:

Deeper CNNs with many stacked layers (e.g., VGG, ResNet) to capture complex hierarchical features;
Dilated convolutions, which expand the receptive field without increasing the number of parameters, useful for modeling long temporal dependencies;
1D CNNs, specifically tailored to sequential data, which process signals along the temporal axis instead of spatial grids.

Thanks to these properties, CNNs have become a cornerstone in deep learning for pattern recognition, offering a powerful compromise between expressiveness, computational efficiency, and scalability. Their ability to directly learn task-specific filters makes them particularly effective in domains where manual feature extraction is challenging or suboptimal.

3.2.2. Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) extend traditional feed-forward architectures by introducing feedback connections that allow information to persist across time steps. Given a sequence of inputs

{x_{1}, x_{2}, \dots, x_{T}}

, an RNN updates its hidden state

h_{t}

at each time step according to

h_{t} = σ (W_{h} h_{t - 1} + W_{x} x_{t} + b),

where

W_{h}

and

W_{x}

are weight matrices, b is a bias term, and

σ

is a nonlinear activation (e.g., tanh).

This recurrent formulation makes RNNs suitable for tasks where order and temporal context matter, such as speech recognition, natural language processing, and physiological time series. At each step, the hidden state acts as a compressed memory of all previous inputs.

However, training RNNs on long sequences is challenging due to the problem of vanishing and exploding gradients. As the error signal is backpropagated through many time steps (Backpropagation Through Time, BPTT), gradients may diminish to near zero or grow uncontrollably, preventing the effective learning of long-term dependencies. Consequently, vanilla RNNs tend to capture only short-term patterns, limiting their performance in complex sequential modeling tasks.

Despite these limitations, RNNs introduced the key idea of modeling temporal dynamics within neural networks, paving the way for more advanced recurrent architectures.

3.2.3. Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) networks were developed to overcome the vanishing gradient problem inherent in RNNs. An LSTM introduces a memory cell

c_{t}

that explicitly maintains information over time, together with gating mechanisms that regulate the flow of information.

The three gates of an LSTM are defined as

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) (forget gate),

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) (input gate),

o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) (output gate),

{\tilde{c}}_{t} = tanh (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c}),

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ {\tilde{c}}_{t}, h_{t} = o_{t} ⊙ tanh (c_{t}) .

Through this design:

The forget gate discards irrelevant information;
The input gate adds new information to the memory cell;
The output gate controls what is exposed to the next hidden state.

LSTMs thus maintain a stable memory over long sequences, enabling the modeling of complex temporal dependencies. They have been widely adopted in language modeling, speech recognition, machine translation, and biomedical time series forecasting. While computationally more demanding than vanilla RNNs, their ability to capture long-term context makes them a standard baseline for sequence learning.

3.2.4. Bidirectional Long Short-Term Memory (BiLSTM)

Bidirectional LSTMs (BiLSTMs) extend LSTMs by processing input sequences in both forward and backward directions. This design allows the network to exploit not only past information but also future context, producing richer sequence representations.

Formally, a BiLSTM computes two hidden states for each time step:

{\vec{h}}_{t} = {LSTM}_{f} (x_{t}, {\vec{h}}_{t - 1}), {\overset{\leftarrow}{h}}_{t} = {LSTM}_{b} (x_{t}, {\overset{\leftarrow}{h}}_{t + 1}),

and concatenates them into

h_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}] .

The main benefits of BiLSTMs include:

Comprehensive context modeling: by integrating past and future states, BiLSTMs better capture dependencies spanning both directions;
Noise robustness: bidirectional context smooths over irregularities in sequential data;
Superior performance in classification: many sequential tasks (speech recognition, named entity recognition, ECG analysis) benefit from having the entire sequence before decision-making.

The main drawback of BiLSTMs is their increased computational cost and memory footprint, as they require approximately twice the parameters of standard LSTMs. Nonetheless, in many applications, the performance gains justify the added complexity.

3.2.5. Temporal Convolutional Networks (TCNs)

Temporal Convolutional Networks (TCNs) provide an alternative to recurrent architectures for sequence modeling. Instead of sequential processing, TCNs employ causal one-dimensional convolutions, ensuring that predictions at time t depend only on present and past inputs. By stacking layers of dilated convolutions, the receptive field grows exponentially with depth, allowing TCNs to capture long-term dependencies efficiently.

Key components of TCNs include:

Causal convolutions, which preserve the temporal ordering of input data;
Dilated convolutions, which insert gaps between kernel elements to expand the receptive field without increasing parameters;
Residual connections, often employed to stabilize training in deep TCNs and facilitate gradient flow.

Compared to RNNs and LSTMs, TCNs offer several advantages:

Parallel computation across time steps, improving training efficiency;
More stable gradients, avoiding vanishing/exploding issues;
Competitive or superior accuracy in tasks such as speech synthesis, machine translation, and time series forecasting.

A limitation of TCNs is their higher memory usage for very long sequences, as intermediate convolutions must be stored. Nonetheless, they have emerged as a strong alternative to recurrent models, combining the benefits of convolutional architectures with the ability to model long-range dependencies.

3.3. Training and Regularization

Both ML and DL models rely on optimization algorithms to fit parameters to data. Gradient-based methods iteratively adjust weights to minimize a loss function, such as cross-entropy for classification tasks. To avoid overfitting, regularization strategies are employed:

Dropout, which randomly disables a fraction of neurons during training;
Early stopping, which halts training when validation performance no longer improves;
Weight regularization, such as $L_{1}$ or $L_{2}$ penalties;
Data augmentation, which increases the diversity of the training set by applying transformations to the input data.

Performance evaluation is typically conducted through metrics including accuracy, precision, recall, F1-score, and confusion matrices, which collectively provide insight into the trade-off between sensitivity and specificity.

4. Approach

This section describes the proposed methodology for atrial fibrillation (AF) detection that integrates traditional machine learning with handcrafted features and modern deep learning architectures operating directly on raw signals. Figure 1 provides a high-level overview of the experimental pipeline.

4.1. Preprocessing

Preprocessing represents a fundamental step of the proposed pipeline, ensuring that atrial fibrillation (AF) episodes are consistently detected and that both feature-based and deep learning approaches operate on standardised input data. The preprocessing workflow is articulated into several stages, as detailed below.

4.1.1. Coalescence Window Analysis

Paroxysmal AF episodes often appear in clusters rather than as isolated events. To account for this, we introduce a coalescence window (CW), a temporal threshold used to merge adjacent AF episodes into clinically meaningful clusters. Two episodes are considered part of the same cluster if the temporal distance between the end of the first and the onset of the second is less than the CW. This grouping strategy reduces fragmentation caused by minor interruptions and avoids over-segmentation due to annotation artefacts.

The CW is determined by analysing the distribution of inter-event intervals and applying the KneeLocator algorithm to identify an elbow point in the curve of grouped episodes. This strategy reduces over-fragmentation caused by short interruptions and improves the representativeness of training samples. In addition, by aligning computational segmentation with clinical event distribution, CW analysis ensures that models are trained on realistic patterns.

The main objective is to identify the CW that minimizes two critical issues when segmenting annotated ECG data:

Event collisions, in which distinct AF episodes are erroneously merged due to an excessively large window;
Truncations, especially for episodes located near the end of recordings, which can produce incomplete or empty segments.

As an enhancement, we performed a sensitivity analysis to evaluate the impact of different window sizes on the number of grouped episodes. We identified the elbow point using the KneeLocator algorithm (Figure 2). As a preliminary step, the dataset was enriched by adding AF/NSR labels at the segment level, obtained through signal segmentation. To support this, a coalescence window analysis was carried out to identify the segmentation window that best minimises data collisions and truncations.

The distribution of inter-event times (i.e., the time intervals between consecutive AF episodes) was first computed (Figure 3). As expected, a large concentration of short intervals was observed, highlighting the need for temporal grouping.

To select the optimal window, we evaluated the number of grouped AF episodes obtained by progressively increasing the coalescence window size. This produced a monotonic decreasing curve: as the window grows, more temporally adjacent episodes are merged into single clusters.

The curve (shown in Figure 2) was constructed by computing the number of episode clusters for each tested window size. The KneeLocator algorithm was then applied to detect the elbow point—i.e., the point beyond which further increases in the window yield diminishing returns in terms of episode grouping. This elbow represents the best trade-off between excessive fragmentation and over-grouping of events.

4.1.2. Segmentation

Since long ECG traces cannot be directly used for classification, the recordings are divided into fixed-length segments of L = 166,755 samples, which corresponds to a target length after applying the optimal coalescence window that was identified at approximately 1667.56 s (about 27.8 min). Each segment is then labelled as AF or NSR based on the expert-provided annotations in the dataset.

A segment was labeled as AF if at least one time instant within the segment overlapped with an AF annotation; otherwise, the segment was labeled as NSR. This conservative labeling strategy prioritizes sensitivity to paroxysmal AF episodes and reflects clinical screening requirements.

This segmentation process transforms the long, continuous signals into a structured dataset of labelled instances, suitable for both machine learning and deep learning pipelines. By adopting fixed-length windows, we ensure comparability across samples and compatibility with batch-based training procedures.

Since long ECG traces cannot be directly used for classification, the recordings were divided into fixed-length segments. The optimal coalescence window (CW), identified through knee-point analysis, was equal to 1667.56 s (≈27.8 min) and was used exclusively to aggregate temporally adjacent AF episodes into unified events, preventing artificial fragmentation of clinically related arrhythmic intervals. For the purpose of signal extraction and model training, each recording was segmented into fixed-length windows of L = 83,377 samples, corresponding to 416.9 s (≈6.9 min) at a sampling rate of 200 Hz (i.e., one quarter of the optimal CW). This choice represents a trade-off between temporal resolution and computational feasibility, while preserving sufficient contextual information for AF detection. Each segment was labelled as AF or NSR based on overlap with expert-provided annotations. This segmentation process transforms the long, continuous ECG recordings into a structured dataset of labelled instances, suitable for both machine learning and deep learning pipelines.

4.1.3. Feature Extraction

In the first branch of the pipeline, each ECG segment undergoes handcrafted feature extraction. Features are computed in multiple domains:

Time domain, including statistical descriptors (mean, variance, skewness, kurtosis), entropy measures, and Hjorth parameters (activity, mobility, complexity);
Frequency domain, obtained via Fast Fourier Transform (FFT) to extract spectral entropy, dominant frequencies, and band-limited power;
Wavelet domain, using multi-resolution decomposition (e.g., Daubechies-4 wavelets) to capture transient and nonstationary patterns characteristic of AF.

The extracted features are standardized using z-score normalization and balanced via random undersampling to counteract class imbalance. Segments with more than 10% missing samples were excluded, while minor gaps were linearly interpolated. The resulting dataset provided a consistent numerical representation of ECG morphology and dynamics, suitable for machine learning classifiers. This branch emphasizes computational efficiency, providing models that can be deployed in clinical or edge environments with limited resources.

4.2. Machine Learning Pipeline

Multiple supervised classifiers are trained on the extracted features from segmented ECG signals. Tree-based methods such as Random Forest, Gradient Boosting, XGBoost, and CatBoost are evaluated alongside linear models (Logistic Regression), probabilistic approaches (Naïve Bayes), and kernel methods (SVM). In addition, a Multi-Layer Perceptron (MLP) provides a neural baseline. These classifiers serve as computationally efficient baselines, supporting direct comparison with deep learning models and offering viable options for resource-constrained deployment. Once standardized, the feature vectors were used to train a set of supervised machine learning classifiers, chosen to represent diverse algorithmic families:

Tree-based models: Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, XGBoost, and CatBoost;
Linear and probabilistic models: Logistic Regression and Gaussian Naive Bayes;
Distance- and kernel-based models: K-Nearest Neighbors (KNNs) and Support Vector Classifier (SVC);
Neural model: Multi-Layer Perceptron (MLPClassifier).

All classifiers were trained on the same feature set to ensure a fair comparison. This branch focuses on resource-efficient models that can be deployed in clinical and edge environments. The outcomes also provide a baseline for evaluating the effectiveness of deep learning methods.

4.3. Deep Learning Pipeline

In parallel to the feature-based approach, a deep learning pipeline was designed to directly exploit raw ECG signals, thus avoiding the need for handcrafted feature extraction. Each standardised segment was reshaped into

(batch size, target length, 1)

and processed end-to-end within a single neural architecture.

The model was structured as a hybrid design combining convolutional layers (CNN) with a bidirectional Long Short-Term Memory (BiLSTM) block, as shown in Figure 4. The convolutional layers act as local feature extractors, capturing morphological characteristics and short-term temporal patterns directly from the ECG waveforms. The extracted representations are then passed to the BiLSTM units, which enable the network to capture sequential dependencies in both forward and backwards directions, effectively modelling the recurrent nature of atrial fibrillation episodes. Dense layers with dropout provide the final classification into AF or NSR. Variants of the architecture include CNN-only and CNN + TCN hybrids, enabling systematic evaluation of robustness. This end-to-end approach leverages the full expressive power of deep learning, capturing patterns beyond the reach of handcrafted features.

The training process was carefully regularised to ensure generalisation. Dropout layers were introduced throughout the network to reduce overfitting, while early stopping and model checkpointing were employed to retain the best-performing weights. The model was implemented in Keras with the TensorFlow backend and trained using the binary cross-entropy loss function optimised via Adam. Performance was evaluated through standard classification metrics, including accuracy, precision, recall, F1-score, and confusion matrices.

This architecture proved highly effective in learning discriminative representations directly from raw ECG signals, uncovering subtle temporal dependencies that handcrafted features could not fully capture. By contrast with the machine learning branch, the deep learning pipeline demonstrated a stronger capacity for generalisation, highlighting the potential of end-to-end strategies for reliable atrial fibrillation detection.

In addition to convolutional and recurrent architectures, a lightweight Transformer-based model was included as a representative attention-based baseline. The architecture follows a hybrid design in which a shallow convolutional front-end performs local feature extraction and temporal downsampling, reducing sequence length and computational cost. The resulting feature maps are then processed by a compact Transformer encoder composed of a single multi-head self-attention layer, enabling the modeling of long-range temporal dependencies within each ECG segment.

To ensure a fair comparison with the recurrent models and to preserve computational efficiency, the Transformer configuration was intentionally kept shallow, using a limited number of attention heads and a single encoder block. The Transformer output is aggregated through global average pooling and passed to a fully connected classification head with sigmoid activation. This design allows for the evaluation of attention mechanisms in long-term ECG analysis while remaining compatible with edge-oriented deployment constraints.

5. Experimental Settings

The experiments were divided into different phases, retracing the evolutions that occurred in the implementation of the best-performing model. The first phase of the work concerns the first four experiments focused on the same simple model, consisting of two convolutional blocks (Conv1D layer, MaxPooling1D layer, and Dropout), followed by a GlobalAveragePooling1D layer, and finally, two Fully Connected layers interspersed with Dropout.

In the second phase of the work, it was decided to halve the size of the segments to allow the model to focus more on smaller segments and thus better identify AF events and check whether the signal still contains sufficient discriminating information.

In the third phase of the work, it was decided to add a Bidirectional LSTM block after four convolutional blocks to also capture long-term temporal dependencies in the ECG signal.

The fourth phase of the work was dedicated to the optimisation of the network hyperparameters, applying a deeper pooling operation.

5.1. Dataset

IRIDIA-AF is a large publicly available long-term electrocardiogram (ECG) monitoring dataset developed for the study of paroxysmal atrial fibrillation. It is described in a research article published in Nature Scientific Data [25], which focuses on classifying AF episodes. The dataset includes 167 recordings (records) from 152 patients, some of whom underwent multiple Holter recording sessions on different dates. These signals, often collected over 24 to 72 h through Holter monitors or similar devices, provide a realistic view of paroxysmal AF dynamics, which may not be captured during short-term monitoring. Metadata describing recording sessions, sampling rate, and patient information are stored alongside the signals to enable downstream segmentation and annotation. Each record contains ECG data sampled at 200 Hz, split into 24 h files, with AF episodes manually annotated by clinical experts.

The dataset is structured as follows:

A metadata file describing the 167 recordings, each associated with a patient. Since some patients underwent multiple acquisition sessions, the number of recordings exceeds the number of patients. The file includes detailed information for each record: acquisition date and time, calculated duration, number of files, number of samples, and effective recording time in seconds. The effective recording time may differ from the calculated duration due to interruptions or overlaps in the data.
An ECG annotation CSV file, which provides the temporal details of each AF episode and periods of normal sinus rhythm (NSR), including start and end timestamps, file indices, QRS complex indices, episode duration, and duration of NSR before onset.
An RR interval annotation CSV file, which contains the start and end indices of RR cycles (the intervals between consecutive heartbeats) and the corresponding file indices.
A folder of HDF5 files for each record, containing:
- The raw ECG signals, stored as a two-dimensional array [samples × 2], where each row represents one sample across two ECG leads (i.e., two electrode positions);
- The RR interval data, expressed in milliseconds, is useful for heart rate variability analysis.

The recordings vary in length, ranging from a minimum duration of 71,635 s (∼20 h) to a maximum of 345,599 s (∼95 h). Accordingly, each record is composed of a variable number of HDF5 files, depending on its total duration. The dataset provides annotated long-term Holter ECG recordings, where AF events have been manually labelled by expert cardiologists. These annotations were used as ground truth to segment the recordings and extract labelled AF and NSR episodes. The IRIDIA-AF dataset has been widely used in studies focusing on atrial fibrillation detection and is particularly suitable for evaluating automated analysis techniques on long-term ECG recordings.

5.2. Data Splitting

To ensure a balanced class distribution and avoid data leakage, the data splitting was performed at the level of signal segments, maintaining the same distribution of AF and NSR classes across subsets. Since the dataset was initially imbalanced, with a higher prevalence of NSR segments, we applied random undersampling to the majority class (NSR) in order to match the number of AF instances. This procedure reduced class imbalance and prevented bias during model training, while ensuring that the balance was preserved consistently across the training, validation, and test sets. For machine learning classifiers, a stratified split was applied by assigning 30% of the available data to the test set, while the remaining 70% was used for training. In the case of deep learning models, the dataset was first split into 70% for training and 30% for testing. Subsequently, the testing portion was further divided into 70% test and 30% validation, thus obtaining three distinct subsets: training, validation, and test. This strategy ensured that validation and test sets remained independent from the training data while providing a dedicated validation subset for model tuning. To ensure comparability across experiments and reduce variability due to different data partitions, the splitting procedure was performed only once. The same subsets were then consistently used for all models under investigation, allowing for a fair comparison between machine learning and deep learning approaches.

In addition to this fixed split, a grouped cross-validation protocol at the recording level was performed for the best-performing deep learning model, as detailed in Section 6.3.

5.3. Hyperparameters

For machine learning classifiers, the default hyperparameter configurations provided by the respective libraries were adopted. This choice was motivated by the objective of establishing a solid baseline comparison across different algorithmic families (tree-based, linear, probabilistic, kernel-based, and shallow neural models) without performing extensive hyperparameter tuning, which would have risked overfitting to the specific dataset.

In contrast, for deep learning models a set of key hyperparameters was explicitly defined and optimized to ensure stable convergence and robust generalization.

Table 1 summarizes the main hyperparameters considered in the proposed CNN + BiLSTM architecture.

For deep learning, the training process uses binary cross-entropy loss optimized with Adam, and incorporates callbacks such as early stopping and model checkpointing to prevent overfitting. Dropout layers and max pooling further enhance generalization. The proposed methodology integrates two complementary paradigms. On one side, feature-based machine learning classifiers offer transparency, stability, and lower computational cost, which are valuable for clinical acceptance and edge deployment. On the other side, deep learning models achieve state-of-the-art accuracy by learning complex spatio-temporal representations directly from the data. For the Transformer-based model, a single encoder layer with four attention heads was employed. This configuration was selected to balance expressiveness and computational efficiency and to allow for a fair comparison with the CNN and CNN + BiLSTM architectures.

5.4. Validation

Model performance is assessed using accuracy, precision, recall, F1-score, and the confusion matrix, which allows us to quantify both correctly and incorrectly classified segments. The confusion matrix distinguishes true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), thus enabling the computation of the following measures:

Accuracy (A): measures the proportion of correctly classified instances over the total number of predictions:

$A = \frac{T P + T N}{T P + F P + T N + F N} .$
Precision (P): quantifies the fraction of true positives among all instances predicted as positive:

$P = \frac{T P}{T P + F P} .$
Recall (R): also referred to as sensitivity, evaluates the proportion of correctly identified positive cases over the total number of actual positive cases:

$R = \frac{T P}{T P + F N} .$
F1-Score (F1): computed as the harmonic mean of precision and recall, providing a balanced measure between the two:

$F 1 = 2 \cdot \frac{P \cdot R}{P + R} .$

Since all experiments were conducted on a balanced dataset obtained through random undersampling, accuracy, precision, recall, and F1-score are numerically consistent and comparable. In particular, under balanced class distributions, accuracy closely reflects the F1-score, which is therefore emphasized as the primary metric in the discussion of deep learning results.

In addition to these metrics, for deep learning models we also monitored the loss function, which quantifies the discrepancy between the predicted and the true labels. In this study, binary cross-entropy was adopted as the probabilistic loss function, optimized with Adam. Learning curves of accuracy and loss across epochs were analyzed to assess generalization performance and detect potential signs of overfitting or underfitting.

6. Results

6.1. Machine Learning Classifiers

The family of traditional machine learning classifiers exhibited heterogeneous performance levels across the evaluated metrics, reflecting the trade-off between model complexity, generalisation capability, and predictive power. Table 2 summarises the detailed results obtained for each classifier on the balanced dataset.

Linear and probabilistic models such as Logistic Regression and Gaussian Naive Bayes reported the lowest accuracies (0.53–0.60). In particular, GaussianNB strongly favoured AF detection (recall 0.88) but at the cost of an extremely high false-positive rate for AF segments, yielding poor overall balance. Logistic Regression showed slightly more stable behaviour but still failed to capture the nonlinear dependencies inherent in ECG morphology.

Kernel- and distance-based methods like Support Vector Classifier (SVC) and K-Nearest Neighbours (KNNs) improved moderately, reaching accuracies around 0.66–0.68. However, their precision–recall trade-off remained suboptimal, with KNN particularly sensitive to noisy or overlapping class boundaries.

Tree-based ensembles proved more robust. Random Forest and Extra Trees achieved the best overall results, with accuracies above 0.80 and balanced precision and recall across both classes. Gradient Boosting performed worse (0.72 accuracy), showing higher sensitivity to NSR but lower recall for AF. Boosted frameworks such as XGBoost and CatBoost achieved stable results close to 0.79 accuracy, consistently outperforming simpler ensembles. These results highlight the strength of ensemble learning in capturing nonlinear patterns in ECG signals.

Finally, the Multi-Layer Perceptron (MLPClassifier) achieved competitive results with 0.74 accuracy, outperforming most linear and kernel-based approaches but falling short of ensemble-based methods.

Overall, machine learning classifiers provided computationally efficient baselines with reasonable discrimination power. However, even the strongest models (Extra Trees and Random Forest) did not surpass 81% accuracy, revealing the limitations of relying exclusively on handcrafted features to capture the complex spatio-temporal dynamics of AF.

6.2. Deep Learning Models

In contrast, deep learning approaches consistently outperformed traditional methods.

Table 3 summarises the results obtained in the seven experiments conducted to evaluate the impact of segment length, network depth and the introduction of recurrent components (BiLSTM) in the classification of ECG segments (AF vs. NSR).

The Table 3 presents a row for each experiment performed, while the columns describe respectively the phase to which the experiment refers, the experiment number, the length of the ECG segment, whether the BiLSTM block was considered and if so the related neurons, how many convolutional blocks were considered and their relative sizes, how many pooling blocks were considered and their relative sizes, the number of epochs designated for training, the number of epochs actually performed, and the patience set for early stopping. The last four columns concern the validation metrics previously described (Accuracy, Precision, Recall and F1-Score).

The results highlight a clear four-phase evolution, where each architectural and parametric change contributed to progressively improving the model’s performance.

In the first phase (exp1–4), we used segments with original length (≈166 k samples) and a relatively simple CNN architecture, consisting of two Conv1D blocks (32 and 64 filters) followed by MaxPooling (pool size 2), GlobalAveragePooling, and a fully connected layer of 32 neurons. The main hyperparameters were gradually varied, increasing the maximum number of epochs (from 20 to 50 to 100) and the early stopping patience (from 5 to 10). Performance progressively improved from an accuracy of 0.76 in exp1 to 0.90 in exp4, indicating that a greater number of epochs allowed for more stable convergence, albeit with very long training times and an increasing risk of overfitting. In this case, the model tended to learn noisy local patterns, unable to effectively exploit the long-term temporal relationships present in the ECG signal.

The initial experiments (exp1–exp4) can be interpreted as convolution-only baselines. Increasing the number of training epochs and network depth progressively improved performance, reaching up to 0.90 accuracy. However, despite these improvements, CNN-only architectures remained limited in their ability to capture long-range temporal dependencies, which are critical for modelling rhythm irregularity in long-term ECG signals.

In phase two (exp5), to address the training time issue, the segment length was reduced to approximately 83 k samples while keeping the architecture unchanged. This choice resulted in a significant reduction in time per epoch and greater training stability, while still maintaining good performance (accuracy

\approx 0.85

). This indicates that shorter segments still contain sufficient discriminative information for AF/NSR classification and may be preferable when computational resources are limited.

In phase three (exp6), to overcome the intrinsic limitations of CNNs alone, which capture predominantly local patterns, a 32-unit Bidirectional LSTM block was introduced after four convolutional blocks (32-64-64-128 filters). This allowed the network to also model long-term temporal dependencies, resulting in a significant improvement in metrics (accuracy

\approx 0.88

). The synergistic effect between local feature extraction (Conv1D) and sequential modelling (BiLSTM) represented the first real qualitative leap.

Finally, in phase four (exp7), the exp6 architecture was optimised by increasing the pooling aggressiveness in the first two blocks (pool size 4 instead of 2) to reduce dimensionality and computation time, and increasing the number of epochs to 20. This produced the best overall result, with an accuracy of 0.965 and virtually identical precision, recall, and F1 scores, indicating a well-balanced and generalizable model. The stability of the validation curves and the low loss also show that the dropout setting (0.2/0.3) is sufficient to prevent overfitting, even with a more complex model.

The Transformer-based model is reported separately in Table 3 from the phased CNN-based experiments, as it represents an alternative architectural paradigm rather than an incremental modification of the convolutional pipeline. Despite its compact size (approximately 65k parameters), the CNN + Transformer architecture achieved competitive performance (F1-score = 0.87), confirming the effectiveness of attention mechanisms in modeling long-range temporal dependencies in long-term ECG signals. However, its performance remained below that of the proposed CNN + BiLSTM architecture, suggesting that recurrent inductive biases are particularly well-suited for capturing rhythm irregularity in paroxysmal atrial fibrillation.

Table 4 provides the detailed classification report; the results show very high overall performance and balance between the two classes.

Specifically, for the NSR (0) class, the model achieves a precision of 0.97 and a recall of 0.96, while for the AF (1) class, it achieves virtually symmetric values (precision 0.96, recall 0.97). The F1 score, which balances precision and recall, is 0.96–0.97 for both classes, confirming balanced and generalizable behaviour. The overall accuracy is 0.97 out of a total of 1886 tested samples.

The confusion matrix (Table 5) confirms this result: out of 929 NSR samples, 888 are correctly classified and only 41 are false positives (classified as AF); out of 957 AF samples, 932 are correct and 25 are false negatives. In both cases, the number of errors is very low and distributed symmetrically between the two classes, indicating the absence of residual bias.

These results significantly outperformed all feature-based classifiers, underscoring the ability of deep learning to uncover discriminative spatio-temporal representations that handcrafted features could not capture.

Figure 5 shows the accuracy and loss comparison during training and validation for the first (exp1) and last (exp7) deep learning experiments. In the case of exp1, based on a simple CNN architecture with two convolutional blocks and long segments (≈166 k samples), the validation accuracy increases slowly and tends to stabilise after just a few epochs at values around 0.75, while the validation loss remains high (≈0.53) and shows fluctuations, indicating unstable convergence and limited predictive capacity. In contrast, in exp7, which uses a deeper network with four convolutional blocks and a BiLSTM layer trained on halved segments (≈83 k samples), we observe a rapid increase in validation accuracy to values close to 0.97 and a decrease in validation loss to approximately 0.08. Furthermore, the training and validation curves are very close, indicating no overfitting and excellent generalisation.

This comparison highlights how the introduction of recurrent components (BiLSTM), the increase in convolutional depth, and the reduction in segment length enabled faster and more stable convergence, dramatically improving the model’s overall performance.

6.3. Results Under Grouped Cross-Validation

To address potential information leakage arising from segment-level splitting, we evaluated the proposed CNN+BiLSTM model using a grouped cross-validation protocol at the recording level. A 5-fold GroupKFold strategy was employed, ensuring that all segments originating from the same recording were assigned to the same fold. In Table 6, performance is reported as mean ± standard deviation across folds.

As expected, grouped cross-validation yields slightly lower performance compared to segment-wise evaluation, due to the increased difficulty of generalizing across unseen recordings. Nevertheless, the proposed model maintains high and stable performance across folds, with limited variance, confirming its robustness and generalization capability in realistic long-term ECG monitoring scenarios.

6.4. Comparison Between ML and DL Approaches

Table 7 summarises the best-performing ML and DL models. While XGBoost and CatBoost reached 76% accuracy with balanced precision–recall trade-offs, the CNN + BiLSTM achieved 97% accuracy and near-perfect F1-scores.

This performance gap highlights a fundamental distinction: ML classifiers depend on feature engineering, which limits their ability to model the irregular temporal dynamics of AF. Conversely, DL architectures leverage end-to-end learning to directly extract robust features from raw ECG, yielding superior generalisation.

Even under grouped cross-validation, deep learning models consistently outperform feature-based machine learning approaches, although the performance gap is reduced compared to segment-wise evaluation.

In conclusion, ML remains valuable for low-resource deployments, but DL provides the most effective and clinically relevant solution for AF detection.

6.5. Comparison Between Deep Learning Architectures

A comparative analysis of the evaluated deep learning architectures highlights the impact of different inductive biases on long-term ECG modeling. Convolution-only networks provide a strong baseline by capturing local morphological patterns, but their performance saturates when long-range temporal dependencies become dominant.

The proposed CNN + BiLSTM model comprises 115,457 trainable parameters, corresponding to an approximate model size of 451 KB using 32-bit floating-point representation. While larger than the CNN + Transformer baseline (65 k parameters), the model remains compact compared to state-of-the-art deep architectures and is compatible with edge-oriented and real-time deployment scenarios

The CNN + Transformer architecture improves upon convolution-only models, achieving an accuracy of 0.87 while maintaining a very compact footprint (approximately 65 k parameters). This confirms that self-attention mechanisms are effective in modeling global temporal relationships in long ECG segments and represent an efficient alternative for resource-constrained deployments.

Nevertheless, the proposed CNN + BiLSTM architecture consistently outperformed the Transformer-based model, reaching an accuracy of 0.97. This suggests that recurrent inductive biases remain particularly effective for modeling rhythm irregularity and sequential dynamics in paroxysmal atrial fibrillation, especially when combined with convolutional feature extraction. Overall, the results indicate a trade-off between computational efficiency and peak performance, with recurrent models offering superior accuracy and attention-based models providing a favourable efficiency–performance balance.

7. Conclusions

This work investigated the problem of atrial fibrillation (AF) detection from long-term ECG recordings by comparing two methodological paradigms: traditional machine learning (ML) classifiers based on handcrafted features, and deep learning (DL) approaches trained directly on raw signals. The ML models, and in particular XGBoost, provided solid baselines, but their ability to capture the complex temporal and morphological patterns of ECG was inherently limited.

On the other hand, the proposed deep learning pipeline, which combines convolutional layers with bidirectional LSTM units, proved markedly more effective. By directly exploiting raw ECG segments, the model achieved superior performance across all metrics, culminating in an accuracy of 97% in the best configuration. The analysis of the classification report and confusion matrix confirmed that the DL model not only minimized false positives but also improved sensitivity to AF episodes, highlighting its robustness and reliability in clinical scenarios.

Overall, the comparative analysis demonstrates the advantage of end-to-end learning strategies over handcrafted feature extraction. While ML classifiers remain useful for rapid prototyping, deep learning architectures are better suited to capture subtle and long-range dependencies in ECG signals.

This study presents some limitations that should be acknowledged. First, the number of available subjects is relatively limited, as the IRIDIA-AF dataset includes a restricted cohort of patients. This limitation is inherent to the use of long-term Holter ECG recordings, whose acquisition and manual clinical annotation are resource-intensive. Nevertheless, each record provides several hours of continuous ECG monitoring, enabling the extraction of a large number of temporally independent segments and multiple AF and non-AF episodes per subject. Second, the dataset does not include detailed clinical information regarding the overall cardiac condition of the patients, nor annotations of arrhythmias other than atrial fibrillation. Consequently, segments labeled as non-AF may still contain other rhythm abnormalities or cardiac conditions that are not explicitly annotated. This aspect may introduce latent variability within the non-AF class and limits the possibility of disentangling AF-specific patterns from other coexisting cardiac rhythms. Although a balanced dataset was adopted to ensure fair model comparison, real-world AF monitoring is inherently characterized by strong class imbalance. In practical deployments, strategies such as cost-sensitive learning, focal loss, or post hoc threshold adjustment could be employed to preserve sensitivity to rare AF events without increasing false positives.

Future research will extend this framework to larger and more heterogeneous datasets, exploring transfer learning and self-supervised pretraining in order to further enhance generalization. Moreover, integrating model explainability tools will be essential to bridge the gap between predictive accuracy and clinical trust, ultimately supporting more effective decision-making in the management of atrial fibrillation.

Author Contributions

Software, I.M. and C.V.; Writing—original draft, I.M.; Writing—review & editing, C.V.; Supervision, L.A. and A.M. All authors contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research was co-funded by the Italian Complementary National Plan PNC-I.1 “Research initiatives for innovative technologies and pathways in the health and welfare sector” D.D. 931 of 6 June 2022, “DARE-DigitAl lifelong pRevEntion” initiative, code PNC0000002, CUP: B53C22006450001.

Data Availability Statement

The database is available on Zenodo (https://zenodo.org/record/8405941, accessed on 10 February 2026).

Conflicts of Interest

The authors declare no conflict of interest.

References

Król-Józaga, B. Atrial fibrillation detection using convolutional neural networks on 2-dimensional representation of ECG signal. Biomed. Signal Process. Control 2022, 74, 103470. [Google Scholar] [CrossRef]
Rahul, J.; Sharma, L.D. Artificial intelligence-based approach for atrial fibrillation detection using normalised and short-duration time-frequency ECG. Biomed. Signal Process. Control 2022, 71, 103270. [Google Scholar] [CrossRef]
Han, J.; Shi, L.; Zhang, X. A novel deep neural network for detection of Atrial Fibrillation using spatial and frequency features. J. Biomed. Inform. 2021, 112, 103648. [Google Scholar]
Oh, S.L.; Ng, E.Y.K.; Tan, R.S.; Acharya, U.R. Automated diagnosis of arrhythmia using combination of CNN and LSTM techniques with variable length heart beats. Comput. Biol. Med. 2018, 102, 278–287. [Google Scholar] [CrossRef] [PubMed]
Zihlif, M.; Himeur, Y.; Bensaali, F. Detecting atrial fibrillation by deep convolutional neural networks. Comput. Biol. Med. 2018, 93, 84–92. [Google Scholar] [CrossRef] [PubMed]
Wu, C.; Zhang, L. Deep learning for ECG analysis: Benchmarks and insights from PTB-XL. IEEE J. Biomed. Health Inform. 2020, 25, 1519–1528. [Google Scholar]
Hong, S.; Zhou, Y.; Shang, J.; Xiao, C.; Sun, J. Opportunities and challenges of deep learning methods for electrocardiogram data: A systematic review. Comput. Biol. Med. 2020, 122, 103801. [Google Scholar] [CrossRef] [PubMed]
Kachuee, M.; Fazeli, S.; Sarrafzadeh, M. Ecg heartbeat classification: A deep transferable representation. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics (ICHI), New York, NY, USA, 4–7 June 2018; pp. 443–444. [Google Scholar]
Tison, G.; Sanchez, J.; Ballinger, B.; Singh, A.; Olgin, J.; Pletcher, M.; Vittinghoff, E.; Lee, E.; Fan, S.; Gladstone, R.; et al. Passive detection of atrial fibrillation using a commercially available smartwatch. JAMA Cardiol. 2018, 3, 409–416. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Zheng, Y.; Liang, Y.; Zhan, Z.; Jiang, M.; Zhang, X.; da Silva, D.S.; Wu, W.; de Albuquerque, V.H.C. Edge2Analysis: A Novel AIoT Platform for Atrial Fibrillation Recognition and Detection. IEEE J. Biomed. Health Inform. 2022, 26, 5772–5782. [Google Scholar] [CrossRef] [PubMed]
Rahman, M.; Morshed, B.I. A Smart Wearable for Real-Time Cardiac Disease Detection Using Beat-by-Beat ECG Signal Analysis with an Edge Computing AI Classifier. In Proceedings of the 2024 IEEE 20th International Conference on Body Sensor Networks (BSN), Chicago, IL, USA, 15–17 October 2024. [Google Scholar]
Liu, X.; Liu, H.; Li, J.; Yang, Z.; Huang, Y.; Zhang, J. AcousAF: Acoustic Sensing-Based Atrial Fibrillation Detection System for Mobile Phones. In Proceedings of the Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing, Melbourne, Australia, 5–9 October 2024; pp. 377–383. [Google Scholar]
Pachori, D.; Tripathy, R.K.; Jain, T.K. Detection of atrial fibrillation from PPG sensor data using variational mode decomposition. IEEE Sensors Lett. 2024, 8, 6001904. [Google Scholar] [CrossRef]
Zhao, L.; Lyu, R.; Lin, Q.; Zhou, A.; Zhang, H.; Ma, H.; Wang, J.; Shao, C.; Tang, Y. mmArrhythmia: Contactless arrhythmia detection via mmwave sensing. In ACM Interactive, Mobile, Wearable Ubiquitous Technologies; Association for Computing Machinery: New York, NY, USA, 2024; Volume 8, pp. 1–25. [Google Scholar]
Lin, C.T.; Chang, K.C.; Lin, C.L.; Chiang, C.C.; Lu, S.W.; Chang, S.S.; Lin, B.S.; Liang, H.Y.; Chen, R.J.; Lee, Y.T.; et al. An Intelligent Telecardiology System Using a Wearable and Wireless ECG to Detect Atrial Fibrillation. IEEE Trans. Inf. Technol. Biomed. 2010, 14, 726–733. [Google Scholar] [PubMed]
Sabbadini, R.; Riccio, M.; Maresca, L.; Irace, A.; Breglio, G. Atrial Fibrillation Detection by Means of Edge Computing on Wearable Device: A Feasibility Assessment. In Proceedings of the 2022 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Messina, Italy, 22–24 June 2022. [Google Scholar]
De Giovanni, E.; ValdÉs, A.A.; PeÓn-QuirÓs, M.; Aminifar, A.; Atienza, D. Real-Time Personalized Atrial Fibrillation Prediction on Multi-Core Wearable Sensors. IEEE Trans. Emerg. Top. Comput. 2022, 9, 1654–1666. [Google Scholar] [CrossRef]
Bahador, N.; Jokelainen, J.; Mustola, S.; Kortelainen, J. Multimodal spatio-temporal-spectral fusion for deep learning applications in physiological time series processing: A Case Study in Monitoring the Depth of Anesthesia. Inf. Fusion 2021, 73, 125–143. [Google Scholar] [CrossRef]
Petroni, A.; Cuomo, F.; Scarano, G.; Francia, P.; Colonnese, S. Atrial Fibrillation Detection by Multi-Lead ECG Processing at the Edge. In Proceedings of the 2021 IEEE Globecom Workshops (GC Wkshps), Madrid, Spain, 7–11 December 2021. [Google Scholar]
Musso, F.; Brinkmeyer, J.; Mobascher, A.; Warbrick, T.; Winterer, G. Spontaneous Brain Activity and EEG Microstates: A Novel EEG-fMRI Analysis Approach to explore resting-state networks. NeuroImage 2010, 52, 1149–1161. [Google Scholar] [CrossRef] [PubMed]
Uwaechia, A.N.; Ramli, D.A. A Comprehensive Survey on ECG Signals as New Biometric Modality for Human Authentication: Recent Advances and Future Challenges. IEEE Access 2021, 9, 97760–97802. [Google Scholar] [CrossRef]
Xiong, Z.; Stiles, M.; Zhao, J. Robust ECG signal classification for detection of atrial fibrillation using a novel neural network. In Proceedings of the 2017 Computing in Cardiology (CinC), Rennes, France, 24–27 September 2017; pp. 91–94. [Google Scholar]
Sun, L.; Li, H.; Muhammad, G. Randomized Attention and Dual-Path System for Electrocardiogram Identity Recognition. Eng. Appl. Artif. Intell. 2024, 132, 107883. [Google Scholar] [CrossRef]
Bouhenguel, R.; Mahgoub, I. A Risk and Incidence Based Atrial Fibrillation Detection Scheme for Wearable Healthcare Computing Devices. In Proceedings of the 2012 6th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth) and Workshops, San Diego, CA, USA, 21–24 May 2012. [Google Scholar]
Gilon, C.; Grégoire, J.M.; Mathieu, M.; Carlier, S.; Bersini, H. IRIDIA-AF, a large paroxysmal atrial fibrillation long-term electrocardiogram monitoring database. Sci. Data 2023, 10, 714. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the Proposed Approach.

Figure 2. Determination of the optimal coalescence window using the KneeLocator algorithm. The curve shows the number of AF episode clusters obtained as a function of the coalescence window size. As the window increases, temporally adjacent episodes are progressively merged, reducing the number of distinct groups. The green dashed line marks the elbow point (CW ≈ 1667.56 s), which represents the optimal trade-off between over-fragmentation and over-grouping.

Figure 3. Distribution of inter-event intervals between AF episodes across all recordings. A high concentration of short intervals is observed, suggesting frequent AF recurrences within short time spans.

Figure 4. Architecture of the deep learning model: CNN layers extract local features, BiLSTM captures temporal dependencies, and dense layers perform final classification.

Figure 5. Representative learning curves (accuracy and loss vs. epochs) for the comparison between exp1 and exp7.

Table 1. Considered hyperparameters and values for deep learning models.

Hyperparameter	Description	Values
filters	Dimensionality of the output space, i.e., number of output filters	32, 64, 128
kernel_size	Length of the 1D convolution window	5
pool_size	Size of the max pooling window	2, 4
LSTM units	Number of hidden units in the BiLSTM layer	32
dropout_rate	Probability of cutting a node out during training	0.2, 0.3
batch_size	Number of samples per gradient update	8
n_epochs	Number of epochs to train the model	20
optimizer	Optimization algorithm used to minimize loss	Adam
loss	Function used to evaluate predictions	Binary cross-entropy
last_activation	Activation function of the last layer	Sigmoid

Table 2. Performance of machine learning classifiers on AF/NSR classification task.

Classifier	Accuracy	Precision	Recall	F1-Score
Decision Tree	0.7000	0.6999	0.6999	0.6999
Random Forest	0.8035	0.8042	0.8037	0.8034
Extra Trees	0.8106	0.8118	0.8108	0.8105
Gradient Boosting	0.7254	0.7279	0.7258	0.7249
XGBoost	0.7895	0.7903	0.7897	0.7894
CatBoost	0.7934	0.7944	0.7936	0.7933
AdaBoost	0.6662	0.6673	0.6665	0.6659
KNN	0.6625	0.6625	0.6626	0.6625
GaussianNB	0.5358	0.5647	0.5334	0.4705
Logistic Regression	0.5957	0.5985	0.5963	0.5936
SVC	0.6763	0.6802	0.6768	0.6749
MLPClassifier	0.7369	0.7369	0.7369	0.7369

Table 3. Results obtained from performing the four different phases of DL pipeline experiments. Higher metric values are in bold.

Phase	Experiment	Segment Length	BiLSTM	Conv1D	Pool-Size	Epochs	Epochs Performed	Patience	Accuracy	Precision	Recall	F1-Score
1	exp1	original	X	2 (32-64)	(2-2)	20	20	5	0.7678	0.7702	0.7679	0.7673
1	exp2	original	X	2 (32-64)	(2-2)	50	39	5	0.8334	0.8346	0.8334	0.8333
1	exp3	original	X	2 (32-64)	(2-2)	100	59	5	0.8496	0.8506	0.8495	0.8494
1	exp4	original	X	2 (32-64)	(2-2)	100	100	10	0.9011	0.9011	0.9011	0.9011
2	exp5	halved	X	2 (32-64)	(2-2)	150	62	10	0.8505	0.8507	0.8507	0.8505
3	exp6	halved	✓ (32)	4 (32-64-64-128)	(2-2-2-2)	10	10	10	0.8759	0.8766	0.8763	0.8759
4	exp7	halved	✓ (32)	4 (32-64-64-128)	(4-4-2-2)	20	20	10	0.965	0.9652	0.9649	0.965
-	CNN + Transformer	halved	X	(32-64)	(4-4)	20	20	10	0.8696	0.8695	0.8696	0.8696

Table 4. Classification report of the best-performing deep learning model.

Class	Precision	Recall	F1-Score	Support
NSR (0)	0.97	0.96	0.96	929
AF (1)	0.96	0.97	0.97	957
Accuracy			0.97	1886
Macro Avg	0.97	0.96	0.96	1886
Weighted Avg	0.97	0.97	0.96	1886

Table 5. Confusion matrix of the best-performing deep learning model.

	Predicted NSR	Predicted AF
True NSR	888	41
True AF	25	932

Table 6. Results obtained from performing the best performing model using cross validation.

Model	Accuracy	Precision	Recall	F1-Score
CNN + BiLSTM (CV)	0.922 ± 0.046	0.923 ± 0.047	0.922 ± 0.047	0.921 ± 0.046

Table 7. Comparison between the best-performing machine learning and deep learning models.

Approach	Accuracy	Prec.	Rec.	F1
ML (XGBoost)	0.76	0.76	0.76	0.76
DL (CNN + BiLSTM)	0.965	0.9652	0.9649	0.965

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aversano, L.; Mancino, I.; Marengo, A.; Verdone, C. Comparative Analysis of Machine Learning and Deep Learning Models for Atrial Fibrillation Detection from Long-Term ECG. Appl. Sci. 2026, 16, 2390. https://doi.org/10.3390/app16052390

AMA Style

Aversano L, Mancino I, Marengo A, Verdone C. Comparative Analysis of Machine Learning and Deep Learning Models for Atrial Fibrillation Detection from Long-Term ECG. Applied Sciences. 2026; 16(5):2390. https://doi.org/10.3390/app16052390

Chicago/Turabian Style

Aversano, Lerina, Ilaria Mancino, Agostino Marengo, and Chiara Verdone. 2026. "Comparative Analysis of Machine Learning and Deep Learning Models for Atrial Fibrillation Detection from Long-Term ECG" Applied Sciences 16, no. 5: 2390. https://doi.org/10.3390/app16052390

APA Style

Aversano, L., Mancino, I., Marengo, A., & Verdone, C. (2026). Comparative Analysis of Machine Learning and Deep Learning Models for Atrial Fibrillation Detection from Long-Term ECG. Applied Sciences, 16(5), 2390. https://doi.org/10.3390/app16052390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Analysis of Machine Learning and Deep Learning Models for Atrial Fibrillation Detection from Long-Term ECG

Abstract

1. Introduction

2. Related Works

3. Background

3.1. Machine Learning

3.2. Deep Learning

3.2.1. Convolutional Neural Networks

3.2.2. Recurrent Neural Networks (RNNs)

3.2.3. Long Short-Term Memory (LSTM)

3.2.4. Bidirectional Long Short-Term Memory (BiLSTM)

3.2.5. Temporal Convolutional Networks (TCNs)

3.3. Training and Regularization

4. Approach

4.1. Preprocessing

4.1.1. Coalescence Window Analysis

4.1.2. Segmentation

4.1.3. Feature Extraction

4.2. Machine Learning Pipeline

4.3. Deep Learning Pipeline

5. Experimental Settings

5.1. Dataset

5.2. Data Splitting

5.3. Hyperparameters

5.4. Validation

6. Results

6.1. Machine Learning Classifiers

6.2. Deep Learning Models

6.3. Results Under Grouped Cross-Validation

6.4. Comparison Between ML and DL Approaches

6.5. Comparison Between Deep Learning Architectures

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI