Explainable Combined Spatial Representations for ECG Arrhythmia Classification

Onică, Iulia; Ciocoiu, Iulian B.

doi:10.3390/make8050114

Open AccessArticle

Explainable Combined Spatial Representations for ECG Arrhythmia Classification

by

Iulia Onică

and

Iulian B. Ciocoiu

^*

Faculty of Electronics, Telecommunications and Information Technology, Gheorghe Asachi Technical University of Iasi, Bd. Carol I, No. 11A, 700506 Iasi, Romania

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(5), 114; https://doi.org/10.3390/make8050114

Submission received: 9 March 2026 / Revised: 22 April 2026 / Accepted: 23 April 2026 / Published: 25 April 2026

(This article belongs to the Section Data)

Download

Browse Figures

Versions Notes

Abstract

The paper addresses ECG arrhythmia classification using a novel input fusion strategy that combines spatial representations of ECG time series recordings. Four distinct time series-to-image transformations are considered, namely classical spectrograms, Gramian Angular Field (GAF), Recursive Plot (RP), and the S-Transform (ST). Classification of combined 2 × 2 images generated from single-lead ECG recordings is performed using both custom and ResNet-50 deep learning architectures. Finally, several distinct explainability algorithms are used to identify the relevant regions in the input images that mainly influence the classification decisions. Experiments performed on the MIT-BIH and Chapman–Shaoxing arrhythmia datasets revealed performance comparable to more sophisticated approaches in terms of accuracy (99%), F1-score (98.6%), and AUC (0.999) values.

Keywords:

spatial representation; data fusion; deep learning; explainability analysis

1. Introduction

According to recent studies, ischemic heart disease and stroke consistently represented the leading causes of worldwide deaths in the interval 1990–2023 [1]. While during the pandemic period COVID-19 temporarily ranked first, it rapidly dropped far lower on the leading causes list, with ischemic heart disease, stroke, chronic obstructive pulmonary disease (COPD), lower respiratory infections, and neonatal disorders placed in the top positions. Nevertheless, the age-standardized mortality rates for both ischemic heart disease and stroke showed a significant decrease, from 161.4 deaths per 100,000 population in 1990 to 99.8 deaths per 100,000 in 2023 [1].

Cardiac arrhythmias represent one of the most prevalent and clinically consequential manifestations of cardiovascular diseases, with a huge impact on therapeutic decisions and mortality reduction. The electrocardiogram (ECG) serves as the primary non-invasive diagnostic tool for detecting and classifying arrhythmias, aiming to identify subtle morphological features that may differentiate them. ECG classification faces a combination of signal-processing, physiological, and computational challenges that collectively hinder the development of robust, generalizable, and clinically deployable systems. There are numerous difficulties to be faced, including: (a) the non-stationary nature of ECG signals implies a significant temporal variability and complicates the extraction of stable, discriminative features across diverse patient cohorts; (b) there exists a pronounced inter- and intra-subject heterogeneity in terms of the waveform morphology, amplitude, and duration that may manifest for the same arrhythmia type, requiring large and representative datasets; (c) the necessity of modeling the complex spatio-temporal dependencies that exist between successive ECG segments and/or multiple recording leads; (d) even state-of-the-art classifiers may yield limited performances under realistic noise conditions.

Traditional manual interpretation of ECG signals by cardiologists is time-consuming, subject to inter-observer variability, and prone to human error, particularly in high-volume clinical settings [2]. Alternatively, machine learning and deep learning models offered a potentially automated, accurate, and scalable ECG signal analysis solution that rivals human expert interpretation, prompting significant research interest and clinical translation efforts [3,4,5]. Convolutional neural networks (CNNs) dominate the field, with recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and hybrid architectures gaining prominence for temporal modeling. Such models achieved remarkable performance metrics, with reported accuracies exceeding 99% on benchmark datasets [4].

However, the clinical adoption of AI-based ECG classification systems faces substantial barriers. The black-box nature of deep learning models raises concerns about trustworthiness and interpretability, which are critical in high-stakes medical decision-making. General explainable AI (XAI) techniques—particularly SHAP, LIME, and Grad-CAM [6]—have been used to enhance clinical trust, along with ECG-specific adaptations such as K-GradCam and B-LIME [2,7] that address temporal dependencies and perturbation instability. Additionally, challenges related to data imbalance, limited generalization across diverse patient populations, computational demands, and the need for robust external validation continue to constrain real-world deployment [8].

One interesting line of research follows a two-step procedure, by first transforming the ECG data series under study into various 2D image representations, while further classifying these images using deep learning models. The approach has been successfully used for time-series analysis in anomaly detection and biometrics applications [9]. For ECG recordings, the method may preliminarily consider single- or multi-beat segmentation of the (preprocessed) signals corresponding to one (typically the modified LII) or several leads. The actual conversion methods may be chosen from a rather wide range of options, four of which are to be described in the next section.

To improve classification performance, various input- and/or output-fusion modalities have been proposed. Typically, one may choose between (a) combining the outcomes of distinct classification models that use a specific spatial representation of a given time series and (b) combining the outputs of a given model that uses different image types as input. The present paper proposes a novel approach that merges four distinct 2D representations of single-lead ECG recordings into 2 × 2 global images for use as input to several deep learning models. The rationale behind the proposal is two-fold: (a) various time series-to-image transformations may reveal distinct features of the ECG waveforms, hence their combination would be more informative than the individual ones; (b) explainable AI algorithms applied on the combined images may identify specific regions of interest that are arrhythmia class-specific, while additionally enabling a hierarchical ordering of the considered spatial representations according to their importance revealed by the XAI analysis.

The following sections describe the chosen time series-to-image transformations, illustrate the individual and merged arrhythmia-class-specific representations, and present classification performance metrics based on extensive experimental results on two widely used ECG arrhythmia benchmark datasets. Explainability analysis using four distinct methods and a comparative analysis against existing solutions are included in the discussion section, while suggested ideas for further study are finally outlined.

2. Background and Related Work

Convolutional neural networks (CNNs) have emerged as the dominant architecture for ECG arrhythmia classification, leveraging their proven ability to automatically learn discriminative features from raw ECG signals, reducing the need for manual feature engineering and domain-specific preprocessing [4,10]. Both 1D and 2D CNN models excel at identifying morphological patterns in ECG signals, such as the QRS complex, P-wave, and T-wave, which are critical for arrhythmia diagnosis.

Multiple studies have demonstrated the effectiveness of 1D/2D CNNs specifically adapted to ECG signals. Various custom architectures were built around 1D/2D-convolutional layers, batch normalization, ReLU activation, max pooling, and dense layers with softmax activation, achieving over 99% accuracy [11,12]. Modified versions of established CNN architectures have also been adopted for ECG analysis. For example, Atwa et al. introduced a custom CNN with a dual-branch architecture for ECG signals and demographic data, alongside a modified VGG16 model adapted for multi-branch input, achieving up to 97.78% accuracy in binary classification and 79.7% in multiclass tasks [13]. These adaptations demonstrate the flexibility of CNN architectures in accommodating both signal data and auxiliary clinical information.

While CNNs excel at extracting spatial and morphological features, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks are employed to capture temporal dependencies and beat-to-beat patterns in ECG signals. These architectures are particularly valuable for modeling the sequential nature of cardiac rhythms and detecting arrhythmias that manifest over extended time windows. Most solutions are built around hybrid CNN+(Bilateral) LSTM models, yielding up to 99.6% accuracy [10].

The deployment of deep learning models on resource-constrained edge devices, wearables, and embedded systems has driven the development of lightweight architectures optimized for memory footprint, computational efficiency, and energy consumption [14,15]. These models aim to enable real-time arrhythmia monitoring in ambulatory and point-of-care settings without sacrificing diagnostic accuracy, addressing a critical gap between high-performing but computationally intensive models and the practical requirements of continuous, real-time monitoring in clinical and consumer health applications. As such, Baig et al. proposed two novel lightweight 1D CNNs, ArrhythmiNet V1 and V2, inspired by MobileNet’s depthwise separable convolutional design [14]. These models achieved classification accuracies of 99% (V1) and 98% (V2) on the MIT-BIH Arrhythmia Dataset while maintaining very low memory footprints. Silva et al.’s systematic review identified multiple optimization strategies for embedded feasibility, including pruning techniques (L1-norm pruning to remove low-impact weights), quantization techniques (reducing bit-width from 32-bit floating-point to 8-bit integers), and lightweight network architectures such as MobileNet, SqueezeNet, and TinyML [15]. Specific implementations demonstrated remarkable efficiency: Ribeiro et al. achieved 99.6% accuracy with 7.65 ms inference time, 5.85 mJ energy consumption, and a 93 KB model size, making it suitable for low-power edge devices [15]. Mao et al. achieved 97.36% accuracy with on-chip learning, 0.3 µJ/beat energy consumption, and 8 KB model size, ideal for wearable devices [15].

FPGA acceleration and Spiking Neural Networks (SNNs) with Channel-Wise Attentional Mechanism (CAM) represent additional strategies for achieving speed and energy efficiency in embedded platforms [15]. Xing et al. achieved 92.07% accuracy with inter-patient validation, 1.37 ms runtime, and 346.33 µJ/beat energy consumption, demonstrating suitability for embedded platforms [15].

Beyond the dominant CNN, RNN, and hybrid architectures, a series of emerging approaches is being explored to address specific limitations and leverage novel computational paradigms. Attention mechanisms, distillation, graph neural networks, capsule networks, and transformer architectures represent promising directions for future research [16,17]. Attention-based networks have been highlighted for improving detection accuracy by enabling models to focus on salient temporal regions of ECG signals, further enabling interpretability of the results [18,19]. Graph neural networks offer potential for modeling complex relationships between ECG leads and temporal segments, while capsule networks aim to preserve spatial hierarchies and part–whole relationships in signal patterns. Transformer architectures, which have revolutionized natural language processing and computer vision, are beginning to be explored for ECG analysis [4]. Many hybrid models first use a CNN to extract local morphological features, then feed those features as tokens into a transformer encoder to capture long-range dependencies, yielding very high accuracy, often surpassing 99% [20,21]. Models targeting long-term monitoring adopt beat-wise tokenization or patient-adaptive tokens to limit sequence length and focus attention. The self-attention mechanism may offer advantages in capturing long-range dependencies and multi-scale temporal patterns in ECG signals.

While these emerging approaches show promise, they remain less prevalent in the current literature than CNNs and RNNs, and their clinical validation and practical deployment are still in the early stages.

3. Spatial Representations of ECG Signals

Time series-to-image transformations encode temporal dependencies into spatial, geometric, or frequency-based alternatives that facilitate the analysis, visualization, and learning of relevant characteristics. Such transformations may preserve global characteristics, emphasize local temporal variations, or capture similarity and recurrence relationships within the data. Among the various methods proposed in the literature, we selected four options that have proven effective for ECG-based biometrics [9], namely: classical spectrograms, Gramian Angular Field (GAF), Recursive Plot (RP), and the S-Transform (ST). The description of those methods below closely follows [9], including examples for both single- and multi-beat ECG recordings from two benchmark arrhythmia datasets. Finally, we merge the individual 2D representations into a 2 × 2 global image for further classification by a CNN model.

3.1. Time-Frequency Spectrograms

Classical time-frequency spectrograms have been generated from MLII single-lead ECG recordings using 25 ms sliding segments with 12 ms overlap and a periodic Hanning window defined over 256 points.

3.2. Gramian Angular Field

To implement the Gramian Angular Field (GAF) transformation [22], we first rescale the time series in the [−1, 1] interval, followed by expressing the resulting signal in polar coordinates, using the angular cosine as the phase information, and coding time as the distance from the origin as (N is a scaling constant):

\begin{array}{l} {\tilde{x}}_{i} = \frac{[x_{i} - \max (x)] + [x_{i} - \min (x)]}{\max (x) - \min (x)}, i = 1 \dots n \\ \{\begin{cases} φ_{i} = \arccos ({\tilde{x}}_{i}), - 1 \leq {\tilde{x}}_{i} \leq 1 \\ r = \frac{t_{i}}{N}, t_{i} = 1 \dots n \end{cases} \end{array}

(1)

We finally define the Gramian Angular Field as follows:

\begin{array}{l} G A F = [\begin{array}{l} \cos (φ_{1} + φ_{1}) & \cos (φ_{1} + φ_{2}) & \dots & \cos (φ_{1} + φ_{n}) \\ \cos (φ_{2} + φ_{1}) & \cos (φ_{2} + φ_{2}) & \dots & \cos (φ_{2} + φ_{n}) \\ ⋮ & ⋮ & ⋮ \\ \cos (φ_{n} + φ_{1}) & \cos (φ_{n} + φ_{2}) & \dots & \cos (φ_{n} + φ_{n}) \end{array}] = \\ = {\tilde{X}}^{T} \cdot \tilde{X} - {(\sqrt{I - {\tilde{X}}^{2}})}^{T} \cdot \sqrt{I - {\tilde{X}}^{2}} \end{array}

(2)

The trigonometric sum of the phase values reflects the temporal correlation within various time intervals of the given time series. The scaling parameter N is set equal to the ECG sample length.

3.3. Recurrence Plots

Recurrence Plots (RPs) [23] are rooted in chaos theory as an efficient tool for visual inspection of time series. The method aims at efficiently exploring the high-dimensional phase space associated with a dynamical nonlinear system through a bidimensional representation of its recurrences. As such, we may construct the following RP matrix:

R P_{i, j} = θ (ε - ‖s_{i} - s_{j}‖), s (.) \in ℜ^{m}, i, j = 1 \dots N

(3)

where m defines the embedding dimension set, N is the total number of states, ε is a real-valued threshold, and θ(.) is the step function (the state vector s(.) is obtained by subsampling the given time series using a proper delay). The RP matrix may reveal trajectories that return close to previous states, exposing both the global characteristics (homogeneity, periodicity) and local patterns as isolated dots, diagonal, horizontal, or vertical lines [9]. The parameter ε defines the radius of the neighborhood around a given phase space vector used to count recurrence points [23] and is set to 0.05 based on the maximum phase space diameter of the normalized time series. The embedding dimension is m = 11, as in [9].

3.4. The S-Transform

The Short-Time Fourier Transform (STFT) has been widely used as a time-frequency tool for revealing the temporal dynamics of a signal’s spectral components, which the classical Fourier Transform (FT) cannot capture. Using a fixed width for the analysis window limits the efficiency of the method, especially for non-stationary data; hence, the Continuous Wavelet Transform (CWT) has been proposed as a solution to compensate for this limitation. The CWT introduces a frequency-dependent resolution of the time-frequency space, better identifying significant changes in the signal under study at both low and high frequencies. Since the standard CWT is unable to yield precise phase information that is critical for biomedical signal processing, the S-Transform (ST) has been introduced as a phase-corrected version of CWT as [24]:

\begin{array}{l} S T (t, f) = C W T (t, σ) e^{- j 2 π f t} = \int_{- \infty}^{+ \infty} x (τ) ψ (t - τ, σ) e^{- j 2 π f τ} d τ = \\ = \int_{- \infty}^{+ \infty} x (τ) \frac{|f|}{\sqrt{2 π}} e^{- \frac{{(t - τ)}^{2} f^{2}}{2}} e^{- j 2 π f τ} d τ \end{array}

(4)

by choosing a particular mother wavelet function

ψ (t, σ) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{t^{2}}{2 σ^{2}}}

with a frequency-dependent width

σ (f) = 1 / |f|

. Improved flexibility in resolution adjustment in the time-frequency plane can be achieved by using more complex dependencies of the σ parameter on frequency. ST is defined in a time-frequency plane, in contrast to the standard CWT, which uses a time-scale pair of axes, making the frequency information more interpretable. Additionally, ST provides both absolutely referenced phase information and frequency-independent amplitude response [24]. Detailed comparative analysis of ST against FT, STFT, and CWT has been presented in the literature [9].

Examples of the time series-to-image transformations defined above are shown in Figure 1 and Figure 2, corresponding to single- and multiple-beat segments selected from the MIT-BIH [25] and Chapman–Shaoxing [26] arrhythmia datasets, respectively. Each type of representation is normalized to the [−1, 1] range and resized to 227 × 227 pixels.

The result of merging the individual representations into a global 2 × 2 image is illustrated in Figure 3. Once again, the combined result is resized to 227 × 227 pixels to meet the input requirements of the CNN models.

4. The Proposed Approach

The block diagram of the proposed approach is illustrated in Figure 4. It suggests the use of the merged spatial representations as the input to various CNN classification models, while additionally enabling the interpretation of the outcome through an XAI analysis. The dimensions of the input images should be properly set to accommodate the constraints imposed by the CNN models under study. When multiple classifiers are used, output fusion may also be performed by combining the individual decisions of the models (e.g., by weighting the response of each model in terms of class-oriented probabilities according to its accuracy). The figure suggests using explainable XAI algorithms to identify regions in input images that primarily influence the classification decision. When combined image representations are used as input, such tools may enable hierarchical ordering of the various time series-to-image transformations, revealing the most informative class-specific 2D mappings.

Three custom and a Resnet-50 CNN models have been used, as indicated in Figure 5. Model 1 is a simple, classical series-type architecture that includes standard 2D convolution, batch normalization, ReLU, pooling, and fully connected layers. Model 2 is inspired by [27] and includes residual-type connections and 1D convolutions before the fully connected classification module. Model 3 is built around standard 2D layers but adds residual connections. Finally, all models incorporate dropout layers. We used an ImageNet-pretrained Resnet-50 model, freezing the first 121 layers, adjusting the number of neurons in the output layer, and fine-tuning the updated architecture using the distinct ECG arrhythmia datasets.

5. Experimental Protocol and Results

Extensive computer simulations have been performed using two benchmark ECG data sources, namely the MIT-BIH [25] and Chapman–Shaoxing [26] arrhythmia datasets, respectively. The next sections describe the characteristics of the recordings, the resampling/augmentation methods used to compensate for class imbalance, and the main performance metrics. The efficiency of both input and output fusion strategies is compared with that of using individual 2D representations and single models. Five explainability algorithms have been considered to assess the relative importance of the representations and identify class-specific regions of interest.

5.1. The ECG Arrhythmia Datasets

The MIT-BIH Arrhythmia Dataset [25] comprises 48 half-hour ambulatory ECG recordings collected from 47 adult subjects at Boston’s Beth Israel Hospital, monitoring two modified limb leads at a 360 Hz sampling frequency. The dataset provides expert beat-level annotations, classified into 24 subcategories, further grouped into five types according to the Association for the Advancement of Medical Instrumentation (AAMI) standard: normal (N), supraventricular ectopic beat (SVEB), ventricular ectopic beat (VEB), fusion beat (F), and unknown beat (Q). Typical preprocessing steps include single-beat segmentation, followed by beat-length and amplitude normalization, outlier removal, and R peak alignment.

Since the original data contains outliers (segments whose lengths deviate significantly from the average R-R time intervals) and the segmentation procedures may yield mixed samples from consecutive heartbeats, several new versions of the dataset have been proposed in the literature to improve the quality of the original recordings. We used the dataset introduced in [28] (using the Lead II of the recordings) that corrects some of the drawbacks of previous solutions. The method starts by computing the R-R intervals and eliminating segments whose lengths fall outside an interval defined by the 25th and 75th percentiles of the R-R segment distribution. The cleaned set is split into 10 second windows, and the mean heartbeat size is computed as the average of the R-R intervals within the current window. A new set of heartbeats, each with the mean length and centered around its own R peak, is finally generated, followed by zero-padding to yield a common global heartbeat size. The data distribution over the five classes is indicated in Table 1.

The Chapman–Shaoxing (SC) [26] arrhythmia dataset contains approximately 10,000 10 s-long 12-lead ECG signal recordings collected in a hospital facility between 1980 and 2018, featuring 11 common rhythms and a sampling rate of 500 Hz. The volume, duration, and population diversity established it as a benchmark dataset widely used for research purposes. The data are grouped into seven arrhythmia types: sinus bradycardia (SB), sinus rhythm (SR), atrial fibrillation (AFIB), sinus tachycardia (ST), atrial flutter (AF), sinus irregularity (SI), and supraventricular tachycardia (SVT). Many papers report classification performances on only four major types (SB, SR, AFIB, and GSVT), obtained by merging some of the original rhythms [4]. This option brings the additional advantage of an almost balanced sample distribution over the classes, as indicated in Table 2.

5.2. Data Preprocessing

The cleaned version of the MIT-BIH Arrhythmia Dataset described in [28] has been generated following a two-step procedure, by first eliminating the outlier segments whose lengths exhibited significantly different durations as compared to the average R-R intervals, and further constructing (zero-padded, common size) R-centered segments, as mentioned in the previous section. The resulting sequences are down-sampled from 360 Hz to 120 Hz, and finally z-score normalized according to

z = \frac{x - μ}{σ}

, where μ and σ are the mean and standard deviation of vector x.

A series of denoising techniques has been employed to increase the quality of the recordings from the Chapman–Shaoxing arrhythmia dataset. These include Butterworth low-pass filtering, a local polynomial regression smoother (LOESS) to remove baseline wandering, and the non-local means (NLM) technique for reducing residual noise [26]. According to [26], the Butterworth filter has a 50 Hz pass-band and a 60 Hz stop-band, and the smoother used the weighted least squares method. Greatly similar to the MIT-BIH dataset, z-score normalization has also been applied, without altering the original sampling rate.

One key point concerns class-imbalance mitigation strategies to consider for the MIT-BIH dataset. Both down-sampling solutions to limit the number of samples in the normal (N) class and augmentation techniques to increase the number of exemplars in the other four classes have been considered. Down-sampling options include random resampling (prone to bias), clustering procedures (followed by retaining only the resulting cluster centroids or the original samples closest to those), and more sophisticated methods rooted in information theory [29,30]. We have considered the algorithm introduced in [30], given the comparative performance analysis against the other options.

Following current practice, we employed the (safe-level version of the) SMOTE algorithm [31] as an augmentation technique for the under-represented categories, namely supraventricular ectopic beat (SVEB), ventricular ectopic beat (VEB), and fusion beat (F). The resulting per-class data distribution is given in Table 1, showing a balanced distribution and the contribution of the generated data to the overall dataset dimensionality.

5.3. Experimental Results

Extensive experiments have been performed along three scenario setups. To start with, we considered the four distinct 2D representations applied as the input to each of the CNN models described in Section 4. Secondly, we used the combined 2 × 2 images and the same individual models. Finally, we evaluated the efficiency of input/output fusion strategies by: (a) combining the outputs of the various CNN models given a specific 2D representation; (b) combining the outputs of a given CNN model when the various 2D representations are used as input. The final result is computed as the weighted sum of the probabilities generated by the individual models on the validation set, while the weights are statically assigned and defined as the ratio between the individual models’ accuracy on the training set and their overall sum:

\begin{array}{l} o u t_{k} = \max_{C} \{\sum_{i = 1}^{4} w_{i} \cdot {p r o b^{C}}_{i, k}\} \\ w_{i} = \frac{A c c_{i}}{\sum_{j} A c c_{j}} \end{array}

(5)

In the equation above, out_k represents the predicted label of sample k, C is the category index, w_i is the weight of a given model, and

{p r o b^{C}}_{i, k}

is the probability that sample k belongs to category C, as computed by model i.

In Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, we present comparative classification performances for each of the previously described scenarios. The performance metrics include accuracy (Acc), precision (P), recall (R), F1-score, Area Under the Curve (AUC), and Cohen’s Kappa coefficient [32] (a measure of the agreement between true and predicted single-classifier outputs, corrected for random chance agreement). All experiments used 10-fold cross-validation, and performance metrics are reported as mean values ± standard deviations.

The training setup was similar for all experiments: input images with 227 × 227 pixels, Adam optimizer, 0.0005 initial learning rate, 0.0005 L2 regularization parameter, batch size 64, dropout for the fully connected layers with 50% probability, focal loss, and best model parameters selected according to minimal loss on the validation set.

5.4. Explainability Analysis

Automated medical decision-making raises critical challenges given the black-box nature of current deep learning models. Revealing the underlying causes of specific outcomes and performance metrics could only contribute to increasing the trustworthiness and broader acceptance of various AI-based solutions, towards a synergistic cooperation between artificial systems and human experts.

In ECG signal analysis, several general explainable AI (XAI) techniques, including SHAP, LIME, Grad-CAM, and variants of these [6,13], have been used to enhance clinical trust, along with ECG-specific adaptations such as K-GradCam and B-LIME [2,7]. We considered five distinct XAI methods, namely standard grad-CAM, grad-CAM++ [33], score-CAM [34], LIME, and occlusion-based saliency maps to identify class-specific regions of interest in the spatial representations under study. The standard grad-CAM algorithm identifies significant regions in the input image by computing a linearly weighted combination of activation maps of the last convolutional layer before a global pooling layer. Grad-CAM++ and score-CAM represent improved versions of this method by refining the definition of the weights in the above linear combination based on the gradient of class confidence (grad-CAM++) and channel-wise increase in confidence score (score-CAM) with respect to the activation maps. Occlusion maps mask input image patches and evaluate the subsequent effect on the classification performances.

Figure 6 and Figure 7 show significant differences among the various classes in both datasets, while additional analysis indicated the robustness of the outcomes across different human subjects within the same arrhythmia category.

6. Discussion and Conclusions

Classification metrics reported in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 reveal top performance of 99% for accuracy, precision, and recall, and 0.999 for AUC, respectively, on the MIT-BIH Arrhythmia Dataset. For the Chapman–Shaoxing dataset, the performance is slightly lower, with 95% top accuracy and an AUC of 0.991. When separate 2D representations are used as input (Table 3 and Table 6), performance is comparable, with GAF and ST yielding marginally better results. Using combined 2 × 2 spatial representations yields a mild improvement in performance on the MIT-BIH dataset, whereas no clear positive effect is observed on the Chapman–Shaoxing data (Table 4 and Table 7). In all scenarios, the best results were obtained by following an output fusion strategy, namely combining the outcomes of the four classification models for each input image type. More specifically, the final result is computed as the weighted sum of the probabilities generated by the individual models, with weights defined as the ratio of each model’s accuracy to the sum of all models’ accuracies. The results presented in Table 5 and Table 8 refer to an alternative input fusion strategy, in which, for each model, we combined predictions using distinct features. For the MIT-BIH dataset, performance is comparable to the combined 2 × 2 input scenario, while for the Chapman–Shaoxing dataset, accuracy increases from 93.5% to 95%, with a mild improvement in AUC from 0.988 to 0.991.

The explainability analysis performed with the 2 × 2 combined 2D representations as input indicates that the spectrogram is the least significant option, while the other three methods are relevant, showing class-specific relative importance. It is worth noting that merging the 2D transformations as proposed may introduce artificial (high-frequency) patterns along the boundaries between representations, although the XAI analysis described in Section 5.4 and Figure 6 and Figure 7 shows that those boundaries are not indicated as significant by any of the used methods.

The explainability analysis presented in Figure 6 and Figure 7 reveals that the significant regions are rather irregular, hence it is quite difficult to establish an accurate correspondence between those regions and specific ECG segments. The problem is further complicated by the non-invertibility of the mappings that generate the GAF and RP images. Nevertheless, to facilitate visual inspection of the results, we provide in Figure 8 the correspondence between the various segments of a typical ECG sample and the associated regions of the 2D representations.

Moreover, in Figure 9, we have shown the relevant regions that influence the classification decision for the four categories available in the Chapman–Shaoxing (multi-beat) recordings. The results correspond to the grad-CAM algorithm applied to image representations generated by the ST method, since this time-frequency transformation enables better temporal localization of the segments of interest.

Simulations were performed on a Windows 10 workstation with an Intel Core i9-3.50 GHz CPU, 128 GB of RAM, and an NVIDIA GeForce RTX 3090 (24 GB) graphics card. The total computational cost of generating the FFT/GAF/RP/ST representations, including preprocessing steps and generation of resized color images, equals 25.5/26.6/27.5/27 ms, respectively. The inference time is in the 2.8–3.3 ms range across all four models, enabling real-time operation. The complexity of the models amounts to 452,500, 4.6 million, 2.4 million, and 25.5 million parameters for Models 1, 2, 3, and Resnet-50, respectively.

A valid comparative analysis against previously reported performances is not easy to conduct, since the various approaches described in the literature differ widely in the number of samples per arrhythmia category, preprocessing steps, the number of ECG leads, and the specific strategies used to tackle class imbalance. Classification results presented in Table 9 and Table 10 show favorable performance of the proposed approach against existing solutions, although many of the better performers use more sophisticated models (e.g., transformers, hybrid architectures), multi-lead recordings, or employ augmentation techniques that yield many more synthetically generated samples than the original ones.

For the MIT-BIH dataset, the results reported in Table 3, Table 4 and Table 5 refer to an intra-patient study, since the data were split record-wise. An inter-patient analysis is equally important and will be considered in the future along the lines described in the references [49,50]. Further work will consider enhancing the proposed combined spatial representations with (possible, multi-head) attention mechanisms. A fusion approach combining features generated by both 1D and 2D CNN models is also worth studying. Extending the type of time series-to-image representations may also prove effective, although merging too many individual 2D transformations into a single global, fixed-size image may obscure important details. Finally, collectively applying the various spatial representations as distinct input channels is a valid avenue to be explored.

Author Contributions

Conceptualization—I.B.C.; methodology—I.O.; software, validation—I.B.C. and I.O.; writing—original draft preparation, I.B.C. and I.O.; writing—review and editing, I.B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in the MIT-BIH Arrhythmia Database at https://physionet.org/content/mitdb/1.0.0/ (accessed on 4 August 2025) and in the Chapman–Shaoxing dataset at https://figshare.com/collections/ChapmanECG/4560497/2 (accessed on 8 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Naghavi, M.; Kyu, H.H.; Aalipour, M.A.; Aalruz, H.; Ababneh, H.S.; Abafita, B.J.; Abaraogu, U.O.; Abbafati, C.; Abbasi, M.; Abbaspour, F.; et al. Global burden of 292 causes of death in 204 countries and territories and 660 subnational locations, 1990–2023: A systematic analysis for the Global Burden of Disease Study 2023. Lancet 2025, 406, 1811–1872. [Google Scholar] [CrossRef]
Singh, P.; Sharma, A. Interpretation and classification of arrhythmia using deep convolutional network. IEEE Trans. Instr. Meas. 2022, 71, 2518512. [Google Scholar] [CrossRef]
Talukder, M.A.; Talaat, A.S.; Muna, N.J.; Alazab, A.; Kazi, M.; Das, U.K. An explainable deep learning framework for trustworthy arrhythmia detection from ECG signals. Sci. Rep. 2025, 15, 39496. [Google Scholar] [CrossRef]
Ansari, Y.; Mourad, O.; Qaraqe, K.; Serpedin, E. Deep learning for ECG Arrhythmia detection and classification: An overview of progress for period 2017–2023. Front. Physiol. 2023, 14, 1246746. [Google Scholar] [CrossRef]
Sun, H.; Luo, D.; Niu, X.; Zeng, X.; Zheng, B.; Liu, H.; Pan, J. Classification algorithms in automatic diagnosis of ECG arrhythmias: A review. IEEE Access 2024, 12, 191921–191935. [Google Scholar] [CrossRef]
Manimaran, G.; Peimankar, A.; Puthusserypady, S.; Momeni, M.; Asyari, R.A.I.; Jahan, M.S.; Moll, J.; Will, U.K.; Ebrahimi, A. Explainable deep learning based techniques for ECG-Based heart disease classification: A systematic literature review and future direction. Comp. Biol. Med. 2025, 199, 111324. [Google Scholar] [CrossRef] [PubMed]
Abdullah, T.A.A.; Zahid, M.S.M.; Ali, W.; Hassan, S.U. B-LIME: An improvement of LIME for interpretable deep learning classification of cardiac arrhythmia from ECG signals. Processes 2023, 11, 595. [Google Scholar] [CrossRef]
Oporto, E.; Mauricio, D.; Maculan, N.; Uribe, G. Challenges in the classification of cardiac arrhythmias and ischemia using end-to-end deep learning and the electrocardiogram: A systematic review. Diagnostics 2026, 16, 161. [Google Scholar] [CrossRef]
Ciocoiu, I.B. Comparative analysis of bag-of-words models for ECG-based biometrics. IET Biom. 2017, 6, 495–502. [Google Scholar] [CrossRef]
Apandi, Z.F.M.; Aziz, N.S.; Othman, W.R.W.; Mustapha, N.; Ikeura, R.; Rofar, N.A.N.A. Heartbeat classification for arrhythmia detection in ambulatory monitoring: A comprehensive systematic review. Biomed. Sig. Process. Control 2026, 116, 109496. [Google Scholar] [CrossRef]
Ahmed, A.A.; Ali, W.; Abdullah, T.A.A.; Malebary, S.J. Classifying cardiac arrhythmia from ECG signal using 1D CNN deep learning model. Mathematics 2023, 11, 562. [Google Scholar] [CrossRef]
Khan, F.; Yu, X.; Yuan, Z.; ur Rehman, A. ECG classification using 1-D convolutional deep residual neural network. PLoS ONE 2023, 18, e0284791. [Google Scholar] [CrossRef]
Atwa, A.E.M.; Atlam, E.S.; Ahmed, A.; Atwa, M.A.; Abdelrahim, E.M.; Siam, A.I. Interpretable deep learning models for arrhythmia classification based on ECG signals using PTB-X dataset. Diagnostics 2025, 15, 1950. [Google Scholar] [CrossRef] [PubMed]
Baig, Z.; Nasir, S.; Khan, R.A.; Haque, M.Z.U. ArrhythmiaVision: Resource-conscious deep learning models with visual explanations for ECG arrhythmia classification. arXiv 2025, arXiv:2505.03787. [Google Scholar]
Silva, G.; Silva, P.; Moreira, G.; Freitas, V.; Gertrudes, J.; Luz, E. A systematic review of ECG arrhythmia classification: Adherence to standards, fair evaluation, and embedded feasibility. arXiv 2025, arXiv:2503.07276. [Google Scholar] [CrossRef]
Imane, A.; Abdelmajid, B.; Hilal, D.L.; Frederic, E.D.Y.; Hanaa, O.H. Artificial intelligence for atrial fibrillation detection: A systematic review of recent advances in ECG-based deep learning models. In Proceedings of the 5th International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Fez, Morocco, 15–16 May 2025; pp. 1–6. [Google Scholar]
Sepahvand, M.; Abdali-Mohammadi, F. A novel method for reducing arrhythmia classification from 12-lead ECG signals to single-lead ECG with minimal loss of accuracy through teacher-student knowledge distillation. Inf. Sci. 2022, 593, 64–77. [Google Scholar] [CrossRef]
Shah, A.; Singh, D.; Mohamed, H.G.; Bharany, S.; Rehman, A.U.; Hussen, S. Electrocardiogram analysis for cardiac arrhythmia classification and prediction through self attention based auto encoder. Sci. Rep. 2025, 15, 9230. [Google Scholar] [CrossRef]
Guhdar, M.; Mohammed, A.O.; Mstafa, R.J. Advanced deep learning framework for ECG arrhythmia classification using 1D-CNN with attention mechanism. Knowl.-Based Syst. 2025, 315, 113301. [Google Scholar] [CrossRef]
Islam, M.R.; Qaraqe, M.K.; Qaraqe, K.A.; Serpedin, E. CAT-Net: Convolution, attention, and transformer based network for single-lead ECG arrhythmia classification. Biomed. Signal Process. Control 2024, 23, 106211. [Google Scholar] [CrossRef]
Shah, H.A.; Saeed, F.; Diyan, M.; Almujally, N.; Kang, J.-M. ECG-TransCovNet: A hybrid transformer model for accurate arrhythmia detection using electrocardiogram signals. CAAI Trans. Intell. Technol. 2024, 1–14. [Google Scholar] [CrossRef]
Wang, Z.; Oates, T. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 40–46. [Google Scholar]
Eckmann, J.; Kamphorst, S.O.; Ruelle, D. Recurrence plots of dynamical systems. Europhys. Lett. 1987, 4, 973–977. [Google Scholar] [CrossRef]
Stockwell, R.G.; Mansinha, L.; Lowe, R.P. Localization of the complex spectrum: The S-transform. IEEE Trans. Sig. Process. 1996, 44, 998–1001. [Google Scholar] [CrossRef]
Moody, G.; Mark, R. MIT-BIH Arrhythmia Database. PhysioNet: MIT Laboratory for Computational Physiology, 2005. Available online: https://physionet.org/content/mitdb/1.0.0/mitdbdir/#files-panel (accessed on 4 August 2025).
Zheng, J.; Zhang, J.; Danioko, S.; Yao, H.; Guo, H.; Rakovski, C. A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Sci. Data 2020, 7, 48. [Google Scholar] [CrossRef] [PubMed]
Zeng, W.; Shan, L.; Yuan, C.; Du, S. Advancing cardiac diagnostics: Exceptional accuracy in abnormal ECG signal classification with cascading deep learning and explainability analysis. Appl. Soft Comp. J. 2024, 165, 112056. [Google Scholar] [CrossRef]
Benmessaoud, A.S.; Medjani, F.; Bousseloub, Y.; Bouaita, K.; Benrahem, D.; Kezai, T. High quality ECG dataset based on MIT-BIH recordings for improved heartbeats classification. In Proceedings of the IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Berlin, Germany, 23–25 July 2023; pp. 1–4. [Google Scholar]
Yen, S.-J.; Lee, Y.-S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
Hoyos-Osorio, J.; Alvarez-Meza, A.; Daza-Santacoloma, G.; Orozco-Gutierrez, A.; Castellanos-Dominguez, G. Relevant information undersampling to support imbalanced data classification. Neurocomputing 2021, 436, 136–146. [Google Scholar] [CrossRef]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
Cohen, J. A coefficient of agreement for nominal scales. Educat. Psych. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Chattopadhyay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. arXiv 2017, arXiv:1710.11063. [Google Scholar]
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-weighted visual explanations for Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 111–119. [Google Scholar]
Anand, R.; Lakshmi, S.V.; Pandey, D.; Pandey, B.K. An enhanced ResNet-50 deep learning model for arrhythmia detection using electrocardiogram biomedical indicators. Evol. Syst. 2024, 15, 83–97. [Google Scholar] [CrossRef]
Mallikarjunamallu, K.; Syed, K. Arrhythmia classification for non-experts using infinite impulse response (IIR)-filter-based machine learning and deep learning models of the electrocardiogram. PeerJ Comput. Sci. 2024, 10, e1774. [Google Scholar]
Lamba, S.; Kumar, S.; Diwakar, M. Fadlec: Feature extraction and arrhythmia classification using deep learning from electrocardiograph signals. Disc. Artif. Intell. 2025, 5, 82. [Google Scholar] [CrossRef]
Eleyan, A.; Alboghbaish, E. Electrocardiogram signals classification using deep-learning-based incorporated convolutional neural network and long short-term memory framework. Computers 2024, 13, 55. [Google Scholar] [CrossRef]
Chen, Z.; Yang, D.; Cui, T.; Li, D.; Liu, H.; Yang, Y.; Zhang, S.; Yang, S.; Ren, T.L. A novel imbalanced dataset mitigation method and ECG classification model based on combined 1D_CBAM-autoencoder and lightweight CNN model. Biomed. Signal Process. Control 2024, 87, 105437. [Google Scholar] [CrossRef]
Bayani, A.; Kargar, M. LDCNN: A new arrhythmia detection technique with ECG signals using a linear deep convolutional neural network. Physiol. Rep. 2024, 12, e16182. [Google Scholar] [CrossRef] [PubMed]
Dong, X.; Si, W. Heartbeat dynamics: A novel efficient interpretable feature for arrhythmias classification. IEEE Access 2023, 1, 87071–87086. [Google Scholar] [CrossRef]
Zhou, F.; Fang, D. Multimodal ECG heartbeat classification method based on a convolutional neural network embedded with FCA. Sci. Rep. 2024, 14, 8804. [Google Scholar] [CrossRef] [PubMed]
Di Paolo, I.F.; Castro, A.R.G. Intra- and interpatient ECG heartbeat classification based on multimodal convolutional neural networks with an adaptive attention mechanism. Appl. Sci. 2024, 14, 9307. [Google Scholar] [CrossRef]
Yildirim, O.; Talo, M.; Ciaccio, E.J.; Tan, R.S.; Acharya, U.R. Accurate deep neural network model to detect cardiac arrhythmia on more than 10,000 individual subject ECG records. Comput. Methods Programs Biomed. 2020, 197, 105740. [Google Scholar] [CrossRef]
Meqdad, M.N.; Abdali-Mohammadi, F.; Kadry, S. A new 12-lead ECG signals fusion method using evolutionary CNN trees for arrhythmia detection. Mathematics 2022, 10, 1911. [Google Scholar] [CrossRef]
Yoon, T.; Kang, D. Bimodal CNN for cardiovascular disease classification by co-training ECG grayscale images and scalograms. Sci. Rep. 2023, 13, 2937. [Google Scholar] [CrossRef]
An, X.; Shi, S.; Wang, Q.; Yu, Y.; Liu, Q. Research on a lightweight arrhythmia classification model based on knowledge distillation for wearable single-lead ECG monitoring systems. Sensors 2024, 24, 7896. [Google Scholar] [CrossRef] [PubMed]
Hassan, A.A.; Abdali-Mohammadi, F. Automatic extraction of medical latent variables from ECG signals utilizing a mutual information-based technique and capsular neural networks for arrhythmia detection. Comp. Mat. Contin. 2024, 81, 971–983. [Google Scholar] [CrossRef]
Dias, F.M.; Monteiro, H.L.M.; Cabral, T.W.; Naji, R.; Kuehni, M.; Luz, E.J.S. Arrhythmia classification from single-lead ECG signals using the inter-patient paradigm. Comput. Methods Programs Biomed. 2021, 202, 105948. [Google Scholar] [CrossRef] [PubMed]
Chazal, P.d.; O’dwyer, M.; Reilly, R.B. Automatic classification of heartbeats using ECG morphology and heartbeat interval features. IEEE Trans. Biomed. Eng. 2004, 51, 1196–1206. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Image representations of MIT-BIH dataset ECG recordings (from top to bottom: single-beat ECG segments (1.25 s, z-score normalized), spectrograms, GAF, RP, and ST, respectively). From left to right columns: normal beat (class N), supraventricular ectopic beat (class S), ventricular ectopic beat (class V), fusion beat (class V), unclassified beat (class Q).

Figure 2. Image representations of SC dataset ECG recordings (from top to bottom: multiple-beat ECG segments (10 s, z-score normalized), spectrograms, GAF, RP, and ST, respectively). From left to right columns: atrial fibrillation (AFIB), supraventricular tachycardia (SVT), sinus bradycardia (SB), sinus rhythm (SR).

Figure 3. Combined 2D representation of ECG recordings using ST, GAF, RP, and spectrogram images, respectively: (a) MIT-BIH; (b) Chapman–Shaoxing datasets.

Figure 4. Block diagram of the proposed approach.

Figure 5. Custom CNN architectures.

Figure 6. Explainability analysis of MIT-BIH recordings. From top to bottom: normal beat (class N), supraventricular ectopic beat (class S), ventricular ectopic beat (class V), fusion beat (class V), unclassified beat (class Q). Columns from left to right: ECG signal (1.25 s, z-score normalized); combined 2D representations; grad-CAM; grad-CAM++; score-CAM; LIME; occlusion maps.

Figure 7. Explainability analysis of Chapman–Shaoxing recordings. From top to bottom: atrial fibrillation (AFIB), supraventricular tachycardia (SVT), sinus bradycardia (SB), sinus rhythm (SR). Columns from left to right: ECG signal (10 s, z-score normalized); combined 2D representations; grad-CAM; grad-CAM++; score-CAM; LIME; occlusion maps.

Figure 8. 2D representations of the P, QRS, and T segments (the first three lines) and of a complete (last line) typical ECG sample. Columns from left to right: ECG sample, spectrogram, GAF, RP, ST.

Figure 9. Relevant segments for the classification decision identified by the grad-CAM algorithm on samples extracted from the Chapman–Shaoxing dataset.

Table 1. Samples distribution per class for the original and MIT-BIH augmented datasets.

	N	SVEB	VEB	F	Q
Original	89,554	2560	7229	803	8043
Augmented	8000	8000	8000	8000	8000

Table 2. Samples distribution per class for the Chapman–Shaoxing dataset.

	SB	SR	AFIB	GSVT
Num. Samples	3888	2222	2218	2260
Merged Categories	-	SR, SI	AFIB, AF	SVT, AT, SAAWR, ST, AVNRT, AVRT

Table 3. Classification performances for the MIT-BIH dataset using distinct 2D representations.

Spectrogram	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	95.0 (±1.1)	95.0 (±1.0)	95.0 (±1.1)	95.0 (±1.1)	84.4 (±3.5)	99.5 (±0.2)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	95.9 (±1.0)	95.9 (±0.9)	95.9 (±1.0)	95.9 (±1.0)	87.2 (±3.0)	99.6 (±0.1)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	95.0 (±1.2)	95.1 (±1.1)	95.0 (±1.2)	95.0 (±1.2)	84.5 (±3.8)	99.5 (±0.2)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	97.5 (±0.9)	97.5 (±0.9)	97.5 (±0.9)	97.5 (±0.9)	92.1 (±2.8)	99.8 (±0.1)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	96.6 (±0.9)	96.8 (±0.9)	96.8 (±0.9)	96.8 (±0.9)	89.6 (±2.9)	99.7 (±0.2)
GAF	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	96.8 (±0.9)	96.8 (±0.9)	96.8 (±0.9)	96.8 (±0.9)	89.8 (±2.9)	99.7 (±0.2)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	97.1 (±1.0)	97.1 (±1.0)	97.1 (±1.0)	97.1 (±1.0)	91.0 (±3.1)	99.7 (±0.2)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	97.0 (±1.0)	97.0 (±1.0)	97.0 (±1.0)	97.0 (±1.0)	90.7 (±3.3)	99.8 (±0.2)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	97.7 (±1.0)	97.7 (±1.0)	97.7 (±1.0)	97.7 (±1.0)	92.9 (±3.1)	99.9 (±0.1)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	98.0 (±1.5)	98.0 (±1.4)	97.9 (±1.5)	97.8 (±1.5)	93.8 (±4.7)	99.9 (±0.1)
RP	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	97.6 (±1.4)	97.7 (±1.3)	97.6 (±1.4)	97.6 (±1.4)	92.6 (±4.3)	99.8 (±0.2)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	97.6 (±1.5)	97.7 (±1.4)	97.6 (±1.5)	97.6 (±1.5)	92.6 (±4.7)	99.8 (±0.2)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	97.8 (±1.2)	97.8 (±1.1)	97.8 (±1.2)	97.8 (±1.2)	93.2 (±3.7)	99.8 (±0.1)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	96.6 (±1.1)	96.7 (±1.1)	96.7 (±1.1)	96.7 (±1.1)	89.4 (±3.6)	99.7 (±0.2)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	97.5 (±1.9)	97.5 (±1.8)	97.5 (±1.9)	97.5 (±1.9)	92.2 (±5.9)	99.9 (±0.2)
ST	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	97.1 (±1.6)	97.1 (±1.5)	97.1 (±1.6)	97.1 (±1.6)	91.0 (±4.9)	99.8 (±0.2)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	97.5 (±1.6)	97.6 (±1.5)	97.5 (±1.6)	97.5 (±1.6)	92.3 (±5.1)	99.8 (±0.2)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	97.2 (±1.6)	97.3 (±1.5)	97.2 (±1.6)	97.2 (±1.6)	91.4 (±5.0)	99.8 (±0.2)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	96.1 (±1.8)	96.1 (±1.7)	96.0 (±1.7)	96.0 (±1.8)	87.9 (±5.5)	99.6 (±0.2)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	97.8 (±1.5)	97.9 (±1.4)	97.8 (±1.5)	97.8 (±1.5)	93.3 (±4.6)	99.7 (±0.3)

Table 4. Classification performances for the MIT-BIH dataset using combined 2D representations (2 × 2 images generated by merging the four distinct 2D representations).

Model 1
Acc	P	R	F1-score	Kappa	AUC
98.5 (±1.2)	98.5 (±1.1)	98.5 (±1.2)	98.5 (±1.2)	95.2 (±3.7)	99.9 (±0.1)
Model 2
Acc	P	R	F1-score	Kappa	AUC
98.5 (±0.8)	98.5 (±0.8)	98.5 (±0.8)	98.5 (±0.8)	95.4 (±2.5)	99.9 (±0.1)
Model 3
Acc	P	R	F1-score	Kappa	AUC
98.5 (±1.0)	98.5 (±1.0)	98.5 (±1.0)	98.5 (±1.0)	95.3 (±3.3)	99.9 (±0.1)
Resnet-50
Acc	P	R	F1-score	Kappa	AUC
98.3 (±0.9)	98.3 (±0.9)	98.3 (±0.9)	98.3 (±0.9)	94.8 (±2.9)	99.9 (±0.1)
Combined models
Acc	P	R	F1-score	Kappa	AUC
99.0 (±0.8)	99.0 (±0.8)	99.0 (±0.8)	99.0 (±0.8)	96.9 (±2.5)	99.9 (±0.1)

Table 5. Classification performances for the MIT-BIH dataset using input fusion (for each model, combine predictions given distinct features as input).

Model 1
Acc	P	R	F1-score	Kappa	AUC
98.9 (±0.8)	98.9 (±0.8)	98.9 (±0.8)	98.9 (±0.8)	96.5 (±2.5)	99.9 (±0.1)
Model 2
Acc	P	R	F1-score	Kappa	AUC
98.8 (±0.8)	98.8 (±0.7)	98.8 (±0.8)	98.8 (±0.8)	96.2 (±2.4)	99.9 (±0.1)
Model 3
Acc	P	R	F1-score	Kappa	AUC
98.7 (±1.0)	98.8 (±1.1)	98.7 (±1.1)	98.7 (±1.1)	94.3 (±3.8)	99.9 (±0.1)
Resnet-50
Acc	P	R	F1-score	Kappa	AUC
98.8 (±0.7)	98.8 (±0.7)	98.8 (±0.7)	98.6 (±0.7)	94.4 (±2.2)	99.9 (±0.1)

Table 6. Classification performances for the Chapman–Shaoxing dataset using distinct 2D representations.

Spectrogram	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	89.5 (±3.9)	88.8 (±3.9)	87.9 (±4.6)	87.9 (±4.8)	72.1 (±10.6)	97.2 (±2.1)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	87.3 (±3.5)	86.2 (±3.4)	85.5 (±4.0)	85.5 (±4.1)	66.2 (±9.3)	96.3 (±1.8)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	89.2 (±3.5)	88.3 (±3.5)	87.6 (±4.1)	87.6 (±4.3)	71.3 (±9.5)	97 (±2.1)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	90.6 (±4.0)	90.2 (±3.9)	89.1 (±4.6)	89.2 (±4.8)	75 (±10.6)	97.2 (±2.4)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	91.5 (±4.2)	91.2 (±4.0)	90.2 (±4.9)	90.2 (±5.0)	77.4 (±11.1)	98.0 (±1.9)
GAF	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	92.6 (±3.7)	92.2 (±3.6)	91.6 (±4.3)	91.6 (±4.4)	80.3 (±10)	98.4 (±1.5)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	90.9 (±4.0)	90.5 (±3.2)	89.7 (±4.7)	89.7 (±4.7)	75.9 (±10.7)	98 (±1.7)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	91.5 (±3.9)	91 (±3.7)	90.3 (±4.6)	90.3 (±4.8)	77.5 (±10.6)	97.8 (±2.0)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	90.2 (±4.3)	89.7 (±3.9)	88.9 (±5.2)	88.8 (±5.3)	73.8 (±11.6)	97.5 (±2.2)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	93.5 (±4.2)	93.3 (±3.8)	92.5 (±4.9)	92.5 (±5.0)	82.6 (±11.2)	98.8 (±1.5)
RP	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	92.4 (±3.7)	92 (±3.6)	91.3 (±4.3)	91.4 (±4.4)	79.9 (±9.8)	98.3 (±1.6)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	91.6 (±3.9)	91.2 (±3.5)	90.5 (±4.6)	90.3 (±4.7)	77.7 (±10.4)	98.1 (±1.5)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	92 (±3.7)	91.4 (±3.7)	91 (±4.3)	91 (±4.4)	78.8 (±10)	98 (±1.6)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	90 (±4.8)	89.8 (±4.0)	88.5 (±5.7)	88.5 (±6.0)	73.2 (±13)	97.1 (±2.8)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	93.6 (±4.2)	93.6 (±3.6)	92.7 (±4.9)	92.7 (±5.0)	83 (±11.1)	98.8 (±1.3)
ST	Model 1
	Acc	P	R	F1-score	Kappa	AUC
	93 (±4.8)	92.7 (±4.8)	91.9 (±5.6)	91.8 (±5.8)	81.3 (±13.8)	98.3 (±1.9)
	Model 2
	Acc	P	R	F1-score	Kappa	AUC
	91.5 (±4.2)	91.1 (±3.9)	90.2 (±5.1)	90.2 (±5.2)	77.5 (±11.3)	98 (±1.8)
	Model 3
	Acc	P	R	F1-score	Kappa	AUC
	92.5 (±3.8)	92.1 (±3.6)	91.4 (±4.5)	91.4 (±4.6)	80.1 (±10.1)	98 (±1.9)
	Resnet-50
	Acc	P	R	F1-score	Kappa	AUC
	90.2 (±4.0)	89.5 (±4.1)	88.8 (±4.7)	88.7 (±5.0)	73.9 (±10.8)	97.4 (±2.1)
	Combined models
	Acc	P	R	F1-score	Kappa	AUC
	93.8 (±4.2)	93.7 (±3.9)	92.7 (±4.9)	92.7 (±5.1)	83.4 (±11.1)	98.6 (±1.6)

Table 7. Classification performances for the Chapman–Shaoxing dataset using combined 2D representations (2 × 2 images generated by merging the four distinct 2D representations).

Model 1
Acc	P	R	F1-score	Kappa	AUC
92 (±2.8)	92.1 (±2.7)	92 (±2.8)	92 (±2.8)	78.6 (±6.4)	98.3 (±0.9)
Model 2
Acc	P	R	F1-score	Kappa	AUC
90.8 (±2.9)	91.1 (±2.7)	90.8 (±2.9)	90.8 (±3.0)	75.5 (±7.8)	98 (±1.2)
Model 3
Acc	P	R	F1-score	Kappa	AUC
92 (±3.2)	92.2 (±3.1)	92 (±3.2)	92 (±3.4)	78.7 (±8.7)	98.1 (±1.1)
Resnet-50
Acc	P	R	F1-score	Kappa	AUC
90.6 (±2.9)	91 (±2.7)	90.6 (±2.9)	90.5 (±3.1)	75 (±7.8)	97.7 (±1.4)
Combined models
Acc	P	R	F1-score	Kappa	AUC
93.5 (±3.1)	93.7 (±2.9)	93.5 (±3.1)	93.4 (±3.2)	82.7 (±8.3)	98.8 (±1.0)

Table 8. Classification performances for the Chapman–Shaoxing dataset using input fusion (for each model, combine predictions given distinct features as input).

Model 1
Acc	P	R	F1-score	Kappa	AUC
94.9 (±4.0)	95 (±3.7)	94.1 (±4.7)	94.2 (±4.7)	86.5 (±10.6)	99.2 (±1.3)
Model 2
Acc	P	R	F1-score	Kappa	AUC
94.3 (±4.1)	94.3 (±3.7)	93.3 (±4.9)	93.3 (±5.0)	84.6 (±1.11)	99 (±1.3)
Model 3
Acc	P	R	F1-score	Kappa	AUC
95 (±4.0)	95 (±3.7)	94.2 (±4.6)	94.2 (±4.8)	86.7 (±10.6)	99.1 (±1.3)
Resnet-50
Acc	P	R	F1-score	Kappa	AUC
94 (±4.7)	94.2 (±4.2)	93 (±5.5)	93 (±5.7)	84 (±12.5)	98.8 (±1.85)

Table 9. Comparative performance analysis for the MIT-BIH dataset.

Reference	Features/Classifier	Key Performances	Remarks
Anand R. et al. [35]	Improved ResNet-18 (ECG biomarkers)	Accuracy: 98.14%, Recall: 97%, F1: 0.97	Weighted loss function, up/down-resampling
Mallikarjunamallu K. et al. [36]	DenseNet-121, hyperparameters tuning	Accuracy: 99.97%	Class imbalance compensated using SMOTE
Lamba et al. [37]	Ant Colony Optimization (ACo) + Bi-LSTM/CNN	Accuracy: 98.9% (ACoBi-LSTM), 99.1% (Aco-CNN)	Class imbalance compensated using SMOTE
Eleyan et al. [38]	FFT + CNN + LSTM	Accuracy: 97.6%
Chen Z. et al. [39]	1D Convolutional Block Attention Module + CNN	Accuracy: 98.2%, Recall: 95%, F1: 0.96	Class imbalance compensated using amplitude scaling, weighted loss
Bayani A. et al. [40]	Linear deep CNN	Accuracy: 99.38%	5 classes, 5000 samples/class after resampling
X. Dong, W. Si [41]	Heartbeat Dynamics feature + interpretable ML model (SVM, kNN)	Accuracy: 99.4%, Precision: 99.1%, Recall: 98.8%, F1-score: 0.99	5 classes, 2 leads, balanced classes using resampling/SMOTE
Zhou Y. et al. [42]	2D representations + Multi-branch CNN	Accuracy: 99.6%, Recall: 98.9%	Balanced classes using SMOTE
Di Paolo F. [43]	2D representations + CNN with attention	Accuracy: 98.4%, Precision: 94.5%, Recall: 80.3%, F1: 0.82	2 leads, balanced classes using SMOTE
Present paper	2D representations + CNN	Accuracy: 98.6%, F1: 94.2%, AUC: 0.991	Model 3 with input fusion

Table 10. Comparative performance analysis for the Chapman–Shaoxing dataset.

Reference	Features/Classifier	Key Performances	Remarks
Yildirim et al. [44]	DenseNet-121, hyperparameters tuning	Accuracy: 96.13%	12 leads
Meqdad et al. [45]	Evolutionary CNN trees	Accuracy: 97.6%	12 leads
Yoon et al. [46]	Scalogram + bimodal CNN	Accuracy: 95.7%, AUC: 0.994
Zheng J. et al. [26]	230 ECG biomarkers + gradient boosting tree	Accuracy: 98.2%, F1: 96%	80/20% train/test sets
Sepahvand M. [17]	Feature distillation	Accuracy: 96.5%	12 leads
An X. et al. [47]	Knowledge distillation	Accuracy: 96.3%	wearable devices
Hassan et al. [48]	Capsule NN	Accuracy: 97.4%, F1: 96.6%	12 leads
Present paper	2D representations + CNN	Accuracy: 95%, F1: 98.6%, AUC: 0.999	Resnet-50 with input fusion, combined 2D representations

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Onică, I.; Ciocoiu, I.B. Explainable Combined Spatial Representations for ECG Arrhythmia Classification. Mach. Learn. Knowl. Extr. 2026, 8, 114. https://doi.org/10.3390/make8050114

AMA Style

Onică I, Ciocoiu IB. Explainable Combined Spatial Representations for ECG Arrhythmia Classification. Machine Learning and Knowledge Extraction. 2026; 8(5):114. https://doi.org/10.3390/make8050114

Chicago/Turabian Style

Onică, Iulia, and Iulian B. Ciocoiu. 2026. "Explainable Combined Spatial Representations for ECG Arrhythmia Classification" Machine Learning and Knowledge Extraction 8, no. 5: 114. https://doi.org/10.3390/make8050114

APA Style

Onică, I., & Ciocoiu, I. B. (2026). Explainable Combined Spatial Representations for ECG Arrhythmia Classification. Machine Learning and Knowledge Extraction, 8(5), 114. https://doi.org/10.3390/make8050114

Article Menu

Explainable Combined Spatial Representations for ECG Arrhythmia Classification

Abstract

1. Introduction

2. Background and Related Work

3. Spatial Representations of ECG Signals

3.1. Time-Frequency Spectrograms

3.2. Gramian Angular Field

3.3. Recurrence Plots

3.4. The S-Transform

4. The Proposed Approach

5. Experimental Protocol and Results

5.1. The ECG Arrhythmia Datasets

5.2. Data Preprocessing

5.3. Experimental Results

5.4. Explainability Analysis

6. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI