MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification

Chen, Jie; Jiang, Mingfeng; He, Xiaoyu; Li, Yang; Zhang, Jucheng; Li, Juan; Wu, Yongquan; Ke, Wei

doi:10.3390/ai6090219

Open AccessArticle

MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification

by

Jie Chen

¹

,

Mingfeng Jiang

^1,*

,

Xiaoyu He

^1,*,

Yang Li

¹

,

Jucheng Zhang

²,

Juan Li

³,

Yongquan Wu

⁴ and

Wei Ke

⁵

¹

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

Department of Clinical Engineering, School of Medicine, The Second Affiliated Hospital, Zhejiang University, Hangzhou 310009, China

³

School of Electrical and Information Engineering, North Minzu University, North Wenchang Road, Yinchuan 750021, China

⁴

Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing 100029, China

⁵

School of Applied Sciences, Macao Polytechnic Institute, Macao SAR, China

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(9), 219; https://doi.org/10.3390/ai6090219

Submission received: 17 July 2025 / Revised: 27 August 2025 / Accepted: 4 September 2025 / Published: 8 September 2025

(This article belongs to the Special Issue Artificial Intelligence in Biomedical Engineering: Challenges and Developments)

Download

Browse Figures

Versions Notes

Abstract

Multi-label arrhythmia classification from 12-lead ECG signals is a tricky problem, including spatiotemporal feature extraction, feature fusion, and class imbalance. To address these issues, a multi-scale temporal–dynamic graph convolutional with orthogonal gates method, termed MST-DGCN, is proposed for ECG arrhythmia classification. In this method, a temporal–dynamic graph convolution with dynamic adjacency matrices is used to learn spatiotemporal patterns jointly, and an orthogonal gated fusion mechanism is used to eliminate redundancy, so as to strength their complementarity and independence through adjusting the significance of features dynamically. Moreover, a multi-instance learning strategy is proposed to alleviate class imbalance by adjusting the proportion of a few arrhythmia samples through adaptive label allocation. After validating on the St Petersburg INCART dataset under stringent inter-patient settings, the experimental results show that the proposed MST-DGCN method can achieve the best classification performance with an F1-score of 73.66% (+6.2% over prior baseline methods), with concurrent improvements in AUC (70.92%) and mAP (85.24%), while maintaining computational efficiency.

Keywords:

ECG; multi-scale; dynamic graph convolution; orthogonal gated fusion; multi-label arrhythmia classification

1. Introduction

Cardiovascular diseases (CVDs) persist as the leading global cause of mortality, responsible for 17.9 million deaths annually, with arrhythmias implicated in 47% of sudden cardiac fatalities [1]. Given the large patient population, even small improvements in predictive performance could translate into a substantial number of additional patients being correctly identified and potentially saved. The 12-lead electrocardiogram (ECG), a gold-standard diagnostic tool, provides multidimensional electrophysiological signatures essential for arrhythmia detection. However, manual interpretation remains constrained by inter-observer variability and operational inefficiency, underscoring the urgent need for automated, high-precision diagnostic systems.

Deep learning approaches have demonstrated significant potential in ECG analysis, leveraging their capacity to decode intricate spatiotemporal patterns. Convolutional Neural Networks (CNNs) performs well on local feature extraction [2], while Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) architectures effectively model temporal dependencies [3]. Recent advancements in Transformer models [4], with their self-attention mechanism, further enhance long-range temporal relationship modeling. Complementing these, graph-based learning frameworks have emerged as powerful tools for multi-label time series classification, particularly in capturing lead interdependency for cardiac diagnostics [5]. Despite these advancements, critical challenges persist in ECG-based arrhythmia classification.

A fundamental limitation lies in the inadequate integration of spatiotemporal features across ECG leads. Conventional deep learning based frameworks often compartmentalize temporal and spatial dimensions, predominantly focusing on temporal correlations in single-lead data while failing to incorporate the critical inter-lead spatial dependencies that fundamentally inform both physiological interpretation and clinical diagnostic precision [6,7]. Critically, ECG signals exhibit intrinsic multi-scale properties where morphological details operate at millisecond resolutions while rhythm patterns span seconds to minutes necessitating unified modeling across temporal scales to capture pathological signatures. This cross-scale interdependency remains inadequately addressed, as traditional feature fusion strategies often introduce information redundancy or loss when integrating heterogeneous resolutions [8], thereby limiting their ability to exploit the synergistic potential of multi-scale data. Furthermore, severe class imbalance is a prevalent issue in arrhythmia datasets [9,10,11], which refers to the phenomenon where transient pathological events (e.g., Classes A, F, S, V) are statistically overwhelmed by normal sinus rhythms (Class N). This disparity is exemplified in the St Petersburg INCART dataset, where abnormal classes collectively constitute merely 19% of samples, with critical categories like F and A representing less than 3%. This inherent disparity skews model training toward majority class recognition at the expense of pathological anomaly detection [12,13].

To address these challenges, we propose Multi-Scale Temporal–Dynamic Graph Convolutional Network (MST-DGCN), a unified framework that systematically resolves three key limitations: (1) Temporal–Dynamic Graph Convolutional Network (T-DGCN) was proposed to extract diverse and informative spatiotemporal patterns. (2) Orthogonal gated fusion (OGF) was introduced to eliminate feature redundancy while preserving complementary multi-scale representations. (3) Multiple instance learning (MIL) strategy was implemented to mitigate extreme class imbalance through instance-level label reassignment.

The key contributions of this work can be outlined as follows:

MST-DGCN Framework: Multi-Scale Temporal–Dynamic Graph Convolutional Network (MST-DGCN), integrating dynamic graph learning, temporal modeling, and multi-scale fusion, is proposed for multi-label arrhythmia classification.
T-DGCN for Spatiotemporal Feature Extraction: This study proposes a novel hybrid architecture that innovatively integrates two complementary components—a Dynamic Graph Convolutional Network (DGCN) for adaptive modeling of inter-lead electrophysiological correlations and a Gated Temporal Convolutional Network (GTCN) for capturing spatiotemporal dependencies, which can effectively overcome the inherent limitation of spatial feature oversight in conventional ECG analytical paradigms.
OGF for Feature Integration: An orthogonal constraint is proposed to eliminate feature redundancy, while simultaneously implementing adaptive gating mechanisms to dynamically recalibrate the contribution weights of complementary multi-scale representations, so as to optimize classification efficacy in cardiac arrhythmia detection.

The remainder of this paper is organized as follows: Section 2 critically reviews advances in deep learning methodologies and feature fusion strategies for ECG arrhythmia classification. Section 3 elaborates on the architectural design of the proposed MST-DGCN framework, including mathematical formulations and implementation specifics. Section 4 delineates the dataset utilized, the experimental setup, and the model evaluation methods employed in the study. Section 5 conducts rigorous comparative analyses of classification performance, supported by ablation studies quantifying component-wise contributions. Finally, Section 6 concludes with a summary of the principal contributions of this paper and outlines potential directions for future research.

2. Related Work

2.1. Deep Learning for ECG Classification

With the advance of deep learning method, it has been used for ECG arrhythmia classification widely. Acharya et al. [14] introduced a 9-layer CNN for automated detection of five arrhythmia types, demonstrating superior normal/abnormal heartbeat discrimination through hierarchical extraction of local morphological features. Subsequent efforts focused on temporal dynamics, such as Cheng et al. [15] integrated a hybrid framework integrating a 24-layer Deep CNN with bidirectional LSTM (BiLSTM), where the DCNN extracts spatiotemporal features while BiLSTM captured bidirectional contextual dependencies, significantly improving long-term temporal pattern recognition. To address computational constraints in clinical deployment, Mantravadi et al. [16] developed a lightweight 1D-CNN suitable for smart wearable devices, significantly reducing the number of parameters without sacrificing diagnostic accuracy. Recent innovations extend to attention-based architectures, such as Wang et al. [17] proposed a multi-resolution network that incorporates an attention mechanism into the residual module to effectively denoise different types of electrocardiogram signals. El-Ghaish et al. [4] further advanced this paradigm through a bidirectional Transformer (BiTrans) architecture that jointly analyzes forward and backward temporal contexts, enabling detection of subtle electrophysiological variations undetectable by conventional methods. Despite these advancements, existing approaches predominantly focus on temporal features extraction while neglecting the inherent spatial correlations among 12-lead ECG signals. This oversight impedes comprehensive modeling of arrhythmia patterns requiring multi-lead coordination—a gap our framework explicitly addresses through dynamic inter-lead relationship learning.

2.2. Feature Fusion Strategies

Effective fusion of multi-scale spatiotemporal ECG features faces dual challenges: redundancy mitigation and modality imbalance [18]. Traditional fusion paradigms often include three distinct frameworks. The first one is Early Fusion, which combines raw features during initial processing stages. Ranipa et al. [19] enhanced this approach through multimodal attention mechanisms to amplify diagnostically critical features, achieving a 3.2% F1-score improvement over naive concatenation. However, such methods risk information overload when processing high-dimensional 12-lead ECG data. The second one is Late Fusion, aggregating predictions from independently trained models. Zhang et al. [20] fused time-domain and frequency-domain classifier outputs via decision-level complementarity, attaining 99.3% F1-score for five-class arrhythmia detection on the Yaseen dataset. While robust, this strategy sacrifices fine-grained feature interactions essential for rare arrhythmia identification. The last one is Hybrid Fusion, which combines early and late-stage features. Maithani et al. [21] sequentially fused time-frequency features and classifier outputs, achieving the best performance using a modified score-level fusion procedure. Despite flexibility, such methods lack dynamic adaptation to heterogeneous ECG morphologies.

Recent attention-based mechanisms [22] dynamically weight features but exhibit modality bias, exacerbating performance degradation on minority classes. Orthogonality constraints offer a promising solution for feature fusion. Zhang et al. [23] demonstrated that enforcing orthogonal representation minimizes redundancy while enhancing feature space diversity. Chang et al. [24] further validated that orthogonality-driven fusion outperforms conventional methods in multi-scale data integration, significantly improving classification robustness. Building on these insights, we propose orthogonal gated fusion (OGF), which fuses orthogonalization with learnable gating mechanisms. OGF dynamically suppresses redundant features while preserving complementary multi-scale patterns, effectively addressing overfitting risks and information loss in imbalanced ECG classification scenarios.

3. Methods

3.1. Overall Framework

This section provides an overview of the MST-DGCN framework (Figure 1), which consists of three main stages: data instantiation, feature extraction, and feature fusion. In the data instantiation phase, the 12-lead ECG signals are preprocessed to segment heartbeats and construct multi-instance data. This process addresses class imbalance by grouping consecutive heartbeats while preserving temporal dynamics. During feature extraction, the T-DGCN and ResNet modules are employed in parallel to extract instance-level features, while statistical features are generated using the tsfresh Python package (version 0.20.2) as auxiliary information. It is worth noting that the tsfresh package performs statistical feature extraction directly on the raw ECG signals, providing additional global context for the feature representation.

We propose a dual-stage feature fusion framework that hierarchically combines single-scale and multi-scale representations through two sequential stages: Stage 1: Global–Local Fusion employs a Squeeze and Excitation (SE) module to fuse global and local features within specific scale, enhancing the discriminative power of the feature representation. Stage 2: Orthogonal gated fusion (OGF) processes multi-scale features through an orthogonal gating mechanism, which eliminates feature redundancy while enhancing complementary information across scales. The pipeline concludes with a multi-layer perceptron (MLP) implementing robust multi-label ECG classification through probabilistic category prediction.

The subsequent sections provide comprehensive descriptions of each module.

3.2. Multiple Instance Learning (MIL)

To alleviate the class imbalance problem in ECG datasets, this work adopts a multiple instance learning (MIL) strategy. As illustrated in Figure 2 the data instantiation process employs an R-wave-centered multi-scale windowing strategy, where continuous ECG signals are segmented into windows of 180, 90, and 45 sampling points centered at R-peaks, corresponding to the three scales depicted in the figure. Sixty consecutive cardiac beats are aggregated into novel instances with multi-label annotations. Each instance may contain multiple abnormal categories (e.g., class A, class F, etc.). If all heartbeats within an instance belong to class N, the instance label is assigned as class N, as shown in Figure 3. Conversely, if any heartbeat is classified as abnormal, the instance label is set to the corresponding abnormal class. It should be noted that when multiple symbols (e.g., F and V) appear simultaneously, this simply indicates the presence of two abnormal categories and does not imply medical nomenclature or any order relationship between them. Under the MIL paradigm, only the overall label of the instance is available, while the specific labels for individual heartbeats within the instance remain unknown. Consequently, feature extraction requires jointly mapping heartbeat-level features to instance-level feature representations for instance label prediction.

The MIL problem is formulated as follows: For the heartbeat in instance

X_{i}

, it can be expressed as

{x_{i, j}^{k} | j = 1, 2 \dots n, k \in [K]}

, where i and j represent the j-th heartbeat in the i-th instance; n is the number of heartbeats in the instance; and k represents time series data of different scales. In this paper, n = 60, K = 3. The instance label Y_i is the reallocated label of all heartbeats in the instance. Therefore, the dataset can be represented as

D = {(X_{i}, Y_{i})}_{i = 1}^{N}

, where N denotes the total number of instances.

3.3. Feature Extractor Method

Feature extraction serves as the cornerstone for transforming raw ECG signals into discriminative representations. This section proposes a hybrid feature extractor integrating dynamic instance representations (capturing spatiotemporal dependencies across multi-lead) and statistical descriptors (quantifying morphological signatures), resolving the critical gap in conventional single-modality approaches that neglect dynamic inter-lead dependencies.

3.3.1. Instance Feature Extraction

The instance feature extraction module proposed in this paper consists of two parallel networks, T-DGCN and ResNet, designed to address the limitations of standalone feature extraction methods. T-DGCN integrates a Gated Temporal Convolutional Network (GTCN) and a Dynamic Graph Convolutional Network (DGCN) in series, leveraging the hierarchical nature of spatiotemporal features. Specifically, GTCN captures temporal dependencies in ECG signals, while DGCN models spatial topological relationships among leads. This serial connection enables the effective integration of temporal and spatial features, as demonstrated in the subsequent experiments (Section 5), where the serial design outperforms parallel configurations in capturing these complex dependencies.

To further enrich the extracted features, the proposed method incorporates a multi-scale processing strategy, where the instance feature extraction process is applied in parallel across three different scales of input data. To illustrate this method, we take a certain scale as an example to describe its process in detail, where all scales perform the same operation in parallel.

(a): Gated Temporal Convolutional Network (GTCN)

To effectively capture temporal dependencies in ECG signals, this study adopts a causal dilated convolution-based Temporal Convolutional Network (TCN) design, as shown in Figure 4a. By introducing a dilation factor d, intervals are created between the convolution kernel weights, significantly expanding the receptive field. This enables the network to achieve an exponential receptive field with fewer layers, making it well-suited for processing long-term dependencies. The causal property ensures that the output at time T depends only on current and past inputs, maintaining the temporal order of the data. The output of the causal dilated convolution can be expressed as:

{h_{t}^{(l)} = T C N}_{l} (h_{t}^{(l - 1)}) = \sum_{i = 0}^{m - 1} h_{t - i \cdot d}^{(l - 1)} \cdot w_{i}

(1)

where

m

is the kernel size,

d

is the dilation factor, and

w_{i}

denotes the convolution kernel weights. By stacking multiple layers of dilated convolutions, the model is able to capture dependencies over extended temporal spans. The output of an

L

-th layer TCN at time step t is defined as follows:

h_{t}^{(L)} = {T C N}_{L} ({T C N}_{L - 1} \dots ({T C N}_{1} (x_{t})))

(2)

To further enhance information flow and preserve critical local patterns, a gating mechanism is introduced, inspired by its success in recurrent and graph convolutional networks. The Gated Temporal Convolution (GTCN) is defined as:

Z_{t}^{t e m p} = G T C N (h_{t}^{(L)}) = σ (h_{t}^{(L)}) ⊙ μ (h_{t}^{(L)})

(3)

where σ(·) and

μ (\cdot)

denote the sigmoid and tanh activation functions, respectively, while ⊙ denotes the Hadamard product. The gating mechanism dynamically screens temporal features to retain discriminative patterns. The output of this stage, denoted as

Z_{t}^{t e m p}

, subsequently serves as the input to the dynamic graph convolutional network for further feature extraction.

(b): Dynamic Graph Convolutional Network (DGCN)

For 12-lead ECG spatial feature extraction, as shown in Figure 4b, we use a Dynamic Graph Convolutional Network (DGCN) to capture the spatial relationships among leads. In the graph representation, each heartbeat corresponds to a time step, and each of the 12 leads is represented as a node. Unlike traditional graph convolutional networks, where the adjacency matrix is predefined and fixed during training, our DGCN employs a learnable adjacency matrix

{\tilde{A}}_{t}

that evolves over time. This dynamic adaptation enables richer spatial feature representations by modeling context-dependent relationships between leads. Each element of the adjacency matrix represents the topological association strength between a pair of nodes. An explicit edge connection is established only when the association strength exceeds a predefined threshold. It is worth noting that the adjacency matrix is treated as a trainable parameter during backpropagation, such that all of its elements—including those corresponding to potential associations below the threshold—receive gradient updates. The adjacency matrix

{\tilde{A}}_{t}

is computed as:

{\tilde{A}}_{t} = s o f t m a x (ψ (Z_{t}^{t e m p} \cdot W_{a}) \cdot ψ {(Z_{t}^{t e m p} \cdot W_{a})}^{⊤} + B)

(4)

where

W_{a}

and

B

are the learnable weight matrix and bias matrix, respectively. The

L e a k y R e L U

activation function (denoted as

ψ

) enhances non-linear mapping capabilities, while the

s o f t m a x

normalization ensures that each row of the adjacency matrix

{\tilde{A}}_{t}

represents adaptive relationships between ECG leads at time step

t

.

Using the adjacency matrix, DGCN transforms the node feature matrix and adjacency matrix through graph convolution operations. The spatial relationships between leads are updated at each layer as:

Z_{t}^{s p a t i o} = D G C N (Z_{t}^{t e m p}) = g_{t}^{(L)}

(5a)

g_{t}^{(l)} = σ ({\tilde{A}}_{t} \cdot g_{t}^{(l - 1)} {\cdot W}^{(l)}) \forall l \in [1, L]

(5b)

where

g_{t}^{(l)}

represents the node feature matrix of the

l

-th layer,

{\tilde{A}}_{t}

is the dynamic adjacency matrix computed by Equation (4), and

W^{(l)}

is the trainable weight matrix for layer

l

. The initial input

g_{t}^{(0)}

corresponds to

Z_{t}^{t e m p}

from Equation (3), which contains temporal features. Through

L

stacked layers, DGCN progressively refines spatial relationships between ECG leads, with the final output

Z_{t}^{s p a t i o}

serving as high-order spatiotemporal feature output.

(c): ResNet

To resolve feature conflicts between temporal waveform variations and inter-lead spatial relationships under noisy conditions or atypical pathological manifestations, we introduce a ResNet architecture (Figure 4c). This network comprises three residual blocks with 10 convolutional layers and a fully connected layer, extracting deep abstract features through skip connections. Crucially, when T-DGCN-derived features exhibit contradictions due to noise or atypical pathology, ResNet’s deep features—selectively amplified by the Squeeze-and-Excitation mechanism—reinforce consistent diagnostic information to resolve conflicts. The synergistic operation not only strengthens feature extraction capabilities but also generates more discriminative representations, thereby effectively supporting downstream tasks such as multi-label ECG classification.

All in all, the entire instance feature extraction process can be summarized as:

F_{i}^{k} = α^{k} \cdot D G C N (G T C N (X_{i}^{k})) + (1 - α^{k}) \cdot R e s N e t (X_{i}^{k})

(6)

where

α^{k}

is a learnable weight parameter that balances the contributions of the two modules.

X_{i}^{k}

denotes the input data of the i-th instance at scale k. In this study, we use three scales (k = 1, 2, 3) to process the ECG signals, where each scale corresponds to a different time window size, allowing the model to capture both fine-grained and coarse-grained features.

3.3.2. Statistical Feature Extraction

To characterize ECG signal properties, we leverage the tsfresh Python package [25] to compute physiologically meaningful statistical features. For instance, kurtosis and skewness quantify waveform sharpness and symmetry, respectively, aiding in distinguishing pathological morphologies. Variance captures signal variability critical for detecting arrhythmic fluctuations. Mean/Standard Deviation characterize baseline amplitude distribution and signal stability. Together, these features effectively capture essential signal patterns and trends. For each heartbeat sample, the statistical features are extracted into a vector

s_{i}

. To reduce dimensionality while preserving the most informative components, PCA is applied outside the network, using the covariance matrix of the training data, to compress

s_{i}

to 500 dimensions. A subsequent linear transformation reduces the feature dimension to 128, creating a compact yet informative embedding suitable for downstream tasks. The process is represented as:

l_{i} = W_{s} \cdot P C A (s_{i}) + b_{s}

. where

W_{s}

and

b_{s}

are the parameters of the linear layer.

So far, we have obtained two feature embeddings: the statistical feature

l_{i}

and the instance feature

F_{i}^{k}

. Next, we will elaborate on the specific solution of feature fusion.

3.4. Feature Fusion Method

To mitigate redundant feature interference in fusion performance, this work proposes a dual-stage fusion module. Stage 1: Global–local feature fusion combines statistical features

l_{i}

, which capture global signal patterns and morphology, with instance features

F_{i}^{k}

, which encode local spatiotemporal dependencies across ECG leads. Concatenate connections maximize complementary information retention from heterogeneous sources, followed by a Squeeze-and-Excitation (SE) module for adaptive channel weighting. This achieves preliminary adaptive fusion at specific scale and provides a foundation for multi-scale fusion. Stage 2: the orthogonalization strategy is employed to eliminate inter-scale redundancy while simultaneously enhancing feature independence and complementarity. Concurrently, a parametric gating mechanism adaptively recalibrates cross-scale feature contributions, thereby optimizing the fused representation to achieve robust multi-label ECG classification with enhanced discriminative power.

3.4.1. Global–Local Fusion at Specific Scale

Inspired by the SE-ECGNet [26], this work employs a squeeze layer to enhance the integration of global statistical and local instance features. First, the instance feature

F_{i}^{k}

undergoes adaptive average pooling (AAP) to generate a global instance feature vector

F_{g}^{i, k}

. This is then concatenated with the reduced-dimensional statistical feature vector

l_{i}

. The concatenated feature vector is processed through two linear layers with ReLU activation to produce the multi-scale fused feature

g_{i}^{k}

. The process is defined as:

F_{g}^{i, k} = A A P (F_{i}^{k})

(7)

g_{i}^{k} = S E ([F_{g}^{i, k}, l_{i}]) = W_{1} \cdot (R e L U (W_{2} \cdot [F_{g}^{i, k}, l_{i}] + b_{2})) + b_{1}

(8)

where

W_{1}

and

W_{2}

are learnable weight matrices.

b_{1}

and

b_{2}

are biases. [⋅,⋅] denotes the vector concatenation operation. And

k

represents the scale index.

This multi-scale fused feature

g_{i}^{k}

serves as an intermediate representation, which is further processed in subsequent stages to produce the final comprehensive fused feature.

3.4.2. Orthogonal Gated Multi-Scale Fusion

To address feature redundancy across scales and ensure each scale contributes effectively, this paper proposes an orthogonal gated multi-scale fusion module, as shown in Figure 5. The multi-scale fused features

g_{i}^{k}

(k = 1, 2, 3) from stage 1 global–local fusion are orthogonalized using the iterative Gram-Schmidt method. The orthogonalization process initiates by selecting an arbitrary scale

g_{i}^{n}

as the reference basis. Subsequently, the projections of the remaining two scales

g_{i}^{m}

onto this reference basis are computed using Equation (9). This operation effectively eliminates redundant information components associated with the reference basis from these scales. Through cyclic alternation of the reference basis and repetition of the projection procedure, mutual orthogonality is systematically enforced across all scales. This iterative refinement culminates in a non-redundant representation within the multi-scale feature space, ensuring maximal informational independence among all scale components.

{\tilde{g}}_{i}^{m} = g_{i}^{m} - {P r o j}_{g_{i}^{n}} (g_{i}^{m}) = g_{i}^{m} - \frac{g_{i}^{m} \cdot g_{i}^{n}}{g_{i}^{n} \cdot g_{i}^{n}} g_{i}^{n}

(9)

where

{P r o j}_{g_{i}^{n}} (g_{i}^{m})

represents the projection of the multi-scale fused feature

g_{i}^{m}

onto the reference feature

g_{i}^{n}

.

A gating mechanism dynamically adjusts the importance of features across scales. For the orthogonal vector

{\tilde{g}}_{i}^{k}

of the k-th scale, a gate unit is used to control whether features above the threshold are allowed to be sent and features below the threshold are allowed to be received. The following explains two gating mechanisms: the delivering end and the receiving end. The delivering end (red dotted line box in Figure 5) combined with the gating value

G_{k}

sends the valid information higher than the threshold value in the k-th scale to the other two scales, and connects it by adding; The receiving end (blue dashed line box in Figure 5) reversely filters the information of the sending end through the gating value

{1 - G}_{k}

, and only retains the information that is not in the k-th scale to avoid information redundancy;

Three parts of information: The orthogonal information of

{\tilde{g}}_{i}^{k}

itself (weighted by

G_{k}

); the useful information in the other two scales (filtered by

1 - G_{k}

); and the independent contribution after its redundancy is removed are combined by addition. The k-th scale feature vector

{\hat{g}}_{i}^{k}

can be obtained by removing the redundant information. The calculation formula is as follows:

{\hat{g}}_{i}^{k} = G_{k} \cdot {\tilde{g}}_{i}^{k} + (1 - G_{k}) \cdot \sum_{j = 1, j \neq k}^{2} G_{j} \cdot {\tilde{g}}_{i}^{j} + {\tilde{g}}_{i}^{k}

(10)

Finally, the final fusion feature

F = [{\hat{g}}_{i}^{1}, {\hat{g}}_{i}^{2}, {\hat{g}}_{i}^{3}]

is obtained by splicing the non-redundant feature vectors of the three scales. The feature vector F is fed into MLP to perform multi-label classification. This enables the model to output the predicted probabilities for all target categories in the multi-label ECG classification task, thereby completing the end-to-end learning process.

4. Experiments

4.1. Dataset

Due to two methodological constraints, this study employed the St. Petersburg INCART database as the primary experimental dataset and did not perform cross-dataset validation: (1) Lead Requirement: The proposed approach necessitates 12-lead ECG data for dynamic graph construction, whereas widely used benchmark datasets (e.g., MIT-BIH) offer only single-lead recordings; (2) Annotation Granularity: The algorithm requires beat-level annotations, whereas mainstream datasets (e.g., PTB-XL, CPSC) offer record-level diagnostic labels.

The INCART dataset comprises 75 annotated clinical ECG records collected from 32 patients (17 male and 15 female, aged 18–80 years; mean age: 58) using Holter monitors, with a sampling rate of 257 Hz and a duration of ≈30 min per record. These records encompass diverse arrhythmia types with precise beat-level labeling. To ensure rigorous evaluation and prevent data leakage, we employed an inter-patient hold-out evaluation protocol: patients were divided at the subject level, with 24 patient records assigned to the training set and 8 patient records to the test set. This design ensures the test data came exclusively from unseen patients, providing a reliable assessment of the model’s generalization.

Unlike conventional beat-wise segmentation, this study groups 60 consecutive heartbeats into a single instance and assigns multi-label annotations through MIL aggregation. If any abnormal beat is present within the group, the entire instance is labeled as abnormal. Since normal beats tend to occur consecutively whereas minority abnormal beats appear sporadically, the MIL-based grouping rarely results in instances dominated by abnormal beats. This strategy not only preserves the data’s clinical plausibility but also effectively mitigates the class imbalance issue. For example, in the INCART dataset, only 17,528 out of 92,074 beats belong to minority classes, accounting for merely 19.04%. After MIL packaging, however, the proportion of minority instances increases to 70.76%, substantially amplifying the weight of rare categories and enabling the model to better focus on abnormal arrhythmias. The final classification distribution is detailed in Table 1.

4.2. Evaluation Metrics

We evaluated ECG classification performance on the St. Petersburg INCART dataset using F1 score, area under the curve (

A U C

),

R e c a l l

, and mean average precision (

m A P

). Given the multi-label nature of the dataset,

m A P

replaces accuracy as the primary metric, with the critical stipulation that an instance is considered correctly classified only when all its associated labels are predicted exactly—any misclassification or missing label constitutes an error. Notably,

R e c a l l

quantifies the proportion of correctly identified positive samples, a particularly crucial metric for disease detection where false negatives carry significant clinical risk.

F 1 = \frac{2 \cdot T P}{2 \cdot T P + F P + F N}

(11)

A U C = \int_{0}^{1} \frac{T P}{T P + F N} d (\frac{F P}{F P + T N})

(12)

R e c a l l = \frac{T P}{T P + F N}

(13)

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}, A P = \sum_{j} (R_{j} - R_{j - 1}) \cdot \frac{{T P}_{j}}{{T P}_{j} + {F P}_{j}}

(14)

In Equation (14),

N

represents the total number of classes in the classification task,

{A P}_{i}

denotes the Average Precision for the

i

-th class, and

R_{j}

refers to the

R e c a l l

at the

j

-th threshold.

4.3. Implementation Details

This experiment is based on the PyTorch framework (version 2.3.1) and performed on a single RTX 3090 GPU. During training, the model is optimized using an adaptive moment estimation (Adam) optimizer, where the learning rate is set to 1 × 10⁻³. During initial training, we observed that the model converges around about 120 epochs, so we set the maximum training epoch to 150 to ensure that the model is fully trained. In addition, in order to ensure the fairness and comparability of the experiments, the batch size of all experiments was set to 64.

5. Experimental Result and Discussion

5.1. Comparison with Previous Methods

As demonstrated in Table 2, our method achieves state-of-the-art performance across all metrics in the inter-patient paradigm, underscoring its superiority in multi-label ECG classification. To provide a classical baseline, we further evaluated traditional machine learning approaches using enhanced statistical and frequency-domain ECG features. Random Forest and linear SVM classifiers were trained with PCA for dimensionality reduction. While these methods achieved moderate AUC scores, their F1 and Recall remained substantially lower than those of MST-DGCN, indicating that while machine learning can reasonably distinguish normal and abnormal classes, it performs poorly in differentiating among minority abnormal classes, further underscoring the challenges of multi-label classification.

For comparison with other deep learning frameworks, MIC [34], despite leveraging attention mechanisms for feature aggregation, attains limited performance (F1: 57.41, Recall: 56.85) due to its reliance only ResNet for feature extraction, which fails to capture spatiotemporal interactions critical for arrhythmia detection. TransMIL [35] employs self-attention for instance relationship modeling, achieving moderate results (F1: 66.60, mAP: 79.97), yet its patch-level attention restricts global temporal pattern recognition. While CMM [36] and MAMIL [37] integrate multi-scale features and attention fusion, their focus on temporal modeling neglects spatial dependencies between ECG leads, resulting in suboptimal performance. Notably, MAMIL [37] introduces additional complexity and limits generalization ability by fusing ECG signals with spectrogram images, compared to our purely electrophysiological approach.

Importantly, multi-label classification is inherently challenging: a prediction is considered correct only if all abnormal rhythm labels are accurately identified, meaning that both false negatives and false positives are penalized. In this context, the 2–4% improvement achieved by our method over prior approaches is clinically meaningful. These above comparative analyses substantiate the efficacy of our design innovations, particularly in spatiotemporal modeling and feature fusion optimization.

To assess the model’s capability in handling rare and severe arrhythmias, Table 3 presents class-wise performance metrics (F1, AUC, and Recall) and compares our method with CMM [36]. Owing to the patient-specific distribution of certain minority classes (e.g., Q and R), these categories were nearly absent in the test set under the inter-patient split, rendering their evaluation statistically unreliable; thus, they are not reported. Despite the inherent challenges posed by extremely rare classes, our model consistently achieved superior performance on class V and demonstrated remarkable improvement on class A, clearly demonstrating that the proposed method not only improves the recognition of minority arrhythmias but also achieves overall gains driven by advances in both normal and abnormal classes.

5.2. Ablation Experiments

To evaluate the contribution of each module to the model’s overall performance, we conducted multiple ablation experiments in inter-patient mode and analyzed the roles of key components. The experimental results are presented in Table 4, Table 5 and Table 6.

5.2.1. Ablation of MIL and Statistical Feature

After the removal of MIL module, all evaluation indexes inevitably decreased, especially the mAP index decreased by 8.72%. These results (Table 4) show that MIL module plays a crucial role in alleviating class imbalance and improving minority class recognition. MIL improves the model’s ability to predict a few classes by focusing more on key instances. The experimental results showed that the F1 score and AUC of the model decreased by 1.88% and 2.41%, respectively, after the removal of statistical features, indicating that the role of statistical features in extracting global patterns and enhancing the understanding of complex ECG sequences cannot be ignored. Compared to relying solely on ECG signals, statistical features can provide additional timing information to the model, especially in the analysis of ECG data over long periods of time.

5.2.2. Ablation of Multi-Scale

The results (Table 5) revealed a consistent performance decline whenever a specific scale of data was removed. Notably, data with larger window sizes showed superior performance and had a more pronounced impact on the overall effectiveness of the model. This finding highlights the critical role of larger-scale data in capturing broader temporal patterns, while smaller-scale data complements this by focusing on finer details. Moreover, the complementary nature of multi-scale information was evident, as combining data from different scales enhanced feature diversity and robustness. These results validate the necessity of incorporating a multi-scale strategy to ensure a comprehensive representation of ECG signals, ultimately improving both the accuracy and robustness of the model. The complementary nature of multi-scale data was evident in our results, as combining both small- and large-scale data not only enhanced feature diversity but also improved robustness, enabling the model to capture a wider range of ECG features. This synergy between different scales supports a comprehensive representation of ECG signals, ultimately enhancing the accuracy of the model.

5.2.3. Ablation of Feature Extraction Modules

The results (Table 6) indicate that neither GTCN nor DGCN alone can achieve optimal performance, as removing either of them leads to a significant decline in F1, AUC, and mAP scores. This validates the importance of combining temporal and spatial information, achieved through the integration of GTCN and DGCN in the T-DGCN module, to enhance the spatiotemporal feature representation ability of the model. Further analysis reveals that demonstrate the dominant role of spatial modeling via DGCN: removing DGCN causes a drastic F1-score decline (73.66%→62.39%), significantly exceeding the 1.92% drop from GTCN removal. This aligns with the electrophysiological principle that arrhythmias like ventricular fibrillation manifest as coordinated spatial anomalies across leads. DGCN’s dynamic adjacency matrix adaptively captures such lead-wise interactions, whereas GTCN primarily augments temporal resolution for local waveform delineation.

5.3. Serial vs. Parallel Connections of GTCN and DGCN

The serial architecture (GTCN→DGCN) outperforms parallel connections by hierarchically modeling ECG spatiotemporal dependencies (Figure 6). This aligns with ECG’s electrophysiological hierarchy: temporal anomalies precede spatially coordinated manifestations. GTCN first extracts high-semantic temporal patterns, which serve as node features for DGCN to dynamically capture lead-wise spatial correlations. This design avoids the parallel structure’s inherent conflicts, where DGCN struggles to jointly optimize raw signal processing and spatial adjacency learning, resulting in redundant features and suboptimal convergence. Notably, the parallel configuration achieves marginally higher Recall (74.18 vs. 73.92), because its independent temporal-spatial pathways preserve more potential positives. However, this comes at the cost of increased false positives (demonstrated by plummeting F1 scores and mAP). In contrast, the serial structure’s staged filtering suppresses noise-induced false detections while maintaining true positives through hierarchical abstraction, achieving superior precision-recall balance.

5.4. Comparison of Fusion Methods

Our orthogonal gated fusion (OGF) method demonstrates superior performance across all metrics compared to existing fusion strategies (Table 7). Concatenation achieves moderate performance (F1: 69.07) but fails to mitigate feature redundancy, as overlapping spatiotemporal patterns inflate feature dimensions, degrading model discriminability. Traditional attention reduces redundancy via global pooling but inadequately models multi-scale interactions, resulting in suboptimal AUC (69.02). EMA [38] employs grouped pooling for computational efficiency, yet its fragmented feature decomposition sacrifices global temporal coherence, limiting Recall (70.01). EAA [39] dynamically weights multi-scale features through query-key projections but lacks orthogonality constraints, allowing redundant features to dominate fusion outputs, evidenced by lower AUC (62.58). In contrast, OGF uniquely integrates orthogonalization and adaptive gating to resolve these limitations: Orthogonalization effectively removes overlapping features while preserving complementary patterns. The gating mechanism dynamically adjusts the importance of features across scales, improving Recall (73.92) without sacrificing precision (F1: +3.81 over EAA). The experimental results demonstrate the effectiveness of our feature fusion method.

5.5. Lead Importance and Multi-Scale Connectivity Patterns

To improve the interpretability of MST-DGCN, we conducted a lead importance analysis based on the learned adjacency matrices. Each adjacency matrix captures pairwise dependencies between ECG leads at a specific time scale, revealing which lead relationships the network leverages for prediction. As shown in Figure 7, distinct scale-specific connectivity patterns emerge: the model consistently emphasizes the (V1, aVR) pair at Scale-1, (V1, V2) and (V2, V6) at Scale-2, and (V4, III) at Scale-3.

The emphasis on (V1, aVR) is physiologically meaningful: V1 is highly sensitive to atrial activity and right ventricular conduction [40], while aVR provides diagnostic information for inferior myocardial infarction, right ventricular hypertrophy, and arrhythmias [41]. The pairs (V1, V2) and (V2, V6) reflect septal and lateral ventricular depolarization, respectively, which are critical for distinguishing anterior/septal myocardial infarction and characteristic left bundle branch block patterns [42,43]. The (V4, III) pair aligns with clinical knowledge, as V3–V4 detect anterior ischemia, while inferior leads (II, III, aVF) identify inferior wall infarctions. Joint attention to anterior (V4) and inferior (III) leads corresponds with prior reports on multi-territory infarction diagnosis and improves the sensitivity of automated myocardial infarction detection [44,45].

In summary, these findings suggest that MST-DGCN does not operate as a mere “black box,” but captures physiologically plausible inter-lead dependencies consistent with established electrocardiographic knowledge. This interpretability strengthens clinical trust and provides additional insight into the model’s decision-making process.

6. Conclusions

This study proposes MST-DGCN, a multi-scale spatiotemporal framework designed to address three critical challenges in ECG multi-label arrhythmia classification, namely modeling inter-lead spatial dependencies, suppressing redundant multi-scale features, and mitigating class-imbalance-induced bias. Two methodological advancements drive its success. First, the T-DGCN module integrates gated temporal convolution (GTCN) and dynamic graph convolution (DGCN) within a hierarchical architecture. This design facilitates the joint learning of cardiac rhythm evolution (temporal) and lead topological interactions (spatial). Second, the orthogonal gated fusion (OGF) strategy is adopted to eliminate redundancy among multi-scale features while adaptively recalibrating cross-scale feature contributions, thereby optimizing fusion representation. Experimental results demonstrate that MST-DGCN achieves competitive performance under inter-patient evaluation protocols on the INCART dataset. Considering resource constraints and real-time requirements in clinical applications, future work may explore lightweight model variants and hardware–acceleration optimizations to enable efficient deployment on embedded devices.

Author Contributions

Methodology, software, writing—original draft preparation, J.C.; supervision, project administration, funding acquisition, writing—review and editing, M.J.; writing—review and editing, validation, X.H.; Formal analysis, Investigation, Y.L.; formal analysis, resources, investigation, J.Z.; conceptualization, formal analysis, J.L.; investigation, resources, Y.W.; visualization, writing—review and editing, W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Key Research and Development Program of China (2023YFE0205600), the National Natural Science Foundation of China (62272415), the Key Research and Development Program of Zhejiang Province (2023C01041), the Key Research and Development Program of Ningxia Province (2023BEG02065), and the Science and Technology Development Fund, Macau SAR, China (No. 0122/2023/AMJ).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Patient consent was not required because this study used only publicly available ECG datasets (St Petersburg INCART 12-lead Arrhythmia Database).

Data Availability Statement

St Petersburg INCART 12-lead Arrhythmia Database are used during this work and are publicly available at: “https://physionet.org/content/incartdb/1.0.0/ (accessed on 6 September 2025)”.

Conflicts of Interest

The authors confirm that there are no conflicts of interest, and the research was carried out without any involvement in commercial or financial relationships.

Abbreviations

The following abbreviations are used in this manuscript:

MST-DGCN	Multi-Scale Temporal–Dynamic Graph Convolutional
T-DGCN	Temporal–Dynamic Graph Convolutional Network
OGF	Orthogonal gated fusion
MIL	Multiple instance learning
TCN	Temporal Convolutional Network
GTCN	Gated Temporal Convolutional Network

References

Nasef, D.; Nasef, D.; Basco, K.J.; Singh, A.; Hartnett, C.; Ruane, M.; Tagliarino, J.; Nizich, M.; Toma, M. Clinical Applicability of Machine Learning Models for Binary and Multi-Class Electrocardiogram Classification. AI 2025, 6, 59. [Google Scholar] [CrossRef]
Jiang, M.; Bian, F.; Zhang, J.; Huang, T.; Xia, L.; Chu, Y.; Wang, Z.; Jiang, J. Myocardial infarction detection method based on the continuous T-wave area feature and multi-lead-fusion deep features. Physiol. Meas. 2024, 45, 055017. [Google Scholar] [CrossRef]
Din, S.; Qaraqe, M.; Mourad, O.; Qaraqe, K.; Serpedin, E. ECG-based cardiac arrhythmias detection through ensemble learning and fusion of deep spatial–temporal and long-range dependency features. Artif. Intell. Med. 2024, 150, 102818. [Google Scholar] [CrossRef] [PubMed]
El-Ghaish, H.; Eldele, E. ECGTransForm: Empowering adaptive ECG arrhythmia classification framework with bidirectional transformer. Biomed. Signal Process. Control 2024, 89, 105714. [Google Scholar] [CrossRef]
Sun, L.; Li, C.; Ren, Y.; Zhang, Y. A Multitask Dynamic Graph Attention Autoencoder for Imbalanced Multilabel Time Series Classification. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 11829–11842. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Liu, W.; Chang, S.; Wang, H.; He, J.; Huang, Q. ST-ReGE: A novel spatial-temporal residual graph convolutional network for CVD. IEEE J. Biomed. Health Inform. 2023, 28, 216–227. [Google Scholar] [CrossRef]
Tao, R.; Wang, L.; Xiong, Y.; Zeng, Y.R. IM-ECG: An interpretable framework for arrhythmia detection using multi-lead ECG. Expert Syst. Appl. 2024, 237, 121497. [Google Scholar] [CrossRef]
Braman, N.; Gordon, J.W.; Goossens, E.T.; Willis, C.; Stumpe, M.C.; Venkataraman, J. Deep orthogonal fusion: Multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2021, Proceedings of the 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part V 24; Springer International Publishing: Cham, Switzerland, 2021. [Google Scholar]
Zhu, Y.; Jiang, M.; He, X.; Li, Y.; Li, J.; Mao, J.; Ke, W. MMDN: Arrhythmia detection using multi-scale multi-view dual-branch fusion network. Biomed. Signal Process. Control 2024, 96, 106468. [Google Scholar] [CrossRef]
Khalaf, A.J.; Mohammed, S.J. Verification and comparison of MIT-BIH arrhythmia database based on number of beats. Int. J. Electr. Comput. Eng. 2021, 11, 4950. [Google Scholar] [CrossRef]
Wagner, P.; Strodthoff, N.; Bousseljot, R.D.; Kreiseler, D.; Lunze, F.I.; Samek, W.; Schaeffter, T. PTB-XL, a large publicly available electrocardiography dataset. Sci. Data 2020, 7, 1–15. [Google Scholar] [CrossRef]
Ge, Z.; Jiang, X.; Tong, Z.; Feng, P.; Zhou, B.; Xu, M.; Wang, Z.; Pang, Y. Multi-label correlation guided feature fusion network for abnormal ECG diagnosis. Knowl.-Based Syst. 2021, 233, 107508. [Google Scholar] [CrossRef]
Ansari, Y.; Mourad, O.; Qaraqe, K.; Serpedin, E. Deep learning for ECG Arrhythmia detection and classification: An overview of progress for period 2017–2023. Front. Physiol. 2023, 14, 1246746. [Google Scholar] [CrossRef]
Acharya, U.R.; Oh, S.L.; Hagiwara, Y.; Tan, J.H.; Adam, M.; Gertych, A.; San Tan, R. A deep convolutional neural network model to classify heartbeats. Comput. Biol. Med. 2017, 89, 389–396. [Google Scholar] [CrossRef]
Cheng, J.; Zou, Q.; Zhao, Y. ECG signal classification based on deep CNN and BiLSTM. BMC Med. Inform. Decis. Mak. 2021, 21, 365. [Google Scholar] [CrossRef]
Mantravadi, A.; Saini, S.; Mittal, S.; Shah, S.; Devi, S.; Singhal, R. CLINet: A novel deep learning network for ECG signal classification. J. Electrocardiol. 2024, 83, 41–48. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Xie, H.; Liu, Y.; Zhu, H.; Zhang, H.; Wang, Z.; Pan, Y. MrSeNet: Electrocardiogram signal denoising based on multi-resolution residual attention network. J. Electrocardiol. 2025, 89, 153858. [Google Scholar] [CrossRef] [PubMed]
Wei, L.; Li, Y. A multi-scale CNN-Transformer parallel network for 12-lead ECG signal classification. Signal Image Video Process. 2025, 19, 611. [Google Scholar] [CrossRef]
Ranipa, K.; Zhu, W.-P.; Swamy, M.N.S. A novel feature-level fusion scheme with multimodal attention CNN for heart sound classification. Comput. Methods Programs Biomed. 2024, 248, 108122. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Zhang, P.; Wang, Z.; Chao, L.; Chen, Y.; Li, Q. Multi-feature decision fusion network for heart sound abnormality detection and classification. IEEE J. Biomed. Health Inform. 2023, 28, 1386–1397. [Google Scholar] [CrossRef]
Maithani, A.; Verma, G. Hybrid model with improved score level fusion for heart disease classification. Multimed. Tools Appl. 2024, 83, 54951–54987. [Google Scholar] [CrossRef]
Huang, Y.; Liu, W.; Yin, Z.; Hu, S.; Wang, M.; Cai, W. ECG classification based on guided attention mechanism. Comput. Methods Programs Biomed. 2024, 257, 108454. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, K.; Wang, X.; Liu, Y.; Shi, B. Multimodal Emotion Reasoning Based on Multidimensional Orthogonal Fusion. In Proceedings of the 2024 3rd International Conference on Image Processing and Media Computing (ICIPMC), Hefei, China, 17–19 May 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Chang, P.-C.; Chen, Y.-S.; Lee, C.-H. IIOF: Intra-and Inter-feature orthogonal fusion of local and global features for music emotion recognition. Pattern Recognit. 2024, 148, 110200. [Google Scholar] [CrossRef]
Petelin, G.; Cenikj, G.; Eftimov, T. Towards understanding the importance of time-series features in automated algorithm performance prediction. Expert Syst. Appl. 2023, 213, 119023. [Google Scholar] [CrossRef]
Zhu, H.; Wang, L.; Shen, N.; Wu, Y.; Feng, S.; Xu, Y.; Chen, C.; Chen, W. MS-HNN: Multi-scale hierarchical neural network with squeeze and excitation block for neonatal sleep staging using a single-channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 2195–2204. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Yan, W.; Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Brookline, MA, USA, 2019. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Dosovitskiy, A.B.; Kolesnikov, A. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceeding of the International Conference on Learning Representations, Virtual Conference, 26–30 April 2020. [Google Scholar]
Xu, Z.; Liu, H. ECG heartbeat classification using convolutional neural networks. IEEE Access 2020, 8, 8614–8619. [Google Scholar] [CrossRef]
Kuznetsov, V.V.; Moskalenko, V.A.; Zolotykh, N.Y. Electrocardiogram generation and feature extraction using a variational autoencoder. arXiv 2020, arXiv:2002.00254. [Google Scholar] [CrossRef]
Feng, Y.; Vigmond, E. Deep multi-label multi-instance classification on 12-lead ECG. In Proceedings of the 2020 Computing in Cardiology, Rimini, Italy, 13–16 September 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Shao, Z.; Bian, H.; Chen, Y.; Wang, Y.; Zhang, J.; Ji, X. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Adv. Neural Inf. Process. Syst. 2021, 34, 2136–2147. [Google Scholar]
Chen, L.; Lian, C.; Zeng, Z.; Xu, B.; Su, Y. Cross-modal multiscale multi-instance learning for long-term ECG classification. Inf. Sci. 2023, 643, 119230. [Google Scholar] [CrossRef]
Han, H.; Lian, C.; Zeng, Z.; Xu, B.; Zang, J.; Xue, C. Multimodal multi-instance learning for long-term ECG classification. Knowl.-Based Syst. 2023, 270, 110555. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
Strauss, D.G.; Selvester, R.H.; Wagner, G.S. Defining left bundle branch block in the era of cardiac resynchronization therapy. Am. J. Cardiol. 2011, 107, 927–934. [Google Scholar] [CrossRef]
Wang, A.; Singh, V.; Duan, Y.; Su, X.; Su, H.; Zhang, M.; Cao, Y. Prognostic implications of ST-segment elevation in lead aVR in patients with acute coronary syndrome: A meta-analysis. Ann. Noninvasive Electrocardiol. 2021, 26, e12811. [Google Scholar] [CrossRef] [PubMed]
Meyers, H.P.; Bracey, A.; Lee, D.; Lichtenheld, A.; Li, W.J.; Singer, D.D.; Rollins, Z.; Kane, J.A.; Dodd, K.W.; Meyers, K.E.; et al. Ischemic ST-segment depression maximal in V1–V4 (versus V5–V6) of any amplitude is specific for occlusion myocardial infarction (versus nonocclusive ischemia). J. Am. Heart Assoc. 2021, 10, e022866. [Google Scholar] [CrossRef] [PubMed]
Yamamoto, T.; Nambu, Y.; Bo, R.; Morichi, S.; Yanagiya, M.; Matsuo, M.; Awano, H. Electrocardiographic R wave amplitude in V6 lead as a predictive marker of cardiac dysfunction in Duchenne muscular dystrophy. J. Cardiol. 2023, 82, 363–370. [Google Scholar] [CrossRef] [PubMed]
Xiong, P.; Lee, S.M.Y.; Chan, G. Deep learning for detecting and locating myocardial infarction by electrocardiogram: A literature review. Front. Cardiovasc. Med. 2022, 9, 860032. [Google Scholar] [CrossRef]
Katsushika, S.; Kodera, S.; Sawano, S.; Shinohara, H.; Setoguchi, N.; Tanabe, K.; Higashikuni, Y.; Takeda, N.; Fujiu, K.; Daimon, M.; et al. An explainable artificial intelligence-enabled electrocardiogram analysis model for the classification of reduced left ventricular function. Eur. Heart J.-Digit. Health 2023, 4, 254–264. [Google Scholar] [CrossRef]

Figure 1. Overview of MST-DGCN.

Figure 2. The instantiation process of MIL.

Figure 3. Multi-label reassignment in MIL.

Figure 4. Architecture of the spatiotemporal feature extraction module.

Figure 5. Orthogonal gated multi-scale feature fusion module.

Figure 6. Serial vs. parallel connections of GTCN and DGCN.

Figure 7. Multi-scale adjacency matrices learned by MST-DGCN across training stages, showing scale-specific lead-to-lead connectivity patterns.

Table 1. Distribution of train and test sets in St. Petersburg INCART Arrhythmia Dataset after multi-instance learning (MIL) processing (inter-patient splitting).

	St. Petersburg INCART Arrhythmia Dataset
Labels	N	V	A	F	Q	R	V, Q	V, F	V, R	Q, F	V, A	Q, R	V, Q, F	Total
Train	480	960	/	7	210	41	109	58	12	3	2	1	4	1887
Test	344	508	2	7	1	/	5	60	/	1	2	/	1	931
Total	824	1468	2	14	211	41	114	118	12	4	4	1	5	2818

Table 2. Comparison with previous work for INCART Dataset (The best is Bold).

	St. Petersburg INCART Arrhythmia Dataset
Method	F1	AUC	Recall	mAP
Linear SVM	15.88	49.91	18.79	23.42
Random Forest	17.27	54.45	20.57	21.91
FCN_wang [27]	57.95	55.15	57.5	73.19
Restnet1d_wang [27]	57.9	59.21	58.33	74.29
Restnet1d50 [28]	58.94	56.38	60.98	75.1
Restnet1d101 [28]	52.49	62.87	55.3	73
EffcientNet [29]	54.47	56.52	54.53	71.71
InceptionTime [30]	57.3	58.34	58.66	73.67
ViT [31]	63.1	65.79	61.68	77.43
CNN_xu [32]	58.55	66.28	59.24	75.31
VAE [33]	59.6	57.55	61.62	75.5
MIC [34]	57.41	66.31	56.85	76.47
TransMIL [35]	66.60	67.73	66.67	79.97
CMM [36]	67.31	68.33	66.39	79.30
MAMIL [37]	69.42	69.74	70.57	82.58
MST-DGCN (Ours)	73.66 ↑ 4.24	70.92 ↑ 1.18	73.92 ↑ 3.35	85.24 ↑ 2.66

Table 3. Classification performance metrics for minority arrhythmia classes.

	F1		AUC		Recall
Class	CMM [36]	Ours	CMM [36]	Ours	CMM [36]	Ours
N	70.12	74.54	89.33	93.14	68.25	72.40
V	79.30	82.54	85.24	87.39	81.11	86.46
A	24.68	60.79	76.37	88.03	14.60	53.14
F	0.00	20.00	67.97	75.12	0.00	12.86

Table 4. Ablation of MIL and statistical modality (The best is Bold).

	St. Petersburg INCART Arrhythmia Dataset
Method	F1	AUC	Recall	mAP
w/o MIL	60.96	60.38	63.15	76.52
w/o statistical modality	71.78	68.51	72.47	83.44
Ours	73.66	70.92	73.92	85.24

Table 5. Ablation of multi-scale (The best is Bold).

		St. Petersburg INCART Arrhythmia Dataset
Method		F1	AUC	Recall	mAP
w/o multiscale	180 sample points	64.34	58.45	63.21	79.42
	90 sample points	66.09	65.27	65.66	80.85
	45 sample points	67.88	64.34	67.17	81.91
Ours		73.66	70.92	73.92	85.24

Table 6. Ablation of feature extraction modules (The best is Bold).

Method			St. Petersburg INCART Arrhythmia Dataset
GTCN	DGCN	ResNet	F1	AUC	Recall	mAP
		√	62.39	61.36	63.21	78.15
√		√	64.16	64.36	64.38	79.64
	√	√	71.74	70.54	71.35	84.72
√	√		70.39	67.75	72.41	83.36
√	√	√	73.66	70.92	73.92	85.24

Table 7. Comparison of feature fusion methods in inter-patient mode (The best is Bold).

	St. Petersburg INCART Arrhythmia Dataset
Method	F1	AUC	Recall	mAP
Concat	69.07	65.50	70.18	82.46
Traditional Attention	65.85	69.02	67.50	81.08
EMA [38]	70.31	63.33	70.01	83.28
EAA [39]	70.85	62.58	72.08	85.19
OGF (Ours)	73.66	70.92	73.92	85.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Jiang, M.; He, X.; Li, Y.; Zhang, J.; Li, J.; Wu, Y.; Ke, W. MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification. AI 2025, 6, 219. https://doi.org/10.3390/ai6090219

AMA Style

Chen J, Jiang M, He X, Li Y, Zhang J, Li J, Wu Y, Ke W. MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification. AI. 2025; 6(9):219. https://doi.org/10.3390/ai6090219

Chicago/Turabian Style

Chen, Jie, Mingfeng Jiang, Xiaoyu He, Yang Li, Jucheng Zhang, Juan Li, Yongquan Wu, and Wei Ke. 2025. "MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification" AI 6, no. 9: 219. https://doi.org/10.3390/ai6090219

APA Style

Chen, J., Jiang, M., He, X., Li, Y., Zhang, J., Li, J., Wu, Y., & Ke, W. (2025). MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification. AI, 6(9), 219. https://doi.org/10.3390/ai6090219

Article Menu

MST-DGCN: Multi-Scale Temporal–Dynamic Graph Convolutional with Orthogonal Gate for Imbalanced Multi-Label ECG Arrhythmia Classification

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning for ECG Classification

2.2. Feature Fusion Strategies

3. Methods

3.1. Overall Framework

3.2. Multiple Instance Learning (MIL)

3.3. Feature Extractor Method

3.3.1. Instance Feature Extraction

3.3.2. Statistical Feature Extraction

3.4. Feature Fusion Method

3.4.1. Global–Local Fusion at Specific Scale

3.4.2. Orthogonal Gated Multi-Scale Fusion

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

5. Experimental Result and Discussion

5.1. Comparison with Previous Methods

5.2. Ablation Experiments

5.2.1. Ablation of MIL and Statistical Feature

5.2.2. Ablation of Multi-Scale

5.2.3. Ablation of Feature Extraction Modules

5.3. Serial vs. Parallel Connections of GTCN and DGCN

5.4. Comparison of Fusion Methods

5.5. Lead Importance and Multi-Scale Connectivity Patterns

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI