Deep Learning-Based ECG Arrhythmia Classiﬁcation: A Systematic Review

: Deep learning (DL) has been introduced in automatic heart-abnormality classiﬁcation using ECG signals, while its application in practical medical procedures is limited. A systematic review is performed from perspectives of the ECG database, preprocessing, DL methodology, evaluation paradigm, performance metric, and code availability to identify research trends, challenges, and opportunities for DL-based ECG arrhythmia classiﬁcation. Speciﬁcally, 368 studies meeting the eligibility criteria are included. A total of 223 (61%) studies use MIT-BIH Arrhythmia Database to design DL models. A total of 138 (38%) studies considered removing noise or artifacts in ECG signals, and 102 (28%) studies performed data augmentation to extend the minority arrhythmia categories. Convolutional neural networks are the dominant models (58.7%, 216) used in the reviewed studies while growing studies have integrated multiple DL structures in recent years. A total of 319 (86.7%) and 38 (10.3%) studies explicitly mention their evaluation paradigms, i.e., intra-and inter-patient paradigms, respectively, where notable performance degradation is observed in the inter-patient paradigm. Compared to the overall accuracy, the average F 1 score, sensitivity, and precision are signiﬁcantly lower in the selected studies. To implement the DL-based ECG classiﬁcation in real clinical scenarios, leveraging diverse ECG databases, designing advanced denoising and data augmentation techniques, integrating novel DL models, and deeper investigation in the inter-patient paradigm could be future research opportunities.


Introduction
Cardiovascular diseases (CVDs) are common chronic diseases that pose major threats to human health [1].Electrocardiogram (ECG) is a kind of noninvasive technique that records the fluctuation of the heart's bio-electric activities.The phenomena of cyclical contractions and relaxations of the heart could be tracked by an ECG machine through electrodes placed on the patient's skin surface.Normal ECG signals consist of different types of waves, including T wave, P wave, and QRS complex.The statistical and morphological characteristics of those ECG waves are important health indicators that could reveal symptoms of heart-related health issues [2].For example, the absence of P-waves and an irregular ventricular rate in ECG signals could relate to atrial fibrillation (AF) [3].In daily medical routine, to identify heart abnormalities and provide effective treatment for those issues, cardiologists usually perform ECG screening for patients, which requires significant human efforts and expensive medical procedures.Due to the population aging, the number of patients having cardiovascular diseases is expected to increase explosively, which calls for efficient, accurate, and low-cost automatic ECG diagnosis [4].In this review, we focus on the classification of heart arrhythmias, i.e., irregular heartbeats, which is a common medical procedure to identify CVDs.
Deep learning (DL) has shown remarkable success in medical diagnosis and has been exploited for automatic heart abnormality classification with ECG signals in recent years.The mapping from ECG features to their corresponding medical categories is learned, which can be characterized by DL models consisting of multiple perception neural layers.The inference capability of the DL model is optimized by a training process with training datasets [5], where the neuron weights are optimized to minimize the mismatch between the inferred and the ground-truth categories of the training data.Compared to traditional machine learning-based classification methods such as clustering and support vector machine (SVM), the DL-based ECG classification could better map the characteristics of ECG signals to their corresponding categories thanks to its powerful multi-level abstraction capability of feature extraction [6].In this work, the studies which consider DL-based arrhythmia classification with ECG signals are reviewed.The diagnosis for different arrhythmia types is a different clinical problem for cardiologists in practice.However, from the perspective of the classification task with DL, the classification methods for arrhythmia categories share an identical context and goal, i.e., establishing the accurate mapping from ECG characteristics to corresponding categories.Hence, this survey focuses on the current research status, challenges, and research opportunities for deep learning-based arrhythmia classification overall.
According to [7], clinical trials of artificial intelligence-enhanced ECG (AI-ECG) diagnosis have been conducted at the Mayo Clinic for the detection of various cardiovascular diseases, which has demonstrated the potential benefit of AI-ECG.However, they conclude that the implementation of the AI-ECG diagnosis is still in its infancy.Hence, although DL techniques have proven their effectiveness for ECG classification in the research community, their applications in the practical clinical process have been limited due to challenges both from the perspectives of DL techniques and ECG data.For example, in the inter-patient paradigm, DL models need to infer arrhythmia types based on ECG signals from patients who are not included in the training process, which is more challenging than the intrapatient paradigm where the models could experience the same patients during both of the training and inference stages.Hence, as the DL techniques significantly rely on the distribution of data in the feature space while ECG signals vary considerably from person to person, the models trained based on particular ECG datasets may not be applied reliably in practice.As many existing reviews [5,6,8] concentrate mainly on DL algorithms, we consider various factors across the whole DL workflow for the ECG arrhythmia classification.Specifically, our major contributions are as follows.

•
We perform a systematic review for DL-based arrhythmia classification with ECG signals from perspectives of ECG database, preprocessing, DL methodology, evaluation paradigm, and performance metrics in the complete DL workflow as well as the code availability of the reviewed studies;

•
The trend of techniques in each perspective in recent years is analyzed to summarize the historical road map and illustrate possible future research directions;

•
We present the detailed performance gap between the ECG arrhythmia classification under intra-and inter-patient paradigms.
To the best of our knowledge, there is no systematic review on the comparison of DL-based ECG arrhythmia classification under different evaluation paradigms, i.e., intrapatient paradigm vs. inter-patient paradigm.Most existing works consider the intrapatient paradigm; while the investigation in the inter-patient paradigm is limited but more desirable in clinical applications, a thorough comparison between the two paradigms could shed light on future research opportunities.
desirable in clinical applications, a thorough comparison between the two paradigms could shed light on future research opportunities.

Search Strategy
This systematic review for DL-based ECG arrhythmia diagnosis is performed based on the literature search with four major scholar databases, i.e., Google Scholar, PubMed, Scopus, and the Digital Bibliography and Library Project, focusing on studies published until December 2022.As many studies do not explicitly mention their classification tasks, we first implement a coarse search to include more candidate studies and avoid overlooking studies for arrhythmia classification.Hence, the searching keywords for the literature search are set as (Deep learning OR deep neural network OR convolutional neural network OR CNN OR recurrent neural network OR RNN OR LSTM) AND (ECG OR electrocardiogram).
The detailed paper search and refinement process is shown in Figure 1.A total of 3910 studies are obtained in the initial identification step.After removing the duplicates, 2265 unique studies remain.We then perform the refining process to extract studies that are more relevant to arrhythmia classification with DL.After the initial identification step, the obtained studies go through further screening and eligibility evaluation according to the inclusion and exclusion criteria.Specifically, 1694 studies are excluded by screening their titles and abstracts, and 203 papers are also removed based on the full-text assessment.The inclusion criterion enforces that the studies should be published in English and leverage DL to classify arrhythmia with ECG signals.The studies dealing with other tasks, such as emotion detection and drug and alcohol assessment, should be removed.Those studies which do not have full-text available are also excluded.Hence, a total of 368 studies were selected to be included in this review.The whole process is completed by two independent reviewers (QX and QXZ) in order and rechecked by PYL to ensure fair results of paper search and refinement.

Data Extraction
Table 1 summarizes the data items that are further extracted from the 368 selected studies.This review focuses on diverse aspects, including the general information, ECG database, preprocessing, DL methodology, evaluation paradigm, performance metric,

Data Extraction
Table 1 summarizes the data items that are further extracted from the 368 selected studies.This review focuses on diverse aspects, including the general information, ECG database, preprocessing, DL methodology, evaluation paradigm, performance metric, and code availability.A detailed description of the extracted information from those aspects is as follows: A.
General Information: An overview of the origin of the selected studies, i.e., the conference proceedings or journals in which they are published and their publication years, is provided;

General Information
Origin Journal/conference where the articles were published.

Publication year
Years of selected studies were published in.

Publication information
Source, release year, and whether the database is public available or not.

Evaluation Paradigm
Whether training and testing datasets contain ECG data from the same patients or not.

Performance Metrics
Metrics to evaluate the classification performance, e.g., F 1 , Sp.

Code Availability
Whether the code is shared online or not.

Results
The selected 368 studies consist of 290 journal papers and 78 conference papers focusing on DL-based arrhythmia classification, where 347 (94%) studies were published after the year 2017.Specifically, the number of published works in 2022 is almost four times more than that in 2017 (increasing from 21 works in 2017 to 99 works in 2022), which indicates that the research interest in DL-based ECG arrhythmia classification has been growing significantly in recent years.The top three journals where the selected studies are published are Computers in Biology and Medicine (22 studies), Biomedical Signal Processing and Control (18 studies), and IEEE Access (18 studies).We provide the detailed information of selected papers at Supplementary Table S1.

Database
DL models require a large amount of ECG signals as the training data to learn the relation between ECG characteristics and the corresponding types of arrhythmias.However, ECG data are considered highly private and sensitive health information, which in general, is difficult to collect from a large group of patients and form a comprehensive database.For the ease of access and the sake of fairly comparing developed DL methods in existing works, the majority of selected studies (89%, 326 out of 368) have established and evaluated their DL models based on ECG datasets from open-source or publicly available databases such as MIT-BIH Arrhythmia Database (MITDB) [9] and MIT-BIH Atrial Fibrillation Database (AFDB) [10].Table 2 presents the ECG datasets used in the selected works, including their publication information, signal information, demographic information, and the number of selected works that use them for arrhythmia classification.As can be seen in Table 2, MITDB is the most popular, as about 61% (223 out of 368) of works use it for arrhythmia classification.Other popular databases used by more than ten selected works are AFDB, PTB [11], PTB-XL [12], NSRDB, and INCART databases.In addition, most datasets contain 12-lead ECG signals where ten electrodes are placed in different locations of the human body, such as V1 for the fourth intercostal space on the right sternum and RA (right arm) for anywhere between the right shoulder and right elbow [13].It results in 12-channel ECG signals where signals of aVR (augmented vector right), aVL (augmented vector left), and aVF (augmented vector foot) channels are obtained based on combinations of ECG signal measurements from other electrodes.The multichannel ECG signals could better capture additional heart status information based on a greater number of simultaneous measurements.Furthermore, the sampling rates of ECG signals range from 128 Hz to 1000 Hz, and about half of ECG databases (8/17) have a sample rate of 250 Hz, as can be seen in Table 2. Based on the signal duration, the ECG signals can be categorized into long-and short-term measurements, from 10 s to 2 h.Most databases provide gender and age information, and the numbers of females and males tested are generally balanced.Some datasets, such as MITDB, contain both normal and abnormal ECG signals, while most ECG signals in datasets, such as the INCART database, are from patients having ventricular ectopic beats.Compared to the widely-used MIT-BIH series of databases collected in the USA, such as MITDB, NSRDB AFDB, recent ECG databases such as PTB/PTB-XL and Chapman collected in Germany and China, respectively, emerged in the research community, which is considered by increasing studies considered for DL-based ECG arrhythmia classification.
Figure 2 shows the trend of major databases used by selected works each year from 2017 to 2022.One can observe that every year, MITDB is still the dominant database in the research community.The proportion of studies that consider PTB/PTBXL has been increasing in recent years.The diversity of databases is improved as the number of databases used by more than ten studies increases from 6 in 2017 to 9 in 2022.Besides ECG datasets obtained specifically for arrhythmia classification, to improve the robustness of the DL models, noisy but normal ECG records can be added to the training dataset.For example, the MIT-BIH Noise Stress Test Database, which is collected from physically active volunteers to mimic ambulatory ECG, acts as a category of noisy ECG signals [15][16][17].In this way, the real situation in clinical practice can be emulated.
Among the selected works, 165 (45%) of studies consider more than one ECG dataset by combining multiple different ECG databases.For example, [18] exploits five public ECG datasets, i.e., AFDB, MITDB, NSRDB, the 2017 PhysioNet/CinC Challenge Database, and the first China Physiological Signal Challenge 2018 Database (CPSC2018), where Besides ECG datasets obtained specifically for arrhythmia classification, to improve the robustness of the DL models, noisy but normal ECG records can be added to the training dataset.For example, the MIT-BIH Noise Stress Test Database, which is collected from physically active volunteers to mimic ambulatory ECG, acts as a category of noisy ECG signals [15][16][17].In this way, the real situation in clinical practice can be emulated.
Among the selected works, 165 (45%) of studies consider more than one ECG dataset by combining multiple different ECG databases.For example, [18] exploits five public ECG datasets, i.e., AFDB, MITDB, NSRDB, the 2017 PhysioNet/CinC Challenge Database, and the first China Physiological Signal Challenge 2018 Database (CPSC2018), where AFDB is used for training and evaluation while other four datasets are used to test the generalization performance of the proposed DL model.This mechanism of training and testing DL models with ECG signals from two non-overlapping groups of patients, respectively, is a typical case of inter-patient diagnosis.However, as those datasets have different attributes such as categories and numbers of channels, a smaller number of classification categories, such as categories of Atrial fibrillation (AF)/non-AF and categories of AF, Normal, Premature Atrial Contractions (PAC), Premature Ventricular Contractions (PVC), Ventricular fibrillation (VF), and Noise are often considered [19,20].In [21], ECG signals from MITDB, MIT-BIH AFDB, CUDB, and MIT-BIH VFDB are fused to form one dataset where the training and testing datasets are obtained by randomly selecting ECG data from the combined dataset.Hence, the intra-patient diagnosis is performed where the DL model has the possibility to train and test based on ECG information from the same patient.By mixing multiple ECG datasets, the issue of imbalance in data categories can also be alleviated [22].Regardless of inter-or intra-patient diagnosis, it shows a clear trend over the last few years that increasing studies exploit combined ECG datasets for DL-based ECG arrhythmia analysis [18,21,[23][24][25][26].

Preprocessing
Before inputting ECG signals into DL models, a preprocessing step is often applied to those signals, which could improve the learning efficiency and reduce the computational complexity of DL models [27].In this review, the preprocessing step is reviewed from two aspects, i.e., denoising [28] and data augmentation [29].The two deal with noisy ECG signals and imbalanced datasets, respectively, which are common cases in real clinical scenarios.

Denosing
The ECG signals are prone to be contaminated by background noise and bioelectrical inference, such as power-line noise and muscle movement.The denoising step could clean the ECG signals to prevent overwhelming micro features in signals and help DL models focus more on the ECG features [30].Based on the selected studies, only about 38% of selected works (138 out of 368) specifically mentioned their denoising methods, and those methods can be mainly categorized into three types, i.e., traditional filter-based denoising methods (45.9%, 62 out of 138), wavelet-based denoising methods (38.4%, 53 out of 138), and hybrid denoising methods (16.7%, 23 out of 138).The traditional denoising filters, such as lowpass, bandpass, and notch filters, assume that the noise and useful signals lie in different frequency bands.Other denoising filters include smoothing filters such as the median filter and the Savitzky-Golay (S-G) [23,[31][32][33] and adaptive filters [34,35].The discrete wavelet transform (DWT) could project ECG signals onto the time-frequency domain based on wavelet basis functions [36].To remove the noise, the wavelet coefficients at high-frequency bands can be simply set to zero or apply a thresholding process to set the modest wavelet coefficients to zero [19][20][21][22] based on the assumption that the useful ECG signal is similar to the selected wavelet basis function.A combination of different types of denoising methods can be applied for noise removal, e.g., [37,38] combines DWT, median filters, or S-G filters for denoising.However, this type of method will induce higher processing latency.
The frequency counts of the three types of methods in each year are presented in Figure 3a.The number of works considering denoising has been increasing in recent years.The traditional filter-based methods are more popular than the other two denoising methods because of their effectiveness but easier implementation.In addition, there have been increasing works that consider wavelet-based methods for ECG signal denoising in recent years.
higher processing latency.
The frequency counts of the three types of methods in each year are presented in Figure 3a.The number of works considering denoising has been increasing in recent years.The traditional filter-based methods are more popular than the other two denoising methods because of their effectiveness but easier implementation.In addition, there have been increasing works that consider wavelet-based methods for ECG signal denoising in recent years.

Data Augmentation
ECG data often has biased distributions of abnormal categories much less than normal categories, as the abnormal signals are more difficult to obtain.DL models trained with the imbalanced ECG dataset will, in nature, put more attention to majority categories and overlook the minority categories leading to biased learning.In this survey, we focus on the data argumentation technologies [39], which take effect during the data preprocessing step to gain more training samples.From the selected studies, 102 (28%) studies explicitly claim the use of data augmentation techniques in their work.The augmentation

Data Augmentation
ECG data often has biased distributions of abnormal categories much less than normal categories, as the abnormal signals are more difficult to obtain.DL models trained with the imbalanced ECG dataset will, in nature, put more attention to majority categories and overlook the minority categories leading to biased learning.In this survey, we focus on the data argumentation technologies [39], which take effect during the data preprocessing step to gain more training samples.From the selected studies, 102 (28%) studies explicitly claim the use of data augmentation techniques in their work.The augmentation techniques can be categorized into two types, i.e., perturbation-based methods (64%, 65 out of 102) and synthetic-based methods (36%).Specifically, for perturbation-based methods, extra data samples can be added to ECG dataset by adjusting or perturbating the original samples from the same dataset, such as scaling and shifting ECG waveforms [40] or adding artificial noise to existing ECG signals [41].The perturbation of data samples is essentially acquiring new data samples from the neighborhood of corresponding original data samples in the feature space.Hence, the new data samples could be highly correlated to the original samples based on which the new data is perturbated.On the other hand, synthetic-based methods generate synthetic ECG data either based on the linear combination of real data samples or the construction of ECG signals by imitating real ECG features.The synthetic minority oversampling technique (SMOTE) and its variants, such as SMOTENN [42], Borderline SMOTE [43], and SVM-SMOTE [42][43][44][45], are often used to extend the minority categories.Just recently, DL techniques have also been used for synthetic data generation, e.g., the convolutional neural style transfer network [46], the generative adversarial network (GAN) [47], and the ACGAN consists of variational auto-encoder model [14].
Figure 3b shows frequency counts of the two types of augmentation methods each year.One can see that the synthetic-based strategies have drawn more attentions in recent years as the number of works considering this type of data augmentation method increased from 1 in 2017 to 17 in 2022.

Model
The design of DL models is crucial to the pipeline of DL-based ECG arrhythmia classification.The DL models have multi-level or multi-layer structures, and each level or layer can be regarded as a feature extractor that can learn how to better summarize signal characteristics [48].Based on the intrinsic property of the major feature extractor within the neural networks, the DL classification models considered in the selected studies can mainly be categorized into the following types: convolutional neural networks (CNNs), recurrent neural networks (RNNs), including the long short-term memory (LSTM) and bidirectional LSTM (BiLSTM), transformer, "hybrid" which refers to combinations of different DL models, and "others" corresponding to less popular models such as restricted Boltzmann machines and deep-belief networks.The detailed analysis of those DL models for ECG arrhythmia classification is as follows.
• CNN CNN is a DL model widely used in image classification, signal analysis, and natural language processing [48].Each layer of CNN usually contains a convolutional filter followed by pooling operations to extract both local and global features [49].Depending on the number of filtering directions of the convolutional filters in the spatial domain, the CNN can be further categorized into 1D CNN and 2D CNN.Specifically, the filters in 1D CNN and 2D CNN move along one and two filtering directions, i.e., feature dimensions, respectively.In general, 1D CNN is suitable for raw or denoised ECG signals, which only have one single feature dimension.For instance, in [50], an adaptive 1D CNN is proposed for ECG classification and anomaly detection at any sampling rate of ECG signals to avoid hand-crafted feature extraction.In [51], a lightweight 1D CNN considering channel shuffle over the group and depth-wise convolutions is designed, where 2-s ECG signal segments are considered as model input [37].In [38], the 1D CNN is leveraged to classify 2, 5, and 20 types of heart diseases where few-shot learning is considered to deal with the small-size of the dataset.On the other hand, 2DCNN mainly takes into account the image-like input, such as the spectrogram and scalogram of ECG signals.In [52], the 2D scalogram is obtained by transforming the 1D ECG signals having 500 samples to the wavelet domain using continuous wavelet transform.Then the 2D scalogram is regarded as a 3-channel color image with a size of 227 × 227 in the spatial domain.A classic 2D CNN, i.e., AlexNet [53], is used to classify ECG signals.In [54], the plot of 1D ECG recordings is directly transformed to 2D gray-level images with a size of 15 × 15 which are then fed as input for the 2D CNN.In [55], a multi-lead CNN takes multi-lead ECG as the matrix input, where the sub-2D convolution and lead asymmetrical pooling are exploited to extract multi-scale features.Due to simpler operation compared to 2D convolution, 1D CNN often contains fewer learnable parameters and has higher computation speed, making it suitable for real-time ECG classification and is often easier to be deployed in hardware.

• RNN/LSTM/BiLSTM
Taking into account the temporal correlation of feature sequences, RNN is a type of DL structure that considers the input as a time series.As ECG signals are time series in nature, their temporal correlation within the signals could potentially better reveal the sign of their categories.For typical RNNs, the information in their hidden layer at the current moment does not only depend on the current input but relies on the information at the previous time instance [56].In this way, the RNN is more sensitive to the temporal features of the input sequence and is advantageous in capturing hidden temporal information in ECG features [56].Furthermore, the improved RNN, i.e., the long short-term memory (LSTM), gains higher popularity than the conventional RNN because of its higher capability to analyze time series.Specifically, the LSTM has three gate structures to control the output, input, and forget information flow in stored memory cells [57].Compared to the RNN, the LSTM could deal with longer signal sequences as it selectively acquires useful information from historical inputs.In [58], a 6-layer LSTM is developed to automatically identify PVC beats based on ECG sequences.Furthermore, bidirectional LSTM (BiLSTM) is a special type of LSTM consisting of two LSTMs that go through the input sequence along the temporal direction forwardly and reversely, respectively [32].Hence, it could capture both the causal and noncausal time dependency information of signals to pursue potential better classification performance.In [59], the BiLSTM model is used for ECG classification based on the extracted ECG wave statistics along the temporal dimension, including RR interval, QR interval, ST segment starting point, and amplitudes of Q-and R-waves.In [60], a 2D BiLSTM is used for AF detection based on the spectrogram of ECG signals, where the input features are the frequency components at each time instance.In [61], the BiLSTM taking the sequence of RR intervals as input, is proposed for AF detection.To summarize, the input sequences for RNNs can be the raw ECG sequences, time-varying wave statistics, and time-frequency representation of ECG.

• Transformer
The attention mechanism gained more popularity in recent DL research communities as it is capable of learning how to assign higher learning weights to significant features [62].The transformer is an encoder-decoder structure that consists of only attention mechanisms and fully connected layers [63].It was originally designed for natural language processing (NLP) but has been extended to other applications since it could achieve better performance than RNN/LSTM [64].In [65], the encoder part of the transformer is used for heartbeat classification with ECG signals where the heart beat sequences are considered as input.In addition, RR intervals are concatenated with the features extracted by the attention module for final classification.In [66], a random forest model is first used to select 22 important features, such as RR interval median and P-wave correlation coefficients.Then the encoder of the transformer is exploited to extract features directly from ECG signals.The combination of the hand-crafted features and the features automatically extracted by the transformer is used for ECG classification.A waveform transformer is proposed in [67] in which the input ECG segments are first projected to a 1D vector through a multi-layer perceptron.Then the embedded segments, together with positional embedding and learnable class embedding, are taken as the input for the transformer encoder.The extracted features from the transformer are combined with 22 static features together for the final ECG classification.The transformer was developed in 2017, and its application to ECG signal is still in its early stage; however, more results with the transformer are expected in the future.

•
Hybrid DL model Many selected studies consider integrating multiple DL models into one DL network for ECG arrhythmia classification.For example, in [68], it combines the CNN and the RNN to form an encoder-decoder structure for heartbeat classification.CNN is used for feature extraction, and RNNs are used to translate the extracted features to their corresponding categories.More examples of the combination of CNN with LSTM and BiLSTM can be seen in [69][70][71], where CNNs are stacked in front of LSTM/BiLSTM modules for feature extraction.In [63], 1D CNN is first used to extract the features from ECG sequences.Then the CNN features are added with the positional encoding to further serve as the input for a transformer to finally detect the ECG arrhythmia.In [72], a 1D CNN is exploited for local attention embedding, and the encoder of the transformer is used for further feature extraction.In [23], shallow-domain knowledge-injection attention is first to extract the ECG signal feature.Then the attention outputs from the original and smoothed ECG data are regarded as the multivariate input for the 2D classification CNN.More works considering combining CNNs with transformers can refer to [35,73].The CNNs are also combined with attention mechanisms in [74][75][76].In the selected studies, 82 studies take advantage of hybrid models, which assemble different types of DL models to classify ECG arrhythmia.The top 3 hybrid models include CNN+LSTM (24 studies), CNN+BiLSTM (15 studies), and CNN+RNN (8 studies).In most hybrid models, the CNNs often serve as feature extractors, followed by other models which perform further feature extraction.
As shown in Figure 4a, the proportions of different DL models used in the selected studies are presented.Overall, the CNN (58.7%, 216 out of 368), RNN/LSTM/BiLSTM (9%, 33 out of 368), and hybrid (22.3%, 82 out of 368) are the most popular DL models for arrhythmia classification.Each year, there are more selected works considering CNN models than those considering other models each year but the number of works considering the hybrid model has been increasing in recent years.
extractors, followed by other models which perform further feature extraction.
As shown in Figure 4a, the proportions of different DL models used in the selected studies are presented.Overall, the CNN (58.7%, 216 out of 368), RNN/LSTM/BiLSTM (9%, 33 out of 368), and hybrid (22.3%, 82 out of 368) are the most popular DL models for arrhythmia classification.Each year, there are more selected works considering CNN models than those considering other models each year but the number of works considering the hybrid model has been increasing in recent years.

Optimizer
The way to optimize the DL models' learnable weights through backpropagation is another important control knob for classification performance.Figure 4b shows the trend of optimization techniques mentioned in the selected studies within years.A total of 50% (184 out of 368) studies did not explicitly report their optimization method.Three most frequently used optimizes are adaptive moment estimation (Adam) (66.8%, 123 out of

Optimizer
The way to optimize the DL models' learnable weights through backpropagation is another important control knob for classification performance.Figure 4b shows the trend of optimization techniques mentioned in the selected studies within years.A total of 50% (184 out of 368) studies did not explicitly report their optimization method.Three most frequently used optimizes are adaptive moment estimation (Adam) (66.8%, 123 out of 184), Stochastic gradient descent (SGD)/ SGD with momentum (SGDM) (12%, 22 out of 184), and root mean square propagation (RMSProp) (3.8%, 7 out of 184).

Classification Categories
Out of 368 selected studies, 118 (32%) studies categorized ECG signals into five classes.The large proportion stems from the fact that most studies utilize MITDB as their ECG databases, where ECG signals have been categorized into five essential groups (N: Normal beat; S: Supraventricular ectopic beat; V: Ventricular ectopic beat; F: Fusion beat; Q: unknown beat) following the American Association of Medical Instrumentation (AAMI) standards [77].Some studies [68,78] follow the AAMI standards but calculate the classification performance of categories of N, S, V, and F, which account for major categories in the ECG dataset.Binary classification (19%, 73 out of 368) is mostly used to identify one certain arrhythmia type, such as AF [79] and left ventricular dysfunction [80].The conclusions from many studies suggest that accurate multi-class arrhythmia classification is more challenging [19,20].

Evaluation Paradigm
The model generalization performance of DL models is a crucial perspective to be considered in the step of model evaluation.The generalization performance refers to the capability of classification models to infer categories of previously unseen and new data.For ECG classification, two evaluation paradigms have been investigated to evaluate the classification capability of DL models, i.e., the intra-and inter-patient paradigms, as depicted in Figure 5a.Specifically, in the inter-patient paradigm, the learning model trained on ECG signals from one group of patients is evaluated with different groups of patients which do not overlap with the training group.The intra-patient paradigm refers to the case that the DL mode could be trained and evaluated based on ECG signals from the same patients.
For ECG classification, two evaluation paradigms have been investigated to evaluate the classification capability of DL models, i.e., the intra-and inter-patient paradigms, as depicted in Figure 5a.Specifically, in the inter-patient paradigm, the learning model trained on ECG signals from one group of patients is evaluated with different groups of patients which do not overlap with the training group.The intra-patient paradigm refers to the case that the DL mode could be trained and evaluated based on ECG signals from the same patients.Among all the selected studies, 27 studies focus on the inter-patient paradigm, while a significant number of studies (319) consider the intra-patient diagnosis.In addition, a total of 11 studies consider both paradigms, while few studies do not describe their paradigm explicitly.As can be seen in Figure 5b, the proportion of the selected studies considering the inter-patient paradigm has been increasing in recent years as it is more desirable for clinical applications in practice.
Detailed information about the selected studies, which consider the inter-patient paradigm, is presented in Table 3.It summarizes the specific ECG data used for training/validation and testing, deep learning algorithm, classification category, and classification performance.The MITDS1/DS2 method (82%, 31 of 38) is the most popular evaluation method for the inter-patient paradigm.Specifically, the ECG data in MITDB is divided into two groups, i.e., DS1 and DS2, where 22 records are included.The details about how to obtain the standard DS1 and DS2 are illustrated in [77].Please note that the MITDS1/DS2 method is modified in some studies [63] for ECG analysis, where different recordings are included in DS1 and DS2.Additionally, some works consider leveraging ECG data from one database for training and testing the trained models with different ECG databases.For example, the model in [81] trains based on AFDB and then tests the model with MITDB.The number of classification categories in those selected works varies from 2 to 9 categories.As the motivations, tasks, datasets, and classification methods of all the reviewed studies are different, it is unfair and not straightforward to compare the classification performance across all the reviewed studies.However, we still intuitively compare the average performance metrics of all the selected studies in the inter-patient paradigm, regardless of the number of categories to be classified.The averaged Acc, F 1 score, Sen, Ppv, and Spe are 92.62%,79.48%, 79.25%, 71.74%, and 95.26%, respectively.Although those statistics are biased due to the way they are calculated, it intuitively shows that the classification performance can still be improved as some performance metrics, i.e., F 1 score, Sen, and Ppv are significantly lower on average than the other two.
Furthermore, Table 4 illustrates similar information about the 11 studies which investigate both the inter-and intra-patient paradigms.The averaged values of classification accuracy of the 11 studies are 98.39% and 90.15% in the intra-and inter-patient paradigms, respectively.As shown in Figure 5c, the averaged values of F 1 score, Sen, Ppv, and Spe for 11 studies in the intra-patient paradigm (inter-patient paradigm) are 95.52% (83.89%), 93.51% (78.16%), (62.82%), and 99.19% (93.86%), respectively.The differences in the F 1 score, Sen, and Ppv between the two paradigms are 11.63%, 15.35%, and 29.96%, which are much higher than the difference in terms of accuracy.Therefore, the inter-patient paradigm is a more challenging scenario, which calls for more research attention and effort.

Performance Metrics
To evaluate the classification performance of DL models, the commonly used performance metrics are overall accuracy (Acc), sensitivity (Sen), positive predictivity (Ppv), false positive rate (FPR), and F 1 score.Sen and Ppv correspond to recall and precision rates, respectively.It can be seen that Acc (82.1%, 302), Sen (72%, 265), F 1 score (59.8%, 220) are the three most used performance metrics to evaluate the classification performance of the DL models for arrhythmia classification.
Regardless of the evaluation paradigm and other conditions, such as task and number categories to be classified, we simply calculate the average of those metrics and find that the classification accuracy of the DL models in the selected studies is already above 95%, while other metrics such as F 1 score, sensitivity, positive predictivity, and specificity are relatively lower, within the range of 80-95%.Furthermore, interesting comparisons between the DL models and professional cardiologists are performed in [110,111].It can be concluded that the DL models are very competitive to cardiologists, which exhibits their great potential for clinical ECG classification.

Code Availability
Sharing the code of DL models online is a way for researchers to reproduce the performance results of existing works.However, only around 6% (20 out of 368) provide the code information directly in their papers, and most of them shared the code through the GitHub platform.In Table 5, to help researchers access the relevant codes conveniently, we list detailed information about the studies whose codes are publicly available.Few studies mention that the codes are available upon request, e.g., [85,112] but are not listed in the table.

Discussion
In this section, the findings from the selected papers in the field of DL-based ECG arrhythmia classification are summarized from the perspectives of the ECG database, preprocessing, DL methodology, evaluation paradigm, and performance metric.In addition, future challenges and possible directions are also discussed accordingly.

ECG Database
The quality of data plays a vital role in achieving high classification performance [37].DL techniques highly depend on the training data from which they learn the relationship between data characteristics and corresponding categories.As ECG signals are considered private medical information, they are, in general, difficult to collect from a broad range of patients having different genders and ages.In addition, the measurement conditions should be kept unified for every patient, and the annotation for ECG signal samples should be precise, which all require standard facilities and significant annotation effort from cardiologists [5].Hence, at the current stage, the publicly available ECG databases are the main data resources for DL-based ECG arrhythmia classification and support the progress of the research.However, the symptoms of arrhythmia in ECG signals vary from person to person.Exploiting diverse ECG databases to help DL models experience a greater amount of data samples could significantly improve their inference performance in practical clinical applications.
According to this review, among the multiple ECG databases, the MITDB is the most popular database for DL-based arrhythmia classification, while it was collected about 40 years ago [9].The MITDB actually acts as the data baseline to help compare newly designed DL methods to well-established models.Nowadays, as the model complexity of DL models increases to pursue higher classification performance, it inevitably consumes more ECG data for training.Hence, there have been growing works that consider combining datasets from multiple different public ECG databases [16,123].Another example is that, in PhysioNet Computing in Cardiology Challenge 2020, seven public databases are provided to participants.However, the differences in the number of leads, signal duration, measurement condition, and patient demographic distribution of those different ECG databases should not be simply ignored.
Another issue with the existing arrhythmia-related ECG database is that the data categories are significantly imbalanced.The amount of ECG data in the normal category is often dominant in those databases [10].Although various methods, such as data augmentation [124] and focal loss [92], are used to address the issue of imbalanced datasets, collecting more data in abnormal categories is the ideal way to entirely resolve the issue.However, collecting the specific data requires patients who exactly have the diseases to be classified, which often is difficult in reality.Hence, the imbalance of the ECG dataset is one of the challenges that researchers should expect to confront in the long term.

Preprocessing
Real clinical ECG signals often contain diverse noise and interference.However, many existing works do not consider ECG signal denoising, which could raise risks when they are implemented for real clinical scenarios.Thorough studies about the impact of ECG denoising on the DL-based ECG classification methods in clinical applications are in demand.Furthermore, the existing denoising methods assume that the characteristics of noise and interference differ from that of the useful signals in a predetermined domain, such as the frequency domain and the wavelet domain [96].In general, the DWT-based denoising methods could better retain the details of higher frequency signal components compared to traditional filtering-based denoising methods, which rely on discrete Fourier transform [36].However, the DWT-based denoising methods are offline algorithms that cannot apply to real-time ECG signal denoising [125].Design for advanced denoising algorithms specifically considering ECG characteristics is needed to be explored.
As ECG datasets are often imbalanced, data augmentation could be a necessary step to further level up the performance of DL models.It can be concluded that syntheticbased methods have gained higher popularity in recent years.In particular, the GAN-style model actually learns to generate synthetic ECG signals based on the characteristics of ECG signals [126].However, this type of method is still a data-driven process that relies heavily on the quality of ECG data.Research [46] proposes a DL model which jointly considers synthetic ECG signals generated by a mathematical model and real clinical ECG data.It could be a better solution for imbalanced data as the generated data set as the generated new data samples are not simply the linear combinations of other data samples but contain the theoretical a priori knowledge from cardiologists.

DL Methodology
This review clearly shows that CNNs are the most popular DL models for ECG classification thanks to their excellent capability for feature extraction [8].As ECG signals are time series in nature, RNNs are another popular type of DL model that has been widely adopted.The transformer is a type of relatively new DL model with the emergence of the attention mechanism and has been used in some recent works.In addition, it clearly shows that more studies leverage hybrid DL models for the arrhythmia classification.Specifically, the CNNs often served as feature extractors right after the input layer of the hybrid model [28].Other DL structures, such as RNNs and transformers, are exploited to further extract refined features.Their results show that in most cases, the hybrid model could achieve better classification performance but induce higher computational complexity [26,84,[127][128][129][130].However, as most selected studies consider traditional DL models such as CNNs and RNNs, the investigation into incorporating novel DL models or structures for arrhythmia classification with ECG signals is still limited.With the emergence of novel DL models such as ViT [131] and MLPMixer [132], the adaption of those novel DL models is expected to be introduced for ECG classification to pursue better performance improvement.In addition, most selected works focus on the improvement in classification performance as much as possible, while the interpretability of DL models is generally not discussed.The interpretable DL models [45] are highly desired to make the ECG classification results trusted in real clinical scenarios and could potentially further help cardiologists relate the heart abnormalities to possible hidden features of ECG signals, such as ECG phenotyping discussed in [7].
In most of the reviewed studies, the DL models are exploited under the supervised learning framework.However, how to leverage DL models in other artificial intelligence frameworks, such as active learning [133] and reinforcement learning [134], to improve the accuracy of ECG diagnosis could be one future research direction.In addition, how to systematically optimize the DL model structures, such as the size of convolutional kernels and hyperparameters, such as the minibatch size and learning rate, could be another crucial control knob for ECG classification.
The number of categories considered in the reviewed studies generally ranges from 2 to 9. As the number of categories to be classified in the DL models increases, learning the mapping for arrhythmia classification becomes more challenging [20].In MITDB, the total number of categories is 26.However, the amount of ECG data in some categories are very limited.Hence, to achieve accurate classification performance with a higher number of categories to be classified, improving the learning capability of the DL process requires research effort, which could overcome the challenges such as the high complexity of the mapping relationship and lack of data in minority categories.

Evaluation Paradigm
Based on how the training and testing datasets are organized, the ECG classification can be categorized into two paradigms, i.e., inter-and intra-patient paradigms [109].Most of the selected studies consider the intra-patient paradigm, while more recent works consider the inter-patient paradigm as it is highly desirable in clinical applications [86].However, under the inter-patient paradigm, we found that the values of Acc and Spe are more than 10% higher than those of F 1 score, Sen, and Ppv.In addition, we also compare the classification performance of the same models in the inter-and intra-patient paradigms from the existing works which consider both of the paradigms at the same time.It shows that the F 1 score, Sen, and Ppv achieved in the inter-patient paradigm are about 15% lower than the performance metrics achieved in the intra-patient paradigm.Hence, further research effort is required to fill the performance gaps between the inter-and intra-patient paradigms, which brings research opportunities.

Performance Metrics
A wide variety of performance metrics are used for comparison.The most common metrics are overall Acc, Sen, Ppv, FPR, and F 1 score.In general, the classification accuracy of the proposed DL models in most of the selected studies is already above 95%, while other metrics such as Sen, Ppv, FPR and F 1 score were relatively lower in the range of 80-95%, which also calls for the research effort.

Conclusions
DL techniques have been extensively investigated for arrhythmia diagnosis with ECG signals, which exhibit the great potential to be implemented in clinical applications.However, this survey shows that some essential aspects of the DL pipeline require further research efforts before reliably applying it in clinical ECG arrhythmia classification.Specifically, leveraging diverse ECG databases for training and testing, design of advanced denoising and data augmentation techniques, developing novel integrated DL models, and deeper investigation in the inter-patient paradigm could be future research directions and opportunities to ensure trusted DL-based arrhythmia classification and promoting its application in real clinical scenarios.

Figure 1 .
Figure 1.Paper search and refinement process.

Figure 1 .
Figure 1.Paper search and refinement process.

Figure 2 .
Figure 2. Trend of different ECG databases used by selected studies.Being the most popular ECG database, MITDB contains 2-lead ECG signals with a sampling rate of 360 Hz and a duration of 30 min.The ECG signals are collected from 47 patients.Recordings 201 and 202 are collected from the same patient, resulting in 48 recordings in total.The age of patients ranges from 23 to 89 years.The duration of each recording is about 30 min.However, the ECG dataset from MITDB is an imbalanced dataset where most ECG recordings are normal while abnormal recordings are much less than the normal ones.As the abnormal signals are more difficult to collect, most ECG databases encounter the issue of data imbalance, which could potentially introduce learning bias in the DL-based classification frameworks [14].Besides ECG datasets obtained specifically for arrhythmia classification, to improve the robustness of the DL models, noisy but normal ECG records can be added to the training dataset.For example, the MIT-BIH Noise Stress Test Database, which is collected from physically active volunteers to mimic ambulatory ECG, acts as a category of noisy ECG signals[15][16][17].In this way, the real situation in clinical practice can be emulated.Among the selected works, 165 (45%) of studies consider more than one ECG dataset by combining multiple different ECG databases.For example,[18] exploits five public ECG datasets, i.e., AFDB, MITDB, NSRDB, the 2017 PhysioNet/CinC Challenge Database, and the first China Physiological Signal Challenge 2018 Database (CPSC2018), where

Figure 2 .
Figure 2. Trend of different ECG databases used by selected studies.Being the most popular ECG database, MITDB contains 2-lead ECG signals with a sampling rate of 360 Hz and a duration of 30 min.The ECG signals are collected from 47 patients.Recordings 201 and 202 are collected from the same patient, resulting in 48 recordings in total.The age of patients ranges from 23 to 89 years.The duration of each recording is about 30 min.However, the ECG dataset from MITDB is an imbalanced dataset where most ECG recordings are normal while abnormal recordings are much less than the normal ones.As the abnormal signals are more difficult to collect, most ECG databases encounter the issue of data imbalance, which could potentially introduce learning bias in the DL-based classification frameworks [14].Besides ECG datasets obtained specifically for arrhythmia classification, to improve the robustness of the DL models, noisy but normal ECG records can be added to the training dataset.For example, the MIT-BIH Noise Stress Test Database, which is collected from physically active volunteers to mimic ambulatory ECG, acts as a category of noisy ECG signals[15][16][17].In this way, the real situation in clinical practice can be emulated.

Figure 3 .
Figure 3. Distribution of preprocessing methods used by selected studies in each year.(a) Trend of different types of denoising methods; (b) Trend of different types of data augmentation methods.

Figure 3 .
Figure 3. Distribution of preprocessing methods used by selected studies in each year.(a) Trend of different types of denoising methods; (b) Trend of different types of data augmentation methods.

Figure 4 .
Figure 4. Distribution of architectures used by selected studies in each year.(a) Trend of different DL models used by the selected studies; (b) Trend of optimization techniques.

Figure 4 .
Figure 4. Distribution of architectures used by selected studies in each year.(a) Trend of different DL models used by the selected studies; (b) Trend of optimization techniques.

Figure 5 .
Figure 5.Comparison of the intra-and inter-patient paradigms.(a) An illustration of the intra-and inter-patient paradigms; (b) Trend of the intra-and inter-patient paradigms; (c) Comparison of classification performance achieved in the inter-and intra-patient paradigms.

Figure 5 .
Figure 5.Comparison of the intra-and inter-patient paradigms.(a) An illustration of the intraand inter-patient paradigms; (b) Trend of the intra-and inter-patient paradigms; (c) Comparison of classification performance achieved in the inter-and intra-patient paradigms.

Table 1 .
Extracted information from papers.

Table 2 .
Popular ECG databases used by the selected studies.

Table 3 .
The inter-patient paradigm studies in the selected articles.

Table 4 .
Comparative summary table of inter/intra-patient paradigms studies.

Table 5 .
The details information of code available studies.