Automatic Classification System of Arrhythmias Using 12-Lead ECGs with a Deep Neural Network Based on an Attention Mechanism "2279

Nowadays, a series of social problems caused by cardiovascular diseases are becoming increasingly serious. Accurate and efficient classification of arrhythmias according to an electrocardiogram is of positive significance for improving the health status of people all over the world. In this paper, a new neural network structure based on the most common 12-lead electrocardiograms was proposed to realize the classification of nine arrhythmias, which consists of Inception and GRU (Gated Recurrent Units) primarily. Moreover, a new attention mechanism is added to the model, which makes sense for data symmetry. The average F1 score obtained from three different test sets was over 0.886 and the highest was 0.919. The accuracy, sensitivity, and specificity obtained from the PhysioNet public database were 0.928, 0.901, and 0.984, respectively. As a whole, this deep neural network performed well in the multi-label classification of 12-lead ECG signals and showed better stability than other methods in the case of more test samples.


Introduction
Electrocardiograms (ECG), as a technical means to record the changes of electrical activity generated by each cardiac cycle, have made outstanding contributions in clinical medicine in the past, especially in the diagnosis of arrhythmia and myocardial infarction [1,2]. It is difficult for doctors to make efficient and accurate diagnoses in the face of tens of thousands of ECG records from different individuals. In addition, there are a lot of noise interferences in the originally collected ECG signals, and the non-obvious potential deviation of special nodes also causes great trouble for cardiologists. With the rapid development of computer-aided diagnosis technology, most commercial ECG machines often have a built-in arrhythmia automatic diagnosis algorithm, but its high misdiagnosis rate is unacceptable [3]. In recent decades, researchers have tried to incorporate medical theory into the automated computer analysis of electrocardiograms for the purpose of accurate diagnosis.
As the most commonly used auxiliary diagnostic method of heart disease, ECG contains abundant cardiac beat information and clinical features. Classification of arrhythmias based on ECG signals is of great significance for effective diagnosis, treatment, and early warning of various cardiovascular diseases. Most classical ECG classification methods are based on single-lead methods, which are

Related Work
The mainstream methods of using ECG signals to identify arrhythmia can be summarized into two categories: traditional signal processing techniques and classifiers constructed by neural networks. On the basis of filtering denoising and R point detection, the former method manually selects features and calculates statistical indicators, which has relatively low efficiency but strong interpretability; the latter method has similarities with neural network-based image recognition technology. Feature screening and parameter tuning are all done automatically by the network, which has relatively high efficiency but poor interpretability. The arrhythmia classification model based on machine learning can be seen as a transition from traditional methods to neural network methods.
Our work aims to establish an automatic classification system for arrhythmia from more heartbeats, more signal leads, and more arrhythmia categories, all while trying to explain the working mechanism of the proposed model from the perspective of a heat map.

Data
Deep learning models tend to train a large number of parameters, so having enough data is the basic premise. In this research, 14,000 12-lead ECG records from 3569 patients were used, which have differences in noise size, sequence length, and amplitude range. Before designing the structure of a neural network, it is necessary to use some mathematical methods to process the data simply.

Data Description
In this work, half of the ECGs jointly tagged by several cardiologists were provided by Shanxi Bethune Hospital, which included a total of 7000 12-lead ECG records with a sampling rate of 500 Hz. They were divided into three parts, 5850 records as the training set, 650 records as the validation set, and 500 records as the testing set 1. Each record may have multiple different arrhythmia labels, so a record may be counted repeatedly in the respective statistics of different classes of signals. To make the results more convincing, we used 6500 ECG records collected offline from multiple channels by our teams as testing set two and 500 ECG records from the PhysioNet public database as testing set three [20,21]. The specific number of ECGs corresponding to nine types of arrhythmias contained in each data set is shown in Table 1. The sample segments of ECGs for nine categories are shown in Figure 1.

Data Processing
Clipping can make each ECG signal have the same length. The number of signal sampling points in the dataset range from 4500 to 50,000. Considering that the specific wave groups of PVC, PAC, and so on do not appear in all heartbeats, simply cutting the signal into a single or several heartbeats will lose the key information. After the overall analysis of the data, the signal length was cut to 8192 sampling points, including 16 or so cardiac beats. The ECG signal with insufficient length was filled with zero at the beginning. In this way, information integrity was guaranteed to the greatest extent, (compared with the irregularity of the waveform caused by the truncation of the single-heartbeat at an inappropriate position, the multi-heartbeat has information redundancy, thus the position of the starting point and ending point is no longer a restrictive factor) and unnecessary computational overhead was reduced. Compared with an entire ECG record that lasts for several hours, the information contained in 16 heartbeats is enough to reflect the essence. When training the neural network, each iteration adopted random clipping to make the data of the training set slightly different each time, which played a role of data enhancement to some extent. Normalization can convert data with different orders of magnitude into uniform measures. By calculating the population mean and population standard deviation, Z-Score standardization can shrink the signal amplitude to a smaller range according to the unified standard. In the derivative process of gradient descent, a small initial value will greatly accelerate the convergence rate. Its calculation formula is defined as: Filtering is usually realized by the coordination of several bandpass filters with different cut-off frequencies. In traditional methods, filtering can effectively remove electromyographic signals, suppress power frequency interference, and remove baseline drift. However, the deep CNN can automatically eliminate noise interference in the feature extraction process, and the data quality of the dataset used is superior, so the effect of filtering on the final performance of the model is not obvious. In practice, we can control whether to add a filtering operation according to the specific experimental environment or to isolate and filter the signals with serious noise interference and several signals with high-quality requirements.

Material and Methods
The whole experiment was completed with the Spyder (Python 3.6) compiler based on the Python language. The construction of the neural network was implemented under the PyTorch (torch 1.2.0+cu92) framework, and the working platform was two GTX1080Ti graphics cards, an Intel Core i7 3.2 GHz (8700) processor, and a 32 GB RAM. In addition, NumPy, pandas, matplotlib, and other third-party libraries were used for data analysis and visualization.
CNN and RNN are two kinds of classical networks. The former can effectively process high-dimensional data by sharing parameters and selecting features automatically. The latter has a prominent advantage in the time series problem by virtue of its unique memorization. Inspired by GoogLeNet [22], a similar deep one-dimensional convolutional network is designed to compress information and obtain feature maps. The double-layer gated recurrent unit (GRU) structure after the convolutional network is used to detect the sequence characteristics of the P-wave, QRS wave, T-wave, and other wave groups. The attention block is inserted at intervals in a cross-layer connection so that the inter-lead correlation is also fully considered. The network structure and data forward propagation are shown in Figure 2. The specific parameters of each layer of the network are given in Table 2. The input is a matrix of 12 × 8192. The basic components of the network include a one-dimensional convolutional layer, one-dimensional pooling layer, Inception, and GRU. A 1 × 9 output prediction vector is obtained by the softmax function.

Inception Module
GoogLeNet's core architecture Inception extends the number of branches to four, improving overall network performance by increasing the depth and width of the network. In Figure 2a, one of the most classical structures of Inception is presented. A different size convolution kernel is adopted in each branch, which means different sizes of receptive fields can get different scale features. The addition of a 1 × 1 convolution kernel effectively reduces a large number of parameter computing bottlenecks caused by a large convolution kernel. Another significant advantage of Inception is that each of the convolution modules is actually composed of the convolution layer, BN layer, and ReLU activation layer. The introduction of batch normalization [23] has effectively solved the problem of gradient disappearance in deep networks and greatly accelerated the convergence rate of networks.

Attention Module
Combining the merits of Residual attention [24] and Dot-product attention [25], we designed a new attention module to work on the Max-pooling layer and Avg-pooling layer immediately after the Inception block. As shown in Figure 3, the initial 12-lead ECG signal gets the outline value of the Max-pooling layer by repeated convolution on the trunk with the increase of channels. The signal reaches the same length as the pooling layer by down-sampling on the branch with keeping the number of channels at 12, and then gets the Max-pooling layer detail value through the attention mechanism. Finally, the final value of the pooling layer is obtained by summing the branch value and the trunk value. The operation process can be expressed as the following equation: where Q new is the feature map obtained after the input matrix A flows through the attention module. Q old represents the output matrix of the pooling layer on the trunk, and K represents the matrix directly obtained by the down-sampling on the branch. Therefore, so f tmax(Q old K T ) can be interpreted as the correlation coefficient matrix between the initial signal leads and the middle pooling layer channels. The feature map calculated by the attention mechanism contains both lead correlation information and wave shape information. Attention mechanism action process. The output of Inception tends to increase significantly in the number of channels, so adding consideration to the initial 12 channels in pooling layers helps to pass on the correlation characteristics of inter-lead layer by layer.

GRU Module
GRU and LSTM are two good variants of RNN [26,27]. LSTM is used in most of the literature for the classification of ECG signals. However, in this experiment, after the CNN, we tried to add two layers of unidirectional GRU, unidirectional LSTM, bidirectional GRU, and bidirectional LSTM, respectively. The study found that GRU is better than LSTM, and unidirectional GRU is better than bidirectional GRU. The main difference between GRU and LSTM is that GRU uses one single gating unit to control both the decision of the forgetting factor and the decision of the update status. The update formula is as follows: where u stands for update gate and r stands for reset gate. σ can be any kind of activation function. b, U, and W represent bias, input weights, and recurrent weights of the update gate in GRU cell, respectively. Inputting the update results of the hidden layer cells of the previous GRU layer to the hidden layer cells of the latter GRU layer can form a double-layer GRU, and inputting the last moment update results of the double-layer GRU into the full connection layer can complete the classification through the built-in Softmax function. The GRU structure used in this paper is shown in Figure 2b.

Other Important Components
In the experiment, Adam optimizer [28] was selected to complete the parameters update, and the initial learning rate was 0.0001. The 10-fold cross-validation was adopted to the train model, and the parameters with excellent prediction were selected to participate in the ensemble. Another key point is that a modified Binary Cross Entropy Loss was used as the loss function in this experiment rather than the commonly used MSE Loss or Cross Entropy Loss, which can be expressed as the following: where x is the network output vector mapped by Sigmoid function, namely the predicted value; y is the corresponding label vector, namely the true value; w represents the weights of loss calculation.
k i can be interpreted as the weights of categories to eliminate the impact of data asymmetry imbalance between classes.

Results and Discussion
In some of the literature, accuracy, sensitivity, and specificity were usually used as the final evaluation indexes of the model. In this experiment, there was a serious data imbalance, which can be reflected in the overall distribution of arrhythmias of nine categories to some extent. At first, we tried to eliminate the adverse effects by increasing or decreasing the number of various data to construct a balanced data set, but the results were not satisfactory. After that, the adjustment model and loss function were selected to fit the unbalanced data set. In this case, it was more reasonable to select precision rate, recall rate, and F1 score as evaluation indexes.

Results Analysis
The loss curve and F1 curve measured on the validation set are given in Figure 4. It can be seen that, after 80 iterations, in the case of the attention mechanism, loss is basically stable at 0.06, which is 0.02 lower than that without the attention mechanism. Meanwhile, the average F1 score of the network with attention mechanism is basically stable above 0.90, while that of the network without the attention mechanism is stable above 0.84. This suggests that the attention mechanism is very effective in learning the 12-lead ECG signals, improving by six percentage points compared with the model without the attention mechanism. On the basis of the attention mechanism, we added two groups of contrast experiments to systematically show that the initial signal length and the selection of RNN variants are important factors affecting the performance of the model. This aspect of the research was insufficient in most previous literature, Tables 3 and 4, respectively, give the results of the above mentioned two groups of contrast experiments on the validation set. As can be seen in Table 3, when the input signal length starts to decrease from 8192, F1 scores generally show a downward trend due to less and less cardiac beat information. When the input signal length increases upward from 8192, the original information cannot be highlighted effectively due to too much filling of zero, and the F1 score also shows a slight decline. The results in Table 4 reflect that GRU is better than LSTM, and a unidirectional network is better than a bidirectional network, which is contrary to the comparison results of RNN variants commonly used in text processing. We guess that in the deep structure used in combination with CNN, the signal length becomes shorter, but the number of channels increases significantly, and the unidirectional. Table 3. Influence of input signals of different lengths on F1 scores. The 4096-length signal and 8192-length signal have their own advantages and disadvantages in single F1 scores, but the average F1 score shows that the length of 8192 is better. In addition, the main disadvantage of too long signal selection is the large amount of calculation. The main disadvantage of too short signal selection is serious information loss and F1 score drops significantly.

Model Evaluation
The confusion matrix drawn according to the best classification result of the validation set is shown in Figure 5. Most samples of abnormal distribution are explicable. For example, the large number of false-positive samples of normal type is because all empty prediction results below the defined threshold are classified into normal artificially. TWC is confused with the other eight categories to varying degrees, which is partly caused by the elimination of small fluctuations in the convolution process and the serious drift of the initial sampling points. In addition, it can be seen that the negative effects of false positive and false negative samples are amplified due to the few ECG records of ER in the data set. From the perspective of morphology, the model cannot effectively learn the feature of J point elevation, which also restricts the prediction score of ER type.
From the perspective of the confusion matrix, the learning effect of the model is outstanding. In order to have an objective and comprehensive understanding of the model, we calculated precision rate, recall rate, specificity, and F1 score of all categories on the validation set and three testing sets, which were summarized in Table 5. The highest average F1 score was 0.919 and the lowest was 0.886. It can be clearly seen that AF, FDAVB, CRBBB, PVC, and PAC have stable performance in four data sets and good generalization ability. When the number of testing samples increases to 6500 (set3), the inter-class confusion between normal and TWC is exacerbated. There are over-fitting risks because of the extremely limited number of LAFB and ER in the training set. Therefore, the F1 scores of these two categories will also decline on a more diverse testing set.

Comparison with Other Approaches
There are endless algorithms for the classification of arrhythmias based on ECG signals, so we selected several representative machine learning methods and deep learning methods for experiments. Table 6 shows the comparison results with previous work. In terms of accuracy, sensitivity, and specificity, our model performs well, with specificity being the highest score. Compared with the method in literature 19, the model in this paper is slightly lower in accuracy and sensitivity. We believe that the use of Faster R-CNN can accurately locate a single heartbeat, but the network proposed by us is lacking in the discrimination of the wave groups position relationship with the simultaneous input of multiple heartbeats. Figure 6 shows the situation of classification errors. Convolution and pooling operations will shorten the distance of the wave groups and reduce the amplitude of the wave groups, which will eventually lead to misjudgment.  Based on the consideration of clinical interpretability, we obtained the summary of the key regions shown in Figure 7 through Grad-CAM, which reflects that the concerns of the classifier are basically the same as the concerns of the professional physicians.

Future Work
In view of some problems existing in the model, future research mainly includes the following three aspects. Firstly, the improvement of data processing methods and optimization of network structure are necessary, with the hope of achieving similar results after reducing the network size and parameter training volume. Secondly, trying to combine neural networks with PCA, RF, and K-NN of machine learning to build a new model to realize accurate identification of ER class. The biggest difficulty is to find out the subtle changes between QRS and T waves and extract these morphological features effectively. Thirdly, changing the ensemble strategy. When multiple labels are attached to an ECG record, effective logical evaluation can eliminate impossible combinations of labels. For the null prediction results, we can return them to the model for the second fine division. In addition, ECG records are often accompanied by some personal information such as the patient's gender and age, which can be fully utilized to assist judgment in the ensemble process.

Future Work
In view of some problems existing in the model, future research mainly includes the following three aspects. Firstly, the improvement of data processing methods and optimization of network structure are necessary, with the hope of achieving similar results after reducing the network size and parameter training volume. Secondly, trying to combine neural networks with PCA, RF, and K-NN of machine learning to build a new model to realize accurate identification of ER class. The biggest difficulty is to find out the subtle changes between QRS and T waves and extract these morphological features effectively. Thirdly, changing the ensemble strategy. When multiple labels are attached to an ECG record, effective logical evaluation can eliminate impossible combinations of labels. For the null prediction results, we can return them to the model for the second fine division. In addition, ECG records are often accompanied by some personal information such as the patient's gender and age, which can be fully utilized to assist judgment in the ensemble process.

Conclusions
The biggest innovation of this research was to propose a deep neural network model with attention mechanism, which can directly classify the 12-lead ECG records. The importance of signal initial length selection and RNN variant selection were also illustrated by contrast experiments. According to multiple testing sets, the F1 score was stable at 0.886 and up to 0.919. The F1 scores of

Conclusions
The biggest innovation of this research was to propose a deep neural network model with attention mechanism, which can directly classify the 12-lead ECG records. The importance of signal initial length selection and RNN variant selection were also illustrated by contrast experiments. According to multiple testing sets, the F1 score was stable at 0.886 and up to 0.919. The F1 scores of AF, CRBBB, and PVC all reached above 0.96 and showed stable performance, which had considerable clinical application value. ER and TWC still had shortcomings in recognition but some constructive ideas for improvement had been proposed. The heat curve mentioned at the end of the article has long-term significance for the automatic classification system of arrhythmias to be widely recognized by clinicians. At the same time, the fixed length of the signal, the fixed number of leads, and the fixed number of categories will make the model lack flexibility, which is the main limitation.

Conflicts of Interest:
The authors declare no conflict of interest.