An Automatic Diagnosis of Arrhythmias Using a Combination of CNN and LSTM Technology

: Electrocardiogram (ECG) signal evaluation is routinely used in clinics as a signiﬁcant diagnostic method for detecting arrhythmia. However, it is very labor intensive to externally evaluate ECG signals, due to their small amplitude. Using automated detection and classiﬁcation methods in the clinic can assist doctors in making accurate and expeditious diagnoses of diseases. In this study, we developed a classiﬁcation method for arrhythmia based on the combination of a convolutional neural network and long short-term memory, which was then used to diagnose eight ECG signals, including a normal sinus rhythm. The ECG data of the experiment were derived from the MIT-BIH arrhythmia database. The experimental method mainly consisted of two parts. The input data of the model were two-dimensional grayscale images converted from one-dimensional signals, and detection and classiﬁcation of the input data was carried out using the combined model. The advantage of this method is that it does not require performing feature extraction or noise ﬁltering on the ECG signal. The experimental results showed that the implemented method demonstrated high classiﬁcation performance in terms of accuracy, speciﬁcity, and sensitivity equal to 99.01%, 99.57%, and 97.67%, respectively. Our proposed model can assist doctors in accurately detecting arrhythmia during routine ECG screening.


Introduction
Electrocardiography provides abundant health and pathology information about the heart and is the main method of diagnosing heart disease [1].Arrhythmia is an extremely common heart disease and is mainly diagnosed by doctors.However, misdiagnosis and missed diagnosis often occur in clinical practice due to differences in doctors' experiences and the randomness of arrhythmia events.At present, automatic detection and identification of arrhythmia events are urgently needed, as they can help doctors detect arrhythmia events earlier.
Traditionally, the study of arrhythmia diagnosis has mainly focused on the noise filtering of electrocardiogram (ECG) signals [2][3][4], signal segmentation [5][6][7], and manual feature extraction [8][9][10][11].Osowski et al. [9] proposed a machine learning method that uses higher-order statistics (HOS) and Hermite functions to extract features, and a support vector machine (SVM) to classify heart diseases.De Chazal et al. [2] used morphological features and weighted linear discrete analysis (LDA) combined with a packaging feature selection function to screen for heart disease.It is well known that the morphological approach is sensitive to ECG signal noise and has many limitations in the classification performance robustness of the model [12].Thanks to the development of deep learning technology, many feature extraction processing tasks can be completed by convolutional computation.This method is superior to the morphological approach and has low requirements for signal quality in classification [13].Kiranyaz et al. [14] introduced a one-dimensional convolutional neural network (1-D CNN) to identify and classify ventricular ectopic beats and premature ventricular contractions, and achieved good results.Yildirim et al. [15] proposed a deeper 1-D CNN classifier and was able to classify even more categories of heart disease and improve the classification performance.Although there are many references to ECG arrhythmia classification, there are still several limitations: (1) ECG signal information is lost during feature extraction or noise filtering, (2) ECG arrhythmia type has a limited number of classifications, and (3) the performance of the actual classification method is relatively poor.
Based on the abovementioned problems, a model based on the input of two-dimensional grayscale images is proposed in this paper, which combines a deep 2-D CNN with long short-term memory (LSTM).Some ECG signal information may be missed due to problems such as noise filtering, but this can be avoided by converting a one-dimensional ECG signal into a two-dimensional ECG image [16].In most current studies, the data used are relatively limited.Many studies need to be very careful when preprocessing the one-dimensional ECG signals because the one-dimensional ECG signals are more sensitive and have a greater impact on the final accuracy.The conversion of one-dimensional ECG signals into two-dimensional ECG images can get more data and the data is effectively available.There is no need for very precise separation of individual beats when performing data conversion.Even if some adjacent signals are separated, the convolution layer of the model can ignore these small noise data.Using two-dimensional ECG images does not require noise filtering and manual feature extraction.Because the convolution and pooling layers of the model automatically ignore the noise data when acquiring the feature map, they avoid the problems of sensitivity to noise signals and accuracy being affected.Some researchers [17] tend to use images instead of one-dimensional signals as input data in other similar disease diagnosis studies.The use of two-dimensional ECG images for detection and classification is more like a way for cardiologists to diagnose arrhythmic diseases because the diseases are diagnosed and identified through the observation of the images.If one-dimensional ECG signals are applied to instruments such as ECG monitors, problems such as sampling rate and noise will inevitably occur, so two-dimensional ECG images can be further applied to ECG monitoring robots that can assist cardiac experts in diagnosing arrhythmic diseases.In addition, it is difficult to apply the data augmentation method used in previous studies due to the characteristics of the one-dimensional ECG signal.The ECG signal is augmented to enlarge the training data, which can effectively improve the classification accuracy.Therefore, in this study, we used different cropping methods to augment the two-dimensional ECG image, so as to help the 2-D CNN model train a single ECG image from different angles.The automatic extraction of ECG beats features using a 2-D CNN can solve the problem of current hand-designed waveform features that are not sufficiently robust to handle patient-to-patient differences in heart beats.In addition to the 2-D CNN model, there is another LSTM deep learning model, which is a time recurrent neural network (RNN).The status of each cell in the LSTM interacts with those of the others, and the time dynamics in the data are presented through the internal feedback state, which can avoid the problem of long-term dependence.The LSTM cells also have the capability of retaining and feeding back useful information of selectively stored information [18].The combination of 2-D CNN and LSTM model features greatly improves the classification effect.

Materials and Methods
In this study, the datasets and annotations used were from the MIT-BIH arrhythmia database.The database included a total of 48 0.5 h long ECG signal records obtained from 47 subjects using two leads [19].Each signal record was sampled at 360 Hz with a set of beat markers presented at the R peak.These records were independently explained by multiple cardiologists.ECG signals were converted into ECG images as input data through data processing.In this paper, lead II signals of data were used in the experiments.Following the Association for the Advancement of Medical Instrumentation standard, according to the annotations provided by the MIT-BIH arrhythmia database, we selected "N" for normal sinus rhythm (NOR), "L" for left bundle branch block (LBBB), "R" for right bundle branch block (RBBB), "A" for atrial premature beat (APB), "V" for premature ventricular contraction (PVC), "/" for paced beat (PAB), "E" for ventricular escape beat (VEB), and "!" for ventricular flutter wave (VFW) for classification.Other types of arrhythmia were excluded in this paper, such as nodal escape beat, start of ventricular flutter, and other beats that cannot be classified.Those have been ignored by most ECG arrhythmia studies because these beats have relatively little research significance.The overall procedures are shown in Figure 1.
Electronics 2020, 9, x FOR PEER REVIEW 3 of 15 data were used in the experiments.Following the Association for the Advancement of Medical Instrumentation standard, according to the annotations provided by the MIT-BIH arrhythmia database, we selected "N" for normal sinus rhythm (NOR), "L" for left bundle branch block (LBBB), "R" for right bundle branch block (RBBB), "A" for atrial premature beat (APB), "V" for premature ventricular contraction (PVC), "/" for paced beat (PAB), "E" for ventricular escape beat (VEB), and "!" for ventricular flutter wave (VFW) for classification.Other types of arrhythmia were excluded in this paper, such as nodal escape beat, start of ventricular flutter, and other beats that cannot be classified.Those have been ignored by most ECG arrhythmia studies because these beats have relatively little research significance.The overall procedures are shown in Figure 1.

Data Preprocessing
In this study, the input data of the model were two-dimensional images.Most previous works have used one-dimensional ECG signals as the input data for the models, which then requires noise filtering and feature extraction of the data during the data preprocessing stage.Because of the time series characteristics of one-dimensional signals, some ECG signal information may be lost during the noise filtering and feature extraction process, which affects the integrity of the data and may also affect the accuracy of the final classification results.Therefore, in this paper, in the data preprocessing stage, we converted one-dimensional ECG signals into two-dimensional ECG images as classification data, which can ensure the integrity of the original ECG data to the greatest extent.We converted each ECG signal into a separate 192 × 128 grayscale image.From the ECG signals obtained from the database, the peak value of the R wave was used as a criterion for dividing each ECG beat according to the existing R wave peak markers in the database in order to locate each ECG signal.Then, 92 data points before and after the R wave peaks of the two ECG signals before and after were deleted, and then a single ECG image was cropped.This was accomplished using Equation (1): Finally, a total of 107,620 ECG image data points were obtained after conversion, and the categories were labeled respectively.From the transformation results, it can be seen that the amount of data was significantly improved through transformation, which also provided more data for subsequent model learning and training.Table 1 describes the information recorded by all ECG signals.

Data Preprocessing
In this study, the input data of the model were two-dimensional images.Most previous works have used one-dimensional ECG signals as the input data for the models, which then requires noise filtering and feature extraction of the data during the data preprocessing stage.Because of the time series characteristics of one-dimensional signals, some ECG signal information may be lost during the noise filtering and feature extraction process, which affects the integrity of the data and may also affect the accuracy of the final classification results.Therefore, in this paper, in the data preprocessing stage, we converted one-dimensional ECG signals into two-dimensional ECG images as classification data, which can ensure the integrity of the original ECG data to the greatest extent.We converted each ECG signal into a separate 192 × 128 grayscale image.From the ECG signals obtained from the database, the peak value of the R wave was used as a criterion for dividing each ECG beat according to the existing R wave peak markers in the database in order to locate each ECG signal.Then, 92 data points before and after the R wave peaks of the two ECG signals before and after were deleted, and then a single ECG image was cropped.This was accomplished using Equation (1): Finally, a total of 107,620 ECG image data points were obtained after conversion, and the categories were labeled respectively.From the transformation results, it can be seen that the amount of data was significantly improved through transformation, which also provided more data for subsequent model learning and training.Table 1 describes the information recorded by all ECG signals.

Data Augmentation
Because the database mostly contains the number of normal rhythm types, there is an imbalance in the amount of data obtained for each disease type.Due to the problem of unbalanced data volume in each category of data, data augmentation can increase the amount of data in a class with a small volume of data and effectively reduce the occurrence of overfitting problems [20].Image enhancement can increase the amount of data.Most previous ECG arrhythmia studies were not able to manually add augmentation data to the training set due to the possibility of ECG signals being lost.The reason is that feedforward neural network (FFNN) [21] and SVM [22] classifiers assume that every ECG signal possess the same classification worth.In most studies with a large amount of data, the ECG signal segmentation method is used to divide a one-dimensional ECG signal into multiple ECG signal segments to expand the amount of data.However, since the input data of the model in this study were ECG images, the method of image enhancement would not modify the data, but it would increase the amount of data.This method draws on the idea of image processing and performs data enhancement on the converted two-dimensional ECG image.On the basis of the converted original ECG image, processing is performed in a certain manner, which increases the number of data samples, and at the same time, leaves the label value of the data unchanged.It can maximize the original qualities of the data while optimizing the data imbalance in the research.In this study, nine different clipping methods were used to increase the beat of the other seven ECG arrhythmia types, except the NOR class.Image cropping was performed on a specified area of the target image.The cropping method of the left top image is one example.The reference coordinates of the left top image were (0, 0).According to the cropping rule of 96 sizes, (0, 96), (96, 0), and (96, 96) coordinate points were used as the four vertex coordinates of the left top image.This method was used for image cropping and obtained a 96 × 96 left top image of the target image.The other eight images were cropped similarly.Among the other eight images, the reference coordinates of the center top image were (64, 0), the reference coordinates of the right top image were (96, 0), the reference coordinates of the left center image were (0, 16), and the reference coordinates of the center image were (64, 16), the reference coordinates of the right center image were (96, 16), the reference coordinates of the left bottom image were (0, 32), the reference coordinates of the center bottom image were (64, 32), and the reference coordinates of the right bottom image were (96,32).By using this cropping method, all the augmentation images could be obtained.Finally, the entire enhanced image was adjusted to a size of 192 × 128 to ensure the uniformity of all sample data.This greatly increased the amount of data for a relatively small number of arrhythmia categories.The added image also retained the information contained in the original ECG image, which is of equal reference value.The data augmentation method was produced inside the model, which reduced the time spent between images in memory, thereby enhancing the learning speed of the model.The experimental data used in subsequent experiments in this paper were divided into 60%, 20%, and 20% of the training, validation, and test sets, respectively.All experimental data were randomly shuffled.According to different proportions, the disrupted experimental data were randomly divided into different sets.There are 107,620 two-dimensional ECG image data in this paper.Among them, 64,572 data were divided into the training set.A total of 581,148 two-dimensional ECG image data were used for model training after data enhancement.The original PAB image and the nine cropped grayscale images are shown in Figure 2.

CNN-LSTM Model
Deep learning [23][24][25] is a new technology that has become mainstream in the field of machine learning and pattern recognition.In this study, a new method for automatically detecting eight different types of ECG signal arrhythmias was developed.It uses a cross-learning model based on deep learning.The overall structure of the model is implemented by combining CNN and LSTM.Among them, CNN is suitable for processing spatial or locally related data, while LSTM is good at capturing the characteristics of data related to time series.
Layers 1-9 of the model are convolutional layers coupled to the largest collection layer, and layer 10 is the LSTM layer.The end of the network uses a fully connected layer for predicting the output.The spatial feature map can be well extracted by the convolutional layer.Subsequent LSTM layers help the model capture the temporal dynamics that exist in these signatures [26].In the combination of CNN and LSTM, the output shape after the pooling layer of the model is (none, 16, 16, 256).We reshape the dimensions of the model through the reshape method, and the input size of the LSTM layer after reshaping is (256, 256).After analyzing the time characteristics of LSTM, the model finally sorts ECG signals through a fully connected layer.The training stages of the model can be improved by setting the optimizer and learning rate.So, we set and used a learning rate of 0.001 and the Adam optimizer for optimization.Figure 3 shows the proposed network model.A detailed overview of the structure is given in Table 2.

CNN-LSTM Model
Deep learning [23][24][25] is a new technology that has become mainstream in the field of machine learning and pattern recognition.In this study, a new method for automatically detecting eight different types of ECG signal arrhythmias was developed.It uses a cross-learning model based on deep learning.The overall structure of the model is implemented by combining CNN and LSTM.Among them, CNN is suitable for processing spatial or locally related data, while LSTM is good at capturing the characteristics of data related to time series.
Layers 1-9 of the model are convolutional layers coupled to the largest collection layer, and layer 10 is the LSTM layer.The end of the network uses a fully connected layer for predicting the output.The spatial feature map can be well extracted by the convolutional layer.Subsequent LSTM layers help the model capture the temporal dynamics that exist in these signatures [26].In the combination of CNN and LSTM, the output shape after the pooling layer of the model is (none, 16, 16, 256).We reshape the dimensions of the model through the reshape method, and the input size of the LSTM layer after reshaping is (256, 256).After analyzing the time characteristics of LSTM, the model finally sorts ECG signals through a fully connected layer.The training stages of the model can be improved by setting the optimizer and learning rate.So, we set and used a learning rate of 0.001 and the Adam optimizer for optimization.Figure 3 shows the proposed network model.A detailed overview of the structure is given in Table 2.

VGGNet Model
Many pretrained models, such as VGGNet [27], GoogleNet, and so forth, could provide us with many solutions to the problem.In this study, we compared the proposed model with the well-known VGGNet model and other ECG arrhythmia classification studies.The VGGNet model is a deep convolutional neural network model composed of multiple convolution blocks.The model can extract ECG deep features well, through convolution and pooling layers.It generates feature maps from the extracted features for learning and training.In the VGGNet model, we set and used a learning rate of 0.001 and the Adam optimizer for optimization too.A detailed overview of the structure is given in Table 3.

Model Architecture and Details
An earlier part of the proposed model is a 2-D CNN structure, which is a combination of three convolution blocks with a step size of 1.There are two 2-D CNN layers and one maximum pooling layer consisting of each convolution block; it is activated using the exponential linear units (ELU) activation function.The batch normalization layer is used to batch normalize the activation output of the layer.In all convolution operations, by multiplying the superposition matrix, the convolution kernel is continuously extracted for each convolution feature.After two-dimensional convolution, the feature map of this layer uses a maximum pooled filter for feature extraction, and the step size of the filter is two.The feature map is propagated to the two-dimensional maximum pooling layer, and the maximum value of the specified area in the feature map is extracted and labeled to extract a new feature map.This continuously deepens the model network.The size of the feature map of each layer is gradually reduced to speed up the learning rate of the model structure.
Then, the feature map is passed to the LSTM layer in the latter part of the model to extract time information.The extracted features are sorted into sequential components after convolution and merging, and their time series prediction is performed by the LSTM circular chain structure.LSTM is different from the traditional RNN because it has a different structure to a single neural network.It consists of multiple cell states and gated modules.LSTM repeatedly combines these units to ensure that all information is cyclically learned throughout the network while remaining unchanged and persistent.The modules of this structure interact to resolve the disappearance of the gradient and avoid long-term dependence problems.After the LSTM layer, it is fed to the fully connected layer of the softmax layer with eight output neurons by a feature vector with representation and time-dependent features.Finally, arrhythmia prediction is performed by the outputs of the eight categories fed to the fully connected layer.

Activation Function
Activation functions are necessary to improve the approximation ability between each layer of the network to enhance the expressiveness of neural networks.Referring to other current related research, nonlinear activation functions, including leakage rectified linear units (LReLU), ELU, and rectified linear units (ReLU), are widely used in CNN models.Most researchers use ReLU as the activation function of the model, but after analyzing the experimental results, when the input function gradient is too large, the neuron will lose the activation function after the network parameters are updated [28].The ELU activation function was used in the experiments in this study, as it demonstrated better classification of ECG arrhythmia.ELU is shown in Equation (2):

Batch Normalization
In deep learning, with the deepening of the number of layers, the parameters of the layer in question are slightly changed, and the proportion of the input parameters of the latter layer have a more comprehensive impact.This phenomenon is called the internal covariate offset.To accelerate the convergence of the model during training and avoid the gradient expansion of the model, we added a batch normalization layer to the network model.In this way, normalizing the batch after each feature change in the network structure ensures that the conversion of different batches is kept within a certain range, thereby accelerating the convergence of the parameters [29].Batch-normalized locations are typically applied before the activation function and after the convolutional layer.In the experiments in this study, the ELU function was placed before the batch normalization layer and achieved significant results.Therefore, there was an ELU function before the batch normalization layer in each convolution block.Behind each convolution block, there was a two-dimensional maximum pooling layer.The specific formula for batch normalization was calculated as where x (i) is the standardized output; µ and σ represent the mean and variance of the same batch, respectively; and ε is a constant, with the value 0.001.

Dropout Regularization
Overfitting is a very important problem encountered during model training [30].Therefore, to avoid overfitting problems, dropout regularization was used here to avoid overfitting of the model training.At the same time, we also conducted comparison experiments with models that did not use dropout regularization.Dropout regularization probabilistically discards some of the nodes in the same layer to reduce the dependencies between layers.The connection weight will be excluded when the neuron exits, which greatly improves the generalization capacity of the model.A model without dropout regularization adds all of the weights to the learning process during the training process, so the dependency between each layer of the model is greatly increased, which causes overfitting problems.In experiments using dropout regularization, it was placed before the last fully connected layer of the model.The rate of dropout was 0.5.

Results
The experimental data in this study came from the international standard ECG database MIT-BIH, which has accurate and comprehensive expert annotation and is widely used in current ECG research [19].In the experiment, the experimental data were divided into 60%, 20%, and 20% for the training, validation, and test sets, respectively.Among them, 21,524 data were used for testing.The number of epochs for training was 100.In each epoch, the batch size used for the dataset was 32, and it was extended over all input data.Two-dimensional ECG images were cropped to 96 × 96 ECG grayscale images as required.Finally, the enhanced image was adjusted to a size of 192 × 128.All experiments were based on the deep learning framework Tensorflow.The working environment for training the network consisted of two NVIDIA Geforce RTX 2080 Ti GPUs with 64 GB of RAM.The entire training process took 16 h.
We compared two different experimental schemes and conducted experimental verification based on the presence or absence of dropout regularization.In Experiment A, we did not use dropout regularization, and the weights of the model during training were all involved in the learning process.In Experiment B, we added dropout regularization with a dropout rate of 0.5.That way, 50% of the information was discarded during training and 50% of the information was retained for learning.The comparison of the results of the two experimental schemes is shown in Figure 4. From the experimental results, we can see that the network after using dropout regularization always had a very stable state, and the accuracy rate gradually increased under the stable state, finally reaching the highest point.The network that did not use dropout regularization appeared to overfit, gradually stabilized after about 60 epochs, and showed very high accuracy.
research [19].In the experiment, the experimental data were divided into 60%, 20%, and 20% for the training, validation, and test sets, respectively.Among them, 21,524 data were used for testing.The number of epochs for training was 100.In each epoch, the batch size used for the dataset was 32, and it was extended over all input data.Two-dimensional ECG images were cropped to 96 × 96 ECG grayscale images as required.Finally, the enhanced image was adjusted to a size of 192 × 128.All experiments were based on the deep learning framework Tensorflow.The working environment for training the network consisted of two NVIDIA Geforce RTX 2080 Ti GPUs with 64 GB of RAM.The entire training process took 16 h.
We compared two different experimental schemes and conducted experimental verification based on the presence or absence of dropout regularization.In Experiment A, we did not use dropout regularization, and the weights of the model during training were all involved in the learning process.In Experiment B, we added dropout regularization with a dropout rate of 0.5.That way, 50% of the information was discarded during training and 50% of the information was retained for learning.The comparison of the results of the two experimental schemes is shown in Figure 4. From the experimental results, we can see that the network after using dropout regularization always had a very stable state, and the accuracy rate gradually increased under the stable state, finally reaching the highest point.The network that did not use dropout regularization appeared to overfit, gradually stabilized after about 60 epochs, and showed very high accuracy.The accuracy and loss curves for training and verification are shown in Figure 5.Both the training and verification curves of the model increased in a stable state and stabilized at approximately 100 epochs.The classification evaluation of the model used the following evaluation metrics: accuracy (Acc), specificity (Spec), and sensitivity (Sen).The model combining CNN and LSTM achieved 99.01%accuracy, 97.67% sensitivity, and 99.57% specificity after experimental verification.The sensitivity indicates the ratio of normal ECG data detected by the system to the overall normal data.Specificity indicates the proportion of abnormal ECG data to total abnormal data.The accuracy rate represents the proportion of the data that determines the overall correctness of the data.The three metrics (Acc, Spec, and Sen) are defined as follows: where  4. The model without dropout regularization showed high classification results due to overfitting, and obtained 99.87% Acc, 99.78% Spec, and 98.95% Sen. They were all higher than the experimental model using dropout regularization.4. The model without dropout regularization showed high classification results due to overfitting, and obtained 99.87% Acc, 99.78% Spec, and 98.95% Sen. They were all higher than the experimental model using dropout regularization.Table 5 describes the confusion matrix for the training model classification results.It can be seen that the model performed better on the classification of PAB, LBBB, and VEB types, and the performance of the classification of APB types was average.This may have been caused by the small morphological differences of the waveforms during the learning process.Comparing the experiments with the same dataset, the results of 98.67% accuracy, 96.93% sensitivity, and 99.52% specificity were obtained by using the VGGNet model.The accuracy and loss curves of the VGGNet model for training and verification are shown in Figure 6.It can be seen from Figure 6    Table 5 describes the confusion matrix for the training model classification results.It can be seen that the model performed better on the classification of PAB, LBBB, and VEB types, and the performance of the classification of APB types was average.This may have been caused by the small morphological differences of the waveforms during the learning process.Comparing the experiments with the same dataset, the results of 98.67% accuracy, 96.93% sensitivity, and 99.52% specificity were obtained by using the VGGNet model.The accuracy and loss curves of the VGGNet model for training and verification are shown in Figure 6.It can be seen from Figure 6 that the training accuracy and loss rate of the VGGNet model tend to stabilize after 20 epochs.The entire training process of VGGNet model took 27 h.Although the parameters of the internal convolution layer are reduced in the VGGNet model, the actual internal parameter space is relatively large.Among them, most of the parameters come from the first fully connected layer, which consumes more computing resources.Therefore, it always takes longer training VGGNet models.Table 6 describes the confusion matrix for the VGGNet training model classification results.It can be seen that the performance of CNN-LSTM model in predicting PVC and RBBB types is better than VGGNet model by observing and comparing Tables 5 and 6.In the CNN-LSTM model, 2.1% of the subdivided categories were incorrectly classified into other categories, while in the VGGNet model, 3.5% of the subdivided categories were incorrectly classified into other categories.It can be seen that both the models performed better in the classification of PAB, LBBB, and VEB types.The comparison results of the two models are shown in Table 7.The two models differed in their numbers of convolutional and pooling layers and whether or not the LSTM layer was used.It can be seen from the results (Table 5) that the proposed model performed better than VGGNet.
Electronics 2020, 9, x FOR PEER REVIEW 11 of 15 large.Among them, most of the parameters come from the first fully connected layer, which consumes more computing resources.Therefore, it always takes longer training VGGNet models.5 and 6.In the CNN-LSTM model, 2.1% of the subdivided categories were incorrectly classified into other categories, while in the VGGNet model, 3.5% of the subdivided categories were incorrectly classified into other categories.It can be seen that both the models performed better in the classification of PAB, LBBB, and VEB types.The comparison results of the two models are shown in Table 7.The two models differed in their numbers of convolutional and pooling layers and whether or not the LSTM layer was used.It can be seen from the results (Table 5) that the proposed model performed better than VGGNet.

Discussion
With the continuous development of machine learning in recent years, the MIT-BIH arrhythmia database has been used by an increasing number of researchers in ECG research.Table 8 summarizes the study of the automatic detection of ECG arrhythmias.Compared with other related studies, the method of combining 2-D CNN and LSTM proposed in this paper was highly accurate.In most machine learning methods, there are often adaptability problems.Through experimental verification, we were able to provide a deeper comparison of the use of dropout regularization in the model.Without dropout regularization, the training of a model is prone to overfitting, which seriously

Discussion
With the continuous development of machine learning in recent years, the MIT-BIH arrhythmia database has been used by an increasing number of researchers in ECG research.Table 8 summarizes the study of the automatic detection of ECG arrhythmias.Compared with other related studies, the method of combining 2-D CNN and LSTM proposed in this paper was highly accurate.In most machine learning methods, there are often adaptability problems.Through experimental verification, we were able to provide a deeper comparison of the use of dropout regularization in the model.Without dropout regularization, the training of a model is prone to overfitting, which seriously affects a model's classification ability.After using a 50% forgetting probability after the final batch normalization of the fully connected layer, good classification performance was obtained, which also greatly improved the generalization effect of the model.Most classification work requires noise removal and manual extraction of ECG signals, which inevitably leads to partial beat loss of ECG data.At the same time, most studies have limited data volume because of the different ECG signal segmentation methods.In this study, after converting one-dimensional ECG signals into two-dimensional ECG image data, we were able to avoid losing part of the data due to preprocessing problems.Moreover, data augmentation methods can also lead to an increase in the amount of data in relatively small categories.That further balances the different types of data and improves the classification performance of the model.It can be seen from Table 8 that the number of arrhythmia classifications obtained in each study differed, and the amount of data used varied.Osowski et al. [9] preprocessed the data by the HOS cumulant and Hermite coefficient of the QRS complex in ECG signals and combined the method of minimum mean square error with SVM, which obtained 98.71% accuracy.Martis et al. [31] also used HOS to preprocess the signals.They used 34,989 ECG signal data points and a least-squares SVM to classify the five arrhythmia types.The highest average was obtained, and the accuracy rate was 93.48%.Plawiak et al. [32] augmented the characteristics of ECG signals by spectral power density.He used ECG signal data to compare different machine learning models and finally used the support vector machine model to obtain the best classification of 17 arrhythmia diseases with 98.85% accuracy.Guerra et al. [33] also used SVM for classification, but they did not use a single specific SVM, instead, multiple SVMs, to achieve automatic classification.Their classification accuracy reached 94.50%.Summarizing the related research mentioned above, the research methods used are all traditional machine learning methods.In data processing, ECG signals need to be filtered and feature extracted by means such as HOS.At the same time, the use of models is also a form of traditional classification for machine learning.In recent years, deep learning has also developed rapidly.Compared with machine learning, the results of deep learning are more significant.Deep learning models such as CNN and LSTM are used in the study of ECG arrhythmia classification by more and more researchers.Acharya et al. [34] constructed a nine-layer 1-D CNN model to automatically identify five different categories of heartbeats in ECG signals.The input data of the model were one-dimensional ECG signals.They filtered the high-frequency noise of the signals and then detected and classified the noisy and non-noisy ECG signals through the model, which greatly improved the generalization ability of the model.The accuracy of the model for classifying original ECG signals was 94.03%.However, the ECG signals used for classification had a high degree of imbalance, and the classification accuracy of the data also decreased after noise filtering.Shu et al. [18] proposed a diagnostic model that combines 1-D CNN and LSTM.Input data of the model were also one-dimensional ECG signals.In the data processing stage, the ECG data were segmented into many ECG data segments of different lengths by positioning the waveforms, and then all ECG data segments were standardized to a uniform length.The model was able to classify ECG signals of different lengths into five categories and achieved an accuracy of 98.10%.Jun et al. [16] proposed a 2-D CNN model.Input data of the model were two-dimensional ECG data.The model used multiple convolution processing units to extract ECG deep features, and classified the extracted features.The proposed model achieved an accuracy of 99.05%.Although a single 2-D CNN model can learn the spatial characteristics of ECG data very well, the learning efficiency of the model is not high enough, and the convergence speed of model training accuracy is low.An LSTM layer is added after the 2-D CNN to learn the time series related features of the components decomposed into a convolutional feature sequence.That way, the temporal characteristics of the data can be better analyzed and further classified.Such training can improve the efficiency of the model, and at the same time, get a higher classification accuracy.Yildirim et al. [35] proposed a bidirectional LSTM (Bi-LSTM) model with wavelet sequences to analyze and classify ECG signal sequences in time series.Bi-LSTM adds more available information, including historical and new data, through a two-way network propagation, which can make the information of the data more fully used.In addition, the ECG data needed to be segmented at different scales to obtain 7326 ECG data segments, which were then used as data for the model.In the end, the proposed model achieved an accuracy of 99.25%.
According to the summary of the abovementioned machine learning and deep learning methods, the CNN-LSTM model proposed here demonstrated higher classification accuracy than other related studies and also showed better advantages in data processing.The quality of the data used often has a great impact on the final results of the model, so using ECG images for classification is also a novel idea.Therefore, the proposed model can be applied in clinics to help cardiologists objectively diagnose ECG heartbeat signals, or it can be used in new smart monitor applications.

Conclusions
Detection and identification of arrhythmias is an integral part of the early diagnosis of cardiovascular disease.This paper presented an effective arrhythmia classification method that combines 2-D CNN and LSTM and uses ECG images as the input data for the model.One-dimensional signals obtained from the MIT-BIH arrhythmia database were converted into 192 × 128 grayscale images.A total of 107,620 ECG images were obtained by processing the data acquired from the database.As a result, the accuracy of this method was 99.01%, the specificity was 99.57%, and the sensitivity was 97.67%.The classification results of ECG arrhythmia showed that the method of arrhythmia detection using a combination of ECG image data and CNN-LSTM can be useful for helping doctors better diagnose cardiovascular disease and can considerably reduce the workloads of doctors.In the future, this auxiliary diagnostic method could be used in connection with medical robots or medical monitors for diagnostic treatment.

Figure 1 .
Figure 1.Overall procedures processed in ECG arrhythmia classification.

Figure 1 .
Figure 1.Overall procedures processed in ECG arrhythmia classification.

Electronics 2020, 9 ,
x FOR PEER REVIEW 5 of 15 There are 107,620 two-dimensional ECG image data in this paper.Among them, 64,572 data were divided into the training set.A total of 581,148 two-dimensional ECG image data were used for model training after data enhancement.The original PAB image and the nine cropped grayscale images are shown in Figure 2.

Figure 3 .
Figure 3.An illustration of the proposed CNN-LSTM architecture.

Figure 4 .
Figure 4. Accuracies of the two training models.Experiment A without dropout regularization.Experiment B with dropout regularization.

Figure 4 .
Figure 4. Accuracies of the two training models.Experiment A without dropout regularization.Experiment B with dropout regularization.
TP indicates that normal ECG data are classified into normal categories; TN means classifying outlying data into exceptional categories (both TP and TN indicate accurate classification); FP indicates that abnormal ECG data are classified into normal categories; and FN means classifying normal data into exceptional categories (both FP and FN indicate a classification error).The three metrics can reflect the overall classification ability of the system as a whole.The larger the value, the better the classification effect.We also compared the evaluation indicators obtained from the two experiments with and without the dropout regularization model.The comparison results of the two experimental schemes are shown in Table

Electronics 2020, 9 ,
x FOR PEER REVIEW 10 of 15 where TP indicates that normal ECG data are classified into normal categories; TN means classifying outlying data into exceptional categories (both TP and TN indicate accurate classification); FP indicates that abnormal ECG data are classified into normal categories; and FN means classifying normal data into exceptional categories (both FP and FN indicate a classification error).The three metrics can reflect the overall classification ability of the system as a whole.The larger the value, the better the classification effect.We also compared the evaluation indicators obtained from the two experiments with and without the dropout regularization model.The comparison results of the two experimental schemes are shown in Table

Figure 5 .
Figure 5. Accuracy and loss of CNN-LSTM training model.
that the training accuracy and loss rate of the VGGNet model tend to stabilize after 20 epochs.The entire training process of VGGNet model took 27 h.Although the parameters of the internal convolution layer are reduced in the VGGNet model, the actual internal parameter space is relatively

Figure 5 .
Figure 5. Accuracy and loss of CNN-LSTM training model.

Figure 6 .
Figure 6.Accuracy and loss of VGGNet training model.

Table 1 .
A summary table of ECG signal description from the MIT-BIH arrhythmia database.

Table 2 .
Detailed overview of the proposed CNN-LSTM model.
[27] pretrained models, such as VGGNet[27], GoogleNet, and so forth, could provide us with many solutions to the problem.In this study, we compared the proposed model with the well-known VGGNet model and other ECG arrhythmia classification studies.The VGGNet model is a deep convolutional neural network model composed of multiple convolution blocks.The model can

Table 2 .
Detailed overview of the proposed CNN-LSTM model.

Table 4 .
Average classification performances of the two experiments.

Table 5 .
Confusion matrix of the proposed CNN-LSTM model.

Table 4 .
Average classification performances of the two experiments.

Table 5 .
Confusion matrix of the proposed CNN-LSTM model.

Table 6
describes the confusion matrix for the VGGNet training model classification results.It can be seen that the performance of CNN-LSTM model in predicting PVC and RBBB types is better than VGGNet model by observing and comparing Tables

Table 6 .
Confusion matrix of the VGGNet model.

Table 7 .
Comparison of the proposed model with VGGNet.

Table 6 .
Confusion matrix of the VGGNet model.

Table 7 .
Comparison of the proposed model with VGGNet.