Deep Learning with a Recurrent Network Structure in the Sequence Modeling of Imbalanced Data for ECG-Rhythm Classiﬁer

: The interpretation of Myocardial Infarction (MI) via electrocardiogram (ECG) signal is a challenging task. ECG signals’ morphological view show signiﬁcant variation in di ﬀ erent patients under di ﬀ erent physical conditions. Several learning algorithms have been studied to interpret MI. However, the drawback of machine learning is the use of heuristic features with shallow feature learning architectures. To overcome this problem, a deep learning approach is used for learning features automatically, without conventional handcrafted features. This paper presents sequence modeling based on deep learning with recurrent network for ECG-rhythm signal classiﬁcation. The recurrent network architecture such as a Recurrent Neural Network (RNN) is proposed to automatically interpret MI via ECG signal. The performance of the proposed method is compared to the other recurrent network classiﬁers such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). The objective is to obtain the best sequence model for ECG signal processing. This paper also aims to study a proper data partitioning ratio for the training and testing sets of imbalanced data. The large imbalanced data are obtained from MI and healthy control of PhysioNet: The PTB Diagnostic ECG Database 15-lead ECG signals. According to the comparison result, the LSTM architecture shows better performance than standard RNN and GRU architecture with identical hyper-parameters. The LSTM architecture also shows better classiﬁcation compared to standard recurrent networks and GRU with sensitivity, speciﬁcity, precision, F1-score, BACC, and MCC is 98.49%, 97.97%, 95.67%, 96.32%, 97.56%, and 95.32%, respectively. Apparently, deep learning with the LSTM technique is a potential method for classifying sequential data that implements time steps in the ECG signal.


Introduction
Electrocardiogram (ECG) is a key component of the clinical diagnosis and management of inpatients and outpatients that can provide important information about cardiac diseases [1].Some cardiac diseases can be recognized only through an ECG signal as has been presented in [2][3][4][5][6].ECG records electrical signals related to heart activity and producing a voltage-chart cardiac rate and being a cardiological test that has been used in the past 100 years [7].ECG signals have three different waveforms for each cardiac cycle: P wave, QRS complex, and T wave in normal rate [8].In other cases, ECG form changes in the T waveform, the ST interval length, and ST elevation.Its morphology causes a cardiac abnormality, i.e., Ischemic Heart Disease (IHD) [9].The IHD is the single largest cause of the main contributors to the disease burden in developing countries [10].The two leading manifestations of IHD are angina and Acute Myocardial Infarction (MI) [10].Angina is the characteristic caused by atherosclerosis leading to stenosis of one or more coronary arteries.Then, MI occurs due to a lack of oxygen demand in the cardiac muscle tissue.If cardiac muscle activity increases, oxygen demand also increases [11].MI is the most dangerous form of IHD with the highest mortality rate [10].
MI is usually diagnosed by changes in the ECG due to the increase of serum enzymes, such as creatine phosphokinase and troponin T or I [10].ECG is the most reliable tool for interpreting MI [12][13][14], apart from the emergence of expensive and sophisticated alternatives [7].However, interpreting MI via morphological ECG is a challenging task due to its significant variation in different patients under different physical conditions [15,16].To prevent the misinterpretation of MI diagnosis, a study uses the nature of ECG signals in a sequence model is automatically necessary.The sequential model consists of sequences of ordering events, with or without concrete notions of time.The algorithm that is usually used for sequential models is a deep learning technique [17].Some deep learning algorithms that used the sequential model to interpret MI from ECG signals have been presented in References [12,14].These studies combine Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture to interpret MI only in one (Lead I) or several leads (I, II, V1, V2, V3, V4, V5, V6).A sequence modeling is synonymous with recurrent networks that maintain a vector of hidden activations that are propagated through time for most deep learning practitioners [17].
Basic recurrent network architectures are notoriously difficult to train due to the large increase in the norm of the gradient during training, and the opposite behavior when long term components go exponentially fast to norm zero [18].Some elaborate architectures are commonly used instead, such as the LSTM [19] and GRU [20,21].Other architectural innovations and training techniques for recurrent networks have been introduced and continue to be actively explored [22][23][24][25].Unfortunately, none of these studies suggested which recurrent network is a suitable method for classification.
In the present paper, three sequence model classifiers to classify MI and healthy control of 15-lead ECG signals are discussed.The comparison of the recurrent network algorithms is proposed to automatically interpret MI via ECG signal.The recurrent network classifiers include Recurrent Neural Network (RNN), LSTM, and GRU.The objective is to obtain the optimum sequence model in ECG signal recording.To evaluate the performance of recurrent network classifiers in a sequence model, the metric evaluation is proposed.This study also analyzes classifier performance in imbalanced data that the sample size of the data classes is unevenly distributed, among the class of MI and cardiac normal in healthy control patients [26].In such situations, the classification method tends to be biased towards the majority class.Therefore, this paper uses metric performance balanced accuracy (BACC) and Matthew's Correlation Coefficient (MCC) to produce better analysis in imbalanced data of MI [26].In some studies, the use of leads is an important factor for determining the performance results of classifiers [12,13].The sequence model classifier can be used for 15-lead ECG instead of only use for one or several leads.

Materials and Methods
This paper proposes the ECG processing method to calculate appropriate features from 15-lead ECG raw data.The method consists of window sized segmentation, classification of sequence modeling, and evaluation of classifier performance based on performance metrics as presented in Figure 1.

ECG Raw Data
The sequential data of ECG signals are obtained from the open access database Physionet: PTB Diagnostic ECG, National Metrology Institute of Germany [27].The PTB Diagnostic ECG database contains 549 records from 290 patients (consisting of 209 males and 81 females).Each patient was associated with one to five ECG record records.Each ECG record includes 15 signals measured simultaneously: 12 conventional leads (I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, and V6) along with 3 Frank leads ECG (vx, vy, vz) in the .xyzfile.The PTB Diagnostic ECG database contains ECG signals that represent normal heart conditions and nine heart abnormalities (one of them is MI).However, this study [27] only uses diagnostic classes of healthy controls and myocardial infarction.In fact, there is a number of potential data that can be used for further study which consist of 80 ECG records of the healthy control and 368 ECG records of MI.

ECG Segmentation
An initial stage of ECG signal pre-processing is the segmentation of the window in the same size.This segmentation is used due to the length of the PTB Diagnostic ECG signal data that varies between each ECG record.The length of the ECG signal for MI ranges from 480000 to 1800180 samples (480-1800 s).For the length of the ECG signal in the healthy control with a range of 1455000-1800180 samples (1455-1800 s).Each window consists of 4 s of data samples at a time, which includes at least three heart beats at a normal heart rate.Each signal has been digitized at 1000 samples per second.A total of 12.359 signal data has been segmented of each window sized for 4 s (see in Figure 2).The number of sequence data for the class of MI and healthy control is 10.144 and 2.215 of the total data, respectively.

ECG Raw Data
The sequential data of ECG signals are obtained from the open access database Physionet: PTB Diagnostic ECG, National Metrology Institute of Germany [27].The PTB Diagnostic ECG database contains 549 records from 290 patients (consisting of 209 males and 81 females).Each patient was associated with one to five ECG record records.Each ECG record includes 15 signals measured simultaneously: 12 conventional leads (I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, and V6) along with 3 Frank leads ECG (vx, vy, vz) in the .xyzfile.The PTB Diagnostic ECG database contains ECG signals that represent normal heart conditions and nine heart abnormalities (one of them is MI).However, this study [27] only uses diagnostic classes of healthy controls and myocardial infarction.In fact, there is a number of potential data that can be used for further study which consist of 80 ECG records of the healthy control and 368 ECG records of MI.

ECG Segmentation
An initial stage of ECG signal pre-processing is the segmentation of the window in the same size.This segmentation is used due to the length of the PTB Diagnostic ECG signal data that varies between each ECG record.The length of the ECG signal for MI ranges from 480000 to 1800180 samples (480-1800 s).For the length of the ECG signal in the healthy control with a range of 1455000-1800180 samples (1455-1800 s).Each window consists of 4 s of data samples at a time, which includes at least three heart beats at a normal heart rate.Each signal has been digitized at 1000 samples per second.A total of 12.359 signal data has been segmented of each window sized for 4 s (see in Figure 2).The number of sequence data for the class of MI and healthy control is 10.144 and 2.215 of the total data, respectively.

Recurrent Neural Network
Recurrent Neural Network (RNN) is a type of artificial neural network architecture with recurrent connections to process input data [12].RNN is categorized as a deep learning technique

Recurrent Neural Network
Recurrent Neural Network (RNN) is a type of artificial neural network architecture with recurrent connections to process input data [12].RNN is categorized as a deep learning technique due to an automatic process of feature calculation without predetermining some appropriate features [28].RNN has "memory", namely, state (s t ) that captures information about all input elements (x t ) to output ŷt [29].Original RNN, also known as vanilla RNN, has similar forward pass and backward pass processes as other artificial neural networks.The difference is only in the backpropagation process where the term being backpropagation is defined through time (BPTT) [30].
The model refers to three matrix weights in Figure 3, namely the weight between the input and hidden layers w hx ∈ R h * x , the weight between two hidden layers w hh ∈ R h * h , and the weights between hidden and output layers w yh ∈ R y * h .Otherwise, the bias is added to the hidden layer b h ∈ R h , and the bias vector is added to the output layer b y ∈ R y .The RNN model can be represented in Equations ( 1) to (3): Algorithms 2019, 12, x FOR PEER REVIEW 5 of 12 due to an automatic process of feature calculation without predetermining some appropriate features [28].RNN has "memory", namely, state ( ) that captures information about all input elements ( ) to output  [29].Original RNN, also known as vanilla RNN, has similar forward pass and backward pass processes as other artificial neural networks.The difference is only in the backpropagation process where the term being backpropagation is defined through time (BPTT) [30].The model refers to three matrix weights in Figure 3, namely the weight between the input and hidden layers ( ∈  * ), the weight between two hidden layers ( ∈  * ), and the weights between hidden and output layers ( ∈  * ).Otherwise, the bias is added to the hidden layer ( ∈  ) , and the bias vector is added to the output layer ( ∈  ) .The RNN model can be represented in Equations ( 1) to (3): It is also well known that the RNN method is sequentially trained method with supervised learning.For step time t, the error results from the difference between the predictions and actual is defined as( −  ), where, the error or loss l is the sum of loss at the time step from t to : Theoretically, the original RNN can handle input dependencies in the long-term, but in practice, the training process of such networks will result to vanishing problems or exploding gradients which is more inefficient when the number of time spans in the input sequence increases [18].Suppose the ECG data have a total error in all time steps T: By applying the chain rules, Equation ( 5) can be explained as It is also well known that the RNN method is sequentially trained method with supervised learning.For step time t, the error results from the difference between the predictions and actual is defined as ( ŷt − y t ), where, the error or loss is the sum of loss at the time step from t to T : Theoretically, the original RNN can handle input dependencies in the long-term, but in practice, the training process of such networks will result to vanishing problems or exploding gradients which is more inefficient when the number of time spans in the input sequence increases [18].Suppose the ECG data have a total error in all time steps T: By applying the chain rules, Equation ( 5) can be explained as Equation ( 6) is a derivative of a hidden state that stores memory at time t which is related to the hidden state at the previous time k.This phase involves the Jacobians matrix for the current time t and one-time k: The Jacobian matrix in Equation ( 7) displays the Eigen Decomposition is given by W T diag[ f (h t−1 )]; the eigenvalues are produced λ 1, λ 2, . . ., λ n, where |λ 1 | >|λ 2 | . ..|λ n | and that corresponds to eigenvectors ν 1, ν 2, . . .ν n .If the largest eigenvalue is produced λ i < 1 there will be vanishing gradient, on the contrary, if the value of λ i > 1, then there will be an exploding gradient.To overcome vanishing and exploding gradient problems on RNN standard, LSTM and GRU can be used [18].

Long Short-Term Memory
The gating mechanism controls the amount of information from the previous time step, which contributes to the current output.Using this gating mechanism, LSTM overcomes vanishing or exploding gradients, wherein a standard RNN there is no gate [29].The LSTM gate mechanism implements three components; (1) inputs, (2) forget, and (3) output gate [29].The LSTM input layer must be in 3-dimension vectors (samples, time steps, features).The samples are the amount of train or test set, time steps are 15-lead ECG signals, and features are 4000 samples (4 s) in each window size.
In the LSTM algorithm, the process consists of a forward and backward pass, as shown in Figure 4.A forward pass is calculated as input x with a length T by starting t = 1 and recursively by applying an update equation while adding, t.The scripts i, f and o refer to the input, forget, and output gates from the block, respectively.The script c refers to one of the C memory cells.In time, t, LSTM receives a new input in the form of vector x t (including bias), and the output of the vector h t−1 in the previous time steps (which ⊗ denotes element-wise product Hadamard).
Algorithms 2019, 12, x FOR PEER REVIEW 6 of 12 vanishing gradient, on the contrary, if the value of  > 1, then there will be an exploding gradient.
To overcome vanishing and exploding gradient problems on RNN standard, LSTM and GRU can be used [18].

Long Short-Term Memory
The gating mechanism controls the amount of information from the previous time step, which contributes to the current output.Using this gating mechanism, LSTM overcomes vanishing or exploding gradients, wherein a standard RNN there is no gate [29].The LSTM gate mechanism implements three components; (1) inputs, (2) forget, and (3) output gate [29].The LSTM input layer must be in 3-dimension vectors (samples, time steps, features).The samples are the amount of train or test set, time steps are 15-lead ECG signals, and features are 4000 samples (4 s) in each window size.
In the LSTM algorithm, the process consists of a forward and backward pass, as shown in Figure 4.A forward pass is calculated as input x with a length T by starting t = 1 and recursively by applying an update equation while adding, t.The scripts ,  and o refer to the input, forget, and output gates from the block, respectively.The script c refers to one of the C memory cells.In time, t, LSTM receives a new input in the form of vector  (including bias), and the output of the vector ℎ in the previous time steps (which ⊗ denotes element-wise product Hadamard).
Ignoring the non-linearities, Then, the memory cell values updated by combining  and the contents of the previous cell  .The combination is based on the magnitude of the gate input  dan forget gate  : Weights from cell c to input, forget, and output gates are annotated as w i , w f , w o respectively.The equations are given by a t = tan h(W c x t + U c h t−1 ) (8) Ignoring the non-linearities, Then, the memory cell values updated by combining a t and the contents of the previous cell c t−1 .The combination is based on the magnitude of the gate input i t dan forget gate f t : In the end, the LSTM cell calculates the output value by passing an updated cell value through non-linearity: Backward pass computes starting from t = T and recursively calculating the derivative unit in each time step.As standard RNN, all status and activations are initialized to zero at t = 0, and all δ = 0 at t = T + 1.

Gated Recurrent Unit
The Gated Recurrent Unit (GRU) architecture consists of two gates: reset gate and update gate [31].Basically, these are two vectors which decide what information should be passed to the output.Mathematically, the GRU algorithm can be described in the flowchart presented in Figure 5: In the end, the LSTM cell calculates the output value by passing an updated cell value through non-linearity: Backward pass computes starting from t = T and recursively calculating the derivative unit in each time step.As standard RNN, all status and activations are initialized to zero at  = 0, and all  = 0 at  =  + 1.

Gated Recurrent Unit
The Gated Recurrent Unit (GRU) architecture consists of two gates: reset gate and update gate [31].Basically, these are two vectors which decide what information should be passed to the output.Mathematically, the GRU algorithm can be described in the flowchart presented in Figure 5: The hidden-state model of vanilla RNN, LSTM, and GRU can be represented in the equations listed in Table 1.The difference is in calculating the parameters of each hidden state:

Classifier
Hidden State Model The hidden-state model of vanilla RNN, LSTM, and GRU can be represented in the equations listed in Table 1.The difference is in calculating the parameters of each hidden state:

Classifier
Hidden State Model

Evaluation Performance
The stages of the learning process in neural networks are validated, which is the determination of whether the conceptual model for simulation of the system of interpretation of MI via ECG signals is an accurate representation of the real system being modeled.Evaluation parameters used in the validation process in the binary classification process between the class of MI and normal heart are using confusion matrix, which contains information about the actual classification and predictions made by the classification system.The data in the classification process are divided into two different classes, namely positive (P) and negative (N).This classification produces four types; two types of classifications that are true (or true), namely, true positive (TP) and true negative (TN); and two types of false classifications, namely, false positive (FP) and false negative (FN) [32] (Table 2).For the overall testing results of the binary classification, in this study, we use the proposed evaluation for binary classification with Balanced Accuracy (BACC) in Equation ( 15) and Matthew's Correlation Coefficient (MCC) in Equation ( 16) for classification in imbalanced data.

Results and Discussion
The comparison of three main sequence models, i.e.Vanilla RNN, LSTM, and GRU is used to classify MI and the healthy control.For all sequence model classifiers, the same hyper-parameters are used.Adam optimization method with learning rate as 0.0001 and 100 epochs were trained in Jupyter Notebook on GPU NVIDIA GeForce RTX 2080.The average of each epoch for the most complex classifier was 13 s.In the sequence model classifier, the number of feed-forward neural network (FFNN) in a unit is different, with vanilla RNN, LSTM, and GRU is one, four, and three FFNN, respectively.The number of FFNN is represented by gates in LSTM and GRU.Knowing the number of FFNN before and after quantization of the sequence model will be useful because it can reduce the size of the model file or even reduce the time needed for model inference.
Furthermore, five different data partitioning ratios of the training and testing set are also compared in the sequence modeling classifier.It consists of 90%:10%, 80%:20%, 70%:30%, 60%:40% and 50%:50% for the training and testing set, respectively (as presented in Table 3).A value of 12.359 of the sequential data is randomly separated with automatic data splitting (shuffled sampling).The training set used is not used for testing and vice versa.We trained all the sequence model classifiers to obtain an optimum model.We have a large imbalanced class of the healthy control and MI i.e., a 4.57 imbalanced ratio.We split the training set to be larger than the testing set initially prior to the data partitioning with the same ratio.As overall, the good result in all data partitioning with training larger than testing set is 90%:10% in all the sequence model classifiers with the average sensitivity, specificity, precision and F1 scores being 90.45%, 94.66%, 93.37% and 91.79%, respectively.Due to having a larger training set, the algorithms are better for understanding the patterns in the set and learning to identify specific examples of training sets.This can optimize the computational time of the validation in the long-term because it can prevent too much overfitting.To evaluate the classifier performance in the imbalanced data, Table 3 describes the results of evaluating binary classifications using BACC and MCC.If the comparison of data between two classes is balanced, it is not recommended to use BACC.The 'regular' accuracy metric is sufficient.The average proportion corrects each class individually calculated by BACC.Otherwise, MCC takes values in the interval [−1.1] with 1 showing a complete agreement and −1 refer to a complete disagreement and 0 showing that the prediction was uncorrelated with label [26].A coefficient of +1 represents a perfect prediction due to takes into account the balance ratios of the TP, TN, FP and FN categories.The best result in all data partitioning ratio, the average BACC and MCC is 94.81% and 92.98%, respectively.All data partitioning as presented in Table 3 shows that Vanilla RNN or standard RNN does not learn properly.This problem is due to the vanishing or exploding gradient.The large increase in the norm of the gradient during training and the opposite behavior when long term components go exponentially fast to norm zero.To overcome these problems in standard RNN.LSTM and GRU are used and show better results than Vanilla RNN.The best sequence model classifier is LSTM with 90%:10% for the training and testing set with sensitivity, specificity, precision, F1-score, BACC and MCC is 98.49%, 97.97%, 95.67%, 96.32%, 97.56% and 95.32%, respectively (see in Figures 6 and 7).With our proposed sequence model, specifically the LSTM, MI class can be detected properly.learn properly.This problem is due to the vanishing or exploding gradient.The large increase in the norm of the gradient during training and the opposite behavior when long term components go exponentially fast to norm zero.To overcome these problems in standard RNN.LSTM and GRU are used and show better results than Vanilla RNN.The best sequence model classifier is LSTM with 90%:10% for the training and testing set with sensitivity, specificity, precision, F1-score, BACC and MCC is 98.49%, 97.97%, 95.67%, 96.32%, 97.56% and 95.32%, respectively (see in Figures 6 and 7).With our proposed sequence model, specifically the LSTM, MI class can be detected properly.

Conclusions
The characteristic of deep learning is to automate feature learning process without hand-crafted creatures.Recurrent network classifiers in a deep learning process that is used for sequential data to binary classification.These classifiers have the characteristic in terms of the number of parameters used in training process.The shared weight in a recurrent network has an advantage due to many fewer parameters to train.The problems of the recurrent network standard have a vanishing or gradient problem.The gating mechanism in LSTM and GRU control some information from the time step to minimize this problem.With fewer ECG pre-processing stages that used in our study, a simple LSTM network presented better classification results of performance in the training and testing setsthan the RNN standard and GRU.It is due to the LSTM method is able to stores more information about the pattern of data compared to the RNN standard and GRU.LSTM is able to learns and selects which data need to be stored or discarded that affects LSTM performance better (forget gates) than other comparable methods.The LSTM structure with 90%:10% for training and testing set presents the sensitivity, specificity, precision and F1-score of 98.49%, 97.97%, 95.67%, and 96.32%, respectively.Furthermore, to evaluate binary classification in imbalanced data MCC and BACC have a closed form and it is very well suited to be used for building the optimal classifier.However, the performance results in the initial stage show unsatisfied due to the lack of ECG signal processing before being classified by sequence modeling classifier.Our LSTM model suggests the presence of crucial information in 15-lead ECG to predict the future clinical course, especially for detecting chest discomfort in real time.

Figure 2 .
Figure 2. The ECG window sized segmentation in each 4 s.

Figure 2 .
Figure 2. The ECG window sized segmentation in each 4 s.

Figure 3 .
Figure 3.The forward and backward pass in recurrent neural network (RNN) standard[18]

Figure 6 .
Figure 6.The plot of accuracy of the LSTM architecture with 90% of training and 10% of testing set.Figure 6.The plot of accuracy of the LSTM architecture with 90% of training and 10% of testing set.

Figure 6 . 12 Figure 7 .
Figure 6.The plot of accuracy of the LSTM architecture with 90% of training and 10% of testing set.Figure 6.The plot of accuracy of the LSTM architecture with 90% of training and 10% of testing set.Algorithms 2019, 12, x FOR PEER REVIEW 10 of 12

Table 1 .
The Hidden State in Sequence Modeling Classifier.

Table 1 .
The Hidden State in Sequence Modeling Classifier.

Table 2 .
The Diagnostic Test.

Table 3 .
The result of the sequence model classifier performance.