An Automated ECG Beat Classiﬁcation System Using Deep Neural Networks with an Unsupervised Feature Extraction Technique

: An automated classiﬁcation system based on a Deep Learning (DL) technique for Cardiac Disease (CD) monitoring and detection is proposed in this paper. The proposed DL architecture is divided into Deep Auto-Encoders (DAEs) as an unsupervised form of feature learning and Deep Neural Networks (DNNs) as a classiﬁer. The objective of this study is to improve on the previous machine learning technique that consists of several data processing steps such as feature extraction and feature selection or feature reduction. It is also noticed that the previously used machine learning technique required human interference and expertise in determining robust features, yet was time-consuming in the labeling and data processing steps. In contrast, DL enables an embedded feature extraction and feature selection in DAEs pre-training and DNNs ﬁne-tuning process directly from raw data. Hence, DAEs is able to extract high-level of features not only from the training data but also from unseen data. The proposed model uses 10 classes of imbalanced data from ECG signals. Since it is related to the cardiac region, abnormality is usually considered for an early diagnosis of CD. In order to validate the result, the proposed model is compared with the shallow models and DL approaches. Results found that the proposed method achieved a promising performance with 99.73% accuracy, 91.20% sensitivity, 93.60% precision, 99.80% speciﬁcity, and a 91.80% F1-Score. Moreover, both the Receiver Operating Characteristic (ROC) curve and the Precision-Recall (PR) curve from the confusion matrix showed that the developed model is a good classiﬁer. The developed model based on unsupervised feature extraction and deep neural network is ready to be used on a large population before its installation for clinical usage.


Introduction
Artificial Intelligence (AI) techniques have been widely used to improve the quality of patient life and care through early diagnosis of disease.Such a technique has become popular because it not only can reduce cost-effectiveness and mortality, but also provides good predictions that facilitate precise treatment [1][2][3].Cardiac Disease (CD) monitoring and detection, in particular, is a difficult task which requires identifying the patterns and interaction among variables using various techniques [2].Despite these significant efforts, recently it has been shown that conventional methods based on a simple score do not perform well and the results obtained by such methods remain unsatisfactory to date.The application of computer-based methods is a potential solution that allows cardiologists to observe the CD in the long-term recording with the better-diagnosing result.However, to produce an optimal CD prediction requires some metrics such as sensitivity, specificity, and accuracy.These parameters help the cardiologist to predict outcomes accurately and effectively.They cannot only be evaluated using a simple score (i.e., probability and statistical method) or traditional CD risk factors (i.e., diabetes, hypertension, and smoking).Currently, Machine Learning (ML) algorithms can overcome the drawbacks of automatic learning to help build a recommendation system.The results can help a medical doctor or cardiologist in making more accurate and sensitive predictions [2][3][4].
The learning process is an important stage of ML algorithms in order to produce accurate diagnosis and prediction.It generally can be categorized into three types of learning processes: supervised, unsupervised, and reinforcement.In supervised learning, algorithms use a dataset labeled by experts.The algorithms develop a model to predict or classify future events or to determine which variables are most relevant to the outcome.ML with supervised learning produces excellent results in classification and regression problems [5].However, it requires a lot of data to be labeled by humans, and it is time-consuming [5].In contrast, unsupervised learning seeks to identify novel disease mechanisms from hidden patterns present in the data without feedback from humans.Unfortunately, it always produces some bias, due to in the initial cluster patterns not being validated with other groups of data [6].The combination process between supervised and unsupervised learning is called reinforcement learning.However, to obtain high accuracy, the reinforcement learning process needs a trial and error phase, which is a very time-consuming, and it can also produce untrustworthy results [3,5].Hence, an improvement in the learning process to produce high performance with trusted values is desirable.
There are three phases to process CD prediction using an ML algorithm: training, validation, and testing.Unfortunately, with less training data sets, CD prediction can lead to inaccurate predictions in the testing phase, meaning it can be biased and inaccurate [6].In order to improve the diagnosis results, more data sets are required for training the model [2,6].Deep learning (DL) is one of the ML approaches which has been successfully implemented in a diverse range of biomedical fields with large datasets, as presented in references [7,8].This approach has become a promising field and has been proliferating in recent years [9,10].DL architecture mimics human brain operation using multiple layers of the artificial neuron for generating automated predictions [5,9,10].
The DL architecture with an unsupervised learning approach has been widely implemented.It can facilitate the exploration of novel factors in score systems or add hidden risk factors to existing models [11], classify novel genotypes and phenotypes from heterogeneous cardiac diseases [2], detect lymph node metastases from breast cancer [12], detect cardiomyopathy [13], and use a risk factor prediction of bleeding and stroke to provide the optimal dose and anticoagulant therapy duration and to identify additional stroke risk factors [14].In the diagnosis of cardiac disease, the implementation of DL produces good results [1,5,10].The algorithms provide a very in-depth analysis for an artificial real-time cardiac imaging with better spatial and temporal resolution.It potentially improves the quality of health caring and reducing costs [15][16][17][18][19].Such algorithms can be trained using an unsupervised learning approach with unlimited memory [9,20,21] and, it is also suitable for noisy data [5,15,16].
Unfortunately, there are some challenges using DL such as (i) A Nonlinear training in deep learning involves a large number of parameters and layers, if it is not handled properly, it can cause overfitting on the model so that the predictions performances are poor.(ii) The analysis requires a graphical processing unit to accelerate the computation; (iii) Parameters set-up for a deep learning method architecture is also time-consuming; and (iv) Multiple layers may increase the training time without providing any improvement in precision and accuracy.Hence, selecting the most suitable DL architecture with several parameters a necessary area to study to help produce the best results in the diagnosis and prediction of cardiac disease.

Deep Learning
Deep Neural Networks (DNNs) with a back-propagation algorithm has limitations in specific applications as has been reported in reference [9] and reference [10].This is because the DNNs with back-propagation is not appropriate for deeper networks.In addition, there are two main drawbacks in term of the learning process: (1) The DNNs model always falls into local minima due to a random weight initialization at the early of the training process; (2) most data is unlabeled when DNNs was initialized with the random requirement for labeled the set of data [9].Today, such problems have been solved through the groundbreaking work of Geoffrey Hinton and his colleagues in 2006 [22].Their research contribution, called unsupervised greedy layer-wise pre-training, is an effective solution for overcoming the problems with traditional backpropagation [20,21].Auto-Encoder (AE) is the one method which can learn generic features using greedy layer-wise training.The process is fast and provides outstanding results on deep network architectures for classification and prediction problems [22].Moreover, we can construct a deep AE model in order to improve performance.Besides, a deep AE model sets the training target to fit the input data.Then, the back-propagation algorithm is trained by feeding the input and calculating the error between reconstructed input and original one.

Deep Auto Encoder
Feature extraction is an important phase in the learning process to obtain appropriate and robust features.If feature quality is low, it may lead to low performance and poor generalization properties, despite having powerful classification algorithms.Some of these powerful feature extraction techniques including Principal Component Analysis (PCA) or the Linear Discriminant Analysis (LDA) algorithm.However, this method cannot extract directly from the network structure, and it usually required a trial-and-error process which is time-consuming [23,24].By using AE, extracting features of the raw input data can work automatically.Automatically extracted features improve the performance of predictive models while at the same time reducing the complexity of the function design task.AE consists of an encoder and decoder; the encoder has the same function in DNNs by transforming the input vector to a hidden layer representation with a weight matrix and bias/offset vector.Simultaneously, the decoder maps the hidden layer representation of the reconstructed input, which is regarded as the output result.
A dimensionality reduction process on AE was done by extracting low-dimensional features from high-dimensional spectral envelopes in a non-linear and unsupervised way.An input vector is assumed to be x, a hidden representation vector y, and the reconstruction vector z.The reconstruction and update flow can be represented by the encoder and decoder process.In this step, the hidden weight and the bias vector are W and b respectively, while the outputs are W and b respectively.The function σ is an activation function and η is the learning rate.A deterministic mapping function called the encoder is y = f θ (x) = σ(W x + b), then a reverse mapping function called the decoder is z = g θ (y) = σ(W y + b ).The weight parameter is usually constrained at W = W T .The parameter of AE can be trained by minimizing the following objective function as follows [25][26][27]: where L is a cost function, and the mean square error (MSE) is used, For each ECG signal input x, the hidden representation y of the kth feature map is represented by,

Proposed Deep Learning Structure
The proposed DL architecture is divided into Deep Auto-Encoders (DAEs) pre-training and DNNs fine-tuning phases, as presented in Figure 1.The fully connected layer is added on top of the encoder part of the DAEs. Figure 1 describes a conceptual depiction of the classifier architecture of the pre-trained DAEs model.In some case of the pre-training process, the DAEs structure can be stacked to improve the quality of signal reconstruction and to obtain effective initial parameters for the fine-tuning via unsupervised learning.However, if the pre-training produces a good performance in data reconstruction, the encoder output can directly be used as an input into the classifier, and the weights can be fine-tuned using back-propagation.The computational resources and an enormous amount of data are required for learning in both phases.
For each ECG signal input x, the hidden representation y of the kth feature map is represented by,

Proposed Deep Learning Structure
The proposed DL architecture is divided into Deep Auto-Encoders (DAEs) pre-training and DNNs fine-tuning phases, as presented in Figure 1.The fully connected layer is added on top of the encoder part of the DAEs. Figure 1 describes a conceptual depiction of the classifier architecture of the pre-trained DAEs model.In some case of the pre-training process, the DAEs structure can be stacked to improve the quality of signal reconstruction and to obtain effective initial parameters for the fine-tuning via unsupervised learning.However, if the pre-training produces a good performance in data reconstruction, the encoder output can directly be used as an input into the classifier, and the weights can be fine-tuned using back-propagation.The computational resources and an enormous amount of data are required for learning in both phases.The classifiers are trained with input x , and output is annotated as a label.The Softmax function is used for the output layer of the classifier as an activation function.The output of each unit can be treated as the probability of each label using the Softmax function [25][26][27].Let N be the number of units of the output layer, x as the input, and i x as the output of unit i .Then, the output ) (i p of unit I is defined by Equation ( 5 where n is the sample size, m is the number of classes, ij p is the output of the classifier of class j of the th i sample and ij y is the annotated label of class j of the th i sample.The classifiers are trained with input x, and output is annotated as a label.The Softmax function is used for the output layer of the classifier as an activation function.The output of each unit can be treated as the probability of each label using the Softmax function [25][26][27].Let N be the number of units of the output layer, x as the input, and x i as the output of unit i.Then, the output p(i) of unit I is defined by Equation ( 5),

Data Preparation
Cross entropy is used as the loss function of the classifier L f as follow, where n is the sample size, m is the number of classes, p ij is the output of the classifier of class j of the i th sample and y ij is the annotated label of class j of the i th sample.

Data Preparation
The raw ECG datasets used in the present study are available in the MIT repository (https: //physionet.org/physiobank/database/mitdb/)[28,29].All of the ECG beat data is annotated at R-peak locations, and there are up to 16 different types of arrhythmias.Under the AAMI standard [29], the database contains 22 types of beats within 5 groups of arrhythmias.However, only 10 types are used in this study [10,23,30].There are 10 types of ECG beats, i.e., normal (N), atrial premature contraction (A), premature ventricular contraction (V), right bundle branch block (R), left bundle branch block (L), paced (P), ventricular flutter wave (!), fusion of ventricular and normal (F), fusion of paced and normal (f) and nodal escape (j), and all data sets for 10 types of ECG beats are presented in Table 1.

Segmentation and Reconstruction
Automatic beats segmentation of the ECG using a minimum heuristic a priori information is an essential problem in the clinical diagnosis of heart disease.The various segments of the ECG have different physiological meanings, and the presence, timing, and duration of each of these segments all have diagnostic and bio-physical importance.The former refers to the features extracted from a single beat, which usually contains only one beat.The raw data sample of the ECG rhythm signal in the normal condition is illustrated in Figure 2a.The normal rhythm is segmented into one beat by determining the P wave, QRS complex and T wave, which are directly related to the R-peak location.The segmentation on an ECG signal is presented in Figure 2b.In general, the frequency of the ECG rhythm is between 60 and 80 per minute [15].The process segmentation to find the R position after the R position is detected, and the sampling is conducted at about 0.7-s segment for one beat.The segment is divided into two intervals, t 1 of 0.25-s before peak position (R) and the interval t 2 of 0.45-s after the peak position (R) (see Figure 2b).All generated nodes during 0.7-s are about 252 nodes, which divided into two intervals, 90 nodes for t 1 and 162 nodes for t 2 .The result of the sampling process illustration is shown in Figure 2c.
rhythm is between 60 and 80 per minute [15].The process segmentation to find the R position after the R position is detected, and the sampling is conducted at about 0.7-second segment for one beat.The segment is divided into two intervals, t1 of 0.25-seconds before peak position (R) and the interval t2 of 0.45-seconds after the peak position (R) (see Figure 2b).All generated nodes during 0.7-seconds are about 252 nodes, which divided into two intervals, 90 nodes for t1 and 162 nodes for t2.The result of the sampling process illustration is shown in Figure 2c.After the segmentation process, DAEs used to extract features from the beat automatically.From Figure 2, the DAEs works based on two processes, i.e., compressing and reconstructing.The compressing means the encoder decreases the dimensionality of the input up to the layer with the fewest neurons, called latent space.The reconstructing means the decoder then tries to reconstruct the input from this low-dimensional representation.This way, the latent space forms a bottleneck, which forces the DAEs to learn an effective compression method for the data.In this study, four different topologies are considered and are the following:  3 that the normal and right bundle branch block beat (RBBB) is utilized as a sample of the ECG signal after segmentation and reconstruction.Figure 3a and 3c described the initial signal ECG and the output of the DAEs.The raw data produce some noise in high and low frequency, and it is cancelled by DAEs after reconstruction, as presented in Figure 3b and 3d.After the segmentation process, DAEs used to extract features from the beat automatically.From Figure 2, the DAEs works based on two processes, i.e., compressing and reconstructing.The compressing means the encoder decreases the dimensionality of the input up to the layer with the fewest neurons, called latent space.The reconstructing means the decoder then tries to reconstruct the input from this low-dimensional representation.This way, the latent space forms a bottleneck, which forces the DAEs to learn an effective compression method for the data.In this study, four different topologies are considered and are the following:

Classifier Structure
The proposed DL architecture was constructed by combining the pre-trained DAEs encoding layer and fine-tuned of DNNs part with the fully connected layer.In the classifier part used ReLU as the activation function, Cross-entropy as the loss function and Adam as the optimization method with the learning rate is set from 0.1 and gradually decreased to 0.0001.Adam optimization allows the use of adaptive learning rates for each parameter.The result of learning rate used is that the learning rate about 0.001 produces a good result in the loss value.Figure 1 shows the details of the proposed structure for the fully connected classifier.We have used 10 experiments to select the best model structure of deep learning (see Table 2).All parameters were tuned to obtain the smallest loss value to avoid overfitting, high accuracy, high sensitivity, high specificity, high precision, and high F1-Score.However, the deep structure must produce a short processing time and small memory usage.Table 2 shows that DL performance improved when using a ReLU activation function, compared to the Sigmoid activation function.Moreover, ReLU activation function was a more straightforward computation than sigmoidal functions, and it has been proven that ReLU works better than sigmoidal functions [31].Based on all performances, model 8 was selected as the best model with three hidden layers in pre-training and three hidden layers of fine-tuning used.In our work, model 8 is the proposed DL structure in Table 3. Table 3 shows the proposed structure of the deep learning method used in the present study.

Classifier Structure
The proposed DL architecture was constructed by combining the pre-trained DAEs encoding layer and fine-tuned of DNNs part with the fully connected layer.In the classifier part used ReLU as the activation function, Cross-entropy as the loss function and Adam as the optimization method with the learning rate is set from 0.1 and gradually decreased to 0.0001.Adam optimization allows the use of adaptive learning rates for each parameter.The result of learning rate used is that the learning rate about 0.001 produces a good result in the loss value.Figure 1 shows the details of the proposed structure for the fully connected classifier.We have used 10 experiments to select the best model structure of deep learning (see Table 2).All parameters were tuned to obtain the smallest loss value to avoid overfitting, high accuracy, high sensitivity, high specificity, high precision, and high F1-Score.However, the deep structure must produce a short processing time and small memory usage.Table 2 shows that DL performance improved when using a ReLU activation function, compared to the Sigmoid activation function.Moreover, ReLU activation function was a more straightforward computation than sigmoidal functions, and it has been proven that ReLU works better than sigmoidal functions [31].Based on all performances, model 8 was selected as the best model with three hidden layers in pre-training and three hidden layers of fine-tuning used.In our work, model 8 is the proposed DL structure in Table 3. Table 3 shows the proposed structure of the deep learning method used in the present study.

Results
Figure 4a-c illustrates the data distribution among the classes.It can be seen that the initial ECG data distribution, which consists of 252 features representing a raw ECG signal, was unstructured and imbalanced; therefore, it can influence one another.In order to minimize the complexity of the data by reducing data dimensionality, PCA and DAEs are used and compared.PCA is one of the main linear dimensionality reduction techniques for extracting effective features from high dimensions' data, while DAEs is nonlinear techniques.The nonlinear techniques have an advantage over linear techniques for solving the problems of the real-world because real-world data are nonlinear in nature like medical data.From the previous research, it is observed that the nonlinear techniques are performing better than linear techniques on artificial tasks and succeed in overcoming poor natural datasets [25,27].
distribution.Its feature distribution becomes clearer so that the minority classes are not polarized with larger classes.
Moreover, DAEs was able to eliminate spikes from the original signal without losing the information in the ECG signal and was also able to learn nonlinear feature representations, as presented in Figure 4.All colors in Figure 4a,4b and 4c respectively, show 10 (ten) classes of Arrhythmia such as N, A, V, R, L, P, !, F, f , and j.To obtain an optimum classification performance, various DL structures in the learning process were examined.Ten structures were determined and validated prior to the selection of the best model.All classifiers were arranged in two processes: training and testing.The processing time was investigated in our model because the structure must embeddable in the hardware as an ECG interpretation module for further application.All processing time results of 10 DL structures are presented in Table 2.
From Table 2, the experiment describes training time for feature learning, classifier, and testing time.In this case, the trade-off happened between model complexity and processing time.It worth noting that the processing time increases with a more complex DL structure.However, a smaller architecture is better for feasible application on the real-time system.Hence, we selected the DAEs architecture with 5 layers combined with DNNs with 3 hidden layers (model 8).It produces a good performance in terms of accuracy, sensitivity, specificity, precision, and F1-score compare to the other structures.As presented in Table 2, a 10 structures model validation of DNNs with a DAEs feature learning model are selected for further analysis.In this study, DAEs reduced the feature from 252 into 32 features while PCA with a cumulative energy value of 0.99, the initial feature is lowered to 32 features for 10 (ten) classes.It can be seen in Figure 4b,c that there is a significant difference between the DAEs and PCA in terms of data distribution plots.By using PCA, it is assumed that the data lie on or near a linear subspace of the high-dimensional space.However, DAEs do not rely on the linearity assumption as a result of which more complex embedding of the data in the high-dimensional space.Therefore, the data distribution from PCA remains unstructured, and it is centered at a certain point, hence it is challenging to discriminate data patterns from each class.DAEs, on the other hand, produces a better data distribution.Its feature distribution becomes clearer so that the minority classes are not polarized with larger classes.
Moreover, DAEs was able to eliminate spikes from the original signal without losing the information in the ECG signal and was also able to learn nonlinear feature representations, as presented in Figure 4.All colors in Figure 4a-c respectively, show 10 (ten) classes of Arrhythmia such as N, A, V, R, L, P, !, F, f , and j.
To obtain an optimum classification performance, various DL structures in the learning process were examined.Ten structures were determined and validated prior to the selection of the best model.All classifiers were arranged in two processes: training and testing.The processing time was investigated in our model because the structure must embeddable in the hardware as an ECG interpretation module for further application.All processing time results of 10 DL structures are presented in Table 2.
From Table 2, the experiment describes training time for feature learning, classifier, and testing time.In this case, the trade-off happened between model complexity and processing time.It worth noting that the processing time increases with a more complex DL structure.However, a smaller architecture is better for feasible application on the real-time system.Hence, we selected the DAEs architecture with 5 layers combined with DNNs with 3 hidden layers (model 8).It produces a good performance in terms of accuracy, sensitivity, specificity, precision, and F1-score compare to the other structures.As presented in Table 2, a 10 structures model validation of DNNs with a DAEs feature learning model are selected for further analysis.
Such a process was done by changing the number of neurons and the number of hidden layers, activation functions, and cost functions of DAEs and DNNs structures.The training process uses 80% of the data, and the testing process uses the remaining data percentage.The validation phase for all models is presented in Table 4, and Table 5, respectively.Table 4 shows the results from the various DL structures in the training process.Based on validation model 8 in Table 3, the training and testing process produces the best performance metrics to all classes (see Tables 4 and 5).The training process produces 99.9% accuracy, 94.1% sensitivity, 99.9% specificity, 97.4% precision and a 95.7% F1-Score, while in the testing process produce 99.7% accuracy, 91.2% sensitivity, 99.6% specificity, 93.6% precision and a 91.8% F1-Score.The confusion matrix was applied to analyze model prediction on each class during the training and the testing process.Tables 6 and 7 present the predicted class only for model 8 as the best model.According to Table 6, less than 5% of the ECG heartbeats are misclassified in the training and testing process.This is because of imbalanced data, and the F1-Score needed due to it was able to describe trusted data from our model.From our best model, the F1-scores were 95.70% and 91.80% for training and testing, respectively (see Table 7).
In order to validate the proposed DL structure, accuracy, and loss curve is presented in Figures 5  and 6.It is shown in Figure 5a,b that the errors from training and testing data were decreased along with the increasing epochs.Both diagrams produce good shapes since the DAEs has the ability to extract high-level of features not only from the training data but also from the unseen data.In addition, The DAEs reconstruction result shows the effect of noise cancelation while maintaining its overall shape (see Figure 4).this paper, the Precision-Recall (P-R) curve is used to overcome the limitation of the ROC curve.The larger area under the curve (AUC) in P-R curve indicates a better performance.Every value in the confusion matrix is related to the ROC and P-R curve.The values in both curves give the same performance in this study.All the values produced a better performance as a classifier by using the proposed model (see Figure 6).

Discussion
The nonlinear processing in the stacked of multiple layers is well suited to capturing highly varied functions with the conciseness of the parameters set.Based on unsupervised pre-training, DL allows assigning deeper networks in a parameter space region to avoid local minima.The availability of large sets or even with only a small number of data, DL techniques achieves excellent performance, and often the best one.Due to only a few studies using DL in arrhythmia classification, several experiments were designed to benchmark our proposed model.
We have selected several ML methods to benchmark our model, and the result is compared in terms of accuracy, sensitivity, specificity, precision, and F1-score (see Tables 8 and 9) [34][35][36].From Table 8, shallow architecture like SVM and DNNs produce a good performance, but from all results, a DNNs model with PCA and DWT produces high performance, such as accuracy of about 99.76%, precision of about 98.20%, sensitivity of about 91.80%, specificity of about 99.78% and an F1-measure of about 97.80%.However, feature extraction in shallow architecture is very difficult; it must be processed separately and manually.On the other hand, DL has a feature learning approach.It learns the feature directly from the network by using its structure.Such an approach can be used as both dimensionality reduction and noise cancelation.It reduces the time process and gives a simpler For a metric scheme, the values of sensitivity and specificity will change along with the cutoff value.In every cutoff value, a dot can be plotted using the coordinate (Specificity, Sensitivity).A curve that connects all these dots is called a Receiver Operating Characteristics (ROC) curve.The ROC curve which is far away above the diagonal, especially in the upper-left corner, produces the right predictions with random guesses.However, if the dataset is highly imbalanced, the shape of the ROC curve is biased and may be misleading [32,33].The MIT BIH arrhythmia dataset is imbalanced, due to the higher number of normal figures compared to disorderly ones.Therefore, in this paper, the Precision-Recall (P-R) curve is used to overcome the limitation of the ROC curve.The larger area under the curve (AUC) in P-R curve indicates a better performance.Every value in the confusion matrix is related to the ROC and P-R curve.The values in both curves give the same performance in this study.All the values produced a better performance as a classifier by using the proposed model (see Figure 6).

Discussion
The nonlinear processing in the stacked of multiple layers is well suited to capturing highly varied functions with the conciseness of the parameters set.Based on unsupervised pre-training, DL allows assigning deeper networks in a parameter space region to avoid local minima.The availability of large sets or even with only a small number of data, DL techniques achieves excellent performance, and often the best one.Due to only a few studies using DL in arrhythmia classification, several experiments were designed to benchmark our proposed model.
We have selected several ML methods to benchmark our model, and the result is compared in terms of accuracy, sensitivity, specificity, precision, and F1-score (see Tables 8 and 9) [34][35][36].From Table 8, shallow architecture like SVM and DNNs produce a good performance, but from all results, a DNNs model with PCA and DWT produces high performance, such as accuracy of about 99.76%, precision of about 98.20%, sensitivity of about 91.80%, specificity of about 99.78% and an F1-measure of about 97.80%.However, feature extraction in shallow architecture is very difficult; it must be processed separately and manually.On the other hand, DL has a feature learning approach.It learns the feature directly from the network by using its structure.Such an approach can be used as both dimensionality reduction and noise cancelation.It reduces the time process and gives a simpler procedure with a big and imbalanced data set.As a result found that DL performances are lower compared with the SVM and DNNs counterpart, in terms of precision about 93.60% and F1-Measure about 91.80% (DL with DAEs).However, the advantage of DL is that it does not require a human to identify and compute the critical features.DL learns discriminatory features that best predict the outcomes.This prediction means that the amount of human effort required to train DL systems is less (because no feature engineering or computation is required) and may also lead to the discovery of important new features that were not anticipated.In addition, due to its ability to learn critical feature directly from data, even with highly imbalanced data, the learned features are still robust enough to discriminate the class labels.Moreover, from Table 9, our proposed architecture is compared with other DL performances.However, many previous studies calculate only 3 metrics without the F1 Score and precision, whereas in the medical data, especially in Arrhythmia, the data set is imbalanced.Therefore, the complete metrics to measure the prediction must be calculated.It can be shown that our proposed DL model based on DAEs and DNNs are leading with others DL like CNN, DBN, and RNN.The DL produces accuracy of about 99.73%, sensitivity of about 91.20%, specificity of about 99.80%, precision of about 93.60%, and an F1-Score of about 91.80%.In addition, to analyze memory usage and the speed of processing time, a comparison between DL architecture and Shallow Architecture was conducted.Figures 7 and 8 illustrate that the shallow architecture with an SVM classifier produces fast processing time and less memory usage.However, by using shallow architecture, the process of feature extraction and feature reduction was manually engineered without considering the low-level feature from the data.This requires more time and effort to investigate the stage.Thus, only the best value is selected to obtain excellent performance.In this study, the comparison three classifiers i.e., DNNs, SVM, and DL are presented.According to the experimental result, it is found that the SVM classifier with PCA and DWT produce the best result that what, as presented in Table 8.Unfortunately, when a large data set is used, the SVM performance is decreased.This problem occurred because the SVM classifier is cost-inefficient when an extensive data set is applied.This was mostly a small data set problem.Therefore, the DL classifier will provide promising things in the future because the data will be increased over time.

11
CNN and RNN [40] 83.In addition, to analyze memory usage and the speed of processing time, a comparison between DL architecture and Shallow Architecture was conducted.Figures 7 and 8 illustrate that the shallow architecture with an SVM classifier produces fast processing time and less memory usage.However, by using shallow architecture, the process of feature extraction and feature reduction was manually engineered without considering the low-level feature from the data.This requires more time and effort to investigate the stage.Thus, only the best value is selected to obtain excellent performance.In this study, the comparison three classifiers i.e., DNNs, SVM, and DL are presented.According to the experimental result, it is found that the SVM classifier with PCA and DWT produce the best result that what, as presented in Table 8.Unfortunately, when a large data set is used, the SVM performance is decreased.This problem occurred because the SVM classifier is cost-inefficient when an extensive data set is applied.This was mostly a small data set problem.Therefore, the DL classifier will provide promising things in the future because the data will be increased over time.

Conclusion
A deep learning approach is presented in this study to automatically learning and classifying the 10 class of ECG heartbeats, which is important for the diagnosis of cardiac arrhythmia.While dealing with a highly variable function which requires a large number of labeled samples, the DL Architecture expresses its full potential.However, training deep architectures is a challenging task.The shallow architecture has provided a good result.Unfortunately, shallow architectures are not as efficient as deep architectures.Adding more layers does not necessarily lead to better solutions and choosing the correct dimensions of a deep architecture is not an easy task.In this research, unsupervised learning for DAEs pre-training is combined with a supervised fine-tuned deep neural structure.Several models are designed to obtain the best classifier, by changing the number of neurons, the number of hidden layers, learning rate value, optimizer, activation function, and loss function.The best value is selected based on validation data in several cases of ECG signal.In addition, the classes are highly imbalanced; the amount of each class has very different sizes.Our proposed architecture, the combination of DAEs and DNNs structure, gives a better performance

Conclusions
A deep learning approach is presented in this study to automatically learning and classifying the 10 class of ECG heartbeats, which is important for the diagnosis of cardiac arrhythmia.While dealing with a highly variable function which requires a large number of labeled samples, the DL Architecture expresses its full potential.However, training deep architectures is a challenging task.The shallow architecture has provided a good result.Unfortunately, shallow architectures are not as efficient as deep architectures.Adding more layers does not necessarily lead to better solutions and choosing the correct dimensions of a deep architecture is not an easy task.In this research, unsupervised learning for DAEs pre-training is combined with a supervised fine-tuned deep neural structure.Several models are designed to obtain the best classifier, by changing the number of neurons, the number of hidden layers, learning rate value, optimizer, activation function, and loss function.The best value is selected based on validation data in several cases of ECG signal.In addition, the classes are highly imbalanced; the amount of each class has very different sizes.Our proposed architecture, the combination of DAEs and DNNs structure, gives a better performance compared to the other selected DL approach.The DL produces 99.73% accuracy, 91.20% sensitivity, 99.80% specificity, 93.60% precision and a 91.80% F1-Score on the testing data.In the future, to overcome the limitations of cardiologists and to develop broader ECG devices, we hope that this technology will be incorporated into inexpensive ECG devices as diagnostic tools everywhere.
used as the loss function of the classifier f L as follow,
[252 -32 -252], [252 -128 -64 -32 -64 -128 -252], [252 -128 -64 -32 -16 -32 -64 -128 -252], and reference [252 -128 -64 -32 -16 -8 -16 -32 -64 -128 -252].All of these can be interpreted as performing a lossy compression whose ratio is 8:1.The possible activation functions are ReLU and Sigmoid.Each DAEs is optimized via Adam with α = 0.0001.In this study, the parameter of the learning rate is changing from 0.1 to 0.0001, and the smallest loss value becomes the baseline to select the learning rate.The loss function uses MSE without any regularization terms, and training epochs are considered to be at 100.Finally, before being fed to the DAEs, all data are scaled to [−1, 1].This step is min-max normalization with the minimum and the maximum being given by the minimum and maximum values in the training set.It can be seen in Figure
[252 -32 -252], [252 -128 -64 -32 -64 -128 -252], [252 -128 -64 -32 -16 -32 -64 -128 -252], and reference [252 -128 -64 -32 -16 -8 -16 -32 -64 -128 -252].All of these can be interpreted as performing a lossy compression whose ratio is 8:1.The possible activation functions are ReLU and Sigmoid.Each DAEs is optimized via Adam with α = 0.0001.In this study, the parameter of the learning rate is changing from 0.1 to 0.0001, and the smallest loss value becomes the baseline to select the learning rate.The loss function uses MSE without any regularization terms, and training epochs are considered to be at 100.Finally, before being fed to the DAEs, all data are scaled to [−1, 1].This step is min-max normalization with the minimum and the maximum being given by the minimum and maximum values in the training set.It can be seen in Figure3that the normal and right bundle branch block beat (RBBB) is utilized as a sample of the ECG signal after segmentation and reconstruction.Figure3a,c described the initial signal ECG and the output of the DAEs.The raw data produce some noise in high and low frequency, and it is cancelled by DAEs after reconstruction, as presented in Figure3b,d.

Figure 3 .
Figure 3. Sample of Auto Encoder reconstruction results.

Figure 3 .
Figure 3. Sample of Auto Encoder reconstruction results.

Figure 4 .
Figure 4. Data distribution plots in three conditions; raw data, PCA, and DAEs.

Figure 4 .
Figure 4. Data distribution plots in three conditions; raw data, PCA, and DAEs.

Figure 5 .
Figure 5. Training and testing evaluation.Figure 5. Training and testing evaluation.

Figure 5 .
Figure 5. Training and testing evaluation.Figure 5. Training and testing evaluation.

Figure 5 .
Figure 5. Training and testing evaluation.
(a) P-R Curve (b) ROC curve

Table 1 .
The classes of arrhythmia and the number of samples.

Table 2 .
Processing time of 10 DL structures.

Table 3 .
The proposed structure of deep learning of model 8.

Table 4 .
Training performances based on 10 models.

Table 5 .
Testing performances based on 10 models.

Table 6 .
Confusion matrix for the training process.

Table 7 .
Confusion matrix for testing process.

Table 8 .
Selected method for the benchmarking proposed method.

Table 9 .
Selected literatures for benchmarking our proposed model.