ECG Classiﬁcation Using Wavelet Packet Entropy and Random Forests

: The electrocardiogram (ECG) is one of the most important techniques for heart disease diagnosis. Many traditional methodologies of feature extraction and classiﬁcation have been widely applied to ECG analysis. However, the effectiveness and efﬁciency of such methodologies remain to be improved, and much existing research did not consider the separation of training and testing samples from the same set of patients (so called inter-patient scheme). To cope with these issues, in this paper, we propose a method to classify ECG signals using wavelet packet entropy (WPE) and random forests (RF) following the Association for the Advancement of Medical Instrumentation (AAMI) recommendations and the inter-patient scheme. Speciﬁcally, we ﬁrstly decompose the ECG signals by wavelet packet decomposition (WPD), and then calculate entropy from the decomposed coefﬁcients as representative features, and ﬁnally use RF to build an ECG classiﬁcation model. To the best of our knowledge, it is the ﬁrst time that WPE and RF are used to classify ECG following the AAMI recommendations and the inter-patient scheme. Extensive experiments are conducted on the publicly available MIT–BIH Arrhythmia database and inﬂuence of mother wavelets and level of decomposition for WPD, type of entropy and the number of base learners in RF on the performance are also discussed. The experimental results are superior to those by several state-of-the-art competing methods, showing that WPE and RF is promising for ECG classiﬁcation.


Introduction
The electrocardiogram (ECG) records the tiny electrical activity produced by the heart over a period of time by placing electrodes on a patient's body, which has become the most widely used non-invasive technique for heart disease diagnoses in the clinics.Due to the high mortality rate of heart diseases, since the last decades, ECG classification has drawn lots of researchers' attention.
Typically, the classification of ECG signals has four phases: preprocessing, segmentation, feature extraction and classification.The preprocessing phase is mainly aimed at detecting and attenuating frequencies of the ECG signal related to artifacts, which also usually performs signal normalization and enhancement.After preprocessing, segmentation divides the signal into smaller segments, which can better express the electrical activity of the heart [1].Nowadays, the researchers can get good results from preprocessing and segmentation by some popular techniques or tools [2].Therefore, most of the literature focuses upon the last two phases.
Feature extraction plays an important role in pattern classification, especially in signal or image classification.Features can be extracted from the raw data or the transformed domain of segmented ECG signals.The simplest method of feature extraction is to extract sampled points at some frequency from an ECG signal curve [3].However, such a method has two drawbacks: (1) the amount of the extracted features is so huge that the efficiency of classifiers will be affected; and (2) the extracted features usually cannot reflect the intrinsic characters of the signals.Features can also be extracted using morphological and/or statistical methods from the raw signals.For example, the time between the R peaks of two heartbeats, known as the RR interval, is one of the most commonly used features.The authors used four features from the RR interval: the RR interval between the current and its predecessor heartbeats, the RR interval between the current and its successor heartbeats, the average of all the RR intervals among a full record and the average of the several neighbors' RR intervals of the current heartbeat [1,4].Independent component analysis (ICA) is another statistical method to extract ECG features.Yu et al. used ICA-based features and the RR interval to compose the feature vector.To get the ICA-based features, the authors randomly selected two sample segments and then whitened and arranged the segments into a data matrix.After that, the independent components (ICs) were calculated from the data matrix, and the original ECG signals were projected onto the bases, and the features were calculated [5].Later, the authors further proposed a novel IC arrangement strategy to improve the effectiveness and efficiency of ECG classification [6].Afkhami et al. used morphological and statistical features to train an ECG classifier [7].
Other major feature extraction methods are to extract features from the transformed domain.Discrete cosine transform (DCT), continuous wavelet transform (CWT) and discrete wavelet transform (DWT) are commonly used transform methods.DCT expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies, which has the ability to compress signals.Khorrami et al. extracted DCT coefficients as features in ECG classification [8].In that paper, the authors also applied CWT and DWT to extract features for ECG classification, and compared the classification performance among DCT, CWT and DWT.Owing to its significant effectiveness for extracting discriminative features for ECG classification, wavelets were widely studied by lots of researchers.Song et al. used wavelet transform to extract 17 original input features from preprocessed signals and then reduced these to four by linear discriminant analysis (LDA), and the performance with the reduced features is better than that by principal component analysis (PCA) and even with original features [9].Yu and Chen used two-level DWT to decompose the signals into components in different sub-bands, and then selected three sets of statistical features of the decomposed signals, alternating current (AC) power and the instantaneous RR interval of the original signals as features [10].Ye et al. analysed ECG signals using morphological features (extracted by DWT and ICA) and dynamic features (RR), the feature dimensionality of the morphological ones was reduced to 26 by PCA before classification [11].Since the coefficients of DWT in different levels have different discrimination power, it is significant to select those that can best represent the ECG signals for classification.Daamounche et al. proposed a novel algorithm for generating the wavelet that best represents the ECG beats in terms of discrimination ability using a particle swarm optimization framework [12].Wavelet packet decomposition (WPD) is an extension version of DWT.Instead of decomposing approximations only in DWT, WPD decomposes both the approximations and details of the signal and hence it keeps the important information in higher-frequency components.WPD has also been applied to ECG classification.For example, the authors applied WPD to classify sleep apnea types [13].In another piece of literature, the authors proposed a feature extraction method based on wavelet packet of R wave window along with a strategy to select nodes in the packet tree [14].
As for the classifiers of ECG, in theory, any multi-class classifier can be used in ECG classification.In practice, the most commonly used classifiers include support vector machine (SVM) [15], artificial neural network (ANN), K-nearest neighbours (KNN) and decision tree (DT) [2].
SVM is one of the most popular ECG classifiers.[16] used a multiclass SVM with the error output codes to build a ECG classifier based on the features calculated from the wavelet coefficients.Osowski et al. [17] presented a new approach for ECG classification by combining SVM with the features extracted by two preprocessing methods, and the results on recognizing the 13 heart rhythm types showed that the proposed method was reliable and advantageous.Mohammadzadeh et al. used SVM and the generalized discriminant analysis (GDA) feature reduction scheme to classify cardiac arrhythmia from the heart rate variability (HRV) signal [18].Some variations of SVM have also been applied in ECG classification, such as least square SVM [19][20][21], hierarchical SVM [22], weighted SVM [23] and SVM combined with particle swarm optimization(PSO) [24].Multi-layer perception (MLP) and probabilistic neural network (PNN) are the most popular ECG classifiers associated with ANN.The authors in [25] used the sequential forward floating search to get a feature subset and then MLP was applied to do the classification.The experimental results showed that the proposed methods exceed some previous work under the same constraints.Luz et al. compared MLP with some other classifiers on different feature sets [1].Alickovic and Subasi used autoregressive modeling to extract features from the de-noised signals by multiscale PCA [26], and several classifiers including MLP were used to train models.Yu et al. used PNN to build classifiers on the combined features of RR-interval and the features by Wavelet [10] and ICA [5,6], respectively.Wang and Chiang et al. pointed out that the integration of PNN with the proposed PCA and LDA feature reduction can achieve satisfactory results [27].Some other researchers investigated the performance of fuzzy NN [28] and combined NN [29] on ECG classification [2].Owing to the simplicity, KNN and DT were also widely applied to ECG classification.Besides the above-mentioned classifiers, some scholars also use Linear Discriminants [4,30,31], Extreme Leaning Machine [32], Optimum-path forest [1], Active Learning [33], and so on to build classification models.
Although so much work exists on ECG classification, it is still necessary to further explore this field.For one reason, the performance needs to be improved for modern diagnosis of heart disease.For another reason, some existing research used samples from the same patients to construct training and testing set (intra-patient scheme), which was not reasonable for practical situations because in a realistic scenario, the training samples should be from some patients and the testing should be from other patients (inter-patient scheme) [2].In addition, the types of cardiac arrhythmias and evaluation methods in existing research are much different, making it hard for reproducing and comparing the experiments.This can be resolved by introducing the Association for the Advancement of Medical Instrumentation (AAMI) recommendations [34].
Entropies from WPD have been demonstrated to have a powerful ability to represent the intrinsic characteristics of electroencephalogram (EEG) signals [35,36].As a robust classifier, Random Forests (RF) have been applied in many aspects, e.g., remote sensing [37], microarray data [38] and Alzheimer's disease [39].To improve the performance of ECG classification and make the results comparable by other scholars, in this paper, we propose a new method for ECG classification using entropy on WPD and RF following the AAMI recommendations [34] and the inter-patient scheme.The main contributions of this work are four-fold: (1) we built an ECG classification expert system with entropy on the coefficients of WPD as features and RF as a classifier; (2) we followed the AAMI recommendations and the inter-patient scheme, which made the proposed method reproducible and more practical; (3) the experimental results on the publicly accessed the MIT-BIH Arrhythmia dataset [40] show that the proposed method is promising for ECG classification; and (4) types of entropy, mother wavelets of WPD, decomposed levels for WPD and the tree numbers of RF were discussed and the suggestions on these settings were given.
Note that the proposed method is different from the previous work [41,42].Firstly, we used WPD and entropy instead of DWT coefficients to extract features.Secondly, we adopted inter-patient scheme instead of intra-patient scheme to conduct the experiments.To the best of our knowledge, it is the first time that WPE and RF are used to classify ECG following the AAMI recommendations and the inter-patient scheme.
The remainder of this paper is organized as follows.Section 2 provides the materials and methods used, including the database, WPD, entropy, feature extraction based on WPD and entropy and the classifier RF.Experimental results are reported in Section 3. We discuss the proposed method in Section 4. Finally, Section 5 concludes this paper.

Overview
In this study, we used the well-known MIT-BIH Arrhythmia database to evaluate the proposed method.After preprocessing and segmentation, we decomposed the ECG data using WPD and then calculated entropy of each terminal node in the wavelet packet tree as the features.The database was split into a training set and testing set following the AAMI recommendations and the inter-patient scheme.The classification model was built on the training set by RF and finally the performances were evaluated on the testing set by the model.The flowchart of the proposed framework is given in Figure 1.Since this study focuses mainly on feature extraction and classification, we applied simply a finite impulse response (FIR) with 12-taps and -3 dB at 35 Hz to the data in the preprocessing phase, and we took the 71 points preceding and succeeding the R peak, respectively (143 points, including the R point in total), to compose one segmented sample in the segmentation phase.

Database and AAMI Recommendations
In this study, the MIT-BIH Arrhythmia database was employed as the data source for the experiments [40], and it is freely available at [43].The database contains 48, 30-minute recordings, sampled at 360 Hz, from 47 different subjects (25 men aged 32 to 89 and 22 women aged 23 to 89).Eighteen types of heartbeats were labeled in total.The recordings numbered 201 and 202 are from the same subject.Among the recordings, 4 with paced beats are excluded according to AAMI recommendations [34].The recommendations divide the 18 types of heartbeats into five groups: normal beat (N), supraventricular ectopic beat (S), ventricular ectopic beat (V), fusion of a V and an N (F), and unknown beat type (Q), as shown in details in Table 1.The AAMI recommendations also define the evaluation measures for ECG classification that will be discussed in Section 3.2.However, the separation of the training set and the testing set was not defined by the recommendations.It was demonstrated in [44] that the use of heartbeat samples from the same patient for both training and setting results in biases for the evaluation process.In such a scheme (intra-patient scheme), many models can achieve the accuracy of classification close to 100% in testing because the particularities of the patient's heartbeat are learned in the training phase [2].To avoid this from happening in this work, following the work in [44], we adopted the inter-patient scheme to divide the recordings into the training set (DS1) and the testing set (DS2) and each has 22 recordings, as shown in Table 2.The amount of each heartbeat type in each set can be seen in Table 3.

Feature Extraction
In machine learning, signal processing and pattern recognition, feature extraction is one of the crucial steps and is aimed at extracting informative and non-redundant values (features) from raw signals.The extracted features facilitate the subsequent learning and generalization steps, and, in some cases, lead to better human interpretations.This step is also related to dimensionality reduction.The features are expected to contain the relevant information from the input data, so that the desired task can be performed by using this reduced representation instead of the complete initial data [45,46].

Wavelet Packet Decomposition
DWT is one of the popular methods in the wavelet transformation family which is a more powerful time-frequency-transformation than the traditional ones such as discrete cosine transform (DCT) and discrete Fourier transform (DFT).DWT is a linear operator that decomposes the initial signal into two components: detail coefficients (DCs, which capture the high frequency, low scale information in the original signal) and approximation coefficients (ACs, which are the low frequency, high scale information of the original signal), and then the DCs remain unchanged while the ACs are decomposed into new DCs and ACs.This process repeats until the decomposition level reaches.
DWT has been widely used in ECG signal processing, especially in denoising, compression and classification.For classification, DWT is usually used for extracting features.Owing to its great time and frequency localization ability, DWT can reveal the local characteristics of the input ECG signal.In addition, the multi-level decomposition of an ECG signal into different scales by DWT generates multi-scale features, each of which represents particular characteristics of the signal.Since DWT decomposes ACs only in each level, it is hard to extract distinctive information from DCs.The WPD is an extension of DWT, and the main difference between WPD and DWT is that the former decomposes not only the DCs but also the ACs simultaneously.Therefore, the WPD has the same frequency bandwidths in each resolution while DWT does not.This property makes WPD not increase and lose information compared to original signals, resulting in the features from WPD having more discrimination power than those from DWT.A three-level WPD tree for a normal heartbeat is shown in Figure 2. From this figure, we can see that the tree is a typical binary tree, where each node has DCs (right sub-node) and ACs (left sub-node).Therefore, features can be extracted from both DCs and ACs at different levels to obtain more information.Mathematically, for the original signal x(t), the WPD can be recursively defined as Equation (1).
where h(k) and g(k) are high-pass and low-pass filter respectively, and d i,j is the reconstruction signals (coefficients) of WPD at the i-th level for the j-th node.

Wavelet Packet Entropy
Although the coefficients by DWT or WPD can reveal the local characteristics of an ECG signal, the number of such coefficients is usually so huge that it is hard to use them as features for classification directly.Therefore, some high-level features may derive from these coefficients for better classification.Entropy is a tool to measure the uncertainty of the information content in given systems, and it is widely applied in signal processing, information theory, pattern recognition, and so on.Shannon entropy (SE), log energy entropy (LEE), Renyi Entropy (RE) and Tsallis entropy (TE) are some of the typical types of entropy.In this paper, entropy is used to extract features from the WPT.Entropy can be computed based on energy.
The information of the k-th coefficient of the j-th node at i-th level can be measured by wavelet energy which is defined as Equation ( 2): (2) Then, the total energy for the j-th node at i-th level can be calculated by Equation ( 3): where N is the number of the corresponding coefficients in the node.The probability of the k-th coefficient at its corresponding node can be calculated by Equation ( 4): where the sum of p i,j,k equals 1. SE is a measure of uncertainty associated with random variables in information theory, and it can be calculated as Equation ( 5) based on the probability distribution of energy [47]: Similarly, LEE is defined as Equation ( 6): Based on the previous equations, RE of order q(q ≥ 0 and q = 1) can be defined as Equation (7) [48]: TE is another type of entropy that is defined at various q values as Equation ( 8) [49]: Both RE and TE are extensions of SE.The parameter of q in RE and TE needs to be optimized in practical applications.RE and TE tend to SE for q → 1.

The Procedure of Feature Extraction
Given the ECG signals, the features based on WPE are obtained by performing the following steps: In practical application, some other important features may be added to the feature vector for better classification results.In this study, two RR-intervals (the interval between the current R peak and its predecessor R peak (RR-predecessor), and the interval between the current R peak and successor one) are appended to the feature vector.

Random Forests
Random forests (RF), first proposed by Breiman [50], is one of the excellent ensemble machine learning techniques widely used in classification.The main idea of RF is to build many classification trees based on some randomly selected features from randomly selected samples with bagging strategy and then to use the trees to vote for a given input vector to get a class label.RF is constructed by many base learners and each base learner is an independent binary tree adopting recursive partitioning.To build a binary tree, a bootstrap sample with N objects is firstly drawn from the training data (the left objects in the training data are called out-of-bag objects), and then the tree fits to the bootstrap samples by recursively selecting feature subset and splitting each terminal node into two child nodes with the best selected feature.The Gini index is used to decide which feature is the best.The built tree was validated by the out-of-bag objects.RF has many advantages as follows: RF has shown its superior performance in classification since it was proposed.The number of base learners is usually the only setting parameter.

Experimental Settings
In order to evaluate the performance of the proposed work, we compared it with some state-of-the-art methods of feature extraction and classification.The competing methods of feature extraction include statistical features, ICA and DWT, while the competing classifiers include KNN, DT, PNN and SVM.All of the experiments were conducted by Matlab 8.6 (Mathworks, Natick, MA, USA) on a 64-bit Windows 7 (Microsoft, Redmond, WA, USA) with 32 GB memory and 3.4 GHz I7 CPU.For NB, DT and PNN, we used the default parameters in Matlab toolbox.We used k = 5 for KNN.As for SVM, we used 10-fold cross validation and grid-search to optimize the parameter γ in radial basis function (RBF) kernel and the penalty factor C in SVM in the training data set with the LIBSVM 3.21 software package [51], which is available at [52].
To demonstrate the performance of the proposed work, we performed the experiments following the AAMI recommendations and the inter-patient scheme.We adopted six-level bd4 wavelet packet decomposition and SE to extract features, and 400 sub-trees in RF for the comparison.Here, we chose SE to extract features because of its simple form, characteristic of no parameters as well as extensive applications in signal processing.These parameters will be discussed in detail in Section 4.
Although normalization is not necessary for RF, to set up the stage for a fair comparison, we applied the Min-Max Normalization (as shown in Equation ( 9)) for the data input to all classifiers: where x min and x max are the minimal and maximal values for one variable, respectively, x new and x is the normalized and the original value respectively.It is clear that the normalization maps the original values to the range of [0, 1].

Performance Measures
To evaluate the performance of the classification, four standard statistical indices of sensitivity (SE, a.k.a.recall), positive predictivity (+P, a.k.a.precision), false positive rate (FPR) and accuracy (ACC) derived from true positive (TP), false negative (FN), true negative (TN) and false positive (FP) were used following the AAMI recommendations.In the ECG classification, TP and FN stand for the number of heartbeats of a given class correctly and incorrectly classified, respectively, while TN stands for the number of the heartbeats not belonging to a given class classified as not belonging to the considered class, and FP stands for the number of the heartbeats incorrectly classified as belonging to a given class [1].
SE is defined as the proportion of correctly classified beats of one class and the total beats classified as that class, including the missed classification beats: +P is defined as the ratio of correctly classified beats of one class among the total beats belonging to that class: For a given class, FPR can be defined as the ratio of incorrectly classified beats in that class to the total beats that not classified as that class: ACC can be defined as the percentage of total beats correctly classified over the number of total beats: The four evaluating measures were recommended by the AAMI.The five classes of heartbeat types are very imbalanced in the MIT-BIH Arrhythmia database and the ACC could be strongly distorted by the majority class (type N).Therefore, the first three measures are also widely used for performance evaluation of ECG classification.An ideal ECG classifier should achieve high SE, high +P, high ACC as well as low FPR.

Experimental Results
Firstly, we compared the proposed method with some previous work and the results are shown in Table 4.It can be seen from this table that the proposed method achieves the highest ACC of 94.61%, followed by Song's 94.22% and Ye's 92.17%, while Yu's method (ICA + RR and PNN) achieves the lowest ACC of 88.87%.All of the methods have good discrimination power on N and V.The proposed method can recognize only some of S and F, while the methods with SVM do not recognize anyone.Regarding Q, all methods cannot recognize any one correctly because it has only seven out of 49,690 samples in DS2.In practical diagnosis, low FPR in the N class and high SE in all other classes are required.The proposed method achieves the lowest FPR of 3.92% for N among all methods, and it achieves the highest SE in N, S and F at the same time.It is worth noting that, although the methods with SVM achieve similar but still slightly lower ACCs when compared with the proposed method, they totally fail on classifying S, F and Q because SVM tends to misclassify a new unknown sample to the classes with many samples, N and V, in this experiment.In other words, SVM labels each heart beat either N or V, and none of the heart beats from S, F, Q are classified correctly.The FPR of N by Song's is 13.66%, which shows that Song's method tends to classify non-N heart beats as N ones.This is highly risky for heart-disease diagnosis.Yu's method has a worse result in terms of FPR of N. The results show that the proposed method outperforms the compared methods.Secondly, to further validate the performance of feature extraction of the proposed method, we compared RF with several classical classifiers using the fixed features of WPE and RR.The results are illustrated in Table 5.Once again, RF achieves the highest ACC (94.61%), the highest SEs on V and F (94.67% and 94.20%, respectively).It also achieves the lowest FPR (0.71%), the second-lowest FPR (3.92%) on V and N, respectively.Note that, although the ACCs of KNN and DT are not very high, they both recognized some samples correctly in F and Q, which is hard for the other classifier, especially for PNN and SVM.The WPE + RR also improves the performance of PNN, especially in terms of ACC and FPR of the class N. The latter significantly decreases from 92.55% to 8.73%.This experiment demonstrates that the extracted features are discriminative for ECG classification, and RF is the best classifier when compared with the competing ones.Finally, Table 6 shows the results obtained by RF with different features.Here, WPD refers to using the coefficients of wavelet packet decomposition as the input features directly.RR-intervals have been proven to be very discriminative for ECG classification.We use three combinations related to WPD (i.e., WPE, WPE + RR and WPD + RR) to validate the performance of RF.With the features, RF achieves high ACCs, ranging from 91.95% to 94.61%, which improves the performance of Yu and Ye.Except for Yu not being able to recognize any samples in F, all of the remaining methods can recognize some samples in N, S, F and V, which shows that RF improves the recognition power.Note that RF also significantly improves the FPR of N.Although WPD + RR and WPE have significant discrimination power, the power of these two types of features are not as good as that of WPE + RR.The experimental results confirm that RR can improve the classification performance, as shown in [1].The results also showed that RF is a promising classifier for ECG classification and the features by WPE + RR have the most discrimination power when compared with the other features.All of the above experiments were conducted on the MIT-BIH database with well-known feature extraction methods and classifiers following the AAMI recommendation and the inter-patient scheme.It can be seen from the experimental results that the proposed method is superior to the competing ones, not only in feature extraction but also in classification.

Discussion
With the advantages of WPE and RF, we have some important issues warranting further investigation including how to choose the mother wavelet and the decomposition level for WPD, and the influence of the type of entropy and that of the number of base learners in RF.Moreover, we evaluated the efficiency of the proposed method.
To choose the best mother wavelet and the best decomposition level for WPD, we conducted experiments on six Daubechies wavelets (db1 (haar), db2, db4, db6, db8 and db10), three Coiflets wavelets (coif1, coif3 and coif5), one discrete Meyer wavelet (dmey) as well as four Biorthogonal wavelets (bior1.1,bior2.4,bior4.4 and bior6.8)with the decomposition levels ranging from 2 to 8 with an interval of 2. Here, we use Shannon entropy to extract features and 400 base learners in RF.We report the representative results in each wavelet family in Table 7.The db4 from Daubechies wavelets family with six-level decomposition achieves the top ACC of 94.61%, slightly better than that of eight-level coif1 and six-level bioer4.4decomposition.On average, the wavelets from the Daubechies family are the best ones for ECG classification.As far as decomposition level is concerned, a very small level can not express the signal well, while a very large one results in high-dimensional data along with many coefficients close to zero.For all the mother wavelets, the decomposition level 6 usually achieves satisfactory results.The type of entropy is another issue in the proposed method.We adopt SE, LEE, RE as well as TE to extract features from WPT.We use real values ranging from 0.1 to 5.1 with an interval of 0.2 as the parameter q for both RE and TE.A few typical results are shown in Table 8.For simplicity, Table 8 does not include the results that are very close to their neighboring ones in terms of q.The results of LEE are slightly worse than those of others.When q varies from 0.1 to 5.1, the performance of RE is relatively stable, with ACC varying in a narrow range of 94.28%-94.70%,indicating that RE is not sensitive to the parameter q.The performance of TE is similar to RE with a wider range in terms of ACC.The ACC of TE reaches a minimum value of 93.8% and a maximum value of 94.64% for q = 0.1, q = 3.5, respectively.For RE and TE, the experimental results also vary slightly with q in terms of SE, +P and FPR, except SE on S and F. The number of base learners (subtrees) is also discussed here.Fixing SE and six-level db4 decomposition, we used several different numbers of base learners to validate the performance of RF.The results are shown in Table 9.When such a number is small, e.g., 10, the performance is low.The performance increases with the increasing number to some extent.However, when the number reaches some value, e.g., 400, the performance reaches the peak.If the number continues to increase, the performance does not improve any more.We can see from Table 9 that 400 is a good choice for the proposed methods.We also analysed the efficiency in terms of the training and testing times with the WPE + RR features and different classifiers.For RF, we used 20, 100 and 400 as the numbers of base learners, respectively.For SVM, the time of optimizing the parameters was excluded.The results are shown in Table 10.Since RF builds many trees in training and also votes for samples by these trees in testing, it takes more time than DT and KNN.The training and testing times increase linearly with the number of base learners.As a lazy classifier, KNN consumes the least time in training, followed by PNN and DT.The training time of PNN is only a little larger than the time to read data owing to its "one-step" training attribute.For DT, since it only needs to build a single tree, the training time is also small.The testing time of RF, DT and SVM is far less than their training time.However, for KNN and PNN, the testing time is almost 50 times and 130 times greater than the training time, respectively.In practice, the testing time plays a more important role than the training time because the training phase is usually completed with off-line data.Therefore, the time consumed by RF is acceptable.In this approach, we calculated the entropy from the coefficients of the sixth level nodes by WPD to extract features, and the total dimension of the features is 66 including the two RR-intervals.The dimension is relatively large when compared to many existing research.We may apply PCA or best basis selection to reduce the dimension and to improve performance in future work.Since DS1 and DS2 have only eight and seven samples of class Q, respectively, showing the amounts of such classes is not representative, and none of the mentioned models can classify the samples in Q correctly, class Q should be excluded or be fused into another class in further research.

Conclusions
This work focused on the feature extraction and classification in ECG signal analysis.In the phase of feature extraction, we used the entropy from the coefficients of the terminal nodes by WPD (such entropy is called WPE) and two RR intervals as features, while in the classification phase, we utilized RF as a classifier.All of the analyses are conducted on the publicly available MIT-BIH database following the AAMI recommendation and the inter-patient scheme, making it easy to reproduce and compare the experiments.To the best of our knowledge, it is the first time that WPE and RF are applied to ECG classification following the AAMI recommendation and the inter-patient scheme.From the extensive experimental results, it can be concluded that: (1) the discrimination ability of WPE + RR is more powerful than that of ICA + RR, DWT + RR or WPD + RR for feature extraction of ECG signal; (2) RF is a stronger classifier than some state-of-the-art classifiers, such as KNN, DT, PNN and SVM for ECG classification; and (3) the combination of WPE + RR and RF significantly improves the ability of ECG classification with great effectiveness and comparable efficiency, indicating that the proposed method is promising for ECG classification.
In the future, the work could be extended in two aspects: (1) studying feature selection for the WPT to reduce dimension and/or to further improve the discrimination power of WPE; and (2) applying the proposed method to other biosignals, such as electromyogram (EMG) and EEG.

Figure 1 .
Figure 1.Flowchart for the proposed method.

Figure 2 .
Figure 2. Wavelet packet decomposition (WPD) tree for a normal heartbeat with three-level wavelet packet decomposition.
(a) Select a mother wavelet function W and the decomposition level L; (b) Decompose the original signals according to the specified W and L; (c) Calculate the energy of each coefficient in each node in the last level L (terminal node); (d) Calculate the energy probability distribution of each coefficient in each terminal node; (e) Calculate entropy of each terminal node; (f) Concatenate entropies of all terminal nodes to compose a feature vector.
(a) It achieves higher accuracy than other state-of-the-art methods; (b) It is efficient on large scale data; (c) It does not overfit; (d) It can handle both numeric and categorical variables; (e) It can be applied in classification, regression and clustering; (f) It can be easily applied in multi-class problems; (g) It has methods for balancing error in class population unbalanced data sets; (h) It has less parameters than some other state-of-the-art classifiers; (i) Normalization of variables is not necessary.

Table 1 .
Mapping of original heartbeat types in MIT-BIH Arrhythmia database to AAMI groups.

Table 3 .
Amount of each heartbeat type in DS1 and DS2.

Table 4 .
Results (in %) obtained by the proposed method and some previous work following the AAMI recommendations and the inter-patient scheme.

Table 5 .
Results (in %) obtained by wavelet packet entropy (WPE) and different classifiers following the AAMI recommendations and the inter-patient scheme.

Table 6 .
Results (in %) obtained by random forests (RF) and different features following the AAMI recommendations and the inter-patient scheme.

Table 7 .
Performance (in %) of mother wavelets and decomposition levels.

Table 9 .
Performance (in %) of different numbers of base learners.

Table 10 .
Training and testing time (in seconds).