DE-PNN: Differential Evolution-Based Feature Optimization with Probabilistic Neural Network for Imbalanced Arrhythmia Classification

In this research, a heartbeat classification method is presented based on evolutionary feature optimization using differential evolution (DE) and classification using a probabilistic neural network (PNN) to discriminate between normal and arrhythmic heartbeats. The proposed method follows four steps: (1) preprocessing, (2) heartbeat segmentation, (3) DE feature optimization, and (4) PNN classification. In this method, we have employed direct signal amplitude points constituting the heartbeat acquired from the ECG holter device with no secondary feature extraction step usually used in case of hand-crafted, frequency transformation or other features. The heartbeat types include normal, left bundle branch block, right bundle branch block, premature ventricular contraction, atrial premature, ventricular escape, ventricular flutter and paced beat. Using ECG records from the MIT-BIH, heartbeats are identified to start at 250 ms before and end at 450 ms after the respective R-peak positions. In the next step, the DE method is applied to reduce and optimize the direct heartbeat features. Although complex and highly computational ECG heartbeat classification algorithms have been proposed in the literature, they failed to achieve high performance in detecting some minority heartbeat categories, especially for imbalanced datasets. To overcome this challenge, we propose an optimization step for the deep CNN model using a novel classification metric called the Matthews correlation coefficient (MCC). This function focuses on arrhythmia (minority) heartbeat classes by increasing their importance. Maximum MCC is used as a fitness function to identify the optimum combination of features for the uncorrelated and non-uniformly distributed eight beat class samples. The proposed DE-PNN scheme can provide better classification accuracy considering 8 classes with only 36 features optimized from a 253 element feature set implying an 85.77% reduction in direct amplitude features. Our proposed method achieved overall 99.33% accuracy, 94.56% F1, 93.84% sensitivity, and 99.21% specificity.


Introduction
An electrocardiogram (ECG) represents electrical activity of the heart in a graphical manner. It is a non-invasive and commonly-used tool by clinicians and cardiology specialists to monitor the function of the heart and diagnose both critical and non-critical heart diseases. The ECG signal is defined by a standard PQRST sequence of waves as shown in Figure 1. The P wave indicates atrial depolarization. The QRS complex consists of a Q wave, R wave and S wave and represents ventricular depolarization. The T wave comes after the QRS complex and indicates ventricular repolarization. Each of these entities, i.e., P wave, QRS complex and T wave (possibly a U wave) have a unique pattern in terms of duration, amplitude and consecutive inter-beat correlation. A deviation in this normal pattern signifies an abnormal event. The diseased state in the case of cardiovascular monitoring is called an arrhythmia. The occurrence of an arrhythmic event is rare but critical and life threatening leading to a sudden cardiac arrest or sudden cardiac death incident. Recently, cardiovascular health monitoring has shifted from traditional in-clinic ECG machines [1,2] to portable and wearable ECG devices [3][4][5] that accumulate 24 h single-lead patient ECG data for long-term and continuous monitoring scenario. Identifying deviating patterns from the normal heartbeats in this large accumulated data is a tiresome and tedious job for clinicians and suffers from inter-and intra-observer variation error. This problem has led to the evolution in the development of computer-aided diagnostic methods for cardiovascular disease pathology indication for early referral to cardiac specialists and initiation of proper and timely medical attention.
Recent developments in the use of wearable ECG devices and on devices based on the Internet of Medical Things (IoMT) have led to an explosion of routinely collected individual ECG data. The use of feature engineering and computational intelligence methods to turn these ever-growing ECG monitoring data into clinical benefits seems as if it should be an obvious path to take. Computer-aided ECG arrhythmia classification systems that use intelligent techniques for the development of smart healthcare monitoring platforms are popular nowadays. A computer-aided early referral arrhythmia classification system [6][7][8] usually involves a feature extraction process in which a set of features is calculated for each individual heartbeat (the type of features used might be hand-crafted, statistical, morphological or spectral, etc.) and classifier construction to learn the features and classify incoming heartbeats. Using all the features calculated in the feature extraction step and a multi-layered classifier not only introduces heavy computational cost but also affects classifier performance due to the presence of redundant/corrupted features. The latest systems deploy a feature reduction/optimization step before classification to remove all unnecessary features. This also allows the use of a single layered or a computationally less intensive learning algorithm for classification. In the latest competitive research, novel features and various classifiers have been utilized for ECG beat classification tasks. Sayantan et al. [9] feature representation of ECG is learnt using the Gaussian-Bernoulli deep belief network followed by a linear support vector machine (SVM) training in the consecutive phase. Elhaj et al. [10] investigated principal components of discrete wavelet transform coefficients and higher order statistics. Afkhami et al. [11] used parameters of Gaussian mixture modeling together with skewness, kurtosis and 5th moment and applied an ensemble of decision trees to classify the heartbeats using a class-oriented scheme. Liu et al. [12] improved the dictionary learning algorithm for vector quantization of ECG. Shen et al. [13] used wavelet transform-based coefficients, signal amplitude and interval parameters. A new classifier, which integrates k-means clustering, one-against-one SVMs, and a modified majority voting mechanism, is proposed to further improve the recognition rate for extremely similar classes. Qin et al. [14] developed wavelet multi-resolution analysis to extract time-frequency domain features and applied one-versus-one support vector machine to characterize six types of ECG beats.
Zhai [15] and Acharaya et al. [16] used a CNN classifier. Oh et al. [17] used CNN and LSTM in combination to propose a refined classification method and generated synthetic data to overcome imbalance problem with accuracies of 94.03% and 93.47% with and without noise removal, respectively.
Recently, researchers have presented different feature reduction methods to reduce the input dimensions of ECG signals for neural classifiers. To name a few of the latest, Zhang et al. [18] extracted statistical features applying a combined method of frequency analysis and Shannon entropy and used information gain criteria to select 10 highly effective features to obtain a good classification on five types of heartbeats. Yildrim et al. [19] implemented a convolutional auto-encoder-based nonlinear compression structure to reduce the feature size of arrhythmic beats. Tuncer et al. [20] applied the neighborhood component analysis feature reduction technique to obtain 64, 128 and 256 features from a 3072 feature vector size. Wang et al. [21] proposed an effective ECG arrhythmia classification scheme consisting of a feature reduction method combining principal component analysis with linear discriminant analysis. Alonso-Atienza et al. [22] used a filter-type feature selection procedure which was proposed to analyze the relevance of the computed parameters. Chen and Yu [23] applied nonlinear correlation-based filters, calculated feature-feature correlation to remove redundant features prior to the feature selection process based on feature-class correlation. Asl et al. [24] proposed the feature reduction scheme based on generalized discriminant analysis. Haseena et al. [25,26] used a fuzzy C-mean (FCM) clustered probabilistic neural network (PNN) for the discrimination of eight types of ECG beats. The performance has been compared with FCM clustered multi layered feed forward network trained with the back propagation algorithm. Important parameters are extracted from each ECG beat and feature reduction has been carried out using FCM clustering. Polato et al. [27] used principal component analysis. Genetic algorithms have also been applied recently for the optimization of ECG heartbeat features [28][29][30][31] and proved to be advantageous in improving the time-cost value in heartbeat classification methods.
Previously proposed automated cardiovascular disease diagnosis systems have mostly followed the design objective of achieving high performance by maximizing accuracy, F1-score, sensitivity and precision measures. A major limitation in the case of general and particularly cardiovascular disease diagnosis is a highly unbalanced ratio or frequency of occurrence of normal to abnormal events. Furthermore, existing multi-class learning approaches mainly focus on exploiting label correlations to facilitate the learning process. However, an intrinsic characteristic of multi-class learning, i.e., class-imbalance [32] has not been well studied [33][34][35]. The Matthews correlation coefficient (MCC) was first used by B.W. Matthews for the performance assessment of protein secondary structure prediction [36]. Since then, it has become a widely used performance measure in biomedical research. MCC and Area Under ROC Curve (AUC) have been chosen as the elective metric in the US FDA-led initiative MAQC-II that aims to reach a consensus on the best practices for the development and validation of predictive models for personalized medicine [37].
This research models a metaheuristic search algorithm Differential Evolution (DE) [38] which is a very robust and highly effective heuristic algorithm. Differential evolution has also been used in many applications in many fields. For example, surface and Beizer curve optimization [39], electronic circuitry [40], lithology [41], optimizing solar cells [42] and many others. Although not directly related, these papers should be cited to show the wide range of uses of the differential evolution algorithm. The current work implements DE to optimize direct ECG heartbeat amplitude features to maximize MCC for eight arrhythmia beat classes having imbalanced and uncorrelated class distributions. The algorithm is tuned to find a minimized optimum combination of features that performs better as compared to all features. The motivation here is to remove noisy or redundant signal points, specifically for the task of classification. Classification using PNN is performed with optimum and all features to show the difference. The proposed method is simply depicted in Figure 2. Using PNN for classifying abnormal heartbeats with reduced direct heartbeat amplitude points diminishes the computation of a secondary feature extraction step, produces higher classification performance due to removal of unnecessary features and is faster due to the optimized minimum number of features and less complex PNN learning algorithm. The rest of this paper is organized as follows. In Section 2, the clinical data, cardiac cycle identification and normalization, DE feature reduction and the PNN classification for arrhythmia identification are described in detail. Section 3 includes the performance evaluation measures and data division for training and testing. Results are presented in Section 4. A detailed discussion on the achieved results plus some future possibilities are presented in Section 5.

Clinical Data
ECG data for this study belongs to "MIT−BIH arrhythmia database" developed in 1987 and are available as open source on Physionet (https://physionet.org, accessed on 15 December 2020) [43,44]. The database consists of 48 two-channel ambulatory ECG records, each of approximately 30 min duration digitized at a sampling rate of 360 Hz acquired from 47 subjects out of which 25 subjects were men aged 32 to 89 years, and 22 were women aged 23 to 89 years (2 records came from the same subject). Each record has simultaneous recordings from 2 leads, MLII and V5. For the purpose of testing a wearable ECG sensing scenario that mostly uses a single lead for acquisition [45], this work uses ECG signal from only the MLII lead. Each record is supported by an annotation file providing the R-peak positions and corresponding beat labels (Lb). Hence, for this research, 107,800 heartbeats are used having corresponding labels for 8 classes, i.e., normal (NORM), left bundle branch block (LBBB), right bundle branch block (RBBB), premature ventricular contraction (PVC), atrial premature contraction (PAC), ventricular escape (VESC), ventricular flutter wave (VFLT) and paced (PACE) beat. The selected 8 classes include less frequent but clinically significant arrhythmic beats too to prove the validity of the proposed algorithm. An sample of all beat patterns is shown in Figure 3 as an example.

Proposed Methodology
The proposed methodology as graphically shown in Figure 2 and in detail in Figure 4 is explained in four steps; (1) preprocessing, (2) cardiac cycle identification and normalization, (3) feature optimization, and (4) disease-based classification as follows:

Preprocessing
In the preprocessing stage, power and low-frequency components are removed from the raw ECG signal by using a 6th-order bidirectional Butterworth band-pass filter with lower and upper cut-off frequencies of 0.5 and 40 Hz, respectively. The baseline is computed as a cubic spline interpolation of fiducial points placed 90 milliseconds before R-peak positions as an approximation for baseline PR-segment and subtracted from the bandpassfiltered signal.

Cardiac Cycle Identification and Normalization
Using the R-peak positions provided with each record, a heartbeat sample is identified as having an onset of 250 ms before each R-peak position to 450 milliseconds after each R-peak position. This definition makes each heartbeat consist of 253 sampling points and ensures that the important characteristic points of ECG such as P, Q, R, S, and T waves are included [46] as shown in Figure 5. The signal amplitude biases in the waveforms of the ECG beat samples are inconsistent due to instrumental and human errors. Hence, we utilize the Z-score method to reduce the above-mentioned differences in each ECG beat sample. Through the Z-score method, the mean value of each ECG sample is first subtracted from each ECG sample to eliminate the offset effect and then divided by its standard deviation [21]. This procedure results in a normalized ECG beat sample with zero mean and unity standard deviation. Figure 3 shows samples for all 8 ECG beat classes used in this research.

Feature Optimization
The mathematical model followed for feature optimization using DE to find the minimum number of features that result in maximum classification performance is explained as follows.

Population Initiation
An initial population matrix P is generated as in Equation (1) to represent the possible solution/optimization space consisting of n p number of binary row vectors p called population individuals each of length n f (number of features in heartbeat samples in this case 253 as mentioned in Section 2.2.2).
. . . . p n p−1,1 p n p−1,2 . . . p n p−1 ,n f p n p,1 p n p,2 . . . p n p ,n f where, p i,j represents bit value at j th feature position in i th population individual. Here, j = 1 to n f and i = 1 to n p . 1's and 0's in each population individual represent the selected and non-selected features, respectively. p i,j for p 1 to p n p−1 are generated setting probability equal to 0.8 for a bit being 1. The last row population individual p n p is set to p all and is defined as a population individual representing an 'All-feature' set in the optimization space. This tunes the DE optimization process to find a final subset of optimized and reduced features that achieves even better fitness than the all feature set and is mathematically represented in Equation (2).
The number of individuals n p is chosen as 50 so that it is large enough to avoid stagnancy and small enough to avoid excessive computing time [47,48].

Fitness Evaluation
The fitness function, fit in this case, is modeled as the k-category MCC [36,49] mathematically expressed as Equation (3) considering one versus rest strategy taking all 8 classes one by one as positive (P) and the rest of 7 classes as negative class (N). All feature subsets represented by p in P are selected from the dataset and individually trained using PNN as explained in Section 2.2.4 and fit is calculated on the testing subset.
Here, TP = number of samples for which positive class was correctly identified, TN = number of samples for which negative class was correctly identified, FP = number of samples for which positive class was wrongly identified and FN = number of samples for which negative class was wrongly identified and k denotes the number of classes and k = 8 for the current problem. Hence, FP and FN represent misclassifications or error made by the classification algorithm. Mean calculated over MCC individually for 8 classes is modeled as fit. A maximization of fit is carried out to find the optimum combination of features. Maximization of the defined fitness function is carried out using maximum 200 generations.
Crossover Randomly selecting two different individuals p i1 and p i2 from P, a 1-point crossover is performed where, i1, i2 are randomly generated index values between 1 and n f with crossover probability (CR = 0.8). The population individual v i obtained after the crossover operation is called an offspring. Similarly, an offspring vector is created corresponding to every row in P to create a trial population matrix V.

Mutation
A bit-flip is performed with mutation probability (MR = 0.2) for all v i 's in V. Hence, currently there exists a parent population P G and an offspring population V G+1 (after crossover and mutation) both of size n p x n f .

Selection
The fitness function fit for each individual in the V is calculated using Equation (4). Applying the current-to-best strategy, if v i shows a higher fit value than the corresponding p i , then p i in the P is replaced with v i . Otherwise, the p i retains its position. This comparison and replacement process is repeated for every (p i , v i ) pair an evolved version of P is obtained at the end of the generation. This process evolves and accumulates better individuals until the maximum number of generations, i.e., 200 is reached. After looping through all generations every individual in the P is replaced with the best possible candidate, i.e., having the highest fit value. p sel with the best fit in the end P is selected as the optimum feature subset with 1's representing the selected features out of total n f .

Termination
The process terminates if the maximum number of given generations 200 is reached or fit becomes stagnant for a consecutive 20 generations. For every new generation, a new V is generated using the updated P. Hence, crossover and mutation occur in every generation. The default control parameters are summarized in Table 1.

Disease-Based Classification
Training and testing subsets composed of optimized subset of features p sel obtained in the last step are now extracted from complete training and testing subsets and can now be used to classify unseen beats using PNN [50]. The PNN consists of an input layer, a pattern layer, a summation layer, and a output layer. This architecture is illustrated in Figure 4 (Step 4). The neurons of the input layer convey the input features a = [a 1 , a 2 , . . . a j , . . . , a n s ] T to the neurons of the pattern layer directly, where n s represents the number of optimized features in p sel and n s <= n f .
In the pattern layer of PNN, Gaussian function is used to calculate the output of the neuron a ko as in Equation (5) using the input vector a transferred down from the input layer: where, a ki is the vector of neurons, σ defines the standard deviation also called spread for the Gaussian function and n s is the size/dimension of the pattern vector a. a − a ki is the Euclidean distance between a and a ki . The neurons in the summation layer calculate the maximum likelihood of the pattern vector a being categorized into class k by averaging the output of all neurons in the pattern layer that belong to the same class as mentioned in Equation (6).
where n k is the total number of the samples in class k. The neuron in the decision layer applies Bayes's decision rule to determine the class belongingness of the pattern a by Equation (7).
where   p sel = p best with maximum 'fit' value as in Equation (3)

Performance Evaluation
Out of the 108,700 beat samples, 50% were selected as the training subset and the remaining 50% as the testing subset. Table 2 summarizes the details of the available beat samples from each class. All the available class samples in the MIT-BIH database are used in the current test to keep the arrhythmia instance ratio as close to real as possible.
All the definitions mentioned below follow a one-versus-rest strategy [51]. Each classification measure is calculated for each of the eight classes (taking one class as positive and all the rest as negative) and then averaged to represent the mean classification measure. The PNN classification was performed for All features set (as the exact solution) and Optimized features subset obtained after DE. Hence, all measures are reported for both All features and Optimized features cases to present a comparison between classification improvement and feature reduction achieved using the proposed method. Here, TP, TN, FP, and FN follow the same definition as mentioned in 'Fitness Evaluation' part.  Table 3 shows a comparison of the proposed DE-PNN algorithm with the selected 'All feature' standard. The confusion matrices for both are reported in Table 4. The optimized features which result in the maximum MCC are plotted in Figure 6. The average number of generations by which the optimization is achieved was 78 ± 12 (10 trials). After an average 78 generations, the fitness value becomes stagnant meaning the fitness function has achieved its maximum value and is no longer improving.    Table 5. To analyze the efficacy of optimized features in distinguishing between simulated cardiac conditions, the receiver operating characteristic (ROC) is plotted using a one-versusall class strategy and area under the curve (AUC) is calculated. By analogy, the higher the AUC, the better the capability of recognition of the particular class by the classification algorithm. Figure 7 shows the ROCs and AUCs of every class in the case of optimized and all features. The AUC for all arrhythmia classes except paced beat has increased with maximum AUC improvement for VFLT (10%) which is the rarest class in the currently used dataset and secondarily PVC (4%), both representing critical pathological conditions. Overall, the recognition for all classes has improved or stayed consistent with 85.77% reduction in number of features.

Discussion
The proposed method presents an accurate and computationally efficient arrhythmia classification method using direct ECG amplitude signal features. More than 100,000 ECG heartbeats are obtained with eight types of ECG beats including one normal and seven arrhythmic beat types. Feature optimization is performed by modeling optimization input as binary vectors representing different feature combinations using DE. An optimized feature subset is obtained which is then used with a simple PNN classifier. The proposed method achieved 85.77% reduction in directly acquired features with comparable classification performance. Figure 6 shows the optimized and selected 36 out of 253 (total amplitude feature points). The higher classification performance achieved could be due to better beat definition (250 ms before and 450 ms after the R-peak positions) as compared to [52] which arbitrarily used 200 samples around the R-peak. Our definition makes sure the inclusion of important physiological characteristics necessary to distinguish between the currently classified arrhythmia types which are most ventricular types. Furthermore, on the algorithm design level, adding an all feature combination to the solution space pushes the optimization process to find a solution better than the All features scenario.
Moreover, we compared the classification performance of the proposed DE-PNN scheme for ECG arrhythmia classification with those of other schemes simultaneously utilizing different feature reduction methods and neural classifiers presented in the literature as summarized in Table 6. Jun et al. [53] used the same direct ECG amplitude features as used in this work and presented a comparison between 2D-CNN, AlexNet, and VGGNet models. All three of the models were deployed using TensorFlow [54] which is a deep learning Python library proposed by Google especially for GPGPUs and yet used two Intel Xeon E5 CPUs and two NVIDIA K20m GPUs to reduce the learning time. All tested classifiers had complex architectures implying extremely high computational cost with no feature optimization/reduction function which is not suitable for continuous monitoring using wearable sensing modality. Yildrim et al. [19], Tuncer et al. [20], and Elhaj et al. [10] used wavelet features with multiple different combination of features to perform arrhythmia classification adding feature computation layer in the processing algorithms performing optimizations focused on classifier parameters rather than feature engineering. DE-PNN aimed at searching for the optimum feature combination that provides maximum recognition capability for arrhythmic heartbeats removing redundant and selecting highly discriminating features. Overall, the achieved ECG arrhythmia classification result indicates that the detection of arrhythmia using 14.23% (85.77% reduced) features of a complete ECG heartbeat can be an effective approach to help general physicians and cardiology specialists to diagnose critical cardiovascular diseases in continuous and long-term, online or offline monitoring scenarios particularly well-suited for a wearable sensing setting. For future work, the current algorithm may be extended to recognize 16 classes (1 normal and 15 arrhythmic) for which the annotations are available with the MIT-BIH dataset. A future DE optimization might focus on a multi-objective approach to maximize arrhythmia recognition whilst minimizing percentage signal distortion (accuracy and compression being the two objective functions) to make the ECG signal reproducible for clinical analysis.