A Metaheuristic Optimization Approach for Parameter Estimation in Arrhythmia Classification from Unbalanced Data

The electrocardiogram records the heart’s electrical activity and generates a significant amount of data. The analysis of these data helps us to detect diseases and disorders via heart bio-signal abnormality classification. In unbalanced-data contexts, where the classes are not equally represented, the optimization and configuration of the classification models are highly complex, reflecting on the use of computational resources. Moreover, the performance of electrocardiogram classification depends on the approach and parameter estimation to generate the model with high accuracy, sensitivity, and precision. Previous works have proposed hybrid approaches and only a few implemented parameter optimization. Instead, they generally applied an empirical tuning of parameters at a data level or an algorithm level. Hence, a scheme, including metrics of sensitivity in a higher precision and accuracy scale, deserves special attention. In this article, a metaheuristic optimization approach for parameter estimations in arrhythmia classification from unbalanced data is presented. We selected an unbalanced subset of those databases to classify eight types of arrhythmia. It is important to highlight that we combined undersampling based on the clustering method (data level) and feature selection method (algorithmic level) to tackle the unbalanced class problem. To explore parameter estimation and improve the classification for our model, we compared two metaheuristic approaches based on differential evolution and particle swarm optimization. The final results showed an accuracy of 99.95%, a F1 score of 99.88%, a sensitivity of 99.87%, a precision of 99.89%, and a specificity of 99.99%, which are high, even in the presence of unbalanced data.


Introduction
The cardiovascular diseases are a set of heart and blood vessel disorders. According to the World Health Organization, 17.9 million people die each year [1] due to cardiovascular disease. For this reason, the detection of heart bio-signal abnormalities (i.e., arrhythmias) by electrocardiograms it is vital. Arrhythmias are disturbances or irregularities of the cardiac electrical impulses [2].

The P-wave
The P-wave is usually positive in most leads (but it is always negative in aVR) ( Fig. 1.18). It can sometimes have a negative deflection in leads III and V1 or be biphasic in these leads and in lead aVL. The normal duration of the P-wave is Figure 1. The ECG waves, segments, and intervals [4].
Unbalanced data refers to a classification problem where the number of observations per class is not equally distributed. It means that the dataset has a large amount of data/observations of one or more classes (known as majority classes) and far fewer observations for one or more other classes (referred to as the minority classes). Many datasets used for the arrhythmia-beat classification have the problem of unbalanced data. The reason for unbalanced data relates to the nature of arrhythmias, since they are cardiac abnormalities that do not appear regularly. It occurs when an ECG signal is captured from a patient who is suspected of having arrhythmia from a device called Holter for monitoring the patient's cardiac signals for several hours. During this time, the arrhythmia will probably appear for a few seconds, and in the remnant time, the signal is communicated normally. So we have arrhythmia signals that last seconds versus normal heartbeat signals that last minutes or even hours. Hence, this means that when creating an ECG dataset, it is not naturally balanced. In other words, a balanced dataset is not going to be found. Now, after the signals are captured, and the dataset is created, this imbalance problem can be mainly reduced by using two techniques: oversampling and/or undersampling; each one has advantages and disadvantages. In our case, we chose undersampling, because oversampling implies a need to validate synthetic data. Considering that we are developing applications with a health orientation, the validation is not a trivial problem because it involves a team of medical personnel specialized in cardiac signals, such as a cardiologist, validating large amounts of data. For example, in MIT-BIH Arrhythmia Database, the instances of normal beat (74,758) and left bundle branch block beat (8072), determined as the major classes, tend to surpass the numbers of cases of arrhythmia beats, such as aberrant atrial premature beats (150) and entricular escape beats (106), minority classes. Notably, this leads to a multi-class classification problem with data not equally distributed; see Table 2 in Section 3.2. Generally, this problem deals with the trade-off between recall (percent of truly positive instances that were classified as such) and precision (percent of positive classifications that are truly positive). In the context of detection, it is usually more costly to miss a positive instance than to label a negative instance falsely. In our case, we want to detect the arrhythmia cases, so we prefer F1 score as the main performance metric of the classifier model because it combines precision and recall. The unbalanced dataset problem affects the performance in most classification techniques [5], and its solution requires research on several areas; in our case, we propose a computational method using artificial intelligence. Three types of methods are used to tackle unbalanced datasets [5]: • Data-level methods: These methods are called external methods. They consists of adding samples of the minority class-carrying out oversampling. Alternatively, there is removing samples from the majority class, known as undersampling [6]. • Algorithmic-level methods: This solution is an internal approach, wherein the goal is to design or enhance the classification algorithm. Some examples are the ensemble methods [6][7][8], feature selection methods [9][10][11], and algorithm improvement methods [12][13][14][15]. • Hybrid methods: They are combinations of data-level methods and algorithmic level methods [6,[13][14][15].
In [13], a method to tackle the use of unbalanced data for classification using oversampling with the synthetic minority oversampling technique (SMOTE) was proposed, creating synthetic samples of the minority class. However, its disadvantage for ECG classification is the need for a cardiologist to validate the data created artificially. In [6], an undersampling method known as the random under-sampling technique (RUS) was proposed; in this case, some samples are removed from the majority class by a random selection. The weakness of this technique is the possibility of eliminating informative instances from the majority class.
We designed the following questions that allowed to roadmap the research presented in this paper: • How can one design a hybrid method of parameter optimization that allows for obtaining a high-performance (F1 Score, and accuracy) arrhythmia heartbeat classification in the presence of unbalanced classes? • What classification technique and what metaheuristic approach will allow us to obtain a high performance (F1 Score, and accuracy) arrhythmia heartbeat classification in cases of unbalanced classes?
The goal of this study was to design a hybrid parameter optimization method based on metaheuristic approaches to arrhythmia classification using unbalanced electrocardiogram signal data. Our method consists of four principal stages to carrying out this task: 1. Data acquisition: We selected an unbalanced subset of the MIT-BIH database to classify eight types of beats: four majority classes (normal beat, left bundle branch block beat, right bundle branch block beat and premature ventricular contraction) and four minority classes (aberrant atrial premature beat, Ventricular flutter wave beat, nodal escape beat, and ventricular escape beat). 2. ECG signal preprocessing: For this stage, the goals are to reduce noises and prepare the ECG signals to obtain relevant information, in the feature extraction stage, because the quality of the signals influences the performance of the classifier. We applied an amplitude normalization to reduce variations among the patients' ECG signals and a cluster selection of instance on majority classes to reduce the number of samples. 3. Feature extraction: From the preprocessed signals, we extracted statistical features to obtain relevant information that allows the classifier to identify the label of each heartbeat. 4. Metaheuristic: In this stage, the classifier's parameter optimization was carried out by combining the data and algorithmic levels using a metaheuristic approach: • Data level. The metaheuristic performs an undersampling over majority classes, selecting the instances based on clustering, and creating a new subset merging the selected instances of majority classes and all the cases of minority classes. This process allows for overcoming relevant information loss on the algorithmic level. Besides, we implemented a feature selection method to reduce high data dimensionality and improve classifier performance. • Algorithmic level. We use an artificial neural network as a classifier. In this case, the metaheuristic selects the number of neurons in the hidden layer, and it also selects the features from the new subset. The performance measures considered are accuracy, sensitivity or recall, specificity, precision, and F1 Score.
We establish a hypothesis that applies a hybrid optimization method to optimize the classifier parameters by combining the data and algorithmic levels based on a metaheuristic approach, which can improve on the performance of the arrhythmia heartbeat classification method up to 98% when classes in a database are unbalanced. In order to demonstrate this hypothesis, we compare two metaheuristics approaches, particle swarm optimization (PSO) [16] and differential evolution (DE) [17], to evaluate the parameter optimization performance of the classifier model. Although both metaheuristics are based on population, PSO optimizes the model obtaining information on the performance of the best individual in a population and the performance of each individual. In contrast, DE, in its version DE/rand/1/bin optimizes only with information on the performance of each individual in a population. The paper is organized as follows: Section 2 shows related work to this research. Section 3 presents the database used and their characteristics. Then, Section 4 exposes the metaheuristic approach applied in our study and the experimental design. Section 5 presents the obtained results, and in Section 6 these are discussed. Finally, Section 7 details the conclusions of this work.

Related Work
In cardiology, the usefulness of machine learning is to provide tools with the aim of extending the capabilities of physicians, such as the improvement of identification and interpretation of disorders about heart functionality. Among the cardiovascular disorders, cardiac arrhythmias are the most common, and as a result, their precise classification has been of great interest in biomedical studies.
There are several approaches to arrhythmia classification to improve abnormality identification using electrocardiogram signals. The differences among studies are the techniques used in preprocessing, the methods employed to feature extraction and feature selection, and the proper selection of the algorithm to perform classification. For example, Chen et al. [18] proposed an approach with a combination of dynamic and projected features derived from a random projection matrix. They used a support vector machine to classify with two different approaches. The first, called the "class-based approach," is used to classify fifteen classes, with 98.46% accuracy. The second, named the "subject-based approach," merges the fifteen classes into five classes, according to the ANSI/AAMI EC57:1998 standard, resulting in 93.1% accuracy.
Elhaj et al. [9] developed an approach for classifying five types of arrhythmias. They extracted two types of features, linear features from coefficients of signal decomposition based on a discrete wavelet transform and non-linear features using a higher-order statistical analysis. The authors applied two classification techniques, a support vector machine with a radial base function and a feed-forward neural network with forty neurons in the hidden layer, obtaining 98.91% and 98.90% accuracy, respectively.
Hassanien et al. [10] introduced an approach to classify two types of heartbeats using a modified Pan-Tompkins algorithm (MPTA) as a feature extractor and a support vector machine (SVM) as a classifier algorithm. The authors used a metaheuristic called elephant herding optimization (EHO) to select a subset of features that improves the performance of the classifier. The metaheuristic is based on the herding behavior of the elephants used in global optimization problems. The results were 93.31% accuracy, 45.49% sensitivity, 46.45% precision, 45.48% F-measure, and 45.48% specificity.
Ashtiyani et al. [11] proposed an approach to classify three types of heartbeats using heart rate variability (HRV) and discrete wavelet transform (DWT) to extract features and to reduce the high dimensionality. The authors used a genetic algorithm to select the best features and implemented a SVM for classification. Their results were 97.14% accuracy, 97.54% sensitivity, 96.9% specificity, and 97.64% precision.
Rajesh and Dhuli [13] proposed an approach to classify five groups of heartbeats applying a nonlinear feature extraction scheme to tackle the problem of unbalanced group. They use two approaches: (1) Data level-the authors used a resampling method called "distribution based data sampling"; and (2) algorithmic level-they employed an AdaBoost classifier. Their results were 99.1% accuracy, 97.9% sensitivity, and 99.4% specificity. Sarvan and Ozkurt [12] addressed the problem of unbalanced classes using the algorithmic level approach implementing two strategies to classify five arrhythmia types. The first strategy tackles unbalanced data using a convolutional neural network (CNN) with an increased number of epochs to improve the ability to learn of the classifier. In the second strategy, the authors used a data ensemble, including the same number of samples from each class. For creating the data ensemble, the majority classes were divided into clusters of the same size as the minority classes. In both strategies, they implemented a CNN architecture that included nine layers: three convolutional layers, three max-pooling layers, and three fully connected layers. Their results were 93.72% accuracy, 26.85% sensitivity, and 99.6% specificity.
Jiang et al. [14] introduced a multi-module neural network system (MMNNS) focused on solving the imbalance problem in ECG databases. The system has four modules: preprocessing, imbalance problem processing, feature extraction, and classification. The authors addressed the problem using a combination of the data-level approach and the algorithmic-level approach. In the data-level approach, they applied an oversampling algorithm called borderline-SMOTE (BLSM), and in the algorithmic-level approach, the authors used a module that extracts and selects features considering the bias towards to majority class. Moreover, the authors employed a CNN adopting two stages of training, the first stage with balanced data and the second stage with unbalanced data. Their system obtained 98.4% accuracy.
Gao et al. [15] introduced an approach to classify eight types of heartbeats in the case of a class imbalance tackling the problem using an algorithmic level approach. Authors used an LSTM recurrent network (LSTM: long short-term memory) with a focal loss function to optimize the weights of networks focusing on heartbeats challenging the classification. The results were 99.26% accuracy, 99.26% sensitivity, 99.14% specificity, and 99.27% of F1 score.
An interesting work was developed by Hajeb-Mohammadalipour et al. [19]. The authors used the "MIT-BIH Arrhythmia Database" where two types of classes are included: (1) beats and (2) rhythms. Authors designed a multi-sequential approach with four stages for discriminating among various malignant arrhythmias of the rhythm type. Their innovation was to integrate a method involving VF, AF, PVC, and sinus arrhythmia episode detection capabilities, by using features derived from time-frequency spectral analysis and nonlinear dynamics to discriminate VF from non-VF arrhythmias. Regarding the classification process, in the first stage to identify a type of rhythm class, they used eight-second segments, which included several beats that were not classified independently, and only identified the abnormalities in the segments of several seconds. After that, the authors classified other two types of rhythm classes but as a result of the first stage. Unlike our work, in which we identified signal segments of 0.69 s, it would be equivalent to a single beat, and we classify whether a type 8 arrhythmia is present.
As described above, the problem of unbalanced datasets affects the performance of the classifier; factors such as precision and sensitivity help to observe the impact of this issue. For example, some previously mentioned works have low values of these metrics [10,12]. The improvement of both metrics is necessary to avoid false positives and false negatives, because of arrhythmia diagnostic, if the patient is not well diagnosed, he or she could not receive treatment, or the patient could be injured by recommending the wrong medication. The contribution of our work is the performance improvement of the classifier in case of the unbalanced dataset, which could help in creating systems to help physicians in the diagnoses of arrhythmias and patient health care.

Multiclass Classification Methods
As a recall, we point out that supervised multiclass classification algorithms aimed at assigning a class label for each input example. In order to carry out this task, we can use two main approaches: one-versus-all (OVA) and all-versus-all (AVA); each one is described as follows [20]: One-Versus-All (OVA): The simplest approach is to reduce the problem of classifying among K classes into K binary problems, where each problem discriminates a given class from the other K − 1 classes [21]. For this approach, we require N = K binary classifiers, where the k th classifier is trained with positive examples belonging to class k and negative examples belonging to the other K − 1 classes. When testing an unknown example, the classifier producing the maximum output is considered the winner, and this class label is assigned to that example.
The OVA approach was used by Hajeb-Mohammadalipour et al. [19]. Their approach takes one class as positive and the rest as negative, and trains the classifier. Thus, for the data having n-classes it trains n classifiers. Then, in the scoring phase all the n-classifiers predict the probability of a particular class and the class with highest probability is selected.
All-Versus-All (AVA): In the AVA approach, each class is compared to each other class [22,23]. A binary classifier is built to discriminate between each pair of classes, while discarding the rest of the classes. This requires building (K(K−1)) 2 binary classifiers. When testing a new example, a vote is performed among the classifiers and the class with the maximum number of votes wins.
In our case we consider the AVA approach with binary pairs of classes, and train the classifier on a subset of data containing those classes. Thus, we train a total of (k(k−1)) 2 classes. During the classification phases, each classifier predicts one class. (This is in contrast to OVA where each classifier predicts probability). Additionally, the class which has been predicted most is the answer.
In this sense, we can describe some differences between these two approaches: • OVA has a shorter training stage than AVA and hence is faster overall, so OVA is usually preferred. • Every single classifier in AVA uses a subset of data, so a single classifier is faster for AVA. • AVA is less predisposed to be imbalanced in a dataset (dominance of particular classes) Two main inconsistencies can be found using these two approaches: • What happens if two or more classes are equally voted in the AVA approach? • What happens if the probability is relatively close or equal in the case of OVA?
As we can see, both approaches have advantages and disadvantages. Besides, we perceive that an OVA approach is as accurate as any other based on the assumption that the classes are "independent." Thus, the classes do not belong to a natural hierarchy, and we do not necessarily expect examples from class "A" to be closer to those in class "B" than those in class "C." In the latter situation, especially when a few examples were available, we might suspect that an algorithm that exploited the relationships between classes could offer superior performance, leading to an interesting open question. Hence, we explore the AVA approach due to the assumption that arrhythmia classes are closer, as we showed in Figure 12.

Materials
For the purposes of this work, we used the ECG data gathered in the MIT-BIH Arrhythmia dataset in order to carry out a comparison among some works described in the state-of-the-art. The MIT-BIH Arrhythmia database is described as follows: MIT-BIH Arrhythmia Database [24]: This database was downloaded from the Physionet platform, and it contains 48 half-hour excerpts of two-channel ambulatory ECG recordings obtained from 47 subjects. The subjects were 25 men and 22 women; the men were 32-89 years old and the women were 23-89 years old. The frequency of each recording is 360 Hz, with an 11-bit resolution over a 10mV range. The database has seventeen types of heartbeats and fourteen types of rhythms. From the database, we select the most typical elements (four majority classes) and the less-typical elements (four minority classes), to represent an unbalanced behavior. In Table 1, the classes are the annotations or acronyms utilized in the database to represent the name of arrhythmia, and the samples are from the limb lead II (MLII). Besides, Table 1 shows the imbalanced nature of classes because the number of samples is not the same. The sizes of majority classes (N, L, R, and V) are several times greater than those of the minority classes (!, j, a, and E). Therefore, the selected subset shown in Table 1 helps us to find ways to get better performance with unbalanced ECG Data. We use the imbalance ratio (IR) to measure the class imbalance; we apply the Equation (1) to calculate the metric where N − is the number of instances in the majority class and N + is the number of instances in the minority class.
There is no rule about a value of IR to establish a dataset as unbalanced, but Lango [25] states that a value of IR higher than 1.5 is considered an unbalanced dataset. Tsai et al. [26] mentioned that an IR higher than ten makes it challenging to build an effective classifier, and an IR higher than 100 can be drastic so that it may deteriorate the performance of classification. Table 2 shows values of IR between classes selected from the MIT-BIH Arrhythmia Database. IR values display a severe imbalance among N class and minority classes. Besides, there is a substantial imbalance among the other majority classes and minority classes. redIn this work, we are focusing on decreasing the IR between the classes selected by carrying out undersampling over majority classes; see Tables 7 and 8 in Section 5.

Metaheuristic Approach
The metaheuristic approach for heartbeat classification with unbalance data shown in Figure 2 has four stages. The first one, data acquisition, uses data from the MIT-BIH arrhythmia database for analyzing the classes to generate an imbalance context of four majority classes and four minority classes. The unbalanced ratio of the new subset of the eight classes is calculated, explained in Section 3.2. The second one, the ECG signal preprocessing stage, has four steps (see Section 4.1). First, we carry out filtering on the records to remove noises from signals. Then, an amplitude normalization to each signal is implemented to decrease the variation among ECG signals of patients. After that, the filtered signals are segmented in heartbeats. At last, the clustering method is applied to the majority classes, dividing them into smaller groups. The feature extraction is the third stage, where statistical features are extracted per heartbeat of all the signals selected; see Section 4.3. In the final stage, metaheuristic optimization, a hybrid approach to tackle the imbalance problem is applied; see Section 4.4. The metaheuristic has the aim of selecting several parameters to improve the performance of the classifier. On the data level, it selects the size of SOM features maps and the number of instances (percentage) of each majority class cluster. On the algorithmic level, the metaheuristic selects the number of neurons in the hidden layer of the artificial neural network, and the features to train and test the classifier. These parameters are represented in a vector, and a fitness function evaluates the performance obtained by each vector. In the next sections, we explain in more detail every stage.

Preprocessing
The electrocardiogram signals have several types of noise. Hence, the preprocessing stage aims to reduce the contamination (noise) caused by muscle noise, power line interference, motion artifact, and baseline drift, among other things. The preprocessing is essential because of the signal condition of ECG influences classification performance. Figure 3 shows the major steps in the preprocessing stage.

Denoising
We select denoising as the first step of preprocessing to remove noise and to avoid edge effect at the ends of segments of the signal [27]. It is necessary to carry out filtering of the signal to delete power line interference at 50 or 60 Hz, baseline wander or baseline drift around 0.5 Hz, electromyogram noises at frequencies higher 50 Hz, and low and high-frequency noise components, which interfere with the signal analysis [28]. The principal components of electrocardiogram signals are P wave, QRS complex, and T Wave, where their frequencies are from 5 to 40 Hz [29].
We compare two independent filtering methods in two different scenarios; see Section 4.5.2. The first filtering method is a combination of low-pass and high-pass filtering. The second method is based on wavelets. The aim of the comparison is to observe the filtering process impact on the performance of the classifier model using combinations among filtering methods and metaheuristics approaches. We select the low-pass and high-pass filtering method because these types of filters have presented high performance at removing noise in electrocardiogram signals, such as muscle noise, baseline wander, power line interference, electromyography noise, and electrosurgical noise [30,31].
We used the wavelet filtering method because it has demonstrated itself to be a useful tool for analyzing non-stationary signals, such as ECG; therefore, we implemented a denoising process based on the discrete wavelet transform (DWT), as other authors did [9,32].
Here we describe the steps of both methods: 1. Low and high pass filtering method: Initially, we apply a high-pass filter with a cutoff frequency 1 Hz to remove noise of low frequencies; for example, baseline wander. Then, we use a low-pass Butterworth filter of sixth order, with a cutoff frequency of 30 Hz to delete noise of high frequencies. 2. Wavelet filtering method: We apply an interval-dependent denoising method to the ECG signals to remove noises such as power line interference, electromyogram noise, and high-frequency noise components. First, we use the Daubechies D6 (db6) wavelet basis to nine levels of decomposition. Then, we reconstruct the decomposed signal and apply a new decomposition of the reconstructed signal in nine levels using the db6 wavelet basis, removing the ninth level approximation sub-band to remove baseline wander noise. Finally, we reconstruct the signal using detailed components from one to nine sub-bands to obtain an ECG signal that is denoised and smoothed. Figure 4 shows the same signal before the filtering process and after the filtering process. The raw signal in Panel A shows two types of noise, power line interference and baseline wander. Panel B shows the signal after the low-pass and high-pass filtering process; the noise of the raw signal was removed, but we can observe slight changes in amplitude: in the raw signal, the Q wave is deeper than S wave, but in signal filtered, it is the opposite. Panel C shows thhe signal after the wavelet filtering process; the power line interference and baseline wander were removed, and the signal is smoother than signals in Panel A and Panel B.

Normalization
Amplitude normalization has been used as one step in the preprocessing stage by authors such as Zadeh et al. [33], to eliminate the effect of amplitude changes. In contrast, Donoso et al. [34] mentioned that amplitude normalization is optional but helps to compare the signals from different patients visually. Thomas et al. [35] applied a normalization process to reduce the amplitude variance from file to file. In this work, we normalize ECG signals into a range from -1 and 1 to reduce variations among patients' signals, using Equation (2), proposed by Kutlu et al. [36], where Y(s) is the normalized data matrix and X(s) is the sth sample vector. Figure 5 shows signals before (Panel A) and after (Panel B) normalization; both are visually similar, but their scales of amplitude are different.

Segmentation
Sixty bpm (beats per minute) is a normal heart rate for a healthy adult [30]. There is not a rule about the number of sampling points in each segment, but Chen et al. [18] recommended 300 sampling points and Elhaj et al. [9] suggested 200 sampling points. We chose 250 sampling points, resulting in a segment equivalent to 0.69 seconds, which includes P wave, QRS complex, and T wave. To carry out the segmentation, we first locate the heartbeat annotation. Then, we use the QRS complex location like a reference point, and we select 99 sampling points from the left side and 151 sampling points from the right side in the ECG signal; Figure 6 shows an example.

Clustering
In our work, this task is a crucial step because it is carried out an undersampling. There are several undersampling methods, such as [6,26,[37][38][39][40]. In our study, we focus on preserving informative samples when the size of the majority classes is reduced based on clustering, such as in [26,38].
A clustering process is used to organize the signal segments in groups (clusters) with similar features based on a similarity degree. We use self-organizing maps because the neural network preserves input space topological properties. Through the mapping, the relative distance among the points is preserved, allowing clustering the data without knowing the class memberships of the input data.

Self Organizing Map
A self-organizing map (SOM) is a neural-network algorithm based on unsupervised learning. Its objective is to find the optimal position of the cluster centers from data through an iterative learning process. A SOM process has four principal stages: initialization, competition, cooperation, and adaptation. In the initialization, weighted vectors are initialized with random values. In competition, for each instance, the neurons compute values of a discriminant function, and the neuron with small values wins. In cooperation, the winner neuron determines a neighborhood of excited neurons. In adaptation, the neuron winner is enhanced by the reduction of discriminant function values of the excited neurons through adjustment weights [41]. Figure 7 shows a SOM structure, where − → X n is the vector of features, − → W i is the weight vector of the neuron, and the neuron group is a feature map.
Amanowicz raphical representation of datasets. ctors are labeled, this method can on mechanism. The Kohonen allearning method. As a result of s, network indicates the activation degrees, as a result of adaptation tic weights, during the process of s, or groups of neurons, are actimulation, adapting to the form of se of this, the test vectors activate twork, which are the most simition process, networks are reliant s, each of the neurons is assigned the multidimensional space. Disbe created in a random way. They hbors in a hexagonal manner. In work is learned by a training set. uron and neighboring neurons uponse to learning vectors using the (3) and (4). vector X(x 1 , x 2 , . . . , x j ) to winner w j ) are calculated on base of the of features, x j -j feature in X input tor of the neuron, w i j -j value of r of the neuron. egree G(i) of winner and neighbors Gaussian formulas: type of traffic. For instance, the 11 dimensional input vector X(x 1 , x 2 , . . . , x 11 ) elements can be represented by following features: In addition, the feature vector is linked to a label that defines the class of activities. Due to research scenario, the composition of features in input vector can vary. To evaluate performance, the cross validation method with 10 folds will be used. After the process of learning, neural net-

Clustering Process
Clusters are used in the fourth stage of the methodology for the instance selection to undersample the majority classes. Figure 8 shows a diagram of the clustering process using SOM. In the first step, we select the size of a two-dimensional feature map, choosing the number of rows and columns to form a grid, where each node in the grid is a neuron. The total number of neurons is equal to the number of rows multiplied by the number of columns. We test the feature map size of the same number the rows and columns, from two to ten, 2 × 2, 3 × 3, ..., 10 × 10. This process is carried out to observe the performance of the classifier using different configurations. Then the network is trained, and once it finished, the number of obtained clusters is equal or smaller than the number of neurons. In SOM, the weight vector associated with each neuron is the center of each cluster. Hence, each cluster computes the euclidian distance from its instances to its center. This allows us to know the farthest instances to the center and consider them like "noisy instances". Finally, the instances of each cluster are sorted in ascending order. Besides, the undersampling is carried out by the metaheuristic, selecting a percentage of the instances closer to each cluster's center.

Feature Extraction
The feature extraction process is essential to classifying ECG signals due to information gathering to differentiate among heartbeats of arrhythmias. The statistical features are extracted because, through them, the type of distribution is analyzed, as is the level of complexity that ECG signals exhibit; this allows discriminating variations among heartbeats [30]. Several authors report the use of statistical features [7,9,[43][44][45].
The first step in feature extraction is to divide the heartbeat into subsegments. Since each heartbeat has 250 samples-see We recognize that from 25 subsegments, the dimensionality begins to increase. Thus, the two most viable options are subsegments of size 5 or subsegments of size 10. In order to choose between these two options, some tests were carried out; the results are shown in Table 3. We observed that 10 subsegments (50 characteristics) demonstrate a higher performance regarding the F1 Score mean and a smaller standard deviation than when only 5 subsegments (25 characteristics) are used. Hence, Figure 9 shows the segment division and the order of each feature. For each segment, the extracted statistical attributes are skewness, kurtosis, energy, mean, and standard deviation. In the final of the process, the number of features is fifty.

Metaheuristic Optimization
In this stage, we propose a hybrid optimization method to optimize the classifier's parameters through combining the data and algorithmic levels based on a metaheuristic approach to tackle the unbalanced class problem. We compare two metaheuristics approaches, particle swarm optimization (PSO) [16] and differential evolution (DE) [17], in similar conditions to search for the combination of parameters improving the performance of the classifier.

Particle Swarm Optimization
Particle swarm optimization (PSO) is a metaheuristic for optimization based on bird and fish flock movement behavior. In the standard PSO algorithm, a swarm is formed by particles that change position; each particle is a potential solution to the problem, and they change position according to three principles: (1) to maintain inertia, (2) to change the conditions according to its optimal position, and (3) to change the conditions according to the swarm's optimal position [16].

Differential Evolution
Differential evolution (DE) is a stochastic algorithm based on the population. It optimizes real-valued functions through the iterative execution of four steps: initialization, mutation, crossover, and selection [17]. Each step is described as follows: • Initialization: In this step, a population of parameter vectors is generated randomly between lower bound and upper bound. • Mutation: In each generation or iteration, individuals of the current population become target vectors. For each target vector, the algorithm randomly selects from the population, three different vectors to the current target vector; then, a donor vector is created from the weighted difference of two vectors to the third. • Crossover: After generating the donor vector, the recombination operation or crossover is performed to enhance the potential diversity of the population. In this step are randomly selected elements from the current target vector and elements of the donor vector that are merged to create a trial vector. • Selection: Finally, the performances of the current target vector and trial vector are computed with a fitness function; the values obtained are compared; and the vector with the highest function value continues to the next generation. The others are removed.

Artificial Neural Networks
An artificial neural network (ANN) is a mathematical model inspired by a biological neural network [46]. The model includes artificial neurons interconnected; each interconnection is associated with adjustable weights. The neurons are organized in layers, one input layer, one output layer, and one or several hidden layers. The ANNs try to solve linear and non-linear classification problems with different network structures and the types of learning (supervised or unsupervised).

Metaheuristics Optimization for Classification Model
We compare the performances of two metaheuristics, particle swarm optimization (PSO) [16] and differential evolution (DE) [17], to search the combination of instances, features, and a number of neurons that allow us to obtain higher performance in the classification. In both metaheuristics, each combination is represented as a vector of real numbers. Table 4 shows the vector structure. The first element is to select the self organizing map (SOM) size from 2 to 10. The elements from 2 to 5 are used to select the percentage of the instances selection (undersampling). The lower bound of elements 2, 4, and 5 is 0.01, and for element 3 it is 0.001. The upper bound of the four elements is 0.999. Element 6 is for selecting the number of neurons, from 10 to 500. Finally, the elements from 7 to 56 are used for selecting features. A value of 1 means the feature is selected, and a value of 0 the feature is not being chosen. In elements 1, 6, and 7-56, integer numbers are required, and for this reason, their real values are rounded. Algorithm 1 shows the classification process; it starts when the metaheuristics evaluate a representation vector (see Table 4) through the objective function. In the next list, we explain the steps of the objective function: 1. First, in line 1 we load the clusters according to parameter one of the representation vector (SOM feature map size). Then, from lines 3 to 10, the sizes of majority classes are reduced (undersampling), selecting instances from the loaded clusters. The numbers of instances to select for each cluster are the parameters in the elements from 2 to 5 of the vector. 2. When completing instance selection, in line 11, a new subset is formed by merging instances selected of majority classes and all instances of minority classes. 3. In line 12, a new vector is obtained from elements 7-56 of the representation vector. From lines 13 to 15, we iterate over the elements of the new vector, and if the value of the element is 1, the feature is selected or removed if its value is 0. Hence, the new subset has fewer features. 4. We use k-fold cross-validation. In line 16, we set the value of k = 10. Line 17 divides the new subset into ten folds. From lines 18 to 23, we iterate over each fold, taking each fold as a test dataset and the remaining folds as the training dataset. In each training of the artificial neural network, the number of neurons in the hidden layers is the selected parameter by the metaheuristic at the sixth element of the vector. 5. Finally, in line 24, we compute the average performance over the ten folds. Line 25 returns the F1 score value indicating the performance of the model.
In the metaheuristics, each evaluation is a call to the objective function. It returns the F1 score, and the metaheuristics try to maximize the value of this metric. When the number of calls to the objective function is greater than 2000, the metaheuristics stop the exploration. Train Artificial Neural Network on K-1 folds 23 Evaluate Artificial Neural Network performance on fold k i

24
Compute average performance over K folds 25 Return F1 Score value

Performance Measurements
We use five metrics to evaluate the performance of the classification accuracy (Acc), sensitivity or recall (Sen), specificity (Spe), precision (Pre), and F1 score (F1), where TP is true positive, FN is false negative, TN is true negative, and FP is false positive. The equations for each metric are defined as follows: · 100 (6) F1Score = 2 · precision · sensitivity precision + sensitivity · 100 (7) The metrics to each class are calculated with the above equations using a macro averaged approach [47]. The overall metric is computed with the mean of the metric results of all classes. We choose accuracy as a measure of the overall performance of the classifier model. Sensitivity is a metric related to the rate of false negatives. Precision is a measure linked to the rate of false positives. Specificity is a metric to determine whether a prediction does not belong to a class. F1 score as the harmonic mean of precision and sensitivity, taking both metrics into account to estimate the performance of the classifier model when there are unbalanced class problems.

Experiments Configuration
We perform four experiments to measure the optimization of the classification process. The experiments have two main goals: (1) to prove two configurations for each metaheuristic (PSO and DE) changing their type of filtering method (HL and wavelets), and (2) to evaluate their overall performance. Table 5     In four experiments, the bounds are the same; it means that a particle (PSO) or a parameter vector (DE) has the same structure; see Table 4. The map size indicates the configuration of the feature map in SOM. The metaheuristic has the ability to select eight configurations, and the number of clusters increases in larger map sizes. The percentage of instances to select is the same in the majority class, except in the lower bound of the Class N, which is lower than the other three classes because the number of samples is greater than the other classes. The metaheuristic can select 490 options of the number of neurons in the hidden layer to improve the performance of the classifier. Figure 9 shows the number of fifty features that the metaheuristic can select according to two options: when the value is 1, the feature is selected, and it is rejected for a value of 0. Moreover, we repeat the four experiments ten times to observe the convergence of the algorithms and the differences in each repetition. Table 5. Experimental configuration. Abbreviations: HL: combination of high-pass and low-pass filtering; PSO: particle swarm optimization; n: population; w max : maximum inertia weight; w min : minimum inertia weight; c 1 : accelerate constant 1; c 2 : accelerate constant 2; DE: differential evolution; Cr: crossover rate, F: mutation factor. In experiments 1 and 3, the metaheuristic was PSO. In both experiments, parameter w was used to control the velocity of the particles, related to the control of exploration and exploitation; w max is the searching ability for the global (exploration), and w min is the searching ability for the local (exploitation). Accelerated constants in PSO are the stochastic acceleration weight of particles c 1 toward the personal best of each particle, and c 2 toward the global best of the population. We use typical values for the parameters w max of 0.9, w min of 0.4, c 1 , and c 2 equal to 2.
In experiments 2 and 4, the metaheuristic was DE. We use the strategy DE/rand/1/bin, where "rand" means that parameter vectors selected for the mutation are chosen at random, "1" is the number of solution couples, and "bin" indicates that the recombination is binomial. Mutation factor "F" is a control parameter in the mutation stage; it manages the ability for exploitation and exploration. In the recombination stage of DE, the crossover rate "Cr" is a control parameter in the binomial crossover where higher values imply a lesser influence of mutant vector on the trial vector. The typical values used in the two experiments were F = 0.5 and Cr = 0.9.

Results
In the diagnosis of a patient, it is essential to avoid false positives (FP) and false negatives (FN) because an incorrect diagnosis implies that the patient may suffer an injury by recommending the wrong medication. FP and FN are crucial in our work. Therefore, we use the F1 score as the main performance metric of the classifier model; it is a suitable metric when there are unbalanced classes. Table 6 shows the results of the four experiments in ten tests, and Figure 11 shows the mean and the standard deviation of the F1 score in each experiment. Experiment 2 obtained the best results of the four experiments due to the fact that the value of the overall mean is 99.77%, the highest. The configuration of the experiment uses a combination of the low-pass and high-pass filtering method with a differential evolution (DE) algorithm. The standard deviation value is 0.073%, the lowest, due to a low dispersion of the results. Experiment two is more precise and repeatable by the low value of the standard deviation, and more accurate than the other three experiments by the high value of the mean. In the next list, we explain the observations in the other experiments: • Experiment 1: The mean is 99.67%. That is the second-highest performance of the four experiments-0.10% lower than the second experiment. The disadvantage is a 0.132% higher standard deviation than experiment 2. For this reason, the first experiment is less repeatable compared to the second experiment. • Experiment 3: It has a value of the mean of 99.50%; this performance is the lowest of the four experiments. Moreover, the experiment is less repeatable compared with experiment 2 because its standard deviation is the second-highest of four experiments with a value of 0.204%. • Experiment 4: It obtains 99.61% for the mean; that is the second-lowest performance of the four experiments; its advantage is that its standard deviation is of 0.079%-0.006% higher than the second experiment.
In this work, the best performance was obtained by the combination of a low and high-pass filtering method. This method was applied in experiment 1 and experiment 2. The F1 scores were 99.67% and 99.77% respectively, against the wavelet filtering method in experiments 3 and 4 with means of 99.50% and 99.61%; therefore, the selection of the filtering method is fundamental in the preprocessing stage of the methodology because of its impact on the classifier's performance.
We compare the metaheuristics (PSO and DE) in similar conditions, observing that DE had a higher performance than PSO. In experiments 1 and 2, the combined filtering (low-and-high pass) is used. In the second experiment, DE had a mean of 99.77%-higher than the mean of 99.67% obtained by PSO in the first experiment. The wavelet filtering method is applied for experiments 3 and 4. DE had a higher performance again with a mean value of 99.61% in the fourth experiment versus a mean of 99.50% of the third experiment (PSO). The experiments 2 and 4, with DE, had less variability; their values of standard deviation were 0.073% and 0.079%, respectively. Against higher values of PSO in experiments 1 and 3 (0.205% and 0.204%), this behavior occurred due to the fact that PSO tended to get trapped at local maximums. The maximum values are shown in bold.  Table 7 shows the best solutions of the four experiments. From the five computed metrics, we observe that experiments 1 and 2 obtained the best performances with values of F1 score of 99.93% and 99.88%, respectively. In the resulting model in the four trials, the selected parameters by the metaheuristics were different for each model. The first parameter chosen by the metaheuristic is the size of the SOM feature map; see Section 4.2.2, showing for each configuration that the number of clusters was different. For example, when the map size is 2 × 2, the maximum number of clusters is four; this means that the majority class is divided into four smaller groups. We observe that metaheuristic tends to select smaller map sizes. In the experiments, the map sizes of the best solutions are 2 × 2 and 4 × 4; this means that the best performances will be gotten whether the majority classes are divided into a low number of clusters. Standard Deviation Figure 11. Means and standard deviations of ten tests on the four experiments.
The total number of instances used in each experiment is shown in the row called "Size of the new subset". The size of the majority classes (L, N, V, and R) has been reduced as a consequence of the undersampling; see Table 7. All the instances of the minority classes were conserved for all the experiments. We observed in Table 7 that the final size reduction in the new subsets was higher than 90% in the four experiments in comparison with the original set, which had 98,165 instances. For example, we observed that for the class R, the number of instances contained in the original dataset were 7255. Nevertheless, in experiment 1 we used only 2167; for the second 1802; the third selected 1887; and the fourth experiment used 2463. The original instances for each arrhythmia are described in Table 1, and the reduction of the number of instances for all the majority classes is shown in Table 7. Thus, we can conclude that applying grouping-based undersampling helps improve classifier performance. The parameters selected for "number of neurons" and "features selected" were different in the four experiments. Therefore, a metaheuristic approach is vital to improving the performance of the artificial neural network, because it optimizes the parameters according to the subset of the selected instances. Figure 9 presents the selected features showed in Table 7; notice that the numbers of features selected by PSO in experiments 1 and 3 are lower (23 and 27) than the numbers of features selected by DE in experiments 2 and 4 (44 and 43). Table shows the scatter plot of the class instances considered in this work. Plot A displays the full dataset that describes the eight arrhythmia types. As we can observe, the multiclass classification problem is not trivial due to the overlapping between several arrhythmias. We obtain the plots B-E by applying the clustering and featuring extraction steps. A critical remark is that we still have unbalanced classes, but their overlapping decrease allows for build better frontiers between the classes to make them separable.  Figure 12 shows the scatter plots of the original set and the new subsets for each experiment. These plots were obtained by applying principal component analysis (PCA) to each of the beats to reduce the number of dimensions from 250 to just two. Therefore, from the components resulting from the analysis of each heartbeat, the first two components were selected to display the plot in two dimensions. With these plots, we are able to observe the impact generated by the subsampling. Plot A displays the full dataset that describes the eight arrhythmia types. As we can observe, the multiclass classification problem is not trivial due to the overlapping between the majority classes over the minority classes; for example, the N and V classes. We obtain the plots of B-E by applying undersampling, clustering, and feature extraction steps. In the plots of experiments 1-3 (Panels B, C, and D), we observe a slightly similar distribution of the instances due to the three experiments having implemented the same cluster configuration (See Table 7, " SOM map size"). In Panel E, we obtain a different distribution of the instances; this is a consequence of a higher number of clusters used, so a more significant overlap is observed between them. A critical remark is that we still have unbalanced classes, but their overlapping decreases, allowing us to build better frontiers between the classes to make them separable.

Analysis of the Classification Model
In Table 8, we describe the "imbalance ratio" (IR) to measure the degree of imbalance between the arrhythmia classes. As a recall, if the value of IR is one, it means that the classes are balanced, implying that they have the same number of instances. When the IR value is different from 1 denotes a degree of inequality. Particularly, Table 8 shows a comparison between the IRs of the original set and the subset used in experiment 2 due to representing our best performance.
We observed a decrease in the IR values of the subset of experiment 2, meaning that the unbalance between the majority classes and minorities decreased. Since no IR value is equal to 1, it can be inferred that the classes are not balanced. Nevertheless, the imbalance reduction is significant as it enables the classifier to achieve better overall performance and improve their learning of minority classes without the need to generate synthetic data. Besides, we show in Table 7 that the proposed approach allows obtaining a high performance despite the fact that the new subset still has a certain degree of imbalance. Table 8. redIR values comparison between full dataset and dataset of the best solution on experiment 2.  Table 9 shows the confusion matrix used to evaluate the performance of the best solution in experiment 2. From the confusion matrix are extracted true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) to calculate the values of the metrics. In Section 4.5.1, we show equations used to compute the performance of each class and explain how to calculate the overall metrics.

Majority Classes
The classifier's performance is measured by using the five metrics determined in Section 4.5.1. First, it is necessary to calculate TP, TN, FP, and FN. Table 10 and Table 11 show how to calculate these values for the classes "!" and "R". For the case of class "!" the TP are the instances located at the intersection between the row and the column of the same class; if we look at the confusion matrix in Table 9 the value of TP is 470. To calculate FP, we add all the values of the column cells where the class "!" is located In Table 9; we subtract the value of TP from the result, obtaining that FP is 2. FN is obtained by adding all the values of the row where the class "!" is located. In Table 9 we subtract the value of TP from this and obtain that FN is 0. Finally, to calculate TN, we add all the values of the cells of the confusion matrix and subtract TP, FP, and FN, and thus obtain that TN has a value for class "!" of 5221. If we apply the same procedure for class "R", we observe that it has a TP value of 1802; by subtracting this value from the sum of the cells in the column we obtain that FP is equal to 0. Additionally, if we subtract the TP value from the sum of the cells in the row FN, it gets a value of 1. Finally, if the sum of all the cells in the confusion matrix, we subtract the TP, FP, and FN values, we obtain that TN is 3890.   Table 12 shows the TP, FP, FN, and TN values for the eight arrhythmia classes. We use these values and the equations presented in Section 4.5.1 to calculate the metric values F1 score, accuracy, sensitivity, specificity, and precision for each arrhythmia class. For example, to calculate the sensitivity value of class "E", we have that TP = 106 and FN = 1 using these values and Equation (4) we obtain a value of 99.07%. The values presented in Table 13 are the individual measures for each arrhythmia of the best solution presented in experiment 2, obtained with the above procedure and the equations of Section 4.5.1. The best values are obtained by class "a" (aberrant atrial premature beat), "N" (normal beat), and "V" (premature ventricular contraction beat). The other classes have performances between 99% and 100% in all metrics. It is essential to remark that the specificity and sensibility range is between 99.07% and 100% for all the classes. These results indicate that less than an 1% is the margin of error of the classifier for each individual arrhythmia.

Parametric Analysis of the Classification Model
In order to prove that the system reaches at least 98% of the F1 score, we carried out a parametric distribution analysis based on a T-Student test. We propose the next hypothesis: H 0 : The approach proposed obtains a value of F1 score lower than or equal to 98% in classification of eight classes of arrhythmias (N, L, R, V, !, j, a, E).
H 1 : The approach proposed obtains a value of F1 score higher than 98% in classification of eight classes of arrhythmias (N, L, R, V, !, j, a, E).
Using the experiment 2 results of Table 6, we have the following sample data: (a) mean x = 99.77, (b) sample size n = 10, (c) standard deviation S = 0.073. By considering the degrees of freedom (d f = 9) and the significance level (α = 0.05) for the parametric analysis, the value corresponding to the T-Student statistic is t α:0.05, f d:9 = 1.833. We use the Equation (8) and obtain t = 76.11 of the Student T-value. Figure 13 shows the one-tailed test result of the previously calculated value, we observe that the Student T-value (76.11) is greater than T-Student statistic (1.833), for this reason, the null hypothesis (H 0 ) is rejected, and the alternative hypothesis is accepted (H 1 ), this gives evidence that the performance of our approach reaches values of F1 score greater than 98% in most tests.
.77 − 98 0.073/ √ 10 = 76.11 (8) Figure 13. T-Student statistic from the approach proposed of experiment 2. Table 14 shows the performance of our study versus works of state of the art. We select as the best solution the experiment 2, from analysis. The four metrics used to compare our best solution to other works are accuracy, sensitivity, precision, and F1 score; Section 4.5.1 explains the process to compute the metrics and the objective of each measure.

Discussion
Hassanien et al. [10] and Sarvan and Ozkurt [12] depicted problems with the unbalanced dataset, obtaining a low performance in sensitivity with values of 45.49% and 26.85% respectively. Moreover, Hassanien et al. [10] obtained a 45.48% of F1 Score; the low value of this metric show that their classifier model had problems with the learning of minority classes. Ashtiyani et al. [11] used only three classes to train their model, and this is a number of classes lower than our work.
In the work of Rajesh & Dhuli [13], their dataset had lower unbalance compared with our study. We computed the imbalance ratios (IR) of classes and observed that the maximum value was 50.77 in their dataset against the maximum value of our work with an IR of 705.3. In Section 3.2, we point out that a high value of IR is related to a high unbalance, which can deteriorate the performance of the classifier. Jian et al. [14] in their study showed a high value of accuracy (98.4%). Still, they did not report the values of other metrics as Sensitivity nor F1 Score; this is a disadvantage because the performance of the classifier model is not visible when it learns about minority classes.
Gao et al. [15] classified a similar number of arrhythmias that our work and their dataset was unbalanced too, the classes used by the authors were: normal, left bundle branch block, right bundle branch block, atrial premature contraction, nodal escape beat, aberrant atrial premature beat, nodal (junctional) premature beat, and atrial escape beat. They utilized hold-out validation with 90% to training and 10% to testing, and implemented an LSTM Recurrent Network, obtaining a performance above 90%. Nevertheless, our work we exceed them in a 0.69% accuracy, sensitivity over 0.61%, in precision over 0.59%, and F1 score over 0.61%. This observation is important due to avoiding false positives and false negatives in the diagnosis, the risk of injuring the patient decreases. The hold-out validation is a method that divides a dataset into two subsets, one subset to train the classifier and the other to test the performance of the model; the disadvantage of the method is that the performance is more variable due to the small size of testing data. Therefore, we use a robust method called K fold cross-validation. It divides a dataset into K subsets or folds, the testing set is a subset, and the training set is the remaining subsets when each has been used as the testing set, the process stops. This method is a better indicator of the quality of the model because it allows us to observe the behavior of the model with unseen data, thus there is a tendency for less variation in the results due to the whole dataset is used to training and testing. Additionally, it is important to remark that only two authors used a metaheuristic to optimize the feature selection stage in the classification process. Hassanien et al. [10] used a metaheuristic based on the behavior of elephants, and Ashtiyani et al. [11] implementing a genetic algorithm. The other authors mentioned in Table 14 optimize the classification process using parameters tuning. In this sense, Gao et al. [15] obtained the best performance. Comparing the works of Table 14, we conclude that the use of a metaheuristic allows us to carry out an extensive exploration of combinations of parameters, to improve the performance of the classifier model. This exploration would have been complicated using only parameters tunning (experience). Hence, our approach allows us to surpass the results obtained in each metric of the works reported in the state of the art.
Finally, we also have carried out some experiments based on the Association for the Advancement of Medical Instrumentation (AAMI), who suggests using five classes according to the standard: ANSI/AAMI EC57:1998 [30]. The fifteen heartbeat classes that have the MIT-BIH Arrhythmia Database are grouped into 5 superclasses: normal (N), supraventricular ectopic beat (SVEB), ventricular ectopic beat (VEB), fusion beat (F) and unknown beat (Q). These experiments have been considered to show the usefulness of the metaheuristic approaches presented considering other contexts. Table 15 shows that the results obtained by our approach differential evolution with a high-low pass filtering, 99.19% accuracy, and a 98.79% of sensibility, compared to Rajesh and Dhuli [13] and Sarvan and Ozkurt [12] have obtained a better performance. There are a variety of methods to extract characteristics, such as statistical, morphological, QRS complex, and Wavelet characteristics [30]. Under different circumstances and contexts, their election affects the results when identifying cardiac abnormalities. Among this variety, each author can select a different window (signal range) to extract characteristics in the time domain, the frequency domain, or the time-frequency domain. Its choice depends directly on the author's experience and the algorithms used for its classification. Each of these methods presents advantages and disadvantages when they are used to classify arrhythmias. Therefore, determining whether it is performed by segments of eight seconds, dynamic time intervals, or by beat, such as it is our case, is an open problem today.
If we want to answer this problem, we must compare the proposals under the same conditions using a standard that allows a fairer comparison between them. One of these parameters is the ANSI/AAMI standard, since it allows a fair comparison among different methods [48]. Most of the studies do not present this type of comparison, which makes it challenging to answer throughout the state-of-the-art review in a cross-sectional way. Besides, authors do not contemplate the same arrhythmia numbers or classes, such as [10,14] with 2, [11] with 3, [12,13] with 5, [15], and our work with 8. In our work, we have included some comparisons using ANSI/AAMI standard to compare our approach with Sarvan and Ozkurt [12] and Rajesh and Dhuli [13] showing better results.
On the other hand, when we can not make a fair comparison, authors could claim that their work presents competitive results, such as the work of Hajeb-Mohammadalipour et al. [19] or our work. Nevertheless, the datasets and the results reported can not be compared directly because of the lack of a database standardized such as ANSI/AAMI.
As a remark, we consider that our proposal contributes positively to arrhythmia classification from unbalanced data due to the fact that the arrhythmias classification is carried out by beat analysis with competitive and balanced results; this means that we obtain an accuracy, sensitivity, specificity, precision, and F1 score with less than 1% of error and some of them with a 0% of error as we show in the individual and overall analysis of the proposal.

Conclusions
In electrocardiogram classification handling unbalanced data, the optimization parameters of the classifier play a key role because the performance depends on it. Hence, we present a metaheuristic optimization approach by combining undersampling based on the clustering method (data level), and a feature selection method (algorithmic level). We selected an unbalanced subset to classify seven types of arrhythmias and a normal beat: left bundle branch block beat, right bundle branch block beat, premature ventricular contraction beat, ventricular flutter wave beat, nodal (junctional) escape beat, aberrated atrial premature beat, ventricular escape beat, and normal beat. We have compared two metaheuristic approaches based on a differential evolution and a particle swarm optimization applying two filtering methods. We had a higher accuracy when we combined a high-pass and low-pass filtering method and a differential evolution as a process to optimize an artificial neural network. The values of the metrics obtained showed an accuracy of 99.95%, an F1 score of 99.88%, a sensitivity of 99.87%, a specificity of 99.99%, and a precision of 99.89%. Hence, our results surpass the metrics reported in the state of the art. Our proposed models can assist expert clinicians and cardiologists in accurately detecting arrhythmias, after developing a routine ECG test, in order to save time and decrease the number of misdiagnoses. Furthermore, with the models proposed along with low-cost ECG devices, we are looking for widespread use of the ECG as a diagnostic tool in remote places where access to a cardiologist is not possible. In future works, we are going to focus on testing our approach with other unbalanced databases, and improving the computational costs related to metaheuristic search. Additionally, we are interested in extending our work by introducing other classes of arrhythmia beats and classes of arrhythmia of the rhythm type.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: