Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

: The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE)


Introduction
The huge amount of data continuously generated by digital applications is an important challenge for the machine learning field [1]. This phenomena is not only characterized by the volume of information, but also by the speed of transference and the variety of data; i.e., the big data characteristics [2,3]. Concerning to the volume of information, a dataset belongs to big data scale when it is difficult to process with traditional analytical systems [4].
In order to efficiently seize the large amount of information from big data applications, deep learning techniques have become an attractive alternative, because these algorithms generally allow to methods to deal with the class imbalance problem. In [42], an ensemble of support vector machines is used, where the maximum margin is adopted to guide the ensemble learning procedure for multi-class imbalance classification. In [14], a hybrid optimal ensemble classifier framework that combines under-sampling and cost sensitive methods is proposed.
The combination of deep learning and an ensemble classifier has been performed. For example, in [43], a software bug prediction is presented. It has two stages: deep learning (auto-encoder) and ensemble learning, and it was noticed to be able to deal with the class imbalance and over-fitting problem. In big data scenarios, the conjunction of ensembles and clustering methods has been proposed, as was demonstrated by [44], where ensembles with datasets preprocessed by clustering and sampling methods were built. Moreover, the use of clustering and sampling techniques to deal with the class imbalance problem has been increased [45][46][47]. In the machine learning context, extensive and comprehensive reviews about of ensembles and class imbalance problem have been performed [30,38,39].
It has seen that big data class imbalance approaches have been addressed by adaptation of traditional techniques, mainly sampling methods [21,44]. However, recent studies show that some conclusions from machine learning are not applicable to the big data context; for example, in machine learning is common that SMOTE performs better than ROS [24], but in big data some results do not show this trend [1,48]. In addition, only a few works have been addressed to deal with the class imbalance in big data by using "intelligent" or heuristic sampling techniques [17,49]. Thus, more studies are need in order to test methods that traditionally present good performances in machine learning (such as the random and heuristic sampling algorithms) at the big data scale.
The potential of traditional sampling methods on deep learning neural networks in the big data context is studied in this work. The main contributions of this research are: (a) This paper is focused in dealing with multi-class imbalance problems, which have hardly been investigated [50][51][52] and they are critical issues in the field of data classification [45,53]. (b) It addresses one of the most popular deep learning methods (Artificial Neural Networks, ANN), a specialized research topic, with details on some particular aspects of the classifier, such as answering the question: Is it pertinent to use methods that work in the features space of ANN classifiers that set the decision boundary in the hidden space? (c) Results that notice the effectiveness of applying editing methods on the output neural network in order to improve the deep neural network classification performance are presented.

Deep Learning Multi-Layer Perceptron
The Multi-Layer Perceptron (MLP) constitutes the most conventional Artificial Neural Network (ANN) architecture [54]. It is formed by a set of simple elements called computational nodes or neurons, where each node sends a signal that depends on its activation state [55]. Before arriving at the receiver neuron, the output is modified multiplying it by the corresponding weight of the link. The signals (net) are accumulated in the receiver node and the neurons are grouped forming layers. MLP is commonly based on three layers: input, output and one hidden layer [7]. The MLP has been translated into a deep neural network by incorporating two or more hidden layers within its architecture, making it a Deep Learning MLP (DL-MLP). This allows it to reduce the number of nodes per layer and it uses fewer parameters, but it also leads to a more complex optimization problem [8]; however, due to the availability of more efficient frameworks, such as Apache-Spark and Tensorflow, this disadvantage is less restrictive than before.
MLP is a practical vehicle with which to perform a nonlinear input-output mapping of a general nature [56]. Given a set of input vectors x and a set of wished answers d, learning systems must find parameters linking these specifications. The j output of an ANN: y j =f (x, w) depends on a set of parameters w (weights), which can be modified to minimize the discrepancy between the system output z and the answer desired d. The aim of a MLP is to represent the behavior off (x, w), in a region R of the input space, by means of a linear combination of functions ϕ j (x) as follows: If l = 1 then s i = ∑ h w hi x h ; i.e., s i is the i-th input hidden neuron on the first hidden layer; w l i is the weight connected to the i-th input of the h hidden neuron on the l-th hidden layer with (l − 1)-th hidden layer; and ϕ l−1 h is the h-th hidden node output on the (l − 1)-th hidden layer. L is the total number of hidden layers.
Traditionally, MLP has been trained with the back-propagation algorithm (which is based in the stochastic gradient descent) and its weights randomly initialized [54,57]. However, in the latest versions of DL-MLPs, the hidden layers are pretrained by an unsupervised algorithm and the weights are optimized by the back-propagation algorithm [7]. MLP uses sigmoid activation functions, such as the hyperbolic tangent or logistic function. In contrast, DL-MLP includes (commonly) the Rectified Linear Unit (ReLU) f (z) = max (0, z) because it typically learns much faster in networks with many layers, allowing the training of a DL-MLP without unsupervised pretraining.
There are three variants of the descending gradient that differ in how many data are used to process the gradient of the objective function [58]: (a) batch gradient descent calculates the gradient of the cost function to the parameters for the entire training dataset, (b) stochastic gradient descent performs an update of parameters for each training example and (c) mini-batch gradient descent takes the best of the previous two and performs the update for each mini-batch of a given a number of training examples. The most common algorithms of descending gradient optimization are: (a) Adagrad, which adapts the learning reason of the parameters, making bigger updates for less frequent parameters and smaller for the most frequent ones; (b) Adadelta-an extension of Adagrad that seeks to reduce aggressiveness, monotonously decreasing the learning rate instead of accumulating all the previous descending gradients, restricting accumulation to a fixed size; and (c) Adam, which calculates adaptations of the learning rate for each parameter and stores an exponentially decreasing average of past gradients [59]. Other important algorithms are AdaMax, Nadam and RMSprop [58].
In general terms, training methods intends to minimize the error between f (w) andf (x, w), in order to calculate the optimized w j values, which are described as follows: where ε becomes arbitrarily small, and the functionf (x, w) is called an approximate of f (w).

Sampling Class Imbalance Approaches
Over and under sampling strategies are very popular and effective approaches to deal with the class imbalance problem [21,25,33,50]. To compensate the class imbalance by biasing the process of discrimination, the ROS algorithm randomly replicates samples from the minority classes while the RUS technique randomly eliminates samples from the majority classes, until achieving a relative classes balance [23,60]. Nevertheless, both strategies have some drawbacks, such as over-training or over-fitting; additionally, they can eliminate data that could be negative to define the boundary decision.

Over-Sampling Methods
Random Over-Sampling (ROS) compensates the class imbalance; this algorithm randomly duplicates samples from the minority class until a relative class balance is achieved [60].
Synthetic Minority Over-sampling Technique (SMOTE) produces artificial samples from minority class by interpolating existing instances that are very close each other. The k intra-class nearest neighbors for each minority sample are found, and synthetic samples are produced in the direction of some or all of those nearest neighbors [61].
Adaptive Synthetic Over-Sampling (ADASYN) is considered as an extension of SMOTE, which is characterized by the creation of more samples in the vicinity of the boundary in the midst of the two classes than inside the minority class [62].

Under-Sampling Methods
Random Under-sampling (RUS), randomly eliminates samples from the majority class to compensate the class imbalance [60].
Editing Nearest Neighbor (ENN) uses the k − NN (k > 1) classifier to estimate the class identifier of every sample in the dataset and eliminates samples whose class identifier disagrees with the class associated to most of the k neighbors [27].
Tomek Links (TL) are pairwise sets of samples a and b from different classes of the dataset such that a sample does not exist c where d(a, c) < d(a, b) or d(b, c) < d(a, b), d being the distance between the pairwise of samples [26]. When two samples (a and b) form a TL, both samples are removed [64].
One Side Selection (OSS) is an under-sampling method that uses TL to reduce the majority class. When two samples generate a TL, OSS removes only the samples from majority class, unlike the TL method removing both samples from the majority and minority classes [19].
Condensed Nearest Neighbor rule (CNN) is a technique that seeks as much as possible reduce the size of the training dataset; i.e., it does not a cleaning strategy as TL and ENN do. In CNN, every member of X (original dataset) must be closer to a member of S (the pruned dataset) of the same class than any other member of S from a different class [28].

Hybrid Sampling Class Imbalance Strategies
The aim of the hybrid methods is first to deal with the class imbalance problem, and next, to clean the training dataset, in order to reduce the noise or overlap region. Hybrid methods generally employ SMOTE to compensate the class imbalance, because this method reduces the possibilities of over-training or over-fitting [1]. They use methods based in nearest neighbor rule to reduce overlap or noise in the dataset [25].
Recently, hybrid methods have increased in popularity in the machine learning community, as it can be observed in several references [23,25,66], which have studied SMOTE+TL, SMOTE+CNN and SMOTE+ENN, among others methods, to deal with class imbalance problem and to eliminate samples in the overlapping region, in order to improve the classifier performance.
SMOTE+ENN technique consists of applying the SMOTE; then, the ENN rule is applied [64]. SMOTE+TL combines SMOTE and TL [64], SMOTE+CNN performs SMOTE and then CNN [65]. Figure 1a describes the operation of these hybrid approaches, which have been widely applied to deal with the class imbalance problem [22,25,29,33,50,[63][64][65]67].  The effectiveness of an additional hybrid approach is studied in this work, in which the training dataset is cleaned or reduced, and after, the number of samples by class in the training dataset is balanced. As a first stage, TL, OSS, ENN and CNN are used to clean the training dataset; in the second stage, SMOTE is applied to balance the class distribution.
SMOTE+OSS is the combination of SMOTE and OSS, in that order. TL+SMOTE first applies TL, and then SMOTE. In CNN+SMOTE, CNN is performed, and then SMOTE. Figure 1b shows the operation of these hybrid methods.

SMOTE+ENN*
Artificial Neural Networks (ANN) are computational models that have become a popular tool used in classification and regression tasks [57]. Nevertheless, it is well know that their maximum performances in supervised classification depend on the quality of the training dataset [68]. Typical methods to improve the training dataset quality are based in the nearest neighbor rule (for example: ENN, CNN, TL or OSS), which work in the feature space of the dataset. In other words, the decision region is defined in the feature space [69]. ANN sets the decision boundary in the hidden space or transformed data (see Equations (1)-(3) [55], even the deep learning ANN [70]. Thus, in this research, we consider the most appropriate scheme seeking the noise or overlap samples in the ANN hidden space (transformed data) or in the ANN output; for simplicity, we chose the last one to propose the SMOTE+ENN* approach. SMOTE+ENN* is presented in Algorithm 1, which works as follows: first, the dataset is processed with SMOTE (Step 2); then, it is used to train the neural network (Step 5); after that, the neural network output is processed with ENN method (Step 7); i.e., the samples (of the training dataset) related to neural network output (which are considered noise) are deleted (Step 8). Finally, the neural network is trained again with the resultant dataset (Step 9). Source code of the proposed strategy is available in website (https://github.com/ccastore/SMOTE-ENN_out.git). 10: /* Q and N are the number of samples and features in X idx , respectively.*/ 11: for q = 0 to q < Q do 12: [N] then 13: for n = 0 to n < (N − 1) do 14: // (N-1) because, the add index is deleted (Step 4). 15:

Database Description
The dataset of images used in this work was obtained from GIC (Grupo de Inteligencia Computacional, by the acronym in Spanish) (http://www.ehu.eus/ccwintco/index.php/ Hyperspectral_Remote_Sensing_Scenes). The dataset corresponds to six hyper-spectral images with multiple bands of light ranging from the ultraviolet to the infrared spectrum; the number of bands depends on the sensors used to capture the images. In Table 1, the number of bands corresponds to the number of features. Every dataset was previously tagged in order to generate a distribution of the pixels, in accordance with the class that was tagged, the number of the classes and the index of the imbalance.
The Indian and Salinas datasets were captured by the AVIRIS sensor in north-western Indiana and Salinas Valley, California, respectively; both image datasets have 17 classes that correspond to crops, grass, stone, buildings and other types. In the north of Italy, the sensor ROSES acquired the images Pavia and PaviaU with 102 and 103 spectral bands belonging to visible water, trees and shadows, among others classes. The Botswana dataset was gather in 2004 by the sensor EO-1 in the region of over the Okavango Delta, Botswana, where 15 classes were identified representing the types of swamps and woodlands. Finally, the KSC (Kennedy Space Center in Florida) has 14 classes representing the various land types. The hold-out method [68] was used to randomly split each dataset: training (X) 70% and testing (TS) 30%. Table 1 summarizes the main characteristics and details of the benchmarking datasets, where "Major" and "Minor" correspond to number of samples of the bigger and smaller majority and minority classes, respectively; "IR" is the class imbalance ratio between both classes (Minor/Major); and "Distribution" shows the number of samples in each class.

Parameter Specification for the Algorithms Used in the Experimentation
The sampling methods used in this work can be found as a part of the library imblearn (https: //pypi.org/project/imblearn/). It includes the over-sampling techniques SMOTE, ROS and ADASYN, and the under-sampling methods RUS, TL, ENN and OSS. All of them were implemented in Python programming language and could be performed in multi-class datasets. These methods were applied with default set-up (k = 5) and the random state or seed was set to 42. In the case of the hybrid methods, the same library was used following the steps depicted in Figure 1, where over-sampling or under-sampling methods were performed in the original dataset, and the new dataset was processed with the other sampling method; afterwards, it was used for training of the neural network.
Multiple configurations of hidden layers and neurons (or nodes) for each neural network were employed. Both hidden layers and nodes were obtained for a trial and error strategy for every dataset. Table 2 shows the neural networks' free parameters and the specifications employed in the experimentation stage. The learning rate (η) was set to 0.001 and the stopping criterion was established at 500; 250 epochs were used for SMOTE+ENN* (for the proposed method, Algorithm 1) if the MSE value was lower than 0.001. The training of the neural network was performed by means of the Adam method; the latter is a computationally efficient and simple algorithm for optimization of stochastic objective functions by the gradient descent principle [59]. The training process was performed ten times in the experimental stage and the results correspond to the averages of those runs. Table 2. Neural networks' parameters and specification used in the experimentation stage. The first column shows the dataset name. In the following columns: the number of hidden layers and its parameter (number of hidden neurons and activation function used in the respective layer). Layer output indicates the number of output neurons and the activation function to these nodes. Epoch is the number of epochs employed in the training. Size of Batch is the portion of samples used for the deep neural networks in each epoch.

Classifier Performance and Tests of Statistical Significance
The geometric mean (g-mean) is the most widely used method to evaluate the performance of classifiers and it is a well recognized tool for the evaluation in the class imbalance context; it is defined as follows [17]: where acc is the accuracy by class, and J is the number of classes in the dataset. In order to strengthen the results analysis, a non-parametric statistical test was additionally achieved. Friedman test is a non-parametric method in which the first step is to rank the algorithms for each dataset separately; the best performing algorithm should have rank 1, the second best rank 2, etc. In the case of ties, average ranks are computed. Friedman test uses the average rankings to calculate Friedman statistic, which can be computed as follows: K denotes the number of methods, N the number of data sets and R j the average rank of method j on all datasets. On the other hand, Iman and Davenport [71] demonstrated that χ 2 F has a conservative behavior. They proposed Equation (7) according to F−distribution with K − 1 and (K − 1)(N − 1) Degrees of Freedom, as a better statistic: Friedman and Iman-Davenport tests were used in this investigation, by using the level of confidence γ = 0.05 and KEEL software [72]. Table 3 exhibits the g-mean values for each method and dataset. Friedman ranks were added in order to simplify the understanding of the results (see Section 4.3). Table 3 confirms previous machine learning works, in the sense that other problems, such as class overlapping, the small disjuncts, the lack of density or the noisy data, among others, increase the effect of class imbalance on the classifier performance. For example, see the Salinas dataset, which is a high imbalanced dataset and its g-mean value (without a preprocessing method) is relatively high (0.8848); i.e, it is affected by the class imbalance, but this problem does not induce a poor classification performance (as it occurs in other datasets).

Experimental Results and Discussion
These results highlighted the need to study a heuristic methods that allow one improve the quality of the training dataset, such as ENN, or strategies such as TL or CNN (see Section 3), in order to reduce the negative effect of the class imbalance on the classifier performance.
For the rest of datasets, the classifier performance is clearly impacted by the class imbalance problem, as it can be observed in Table 3. When the classes distribution was balanced, a substantial performance improvement was reached (see ROS and SMOTE results). Table 3. DL-MLP performance using the g-mean (Equation (5)) as measure. Values presented correspond to averages of ten different initializations of the neural network weights.

Method
Indian In accordance with Friedman ranks, it can be noted that ROS presents better results than SMOTE. In machine learning scenarios, SMOTE traditionally overcomes ROS's performance [24], but in the big data and deep learning context, results (of this work) show that this expectation was not sufficiently fulfilled. This behavior has been observed in other works related to big data and deep learning [1,48]. However, the performances of both ROS and SMOTE were practically equivalent. Even for the KSC dataset, SMOTE g-mean outperforms ROS, although there does exist a clear trend to obtain better results with ROS than with SMOTE.
RUS overcomes the classification performance of MLP trained with the original dataset; notwithstanding, it only improves the g-mean results of TL and ENN. I.e., RUS is an effective technique to deal with the class imbalance problem, but it is not the best strategy. RUS eliminates a considerable amount of samples and keeps acceptable values of g-mean in Salinas, Pavia and PaviaU. These results exhibit that only a few samples are necessary to obtain a good classifier performance; thus, this information gives relevance to the study of heuristic sampling data.
On the other hand, Table 3 shows that in the most datasets, the strategies ENN and TL present the worst results. In Indian, Pavia, Botswana and KCS datasets it is not enough only to improve the quality of the data to increase the classification performance, but when they are added to any over-sampling method, the classification performance is improved. However, Table 3 does not show that g-means from SMOTE+ENN and SMOTE+TL overcome the g-mean of SMOTE. Analyzing the SMOTE+CNN and SMOTE+OSS methods, a similar classification performance behavior was observed.
CNN, ENN, TL and OSS are effective methods with which to improve the training dataset quality in nearest neighbor rule context; in neural networks, only ENN combined with SMOTE archives positive results, but it does not generate better results than SMOTE. In addition, CNN+SMOTE, OSS+SMOTE and ENN+SMOTE show values N/A (i.e., does not apply), which make reference to situations where these editing methods eliminate too many samples from a particular class. SMOTE is not applicable in such cases because it needs a least six samples by class; some classes of those methods only keep less than six samples, and it is proposed to not consider these values for this reason.
CNN is a method to keep a consistent subset from the training data, in the way that all samples in a consistent subset are successfully classified by nearest neighbor rule [28]. Therefore, methods that are successful by nearest neighbor rules are not necessarily effective in a neural network context. ENN, TL, OSS and CNN are methods based in nearest neighbor rule [73], whose decision boundaries are defined locally in the feature space. However, the decision boundary in MLP neural networks is defined in the hidden space, not in the feature space [8]. Thus, it is assumed that to apply these editing methods in the data features space, i.e., in the training data, is not the best option because these methods help to determine the decision boundary to nearest neighbor rule, not for neural network classifier. For this reason, in order to remove samples that make it difficult to set the neural network decision boundary, the ENN is applied in the neural network output in this work.
The proposed method SMOTE+ENN* (Algorithm 1) consists of training the neural network during I init = 250 epochs, and next applying ENN in the neural network output. In other words, it removes the samples suspected of being outliers or noise from the training data, which correspond to those identified in the neural network output space for the ENN. Following that, the neural network was trained again (I end = 250) with the resultant dataset. Results presented in Table 3 show that there was a trend to obtain the best results when the neural network output was edited. SMOTE+ENN* presents the best average rank (2.5); i.e., it is better than the ranks for ROS (3.7) and SMOTE (4.4). These results evidence the appropriateness of sampling the neural network output, instead of preprocessing the input feature space.
Finally, the Friedman and Iman-Davenport non-parametrical statistical tests were applied to five best sampling methods (SMOTE+ENN*, ROS, SMOTE, SMOTE+ENN and SMOTE+TL) in order to know if a significant difference exists in these methods.
Results report that considering the reduction in performance distributed according to chi-square with four Degrees of Freedom, the Friedman statistic was set to 5.73 and p-value computed by the Friedman test was 0.219. Considering the performance reduction according to F-distribution with 13 and 143 Degrees of Freedom, the Iman and Davenport statistic was 1.57, and the p-value computed by the Iman-Davenport test was 0.221. Thus, the null hypothesis is accepted given that Friedman and Iman-Davenport tests indicate that there are not significant differences in the results. In others words, the performances of these five methods are statistically equivalent. Notwithstanding, in the results exist evidence of the importance of applying the sampling methods in the neural network output.

Conclusions and Future Work
In this work, the potential of the heuristic sampling strategies to deal with the multi-class imbalance problem on DL-MLP neural network in big data scenarios was studied. Big data, multi-class imbalanced datasets obtained from hyperspectral remote sensing images were used for experimentation. Results presented in this work exhibit that ROS and SMOTE are very competitive strategies to deal with this problem, but ROS shows a better performance than SMOTE.
Hybrid methods such as SMOTE+TL, SMOTE+ENN, SMOTE+OSS and SMOTE+CNN, although presenting evidence of being effective techniques for facing the class imbalance problem, do not improve the SMOTE's performance. Cleaning and sampling reduction strategies (TL, ENN, OSS and CNN) do not show results with respect to a DL-MLP neural network on big data scenarios.
RUS gives evidence of that with a few samples, favorable results can be obtained, but they are not better (or competitive) than the results of ROS and SMOTE. However, RUS results strengthen the hypothesis that with a small subset, the DL-MLP neural network can be successfully trained.
In this study it was found that cleaning strategies should be applied on ANN output instead of input feature space, in order to obtain the best classification results. Thus, SMOTE+ENN* is proposed, which consists of training the neural network, applying ENN on the neural network's output and removing those samples (from the training dataset) related to neural network outputs suspicious to be outliers or noise; i.e., those outputs identified by ENN. Results presented in this work evidence a trend to obtain better results when a neural network's output is edited than when input feature space is preprocessed.
Typically, most of the research reported in literature has been focused on the two-class imbalance problem by using classical classifiers. Thus, the study of the relevance and capabilities of heuristic sampling methods to deal with the big data multi-class imbalance problem in deep learning context is especially interesting.
Future work should be addressed toward deep learning, in the study of reduction and cleaning strategies to improve the DL-MLP neural network's performance in big data scenarios.