Imbalanced Learning Based on Data-Partition and SMOTE

Classification of data with imbalanced class distribution has encountered a significant drawback by most conventional classification learning methods which assume a relatively balanced class distribution. This paper proposes a novel classification method based on data-partition and SMOTE for imbalanced learning. The proposed method differs from conventional ones in both the learning and prediction stages. For the learning stage, the proposed method uses the following three steps to learn a class-imbalance oriented model: (1) partitioning the majority class into several clusters using data partition methods such as K-Means, (2) constructing a novel training set using SMOTE on each data set obtained by merging each cluster with the minority class, and (3) learning a classification model on each training set using convention classification learning methods including decision tree, SVM and neural network. Therefore, a classifier repository consisting of several classification models is constructed. With respect to the prediction stage, for a given example to be classified, the proposed method uses the partition model constructed in the learning stage to select a model from the classifier repository to predict the example. Comprehensive experiments on KEEL data sets show that the proposed method outperforms some other existing methods on evaluation measures of recall, g-mean, f-measure and AUC.


Introduction
Class-imbalanced problems is an important research field in machine learning and pattern recognition [1], and, for two-class problems, the imbalanced data is characterized as the size of one class (minority class or positive class) is much smaller than that of the contrary one (majority class or negative class) [2].In many practical applications, correctly classifying the minority class examples is often of greater importance than correctly classifying the contrary ones.For example, in fraud detection, only few cases are the fraud cases and how to correctly identify the fraud cases is very meaningful [3].Although classification methods, such as k-nearest neighbor (KNN), decision tree, support vector machine (SVM) and back-propagation neural network, have been widely used in many real-world applications to guild decision-making, classifying imbalanced data still challenges these conventional classification models.The reasons include (1) when facing the imbalanced circumstance, standard approaches often provide a good coverage of the majority class examples, distorting the minority class examples [4], (2) the learning process guided by measures such as global accuracy leads to an overlooking to the minority class examples [5], (3) minority class examples may be misclassified into noise and vice versa, since both of them are the rare class in the data space [6] and (4) minority class examples may overlap with other classes, which causes minority class examples often being misclassified into the majority class [7].
Many approaches have been proposed to handle the class-imbalanced problems.These approaches can be mainly categorized into two levels: data level approaches and algorithm level approaches.Approaches at data level try to rebalance the data distribution by sampling the data space such that the conventional learning methods can capture the characteristics of the minority class [8][9][10][11][12][13][14][15][16][17].
Approaches at the algorithm level try to improve the generalization ability of existing algorithms on imbalanced data by adjusting the learning process of the algorithms.In this manner, these approaches require clear comprehension of the algorithm itself and its application domains, knowing why the algorithm performs poorly when the data distribution is imbalanced.Existing methods at this level include enhancing the discriminatory ability of classifiers using kernel transformation [18] and converting the learning objective to functions which punish errors on the minority class more severely [19].
This paper proposes a novel classification method based on data-partition and an oversampling technique, namely SMOTE [10], for imbalanced learning.In the learning stage, the proposed method firstly pre-processes the training set using partition methods and SMOTE: using partition methods such as K-Means [20] to partition the majority class into several clusters, and create a new training set using SMOTE to oversample each data set obtained by merging the minority class with each cluster.Then, the proposed method learns a classification model on each novel training set, and therefore a repository of classification models is constructed.In the prediction stage, the method uses a partition model obtained in the learning stage to distribute an unlabeled example into a cluster, and the corresponding classification model is selected.The idea of partitioning the majority set is inspired by the intuition that the new training set obtained through operations of partition, merging and over-sampling is more balanced and separable.For example, the model (such as SVM) learned on the data set shown in the left sub-figure of Figure 1 intuitively tends to predict the positive class samples to be negative ones.After partitioning the negative class set to several clusters, the model learned from the novel data set obtained by merging the positive class and each cluster (the example set in solid ellipse of the right sub-figure of Figure 1) is more prone to correctly classifying the three positive class examples, as the novel data set is separable.The right sub-figure of Figure 1 also shows that the new training set obtained after the over-sampling operation (the hollow cross represents an example generated by SMOTE) can further improve the performance of conventional methods on imbalanced data.Therefore, the main contributions of this work are as follows: • proposing a data-partition based method to enhance separability between majority class and minority class, and thus improving the performance of conventional methods on class-imbalance data; • combining an oversampling technique, namely SMOTE, with a data-partition method through oversampling the partitioned data to obtain a more balanced one, and thus further enhancing the performance of traditional methods on imbalanced problems; • extensive experiments are conducted and the corresponding results show that the proposed method significantly outperforms the other state-of-the-art methods on measures of recall, g-mean, f-measure and AUC.
The rest of the paper is organized as follows.After presenting the related work in Section 2, Section 3 describes the proposed method for class-imbalanced problem, followed by the discussion of parameters learning in Section 4 and the experimental results in Section 5. Finally, Section 6 concludes the paper.

Characteristics of Imbalanced Data
The imbalanced data set that exhibits an unequal distribution between its classes can be considered imbalanced (skewed).Theoretical and empirical results indicate that, besides the skewed data distribution, many factors influence the classifier performance on identifying the minority class examples [21].These factors include imbalanced class distribution, small sample size, class overlapping and within-class subconcepts.

•
Imbalanced Class Distribution: The imbalance degree of a data set is often denoted by the ratio of the size of the majority class to that of the minority class.The studies carried out by Weiss and Provostin [22] showed that the model constructed on a relatively balanced distribution usually obtains a better classification performance.However, it is difficult to explicitly state at what imbalance degree the class distribution would deteriorate the classification performance due to other factors including class overlapping and within-class subconcepts also affecting the performance.

•
Small Sample Size: For a data with a given imbalance degree, the data size determines the performance of a classification model.If the data size is small, limited examples of the minority class can not cover the inherent regularities.The studies carried out by Japkowicz and Stephen [23] indicated that, providing a large enough data set, the imbalanced class distribution may not affect the classification performance.However, in practice, collecting sufficient data for class imbalanced data sets is challenging [24].

Sampling Technique
The sampling technique including under-sampling, over-sampling and clustering-based sampling is one of the most popular methods to solve the problem existing in imbalanced data [26,27].The approaches of the under-sampling technique aim to balance the data distribution by eliminating majority class examples.Random under-sampling, a commonly used sampling method, randomly discards majority class examples until a relatively balanced distribution is reached.The problem existing in random under-sampling is that useful examples may be eliminated leading to a worse classification performance [28].To overcome this drawback, many methods have been proposed to retain useful information presented in the majority class.Examples include the condensed nearest neighbor (CNN) rule [29] and Tomek Links [30], where CNN tries to remove redundant examples of the majority class and Tomek Links is to discard borderline and noisy examples.
The approaches of the over-sampling technique is to rebalance the data distribution by creating a new minority class, and random over-sampling is one of the most widely used over-sampling methods.In random over-sampling, majority class examples are randomly duplicated such that the class distribution is more balanced.Though the over-sampling technique creates more balanced distribution, it suffers from the drawback of overfitting [28].The synthetic minority over-sampling technique (SMOTE) [10] was proposed to overcome this drawback, which synthetically generates new minority class examples along the line between the two selected minority class samples.Borderline over-sampling [31] only oversamples the borderline minority class samples since the borderline region is more crucial for establishing the decision boundary.The majority weighted minority oversampling technique (MWMOTE) [14] identifies the hard-to-learn informative minority samples, assigns weights according to their distances from the nearest majority class samples, and generates synthetic samples from the weighted minority class samples.Some other over-sampling methods such as Borderline-SMOTE [32] and Safe-level-SMOTE [33] were also proposed to handle overfitting.
In recent years, many clustering-based sampling approaches have been proposed for handling class-imbalanced problem, such as Sobhani [12] dividing the majority class into several clusters and selecting at least one sample from each cluster to form a subset of the majority class with the size equal to the minority class.Yen [13] partitioned the whole training set into several clusters, and selected majority class samples from each cluster according to the ratio of the size of the majority class to that of the minority class, and combined the selected majority class samples with the minority class samples to obtain a new training set.Prachuabsupakij [34] proposed a method that partitions the training set into two clusters, uses over-sampling and under-sampling to resample each cluster, learns a random forest on each resampled cluster, and obtains the prediction on an example by combining the results from both clusters through a majority vote.
Our method differs from the above methods both in the learning and prediction stages: for the learning stage, our method first partitions the majority class into several clusters, and directly combines the minority class with each cluster as a new cluster, applies SMOTE to oversample each new cluster, and, based on each new cluster, a conventional classification model is learned.Therefore, a classifier lab with a size equal to partition number is constructed.For the prediction stage, the partition method learned in the learning stage is used to select the corresponding classification model for prediction and thus our method is a special single model instead of an ensemble one.

Clustering
The process of grouping a set of examples into classes of similar characteristic is called clustering.A cluster is a collection of examples within which the examples are similar to each other and are dissimilar to examples of other clusters [35].Many clustering algorithms have been proposed, and partition based clustering and hierarchical clustering are commonly used methods.
Partition-based clustering constructs k partitions of the examples, where each partition represents a cluster.These clusters should fulfill the following requirements: (1) each cluster must contain at least one object, and (2) each example must belong to exactly one cluster.K-Means is widely used in many practical applications because of its simplicity and adaptability [36].In the K-Means algorithm, a center is the average of all points in a partition.Other commonly used partition based clustering methods including ISODATA [37] and PAM [38].
Hierarchical clustering (HC) algorithms organize data into a hierarchical structure according to the proximity matrix [39].HC could be mainly categorized into two groups: agglomerative methods and divisive methods.Agglomerative methods start with N clusters and each cluster is composed of only one example and then iteratively merges the two closest clusters until the pre-determined number of cluster is reached.Contrary to agglomerative methods, divisive methods start with the entire data set as one cluster and iteratively divides the most appropriate cluster, and repeats the dividing process until a pre-specified criterion is reached.

Evaluation Measures
Evaluation measures play a key role in both accessing the performance of the classification model and the guiding of its modeling process [25].Traditional evaluation measures such as accuracy focus more on the majority class examples, ignoring the minority class examples that are relative to the user preference [28].For example, in a problem where the minority class contains 1% of all the examples, a naive approach predicting all examples to be the majority class would achieve an accuracy of 99%.Though 99 percentage accuracy appears good on the whole data set, all of the minority class examples are misclassified.
Table 1 presents the confusion matrix of the prediction of a classifier on biclass problem.Based on Table 1, many measures including recall, precision, f-measure and g-mean have been designed for the class-imbalanced problem.Recall and precision are defined as: F-measure is s a harmonic mean between recall and precision, when the relative importance between recall and precision is set to 1, f-measure becomes f1-measure, formally Like f-measure, g-mean is another metric considering both minority class and majority class.Specifically, g-mean measures the balanced performance of a classifier using the geometric mean of the recall of minority class and that of majority class.Formally, g-mean is as follows: In addition, AUC is a commonly used measure to evaluate model performance.According to [28], AUC can be estimated by In this paper, we apply recall, g-mean, f-measure, AUC and precision as candidate measures to evaluate the generalization ability of the proposed method.

Imbalanced Learning Based on Data-Partition and SMOTE
In this paper, a novel imbalanced learning method based on data-partition and SMOTE is proposed, which is a method of data level approaches for class-imbalanced problems.Figure 2 illustrates the learning and prediction stages of the proposed method.Different from other existing methods, in the learning stage, the method partitions the majority class D maj into m clusters {C i |i = 1, 2, • • • , m} using a partition algorithm (e.g., K-Means) with ∪C i = D maj , C i ∩ C j = ∅ and i = j, constructs a data set by merging the minority class D min with each cluster C i (namely C mer,i = D min ∪ C i ), oversamples each set C mer,i to obtain a new training set C new,i , and learns a model using conventional learning methods such as SVM on each novel training set C new,i .In this manner, the proposed method learns a classifier repository consisting of m models.In the prediction stage, a model is selected from the repository using the partition method obtained in the learning stage to predict the class label of example x.The algorithm details of the proposed method are provided in Algorithm 1.   x-example to be classified; P-partition algorithms (e.g., K-Means); M-the classifier repository.Output: class label y to which example x belongs to 10. γ = P(x); //get the label of cluster to which x belongs to 11. y = M γ (x); //predict x's class using corresponding model 12. return y.

D Majority class D maj
In the learning stage, the algorithm firstly preprocesses the training data set: separating the training set into two sets, namely the majority class set D maj and the minority class set D min (line 1), partitioning the majority class set D maj into m clusters C = {C i |i = 1, 2, ..., m} (line 2) using the partition method.Then, the algorithm iteratively learns base models on the training sets obtained by (1) merging each cluster C i with the minority class D min to be C mer,i , and (2) over-sampling the merged cluster C mer,i to get the new training set C new,i (lines 3-7).In the prediction stage, the algorithm uses the partition model obtained in the learning stage to distribute an unclassified example x into a cluster with label γ (line 10).Then, the corresponding model M γ is employed to predict the class label of the example (line 11).
K-Means was used as the clustering algorithm (refer to line 220) in the proposed method.We have compared the results obtained by clustering algorithms, such as K-Means, hierarchical clustering and random clustering, and found that there were few differences among them, so we selected K-Means as a representative.
Intuitively, different partition methods (lines 2 and 10) may lead to diverse performance of the learned models.We have compared the results obtained by different clustering algorithms, such as K-Means, hierarchical clustering and random clustering, and found that there were few differences among them.Thus, this paper uses K-Means as the candidate to partition the majority class due to its simplicity and adaptability [39].Given a data set D = {x 1 , x 2 , . . ., x n }, K-Means learns the cluster set C = {C 1 , C 2 , . . ., C m } by minimizing the squared error E of the clusters where ∪C i = D maj , C i ∩ C j = ∅, i = j, and E is formally defined as where d j is the centroid of cluster C j , defined by where |C j | is the size of cluster C j .Intuitively, Equation ( 6) reflects the tightness of the examples within a cluster around the corresponding centroid, and the smaller the value of Equation ( 6), the more similar the examples within a cluster.K-Means approximately optimizes Equation ( 6) iteratively: (1) randomly selecting m examples from D as the initial centroids {d 1 , d 2 , . . ., d m } of clusters, (2) for each x i ∈ D, calculating the distance between x i and each centroid d j , namely dis(x i , d j ) = x i − d j , and merging calculating the centroid d j of each cluster C j using Equation ( 7) and updating the centroid d j to be d j if d j = d j ; (4) repeating steps 2 and 3 until all the centroids remain unchanged.
The proposed method applies K-Means to partition the majority class (refer to lines 2 and 10 of the algorithm shown in Algorithm 1), and uses Euclidean distance to calculate the distance between the example and the centroid, formally In the prediction stage, the proposed method firstly distributes an unlabeled example to the cluster C γ (refer to line 10 in Algorithm 1) using Euclidean distance (Equation ( 8)) and then predicts the label of the example using the corresponding classification model M γ (refer to line 11 in Algorithm 1), where γ = arg j min j∈{1,2,...,m} dis(x i , d j ), where dis(x i , d j ) is calculated by Equation ( 8).The corresponding experimental results are shown in Section 5.
Let t, N and l be the iteration number of k-means, the size of the original data and the number examples generated by SMOTE, respectively.Denote O(Learn(D)) as the running time for learning a given classifier on the data set D, which is determined by the corresponding classifier model.The running time of learning the proposed model is dominated by line 2 and the loop from line 3 to line 9. Line 2 uses K-Means to partition the majority set, which can be done in O(tm|D maj |) < O(tmN).The loop from lines 4 to 8 oversamples the merged data C mer,i consuming O(l|D min |log|D min |) < O(lNlogN) and learns a given classifier with O(Learn(C new,i )) < O(Learn(D)), and thus the running time of the loop from 3 to 9 is less than O(mlNlogN + mLearner(D)).Therefore, the running time of the learning model using the proposed method is less than O(m(tN + lNlogN + Learn(D)).Note that l, t, m N. In addition, the running time of learning a classifier M i is on a balanced data set C new,i whose size is much smaller than D and therefore our experiments show that the running time of learning m classifiers on balanced data sets is comparable to (even much less than) learning a model on the whole data set, especially for SVM.For predicting an example x, the proposed method consumes more running time than other methods for getting the cluster x to belong to, which needs O(m).Therefore, the proposed method is a very efficient learning approach.

Parameter Setting
The problem in the algorithm shown in Algorithm 1 is how to set the cluster number m used by partition method for partitioning the majority class.In this section, K-Means was selected as the candidate partition method, and the six conventional classification methods, namely k-nearest neighbor (KNN), C4.5, logistic regression (LR), support vector machine (SVM), neural network (NN) and naive Bayes (NB), were selected as the basic classifiers, in order to evaluate the impact of parameter m on the performance of the proposed method.In addition, AUC was selected as the candidate measurement.
Figure 3 reports the impact of parameter m on the performances of the six learning methods, namely KNN, C4.5, LR, SVM, NN and NB, in term of AUC, where each row indicates the performance of a given classifier on four data sets and the columns are the results of the six classifiers on a given data set.For each sub-figure, the horizontal axis is the value of m growing gradually from 1 to 20 with step 1 and the vertical axis is the average AUC performance on 10-fold cross validation.For example, the sub-figure of the second row (corresponding to algorithm C4.5) and the third column (corresponding to data set flare-F) shows the AUC curve of C4.5 on data set flare-F.Here, the four data sets, namely car-good, ecoli-0-3-4-7_vs_5-6, flare-F and zoo-3, are selected as candidate data sets to evaluate the impact of m.More details about the four data sets refer to Section 5.1.The results are obtained as follows: given the value of m, 10-fold cross validation is conducted on each data set and 10 results are obtained.Then, the final result is calculated by averaging the 10 results, which is drawn as a point in the figure.From Figure 3, partition methods can effectively improve the performance of models on most data sets (m = 1 corresponding to the original model).In addition, we observed from Figure 3 that classifiers achieve acceptable AUC performance if m = N maj /(N min × 2) , where N maj and N min are the size of majority class and minority class, respectively.Take classifier KNN on data set car-good as an example.The AUC is rather good at m = N maj /(N min × 2) = 13.Similar results can be observed from other subfigures in Figure 3.This result is not in accordance with the intuitive expectation of m = N maj /N min , since the partition method can not guarantee that the size of clusters are similar to each other [53].In fact, our experiments show that the size of some clusters may be much smaller than that of the minority class, which leads to a novel imbalance and thus may reduce the model performance.Therefore, we take m = N maj /(N min × 2) in the following experiments.

Experimental Setup
This paper focuses only on the binary classification, and seventeen binary class imbalanced data sets, the details of which are shown in Table 2, were randomly selected from the KEEL repository [54], where Instances, Attributes and IR are the size of data sets, number of attributes and imbalance ratio, respectively.The imbalance ratio is defined as the ratio of the size of the majority class to that of the minority class.To evaluate the performance of the proposed method on class-imbalanced problems, a ten-fold cross-validation strategy was applied: each data set is divided into ten folds with similar sizes, and nine folds are used to train a model.The remaining one is used for testing the model [55].On each data set, we conducted ten times the ten-fold validation and, therefore, 100 models were actually constructed.
Thirty methods were designed for the experiment, and, based on a basic classifier used by the proposed method (refer to line 6 of Algorithm 1), the methods are grouped into six categories.

KNN Based Methods:
• K-nearest neighbor (KNN) [56,57] searches k nearest neighborhoods of a given example and predicts the class of the example using a majority vote of its neighbors.k was set to be 3 in the experiments.• MWMOTE-KNN (MWMO-KNN) first oversamples the training set using MWMOTE [14], and, based on that, a KNN model is learned.
C4.5 Based Methods: 5 is an algorithm used to generate a decision tree developed by Ross Quinlan [58].Authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date" [59].Here, a pruned C4.5 algorithm was used.

•
Data-partition-C4.5 (DP-C4.5) is similar to DP-KNN with the exception that C4.5 was used to learn basic classifiers instead of KNN, and the partition number m was set to be N maj /(N min × 2) .• SMOTE-C4.5 (S-C4.5) is similar to S-KNN with the exception that C4.5 instead of KNN was used to train basic models.
Logistic Regression Based Methods: • Logistic regression (LR) [60] is a regression model where the dependent variable is categorical.
In this paper, a binary logistic regression model was used to predict the class label of an example based on the example's features.Support Vector Machine Based Methods: • Support vector machine [61] (SVM) constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification.In the SVM model, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class.

•
Data-partition-SVM (DP-SVM) is similar to DP-KNN with the exception that SVM classification models instead of KNN were learned.• SMOTE-SVM (S-SVM) is similar to S-KNN with the exception that SVM classification models instead of KNN were learned.
• Data-partition-SMOTE-SVM (DPS-SVM) is similar to DPS-KNN with the exception that SVM models instead of KNN were learned.• MWMOTE-SVM (MWMO-SVM) first oversamples the training set using MWMOTE [14], and, based on that, an SVM model is learned.
Neural Network Based Methods: • Neural Network (NN) [62] with one hidden layer was learned and the hidden units were set to be the mean of the input and output number.In addition, the parameters of SMOTE used by the above methods, namely S-KNN, DPS-KNN, S-C4.5, DPS-C4.5, S-LR, DPS-LR, S-SVM, DPS-SVM, S-NN, DPS-NN, S-NB and DPS-NB, were set as follows: nearest neighbor number was set to be 5 and the number of generated examples equals the size of the minority class.

Experimental Results
This section reports the experimental results of the 30 classification methods that are grouped into six categories according to the basic learner used by the proposed method, as introduced in Section 5.1.The corresponding experimental results are shown in five tables, where Tables 3-6 report the summary results of methods on measures of recall, g-mean, f-measure and AUC, respectively.Table 7 reports the number of data sets on which the proposed method outperforms the compared methods.In Tables 3-6, the values in parentheses are ranks of methods on the corresponding measures.On each data set, the results of the best performed methods within their categories are shown in bold.The last column of each table shows the average performance of each method on the 17 data sets, and the best average performance are shown in bold.In these tables, the experimental results are separated by blank lines according to different algorithm categories, and the first method (the proposed method) of each category is compared with the other four methods.For simplicity, this subsection names the proposed method as DPS (data-patition and SMOTE based method), and treats DPS compared with the other method m as DPS-m vs. m.For example, DPS compared with KNN means DSP-KNN vs. KNN.The ranks of these methods in these tables are calculated as follows [55,64]: on a given data set, the best performing method gets the rank of 1.0, the second best ranks 2.0, and so on.In case of ties, average ranks are assigned.Table 7 provides the number of data sets on which DPS outperform their compared methods on different measures.For example, the value "15" in the second row and the second column of Table 7 indicates that DPS (DPS-KNN) outperforms KNN on 15 out of 17 data sets on measure of recall.
Table 6 depicts experimental results and ranks of compared methods on AUC.Similar to the results above, DPS wins on most of 17 data sets compared to other methods, and the number of data sets on which DPS outperforms the compared methods on AUC are also shown in Table 7. Table 6 shows that the ranks of DPS methods on most of the data sets are also much smaller than other methods, and DPS-KNN ranks first with the average rank of 6.9, DPS-NN(9.1),S-KNN(9.5),DP-NN(10.Combining Tables 3-7, we conclude that the proposed method improves conventional methods on the class-imbalanced problem, and also is superior to other existing class-imbalance oriented methods in terms of recall, g-mean, f-measure and AUC.

Conclusions
This paper proposes a classification method based on data-partitioning and SMOTE for the class-imbalanced problem.In the learning stage, the proposed method partitions the majority class into several clusters, merges each cluster with the minority class as several new sets, and oversamples the new sets to obtain relatively more balanced new training sets.Then, a classification model is learned from each new training set and thus a repository of models is constructed.In the prediction stage, a model is selected from this repository using the partition method learned in the learning stage to predict the class of an example.Experimental results show that the proposed method significantly enhances the performances of conventional methods on class-imbalanced problems, and compared to some other existing classification methods for tackling class-imbalanced problems, the proposed method also shows superiority in terms of recall, g-mean, f-measure and AUC.
In this paper, although we limited the analyses to two-class imbalanced problems and only used K-means and SMOTE as the partitioning and sampling technique, respectively, the general ideas derived in this paper are valid for broad multi-class imbalanced problems.Future work will involve further testing the performance of the proposed method on multi-class problems, automating the selection of the parameters of K-means.In addition, many partition methods and sampling techniques can be used for the proposed methods, and therefore it will be also important to study the impact of other partitioning methods and sampling techniques on the proposed method.

Figure 1 .
Figure 1.Characteristics of data.(Left): original data; (Center): after partition; (Right): after merging and over-sampling.The bars represent the majority class samples, the crosses represent the minority class samples, and the hollow cross represents the sample generated by oversampling.

Figure 2 .
Figure 2. Diagram of the proposed method.

Figure 3 .
Figure 3. Impact of parameter m on the performance of models.

•
Data-partition-KNN (DP-KNN) partitions the majority class into m clusters using K-Means and learns a KNN model on each set obtained by merging each cluster and minority class.For prediction, DP-KNN selects a KNN model according to the learned K-Means model to predict the example class.Here, we set k = 3 (for KNN) m = N maj /(N min × 2) , where N maj and N min represent the size of majority class and that of minority class (refer to Section 4).• SMOTE-KNN (S-KNN) oversamples the training set using SMOTE to obtain a relatively balanced class distribution, and on which a KNN model is learned.Similar to KNN and DP-KNN, k was set to 3. • Data-partition-SMOTE-KNN (DPS-KNN) is similar to DP-KNN with the exception that DPS-KNN used both K-Means and SMOTE to preprocess the training set, and we set k = 3 and m = N maj /(N min × 2) .

•
Data-partition-NN (DP-NN) is similar to DP-KNN, except that the NN model was used instead of KNN to train basic classifiers.• SMOTE-NN (S-NN) is similar to S-KNN except that the NN model was used instead of KNN to train basic classifiers.• Data-partition-SMOTE-NN (DPS-NN) is similar to DPS-KNN except that the NN model was used instead of KNN to train basic classifiers.• MWMOTE-NN (MWMO-NN) first oversamples the training set using MWMOTE [14], and, based on that, an NN model is learned.Naive Bayes Based Methods:• Naive Bayes (NB)[63] is a simple probabilistic classifier with naive independence assumptions between the features.• Data-partition-NB (DP-NB) is similar to DP-KNN with the exception that the learning model was NB instead of KNN.• SMOTE-NB (S-NB) is similar to S-KNN with the exception that the learning model was NB instead of KNN.• Data-partition-SMOTE-NB (DPS-NB) is similar to DPS-KNN with the exception that the learning model was NB instead of KNN.• MWMOTE-NB (MWMO-NB) first oversamples the training set using MWMOTE [14], and, based on that, an NB model is learned.

Table 1 .
Confusion matrix of biclass problem.

Algorithm 1
Imbalanced learning algorithm based on data-partition and SMOTE.

Table 2 .
Summary description of imbalanced data sets.

Table 3 .
Experimental results of models on Recall.

Table 4 .
Experimental results of models on g-mean.

Table 5 .
Experimental results of models on f-measure.

Table 6 .
Experimental results of models on AUC.

Table 7 .
Number of data sets on which DPS methods outperform the compared methods.