A New Oversampling Method Based on the Classiﬁcation Contribution Degree

: Data imbalance is a thorny issue in machine learning. SMOTE is a famous oversampling method of imbalanced learning. However, it has some disadvantages such as sample overlapping, noise interference, and blindness of neighbor selection. In order to address these problems, we present a new oversampling method, OS-CCD, based on a new concept, the classiﬁcation contribution degree. The classiﬁcation contribution degree determines the number of synthetic samples generated by SMOTE for each positive sample. OS-CCD follows the spatial distribution characteristics of original samples on the class boundary, as well as avoids oversampling from noisy points. Experiments on twelve benchmark datasets demonstrate that OS-CCD outperforms six classical oversampling methods in terms of accuracy, F1-score, AUC, and ROC.


Introduction
Currently, imbalanced learning for classification problems has attracted more and more attention in machine learning research [1,2]. In the imbalanced datasets, the positive (minority) samples, such as the credit score in the financial sector [3], fault detection in mechanical maintenance [4], abnormal behavior detection in a crowd [5], cancer detection [6], other medical fields [7], etc., play a great role. However, traditional classifiers or learning algorithms attach too much importance to negative (majority) samples [8]. In order to address this problem, many research works have put forward some approaches from two levels, the data level and the algorithm level.
At the algorithm level, ensemble learning has been a hot topic recently. Ensemble learning approaches such as random forest [9], XGboost [10], and AdaBoost [11] build several weak classifiers and then integrate them into a strong classifier based on a voting or averaging operation. This can effectively compensate the drawback of a single classifier on imbalanced datasets to improve the classification precision. At present, there are also many literature works that use deep learning models or algorithms to process the imbalanced datasets, such as CNN [12] and DBN [13]. However, training deep networks generally requires much time consumption.
The data level mainly includes oversampling, undersampling, and hybrid sampling methods. The core idea is to strengthen the importance of the positive samples or reduce the impact of negative samples, so that the positive class has the same importance as the negative class in the classifier training.
Oversampling has been widely used because it retains the original information of the dataset and is easy to operate [14]. For example, CIR [15] synthesizes new positive samples based on calculating the centroid of all attributes of positive samples to create a symmetry between the defect and non-defect records in the imbalanced datasets. FL-NIDS [16] overcomes the imbalanced data problem and is applied to evaluate three benchmark to extract the category and distribution information contained in P from two directions, the macro distribution characteristics of P and the micro location of each positive sample in the feature space.
The positive samples are scattered in space, and the distance between any two positive samples varies in a large range. Therefore, we first cluster P into K clusters with the k-means algorithm as shown in Figure 1b where K = 3. Denote the number of samples in every cluster k as n k . Generally, different clusters contain different number of samples, which indicates that the distribution of samples in space has density characteristics. For every cluster, we define a new concept, the cluster ratio R k , i.e., where |P| is the number of samples in P. For example, the cluster ratios of the three clusters shown in Figure 1b are 0.333, 0.083, and 0.583, respectively. R k can quantitatively reflect the macro distribution characteristics of P.
For each positive sample, its location information generally should be considered relative to other samples including both positive and negative ones. Therefore, nearest neighbors are naturally involved. For every sample x i of P, we calculate the distance d N i between x i and its nearest negative sample neighbor. The hypersphere with x i and d N i being the center and radius, respectively, is called the Type-N safe neighborhood of x i , and it is shown by the blue disk in Figure 1c. Similarly, we calculate the distance d P i between x i and its nearest positive sample neighbor. The hypersphere with x i and d P i being the center and radius, respectively, is called the Type-P safe neighborhood of x i , and it is shown by the yellow disk in Figure 1c.
If x i is easily classified correctly, such as x 1 in Figure 1c, the union of its Type-N and Type-P safe neighborhoods contains some positive samples, but only one negative sample; if x i is a noisy point, i.e., it is located in the interior of N, such as x 0 in Figure 1c, the union of its two types of safe neighborhoods contains many negative samples, but only one positive sample; If x i is located near the class boundary, such as x 2 in Figure 1c, the union of its two types of safe neighborhoods contains similar numbers of positive and negative samples. Based on this characteristic, we present another definition for every x i ∈ P, named the safety degree, where a i and b i are the numbers of positive and negative samples, respectively, contained in the union of the two types of safe neighborhoods of x i . According to the above analysis, S i is great if x i is far away from the class boundary; otherwise, it is close to zero. Now, we define the classification contribution degree F i with the cluster ratio R k and the safety degree S i for every x i ∈ P as follows: where A is a correction coefficient, which is used to prevent the denominator from being 0. R i equals the cluster ratio R k of the cluster k to which x i belongs. The classification contribution degree is directly proportional to the cluster ratio R i and inversely proportional to the degree of safety S i . It is based on the point that not only easily misclassified samples, but also samples belonging to a cluster containing a large number of elements should play a major role in determining the class boundary. Therefore, the classification contribution degree is a quantitative measurement of the degree to which a sample is the boundary sample. At the same time, it can also identify the noisy point, such as x 0 shown in Figure 1c. Its R 0 is close to zero from (1) and S 0 is large from (2). Therefore, its classification contribution degree F 0 is almost zero.

Oversampling Based on the Classification Contribution Degree
By normalizing the classification contribution degrees as follows: we can compress the classification contribution degrees within 0-1. With F i , we can determine a suitable number of synthetic samples for each positive sample as follows: The sign means the operation of rounding down. If F i is too small, there are no new samples generated from x i , such as the noisy point x 0 shown in Figure 1c.
The nearest neighbor SMOTE is repeated T i times to oversample T i new positive samples as follows: where x new is the new synthesized sample, x j is the nearest positive sample to x i , and δ is a random number between (0, 1). The newly generated samples are shown as the black points in Figure 1d. In particular, for the noisy point x 0 as shown in Figure 1c, because the negative samples are farther than the positive samples in the Type-P safe neighborhood, its safety degree is large, leading to a small classification contribution degree. The sampling times are 0 after rounding down in (5), so no samples are generated to avoid noise interference. However, for the easily misclassified positive samples, there are many new samples generated to balance the whole dataset. The flowchart of imbalanced learning with OS-CCD is shown in Figure 2.

Results and Discussion
In this section, OS-CCD is compared with six commonly used oversampling methods on twelve benchmark datasets in terms of the accuracy, F1-score, AUC, and ROC. We used the average values of ten independent runs of the fivefold cross-validation performed on 12 datasets to evaluate the oversampling methods.

Datasets Description and Experimental Evaluation
In this paper, twelve benchmark datasets were collected from the KEEL Tool [29] to verify the effectiveness of our proposed oversampling method. The detailed description of these datasets is shown in Table 1. The imbalance ratio (IR) varies between 5.14 and 68.1. Their sizes vary from 197 to 1484. To evaluate our method, several evaluation metrics are employed. Accuracy (Acc) [30] is the proportion of correctly classified samples in the whole dataset: where TP is the number of actual positive samples identified correctly as the positive class, and FN, FP, and TN can be understood similarly. Accuracy has low credibility when dealing with imbalanced data. Therefore, other metrics are also employed. The F1-score [31] is the harmonic average of the precision and recall rate to evaluate the machine learning model in rebalanced data task: The ROC curve reflects the relationship between the true and false positive rates. AUC [30] is the area under the ROC curve to evaluate the model performance.

Experimental Results
To visualize the sampling results of OS-CCD, we use principal component analysis (PCA) [38] to reduce the dimension of the data and then draw 84 scatter plots for the twelve datasets and seven sampling methods, as shown in Figure 3, where red and blue dots represent the original negative and positive samples, respectively, and black dots represent the newly generated samples.
As can be seen from Figure 3, the sampling results produced by the seven methods are different on most datasets. On the new-thyroid1 dataset, SMOTE, borderline-SMOTE, NRAS, and Gaussian-SMOTE generate many new positive samples from three loners of the positive class, but OS-CCD only generates a small number of samples from the three. The same thing happens on IDs 2, 4, 5, 7, and 11. This indicates that OS-CCD can oversample a small number of samples in the low-density area of the positive class.
On most datasets, the new samples generated by OS-CCD follow the distribution characteristics of the original positive class. However, borderline-SMOTE, SMOTE and Gaussian-SMOTE generate some samples that overlap with the negative samples, and the random oversampling method only simply replicates every positive sample without regard to other factors. The remainder of the paper reports the experimental results. The correction coefficient A was one in all experiments, and the cluster number K was set to 3, 4, 4, 4, 3, 4, 12, 3, 4, 2, 6, and 3 for every dataset, respectively, according to the visualization of Figure 3. Table 3 shows the test classification accuracy comparison of OS-CCD with the other methods. The best values are in bold. It can be seen that the classification performance of OS-CCD outperformed almost all the other oversampling methods on the twelve datasets if the combined classifiers were SVM and MLP. There were only three cases where the values of accuracy produced by OS-CCD were not the highest if the combined classifier was LR, but the accuracies on the three datasets were close to first place. If the combined classifier was DT, there were only three cases where OS-CCD outperformed the other methods. This means that the combination of DT and OS-CCD was not very harmonious. Table 4 shows the test F1-score comparison of OS-CCD with other methods. Similar to the accuracy, SVM and MLP obtained the highest performances on almost all datasets balanced by OS-CCD. Although some values were not number one, they were not far off. The F1-score of LR with OS-CCD on the new-thyroid1 and new-thyroid2 datasets were only less than one percent below the best score. The F1-score of DT with OS-CCD was the highest on only four datasets. At the same time, the standard deviations of the accuracy and F1-score are also reported in Tables 3 and 4, respectively. As shown in the tables, the standard deviation of OS-CCD was relatively low, which reflects the stability of our method.  Figure 4 shows the AUC of the 28 approaches on the 12 datasets. It shows that the mean and the standard deviation of the AUC of OS-CCD were better than those of other oversampling methods combined with SVM and MLP on most datasets, except flare-F and yeast4. The reason for this may be that the positive samples of flare-F and yeast4 were so fragmented, as shown in Figure 3, that it was very difficult to determine a suitable number for K in the k-means algorithm. If the combined classifier was DT, OS-CCD could not achieve the best value especially on the last two datasets. The reason for this may be that there were too many isolated noise samples in the two datasets, and OS-CCD could not extract the features of the positive class like GS did. To evaluate the generalization ability of OS-CCD, the ROC curves of MLP and LR with the four oversampling methods are plotted in Figure 5 on the ecoli3 dataset, which contains moderate the data size and imbalance ratio. The black diagonal line represents the random selection. For both MLP and LR, their optimal thresholds combined with OS-CCD are the closest to the upper left corner. This indicates the strong generalization ability of OS-CCD.

Conclusions
In this paper, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. We first cluster positive samples into K clusters with the k-means algorithm and get the cluster ratio for each positive sample. Secondly, we compute the safety degree based on two types of safe neighborhoods for each possible sample. Then, we present the definition of the classification contribution degree to determine the number of synthetic samples generated by SMOTE from each positive sample. OS-CCD can effectively avoid oversampling from noisy points and can strengthen the boundary information by highlighting the spatial distribution characteristics of the original positive class. High performances of OS-CCD are substantiated in terms of the accuracy, F1-score, AUC, and ROC on twelve commonly used datasets. Further investigations may include the generalization of the classification contribution degree to all samples and the extension of the results to ensemble classifiers.