Next Article in Journal
Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark
Previous Article in Journal
An Analytical Approach to the Universal Wave Function and Its Gravitational Effect
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Oversampling Method Based on the Classification Contribution Degree

1
DUT-BSU Joint Institute, Dalian University of Technology, Dalian 116024, China
2
School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China
*
Author to whom correspondence should be addressed.
Symmetry 2021, 13(2), 194; https://doi.org/10.3390/sym13020194
Submission received: 5 January 2021 / Revised: 17 January 2021 / Accepted: 24 January 2021 / Published: 26 January 2021
(This article belongs to the Section Computer)

Abstract

:
Data imbalance is a thorny issue in machine learning. SMOTE is a famous oversampling method of imbalanced learning. However, it has some disadvantages such as sample overlapping, noise interference, and blindness of neighbor selection. In order to address these problems, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. The classification contribution degree determines the number of synthetic samples generated by SMOTE for each positive sample. OS-CCD follows the spatial distribution characteristics of original samples on the class boundary, as well as avoids oversampling from noisy points. Experiments on twelve benchmark datasets demonstrate that OS-CCD outperforms six classical oversampling methods in terms of accuracy, F1-score, AUC, and ROC.

1. Introduction

Currently, imbalanced learning for classification problems has attracted more and more attention in machine learning research [1,2]. In the imbalanced datasets, the positive (minority) samples, such as the credit score in the financial sector [3], fault detection in mechanical maintenance [4], abnormal behavior detection in a crowd [5], cancer detection [6], other medical fields [7], etc., play a great role. However, traditional classifiers or learning algorithms attach too much importance to negative (majority) samples [8]. In order to address this problem, many research works have put forward some approaches from two levels, the data level and the algorithm level.
At the algorithm level, ensemble learning has been a hot topic recently. Ensemble learning approaches such as random forest [9], XGboost [10], and AdaBoost [11] build several weak classifiers and then integrate them into a strong classifier based on a voting or averaging operation. This can effectively compensate the drawback of a single classifier on imbalanced datasets to improve the classification precision. At present, there are also many literature works that use deep learning models or algorithms to process the imbalanced datasets, such as CNN [12] and DBN [13]. However, training deep networks generally requires much time consumption.
The data level mainly includes oversampling, undersampling, and hybrid sampling methods. The core idea is to strengthen the importance of the positive samples or reduce the impact of negative samples, so that the positive class has the same importance as the negative class in the classifier training.
Oversampling has been widely used because it retains the original information of the dataset and is easy to operate [14]. For example, CIR [15] synthesizes new positive samples based on calculating the centroid of all attributes of positive samples to create a symmetry between the defect and non-defect records in the imbalanced datasets. FL-NIDS [16] overcomes the imbalanced data problem and is applied to evaluate three benchmark intrusion detection datasets that suffer from imbalanced distributions. OS-ELM [17] uses the oversampling technique when identifying an axle box bearing fault of multiple high-speed electric units.
The synthetic minority over-sampling technique (SMOTE) [18] is a well-known oversampling method. However, SMOTE has three disadvantages: (1) it oversamples uninformative samples [19]; (2) it oversamples noisy samples; and (3) it is difficult to determine the number of nearest neighbors, and there is strong blindness in the selection of nearest neighbors for the synthetic samples.
To address these three problems, many strategies have been presented in literature. Borderline-SMOTE [20] focuses on samples within the selected area to strengthen the class boundary information. RCSMOTE [19] controls the range of synthetic instances. K-means SMOTE [21] combines a clustering algorithm and SMOTE to deal with the overlapped samples of different classes. G-SMOTE [22] generates synthetic samples in a given geometric region. LNSMOTE [23] considers the local neighborhood information of samples. DBSMOTE [24] uses the DBSCAN algorithm, which is based on the density before sampling. Carla Vairetti proposed a method for a high-dimensional dataset [25]. Gaussian-SMOTE [26] is a combination of a Gaussian distribution and SMOTE, based on the probability density of datasets. The above algorithms improve SMOTE to different extents, but they do not extract or make full use of the distribution information of samples, nor do they modify the random choice of the nearest neighbors.
In a binary classification problem, samples that are far away from the class boundary are easily classified. They, especially the noisy samples, generally contribute little to the sampling. Noisy samples produced by statistical errors or other causes often occur in datasets, which may send error messages. Therefore, noisy samples can be regarded as useless information and should be ignored. From this point, we would like to focus on strengthening the information of positive samples near the decision boundary.
In addition, samples often appear in clusters, and the distance between two clusters is much larger than the distance between two samples within one cluster. From this point, we can use cluster methods to extract the density characteristics of the distribution for positive samples as the macro information.
Overall, the boundary information can be regarded as intra-class information, and the distribution features of clusters within one class can be regarded as inter-class information. Motivated by the above discussion, we present a new oversampling method, OS-CCD, based on the concept of the classification contribution degree. The classification contribution degree can not only combine the inter-class information in the dataset with the intra-class information, but also can ignore the useless information.
When using SMOTE, a big problem is determining how many times SMOTE oversamples. There is a much research on this. ADASYN [27] obtains the times of sampling for positive samples with the ratio of negative samples among the K nearest neighbors of positive samples. As a new technique, the classification contribution degree can determine the exact number of the synthetic samples for every positive sample.
In other methods, SMOTE randomly selects one of the nearest neighbors of a positive sample for synthesizing a new sample [28]. With the guidance of the classification contribution degree, we can only select one nearest neighbor, but not a random one. This can highlight the spatial distribution characteristics of the positive class.
The remainder of this paper is organized as follows. Section 2 presents the details of OS-CCD including the definition of the cluster ratio, safe neighborhood, safety degree, and the classification contribution degree. Section 3 demonstrates the experimental process and results. Section 4 concludes the work of this paper.

2. Methodology

2.1. Safe Neighborhood and Classification Contribution Degree

Given an imbalanced dataset, which contains two classes, the positive class P and the negative class N are shown in Figure 1a. To generate new positive samples, we try to extract the category and distribution information contained in P from two directions, the macro distribution characteristics of P and the micro location of each positive sample in the feature space.
The positive samples are scattered in space, and the distance between any two positive samples varies in a large range. Therefore, we first cluster P into K clusters with the k-means algorithm as shown in Figure 1b where K = 3 . Denote the number of samples in every cluster k as n k . Generally, different clusters contain different number of samples, which indicates that the distribution of samples in space has density characteristics.
For every cluster, we define a new concept, the cluster ratio R k , i.e.,
R k = n k | P | ,   k = 1 , 2 , K ,
where | P | is the number of samples in P. For example, the cluster ratios of the three clusters shown in Figure 1b are 0.333 , 0.083 , and 0.583 , respectively. R k can quantitatively reflect the macro distribution characteristics of P.
For each positive sample, its location information generally should be considered relative to other samples including both positive and negative ones. Therefore, nearest neighbors are naturally involved. For every sample x i of P, we calculate the distance d i N between x i and its nearest negative sample neighbor. The hypersphere with x i and d i N being the center and radius, respectively, is called the Type-N safe neighborhood of x i , and it is shown by the blue disk in Figure 1c. Similarly, we calculate the distance d i P between x i and its nearest positive sample neighbor. The hypersphere with x i and d i P being the center and radius, respectively, is called the Type-P safe neighborhood of x i , and it is shown by the yellow disk in Figure 1c.
If x i is easily classified correctly, such as x 1 in Figure 1c, the union of its Type-N and Type-P safe neighborhoods contains some positive samples, but only one negative sample; if x i is a noisy point, i.e., it is located in the interior of N, such as x 0 in Figure 1c, the union of its two types of safe neighborhoods contains many negative samples, but only one positive sample; If x i is located near the class boundary, such as x 2 in Figure 1c, the union of its two types of safe neighborhoods contains similar numbers of positive and negative samples. Based on this characteristic, we present another definition for every x i P , named the safety degree,
S i = ( a i b i ) 2 ,
where a i and b i are the numbers of positive and negative samples, respectively, contained in the union of the two types of safe neighborhoods of x i . According to the above analysis, S i is great if x i is far away from the class boundary; otherwise, it is close to zero.
Now, we define the classification contribution degree F i with the cluster ratio R k and the safety degree S i for every x i P as follows:
F i = R i S i + A ,
where A is a correction coefficient, which is used to prevent the denominator from being 0. R i equals the cluster ratio R k of the cluster k to which x i belongs.
The classification contribution degree is directly proportional to the cluster ratio R i and inversely proportional to the degree of safety S i . It is based on the point that not only easily misclassified samples, but also samples belonging to a cluster containing a large number of elements should play a major role in determining the class boundary. Therefore, the classification contribution degree is a quantitative measurement of the degree to which a sample is the boundary sample. At the same time, it can also identify the noisy point, such as x 0 shown in Figure 1c. Its R 0 is close to zero from (1) and S 0 is large from (2). Therefore, its classification contribution degree F 0 is almost zero.

2.2. Oversampling Based on the Classification Contribution Degree

By normalizing the classification contribution degrees as follows:
F i = F i i F i , i = 1 , 2 , , | P | ,
we can compress the classification contribution degrees within 0–1. With F i , we can determine a suitable number of synthetic samples for each positive sample as follows:
T i = F i · ( | N | | P | ) .
The sign means the operation of rounding down. If F i is too small, there are no new samples generated from x i , such as the noisy point x 0 shown in Figure 1c.
The nearest neighbor SMOTE is repeated T i times to oversample T i new positive samples as follows:
x n e w = x i + ( x i x j ) δ ,
where x n e w is the new synthesized sample, x j is the nearest positive sample to x i , and δ is a random number between ( 0 , 1 ) .
The newly generated samples are shown as the black points in Figure 1d. In particular, for the noisy point x 0 as shown in Figure 1c, because the negative samples are farther than the positive samples in the Type-P safe neighborhood, its safety degree is large, leading to a small classification contribution degree. The sampling times are 0 after rounding down in (5), so no samples are generated to avoid noise interference. However, for the easily misclassified positive samples, there are many new samples generated to balance the whole dataset. The flowchart of imbalanced learning with OS-CCD is shown in Figure 2.

3. Results and Discussion

In this section, OS-CCD is compared with six commonly used oversampling methods on twelve benchmark datasets in terms of the accuracy, F1-score, AUC, and ROC. We used the average values of ten independent runs of the fivefold cross-validation performed on 12 datasets to evaluate the oversampling methods.

3.1. Datasets Description and Experimental Evaluation

In this paper, twelve benchmark datasets were collected from the KEEL Tool [29] to verify the effectiveness of our proposed oversampling method. The detailed description of these datasets is shown in Table 1. The imbalance ratio (IR) varies between 5.14 and 68.1 . Their sizes vary from 197 to 1484.
To evaluate our method, several evaluation metrics are employed. Accuracy (Acc) [30] is the proportion of correctly classified samples in the whole dataset:
Acc = TP + TN TP + FN + FP + TN ,
where TP is the number of actual positive samples identified correctly as the positive class, and FN, FP, and TN can be understood similarly. Accuracy has low credibility when dealing with imbalanced data. Therefore, other metrics are also employed. The F1-score [31] is the harmonic average of the precision and recall rate to evaluate the machine learning model in rebalanced data task:
F 1 - score = 2 TP TP + FP · TP TP + FN TP TP + FP + TP TP + FN .
The ROC curve reflects the relationship between the true and false positive rates. AUC [30] is the area under the ROC curve to evaluate the model performance.

3.2. Experimental Method

To test the effectiveness and robustness of OS-CCD, six commonly used classical oversampling methods, random oversampling (RO) [32], SMOTE [18], borderline-SMOTE (BS) [20], k-means SMOTE (KS) [21], NRAS [33], and Gaussian-SMOTE (GS) [34] are used. Including OS-CCD, and the seven oversampling methods are combined with four classifiers, support vector machine (SVM) [35], logistic regression (LR) [36], decision tree (DT) [36], and multi-layer perceptron (MLP) neural network [37]). The 28 approaches are shown in Table 2.

3.3. Experimental Results

To visualize the sampling results of OS-CCD, we use principal component analysis (PCA) [38] to reduce the dimension of the data and then draw 84 scatter plots for the twelve datasets and seven sampling methods, as shown in Figure 3, where red and blue dots represent the original negative and positive samples, respectively, and black dots represent the newly generated samples.
As can be seen from Figure 3, the sampling results produced by the seven methods are different on most datasets. On the new-thyroid1 dataset, SMOTE, borderline-SMOTE, NRAS, and Gaussian-SMOTE generate many new positive samples from three loners of the positive class, but OS-CCD only generates a small number of samples from the three. The same thing happens on IDs 2, 4, 5, 7, and 11. This indicates that OS-CCD can oversample a small number of samples in the low-density area of the positive class.
On most datasets, the new samples generated by OS-CCD follow the distribution characteristics of the original positive class. However, borderline-SMOTE, SMOTE and Gaussian-SMOTE generate some samples that overlap with the negative samples, and the random oversampling method only simply replicates every positive sample without regard to other factors.
The remainder of the paper reports the experimental results. The correction coefficient A was one in all experiments, and the cluster number K was set to 3, 4, 4, 4, 3, 4, 12, 3, 4, 2, 6, and 3 for every dataset, respectively, according to the visualization of Figure 3.
Table 3 shows the test classification accuracy comparison of OS-CCD with the other methods. The best values are in bold. It can be seen that the classification performance of OS-CCD outperformed almost all the other oversampling methods on the twelve datasets if the combined classifiers were SVM and MLP. There were only three cases where the values of accuracy produced by OS-CCD were not the highest if the combined classifier was LR, but the accuracies on the three datasets were close to first place. If the combined classifier was DT, there were only three cases where OS-CCD outperformed the other methods. This means that the combination of DT and OS-CCD was not very harmonious.
Table 4 shows the test F1-score comparison of OS-CCD with other methods. Similar to the accuracy, SVM and MLP obtained the highest performances on almost all datasets balanced by OS-CCD. Although some values were not number one, they were not far off. The F1-score of LR with OS-CCD on the new-thyroid1 and new-thyroid2 datasets were only less than one percent below the best score. The F1-score of DT with OS-CCD was the highest on only four datasets.
At the same time, the standard deviations of the accuracy and F1-score are also reported in Table 3 and Table 4, respectively. As shown in the tables, the standard deviation of OS-CCD was relatively low, which reflects the stability of our method.
Figure 4 shows the AUC of the 28 approaches on the 12 datasets. It shows that the mean and the standard deviation of the AUC of OS-CCD were better than those of other oversampling methods combined with SVM and MLP on most datasets, except flare-F and yeast4. The reason for this may be that the positive samples of flare-F and yeast4 were so fragmented, as shown in Figure 3, that it was very difficult to determine a suitable number for K in the k-means algorithm. If the combined classifier was DT, OS-CCD could not achieve the best value especially on the last two datasets. The reason for this may be that there were too many isolated noise samples in the two datasets, and OS-CCD could not extract the features of the positive class like GS did.
To evaluate the generalization ability of OS-CCD, the ROC curves of MLP and LR with the four oversampling methods are plotted in Figure 5 on the ecoli3 dataset, which contains moderate the data size and imbalance ratio. The black diagonal line represents the random selection. For both MLP and LR, their optimal thresholds combined with OS-CCD are the closest to the upper left corner. This indicates the strong generalization ability of OS-CCD.

4. Conclusions

In this paper, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. We first cluster positive samples into K clusters with the k-means algorithm and get the cluster ratio for each positive sample. Secondly, we compute the safety degree based on two types of safe neighborhoods for each possible sample. Then, we present the definition of the classification contribution degree to determine the number of synthetic samples generated by SMOTE from each positive sample. OS-CCD can effectively avoid oversampling from noisy points and can strengthen the boundary information by highlighting the spatial distribution characteristics of the original positive class. High performances of OS-CCD are substantiated in terms of the accuracy, F1-score, AUC, and ROC on twelve commonly used datasets. Further investigations may include the generalization of the classification contribution degree to all samples and the extension of the results to ensemble classifiers.

Author Contributions

Code, writing, original draft, translation, visualization, and formal analysis, Z.J.; discussion of the results, T.P.; supervision and resources, J.Y.; checking the code of this paper, C.Z. All authors read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China under Grant 2018AAA0100300 and in part by the National Natural Science Foundation of China under Grant 11201051.

Data Availability Statement

Data available in a publicly accessible repository that does not issue DOIs Publicly available datasets were analyzed in this study. This data can be found here: [http://keel.es].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kovács, G. Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing 2019, 366, 352–354. [Google Scholar] [CrossRef]
  2. Kovács, G. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl. Soft Comput. 2019, 83, 105662. [Google Scholar] [CrossRef]
  3. Brown, I.; Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring datasets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef] [Green Version]
  4. Samanta, B.; Al-Balushi, K.R.; Al-Araimi, S.A. Artificial neural networks and support vector machines with genetic algorithm for bearing fault detection. Eng. Appl. Artif. Intell. 2003, 16, 657–665. [Google Scholar] [CrossRef]
  5. Xie, S.; Zhang, X.; Cai, J. Video crowd detection and abnormal behavior model detection based on machine learning method. Neural. Comput. Appl. 2019, 31, 175–184. [Google Scholar] [CrossRef]
  6. Kalwa, U.; Legner, C.; Kong, T.; Pandey, S. Skin cancer diagnostics with an all-inclusive smartphone application. Symmetry 2019, 11, 790. [Google Scholar] [CrossRef] [Green Version]
  7. Le, T.; Baik, S.W. A robust framework for self-care problem identification for children with disability. Symmetry 2019, 11, 89. [Google Scholar] [CrossRef] [Green Version]
  8. Kang, Q.; Fan, Q.W.; Zurada, J.M. Deterministic convergence analysis via smoothing group Lasso regularization and adaptive momentum for Sigma-Pi Sigma neural network. Inf. Sci. 2021, 553, 66–82. [Google Scholar] [CrossRef]
  9. Díaz-Uriarte, R.; De Andres, S.A. Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006, 7, 3. [Google Scholar] [CrossRef] [Green Version]
  10. Wang, C.; Deng, C.; Wang, S. Imbalance-XGBoost: Leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost. Pattern Recognit. Lett. 2020, 136, 190–197. [Google Scholar] [CrossRef]
  11. Thanathamathee, P.; Lursinsap, C. Handling imbalanced datasets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognit. Lett. 2013, 34, 1339–1347. [Google Scholar] [CrossRef]
  12. Kvamme, H.; Sellereite, N.; Aas, K.; Sjursen, S. Predicting mortgage default using convolutional neural networks. Expert Syst. Appl. 2018, 102, 207–217. [Google Scholar] [CrossRef] [Green Version]
  13. Yu, L.; Zhou, R.; Tang, L.; Chen, R. A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Appl. Soft Comput. 2018, 69, 192–202. [Google Scholar] [CrossRef]
  14. Elreedy, D.; Atiya, A.F. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 2019, 505, 32–64. [Google Scholar] [CrossRef]
  15. Bejjanki, K.K.; Gyani, J.; Gugulothu, N. Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance. Symmetry 2020, 12, 407. [Google Scholar] [CrossRef] [Green Version]
  16. Mulyanto, M.; Faisal, M.; Prakosa, S.W.; Leu, J.-S. Effectiveness of Focal Loss for Minority Classification in Network Intrusion Detection Systems. Symmetry 2021, 13, 4. [Google Scholar] [CrossRef]
  17. Hao, W.; Liu, F. Imbalanced Data Fault Diagnosis Based on an Evolutionary Online Sequential Extreme Learning Machine. Symmetry 2020, 12, 1204. [Google Scholar] [CrossRef]
  18. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  19. Soltanzadeh, P.; Hashemzadeh, M. RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf. Sci. 2020, 542, 92–111. [Google Scholar] [CrossRef]
  20. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced datasets learning. In Proceedings of the International Conference on Intelligent Computing (ICIC), Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
  21. Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef] [Green Version]
  22. Douzas, G.; Bacao, F. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inf. Sci. 2019, 501, 118–135. [Google Scholar] [CrossRef]
  23. Maciejewski, T.; Stefanowski, J. Local neighbourhood extension of SMOTE for mining imbalanced data. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, France, 11–15 April 2011; pp. 104–111. [Google Scholar]
  24. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. DBSMOTE: Density-based synthetic minority over-sampling technique. Appl. Intell. 2012, 36, 664–684. [Google Scholar] [CrossRef]
  25. Maldonado, S.; López, J.; Vairetti, C. An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 2019, 76, 380–389. [Google Scholar] [CrossRef]
  26. Pan, T.; Zhao, J.; Wu, W.; Yang, J. Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf. Sci. 2020, 512, 1214–1233. [Google Scholar] [CrossRef]
  27. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
  28. Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  29. Alcalá-Fdez, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  30. Huang, J.; Ling, C.X. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17, 299–310. [Google Scholar] [CrossRef] [Green Version]
  31. Al-Azani, S.; El-Alfy, E.S.M. Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text. In Proceedings of the International Conference on Ambient Systems, Networks and Technologies and International Conference on Sustainable Energy Information Technology (ANT/SEIT), Madeira, Portugal, 16–19 May 2017; pp. 359–366. [Google Scholar]
  32. Liu, A.; Ghosh, J.; Martin, C.E. Generative Oversampling for Mining Imbalanced Datasets. In Proceedings of the International Conference on Data Mining (DMIN), Las Vegas, NV, USA, 25–28 June 2007; pp. 66–72. [Google Scholar]
  33. Rivera, W.A. Noise reduction a priori synthetic over-sampling for class imbalanced datasets. Inf. Sci. 2017, 408, 146–161. [Google Scholar] [CrossRef]
  34. Lee, H.; Kim, J.; Kim, S. Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions. Int. J. Fuzzy Log. Intell. Syst. 2017, 17, 229–234. [Google Scholar] [CrossRef]
  35. Kang, Q.; Shi, L.; Zhou, M.; Wang, X.; Wu, Q.; Wei, Z. A distance-based weighted undersampling scheme for support vector machines and its application to imbalanced classification. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4152–4165. [Google Scholar] [CrossRef]
  36. Nie, G.; Rowe, W.; Zhang, L.; Tian, Y.; Shi, Y. Credit card churn forecasting by logistic regression and decision tree. Expert Syst. Appl. 2011, 38, 15273–15285. [Google Scholar] [CrossRef]
  37. Oh, S.H. Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 2011, 74, 1058–1061. [Google Scholar] [CrossRef]
  38. Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Figure 1. Diagram of oversampling with OS-classification contribution degree (CCD). (a) The original dataset in the first panel. (b) Positive samples are clustered in the second panel. (c) Type-N and Type-P safe neighborhoods in the third panel. (d) Synthesized samples by OS-CCD in the fourth panel.
Figure 1. Diagram of oversampling with OS-classification contribution degree (CCD). (a) The original dataset in the first panel. (b) Positive samples are clustered in the second panel. (c) Type-N and Type-P safe neighborhoods in the third panel. (d) Synthesized samples by OS-CCD in the fourth panel.
Symmetry 13 00194 g001
Figure 2. Flowchart of imbalanced learning with OS-CCD.
Figure 2. Flowchart of imbalanced learning with OS-CCD.
Symmetry 13 00194 g002
Figure 3. The oversampling results of the seven methods on the twelve datasets.
Figure 3. The oversampling results of the seven methods on the twelve datasets.
Symmetry 13 00194 g003
Figure 4. Histogram of the AUC of the 28 approaches on the 12 datasets.
Figure 4. Histogram of the AUC of the 28 approaches on the 12 datasets.
Symmetry 13 00194 g004
Figure 5. The ROC on the ecoli3 dataset.
Figure 5. The ROC on the ecoli3 dataset.
Symmetry 13 00194 g005
Table 1. Details of the twelve datasets for comparison. IR, imbalance ratio.
Table 1. Details of the twelve datasets for comparison. IR, imbalance ratio.
IDDatasetPositiveNegativeAttributeIR
1new-thyroid13518055.14
2new-thyroid23518055.14
3ecoli33530178.6
4yeast-1_vs_730429714.3
5glass413201915.46
6glass59205922.78
7flare-F4310231123.79
8ecoli-0-1-3-7-vs-2-67190727.14
9yeast4511433828.1
10abalone-3-vs-1115487832.47
11winequality-red-8vs6186381135.44
12winequality-red-3vs5106811168.1
Table 2. The combination of the 28 approaches of four classifiers and seven oversampling methods.
Table 2. The combination of the 28 approaches of four classifiers and seven oversampling methods.
Group Group
SVMRS: Random oversampling + SVMDTRD: Random oversampling + DT
SS: SMOTE + SVMSD: SMOTE + DT
BS: Borderline-SMOTE + SVMBD: BorderlineSMOTE + DT
KMS: k-meas-SMOTE + SVMKMD: k-meas-SMOTE + DT
NS: NRAS + SVMND: NRAS + DT
GS: Gaussian-SMOTE + SVMGD: Gaussian-SMOTE + DT
OS: OS-CCD + SVMOD: OS-CCD + DT
LRRL: Random oversampling + LRMLPRM:Random oversampling + MLP
SL: SMOTE + LRSM: SMOTE + MLP
BL: Borderline-SMOTE + LRBM: Borderline-SMOTE + MLP
KML: k-means-SMOTE + LRKMM: k-meas-SMOTE + MLP
NL: NRAS + LRNM: NRAS + MLP
GL: Gaussian-SMOTE + LRGM: Gaussian-SMOTE + MLP
OL: OS-CCD + LROM: OS-CCD + MLP
Table 3. The classification test Acc (%) of 28 approaches on 12 datasets.
Table 3. The classification test Acc (%) of 28 approaches on 12 datasets.
ROSMOTEBSKSNRASGSOS-CCD
1SVM0.9586 ± 0.010.9540 ± 0.010.9107 ± 0.020.9256 ± 0.010.9563 ± 0.020.9349 ± 0.080.9619 ± 0.01
LR0.9884 ± 0.000.9870 ± 0.000.9856 ± 0.000.9874 ± 0.000.9823 ± 0.000.9660 ± 0.010.9856 ± 0.00
DT0.9693 ± 0.010.9730 ± 0.010.9740 ± 0.010.9702 ± 0.010.9577 ± 0.020.9381 ± 0.010.9805 ± 0.00
MLP0.9888 ± 0.000.9647 ± 0.060.9670 ± 0.010.9940 ± 0.000.9833 ± 0.000.9879 ± 0.000.9953 ± 0.00
2SVM0.9544 ± 0.020.9502 ± 0.010.9079 ± 0.020.9256 ± 0.010.9693 ± 0.020.9716 ± 0.000.9865 ± 0.00
LR0.9921 ± 0.010.9898 ± 0.000.9893 ± 0.000.9893 ± 0.000.9795 ± 0.000.9670 ± 0.010.9907 ± 0.00
DT0.9651 ± 0.010.9712 ± 0.010.9735 ± 0.010.9674 ± 0.010.9619 ± 0.010.9493 ± 0.010.9819 ± 0.01
MLP0.9791 ± 0.010.9819 ± 0.010.9740 ± 0.010.9944 ± 0.000.9795 ± 0.000.9865 ± 0.000.9953 ± 0.00
3SVM0.8836 ± 0.010.8962 ± 0.010.8944 ± 0.010.9015 ± 0.010.8506 ± 0.000.8229 ± 0.000.9155 ± 0.01
LR0.8516 ± 0.010.8614 ± 0.000.8397 ± 0.000.8569 ± 0.010.8515 ± 0.000.8345 ± 0.010.8697 ± 0.01
DT0.9045 ± 0.010.8888 ± 0.010.8982 ± 0.010.8980 ± 0.010.8822 ± 0.010.9108 ± 0.010.9016 ± 0.01
MLP0.8703 ± 0.010.8786 ± 0.000.8617 ± 0.010.8768 ± 0.020.8747 ± 0.010.8646 ± 0.000.9158 ± 0.00
4SVM0.7865 ± 0.010.7834 ± 0.010.8342 ± 0.010.8312 ± 0.030.7273 ± 0.010.8020 ± 0.010.7983 ± 0.03
LR0.7575 ± 0.010.7580 ± 0.010.7787 ± 0.010.7715 ± 0.040.7380 ± 0.010.7608 ± 0.010.8290 ± 0.02
DT0.9026 ± 0.020.8830 ± 0.010.8922 ± 0.010.8996 ± 0.010.8460 ± 0.010.9041 ± 0.010.8523 ± 0.03
MLP0.7900 ± 0.010.8002 ± 0.010.8272 ± 0.010.7867 ± 0.020.7528 ± 0.010.7360 ± 0.020.8314 ± 0.02
5SVM0.8495 ± 0.010.8430 ± 0.010.8458 ± 0.020.7636 ± 0.060.9047 ± 0.010.9108 ± 0.000.9350 ± 0.01
LR0.9243 ± 0.000.9238 ± 0.010.9257 ± 0.010.9257 ± 0.000.9000 ± 0.010.9160 ± 0.010.9262 ± 0.01
DT0.9575 ± 0.010.9613 ± 0.010.9631 ± 0.010.9575 ± 0.010.9261 ± 0.010.9181 ± 0.010.9575 ± 0.01
MLP0.8991 ± 0.010.9010 ± 0.010.8987 ± 0.010.9178 ± 0.010.8693 ± 0.010.8945 ± 0.010.9565 ± 0.01
6SVM0.8184 ± 0.010.8020 ± 0.010.7937 ± 0.010.6788 ± 0.080.8916 ± 0.010.9042 ± 0.010.7963 ± 0.01
LR0.9039 ± 0.010.9071 ± 0.010.8978 ± 0.010.8791 ± 0.010.8705 ± 0.010.8344 ± 0.010.9337 ± 0.01
DT0.9818 ± 0.010.9785 ± 0.000.9771 ± 0.010.9739 ± 0.010.9305 ± 0.020.9389 ± 0.020.9799 ± 0.01
MLP0.9029 ± 0.010.9057 ± 0.010.8968 ± 0.010.9132 ± 0.020.8780 ± 0.010.8154 ± 0.020.9865 ± 0.00
7SVM0.8417 ± 0.010.8457 ± 0.010.8584 ± 0.010.8596 ± 0.010.8102 ± 0.010.8178 ± 0.000.9250 ± 0.01
LR0.8400 ± 0.000.8280 ± 0.000.8369 ± 0.010.8492 ± 0.010.8305 ± 0.000.8255 ± 0.000.9015 ± 0.01
DT0.8932 ± 0.010.9050 ± 0.000.9171 ± 0.000.9057 ± 0.010.8913 ± 0.010.9461 ± 0.000.9406 ± 0.00
MLP0.8801 ± 0.010.8899 ± 0.010.9084 ± 0.010.8806 ± 0.010.8572 ± 0.010.8502 ± 0.010.9313 ± 0.00
8SVM0.9490 ± 0.020.9764 ± 0.010.9918 ± 0.000.9907 ± 0.000.9149 ± 0.020.9656 ± 0.010.9918 ± 0.00
LR0.9488 ± 0.020.9532 ± 0.010.9896 ± 0.000.9863 ± 0.000.9151 ± 0.010.9560 ± 0.010.9896 ± 0.00
DT0.9742 ± 0.010.9742 ± 0.010.9784 ± 0.000.9786 ± 0.000.9630 ± 0.010.9825 ± 0.000.9784 ± 0.00
MLP0.9775 ± 0.010.9827 ± 0.010.9899 ± 0.000.9899 ± 0.000.9497 ± 0.010.9505 ± 0.010.9912 ± 0.00
9SVM0.8664 ± 0.000.8775 ± 0.010.9158 ± 0.000.8685 ± 0.010.8557 ± 0.000.8562 ± 0.000.9348 ± 0.00
LR0.8550 ± 0.000.8560 ± 0.000.8922 ± 0.000.8161 ± 0.010.8483 ± 0.000.8472 ± 0.000.9013 ± 0.01
DT0.9549 ± 0.000.9375 ± 0.000.9424 ± 0.000.9398 ± 0.010.9268 ± 0.010.9474 ± 0.000.9459 ± 0.00
MLP0.8842 ± 0.010.8862 ± 0.010.9139 ± 0.010.8642 ± 0.010.8381 ± 0.010.8575 ± 0.000.9282 ± 0.00
10SVM0.9906 ± 0.000.9914 ± 0.000.9942 ± 0.000.9906 ± 0.000.9956 ± 0.000.9811 ± 0.000.9954 ± 0.00
LR0.9759 ± 0.000.9763 ± 0.000.9783 ± 0.000.9823 ± 0.000.9930 ± 0.000.9675 ± 0.001.0000 ± 0.00
DT0.9990 ± 0.000.9994 ± 0.000.9992 ± 0.000.9990 ± 0.000.9990 ± 0.000.9934 ± 0.000.9996 ± 0.00
MLP0.9930 ± 0.000.9938 ± 0.000.9918 ± 0.000.9934 ± 0.000.9944 ± 0.000.9588 ± 0.000.9946 ± 0.00
11SVM0.6635 ± 0.010.6547 ± 0.010.7880 ± 0.000.7176 ± 0.030.8265 ± 0.010.8081 ± 0.010.8944 ± 0.04
LR0.8030 ± 0.000.8122 ± 0.000.8799 ± 0.010.8532 ± 0.010.7669 ± 0.010.7759 ± 0.000.8976 ± 0.01
DT0.9570 ± 0.000.9459 ± 0.010.9534 ± 0.000.9524 ± 0.010.9110 ± 0.010.8738 ± 0.020.9540 ± 0.01
MLP0.8596 ± 0.010.8634 ± 0.010.9364 ± 0.010.8947 ± 0.010.7959 ± 0.030.8850 ± 0.010.9453 ± 0.00
12SVM0.6833 ± 0.030.5773 ± 0.020.9042 ± 0.010.8185 ± 0.040.8774 ± 0.010.8237 ± 0.010.9632 ± 0.01
LR0.8577 ± 0.010.8674 ± 0.010.9562 ± 0.010.9226 ± 0.010.7970 ± 0.010.8348 ± 0.010.9514 ± 0.01
DT0.9715 ± 0.000.9528 ± 0.010.9667 ± 0.000.9669 ± 0.000.9449 ± 0.010.9211 ± 0.010.9695 ± 0.00
MLP0.9431 ± 0.010.9109 ± 0.010.9695 ± 0.000.9473 ± 0.010.8765 ± 0.010.9375 ± 0.010.9699 ± 0.00
Table 4. The classification test F1-score (%) of 28 approaches on 12 datasets.
Table 4. The classification test F1-score (%) of 28 approaches on 12 datasets.
ROSMOTEBSKSNRASGSOS-CCD
1SVM0.8594 ± 0.020.8439 ± 0.020.7949 ± 0.040.7211 ± 0.030.8440 ± 0.090.8624 ± 0.130.8999 ± 0.02
LR0.9656 ± 0.010.9608 ± 0.010.9584 ± 0.010.9606 ± 0.010.9515 ± 0.010.9101 ± 0.020.9593 ± 0.01
DT0.9016 ± 0.020.9143 ± 0.020.9201 ± 0.020.9068 ± 0.020.8794 ± 0.050.8353 ± 0.020.9392 ± 0.01
MLP0.9685 ± 0.010.9350 ± 0.060.9010 ± 0.050.9818 ± 0.010.9541 ± 0.010.9662 ± 0.010.9867 ± 0.00
2SVM0.8422 ± 0.060.8310 ± 0.030.7827 ± 0.050.7287 ± 0.040.9025 ± 0.080.9237 ± 0.010.9599 ± 0.01
LR0.9767 ± 0.050.9690 ± 0.010.9690 ± 0.010.9664 ± 0.010.9445 ± 0.010.9139 ± 0.010.9729 ± 0.01
DT0.8890 ± 0.050.9090 ± 0.020.9200 ± 0.030.8966 ± 0.020.8923 ± 0.020.8620 ± 0.030.9467 ± 0.02
MLP0.9418 ± 0.040.9498 ± 0.010.9306 ± 0.020.9834 ± 0.010.9444 ± 0.010.9621 ± 0.010.9867 ± 0.00
3SVM0.6179 ± 0.010.6395 ± 0.020.6363 ± 0.020.6481 ± 0.020.5691 ± 0.010.5315 ± 0.010.6856 ± 0.03
LR0.5725 ± 0.010.5900 ± 0.010.5622 ± 0.010.5874 ± 0.020.5727 ± 0.010.5507 ± 0.010.6060 ± 0.01
DT0.5147 ± 0.040.5122 ± 0.040.5339 ± 0.050.5206 ± 0.050.5232 ± 0.050.5832 ± 0.030.5373 ± 0.04
MLP0.5943 ± 0.010.6093 ± 0.010.5900 ± 0.020.6129 ± 0.020.6040 ± 0.020.5981 ± 0.010.6870 ± 0.01
4SVM0.2835 ± 0.020.2815 ± 0.010.3269 ± 0.040.2893 ± 0.050.2714 ± 0.020.3078 ± 0.020.3126 ± 0.05
LR0.2727 ± 0.020.2779 ± 0.010.2853 ± 0.020.2498 ± 0.030.2785 ± 0.010.2861 ± 0.010.3392 ± 0.05
DT0.2743 ± 0.080.2869 ± 0.040.2783 ± 0.060.3212 ± 0.050.2633 ± 0.030.3256 ± 0.060.2256 ± 0.04
MLP0.3013 ± 0.010.3087 ± 0.010.3147 ± 0.020.2573 ± 0.040.2838 ± 0.020.2556 ± 0.010.3154 ± 0.04
5SVM0.4206 ± 0.030.4075 ± 0.040.4097 ± 0.040.2900 ± 0.050.5532 ± 0.020.5665 ± 0.030.6274 ± 0.06
LR0.5968 ± 0.030.5952 ± 0.030.5957 ± 0.040.5972 ± 0.020.5384 ± 0.040.5872 ± 0.030.6032 ± 0.03
DT0.6493 ± 0.060.6858 ± 0.070.6816 ± 0.110.6391 ± 0.080.5576 ± 0.080.6160 ± 0.040.6621 ± 0.07
MLP0.5287 ± 0.030.5336 ± 0.040.4819 ± 0.030.5637 ± 0.040.4812 ± 0.060.5307 ± 0.030.7198 ± 0.04
6SVM0.1903 ± 0.040.1872 ± 0.040.1926 ± 0.030.2009 ± 0.030.4501 ± 0.040.4847 ± 0.020.2041 ± 0.02
LR0.4891 ± 0.050.5014 ± 0.040.4734 ± 0.030.3753 ± 0.050.4086 ± 0.020.3522 ± 0.030.5694 ± 0.07
DT0.6808 ± 0.130.6748 ± 0.100.6414 ± 0.110.6709 ± 0.120.4947 ± 0.110.5245 ± 0.100.7048 ± 0.12
MLP0.4966 ± 0.040.5042 ± 0.030.4840 ± 0.060.5064 ± 0.080.4210 ± 0.020.3055 ± 0.030.8734 ± 0.04
7SVM0.2821 ± 0.010.2491 ± 0.020.2514 ± 0.010.2676 ± 0.020.2673 ± 0.010.2712 ± 0.010.3262 ± 0.02
LR0.2942 ± 0.010.2655 ± 0.010.2790 ± 0.010.2875 ± 0.020.2894 ± 0.000.2804 ± 0.010.3266 ± 0.02
DT0.1894 ± 0.030.2088 ± 0.020.2088 ± 0.020.2251 ± 0.020.1951 ± 0.020.1842 ± 0.050.1184 ± 0.03
MLP0.2659 ± 0.020.2455 ± 0.030.2660 ± 0.020.2688 ± 0.020.2883 ± 0.020.2943 ± 0.010.3012 ± 0.03
8SVM0.3562 ± 0.110.5282 ± 0.110.7233 ± 0.070.7027 ± 0.080.1981 ± 0.070.4116 ± 0.100.7233 ± 0.07
LR0.3286 ± 0.090.3621 ± 0.070.6880 ± 0.060.6427 ± 0.070.2273 ± 0.040.3596 ± 0.070.6880 ± 0.06
DT0.2904 ± 0.100.3580 ± 0.150.4901 ± 0.110.5154 ± 0.120.3658 ± 0.090.5670 ± 0.070.5305 ± 0.10
MLP0.5448 ± 0.120.5817 ± 0.120.7000 ± 0.070.6967 ± 0.070.3389 ± 0.100.3259 ± 0.050.7133 ± 0.06
9SVM0.2848 ± 0.010.2967 ± 0.020.3584 ± 0.010.1684 ± 0.030.2839 ± 0.010.2797 ± 0.010.3806 ± 0.03
LR0.2738 ± 0.010.2671 ± 0.010.3205 ± 0.010.1477 ± 0.020.2681 ± 0.010.2687 ± 0.010.3330 ± 0.01
DT0.3107 ± 0.050.2799 ± 0.030.3173 ± 0.040.2298 ± 0.030.3123 ± 0.040.3049 ± 0.040.2958 ± 0.03
MLP0.2977 ± 0.020.2830 ± 0.010.3364 ± 0.020.1828 ± 0.030.2513 ± 0.010.2829 ± 0.010.3630 ± 0.01
10SVM0.8790 ± 0.010.8881 ± 0.020.9014 ± 0.040.8328 ± 0.040.9033 ± 0.050.7823 ± 0.020.9171 ± 0.05
LR0.7330 ± 0.020.7357 ± 0.020.7200 ± 0.050.7871 ± 0.030.8460 ± 0.030.6725 ± 0.011.0000 ± 0.00
DT0.9846 ± 0.020.9903 ± 0.020.9874 ± 0.020.9836 ± 0.020.9857 ± 0.010.9119 ± 0.030.9931 ± 0.01
MLP0.9069 ± 0.0110.9176 ± 0.020.8812 ± 0.040.9139 ± 0.020.8931 ± 0.050.6155 ± 0.010.9288 ± 0.02
11SVM0.0781 ± 0.010.0791 ± 0.010.1164 ± 0.000.0919 ± 0.010.0870 ± 0.010.0673 ± 0.020.1909 ± 0.03
LR0.1547 ± 0.010.1510 ± 0.020.1776 ± 0.030.1686 ± 0.010.1497 ± 0.020.1705 ± 0.010.1981 ± 0.02
DT0.1844 ± 0.040.2255 ± 0.040.1833 ± 0.060.2573 ± 0.060.1981 ± 0.050.2006 ± 0.040.2313 ± 0.02
MLP0.1316 ± 0.030.1471 ± 0.020.2049 ± 0.030.1637 ± 0.030.1268 ± 0.020.0526 ± 0.020.2174 ± 0.03
12SVM0.0527 ± 0.010.0376 ± 0.000.0988 ± 0.020.0786 ± 0.010.0538 ± 0.020.020 ± 0.010.2136 ± 0.07
LR0.0987 ± 0.010.0977 ± 0.010.1365 ± 0.060.1195 ± 0.030.0826 ± 0.020.1222 ± 0.010.1692 ± 0.04
DT0.0130 ± 0.030.0377 ± 0.040.0274 ± 0.040.0248 ± 0.050.0617 ± 0.050.1428 ± 0.050.0607 ± 0.07
MLP0.1196 ± 0.030.0774 ± 0.030.0918 ± 0.060.1295 ± 0.070.0689 ± 0.020.0394 ± 0.020.1982 ± 0.06
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jiang, Z.; Pan, T.; Zhang, C.; Yang, J. A New Oversampling Method Based on the Classification Contribution Degree. Symmetry 2021, 13, 194. https://doi.org/10.3390/sym13020194

AMA Style

Jiang Z, Pan T, Zhang C, Yang J. A New Oversampling Method Based on the Classification Contribution Degree. Symmetry. 2021; 13(2):194. https://doi.org/10.3390/sym13020194

Chicago/Turabian Style

Jiang, Zhenhao, Tingting Pan, Chao Zhang, and Jie Yang. 2021. "A New Oversampling Method Based on the Classification Contribution Degree" Symmetry 13, no. 2: 194. https://doi.org/10.3390/sym13020194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop