An Empirical Assessment of Performance of Data Balancing Techniques in Classification Task

Anil Jadhav; Samih M. Mostafa; Hela Elmannai; Faten Khalid Karim

doi:10.3390/app12083928

,

and

¹

Symbiosis Centre for Information Technology, Symbiosis International (Deemed University), Pune 411057, India

²

Faculty of Computers and Information, South Valley University, Qena 83523, Egypt

³

Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

⁴

Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

Appl. Sci.2022, 12(8), 3928;https://doi.org/10.3390/app12083928

Version Notes

Order Reprints

Abstract

Many real-world classification problems such as fraud detection, intrusion detection, churn prediction, and anomaly detection suffer from the problem of imbalanced datasets. Therefore, in all such classification tasks, we need to balance the imbalanced datasets before building classifiers for prediction purposes. Several data-balancing techniques (DBT) have been discussed in the literature to address this issue. However, not much work is conducted to assess the performance of DBT. Therefore, in this research paper we empirically assess the performance of the data-preprocessing-level data-balancing techniques, namely: Under Sampling (OS), Over Sampling (OS), Hybrid Sampling (HS), Random Over Sampling Examples (ROSE), Synthetic Minority Over Sampling (SMOTE), and Clustering-Based Under Sampling (CBUS) techniques. We have used six different classifiers and twenty-five different datasets, that have varying levels of imbalance ratio (IR), to assess the performance of DBT. The experimental results indicate that DBT helps to improve the performance of the classifiers. However, no significant difference was observed in the performance of the US, OS, HS, SMOTE, and CBUS. It was also observed that performance of DBT was not consistent across varying levels of IR in the dataset and different classifiers.

Keywords:

classification; machine learning; data balancing techniques; imbalanced data; performance assessment

1. Introduction

Classification is a supervised machine learning (ML) technique used to predict class label of unseen data by building a classifier using historical data. Classification algorithms usually work with the assumption that dataset used to build the classifier is balanced. However, many datasets are highly imbalanced. An imbalanced dataset refers to a dataset where one class outnumbers the other classes in the dataset with respect to the target class variable. For example, consider a dataset that contains 1000 transactions, out of which 990 are nonfraudulent and only 10 are fraudulent transactions. This is a good example of a highly imbalanced dataset. Many such examples of imbalanced datasets in classification tasks are discussed in the literature. Some of them are software product defect detection [1], survival prediction of hepatocellular carcinoma patients [2], customer churn prediction [3], predicting freshmen student attrition [4], insurance fraud detection [5], and intrusion and crime detection [6]. When we build a classifier using a highly imbalanced dataset, the classifier is usually biased towards the majority class cases. It means that the performance of the classifier will be better at correctly predicting majority class cases than minority class cases. However, in real life we expect that a classifier should be unbiased and equally good at correctly predicting both minority and majority cases. Therefore, balancing imbalanced datasets is one of the most important activities because it helps to reduce bias in the model prediction, and thereby enhances the classifier’s performance.

To address the problem of imbalanced datasets in the classification task, several solutions have been proposed in the literature [7,8,9]. These solutions are broadly divided into several categories, namely data-preprocessing-level solutions, cost-sensitive learning methods, algorithm-level solutions, and ensemble methods. The data-preprocessing-level solutions are based on resampling of the original data. Resampling is performed before building the classifier. Therefore, resampling techniques are easy to implement and are independent of the classifier. Cost-sensitive learning approaches take into account the significance of misclassification of majority and minority class instances. Algorithm-level solutions either suggest a new algorithm or modify existing algorithms. Algorithm-level solutions are dependent on algorithms and require a detailed understanding of the algorithm for implementation. Therefore, algorithm-level solutions are less popular compared to resampling techniques. Ensemble solutions combine ensemble (bagging and boosting) models with resampling techniques or a cost-sensitive approach [7,8].

Though several solutions have been proposed in the literature to deal with the imbalanced dataset problem in classification tasks, there is a lack of research assessing the performance of DBT [7]. As a large number of solutions have been proposed in the literature, it is difficult to assess the performance of all proposed DBT. Therefore, in this study we limited the scope of our study to assess the performance of only resampling techniques used to balance the imbalanced dataset at the data-preprocessing level. The reason for choosing resampling techniques was that these techniques are very widely used in the literature to deal with imbalanced dataset problems in classification tasks.

The objectives of this study are: (1) to assess performance of DBT used to balance the imbalanced dataset; (2) to assess whether performance of DBT is independent of the level of imbalance ratio in the dataset; (3) to assess whether performance of DBT is independent of the classifier; (4) to assess whether DBT help to improve the performance of the classifiers.

The rest of the paper is organized as follows. Section 2 describes the theoretical background and related work. In Section 3, we discuss the experimental setup. The results of the experiment are analyzed and discussed in Section 4. Finally, the paper is concluded in Section 5.

2. Background and Related Work

Machine learning algorithms in classification tasks work with the assumption that the distribution of the data with respect to the targeted class variable is equal. However, most classification problems suffer from an imbalanced dataset. Therefore, dealing with imbalanced datasets is considered one of the most important activities in classification tasks. In order to deal with this problem, several solutions have been proposed in the literature [7,8,9].

The data-preprocessing-level solutions deal with imbalanced datasets by resampling of data [9,10]. Resampling is performed by OS minority cases or US majority cases, or by combining US and OS strategies. Using resampling techniques, we can balance the dataset at any desired level of imbalance ratio (IR). It is not necessary that the number of majority and minority cases is exactly the same. Resampling techniques are broadly divided into three categories, namely US, OS, and HS [9].

Under Sampling (US): In this method, the dataset is balanced by deleting majority class instances [10,11]. The instances are selected randomly and deleted from the dataset until the dataset is balanced. The weakness of this method is that we might lose some potentially useful information required for the learning process when we remove the instances from the majority class data.

Over Sampling (OS): In this method, the dataset is balanced by randomly OS minority class instances [12]. This method suffers from duplication of information due to OS of the minority class instances, which might lead to overfitting the model. However, this method does not lose any important information, unlike the RUS method.

Hybrid Sampling (HS): In this method, the dataset is balanced by combining the OS and US approaches [13,14].

Random Over Sampling Examples (RSE): This method is based on smoothed bootstrap-based techniques [15].

Synthetic minority over sampling technique (SMOTE): This method is an OS technique. In this method, instead of replicating minority class instances, new instances are generated synthetically [16]. It follows the following process to generate synthetic data: first, it randomly selects a minority class instance then finds its k-nearest neighbors. Then new instances are generated synthetically by interpolation between the selected minority class instance and its nearest neighbors. There are many literatures that talk about the other variants of SMOTE method such as SMOTEBoost [17], MSMOTE [18], and MWMOTE [19].

One-sided Selection Method (OSS): This method falls under the US techniques category. In this method, borderline and redundant majority class instances are removed [12].

Clustering-Based Under Sampling (CBUS): This method is a US strategy. In this method, US is achieved by creating clusters of majority class instances [20]. The number of clusters should be equal to the number of minority class instances. There are two clustering strategies. In the first strategy, the cluster centre represents the cluster, and in the second strategy, the nearest neighbor of the cluster center represents the cluster.

Clustering-Based Over Sampling and Under Sampling (CBOUS): This method is an extension of the CBUS. In this method, data balancing is achieved by combining US and OS approaches by creating clusters of majority and minority class instances [21].

The cost-sensitive methods are based on the assumption that the cost of misclassification of the minority class instances is higher than the cost of misclassification of the majority class instances. Cost-sensitive learning can be incorporated at the data-preprocessing level or at the algorithm level. Cost-sensitive methods are difficult to implement compared to the resampling technique, as detailed knowledge of the algorithm is required if it is to be incorporated into an algorithm. Several cost-sensitive solutions have been discussed in the literature [22,23,24,25,26].

The algorithm-level solutions deal with imbalanced dataset by proposing a new algorithm or modifying an existing algorithm. Some examples of algorithms modified for dealing with imbalanced dataset are discussed in existing literature [27,28,29,30].

A large number of ensemble solutions have been proposed in the literature to deal with the imbalanced dataset problem in the classification task [31,32,33,34]. In this approach, bagging and boosting algorithms are combined with resampling techniques or cost-sensitive learning methods.

Susan and Kumar reviewed state-of-the art data-balancing strategies used prior to the learning phase. The study discussed the strengths and weaknesses of the techniques and also reported intelligent sampling procedures based on their performance, popularity, and ease of implementation [35]. Halimu and Kasem proposed a data-processing sampling technique named Split Balancing (sBal) for ensemble methods. The proposed method creates multiple balanced bins and then multiple base learners are induced on balanced bins. It was found that the sBal method improves classification performance considerably compared to existing ensemble methods [36]. Tolba et al. used SMOTE, NearMiss, cost-sensitive learning, k-Means SMOTE, TextGan, LoRAS, SDS, Clustering-Based Under Sampling, and a VL strategy to balance the imbalanced dataset for automatic detection of online harassment [37]. Tao et al. proposed a novel SVDD boundary-based weighted over sampling approach for dealing with imbalanced and overlapped dataset classification issues [38]. Islam et al. proposed a K-nearest neighbor over sampling approach (KNNOR) for augmentation and to generate synthetic data points for the minority class [39].

Few papers talk about the performance of resampling techniques. The following are some observations based on these papers. The US technique works well when the number of minority class instances is large (in the hundreds). The OS technique works well when the number of minority class instances is small. When the data size is too large, a combination of SMOTE and US techniques works well. The study conducted by Lopez et al. [40] compared preprocessing techniques against cost-sensitive learning and it was found that there are no differences among the data-preprocessing techniques. Both preprocessing and cost-sensitive learning found good and equivalent approaches. The study conducted by Thammasiri et al. [4] tested three DBT, namely US, OS, and SMOTE, using four different classifiers. The study found that SVM combined with SMOTE gives better results. The study conducted by Burez et al. [41] found that US leads to improved prediction accuracy.

As the focus of this study was to assess the performance of the resampling techniques, we assessed the following most-used resampling techniques: (i) Under Sampling (US); (ii) Over Sampling (OS); (iii) Hybrid Sampling (HS); (iv) Clustering-Based Under Sampling (CBUS); (v) Random Over Sampling Examples (ROSE); (vi) Synthetic Minority Over Sampling technique (SMOTE) and (vii) Clustering-based US (CBUS) technique.

3. Experimental Setup and Datasets

We used six different classifiers, namely the Decision Tree (C4.5), k-Nearest Neighbor (kNN), Logistic Regression (LR), Naïve Bayes (NB), random forest (RF), and support vector machine (SVM), to assess the performance of DBT instead of using a single classifier. Considering six different classifiers will also help us to understand whether performance of DBT varies for different classifiers or it is the same. In this study, we used 25 different small datasets with varying levels of IR. All datasets were downloaded from the KEEL dataset repository [42]. Information about the datasets is given in Table 1. More details about the dataset are available at (https://sci2s.ugr.es/keel/imbalanced.php (accessed on 3 November 2021)). The last column in Table 1 is the Imbalance Ratio (IR), which indicates the proportion of the number of majority class instances to the minority class instances.

Table 1. Dataset Information.

We built a total of 1050 (25 datasets times 7 techniques times 6 classifiers) classifiers using open source ‘R’ software. To build the classification model and assess its performance, the following processes were used: (i) Divide the dataset into training and test sets. The training set contains 80% of the data and the test set contains 20% of the data; (ii) apply DBT on the training dataset; (iii) build the classification model using the balanced training set; (iv) test the performance of the classification model using the test set. The performance of the classifier was measured using the area under ROC curve (AUC value) [43]. To train the classifiers, we used the default hyperparameters settings of the caret package in ‘R’ tool. No specific hyperparameter tuning was performed, as the objective of this study was not to improve the performance of the classifiers but to assess the performance of DBT. More details about the caret package are discussed by Max Kuhk et al. [44].

We used Friedman’s test statistics to compare the performances of different classifiers as they are, based on the average ranked performance of the DBT in the classification task for each dataset [45,46]. The Friedman test statistics helped us to understand whether there was significance difference in the DBT performance of different classifiers [46]. To report the differences in the performance of DBT, we applied a post hoc Nemenyi test [46,47]. It tells us which DBT differed significantly with respect to its performance in the classification task. We used Kendall’s test statistics [48] to test the agreement on rankings of DBT, based on the performance in the classification task, for varying levels of IR in the dataset. Kendall’s test was used to assess the performance of the imputation method [49]. If the value of Kendall’s ‘w’ is 1, it means that there is complete agreement over the ranking, and when it is 0, this means there is no agreement over the ranking.

4. Results and Discussion

4.1. Performance of DBT

Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 report the performances of six different classifiers, namely DT, kNN, LR, NB, RF, and SVM, for different data-balancing techniques, namely None, US, OS, HS, ROSE, and CBUS. The performance of the classifier is measured using the area under receiver operating characteristics curve, i.e., the AUC value. The first column in the table indicates the name of the imbalanced dataset. The second column indicates the performance of the classifier without balancing the imbalanced dataset (None strategy). Column 3 to Column 8 indicate the performance of the classifier for US, OS, HS, ROSE, SMOTE, and CBUS strategies. The mean rank of the classifiers based on their performance over 25 different datasets, along with Friedman test statistics, is also reported in each table in the last two rows. Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 report the post hoc analysis using the Nemenyi multiple comparison test. The Nemenyi statistics are used to understand which DBT performance is significantly different.

Table 2. Performance of DBT for DT classifier.

Table 3. Performance of DBT for kNN classifier.

Table 4. Performance of DBT for LR.

Table 5. Performance of DBT for NB classifier.

Table 6. Performance of DBT for RF.

Table 7. Performance of DBT for SVM.

Table 8. Post hoc analysis using Nemenyi multiple comparison test for DT.

Table 9. Post hoc analysis using Nemenyi multiple comparison test for kNN.

Table 10. Post hoc analysis using Nemenyi multiple comparison test for LR.

Table 11. Post hoc analysis using Nemenyi multiple comparison test for NB classifier.

Table 12. Post hoc analysis using Nemenyi multiple comparison test for RF.

Table 13. Post hoc analysis using Nemenyi multiple comparison test for SVM.

The Friedman test statistics show that for all the classifiers, the ‘p’ value is less than 0.05. So, we can say that, statistically, there is a significance difference in the performance of DBT. As difference in the performance of DBT is significant, the Nemenyi test was then applied to find which DBT has significant difference in the performance. The following are our observations based on the Friedman statistics and Nemenyi post hoc analysis:

For the DT classifier, the performance of the None strategy was poor and significantly different than US, OS, HS, SMOTE, and CBUS. However, no difference in the performance was observed between None and ROSE strategies. Further, no significant difference was observed in the performance of US, OS, HS, SMOTE, and CBUS.
For the kNN classifier, the performance of the None strategy was poor and significantly different than OS, HS, ROSE, SMOTE, and CBUS. However, no difference in the performance was observed between the None and US strategy. Further, no significant difference was observed in the performance of OS, HS, ROSE, and CBUS. Significant difference was observed in the performance of US and OS strategies.
For the LR classifier, the performance of the None strategy was found to be poor and significantly different than US, OS, HS, and SMOTE. However, no significant difference in the performance was observed between None, ROSE, and CBUS. It was also observed that there was no difference in the performance of US, OS, HS, ROSE, SMOTE, and CBUS.
For the NB classifier, the performance of the None strategy was found to be poor and significantly different to US, SMOTE, and CBUS. However, no difference in the performance was observed between None, OS, HS, and ROSE. Further, no significant difference was observed in the performance of US, SMOTE, and CBUS.
For the RF classifier, it was found that the performance of the None Strategy was poor and significantly different to the US and CBUS strategies. However, no difference was observed in the performance of the None, OS, HS, ROSE, and SMOTE. Further, no significant difference was observed in the performance of US, HS, SMOTE, and CBUS.
For SVM, the performance of the None strategy was poor and significantly different than US, OS, HS, SMOTE, and CBUS. However, no significant difference was observed in the performance of None and ROSE. Further, no significant difference was observed in the performance of US, OS, HS, ROSE, SMOTE, and CBUS.

Therefore, from all the observations above, we can infer that: (i) the performance of the None and ROSE strategies are poor and significantly different than the others; (ii) no significant difference was observed in the performance of US, OS, HS, SMOTE, and CBUS strategies. Dealing with imbalanced datasets is a very common problem in classification tasks, and which DBT is more suitable to enhance the performance of the classifier is the most common question that needs to be answered. In this section, we have attempted to answer this question by applying data-preprocessing-level DBT to 25 different datasets using six different classifiers. From the results of the experiment and statistical analysis, we can infer the following: (i) balancing the imbalanced dataset certainly helps to improve the performance of the classifier; (ii) For DT classifier CBUS and US techniques give a better performance; (iii) For logistic regression SMOTE and OS techniques give a better performance; (iv) For Naïve Bayes classifier US and SMOTE give a better performance; (v) For random forest CBUS and US techniques give a better performance; (vi) For support vector machine HS and CBUS give better results; (vii) For kNN classifer OS and SMOTE give better results. However, it is important to note that every time we apply DBT on an imbalanced dataset, there is no guarantee that the same data will be generated or removed in order to balance it. Therefore, model performance and results could also vary slightly.

4.2. Performance of DBT across the Classifier

In this section, we assess whether performance of DBT is consistent across the classifiers or varies for different classifiers. To do this assessment, we used Kendall’s ‘w’ statistics. When ‘w’ is 1, then there is complete agreement over the ranks; and when ‘w’ is 0, then there is complete disagreement over the ranks. The ranks of DBT and results of the Kendall’s statistics are shown in Table 14. The results show that there is agreement over the ranking of data-balancing techniques. However, the concordance coefficient value (w) is 0.562, which indicates that there is partial agreement over the ranking. It is observed from Table 14 that there is consistency in the ranking only for None and ROSE techniques. However, there was no consistency in the performance of the US, OS, HS, SMOTE and CBUS techniques. The experimental results show that the performance of None and ROSE was poor and consistent. However, there was no consistency in the performance of US, OS, HS, SMOTE, and CBUS techniques, but its performance was better than ROSE and None.

Table 14. Ranks of DBT for different classifiers.

In this section of the paper, we have attempted to answer the following question: Is performance of DBT consistent across classifiers? The results show that the performance of DBT is not consistent across different classifiers.

4.3. Performance of DBT for Varying Levels of IR in the Datase

Table 15, Table 16, Table 17, Table 18, Table 19 and Table 20 show ranks of DBT for six different classifiers for varying levels of IR in the dataset. In order to assess the performance of DBT for varying levels of IR, we used Kendall’s test statistics. The rows in the tables indicate the DB strategy and the columns indicate the range of IR in the dataset. The values in each cell indicate the rank of DBT for varying levels of IR in the dataset for a given classifier. The last row in each table shows the results of Kendall’s test statistics.

Table 15. Ranks of DBT for DT for varying levels of IR.

Table 16. Ranks of DBT for kNN for varying levels of IR.

Table 17. Ranks of DBT for LR for varying levels of IR.

Table 18. Ranks of DBT for NB classifier for varying levels of IR.

Table 19. Ranks of DBT for RF for varying levels of IR.

Table 20. Ranks of DBT for SVM classifier for varying levels of IR.

From the results of Kendall’s statistics, we can infer that:

For the DT classifier, there is no agreement over the rankings of DBT as the “p” value is greater than 0.05. This means that for the DT classifier, the performance of the data balancing techniques was not consistent for varying imbalance-ratio percentages.
For the kNN classifier, there is agreement over the rankings of the data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.593, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that the performance of None seemed consistent, whereas the performance of other DBT was different for varying imbalance-ratio percentages.
For the LR classifier, there is no agreement over the rankings of data-balancing techniques as the “p” value is greater than 0.05. This means that for the LR classifier, the performance of the data-balancing techniques was not consistent for varying imbalance-ratio percentages.
For the NB classifier, there is agreement over the rankings of data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.686, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that performance of the None was consistent, whereas the performance of other data-balancing techniques was different for varying imbalance-ratio percentages.
For the RF classifier, there is agreement over the rankings of data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.539, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that performance of the None, US, CBUS seemed consistent, whereas the performance of other data-balancing techniques was different for varying imbalance-ratio percentages.
For the SVM classifier, there is agreement over the rankings of data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.564, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that only the performance of the None was consistent, whereas the performance of other data-balancing techniques was different for varying imbalance-ratio percentages.

In this section of the paper, we have attempted to answer the following question: Is the performance of the DBT consistent for varying levels of IR in the dataset? The results of the experiment show that for all the classifiers, the performance of the None and ROSE strategy was poor and consistent for varying levels of IR in the dataset. However, performances of the other DBT were not consistent for varying levels of IR in the dataset.

5. Conclusions and Recommendation for Further Work

In this research paper, we have assessed the performance of six different DBT. The assessment was performed using six different classifiers and 25 different datasets that had different levels of IR. The performance of the DBT was assessed using the performance of classifiers, which was measured using the area under ROC curve. The experimental results show that (i) for all the six classifiers, the performance of the None and ROSE strategy was poor and significantly different than the others. It was also observed that there was no significant difference in the performance of the US, OS, HS, SMOTE, and CBUS techniques; (ii) performance of None and ROSE was poor and consistent across the classifiers. There was no consistency in the performance of US, OS, HS, SMOTE, and CBUS techniques, but its performance was better than the ROSE and None strategy; (iii) there was no agreement over the ranks of DBT for varying levels of IR in the dataset except for the None and ROSE strategy; (iv) DBT helps to improve the performance of the classifiers. However, performance of the ROSE was not significantly different than the None Strategy. Thus, from the experimental results, we may infer that DBT helps to improve the performance of the classifier in classification tasks. Further, performance of the DBT is independent of the classification algorithm and IR in the dataset. These inferences are drawn based on our experimental results.

As stated earlier in the introduction section, we assessed the performance of only data-preprocessing-level data-balancing techniques. However, there is a need to assess the performance of advanced DBT such as algorithm-level solutions, cost-based learning, and ensemble methods.

Author Contributions

Conceptualization, A.J. and S.M.M.; methodology, A.J. and S.M.M.; software, A.J. and S.M.M.; validation, A.J. and S.M.M.; formal analysis, A.J. and S.M.M.; investigation, A.J., H.E., F.K.K. and S.M.M.; resources, A.J., H.E., F.K.K. and S.M.M.; data curation, A.J. and S.M.M.; writing—original draft preparation, A.J., H.E., F.K.K. and S.M.M.; writing—review and editing, A.J., H.E., F.K.K. and S.M.M.; visualization, A.J., H.E., F.K.K. and S.M.M.; supervision, H.E., F.K.K. and S.M.M.; project administration, H.E. and F.K.K.; funding acquisition, H.E. and F.K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R300).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request.

Acknowledgments

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R300), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare that they have no conflict of interest to report regarding the present study.

References

Siers, M.J.; Islam, M.Z. Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 2015, 51, 62–71. [Google Scholar] [CrossRef]
Santos, M.S.; Abreu, P.H.; Laencina, P.J.G.; Simão, A.; Carvalho, A. A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inform. 2015, 58, 49–59. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhu, B.; Baesens, B.; Broucke, S.K.L.M. An empirical comparison of techniques for the class imbalance problem in churn prediction. Inf. Sci. 2017, 408, 84–99. [Google Scholar] [CrossRef] [Green Version]
Thammasiri, D.; Delen, D.; Meesad, P.; Kasap, N. A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition. Expert Syst. Appl. 2014, 41, 321–330. [Google Scholar] [CrossRef] [Green Version]
Hassan, A.K.I.; Abraham, A. Modeling insurance fraud detection using imbalanced data classification. In Proceedings of the 7th World Congress on Nature and Biologically Inspired Computing (NaBIC2015), Pietermaritzburg, South Africa, 18 November 2015; pp. 117–127. [Google Scholar]
Hajian, S.; Ferrer, J.D.; Balleste, A.M. Discrimination prevention in data mining for intrusion and crime detection. In Proceedings of the IEEE Symposium on Computational Intelligence in Cyber Security (CICS), Paris, France, 11–15 April 2011; pp. 1–8. [Google Scholar]
Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2012, 42, 463–484. [Google Scholar] [CrossRef]
Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
Kotsiantis, S.; Kanellopoulos, D.; Pintelas, P. Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 2006, 30, 1–12. [Google Scholar]
Kotsiantis, S.; Pintelas, P. Mixture of Expert Agents for Handling Imbalanced Data Sets. Ann. Math. Comput. TeleInformatics 2003, 1, 46–55. [Google Scholar]
Tahir, M.A.; Kittler, J.; Mikolajczyk, K.; Yan, F. A multiple expert approach to the class imbalance problem using inverse random under sampling. In Proceedings of the International Workshop on Multiple Classifier Systems, Reykjavik, Iceland, 10–12 June 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 82–91. [Google Scholar]
Kubat, M.; Matwin, S. Addressing the curse of imbalanced training sets: One sided selection. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 8 July 1997; pp. 179–186. [Google Scholar]
Cateni, S.; Colla, V.; Vannucci, M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 2014, 135, 32–41. [Google Scholar] [CrossRef]
Yeh, C.W.; Li, D.C.; Lin, L.S.; Tsai, T.I. A Learning Approach with Under and Over-Sampling for Imbalanced Data Sets. In Proceedings of the 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), Kumamoto, Japan, 10–14 July 2016; pp. 725–729. [Google Scholar]
Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 2014, 6, 79–89. [Google Scholar] [CrossRef] [Green Version]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Cavtat-Dubrovnik, Dubrovnik, Croatia, 22–26 September 2003; pp. 107–119. [Google Scholar]
Hu, S.; Liang, Y.; Ma, L.; He, Y. MSMOTE: Improving classification performance when training data is imbalanced. In Proceedings of the Second International Workshop on Computer Science and Engineering, Qingdao, China, 28–30 October 2009; pp. 13–17. [Google Scholar]
Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
Lin, W.; Tsai, C.; Hu, Y.; Jhang, J. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409, 17–26. [Google Scholar] [CrossRef]
Jadhav, A. Clustering Based Data Preprocessing Technique to Deal with Imbalanced Dataset Problem in Classification Task. In Proceedings of the IEEE Punecon, Pune, India, 30 November–2 December 2018; pp. 1–7. [Google Scholar]
Fan, W.; Stolfo, S.J.; Zhang, J.; Chan, P.K. AdaCost: Misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, USA, 27–30 June 1999; pp. 99–105. [Google Scholar]
Zhou, Z.; Liu, X. Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Trans. Knowl. Data Eng. 2006, 18, 63–77. [Google Scholar] [CrossRef]
Domingos, P. MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999; pp. 155–164. [Google Scholar]
López, V.; Río, S.D.; Benítez, J.M.; Herrera, F. Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst. 2015, 258, 5–38. [Google Scholar] [CrossRef]
Sun, Y.; Kamel, M.S.; Wong, A.K.; Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 2007, 40, 3358–3378. [Google Scholar] [CrossRef]
Chen, Z.Y.; Shu, P.; Sun, M. A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data. Eur. J. Oper. Res. 2012, 223, 461–472. [Google Scholar] [CrossRef]
Zhang, Y.; Fu, P.; Liu, W.; Chen, G. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput. Appl. 2014, 25, 927–935. [Google Scholar] [CrossRef]
Kim, S.; Kim, H.; Namkoong, Y. Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Service. IEEE Intell. Syst. 2016, 31, 50–56. [Google Scholar] [CrossRef]
Godoy, M.D.P.; Fernández, A.; Rivera, A.J.; Jesus, M.J.D. Analysis of an evolutionary RBFN design algorithm, CO2RBFN, for imbalanced data sets. Pattern Recognit. Lett. 2010, 31, 2375–2388. [Google Scholar] [CrossRef]
Seiffert, C.; Khoshgoftaar, T.M.; Hulse, J.V.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
Wang, S.; Yao, X. Diversity analysis on imbalanced data sets by using ensemble models. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA, 30 March–2 April 2009; pp. 324–331. [Google Scholar]
Barandela, R.; Valdovinos, R.M.; S´anchez, J.S. New applications of ensembles of classifiers. Pattern Anal. Appl. 2003, 6, 245–256. [Google Scholar] [CrossRef]
Liao, J.J.; Shih, C.H.; Chen, T.F.; Hsu, M.F. An ensemble-based model for two-class imbalanced financial problem. Econ. Model. 2014, 37, 175–183. [Google Scholar] [CrossRef]
Susan, S.; Kumar, A. The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Eng. Rep. 2021, 3, e12298. [Google Scholar] [CrossRef]
Halimu, C.; Kasem, A. Split balancing (sBal)—A data preprocessing sampling technique for ensemble methods for binary classification in imbalanced datasets. In Computational Science and Technology; Springer: Singapore, 2021; pp. 241–257. [Google Scholar]
Tolba, M.; Ouadfel, S.; Meshoul, S. Hybrid ensemble approaches to online harassment detection in highly imbalanced data. Expert Syst. Appl. 2021, 175, 114751. [Google Scholar] [CrossRef]
Tao, X.; Zheng, Y.; Chen, W.; Zhang, X.; Qi, L.; Fan, Z.; Huang, S. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning. Inf. Sci. 2022, 588, 13–51. [Google Scholar] [CrossRef]
Islam, A.; Belhaouari, S.B.; Rehman, A.U.; Bensmail, H. KNNOR: An oversampling technique for imbalanced datasets. Appl. Soft Comput. 2022, 115, 108288. [Google Scholar] [CrossRef]
López, V.; Fernández, A.; Torres, J.G.M.; Herrera, F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst. Appl. 2012, 39, 6585–6608. [Google Scholar] [CrossRef]
Burez, J.; Poel, V.D. Handling class imbalance in customer churn prediction. Expert Syst. Appl. 2009, 36, 4626–4636. [Google Scholar] [CrossRef] [Green Version]
Alcalá-Fdez, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef] [Green Version]
Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenke, B.; R Core Team. Classification and Regression Training. 2022. Available online: https://cran.r-project.org/web/packages/caret/caret.pdf (accessed on 3 November 2021).
Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
Brown, I.; Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef] [Green Version]
Nemenyi, P. Distribution-Free Multiple Comparisons. Ph.D. Thesis, University of Princeton, Princeton, NJ, USA, 1963. [Google Scholar]
Kendall, M.G.; Smith, B.B. The Problem of m Rankings. Ann. Math. Stat. 1939, 10, 275–287. [Google Scholar] [CrossRef]
Jadhav, A.; Pramod, D.; Ramanathan, K. Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 2019, 33, 913–933. [Google Scholar] [CrossRef]

Table 1. Dataset Information.

Sr. No.	Dataset Name	No. of Samples	No. of Features	Imbalance Ratio
Imbalance ratio from 1–10
1	Pima	768	8	1.87
2	ecoli1	336	7	3.36
3	segment0	2308	19	6.02
4	yeast3	1484	8	8.1
5	vowel0	988	13	9.98
Imbalance ratio from 10–20
6	glass2	214	9	11.59
7	yeast-1_vs_7	459	7	14.3
8	ecoli4	336	7	15.8
9	page-blocks-1-3_vs_4	472	10	15.86
10	glass-0-1-6_vs_5	184	9	19.44
Imbalance ratio from 20–30
11	yeast-1-4-5-8_vs_7	693	8	22.1
12	yeast-2_vs_8	482	8	23.1
13	yeast4	1484	8	28.1
14	poker-9_vs_7	244	10	29.5
15	winequality-red-4	1599	11	29.17
Imbalance ratio from 30–40
16	yeast-1-2-8-9_vs_7	947	8	30.57
17	winequality-white-9_vs_4	168	11	32.6
18	yeast5	1484	8	32.73
19	winequality-red-8_vs_6	656	11	35.44
20	ecoli-0-1-3-7_vs_2-6	281	7	39.14
Imbalance ratio from 40–50
21	abalone-21_vs_8	581	8	40.5
22	yeast6	1484	8	41.4
23	winequality-white-3_vs_7	900	11	44
24	winequality-red-8_vs_6-7	855	11	46.5
25	abalone-19_vs_10-11-12-13	1622	8	49.69

Table 2. Performance of DBT for DT classifier.

Dataset	None	US	OS	HS	ROSE	SMOTE	CBUS
Pima	0.635283	0.676792	0.680755	0.671321	0.680189	0.73566	0.701792
ecoli1	0.833333	0.941176	0.923529	0.941176	0.892157	0.860784	0.921569
segment0	0.976923	0.996203	0.976923	0.976923	0.837975	0.984615	0.980915
yeast3	0.894886	0.864583	0.885417	0.879735	0.791667	0.893466	0.883996
vowel0	0.997207	0.972067	0.997207	0.980447	0.927374	0.969429	0.963687
glass2	0.5	0.717949	0.602564	0.75641	0.717949	0.641026	0.628205
yeast-1_vs_7	0.5	0.522549	0.565686	0.547059	0.505882	0.542157	0.921875
ecoli4	0.984127	0.936508	0.984127	0.859127	0.892157	0.984127	0.921569
glass-0-1-6_vs_5	1	0.985714	1	1	0.957143	1	0.957143
page-blocks-1-3_vs_4	1	0.994318	1	1	0.892045	1	0.982955
yeast-1-4-5-8_vs_7	0.5	0.602273	0.507576	0.530303	0.511364	0.55303	0.549242
yeast-2_vs_8	0.5	0.684783	0.86413	0.858696	0.875	0.706522	0.728261
yeast4	0.5	0.755594	0.591259	0.768531	0.505245	0.727273	0.795105
winequality-red-4	0.5	0.734951	0.538673	0.52411	0.569094	0.559547	0.579935
poker-9_vs_7	0.5	0.808511	0.521277	0.978723	0.829787	0.521277	0.904255
yeast-1-2-8-9_vs_7	0.5	0.562842	0.572404	0.56694	0.502732	0.63388	0.618852
winequality-white-9_vs_4	0.5	0.921875	1	0.546875	0.953125	0.96875	0.921875
yeast5	0.934028	0.894097	0.875	0.932292	0.947917	0.871528	0.883681
winequality-red-8_vs_6	0.5	0.542914	0.660679	0.633733	0.680639	0.636727	0.715569
ecoli-0-1-3-7_vs_2-6	0.5	0.824074	0.5	0.527778	0.509259	0.509259	0.648148
abalone-21_vs_8	0.5	0.99115	0.517699	0.986726	0.964602	0.526549	1
yeast6	0.5	0.895699	0.707365	0.851953	0.50173	0.845032	0.733564
winequality-white-3_vs_7	0.5	0.823864	0.508523	0.519886	0.732955	0.511364	0.681818
winequality-red-8_vs_6-7	0.5	0.542914	0.660679	0.633733	0.680639	0.636727	0.715569
abalone-19_vs_10-11-12-13	0.5	0.628931	0.569182	0.52673	0.611635	0.559748	0.5
Average AUC Score	0.65023	0.792893	0.728426	0.759968	0.73881	0.735139	0.793583
Mean Rank	5.66(7)	3.42(2)	3.86(5)	3.70(4)	4.34(6)	3.64(3)	3.38(1)
Friedman test statistics	Friedman chi-squared = 21.081, df = 6, p-value = 0.001774

Table 3. Performance of DBT for kNN classifier.

Dataset	None	US	OS	HS	ROSE	SMOTE	CBUS
Pima	0.679717	0.630189	0.693491	0.615189	0.652453	0.668491	0.692358
ecoli1	0.933333	0.941176	0.960784	0.95098	0.95098	0.927451	0.95098
segment0	0.989776	0.911392	0.969523	0.945472	0.736709	0.982181	0.903797
yeast3	0.820549	0.869318	0.863163	0.859375	0.875473	0.882102	0.878788
vowel0	0.944444	0.932961	0.994413	0.960894	0.96648	1	0.944134
glass2	0.512821	0.794872	0.884615	0.653846	0.666667	0.897436	0.628205
yeast-1_vs_7	0.5	0.703922	0.721569	0.668627	0.52549	0.709804	0.515625
ecoli4	1	0.960317	0.968254	0.968254	0.95098	0.992063	0.95098
glass-0-1-6_vs_5	0.5	0.614286	0.528571	0.928571	0.871429	0.985714	0.628571
page-blocks-1-3_vs_4	0.788636	0.780682	0.854545	0.848864	0.6	0.971591	0.792045
yeast-1-4-5-8_vs_7	0.5	0.560606	0.670455	0.636364	0.617424	0.57197	0.583333
yeast-2_vs_8	0.875	0.809783	0.831522	0.782609	0.869565	0.793478	0.869565
yeast4	0.598252	0.737063	0.806294	0.792308	0.783566	0.899301	0.724825
winequality-red-4	0.5	0.536246	0.511489	0.630259	0.504207	0.522168	0.625243
poker-9_vs_7	0.5	0.521277	0.978723	0.93617	0.978723	0.957447	0.521277
yeast-1-2-8-9_vs_7	0.5	0.680328	0.531421	0.581967	0.520492	0.584699	0.510929
winequality-white-9_vs_4	0.5	0.828125	0.59375	0.625	0.84375	0.578125	0.515625
yeast5	0.8125	0.954861	0.980903	0.979167	0.961806	0.984375	0.956597
winequality-red-8_vs_6	0.5	0.511976	0.615768	0.567864	0.613772	0.543912	0.61477
ecoli-0-1-3-7_vs_2-6	0.5	0.527778	0.518519	0.518519	0.546296	0.518519	0.611111
abalone-21_vs_8	0.75	0.915929	0.99115	0.986726	0.995575	0.977876	0.938053
yeast6	0.641127	0.820811	0.90608	0.895699	0.854177	0.827731	0.885319
winequality-white-3_vs_7	0.5	0.676136	0.599432	0.710227	0.616477	0.571023	0.715909
winequality-red-8_vs_6-7	0.5	0.511976	0.615768	0.567864	0.613772	0.543912	0.61477
abalone-19_vs_10-11-12-13	0.5	0.561321	0.548742	0.528302	0.647799	0.597484	0.578616
Average AUC Score	0.653846	0.731733	0.765558	0.765565	0.750563	0.779554	0.726057
Mean Rank	5.92(7)	4.66(6)	2.80(1)	3.74(4)	3.66(3)	3.24(2)	3.98(5)
Friedman test statistics	Friedman chi-squared = 34.192, df = 6, p-value = 6.177 × 10⁻⁶

Table 4. Performance of DBT for LR.

Dataset	None	US	OS	HS	ROSE	SMOTE	CBUS
Pima	0.694151	0.700189	0.709623	0.728491	0.690189	0.709623	0.690755
ecoli1	0.970588	0.911765	0.921569	0.911765	0.931373	0.921569	0.921569
segment0	0.980818	0.978481	0.978286	0.981013	0.960467	0.984615	0.991139
yeast3	0.849905	0.886837	0.880682	0.886364	0.864583	0.88447	0.882576
vowel0	0.941651	0.99162	0.963842	0.921943	0.888579	0.941651	0.96648
glass2	0.551282	0.717949	0.769231	0.75641	0.705128	0.769231	0.74359
yeast-1_vs_7	0.5	0.840196	0.834314	0.840196	0.786275	0.840196	0.78125
ecoli4	0.992063	0.835317	0.992063	0.992063	0.931373	0.992063	0.921569
glass-0-1-6_vs_5	0.985714	0.9	0.971429	0.542857	0.971429	0.971429	0.871429
page-blocks-1-3_vs_4	0.6	0.977273	0.988636	0.788636	0.777273	0.788636	0.9375
yeast-1-4-5-8_vs_7	0.5	0.549242	0.583333	0.632576	0.621212	0.598485	0.560606
yeast-2_vs_8	0.75	0.690217	0.804348	0.804348	0.809783	0.804348	0.722826
yeast4	0.501748	0.628322	0.685315	0.687063	0.687063	0.694056	0.674825
winequality-red-4	0.501618	0.676861	0.77055	0.733495	0.715696	0.777023	0.76877
poker-9_vs_7	0.5	0.840426	0.861702	0.819149	0.861702	0.87234	0.776596
yeast-1-2-8-9_vs_7	0.5	0.780055	0.685792	0.621585	0.637978	0.696721	0.647541
winequality-white-9_vs_4	0.5	0.90625	0.5	0.984375	0.96875	0.5	0.78125
yeast5	0.871528	0.984375	0.977431	0.911458	0.972222	0.977431	0.982639
winequality-red-8_vs_6	0.502994	0.61477	0.567864	0.588822	0.552894	0.576846	0.856287
ecoli-0-1-3-7_vs_2-6	0.518519	0.861111	0.518519	0.537037	0.555556	0.518519	0.583333
abalone-21_vs_8	1	0.960177	0.969027	0.964602	0.973451	0.99115	0.938053
yeast6	0.642857	0.791399	0.81562	0.8087	0.81216	0.820811	0.789669
winequality-white-3_vs_7	0.5	0.698864	0.673295	0.673295	0.801136	0.542614	0.65625
winequality-red-8_vs_6-7	0.502994	0.61477	0.567864	0.588822	0.552894	0.576846	0.856287
abalone-19_vs_10-11-12-13	0.5	0.754717	0.718553	0.72327	0.693396	0.72327	0.783019
Average AUC Score	0.674337	0.803647	0.788356	0.777133	0.788902	0.778958	0.803432
Mean Rank	5.64(7)	3.70(4)	3.64(2)	3.68(3)	4.28(6)	3.06(1)	4.00(5)
Friedman test statistics	Friedman chi-squared = 21.978, df = 6, p-value = 0.001222

Table 5. Performance of DBT for NB classifier.

Dataset	None	US	OS	HS	ROSE	SMOTE	CBUS
Pima	0.693019	0.708491	0.695189	0.637453	0.659717	0.717925	0.718491
ecoli1	0.770588	0.95098	0.917647	0.884314	0.917647	0.884314	0.85098
segment0	0.97702	0.954236	0.970691	0.975852	0.844304	0.978286	0.954236
yeast3	0.529356	0.892519	0.816761	0.836648	0.86553	0.836174	0.859848
vowel0	0.938858	0.946927	0.972067	0.960894	0.935909	0.97486	0.969274
glass2	0.846154	0.717949	0.717949	0.717949	0.717949	0.769231	0.717949
yeast-1_vs_7	0.5	0.710784	0.577451	0.571569	0.517647	0.571569	0.859375
ecoli4	0.625	0.952381	0.75	0.85119	0.917647	0.875	0.85098
glass-0-1-6_vs_5	0.514286	0.6	0.514286	0.514286	0.571429	1	0.514286
page-blocks-1-3_vs_4	0.588636	0.9375	0.965909	0.948864	0.682955	0.977273	0.865909
yeast-1-4-5-8_vs_7	0.5	0.587121	0.5	0.503788	0.503788	0.5	0.636364
yeast-2_vs_8	0.5	0.63587	0.75	0.75	0.73913	0.75	0.516304
yeast4	0.5	0.622028	0.629021	0.627273	0.627273	0.672028	0.643357
winequality-red-4	0.590291	0.625243	0.593204	0.586731	0.575405	0.605987	0.642395
poker-9_vs_7	0.5	0.925532	0.510638	0.531915	0.925532	0.914894	0.787234
yeast-1-2-8-9_vs_7	0.5	0.674863	0.5	0.508197	0.569672	0.502732	0.520492
winequality-white-9_vs_4	1	0.90625	1	0.96875	0.875	0.984375	0.859375
yeast5	0.5	0.892361	0.864583	0.864583	0.859375	0.807292	0.946181
winequality-red-8_vs_6	0.51497	0.682635	0.582834	0.567864	0.537924	0.582834	0.613772
ecoli-0-1-3-7_vs_2-6	0.5	0.907407	0.5	0.5	0.5	0.5	0.5
abalone-21_vs_8	0.973451	0.90708	0.938053	0.946903	0.942478	0.938053	0.933628
yeast6	0.5	0.857637	0.850222	0.853683	0.775334	0.782254	0.793129
winequality-white-3_vs_7	0.744318	0.792614	0.715909	0.849432	0.744318	0.696023	0.596591
winequality-red-8_vs_6-7	0.51497	0.682635	0.582834	0.567864	0.537924	0.582834	0.613772
abalone-19_vs_10-11-12-13	0.5	0.646226	0.558176	0.548742	0.514151	0.525157	0.561321
Average AUC Score	0.632837	0.788691	0.718937	0.72299	0.714321	0.757164	0.73301
Mean Rank	5.54(7)	2.72(1)	3.98(4)	4.02(5)	4.60(6)	3.44(2)	3.70(3)
Friedman test statistics	Friedman chi-squared = 27.272, df = 6, p-value = 0.0001288

Table 6. Performance of DBT for RF.

Dataset	None	US	OS	HS	ROSE	SMOTE	CBUS
Pima	0.701887	0.731792	0.720189	0.704057	0.686321	0.728491	0.736792
ecoli1	0.9	0.931373	0.947059	0.941176	0.911765	0.937255	0.95098
segment0	0.976923	1	1	0.992308	0.865823	0.992308	1
yeast3	0.844223	0.923295	0.875473	0.887311	0.827652	0.902936	0.904356
vowel0	0.997207	0.977654	0.997207	0.997207	0.958101	0.997207	0.980447
glass2	0.5	0.833333	0.628205	0.807692	0.717949	0.628205	0.74359
yeast-1_vs_7	0.660784	0.745098	0.583333	0.577451	0.505882	0.654902	0.84375
ecoli4	0.992063	0.960317	0.992063	0.992063	0.911765	0.992063	0.912698
glass-0-1-6_vs_5	0.5	0.842857	0.5	1	0.957143	1	0.957143
page-blocks-1-3_vs_4	1	1	1	1	0.977273	0.9	0.988636
yeast-1-4-5-8_vs_7	0.5	0.556818	0.503788	0.575758	0.511364	0.518939	0.556818
yeast-2_vs_8	0.875	0.701087	0.744565	0.744565	0.652174	0.733696	0.711957
yeast4	0.548252	0.71958	0.598252	0.686014	0.505245	0.734266	0.804895
winequality-red-4	0.5	0.613916	0.503236	0.512945	0.633657	0.532201	0.68123
poker-9_vs_7	0.5	0.946809	0.5	0.5	0.925532	0.5	0.595745
yeast-1-2-8-9_vs_7	0.580601	0.605191	0.583333	0.577869	0.510929	0.572404	0.610656
winequality-white-9_vs_4	0.5	0.984375	0.5	0.5	0.984375	1	0.84375
yeast5	0.873264	0.907986	0.873264	0.932292	0.951389	0.873264	0.958333
winequality-red-8_vs_6	0.666667	0.769461	0.663673	0.654691	0.728543	0.651697	0.751497
ecoli-0-1-3-7_vs_2-6	0.509259	0.916667	0.5	0.5	0.527778	0.5	0.564815
abalone-21_vs_8	0.75	0.946903	1	1	0.986726	0.99115	0.982301
yeast6	0.712556	0.852447	0.707365	0.851953	0.583045	0.782254	0.816115
winequality-white-3_vs_7	0.619318	0.659091	0.5	0.5	0.747159	0.502841	0.576705
winequality-red-8_vs_6-7	0.666667	0.769461	0.663673	0.654691	0.728543	0.651697	0.751497
abalone-19_vs_10-11-12-13	0.5	0.514151	0.5	0.507862	0.550314	0.514151	0.545597
Average AUC Score	0.694987	0.816387	0.703387	0.743916	0.753858	0.751677	0.790812
Mean Rank	5.00(7)	2.84(2)	4.56(5)	3.94(3)	4.72(6)	4.22(4)	2.72(1)
Friedman test statistics	Friedman chi-squared = 27.41, df = 6, p-value = 0.0001213

Table 7. Performance of DBT for SVM.

Dataset	None	US	OS	HS	ROSE	SMOTE	CBUS
Pima	0.675283	0.700755	0.705189	0.691321	0.690755	0.685755	0.681887
ecoli1	0.917647	0.888235	0.911765	0.911765	0.931373	0.911765	0.911765
segment0	0.992308	0.992405	0.997468	0.997468	0.963096	0.991042	0.994937
yeast3	0.846117	0.878788	0.880682	0.882576	0.866477	0.880682	0.878788
vowel0	0.883302	0.963687	0.961049	0.97486	0.899752	0.986034	0.96648
glass2	0.5	0.820513	0.769231	0.782051	0.615385	0.769231	0.705128
yeast-1_vs_7	0.5	0.644118	0.851961	0.768627	0.780392	0.840196	0.90625
ecoli4	0.992063	0.952381	0.992063	0.992063	0.931373	0.992063	0.992063
glass-0-1-6_vs_5	1	0.857143	1	0.957143	0.928571	1	0.9
page-blocks-1-3_vs_4	0.6	0.771591	0.960227	0.982955	0.760227	0.960227	0.823864
yeast-1-4-5-8_vs_7	0.5	0.598485	0.579545	0.609848	0.613636	0.587121	0.568182
yeast-2_vs_8	0.875	0.875	0.701087	0.711957	0.86413	0.695652	0.76087
yeast4	0.5	0.628322	0.688811	0.695804	0.685315	0.694056	0.733566
winequality-red-4	0.5	0.68657	0.773786	0.731877	0.710841	0.777023	0.778479
poker-9_vs_7	0.5	0.819149	0.819149	0.808511	0.819149	0.840426	0.829787
yeast-1-2-8-9_vs_7	0.5	0.606557	0.688525	0.637978	0.57377	0.685792	0.60929
winequality-white-9_vs_4	1	0.890625	0.5	0.984375	0.953125	0.5	0.90625
yeast5	0.873264	0.951389	0.977431	0.972222	0.970486	0.977431	0.96875
winequality-red-8_vs_6	0.5	0.763473	0.567864	0.585828	0.540918	0.567864	0.850299
ecoli-0-1-3-7_vs_2-6	0.5	0.62037	0.518519	0.537037	0.583333	0.518519	0.601852
abalone-21_vs_8	0.75	0.995575	0.986726	0.986726	0.986726	0.986726	0.986726
yeast6	0.5	0.822541	0.81562	0.81043	0.819081	0.81735	0.80351
winequality-white-3_vs_7	0.5	0.590909	0.667614	0.661932	0.585227	0.536932	0.627841
winequality-red-8_vs_6-7	0.5	0.763473	0.567864	0.585828	0.540918	0.567864	0.850299
abalone-19_vs_10-11-12-13	0.5	0.773585	0.701258	0.70283	0.693396	0.705975	0.831761
Average AUC Score	0.676199	0.794226	0.783337	0.798561	0.772298	0.779029	0.818745
Mean Rank	5.90(7)	3.84(5)	3.48(3)	3.16(1)	4.60(6)	3.62(4)	3.40(2)
Friedman test statistics	Friedman chi-squared = 30.856, df = 6, p-value = 2.7 × 10⁻⁵

Table 8. Post hoc analysis using Nemenyi multiple comparison test for DT.

	None	US	OS	HS	ROSE	SMOTE
US	0.004607	NA	NA	NA	NA	NA
OS	0.050343	0.991424	NA	NA	NA	NA
HS	0.022697	0.999309	0.999974	NA	NA	NA
ROSE	0.317359	0.741575	0.986401	0.942875	NA	NA
SMOTE	0.01647	0.999829	0.999829	1	0.913892	NA
CBUS	0.003596	1	0.986401	0.998521	0.700912	0.999549

Table 9. Post hoc analysis using Nemenyi multiple comparison test for kNN.

	None	US	OS	HS	ROSE	SMOTE
US	0.37547	NA	NA	NA	NA	NA
OS	6.82 × 10⁻⁶	0.037722	NA	NA	NA	NA
HS	0.00662	0.741575	0.721504	NA	NA	NA
ROSE	0.004073	0.658419	0.798057	1	NA	NA
SMOTE	0.000233	0.232329	0.991424	0.983178	0.993325	NA
CBUS	0.025191	0.924423	0.459407	0.999716	0.998521	0.89016

Table 10. Post hoc analysis using Nemenyi multiple comparison test for LR.

	None	US	OS	HS	ROSE	SMOTE
US	0.025191	NA	NA	NA	NA	NA
OS	0.018352	1	NA	NA	NA	NA
HS	0.022697	1	1	NA	NA	NA
ROSE	0.281461	0.964404	0.942875	0.958006	NA	NA
SMOTE	0.000482	0.942875	0.964404	0.950841	0.416662	NA
CBUS	0.102218	0.998975	0.997137	0.998521	0.999309	0.721504

Table 11. Post hoc analysis using Nemenyi multiple comparison test for NB classifier.

	None	US	OS	HS	ROSE	SMOTE
US	8.02 × 10⁻⁵	NA	NA	NA	NA	NA
OS	0.140703	0.37547	NA	NA	NA	NA
HS	0.163616	0.33619	1	NA	NA	NA
ROSE	0.721504	0.034169	0.950841	0.964404	NA	NA
SMOTE	0.010548	0.902473	0.975065	0.964404	0.481232	NA
CBUS	0.041588	0.679861	0.999309	0.998521	0.761061	0.999549

Table 12. Post hoc analysis using Nemenyi multiple comparison test for RF.

	None	US	OS	HS	ROSE	SMOTE
US	0.007451	NA	NA	NA	NA	NA
OS	0.991424	0.072553	NA	NA	NA	NA
HS	0.592448	0.547804	0.950841	NA	NA	NA
ROSE	0.999309	0.034169	0.999974	0.862863	NA	NA
SMOTE	0.862863	0.264439	0.99792	0.999309	0.983178	NA
CBUS	0.003596	0.999995	0.041588	0.416662	0.018352	0.176038

Table 13. Post hoc analysis using Nemenyi multiple comparison test for SVM.

	None	US	OS	HS	ROSE	SMOTE
US	0.013214	NA	NA	NA	NA	NA
OS	0.001455	0.997137	NA	NA	NA	NA
HS	0.000148	0.924423	0.998521	NA	NA	NA
ROSE	0.33619	0.876954	0.525494	0.217261	NA	NA
SMOTE	0.003596	0.999829	0.999988	0.989133	0.679861	NA
CBUS	0.000845	0.991424	1	0.999716	0.437865	0.999829

Table 14. Ranks of DBT for different classifiers.

	DT	kNN	LR	NB	RF	SVM	Mean Rank
None	7	7	7	7	7	7	7
US	2	6	4	1	2	5	3.333333
OS	5	1	2	4	5	3	3.333333
HS	4	4	3	5	3	1	3.333333
ROSE	6	3	6	6	6	6	5.5
SMOTE	3	2	1	2	4	4	2.666667
CBUS	1	5	5	3	1	2	2.833333
Kendall’s statistics	W = 0.562, chi-sq = 20.2, p-value = 0.00254

Table 15. Ranks of DBT for DT for varying levels of IR.

	01–10	11–20	21–30	31–40	41–50	Mean
None	6	4	7	7	7	6.2
US	3	5	2	5	1	3.2
OS	1	1.5	6	4	6	3.7
HS	5	3	3	6	4	4.2
ROSE	7	7	4	2	3	4.6
SMOTE	2	1.5	5	3	5	3.3
CBUS	4	6	1	1	2	2.8
Kendall’s statistics	W = 0.282, chi-sq = 8.46, p-value = 0.206

Table 16. Ranks of DBT for kNN for varying levels of IR.

	01–10	11–20	21–30	31–40	41–50	Mean
None	5	7	7	7	7	6.6
US	7	4	6	4	6	5.4
OS	1	2	1	2	2.5	1.7
HS	6	3	2.5	4	4	3.9
ROSE	4	5	2.5	1	1	2.7
SMOTE	2	1	4	4	5	3.2
CBUS	3	6	5	6	2.5	4.5
Kendall’s statistics	W = 0.593, chi-sq = 17.8, p-value = 0.00679

Table 17. Ranks of DBT for LR for varying levels of IR.

	01–10	11–20	21–30	31–40	41–50	Mean
None	6	5	7	7	7	6.4
US	3	4	6	1	2	3.2
OS	5	1	3.5	6	5	4.1
HS	4	3	3.5	3	3.5	3.4
ROSE	7	6.5	2	4	3.5	4.6
SMOTE	1	2	1	5	1	2
CBUS	2	6.5	5	2	6	4.3
Kendall’s statistics	W = 0.401, chi-sq = 12, p-value = 0.0615

Table 18. Ranks of DBT for NB classifier for varying levels of IR.

	01–10	11–20	21–30	31–40	41–50	Mean
None	7	7	7	7	7	7
US	2	2	3	1	1	1.8
OS	3.5	3.5	5.5	3	3	3.7
HS	5	5.5	5.5	4	2	4.4
ROSE	6	5.5	4	6	6	5.5
SMOTE	1	1	2	5	5	2.8
CBUS	3.5	3.5	1	2	4	2.8
Kendall’s statistics	W = 0.686, chi-sq = 20.6, p-value = 0.00217

Table 19. Ranks of DBT for RF for varying levels of IR.

	01–10	11–20	21–30	31–40	41–50	Mean
None	6	5	7	4	6.5	5.7
US	3	2	2	2	1	2
OS	2	6	6	6	6.5	5.3
HS	5	1	3	6	4	3.8
ROSE	7	7	5	3	2.5	4.9
SMOTE	4	4	4	6	5	4.6
CBUS	1	3	1	1	2.5	1.7
Kendall’s statistics	W = 0.539, chi-sq = 16.2, p-value = 0.0129

Table 20. Ranks of DBT for SVM classifier for varying levels of IR.

	01–10	11–20	21–30	31–40	41–50	Mean
None	7	6	7	7	7	6.8
US	5	5	5	3.5	1	3.9
OS	2	1	6	3.5	4	3.3
HS	1	3	4	1	3	2.4
ROSE	6	7	3	6	6	5.6
SMOTE	3	2	2	5	5	3.4
CBUS	4	4	1	2	2	2.6
Kendall’s statistics	W = 0.564, chi-sq = 16.9, p-value = 0.00963

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

An Empirical Assessment of Performance of Data Balancing Techniques in Classification Task

Abstract

1. Introduction

2. Background and Related Work

3. Experimental Setup and Datasets

4. Results and Discussion

4.1. Performance of DBT

4.2. Performance of DBT across the Classifier

4.3. Performance of DBT for Varying Levels of IR in the Datase

5. Conclusions and Recommendation for Further Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics