Abstract
Many real-world classification problems such as fraud detection, intrusion detection, churn prediction, and anomaly detection suffer from the problem of imbalanced datasets. Therefore, in all such classification tasks, we need to balance the imbalanced datasets before building classifiers for prediction purposes. Several data-balancing techniques (DBT) have been discussed in the literature to address this issue. However, not much work is conducted to assess the performance of DBT. Therefore, in this research paper we empirically assess the performance of the data-preprocessing-level data-balancing techniques, namely: Under Sampling (OS), Over Sampling (OS), Hybrid Sampling (HS), Random Over Sampling Examples (ROSE), Synthetic Minority Over Sampling (SMOTE), and Clustering-Based Under Sampling (CBUS) techniques. We have used six different classifiers and twenty-five different datasets, that have varying levels of imbalance ratio (IR), to assess the performance of DBT. The experimental results indicate that DBT helps to improve the performance of the classifiers. However, no significant difference was observed in the performance of the US, OS, HS, SMOTE, and CBUS. It was also observed that performance of DBT was not consistent across varying levels of IR in the dataset and different classifiers.
1. Introduction
Classification is a supervised machine learning (ML) technique used to predict class label of unseen data by building a classifier using historical data. Classification algorithms usually work with the assumption that dataset used to build the classifier is balanced. However, many datasets are highly imbalanced. An imbalanced dataset refers to a dataset where one class outnumbers the other classes in the dataset with respect to the target class variable. For example, consider a dataset that contains 1000 transactions, out of which 990 are nonfraudulent and only 10 are fraudulent transactions. This is a good example of a highly imbalanced dataset. Many such examples of imbalanced datasets in classification tasks are discussed in the literature. Some of them are software product defect detection [1], survival prediction of hepatocellular carcinoma patients [2], customer churn prediction [3], predicting freshmen student attrition [4], insurance fraud detection [5], and intrusion and crime detection [6]. When we build a classifier using a highly imbalanced dataset, the classifier is usually biased towards the majority class cases. It means that the performance of the classifier will be better at correctly predicting majority class cases than minority class cases. However, in real life we expect that a classifier should be unbiased and equally good at correctly predicting both minority and majority cases. Therefore, balancing imbalanced datasets is one of the most important activities because it helps to reduce bias in the model prediction, and thereby enhances the classifier’s performance.
To address the problem of imbalanced datasets in the classification task, several solutions have been proposed in the literature [7,8,9]. These solutions are broadly divided into several categories, namely data-preprocessing-level solutions, cost-sensitive learning methods, algorithm-level solutions, and ensemble methods. The data-preprocessing-level solutions are based on resampling of the original data. Resampling is performed before building the classifier. Therefore, resampling techniques are easy to implement and are independent of the classifier. Cost-sensitive learning approaches take into account the significance of misclassification of majority and minority class instances. Algorithm-level solutions either suggest a new algorithm or modify existing algorithms. Algorithm-level solutions are dependent on algorithms and require a detailed understanding of the algorithm for implementation. Therefore, algorithm-level solutions are less popular compared to resampling techniques. Ensemble solutions combine ensemble (bagging and boosting) models with resampling techniques or a cost-sensitive approach [7,8].
Though several solutions have been proposed in the literature to deal with the imbalanced dataset problem in classification tasks, there is a lack of research assessing the performance of DBT [7]. As a large number of solutions have been proposed in the literature, it is difficult to assess the performance of all proposed DBT. Therefore, in this study we limited the scope of our study to assess the performance of only resampling techniques used to balance the imbalanced dataset at the data-preprocessing level. The reason for choosing resampling techniques was that these techniques are very widely used in the literature to deal with imbalanced dataset problems in classification tasks.
The objectives of this study are: (1) to assess performance of DBT used to balance the imbalanced dataset; (2) to assess whether performance of DBT is independent of the level of imbalance ratio in the dataset; (3) to assess whether performance of DBT is independent of the classifier; (4) to assess whether DBT help to improve the performance of the classifiers.
2. Background and Related Work
Machine learning algorithms in classification tasks work with the assumption that the distribution of the data with respect to the targeted class variable is equal. However, most classification problems suffer from an imbalanced dataset. Therefore, dealing with imbalanced datasets is considered one of the most important activities in classification tasks. In order to deal with this problem, several solutions have been proposed in the literature [7,8,9].
The data-preprocessing-level solutions deal with imbalanced datasets by resampling of data [9,10]. Resampling is performed by OS minority cases or US majority cases, or by combining US and OS strategies. Using resampling techniques, we can balance the dataset at any desired level of imbalance ratio (IR). It is not necessary that the number of majority and minority cases is exactly the same. Resampling techniques are broadly divided into three categories, namely US, OS, and HS [9].
Under Sampling (US): In this method, the dataset is balanced by deleting majority class instances [10,11]. The instances are selected randomly and deleted from the dataset until the dataset is balanced. The weakness of this method is that we might lose some potentially useful information required for the learning process when we remove the instances from the majority class data.
Over Sampling (OS): In this method, the dataset is balanced by randomly OS minority class instances [12]. This method suffers from duplication of information due to OS of the minority class instances, which might lead to overfitting the model. However, this method does not lose any important information, unlike the RUS method.
Hybrid Sampling (HS): In this method, the dataset is balanced by combining the OS and US approaches [13,14].
Random Over Sampling Examples (RSE): This method is based on smoothed bootstrap-based techniques [15].
Synthetic minority over sampling technique (SMOTE): This method is an OS technique. In this method, instead of replicating minority class instances, new instances are generated synthetically [16]. It follows the following process to generate synthetic data: first, it randomly selects a minority class instance then finds its k-nearest neighbors. Then new instances are generated synthetically by interpolation between the selected minority class instance and its nearest neighbors. There are many literatures that talk about the other variants of SMOTE method such as SMOTEBoost [17], MSMOTE [18], and MWMOTE [19].
One-sided Selection Method (OSS): This method falls under the US techniques category. In this method, borderline and redundant majority class instances are removed [12].
Clustering-Based Under Sampling (CBUS): This method is a US strategy. In this method, US is achieved by creating clusters of majority class instances [20]. The number of clusters should be equal to the number of minority class instances. There are two clustering strategies. In the first strategy, the cluster centre represents the cluster, and in the second strategy, the nearest neighbor of the cluster center represents the cluster.
Clustering-Based Over Sampling and Under Sampling (CBOUS): This method is an extension of the CBUS. In this method, data balancing is achieved by combining US and OS approaches by creating clusters of majority and minority class instances [21].
The cost-sensitive methods are based on the assumption that the cost of misclassification of the minority class instances is higher than the cost of misclassification of the majority class instances. Cost-sensitive learning can be incorporated at the data-preprocessing level or at the algorithm level. Cost-sensitive methods are difficult to implement compared to the resampling technique, as detailed knowledge of the algorithm is required if it is to be incorporated into an algorithm. Several cost-sensitive solutions have been discussed in the literature [22,23,24,25,26].
The algorithm-level solutions deal with imbalanced dataset by proposing a new algorithm or modifying an existing algorithm. Some examples of algorithms modified for dealing with imbalanced dataset are discussed in existing literature [27,28,29,30].
A large number of ensemble solutions have been proposed in the literature to deal with the imbalanced dataset problem in the classification task [31,32,33,34]. In this approach, bagging and boosting algorithms are combined with resampling techniques or cost-sensitive learning methods.
Susan and Kumar reviewed state-of-the art data-balancing strategies used prior to the learning phase. The study discussed the strengths and weaknesses of the techniques and also reported intelligent sampling procedures based on their performance, popularity, and ease of implementation [35]. Halimu and Kasem proposed a data-processing sampling technique named Split Balancing (sBal) for ensemble methods. The proposed method creates multiple balanced bins and then multiple base learners are induced on balanced bins. It was found that the sBal method improves classification performance considerably compared to existing ensemble methods [36]. Tolba et al. used SMOTE, NearMiss, cost-sensitive learning, k-Means SMOTE, TextGan, LoRAS, SDS, Clustering-Based Under Sampling, and a VL strategy to balance the imbalanced dataset for automatic detection of online harassment [37]. Tao et al. proposed a novel SVDD boundary-based weighted over sampling approach for dealing with imbalanced and overlapped dataset classification issues [38]. Islam et al. proposed a K-nearest neighbor over sampling approach (KNNOR) for augmentation and to generate synthetic data points for the minority class [39].
Few papers talk about the performance of resampling techniques. The following are some observations based on these papers. The US technique works well when the number of minority class instances is large (in the hundreds). The OS technique works well when the number of minority class instances is small. When the data size is too large, a combination of SMOTE and US techniques works well. The study conducted by Lopez et al. [40] compared preprocessing techniques against cost-sensitive learning and it was found that there are no differences among the data-preprocessing techniques. Both preprocessing and cost-sensitive learning found good and equivalent approaches. The study conducted by Thammasiri et al. [4] tested three DBT, namely US, OS, and SMOTE, using four different classifiers. The study found that SVM combined with SMOTE gives better results. The study conducted by Burez et al. [41] found that US leads to improved prediction accuracy.
As the focus of this study was to assess the performance of the resampling techniques, we assessed the following most-used resampling techniques: (i) Under Sampling (US); (ii) Over Sampling (OS); (iii) Hybrid Sampling (HS); (iv) Clustering-Based Under Sampling (CBUS); (v) Random Over Sampling Examples (ROSE); (vi) Synthetic Minority Over Sampling technique (SMOTE) and (vii) Clustering-based US (CBUS) technique.
3. Experimental Setup and Datasets
We used six different classifiers, namely the Decision Tree (C4.5), k-Nearest Neighbor (kNN), Logistic Regression (LR), Naïve Bayes (NB), random forest (RF), and support vector machine (SVM), to assess the performance of DBT instead of using a single classifier. Considering six different classifiers will also help us to understand whether performance of DBT varies for different classifiers or it is the same. In this study, we used 25 different small datasets with varying levels of IR. All datasets were downloaded from the KEEL dataset repository [42]. Information about the datasets is given in Table 1. More details about the dataset are available at (https://sci2s.ugr.es/keel/imbalanced.php (accessed on 3 November 2021)). The last column in Table 1 is the Imbalance Ratio (IR), which indicates the proportion of the number of majority class instances to the minority class instances.
Table 1.
Dataset Information.
We built a total of 1050 (25 datasets times 7 techniques times 6 classifiers) classifiers using open source ‘R’ software. To build the classification model and assess its performance, the following processes were used: (i) Divide the dataset into training and test sets. The training set contains 80% of the data and the test set contains 20% of the data; (ii) apply DBT on the training dataset; (iii) build the classification model using the balanced training set; (iv) test the performance of the classification model using the test set. The performance of the classifier was measured using the area under ROC curve (AUC value) [43]. To train the classifiers, we used the default hyperparameters settings of the caret package in ‘R’ tool. No specific hyperparameter tuning was performed, as the objective of this study was not to improve the performance of the classifiers but to assess the performance of DBT. More details about the caret package are discussed by Max Kuhk et al. [44].
We used Friedman’s test statistics to compare the performances of different classifiers as they are, based on the average ranked performance of the DBT in the classification task for each dataset [45,46]. The Friedman test statistics helped us to understand whether there was significance difference in the DBT performance of different classifiers [46]. To report the differences in the performance of DBT, we applied a post hoc Nemenyi test [46,47]. It tells us which DBT differed significantly with respect to its performance in the classification task. We used Kendall’s test statistics [48] to test the agreement on rankings of DBT, based on the performance in the classification task, for varying levels of IR in the dataset. Kendall’s test was used to assess the performance of the imputation method [49]. If the value of Kendall’s ‘w’ is 1, it means that there is complete agreement over the ranking, and when it is 0, this means there is no agreement over the ranking.
4. Results and Discussion
4.1. Performance of DBT
Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 report the performances of six different classifiers, namely DT, kNN, LR, NB, RF, and SVM, for different data-balancing techniques, namely None, US, OS, HS, ROSE, and CBUS. The performance of the classifier is measured using the area under receiver operating characteristics curve, i.e., the AUC value. The first column in the table indicates the name of the imbalanced dataset. The second column indicates the performance of the classifier without balancing the imbalanced dataset (None strategy). Column 3 to Column 8 indicate the performance of the classifier for US, OS, HS, ROSE, SMOTE, and CBUS strategies. The mean rank of the classifiers based on their performance over 25 different datasets, along with Friedman test statistics, is also reported in each table in the last two rows. Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 report the post hoc analysis using the Nemenyi multiple comparison test. The Nemenyi statistics are used to understand which DBT performance is significantly different.
Table 2.
Performance of DBT for DT classifier.
Table 3.
Performance of DBT for kNN classifier.
Table 4.
Performance of DBT for LR.
Table 5.
Performance of DBT for NB classifier.
Table 6.
Performance of DBT for RF.
Table 7.
Performance of DBT for SVM.
Table 8.
Post hoc analysis using Nemenyi multiple comparison test for DT.
Table 9.
Post hoc analysis using Nemenyi multiple comparison test for kNN.
Table 10.
Post hoc analysis using Nemenyi multiple comparison test for LR.
Table 11.
Post hoc analysis using Nemenyi multiple comparison test for NB classifier.
Table 12.
Post hoc analysis using Nemenyi multiple comparison test for RF.
Table 13.
Post hoc analysis using Nemenyi multiple comparison test for SVM.
The Friedman test statistics show that for all the classifiers, the ‘p’ value is less than 0.05. So, we can say that, statistically, there is a significance difference in the performance of DBT. As difference in the performance of DBT is significant, the Nemenyi test was then applied to find which DBT has significant difference in the performance. The following are our observations based on the Friedman statistics and Nemenyi post hoc analysis:
- For the DT classifier, the performance of the None strategy was poor and significantly different than US, OS, HS, SMOTE, and CBUS. However, no difference in the performance was observed between None and ROSE strategies. Further, no significant difference was observed in the performance of US, OS, HS, SMOTE, and CBUS.
- For the kNN classifier, the performance of the None strategy was poor and significantly different than OS, HS, ROSE, SMOTE, and CBUS. However, no difference in the performance was observed between the None and US strategy. Further, no significant difference was observed in the performance of OS, HS, ROSE, and CBUS. Significant difference was observed in the performance of US and OS strategies.
- For the LR classifier, the performance of the None strategy was found to be poor and significantly different than US, OS, HS, and SMOTE. However, no significant difference in the performance was observed between None, ROSE, and CBUS. It was also observed that there was no difference in the performance of US, OS, HS, ROSE, SMOTE, and CBUS.
- For the NB classifier, the performance of the None strategy was found to be poor and significantly different to US, SMOTE, and CBUS. However, no difference in the performance was observed between None, OS, HS, and ROSE. Further, no significant difference was observed in the performance of US, SMOTE, and CBUS.
- For the RF classifier, it was found that the performance of the None Strategy was poor and significantly different to the US and CBUS strategies. However, no difference was observed in the performance of the None, OS, HS, ROSE, and SMOTE. Further, no significant difference was observed in the performance of US, HS, SMOTE, and CBUS.
- For SVM, the performance of the None strategy was poor and significantly different than US, OS, HS, SMOTE, and CBUS. However, no significant difference was observed in the performance of None and ROSE. Further, no significant difference was observed in the performance of US, OS, HS, ROSE, SMOTE, and CBUS.
Therefore, from all the observations above, we can infer that: (i) the performance of the None and ROSE strategies are poor and significantly different than the others; (ii) no significant difference was observed in the performance of US, OS, HS, SMOTE, and CBUS strategies. Dealing with imbalanced datasets is a very common problem in classification tasks, and which DBT is more suitable to enhance the performance of the classifier is the most common question that needs to be answered. In this section, we have attempted to answer this question by applying data-preprocessing-level DBT to 25 different datasets using six different classifiers. From the results of the experiment and statistical analysis, we can infer the following: (i) balancing the imbalanced dataset certainly helps to improve the performance of the classifier; (ii) For DT classifier CBUS and US techniques give a better performance; (iii) For logistic regression SMOTE and OS techniques give a better performance; (iv) For Naïve Bayes classifier US and SMOTE give a better performance; (v) For random forest CBUS and US techniques give a better performance; (vi) For support vector machine HS and CBUS give better results; (vii) For kNN classifer OS and SMOTE give better results. However, it is important to note that every time we apply DBT on an imbalanced dataset, there is no guarantee that the same data will be generated or removed in order to balance it. Therefore, model performance and results could also vary slightly.
4.2. Performance of DBT across the Classifier
In this section, we assess whether performance of DBT is consistent across the classifiers or varies for different classifiers. To do this assessment, we used Kendall’s ‘w’ statistics. When ‘w’ is 1, then there is complete agreement over the ranks; and when ‘w’ is 0, then there is complete disagreement over the ranks. The ranks of DBT and results of the Kendall’s statistics are shown in Table 14. The results show that there is agreement over the ranking of data-balancing techniques. However, the concordance coefficient value (w) is 0.562, which indicates that there is partial agreement over the ranking. It is observed from Table 14 that there is consistency in the ranking only for None and ROSE techniques. However, there was no consistency in the performance of the US, OS, HS, SMOTE and CBUS techniques. The experimental results show that the performance of None and ROSE was poor and consistent. However, there was no consistency in the performance of US, OS, HS, SMOTE, and CBUS techniques, but its performance was better than ROSE and None.
Table 14.
Ranks of DBT for different classifiers.
In this section of the paper, we have attempted to answer the following question: Is performance of DBT consistent across classifiers? The results show that the performance of DBT is not consistent across different classifiers.
4.3. Performance of DBT for Varying Levels of IR in the Datase
Table 15, Table 16, Table 17, Table 18, Table 19 and Table 20 show ranks of DBT for six different classifiers for varying levels of IR in the dataset. In order to assess the performance of DBT for varying levels of IR, we used Kendall’s test statistics. The rows in the tables indicate the DB strategy and the columns indicate the range of IR in the dataset. The values in each cell indicate the rank of DBT for varying levels of IR in the dataset for a given classifier. The last row in each table shows the results of Kendall’s test statistics.
Table 15.
Ranks of DBT for DT for varying levels of IR.
Table 16.
Ranks of DBT for kNN for varying levels of IR.
Table 17.
Ranks of DBT for LR for varying levels of IR.
Table 18.
Ranks of DBT for NB classifier for varying levels of IR.
Table 19.
Ranks of DBT for RF for varying levels of IR.
Table 20.
Ranks of DBT for SVM classifier for varying levels of IR.
From the results of Kendall’s statistics, we can infer that:
- For the DT classifier, there is no agreement over the rankings of DBT as the “p” value is greater than 0.05. This means that for the DT classifier, the performance of the data balancing techniques was not consistent for varying imbalance-ratio percentages.
- For the kNN classifier, there is agreement over the rankings of the data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.593, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that the performance of None seemed consistent, whereas the performance of other DBT was different for varying imbalance-ratio percentages.
- For the LR classifier, there is no agreement over the rankings of data-balancing techniques as the “p” value is greater than 0.05. This means that for the LR classifier, the performance of the data-balancing techniques was not consistent for varying imbalance-ratio percentages.
- For the NB classifier, there is agreement over the rankings of data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.686, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that performance of the None was consistent, whereas the performance of other data-balancing techniques was different for varying imbalance-ratio percentages.
- For the RF classifier, there is agreement over the rankings of data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.539, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that performance of the None, US, CBUS seemed consistent, whereas the performance of other data-balancing techniques was different for varying imbalance-ratio percentages.
- For the SVM classifier, there is agreement over the rankings of data-balancing techniques as the “p” value is less than 0.05. However, the concordance value (w) is 0.564, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that only the performance of the None was consistent, whereas the performance of other data-balancing techniques was different for varying imbalance-ratio percentages.
In this section of the paper, we have attempted to answer the following question: Is the performance of the DBT consistent for varying levels of IR in the dataset? The results of the experiment show that for all the classifiers, the performance of the None and ROSE strategy was poor and consistent for varying levels of IR in the dataset. However, performances of the other DBT were not consistent for varying levels of IR in the dataset.
5. Conclusions and Recommendation for Further Work
In this research paper, we have assessed the performance of six different DBT. The assessment was performed using six different classifiers and 25 different datasets that had different levels of IR. The performance of the DBT was assessed using the performance of classifiers, which was measured using the area under ROC curve. The experimental results show that (i) for all the six classifiers, the performance of the None and ROSE strategy was poor and significantly different than the others. It was also observed that there was no significant difference in the performance of the US, OS, HS, SMOTE, and CBUS techniques; (ii) performance of None and ROSE was poor and consistent across the classifiers. There was no consistency in the performance of US, OS, HS, SMOTE, and CBUS techniques, but its performance was better than the ROSE and None strategy; (iii) there was no agreement over the ranks of DBT for varying levels of IR in the dataset except for the None and ROSE strategy; (iv) DBT helps to improve the performance of the classifiers. However, performance of the ROSE was not significantly different than the None Strategy. Thus, from the experimental results, we may infer that DBT helps to improve the performance of the classifier in classification tasks. Further, performance of the DBT is independent of the classification algorithm and IR in the dataset. These inferences are drawn based on our experimental results.
As stated earlier in the introduction section, we assessed the performance of only data-preprocessing-level data-balancing techniques. However, there is a need to assess the performance of advanced DBT such as algorithm-level solutions, cost-based learning, and ensemble methods.
Author Contributions
Conceptualization, A.J. and S.M.M.; methodology, A.J. and S.M.M.; software, A.J. and S.M.M.; validation, A.J. and S.M.M.; formal analysis, A.J. and S.M.M.; investigation, A.J., H.E., F.K.K. and S.M.M.; resources, A.J., H.E., F.K.K. and S.M.M.; data curation, A.J. and S.M.M.; writing—original draft preparation, A.J., H.E., F.K.K. and S.M.M.; writing—review and editing, A.J., H.E., F.K.K. and S.M.M.; visualization, A.J., H.E., F.K.K. and S.M.M.; supervision, H.E., F.K.K. and S.M.M.; project administration, H.E. and F.K.K.; funding acquisition, H.E. and F.K.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research project was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R300).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available on request.
Acknowledgments
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R300), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Conflicts of Interest
The authors declare that they have no conflict of interest to report regarding the present study.
References
- Siers, M.J.; Islam, M.Z. Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf. Syst. 2015, 51, 62–71. [Google Scholar] [CrossRef]
- Santos, M.S.; Abreu, P.H.; Laencina, P.J.G.; Simão, A.; Carvalho, A. A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inform. 2015, 58, 49–59. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhu, B.; Baesens, B.; Broucke, S.K.L.M. An empirical comparison of techniques for the class imbalance problem in churn prediction. Inf. Sci. 2017, 408, 84–99. [Google Scholar] [CrossRef] [Green Version]
- Thammasiri, D.; Delen, D.; Meesad, P.; Kasap, N. A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition. Expert Syst. Appl. 2014, 41, 321–330. [Google Scholar] [CrossRef] [Green Version]
- Hassan, A.K.I.; Abraham, A. Modeling insurance fraud detection using imbalanced data classification. In Proceedings of the 7th World Congress on Nature and Biologically Inspired Computing (NaBIC2015), Pietermaritzburg, South Africa, 18 November 2015; pp. 117–127. [Google Scholar]
- Hajian, S.; Ferrer, J.D.; Balleste, A.M. Discrimination prevention in data mining for intrusion and crime detection. In Proceedings of the IEEE Symposium on Computational Intelligence in Cyber Security (CICS), Paris, France, 11–15 April 2011; pp. 1–8. [Google Scholar]
- Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2012, 42, 463–484. [Google Scholar] [CrossRef]
- Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
- Kotsiantis, S.; Kanellopoulos, D.; Pintelas, P. Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 2006, 30, 1–12. [Google Scholar]
- Kotsiantis, S.; Pintelas, P. Mixture of Expert Agents for Handling Imbalanced Data Sets. Ann. Math. Comput. TeleInformatics 2003, 1, 46–55. [Google Scholar]
- Tahir, M.A.; Kittler, J.; Mikolajczyk, K.; Yan, F. A multiple expert approach to the class imbalance problem using inverse random under sampling. In Proceedings of the International Workshop on Multiple Classifier Systems, Reykjavik, Iceland, 10–12 June 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 82–91. [Google Scholar]
- Kubat, M.; Matwin, S. Addressing the curse of imbalanced training sets: One sided selection. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 8 July 1997; pp. 179–186. [Google Scholar]
- Cateni, S.; Colla, V.; Vannucci, M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 2014, 135, 32–41. [Google Scholar] [CrossRef]
- Yeh, C.W.; Li, D.C.; Lin, L.S.; Tsai, T.I. A Learning Approach with Under and Over-Sampling for Imbalanced Data Sets. In Proceedings of the 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), Kumamoto, Japan, 10–14 July 2016; pp. 725–729. [Google Scholar]
- Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A Package for Binary Imbalanced Learning. R J. 2014, 6, 79–89. [Google Scholar] [CrossRef] [Green Version]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Cavtat-Dubrovnik, Dubrovnik, Croatia, 22–26 September 2003; pp. 107–119. [Google Scholar]
- Hu, S.; Liang, Y.; Ma, L.; He, Y. MSMOTE: Improving classification performance when training data is imbalanced. In Proceedings of the Second International Workshop on Computer Science and Engineering, Qingdao, China, 28–30 October 2009; pp. 13–17. [Google Scholar]
- Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
- Lin, W.; Tsai, C.; Hu, Y.; Jhang, J. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409, 17–26. [Google Scholar] [CrossRef]
- Jadhav, A. Clustering Based Data Preprocessing Technique to Deal with Imbalanced Dataset Problem in Classification Task. In Proceedings of the IEEE Punecon, Pune, India, 30 November–2 December 2018; pp. 1–7. [Google Scholar]
- Fan, W.; Stolfo, S.J.; Zhang, J.; Chan, P.K. AdaCost: Misclassification cost-sensitive boosting. In Proceedings of the Sixteenth International Conference on Machine Learning, San Francisco, CA, USA, 27–30 June 1999; pp. 99–105. [Google Scholar]
- Zhou, Z.; Liu, X. Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Trans. Knowl. Data Eng. 2006, 18, 63–77. [Google Scholar] [CrossRef]
- Domingos, P. MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999; pp. 155–164. [Google Scholar]
- López, V.; Río, S.D.; Benítez, J.M.; Herrera, F. Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst. 2015, 258, 5–38. [Google Scholar] [CrossRef]
- Sun, Y.; Kamel, M.S.; Wong, A.K.; Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 2007, 40, 3358–3378. [Google Scholar] [CrossRef]
- Chen, Z.Y.; Shu, P.; Sun, M. A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data. Eur. J. Oper. Res. 2012, 223, 461–472. [Google Scholar] [CrossRef]
- Zhang, Y.; Fu, P.; Liu, W.; Chen, G. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput. Appl. 2014, 25, 927–935. [Google Scholar] [CrossRef]
- Kim, S.; Kim, H.; Namkoong, Y. Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Service. IEEE Intell. Syst. 2016, 31, 50–56. [Google Scholar] [CrossRef]
- Godoy, M.D.P.; Fernández, A.; Rivera, A.J.; Jesus, M.J.D. Analysis of an evolutionary RBFN design algorithm, CO2RBFN, for imbalanced data sets. Pattern Recognit. Lett. 2010, 31, 2375–2388. [Google Scholar] [CrossRef]
- Seiffert, C.; Khoshgoftaar, T.M.; Hulse, J.V.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
- Wang, S.; Yao, X. Diversity analysis on imbalanced data sets by using ensemble models. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA, 30 March–2 April 2009; pp. 324–331. [Google Scholar]
- Barandela, R.; Valdovinos, R.M.; S´anchez, J.S. New applications of ensembles of classifiers. Pattern Anal. Appl. 2003, 6, 245–256. [Google Scholar] [CrossRef]
- Liao, J.J.; Shih, C.H.; Chen, T.F.; Hsu, M.F. An ensemble-based model for two-class imbalanced financial problem. Econ. Model. 2014, 37, 175–183. [Google Scholar] [CrossRef]
- Susan, S.; Kumar, A. The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Eng. Rep. 2021, 3, e12298. [Google Scholar] [CrossRef]
- Halimu, C.; Kasem, A. Split balancing (sBal)—A data preprocessing sampling technique for ensemble methods for binary classification in imbalanced datasets. In Computational Science and Technology; Springer: Singapore, 2021; pp. 241–257. [Google Scholar]
- Tolba, M.; Ouadfel, S.; Meshoul, S. Hybrid ensemble approaches to online harassment detection in highly imbalanced data. Expert Syst. Appl. 2021, 175, 114751. [Google Scholar] [CrossRef]
- Tao, X.; Zheng, Y.; Chen, W.; Zhang, X.; Qi, L.; Fan, Z.; Huang, S. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning. Inf. Sci. 2022, 588, 13–51. [Google Scholar] [CrossRef]
- Islam, A.; Belhaouari, S.B.; Rehman, A.U.; Bensmail, H. KNNOR: An oversampling technique for imbalanced datasets. Appl. Soft Comput. 2022, 115, 108288. [Google Scholar] [CrossRef]
- López, V.; Fernández, A.; Torres, J.G.M.; Herrera, F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst. Appl. 2012, 39, 6585–6608. [Google Scholar] [CrossRef]
- Burez, J.; Poel, V.D. Handling class imbalance in customer churn prediction. Expert Syst. Appl. 2009, 36, 4626–4636. [Google Scholar] [CrossRef] [Green Version]
- Alcalá-Fdez, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
- Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef] [Green Version]
- Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenke, B.; R Core Team. Classification and Regression Training. 2022. Available online: https://cran.r-project.org/web/packages/caret/caret.pdf (accessed on 3 November 2021).
- Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
- Brown, I.; Mues, C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 2012, 39, 3446–3453. [Google Scholar] [CrossRef] [Green Version]
- Nemenyi, P. Distribution-Free Multiple Comparisons. Ph.D. Thesis, University of Princeton, Princeton, NJ, USA, 1963. [Google Scholar]
- Kendall, M.G.; Smith, B.B. The Problem of m Rankings. Ann. Math. Stat. 1939, 10, 275–287. [Google Scholar] [CrossRef]
- Jadhav, A.; Pramod, D.; Ramanathan, K. Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 2019, 33, 913–933. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).