An Empirical Assessment of Performance of Data Balancing Techniques in Classiﬁcation Task

: Many real-world classiﬁcation problems such as fraud detection, intrusion detection, churn prediction, and anomaly detection suffer from the problem of imbalanced datasets. Therefore, in all such classiﬁcation tasks, we need to balance the imbalanced datasets before building classiﬁers for prediction purposes. Several data-balancing techniques (DBT) have been discussed in the literature to address this issue. However, not much work is conducted to assess the performance of DBT. Therefore, in this research paper we empirically assess the performance of the data-preprocessing-level data-balancing techniques, namely: Under Sampling (OS), Over Sampling (OS), Hybrid Sampling (HS), Random Over Sampling Examples (ROSE), Synthetic Minority Over Sampling (SMOTE), and Clustering-Based Under Sampling (CBUS) techniques. We have used six different classiﬁers and twenty-ﬁve different datasets, that have varying levels of imbalance ratio (IR), to assess the performance of DBT. The experimental results indicate that DBT helps to improve the performance of the classiﬁers. However, no signiﬁcant difference was observed in the performance of the US, OS, HS, SMOTE, and CBUS. It was also observed that performance of DBT was not consistent across varying levels of IR in the dataset and different classiﬁers.


Introduction
Classification is a supervised machine learning (ML) technique used to predict class label of unseen data by building a classifier using historical data. Classification algorithms usually work with the assumption that dataset used to build the classifier is balanced. However, many datasets are highly imbalanced. An imbalanced dataset refers to a dataset where one class outnumbers the other classes in the dataset with respect to the target class variable. For example, consider a dataset that contains 1000 transactions, out of which 990 are nonfraudulent and only 10 are fraudulent transactions. This is a good example of a highly imbalanced dataset. Many such examples of imbalanced datasets in classification tasks are discussed in the literature. Some of them are software product defect detection [1], survival prediction of hepatocellular carcinoma patients [2], customer churn prediction [3], predicting freshmen student attrition [4], insurance fraud detection [5], and intrusion and crime detection [6]. When we build a classifier using a highly imbalanced dataset, the classifier is usually biased towards the majority class cases. It means that the performance of the classifier will be better at correctly predicting majority class cases than minority class cases. However, in real life we expect that a classifier should be unbiased and equally good at correctly predicting both minority and majority cases. Therefore, balancing imbalanced datasets is one of the most important activities because it helps to reduce bias in the model prediction, and thereby enhances the classifier's performance.
To address the problem of imbalanced datasets in the classification task, several solutions have been proposed in the literature [7][8][9]. These solutions are broadly divided into several categories, namely data-preprocessing-level solutions, cost-sensitive learning methods, algorithm-level solutions, and ensemble methods. The data-preprocessing-level solutions are based on resampling of the original data. Resampling is performed before building the classifier. Therefore, resampling techniques are easy to implement and are independent of the classifier. Cost-sensitive learning approaches take into account the significance of misclassification of majority and minority class instances. Algorithm-level solutions either suggest a new algorithm or modify existing algorithms. Algorithm-level solutions are dependent on algorithms and require a detailed understanding of the algorithm for implementation. Therefore, algorithm-level solutions are less popular compared to resampling techniques. Ensemble solutions combine ensemble (bagging and boosting) models with resampling techniques or a cost-sensitive approach [7,8].
Though several solutions have been proposed in the literature to deal with the imbalanced dataset problem in classification tasks, there is a lack of research assessing the performance of DBT [7]. As a large number of solutions have been proposed in the literature, it is difficult to assess the performance of all proposed DBT. Therefore, in this study we limited the scope of our study to assess the performance of only resampling techniques used to balance the imbalanced dataset at the data-preprocessing level. The reason for choosing resampling techniques was that these techniques are very widely used in the literature to deal with imbalanced dataset problems in classification tasks.
The objectives of this study are: (1) to assess performance of DBT used to balance the imbalanced dataset; (2) to assess whether performance of DBT is independent of the level of imbalance ratio in the dataset; (3) to assess whether performance of DBT is independent of the classifier; (4) to assess whether DBT help to improve the performance of the classifiers.
The rest of the paper is organized as follows. Section 2 describes the theoretical background and related work. In Section 3, we discuss the experimental setup. The results of the experiment are analyzed and discussed in Section 4. Finally, the paper is concluded in Section 5.

Background and Related Work
Machine learning algorithms in classification tasks work with the assumption that the distribution of the data with respect to the targeted class variable is equal. However, most classification problems suffer from an imbalanced dataset. Therefore, dealing with imbalanced datasets is considered one of the most important activities in classification tasks. In order to deal with this problem, several solutions have been proposed in the literature [7][8][9]. The data-preprocessing-level solutions deal with imbalanced datasets by resampling of data [9,10]. Resampling is performed by OS minority cases or US majority cases, or by combining US and OS strategies. Using resampling techniques, we can balance the dataset at any desired level of imbalance ratio (IR). It is not necessary that the number of majority and minority cases is exactly the same. Resampling techniques are broadly divided into three categories, namely US, OS, and HS [9].
Under Sampling (US): In this method, the dataset is balanced by deleting majority class instances [10,11]. The instances are selected randomly and deleted from the dataset until the dataset is balanced. The weakness of this method is that we might lose some potentially useful information required for the learning process when we remove the instances from the majority class data.
Over Sampling (OS): In this method, the dataset is balanced by randomly OS minority class instances [12]. This method suffers from duplication of information due to OS of the minority class instances, which might lead to overfitting the model. However, this method does not lose any important information, unlike the RUS method.
Hybrid Sampling (HS): In this method, the dataset is balanced by combining the OS and US approaches [13,14].
Random Over Sampling Examples (RSE): This method is based on smoothed bootstrapbased techniques [15].
Synthetic minority over sampling technique (SMOTE): This method is an OS technique. In this method, instead of replicating minority class instances, new instances are generated synthetically [16]. It follows the following process to generate synthetic data: first, it randomly selects a minority class instance then finds its k-nearest neighbors. Then new instances are generated synthetically by interpolation between the selected minority class instance and its nearest neighbors. There are many literatures that talk about the other variants of SMOTE method such as SMOTEBoost [17], MSMOTE [18], and MWMOTE [19].
One-sided Selection Method (OSS): This method falls under the US techniques category. In this method, borderline and redundant majority class instances are removed [12].
Clustering-Based Under Sampling (CBUS): This method is a US strategy. In this method, US is achieved by creating clusters of majority class instances [20]. The number of clusters should be equal to the number of minority class instances. There are two clustering strategies. In the first strategy, the cluster centre represents the cluster, and in the second strategy, the nearest neighbor of the cluster center represents the cluster.
Clustering-Based Over Sampling and Under Sampling (CBOUS): This method is an extension of the CBUS. In this method, data balancing is achieved by combining US and OS approaches by creating clusters of majority and minority class instances [21].
The cost-sensitive methods are based on the assumption that the cost of misclassification of the minority class instances is higher than the cost of misclassification of the majority class instances. Cost-sensitive learning can be incorporated at the data-preprocessing level or at the algorithm level. Cost-sensitive methods are difficult to implement compared to the resampling technique, as detailed knowledge of the algorithm is required if it is to be incorporated into an algorithm. Several cost-sensitive solutions have been discussed in the literature [22][23][24][25][26].
The algorithm-level solutions deal with imbalanced dataset by proposing a new algorithm or modifying an existing algorithm. Some examples of algorithms modified for dealing with imbalanced dataset are discussed in existing literature [27][28][29][30].
A large number of ensemble solutions have been proposed in the literature to deal with the imbalanced dataset problem in the classification task [31][32][33][34]. In this approach, bagging and boosting algorithms are combined with resampling techniques or cost-sensitive learning methods.
Susan and Kumar reviewed state-of-the art data-balancing strategies used prior to the learning phase. The study discussed the strengths and weaknesses of the techniques and also reported intelligent sampling procedures based on their performance, popularity, and ease of implementation [35]. Halimu and Kasem proposed a data-processing sampling technique named Split Balancing (sBal) for ensemble methods. The proposed method creates multiple balanced bins and then multiple base learners are induced on balanced bins. It was found that the sBal method improves classification performance considerably compared to existing ensemble methods [36]. Tolba et al. used SMOTE, NearMiss, cost-sensitive learning, k-Means SMOTE, TextGan, LoRAS, SDS, Clustering-Based Under Sampling, and a VL strategy to balance the imbalanced dataset for automatic detection of online harassment [37]. Tao et al. proposed a novel SVDD boundary-based weighted over sampling approach for dealing with imbalanced and overlapped dataset classification issues [38]. Islam et al. proposed a K-nearest neighbor over sampling approach (KNNOR) for augmentation and to generate synthetic data points for the minority class [39].
Few papers talk about the performance of resampling techniques. The following are some observations based on these papers. The US technique works well when the number of minority class instances is large (in the hundreds). The OS technique works well when the number of minority class instances is small. When the data size is too large, a combination of SMOTE and US techniques works well. The study conducted by Lopez et al. [40] compared preprocessing techniques against cost-sensitive learning and it was found that there are no differences among the data-preprocessing techniques. Both preprocessing and cost-sensitive learning found good and equivalent approaches. The study conducted by Thammasiri et al. [4] tested three DBT, namely US, OS, and SMOTE, using four different classifiers. The study found that SVM combined with SMOTE gives better results. The study conducted by Burez et al. [41] found that US leads to improved prediction accuracy.
As the focus of this study was to assess the performance of the resampling techniques, we assessed the following most-used resampling techniques:

Experimental Setup and Datasets
We used six different classifiers, namely the Decision Tree (C4.5), k-Nearest Neighbor (kNN), Logistic Regression (LR), Naïve Bayes (NB), random forest (RF), and support vector machine (SVM), to assess the performance of DBT instead of using a single classifier. Considering six different classifiers will also help us to understand whether performance of DBT varies for different classifiers or it is the same. In this study, we used 25 different small datasets with varying levels of IR. All datasets were downloaded from the KEEL dataset repository [42]. Information about the datasets is given in Table 1. More details about the dataset are available at (https://sci2s.ugr.es/keel/imbalanced.php (accessed on 3 November 2021)). The last column in Table 1 is the Imbalance Ratio (IR), which indicates the proportion of the number of majority class instances to the minority class instances.
We built a total of 1050 (25 datasets times 7 techniques times 6 classifiers) classifiers using open source 'R' software. To build the classification model and assess its performance, the following processes were used: (i) Divide the dataset into training and test sets. The training set contains 80% of the data and the test set contains 20% of the data; (ii) apply DBT on the training dataset; (iii) build the classification model using the balanced training set; (iv) test the performance of the classification model using the test set. The performance of the classifier was measured using the area under ROC curve (AUC value) [43]. To train the classifiers, we used the default hyperparameters settings of the caret package in 'R' tool. No specific hyperparameter tuning was performed, as the objective of this study was not to improve the performance of the classifiers but to assess the performance of DBT. More details about the caret package are discussed by Max Kuhk et al. [44].
We used Friedman's test statistics to compare the performances of different classifiers as they are, based on the average ranked performance of the DBT in the classification task for each dataset [45,46]. The Friedman test statistics helped us to understand whether there was significance difference in the DBT performance of different classifiers [46]. To report the differences in the performance of DBT, we applied a post hoc Nemenyi test [46,47]. It tells us which DBT differed significantly with respect to its performance in the classification task. We used Kendall's test statistics [48] to test the agreement on rankings of DBT, based on the performance in the classification task, for varying levels of IR in the dataset. Kendall's test was used to assess the performance of the imputation method [49]. If the value of Kendall's 'w' is 1, it means that there is complete agreement over the ranking, and when it is 0, this means there is no agreement over the ranking.

Performance of DBT
Tables 2-7 report the performances of six different classifiers, namely DT, kNN, LR, NB, RF, and SVM, for different data-balancing techniques, namely None, US, OS, HS, ROSE, and CBUS. The performance of the classifier is measured using the area under receiver operating characteristics curve, i.e., the AUC value. The first column in the table indicates the name of the imbalanced dataset. The second column indicates the performance of the classifier without balancing the imbalanced dataset (None strategy). Column 3 to Column 8 indicate the performance of the classifier for US, OS, HS, ROSE, SMOTE, and CBUS strategies. The mean rank of the classifiers based on their performance over 25 different datasets, along with Friedman test statistics, is also reported in each table in the last two rows. Tables 8-13 report the post hoc analysis using the Nemenyi multiple comparison test. The Nemenyi statistics are used to understand which DBT performance is significantly different.      The Friedman test statistics show that for all the classifiers, the 'p' value is less than 0.05. So, we can say that, statistically, there is a significance difference in the performance of DBT. As difference in the performance of DBT is significant, the Nemenyi test was then applied to find which DBT has significant difference in the performance. The following are our observations based on the Friedman statistics and Nemenyi post hoc analysis:

1.
For the DT classifier, the performance of the None strategy was poor and significantly different than US, OS, HS, SMOTE, and CBUS. However, no difference in the performance was observed between None and ROSE strategies. Further, no significant difference was observed in the performance of US, OS, HS, SMOTE, and CBUS.

2.
For the kNN classifier, the performance of the None strategy was poor and significantly different than OS, HS, ROSE, SMOTE, and CBUS. However, no difference in the performance was observed between the None and US strategy. Further, no significant difference was observed in the performance of OS, HS, ROSE, and CBUS. Significant difference was observed in the performance of US and OS strategies.

3.
For the LR classifier, the performance of the None strategy was found to be poor and significantly different than US, OS, HS, and SMOTE. However, no significant difference in the performance was observed between None, ROSE, and CBUS. It was also observed that there was no difference in the performance of US, OS, HS, ROSE, SMOTE, and CBUS. 4.
For the NB classifier, the performance of the None strategy was found to be poor and significantly different to US, SMOTE, and CBUS. However, no difference in the performance was observed between None, OS, HS, and ROSE. Further, no significant difference was observed in the performance of US, SMOTE, and CBUS.

5.
For the RF classifier, it was found that the performance of the None Strategy was poor and significantly different to the US and CBUS strategies. However, no difference was observed in the performance of the None, OS, HS, ROSE, and SMOTE. Further, no significant difference was observed in the performance of US, HS, SMOTE, and CBUS. 6.
For SVM, the performance of the None strategy was poor and significantly different than US, OS, HS, SMOTE, and CBUS. However, no significant difference was observed in the performance of None and ROSE. Further, no significant difference was observed in the performance of US, OS, HS, ROSE, SMOTE, and CBUS.
Therefore, from all the observations above, we can infer that: (i) the performance of the None and ROSE strategies are poor and significantly different than the others; (ii) no significant difference was observed in the performance of US, OS, HS, SMOTE, and CBUS strategies. Dealing with imbalanced datasets is a very common problem in classification tasks, and which DBT is more suitable to enhance the performance of the classifier is the most common question that needs to be answered. In this section, we have attempted to answer this question by applying data-preprocessing-level DBT to 25 different datasets using six different classifiers. From the results of the experiment and statistical analysis, we can infer the following: (i) balancing the imbalanced dataset certainly helps to improve the performance of the classifier; (ii) For DT classifier CBUS and US techniques give a better performance; (iii) For logistic regression SMOTE and OS techniques give a better performance; (iv) For Naïve Bayes classifier US and SMOTE give a better performance; (v) For random forest CBUS and US techniques give a better performance; (vi) For support vector machine HS and CBUS give better results; (vii) For kNN classifer OS and SMOTE give better results. However, it is important to note that every time we apply DBT on an imbalanced dataset, there is no guarantee that the same data will be generated or removed in order to balance it. Therefore, model performance and results could also vary slightly.

Performance of DBT across the Classifier
In this section, we assess whether performance of DBT is consistent across the classifiers or varies for different classifiers. To do this assessment, we used Kendall's 'w' statistics. When 'w' is 1, then there is complete agreement over the ranks; and when 'w' is 0, then there is complete disagreement over the ranks. The ranks of DBT and results of the Kendall's statistics are shown in Table 14. The results show that there is agreement over the ranking of data-balancing techniques. However, the concordance coefficient value (w) is 0.562, which indicates that there is partial agreement over the ranking. It is observed from Table 14 that there is consistency in the ranking only for None and ROSE techniques. However, there was no consistency in the performance of the US, OS, HS, SMOTE and CBUS techniques. The experimental results show that the performance of None and ROSE was poor and consistent. However, there was no consistency in the performance of US, OS, HS, SMOTE, and CBUS techniques, but its performance was better than ROSE and None. In this section of the paper, we have attempted to answer the following question: Is performance of DBT consistent across classifiers? The results show that the performance of DBT is not consistent across different classifiers.

Performance of DBT for Varying Levels of IR in the Datase
Tables 15-20 show ranks of DBT for six different classifiers for varying levels of IR in the dataset. In order to assess the performance of DBT for varying levels of IR, we used Kendall's test statistics. The rows in the tables indicate the DB strategy and the columns indicate the range of IR in the dataset. The values in each cell indicate the rank of DBT for varying levels of IR in the dataset for a given classifier. The last row in each table shows the results of Kendall's test statistics. 1.
For the DT classifier, there is no agreement over the rankings of DBT as the "p" value is greater than 0.05. This means that for the DT classifier, the performance of the data balancing techniques was not consistent for varying imbalance-ratio percentages.

2.
For the kNN classifier, there is agreement over the rankings of the data-balancing techniques as the "p" value is less than 0.05. However, the concordance value (w) is 0.593, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that the performance of None seemed consistent, whereas the performance of other DBT was different for varying imbalance-ratio percentages.

3.
For the LR classifier, there is no agreement over the rankings of data-balancing techniques as the "p" value is greater than 0.05. This means that for the LR classifier, the performance of the data-balancing techniques was not consistent for varying imbalance-ratio percentages.

4.
For the NB classifier, there is agreement over the rankings of data-balancing techniques as the "p" value is less than 0.05. However, the concordance value (w) is 0.686, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that performance of the None was consistent, whereas the performance of other data-balancing techniques was different for varying imbalanceratio percentages.

5.
For the RF classifier, there is agreement over the rankings of data-balancing techniques as the "p" value is less than 0.05. However, the concordance value (w) is 0.539, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that performance of the None, US, CBUS seemed consistent, whereas the performance of other data-balancing techniques was different for varying imbalanceratio percentages. 6.
For the SVM classifier, there is agreement over the rankings of data-balancing techniques as the "p" value is less than 0.05. However, the concordance value (w) is 0.564, which indicates that there is partial agreement over the rankings. From the ranks of DBT, it is observed that only the performance of the None was consistent, whereas the performance of other data-balancing techniques was different for varying imbalance-ratio percentages.
In this section of the paper, we have attempted to answer the following question: Is the performance of the DBT consistent for varying levels of IR in the dataset? The results of the experiment show that for all the classifiers, the performance of the None and ROSE strategy was poor and consistent for varying levels of IR in the dataset. However, performances of the other DBT were not consistent for varying levels of IR in the dataset.

Conclusions and Recommendation for Further Work
In this research paper, we have assessed the performance of six different DBT. The assessment was performed using six different classifiers and 25 different datasets that had different levels of IR. The performance of the DBT was assessed using the performance of classifiers, which was measured using the area under ROC curve. The experimental results show that (i) for all the six classifiers, the performance of the None and ROSE strategy was poor and significantly different than the others. It was also observed that there was no significant difference in the performance of the US, OS, HS, SMOTE, and CBUS techniques; (ii) performance of None and ROSE was poor and consistent across the classifiers. There was no consistency in the performance of US, OS, HS, SMOTE, and CBUS techniques, but its performance was better than the ROSE and None strategy; (iii) there was no agreement over the ranks of DBT for varying levels of IR in the dataset except for the None and ROSE strategy; (iv) DBT helps to improve the performance of the classifiers. However, performance of the ROSE was not significantly different than the None Strategy. Thus, from the experimental results, we may infer that DBT helps to improve the performance of the classifier in classification tasks. Further, performance of the DBT is independent of the classification algorithm and IR in the dataset. These inferences are drawn based on our experimental results.
As stated earlier in the introduction section, we assessed the performance of only data-preprocessing-level data-balancing techniques. However, there is a need to assess the performance of advanced DBT such as algorithm-level solutions, cost-based learning, and ensemble methods.

Data Availability Statement:
The data presented in this study are available on request.