Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics

: Class imbalance, as a phenomenon of asymmetry, has an adverse effect on the performance of most machine learning and overlap is another important factor that affects the classiﬁcation performance of machine learning algorithms. This paper deals with the two factors simultaneously, addressing the class overlap under imbalanced distribution. In this paper, a theoretical analysis is ﬁrstly conducted on the existing class overlap metrics. Then, an improved method and the corresponding metrics to evaluate the class overlap under imbalance distributions are proposed based on the theoretical analysis. A well-known collection of the imbalanced datasets is used to compare the performance of different metrics and the performance is evaluated based on the Pearson correlation coefﬁcient and the ξ correlation coefﬁcient. The experimental results demonstrate that the proposed class overlap metrics outperform other compared metrics for the imbalanced datasets and the Pearson correlation coefﬁcient with the AUC metric of eight algorithms can be improved by 34.7488% in average.


Introduction
Machine learning has been widely applied to solve problems in various fields. One of the common and important challenges in solving these problems by machine learning is the classification under imbalanced distribution [1]. The imbalance is encountered by a large number of applications where the concerned samples are rare, such as disease diagnosis, financial fraud detection, network intrusion detection, and so on [2]. The data distributions in these fields are asymmetry that the number of concerned positive samples are smaller than that of negative samples. Most standard classification algorithms are designed based on the concept of symmetry, relatively balanced class distribution or equal cost of misclassification [3]. The classification performances of these algorithms are degraded for handling the imbalance problem to some extent. Hence, building symmetry in machine learning for data under asymmetry distribution is an important research topic [4]. In [5], a novel class imbalance reduction algorithm is proposed to build a symmetry by considering distribution properties of the dataset to improve the performance in software defect prediction. In addition, there are also a lot of methods are proposed to handle the imbalance problem, which can be referred in [2,6].
Besides the imbalance, class overlap is also an important factor that affects the performance of classification [7]. In addition, the research of Liu et al. [8] demonstrated that the sample is often misclassified if it is in a class overlapping boundary. Oh [9] proposed the R value based on the ratio of overlapping areas to the whole dataset and the experimental results show that the R value is strongly correlated with the classification accuracy. In addition, Denil [10] has given a systematic analysis on the imbalance and overlap. The analysis shows that the overlap problem has a greater influence on classification performance than the imbalance in isolation and the classification performance is decreased significantly when the overlap and imbalance are both exist. To deal with the classification of datasets with class overlap and imbalance, some research works have also been conducted [7,11].
The classification methods for the datasets with class overlap and imbalance are important, but so are the quantitative estimation methods of the class overlap level for imbalanced datasets [9]. It can make contributions to understand the characteristic of the datasets and then help to design suitable methods for better classification performance. Klomsae et al. [12] adopts the R value to indicate the classification performance of the dataset and propose a string grammar fuzzy-possibilistic C-medians algorithm to handle the overlapping data problem. In addition, some methods based on the R value to conduct feature selection [13,14], feature construction [15] and data sampling [16] are proposed for achieving better classification performance. Later, Borsos et al. [17] analyzed the problem of the R value for estimating the overlap level of imbalanced datasets and extended the R value to the R aug value for imbalanced datasets. The experimental results demonstrate that the R aug value has a stronger correlation with the classification performance, and it can also achieve better performance in algorithm selection for better classification performance. In addition, some feature selection research works are also conducted based on the R aug value [18,19].
The R aug value has achieved great performance for addressing the class overlap under imbalanced distribution. However, the experimental results in [17] show that the absolute value of the Pearson correlation coefficients of the R aug with the classification performances are lower than 0.7 and the correlation coefficients are varied to different algorithms. Therefore, both correlation coefficients with the classification performances and the generalization ability for different classification algorithms need to be improved. For this purpose, a theoretical analysis on the existing class overlap metrics is firstly conducted and then an improved method is proposed to measure the class overlap for imbalanced datasets in this paper. Based on the proposed method, the R and R aug are extended to ImR and ImR aug for better estimating the class overlap level for imbalanced datasets. The comparison experiments conducted on a well-known collection of imbalanced datasets and eight commonly used classification algorithms are adopted to obtain the classification performance. In addition, the performances of different overlap metrics are evaluated based on the Pearson correlation coefficient and the ξ correlation coefficient with the classification performance. The experimental results demonstrate the excellent performances of the proposed metrics, which indicates the superiority of the proposed method.
The contributions of this paper can be summarized as follows: • A theoretical analysis on the existing class overlap measure R value is presented. • A novel method along with two metrics for estimating the class overlap of the imbalanced datasets is proposed based on the theoretical analysis. • The proposed two class overlap metrics are verified to be in higher correlations with the classification performance of imbalanced datasets.
The rest of the paper is organized as follows. The existing overlap metrics, the R and R aug values, are introduced in Section 2. Section 3 presents a theoretical analysis on the R value. Then, an improved method and two corresponding overlap metrics for imbalanced datasets are proposed based on the theoretical analysis. In addition, Section 4 describes the information about the experiments, such as experiment setup, adopted datasets, and performance evaluation. The experimental results and discussions are given in Section 5. Finally, the conclusions are drawn in Section 6.

The Existing Overlap Metrics
To estimate the level of class overlap, there are two metrics, the R value and the R aug value. These two metrics are introduced in this section. Before introducing the two metrics, some notations are presented as follows: • N: the number of the samples • n: the number of the classes • C l : the set of samples belonging to class l, l ∈ [1, n] • U: the set of all samples,

The R Value
The original R value is proposed by Oh [9] based on the assumption that a sample from class C l is overlapped with other samples if the number of the samples that is in its k nearest neighbors and also belongs to a class rather than C l is at least θ + 1. In addition, the R values of class C l and the whole dataset f are defined as Equations (1) and (2), respectively: According to the definition, the R value can be considered as the ratio of samples in the overlapping area. The range of the R value is [0, 1]. In addition, there are two parameters, k, θ needed to be predefined to calculate the R value. According to [9], the R value is strongly correlated with the accuracy of Support Vector Machine (SVM), Artificial Neural Network (ANN), and K Nearest Neighbor (KNN) algorithms when k = 7 and θ = 3. With this parameter setting, a sample is considered to be in the overlapping area if at least four samples in its seven nearest neighbors belong to another class.

The R aug Value
In [9], the results also show that the R value is most strongly correlated with the classification accuracy of the majority class. In addition, Borsos et al. [17] evaluated R value on some imbalanced data sets with different imbalanced ratios and the experiment results showed that R value is almost constant, while the classification performance decreases with a larger imbalance ratio (IR). Therefore, they conducted an analysis of the R value in a simple case with only two classes, the majority class and minority class. The majority class and minority class are denoted as C N and C P , respectively. Then, the R value of the whole dataset can be calculated as Equation (3): By introducing the imbalance ratio IR = |C N |/|C P | into Equation (3), the equation can be simplified as Equation (4): As the weight of the R value of the majority class is IR, the R value of the whole dataset will be dominated by R value of the majority class when the dataset has a large IR. Actually, the sample in the majority class has a low probability to be recognized in the overlap area, while the sample in the minority class has a high probability. Therefore, the weight of the R value of the majority class should be smaller than that of the minority class. Based on this analysis, the augmented R value defined as Equation (5) is proposed by Borsos et al. [17]: It can be seen that the R aug value is dominated by the R value of the minority class for datasets with large IR, and it is equal to R value for balanced datasets with IR = 1. The experimental results in [17] demonstrated that it could achieve a stronger correlation with the classification performance evaluated by the metric of the area under the Receiver Operating Characteristic (ROC) curve.

The Proposed Method and Corresponding Overlap Metrics
In this section, the theoretical analysis of the R value is firstly conducted. Then, an improved method to recognize the overlap area for imbalanced datasets is proposed and the corresponding overlap metrics are introduced.

Theoretical Analysis of the R Value
Consider a dataset with N samples X = {x 1 , ..., x i ..., x N }, the number of samples in the same class with each sample x i is denoted as r i . In the ideal non-overlapping data distribution, all samples are distributed very well so that the r i nearest samples of the sample x i along with the sample x i itself are all in the same class, while there may be some samples in the r i nearest samples of the sample x i that are in a different class to the class of x i in real data distribution.
Let's keep it simple; only consider the k i nearest data samples (k i < r i ) of the sample x i in real data distribution. As shown in Figure 1, let P i be denoted as the set of r i nearest samples of the x i in the ideal data distribution, and Q i represents the set of k i nearest samples of the x i in the real data distribution. For any sample except x i , it can be represented by x TP , x FP , x FN , and x TN . The meanings of these four kinds samples are presented as follows: • x TP : the sample which is in both P i and Q i • x TN : the sample in neither P i nor Q i • x FN : the sample which is in P i but not in Q i • x FP : the sample that are not in P i but in Q i Actually, the contribution of x i to the R value can be determined based on the probability distribution. In the following, the contribution is analyzed based on the distance between the real data distribution and the ideal data distribution from the perspective of probability. To measure the distance between distributions, the Kullback-Leibler (K-L) divergence is commonly used. In addition, to conduct the calculation of K-L divergence, denote p j|i and q j|i as the conditional probabilities of sample x j given x i for the ideal data distribution and the real data distribution. In addition, the conditional probability of the samples in the P i and Q i can be defined to be an equal non-zero probability and the conditional probability of the other samples can be defined to a near zero probability according to [20]. Then, the detailed conditional probabilities of p j|i and q j|i are shown in Equations (6) and (7): Based on the conditional probabilities, the distance, the K-L divergence D(q j|i , p j|i ), can be defined as Equation (8) shows. Due to the near zero value of δ, the distance can be simplified to Equation (9). According to Equation (9), N TP i and N FP i dominate the distance, which is consistent with the definition of the R value: Equation (9) can be further simplified to Equation (10) because of the near zero value of δ. It can be seen that the distance actually is dominated by the ratio of N FP i and k i (k i = N FP i + N TP i ). Therefore, a reasonable threshold for judging whether x i has a contribution to the R value is 0. 5 should be at least 4. It is consistent with the implementation of the R value in the experiments of [9]: In addition, the sample x i is correctly classified to a class if N TP i > N FP i in the k nearest neighbor for the KNN algorithm, which is contrary to the definition of the R value. Therefore, the R value can be strongly and negatively correlated with the accuracy of the KNN algorithm.

The Proposed Method
From the above theoretical analysis, it can be seen that the class overlap is actually not only determined by N FP i but also N TP i , while the contribution of N TP i is ignored for the R value. As Equation (10) shows, it can be omitted as the coefficient log r i k i is constant for balanced data sets with the same k i for all samples. However, when the same k i is adopted, the coefficient is varied from different classes due to the different r i for imbalanced data sets. To make the coefficients of N TP i equal for different classes in an imbalanced dataset, the condition shown in Equation (11) should be satisfied. Then, Equation (12) can be obtained. Besides, the same result will be obtained if the deduction is conducted based on Hellinger distance, which can be found in Appendix A. It indicates that the adopted value of k should be in proportion to the number of samples in the class and the smaller value of k min should be adopted for the minority class: According to Equation (12), if k is used for the majority class, k/IR should be used for the minority class as it must be a positive integer. Based on this analysis, an improved method to calculate the overlap of different classes is proposed as Equation (13) shows, where C N is the number of samples in the majority class. In this way, the samples in the minority class will not be considered in an overlap area easily: An intuitive demonstration of how the proposed method works is presented in Figure 2. For the sample x N i in the majority class, both k = 3 and k = 5 are suitable to decide whether x N i is in the class overlap region or not, while the sample x P i in the minority class will be recognized to be in the overlap region when k = 5. When k = 3 = 5/2 , the x P i can be correctly recognized to be in the non-overlap region.

The Proposed Metrics
According to the proposed method, two improved overlap metrics based on the R and R aug for binary imbalance datasets can be introduced. The two metrics, denoted as ImR and ImR aug , are defined as Equations (14) and (15): According to the definition of the two metrics, they can both be equal to the original R value when the dataset is balanced (IR = 1). In addition, the experimental results in [9,17] demonstrate that the R and R aug are strongly correlated with the accuracy and the area under the ROC curve (AUC) respectively. Therefore, it is expected that ImR is more strongly correlated with the accuracy of the imbalanced datasets and ImR aug is more strongly correlated with the AUC of the imbalanced datasets.

Experiment Design
In this section, the experiment setup is firstly introduced. Then, the datasets adopted in the experiments are presented. Finally, the evaluation metric for the comparison of different class overlap metrics is described.

Experiment Setup
The experiments are conducted to prove the effectiveness of the proposed method and the two metrics for addressing the class overlap of the imbalanced datasets. To evaluate the effectiveness, not only the correlation of different overlap metrics with the classification performance but also the time consuming of the overlap metrics and the classification modeling are compared. The Pearson correlation coefficient and the ξ correlation coefficient are adopted to obtain the correlation result. The Pearson correlation coefficient can only handle the linear correlations, while the ξ correlation coefficient can deal with both the linear and nonlinear correlations.
According to the investigation of Guo et al. in [2], the AUC and accuracy are the most frequently used metrics for evaluating the classification performance. The AUC is obtained based on the ROC curve which consists of a series points of (false positive rate, true positive rate) [21]. In addition, the points are generated by varying different thresholds for the prediction probability of the classifier. As the AUC is robust to the imbalanced datasets [22], it is recognized as an objective metric and widely utilized to evaluate the classification performance for imbalanced problems. The accuracy is defined as the rate of the number of correctly predicted samples to the number of samples in the whole dataset. Although the accuracy has been proved to be biased to the majority class, it is still frequently used in the research on imbalance learning as it is the most general and intuitive metric [2]. In addition, the proposed metrics ImR and ImR aug are expected to be strongly correlated with the accuracy and AUC respectively based on the analysis in Section 3. Therefore, the two metrics are both adopted to evaluate the classification performance for the better evaluation of the proposed overlap metrics.
Moreover, for the comparison of the generalization ability, eight commonly used algorithms are adopted to obtain the classification performance and the performances are obtained based on the 5-fold cross validation. The eight classification algorithms are k-nearest neighbor (KNN), Naive Bayesian (NB), Support Vector Machine with linear kernel (SVM-L), Support Vector Machine with radial basis kernel (SVM-R), Decision Tree (DT), Multiple Layer Perceptron (MLP), Random Forest (RF), and Adaptive Boosting (AdaB). All methods in the experiments are implemented in python based on some packages like scikit-learn [23] and so on. The parameters of the eight classification algorithms are set to default in the experiment. In addition, the same parameter setting, k = 7, θ = k/2, is adopted for all overlap metrics.

Datasets
In the experiments, a relatively well-known collection of 66 datasets for imbalanced classification is utilized. This collection can be obtained from the KEEL repository and has been adopted in [17,24]. The descriptions of these datasets are shown in Table 1, where #Inst. and #Attrs indicate the number of samples and attributes, respectively, and IR means the imbalance ratio. The imbalance ratios of the datasets are in a very wide range. The minimum imbalance ratio is 1.82, while the maximum imbalance ratio is 128.87.

Evaluation of Correlation
Pearson correlation coefficient [25] is defined to measure the strength of the relationship between two variables in statistics. The equation of Pearson correlation coefficient is shown in Equation (16), where X and Y are two variables,X andȲ are the mean value of the two variables, cov(., .) is the covariance, and σ is the standard deviation: The Pearson correlation coefficient is widely utilized to calculate the linear correlation of two variables. It has been used to compare the performance of R and R aug in [17], and it is also used in the experiments of this paper. In addition, the range of Pearson correlation coefficient ρ is [−1, 1]. The bigger of the absolute value of ρ, the stronger correlation of the two variables. ρ = 1 indicates that there is a perfect positive correlation between the two variables, while ρ = −1 means a perfect negative correlation. In addition, when ρ = 0, it indicates that the two variables are dependent and there is no correlation can be found between them.
To further verify the linear correlation of the proposed metrics and the classification performance, the probabilities of the Pearson correlation results are also compared. The probability can be indicated by p-value, where the smaller p-value indicates the stronger support for the result of Pearson correlation coefficient. Generally, a p-value smaller than 0.05 means that the result of linear correlation is solid. In addition, the result is significantly solid when the p-value is smaller than 0.01.
Besides the Pearson correlation coefficient, the ξ correlation coefficient is also used to the evaluate the relationships of different overlap metrics with the classification performance. ξ correlation coefficient, which was proposed in [26], can not only measure the linear correlation but also the nonlinear correlation. To calculate the ξ correlation coefficient of a pair of variables (X, Y), the data should be rearranged as (X 1 , Y 1 ), ..., (X n , Y n ) such that X 1 ≤··· ≤ X n . Let h i be the number of j such that Y j ≤ Y i and l i are the number of j such that Y j ≥ Y i , and the ξ correlation coefficient is defined as Equation (17) shown. It is in range of [0, 1], where ξ(X, Y) = 0 indicates that X and Y are independent and ξ(X, Y) = 1 indicates that Y is a measurable function of X:

Results and Discussion
To verify the efficiency of the proposed method to measure class overlap, both the correlation results of different metrics with the AUC and the accuracy are compared. As the AUC metric is more objective than the accuracy for imbalance learning, the comparison of the correlations of different overlap metrics and the AUC of different algorithms is firstly conducted. In addition, then the correlation results of different metrics with the accuracy are compared. Finally, the overall results are summarized. Figure 3 shows the correlation results of different overlap metrics with the AUCs of classification algorithms. The results demonstrate that the ImR aug metric achieved the best performance among these metrics for all classification algorithms. In addition, the ImR metric also obtained better performance than the original R metric based on the AUC metric of classification algorithms. In addition, both the original R and R aug value have low correlation coefficients with the AUC of the NB algorithm. It can be seen that the generalization ability of the existing overlap metrics is not good. While, the correlation coefficients of the R and R aug with the NB algorithm are both largely improved by the proposed method. The ImR and ImR aug seem to have better generalization abilities for these algorithms than the R and R aug , respectively. As expected, the R aug and ImR aug achieve much stronger correlations with the AUC of different algorithms. It is consistent with the result in [17]. In the following, the detailed correlation results of the R aug and ImR aug with different classification algorithms are compared. The detailed correlation coefficients along with the p-values of R aug and ImR aug with the AUC of different algorithms are presented in Table 2. It can be seen that the correlation coefficients of the R aug with the AUC of the DT, MLP, RF, and AdaB algorithms are all lower than 0.7, while the correlation coefficients with the AUC of these algorithms are all improved to more than 0.8 by ImR aug . On average, the ImR aug achieves a 34.7488% improvement to the R aug . An illustration of the correlation coefficient improvements of ImR aug over R aug is shown in Figure 4. The correlation coefficient between R aug and the AUC of the NB algorithm is largely improved by the proposed ImR aug .  In addition, all the p-values of ImR aug are far less than 0.01, which indicates that ImR aug does have linear correlations with the AUC of these algorithms. In addition, the p-values of ImR aug are also much less than that of R aug . Therefore, the ImR aug has a much better performance than the R aug for estimating the level of class overlap under imbalanced distribution. Moreover, ImR aug can not only achieve the better mean value of correlations with the AUCs of all classification algorithms, but also achieve the smaller standard deviation. It demonstrates that ImR aug also has a better generalization ability to these algorithms.  Figure 5 demonstrates the ξ correlation coefficients of the AUCs of different algorithms with the R aug and ImR aug . It can be seen that the result of the ξ correlation coefficient is similar to the result of the Pearson correlation coefficient. The x i correlation coefficient of the AUC of the RF algorithm with the ImR aug is the highest and the correlation coefficient of the NB algorithm with R aug is largely improved by ImR aug . In addition, the ξ correlation coefficients of the AUCs of different algorithms with the ImR aug are all higher than that with the R aug . Therefore, the comparison of ξ correlation coefficient also demonstrates the superior of ImR aug .  Figure 6 shows the correlation results of R and ImR with the accuracy of several classification algorithms. It can be seen that the ImR achieves better performance than the original R for more classification algorithms. In addition, the ImR aug measure also obtained better performance than the R aug measure based on the accuracy of classification algorithms. The detailed correlation coefficients and p-values of the R and ImR with the accuracy of different algorithms are shown in Table 3. As expected, the ImR is strongly correlated with the accuracy of the KNN algorithm. The Pearson correlation coefficient of the ImR with the accuracy of the KNN algorithm is more than 0.9. In addition, the ImR also achieves high correlation coefficients with the accuracy of the SVM-R, DT, MLP, RF, and AdaB algorithms. Although the correlation coefficients of the R and ImR with the accuracy of the NB algorithm are very low, the correlation coefficient is greatly improved by the ImR. On average, the ImR achieves a 8.0898% improvement of the Pearson correlation coefficient to the R. Therefore, the ImR has a better performance than the R for estimating the level of class overlap under imbalanced distribution.  In addition, all the p-values of ImR except for the NB algorithm are far less than 0.01, which indicates that ImR does have linear correlations with the ACC of most algorithms. In addition, the p-values of ImR except for SVM-L and MLP algorithms are much less than that of R. Therefore, the ImR has a much better performance than the R for estimating the level of class overlap under imbalanced distribution. Moreover, ImR can not only achieve the better mean value of correlation coefficients with the accuracy of all classification algorithms, but also achieve the smaller standard deviation. It demonstrates that ImR also has a better generalization ability to these algorithms. Figure 7 presents the ξ correlation coefficients of the accuracies of different algorithms with the R aug and ImR aug . It can be seen that the result of the ξ correlation coefficient is also similar to the result of the Pearson correlation coefficient. The x i correlation coefficient of the accuracy of the NB algorithm with the R is much smaller than that of the accuracies of other algorithms with the R, and the coefficient is largely improved by the ImR. The x i correlation coefficient of the accuracy of the SVM-R algorithm with the ImR is also the highest. Meanwhile, the ξ correlation coefficients of the accuracies of different algorithms with the ImR are all higher than that with the R. Therefore, the comparison of ξ correlation coefficient also shows that the ImR has a better performance than that of the R.

The Comparison of Time Consumption
The average time-consuming comparison of different overlap metrics and 5-fold cross validation of different classification algorithms is shown in Figure 8. It can be seen that the MLP algorithm has the most time-consuming performance due to the backpropagation. In addition, the RF and AdaB algorithms are also very time consuming because of the ensemble learning. In addition, these overlap metrics have a similar time consuming performance and the consuming time of the KNN algorithm is approximately four times that of the overlap metric. The main reason is that the k-nearest neighbors searching is the most time-consuming process for these overlap metrics and the KNN algorithm. The searching process will be conducted in the range of n samples for a dataset, and it will be conducted five times in the range of 4n 5 samples for the 5-fold cross validation of the KNN algorithm. Therefore, the result is consistent to the analysis and the proposed overlap metrics are also superior in terms of time consumption.
To sum up, the metrics ImR and ImR aug , which are proposed based on the proposed method, can achieve better performance than the original R and R aug respectively for estimating the level of class overlap of imbalanced datasets. Therefore, the conclusion that the proposed method and metrics are superior to address the class overlap under imbalanced distribution can be drawn.

Conclusions
In this paper, a theoretical analysis is conducted on the existing class overlap metrics and an improved method to address the class overlap under imbalanced distribution is proposed based on the theoretical analysis. Then, the corresponding metrics for estimating the class overlap of imbalanced datasets are also introduced. A well-known collection of the imbalanced datasets is used to compare the Pearson correlation coefficients and the ξ correlation coefficients of different overlap metrics with the classification performance. In addition, the experimental results demonstrate that the proposed data overlap metrics outperform other compared metrics for the imbalanced datasets. The Pearson correlation coefficients with the AUC metric and the accuracy metric can be improved by 34.7488% and 8.0898% on average, respectively. Therefore, the proposed method and metrics can much better estimate the class overlap under imbalanced distribution.
In the future, the proposed metrics can be applied to feature selection and feature construction. In addition, they can also be used as meta-features in meta-learning for algorithm selection and parameters optimization.
To make the distance fair to different classes in an imbalanced dataset, the coefficients of N TP i for different classes are equal as shown in Equation (A3). Then, it can be seen that the result is consistent to the result obtained by K-L divergence: