An Undersampling Method Approaching the Ideal Classification Boundary for Imbalance Problems

: Data imbalance is a common problem in most practical classification applications of machine learning, and it may lead to classification results that are biased towards the majority class if not dealt with properly. An effective means of solving this problem is undersampling in the borderline area; however, it is difficult to find the area that fits the classification boundary. In this paper, we present a novel undersampling framework, whereby the clustering of samples in the majority class is conducted and segmentation is then performed in the boundary area according to the clusters obtained; this enables a better shape that fits the classification boundary to be obtained via the performance of random sampling in the borderline area of these segments. In addition, we hypothesize that there exists an optimal number of classifiers to be integrated into the method of ensemble learning that utilizes multiple classifiers that have been obtained via sampling to promote the algorithm. After passing the hypothesis test, we apply the improved algorithm to the newly developed method. The experimental results show that the proposed method works well.


Introduction
In the era of big data, classification has become an increasingly important learning task in various data mining and analysis applications, including bankruptcy prediction [1], stroke prediction [2], fraud detection [3], and fault diagnosis [4].Generally speaking, the classification algorithms employed in machine learning are designed to deal with balanced data classification problems.In reality, however, the data collected in practical applications are usually imbalanced, i.e., the number of samples representing different classes is unequal.For example, in a binary classification problem, there are 10 samples in one class and 500 in the other class.When the two types of data overlap one another, the classification algorithm may treat these 10 samples as noise data.It is therefore difficult to predict the minority-class samples when using classification algorithms, as the algorithm tends to learn the characteristics of the majority-class data; this results in the minority-class samples being misclassified as majority-class samples.For example, in practical applications such as cancer prediction, there are far more normal cells than cancer cells.However, there would be serious consequences if the cancer cells were to be misclassified as normal cells.
It is well known that imbalanced problems can be solved via two methods of research: one is oversampling [5,6], whereby data sets are balanced via the random generation of samples for the minority class, and the other is undersampling, whereby data sets are balanced by performing partial sampling in the majority class and the number of samples selected is therefore reduced [7][8][9].Although both methods have enabled significant research results to be obtained, we will focus on the undersampling method in this paper.
Great success has been achieved by conducting random sampling and removing samples from the majority class to maintain the balance of data sets [7,9].Experiments conducted by [10], [11], and [12] provedthat not all data have a significant effect on the classification model, and that the contribution of some data is even negligible.Researchers have screened data based on the borderline [13], sensitivity [14], and distribution of data overlap [7] to obtain a new, balanced data set; these methods, unfortunately, may compromise the original distribution characteristics of the data set.Therefore, the cluster-based undersampling technique emerged; for example, [8] obtained good results showing that the majority class was partitioned into clusters.It is clear that regardless of the data sampling method used, it will eventually affect the generation of the classifier, and then affect the shape of the classification boundary.Therefore, our goal is to find an undersampling method that makes the distribution of samples conform to that at the ideal classification boundary.
The main contributions of this paper are as follows: (1) Identifying a sampling area that is more consistent with the ideal classification boundary shape via the performance of segmentation after clustering and space compression; (2) Ensuring that the distribution of the sampled data coincides with the sample distribution at the ideal classification boundary by optimizing the number of classifiers obtained and performing undersampling during ensemble learning; (3) Demonstrating that the proposed method has a clear effect via the performance of experiments using 20 data sets.
In the remainder of this paper, Section 2 discusses related works, Section 3 gives a detailed description of our proposed model, and Section 4 presents extensive experimental results in order to justify the effectiveness of our method.The conclusions and future research directions are outlined in Section 5.

Related Work
In this paper, we propose a method that first removes noise, performs undersampling that is based on clustering, and uses ensemble learning to complete the final classification task.Correspondingly, we will introduce relevant work from the following three aspects.

Noise Filtering in Undersampling
Data imbalance is caused by a significant difference in the number of objects between different classes in a data set.Undersampling, which is able to solve the problem of imbalance, discards certain data in the majority class and forms a new data set with balanced majority-class data and minority-class data [9,15].In addition, noise will affect the classification result.For example, when there is noise in the majority class and if the undersampling method incorporates noise into the balanced new data set, the accuracy of the classification will be greatly reduced.Although the ensemble learning method can be used alongside multiple sampling in order to weaken this effect, better classification results can be obtained if the noise is eliminated at the very beginning.Therefore, many researchers have introduced noise filtering into the process of undersampling.
Van et al. [10] proposed the use of a threshold adjustment strategy in order to filter data noise.Sáez et al. [11] removed data noise by using an iterative integrated noise filter based on the SMOTE method.Kang et al. [12] addressed the incorporation of a kNN filter into the undersampling method in order to exclude noise samples in the minority class.Yan et al. [5] used the semantic relationship between the attributes of the problem itself to aid in the identification of noise.Indeed, kNN is a good preprocessing method for imbalance classification as long as the noise does not interfere significantly with the results.

Cluster-Based Sampling Methods
In contrast to random sampling, the cluster-based undersampling method can preserve the distribution characteristics of the data set by dividing the majority class into groups of similar samples through cluster analysis; as a result, the data characteristics of the majority class are maintained as long as the samples are drawn from these different groups to form a new, balanced data set.
Yen et al. [16] used Kmeans to cluster the majority classes and then randomly sampled the representative objects from clusters of the majority classes according to the proportion of the number of each cluster.Lin et al. [17] set the number of clusters in the majority class to the same number of samples in the minority class; they then used the center point of these clusters and the minority class to form a new data set.Tsai et al. [18] proposed a clustering-based combination method named CBIS that first uses the AP algorithm to perform clustering in the majority class and then uses sample selection techniques to obtain a new data set; it then achieves the final results via ensemble learning.Herein, the number of clusters is adaptively obtained.Li et al. [19] utilized the hierarchical clustering of undersampled fused random forests.The proposed method clusters the majority of samples using a hierarchical clustering algorithm, undersamples each cluster to achieve the balance of data samples, and then constructs a random forest.Compared to the method comprising random undersampling combined with random forest (RF), it improves the prediction accuracy by 16% and the F-value by 17%.Jang and Kim [20] proposed a new method that enables the self-organizing mapping of boundary regions to be described using a normal distribution, and that addresses high-dimensional imbalance problems.This method has been shown to perform well on two high-dimensional data sets used in industrial fault detection.Farshidvard et al. [8] performed clustering in the majority class such that there were no minority samples in the convex hull of each cluster.They believe that this approach can preserve the data distribution in the feature space and achieve good results.Devi et al. [21] presented an algorithm that uses the adaptive clustering method combined with AdaBoost to eliminate the majority of samples with minor contributions; this method therefore aids in the classification process.Guzmán-Ponce et al. [22] adopted a two-stage method that aims to overcome the problem of imbalance by combining DBSCAN and a graph-based process to filter noisy objects in the majority class.Tahfim and Chen [23] used the k-prototypes clustering algorithm to partition majority-class samples and perform initial undersampling, followed by resampling using ADASYN, NearMiss-2, and SMOTETomek.This method has achieved good results when applied to imbalanced large truck collision data.
The majority of the above-mentioned studies present methods that aim to accurately obtain the number of clusters in the majority class.This is, however, a paradox, because if the data of the majority class can be clearly distinguished from each other, then it is more reasonable to divide the majority class into several classes so that a majority class no longer exists.In addition, these methods mainly focus on preserving the data structure of the majority class, and ignore the fact that the boundary points have a greater impact on the construction of the classification model.

Ensemble Learning in Undersampling
The use of ensemble learning can improve the results of classification algorithms, especially with regard to undersampling.Because the undersampling method only selects part of the samples from the majority class, the information of those unselected samples is lost.Due to its ability to better utilize sample information, the undersampling classification method with ensemble learning has gained popularity [24], whereby the bagging and boosting capabilities of ensemble learning are used in undersampling; for example, the SMOTEBagging function of the ensemble classifier can be utilized [25].The ensemble of the α− Trees framework (EAT) uses underbagging technology to achieve good results when applied to imbalanced classification problems [26].In terms of boosting, the technologies commonly used include RUSBoost [27], SMOTEBoost [28], CSBBoost [29], Adaboost, and AsymBoost [30].In addition, Yang et al. [31] employed progressive density-based weighted ensemble learning, and Ren et al. [32] designed a weighted integration scheme that was obtained via the use of classifiers based on the original imbalanced data set for ensemble learning.
In these ensemble learning methods, it is important that the total number of samples in the majority class is increased when the model is constructed through multiple learning; however, none of them properly consider whether the number of classifiers to be boosted can be optimized via ensemble learning.

Proposed Method
Since multi-class classification problems can be converted into two-class classification problems in order to obtain solutions, this paper studies the two-class imbalanced undersampling problem.Suppose that there is an imbalanced data set D that has two categories: the majority class is D b and the minority class is D s .We use clustering to maintain the D b data distribution characteristics and then perform sampling to solve the problem of imbalance.

Influence of the Boundary on the Classification Results and the Idea of Our Method
When faced with a classification problem, we design a thought experiment in which there is an indefinite number of samples and sufficient samples at the boundary of different classes.These samples constitute the classification boundary.Because the samples near the classification boundary contribute more to the establishment of the classification model than those far away, for the problem of imbalanced classification, we usually select samples that are near to the boundary when undersampling is performed in the majority class.
The method commonly used to perform undersampling near the boundary is shown in Figure 1.A rough linear classification is performed to obtain f (x 1 ).Then, f (x 2 ), which is close to the boundary, is obtained via the parallel translation of f (x 1 ).Finally, boundary samples are selected for the construction of a classification model via the extraction of the samples in between these two functions.According to the sample distribution shown in Figure 1, the information in the B area is lost.The ideal state is to first obtain the classification of f (x 3 ), then obtain f (x 4 ) through parallel translation; then, sampling is performed in the area enclosed by the two functions (Figure 2).However, f (x 3 ) is generally difficult to obtain, and it is not easy to judge whether it is consistent with the distribution of samples in the majority class.To deal with this situation, we adopt the scheme shown in Figure 3. Firstly, a rough linear classification f (x 1 ) is obtained.Then, clustering is performed in the majority class to obtain three classes that are parallel to f (x 1 ): f (x 1a ), f (x 1b ), and f (x 1c ).Finally, sampling is performed in the area enclosed by f (x 1a ), f (x 1b ), f (x 1c ), and f (x 1 ).In this way, a sample set that fits the classification boundary is obtained for the construction of the final classification model.

A Cluster-Based Sampling Area Fitting the Classification Boundary Morphology and Its Undersampling
The method proposed in this paper requires performing rough division on the linear separation hyperplane prior to executing the other steps.It is therefore unclear whether any random linear method will work.Figure 4 shows that, by using f (x 1 )-based division, the samples drawn in areas A, B, and C will most likely fall to their left; it also shows that f (x 2 )-based division will be more uniform in the above-mentioned area.In other words, different linear division methods will have different sampling results.Therefore, it is necessary to find a more reasonable linear division method in order to determine rough boundaries.We therefore suggest that the mean error can be used to roughly determine the appropriate linear division f (x i ) in order to obtain the initial linear separation hyperplane.
Herein, n is the number of the linear model f (x i ), and y (k) is the label of sample k of D.
Once the initial partition f (x i ) has been determined, the undersampling is initiated.Firstly, Kmeans clustering is performed in the majority class D b , and the various clusters C i are obtained.Then, according to the number of samples for D b , N s (N s = 1.5 * |D s |, here, D s is the minority class), the number of samples for C i is obtained, as follows: is the diameter of the cluster facing the separating hyperplane.As shown in Figure 5, we calculate the distance L j between each element j in C i and f (x where, L C i ,max is the maximum distance from the samples in C i , which is perpendicular to f (x).
Similarly, L C i ,min is the minimum distance.
If α < 1 is the spatial compression coefficient, then the sampling space in C i is enclosed by a hyperplane with a distance f (x i ) of L (C i ,min) and αL (i, f (x)) .The samples are obtained from a new undersampled majority class sample, namely set D bs ; this can then form a balanced data set with the minority class D s for classification.
In order to minimize the distance between the sampled data objects and the boundary, we adopted a weighted method to update the samples in the majority class.That is, the closer the point is to the hyperplane, the greater the weight and the higher the probability that it will be sampled.In contrast, the farther the data object is from the hyperplane, the smaller the weight and the smaller the probability of it being sampled.Suppose f (x i ) = ωx i + b, then the weight formula is as follows: In summary, the basic principle of our undersampling is as follows.Firstly, various clusters that can reflect the distribution of the majority-class samples are obtained via Kmeans clustering.Then, the spatial compression coefficient α is applied in each cluster to obtain sub-sampling spaces that are closer to the boundaries.A greater weight k i is assigned to samples that are closer to the boundary in order to ensure that these samples are more likely to be selected.Via this approach, a relatively balanced majority class sample set D bu that better reflects the shape of the classification boundary after undersampling can be obtained.The sampling method is shown in Algorithm 1.
Calculate the distance L j from the sample x j to the Ĥ Use Eq. ( 4) to calculate weight k j k j * x j end for Obtain the sub-sampling space Ω C i of C i according to the α Random sampling N s,C i times without replacement to obtain D bu,C i D bu = D bu D bu,C i end for

Undersampling That Enables Close-Fitting Data Distribution at the Ideal Classification Boundary
During undersampling, a small number of samples are extracted from a large number of data objects in the majority class, thus causing some information loss.Therefore, multiple undersampling is generally used to obtain more classifiers during ensemble learning, which enhances the uniformity of the samples at the classification boundary.The questions considered in this process are thus as follows: How can the number of classifiers m in ensemble learning be determined?Does the sample distribution that was obtained after determination conform to the ideal data distribution?Herein, one-time undersampling gains a classifier.Definition 2. ϕ(x, f (x 1 )) is the sample distribution of the majority class sampling, as determined by the classification boundary function.As shown in Figure 3, there is a linear separation function f (x 1 ) between the majority class and the minority class, and C i is obtained by clustering in the majority class.At the boundary of the majority class, f (x 2 ) is parallel to f (x 1 ).Then, the space formed by f (x 2 ) and C i facing f (x 1 ) is the undersampled sampling space.We call the sample distribution ϕ(x, f (x 1 )).
t is the number of classifiers in ensemble learning.When m → t, the data distribution of ensemble learning ϕ(x, t) → ϕ(x, f (x 1 )).Since f (x 1 ) is obtained via the performance rough classification, it generally does not coincide with the ideal classification f (x ideal ).As such, when ϕ(x) → ϕ(x, f (x 1 )), the classification results obtained according to this distribution ϕ(x) deviate.As is evident from area B in Figure 5, the sample distribution in the sampling space is biased to the left.In a situation with the ideal sampling distribution ϕ(x, f (x ideal )), where the sampling distribution is mostly uniform, this may work if there are fewer samples on the left side.That is, if m takes a value from 0 to t, the total sampling distribution of t opt in the ensemble learning will approach the ideal state.When m > t, ϕ(x, m) will gradually deviate from ϕ(x, f (x 1 )), and then begin to approach ϕ(x, f (x 1 )).After analyzing the above situation, we believe that at the initial stage of ensemble learning, the result of the undersampling classification gradually becomes better; then, when it reaches m, the optimal value is obtained.After this, it gradually decreases.Based on this information, we argue that the sample distribution is uniform for f (x ideal ) when the number of undersampling is taken as m.
In order to further investigate the improvement in the algorithm according to the number of classifiers that were built using the sampling data obtained in the ensemble learning, we tested 20 data sets.For each data set, we assumed that the number of classifiers in ensemble learning ranged from 1 to 2000, and then created a graph.The graph for each data set is very similar to Figure 6.As the number of classifiers increased, the AUC of the classification result exhibited a sawtooth shape, as shown in Figure 6.Therefore, we speculate that the AUC may exhibit periodicity, accompanied by an increase in the number of classifiers in ensemble learning.We then conducted a time series periodicity test on the results of all 20 data sets after performing a Fourier transform and found that they did not show periodicity.In this experiment, we found that each data set exhibits a cyclic phenomenon, in which the accuracy rate increases from small to large and then decreases.The sub-picture shown in Figure 6 is an enlargement of the 1-30 samples obtained from the Yeast5 data set.According to its performance, we present the following assumptions.

Hypothesis 1.
There is an m classifier that obtains the best result in a cycle of ensemble learning under ϕ(x, f (x 1 )) distribution sampling.
According to Hypothesis 1, there exists an optimal number of classifiers that need to be integrated within a cycle, and we apply Algorithm 2 to determine this number.We set a stop threshold θ and record the result of this ensemble learning.Initially, the number of classifiers in ensemble learning ranges from 1 to a very large number.Then, we increase the number of classifiers and obtain their classification results r i .When multiple θ consecutive r i values that are less than the current best result r are observed, the optimal number of classifiers m is output.

Our Method FDUS and Its Time Complexity
As mentioned previously, the noise reduction achieved by kNN during imbalanced classification can improve the results; we therefore utilized it to perform our preprocessing.By taking the foregoing description of this section into consideration, we named our method FDUS (Fitting data Distribution with UnderSampling).This is shown in Algorithm 3. The time complexity of assigning weights to majority-class samples is O(N), and the time complexity of determining the number of classifiers in ensemble learning and outputting the optimal results is m * O(N * N).Since there is no nesting between processes, the time complexity of the overall sampling algorithm is O(N * N).

Experiments and Results
In this section, we first show the experimental details of the proposed method; these include the data sets, the comparison method, and the evaluation index.Then, we verify the aforementioned hypothesis.Finally, we analyze and compare each method based on the experimental results.

Data sets
We conducted experiments on 20 imbalanced data sets obtained from keel (https://sci2s.ugr.es/keel/imbalanced.php (Visits since 5 November 2005)).Table 1 shows that the range in the imbalance rate (ir) (ir = |majority−classsamples| |minority−classsamples| ) of the data sets is (1.86, 40.5), that the range in the examples of the data sets is (184, 5472), and that the range in the dimension of the data sets is (4,19).It is evident that these data sets include both low-dimensional data and high-dimensional data, with both a low imbalance rate and a high imbalance rate; this enables us to perform a comprehensive evaluation of the proposed method.In these experiments, we used a 5-fold cross validation method to carry out the experiment and repeat it 20 times.Herein, the average of 20 experimental results is used as the final result.Benchmark methods As mentioned in Section 2, we use five of the latest imbalanced undersampling methods and two classic algorithms; these are listed below as our benchmark methods: (1) NB-Tomek [7] This method accurately identifies and eliminates overlapping instances of imbalanced data sets; therefore, the visibility of minority instances is maximized, and the excessive elimination of data is minimized, thereby reducing information loss.(2) USS [14] This method calculates the sensitivity of majority-class samples.Low [33] This method randomly selects the same number of samples from the majority class as that from the minority class, and then obtains a classifier by using AdaBoost.(4) RBU [34] This method uses the Gaussian kernel function to calculate the mutual class relationship between majority-class samples and minority-class samples based on the mutual-class potential, and achieves a balance between the two classes through the diffusion kernel radius.Finally, the naive Bayes classifier is applied to achieve classification.
(5) Centers_NN [17] This method conducts the clustering learning of most classes in an imbalanced data set, and the nearest neighbors of the cluster center of the undersampling clustering are used for the classification learning of the imbalanced data set.( 6) CBIS [18] This method uses the AP algorithm to cluster the data of the majority class; it then uses the IS3 [35] method to select samples from the majority class subsets of each cluster and combines them with the minority-class samples to form a training set.
Ensemble learning is then used to obtain the classifier.( 7) UA-KF [12] This method performs noise filtering on minority-class data, randomly undersamples majority-class data, and finally uses AdaBoost for ensemble learning.

Parameters and Metrics
In our proposed method, we set k = 15 in the kNN filtering step, and use five CART trees as weak classifiers in the AdaBoost step.It is believed that if the ratio of majority-class samples to minority-class samples exceeds 1.5, then the data are imbalanced; therefore, we set the ratio of the sampled data to 1.5 (majority class vs. minority class).In an imbalanced data set, the accuracy of classification is biased to the data of the majority class, and therefore the results are not considered to be accurate.In this work, we use the F-measure, AUC, and G-mean, which are commonly used in research on imbalance in order to evaluate the experimental performance [36][37][38].Herein, the AUC can be obtained by calculating the area of the ROC, which is a curve showing the relationship between the False Positive Rate and True Positive Rate.The formula is as follows.The results of the ensemble learning performed by the method proposed in this paper are presented.The number of classifiers is set to a range of 1 to 2000.We then select data in the first cycle of the AUC that are similar to those presented in Figure 6.
The selected data generally remove the first data, and then the remaining values are used to perform a normal distribution hypothesis test.The detection results of the 20 data sets used in this paper are recorded in Table 2. stat is a statistical measure that is used to measure the degree of fit between the sample data and the normal distribution, and P is a probability value; when P is less than the significance level of 0.05, it is considered to not follow a normal distribution.As shown in Table 2, the p values are greater than 0.05%, indicating that the confidence level is 95%.This shows that all the data have passed the hypothesis test, and therefore follow a normal distribution.Therefore, they all hold an optimal value in a cycle.

Effect of Different Clustering Methods on Research Results
In order to discuss the impact of clustering algorithms for large-class partitioning on the entire method, we selected two classic methods, namely DBSCAN and SVR, and compared them with Kmeans.Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [39] is a data-clustering algorithm that is used to discover clusters of arbitrary shapes and is based on the concept of density reachability and density connectivity.Support Vector Clustering (SVC) [40] is used to cluster data points in a high-dimensional space.From Table 3, it can be seen that the results of the three methods are very similar, with Kmeans performing slightly better than DBSCAN and SVC.According to the principle of Occam's razor, we chose to use Kmeans in our method.

Effect of Proper Separation of the Hyperplane on Research Results
We suggest that the separating hyperplane chosen will affect the final result of the algorithm.Therefore, our FDUS method selects an optimal H opt from among the separated hyperplanes H obtained by the SVM, linear classification, and logistic classification before proceeding with the other steps.To verify this idea and prove that the optimization of H can assist the algorithm, we apply the H that was obtained using the above three methods directly to the FDUS method in order to perform experiments.Each experiment was performed 20 times, and then their average results were entered in Table 4.It can be seen from Table 4 that the final results achieved by the different separation hyperplanes obtained by the SVM, linear classification, and logistic classification also have their own advantages and disadvantages.This shows that the FDUS algorithm is closely related to the initial H.Secondly, it can be seen from Table 3 that, according to an evaluation of the data using the AUC, the final result of the FDUS method based on linear classification and the car-vgood data set is the best; it can also be observed that the method based on logistic regression classification achieves the best results for the two data sets of ecoli1 and yeast5, and that the SVM-based algorithm shows the best results for the remaining data sets.This proves that the optimization of the separated hyperplane can improve the results of the algorithm.

Comparison of FDUS with Benchmarks
In this section, the proposed FDUS is compared with seven representative algorithms on 20 numerical data sets.Tables 5-7 present the evaluations of the AUC, F-measure, and G-mean.Each of their results is obtained from the average of 20 operations.We will compare and analyze these results from four aspects, while examining the performance of FDUS.Cases with complex sample distribution.Here, FDUS exhibits significant advantages compared to NB-tomek and RBU.In terms of the average results obtained in the AUC, F-measure and G-mean evaluations, FDUS exceeds NB-Tomek by 3.18%, 8.50%, and 6.58%, respectively.It also exceeds RBU by 3.84%, 4.47%, and 4.31%, respectively.Compared to NB-Tomek, FDUS exhibits good priority in each data set.The number of data sets for which FDUS is superior to RBU is 20, 19, and 18 with respect to the AUC, F-measure, and G-mean, respectively.NB-Tomek is mainly designed for the complex situation in which majority-class samples overlap with minority-class samples, while RBU uses Gaussian kernel functions to deal with the complex distribution of samples in the majority-class sample space.We argue that FDUS alleviates the problem of sample overlap by undersampling in each partition after clustering in the majority class.At the same time, undersampling in the determined majority-class sample space enables some of the minor errors introduced by the NB-Tomek method to be avoided.Therefore, FDUS is superior to NB-Tomek.For RBU, some of the sample distribution features that are captured by Gaussian kernel functions are far from the classification boundary, while those captured by FDUS are closer.Therefore, FDUS exhibits better performance after undersampling.The case in which the sample features sensitivity and noise.We selected USS and UA-KF to conduct comparisons.Across each data set, FDUS significantly outperforms UA-KF when evaluated using the AUC metric.In the evaluation of the F-measure and G-mean, FDUS outperforms UA-KF on 18 and 16 data sets, respectively, demonstrating its superior performance.Compared with UA-KF, FDUS is 5.74% superior to the AUC, 6.98% superior to F-measure, and 7.31% superior to G-mean with regard to the average of the evaluation indicators for all data sets.For USS, in terms of the average values for the AUC, F-measure, and G-mean evaluations, FDUS exhibits significant priority in each data set, with values of 14.32%, 10.21%, and 12.93%, respectively.Due to the sensitivity of USS for all samples in a majority class, samples near the boundary are also included in the calculation.This may lead to some important samples in the boundary region being incorrectly processed due to their high sensitivity.UA-KF, on the other hand, only employs KNN to address noise issues in minority-class samples.In contrast, FDUS eliminates the influence of outliers far from the classification boundary through spatial compression throughout the entire KNN data preprocessing process.This enables FDUS to perform well in terms of noise and sensitivity.
Ensemble learning and undersampling across the majority class.In order to compare FDUS with ensemble learning methods that do not partition in the majority class, we chose RUS.Regarding the average results of the AUC, F-measure, and G-mean evaluations, FDUS performs exceptionally well, with 7.80%, 5.59%, and 6.37% higher values compared to RUS.Meanwhile, 20, 15, and 16 data sets dominate, respectively.Because random undersampling is performed directly on the majority class without partitioning, the balanced majority class obtained may not necessarily match the original sample distribution well, which could lead to a worse score than that obtained with FDUS.
Comparison with cluster-based undersampling methods.We chose Centers_NN and CBIS for a comparison of their performance.Regarding the average results of the AUC, F-measure, and G-mean evaluations, FDUS exceeds Centers_NN by 4.50%, 4.58%, and 4.60%, respectively.It also exceeds CBIS by 3.43%, 2.49%, and 3.23%, respectively.The number of data sets in which FDUS outperforms Centers_NN is 19, 18, and 19, respectively, and the number of data sets in which FDUS surpasses CBIS is 20, 16, and 15, respectively.The Centers_NN and CBIS methods use the NN and AP algorithms, respectively, to cluster the majority class and obtain the sample distribution characteristics.However, during undersampling, they operate within the entire class sample space.As a result, some samples that are distant from the classification boundary are retained; their number is set to m.Furthermore, because of these m samples, they have m fewer samples near the boundary compared to FDUS.Consequently, FDUS achieves a closer approximation to the ideal classification boundary than the boundaries generated by these methods.Hence, FDUS demonstrates superior performance.(q) winequality-white-3_vs_7 (r) wisconsin (s) yeast-0-5-6-7-9_vs_4 (t) yeast5

Conclusions
The imbalanced classification method based on clustering is first used to perform cluster analysis and then perform a random sampling in each cluster to obtain a balanced data set.This traditional approach enables the characteristics of the original data set distribution to be obtained; however, it also retains the data with a low contribution to the classification and thus reduces the accuracy of the classification model.To solve this problem, we propose the FDUS method.It firstly runs the clustering algorithm in the majority class, and then performs a random sampling in the borderline area near to the minority class in the clusters just obtained.In this way, the classification boundary obtained with FDUS conforms to the original shape of the data set between the two classes.In order to enhance the consistency between the classification boundary and the ideal shape, we adopted an ensemble learning method that utilizes multiple classifiers to improve the algorithm.According to the experimental analysis performed on the use of different numbers of classifiers for ensemble learning in FDUS, we put forward the hypothesis that there is an optimal number of models to be in ensemble learning, and proved it via the

Figure 1 .
Figure 1.Method of sampling near the boundary.

Figure 2 .
Figure 2. Method of sampling near the nonlinear boundary.

Figure 3 .
Figure 3. Method of sampling near the linear boundary with partitions.

Figure 4 .
Figure 4.The influence of different linear classification boundaries.

Figure 5 .
Figure 5. Diameter of the cluster facing the separating hyperplane.

Algorithm 1
WUC: Weighted undersampling of boundary space based on clustering Require: Training data set D Majority class D b The clusters of majority class C i , i = 0, 1, • • • , N c Spatial compression coefficient α Ensure: Majority class samples set obtained after undersampling D bu Obtain hyperplane Ĥ from D by linear regression D bu = {} Obtain the number of samples N s,C i for each

Figure 6 .
Figure 6.AUC value obtained from the number of classifiers/undersampling times in ensemble learning for the range 1-2000 in the Yeast5 data set.

Algorithm 2
DNC: Determine the number of classifiers in ensemble learning and output the best result Require: imbalanced data set D stop threshold θ Number of ensemble learning iterations n Evaluation result r = 0 The mode of classification with undersampling M Ensure: Optimal times of ensemble learning m Num = 0, ilabel = 0 for i = 1 to n do r i = Adaboost(M,i) // i is the time of ensemble learning if r i > r then r = r i ilabel = i Num = 0 Save the result of ilabel else Num + + end if if Num > θ then m = ilabel Output the result of m break end if end for

:
imbalanced data set D Space compression factor α Ensure: Classification model f D ′ = kNN(D) //Data preprocessing and denoising Obtain Ĥ by choosing the right linear classifier Use Kmeans to divide majority class Perform WUC Use DNC to output result The time complexity of kNN is O(N * N).We use linear SVM, linear regression, and logistic regression to optimize rough linear classification.The time complexity at this time is 3 * O(N * N).The time complexity of Kmeans is O(n * k * t).Since both k and t are deterministic values and k, t ≪ N, the time complexity of Kmeans approaches O(N).

Table 1 .
Data set statistics.

Table 2 .
Normal distribution test of the data sets.

Table 3 .
Different ratios of AUC, F-measure ('F' for short), and G-mean ('G' for short) with different clustering methods.

Table 4 .
Different ratios of AUC, F-measure ('F' for short), and G-mean ('G' for short) with different separated hyperplanes.