1. Introduction
Class distribution, i.e., the proportion of instances belonging to each class in a data set, plays a key role in any kind of machine-learning and data-mining research. However, the real world data often suffer from class imbalance. The class imbalance case has been reported to exist in a wide variety of real-world domains, such as face recognition [
1], text mining [
2], software defect prediction [
3], and remote sensing [
4]. Binary imbalanced data classification problems occur when one class, usually the one that refers to the concept of interest (positive or minority class), is underrepresented in the data-set; in other words, the number of negative (majority) instances outnumbers the amount of positive class instances [
5,
6,
7]. Processing minority class instances as noise can reduce classification accuracy. Dealing with multi-class tasks with different misclassification costs of classes is harder than dealing with two-class ones [
8,
9,
10]. Some traditional classification algorithms, such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and decision trees, which show good behavior in problems with balanced classes, do not necessarily achieve good performance in class imbalance problems. Consequently, how to classify imbalanced data effectively has emerged as one of the biggest challenges in machine learning.
The objective of imbalance learning can be generally described as
obtaining a classifier that will provide high accuracy for the minority class without severely jeopardizing the accuracy of the majority class. Typically, there are four methods for imbalanced learning [
11]: sampling methods [
12], cost-sensitive methods [
7,
13], kernel-based methods [
7] and active learning methods [
14].
Sampling methods: The objective of these non-heuristic methods is to provide a balanced distribution by considering the representative proportions of class examples. They are carried out before training starts. These methods will be presented in detail in
Section 2.1.
Cost-sensitive methods: These methods incorporate both data level transformations (by adding costs to instances) and algorithm level modifications (by modifying the learning process to accept costs). They generally use the cost matrix to consider the costs associated with misclassifying samples [
11]. Cost-sensitive neural network [
15] with threshold-moving technique was proposed to adjust the output threshold toward inexpensive classes, such that high-cost samples are unlikely to be misclassified. Three cost-sensitive methods, AdaC1, AdaC2, and AdaC3 were proposed [
16] and cost items were used to weight the updating strategy in the boosting algorithm. The disadvantage of these approaches is the need to define misclassification costs, which are not usually available in the data sets [
5].
Kernel-based methods: The principles of kernel-based learning are centered on the theories of statistical learning and Vapnik-Chervonenkis dimensions [
17,
18]. In kernel-based methods, there have been many works to apply sampling and ensemble techniques to the support vector machine (SVM) concept [
19]. Different error costs [
20] were suggested for different classes to bias the SVM to shift the decision boundary away from positive instances and make positive instances more densely distributed.
Active learning methods: Traditional active learning methods were used to solve the imbalanced training data problem. Recently, various approaches on active learning from imbalanced data sets were proposed [
14]. Active learning effectively selects the instances from a random set of training data, therefore significantly reducing the computational costs when dealing with large imbalanced data sets. The major drawback of these approaches is large computation costs for large datasets [
14].
Ensemble classifiers are known to increase the accuracy of single classifiers by combining several of them and have been successfully applied to imbalanced data-sets [
21,
22,
23,
24]. Ensemble learning methods have been shown to be more effective than data sampling techniques to enhance the classification performance of imbalanced data [
25]. However, as the standard techniques for constructing ensembles are rather too overall accuracy oriented, they still have difficulty sufficiently recognizing the minority class [
26]. So, the ensemble learning algorithms have to be designed specifically to effectively handle the class imbalance problem [
5]. The combination of ensemble learning with imbalanced learning techniques (such as sampling methods presented in
Section 2.1) to tackle the class imbalance problem has led to several proposals in the literature, with positive results [
5]. Hence, aside from conventional categories such as kernel-based methods, ensemble-based methods can be classified into a new category in imbalanced domains [
5]. In addition, the idea of combining multiple classifiers itself can reduce the probability of overfitting.
Margins, which were originally applied to explain the success of boosting [
27] and to develop the Support Vector Machines (SVM) theory [
17], play a crucial role in modern machine learning research. The ensemble margin [
27] is a fundamental concept in ensemble learning. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples [
27]. A good margin distribution means that most examples have large margins [
28]. Moreover, ensemble margin theory is a proven effective way to improve the performance of classification models [
21,
29]. It can be used to detect the most important instances, which have low margin values, and thus help ensemble classifiers to avoid the negative effects of redundant and noisy samples. In machine learning, the ensemble margin has been used in imbalanced data sampling [
21], noise removal [
30,
31,
32], instance selection [
33], feature selection [
34] and classifier design [
35,
36,
37].
In this paper, we propose a novel ensemble margin based algorithm, which handles imbalanced classification by employing more low margin examples which are more informative than high margin samples. This algorithm combines ensemble learning with undersampling, but instead of balancing classes randomly such as UnderBagging [
38], our method pays attention to constructing higher quality balanced sets for each base classifier. In order to demonstrate the effectiveness of the proposed method in handling class imbalanced data, UnderBagging [
38] and SMOTEBagging [
8], which will be presented in detail in the following section, are used in a comparative analysis. We also compare the performances of different ensemble margin definitions, including the new margin proposed, in class imbalance learning.
The remaining part of this paper is organized as follows.
Section 2 presents an overview of the imbalanced classification domain from the two-class and multi-class perspectives. The ensemble margin definition and the effect of class imbalance on ensemble margin distribution is presented in
Section 3.
Section 4 describes in detail the proposed methodology.
Section 5 presents the experimental study and
Section 6 provides a discussion according to the analysis of the results. Finally,
Section 7 presents the concluding remarks.
4. A Novel Bagging Method Based on Ensemble Margin
Compared to binary classification data imbalance problems, multi-class imbalance problems increase the data complexity and negatively affect the classification performance regardless of whether the data is imbalanced or not. Hence, multi-class imbalance problems cannot be simply solved by rebalancing the number of examples among classes in the pre-processing step. In this section, we propose a new algorithm to handle the class imbalance problem. Several methods proposed in the literature to address the problem of class imbalance as well as their strengths and weaknesses have been presented in the previous section. Ensemble classifiers have been shown to be more effective than data sampling techniques to enhance the classification performance of imbalanced data. Moreover, the combination of ensemble learning with sampling methods to tackle the class imbalance problem has led to several proposals with positive results in the literature.
In addition, as mentioned in the previous section, boosting based methods are sensitive to noise. On the contrary, bagging techniques are not only robust to noise but also easy to develop. Galar et al. pointed out that bagging ensembles would be powerful when dealing with class imbalance if they are properly combined [
5,
63]. Consequently, we chose to found our new imbalance ensemble learning method on bagging.
Enhancing the classification of class decision boundary instances is useful to improve the classification accuracy. Hence, for a balanced classification, focusing on the usage of the small margin instances of a global margin ordering should benefit the performance of an ensemble classifier. However, the same scheme is not suited to improve the model built from an imbalanced training set. Although most of the minority class instances have low margin values, selecting useful instances from a global margin sorting still has a risk to lose partial minority class samples, and even causes the classification performance to deteriorate. Hence, the most appropriate method for the improvement of imbalanced classification is to choose useful instances from each class independently.
4.1. Ensemble Margin Based Data Ordering
The informative instances such as class decision boundary samples and difficult class instances play an important role in classification particularly when it is imbalanced classification. These instances generally have low ensemble margins. To utilize the relationship between the importance of instances and their margins effectively in imbalance learning, we designed our class imbalance sampling algorithm based on margin ordering.
Let us consider a training set denoted as
, where
is a vector with feature values and
is the value of the class label. The importance of a training instance
could be assessed by an importance evaluation function which relies on an ensemble margin’s definition and is defined by Equation (
5).
The lower the margin value (in absolute value), the more informative the instance is and the more important it is to consider for our imbalance sampling scheme.
To solve the problem previously mentioned related to the margins (both supervised and unsupervised) based on a sum operation, a shift is performed before data importance calculation. The shifted margin values are achieved by subtracting the minimum margin value of the samples of the training set which are correctly classified from their original margin values. An example is used to explain the margin shift procedure in
Figure 2.
4.2. A Novel Bagging Method Based on Ensemble Margin
The proposed ensemble margin based imbalance learning method is inspired by SMOTEBagging [
8], a major oversampling method which has been defined in the previous section. It combines under sampling, ensemble and margin concepts. Our method pays more attention to low margin instances. It could overcome the shortcomings of both
SMOTEBagging [
8] and
UnderBagging [
38]. This method has lower computational complexity than
SMOTEBagging and focuses more on important instances for classification tasks than
UnderBagging.
The proposed method has three main steps:
Computing the ensemble margin values of the training samples via an ensemble classifier.
Constructing balanced training subsets by focusing more on small margin instances.
Training base classifiers on balanced training subsets and constructing a new ensemble with a better capability for imbalance learning.
Denote
as training samples. The first step of our method involves a robust ensemble classifier:
bagging which is constructed using the whole training set. The margin value of each training instance is then calculated. In the second phase, we aim to select the most significant training samples for classification to form several new balanced training subsets. Suppose
L is the number of classes and
the number of training instances of the
class. We sort those classes in descending order according to their number of instances. Therefore,
is the training size of class
L, which is the smallest, and
is the training size of class 1 which is the largest. The training instances of each class,
, are sorted in descending order according to the margin based importance evaluation function (Equation (
5)) previously introduced. For each class
c, the higher the importance value
of an instance
, the more important this instance is for classification decision. Then, as in
SMOTEBagging [
8], a resampling rate
a is used to control the amount of instances which should be chosen in each class to contract a balanced data set. All the instances of the smallest class are kept. The detailed steps of our method are shown in Algorithm 1.
The range of
a is set from 10 to 100 first. For each class
,
L representing the smallest class,
instances are bootstrapped from the first
of the importance ordered samples of class
c to construct subset
. All the subsets are balanced. When the amount of class
is under
,
instances are bootstrapped from the first
samples of class
c, which is the same as in
UnderBagging. Then the
smallest class samples are combined with
to construct the first balanced data. In the next phase, the first base classifier is built using the obtained balanced training set.
Figure 3 presents the flowchart of our method with an ensemble size
T and a range of 10–100% for
a. The elements in the range of
a could construct an arithmetic progression denoted as
A. If we build
classifiers as ensemble members, every 10 classifiers will be built with different resampling rates
a ranging from 10% to 100%, as in
SMOTEBagging. However, while
SMOTEBagging uses
, the training size of the largest class 1, as a standard for carrying out oversampling (SMOTE) on other relative minority classes, our method use
, the training size of the smallest class
L, as a standard for performing an instance importance based undersampling on other relative majority classes.
Algorithm 1: A novel ensemble margin based bagging method (MBagging). |
Training phase Inputs:Training set ; Number of classes L; is the number of training instances of class ; Ensemble creation algorithm ; Number of classifiers T; Range of resampling rate a. : an ensemble Iterative process:Construct an ensemble classifier H with all the n training data and compute the margin of each training instance . Obtain the weight of each training instance . Order separately the training instances of each class, according to the instance importance evaluation function , in descending order. For to T do - (a)
Keep all the instances of the smallest class L - (b)
For to End - (c)
Construct a new balanced data set by combining the smallest class training instances with . - (d)
Train a classifier . - (e)
. - (f)
Change percentage a%.
End Output: The ensemble E |
Prediction phase Inputs:Output: Class label . |
7. Conclusions
Ensembles of classifiers have shown very good properties for addressing the problem of imbalanced classification. They work in line with baseline solutions for this task such as data preprocessing for an ensemble or for each classifier of the ensemble. However, selecting more informative instances should benefit ensemble construction and better handle multi class imbalanced classification. Our answer to this data selection problem consists of carrying out an estimation of instance importance which relies on the ensemble margin. More specifically, instances can be focused on or not by an ensemble of base classifiers according to their margin values. We consider the lowest margin instances as the most informative in classification tasks.
In this work, we have proposed a novel margin ordering and under sampling based bagging method for imbalanced classification. To evaluate the effectiveness of our approach, standard bagging as well as two state of the art imbalance learning ensemble methods UnderBagging and SMOTEBagging that inspired our method were used in comparative analysis. From this study, we have emphasized the superiority of the new proposed method, in handling the imbalance learning problem compared with bagging, UnderBagging and SMOTEBagging.
The performances of four margin definitions involved in our algorithm were also compared. The unsupervised margins achieve slightly better performance with respect to the supervised margins. The unsupervised max-margin generally outperforms other margins in terms of F-measure and minimum accuracy per class. In addition, the effectiveness of the new proposed margin in addressing the class imbalance problem is demonstrated. As future research we plan to extend the margin-based ensemble framework to an oversampling scheme, such as producing minority class instances by adopting the SMOTE procedure on the small margin instances.
References