Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric

Real-world datasets are often contaminated with label noise; labeling is not a clear-cut process and reliable methods tend to be expensive or time-consuming. Depending on the learning technique used, such label noise is potentially harmful, requiring an increased size of the training set, making the trained model more complex and more prone to overfitting and yielding less accurate prediction. This work proposes a cleaning technique called the ensemble method based on the noise detection metric (ENDM). From the corrupted training set, an ensemble classifier is first learned and used to derive four metrics assessing the likelihood for a sample to be mislabeled. For each metric, three thresholds are set to maximize the classifying performance on a corrupted validation dataset when using three different ensemble classifiers, namely Bagging, AdaBoost and k-nearest neighbor (k-NN). These thresholds are used to identify and then either remove or correct the corrupted samples. The effectiveness of the ENDM is demonstrated in performing the classification of 15 public datasets. A comparative analysis is conducted concerning the homogeneous-ensembles-based majority vote method and consensus vote method, two popular ensemble-based label noise filters.


Introduction
In machine learning, the prediction accuracy depends not only on the appropriate choice of the learning technique, but also on the quality of the database. Quoting [1], "real-world databases are estimated to contain around five percent of encoding errors, all fields taken together when no specific measures are taken." Noise-contaminating databases can be mainly of two types: feature noise or label noise, also called class noise (i.e., mislabeled data) [2]. Whether one or the other prevails depends on the application field. Using inaccurate sensors or choosing less invasive measurements may explain why the feature noise is predominant. On the other hand, labeling training instances may be contaminated with data entry errors. It is a costly and rather subjective task as the meaning of a label could be inadequate [3][4][5][6][7][8][9][10], As a result, the label noise could be predominant.
Feature noise is generally spread over many features and each feature noise component tends to be statistically independent of the others, and most learned classifiers are robust to such noise. Conversely, label noise can significantly affect the learning performance [11][12][13] and should be taken into account when designing learning algorithms [14]. Noise can increase the number of necessary training instances, the complexity of learned models, the number of nodes in decision trees [1] and the size (number of base classifiers) of an ensemble classifier [5]. Learning from noisy labeled data can 1. (Case 1) A clean sample is regarded as mislabeled and cleaned. This case harms the classification performance, especially when the size of the training dataset is small.

(Case 2)
A mislabeled sample is regarded as clean and retained or unchanged. This makes noisy samples remain in the training dataset and degrades the classification performance.
The ensemble approach is a popular method to filter out mislabeled instances [4,15,23,[27][28][29]. It constructs a set of base-level classifiers and then uses their classifications to identify mislabeled instances [11]. The majority filter and consensus filter are two typical noise cleaning methods. A majority filter tags an instance as mislabeled if more than half of the base classifiers do not predict the right label. A consensus filter tags an instance as mislabeled if all base classifiers do not predict the right label. When using the consensus filter whose criterion is strict, only a small portion of the label noise is removed. Most mislabeled instances then remain in the filtered training set and performance is hindered, more than when using the majority vote filter, as it removes a higher portion of the label noise. Because of the diversity of the ensemble classifier used in these majority filter, samples near the classification boundary have a reduced amount of base classifiers predicting the right label, and more correctly, labeled instances are removed from the training set, which can negatively affect the classifier's performance [1,30].
Depending on whether base classifiers are induced using different or similar learning techniques, the ensemble-based noise filtering method is referred to as heterogeneous or homogeneous.
In the heterogeneous method, an ensemble classifier detects mislabeled instances by constructing a set of base-level detectors (classifiers) and then using their classification errors to identify mislabeled instances. Brodley et al. chose three well-known algorithms from the machine learning and statistical pattern recognition communities to form the filters: decision trees, nearest neighbor classifiers, and linear machines. An instance is tagged as mislabeled if α of the T base-level classifiers cannot classify it correctly. In heterogeneous ensembles, the decision borders are varied because the individual classifiers are of different types. The dispersion of class noise may reflect this variability. Hence, this method tends to eliminate instances that lie on the wrong side of the classification boundary, which can negatively affect the classifier's performance.
The homogeneous ensemble vote for noise filtering [15] is an improved version of the aforementioned heterogeneous ensemble-based method. Verbaeten and Assche considered the problem of mislabeled training examples by preprocessing the training set based on some well-known ensemble classification methods (Bagging and boosting) [15] using C4.5 as base classifier [31]. They proposed two approaches: Results show that majority vote filters are still more accurate than consensus filters in the homogeneous ensemble method. In addition, Bagging majority vote filters outperform the boosting filters. The boosting filter tends to incorrectly remove many important correctly labeled instances with large weights. In addition, in a homogeneous ensemble, the decision boundaries of the individual classifiers are similar to each other. Then, noisy examples close to this decision boundary can be detected effectively by the majority voting [32]. In summary, in this paper, the homogeneous ensemble-vote-based method for noise filtering is used as the comparison in the experiment.
As an example of the ensemble approach but in a different manner, there is outlier removal boosting (ORBoost) [33], where data cleaning is performed while learning, and not after nor before. The only difference with AdaBoost is that the weight assigned to each sample is set to zero when it exceeds a certain threshold. Good performance is observed when the label noise is low.
An ensemble classifier induced on a training set is also a precious information source on each instance: how many base classifiers did not predict correctly? What is the cumulative weight of these base classifiers failing to predict the assigned label? To what extent might the ensemble classifier have been able to predict a different label? These questions have greatly influenced our work. Using a technical definition of an edge [34,35] exploited the answers to the two first questions to detect mislabeled instances. [23] exploited the answer to the third question to compute for each instance a margin with which mislabeled instances are detected. However, an important amount of mislabeled instances are not removed, perhaps because the possibly correct label of instances from the training set have actually no impact on the margin value and the amount of suspicious instances removed is set upon the performance on a validation dataset using Bagging, which happens to be fairly robust to label noise. Our work is different in that (1) to set the amount of removed samples, instead of only Bagging, it adopts three methods (Bagging, AdaBoost and k-NN, with k = 1 rendering it especially noise-sensitive); (2) four noise detection metrics are considered (instead of one); (3) the ENDM is extended to label noise correction. A comparative analysis is also conducted concerning the majority vote filter [4,15].
In the paper, we deal with the label noise issue using an adaptive ensemble method based on a noise detection metric (ENDM). It is called adaptive because there is no fixed threshold being used to select the suspicious instances as for majority and consensus filters-rather, there is a counting parameter whose value is selected using a validation set. Our proposed method for noise detection is described in Section 2. In Section 3, we present the results of an empirical evaluation of the proposed method and a comparison with other state-of-the-art noise detection methods. In Section 4, the conclusions are provided, and future works are discussed.

Label Noise Detection Metric
In this work, it is assumed that all instances have an unknown, yet fixed, probability of being misclassified. This assumption is called the uniform label noise in [1]. To assess whether an instance is more likely to be misclassified than another, a homogeneous ensemble classifier is trained and four metrics are computed based on the votes of the base classifiers and their number [5]. Once samples are ordered according to one of the metrics, it suffices to use a validation set to fix the exact number of tagged instances and hence to select the suspicious instances [23]. The flowchart of the proposed method is shown on Figure 1.
Let us define some notations, • ζ is an ensemble model composed of T base classifiers, • (x, y) is an instance, with x as a feature vector and y as one of the C class labels, is the number of classifiers predicting the label c when the feature vector is x, • Λ(x, y) is a metric assessing the likelihood that a sample (x, y) will be mislabeled (it is used to sort samples), • 1(P ) is equal to one when statement P is true and is equal to zero otherwise, • λ counting parameter.
All four metrics proposed are ranging from 0 to 1, 0 indicating that the label is very suspicious and 1 indicating that this label is reliable.

Misclassified data
Obtain the best value of , and eliminate or correct the first most likely mislabeled instances

Supervised Max Operation (SuMax)
A popular ensemble margin function was introduced by Schapire et al. [36] and has been used in data importance evaluation [5]. This ensemble margin is defined as A positive value of margin SuMax (x, y) means that this instance is correctly classified by the set of T classifiers when using a majority vote. A negative value indicates misclassification. When there is a class c such that v(x, c) equals to T, margin(x, y) = 1 or margin(x, y) = −1 depending on whether c = y or not. Otherwise, the range of margin(x, y) is (−1, 1).
The SuMax-based noise detection metric is defined by Equation (2).

Supervised Sum Operation (SuSum)
The margin of a sample is also obtained by the difference between the fraction of classifiers voting correctly and incorrectly [37]. Unlike the previous definition, in a multiclass context, a negative value of the margin does not necessarily indicate misclassification. This definition bears some resemblance with the definition of an edge in [34]: Given an ensemble classifier and an instance, the edge is the sum of the weights associated to classifiers predicting a wrong label.
The SuSum-based noise detection metric is defined by Equation (3).

Unsupervised Max Operation (UnMax)
In [23], the authors proposed a new margin definition that is more robust to label noise. It is an unsupervised version of Schapire's method and defined in Equation (4).
where ζ(x) is the predicted class of the ensemble classifier for sample x: ζ(x) = arg max c v(x, c).

Unsupervised Sum Operation (UnSum)
In our previous work [38], we proposed a new unsupervised data importance evaluation method. It is an unsupervised version of the SuSum (3) and is defined in Equation (5).
According to the above presentations of the four metrics, the supervised margins need the true label of the mislabeled instances while the unsupervised margins are robust to the true class values. In addition, when compared with the max operation, the sum based methods tend to give the misclassified instances with higher noise weight values.

Label Noise Cleaning Method
The proposed label noise cleaning method can be used with any of the four noise detection metrics. The samples tagged as mislabeled are the λN samples having the smallest metric values, with λ itself depending on whether the tagged samples are removed or corrected.

Label Noise Removal with ENDM
The pseudo-code of the ENDM based noise removal method is presented as Algorithm 1. In the first step, Bagging is used to induce an ensemble classifier composed of pruned trees from the whole training set. Collect in S the misclassified samples. Based on predictions of base classifiers, the chosen metric is used to sort samples in S according to their metric value. Compute λ max defined as the ratio of the size of S to the size of S.
The second step is an iterative procedure. In this step, λ ranges from 0 to λ max in steps of 1%. At each iteration, a new subset containing the correctly classified samples of S and the λN-lowest sorted samples. Then, this obtained subset is used to learn a classifier (Bagging, AdaBoost or k-NN with k = 1), and the accuracy of this classifier is measured on the validation set. The finally selected λ-value is the one yielding the highest accuracy of the validation set.
In the last step, a clean training set is defined with the selected λ-value as in the second step, and a noise-sensitive classifier, AdaBoost or k-NN, is, induced on this clean training set, and tested on the test set.

Label Noise Correction with ENDM
The only difference with ENDM-based noise correction is in that tagged samples have their label corrected instead of being removed. Its pseudo-code is presented as Algorithm 2 and its description is also divided into three steps.
In the first step, the same technique is used to induce an ensemble classifier (Bagging with pruned trees), and the same misclassified samples are collected in S and sorted according to the metric chosen. The value λ max is the ratio which is used to control the number of the removed instances.
In the second step, λ ranges from 0 to λ max by step of 1%. Then, the first λN-highest sorted samples of S are tagged. This step is different in that instead of collecting the nontagged samples in a new subset, it is the whole training set that is considered and the labels of the S nontagged samples are changed into those predicted by the ensemble classifier. This modified training set is again used to learn a classifier (Bagging, AdaBoost or k-NN with k = 1), and the accuracy of this classifier is measured on the validation set. As the inducing sets have been modified, there is no reason that the selected λ-value yielding the highest accuracy should be the same.
The last step is the same, though the inducing set is different in size and label values: a noise-sensitive classifier, AdaBoost or k-NN, is induced on this modified training set and tested on the test set. Initiate clean set with S, S λ := S; 14: Modify in S λ with A the λN-highest samples, ∀i ≤ λN, y i := A(x i ); 15: Train ζ λ according to B with S λ ; 16: Compute accuracy of ζ λ on V, a λ := 1 |S λ | ∑ (x,y)∈S λ 1 (y = ζ λ (x)) 17: end for 18: Select the optimal λ-value, λ := arg max a λ ; 19: Select the best filtered training set, S := S λ . 20: Output: 21: The clean training set S .

Experiment Settings
As in [2,4,5,15,27] and actually following most of the literature addressing the noise label issue, artificial noise is introduced in the training set and the validation set, not in the test set. In all our experiments, 20% of the training samples and 20% of the validation samples are randomly selected and have their labels randomly modified to another label. For a fairer comparison, we included the validation data in the training data when the validation set was not necessary (e.g., no filtering, majority vote filters and consensus vote filters). As for the Bagging-induced ensemble classifier using the uncleaned training set, in all experiments, it is composed of exactly 200 pruned Classification and Regression Trees (CART) [39] as base classifiers.
To exemplify the proposed method, it is applied with noise removal on Statlog dataset with λ ranging from 0 up to λ max (= 31%) by step of 1%. Subsets are built by collecting the λN-lowest values of the training set. AdaBoost is induced on each of those subsets. Figure 2 shows two curves. The lower is the accuracy measured on the validation set, of each induced AdaBoost classifier, as a function of λ. The upper is the accuracy measured on the test set of the same classifiers as a function of λ. As the test set is noise-free, it is no surprise that the test-set measured accuracy is significantly higher than the validation-set measured accuracy. Note that both curves have very similar shapes, showing the appropriateness of the proposed way of selecting λ.
A comparative analysis is conducted between the ENDM-based mislabeled data identification method and the homogeneous-ensemble-based majority vote method [15]. Both label noise removal and correction schemes are involved in the comparison.

Datasets
The experimentation is based on 15 public multiclass datasets from the UCI and KEEL machine learning repository [40]. Each dataset was divided into three parts: training set, validation set and test set. Those datasets are described in Table 1, where Num. refers to the number of examples, Variables to the number of attributes (and their type) and Classes to the number of classes.  Abalone  1500  750  1500  8  3  ForestTypes  200  100  200  27  4  Glass  80  40  80  10  6  Hayes-roth  64  32  64  3  3  Letter  5000  2500  5000  16  26  Optdigits  1000  500  1000  64  10  Penbased  440  220  440  15  10  Pendigit  2000  1000  2000  16  10  Segment  800  400  800  19  7  Statlog  2000  1000  2000  36  6  Texture  2000  1000  2000  40  11  Vehicle  200  100  200  18  4  Waveform  2000  1000  2000  21  3  Wine  71  35  72  12  3  Winequalityred  600  300  600  11  6 Tables 2 and 3 show respectively the accuracy of AdaBoost and k-NN-classifiers induced using three different training sets: the training set is not modified (no filtering), the training set is ENDM-based filtered with noise removal and noise correction. On all 15 datasets using both learning techniques and regardless of the modality chosen (noise removal or noise correction), the ENDM technique is better performing than not using any filter with an average increase of 2.42%. In comparison with not filtering, the ENDM-technique yields an increase in accuracy of 10% on dataset Letter with AdaBoost and about 16% with k-NN on dataset Optdigit. The ENDM noise-correction modality, be it with AdaBoost or k-NN, appears to be most often less performing than the noise removal modality; nonetheless, it remains better than not using any filter. Table 2. Accuracy of the AdaBoost-classifier induced using 15 different training sets: three filtering techniques (no filter, majority-vote and ENDM) and two modalities (noise removal and noise correction). The best results are marked in bold.

Comparison of ENDM Versus Other Ensemble-Vote-Based Noise Filter
Tables 2 and 3 also show the accuracy of AdaBoost and k-NN classifiers, respectively, induced using different training sets: the training sets are majority-vote-based and consensus vote-based filtered with noise removal and noise correction, and the training set is ENDM-based filtered with noise removal and noise correction. As noted previously, the noise-correction modality of the majority-vote-based noise filtering technique is most often less performing than the noise-removal modality when using AdaBoost or k-NN. On 10 of the 15 datasets, using AdaBoost, and considering only the noise-removal modality, ENDM is more successful than the other ensemble-vote-based noise filtering techniques. On dataset Letter, the increase is of 9% using AdaBoost and 25% using k-NN. Both tables show that ENDM is more safe with respect to majority vote method. The majority-vote-based and consensus-vote-based methods tend to tag more instances as noisy. When the data removal is carried out, more useful samples are wasted. With respect to the majority vote filter, the best increase in accuracy is, respectively, over 9% and 25% with AdaBoost on dataset Letter.

Comparing Different Noise Detection Metrics in ENDM
Tables 6 and 7 also show the accuracy of AdaBoost and k-NN-classifiers, respectively, induced using twelve different ENDM-based filtered training sets with the noise-removal modality: three different classifiers (Bagging, AdaBoost, k-NN with k = 1) are used to select λ and four different metrics (SuMax, UnMax, SuSum, UnSum) yield four different ways of ordering samples. The histogram figures of the performances of the four different noise detection metrics in the proposed algorithm on the datasets Letter and Optdigits are shown in Figures 4 and 5.
When comparing columns 1 and 2 or 3 and 4 from both tables, it appears that using supervised metrics (SuMax and SuSum) are most often more successful than using unsupervised metrics (UnMax and UnSum). When comparing columns 3 and 1 from both tables, it appears that using SuSum is more successful than SuMax quite often. This finding does not extend to the unsupervised metrics, as UnSum and UnMax yield similar performances. This may be explained by the lack of information available to unsupervised metrics. Table 6. Accuracy of the AdaBoost classifier trained on ENDM-based filtered training set with the noise-removal modality, four different metrics (SuMax, UnMax, SuSum, UnSum) and three different classifiers applied on the validation dataset (Bagging, AdaBoost, k-NN with k = 1). The values in brackets are the detected noise ratios. The best results are marked in bold.   Tables 6 and 7 show the accuracy of AdaBoost and k-NN classifiers, respectively, induced using twelve different ENDM-based filtered training sets with the noise removal modality: four different metrics (SuMax, UnMax, SuSum, UnSum) and three different classifiers (Bagging, AdaBoost and k-NN with k = 1) are used to select λ . Tables 8 and 9 show the accuracy of AdaBoost and k-NN, respectively, when both classifiers are combined with the proposed noise correction modality.  When comparing the accuracies yielded using the Bagging-selected λ-value with those yielded using AdaBoost and k-NN, it appears that Bagging is most often not the appropriate tool to select λ. For example, in Table 7, when compared with bagging, the AdaBoost/k-NN could increase the accuracy of over 11% on the Vehicle. Actually, Bagging is known to be more label-noise-robust than AdaBoost and k-NN when k = 1, and λ is estimated as the value yielding the best performance when trained on a λ-dependent training set and tested on a validation set. Therefore, it makes sense to use noise-sensitive classifiers instead of noise-robust classifiers.

Comparing Different Classifiers Used for λ-Selection in ENDM
In Table 6, when comparing the AdaBoost-tested accuracies yielded using the AdaBoost-selected λ-value with those yielded using the k-NN-selected λ-value, AdaBoost seems more appropriate to select λ. Now, in Table 7, when comparing the k-NN-tested accuracies yielded using the AdaBoost-selected λ-value with those yielded using the k-NN-selected λ-value, k-NN seems more appropriate to select λ. Hence, when the classifier used on the clean training set is noise-sensitive, it seems sensible to use that same learning technique when selecting λ.

Conclusions
This paper has focused on cleaning training sets contaminated with label noise, a challenging issue in machine learning, especially when a noise-sensitive classifier is desired. In line with the literature, we have proposed a two-stage process called ENDM, where an induced ensemble classifier enables the measuring of label's reliability of each training instance, and then the maximization of the accuracy of a second classifier tested on a validation provides an estimate of the number of samples that should be removed or corrected. When compared on 15 public datasets, with no cleaning, with the majority-vote-based filtering method, and with the consensus-vote-based filtering method, ENDM appears to perform significantly better.
In addition, experiments and discussion have provided some insights into how a label's reliability should be measured, whether suspicious samples should be removed or have their labels modified and how the second classifier ought to be chosen.
Future work will investigate more realistic ways of introducing artificial label noise in the datasets. Imbalanced label-noisy datasets will also be considered as cleaning filters tend to be more discriminative against minority instances, being more difficult to classify. Table 8. Accuracy of the AdaBoost classifier trained on the ENDM-based filtered training set with the noise-correction modality, four different metrics (SuMax, UnMax, SuSum, UnSum) and three different classifiers (Bagging, AdaBoost, and k-NN with k = 1). The values in brackets are the detected noise ratio. The best results are marked in bold.