An Efficient AdaBoost Algorithm with the Multiple Thresholds Classification

Adaptive boost (AdaBoost) is a prominent example of an ensemble learning algorithm that combines weak classifiers into strong classifiers through weighted majority voting rules. AdaBoost’s weak classifier, with threshold classification, tries to find the best threshold in one of the data dimensions, dividing the data into two categories-1 and 1. However, in some cases, this Weak Learning algorithm is not accurate enough, showing poor generalization performance and a tendency to over-fit. To solve these challenges, we first propose a new Weak Learning algorithm that classifies examples based on multiple thresholds, rather than only one, to improve its accuracy. Second, in this paper, we make changes to the weight allocation scheme of the Weak Learning algorithm based on the AdaBoost algorithm to use potential values of other dimensions in the classification process, while the theoretical identification is provided to show its generality. Finally, comparative experiments between the two algorithms on 18 datasets on UCI show that our improved AdaBoost algorithm has a better generalization effect in the test set during the training iteration.


Introduction
The rapid growth of the Internet has led to a dramatic increase in the rate of data generation. Data mining technology is one of the most important means of mining value from such a large amount of data. Classification is the initialization operation that processes the digitized information of data mining. Obviously, accurate classification will save a lot of time and economic costs for subsequent work, such as analysis, forecasting, and fitting processes.
Ensemble methods are ideal for regression and classification, and by combining multiple models into a very reliable model, it can reduce bias and variance to boost the accuracy of predictions [1,2]. Two common techniques for constructing Ensemble classifiers are Boosting [3][4][5][6][7] and Bagging [8][9][10]. Boosting is better than Bagging, with less noise in the data [11].
In the field of machine learning, the boosting algorithm is a more classic generalpurpose learning algorithm, which is based on the "probably approximately correct" learning model proposed by Valiant. Freund and Schapire improved the Boosting algorithm in 1995 and named it the AdaBoost algorithm.
The AdaBoost algorithm has now developed into an important feature classification algorithm in machine learning, which has been widely used in face detection [12] and image retrieval [13], intrusion detection [14], object recognition [15], feature extraction [16], and other applications.
In the training phase, the weights assigned to the samples by AdaBoost rise as the error rate increases, and conversely, the newly assigned weights fall as the error rate decreases. The samples are then continuously trained with unknown distribution weights. The aim is to obtain powerful feedback by reducing errors on the next machine and ultimately achieving better accuracy [17].
How to design an effective classifier [18] to improve the accuracy of classification is at the core of this type of algorithm. In previous years, the AdaBoost algorithm has attracted great interest of scientists, aiming at the problems of insufficient correct classification rate and long training time, a variety of improved algorithms were proposed [19,20].
One of the scholars focused on solving the problem of unbalanced data classification [21,22] to improve the classification accuracy of the algorithm. Juan J. Rodrigue and Jesus Maudes proposed a new reconfigurable Weak Learn algorithm based on the decision tree algorithm [23]. Utkin L V and Zhuk Y A analyzed a robust classification algorithm called Robust Imprecise Local Boost (RILBoost), which focused on the precise vector of weights assigned for examples in the training set at each iteration of boosting improved classification performance [24]. Another class of scholars has designed a variety of important and influential AdaBoost variants under the framework of forward Additive Stagewise models based on nonparametric regression, including LogitBoost [25], GradientBoost [26], Gradient-boost-tree [26], and ProbitBoost [27] for solving probabilistic regression. In these experiments, they demonstrated that the AdaBoost-based approach provided better detection performance than other non-AdaBoost-based methods.
However, algorithms are difficult to coexist with high classification accuracy and low complexity, that is, strong classifiers are accurate but difficult to obtain, while weak classifiers are the opposite. The Boosting algorithm requires the Weak Learn algorithm to have an accuracy better than 0.5, which is too strong to be fulfilled, and the number of training examples can have a significant impact on accuracy. Studies have shown that even under ideal circumstances, increasing the number of iterations, AdaBoost will eventually overfit [28], and the sub-classifiers generated at the end of the iteration will have very small effect on improving the generalization performance of the classifier, and there is not only a risk of overfitting, but also a waste of computing power.
Furthermore, an analysis of the diversity of sub-classifiers generated by the AdaBoost algorithm found that in the previous iterations [29,30], each new sub-classifier focused on samples that were difficult for the previous sub-classifier to classify correctly, so there was a high diversity. However, after several iterations, samples that are difficult to classify correctly are likely to always be misclassified, resulting in a sharp decline in diversity as the generated sub-classifiers pay more and more attention to the same batch of samples that cannot be classified correctly.
Although many scholars have made significant contributions in improving the accuracy and generalization rates in previous work, the work has mostly focused on modifications to the algorithmic framework, while few scholars have focused on the tuning of the Weak Learn algorithm itself.
In this paper, we solved the problem of low accuracy and weak generalization of the AdaBoost algorithm when using the threshold classification method as a weak classifier. Our learning model faces two challenges. First, how to mine more exploitation values from all features of training samples to improve the accuracy of threshold classification; second, how to assign weights to different weak classifiers during iteration to build integrated classifiers with strong generalization ability. To solve the above two challenges, we propose a novel AdaBoost algorithm with Multiple Threshold Classification as the weak learn algorithm, which can improve the classification accuracy and generalization ability.
In all, the innovation of this paper is that the threshold classification method is improved by taking the weighted voting results of the three thresholds with the highest accuracy as the prediction results, called Multiple Threshold Classification method, and the method is used as a Weak Learning algorithm for the AdaBoost algorithm. Meanwhile, in order to make the Multiple Threshold Classification method more adaptable to the Ad-aBoost algorithm, when assigning weights to the weak classifiers, we choose the maximum error rate as the weight parameter in order to obtain more weak classifiers with smaller weights to fully exploit the value of all features of the sample. Finally, we can obtain a better predictive classifier with higher accuracy, generalization ability, and convergence in different types of classification problems in different domains.
The rest of this paper is organized as follows. In Section 2, in order to facilitate the description, we introduce the AdaBoost algorithm and Threshold Classification algorithm as the Weak Learn algorithm of AdaBoost. In Section 3, we illustrate our AdaBoost algorithm with Multiple Thresholds Classification and a new weight distribution in detail. Our AdaBoost algorithm is tested for accuracy and generalization in Section 4 using 12 datasets from UCI. Conclusion and future work are given in Section 5.

Background
In this section, to facilitate the illustration of our AdaBoost algorithm with Multiple Thresholds Classification, we first introduce AdaBoost algorithm in Section 2.1 and the Threshold Classification algorithm in Section 2.2. AdaBoost with Multiple Thresholds Classification is an ensemble learn classification with the AdaBoost as the frame algorithm, and the Threshold Classification as the Weak Learn algorithm. In this section, we first introduce the AdaBoost algorithm in Section 2.1, in order to easily illustrate our AdaBoost algorithm with Multiple Thresholds Classification; in Section 2.2, we introduce the threshold classification algorithm. AdaBoost with multiple threshold classification is an integrated learning classification, where AdaBoost is used as a frame algorithm and threshold classification is used as a Weak Learning algorithm.

AdaBoost
AdaBoost is an effective ensemble learning algorithm that can take full advantage of a  i :ω 1 i = 1 m , i = 1, 2, · · · , m. 3. Do for t = 1, 2, · · · , T (3.1) generates a weak classifier based on the weak classifier learning algorithm with current weight distribution ω t ; (3.2) calculate the weighted training error ε t : In training, dataset S, (X i , Y i ) is the ith example with X i ∈ R d showing the attributes and Y i ∈ {−1, 1} doing the label (i = 1, 2, · · · , m), and the weight distribution over all samples is initially set uniform in step 2. Then, AdaBoost calls Weak Learn algorithm repeatedly in a series of iteration as shown in step 3. In the t th iteration, the Weak Learn trains a classifier h t and the distribution ω t is updated after each iteration according to the prediction results on the training samples. "Easy" samples which are correctly classified by h t obtain lower weights, and "hard" samples that are misclassified become higher weights according to step 3.4.
In the next iteration, there will be a new classifier h t in case of a new distribution ω t+1 . Due to this, AdaBoost focuses on the samples with higher weights, which seem to be harder to use for the Weak Learn algorithm. It continues for T iterations, and, finally, AdaBoost linearly combines each component classifier to form a single final hypothesis f .

Threshold Classification
By increasing weights of error-classifying examples, AdaBoost focuses the 'hard' (misclassified) examples and trains learners in an iterative way. The final decision is combined by a set of diverse classifiers using weighted majority voting rule. Without the difficulty in directly designing an excellent algorithm, AdaBoost takes full advantage of Weak Learn algorithm, which is easily available. For a great many kinds of training example sets, there are obvious relations between labels and certain attributes, in other words, most examples with larger values of certain attribute fall in the same category, and the ones with smaller values fall in the other category. Based on the relation, University of Twente has written out the program of the classical AdaBoost algorithm, AdaBoost with Threshold Classification, in 2010. Finding the best threshold from all attributes as classification rule, this AdaBoost provides a rough but easily obtainable learner in each iteration. The Threshold Classification algorithm is presented in Algorithm 2. (2.2) Find the threshold τ j in the attribute as the classifier h j . 3. Output: Compare these weighted errors, ε 1 , ε 2 , · · · , ε d , and find the minimum ε = min ε j (j = 1, 2, · · · , d), the threshold classifier is h = h j with the error ε j = ε.
In the train dataset S, (X i , Y i ) is the ith example, and X i = (x i1 , x i2 , · · · , x iJ ) describes the ith example (j = 1, 2, · · · , d), while Y i ∈ {−1, 1} does the label (i = 1, 2, · · · , m). So x ij is the jth attribute Absolutely of ith example and every example is described by d attributes and a label. We find the threshold τ j as a classifier h j from x 1j , x 2j , · · · , x mj which are the jth attributes of all the examples so that we obtain the lowest weighted the examples whose jth attribute are larger than τ j from those that are smaller than τ j . When the jth attributes of most positive examples are larger than τ j and that of most negative examples are smaller than τ j : The accuracy of the weak classifier h based on Threshold Classification is 1 − ε.

AdaBoost with Multiple Thresholds Classification
This section explains the basic idea of our Multiple Thresholds Classification in Section 3.1. Afterwards, we will construct the Weak Learn algorithm in the Algorithm frame and then briefly describe our algorithm in Section 3.2.

Multiple Thresholds Classification
Instead of building the complicated relation between labels and all the attributes, AdaBoost algorithm based on threshold focuses on the liner relation between labels and the most critical attribute. However, this classifier may achieve a low accuracy and poor generalization due to the following reasons: (1) this Weak Learn algorithm just searches only the one optimal threshold from all the attributes as the classified rule that wastes the potential value of other resources; (2) this classifier still has the tendency of overfitting as it focuses on training examples which is misclassified by just the one threshold in each iteration. If the labels tend to have certain degree linear relations with not only one attribute, such as the relation between gender and height or weight, AdaBoost algorithm with Threshold Classification may achieve not only a low accuracy but also poor generalization. For example, if we classify persons by height, like that the taller persons are divided into males and shorter ones into females, there must be many short males and tall females misclassified, the same as by weight. To solve the dilemma, we can divide those who are tall and heavy into males and the others into females, so there will be less persons misclassified. To improve the Threshold Classification, we proposed the Multiple Thresholds Classification which uses three crucial thresholds as a classifier, not only the one. Firstly, we sort examples in sequences by attributions, and choose thresholds from every sequence, the same as step 2 of Algorithm 2. According to the errors of these thresholds, we choose the best three ones and assign different weights to them, then we integrate those three classifiers by weighted majority voting rule as a Weak Learn algorithm. It is formally demonstrated in Algorithm 3.

Input:
The prediction result is weighted majority voted by the three rules: The error ε of a classifier means that the classifier will classify an example correctly with the probability of 1 − ε, so in Algorithm 3, the weak classifier h with multiple thresholds will classify examples correctly in the following five cases: 1.
all the three classifiers h , h , h classify correctly, h (X) = Y, h (X) = Y, h (X) = Y according to Equation (1), there is h (X) = Y, and the probability is: This is example 1 of an equation: in that ε ≤ ε ≤ ε and the probability is:

3.
h , h classify correctly, but h not. The probability is:

5.
h classifies correctly, but h , h not and the sum of accuracies of h and h is smaller The probability is: According to Equations (2)-(6), expectation of accuracy of Multiple Thresholds Classification h(X) is: Assume that the errors follow uniform distribution pattern: ε ∼ U(0, 1) , ε ∼ U(ε , 1) , and ε ∼ U(ε , 1) , then the joint density function is: besides Equation (8), ∂ in Equation (7) is calculated as followed: The accuracy of threshold classifier 1 − ε in Algorithm 2 is equal to the accuracy of Multiple Thresholds Classifier 1 − ε in Algorithm 3. If ε ∼ U(0, 1) expectation of ε is 50%, so is 1 − ε . Obviously, expectation of accuracy of Multiple Thresholds Classification ∂ in Algorithm 4 is better than that of Threshold Classification 1 − ε .
Algorithm 4 AdaBoost with Multiple Thresholds Classification.

Input:
Training dataset S and Ensemble size T as the same as in Algorithm 1; Multiple Thresholds Classification. 2. Initialization: the same as in Algorithm 1; 3. Do for t = 1, 2, · · · , T (3.1) generates a weak classifier h t (X) as shown in Algorithm 3.
(3.2) the greater the error ε t of learner h t (X) is, the smaller the weight of h t (X) will be, so we prefer to assign a severe weight to h t (X) by choosing a greater error ε t between ε and e in Algorithm 3:

Multiple Thresholds Classification as the Weak Learn Algorithm
Although a weighted majority voting rule can prevent the final decision from overfitting effectively, the Weak Learn algorithm of AdaBoost with Threshold Classification still has the tendency of overfitting when the adjacent Weak Learn algorithm is from the same attribute. We try to enhance the generalization of the final decision by enhancing that of every Weak Learn algorithm. In AdaBoost with Multiple Thresholds Classification, (1) we choose multiple thresholds as a Weak Learn algorithm to avoid monopoly and exploit more potential value of the whole training set; (2) in the case where the Threshold Classification is not accurate as Weak Learn algorithm of the AdaBoost, we prefer to make more but 'smaller' Weak Learn algorithms, so we improve a new mode to assign smaller weights to weak classifiers.
The only determinant of α is the error ε in (3.3) in Algorithm 4, so we can obtain smaller α by choosing a greater ε . There appear four learners including ε , ε , ε and e , and any one of their weighted errors can be used in (3.3) theoretically. Obviously, the weighted error ε is the greatest in ε , ε , ε , so we choose the greater one from ε and e as the ε in Algorithm 4.

Experiment
We compare our AdaBoost based on Multiple Thresholds Classification with classical AdaBoost by conducting experiments on 18 multi-class classification problems in Section 4, so that the effectiveness of the method can be demonstrated. These adopted datasets used in this study are popular binary classification datasets with numerical attributes and few incomplete instances from various areas in UCI repositories.

Data Set Information and Parameter Setting
We carried out the program of AdaBoost with Multiple Thresholds Classification in MATLAB R2014a and adopted datasets from the UCI machine learning repository listed in Table 1  there is only one parameter that needs to be set for AdaBoost Algorithms, the maximal number of iterations, which is set as 100 that is acceptable for computing power and enough for the most datasets to converge.

Wilcoxon Rank-Sum Test of These Two Algorithms and Analysis of Experimental Results
Comparisons of these two algorithms on 18 testing subsets are shown in Table 2. T_Ad is the abbreviation of AdaBoost with Threshold Classification, and MT_Ad is the abbreviation of AdaBoost with Multiple Thresholds Classification, in both Figures 1 and 2. From Table 2, AdaBoost based on Multiple Thresholds Classifications perform significantly better for most of the 18 datasets with different numbers of examples and attributes.     Figure 1 shows frequency histogram of accuracies of these two algorithms. Obviously, accuracies of these two algorithms are not consistent with a normal distribution, so Wilcoxon signed-rank test is set to prove significant difference between the two algorithms.
Null hypothesis and alternative hypothesis are set as: H 0 : there is no significant difference between the two algorithms; H 1 : AdaBoost with Multiple Thresholds Classification is better than that with Threshold Classification.
The probability of H 0 , p(H 0 ) ≈ 0.0078 ≤ 0.05 by Wilcoxon signed-rank test, so the null hypothesis H 0 was rejected at the significance level 0.05, in other words, H 1 is received. Specifically, AdaBoost based on Multiple Thresholds Classification performs better than that based on Threshold Classification by the result of Wilcoxon signed-rank test. Figure 2 shows the variation process of this two algorithms' convergence performances directly. In each panel, we depict the changes of training sets errors and testing sets errors with the growth of iterations of the 18 datasets, abscissa, and ordinate values indicate times of training iteration and errors of two classifiers, respectively. The blue straight line with triangles indicates the test data's error of the test data, and the dashed line with a cross indicates the error of the training data of the threshold classification method AdaBoost; the red line with a triangle indicates the error of the test data, and the dashed line with a cross indicates the error of the training data of the AdaBoost with Multiple Thresholds Classification.

Analysis of Experimental Results
By

Analysis of Experimental Results
By Table 2, it is obvious that the new AdaBoost is better than the classical one for test subsets of Breast Cancer Wisconsin (Diagnostic wdbc) dataset, Connectionist Bench (Sonar) dataset, Cryotherapy dataset, EEG Eye State dataset, Hill-Valley dataset, HTRU2 dataset, Liver Disorders dataset, Molecular Biology (Splice) dataset, SPECTF Heart dataset, Statlog (Heart) dataset, and Wholesale Customers dataset.
From Figure 1, we can learn that the blue dotted liners are lower than the red ones generally; it means that the AdaBoost with Threshold Classification is roughly better than AdaBoost with Multiple Thresholds Classification on accuracy in training sets, but not in testing sets in case the blue full lines are higher than the red ones for datasets of Cryotherapy, EEG eye state, HTRU2, Immunotherapy, Ionosphere, Liver Disorders, Seismicbumps, Sonar, Spectf, Wholesale Customers, Breast Cancer Wisconsin (Diagnostic wdbc), and Hill-Valley. It indicates that the improved algorithm is with better generalization performance. To discuss the setting where the proposed algorithm performs better, datasets are sorted in ascending order by accuracies of AdaBoost with Threshold Classification, and these datasets are labeled as "1" with the accuracies of AdaBoost with Multiple Thresholds Classification being higher than that of AdaBoost with Threshold Classification, and the others are labeled as "−1" as shown in Table 3. Correlation coefficient between rankings and labels is −0.4423 < −0.3, which indicates a weak but nevertheless effective linear relationship. So, without considering other factors, such as numerical values of attributes, areas of datasets and so on, AdaBoost with Multiple Thresholds Classification seems to be more suitable on those "hard" datasets whose accuracies is low by classical AdaBoost. It is because the "hard" datasets obviously contain more "hard" examples, and Multiple Thresholds Classification can classify those "hard" examples more accurately than Threshold Classification.

Conclusions
In this paper, in order to improve the accuracy and generativity of AdaBoost with Thresholds Classification method and avoid the generation of overfitting and voting monopoly, we proposed the AdaBoost Multiple Thresholds Classification algorithm and selected three crucial attributes in each iteration to build a new Weak Learn algorithm. Accurate decisions were obtained using changes in the weighting scheme during the iter-ations, which positively affected the final ensemble decision of AdaBoost with Multiple Thresholds Classification.
In this study, we directly compared the AdaBoost algorithm with the AdaBoost multithreshold classification method by working on 18 datasets from UCI. The experimental data results clearly show that better performance is achieved in our AdaBoost based on the modified AdaBoost macro framework while modifying the specific Weak Learning algorithm. It was a little bit tedious to set the parameters of each algorithm. In future, we are still trying to improve the efficiency of AdaBoost Multiple Thresholds Classification and find the best way to use our algorithm for multi-class scenarios.