Sieve: An Ensemble Algorithm Using Global Consensus for Binary Classification

In the field of machine learning, an ensemble approach is often utilized as an effective means of improving on the accuracy of multiple weak base classifiers. A concern associated with these ensemble algorithms is that they can suffer from the Curse of Conflict, where a classifier’s true prediction is negated by another classifier’s false prediction during the consensus period. Another concern of the ensemble technique is that it cannot effectively mitigate the problem of Imbalanced Classification, where an ensemble classifier usually presents a similar magnitude of bias to the same class as its imbalanced base classifiers. We proposed an improved ensemble algorithm called “Sieve” that overcomes the aforementioned shortcomings through the establishment of the novel concept of Global Consensus. The proposed Sieve ensemble approach was benchmarked against various ensemble classifiers, and was trained using different ensemble algorithms with the same base classifiers. The results demonstrate that better accuracy and stability was achieved.


Introduction
In complex machine learning problems involving multiple-base classifiers, a consensus result can be achieved using an ensemble approach which can improve the accuracy of intermediate results. This ability of combining intermediate results creates an ensemble algorithm, which have recently been the focus of ongoing research [1], according to various machine learning approaches applied to numerous domains employing classification [2], regression [3] and clustering [4][5][6]. When considering the structure of Multiple Classifier Systems (MCS), the majority of their configurations consist of three procedures: topology selection, base classifier generation and consensus [7]. In most cases, the topology and the base classifier procedures are combined to generate improved and independent base classifiers, where their results are aggregated through an ensemble to produce greater accurate results through a process of compensating each classifier's results. A salient aspect of this approach is that every successful enhanced ensemble algorithm e.g., [8][9][10][11] has improved one or more of these three procedures. The main emphasize of applying an ensemble is to improve the effectiveness of the consensus to produce a more accurate result of the predicted label.
The objective of the consensus is to determine the label of a sample, based on multiple base classifier predictions. The current available consensus techniques target identifying and making the accurate base classifiers which dominate the consensus. However, these practical generated ensembles may not be improved as expected, due to the following five shortcomings: (1) Curse of Conflict (CoC), (2) uncertainty of accuracy improvement, (3) erratic magnitude of accuracy improvement, (4) Imbalanced Classification (IC) and (5) difficulty of base classifier selection.
The aforementioned five shortcomings are originally derived from the CoC and each one is associated with the previous one(s). The CoC occurs when correct classifier predictions are cancelled through wrong classifier predictions, which can be attributed to the conflict among the nonhomologous predictions [12]. A consequence of the CoC is that the employed consensus technique will perform inconsistently across diverse datasets. This impacts the effectiveness of the consensus and introduces two concerns, as described in (2) and (3); that is, any base classifier accuracy improvement cannot be guaranteed and any improvement in the accuracy's magnitude is erratic [13]. Due to the fact that each ensemble result is derived from the base results, the former would present a similar extent of bias towards the same class as the latter. Accordingly, the existing consensus methodologies cannot effectively mitigate the issue of "Imbalanced Classification" (IC) [14]. Moreover, the IC will lead to the selection of the base classifier results employed in the consensus to be relegated to the results with a minimum accuracy of 50% to reduce the ensemble generation error [15]. As a consequence, in obtaining each ensemble result at the cost of maintaining multiple base classifiers, predicting each sample multiple times and executing a complex compensation/consensus procedure will largely increase the actual time complexities.
In the literature, various intelligent consensus solutions have been proposed [16] to overcome these five shortcomings, but to date, the proposed solutions lack the ability to achieve an ensemble's main goal, "a guaranteed and significant accuracy improvement". In this paper, we propose a novel ensemble algorithm named "Sieve", which uses a new methodology called Global Consensus (GC) to resolve the previously identified concerns with the current ensemble approaches embedded in the traditional consensus methods.

Motivation
Various consensus techniques such as (weighted) voting, bagging, Bayesian formalism, belief theory, Dempster-Shafer's evidence theory [17][18][19] have been developed to mitigate the CoC. As the beginning of the MCS, the (weighted) voting tried to avoid a wrong ensemble result by simply aggregating multiple-base results. Then, the bagging made improvements through generating more independent base classifiers which were trained on varied subsets. Other, more advanced, techniques focused on improving the base classifier's results. For example, the Bayesian, belief and Dempster-Shafer theories have been utilized to quantify the importance of the base classifiers for improving the ensemble accuracy. Moreover, the most recent works aimed at reducing the error rate of the base classifiers. For instance, Rotation-Based SVM [20] generated diverse base results using random feature selection and data transformation techniques, which could simultaneously enhance both the individual's accuracy and diversity within the ensemble. Essentially, the main difference of the existing practices is the adopted criterion for quantifying the importance of the base classifiers. Furthermore, these algorithms focus mainly on improving the accuracy of the ensemble result through reducing the misclassification associated with each sample. These can be identified as Local Consensus (LC), where the consensus emphasis is on the individual sample. The LC approach inherently suffers from a structural deficiency that causes unavoidable CoC, since the base classifier predictions will inevitably be in conflict, producing undesirable results.
Although the LC suffers from unsolvable deficiencies, it is undeniable that the consensus approach is an effective methodology to improve the ensemble's accuracy because it compensates for the weakness or imperfect performance of the base classifiers. For instance, in order to resolve the poor relevance feedback issue that results from using a Support Vector Machine (SVM) classifier in the content-based image retrieval problem, many researchers have proposed a comprehensive classifier (ABRS-SVM) which combines an asymmetric Bagging-based SVM (AB-SVM) and a Random Sub-Space SVM (RS-SVM). The BRS-SVM itself has been a successful research project and 17 industry patents [21] have been generated directly or indirectly.
Therefore, emphasis on improving the ability of the consensus approach to avoid these deficiencies has rendered a superior result. It should be noted that although the principle of ensemble is to obtain a better result by combining the results of multiple base classifiers, the LC is not the only form of consensus. Hence, to avoid the CoC resulting from the sample-scoped consensus, another approach is to classify each sample by only one of the base classifiers. Due to the lack of samplescoped consensus, the accuracy of each sample would be lower. In order to maintain a high accuracy, the most accurate classifiers will be identified (by testing on a subset called ranking, please refer to the next section for details) and employed first to label the samples, so that the misclassifications can be minimized. More specifically, each classifier is only responsible for classifying a portion of the samples to avoid the CoC. As a result, the ensemble labels are composed of the labels assigned by each base classifier. Since the described approach to consensuses is upon the full dataset, it is called Global Consensus (GC). The main difference between the two approaches, the LC and the GC, is that in the GC the base classifiers are not working together to label each sample, but instead cooperatively to label all the samples. One potential concern might be that the GC's accuracy could be reduced, but since the consensus is not executed on individual samples, the opposite effect can be achieved because the accuracy is greatly improved with the reduction in or complete elimination of the CoC.

Overall Methodology
The proposed Sieve ensemble algorithm is similar to other ensemble algorithms, that is, it requires the presence of several trained base classifiers to participate in a consensus. The difference is the way that the base classifiers are employed in forming an ordered classifier chain. As shown in Figure 1, the base classifiers are placed in an ordered chain. Starting with the first classifier, each classifier will label an exclusive portion of the samples (i.e., the white circles). This process continues until the last classifier in the chain has labeled its portion. The motivation of employing the base classifiers in this manner is two-fold. (1) As mentioned in the last section, since the CoC is an inevitable problem unless we abandon the sample-scoped consensus, labeling each sample using one base classifier is the only available approach of classification. (2) Furthermore, in order to reduce the error rate on each sample, each base classifier is allowed to proactively/freely select a portion of samples to classify based on its strength. As a result, each sample will be labeled by the base classifier with the highest confidence, hence producing high-level aggregated ensemble results.
It is important to recognize that there are always a few samples that would not be labeled by any classifier (i.e., the black circles), then these samples will be labeled using a non-classifier procedure. In the GC, the base classifiers compensate each other in labeling the samples that are not labeled by the other base classifiers, instead of cancelling out correct predictions with incorrect predictions from other base classifiers when using LC. In conclusion, the approach taken in Sieve requires the construction of a classifier chain, where the process is divided into three procedures composing the life cycle of the chain: creation, refinement and employment.

Creation of Classifier Chain
Referring to Figure 2, in order to train the base classifiers and create the classifier chain, the first step is to split the dataset into training, ranking and testing portions. The training portion is used to generate the base classifiers using different algorithms, where each algorithm from 1-K generates a corresponding classifier. The ranking portion is applied to evaluate each base classifier during the formation of the classifier chain. The testing portion of the dataset is used to evaluate the final classifier chain. Since both the latter two portions are used for evaluation, they should be about the same size and not less than 10,000 samples in general. These requirements should also be applied to the training portion. In addition, there is no special criterion in selecting the ranking portion of samples compared with the testing samples. In the case when a dataset consists of training and testing portions, the ranking portion would be extracted from the training portion. It is important to realize that the under-sampling of the ranking portion would result in an inaccurate classifier chain because the ensemble results are associated with the quality of classifier chain.
Once these different subsets of data are established, the process of training, ranking and testing can proceed. Firstly, the base classifiers will be generated by applying the training dataset, then the performance of each classifier is determined using the ranking dataset. This process evaluates and concludes the generalization capabilities of each base classifier. Next, using this information, all the base classifiers are sorted in ascending order according to false-positive (FP) and false-negative (FN), respectively (i.e., the best first). This will lead to the formation of two sorted classifier chains, the FP and FN sorted classifiers. Each chain contains all the base classifiers, but in different order. Finally, the total misclassifications associated with each of the two classifier chains are calculated, and we select the one with the least misclassification.
Assume K is five and there are two target classes, the A and the B, as in Figure 2. This leads to the training of five base classifiers 1-5, which are separately evaluated on the ranking dataset. Based on the outcomes, two sorted classifier chains are formed, C1: 2-5-1-3-4 for FP and C2: 4-2-1-5-3 for FN, in which higher score classifiers are positioned first in the chain (the order is not significant if the scores are the same). Here, FP represents misclassifying the class B as A, and FN represents misclassifying the class A as B, where C1 is ordered based on the misclassifications on the class B and C2 is ordered based on the misclassifications on the class A. Therefore, the misclassifications result from the two classifier chains indicates any inherent class bias with the base classifiers. Based on these results, the one with the lowest score will be selected as the base classifier chain, as it would result in less misclassification on a specific class. If C1′s misclassification score is higher than that of C2, then the C1 will be abandoned and C2 chain will be selected.
The motivation of utilizing the misclassification results as the basis for selecting a classifier chain is that, when invoking the base classifiers of C2 in the sequential of 4-2-1-5-3 to label the samples of class A, it will have a better accuracy in predicting the sample as class A. To illustrate the process, we apply a set of samples to the first base classifier, which is no.4 in C2, to make labeling predictions for all of the samples. The samples that were not labeled as class A will then be passed to the next base classifier, the no.2, in the chain, to continue the labeling process. This process will continue until the last base classifier, no.3, is employed. At this point, any remaining samples that most likely belong to the class B will be labeled to that class by a non-classifier procedure (refer to subsection "Employment of Classifier Chain") to minimalize possible misclassification of them. Consequently, the ensemble result is the aggregation of the labels assigned by each base classifier, which will achieve high ensemble accuracy. A comprehensive explanation is presented in the subsection "Employment of Classifier Chain". Two important points here: (1) the base classifiers are orderly invoked to only label a specific class; (2) the ensemble classification is a result of collaborations among all of the base classifiers and the last non-classifier procedure.

Refinement of Classifier Chain
The refinement of classifier chain is a process of identifying and eliminating the weak classifiers within the classifier chain. Although it may improve the ensemble accuracy, the main objective is to reduce the time complexity. Essentially, the refinement process is composed of a simplified classifier chain employment (refer to the next subsection) and a classifier elimination procedure.
Referring to Figure 3, assume that 10 samples in the ranking set are applied to the chain C2. Then each base classifier makes predictions on the unlabeled samples and will group some of them into class A by a base classifier, as expected. For example, the classifier no.4 predicts that among all the 10 samples, samples 3, 4, 7 belong to class A. Since the accuracy of this decision is higher than 50%, classifier no. 4 is retained. The remaining unlabeled seven samples are passed to the next classifier in the chain. From these seven remaining samples, classifier no.2 labels another four samples (i.e., samples 1, 5, 8, and 10) to the class A. Since the accuracy of this result is less than 50%, classifier no.2 is eliminated from the chain and the associated labeling result be cancelled. The number of unlabeled samples remains seven. Following this procedure, the base classifiers no.1 and no.3 are retained, as their accuracy is higher than 50%, and the base classifier no.5 is removed due to its lower than 50% accuracy. As a result, the base classifiers no. 4, 1, 3 are the only ones retained and the length of the refined classifier chain is reduced from five to three, a 60% reduction.
In the process of training each classifier, it is always important to address the issues of the overfitting problem. For Sieve, we need to point out that since the ranking process is involved in the forming of the classifier chain, this combination significantly mitigates the overfitting problem. In general, when using LC, there are two methods of splitting the training data to train the base classifiers, and none of them can effectively mitigate the overfitting problem. If the base classifiers are trained on the full training dataset, then the low independency among the base classifiers would lead to a stronger CoC. Training the base classifiers on varied subsets of the training dataset will make each base classifier lose the patterns in the excluded portion of the training set. It is critical to understand that the base classifiers in the LC-series compete among themselves, as each one wants to distinguish itself during the sample-scoped consensus. Therefore, this ability of the base classifiers would often be neutralized in such a competitive environment. Now, as in the Sieve the training data are split into two groups, one for training the base classifiers and the another for determining the best combination of classifiers in the chain. Since the relationship among the base classifiers in GC-series is compensation, the formation of the classifier chain is a process of improving the mutual complementarity among them. The relationship among the generalization abilities of the base classifiers is always enhancing, hence resulting in a chain with a strong generalization ability. Most importantly, the complementarity is a property that is completely determined by the features of the base classifiers, instead of the training data, so the Sieve presents a stronger generalization ability than the LC-series ensemble algorithms on unseen data. At the end of this paper, a special experiment will be presented to highlight the primary reason that enables the Sieve approach to outperform the LC-series ensemble algorithms, emphasizing the significant structural advantages of employing GC.

Employment of Classifier Chain
Employing the refined classifier chain on the testing set is the final procedure. Figure 4 illustrates the workflow of evaluating the refined classifier chain C2 on 10 samples. As a reminder, the C2 is ordered by the FN, which means its base classifiers no. 4, 1, 3, as a whole, are biased towards the class A, i.e., each base classifier is responsible for labeling a sample as class A if it predicts the sample in class A. In the figure, starting with the base classifier no.4, it makes predictions on all the 10 samples and label 3 samples (i.e., no.3, 4, 7) as class A. The remaining seven unlabeled samples are passed to the next base classifier, the base classifier no.1, which then makes prediction and labels samples no.5 and no.10 as class A. Now the remaining unlabeled samples are reduced to five, which are passed to the last base classifier, no.3. This classifier labels samples no.2, 6, and 9 as class A. The left-over samples no.1 and no.8 are not labeled and will be assigned to class B.
Overall, every sample will be either labeled to class A by one of the base classifiers or to the opposed group, class B. When a base classifier labels a sample, it is expected to have been done with high confidence because the classifier chain inherently has a bias to class A based on the evaluation results during the ranking process. In addition, high confidence is also associated with the nonlabeled sample(s) in the class B. This is due to the fact that almost all the samples which truly belong to class A have been classified through the base classifiers, leaving most of the unlabeled samples to class B. Therefore, the Sieve technique maximizes the overall accuracy and reduces the misclassifications through the entire process. The proposed Sieve technique also incorporates a GC method as the new consensus approach which expands the consensus scope from each individual sample to all the samples in the dataset as a whole. A distinct difference between the LC and GC is that the former is built on a competitive ensemble idea which is susceptible to the CoC during the consensus, whereas the creation of GC is based on the compensation principle, which forms a robust ensemble classifier set through compensating the weaknesses among the base classifiers. . A classifier chain with three base classifiers workflow (black squares represent the labeled samples). Note: this is a simulated example to illustrate the employment process, the actual accuracy is varied and the chain is longer in practice.

Pseudocode
The pseudocode of the Sieve is presented in Algorithm 1; please refer to the previous subsections for details. Output (2): the refined classifier chain and varied performances Begin

Advantages
Except for the aforementioned strong generalization ability, the benefit of employing GC is that it overcomes those five shortcomings associated with the LC. (1) Since the consensus is no longer based upon each sample, which leads to the impact with the CoC (i.e., the conflict among the base predictions), that is instead completely eliminated. (2) As a consequence of the CoC elimination, there is an accuracy improvement that is manifested, since the ensemble accuracy must be at least as high as the first-invoked/most-accurate base classifier, even if all subsequent base classifiers misclassify all the unlabeled samples (i.e., the worst case). (3) In addition, the magnitude of the accuracy improvement is significant, since worst-case sample labeling would rarely occur in practice. (4) With respect to the mitigation of the IC, since the GC only utilizes the best partial ability of each base classifier (i.e., only labels one specific class) and assigns the left-over sample(s) to the opposite class, the bias on each cannot be fully transferred to/reflect on the Sieve. (5) Due to the fact that only the best partial ability will be utilized in the GC, one classifier is qualified to be the base classifier only if it can achieve a minimum accuracy of 50% toward either class instead of both classes. In that sense, the requirement of being a GC's base classifier becomes lower compared with the LC, because achieving a decent accuracy on one class is the pre-requisite of achieving decent accuracies on two classes.
Regarding time complexity in terms of employment, it also has been significantly reduced after the base classifier chain is refined, because each base classifier in the chain is only responsible for making predictions on the samples that have not been previously labeled. For instance, consider a scenario that consists of a dataset of 10,000 samples that requires labeling and a chain that is composed of 10 base classifiers. If the LC is employed, then there are 100,000 (i.e., 10 × 10,000) predictions required, since each of the 10 base classifiers needs to predict all the 10,000 samples towards a consensus per sample. On the other hand, if the GC is adopted, the number of predictions is reduced to only 19,980 (i.e., 10,000 × (2 − (0.5) ^ 9)), under the conservative assumption that each base classifier will predict and label 50% of the samples remained.

Objectives and Methodologies
Six experiments carried out in this study can be divided into three groups based on their objectives. The first three experiments focus on the Sieve and LC-series ensemble algorithms employing the same group of base classifiers to compare their accuracy and stability. The objective of the next two experiments is to demonstrate the effectiveness of the classifier chain refinement process in its accuracy and length variations between the original and refined classifier chains. Finally, the last experiment is performed to reveal that the GC has a significant structural advantage over the LC. It is important to note that the goal is not to achieve the highest accuracy possible, but to show a better "accuracy improvement" compared with the LC-series algorithms when using the same base classifiers. Therefore, base classifiers are trained using the default configurations setup by Scikit-learn [22] and a dataset that has not been preprocessed to achieve consistency across these evaluations.
It is important to clarify that the stability (i.e., certainty of accuracy improvement) is a more important metric than the accuracy (i.e., the magnitude of accuracy improvement) in the evaluation of an ensemble classifier, because stability is the pre-requisite of the accuracy improvement. Therefore, only the accurate ensemble classifiers which have been validated in extensive experiments can truly reflect the performance of the existing practices. Accordingly, only 38 base classifiers and 20 LC-series ensemble classifiers trained by eight successful ensemble algorithms (i.e., Randomizable Filter [23], Bagging [24], AdaBoost [25], Random Forest [26], Random Sub-space [27], Majority Vote [28], Random Committee [29], and Extra Trees [30]) are involved in the first three experiments. Particularly, to maximize the accuracies of the compared ensemble classifiers (e.g., AdaBoost) which can only consensus upon the same type of base classifier; the most accurate base classifiers are employed as their base classifiers (refer to Tables A1-A4 in Appendix A). With respect to the evaluation data, a well-known benchmark dataset NSL-KDD [31], which includes two classes (i.e., Benign: B and Malicious: M) and two challenging test sets "test+" and "test-21" is employed to evaluate all the ensemble classifiers in terms of accuracy, precision, recall, F1-score and the Area Under the Receiver Operating Characteristic (AUROC). In addition, a dataset for Breast Cancer [32] is also evaluated by employing the same approach, which is a representative of small data. Since the refined classifier chains result from the big data (i.e., NSL-KDD) are much longer on than the small data (i.e., Breast Cancer), we will mainly interpreter/demonstrate the principles of the Sieve based on the results achieved on the first two experiments. Consequently, we place the results of the third experiment in Tables A3-A6 in Appendix A. The experiments will show the proposed algorithm can improve the performances on both the small and the big data.

Experiments 1, 2 and 3: Comparison of the Performances
We trained the 38 base classifiers on the dataset "KDDTrain+" (i.e., 125,937 samples). Tables 1  and 2 show the performance of different ensemble techniques on the datasets, test-21 (i.e., 11,850 samples) and test+ (i.e., 22,544 samples), respectively. The rows above the third row from the bottom of the tables give the performance of the eight selected ensemble algorithms. More specifically, since five of the eight ensemble algorithms, denoted with asterisk, can produce multiple ensemble classifiers using varied base classifiers, we have to create multiple ensemble classifiers using these asterisk-marked algorithms to obtain a comprehensive result. Therefore, the performance of each of the algorithms with asterisk is averaged over the corresponding multiple ensemble base classifiers. For example, we built four AdaBoost ensemble classifiers by respectively employing four different base classifiers (i.e., algorithms), so in Tables 1 and 2 Tables 1 and 2 are the averaged performance of the 38 base classifiers and the eight selected LC-series ensemble algorithms, respectively. Observe that the averaged accuracy of the 38 base classifiers is lower than the majority vote, this is due to the fact that the majority vote only ensembles the best four base classifiers for its performance. Finally, the last row in the same tables is the performance of the Sieve technique.
On the other hand, since the data pattern of the test-21 and test+ are somewhat varied, it is inevitable that all the classifiers will generate different results based on these two test sets. However, when the same ensemble classifier performs very differently on the two test sets, it indicates that this classifier would be more vulnerable/sensitive to the pattern variations (i.e., unstable performance). According to the results of the experiments, the difference in accuracy when the LC-series ensemble classifiers were applied to these two test datasets can reach a value of 20% (i.e., 75.89% − 56.44%), whereas the same metric is only 4.86% (i.e., 90.29% − 85.43%) for the Sieve. This comparison indicates that the Sieve would be 4.12 (i.e., 20/4.86) times more stable on the unseen data/diversity patterns, which means that the Sieve has a much broader range of practical applications. In addition, since the difference in the accuracy of the base classifiers for the same two datasets is also about 20% (i.e., 73.95% − 53.68%), it indicates that the erratic accuracy achieved with the LC-series ensemble classifiers is actually derived from the base classifiers. Therefore, the Sieve is able to greatly mitigate the inherent instabilities that are associated with the base classifiers. The final assessment is that the Sieve's advantage in accuracy and stability is attributed to the structural advantage of the GC.
Similar conclusions can also be found in the evaluation results of the Breast Cancer; please refer to Tables A3-A5 in Appendix A for details. It is worth mentioning that the LC-series ensemble algorithms are not able to improve the accuracy of the base classifier when the number of training samples is limited (i.e., 210), whereas, the Sieve can still improve the accuracy even though the number of training samples is only 50% of the LC-series (i.e., 105). * Since five of the eight ensemble algorithms can produce multiple ensemble classifiers using varied base classifiers, we have to create multiple ensemble classifiers using these asterisk-marked algorithms to obtain a comprehensive result. * Since five of the eight ensemble algorithms can produce multiple ensemble classifiers using varied base classifiers, we have to create multiple ensemble classifiers using these asterisk-marked algorithms to obtain a comprehensive result.

Experiments 3 and 4: Verifying the Effectiveness of the Classifier Chain Refinement
To verify the effectiveness of the refinement in the classifier chain, experiments 3 and 4 are performed to evaluate the variations in the accuracies and lengths of both the original and refined classifier chains on the two test sets. These two experiments record the accumulated accuracies of the invoked base classifier and form trajectories accordingly. As shown in the Figures 5-8, the thin and thick lines are the trajectories that result from the original and refined classifier chains, separately. More specifically, each point represents the ratio between the number of correct-labeled samples and the number of all the samples in the dataset. Using the following example to demonstrate the method of calculating each accumulated accuracy point, assume there are 10 samples in the dataset; the first and last five samples belong to classes X and Y, respectively. Suppose also that the classifier chain that is biased towards class X includes two base classifiers: c1 and c2. If the c1 labels samples no.3-7 as class X, then all the rest of the samples are labeled as class Y. Then, the number of correct labels is six, which are samples no.3-5 (i.e., class X) and no.8-10 (i.e., class y). Therefore, the first accumulated accuracy point is 0.6 (i.e., 6/10). We have to emphasize that we are calculating every accuracy point by strictly executing the workflow of classifier chain employment. This means that we are simulating the workflow of employing a classifier chain with only one base classifier (i.e., c1) in the previous example. Since there are two base classifiers in the chain, adopting the same method, we need to clear all the labels that were not assigned by the c1 before calculating the second accumulated accuracy point, because the last base classifier c2 was not invoked. Accordingly, there are five samples (i.e., no.1, 2 and 8-10) passed to the c2. If the c2 labels samples no.1, 2, 10 as class X, then all the rest, samples no. 8 and 9, are labeled as class Y. Thus, the number of accumulated correct labels is 7: 3 labels (i.e., no.3-5, class X) result from the c1, 2 labels (i.e., no.1 and 2, class X) result from the c2 and the other two labels (i.e., no.8 and 9, the class Y) result from the last non-classifier procedure. Consequently, the second accumulated accuracy point is 0.7 (i.e., 7/10). In addition, we must distinguish the difference between calculating the accumulated accuracy and the accuracy that is used to eliminate the base classifiers. Referring to subsection "The Refinement of Classifier Chain," a base classifier will be removed from the chain only when its accuracy is lower than 50%. In the previous example, the accuracy of the c2 for determining elimination is 0.67 instead of the accumulated accuracy 0.7, because it correctly labels the samples no.1, 2 and wrongly labels the sample no.10. In conclusion, the Sieve takes the accuracy that directly results from a base classifier into consideration when considering elimination and adopts the accumulated accuracy to verify the effectiveness of the classifier chain refinement. Referring to the figures, the accumulated accuracy might be increased, decreased or maintained, along with each new base classifier involved. Since the number of accumulated correct labels result from the newly involved base classifier, all the previous base classifiers and the last non-classifier procedure, it would be increased or decreased at some extents. There are also some points that maintain the accumulated accuracies, because the corresponding base classifiers do not label any new sample. The reason for this phenomenon is that all of their target samples have been labeled by the previous base classifiers, which means that their target samples are completely overlapping their predecessors. It is easy to understand that all the base classifiers that not improve the accumulated accuracies should be removed from the classifier chain and our classifier chain refinement successfully achieved this goal. The figures clearly indicate that the accuracy of each refined classifier chain (i.e., the last point on the thick lines) is roughly the same as that of the original classifier chain (i.e., the last point on the thin lines), which shows that the chain refinement procedure will not reduce the performance on accuracy.
In addition, the experiment data show that the low-ranked base classifiers are more vulnerable to being eliminated. This phenomenon is caused by two factors: the difference in accuracy and the overlap in terms of target samples between the high-and low-ranked base classifiers. For instance, there are 10 samples (i.e., including two classes, X and Y) and two base classifiers (i.e., c1 and c3, with the same accuracies of 60% in predicting the class X samples). In addition, let us make the following assumptions for the c1 and c3. If we invoke the c1 to label the 10 samples, it will label samples no.1-5 as the class X and the first three predictions (i.e., no.1-3) are correct. If we invoke the c3 to make predictions on the 10 samples, it will label samples no. 4-8 as the class X and the first three labels (i.e., no. 4-6) are correct. Under this assumption, if we form a classifier chain with the two base classifiers (i.e., the c1 in the first place) to label the 10 samples, the second-placed c3 will only label samples no.6-8 as the class X because samples no.1-5 have been labeled by the c1. Since the c3 (i.e., the lowranked base classifier) can only make a correct prediction on sample no.6, its accuracy is only 33% and should be eliminated from the chain according to the chain refinement procedure. Furthermore, if there is another base classifier c2 between c1 and c3, and c2 has labeled sample no.6 as class X before invoking the c3, then the c3 will only label samples no.7, 8, hence reducing its accuracy to zero (i.e., more vulnerable to being eliminated). Although the exact assumed situation might rarely happen in practice, it is inevitable in reality, because any first-invoked base classifier would label (more or less) some samples that can be correctly labeled by its successors. Essentially, the magnitude of such affection (in terms of elimination) resulting from the high-ranked base classifiers toward a lowranked base classifier is unpredictable, but the tendency is generally determined. Therefore, the aforementioned example is trying to explain a common phenomenon that a low-ranked base classifier is more vulnerable to being eliminated due to its low accuracy as well as the overlapping magnitude of the target samples between it and the predecessors. As a result, only the first several and a few middle base classifiers are retained in the classifier chain. The reduction rates in the two original classifier chains are 73.68% (i.e., (38 − 10)/38) in the first two experiments. Particularly, the reduction rate of the third experiment achieves 94.74% (i.e., 2/38; please refer to Table A5 in Appendix A). A possible reason for the difference in the reduction rates is that the small/big data would be more likely to be labeled/covered by a fewer/more base classifiers, so the length of the refined classifier chain would be proportion to the data size. However, this is only a trend instead of a definite conclusion because the reduction rate is determined by the base classifiers, and hence out of the control of the Sieve. As a result, the three experiments verify that the classifier chain refinement can effectively shorten the classifier chain without impacting the accuracy.    Moreover, the percentage of samples labeled by each base classifier/layer and the corresponding accuracy are shown in Table 3, where the two columns "Labeling Percentage" are computed via dividing the number of labeled samples (i.e., by the current layer) by the total number of samples; the two columns "Accuracy" represent the accuracy achieved by the current layer. For example, assume the total number of samples is 10. If the first base classifier labels four of 10 samples and three of the four labeled samples are correct, the percentage and the accuracy of this base classifier are 40% (i.e., 4/10) and 75% (i.e., 3/4), respectively. As a result, we can observe that the percentages and accuracies are (roughly) becoming smaller as more base classifiers are invoked. However, this is also a trend instead of a definite conclusion because there are some data that violate this pattern. For instance, the two labeling percentages of the base classifier no.3 are higher than the base classifier no.2. The two accuracies of the base classifier no.8 are higher than the base classifier no.7. A similar pattern can also be found on the layer-based performances result from the Breast Cancer; please refer to Table A6 in Appendix A for details.

Revealing the Significant Structural Advantage on the Global Consensus
Essentially, there are only three differences between the Sieve and the LC-series ensemble algorithms. These are: (1) utilizing the capabilities of every base classifier fully (i.e., labeling both classes) or partially (i.e., only labeling one class); (2) employing the base classifiers with or without a certain order (i.e., invoking based on the ranking or not); (3) labeling each sample via a consensus approach (i.e., combining multiple base predictions) or not. Therefore, the performance on accuracy between these two algorithms must be linked to one or more of these three points. In order to identify the actual reason, multiple hybrid ensemble classifiers are created and each one of them is a mixture of the Sieve and the LC-series ensemble algorithms. With respect to the approach of making prediction, each hybrid classifier performs in the similar way to the Sieve, i.e., the base classifiers are orderly invoked based on the ranking, and each base classifier adds one more vote of a specific class (instead of direct labeling) to a sample, but only when it predicts the sample as that class. From the respective of labeling, each hybrid classifier performs in exactly the same way as a LC-series ensemble classifier that adopts the majority vote as the consensus approach. A sample will be labeled as a specific class if the votes of that class are more than a half of the number of the base classifiers, otherwise, it will be labeled as the other class. It should be noted that each hybrid classifier is actually a special Sieve with a LC procedure.
The created hybrid ensemble classifiers are evaluated on the datasets test-21 and test+. Referring to Tables 4 and 5, the performance between the Sieve and the hybrid ensemble classifiers is considerably different. Particularly, the accuracy of each hybrid ensemble classifier is even worse than that of the corresponding traditional LC-series ensemble classifier, despite the fact that it can orderly invoke the base classifiers similar to the Sieve. Due to the fact that the only difference between the hybrid classifiers and the Sieve is the involvement of an LC procedure, it can be concluded that this LC is the only possible reason that would cause the reduction in accuracy. Therefore, the reason that the Sieve greatly outperforms the LC-series ensemble classifiers can be attributed to the structural advantage that comes with the GC.

Conclusion
We have proposed a new ensemble algorithm, the Sieve, which outperforms existing ones on both the big and the small data by eliminating the problem of CoC, associated with the base classifiers in LC. This algorithm also skillfully handles the inherent bias associated with each classifier by constructing a classifier chain. The successful elimination of the problem of CoC is attributed to the GC, in which consensuses come from all of the samples as a whole instead of from each individual sample. (refer to subsection "Employment of Classifier Chain") Consequently, the Sieve makes improvements on both accuracy (i.e., improved by 19.51% on average) and stability (refer to the subsection "Advantages") by minimizing misclassifications in each procedure. In addition, it can effectively mitigate the IC by applying a non-classifier procedure to the classification process, which calibrates the biased ensemble results by labeling the leftover nonlabelled samples to the opposite group. Since the Sieve has to spend additional time constructing and refining the classifier chain, these manipulations will improve the time complexity in terms of the pre-processing.
The current version of the Sieve is only making improvements to the consensus procedure. The ensemble results could be further improved using other methods in the future. One of the methods is to train the base classifiers by various data subsets to enhance the independency among the base classifiers to benefit the overall accuracy of the entire classifier chain. Another promising method is to replace the non-classifier procedure by the base classifier with the lowest FN/FP if the refined classifier chain is ranked based on the FP/FN. Consequently, the number of misclassifications in the last layer would be reduced, improving the overall accuracy.
The current version that works for the binary classification can also be applied to resolve multiclass problems. Assume a dataset with three labels: A, B, C. We can create a binary dataset by relabeling all samples that belong to class B or C as BC. Using this approach, three binary datasets can be created by combining every two classes alternately (i.e., AB, BC and AC). Then, the Sieve can be employed to classify the three datasets. Since the features in the datasets were not changed, each sample will receive three labels. Finally, the classification of a sample can result from voting among the three labels. Since the Sieve improved the mid-results (i.e., the three labels for voting), the final result would also be improved.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflicts of interest. Due to the limited samples, we employ the cross validation.