You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

10 October 2019

Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme

,
,
,
and
1
Wired Communications Lab, Department of Electrical and Computer Engineering, University of Patras, 26504 Achaia, Greece
2
Educational Software Development Lab, Department of Mathematics, University of Patras, 26504 Achaia, Greece
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Theory and Applications of Information Theoretic Machine Learning

Abstract

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.

1. Introduction

The most common approach established in machine learning (ML) is supervised learning (SL). Under the SL schemes, classifiers are trained using purely labeled data. In contrast with the problem complexity, the performance of such schemes is directly analogous to the amount and the quality of labeled data which are used at the training phase. In a large variety of scientific domains, such as object detection [1], speech recognition [2], web page categorization [3], and computer-aided medical diagnosis [4,5,6] vast pools of unlabeled data are often available. Though, in most cases labeling data can be costly and time-consuming, as human effort and expertise are required to annotate the available data. Many research works [7] exist focusing on techniques with the aim of exploiting the available unlabeled data especially in favor of classification problems. The most common learning methods incorporating such techniques are active learning (AL) and semi-supervised learning (SSL) [8]. Both AL and SSL share an iterative learning nature, making them a perfect fit for constructing more complex combination learning schemes.
The primary goal of this paper is to put forward a new AL and SSL combination algorithm in order to efficiently exploit the plethora of available unlabeled data found in most of the ML datasets and provide an improved classification framework. The general flow of AL and SSL frameworks is presented in Figure 1. Both methods utilize an initial pool of labeled and unlabeled examples with the goal of efficiently augmenting the available knowledge. AL and SSL frameworks, in most cases, operate under an iterative logic aiming to predict the label in the most appropriate unlabeled examples. While the former method annotates the unlabeled instances by interactively querying a human expert based on a variety of querying strategies, the latter attempts to automatically produce the labels of unlabeled examples by exploiting the previously learned knowledge and a wide range of unlabeled instances selection criteria. After the successful augmentation of the initial labeled set, a final model is constructed in both cases with a view to the application on the unknown test cases.
Figure 1. The general frameworks of active learning and semi-supervised learning along with their shared elements.
As both methods share a lot of key characteristics, a major effort is now needed to combine the two learning approaches. The main contribution of the proposed algorithm is the employment of a self-training scheme for the combination of AL and SSL utilizing the fast and effective metrics of the entropy and the distribution of the prediction probabilities of the available unlabeled data. The plethora of experiments carried out, also play a major role in the validation of the proposed algorithm. The proposed method is examined through a number of different individual base learners, where the ensemble learning technique is also explored as the aggregated models tend to produce more accurate predictions and are commonly used in today’s applications [9,10].
Real-world case scenarios where AL and SSL combination methods can be applied include natural language processing (NLP) problems to which a lot of labeled examples are required to effectively train a model and also vast amounts of unlabeled data can be mined. Common applications on the NLP field are part of speech tagging, named entity recognition, sentiment analysis [11], fraud detection, and spam filtering. Especially, a number of AL [12], SSL, and combinations [13] of them have been proposed in the spam filtering domain. In Figure 2, an application on the Spambase [14] benchmark dataset briefly presents the accuracy improvement for the proposed scheme as the algorithm’s iterations progress. With regard to the base algorithm learner, the support vector machines (SVMs) [15] classifier was embedded. For comparison, in the same figure, the corresponding SSL part of the algorithm was fed with the same amount of unlabeled data to obtain only the semi-supervised accuracy.
Figure 2. Progression of accuracies in relation to the number of iterations executed for the proposed combination scheme and its semi-supervised counterpart, utilizing support vector machines (SVMs) as base learner, applied on the Spambase dataset using two different labeled ratios.
The rest of this research work is organized as follows: In Section 2, the related work on similar classification methods is discussed. Following in Section 3, the proposed method is presented along with the exact algorithm implemented. An attempt to evaluate the efficacy of the combination scheme is made in Section 4, where extensive experimentation results can be found. Moreover, in this section, the average accuracies of the classifiers applied on the combination scheme are also briefly compared. In Section 5, a modification of the scheme is explored. The research conclusions are conferred in Section 6, where a number of areas to be explored as future work are mentioned. Finally, a software implementation of the wrapper algorithm is found in the Appendix A through the accompanying link.

3. Proposed Method

The proposed method constitutes a combination of AL and SSL approaches, in order to leverage the advantages of both techniques. A mixed self-training method is employed utilizing the entropy of unlabeled instances, with the aim to identify the most confusing instances in the case of active round, while in the semi-supervised round the internal learner’s distribution of probabilities for all possible labels per each instance is exploited as a sorting mechanism for the selection of the most confident examples.
Algorithm 1: Combination Scheme
1:
LOAD the dataset D and construct the labeled set L and the unlabeled set U
2:
INITIALIZE the classifier CLS
3:
CALCULATE the labeled ratio R = size(L)/size(L+U)
4:
DEFINE the maximum number of iterations MaxIter
5:
DEFINE the maximum percentage of unlabeled examples to be added in each iteration T in respect with R
6:
SET maxUnlabPerIter = T * R * size(D)
7:
 
8:
SET i = 0
9:
WHILE i<MaxIter AND size(Ui)>0: /* where U0= U */
10:
 Train(CLS) on the current labeled set Li /* where L0=L */
11:
 IF i modulo 2 == 0:
12:
  Classify(Ui) using CLS and construct matrix Mpr containing corresponding prediction probabilities along with the predicted labels
13:
  SORT Mpr descending according to the prediction probabilities
14:
  STORE the top maxUnlabPerIter instances of Mpr in a matrix Mfinal
15:
   /* now containing the most confident instances along with their predictions */
16:
 ELSE:
17:
  Calculate the distribution_of_probabilities(Ui) and return a matrix DistUi
18:
  Calculate the entropy(DistUi) for each element and return a matrix EntrUi
19:
  SORT EntrUi descending according to their entropies
20:
  Label the top maxUnlabPerIter using human expertise
21:
  STORE the top maxUnlabPerIter instances along with their labels in a matrix Mfinal
22:
   /* now containing the most confusing instances along with their true labels */
23:
 END_IF
24:
 Augment(Li) by adding Mfinal instances
25:
 Clean(Ui) by removing Mfinal instances
26:
 SET i = i + 1
27:
END_WHILE
28:
29:
Train(CLS) using Laugmented (☰ Llast iteration)
30:
LOAD the unknown test cases as Testset
31:
Classify(Testset) using CLS to produce the final predictions
Uncertainty-based metrics are widely deployed in the AL field as the literature suggests [34], mainly due to their strong performance in terms of calculation efficiency and effectiveness in the process of selecting the most confusing instances. On the other hand, in the SSL field research works exist [8,35] proving the effectiveness of probabilistic iterative schemes. As the nature of these types of metrics is similar, they can prove to be a robust combination for the construction of schemes such as the proposed. Moreover, it is also known [36] that the SSL self-training technique further helps to overpass the lack of exploration problems that occur during the AL entropy-based training process causing the algorithms to stuck at suboptimal solutions, continuously selecting instances which do not improve the current classifier.
The proposed algorithm can be characterized as a simple yet very effective wrapper algorithm that can utilize a wide range of learners, assuming that they can produce probability distributions for their predictions. A detailed presentation of the algorithm follows in the next paragraphs.
Let D denote the initial training set, consisting of a labeled set of examples L and an unlabeled set of examples U thus defining a labeled ratio R, as in the following equation:
L a b e l e d   R a t i o = s i z e ( L ) s i z e ( L + U )
where the s i z e ( X ) function returns the size of a set of instances.
Initially, a base learner (CLS) is selected and trained on L. Afterwards, a self-training scheme is employed with the aim to augment the L using the available unlabeled examples of D. The number of unlabeled examples utilized in each iteration is conservatively selected taking in account the size of the initial labeled set, using also a control parameter T, setting the percentage of unlabeled examples related to the size of the initial labeled set. The number of maximum unlabeled instances selected in each iteration is calculated as follows:
max U n l a b P e r I t e r = T * R * s i z e ( D )
In each iteration (i), one of the two learning approaches is employed successively. The self-training loop terminates in a maximum number of iterations MaxIter or in the case of exhaustion of the pool of unlabeled examples.
Starting with the SSL round, the CLS is applied on the current unlabeled set Ui and a matrix of predictions Mpr is constructed along with the prediction probability for each unlabeled instance, resulting in a s i z e ( U i )   x   ( l + 2 ) dimensions matrix, where l + 2 is the number of features, including the predicted labels and the corresponding prediction probabilities. The SSL round uses machine labeling in order to balance the expensive human effort and examination process required to label the data. The Mpr is sorted descendingly utilizing the prediction probabilities while the rest of the maxUnlabPerIter elements are discarded. The maxUnlabPerIter instances along with their predicted labels are stored in Mfinal.
Following the method flow, an AL round is deployed in every other iteration. In this round, the algorithm attempts to construct a matrix containing the entropy estimation of each unlabeled instance EntrUi. The base learner is applied on Ui and the distribution of probabilities are exported in matrix DistUi of dimensions s i z e ( U i )   x   n u m _ c l a s s e s ( D ) , where the n u m _ c l a s s e s ( X ) function returns the number of classes of a dataset. Having produced DistUi, the calculation of entropy estimation matrix is performed using the next formula, to compute each one of its elements (j):
E n t r o p y j = k = 1 n u m _ c l a s s e s ( D ) p k * log 2 p k
where p k denotes the probability of k class for instance j, already contained in DistUi.
Subsequently, EntrUi is sorted in descending order, as the most confusing examples, with entropy values near one, should be placed on the top of the matrix. The top maxUnlabPerIter instances are kept in EntrUi with the rest of them being discarded. Human expertise is utilized to label the maxUnlabPerIter instances and a matrix containing the human-labeled instances Mfinal is constructed with the size of maxUnlabPerIter   x   ( l + 1 ) , where l + 1 is the number of features, including the class.
During each iteration, the Mfinal instances are added to the current labeled set Li and removed from the current unlabeled set Ui. The CLS is re-trained at the start of each self-training iteration in order to be utilized again. When the termination criteria are met, the algorithm exits the self-training loop having constructed the augmented labeled set Laugmented ( Llast iteration). As a final step, the CLS is trained on the augmented labeled set in order to be applied on the unknown test cases. The exact implementation of the combination scheme is presented in Algorithm 1.

4. Experimentation and Results

In order to examine the efficacy of the proposed scheme, an exhaustive experimentation procedure was followed. At first, fifty-five (55) benchmark datasets were extracted from the UCI repository [14], related to a wide range of classification problems. To further enhance the variance and complexity of the classification process, all datasets were partitioned and examined according to the resampling procedure of k-fold cross-validation [37]. Following the method’s steps, each subject dataset is shuffled and then divided into k unique data groups. By holding out one of the groups as a test set and utilizing the rest as a train set, k new datasets are generated. The k parameter was set equal to ten, as it is commonly selected by the majority of the literature.
The main aim of the experimentation process was to prove the superiority of the combination scheme against the competing methods of the supervised, semi-supervised and active learning using always the same amounts of labeled and unlabeled data under the same base learner model. In more detail, the supervised method is trained only on the initial labeled set while the semi-supervised rival method utilizes also the initial unlabeled set in the same manner that is also exploited in the proposed combination scheme. Moreover, as baseline AL opponent the random sampling [7] process is implemented in a similar way with the rest of the combination self-training procedure, also utilizing the initial unlabeled set.
For this purpose, all training subsets were further divided into two sets, an initial labeled set and an initial unlabeled set, using four different labeled ratios R. As the initial datasets contained a hundred percent of the instance labels, in order to simulate the human expert labeling process, all the original labels for the constructed unlabeled sets were stored separately in order to be retrieved whenever the algorithm required to query the human expert. Thus, each original dataset was augmented into forty derived datasets. In detail, the R values were set to 10%, 20%, 30%, and 40%. As regards the proposed algorithm’s parameters, the control parameter T was set equal to 10%, while the MaxIter parameter was empirically selected equal to 10 in order to impose a maximum of 40%, in relation to the original dataset size, limit (can be calculated using Equation (2) multiplied by the MaxIter parameter of unlabeled instances for selection and augmentation of the initial labeled set in the case of R = 40%.
As a comparison measure, the average classification accuracy over each R was used. In order to draw general conclusions for the efficacy of the combination scheme, a wide range of classification models and meta-techniques were employed, incorporated in each one of the four learning methods. A brief description for each one of the base learners is presented:
BagDT: In this model, the bootstrap aggregating (bagging) [38] meta-algorithm was applied along with the use of the C4.5 decision trees [39] classifier. The bagging technique is often adopted to reduce the variance and overfitting of a base learner and enhance its accuracy stability. The basic idea behind this technique is the generation of multiple training sets by uniformly sampling the original dataset.
5NN: The k-nearest neighbors [40] classifier belongs to the family of lazy learning algorithms. By examining the k closest instances in a defined feature space, it classifies a given test instance by plurality voting on the labels of the k instances.
Logistic: The logistic regression, also commonly referenced as the logit model, is a statistical model that utilizes the logistic function in order to model binary dependent variables, thus fitting very well with categorical targets. In problems where the target variable has more than two values, multinomial logistic regression is applied [41].
LMT: The logistic model tree [42] classification model combines logistic regression with decision trees. The main idea behind the classifier is the use of linear regression models as leaves of a classification tree.
LogitBoost: This classifier is a boosting model proposed by Friedman et al. [43]. It is based on the idea that the adaptive boosting [44] method can be thought as a generalized additive model, thus the cost function of logistic regression can be applied.
RF: One of the most robust ML learners is the random forests [45] model, which is capable of tackling regression and classification problems. Its operation is based on the construction of multiple decision trees using random subsamples of the original feature space. The aggregation of the results is achieved via majority voting. Due to its inner architecture, it is known to efficiently handle the overfitting phenomena.
RotF: The rotation forest model constitutes an ensemble [46] classifier proposed by Rodriguez and Kuncheva [47]. Following the flow of this algorithm, the initial feature space is divided in random subspaces. The default feature extraction algorithm applied to create the subspaces is the principal component analysis (PCA) [48], aiming to increase the diversity amongst the base learners.
XGBoost: The extreme gradient boosted trees [49] algorithm, is a powerful implementation of gradient boosted decision trees. Under this boosting [50] scheme, a number of trees are built sequentially with each time the goal to reduce errors produced from the previous tree, thus each tree is fitted on the gradient loss of the previous step. The final decision is produced from the weighted voting of the trees. The XGBoost algorithm is a very scalable algorithm that has shown to perform very well on large datasets or sparse datasets utilizing parallel and distributed execution methods.
Voting (RF, RotF, XGBoost): As a last effort to further explore the potential of more complex classification models in the combination scheme, the construction of an ensemble classifier by majority voting the results of three of the most robust models: RF, RotF, and XGBoost was put forward. As regards the extraction of probabilities, the average of the exported probabilities for the three classifiers was considered as the best option.
The experimental results in terms of classification accuracy for each base learner are organized in Table 1, Table 2, Table 3, Table 4 and Table 5 and supplementary material Tables S1–S4, categorized according to the four label ratios (10%, 20%, 30%, 40%) for each learning method. The bold values in the tables indicate the highest accuracy for the corresponding dataset and the subject labeled ratio.
Table 1. Classification accuracies of bagging-decision trees (BagDT) on four different ratios.
Table 2. Classification accuracies of random forests (RF) on four different ratios.
Table 3. Classification accuracies of rotation forest (RotF) on four different ratios.
Table 4. Classification accuracies of XGBoost on four different ratios.
Table 5. Classification accuracies of voting (RF, RotF, XGBoost) on four different ratios.
The superiority of the proposed combination scheme regarding the classification accuracy is prominent. The following important observations are derived from the accuracy tables:
  • The proposed combination method outperforms all other four learning methods in all four labeled ratios and for all the nine base learners used as control methods, in terms of average accuracy. This argument is also validated in Figure 3, where the comparisons are visually assembled and a progressive picture of the performance of the two dominant methods is presented as the R increases. The SL method was also included as a baseline performance metric.
    Figure 3. Performance comparison of the proposed combination scheme, in terms of average accuracies over fifty-five datasets and four labeled ratios, against the corresponding methods of active learning (AL) and supervised learning (SL) for the nine base classifiers.
  • It is observed by the accuracy tables that the proposed method steadily produces significantly more wins on each individual dataset through all the experiments carried out.
Following the accuracy examination, the Friedman aligned ranks test [51] was conducted. In Table 6, Table 7, Table 8, Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14, the results of the statistical tests for each one of the nine base learners divided into the four labeled ratios used, are presented. These lead to the following assumptions:
Table 6. Friedman aligned ranking test and Holm’s post hoc test regarding BagDT (a = 0.10).
Table 7. Friedman aligned ranking test and Holm’s post hoc test regarding RF (a = 0.10).
Table 8. Friedman aligned ranking test and Holm’s post hoc test regarding RotF (a = 0.10).
Table 9. Friedman aligned ranking test and Holm’s post hoc test regarding extreme gradient boosted trees (XGBoost) (a = 0.10).
Table 10. Friedman aligned ranking test and Holm’s post hoc test regarding voting (RF, RotF, XGBoost) (a = 0.10).
Table 11. Friedman aligned ranking test and Holm’s post hoc test regarding k-nearest neighbors (5NN) (a = 0.10).
Table 12. Friedman aligned ranking test and Holm’s post hoc test regarding logistic (a = 0.10).
Table 13. Friedman aligned ranking Test and Holm’s post hoc test regarding logistic model tree (LMT) (a = 0.10).
Table 14. Friedman aligned ranking test and Holm’s post hoc test regarding LogitBoost (a = 0.10).
  • The non-parametric tests assess the null hypothesis that the means of the results of two or more of the compared methods are the same by calculating the related p-value. This hypothesis can be rejected for all the nine algorithms and for all labeled ratios as all calculated p-values are significantly lower than the significance level of a = 0.10.
  • Moreover, the Friedman rankings confirm that for all nine base learners and regardless of the labeled ratio, the proposed combination scheme ranks first ahead of all other learning methods in coincidence with the accuracy experimental results.
Since the Friedman test null hypothesis was rejected, the Holm’s [52] post-hoc statistical test was also applied with an alpha value of 0.10. The aim of the Holm’s test is to detect the specific differences between the combination scheme and the other learning methods, thus the null hypothesis under evaluation is that the mean of the results of the proposed method and against each other group is equal (compared in pairs). The post-hoc results are also presented in the corresponding ranking test tables for each one of the base learners. By observing the adjusted p-values of the Holm’s tests, it is concluded that:
  • The proposed combination method performs significantly better on 105 of the total 108 compared method variations for the nine base learners over the four labeled ratios.
  • The AL methods for the Logistic, the LMT and the LogitBoost classifiers accept the mean significant difference test for one label ratio each, 30%, 20%, 40% accordingly. However, the adjusted p-values show small differences over the alpha of 0.10.
Summarizing the test results, both Friedman Aligned Ranks tests and Holm’s one vs all comparison tests verify the superior performance of the proposed method over a wide range of scenarios and algorithm comparisons.
To better observe the individual results regarding the combination schemes and the role of the base learners incorporated, the average accuracies were plotted in Figure 4. The outcome was as expected the following: The ensemble voting (RF, RotF, XGBoost) classifier outperforms the rest models in all labeled ratios. As the first indication of such an outcome, the improved prediction probabilities derived from the averaging of the three classifier probabilities, on which the combination scheme relies, it would be a promising starting point for seeking a robust proof to strictly explain the performance boost. Thus, on the one hand, the most confusing unlabeled instances, through the entropy calculation, and on the other hand, the most confident unlabeled instances, through the distribution of prediction probabilities, are detected using the distribution of prediction probabilities. Such behaviors seem to also emerge in other relevant ensemble wrapper algorithms [53].
Figure 4. Average accuracies for the proposed combination scheme regarding different base learners and labeled ratios.

5. Modification

Pointing towards the improvement of the proposed method, it is obvious by the statistical analysis and ranking results that a slight increase in the performance of the SSL part could have a significant impact on the overall efficiency of the combination scheme.
In this direction, careful observation of the execution of the proposed algorithm revealed the weakness of the SSL prediction probabilities, which, in many cases, leads to the selection of the wrong instances to be labeled. In order to augment the probabilistic information available for the proposed method, as regards the SSL part, a lazy classifier (kNN) was integrated into the instance selection process. Such a development, on the one hand, augments the proposed method with a second view of the labels for the unlabeled set, and on the other hand, does not significantly increase the computational overhead as this family of classifiers does not need training. As a second measure to strengthen the SSL instance selection criteria, the empirical approach of setting a lower limit on the minimum accepted probability for an unlabeled instance was adopted using the formula:
p r o b a T h r e s h o l d = n u m _ c l a s s e s ( D ) + 1 2 * n u m _ c l a s s e s ( D )
whereby utilizing the n u m _ c l a s s e s ( X ) function the dependence on the dataset characteristics is lifted. A more compact representation of the SSL part modifications is given in Algorithm 2, while the abstract flow chart of the improved combination framework is presented in Figure 5.
Figure 5. Graphical abstract of the proposed combination framework after the introduction of the semi-supervised learning (SSL) improvements.
Algorithm 2: SSL modification
10:
[Execute Algorithm 1 steps (until Alg. 1 line 10)]
11:
 IF i modulo 2 == 0:
12:
  SET probaThreshold = [num_classes(D) + 1]/[2 * num_classes(D)]
13:
  SET the number of nearest neighbors numNeib
14:
  INITIALIZE the NN classifier on Li using numNeib
15:
  
16:
  Classify(Ui) using CLS and construct matrix Mpr containing corresponding prediction probabilities along with the predicted labels
17:
  FOR_EACH instance of Mpr:
18:
   IF cls_predicted_class(instance) != nn_predicted_class(instance)    OR cls_probability(instance) < probaThreshold:
19:
    DISCARD instance from Mpr
20:
   END_IF
21:
  END_FOR_EACH
22:
  SORT Mpr descending according to the prediction probabilities
23:
  STORE the top maxUnlabPerIter instances of Mpr in a matrix Mfinal
24:
  /* now containing the most confident instances along with their predictions */
25:
[Continue Algorithm 1 steps (from Alg. 1 line 16)]
The improved combination scheme was further tested against the most robust AL frameworks found in the literature. In detail, the query strategies of least confidence (LC), margin sampling (MS) and entropy sampling (ES) were considered to be compared with the modified proposed scheme. The major aspects concerning these strategies [7] follow below.
LC: The objective of this strategy is to identify the least confident unlabeled instances by examining the probability of the most probable label for each unlabeled instance. The strategy continues by selecting the instances having the lowest probable labels and presents them to the human expert to be labeled in order to augment the initial labeled set.
MS: As an improvement of the LC strategy, MS attempts to overcome the disadvantageous selection process of only considering the most probable labels by calculating the differences of the most probable and the second most probable label for an unlabeled instance. Afterwards, those calculated differences are sorted and the instances with the lowest differences are selected to be labialized.
ES: This strategy, part of which is also integrated into the AL counterpart of the proposed scheme, computes the entropy measure (similar to Equation (3)) for each unlabeled instance using the distribution of prediction probabilities. The most entropic instances are then selected to be displayed to the human expert in order to enlarge the original labeled set.
In Figure 6, ten experiments display the performance comparison in terms of classification accuracy regarding the three AL methods against the modified combination scheme. The experiments are categorized by the five base learner models that were integrated into the methods. In each experiment, a different benchmark dataset was deployed using four different Rs equal to 10%, 20%, 30%, and 40% accordingly.
Figure 6. Progression of accuracies for the modified combination scheme against the four AL strategies (least confidence (LC), margin sampling (MS), entropy sampling (ES)) for five different base learners on ten different benchmark datasets.
The experimental results confirm the efficiency of the modified combination scheme against the AL methods. It can be extracted from the figure that the proposed technique in all ten cases performs equally or better from its rivals’ accuracies. Moreover, the figure suggests that the three AL methods produce closely related accuracy results, as in four of the ten test cases, their performance was almost identical. The previous outcome can be explained by exploring the metrics utilized in these strategies, which are all derived from the prediction probabilities of the base learners.
Closing this section, in the conducted experiments on real-world benchmark datasets, the proposed combination scheme was compared with the SL, the SSL, and the AL methods. The experiments show that the proposed method outperforms the compared methods. Therefore, in the future, it is very important to conduct more insightful theoretical analyses on the effectiveness of the proposed approach and explore other appropriate selection criteria for filtering the informative unlabeled instances, in order to generalize the results with more confidence.

6. Conclusions

In this research work, a new wrapper algorithm was proposed combining the AL and SSL methods with the aim of efficiently utilizing the available unlabeled data. A plethora of experiments was conducted for evaluating the efficacy of the proposed algorithm in a wide range of benchmark datasets against other learning methods using a variety of classifiers as base models. In addition, four different labeled ratios were investigated. The proposed algorithm prevails over the other learning methods as statistically confirmed by the Friedman aligned ranks non-parametric tests and the Holm’s post-hoc tests. To further promote the use of the proposed algorithm, a software package was developed while more details about this package can be obtained from the link found in the Appendix A.
Regarding the performance boost that was experimentally observed while applying the proposed combination scheme on the numerous datasets, there is strong evidence that the vigorous AL method can efficiently improve its performance utilizing SSL schemes such as the self-training technique. Even in cases were the individual SSL method was not performing dexterously; when integrated in the AL and SSL proposed wrapper the performance of the overall scheme was significantly improved compared to the plain AL method. Moreover, in the case that the majority of the instances used in a learning scheme are automatically labeled, the performance may be unsatisfactory, and in some cases, it may even be worse than the SL baseline accuracy. For this reason, a fundamental requirement arises; that of defining a sufficient threshold of human expert intervention on the labeling process to successfully combine AL and SSL methods. Such a fine-tuning process is criticized as highly application-specific and challenging to automate. Furthermore, it can be noticed by the results, that on datasets with very small initial labeled sets, the proposed scheme can be beneficiary as the initially learned decision boundaries of such datasets can be possibly inaccurate, thus unlabeled instances near these boundaries could be falsely classified. This is an implication that the AL part of the proposed scheme could efficiently tackle.
For future work, a number of areas have been identified and are worth exploring as they seem promising in the direction of improving the classification abilities of the proposed algorithm. As a major first research area, that is expected to have a high impact on the combination scheme’s performance in terms of accuracy and execution time would be the investigation of different instance selection strategies than those that are currently employed. In the AL part of the proposed algorithm, two common alternatives are the least confidence [54] and the margin sampling [55] algorithms, which utilize the unlabeled data under a different scope. Moreover, more complex query scenarios than the plain pool-based sampling used, like query synthesis [56] could also be beneficial. As regards the semi-supervised part, simple techniques like the integration of weights annotating the instances assessed as informative by the SSL part of the algorithm could further improve the overall accuracy of the combination scheme as suggested in [35,57].
Another interesting research area would be that of the extreme outlier detection algorithms. The incorporation of such algorithms in the proposed algorithm would have an immediate impact on the quality of the selected candidate unlabeled instances that are used to augment the labeled set in each self-training iteration, thus resulting in more robust inner models. A few of the very well-known techniques that could be directly implemented in the combination scheme are the local outlier factor [58] for detecting anomalous values based on neighboring data or the isolation forest [59], which is a tree-based outlier detector.
Other research areas that would bear further improvement to the proposed algorithm include preprocessing algorithms, for instance, PCA for dimensionality reduction and production of more informative features or other feature selection techniques such as univariate feature selection [60]. Speaking of the integrated base learners, the introduction of online learners like the Hoeffding adaptive tree [61] and Pegasos [62] or deep learning architectures based on deep neural networks [63] and deep ensembles [64] could make the proposed algorithm sufficient for tackling streaming and big data problems.
Finally, by combining schemes from the fields of active regression learning [65,66] and semi-supervised regression [53] along with the proposed classification algorithm, a general combination scheme could be put forward that would be able to handle numeric and categorical targets.

Supplementary Materials

The following are available online at https://www.mdpi.com/1099-4300/21/10/988/s1, Table S1: Classification accuracies of 5 nearest neighbors (5NN) on four different ratios, Table S2: Classification accuracies of logistic regression (logistic) on four different ratios, Table S3: Classification accuracies of logistic model trees (LMT) on four different ratios, Table S4: Classification accuracies of LogitBoost on four different ratios.

Author Contributions

All authors have contributed equally to the final manuscript.

Funding

This research is implemented through the Operational Program Human Resources Development, Education and Lifelong Learning and is co-financed by the European Union (European Social Fund) and Greek national funds.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A

The proposed combination algorithm was implemented as a separate Java package for the WEKA [67] software tool. The decision to develop the combination scheme as a part of the WEKA tool was made since it is one of the most well-known tools used in the machine-learning community, which includes a big number of base learner models. Moreover, it can be easily deployed without requiring programming experience for the end-user. The package can be downloaded using the following link: http://ml.upatras.gr/combine-classification/.

References

  1. Rosenberg, C.; Hebert, M.; Schneiderman, H. Semi-supervised self-training of object detection models. In Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision (WACV 2005), Breckenridge, CO, USA, 5–7 January 2005. [Google Scholar]
  2. Karlos, S.; Fazakis, N.; Karanikola, K.; Kotsiantis, S.; Sgarbas, K. Speech Recognition Combining MFCCs and Image Features. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2016; Volume 9811 LNCS, pp. 651–658. ISBN 9783319439570. [Google Scholar]
  3. Tsukada, M.; Washio, T.; Motoda, H. Automatic Web-Page Classification by Using Machine Learning Methods. In Web Intelligence: Research and Development; Springer: Berlin/Heidelberg, Germany, 2001; pp. 303–313. [Google Scholar]
  4. Fiscon, G.; Weitschek, E.; Cella, E.; Lo Presti, A.; Giovanetti, M.; Babakir-Mina, M.; Ciotti, M.; Ciccozzi, M.; Pierangeli, A.; Bertolazzi, P.; et al. MISSEL: A method to identify a large number of small species-specific genomic subsequences and its application to viruses classification. BioData Min. 2016, 9, 38. [Google Scholar] [CrossRef] [PubMed]
  5. Previtali, F.; Bertolazzi, P.; Felici, G.; Weitschek, E. A novel method and software for automatically classifying Alzheimer’s disease patients by magnetic resonance imaging analysis. Comput. Methods Programs Biomed. 2017, 143, 89–95. [Google Scholar] [CrossRef] [PubMed]
  6. Celli, F.; Cumbo, F.; Weitschek, E. Classification of Large DNA Methylation Datasets for Identifying Cancer Drivers. Big Data Res. 2018, 13, 21–28. [Google Scholar] [CrossRef]
  7. Settles, B. Active Learning Literature Survey. Mach. Learn. University of Wisconsin-Madison: Madison, WI, USA, 2009; pp. 1–43.
  8. Triguero, I.; García, S.; Herrera, F. Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowl. Inf. Syst. 2015, 42, 245–284. [Google Scholar] [CrossRef]
  9. Mousavi, R.; Eftekhari, M.; Rahdari, F. Omni-Ensemble Learning (OEL): Utilizing Over-Bagging, Static and Dynamic Ensemble Selection Approaches for Software Defect Prediction. Int. J. Artif. Intell. Tools 2018, 27, 1850024. [Google Scholar] [CrossRef]
  10. Bologna, G.; Hayashi, Y. A Comparison Study on Rule Extraction from Neural Network Ensembles, Boosted Shallow Trees, and SVMs. Appl. Comput. Intell. Soft Comput. 2018, 2018, 1–20. [Google Scholar] [CrossRef]
  11. Hajmohammadi, M.S.; Ibrahim, R.; Selamat, A.; Fujita, H. Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples. Inf. Sci. 2015, 317, 67–77. [Google Scholar] [CrossRef]
  12. Ahsan, M.N.I.; Nahian, T.; Kafi, A.A.; Hossain, M.I.; Shah, F.M. Review spam detection using active learning. In Proceedings of the IEEE 2016 7th IEEE Annual Information Technology, Electronics and Mobile Communication Conference, Vancouver, BC, Canada, 13–15 October 2016. [Google Scholar]
  13. Xu, J.; Fumera, G.; Roli, F.; Zhou, Z. Training spamassassin with active semi-supervised learning. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09), Mountain View, CA, USA, 16–17 July 2009. [Google Scholar]
  14. Dua, D.; Graff, C. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/citation_policy.html (accessed on 9 October 2019).
  15. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  16. Sourati, J.; Akcakaya, M.; Dy, J.; Leen, T.; Erdogmus, D. Classification Active Learning Based on Mutual Information. Entropy 2016, 18, 51. [Google Scholar] [CrossRef]
  17. Huang, S.J.; Jin, R.; Zhou, Z.H. Active Learning by Querying Informative and Representative Examples. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1936–1949. [Google Scholar] [CrossRef]
  18. Lewis, D.D.; Gale, W.A. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the ACM SIGIR Forum, Dublin, Ireland, 3–6 July 1994. [Google Scholar]
  19. Riccardi, G.; Hakkani-Tür, D. Active learning: Theory and applications to automatic speech recognition. IEEE Trans. Speech Audio Process. 2005, 13, 504–511. [Google Scholar] [CrossRef]
  20. Zhang, Z.; Schuller, B. Active Learning by Sparse Instance Tracking and Classifier Confidence in Acoustic Emotion Recognition. In Proceedings of the Interspeech 2012, Portland, OR, USA, 9–13 September 2012. [Google Scholar]
  21. Roma, G.; Janer, J.; Herrera, P. Active learning of custom sound taxonomies in unstructured audio data. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, Hong Kong, China, 5–8 June 2012. [Google Scholar]
  22. Chen, Y.; Wang, G.; Dong, S. Learning with progressive transductive support vector machine. Pattern Recognit. Lett. 2003, 24, 1845–1855. [Google Scholar] [CrossRef]
  23. Johnson, R.; Zhang, T. Graph-based semi-supervised learning and spectral kernel design. IEEE Trans. Inf. Theory 2008, 54, 275–288. [Google Scholar] [CrossRef]
  24. Anis, A.; El Gamal, A.; Avestimehr, A.S.; Ortega, A. A Sampling Theory Perspective of Graph-Based Semi-Supervised Learning. IEEE Trans. Inf. Theory 2019, 65, 2322–2342. [Google Scholar] [CrossRef]
  25. Culp, M.; Michailidis, G. Graph-based semisupervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 174–179. [Google Scholar] [CrossRef] [PubMed]
  26. Blum, A.; Mitchell, T. Combining Labeled and Unlabeled Data with Co-Training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998. [Google Scholar]
  27. McCallum, A.K.; Nigam, K.; McCallumzy, A.K.; Nigamy, K. Employing EM and pool-based active learning for text classification. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; pp. 359–367. [Google Scholar]
  28. Tur, G.; Hakkani-Tür, D.; Schapire, R.E. Combining active and semi-supervised learning for spoken language understanding. Speech Commun. 2005, 45, 171–186. [Google Scholar] [CrossRef]
  29. Tomanek, K.; Hahn, U. Semi-supervised active learning for sequence labeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, Singapore, 2–7 August 2009. [Google Scholar]
  30. Han, W.; Coutinho, E.; Ruan, H.; Li, H.; Schuller, B.; Yu, X.; Zhu, X. Semi-supervised active learning for sound classification in hybrid learning environments. PLoS ONE 2016, 11, e0162075. [Google Scholar] [CrossRef] [PubMed]
  31. Chai, H.; Liang, Y.; Wang, S.; Shen, H.-W. A novel logistic regression model combining semi-supervised learning and active learning for disease classification. Sci. Rep. 2018, 8, 13009. [Google Scholar] [CrossRef] [PubMed]
  32. Su, H.; Yin, Z.; Huh, S.; Kanade, T.; Zhu, J. Interactive Cell Segmentation Based on Active and Semi-Supervised Learning. IEEE Trans. Med. Imaging 2016, 35, 762–777. [Google Scholar] [CrossRef] [PubMed]
  33. Rhee, P.K.; Erdenee, E.; Kyun, S.D.; Ahmed, M.U.; Jin, S. Active and semi-supervised learning for object detection with imperfect data. Cogn. Syst. Res. 2017, 45, 109–123. [Google Scholar] [CrossRef]
  34. Yang, Y.; Loog, M. Active learning using uncertainty information. In Proceedings of the International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016. [Google Scholar]
  35. Fazakis, N.; Karlos, S.; Kotsiantis, S.; Sgarbas, K. Self-trained Rotation Forest for semi-supervised learning. J. Intell. Fuzzy Syst. 2017, 32, 711–722. [Google Scholar] [CrossRef]
  36. Yang, Y.; Loog, M. A benchmark and comparison of active learning for logistic regression. Pattern Recognit. 2018, 83, 401–415. [Google Scholar] [CrossRef]
  37. Stone, M. Cross-validation: A review. Ser. Stat. 1978, 9, 127–139. [Google Scholar] [CrossRef]
  38. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  39. Salzberg, S.L. C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach. Learn. 1994, 16, 235–240. [Google Scholar] [CrossRef]
  40. Aha, D.W.; Kibler, D.; Albert, M.K. Instance-Based Learning Algorithms. Mach. Learn. 1991, 6, 37–66. [Google Scholar] [CrossRef]
  41. Le Cessie, S.; Houwelingen, J.C. Van Ridge Estimators in Logistic Regression. Appl. Stat. 1992, 41, 191–201. [Google Scholar] [CrossRef]
  42. Landwehr, N.; Hall, M.; Frank, E. Logistic model trees. Mach. Learn. 2005, 59, 161–205. [Google Scholar] [CrossRef]
  43. Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting. Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
  44. Schapire, R.E. A Short Introduction to Boosting. J. Jpn. Soc. Artif. Intell. 1999, 14, 771–780. [Google Scholar]
  45. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  46. Opitz, D.; Maclin, R. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]
  47. Rodriguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1619–1630. [Google Scholar] [CrossRef] [PubMed]
  48. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
  49. Chen, T.; Guestrin, C. XGBoost: Reliable Large-scale Tree Boosting System. Available online: http://learningsys.org/papers/LearningSys_2015_paper_32.pdf (accessed on 9 October 2019).
  50. Ferreira, A.J.; Figueiredo, M.A.T. Boosting algorithms: A review of methods, theory, and applications. In Ensemble Machine Learning: Methods and Applications; Springer: Boston, MA, USA, 2012. [Google Scholar]
  51. Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, 32, 69–73. [Google Scholar] [CrossRef]
  52. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
  53. Fazakis, N.; Karlos, S.; Kotsiantis, S.; Sgarbas, K. A multi-scheme semi-supervised regression approach. Pattern Recognit. Lett. 2019, 125, 758–765. [Google Scholar] [CrossRef]
  54. Culotta, A.; McCallum, A. Reducing labeling effort for structured prediction tasks. In Proceedings of the National Conference on Artificial Intelligence, Pittsburgh, PA, USA, 9–13 July 2005. [Google Scholar]
  55. Scheffer, T.; Decomain, C.; Wrobel, S. Active hidden markov models for information extraction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
  56. Wang, L.; Hu, X.; Yuan, B.; Lu, J. Active learning via query synthesis and nearest neighbour search. Neurocomputing 2015, 147, 426–434. [Google Scholar] [CrossRef]
  57. Huu, Q.N.; Viet, D.C.; Thuy, Q.D.T.; Quoc, T.N.; Van, C.P. Graph-based semisupervised and manifold learning for image retrieval with SVM-based relevant feedback. J. Intell. Fuzzy Syst. 2019, 37, 711–722. [Google Scholar] [CrossRef]
  58. Wang, W.; Lu, P. An efficient switching median filter based on local outlier factor. IEEE Signal Process. Lett. 2011, 18, 551–554. [Google Scholar] [CrossRef]
  59. Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
  60. Tang, J.; Alelyani, S.; Liu, H. Feature selection for classification: A review. In Data Classification: Algorithms and Applications; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
  61. Hulten, G.; Spencer, L.; Domingos, P. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and data Mining KDD ’01, San Francisco, CA, USA, 26–29 August 2001. [Google Scholar]
  62. Shalev-Shwartz, S.; Singer, Y.; Srebro, N.; Cotter, A. Pegasos: Primal estimated sub-gradient solver for SVM. Math. Program. 2011, 127, 3–30. [Google Scholar] [CrossRef]
  63. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
  64. Amini, M.; Rezaeenour, J.; Hadavandi, E. A Neural Network Ensemble Classifier for Effective Intrusion Detection Using Fuzzy Clustering and Radial Basis Function Networks. Int. J. Artif. Intell. Tools 2016, 25, 1550033. [Google Scholar] [CrossRef]
  65. Elreedy, D.; Atiya, A.F.; Shaheen, S.I. A Novel Active Learning Regression Framework for Balancing the Exploration-Exploitation Trade-Off. Entropy 2019, 21, 651. [Google Scholar] [CrossRef]
  66. Fazakis, N.; Kostopoulos, G.; Karlos, S.; Kotsiantis, S.; Sgarbas, K. An Active Learning Ensemble Method for Regression Tasks. Intell. Data Anal. 2020, 24. [Google Scholar] [CrossRef]
  67. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software. ACM SIGKDD Explor. Newsl. 2009, 11, 10. [Google Scholar] [CrossRef]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.