Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.


Introduction
The most common approach established in machine learning (ML) is supervised learning (SL). Under the SL schemes, classifiers are trained using purely labeled data. In contrast with the problem complexity, the performance of such schemes is directly analogous to the amount and the quality of labeled data which are used at the training phase. In a large variety of scientific domains, such as object detection [1], speech recognition [2], web page categorization [3], and computer-aided medical diagnosis [4][5][6] vast pools of unlabeled data are often available. Though, in most cases labeling data can be costly and time-consuming, as human effort and expertise are required to annotate the available data. Many research works [7] exist focusing on techniques with the aim of exploiting the available unlabeled data especially in favor of classification problems. The most common learning methods incorporating such techniques are active learning (AL) and semi-supervised learning (SSL) [8]. Both AL and SSL share an iterative learning nature, making them a perfect fit for constructing more complex combination learning schemes.
The primary goal of this paper is to put forward a new AL and SSL combination algorithm in order to efficiently exploit the plethora of available unlabeled data found in most of the ML datasets and provide an improved classification framework. The general flow of AL and SSL frameworks is presented in Figure 1. Both methods utilize an initial pool of labeled and unlabeled examples with the goal of efficiently augmenting the available knowledge. AL and SSL frameworks, in most cases, operate under an iterative logic aiming to predict the label in the most appropriate unlabeled examples. While the former method annotates the unlabeled instances by interactively querying a human expert based on a variety of querying strategies, the latter attempts to automatically produce the labels of unlabeled examples by exploiting the previously learned knowledge and a wide range of unlabeled instances selection criteria. After the successful augmentation of the initial labeled set, a final model is constructed in both cases with a view to the application on the unknown test cases. Real-world case scenarios where AL and SSL combination methods can be applied include natural language processing (NLP) problems to which a lot of labeled examples are required to effectively train a model and also vast amounts of unlabeled data can be mined. Common applications on the NLP field are part of speech tagging, named entity recognition, sentiment analysis [11], fraud detection, and spam filtering. Especially, a number of AL [12], SSL, and combinations [13] of them have been proposed in the spam filtering domain. In Figure 2, an application on the Spambase [14] benchmark dataset briefly presents the accuracy improvement for the proposed scheme as the algorithm's iterations progress. With regard to the base algorithm learner, the support vector As both methods share a lot of key characteristics, a major effort is now needed to combine the two learning approaches. The main contribution of the proposed algorithm is the employment of a self-training scheme for the combination of AL and SSL utilizing the fast and effective metrics of the entropy and the distribution of the prediction probabilities of the available unlabeled data. The plethora of experiments carried out, also play a major role in the validation of the proposed algorithm. The proposed method is examined through a number of different individual base learners, where the ensemble learning technique is also explored as the aggregated models tend to produce more accurate predictions and are commonly used in today's applications [9,10].
Real-world case scenarios where AL and SSL combination methods can be applied include natural language processing (NLP) problems to which a lot of labeled examples are required to effectively train a model and also vast amounts of unlabeled data can be mined. Common applications on the NLP field are part of speech tagging, named entity recognition, sentiment analysis [11], fraud detection, and spam filtering. Especially, a number of AL [12], SSL, and combinations [13] of them have been proposed in the spam filtering domain. In Figure 2, an application on the Spambase [14] benchmark dataset briefly presents the accuracy improvement for the proposed scheme as the algorithm's iterations progress. With regard to the base algorithm learner, the support vector machines (SVMs) [15] classifier was embedded. For comparison, in the same figure, the corresponding SSL part of the algorithm was fed with the same amount of unlabeled data to obtain only the semi-supervised accuracy. Real-world case scenarios where AL and SSL combination methods can be applied include natural language processing (NLP) problems to which a lot of labeled examples are required to effectively train a model and also vast amounts of unlabeled data can be mined. Common applications on the NLP field are part of speech tagging, named entity recognition, sentiment analysis [11], fraud detection, and spam filtering. Especially, a number of AL [12], SSL, and combinations [13] of them have been proposed in the spam filtering domain. In Figure 2, an application on the Spambase [14] benchmark dataset briefly presents the accuracy improvement for the proposed scheme as the algorithm's iterations progress. With regard to the base algorithm learner, the support vector machines (SVMs) [15] classifier was embedded. For comparison, in the same figure, the corresponding SSL part of the algorithm was fed with the same amount of unlabeled data to obtain only the semi-supervised accuracy. The rest of this research work is organized as follows: In Section 2, the related work on similar classification methods is discussed. Following in Section 3, the proposed method is presented along with the exact algorithm implemented. An attempt to evaluate the efficacy of the combination scheme is made in Section 4, where extensive experimentation results can be found. Moreover, in this section, the average accuracies of the classifiers applied on the combination scheme are also briefly compared. In Section 5, a modification of the scheme is explored. The research conclusions are conferred in Section 6, where a number of areas to be explored as future work are mentioned. Finally, a software implementation of the wrapper algorithm is found in the Appendix A through the accompanying link.

Related Work
AL can be considered one of the most promising approaches for improving the performance of a prediction model in real-world scenarios where large amount of data exists, but their labeling is costly or infeasible [7]. AL assumes that human experts will be available to provide ground-truth labels for the unlabeled instances. Therefore, the philosophy of AL is to minimize the number of queries with the explicit goal to focus the labeling effort in the most profitable or informative instances, in other words, to minimize the training cost of the model [16]. Finally, these manually annotated samples are merged with the training dataset to get the highest classification accuracy. Several query strategies [7,17] have been proposed to measure the informativeness or the representativeness of the data. Informativeness-based strategies measure the contribution of an The rest of this research work is organized as follows: In Section 2, the related work on similar classification methods is discussed. Following in Section 3, the proposed method is presented along with the exact algorithm implemented. An attempt to evaluate the efficacy of the combination scheme is made in Section 4, where extensive experimentation results can be found. Moreover, in this section, the average accuracies of the classifiers applied on the combination scheme are also briefly compared. In Section 5, a modification of the scheme is explored. The research conclusions are conferred in Section 6, where a number of areas to be explored as future work are mentioned. Finally, a software implementation of the wrapper algorithm is found in the Appendix A through the accompanying link.

Related Work
AL can be considered one of the most promising approaches for improving the performance of a prediction model in real-world scenarios where large amount of data exists, but their labeling is costly or infeasible [7]. AL assumes that human experts will be available to provide ground-truth labels for the unlabeled instances. Therefore, the philosophy of AL is to minimize the number of queries with the explicit goal to focus the labeling effort in the most profitable or informative instances, in other words, to minimize the training cost of the model [16]. Finally, these manually annotated samples are merged with the training dataset to get the highest classification accuracy. Several query strategies [7,17] have been proposed to measure the informativeness or the representativeness of the data. Informativeness-based strategies measure the contribution of an unlabeled instance on the uncertainty reduction of a statistical model, while representativeness-based strategies measure the instance contribution on representing the underlying structure of input patterns. The most commonly used query strategies can be considered certainty-based sampling, query-by-committee, and expected error reduction. In the first type of strategy, a single model is trained and the human expert (annotator) is queried to label the least confidence unlabeled instances based on the pre-trained model. The queryby-committee strategy involves more than one active learner (classification models) to be trained for the classification task. The unlabeled instances about which these models disagree the most are selected for human annotation. The third strategy is a decision-theoretic approach aiming to estimate the potential of the model's generalization error reduction. In other words, a model is trained and used to estimate the expected future error of the unlabeled samples. Then, the instances with the minimal future error (risk) are selected and delivered for manual labeling. The effectiveness of AL and various query strategies has been shown in typical classification tasks, such as text classification [18], speech recognition [19], speech emotion classification [20], audio retrieval [21] to name a few.
In contrast to AL, SSL aims to automatically exploit unlabeled data in addition to labeled data to improve learning performance, without human intervention. In SSL, two basic assumptions about the data distribution are considered. The first assumes that data are inherently clustered, meaning that instances belonging to the same cluster have the same label. The other one assumes that data lie on a manifold, meaning that nearby samples have similar predictions. The idea behind both is that similar data points should have similar outputs and the unlabeled instances can expose similarities between these data points. Many different SSL methods have been designed in machine learning, including mainly transductive support vector machines [22], graph-based methods [23][24][25], co-training [26], self-training [1]. In the self-training scheme, the classification model is used to predict the labels of a portion of the unlabeled instances and, consequently, the most confident ones are added to the initial training dataset repeatedly until convergence. Rather than just relying on a unique model, in co-training [26] ensemble method is employed. For each model, separate feature sets (or views) of the same labeled data are used for training. Then, like self-training, the most confident predictions of each classifier on the unlabeled data are used to iteratively construct additional labeled training data. The co-training paradigm relies on three assumptions about the views, i.e., sufficiency, compatibility, and conditional independence [26]. On the other hand, graph-based methods treat all the samples (both labeled and unlabeled) as connected vertices (nodes) in a graph, aiming to connect these nodes, in other words, to weight these node-to-node pairwise edges by similarities between the corresponding sample pairs. Finally, minimum energy optimization is used to propagate labeling from the labeled to the unlabeled nodes.
Although AL has led to the reduction of the human labeling burden, without sacrificing the model's performance [7], it is still inefficient in some situations, e.g., the acquisition of a large amount of human annotations is impractical or not feasible at all. Thus, SSL comes in handy by minimizing the unlabeled data that will be fed to the human annotator. Specifically, human experts are required to label only those instances with the lowest certainty (as determined by the AL algorithm), while the remaining instances are automatically labeled by a machine annotator (by the SSL algorithm). Indeed, several studies have been proposed that combine AL and SSL under the same methodology. One of the first attempts [27] was in the text classification field, where expectation maximization was employed along with pool-based active learning. Later, Muslea et al. proposed the combination of co-testing and co-training showing improved classification accuracy in Web pages and pictures classification. During co-training, two classifiers are trained separately on two different views, and only the contention points, i.e., the unlabeled instances in which the classifier disagrees the most, were selected for human annotation. Finally, expectation maximization co-training (co-EM) was employed to automatically label instances that showed a low disagreement between the two classifiers. Other studies exploited certainty-based AL with self-training aiming to manual labeling with minimum human cost in spoken language understanding [28], natural language processing [29], sound classification [30], disease classification [31] and cell segmentation [32]. In another study [33], the authors addressed the problem of imbalanced training data in object detection. First, a simple object detection model was trained using a small portion of perfect samples instead of using the entire training dataset, while the imperfect samples were partitioned into several batches. Then, a batch-mode learning of AL and SSL combination was employed by integrating the uncertainty and diversity criteria from the concept of AL and the confidence criterion from that of SSL.

Proposed Method
The proposed method constitutes a combination of AL and SSL approaches, in order to leverage the advantages of both techniques. A mixed self-training method is employed utilizing the entropy of unlabeled instances, with the aim to identify the most confusing instances in the case of active round, while in the semi-supervised round the internal learner's distribution of probabilities for all possible labels per each instance is exploited as a sorting mechanism for the selection of the most confident examples.
Uncertainty-based metrics are widely deployed in the AL field as the literature suggests [34], mainly due to their strong performance in terms of calculation efficiency and effectiveness in the process of selecting the most confusing instances. On the other hand, in the SSL field research works exist [8,35] proving the effectiveness of probabilistic iterative schemes. As the nature of these types of metrics is similar, they can prove to be a robust combination for the construction of schemes such as the proposed. Moreover, it is also known [36] that the SSL self-training technique further helps to overpass the lack of exploration problems that occur during the AL entropy-based training process causing the algorithms to stuck at suboptimal solutions, continuously selecting instances which do not improve the current classifier.
The proposed algorithm can be characterized as a simple yet very effective wrapper algorithm that can utilize a wide range of learners, assuming that they can produce probability distributions for their predictions. A detailed presentation of the algorithm follows in the next paragraphs.
Let D denote the initial training set, consisting of a labeled set of examples L and an unlabeled set of examples U thus defining a labeled ratio R, as in the following equation: where the size(X) function returns the size of a set of instances. Initially, a base learner (CLS) is selected and trained on L. Afterwards, a self-training scheme is employed with the aim to augment the L using the available unlabeled examples of D. The number of unlabeled examples utilized in each iteration is conservatively selected taking in account the size of the initial labeled set, using also a control parameter T, setting the percentage of unlabeled examples related to the size of the initial labeled set. The number of maximum unlabeled instances selected in each iteration is calculated as follows: In each iteration (i), one of the two learning approaches is employed successively. The self-training loop terminates in a maximum number of iterations MaxIter or in the case of exhaustion of the pool of unlabeled examples.
Starting with the SSL round, the CLS is applied on the current unlabeled set U i and a matrix of predictions M pr is constructed along with the prediction probability for each unlabeled instance, resulting in a size(U i ) x (l + 2) dimensions matrix, where l + 2 is the number of features, including the predicted labels and the corresponding prediction probabilities. The SSL round uses machine labeling in order to balance the expensive human effort and examination process required to label the data. The M pr is sorted descendingly utilizing the prediction probabilities while the rest of the maxUnlabPerIter elements are discarded. The maxUnlabPerIter instances along with their predicted labels are stored in M final .
Following the method flow, an AL round is deployed in every other iteration. In this round, the algorithm attempts to construct a matrix containing the entropy estimation of each unlabeled instance EntrU i . The base learner is applied on U i and the distribution of probabilities are exported in matrix DistU i of dimensions size(U i ) x num_classes(D), where the num_classes(X) function returns the number of classes of a dataset. Having produced DistU i , the calculation of entropy estimation matrix is performed using the next formula, to compute each one of its elements (j): where p k denotes the probability of k class for instance j, already contained in DistU i . Subsequently, EntrU i is sorted in descending order, as the most confusing examples, with entropy values near one, should be placed on the top of the matrix. The top maxUnlabPerIter instances are kept in EntrU i with the rest of them being discarded. Human expertise is utilized to label the maxUnlabPerIter instances and a matrix containing the human-labeled instances M final is constructed with the size of maxUnlabPerIter x (l + 1), where l + 1 is the number of features, including the class.
During each iteration, the M final instances are added to the current labeled set L i and removed from the current unlabeled set U i . The CLS is re-trained at the start of each self-training iteration in order to be utilized again. When the termination criteria are met, the algorithm exits the self-training loop having constructed the augmented labeled set L augmented (≡L last iteration ). As a final step, the CLS is trained on the augmented labeled set in order to be applied on the unknown test cases. The exact implementation of the combination scheme is presented in Algorithm 1. Classify(U i ) using CLS and construct matrix M pr containing corresponding prediction probabilities along with the predicted labels 13: SORT M pr descending according to the prediction probabilities 14: STORE the top maxUnlabPerIter instances of M pr in a matrix M final 15: /* now containing the most confident instances along with their predictions */

Experimentation and Results
In order to examine the efficacy of the proposed scheme, an exhaustive experimentation procedure was followed. At first, fifty-five (55) benchmark datasets were extracted from the UCI repository [14], related to a wide range of classification problems. To further enhance the variance and complexity of the classification process, all datasets were partitioned and examined according to the resampling procedure of k-fold cross-validation [37]. Following the method's steps, each subject dataset is shuffled and then divided into k unique data groups. By holding out one of the groups as a test set and utilizing the rest as a train set, k new datasets are generated. The k parameter was set equal to ten, as it is commonly selected by the majority of the literature.
The main aim of the experimentation process was to prove the superiority of the combination scheme against the competing methods of the supervised, semi-supervised and active learning using always the same amounts of labeled and unlabeled data under the same base learner model. In more detail, the supervised method is trained only on the initial labeled set while the semi-supervised rival method utilizes also the initial unlabeled set in the same manner that is also exploited in the proposed combination scheme. Moreover, as baseline AL opponent the random sampling [7] process is implemented in a similar way with the rest of the combination self-training procedure, also utilizing the initial unlabeled set.
For this purpose, all training subsets were further divided into two sets, an initial labeled set and an initial unlabeled set, using four different labeled ratios R. As the initial datasets contained a hundred percent of the instance labels, in order to simulate the human expert labeling process, all the original labels for the constructed unlabeled sets were stored separately in order to be retrieved whenever the algorithm required to query the human expert. Thus, each original dataset was augmented into forty derived datasets. In detail, the R values were set to 10%, 20%, 30%, and 40%. As regards the proposed algorithm's parameters, the control parameter T was set equal to 10%, while the MaxIter parameter was empirically selected equal to 10 in order to impose a maximum of 40%, in relation to the original dataset size, limit (can be calculated using Equation (2) multiplied by the MaxIter parameter of unlabeled instances for selection and augmentation of the initial labeled set in the case of R = 40%.
As a comparison measure, the average classification accuracy over each R was used. In order to draw general conclusions for the efficacy of the combination scheme, a wide range of classification models and meta-techniques were employed, incorporated in each one of the four learning methods. A brief description for each one of the base learners is presented: • BagDT: In this model, the bootstrap aggregating (bagging) [38] meta-algorithm was applied along with the use of the C4.5 decision trees [39] classifier. The bagging technique is often adopted to reduce the variance and overfitting of a base learner and enhance its accuracy stability. The basic idea behind this technique is the generation of multiple training sets by uniformly sampling the original dataset.
• 5NN: The k-nearest neighbors [40] classifier belongs to the family of lazy learning algorithms. By examining the k closest instances in a defined feature space, it classifies a given test instance by plurality voting on the labels of the k instances.
• Logistic: The logistic regression, also commonly referenced as the logit model, is a statistical model that utilizes the logistic function in order to model binary dependent variables, thus fitting very well with categorical targets. In problems where the target variable has more than two values, multinomial logistic regression is applied [41].
• LMT: The logistic model tree [42] classification model combines logistic regression with decision trees. The main idea behind the classifier is the use of linear regression models as leaves of a classification tree.
• LogitBoost: This classifier is a boosting model proposed by Friedman et al. [43]. It is based on the idea that the adaptive boosting [44] method can be thought as a generalized additive model, thus the cost function of logistic regression can be applied.
• RF: One of the most robust ML learners is the random forests [45] model, which is capable of tackling regression and classification problems. Its operation is based on the construction of multiple decision trees using random subsamples of the original feature space. The aggregation of the results is achieved via majority voting. Due to its inner architecture, it is known to efficiently handle the overfitting phenomena.
• RotF: The rotation forest model constitutes an ensemble [46] classifier proposed by Rodriguez and Kuncheva [47]. Following the flow of this algorithm, the initial feature space is divided in random subspaces. The default feature extraction algorithm applied to create the subspaces is the principal component analysis (PCA) [48], aiming to increase the diversity amongst the base learners.
• XGBoost: The extreme gradient boosted trees [49] algorithm, is a powerful implementation of gradient boosted decision trees. Under this boosting [50] scheme, a number of trees are built sequentially with each time the goal to reduce errors produced from the previous tree, thus each tree is fitted on the gradient loss of the previous step. The final decision is produced from the weighted voting of the trees. The XGBoost algorithm is a very scalable algorithm that has shown to perform very well on large datasets or sparse datasets utilizing parallel and distributed execution methods.
• Voting (RF, RotF, XGBoost): As a last effort to further explore the potential of more complex classification models in the combination scheme, the construction of an ensemble classifier by majority voting the results of three of the most robust models: RF, RotF, and XGBoost was put forward. As regards the extraction of probabilities, the average of the exported probabilities for the three classifiers was considered as the best option.
The experimental results in terms of classification accuracy for each base learner are organized in Tables 1-5 and supplementary material Tables S1-S4, categorized according to the four label ratios (10%, 20%, 30%, 40%) for each learning method. The bold values in the tables indicate the highest accuracy for the corresponding dataset and the subject labeled ratio.
The superiority of the proposed combination scheme regarding the classification accuracy is prominent. The following important observations are derived from the accuracy tables: • The proposed combination method outperforms all other four learning methods in all four labeled ratios and for all the nine base learners used as control methods, in terms of average accuracy. This argument is also validated in Figure 3, where the comparisons are visually assembled and a progressive picture of the performance of the two dominant methods is presented as the R increases. The SL method was also included as a baseline performance metric.   Following the accuracy examination, the Friedman aligned ranks test [51] was conducted. In Tables 6-14, the results of the statistical tests for each one of the nine base learners divided into the four labeled ratios used, are presented. These lead to the following assumptions: • The non-parametric tests assess the null hypothesis that the means of the results of two or more of the compared methods are the same by calculating the related p-value. This hypothesis can be rejected for all the nine algorithms and for all labeled ratios as all calculated p-values are significantly lower than the significance level of a = 0.10.

•
Moreover, the Friedman rankings confirm that for all nine base learners and regardless of the labeled ratio, the proposed combination scheme ranks first ahead of all other learning methods in coincidence with the accuracy experimental results.
Since the Friedman test null hypothesis was rejected, the Holm's [52] post-hoc statistical test was also applied with an alpha value of 0.10. The aim of the Holm's test is to detect the specific differences between the combination scheme and the other learning methods, thus the null hypothesis under evaluation is that the mean of the results of the proposed method and against each other group is equal (compared in pairs). The post-hoc results are also presented in the corresponding ranking test tables for each one of the base learners. By observing the adjusted p-values of the Holm's tests, it is concluded that: • The proposed combination method performs significantly better on 105 of the total 108 compared method variations for the nine base learners over the four labeled ratios.

•
The AL methods for the Logistic, the LMT and the LogitBoost classifiers accept the mean significant difference test for one label ratio each, 30%, 20%, 40% accordingly. However, the adjusted p-values show small differences over the alpha of 0.10.
Summarizing the test results, both Friedman Aligned Ranks tests and Holm's one vs all comparison tests verify the superior performance of the proposed method over a wide range of scenarios and algorithm comparisons.
To better observe the individual results regarding the combination schemes and the role of the base learners incorporated, the average accuracies were plotted in Figure 4. The outcome was as expected the following: The ensemble voting (RF, RotF, XGBoost) classifier outperforms the rest models in all labeled ratios. As the first indication of such an outcome, the improved prediction probabilities derived from the averaging of the three classifier probabilities, on which the combination scheme relies, it would be a promising starting point for seeking a robust proof to strictly explain the performance boost. Thus, on the one hand, the most confusing unlabeled instances, through the entropy calculation, and on the other hand, the most confident unlabeled instances, through the distribution of prediction probabilities, are detected using the distribution of prediction probabilities. Such behaviors seem to also emerge in other relevant ensemble wrapper algorithms [53]. • Moreover, the Friedman rankings confirm that for all nine base learners and regardless of the labeled ratio, the proposed combination scheme ranks first ahead of all other learning methods in coincidence with the accuracy experimental results. Since the Friedman test null hypothesis was rejected, the Holm's [52] post-hoc statistical test was also applied with an alpha value of 0.10. The aim of the Holm's test is to detect the specific differences between the combination scheme and the other learning methods, thus the null hypothesis under evaluation is that the mean of the results of the proposed method and against each other group is equal (compared in pairs). The post-hoc results are also presented in the corresponding ranking test tables for each one of the base learners. By observing the adjusted p-values of the Holm's tests, it is concluded that: • The proposed combination method performs significantly better on 105 of the total 108 compared method variations for the nine base learners over the four labeled ratios. • The AL methods for the Logistic, the LMT and the LogitBoost classifiers accept the mean significant difference test for one label ratio each, 30%, 20%, 40% accordingly. However, the adjusted p-values show small differences over the alpha of 0.10. Summarizing the test results, both Friedman Aligned Ranks tests and Holm's one vs all comparison tests verify the superior performance of the proposed method over a wide range of scenarios and algorithm comparisons.
To better observe the individual results regarding the combination schemes and the role of the base learners incorporated, the average accuracies were plotted in Figure 4. The outcome was as expected the following: The ensemble voting (RF, RotF, XGBoost) classifier outperforms the rest models in all labeled ratios. As the first indication of such an outcome, the improved prediction probabilities derived from the averaging of the three classifier probabilities, on which the combination scheme relies, it would be a promising starting point for seeking a robust proof to strictly explain the performance boost. Thus, on the one hand, the most confusing unlabeled instances, through the entropy calculation, and on the other hand, the most confident unlabeled instances, through the distribution of prediction probabilities, are detected using the distribution of prediction probabilities. Such behaviors seem to also emerge in other relevant ensemble wrapper algorithms [53].

Modification
Pointing towards the improvement of the proposed method, it is obvious by the statistical analysis and ranking results that a slight increase in the performance of the SSL part could have a significant impact on the overall efficiency of the combination scheme.
In this direction, careful observation of the execution of the proposed algorithm revealed the weakness of the SSL prediction probabilities, which, in many cases, leads to the selection of the wrong instances to be labeled. In order to augment the probabilistic information available for the proposed method, as regards the SSL part, a lazy classifier (kNN) was integrated into the instance selection process. Such a development, on the one hand, augments the proposed method with a second view of the labels for the unlabeled set, and on the other hand, does not significantly increase the computational overhead as this family of classifiers does not need training. As a second measure to strengthen the SSL instance selection criteria, the empirical approach of setting a lower limit on the minimum accepted probability for an unlabeled instance was adopted using the formula: whereby utilizing the num_classes(X) function the dependence on the dataset characteristics is lifted. A more compact representation of the SSL part modifications is given in Algorithm 2, while the abstract flow chart of the improved combination framework is presented in Figure 5. The improved combination scheme was further tested against the most robust AL frameworks found in the literature. In detail, the query strategies of least confidence (LC), margin sampling (MS) and entropy sampling (ES) were considered to be compared with the modified proposed scheme. The major aspects concerning these strategies [7] follow below.
• LC: The objective of this strategy is to identify the least confident unlabeled instances by examining the probability of the most probable label for each unlabeled instance. The strategy continues by selecting the instances having the lowest probable labels and presents them to the human expert to be labeled in order to augment the initial labeled set.
• MS: As an improvement of the LC strategy, MS attempts to overcome the disadvantageous selection process of only considering the most probable labels by calculating the differences of the most probable and the second most probable label for an unlabeled instance. Afterwards, those calculated differences are sorted and the instances with the lowest differences are selected to be labialized.
• ES: This strategy, part of which is also integrated into the AL counterpart of the proposed scheme, computes the entropy measure (similar to Equation (3)) for each unlabeled instance using the distribution of prediction probabilities. The most entropic instances are then selected to be displayed to the human expert in order to enlarge the original labeled set. In Figure 6, ten experiments display the performance comparison in terms of classification accuracy regarding the three AL methods against the modified combination scheme. The experiments are categorized by the five base learner models that were integrated into the methods. In each experiment, a different benchmark dataset was deployed using four different Rs equal to 10%, 20%, 30%, and 40% accordingly. In Figure 6, ten experiments display the performance comparison in terms of classification accuracy regarding the three AL methods against the modified combination scheme. The experiments are categorized by the five base learner models that were integrated into the methods. In each experiment, a different benchmark dataset was deployed using four different Rs equal to 10%, 20%, 30%, and 40% accordingly. The experimental results confirm the efficiency of the modified combination scheme against the AL methods. It can be extracted from the figure that the proposed technique in all ten cases performs The experimental results confirm the efficiency of the modified combination scheme against the AL methods. It can be extracted from the figure that the proposed technique in all ten cases performs equally or better from its rivals' accuracies. Moreover, the figure suggests that the three AL methods produce closely related accuracy results, as in four of the ten test cases, their performance was almost identical. The previous outcome can be explained by exploring the metrics utilized in these strategies, which are all derived from the prediction probabilities of the base learners.
Closing this section, in the conducted experiments on real-world benchmark datasets, the proposed combination scheme was compared with the SL, the SSL, and the AL methods. The experiments show that the proposed method outperforms the compared methods. Therefore, in the future, it is very important to conduct more insightful theoretical analyses on the effectiveness of the proposed approach and explore other appropriate selection criteria for filtering the informative unlabeled instances, in order to generalize the results with more confidence.

Conclusions
In this research work, a new wrapper algorithm was proposed combining the AL and SSL methods with the aim of efficiently utilizing the available unlabeled data. A plethora of experiments was conducted for evaluating the efficacy of the proposed algorithm in a wide range of benchmark datasets against other learning methods using a variety of classifiers as base models. In addition, four different labeled ratios were investigated. The proposed algorithm prevails over the other learning methods as statistically confirmed by the Friedman aligned ranks non-parametric tests and the Holm's post-hoc tests. To further promote the use of the proposed algorithm, a software package was developed while more details about this package can be obtained from the link found in the Appendix A.
Regarding the performance boost that was experimentally observed while applying the proposed combination scheme on the numerous datasets, there is strong evidence that the vigorous AL method can efficiently improve its performance utilizing SSL schemes such as the self-training technique. Even in cases were the individual SSL method was not performing dexterously; when integrated in the AL and SSL proposed wrapper the performance of the overall scheme was significantly improved compared to the plain AL method. Moreover, in the case that the majority of the instances used in a learning scheme are automatically labeled, the performance may be unsatisfactory, and in some cases, it may even be worse than the SL baseline accuracy. For this reason, a fundamental requirement arises; that of defining a sufficient threshold of human expert intervention on the labeling process to successfully combine AL and SSL methods. Such a fine-tuning process is criticized as highly application-specific and challenging to automate. Furthermore, it can be noticed by the results, that on datasets with very small initial labeled sets, the proposed scheme can be beneficiary as the initially learned decision boundaries of such datasets can be possibly inaccurate, thus unlabeled instances near these boundaries could be falsely classified. This is an implication that the AL part of the proposed scheme could efficiently tackle.
For future work, a number of areas have been identified and are worth exploring as they seem promising in the direction of improving the classification abilities of the proposed algorithm. As a major first research area, that is expected to have a high impact on the combination scheme's performance in terms of accuracy and execution time would be the investigation of different instance selection strategies than those that are currently employed. In the AL part of the proposed algorithm, two common alternatives are the least confidence [54] and the margin sampling [55] algorithms, which utilize the unlabeled data under a different scope. Moreover, more complex query scenarios than the plain pool-based sampling used, like query synthesis [56] could also be beneficial. As regards the semi-supervised part, simple techniques like the integration of weights annotating the instances assessed as informative by the SSL part of the algorithm could further improve the overall accuracy of the combination scheme as suggested in [35,57].
Another interesting research area would be that of the extreme outlier detection algorithms. The incorporation of such algorithms in the proposed algorithm would have an immediate impact on the quality of the selected candidate unlabeled instances that are used to augment the labeled set in each self-training iteration, thus resulting in more robust inner models. A few of the very well-known techniques that could be directly implemented in the combination scheme are the local outlier factor [58] for detecting anomalous values based on neighboring data or the isolation forest [59], which is a tree-based outlier detector.
Other research areas that would bear further improvement to the proposed algorithm include preprocessing algorithms, for instance, PCA for dimensionality reduction and production of more informative features or other feature selection techniques such as univariate feature selection [60]. Speaking of the integrated base learners, the introduction of online learners like the Hoeffding adaptive tree [61] and Pegasos [62] or deep learning architectures based on deep neural networks [63] and deep ensembles [64] could make the proposed algorithm sufficient for tackling streaming and big data problems.
Finally, by combining schemes from the fields of active regression learning [65,66] and semi-supervised regression [53] along with the proposed classification algorithm, a general combination scheme could be put forward that would be able to handle numeric and categorical targets.
Author Contributions: All authors have contributed equally to the final manuscript.
Funding: This research is implemented through the Operational Program Human Resources Development, Education and Lifelong Learning and is co-financed by the European Union (European Social Fund) and Greek national funds.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A
The proposed combination algorithm was implemented as a separate Java package for the WEKA [67] software tool. The decision to develop the combination scheme as a part of the WEKA tool was made since it is one of the most well-known tools used in the machine-learning community, which includes a big number of base learner models. Moreover, it can be easily deployed without requiring programming experience for the end-user. The package can be downloaded using the following link: http://ml.upatras.gr/combine-classification/.