An Auto-Adjustable Semi-Supervised Self-Training Algorithm

Semi-supervised learning algorithms have become a topic of significant research as an alternative to traditional classification methods which exhibit remarkable performance over labeled data but lack the ability to be applied on large amounts of unlabeled data. In this work, we propose a new semi-supervised learning algorithm that dynamically selects the most promising learner for a classification problem from a pool of classifiers based on a self-training philosophy. Our experimental results illustrate that the proposed algorithm outperforms its component semi-supervised learning algorithms in terms of accuracy, leading to more efficient, stable and robust predictive models.


Introduction
In machine learning and data mining, the construction of a classifier can be considered one of the most significant and challenging tasks [1].Traditional classification algorithms belong to the class of supervised algorithms which use only labelled data to train the classifier.However, in many real-world classification problems, labelled instances are often difficult, expensive, or time consuming to obtain, since they require the efforts of empirical research.In contrast unlabeled data are fairly easy to obtain and require less effort of experienced human annotators.
Semi-supervised learning (SSL) algorithms constitute the appropriate and effective machine learning methodology for extracting knowledge from both labeled and unlabeled data so as to build efficient classifiers [2].More analytically, they efficiently combine the explicit classification information of labeled data with the information hidden in the unlabeled data.The general assumption of this class of algorithms is that data points in a high density region are likely to belong to the same class and the decision boundary lies in low density regions [3].Hence, these methods have the advantage of reducing the effort of supervision to a minimum, while still preserving competitive recognition performance.Nowadays, these algorithms have great interest both in theory and in practice and have become a topic of significant research as an alternative to traditional methods of machine learning, since they require less human effort and frequently present higher accuracy [4][5][6][7][8][9][10].The main issue of semi-supervised learning is how to efficiently exploit the hidden information in the unlabeled data.In the literature, several approaches have been proposed with different philosophy related to the link between the distribution of labeled and unlabeled data [2,[11][12][13][14].
Self-training constitutes perhaps the most popular and frequently used SSL algorithm due to its simplicity and classification accuracy [4,5,9].This algorithm wraps around a base learner and uses its own predictions to assign labels to unlabeled data.More specifically, in the self-training process, a classifier is trained with a small number of labeled examples and iteratively enlarges its training set using newly labeled data with its own most confident predictions.However, this methodology can lead to erroneous predictions when noisy examples are classified as the most confident ones and in following incorporated into the labeled training set.Li and Zhou [15] tried to address this difficulty and presented the SETRED method which incorporates data editing in the self-training framework in order to actively learn from the self-labeled examples.Along this line, Tanha et al. [16] studied the classification behaviour of self-training and based on their numerical experiments, stated that the most important aspect of the self-training procedure is to correctly estimate the confidence of the predictions so as to be successful.
Therefore, the success of the self-training algorithm is depended on the newly labeled data [2] but most significantly, on the selection of the base learner.Nevertheless, the selection for base learner is still in progress since the decision of which particular learning algorithm to choose for a specific problem, is still a complicated and challenging problem.Given a pattern recognition problem, the traditional approach is to evaluate a set of different learners against a representative validation set and select the best one.It is generally recognized that the key to pattern recognition problems does not wholly lie in any particular solution since no single model exists for all problems [17].
In this work, we propose a new semi-supervised learning algorithm which is based on a self-training philosophy.The proposed algorithm initially uses several independent base learners and during the training process dynamically selects the most promising base learner relative to a strategy based on the number of the most confident predictions of unlabeled data.Our numerical experiments on several benchmark datasets confirm the efficacy of the proposed methodology.Additionally, we performed several statistical tests in order to illustrate the efficiency of our proposed algorithm.
The remainder of this paper is organized as follows: Section 2 defines the semi-supervised classification problem and the self-training approach.Section 3 presents a detailed description of the proposed algorithm and Section 4 presents the numerical experiments and discusses the obtained results.Finally, Section 5 discusses the conclusions and some further research topics for future work.

A Review of Semi-Supervised Classification Via Self-Labeled Approach
This section provides a definition for the semi-supervised classification problem and a short description of the most popular and frequently used semi-supervised self-labeled algorithms.

Semi-Supervised Classification
In the sequel, we present the definitions and the necessary notations for the semi-supervised classification problem.Let x p = (x p1 , x p2 , . . ., x pD , y) be an example, where x p belongs to a class y and a D-dimensional space in which x pi is the i-th attribute of the p-th sample.Suppose L is a labeled set of N l instances x p with y known and U is an unlabeled set of N u instance x q with y unknown, where N l N u .Notice that the set L ∪ U consists the training set.Moreover, there is a test set T composed of N t unseen instances x t which has not been used in the training stage.The aim of semi-supervised classification is to obtain an accurate and robust learn hypothesis using the training set L ∪ U and in following evaluate its performance using the test set T.
In the literature, a variety of self-labeled methods has been proposed, each following a different methodology on exploiting the information hidden in the unlabeled data.Next, we present a brief description of the most popular and frequently used semi-supervised self-labeled methods.

Semi-Supervised Self-Labeled Methods
Self-training is a wrapper-based semi-supervised approach which constitutes an iterative procedure of self-labeling unlabeled data and is generally considered to be a simple and effective SSL algorithm.According to Ng and Cardie [18] "self-training is a single-view weakly supervised algorithm" which is based on its own predictions on unlabeled data to teach itself.In the self-training framework, an arbitrary classifier is initially trained with a small amount of labeled data which constitutes its training set, aiming to classify unlabeled points.Subsequently, it iteratively enlarges its labeled training set with its own most confident predictions and retrained.More specifically, at each iteration, the classifier's training set is augmented gradually with classified unlabeled instances that have achieved a probability value over a defined threshold c; these instances are considered as sufficiently reliable to be added to the training set.Notice that the way in which the confidence predictions are measured depends on the type of used base learner (see [19]).
Clearly, this model does not make any specific assumptions for the input data, but rather accepts that its own predictions tend to be correct.Therefore, since the success of the self-training algorithm is heavily depended on the newly-labeled data based on its own predictions, its weakness is that erroneous initial predictions will probably lead the classifier to generate incorrectly labeled data [2].
Li and Zhou [15] tried to address this difficulty and as a result, they presented the SETRED method which incorporates data editing in the self-training framework in order to actively learn from the self-labeled examples.Their principal improvement in relation to the classical self-training scheme, is the establishment of a restriction related to the acceptance or the rejection of the unlabeled examples which are evaluated as trustworthy by the algorithm.More analytically, a neighboring graph in D-dimensional feature space is being built and all the candidate unlabeled examples for being appended to the initial training set are being filtered through a hypothesis test.Thus, any examples having successfully passed that test are finally added to the training set before the end of each iteration.
Co-training is a semi-supervised algorithm which can be regarded as a different variant of the self-training technique [12].It is based on the strong assumption that the feature space can be divided into two conditionally independent views, with each view being sufficient to train an efficient classifier.In this framework, two learning algorithms are separately trained for each view using the initial labeled dataset and the most confident predictions of each algorithm on unlabeled data are used to augment the training set of the other through an iterative learning process.Following the same concept, Nigam and Ghani [14] performed an experimental analysis where they concluded that the Co-training outperforms other SSL algorithms when there is a natural existence of two distinct and independent views.Nevertheless, the assumption about the existence of sufficient and redundant views is a luxury hardly met in most real-case scenarios.
Zhou and Goldman [20] have also adopted the idea of ensemble learning and majority voting in the semi-supervised framework.Along this line, Li and Zhou [21] proposed another algorithm, in which several Random Trees are trained on bootstrap data from the dataset, named Co-Forest.The main idea of this algorithm is the assignment of a few unlabeled examples to each Random Tree during the training process.Eventually, the final decision is composed by a simple majority voting.Notice that the use of Random Tree classifier for random samples of the collected labeled data is the main reason why the behavior of Co-Forest is efficient and robust although the number of the available labeled examples is reduced.
A rather representative approach which is based on the ensemble philosophy is the Tri-training algorithm.This algorithm constitutes an improved single-view extension of the Co-training algorithm exploiting unlabeled data without relying on the existence of two views of instances [22].Tri-training algorithm can be considered as a bagging ensemble of three classifiers which are trained on data subsets generated through bootstrap sampling from the original labeled training set [23].Subsequently, in each Tri-training round, if two classifiers agree on the labeling of an unlabeled instance while the third one disagrees, then these two classifiers will label this instance for the third classifier.It is worth noticing that the "majority teach minority strategy" serves as an implicit confidence measurement which avoids the use of complicated time-consuming approaches for explicitly measuring the predictive confidence, and hence the training process is efficient [4].
Kostopoulos et al. [24] and Livieris et al. [25,26], motivated by the previous works, studied the fusion of ensemble as well as semi-supervised learning.More specifically, they presented self-labeled methods by adopting majority voting in the semi-supervised framework.

Auto-Adjustable Self-Training Semi-Supervised Algorithm
In this section, we present the proposed SSL algorithm which is based on the self-training framework.We recall that two main difficulties in self-training is the decision of which base learner to choose for a specific problem and how to find a set of high confidence predictions of unlabeled data.Therefore, in order to address these difficulties, we consider starting with an initial pool of classifiers and during the training process, to dynamically select the most promising classifier, relative to the most confident predictions.A high-level description of the proposed semi-supervised algorithm, entitled Auto-Adjustable Self-Training (AAST), is presented in Algorithm 1 which consists of two phases: in the 1st phase, the most promising classifier is selected from a pool of classifiers based on the number of confident predictions of unlabeled data, whereas in the 2nd phase, the most promising classifier is trained within the self-training framework.
Suppose that C = (C 1 , C 2 , . . ., C N ) constitutes a set of N classifiers which can be used as base learners in the self-training framework.Initially, all base learners C i ∈ C are trained using the same small amount of labeled data L and then applied on the same unlabeled data U. Subsequently, the labeled set L i of each classifier C i is iteratively augmented gradually using its own most confident predictions.More specifically, each classified unlabeled instance that has achieved a probability value over a defined threshold c, is considered sufficiently reliable in order to be added to the classifier's labeled set L i for the following training phases.It is worth mentioning that the way the confidence predictions are measured, depends on the type of the used base learner (see [19,27,28] and the references there in).Finally, each classifier is re-trained using its own new enlarged training set.MCP from U and add to L. 23 : until some stopping criterion is met or U is empty.
The proposed algorithm in order to select a base learner from set C is grounded on the following simple idea: the most promising base learner is probably the base learner with the most confident predictions.In other words, the base learner that is able to confidently label as many unlabeled instances as possible in order to explore them is the most promising classifier.
Every k iterations (which we call a cycle), AAST evaluates the base learner in set C and selects the classifier C m with the minimum number of most confident predictions as well as the classifier C M with the maximum number of most confident predictions.Subsequently, the classifier C m is removed from the set C and the classifier C M will provide its labeled set L M and its unlabeled set U M for all the rest classifiers for the next cycle.More to the point, in every cycle (i.e., every k iterations) the algorithm removes the least promising classifier from the set C, in order to reduce the computational cost and restarts the self-training process using the labeled and unlabeled sets of the most promising classifier C M , relative to the number of most confident predictions of each classifier.
Notice that, it is immediately implied from the above discussion that after N C − 1 cycles (i.e., k • (N C − 1) iterations), where N C is the initial number of used base learners, only one classifier, denoted as C P , remains in set C. This classifier constitutes the most promising classifier, relative to the proposed selection strategy.Subsequently, the only remaining classifier C P continues its training within the semi-supervised framework.
An obvious advantage of the proposed technique is that it exploits the diversity of the errors of the learned models by using different learning algorithms and the classifier with the most confident predictions is dynamically selected as the most promising one.Nevertheless, the efficacy and computational cost of the proposed algorithm depends on the value of parameter k.As the value of parameter k increases, the base learners exploit the hidden information in the unlabeled data for more iterations before being evaluated; however, the computational cost and time significantly increases.

Experimental Results
The experiments were based on 40 datasets from UCI Machine Learning Repository [29] and KEEL repository [30].Table 1 presents a brief description of the datasets' structure i.e., the number of instances (#Instances), number of attributes (#Features) and number of output classes (#Classes).The considered datasets contain between 101 and 19, 020 instances, while the number of attributes ranges from 2 to 60 and the number of classes varies between 2 and 11.
Our experimental results were obtained by conducting a three phase procedure: In the first phase, the performance of the proposed algorithm AAST using various values of parameter k in order to study its sensitivity is evaluated; in the second phase, the performance of AAST with that of the most popular and commonly used self-labeled algorithms is compared, while in the third stage, a statistical comparison between all compared semi-supervised self-labeled algorithms is performed.The detailed numerical results can be found in the web site: www.math.upatras.gr/~livieris/Results/AAST.zip.
The implementation code was written in Java, using the WEKA Machine Learning Toolkit [28] and the classification accuracy was evaluated using the stratified 10-fold cross-validation i.e., the data was separated into folds so that each fold had the same distribution of classes as the entire dataset.For each generated fold, a given algorithm is trained with the examples contained in the rest of the other folds (training partition) and then tested with the current fold.Moreover, the training partition was divided into labeled and unlabeled subsets.Similar to [13,31] in the division process, we do not maintain the class proportion in the labeled and unlabeled sets since the main aim of semi-supervised classification is to exploit unlabeled data for better classification results.Hence, we use a random selection of examples that will be marked as labeled instances and the class label of the remaining instances will be removed.Furthermore, we ensure that every class has at least one representative instance.To study the influence of the amount of labeled data, three different ratios R were used: 10%, 20% and 30%.In summary, this experimental study involves a total of 120 datasets (40 datasets × 3 labeled ratios).
Furthermore, the proposed algorithm uses three well-known supervised classifiers as base learners namely C4.5, JRip and kNN.These base learners constitute some of the most effective and widely used data mining algorithms for classification [24,32].A brief description of these classifiers is given below:

•
C4.5 [33] constitutes one of the most effective and efficient classification algorithms for building decision trees.This algorithm induces classification rules in the form of decision trees for a given training set.More analytically, it categorizes instances to a predefined set of classes according to their attribute values from the root of a tree down to a leaf.The accuracy of a leaf corresponds to the percentage of correctly classified instances of the training set.

•
JRip [34] is generally considered to be a very effective and fast rule-based algorithm, especially on large samples with noisy data.The algorithm examines each class in increasing size and an initial set of rules for a class is generated using incremental reduced errors.Then, it proceeds by treating all the examples of a particular judgement in the training data as a class and determines a set of rules that covers all the members of that class.Subsequently, it proceeds to the next class and iteratively applies the same procedure until all classes have been covered.What is more, JRip produces error rates competitive with C4.5 with less computational effort.

•
kNN [35] constitutes a representative instance-structured learning algorithm based on dissimilarities among a set of instances.It belongs to the lazy learning family of methods [35] which do not build a model during the learning process.According to kNN algorithm, characteristics extracted from classification process by viewing the entire distance among new individuals, should be classified and then the nearest k category is used.As a result of this process, test data belongs to the nearest k neighbor category which has more members in certain class.The main advantages of the kNN classification algorithm is its easiness and simplicity of implementation and the fact that it provides good generalization results during classification assigned to multiple categories.
The configuration parameters of the proposed algorithm AAST and base learners used in the experiments are presented in Table 2.Moreover, similar to Blum and Mitchell [12], we established a limit to the number of iterations (MaxIter = 40), in algorithm AAST.All classification algorithms were evaluated using the performance profiles based on accuracy proposed by Dolan and Morè [36].This metric provides a wealth of information such as solver efficiency, robustness and probability of success in compact form.More specifically, authors presented a new tool for analyzing the efficiency of algorithms by introducing the notion of a performance profile as a means to evaluate and compare the performance of the set of solvers S on a test set P.
Assuming that there exist n s solvers and n p problems for each solver s and problem p, they defined α p,s as the percentage of misclassified instances by solver s for problem p. Requiring a baseline for comparisons, they compared the performance on problem p by solver s with the best performance by any solver on this problem; that is, using the performance ratio.
The performance of solver s on any given problem might be of interest, but we would like to obtain an overall assessment of the performance of the solver.Next they defined.
Function ρ s was the (cumulative) distribution function for the performance ratio.The performance profile ρ s : R → [0, 1] for a solver was a non-decreasing, piecewise constant function, continuous from the right at each breakpoint [36].In other words, the performance profile plots the fraction P of problems for which any given method is within a factor α of the best solver.According to the above rules and discussion, we conclude that one solver whose performance profile plot is on top right will win over the rest of the solvers.
Ultimately, the use of performance profiles eliminates the influence of a small number of problems on the benchmarking process and the sensitivity of results associated with the ranking of solvers [36][37][38].It is worth mentioning that the vertical side of a performance profile gives the percentage of the problems that were successfully solved by each method (robustness).

Sensitivity of AAST to the Value of Parameter k
In the sequel, we focus our interest on the experimental analysis for the best value of parameter k; hence, we have tested values of k ranging from 3 to 8 in steps of 1. Figure 1 presents the performance profiles for various values of parameter k, relative to the used ratio of labeled data.Clearly, AAST exhibits better classification performance as the value of parameter k increases, revealing its sensitivity.More specifically, using 10% as labeled ratio, AAST with k = 3, 4, 5, 6, 7 and 8 classifies 22.5%, 25%, 25%, 35%, 47.5% and 80% of the test problems with the highest accuracy, respectively.Furthermore, AAST with k = 3, 4, 5, 6, 7 and 8 classifies 20%, 27.5%, 30%, 40%, 52.5% and 80% of the test problems with the highest accuracy, respectively for 20% labeled ratio as well as 17.5%, 17.5%, 22.5%, 35%, 57.5% and 80%, respectively for 30% labeled ratio.

Performance Evaluation of AAST
Subsequently, we evaluate the performance of the proposed algorithm AAST against Self-training using C4.5, JRip and kNN as base learners.In the rest of this section, the value of parameter k in Algorithm AAST is set to 8 which exhibited the highest classification accuracy.
Figure 2 presents the performance profiles for Self-training and AAST.Obviously, AAST illustrates the highest probability of being the optimal classifier since it corresponds to the top curve, regarding all used labeled ratio.More analytically, AAST reports the best performance, classifying 72.5%, 87.5% and 60% of the test problems with the highest accuracy using 10%, 20% and 30% as labeled ratio, respectively, followed by Self-training (kNN) reporting 22.5%, 10% and 25%, in the same situations.Finally, in order to demonstrate the classification performance of the proposed algorithm, we compare it with other state-of-the-art self-labeled algorithms such as Co-training [12] and Tri-training [22] using C4.5, JRip and kNN as base learners, Co-Forest [21] and SETRED [15].Notice that all algorithms were used with the parameters presented in [30].
Figure 3 presents the performance profiles of some state-of-the-art self-labeled algorithms and AAST, regarding the used labeled ratio.Despite the ratio of instances, AAST algorithm managed to achieve the best overall performance, outperforming all self-labeled algorithms.More specifically, AAST classifies 45%, 52.5% and 35% of the test problems with the highest accuracy, using 10%, 20% and 30% as labeled ratio, respectively.Conclusively, it is worth mentioning that the reported performance profiles illustrate that AAST exhibits better performance on average, outperforming classical SSL methods, but this is not in general the case for a single dataset.

Statistical and Post-Hoc Analysis
The statistical comparison of multiple algorithms over multiple datasets is fundamental in machine learning and usually it is carried out by means of a non-parametric statistical test.Therefore, we use Friedman Aligned-Ranks (FAR) test [39] in order to conduct a complete performance comparison between all algorithms for all the different labeled ratios.Its application will allow us to highlight the existence of significant differences between our proposed algorithm and the classical SSL algorithms and in following to evaluate the rejection of the hypothesis that all the classifiers perform equally well for a given level [25,40].
Let r j i be the rank of the j-th of k learning algorithms on the i-th of M problems.Under the null-hypothesis H 0 which states that all the algorithms are equivalent, the Friedman aligned ranks test statistic is defined by:

Conclusions and Future Research
In this work, we presented a new SSL algorithm which is based on a self-training philosophy.More specifically, our proposed algorithm automatically selects the best base learner, relative to the number of the most confident predictions of unlabeled data.
The efficiency of the proposed semi-supervised algorithm was evaluated on several benchmark datasets in terms of classification accuracy utilizing the most frequently used base learners: C4.5, kNN and JRip and different ratios of labeled data.Our numerical results as well as the presented statistical analysis demonstrate that the AAST algorithm outperforms its component SSL algorithms, confirming the effectiveness and robustness of the proposed method.Therefore, the presented methodology seems to lead to more efficient, stable and robust predictive models.
In our future work, we intend to pursue extensive empirical experiments in order to compare the proposed self-labeled method AAST with various methods, belonging to other SSL classes such as generative mixture models [14,41], transductive SVMs [42][43][44], graph-based methods [45][46][47][48][49], extreme learning methods [50][51][52], expectation maximization with generative mixture models [14,53].Furthermore, since our experimental results are quite encouraging, our next step is the use of other supervised classifiers as base learners, such as neural networks [54] and support vector machines [55] or ensemble-based learners [26] aiming to enhance our proposed framework with more sophisticated and theoretically motivated selection criteria for the most promising classifier in order to study the behavior of AAST at each cycle.Finally, an interesting aspect is the evaluation of the proposed algorithm in specific scientific fields applying real world datasets, such as educational, health care, etc. and explore its performance on imbalanced datasets [56,57] using more sophisticated performance metrics such as Sensitivity, Specificity, F-measure, AUC, ROC curve [58,59].
Set of labeled training instances.U − Set of unlabeled training instances.c − Confidence level.k − Iterations per cycle's.C = (C 1 , C 2 , . . ., C N ) − Set of N base learners.Output: C P − Trained classifier.

Figure 1 .
Figure 1.Log 10 scaled performance profiles for AAST using various values of parameter k.

Figure 2 .
Figure 2. Log 10 scaled performance profiles for Self-training and AAST.

Figure 3 .
Figure 3. Log 10 scaled performance profiles for some state-of-the-art self-labeled algorithms and AAST.

Table 1 .
Brief description of datasets.

Table 2 .
Parameter specification for all the SSL methods employed in the experimentation.
JRip Number of optimization runs = 2. Number of folds used for reduced-error pruning = 3. Minimum total weight of the instances in a rule = 2.0.Pruning is performed after tree building.kNN Number of neighbors = 3. Euclidean distance.