1. Introduction
In machine learning and data mining, the construction of a classifier can be considered one of the most significant and challenging tasks [
1]. Traditional classification algorithms belong to the class of supervised algorithms which use only labelled data to train the classifier. However, in many real-world classification problems, labelled instances are often difficult, expensive, or time consuming to obtain, since they require the efforts of empirical research. In contrast unlabeled data are fairly easy to obtain and require less effort of experienced human annotators.
Semi-supervised learning (SSL) algorithms constitute the appropriate and effective machine learning methodology for extracting knowledge from both labeled and unlabeled data so as to build efficient classifiers [
2]. More analytically, they efficiently combine the explicit classification information of labeled data with the information hidden in the unlabeled data. The general assumption of this class of algorithms is that data points in a high density region are likely to belong to the same class and the decision boundary lies in low density regions [
3]. Hence, these methods have the advantage of reducing the effort of supervision to a minimum, while still preserving competitive recognition performance. Nowadays, these algorithms have great interest both in theory and in practice and have become a topic of significant research as an alternative to traditional methods of machine learning, since they require less human effort and frequently present higher accuracy [
4,
5,
6,
7,
8,
9,
10]. The main issue of semi-supervised learning is how to efficiently exploit the hidden information in the unlabeled data. In the literature, several approaches have been proposed with different philosophy related to the link between the distribution of labeled and unlabeled data [
2,
11,
12,
13,
14].
Self-training constitutes perhaps the most popular and frequently used SSL algorithm due to its simplicity and classification accuracy [
4,
5,
9]. This algorithm wraps around a base learner and uses its own predictions to assign labels to unlabeled data. More specifically, in the self-training process, a classifier is trained with a small number of labeled examples and iteratively enlarges its training set using newly labeled data with its own most confident predictions. However, this methodology can lead to erroneous predictions when noisy examples are classified as the most confident ones and in following incorporated into the labeled training set. Li and Zhou [
15] tried to address this difficulty and presented the SETRED method which incorporates data editing in the self-training framework in order to actively learn from the self-labeled examples. Along this line, Tanha et al. [
16] studied the classification behaviour of self-training and based on their numerical experiments, stated that the most important aspect of the self-training procedure is to correctly estimate the confidence of the predictions so as to be successful.
Therefore, the success of the self-training algorithm is depended on the newly labeled data [
2] but most significantly, on the selection of the base learner. Nevertheless, the selection for base learner is still in progress since the decision of which particular learning algorithm to choose for a specific problem, is still a complicated and challenging problem. Given a pattern recognition problem, the traditional approach is to evaluate a set of different learners against a representative validation set and select the best one. It is generally recognized that the key to pattern recognition problems does not wholly lie in any particular solution since no single model exists for all problems [
17].
In this work, we propose a new semi-supervised learning algorithm which is based on a self-training philosophy. The proposed algorithm initially uses several independent base learners and during the training process dynamically selects the most promising base learner relative to a strategy based on the number of the most confident predictions of unlabeled data. Our numerical experiments on several benchmark datasets confirm the efficacy of the proposed methodology. Additionally, we performed several statistical tests in order to illustrate the efficiency of our proposed algorithm.
The remainder of this paper is organized as follows:
Section 2 defines the semi-supervised classification problem and the self-training approach.
Section 3 presents a detailed description of the proposed algorithm and
Section 4 presents the numerical experiments and discusses the obtained results. Finally,
Section 5 discusses the conclusions and some further research topics for future work.
2. A Review of Semi-Supervised Classification Via Self-Labeled Approach
This section provides a definition for the semi-supervised classification problem and a short description of the most popular and frequently used semi-supervised self-labeled algorithms.
2.1. Semi-Supervised Classification
In the sequel, we present the definitions and the necessary notations for the semi-supervised classification problem. Let be an example, where belongs to a class y and a D-dimensional space in which is the i-th attribute of the p-th sample. Suppose L is a labeled set of instances with y known and U is an unlabeled set of instance with y unknown, where . Notice that the set consists the training set. Moreover, there is a test set T composed of unseen instances which has not been used in the training stage. The aim of semi-supervised classification is to obtain an accurate and robust learn hypothesis using the training set and in following evaluate its performance using the test set T.
In the literature, a variety of self-labeled methods has been proposed, each following a different methodology on exploiting the information hidden in the unlabeled data. Next, we present a brief description of the most popular and frequently used semi-supervised self-labeled methods.
2.2. Semi-Supervised Self-Labeled Methods
Self-training is a wrapper-based semi-supervised approach which constitutes an iterative procedure of self-labeling unlabeled data and is generally considered to be a simple and effective SSL algorithm. According to Ng and Cardie [
18] “
self-training is a single-view weakly supervised algorithm” which is based on its own predictions on unlabeled data to teach itself. In the self-training framework, an arbitrary classifier is initially trained with a small amount of labeled data which constitutes its training set, aiming to classify unlabeled points. Subsequently, it iteratively enlarges its labeled training set with its own most confident predictions and retrained. More specifically, at each iteration, the classifier’s training set is augmented gradually with classified unlabeled instances that have achieved a probability value over a defined threshold
c; these instances are considered as sufficiently reliable to be added to the training set. Notice that the way in which the confidence predictions are measured depends on the type of used base learner (see [
19]).
Clearly, this model does not make any specific assumptions for the input data, but rather accepts that its own predictions tend to be correct. Therefore, since the success of the self-training algorithm is heavily depended on the newly-labeled data based on its own predictions, its weakness is that erroneous initial predictions will probably lead the classifier to generate incorrectly labeled data [
2].
Li and Zhou [
15] tried to address this difficulty and as a result, they presented the
SETRED method which incorporates data editing in the self-training framework in order to actively learn from the self-labeled examples. Their principal improvement in relation to the classical self-training scheme, is the establishment of a restriction related to the acceptance or the rejection of the unlabeled examples which are evaluated as trustworthy by the algorithm. More analytically, a neighboring graph in
D-dimensional feature space is being built and all the candidate unlabeled examples for being appended to the initial training set are being filtered through a hypothesis test. Thus, any examples having successfully passed that test are finally added to the training set before the end of each iteration.
Co-training is a semi-supervised algorithm which can be regarded as a different variant of the self-training technique [
12]. It is based on the strong assumption that the feature space can be divided into two conditionally independent views, with each view being sufficient to train an efficient classifier. In this framework, two learning algorithms are separately trained for each view using the initial labeled dataset and the most confident predictions of each algorithm on unlabeled data are used to augment the training set of the other through an iterative learning process. Following the same concept, Nigam and Ghani [
14] performed an experimental analysis where they concluded that the Co-training outperforms other SSL algorithms when there is a natural existence of two distinct and independent views. Nevertheless, the assumption about the existence of sufficient and redundant views is a luxury hardly met in most real-case scenarios.
Zhou and Goldman [
20] have also adopted the idea of ensemble learning and majority voting in the semi-supervised framework. Along this line, Li and Zhou [
21] proposed another algorithm, in which several Random Trees are trained on bootstrap data from the dataset, named
Co-Forest. The main idea of this algorithm is the assignment of a few unlabeled examples to each Random Tree during the training process. Eventually, the final decision is composed by a simple majority voting. Notice that the use of Random Tree classifier for random samples of the collected labeled data is the main reason why the behavior of Co-Forest is efficient and robust although the number of the available labeled examples is reduced.
A rather representative approach which is based on the ensemble philosophy is the
Tri-training algorithm. This algorithm constitutes an improved single-view extension of the Co-training algorithm exploiting unlabeled data without relying on the existence of two views of instances [
22]. Tri-training algorithm can be considered as a bagging ensemble of three classifiers which are trained on data subsets generated through bootstrap sampling from the original labeled training set [
23]. Subsequently, in each Tri-training round, if two classifiers agree on the labeling of an unlabeled instance while the third one disagrees, then these two classifiers will label this instance for the third classifier. It is worth noticing that the “
majority teach minority strategy” serves as an implicit confidence measurement which avoids the use of complicated time-consuming approaches for explicitly measuring the predictive confidence, and hence the training process is efficient [
4].
Kostopoulos et al. [
24] and Livieris et al. [
25,
26], motivated by the previous works, studied the fusion of ensemble as well as semi-supervised learning. More specifically, they presented self-labeled methods by adopting majority voting in the semi-supervised framework.
3. Auto-Adjustable Self-Training Semi-Supervised Algorithm
In this section, we present the proposed SSL algorithm which is based on the self-training framework. We recall that two main difficulties in self-training is the decision of which base learner to choose for a specific problem and how to find a set of high confidence predictions of unlabeled data. Therefore, in order to address these difficulties, we consider starting with an initial pool of classifiers and during the training process, to dynamically select the most promising classifier, relative to the most confident predictions. A high-level description of the proposed semi-supervised algorithm, entitled Auto-Adjustable Self-Training (AAST), is presented in Algorithm 1 which consists of two phases: in the 1st phase, the most promising classifier is selected from a pool of classifiers based on the number of confident predictions of unlabeled data, whereas in the 2nd phase, the most promising classifier is trained within the self-training framework.
Suppose that
constitutes a set of
N classifiers which can be used as base learners in the self-training framework. Initially, all base learners
are trained using the same small amount of labeled data
L and then applied on the same unlabeled data
U. Subsequently, the labeled set
of each classifier
is iteratively augmented gradually using its own most confident predictions. More specifically, each classified unlabeled instance that has achieved a probability value over a defined threshold
c, is considered sufficiently reliable in order to be added to the classifier’s labeled set
for the following training phases. It is worth mentioning that the way the confidence predictions are measured, depends on the type of the used base learner (see [
19,
27,
28] and the references there in). Finally, each classifier is re-trained using its own new enlarged training set.
Algorithm 1: Auto-Adjustable Self-Training (AAST). |
Input: L— Set of labeled training instances. U— Set of unlabeled training instances. c— Confidence level. k— Iterations per cycle’s. — Set of N base learners. Output: — Trained classifier. /* Phase I: Classifier Selection */- 1:
repeat - 2:
for to N do - 3:
Set and . - 4:
end for - 5:
for to k do - 6:
for each (classifier ) do - 7:
Apply on . - 8:
Select instances with a predicted probability more than threshold c per iteration (). - 9:
Remove from and add to . - 10:
end for - 11:
end for - 12:
Select classifier with the fewest labeled instances. - 13:
Remove the classifier from the set C. - 14:
Select classifier with the most labeled instances. - 15:
Set and . - 16:
Set . - 17:
until one classifier remains in set C. - 18:
Set the only classifier in set C. /* Phase II: Training of classifier */- 19:
repeat - 20:
Apply on L. - 21:
Select instances with a predicted probability more than threshold c per iteration (). - 22:
Remove from U and add to L. - 23:
until some stopping criterion is met or U is empty.
|
The proposed algorithm in order to select a base learner from set C is grounded on the following simple idea: the most promising base learner is probably the base learner with the most confident predictions. In other words, the base learner that is able to confidently label as many unlabeled instances as possible in order to explore them is the most promising classifier.
Every k iterations (which we call a cycle), AAST evaluates the base learner in set C and selects the classifier with the minimum number of most confident predictions as well as the classifier with the maximum number of most confident predictions. Subsequently, the classifier is removed from the set C and the classifier will provide its labeled set and its unlabeled set for all the rest classifiers for the next cycle. More to the point, in every cycle (i.e., every k iterations) the algorithm removes the least promising classifier from the set C, in order to reduce the computational cost and restarts the self-training process using the labeled and unlabeled sets of the most promising classifier , relative to the number of most confident predictions of each classifier.
Notice that, it is immediately implied from the above discussion that after cycles (i.e., iterations), where is the initial number of used base learners, only one classifier, denoted as , remains in set C. This classifier constitutes the most promising classifier, relative to the proposed selection strategy. Subsequently, the only remaining classifier continues its training within the semi-supervised framework.
An obvious advantage of the proposed technique is that it exploits the diversity of the errors of the learned models by using different learning algorithms and the classifier with the most confident predictions is dynamically selected as the most promising one. Nevertheless, the efficacy and computational cost of the proposed algorithm depends on the value of parameter k. As the value of parameter k increases, the base learners exploit the hidden information in the unlabeled data for more iterations before being evaluated; however, the computational cost and time significantly increases.
4. Experimental Results
The experiments were based on 40 datasets from UCI Machine Learning Repository [
29] and KEEL repository [
30].
Table 1 presents a brief description of the datasets’ structure i.e., the number of instances (#Instances), number of attributes (#Features) and number of output classes (#Classes). The considered datasets contain between 101 and
instances, while the number of attributes ranges from 2 to 60 and the number of classes varies between 2 and 11.
Our experimental results were obtained by conducting a three phase procedure: In the first phase, the performance of the proposed algorithm AAST using various values of parameter
k in order to study its sensitivity is evaluated; in the second phase, the performance of AAST with that of the most popular and commonly used self-labeled algorithms is compared, while in the third stage, a statistical comparison between all compared semi-supervised self-labeled algorithms is performed. The detailed numerical results can be found in the web site:
www.math.upatras.gr/~livieris/Results/AAST.zip.
The implementation code was written in Java, using the WEKA Machine Learning Toolkit [
28] and the classification accuracy was evaluated using the stratified 10-fold cross-validation i.e., the data was separated into folds so that each fold had the same distribution of classes as the entire dataset. For each generated fold, a given algorithm is trained with the examples contained in the rest of the other folds (training partition) and then tested with the current fold. Moreover, the training partition was divided into labeled and unlabeled subsets.
Similar to [
13,
31] in the division process, we do not maintain the class proportion in the labeled and unlabeled sets since the main aim of semi-supervised classification is to exploit unlabeled data for better classification results. Hence, we use a random selection of examples that will be marked as labeled instances and the class label of the remaining instances will be removed. Furthermore, we ensure that every class has at least one representative instance. To study the influence of the amount of labeled data, three different ratios
R were used:
,
and
. In summary, this experimental study involves a total of 120 datasets (40 datasets × 3 labeled ratios).
Furthermore, the proposed algorithm uses three well-known supervised classifiers as base learners namely C4.5, JRip and
kNN. These base learners constitute some of the most effective and widely used data mining algorithms for classification [
24,
32]. A brief description of these classifiers is given below:
C4.5 [
33] constitutes one of the most effective and efficient classification algorithms for building decision trees. This algorithm induces classification rules in the form of decision trees for a given training set. More analytically, it categorizes instances to a predefined set of classes according to their attribute values from the root of a tree down to a leaf. The accuracy of a leaf corresponds to the percentage of correctly classified instances of the training set.
JRip [
34] is generally considered to be a very effective and fast rule-based algorithm, especially on large samples with noisy data. The algorithm examines each class in increasing size and an initial set of rules for a class is generated using incremental reduced errors. Then, it proceeds by treating all the examples of a particular judgement in the training data as a class and determines a set of rules that covers all the members of that class. Subsequently, it proceeds to the next class and iteratively applies the same procedure until all classes have been covered. What is more, JRip produces error rates competitive with C4.5 with less computational effort.
kNN [
35] constitutes a representative instance-structured learning algorithm based on dissimilarities among a set of instances. It belongs to the lazy learning family of methods [
35] which do not build a model during the learning process. According to
kNN algorithm, characteristics extracted from classification process by viewing the entire distance among new individuals, should be classified and then the nearest
k category is used. As a result of this process, test data belongs to the nearest
k neighbor category which has more members in certain class. The main advantages of the
kNN classification algorithm is its easiness and simplicity of implementation and the fact that it provides good generalization results during classification assigned to multiple categories.
The configuration parameters of the proposed algorithm AAST and base learners used in the experiments are presented in
Table 2. Moreover, similar to Blum and Mitchell [
12], we established a limit to the number of iterations (MaxIter = 40), in algorithm AAST.
All classification algorithms were evaluated using the performance profiles based on accuracy proposed by Dolan and Morè [
36]. This metric provides a wealth of information such as solver efficiency, robustness and probability of success in compact form. More specifically, authors presented a new tool for analyzing the efficiency of algorithms by introducing the notion of a performance profile as a means to evaluate and compare the performance of the set of solvers
S on a test set
P.
Assuming that there exist
solvers and
problems for each solver
s and problem
p, they defined
as the percentage of misclassified instances by solver
s for problem
p. Requiring a baseline for comparisons, they compared the performance on problem
p by solver
s with the best performance by any solver on this problem; that is, using the performance ratio.
The performance of solver
s on any given problem might be of interest, but we would like to obtain an overall assessment of the performance of the solver. Next they defined.
Function
was the (cumulative) distribution function for the performance ratio. The performance profile
for a solver was a non-decreasing, piecewise constant function, continuous from the right at each breakpoint [
36]. In other words, the performance profile plots the fraction
P of problems for which any given method is within a factor
of the best solver. According to the above rules and discussion, we conclude that one solver whose performance profile plot is on top right will win over the rest of the solvers.
Ultimately, the use of performance profiles eliminates the influence of a small number of problems on the benchmarking process and the sensitivity of results associated with the ranking of solvers [
36,
37,
38]. It is worth mentioning that the vertical side of a performance profile gives the percentage of the problems that were successfully solved by each method (robustness).
4.1. Sensitivity of AAST to the Value of Parameter k
In the sequel, we focus our interest on the experimental analysis for the best value of parameter
k; hence, we have tested values of
k ranging from 3 to 8 in steps of 1.
Figure 1 presents the performance profiles for various values of parameter
k, relative to the used ratio of labeled data. Clearly, AAST exhibits better classification performance as the value of parameter
k increases, revealing its sensitivity. More specifically, using
as labeled ratio, AAST with
and 8 classifies
,
,
,
,
and
of the test problems with the highest accuracy, respectively. Furthermore, AAST with
and 8 classifies
,
,
,
,
and
of the test problems with the highest accuracy, respectively for
labeled ratio as well as
,
,
,
,
and
, respectively for
labeled ratio.
4.2. Performance Evaluation of AAST
Subsequently, we evaluate the performance of the proposed algorithm AAST against Self-training using C4.5, JRip and kNN as base learners. In the rest of this section, the value of parameter k in Algorithm AAST is set to 8 which exhibited the highest classification accuracy.
Figure 2 presents the performance profiles for Self-training and AAST. Obviously, AAST illustrates the highest probability of being the optimal classifier since it corresponds to the top curve, regarding all used labeled ratio. More analytically, AAST reports the best performance, classifying
,
and
of the test problems with the highest accuracy using
,
and
as labeled ratio, respectively, followed by Self-training (
kNN) reporting
,
and
, in the same situations.
Finally, in order to demonstrate the classification performance of the proposed algorithm, we compare it with other state-of-the-art self-labeled algorithms such as Co-training [
12] and Tri-training [
22] using C4.5, JRip and
kNN as base learners, Co-Forest [
21] and SETRED [
15]. Notice that all algorithms were used with the parameters presented in [
30].
Figure 3 presents the performance profiles of some state-of-the-art self-labeled algorithms and AAST, regarding the used labeled ratio. Despite the ratio of instances, AAST algorithm managed to achieve the best overall performance, outperforming all self-labeled algorithms. More specifically, AAST classifies
,
and
of the test problems with the highest accuracy, using
,
and
as labeled ratio, respectively. Conclusively, it is worth mentioning that the reported performance profiles illustrate that AAST exhibits better performance on average, outperforming classical SSL methods, but this is not in general the case for a single dataset.
4.3. Statistical and Post-Hoc Analysis
The statistical comparison of multiple algorithms over multiple datasets is fundamental in machine learning and usually it is carried out by means of a non-parametric statistical test. Therefore, we use Friedman Aligned-Ranks (FAR) test [
39] in order to conduct a complete performance comparison between all algorithms for all the different labeled ratios. Its application will allow us to highlight the existence of significant differences between our proposed algorithm and the classical SSL algorithms and in following to evaluate the rejection of the hypothesis that all the classifiers perform equally well for a given level [
25,
40].
Let
be the rank of the
j-th of
k learning algorithms on the
i-th of
M problems. Under the null-hypothesis
which states that all the algorithms are equivalent, the Friedman aligned ranks test statistic is defined by:
where
is equal to the rank total of the
i-th dataset and
is the rank total of the
j-th algorithm. The test statistic
is compared with the
distribution with
degrees of freedom. Please note that since the test is non-parametric, it does not require the commensurability of the measures across different datasets. In addition, this test does not assume the normality of the sample means, and thus, it is robust to outliers.
In statistical hypothesis testing, the
p-value is the probability of obtaining a result at least as extreme as the one that was actually observed, while assuming that the null hypothesis is true. In other words, the
p-value provides information about whether a statistical hypothesis test is significant or not, thus indicating “how significant” the result is while it does this without committing to a particular level of significance. When a
p-value is considered in a multiple comparison, it reflects the probability error of a certain comparison; however, it does not take into account the remaining comparisons belonging to the family. One way to address this problem is to report adjusted
p-values which take into account that multiple tests are conducted and can be compared directly with any significance level [
40].
To this end, the Finner Post-Hoc test [
39] with a significance level
was applied so as to detect the specific differences between the algorithms. In addition, the Finner test is easy to comprehend, as it usually offers better results than other Post-Hoc tests, especially when the number of compared algorithms is low [
40]. The Finner procedure adjusts the value of
in a step-down manner. Let
be the ordered
p-values with
and
be the corresponding hypothesis. The Finner procedure rejects
–
if
i is the smallest integer such that
, while the adjusted Finner
p-value is defined by:
where
is the
p-value obtained for the
j-th hypothesis and
. It is worth mentioning that the test rejects the hypothesis of equality when the adjusted Finner
p-value
is less than
.
Table 3,
Table 4 and
Table 5 present the information of the statistical analysis performed by non-parametric multiple comparison procedures over
,
and
of labeled data, respectively. The best (lowest) ranking obtained in each FAR test determines the control algorithm for the Post-Hoc test. Moreover, the adjusted
p-value with Finner’s test (Finner APV) is presented based on the control algorithm, at
level of significance. Clearly, the proposed algorithm exhibits the best overall performance, outperforming the rest self-labeled algorithms, since it reports the highest probability-based ranking and presents statistically better results, relative to all labeled ratio.
5. Conclusions and Future Research
In this work, we presented a new SSL algorithm which is based on a self-training philosophy. More specifically, our proposed algorithm automatically selects the best base learner, relative to the number of the most confident predictions of unlabeled data.
The efficiency of the proposed semi-supervised algorithm was evaluated on several benchmark datasets in terms of classification accuracy utilizing the most frequently used base learners: C4.5, kNN and JRip and different ratios of labeled data. Our numerical results as well as the presented statistical analysis demonstrate that the AAST algorithm outperforms its component SSL algorithms, confirming the effectiveness and robustness of the proposed method. Therefore, the presented methodology seems to lead to more efficient, stable and robust predictive models.
In our future work, we intend to pursue extensive empirical experiments in order to compare the proposed self-labeled method AAST with various methods, belonging to other SSL classes such as generative mixture models [
14,
41], transductive SVMs [
42,
43,
44], graph-based methods [
45,
46,
47,
48,
49], extreme learning methods [
50,
51,
52], expectation maximization with generative mixture models [
14,
53]. Furthermore, since our experimental results are quite encouraging, our next step is the use of other supervised classifiers as base learners, such as neural networks [
54] and support vector machines [
55] or ensemble-based learners [
26] aiming to enhance our proposed framework with more sophisticated and theoretically motivated selection criteria for the most promising classifier in order to study the behavior of AAST at each cycle. Finally, an interesting aspect is the evaluation of the proposed algorithm in specific scientific fields applying real world datasets, such as educational, health care, etc. and explore its performance on imbalanced datasets [
56,
57] using more sophisticated performance metrics such as Sensitivity, Specificity,
F-measure, AUC, ROC curve [
58,
59].