A Soft-Voting Ensemble Based Co-Training Scheme Using Static Selection for Binary Classiﬁcation Problems

: In recent years, a forward-looking subﬁeld of machine learning has emerged with important applications in a variety of scientiﬁc ﬁelds. Semi-supervised learning is increasingly being recognized as a burgeoning area embracing a plethora of e ﬃ cient methods and algorithms seeking to exploit a small pool of labeled examples together with a large pool of unlabeled ones in the most e ﬃ cient way. Co-training is a representative semi-supervised classiﬁcation algorithm originally based on the assumption that each example can be described by two distinct feature sets, usually referred to as views. Since such an assumption can hardly be met in real world problems, several variants of the co-training algorithm have been proposed dealing with the absence or existence of a naturally two-view feature split. In this context, a Static Selection Ensemble-based co-training scheme operating under a random feature split strategy is outlined regarding binary classiﬁcation problems, where the type of the base ensemble learner is a soft-Voting one composed of two participants. Ensemble methods are commonly used to boost the predictive performance of learning models by using a set of di ﬀ erent classiﬁers, while the Static Ensemble Selection approach seeks to ﬁnd the most suitable structure of ensemble classiﬁer based on a speciﬁc criterion through a pool of candidate classiﬁers. The e ﬃ cacy of the proposed scheme is veriﬁed through several experiments on a plethora of benchmark datasets as statistically conﬁrmed by the Friedman Aligned Ranks non-parametric test over the behavior of classiﬁcation accuracy, F 1 -score, and Area Under Curve metrics.


Introduction
In recent years, the latest research on machine learning (ML) which has placed much emphasis on learning from both labeled and unlabeled examples is mainly expressed by semi-supervised learning (SSL) [1].SSL is increasingly being recognized as a burgeoning area embracing a plethora of efficient methods and algorithms seeking to exploit a small pool of labeled examples together with a large pool of unlabeled ones in the most efficient way.Since in most real-world applications there is an abundance of unlabeled examples, while labeled examples are either difficult or expensive to obtain, SSL has emerged as a promising domain with important applications in a variety of scientific fields with substantial results [2,3].
In general, SSL methods are commonly divided into two key tasks, as follows: semi-supervised classification (SSC) for discrete-value output variables and semi-supervised regression (SSR) for real-value ones [4].The classification task, usually referred to as pattern recognition in engineering or discriminant analysis in statistics [5], has been widely studied under the semi-supervised framework for classifying any given example to one out of the included class labels into a predetermined set.Depending on the number of class labels, classification problems may be either binary (with two class labels) or multi-class (with more than two class labels).For that purpose, a number of semi-supervised algorithms have been developed and successfully implemented, such as self-training [6], co-training [7], and tri-training [8], as well as approaches that are based on semi-supervised support vector machines and transductive learning or SSL graph-based methods, to name just a few [9].
Co-training is a representative multi-view SSC algorithm originally based on the assumption that each example can be described by two distinct feature sets, usually referred to as views [7].Then, two classification algorithms are trained separately on each view and the most confident predictions of each one on the unlabeled data are used to augment the training set of the other.Let L D denote a small set of labeled examples and U D a large set of unlabeled ones.The two separate classifiers C 1 , C 2 are retrained on the enlarged set L D and the process is repeated for a predefined number of iterations or until a stopping criterion is satisfied, such as the U D pool to be empty.However, since such a two-view assumption can hardly be met in real world problems, several variants of the co-training algorithm have been proposed dealing with the absence or existence of a naturally two-view feature split with notable results.
In addition to these, ensemble learning or committee-based learning or learning multiple classifier systems has emerged recently and is considered as one of the most adequate solutions for building powerful and accurate classification models [10].Instead of using one algorithm for building a learning model, ensemble methods are commonly used to construct and combine a set of classifiers, either weak or strong, generally called base learners.The fundamental points for the effectiveness of an ensemble method concern careful selection of both base learners [11] and the combination method for producing the final hypothesis [12].Averaging (simple or weighted [13]) and voting (majority, unanimity, plurality, or even weighted votes) are popular and commonly used combination methods [14] depending on the problem which needs to be resolved [10].Moreover, approaches using committees of base learners into the core of their learning process have also been demonstrated recently, presenting encouraging results [15].
The necessity of accurate and robust decisions inside a semi-supervised scheme play a cardinal role, especially in cases where the number of initially labeled instances is quite small and no decision correcting or editing mechanisms have been placed inside the learning kernel.Therefore, strategies trying to build an ensemble optimizing well-defined criteria could be a useful asset of a compact semi-supervised algorithm for selecting the most suitable structure per task.Although this concept has been highly exploited under the supervised mode, only few works have been detected in the related literature that apply similar approaches [16,17].Furthermore, there are some works in this field that apply mechanisms suitably combining the decisions of selected base learners, failing, however, to state the reasons of their choice, apart from some generic properties, such as the combination of one generative and one discriminative approach that favors the diversity of the applied co-training algorithm [18].
In this context, a soft-Voting ensemble-based co-training scheme using static selection strategy, regarding binary classification problems, is proposed.Since co-training is primarily relying on the multi-view assumption, a heuristic scheme is adopted for manually generating the two views.Although this random split may act towards injecting diversity into a multi-view SSL approach, the main asset of the proposed algorithm is the construction of an ensemble learner choosing among five different classifiers, based on a novel objective function that measures the efficacy of any examined pair of classifiers per dataset, operating under a soft-Voting scheme.The efficacy of the proposed mechanism for selecting the ensemble's participants is verified through several experiments against both single-view SSL variants-through the well-known self-training scheme-and the co-training scheme, applying all of the 10 different pairs of algorithms into the same soft-Voting learner and the five individual classifiers, on a plethora of benchmark datasets over five separate labeled ratio values.
The obtained results regarding two well-known classification metrics are statistically confirmed by the applied Friedman Aligned Ranks non-parametric test.
The rest of this paper is organized as follows: the co-training framework is presented in Section 2, while reviewing recent studies concerning both the application of co-training in real world applications and Ensemble Selection strategies.In Section 3, we propose and describe in detail the proposed co-training scheme operating under a random feature split using internally a static selection strategy regarding a soft-Voting algorithm.Section 4 includes the experiments carried out along with the relevant results.Finally, in Section 5 we comment on the results considering some thoughts for future work.

Related Works
This section consists of two different parts, which highlight the two main points related to our proposed work.After having mentioned some of the most important works towards these directions, we can summarize our main contributions into the next section, facilitating the structure of this work.

Co-Training Studies
The first part of this section is dedicated to Co-training scheme and some of the numerous variants that have been demonstrated.Hence, Co-training is deemed to be a representative multi-view SSL method established by Blum and Mitchell [7] for binary classification problems and, in particular, for categorizing web pages either as course or as non-course.It is based on the premise that each example can be naturally divided into two separate set of features usually referred to as views, which is clearly an assumption of great importance for the implementation of the particular method.Co-training is identified as a "two-view weakly supervised algorithm" [6] since it incorporates the self-training approach to separately teach each one of the two supervised classifiers in the corresponding feature view and boost the classification performance exploiting the unlabeled examples in the most efficient manner [19].Moreover, Zhu and Goldberg consider co-training as a wrapper method which is not affected by the two supervised classifiers employed in the relevant procedure, provided that they produce good predictions on unlabeled data [20].Several modifications have been implemented since then, including mutual-learning and co-EM-bringing together the co-training and Expectation-Maximization (EM) approaches-exploiting mainly simple classifiers like naive Bayes (NB) [7,21].
In addition to the "two view" assumption, the effectiveness of the particular method depends largely on two other key assumptions: the first one is that each view is adequate for classifying the unlabeled data using a small set of labeled examples for training, while the second one is that each view is conditionally independent given the class label.When either of these assumptions is not met, different co-training variants have been proposed with comparable results.In the case where the "two view" assumption is not fulfilled, a random feature partition could take place to facilitate the application of the method as proposed by Zhu and Goldberg [20].In such cases, the feature set is partitioned into two subsets of almost equal size, which henceforth form the two feature views, while different classifiers C 1 , C 2 are employed.In addition, the same classifiers may be used under different configuration parameters, thus ensuring the diversity between them [22].
The number of studies that propose co-training as an effective SSL method is really restricted.One of these is presented in [22], where sentiment analysis is the main focus, while in [23] the authors have tackled a health care issue.Although the popularity of this type of problem is widespread and even though any shortcomings that may be associated with a large amount of labeled data can be efficiently leveraged by other SSL methods, yet co-training seems to not have been delved into thoroughly enough.In this study, three different sources of text data were examined: news articles, online reviews, and blogs.A number of co-training variants were designed, focusing on the way the split of the feature space takes place, fitting appropriately the specific properties that characterize text data, such as the creation of one view by unigrams and the rest by bigrams or by adopting character-based language models and bag-of-words models, respectively.The produced results demonstrate the effectiveness of the co-training algorithm.
Another task that has been efficiently tackled by using the co-training method is that of drug discovery, where classification methods need to be applied so as to predict the suitability of some molecules considering treatments of diseases and their possibly induced side-effects during the initial steps of tedious experiments [24].Accurate predictions may save both time and money, since fewer combinations would be investigated and the final results could be acquired much faster.In this work, two different views were available, stemming from chemistry and biology, and had to be mixed to reach the final conclusion.The approaches that were examined may be summed up as follows: (i) access separately each view either with a base classifier or the partial least squares (PLS) regression method [25], (ii) fuse the different views, either by joining the heterogeneous data without any preprocess or after having applied the PLS method, also used for dimensionality reduction, and (iii) a modification of the co-training method (co-FTF).Ensemble tree-based learners were preferred in this last approach, handling imbalanced datasets appropriately and leading to promising results, while examining two labeled ratio scenarios.In addition, a random forest of predictive clustering trees was incorporated in a self-training scheme for multi-target regression, thus improving the performance of the employed SSL approach [26].
An expansion of the co-training algorithm, which includes an ensemble of tree-based learners as base learner, has been proposed in [27].Under the assumptions that are presented there, the necessity of two sufficient and redundant views has been eliminated for the proper operation of Co-Forest.Furthermore, the bootstrap method that is exploited during the creation of the included decision trees provides the required diversity and, at the same time, reduces the chance of exporting biased decisions, leading to an efficient operation of the SSL scheme.Adaptive Data Editing based Co-Forest (ADE-Co-Forest) [28] constitutes a variant of the original Co-Forest algorithm, introducing an internal mechanism in order to tackle the mislabeled instances, thus improving the total predictive behavior, since both false negative/positive error rates are further reduced, compared to its ancestor.A boosted co-training algorithm has also been proposed for a real-task problem-to be more specific, it concerns the human action recognition-which is based on the mutual information and the consistency between labeled and unlabeled data.Two metrics, named inter-view and intra-view confidence, are introduced and exploited dynamically so as to select the most appropriate subset of the unlabeled pool with the corresponding pseudo-labels [29].
Recently, a quite effective co-training method was introduced in [30] for early prognosis of undergraduate students' performance in the final examinations of a distance learning course based on attributes which are naturally divided into two separate and independent views.The first one concerns students' characteristics and academic achievements which are manually filled out by tutors, while the second one refers to attributes tracking students' online activity in the course learning management system and which are automatically recorded by the system.It should be mentioned that semi-supervised multi-view learning has also been successfully applied for gene network reconstruction combining the interactions predicted by a number of different inference methods [19].In a similar work, an ensemble-based SSL approach has been proposed for the computational discovery of miRNA regulatory networks from large-scale predictions produced by different algorithms [31].

Ensemble Selection Strategies
The second part is oriented towards reporting briefly some of the most important points related with Ensemble Selection concept [32,33].To be more specific, some usual keywords in this field are Multiple Classification Systems (MCSs), Static Ensemble Classifier (SEC), and Dynamic Ensemble Classifier (DEC), as well as classifiers' competence and diversification.The way that all these terms are connected is the fact that when a new ensemble learner is designed, the main ambitions are the employment of complementary and diverse participants, following the main asset of MCSs regarding the continuous increase of the predictive rate.The main difference between the remaining two terms is the fact that SEC strategies examine a global solution regarding the total set of unknown instances, while the DES approaches provide a separate solution per test instance using mainly local restrictions.Despite their distinct roles, they can be combined under hybrid mechanisms sharing similar measurement metrics or ML techniques for converging to their decisions [34,35].
Ensemble Selection has been inserted as a new stage into the original chain of constructing an ensemble learner, taking into consideration both the importance of computational needs that arise when we trust ensembles with too many participants and the fact of discarding less accurate models or models that reduce the internal diversity.This tactic is usually referred to as ensemble pruning or selective ensemble.A taxonomy of these techniques has been proposed in [36], assigning them to four different categories: (i) ranking-based, (ii) clustering-based, (iii) optimization-based, and (iv) others, including the remaining techniques that cannot be strictly categorized to any of the previous three subsets.Another taxonomy was demonstrated in 2014, concerning mainly the actual need of DES in practice and the relation between the inherent complexity of classification problem, measured by appropriate metrics, and the contribution of the examined Dynamic Selection approaches [16].Prototype selection techniques have also been examined in the abovementioned framework, acting beneficially towards both reducing computational resources and boosting the classification accuracy [37].Furthermore, one related work on the field of SSL has been proposed using the competence of selected classifiers that stems from an affinity graph, achieving smoothness of the decisions for neighboring data [17].

The Proposed Co-Training Scheme
Motivated by the above studies, in the present paper we make an attempt to put forward an ensemble-based co-training scheme for binary classification problems adopting a strategy of choosing the base classifiers of the ensemble from an available pool of candidate classification algorithms per dataset.The most important points concerning our contribution are outlined below:

•
We propose a multi-view SSL algorithm that handles efficiently both labeled (L) and unlabeled (U) data in the case of binary output variables.

•
Instead of demanding two sufficient and redundant views, a random feature split is applied, thereby increasing the applicability and improving the performance of the finally formatted algorithm [38].

•
We introduce a simple mechanism concerning the cardinality of unlabeled examples per different class that is mined for avoiding overfitting phenomena in cases where imbalanced datasets must be assessed.

•
We insert a preprocess stage, where a pool of single learners is mined by a Static Ensemble Selection algorithm to extract a powerful soft-Voting ensemble per different classification problem, seeking to produce a more accurate and robust semi-supervised algorithm operating under small labeled ratio values.
Let the whole dataset (X) consist of n instances and k features, apart from the class variable (Y) that, in the context of this work, is restricted to be a binary one.Thus, without loss of generality, we assume that y i ∈ {0,1} for each labeled instance {l i , 1 ≤ i ≤ n i }, while each unlabeled instance {u i , 1 ≤ i ≤ n u } is characterized by the absence of the corresponding y i value.The parameters n l and n u represent the cardinality of L and U subsets, respectively.After having removed all missing values-leading to a new cardinality of total instances (n )-it is evident that the following equation holds: Besides holding both numeric and categorical features, all the features of the latter form are converted into binary ones, increasing the initial number of k features into k , in case X contains at least one of them.Otherwise, since no augmentation of the initial features has been applied, the next two quantities coincide: k ≡ k .Under this generic approach, classification algorithms that cannot handle categorical data are not rejected by the total proposed process.This choice seems safe enough, since it does not reject the adoption of any learner-this mainly refers to learning algorithms that cannot handle efficiently the existence of both numerical and categorical data-although the manipulation of heterogeneous features is an open issue [39].Afterwards, without introducing any specific assumption about the relationship or the origination of any included feature, the available feature vector F: <f 1 , f 2 , . . ., f k > is split into two newly formatted subsets F 1 and F 2 , where F = F 1 ∪F 2 .Hence, two different datasets X 1 , X 2 are generated, respectively, both including disjoint feature sets, but sharing the same class variable Y. Therefore, the final hypothesis space could be summarized as follows: F view : X view →[0, 1], where view = 1, 2.
Through the above described methodology, the following two choices are enabled: either to apply a common learning strategy for both views, such as adopting the same learner, or tackling each view separately, depending on underlying properties, such as the views' cardinalities, independence or correlation assumptions that affect the views' internal structure or other kind of relationships that specify the nature of each view, since two distinct tasks have been raised.Following the majority of the existing approaches found in the literature and taking into consideration that a random feature split operates as an agnostic factor regarding the structure of the constructed views, the first approach was adopted in the present study [40].
Under this strategy, and before the common base learner is built per view, a preprocess stage is inserted.This aims to measure the rate of the imbalanced instances found in the provided training set and to define the number of the instances that have to be mined from each class per iteration (Mined class0 , Mined class1 ).Due to the SSL concept, the quota of L and U subsets is defined by a labeled ratio value (R).Given this setting, the amount of the initial training set (L view ) is computed according to the following formula: The cardinalities of both classes are then computed (C max , C min ) regarding the available L 0 view .The minimum of them is set equal to 1 (Mined class0 ), while the other one is equal to C max /C min (Mined class1 ).In this way, the provided class distribution of the labeled instances is assumed to be representative of the total problem defined also by the unknown instances that must be assessed.Finally, these two variables are exploited during the learning stage to retrieve a suitable number of unlabeled instances per class during each iteration.Now, as it regards the choice of the base learners, we selected five representative algorithms from different learning families, capturing a wide spectrum of properties, concerning both assets and defects, which should be combined and avoided, respectively, in order to construct appropriately an accurate and robust enough ensemble learner per dataset so as to initialize the co-training process [41].For this purpose, our pool of classifiers (C) consists of support vector machines (SVMs) [42], k-nearest-neighbors (kNN) [43], a simple tree inducer (DT) from family of decision trees [44], naive Bayes (NB) [45], and logistic regression (LR) [46].In order to keep the computational needs of the exported ensemble, we restrict the cardinality of classifier participants under our Voting scheme, setting this number equal to 2. Thus, we had to employ a soft variant of Voting scheme which takes into account the class-probabilities of each algorithm and combines these decisions through averaging process, instead of hard voting through on-off decisions [29], where the occurrence of ties with the even number of base learners would appear too frequent.Furthermore, the stage of averaging the decisions of each individual participant generally leads to the reduction of the ensemble's variance and helps to surpass the structure sensitivity that is usually detected in more unstable methods, considering the input data To be more specific, if we assume that we tackle with a binary classification problem containing a set of labels Υ = {0, 1} and a feature space X ∈ R k , such that for any probabilistic classifier F holds the next function: F : X → Υ , then for each instance m the decision profile of learner j is a pair of class probabilities P j0 , P j1 which sum up to 1. Consequently, the mechanism of a simple, without weighting factors, soft-Voting classifier, given an instance x m , combines the decisions of all the candidate classification algorithms searching the most probable class (ω) as follows: The class with the largest average probability is exported as the prevalent one through this pipeline, where ŷm ∈ Y and the notation of p depicts the number of the combined classifiers.
Trying to uncover the function of our preprocess stage which constructs the base learner of the proposed co-training scheme, we had to refer that the ambition of any Static Ensemble Selection strategy is to construct a subset C * , such that C * ⊂ C and |C * | = 2, which satisfies better the chosen criteria for obtaining the most desired performance over test instances.In our case, we investigate the most compatible pair of learners that maximizes our proposed criterion under an unweighted soft-Voting scheme.Through this, we measure the number of instances for which the decision of the soft-Voting scheme remains correct when the two candidate participants disagree (q corrected ), normalized by the total amount of disagreements based on the label of the examined instances (q disaggre ), as well as the rate of non-common errors (q common errors /v).To this end, we introduce the objective function of Equation ( 4), which is defined as a linear combination of the mentioned quantities: where a is a parameter to balance the importance between the included terms.Actually, the first one rewards the pair of classifiers that managed to act complementary, since the more times the confidence of the classifier that guessed correctly the corresponding class label overpowered against the erroneous one, the larger values this term records.On the other hand, the second term penalizes the pair of classifiers whose common decisions coincide with mislabeling cases by reducing its value when such behavior occurs.The parameter v symbolizes the cardinality of the validation set over which the rest of quantities are calculated.Giacinto and Roli called this diversity measure as "the double-fault measure" [47].
Although an analysis of the selected a value could raise the interest of further research, we selected the value of 0.5 for equal importance.Thus, for each examined dataset D, which contains both labeled and unlabeled data, we split the labeled set into train and validation set, in a same manner as the default k-fold-cross-validation strategy, applying the previously referred Static Ensemble Selection strategy so as to detect the most favorable pair of classifiers for our soft-Voting ensemble learner.In case that q disaggre = 0, then a is set equal to 0, holding only the second term.
Exploiting the exported soft-Voting ensemble learner as the base learner of our co-training variant, each L 0 view is fitted with Co(Voteso f t(C * i , C * j )) ≡ Co(Vote SEC so f t )-we use the notation C * i and C * j for the selected learners which are included into C * -and the corresponding class probabilities for each unlabeled instance per view (u i view ) are computed per iteration.Next, only the top-class0 and top-class1 instances per class are selected, based on the estimated confidence measure.Subsequently, these instances are exported by the current U subset (since both views share the common unlabeled set, it does not need to use the view index when referring to the U subset).Then, they are added to the training set of the opposite view along with the most prominent class label based on base learner's decision.Therefore, if the target variable of the m-th instance of U is categorized as class0 by the F 1 classifier (x m : <f 1 , f 2 , . . ., f k /2 with probclass0 first > 0.5), then the L iter 2 subset during the iter-th iteration has to be augmented with the same instance, using the corresponding features of the second view and the estimated class variable (x m : <f k'/2+1 , f k'/2+2 , . . ., f k' |class0>).
According to this learning scheme whose main ambition is to teach two different learners of the same classification algorithm through mutual disagreement concept, each learner injects into the other the information that is retrieved by the supplied view per iteration.A more theoretical analysis of the error bounds that can be achieved through the disagreement-based concept in case of Co-training could be found in [48].Since our strategy of constructing the base learner of co-training through a static ensemble selection mechanism Vote SEC so f t , we assume that we provide an accurate enough algorithm whose both competence's performance and diversity's behavior have been verified through a validation set so as to avoid overfitting phenomena or heavy mislabeling learning behaviors.
To sum up, the pseudo-code of the introduced SEC strategy (SSoftEC) as well as the proposed co-training variant are presented in Algorithms 1 and 2, respectively.

Input:
L-labeled set f-number of folds to split the L C-pool of classification algorithms exporting class probabilities α-value of balancing parameter Main Procedure: For each i, j ∈ {0, 1, . . ., |C|} and i j do Set iter = 0, Q a so f t (i, j) = 0 Split L to f separate folds: L (1) , L (2) Update Q a so f t (i, j) according to Equation (4) iter = iter + 1 Output: Return pair of indices i, j such that: (i, j) * : arg max i, j Q a so f t (i, j).

Mode:
Pool-based scenario over a provided dataset D = X n × k Y n × 1 x i -vector with k features <f 1 , f 2 , . . .

Experimental Procedure and Results
For the purpose of our study a number of experiments were carried out using 27 benchmark datasets from UCI Machine Learning Repository [49] regarding binary classification problems (Table 1), where the sign # depicts the cardinality of the corresponding quantity.Note that the columns entitled # Features in Table 1, counts all the features apart from the class variable.These datasets have been partitioned into 10 equal-sized folds using the stratified 10-fold-CV resampling procedure so that each fold should have the same distribution as the entire dataset [50].This process was repeated 10 times until all folds were used as the testing set and the results were averaged.Moreover, each fold was divided into two subsets, one labeled and the other one unlabeled, in accordance with a selected labeled ratio value (R) which is defined as follows: (5) In order to study the influence of the amount of labeled data in the training set, three different ratios were used, and in particular: 10%, 20%, and 30%.In general, the R (%) values over which researchers are interested are the smaller ones (R < 50%), so as to be consistent with the practical aspect of SSL scenario.The effectiveness of the proposed co-training scheme was compared to several co-training and self-training variants.For verifying the supremacy of the Vote SEC so f t as base classifier, we built the soft-Voting versions based on all pairs of the inserted pool of classifiers (C).Furthermore, the version that exploits the decisions of all the participants of C pool was implemented, as well as the individual variants without voting.Thus, 16 different supervised classifiers were exhibited as base learners, all imported by the scikit-learn Python library [51] and in particular: The SVMs using Radial Basis Function as kernel inside its implementation, representing one universal learner that tries to separate instances using hyper-planes and 'Kernel-trick' [52], • The k-Nearest Neighbor (kNN) instance-based learner [53] with k equal to 5, a very effective method for classification problems, using the Euclidean metric as a similarity measure to determine the distance between two instances, • A simple Decision Tree (DT) algorithm, a variant of tree induction algorithms with large depth that split the feature space using 'gini' criterion [44], The NB probabilistic classifier, a simple and quite efficient classification algorithm based on the assumption that features are independent of each other given the class label [54], The Logistic Regression (LR), a well-known discriminative algorithm that assumes the log likelihood ratio of class distributions is linear in the provided examples.Its main function supports the binomial case of the target variable, exporting posterior probabilities in a direct way.In our implementation, L2-norm during penalization stage was chosen [55].
For simplicity, we made use of the following notation in the experiments, while the parameters' configuration for all applied classification methods is presented in Table 2: , where all participants of C are exploited under the Voting scheme, • Co(learner), where this kind of approach corresponds to the case that learner1 ≡ learner2 ≡ learner, with learner ∈ C, • Co(Vote(learner i , learner j )), where learner i , learner j ∈ C with i j, and the ensemble Voting learner is the same for both views, similar with the previous scenario, • Co(Vote(all)), where all participants of C are exploited under the Voting scheme for each view, and finally, • Co(Vote SEC so f t ), which coincides with the proposed semi-supervised algorithm.As mentioned before, there are 10 different pairs of algorithms that can be formatted with a pool of five candidate classifiers.In addition, the case of applying each one individually takes also place, as well as the case that all participants of pool C are exploited under the same Voting stage.Thus, 16 self-training variants and 16 co-training variants are examined against the proposed co-training algorithm, which selects through a static selection strategy the soft-Voting ensemble base learner into its operation per different task.As it concerns the parameter f, it has been set equal to 10, leading to a 10-fold-cross-validation procedure per examined dataset.The next tables depict only one out of three different labeled ratio scenarios concerning the top five algorithms, based on total Friedman Ranking statistical process along with a smaller statistical comparison concerning only the top five algorithms.For a deeper analysis, the total results can be found in http://mL.math.upatras.gr/wp-content/uploads/2019/12/Official_results_co_training_ssoftec_voting.7z.Moreover, a pie chart has been provided in Figure 1, depicting the participation, into per centage style, of the pair of classifiers that were employed into the proposed strategy as base learner during all the experiments.For evaluating the predictive performance of the proposed algorithm, three representative and widely used evaluation measures were adopted for measuring the obtained performance over the test set: classification accuracy, F1-score, and Area Under the ROC Curve (AUC).Accuracy corresponds to the percentage of correctly classified instances, while F1-score is an appropriate metric for imbalanced datasets and is defined as the harmonic mean of recall (r) and precision (p).In the case of a binary classification problem, they are defined as: = 2 × /(2 ×  +  + ).(7) where tp, tn, fp, fn, and n correspond to the number of true positive, true negative, false positive, false negative, and total number of instances, respectively.Finally, the latter one is related to the quality of the examined classifier ranking of any randomly chosen instance and is computed by aggregating the corresponding performance across all possible classification thresholds.The most favorable manner to visualize this metric is through plots of TPR vs. FPR or Sensitivity vs. (1-Specificity) relationship at different classification thresholds, where TPR stands for True Positive Rate, while FPR stands for False Positive Rate.Their analytical formulas are provided here:

𝐹𝑃𝑅 = 𝑓𝑝/(𝑓𝑝 + 𝑡𝑛). (9)
The experimental results using 10% labeled ratio are summarized in Tables 3-5, where the best value per dataset is bold highlighted.Overall, it appears that the co-training Vote performs better than the corresponding self-training variants.Moreover, among the co-training variants employed, the proposed algorithm takes precedence over the rest on most of the datasets.In addition, we applied a familiar statistical tool to confirm the observed results.Hence, the Friedman Aligned Ranks [56] non-parametric test (significance level α = 0.05) was used to compare all the employed SSL methods (Table 6).According to the calculated results, the algorithms are sorted from the best performer (lowest ranking) to the worst one (higher ranking).Therefore, it is statistically confirmed the supremacy of the   algorithm, while the null hypothesis  (i.e., the means of the results of two or more algorithms are the same) is rejected.Furthermore, the Nemenyi post-hoc test [57] (α = 0.05) was applied to detect the specific differences between the algorithms, which is a commonly used non-parametric test for pairwise multiple comparisons.Table 6 includes the computed Critical Difference (CD) which is the same for all the cases of this R-based scenario (CD = 2.27).It is statistically confirmed that the difference between the   algorithm and the majority of the other methods is statistically significant in all examined metrics, thus verifying the predominance of the proposed co-training scheme.The fact also that the proposed algorithm outperforms the Vote (all) variants means that the implemented time-efficient SEC strategy provides For evaluating the predictive performance of the proposed algorithm, three representative and widely used evaluation measures were adopted for measuring the obtained performance over the test set: classification accuracy, F 1 -score, and Area Under the ROC Curve (AUC).Accuracy corresponds to the percentage of correctly classified instances, while F 1 -score is an appropriate metric for imbalanced datasets and is defined as the harmonic mean of recall (r) and precision (p).In the case of a binary classification problem, they are defined as: where tp, tn, fp, fn, and n correspond to the number of true positive, true negative, false positive, false negative, and total number of instances, respectively.Finally, the latter one is related to the quality of the examined classifier ranking of any randomly chosen instance and is computed by aggregating the corresponding performance across all possible classification thresholds.The most favorable manner to visualize this metric is through plots of TPR vs. FPR or Sensitivity vs. (1-Specificity) relationship at different classification thresholds, where TPR stands for True Positive Rate, while FPR stands for False Positive Rate.Their analytical formulas are provided here: The experimental results using 10% labeled ratio are summarized in Tables 3-5, where the best value per dataset is bold highlighted.Overall, it appears that the co-training Vote performs better than the corresponding self-training variants.Moreover, among the co-training variants employed, the proposed algorithm takes precedence over the rest on most of the datasets.In addition, we applied a familiar statistical tool to confirm the observed results.Hence, the Friedman Aligned Ranks [56] non-parametric test (significance level α = 0.05) was used to compare all the employed SSL methods (Table 6).According to the calculated results, the algorithms are sorted from the best performer (lowest ranking) to the worst one (higher ranking).Therefore, it is statistically confirmed the supremacy of the Co(Vote SEC so f t ) algorithm, while the null hypothesis H 0 (i.e., the means of the results of two or more algorithms are the same) is rejected.Furthermore, the Nemenyi post-hoc test [57] (α = 0.05) was applied to detect the specific differences between the algorithms, which is a commonly used non-parametric test for pairwise multiple comparisons.Table 6 includes the computed Critical Difference (CD) which is the same for all the cases of this R-based scenario (CD = 2.27).It is statistically confirmed that the difference between the Co(Vote SEC so f t ) algorithm and the majority of the other methods is statistically significant in all examined metrics, thus verifying the predominance of the proposed co-training scheme.The fact also that the proposed algorithm outperforms the Vote (all) variants means that the implemented time-efficient SEC strategy provides a more accurate base learner for the field of SSL.Towards this direction, we visualize the performance of the proposed algorithm against Co(Vote(all)) for the examined metrics and the case of R = 90% via a violin plot which favors the comparison of the distribution of the achieved values per algorithm, including also some important statistical quantities: median, interquartile range, and 1.5× interquartile range (Figure 2).Therefore, we can deduce experimentally the success of the proposed approach, especially when generic binary datasets constitute the main issue to be tackled when the collected labeled instances are highly numerically restricted.

Conclusions
In the present study, a soft-Voting ensemble-based co-training scheme through a Static Selection strategy operating under a random feature split, called   , was presented regarding binary classification problems.The proposed algorithm harnesses the benefits of the SSL approach and the ensemble methods that are built using heterogeneous approaches [61][62][63].The experimental results using 27 benchmark datasets demonstrate the prevalence of the proposed algorithm in terms of classification accuracy and F1-score, compared to several co-training and self-training variants, while using five different labeled ratios.Thus, the employment of a static selection classifier that tries to find a suitable combination among five provided classifiers for feeding appropriately an ensemble scheme seems that has favored the final predictive ability, without consuming much computational resources during the preprocess step per different task.This was successfully provoked by using soft-Voting strategy: for each class, the maximum averaged confidence is exported as the most prominent, taking into consideration the corresponding confidence values of both exploited learners.
Regarding our initial ambition, Co-training scheme seems more favorable for exploiting unlabeled instances and augmenting the initially collected labeled instances through the most reliable of the former, even during small labeled-ratio conditions, against Self-training approach.Furthermore, since the source and the structure of our examined datasets vary, application of the proposed method under more well-defined fields/tasks could be benefited by more advanced preprocessing stages, such as feature engineering, or tuning of participant learners, while specific criteria could be defined so as to avoid random split and converge to a more suitable feature split [40,64].Increasing the cardinality and the diversification of the candidate classifiers should also be examined in following research, and especially the case when more strong classifiers are available, since their decision profile might demand a weighting soft or hard Voting scheme.In any case, the fact that unlabeled examples may boost the contribution of ensemble learners and their Selection strategies under SSL schemes seems to hold our assumption through our experimental stage [65].Such strategies could also be used for selecting appropriate learners under more sophisticated ensemble structures like Stacking [66].Adoption of more dedicated preprocessing stages oriented towards more specific problems should be applied, in order to boost the performance of the SSoftEC strategy and provide the Co-training scheme a more appropriate base learner [58][59][60].However, in our generic experimental stage, which covers various applications, the proposed algorithm recorded a both robust and accurate enough performance, especially in the case of the F 1 -score metric which is critical for real problems with class distribution different from the optimal, under a computational inexpensive manner, in contrast with DEC strategies that employ a new classifier search per test instance.The smoothing of the decisions that are produced through the proposed soft-Voting ensemble seems to favor the exported decision profile, since a large number of decisions that were initially misclassified based on individual predictions were reverted towards the ground truth label.While at the same time, numerous cases where the two participants disagree over the binary label were not affected.This happens because a large correct confidence value combined with a smaller incorrect one remains untouched under such a voting scheme, according to Equation (3).

Conclusions
In the present study, a soft-Voting ensemble-based co-training scheme through a Static Selection strategy operating under a random feature split, called Co(Vote SEC so f t ), was presented regarding binary classification problems.The proposed algorithm harnesses the benefits of the SSL approach and the ensemble methods that are built using heterogeneous approaches [61][62][63].The experimental results using 27 benchmark datasets demonstrate the prevalence of the proposed algorithm in terms of classification accuracy and F 1 -score, compared to several co-training and self-training variants, while using five different labeled ratios.Thus, the employment of a static selection classifier that tries to find a suitable combination among five provided classifiers for feeding appropriately an ensemble scheme seems that has favored the final predictive ability, without consuming much computational resources during the preprocess step per different task.This was successfully provoked by using soft-Voting strategy: for each class, the maximum averaged confidence is exported as the most prominent, taking into consideration the corresponding confidence values of both exploited learners.
Regarding our initial ambition, Co-training scheme seems more favorable for exploiting unlabeled instances and augmenting the initially collected labeled instances through the most reliable of the former, even during small labeled-ratio conditions, against Self-training approach.Furthermore, since the source and the structure of our examined datasets vary, application of the proposed method under more well-defined fields/tasks could be benefited by more advanced preprocessing stages, such as feature engineering, or tuning of participant learners, while specific criteria could be defined so as to avoid random split and converge to a more suitable feature split [40,64].Increasing the cardinality and the diversification of the candidate classifiers should also be examined in following research, and especially the case when more strong classifiers are available, since their decision profile might demand a weighting soft or hard Voting scheme.In any case, the fact that unlabeled examples may boost the contribution of ensemble learners and their Selection strategies under SSL schemes seems to hold our assumption through our experimental stage [65].Such strategies could also be used for selecting appropriate learners under more sophisticated ensemble structures like Stacking [66].
One interesting point, especially in the case that such a co-training algorithm should be applied to datasets that are characterized by more intense imbalanced classes, is the adoption of either more specified preprocess stages, such as the use of SMOTE algorithm, a well-known oversampling method that generates new instances or any of its descendants [67] or the embedding of similar methods inside the operation of base learner (s).Such an approach has been demonstrated recently in [68], where Rotation Forest, a popular ensemble learner based also on DTs, is combined with an under-sampling method so as to tackle this kind of issue.Regarding both the produced results and the fact that in semi-supervised scenarios the amount of collected data is much more restricted than the default supervised case, more sophisticated approaches for avoiding imbalanced datasets during the initialization of base learner, could be proven really promising for acquiring better learning rates [69].Another interesting point is the expansion of the proposed scheme on the multi-class semi-supervised classification problem, such as in [70] where a new loss function was applied using gradient descent in functional space.
Moreover, appropriate experiments should be made towards the direction of recognizing the importance of the included features per view or among all the provided features, in case that random feature split is applied instead of merging all the distinct views into a compact but still heterogeneous view, so as to either propose a more detailed strategy for formatting the two separate views or applying feature reduction techniques that may favor the final predictive performance [64].The "absent levels" problem [39] should also be studied under a SSL scenario, avoiding constructing views or preprocess feature sets that may lead to more implicit approaches, especially in real-life situations that the interpretability of the exported predictive model is of high priority for the business part or the corresponding field that is connected with the examined problem [71,72].Construction of artificially generated data could be a safe strategy for excluding such conclusion.Otherwise, application to real-world data from totally different domains might infer biased decisions regarding the applicability to a wider range of datasets.Transfer learning, combined probably with Active Learning framework that permits the knowledge blending of human factors into the learning pipeline, could prove to be a valuable approach for exporting more robust classifiers by enriching the feature vector and/or applying more compatible modifications [73].
Finally, deep neural networks (DNNs) [74] could be employed under a co-training scheme to boost the predictive performance, fed with either raw data or other generic kinds of datasets.More specifically, long short term memory (LSTM) networks have already proven efficient enough when combined with SSL methods for constructing clinical support decision systems [75].In case DNNs should be exploited, creation of new insights into inserted data could take place, providing either totally new view (s) or augmenting the existing one (s).Thus, several feature engineering approaches should be adopted to enhance the quality of the co-training scheme and possibly violate the assumption about the independent views less.

Figure 1 .
Figure 1.Pie chart depicting the participation of each combination inside our Static Ensemble strategy.

Figure 1 .
Figure 1.Pie chart depicting the participation of each combination inside our Static Ensemble strategy.

Figure 2 .
Figure 2. Violin plots of the proposed algorithm against Co(Vote(all)) approach over the three examined metrics.

Figure 2 .
Figure 2. Violin plots of the proposed algorithm against Co(Vote(all)) approach over the three examined metrics.
separate feature sets with view ∈ [1,2] learner view -build of selected learner on corresponding View, ∀view = 1, 2 = class|X view ) ∀ j ∈ Ind ∼view ) (The sign ∼ view means the opposite view from the current. k -number of features after having converted each categorical feature into binary n -number of instances after having removed instances with at least one missing value C j -instance cardinalities of both existing classes with j ∈ {min, max} Mined c -define number of mined instances per class, where c ∈ {class 0 , class 1 } j ) trained on L MaxIter to predict class labels of test data.

Table 1 .
Description of datasets used from the UCI repository.
#: the cardinality of the corresponding quantity.

Table 2 .
Configuration of exploited algorithms' parameters.
Bold highlighted means the best value per dataset.

Table 6 .
Friedman Rankings for all examined algorithms and statistical importance based on Nemenyi post-hoc test.