Investigation of Combining Logitboost(M5P) under Active Learning Classiﬁcation Tasks

: Active learning is the category of partially supervised algorithms that is di ﬀ erentiated by its strategy to combine both the predictive ability of a base learner and the human knowledge so as to exploit adequately the existence of unlabeled data. Its ambition is to compose powerful learning algorithms which otherwise would be based only on insu ﬃ cient labelled samples. Since the latter kind of information could raise important monetization costs and time obstacles, the human contribution should be seriously restricted compared with the former. For this reason, we investigate the use of the Logitboost wrapper classiﬁer, a popular variant of ensemble algorithms which adopts the technique of boosting along with a regression base learner based on Model trees into 3 di ﬀ erent active learning query strategies. We study its e ﬃ ciency against 10 separate learners under a well-described active learning framework over 91 datasets which have been split to binary and multi-class problems. We also included one typical Logitboost variant with a separate internal regressor for discriminating the beneﬁts of adopting a more accurate regression tree than one-node trees, while we examined the e ﬃ cacy of one hyperparameter of the proposed algorithm. Since the application of the boosting technique may provide overall less biased predictions, we assume that the proposed algorithm, named as Logitboost(M5P), could provide both accurate and robust decisions under active learning scenarios that would be beneﬁcial on real-life weakly supervised classiﬁcation tasks. Its smoother weighting stage over the misclassiﬁed cases during training as well as the accurate behavior of M5P are the main factors that lead towards this performance. Proper statistical comparisons over the metric of classiﬁcation accuracy verify our assumptions, while adoption of M5P instead of weak decision trees was proven to be more competitive for the majority of the examined problems. We present our results through appropriate summarization approaches and explanatory visualizations, commenting our results per case.


Introduction
Without a doubt, the last two decades have been characterized by massive production of data with regards to the fields of Computer Science (CS) and Artificial Intelligence (AI). Several real-life applications contribute to this phenomenon, operating as rich sources of data over all possible kinds: structured, semi-structured or unstructured [1,2]. We distinguish the next fields: social media platforms, economic transactions, medical recordings and Internet of Things (IoT) where Industry 4.0, constitutes a highly affected application coming from the latter field [3][4][5]. Although these applications offer advanced mechanisms for producing the necessary data under automated protocols and/or mechanisms, they still, in their majority, cannot address the data annotation under a similar and still accurate manner. Therefore, the most widely known variant of Machine Learning (ML) algorithms, the supervised ones, do not stand as a proper solution for obtaining informative insights over the collected data, since the restricted number of annotated examples that may be acquired, even by a manual procedure, do not establish a sufficient training subset. Weakly Supervised Learning (WSL) and/or Partially Supervised Learning (PSL) approaches tackle this problem, trying to exploit the existence of the clearly larger amounts of non-annotated examples in order to mine useful information that might hide based on the aforementioned available annotated examples [6,7]. Different approaches of WSL and/or PSL approaches have been recorded in the related literature since their emergence. The common factor is the much smaller size of the labeled subset (L) against the corresponding unlabeled subset (U), while the most important points over which they differentiate are the following [8]: • inductive/transductive approaches, where an explicit learning rule is formatted using the train set during the former one, trying to apply this on a distinct test set, while these two sets are both provided in advance during the latter; • incomplete/inaccurate supervision, where both labeled and unlabeled examples are initially gathered regarding the first category, on contrast with the second one which is distinguished because of the noise that may govern the provided labeled examples, a fact that would cause intense deterioration on learning a specific task; and • active/semi-supervised learning, where there is a straightforward separation of the approaches that need or demand human intervention so as to blend human's knowledge into their total learning kernel for acquiring safer decisions instead of being based solely on a base learner's predictions building a more automated learning chain but with greater risks.
Following the recipes of WSL/PSL approaches, the abundance of collected unlabeled data might act as a valuable source of information, reducing the negative effect of the difficulty on obtaining much labeled data, because of the inherent difficulties that take place on several domains. Ethical issues, scarcity of specific incidents and highly expensive labeling procedures are some of the obstacles that usually prevent us from handling successfully the creation of a large enough L per case [9]. However, based on the assumption that both L and U are produced from the same underlying data distribution P, the connection between the conditional distribution P y x and the marginal distribution P(x) could lead potentially to more accurate learning functions: f : X → Y , where x = {x 1 , x 2 , . . . , x N } stands for a typical example with N features while y symbolizes the label that accompanies any such sample, being either known or unknown in the cases of L and U, respectively. Furthermore, through X, Y, we depict the space of the examples and the labels.
In this work, we aim to propose an accurate and robust batch-based inductive Active Learning (AL) algorithm for pool-based scenario, regarding the manner that the data are initially concerned. The base learner of the proposed AL algorithm is based on the adoption of an ensemble learner into its learning structure so as to cover efficiently the shortage of labeled data (l i L) along with the existence of a human oracle (H oracle ) that may provide us with trustworthy annotations. At the same time, a quite larger pool of unlabeled examples (u i U) is available for mining its context [10]. Since the need for well-established predictions during the labeling stage of u i is one of the most crucial point during AL strategies, the exploitation of ensemble learners seems mandatory in order to capture better the insights of the examined data. Several recent works are also directed towards introducing ensemble learners into AL or other WSL variants, such as Semi-supervised Learning (SSL) [11], Cooperative Learning (CL) [12]-also known as AL + SSL [13,14]-or even Transfer Learning (TL) [15], while the field of supervised ensemble learners is still in bloom [16].
In our case, we prefer the adoption of the Logitboost ensemble learner, a product of the well-known procedure of generating ensemble learners through serially formatting a Generalized Additive Model (GAM), which does not prevent us from adopting additive tree models, as follows: Informatics 2020, 7, 50 3 of 24 According to its strong theory, Logitboost manipulates the distribution of the training dataset based on the errors that occur during their categorization, where M depicts the learning rounds of this iterative procedure [17]. Following the generic concept of boosting, during the training stage, we fit a number of learners f (·) which try to emphasize better on the instances that are misclassified. This is made through the weighting vector w that enables the fitted learner to modify its decisions towards covering the most difficult cases based on their weight factors. After M such rounds, we reach an iteratively boosted learning function, which has ideally transformed to a strong learner by continuously reducing the errors that the initial weak model and its previous variants faced.
Although the variant of AdaBoost is the most popular product of the boosting family algorithms, Logitboost actually constitutes a choice that may reward us compared to some defects that AdaBoost presents [18]. To be more specific, Logitboost uses a smoother weighting function than the default AdaBoost classifier, a fact that allows to address better the examples that are highly misclassified since the direction towards learning the proper mapping function per each learning round is not heavily affected by them. Instead, their importance does not overwhelm the corresponding importance of the examples with smaller misclassified errors, providing more robust confidence scores: p y x . Taking into consideration that the choice of the most informative u i examples are more often than not selected through suitable metrics that depend on p(y i u i ), such a behavior may be proven quite successful in practice [19,20]. Additionally, the convergence of the Logitboost scheme is not violated, as the general boosting procedure guarantees, since the logit-loss function which is described later is asymptotically minimized.
The favoring properties of the Logitboost ensemble learner have also been noticed in the related literature, although their cardinality is restricted. To be more specific, apart from using only univariate regressors inside this scheme, generalized functions could also be applied, increasing thus the total predictive ability but probably disrupting interpretability [21]. Otero and Sanchez proposed the use of descriptive fuzzy learners inside Logitboost, modifying slightly the usual structure of default fuzzy learners and overpassing the behavior of a similar fuzzy-based AdaBoost version [22], while a modification of the internal scoring mechanism based on distance from the decision regions using weak learners under Logitboost scheme was tested in [23]. Naive Bayes (NB) has also been combined appropriately with this scheme, improving its total performance against other popular variants of Bayesian Networks [24], while Logitboost's operation was totally matched into the learning procedure proposed by Leathart et al., introducing Probability Calibration Trees (PCT) in the context of regression task, separating the input task space and fitting local predictors [25].
Logitboost autoregressive networks made use of the same scheme for modeling conditional distributions, offering a procedure that could be parallelized, exploiting the advantages of boosting ensembles for which the hyperparameters are clearly less than that of Neural Networks (NNs) and appeared to converge for several examined cases into same values, at least for the shrinkage factor [26]. More sophisticated multi-class expansions of Logitboost could further improve its applicability as it has been mentioned by the corresponding authors. This direction has been actually studied recently by some works, providing interesting expansions of the default multi-class operation of Logitboost scheme: Adaptive Base class (ABC) [27] and Adaptive One vs. One (AOSO) Logitboost [28].
Moreover, since the pool-based inductive AL strategies are inherently iterative procedures which are based on a few initially provided data, exploiting appropriately at least one H oracle for detecting the most informative u i -further analysis is presented on the next Section-the importance of obtaining accurate predictions is highly considered, but time limitations may occur when much complex learning models are embedded into these strategies. Trying to satisfy this trade-off under the Logitboost wrapper, we propose the use of M5P, a model tree regressor that tackles efficiently high-dimensional data since it builds linear models after having grown its preferred decision tree structure, taking advantage of its widely accepted decent learning performance over various scientific fields [29,30]. On the other hand, with the greedy manner under which Logitboost acts, although it is applied under a number of learning rounds (M), its total complexity does not differentiate heavier than other state-of-the-art classification algorithms [27]. Thus, its integration under AL strategies would not induce prohibitive time response in practice. Furthermore, since we maintain a regression tree as its base learner, both binary and multi-class classification problems can be addressed efficiently without inserting further modifications that would probably raise the computational complexity of the total algorithm.
Consequently, we propose the adoption of Logitboost(M5P) under pool-based AL classification problems, exploiting its favoring properties both for selecting informative unlabeled instances and for evaluating the final learning hypothesis, built into the gradually augmented L, based on the annotations that a powerful human oracle provides us. This combination has been recently examined in the scenario of SSL [31], presenting remarkable performance. For investigating the overall ability of Logitboost(M5P) under AL scenario, we examined 91 different datasets, separated into binary and multi-class under 3 different query strategies against the baseline strategy of Random Sampling (RS), comparing its performance against 10 other well-known learning algorithms as well as the default use of weak Decision Trees (DTs)-to be more particular, one-node trees [32]-into Logitboost, as it is usually met in the literature and related ML packages [33]. A further study tuning one hyperparameter of the proposed algorithm was also made, proving that its learning performance may still improve under suitable preprocess stages which however are not easy to trust under the existence of limited training instances.
More details regarding AL and a description of the proposed framework for examining the efficacy of Logitboost(M5P) in the case of an AL ecosystem are provided in the next two sections, along with the experimental procedure, the results and our comments, following the structure of the current journal. The last section summarizes our contributions and the pros and cons of our proposed combination, based mainly on our results, while future directions are posed.

Materials and Methods
The main reason that we resort to PSL methods is the coexistence of both L and U, while the amount of the latter (size(U)) is much larger than of the former (size(L)): size(U) size(L) [6]. One of the subcategories of PSL algorithms is AL, where Settles has demonstrated a great survey work on this kind of algorithms [10]. Trying not to present many details, we highlight the most important parts of such a learning strategy.
First of all, we employ a probabilistic classifier (f ) acting towards two different directions: searching for the most compatible u i s and evaluating the final model after a predefined number of iterations is reached or until any other set stopping criterion is satisfied. The first part is handled by exploiting a proper sampling Query Strategy (QS) which defines a specific criterion or metric metric use f ulness , so as to measure the informativeness or the utility of all the available u i s. In order to detect the most convenient of them for creating a batch (B) of potentially informative u i s so as to increase the learning ability of the total AL learning procedure, we select the top b highly ranked instances: where B is actually a subset of the applied U during each iteration. The second is resolved through employing one or more human oracles or sources of information, like known crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower [34]. This means that, after having detected the B, we ask the available H oracle to assign the corresponding label based on its knowledge background. Then, we merge the pairs of {B, H oracle (B)} with the initially collected L during the first iteration or the current version of L for next iterations L iter , where iter depicts the current iteration. Then, we refine f and repeat this procedure until a terminating condition is satisfied. In contrast with pure SSL approaches or in general with the wide spectrum of WSL approaches, the terminating condition of the empty U pool is not a realistic one here, since this would Informatics 2020, 7, 50 5 of 24 demand much effort on the side of the human factor. The participation of the latter introduces several trade-off situations that should be considered carefully.
According to the related literature [20], there are 3 general kinds of QS models on the field of pool-based AL and a group of hybrid ones that combine more than one strategy: representativeness-based, and • hybrid ones, where more details are provided in [35]. A quite important research orientation of the related community is the proposal of a new QS, either introducing new metrics which may measure a behavior that seems more favorable for specific tasks [36,37] or trying to capture better the reasoning of some choices made by similar methods [38]. One representative work related to this last category is the work of Vu-Linh Nguyen et al. [39], exploring further Uncertainty Sampling (UncS) QS, discriminating this into epistemic and aleatoric sampling strategies, highlighting their differences and proposing the first variant as more promising.
We actually adopted UncS in our AL framework, which tries to distinguish the u i instances for which the applied learning algorithm being trained on L iter is less confident. For a binary problem, such an instance would induce p(y i = 0 u i ) ≈ p(y i = 1 u i ) ≈ 0.5. This strategy favors time-efficient solutions for the majority of ML algorithms because its time complexity demands a training stage of f over L iter and an evaluation stage of U iter . Since the cardinality of the former is smaller than the latter, especially for low labeled Ratios (R)-where R is defined as the ratio of the initial L's cardinality against the total amount of both L and U-the needed computational resources can be bounded based on the computational complexity of the base learner. Of course, the size of the batches (b) and the number of executed iterations (k) also play important roles.
In order to investigate the efficiency of Logitboost(M5P) under UncS, we employed 3 separate metrics inside this wrapper strategy, comparing them each time with the baseline of RS, where no sophisticated criterion was assessed for selecting the participant u i of each batch but a random pick took place before the corresponding batch was provided to H oracle . This strategy comes with no time costs during the mining of U. This means that any examined QS should outreach this performance for being qualified as a valid one for the concept of AL. The relationship of each of the utilized metric use f ulness (Least Confident: LConf, Smallest Margin: SMar and Ent: Ent) is given here: f SMar (u i ) = arg min where p(y u i ) is the confidence of the base learner on the examined u i , while Equation (5) computes the difference between the two most probable classes y 1 and y 2 of the same u i so as to return the most compatible choice from the available into U.
With regards to the proposed base learner, Logitboost(M5P), more details are given here. Logitboost is an additive logistic regression algorithm that can be seen as a convex optimization problem. An additive model, like simple linear models or regression trees, for solving a binary problem has the function of the following form: where m is the number of classifiers, w m is the constants to be determined and h m is the chosen base functions along with their internal parameters γ m . Assuming now that f additive (x) is the mapping that we need to fit our strong aggregate hypothesis and h m is the separate weak hypotheses, then the two-class boosting algorithm is fit by minimizing the next criterion: where y is the true class label and y· f additive (x) is the voting margin term [40], while ε[·] denotes the expected value.
Adopting the negative binomial log-likelihood does not affect the minimizer of Equation (8) compared with the typical boosting function, enabling at the same time a smoother weighting of examples for which predictions of Logitboost learner are far away from the discriminating decision threshold, constituting a great asset. This procedure takes place using the Newton-like steps, a more complicated optimization process than in the case of exponential loss function of AdaBoost algorithm. However, this fact does not affect the ambition of minimizing the set loss function during the training process.
Regression trees are known for their ability to deal efficiently with large datasets in terms of both features and instances, added to their simplicity and robustness [41]. The M5P regressor is a recreation of M5 algorithm [42], where the portion of the dataset that reaches the leaf is classified by a linear regression model stored in each branch of the tree. For the dataset split, certain attributes are chosen using the standard deviation error (SDR) as a criterion for the best attributes to split the dataset at each node. The chosen attribute is the one with the maximum expectation to error reduction: where Tree i refers to the subset of examples that have the ith result of the potential test and SD(·) refers to the standard deviation of its argument. The stopping criteria is either the number of remaining instances to reach a certain number or a very small change in class value. The successful competition of M5P against other regression trees or other conventional ML learners has been stated in recent literature [30,43,44]. Its exploitation under the wrapper scheme of Logitboost could lead to a robust classifier that operates on the field of AL, both for choosing informative u i instances and for providing remarkable classification performance. Moreover, possible inaccurate predictions that would appear because of either a shortage of l i that covers the total range of the output values or weak indicators originally existing in the feature space of a dataset could be alleviated by the smooth weighting function that Logitboost applies, avoiding the overfitting phenomena that might discard its decisions [45].
The last mechanism that needs to be described before more technical details of the experimental procedure is the increase in L during the iterative process of a typical AL environment. To be more specific, the queried instances (b) are chosen in batches. The size of each batch relies on the size of the initial labeled set of each dataset and the number of predefined iterations (k). This happens because we adopted an augmenting strategy that aims to double the size of the labeled instances at the final (kth) iteration of the experiment. According to this concept, the steady value of the parameter b per dataset is computed as follows: Thus, to execute a complete AL experiment fairly, we adopted a flexible process for which the pseudocode is placed in Algorithm 1. Initially, we start with size(L) collected labeled instances, and before the final evaluation, 2*size(L) instances are gathered, where the additional labeled instances have been assigned with pseudo-labels by H oracle after their selection through the combined interaction of any chosen QS and our selected base learner. During each iteration, a batch B that consists of b instances is extracted from the U subset and is added along with the decisions of the employed H oracle into the current L subset. For the evaluation process, a test set is used to examine the accuracy of the Logitboost(M5P), which is now trained on the augmented labeled set through the process of the AL. The total procedure is as follows: Algorithm 1 Active learning scheme

1:
Mode: 2: Pool-based scenario over a provided dataset D = X n x N Y n x 1 3: x i -i-th instance vector with N features n-number of instances n = size(L) + size(U) 6: B-batch of unlabeled samples that are labeled per iteration 7: Input: 8: L iter (U iter )-(un)labeled instances during the iter-th iteration, L iter ⊂ D, U iter ⊂ D 9: k-number of executed iterations 10: base learner-the selected classifier 11: QS(metric)-the selected Query Strategy along with its embedded metric 12: Preprocess: 13: b-size of batch B computed by Equation (10)

Results
More technical details are revealed in this section to both describe better the volume of our executed experiments trying to better clarify the importance of the Logitboost(M5P) as a robust inductive learner under the AL concept and to favor the reproducibility of the total experimental procedure. Therefore, we firstly describe the basic properties of the examined datasets, providing later the details of the experimental phase regarding mainly the parameters of the compared algorithms as well as the open-source platform that we utilized to execute them. Finally, we share a link with all the produced results and visualizations due to lack of space.

Data
All of the mined datasets come from the well-known University of California-Irvine (UCI) repository [46]. The next two figures represent their formulation, having separated them based on the number of class into binary and multi-class datasets. Thus, in Figure 1, the third column-depicting the number (#) of classes-contains recordings equal to 2 for all datasets, while in Figure 2, this parameter varies from 3 to 28 for our case. The last column depicts the ratio that the class with the most instances (majority class) and the class with the least instances (minority class) capture against the rest ones. In the former case (Table 1), these two values sum up to 100%, while in the latter case, we added the corresponding ratios of the two most minor classes, since some datasets contain some extremely rare classes, and depicting only their contribution to the total instances would be neither convenient nor informative for the reader. Therefore, the last column of all datasets with exactly 3 classes presented in Table 2 sums up also to 100% (contribution of largest class/contribution of the two most minor classes), while for the rest multiclass datasets, these values are not constrained. The fact that the majority of the datasets are imbalanced sets some kind of difficulty over the examined AL algorithms, but this property at the same time is met in the most remarkable challenging real-life problems. Thus, no preprocess stage was applied for balancing the datasets based on the cardinality of their classes. One exception is the "texture" dataset, which contains 11 classes with exactly 500 instances per each class, leading to perfect balance.
Informatics 2020, 7, x 8 of 25 some extremely rare classes, and depicting only their contribution to the total instances would be neither convenient nor informative for the reader. Therefore, the last column of all datasets with exactly 3 classes presented in Table 2 sums up also to 100% (contribution of largest class/contribution of the two most minor classes), while for the rest multiclass datasets, these values are not constrained. The fact that the majority of the datasets are imbalanced sets some kind of difficulty over the examined AL algorithms, but this property at the same time is met in the most remarkable challenging real-life problems. Thus, no preprocess stage was applied for balancing the datasets based on the cardinality of their classes. One exception is the "texture" dataset, which contains 11 classes with exactly 500 instances per each class, leading to perfect balance.   Informatics 2020, 7, x 8 of 25 some extremely rare classes, and depicting only their contribution to the total instances would be neither convenient nor informative for the reader. Therefore, the last column of all datasets with exactly 3 classes presented in Table 2 sums up also to 100% (contribution of largest class/contribution of the two most minor classes), while for the rest multiclass datasets, these values are not constrained. The fact that the majority of the datasets are imbalanced sets some kind of difficulty over the examined AL algorithms, but this property at the same time is met in the most remarkable challenging real-life problems. Thus, no preprocess stage was applied for balancing the datasets based on the cardinality of their classes. One exception is the "texture" dataset, which contains 11 classes with exactly 500 instances per each class, leading to perfect balance.     As it concerns the rest of the information that characterizes the structure of our examined datasets, we mention that their cardinalities range from 57 ("labor") to 67,557 instances ("connect-4"), while their feature space ranges from 2 ("banana") to 90 ("movement-libras"), covering thus a wide spectrum of cases regarding these two properties. Additionally, the 5th column distinguishes and counts accordingly the number of categorical and numerical features. Analogous to their participation, there are datasets for which features are related solely with one category (numerical or nominal) or with both of them, usually called mixed.

Classifiers
In order to investigate properly the distinguishing ability of the proposed combination of Logitboost with the M5P under the AL concept, 7 different classifiers have been selected so as to operate under the same AL framework, as this was combined in Algorithm 1. Some brief details of these classifiers along with their original references are recorded here: • k-Nearest Neighbors [47], the most representative classification algorithm from the family of lazy learners, also referred to as an instance-based algorithm since it does not consume any resources during the training stage. Instead, it computes based on appropriate distance metrics the k-nearest neighbors of each test instance and exports its decision through a simple majority vote about the class of the latter one. Three different variants of this algorithm were included: 1-NN, 3-NN and 5-NN, increasing the value of the k parameter; • Decision Trees (DTs) [48], where J48 and Random tree algorithms from this category were preferred. The first one constitutes a popular implementation of C4.5 generating a pruned variant that exploits Gain Ratio to determine how to split the tree, while the second one considers just a randomly chosen subset of the initial feature space before growing an unpruned decision tree. Logistic Model Trees (LMT) [49] was also employed as a powerful ensemble algorithm. Based on this, a tree structure is suitably grown, but proper logistic regression models are built at its leaves, exploiting in this manner only the most relevant attributes; • JRip [50], a rule-based learner that tries to produce rules so as to capture all the included instances into the provided training set; • Naive Bayes (NB) [51], a simple Bayesian-based method that assumes that all the features inside the original feature space are independent. Although this assumption seldom holds, especially in real-life cases, this generative approach has found great acceptance at the literature; and • AdaBoost (Ada) [18], the most popular boosting algorithm that minimizes exponential loss.
As all the mentioned algorithms as well as the proposed combined Logitboost(M5P) learner were considered, they were mined from the Waikato Environment for Knowledge Analysis (WEKA) environment [33], keeping the default parameters of all their original implementations, enforcing the reproducibility and conducting fair comparisons.

Experiment Details
Considering the evaluation of our results, we applied a 3-fold-cross-validation procedure (3-CV) where the 2 out of three folds were held for the training data and the rest was for testing. Afterwards, the training data were split to L and U under 4 different R values: 5%, 10%, 15% and 20%. After executing 15 iterations, the size of the labeled data compared to the total training data increased to 10%, 20%, 30% and 40%, respectively. This procedure was repeated 3 times per case, executing a 3 × 3-CV evaluation stage. For implementing our pool-based experiments with the UncS and the aforementioned metrics, along with the RS strategy, we used the 'Java Class Library for Active Learning' (JCLAL) library, a Java-based library that interacts directly with the WEKA facilitating the use of the algorithm classifiers in this framework [52]. Furthermore, we conducted the next comparisons (the proposed learner versus 1. simple or 2. ensemble learners) regarding the classification accuracy (Acc%) metric, verified later by a nonparametric statistical comparison process: The second set of comparisons is an investigation between the learning behavior of the proposed combination of M5P under the Logitboost scheme with the default Logitboost, which exploits the Decision Stump algorithm (DStump) [32], a simple one-node tree that discriminates each example using only one feature which acts as the root of this simplified tree. Additionally, the well-known Bagging scheme along with the J48 algorithm has been also examined for consistency reasons, comparing our proposed base learner with another ensemble approach, based again on DTs [53,54] along with the Ada and LMT learning approaches that have found great acceptance in practice.

Figures, Tables and Schemes
Here, we record the results of the smallest R-based scenario (R = 5%) only for the UncS(Ent) QS. We provide two tables for each comparison setting, separating thus on binary and multi-class datasets (Tables 3-6).       Due to the volume of produced results and to maintain a balance between the extension of the presented results and the main body of the rest manuscript, our included tables and figures present our results only for a small portion of our total experiments. The following link contains all results: http://ml.math.upatras.gr/wp-content/uploads/2020/10/MDPI_Informatics_AL_Logitboost_M5P.7z.
In the preceding 4 tables, we highlighted the best value per dataset (row) achieved by all the included algorithms (columns) in bold format to facilitate visualization of these results. However, some implications occurred during our experiments and we have to record them here. Initially, it has to be mentioned that the JRip algorithm did not manage to export decisions for the majority of the datasets under the SMar metric. This behavior is recorded because of JRip's inherent inability to produce rules for all the existing classes, especially when the training data are not sufficient. Thus, we have removed this algorithm from the corresponding files, but it is still included for the other QSs. Furthermore, the LMT algorithm as well as Ada(DStump) did not manage to export predictions for 20 (15 binary and 5 multi-class) datasets, which made us remove all of them from the second set of comparisons: the proposed algorithm versus the ensemble ones.
Continuing with our results, in order to present some better insights into the total experiments, we have gathered the victory frequencies for these two cases into Tables 7 and 8, separated internally depending on the number of the existing classes and summarizing the performance per learner for all the examined R-based scenarios and the distinct query strategies applied. We have ignored the RS case, since this acts as the baseline in the AL concept, but we have taken into consideration this strategy in the statistical comparisons that follow. In the aforementioned case of JRip under UncS(SMar) strategy, we did not record any value since it has been rejected by this kind of experiments.
Considering the statistical comparison of the first part of the results, we have applied the well-known nonparametric Friedman test, which examines if the null hypothesis about the similarity of the participating algorithms as it considers their performance holds [55]. Since both Friedman and Iman-Davenport statistics highly favored the rejection of the null hypothesis, a proper post hoc test was applied to investigate further the statistical importance of the acquired results. In our case, the post hoc of Nemenyi was selected using an alpha level parameter equal to 0.05 [56]. The value of the critical difference (CD) that should be overpassed between the learning behavior of two algorithms for being considered as statistically different is 1.21. The next figure depicts the achieved scores per base learner for both the binary and multi-class datasets, discriminating the performance of the 3 examined metrics under UncS against the RS. A violin plot was chosen.
Regarding the second part of the experiments, a more targeted comparison was made to verify the predictive ability of the proposed variant of Logitboost against its default setup, as has been implemented on WEKA API, which actually uses a one-node tree as a weak learner, while the other 3 ensemble learners have been included. Therefore, we applied a twofold comparison, examining their efficacy on both the binary and multi-label datasets, following the same statistical verification as previously. We have already measured the frequency of the best achieved performance for all examined datasets per ensemble learner, QS and R-based scenario in Table 9, while in Table 10, we present the corresponding Freidman rankings. For acquiring better insight of the relative importance of the learning performances of this kind of comparison, we recorded the corresponding statistical ranking per separate R-based scenario.   Indeed, we can verify that different behaviors are recorded in the largest R-based scenario compared with the rest ones during the binary datasets, a fact that would not have been noticed under an average of the separate rankings, being merged and leading to erroneous conclusions. Despite this fact, the underlying CD value for this set of experiments is equal to 0.568, a score that settles that the proposed approach is significantly superior against the other examined algorithms under the same AL framework in 5 out of the 8 cases.
Meanwhile, the next 3 figures depict the performance of the Logitboost(M5P) against its 4 ensemble opponents for the 9 largest datasets, across both kind of datasets, through compatible error-bar plots that summarize the performance on each evaluation during the 3-fold-CV process as well as the corresponding average value for each iteration.

Discussion
In this section, we briefly discuss the obtained results from both comparisons to summarize the overall results and to perceive better the assets of the proposed AL approach that is based on the combination of Logitboost scheme with the M5P regressor. First, application of Logitboost under AL has not been recorded in the literature in contrast to several other ML models [57]. Consequently, it was reasonable to adopt a common AL framework to make fair comparisons with the selected approaches that are based on state-of-the-art algorithms during both of the examined settings. Thus, no tuning stages were inserted into the learning pipeline. The total experimental procedure was conducted regarding 4 different values of the R (%) parameter, trying to investigate further the behavior of all the examined approaches, assuming that the human oracle did not introduce any noisy decisions at all. Thus, we do not insert noisy decisions during augmentation of the initial labeled set, which shifts the responsibility of selecting informative instances to the learning ability of the base learner per case. Additionally, we are more interested on the lowest R-based scenarios since they constitute more realistic simulations of real-life WSL problems.
From the accuracy score recorded in Table 3 through Table 6, we can see that the proposed combination highly outperformed its rivals on the majority of the 91 datasets. The aggregated number of victories for both sets of comparisons-versus simple and ensemble base learners-have been placed in Tables 7 and 8, where the proposed set managed to capture the best performance in 774 cases out of 1112 (69.6%) against 6 approaches and in 499 out of 866 cases (57.6%) against 4 approaches. Moreover, its defects seem to appear on specified datasets (e.g., "iris" and "post-operative" from binary datasets as well as "heart-h", "saheart" and "newthyroid" from multiclass ones). The structure of these datasets should be examined further, but probably a tuning stage of Logitboost scheme, for which the parameters could be expanded because of the presence of the internal regressor, might lead to better performance against algorithms like kNN or learners that are based on DTs, either individually or under an ensemble fashion.
As it concerns the first experimental scenario, the distribution of the number of victories of the proposed AL approach was similar across the different query strategies and the labeled ratios, denoting its general efficacy against the other approaches. The 3 distinct kNN learners also performed cumulatively 243 victories, constituting a useful proof about their robustness despite the restricted number of labeled examples [58]. On the other hand, the performance of NB as a base learner was disappointing, affected by the aforementioned shortage of numerous initial examples since it did not record any victory.
During the second and, of course, more challenging scenario against ensemble base learners, the proposed algorithm was again more competitive and outperformed the rest in the majority of the grouped experiments. However, LMT-based approaches managed also to score several winning accuracies per dataset, especially in binary problems when larger R-based experiments were conducted. In fact, LMT expands the Logitboost procedure internally into its main learning kernel but, at its final stage, exploits only a subset of the initial feature space so as to build its logistic model. This property seems favorable for the aforementioned case, as the statistical rankings placed in Table 10 prove. Specifically, for binary problems, the proposed algorithm performed significantly better behavior than the LMT-based AL approach only for the case of R = 5%, while for the next two comparisons, no statistical difference was recorded, ranking Logitboost(M5P)-based approaches in first with a slight lead, while in the last scenario, where R = 20%, the LMT significantly outperformed the proposed one. However, in multi-class problems, this behavior was not repeated. This kind of result possibly denotes the existence of noisy features that highly affect binary problems and/or highlights the overfitting phenomena that Logitboost may face when outliers are inserted into its training stage.
Returning to the first set of experiments, a statistical comparison was executed for verifying the results obtained from the examined query strategies against also the baseline of random sampling. Figure 2 captures the performance per district learner, where the proposed base learner recorded a statistically significant behavior against its baseline as well as the rest of the AL approaches, being ranked always as the best across all the conducted scenarios. It is remarkable that the RS(Logitboost(M5P)) approach managed also to outreach the other simple algorithms on average, proving the overall predictive ability of the proposed ensemble learner. This last note highlights also the implications that may occur when ranking the available unlabeled examples, a fact that may deteriorate the total learning behavior since the less informative the selected instances are labeled, the more redundancy that occurs in the gradually augmented training subset.
Discussing again the second experiment setup, a similar procedure was implemented, where besides a comparison with the LMT ensemble learner, useful conclusions were drawn through the conducted comparisons. The replacement of a weak one-node tree with the M5P of model trees under the Logitboost scheme was substantially examined, along with the use of the AdaBoost procedure. This amendment helped us to clarify even better the overall benefits of the proposed boosting approach, since the improvement that was noticed mainly against these two approaches was impressive. Additionally, two separate figures (Figures 3 and 4) were produced to better visualize the discriminative ability of the proposed approach against other ensemble learners, which is clearly formatted by the initial iterations in 8 out of 9 selected datasets, recording also more robust learning behavior judging by its fluctuations along the iterative procedure of the applied AL framework. The instable behavior of AdaBoost(DStump) is also remarkable, showing clearly its untrustworthy behavior compared with the proposed one, as intense fluctuation was recorded.
Finally, we also conducted a study of one hyperparameter of the Logitboost scheme, without tuning further the internal base learner of M5P, in order to verify its optimality regarding at least one parameter of this scheme. This hyperparameter was selected to be the number of iterations that are executed during its training stage, a property that affects both its spent computational resources and the main drawback of boosting procedures: overfitting. We noticed that, more often than not, the default approach with 10 internal iterations did not achieve the best performance. This fact leaves much space for further investigation on the parameters of the Logitboost scheme under AL learning scenarios, which is however not easy to shed light on because of the limited initial data that are in practice provided. We pose the corresponding statistical rankings in Table 10.
impressive. Additionally, two separate figures (Figures 3 and 4) were produced to better visualize the discriminative ability of the proposed approach against other ensemble learners, which is clearly formatted by the initial iterations in 8 out of 9 selected datasets, recording also more robust learning behavior judging by its fluctuations along the iterative procedure of the applied AL framework. The instable behavior of AdaBoost(DStump) is also remarkable, showing clearly its untrustworthy behavior compared with the proposed one, as intense fluctuation was recorded.   Finally, we also conducted a study of one hyperparameter of the Logitboost scheme, without tuning further the internal base learner of M5P, in order to verify its optimality regarding at least one parameter of this scheme. This hyperparameter was selected to be the number of iterations that are executed during its training stage, a property that affects both its spent computational resources and the main drawback of boosting procedures: overfitting. We noticed that, more often than not, the default approach with 10 internal iterations did not achieve the best performance. This fact leaves much space for further investigation on the parameters of the Logitboost scheme under AL learning scenarios, which is however not easy to shed light on because of the limited initial data that are in practice provided. We pose the corresponding statistical rankings in Table 10.

Conclusions
To sum up, in this work, we proposed the use of the Logitboost scheme along with an M5P regressor under a properly designed pool-based AL learning scheme. We assumed that the smooth learning behavior of Logitboost could lead to safer predictions, especially when it selects the most informative unlabeled instances from the corresponding U pool, favoring thus the overall learning rates of an AL algorithm. Its combination with M5P managed to increase the overall accuracy compared with the performance of simpler tree-based models, leading also to superior performance against various ensemble state-of-the-art algorithms evaluated in the same AL framework over 4 R-based scenarios under 3 separate metrics embedded into uncertainty sampling query strategy. The performed statistical comparisons verified the significantly better performance of the proposed batch-based inductive active learning algorithm, recording its better generalization ability through a wide range of experiments.
Our future directions are mainly related to internal investigation of the Logitboost scheme, since its application on both the AL and SSL fields has been proven to be really promising, while at the same time, the related community has not highly benefited by this. Feature selection could be really useful in several real-life cases, since removing noisy or irrelevant variables would further improve the predictive ability of the Logitboost scheme. A similar preprocessing strategy was considered in [59], before creating an ensemble of Logitboost that exploits random forest as a base learner in the field of anomaly detection. The use of metrics of informativeness that are popular in the field of AL could boost the predictive performance of SSL methods and vice versa, as the authors of [60] demonstrated, studying the exploitation of centrality measures that stem from graph-based representation of data for capturing data heterogeneity. Use of AL + SSL based on ensemble learners either with UncS or with more targeted query strategies could boost the overall performance on classification tasks without demanding much effort from human annotators or reducing expenses induced by the corresponding crowdsourcing services [14]. The aspect of applying query strategies that avoid using uncertainty-based directions but prefer guidance by interactions among the decisions of multiple learners seems really promising, either for obtaining decisions through distinct iterations of Logitboost-based classifiers or for blending this powerful classifier into a pool of available classification algorithms [61].
Furthermore, a combination of the proposed base learner or adoption of the related boosting learners [62] with more recently stated query strategies could help us reach competitive performance in more complex tasks that stem from real-life applications [63]. Expansion also towards online AL frameworks should be further investigated by our side [64].