Classification Active Learning Based on Mutual Information

Selecting a subset of samples to label from a large pool of unlabeled data points, such that a sufficiently accurate classifier is obtained using a reasonably small training set is a challenging, yet critical problem. Challenging, since solving this problem includes cumbersome combinatorial computations, and critical, due to the fact that labeling is an expensive and time-consuming task, hence we always aim to minimize the number of required labels. While information theoretical objectives, such as mutual information (MI) between the labels, have been successfully used in sequential querying, it is not straightforward to generalize these objectives to batch mode. This is because evaluation and optimization of functions which are trivial in individual querying settings become intractable for many objectives when we are to select multiple queries. In this paper, we develop a framework, where we propose efficient ways of evaluating and maximizing the MI between labels as an objective for batch mode active learning. Our proposed framework efficiently reduces the computational complexity from an order proportional to the batch size, when no approximation is applied, to the linear cost. The performance of this framework is evaluated using data sets from several fields showing that the proposed framework leads to efficient active learning for most of the data sets.


Introduction
In supervised learning, there is a teacher in the form of labels associated to a training set of samples to enable learning of models for prediction.However, obtaining expert/human/manual labeling is expensive.Instead of randomly selecting samples for manual annotation (the standard setting), the goal of active learning is to intelligently select samples for annotation that enables efficiently learning an accurate classifier with as few labels as possible.Here, the implicit assumptions are that the labeling costs (in terms of time or financial expense) are the same for all of the queries and also significantly larger than the computational cost of the querying algorithms.The latter assumption leads us towards the active learning algorithms that generate smaller batch of queries, even though they might be computationally more expensive compared to the cheap passive learning.
An active learning setting typically starts with an initial model (from a few labeled samples), then more samples are selected for label querying.There are two general strategies for querying: sequential or batch mode.In sequential querying, a single sample is selected for querying at each active learning step, where each step involves labeling the query and retraining of the model.Batch mode querying, on the other hand, allows labeling of multiple samples at each active learning step.Most of the classical studies in active learning use sequential querying [1][2][3][4][5][6].However, oftentimes querying a batch of samples is more efficient when the experts can label multiple samples in one step, since they can label more queries at each step without the need to wait for the retraining process.In this paper, we introduce a batch mode active learning algorithm based on mutual information.
Performing active learning in batch mode introduces new challenges.Since we need to select a set of queries, one should also make sure that the samples are non-redundant to maximize the amount of information that they provide.Another related challenge is that selecting a subset of samples based on a given objective function defined over sets can easily lead to intractable or to non-deterministic polynomial-time hard (NP-hard) optimization problems.
Recently, different batch mode active learning algorithms have been developed.Besides the querying strategy, active learning methods also differ based on the criterion they optimize for selecting queries.[7][8][9] select samples that reduce model uncertainty.Holub et al. [8] do this directly using the joint entropy function; Brinker [7] does this indirectly based on the distance of samples to the classifier's boundary; and Chen et al. [9] also do this indirectly based on the volume of version space.Most of these methods need to employ heuristics to introduce diversity among the queries.Choosing a subset of samples that maximizes the expected model change [10] is another type of batch selection strategy which works directly with classifier performance, but is usually tied to a particular classification model and is not general.On the other hand, Azimi et al. [11] develop a framework which constructs a batch mode variant of any given sequential querying policy, such that it performs close to the sequential scenario.Their algorithm is based on the restrictive assumption that any sequential querying outperforms its batch mode correspondent, in the expense of more frequent model updating.There are also studies that select the queries based on the amount of information they carry with respect to the underlying data distribution.For example, Hoi et al. [12] use the Fisher information ratio, which is specifically designed for a logistic regression model; Guo [13] and Li et al. [14] utilize mutual information between the input feature values of the candidate queries and the remaining samples, where they used Gaussian Process distribution to model the joint probability over the instances.
In this paper, we advance the field by introducing a general framework for batch mode active learning that selects samples based on mutual information (MI) between labels of the candidate query set and the remaining samples given the current model.Our framework is general, because it can be applied with any classifier.Also note that we optimize for MI between the labels; in contrast, the MI in [13] and [14] are based on the input feature values.Our formulation is discriminative; hence, we do not need to model the distribution of the input features.Another additional benefit is that our MI-based objective takes redundancy into account naturally.
MI between the labels has been employed in sequential active learning settings [15]; however, to date, we are not aware of any work that optimizes for this objective directly in batch mode.This is due to two main hurdles: (1) difficulty in calculating the MI between non-singleton subsets of labels and (2) its combinatorial optimization problem.We address hurdle (1) by introducing pessimistic and optimistic versions of estimating MI, both of which can be calculated efficiently in a greedy fashion (Section 2.2), and hurdle (2) via the popular greedy submodular maximization algorithm and a stochastic version of it.We show that MI is submodular but non-monotone, and therefore only the stochastic algorithm has guaranteed tight bounds for maximizing it (Section 2.3).This stochastic optimization algorithm, which has been not exploited in active learning community before, will be compared in practice against the former algorithm which has been used widely in batch querying [9,12,16,17].Our proposed framework efficiently reduces the computational complexity from order k (denoting the batch size), when no approximation is applied, to the linear cost (Section 2.4).Additionally, as suggested by Wei et al. [18], we also use uncertainty sampling to downsample the unlabeled data as a pre-processing step to further decrease the model complexity (Section 2.5).The performance of our proposed algorithms are evaluated using data sets from several different fields, and make comparisons against entropy-based and random querying benchmarks.

Active Learning with Mutual Information
In this section, we introduce our notations and formulate the batch active learning problem with MI as the objective.Then we provide our solutions to the two hurdles mentioned above.

Formulation
Let X = {x i } n+m i=1 ⊆ R d denote our finite data set where x i is the d-dimensional feature vector representing the i-th sample.Also let Y = {y i } n+m i=1 be their respective class labels where each y i represents the numerical category varying between 1 to c, with c the number of classes.We distinguish indices of the labeled and unlabeled partitions of the data set by L and U (with |L| = n and |U | = m), respectively, which are disjoint subsets of {1, ..., n + m}.Note that the true values of the labels Y L are observed and denoted by The initial classifier is trained based on these observed labels.
Labeling unlabeled samples is costly and time-consuming.Given a limited budget for this task, we wish to select k ≥ 1 queries from U whose labeling leads to a new classifier with a significantly improved performance.Therefore we need an objective set function f : 2 U → R defined over subsets of the unlabeled indices A ⊆ U .The goal of batch active learning is then to choose a subset A with a given cardinality k that maximizes the objective given the current model that is trained based on Y * L .This can be formulated as the following constrained combinatorial optimization: We aim to choose the queries A whose labels Y A give the highest amount of information about labels of the remaining samples Y U −A .In other words, the goal is to maximize the mutual information (MI) between the random sets Y A and Y U −A given the observed labels Y * L and the features X: Let us focus on the first right hand side term f H (A) := H(Y A |X, Y * L ), which is the joint entropy function used in active learning methods [8].Note that maximizing f H is equivalent to minimizing ) is a constant with respect to A. This objective is expensive to calculate due to the complexity of computing the joint posterior P(Y A |X, Y * L ).However, using a point estimation, such as maximum likelihood, to train the classifier's parameter θ, one can say that θ is deterministically equal to the maximum likelihood estimation (MLE) point estimate θn given X and Y * L .Then we can rewrite the posterior as Since in most discriminative classifiers the labels are assumed to be independent given the parameter, one can write P(Y A | θn ) = ∏ i∈A P(y i | θn ).This simplifies the computation of the joint entropy to the sum of sample entropy contributions: which is straightforward to compute having the pmf's P(y i | θn ).Equation (3) implies that maximizing f H can be separated into several individual maximizations, hence does not take into account the redundancy among the selected queries.Thus, in different related studies heuristics are added to cope with this issue.MI in Equation ( 2), on the other hand, removes this shortcoming by introducing a second term which conditions over the unobserved random variable Y U −A , as well as the observed Y * L .This conditioning prevents the labels in Y A from becoming independent, and therefore automatically incorporates the diversity among the queries (see next section for details of evaluating this term).
Unfortunately, maximizing f MI for k > 1 is NP-hard (the optimization hurdle).Relaxing combinatorial optimizations into continuous spaces is a common technique to make the computations tractable [13], however these methods still involve a final discretization step that often includes using heuristics.In the following sections, we introduce our strategies to overcome the practical hurdles in MI-based active learning algorithms by introducing (1) pessimistic/optimistic approximations of MI; and (2) submodular maximization algorithms that allow us to perform the computations within the discrete domain.

Evaluating Mutual Information
In this Section, we address the hurdle of evaluating MI between non-singleton subset of labels.This objective, formulated in Equation (2), is also equal to due to MI's symmetry.We prefer this equation, since usually we have |Y A | = k |Y U | and thus it leads to a more computationally efficient problem.Note that the first term in the right hand side of Equation ( 4) can be evaluated similar to Equation (3).The major difficulty we need to handle in Equation ( 4) is the computation of the second term, which requires considering all possible label assignments to Y A .To make this computationally tractable, we propose to use a greedy strategy based on two variants: pessimistic and optimistic approximations of MI.To see this we focus on the second term: where {1, ..., c} |A| is the set of all possible class label assignments to the samples in A.
For example, if A has three samples (|A| = 3) and c = 2, then this set would be equal to {1, It is also evident from the example above that the number of possible assignments J to Y A is c k .Therefore, the number of necessary classifier updates grows exponentially with |Y A | = k.This is computationally very expensive and makes Equation (5) impractical.Alternatively, we can replace the expectation in Equation ( 5) with a minimization/maximization to get a pessimistic/optimistic approximation of MI.Such a replacement enables us to employ efficient greedy approaches to estimate f MI in a conservative/aggressive manner.The greedy approach that we use here is compatible with the iterative nature of the optimization Algorithms 1 and 2 (described in Section 2.3).In the remainder of this Section, we focus on the pessimistic approximation.Similar equations can be derived for the optimistic case.The first step is replacing the weighted summation in Equation ( 5) by a maximization: Note that f pess MI (A) is always less than or equal to f MI .Equation ( 6) still needs the computation of the conditional entropy for all possible assignments J.However, it enables us to use greedy approaches to approximate f pess MI (A) for any candidate query set A ⊆ U , as described below.Without loss of generality, suppose that A, with size |A| = k (1 ≤ k ≤ m), can be shown element-wise as A = {u 1 , ..., u k }.Define A t = {u 1 , ..., u t } for any t ≤ k (hence A k = A).In the first iteration we can evaluate Equation ( 6) simply for the singleton f pess MI ({u 1 }) and store ŷu 1 , the assignment to y u 1 which maximizes the conditional entropy in Equation ( 6): where we used Equation (3) to substitute the first term with H(Y U −{u 1 } | θn ).Note that the second term in Equation ( 7) requires c times of retraining the classifier with the newly added class label y u 1 = j for all possible j ∈ {1, ..., c}.In practice, the retraining process can be very time-consuming.
Here, instead of retraining the classifier from scratch, we leverage the current estimate of the classifier's parameter vector and take one quasi-Newton step to update this estimate: where g n+1 and H n+1 are the gradient vector and Hessian matrix of the log-likelihood function of our classifier given the labels Y * L ∪ {y u 1 = j}.Then we use the approximation In Appendix, we derive the update equation in case a multinomial logistic regression is used as the discriminative classifier.Specifically, we will see that g n+1 and H −1 n+1 can be obtained efficiently from g n and H will be approximated from the previous iterations: where ŶA t−1 = { ŷu 1 , ..., ŷu t−1 } are the assignments maximizing the conditional entropy that are stored from the previous iterations, such that the i-th element ŷu i is the assignment stored for Note that Equation ( 10) is an approximation of the pessimistic MI, as is defined by Equation ( 6), however, in order to keep the notations simple we use the same notation f pess MI for both.Moreover, similar to Equation ( 7) there are c time of classifier updates involved in the computation of Equation (10).To complete iteration t, we make ŷu t equal to the assignment to y u t that maximizes the second term in Equation (10) and add it to ŶA t−1 to form ŶA t .As in the first iteration, the conditional entropy term in Equation ( 10) is estimated by using the set of parameters obtained from the quasi-Newton step: Considering Equations ( 7) and (10) as the greedy steps of approximating f MI , we see that the number of necessary classifier updates are c • k, since there are k iterations each of which requires c times of retraining the classifier.Thus, the computational complexity reduced from the exponential cost in the exact formulation Equation ( 5) to the linear cost in the greedy approximation.
Similar to Equation (10), for the optimistic approximation, we will have: where ŶA t−1 = { ŷu 1 , ..., ŷu t−1 } is the set of class assignments minimizing the conditional entropy that are stored from the previous iterations.Clearly, the reduction of the computational complexity remains the same in the optimistic formulation.Let us emphasize that, from the definitions of f pess MI and f opt MI , we always have the following inequality The first (or second) inequality turns to equality, if the results of averaging in conditional entropy in Equation ( 5) is equal to maximization (or minimization) involved in the approximations.This is equivalent to saying that the posterior probability P(Y A |X, Y * L ) is a degenerative distribution concentrated at the assignment Y A = J that maximizes (or minimizes) the conditional entropy.Furthermore, if the posterior is a uniform distribution, giving the same posterior probability to all possible assignments J ∈ {1, ..., c} |A| , then the averaging, minimization and maximization lead to the same numerical result and therefore we get In theory, the value of MI between any two random variables is non-negative.However, because of the approximations made in computing the pessimistic or optimistic evaluations of MI, it is possible to get negative values depending on the distribution of the data.Therefore, after going through all the elements of A in evaluating f pess MI (or f opt MI ), we take the maximum between the approximations of f pess MI (A) (or f opt MI (A)) and zero to ensure its non-negativity.

Randomized vs. Deterministic Submodular Optimizations
In this section, we begin by reviewing the basic definitions regarding submodular set functions, and see that both f MI and f H satisfy submodularity condition.We then present two methods for submodular maximization: a deterministic and a randomized approach.The latter is applicable to submodular and monotone set functions such as f H .But f MI is not monotone in general, hence we present the randomized approach for this objective.
We call f supermodular if the inequality in Equation ( 15) is reversed.In many occasions, it is easier to use an equivalent definition, which uses the notion of discrete derivative defined as: Proposition 2. Let f : 2 U → R be a set function.f is submodular if and only if we have This equips us to show the submodularity of joint entropy and MI: Theorem 3. The set functions f H and f MI , defined in Equations ( 3) and (2) above, are submodular.
Proof.It is straightforward to check the submodularity of f H and therefore the first term of MI formulation in Equation (2).It remains to show that g(A) , the second term with the opposite sign, is supermodular.Let us first write the discrete derivative of the function g: which holds for any u / ∈ A ⊆ U .Here, we used that the joint entropy of two sets of random variables A and B can be written as H(A, B) = H(A) + H(B|A).Now take any superset B ⊇ A, which does not contain u ∈ U .From B ⊇ A, we have Although submodular functions can be minimized efficiently, they are NP-hard to maximize [19], and therefore we have to use approximate algorithms.Next, we briefly discuss the classical approximate submodular maximization method widely used in batch querying [9,11,12,16,17].This greedy approach, we call deterministic throughout this paper, is first proposed in the seminal work of [20] (shown in Algorithm 1) and its performance is analyzed for monotone set functions as follows: Theorem 5. Let f : 2 U → R be a submodular and nondecreasing set function with f (∅) = 0, A be the output of Algorithm 1 and A * be the optimal solution to the problem in Equation ( 1).Then we have: Algorithm 1: The deterministic approach Inputs: The objective function f , the unlabeled indices U , the query batch size k > 0 Outputs: The proof is given by [20] and [21].Among the assumptions, f (∅) = 0 can always be assumed since maximizing a general set function f (A) is equivalent to maximizing its adjusted version g(A) := f (A) − f (∅) which satisfies g(∅) = 0. Nemhauser et al. [22] also showed that Algorithm 1 gives the optimal approximate solution to the problem in (1) for nondecreasing functions such as f H .However, f MI is not monotone in general and therefore Theorem 5 is not applicable.To imagine non-monotonicity of f MI , it suffices to imagine that f MI (∅) = f MI (U ) = 0.
Recently several algorithms have been proposed for approximate maximization of nonnegative submodular set functions, which are not necessarily monotone.Feige et al. [23] made the first attempt towards this goal by proposing a (2/5)-approximation algorithm and also proving that 1/2 is the optimal approximation factor in this case.Buchbinder et al. [24] could achieve this optimal bound in expectation by proposing a randomized iterative algorithm.However, these algorithms are designed for unconstrained maximization problems.Later, Buchbinder et al. [25] devised a (1/e)-approximation randomized algorithm with cardinality constraint, which is more suitable for batch active learning.
A pseudocode of this approach is shown in Algorithm 2 where instead of selecting the sample with maximum objective value at each iteration, the best k samples are identified (line 4) and one of them is chosen randomly (line 5).Such a randomized procedure provides a (1/e)-approximation algorithm for maximizing a nonnegative submodular set function such as f MI : Theorem 6.Let f : 2 U → R be a submodular nonnegative set function and A be the output of Algorithm 2. Then if A * is the optimal solution to the problem in (1) we have: The proof can be found in [25] and our supplementary document.In order to be able to select k samples from U t to form M t for all t, it suffices to ensure that the smallest unlabeled set that we sample from U k−1 has enough members, i.e., k Observe that although the assumptions in Theorem 6 are weaker than those in Theorem 5, the bound shown in Equation ( 20) is also looser than that in Equation (19).However, interestingly, it is proven that inequality Equation ( 19) will still hold for Algorithm 2 if the monotonicity of f is satisfied (see the Theorem 3.1.in [25]).Thus, the randomized Algorithm 2 is expected to be performing similar to Algorithm 1 for monotone functions.
Algorithms 1 and 2 are equivalent for sequential querying (k = 1).Also note that in both algorithms, the variables u t in iteration t, is determined by deterministic or stochastic maximization of f (A t−1 ∪ {u}).Fortunately, such maximization needs only computations in the form of Equation (10) or Equation ( 13) when f = f pess MI or f opt MI .These computations can be done easily provided that the gradient vector g n+t−1 and inverse-Hessian matrix H −1 n+t−1 have been stored from the previously selected subset A t−1 .The updated gradient and inverse-Hessian that are used to compute f (A t−1 ∪ {u}) are different for each specific u ∈ U t−1 .We only save those associated with the local maximizer, that is u t , as g n+t and H −1  n+t to be used in the next iteration.

Total Complexity Reduction
We measure the complexity of a given querying algorithm in terms of the required number of classifier updates.This makes our analysis general and independent of the updating procedure, which can be done in several possible ways.As we discussed in the last section, we chose to perform a single step of quasi Newton in Equation ( 8) but alternatively one can use full training or any other numerical parameter update.
Consider the following optimization problems: greedy arg max where "greedy arg max" denotes the greedy maximization operator that uses Algorithm 1 or 2 to maximize the objective, and fMI is either f pess MI or f opt MI .Note that Equation (21a) formulates the global maximization of the exact MI function and Equation (21b) shows the optimization in our framework, that is a greedy maximization of the pessimistic/optimistic MI approximations.In the following remark, we compare the complexity of solving the two optimizations in Equation ( 21) in terms of the number of classifier updates required for obtaining the solutions.
Remark 1.For a fixed k, the number of necessary classifier updates for solving Equation (21a) increases with order k, whereas for Equation (21b) it changes linearly.
Proof.As is explained in Section 2.2, the number of classifier updates for computing f MI (A) without any approximations, is c k .Moreover, in order to find the global maximizer of MI, f MI needs to be evaluated at all subsets of U with size k.There are m k = O m k of such subsets (recall that m = |U |).Hence, the total number of classifier update required for global maximization f MI is of order O (m • c) k .Now, regarding Equation (21b), recall from Section 2.2 that if g n+t−1 and H −1  n+t−1 are stored from the previous iteration, computing fMI (A t−1 ∪ {u t }) needs only c classifier updates.However, despite the evaluation problem in Section 2.2, in computing line (4) of Algorithms 1 and 2, the next sample to add, that is u t , is not given.In order to obtain u t , fMI is to be evaluated at all the remaining samples in U t−1 .Since, |U t−1 | = m − t + 1, the number of necessary classifier updates in the t-th iteration is c • (m − t + 1).Both algorithms run k iterations that results the following total number of classifier updates:

Further Speed-Up
Here, we show that the total complexity of our proposed MI-based querying approaches is linear, in contrast with the exponential cost of the exact formulation which is not practical.
Even after approximating f MI using the pessimistic or optimistic formulations, MI-based algorithms can be significantly slow for large data sets.In order to further scale up our algorithm, induced by [18], we first selects a subset of the unlabeled samples by only choosing the most β uncertain samples (where β ∈ Z + ).We then ran our MI-based algorithm over such filtered data.More formally, the input set of unlabeled indices to Algorithms 1 and 2 will be It is evident that for β = |U | the filtered data will be equal to the original unlabeled pool U f = U .From now on, we add the adjective filtered to any querying algorithm that is preceded by reduction of the unlabeled data into the samples with high uncertainty as described above.

Results and Discussion
In this section, we show our experimental results over several data sets on three different fields: medicine Section 3.1, image processing Section 3.2 and music harmony analysis Section 3.3.We ran our MI-based querying algorithms against entropy-based and random active learning benchmarks.In the following section, first we describe the data sets that have been used, then we explaining the experimental settings and present the numerical results.

Cardiotocography
This data set is downloaded from UCI repository [26] and contains 2126 fetal cardiotocograms each of which is represented by a 21-dimensional feature vector.The data is categorized into three classes based on the fetal states: normal, suspect and pathological (therefore c = 3).All the data samples are first projected into a 15-dimensional principal component analysis (PCA) subspace obtained from the unlabeled pool.The initial labeled data set chosen randomly in the beginning of each experiment, consists of 75 samples (25 samples per class).We will refer to this data set as "Cardio" in the following sections.

MNIST (Mixed National Institute of Standards and Technology)
This is an image database of handwritten digits 1 to 9 [27].Here, we only use images for digits 1 to 4, hence c = 4.The data set, consisting of 20 × 20 images, is already divided into a testing/training partitions.In our experiments, these partitions are fixed as given, but each time the initial labeled data sets are randomly chosen from the training partition.The raw 400-dimensional feature vectors are projected into 10-dimensional PCA subspace constructed based on the training partition.After choosing only images of digits from 1 to 4, the size of the testing and training partitions are 4130 and 4159, respectively.We also set the size of our initial labeled data set L 0 to 200 (50 samples per class).

Bach Choral Harmony
The other data set that we used for evaluating performance of the algorithms contains pitch information of time events of 60 chorales by Johann Sebastian Bach [28].Each event is represented by pitch-wise and meter information and is assigned a chord class label.We selected the events associated with the five most frequent chords in the data set: D-major, G-major, C-major, F-major and A-major (hence c = 5); resulting a set of 2221 samples.Discarding pitch class of the bass notes and the metric information, we used the binary indicators of the pitch classes corresponding to equal-tempered 12 notes of the chromatic scale.This leads to a set of 12-dimensional binary feature vectors that are projected into 8-dimensional PCA subspace obtained based on the training data at each experiment.We will refer to this data set simply as "Bach" in the remaining sections.

Experimental Settings
From the previous sections, we have two methods of evaluating MI and two optimization techniques, leading to four different ways of doing MI-based querying, in all of which we used the filtered pool of unlabeled samples U f that is obtained with β = 100.Throughout this section, we distinguish different approaches involved in our experimental settings using the labels listed below: In sequential querying, where deterministic and randomized optimization algorithms are equivalent, we use Pess-MI and Opt-MI to refer to the MI-based objectives without mentioning the optimization type.
In running the querying experiments over all the data sets, we used a linear logistic regression as the core classifier.In case that the data under consideration is not already divided into testing/training partitions, we randomly generate such partitions in each experiment with fixed ratio of 3/7 (testing size to training size).The initial training data set L 0 is randomly selected from the training partition and the rest of the training samples are considered as the unlabeled pool U from which the queries are to be selected in each querying iteration.Moreover, in each experiment we first reduce the dimensionality of the data using PCA over the unlabeled pool.
In the experiments, we iteratively select the query batches of either sizes k = 1, 5, 10 or 20, add the selected queries together with their class labels to the labeled data set and re-calculate the parameters of the classifier, which in turn, leads to an updated accuracy value based on the testing partition.For each value of k, we repeated running the experiments for 25 times, each time with a different random selection of testing/training partitions and a different initial labeled set L 0 .Hence, in total, we get 25 accuracy curves for each value of k.Ideally, we want an active learning algorithm whose accuracy curve increases as fast as possible, i.e., obtaining a more accurate classifier with labeling fewer number of query batches.
In order to present the performance of the listed algorithms, we calculate the average and standard deviation (STD) of the 25 accuracy curves for each algorithm.Furthermore, for pairwise comparison between the MI-based approaches and the benchmarks for a fixed value of k, we perform two-sample one-tail T-tests over the accuracy values of the competing algorithms.Note that such hypothesis test can and should be done over the accuracy values calculated after each querying iteration t separately.Here, the assumption is that the accuracy levels generated from the 25 querying experiments at iteration t are independent from each other.Let η t denote the random variable presenting the accuracy of the updated classifier after t times of running an MI-based querying algorithm and η t be a similar random variable for a competing non-MI-based method.Then, we consider the null and alternative hypotheses to be: where µ(η t ) and µ(η t ) are the mean of the random variables η t and η t .We perform such T-test for comparing all modes of MI-based objectives (see the list above) versus the entropy-based and random querying algorithms.Rejecting the null hypothesis implies that the new accuracy in the t-th iteration of an MI-based querying is not less than or equal to the case when we use a querying objective other than the approximating variants of f MI .In other words, obtaining a smaller p-value for a T-test described above, means that with a higher probability the accuracy of the updated classifier is larger when using the MI-based approach for querying.

Numerical Results
The numerical results of running sequential active learning with different querying algorithms are shown in Figure 1  As it is mentioned before, the deterministic and randomized optimization algorithms described in Section 2.3 are equivalent in sequential querying.Therefore, we have only two variants of MI-based querying in Figure 1.This figure shows that for two data sets (MNIST and Bach) the MI-based approaches perform similar or sometimes even worse than the entropy-based benchmark.This can be explained by noting that the main shortcoming of using the entropy objective f h , that is redundancy among the queries, is meaningful only when we have multiple samples in the batch, that is k > 1.However, they mostly outperform random querying.Another observation is that using optimistic approximation of MI gave better results both in terms of average accuracy and the hypotheses p-values.Recall that the optimistic approximation of MI tries to minimize the entropy over the class labels of the remaining samples in each iteration, while the pessimistic approximation uses maximization of this entropy.Hence, our conjecture for this observation is that the optimistic approach does more aggressive exploitation in comparison with the pessimistic variant, in the sense of choosing the queries from the set of samples that lead to a lower classifier entropy.
For batch mode active learning, the MI-based approaches generally show better performance.That is, their accuracy curves grow more rapidly than the benchmarks.When comparing against the entropy-based approach, for data sets Cardio and Bach, we observe that the p-values are small in the beginning iterations.Whereas for MNIST data set, low p-values are mostly seen in the middle or late iterations.Hence, the probability that MI-based variants outperform the entropy is high in early querying iterations, before the labeled training set becomes large, for the two former data sets, and in later iterations for MNIST.This behavior can also be seen from the plots of the average accuracy.
Regarding the comparison against the random benchmark, we see from the p-value plots more conspicuously that MI-based approaches generally outperform random with high confidence soon after the early iterations.However, the plots for Bach, show that this confidence decrease in late iterations, which is mainly due to the growth of the accuracies of random to the same level as the MI-based curves.
Whereas in sequential active learning the optimistic approach did a better job in comparison with the pessimistic variant, the difference between these two variants shrinks as the size of query batch increases (and so does the number of required approximation iterations).Our conjecture is that accumulation of the approximation error makes the performance of optimistic and pessimistic MI-based querying methods closer to each other, though still better than the benchmarks.Additionally, there is no significant difference between using deterministic or randomized optimization algorithms, which might be because of local monotonicity of the approximations f MI .Also note that although there are large distinctions between MI-based variants when their p-values are large, we ignore those parts as uninformative regions, since they just imply that the probability of MI-based approaches outperforming the benchmarks is not high.

Conclusions
Active learning based on reducing the model uncertainty is very popular, however most of the relevant objectives such as entropy and sample margin, do not take into account the diversity between the queries in case of batch active learning.Working with probabilistic classifiers, one natural replacement for entropy is mutual information (MI) between the class labels of the candidate queries and the remaining unlabeled samples.But hurdles in evaluating and efficient optimization of this objective has failed its popularity in the learning community.In this paper we presented a framework for efficient querying based on this objective.
In our framework, we proposed pessimistic and optimistic approximations of MI by replacing the averaging operator inside the conditional entropy with maximizing and minimizing operators, respectively.This enabled us to efficiently estimate these values in a greedy fashion.Furthermore, in a consistent flow with the greedy estimation of MI, the optimization is also done in an iterative scheme.The iterative nature of the optimization decreased the computational complexity from O (m • c) k , when no approximation is applied, to O(ckm).Two different modes of optimization are suggested based on the existing algorithms in submodular maximization literature: one the classical deterministic greedy approach that has been already used in learning literature, and the other one a stochastic variant of it that is especially useful when monotonicity is absent in the submodular objective.
We generated experimental results using various real-world data sets in order to evaluate the performance of our MI-based querying approaches against the entropy-based and random querying benchmarks in terms of the accuracy growth rate during the querying iterations.These results show that MI-based algorithms outperformed the rest in most of the cases, especially when k > 1 (batch active learning).Furthermore, the optimistic variant outperformed the pessimistic one, especially for smaller batch sizes.As we increase the number of the queries and therefore the number of approximating iterations, this difference shrinks.
Note that π ij , and equivalently the distributions shown in Equation (24), are the likelihood functions when viewed from the point of view of parameter vector θ.The objective to optimize in order to find the maximum likelihood estimation (MLE), denoted by θn , given an i.
The subscript n in θn is to emphasize the sample size of the training data using which the MLE is obtained.

Updating the Gradient Vector
Here, we formulate the gradient vector in terms of the individual log-likelihood functions π i j and the feature vectors x i , which readily enables us to derive an update equation for the gradient.The k-th partial gradient (for 1 ≤ k ≤ c − 1) of the log-likelihood evaluated at (x i , y * i = j) is: The complete gradient vector of the log-likelihood function (θ; L), denoted by g n , is obtained by concatenating the partial gradient vectors.It can be written compactly as below: where ⊗ denotes the Kronecker product.Equation (29) implies that the gradient is additive and the update equation, after adding a pair (x u 1 , y u 1 ) to the training set L is simply equal to Similarly, g n+t in Equation ( 12) can be obtained by adding a single product to the gradient vector calculated in the previous iteration: . . .

Algorithm 2 : 3 for t = 1 4 M
The randomized approach Inputs: The objective function f , the unlabeled indices U , the query batch size k > 0 Outputs: a subset of unlabeled indices A ⊆ U of size k → k do /* Selecting k points with highest f values */ t ← arg max
and the results of batch active learning with different batch sizes are shown in Figures 2 (for k = 5), 3 (for k = 10) and 4 (for k = 20).The figures show the average accuracy curves (first row of each figure), the standard deviation of the accuracy curves (second rows), and the resulting p-values of the hypothesis tests for comparison between the MI-based approaches and entropy-based (third rows) or random querying (fourth rows).