Classification Active Learning Based on Mutual Information

Sourati, Jamshid; Akcakaya, Murat; Dy, Jennifer G.; Leen, Todd K.; Erdogmus, Deniz

doi:10.3390/e18020051

Open AccessArticle

Classification Active Learning Based on Mutual Information

by

Jamshid Sourati

^1,*,

Murat Akcakaya

²,

Jennifer G. Dy

¹,

Todd K. Leen

^3,† and

Deniz Erdogmus

^1,†

¹

Department of Electrical and Computer Engineering, Northeastern University, 360 Huntington Ave, Boston, MA 02115, USA

²

Department of Electrical and Computer Engineering, University of Pittsburgh, 3700 O’Hara Street, Pittsburgh, PA 15261, USA

³

National Science Foundation, 4201 Wilson Boulevard, Arlington, VA 22230, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2016, 18(2), 51; https://doi.org/10.3390/e18020051

Submission received: 15 December 2015 / Revised: 28 January 2016 / Accepted: 1 February 2016 / Published: 5 February 2016

(This article belongs to the Special Issue Information Theoretic Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Selecting a subset of samples to label from a large pool of unlabeled data points, such that a sufficiently accurate classifier is obtained using a reasonably small training set is a challenging, yet critical problem. Challenging, since solving this problem includes cumbersome combinatorial computations, and critical, due to the fact that labeling is an expensive and time-consuming task, hence we always aim to minimize the number of required labels. While information theoretical objectives, such as mutual information (MI) between the labels, have been successfully used in sequential querying, it is not straightforward to generalize these objectives to batch mode. This is because evaluation and optimization of functions which are trivial in individual querying settings become intractable for many objectives when we are to select multiple queries. In this paper, we develop a framework, where we propose efficient ways of evaluating and maximizing the MI between labels as an objective for batch mode active learning. Our proposed framework efficiently reduces the computational complexity from an order proportional to the batch size, when no approximation is applied, to the linear cost. The performance of this framework is evaluated using data sets from several fields showing that the proposed framework leads to efficient active learning for most of the data sets.

Keywords:

active learning; mutual information; submodular maximization; classification

1. Introduction

In supervised learning, there is a teacher in the form of labels associated to a training set of samples to enable learning of models for prediction. However, obtaining expert/human/manual labeling is expensive. Instead of randomly selecting samples for manual annotation (the standard setting), the goal of active learning is to intelligently select samples for annotation that enables efficiently learning an accurate classifier with as few labels as possible. Here, the implicit assumptions are that the labeling costs (in terms of time or financial expense) are the same for all of the queries and also significantly larger than the computational cost of the querying algorithms. The latter assumption leads us towards the active learning algorithms that generate smaller batch of queries, even though they might be computationally more expensive compared to the cheap passive learning.

An active learning setting typically starts with an initial model (from a few labeled samples), then more samples are selected for label querying. There are two general strategies for querying: sequential or batch mode. In sequential querying, a single sample is selected for querying at each active learning step, where each step involves labeling the query and retraining of the model. Batch mode querying, on the other hand, allows labeling of multiple samples at each active learning step. Most of the classical studies in active learning use sequential querying [1,2,3,4,5,6]. However, oftentimes querying a batch of samples is more efficient when the experts can label multiple samples in one step, since they can label more queries at each step without the need to wait for the retraining process. In this paper, we introduce a batch mode active learning algorithm based on mutual information.

Performing active learning in batch mode introduces new challenges. Since we need to select a set of queries, one should also make sure that the samples are non-redundant to maximize the amount of information that they provide. Another related challenge is that selecting a subset of samples based on a given objective function defined over sets can easily lead to intractable or to non-deterministic polynomial-time hard (NP-hard) optimization problems.

Recently, different batch mode active learning algorithms have been developed. Besides the querying strategy, active learning methods also differ based on the criterion they optimize for selecting queries. [7,8,9] select samples that reduce model uncertainty. Holub et al. [8] do this directly using the joint entropy function; Brinker [7] does this indirectly based on the distance of samples to the classifier’s boundary; and Chen et al. [9] also do this indirectly based on the volume of version space. Most of these methods need to employ heuristics to introduce diversity among the queries. Choosing a subset of samples that maximizes the expected model change [10] is another type of batch selection strategy which works directly with classifier performance, but is usually tied to a particular classification model and is not general. On the other hand, Azimi et al. [11] develop a framework which constructs a batch mode variant of any given sequential querying policy, such that it performs close to the sequential scenario. Their algorithm is based on the restrictive assumption that any sequential querying outperforms its batch mode correspondent, in the expense of more frequent model updating. There are also studies that select the queries based on the amount of information they carry with respect to the underlying data distribution. For example, Hoi et al. [12] use the Fisher information ratio, which is specifically designed for a logistic regression model; Guo [13] and Li et al. [14] utilize mutual information between the input feature values of the candidate queries and the remaining samples, where they used Gaussian Process distribution to model the joint probability over the instances.

In this paper, we advance the field by introducing a general framework for batch mode active learning that selects samples based on mutual information (MI) between labels of the candidate query set and the remaining samples given the current model. Our framework is general, because it can be applied with any classifier. Also note that we optimize for MI between the labels; in contrast, the MI in [13] and [14] are based on the input feature values. Our formulation is discriminative; hence, we do not need to model the distribution of the input features. Another additional benefit is that our MI-based objective takes redundancy into account naturally.

MI between the labels has been employed in sequential active learning settings [15]; however, to date, we are not aware of any work that optimizes for this objective directly in batch mode. This is due to two main hurdles: (1) difficulty in calculating the MI between non-singleton subsets of labels and (2) its combinatorial optimization problem. We address hurdle (1) by introducing pessimistic and optimistic versions of estimating MI, both of which can be calculated efficiently in a greedy fashion (Section 2.2), and hurdle (2) via the popular greedy submodular maximization algorithm and a stochastic version of it. We show that MI is submodular but non-monotone, and therefore only the stochastic algorithm has guaranteed tight bounds for maximizing it (Section 2.3). This stochastic optimization algorithm, which has been not exploited in active learning community before, will be compared in practice against the former algorithm which has been used widely in batch querying [9,12,16,17]. Our proposed framework efficiently reduces the computational complexity from order k (denoting the batch size), when no approximation is applied, to the linear cost (Section 2.4). Additionally, as suggested by Wei et al. [18], we also use uncertainty sampling to downsample the unlabeled data as a pre-processing step to further decrease the model complexity (Section 2.5). The performance of our proposed algorithms are evaluated using data sets from several different fields, and make comparisons against entropy-based and random querying benchmarks.

2. Active Learning with Mutual Information

In this section, we introduce our notations and formulate the batch active learning problem with MI as the objective. Then we provide our solutions to the two hurdles mentioned above.

2.1. Formulation

Let

X = {x_{i}}_{i = 1}^{n + m} \subseteq R^{d}

denote our finite data set where

x_{i}

is the d-dimensional feature vector representing the i-th sample. Also let

Y = {y_{i}}_{i = 1}^{n + m}

be their respective class labels where each

y_{i}

represents the numerical category varying between 1 to c, with c the number of classes. We distinguish indices of the labeled and unlabeled partitions of the data set by

L

and

U

(with

| L | = n

and

| U | = m

), respectively, which are disjoint subsets of

{1, . . ., n + m}

. Note that the true values of the labels

Y_{L}

are observed and denoted by

Y_{L}^{*}

, hence

Y = Y_{U} \cup Y_{L}^{*}

. The initial classifier is trained based on these observed labels.

Labeling unlabeled samples is costly and time-consuming. Given a limited budget for this task, we wish to select

k \geq 1

queries from

U

whose labeling leads to a new classifier with a significantly improved performance. Therefore we need an objective set function

f : 2^{U} \to R

defined over subsets of the unlabeled indices

A \subseteq U

. The goal of batch active learning is then to choose a subset

A

with a given cardinality k that maximizes the objective given the current model that is trained based on

Y_{L}^{*}

. This can be formulated as the following constrained combinatorial optimization:

A^{*} = \underset{A \subseteq U, | A | = k}{arg max} f (A) .

(1)

We aim to choose the queries

A

whose labels

Y_{A}

give the highest amount of information about labels of the remaining samples

Y_{U - A}

. In other words, the goal is to maximize the mutual information (MI) between the random sets

Y_{A}

and

Y_{U - A}

given the observed labels

Y_{L}^{*}

and the features X:

f_{M I} (A) = M I (Y_{A}, Y_{U - A} | X, Y_{L}^{*}) = H (Y_{A} | X, Y_{L}^{*}) - H (Y_{A} | X, Y_{L}^{*}, Y_{U - A}) .

(2)

Let us focus on the first right hand side term

f_{H} (A) : = H (Y_{A} | X, Y_{L}^{*})

, which is the joint entropy function used in active learning methods [8]. Note that maximizing

f_{H}

is equivalent to minimizing

H (Y_{U - A} | X, Y_{L}^{*}, Y_{A})

(the actual objective introduced in [8]), since

H (Y_{U} | X, Y_{L}^{*}) = H (Y_{A} | X, Y_{L}^{*}) + H (Y_{U - A} | X, Y_{L}^{*}, Y_{A})

and

H (Y_{U} | X, Y_{L}^{*})

is a constant with respect to

A

. This objective is expensive to calculate due to the complexity of computing the joint posterior

P (Y_{A} | X, Y_{L}^{*})

. However, using a point estimation, such as maximum likelihood, to train the classifier’s parameter θ, one can say that θ is deterministically equal to the maximum likelihood estimation (MLE) point estimate

{\hat{θ}}_{n}

given X and

Y_{L}^{*}

. Then we can rewrite the posterior as

P (Y_{A} | X, Y_{L}^{*}) = \int P (Y_{A} | θ) \cdot δ (θ - {\hat{θ}}_{n}) d θ = P (Y_{A} | {\hat{θ}}_{n})

. Since in most discriminative classifiers the labels are assumed to be independent given the parameter, one can write

P (Y_{A} | {\hat{θ}}_{n}) = \prod_{i \in A} P (y_{i} | {\hat{θ}}_{n})

. This simplifies the computation of the joint entropy to the sum of sample entropy contributions:

f_{H} (A) = H (Y_{A} | X, Y_{L}^{*}) = H (Y_{A} | {\hat{θ}}_{n}) = \sum_{j \in A} H (y_{j} | {\hat{θ}}_{n}),

(3)

which is straightforward to compute having the pmf’s

P (y_{i} | {\hat{θ}}_{n})

. Equation (3) implies that maximizing

f_{H}

can be separated into several individual maximizations, hence does not take into account the redundancy among the selected queries. Thus, in different related studies heuristics are added to cope with this issue. MI in Equation (2), on the other hand, removes this shortcoming by introducing a second term which conditions over the unobserved random variable

Y_{U - A}

, as well as the observed

Y_{L}^{*}

. This conditioning prevents the labels in

Y_{A}

from becoming independent, and therefore automatically incorporates the diversity among the queries (see next section for details of evaluating this term).

Unfortunately, maximizing

f_{M I}

for

k > 1

is NP-hard (the optimization hurdle). Relaxing combinatorial optimizations into continuous spaces is a common technique to make the computations tractable [13], however these methods still involve a final discretization step that often includes using heuristics. In the following sections, we introduce our strategies to overcome the practical hurdles in MI-based active learning algorithms by introducing (1) pessimistic/optimistic approximations of MI; and (2) submodular maximization algorithms that allow us to perform the computations within the discrete domain.

2.2. Evaluating Mutual Information

In this Section, we address the hurdle of evaluating MI between non-singleton subset of labels. This objective, formulated in Equation (2), is also equal to

f_{M I} (A) = H (Y_{U - A} | X, Y_{L}^{*}) - H (Y_{U - A} | X, Y_{L}^{*}, Y_{A}),

(4)

due to MI’s symmetry. We prefer this equation, since usually we have

| Y_{A} | = k ≪ | Y_{U} |

and thus it leads to a more computationally efficient problem. Note that the first term in the right hand side of Equation (4) can be evaluated similar to Equation (3). The major difficulty we need to handle in Equation (4) is the computation of the second term, which requires considering all possible label assignments to

Y_{A}

. To make this computationally tractable, we propose to use a greedy strategy based on two variants: pessimistic and optimistic approximations of MI. To see this we focus on the second term:

\begin{matrix} H (Y_{U - A} | X, Y_{L}^{*}, Y_{A}) & = \sum_{J \in {1, . . ., c}^{| A |}} P (Y_{A} = J | X, Y_{L}^{*}) \cdot H (Y_{U - A} | X, Y_{L}^{*}, Y_{A} = J) \end{matrix}

(5)

where

{1, . . ., c}^{| A |}

is the set of all possible class label assignments to the samples in

A

. For example, if

A

has three samples (

| A | = 3

) and

c = 2

, then this set would be equal to

\{{1, 1, 1}, {2, 1, 1}, {1, 2, 1}, {2, 2, 1}, {2, 2, 2}, {1, 2, 2}, {2, 1, 2}, {1, 1, 2}\}

. For each fixed label permutation J, the classifier should be retrained after adding the new labels

Y_{A} = J

to the training labels

Y_{L}^{*}

in order to compute the conditional entropy

H (Y_{U - A} | X, Y_{L}^{*}, Y_{A} = J)

. It is also evident from the example above that the number of possible assignments J to

Y_{A}

is

c^{k}

. Therefore, the number of necessary classifier updates grows exponentially with

| Y_{A} | = k

. This is computationally very expensive and makes Equation (5) impractical. Alternatively, we can replace the expectation in Equation (5) with a minimization/maximization to get a pessimistic/optimistic approximation of MI. Such a replacement enables us to employ efficient greedy approaches to estimate

f_{M I}

in a conservative/aggressive manner. The greedy approach that we use here is compatible with the iterative nature of the optimization Algorithms 1 and 2 (described in Section 2.3). In the remainder of this Section, we focus on the pessimistic approximation. Similar equations can be derived for the optimistic case. The first step is replacing the weighted summation in Equation (5) by a maximization:

f_{M I}^{p e s s} (A) = H (Y_{U - A} | X, Y_{L}^{*}) - max_{J \in {1, . . ., c}^{| A |}} H (Y_{U - A} | X, Y_{L}^{*}, Y_{A} = J)

(6)

Note that

f_{M I}^{p e s s} (A)

is always less than or equal to

f_{M I}

. Equation (6) still needs the computation of the conditional entropy for all possible assignments J. However, it enables us to use greedy approaches to approximate

f_{M I}^{p e s s} (A)

for any candidate query set

A \subseteq U

, as described below.

Without loss of generality, suppose that

A

, with size

| A | = k

(

1 \leq k \leq m

), can be shown element-wise as

A = {u_{1}, . . ., u_{k}}

. Define

A_{t} = {u_{1}, . . ., u_{t}}

for any

t \leq k

(hence

A_{k} = A

). In the first iteration we can evaluate Equation (6) simply for the singleton

f_{M I}^{p e s s} ({u_{1}})

and store

{\hat{y}}_{u_{1}}

, the assignment to

y_{u_{1}}

which maximizes the conditional entropy in Equation (6):

\begin{matrix} f_{M}^{p e s s} ({u_{1}}) & = H (Y_{U - {u_{1}}} | X, Y_{L}^{*}) - max_{j \in {1, . . ., c}} H (Y_{U - {u_{1}}} | X, Y_{L}^{*}, y_{u_{1}} = j) \\ = H (Y_{U - {u_{1}}} | X, {\hat{θ}}_{n}) - H (Y_{U - {u_{1}}} | X, Y_{L}^{*}, y_{u_{1}} = {\hat{y}}_{u_{1}}), \end{matrix}

(7)

where we used Equation (3) to substitute the first term with

H (Y_{U - {u_{1}}} | {\hat{θ}}_{n})

. Note that the second term in Equation (7) requires c times of retraining the classifier with the newly added class label

y_{u_{1}} = j

for all possible

j \in {1, . . ., c}

. In practice, the retraining process can be very time-consuming. Here, instead of retraining the classifier from scratch, we leverage the current estimate of the classifier’s parameter vector and take one quasi-Newton step to update this estimate:

{\tilde{θ}}_{n + 1} : = {\hat{θ}}_{n} - H_{n + 1}^{- 1} \cdot g_{n + 1},

(8)

where

g_{n + 1}

and

H_{n + 1}

are the gradient vector and Hessian matrix of the log-likelihood function of our classifier given the labels

Y_{L}^{*} \cup {y_{u_{1}} = j}

. Then we use the approximation

H (Y_{U - {u_{1}}} | X, Y_{L}^{*}, y_{u_{1}} = {\hat{y}}_{u_{1}}) = H (Y_{U - {u_{1}}} | X, θ_{n + 1}) \approx H (Y_{U - {u_{1}}} | X, {\tilde{θ}}_{n + 1}) .

(9)

In Appendix, we derive the update equation in case a multinomial logistic regression is used as the discriminative classifier. Specifically, we will see that

g_{n + 1}

and

H_{n + 1}^{- 1}

can be obtained efficiently from

g_{n}

and

H_{n}^{- 1}

.

If

k = 1

, we are done. Otherwise, to move from iteration

t - 1

to t (

1 < t \leq k

),

f_{M I}^{p e s s} (A_{t - 1} \cup {u_{t}})

will be approximated from the previous iterations:

f_{M I}^{p e s s} (A_{t}) \approx H (Y_{U - A_{t}} |{\hat{θ}}_{n}) - max_{j \in {1, . . ., c}} H (Y_{U - A_{t}} |X, Y_{L}^{*}, {\hat{Y}}_{A_{t - 1}}, y_{u_{t}} = j),

(10)

where

{\hat{Y}}_{A_{t - 1}} = {{\hat{y}}_{u_{1}}, . . ., {\hat{y}}_{u_{t - 1}}}

are the assignments maximizing the conditional entropy that are stored from the previous iterations, such that the i-th element

{\hat{y}}_{u_{i}}

is the assignment stored for

u_{i} = A_{i} - A_{i - 1} (1 \leq i \leq t)

. Note that Equation (10) is an approximation of the pessimistic MI, as is defined by Equation (6), however, in order to keep the notations simple we use the same notation

f_{M I}^{p e s s}

for both. Moreover, similar to Equation (7) there are c time of classifier updates involved in the computation of Equation (10). To complete iteration t, we make

{\hat{y}}_{u_{t}}

equal to the assignment to

y_{u_{t}}

that maximizes the second term in Equation (10) and add it to

{\hat{Y}}_{A_{t - 1}}

to form

{\hat{Y}}_{A_{t}}

.

As in the first iteration, the conditional entropy term in Equation (10) is estimated by using the set of parameters obtained from the quasi-Newton step:

H (Y_{U - A_{t}} |X, Y_{L}^{*}, {\hat{Y}}_{A_{t - 1}}, y_{u_{t}} = j) \approx H (Y_{U - A_{t}} |X, {\tilde{θ}}_{n + t}),

(11)

where

{\tilde{θ}}_{n + t} = {\tilde{θ}}_{n + t - 1} - H_{n + t}^{- 1} \cdot g_{n + t} .

(12)

Considering Equations (7) and (10) as the greedy steps of approximating

f_{M I}

, we see that the number of necessary classifier updates are

c \cdot k

, since there are k iterations each of which requires c times of retraining the classifier. Thus, the computational complexity reduced from the exponential cost in the exact formulation Equation (5) to the linear cost in the greedy approximation.

Similar to Equation (10), for the optimistic approximation, we will have:

f_{M I}^{o p t} (A_{t}) = H (Y_{U - A_{t}} |{\hat{θ}}_{n}) - min_{j \in {1, . . ., c}} H (Y_{U - A_{t}} |X, Y_{L}^{*}, {\hat{Y}}_{A_{t - 1}}, y_{u_{t}} = j),

(13)

where

{\hat{Y}}_{A_{t - 1}} = {{\hat{y}}_{u_{1}}, . . ., {\hat{y}}_{u_{t - 1}}}

is the set of class assignments minimizing the conditional entropy that are stored from the previous iterations. Clearly, the reduction of the computational complexity remains the same in the optimistic formulation.

Let us emphasize that, from the definitions of

f_{M I}^{p e s s}

and

f_{M I}^{o p t}

, we always have the following inequality

f_{M I}^{p e s s} (A) \leq f_{M I} (A) \leq f_{M I}^{o p t} (A), \forall A \subseteq U .

(14)

The first (or second) inequality turns to equality, if the results of averaging in conditional entropy in Equation (5) is equal to maximization (or minimization) involved in the approximations. This is equivalent to saying that the posterior probability

P (Y_{A} | X, Y_{L}^{*})

is a degenerative distribution concentrated at the assignment

Y_{A} = J

that maximizes (or minimizes) the conditional entropy. Furthermore, if the posterior is a uniform distribution, giving the same posterior probability to all possible assignments

J \in {1, . . ., c}^{| A |}

, then the averaging, minimization and maximization lead to the same numerical result and therefore we get

f_{M I}^{p e s s} = f_{M I}^{o p t} = f_{M I}

.

In theory, the value of MI between any two random variables is non-negative. However, because of the approximations made in computing the pessimistic or optimistic evaluations of MI, it is possible to get negative values depending on the distribution of the data. Therefore, after going through all the elements of

A

in evaluating

f_{M I}^{p e s s}

(or

f_{M I}^{o p t}

), we take the maximum between the approximations of

f_{M I}^{p e s s} (A)

(or

f_{M I}^{o p t} (A)

) and zero to ensure its non-negativity.

2.3. Randomized vs. Deterministic Submodular Optimizations

In this section, we begin by reviewing the basic definitions regarding submodular set functions, and see that both

f_{M I}

and

f_{H}

satisfy submodularity condition. We then present two methods for submodular maximization: a deterministic and a randomized approach. The latter is applicable to submodular and monotone set functions such as

f_{H}

. But

f_{M I}

is not monotone in general, hence we present the randomized approach for this objective.

Definition 1.

A set function

f : 2^{U} \to R

is said to be submodular if

f (A) + f (B) \geq f (A \cup B) + f (A \cap B), \forall A, B \subseteq U .

(15)

We call f supermodular if the inequality in Equation (15) is reversed. In many occasions, it is easier to use an equivalent definition, which uses the notion of discrete derivative defined as:

ρ_{f} (A, u) : = f (A \cup {u}) - f (A), \forall A \subseteq U, u \in U .

(16)

Proposition 2.

Let

f : 2^{U} \to R

be a set function. f is submodular if and only if we have

ρ_{f} (A, u) \geq ρ_{f} (B, u), \forall A \subseteq B \subseteq U, u \in U - B .

(17)

This equips us to show the submodularity of joint entropy and MI:

Theorem 3.

The set functions

f_{H}

and

f_{M I}

, defined in Equations (3) and (2) above, are submodular.

Proof.

It is straightforward to check the submodularity of

f_{H}

and therefore the first term of MI formulation in Equation (2). It remains to show that

g (A) : = H (Y_{A} | X, Y_{L}^{*}, Y_{U - A})

, the second term with the opposite sign, is supermodular. Let us first write the discrete derivative of the function g:

\begin{matrix} ρ_{g} (A, u) & = g (A \cup {u}) - g (A) \\ = H (Y_{A \cup {u}} | X, Y_{L}^{*}, Y_{U - A \cup {u}}) - H (Y_{A} | X, Y_{L}^{*}, Y_{U - A}) \\ = H (y_{u} | X, Y_{L}^{*}, Y_{U - A \cup {u}}), \end{matrix}

(18)

which holds for any

u \notin A \subseteq U

. Here, we used that the joint entropy of two sets of random variables A and B can be written as

H (A, B) = H (A) + H (B | A)

. Now take any superset

B \supseteq A

, which does not contain

u \in U

. From

B \supseteq A

, we have

Y_{U - A \cup {u}} \subseteq Y_{U - B \cup {u}}

and therefore

ρ_{g} (A, u) - ρ_{g} (B, u) = H (y_{u} | X, Y_{L}^{*}, Y_{U - A \cup {u}}) - H (y_{u} | X, Y_{L}^{*}, Y_{U - B \cup {u}}) \leq 0

implying supermodularity of g. ☐

Although submodular functions can be minimized efficiently, they are NP-hard to maximize [19], and therefore we have to use approximate algorithms. Next, we briefly discuss the classical approximate submodular maximization method widely used in batch querying [9,11,12,16,17]. This greedy approach, we call deterministic throughout this paper, is first proposed in the seminal work of [20] (shown in Algorithm 1) and its performance is analyzed for monotone set functions as follows:

Definition 4.

The set function

f : 2^{U} \to R

is said to be monotone (nondecreasing) if for every

A \subseteq B \subseteq U

we have

f (A) \leq f (B)

.

Theorem 5.

Let

f : 2^{U} \to R

be a submodular and nondecreasing set function with

f (⌀) = 0

,

A

be the output of Algorithm 1 and

A^{*}

be the optimal solution to the problem in Equation (1). Then we have:

\begin{matrix} f (A) \geq [1 - {(\frac{k - 1}{k})}^{k}] f (A^{*}) \geq (1 - \frac{1}{e}) f (A^{*}) . \end{matrix}

(19)

Algorithm 1: The deterministic approach

Inputs: The objective function f, the unlabeled indices

U

, the query batch size

k > 0

Outputs: a subset of unlabeled indices

A \subseteq U

of size k

The proof is given by [20] and [21]. Among the assumptions,

f (⌀) = 0

can always be assumed since maximizing a general set function

f (A)

is equivalent to maximizing its adjusted version

g (A) : = f (A) - f (⌀)

which satisfies

g (⌀) = 0

. Nemhauser et al. [22] also showed that Algorithm 1 gives the optimal approximate solution to the problem in (1) for nondecreasing functions such as

f_{H}

. However,

f_{M I}

is not monotone in general and therefore Theorem 5 is not applicable. To imagine non-monotonicity of

f_{M I}

, it suffices to imagine that

f_{M I} (⌀) = f_{M I} (U) = 0

.

Recently several algorithms have been proposed for approximate maximization of nonnegative submodular set functions, which are not necessarily monotone. Feige et al. [23] made the first attempt towards this goal by proposing a

(2 / 5)

-approximation algorithm and also proving that

1 / 2

is the optimal approximation factor in this case. Buchbinder et al. [24] could achieve this optimal bound in expectation by proposing a randomized iterative algorithm. However, these algorithms are designed for unconstrained maximization problems. Later, Buchbinder et al. [25] devised a

(1 / e)

-approximation randomized algorithm with cardinality constraint, which is more suitable for batch active learning. A pseudocode of this approach is shown in Algorithm 2 where instead of selecting the sample with maximum objective value at each iteration, the best k samples are identified (line 4) and one of them is chosen randomly (line 5). Such a randomized procedure provides a

(1 / e)

-approximation algorithm for maximizing a nonnegative submodular set function such as

f_{M I}

:

Theorem 6.

Let

f : 2^{U} \to R

be a submodular nonnegative set function and

A

be the output of Algorithm 2. Then if

A^{*}

is the optimal solution to the problem in (1) we have:

E [f (A)] \geq {(1 - \frac{1}{k})}^{k - 1} f (A^{*}) \geq \frac{1}{e} f (A^{*}) .

(20)

The proof can be found in [25] and our supplementary document. In order to be able to select k samples from

U_{t}

to form

M_{t}

for all t, it suffices to ensure that the smallest unlabeled set that we sample from

U_{k - 1}

has enough members, i.e.,

k \leq | U_{k - 1} | = | U | - k + 1

hence

k \leq (| U | + 1) / 2

.

Observe that although the assumptions in Theorem 6 are weaker than those in Theorem 5, the bound shown in Equation (20) is also looser than that in Equation (19). However, interestingly, it is proven that inequality Equation (19) will still hold for Algorithm 2 if the monotonicity of f is satisfied (see the Theorem 3.1. in [25]). Thus, the randomized Algorithm 2 is expected to be performing similar to Algorithm 1 for monotone functions.

Algorithm 2: The randomized approach

Inputs: The objective function f, the unlabeled indices

U

, the query batch size

k > 0

Outputs: a subset of unlabeled indices

A \subseteq U

of size k

Algorithms 1 and 2 are equivalent for sequential querying (

k = 1

). Also note that in both algorithms, the variables

u_{t}

in iteration t, is determined by deterministic or stochastic maximization of

f (A_{t - 1} \cup {u})

. Fortunately, such maximization needs only computations in the form of Equation (10) or Equation (13) when

f = f_{M I}^{p e s s}

or

f_{M I}^{o p t}

. These computations can be done easily provided that the gradient vector

g_{n + t - 1}

and inverse-Hessian matrix

H_{n + t - 1}^{- 1}

have been stored from the previously selected subset

A_{t - 1}

. The updated gradient and inverse-Hessian that are used to compute

f (A_{t - 1} \cup {u})

are different for each specific

u \in U_{t - 1}

. We only save those associated with the local maximizer, that is

u_{t}

, as

g_{n + t}

and

H_{n + t}^{- 1}

to be used in the next iteration.

2.4. Total Complexity Reduction

We measure the complexity of a given querying algorithm in terms of the required number of classifier updates. This makes our analysis general and independent of the updating procedure, which can be done in several possible ways. As we discussed in the last section, we chose to perform a single step of quasi Newton in Equation (8) but alternatively one can use full training or any other numerical parameter update.

Consider the following optimization problems:

\underset{\begin{matrix} A \subseteq U \\ | A | = k \end{matrix}}{\arg \max} f_{M I} (A),

(21a)

greedy \underset{\begin{matrix} A \subseteq U \\ | A | = k \end{matrix}}{\arg \max} {\tilde{f}}_{M I} (A),

(21b)

where “

greedy \arg \max

” denotes the greedy maximization operator that uses Algorithm 1 or 2 to maximize the objective, and

{\tilde{f}}_{M I}

is either

f_{M I}^{p e s s}

or

f_{M I}^{o p t}

. Note that Equation (21a) formulates the global maximization of the exact MI function and Equation (21b) shows the optimization in our framework, that is a greedy maximization of the pessimistic/optimistic MI approximations. In the following remark, we compare the complexity of solving the two optimizations in Equation (21) in terms of the number of classifier updates required for obtaining the solutions.

Remark 1.

For a fixed k, the number of necessary classifier updates for solving Equation (21a) increases with order k, whereas for Equation (21b) it changes linearly.

Proof.

As is explained in Section 2.2, the number of classifier updates for computing

f_{M I} (A)

without any approximations, is

c^{k}

. Moreover, in order to find the global maximizer of MI,

f_{M I}

needs to be evaluated at all subsets of

U

with size k. There are

(\begin{matrix} m \\ k \end{matrix}) = O (m^{k})

of such subsets (recall that

m = | U |

). Hence, the total number of classifier update required for global maximization

f_{M I}

is of order

O ({(m \cdot c)}^{k})

.

Now, regarding Equation (21b), recall from Section 2.2 that if

g_{n + t - 1}

and

H_{n + t - 1}^{- 1}

are stored from the previous iteration, computing

{\tilde{f}}_{M I} (A_{t - 1} \cup {u_{t}})

needs only c classifier updates. However, despite the evaluation problem in Section 2.2, in computing line (4) of Algorithms 1 and 2, the next sample to add, that is

u_{t}

, is not given. In order to obtain

u_{t}

,

{\tilde{f}}_{M I}

is to be evaluated at all the remaining samples in

U_{t - 1}

. Since,

| U_{t - 1} | = m - t + 1

, the number of necessary classifier updates in the t-th iteration is

c \cdot (m - t + 1)

. Both algorithms run k iterations that results the following total number of classifier updates:

c m + c (m - 1) + . . . + c (m - k + 1) = c k (m - \frac{k + 1}{2}) = O (c k m) .

☐

2.5. Further Speed-Up

Here, we show that the total complexity of our proposed MI-based querying approaches is linear, in contrast with the exponential cost of the exact formulation which is not practical.

Even after approximating

f_{M I}

using the pessimistic or optimistic formulations, MI-based algorithms can be significantly slow for large data sets. In order to further scale up our algorithm, induced by [18], we first selects a subset of the unlabeled samples by only choosing the most β uncertain samples (where

β \in Z^{+}

). We then ran our MI-based algorithm over such filtered data. More formally, the input set of unlabeled indices to Algorithms 1 and 2 will be

U_{f} = \underset{\begin{matrix} U \subseteq U \\ | U | = β \end{matrix}}{\arg \max} \sum_{u \in U} f_{H} ({u}) .

(22)

It is evident that for

β = | U |

the filtered data will be equal to the original unlabeled pool

U_{f} = U

. From now on, we add the adjective filtered to any querying algorithm that is preceded by reduction of the unlabeled data into the samples with high uncertainty as described above.

3. Results and Discussion

In this section, we show our experimental results over several data sets on three different fields: medicine Section 3.1, image processing Section 3.2 and music harmony analysis Section 3.3. We ran our MI-based querying algorithms against entropy-based and random active learning benchmarks. In the following section, first we describe the data sets that have been used, then we explaining the experimental settings and present the numerical results.

3.1. Cardiotocography

This data set is downloaded from UCI repository [26] and contains 2126 fetal cardiotocograms each of which is represented by a 21-dimensional feature vector. The data is categorized into three classes based on the fetal states: normal, suspect and pathological (therefore

c = 3

). All the data samples are first projected into a 15-dimensional principal component analysis (PCA) subspace obtained from the unlabeled pool. The initial labeled data set chosen randomly in the beginning of each experiment, consists of 75 samples (25 samples per class). We will refer to this data set as “Cardio" in the following sections.

3.2. MNIST (Mixed National Institute of Standards and Technology)

This is an image database of handwritten digits 1 to 9 [27]. Here, we only use images for digits 1 to 4, hence

c = 4

. The data set, consisting of

20 \times 20

images, is already divided into a testing/training partitions. In our experiments, these partitions are fixed as given, but each time the initial labeled data sets are randomly chosen from the training partition. The raw 400-dimensional feature vectors are projected into 10-dimensional PCA subspace constructed based on the training partition. After choosing only images of digits from 1 to 4, the size of the testing and training partitions are 4130 and 4159, respectively. We also set the size of our initial labeled data set

L_{0}

to 200 (50 samples per class).

3.3. Bach Choral Harmony

The other data set that we used for evaluating performance of the algorithms contains pitch information of time events of 60 chorales by Johann Sebastian Bach [28]. Each event is represented by pitch-wise and meter information and is assigned a chord class label. We selected the events associated with the five most frequent chords in the data set: D-major, G-major, C-major, F-major and A-major (hence

c = 5

); resulting a set of 2221 samples. Discarding pitch class of the bass notes and the metric information, we used the binary indicators of the pitch classes corresponding to equal-tempered 12 notes of the chromatic scale. This leads to a set of 12-dimensional binary feature vectors that are projected into 8-dimensional PCA subspace obtained based on the training data at each experiment. We will refer to this data set simply as “Bach” in the remaining sections.

3.4. Experimental Settings

From the previous sections, we have two methods of evaluating MI and two optimization techniques, leading to four different ways of doing MI-based querying, in all of which we used the filtered pool of unlabeled samples

U_{f}

that is obtained with

β = 100

. Throughout this section, we distinguish different approaches involved in our experimental settings using the labels listed below:

Pess-MI-Det: Pessimistic MI $f_{M I}^{p e s s}$ with deterministic optimization (Algorithm 1);
Pess-MI-Rand: Pessimistic MI $f_{M I}^{p e s s}$ with randomized optimization (Algorithm 2);
Opt-MI-Det: Optimistic MI $f_{M I}^{o p t}$ with deterministic optimization (Algorithm 1);
Opt-MI-Rand: Optimistic MI $f_{M I}^{o p t}$ with randomized optimization (Algorithm 2);
entropy: Entropy objective $f_{H}$ with deterministic optimization (Algorithm 1);
random: Random querying.

In sequential querying, where deterministic and randomized optimization algorithms are equivalent, we use Pess-MI and Opt-MI to refer to the MI-based objectives without mentioning the optimization type.

In running the querying experiments over all the data sets, we used a linear logistic regression as the core classifier. In case that the data under consideration is not already divided into testing/training partitions, we randomly generate such partitions in each experiment with fixed ratio of 3/7 (testing size to training size). The initial training data set

L_{0}

is randomly selected from the training partition and the rest of the training samples are considered as the unlabeled pool

U

from which the queries are to be selected in each querying iteration. Moreover, in each experiment we first reduce the dimensionality of the data using PCA over the unlabeled pool.

In the experiments, we iteratively select the query batches of either sizes

k = 1

, 5, 10 or 20, add the selected queries together with their class labels to the labeled data set and re-calculate the parameters of the classifier, which in turn, leads to an updated accuracy value based on the testing partition. For each value of k, we repeated running the experiments for 25 times, each time with a different random selection of testing/training partitions and a different initial labeled set

L_{0}

. Hence, in total, we get 25 accuracy curves for each value of k. Ideally, we want an active learning algorithm whose accuracy curve increases as fast as possible, i.e., obtaining a more accurate classifier with labeling fewer number of query batches.

In order to present the performance of the listed algorithms, we calculate the average and standard deviation (STD) of the 25 accuracy curves for each algorithm. Furthermore, for pairwise comparison between the MI-based approaches and the benchmarks for a fixed value of k, we perform two-sample one-tail T-tests over the accuracy values of the competing algorithms. Note that such hypothesis test can and should be done over the accuracy values calculated after each querying iteration t separately. Here, the assumption is that the accuracy levels generated from the 25 querying experiments at iteration t are independent from each other. Let

η_{t}

denote the random variable presenting the accuracy of the updated classifier after t times of running an MI-based querying algorithm and

η_{t}^{'}

be a similar random variable for a competing non-MI-based method. Then, we consider the null and alternative hypotheses to be:

\begin{matrix} H_{0} : μ (η_{t}) & \leq μ (η_{t}^{'}) \\ H_{1} : μ (η_{t}) & > μ (η_{t}^{'}) \end{matrix}

(23)

where

μ (η_{t})

and

μ (η_{t}^{'})

are the mean of the random variables

η_{t}

and

η_{t}^{'}

. We perform such T-test for comparing all modes of MI-based objectives (see the list above) versus the entropy-based and random querying algorithms. Rejecting the null hypothesis implies that the new accuracy in the t-th iteration of an MI-based querying is not less than or equal to the case when we use a querying objective other than the approximating variants of

f_{M I}

. In other words, obtaining a smaller p-value for a T-test described above, means that with a higher probability the accuracy of the updated classifier is larger when using the MI-based approach for querying.

3.5. Numerical Results

The numerical results of running sequential active learning with different querying algorithms are shown in Figure 1 and the results of batch active learning with different batch sizes are shown in Figure 2 (for

k = 5

), Figure 3 (for

k = 10

) and Figure 4 (for

k = 20

). The figures show the average accuracy curves (first row of each figure), the standard deviation of the accuracy curves (second rows), and the resulting p-values of the hypothesis tests for comparison between the MI-based approaches and entropy-based (third rows) or random querying (fourth rows).

As it is mentioned before, the deterministic and randomized optimization algorithms described in Section 2.3 are equivalent in sequential querying. Therefore, we have only two variants of MI-based querying in Figure 1. This figure shows that for two data sets (MNIST and Bach) the MI-based approaches perform similar or sometimes even worse than the entropy-based benchmark. This can be explained by noting that the main shortcoming of using the entropy objective

f_{h}

, that is redundancy among the queries, is meaningful only when we have multiple samples in the batch, that is

k > 1

. However, they mostly outperform random querying. Another observation is that using optimistic approximation of MI gave better results both in terms of average accuracy and the hypotheses p-values. Recall that the optimistic approximation of MI tries to minimize the entropy over the class labels of the remaining samples in each iteration, while the pessimistic approximation uses maximization of this entropy. Hence, our conjecture for this observation is that the optimistic approach does more aggressive exploitation in comparison with the pessimistic variant, in the sense of choosing the queries from the set of samples that lead to a lower classifier entropy.

For batch mode active learning, the MI-based approaches generally show better performance. That is, their accuracy curves grow more rapidly than the benchmarks. When comparing against the entropy-based approach, for data sets Cardio and Bach, we observe that the p-values are small in the beginning iterations. Whereas for MNIST data set, low p-values are mostly seen in the middle or late iterations. Hence, the probability that MI-based variants outperform the entropy is high in early querying iterations, before the labeled training set becomes large, for the two former data sets, and in later iterations for MNIST. This behavior can also be seen from the plots of the average accuracy.

Regarding the comparison against the random benchmark, we see from the p-value plots more conspicuously that MI-based approaches generally outperform random with high confidence soon after the early iterations. However, the plots for Bach, show that this confidence decrease in late iterations, which is mainly due to the growth of the accuracies of random to the same level as the MI-based curves.

Whereas in sequential active learning the optimistic approach did a better job in comparison with the pessimistic variant, the difference between these two variants shrinks as the size of query batch increases (and so does the number of required approximation iterations). Our conjecture is that accumulation of the approximation error makes the performance of optimistic and pessimistic MI-based querying methods closer to each other, though still better than the benchmarks. Additionally, there is no significant difference between using deterministic or randomized optimization algorithms, which might be because of local monotonicity of the approximations

f_{M I}

. Also note that although there are large distinctions between MI-based variants when their p-values are large, we ignore those parts as uninformative regions, since they just imply that the probability of MI-based approaches outperforming the benchmarks is not high.

Figure 1. The experimental results of different querying approaches for sequential active learning (

k = 1