In this section, we introduce our notations and formulate the batch active learning problem with MI as the objective. Then we provide our solutions to the two hurdles mentioned above.

#### 2.1. Formulation

Let $X={\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n+m}\subseteq {\mathbb{R}}^{d}$ denote our finite data set where ${\mathbf{x}}_{i}$ is the d-dimensional feature vector representing the i-th sample. Also let $Y={\left\{{y}_{i}\right\}}_{i=1}^{n+m}$ be their respective class labels where each ${y}_{i}$ represents the numerical category varying between 1 to c, with c the number of classes. We distinguish indices of the labeled and unlabeled partitions of the data set by $\mathcal{L}$ and $\mathcal{U}$ (with $\left|\mathcal{L}\right|=n$ and $\left|\mathcal{U}\right|=m$), respectively, which are disjoint subsets of $\{1,...,n+m\}$. Note that the true values of the labels ${Y}_{\mathcal{L}}$ are observed and denoted by ${Y}_{\mathcal{L}}^{*}$, hence $Y={Y}_{\mathcal{U}}\cup {Y}_{\mathcal{L}}^{*}$. The initial classifier is trained based on these observed labels.

Labeling unlabeled samples is costly and time-consuming. Given a limited budget for this task, we wish to select

$k\ge 1$ queries from

$\mathcal{U}$ whose labeling leads to a new classifier with a significantly improved performance. Therefore we need an objective

set function $f:{2}^{\mathcal{U}}\to \mathbb{R}$ defined over subsets of the unlabeled indices

$\mathcal{A}\subseteq \mathcal{U}$. The goal of batch active learning is then to choose a subset

$\mathcal{A}$ with a given cardinality

k that maximizes the objective given the current model that is trained based on

${Y}_{\mathcal{L}}^{*}$. This can be formulated as the following constrained combinatorial optimization:

We aim to choose the queries

$\mathcal{A}$ whose labels

${Y}_{\mathcal{A}}$ give the highest amount of information about labels of the remaining samples

${Y}_{\mathcal{U}-\mathcal{A}}$. In other words, the goal is to maximize the mutual information (MI) between the random sets

${Y}_{\mathcal{A}}$ and

${Y}_{\mathcal{U}-\mathcal{A}}$ given the observed labels

${Y}_{\mathcal{L}}^{*}$ and the features

X:

Let us focus on the first right hand side term

${f}_{H}(\mathcal{A}):=H({Y}_{\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*})$, which is the joint entropy function used in active learning methods [

8]. Note that maximizing

${f}_{H}$ is equivalent to minimizing

$H({Y}_{\mathcal{U}-\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*},{Y}_{\mathcal{A}})$ (the actual objective introduced in [

8]), since

$H({Y}_{\mathcal{U}}|X,{Y}_{\mathcal{L}}^{*})=H({Y}_{\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*})+H({Y}_{\mathcal{U}-\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*},{Y}_{\mathcal{A}})$ and

$H({Y}_{\mathcal{U}}|X,{Y}_{\mathcal{L}}^{*})$ is a constant with respect to

$\mathcal{A}$. This objective is expensive to calculate due to the complexity of computing the joint posterior

$\mathbb{P}({Y}_{\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*})$. However, using a point estimation, such as maximum likelihood, to train the classifier’s parameter

**θ**, one can say that

**θ** is deterministically equal to the maximum likelihood estimation (MLE) point estimate

${\widehat{\mathit{\theta}}}_{n}$ given

X and

${Y}_{\mathcal{L}}^{*}$. Then we can rewrite the posterior as

$\mathbb{P}({Y}_{\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*})=\int \mathbb{P}({Y}_{\mathcal{A}}|\mathit{\theta})\xb7\delta (\mathit{\theta}-{\widehat{\mathit{\theta}}}_{n})d\mathit{\theta}=\mathbb{P}({Y}_{\mathcal{A}}|{\widehat{\mathit{\theta}}}_{n})$. Since in most discriminative classifiers the labels are assumed to be independent given the parameter, one can write

$\mathbb{P}({Y}_{\mathcal{A}}|{\widehat{\mathit{\theta}}}_{n})={\prod}_{i\in \mathcal{A}}\mathbb{P}({y}_{i}|{\widehat{\mathit{\theta}}}_{n})$. This simplifies the computation of the joint entropy to the sum of sample entropy contributions:

which is straightforward to compute having the pmf’s

$\mathbb{P}({y}_{i}|{\widehat{\mathit{\theta}}}_{n})$. Equation (

3) implies that maximizing

${f}_{H}$ can be separated into several individual maximizations, hence does not take into account the redundancy among the selected queries. Thus, in different related studies heuristics are added to cope with this issue. MI in Equation (

2), on the other hand, removes this shortcoming by introducing a second term which conditions over the unobserved random variable

${Y}_{\mathcal{U}-\mathcal{A}}$, as well as the observed

${Y}_{\mathcal{L}}^{*}$. This conditioning prevents the labels in

${Y}_{\mathcal{A}}$ from becoming independent, and therefore automatically incorporates the diversity among the queries (see next section for details of evaluating this term).

Unfortunately, maximizing

${f}_{MI}$ for

$k>1$ is NP-hard (the optimization hurdle). Relaxing combinatorial optimizations into continuous spaces is a common technique to make the computations tractable [

13], however these methods still involve a final discretization step that often includes using heuristics. In the following sections, we introduce our strategies to overcome the practical hurdles in MI-based active learning algorithms by introducing (1) pessimistic/optimistic approximations of MI; and (2)

submodular maximization algorithms that allow us to perform the computations within the discrete domain.

#### 2.2. Evaluating Mutual Information

In this Section, we address the hurdle of evaluating MI between non-singleton subset of labels. This objective, formulated in Equation (

2), is also equal to

due to MI’s symmetry. We prefer this equation, since usually we have

$|{Y}_{\mathcal{A}}|=k\ll |{Y}_{\mathcal{U}}|$ and thus it leads to a more computationally efficient problem. Note that the first term in the right hand side of Equation (

4) can be evaluated similar to Equation (

3). The major difficulty we need to handle in Equation (

4) is the computation of the second term, which requires considering all possible label assignments to

${Y}_{\mathcal{A}}$. To make this computationally tractable, we propose to use a greedy strategy based on two variants: pessimistic and optimistic approximations of MI. To see this we focus on the second term:

where

${\{1,...,c\}}^{\left|\mathcal{A}\right|}$ is the set of all possible class label assignments to the samples in

$\mathcal{A}$. For example, if

$\mathcal{A}$ has three samples (

$\left|\mathcal{A}\right|=3$) and

$c=2$, then this set would be equal to

$\left\{\{1,1,1\},\{2,1,1\},\{1,2,1\},\{2,2,1\},\{2,2,2\},\{1,2,2\},\{2,1,2\},\{1,1,2\}\right\}$. For each fixed label permutation

J, the classifier should be retrained after adding the new labels

${Y}_{\mathcal{A}}=J$ to the training labels

${Y}_{\mathcal{L}}^{*}$ in order to compute the conditional entropy

$H({Y}_{\mathcal{U}-\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*},{Y}_{\mathcal{A}}=J)$. It is also evident from the example above that the number of possible assignments

J to

${Y}_{\mathcal{A}}$ is

${c}^{k}$. Therefore, the number of necessary classifier updates grows exponentially with

$|{Y}_{\mathcal{A}}|=k$. This is computationally very expensive and makes Equation (5) impractical. Alternatively, we can replace the expectation in Equation (5) with a minimization/maximization to get a

pessimistic/

optimistic approximation of MI. Such a replacement enables us to employ efficient greedy approaches to estimate

${f}_{MI}$ in a conservative/aggressive manner. The greedy approach that we use here is compatible with the iterative nature of the optimization Algorithms 1 and 2 (described in

Section 2.3). In the remainder of this Section, we focus on the pessimistic approximation. Similar equations can be derived for the optimistic case. The first step is replacing the weighted summation in Equation (5) by a maximization:

Note that ${f}_{MI}^{pess}(\mathcal{A})$ is always less than or equal to ${f}_{MI}$. Equation (6) still needs the computation of the conditional entropy for all possible assignments J. However, it enables us to use greedy approaches to approximate ${f}_{MI}^{pess}(\mathcal{A})$ for any candidate query set $\mathcal{A}\subseteq \mathcal{U}$, as described below.

Without loss of generality, suppose that

$\mathcal{A}$, with size

$\left|\mathcal{A}\right|=k$ (

$1\le k\le m$), can be shown element-wise as

$\mathcal{A}=\{{u}_{1},...,{u}_{k}\}$. Define

${\mathcal{A}}_{t}=\{{u}_{1},...,{u}_{t}\}$ for any

$t\le k$ (hence

${\mathcal{A}}_{k}=\mathcal{A}$). In the first iteration we can evaluate Equation (6) simply for the singleton

${f}_{MI}^{pess}(\left\{{u}_{1}\right\})$ and store

${\widehat{y}}_{{u}_{1}}$, the assignment to

${y}_{{u}_{1}}$ which maximizes the conditional entropy in Equation (6):

where we used Equation (

3) to substitute the first term with

$H({Y}_{\mathcal{U}-\left\{{u}_{1}\right\}}|{\widehat{\mathit{\theta}}}_{n})$. Note that the second term in Equation (7) requires

c times of retraining the classifier with the newly added class label

${y}_{{u}_{1}}=j$ for all possible

$j\in \{1,...,c\}$. In practice, the retraining process can be very time-consuming. Here, instead of retraining the classifier from scratch, we leverage the current estimate of the classifier’s parameter vector and take one quasi-Newton step to update this estimate:

where

${\mathbf{g}}_{n+1}$ and

${\mathbf{H}}_{n+1}$ are the gradient vector and Hessian matrix of the log-likelihood function of our classifier given the labels

${Y}_{\mathcal{L}}^{*}\cup \{{y}_{{u}_{1}}=j\}$. Then we use the approximation

In

Appendix, we derive the update equation in case a multinomial logistic regression is used as the discriminative classifier. Specifically, we will see that

${\mathbf{g}}_{n+1}$ and

${\mathbf{H}}_{n+1}^{-1}$ can be obtained efficiently from

${\mathbf{g}}_{n}$ and

${\mathbf{H}}_{n}^{-1}$.

If

$k=1$, we are done. Otherwise, to move from iteration

$t-1$ to

t (

$1<t\le k$),

${f}_{MI}^{pess}({\mathcal{A}}_{t-1}\cup \left\{{u}_{t}\right\})$ will be approximated from the previous iterations:

where

${\widehat{Y}}_{{\mathcal{A}}_{t-1}}=\{{\widehat{y}}_{{u}_{1}},...,{\widehat{y}}_{{u}_{t-1}}\}$ are the assignments maximizing the conditional entropy that are stored from the previous iterations, such that the

i-th element

${\widehat{y}}_{{u}_{i}}$ is the assignment stored for

${u}_{i}={\mathcal{A}}_{i}-{\mathcal{A}}_{i-1}(1\le i\le t)$. Note that Equation (10) is an approximation of the pessimistic MI, as is defined by Equation (6), however, in order to keep the notations simple we use the same notation

${f}_{MI}^{pess}$ for both. Moreover, similar to Equation (7) there are

c time of classifier updates involved in the computation of Equation (10). To complete iteration

t, we make

${\widehat{y}}_{{u}_{t}}$ equal to the assignment to

${y}_{{u}_{t}}$ that maximizes the second term in Equation (10) and add it to

${\widehat{Y}}_{{\mathcal{A}}_{t-1}}$ to form

${\widehat{Y}}_{{\mathcal{A}}_{t}}$.

As in the first iteration, the conditional entropy term in Equation (10) is estimated by using the set of parameters obtained from the quasi-Newton step:

where

Considering Equations (7) and (10) as the greedy steps of approximating ${f}_{MI}$, we see that the number of necessary classifier updates are $c\xb7k$, since there are k iterations each of which requires c times of retraining the classifier. Thus, the computational complexity reduced from the exponential cost in the exact formulation Equation (5) to the linear cost in the greedy approximation.

Similar to Equation (10), for the optimistic approximation, we will have:

where

${\widehat{Y}}_{{\mathcal{A}}_{t-1}}=\{{\widehat{y}}_{{u}_{1}},...,{\widehat{y}}_{{u}_{t-1}}\}$ is the set of class assignments minimizing the conditional entropy that are stored from the previous iterations. Clearly, the reduction of the computational complexity remains the same in the optimistic formulation.

Let us emphasize that, from the definitions of

${f}_{MI}^{pess}$ and

${f}_{MI}^{opt}$, we always have the following inequality

The first (or second) inequality turns to equality, if the results of averaging in conditional entropy in Equation (5) is equal to maximization (or minimization) involved in the approximations. This is equivalent to saying that the posterior probability $\mathbb{P}({Y}_{\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*})$ is a degenerative distribution concentrated at the assignment ${Y}_{\mathcal{A}}=J$ that maximizes (or minimizes) the conditional entropy. Furthermore, if the posterior is a uniform distribution, giving the same posterior probability to all possible assignments $J\in {\{1,...,c\}}^{\left|\mathcal{A}\right|}$, then the averaging, minimization and maximization lead to the same numerical result and therefore we get ${f}_{MI}^{pess}={f}_{MI}^{opt}={f}_{MI}$.

In theory, the value of MI between any two random variables is non-negative. However, because of the approximations made in computing the pessimistic or optimistic evaluations of MI, it is possible to get negative values depending on the distribution of the data. Therefore, after going through all the elements of $\mathcal{A}$ in evaluating ${f}_{MI}^{pess}$ (or ${f}_{MI}^{opt}$), we take the maximum between the approximations of ${f}_{MI}^{pess}(\mathcal{A})$ (or ${f}_{MI}^{opt}(\mathcal{A})$) and zero to ensure its non-negativity.

#### 2.3. Randomized vs. Deterministic Submodular Optimizations

In this section, we begin by reviewing the basic definitions regarding submodular set functions, and see that both ${f}_{MI}$ and ${f}_{H}$ satisfy submodularity condition. We then present two methods for submodular maximization: a deterministic and a randomized approach. The latter is applicable to submodular and monotone set functions such as ${f}_{H}$. But ${f}_{MI}$ is not monotone in general, hence we present the randomized approach for this objective.

**Definition 1.** A set function

$f:{2}^{\mathcal{U}}\to \mathbb{R}$ is said to be

submodular if

We call

f supermodular if the inequality in Equation (

15) is reversed. In many occasions, it is easier to use an equivalent definition, which uses the notion of

discrete derivative defined as:

**Proposition 2.** Let $f:{2}^{\mathcal{U}}\to \mathbb{R}$ be a set function. f is submodular if and only if we have This equips us to show the submodularity of joint entropy and MI:

**Theorem 3.** The set functions ${f}_{H}$ and ${f}_{MI}$,

defined in Equations (3) and (2) above, are submodular. **Proof.** It is straightforward to check the submodularity of

${f}_{H}$ and therefore the first term of MI formulation in Equation (

2). It remains to show that

$g(\mathcal{A}):=H({Y}_{\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*},{Y}_{\mathcal{U}-\mathcal{A}})$, the second term with the opposite sign, is supermodular. Let us first write the discrete derivative of the function

g:

which holds for any

$u\notin \mathcal{A}\subseteq \mathcal{U}$. Here, we used that the joint entropy of two sets of random variables

A and

B can be written as

$H(A,B)=H(A)+H(B|A)$. Now take any superset

$\mathcal{B}\supseteq \mathcal{A}$, which does not contain

$u\in \mathcal{U}$. From

$\mathcal{B}\supseteq \mathcal{A}$, we have

${Y}_{\mathcal{U}-\mathcal{A}\cup \left\{u\right\}}\subseteq {Y}_{\mathcal{U}-\mathcal{B}\cup \left\{u\right\}}$ and therefore

${\rho}_{g}(\mathcal{A},u)-{\rho}_{g}(\mathcal{B},u)=H({y}_{u}|X,{Y}_{\mathcal{L}}^{*},{Y}_{\mathcal{U}-\mathcal{A}\cup \left\{u\right\}})-H({y}_{u}|X,{Y}_{\mathcal{L}}^{*},{Y}_{\mathcal{U}-\mathcal{B}\cup \left\{u\right\}})\le 0$ implying supermodularity of

g. ☐

Although submodular functions can be minimized efficiently, they are NP-hard to maximize [

19], and therefore we have to use approximate algorithms. Next, we briefly discuss the classical approximate submodular maximization method widely used in batch querying [

9,

11,

12,

16,

17]. This greedy approach, we call

deterministic throughout this paper, is first proposed in the seminal work of [

20] (shown in Algorithm 1) and its performance is analyzed for

monotone set functions as follows:

**Definition 4.** The set function $f:{2}^{\mathcal{U}}\to \mathbb{R}$ is said to be monotone (nondecreasing) if for every $\mathcal{A}\subseteq \mathcal{B}\subseteq \mathcal{U}$ we have $f(\mathcal{A})\le f(\mathcal{B})$.

**Theorem 5.** Let $f:{2}^{\mathcal{U}}\to \mathbb{R}$ be a submodular and nondecreasing set function with $f(\u2300)=0$,

$\mathcal{A}$ be the output of Algorithm 1 and ${\mathcal{A}}^{*}$ be the optimal solution to the problem in Equation (1). Then we have: **Algorithm 1:** The deterministic approach |

**Inputs:** The objective function f, the unlabeled indices $\mathcal{U}$, the query batch size $k>0$ |

**Outputs:** a subset of unlabeled indices $\mathcal{A}\subseteq \mathcal{U}$ of size k |

The proof is given by [

20] and [

21]. Among the assumptions,

$f(\u2300)=0$ can always be assumed since maximizing a general set function

$f(\mathcal{A})$ is equivalent to maximizing its adjusted version

$g(\mathcal{A}):=f(\mathcal{A})-f(\u2300)$ which satisfies

$g(\u2300)=0$. Nemhauser

et al. [

22] also showed that Algorithm 1 gives the

optimal approximate solution to the problem in (1) for nondecreasing functions such as

${f}_{H}$. However,

${f}_{MI}$ is not monotone in general and therefore Theorem 5 is not applicable. To imagine non-monotonicity of

${f}_{MI}$, it suffices to imagine that

${f}_{MI}(\u2300)={f}_{MI}(\mathcal{U})=0$.

Recently several algorithms have been proposed for approximate maximization of

nonnegative submodular set functions, which are not necessarily monotone. Feige

et al. [

23] made the first attempt towards this goal by proposing a

$(2/5)$-approximation algorithm and also proving that

$1/2$ is the optimal approximation factor in this case. Buchbinder

et al. [

24] could achieve this optimal bound in expectation by proposing a randomized iterative algorithm. However, these algorithms are designed for

unconstrained maximization problems. Later, Buchbinder

et al. [

25] devised a

$(1/e)$-approximation randomized algorithm with cardinality constraint, which is more suitable for batch active learning. A pseudocode of this approach is shown in Algorithm 2 where instead of selecting the sample with maximum objective value at each iteration, the best

k samples are identified (line 4) and one of them is chosen randomly (line 5). Such a randomized procedure provides a

$(1/e)$-approximation algorithm for maximizing a nonnegative submodular set function such as

${f}_{MI}$:

**Theorem 6.** Let $f:{2}^{\mathcal{U}}\to \mathbb{R}$ be a submodular nonnegative set function and $\mathcal{A}$ be the output of Algorithm 2. Then if ${\mathcal{A}}^{*}$ is the optimal solution to the problem in (1) we have: The proof can be found in [

25] and our supplementary document. In order to be able to select

k samples from

${\mathcal{U}}_{t}$ to form

${\mathcal{M}}_{t}$ for all

t, it suffices to ensure that the smallest unlabeled set that we sample from

${\mathcal{U}}_{k-1}$ has enough members,

i.e.,

$k\le |{\mathcal{U}}_{k-1}|=|\mathcal{U}|-k+1$ hence

$k\le (|\mathcal{U}|+1)/2$.

Observe that although the assumptions in Theorem 6 are weaker than those in Theorem 5, the bound shown in Equation (20) is also looser than that in Equation (19). However, interestingly, it is proven that inequality Equation (19) will still hold for Algorithm 2 if the monotonicity of

f is satisfied (see the Theorem 3.1. in [

25]). Thus, the randomized Algorithm 2 is expected to be performing similar to Algorithm 1 for monotone functions.

**Algorithm 2:** The randomized approach |

**Inputs:** The objective function f, the unlabeled indices $\mathcal{U}$, the query batch size $k>0$ |

**Outputs:** a subset of unlabeled indices $\mathcal{A}\subseteq \mathcal{U}$ of size k |

Algorithms 1 and 2 are equivalent for sequential querying ($k=1$). Also note that in both algorithms, the variables ${u}_{t}$ in iteration t, is determined by deterministic or stochastic maximization of $f({\mathcal{A}}_{t-1}\cup \left\{u\right\})$. Fortunately, such maximization needs only computations in the form of Equation (10) or Equation (13) when $f={f}_{MI}^{pess}$ or ${f}_{MI}^{opt}$. These computations can be done easily provided that the gradient vector ${\mathbf{g}}_{n+t-1}$ and inverse-Hessian matrix ${\mathbf{H}}_{n+t-1}^{-1}$ have been stored from the previously selected subset ${\mathcal{A}}_{t-1}$. The updated gradient and inverse-Hessian that are used to compute $f({\mathcal{A}}_{t-1}\cup \left\{u\right\})$ are different for each specific $u\in {\mathcal{U}}_{t-1}$. We only save those associated with the local maximizer, that is ${u}_{t}$, as ${\mathbf{g}}_{n+t}$ and ${\mathbf{H}}_{n+t}^{-1}$ to be used in the next iteration.

#### 2.4. Total Complexity Reduction

We measure the complexity of a given querying algorithm in terms of the required number of classifier updates. This makes our analysis general and independent of the updating procedure, which can be done in several possible ways. As we discussed in the last section, we chose to perform a single step of quasi Newton in Equation (8) but alternatively one can use full training or any other numerical parameter update.

Consider the following optimization problems:

where “

$\mathrm{greedy}\phantom{\rule{5.0pt}{0ex}}\mathrm{arg}\phantom{\rule{0.166667em}{0ex}}\mathrm{max}$” denotes the greedy maximization operator that uses Algorithm 1 or 2 to maximize the objective, and

${\tilde{f}}_{MI}$ is either

${f}_{MI}^{pess}$ or

${f}_{MI}^{opt}$. Note that Equation (21a) formulates the global maximization of the exact MI function and Equation (21b) shows the optimization in our framework, that is a greedy maximization of the pessimistic/optimistic MI approximations. In the following remark, we compare the complexity of solving the two optimizations in Equation (21) in terms of the number of classifier updates required for obtaining the solutions.

**Remark 1.** For a fixed k, the number of necessary classifier updates for solving Equation (21a) increases with order k, whereas for Equation (21b) it changes linearly.

**Proof.** As is explained in

Section 2.2, the number of classifier updates for computing

${f}_{MI}(\mathcal{A})$ without any approximations, is

${c}^{k}$. Moreover, in order to find the global maximizer of MI,

${f}_{MI}$ needs to be evaluated at all subsets of

$\mathcal{U}$ with size

k. There are

$\left(\begin{array}{c}{\displaystyle m}\\ {\displaystyle k}\end{array}\right)=O\left({m}^{k}\right)$ of such subsets (recall that

$m=\left|\mathcal{U}\right|$). Hence, the total number of classifier update required for global maximization

${f}_{MI}$ is of order

$O\left({(m\xb7c)}^{k}\right)$.

Now, regarding Equation (21b), recall from

Section 2.2 that if

${\mathbf{g}}_{n+t-1}$ and

${\mathbf{H}}_{n+t-1}^{-1}$ are stored from the previous iteration, computing

${\tilde{f}}_{MI}({\mathcal{A}}_{t-1}\cup \left\{{u}_{t}\right\})$ needs only

c classifier updates. However, despite the evaluation problem in

Section 2.2, in computing line (4) of Algorithms 1 and 2, the next sample to add, that is

${u}_{t}$, is not given. In order to obtain

${u}_{t}$,

${\tilde{f}}_{MI}$ is to be evaluated at all the remaining samples in

${\mathcal{U}}_{t-1}$. Since,

$|{\mathcal{U}}_{t-1}|=m-t+1$, the number of necessary classifier updates in the

t-th iteration is

$c\xb7(m-t+1)$. Both algorithms run

k iterations that results the following total number of classifier updates:

☐