Open Access
This article is

- freely available
- re-usable

*Computation*
**2019**,
*7*(1),
16;
https://doi.org/10.3390/computation7010016

Article

Extreme Multiclass Classification Criteria

NYU Tandon School of Engineering, Department of Electrical and Computer Engineering 5 MetroTech Center, Brooklyn, NY 11201, USA

^{*}

Author to whom correspondence should be addressed.

Received: 2 February 2019 / Accepted: 8 March 2019 / Published: 12 March 2019

## Abstract

**:**

We analyze the theoretical properties of the recently proposed objective function for efficient online construction and training of multiclass classification trees in the settings where the label space is very large. We show the important properties of this objective and provide a complete proof that maximizing it simultaneously encourages balanced trees and improves the purity of the class distributions at subsequent levels in the tree. We further explore its connection to the three well-known entropy-based decision tree criteria, i.e., Shannon entropy, Gini-entropy and its modified variant, for which efficient optimization strategies are largely unknown in the extreme multiclass setting. We show theoretically that this objective can be viewed as a surrogate function for all of these entropy criteria and that maximizing it indirectly optimizes them as well. We derive boosting guarantees and obtain a closed-form expression for the number of iterations needed to reduce the considered entropy criteria below an arbitrary threshold. The obtained theorem relies on a weak hypothesis assumption that directly depends on the considered objective function. Finally, we prove that optimizing the objective directly reduces the multi-class classification error of the decision tree.

Keywords:

multiclass classification; decision trees; boosting## 1. Introduction

This paper focuses on the multiclass classification setting, where the number of classes is very large. The recent widespread development of data-acquisition web services and devices has helped make large data sets, such as multiclass data sets, commonplace. Straightforward extensions of the binary approaches to the multiclass setting, such as the one-against-all approach [1], which for each data point computes a score for each class and returns the class with the maximum score, do not often work in the presence of strict computational constraints as their running time often scales linearly with the number of labels k. On the other hand, the most computationally efficient approaches for multiclass classification are given by $\mathcal{O}(logk)$ train/test running time [2]. This running time can naturally be achieved by hierarchical classifiers that build the hierarchy over the labels.

This paper considers a hierarchical multiclass decision tree structure, where each node of the tree contains a binary classifier h from some hypothesis class $\mathcal{H}$ that sends an example reaching that node to either left ($h(x)\le 0$) or right ($h(x)>0$) child node depending on the sign of $h(x)$ (each node has its own splitting hypothesis). The test example descends from the root to the leaf of such tree guided by the classifiers lying on its path, and is labeled according to the label with the highest frequency amongst the training examples that were reaching the leaf that it descended to. The tree is constructed and trained in a top-down fashion, where splitting the data in every node of the tree is done by maximizing the following objective function recently introduced in the literature [3] (along with the algorithm (we refer the reader to the referenced paper for the algorithm’s details), called LOMtree, optimizing it in an online fashion):
where $x\in \mathcal{X}\subseteq {\mathbb{R}}^{d}$ are the data points (each with a label from the set $\{1,2,\dots ,k\}$), ${\pi}_{i}$ denotes the proportion of label i amongst the examples reaching a node, and probabilities $P(h(x)>0)$ and $P(h(x)>0|i)$ denote the fraction of examples reaching a node for which $h(x)>0$, marginally and conditional on class i respectively. The objective measures the dependence between the split and the class distribution. Note that it satisfies $J(h)\in [0,1]$ and, as implied by its form, maximizing it encourages the fraction of examples going to the right from class i to be substantially different from the background fraction for each class i. Thus for a balanced split (i.e., $P(h(x)>0)=0.5$), the examples of class i are encouraged to be sent exclusively to the left ($P(h(x)>0|i)=0$) or right ($P(h(x)>0|i)=1$) refining the purity of the class distributions at subsequent levels in the tree. The LOMtree algorithm effectively maximizes this objective over hypotheses $h\in \mathcal{H}$ in an online fashion with stochastic gradient descent (SGD) and obtains good-quality multiclass tree predictors with logarithmic train and test running times. Despite that, this objective and its properties (including the relation to the more standard entropy criteria) remain largely ununderstood. Its exhaustive analysis is instead provided in this paper.

$$J(h):=2\sum _{i=1}^{k}|{\pi}_{i}P(h(x)>0)-\underset{P(h(x)>0|i){\pi}_{i}}{\underbrace{P(h(x)>0,i)}}|,$$

Our contributions are the following:

- We provide an extensive theoretical analysis of the properties of the considered objective and prove that maximizing this objective in any tree node simultaneously encourages balanced partition of the data in that node and improves the purity of the class distributions at its children nodes.
- We show a formal relation of this objective to some more standard entropy-based objectives, i.e., Shannon entropy, Gini-entropy and its modified variant, for which online optimization schemes in the context of multiclass classification are largely unknown. In particular we show that i) the improvement in the value of entropy resulting from performing the node split is lower-bounded by an expression that increases with the value of the objective and thus ii) the considered objective can be used as a surrogate function for indirectly optimizing any of the three considered entropy-based criteria.
- We present three boosting theorems for each of the three entropy criteria, which provide the number of iterations needed to reduce each of them below an arbitrary threshold. Their weak hypothesis assumptions rely on the considered objective function.
- We establish the error bound that relates maximizing the objective function with reducing the multi-class classification error.
- Finally, in the Appendix A we establish an empirical connection between the multiclass classification error and the entropy criteria and show that Gini-entropy most closely resembles the behavior of the test error in practice.

The main theoretical analysis of this paper is kept in the boosting framework [4] and relies on the assumption that the objective function can be weakly optimized in the internal nodes of the tree. This weak advantage is amplified in the tree leading to hierarchies achieving any desired level of entropy (either Shannon entropy, Gini-entropy or its modified variant). Our work adds new theoretical results to the theory of multiclass boosting. Note that the multiclass boosting is largely ununderstood from the theoretical perspective [5] (we refer the reader to [5] for comprehensive review of the theory of muticlass boosting).

The paper is organized as follows: related literature is discussed in Section 2, the theoretical properties of the objective $J(h)$ are shown in Section 3, the main theoretical results are presented in Section 4, and finally the mathematical properties of the entropy criteria and the proofs of the main theoretical results are provided in Section 5. Conclusions (Section 6) end the paper. Appendix A contains basic numerical experiments (Appendix A.1) and additional proofs (Appendix A.2).

## 2. Related Work

The extreme multiclass classification problem has been addressed in the literature in different ways. We discuss them here, putting emphasis on the ones that build hierarchical predictors as these techniques are the most relevant to this paper. Only a few authors [2,3,6,7,8] simultaneously address logarithmic time training and testing. The methods they propose are either hard to apply in practical problems [7] or use fixed tree structures [6,8]. Furthermore, an alternative approach based on using a random tree structure was shown to potentially lead to considerable underperformance [3,9]. At the same time, for massive datasets making multiple passes through the data is computationally costly, which justifies the need for developing online approaches, where the algorithm streams over a potentially infinitely large data set (online approaches are also plausible for non-stationary problems). It is unclear how to optimize standard decision tree objectives, such as Shannon or Gini-entropy, in this setting (early attempt was recently proposed [2] for Shannon entropy). One of the prior works to this paper [3] introduces an objective function which enjoys certain advantages over entropy criteria. In particular, it can be easily and efficiently optimized online. The authors however present an incomplete theoretical analysis and leave a number of open questions, which this paper instead aims at addressing. The algorithms for incremental learning of classification with decision trees also include some older works [10,11,12], which split any node according to the outcome of the node split-test based on the values of selected attributes of the data examples reaching that node. These approaches are different from the one in this paper, where the node split is performed according to the value of the learned (e.g., with SGD) hypothesis computed for the entire vector of attributes of the data examples reaching that node.

Other tree-based approaches include conditional probability trees [13] and clustering methods [9,14,15] ([9] was later improved in [16]), but they allow training time to be linear in the label complexity. The remaining techniques for multiclass classification include sparse output coding [17], variants of error correcting output codes [18], variants of iterative least-squares [19], and a method based on guess-averse loss functions [20].

Finally note that the conditional density estimation problem is also challenging in the large-class settings and in this respect remains parallel to the extreme multiclass classification problem [21]. In the context of conditional density estimation problem, there have also been some works that use tree structured models to accelerate computation of the likelihood and gradients [8,22,23,24]. They typically use heuristics based on using ontologies [8], Huffman coding [24], and various other mechanisms.

## 3. Theoretical Properties of the Objective Function

In this section we describe the objective function introduced in Equation (1) and provide its theoretical properties. The proofs are deferred to the Appendix. We first introduce the definitions of the concept of balancedness and purity of the node split.

**Definition**

**1**(Purity and balancedness)

**.**

The hypothesis $h\in \mathcal{H}$ induces a pure split if $\alpha :={\sum}_{i=1}^{k}{\pi}_{i}min(P(h(x)>0|i),P(h(x)<0|i))\le \delta $, where $\delta \in [0,0.5)$, and α is called the purity factor.

The hypothesis $h\in \mathcal{H}$ induces a balanced split if $\beta :=P(h(x)>0)\in [c,1-c]$, where $c\in (0,0.5]$, and β is called the balancing factor.

A partition is perfectly pure if $\alpha =0$ (examples of the same class are sent exclusively to the left or to the right). A partition is called perfectly balanced if $\beta =0.5$ (equal number of examples are sent to the left and to the right). The notions of balancedness and purity are conveniently illustrated in Figure 1, where it is shown that the purity criterion helps to refine the choice of the splitting hypothesis from among well-balanced candidates.

Next, we show the first theoretical property of the objective function $J(h)$ that characterizes its behavior at the optimum ($J(h)=1$).

**Lemma**

**1.**

The hypothesis $h\in \mathcal{H}$ induces a perfectly pure and balanced partition if and only if $J(h)=1$.

For some data sets however there exist no hypotheses producing perfectly pure and balanced splits. We next show that increasing the value of the objective leads to more balanced splits.

**Lemma**

**2.**

For any hypothesis h and any distribution over data examples the balancing factor β satisfies $\beta \in \left[0.5(1-\sqrt{1-J(h)}),0.5(1+\sqrt{1-J(h)})\right]$.

We refer to the interval to which $\beta $ belongs to as $\beta $-interval. Thus the larger (closer to 1) the value of $J(h)$ is, the narrower the $\beta $-interval is, leading to more balanced splits at the extremes of this interval ($\beta $ closer to $0.5$).

This result combined with the next lemma implies that, at the extremes of the $\beta $ interval, the value of the upper-bound on the purity factor decreases as the value of $J(h)$ increases (since $J(h)$ gets closer to 1 and the balancing factor $\beta $ gets closer to $0.5$ at the extremes of the $\beta $ interval). The recovered splits therefore have better purity ($\alpha $ closer to 0).

**Lemma**

**3.**

(Lemma 1 in [3])

**.**For any hypothesis h and any distribution over data examples the purity factor α and the balancing factor β satisfy $\alpha \le min\left\{(2-J(h))/4\beta -\beta ,0.5\right\}$.Note that the equality condition in Lemma 3 is achieved when $P(h(x)>0|i)=P(h(x)<0|i)=0.5$ (and thus, $\alpha =0$, $\beta =0.5$, and $J(h)=0$).

We thus showed that maximizing the objective in Equation (1) in each tree node simultaneously encourages trees that are balanced and whose purity of the class distributions is gradually improving when moving from the root to a subsequent tree levels. Lemmas 2 and 3 are illustrated in Figure 2.

In the next section we show that the objective $J(h)$ is related to the more standard decision tree entropy-based objectives and that maximizing it leads to the reduction of these criteria. We consider three different entropy criteria in this paper. The theoretical analysis relies on the boosting framework and depends on the weak learning assumption. Three different entropy-based criteria lead to three different theoretical statements, where we bound the number of splits required to reduce the value of the criterion below given level. The bounds we obtain, and their dependences on the number of classes (k), critically depend on the strong concativity properties of the considered entropy-based objectives.

## 4. Main Theoretical Results

#### 4.1. Notation

We first introduce notation. Let $\mathcal{T}$ denote the tree under consideration. ${\pi}_{l,i}$’s denote the probabilities that a randomly chosen data point x drawn from $\mathcal{P}$, where $\mathcal{P}$ is a fixed target distribution over $\mathcal{X}$, has label i given that x reaches node l (note that ${\sum}_{i=1}^{k}{\pi}_{l,i}=1$), t denotes the number of internal tree nodes, ${\mathcal{L}}_{t}$ denotes the set of all tree leaves at time t, and ${w}_{l}$ is the weight of leaf l defined as the probability a randomly chosen x drawn from $\mathcal{P}$ reaches leaf l (note that ${\sum}_{l\in {\mathcal{L}}_{t}}{w}_{l}=1$). We study a tree construction algorithm where we recursively find the leaf node with the highest weight, and choose to split it into two children. Consider the tree constructed over t steps where in each step we take one leaf node and split it (thus the number of splits is equal to the number of internal nodes of the tree) ($t=1$ corresponds to splitting the root, thus the tree consists of one node (root) and its two children (leaves) in this step). We measure the quality of the tree at any given time t with three different entropy criteria:

- Shannon entropy ${G}_{t}^{e}$:$${G}_{t}^{e}=\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}ln\left(\frac{1}{{\pi}_{l,i}}\right)$$
- Gini-entropy ${G}_{t}^{g}$:$${G}_{t}^{g}=\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}(1-{\pi}_{l,i})$$
- Modified Gini-entropy ${G}_{t}^{m}$:$${G}_{t}^{m}=\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})},$$

These criteria are the natural extensions of the criteria used in the context of binary classification [25] to the multiclass classification setting (note that there is more than one way of extending the entropy-based criteria from [25] to the multiclass classification setting, e.g., the modified Gini-entropy could as well be defined as ${G}_{t}^{m}={\sum}_{l\in {\mathcal{L}}_{t}}{w}_{l}{\sum}_{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})}$, where $\mathcal{C}\in [1,2]$. This and other extensions will be investigated in future works). We will next present the main results of this paper, which will be followed by their proofs. We begin with introducing the weak hypothesis assumption.

#### 4.2. Theorems

**Definition**

**2**(Weak Hypothesis Assumption)

**.**

Let m denote any internal node of the tree $\mathcal{T}$, and let ${\beta}_{m}=P({h}_{m}(x)>0)$ and ${P}_{m,i}=P({h}_{m}(x)>0|i)$. Furthermore, let $\gamma \in {\mathbb{R}}^{+}$ be such that for all m, $\gamma \in (0,min({\beta}_{m},1-{\beta}_{m})]$. We say that the weak hypothesis assumption is satisfied when for any distribution $\mathcal{P}$ over $\mathcal{X}$ at each node m of the tree $\mathcal{T}$ there exists a hypothesis ${h}_{m}\in \mathcal{H}$ such that $J({h}_{m})/2={\sum}_{i=1}^{k}{\pi}_{m,i}|{P}_{m,i}-{\beta}_{m}|\ge \gamma $.

The weak hypothesis assumption says that in every node of the tree we are able to recover a hypothesis from $\mathcal{H}$ which corresponds to the value of the objective that is above 0 (thus the corresponding split is “weakly” pure and “weakly” balanced).

Consider next any time t and let n be the heaviest leaf at time t that we split and its weight ${w}_{n}$ be denoted by w for brevity. Similarly, let h denote the regressor at node n (shorthand for ${h}_{n}$). We denote the difference between the contribution of node n to the value of the entropy-based objectives in times t and $t+1$ as
Then the following lemma holds (the proof in provided in Section 5):

$${\Delta}_{t}^{e}:={G}_{t}^{e}-{G}_{t+1}^{e};\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}{\Delta}_{t}^{g}:={G}_{t}^{g}-{G}_{t+1}^{g};\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}{\Delta}_{t}^{m}:={G}_{t}^{m}-{G}_{t+1}^{m}.$$

**Lemma**

**4.**

Under the Weak Hypothesis Assumption, the change in entropies occuring due to the node split can be bounded as

$${\Delta}_{t}^{e}\ge \frac{wJ{(h)}^{2}}{8{(1-\gamma )}^{2}};\phantom{\rule{1.em}{0ex}}{\Delta}_{t}^{g}\ge \frac{wJ{(h)}^{2}}{4k{(1-\gamma )}^{2}};\phantom{\rule{1.em}{0ex}}{\Delta}_{t}^{m}\ge \frac{{(\mathcal{C}-2)}^{2}}{{\mathcal{C}}^{3}}\xb7\frac{wJ{(h)}^{2}}{4k{(1-\gamma )}^{2}}.$$

Clearly, maximizing the objective $J(h)$ improves the entropy reduction. The considered objective can therefore be viewed as a surrogate function for indirectly optimizing any of the three considered entropy-based criteria, for which efficient online optimization strategies are largely unknown but highly desired in the multiclass classification setting. To be more specific, the standard packages for binary classification trees, such as CART [26] and C4.5 [27], require running a brute force search to find a partition at every node of the tree from a set of all possible partitions that leads to the biggest improvement of the entropy-based criterion of interest [25]. This is prohibitive in case of the multiclass problem. $J(h)$ however can be efficiently optimized with SGD instead.

We next state the three boosting theoretical results captured in Theorems 1–3. They guarantee that the top-down decision tree algorithm which optimizes $J(h)$ in each node will amplify the weak advantage, captured in the weak learning assumption, to build a tree achieving any desired level of entropy (either Shannon entropy, Gini-entropy or its modified variant).

**Theorem**

**1.**

Under the Weak Hypothesis Assumption, for any $\alpha \in [0,2lnk]$, to obtain ${G}_{t}^{e}\le \alpha $ it suffices to make $t\ge {\left(\frac{2lnk}{\alpha}\right)}^{\frac{4{(1-\gamma )}^{2}}{{\gamma}^{2}{log}_{2}e}lnk}$ splits.

**Theorem**

**2.**

Under the Weak Hypothesis Assumption, for any $\alpha \in [0,2\left(1-\frac{1}{k}\right)]$, to obtain ${G}_{t}^{g}\le \alpha $ it suffices to make $t\ge {\left(\frac{2\left(1-\frac{1}{k}\right)}{\alpha}\right)}^{\frac{2{(1-\gamma )}^{2}}{{\gamma}^{2}{log}_{2}e}(k-1)}$ splits.

**Theorem**

**3.**

Under the Weak Hypothesis Assumption, for any $\alpha \in [\sqrt{\mathcal{C}-1},2\sqrt{k\mathcal{C}-1}]$, to obtain ${G}_{t}^{m}\le \alpha $ it suffices to make $t\ge {\left(\frac{2\sqrt{k\mathcal{C}-1}}{\alpha}\right)}^{\frac{2{(1-\gamma )}^{2}{\mathcal{C}}^{3}}{{\gamma}^{2}{(\mathcal{C}-2)}^{2}{log}_{2}e}k\sqrt{k\mathcal{C}-1}}$ splits.

Finally, we provide the error guarantee in Theorem 4. Denote $y(x)$ to be a fixed target function with domain $\mathcal{X}$, which assigns the data point x to its label, and let $\mathcal{P}$ be a fixed target distribution over $\mathcal{X}$. Together y and $\mathcal{P}$ induce a distribution on labeled pairs $(x,y(x))$. Let $t(x)$ be the label assigned to data point x by the tree. We denote as $\u03f5(\mathcal{T})$ the error of tree $\mathcal{T}$, i.e., $\u03f5{(\mathcal{T}):=}_{x\sim \mathcal{P}}\left[{\sum}_{i=1}[t(x)=i,y(x)\ne i]\right]$

**Theorem**

**4.**

Under the Weak Hypothesis Assumption, for any $\alpha \in [0,1]$, to obtain $\u03f5(\mathcal{T})\le \alpha $ it suffices to make $t\ge {\left(\frac{2lnk{log}_{2}e}{\alpha}\right)}^{\frac{4{(1-\gamma )}^{2}}{{\gamma}^{2}{log}_{2}e}lnk}$ splits.

**Remark**

**1.**

The main theorems show how fast the entropy criteria or the multi-class classification error drop as the tree grows and performs node splits. These statements therefore provide a platform for comparing different entropy criteria and answer two questions: 1) for a fixed $\alpha ,\gamma ,\mathcal{C}$, and k, which criterion is reduced the most with each split? and 2) can the multi-class error match the convergence speed of the best entropic criterion? Hence, it can be noted that the Shannon entropy has the most advantageous dependence on the label complexity, since the bound scales only logarithmically with k, and thus achieves the fastest convergence. Simultaneously, the multi-class classification rate matches this advantageous convergence rate and also scales favorably (logarithmically) with k. Finally, even though the weak hypothesis requires only slightly favorable γ, i.e., $\gamma >0$, in practice when constructing the tree one can optimize J in every node of the tree, which effectively pushes γ to be as high as possible. In that case γ becomes a well-behaving constant in the above theorems, ideally equal to $1/2$, and does not negatively affect the split count.

We next discuss in details the mathematical properties of the entropy-based criteria, which are important to prove the above theorems.

## 5. Proofs

#### 5.1. Properties of the Entropy-Based Criteria

Each of the presented entropy-based criteria has a number of useful properties that we give next, along with their proofs. We first give bounds on the values of the entropy-based functions. As before, let w be the weight of the heaviest leaf in the tree at time t.

#### 5.1.1. Bounds on the Entropy-Based Criteria

**Lemma**

**5.**

The Shannon entropy function ${G}_{t}^{e}$ at time t is bounded as $0\le {G}_{t}^{e}\le (t+1)wlnk$.

**Lemma**

**6.**

The Gini-entropy function ${G}_{t}^{g}$ at time t is bounded as $0\le {G}_{t}^{g}\le (t+1)w\left(1-1/k\right)$.

**Lemma**

**7.**

The modified Gini-entropy function ${G}_{t}^{m}$ at time t is bounded as $\sqrt{\mathcal{C}-1}\le {G}_{t}^{m}\le (t+1)w\sqrt{k\mathcal{C}-1}$.

The upper-bounds in Lemmas 5–7 are tight, where the equalities hold for the special case when ${\forall}_{i\in \{1,\dots ,k\},\phantom{\rule{0.222222em}{0ex}}l\in {\mathcal{L}}_{t}}{\pi}_{l,i}=1/k$, e.g., when each internal node of the tree produce a perfectly pure and balanced split.

#### 5.1.2. Strong Concativity Properties of the Entropy-Based Criteria

So far we have been focusing on the time step t. Recall that n is the heaviest leaf at time t and its weight ${w}_{n}$ is denoted by w for brevity. Consider splitting this leaf to two children ${n}_{0}$ and ${n}_{1}$. For ease of notation let ${w}_{0}={w}_{{n}_{0}}$ and ${w}_{1}={w}_{{n}_{1}}$, $\beta =P({h}_{n}(x)>0)$ and ${P}_{i}=P({h}_{n}(x)>0|i)$, and furthermore let ${\pi}_{i}$ and h be the shorthands for ${\pi}_{n,i}$ and ${h}_{n}$, respectively. Recall that $\beta ={\sum}_{i=1}^{k}{\pi}_{i}{P}_{i}$ and ${\sum}_{i=1}^{k}{\pi}_{i}=1$. Notice that ${w}_{0}=w(1-\beta )$ and ${w}_{1}=w\beta $. Let $\mathit{\pi}$ be the k-element vector with ${i}^{th}$ entry equal to ${\pi}_{i}$. Finally, let ${\tilde{G}}^{e}(\mathit{\pi})={\sum}_{i=1}^{k}{\pi}_{i}ln\left(\frac{1}{{\pi}_{i}}\right)$, ${\tilde{G}}^{g}(\mathit{\pi})={\sum}_{i=1}^{k}{\pi}_{i}(1-{\pi}_{i})$, and ${\tilde{G}}^{m}(\mathit{\pi})={\sum}_{i=1}^{k}\sqrt{{\pi}_{i}(1-{\pi}_{i})}$. Before the split the contribution of node n to resp. ${G}_{t}^{e}$, ${G}_{t}^{g}$, and ${G}_{t}^{m}$ was resp. $w{\tilde{G}}^{e}(\mathit{\pi})$, $w{\tilde{G}}^{g}(\mathit{\pi})$, and $w{\tilde{G}}^{m}(\mathit{\pi})$. Note that ${\pi}_{{n}_{0},i}=\frac{{\pi}_{i}(1-{P}_{i})}{1-\beta}$ and ${\pi}_{{n}_{1},i}=\frac{{\pi}_{i}{P}_{i}}{\beta}$ are the probabilities that a randomly chosen x drawn from $\mathcal{P}$ has label i given that x reaches nodes ${n}_{0}$ and ${n}_{1}$ respectively. For brevity, let ${\pi}_{{n}_{0},i}$ and ${\pi}_{{n}_{1},i}$ be denoted respectively as ${\pi}_{0,i}$ and ${\pi}_{1,i}$. Let ${\mathit{\pi}}_{0}$ be the k-element vector with ${i}^{th}$ entry equal to ${\pi}_{0,i}$ and let ${\mathit{\pi}}_{1}$ be the k-element vector with ${i}^{th}$ entry equal to ${\pi}_{1,i}$. Notice that $\mathit{\pi}=(1-\beta ){\mathit{\pi}}_{0}+\beta {\mathit{\pi}}_{1}$. After the split the contribution of the same, now internal, node n changes to resp. $w((1-\beta ){\tilde{G}}^{e}({\mathit{\pi}}_{0})+\beta {\tilde{G}}^{e}({\mathit{\pi}}_{1}))$, $w((1-\beta ){\tilde{G}}^{g}({\mathit{\pi}}_{0})+\beta {\tilde{G}}^{g}({\mathit{\pi}}_{1}))$, and $w((1-\beta ){\tilde{G}}^{m}({\mathit{\pi}}_{0})+\beta {\tilde{G}}^{m}({\mathit{\pi}}_{1}))$. We can compute the difference between the contribution of node n to the value of the entropy-based objectives in times t and $t+1$ as

$$\begin{array}{c}\hfill {\Delta}_{t}^{e}={G}_{t}^{e}-{G}_{t+1}^{e}=w\left[{\tilde{G}}^{e}(\mathbf{\pi})-(1-\beta ){\tilde{G}}^{e}({\mathbf{\pi}}_{0})-\beta {\tilde{G}}^{e}({\mathbf{\pi}}_{1})\right],\end{array}$$

$$\begin{array}{c}\hfill {\Delta}_{t}^{g}={G}_{t}^{g}-{G}_{t+1}^{g}=w\left[{\tilde{G}}^{g}(\mathbf{\pi})-(1-\beta ){\tilde{G}}^{g}({\mathbf{\pi}}_{0})-\beta {\tilde{G}}^{g}({\mathbf{\pi}}_{1})\right],\end{array}$$

$$\begin{array}{c}\hfill {\Delta}_{t}^{m}={G}_{t}^{m}-{G}_{t+1}^{m}=w\left[{\tilde{G}}^{m}(\mathbf{\pi})-(1-\beta ){\tilde{G}}^{m}({\mathbf{\pi}}_{0})-\beta {\tilde{G}}^{m}({\mathbf{\pi}}_{1})\right].\end{array}$$

The next three lemmas, Lemmas 8–10, describe the strong concativity properties of the entropy, Gini-entropy and modified Gini-entropy, which can be used to lower-bound ${\Delta}_{t}^{e}$, ${\Delta}_{t}^{g}$, and ${\Delta}_{t}^{m}$ (Equations (2)–(4) correspond to a gap in the Jensen’s inequality applied to the strongly concave function).

**Lemma**

**8.**

The Shannon entropy function ${\tilde{G}}^{e}$ is strongly concave with respect to ${l}_{1}$-norm with modulus 1, and thus the following holds ${\tilde{G}}^{e}(\mathit{\pi})-(1-\beta ){\tilde{G}}^{e}({\mathit{\pi}}_{0})-\beta {\tilde{G}}^{e}({\mathit{\pi}}_{1})\ge \frac{1}{2}\beta (1-\beta ){\parallel {\mathit{\pi}}_{0}-{\mathit{\pi}}_{1}\parallel}_{1}^{2}$.

**Lemma**

**9.**

The Gini-entropy function ${\tilde{G}}^{g}$ is strongly concave with respect to ${l}_{2}$-norm with modulus 2, and thus the following holds ${\tilde{G}}^{g}(\mathit{\pi})-(1-\beta ){\tilde{G}}^{g}({\mathit{\pi}}_{0})-\beta {\tilde{G}}^{g}({\mathit{\pi}}_{1})\ge \beta (1-\beta ){\parallel {\mathit{\pi}}_{0}-{\mathit{\pi}}_{1}\parallel}_{2}^{2}$.

**Lemma**

**10.**

The modified Gini-entropy function ${\tilde{G}}^{m}$ is strongly concave with respect to ${l}_{2}$-norm with modulus $\frac{2{(\mathcal{C}-2)}^{2}}{{\mathcal{C}}^{3}}$, and thus the following holds ${\tilde{G}}^{m}(\mathit{\pi})-(1-\beta ){\tilde{G}}^{m}({\mathit{\pi}}_{0})-\beta {\tilde{G}}^{m}({\mathit{\pi}}_{1})\ge \frac{{(\mathcal{C}-2)}^{2}}{{\mathcal{C}}^{3}}\beta (1-\beta ){\parallel {\mathit{\pi}}_{0}-{\mathit{\pi}}_{1}\parallel}_{2}^{2}$.

Figure 3 illustrates different entropy criteria normalized to the $[0,1]$ interval.

#### 5.2. Proof of Lemma 4 and Theorems 1–3

We finally proceed to proving all three boosting theorems, Theorems 1–3. Lemma 4 is a by-product of these proofs.

**Proof.**

For the Shannon entropy it follows from Equation (2), Lemmas 5 and 8 that
where the last inequality comes from the fact that $1-\gamma \ge \beta \ge \gamma $ (see the definition of $\gamma $ in the weak hypothesis assumption) and $J(h)\ge 2\gamma $ (see weak hypothesis assumption). For the Gini-entropy criterion notice that from Equation (3), Lemmas 6, 9, and A4 it follows that
where the last inequality is obtained similarly as the last inequality in Equation (5). And finally for the modified Gini-entropy it follows from Equation (4), Lemmas 7, 10, and A4 that
where the last inequality is obtained as before.

$$\begin{array}{cc}\hfill {\Delta}_{t}^{e}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}& \ge \frac{1}{2}w\beta (1-\beta ){\parallel {\mathbf{\pi}}_{0}-{\mathbf{\pi}}_{1}\parallel}_{1}^{2}\hfill \\ & =\frac{1}{2}\frac{w}{\beta (1-\beta )}{\left(\sum _{i=1}^{k}\left|{\pi}_{i}({P}_{i}-\beta )\right|\right)}^{2}\hfill \\ & =\frac{wJ{(h)}^{2}}{8\beta (1-\beta )}\hfill \\ & \ge \frac{J{(h)}^{2}{G}_{t}^{e}}{8\beta (1-\beta )(t+1)lnk}\hfill \\ & \ge \frac{{\gamma}^{2}{G}_{t}^{e}}{2{(1-\gamma )}^{2}(t+1)lnk},\hfill \end{array}$$

$$\begin{array}{cc}\hfill {\Delta}_{t}^{g}& \ge w\beta (1-\beta )\parallel {\mathbf{\pi}}_{0}-{\mathbf{\pi}}_{1}{\parallel}_{2}^{2}\hfill \\ & \ge \frac{1}{k}w\beta (1-\beta ){\parallel {\mathbf{\pi}}_{0}-{\mathbf{\pi}}_{1}\parallel}_{1}^{2}\hfill \\ & \ge \frac{{\gamma}^{2}{G}_{t}^{g}}{{(1-\gamma )}^{2}(t+1)(k-1)},\hfill \end{array}$$

$$\begin{array}{cc}\hfill {\Delta}_{t}^{m}& \ge w\frac{{(\mathcal{C}-2)}^{2}}{{\mathcal{C}}^{3}}\beta (1-\beta ){\parallel {\mathbf{\pi}}_{0}-{\mathbf{\pi}}_{1}\parallel}_{2}^{2}\hfill \\ & \ge \frac{1}{k}w\frac{{(\mathcal{C}-2)}^{2}}{{\mathcal{C}}^{3}}\beta (1-\beta ){\parallel {\mathbf{\pi}}_{0}-{\mathbf{\pi}}_{1}\parallel}_{1}^{2}\hfill \\ & \ge \frac{{\gamma}^{2}{G}_{t}^{m}}{\frac{{\mathcal{C}}^{3}}{{(\mathcal{C}-2)}^{2}}{(1-\gamma )}^{2}(t+1)k\sqrt{k\mathcal{C}-1}},\hfill \end{array}$$

Clearly the larger the objective $J(h)$ is at time t, the larger the entropy reduction ends up being. Let

$$\begin{array}{c}{\eta}^{e}=\frac{2\sqrt{2}\gamma}{(1-\gamma )\sqrt{lnk}},\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}{\eta}^{g}=\frac{4\gamma}{(1-\gamma )\sqrt{k-1}},\hfill \\ {\eta}^{m}=\frac{4\gamma}{(1-\gamma )\sqrt{\frac{{\mathcal{C}}^{3}}{{(\mathcal{C}-2)}^{2}}k\sqrt{k\mathcal{C}-1}}}.\hfill \end{array}$$

For simplicity of notation assume ${\Delta}_{t}$ corresponds to either ${\Delta}_{t}^{e}$, or ${\Delta}_{t}^{g}$, or ${\Delta}_{t}^{m}$, and ${G}_{t}$ stands for ${G}_{t}^{e}$, or ${G}_{t}^{g}$, or ${G}_{t}^{m}$. Thus ${\Delta}_{t}>\frac{{\eta}^{2}{G}_{t}}{16(t+1)}$, and we obtain

$${G}_{t+1}\le {G}_{t}-{\Delta}_{t}<{G}_{t}-\frac{{\eta}^{2}{G}_{t}}{16(t+1)}={G}_{t}\left(1-\frac{{\eta}^{2}}{16(t+1)}\right).$$

One can now compute the minimum number of splits required to reduce ${G}_{t}$ below $\alpha $, where $\alpha \in [0,1]$, from this recurrence inequality. Assume ${log}_{2}(t+1)\in {\mathbb{Z}}^{+}$.
where $r=\{2,3,\cdots ,{log}_{2}(t+1)\}$. Recall that
where the last step follows from Lemma A5. Also note that by the same lemma $\left(1-\frac{{\eta}^{2}}{16\xb72}\right)\le {e}^{-{\eta}^{2}/32}$. Thus,

$$\begin{array}{cc}\hfill {G}_{t+1}& \le {G}_{t}\left(1-\frac{{\eta}^{2}}{16(t+1)}\right)\hfill \\ & ={G}_{1}\left(1-\frac{{\eta}^{2}}{16\xb72}\right)\left(1-\frac{{\eta}^{2}}{16\xb73}\right)\cdots (1-\frac{{\eta}^{2}}{16\xb7(t+1)})\hfill \\ & ={G}_{1}\left(1-\frac{{\eta}^{2}}{16\xb72}\right)\prod _{{t}^{{}^{\prime}}=3}^{4}\left(1-\frac{{\eta}^{2}}{16\xb7{t}^{{}^{\prime}}}\right)\cdots \hfill \\ & \prod _{{t}^{{}^{\prime}}=({2}^{r}/2)+1}^{{2}^{r}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\left(1-\frac{{\eta}^{2}}{16\xb7{t}^{{}^{\prime}}}\right)\cdots \phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\prod _{{t}^{{}^{\prime}}=({2}^{{log}_{2}(t+1)}/2)+1}^{{2}^{{log}_{2}(t+1)}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\left(1-\frac{{\eta}^{2}}{16\xb7{t}^{{}^{\prime}}}\right),\hfill \end{array}$$

$$\begin{array}{c}\prod _{{t}^{{}^{\prime}}=({2}^{r}/2)+1}^{{2}^{r}}\left(1-\frac{{\eta}^{2}}{16\xb7{t}^{{}^{\prime}}}\right)\le \prod _{{t}^{{}^{\prime}}=({2}^{r}/2)+1}^{{2}^{r}}\left(1-\frac{{\eta}^{2}}{16\xb7{2}^{r}}\right)\hfill \\ ={\left(1-\frac{{\eta}^{2}}{16\xb7{2}^{r}}\right)}^{{2}^{r}/2}\le {e}^{-{\eta}^{2}/32},\hfill \end{array}$$

$${G}_{t+1}\le {G}_{1}{e}^{-{\eta}^{2}{log}_{2}(t+1)/32}.$$

Therefore to reduce ${G}_{t+1}\le \alpha $ (where $\alpha $’s are defined in Theorems 1–3) it suffices to make $t+1$ splits such that ${log}_{2}(t+1)\ge ln{\left(\frac{{G}_{1}}{\alpha}\right)}^{\frac{32}{{\eta}^{2}}}$ splits. Since ${log}_{2}(t+1)=ln(t+1)\xb7{log}_{2}(e)$, where $e=exp(1)$. Thus,

$$ln(t+1)\ge ln{\left(\frac{{G}_{1}}{\alpha}\right)}^{\frac{32}{{\eta}^{2}{log}_{2}(e)}}\iff t+1\ge {\left(\frac{{G}_{1}}{\alpha}\right)}^{\frac{32}{{\eta}^{2}{log}_{2}(e)}}.$$

Recall that by resp. Lemmas 5–7 we have resp. ${G}_{1}^{e}\le 2lnk$, ${G}_{1}^{g}\le 2(1-\frac{1}{k})$, ${G}_{1}^{g}\le 2\sqrt{k\mathcal{C}-1}$. We consider the worst case setting (giving the largest possible number of split) thus we assume ${G}_{1}^{e}=2lnk$, ${G}_{1}^{g}=2(1-\frac{1}{k})$, and ${G}_{1}^{g}\le 2\sqrt{k\mathcal{C}-1}$. Combining that with Equations (8) and (10) yields statements of the main theorems. □

#### 5.3. Proof of Theorem 4

We next proceed to directly proving the error bound. Recall that ${\pi}_{l,i}$ is the probability that the data point x corresponds to label i given that x reached l, i.e., ${\pi}_{l,i}=P(y(x)=i|x\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\mathrm{reached}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}l)$. Let the label assigned to the leaf be the majority label and thus lets assume that the leaf is assigned to label i if and only if the following is true ${\forall}_{\begin{array}{c}z=\{1,2,\cdots ,k\}\\ z\ne i\end{array}}{\pi}_{l,i}\ge {\pi}_{l,z}$. Therefore we can write that
Let ${i}_{l}$ be the majority label in leaf l, thus ${\forall}_{\begin{array}{c}z=\{1,2,\cdots ,k\}\\ z\ne {i}_{l}\end{array}}{\pi}_{l,{i}_{l}}\ge {\pi}_{l,z}$. We can continue as follows

$$\begin{array}{cc}\hfill \u03f5(\mathcal{T})=& P(t(x)\ne y(x))\hfill \\ \hfill =& \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}P(t(x)\ne y(x)|x\phantom{\rule{0.222222em}{0ex}}\mathrm{reached}\phantom{\rule{0.222222em}{0ex}}l)\hfill \end{array}$$

$$\begin{array}{cc}\hfill \u03f5(\mathcal{T})=& \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}P(t(x)\ne {i}_{l}|x\phantom{\rule{0.222222em}{0ex}}\mathrm{reached}\phantom{\rule{0.222222em}{0ex}}l)\hfill \\ \hfill =& \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}(1-{\pi}_{l,{i}_{l}})\hfill \\ \hfill =& \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}(1-max({\pi}_{l,1},{\pi}_{l,2},\cdots ,{\pi}_{l,k})\hfill \end{array}$$

Consider again the Shannon entropy $G(\mathcal{T})$ of the leaves of tree $\mathcal{T}$ that is defined as

$$\begin{array}{ccc}\hfill {G}_{t}^{e}& =& \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}ln\frac{1}{{\pi}_{l,i}}\hfill \\ \hfill {G}_{t}^{e}& =& \frac{1}{lo{g}_{2}e}\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}{log}_{2}\frac{1}{{\pi}_{l,i}}\hfill \end{array}$$

Note that
where the last inequality comes from the fact that ${\forall}_{\begin{array}{c}i=\{1,2,\cdots ,\}\\ i\ne {i}_{l}\end{array}}{\pi}_{l,i}\le 0.5$ and thus ${\forall}_{\begin{array}{c}i=\{1,2,\cdots ,\}\\ i\ne {i}_{l}\end{array}}\frac{1}{{\pi}_{l,i}}\in [2;+\infty ]$ and consequently ${\forall}_{\begin{array}{c}i=\{1,2,\cdots ,\}\\ i\ne {i}_{l}\end{array}}{log}_{2}\frac{1}{{\pi}_{l,i}}\in [1;+\infty ]$.

$$\begin{array}{cc}\hfill {G}_{t}^{e}=& \frac{1}{lo{g}_{2}e}\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}{\pi}_{l,i}{log}_{2}\frac{1}{{\pi}_{l,i}}\hfill \\ \hfill \ge & \frac{1}{lo{g}_{2}e}\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{\begin{array}{c}i=1\\ i\ne {i}_{l}\end{array}}{\pi}_{l,i}{log}_{2}\frac{1}{{\pi}_{l,i}}\hfill \\ \hfill \ge & \frac{1}{lo{g}_{2}e}\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{\begin{array}{c}i=1\\ i\ne {i}_{l}\end{array}}{\pi}_{l,i}\hfill \\ \hfill =& \frac{1}{lo{g}_{2}e}\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}(1-max({\pi}_{l,1},{\pi}_{l,2},\cdots ,{\pi}_{l,k}))\hfill \\ \hfill =& \frac{1}{lo{g}_{2}e}\u03f5(\mathcal{T}),\hfill \end{array}$$

## 6. Conclusions

This paper aims at introducing theoretical tools, encapsulated in the boosting framework, that enable the comparison of different multi-class classification objective functions. The multi-class boosting is largely ununderstood from the theoretical perspective [5]. We provide an exhaustive theoretical analysis of the objective function underlying the recently proposed LOMtree algorithm for extreme multi-class classification and explore the connection of this objective to entropy-based criteria. We show that optimizing this objective simultaneously optimizes Shannon entropy, Gini-entropy and its modified variant, as well as the multi-class classification error. We expect that discussed tools can be used to obtain theoretical guarantees in the multi-label [28,29,30] and memory-constrained settings (we will explore this research direction in the future). We also consider extensions to different variants of the multi-class classification problem [31,32] and multi-output learning tasks [33,34]. We thus plan to build a unified theoretical framework for understanding extreme classification trees.

## Author Contributions

A.C. derived the theoretical results and did the empirical evaluation. I.K.J. was working on improving the write-up of the paper and checking mathematical correctness.

## Funding

This research received no external funding.

## Conflicts of Interest

The authors declare no conflict of interest.

## Appendix A. Extreme Multiclass Classification Criteria

#### Appendix A.1. Numerical Experiments

We run the LOMtree algorithm, which is implemented in the open source learning system Vowpal Wabbit [35], on four benchmark multiclass data sets: Mnist (10 classes, downloaded from http://yann.lecun.com/exdb/mnist/), Isolet (26 classes, downloaded from http://www.cs.huji.ac.il/~shais/datasets/ClassificationDatasets.html), Sector (105 classes, downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html), and Aloi (1000 classes, downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html). The data sets were divided into training ($90\%$) and testing ($10\%$), where $10\%$ of the training data set was used as a validation set. The regressors in the tree nodes are linear and were trained by SGD [36] with 20 epochs and the learning rate chosen from the set $\{0.25,0.5,0.75,1,2,4,8\}$. We investigated different swap resistances chosen from the set $\{4,8,16,32,64,128,256\}$. We selected the learning rate and the swap resistance as the one minimizing the validation error, where the number of splits in all experiments was set to 10 k.

Figure A1 shows the Shannon entropy, Gini-entropy, modified Gini-entropy (all normalized to the interval $[0,1]$), and the multiclass classification error computed on the test data set as the function of the number of splits. The behavior of the Shannon entropy and Gini-entropy match the theoretical findings. However, the modified Gini-entropy instead drops the fastest with the number of splits, which in particular suggests that in this case perhaps tighter bounds could possibly be proved (for the binary case tighter analysis was shown in [25], but it is highly non-trivial to generalize this analysis to the multiclass classification setting). Furthermore, it can be observed that the behavior of the error closely mimics the behavior of the Gini-entropy. The Gini-entropy in all cases well-approximates the upper-bound on the error.

**Figure A1.**Functions ${G}_{t}^{e}$, ${G}_{t}^{g}$, and ${G}_{t}^{m}$, and the test error, all normalized to the interval $[0,1]$, versus the number of splits. Figure is recommended to be read in color.

#### Appendix A.2. Additional Proofs

**Proof**

**of**

**Lemma**

**1.**

The proof that if h induces a maximally pure and balanced partition then $J(h)=1$ was done in [3] (Lemma 2) and is very basic. We focus here on the remaining part of statement, which is harder to show, and prove that if $J(h)=1$ then h induces a maximally pure and balanced partition.

Without loss of generality assume each ${\pi}_{i}\in (0,1)$. Recall that $\beta =P(h(x)>0)$, and let ${P}_{i}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}P(h(x)>0|i)$. Also recall that $\beta ={\sum}_{i=1}^{k}{\pi}_{i}{P}_{i}$. Thus $J(h)=2{\sum}_{i=1}^{k}{\pi}_{i}\left|{\sum}_{j=1}^{k}{\pi}_{j}{P}_{j}-{P}_{i}\right|$. The objective is certainly maximized in the extremes of the interval $[0,1]$, where each ${P}_{i}$ is either 0 or 1 (also note that at maximum, where $J(h)=1$, it cannot be that all ${P}_{i}$’s are 0 or all ${P}_{i}$’s are 1). The function $J(h)$ is differentiable in these extremes ($J(h)$ is non-differentiable only when ${\sum}_{j=1}^{k}{\pi}_{j}{P}_{j}={P}_{i}$, but at considered extremes the left-hand side of this equality is in $(0,1)$, whereas the right-hand side is either 0 or 1). We then write
where $\mathcal{P}=\{i:{\sum}_{j=1}^{k}{\pi}_{j}{P}_{j}\ge {P}_{i}\}$ and $\mathcal{N}=\{i:{\sum}_{j=1}^{k}{\pi}_{j}{P}_{j}<{P}_{i}\}$. Also let ${\mathcal{P}}^{+}=\{i:{\sum}_{j=1}^{k}{\pi}_{j}{P}_{j}>{P}_{i}\}$ (clearly ${\sum}_{i\in {\mathcal{P}}^{+}}{\pi}_{i}\ne 1$ and ${\sum}_{i\in \mathcal{N}}{\pi}_{i}\ne 1$ in the extremes of the interval $[0,1]$ where $J(h)$ is maximized). We then can compute the derivatives of $J(h)$ with respect to ${P}_{r}$, where $r=\{1,2,\cdots ,k\}$, everywhere where the function is differentiable as follows
and note that in the extremes of the interval $[0,1]$ where $J(h)$ is maximized $\frac{\partial J}{\partial {P}_{r}}\ne 0$, since ${\sum}_{i\in {\mathcal{P}}^{+}}{\pi}_{i}\ne 1$, ${\sum}_{i\in \mathcal{N}}{\pi}_{i}\ne 1$, and each ${\pi}_{i}\in (0,1)$. Since $J(h)$ is convex, and by the fact that in particular the derivative of $J(h)$ with respect to any ${P}_{r}$ cannot be 0 in the extremes of the interval $[0,1]$ where $J(h)$ is maximized, it follows that the $J(h)$ can only be maximized ($J(h)=1$) at the extremes of the $[0,1]$ interval. Thus we already proved that if $J(h)=1$ then h induces a maximally pure partition. We are left with showing that if $J(h)=1$ then h induces also a maximally balanced partition. We prove it by contradiction. Assume $\beta \ne 0.5$. Denote as before ${\mathcal{I}}_{0}=\{i:P(h(x)>0|i)=0\}$ and ${\mathcal{I}}_{1}=\{i:P(h(x)>0|i)=1\}$. Recall $\beta ={\sum}_{i=1}^{k}{\pi}_{i}{P}_{i}={\sum}_{i\in {\mathcal{I}}_{0}}{\pi}_{i}\xb70+{\sum}_{i\in {\mathcal{I}}_{1}}{\pi}_{i}\xb71={\sum}_{i\in {\mathcal{I}}_{1}}{\pi}_{i}$. Thus,
where the last inequality comes from the fact that the quadratic form $-4{\beta}^{2}+4\beta $ is equal to 1 only when $\beta =0.5$, and otherwise it is smaller than 1. Thus we obtain the contradiction which ends the proof. □

$$J(h)=2\sum _{i\in \mathcal{P}}{\pi}_{i}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\left(\sum _{j=1}^{k}{\pi}_{j}{P}_{j}\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}{P}_{i}\right)\phantom{\rule{-0.166667em}{0ex}}+2\sum _{i\in \mathcal{N}}{\pi}_{i}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\left({P}_{i}\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}\sum _{j=1}^{k}{\pi}_{j}{P}_{j}\right),$$

$$\frac{\partial J}{\partial {P}_{r}}=\left\{\begin{array}{c}\hfill 2{\pi}_{r}(\sum _{i\in {\mathcal{P}}^{+}}{\pi}_{i}-1)\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}if\phantom{\rule{0.222222em}{0ex}}r\in {\mathcal{P}}^{+}\\ \hfill 2{\pi}_{r}(1-\sum _{i\in \mathcal{N}}{\pi}_{i})\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}\phantom{\rule{0.222222em}{0ex}}if\phantom{\rule{0.222222em}{0ex}}r\in \mathcal{N}\end{array}\right.,$$

$$\begin{array}{cc}\hfill J(h)=1& =2\sum _{i\in {\mathcal{I}}_{0}}{\pi}_{i}\left|\beta \right|+2\sum _{i\in {\mathcal{I}}_{1}}{\pi}_{i}\left|\beta -1\right|\hfill \\ & =2\beta \sum _{i\in {\mathcal{I}}_{0}}{\pi}_{i}+2(1-\beta )\sum _{i\in {\mathcal{I}}_{1}}{\pi}_{i}\hfill \\ & =2\beta (1-\sum _{i\in {\mathcal{I}}_{1}}{\pi}_{i})+2(1-\beta )\sum _{i\in {\mathcal{I}}_{1}}{\pi}_{i}\hfill \\ & =2\beta (1-\beta )+2(1-\beta )\beta \hfill \\ & =-4{\beta}^{2}+4\beta <1,\hfill \end{array}$$

**Proof**

**of**

**Lemma**

**2.**

We use the following notation: $\beta =P(h(x)>0)$, and ${P}_{i}=P(h(x)>0|i)$. Also let $\mathcal{P}=\{i:\beta \ge {P}_{i}\}$ and $\mathcal{N}=\{i:\beta <{P}_{i}\}$. Recall that $\beta ={\sum}_{i\in \{\mathcal{P}\cup \mathcal{N}\}}{\pi}_{i}{P}_{i}$, and ${\sum}_{i\in \{\mathcal{P}\cup \mathcal{N}\}}{\pi}_{i}=1$. We split the proof into two cases.

- Let ${\sum}_{i\in \mathcal{P}}{\pi}_{i}\le 1-\beta $. Then$$\begin{array}{cc}\hfill J(h)& =2\sum _{i=1}^{k}{\pi}_{i}\left|\beta -{P}_{i}\right|\hfill \\ & =2\sum _{i\in \mathcal{P}}{\pi}_{i}(\beta -{P}_{i})+2\sum _{i\in \mathcal{N}}{\pi}_{i}({P}_{i}-\beta )\hfill \\ & =2\sum _{i\in \mathcal{P}}{\pi}_{i}\beta -2\sum _{i\in \mathcal{P}}{\pi}_{i}{P}_{i}+2(\beta -\sum _{i\in \mathcal{P}}{\pi}_{i}{P}_{i})\hfill \\ & \phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{1.em}{0ex}}-2\beta (1-\sum _{i\in \mathcal{P}}{\pi}_{i})\hfill \\ & =4\beta \sum _{i\in \mathcal{P}}{\pi}_{i}-4\sum _{i\in \mathcal{P}}{\pi}_{i}{P}_{i}\hfill \\ & \le 4\beta \sum _{i\in \mathcal{P}}{\pi}_{i}\le 4\beta (1-\beta )\hfill \end{array}$$Thus $-4{\beta}^{2}+4\beta -J(h)\ge 0$ which, when solved, yields the lemma.
- Let ${\sum}_{i\in \mathcal{P}}{\pi}_{i}\ge 1-\beta $ (thus ${\sum}_{i\in \mathcal{N}}{\pi}_{i}\le \beta $). Note that $J(h)$ can be written as$$J(h)=2\sum _{i=1}^{k}{\pi}_{i}\left|P(h(x)\le 0)-P(h(x)\le 0|i)\right|,$$$$\begin{array}{cc}\hfill J(h)& =2\sum _{i=1}^{k}{\pi}_{i}\left|{\beta}^{{}^{\prime}}-{P}_{i}^{{}^{\prime}}\right|\hfill \\ & =2\sum _{i\in \mathcal{P}}{\pi}_{i}({P}_{i}^{{}^{\prime}}-{\beta}^{{}^{\prime}})+2\sum _{i\in \mathcal{N}}{\pi}_{i}({\beta}^{{}^{\prime}}-{P}_{i}^{{}^{\prime}})\hfill \\ & =2({\beta}^{{}^{\prime}}-\sum _{i\in \mathcal{N}}{\pi}_{i}{P}_{i}^{{}^{\prime}})-2{\beta}^{{}^{\prime}}(1-\sum _{i\in \mathcal{N}}{\pi}_{i})\hfill \\ & \phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}+2\sum _{i\in \mathcal{N}}{\pi}_{i}{\beta}^{{}^{\prime}}-2\sum _{i\in \mathcal{N}}{\pi}_{i}{P}_{i}^{{}^{\prime}}\hfill \\ & =4{\beta}^{{}^{\prime}}\sum _{i\in \mathcal{N}}{\pi}_{i}-4\sum _{i\in \mathcal{N}}{\pi}_{i}{P}_{i}^{{}^{\prime}}\le 4{\beta}^{{}^{\prime}}\sum _{i\in \mathcal{N}}{\pi}_{i}\hfill \\ & =4(1-\beta )\sum _{i\in \mathcal{N}}{\pi}_{i}\le 4\beta (1-\beta ).\hfill \end{array}$$Thus as before we obtain $-4{\beta}^{2}+4\beta -J(h)\ge 0$ which, when solved, yields the lemma. □

**Proof**

**of**

**Lemma**

**5.**

The lower-bound follows from the fact that the entropy of each leaf ${\sum}_{i=1}^{k}{\pi}_{l,i}ln\left(\frac{1}{{\pi}_{l,i}}\right)$ is non-negative. We next prove the upper-bound.
where the first inequality comes from the fact that uniform distribution maximizes the entropy, and the last equality comes from the fact that a tree with t internal nodes has $t+1$ leaves (also recall that w is the weight of the heaviest node in the tree at time t which is what we will also use in the next lemmas). □

$$\begin{array}{cc}\hfill {G}_{t}^{e}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}& =\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}ln\left(\frac{1}{{\pi}_{l,i}}\right)\hfill \\ & \le \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}lnk\le wlnk\sum _{l\in {\mathcal{L}}_{t}}1\hfill \\ & =(t+1)wlnk,\hfill \end{array}$$

Before proceeding to the actual proof of Lemma 6 we first introduce the helpful result captured in Lemma A1 and Corollary A1.

**Lemma**

**A1**(The inequality between Euclidean and arithmetic mean)

**.**

Let ${x}_{1},\cdots ,{x}_{k}$ be a set of non-negative numbers. Then Euclidean mean upper-bounds the arithmetic mean as follows $\sqrt{\frac{{\sum}_{i=1}^{k}{x}_{i}^{2}}{k}}\ge \frac{{\sum}_{i=1}^{k}{x}_{i}}{k}$.

**Corollary**

**A1.**

Let $\{{x}_{1},\cdots ,{x}_{k}\}$ be non-negative. Then ${\sum}_{i=1}^{k}{x}_{i}^{2}\ge \frac{1}{k}{\left({\sum}_{i=1}^{k}{x}_{i}\right)}^{2}$.

**Proof.**

By Lemma A1 we have $\sqrt{\frac{{\sum}_{i=1}^{k}{x}_{i}^{2}}{k}}\ge \frac{{\sum}_{i=1}^{k}{x}_{i}}{k}\iff {\sum}_{i=1}^{k}{x}_{i}^{2}\ge \frac{1}{k}{\left({\sum}_{i=1}^{k}{x}_{i}\right)}^{2}$. □

**Proof**

**of**

**Lemma**

**6.**

The lower-bound is straightforward since all ${\pi}_{l,i}$’s are non-negative. The upper-bound can be shown as follows (the last inequality results from Corollary A1):
□

$$\begin{array}{cc}\hfill {G}_{t}^{g}& =\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}(1-{\pi}_{l,i})\hfill \\ & \le w\sum _{l\in {\mathcal{L}}_{t}}\sum _{i=1}^{k}({\pi}_{l,i}-{\pi}_{l,i}^{2})=w\sum _{l\in {\mathcal{L}}_{t}}\left(1-\sum _{i=1}^{k}{\pi}_{l,i}^{2}\right)\hfill \\ & \le w\sum _{l\in {\mathcal{L}}_{t}}\left(1-\frac{1}{k}{\left(\sum _{i=1}^{k}{\pi}_{l,i}\right)}^{2}\right)=w\sum _{l\in {\mathcal{L}}_{t}}\left(1-\frac{1}{k}\right)\hfill \\ & =(t+1)w\left(1-\frac{1}{k}\right).\hfill \end{array}$$

**Proof**

**of**

**Lemma**

**7.**

The lower-bound can be shown as follows. Recall that the function ${\sum}_{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})}$ is concave and therefore it is certainly minimized on the extremes of the $[0,1]$ interval, meaning where each ${\pi}_{l,i}$ is either 0 or 1. Let ${I}_{0}=\{i:{\pi}_{l,i}=0\}$ and let ${I}_{1}=\{i:{\pi}_{l,i}=1\}$. Thus ${\sum}_{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})}={\sum}_{i\in {I}_{1}}\sqrt{\mathcal{C}-1}\ge \sqrt{\mathcal{C}-1}$. Combining this result with the fact that ${\sum}_{l\in {\mathcal{L}}_{t}}{w}_{l}=1$ gives the lower-bound. We next prove the upper-bound. Recall that Lemma A1 implies that $({\sum}_{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})})/k\le \sqrt{({\sum}_{i=1}^{k}{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i}))/k}$, thus

$$\begin{array}{cc}\hfill {G}_{t}^{m}\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}& =\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})}\hfill \\ & \le \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sqrt{k{\sum}_{i=1}^{k}{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})}\hfill \\ & =\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sqrt{k\mathcal{C}-{k}^{2}\sum _{i=1}^{k}\frac{1}{k}{\pi}_{l,i}^{2}}.\hfill \end{array}$$

By Jensen’s inequality ${\sum}_{i=1}^{k}\frac{1}{k}{\pi}_{l,i}^{2}\ge {({\sum}_{i=1}^{k}\frac{1}{k}{\pi}_{l,i})}^{2}=\frac{1}{{k}^{2}}$. Thus
□

$${G}_{t}^{m}\le \sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sqrt{k\mathcal{C}-1}\le (t+1)w\sqrt{k\mathcal{C}-1}.$$

**Lemma**

**A2**

(Lemma 14 in 38) If the function $\Phi (\mathit{\pi})$ is twice differentiable, then the sufficient condition for strong concativity of Φ is that for all $\mathit{\pi}$, $\mathit{x}$, $\u2329{\nabla}^{2}\Phi (\mathit{\pi})\mathit{x},\mathit{x}\u232a\le -\sigma {\parallel x\parallel}^{2}$, where ${\nabla}^{2}\Phi (\mathit{\pi})$ is the Hessian matrix of Φ at $\mathit{\pi}$, and $\sigma >0$ is the strong concativity modulus.

**Proof**

**of**

**Lemma**

**9.**

Note that $\u2329{\nabla}^{2}{\tilde{G}}^{g}(\mathit{\pi})\mathit{x},\mathit{x}\u232a\le -2{\parallel \mathit{x}\parallel}_{2}^{2}$, and apply Lemma A2. □

**Lemma**

**A3**

(Remark 2.2.4. in 39) The sum of strongly concave functions on ${\mathbb{R}}^{n}$ with modulus σ is strongly concave with the same modulus.

**Proof**

**of**

**Lemma**

**10.**

Consider functions $g({\pi}_{i})=\sqrt{f({\pi}_{i})}$, where $f({\pi}_{i})={\pi}_{i}(\mathcal{C}-{\pi}_{i})$, $\mathcal{C}\ge 2$, and ${\pi}_{i}\in [0,1]$. Also let $h(x)=\sqrt{x}$, where $x\in [0,\frac{{\mathcal{C}}^{2}}{4}]$. It is easy to see, using Lemma A2, that function f is strongly concave with respect to ${l}_{2}$-norm with modulus 2, thus
where ${\pi}_{i}^{{}^{\prime}},{\pi}_{i}^{{}^{\u2033}}\in [0,1]$ and $\theta \in [0,1]$. Also note that h is strongly concave with modulus $\frac{2}{{\mathcal{C}}^{3}}$ in its domain $[0,\frac{{\mathcal{C}}^{2}}{4}]$ (the second derivative of h is ${h}^{{}^{\u2033}}(x)=-\frac{1}{4\sqrt{{x}^{3}}}\le -\frac{2}{{\mathcal{C}}^{3}}$). The strong concativity of h implies that
where ${x}_{1},{x}_{2}\in [0,\frac{{\mathcal{C}}^{2}}{4}]$. Let ${x}_{1}=f({\pi}_{i}^{{}^{\prime}})$ and ${x}_{2}=f({\pi}_{i}^{{}^{\u2033}})$. Then we obtain

$$f(\theta {\pi}_{i}^{{}^{\prime}}+(1-\theta ){\pi}_{i}^{{}^{\u2033}})\ge \theta f({\pi}_{i}^{{}^{\prime}})+(1-\theta )f({\pi}_{i}^{{}^{\u2033}})+\theta (1-\theta ){\parallel {\pi}_{i}^{{}^{\prime}}-{\pi}_{i}^{{}^{\u2033}}\parallel}_{2}^{2},$$

$$\begin{array}{cc}\hfill \sqrt{\theta {x}_{1}+(1-\theta ){x}_{2}}& \ge \theta \sqrt{{x}_{1}}+(1-\theta )\sqrt{{x}_{2}}\hfill \\ & \phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}+\frac{1}{{\mathcal{C}}^{3}}\theta (1-\theta ){\parallel {x}_{1}-{x}_{2}\parallel}_{2}^{2},\hfill \end{array}$$

$$\begin{array}{c}\sqrt{\theta f({\pi}_{i}^{{}^{\prime}})+(1-\theta )f({\pi}_{i}^{{}^{\u2033}})}\ge \theta \sqrt{f({\pi}_{i}^{{}^{\prime}})}+(1-\theta )\sqrt{f({\pi}_{i}^{{}^{\u2033}})}\hfill \\ \phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}+\frac{1}{{\mathcal{C}}^{3}}\theta (1-\theta ){\parallel f({\pi}_{i}^{{}^{\prime}})-f({\pi}_{i}^{{}^{\u2033}})\parallel}_{2}^{2}.\hfill \end{array}$$

Note that
where the second inequality results from Equation (A1) and the last (third) inequality results from Equation (A2). Finally note that the first derivative of f is ${f}^{{}^{\prime}}({\pi}_{i})=\mathcal{C}-2{\pi}_{i}\in [\mathcal{C}-2,\mathcal{C}]$. Thus
and combining this result with previous statement yields
thus $g({\pi}_{i})$ is strongly concave with modulus $\frac{2{(\mathcal{C}-2)}^{2}}{{\mathcal{C}}^{3}}$. By Lemma A3, ${\tilde{G}}^{m}(\mathit{\pi})$ is also strongly concave with the same modulus. □

$$\begin{array}{c}\sqrt{f(\theta {\pi}_{i}^{{}^{\prime}}+(1-\theta ){\pi}_{i}^{{}^{\u2033}})}\hfill \\ \ge \sqrt{f(\theta {\pi}_{i}^{{}^{\prime}}+(1-\theta ){\pi}_{i}^{{}^{\u2033}})-\theta (1-\theta ){\parallel {\pi}_{i}^{{}^{\prime}}-{\pi}_{i}^{{}^{\u2033}}\parallel}_{2}^{2}}\hfill \\ \ge \sqrt{\theta f({\pi}_{i}^{{}^{\prime}})+(1-\theta )f({\pi}_{i}^{{}^{\u2033}})}\hfill \\ \ge \theta \sqrt{f({\pi}_{i}^{{}^{\prime}})}+(1-\theta )\sqrt{f({\pi}_{i}^{{}^{\u2033}})}\hfill \\ \phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}+\frac{1}{{\mathcal{C}}^{3}}\theta (1-\theta ){\parallel f({\pi}_{i}^{{}^{\prime}})-f({\pi}_{i}^{{}^{\u2033}})\parallel}_{2}^{2},\hfill \end{array}$$

$$\frac{|f({\pi}_{i}^{{}^{\prime}})-f({\pi}_{i}^{{}^{\u2033}})|}{|{\pi}_{i}^{{}^{\prime}}-{\pi}_{i}^{{}^{\u2033}}|}\phantom{\rule{-0.166667em}{0ex}}\ge \mathcal{C}-2$$

$$\iff \parallel f({\pi}_{i}^{{}^{\prime}})-f({\pi}_{i}^{{}^{\u2033}}){\parallel}^{2}\phantom{\rule{-0.166667em}{0ex}}\ge \phantom{\rule{-0.166667em}{0ex}}{(\mathcal{C}-2)}^{2}{\parallel {\pi}_{i}^{{}^{\prime}}-{\pi}_{i}^{{}^{\u2033}}\parallel}^{2},$$

$$\begin{array}{c}\sqrt{f(\theta {\pi}_{i}^{{}^{\prime}}+(1-\theta ){\pi}_{i}^{{}^{\u2033}})}\hfill \\ \ge \theta \sqrt{f({\pi}_{i}^{{}^{\prime}})}\phantom{\rule{-0.166667em}{0ex}}+\phantom{\rule{-0.166667em}{0ex}}(1\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}\theta )\sqrt{f({\pi}_{i}^{{}^{\u2033}})}\phantom{\rule{-0.166667em}{0ex}}+\phantom{\rule{-0.166667em}{0ex}}\frac{{(\mathcal{C}-2)}^{2}}{{\mathcal{C}}^{3}}\theta (1\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}\theta ){\parallel {\pi}_{i}^{{}^{\prime}}\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}{\pi}_{i}^{{}^{\u2033}}\parallel}^{2},\hfill \end{array}$$

The next two lemma are fundamental and they are used in the proof of Lemma 4 and the boosting theorems. The first one relates ${l}_{1}$-norm and ${l}_{2}$-norm and the second one is a simple property of the exponential function.

**Lemma**

**A4.**

Let $x\in {\mathbb{R}}^{k}$ then ${\parallel x\parallel}_{1}\le \sqrt{k}{\parallel x\parallel}_{2}$.

**Lemma**

**A5.**

For $x\ge 1$ the following holds ${\left(1-\frac{1}{x}\right)}^{x}\le \frac{1}{e}$.

## References

- Rifkin, R.; Klautau, A. In Defense of One-Vs-All Classification. J. Mach. Learn. Res.
**2004**, 5, 101–141. [Google Scholar] - Daume, H.; Karampatziakis, N.; Langford, J.; Mineiro, P. Logarithmic Time One-Against-Some. arXiv, 2016; arXiv:1606.04988. [Google Scholar]
- Choromanska, A.; Langford, J. Logarithmic Time Online Multiclass prediction. In Neural Information Processing Systems 2015; Neural Information Processing Systems Foundation, Inc.: Vancouver, BC, Canada, 2015. [Google Scholar]
- Schapire, R.E.; Freund, Y. Boosting: Foundations and Algorithms; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Mukherjee, I.; Schapire, R.E. A theory of multiclass boosting. J. Mach. Learn. Res.
**2013**, 14, 437–497. [Google Scholar] - Beygelzimer, A.; Langford, J.; Ravikumar, P.D. Error-Correcting Tournaments. In Algorithmic Learning Theory; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Takimoto, E.; Maruoka, A. Top-down Decision Tree Learning As Information Based Boosting. Theor. Comput. Sci.
**2003**, 292, 447–464. [Google Scholar] [CrossRef] - Morin, F.; Bengio, Y. Hierarchical probabilistic neural network language model. Aistats
**2005**, 5, 246–252. [Google Scholar] - Bengio, S.; Weston, J.; Grangier, D. Label Embedding Trees for Large Multi-Class Tasks. In Advances in Neural Information Processing Systems 23 (NIPS 2010); NIPS: Vancouver, BC, Canada, 2010. [Google Scholar]
- Utgoff, P.E. Incremental Induction of Decision Trees. Mach. Learn.
**1989**, 4, 161–186. [Google Scholar] [CrossRef][Green Version] - Domingos, P.; Hulten, G. Mining High-speed Data Streams; KDD: Boston, MA, USA, 2000. [Google Scholar]
- Gama, J.; Rocha, R.; Medas, P. Accurate Decision Trees for Mining High-speed Data Streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003. [Google Scholar]
- Beygelzimer, A.; Langford, J.; Lifshits, Y.; Sorkin, G.B.; Strehl, A.L. Conditional Probability Tree Estimation Analysis and Algorithms. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 June 2009. [Google Scholar]
- Madzarov, G.; Gjorgjevikj, D.; Chorbev, I. A Multi-class SVM Classifier Utilizing Binary Decision Tree. Informatica
**2009**, 33, 225–233. [Google Scholar] - Weston, J.; Makadia, A.; Yee, H. Label Partitioning For Sublinear Ranking. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
- Deng, J.; Satheesh, S.; Berg, A.C.; Fei-Fei, L. Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition. In Advances in Neural Information Processing Systems 24 (NIPS 2011); NIPS: Vancouver, BC, Canada, 2011. [Google Scholar]
- Zhao, B.; Xing, E.P. Sparse Output Coding for Large-Scale Visual Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar]
- Hsu, D.; Kakade, S.; Langford, J.; Zhang, T. Multi-Label Prediction via Compressed Sensing. In Advances in Neural Information Processing Systems 22 (NIPS 2009); NIPS: Vancouver, BC, Canada, 2009. [Google Scholar]
- Agarwal, A.; Kakade, S.M.; Karampatziakis, N.; Song, L.; Valiant, G. Least Squares Revisited: Scalable Approaches for Multi-class Prediction. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, 21–26 June 2014. [Google Scholar]
- Beijbom, O.; Saberian, M.; Kriegman, D.; Vasconcelos, N. Guess-Averse Loss Functions For Cost-Sensitive Multiclass Boosting. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, 21–26 June 2014. [Google Scholar]
- Jernite, Y.; Choromanska, A.; Sontag, D. Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation. arXiv, 2017; arXiv:1610.04658. [Google Scholar]
- Mnih, A.; Hinton, G.E. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21 (NIPS 2008); NIPS: Vancouver, BC, Canada, 2009. [Google Scholar]
- Djuric, N.; Wu, H.; Radosavljevic, V.; Grbovic, M.; Bhamidipati, N. Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS 2013); NIPS: Vancouver, BC, Canada, 2013. [Google Scholar]
- Kearns, M.; Mansour, Y. On the Boosting Ability of Top-Down Decision Tree Learning Algorithms. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing (STOC ’96), Philadelphia, PA, USA, 22–24 May 1996. reprinted in J. Comput. Syst. Sci.
**1999**, 58, 109–128. [Google Scholar] [CrossRef] - Breiman, L. Classification Regression Trees; Routledge: Abingdon, UK, 2017. [Google Scholar]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Liu, W.; Tsang, I.W. Making decision trees feasible in ultrahigh feature and label dimensions. J. Mach. Learn. Res.
**2017**, 18, 2814–2849. [Google Scholar] - Muñoz, E.; Nováček, V.; Vandenbussche, P.Y. Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models. Brief. Bioinform.
**2017**, 20, 190–202. [Google Scholar] [CrossRef] [PubMed] - Charte, F.; Rivera, A.J.; del Jesus, M.J.; Herrera, F. REMEDIAL-HwR: Tackling multilabel imbalance through label decoupling and data resampling hybridization. Neurocomputing
**2019**, 326, 110–122. [Google Scholar] [CrossRef] - Koster, C.H.; Seutter, M.; Beney, J. Multi-classification of patent applications with Winnow. In International Andrei Ershov Memorial Conference on Perspectives of System Informatics; Springer: Berlin/Heidelberg, Germany, 2003; pp. 546–555. [Google Scholar]
- Liu, W.; Tsang, I.W.; Müller, K.R. An easy-to-hard learning paradigm for multiple classes and multiple labels. J. Mach. Learn. Res.
**2017**, 18, 3300–3337. [Google Scholar] - Liu, W.; Xu, D.; Tsang, I.W.; Zhang, W. Metric learning for multi-output tasks. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 41, 408–422. [Google Scholar] [CrossRef] [PubMed] - Petersen, N.C.; Rodrigues, F.; Pereira, F.C. Multi-output bus travel time prediction with convolutional LSTM neural network. Expert Syst. Appl.
**2019**, 120, 426–435. [Google Scholar] [CrossRef] - Langford, J.; Li, L.; Strehl, A. Vowpal Wabbit (Fast Learning). 2007. Available online: http://hunch.net/~vw (accessed on 2 February 2019).
- Bottou, L. Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks; Cambridge University Press: New York, NY, USA, 1998. [Google Scholar]
- Shalev-Shwartz, S. Online Learning and Online Convex Optimization. Found. Trends Mach. Learn.
**2012**, 4, 107–194. [Google Scholar] [CrossRef] - Shalev-Shwartz, S. Online Learning: Theory, Algorithms, and Applications. Ph.D. Thesis, The Hebrew University of Jerusalem, Jerusalem, Israel, 2007. [Google Scholar]
- Zhukovskiy, V. Lyapunov Functions in Differential Games; Stability and Control: Theory, Methods and Applications; Taylor & Francis: London, UK, 2003. [Google Scholar]

**Figure 1.**

**Red partition**: highly balanced split but impure (the partition cuts through the black and green classes).

**Green partition**: highly balanced and highly pure split. Figure should be read in color.

**Figure 2.**

**Left**: Blue curve captures the behavior of the upper-bound on the balancing factor as a function of $J(h)$, red curve captures the behavior of the lower-bound on the balancing factor as a function of $J(h)$, green intervals correspond to the intervals where the balancing factor lies for different values of $J(h)$.

**Right**: Red line captures the behavior of the upper-bound on the purity factor as a function of $J(h)$ when the balancing factor is fixed to $\frac{1}{2}$. Figure should be read in color.

**Figure 3.**Functions ${G}_{*}^{e}({\pi}_{1})={\tilde{G}}^{e}({\pi}_{1})/ln2=\left({\pi}_{1}ln\left(\frac{1}{{\pi}_{1}}\right)+(1-{\pi}_{1})ln\left(\frac{1}{1-{\pi}_{1}}\right)\right)/ln2$, ${G}_{*}^{g}({\pi}_{1})=2{\tilde{G}}^{g}({\pi}_{1})=4{\pi}_{1}(1-{\pi}_{1})$, and ${G}_{*}^{m}({\pi}_{1})=({\tilde{G}}^{m}({\pi}_{1})-\sqrt{\mathcal{C}-1})/(\sqrt{2*\mathcal{C}-1}-\sqrt{\mathcal{C}-1})=(\sqrt{{\pi}_{1}(\mathcal{C}-{\pi}_{1})}+\sqrt{(1-{\pi}_{1})(\mathcal{C}-1+{\pi}_{1})}-\sqrt{\mathcal{C}-1})/(\sqrt{2\ast \mathcal{C}-1}-\sqrt{\mathcal{C}-1})$ (functions ${\tilde{G}}^{e}({\pi}_{1})$, ${\tilde{G}}^{g}({\pi}_{1})$, and ${\tilde{G}}^{m}({\pi}_{1})$ were re-scaled to have values in $[0,1]$) as a function of ${\pi}_{1}$ ($p{i}_{1}$). Figure should be read in color.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).