Extreme Multiclass Classiﬁcation Criteria

: We analyze the theoretical properties of the recently proposed objective function for efﬁcient online construction and training of multiclass classiﬁcation trees in the settings where the label space is very large. We show the important properties of this objective and provide a complete proof that maximizing it simultaneously encourages balanced trees and improves the purity of the class distributions at subsequent levels in the tree. We further explore its connection to the three well-known entropy-based decision tree criteria, i.e., Shannon entropy, Gini-entropy and its modiﬁed variant, for which efﬁcient optimization strategies are largely unknown in the extreme multiclass setting. We show theoretically that this objective can be viewed as a surrogate function for all of these entropy criteria and that maximizing it indirectly optimizes them as well. We derive boosting guarantees and obtain a closed-form expression for the number of iterations needed to reduce the considered entropy criteria below an arbitrary threshold. The obtained theorem relies on a weak hypothesis assumption that directly depends on the considered objective function. Finally, we prove that optimizing the objective directly reduces the multi-class classiﬁcation error of the decision tree.


Introduction
This paper focuses on the multiclass classification setting, where the number of classes is very large.The recent widespread development of data-acquisition web services and devices has helped make large data sets, such as multiclass data sets, commonplace.Straightforward extensions of the binary approaches to the multiclass setting, such as the one-against-all approach [1], which for each data point computes a score for each class and returns the class with the maximum score, do not often work in the presence of strict computational constraints as their running time often scales linearly with the number of labels k.On the other hand, the most computationally efficient approaches for multiclass classification are given by O(log k) train/test running time [2].This running time can naturally be achieved by hierarchical classifiers that build the hierarchy over the labels.
This paper considers a hierarchical multiclass decision tree structure, where each node of the tree contains a binary classifier h from some hypothesis class H that sends an example reaching that node to either left (h(x) ≤ 0) or right (h(x) > 0) child node depending on the sign of h(x) (each node has its own splitting hypothesis).The test example descends from the root to the leaf of such tree guided by the classifiers lying on its path, and is labeled according to the label with the highest frequency amongst the training examples that were reaching the leaf that it descended to.The tree is constructed and trained in a top-down fashion, where splitting the data in every node of the tree is done by maximizing the following objective function recently introduced in the literature [3] (along with the algorithm (we refer the reader to the referenced paper for the algorithm's details), called LOMtree, optimizing it in an online fashion): where x ∈ X ⊆ R d are the data points (each with a label from the set {1, 2, . . ., k}), π i denotes the proportion of label i amongst the examples reaching a node, and probabilities P(h(x) > 0) and P(h(x) > 0|i) denote the fraction of examples reaching a node for which h(x) > 0, marginally and conditional on class i respectively.The objective measures the dependence between the split and the class distribution.Note that it satisfies J(h) ∈ [0, 1] and, as implied by its form, maximizing it encourages the fraction of examples going to the right from class i to be substantially different from the background fraction for each class i.Thus for a balanced split (i.e., P(h(x) > 0) = 0.5), the examples of class i are encouraged to be sent exclusively to the left (P(h(x) > 0|i) = 0) or right (P(h(x) > 0|i) = 1) refining the purity of the class distributions at subsequent levels in the tree.The LOMtree algorithm effectively maximizes this objective over hypotheses h ∈ H in an online fashion with stochastic gradient descent (SGD) and obtains good-quality multiclass tree predictors with logarithmic train and test running times.Despite that, this objective and its properties (including the relation to the more standard entropy criteria) remain largely ununderstood.Its exhaustive analysis is instead provided in this paper.
Our contributions are the following:

•
We provide an extensive theoretical analysis of the properties of the considered objective and prove that maximizing this objective in any tree node simultaneously encourages balanced partition of the data in that node and improves the purity of the class distributions at its children nodes.

•
We show a formal relation of this objective to some more standard entropy-based objectives, i.e., Shannon entropy, Gini-entropy and its modified variant, for which online optimization schemes in the context of multiclass classification are largely unknown.In particular we show that i) the improvement in the value of entropy resulting from performing the node split is lower-bounded by an expression that increases with the value of the objective and thus ii) the considered objective can be used as a surrogate function for indirectly optimizing any of the three considered entropy-based criteria.

•
We present three boosting theorems for each of the three entropy criteria, which provide the number of iterations needed to reduce each of them below an arbitrary threshold.Their weak hypothesis assumptions rely on the considered objective function.

•
We establish the error bound that relates maximizing the objective function with reducing the multi-class classification error.

•
Finally, in the Appendix A we establish an empirical connection between the multiclass classification error and the entropy criteria and show that Gini-entropy most closely resembles the behavior of the test error in practice.
The main theoretical analysis of this paper is kept in the boosting framework [4] and relies on the assumption that the objective function can be weakly optimized in the internal nodes of the tree.This weak advantage is amplified in the tree leading to hierarchies achieving any desired level of entropy (either Shannon entropy, Gini-entropy or its modified variant).Our work adds new theoretical results to the theory of multiclass boosting.Note that the multiclass boosting is largely ununderstood from the theoretical perspective [5] (we refer the reader to [5] for comprehensive review of the theory of muticlass boosting).
The paper is organized as follows: related literature is discussed in Section 2, the theoretical properties of the objective J(h) are shown in Section 3, the main theoretical results are presented in Section 4, and finally the mathematical properties of the entropy criteria and the proofs of the main theoretical results are provided in Section 5. Conclusions (Section 6) end the paper.Appendix A contains basic numerical experiments (Appendix A.1) and additional proofs (Appendix A.2).

Related Work
The extreme multiclass classification problem has been addressed in the literature in different ways.We discuss them here, putting emphasis on the ones that build hierarchical predictors as these techniques are the most relevant to this paper.Only a few authors [2,3,[6][7][8] simultaneously address logarithmic time training and testing.The methods they propose are either hard to apply in practical problems [7] or use fixed tree structures [6,8].Furthermore, an alternative approach based on using a random tree structure was shown to potentially lead to considerable underperformance [3,9].At the same time, for massive datasets making multiple passes through the data is computationally costly, which justifies the need for developing online approaches, where the algorithm streams over a potentially infinitely large data set (online approaches are also plausible for non-stationary problems).It is unclear how to optimize standard decision tree objectives, such as Shannon or Gini-entropy, in this setting (early attempt was recently proposed [2] for Shannon entropy).One of the prior works to this paper [3] introduces an objective function which enjoys certain advantages over entropy criteria.In particular, it can be easily and efficiently optimized online.The authors however present an incomplete theoretical analysis and leave a number of open questions, which this paper instead aims at addressing.The algorithms for incremental learning of classification with decision trees also include some older works [10][11][12], which split any node according to the outcome of the node split-test based on the values of selected attributes of the data examples reaching that node.These approaches are different from the one in this paper, where the node split is performed according to the value of the learned (e.g., with SGD) hypothesis computed for the entire vector of attributes of the data examples reaching that node.
Other tree-based approaches include conditional probability trees [13] and clustering methods [9,14,15] ( [9] was later improved in [16]), but they allow training time to be linear in the label complexity.The remaining techniques for multiclass classification include sparse output coding [17], variants of error correcting output codes [18], variants of iterative least-squares [19], and a method based on guess-averse loss functions [20].
Finally note that the conditional density estimation problem is also challenging in the large-class settings and in this respect remains parallel to the extreme multiclass classification problem [21].In the context of conditional density estimation problem, there have also been some works that use tree structured models to accelerate computation of the likelihood and gradients [8,[22][23][24].They typically use heuristics based on using ontologies [8], Huffman coding [24], and various other mechanisms.

Theoretical Properties of the Objective Function
In this section we describe the objective function introduced in Equation (1) and provide its theoretical properties.The proofs are deferred to the Appendix.We first introduce the definitions of the concept of balancedness and purity of the node split.

Definition 1 (Purity and balancedness).
The hypothesis h ∈ H induces a pure split if α := ∑ k i=1 π i min(P(h(x) > 0|i), P(h(x) < 0|i)) ≤ δ, where δ ∈ [0, 0.5), and α is called the purity factor.The hypothesis h ∈ H induces a balanced split if where c ∈ (0, 0.5], and β is called the balancing factor. A partition is perfectly pure if α = 0 (examples of the same class are sent exclusively to the left or to the right).A partition is called perfectly balanced if β = 0.5 (equal number of examples are sent to the left and to the right).The notions of balancedness and purity are conveniently illustrated in Figure 1, where it is shown that the purity criterion helps to refine the choice of the splitting hypothesis from among well-balanced candidates.
Next, we show the first theoretical property of the objective function J(h) that characterizes its behavior at the optimum (J(h) = 1).

Lemma 1. The hypothesis h ∈ H induces a perfectly pure and balanced partition if and only if J
For some data sets however there exist no hypotheses producing perfectly pure and balanced splits.We next show that increasing the value of the objective leads to more balanced splits.

Lemma 2. For any hypothesis h and any distribution over data examples the balancing factor β satisfies
We refer to the interval to which β belongs to as β-interval.Thus the larger (closer to 1) the value of J(h) is, the narrower the β-interval is, leading to more balanced splits at the extremes of this interval (β closer to 0.5).
This result combined with the next lemma implies that, at the extremes of the β interval, the value of the upper-bound on the purity factor decreases as the value of J(h) increases (since J(h) gets closer to 1 and the balancing factor β gets closer to 0.5 at the extremes of the β interval).The recovered splits therefore have better purity (α closer to 0).
We thus showed that maximizing the objective in Equation ( 1) in each tree node simultaneously encourages trees that are balanced and whose purity of the class distributions is gradually improving when moving from the root to a subsequent tree levels.Lemmas 2 and 3 are illustrated in Figure 2. Left: Blue curve captures the behavior of the upper-bound on the balancing factor as a function of J(h), red curve captures the behavior of the lower-bound on the balancing factor as a function of J(h), green intervals correspond to the intervals where the balancing factor lies for different values of J(h).Right: Red line captures the behavior of the upper-bound on the purity factor as a function of J(h) when the balancing factor is fixed to 1  2 .Figure should be read in color.
In the next section we show that the objective J(h) is related to the more standard decision tree entropy-based objectives and that maximizing it leads to the reduction of these criteria.We consider three different entropy criteria in this paper.The theoretical analysis relies on the boosting framework and depends on the weak learning assumption.Three different entropy-based criteria lead to three different theoretical statements, where we bound the number of splits required to reduce the value of the criterion below given level.The bounds we obtain, and their dependences on the number of classes (k), critically depend on the strong concativity properties of the considered entropy-based objectives.

Notation
We first introduce notation.Let T denote the tree under consideration.π l,i 's denote the probabilities that a randomly chosen data point x drawn from P, where P is a fixed target distribution over X , has label i given that x reaches node l (note that ∑ k i=1 π l,i = 1), t denotes the number of internal tree nodes, L t denotes the set of all tree leaves at time t, and w l is the weight of leaf l defined as the probability a randomly chosen x drawn from P reaches leaf l (note that ∑ l∈L t w l = 1).We study a tree construction algorithm where we recursively find the leaf node with the highest weight, and choose to split it into two children.Consider the tree constructed over t steps where in each step we take one leaf node and split it (thus the number of splits is equal to the number of internal nodes of the tree) (t = 1 corresponds to splitting the root, thus the tree consists of one node (root) and its two children (leaves) in this step).We measure the quality of the tree at any given time t with three different entropy criteria: where C is a constant such that C > 2.
These criteria are the natural extensions of the criteria used in the context of binary classification [25] to the multiclass classification setting (note that there is more than one way of extending the entropy-based criteria from [25] to the multiclass classification setting, e.g., the modified Gini-entropy could as well be defined as . This and other extensions will be investigated in future works).We will next present the main results of this paper, which will be followed by their proofs.We begin with introducing the weak hypothesis assumption.

Theorems
Definition 2 (Weak Hypothesis Assumption).Let m denote any internal node of the tree T , and let β m = P(h m (x) > 0) and P m,i = P(h m (x) > 0|i).Furthermore, let γ ∈ R + be such that for all m, γ ∈ (0, min(β m , 1 − β m )].We say that the weak hypothesis assumption is satisfied when for any distribution P over X at each node m of the tree T there exists a hypothesis The weak hypothesis assumption says that in every node of the tree we are able to recover a hypothesis from H which corresponds to the value of the objective that is above 0 (thus the corresponding split is "weakly" pure and "weakly" balanced).
Consider next any time t and let n be the heaviest leaf at time t that we split and its weight w n be denoted by w for brevity.Similarly, let h denote the regressor at node n (shorthand for h n ).We denote the difference between the contribution of node n to the value of the entropy-based objectives in times t and t + 1 as Then the following lemma holds (the proof in provided in Section 5): Lemma 4.Under the Weak Hypothesis Assumption, the change in entropies occuring due to the node split can be bounded as Clearly, maximizing the objective J(h) improves the entropy reduction.The considered objective can therefore be viewed as a surrogate function for indirectly optimizing any of the three considered entropy-based criteria, for which efficient online optimization strategies are largely unknown but highly desired in the multiclass classification setting.To be more specific, the standard packages for binary classification trees, such as CART [26] and C4.5 [27], require running a brute force search to find a partition at every node of the tree from a set of all possible partitions that leads to the biggest improvement of the entropy-based criterion of interest [25].This is prohibitive in case of the multiclass problem.J(h) however can be efficiently optimized with SGD instead.
We next state the three boosting theoretical results captured in Theorems 1-3.They guarantee that the top-down decision tree algorithm which optimizes J(h) in each node will amplify the weak advantage, captured in the weak learning assumption, to build a tree achieving any desired level of entropy (either Shannon entropy, Gini-entropy or its modified variant).Theorem 2. Under the Weak Hypothesis Assumption, for any α ∈ [0, splits.
Theorem 3.Under the Weak Hypothesis Assumption, for any α ∈ Finally, we provide the error guarantee in Theorem 4. Denote y(x) to be a fixed target function with domain X , which assigns the data point x to its label, and let P be a fixed target distribution over X .Together y and P induce a distribution on labeled pairs (x, y(x)).Let t(x) be the label assigned to data point x by the tree.We denote as (T ) the error of tree T , i.e., (T ) Remark 1.The main theorems show how fast the entropy criteria or the multi-class classification error drop as the tree grows and performs node splits.These statements therefore provide a platform for comparing different entropy criteria and answer two questions: 1) for a fixed α, γ, C, and k, which criterion is reduced the most with each split?and 2) can the multi-class error match the convergence speed of the best entropic criterion?Hence, it can be noted that the Shannon entropy has the most advantageous dependence on the label complexity, since the bound scales only logarithmically with k, and thus achieves the fastest convergence.Simultaneously, the multi-class classification rate matches this advantageous convergence rate and also scales favorably (logarithmically) with k.Finally, even though the weak hypothesis requires only slightly favorable γ, i.e., γ > 0, in practice when constructing the tree one can optimize J in every node of the tree, which effectively pushes γ to be as high as possible.In that case γ becomes a well-behaving constant in the above theorems, ideally equal to 1/2, and does not negatively affect the split count.
We next discuss in details the mathematical properties of the entropy-based criteria, which are important to prove the above theorems.

Properties of the Entropy-Based Criteria
Each of the presented entropy-based criteria has a number of useful properties that we give next, along with their proofs.We first give bounds on the values of the entropy-based functions.As before, let w be the weight of the heaviest leaf in the tree at time t.

Bounds on the
The upper-bounds in Lemmas 5-7 are tight, where the equalities hold for the special case when ∀ i∈{1,...,k}, l∈L t π l,i = 1/k, e.g., when each internal node of the tree produce a perfectly pure and balanced split.

Strong Concativity Properties of the Entropy-Based Criteria
So far we have been focusing on the time step t.Recall that n is the heaviest leaf at time t and its weight w n is denoted by w for brevity.Consider splitting this leaf to two children n 0 and n 1 .For ease of notation let w 0 = w n 0 and w 1 = w n 1 , β = P(h n (x) > 0) and P i = P(h n (x) > 0|i), and furthermore let π i and h be the shorthands for π n,i and h n , respectively.Recall that β = ∑ k i=1 π i P i and ∑ k i=1 π i = 1.Notice that w 0 = w(1 − β) and w 1 = wβ.Let π be the k-element vector with i th entry equal to Before the split the contribution of node n to resp.G e t , G g t , and G m t was resp.w Ge (π), w Gg (π), and w Gm (π).Note that π n 0 ,i = π i (1−P i ) 1−β and π n 1 ,i = π i P i β are the probabilities that a randomly chosen x drawn from P has label i given that x reaches nodes n 0 and n 1 respectively.For brevity, let π n 0 ,i and π n 1 ,i be denoted respectively as π 0,i and π 1,i .Let π 0 be the k-element vector with i th entry equal to π 0,i and let π 1 be the k-element vector with i th entry equal to π 1,i .Notice that π = (1 − β)π 0 + βπ 1 .After the split the contribution of the same, now internal, node n changes to resp.w((1 − β) Ge (π 0 ) + β Ge (π 1 )), w((1 − β) Gg (π 0 ) + β Gg (π 1 )), and w((1 − β) Gm (π 0 ) + β Gm (π 1 )).We can compute the difference between the contribution of node n to the value of the entropy-based objectives in times t and t + 1 as The next three lemmas, Lemmas 8-10, describe the strong concativity properties of the entropy, Gini-entropy and modified Gini-entropy, which can be used to lower-bound ∆ e t , ∆ g t , and ∆ m t (Equations ( 2)-( 4) correspond to a gap in the Jensen's inequality applied to the strongly concave function).Lemma 8.The Shannon entropy function Ge is strongly concave with respect to l 1 -norm with modulus 1, and thus the following holds Ge (π) Lemma 9.The Gini-entropy function Gg is strongly concave with respect to l 2 -norm with modulus 2, and thus the following holds Gg (π) Lemma 10.The modified Gini-entropy function Gm is strongly concave with respect to l 2 -norm with modulus , and thus the following holds Gm (π) Functions G e * (π (functions Ge (π 1 ), Gg (π 1 ), and Gm (π 1 ) were re-scaled to have values in [0, 1]) as a function of π 1 (pi 1 ). Figure should be read in color.

Proof of Lemma 4 and Theorems 1-3
We finally proceed to proving all three boosting theorems, Theorems 1-3.Lemma 4 is a by-product of these proofs.

Proof.
For the Shannon entropy it follows from Equation (2), Lemmas 5 and 8 that where the last inequality comes from the fact that 1 − γ ≥ β ≥ γ (see the definition of γ in the weak hypothesis assumption) and J(h) ≥ 2γ (see weak hypothesis assumption).For the Gini-entropy criterion notice that from Equation (3), Lemmas 6, 9, and A4 it follows that where the last inequality is obtained similarly as the last inequality in Equation (5).And finally for the modified Gini-entropy it follows from Equation (4), Lemmas 7, 10, and A4 that where the last inequality is obtained as before.
Clearly the larger the objective J(h) is at time t, the larger the entropy reduction ends up being.Let .
We consider the worst case setting (giving the largest possible number of split) thus we assume Combining that with Equations ( 8) and ( 10) yields statements of the main theorems.

Proof of Theorem 4
We next proceed to directly proving the error bound.Recall that π l,i is the probability that the data point x corresponds to label i given that x reached l, i.e., π l,i = P(y(x) = i|x reached l).Let the label assigned to the leaf be the majority label and thus lets assume that the leaf is assigned to label i if and only if the following is true ∀ z={1,2,...,k} z =i π l,i ≥ π l,z .Therefore we can write that (T ) = P(t(x) = y(x)) = ∑ l∈L t w l P(t(x) = y(x)|x reached l) Let i l be the majority label in leaf l, thus ∀ z={1,2,...,k} z =i l π l,i l ≥ π l,z .We can continue as follows Consider again the Shannon entropy G(T ) of the leaves of tree T that is defined as Note that where the last inequality comes from the fact that ∀ i={1,2,...,} i =i l π l,i ≤ 0.

Conclusions
This paper aims at introducing theoretical tools, encapsulated in the boosting framework, that enable the comparison of different multi-class classification objective functions.The multi-class boosting is largely ununderstood from the theoretical perspective [5].We provide an exhaustive theoretical analysis of the objective function underlying the recently proposed LOMtree algorithm for extreme multi-class classification and explore the connection of this objective to entropy-based criteria.We show that optimizing this objective simultaneously optimizes Shannon entropy, Gini-entropy and its modified variant, as well as the multi-class classification error.We expect that discussed tools can be used to obtain theoretical guarantees in the multi-label [28][29][30] and memory-constrained settings (we will explore this research direction in the future).We also consider extensions to different variants of the multi-class classification problem [31,32] and multi-output learning tasks [33,34].We thus plan to build a unified theoretical framework for understanding extreme classification trees.Assume β = 0.5.Denote as before I 0 = {i : P(h(x) > 0|i) = 0} and I 1 = {i : P where the last inequality comes from the fact that the quadratic form −4β 2 + 4β is equal to 1 only when β = 0.5, and otherwise it is smaller than 1.Thus we obtain the contradiction which ends the proof.
Note that J(h) can be written as Thus as before we obtain −4β 2 + 4β − J(h) ≥ 0 which, when solved, yields the lemma.
Proof of Lemma 5.The lower-bound follows from the fact that the entropy of each leaf ∑ k i=1 π l,i ln 1 π l,i is non-negative.We next prove the upper-bound.
where the first inequality comes from the fact that uniform distribution maximizes the entropy, and the last equality comes from the fact that a tree with t internal nodes has t + 1 leaves (also recall that w is the weight of the heaviest node in the tree at time t which is what we will also use in the next lemmas).
Before proceeding to the actual proof of Lemma 6 we first introduce the helpful result captured in Lemma A1 and Corollary A1.
Lemma A1 (The inequality between Euclidean and arithmetic mean).Let x 1 , . . ., x k be a set of non-negative numbers.Then Euclidean mean upper-bounds the arithmetic mean as follows Proof.By Lemma A1 we have Proof of Lemma 6.The lower-bound is straightforward since all π l,i 's are non-negative.The upper-bound can be shown as follows (the last inequality results from Corollary A1): Proof of Lemma 7. The lower-bound can be shown as follows.
Recall that the function ∑ k i=1 π l,i (C − π l,i ) is concave and therefore it is certainly minimized on the extremes of the [0, 1] interval, meaning where each π l,i is either 0 or 1.Let I 0 = {i : π l,i = 0} and let Combining this result with the fact that ∑ l∈L t w l = 1 gives the lower-bound.We next prove the upper-bound.Recall that Lemma A1 implies that (∑ k  Proof of Lemma 8. Lemma 8 is proven in [37] (Example 2.5).
Lemma A3 (Remark 2.2.4. in [39]).The sum of strongly concave functions on R n with modulus σ is strongly concave with the same modulus.
Proof of Lemma 10.Consider functions g(π i ) = f (π i ), where f (π i ) = π i (C − π i ), C ≥ 2, and ].It is easy to see, using Lemma A2, that function f is strongly concave with respect to l 2 -norm with modulus 2, thus where π i , π i ∈ [0, 1] and θ ∈ [0, 1].Also note that h is strongly concave with modulus 2 C 3 in its domain [0, C 2 4 ] (the second derivative of h is h (x) = − 1 4 √ x 3 ≤ − 2 C 3 ).The strong concativity of h implies that where x 1 , x 2 ∈ [0, C 2 4 ].Let x 1 = f (π i ) and x 2 = f (π i ).Then we obtain (A2) Note that where the second inequality results from Equation (A1) and the last (third) inequality results from Equation (A2).Finally note that the first derivative of f is f and combining this result with previous statement yields thus g(π i ) is strongly concave with modulus 2(C−2) 2 C 3 . By Lemma A3, Gm (π) is also strongly concave with the same modulus.
The next two lemma are fundamental and they are used in the proof of Lemma 4 and the boosting theorems.The first one relates l 1 -norm and l 2 -norm and the second one is a simple property of the exponential function.

Figure 1 .
Figure 1.Red partition: highly balanced split but impure (the partition cuts through the black and green classes).Green partition: highly balanced and highly pure split.Figure should be read in color.

Figure 2 .
Figure2.Left: Blue curve captures the behavior of the upper-bound on the balancing factor as a function of J(h), red curve captures the behavior of the lower-bound on the balancing factor as a function of J(h), green intervals correspond to the intervals where the balancing factor lies for different values of J(h).Right: Red line captures the behavior of the upper-bound on the purity factor as a function of J(h) when the balancing factor is fixed to1  2 .Figure should be read in color.

Figure A1 .
Figure A1.Functions G e t , G g t , and G m t , and the test error, all normalized to the interval [0, 1], versus the number of splits.Figure is recommended to be read in color. i=1