#### 4.1. Notation

We first introduce notation. Let $\mathcal{T}$ denote the tree under consideration. ${\pi}_{l,i}$’s denote the probabilities that a randomly chosen data point x drawn from $\mathcal{P}$, where $\mathcal{P}$ is a fixed target distribution over $\mathcal{X}$, has label i given that x reaches node l (note that ${\sum}_{i=1}^{k}{\pi}_{l,i}=1$), t denotes the number of internal tree nodes, ${\mathcal{L}}_{t}$ denotes the set of all tree leaves at time t, and ${w}_{l}$ is the weight of leaf l defined as the probability a randomly chosen x drawn from $\mathcal{P}$ reaches leaf l (note that ${\sum}_{l\in {\mathcal{L}}_{t}}{w}_{l}=1$). We study a tree construction algorithm where we recursively find the leaf node with the highest weight, and choose to split it into two children. Consider the tree constructed over t steps where in each step we take one leaf node and split it (thus the number of splits is equal to the number of internal nodes of the tree) ($t=1$ corresponds to splitting the root, thus the tree consists of one node (root) and its two children (leaves) in this step). We measure the quality of the tree at any given time t with three different entropy criteria:

Shannon entropy

${G}_{t}^{e}$:

Gini-entropy

${G}_{t}^{g}$:

Modified Gini-entropy

${G}_{t}^{m}$:

where

$\mathcal{C}$ is a constant such that

$\mathcal{C}>2$.

These criteria are the natural extensions of the criteria used in the context of binary classification [

25] to the multiclass classification setting (note that there is more than one way of extending the entropy-based criteria from [

25] to the multiclass classification setting, e.g., the modified Gini-entropy could as well be defined as

${G}_{t}^{m}={\sum}_{l\in {\mathcal{L}}_{t}}{w}_{l}{\sum}_{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})}$, where

$\mathcal{C}\in [1,2]$. This and other extensions will be investigated in future works). We will next present the main results of this paper, which will be followed by their proofs. We begin with introducing the weak hypothesis assumption.

#### 4.2. Theorems

**Definition** **2** (Weak Hypothesis Assumption)

**.** Let m denote any internal node of the tree $\mathcal{T}$, and let ${\beta}_{m}=P({h}_{m}(x)>0)$ and ${P}_{m,i}=P({h}_{m}(x)>0|i)$. Furthermore, let $\gamma \in {\mathbb{R}}^{+}$ be such that for all m, $\gamma \in (0,min({\beta}_{m},1-{\beta}_{m})]$. We say that the weak hypothesis assumption is satisfied when for any distribution $\mathcal{P}$ over $\mathcal{X}$ at each node m of the tree $\mathcal{T}$ there exists a hypothesis ${h}_{m}\in \mathcal{H}$ such that $J({h}_{m})/2={\sum}_{i=1}^{k}{\pi}_{m,i}|{P}_{m,i}-{\beta}_{m}|\ge \gamma $.

The weak hypothesis assumption says that in every node of the tree we are able to recover a hypothesis from $\mathcal{H}$ which corresponds to the value of the objective that is above 0 (thus the corresponding split is “weakly” pure and “weakly” balanced).

Consider next any time

t and let

n be the heaviest leaf at time

t that we split and its weight

${w}_{n}$ be denoted by

w for brevity. Similarly, let

h denote the regressor at node

n (shorthand for

${h}_{n}$). We denote the difference between the contribution of node

n to the value of the entropy-based objectives in times

t and

$t+1$ as

Then the following lemma holds (the proof in provided in

Section 5):

**Lemma** **4.** Under the Weak Hypothesis Assumption, the change in entropies occuring due to the node split can be bounded as Clearly, maximizing the objective

$J(h)$ improves the entropy reduction. The considered objective can therefore be viewed as a surrogate function for indirectly optimizing any of the three considered entropy-based criteria, for which efficient online optimization strategies are largely unknown but highly desired in the multiclass classification setting. To be more specific, the standard packages for binary classification trees, such as CART [

26] and C4.5 [

27], require running a brute force search to find a partition at every node of the tree from a set of all possible partitions that leads to the biggest improvement of the entropy-based criterion of interest [

25]. This is prohibitive in case of the multiclass problem.

$J(h)$ however can be efficiently optimized with SGD instead.

We next state the three boosting theoretical results captured in Theorems 1–3. They guarantee that the top-down decision tree algorithm which optimizes $J(h)$ in each node will amplify the weak advantage, captured in the weak learning assumption, to build a tree achieving any desired level of entropy (either Shannon entropy, Gini-entropy or its modified variant).

**Theorem** **1.** Under the Weak Hypothesis Assumption, for any $\alpha \in [0,2lnk]$, to obtain ${G}_{t}^{e}\le \alpha $ it suffices to make $t\ge {\left(\frac{2lnk}{\alpha}\right)}^{\frac{4{(1-\gamma )}^{2}}{{\gamma}^{2}{log}_{2}e}lnk}$ splits.

**Theorem** **2.** Under the Weak Hypothesis Assumption, for any $\alpha \in [0,2\left(1-\frac{1}{k}\right)]$, to obtain ${G}_{t}^{g}\le \alpha $ it suffices to make $t\ge {\left(\frac{2\left(1-\frac{1}{k}\right)}{\alpha}\right)}^{\frac{2{(1-\gamma )}^{2}}{{\gamma}^{2}{log}_{2}e}(k-1)}$ splits.

**Theorem** **3.** Under the Weak Hypothesis Assumption, for any $\alpha \in [\sqrt{\mathcal{C}-1},2\sqrt{k\mathcal{C}-1}]$, to obtain ${G}_{t}^{m}\le \alpha $ it suffices to make $t\ge {\left(\frac{2\sqrt{k\mathcal{C}-1}}{\alpha}\right)}^{\frac{2{(1-\gamma )}^{2}{\mathcal{C}}^{3}}{{\gamma}^{2}{(\mathcal{C}-2)}^{2}{log}_{2}e}k\sqrt{k\mathcal{C}-1}}$ splits.

Finally, we provide the error guarantee in Theorem 4. Denote $y(x)$ to be a fixed target function with domain $\mathcal{X}$, which assigns the data point x to its label, and let $\mathcal{P}$ be a fixed target distribution over $\mathcal{X}$. Together y and $\mathcal{P}$ induce a distribution on labeled pairs $(x,y(x))$. Let $t(x)$ be the label assigned to data point x by the tree. We denote as $\u03f5(\mathcal{T})$ the error of tree $\mathcal{T}$, i.e., $\u03f5{(\mathcal{T}):=}_{x\sim \mathcal{P}}\left[{\sum}_{i=1}[t(x)=i,y(x)\ne i]\right]$

**Theorem** **4.** Under the Weak Hypothesis Assumption, for any $\alpha \in [0,1]$, to obtain $\u03f5(\mathcal{T})\le \alpha $ it suffices to make $t\ge {\left(\frac{2lnk{log}_{2}e}{\alpha}\right)}^{\frac{4{(1-\gamma )}^{2}}{{\gamma}^{2}{log}_{2}e}lnk}$ splits.

**Remark** **1.** The main theorems show how fast the entropy criteria or the multi-class classification error drop as the tree grows and performs node splits. These statements therefore provide a platform for comparing different entropy criteria and answer two questions: 1) for a fixed $\alpha ,\gamma ,\mathcal{C}$, and k, which criterion is reduced the most with each split? and 2) can the multi-class error match the convergence speed of the best entropic criterion? Hence, it can be noted that the Shannon entropy has the most advantageous dependence on the label complexity, since the bound scales only logarithmically with k, and thus achieves the fastest convergence. Simultaneously, the multi-class classification rate matches this advantageous convergence rate and also scales favorably (logarithmically) with k. Finally, even though the weak hypothesis requires only slightly favorable γ, i.e., $\gamma >0$, in practice when constructing the tree one can optimize J in every node of the tree, which effectively pushes γ to be as high as possible. In that case γ becomes a well-behaving constant in the above theorems, ideally equal to $1/2$, and does not negatively affect the split count.

We next discuss in details the mathematical properties of the entropy-based criteria, which are important to prove the above theorems.