# Extreme Multiclass Classification Criteria

^{*}

## Abstract

**:**

## 1. Introduction

- We provide an extensive theoretical analysis of the properties of the considered objective and prove that maximizing this objective in any tree node simultaneously encourages balanced partition of the data in that node and improves the purity of the class distributions at its children nodes.
- We show a formal relation of this objective to some more standard entropy-based objectives, i.e., Shannon entropy, Gini-entropy and its modified variant, for which online optimization schemes in the context of multiclass classification are largely unknown. In particular we show that i) the improvement in the value of entropy resulting from performing the node split is lower-bounded by an expression that increases with the value of the objective and thus ii) the considered objective can be used as a surrogate function for indirectly optimizing any of the three considered entropy-based criteria.
- We present three boosting theorems for each of the three entropy criteria, which provide the number of iterations needed to reduce each of them below an arbitrary threshold. Their weak hypothesis assumptions rely on the considered objective function.
- We establish the error bound that relates maximizing the objective function with reducing the multi-class classification error.
- Finally, in the Appendix A we establish an empirical connection between the multiclass classification error and the entropy criteria and show that Gini-entropy most closely resembles the behavior of the test error in practice.

## 2. Related Work

## 3. Theoretical Properties of the Objective Function

**Definition**

**1**(Purity and balancedness)

**.**

**Lemma**

**1.**

**Lemma**

**2.**

**Lemma**

**3.**

**.**For any hypothesis h and any distribution over data examples the purity factor α and the balancing factor β satisfy $\alpha \le min\left(\right)open="\{"\; close="\}">(2-J(h))/4\beta -\beta ,0.5$.

## 4. Main Theoretical Results

#### 4.1. Notation

- Shannon entropy ${G}_{t}^{e}$:$${G}_{t}^{e}=\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}ln\left(\right)open="("\; close=")">\frac{1}{{\pi}_{l,i}}$$
- Gini-entropy ${G}_{t}^{g}$:$${G}_{t}^{g}=\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}{\pi}_{l,i}(1-{\pi}_{l,i})$$
- Modified Gini-entropy ${G}_{t}^{m}$:$${G}_{t}^{m}=\sum _{l\in {\mathcal{L}}_{t}}{w}_{l}\sum _{i=1}^{k}\sqrt{{\pi}_{l,i}(\mathcal{C}-{\pi}_{l,i})},$$

#### 4.2. Theorems

**Definition**

**2**(Weak Hypothesis Assumption)

**.**

**Lemma**

**4.**

**Theorem**

**1.**

**Theorem**

**2.**

**Theorem**

**3.**

**Theorem**

**4.**

**Remark**

**1.**

## 5. Proofs

#### 5.1. Properties of the Entropy-Based Criteria

#### 5.1.1. Bounds on the Entropy-Based Criteria

**Lemma**

**5.**

**Lemma**

**6.**

**Lemma**

**7.**

#### 5.1.2. Strong Concativity Properties of the Entropy-Based Criteria

**Lemma**

**8.**

**Lemma**

**9.**

**Lemma**

**10.**

#### 5.2. Proof of Lemma 4 and Theorems 1–3

**Proof.**

#### 5.3. Proof of Theorem 4

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Extreme Multiclass Classification Criteria

#### Appendix A.1. Numerical Experiments

**Figure A1.**Functions ${G}_{t}^{e}$, ${G}_{t}^{g}$, and ${G}_{t}^{m}$, and the test error, all normalized to the interval $[0,1]$, versus the number of splits. Figure is recommended to be read in color.

#### Appendix A.2. Additional Proofs

**Proof**

**of**

**Lemma**

**1.**

**Proof**

**of**

**Lemma**

**2.**

- Let ${\sum}_{i\in \mathcal{P}}{\pi}_{i}\le 1-\beta $. Then$$\begin{array}{cc}\hfill J(h)& =2\sum _{i=1}^{k}{\pi}_{i}\left(\right)open="|"\; close="|">\beta -{P}_{i}\hfill \end{array}$$Thus $-4{\beta}^{2}+4\beta -J(h)\ge 0$ which, when solved, yields the lemma.
- Let ${\sum}_{i\in \mathcal{P}}{\pi}_{i}\ge 1-\beta $ (thus ${\sum}_{i\in \mathcal{N}}{\pi}_{i}\le \beta $). Note that $J(h)$ can be written as$$J(h)=2\sum _{i=1}^{k}{\pi}_{i}\left(\right)open="|"\; close="|">P(h(x)\le 0)-P(h(x)\le 0|i)$$$$\begin{array}{cc}\hfill J(h)& =2\sum _{i=1}^{k}{\pi}_{i}\left(\right)open="|"\; close="|">{\beta}^{{}^{\prime}}-{P}_{i}^{{}^{\prime}}\hfill \end{array}$$Thus as before we obtain $-4{\beta}^{2}+4\beta -J(h)\ge 0$ which, when solved, yields the lemma. □

**Proof**

**of**

**Lemma**

**5.**

**Lemma**

**A1**(The inequality between Euclidean and arithmetic mean)

**.**

**Corollary**

**A1.**

**Proof.**

**Proof**

**of**

**Lemma**

**6.**

**Proof**

**of**

**Lemma**

**7.**

**Lemma**

**A2**

**Proof**

**of**

**Lemma**

**9.**

**Lemma**

**A3**

**Proof**

**of**

**Lemma**

**10.**

**Lemma**

**A4.**

**Lemma**

**A5.**

## References

- Rifkin, R.; Klautau, A. In Defense of One-Vs-All Classification. J. Mach. Learn. Res.
**2004**, 5, 101–141. [Google Scholar] - Daume, H.; Karampatziakis, N.; Langford, J.; Mineiro, P. Logarithmic Time One-Against-Some. arXiv, 2016; arXiv:1606.04988. [Google Scholar]
- Choromanska, A.; Langford, J. Logarithmic Time Online Multiclass prediction. In Neural Information Processing Systems 2015; Neural Information Processing Systems Foundation, Inc.: Vancouver, BC, Canada, 2015. [Google Scholar]
- Schapire, R.E.; Freund, Y. Boosting: Foundations and Algorithms; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Mukherjee, I.; Schapire, R.E. A theory of multiclass boosting. J. Mach. Learn. Res.
**2013**, 14, 437–497. [Google Scholar] - Beygelzimer, A.; Langford, J.; Ravikumar, P.D. Error-Correcting Tournaments. In Algorithmic Learning Theory; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Takimoto, E.; Maruoka, A. Top-down Decision Tree Learning As Information Based Boosting. Theor. Comput. Sci.
**2003**, 292, 447–464. [Google Scholar] [CrossRef] - Morin, F.; Bengio, Y. Hierarchical probabilistic neural network language model. Aistats
**2005**, 5, 246–252. [Google Scholar] - Bengio, S.; Weston, J.; Grangier, D. Label Embedding Trees for Large Multi-Class Tasks. In Advances in Neural Information Processing Systems 23 (NIPS 2010); NIPS: Vancouver, BC, Canada, 2010. [Google Scholar]
- Utgoff, P.E. Incremental Induction of Decision Trees. Mach. Learn.
**1989**, 4, 161–186. [Google Scholar] [CrossRef][Green Version] - Domingos, P.; Hulten, G. Mining High-speed Data Streams; KDD: Boston, MA, USA, 2000. [Google Scholar]
- Gama, J.; Rocha, R.; Medas, P. Accurate Decision Trees for Mining High-speed Data Streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003. [Google Scholar]
- Beygelzimer, A.; Langford, J.; Lifshits, Y.; Sorkin, G.B.; Strehl, A.L. Conditional Probability Tree Estimation Analysis and Algorithms. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 June 2009. [Google Scholar]
- Madzarov, G.; Gjorgjevikj, D.; Chorbev, I. A Multi-class SVM Classifier Utilizing Binary Decision Tree. Informatica
**2009**, 33, 225–233. [Google Scholar] - Weston, J.; Makadia, A.; Yee, H. Label Partitioning For Sublinear Ranking. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
- Deng, J.; Satheesh, S.; Berg, A.C.; Fei-Fei, L. Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition. In Advances in Neural Information Processing Systems 24 (NIPS 2011); NIPS: Vancouver, BC, Canada, 2011. [Google Scholar]
- Zhao, B.; Xing, E.P. Sparse Output Coding for Large-Scale Visual Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar]
- Hsu, D.; Kakade, S.; Langford, J.; Zhang, T. Multi-Label Prediction via Compressed Sensing. In Advances in Neural Information Processing Systems 22 (NIPS 2009); NIPS: Vancouver, BC, Canada, 2009. [Google Scholar]
- Agarwal, A.; Kakade, S.M.; Karampatziakis, N.; Song, L.; Valiant, G. Least Squares Revisited: Scalable Approaches for Multi-class Prediction. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, 21–26 June 2014. [Google Scholar]
- Beijbom, O.; Saberian, M.; Kriegman, D.; Vasconcelos, N. Guess-Averse Loss Functions For Cost-Sensitive Multiclass Boosting. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, 21–26 June 2014. [Google Scholar]
- Jernite, Y.; Choromanska, A.; Sontag, D. Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation. arXiv, 2017; arXiv:1610.04658. [Google Scholar]
- Mnih, A.; Hinton, G.E. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21 (NIPS 2008); NIPS: Vancouver, BC, Canada, 2009. [Google Scholar]
- Djuric, N.; Wu, H.; Radosavljevic, V.; Grbovic, M.; Bhamidipati, N. Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS 2013); NIPS: Vancouver, BC, Canada, 2013. [Google Scholar]
- Kearns, M.; Mansour, Y. On the Boosting Ability of Top-Down Decision Tree Learning Algorithms. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing (STOC ’96), Philadelphia, PA, USA, 22–24 May 1996. reprinted in J. Comput. Syst. Sci.
**1999**, 58, 109–128. [Google Scholar] [CrossRef] - Breiman, L. Classification Regression Trees; Routledge: Abingdon, UK, 2017. [Google Scholar]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Liu, W.; Tsang, I.W. Making decision trees feasible in ultrahigh feature and label dimensions. J. Mach. Learn. Res.
**2017**, 18, 2814–2849. [Google Scholar] - Muñoz, E.; Nováček, V.; Vandenbussche, P.Y. Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models. Brief. Bioinform.
**2017**, 20, 190–202. [Google Scholar] [CrossRef] [PubMed] - Charte, F.; Rivera, A.J.; del Jesus, M.J.; Herrera, F. REMEDIAL-HwR: Tackling multilabel imbalance through label decoupling and data resampling hybridization. Neurocomputing
**2019**, 326, 110–122. [Google Scholar] [CrossRef] - Koster, C.H.; Seutter, M.; Beney, J. Multi-classification of patent applications with Winnow. In International Andrei Ershov Memorial Conference on Perspectives of System Informatics; Springer: Berlin/Heidelberg, Germany, 2003; pp. 546–555. [Google Scholar]
- Liu, W.; Tsang, I.W.; Müller, K.R. An easy-to-hard learning paradigm for multiple classes and multiple labels. J. Mach. Learn. Res.
**2017**, 18, 3300–3337. [Google Scholar] - Liu, W.; Xu, D.; Tsang, I.W.; Zhang, W. Metric learning for multi-output tasks. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 41, 408–422. [Google Scholar] [CrossRef] [PubMed] - Petersen, N.C.; Rodrigues, F.; Pereira, F.C. Multi-output bus travel time prediction with convolutional LSTM neural network. Expert Syst. Appl.
**2019**, 120, 426–435. [Google Scholar] [CrossRef] - Langford, J.; Li, L.; Strehl, A. Vowpal Wabbit (Fast Learning). 2007. Available online: http://hunch.net/~vw (accessed on 2 February 2019).
- Bottou, L. Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks; Cambridge University Press: New York, NY, USA, 1998. [Google Scholar]
- Shalev-Shwartz, S. Online Learning and Online Convex Optimization. Found. Trends Mach. Learn.
**2012**, 4, 107–194. [Google Scholar] [CrossRef] - Shalev-Shwartz, S. Online Learning: Theory, Algorithms, and Applications. Ph.D. Thesis, The Hebrew University of Jerusalem, Jerusalem, Israel, 2007. [Google Scholar]
- Zhukovskiy, V. Lyapunov Functions in Differential Games; Stability and Control: Theory, Methods and Applications; Taylor & Francis: London, UK, 2003. [Google Scholar]

**Figure 1.**

**Red partition**: highly balanced split but impure (the partition cuts through the black and green classes).

**Green partition**: highly balanced and highly pure split. Figure should be read in color.

**Figure 2.**

**Left**: Blue curve captures the behavior of the upper-bound on the balancing factor as a function of $J(h)$, red curve captures the behavior of the lower-bound on the balancing factor as a function of $J(h)$, green intervals correspond to the intervals where the balancing factor lies for different values of $J(h)$.

**Right**: Red line captures the behavior of the upper-bound on the purity factor as a function of $J(h)$ when the balancing factor is fixed to $\frac{1}{2}$. Figure should be read in color.

**Figure 3.**Functions ${G}_{*}^{e}({\pi}_{1})={\tilde{G}}^{e}({\pi}_{1})/ln2=\left(\right)open="("\; close=")">{\pi}_{1}ln\left(\right)open="("\; close=")">\frac{1}{{\pi}_{1}}$, ${G}_{*}^{g}({\pi}_{1})=2{\tilde{G}}^{g}({\pi}_{1})=4{\pi}_{1}(1-{\pi}_{1})$, and ${G}_{*}^{m}({\pi}_{1})=({\tilde{G}}^{m}({\pi}_{1})-\sqrt{\mathcal{C}-1})/(\sqrt{2*\mathcal{C}-1}-\sqrt{\mathcal{C}-1})=(\sqrt{{\pi}_{1}(\mathcal{C}-{\pi}_{1})}+\sqrt{(1-{\pi}_{1})(\mathcal{C}-1+{\pi}_{1})}-\sqrt{\mathcal{C}-1})/(\sqrt{2\ast \mathcal{C}-1}-\sqrt{\mathcal{C}-1})$ (functions ${\tilde{G}}^{e}({\pi}_{1})$, ${\tilde{G}}^{g}({\pi}_{1})$, and ${\tilde{G}}^{m}({\pi}_{1})$ were re-scaled to have values in $[0,1]$) as a function of ${\pi}_{1}$ ($p{i}_{1}$). Figure should be read in color.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Choromanska, A.; Kumar Jain, I.
Extreme Multiclass Classification Criteria. *Computation* **2019**, *7*, 16.
https://doi.org/10.3390/computation7010016

**AMA Style**

Choromanska A, Kumar Jain I.
Extreme Multiclass Classification Criteria. *Computation*. 2019; 7(1):16.
https://doi.org/10.3390/computation7010016

**Chicago/Turabian Style**

Choromanska, Anna, and Ish Kumar Jain.
2019. "Extreme Multiclass Classification Criteria" *Computation* 7, no. 1: 16.
https://doi.org/10.3390/computation7010016