# Classification Active Learning Based on Mutual Information

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Active Learning with Mutual Information

#### 2.1. Formulation

**θ**, one can say that

**θ**is deterministically equal to the maximum likelihood estimation (MLE) point estimate ${\widehat{\mathit{\theta}}}_{n}$ given X and ${Y}_{\mathcal{L}}^{*}$. Then we can rewrite the posterior as $\mathbb{P}({Y}_{\mathcal{A}}|X,{Y}_{\mathcal{L}}^{*})=\int \mathbb{P}({Y}_{\mathcal{A}}|\mathit{\theta})\xb7\delta (\mathit{\theta}-{\widehat{\mathit{\theta}}}_{n})d\mathit{\theta}=\mathbb{P}({Y}_{\mathcal{A}}|{\widehat{\mathit{\theta}}}_{n})$. Since in most discriminative classifiers the labels are assumed to be independent given the parameter, one can write $\mathbb{P}({Y}_{\mathcal{A}}|{\widehat{\mathit{\theta}}}_{n})={\prod}_{i\in \mathcal{A}}\mathbb{P}({y}_{i}|{\widehat{\mathit{\theta}}}_{n})$. This simplifies the computation of the joint entropy to the sum of sample entropy contributions:

#### 2.2. Evaluating Mutual Information

#### 2.3. Randomized vs. Deterministic Submodular Optimizations

**Definition 1.**

**Proposition 2.**

**Proof.**

**Definition 4.**

**Theorem 5.**

Algorithm 1: The deterministic approach |

Inputs: The objective function f, the unlabeled indices $\mathcal{U}$, the query batch size $k>0$ |

Outputs: a subset of unlabeled indices $\mathcal{A}\subseteq \mathcal{U}$ of size k |

**Theorem 6.**

Algorithm 2: The randomized approach |

Inputs: The objective function f, the unlabeled indices $\mathcal{U}$, the query batch size $k>0$ |

Outputs: a subset of unlabeled indices $\mathcal{A}\subseteq \mathcal{U}$ of size k |

#### 2.4. Total Complexity Reduction

**Remark 1.**

**Proof.**

#### 2.5. Further Speed-Up

## 3. Results and Discussion

#### 3.1. Cardiotocography

#### 3.2. MNIST (Mixed National Institute of Standards and Technology)

#### 3.3. Bach Choral Harmony

#### 3.4. Experimental Settings

- Pess-MI-Det: Pessimistic MI ${f}_{MI}^{pess}$ with deterministic optimization (Algorithm 1);
- Pess-MI-Rand: Pessimistic MI ${f}_{MI}^{pess}$ with randomized optimization (Algorithm 2);
- Opt-MI-Det: Optimistic MI ${f}_{MI}^{opt}$ with deterministic optimization (Algorithm 1);
- Opt-MI-Rand: Optimistic MI ${f}_{MI}^{opt}$ with randomized optimization (Algorithm 2);
- entropy: Entropy objective ${f}_{H}$ with deterministic optimization (Algorithm 1);
- random: Random querying.

#### 3.5. Numerical Results

**Figure 1.**The experimental results of different querying approaches for sequential active learning ($k=1$). (

**a**) Average; (

**b**) Average; (

**c**) Average; (

**d**) STD; (

**e**) STD; (

**f**) STD; (

**g**) p-value: MI vs. Entropy; (

**h**) p-value: MI vs. Entropy; (

**i**) p-value: MI vs. Entropy; (

**j**) p-value: MI vs. Random; (

**k**) p-value: MI vs. Random; (

**l**) p-value: MI vs. Random.

**Figure 2.**The experimental results of different querying approaches for batch active learning ($k\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}5$). (

**a**) Average; (

**b**) Average; (

**c**) Average; (

**d**) STD; (

**e**) STD; (

**f**) STD; (

**g**) p-value: MI vs. Entropy; (

**h**) p-value: MI vs. Entropy; (

**i**) p-value: MI vs. Entropy; (

**j**) p-value: MI vs. Random; (

**k**) p-value: MI vs. Random; (

**l**) p-value: MI vs. Random.

**Figure 3.**The experimental results of different querying approaches for batch active learning ($k\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}10$). (

**a**) Average; (

**b**) Average; (

**c**) Average; (

**d**) STD; (e) STD; (

**f**) STD; (g) p-value: MI vs. Entropy; (

**h**) p-value: MI vs. Entropy; (

**i**) p-value: MI vs. Entropy; (

**j**) p-value: MI vs. Random; (

**k**) p-value: MI vs. Random; (

**l**) p-value: MI vs. Random.

**Figure 4.**The experimental results of different querying approaches for batch active learning ($k\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}20$). (

**a**) Average; (

**b**) Average; (

**c**) Average; (

**d**) STD; (

**e**) STD; (

**f**) STD; (

**g**) p-value: MI vs. Entropy; (

**h**) p-value: MI vs. Entropy; (

**i**) p-value: MI vs. Entropy; (

**j**) p-value: MI vs. Random; (

**k**) p-value: MI vs. Random; (

**l**) p-value: MI vs. Random.

## 4. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Appendix: Update Equation for Multinomial Logistic Regression

#### Multinomial Logistic Regression as a Discriminative Classifier

**θ**. The objective to optimize in order to find the maximum likelihood estimation (MLE), denoted by ${\widehat{\mathit{\theta}}}_{n}$, given an i.i.d set of training samples $\mathcal{L}=\{({\mathbf{x}}_{1},{y}_{1}^{*}),...,({\mathbf{x}}_{n},{y}_{n}^{*})\}$, is

#### Updating the Gradient Vector

#### Updating the Inverse Hessian Matrix

## References

- Lewis, D.D.; Gale, W.A. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 3–6 July 1994.
- Freund, Y.; Seung, H.S.; Shamir, E.; Tishby, N. Selective sampling using the query by committee algorithm. Mach. Learn.
**1997**, 28, 133–168. [Google Scholar] [CrossRef] - Cohn, D.A.; Ghahramani, Z.; Jordan, M.I. Active learning with statistical models. 1996; arXiv:cs/9603104. [Google Scholar]
- Campbell, C.; Cristianini, N.; Smola, A.J. Query Learning with Large Margin Classifiers. In Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000.
- Roy, N.; McCallum, A. Toward Optimal Active Learning through Monte Carlo Estimation of Error Reduction. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001.
- Settles, B.; Craven, M.; Ray, S. Multiple-instance active learning. Adv. Neural Inf. Process. Syst.
**2008**, 20, 1289–1296. [Google Scholar] - Brinker, K. Incorporating Diversity in Active Learning with Support Vector Machines. In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003.
- Holub, A.; Perona, P.; Burl, M.C. Entropy-based active learning for object recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, 23–28 June 2008.
- Chen, Y.; Krause, A. Near-optimal batch mode active learning and adaptive submodular optimization. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013.
- Guo, Y.; Schuurmans, D. Discriminative batch mode active learning. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.6929 (accessed on 2 February 2016).
- Azimi, J.; Fern, A.; Zhang-Fern, X.; Borradaile, G.; Heeringa, B. Batch Active Learning via Coordinated Matching. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, UK, 27 June–3 July 2012.
- Hoi, S.C.H.; Jin, R.; Zhu, J.; Lyu, M.R. Batch Mode Active Learning and Its Application to Medical Image Classification. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 2006.
- Guo, Y. Active instance sampling via matrix partition. Available online: http://papers.nips.cc/paper/3919-active-instance-sampling-via-matrix-partition (accessed on 2 February 2016).
- Li, X.; Guo, Y. Adaptive Active Learning for Image Classification. In Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013.
- Guo, Y.; Greiner, R. Optimistic Active Learning Using Mutual Information. In Proceedings of 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 6–12 January 2007.
- Krause, A.; Singh, A.; Guestrin, C. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. J. Mach. Learn. Res.
**2008**, 9, 235–284. [Google Scholar] - Dilkina, B.; Damoulas, T.; Gomes, C.; Fink, D. AL2: Learning for Active Learning. In Proceedings of Machine Learning for Sustainability Workshop at the 25th Conference of Neural Information Processing Systems, Sirra Nevada, Spain, 16–17 December 2011.
- Wei, K.; Iyer, R.; Bilmes, J. Submodularity in data subset selection and active learning. In Proceedings of the 32nd International Conference on Machine Learning, Lille, Fran, 6–11 July 2015.
- Bach, F. Learning with submodular functions: A convex optimization perspective. 2013; arXiv:1111.6453v2. [Google Scholar]
- Nemhauser, G.L.; Wolsey, L.A.; Fisher, M.L. An analysis of approximations for maximizing submodular set functions—I. Math. Program.
**1978**, 14, 265–294. [Google Scholar] [CrossRef] - Krause, A.; Golovin, D. Submodular Function Maximization. Available online: https://las.inf.ethz.ch/files/krause12survey.pdf (accessed on 2 February 2016).
- Nemhauser, G.L.; Wolsey, L.A. Best algorithms for approximating the maximum of a submodular set function. Math. Oper. Res.
**1978**, 3, 177–188. [Google Scholar] [CrossRef] - Feige, U.; Mirrokni, V.S.; Vondrak, J. Maximizing non-monotone submodular functions. SIAM J. Comput.
**2011**, 40, 1133–1153. [Google Scholar] [CrossRef] - Buchbinder, N.; Feldman, M.; Naor, J.; Schwartz, R. A tight linear time (1/2)-approximation for unconstrained submodular maximization. In Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, New Brunswick, NJ, USA, 20–23 October 2012; pp. 649–658.
- Buchbinder, N.; Feldman, M.; Naor, J.; Schwartz, R. Submodular Maximization with Cardinality Constraints. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, Portland, OR, USA, 5–7 January 2014.
- Lichman, M. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/datasets.html (accessed on 2 February 2016).
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] - Radicioni, D.P.; Esposito, R. BREVE: An HMPerceptron-Based Chord Recognition System. In Advances in Music Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2010; pp. 143–164. [Google Scholar]
- McCullagh, P.; Nelder, J.A. Generalized linear models. Ann. Statist.
**1984**, 12, 1589–1596. [Google Scholar] [CrossRef]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sourati, J.; Akcakaya, M.; Dy, J.G.; Leen, T.K.; Erdogmus, D.
Classification Active Learning Based on Mutual Information. *Entropy* **2016**, *18*, 51.
https://doi.org/10.3390/e18020051

**AMA Style**

Sourati J, Akcakaya M, Dy JG, Leen TK, Erdogmus D.
Classification Active Learning Based on Mutual Information. *Entropy*. 2016; 18(2):51.
https://doi.org/10.3390/e18020051

**Chicago/Turabian Style**

Sourati, Jamshid, Murat Akcakaya, Jennifer G. Dy, Todd K. Leen, and Deniz Erdogmus.
2016. "Classification Active Learning Based on Mutual Information" *Entropy* 18, no. 2: 51.
https://doi.org/10.3390/e18020051