# A Comparison of Variational Bounds for the Information Bottleneck Functional

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

**Related Work and Scope.**Many variational bounds for mutual information have been proposed [8], and many of these bounds can be applied to the IB functional. Both the VIB and VCEB variational bounds belong to the class of Barber & Agakov bounds, cf. Section 2.1 of [8]. As an alternative example, the authors of [9] bounded the IB functional using the Donsker–Varadhan representation of mutual information. Aside from that, the IB functional has been used for NN training also without resorting to purely variational approaches. For example, the authors of [10] applied the Barber & Agakov bound to replace $I(Y;Z)$ by the standard cross-entropy loss of a trained classifier, but used a non-parametric estimator for $I(X;Z)$. Rather than comparing multiple variational bounds with each other, in this work we focus exclusively on the VIB [7] and VCEB [2] bounds. The structural similarity of these bounds allows a direct comparison and still yields interesting insights that can potentially carry over to other variational approaches.

**Notation.**We consider a classification task with a feature RV X on ${\mathbb{R}}^{m}$ and a class RV Y on the finite set $\mathcal{Y}$ of classes. We assume that the joint distribution of X and Y is denoted by ${p}_{XY}$. In this work we are interested in representations Z of the feature RV X. This (typically real-valued) representation Z is obtained by feeding X to a stochastic encoder ${e}_{Z|X}$, and the representation Z can be used to infer the class label by feeding it to a classifier ${c}_{\widehat{Y}|Z}$. Note that this classifier yields a class estimate $\widehat{Y}$ that need not coincide with the class RV Y. Thus, the setup of encoder, representation, and classifier yields the following Markov condition: $Y-X-Z-\widehat{Y}$. We abuse notation and abbreviate the conditional probability (density) ${p}_{W|V=v}(\xb7)$ of a RV W given that another RV V assumes a certain value v as ${p}_{W|V}(\xb7|v)$. For example, the probability density of the representation Z for an input $X=x$ is induced by the encoder ${e}_{Z|X}$ and is given as ${e}_{Z|X}(\xb7|x)$.

## 2. Variational Bounds on the Information Bottleneck Functional

## 3. Variational IB and Variational CEB as Optimization Problems

**Assumption**

**1.**

**Definition**

**1.**

## 4. Main Results

**Example**

**1**

**Example**

**2**

**Theorem**

**1.**

## 5. Discussion

## 6. Proof of Theorem 1

- $\left(a\right)$ follows by writing the KL divergence as an expectation of the logarithm of a ratio;
- $\left(b\right)$ follows by multiplying both numerator and denominator in the first term with ${c}_{\widehat{Y}|Z}^{\mathrm{VCEB}}$;
- $\left(c\right)$ is because of the (potential) suboptimality of ${c}_{\widehat{Y}|Z}^{\mathrm{VCEB}}$ for the VIB cost function;
- $\left(d\right)$ is because $\mathcal{Q}\supseteq \{{q}_{Z}:\phantom{\rule{4pt}{0ex}}{q}_{Z}\left(z\right)={\sum}_{y}{b}_{Z|Y}\left(z\right|y){p}_{Y}\left(y\right),\phantom{\rule{4pt}{0ex}}{b}_{Z|Y}\in \mathcal{B}\}$, thus we may choose ${q}_{Z}={q}_{Z}^{\prime}$ where ${q}_{Z}^{\prime}$ is defined in (17); and because this particular choice may be suboptimal for the VIB cost function;
- $\left(e\right)$ follows by splitting the logarithm
- $\left(f\right)$ follows by noticing that ${E}_{XYZ\sim {p}_{XY}{e}_{Z|X}}\left[log{p}_{Y}\left(Y\right)\right]=-H\left(Y\right)$
- $\left(g\right)$ follows because ${e}_{Z|X}^{\mathrm{VCEB}}$ may be suboptimal for the VIB cost function.

- $\left(a\right)$ follows by writing the KL divergence as an expectation of the logarithm of a ratio;
- $\left(b\right)$ follows by the assumption that the VCEB problem is constrained to a consistent classifier-backward encoder pair, and from (11);
- $\left(c\right)$ is because of the (potential) suboptimality of ${c}_{\widehat{Y}|Z}^{\mathrm{VIB}}$ for the VCEB cost function;
- $\left(d\right)$ follows by adding and subtracting $H\left(Y\right)$; by choosing ${b}_{Z|Y}^{\mathrm{VIB}}={c}_{\widehat{Y}|Z}^{\mathrm{VIB}}{q}_{Z}^{\mathrm{VIB}}/{p}_{Y}$, which is possible because $\mathcal{B}\supseteq \{{b}_{Z|Y}:\phantom{\rule{4pt}{0ex}}{b}_{Z|Y}\left(z\right|y)={c}_{\widehat{Y}|Z}\left(y\right|z){q}_{Z}\left(z\right)/{p}_{Y}\left(y\right),\phantom{\rule{4pt}{0ex}}{q}_{Z}\in \mathcal{Q},{c}_{\widehat{Y}|Z}\in \mathcal{C}\}$; and by the fact that this particular choice may be suboptimal for the VCEB cost function;
- $\left(e\right)$ follows because ${e}_{Z|X}^{\mathrm{VIB}}$ may be suboptimal for the VCEB cost function.

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Tishby, N.; Pereira, F.C.; Bialek, W. The Information Bottleneck Method. In Proceedings of the Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
- Fischer, I. The Conditional Entropy Bottleneck. Entropy
**2020**, 22, 999. [Google Scholar] [CrossRef] - Wyner, A. The common information of two dependent random variables. IEEE Trans. Inf. Theory
**1975**, 21, 163–179. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 1st ed.; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
- Amjad, R.A.; Geiger, B.C. Class-Conditional Compression and Disentanglement: Bridging the Gap between Neural Networks and Naive Bayes Classifiers. arXiv
**2019**, arXiv:1906.02576. [Google Scholar] - Fischer, I.; Alemi, A.A. CEB Improves Model Robustness. Entropy
**2020**, 22, 1081. [Google Scholar] [CrossRef] - Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Poole, B.; Ozari, S.; van den Oord, A.; Alemi, A.A.; Tucker, G. On Variational Bounds of Mutual Information. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 5171–5180. [Google Scholar]
- Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual Information Neural Estimation. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
- Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear Information Bottleneck. Entropy
**2019**, 21, 1181. [Google Scholar] [CrossRef] [Green Version] - Achille, A.; Soatto, S. Information Dropout: Learning Optimal Representations Through Noisy Computation. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 40, 2897–2905. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Wieczorek, A.; Roth, V. On the difference between the Information Bottleneck and the Deep Information Bottleneck. Entropy
**2020**, 22, 131. [Google Scholar] [CrossRef] [Green Version] - Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell.
**2020**, 42, 2225–2239. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Geiger, B.C.; Fischer, I.S.
A Comparison of Variational Bounds for the Information Bottleneck Functional. *Entropy* **2020**, *22*, 1229.
https://doi.org/10.3390/e22111229

**AMA Style**

Geiger BC, Fischer IS.
A Comparison of Variational Bounds for the Information Bottleneck Functional. *Entropy*. 2020; 22(11):1229.
https://doi.org/10.3390/e22111229

**Chicago/Turabian Style**

Geiger, Bernhard C., and Ian S. Fischer.
2020. "A Comparison of Variational Bounds for the Information Bottleneck Functional" *Entropy* 22, no. 11: 1229.
https://doi.org/10.3390/e22111229