Information Bottleneck: Theory and Applications in Deep Learning
- Kunze et al. show that maximizing the evidence lower bound with a factorized Gaussian approximate posterior effectively limits mutual information between the available data and the learned parameters [8]. The effect of this tunable “model capacity” is validated in supervised and unsupervised settings, illustrating intuitive connections with overfitting, the NN architecture, and the dataset size;
- Wu et al. investigate the learnability within the IB framework. They show that if the parameter in (1) falls below a certain threshold , then a trivial representation T that is independent of X and Y minimizes the IB functional [9]. This threshold depends on the joint distribution of X and Y, and the authors propose an algorithm to estimate for a given dataset;
- Ngyuen and Choi argue that every layer in a feedforward NN should be optimized w.r.t. the IB functional (1) separately, with the parameter adapted to the layer index [10]. Proposing a cost function for this multiobjective optimization problem, a computable variational bound, and a greedy optimization procedure, they achieve superior accuracy and adversarial robustness in stochastic binary NNs;
- Kolchinsky et al. propose an NN-based implementation of the IB problem, i.e., the compression scheme and conditional distribution are parameterized by NNs [11]. Acknowledging the issues in [12], these NNs are trained to minimize an upper bound on , combining variational and non-parametric approaches for bounding. Their experiments yield a better trade-off between and and more meaningful latent representations in the bottleneck layer than a corresponding reformulation of [6];
- Tegmark and Wu investigate binary classification from real-valued observations [13]. They show that the observations can be compressed to a discrete representation T parameterized by in such a way that the Pareto frontier of (1) is swept, essentially characterizing the binary classification problem. The authors further show that the corner points of this Pareto frontier, corresponding to a maximization of for a given alphabet size of T, can be computed without multiobjective optimization;
- Rodríguez Gálvez et al. discuss the scenario in which the target Y is a deterministic function of X in [14]. In this case, it is known that sweeping the paramter in (1) is not sufficient to sweep the Pareto frontier of optimal pairs [12]. The authors show that this shortcoming can be removed by optimizing instead, where u is a strictly convex function. Furthermore, the authors demonstrate that the particular choice of the strictly convex function u helps to obtain a desired value of over a wide range of parameters ;
- Franzese and Visintin propose using the IB functional as a cost function to train ensembles of decision trees for classification [15]. The authors show that these ensembles perform similarly to bagged trees, while they outperform the naive Bayes and k-nearest neighbor classifiers;
- Jónsson et al. [16] investigate the learning behavior of a high-dimensional VGG-16 convolutional NN in the information plane. Using MINE [17] to estimate and throughout training, the authors observed a separate compression phase, during which the estimate of decreases, thus aligning with [4]. The authors further show that regularizing NN training via an MINE-based estimate of the compression term yields improved classification performance;
- Voloshynovskiy et al. propose an IB-based framework for semi-supervised classification, considering variational bounds both with learned and hand-crafted marginal distributions and achieving competitive performance [18]. A close investigation of their cost function yields improved insight into previously proposed approaches to semi-supervised classification;
- Fischer formulates the principle of minimum necessary information and derives from it the conditional entropy bottleneck functional [19]. This functional is mathematically equivalent to the IB functional, but uses the chain rule of mutual information to replace in (1) by . This results in different variational bounds, which are shown to yield better classification accuracy, improved robustness to adversarial examples, and stronger out-of-distribution detection than deterministic models or models based on variational approximations of (1), cf. [6];
- Fischer and Alemi provide additional empirical evidence for the claims in [19]. Specifically, they show that optimizing the proposed variational bounds leads to improved robustness against targeted and untargeted projected gradient descent attacks and to common corruptions (cf. [20]) of the ImageNet data [21]. Furthermore, the authors indicate that the conditional entropy bottleneck functional yields improved calibration for both clean and corrupted test data;
- Geiger and Fischer investigate the variational bounds proposed in [6,19]. While the underlying IB and conditional entropy bottleneck functionals are equivalent, the authors show that the variational bounds are not; these bounds are generally unordered, but an ordering can be enforced by restricting the feasible sets appropriately [22]. Their analysis is valid for general optimization and does not rely on the assumption that the variational bounds are implemented using NNs.
Funding
Acknowledgments
Conflicts of Interest
References
- Tishby, N.; Pereira, F.C.; Bialek, W. The Information Bottleneck Method. In Proceedings of the Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
- Zaidi, A.; Estella-Aguerri, I.; Shamai (Shitz), S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Goldfeld, Z.; Polyanskiy, Y. The Information Bottleneck Problem and Its Applications in Machine Learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
- Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
- Geiger, B.C. On Information Plane Analyses of Neural Network Classifiers—A Review. arXiv 2020, arXiv:2003.09671. [Google Scholar]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Achille, A.; Soatto, S. Information Dropout: Learning Optimal Representations Through Noisy Computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2897–2905. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kunze, J.; Kirsch, L.; Ritter, H.; Barber, D. Gaussian Mean Field Regularizes by Limiting Learned Information. Entropy 2019, 21, 758. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the Information Bottleneck. Entropy 2019, 21, 924. [Google Scholar] [CrossRef] [Green Version]
- Nguyen, T.T.; Choi, J. Markov Information Bottleneck to Improve Information Flow in Stochastic Neural Networks. Entropy 2019, 21, 976. [Google Scholar] [CrossRef] [Green Version]
- Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear Information Bottleneck. Entropy 2019, 21, 1181. [Google Scholar] [CrossRef] [Green Version]
- Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Tegmark, M.; Wu, T. Pareto-Optimal Data Compression for Binary Classification Tasks. Entropy 2020, 22, 7. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rodríguez Gálvez, B.; Thobaben, R.; Skoglund, M. The Convex Information Bottleneck Lagrangian. Entropy 2020, 22, 98. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Franzese, G.; Visintin, M. Probabilistic Ensemble of Deep Information Networks. Entropy 2020, 22, 100. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jónsson, H.; Cherubini, G.; Eleftheriou, E. Convergence Behavior of DNNs with Mutual-Information-Based Regularization. Entropy 2020, 22, 727. [Google Scholar] [CrossRef] [PubMed]
- Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual Information Neural Estimation. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
- Voloshynovskiy, S.; Taran, O.; Kondah, M.; Holotyak, T.; Rezende, D. Variational Information Bottleneckfor Semi-Supervised Classification. Entropy 2020, 22, 943. [Google Scholar] [CrossRef] [PubMed]
- Fischer, I. The Conditional Entropy Bottleneck. Entropy 2020, 22, 999. [Google Scholar] [CrossRef] [PubMed]
- Hendrycks, D.; Dietterich, T. Benchmarking Neural Networks Robustness to Common Corruptions and Perturbations. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Fischer, I.; Alemi, A.A. CEB Improves Model Robustness. Entropy 2020, 22, 1081. [Google Scholar] [CrossRef] [PubMed]
- Geiger, B.C.; Fischer, I.S. A Comparison of Variational Bounds for the Information Bottleneck Functional. Entropy 2020, 22, 1229. [Google Scholar] [CrossRef] [PubMed]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Geiger, B.C.; Kubin, G. Information Bottleneck: Theory and Applications in Deep Learning. Entropy 2020, 22, 1408. https://doi.org/10.3390/e22121408
Geiger BC, Kubin G. Information Bottleneck: Theory and Applications in Deep Learning. Entropy. 2020; 22(12):1408. https://doi.org/10.3390/e22121408
Chicago/Turabian StyleGeiger, Bernhard C., and Gernot Kubin. 2020. "Information Bottleneck: Theory and Applications in Deep Learning" Entropy 22, no. 12: 1408. https://doi.org/10.3390/e22121408