You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

22 January 2020

On the Difference between the Information Bottleneck and the Deep Information Bottleneck

and
Department of Mathematics and Computer Science, University of Basel, CH-4051 Basel, Switzerland
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue The Information Bottleneck in Deep Learning

Abstract

Combining the information bottleneck model with deep learning by replacing mutual information terms with deep neural nets has proven successful in areas ranging from generative modelling to interpreting deep neural networks. In this paper, we revisit the deep variational information bottleneck and the assumptions needed for its derivation. The two assumed properties of the data, X and Y, and their latent representation T, take the form of two Markov chains T X Y and X T Y . Requiring both to hold during the optimisation process can be limiting for the set of potential joint distributions P ( X , Y , T ) . We, therefore, show how to circumvent this limitation by optimising a lower bound for the mutual information between T and Y: I ( T ; Y ) , for which only the latter Markov chain has to be satisfied. The mutual information I ( T ; Y ) can be split into two non-negative parts. The first part is the lower bound for I ( T ; Y ) , which is optimised in deep variational information bottleneck (DVIB) and cognate models in practice. The second part consists of two terms that measure how much the former requirement T X Y is violated. Finally, we propose interpreting the family of information bottleneck models as directed graphical models, and show that in this framework, the original and deep information bottlenecks are special cases of a fundamental IB model.

1. Introduction

Deep latent variable models, such as generative adversarial networks [1] and the variational autoencoder (VAE) [2], have attracted much interest in the last few years. They have been used in many application and formed a conceptual basis for a number of extensions. One of popular deep latent variable models is the deep variational information bottleneck (DVIB) [3]. Its foundational idea is that of applying deep neural networks to the information bottleneck (IB) model [4], which finds a sufficient statistic T of a given variable X while retaining side information about a variable Y.
The original IB model, as well as DVIB, assumes the Markov chain T X Y . Additionally, in the latter model, the Markov chain X T Y appears by construction. The relationship between the two assumptions and how it influences the set of potential solutions have been neglected so far. In this paper, we clarify this relationship by showing that it is possible to lift the original IB assumption in the context of the deep variational information bottleneck. It can be achieved by optimising a lower bound on the mutual information between T and Y, which follows naturally from the model’s construction. This explains why DVIB can optimise over a set of distributions which is not overly restrictive.
This paper is structured as follows. In Section 2 we describe the information bottleneck and deep variational information bottleneck models, along with their extensions. Section 3 introduces the lower bound on the mutual information which makes it possible to lift the original IB T X Y assumption, and makes possible the interpretation of this bound. It also contains the specifications of IB as a directed graphical model. We provide concluding remarks in Section 4.

3. The Difference between Information Bottleneck Models

In this section, we focus on the difference between the original and deep IB models. First, we examine how the different Markov assumptions lead to different forms that the I ( Y ; T ) term admits. In Section 3.2, we consider both models and show that describing them as directed graphical models makes it possible to elucidate a fundamental property shared by all IB models. We then proceed to summarise the comparison in Section 3.3.

3.1. Clarifying the Discrepancy between the Assumptions in IB and DVIB

3.1.1. Motivation

The derivation of the deep variational information bottleneck model described in Section 2.2 uses the Markov assumption T X Y (last line of Equation (9), Figure 1a). At the same time, by construction, the model adheres to the data generating process described by the following structural equations ( η T , η Y are noise terms independent of X and T, respectively):
T = f T ( X , η T ) , Y = f Y ( T , η Y ) .
Figure 1. Markov assumptions for the information bottleneck and the deep information bottleneck.
This implies that the Markov chain X T Y is satisfied in the model, too (Figure 1b). Requiring that both Markov chains hold in the resulting joint distribution P ( X , Y , T ) can be overly restrictive (note that no directed acyclic graph with three vertices to which such a distribution is faithful exists). Thus, the question of whether the T X Y property in DVIB can be lifted arises. In what follows, we show that it is indeed possible.
Recall from Section 2.2 that the DVIB model relies on sampling only from the data P ( X , Y ) , encoder P ( T | X ) , and decoder P ( Y | T ) . Therefore, for optimising the latent IB, we want to avoid specifying the full conditional P ( T | X , Y ) , since this would require us to explicitly model the joint influence of both X and Y on T (which might be a complex distribution). We now proceed to show how to bound I ( T ; Y ) in a way that only involves sampling from the encoder P ( T | X ) and circumvents modelling P ( T | X , Y ) without using the T X Y assumption.

3.1.2. Bound Derivation

First, adopt the mutual information I ( T ; Y ) from the penultimate line of Equation (9) (i.e., without assuming the T X Y property):
I ( T ; Y ) = E P ( X , Y ) E P ( T | X , Y ) log P ( Y | T ) + H ( Y )
Now, rewrite Equation (11) using X T Y (i.e., X Y   |   T : X and Y are conditionally independent given T):
I ( T ; Y ) = E P ( X , Y ) E P ( T | X , Y ) log P ( Y | T ) + H ( Y ) = E P ( X ) E P ( Y | X ) E P ( T | X , Y ) log P ( Y | T , X ) + H ( Y ) .
Focusing on E P ( Y | X ) E P ( T | X , Y ) log P ( Y | T , X ) in Equation (12), we obtain:
E P ( Y | X ) E P ( T | X , Y ) log P ( Y | T , X ) = P ( T , Y | X ) log P ( T , Y | X ) d t d y   = P ( T , Y | X ) log P ( Y | X ) P ( T , Y | X ) P ( Y | X ) P ( T | X ) d y d t   = D K L P ( Y , T | X ) P ( Y | X ) P ( T | X ) + P ( T , Y | X ) log P ( Y | X ) d t d y   = D K L P ( Y , T | X ) P ( Y | X ) P ( T | X ) + P ( Y | X ) log P ( Y | X ) d y   = D K L P ( Y , T | X ) P ( Y | X ) P ( T | X ) + P ( Y | X ) P ( T | X ) log P ( T | X ) d t d y   = D K L P ( Y , T | X ) P ( Y | X ) P ( T | X )   + P ( Y | X ) P ( T | X ) log P ( T | X ) P ( Y | X ) P ( T , Y | X ) P ( T , Y | X ) P ( T | X ) d t d y   = D K L P ( Y , T | X ) P ( Y | X ) P ( T | X ) + D K L P ( Y | X ) P ( T | X ) P ( Y , T | X )   + E P ( T | X ) P ( Y | X ) log P ( Y | T , X )   E P ( T | X ) P ( Y | X ) log P ( Y | T , X ) .
Plugging Equation (13) into Equation (12), i.e., averaging over X, and using X Y   |   T again, we arrive at:
I ( T ; Y ) = E P ( X ) E P ( Y | X ) E P ( T | X , Y ) log P ( Y | T ) + H ( Y ) = E P ( X ) E P ( Y | X ) E P ( T | X ) log P ( Y | T )    + E P ( X ) D K L ( P ( Y , T | X ) P ( Y | X ) P ( T | X ) )    + E P ( X ) D K L ( P ( Y | X ) P ( T | X ) P ( Y , T | X ) ) + H ( Y ) = E P ( X ) E P ( Y | X ) E P ( T | X ) log P ( Y | T )    + I ( Y ; T | X ) + L ( Y ; T | X ) + H ( Y ) E P ( X ) E P ( Y | X ) E P ( T | X ) log P ( Y | T ) + H ( Y ) .

3.1.3. Interpretation

According to Equation (14), the mutual information I ( T ; Y ) consists of three terms: its lower bound E P ( X ) E P ( Y | X ) E P ( T | X ) log P ( Y | T ) which is actually optimised in DVIB and its extensions, I ( Y ; T | X ) + L ( Y ; T | X ) which are 0 when both Markov assumptions are satisfied, and the entropy term H ( Y ) .
Equation (14) shows how to bound the mutual information term I ( T ; Y ) in the IB model (Equation (4)) so that the value of the bound depends on the data P ( X , Y ) and marginals T | X , Y | T without using the Markov assumption T X Y . If we again implement the marginal distributions as deep neural nets P ϕ ( T | X ) and P θ ( Y | T ) , (Equation (14) provides the lower bound which is actually optimised in DVIB (Equation (9)). By training the networks, we find parameters ϕ and θ in P ϕ ( T | X ) and P θ ( Y | T ) such that both I ( Y ; T | X ) and L ( Y ; T | X ) are close to zero. The terms I ( Y ; T | X ) and L ( Y ; T | X ) can thus be interpreted as a measure of how much the original IB assumption T X Y is violated during the training of the model that implements X T Y by construction.
The difference between the original IB and DVIB is that in the former, T X Y is used to derive the general form of the solution T, while X T Y is approximated as closely as possible by T (as noted in [18]. In the latter, X T Y is forced by construction, and T X Y is approximated by optimising the lower bound given by Equation (14). The "distance" to a distribution satisfying both assumptions is measured by the tightness of the bound.

3.2. The Original IB Assumption Revisited

3.2.1. Motivation for Conditional Independence Assumptions in Information Bottleneck Models

In the original formulation of the information bottleneck (Section 2.1 and Equation (4)), given by m i n P ( T | X ) I ( X ; T ) β I ( T ; Y ) , one optimises over P ( T | X ) while disregarding any dependence on Y. This suggests that the defining feature of the IB model is the absence of a direct functional dependence of T on Y. This can be achieved, e.g., by the first structural equation in Equation (10):
T = f T ( X , η T ) .
That means any influence of Y on T must go through X. Note that this is implied by the original IB assumption T X Y , but not the other way around. In particular, the model given by X T Y can also be parametrized such that there is no direct dependence of T on Y, as in, e.g., Equation (10). This means that DVIB, despite optimising a lower bound on the IB, implements the defining feature of IB as well.

3.2.2. Information Bottleneck as a Directed Graphical Model

The above discussion leads to the conclusion that the IB assumptions might also be described by directed graphical models. Such models encode conditional independence relations with d-separation (for the definition and examples of d-separation in directed acyclic graphs, see [20] or [21] (Chapters 1.2.3 and 11.1.2)). In particular, any pair of variables d-separated by Z is conditionally independent given Z. The arrows of the directed acyclic graph (DAG) are assumed to correspond to the data generating process described by a set of structural equation (as in Equation (10)). Therefore, the following probability factorisation and data generating process hold for a DAG model:
P ( X 1 , X 2 , , X n ) = i P ( X i | p a ( X i ) )
X i = f i ( p a ( X i ) , U i ) ,
where p a ( X i ) stands for the set of direct parents of X i and U i are exogenous noise variables.
Let us now focus again on the motivation for the T X Y assumption in Equation (4). It prevents the model from choosing a degenerate solution of T = Y (in which case I ( X ; T ) = const . and I ( T ; Y ) = ). Note, however, that while T X Y is a sufficient condition for such a solution to be excluded (which justifies the correctness of the original IB), the necessary condition is that T cannot depend directly on Y. This means that the IB Markov assumption can be indeed reduced to requiring the absence of a direct arrow from Y to T in the underlying DAG. Note that this can be achieved in the undirected X T Y model too. One thus wishes to avoid degenerate solutions which impair the bottleneck nature of T: it should contain information about both X and Y, the trade-off between them being steered by β . It is therefore necessary to exclude DAG structures which encode independence of X and T as well as Y and T. Such independences are achieved by collider structures (with two different variables pointing towards a common child) in DAGs; i.e., T Y X and T X Y (they lead to degenerate solutions of I ( X ; T ) = 0 and I ( T ; Y ) = 0 , respectively). To sum up, the goal of asserting the conditional independence assumption in Equation (4) is to avoid degenerate solutions which impair the bottleneck nature of the representation T. When modelling the information bottleneck with DAG structures, one has to exclude the arrow Y T and collider structures. A simple enumeration of the possible DAG models for the information bottleneck results in 10 distinct models listed in Table 1.
Table 1. Directed graphical models of the information bottleneck.
As can be seen, considering the information bottleneck as a directed graphical model (DAG) makes room for a family of models which fall into three broad categories, satisfying one of the two undirected Markov assumptions T X Y or X T Y , as described in Section 3.1, or neither of them (see Table 1). The difference between particular models lies in the necessity to specify different conditional distributions and parametrise them, which might lead to situations in which no joint distribution P ( X , Y , T ) exists (which is likely to be the case in the third category). Focusing on the two first categories, we see that the former corresponds to the standard parametrisations of the information bottleneck and the Gaussian information bottleneck (see Section 2.1). In the latter, we see the deep information bottleneck (Equation (10)) as the first DAG. Note also that the second DAG satisfying the X T Y assumption in Table 1 defines the probabilistic CCA model [22]. This is not surprising, since the solutions of CCA and the Gaussian information bottleneck use eigenvectors of the same matrix [5].

3.3. Comparing IB and DVIB Assumptions

The original and deep information bottleneck models differ by using different Markov assumptions (see Figure 1) in the derivation of the respective solutions. As demonstrated in Section 3.1, DVIB optimises a lower bound on the objective function of IB. The tightness of the bound measures to what extent the IB assumption (Figure 1a) is violated. As described in Section 3.2, characterising both models as directed graphical models results in two different DAGs for the IB and DVIB. Both models are summarised in Table 2.
Table 2. Comparison of the information bottleneck and deep variational information bottleneck.

4. Conclusions

In this paper, we showed how to lift the information bottleneck’s Markov assumption T X Y in the context of the deep information bottleneck model, in which X T Y holds by construction. This result explains why standard implementations of the deep information bottleneck can optimise over a larger amount of joint distributions P ( X , T , Y ) while only specifying the marginal T | X . It is made possible by optimising the lower bound on the mutual information I ( T ; Y ) provided here, rather than the full mutual information. We also provided a description of the information bottleneck as a DAG model and showed that it is possible to identify a fundamental necessary feature of the IB in the language of directed graphical models. This property is satisfied for both the original and deep information bottlenecks.

Author Contributions

Conceptualization, A.W.; Supervision, V.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Swiss National Science Foundation grant number CR32I2159682.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
  2. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  3. Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
  4. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
  5. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
  6. Rey, M.; Roth, V.; Fuchs, T. Sparse meta-Gaussian information bottleneck. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, 21–26 June 2014; pp. 910–918. [Google Scholar]
  7. Wieczorek, A.; Wieser, M.; Murezzan, D.; Roth, V. Learning Sparse Latent Representations with the Deep Copula Information Bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  8. Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop, ITW, Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar] [CrossRef]
  9. Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
  10. Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech.-Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
  11. Palomar, D.P.; Verdú, S. Lautum information. IEEE Trans. Inf. Theory 2008, 54, 964–975. [Google Scholar] [CrossRef]
  12. Alemi, A.A.; Poole, B.; Fischer, I.; Dillon, J.V.; Saurous, R.A.; Murphy, K. Fixing a broken ELBO. arXiv 2017, arXiv:1711.00464. [Google Scholar]
  13. Alemi, A.A.; Fischer, I. GILBO: One metric to measure them all. arXiv 2018, arXiv:1802.04874. [Google Scholar]
  14. Painsky, A.; Tishby, N. Gaussian lower bound for the information bottleneck limit. J. Mach. Learn. Res. 2017, 18, 7908–7936. [Google Scholar]
  15. Hotelling, H. Relations between two sets of variates. In Breakthroughs in Statistics; Springer: Berlin, Germany, 1992; pp. 162–190. [Google Scholar]
  16. Zhao, S.; Song, J.; Ermon, S. Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5885–5892. [Google Scholar]
  17. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. arXiv 2016, arXiv:1606.03657. [Google Scholar]
  18. Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2897–2905. [Google Scholar] [CrossRef] [PubMed]
  19. Achille, A.; Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018, 19, 1947–1980. [Google Scholar]
  20. Lauritzen, S.L. Graphical Models; Clarendon Press: Oxford, UK, 1996; Volume 17. [Google Scholar]
  21. Pearl, J. Causality; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  22. Bach, F.R.; Jordan, M.I. A Probabilistic Interpretation of Canonical Correlation Analysis; University of California: Berkeley, CA, USA, 2005. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.