On the Difference between the Information Bottleneck and the Deep Information Bottleneck

Combining the information bottleneck model with deep learning by replacing mutual information terms with deep neural nets has proven successful in areas ranging from generative modelling to interpreting deep neural networks. In this paper, we revisit the deep variational information bottleneck and the assumptions needed for its derivation. The two assumed properties of the data, X and Y, and their latent representation T, take the form of two Markov chains T−X−Y and X−T−Y. Requiring both to hold during the optimisation process can be limiting for the set of potential joint distributions P(X,Y,T). We, therefore, show how to circumvent this limitation by optimising a lower bound for the mutual information between T and Y: I(T;Y), for which only the latter Markov chain has to be satisfied. The mutual information I(T;Y) can be split into two non-negative parts. The first part is the lower bound for I(T;Y), which is optimised in deep variational information bottleneck (DVIB) and cognate models in practice. The second part consists of two terms that measure how much the former requirement T−X−Y is violated. Finally, we propose interpreting the family of information bottleneck models as directed graphical models, and show that in this framework, the original and deep information bottlenecks are special cases of a fundamental IB model.


Introduction
Deep latent variable models such as generative adversarial networks (Goodfellow et al., 2014) or the variational autoencoder (VAE) (Kingma and Welling, 2013) have attracted much interest in the last few years.They have been used across many application areas and formed a conceptual basis for a number of extensions.One of popular deep latent variable models is the deep variational information bottleneck (DVIB) (Alemi et al., 2016).Its foundational idea is that of applying deep neural networks to the information bottleneck (IB) model (Tishby et al., 2000) which finds a sufficient statistic T of a given variable X while retaining side information about a variable Y .
The original IB model as well as DVIB assume the Markov chain T − X − Y .Additionally, in the latter model, the Markov chain X − T − Y appears by construction.
The relationship between the two assumptions and how it influences the set of potential solutions have been neglected so far.In this paper, we clarify this relationship by showing that it is possible to lift the original IB assumption in the context of the deep variational information bottleneck.It can be achieved by optimising a lower bound on the mutual information between T and Y which follows naturally from the model's construction.This explains why DVIB can optimise over a set of distributions which is not overly restrictive.
This paper is structured as follows.In Section 2 we describe the information bottleneck and deep variational information bottleneck models along with their extensions.Section 3 introduces the lower bound on the mutual information which makes it possible to lift the original IB T −X −Y assumption as well as the interpretation of this bound.It also contains the specification of IB as directed graphical model.We provide concluding remarks in Section 4.

Related Work on the Deep Information Bottleneck Model
The information bottleneck was originally introduced by Tishby et al. (2000) as a compression technique in which a random variable X is compressed while preserving relevant information about another random variable Y .The problem was originally formulated using only information theoretic concepts.No analytical solution exists for the original formulation, however, an additional assumption that X and Y are jointly Gaussian distributed leads to a special case of the IB, the Gaussian information bottleneck, introduced by Chechik et al. (2005), where the optimal compression is also Gaussian distributed.The Gaussian information bottleneck has been further extended to sparse compression and to meta-Gaussian distributions (multivariate distributions with a Gaussian copula and arbitrary marginal densities) by Rey et al. (2014).The idea of applying deep neural networks to model the information common to X and T as well as Y and T has resulted in the formulation of the deep variational information bottleneck (Alemi et al., 2016).This model has been extended to account for invariance to monotonic transformations of the input variables by Wieczorek et al. (2018).
The information bottleneck method has also recently been applied to the analysis of deep neural networks (Tishby and Zaslavsky, 2015), by quantifying mutual information between the network layers and deriving an information theoretic limit on DNN efficiency.This has lead to attempts at explaining the behaviour of deep neural networks with the IB formalism (Shwartz-Ziv and Tishby, 2017;Saxe et al., 2018).
We now proceed to formally define the IB and DVIB models.
Notation.Throughout this paper, we adopt the following notation.Define the Kullback-Leibler divergence between two (discrete or continuous) probability distributions P and . Note that the KL divergence is always non-negative.The mutual information between X and Y is defined as (1) Since the KL divergence is not symmetric, the divergence between the product of the marginals and the joint distribution has also been defined as the lautum information (Palomar and Verdú, 2008): Both quantities have conditional counterparts: (3) Let H [X] = −E P (X) [log P (X)] denote entropy for discrete and differential entropy for continuous X. Analogously, H [P (X|Y )] = −E P (X,Y ) [log P (X|Y )] denotes conditional entropy for discrete and conditional differential entropy for continuous X and Y .

Information Bottleneck
Given two random vectors X and Y , the Information Bottleneck method (Tishby et al., 2000) searches for a third random vector T which, while compressing X, preserves information contained in Y .The resulting variational problem is defined as follows: where β is a parameter defining the trade-off between compression of X and preservation of Y .The solution is the optimal conditional distribution of T |X.No analytical solution exists for the general IB problem defined by Eq. ( 4), however for discrete X and Y a numerical approximation of the optimal distribution T can be found with the Blahut-Arimoto algorithm for rate-distortion function calculation (Tishby et al., 2000).Note that the assumed property T − X − Y of the solution is used in the derivation of the model.
Gaussian Information Bottleneck.For Gaussian distributed (X, Y ), let the partitioning of the joint covariance matrix be denoted as follows: The assumption that X and Y are jointly Gaussian distributed leads to the Gaussian information bottleneck (Chechik et al., 2005) where the solution T of Eq. ( 4) is also Gaussian distributed.T is then a noisy linear projection of X, i.e.T = AX + ξ, where ξ ∼ N (0, Σ ξ ) is independent of X.This means that T ∼ N (0, AΣ X A T + Σ ξ ).The IB optimisation problem defined in Eq. ( 4) becomes an optimisation problem over the matrix A and noise covariance matrix Σ ξ : Recall that for n-dimensional Gaussian distributed random variables, entropy, and hence mutual information, have the following form: , where Σ X and Σ X|Y denote covariance matrices of X and X|Y , respectively.The Gaussian information bottleneck problem has an analytical solution, given by Chechik et al. (2005): for a fixed β, Eq. ( 6) is optimised by Σ ξ = I and A having an analytical form depending on Σ X and eigenvectors and eigenvalues of Here again, the T − X − Y assumption is used in the derivation of the solution.
Sparse Gaussian Information Bottleneck.Sparsity of the compression in the Gaussian IB can be ensured by requiring the projection matrix A to be diagonal, i.e.A = diag(a 1 , . . ., a n ).It has been shown by (Rey et al., 2014) that, since log |AΣA T + I| = log |ΣA T A + I| for any positive definite Σ and symmetric A, the sparsity requirement simplifies Eq. ( 6) to minimisation over diagonal matrices with positive entries with d i = a i 2 and ξ ∼ N (0, I) independent of X.

Deep Variational Information Bottleneck
The deep variational information bottleneck (Alemi et al., 2016) is a variational approach to the problem defined in Eq. ( 4).The main idea is to parametrise the conditionals P (T |X) and P (Y |T ) with neural networks so that the two mutual informations in Eq. ( 4) can be directly recovered from two deep neural nets.To this end, one can express the mutual informations as follows: where the last equality in Eq. ( 9) follows from the Markov assumption T − X − Y in the information bottleneck model: The conditional Y |T is computed by sampling from the latent representation T as in the variational autoencoder (Kingma and Welling, 2013).Note that this form of the DVIB makes sure that one is only required to sample from the data distribution P (X, Y ), the variational decoder P θ (Y |T ), and the stochastic encoder P φ (T |X), implemented as deep neural networks parametrised by θ and φ, respectively.In the latter, T depends only on X because of the T − X − Y assumption.
Deep Copula Information Bottleneck.Alemi et al. ( 2016) argue that the entropy term H(Y ) in the last line of Eq. ( 9) can be omitted, as Y is a constant.It has however been pointed out (Wieczorek et al., 2018) that the IB solution should be invariant to monotonic transformations of both X and Y , since the problem is defined only in terms of mutual informations which exhibit such invariance (i.e.I(X; T ) = I(f (X); T ) for an invertible f ).The term remaining in Eq. ( 9) after leaving out H(Y ) does not have this property.Furthermore, problems limiting the DVIB when specifying marginal distributions of T |X and Y |T in Eqs. ( 8) and ( 9) have been identified (Wieczorek et al., 2018).These considerations have lead to the formulation of the deep copula information bottleneck, where the data are subject to the following transformation X = Φ −1 ( F (X)), where Φ and F are the Gaussian and empirical cdfs, respectively.This transformation makes them depend only on their copula and not on the marginals.This has also been shown to result in superior interpretability and disentanglement of the latent space T .

Bounds on Mutual Information in Deep Latent Variable Models
The deep information bottleneck model can be thought of as an extension of the VAE.Indeed, one can incorporate a variational approximation Q(Y |T ) of the posterior P (Y |T ) to Eq. ( 9) and by et al., 2016).A number of other bounds and approximations of mutual information have been considered in the literature.Many of them are motivated by obtaining a better representation of the latent space T .Alemi et al. ( 2017) consider different encoding distributions Q(T |X) and derive a common bound for I(X; T ) on the rate-distortion plane.They subsequently extend this bound to the case where it is independent of the sample, which makes it possible to compare VAE and GANs (Alemi and Fischer, 2018).Painsky and Tishby (2017) use a Gaussian relaxation of the mutual information terms in the information bottleneck to bound them from below.They then proceed to compare the resulting method to Canonical Correlation Analysis (Hotelling, 1992).
Extensions of generative models with an explicit regularisation in the form of a mutual information term have been proposed (Zhao et al., 2019;Chen et al., 2016).In the latter, an explicit lower bound on the mutual information between the latent space T and the generator network is derived.
Similarly, implicit regularisation of generative models in the form of dropout has been shown to be equivalent to the deep information bottleneck model (Achille and Soatto, 2018a,b).The authors also mention that both Markov properties should hold in the IB solution as well as note that T − X − Y is enforced by construction while X − T − Y is only approximated by the optimal joint distribution of X, Y and T .They do not, however, analyse the impact of both Markov assumptions and the relationship between them.

The Difference Between Information Bottleneck Models
In this section, we focus on the difference between the original and deep IB models.First, we examine how the different Markov assumptions lead to different forms that the I(Y ; T ) term admits.In Section 3.2, we consider both models and show that describing them as directed graphical models makes it possible to elucidate a fundamental property shared by all IB models.We then proceed to summarise the comparison in Section 3.3.

Clarifying the Discrepancy between the Assumptions in IB and DVIB
Motivation.The derivation of the deep variational information bottleneck model described in Section 2.2 uses the Markov assumption T − X − Y (last line of Eq. ( 9), Fig. 1a).At the same time, by construction, the model adheres to the data generating process described by the following structural equations (η T , η Y are noise terms independent of X and T , respectively): This implies that the Markov chain X − T − Y is satisfied in the model, too (Fig. 1b).Requiring that both Markov chains hold in the resulting joint distribution P (X, Y, T ) can be overly restrictive (note that no DAG with 3 vertices to which such distribution is faithful exists).Thus, the question arises if the T − X − Y property in DVIB can be lifted.In what follows we show that it is indeed possible.

T X Y
(a) The original IB assumption.

X T Y
(b) The DVIB assumption.Recall from Section 2.2 that the DVIB model relies on sampling only from the data P (X, Y ), encoder P (T |X) and decoder P (Y |T ).Therefore, for optimising the latent IB, we want to avoid specifying the full conditional P (T |X, Y ), since this would require us to explicitly model the joint influence of both X and Y on T (which might be a complicated distribution).We now proceed to show how to bound I(T ; Y ) in a way that only involves sampling from the encoder P (T |X) and circumvents modelling P (T |X, Y ) without using the T − X − Y assumption.
Bound derivation.First, adopt the mutual information I(T ; Y ) from the penultimate line of Eq. ( 9) (i.e.without assuming the T − X − Y property): Now, rewrite Eq. ( 11) using X − T − Y (i.e.X ⊥ ⊥ Y | T ): Focusing on E P (Y |X) E P (T |X,Y ) log P (Y |T, X) in Eq. ( 12), the we obtain: Plugging Eq. ( 13) into Eq.( 12), i.e. averaging over X, and using X ⊥ ⊥ Y | T again, we arrive at: Interpretation.According to Eq. ( 14), the mutual information I(T ; Y ) consists of 3 terms: its lower bound E P (X) E P (Y |X) E P (T |X) log P (Y |T ) which is actually optimised in DVIB and its extensions, I(Y ; T |X) + L(Y ; T |X) which are 0 when both Markov assumptions are satisfied, and the entropy term H(Y ).Equation ( 14) shows how to bound the mutual information term I(T ; Y ) in the IB model (Eq.( 4)) so that the value of the bound depends on the data P (X, Y ) and marginals T |X, Y |T without using the Markov assumption T − X − Y .If we now again implement the marginal distributions as deep neural nets P φ (T |X) and P θ (Y |T ), Eq. ( 14) provides the lower bound which is actually optimised in DVIB (Eq.( 9 The difference between the original IB and DVIB is that in the former, T − X − Y is used to derive the general form of the solution T while X − T − Y is approximated as closely as possible by T (as noted by Achille and Soatto (2018a).In the latter, X − T − Y is forced by construction, and T − X − Y is approximated by optimising the lower bound given by Eq. ( 14).The 'distance' to a distribution satisfying both assumptions is measured by the tightness of the bound.

The Original IB Assumption Revisited
Motivation for conditional independence assumptions in information bottleneck models.In the original formulation of the information bottleneck (Section 2.1 and Eq. ( 4)), given by: min P (T |X) I(X; T ) − βI(T ; Y ), one optimises over P (T |X) while disregarding any dependence on Y .This suggests that the defining feature of the IB model is the absence of a direct functional dependence of T on Y .This can be achieved e.g. by the first structural equation in Eq. ( 10): It means that any influence of Y on T must go through X.Note that this is implied by the original IB assumption T − X − Y , but not the other way around.In particular, the model given by X − T − Y can also be parametrised such that there is no direct dependence of T on Y , e.g. as in Eq. ( 10).This means that DVIB, despite optimising a lower bound on the IB, implements the defining feature of IB as well.
Information bottleneck as a directed graphical model.The above discussion leads to the conclusion that the IB assumptions might also be described by directed graphical models.Such models encode conditional independence relations with d-separation (for the definition and examples of d-separation in DAGs, see Lauritzen (1996) or Pearl (2009, Chapters 1.2.3 and 11.1.2)).In particular, any pair of variables d-separated by Z is conditionally independent given Z.The arrows of the DAG are assumed to correspond to the data generating process described by a set of structural equation (as in Eq. ( 10)).Therefore, the following probability factorisation and data generating process hold for a DAG model: where pa(X i ) stands for the set of direct parents of X i and U i are exogenous noise variables.
Let us now focus again on the motivation for the T − X − Y assumption in Eq. ( 4).It prevents the model to choose a degenerate solution of T = Y (in which case I(X; T ) = const.and I(T ; Y ) = ∞).Note, however, that while T − X − Y is a sufficient condition for such solution to be excluded (which justifies the correctness of the original IB), the necessary condition is that T cannot depend directly on Y .This means that the IB Markov assumption can be indeed reduced to requiring the absence of a direct arrow from Y to T in the underlying DAG.Note that this can be achieved in the undirected X − T − Y model, too.One thus wishes to avoid degenerate solutions which impair the bottleneck nature of T : it should contain information about both X and Y , the trade-off between them being steered by β.It is therefore necessary to exclude DAG structures which encode independence of X and T as well as Y and T .Such independences are achieved by collider structures in DAGs, i.e.T → Y ← X and T → X ← Y (they lead to degenerate solutions of I(X; T ) = 0 and I(T ; Y ) = 0, respectively).To sum up, the goal of asserting the conditional independence assumption in Eq. ( 4) is to avoid degenerate solutions which impair the bottleneck nature of the representation T .When modelling the information bottleneck with DAG structures, one has to exclude ate arrow Y → T as well as collider structures.A simple enumeration of the possible DAG models for the information bottleneck results in 10 distinct models listed in Table 1.
As can be seen, considering the information bottleneck as a directed graphical model (DAG) makes room for a family of models which fall into 3 broad categories: satisfying one of the two undirected Markov assumptions T − X − Y or X − T − Y , as described in Section 3.1 or neither of them (see Table 1).The difference between particular models lies in the necessity to specify different conditional distributions and parametrising them, which might lead to situations in which no joint distribution P (X, Y, T ) exists (which is likely to be the case in the third category).Focusing on the two first categories, we see that the former corresponds to the standard parametrisations of the information bottleneck and the Gaussian information bottleneck (see Section 2.1).In the latter, we see the deep information bottleneck (Eq.( 10)) as the first DAG.Note also that the second DAG satisfying the X − T − Y assumption in Table 1 defines the probabilistic CCA model (Bach and Jordan, 2005).This is not surprising, since the solution of CCA and the Gaussian information bottleneck use eigenvectors of the same matrix (Chechik et al., 2005).The original and deep information bottleneck models differ by using different Markov assumptions (see Fig. 1) in the derivation of the respective solutions.As demonstrated in Section 3.1, DVIB optimises a lower bound on the objective function of IB.The tightness of the bound measures to what extent the IB assumption (Fig. 1a) is violated.As described in Section 3.2, characterising both models in as directed graphical models results in two different DAGs for the IB and DVIB.Both models are summarised in Table 2.

Conclusion
In this paper, we show how to lift the information bottleneck Markov assumption T − X − Y in the context of the deep information bottleneck model in which X − T − Y holds by construction.This result explains why standard implementations of the deep information bottleneck can optimise over a larger amount of joint distributions P (X, T, Y ) while only specifying the marginal T |X.It is made possible by optimising the lower bound on the mutual information I(T ; Y ) provided here rather than the full mutual information.We also provide a description of the information bottleneck as a DAG model and show that it is possible to identify a fundamental necessary feature of the IB in the language of directed graphical models.This property is satisfied in both the original and deep information bottleneck.

Figure 1 :
Figure 1: Markov assumptions for the Information Bottleneck and the Deep Information Bottleneck.
|X) E P (T |X,Y ) log P (Y |T, X) = P (T, Y |X) log P (T, Y |X)dtdy = P (T, Y |X) log P (Y |X)P (T, Y |X) P (Y |X)P (T |X) dydt = D KL P (Y, T |X) P (Y |X)P (T |X) + P (T, Y |X) log P (Y |X)dtdy = D KL P (Y, T |X) P (Y |X)P (T |X) + P (Y |X) log P (Y |X)dy = D KL P (Y, T |X) P (Y |X)P (T |X) + P (Y |X)P (T |X) log P (T |X)dtdy = D KL P (Y, T |X) P (Y |X)P (T |X) + P (Y |X)P (T |X) log P (T |X)P (Y |X)P (T, Y |X) P (T, Y |X)P (T |X) dtdy = D KL P (Y, T |X) P (Y |X)P (T |X) + D KL P (Y |X)P (T |X) P (Y, T |X) )).During optimisation, the parameters φ in P φ (T |X) and θ in P θ (Y |T ) are adjusted such that both I(Y ; T |X) and L(Y ; T |X) become small.The terms I(Y ; T |X) and L(Y ; T |X) can thus be interpreted as a measure of how much the original IB assumption T − X − Y is violated during the training of the model that implements X − T − Y by construction.

Table 1 :
Directed graphical models of the information bottleneck.

Table 2 :
Comparison of the Information Bottleneck and Deep Variational Information Bottleneck.