Layer-wise Learning of Stochastic Neural Networks with Information Bottleneck

Deep neural networks (DNNs) offer flexible modeling capability for various important machine learning problems. Given the same neural modeling capability, the success of DNNs is attributed to how effectively we could learn the networks. Currently, the maximum likelihood estimate (MLE) principle has been a de-facto standard for learning DNNs. However, the MLE principle is not explicitly tailored to the hierarchical structure of DNNs. In this work, we propose the Parametric Information Bottleneck (PIB) framework as a fully information-theoretic learning principle of DNNs. Motivated by the Information Bottleneck principle, our framework efficiently induces relevant information under compression constraint into each layer of DNNs via multi-objective learning. Consequently, PIB generalizes the MLE principle in DNNs, indeed empirically exploits the neural representations better than MLE and a partially information-theoretic treatment, and offers better generalization and adversarial robustness on MNIST and CIFAR10.


Introduction
Deep neural networks (DNNs) have demonstrated state-of-the-art performances in several important machine learning tasks including image recognition Krizhevsky et al. (2012), natural language translation Cho et al. (2014); Bahdanau et al. (2014) and game playing Silver et al. (2016). Behind the practical success of DNNs are various revolutionary techniques such as data-specific design of network architecture (e.g., convolutional neural network architecture), regularization techniques (e.g., early stopping, weight decay, dropout Srivastava et al. (2014), and batch normalization Ioffe and Szegedy (2015)), and optimization methods Kingma and Ba (2014). For learning DNNs, the maximum likelihood estimate (MLE) principle (in its various forms such as maximum log-likelihood or Kullback-Leibler divergence) has generally become a de-facto standard. The MLE principle maximizes the likelihood of the model for observing the entire training data. This principle is, however, generic and not specially tailored to hierarchy-structured models like neural networks. Particularly, MLE treats the entire neural network as a collective whole without considering an explicit contribution of its hidden layers to model learning. As a result, the information contained within the hidden structure may not be adequately modified to capture the data regularities reflecting a target variable. Thus, a reasonable question to ask is whether the MLE principle effectively and sufficiently exploits a neural network's representative power and whether there is any better alternative?
The Information Bottleneck (IB) principle Tishby et al. (1999) is an alternative principle which extracts relevant information about a target variable Y from an input variable X via a bottleneck variable Z. In detail, the IB framework constructs a bottleneck variable Z = Z(X) that is a compressed version of X but preserves as much relevant information in X about Y as possible. The compression of the representation Z is quantized by I(Z; X), the mutual information of Z and X. The relevance in Z, the amount of information Z contains about Y , is specified by I(Z; Y ). An optimal representation Z satisfying a certain compression-relevance trade-off constraint is then determined via minimization of the following Lagrangian L IB [p(z|x)] = I(Z; X) − βI(Z; Y ), where β is a positive Lagrangian multiplier that controls the trade-off. Although an exact solution to this minimization problem has been found Tishby et al. (1999), the solution is non-parametric, implicit and non-analytical for most practical cases of interest. Thus, it is still not clear how the IB principle can be applied to many conventional DNNs in practice.
In this work, we present a fully information-theoretic learning of DNNs which solely relies on the information relayed back and forth via each layer to construct the learning process. The main principle to construct such a learning process in this paper is to utilize the IB trade-off between compression and relevance occurring at each layer. We propose to interpret the learning of (multi-layered) neural networks as layer-wise multi-objective information trade-offs in which each layer attempts to obtain its own information optimality in terms of compression and relevance. We show that the simultaneous layer-wise information optimality is unfortunately impossible in the context of DNNs; thus propose compromised optimality. In addition, the information quantity -mutual information -is highly intractable in DNNs. We, therefore, propose simple but effective approximations to mutual information using variational methods and existing network architectures.
As a result, we are thereafter able to interpret the MLE principle in DNNs as a special case of our framework applied to a specific so-called super layer while our learning principle generalizes the MLE principle to every layer under compression constraints. In practice, our fully informationtheoretic framework PIB, compared to the MLE principle and a partially information-theoretic treatment, offers a superior, or at least comparative, performance in terms of generalization, and adversarial robustness on MNIST and CIFAR10. We also show that our framework does empirically indeed exploit the neural representations better in terms of information.
This work is organized as follows. We first review related literature in Section 2. Section 3 introduces our layer-wise multi-objective Information Bottleneck Principle to stochastic neural networks (presented and explained next) followed by variational bounds on mutual information and compromised optimality algorithms. Section 4 demonstrates a case study that we apply our PIB to binary stochastic neural networks. Finally, Section 5 presents the empirical results of our framework on MNIST and CIFAR10, in comparison with MLE and a partially information-theoretic treatment.

Related Work
One can generalize the MLE principle in DNNs with Bayesian treatments of DNNs Neal (1995); MacKay (1992); Dayan et al. (1995) in which the network weights are assigned a prior distribution. This approach equips DNNs with an ability to reason about their uncertainty and takes good advantage of the well-studied tools in probability theory. As a result, this idea achieved interpretability and state-of-the-art performance on many tasks Gal (2016). The challenges are still the scalability to big data and how one approximates an intractable posterior in the Bayesian treatment. In our work, we take a different perspective on DNNs using Information Theory and thereafter reason about the learning in terms of information contained in the neural representations. In addition, our framework is both interpretable and scalable with efficient mutual information and gradient-based algorithms. Tishby et al. (1999) originally proposed the IB framework which offers a principled way of extracting the relevant information in one variable X about another variable Y . An exact solution to the IB problem is given via highly-nonlinear self-consistent equations and can be found with the iterative Blahut Arimoto algorithm if the variables are discrete. However, the algorithm is non-parametric and not applicable to DNNs. In practice, the IB problem has been solved efficiently in the following three cases only: (1) X, Y and Z are all discrete Tishby et al. (1999); (2) X, Y and Z are mutually joint Gaussian Chechik et al. (2005); (3) or (X, Y, Z) has meta-Gaussian distributions Rey and Roth Figure 1: Each layer Z l is considered as a bottleneck where compression and relevance is induced. A bottleneck Z l in DNNs splits the architecture into an encoder p(Z l |X) and a variational relevance decoder p(Ŷ |Z l ).
(2012) where Z is a bottleneck variable. In addition, the IB principle has been proven to be mathematically equivalent to the MLE principle for the multinomial mixture model for the clustering problem when the input distribution X is uniform or has a large sample size Slonim (2003). It is, however, not clear how the IB principle is related to the MLE principle in the context of DNNs.
Some works have recently attempted to apply the IB principle to learn DNNs. For example, Tishby and Zaslavsky (2015) proposes the use of the mutual information of a hidden layer with the input layer and the output layer to quantify the performance of DNNs. By analyzing these measures with the IB principle, Tishby and Zaslavsky (2015) establishes an information-theoretic learning principle for DNNs. In principle, one can optimize a neural network by pushing up the network and all its hidden layers to the IB optimal limit in one layer after another. Although the analysis offers a new perspective on optimality in neural networks, it proposes a general analysis of optimality rather than practical optimization criteria. Furthermore, estimating the mutual information between the variables transformed by network layers and the data variables poses several computational challenges which are not addressed in this work. A small change in a multi-layered neural network could greatly modify the entropy of the input variables; it is difficult to analytically capture such modifications. Strouse and Schwab (2016); Chalk et al. (2016); Alemi et al. (2017) has overcome the intractability of mutual information in DNNs by using variational approximations. This approach however considers one single bottleneck and parameterizes the encoder p(z z z|x x x; θ θ θ) using an entire neural network. Therefore, their proposed framework still treats an entire neural network as a whole rather than explicitly optimizing it layer-wise.

Stochastic Neural Network
Definition 1 (Random Variables in Neural Networks). Consider a neural network with L hidden layers with no feedback or skip connection, we view the input layer X, the output of the l th hidden layer Z l , and the network output layerŶ as random variables (RVs). Without any feedback or skip connection, Y, X, Z l , Z l+1 , andŶ form a Markov chain in that order, denoted as: In addition, we also denote n l as the dimension of Z l , use subscript D to refer to the data distribution, write X ⊥ Y (respectively, X ⊥ Y ) to indicate X and Y are independent (respectively, not independent) and abuse the notation integral, e.g., f (z z z)dz z z, regardless of whether the variable z z z is real-valued or discrete-valued.
The role of the neural network is, therefore, reduced to transforming from one RV to another via the Markov chain X → Z l → Z l+1 →Ŷ whereŶ is used as a surrogate for Y . We call the transition distribution p(Z l |X) 1 from X to Z l an encoder as it encodes the data X into the representation Z l . For each encoder p(Z l |X), there is a unique corresponding decoder, namely relevance decoder, that decodes the relevant information in X about Y from representation Z l : Note that our view on RVs in a neural network is different from that in the Bayesian treatment of neural networks Neal (1995); MacKay (1992); Dayan et al. (1995) where the network weights are RVs. In our view, the network weights are simply learnable means for the network to transform RVs. Furthermore, the learning of the neural network totally depends on these transformations whose outputs {Z l : 1 ≤ l ≤ L} reflect their quality for the learning. Definition 2 (Stochastic Neural Networks). A neural network is said to be stochastic if there exists l ∈ {1, ..., L} s.t. the mapping from Z l to X is stochastic. A neural network is said to be completely stochastic if ∀1 ≤ l ≤ L, the mapping from Z l to X is stochastic.
It follows from the Markov chain in Equation (1) that inference in (either completely or partially) stochastic neural networks can be done as: where z z z = (z z z 1 , ..., z z z L ), Z 0 := X and Z L+1 :=Ŷ .
In this work, we consider completely stochastic neural networks but refer to them as stochastic neural networks for simplicity. The reason that we do not consider any deterministic mapping is that a deterministic mapping can make an estimation of mutual information I(Z l ; X) as hard as that of H(X) which is unknown in most applications. For example, consider is a deterministic function, then the mutual information becomes the entropy of the deterministic function of the data variable: ).

Multi-Objective Information Bottleneck
In the perspective that a neural network is a data-processing system that transforms one RV to another RV, we propose that each of the neural representations Z l should ideally be maximally compressed while preserving much of the relevant information in X about Y . This notion of compression and relevance can be quantified by mutual information I(Z l ; X) and I(Z l ; Y ), respectively. Motivated by the (non-parametric) Information Bottleneck principle Tishby et al. (1999), we propose the Parametric Information Bottleneck (PIB) which interprets the learning of a neural network as the following multi-objective (MO) layer-wise optimizations: where D l is some positive number ∀1 ≤ l ≤ L. Note that p(Z l+1 |X) partly depends on p(Z l |X).
Equivalently, by introducing Lagrange multipliers β l > 0 for the constrained relevance, we find optimal encoders p(Z l |X) for all 1 ≤ l ≤ L by minimizing the functionals: 1 If the mapping from X to Z l is deterministic, then p(Z l |X) is simply a delta function.
An important question to ask here is therefore whether there exists a single solution that simultaneously optimizes L objective functionals in Equation (5); otherwise, the objective functions are (said to be) conflicting. Theorem 1 provides an answer to the question. Theorem 1 (Conflicting Information Optimality). Given Y → X → Z 1 → Z 2 , Z 2 ⊥ Z 1 , β 1 > 0, and β 2 > 0, then L 1 and L 2 defined by Equation (5) are conflicting, i.e., there does not exist a single solution that minimizes L 1 and L 2 simultaneously.
Proof of Theorem 1. Here we briefly sketch a proof for Theorem 1 which uses the contradiction technique and following two lemmas.
Assume, by contradiction, that there exists a solution that minimizes both L 1 and L 2 simultaneously, i.e., ∃p * (z z z 1 |x x x), p * (z z z 2 |z z z 1 ) s.t. L 1 has a minimum at p * (z z z 1 |x x x) and L 2 has a minimum at . We now rewrite Equation (12) as: where and λ 1 (x x x) is defined in Equation (8). Taking the derivative w.r.t p * (z z z 1 |x x x) both sides of Equation (14), noting that p * (z z z 1 |x x x) is a critical point of both L ′ 1 and L ′ 2 , we have: However, L ′ 1,|Z2 depends on Z 2 if Z 2 ⊥ Z 1 ; thus setting the LHS of Equation (15) to zero leads p * (z z z 1 |x x x) to depend on Z 2 as well. This contradiction shows that the initial assumption is wrong, thus implying Theorem 1.
There are two main challenges in the MO optimizations in Equation (5): (a) Given Theorem 1, we cannot optimize each of L objectives in Equation (5) simultaneously, i.e., these objectives need to compromise on their optimality to some degree; (b) the mutual information I(Z l ; X) and I(Z l ; Y ) are intractable in neural networks. Thus, we focus on deriving efficient variational bounds on mutual information to deal with (b) and propose two simple compromised strategies to handle (a).

Bounds on Mutual Information
Here we propose simple but efficient variational bounds on compression and relevance for each layer. These bounds can then be efficiently optimized with gradient-based algorithms.

Approximate Relevance
The relevance I(Z l ; Y ) is intractable due to the intractable relevance decoder p(y y y|z z z l ) in Equation (2). It follows from Jensen's inequality that: H(Y |Z l ) = − p(y y y|z z z l )p(z z z l ) log p(y y y|z z z l )dy y ydz z z l ≤ − p(y y y|z z z l )p(z z z l ) log p v (y y y|z z z l )dy y ydz z z l where p v (y y y|z z z l ) is any probability distribution. Note that where H(Y ) = constant which can be ignored in the minimization of L l . Specifically in PIB, we propose to use the network architecture connecting Z l toŶ to define the variational relevance decoder for layer l, i.e., p v (y y y|z z z l ) = p(ŷ ŷ y|z z z l ) where p(ŷ ŷ y|z z z l ) is determined by the network architecture: p v (y y y|z z z l ) := p(ŷ ŷ y|z z z l ) = L i=l p(z z z i+1 |z z z i )dz z z L ...dz z z l+1 For the rest of this work, we refer toH(Y |Z l ) with p v (y y y|z z z l ) = p(ŷ ŷ y|z z z l ) as the variational conditional relevance (VCR) of the l th layer. Theorem 2 addresses the relation between VCR and the MLE principle.
Theorem 2 (Information on the extreme layers). The VCR of the lowest-level (so-called super) layer (i.e., l = 0) is the negative log-likelihood (NLL) function of the neural network, i.e., Similarly, the VCR of the highest-level layer (i.e., l = L) equals that of the compositional layer Z = (Z 1 , Z 2 , ..., Z L ), a composite of all hidden layers; in addition, their VCR is an upper bound on the NLL:H Proof of Theorem 2. Using the definition of VCR, the Markov Chain assumption, and Jensen's inequality (details in Appendix).
Immediately followed from Theorem 2 is an interpretation of MLE in terms of VCR (or more generally PIB) and vice versus. That said, the MLE principle is to optimize the super-level VCR while PIB allows an explicit extension of this concept to any layer under compression constraints.

Approximate Compression
While p(z z z l |z z z l−1 ) in DNNs has an analytical form, p(z z z l |x x x) for l > 1 generally does not as it is a mixture of p(z z z l |x x x). We thus propose to avoid directly estimating I(Z l ; X) by instead resorting to its upper bound I(Z l ; Z l−1 ) as its surrogate in the optimization. However, I(Z l ; Z l−1 ) is still intractable as it has the intractable distribution p(z z z l ). We then approximate I(Z l ; Z l−1 ) using a mean-field (factorized) variational distribution r(z z z l ) = n l i=1 r(z l,i ): The factorized variational distribution r(z z z l ) does not only enable feasible computation but also promotes independence among the dimensions of each vector point z z z l . This independence-favored property encourages each dimension to explain for a different underlying causal factor of the observed data, therefore manifests the concept of distributed representation Bengio et al. (2013a). In PIB, the mean-field distribution r(z z z l ) is made learnable.

Compromised Information Optimality
Since we could not simultaneously achieve the optimality in all layers in the MO optimizations in Equation (5), we propose two simple compromised strategies, namely JointPIB and GreedyPIB. JointPIB (Details at Algorithm 1) is a weighted sum of layer-wise PIB objectives: where γ l > 0. The main idea of JointPIB is to simultaneously optimize all encoders and variational relevance decoders. Even though each layer might not achieve its individual optimality, their joint optimality encourages a joint compromise. On the other hand, GreedyPIB applies PIB progressively in a greedy manner. In other words, GreedyPIB tries to obtain the conditional optimality of a current layer which is conditioned on the achieved conditional optimality of the previous layers.

An Application to Neural Networks
To analyze our PIB framework, we apply it to a simple network architecture: binary stochastic feed-forward (fully-connected) neural networks (SFNN) and leave other more complicated architectures for future work. In binary SFNN, we use sigmoid as its activation function: p(z z z l |z z z l−1 ) = σ(W l−1 z z z l−1 +b b b l−1 ) where σ(.) is the (element-wise) sigmoid function, W l−1 is the network weights connecting layer l − 1 to layer l, b b b l−1 is a bias vector and Z l ∈ {0, 1} n l .
It has been not clear so far as to how the gradient is computed in stochastic neural network at line 10 of Algorithm 1. The sampling operation in stochastic neural networks precludes the backpropagation in a computation graph. It becomes even more challenging with binary stochastic neural networks as it is not well-defined to compute gradients w.r.t. discrete-valued variables. Fortunately, we can find approximate gradients which has been proved to be efficient in practice 3 : REINFORCE estimator Williams (1992); Bengio et al. (2013b), straight-through estimator Hinton (2016), the generalized EM algorithm Tang and Salakhutdinov (2013), and Raiko (biased) estimator Raiko et al. (2015). Especially, we found the Raiko gradient estimator works best in our specific setting thus deployed it in this application. In the Raiko estimator, the gradient of a bottleneck particle z l,i ∼ p(z l,i |z z z l−1 ) = σ(a

Experiments
In this section, we evaluated our PIB framework using GreedyPIB and JointPIB algorithms on MNIST LeCun (1998) and CIFAR10 Krizhevsky (2009) for classification, learning dynamics and robustness against adversarial attacks. The MNIST dataset consists of 28 × 28 pixel greyscale images of handwritten digits 0-9, with 60,000 training and 10,000 test examples. The CIFAR10 dataset consists of 60,000 (50,000 for train and 10,000 for test) 32x32 colour images in 10 classes, with 6,000 images per class.

Image classification
To evaluate the generalization capability of the layer-wise informativeness in PIB, we compared JointPIB and GreedyPIB with other three comparative models which used the same network architecture without any explicit regularizer: (1) Standard deterministic neural network (DET) which simply treats each hidden layer as deterministic; (2) Stochastic Feed-forward Neural Network (SFNN) Raiko et al. (2015) which is a binary stochastic neural network as in PIB but is trained with the MLE principle; (3) Variational Information Bottleneck (VIB) Alemi et al. (2017) which employs the entire deterministic network as an encoder, adds an extra stochastic layer as a out-of-network bottleneck variable, and is then trained with the IB principle on that single bottleneck layer. The base network architecture in this experiment had two hidden layers with 512 sigmoid-activated neurons per each layer.
Adopted from the common practice, we used the last 10,000 images of the training set as a validation (holdout) set for tuning hyper-parameters. We then retrained models from scratch in the full training set with the best validated configuration. We trained each of the five models with the same set of 5 different initializations and reported the average results over the set. For the stochastic models (all except DET), we drew M = 32 samples per stochastic layer during both training and inference, and performed inference 10 times at test time to report the mean of classification errors for MNIST and classification accuracy for CIFAR10. For JointPIB and GreedyPIB, we set γ l = 1 (in JointPIB only) and β l = β, ∀1 ≤ l ≤ L, tuned β on a linear log scale β ∈ {10 −i : 1 ≤ i ≤ 10}. We found β = 10 −4 worked best for both models. For VIB, we found that β = 10 −3 and β = 10 −4 worked best on MNIST and CIFAR10, respectively. We trained the models on MNIST with Adadelta optimization Zeiler (2012) and on CIFAR10 with Adagrad optimization Duchi et al. (2011) (except for VIB we used Adam optimization Kingma and Ba (2014)) as we found that they worked best in the validation set. The results are shown in Table 1. We see that we can outperform the other models with the explicit informativeness induced into each layer either by JointPIB on MNIST and GreedyPIB on CIFAR10.

Robustness against adversarial attacks
Neural networks are prone to adversarial attacks which disturb the input pixels by small amounts imperceptible to humans Szegedy et al. (2013);Nguyen et al. (2015). Adversarial attacks generally fall into two categories: untargeted and targeted attacks. An untargeted adversarial attack A maps the target model M and an input image x x x into an adversarially perturbed image x x x ′ : A : (M, x x x) → x x x ′ , and is considered successful if it can fool the model M (x x x) = M (x x x ′ ). A targeted attack, on the other hand, has an additional target label l: A : (M, x x x, l) → x x x ′ , and is considered successful if To evaluate PIB's capability to equip a neural network with robustness against adversarial attacks, we performed adversarial attacks to the neural networks trained by MLE and PIB, and resorted to the accuracy on adversarially perturbed versions of the test set to rank a model's robustness. In addition, we use the L 2 attack method for both targeted and untargeted attacks from Carlini and Wagner (2017), which has shown to be most effective attack algorithm with smaller perturbations. Specifically, we attacked the same four comparative models described from the previous experiment on the first 1, 000 samples of the MNIST test set. For the targeted attacks, we targeted each image into the other 9 labels other than the true label of the image.
The results are also shown in Table 1. We see that the deterministic model DET is totally fooled by the attacks. It is known that stochasticity in neural networks improves adversarial robustness which is consistent in our experiment as SFNN is significantly more adversarially robust than DET. VIB has compatible adversarial robustness with SFNN even if VIB has "less stochasticity" than SFNN (VIB has one stochastic layer while all hidden layers of SFNN are stochastic). This is because VIB performance is compensated with IB principle for its stochastic layer. Finally, our JointPIB is shown to be more adversarially robust than the other models.

Learning dynamics
To demonstrate PIB's capability to explicitly alter the layer-wise information, we visualize the compression and relevance of each layer over the course of training of SFNN, JointPIB and GreedyPIB.
To simplify our analysis, we considered a binary decision problem where X is 12 binary inputs making up 2 12 = 4096 equally likely input patterns and Y is a binary variable equally distributed among 4096 input patterns Shwartz-Ziv and Tishby (2017). The base neural network architecture had 4 hidden layers with widths: 10-8-6-4 neurons. Since the network architecture is small, we could precisely compute the (true) compression I x := I(Z i ; X) and (true) relevance I y := I(Z i ; Y ) over training epochs. We fixed β l = β = 10 −4 for both JointPIB and GreedyPIB, trained five different randomly initialized neural networks for each comparative model with SGD up to 20,000 epochs on 80% of the data, and averaged the mutual information. . We can observe a common trend in the learning dynamics offered by both MLE (in SFNN model) and PIB framework (in JointPIB and GreedyPIB). Both principles allow the network to gradually encode more information about X and the relevant information about Y into the hidden layers at the beginning as I(Z i ; X) and I(Z i , Y ) both increase. Especially, compression does occur in SFNN which is consistent with the result reported in Shwartz-Ziv and Tishby (2017) that compression phase is an inherit property of DNNs trained with SGD even though the MLE principle is not explicitly designed to do so.
What distinguishes PIB from MLE is the maximum level of relevance at each layer and the number of epochs to encode the same level of relevance. Firstly, JointPIB and GreedyPIB at l = 1 needs only about 4.68% and 17.95%, respectively, of the training epochs to achieve at least the same level of relevance in all layers of SFNN at the final epoch. Recall that in GreedyPIB at l = 1 the PIB principle is applied to the first hidden layer only. Secondly, MLE is unable to encode the network layers to reach the maximum level of relevance enabled by PIB (We also trained SFNN up to 100, 000 epochs and observed that the level of relevance of each layer degrades before ever reaching the value of 0.8 bits.). While it is not clear in this experiment what compression in PIB offers in comparison with MLE (as MLE is observed to possess compression capability as well), the compression constraints do help within the PIB framework itself by keeping the layer representation from shifting to the the right (in the information plane) during the encoding of relevant information (e.g., the layer representation at the final epoch gradually shifts to the left (i.e., more compressed) while not degrading the relevance over the greedy training from layer 1 to layer 4 as presented in Figure 4-7).

Conclusion
In this work, we introduce the Parametric Information Bottleneck (PIB) as an alternative to the MLE principle for learning DNNs. PIB offers a principled yet efficient way of inducing compression and relevance into each layer of DNNs. While PIB can be considered as a fully parameterized version of the non-parametric IB principle to DNNs, it also generalizes the MLE principle to every layer levels under compression constraints. The competitive performance of PIB on MNIST and CIFAR10 suggests the effectiveness of PIB for exploiting the neural representations.

VCR decomposition for a multivariate target variable
VCR decomposition for a multivariate target variable. We will prove that the VCR of level l for a multivariate variable y y y can be decomposed as the sum of the VCRs of each of its vector elements. Indeed, consider y y y = (y 1 , ..., y n ). It follows from the fact that the neurons within a layer are conditionally independent given the previous layer that we have: H(Y |Z l ) = −E (x x x,y1,...,yn)D E z z z l |x x x log p(ŷ 1 , ...,ŷ n |z z z l ) implying the claim about the decomposibility of VCR for a multivariate target variable.

Derivation for the relevance decoder
Derivation for the relevance decoder.
p(y y y|z z z l ) = p(y y y, z z z l ) p(z z z l ) = p(x x x, y y y, z z z l ) p(z z z l ) dx x x = p(x x x, y y y)p(z z z l |x x x, y y y) p(z z z l ) dx x x = p(x x x, y y y)p(z z z l |x x x) p(z z z l ) dx x x (due to Z l ⊥ Y |X), implying Equation (2) in the main material.

Derivation for a variational upper bound on mutual information
Proof. We give a more detailed derivation for Equation (16) in the main material: H(Y |Z l ) = − p(z z z l )p(y y y|z z z l ) log p(y y y|z z z l )dy y ydz z z l = − p(z z z l )p(y y y|z z z l ) log p v (y y y|z z z l )dy y ydz z z l − p(z z z l )D KL [p(y y y|z z z l )||p v (y y y|z z z l )]dz z z l ≤ − p(z z z l )p(y y y|z z z l ) log p v (y y y|z z z l )dy y ydz z z l = − p(y y y, z z z l ) log p v (y y y|z z z l )dy y ydz z z l = − p(x x x, y y y, z z z l ) log p v (y y y|z z z l )dy y ydz z z l dx x x = − p D (x x x, y y y)p(z z z l |x x x, y y y) log p v (y y y|z z z l )dy y ydz z z l dx x x = − p D (x x x, y y y)p(z z z l |x x x) log p v (y y y|z z z l )dz z z l dx x xdy y y = −E pD (x x x,y y y) E p(z z z l |x x x) [log p v (y y y|z z z l )] =:H(Y |Z l )

MLE as distribution matching
MLE as distribution matching. The purpose of the MLE principle can be interpreted as matching the model probability function to the empirical data probability function using the Kullback-Leibler (KL) divergence as a measure of their discrepancy. Indeed, given a set of samples X = {x x x 1 , x x x 2 , ..., x x x N } i.i.d. drawn from some underlying data probability function p D (x x x), a parametric model p model (x x x; θ θ θ) attempts to map any data sample x x x to a real number that estimates the true probability p D (x x x). The MLE principle maximizes the likelihood function on the empirical data. This in turn can be interpreted as matching the model probability function p model with the data probability function p D by minimizing their KL divergence to find the maximum likelihood (point) estimator for θ θ θ: θ θ θ ML = arg max where Expression (32) is an empirical estimation of Expression (31) for N datapoints.