1. Introduction
The celebrated information bottleneck (IB) functional [
1] is a cost function for supervised lossy compression. More specifically, if
X is an observation and
Y a stochastically related random variable (RV) that we associate with relevance, then the IB problem aims to find an encoder
, i.e., a conditional distribution of
Z given
X, that minimizes
In (
1),
and
denote the mutual information between observation
X and representation
Z and between relevant variable
Y and representation
Z, respectively, and
is a Lagrangian parameter. The aim is to obtain a representation
Z that is simultaneously compressed (small
) and informative about the relevant variable
Y (large
), and the parameter
trades between these two goals.
Recently, Fischer proposed an equivalent formulation, termed the conditional entropy bottleneck (CEB) [
2]. While the IB functional inherently assumes the Markov condition
, the CEB is motivated from the principle of Minimum Necessary Information, which lacks this Markov condition and which aims to find a representation
Z that compresses a bi-variate dataset
while still being useful for a given task. Instantiating the principle of Minimum Necessary Information induces then a Markov condition. For example, the task of finding a representation
Z that makes
X and
Y conditionally independent induces the Markov condition
, and the representation optimal w.r.t. the principle of Minimum Necessary Information turns out to be
, i.e., it is related to Wyner’s common information [
3]. The task relevant in this work—estimating
Y from a representation
Z that is obtained exclusively from
X—induces the Markov condition
and the constraint
. A Lagrangian formulation of the constrained optimization problem
, where the infimum is taken over all encoders
that take only
X as input, yields the CEB functional (see Section 2.3 of [
2])
Due to the chain rule of mutual information [
4] (Theorem 2.5.2), (
2) is equivalent to (
1) for
. Nevertheless, (
2) has additional appeals. To this end, note that
captures the information about
X contained in the representation
Z that is redundant for the task of predicting the class variable
Y. In the language of [
5], which essentially also proposed (
2),
thus quantifies class-conditional compression. Minimizing this class-conditional compression term
is not in conflict with maximizing
, whereas minimizing
is (see Figure 2 in [
2] and Section 2 in [
5]). At the same time, as stated in [
2] (p. 6),
allows to “measure in absolute terms how much more we could compress our representation at the same predictive performance”, i.e., by how much
could potentially be further reduced without simultaneously reducing
.
Aside from these theoretical considerations that make the CEB functional preferable over the equivalent IB functional, it has been shown that minimizing variational bounds on the former achieve better performance than minimizing variational bounds on the latter [
2,
6]. More specifically, it was shown that variational CEB (VCEB) achieves higher classification accuracy and better robustness against adversarial attacks than variational IB (VIB) proposed in [
7].
The exact underlying reason why VCEB outperforms VIB is currently still being investigated. Comparing these two bounds at
, Fischer suggests that “we may expect VIB to converge to a looser approximation of
”, where the later equation corresponds to the Minimum Necessary Information point (see Section 2.5.1 of [
2]). Furthermore, Fischer and Alemi claim that VCEB “can be thought of as a tighter variational approximation to the IB objective than VIB” (see Section 2.1 of [
6]). Nevertheless, the following question remains: Does VCEB outperform VIB because the variational bound of VCEB is tighter, or because VCEB is more amenable to optimization than VIB?
To partly answer this question, we compare the optimization problems corresponding to VCEB and VIB. Rather than focusing on actual (commonly neural network-based) implementations of these problems, we keep an entirely mathematical perspective and discuss the problem of finding minimizers within well-defined feasible sets (see
Section 3). Our main result in
Section 4 shows that the optimization problems corresponding to VCEB and VIB are indeed ordered if additional constraints are added: If VCEB is constrained to use a consistent classifier-backward encoder pair (see Definition 1 below), then (unconstrained) VIB yields a tighter approximation of the IB functional. In contrast, if VIB is constrained to use a consistent classifier-marginal pair, then (constrained and unconstrained) VCEB yields a tighter approximation. If neither VCEB nor VIB are constrained, then no ordering can be shown between the resulting optimal variational bounds. Taken together, these results indicate that the superiority of VCEB over VIB observed in [
2,
6] cannot be due to VCEB better approximating the IB functional. Rather, we conclude in
Section 5 that the variational bound provided in [
2] is either more amenable to optimization, at least when the variational terms in VCEB and VIB are implemented using neural networks (NNs), or a successful cost function for optimization in its own regard, i.e., without justification from the IB or Minimum Necessary Information principles.
Related Work and Scope. Many variational bounds for mutual information have been proposed [
8], and many of these bounds can be applied to the IB functional. Both the VIB and VCEB variational bounds belong to the class of Barber & Agakov bounds, cf. Section 2.1 of [
8]. As an alternative example, the authors of [
9] bounded the IB functional using the Donsker–Varadhan representation of mutual information. Aside from that, the IB functional has been used for NN training also without resorting to purely variational approaches. For example, the authors of [
10] applied the Barber & Agakov bound to replace
by the standard cross-entropy loss of a trained classifier, but used a non-parametric estimator for
. Rather than comparing multiple variational bounds with each other, in this work we focus exclusively on the VIB [
7] and VCEB [
2] bounds. The structural similarity of these bounds allows a direct comparison and still yields interesting insights that can potentially carry over to other variational approaches.
We finally want to mention two works that draw conclusions similar to ours. First, Achille and Soatto [
11] pointed to the fact that their choice of injecting multiplicative noise to neuron activations is not only a restriction of the feasible set over which the optimization is performed, but it can also be interpreted as a means of regularization or as an approach to perform optimization. Thus, the authors claim, there is an intricate connection between regularization (i.e., the cost function), the feasible set, and the method of optimization (see Section 9 of [
11]); this claim resonates with our
Section 5. Second, Wieczorek and Roth [
12] investigate the difference between IB and VIB: While IB implicitly assumes the Markov condition
, the variational approach taken in VIB assumes that an estimate of
Y is obtained from the representation
Z, i.e.,
. Dropping the former assumption allows to express the difference between the VIB bound and the IB functional via mutual and lautum information, which, taken together, measure the violation of the condition
. The authors thus argue that dropping this condition enables VIB and similar variants to optimize over larger sets of joint distributions of
X,
Y, and
Z. In this work, we take a slightly different approach and argue that the posterior distribution of
Y given
Z is approximated by a classifier with input
Z that responds with a class estimate
. Thus, we stick to the Markov condition inherent to IB and extend it by an additional variable, resulting in
. As a consequence, our variational approach does not assume that
holds, which also leads to a larger set of joint distributions of
X,
Y, and
Z. Finally, while [
12] compares the IB functional with the VIB bound, in our work we compare two variational bounds on the IB functional with each other.
Notation. We consider a classification task with a feature RV X on and a class RV Y on the finite set of classes. We assume that the joint distribution of X and Y is denoted by . In this work we are interested in representations Z of the feature RV X. This (typically real-valued) representation Z is obtained by feeding X to a stochastic encoder , and the representation Z can be used to infer the class label by feeding it to a classifier . Note that this classifier yields a class estimate that need not coincide with the class RV Y. Thus, the setup of encoder, representation, and classifier yields the following Markov condition: . We abuse notation and abbreviate the conditional probability (density) of a RV W given that another RV V assumes a certain value v as . For example, the probability density of the representation Z for an input is induced by the encoder and is given as .
We obtain encoder, classifier, and eventual variational distributions via solving a constrained optimization problem. For example, minimizes the objective over all encoders from a given family . In practice, encoder, classifier, and variational distributions are parameterized by (stochastic) feed-forward NNs. The chosen architecture has a certain influence on the feasible set; e.g., may denote the set of encoders that can be parameterized by a NN of a given architecture.
We assume that the reader is familiar with information-theoretic quantities. More specifically, we let and denote mutual information and Kullback–Leibler divergence, respectively. The expectation w.r.t. to a RV W drawn from a distribution is denoted as .
3. Variational IB and Variational CEB as Optimization Problems
While it is known that
and
for all possible
and all choices of
,
,
, and
, it is not obvious how
and
compare during optimization. In other words, we are interested in determining whether there is an ordering between
and
Since we will always compare variational bounds for equivalent parameterization, i.e., compare with , we will drop the arguments and for the sake of readability.
For a fair comparison, we need to ensure that both cost functions are optimized over comparable feasible sets , , , and for the encoder, classifier, the backward encoder, and the marginal. We make this explicit in the following assumption.
Assumption 1. The optimizations of VCEB and VIB are performed over equivalent feasible sets. Specifically, the families and from which VCEB and VIB can choose encoder and classifier shall be the same. Depending on the scenario, we may require that the optimization over the marginal is able to choose from the same mixture models as are induced by VCEB. i.e., if is a feasible solution of , then shall also be a feasible solution for ; we thus require that . Depending on the scenario, we may require that every feasible solution for the marginal shall be achievable by selecting feasible backward encoders; we thus require that . If both conditions are fulfilled, then we write that .
We furthermore need the following definition:
Definition 1. In the optimization of , we say that backward encoder and classifier are a consistent pairifholds. In the optimization of , we say that marginal and classifier are a consistent pairifholds. The restriction to consistent pairs restricts the feasible sets. For example, for VCEB, if
is large enough to contain all classifiers consistent with backward encoders in
, i.e., if
, then the triple minimization
is reduced to the double minimization
Equivalently, one can write the joint triple minimization as a consecutive double minimization and a single minimization, where the inner minimization runs over all backwards encoders consistent with the classifier chosen in the outer minimization (where the minimization over an empty set returns infinity):
Similar considerations hold for VIB.
4. Main Results
Our first main result is negative in the sense that it shows and cannot be ordered in general. To this end, consider the following two examples.
Example 1 (VIB < VCEB).
In this example, let , where and are constrained, and let be unconstrained, thus . Suppose further that we have selected a fixed encoder that induces the marginal and conditional distributions and , respectively. With this, we can writeSuppose that is a minimizer of (12a) over and that . By the chain rule of of Kullback–Leibler divergence [4] (Th. 2.5.3) and with , we can expandthusSuppose that is such that the inequality above is strict. Then, where the last inequality follows because may not be optimal for the VIB cost function.
Example 2 (VIB > VCEB).
Let , where and are unconstrained, thus with (12) we haveSuppose further that is such that , where . It then follows that In both of these examples we have ensured that the comparison is fair in the sense of Assumption 1. Aside from showing that VIB and VCEB in general allow no ordering, additional interesting insights can be gleaned from Examples 1 and 2. First, whether VIB or VCEB yield tighter approximations of the IB and CEB functionals for a fixed encoder depends largely on the feasible sets and : Constraints on cause disadvantages for VIB, while constraints on lead to the VCEB bound becoming looser. Second, for fixed encoders, the tightness of the respective bounds and the question which of the bounds is tighter do not depend on how well the IB and CEB objectives are met: These objectives are functions only of the encoder , whereas the tightness of the variational bounds depends on , , and . (Of course, the tightness of the respective bounds after the triple optimization in (6) depends also on , as the optimization over and in Example 1 and over in Example 2 interacts with the optimization over in a non-trivial manner.)
Our second main result, in contrast, shows that the variational bounds can indeed be ordered if additional constraints are introduced. More specifically, if the variational bounds are restricted to consistent pairs as in Definition 1, then the following ordering can be shown. The proof of Theorem 1 is deferred to
Section 6.
Theorem 1. If VCEB is constrained to a consistent classifier-backward encoder pair, and if , thenIf VIB and VCEB are constrained to a consistent classifier–marginal and classifier-backward encoder pair, respectively, and if , thenA fortiori, (13b) continues to hold if VCEB is not constrained to a consistent classifier-backward encoder pair. Theorem 1 thus relates the cost functions of VIB and VCEB in certain well-defined scenarios, contingent on the size of the feasible sets and . If the variational approximations are implemented using NNs, then these bounds are thus contingent on the capacity of the NNs trained to represent the backward encoder in case of VCEB and the marginal in the case of VIB. A few clarifying statements are now in order.
First, it is easy to imagine scenarios in which the inequalities are strict. Trivially, this is the case for (
13a) if
and
, and for (
13b) if
and
do not contain a consistent pair. Furthermore, if the set relations in the respective conditions do not hold with equality, the optimization over the strictly larger set of, e.g., marginals in (
13a), may yield strictly smaller values for the cost function
.
Second, the condition that is less restrictive than the condition stated in Assumption 1. This is because every backward encoder that is written as for and satisfies trivially that . Thus, if one accepts Assumption 1 as reasonable for a fair comparison between VCEB and VIB, then one must also accept that the ordering provided in the theorem is mainly a consequence of the restriction to consistent pairs, and not to one of the optimization problems having access to a significantly larger feasible set.
Finally, if
,
, and
are sufficiently large, i.e., if the NNs implementing the classifier, backward encoder, and marginal are sufficiently powerful, then both VCEB and VIB can be assumed to yield equally good approximations of the IB functional. To see this, let
,
, and
denote the marginal and conditional distributions induced by
and note that with (12) we get
and
Large and render the second terms in both equations close to zero for all choices of (see Example 2), while large renders the last terms close to zero (see Example 1). Thus, in this case not only do we have , but we also have that VCEB employs a consistent classifier-backward encoder pair by the fact that and . Thus, one may argue that if the feasible sets are sufficiently large, the restriction to consistent pairs may not lead to significantly looser bounds.
5. Discussion
In this note we have compared the IB and CEB functionals and their respective variational approximations. While IB and CEB are shown to be equivalent, the variational approximations VIB and VCEB yield different results after optimization. Specifically, it was observed that using VCEB as a training objective for stochastic NNs outperforms VIB in terms of accuracy, adversarial robustness, and out-of-distribution detection (see Section 3.1 of [
2]). In our analysis we have observed that, although in general there is no ordering between VIB and VCEB (Examples 1 and 2), the optimal values of the cost functions can be ordered if additional restrictions are imposed (Theorem 1). Specifically, if VCEB is constrained to a consistent classifier-backward encoder pair, then its optimal value cannot fall below the optimal value of VIB. If, in contrast, VIB is constrained, then the optimal value of VIB cannot fall below the optimal value of VCEB (constrained or unconstrained). Thus, as expected, adding restrictions weakens the optimization problem w.r.t. the unconstrained counterpart.
These results imply that the superiority of VCEB is not caused by enabling a tighter bound on the IB functional than VIB does. Furthermore, it was shown in Table 1 of [
6] that VCEB, constrained to a consistent classifier-backward encoder pair, yields better classification accuracy and robustness against corruptions than the unconstrained VCEB objective. Since obviously
the achievable tightness of a variational bound on the IB functional appears to be even negatively correlated with generalization performance in this set of experiments. (We note that [
6] only reports constrained VCEB results for the largest NN models, and the constrained models perform slightly worse on robustness to adversarial examples than the unconstrained VCEB models of the same size.)
One may hypothesize, though, that VCEB is more amenable to optimization, in the sense that it achieves a tighter bound on the IB functional when encoder, classifier, and variational distributions are implemented and optimized using NNs. However, optimizing VCEB and VIB was shown to yield very similar results in terms of a lower bound on
for several values of
, cf. Figure 4 of [
2], which seems not to support above hypothesis.
We therefore conclude that the superiority of (constrained) VCEB is not due to it better approximating the IB functional. While the hypothesis that the optimized VCEB functional approximates the optimized IB functional better cannot be ruled out, we will now formulate an alternative hypothesis. Namely, that the VCEB cost function itself instills desirable properties in the encoder that would otherwise not be instilled when relying exclusively on the IB functional, cf. Section 5.4 of [
13]. For example, neither IB nor the Minimum Necessary Information principle include a classifier
in their formulations. Thus, by the invariance of mutual information under bijections, there may be many encoders
in the feasible set
that lead to representations
Z equivalent in terms of (
1) and (
2). Only few of these representations are useful in the sense that the information about the class
Y can be extracted “easily”. The variational approach of using a classifier to approximate
, however, ensures that, among all encoders
that are equivalent under the IB principle, one is chosen such that there exists a classifier
in
that allows inferring the class variable
Y from
Z with low entropy: While the IB and Minimum Necessary Information principles ensure that
Z is informative about
Y, the variational approaches of VIB and VCEB ensure that this information can be accessed in practice. Regarding the observed superiority of VCEB over VIB, one may argue that a variational bound relying on a backward encoder instills properties in the latent representation
Z that are preferable over those that are achieved by optimizing a variational bound relying on a marginal only.
In other words, VCEB and VIB are justified as cost functions for NN training even without recourse to the IB and Minimum Necessary Information principles. This does not say that the concept of compression, inherent in both of these principles, is not a useful guidance—whether compression and generalization are causally related is the topic of an ongoing debate to which we do not want to contribute in this work. Rather, we claim that variational approaches may yield desirable properties that go beyond compression and that may be overlooked when too much focus is put on the functionals that are approximated with these variational bounds.
In combination with the variational approach, the selection of feasible sets can also have profound impact on the properties of the representation
Z. A representation
Z is called disentangled if its distribution
factorizes. Disentanglement can thus be measured by total correlation, i.e., the Kullback–Leibler divergence between
and the product of its marginals
Section 5 of the [
11]. Achille and Soatto have shown that selecting
in the optimization of VIB as a family of factorized marginals is equivalent to adding a total correlation term to the IB functional, effectively encouraging disentanglement, cf. Proposition 1 in [
11]. Similarly, Amjad and Geiger note that selecting
in the optimization of VCEB as a family of factorized backward encoders encourages class-conditional disentanglement; i.e., it enforces a Naive Bayes structure on the representation
Z, cf. Corollary 1 & Section 3.1 of [
5]. To understand the implications of these observations, it is important to note that neither disentanglement nor class-conditional disentanglement are encouraged by the IB or CEB functionals. However, by appropriately selecting the feasible sets of VIB or VCEB, disentanglement and class-conditional disentanglement can be achieved. While we leave it to the discretion of the reader to decide whether disentanglement is desirable or not, we believe that it is vital to understand that disentanglement is an achievement of optimizing a variational bound over an appropriately selected feasible set, and not one of the principles based on which these variational approaches are motivated.