Learnability for the Information Bottleneck

The Information Bottleneck (IB) method provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective I(X;Z)−βI(Y;Z) employs a Lagrange multiplier β to tune this trade-off. However, in practice, not only is β chosen empirically without theoretical guidance, there is also a lack of theoretical understanding between β, learnability, the intrinsic nature of the dataset and model capacity. In this paper, we show that if β is improperly chosen, learning cannot happen—the trivial representation P(Z|X)=P(Z) becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as β is varied. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, which provides theoretical guidance for choosing a good β. We further show that IB-learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), and discuss its relation with model capacity. We give practical algorithms to estimate the minimum β for a given dataset. We also empirically demonstrate our theoretical conditions with analyses of synthetic datasets, MNIST and CIFAR10.


INTRODUCTION
introduced the Information Bottleneck (IB) objective function which learns a representa-tion Z of observed variables (X, Y ) that retains as little information about X as possible, but simultaneously captures as much information about Y as possible: min IB β (X, Y ; Z) = min[I(X; Z) − βI(Y ; Z)] (1) I(•) is the mutual information.The hyperparameter β controls the trade-off between compression and prediction, in the same spirit as Rate-Distortion Theory (Shannon, 1948), but with a learned representation function P (Z|X) that automatically captures some part of the "semantically meaningful" information, where the semantics are determined by the observed relationship between X and Y .
The IB framework has been extended to and extensively studied in a variety of scenarios, including Gaussian variables (Chechik et al. (2005)), meta-Gaussians (Rey and Roth (2012)), continuous variables via variational methods (Alemi et al. (2016); Chalk et al. (2016); Fischer (2018)), deterministic scenarios (Strouse and Schwab (2017a); Kolchinsky et al. (2019)), geometric clustering (Strouse and Schwab (2017b)), and is used for learning invariant and disentangled representations in deep neural nets (Achille and Soatto (2018a,b)).However, a core issue remains: how should we set a good β?In the original work, the authors recommend sweeping β > 1, which can be prohibitively expensive in practice, but also leaves open interesting theoretical questions around the relationship between β, P (Z|X), and the observed data, P (X, Y ).
This work begins to answer some of those questions by characterizing the onset of learning.Specifically: dergo a phase transition from the inability to learn to the ability to learn (Section 3).
• Using the second-order variation, we derive sufficient conditions for IB-Learnability, which provide theoretical guidance for choosing a good β (Section 4).
• We show that IB-Learnability is determined by the largest confident, typical, and imbalanced subset of the examples (the conspicuous subset), reveal its relationship with the slope of the Pareto frontier at the origin on the information plane I(X; Z) vs. I(Y ; Z), and discuss its relation to model capacity (Section 5).
• We additionally prove a deep relationship between IB-Learnability, the hypercontractivity coefficient, the contraction coefficient, and the maximum correlation (Section 5).
We also present an algorithm for estimating the onset of IB-Learnability and the conspicuous subset, and demonstrate that it does a good job of approximating both the theoretical predictions and the empirical results (Section 6).Finally, we use our main results to demonstrate on synthetic datasets, MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky and Hinton, 2009) that the theoretical prediction for IB-Learnability closely matches experiment (Section 7).

A Motivating Example
How can we choose a good β?To gain intuition, consider learning multiple Variational Information Bottleneck (VIB) representations (Alemi et al., 2016) of MNIST (LeCun et al., 1998) at different β.We select the digits 0 and 1 for binary classification, and add classconditional noise (Angluin and Laird, 1988) to the labels with flip probability 0.2, which simulates a general scenario where the data may be noisy and the dependence of Y on X is not deterministic.The algorithm only sees the corrupted labels.Fig. 1 shows the converged accuracy on the true labels for the VIB models plotted against β.
We see clearly that when β < 3.25, no learning happens, and the accuracy is the same as random guessing.Beginning with β > 3.25, there is a clear phase transition where the accuracy sharply increases, indicating the objective is able to learn a non-trivial representation.This kind of phase transition is typical in our experiments in Section 7. When the noise rate is high, the transition can happen at β ∼ 500; i.e., we need a large "β force" to extract relevant information from X to predict Y .In that case, an improperly-chosen β in the unlearnable region will preclude learning a useful representation.

RELATED WORK
The original IB work (Tishby et al., 2000) provides a tabular method for exactly computing the optimal encoder distribution P (Z|X) for a given β and cardinality of the discrete representation, |Z|.Thus, the search for the desired model involves not only sweeping β, but also considering different representation dimensionalities.These restrictions were lifted somewhat by Chechik et al. (2005), which presents the Gaussian Information Bottleneck (GIB) for learning a multivariate Gaussian representation Z of (X, Y ), assuming that both X and Y are also multivariate Gaussians.They also note the presence of the trivial solution not only when β ≤ 1, but also depending on the eigenspectrum of the observed variables.However, the restriction to multivariate Gaussian datasets limits the generality of the analysis.Another analytic treatment of IB is given in Rey and Roth (2012), which reformulates the objective in terms of the copula functions.As with the GIB approach, this formulation restricts the form of the data distributions -the copula functions for the joint distribution (X, Y ) are assumed to be known, which is unlikely in practice.
Strouse and Schwab (2017a) presents the Deterministic Information Bottleneck (DIB), which minimizes the coding cost of the representation, H(Z), rather than the transmission cost, I(X; Z) as in IB.This approach learns hard clusterings with different code entropies that vary with β.In this case, it is clear that a hard clustering with minimal H(Z) will result in a single cluster for all of the data, which is the DIB trivial solution.No analysis is given beyond this fact to predict the actual onset of learnability, however.
The first amortized IB objective is in the Variational Information Bottleneck (VIB) of Alemi et al. (2016).VIB replaces the exact, tabular approach of IB with variational approximations of the classifier distribution (P (Y |Z)) and marginal distribution (P (Z)).This approach cleanly permits learning a stochastic encoder, P (Z|X), that is applicable to any x ∈ X , rather than just the particular X seen at training time.The cost of this flexibility is the use of variational approximations that may be less expressive than the tabular method.Nevertheless, in practice, VIB learns easily and is simple to implement, so we rely on VIB models for our experimental confirmation.
Closely related to IB is the recently proposed Conditional Entropy Bottleneck (CEB) (Fischer, 2018).CEB attempts to explicitly learn the Minimum Necessary Information (MNI), defined as the point in the information plane where I(X; Y ) = I(X; Z) = I(Y ; Z).The MNI point may not be achievable even in principle for a particular dataset.However, the CEB objective provides an explicit estimate of how closely the model is approaching the MNI point by observing that a necessary condition for reaching the MNI point occurs when I(X; Z|Y ) = 0.
The CEB objective I(X; Z|Y ) − γI(Y ; Z) is equivalent to IB at γ = β + 1, so our analysis of IB-Learnability applies equally to CEB.Kolchinsky et al. (2019) presents analytic and empirical results about trivial solutions in the particular setting of Y being a deterministic function of X in the observed sample.However, their use of the term "trivial solution" is distinct from ours.They are referring to the observation that β will demonstrate trivial interpolation between two different but valid solutions on the optimal frontier, rather than demonstrating a non-trivial trade-off between compression and prediction as expected when varying the IB Lagrangian.Our use of "trivial" refers to whether IB is capable of learning at all given a certain dataset and value of β.
Achille and Soatto (2018b) apply the IB Lagrangian to the weights of a neural network, yielding InfoDropout.
In Achille and Soatto (2018a), the authors give a deep and compelling analysis of how the IB Lagrangian can yield invariant and disentangled representations.They do not, however, consider the question of the onset of learning, although they are aware that not all models will learn a non-trivial representation.More recently, Achille et al.
(2018) repurpose the InfoDropout IB Lagrangian as a Kolmogorov Structure Function to analyze the ease with which a previously-trained network can be fine-tuned for a new task.While that work is tangentially related to learnability, the question it addresses is substantially different from our investigation of the onset of learning.
Our work is also closely related to the hypercontrac-tivity coefficient (Anantharam et al. (2013); Polyanskiy and Wu (2017)), defined as sup Z−X−Y I(Y ;Z) I(X;Z) , which by definition equals the inverse of β 0 , our IBlearnability threshold.In Anantharam et al. (2013), the authors prove that the hypercontractivity cofficient equals the contraction coefficient η KL (P Y |X , P X ), and Kim et al. (2017) propose a practical algorithm to estimate η KL (P Y |X , P X ), which provides a measure for potential influence in the data.Although our goal is different, the sufficient conditions we provide for IB-Learnability are also lower bounds for the hypercontractivity coefficient.

IB-LEARNABILITY
We are given instances of (x, y) ∈ X × Y drawn from a distribution with probability (density) P (X, Y ), where unless otherwise stated, both X and Y can be discrete or continuous variables.(X, Y ) is our training data, and may be characterized by different types of noise.The nature of this training data and the choice of β will be sufficient to predict the transition from unlearnable to learnable.
We can learn a representation Z of X with conditional probability1 p(z|x), such that X, Y, Z obey the Markov chain Z ← X ↔ Y .Eq. 1 above gives the IB objective with Lagrange multiplier β, IB β (X, Y ; Z), which is a functional of p(z|x): The IB learning task is to find a conditional probability p(z|x) that minimizes IB β (X, Y ; Z).The larger β, the more the objective favors making a good prediction for Y .Conversely, the smaller β, the more the objective favors learning a concise representation.
How can we select β such that the IB objective learns a useful representation?In practice, the selection of β is done empirically.Indeed, Tishby et al. (2000) recommends "sweeping β".In this paper, we provide theoretical guidance for choosing β by introducing the concept of IB-Learnability and providing a series of IB-learnable conditions.Definition 1. (X, Y ) is IB β -learnable if there exists a Z given by some p , where p(z|x) = p(z) characterizes the trivial representation where Z = Z trivial is independent of X.
If (X; Y ) is IB β -learnable, then when IB β (X, Y ; Z) is globally minimized, it will not learn a trivial representation.On the other hand, if (X; Y ) is not IB β -learnable, then when IB β (X, Y ; Z) is globally minimized, it may learn a trivial representation.
Trivial solutions.Definition 1 defines trivial solutions in terms of representations where I(X; Z) = I(Y ; Z) = 0. Another type of trivial solution occurs when I(X; Z) > 0 but I(Y ; Z) = 0.This type of trivial solution is not directly achievable by the IB objective, as I(X; Z) is minimized, but it can be achieved by construction or by chance.It is possible that starting learning from I(X; Z) > 0, I(Y ; Z) = 0 could result in access to non-trivial solutions not available from I(X; Z) = 0. We do not attempt to investigate this type of trivial solution in this work.
Necessary condition for IB-Learnability.From Definition 1, we can see that IB β -Learnability for any dataset (X; Y ) requires β > 1.In fact, from the Markov chain Z ← X ↔ Y , we have I(Y ; Z) ≤ I(X; Z) via the data-processing inequality.If β ≤ 1, then since I(X; Z) ≥ 0 and I(Y ; Z) ≥ 0, we have that min(I(X; Due to the reparameterization invariance of mutual information, we have the following theorem for IB β -Learnability: Theorem 1.Let X = g(X) be an invertible map (if X is a continuous variable, g is additionally required to be continuous).Then (X, Y ) and (X , Y ) have the same IB β -Learnability.
The proof for Theorem 1 is in Appendix B. Theorem 1 implies a favorable property for any condition for IB β -Learnability: the condition should be invariant to invertible mappings of X.We will inspect this invariance in the conditions we derive in the following sections.

SUFFICIENT CONDITIONS FOR IB-LEARNABILITY
Given (X, Y ), how can we determine whether it is IB βlearnable?To answer this question, we derive a series of sufficient conditions for IB β -Learnability, starting from its definition.The conditions are in increasing order of practicality, while sacrificing as little generality as possible.
The proof in Appendix F shows that both first-order variations δI(X; Z) = 0 and δI(Y ; Z) = 0 vanish at the trivial representation p(z|x) = p(z), so δIB β [p(z|x)] = 0 at the trivial representation. Lemma where the functional β 0 [h(x)] is given by −1 is a lower bound of the slope of the Pareto frontier in the information plane I(Y ; Z) vs. I(X; Z) at the origin.
The proof is given in Appendix G, which also shows that if p(z) dz > 0 for some h 2 (z), such that h(z|x) satisfies Theorem 3. It also shows that the converse is true: if there exists h(z|x) such that the condition in Theorem 3 is true, then Theorem 4 is satisfied4 , i.e. β > inf h(x) β 0 [h(x)].Moreover, letting the perturbation function h(z|x) = h * (x)h 2 (z) at the trivial solution, we have where p β (y|x) is the estimated p(y|x) by IB for a certain β, h * x = h * (x)p(x)dx, and p(z) dz > 0 is a constant.This shows how the p β (y|x) by IB explicitly depends on h * (x) at the onset of learning.The proof is provided in Appendix H.
Theorem 4 suggests a method to estimate β 0 : we can parameterize h(x) e.g. by a neural network, with the objective of minimizing β 0 [h(x)].At its minimization, β 0 [h(x)] provides an upper bound for β 0 , and h(x) provides a soft clustering of the examples corresponding to a nontrivial perturbation of p(z|x) at p(z|x) = p(z) that minimizes IB β [p(z|x)].
Alternatively, based on the property of β 0 [h(x)], we can also use a specific functional form for h(x) in Eq. ( 2), and obtain a stronger sufficient condition for IB β -Learnability.But we want to choose h(x) as near to the infimum as possible.To do this, we note the following characteristics for the R.H.S of Eq. (2): • We can set h(x) to be nonzero if x ∈ Ω x for some region Ω x ⊂ X and 0 otherwise.Then we obtain the following sufficient condition: • The numerator of the R.H.S. of Eq. (4) attains its minimum when h(x) is a constant within Ω x .
This can be proved using the Cauchy-Schwarz inequality: u, u v, v ≥ u, v 2 , setting u(x) = h(x) p(x), v(x) = p(x), and defining the inner product as u, v = u(x)v(x)dx.Therefore, the numerator of the R.H.S. of Eq. ( 4) ≥ 1 x∈Ωx p(x) −1, and attains equality when u(x)  v(x) = h(x) is constant.
Based on these observations, we can let h(x) be a nonzero constant inside some region Ω x ⊂ X and 0 otherwise, and the infimum over an arbitrary function h(x) is simplified to infimum over Ω x ⊂ X , and we obtain a sufficient condition for IB β -Learnability, which is a key result of this paper: Theorem 5 (Conspicuous Subset Suff.Cond.).A sufficient condition for (X, Y ) to be IB β -learnable is X and Y are not independent, and where (inf Ωx⊂X β 0 (Ω x )) −1 gives a lower bound of the slope of the Pareto frontier in the information plane I(Y ; Z) vs. I(X; Z) at the origin.
The proof is given in Appendix I.In the proof we also show that this condition is invariant to invertible mappings of X.
(5), we see that three characteristics of the subset Ω x ⊂ X lead to low β 0 : (1) confidence: p(y|Ω x ) is large; (2) typicality and size: the number of elements in Ω x is large, or the elements in Ω x are typical, leading to a large probability of p(Ω x ); (3) imbalance: p(y) is small for the subset Ω x , but large for its complement.In summary, β 0 will be determined by the largest confident, typical and imbalanced subset of examples, or an equilibrium of those characteristics.We term Ω x at the minimization of β 0 (Ω x ) the conspicuous subset.
Multiple phase transitions.Based on this characterization of Ω x , we can hypothesize datasets with multiple learnability phase transitions.Specifically, consider a region Ω x0 that is small but "typical", consists of all elements confidently predicted as y 0 by p(y|x), and where y 0 is the least common class.By construction, this Ω x0 will dominate the infimum in Eq. ( 5), resulting in a small value of β 0 .However, the remaining X −Ω x0 effectively form a new dataset, X 1 .At exactly β 0 , we may have that the current encoder, p 0 (z|x), has no mutual information with the remaining classes in X 1 ; i.e., I(Y 1 ; Z 0 ) = 0.In this case, Definition 1 applies to p 0 (z|x) with respect to I(X 1 ; Z 1 ).We might expect to see that, at β 0 , learning will plateau until we get to some β 1 > β 0 that defines the phase transition for X 1 .Clearly this process could repeat many times, with each new dataset X i being distinctly more difficult to learn than X i−1 .
Similarity to information measures.The denominator of β 0 (Ω x ) in Eq. ( 5) is closely related to mutual information.Using the inequality x − 1 ≥ log(x) for x > 0, it becomes: Of course, this quantity is also D KL [p(y|Ω x )||p(y)], so we know that the denominator of Eq. ( 5) is non-negative.Incidentally, E y∼p(y|Ωx) p(y|Ωx) p(y) − 1 is the density of "rational mutual information" (Lin and Tegmark (2016)) at Ω x .
Similarly, the numerator of β 0 (Ω x ) is related to the selfinformation of Ω x : so we can estimate the phase transition as: Since Eq. ( 6) uses upper bounds on both the numerator and the denominator, it does not give us a bound on β 0 .

Estimating model capacity.
The observation that a model can't distinguish between cluster overlap in the data and its own lack of capacity gives an interesting way to use IB-Learnability to measure the capacity of a set of models relative to the task they are being used to solve.slope β −1 if it is differentiable.If the frontier is also concave (has negative second derivative), then this slope β −1 will take its maximum β −1 0 at the origin, which implies IB β -Learnability for β > β 0 , so that the threshold for IB β -Learnability is simply the inverse slope of the frontier at the origin.More generally, as long as the Pareto frontier is differentiable, the threshold for IB βlearnability is the inverse of its maximum slope.Indeed, Theorem 4 and Theorem 5 give lower bounds of the slope of the Pareto frontier at the origin.
IB-Learnability, hypercontractivity, and maximum correlation.IB-Learnability and its sufficient conditions we provide harbor a deep connection with hypercontractivity and maximum correlation: Hirschfeld, 1935;Gebelein, 1941) is the contraction coefficient.Our proof relies on Anantharam et al. ( 2013)'s proof ξ(X; Y ) = η KL .Our work reveals the deep relationship between IB-Learnability and these earlier concepts and provides additional insights about what aspects of a dataset give rise to high maximum correlation and hypercontractivity: the most confident, typical, imbalanced subset of (X, Y ).

ESTIMATING THE IB-LEARNABILITY CONDITION
Theorem 5 not only reveals the relationship between the learnability threshold for β and the least noisy region of P (Y |X), but also provides a way to practically estimate β 0 , both in the general classification case, and in more structured settings.

Estimation Algorithm
Based on Theorem 5, for general classification tasks we suggest Algorithm 1 to empirically estimate an upperbound β0 ≥ β 0 , as well as discovering the conspicuous subset that determines β 0 .
We approximate the probability of each example p(x i ) by its empirical probability, p(x i ).E.g., for MNIST, p(x i ) = 1 N , where N is the number of examples in the dataset.The algorithm starts by first learning a maximum likelihood model of p θ (y|x), using e.g.feed-forward neural networks.It then constructs a matrix P y|x and a vector p y to store the estimated p(y|x) and p(y) for all the examples in the dataset.To find the subset Ω such that the β0 is as small as possible, by previous analysis we want to find a conspicuous subset such that its p(y|x) is large for a certain class j (to make the denominator of Eq. ( 5) large), and containing as many elements as possible (to make the numerator small).
We suggest the following heuristics to discover such a conspicuous subset.For each class j, we sort the rows of (P y|x ) according to its probability for the pivot class j by decreasing order, and then perform a search over i left , i right for Ω = {i left , i left + 1, ..., i right }. Since β0 is large when Ω contains too few or too many elements, the minimum of β(j) 0 for class j will typically be reached with some intermediate-sized subset, and we can use binary search or other discrete search algorithm for the optimization.The algorithm stops when β(j) 0 does not improve by tolerance ε.The algorithm then returns the β0 as the minimum over all the classes β(1) 0 , ... β(N) 0 , as well as the conspicuous subset that determines this β0 .
After estimating β0 , we can then use it for learning with Algorithm 1 Estimating the upper bound for β 0 and identifying the conspicuous subset Require: IB, either directly, or as an anchor for a region where we can perform a much smaller sweep than we otherwise would have.This may be particularly important for very noisy datasets, where β 0 can be very large.

Special Cases for Estimating β 0
Theorem 5 may still be challenging to estimate, due to the difficulty of making accurate estimates of p(Ω x ) and searching over Ω x ⊂ X .However, if the learning problem is more structured, we may be able to obtain a simpler formula for the sufficient condition.
Class-conditional label noise.Classification with noisy labels is a common practical scenario.An important noise model is that the labels are randomly flipped with some hidden class-conditional probabilities and we only observe the corrupted labels.This problem has been studied extensively (Angluin and Laird, 1988;Natarajan et al., 2013;Liu and Tao, 2016;Xiao et al., 2015;Northcutt et al., 2017).If IB is applied to this scenario, how large β do we need?The following corollary provides a simple formula.Corollary 5.1.Suppose that the true class labels are y * , and the input space belonging to each y * has no overlap.We only observe the corrupted labels y with classconditional noise p(y|x, y * ) = p(y|y * ), and Y is not independent of X.We have that a sufficient condition for IB β -Learnability is: We see that under class-conditional noise, the sufficient condition reduces to a discrete formula which only depends on the noise rates p(y|y * ) and the true class probability p(y * ), which can be accurately estimated via e.g.Northcutt et al. (2017).Additionally, if we know that the noise is class-conditional, but the observed β 0 is greater than the R.H.S. of Eq. ( 8), we can deduce that there is overlap between the true classes.The proof of Corollary 5.1 is provided in Appendix J.
Deterministic relationships.Theorem 5 also reveals that β 0 relates closely to whether Y is a deterministic function of X, as shown by Corollary 5.2: Corollary 5.2.Assume that Y contains at least one value y such that its probability p(y) > 0. If Y is a deterministic function of X and not independent of X, then a sufficient condition for IB β -Learnability is β > 1.
The assumption in the corollary 5.2 is satisfied by classification, and certain regression problems.Combined with the necessary condition β > 1 for any dataset (X, Y ) to be IB β -learnable (Section 3), we have that under the assumption, if Y is a deterministic function of X, then a necessary and sufficient condition for IB βlearnability is β > 1; i.e., its β 0 is 1.The proof of Corollary 5.2 is provided in Appendix J.
Therefore, in practice, if we find that β 0 > 1, we may infer that Y is not a deterministic function of X.For a classification task, we may infer that either some classes have overlap, or the labels are noisy.However, recall that finite models may add effective class overlap if they have insufficient capacity for the learning task, as mentioned in Section 4. This may translate into a higher observed β 0 , even when learning deterministic functions.

EXPERIMENTS
To test how the theoretical conditions for IB βlearnability match with experiment, we apply them to

Synthetic Dataset Experiments
We construct a set of datasets from 2D mixtures of 2 Gaussians as X and the identity of the mixture component as Y .We simulate two practical scenarios with these datasets: (1) noisy labels with class-conditional noise, and (2) class overlap.For (1), we vary the classconditional noise rates.For (2), we vary class overlap by tuning the distance between the Gaussians.For each experiment, we sweep β with exponential steps, and observe I(X; Z) and I(Y ; Z).We then compare the empirical β 0 indicated by the onset of above-zero I(X; Z) with predicted values for β 0 .
Classification with class-conditional noise.In this experiment, we have a mixture of Gaussian distribution with 2 components, each of which is a 2D Gaussian with diagonal covariance matrix Σ = diag(0.25,0.25).The two components have distance 16 (hence virtually no overlap) and equal mixture weight.For each x, the label y ∈ {0, 1} is the identity of which component it belongs to.We create multiple datasets by randomly flipping the labels y with a certain noise rate ρ = P (y = 0|y * = 1) = P (y = 1|y * = 0).For each dataset, we train VIB models across a range of β, and observe the onset In the classification setting, this approach doesn't require any learned estimate of p(y|x), as we can directly use the empirical p(y) and p(x|y) from SGD mini-batches.
This experiment also shows that for dataset where the signal-to-noise is small, β 0 can be very high.Instead of  blindly sweeping β, our result can provide guidance for setting β so learning can happen.
Classification with class overlap.In this experiment, we test how different amounts of overlap among classes influence β 0 .We use the mixture of Gaussians with two components, each of which is a 2D Gaussian with diagonal covariance matrix Σ = diag(0.25,0.25).The two components have weights 0.6 and 0.4.We vary the distance between the Gaussians from 8.0 down to 0.8 and observe the β 0,exp .Since we don't add noise to the labels, if there were no overlap and a deterministic map from X to Y , we would have β 0 = 1 by Corollary 5.2.The more overlap between the two classes, the more uncertain Y is given X.By Eq. 5 we expect β 0 to be larger, which is corroborated in Fig. 4.

MNIST Experiments
We perform binary classification with digits 0 and 1, and as before, add class-conditional noise to the labels with varying noise rates ρ.To explore how the model capacity influences the onset of learning, for each dataset we train two sets of VIB models differing only by the number of neurons in their hidden layers of the encoder: one with n = 512 neurons, the other with n = 128 neurons.As we describe in Section 4, insufficient capacity will result in more uncertainty of Y given X from the point of view of the model, so we expect the observed β 0 for the n = 128 model to be larger.This result is confirmed by Figure 6: Histograms of the full MNIST training and validation sets according to h(X).Note that both are bimodal, and the histograms are indistinguishable.In both cases, h(x) has learned to separate most of the ones into the smaller mode, but difficult ones are in the wide valley between the two modes.See Figure 8 for all of the training images to the left of the red threshold line, as well as the first few images to the right of the threshold.
the experiment (Fig. 5).Also, in Fig. 5 we plot β 0 given by different estimation methods.We see that the observations (A), (B), (C) and (D) in Section 7.1 still hold.2To see what IB learns at its onset of learning for the full MNIST dataset, we optimize Eq. (2) w.r.t. the full MNIST dataset, and visualize the clustering of digits by h(x).Eq. ( 2) can be optimized using SGD using any differentiable parameterized mapping h(x) : X → R.

MNIST Experiments using Equation
In this case, we chose to parameterize h(x) with a Pix-elCNN++ architecture (van den Oord et al., 2016;Salimans et al., 2017), as PixelCNN++ is a powerful autoregressive model for images that gives a scalar output (normally interpreted as log p(x)).Eq. ( 2) should generally give two clusters in the output space, as discussed in Section 4. In this setup, smaller values of h(x) correspond to the subset of the data that is easiest to learn.Fig. 6 shows two strongly separated clusters, as well as the threshold we choose to divide them.Fig. 8 shows the first 5,776 MNIST training examples as sorted by our learned h(x), with the examples above the threshold highlighted in red.We can clearly see that our learned h(x) has separated the "easy" one (1) digits from the rest of the MNIST training set.

CIFAR10 Forgetting Experiments
For CIFAR10 (Krizhevsky and Hinton, 2009), we study how forgetting varies with β.In other words, given a VIB model trained at some high β 2 , if we anneal it down to some much lower β 1 , what I(Y ; Z) does the model converge to?Using Alg. 1, we estimated β 0 = 1.0483 on a version of CIFAR10 with 20% label noise, where the P y|x is estimated by maximum likelihood training with the same encoder and classifier architectures as used for VIB.For the VIB models, the lowest β with performance above chance was β = 1.048, a very tight match with the estimate from Alg. 1. See Appendix L.2 for details.

CONCLUSION
In this paper, we have presented theoretical results for predicting the onset of learning, and have shown that it is determined by the conspicuous subset of the training examples.We gave a practical algorithm for predicting the transition as well as discovering this subset, and showed that those predictions are accurate, even in cases of extreme label noise.We believe these results will provide theoretical and practical guidance for choosing β in the IB framework for balancing prediction and compression.Our work also raises other questions, such as whether there are other phase transitions in learnability that might be identified.We hope to address some of those questions in future work.
Applying to our functional IB β [p(z|x)], an immediate result of Theorem 6 is that, if at p(z|x) = p(z), there exists an Using the definition of IB β learnability, we have that (X, Y ) is IB β -learnable.

E First-and second-order variations of IB β [p(z|x)]
In this section, we derive the first-and second-order variations of IB β [p(z|x)], which are needed for proving Lemma 2.1 and Theorem 4. Lemma 6.1.Using perturbative function h(z|x), we have Proof.Since IB β [p(z|x)] = I(X; Z) − βI(Y ; Z), let us calculate the first and second-order variation of I(X; Z) and I(Y ; Z) w.r.t.p(z|x), respectively.Through this derivation, we use h(z|x) as a perturbative function, for ease of deciding different orders of variations.We will finally absorb into h(z|x).
We have Expanding F 1 [p(z|x) + h(z|x)] to the second order of , we have Collecting the first order terms of , we have Collecting the second order terms of 2 , we have Now let us calculate the first and second-order variation of F 2 [p(z|x)] = I(Z; Y ).We have Then expanding F 2 [p(z|x) + h(z|x)] to the second order of , we have Collecting the first order terms of , we have Collecting the second order terms of , we have Finally, we have Absorb into h(z|x), we get rid of the factor and obtain the final expression in Lemma 6.1.
F Proof of Lemma 2.1 Proof.Using Lemma 6.1, we have Let p(z|x) = p(z) (the trivial representation), we have that log p(z|x) p(z) ≡ 0. Therefore, the two integrals are both 0. Hence,

G Proof of Theorem 4
Proof.Firstly, from the necessary condition of β > 1 in Section 3, we have that any sufficient condition for IB βlearnability should be able to deduce β > 1.
Now using Theorem 3, a sufficient condition for (X, Y ) to be IB β -learnable is that there exists h(z|x) with h(z|x)dx = 0 such that δ 2 IB β [p(z|x)] < 0 at p(z|x) = p(x).
At the trivial representation, p(z|x) = p(z) and hence p(x, z) = p(x)p(z).Due to the Markov chain Z ← X ↔ Y , we have p(y, z) = p(y)p(z).Substituting them into the δ 2 IB β [p(z|x)] in Lemma 6.1, the condition becomes: there exists h(z|x) with h(z|x)dz = 0, such that Rearranging terms and simplifying, we have where Now we prove that the condition that ∃h(z|x) s.t.
does not contain integration over z, we can treat the z in G[h(z|x)] as a parameter and we have that ∃h(x) s.t.G[h(x)] < 0.
Conversely, if there exists an certain function h(x) such that G[h(x)] < 0, we can find some h 2 (z) such that h 2 (z)dz = 0 and p(z) dz > 0, and let h 1 (z|x) = h(x)h 2 (z).Now we have In other words, the condition Eq. ( 10) is equivalent to requiring that there exists an h(x) such that G[h(x)] < 0 .Hence, a sufficient condition for IB β -learnability is that there exists an h(x) such that When h(x) = C = constant in the entire input space X , Eq. ( 11) becomes: which cannot be true.Therefore, h(x) = constant cannot satisfy Eq. ( 11).
Rearranging terms and simplifying, and note that dxh(x)p(x) 2 > 0 due to h(x) ≡ 0 = constant, we have For the R.H.S. of Eq. ( 12), let us show that it is greater than 0. Using Cauchy-Schwarz inequality: u, u v, v ≥ u, v 2 , and setting u(x) = h(x) p(x), v(x) = p(x), and defining the inner product as u, v = u(x)v(x)dx.
We have It attains equality when u(x) v(x) = h(x) is constant.Since h(x) cannot be constant, we have that the R.H.S. of Eq. ( 12) is greater than 0.
For the L.H.S. of Eq. ( 12), due to the necessary condition that β > 0, if (12) cannot hold.Then the h(x) such that Eq. ( 12) holds is for those that satisfies We see this constraint contains the requirement that h(x) ≡ constant.
Written in the form of expectations, we have Since the square function is convex, using Jensen's inequality on the outer expectation on the L.H.S. of Eq. ( 13), we have 2 The equality holds iff E x∼p(x|y) [h(x)] is constant w.r.t.y, i.e.Y is independent of X.Therefore, in order for Eq. ( 13) to hold, we require that Y is not independent of X.
Using Jensen's inequality on the innter expectation on the L.H.S. of Eq. ( 13), we have The equality holds when h(x) is a constant.Since we require that h(x) is not a constant, we have that the equality cannot be reached.
Under the constraint that Y is not independent of X, we can divide both sides of Eq. 11, and obtain the condition: there exists an h(x) such that Written in the form of expectations, we have We can absorb the constraint Eq. ( 13) into the above formula, and get where which proves the condition of Theorem 4.
Furthermore, from Eq. ( 14) we have for h(x) ≡ const, which satisfies the necessary condition of β > 1 in Section 3.
Proof of lower bound of slope of the Pareto frontier at the origin: Now we prove the second statement of Theorem 4. Since δI(X; Z) = 0 and δI(Y ; Z) = 0 according to Lemma 2.1, we have ∆I(Y ;Z) ∆I(X;Z) . Substituting into the expression of δ 2 I(Y ; Z) and δ 2 I(X; Z) from Lemma 6.1, we have dxdx dz p(x)p(x ) p(z) h(z|x)h(z|x ) p(z) dz dxdx dy p(x,y)p(x ,y) p(y) dxdx dy p(x,y)p(x ,y) p(y) Therefore, inf h(x) β 0 [h(x)] −1 gives the largest slope of ∆I(Y ; Z) vs. ∆I(X; Z) for perturbation function of the form h 1 (z|x) = h(x)h 2 (z) satisfying h 2 (z)dz = 0 and p(z) dz > 0, which is a lower bound of slope of ∆I(Y ; Z) vs. ∆I(X; Z) for all possible perturbation function h 1 (z|x).The latter is the slope of the Pareto frontier of the I(Y ; Z) vs. I(X; Z) curve at the origin.
Inflection point for general Z: If we do not assume that Z is at the origin of the information plane, but at some general stationary solution Z * with p(z|x), we define dxdx dydz p(x,y)p(x ,y) p(y,z) dxdx dz p(x)p(x ) p(z) h(z|x)h(z|x ) = dxdz p(x) 2 p(x,z) h(z|x) 2 − dxdx dz p(x)p(x ) p(z) h(z|x)h(z|x ) dxdx dydz p(x,y)p(x ,y) p(y,z) it becomes a non-stable solution (non-minimum), and we will have other Z that achieves a better IB β (X, Y ; Z) than the current Z * .

H What IB first learns at its onset of learning
In this section, we prove that at the onset of learning, if letting h(z|x) = h * (x)h 2 (z), we have where p β (y|x) is the estimated p(y|x) by IB for a certain β, h Proof.In IB, we use p β (z|x) to obtain Z from X, then obtain the prediction of Y from Z using p β (y|z).Here we use subscript β to denote the probability (density) at the optimum of IB β [p(z|x)] at a specific β.We have When we have a small perturbation • h(z|x) at the trivial representation, p β (z|x) = p β0 (z) + • h(z|x), we have p β (z) = p β0 (z) + • h(z|x )p(x )dx .Substituting, we have The 0 th -order term is dzdx p(x , y)p β0 (z) = p(y).The first-order term is since we have h(z|x)dz = 0 for any x.
For the second-order term, using h where h * x = h * (x)p(x)dx.Combining everything, we have up to the second order,

I Proof of Theorem 5
Proof.According to Theorem 4, a sufficient condition for (X, Y ) to be IB β -learnable is that X and Y are not independent, and We can assume a specific form of h(x), and obtain a (potentially stronger) sufficient condition.Specifically, we let for certain Ω x ⊂ X .Substituting into Eq.( 18), we have that a sufficient condition for (X, Y ) to be IB β -learnable is where p(Ω x ) = x∈Ωx p(x)dx.
The denominator of Eq. ( 19) is Both equalities hold iff p(y|Ω x ) ≡ p(y), at which the denominator of Eq. ( 19) is equal to 0 and the expression inside the infimum diverge, which will not contribute to the infimum.Except this scenario, the denominator is greater than 0. Substituting into Eq.( 19), we have that a sufficient condition for (X, Y ) to be IB β -learnable is Since Ω x is a subset of X , by the definition of h(x) in Eq. ( 18), h(x) is not a constant in the entire X .Hence the numerator of Eq. ( 20) is positive.Since its denominator is also positive, we can then neglect the "> 0", and obtain the condition in Theorem 5.
Since the h(x) used in this theorem is a subset of the h(x) used in Theorem 4, the infimum for Eq. ( 5) is greater than or equal to the infimum in Eq. ( 2).Therefore, according to the second statement of Theorem 4, we have that the (inf Ωx⊂X β 0 (Ω x )) −1 is also a lower bound of the slope for the Pareto frontier of I(Y ; Z) vs. I(X; Z) curve.Now we prove that the condition Eq. ( 5) is invariant to invertible mappings of X.In fact, if X = g(X) is a uniquely invertible map X is continuous, g is additionally required to be continuous), let X = {g(x)|x ∈ Ω x }, and denote g(Ω x ) ≡ {g(x)|x ∈ Ω x } for any Ω x ⊂ X , we have p(g(Ω x )) = p(Ω x ), and p(y|g(Ω x )) = p(y|Ω x ).Then for dataset (X, Y ), let Ω x = g(Ω x ), we have Additionally we have X = g(X ).Then inf For dataset (X , Y ) = (g(X), Y ), applying Theorem 5 we have that a sufficient condition for it to be IB β -learnable is where the equality is due to Eq. ( 22).Comparing with the condition for IB β -learnability for (X, Y ) (Eq. ( 5)), we see that they are the same.Therefore, the condition given by Theorem 5 is invariant to invertible mapping of X.
J Proof of Corollary 5.1 and Corollary 5.2 J.1 Proof of Corollary 5.1 Proof.We use Theorem 5. Let Ω x contain all elements x whose true class is y * for some certain y * , and 0 otherwise.Then we obtain a (potentially stronger) sufficient condition.Since the probability p(y|y * , x) = p(y|y * ) is classconditional, we have , we obtain a sufficient condition for IB β learnability.
J.2 Proof of Corollary 5.2 Proof.We again use Theorem 5. Since Y is a deterministic function of X, let Y = f (X).By the assumption that Y contains at least one value y such that its probability p(y) > 0, we let Ω x contain only x such that f (x) = y.Substituting into Eq.( 5), we have Therefore, the sufficient condition becomes β > 1.
Combined with Eq. ( 29), we have Our Theorem 4 states that sup h(x) Combining Eqs. ( 26), (30) and Eq. ( 31), we have In summary, the relations among the quantities are:

L Experiment Details
We use the Variational Information Bottleneck (VIB) objective from Alemi et al. (2016).For the synthetic experiment, the latent Z has dimension of 2. The encoder is a neural net with 2 hidden layers, each of which has 128 neurons with ReLU activation.The last layer has linear activation and 4 output neurons; the first two parameterize the mean of a Gaussian and the last two parameterize the log variance.The decoder is a neural net with 1 hidden layer with 128 neurons and ReLU activation.Its last layer has linear activation and outputs the logit for the class labels.It uses a mixture of Gaussian prior with 500 components (for the experiment with class overlap, 256 components), each of which is a 2D Gaussian with learnable mean and log variance, and the weights for the components are also learnable.For the MNIST experiment, the architecture is mostly the same, except the following: (1) for Z, we let it have dimension of 256.For the prior, we use standard Gaussian with diagonal covariance matrix.
For all experiments, we use Adam (Kingma and Ba ( 2014)) optimizer with default parameters.We do not add any explicit regularization.We use learning rate of 10 −4 and have a learning rate decay of 1 1+0.01×epoch .We train in total 2000 epochs with mini-batch size of 500.
For estimation of the observed β 0 in Fig. 3, in the I(X; Z) vs. β i curve (β i denotes the i th β), we take the mean and standard deviation of I(X; Z) for the lowest 5 β i values, denoting as µ β , σ β (I(Y ; Z) has similar behavior, but since we are minimizing I(X; Z) − β • I(Y ; Z), the onset of nonzero I(X; Z) is less prone to noise).When I(X; Z) is greater than µ β + 3σ β , we regard it as learning a non-trivial representation, and take the average of β i and β i−1 as the experimentally estimated onset of learning.We also inspect manually and confirm that it is consistent with human intuition.
For estimating β 0 using Alg. 1, at step 6 we use the following discrete search algorithm.We fix i left = 1 and gradually narrow down the range In other words, we narrow down the range of i right if we find that the Ω given by the left or right boundary gives a lower β0 value.The process stops when both β0,a and β0,b stop improving (which we find always happens when b = a + 1), and we return the smaller of the final β0,a and β0,b as β0 .
For estimation of p(y|x) for (2 ) Alg. 1 and (3 ) ηKL for both synthetic and MNIST experiments, we use a 3-layer neuron net where each hidden layer has 128 neurons and ReLU activation.The last layer has linear activation.The objective is cross-entropy loss.We use Adam (Kingma and Ba, 2014) optimizer with a learning rate of 10 −4 , and train for 100 epochs (after which the validation loss does not go down).
For estimating β 0 via (3 ) ηKL by the algorithm in (Kim et al., 2017), we use the code from the GitHub repository provided by the paper5 , using the same p(y|x) employed for (2 ) Alg. 1.Since our datasets are classification tasks, we use A ij = p(y j |x i )/p(y j ) instead of the kernel density for estimating matrix A; we take the maximum of 10 runs as estimation of µ.

Figure 1 :
Figure 1: Accuracy for binary classification of MNIST digits 0 and 1 with 20% label noise and varying β.No learning happens for models trained at β < 3.25.

Figure 2 :
Figure 2: The Pareto frontier of the information plane, I(X; Z) vs I(Y ; Z), for the binary classification of MNIST digits 0 and 1 with 20% label noise described in Sec.1.1 and Fig. 1.For this problem, learning happens for models trained at β > 3.25.H(Y ) = 1 bit since only two of ten digits are used, and I(Y ; Z) ≤ I(X; Y ) ≈ 0.5 bits < H(Y ) because of the 20% label noise.The true frontier is differentiable; the figure shows a variational approximation that places an upper bound on both informations, horizontally offset to pass through the origin.

Figure 3 :
Figure 3: Predicted vs. experimentally identified β 0 , for mixture of Gaussians with varying class-conditional noise rates.

Figure 4 :
Figure 4: I(Y ; Z) vs. β, for mixture of Gaussian datasets with different distances between the two mixture components.The vertical lines are β 0,predicted computed by the R.H.S. of Eq. (8).As Eq. (8) does not make predictions w.r.t.class overlap, the vertical lines are always just above β 0,predicted = 1.However, as expected, decreasing the distance between the classes in X space also increases the true β 0 .

Figure 5 :
Figure 5: I(Y ; Z) vs. β for the MNIST binary classification with different hidden units per layer n and noise rates ρ: (upper left) ρ = 0.02, (upper right) ρ = 0.1, (lower left) ρ = 0.2, (lower right) ρ = 0.3.The vertical lines are β 0 estimated by different methods.n = 128 has insufficient capacity for the problem, so its observed learnability onset is pushed higher, similar to the class overlap case.

Figure 7 :
Figure 7: Plot of I(Y ; Z) vs β for CIFAR10 training set with 20% label noise.Each blue cross corresponds to a fully-converged model starting with independent initialization.The vertical black line corresponds to the predicted β 0 = 1.0483 using Alg. 1.The empirical β 0 = 1.048.

Figure 8 :
Figure 8: The first 5776 MNIST training set digits when sorted by h(x).The digits highlighted in red are above the threshold drawn in Figure 6.