Abstract
The Information Bottleneck (IB) method provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective employs a Lagrange multiplier to tune this trade-off. However, in practice, not only is chosen empirically without theoretical guidance, there is also a lack of theoretical understanding between , learnability, the intrinsic nature of the dataset and model capacity. In this paper, we show that if is improperly chosen, learning cannot happen—the trivial representation becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as is varied. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, which provides theoretical guidance for choosing a good . We further show that IB-learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), and discuss its relation with model capacity. We give practical algorithms to estimate the minimum for a given dataset. We also empirically demonstrate our theoretical conditions with analyses of synthetic datasets, MNIST and CIFAR10.
1. Introduction
Tishby et al. [1] introduced the Information Bottleneck (IB) objective function which learns a representation Z of observed variables that retains as little information about X as possible but simultaneously captures as much information about Y as possible:
is the mutual information. The hyperparameter controls the trade-off between compression and prediction, in the same spirit as Rate-Distortion Theory [2] but with a learned representation function that automatically captures some part of the “semantically meaningful” information, where the semantics are determined by the observed relationship between X and Y. The IB framework has been extended to and extensively studied in a variety of scenarios, including Gaussian variables [3], meta-Gaussians [4], continuous variables via variational methods [5,6,7], deterministic scenarios [8,9], geometric clustering [10] and is used for learning invariant and disentangled representations in deep neural nets [11,12].
From the IB objective (Equation (1)) we see that when it will encourage which leads to a trivial representation Z that is independent of X, while when , it reduces to a maximum likelihood objective (e.g., in classification, it reduces to cross-entropy loss). Therefore, as we vary from 0 to , there must exist a point at which IB starts to learn a nontrivial representation where Z contains information about X.
As an example, we train multiple variational information bottleneck (VIB) models on binary classification of MNIST [13] digits 0 and 1 with 20% label noise at different . The accuracy vs. is shown in Figure 1. We see that when , no learning happens and the accuracy is the same as random guessing. Beginning with , there is a clear phase transition where the accuracy sharply increases, indicating the objective is able to learn a nontrivial representation. In general, we observe that different datasets and model capacity will result in different at which IB starts to learn a nontrivial representation. How does depend on the aspects of the dataset and model capacity and how can we estimate it? What does an IB model learn at the onset of learning? Answering these questions may provide a deeper understanding of IB in particular and learning on two observed variables in general.
Figure 1.
Accuracy for binary classification of MNIST digits 0 and 1 with 20% label noise and varying . No learning happens for models trained at .
In this work, we begin to answer the above questions. Specifically:
- We introduce the concept of IB-Learnability and show that when we vary , the IB objective will undergo a phase transition from the inability to learn to the ability to learn (Section 3).
- Using the second-order variation, we derive sufficient conditions for IB-Learnability, which provide upper bounds for the learnability threshold (Section 4).
- We show that IB-Learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), reveal its relationship with the slope of the Pareto frontier at the origin on the information plane vs. and discuss its relation to model capacity (Section 5).
- We prove a deep relationship between IB-Learnability, our upper bounds on , the hypercontractivity coefficient, the contraction coefficient and the maximum correlation (Section 5).
We also present an algorithm for estimating the onset of IB-Learnability and the conspicuous subset, which provide us with a tool for understanding a key aspect of the learning problem (Section 6). Finally, we use our main results to demonstrate on synthetic datasets, MNIST [13] and CIFAR10 [14] that the theoretical prediction for IB-Learnability closely matches experiment, and show the conspicuous subset our algorithm discovers (Section 7).
2. Related Work
The seminal IB work [1] provides a tabular method for exactly computing the optimal encoder distribution for a given and cardinality of the discrete representation, . They did not consider the IB learnability problem as addressed in this work. Chechik et al. [3] presents the Gaussian Information Bottleneck (GIB) for learning a multivariate Gaussian representation Z of , assuming that both X and Y are also multivariate Gaussians. Under GIB, they derive analytic formula for the optimal representation as a noisy linear projection to eigenvectors of the normalized regression matrix and the learnability threshold is then given by where is the largest eigenvalue of the matrix . This work provides deep insights about relations between the dataset, and optimal representations in the Gaussian scenario but the restriction to multivariate Gaussian datasets limits the generality of the analysis Another analytic treatment of IB is given in [4], which reformulates the objective in terms of the copula functions. As with the GIB approach, this formulation restricts the form of the data distributions—the copula functions for the joint distribution are assumed to be known, which is unlikely in practice.
Strouse and Schwab [8] present the Deterministic Information Bottleneck (DIB), which minimizes the coding cost of the representation, , rather than the transmission cost, as in IB. This approach learns hard clusterings with different code entropies that vary with . In this case, it is clear that a hard clustering with minimal will result in a single cluster for all of the data, which is the DIB trivial solution. No analysis is given beyond this fact to predict the actual onset of learnability, however.
The first amortized IB objective is in the Variational Information Bottleneck (VIB) of Alemi et al. [5]. VIB replaces the exact, tabular approach of IB with variational approximations of the classifier distribution () and marginal distribution (). This approach cleanly permits learning a stochastic encoder, , that is applicable to any , rather than just the particular X seen at training time. The cost of this flexibility is the use of variational approximations that may be less expressive than the tabular method. Nevertheless, in practice, VIB learns easily and is simple to implement, so we rely on VIB models for our experimental confirmation.
Closely related to IB is the recently proposed Conditional Entropy Bottleneck (CEB) [7]. CEB attempts to explicitly learn the Minimum Necessary Information (MNI), defined as the point in the information plane where . The MNI point may not be achievable even in principle for a particular dataset. However, the CEB objective provides an explicit estimate of how closely the model is approaching the MNI point by observing that a necessary condition for reaching the MNI point occurs when . The CEB objective is equivalent to IB at , so our analysis of IB-Learnability applies equally to CEB.
Kolchinsky et al. [9] show that when Y is a deterministic function of X, the “corner point” of the IB curve (where ) is the unique optimizer of the IB objective for all (with the parameterization of Kolchinsky et al. [9], ), which they consider to be a “trivial solution”. However, their use of the term “trivial solution” is distinct from ours. They are referring to the observation that all points on the IB curve contain uninteresting interpolations between two different but valid solutions on the optimal frontier, rather than demonstrating a non-trivial trade-off between compression and prediction as expected when varying the IB Lagrangian. Our use of “trivial” refers to whether IB is capable of learning at all given a certain dataset and value of .
Achille and Soatto [12] apply the IB Lagrangian to the weights of a neural network, yielding InfoDropout. In Achille and Soatto [11], the authors give a deep and compelling analysis of how the IB Lagrangian can yield invariant and disentangled representations. They do not, however, consider the question of the onset of learning, although they are aware that not all models will learn a non-trivial representation. More recently, Achille et al. [15] repurpose the InfoDropout IB Lagrangian as a Kolmogorov Structure Function to analyze the ease with which a previously-trained network can be fine-tuned for a new task. While that work is tangentially related to learnability, the question it addresses is substantially different from our investigation of the onset of learning.
Our work is also closely related to the hypercontractivity coefficient [16,17], defined as , which by definition equals the inverse of , our IB-learnability threshold. In [16], the authors prove that the hypercontractivity cofficient equals the contraction coefficient and Kim et al. [18] propose a practical algorithm to estimate , which provides a measure for potential influence in the data. Although our goal is different, the sufficient conditions we provide for IB-Learnability are also lower bounds for the hypercontractivity coefficient.
3. IB-Learnability
We are given instances of drawn from a distribution with probability (density) with support of , where unless otherwise stated, both X and Y can be discrete or continuous variables. We use capital letters for random variables and lowercase to denote the instance of variables, with and denoting their probability or probability density, respectively. is our training data and may be characterized by different types of noise. The nature of this training data and the choice of will be sufficient to predict the transition from unlearnable to learnable.
We can learn a representation Z of X with conditional probability , such that obey the Markov chain . Equation (1) above gives the IB objective with Lagrange multiplier , , which is a functional of : . The IB learning task is to find a conditional probability that minimizes . The larger , the more the objective favors making a good prediction for Y. Conversely, the smaller , the more the objective favors learning a concise representation.
How can we select such that the IB objective learns a useful representation? In practice, the selection of is done empirically. Indeed, Tishby et al. [1] recommends “sweeping ”. In this paper, we provide theoretical guidance for choosing by introducing the concept of IB-Learnability and providing a series of IB-learnable conditions.
Definition 1.
is-learnable if there exists a Z given by some, such that, wherecharacterizes the trivial representation whereis independent of X.
If is -learnable, then when is globally minimized, it will not learn a trivial representation. On the other hand, if is not -learnable, then when is globally minimized, it may learn a trivial representation.
3.1. Trivial Solutions
Definition 1 defines trivial solutions in terms of representations where . Another type of trivial solution occurs when but . This type of trivial solution is not directly achievable by the IB objective, as is minimized but it can be achieved by construction or by chance. It is possible that starting learning from could result in access to non-trivial solutions not available from . We do not attempt to investigate this type of trivial solution in this work.
3.2. Necessary Condition for IB-Learnability
From Definition 1, we can see that -Learnability for any dataset requires . In fact, from the Markov chain , we have via the data-processing inequality. If , then since and , we have that . Hence is not -learnable for .
Due to the reparameterization invariance of mutual information, we have the following theorem for -Learnability:
Lemma 1.
Letbe an invertible map (if X is a continuous variable, g is additionally required to be continuous). Thenandhave the same-Learnability.
The proof for Lemma 1 is in Appendix A.2. Lemma 1 implies a favorable property for any condition for -Learnability: the condition should be invariant to invertible mappings of X. We will inspect this invariance in the conditions we derive in the following sections.
4. Sufficient Conditions for IB-Learnability
Given , how can we determine whether it is -learnable? To answer this question, we derive a series of sufficient conditions for -Learnability, starting from its definition. The conditions are in increasing order of practicality, while sacrificing as little generality as possible.
Firstly, Theorem 1 characterizes the -Learnability range for , with proof in Appendix A.3:
Theorem 1.
Ifis-learnable, then for any, it is-learnable.
Based on Theorem 1, the range of such that is -learnable has the form . Thus, is the threshold of IB-Learnability.
Lemma 2.
is a stationary solution for.
The proof in Appendix A.6 shows that both first-order variations and vanish at the trivial representation , so at the trivial representation.
Lemma 2 yields our strategy for finding sufficient conditions for learnability: find conditions such that is not a local minimum for the functional . Based on the necessary condition for the minimum (Appendix A.4), we have the following theorem (The theorems in this paper deal with learnability w.r.t. true mutual information. If parameterized models are used to approximate the mutual information, the limitation of the model capacity will translate into more uncertainty of Y given X, viewed through the lens of the model.):
Theorem 2
(Suff. Cond. 1). A sufficient condition for to be -learnable is that there exists a perturbation function (so that the perturbed probability (density) is ) with , such that the second-order variation at the trivial representation .
The proof for Theorem 2 is given in Appendix A.4. Intuitively, if , we can always find a in the neighborhood of the trivial representation , such that , thus satisfying the definition for -Learnability.
To make Theorem 2 more practical, we perturb around the trivial solution and expand to the second order of . We can then prove Theorem 3:
Theorem 3
(Suff. Cond. 2). A sufficient condition for to be -learnable is X and Y are not independent and
where the functional is given by
Moreover, we have thatis a lower bound of the slope of the Pareto frontier in the information planevs.at the origin.
The proof is given in Appendix A.7, which also shows that if in Theorem 3 is satisfied, we can construct a perturbation function with , for some , such that satisfies Theorem 2. It also shows that the converse is true: if there exists such that the condition in Theorem 2 is true, then Theorem 3 is satisfied, that is, . (We do not claim that any satisfying Theorem 2 can be decomposed to at the onset of learning. But from the equivalence of Theorems 2 and 3 as explained above, when there exists an such that Theorem 2 is satisfied, we can always construct an that also satisfies Theorem 2.) Moreover, letting the perturbation function at the trivial solution, we have
where is the estimated by IB for a certain , and is a constant. This shows how the by IB explicitly depends on at the onset of learning. The proof is provided in Appendix A.8.
Theorem 3 suggests a method to estimate : we can parameterize for example, by a neural network, with the objective of minimizing . At its minimization, provides an upper bound for , and provides a soft clustering of the examples corresponding to a nontrivial perturbation of at that minimizes .
Alternatively, based on the property of , we can also use a specific functional form for in Equation (2) and obtain a stronger sufficient condition for -Learnability. But we want to choose as near to the infimum as possible. To do this, we note the following characteristics for the R.H.S of Equation (2):
- We can set to be nonzero if for some region and 0 otherwise. Then we obtain the following sufficient condition:
Based on these observations, we can let be a nonzero constant inside some region and 0 otherwise and the infimum over an arbitrary function is simplified to infimum over and we obtain a sufficient condition for -Learnability, which is a key result of this paper:
Theorem 4
(Conspicuous Subset Suff. Cond.). A sufficient condition for to be -learnable is X and Y are not independent and
where
denotes the event that , with probability .
gives a lower bound of the slope of the Pareto frontier in the information planevs.at the origin.
The proof is given in Appendix A.9. In the proof we also show that this condition is invariant to invertible mappings of X.
5. Discussion
5.1. The Conspicuous Subset Determines
From Equation (5), we see that three characteristics of the subset lead to low : (1) confidence: is large; (2) typicality and size: the number of elements in is large or the elements in are typical, leading to a large probability of ; (3) imbalance: is small for the subset but large for its complement. In summary, will be determined by the largest confident, typical and imbalanced subset of examples or an equilibrium of those characteristics. We term at the minimization of the conspicuous subset.
5.2. Multiple Phase Transitions
Based on this characterization of , we can hypothesize datasets with multiple learnability phase transitions. Specifically, consider a region that is small but “typical”, consists of all elements confidently predicted as by and where is the least common class. By construction, this will dominate the infimum in Equation (5), resulting in a small value of . However, the remaining effectively form a new dataset, . At exactly , we may have that the current encoder, , has no mutual information with the remaining classes in ; that is, . In this case, Definition 1 applies to with respect to . We might expect to see that, at , learning will plateau until we get to some that defines the phase transition for . Clearly this process could repeat many times, with each new dataset being distinctly more difficult to learn than .
5.3. Similarity to Information Measures
The denominator of in Equation (5) is closely related to mutual information. Using the inequality for , it becomes:
where is the mutual information “density” at . Of course, this quantity is also , so we know that the denominator of Equation (5) is non-negative. Incidentally, is the density of “rational mutual information” [19] at .
Similarly, the numerator of is related to the self-information of :
so we can estimate as:
Since Equation (6) uses upper bounds on both the numerator and the denominator, it does not give us a bound on , only an estimate.
5.4. Estimating Model Capacity
The observation that a model cannot distinguish between cluster overlap in the data and its own lack of capacity gives an interesting way to use IB-Learnability to measure the capacity of a set of models relative to the task they are being used to solve. For example, for a classification task, we can use different model classes to estimate . For each such trained model, we can estimate the corresponding IB-learnability threshold . A model with smaller capacity than the task needs will translate to more uncertainty in , resulting in a larger . On the other hand, models that give the same as each other all have the same capacity relative to the task, even if we would otherwise expect them to have very different capacities. For example, if two deep models have the same core architecture but one has twice the number of parameters at each layer and they both yield the same , their capacities are equivalent with respect to the task. Thus, provides a way to measure model capacity in a task-specific manner.
5.5. Learnability and the Information Plane
Many of our results can be interpreted in terms of the geometry of the Pareto frontier illustrated in Figure 2, which describes the trade-off between increasing and decreasing . At any point on this frontier that minimizes , the frontier will have slope if it is differentiable. If the frontier is also concave (has negative second derivative), then this slope will take its maximum at the origin, which implies -Learnability for , so that the threshold for -Learnability is simply the inverse slope of the frontier at the origin. More generally, as long as the Pareto frontier is differentiable, the threshold for -learnability is the inverse of its maximum slope. Indeed, Theorem 3 and Theorem 4 give lower bounds of the slope of the Pareto frontier at the origin.
Figure 2.
The Pareto frontier of the information plane, vs. , for the binary classification of MNIST digits 0 and 1 with 20% label noise described in Section 1 and Figure 1. For this problem, learning happens for models trained at . bit since only two of ten digits are used and bits because of the 20% label noise. The true frontier is differentiable; the figure shows a variational approximation that places an upper bound on both informations, horizontally offset to pass through the origin.
5.6. IB-Learnability, Hypercontractivity and Maximum Correlation
IB-Learnability and its sufficient conditions we provide harbor a deep connection with hypercontractivity and maximum correlation:
which we prove in Appendix A.11. Here s.t. and is the maximum correlation [20,21], is the hypercontractivity coefficient and is the contraction coefficient. Our proof relies on Anantharam et al. [16]’s proof . Our work reveals the deep relationship between IB-Learnability and these earlier concepts and provides additional insights about what aspects of a dataset give rise to high maximum correlation and hypercontractivity: the most confident, typical, imbalanced subset of .
6. Estimating the IB-Learnability Condition
Theorem 4 not only reveals the relationship between the learnability threshold for and the least noisy region of but also provides a way to practically estimate , both in the general classification case and in more structured settings.
6.1. Estimation Algorithm
Based on Theorem 4, for general classification tasks we suggest Algorithm 1 to empirically estimate an upper-bound , as well as discovering the conspicuous subset that determines .
We approximate the probability of each example by its empirical probability, . For example, for MNIST, , where N is the number of examples in the dataset. The algorithm starts by first learning a maximum likelihood model of , using for example, feed-forward neural networks. It then constructs a matrix and a vector to store the estimated and for all the examples in the dataset. To find the subset such that the is as small as possible, by previous analysis we want to find a conspicuous subset such that its is large for a certain class j (to make the denominator of Equation (5) large) and containing as many elements as possible (to make the numerator small).
We suggest the following heuristics to discover such a conspicuous subset. For each class j, we sort the rows of according to its probability for the pivot class j by decreasing order and then perform a search over for . Since is large when contains too few or too many elements, the minimum of for class j will typically be reached with some intermediate-sized subset and we can use binary search or other discrete search algorithm for the optimization. The algorithm stops when does not improve by tolerance . The algorithm then returns the as the minimum over all the classes , as well as the conspicuous subset that determines this .
After estimating , we can then use it for learning with IB, either directly or as an anchor for a region where we can perform a much smaller sweep than we otherwise would have. This may be particularly important for very noisy datasets, where can be very large.
| Algorithm 1 Estimating the upper bound for and identifying the conspicuous subset |
| Require: Dataset . The number of classes is C. Require: tolerance for estimating
|
subroutine Get():
|
6.2. Special Cases for Estimating
Theorem 4 may still be challenging to estimate, due to the difficulty of making accurate estimates of and searching over . However, if the learning problem is more structured, we may be able to obtain a simpler formula for the sufficient condition.
6.2.1. Class-Conditional Label Noise
Classification with noisy labels is a common practical scenario. An important noise model is that the labels are randomly flipped with some hidden class-conditional probabilities and we only observe the corrupted labels. This problem has been studied extensively [22,23,24,25,26]. If IB is applied to this scenario, how large do we need? The following corollary provides a simple formula.
Corollary 1.
Suppose that the true class labels areand the input space belonging to eachhas no overlap. We only observe the corrupted labels y with class-conditional noiseand Y is not independent of X. We have that a sufficient condition for-Learnability is:
We see that under class-conditional noise, the sufficient condition reduces to a discrete formula which only depends on the noise rates and the true class probability , which can be accurately estimated via, for example, Northcutt et al. [26]. Additionally, if we know that the noise is class-conditional but the observed is greater than the R.H.S. of Equation (8), we can deduce that there is overlap between the true classes. The proof of Corollary 1 is provided in Appendix A.10.
6.2.2. Deterministic Relationships
Theorem 4 also reveals that relates closely to whether Y is a deterministic function of X, as shown by Corollary 2:
Corollary 2.
Assume that Y contains at least one value y such that its probability. If Y is a deterministic function of X and not independent of X, then a sufficient condition for-Learnability is.
The assumption in the Corollary 2 is satisfied by classification and certain regression problems. (The following scenario does not satisfy this assumption: for certain regression problems where Y is a continuous random variable and the probability density function is bounded, then for any y, the probability has measure 0.)This corollary generalizes the result in Reference [9] which only proves it for classification problems. Combined with the necessary condition for any dataset to be -learnable (Section 3), we have that under the assumption, if Y is a deterministic function of X, then a necessary and sufficient condition for -learnability is ; that is, its is 1. The proof of Corollary 2 is provided in Appendix A.10.
Therefore, in practice, if we find that , we may infer that Y is not a deterministic function of X. For a classification task, we may infer that either some classes have overlap or the labels are noisy. However, recall that finite models may add effective class overlap if they have insufficient capacity for the learning task, as mentioned in Section 4. This may translate into a higher observed , even when learning deterministic functions.
7. Experiments
To test how the theoretical conditions for -learnability match with experiment, we apply them to synthetic data with varying noise rates and class overlap, MNIST binary classification with varying noise rates and CIFAR10 classification, comparing with the found experimentally. We also compare with the algorithm in Kim et al. [18] for estimating the hypercontractivity coefficient (=) via the contraction coefficient . Experiment details are in Appendix A.12.
7.1. Synthetic Dataset Experiments
We construct a set of datasets from 2D mixtures of 2 Gaussians as X and the identity of the mixture component as Y. We simulate two practical scenarios with these datasets: (1) noisy labels with class-conditional noise and (2) class overlap. For (1), we vary the class-conditional noise rates. For (2), we vary class overlap by tuning the distance between the Gaussians. For each experiment, we sweep with exponential steps and observe and . We then compare the empirical indicated by the onset of above-zero with predicted values for .
7.1.1. Classification with Class-Conditional Noise
In this experiment, we have a mixture of Gaussian distribution with 2 components, each of which is a 2D Gaussian with diagonal covariance matrix . The two components have distance 16 (hence virtually no overlap) and equal mixture weight. For each x, the label is the identity of which component it belongs to. We create multiple datasets by randomly flipping the labels y with a certain noise rate . For each dataset, we train VIB models across a range of and observe the onset of learning via random (Observed). To test how different methods perform in estimating , we apply the following methods: (1) Corollary 1, since this is classification with class-conditional noise and the two true classes have virtually no overlap; (2) Algorithm 1 with true ; (3) The algorithm in Kim et al. [18] that estimates , provided with true ; (4) in Equation (2); (2′) Algorithm 1 with estimated by a neural net; (3′) with the same as in (2′). The results are shown in Figure 3 and in Table 1.
Figure 3.
Predicted vs. experimentally identified , for mixture of Gaussians with varying class-conditional noise rates.
Table 1.
Full table of values used to generate Figure 3.
From Figure 3 and Table 1 we see the following. (A) When using the true , both Algorithm 1 and generally upper bound the empirical and Algorithm 1 is generally tighter. (B) When using the true , Algorithm 1 and Corollary 1 give the same result. (C) Comparing Algorithm 1 and both of which use the same empirically estimated , both approaches provide good estimation in the low-noise region; however, in the high-noise region, Algorithm 1 gives more precise values than , indicating that Algorithm 1 is more robust to the estimation error of . (D) Equation (2) empirically upper bounds the experimentally observed and gives almost the same result as theoretical estimation in Corollary 1 and Algorithm 1 with the true . In the classification setting, this approach does not require any learned estimate of , as we can directly use the empirical and from SGD mini-batches.
This experiment also shows that for dataset where the signal-to-noise is small, can be very high. Instead of blindly sweeping , our result can provide guidance for setting so learning can happen.
7.1.2. Classification with Class Overlap
In this experiment, we test how different amounts of overlap among classes influence . We use the mixture of Gaussians with two components, each of which is a 2D Gaussian with diagonal covariance matrix . The two components have weights 0.6 and 0.4. We vary the distance between the Gaussians from 8.0 down to 0.8 and observe the . Since we do not add noise to the labels, if there were no overlap and a deterministic map from X to Y, we would have by Corollary 2. The more overlap between the two classes, the more uncertain Y is given X. By Equation (5) we expect to be larger, which is corroborated in Figure 4.
Figure 4.
vs. , for mixture of Gaussian datasets with different distances between the two mixture components. The vertical lines are computed by the R.H.S. of Equation (8). As Equation (8) does not make predictions w.r.t. class overlap, the vertical lines are always just above . However, as expected, decreasing the distance between the classes in X space also increases the true .
7.2. MNIST Experiments
We perform binary classification with digits 0 and 1 and as before, add class-conditional noise to the labels with varying noise rates . To explore how the model capacity influences the onset of learning, for each dataset we train two sets of VIB models differing only by the number of neurons in their hidden layers of the encoder: one with neurons, the other with neurons. As we describe in Section 4, insufficient capacity will result in more uncertainty of Y given X from the point of view of the model, so we expect the observed for the model to be larger. This result is confirmed by the experiment (Figure 5). Also, in Figure 5 we plot given by different estimation methods. We see that the observations (A), (B), (C) and (D) in Section 7.1 still hold.
Figure 5.
vs. for the MNIST binary classification with different hidden units per layer n and noise rates : (upper left) , (upper right) , (lower left) , (lower right) . The vertical lines are estimated by different methods. has insufficient capacity for the problem, so its observed learnability onset is pushed higher, similar to the class overlap case.
7.3. MNIST Experiments Using Equation (2)
To see what IB learns at its onset of learning for the full MNIST dataset, we optimize Equation (2) w.r.t. the full MNIST dataset and visualize the clustering of digits by . Equation (2) can be optimized using SGD using any differentiable parameterized mapping . In this case, we chose to parameterize with a PixelCNN++ architecture [27,28], as PixelCNN++ is a powerful autoregressive model for images that gives a scalar output (normally interpreted as ). Equation (2) should generally give two clusters in the output space, as discussed in Section 4. In this setup, smaller values of correspond to the subset of the data that is easiest to learn. Figure 6 shows two strongly separated clusters, as well as the threshold we choose to divide them. Figure 7 shows the first 5776 MNIST training examples as sorted by our learned , with the examples above the threshold highlighted in red. We can clearly see that our learned has separated the “easy” one (1) digits from the rest of the MNIST training set.
Figure 6.
Histograms of the full MNIST training and validation sets according to . Note that both are bimodal and the histograms are indistinguishable. In both cases, has learned to separate most of the ones into the smaller mode but difficult ones are in the wide valley between the two modes. See Figure 7 for all of the training images to the left of the red threshold line, as well as the first few images to the right of the threshold.
Figure 7.
The first 5776 MNIST training set digits when sorted by . The digits highlighted in red are above the threshold drawn in Figure 6.
7.4. CIFAR10 Forgetting Experiments
For CIFAR10 [14], we study how forgetting varies with . In other words, given a VIB model trained at some high , if we anneal it down to some much lower , what does the model converge to? Using Algorithm 1, we estimated on a version of CIFAR10 with 20% label noise, where the is estimated by maximum likelihood training with the same encoder and classifier architectures as used for VIB. For the VIB models, the lowest with performance above chance was (Figure 8), a very tight match with the estimate from Algorithm 1. See Appendix A.12 for details.
Figure 8.
Plot of vs. for CIFAR10 training set with 20% label noise. Each blue cross corresponds to a fully-converged model starting with independent initialization. The vertical black line corresponds to the predicted using Algorithm 1. The empirical .
8. Conclusions
In this paper, we have presented theoretical results for predicting the onset of learning and have shown that it is determined by the conspicuous subset of the training examples. We gave a practical algorithm for predicting the transition as well as discovering this subset and showed that those predictions are accurate, even in cases of extreme label noise. We proved a deep connection between IB-learnability, our upper bounds on , the hypercontractivity coefficient, the contraction coefficient and the maximum correlation. We believe that these results provide a deeper understanding of IB, as well as a tool for analyzing a dataset by discovering its conspicuous subset and a tool for measuring model capacity in a task-specific manner. Our work also raises other questions, such as whether there are other phase transitions in learnability that might be identified. We hope to address some of those questions in future work.
Author Contributions
Conceptualization, T.W. and I.F.; methodology, T.W., I.F., I.L.C. and M.T.; software, T.W. and I.F.; validation, T.W. and I.F.; formal analysis, T.W. and I.F.; investigation, T.W. and I.F.; resources, T.W., I.F., I.L.C. and M.T.; data curation, T.W. and I.F.; writing–original draft preparation, T.W., I.F., I.L.C. and M.T.; writing–review and editing, T.W., I.F., I.L.C. and M.T.; visualization, T.W. and I.F.; supervision, I.F., I.L.C. and M.T.; project administration, I.F., I.L.C. and M.T.; funding acquisition, M.T.
Funding
T.W.’s work was supported by the The Casey and Family Foundation, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science. He thanks the Center for Brains, Minds and Machines (CBMM) for hospitality.
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive comments that contributed to improving the paper.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Appendix A
The structure of the Appendix is as follows. In Appendix A.1, we provide preliminaries for the first-order and second-order variations on functionals. We prove Theorem 1 and Theorem 1 in Appendix A.2 and Appendix A.3, respectively. In Appendix A.4, we prove Theorem 2, the sufficient condition 1 for IB-Learnability. In Appendix A.5, we calculate the first and second variations of at the trivial representation , which is used in proving Lemma 2 (Appendix A.6) and the Sufficient Condition 2 for -learnability (Appendix A.7). In Appendix A.8, we prove Equation (3) at the onset of learning. After these preparations, we prove the key result of this paper, Theorem 4, in Appendix A.9. Then two important Corollaries 1, 2 are proved in Appendix A.10. In Appendix A.11 we explore the deep relation between , , the hypercontractivity coefficient, contraction coefficient and maximum correlation. Finally in Appendix A.12, we provide details for the experiments.
Below are some implicit conventions of the paper: for integrals, whenever a variable W is discrete, we can simply replace the integral by summation .
Appendix A.1. Preliminaries: First-Order and Second-Order Variations
Let functional be defined on some normed linear space . Let us add a perturbative function to , and now the functional can be expanded as
where denotes the norm of h, is a linear functional of , and is called the first-order variation, denoted as . is a quadratic functional of , and is called the second-order variation, denoted as .
If , we call a stationary solution for the functional .
If for all such that is at the neighborhood of , we call a (local) minimum of .
Appendix A.2. Proof of Lemma 1
Proof.
If is -learnable, then there exists given by some such that , where satisfies . Since is a invertible map (if X is continuous variable, g is additionally required to be continuous), and mutual information is invariant under such an invertible map [29], we have that , so is -learnable. On the other hand, if is not -learnable, then , we have . Again using mutual information’s invariance under g, we have for all Z, , leading to that is not -learnable. Therefore, we have that and have the same -learnability. □
Appendix A.3. Proof of Theorem 1
Proof.
At the trivial representation , we have , and due to the Markov chain, so for any . Since is -learnable, there exists a Z given by a such that . Since , and , we have . Therefore, is -learnable. □
Appendix A.4. Proof of Theorem 2
Proof.
To prove Theorem 2, we use the Theorem 1 of Chapter 5 of Gelfand et al. [30] which gives a necessary condition for to have a minimum at . Adapting to our notation, we have:
Theorem A1.
([30]). A necessary condition for the functional to have a minimum at is that for and all admissible ,
Applying to our functional , an immediate result of Theorem A1 is that, if at , there exists an such that , then is not a minimum for . Using the definition of learnability, we have that is -learnable. □
Appendix A.5. First- and Second-Order Variations of
In this section, we derive the first- and second-order variations of , which are needed for proving Lemma 2 and Theorem 3.
Lemma A1.
Using perturbative function, we have
Proof.
Since , let us calculate the first and second-order variation of and w.r.t. , respectively. Through this derivation, we use as a perturbative function, for ease of deciding different orders of variations. We assume that is continuous, and there exists a constant M such that , . We will finally absorb into .
Denote . We have
In this paper, we implicitly assume that the integral (or summing) are only on the support of .
Since
We have
Expanding to the second order of , we have
Collecting the first order terms of , we have
Collecting the second order terms of , we have
Now let us calculate the first and second-order variation of . We have
Using the Markov chain , we have
Hence
Then expanding to the second order of , we have
Collecting the first order terms of , we have
Collecting the second order terms of , we have
Finally, we have
Absorb into , we get rid of the factor and obtain the final expression in Lemma A1. □
Appendix A.6. Proof of Lemma 2
Proof.
Using Lemma A1, we have
Let (the trivial representation), we have that . Therefore, the two integrals are both 0. Hence,
Therefore, the is a stationary solution for . □
Appendix A.7. Proof of Theorem 3
Proof.
Firstly, from the necessary condition of in Section 3, we have that any sufficient condition for -learnability should be able to deduce .
Now using Theorem 2, a sufficient condition for to be -learnable is that there exists with such that at .
At the trivial representation, and hence . Due to the Markov chain , we have . Substituting them into the in Lemma A1, the condition becomes: there exists with , such that
Rearranging terms and simplifying, we have
where
Now we prove that the condition that s.t. is equivalent to the condition that s.t. .
If , , then we have , . Therefore, if s.t. , we have that s.t. . Since the functional does not contain integration over z, we can treat the z in as a parameter and we have that s.t. .
Conversely, if there exists an certain function such that , we can find some such that and , and let . Now we have
In other words, the condition Equation (A2) is equivalent to requiring that there exists an such that . Hence, a sufficient condition for -learnability is that there exists an such that
When in the entire input space , Equation (A3) becomes:
which cannot be true. Therefore, cannot satisfy Equation (A3).
Rearranging terms and simplifying, we have
Written in the form of expectations, we have
Since the square function is convex, using Jensen’s inequality on the L.H.S. of Equation (A5), we have
The equality holds iff is constant w.r.t. y, i.e., Y is independent of X. Therefore, in order for Equation (A5) to hold, we require that Y is not independent of X.
Using Jensen’s inequality on the innter expectation on the L.H.S. of Equation (A5), we have
The equality holds when is a constant. Since we require that is not a constant, we have that the equality cannot be reached.
Similarly, using Jensen’s inequality on the R.H.S. of Equation (A5), we have that
where we have used the requirement that cannot be constant.
Under the constraint that Y is not independent of X, we can divide both sides of Equation (A5), and obtain the condition: there exists an such that
i.e.,
which proves the condition of Theorem 3.
Furthermore, from Equation (A6) we have
for const, which satisfies the necessary condition of in Section 3.
Proof of lower bound of slope of the Pareto frontier at the origin: Now we prove the second statement of Theorem 3. Since and according to Lemma 2, we have . Substituting into the expression of and from Lemma A1, we have
Therefore, gives the largest slope of vs. for perturbation function of the form satisfying and , which is a lower bound of slope of vs. for all possible perturbation function . The latter is the slope of the Pareto frontier of the vs. curve at the origin.
Inflection point for general: If we do not assume that Z is at the origin of the information plane, but at some general stationary solution with , we define
which reduces to when . When
it becomes a non-stable solution (non-minimum), and we will have other Z that achieves a better than the current . □
Appendix A.8. What IB First Learns at Its Onset of Learning
In this section, we prove that at the onset of learning, if letting , we have
where is the estimated by IB for a certain , , , is a constant.
Proof.
In IB, we use to obtain Z from X, then obtain the prediction of Y from Z using . Here we use subscript to denote the probability (density) at the optimum of at a specific . We have
When we have a small perturbation at the trivial representation, , we have . Substituting, we have
The 0th-order term is . The first-order term is
since we have for any x.
For the second-order term, using and , it is
where . Combining everything, we have up to the second order,
□
Appendix A.9. Proof of Theorem 4
Proof.
According to Theorem 3, a sufficient condition for to be -learnable is that X and Y are not independent, and
We can assume a specific form of , and obtain a (potentially stronger) sufficient condition. Specifically, we let
for certain . Substituting into Equation (A10), we have that a sufficient condition for to be -learnable is
where .
The denominator of Equation (A11) is
Using the inequality , we have
Both equalities hold iff , at which the denominator of Equation (A11) is equal to 0 and the expression inside the infimum diverge, which will not contribute to the infimum. Except this scenario, the denominator is greater than 0. Substituting into Equation (A11), we have that a sufficient condition for to be -learnable is
Since is a subset of , by the definition of in Equation (A10), is not a constant in the entire . Hence the numerator of Equation (A12) is positive. Since its denominator is also positive, we can then neglect the “”, and obtain the condition in Theorem 4.
Since the used in this theorem is a subset of the used in Theorem 3, the infimum for Equation (5) is greater than or equal to the infimum in Equation (2). Therefore, according to the second statement of Theorem 3, we have that the is also a lower bound of the slope for the Pareto frontier of vs. curve.
Now we prove that the condition Equation (5) is invariant to invertible mappings of X. In fact, if is a uniquely invertible map (if X is continuous, g is additionally required to be continuous), let , and denote for any , we have , and . Then for dataset , let , we have
Additionally we have . Then
For dataset , applying Theorem 4 we have that a sufficient condition for it to be -learnable is
where the equality is due to Equation (A14). Comparing with the condition for -learnability for (Equation (5)), we see that they are the same. Therefore, the condition given by Theorem 4 is invariant to invertible mapping of X. □
Appendix A.10. Proof of Corollary 1 and Corollary 2
Appendix A.10.1. Proof of Corollary 1
Proof.
We use Theorem 4. Let contain all elements x whose true class is for some certain , and 0 otherwise. Then we obtain a (potentially stronger) sufficient condition. Since the probability is class-conditional, we have
By requiring , we obtain a sufficient condition for learnability. □
Appendix A.10.2. Proof of Corollary 2
Proof.
We again use Theorem 4. Since Y is a deterministic function of X, let . By the assumption that Y contains at least one value y such that its probability , we let contain only x such that . Substituting into Equation (5), we have
□
Therefore, the sufficient condition becomes .
Appendix A.11. β0, Hypercontractivity Coefficient, Contraction Coefficient, , and Maximum Correlation
In this section, we prove the relations between the IB-Learnability threshold , the hypercontractivity coefficient , the contraction coefficient , in Equation (2), and maximum correlation , as follows:
Proof.
The hypercontractivity coefficient is defined as [16]:
By our definition of IB-learnability, (X, Y) is IB-Learnable iff there exists Z obeying the Markov chain , such that
Or equivalently there exists Z obeying the Markov chain such that
By Theorem 1, the IB-Learnability region for is , or equivalently the IB-Learnability region for is
In Anantharam et al. [16], the authors prove that
where the contraction coefficient is defined as
where and . Treating as a channel, the contraction coefficient measures how much the two distributions and becomes “nearer” (as measured by the KL-divergence) after passing through the channel.
In Anantharam et al. [16], the authors also provide a counterexample to an earlier result by Erkip and Cover [31] that incorrectly proved . In the specific counterexample Anantharam et al. [16] design, .
The maximum correlation is defined as where and are real-valued random variables such that and [20,21].
Now we prove , based on Theorem 3. To see this, we use the alternate characterization of by Rényi [32]:
Denoting , we can transform in Equation (2) as follows:
where we denote , so that and .
Combined with Equation (A21), we have
Our Theorem 3 states that
In summary, the relations among the quantities are:
□
Appendix A.12. Experiment Details
We use the Variational Information Bottleneck (VIB) objective from [5]. For the synthetic experiment, the latent Z has dimension of 2. The encoder is a neural net with 2 hidden layers, each of which has 128 neurons with ReLU activation. The last layer has linear activation and 4 output neurons; the first two parameterize the mean of a Gaussian and the last two parameterize the log variance. The decoder is a neural net with 1 hidden layer with 128 neurons and ReLU activation. Its last layer has linear activation and outputs the logit for the class labels. It uses a mixture of Gaussian prior with 500 components (for the experiment with class overlap, 256 components), each of which is a 2D Gaussian with learnable mean and log variance, and the weights for the components are also learnable. For the MNIST experiment, the architecture is mostly the same, except the following: (1) for Z, we let it have dimension of 256. (2) For the prior, we use standard Gaussian with diagonal covariance matrix.
For all experiments, we use Adam [33] optimizer with default parameters. We do not add any explicit regularization. We use learning rate of and have a learning rate decay of . We train in total 2000 epochs with mini-batch size of 500.
For estimation of the observed in Figure 3, in the vs. curve ( denotes the i-th ), we take the mean and standard deviation of for the lowest 5 values, denoting as , ( has similar behavior, but since we are minimizing , the onset of nonzero is less prone to noise). When is greater than + 3, we regard it as learning a non-trivial representation, and take the average of and as the experimentally estimated onset of learning. We also inspect manually and confirm that it is consistent with human intuition.
For estimating using Algorithm 1, at step 6 we use the following discrete search algorithm. We fix and gradually narrow down the range of , starting from . At each iteration, we set a tentative new range , where , , and calculate , where and . If , let . If , let . In other words, we narrow down the range of if we find that the given by the left or right boundary gives a lower value. The process stops when both and stop improving (which we find always happens when ), and we return the smaller of the final and as .
For estimation of for (2′) Algorithm 1 and (3′) for both synthetic and MNIST experiments, we use a 3-layer neuron net where each hidden layer has 128 neurons and ReLU activation. The last layer has linear activation. The objective is cross-entropy loss. We use Adam [33] optimizer with a learning rate of , and train for 100 epochs (after which the validation loss does not go down).
For estimating via (3′) by the algorithm in [18], we use the code from the GitHub repository provided by the paper (At https://github.com/wgao9/hypercontractivity), using the same employed for (2′) Algorithm 1. Since our datasets are classification tasks, we use instead of the kernel density for estimating matrix A; we take the maximum of 10 runs as estimation of .
CIFAR10 Details
We trained a deterministic 28 × 10 wide resnet [34,35], using the open source implementation from Cubuk et al. [36]. However, we extended the final 10 dimensional logits of that model through another 3 layer MLP classifier, in order to keep the inference network architecture identical between this model and the VIB models we describe below. During training, we dynamically added label noise according to the class confusion matrix in Table A1. The mean label noise averaged across the 10 classes is 20%. After that model had converged, we used it to estimate with Algorithm 1. Even with 20% label noise, was estimated to be 1.0483.
Table A1.
Class confusion matrix used in CIFAR10 experiments. The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%.
Table A1.
Class confusion matrix used in CIFAR10 experiments. The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%.
| Plane | Auto. | Bird | Cat | Deer | Dog | Frog | Horse | Ship | Truck | |
|---|---|---|---|---|---|---|---|---|---|---|
| Plane | 0.82232 | 0.00238 | 0.021 | 0.00069 | 0.00108 | 0 | 0.00017 | 0.00019 | 0.1473 | 0.00489 |
| Auto. | 0.00233 | 0.83419 | 0.00009 | 0.00011 | 0 | 0.00001 | 0.00002 | 0 | 0.00946 | 0.15379 |
| Bird | 0.03139 | 0.00026 | 0.76082 | 0.0095 | 0.07764 | 0.01389 | 0.1031 | 0.00309 | 0.00031 | 0 |
| Cat | 0.00096 | 0.0001 | 0.00273 | 0.69325 | 0.00557 | 0.28067 | 0.01471 | 0.00191 | 0.00002 | 0.0001 |
| Deer | 0.00199 | 0 | 0.03866 | 0.00542 | 0.83435 | 0.01273 | 0.02567 | 0.08066 | 0.00052 | 0.00001 |
| Dog | 0 | 0.00004 | 0.00391 | 0.2498 | 0.00531 | 0.73191 | 0.00477 | 0.00423 | 0.00001 | 0 |
| Frog | 0.00067 | 0.00008 | 0.06303 | 0.05025 | 0.0337 | 0.00842 | 0.8433 | 0 | 0.00054 | 0 |
| Horse | 0.00157 | 0.00006 | 0.00649 | 0.00295 | 0.13058 | 0.02287 | 0 | 0.83328 | 0.00023 | 0.00196 |
| Ship | 0.1288 | 0.01668 | 0.00029 | 0.00002 | 0.00164 | 0.00006 | 0.00027 | 0.00017 | 0.83385 | 0.01822 |
| Truck | 0.01007 | 0.15107 | 0 | 0.00015 | 0.00001 | 0.00001 | 0 | 0.00048 | 0.02549 | 0.81273 |
We then trained 73 different VIB models using the same 28 × 10 wide resnet architecture for the encoder, parameterizing the mean of a 10-dimensional unit variance Gaussian. Samples from the encoder distribution were fed to the same 3 layer MLP classifier architecture used in the deterministic model. The marginal distributions were mixtures of 500 fully covariate 10-dimensional Gaussians, all parameters of which are trained. The VIB models had ranging from 1.02 to 2.0 by steps of 0.02, plus an extra set ranging from 1.04 to 1.06 by steps of 0.001 to ensure we captured the empirical with high precision.
However, this particular VIB architecture does not start learning until , so none of these models would train as described. (A given architecture trained using maximum likelihood and with no stochastic layers will tend to have higher effective capacity than the same architecture with a stochastic layer that has a fixed but non-trivial variance, even though those two architectures have exactly the same number of learnable parameters.) Instead, we started them all at , and annealed down to the corresponding target over 10,000 training gradient steps. The models continued to train for another 200,000 gradient steps after that. In all cases, the models converged to essentially their final accuracy within 20,000 additional gradient steps after annealing was completed. They were stable over the remaining ∼180,000 gradient steps.
References
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
- Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
- Rey, M.; Roth, V. Meta-Gaussian information bottleneck. In Advances in Neural Information Processing Systems; lNIPS: San Diego, CA, USA, 2012; pp. 1916–1924. [Google Scholar]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
- Chalk, M.; Marre, O.; Tkacik, G. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2016; pp. 1957–1965. [Google Scholar]
- Fischer, I. The Conditional Entropy Bottleneck. 2018. Available online: https://openreview.net/forum?id=rkVOXhAqY7 (accessed on 20 September 2019).
- Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [PubMed]
- Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 30 April 2019. [Google Scholar]
- Strouse, D.; Schwab, D.J. The information bottleneck and geometric clustering. arXiv 2017, arXiv:1712.09657. [Google Scholar] [CrossRef]
- Achille, A.; Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018, 19, 1947–1980. [Google Scholar]
- Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018. [Google Scholar] [CrossRef]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Achille, A.; Mbeng, G.; Soatto, S. The Dynamics of Differential Learning I: Information-Dynamics and Task Reachability. arXiv 2018, arXiv:1810.02440. [Google Scholar]
- Anantharam, V.; Gohari, A.; Kamath, S.; Nair, C. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv 2013, arXiv:1304.6133. [Google Scholar]
- Polyanskiy, Y.; Wu, Y. Strong data-processing inequalities for channels and Bayesian networks. In Convexity and Concentration; Springer: Berlin/Heidelberg, Germany, 2017; pp. 211–249. [Google Scholar]
- Kim, H.; Gao, W.; Kannan, S.; Oh, S.; Viswanath, P. Discovering potential correlations via hypercontractivity. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2017; pp. 4577–4587. [Google Scholar]
- Lin, H.W.; Tegmark, M. Criticality in formal languages and statistical physics. arXiv 2016, arXiv:1606.06737. [Google Scholar]
- Hirschfeld, H.O. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1935; Volume 31, pp. 520–524. [Google Scholar]
- Gebelein, H. Das statistische Problem der Korrelation als Variations-und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM-J. Appl. Math. Mech. Für Angew. Math. Und Mech. 1941, 21, 364–379. [Google Scholar] [CrossRef]
- Angluin, D.; Laird, P. Learning from noisy examples. Mach. Learn. 1988, 2, 343–370. [Google Scholar] [CrossRef]
- Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2013; pp. 1196–1204. [Google Scholar]
- Liu, T.; Tao, D. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 447–461. [Google Scholar] [CrossRef] [PubMed]
- Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; Wang, X. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2691–2699. [Google Scholar]
- Northcutt, C.G.; Wu, T.; Chuang, I.L. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv 2017, arXiv:1705.01936. [Google Scholar]
- van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Kavukcuoglu, K.; Vinyals, O.; Graves, A. Conditional Image Generation with PixelCNN Decoders. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 4790–4798. [Google Scholar]
- Salimans, T.; Karpathy, A.; Chen, X.; Kingma, D.P. PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef]
- Gelfand, I.M.; Silverman, R.A. Calculus of Variations; Courier Corporation: North Chelmsford, MA, USA, 2000. [Google Scholar]
- Erkip, E.; Cover, T.M. The efficiency of investment information. IEEE Trans. Inf. Theory 1998, 44, 1026–1040. [Google Scholar] [CrossRef]
- Rényi, A. On measures of dependence. Acta Math. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).