The Convex Information Bottleneck Lagrangian

The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations T of some random variable X for the task of predicting Y. It is defined as a constrained optimization problem that maximizes the information the representation has about the task, I(T;Y), while ensuring that a certain level of compression r is achieved (i.e., I(X;T)≤r). For practical reasons, the problem is usually solved by maximizing the IB Lagrangian (i.e., LIB(T;β)=I(T;Y)−βI(X;T)) for many values of β∈[0,1]. Then, the curve of maximal I(T;Y) for a given I(X;T) is drawn and a representation with the desired predictability and compression is selected. It is known when Y is a deterministic function of X, the IB curve cannot be explored and another Lagrangian has been proposed to tackle this problem: the squared IB Lagrangian: Lsq−IB(T;βsq)=I(T;Y)−βsqI(X;T)2. In this paper, we (i) present a general family of Lagrangians which allow for the exploration of the IB curve in all scenarios; (ii) provide the exact one-to-one mapping between the Lagrange multiplier and the desired compression rate r for known IB curve shapes; and (iii) show we can approximately obtain a specific compression level with the convex IB Lagrangian for both known and unknown IB curve shapes. This eliminates the burden of solving the optimization problem for many values of the Lagrange multiplier. That is, we prove that we can solve the original constrained problem with a single optimization.


Introduction
Let X and Y be two statistically dependent random variables with joint distribution p X,Y (x, y).The information bottleneck (IB) (Tishby et al., 2000) investigates the problem of extracting the relevant information from X for the task of predicting Y .
For this purpose, the IB defines a bottleneck variable T obeying the Markov chain Y ↔ X ↔ T so that T acts as a representation of X. Tishby et al. (2000) define the relevant information as the information the representation keeps from Y after the compression of X; i.e., I(T ; Y ), provided a minimum level of compression; i.e, I(X; T ) ≤ r.Therefore, we select the representation which yields the value of the IB curve that best fits our requirements.Definition 1 (IB functional).Let X and Y be statistically dependent variables.Let ∆ be the set of random variables T obeying the Markov condition Y ↔ X ↔ T .Then the IB functional is (1) Definition 2 (IB curve).The IB curve is the set of points defined by the solutions of F IB,max (r) for varying values of r ∈ [0, ∞).
Definition 3 (Information plane).The plane is defined by the axes I(T ; Y ) and I(X; T ).
In practice, solving a constrained optimization problem such as the IB functional is difficult.Thus, in order to avoid the non-linear constraints from the IB functional the IB Lagrangian is defined.Definition 4 (IB Lagrangian).Let X and Y be statistically dependent variables.Let ∆ be the set of random variables T obeying the Markov condition Y ↔ X ↔ T .Then we define the IB Lagrangian as arXiv:1911.11000v1 [stat.ML] 25 Nov 2019 The Convex Information Bottleneck Lagrangian L β IB (T ) = I(T ; Y ) − βI(X; T ). (2) Here β ∈ [0, 1] is the Lagrange multiplier which controls the trade-off between the information of Y retained and the compression of X.Note we consider β ∈ [0, 1] because (i) for β ≤ 0 many uncompressed solutions such as T = X maximize L β IB , and (ii) for β ≥ 1 the IB Lagrangian is non-positive due to the data processing inequality (DPI) (Theorem 2.8.1 from Cover, Thomas (2012)) and trivial solutions like T = const are maximizers with L β IB (T ) = 0 (Kolchinsky et al., 2019a).
We know the solutions of the IB Lagrangian optimization (if existent) are solutions of the IB functional by the Lagrange's sufficiency theorem (Theorem 5 in Appendix A of Courcoubetis (2003)).Moreover, since the IB functional is concave (Lemma 5 of Gilad-Bachrach et al. (2003)) we know they exist (Theorem 6 in Appendix A of Courcoubetis (2003)).
Therefore, the problem is usually solved by maximizing the IB Lagrangian with adaptations of the Blahut-Arimoto algorithm (Tishby et al., 2000), deterministic annealing approaches (Tishby, Slonim, 2001) or a bottom-up greedy agglomerative clustering (Slonim, Tishby, 2000) or its improved sequential counterpart (Slonim et al., 2002).However, when provided with high-dimensional random variables X such as images, these algorithms do not scale well and deep learning based techniques, where the IB Lagrangian is used as the objective function, prevailed (Alemi et al., 2016;Chalk et al., 2016;Kolchinsky et al., 2019b).
Note the IB Lagrangian optimization yields a representation T with a given performance (I(X; T ), I(T ; Y )) for a given β.However there is no one-to-one mapping between β and I(X; T ).Hence, we cannot directly optimize for a desired compression level r but we need to perform several optimizations for different values of β and select the representation with the desired performance; e.g., (Alemi et al., 2016).The Lagrange multiplier selection is important since (i) sometimes even choices of β < 1 lead to trivial representations such that p T |X (t|x) = p T (t), and (ii) there exist some discontinuities on the performance level w.r.t. the values of β (Wu et al., 2019).Kolchinsky et al. (2019a) showed how in deterministic scenarios (such as many classification problems where an input x i belongs to a single particluar class y i ) the IB Lagrangian could not explore the IB curve.Particularly, they showed that multiple β yielded the same performance level and that a single value of β could result in different performance levels.To solve this issue, they introduced the squared IB Lagrangian, L βsq sq-IB (T ) = I(T ; Y ) − β sq I(X; T ) 2 , which is able to explore the IB curve in any scenario by optimizing for different values of β sq .However, even though they realized a one-to-one mapping between β sq and the compression level existed, they did not find such mapping.Hence, multiple optimizations of the Lagrangian were still required to fing the best traded-off solution.

Moreover, recently
The main contributions of this article are: 1. We introduce a general family of Lagrangians (the convex IB Lagrangians) which are able to explore the IB curve in any scenario for which the squared IB Lagrangian (Kolchinsky et al., 2019a) is a particular case of.More importantly, the analysis made for deriving this family of Lagrangians can serve as inspiration for obtaining new Lagrangian families which solve other objective functions with intrinsic trade-offs such as the IB Lagrangian.
2. We show that in deterministic scenarios (and other scenarios where the IB curve shape is known) one can use the convex IB Lagrangian to obtain a desired level of performance with a single optimization.That is, there is a one-to-one mapping between the Lagrange multiplier used for the optmization and the level of compression and informativeness obtained, and we know such mapping.Therefore, eliminating the need of multiple optimizations to select a suitable representation.
Furthermore, we provide some insight for explaining why there are discontinuities in the performance levels w.r.t. the values of the Lagrange multipliers.In a classification setting, we connect those discontinuities with the intrinsic clusterization of the representations when optimizing the IB bottleneck objective.
The structure of the article is the following: in Section 2 we motivate the usage of the IB in supervised learning settings.Then, in Section 3 we outline the important results used about the IB curve in deterministic scenarios.Later, in Section 4 we introduce the convex IB Lagrangian and explain some of its properties like the bijective mapping between Lagrange multipliers and the compression level and the range of such multipliers.After that, we support our (proved) claims with some empirical evidence on the MNIST dataset (LeCun et al., 1998) in Section 5.

The IB in supervised learning
In this section we will first give an overview of supervised learning in order to later motivate the usage of the information bottleneck in this setting.

Supervised learning overview
In supervised learning we are given a dataset D n = {(x i , y i )} n i=1 of n pairs of input features and task outputs.In this case, X and Y are the random variables of the input features and the task outputs.We assume x i and y i are sampled i.i.d.from the true distribution p X,Y (x, y) = p Y |X (y|x)p X (x).The usual aim of supervised learning is to use the dataset D n to learn a particular conditional distribution q Ŷ |X (ŷ|x; θ) of the task outputs given the input features, parametrized by θ, which is a good approximation of p Y |X (y|x).We use Ŷ and ŷ to indicate the predicted task output random variable and its outcome.We call a supervised learning task regression when Y is continuous-valued and classification when it is discrete.
Usually supervised learning methods employ intermediate representations of the inputs before making predictions about the outputs; e.g., hidden layers in neural networks (Chapter 5 from Bishop ( 2006)) or transformations in a feature space through the kernel trick in kernel machines like SVMs or RVMs (Sections 7.1 and 7.2 from Bishop ( 2006)).Let T be a possibly stochastic function of the input features X with a parametrized conditional distribution q T |X (t|x; θ), then, T obeys the Markov condition Y ↔ X ↔ T .The mapping from the representation to the predicted task outputs is defined by the parametrized conditional distribution q Ŷ |T (ŷ|t; θ).Therefore, in representation-based machine learning methods the full Markov Chain is Y ↔ X ↔ T ↔ Ŷ .Hence, the overall estimation of the conditional probability p Y |X (y|x) is given by the marginalization of the representations, In order to achieve the goal of having a good estimation of the conditional probability distribution p Y |X (y|x), we usually define an instantaneous cost function j θ (x, y) : X × Y → R.This serves as a heuristic to measure the loss our algorithm (parametrized by θ) obtains when trying to predict the realization of the task output y with the input realization x.
Clearly, we can be interested in minimizing the expectation of the instantaneous cost function over all the possible input features and task outputs, which we call the cost function.However, since we only have a finite dataset D n we have instead to minimize the empirical cost function.
Definition 5 (Cost function and empirical cost function).Let X and Y be the input features and task output random variables and x ∈ X and y ∈ Y their realizations.Let also j θ (x, y) be the instantaneous cost function, θ the parametrization of our learning algorithm, and D n = {(x i , y i )} n i=1 the given dataset.Then we define: 1.The cost function: 2. The emprical cost function: The discrepancy between the normal and empirical cost functions is called the generalization gap or generalization error (see Section 1 of Xu, Raginsky (2017), for instance) and intuitevely, the smaller this gap is, the better our model generalizes; i.e., the better it will perform to new, unseen samples in terms of our cost function.
Definition 6 (Generalization gap).Let J(θ) and Ĵ(θ, D n ) be the cost and the empirical cost functions as defined in Definition 5.Then, the generalization gap is defined as and it represents the error incurred when the selected distribution is the one parametrized by θ when the rule Ĵ(θ, D n ) is used instead of J(θ) as the function to minimize.
Ideally, we would want to minimize the cost function.Hence, we usually try to minimize the empirical cost function and the generalization gap simultaneously.The modifications to our learning algorithm which intend to reduce the generalization gap but not hurt the performance on the empirical cost function are known as regularization.

Why do we use the IB?
Definition 7 (Representation cross-entropy cost function).Let X and Y be two statistically dependent variables with joint distribution p X,Y (x, y) = p Y |X (y|x)p X (x).Let also T be a random variable obeying the Markov condition Y ↔ X ↔ T and q T |X (t|x; θ) and q Ŷ |T (ŷ|t; θ) be the encoding and decoding distributions of our model, parametrized by θ.Finally, let ] be the cross entropy between two probability distributions p Z and q Z .Then, the cross-entropy cost function is where j CE,θ (x, y) = C(q T |X (t|x; θ)||q Ŷ |T (ŷ|t; θ)) is the instantaneous representation cross-entropy cost function and The cross-entropy is a widely used cost function in classification tasks (e.g., Krizhevsky et al. (2012); Shore, Gray (1982); Teahan ( 2000)) which has many interesting properties (Shore, Johnson, 1981).Moreover, it is known that minimizing the J CE (θ) maximizes the mutual information I(T ; Y ).That is, Proposition 1 (Minimizing the cross entropy maximizes the mutual information).Let J CE (θ) be the representation cross-entropy cost function as defined in Definition 7. Let also I(T ; Y ) be the mutual information between random variables T and Y in the setting from Definition 7.Then, minimizing J CE implies maximizing I(T ; Y ).
The proof of this proposition can be found in Appendix A.
Definition 8 (Nuisance).A nuisance is any random variable which affects the observed data X but is not informative to the task we are trying to solve.That is, Similarly, we know that minimizing I(X; T ) minimizes the generalization gap for restricted classes when using the cross-entropy cost function (Theorem 1 of Vera et al. ( 2018)), and when using I(T ; Y ) directly as an objective to maximize (Theorem 4 of Shamir et al. ( 2010)).Furthermore, Achille, Soatto (2018) in Proposition 3.1 upper bound the information of the input representations, T , with nuisances that affect the observed data, Ξ, with I(X; T ).Therefore, minimizing I(X; T ) helps generalization by not keeping useless information of Ξ in our representations.
Thus, jointly maximizing I(T ; Y ) and minimizing I(X; T ) is a good choice both in terms of performance in the available dataset and in new, unseen data, which motivates studies on the IB.
3 The Information Bottleneck in deterministic scenarios Kolchinsky et al. (2019a) showed that when Y is a deterministic function of X; i.e., Y = f (X), the IB curve is piecewise linear.More precisely, it is shaped as stated in Proposition 2.
Proposition 2 (The IB curve is piecewise linear in deterministic scenarios).Let X be a random variable and Y = f (X) be a deterministic function of X.Let also T be the bottleneck variable that solves the IB functional.Then the IB curve in the information plane is defined by the following equation: Furthermore, they showed that the IB curve could not be explored by optimizing the IB Lagrangian for multiple β because the curve was not strictly concave.That is, there was not a one-to-one relationship between β and the performance level.
Theorem 1 (In deterministic scenarios, the IB curve cannot be explored using the IB Lagrangian).Let X be a random variable and Y = f (X) be a deterministic function of X.Let also ∆ be the set of random variables T obeying the Markov condition Y ↔ X ↔ T .Then: 1. Any solution T ∈ ∆ such that I(X; T ) ∈ [0, I(X; Y )) and I(T ; Y ) = I(X; T ) solves arg max T ∈∆ {L β IB (T )} for β = 1.That is, many different compression and performance levels can be achieved for β = 1.
2. Any solution T ∈ ∆ such that I(X; T ) > I(X; Y ) and I(T ; Y ) = I(X; Y ) solves arg sup T ∈∆ {L β IB (T )}1 for β = 0.That is, many compression levels can be achieved with the same performance for β = 0.
An alternative proof for this theorem can be found in Appendix B.
4 The Convex IB Lagrangian

Exploring the IB curve
Clearly, a situation like the one depicted in Theorem 1 is not desirable, since we cannot aim for different levels of compression or performance.For this reason, we generalize the effort from Kolchinsky et al. (2019a) and look for families of Lagrangians which are able to explore the IB curve.Inspired by the squared IB Lagrangian, L βsq sq-IB (T ) = I(T ; Y ) − β sq I(X; T ) 2 , we look at the conditions a function of I(X; T ) requires in order to be able to explore the IB curve.In this way, we realize that any monotonically increasing and strictly convex function will be able to do so, and we call the family of Lagrangians with these characteristics the convex IB Lagrangians, due to the nature of the introduced function.Theorem 2 (Convex IB Lagrangians).Let ∆ be the set of r.v.T obeying the Markov condition Y ↔ X ↔ T .Then, if u is a monotonically increasing and strictly convex function, the IB curve can always be recovered by the solutions of arg max T ∈∆ {L βu IB,u (T )}, with That is, for each point (I(X; T ), I(T ; Y )) s.t.dI(T ; Y )/dI(X; T ) > 0 there is a unique β u for which maximizing L βu IB,u (T ) achieves this solution.Furthermore, β u is strictly decreasing w.r.t.I(X; T ).We call L βu IB,u (T ) the convex IB Lagrangian.
The proof of this theorem can be found on Appendix C. Furthermore, by exploiting the IB curve duality (Lemma 10 of Gilad-Bachrach et al. (2003)) we were able to derive other families of Lagrangians which allow for the exploration of the IB curve (Appendix G).Remark 1.Clearly, we can see how if u is the identity function (i.e., u(I(X; T )) = I(X; T )) then we end up with the normal IB Lagrangian.However, since the identity function is not strictly convex, it cannot ensure the exploration of the IB curve.

Aiming for a specific compression level
Let B u denote the domain of Lagrange multipliers β u for which we can find solutions in the IB curve with the convex IB Lagrangian.Then, the convex IB Lagrangians do not only allow us to explore the IB curve with different β u .They also allow us to identify the specific β u that obtains a given point (I(X; T ), I(T ; Y )), provided we know the IB curve in the information plane.Conversely, the convex IB Lagrangian allows to find the specific point (I(X; T ), I(T ; Y )) that is obtained by a given β u .Proposition 3 (Bijective mapping between IB curve point and convex IB Lagrange multiplier).Let the IB curve in the information plane be known; i.e., I(T ; Y ) = f IB (I(X; T )) is known.Then there is a bijective mapping from Lagrange multipliers β u ∈ B u \ {0} from the convex IB Lagrangian to points in the IB curve (I(X; T ), f IB (I(X; T )).Furthermore, these mappings are: where u is the derivative of u and (u ) −1 is the inverse of u .

The Convex Information Bottleneck Lagrangian
This is especially interesting since in deterministic scenarios we know the shape of the IB curve (Theorem 2) and since the convex IB Lagrangians allow for the exploration of the IB curve (Theorem 2).A proof for Proposition 3 can be found in Appendix D. A direct result derived from this proposition is that we know the domain of Lagrange multipliers, B u , which allow for the exploration of the IB curve if the shape of the IB curve is known.Furthermore, if the shape is not known we can at least bound that range.
Corollary 1 (Domain of convex IB Lagrange multiplier with known IB curve shape).Let the IB curve in the information plane be I(T ; Y ) = f IB (I(X; T )) and let I max = I(X; Y ).Let also I(X; T ) = r max be the minimum mutual information s.t.f IB (r max ) = I max ; i.e., r max = arg min r {f IB (r)} s.t.f IB (r) = I max .Then, the range of Lagrange multipliers that allow the exploration of the IB curve with the convex IB Lagrangian is where f IB (r) and u (r) are the derivatives of f IB (I(X; T )) and u(I(X; T )) w.r.t.I(X; T ) evaluated at r respectively.
Corollary 2 (Domain of convex IB Lagrange multiplier bound).The range of the Lagrange multipliers that allow the exploration of the IB curve is contained by [0, β u,top ] which is also contained by [0, β + u,top ], where u (r) is the derivative of u(I(X; T )) w.r.t.I(X; T ) evaluated at r, X is the set of possible realizations of X and β 0 2 and Ω x are defined as in (Wu et al., 2019).That is, Corollaries 1 and 2 allow us to reduce the range search for β when we want to explore the IB curve.Practically, inf Ωx⊂X {β 0 (Ω x )} might be difficult to calculate so Wu et al. ( 2019) derived an algorithm to approximate it.However, we still recommend 1 for simplicity.The proofs for both corollaries are found in Appendices E and F.

Experimental support
In order to showcase our claims we use the MNIST dataset (LeCun et al., 1998).We simply modify the nonlinear-IB method (Kolchinsky et al., 2019b), which is a neural network that minimizes the cross-entropy while also minimizing a differentiable kernel-based estimate of I(X; T ) (Kolchinsky, Tracey, 2017).Then we use this technique to maximize a lower bound on the convex IB Lagrangians by applying the functions u to the I(X; T ) estimate.
For a fair comparison, we use the same network architecture as that in (Kolchinsky et al., 2019b): First, a stochastic encoder3 T = f θ,enc (X) + W with p W (w) = N (w; 0, I 2 ) such that T ∈ R2 .Here f θ,enc is a three fully-conected layer encoder with 800 ReLU units on the first two layers and 2 linear units on the last layer.Second, a deterministic decoder q Ŷ |T (ŷ|t; θ) = f θ,dec (t).Here, f θ,dec is a fully-conected 800 ReLU unit layers followed by an output layer with 10 softmax units.For further details about the experiment setup and additional results for different values of α and η please refer to Appendix H.
In Figure 1 we show our results for two particularizations of the convex IB Lagrangians: 1. the power IB Lagrangians5 : L   We can clearly see how both Lagrangians are able to explore the IB curve (first column from Figure 1) and how the theoretical performance trend of the Lagrangians matches the experimental results (second and third columns from Figure 1).There are small mismatches between the theoretical and experimental performance.This is because using the nonlinear-IB, as stated by Kolchinsky et al. (2019a), does not guarantee that we find optimal representations due to factors like: (i) innacurate estimation of I(X; T ), (ii) restrictions on the structure of T , (iii) use of an estimation of the decoder instead of the real one and (iv) the typical non-convex optimization issues that arise with gradient-based methods.The main difference comes from the discontinuities in performance for increasing β, which cause is still unknown (cf.Wu et al. ( 2019)).It has been observed, however, that the bottleneck variable performs an intrinsic clusterization in classification tasks (see, for instance (Kolchinsky et al., 2019b,a;Alemi et al., 2018) or Figure 2b).We realized how this clusterization matches with the quantized performance levels observed (e.g., compare Figure 2a with the top center graph in Figure 1); with maximum performance when the number of clusters is equal to the The Convex Information Bottleneck Lagrangian cardinality of Y and reducing performance with a reduction of the number of clusters.We do not have a mathematical proof for the exact relationship between these two phenomena; however, we agree with Wu et al. ( 2019) that it is an interesting matter and hope this realization serves as motivation to derive new theory.
To sum up, in order to achieve a desired level of performance with the convex IB Lagrangian as an objective one should: 1.In a deterministic or close to deterministic setting (see -deterministic definition in Kolchinsky et al. ( 2019a)): Use the adequate β u for that performance using Proposition 3. Then if the perfomance is lower than desired; i.e., we are placed in the wrong performance plateau, gradually reduce the value of β u until reaching the previous performance plateau.
2. In a stochastic setting: Draw the IB curve with multiple values of β u on the range defined by Corollary 2 and select the representations that best fit their interests.

I(X; T) I(T; Y)
= 0.0 = 0.083 = 0.167 = 0.25 = 0.333 In practice, there are different criteria for choosing the function u.For instance, the exponential IB Lagrangian could be more desirable than the power IB Lagrangian when we want to draw the IB curve since it has a finite range of β u .This is B u = [(η exp(ηI max )) −1 , η −1 ] for the exponential IB Lagrangian vs. B u = [((1 + α)I α max ) −1 , ∞) for the power IB Lagrangian.Furthermore, there is a trade-off between (i) how much the selected u function ressembles a linear function in our region of interest; e.g., with α or η close to zero, since it will suffer from similar problems as the original IB Lagrangian; and (ii) how fast it grows in our region of interest; e.g., higher values of α or η, since it will suffer from value convergence; i.e., optimizing for separate values of β u will achieve similar levels of performance (Figure 3).Please, refer to Appendix I for a more thorough explanation of this phenomenon.

Conclusion
The information bottleneck is a widely used and studied technique.However, it is known that the IB Lagrangian cannot be used to achieve varying levels of performance in deterministic scenarios.Moreover, in order to achieve a particular level of performance multiple optimizations with different Lagrange multipliers must be done to draw the IB curve and select the best traded-off representation.
In this article we introduced a general family of Lagrangians which allow to (i) achieve varying levels of performance in any scenario, and (ii) pinpoint a specific Lagrange multiplier β u to optimize for a specific performance level in known IB curve scenarios; e.g., deterministic.Furthermore, we showed the β u domain when the IB curve is known and a β u domain bound for exploring the IB curve when it is unkown.This way we can reduce and/or avoid multiple optimizations and, hence, reduce the computational effort for finding well traded-off representations.Finally, (iii) we provided some insight to the discontinuities on the performance levels w.r.t. the Lagange multipliers by connecting those with the intrinsic clusterization of the bottleneck variable.
(a) Since the IB curve is concave we know β is non-increasing in I(X; T ) ∈ R + .We also know β = 1 at the points in the IB curve where I(X; T ) ≤ lim →0 + {I(X; Y ) − } and β = 1 at the points in the IB curve where I(X; T ) ≥ lim →0 + {I(X; Y ) + }.Hence, if we achieve a solution with β ∈ (0, 1), this solution is I(X; T ) = I(T ; Y ) = I(X; Y ).(b) We can upper bound the IB Lagrangian by where the first and second inequalities use the DPI (Theorem 2.8.1 from Cover, Thomas ( 2012)).
Then, we can consider the point of the IB curve (I(X; Y ), I(X; Y )).Since the function it is concave we know it exists a tangent line to (I(X; Y ), I(X; Y )) such that all other points in the curve lay below this line.Let β be the slope of this curve (which we know it is from Tishby et al. ( 2000)).Then, As we see, by the upper bound on the IB Lagrangian from Equation ( 17), if the point (I(X; Y ), I(X; Y )) exists, any β can be the slope of the tangent line to (I(X; Y ), I(X; Y )) that ensures concavity.

C Proof of Theorem 2
Proof.We start the proof by remembering the optimization problem at hand (Definition 1): We can modify the optimization problem by Here, the equality from equation ( 21) comes from the fact that since I(X; T ) ≤ r, then ∃ξ ≥ 0 s.t.u(I(X; T )) − u(r) + ξ = 0.Then, the inequality from equation ( 22) holds since we have expanded the optimization search space.Finally, in equation ( 23) we use that T * maximizes L β * u IB,u (T ) and that I(X; T * ) ≤ r.Now, we can exploit that u(r) and ξ do not depend on T and drop them in the maximization in equation ( 22).We can then realize we are maximizing over L = arg max The Convex Information Bottleneck Lagrangian Therefore, since I(T * ; Y ) satisfies both the maximization with T * ∈ ∆ and the constraint I(X; T * ) ≤ r, maximizing L β * u IB,u (T ) obtains F IB,max (r).Now, we know if such β * u exists, then the solution of the Lagrangian will be a solution for F IB,max (r).Then, if we consider Theorem 6 from the Appendix of Courcoubetis ( 2003) and consider the maximization problem instead of the minimization problem, we know if both I(T ; Y ) and −u(I(X; T )) are concave functions, then a set of Lagrange multipliers S * u exists with these conditions.We can make this consideration because f is concave if −f is convex and max{f } = min{−f }.We know I(T ; Y ) is a concave function of T for T ∈ ∆ (Lemma 5 of Gilad-Bachrach et al. (2003)) and I(X; T ) is convex w.r.t.T given p X (x) is fixed (Theorem 2.7.4 of Cover, Thomas (2012)).Thus, if we want −u(I(X; T )) to be concave we need u to be a convex function.
Finally, we will look at the conditions of u so that for every point (I(X; T ), I(T ; Y )) in the IB curve, there exists a unique β * u s.t.L For this purpose we will look at the solutions of the Lagrangian optimization: Now, if we integrate both sides of equation ( 26) over all T ∈ ∆ we obtain where β is the Lagrange multiplier from the IB Lagrangian (Tishby et al., 2000) and u (I(X; T )) is du(I(X;T )) dI(X;T ) .Also, if we want to avoid indeterminations of β u we need u (I(X; T )) not to be 0. Since we already imposed u to be monotonically non-decreasing, we can solve this issue by strengthening this condition.That is, we will require u to be monotonically increasing.
We would like β u to be continuous, this way there would be a unique β u for each value of I(X; T ).We know β is a non-increasing function of I(X; T ) (Lemma 6 of Gilad-Bachrach et al. (2003)).Hence, if we want β u to be a strictly decreasing function of I(X; T ), we will require u to be an strictly increasing function of I(X; T ).Therefore, we will require u to be a strictly convex function.
Thus, if u is an strictly convex and monotonically increasing function, for each point (I(X; T ), I(T ; Y )) in the IB curve s.t.dI(T ; Y )/dI(X; T ) > 0 there is a unique β u for which maximizing L βu IB,u (T ) achieves this solution.

D Proof of Proposition 3
Proof.In Theorem 2 we showed how each point of the IB curve (I(X; T ), I(T ; Y )) can be found with a unique β u maximizing L βu IB,u .Therefore, since we also proved L βu IB,u is strictly concave w.r.t.T we can find the values of β u that maximize the Lagrangian for fixed I(X; T ).
First, we look at the solutions of the Lagrangian maximization: Then as before we can integrate at both sides for all T ∈ ∆ and solve for β u : Moreover, since u is a strictly convex function its derivative u is strictly decreasing.Hence, u is an invertible function (since a strictly decreasing function is bijective and a function is invertible iff it is bijective by definition).Now, if we consider β u > 0 to be known and I(X; T ) to be the unknown we can solve for I(X; T ) and get: If we use Proposition 3 on both Lagrangians we obtain the bijective mapping between their Lagrange multipliers and a certain level of compression in the classification setting: 1. Power IB Lagrangian: 2. Exponential IB Lagrangian: β exp = (η exp(ηI(X; T ))) −1 and I(X; T ) = − log(ηβ exp )/η.
Hence, we can simply plot the curves of I(X; T ) vs. β u for different hyperparameters α and η (see Figure 8).In this way we can observe how increasing the growth of the function (e.g., increasing α or η in this case) too much provokes that many different values of β u converge to very similar values of I(X; T ).This is an issue both for drawing the curve (for obvious reasons) and for aiming for a specific performance level.Due to the nature of the estimation of the IB Lagrangian, the theoretical and practical value of β u that yield a specific I(X; T ) may vary slightly (see Figure 1).Then if we select a function with too high growth, a small change in β u can result in a big change in the performance obtained.

I.2 Aiming for strong convexity
Definition 10 (µ-Strong convexity).If a function f (r) is twice continuous differentiable and its domain is confined in the real line, then it is µ-strong convex if f (r) ≥ µ ≥ 0 ∀r.Experimentally, we observed when the growth of our function u(r) is small in the domain of interest r > 0 the convex IB Lagrangian does not perform well.Later we realized that this was closely related with the strength of the convexity of our function.
In Theorem 2 we imposed the function u to be strictly convex to enforce having a unique β u for each value of I(X; T ).Hence, since in practice we are not exactly computing the Lagrangian but an estimation of it (e.g., with the nonlinear IB (Kolchinsky et al., 2019b)) we require strong convexity in order to be able to explore the IB curve.
We now look at the second derivative of the power and exponential function: u (r) = (1 + α)αr α−1 and u (r) = η 2 exp(ηr) respectivelly.Here we see how both functions are inherently 0-strong convex for r > 0 and α, η > 0. However, values of α < 1 and η < 1 could lead to low µ-strong convexity in certain domains of r.Particularly, the case of α < 1 is dangerous because the function approaches 0-strong convexity as r increases, so the power IB Lagrangian performs poorly when low α are used to find high performances.

Remark 2 .
The inclusion of the function u is what allows us to find the bijection between β u and I(X; T ).The previous definition fromTishby et al. (2000) of β as d(I(T ; Y ))/dI(X; T ) did not.

Figure 1 :
Figure 1: The top row shows the results for the power IB Lagrangian with α = 1, and the bottom row for the exponential IB Lagrangian with η = 1.In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) I(T ; Y ) as a function of β u ; and (iii) the compression I(X; T ) as a function of β u .In all plots the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition3.Moreover, in all plots it is indicated I(X; Y ) = H(Y ) = log 2 (10) in a dashed, orange line.All values are shown in bits.

Figure 2 :
Figure 2: Depiction of the clusterization behavior 4 of the bottleneck variable for the power IB Lagrangian with α = 1.

Figure 5 :
Figure 5: Results for the power IB Lagrangian with α = {0.5, 1, 2}, from top to bottom.In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) I(T ; Y ) as a function of β u ; and (iii) the compression I(X; T ) as a function of β u .In all plots the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition 3.Moreover, in all plots it is indicated I(X; Y ) = H(Y ) = log 2 (10) in a dashed, orange line.All values are shown in bits.

Figure 6 :
Figure 6: Results for the exponential IB Lagrangian with η = {log(2), 1, 1.5}, from top to bottom.In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) I(T ; Y ) as a function of β u ; and (iii) the compression I(X; T ) as a function of β u .In all plots the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the gren stars the theoretical values computed as dictated by Proposition 3.Moreover, in all plots it is indicated I(X; Y ) = H(Y ) = log 2 (10) in a dashed, orange line.All values are shown in bits.

Figure 7 :Figure 8 :
Figure 7: Depiction of the clusterization behavior of the bottleneck variable.In the first row, from left to right, the power IB Lagrangian with different values of α = {0.5, 1, 2}.In the second row, from left to right, the exponential IB Lagrangian with different values of η = {log(2), 1, 1.5}.