Bottleneck Problems: An Information and Estimation-Theoretic View

Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), and strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding, and dependence dilution. Leveraging these connections, we prove a new cardinality bound on the auxiliary variable in IB, making its computation more tractable for discrete random variables. In the second part, we introduce a general family of optimization problems, termed “bottleneck problems”, by replacing mutual information in IB and PF with other notions of mutual information, namely f-information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantees in a variety of inference tasks with statistical constraints on accuracy and privacy. While the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to a binary case, we derive closed form expressions for several bottleneck problems.


I. INTRODUCTION
Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental in a variety of learning problems found in machine learning.A notable example is the information bottleneck (IB) method [2].Suppose Y is a target variable and X is an observable correlated variable with joint distribution P XY .The goal of IB is to learn a "compact" summary (aka bottleneck) T of X that is maximally "informative" for inferring Y .The bottleneck variable T is assumed to be generated from X by applying a random function F to X, i.e., T = F (X), in such a way that it is conditionally independent of Y given X, that we denote by Y − − X − − T. ( The IB quantifies this goal by measuring the "compactness" of T using the mutual information I(X; T ) and, similarly, "informativeness" by I(Y ; T ).For a given level of compactness R ≥ 0, IB extracts the bottleneck variable T that solves the constrained optimization problem where the supremum is taken over all randomized functions T = F (X) satisfying Y − − X − − T .
The optimization problem that underlies the information bottleneck has been studied in the information theory literature as early as the 1970's -see [3]- [6] -as a technique to prove impossibility results in information theory and also to study the common information between X and Y .Wyner and Ziv [3] explicitly determined the value of IB(R) for the special case of binary X and Y -a result widely known as Mrs. Gerber's Lemma [3], [7].More than twenty years later, the information bottleneck function was studied by Tishby et al. [2] and re-formulated in a data analytic context.Here, the random variable X represents a high-dimensional observation with a corresponding low-dimensional feature Y .IB aims at specifying a compressed description of image which is maximally informative about feature Y .This framework led to several applications in clustering [8]- [10] and quantization [11], [12].
A closely-related framework to IB is the privacy funnel (PF) problem [13]- [15].In the PF framework, a bottleneck variable T is sought to maximally preserve "information" contained in X while revealing as little about Y as possible.This framework aims to capture the inherent trade-off between revealing X perfectly and leaking a sensitive attribute Y .For instance, suppose a user wishes to share an image X for some classification tasks.The image might carry information about attributes, say Y , that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g, location, or emotion.The PF framework seeks to extract a representation of X from which the original image can be recovered with maximal accuracy while minimizing the privacy leakage with respect to Y .Using mutual information for both privacy leakage and informativeness, the privacy funnel can be formulated as PF(r) := inf I(Y ; T ) subject to I(X; T ) ≥ r, where the infumum is taken over all randomized function T = F (X) and r is the parameter specifying the level of informativeness.It is evident from the formulations (2) and (3) that IB and PF are closely related.In fact, we shall see later that they correspond to the upper and lower boundaries of a two-dimensional compact convex set.This duality has led to design of greedy algorithms [13], [16] for estimating PF based on the agglomerative information bottleneck [10] algorithm.A similar formulation has recently been proposed in [17] as a tool to train a neural network for learning a private representation of data X.Solving IB and PF optimization problems analytically is challenging.However, recent machine learning applications, and deep learning algorithms in particular, have reignited the study of both IB and PF (see Related Work).
In this paper, we first give a cohesive overview of the existing results surrounding the IB and the PF formulations.We then provide a comprehensive analysis of IB and PF from an information-theoretic perspective, as well as a survey of several formulations connected to the IB and PF that have been introduced in the information theory and machine learning literature.Moreover, we overview connections with coding problems such as remote sourcecoding [18], testing against independence [19], and dependence dilution [20].Leveraging these connections, we prove a new cardinality bound for the bottleneck variable in IB, leading to more tractable optimization problem for IB.We then consider a broad family of optimization problems by going beyond mutual information in formulations (2) and (3).We propose two candidates for this task: Arimoto's mutual information [21] and f -information [22].By replacing I(Y ; T ) and/or I(X; T ) with either of these measures, we generate a family of optimization problems that we referred to as the bottleneck problems.These problems are shown to better capture the underlying trade-offs intended by IB and PF.More specifically, our main contributions are listed next.
• Computing IB and PF are notoriously challenging when X takes values in a set with infinite cardinality (e.g., X is drawn from a continuous probability distribution).We consider three different scenarios to circumvent this difficulty.First, we assume that X is a Gaussian perturbation of Y , i.e., X = Y + N G where N G is a noise variable sampled from a Gaussian distribution independent of Y .Building upon the recent advances in entropy power inequality in [23], we derive a sharp upper bound for IB(R).As a special case, we consider jointly Gaussian (X, Y ) for which the upper bound becomes tight.This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable T is also Gaussian than the original proof given in [24].In the second scenario, we assume that Y is a Gaussian perturbation of X, i.e., Y = X + N G .This corresponds to a practical setup where the feature Y might be perfectly obtained from a noisy observation of X. Relying on the recent results in strong data processing inequality [25], we obtain an upper bound on IB(R) which is tight for small values of R. In the last scenario, we compute second-order approximation of PF(r) under the assumption that T is obtained by Gaussian perturbation of X, i.e., T = X + N G .Interestingly, the rate of increase of PF(r) for small values of r is shown to be dictated by an asymmetric measure of dependence introduced by Rényi [26].• We extend the Witsenhausen and Wyner's approach [4] for analytically computing IB and PF.This technique converts solving the optimization problems in IB and PF to determining the convex and concave envelopes of a certain function, respectively.We apply this technique to binary X and Y and derive a closed form expression for PF(r)we call this result Mr. Gerber's Lemma.• Relying on the connection between IB and noisy source coding [18] (see [27], [28]), we show that the optimal bottleneck variable T in optimization problem (2)   1) .Moreover, we show that this does not hold for PF, indicating a fundamental difference in optimizations problems (2) and (3).
• Following [15], [29], we study the deterministic IB and PF (denoted by dIB and dPF) in which T is assumed to be a deterministic function of X, i.e., T = f (X) for some function f .By connecting dIB and dPF with entropy-constrained scalar quantization problems in information theory [30], we obtain bounds on them explicitly in terms of |X |.Applying these bounds to IB, we obtain that IB(R) I(X;Y ) is bounded by one from above and by min{ R H(X) , e R −1 |X | } from below.• By replacing I(Y ; T ) and/or I(X; T ) in ( 2) and (3) with Arimoto's mutual information or f -information, we generate a family of bottleneck problems.We then argue that these new functionals better describe the trade-offs that were intended to be captured by IB and PF.The main reason is three-fold: First, as illustrated in Section II-C, mutual information in IB and PF are mainly justified when n 1 independent samples (X 1 , Y 1 ), . . . , (X n , Y n ) of P XY are considered.However, Arimoto's mutual information allows for operational interpretation even in the single-shot regime (i.e., for n = 1).Second, I(Y ; T ) in IB and PF is meant to be a proxy for the efficiency of reconstructing Y given observation T .However, this can be accurately formalized by probability of correctly guessing Y given T (i.e., Bayes risk) or minimum mean-square error (MMSE) in estimating Y given T .While I(Y ; T ) bounds these two measures, we show that they are precisely characterized by Arimoto's mutual information and f -information, respectively.Finally, when P XY is unknown, mutual information is known to be notoriously difficult to estimate.Nevertheless, Arimoto's mutual information and f -information are easier to estimate: While mutual information can be estimated with estimation error that scales as O(log n/ √ n) [31], Diaz et a. [32] showed that this estimation error for Arimoto's mutual information and f -information is O(1/ √ n).We also generalize our computation technique that enables us to analytically compute these bottleneck problems.Similar as before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certain functions.Focusing on binary X and Y , we derive closed form expressions for some of the bottleneck problems.

A. Related Work
The IB formulation has been extensively applied in representation learning and clustering [9], [33]- [37].Clustering based on IB results in algorithms that cluster data points in terms of the similarity of P Y |X .When data points lie in a metric space, usually geometric clustering is preferred where clustering is based upon the geometric (e.g., Euclidean) distance.Strouse and Schwab [29], [38] proposed the deterministic IB (denoted by dIB) by enforcing that P T |X is a deterministic mapping: dIB(R) denotes the supremum of I(Y ; f (X)) over all functions f : X → T satisfying H(f (X)) ≤ R.This optimization problem is closely related to the problem of scalar quantization in information theory: designing a function f : X → [M ] := {1, . . ., M } with a pre-determined output alphabet with f optimizing some objective functions.This objective might be maximizing or minimizing H(f (X)) [39] or maximizing I(Y ; f (X)) for a random variable Y correlated with X [30], [40]- [42].Since H(f (X)) ≤ log M for f : X → [M ], the latter problem provides lower bounds for dIB (and thus for IB).In particular, one can exploit [43,Theorem 1] This result establishes a linear gap between dIB and I(X; Y ) irrespective of |X |.
The connection between quantization and dIB further allows us to obtain multiplicative bounds.For instance, if Y ∼ Bernoulli( 12 ) and X = Y + N G , where N G ∼ N (0, 1) is independent of Y , then it is well-known in information theory literature that I(Y ; f (X)) ≥2 π I(X; Y ) for all non-constant f : X → {0, 1} (see, e.g., [44,Section 2.11]), thus dIB(R) ≥ 2 π I(X; Y ) for R ≤ 1.We further explore this connection to provide multiplicative bounds on dIB(R) in Section II-E.
The study of IB has recently gained increasing traction in the context of deep learning.By taking T to be the activity of the hidden layer(s), Tishby and Zaslavsky [45] (see also [46]) argued that neural network classifiers trained with cross-entropy loss and stochastic gradient descent (SGD) inherently aims at solving the IB optimization problems.In fact, it is claimed that the graph of the function R → IB(R) (the so-called the information plane) characterizes the learning dynamic of different layers in the network: shallow layers correspond to maximizing I(Y ; T ) while deep layers' objective is minimizing I(X; T ).While the generality of this claim was refuted empirically in [47] and theoretically in [48], [49], it inspired significant follow-up studies.These include (i) modifying neural network training in order to solve the IB optimization problem [50]- [54]; (ii) creating connections between IB and generalization error [55], robustness [50], and detection of out-of-distribution data [56]; and (iii) using IB to understand specific characteristic of neural networks [54], [57]- [59].
In both IB and PF, mutual information poses some limitations.For instance, it may become infinity in deterministic neural networks [47]- [49] and also may not lead to proper privacy guarantee [60].As suggested in [54], [61], one way to address this issue is to replace mutual information with other statistical measures.In the privacy literature, several measures with strong privacy guarantee have been proposed including Rényi maximal correlation [20], [62], [63], probability of correctly recovering [64], [65], minimum mean-squared estimation error (MMSE) [66], [67], χ 2 -information [68] (a special case of f -information to be described in Section III), Arimoto's and Sibson's mutual information [60], [69] -to be discussed in Section III, maximal leakage [70], and local differential privacy [71].All these measures ensure interpretable privacy guarantees.For instance, it is shown in1 [66], [67] that if χ 2 -information between Y and T is sufficiently small, then no functions of Y can be efficiently reconstructed given T ; thus providing an interpretable privacy guarantee.
Another limitation of mutual information is related to its estimation difficulty.It is known that mutual information can be estimated from n samples with the estimation error that scales as O(log n/ √ n) [31].However, as shown by Diaz et al. [32], the estimation error for most of the above measures scales as O(1/ √ n).Furthermore, the recently popular variational estimators for mutual information, typically implemented via deep learning methods [73]- [75], presents some fundamental limitations [76]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimator might not satisfy basic properties of mutual information such as data processing inequality or additivity.McAllester and Stratos [77] showed that some of these limitations are inherent to a large family of mutual information estimators.

B. Notation
We use capital letters, e.g., X, for random variables and calligraphic letters for their alphabets, e.g., X .If X is distributed according to probability mass function (pmf) P X , we write X ∼ P X .Given two random variables X and Y , we write P XY and P Y |X as the joint distribution and the conditional distribution of Y given X.We also interchangeably refer to P Y |X as a channel from X to Y .We use H(X) to denote both entropy and differential entropy of X, i.e., we have if X is a discrete random variable taking values in X with probability mass function (pmf) P X and where X is an absolutely continuous random variable with probability density function (pdf) f X .If X is a binary random variable with P X (1) = p, we write X ∼ Bernoulli(p).In this case, its entropy is called binary entropy function and denoted by h b (p) := −p log p−(1−p) log(1−p).We use superscript G to describe a standard Gaussian random variable, i.e., N G ∼ N (0, 1).Given two random variables X and Y , their (Shannon's) mutual information is denoted by I(X; Y ) := H(Y ) − H(Y |X).We let P(X ) denote the set of all probability distributions on the set X .Given an arbitrary Q X ∈ P(X ) and a channel P Y |X , we let Q X P Y |X denote the resulting output distribution on Y.For any a ∈ [0, 1], we use ā to denote 1 − a and for any integer k ∈ N, [k] := {1, 2, . . ., k}.
Throughout the paper, we assume a pair of (discrete or continuous) random variables (X, Y ) ∼ P XY are given with a fixed joint distribution P XY , marginals P X and P Y , and conditional distribution P Y |X .We then use Q X ∈ P(X ) to denote an arbitrary distribution with

II. INFORMATION BOTTLENECK AND PRIVACY FUNNEL: DEFINITIONS AND FUNCTIONAL PROPERTIES
In this section, we review the information bottleneck and its closely related functional, the privacy funnel.We then prove some analytical properties of these two functionals and develop a convex analytic approach which enables us to compute closed-form expressions for both these two functionals in some simple cases.
To precisely quantify the trade-off between these two conflicting goals, the IB optimization problem (2) was proposed [2].Since any randomized function T = F (X) can be equivalently characterized by a conditional distribution, (2) can be instead expressed as where R and R denote the level of desired compression and informativeness, respectively.We use IB(R) and IB( R) to denote IB(P XY , R) and IB(P XY , R), respectively, when the joint distribution is clear from the context.Notice that if IB(P XY , R) = R, then IB(P XY , R) = R. Now consider the setup where data X is required to be disclosed while maintaining the privacy of a sensitive attribute, represented by Y .This goal was formulated by PF in (3).As before, replacing randomized function T = F (X) with conditional distribution P T |X , we can equivalently express (3) as where r and r denote the level of desired privacy and informativeness, respectively.The case r = 0 is particularly interesting in practice and specifies perfect privacy, see e.g., [14], [78].As before, we write PF(r) and PF(r) for PF(P XY , r) and PF(P XY , r) when P XY is clear from the context.The following properties of IB and PF follow directly from their definitions.The proof of this result (and any other results in this section) is given in Appendix A.
• If P Y |X (y|x) > 0 for all x ∈ X and y ∈ Y, then both R → IB(R) and r → PF(r) are continuously differentiable over (0, H(X)).
According to this theorem, we can always restrict both R and r in (4) and (5), respectively, to [0, H(X)] as IB(R) = PF(r) = I(X; Y ) for all r, R ≥ H(X).
Define M = M(P XY ) ⊂ R 2 as It can be directly verified that M is convex.According to this theorem, R → IB(R) and r → PF(r) correspond to the upper and lower boundary of M, respectively.The convexity of M then implies the concavity and convexity of IB and PF.Fig. 1 illustrates the set M for the simple case of binary X and Y .While both IB(0) = 0 and PF(0) = 0, their behavior in the neighborhood around zero might be completely different.As illustrated in Fig. 1, IB(R) > 0 for all R > 0, whereas PF(r) = 0 for r ∈ [0, r 0 ] for some r 0 > 0. When such r 0 > 0 exists, we say perfect privacy occurs: there exists a variable T satisfying Y − − X − − T such that I(Y ; T ) = 0 while I(X; T ) > 0; making T a representation of X having perfect privacy (i.e., no information leakage about Y ).A necessary and sufficient condition for the existence of such T is given in [20, Lemma 10] and [14, Theorem 3], described next.
Theorem 2 (Perfect privacy).Let (X, Y ) ∼ P XY be given and A ⊂ [0, 1] |Y| be the set of vectors {P Y |X (•|x), x ∈ X }.Then there exists r 0 > 0 such that PF(r) = 0 for r ∈ [0, r 0 ] if and only if vectors in A are linearly independent.
In light of this theorem, we obtain that perfect privacy occurs if |X | > |Y|.It also follows from the theorem that for binary X, perfect privacy cannot occur (see Fig. 1a).
Theorem 1 enables us to derive a simple bounds for IB and PF.Specifically, the facts that PF(r) r is non-decreasing and IB(R) R is non-increasing immediately result in the the following linear bounds.
Theorem 3 (Linear lower bound).For r, R ∈ (0, H(X)), we have inf In light of this theorem, if PF(r) = r, then I(X; Y ) = H(X), implying X = g(Y ) for a deterministic function It can be verified (see [79,Problem 16.25]) that this double Markov condition is equivalent to the existence of a pair of functions f and g such that f (X) = g(Y ) and (X, Y ) − − f (X) − − T * .One special case of this setting, namely where g is an identity function, has been recently studied in details in [52] and will be reviewed in Section II-E.Theorem 3 also enables us to characterize the "worst" joint distribution P XY with respect to IB and PF.As demonstrated in the following lemma, if P Y |X is an erasure channel then PF(r) r = IB(R) R = I(X;Y ) H(X) .Lemma 1.
• Let P XY be such that Y = X ∪ {⊥}, P Y |X (x|x) = 1 − δ, and • Let P XY be such that X = Y ∪ {⊥}, P X|Y (y|y) = 1 − δ, and P X|Y (⊥ |y) = δ for some δ > 0. Then The bounds in Theorem 3 hold for all r and R in the interval [0, H(X)].We can, however, improve them when r and R are sufficiently small.Let PF (0) and IB (0) denote the slope of PF(•) and IB(•) at zero, i.e., PF (0) := lim r→0 + PF(r) r and IB (0 This theorem provides the exact values of PF (0) and IB (0) and also simple bounds for them.Although the exact expressions for PF (0) and IB (0) are usually difficult to compute, a simple plug-in estimator is proposed in [80] for IB (0).This estimator can be readily adapted to estimate PF (0).Theorem 4 reveals a profound connection between IB and the strong data processing inequality (SDPI) [81].More precisely, thanks to the pioneering work of Anantharam et al. [82], it is known that the supremum of DKL(QY PY ) DKL(QX PX ) over all Q X = P X is equal the supremum of I(Y ;T ) I(X;T ) over all P T |X satisfying Y − − X − − T and hence IB (0) specifies the strengthening of the data processing inequality of mutual information.This connection may open a new avenue for new theoretical results for IB, especially when X or Y are continuous random variables.In particular, the recent non-multiplicative SDPI results [25], [83] seem insightful for this purpose.
In many practical cases, we might have n i.i.d.samples (X 1 , Y 1 ), . . ., (X n , Y n ) of (X, Y ) ∼ P XY .We now study how IB behaves in n.Let X n := (X 1 , . . ., X n ) and Y n := (Y 1 , . . ., Y n ).Due to the i.i.d.assumption, we have P X n Y n (x n , y n ) = n i=1 P XY (x i , y i ).This can also be described by independently feeding X i , i ∈ [n], to channel P Y |X producing Y i .The following theorem, demonstrated first in [4,Theorem 2.4], gives a formula for IB in terms of n.
Theorem 5 (Additivity).We have This theorem demonstrates that an optimal channel P T n |X n for i.i.d.samples (X n , Y n ) ∼ P XY is obtained by the Kronecker product of an optimal channel P T |X for (X, Y ) ∼ P XY .This, however, may not hold in general for PF, that is, we might have PF(P X n Y n , nr) < nPF(P XY , r), see [14, Proposition 1] for an example.

A. Gaussian IB and PF
In this section, we turn our attention to a special, yet important, case where X = Y + σN G , where σ > 0 and N G ∼ N (0, 1) is independent of Y .This setting subsumes the popular case of jointly Gaussian (X, Y ) whose information bottleneck functional was computed in [84] for the vector case (i.e., (X, Y ) are jointly Gaussian random vectors).
It is worth noting that this result was concurrently proved in [85].The main technical tool in the proof of this lemma is a strong version of the entropy power inequality [23, Theorem 2] which holds even if X i , Y i , and N i are random vectors (as opposed to scalar).Thus, one can readily generalize Lemma 2 to the vector case.Note that the upper bound established in this lemma holds without any assumptions on P T |X .This upper bound provides a significantly simpler proof for the well-known fact that for the jointly Gaussian (X, Y ), the optimal channel P T |X is Gaussian.This result was first proved in [24] and used in [84] to compute an expression of IB for the Gaussian case.
Corollary 1.If (X, Y ) are jointly Gaussian with correlation coefficient ρ, then we have Moreover, the optimal channel P T |X is given by In Lemma 2, we assumed that X is a Gaussian perturbation of Y .However, in some practical scenarios, we might have Y as a Gaussian perturbation of X.For instance, let X represent an image and Y be a feature of the image that can be perfectly obtained from a noisy observation of X.Then, the goal is to compress the image with a given compression rate while retaining maximal information about the feature.The following lemma, which is an immediate consequence of [25,Theorem 1], gives an upper bound for IB in this case.Lemma 3. Let X n be n i.i.d.copies of a random variable X satisfying E[X 2 ] ≤ 1 and Y i be the result of passing where for a ∈ (0, 1) is the binary entropy function.Moreover, we have Note that that Lemma 3 holds for any arbitrary X (provided that E[X 2 ] ≤ 1) and hence (9) bounds information bottleneck functionals for a wide family of P XY .However, the bound is loose in general for large values of R. For instance, if (X, Y ) are jointly Gaussian (implying Y = X + σN G for some σ > 0), then the right-hand side of ( 9) does not reduce to (8).To show this, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Fig. 2.
The privacy funnel functional is much less studied even for the simple case of jointly Gaussian.Solving the optimization in PF over P T |X without any assumptions is a difficult challenge.A natural assumption to make is where and N G ∼ N (0, 1) is independent of X.This formulation is tractable and can be computed in closed form for jointly Gaussian (X, Y ) as described in the following example.
Example 1.Let X and Y be jointly Gaussian with correlation coefficient ρ.First note that since mutual information is invariant to scaling, we may assume without loss of generality that both X and Y are zero mean and unit variance and hence we can write X = ρY and In order to ensure I(X; T σ ) ≥ r, we must have σ ≤ e 2r − 1 . Plugging this choice of σ into (13), we obtain This example indicates that for jointly Gaussian (X, Y ), we have PF G (r) = 0 if and only if r = 0 (thus perfect privacy does not occur) and the constraint I(X; T σ ) = r is satisfied by a unique σ.These two properties in fact hold for all continuous variables X and Y with finite second moments as demonstrated in Lemma 10 in Appendix A. We use these properties to derive a second-order approximation of PF G (r) when r is sufficiently small.For the following theorem, we use var(U ) to denote the variance of the random variable U and var(U We use σ 2 X = var(X) for short.Theorem 6.For any pair of continuous random variables (X, Y ) with finite second moments, we have as r → 0 Fig. 3. Second-order approximation of PF G according to Theorem 6 for jointly Gaussian X and Y with correlation coefficient ρ = 0.8.For this particular case, the exact expression of PF G is computed in (14).
It is worth mentioning that the quantity η(X, Y ) was first defined by Rényi [26] as an asymmetric measure of correlation between X and Y .In fact, it can be shown that η(X, Y ) = sup f ρ 2 (X, f (Y )), where supremum is taken over all measurable functions f and ρ(•, •) denotes the correlation coefficient.As a simple illustration of Theorem 6, consider jointly Gaussian X and Y with correlation coefficient ρ for which PF G was computed in Example 1.In this case, it can be easily verified that η(X, Y Hence, for jointly Gaussian (X, Y ) with correlation coefficient ρ and unit variance, we have ).In Fig. 3, we compare the approximation given in Theorem 6 for this particular case.

B. Evaluation of IB and PF
The constrained optimization problems in the definitions of IB and PF are usually challenging to solve numerically due to the non-linearity in the constraints.In practice, however, both IB and PF are often approximated by their corresponding Lagrangian optimizations and where β ∈ R + is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for IB and the privacy and informativeness in PF.Notice that for the computation of L IB , we can assume, without loss of generality, that β ∈ [0, 1] since otherwise the maximizer of ( 15) is trivial.It is worth noting that L IB (β) and L PF (β) in fact correspond to lines of slope β supporting M from above and below, thereby providing a new representation of M.
Let (X , Y ) be a pair of random variables with This function, in general, is neither convex nor concave in Q X .For instance, F (0) is concave and F (1) is convex in P X .The lower convex envelope (resp.upper concave envelope) of F β (Q X ) is defined as the largest (resp.smallest) convex (resp.concave) smaller (larger) than where the last equality follows from the fact that −(β − β)H(X) is convex.Hence, at P X we have Notice that, according to ( 15) and ( 16), we can write and In light of the above arguments, we can write In the following theorem, we show that β IB and β PF are given by the values of IB (0) and PF (0), respectively, given in Theorem 4. A similar formulae β IB and β PF were given in [86].
Proposition 1.We have, and Kim et al. [80] have recently proposed an efficient algorithm to estimate β IB from samples of P XY involving a simple optimization problem.This algorithm can be readily adapted for estimating β PF .Proposition 1 implies that in optimizing the Lagrangians (17) and (18), we can restrict the Lagrange multiplier β, that is and Remark 1.As demonstrated by Kolchinsky et al. [52], the boundary points 0 and β IB are required for the computation of L IB (β).In fact, when Y is a deterministic function of X, then only β = 0 and β = β IB are required to compute the IB and other values of β are vacuous.The same argument can also be used to justify the inclusion of β PF in computing L PF (β).Note also that since becomes trivial for such values of β.
Remark 2. Observe that the lower convex envelope of any function f can be obtained by taking Legendre-Fenchel transformation (aka.convex conjugate) twice.Hence, one can use the existing linear-time algorithms for approximating Legendre-Fenchel transformation (e.g., [87], [88]) for approximating Once L IB (β) and L PF (β) are computed, we can derive IB and PF via standard results in optimization (see [4, Section IV] for more details): and Following the convex analysis approach outlined by Witsenhausen and Wyner [4], IB and PF can be directly computed from L IB (β) and L PF (β) by observing the following.Suppose for some β, Fig. 4. The mapping q → F β (q) = H(Y ) − βH(X ) where X ∼ Bernoulli(q) and Y is the result of passing X through BSC(0.1),see (26).
at P X is obtained by a convex combination of points and weights λ i ≥ 0 (with i λ i = 1).
. Thus, all the information about the functional IB (resp.PF) is contained in the subset of the domain of ) over which it differs from F β (Q X ).We will revisit and generalize this approach later in Section III.
We can now instantiate this for the binary symmetric case.Suppose X and Y are binary variables and P Y |X is binary symmetric channel with crossover probability δ, denoted by BSC(δ) and defined as for some δ ≥ 0. To describe the result in a compact fashion, we introduce the following notation: we let h b : [0, 1] → [0, 1] denote the binary entropy function, i.e., h b (p) = −p log p − (1 − p) log(1 − p).Since this function is strictly increasing [0, 1  2 ], its inverse exists and is denoted by Lemma 4 (Mr. and Mrs. Gerber's Lemma).For X ∼ Bernoulli(p) for p ≤ 1  2 and P Y |X = BSC(δ) for δ ≥ 0, we have and where ), and α ∈ [0, 1].The result in (24) was proved by Wyner and Ziv [3] and is widely known as Mrs. Gerber's Lemma in information theory.Due to the similarity, we refer to (25) as Mr. Gerber's Lemma.As described above, to prove (24) and ( 25) it suffices to derive the convex and concave envelopes of the mapping F β : [0, 1] → R given by where q * δ := q δ + δ q is the output distribution of BSC(δ) when the input distribution is Bernoulli(q) for some q ∈ (0, 1).It can be verified that β IB ≤ (1 − 2δ) 2 .This function is depicted in Fig. 4 depending of the values of

C. Operational Meaning of IB and PF
In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of both IB and PF.The operational interpretation of IB has recently been extensively studied in information-theoretic settings in [27], [28].In particular, it was shown that IB specifies the rate-distortion region of noisy source coding problem [18], [89] under the logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side information at the decoder [90].Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound) and also provide a new information-theoretic setting in which IB appears as the solution.Then, we describe another setting, the so-called dependence dilution, whose achievable rate region has an extreme point specified by PF.This in fact delineate an important difference between IB and PF: while IB describes the entire rate-region of an information-theoretic setup, PF specifies only a corner point of a rate region.Other information-theoretic settings related to IB and PF include CEO problem [91] and source coding for the Gray-Wyner network [92].
1) Noisy Source Coding: Suppose Alice has access only to a noisy version X of a source of interest Y .She wishes to transmit a rate-constrained description from her observation (i.e., X) to Bob such that he can recover Y with small average distortion.More precisely, let (X n , Y n ) be n i.i.d.samples of (X, Y ) ∼ P XY .Alice encodes her observation X n through an encoder φ : X n → {1, . . ., K n } and sends φ(X n ) to Bob.Upon receiving φ(X n ), Bob reconstructs a "soft" estimate of Y n via a decoder ψ : {1, . . ., K n } → Y n where Y = P(Y).That is, the reproduction sequence ŷn consists of n probability measures on Y.For any source and reproduction sequences y n and ŷn , respectively, the distortion is defined as We say that a pair of rate-distortion (R, D) is achievable if there exists a pair (φ, ψ) of encoder and decoder such that The noisy rate-distortion function R noisy (D) for a given D ≥ 0, is defined as the minimum rate R such that (R, D) is an achievable rate-distortion pair.This problem arises naturally in many data analytic problems.Some examples include feature selection of a high-dimensional dataset, clustering, and matrix completion.This problem was first studied by Dobrushin and Tsybakov [18], who showed that R noisy (D) is analogous to the classical rate-distortion function It can be easily verified that E[d(Y, Ŷ )] = H(Y | Ŷ ) and hence (after relabeling Ŷ as T ) where R = H(Y ) − D, which is equal to IB defined in (4).For more details in connection between noisy source coding and IB, the reader is referred to [27], [28], [91], [93].Notice that one can study an essentially identical problem where the distortion constraint ( 28) is replaced by This problem is addressed in [94] for discrete alphabets X and Y and extended recently in [95] for any general alphabets.
2) Test Against Independence with Communication Constraint: As mentioned earlier, the connection between IB and noisy source coding, described above, was known and studied in [27], [28].Here, we provide a new informationtheoretic setting which provides yet another operational meaning for IB.Given n i.i.d.samples (X 1 , Y 1 ), . . ., (X n , Y n ) from joint distribution Q, we wish to test whether This task is formulated by the following hypothesis test: for a given joint distribution P XY with marginals P X and P Y .Ahlswede and Csiszár [19] investigated this problem under a communication constraint: While Y observations (i.e., Y 1 , . . ., Y n ) are available, the X observations need to be compressed at rate R, that is, instead of X n , only φ(X n ) is present where φ : X n → {1, . . ., K n } satisfies For the type I error probability not exceeding a fixed ε ∈ (0, 1), Ahlswede and Csiszár [19] derived the smallest possible type 2 error probability, defined as The following gives the asymptotic expression of β R (n, ε) for every ε ∈ (0, 1).For the proof, refer to [19,Theorem 3].

Theorem 7 ( [19]
).For every R ≥ 0 and ε ∈ (0, 1), we have In light of this theorem, IB(R) specifies the exponential rate at which the type II error probability of the hypothesis test (31) decays as the number of samples increases.
3) Dependence Dilution: Inspired by the problems of information amplification [96] and state masking [97], Asoodeh et al. [20] proposed the dependence dilution setup as follows.Consider a source sequences X n of n i.i.d.copies of X ∼ P X .Alice observes the source X n and wishes to encode it via the encoder for some R > 0. The goal is to ensure that any user observing f n (X n ) can construct a list, of fixed size, of sequences in X n that contains likely candidates of the actual sequence X n while revealing negligible information about a correlated source Y n .To formulate this goal, consider the decoder where 2 X n denotes the power set of X n .A dependence dilution triple (R, Γ, ∆) ∈ R 3  + is said to be achievable if, for any δ > 0, there exists a pair of encoder and decoder (f n , g n ) such that for sufficiently large n having fixed size |g n (J)| = 2 n(H(X)−Γ) , where J = f n (X n ) and simultaneously Notice that without side information J, the decoder can only construct a list of size 2 nH(X) which contains X n with probability close to one.However, after J is observed and the list g n (J) is formed, the decoder's list size can be reduced to 2 n(H(X)−Γ) and thus reducing the uncertainty about X n by nΓ ∈ [0, nH(X)].This observation can be formalized to show (see [96] for details) that the constraint (32) is equivalent to According to this theorem, PF(Γ) specifies the best privacy performance of the dependence dilution setup for the maximum amplification rate Γ.While this informs the operational interpretation of PF, Theorem 8 only provides an outer bound for the set of achievable dependence dilution triple (R, Γ, ∆).It is, however, not clear that PF characterizes the rate region of an information-theoretic setup.
The fact that IB fully characterizes the rate-region of an source coding setup has an important consequence: the cardinality of the auxiliary random variable T in IB can be improved to |X | instead of |X | + 1.

D. Cardinality Bound
Recall that in the definition of IB in (4), no assumption was imposed on the auxiliary random variable T .A straightforward application of Carathéodory-Fenchel-Eggleston theorem 2  The proof of this theorem hinges on the operational characterization of IB as the lower boundary of the ratedistortion region of noisy source coding problem discussed in Section II-C.Specifically, we first show that the extreme points of this region is achieved by T taking |X | values.We then make use of a property of the noisy source coding problem (namely, time-sharing) to argue that all points of this region (including the boundary points) can be attained by such T .It must be mentioned that this result was already claimed by Harremoës and Tishby in [99] without proof.
In many practical scenarios, feature X has a large alphabet.Hence, the bound |T | ≤ |X |, albeit optimal, still can make the information bottleneck function computationally intractable over large alphabets.However, label Y usually has a significantly smaller alphabet.While it is in general impossible to have a cardinality bound for T in terms of |Y|, one can consider approximating IB assuming T takes N values.The following result, recently proved by Hirche and Winter [100], is in this spirit.
Recall that, unlike PF, the graph of IB characterizes the rate region of a Shannon-theoretic coding problem (as illustrated in Section II-C), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region.This lack of operational characterization of PF translates into a worse cardinality bound than that of IB.In fact, for PF the cardinality bound |T | ≤ |X | + 1 cannot be improved in general.To demonstrate this, we numerically solve the optimization in PF assuming that |T | = |X | when both X and Y are binary.As illustrated in Fig. 5, this optimization does not lead to a convex function, and hence, cannot be equal to PF.

E. Deterministic Information Bottleneck
As mentioned earlier, IB formalizes an information-theoretic approach to clustering high-dimensional feature X into cluster labels T that preserve as much information about the label Y as possible.The clustering label where the maximization is taken over all deterministic functions f whose range is a finite set T .Similarly, one can define dPF(P XY , r) := inf One way to ensure that H(f (X)) ≤ R for a deterministic function f is to restrict the cardinality of the range of f : if f : X → [e R ] then H(f (X)) is necessarily smaller than R. Using this insight, we derive a lower for dIB(P XY , R) in the following lemma.
Lemma 5.For any given P XY , we have .
Note that both R and r are smaller than H(X) and thus the multiplicative factors of I(X; Y ) in the lemma are smaller than one.In light of this lemma, we can obtain .
In most of practical setups, |X | might be very large, making the above lower bound for IB vacuous.In the following lemma, we partially address this issue by deriving a bound independent of X when Y is binary.
Lemma 6.Let P XY be a joint distribution of arbitrary X and binary Y ∼ Bernoulli(q) for some q ∈ (0, 1).Then, for any R ≥ log 5 we have , where α = max{log 1 q , log 1 1−q }.

III. FAMILY OF BOTTLENECK PROBLEMS
In this section, we introduce a family of bottleneck problems by extending IB and PF to a large family of statistical measures.Similar to IB and PF, these bottleneck problems are defined in terms of boundaries of a twodimensional convex set induced by a joint distribution P XY .Recall that R → IB(P XY , R) and r → PF(P XY , r) are the upper and lower boundary of the set M defined in (6) and expressed here again for convenience Since P XY is given, H(X) and H(Y ) are fixed.Thus, in characterizing M it is sufficient to consider only H(X|T ) and H(Y |T ).To generalize IB and PF, we must therefore generalize H(X|T ) and H(Y |T ).
Given a joint distribution P XY and two non-negative real-valued functions Φ : P(X ) → R + and Ψ : and When X ∼ P X and Y ∼ P Y , we interchangeably write Φ(X) for Φ(P X ) and Φ(Y ) for Ψ(P Y ).
These definitions provide natural generalizations for Shannon's entropy and mutual information.Moreover, as we discuss later in Sections III-B and III-C, it also can be specialized to represent a large family of popular information-theoretic and statistical measures.Examples include information and estimation theoretic quantities such as Arimoto's conditional entropy of order α for Φ(Q X ) = ||Q X || α , probability of correctly guessing for Φ(Q X ) = ||Q X || ∞ , maximal correlation for binary case, and f -information for Φ(Q X ) given by f -divergence.We are able to generate a family of bottleneck problems using different instantiations of Φ(X|T ) and Ψ(Y |T ) in place of mutual information in IB and PF.As we argue later, these problems better capture the essence of "informativeness" and "privacy"; thus providing analytical and interpretable guarantees similar in spirit to IB and PF.
Computing these bottleneck problems in general boils down to the following optimization problems and Consider the set Note that if both Φ and Ψ are continuous (with respect to the total variation distance), then M Φ,Ψ is compact.Moreover, it can be easily verified that M Φ,Ψ is convex.Hence, its upper and lower boundaries are well-defined and are characterized by the graphs of U Φ,Ψ and L Φ,Ψ , respectively.As mentioned earlier, these functional are instrumental for computing the general bottleneck problem later.Hence, before we delve into the examples of bottleneck problems, we extend the approach given in Section II-B to compute U Φ,Ψ and L Φ,Ψ .
A. Evaluation of U Φ,Ψ and L Φ,Ψ Analogous to Section II-B, we first introduce the Lagrangians of U Φ,Ψ and L Φ,Ψ as and where β ≥ 0 is the Lagrange multiplier, respectively.Let (X , Y ) be a pair of random variable with X ∼ Q X and Y is the result of passing X through the channel P Y |X .Letting we obtain that recalling that K ∩ and K ∪ are the upper concave and lower convex envelop operators.Once we compute L U Φ,Ψ and L L Φ,Ψ for all β ≥ 0, we can use the standard results in optimizations theory (similar to ( 21) and ( 22)) to recover U Φ,Ψ and L Φ,Ψ .However, we can instead extend the approach Witsenhausen and Wyner [4] described in Section II-B.Suppose for some β, and weights λ i ≥ 0 (with i λ i = 1).Then i λ i Q i = P X , and T * with properties P T * (i) = λ i and The algorithm to compute U Φ,Ψ and L Φ,Ψ is then summarized in the following three steps: ] and P T (i) = λ i give the optimal T * in U Φ,Ψ and L Φ,Ψ , respectively.We will apply this approach to analytically compute U Φ,Ψ and L Φ,Ψ (and the corresponding bottleneck problems) for binary cases in the following sections.

B. Guessing Bottleneck Problems
Let P XY be given with marginals P X and P Y and the corresponding channel P Y |X .Let also Q X ∈ P(X ) be an arbitrary distribution on X and Q Y = Q X P Y |X be the output distribution of P Y |X when fed with Q X .Any channel P T |X , together with the Markov structure Y − − X − − T , generates unique P X|T and P Y |T .We need the following basic definition from statistics.Definition 1.Let U be a discrete and V be an arbitrary random variables supported on U and V with |U| < ∞, respectively.Then P c (U ) the probability of correctly guessing U and P c (U |V ) the probability of correctly guessing U given V are given by P c (U ) := max u∈U P U (u), Moreover, the multiplicative gain of the observation V in guessing U is defined3 As the names suggest, P c (U |V ) and P c (U ) characterize the optimal efficiency of guessing U with or without the observation V , respectively.Intuitively, I ∞ (U ; V ) quantifies how useful the observation V is in estimating U : If it is small, then it means it is nearly as hard for an adversary observing V to guess U as it is without V .This observation motivates the use of I ∞ (Y ; T ) as a measure of privacy in lieu of I(Y ; T ) in PF.
It is worth noting that I ∞ (U ; V ) is not symmetric in general, i.Consider thus both measures H(X|T ) and P c (Y |T ) are special cases of the models described in the previous section.In particular, we can define the corresponding U Φ,Ψ and L Φ,Ψ .We will see later that I(X; T ) and P c (Y |T ) correspond to Arimoto's mutual information of orders 1 and ∞, respectively.Define This bottleneck functional formulated an interpretable guarantee: Recall that the functional PF(r) aims at extracting maximum information of X while protecting privacy with respect to Y .Measuring the privacy in terms of P c (Y |T ), this objective can be better formulated by with the interpretable privacy guarantee: PF (∞,1) (r) characterizes the smallest probability of revealing private feature Y among all representations of X preserving at least r bits information of X Notice that the variable T in the formulations of IB (∞,1) and PF (∞,1) takes values in a set T of arbitrary cardinality.However, a straightforward application of the Carathéodory-Fenchel-Eggleston theorem (see e.g., [79,Lemma 15.4]) reveals that the cardinality of T can be restricted to |X | + 1 without loss of generality.In the following lemma, we prove more basic properties of IB (∞,1) and PF (∞,1) .Lemma 7.For any P XY with Y supported on a finite set Y, we have is strictly increasing and concave on the range (0, I ∞ (X; Y )).
The proof follows the same lines as Theorem 1 and hence omitted.Lemma 7 in particular implies that inequalities I(X; T ) ≤ R and I(X; T ) ≥ r in the definition of IB (∞,1) and PF (∞,1) can be replaced by I(X; T ) = R and I(X; T ) = r, respectively.It can be verified that I ∞ satisfies the data-processing inequality, i.e., 1) and PF (∞,1) must be smaller than I ∞ (Y ; X).The properties listed in Lemma 7 enable us to derive a slightly tighter upper bound for PF (∞,1) as demonstrated in the following.Lemma 8.For any P XY with Y supported on a finite set Y, we have (q) = P c (Y ) + βH(X ) where X ∼ Bernoulli(q) and Y ∼ Bernoulli(q)BSC(0.1). and The proof of this lemma (and any other results in this section) is given in Appendix B. This lemma shows that the gap between I ∞ (Y ; X) and IB (∞,1) (R) when R is sufficiently close to H(X) behaves like .
In the following theorem, we apply the technique delineated in Section III-A to derive closed form expressions for IB (∞,1) and PF (∞,1) for the binary symmetric case, thereby establishing similar results as Mr and Mrs. Gerber's Lemma.
As described in Section III-A, to compute IB (∞,1) and PF (∞,1) it suffices to derive the convex and concave envelopes of the mapping F (∞,1) β (q) := P c (Y ) + βH(X ) where X ∼ Bernoulli(q) and Y is the result of passing This function is depicted in Fig. 6.The detailed derivation of convex and concave envelope of F

C. Arimoto Bottleneck Problems
The bottleneck framework proposed in the last section benefited from interpretable guarantees brought forth by the quantity I ∞ .In this section, we define a parametric family of statistical quantities, the so-called Arimoto's mutual information, which includes both Shannon's mutual information and I ∞ as extreme cases.

Definition 2 ( [21]
).Let U ∼ P U and V ∼ P V be two random variables supported over finite sets U and V, respectively.Their Arimoto's mutual information of order α > 1 is defined as where is the Rényi entropy of order α and is the Arimoto's conditional entropy of order α.
By continuous extension, one can define I − α(U ; V ) for α = 1 and α = ∞ as I(U ; V ) and I ∞ (U ; V ), respectively.That is, and Arimoto's mutual information was first introduced by Arimoto [21] and then later revisited by Liese and Vajda in [101] and more recently by Verdú in [102].More in-depth analysis and properties of I α can be found in [103].It is shown in [70, Lemma 1] that I α (U ; V ) for α ∈ [1, ∞] quantifies the minimum loss in recovering U given V where the loss is measured in terms of the so-called α-loss.This loss function reduces to logarithmic loss ( 27) and P c (U |V ) for α = 1 and α = ∞, respectively.This sheds light on the utility and/or privacy guarantee promised by a constraint on Arimoto's mutual information.It is now natural to use I α for defining a family of bottleneck problems.
Theorem 12. Let X ∼ Bernoulli(p) and P Y |X = BSC(δ) with p, δ ≤ 1 2 .We have for α ≥ 2 where a α := [a, ā] α for a ∈ [0, 1] and q ≤ p solves Moreover, where z = max{2p, λ} and λ ∈ [0, 1] solves By letting α → ∞, this theorem indicates that for X and Y connected through BSC(δ) and all variables T forming Y − − X − − T , we have which can be shown to be achieved T * generated by the following channel (see Fig. 7) Note that, by assumption, p ≤ 1 2 , and hence the event {X = 1} is less likely than {X = 0}.Therefore, (61) demonstrates that to ensure correct recoverability of X with probability at lest λ, the most private approach (with respect to Y ) is to obfuscate the higher-likely event {X = 0} with probability λ p .As demonstrated in (61) the optimal privacy guarantee is linear in the utility parameter in the binary symmetric case.This is in fact a special case of the larger result recently proved in [64, Theorem 1]: the infimum of P c (Y |T ) over all variables T such that P c (X|T ) ≥ λ is piece-wise linear in λ, on equivalently, the mapping e r → exp(PF (∞,∞) (r)) is piece-wise linear.

D. f -Bottleneck Problems
In this section, we describe another instantiation of the general framework introduced in terms of functions Φ and Ψ that enjoys interpretable estimation-theoretic guarantee.Definition 4. Let f : (0, ∞) → R be a convex function with f (1) = 0. Furthermore, let U and V be two real-valued random variables supported over U and V, respectively.Their f -information is defined by where [104] between distributions and defined as Due to convexity of f , we have D f (P Q) ≥ f (1) = 0 and hence f -information is always non-negative.If, furthermore, f is strictly convex at 1, then equality holds if and only P = Q.Csiszár introduced f -divergence in [104] and applied it to several problems in statistics and information theory.More recent developments about the properties of f -divergence and f -information can be found in [22] and the references therein.Any convex function f with the property f (1) = 0 results in an f -information.Popular examples include f (t) = t log t corresponding to Shannon's mutual information, f (t) = |t − 1| corresponding to T -information [83], and also f (t) = t 2 − 1 corresponding to χ 2 -information [68] for .It is worth mentioning that Arimoto's mutual information of order4 α ∈ (0, 1) was also shown to be an f -information in the binary case for a certain function f , see [101,Theorem 8].
Let (X, Y ) ∼ P XY be given with marginals P X and P Y .Consider functions Φ and Ψ on P(X ) and P(Y) defined as Given a conditional distribution P T |X , it is easy to verify that Φ(X|T ) = I f (X; T ) and Ψ(Y |T ) = I f (Y ; T ).This in turn implies that f -information can be utilized in ( 40) and ( 41) to define general bottleneck: Let f : (0, ∞) → R and g : (0, ∞) → R be two convex functions satisfying f (1) = g(1) = 0. Then we define and In light of the discussion in Section III-A, the optimization problems in IB (f,g) and IB (f,g) can be analytically solved by determining the upper concave and lower convex envelope of the mapping where β ≥ 0 is the Lagrange multiplier and The corresponding f -divergence is sometimes called Hellinger divergence of order α, see e.g., [105].Note that Hellinger divergence of order 2 reduces to χ 2divergence.Calmon et al. [67] and Asoodeh et al. [66] showed that if I f2 (Y ; T ) ≤ ε for some ε ∈ (0, 1), then the minimum mean-squared error (MMSE) of reconstructing any zero-mean unit-variance function of Y given T is lower bounded by 1 − ε, i.e., no function of Y can be reconstructed with small MMSE given an observation of T .This result serves a natural justification for I f2 as an operational measure of both privacy and utility in a bottleneck problem.
Unfortunately, our approach described in Section III-A cannot be used to compute IB (f2,f2) or PF (f2,f2) in the binary symmetric case.The difficulty lies in the fact that the function F f2,f2 β , defined in (66), for the binary symmetric case is either convex or concave on its entire domain depending on the value of β.Nevertheless, one can consider Hellinger divergence of order α with α = 2 and then apply our approach to compute IB (fα,fα) or [106,Corollary 5.6], one can justify I fα as a measure of privacy and utility in a similar way as I f2 .
We end this section by a remark about estimating the measures studied in this section.While we consider information-theoretic regime where the underlying distribution P XY is known, in practice only samples (x i , y i ) are given.Consequently, the de facto guarantees of bottleneck problems might be considerably different from those shown in this work.It is therefore essential to asses the guarantees of bottleneck problems when accessing only samples.To do so, one must derive bounds on the discrepancy between P c , I α , and I f computed on the empirical distribution and the true (unknown) distribution.These bounds can then be used to shed light on the de facto guarantee of the bottleneck problems.Relying on [32, Theorem 1], one can obtain that the gaps between the measures P c , I α , and I f computed on empirical distributions and the true one scale as O(1/ √ n) where n is the number of samples.This is in contrast with mutual information for which the similar upper bound scales as O(log n/ √ n) as shown in [31].Therefore, the above measures appear to be easier to estimate than mutual information.

IV. SUMMARY AND CONCLUDING REMARKS
Following the recent surge in the use of information bottleneck (IB) and privacy funnel (PF) in developing and analyzing machine learning models, we investigated the functional properties of these two optimization problems.Specifically, we showed that IB and PF correspond to the upper and lower boundary of a two-dimensional convex set M = {(I(X; T ), I(Y ; T )) : Y − − X − − T } where (X, Y ) ∼ P XY represents the observable data X and target feature Y and the auxiliary random variable T varies over all possible choices satisfying the Markov relation Y − − X − − T .This unifying perspective on IB and PF allowed us to adapt the classical technique of Witsenhausen and Wyner [4] devised for computing IB to be applicable for PF as well.We illustrated this by deriving a closed form expression for PF in the binary case -a result reminiscent of the Mrs. Gerber's Lemma [3] in information theory literature.We then showed that both IB and PF are closely related to several informationtheoretic coding problems such as noisy random coding, hypothesis testing against independence, and dependence dilution.While these connections were partially known in previous work (see e.g., [27], [28]), we show that they lead to an improvement on the cardinality of T for computing IB.We then turned our attention to the continuous setting where X and Y are continuous random variables.Solving the optimization problems in IB and PF in this case without any further assumptions seems a difficult challenge in general and leads to theoretical results only when (X, Y ) is jointly Gaussian.Invoking recent results on the entropy power inequality [23] and strong data processing inequality [25], we obtained tight bounds on IB in two different cases: (1) when Y is a Gaussian perturbation of X and (2) when X is a Gaussian perturbation of Y .We also utilized the celebrated I-MMSE relationship [107] to derive a second-order approximation of PF when T is considered to be a Gaussian perturbation of X.
In the second part of the paper, we argue that the choice of (Shannon's) mutual information in both IB and PF does not seem to carry specific operational significance.It does, however, have a desirable practical consequence: it leads to self-consistent equations [2] that can be solved iteratively (without any guarantee to convergence though).In fact, this property is unique to mutual information among other existing information measures [99].Nevertheless, we argued that other information measures might lead to better interpretable guarantee for both IB and PF.For instance, statistical accuracy in IB and privacy leakage in PF can be shown to be precisely characterized by probability of correctly guessing (aka Bayes risk) or minimum mean-squared error (MMSE).Following this observation, we introduced a large family of optimization problems, which we call bottleneck problems, by replacing mutual information in IB and PF with Arimoto's mutual information [21] or f -information [22].Invoking results from [31], [32], we also demonstrated that these information measures are in general easier to estimate from data than mutual information.Similar to IB and PF, the bottleneck problems were shown to be fully characterized by boundaries of a two-dimensional convex set parameterized by two real-valued non-negative functions Φ and Ψ.This perspective enabled us to generalize the technique used to compute IB and PF for evaluating bottleneck problems.Applying this technique to the binary case, we derived closed form expressions for several bottleneck problems.

APPENDIX A PROOFS FROM SECTION II
Proof of Theorem 1.
• Note that R = 0 in optimization problem (4) implies that X and T are independent.Since Y, X and T form Markov chain Y − − Y − − T , independent of X and T implies independence of Y and T and thus I(Y ; T ) = 0. Similarly for PF(0).
• Since I(X; T ) ≤ H(X) for any random variable T , we have T = X satisfies the information constraint I(X; T ) ≤ R for R ≥ H(X).Since I(Y ; T ) ≤ I(Y ; X), this choice is optimal.Similarly for PF, the constraint I(X; T ) ≥ r for r ≥ H(X) implies T = X.Hence, PF(r) = I(Y ; X). • The concavity of R → IB(R) follows from the fact it is the upper boundary of the convex set M, defined in (6).This in turn implies the continuity of IB(•).Monotonicity of R → IB(R) follows from the definition.Strict monotonicity follows from the convexity and the fact that IB(H(X)) = I(X; Y ).• Similar as above.
• The differentiability of the map R → IB(R) follows from [94,Lemma 6].This result in fact implies the differentiability of the map r → PF(r) as well.Continuity of the derivative of IB and PF on (0, H(X)) is a straightforward application of [108,Theorem 25.5].• Monotonicity of mappings R → IB(R) R and r → PF(r) r follows from the concavity and convexity of IB(•) and PF(•), respectively.
• Strict monotonicity of IB(•) and PF(•) imply that the optimization problems in (4) and ( 5) occur when the inequality in the constraints becomes equality.
Proof of Theorem 3. Recall that, according to Theorem 1, the mappings R → IB(R) and r → PF(r) are concave and convex, respectively.This implies that IB(R) (resp.PF(r)) lies above (resp.below) the chord connecting (0, 0) and (H(X, I(X; Y )).This proves the lower bound (resp.upper bound) IB(R) ≥ R I(X;Y ) H(X) (resp.PF(r) ≤ r I(X;Y ) H(X) ).In light of the convexity of PF and monotonicity of r → PF(r) r , we can write PF(r) r ≥ lim where the last equality is due to [14, Lemma 4] and Q Y is the output distribution of the channel P Y |X when the input is distributed according to Q X .Similarly, we can write where the last equality is due to [82,Theorem 4].
Proof of Theorem 5. Let T n be an optimal summeries of X n , that is, it satisfies T n − − X n − − Y n and I(X n ; T n ) = nR.We can write and hence, if R k := I(X k ; X k−1 , T n ), then we have We can similarly write Since we have we conclude from the above inequality that where the last inequality follows from concavity of the map x → IB(P XY , x) and ( 67).Consequently, we obtain To prove the other direction, let P T |X be an optimal channel in the definition of IB, i.e., I(X; T ) = R and IB(P XY , R) = I(Y ; T ).Then using this channel n times for each pair (X i , Y i ), we obtain T n = (T 1 , . . ., T n ) satisfying T n − − X n − − Y n .Since I(X n ; T n ) = nI(X; T ) = nR and I(Y n ; T n ) = nI(Y ; T ), we have IB(P X n Y n , nR) ≥ nIB(P XY , R).This, together with (69), concludes the proof.
Proof of Theorem 4. First notice that where the last equality is due to [82,Theorem 4].Similarly, where the last equality is due to [14, Lemma 4].Fix x 0 ∈ X with P X (x 0 ) > 0 and let T be a Bernoulli random variable specified by the following channel for some δ > 0. This channel induces T ∼ Bernoulli(δP X (x 0 )), P Y |T (y|1) = P Y |X (y|x 0 ), and It can be verified that Setting δ = r −P X (x 0 ) log P X (x 0 ) , we obtain and hence Since x 0 is arbitrary, the result follows.The proof for IB follows similarly.
Proof of Lemma 1.When Y is an erasure of X, i.e., Y = X ∪{⊥} with for every P X and Q X in P(X).Consequently, we have inf Hence, Theorem 3 gives the desired result.
To prove the second part, i.e., when X is an erasure of Y , we need an improved upper bound of PF.Notice that if perfect privacy occurs for a given P XY , then the upper bound for PF(r) in Theorem 3 can be improved: where r 0 is the largest r ≥ 0 such that PF(r) = 0. Here, we show that r 0 = H(X|Y ).This suffices to prove the result as ( 70), together with Theorem 1, we have To show that PF(H(X|Y )) = 0, consider the channel P T |X (t|x) = 1 |Y| 1 {t =⊥,x =⊥} and P T |X (⊥ | ⊥) = 1.It can be verified that this channel induces T which is independent of Y and that where h b (δ) := −δ log δ − (1 − δ) log(1 − δ) is the binary entropy function.
Consider the problem of minimizing the Lagrangian L PF (β) (20) for β ≥ β PF .Let X ∼ Q X = Bernoulli(q) for some q ∈ (0, 1) and Y be the result of passing X through BSC(δ), i.e., Y ∼ Bernoulli(q * δ).Recall that 2 and is also concave in a region around q = 1 2 , where it reaches its local maximum.Hence, if β is such that ] is given by the convex combination of F β (0) and F β (1).
Proof of Corollary 1.Since (X, Y ) are jointly Gaussian, we can write Y is the variance of Y .Applying Lemma 2 and noticing that H(X) = 1 2 log(2πe(σ 2 Y + σ 2 )), we obtain Next, we wish to prove Theorem 6.However, we need the following preliminary lemma before we delve into its proof.Proof.The finiteness of E[X 2 ] and E[Y 2 ] imply that H(X) and H(Y ) are finite.A straightforward application of the entropy power inequality (cf.[109,Theorem 17.7.3])implies that H(T σ ) is also finite.Thus, I(X; T σ ) and I(Y ; T σ ) are well-defined.According to the data processing inequality, we have I(X; T σ+δ ) < I(X; T σ ) for all δ > 0 and also I(Y ; T σ+δ ) ≤ I(Y ; T σ ) where the equality occurs if and only if X and Y are independent.Since, bu assumption X and Y correlated, it follows I(Y ; T σ+δ ) < I(Y ; T σ ).Thus, both I(X; T σ ) and I(Y ; T σ ) are strictly decreasing.
Next, we prove the continuity of σ → I(Y ; T σ ).For the sequence of positive numbers σ n converging to σ > 0, we have I(Y ; T σn ) = H(T σn ) − H(T σn |Y ).We only need to show H(T σn |Y ) → H(T σ |Y ).Invoking again de Brujin's identity, we obtain H(T σn |Y = y) → H(T σ |Y = y) for each y ∈ Y.The desired result follows from dominated convergence theorem.Finally, the The continuity of σ → I(Y ; T σ ) when σ = 0 follows from [110, Page 2028] stating that H(T σn |Y = y) → H(X|Y = y) and then applying dominated convergence theorem.
Note that where σ 2 X is the variance of X and the last inequality follows from the fact that I(X; X + σN G ) is maximized when X is Gaussian.Since by assumption σ X < ∞, it follows that both I(X; T σ ) and I(Y ; T σ ) converge to zero as σ → ∞.
In light of this lemma, there exists a unique σ ≥ 0 such that I(X; T σ ) = r.Let σ r denote such σ.Therefore, we have PF(r) = I(Y ; T σr ).This enables us to prove Theorem 6.
Prof of Theorem 6.The proof relies on the I-MMSE relation in information theory literature.We briefly describe it here for convenience.Given any pair of random variables U and V , the minimum mean-squared error (MMSE) of estimating U given V is given by mmse where the infimum is taken over all measurable functions f and var( [107] proved the following identity, which is referred to as I-MMSE formula, relating the input-output mutual information of the additive Gaussian channel T σ = X + σN G , where N G ∼ N (0, 1) is independent of X, with the MMSE of the input given the output: Since Y , X, and Thus, two applications of (72) yields The second derivative of I(X; T σ ) and I(Y ; T σ ) are also known via the formula [111, Proposition 9] With these results in mind, we now begin the proof.Recall that σ r is the unique σ such that I(X; T σ ) = r, thus implying PF G (r) = I(Y ; T σr ).We have To compute the derivative of PF(r), we therefore need to compute the derivative of σ 2 r with respect to r.To do so, notice that from the identity I(X; T σr ) = r we can obtain Plugging this identity into (75) and invoking (73), we obtain The second derivative can be obtained via (74) where var(E[X|Y ]) is the variance of the conditional expectation X given Y and the last equality comes from the law of total variance.and Taylor expansion of PF(r) around r = 0 gives the result.
Proof of Theorem 9.The main ingredient of this proof is a result by Jana [112,Lemma 2.2] which provides a tight cardinality bound for the auxiliary random variables in the canonical problems in network information theory (including noisy source coding problem described in II-C).Consider a pair of random variables (X, Y ) ∼ P XY and let d : Y × Ŷ → R be an arbitrary distortion measure defined for arbitrary reconstruction alphabet Ŷ.For the privacy funnel, the proof proceeds as follows.We sort the elements in X such that
Next, we compute U α or equivalently K ∩ [F β (q † ) where q † is a point ∈ [0, 1  2 ].Without loss of generality, we can ignore the first case.The other two cases correspond to the following solutions • T * is a ternary variable given by P X|T * =0 = Bernoulli(0), P X|T * =1 = Bernoulli(1), and P X|T * =2 = Bernoulli( Since I ∞ (X; T ) = H ∞ (X) − H ∞ (X|T ), the above lower bound yields where the last inequality follows from the fact that γ → H γ (X) is non-increasing.The upper bound in (80) (after replacing X with Y and γ with α) implies Combining ( 81) and ( 82), we obtain the desired upper bound for PF (α,γ) .The other bounds can be proved similarly by interchanging X with Y and α with γ in ( 81) and ( 82).
takes values in a set T with cardinality |T | ≤ |X |.Compared to the best cardinality bound previously known (i.e., |T | ≤ |X | + 1), this result leads to a reduction in the search space's dimension of the optimization problem (2) from R |X | 2 to R |X |(|X |− PF(P XY , r) := inf PT |X :Y − −X − −T I(X;T )≥r I(Y ; T ), or PF(P XY , r) := sup PT |X :Y − −X − −T I(Y ;T )≤r I(X; T ),

2 Fig. 1 .
Fig. 1.Examples of the set M, defined in (6).The upper and lower boundaries of this set correspond to IB and PF, respectively.It is worth noting that, while IB(R) = 0 only at R = 0, PF(r) = 0 holds in general for r belonging to a non-trivial interval (only for |X | > 2).Also, note that in general neither upper nor lower boundaries are smooth.A sufficient condition for smoothness is P Y |X (y|x) > 0 (see Theorem 1), thus both IB and PF are smooth in the binary case.
g. Conversely, if X = g(Y ) then PF(r) = r because for all T forming the Markov relation Y − − g(Y ) − − T , we have I(Y ; T ) = I(g(Y ); T ).On the other hand, we have IB(R) = R if and only if there exists a variable T * satisfying I(X; T * ) = I(Y ; T * ) and thus the following double Markov relations Y − − X − − T * , and X − − Y − − T * .

Fig. 2 .
Fig.2.Comparison of(8), the exact value of IB for jointly Gaussian X and Y (i.e., Y = X + σN G with X and N G being both standard Gaussian N (0, 1)), with the general upper bound (9) for σ 2 = 0.5.It is worth noting that while the Gaussian IB converges to I(X; Y ) ≈ 0.8, the upper bound diverges.
reveals that IB is attained for T taking values in a set T with cardinality |T | ≤ |X | + 1.Here, we improve this bound and show that cardinality bound to |T | ≤ |X |.Theorem 9.For any joint distribution P XY and R ∈ (0, H(X)], information bottleneck IB(R) is achieved by T taking at most |X | values.

Fig. 5 .
Fig. 5.The set {(I(X; T ), I(Y ; T ))} with P X = Bernoulli(0.9),P Y |X=0 = [0.9,0.1], P Y |X=1 = [0.85,0.15], and T restricted to be binary.While the upper boundary of this set is concave, the lower boundary is not convex.This implies that, unlike IB, PF(r) cannot be attained by binary variables T .

5 Fig. 6 .
Fig. 6.The mapping q → F (∞,1) β . The proof of this theorem also reveals the following intuitive statements.If X ∼ Bernoulli(p) and P Y |X = BSC(δ), then among all random variables T satisfying Y − − X − − T and H(X|T ) ≤ λ, the minimum P c (Y |T ) is given by δ − λ(0.5 − δ).Notice that, without any information constraint (i.e., λ = 0), P c (Y |T ) = P c (Y |X) = δ.Perhaps surprisingly, this shows that the mutual information constraint has a linear effect on the privacy of Y .Similarly, to prove (51), we show that among all R-bit representations T of X, the best achievable accuracy P c (Y |T ) is given by 1 − δ * h −1 b (h b (p) − R).This can be proved by combining Mrs. Gerber's Lemma (cf.Lemma 4) and Fano's inequality as follows.For all T such that H(X|T ) ≥ λ, the minimum of H(Y |T ) is given by h b (δ * h −1 b (λ)).Since by Fano's inequality, H(Y |T ) ≤ h b (1 − P c (Y |T )), we obtain δ * h −1 b (λ) ≤ 1 − P c (Y |T ) which leads to the same result as above.Nevertheless, in Appendix B we give another proof based on the discussion of Section III-A.

•
The upper bound on IB follows from the data processing inequality: I(Y ; T ) ≤ min{I(X; T ), I(X; Y )} for all T satisfying the Markov condition Y − − X − − T .• To prove the lower bound on PF, note that I(Y ; T ) = I(X; T ) − I(X; T |Y ) ≥ I(X; T ) − H(X|Y ).

Theorem 13 ( 1 i=1P− 1 i=1P
[112]).Let A be the set of all pairs (R, D) satisfyingI(X; T ) ≤ R and E[d(Y, ψ(T ))] ≤ D,for some mapping ψ : T → Ŷ and some joint distributions P XY T = P XY P T |X .Then every extreme points of A corresponds to some choice of auxiliary variable T with alphabet size |T | ≤ |X |.Measuring the distortion in the above theorem in terms of the logarithmic loss as in(27), we obtain thatA = {(R, D) ∈ R 2 + : R ≥ R noisy (D)}, where R noisy (D) is given in(29).We observed in Section II-C that IB is fully characterized by the mapping D → R noisy (D) and thus by A. In light of Theorem 13, all extreme points of A are achieved by a choice of T with cardinality size |T | ≤ |X |.Let {(R i , D i )} be the set of extreme points of A each constructed by channel P Ti|X and mapping ψ i .Due to the convexity of A, each point (R, D) ∈ A is expressed as a convex combination of {(R i , D i )} with coefficient {λ i }; that is there exists a channel P T |X = i λ i P Ti|X and a mapping ψ(T) = i λ i ψ i (T i ) such that I(X; T ) = Rand E[d(Y, ψ(T ))] = D.This construction, often termed timesharing in information theory literature, implies that all points in A (including the boundary points) can be achieved with a variable T with |T | ≤ |X |.Since the boundary of A is specified by the mapping R → IB(R), we conclude that IB(R) is achieved by a variable T with cardinality |T | ≤ |X | for very R < H(X).Proof of Lemma 5.The following proof is inspired by[30, Proposition 1].Let X = {1, . . ., m}.We sort the elements in X such thatP X (1)D KL (P Y |X=1 P Y ) ≥ • • • ≥ P X (m)D KL (P Y |X=m P Y ).Now consider the function f : X → [M ] given by f (x) = x if x < M and f (x) = M if x ≥ M where M = e R .Let Z = f (X).We have P Z (i) = P X (i) if i < M and P Z (M ) = j≥M P X (j).We can now writeI(Y ; Z) = M −X (i)D(P Y |X=i P Y ) + P Z (M )D(P Y |Z=M P Y ) ≥ M X (i)D(P Y |X=i P Y ) ≥ M − 1 |X | i∈X P X (i)D(P Y |X=i P Y ) = M − 1 |X | I(X; Y ).Since f (X) takes values in [M ], it follows that H(f (X)) ≤ R. Consequently, we have dIB(P XY , R) ≥ sup f :X →[M ] I(Y ; f (X)) ≥ M − 1 |X | I(X; Y ).