Abstract
Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), and strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding, and dependence dilution. Leveraging these connections, we prove a new cardinality bound on the auxiliary variable in IB, making its computation more tractable for discrete random variables. In the second part, we introduce a general family of optimization problems, termed “bottleneck problems”, by replacing mutual information in IB and PF with other notions of mutual information, namely f-information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantees in a variety of inference tasks with statistical constraints on accuracy and privacy. While the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to a binary case, we derive closed form expressions for several bottleneck problems.
1. Introduction
Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental in a variety of learning problems found in machine learning. A notable example is the information bottleneck () method [1]. Suppose Y is a target variable and X is an observable correlated variable with joint distribution . The goal of is to learn a “compact” summary (aka bottleneck) T of X that is maximally “informative” for inferring Y. The bottleneck variable T is assumed to be generated from X by applying a random function F to X, i.e., , in such a way that it is conditionally independent of Y given X, that we denote by
The quantifies this goal by measuring the “compactness” of T using the mutual information and, similarly, “informativeness” by . For a given level of compactness , extracts the bottleneck variable T that solves the constrained optimization problem
where the supremum is taken over all randomized functions satisfying Y
X
T.
X
T.The optimization problem that underlies the information bottleneck has been studied in the information theory literature as early as the 1970’s—see [2,3,4,5]—as a technique to prove impossibility results in information theory and also to study the common information between X and Y. Wyner and Ziv [2] explicitly determined the value of for the special case of binary X and Y—a result widely known as Mrs. Gerber’s Lemma [2,6]. More than twenty years later, the information bottleneck function was studied by Tishby et al. [1] and re-formulated in a data analytic context. Here, the random variable X represents a high-dimensional observation with a corresponding low-dimensional feature Y. aims at specifying a compressed description of image which is maximally informative about feature Y. This framework led to several applications in clustering [7,8,9] and quantization [10,11].
A closely-related framework to is the privacy funnel () problem [12,13,14]. In the framework, a bottleneck variable T is sought to maximally preserve “information” contained in X while revealing as little about Y as possible. This framework aims to capture the inherent trade-off between revealing X perfectly and leaking a sensitive attribute Y. For instance, suppose a user wishes to share an image X for some classification tasks. The image might carry information about attributes, say Y, that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g., location, or emotion. The framework seeks to extract a representation of X from which the original image can be recovered with maximal accuracy while minimizing the privacy leakage with respect to Y. Using mutual information for both privacy leakage and informativeness, the privacy funnel can be formulated as
where the infumum is taken over all randomized function and r is the parameter specifying the level of informativeness. It is evident from the formulations (2) and (3) that and are closely related. In fact, we shall see later that they correspond to the upper and lower boundaries of a two-dimensional compact convex set. This duality has led to design of greedy algorithms [12,15] for estimating based on the agglomerative information bottleneck [9] algorithm. A similar formulation has recently been proposed in [16] as a tool to train a neural network for learning a private representation of data X; see [17,18] for other closely-related formulations. Solving and optimization problems analytically is challenging. However, recent machine learning applications, and deep learning algorithms in particular, have reignited the study of both and (see Related Work).
In this paper, we first give a cohesive overview of the existing results surrounding the and the formulations. We then provide a comprehensive analysis of and from an information-theoretic perspective, as well as a survey of several formulations connected to the and that have been introduced in the information theory and machine learning literature. Moreover, we overview connections with coding problems such as remote source-coding [19], testing against independence [20], and dependence dilution [21]. Leveraging these connections, we prove a new cardinality bound for the bottleneck variable in , leading to more tractable optimization problem for . We then consider a broad family of optimization problems by going beyond mutual information in formulations (2) and (3). We propose two candidates for this task: Arimoto’s mutual information [22] and f-information [23]. By replacing and/or with either of these measures, we generate a family of optimization problems that we referred to as the bottleneck problems. These problems are shown to better capture the underlying trade-offs intended by and (see also the short version [24]). More specifically, our main contributions are listed next.
- Computing and are notoriously challenging when X takes values in a set with infinite cardinality (e.g., X is drawn from a continuous probability distribution). We consider three different scenarios to circumvent this difficulty. First, we assume that X is a Gaussian perturbation of Y, i.e., where is a noise variable sampled from a Gaussian distribution independent of Y. Building upon the recent advances in entropy power inequality in [25], we derive a sharp upper bound for . As a special case, we consider jointly Gaussian for which the upper bound becomes tight. This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable T is also Gaussian than the original proof given in [26]. In the second scenario, we assume that Y is a Gaussian perturbation of X, i.e., . This corresponds to a practical setup where the feature Y might be perfectly obtained from a noisy observation of X. Relying on the recent results in strong data processing inequality [27], we obtain an upper bound on which is tight for small values of R. In the last scenario, we compute second-order approximation of under the assumption that T is obtained by Gaussian perturbation of X, i.e., . Interestingly, the rate of increase of for small values of r is shown to be dictated by an asymmetric measure of dependence introduced by Rényi [28].
- We extend the Witsenhausen and Wyner’s approach [3] for analytically computing and . This technique converts solving the optimization problems in and to determining the convex and concave envelopes of a certain function, respectively. We apply this technique to binary X and Y and derive a closed form expression for – we call this result Mr. Gerber’s Lemma.
- Relying on the connection between and noisy source coding [19] (see [29,30]), we show that the optimal bottleneck variable T in optimization problem (2) takes values in a set with cardinality . Compared to the best cardinality bound previously known (i.e., ), this result leads to a reduction in the search space’s dimension of the optimization problem (2) from to . Moreover, we show that this does not hold for , indicating a fundamental difference in optimizations problems (2) and (3).
- Following [14,31], we study the deterministic and (denoted by and ) in which T is assumed to be a deterministic function of X, i.e., for some function f. By connecting and with entropy-constrained scalar quantization problems in information theory [32], we obtain bounds on them explicitly in terms of . Applying these bounds to , we obtain that is bounded by one from above and by from below.
- By replacing and/or in (2) and (3) with Arimoto’s mutual information or f-information, we generate a family of bottleneck problems. We then argue that these new functionals better describe the trade-offs that were intended to be captured by and . The main reason is three-fold: First, as illustrated in Section 2.3, mutual information in and are mainly justified when independent samples of are considered. However, Arimoto’s mutual information allows for operational interpretation even in the single-shot regime (i.e., for ). Second, in and is meant to be a proxy for the efficiency of reconstructing Y given observation T. However, this can be accurately formalized by probability of correctly guessing Y given T (i.e., Bayes risk) or minimum mean-square error (MMSE) in estimating Y given T. While bounds these two measures, we show that they are precisely characterized by Arimoto’s mutual information and f-information, respectively. Finally, when is unknown, mutual information is known to be notoriously difficult to estimate. Nevertheless, Arimoto’s mutual information and f-information are easier to estimate: While mutual information can be estimated with estimation error that scales as [33], Diaz et al. [34] showed that this estimation error for Arimoto’s mutual information and f-information is .We also generalize our computation technique that enables us to analytically compute these bottleneck problems. Similar as before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certain functions. Focusing on binary X and Y, we derive closed form expressions for some of the bottleneck problems.
1.1. Related Work
The formulation has been extensively applied in representation learning and clustering [7,8,35,36,37,38]. Clustering based on results in algorithms that cluster data points in terms of the similarity of . When data points lie in a metric space, usually geometric clustering is preferred where clustering is based upon the geometric (e.g., Euclidean) distance. Strouse and Schwab [31,39] proposed the deterministic (denoted by ) by enforcing that is a deterministic mapping: denotes the supremum of over all functions satisfying . This optimization problem is closely related to the problem of scalar quantization in information theory: designing a function with a pre-determined output alphabet with f optimizing some objective functions. This objective might be maximizing or minimizing [40] or maximizing for a random variable Y correlated with X [32,41,42,43]. Since for , the latter problem provides lower bounds for (and thus for ). In particular, one can exploit [44] (Theorem 1) to obtain provided that . This result establishes a linear gap between and irrespective of .
The connection between quantization and further allows us to obtain multiplicative bounds. For instance, if and , where is independent of Y, then it is well-known in information theory literature that for all non-constant (see, e.g., [45] (Section 2.11)), thus for . We further explore this connection to provide multiplicative bounds on in Section 2.5.
The study of has recently gained increasing traction in the context of deep learning. By taking T to be the activity of the hidden layer(s), Tishby and Zaslavsky [46] (see also [47]) argued that neural network classifiers trained with cross-entropy loss and stochastic gradient descent (SGD) inherently aims at solving the optimization problems. In fact, it is claimed that the graph of the function (the so-called the information plane) characterizes the learning dynamic of different layers in the network: shallow layers correspond to maximizing while deep layers’ objective is minimizing . While the generality of this claim was refuted empirically in [48] and theoretically in [49,50], it inspired significant follow-up studies. These include (i) modifying neural network training in order to solve the optimization problem [51,52,53,54,55]; (ii) creating connections between and generalization error [56], robustness [51], and detection of out-of-distribution data [57]; and (iii) using to understand specific characteristic of neural networks [55,58,59,60].
In both and , mutual information poses some limitations. For instance, it may become infinity in deterministic neural networks [48,49,50] and also may not lead to proper privacy guarantee [61]. As suggested in [55,62], one way to address this issue is to replace mutual information with other statistical measures. In the privacy literature, several measures with strong privacy guarantee have been proposed including Rényi maximal correlation [21,63,64], probability of correctly recovering [65,66], minimum mean-squared estimation error (MMSE) [67,68], -information [69] (a special case of f-information to be described in Section 3), Arimoto’s and Sibson’s mutual information [61,70]—to be discussed in Section 3, maximal leakage [71], and local differential privacy [72]. All these measures ensure interpretable privacy guarantees. For instance, it is shown in [67,68] that if -information between Y and T is sufficiently small, then no functions of Y can be efficiently reconstructed given T; thus providing an interpretable privacy guarantee.
Another limitation of mutual information is related to its estimation difficulty. It is known that mutual information can be estimated from n samples with the estimation error that scales as [33]. However, as shown by Diaz et al. [34], the estimation error for most of the above measures scales as . Furthermore, the recently popular variational estimators for mutual information, typically implemented via deep learning methods [73,74,75], presents some fundamental limitations [76]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimator might not satisfy basic properties of mutual information such as data processing inequality or additivity. McAllester and Stratos [77] showed that some of these limitations are inherent to a large family of mutual information estimators.
1.2. Notation
We use capital letters, e.g., X, for random variables and calligraphic letters for their alphabets, e.g., . If X is distributed according to probability mass function (pmf) , we write . Given two random variables X and Y, we write and as the joint distribution and the conditional distribution of Y given X. We also interchangeably refer to as a channel from X to Y. We use to denote both entropy and differential entropy of X, i.e., we have
if X is a discrete random variable taking values in with probability mass function (pmf) and
where X is an absolutely continuous random variable with probability density function (pdf) . If X is a binary random variable with , we write . In this case, its entropy is called binary entropy function and denoted by . We use superscript to describe a standard Gaussian random variable, i.e., . Given two random variables X and Y, their (Shannon’s) mutual information is denoted by . We let denote the set of all probability distributions on the set . Given an arbitrary and a channel , we let denote the resulting output distribution on . For any , we use to denote and for any integer , .
Throughout the paper, we assume a pair of (discrete or continuous) random variables are given with a fixed joint distribution , marginals and , and conditional distribution . We then use to denote an arbitrary distribution with .
2. Information Bottleneck and Privacy Funnel: Definitions and Functional Properties
In this section, we review the information bottleneck and its closely related functional, the privacy funnel. We then prove some analytical properties of these two functionals and develop a convex analytic approach which enables us to compute closed-form expressions for both these two functionals in some simple cases.
To precisely quantify the trade-off between these two conflicting goals, the optimization problem (2) was proposed [1]. Since any randomized function can be equivalently characterized by a conditional distribution, the optimization problem (2) can be instead expressed as
where R and denote the level of desired compression and informativeness, respectively. We use and to denote and , respectively, when the joint distribution is clear from the context. Notice that if , then .
Now consider the setup where data X is required to be disclosed while maintaining the privacy of a sensitive attribute, represented by Y. This goal was formulated by in (3). As before, replacing randomized function with conditional distribution , we can equivalently express (3) as
where and r denote the level of desired privacy and informativeness, respectively. The case is particularly interesting in practice and specifies perfect privacy, see e.g., [13,78]. As before, we write and for and when is clear from the context.
The following properties of and follow directly from their definitions. The proof of this result (and any other results in this section) is given in Appendix A.
Theorem 1.
For a given , the mappings and have the following properties:
- .
- for any and for .
- for any and for any .
- is continuous, strictly increasing, and concave on the range .
- is continuous, strictly increasing, and convex on the range .
- If for all and , then both and are continuously differentiable over .
- is non-increasing and is non-decreasing.
- We have
According to this theorem, we can always restrict both R and r in (4) and (5), respectively, to as for all .
Define as
It can be directly verified that is convex. According to this theorem, and correspond to the upper and lower boundary of , respectively. The convexity of then implies the concavity and convexity of and . Figure 1 illustrates the set for the simple case of binary X and Y.
Figure 1.
Examples of the set , defined in (6). The upper and lower boundaries of this set correspond to information bottleneck () and privacy funnel (), respectively. It is worth noting that, while (R) = 0 only at R = 0, (r) = 0 holds in general for r belonging to a non-trivial interval (only for > 2). Moreover, note that in general neither upper nor lower boundaries are smooth. A sufficient condition for smoothness is > 0 (see Theorem 1), thus both and are smooth in the binary case.
While both and , their behavior in the neighborhood around zero might be completely different. As illustrated in Figure 1, for all , whereas for for some . When such exists, we say perfect privacy occurs: there exists a variable T satisfying Y
X
T such that while ; making T a representation of X having perfect privacy (i.e., no information leakage about Y). A necessary and sufficient condition for the existence of such T is given in [21] (Lemma 10) and [13] (Theorem 3), described next.
X
T such that while ; making T a representation of X having perfect privacy (i.e., no information leakage about Y). A necessary and sufficient condition for the existence of such T is given in [21] (Lemma 10) and [13] (Theorem 3), described next.Theorem 2
(Perfect privacy). Let be given and be the set of vectors . Then there exists such that for if and only if vectors in are linearly independent.
In light of this theorem, we obtain that perfect privacy occurs if . It also follows from the theorem that for binary X, perfect privacy cannot occur (see Figure 1a).
Theorem 1 enables us to derive a simple bounds for and . Specifically, the facts that is non-decreasing and is non-increasing immediately result in the the following linear bounds.
Theorem 3
(Linear lower bound). For , we have
In light of this theorem, if , then , implying for a deterministic function g. Conversely, if then because for all T forming the Markov relation Y
g(Y)
T, we have . On the other hand, we have if and only if there exists a variable satisfying and thus the following double Markov relations
It can be verified (see [79] (Problem 16.25)) that this double Markov condition is equivalent to the existence of a pair of functions f and g such that and (X,Y)
f(X)
. One special case of this setting, namely where g is an identity function, has been recently studied in details in [53] and will be reviewed in Section 2.5. Theorem 3 also enables us to characterize the “worst” joint distribution with respect to and . As demonstrated in the following lemma, if is an erasure channel then .
g(Y)
T, we have . On the other hand, we have if and only if there exists a variable satisfying and thus the following double Markov relations
f(X)
. One special case of this setting, namely where g is an identity function, has been recently studied in details in [53] and will be reviewed in Section 2.5. Theorem 3 also enables us to characterize the “worst” joint distribution with respect to and . As demonstrated in the following lemma, if is an erasure channel then .Lemma 1.
- Let be such that , , and for some . Then
- Let be such that , , and for some . Then
The bounds in Theorem 3 hold for all r and R in the interval . We can, however, improve them when r and R are sufficiently small. Let and denote the slope of and at zero, i.e., and .
Theorem 4.
Given , we have
This theorem provides the exact values of and and also simple bounds for them. While the exact expressions for and are usually difficult to compute, a simple plug-in estimator is proposed in [80] for . This estimator can be readily adapted to estimate . Theorem 4 reveals a profound connection between and the strong data processing inequality (SDPI) [81]. More precisely, thanks to the pioneering work of Anantharam et al. [82], it is known that the supremum of over all is equal the supremum of over all satisfying Y
X
T and hence specifies the strengthening of the data processing inequality of mutual information. This connection may open a new avenue for new theoretical results for , especially when X or Y are continuous random variables. In particular, the recent non-multiplicative SDPI results [27,83] seem insightful for this purpose.
X
T and hence specifies the strengthening of the data processing inequality of mutual information. This connection may open a new avenue for new theoretical results for , especially when X or Y are continuous random variables. In particular, the recent non-multiplicative SDPI results [27,83] seem insightful for this purpose.In many practical cases, we might have n i.i.d. samples of . We now study how behaves in n. Let and . Due to the i.i.d. assumption, we have . This can also be described by independently feeding , , to channel producing . The following theorem, demonstrated first in [3] (Theorem 2.4), gives a formula for in terms of n.
Theorem 5
(Additivity). We have
This theorem demonstrates that an optimal channel for i.i.d. samples is obtained by the Kronecker product of an optimal channel for . This, however, may not hold in general for , that is, we might have , see [13] (Proposition 1) for an example.
2.1. Gaussian and
In this section, we turn our attention to a special, yet important, case where , where and is independent of Y. This setting subsumes the popular case of jointly Gaussian whose information bottleneck functional was computed in [84] for the vector case (i.e., are jointly Gaussian random vectors).
Lemma 2.
Let be n i.i.d. copies of and where are i.i.d samples of independent of Y. Then, we have
It is worth noting that this result was concurrently proved in [85]. The main technical tool in the proof of this lemma is a strong version of the entropy power inequality [25] (Theorem 2) which holds even if , , and are random vectors (as opposed to scalar). Thus, one can readily generalize Lemma 2 to the vector case. Note that the upper bound established in this lemma holds without any assumptions on . This upper bound provides a significantly simpler proof for the well-known fact that for the jointly Gaussian , the optimal channel is Gaussian. This result was first proved in [26] and used in [84] to compute an expression of for the Gaussian case.
Corollary 1.
If are jointly Gaussian with correlation coefficient ρ, then we have
Moreover, the optimal channel is given by for where is the variance of Y.
In Lemma 2, we assumed that X is a Gaussian perturbation of Y. However, in some practical scenarios, we might have Y as a Gaussian perturbation of X. For instance, let X represent an image and Y be a feature of the image that can be perfectly obtained from a noisy observation of X. Then, the goal is to compress the image with a given compression rate while retaining maximal information about the feature. The following lemma, which is an immediate consequence of [27] (Theorem 1), gives an upper bound for in this case.
Lemma 3.
Let be n i.i.d. copies of a random variable X satisfying and be the result of passing , , through a Gaussian channel , where and is independent of X. Then, we have
where
is the Gaussian complimentary CDF and for is the binary entropy function. Moreover, we have
Note that that Lemma 3 holds for any arbitrary X (provided that ) and hence (9) bounds information bottleneck functionals for a wide family of . However, the bound is loose in general for large values of R. For instance, if are jointly Gaussian (implying for some ), then the right-hand side of (9) does not reduce to (8). To show this, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Figure 2.
The privacy funnel functional is much less studied even for the simple case of jointly Gaussian. Solving the optimization in over without any assumptions is a difficult challenge. A natural assumption to make is that is Gaussian for each . This leads to the following variant of
where
and is independent of X. This formulation is tractable and can be computed in closed form for jointly Gaussian as described in the following example.
Example 1.
Let X and Y be jointly Gaussian with correlation coefficient ρ. First note that since mutual information is invariant to scaling, we may assume without loss of generality that both X and Y are zero mean and unit variance and hence we can write where is independent of Y. Consequently, we have
and
In order to ensure , we must have . Plugging this choice of σ into (13), we obtain
This example indicates that for jointly Gaussian , we have if and only if (thus perfect privacy does not occur) and the constraint is satisfied by a unique . These two properties in fact hold for all continuous variables X and Y with finite second moments as demonstrated in Lemma A1 in Appendix A. We use these properties to derive a second-order approximation of when r is sufficiently small. For the following theorem, we use to denote the variance of the random variable U and . We use for short.
Theorem 6.
For any pair of continuous random variables with finite second moments, we have as
where and
It is worth mentioning that the quantity was first defined by Rényi [28] as an asymmetric measure of correlation between X and Y. In fact, it can be shown that where supremum is taken over all measurable functions f and denotes the correlation coefficient. As a simple illustration of Theorem 6, consider jointly Gaussian X and Y with correlation coefficient for which was computed in Example 1. In this case, it can be easily verified that and . Hence, for jointly Gaussian with correlation coefficient and unit variance, we have . In Figure 3, we compare the approximation given in Theorem 6 for this particular case.
Figure 3.
Second-order approximation of according to Theorem 6 for jointly Gaussian X and Y with correlation coefficient . For this particular case, the exact expression of is computed in (14).
2.2. Evaluation of and
The constrained optimization problems in the definitions of and are usually challenging to solve numerically due to the non-linearity in the constraints. In practice, however, both and are often approximated by their corresponding Lagrangian optimizations
and
where is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for and the privacy and informativeness in . Notice that for the computation of , we can assume, without loss of generality, that since otherwise the maximizer of (15) is trivial. It is worth noting that and in fact correspond to lines of slope supporting from above and below, thereby providing a new representation of .
Let be a pair of random variables with for some and is the output of when the input is (i.e., ). Define
This function, in general, is neither convex nor concave in . For instance, is concave and is convex in . The lower convex envelope of is defined as the largest convex function smaller than . Similarly, the upper concave envelope of is defined as the smallest concave function larger than . Let and denote the lower convex and upper concave envelopes of , respectively. If is convex at , that is , then remains convex at for all because
where the last equality follows from the fact that is convex. Hence, at we have
Analogously, if is concave at , that is , then remains concave at for all .
Notice that, according to (15) and (16), we can write
and
In light of the above arguments, we can write
for all where is the smallest such that touches . Similarly,
for all where is the largest such that touches . In the following theorem, we show that and are given by the values of and , respectively, given in Theorem 4. A similar formulae and were given in [86].
Proposition 1.
We have,
and
Kim et al. [80] have recently proposed an efficient algorithm to estimate from samples of involving a simple optimization problem. This algorithm can be readily adapted for estimating . Proposition 1 implies that in optimizing the Lagrangians (17) and (18), we can restrict the Lagrange multiplier , that is
and
Remark 1.
As demonstrated by Kolchinsky et al. [53], the boundary points 0 and are required for the computation of . In fact, when Y is a deterministic function of X, then only and are required to compute the and other values of β are vacuous. The same argument can also be used to justify the inclusion of in computing . Note also that since becomes convex for , computing becomes trivial for such values of β.
Remark 2.
Observe that the lower convex envelope of any function f can be obtained by taking Legendre-Fenchel transformation (aka. convex conjugate) twice. Hence, one can use the existing linear-time algorithms for approximating Legendre-Fenchel transformation (e.g., [87,88]) for approximating .
Once and are computed, we can derive and via standard results in optimization (see [3] (Section IV) for more details):
and
Following the convex analysis approach outlined by Witsenhausen and Wyner [3], and can be directly computed from and by observing the following. Suppose for some , (resp. ) at is obtained by a convex combination of points , for some in , integer , and weights (with ). Then , and with properties and attains the minimum (resp. maximum) of . Hence, is a point on the upper (resp. lower) boundary of ; implying that for (resp. for ). If for some , at coincides with , then this corresponds to . The same holds for . Thus, all the information about the functional (resp. ) is contained in the subset of the domain of (resp. ) over which it differs from . We will revisit and generalize this approach later in Section 3.
We can now instantiate this for the binary symmetric case. Suppose X and Y are binary variables and is binary symmetric channel with crossover probability , denoted by and defined as
for some . To describe the result in a compact fashion, we introduce the following notation: we let denote the binary entropy function, i.e., . Since this function is strictly increasing , its inverse exists and is denoted by . Moreover, for .
Lemma 4
(Mr. and Mrs. Gerber’s Lemma). For for and for , we have
and
where , , and .
The result in (24) was proved by Wyner and Ziv [2] and is widely known as Mrs. Gerber’s Lemma in information theory. Due to the similarity, we refer to (25) as Mr. Gerber’s Lemma. As described above, to prove (24) and (25) it suffices to derive the convex and concave envelopes of the mapping given by
where is the output distribution of when the input distribution is for some . It can be verified that . This function is depicted in Figure 4 depending of the values of .
Figure 4.
The mapping where and is the result of passing through BSC(0.1), see (26).
2.3. Operational Meaning of and
In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of both and . The operational interpretation of has recently been extensively studied in information-theoretic settings in [29,30]. In particular, it was shown that specifies the rate-distortion region of noisy source coding problem [19,89] under the logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side information at the decoder [90]. Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound) and also provide a new information-theoretic setting in which appears as the solution. Then, we describe another setting, the so-called dependence dilution, whose achievable rate region has an extreme point specified by . This in fact delineate an important difference between and : while describes the entire rate-region of an information-theoretic setup, specifies only a corner point of a rate region. Other information-theoretic settings related to and include CEO problem [91] and source coding for the Gray-Wyner network [92].
2.3.1. Noisy Source Coding
Suppose Alice has access only to a noisy version X of a source of interest Y. She wishes to transmit a rate-constrained description from her observation (i.e., X) to Bob such that he can recover Y with small average distortion. More precisely, let be n i.i.d. samples of . Alice encodes her observation through an encoder and sends to Bob. Upon receiving , Bob reconstructs a “soft” estimate of via a decoder where . That is, the reproduction sequence consists of n probability measures on . For any source and reproduction sequences and , respectively, the distortion is defined as
where
We say that a pair of rate-distortion is achievable if there exists a pair of encoder and decoder such that
The noisy rate-distortion function for a given , is defined as the minimum rate such that is an achievable rate-distortion pair. This problem arises naturally in many data analytic problems. Some examples include feature selection of a high-dimensional dataset, clustering, and matrix completion. This problem was first studied by Dobrushin and Tsybakov [19], who showed that is analogous to the classical rate-distortion function
It can be easily verified that and hence (after relabeling as T)
where , which is equal to defined in (4). For more details in connection between noisy source coding and , the reader is referred to [29,30,91,93]. Notice that one can study an essentially identical problem where the distortion constraint (28) is replaced by
This problem is addressed in [94] for discrete alphabets and and extended recently in [95] for any general alphabets.
2.3.2. Test against Independence with Communication Constraint
As mentioned earlier, the connection between and noisy source coding, described above, was known and studied in [29,30]. Here, we provide a new information-theoretic setting which provides yet another operational meaning for . Given n i.i.d. samples from joint distribution Q, we wish to test whether are independent of , that is, Q is a product distribution. This task is formulated by the following hypothesis test:
for a given joint distribution with marginals and . Ahlswede and Csiszár [20] investigated this problem under a communication constraint: While Y observations (i.e., ) are available, the X observations need to be compressed at rate R, that is, instead of , only is present where satisfies
For the type I error probability not exceeding a fixed , Ahlswede and Csiszár [20] derived the smallest possible type 2 error probability, defined as
The following gives the asymptotic expression of for every . For the proof, refer to [20] (Theorem 3).
Theorem 7
([20]). For every and , we have
In light of this theorem, specifies the exponential rate at which the type II error probability of the hypothesis test (31) decays as the number of samples increases.
2.3.3. Dependence Dilution
Inspired by the problems of information amplification [96] and state masking [97], Asoodeh et al. [21] proposed the dependence dilution setup as follows. Consider a source sequences of n i.i.d. copies of . Alice observes the source and wishes to encode it via the encoder
for some . The goal is to ensure that any user observing can construct a list, of fixed size, of sequences in that contains likely candidates of the actual sequence while revealing negligible information about a correlated source . To formulate this goal, consider the decoder
where denotes the power set of . A dependence dilution triple is said to be achievable if, for any , there exists a pair of encoder and decoder such that for sufficiently large n
having fixed size where and simultaneously
Notice that without side information J, the decoder can only construct a list of size which contains with probability close to one. However, after J is observed and the list is formed, the decoder’s list size can be reduced to and thus reducing the uncertainty about by . This observation can be formalized to show (see [96] for details) that the constraint (32) is equivalent to
which lower bounds the amount of information J carries about . Built on this equivalent formulation, Asoodeh et al. [21] (Corollary 15) derived a necessary condition for the achievable dependence dilution triple.
Theorem 8
([21]). Any achievable dependence dilution triple satisfies
for some auxiliary random variable T satisfying Y
X
T and taking values.
X
T and taking values.According to this theorem, specifies the best privacy performance of the dependence dilution setup for the maximum amplification rate . While this informs the operational interpretation of , Theorem 8 only provides an outer bound for the set of achievable dependence dilution triple . It is, however, not clear that characterizes the rate region of an information-theoretic setup.
The fact that fully characterizes the rate-region of an source coding setup has an important consequence: the cardinality of the auxiliary random variable T in can be improved to instead of .
2.4. Cardinality Bound
Recall that in the definition of in (4), no assumption was imposed on the auxiliary random variable T. A straightforward application of Carathéodory-Fenchel-Eggleston theorem (see e.g., [98] (Section III) or [79] (Lemma 15.4)) reveals that is attained for T taking values in a set with cardinality . Here, we improve this bound and show is sufficient.
Theorem 9.
For any joint distribution and , information bottleneck is achieved by T taking at most values.
The proof of this theorem hinges on the operational characterization of as the lower boundary of the rate-distortion region of noisy source coding problem discussed in Section 2.3. Specifically, we first show that the extreme points of this region is achieved by T taking values. We then make use of a property of the noisy source coding problem (namely, time-sharing) to argue that all points of this region (including the boundary points) can be attained by such T. It must be mentioned that this result was already claimed by Harremoës and Tishby in [99] without proof.
In many practical scenarios, feature X has a large alphabet. Hence, the bound , albeit optimal, still can make the information bottleneck function computationally intractable over large alphabets. However, label Y usually has a significantly smaller alphabet. While it is in general impossible to have a cardinality bound for T in terms of , one can consider approximating assuming T takes N values. The following result, recently proved by Hirche and Winter [100], is in this spirit.
Theorem 10
([100]). For any , we have
where and denotes the information bottleneck functional (4) with the additional constraint that .
Recall that, unlike , the graph of characterizes the rate region of a Shannon-theoretic coding problem (as illustrated in Section 2.3), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region. This lack of operational characterization of translates into a worse cardinality bound than that of . In fact, for the cardinality bound cannot be improved in general. To demonstrate this, we numerically solve the optimization in assuming that when both X and Y are binary. As illustrated in Figure 5, this optimization does not lead to a convex function, and hence, cannot be equal to .
Figure 5.
The set with , , , and T restricted to be binary. While the upper boundary of this set is concave, the lower boundary is not convex. This implies that, unlike , cannot be attained by binary variables T.
2.5. Deterministic Information Bottleneck
As mentioned earlier, formalizes an information-theoretic approach to clustering high-dimensional feature X into cluster labels T that preserve as much information about the label Y as possible. The clustering label is assigned by the soft operator that solves the formulation (4) according to the rule: is likely assigned label if is small where . That is, clustering is assigned based on the similarity of conditional distributions. As in many practical scenarios, a hard clustering operator is preferred, Strouse and Schwab [31] suggested the following variant of , termed as deterministic information bottleneck
where the maximization is taken over all deterministic functions f whose range is a finite set . Similarly, one can define
One way to ensure that for a deterministic function f is to restrict the cardinality of the range of f: if then is necessarily smaller than R. Using this insight, we derive a lower for in the following lemma.
Lemma 5.
For any given , we have
and
Note that both R and r are smaller than and thus the multiplicative factors of in the lemma are smaller than one. In light of this lemma, we can obtain
and
In most of practical setups, might be very large, making the above lower bound for vacuous. In the following lemma, we partially address this issue by deriving a bound independent of when Y is binary.
Lemma 6.
Let be a joint distribution of arbitrary X and binary for some . Then, for any we have
where .
3. Family of Bottleneck Problems
In this section, we introduce a family of bottleneck problems by extending and to a large family of statistical measures. Similar to and , these bottleneck problems are defined in terms of boundaries of a two-dimensional convex set induced by a joint distribution . Recall that and are the upper and lower boundary of the set defined in (6) and expressed here again for convenience
Since is given, and are fixed. Thus, in characterizing it is sufficient to consider only and . To generalize and , we must therefore generalize and .
Given a joint distribution and two non-negative real-valued functions and , we define
and
When and , we interchangeably write for and for .
These definitions provide natural generalizations for Shannon’s entropy and mutual information. Moreover, as we discuss later in Section 3.2 and Section 3.3, it also can be specialized to represent a large family of popular information-theoretic and statistical measures. Examples include information and estimation theoretic quantities such as Arimoto’s conditional entropy of order for , probability of correctly guessing for , maximal correlation for binary case, and f-information for given by f-divergence. We are able to generate a family of bottleneck problems using different instantiations of and in place of mutual information in and . As we argue later, these problems better capture the essence of “informativeness” and “privacy”; thus providing analytical and interpretable guarantees similar in spirit to and .
Computing these bottleneck problems in general boils down to the following optimization problems
and
Consider the set
Note that if both and are continuous (with respect to the total variation distance), then is compact. Moreover, it can be easily verified that is convex. Hence, its upper and lower boundaries are well-defined and are characterized by the graphs of and , respectively. As mentioned earlier, these functional are instrumental for computing the general bottleneck problem later. Hence, before we delve into the examples of bottleneck problems, we extend the approach given in Section 2.2 to compute and .
3.1. Evaluation of and
Analogous to Section 2.2, we first introduce the Lagrangians of and as
and
where is the Lagrange multiplier, respectively. Let be a pair of random variable with and is the result of passing through the channel . Letting
we obtain that
recalling that and are the upper concave and lower convex envelop operators. Once we compute and for all , we can use the standard results in optimizations theory (similar to (21) and (22)) to recover and . However, we can instead extend the approach Witsenhausen and Wyner [3] described in Section 2.2. Suppose for some , (resp. ) at is obtained by a convex combination of points , for some in , integer , and weights (with ). Then , and with properties and attains the maximum (resp. minimum) of , implying that is a point on the upper (resp. lower) boundary of . Consequently, such satisfies for (resp. for ). The algorithm to compute and is then summarized in the following three steps:
- Construct the functional for and and all and .
- Compute and evaluated at .
- If for distributions in for some , we have or for some satisfying , then then , and give the optimal in and , respectively.
We will apply this approach to analytically compute and (and the corresponding bottleneck problems) for binary cases in the following sections.
3.2. Guessing Bottleneck Problems
Let be given with marginals and and the corresponding channel . Let also be an arbitrary distribution on and be the output distribution of when fed with . Any channel , together with the Markov structure Y
X
T, generates unique and . We need the following basic definition from statistics.
X
T, generates unique and . We need the following basic definition from statistics.Definition 1.
Let U be a discrete and V be an arbitrary random variables supported on and with , respectively. Then the probability of correctly guessing U and the probability of correctly guessing U given V are given by
and
Moreover, the multiplicative gain of the observation V in guessing U is defined as (the reason for ∞ in the notation becomes clear later)
As the names suggest, and characterize the optimal efficiency of guessing U with or without the observation V, respectively. Intuitively, quantifies how useful the observation V is in estimating U: If it is small, then it means it is nearly as hard for an adversary observing V to guess U as it is without V. This observation motivates the use of as a measure of privacy in lieu of in .
It is worth noting that is not symmetric in general, i.e., . Since observing T can only improve, we have ; thus . However, does not necessarily imply independent of Y and T; instead, it means T is useless in estimating Y. As an example, consider and and with . Then and
Thus, if , then . This then implies that whereas Y and T are clearly dependent; i.e., . While in general and are not related, it can be shown that if Y is uniform (see [65] (Proposition 1)). Hence, only with this uniformity assumption, implies the independence.
Consider and . Clearly, we have . Note that
thus both measures and are special cases of the models described in the previous section. In particular, we can define the corresponding and . We will see later that and correspond to Arimoto’s mutual information of orders 1 and ∞, respectively. Define
This bottleneck functional formulated an interpretable guarantee:
Recall that the functional aims at extracting maximum information of X while protecting privacy with respect to Y. Measuring the privacy in terms of , this objective can be better formulated by
with the interpretable privacy guarantee:
Notice that the variable T in the formulations of and takes values in a set of arbitrary cardinality. However, a straightforward application of the Carathéodory-Fenchel-Eggleston theorem (see e.g., [79] (Lemma 15.4)) reveals that the cardinality of can be restricted to without loss of generality. In the following lemma, we prove more basic properties of and .
Lemma 7.
For any with Y supported on a finite set , we have
- .
- for any and for .
- is strictly increasing and concave on the range .
- is strictly increasing, and convex on the range .
The proof follows the same lines as Theorem 1 and hence omitted. Lemma 7 in particular implies that inequalities and in the definition of and can be replaced by and , respectively. It can be verified that satisfies the data-processing inequality, i.e., for the Markov chain Y
X
T. Hence, both and must be smaller than . The properties listed in Lemma 7 enable us to derive a slightly tighter upper bound for as demonstrated in the following.
X
T. Hence, both and must be smaller than . The properties listed in Lemma 7 enable us to derive a slightly tighter upper bound for as demonstrated in the following.Lemma 8.
For any with Y supported on a finite set , we have
and
The proof of this lemma (and any other results in this section) is given in Appendix B. This lemma shows that the gap between and when R is sufficiently close to behaves like
Thus, approaches as at least linearly.
In the following theorem, we apply the technique delineated in Section 3.1 to derive closed form expressions for and for the binary symmetric case, thereby establishing similar results as Mr and Mrs. Gerber’s Lemma.
Theorem 11.
For and with , we have
and
where .
As described in Section 3.1, to compute and it suffices to derive the convex and concave envelopes of the mapping where and is the result of passing through , i.e., . In this case, and can be expressed as
This function is depicted in Figure 6.
Figure 6.
The mapping where and .
The detailed derivation of convex and concave envelope of is given in Appendix B. The proof of this theorem also reveals the following intuitive statements. If and , then among all random variables T satisfying Y
X
T and , the minimum is given by . Notice that, without any information constraint (i.e., ), . Perhaps surprisingly, this shows that the mutual information constraint has a linear effect on the privacy of Y. Similarly, to prove (51), we show that among all R-bit representations T of X, the best achievable accuracy is given by . This can be proved by combining Mrs. Gerber’s Lemma (cf. Lemma 4) and Fano’s inequality as follows. For all T such that , the minimum of is given by . Since by Fano’s inequality, , we obtain which leads to the same result as above. Nevertheless, in Appendix B we give another proof based on the discussion of Section 3.1.
X
T and , the minimum is given by . Notice that, without any information constraint (i.e., ), . Perhaps surprisingly, this shows that the mutual information constraint has a linear effect on the privacy of Y. Similarly, to prove (51), we show that among all R-bit representations T of X, the best achievable accuracy is given by . This can be proved by combining Mrs. Gerber’s Lemma (cf. Lemma 4) and Fano’s inequality as follows. For all T such that , the minimum of is given by . Since by Fano’s inequality, , we obtain which leads to the same result as above. Nevertheless, in Appendix B we give another proof based on the discussion of Section 3.1.3.3. Arimoto Bottleneck Problems
The bottleneck framework proposed in the last section benefited from interpretable guarantees brought forth by the quantity . In this section, we define a parametric family of statistical quantities, the so-called Arimoto’s mutual information, which includes both Shannon’s mutual information and as extreme cases.
Definition 2
([22]). Let and be two random variables supported over finite sets and , respectively. Their Arimoto’s mutual information of order is defined as
where
is the Rényi entropy of order α and
is the Arimoto’s conditional entropy of order α.
By continuous extension, one can define for and as and , respectively. That is,
Arimoto’s mutual information was first introduced by Arimoto [22] and then later revisited by Liese and Vajda in [101] and more recently by Verdú in [102]. More in-depth analysis and properties of can be found in [103]. It is shown in [71] (Lemma 1) that for quantifies the minimum loss in recovering U given V where the loss is measured in terms of the so-called -loss. This loss function reduces to logarithmic loss (27) and for and , respectively. This sheds light on the utility and/or privacy guarantee promised by a constraint on Arimoto’s mutual information. It is now natural to use for defining a family of bottleneck problems.
Definition 3.
Given a pair of random variables over finite sets and and , we define and as
and
Of course, and . It is known that Arimoto’s mutual information satisfies the data-processing inequality [103] (Corollary 1), i.e., for the Markov chain Y
X
T. On the other hand, . Thus, both and equal for . Note also that where (see (39)) corresponding to the function . Consequently, and are characterized by the lower and upper boundary of , defined in (37), with respect to and . Specifically, we have
where , and
where and and . This paves the way to apply the technique described in Section 2.2 to compute and . Doing so requires the upper concave and lower convex envelope of the mapping for some , where . In the following theorem, we drive these envelopes and give closed form expressions for and for a special case where .
X
T. On the other hand, . Thus, both and equal for . Note also that where (see (39)) corresponding to the function . Consequently, and are characterized by the lower and upper boundary of , defined in (37), with respect to and . Specifically, we have
Theorem 12.
Let and with . We have for
where for and solves
Moreover,
where and solves
By letting , this theorem indicates that for X and Y connected through and all variables T forming Y
X
T, we have
which can be shown to be achieved generated by the following channel (see Figure 7)
Note that, by assumption, , and hence the event is less likely than . Therefore, (61) demonstrates that to ensure correct recoverability of X with probability at lest , the most private approach (with respect to Y) is to obfuscate the higher-likely event with probability . As demonstrated in (61) the optimal privacy guarantee is linear in the utility parameter in the binary symmetric case. This is in fact a special case of the larger result recently proved in [65] (Theorem 1): the infimum of over all variables T such that is piece-wise linear in , on equivalently, the mapping is piece-wise linear.
X
T, we have
Figure 7.
The structure of the optimal for when and with . If the accuracy constraint is (or equivalently ), then the parameter of optimal is given by , leading to .
Computing analytically for every seems to be challenging, however, the following lemma provides bounds for and in terms of and , respectively.
Lemma 9.
For any pair of random variables over finite alphabets and , we have
and
where and .
The previous lemma can be directly applied to derive upper and lower bounds for and given and .
3.4. f-Bottleneck Problems
In this section, we describe another instantiation of the general framework introduced in terms of functions and that enjoys interpretable estimation-theoretic guarantee.
Definition 4.
Let be a convex function with . Furthermore, let U and V be two real-valued random variables supported over and , respectively. Their f-information is defined by
where is the f-divergence [104] between distributions and defined as
Due to convexity of f, we have and hence f-information is always non-negative. If, furthermore, f is strictly convex at 1, then equality holds if and only . Csiszár introduced f-divergence in [104] and applied it to several problems in statistics and information theory. More recent developments about the properties of f-divergence and f-information can be found in [23] and the references therein. Any convex function f with the property results in an f-information. Popular examples include corresponding to Shannon’s mutual information, corresponding to T-information [83], and also corresponding to -information [69] for. It is worth mentioning that if we allow to be in in Definition 2 (similar to [101]), then the resulting Arimoto’s mutual information can be shown to be an f-information in the binary case for a certain function f, see [101] (Theorem 8).
Let be given with marginals and . Consider functions and on and defined as
Given a conditional distribution , it is easy to verify that and . This in turn implies that f-information can be utilized in (40) and (41) to define general bottleneck: Let and be two convex functions satisfying . Then we define
and
In light of the discussion in Section 3.1, the optimization problems in and can be analytically solved by determining the upper concave and lower convex envelope of the mapping
where is the Lagrange multiplier and .
Consider the function with . The corresponding f-divergence is sometimes called Hellinger divergence of order , see e.g., [105]. Note that Hellinger divergence of order 2 reduces to -divergence. Calmon et al. [68] and Asoodeh et al. [67] showed that if for some , then the minimum mean-squared error (MMSE) of reconstructing any zero-mean unit-variance function of Y given T is lower bounded by , i.e., no function of Y can be reconstructed with small MMSE given an observation of T. This result serves a natural justification for as an operational measure of both privacy and utility in a bottleneck problem.
Unfortunately, our approach described in Section 3.1 cannot be used to compute or in the binary symmetric case. The difficulty lies in the fact that the function , defined in (66), for the binary symmetric case is either convex or concave on its entire domain depending on the value of . Nevertheless, one can consider Hellinger divergence of order with and then apply our approach to compute or . Since (see [106] (Corollary 5.6)), one can justify as a measure of privacy and utility in a similar way as .
We end this section by a remark about estimating the measures studied in this section. While we consider information-theoretic regime where the underlying distribution is known, in practice only samples are given. Consequently, the de facto guarantees of bottleneck problems might be considerably different from those shown in this work. It is therefore essential to asses the guarantees of bottleneck problems when accessing only samples. To do so, one must derive bounds on the discrepancy between , , and computed on the empirical distribution and the true (unknown) distribution. These bounds can then be used to shed light on the de facto guarantee of the bottleneck problems. Relying on [34] (Theorem 1), one can obtain that the gaps between the measures , , and computed on empirical distributions and the true one scale as where n is the number of samples. This is in contrast with mutual information for which the similar upper bound scales as as shown in [33]. Therefore, the above measures appear to be easier to estimate than mutual information.
4. Summary and Concluding Remarks
Following the recent surge in the use of information bottleneck () and privacy funnel () in developing and analyzing machine learning models, we investigated the functional properties of these two optimization problems. Specifically, we showed that and correspond to the upper and lower boundary of a two-dimensional convex set Y
X
T} where represents the observable data X and target feature Y and the auxiliary random variable T varies over all possible choices satisfying the Markov relation Y
X
T. This unifying perspective on and allowed us to adapt the classical technique of Witsenhausen and Wyner [3] devised for computing to be applicable for as well. We illustrated this by deriving a closed form expression for in the binary case—a result reminiscent of the Mrs. Gerber’s Lemma [2] in information theory literature. We then showed that both and are closely related to several information-theoretic coding problems such as noisy random coding, hypothesis testing against independence, and dependence dilution. While these connections were partially known in previous work (see e.g., [29,30]), we show that they lead to an improvement on the cardinality of T for computing . We then turned our attention to the continuous setting where X and Y are continuous random variables. Solving the optimization problems in and in this case without any further assumptions seems a difficult challenge in general and leads to theoretical results only when is jointly Gaussian. Invoking recent results on the entropy power inequality [25] and strong data processing inequality [27], we obtained tight bounds on in two different cases: (1) when Y is a Gaussian perturbation of X and (2) when X is a Gaussian perturbation of Y. We also utilized the celebrated I-MMSE relationship [107] to derive a second-order approximation of when T is considered to be a Gaussian perturbation of X.
X
T} where represents the observable data X and target feature Y and the auxiliary random variable T varies over all possible choices satisfying the Markov relation Y
X
T. This unifying perspective on and allowed us to adapt the classical technique of Witsenhausen and Wyner [3] devised for computing to be applicable for as well. We illustrated this by deriving a closed form expression for in the binary case—a result reminiscent of the Mrs. Gerber’s Lemma [2] in information theory literature. We then showed that both and are closely related to several information-theoretic coding problems such as noisy random coding, hypothesis testing against independence, and dependence dilution. While these connections were partially known in previous work (see e.g., [29,30]), we show that they lead to an improvement on the cardinality of T for computing . We then turned our attention to the continuous setting where X and Y are continuous random variables. Solving the optimization problems in and in this case without any further assumptions seems a difficult challenge in general and leads to theoretical results only when is jointly Gaussian. Invoking recent results on the entropy power inequality [25] and strong data processing inequality [27], we obtained tight bounds on in two different cases: (1) when Y is a Gaussian perturbation of X and (2) when X is a Gaussian perturbation of Y. We also utilized the celebrated I-MMSE relationship [107] to derive a second-order approximation of when T is considered to be a Gaussian perturbation of X.In the second part of the paper, we argue that the choice of (Shannon’s) mutual information in both and does not seem to carry specific operational significance. It does, however, have a desirable practical consequence: it leads to self-consistent equations [1] that can be solved iteratively (without any guarantee to convergence though). In fact, this property is unique to mutual information among other existing information measures [99]. Nevertheless, we argued that other information measures might lead to better interpretable guarantee for both and . For instance, statistical accuracy in and privacy leakage in can be shown to be precisely characterized by probability of correctly guessing (aka Bayes risk) or minimum mean-squared error (MMSE). Following this observation, we introduced a large family of optimization problems, which we call bottleneck problems, by replacing mutual information in and with Arimoto’s mutual information [22] or f-information [23]. Invoking results from [33,34], we also demonstrated that these information measures are in general easier to estimate from data than mutual information. Similar to and , the bottleneck problems were shown to be fully characterized by boundaries of a two-dimensional convex set parameterized by two real-valued non-negative functions and . This perspective enabled us to generalize the technique used to compute and for evaluating bottleneck problems. Applying this technique to the binary case, we derived closed form expressions for several bottleneck problems.
Author Contributions
All authors contributed equally. All authors have read and agreed to the published version of the manuscript.
Funding
This material is based upon work supported by the National Science Foundation under Grant No. CIF 1900750 and CIF CAREER 1845852.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proofs from Section 2
Proof of Theorem 1.
- Note that in optimization problem (4) implies that X and T are independent. Since and T form Markov chain Y
Y
T, independent of X and T implies independence of Y and T and thus . Similarly for . - Since for any random variable T, we have satisfies the information constraint for . Since , this choice is optimal. Similarly for , the constraint for implies . Hence, .
- The upper bound on follows from the data processing inequality: for all T satisfying the Markov condition Y
X
T. - To prove the lower bound on , note that
- The concavity of follows from the fact it is the upper boundary of the convex set , defined in (6). This in turn implies the continuity of . Monotonicity of follows from the definition. Strict monotonicity follows from the convexity and the fact that .
- Similar as above.
- The differentiability of the map follows from [94] (Lemma 6). This result in fact implies the differentiability of the map as well. Continuity of the derivative of and on is a straightforward application of [108] (Theorem 25.5).
- Monotonicity of mappings and follows from the concavity and convexity of and , respectively.
Proof of Theorem 3.
Recall that, according to Theorem 1, the mappings and are concave and convex, respectively. This implies that (resp. ) lies above (resp. below) the chord connecting and . This proves the lower bound (resp. upper bound) (resp. ).
In light of the convexity of and monotonicity of , we can write
where the last equality is due to [13] (Lemma 4) and is the output distribution of the channel when the input is distributed according to . Similarly, we can write
where the last equality is due to [82] (Theorem 4). □
Proof of Theorem 5.
Let be an optimal summeries of , that is, it satisfies Tn
Xn
Yn and . We can write
and hence, if , then we have
We can similarly write
Since we have
Xk
Yk for every , we conclude from the above inequality that
where the last inequality follows from concavity of the map and (A1). Consequently, we obtain
To prove the other direction, let be an optimal channel in the definition of , i.e., and . Then using this channel n times for each pair , we obtain satisfying Tn
Xn
Yn. Since and , we have This, together with (A3), concludes the proof. □
Xn
Yn and . We can write
Xk
Yk for every , we conclude from the above inequality that
Xn
Yn. Since and , we have This, together with (A3), concludes the proof. □Proof of Theorem 4.
First notice that
where the last equality is due to [82] (Theorem 4). Similarly,
where the last equality is due to [13] (Lemma 4).
Fix with and let T be a Bernoulli random variable specified by the following channel
for some . This channel induces , , and
It can be verified that
and
Setting
we obtain
and hence
Since is arbitrary, the result follows. The proof for follows similarly. □
Proof of Lemma 1.
When Y is an erasure of X, i.e., with and , it is straightforward to verify that for every and in . Consequently, we have
Hence, Theorem 3 gives the desired result.
To prove the second part, i.e., when X is an erasure of Y, we need an improved upper bound of . Notice that if perfect privacy occurs for a given , then the upper bound for in Theorem 3 can be improved:
where is the largest such that . Here, we show that . This suffices to prove the result as (A4), together with Theorem 1, we have
To show that , consider the channel and It can be verified that this channel induces T which is independent of Y and that
where is the binary entropy function. □
Proof of Lemma 4.
Consider the problem of minimizing the Lagrangian (20) for . Let for some and be the result of passing through , i.e., . Recall that . It suffices to compute the upper concave envelope of . It can be verified that and hence for all , . A straightforward computation shows that is symmetric around and is also concave in a region around , where it reaches its local maximum. Hence, if is such that
Hence, assuming , we can construct that maximizes in three different cases corresponding three cases above:
- In the first case, is binary and we have and with .
- In the second case, is ternary and we have , , and with for some .
- In the third case, is again binary and we have and with for some .
Combining these three cases, we obtain the result in (25).□
Proof of Lemma 2.
Let where and is independent of Y. According to the improved entropy power inequality proved in [25] (Theorem 1), we can write
for any random variable T forming Y
X
T. This, together with Theorem 5, implies the result. □
X
T. This, together with Theorem 5, implies the result. □Proof of Corollary 1.
Since are jointly Gaussian, we can write where and is the variance of Y. Applying Lemma 2 and noticing that , we obtain
for all channels satisfying Y
X
T. This bound is attained by Gaussian . Specifically, assuming where for and independent of X, it can be easily verified that and . This, together with (A5), implies □
X
T. This bound is attained by Gaussian . Specifically, assuming where for and independent of X, it can be easily verified that and . This, together with (A5), implies □Next, we wish to prove Theorem 6. However, we need the following preliminary lemma before we delve into its proof.
Lemma A1.
Let X and Y be continuous correlated random variables with and . Then the mappings and are continuous, strictly decreasing, and
Proof.
The finiteness of and imply that and are finite. A straightforward application of the entropy power inequality (cf. [109] (Theorem 17.7.3)) implies that is also finite. Thus, and are well-defined. According to the data processing inequality, we have for all and also where the equality occurs if and only if X and Y are independent. Since, bu assumption X and Y correlated, it follows . Thus, both and are strictly decreasing.
For the proof of continuity, we consider two cases and separately. We first give the poof for . Since , we have and thus that is equal to . For , let be a sequence of positive numbers converging to . In light of de Bruijn’s identity (cf. [109] (Theorem 17.7.2)), we have , implying the continuity of .
Next, we prove the continuity of . For the sequence of positive numbers converging to , we have . We only need to show . Invoking again de Brujin’s identity, we obtain for each . The desired result follows from dominated convergence theorem. Finally, the The continuity of when follows from [110] (p. 2028) stating that and then applying dominated convergence theorem.
Note that
where is the variance of X and the last inequality follows from the fact that is maximized when X is Gaussian. Since by assumption , it follows that both and converge to zero as . □
In light of this lemma, there exists a unique such that . Let denote such . Therefore, we have This enables us to prove Theorem 6.
Proof of Theorem 6.
The proof relies on the I-MMSE relation in information theory literature. We briefly describe it here for convenience. Given any pair of random variables U and V, the minimum mean-squared error (MMSE) of estimating U given V is given by
where the infimum is taken over all measurable functions f and . Guo et al. [107] proved the following identity, which is referred to as I-MMSE formula, relating the input-output mutual information of the additive Gaussian channel , where is independent of X, with the MMSE of the input given the output:
Since Y, X, and form the Markov chain Y
X
Tσ, it follows that . Thus, two applications of (A6) yields
The second derivative of and are also known via the formula [111] (Proposition 9)
With these results in mind, we now begin the proof. Recall that is the unique such that , thus implying . We have
To compute the derivative of , we therefore need to compute the derivative of with respect to r. To do so, notice that from the identity we can obtain
implying
Plugging this identity into (A9) and invoking (A7), we obtain
The second derivative can be obtained via (A8)
Since as , we can write
where is the variance of the conditional expectation X given Y and the last equality comes from the law of total variance. and
Taylor expansion of around gives the result. □
X
Tσ, it follows that . Thus, two applications of (A6) yields
Proof of Theorem 9.
The main ingredient of this proof is a result by Jana [112] (Lemma 2.2) which provides a tight cardinality bound for the auxiliary random variables in the canonical problems in network information theory (including noisy source coding problem described in Section 2.3). Consider a pair of random variables and let be an arbitrary distortion measure defined for arbitrary reconstruction alphabet . □
Theorem A1
([112]). Let be the set of all pairs satisfying
for some mapping and some joint distributions . Then every extreme points of corresponds to some choice of auxiliary variable T with alphabet size .
Measuring the distortion in the above theorem in terms of the logarithmic loss as in (27), we obtain that
where is given in (29). We observed in Section 2.3 that is fully characterized by the mapping and thus by . In light of Theorem A1, all extreme points of are achieved by a choice of T with cardinality size . Let be the set of extreme points of each constructed by channel and mapping . Due to the convexity of , each point is expressed as a convex combination of with coefficient ; that is there exists a channel and a mapping such that and . This construction, often termed timesharing in information theory literature, implies that all points in (including the boundary points) can be achieved with a variable T with . Since the boundary of is specified by the mapping , we conclude that is achieved by a variable T with cardinality for very .
Proof of Lemma 5.
The following proof is inspired by [32] (Proposition 1). Let . We sort the elements in such that
Now consider the function given by if and if where . Let . We have if and . We can now write
Since takes values in , it follows that . Consequently, we have
For the privacy funnel, the proof proceeds as follows. We sort the elements in such that
Consider now the function given by if and if . As before, let . Then, we can write,
where the last inequality is due to the log-sum inequality. □
Proof of Lemma 6.
Employing the same argument as in the proof of [32] (Theorem 3), we obtain that there exists a function such that
for any and
Since for all , it follows from above that (noticing that )
where . Rearranging this, we obtain
Assuming , we have and hence
implying
Plugging this into (A11), we obtain
As before, if , then . Hence,
for all . □
Appendix B. Proofs from Section 3
Proof of Lemma 8.
To prove the upper bound on , recall that is convex. Thus, it lies below the chord connecting points and . The lower bound on is similarly obtained using the concavity of . This is achievable by an erasure channel. To see this consider the random variable taking values in that is obtained by conditional distributions and for some . It can be verified that and . By taking , this channel meets the constraint . Hence,
□
Proof of Theorem 11.
We begin by . As described in Section 3.1, and similar to Mrs. Gerber’s Lemma (Lemma 4), we need to construct the lower convex envelope of where and is the result of passing through , i.e., . In this case, . Hence, we need to determine the lower convex envelope of the map
A straightforward computation shows that is symmetric around and is also concave in q on for any . Hence, is obtained as follows depending on the values of :
Hence, assuming , we can construct that minimizes . Considering the first two cases, we obtain that is ternary with , , and with marginal for some . This leads to and . Note that covers all possible domain by varying on . Replacing by r, we obtain leading to . Since , the desired result follows.
To derive the expression for , recall that we need to derive the upper concave envelope of . It is clear from Figure 6 that is obtained by replacing on the interval by its maximum value over q where
is the maximizer of on . In other words,
Note that if then evaluated at p coincides with . This corresponds to all trivial such that . If, on the other hand, , then is the convex combination of and . Hence, taking as a parameter (say, ), the optimal binary is constructed as follows: and for . Such channel induces
as , and also
Combining these two, we obtain
□
Proof of Theorem 12.
Let and denote the and , respectively, when . In light of (59) and (60), it is sufficient to compute and . To do so, we need to construct the lower convex envelope and upper concave envelope of the map given by where and is the result of passing through , i.e., . In this case, we have
where is to mean for any .
We begin by for which we aim at obtaining . A straightforward computation shows that is convex for and . For and , it can be shown that is concave an interval where solves . (The shape of in is similar to what was depicted in Figure 4.) By symmetry, is therefore obtained by replacing on this interval by . Hence, if , at p coincides with which results in trivial (see the proof of Theorem 11 for more details). If, on the other hand, , then evaluated at p is given by a convex combination of and . Relabeling as a parameter (say, q), we can write an optimal binary via the following: and for . This channel induces and . Hence, the graph of is given by
Therefore,
where solves . Since the map is strictly decreasing for , this equation has a unique solution.
Next, we compute or equivalently the upper concave envelop of defined in (A13). As mentioned earlier, is convex for and . For , we need to consider three cases: (1) is given by the convex combination of and , (2) is given by the convex combination of , , and , (3) is given by the convex combination of and where is a point . Without loss of generality, we can ignore the first case. The other two cases correspond to the following solutions
- is a ternary variable given by , , and with marginal for some . This producesand
- is a binary variable given by and with marginal for some . This producesand
Proof of Lemma 9.
The facts that is non-increasing on [103] (Proposition 5) and for all imply
Since , the above lower bound yields
where the last inequality follows from the fact that is non-increasing. The upper bound in (A14) (after replacing X with Y and with ) implies
Combining (A15) and (A16), we obtain the desired upper bound for . The other bounds can be proved similarly by interchanging X with Y and with in (A15) and (A16). □
References
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 30 September–3 October 1999; pp. 368–377. [Google Scholar]
- Wyner, A.; Ziv, J. A theorem on the entropy of certain binary sequences and applications: Part I. IEEE Trans. Inf. Theory 1973, 19, 769–772. [Google Scholar] [CrossRef]
- Witsenhausen, H.; Wyner, A. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory 1975, 21, 493–501. [Google Scholar] [CrossRef]
- Ahlswede, R.; Körner, J. On the connection between the entropies of input and output distributions of discrete memoryless channels. In Proceedings of the Fifth Conference on Probability Theory, Brasov, Romania, 1–6 September 1974. [Google Scholar]
- Wyner, A. A theorem on the entropy of certain binary sequences and applications—II. IEEE Trans. Inf. Theory 1973, 19, 772–777. [Google Scholar] [CrossRef]
- Kim, Y.H.; El Gamal, A. Network Information Theory; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Slonim, N.; Tishby, N. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 208–215. [Google Scholar]
- Still, S.; Bialek, W. How Many Clusters? An Information-Theoretic Perspective. Neural Comput. 2004, 16, 2483–2506. [Google Scholar] [CrossRef]
- Slonim, N.; Tishby, N. Agglomerative Information Bottleneck. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS’99), Denver, CO, USA, 29 November–4 December 1999; pp. 617–623. [Google Scholar]
- Cardinal, J. Compression of side information. In Proceedings of the 2003 International Conference on Multimedia and Expo—Volume 1, Baltimore, MD, USA, 6–9 July 2003; Volume 2, pp. 569–572. [Google Scholar]
- Zeitler, G.; Koetter, R.; Bauch, G.; Widmer, J. Design of network coding functions in multihop relay networks. In Proceedings of the 2008 5th International Symposium on Turbo Codes and Related Topics, Lausanne, Switzerland, 1–5 September 2008; pp. 249–254. [Google Scholar]
- Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the Information Bottleneck to the Privacy Funnel. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW 2014), Tasmania, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
- Calmon, F.P.; Makhdoumi, A.; Médard, M. Fundamental limits of perfect privacy. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 1796–1800. [Google Scholar]
- Asoodeh, S.; Alajaji, F.; Linder, T. Notes on information-theoretic privacy. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 30 September–3 October 2014; pp. 1272–1278. [Google Scholar]
- Ding, N.; Sadeghi, P. A Submodularity-based Clustering Algorithm for the Information Bottleneck and Privacy Funnel. In Proceedings of the 2019 IEEE Information Theory Workshop (ITW), Visby, Sweden, 25–28 August 2019; pp. 1–5. [Google Scholar]
- Bertran, M.; Martinez, N.; Papadaki, A.; Qiu, Q.; Rodrigues, M.; Reeves, G.; Sapiro, G. Adversarially Learned Representations for Information Obfuscation and Inference. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 614–623. [Google Scholar]
- Lopuhaä-Zwakenberg, M.; Tong, H.; Škorić, B. Data Sanitisation Protocols for the Privacy Funnel with Differential Privacy Guarantees. arXiv 2020, arXiv:2008.13151. [Google Scholar]
- Hsu, H.; Asoodeh, S.; Calmon, F. Obfuscation via Information Density Estimation. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Sicily, Italy, 26–28 August 2020; Volume 108, pp. 906–917. [Google Scholar]
- Dobrushin, R.; Tsybakov, B. Information transmission with additional noise. IRE Trans. Inf. Theory 1962, 8, 293–304. [Google Scholar] [CrossRef]
- Ahlswede, R.; Csiszar, I. Hypothesis testing with communication constraints. IEEE Trans. Inf. Theory 1986, 32, 533–542. [Google Scholar] [CrossRef]
- Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Information extraction under privacy constraints. Information 2016, 7, 15. [Google Scholar] [CrossRef]
- Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory, Coll. Math. Soc. J. Bolyai; Csiszár, I., Elias, P., Eds.; North-Holland: Amsterdam, The Netherlands, 1977; Volume 16, pp. 41–52. [Google Scholar]
- Raginsky, M. Strong Data Processing Inequalities and Φ-Sobolev Inequalities for Discrete Channels. IEEE Trans. Inf. Theory 2016, 62, 3355–3389. [Google Scholar] [CrossRef]
- Hsu, H.; Asoodeh, S.; Salamatian, S.; Calmon, F.P. Generalizing Bottleneck Problems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 531–535. [Google Scholar]
- Courtade, T.A. Strengthening the entropy power inequality. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 2294–2298. [Google Scholar]
- Globerson, A.; Tishby, N. On the Optimality of the Gaussian Information Bottleneck Curve; Technical Report; Hebrew University: Jerusalem, Israel, 2004. [Google Scholar]
- Calmon, F.P.; Polyanskiy, Y.; Wu, Y. Strong data processing inequalities in power-constrained Gaussian channels. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Hongkong, China, 14–19 June 2015; pp. 2558–2562. [Google Scholar]
- Rényi, A. On measures of dependence. Acta Math. Acad. Sci. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
- Goldfeld, Z.; Polyanskiy, Y. The Information Bottleneck Problem and its Applications in Machine Learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
- Zaidi, A.; Estella-Aguerri, I.; Shamai (Shitz), S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef]
- Strouse, D.; Schwab, D.J. The Deterministic Information Bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [PubMed]
- Bhatt, A.; Nazer, B.; Ordentlich, O.; Polyanskiy, Y. Information-Distilling Quantizers. arXiv 2018, arXiv:1812.03031. [Google Scholar]
- Shamir, O.; Sabato, S.; Tishby, N. Learning and Generalization with the Information Bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef]
- Diaz, M.; Wang, H.; Calmon, F.P.; Sankar, L. On the Robustness of Information-Theoretic Privacy Measures and Mechanisms. IEEE Trans. Inf. Theory 2020, 66, 1949–1978. [Google Scholar] [CrossRef]
- El-Yaniv, R.; Souroujon, O. Iterative Double Clustering for Unsupervised and Semi-Supervised Learning. In Proceedings of the 12th European Conference on Machine Learning, Freiburg, Germany, 5–7 September 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 121–132. [Google Scholar]
- Elidan, G.; Friedman, N. Learning Hidden Variable Networks: The Information Bottleneck Approach. J. Mach. Learn. Res. 2005, 6, 81–127. [Google Scholar]
- Aguerri, I.E.; Zaidi, A. Distributed Information Bottleneck Method for Discrete and Gaussian Sources. arXiv 2017, arXiv:1709.09082. [Google Scholar]
- Aguerri, I.E.; Zaidi, A. Distributed Variational Representation Learning. arXiv 2019, arXiv:1807.04193. [Google Scholar]
- Strouse, D.; Schwab, D.J. Geometric Clustering with the Information Bottleneck. Neural Comput. 2019, 31, 596–612. [Google Scholar] [CrossRef]
- Cicalese, F.; Gargano, L.; Vaccaro, U. Bounds on the Entropy of a Function of a Random Variable and Their Applications. IEEE Trans. Inf. Theory 2018, 64, 2220–2230. [Google Scholar] [CrossRef]
- Koch, T.; Lapidoth, A. At Low SNR, Asymmetric Quantizers are Better. IEEE Trans. Inf. Theory 2013, 59, 5421–5445. [Google Scholar] [CrossRef][Green Version]
- Pedarsani, R.; Hassani, S.H.; Tal, I.; Telatar, E. On the construction of polar codes. In Proceedings of the 2011 IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 31 July–5 August 2011; pp. 11–15. [Google Scholar]
- Tal, I.; Sharov, A.; Vardy, A. Constructing polar codes for non-binary alphabets and MACs. In Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings, Cambridge, MA, USA, 1–6 July 2012; pp. 2132–2136. [Google Scholar]
- Kartowsky, A.; Tal, I. Greedy-Merge Degrading has Optimal Power-Law. IEEE Trans. Inf. Theory 2019, 65, 917–934. [Google Scholar] [CrossRef]
- Viterbi, A.J.; Omura, J.K. Principles of Digital Communication and Coding, 1st ed.; McGraw-Hill, Inc.: New York, NY, USA, 1979. [Google Scholar]
- Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop (ITW), Jeju Island, Korea, 11–15 October 2015; pp. 1–5. [Google Scholar]
- Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
- Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the Information Bottleneck Theory of Deep Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Goldfeld, Z.; Van Den Berg, E.; Greenewald, K.; Melnyk, I.; Nguyen, N.; Kingsbury, B.; Polyanskiy, Y. Estimating Information Flow in Deep Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 2299–2308. [Google Scholar]
- Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2225–2239. [Google Scholar] [CrossRef]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
- Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear Information Bottleneck. arXiv 2017, arXiv:1705.02436. [Google Scholar]
- Kolchinsky, A.; Tracey, B.D.; Kuyk, S.V. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Chalk, M.; Marre, O.; Tkacik, G. Relevant Sparse Codes with Variational Information Bottleneck. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 9 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 1965–1973. [Google Scholar]
- Wickstrøm, K.; Løkse, S.; Kampffmeyer, M.; Yu, S.; Principe, J.; Jenssen, R. Information Plane Analysis of Deep Neural Networks via Matrix–Based Rényi’s Entropy and Tensor Kernels. arXiv 2019, arXiv:1909.11396. [Google Scholar]
- Matias, V.; Piantanida, P.; Rey Vega, L. The Role of the Information Bottleneck in Representation Learning. In Proceedings of the IEEE International Symposium on Information Theory (ISIT 2018), Vail, CO, USA, 17–22 June 2018. [Google Scholar] [CrossRef]
- Alemi, A.; Fischer, I.; Dillon, J. Uncertainty in the Variational Information Bottleneck. arXiv 2018, arXiv:1807.00906. [Google Scholar]
- Yu, S.; Jenssen, R.; Príncipe, J. Understanding Convolutional Neural Network Training with Information Theory. arXiv 2018, arXiv:1804.06537. [Google Scholar]
- Cheng, H.; Lian, D.; Gao, S.; Geng, Y. Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
- Issa, I.; Wagner, A.B.; Kamath, S. An Operational Approach to Information Leakage. IEEE Trans. Inf. Theory 2020, 66, 1625–1657. [Google Scholar] [CrossRef]
- Cvitkovic, M.; Koliander, G. Minimal Achievable Sufficient Statistic Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 1465–1474. [Google Scholar]
- Asoodeh, S.; Alajaji, F.; Linder, T. On maximal correlation, mutual information and data privacy. In Proceedings of the IEEE 14th Canadian Workshop on Inf. Theory (CWIT), St. John’s, NL, Canada, 6–9 July 2015; pp. 27–31. [Google Scholar]
- Makhdoumi, A.; Fawaz, N. Privacy-utility tradeoff under statistical uncertainty. In Proceedings of the 51st Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 2–4 October 2013; pp. 1627–1634. [Google Scholar] [CrossRef]
- Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Estimation Efficiency Under Privacy Constraints. IEEE Trans. Inf. Theory 2019, 65, 1512–1534. [Google Scholar] [CrossRef]
- Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Privacy-aware guessing efficiency. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017. [Google Scholar]
- Asoodeh, S.; Alajaji, F.; Linder, T. Privacy-aware MMSE estimation. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1989–1993. [Google Scholar]
- Calmon, F.P.; Makhdoumi, A.; Médard, M.; Varia, M.; Christiansen, M.; Duffy, K.R. Principal Inertia Components and Applications. IEEE Trans. Inf. Theory 2017, 63, 5011–5038. [Google Scholar] [CrossRef]
- Wang, H.; Vo, L.; Calmon, F.P.; Médard, M.; Duffy, K.R.; Varia, M. Privacy With Estimation Guarantees. IEEE Trans. Inf. Theory 2019, 65, 8025–8042. [Google Scholar] [CrossRef]
- Asoodeh, S. Information and Estimation Theoretic Approaches to Data Privacy. Ph.D. Thesis, Queen’s University, Kingston, ON, Canada, 2017. [Google Scholar]
- Liao, J.; Kosut, O.; Sankar, L.; du Pin Calmon, F. Tunable Measures for Information Leakage and Applications to Privacy-Utility Tradeoffs. IEEE Trans. Inf. Theory 2019, 65, 8043–8066. [Google Scholar] [CrossRef]
- Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Privacy aware learning. J. Assoc. Comput. Mach. (ACM) 2014, 61, 38. [Google Scholar] [CrossRef]
- Poole, B.; Ozair, S.; Van Den Oord, A.; Alemi, A.; Tucker, G. On Variational Bounds of Mutual Information. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Volume 97, pp. 5171–5180. [Google Scholar]
- Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 531–540. [Google Scholar]
- Van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Song, J.; Ermon, S. Understanding the Limitations of Variational Mutual Information Estimators. In Proceedings of the International Conference on Learning Representations, online, 26 April–1 May 2020. [Google Scholar]
- McAllester, D.; Stratos, K. Formal Limitations on the Measurement of Mutual Information. In Proceedings of the International Conference on Learning Representations, online, 26 April–1 May 2020; Volume 108, pp. 875–884. [Google Scholar]
- Rassouli, B.; Gunduz, D. On Perfect Privacy. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 2551–2555. [Google Scholar]
- Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Kim, H.; Gao, W.; Kannan, S.; Oh, S.; Viswanath, P. Discovering Potential Correlations via Hypercontractivity. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4577–4587. [Google Scholar]
- Ahlswede, R.; Gács, P. Spreading of sets in product spaces and hypercontraction of the Markov operator. Ann. Probab. 1976, 4, 925–939. [Google Scholar] [CrossRef]
- Anantharam, V.; Gohari, A.; Kamath, S.; Nair, C. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv 2014, arXiv:1304.6133v1. [Google Scholar]
- Polyanskiy, Y.; Wu, Y. Dissipation of Information in Channels With Input Constraints. IEEE Trans. Inf. Theory 2016, 62, 35–55. [Google Scholar] [CrossRef]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
- Zaidi, A. Hypothesis Testing Against Independence Under Gaussian Noise. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 1289–1294. [Google Scholar] [CrossRef]
- Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the Information Bottleneck. Entropy 2019, 21, 924. [Google Scholar] [CrossRef]
- Contento, L.; Ern, A.; Vermiglio, R. A linear-time approximate convex envelope algorithm using the double Legendre-Fenchel transform with application to phase separation. Comput. Optim. Appl. 2015, 60, 231–261. [Google Scholar] [CrossRef][Green Version]
- Lucet, Y. Faster than the Fast Legendre Transform, the Linear-time Legendre Transform. Numer. Algorithms 1997, 16, 171–185. [Google Scholar] [CrossRef]
- Witsenhausen, H. Indirect rate distortion problems. IEEE Trans. Inf. Theory 1980, 26, 518–521. [Google Scholar] [CrossRef]
- Wyner, A. On source coding with side information at the decoder. IEEE Trans. Inf. Theory 1975, 21, 294–300. [Google Scholar] [CrossRef]
- Courtade, T.A.; Weissman, T. Multiterminal Source Coding Under Logarithmic Loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef]
- Li, C.T.; El Gamal, A. Extended Gray-Wyner System With Complementary Causal Side Information. IEEE Trans. Inf. Theory 2018, 64, 5862–5878. [Google Scholar] [CrossRef]
- Vera, M.; Rey Vega, L.; Piantanida, P. Collaborative Information Bottleneck. IEEE Trans. Inf. Theory 2019, 65, 787–815. [Google Scholar] [CrossRef]
- Gilad-Bachrach, R.; Navot, A.; Tishby, N. An Information Theoretic Tradeoff between Complexity and Accuracy. In Learning Theory and Kernel Machines; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
- Pichler, G.; Koliander, G. Information Bottleneck on General Alphabets. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 526–530. [Google Scholar] [CrossRef]
- Kim, Y.H.; Sutivong, A.; Cover, T. State mplification. IEEE Trans. Inf. Theory 2008, 54, 1850–1859. [Google Scholar] [CrossRef]
- Merhav, N.; Shamai, S. Information rates subject to state masking. IEEE Trans. Inf. Theory 2007, 53, 2254–2261. [Google Scholar] [CrossRef]
- Witsenhausen, H. Some aspects of convexity useful in information theory. IEEE Trans. Inf. Theory 1980, 26, 265–271. [Google Scholar] [CrossRef]
- Harremoës, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
- Hirche, C.; Winter, A. An alphabet size bound for the information bottleneck function. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
- Liese, F.; Vajda, I. On Divergences and Informations in Statistics and Information Theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
- Verdú, S. α-mutual information. In Proceedings of the Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
- Fehr, S.; Berens, S. On the Conditional Rényi Entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
- Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
- Sason, I.; Verdú, S. f-Divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
- Guntuboyina, A.; Saha, S.; Schiebinger, G. Sharp Inequalities for f-Divergences. IEEE Trans. Inf. Theory 2014, 60, 104–121. [Google Scholar] [CrossRef]
- Guo, D.; Shamai, S.; Verdú, S. Mutual information and minimum mean-square error in Gaussian channels. IEEE Trans. Inf. Theory 2005, 51, 1261–1282. [Google Scholar] [CrossRef]
- Rockafellar, R.T. Convex Analysis; Princeton Univerity Press: Princeton, NJ, USA, 1997. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Linder, T.; Zamir, R. On the asymptotic tightness of the Shannon lower bound. IEEE Trans. Inf. Theory 2008, 40, 2026–2031. [Google Scholar] [CrossRef]
- Guo, D.; Wu, Y.; Shitz, S.S.; Verdú, S. Estimation in Gaussian Noise: Properties of the Minimum Mean-Square Error. IEEE Trans. Inf. Theory 2011, 57, 2371–2385. [Google Scholar]
- Jana, S. Alphabet sizes of auxiliary random variables in canonical inner bounds. In Proceedings of the 43rd Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 18–20 March 2009; pp. 67–71. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).