Next Article in Journal
Diffusion Limitations and Translocation Barriers in Atomically Thin Biomimetic Pores
Next Article in Special Issue
Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks
Previous Article in Journal
Extending Fibre Nonlinear Interference Power Modelling to Account for General Dual-Polarisation 4D Modulation Formats
Previous Article in Special Issue
Information Bottleneck Classification in Extremely Distributed Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bottleneck Problems: An Information and Estimation-Theoretic View †

School of Engineering and Applied Science, Harvard University, Cambridge, MA 02138, USA
*
Author to whom correspondence should be addressed.
Part of the Results in This Paper Was Presented at the International Symposium on Information Theory 2018, Vail, CO, USA, 17–22 June 2018.
Entropy 2020, 22(11), 1325; https://doi.org/10.3390/e22111325
Submission received: 15 October 2020 / Revised: 17 November 2020 / Accepted: 17 November 2020 / Published: 20 November 2020

Abstract

:
Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), and strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding, and dependence dilution. Leveraging these connections, we prove a new cardinality bound on the auxiliary variable in IB, making its computation more tractable for discrete random variables. In the second part, we introduce a general family of optimization problems, termed “bottleneck problems”, by replacing mutual information in IB and PF with other notions of mutual information, namely f-information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantees in a variety of inference tasks with statistical constraints on accuracy and privacy. While the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to a binary case, we derive closed form expressions for several bottleneck problems.

1. Introduction

Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental in a variety of learning problems found in machine learning. A notable example is the information bottleneck ( IB ) method [1]. Suppose Y is a target variable and X is an observable correlated variable with joint distribution P X Y . The goal of IB is to learn a “compact” summary (aka bottleneck) T of X that is maximally “informative” for inferring Y. The bottleneck variable T is assumed to be generated from X by applying a random function F to X, i.e., T = F ( X ) , in such a way that it is conditionally independent of Y given X, that we denote by
Entropy 22 01325 i002
The IB quantifies this goal by measuring the “compactness” of T using the mutual information I ( X ; T ) and, similarly, “informativeness” by I ( Y ; T ) . For a given level of compactness R 0 , IB extracts the bottleneck variable T that solves the constrained optimization problem
IB ( R ) sup I ( Y ; T ) subject to I ( X ; T ) R ,
where the supremum is taken over all randomized functions T = F ( X ) satisfying Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T.
The optimization problem that underlies the information bottleneck has been studied in the information theory literature as early as the 1970’s—see [2,3,4,5]—as a technique to prove impossibility results in information theory and also to study the common information between X and Y. Wyner and Ziv [2] explicitly determined the value of IB ( R ) for the special case of binary X and Y—a result widely known as Mrs. Gerber’s Lemma [2,6]. More than twenty years later, the information bottleneck function was studied by Tishby et al. [1] and re-formulated in a data analytic context. Here, the random variable X represents a high-dimensional observation with a corresponding low-dimensional feature Y. IB aims at specifying a compressed description of image which is maximally informative about feature Y. This framework led to several applications in clustering [7,8,9] and quantization [10,11].
A closely-related framework to IB is the privacy funnel ( PF ) problem [12,13,14]. In the PF framework, a bottleneck variable T is sought to maximally preserve “information” contained in X while revealing as little about Y as possible. This framework aims to capture the inherent trade-off between revealing X perfectly and leaking a sensitive attribute Y. For instance, suppose a user wishes to share an image X for some classification tasks. The image might carry information about attributes, say Y, that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g., location, or emotion. The PF framework seeks to extract a representation of X from which the original image can be recovered with maximal accuracy while minimizing the privacy leakage with respect to Y. Using mutual information for both privacy leakage and informativeness, the privacy funnel can be formulated as
PF ( r ) inf I ( Y ; T ) subject to I ( X ; T ) r ,
where the infumum is taken over all randomized function T = F ( X ) and r is the parameter specifying the level of informativeness. It is evident from the formulations (2) and (3) that IB and PF are closely related. In fact, we shall see later that they correspond to the upper and lower boundaries of a two-dimensional compact convex set. This duality has led to design of greedy algorithms [12,15] for estimating PF based on the agglomerative information bottleneck [9] algorithm. A similar formulation has recently been proposed in [16] as a tool to train a neural network for learning a private representation of data X; see [17,18] for other closely-related formulations. Solving IB and PF optimization problems analytically is challenging. However, recent machine learning applications, and deep learning algorithms in particular, have reignited the study of both IB and PF (see Related Work).
In this paper, we first give a cohesive overview of the existing results surrounding the IB and the PF formulations. We then provide a comprehensive analysis of IB and PF from an information-theoretic perspective, as well as a survey of several formulations connected to the IB and PF that have been introduced in the information theory and machine learning literature. Moreover, we overview connections with coding problems such as remote source-coding [19], testing against independence [20], and dependence dilution [21]. Leveraging these connections, we prove a new cardinality bound for the bottleneck variable in IB , leading to more tractable optimization problem for IB . We then consider a broad family of optimization problems by going beyond mutual information in formulations (2) and (3). We propose two candidates for this task: Arimoto’s mutual information [22] and f-information [23]. By replacing I ( Y ; T ) and/or I ( X ; T ) with either of these measures, we generate a family of optimization problems that we referred to as the bottleneck problems. These problems are shown to better capture the underlying trade-offs intended by IB and PF (see also the short version [24]). More specifically, our main contributions are listed next.
  • Computing IB and PF are notoriously challenging when X takes values in a set with infinite cardinality (e.g., X is drawn from a continuous probability distribution). We consider three different scenarios to circumvent this difficulty. First, we assume that X is a Gaussian perturbation of Y, i.e., X = Y + N G where N G is a noise variable sampled from a Gaussian distribution independent of Y. Building upon the recent advances in entropy power inequality in [25], we derive a sharp upper bound for IB ( R ) . As a special case, we consider jointly Gaussian ( X , Y ) for which the upper bound becomes tight. This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable T is also Gaussian than the original proof given in [26]. In the second scenario, we assume that Y is a Gaussian perturbation of X, i.e., Y = X + N G . This corresponds to a practical setup where the feature Y might be perfectly obtained from a noisy observation of X. Relying on the recent results in strong data processing inequality [27], we obtain an upper bound on IB ( R ) which is tight for small values of R. In the last scenario, we compute second-order approximation of PF ( r ) under the assumption that T is obtained by Gaussian perturbation of X, i.e., T = X + N G . Interestingly, the rate of increase of PF ( r ) for small values of r is shown to be dictated by an asymmetric measure of dependence introduced by Rényi [28].
  • We extend the Witsenhausen and Wyner’s approach [3] for analytically computing IB and PF . This technique converts solving the optimization problems in IB and PF to determining the convex and concave envelopes of a certain function, respectively. We apply this technique to binary X and Y and derive a closed form expression for PF ( r ) – we call this result Mr. Gerber’s Lemma.
  • Relying on the connection between IB and noisy source coding [19] (see [29,30]), we show that the optimal bottleneck variable T in optimization problem (2) takes values in a set T with cardinality | T | | X | . Compared to the best cardinality bound previously known (i.e., | T | | X | + 1 ), this result leads to a reduction in the search space’s dimension of the optimization problem (2) from R | X | 2 to R | X | ( | X | 1 ) . Moreover, we show that this does not hold for PF , indicating a fundamental difference in optimizations problems (2) and (3).
  • Following [14,31], we study the deterministic IB and PF (denoted by dIB and dPF ) in which T is assumed to be a deterministic function of X, i.e., T = f ( X ) for some function f. By connecting dIB and dPF with entropy-constrained scalar quantization problems in information theory [32], we obtain bounds on them explicitly in terms of | X | . Applying these bounds to IB , we obtain that IB ( R ) I ( X ; Y ) is bounded by one from above and by min { R H ( X ) , e R 1 | X | } from below.
  • By replacing I ( Y ; T ) and/or I ( X ; T ) in (2) and (3) with Arimoto’s mutual information or f-information, we generate a family of bottleneck problems. We then argue that these new functionals better describe the trade-offs that were intended to be captured by IB and PF . The main reason is three-fold: First, as illustrated in Section 2.3, mutual information in IB and PF are mainly justified when n 1 independent samples ( X 1 , Y 1 ) , , ( X n , Y n ) of P X Y are considered. However, Arimoto’s mutual information allows for operational interpretation even in the single-shot regime (i.e., for n = 1 ). Second, I ( Y ; T ) in IB and PF is meant to be a proxy for the efficiency of reconstructing Y given observation T. However, this can be accurately formalized by probability of correctly guessing Y given T (i.e., Bayes risk) or minimum mean-square error (MMSE) in estimating Y given T. While I ( Y ; T ) bounds these two measures, we show that they are precisely characterized by Arimoto’s mutual information and f-information, respectively. Finally, when P X Y is unknown, mutual information is known to be notoriously difficult to estimate. Nevertheless, Arimoto’s mutual information and f-information are easier to estimate: While mutual information can be estimated with estimation error that scales as O ( log n / n ) [33], Diaz et al. [34] showed that this estimation error for Arimoto’s mutual information and f-information is O ( 1 / n ) .
    We also generalize our computation technique that enables us to analytically compute these bottleneck problems. Similar as before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certain functions. Focusing on binary X and Y, we derive closed form expressions for some of the bottleneck problems.

1.1. Related Work

The IB formulation has been extensively applied in representation learning and clustering [7,8,35,36,37,38]. Clustering based on IB results in algorithms that cluster data points in terms of the similarity of P Y | X . When data points lie in a metric space, usually geometric clustering is preferred where clustering is based upon the geometric (e.g., Euclidean) distance. Strouse and Schwab [31,39] proposed the deterministic IB (denoted by dIB ) by enforcing that P T | X is a deterministic mapping: dIB ( R ) denotes the supremum of I ( Y ; f ( X ) ) over all functions f : X T satisfying H ( f ( X ) ) R . This optimization problem is closely related to the problem of scalar quantization in information theory: designing a function f : X [ M ] { 1 , , M } with a pre-determined output alphabet with f optimizing some objective functions. This objective might be maximizing or minimizing H ( f ( X ) ) [40] or maximizing I ( Y ; f ( X ) ) for a random variable Y correlated with X [32,41,42,43]. Since H ( f ( X ) ) log M for f : X [ M ] , the latter problem provides lower bounds for dIB (and thus for IB ). In particular, one can exploit [44] (Theorem 1) to obtain I ( X ; Y ) dIB ( R ) O ( e 2 R / | Y | 1 ) provided that min { | X | , 2 R } > 2 | Y | . This result establishes a linear gap between dIB and I ( X ; Y ) irrespective of | X | .
The connection between quantization and dIB further allows us to obtain multiplicative bounds. For instance, if Y Bernoulli ( 1 2 ) and X = Y + N G , where N G N ( 0 , 1 ) is independent of Y, then it is well-known in information theory literature that I ( Y ; f ( X ) ) 2 π I ( X ; Y ) for all non-constant f : X { 0 , 1 } (see, e.g., [45] (Section 2.11)), thus dIB ( R ) 2 π I ( X ; Y ) for R 1 . We further explore this connection to provide multiplicative bounds on dIB ( R ) in Section 2.5.
The study of IB has recently gained increasing traction in the context of deep learning. By taking T to be the activity of the hidden layer(s), Tishby and Zaslavsky [46] (see also [47]) argued that neural network classifiers trained with cross-entropy loss and stochastic gradient descent (SGD) inherently aims at solving the IB optimization problems. In fact, it is claimed that the graph of the function R IB ( R ) (the so-called the information plane) characterizes the learning dynamic of different layers in the network: shallow layers correspond to maximizing I ( Y ; T ) while deep layers’ objective is minimizing I ( X ; T ) . While the generality of this claim was refuted empirically in [48] and theoretically in [49,50], it inspired significant follow-up studies. These include (i) modifying neural network training in order to solve the IB optimization problem [51,52,53,54,55]; (ii) creating connections between IB and generalization error [56], robustness [51], and detection of out-of-distribution data [57]; and (iii) using IB to understand specific characteristic of neural networks [55,58,59,60].
In both IB and PF , mutual information poses some limitations. For instance, it may become infinity in deterministic neural networks [48,49,50] and also may not lead to proper privacy guarantee [61]. As suggested in [55,62], one way to address this issue is to replace mutual information with other statistical measures. In the privacy literature, several measures with strong privacy guarantee have been proposed including Rényi maximal correlation [21,63,64], probability of correctly recovering [65,66], minimum mean-squared estimation error (MMSE) [67,68], χ 2 -information [69] (a special case of f-information to be described in Section 3), Arimoto’s and Sibson’s mutual information [61,70]—to be discussed in Section 3, maximal leakage [71], and local differential privacy [72]. All these measures ensure interpretable privacy guarantees. For instance, it is shown in [67,68] that if χ 2 -information between Y and T is sufficiently small, then no functions of Y can be efficiently reconstructed given T; thus providing an interpretable privacy guarantee.
Another limitation of mutual information is related to its estimation difficulty. It is known that mutual information can be estimated from n samples with the estimation error that scales as O ( log n / n ) [33]. However, as shown by Diaz et al. [34], the estimation error for most of the above measures scales as O ( 1 / n ) . Furthermore, the recently popular variational estimators for mutual information, typically implemented via deep learning methods [73,74,75], presents some fundamental limitations [76]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimator might not satisfy basic properties of mutual information such as data processing inequality or additivity. McAllester and Stratos [77] showed that some of these limitations are inherent to a large family of mutual information estimators.

1.2. Notation

We use capital letters, e.g., X, for random variables and calligraphic letters for their alphabets, e.g., X . If X is distributed according to probability mass function (pmf) P X , we write X P X . Given two random variables X and Y, we write P X Y and P Y | X as the joint distribution and the conditional distribution of Y given X. We also interchangeably refer to P Y | X as a channel from X to Y. We use H ( X ) to denote both entropy and differential entropy of X, i.e., we have
H ( X ) = x X P X ( x ) log P X ( x )
if X is a discrete random variable taking values in X with probability mass function (pmf) P X and
H ( X ) = log f X ( x ) log f X ( x ) d x ,
where X is an absolutely continuous random variable with probability density function (pdf) f X . If X is a binary random variable with P X ( 1 ) = p , we write X Bernoulli ( p ) . In this case, its entropy is called binary entropy function and denoted by h b ( p ) p log p ( 1 p ) log ( 1 p ) . We use superscript G to describe a standard Gaussian random variable, i.e., N G N ( 0 , 1 ) . Given two random variables X and Y, their (Shannon’s) mutual information is denoted by I ( X ; Y ) H ( Y ) H ( Y | X ) . We let P ( X ) denote the set of all probability distributions on the set X . Given an arbitrary Q X P ( X ) and a channel P Y | X , we let Q X P Y | X denote the resulting output distribution on Y . For any a [ 0 , 1 ] , we use a ¯ to denote 1 a and for any integer k N , [ k ] { 1 , 2 , , k } .
Throughout the paper, we assume a pair of (discrete or continuous) random variables ( X , Y ) P X Y are given with a fixed joint distribution P X Y , marginals P X and P Y , and conditional distribution P Y | X . We then use Q X P ( X ) to denote an arbitrary distribution with Q Y = Q X P Y | X P ( Y ) .

2. Information Bottleneck and Privacy Funnel: Definitions and Functional Properties

In this section, we review the information bottleneck and its closely related functional, the privacy funnel. We then prove some analytical properties of these two functionals and develop a convex analytic approach which enables us to compute closed-form expressions for both these two functionals in some simple cases.
To precisely quantify the trade-off between these two conflicting goals, the IB optimization problem (2) was proposed [1]. Since any randomized function T = F ( X ) can be equivalently characterized by a conditional distribution, the optimization problem (2) can be instead expressed as
Entropy 22 01325 i003
where R and R ˜ denote the level of desired compression and informativeness, respectively. We use IB ( R ) and IB ˜ ( R ˜ ) to denote IB ( P X Y , R ) and IB ˜ ( P X Y , R ˜ ) , respectively, when the joint distribution is clear from the context. Notice that if IB ( P X Y , R ) = R ˜ , then IB ˜ ( P X Y , R ˜ ) = R .
Now consider the setup where data X is required to be disclosed while maintaining the privacy of a sensitive attribute, represented by Y. This goal was formulated by PF in (3). As before, replacing randomized function T = F ( X ) with conditional distribution P T | X , we can equivalently express (3) as
Entropy 22 01325 i004
where r ˜ and r denote the level of desired privacy and informativeness, respectively. The case r ˜ = 0 is particularly interesting in practice and specifies perfect privacy, see e.g., [13,78]. As before, we write PF ˜ ( r ˜ ) and PF ( r ) for PF ˜ ( P X Y , r ˜ ) and PF ( P X Y , r ) when P X Y is clear from the context.
The following properties of IB and PF follow directly from their definitions. The proof of this result (and any other results in this section) is given in Appendix A.
Theorem 1.
For a given P X Y , the mappings IB ( R ) and PF ( r ) have the following properties:
  • IB ( 0 ) = PF ( 0 ) = 0 .
  • IB ( R ) = I ( X ; Y ) for any R H ( X ) and PF ( r ) = I ( X ; Y ) for r H ( X ) .
  • 0 IB ( R ) min { R , I ( X ; Y ) } for any R 0 and PF ( r ) max { r H ( X | Y ) , 0 } for any r 0 .
  • R IB ( R ) is continuous, strictly increasing, and concave on the range ( 0 , I ( X ; Y ) ) .
  • r PF ( r ) is continuous, strictly increasing, and convex on the range ( 0 , I ( X ; Y ) ) .
  • If P Y | X ( y | x ) > 0 for all x X and y Y , then both R IB ( R ) and r PF ( r ) are continuously differentiable over ( 0 , H ( X ) ) .
  • R IB ( R ) R is non-increasing and r PF ( r ) r is non-decreasing.
  • We have
    Entropy 22 01325 i005
According to this theorem, we can always restrict both R and r in (4) and (5), respectively, to [ 0 , H ( X ) ] as IB ( R ) = PF ( r ) = I ( X ; Y ) for all r , R H ( X ) .
Define M = M ( P X Y ) R 2 as
Entropy 22 01325 i006
It can be directly verified that M is convex. According to this theorem, R IB ( R ) and r PF ( r ) correspond to the upper and lower boundary of M , respectively. The convexity of M then implies the concavity and convexity of IB and PF . Figure 1 illustrates the set M for the simple case of binary X and Y.
While both IB ( 0 ) = 0 and PF ( 0 ) = 0 , their behavior in the neighborhood around zero might be completely different. As illustrated in Figure 1, IB ( R ) > 0 for all R > 0 , whereas PF ( r ) = 0 for r [ 0 , r 0 ] for some r 0 > 0 . When such r 0 > 0 exists, we say perfect privacy occurs: there exists a variable T satisfying Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T such that I ( Y ; T ) = 0 while I ( X ; T ) > 0 ; making T a representation of X having perfect privacy (i.e., no information leakage about Y). A necessary and sufficient condition for the existence of such T is given in [21] (Lemma 10) and [13] (Theorem 3), described next.
Theorem 2
(Perfect privacy). Let ( X , Y ) P X Y be given and A [ 0 , 1 ] | Y | be the set of vectors { P Y | X ( · | x ) , x X } . Then there exists r 0 > 0 such that PF ( r ) = 0 for r [ 0 , r 0 ] if and only if vectors in A are linearly independent.
In light of this theorem, we obtain that perfect privacy occurs if | X | > | Y | . It also follows from the theorem that for binary X, perfect privacy cannot occur (see Figure 1a).
Theorem 1 enables us to derive a simple bounds for IB and PF . Specifically, the facts that PF ( r ) r is non-decreasing and IB ( R ) R is non-increasing immediately result in the the following linear bounds.
Theorem 3
(Linear lower bound). For r , R ( 0 , H ( X ) ) , we have
inf Q X P ( X ) Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) PF ( r ) r I ( X ; Y ) H ( X ) IB ( R ) R sup Q X P ( X ) Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) 1 .
In light of this theorem, if PF ( r ) = r , then I ( X ; Y ) = H ( X ) , implying X = g ( Y ) for a deterministic function g. Conversely, if X = g ( Y ) then PF ( r ) = r because for all T forming the Markov relation Y Entropy 22 01325 i001 g(Y) Entropy 22 01325 i001 T, we have I ( Y ; T ) = I ( g ( Y ) ; T ) . On the other hand, we have IB ( R ) = R if and only if there exists a variable T satisfying I ( X ; T ) = I ( Y ; T ) and thus the following double Markov relations
Entropy 22 01325 i007
It can be verified (see [79] (Problem 16.25)) that this double Markov condition is equivalent to the existence of a pair of functions f and g such that f ( X ) = g ( Y ) and (X,Y) Entropy 22 01325 i001 f(X) Entropy 22 01325 i001 T . One special case of this setting, namely where g is an identity function, has been recently studied in details in [53] and will be reviewed in Section 2.5. Theorem 3 also enables us to characterize the “worst” joint distribution P X Y with respect to IB and PF . As demonstrated in the following lemma, if P Y | X is an erasure channel then PF ( r ) r = IB ( R ) R = I ( X ; Y ) H ( X ) .
Lemma 1.
  • Let P X Y be such that Y = X { } , P Y | X ( x | x ) = 1 δ , and P Y | X ( | x ) = δ for some δ > 0 . Then
    PF ( r ) r = IB ( R ) R = 1 δ .
  • Let P X Y be such that X = Y { } , P X | Y ( y | y ) = 1 δ , and P X | Y ( | y ) = δ for some δ > 0 . Then
    PF ( r ) = max { r H ( X | Y ) , 0 } .
The bounds in Theorem 3 hold for all r and R in the interval [ 0 , H ( X ) ] . We can, however, improve them when r and R are sufficiently small. Let PF ( 0 ) and IB ( 0 ) denote the slope of PF ( · ) and IB ( · ) at zero, i.e., PF ( 0 ) lim r 0 + PF ( r ) r and IB ( 0 ) lim R 0 + IB ( R ) R .
Theorem 4.
Given ( X , Y ) P X Y , we have
inf Q X P ( X ) Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) = PF ( 0 ) min x X : P X ( x ) > 0 D KL ( P Y | X ( · | x ) P Y ( · ) ) log P X ( x ) max x X : P X ( x ) > 0 D KL ( P Y | X ( · | x ) P Y ( · ) ) log P X ( x ) IB ( 0 ) = sup Q X P ( X ) Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) .
This theorem provides the exact values of PF ( 0 ) and IB ( 0 ) and also simple bounds for them. While the exact expressions for PF ( 0 ) and IB ( 0 ) are usually difficult to compute, a simple plug-in estimator is proposed in [80] for IB ( 0 ) . This estimator can be readily adapted to estimate PF ( 0 ) . Theorem 4 reveals a profound connection between IB and the strong data processing inequality (SDPI) [81]. More precisely, thanks to the pioneering work of Anantharam et al. [82], it is known that the supremum of D KL ( Q Y P Y ) D KL ( Q X P X ) over all Q X P X is equal the supremum of I ( Y ; T ) I ( X ; T ) over all P T | X satisfying Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T and hence IB ( 0 ) specifies the strengthening of the data processing inequality of mutual information. This connection may open a new avenue for new theoretical results for IB , especially when X or Y are continuous random variables. In particular, the recent non-multiplicative SDPI results [27,83] seem insightful for this purpose.
In many practical cases, we might have n i.i.d. samples ( X 1 , Y 1 ) , , ( X n , Y n ) of ( X , Y ) P X Y . We now study how IB behaves in n. Let X n ( X 1 , , X n ) and Y n ( Y 1 , , Y n ) . Due to the i.i.d. assumption, we have P X n Y n ( x n , y n ) = i = 1 n P X Y ( x i , y i ) . This can also be described by independently feeding X i , i [ n ] , to channel P Y | X producing Y i . The following theorem, demonstrated first in [3] (Theorem 2.4), gives a formula for IB in terms of n.
Theorem 5
(Additivity). We have
1 n IB ( P X n Y n , n R ) = IB ( P X Y , R ) .
This theorem demonstrates that an optimal channel P T n | X n for i.i.d. samples ( X n , Y n ) P X Y is obtained by the Kronecker product of an optimal channel P T | X for ( X , Y ) P X Y . This, however, may not hold in general for PF , that is, we might have PF ( P X n Y n , n r ) < n PF ( P X Y , r ) , see [13] (Proposition 1) for an example.

2.1. Gaussian IB and PF

In this section, we turn our attention to a special, yet important, case where X = Y + σ N G , where σ > 0 and N G N ( 0 , 1 ) is independent of Y. This setting subsumes the popular case of jointly Gaussian ( X , Y ) whose information bottleneck functional was computed in [84] for the vector case (i.e., ( X , Y ) are jointly Gaussian random vectors).
Lemma 2.
Let { Y i } i = 1 n be n i.i.d. copies of Y P Y and X i = Y i + σ N i G where { N i G } are i.i.d samples of N ( 0 , 1 ) independent of Y. Then, we have
1 n IB ( P X n Y n , n R ) H ( X ) 1 2 log 2 π e σ 2 + e 2 ( H ( Y ) R ) .
It is worth noting that this result was concurrently proved in [85]. The main technical tool in the proof of this lemma is a strong version of the entropy power inequality [25] (Theorem 2) which holds even if X i , Y i , and N i are random vectors (as opposed to scalar). Thus, one can readily generalize Lemma 2 to the vector case. Note that the upper bound established in this lemma holds without any assumptions on P T | X . This upper bound provides a significantly simpler proof for the well-known fact that for the jointly Gaussian ( X , Y ) , the optimal channel P T | X is Gaussian. This result was first proved in [26] and used in [84] to compute an expression of IB for the Gaussian case.
Corollary 1.
If ( X , Y ) are jointly Gaussian with correlation coefficient ρ, then we have
IB ( R ) = 1 2 log 1 1 ρ 2 + ρ 2 e 2 R .
Moreover, the optimal channel P T | X is given by P T | X ( · | x ) = N ( 0 , σ ˜ 2 ) for σ ˜ 2 = σ Y 2 e 2 R ρ 2 ( 1 e 2 R ) where σ Y 2 is the variance of Y.
In Lemma 2, we assumed that X is a Gaussian perturbation of Y. However, in some practical scenarios, we might have Y as a Gaussian perturbation of X. For instance, let X represent an image and Y be a feature of the image that can be perfectly obtained from a noisy observation of X. Then, the goal is to compress the image with a given compression rate while retaining maximal information about the feature. The following lemma, which is an immediate consequence of [27] (Theorem 1), gives an upper bound for IB in this case.
Lemma 3.
Let X n be n i.i.d. copies of a random variable X satisfying E [ X 2 ] 1 and Y i be the result of passing X i , i [ n ] , through a Gaussian channel Y = X + σ N G , where σ > 0 and N G N ( 0 , 1 ) is independent of X. Then, we have
1 n IB ( P X n Y n , n R ) R Ψ ( R , σ ) ,
where
Ψ ( R , σ ) max x [ 0 , 1 2 ] 2 Q 1 x σ 2 R h b ( x ) x 2 log 1 + 1 x σ 2 ,
Q ( t ) t 1 2 π e t 2 2 d t is the Gaussian complimentary CDF and h b ( a ) a log ( a ) ( 1 a ) log ( 1 a ) for a ( 0 , 1 ) is the binary entropy function. Moreover, we have
1 n IB ( P X n Y n , n R ) R e 1 R σ 2 log 1 R + Θ log 1 R .
Note that that Lemma 3 holds for any arbitrary X (provided that E [ X 2 ] 1 ) and hence (9) bounds information bottleneck functionals for a wide family of P X Y . However, the bound is loose in general for large values of R. For instance, if ( X , Y ) are jointly Gaussian (implying Y = X + σ N G for some σ > 0 ), then the right-hand side of (9) does not reduce to (8). To show this, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Figure 2.
The privacy funnel functional is much less studied even for the simple case of jointly Gaussian. Solving the optimization in PF over P T | X without any assumptions is a difficult challenge. A natural assumption to make is that P T | X ( · | x ) is Gaussian for each x X . This leads to the following variant of PF
PF G ( r ) inf σ 0 , I ( X ; T σ ) r I ( Y ; T σ ) ,
where
T σ X + σ N G ,
and N G N ( 0 , 1 ) is independent of X. This formulation is tractable and can be computed in closed form for jointly Gaussian ( X , Y ) as described in the following example.
Example 1.
Let X and Y be jointly Gaussian with correlation coefficient ρ. First note that since mutual information is invariant to scaling, we may assume without loss of generality that both X and Y are zero mean and unit variance and hence we can write X = ρ Y + 1 ρ 2 M G where M G N ( 0 , 1 ) is independent of Y. Consequently, we have
I ( X ; T σ ) = 1 2 log 1 + 1 σ 2 ,
and
I ( Y ; T σ ) = 1 2 log 1 + ρ 2 1 ρ 2 + σ 2 .
In order to ensure I ( X ; T σ ) r , we must have σ e 2 r 1 1 2 . Plugging this choice of σ into (13), we obtain
PF G ( r ) = 1 2 log 1 1 ρ 2 1 e 2 r .
This example indicates that for jointly Gaussian ( X , Y ) , we have PF G ( r ) = 0 if and only if r = 0 (thus perfect privacy does not occur) and the constraint I ( X ; T σ ) = r is satisfied by a unique σ . These two properties in fact hold for all continuous variables X and Y with finite second moments as demonstrated in Lemma A1 in Appendix A. We use these properties to derive a second-order approximation of PF G ( r ) when r is sufficiently small. For the following theorem, we use var ( U ) to denote the variance of the random variable U and var ( U | V ) E [ ( U E [ U | V ] ) 2 | V ] . We use σ X 2 = var ( X ) for short.
Theorem 6.
For any pair of continuous random variables ( X , Y ) with finite second moments, we have as r 0
PF G ( r ) = η ( X , Y ) r + Δ ( X , Y ) r 2 + o ( r 2 ) ,
where η ( X , Y ) var ( E [ X | Y ] ) σ X 2 and
Δ ( X , Y ) 2 σ X 4 E [ var 2 ( X | Y ) ] σ X 2 E [ var ( X | Y ) ] .
It is worth mentioning that the quantity η ( X , Y ) was first defined by Rényi [28] as an asymmetric measure of correlation between X and Y. In fact, it can be shown that η ( X , Y ) = sup f ρ 2 ( X , f ( Y ) ) , where supremum is taken over all measurable functions f and ρ ( · , · ) denotes the correlation coefficient. As a simple illustration of Theorem 6, consider jointly Gaussian X and Y with correlation coefficient ρ for which PF G was computed in Example 1. In this case, it can be easily verified that η ( X , Y ) = ρ 2 and Δ ( X , Y ) = 2 σ X 2 ρ 2 ( 1 ρ 2 ) . Hence, for jointly Gaussian ( X , Y ) with correlation coefficient ρ and unit variance, we have PF G ( r ) = ρ 2 r 2 ρ 2 ( 1 ρ 2 ) r 2 + o ( r 2 ) . In Figure 3, we compare the approximation given in Theorem 6 for this particular case.

2.2. Evaluation of IB and PF

The constrained optimization problems in the definitions of IB and PF are usually challenging to solve numerically due to the non-linearity in the constraints. In practice, however, both IB and PF are often approximated by their corresponding Lagrangian optimizations
L IB ( β ) sup P T | X I ( Y ; T ) β I ( X ; T ) = H ( Y ) β H ( X ) inf P T | X H ( Y | T ) β H ( X | T ) ,
and
L PF ( β ) inf P T | X I ( Y ; T ) β I ( X ; T ) = H ( Y ) β H ( X ) sup P T | X H ( Y | T ) β H ( X | T ) ,
where β R + is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for IB and the privacy and informativeness in PF . Notice that for the computation of L IB , we can assume, without loss of generality, that β [ 0 , 1 ] since otherwise the maximizer of (15) is trivial. It is worth noting that L IB ( β ) and L PF ( β ) in fact correspond to lines of slope β supporting M from above and below, thereby providing a new representation of M .
Let ( X , Y ) be a pair of random variables with X Q X for some Q X P ( X ) and Y is the output of P Y | X when the input is X (i.e., Y Q X P Y | X ). Define
F β ( Q X ) H ( Y ) β H ( X ) .
This function, in general, is neither convex nor concave in Q X . For instance, F ( 0 ) is concave and F ( 1 ) is convex in P X . The lower convex envelope of F β ( Q X ) is defined as the largest convex function smaller than F β ( Q X ) . Similarly, the upper concave envelope of F β ( Q X ) is defined as the smallest concave function larger than F β ( Q X ) . Let K [ F β ( Q X ) ] and K [ F β ( Q X ) ] denote the lower convex and upper concave envelopes of F β ( Q X ) , respectively. If F β ( Q X ) is convex at P X , that is K [ F β ( Q X ) ] | P X = F β ( P X ) , then F β ( Q X ) remains convex at P X for all β β because
K [ F β ( Q X ) ] = K [ F β ( Q X ) ( β β ) H ( X ) ] K [ F β ( Q X ) ] + K [ ( β β ) H ( X ) ] = K [ F β ( Q X ) ] ( β β ) H ( X ) ,
where the last equality follows from the fact that ( β β ) H ( X ) is convex. Hence, at P X we have
K [ F β ( Q X ) ] | P X K [ F β ( Q X ) ] | P X ( β β ) H ( X ) = F β ( P X ) ( β β ) H ( X ) = F β ( P X ) .
Analogously, if F β ( Q X ) is concave at P X , that is K [ F β ( Q X ) ] | P X = F β ( P X ) , then F β ( Q X ) remains concave at P X for all β β .
Notice that, according to (15) and (16), we can write
L IB ( β ) = H ( Y ) β H ( X ) K [ F β ( Q X ) ] | P X ,
and
L PF ( β ) = H ( Y ) β H ( X ) K [ F β ( Q X ) ] | P X .
In light of the above arguments, we can write
L IB ( β ) = 0 ,
for all β > β IB where β IB is the smallest β such that F β ( P X ) touches K [ F β ( Q X ) ] . Similarly,
L PF ( β ) = 0 ,
for all β < β PF where β PF is the largest β such that F β ( P X ) touches K [ F β ( Q X ) ] . In the following theorem, we show that β IB and β PF are given by the values of IB ( 0 ) and PF ( 0 ) , respectively, given in Theorem 4. A similar formulae β IB and β PF were given in [86].
Proposition 1.
We have,
β IB = sup Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) ,
and
β PF = inf Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) .
Kim et al. [80] have recently proposed an efficient algorithm to estimate β IB from samples of P X Y involving a simple optimization problem. This algorithm can be readily adapted for estimating β PF . Proposition 1 implies that in optimizing the Lagrangians (17) and (18), we can restrict the Lagrange multiplier β , that is
L IB ( β ) = H ( Y ) β H ( X ) K [ F β ( Q X ) ] | P X , for β [ 0 , β IB ] ,
and
L PF ( β ) = H ( Y ) β H ( X ) K [ F β ( Q X ) ] | P X , for β [ β PF , ) .
Remark 1.
As demonstrated by Kolchinsky et al. [53], the boundary points 0 and β IB are required for the computation of L IB ( β ) . In fact, when Y is a deterministic function of X, then only β = 0 and β = β IB are required to compute the IB and other values of β are vacuous. The same argument can also be used to justify the inclusion of β PF in computing L PF ( β ) . Note also that since F β ( Q X ) becomes convex for β > β IB , computing K [ F β ( Q X ) ] becomes trivial for such values of β.
Remark 2.
Observe that the lower convex envelope of any function f can be obtained by taking Legendre-Fenchel transformation (aka. convex conjugate) twice. Hence, one can use the existing linear-time algorithms for approximating Legendre-Fenchel transformation (e.g., [87,88]) for approximating K [ F β ( Q X ) ] .
Once L IB ( β ) and L PF ( β ) are computed, we can derive IB and PF via standard results in optimization (see [3] (Section IV) for more details):
IB ( R ) = inf β [ 0 , β IB ] β R + L IB ( β ) ,
and
PF ( r ) = sup β [ β PF , ] β r + L PF ( β ) .
Following the convex analysis approach outlined by Witsenhausen and Wyner [3], IB and PF can be directly computed from L IB ( β ) and L PF ( β ) by observing the following. Suppose for some β , K [ F β ( Q X ) ] (resp. K [ F β ( Q X ) ] ) at P X is obtained by a convex combination of points F β ( Q i ) , i [ k ] for some Q 1 , , Q k in P ( X ) , integer k 2 , and weights λ i 0 (with i λ i = 1 ). Then i λ i Q i = P X , and T with properties P T ( i ) = λ i and P X | T = i = Q i attains the minimum (resp. maximum) of H ( Y | T ) β H ( X | T ) . Hence, ( I ( X ; T ) , I ( Y ; T ) ) is a point on the upper (resp. lower) boundary of M ; implying that IB ( R ) = I ( Y ; T ) for R = I ( X ; T ) (resp. PF ( r ) = I ( Y ; T ) for r = I ( X ; T ) ). If for some β , K [ F β ( Q X ) ] at P X coincides with F β [ P X ] , then this corresponds to L IB ( β ) = 0 . The same holds for K [ F β ( Q X ) ] . Thus, all the information about the functional IB (resp. PF ) is contained in the subset of the domain of K [ F β ( Q X ) ] (resp. K [ F β ( Q X ) ] ) over which it differs from F β ( Q X ) . We will revisit and generalize this approach later in Section 3.
We can now instantiate this for the binary symmetric case. Suppose X and Y are binary variables and P Y | X is binary symmetric channel with crossover probability δ , denoted by BSC ( δ ) and defined as
BSC ( δ ) = 1 δ δ δ 1 δ ,
for some δ 0 . To describe the result in a compact fashion, we introduce the following notation: we let h b : [ 0 , 1 ] [ 0 , 1 ] denote the binary entropy function, i.e., h b ( p ) = p log p ( 1 p ) log ( 1 p ) . Since this function is strictly increasing [ 0 , 1 2 ] , its inverse exists and is denoted by h b 1 : [ 0 , 1 ] [ 0 , 1 2 ] . Moreover, a b a ( 1 b ) + b ( 1 a ) for a , b [ 0 , 1 ] .
Lemma 4
(Mr. and Mrs. Gerber’s Lemma). For X Bernoulli ( p ) for p 1 2 and P Y | X = BSC ( δ ) for δ 0 , we have
IB ( R ) = h b ( p δ ) h b δ h b 1 h b ( p ) R ,
and
PF ( r ) = h b ( p δ ) α h b δ p z α ¯ h b δ ,
where r = h b ( p ) α h b p z , z = max α , 2 p , and α [ 0 , 1 ] .
The result in (24) was proved by Wyner and Ziv [2] and is widely known as Mrs. Gerber’s Lemma in information theory. Due to the similarity, we refer to (25) as Mr. Gerber’s Lemma. As described above, to prove (24) and (25) it suffices to derive the convex and concave envelopes of the mapping F β : [ 0 , 1 ] R given by
F β ( q ) F β ( Q X ) = h b ( q δ ) β h b ( q ) ,
where q δ q δ ¯ + δ q ¯ is the output distribution of BSC ( δ ) when the input distribution is Bernoulli ( q ) for some q ( 0 , 1 ) . It can be verified that β IB ( 1 2 δ ) 2 . This function is depicted in Figure 4 depending of the values of β ( 1 2 δ ) 2 .

2.3. Operational Meaning of IB and PF

In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of both IB and PF . The operational interpretation of IB has recently been extensively studied in information-theoretic settings in [29,30]. In particular, it was shown that IB specifies the rate-distortion region of noisy source coding problem [19,89] under the logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side information at the decoder [90]. Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound) and also provide a new information-theoretic setting in which IB appears as the solution. Then, we describe another setting, the so-called dependence dilution, whose achievable rate region has an extreme point specified by PF . This in fact delineate an important difference between IB and PF : while IB describes the entire rate-region of an information-theoretic setup, PF specifies only a corner point of a rate region. Other information-theoretic settings related to IB and PF include CEO problem [91] and source coding for the Gray-Wyner network [92].

2.3.1. Noisy Source Coding

Suppose Alice has access only to a noisy version X of a source of interest Y. She wishes to transmit a rate-constrained description from her observation (i.e., X) to Bob such that he can recover Y with small average distortion. More precisely, let ( X n , Y n ) be n i.i.d. samples of ( X , Y ) P X Y . Alice encodes her observation X n through an encoder ϕ : X n { 1 , , K n } and sends ϕ ( X n ) to Bob. Upon receiving ϕ ( X n ) , Bob reconstructs a “soft” estimate of Y n via a decoder ψ : { 1 , , K n } Y ^ n where Y ^ = P ( Y ) . That is, the reproduction sequence y ^ n consists of n probability measures on Y . For any source and reproduction sequences y n and y ^ n , respectively, the distortion is defined as
d ( y n , y ^ n ) 1 n i = 1 n d ( y i , y ^ i ) ,
where
d ( y , y ^ ) log 1 y ^ ( y ) .
We say that a pair of rate-distortion ( R , D ) is achievable if there exists a pair ( ϕ , ψ ) of encoder and decoder such that
lim sup n E [ d ( Y n , ψ ( ϕ ( X n ) ) ) ] D , and lim sup n 1 n log K n R .
The noisy rate-distortion function R noisy ( D ) for a given D 0 , is defined as the minimum rate R such that ( R , D ) is an achievable rate-distortion pair. This problem arises naturally in many data analytic problems. Some examples include feature selection of a high-dimensional dataset, clustering, and matrix completion. This problem was first studied by Dobrushin and Tsybakov [19], who showed that R noisy ( D ) is analogous to the classical rate-distortion function
Entropy 22 01325 i008
It can be easily verified that E [ d ( Y , Y ^ ) ] = H ( Y | Y ^ ) and hence (after relabeling Y ^ as T)
Entropy 22 01325 i009
where R = H ( Y ) D , which is equal to IB ˜ defined in (4). For more details in connection between noisy source coding and IB , the reader is referred to [29,30,91,93]. Notice that one can study an essentially identical problem where the distortion constraint (28) is replaced by
lim n 1 n I ( Y n ; ψ ( ϕ ( X n ) ) ) ] R , and lim sup n 1 n log K n R .
This problem is addressed in [94] for discrete alphabets X and Y and extended recently in [95] for any general alphabets.

2.3.2. Test against Independence with Communication Constraint

As mentioned earlier, the connection between IB and noisy source coding, described above, was known and studied in [29,30]. Here, we provide a new information-theoretic setting which provides yet another operational meaning for IB . Given n i.i.d. samples ( X 1 , Y 1 ) , , ( X n , Y n ) from joint distribution Q, we wish to test whether X i are independent of Y i , that is, Q is a product distribution. This task is formulated by the following hypothesis test:
H 0 : Q = P X Y , H 1 : Q = P X P Y ,
for a given joint distribution P X Y with marginals P X and P Y . Ahlswede and Csiszár [20] investigated this problem under a communication constraint: While Y observations (i.e., Y 1 , , Y n ) are available, the X observations need to be compressed at rate R, that is, instead of X n , only ϕ ( X n ) is present where ϕ : X n { 1 , , K n } satisfies
1 n log K n R .
For the type I error probability not exceeding a fixed ε ( 0 , 1 ) , Ahlswede and Csiszár [20] derived the smallest possible type 2 error probability, defined as
β R ( n , ε ) = min ϕ : X n [ K ] 1 n log K n R min A [ K n ] × Y n ( P ϕ ( X n ) × P Y n ) ( A ) : P ϕ ( X n ) × Y n ( A ) 1 ε .
The following gives the asymptotic expression of β R ( n , ε ) for every ε ( 0 , 1 ) . For the proof, refer to [20] (Theorem 3).
Theorem 7
([20]). For every R 0 and ε ( 0 , 1 ) , we have
lim n 1 n log β R ( n , ε ) = IB ( R ) .
In light of this theorem, IB ( R ) specifies the exponential rate at which the type II error probability of the hypothesis test (31) decays as the number of samples increases.

2.3.3. Dependence Dilution

Inspired by the problems of information amplification [96] and state masking [97], Asoodeh et al. [21] proposed the dependence dilution setup as follows. Consider a source sequences X n of n i.i.d. copies of X P X . Alice observes the source X n and wishes to encode it via the encoder
f n : X n { 1 , 2 , , 2 n R } ,
for some R > 0 . The goal is to ensure that any user observing f n ( X n ) can construct a list, of fixed size, of sequences in X n that contains likely candidates of the actual sequence X n while revealing negligible information about a correlated source Y n . To formulate this goal, consider the decoder
g n : { 1 , 2 , , 2 n R } 2 X n ,
where 2 X n denotes the power set of X n . A dependence dilution triple ( R , Γ , Δ ) R + 3 is said to be achievable if, for any δ > 0 , there exists a pair of encoder and decoder ( f n , g n ) such that for sufficiently large n
Pr X n g n ( J ) < δ ,
having fixed size | g n ( J ) | = 2 n ( H ( X ) Γ ) , where J = f n ( X n ) and simultaneously
1 n I ( Y n ; J ) Δ + δ .
Notice that without side information J, the decoder can only construct a list of size 2 n H ( X ) which contains X n with probability close to one. However, after J is observed and the list g n ( J ) is formed, the decoder’s list size can be reduced to 2 n ( H ( X ) Γ ) and thus reducing the uncertainty about X n by n Γ [ 0 , n H ( X ) ] . This observation can be formalized to show (see [96] for details) that the constraint (32) is equivalent to
1 n I ( X n ; J ) Γ δ ,
which lower bounds the amount of information J carries about X n . Built on this equivalent formulation, Asoodeh et al. [21] (Corollary 15) derived a necessary condition for the achievable dependence dilution triple.
Theorem 8
([21]). Any achievable dependence dilution triple ( R , Γ , Δ ) satisfies
R Γ Γ I ( X ; T ) Δ I ( Y ; T ) I ( X ; T ) + Γ ,
for some auxiliary random variable T satisfying Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T and taking | T | | X | + 1 values.
According to this theorem, PF ( Γ ) specifies the best privacy performance of the dependence dilution setup for the maximum amplification rate Γ . While this informs the operational interpretation of PF , Theorem 8 only provides an outer bound for the set of achievable dependence dilution triple ( R , Γ , Δ ) . It is, however, not clear that PF characterizes the rate region of an information-theoretic setup.
The fact that IB fully characterizes the rate-region of an source coding setup has an important consequence: the cardinality of the auxiliary random variable T in IB can be improved to | X | instead of | X | + 1 .

2.4. Cardinality Bound

Recall that in the definition of IB in (4), no assumption was imposed on the auxiliary random variable T. A straightforward application of Carathéodory-Fenchel-Eggleston theorem (see e.g., [98] (Section III) or [79] (Lemma 15.4)) reveals that IB is attained for T taking values in a set T with cardinality | T | | X | + 1 . Here, we improve this bound and show | T | | X | is sufficient.
Theorem 9.
For any joint distribution P X Y and R ( 0 , H ( X ) ] , information bottleneck IB ( R ) is achieved by T taking at most | X | values.
The proof of this theorem hinges on the operational characterization of IB as the lower boundary of the rate-distortion region of noisy source coding problem discussed in Section 2.3. Specifically, we first show that the extreme points of this region is achieved by T taking | X | values. We then make use of a property of the noisy source coding problem (namely, time-sharing) to argue that all points of this region (including the boundary points) can be attained by such T. It must be mentioned that this result was already claimed by Harremoës and Tishby in [99] without proof.
In many practical scenarios, feature X has a large alphabet. Hence, the bound | T | | X | , albeit optimal, still can make the information bottleneck function computationally intractable over large alphabets. However, label Y usually has a significantly smaller alphabet. While it is in general impossible to have a cardinality bound for T in terms of | Y | , one can consider approximating IB assuming T takes N values. The following result, recently proved by Hirche and Winter [100], is in this spirit.
Theorem 10
([100]). For any ( X , Y ) P X Y , we have
IB ( R , N ) IB ( R ) IB ( R , N ) + δ ( N ) ,
where δ ( N ) = 4 N 1 | Y | log | Y | 4 + 1 | Y | log N and IB ( R , N ) denotes the information bottleneck functional (4) with the additional constraint that | T | N .
Recall that, unlike PF , the graph of IB characterizes the rate region of a Shannon-theoretic coding problem (as illustrated in Section 2.3), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region. This lack of operational characterization of PF translates into a worse cardinality bound than that of IB . In fact, for PF the cardinality bound | T | | X | + 1 cannot be improved in general. To demonstrate this, we numerically solve the optimization in PF assuming that | T | = | X | when both X and Y are binary. As illustrated in Figure 5, this optimization does not lead to a convex function, and hence, cannot be equal to PF .

2.5. Deterministic Information Bottleneck

As mentioned earlier, IB formalizes an information-theoretic approach to clustering high-dimensional feature X into cluster labels T that preserve as much information about the label Y as possible. The clustering label is assigned by the soft operator P T | X that solves the IB formulation (4) according to the rule: X = x is likely assigned label T = t if D KL ( P Y | x P Y | t ) is small where P Y | t = x P Y | x P X | t . That is, clustering is assigned based on the similarity of conditional distributions. As in many practical scenarios, a hard clustering operator is preferred, Strouse and Schwab [31] suggested the following variant of IB , termed as deterministic information bottleneck dIB
dIB ( P X Y , R ) sup f : X T , H ( f ( X ) ) R I ( Y ; f ( X ) ) ,
where the maximization is taken over all deterministic functions f whose range is a finite set T . Similarly, one can define
dPF ( P X Y , r ) inf f : X T , H ( f ( X ) ) r I ( Y ; f ( X ) ) .
One way to ensure that H ( f ( X ) ) R for a deterministic function f is to restrict the cardinality of the range of f: if f : X [ e R ] then H ( f ( X ) ) is necessarily smaller than R. Using this insight, we derive a lower for dIB ( P X Y , R ) in the following lemma.
Lemma 5.
For any given P X Y , we have
dIB ( P X Y , R ) e R 1 | X | I ( X ; Y ) ,
and
dPF ( P X Y , r ) e r 1 | X | I ( X ; Y ) + Pr ( X e r ) log 1 Pr ( X e r ) .
Note that both R and r are smaller than H ( X ) and thus the multiplicative factors of I ( X ; Y ) in the lemma are smaller than one. In light of this lemma, we can obtain
e R 1 | X | I ( X ; Y ) IB ( R ) I ( X ; Y ) ,
and
PF ( r ) e r 1 | X | I ( X ; Y ) + Pr ( X e r ) log 1 Pr ( X e r ) .
In most of practical setups, | X | might be very large, making the above lower bound for IB vacuous. In the following lemma, we partially address this issue by deriving a bound independent of X when Y is binary.
Lemma 6.
Let P X Y be a joint distribution of arbitrary X and binary Y Bernoulli ( q ) for some q ( 0 , 1 ) . Then, for any R log 5 we have
dIB ( P X Y , R ) I ( X ; Y ) 2 α h b I ( X ; Y ) 2 α ( e R 4 ) ,
where α = max { log 1 q , log 1 1 q } .

3. Family of Bottleneck Problems

In this section, we introduce a family of bottleneck problems by extending IB and PF to a large family of statistical measures. Similar to IB and PF , these bottleneck problems are defined in terms of boundaries of a two-dimensional convex set induced by a joint distribution P X Y . Recall that R IB ( P X Y , R ) and r PF ( P X Y , r ) are the upper and lower boundary of the set M defined in (6) and expressed here again for convenience
Entropy 22 01325 i010
Since P X Y is given, H ( X ) and H ( Y ) are fixed. Thus, in characterizing M it is sufficient to consider only H ( X | T ) and H ( Y | T ) . To generalize IB and PF , we must therefore generalize H ( X | T ) and H ( Y | T ) .
Given a joint distribution P X Y and two non-negative real-valued functions Φ : P ( X ) R + and Ψ : P ( Y ) R + , we define
Φ ( X | T ) E Φ ( P X | T ) = t T P T ( t ) Φ ( P X | T = t ) ,
and
Ψ ( Y | T ) E Ψ ( P Y | T ) = t T P T ( t ) Ψ ( P Y | T = t ) .
When X P X and Y P Y , we interchangeably write Φ ( X ) for Φ ( P X ) and Φ ( Y ) for Ψ ( P Y ) .
These definitions provide natural generalizations for Shannon’s entropy and mutual information. Moreover, as we discuss later in Section 3.2 and Section 3.3, it also can be specialized to represent a large family of popular information-theoretic and statistical measures. Examples include information and estimation theoretic quantities such as Arimoto’s conditional entropy of order α for Φ ( Q X ) = | | Q X | | α , probability of correctly guessing for Φ ( Q X ) = | | Q X | | , maximal correlation for binary case, and f-information for Φ ( Q X ) given by f-divergence. We are able to generate a family of bottleneck problems using different instantiations of Φ ( X | T ) and Ψ ( Y | T ) in place of mutual information in IB and PF . As we argue later, these problems better capture the essence of “informativeness” and “privacy”; thus providing analytical and interpretable guarantees similar in spirit to IB and PF .
Computing these bottleneck problems in general boils down to the following optimization problems
Entropy 22 01325 i011
and
Entropy 22 01325 i012
Consider the set
Entropy 22 01325 i013
Note that if both Φ and Ψ are continuous (with respect to the total variation distance), then M Φ , Ψ is compact. Moreover, it can be easily verified that M Φ , Ψ is convex. Hence, its upper and lower boundaries are well-defined and are characterized by the graphs of U Φ , Ψ and L Φ , Ψ , respectively. As mentioned earlier, these functional are instrumental for computing the general bottleneck problem later. Hence, before we delve into the examples of bottleneck problems, we extend the approach given in Section 2.2 to compute U Φ , Ψ and L Φ , Ψ .

3.1. Evaluation of U Φ , Ψ and L Φ , Ψ

Analogous to Section 2.2, we first introduce the Lagrangians of U Φ , Ψ and L Φ , Ψ as
L Φ , Ψ U ( β ) sup P T | X Ψ ( Y | T ) β Φ ( X | T ) ,
and
L Φ , Ψ L ( β ) inf P T | X Ψ ( Y | T ) β Φ ( X | T ) ,
where β 0 is the Lagrange multiplier, respectively. Let ( X , Y ) be a pair of random variable with X Q X and Y is the result of passing X through the channel P Y | X . Letting
F β Φ , Ψ ( Q X ) Ψ ( Y ) β Φ ( X ) ,
we obtain that
L Φ , Ψ U ( β ) = K [ F β Φ , Ψ ( Q X ) ] | P X and L Φ , Ψ L ( β ) = K [ F β Φ , Ψ ( Q X ) ] | P X ,
recalling that K and K are the upper concave and lower convex envelop operators. Once we compute L Φ , Ψ U and L Φ , Ψ L for all β 0 , we can use the standard results in optimizations theory (similar to (21) and (22)) to recover U Φ , Ψ and L Φ , Ψ . However, we can instead extend the approach Witsenhausen and Wyner [3] described in Section 2.2. Suppose for some β , K [ F β Φ , Ψ ( Q X ) ] (resp. K [ F β Φ , Ψ ( Q X ) ] ) at P X is obtained by a convex combination of points F β Φ , Ψ ( Q i ) , i [ k ] for some Q 1 , , Q k in P ( X ) , integer k 2 , and weights λ i 0 (with i λ i = 1 ). Then i λ i Q i = P X , and T with properties P T ( i ) = λ i and P X | T = i = Q i attains the maximum (resp. minimum) of Ψ ( Y | T ) β Φ ( X | T ) , implying that ( Φ ( X | T ) , Ψ ( Y | T ) ) is a point on the upper (resp. lower) boundary of M Φ , Ψ . Consequently, such T satisfies U Φ , Ψ ( ζ ) = Ψ ( Y | T ) for ζ = Φ ( X | T ) (resp. L Φ , Ψ ( ζ ) = Ψ ( Y | T ) for ζ = Φ ( X | T ) ). The algorithm to compute U Φ , Ψ and L Φ , Ψ is then summarized in the following three steps:
  • Construct the functional F β Φ , Ψ ( Q X ) Ψ ( Y ) β Φ ( X ) for X Q X and Y Q X P Y | X and all Q X P ( X ) and β 0 .
  • Compute K [ F Φ , Ψ ( Q X ) ] | P X and K [ F Φ , Ψ ( Q X ) ] | P X evaluated at P X .
  • If for distributions Q 1 , , Q k in P ( X ) for some k 1 , we have K [ F Φ , Ψ ( Q X ) ] | P X = i = 1 k λ i F Φ , Ψ ( Q i ) or K [ F Φ , Ψ ( Q X ) ] | P X = i = 1 k λ i F Φ , Ψ ( Q i ) for some λ i 0 satisfying i = 1 k λ i = 1 , then then P X | T = i = Q i , i [ k ] and P T ( i ) = λ i give the optimal T in U Φ , Ψ and L Φ , Ψ , respectively.
We will apply this approach to analytically compute U Φ , Ψ and L Φ , Ψ (and the corresponding bottleneck problems) for binary cases in the following sections.

3.2. Guessing Bottleneck Problems

Let P X Y be given with marginals P X and P Y and the corresponding channel P Y | X . Let also Q X P ( X ) be an arbitrary distribution on X and Q Y = Q X P Y | X be the output distribution of P Y | X when fed with Q X . Any channel P T | X , together with the Markov structure Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T, generates unique P X | T and P Y | T . We need the following basic definition from statistics.
Definition 1.
Let U be a discrete and V be an arbitrary random variables supported on U and V with | U | < , respectively. Then P c ( U ) the probability of correctly guessing U and P c ( U | V ) the probability of correctly guessing U given V are given by
P c ( U ) max u U P U ( u ) ,
and
P c ( U | V ) max g Pr ( U = g ( V ) ) = E max u U P U | V ( u | V ) .
Moreover, the multiplicative gain of the observation V in guessing U is defined as (the reason for ∞ in the notation becomes clear later)
I ( U ; V ) log P c ( U | V ) P c ( U ) .
As the names suggest, P c ( U | V ) and P c ( U ) characterize the optimal efficiency of guessing U with or without the observation V, respectively. Intuitively, I ( U ; V ) quantifies how useful the observation V is in estimating U: If it is small, then it means it is nearly as hard for an adversary observing V to guess U as it is without V. This observation motivates the use of I ( Y ; T ) as a measure of privacy in lieu of I ( Y ; T ) in PF .
It is worth noting that I ( U ; V ) is not symmetric in general, i.e., I ( U ; V ) I ( V ; U ) . Since observing T can only improve, we have P c ( Y | T ) P c ( Y ) ; thus I ( Y ; T ) 0 . However, I ( Y ; T ) = 0 does not necessarily imply independent of Y and T; instead, it means T is useless in estimating Y. As an example, consider Y Bernoulli ( p ) and P T | Y = 0 = Bernoulli ( δ ) and P T | Y = 1 = Bernoulli ( η ) with δ , η 1 2 < p . Then P c ( Y ) = p and
P c ( Y | T ) = max { δ ¯ p ¯ , η p } + η ¯ p .
Thus, if δ ¯ p ¯ η p , then P c ( Y | T ) = P c ( Y ) . This then implies that I ( Y ; T ) = 0 whereas Y and T are clearly dependent; i.e., I ( Y ; T ) > 0 . While in general I ( Y ; T ) and I ( Y ; T ) are not related, it can be shown that I ( Y ; T ) I ( Y ; T ) if Y is uniform (see [65] (Proposition 1)). Hence, only with this uniformity assumption, I ( Y ; T ) implies the independence.
Consider Ψ ( Q X ) = x X Q X ( x ) log Q X ( x ) and Ψ ( Q Y ) = Q Y . Clearly, we have Φ ( X | T ) = H ( X | T ) . Note that
Ψ ( Y | T ) = t T P T ( t ) | | P Y | T = t | | = P c ( Y | T ) ,
thus both measures H ( X | T ) and P c ( Y | T ) are special cases of the models described in the previous section. In particular, we can define the corresponding U Φ , Ψ and L Φ , Ψ . We will see later that I ( X ; T ) and P c ( Y | T ) correspond to Arimoto’s mutual information of orders 1 and , respectively. Define
Entropy 22 01325 i014
This bottleneck functional formulated an interpretable guarantee:
IB ( , 1 ) ( R ) characterizes   the   best   error   probability   in   recovering   Y   among   all   R - bit   summaries   of   X
Recall that the functional PF ( r ) aims at extracting maximum information of X while protecting privacy with respect to Y. Measuring the privacy in terms of P c ( Y | T ) , this objective can be better formulated by
Entropy 22 01325 i015
with the interpretable privacy guarantee:
PF ( , 1 ) ( r ) characterizes the smallest probability of revealing private feature Y among all representations of X preserving at least r bits information of X
Notice that the variable T in the formulations of IB ( , 1 ) and PF ( , 1 ) takes values in a set T of arbitrary cardinality. However, a straightforward application of the Carathéodory-Fenchel-Eggleston theorem (see e.g., [79] (Lemma 15.4)) reveals that the cardinality of T can be restricted to | X | + 1 without loss of generality. In the following lemma, we prove more basic properties of IB ( , 1 ) and PF ( , 1 ) .
Lemma 7.
For any P X Y with Y supported on a finite set Y , we have
  • IB ( , 1 ) ( 0 ) = PF ( , 1 ) ( 0 ) = 0 .
  • IB ( , 1 ) ( R ) = I ( X ; Y ) for any R H ( X ) and PF ( , 1 ) ( r ) = I ( X ; Y ) for r H ( X ) .
  • R exp ( IB ( , 1 ) ( R ) ) is strictly increasing and concave on the range ( 0 , I ( X ; Y ) ) .
  • r exp ( PF ( , 1 ) ( r ) ) is strictly increasing, and convex on the range ( 0 , I ( X ; Y ) ) .
The proof follows the same lines as Theorem 1 and hence omitted. Lemma 7 in particular implies that inequalities I ( X ; T ) R and I ( X ; T ) r in the definition of IB ( , 1 ) and PF ( , 1 ) can be replaced by I ( X ; T ) = R and I ( X ; T ) = r , respectively. It can be verified that I satisfies the data-processing inequality, i.e., I ( Y ; T ) I ( Y ; X ) for the Markov chain Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T. Hence, both IB ( , 1 ) and PF ( , 1 ) must be smaller than I ( Y ; X ) . The properties listed in Lemma 7 enable us to derive a slightly tighter upper bound for PF ( , 1 ) as demonstrated in the following.
Lemma 8.
For any P X Y with Y supported on a finite set Y , we have
PF ( , 1 ) ( r ) log 1 + r H ( X ) e I ( Y ; X ) 1 ,
and
log 1 + R H ( X ) e I ( Y ; X ) 1 IB ( , 1 ) ( R ) I ( Y ; X ) .
The proof of this lemma (and any other results in this section) is given in Appendix B. This lemma shows that the gap between I ( Y ; X ) and IB ( , 1 ) ( R ) when R is sufficiently close to H ( X ) behaves like
I ( Y ; X ) IB ( , 1 ) ( R ) I ( Y ; X ) log 1 + R H ( X ) e I ( Y ; X ) 1 1 e I ( Y ; X ) 1 R H ( X ) .
Thus, IB ( , 1 ) ( R ) approaches I ( Y ; X ) as R H ( X ) at least linearly.
In the following theorem, we apply the technique delineated in Section 3.1 to derive closed form expressions for IB ( , 1 ) and PF ( , 1 ) for the binary symmetric case, thereby establishing similar results as Mr and Mrs. Gerber’s Lemma.
Theorem 11.
For X Bernoulli ( p ) and P Y | X = BSC ( δ ) with p , δ 1 2 , we have
PF ( , 1 ) ( r ) = log δ ¯ ( h b ( p ) r ) ( 1 2 δ ) 1 δ p ,
and
IB ( , 1 ) ( R ) = log 1 δ h b 1 ( h b ( p ) R ) 1 δ p ,
where δ ¯ = 1 δ .
As described in Section 3.1, to compute IB ( , 1 ) and PF ( , 1 ) it suffices to derive the convex and concave envelopes of the mapping F β ( , 1 ) ( q ) P c ( Y ) + β H ( X ) where X Bernoulli ( q ) and Y is the result of passing X through BSC ( δ ) , i.e., Y Bernoulli ( δ q ) . In this case, P c ( Y ) = max { δ q , 1 δ q } and F β ( , 1 ) can be expressed as
q F β ( , 1 ) ( q ) = max { δ q , 1 δ q } + β h b ( q ) .
This function is depicted in Figure 6.
The detailed derivation of convex and concave envelope of F β ( , 1 ) is given in Appendix B. The proof of this theorem also reveals the following intuitive statements. If X Bernoulli ( p ) and P Y | X = BSC ( δ ) , then among all random variables T satisfying Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T and H ( X | T ) λ , the minimum P c ( Y | T ) is given by δ ¯ λ ( 0.5 δ ) . Notice that, without any information constraint (i.e., λ = 0 ), P c ( Y | T ) = P c ( Y | X ) = δ ¯ . Perhaps surprisingly, this shows that the mutual information constraint has a linear effect on the privacy of Y. Similarly, to prove (51), we show that among all R-bit representations T of X, the best achievable accuracy P c ( Y | T ) is given by 1 δ h b 1 ( h b ( p ) R ) . This can be proved by combining Mrs. Gerber’s Lemma (cf. Lemma 4) and Fano’s inequality as follows. For all T such that H ( X | T ) λ , the minimum of H ( Y | T ) is given by h b ( δ h b 1 ( λ ) ) . Since by Fano’s inequality, H ( Y | T ) h b ( 1 P c ( Y | T ) ) , we obtain δ h b 1 ( λ ) 1 P c ( Y | T ) which leads to the same result as above. Nevertheless, in Appendix B we give another proof based on the discussion of Section 3.1.

3.3. Arimoto Bottleneck Problems

The bottleneck framework proposed in the last section benefited from interpretable guarantees brought forth by the quantity I . In this section, we define a parametric family of statistical quantities, the so-called Arimoto’s mutual information, which includes both Shannon’s mutual information and I as extreme cases.
Definition 2
([22]). Let U P U and V P V be two random variables supported over finite sets U and V , respectively. Their Arimoto’s mutual information of order α > 1 is defined as
I α ( U ; V ) = H α ( U ) H α ( U | V ) ,
where
H α ( U ) α 1 α log | | P U | | α ,
is the Rényi entropy of order α and
H α ( U | V ) α 1 α log v V P V ( v ) | | P U | V = v | | α ,
is the Arimoto’s conditional entropy of order α.
By continuous extension, one can define I α ( U ; V ) for α = 1 and α = as I ( U ; V ) and I ( U ; V ) , respectively. That is,
lim α 1 + I α ( U ; V ) = I ( U ; V ) , and lim α I α ( U ; V ) = I ( U ; V ) .
Arimoto’s mutual information was first introduced by Arimoto [22] and then later revisited by Liese and Vajda in [101] and more recently by Verdú in [102]. More in-depth analysis and properties of I α can be found in [103]. It is shown in [71] (Lemma 1) that I α ( U ; V ) for α [ 1 , ] quantifies the minimum loss in recovering U given V where the loss is measured in terms of the so-called α -loss. This loss function reduces to logarithmic loss (27) and P c ( U | V ) for α = 1 and α = , respectively. This sheds light on the utility and/or privacy guarantee promised by a constraint on Arimoto’s mutual information. It is now natural to use I α for defining a family of bottleneck problems.
Definition 3.
Given a pair of random variables ( X , Y ) P X Y over finite sets X and Y and α , γ [ 1 , ] , we define IB ( α , γ ) and PF ( α , γ ) as
Entropy 22 01325 i016
and
Entropy 22 01325 i017
Of course, IB ( 1 , 1 ) ( R ) = IB ( R ) and PF ( 1 , 1 ) ( r ) = PF ( r ) . It is known that Arimoto’s mutual information satisfies the data-processing inequality [103] (Corollary 1), i.e., I α ( Y ; T ) I α ( Y ; X ) for the Markov chain Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T. On the other hand, I γ ( X ; T ) H γ ( X ) . Thus, both IB ( α , γ ) ( R ) and PF ( α , γ ) ( r ) equal I α ( Y ; X ) for R , r H γ ( X ) . Note also that H α ( Y | T ) = α 1 α log Ψ ( Y | T ) where Ψ ( Y | T ) (see (39)) corresponding to the function Ψ ( Q Y ) = | | Q Y | | α . Consequently, IB ( α , γ ) and PF ( α , γ ) are characterized by the lower and upper boundary of M Φ , Ψ , defined in (37), with respect to Φ ( Q X ) = | | Q X | | γ and Ψ ( Q Y ) = | | Q Y | | α . Specifically, we have
IB ( α , γ ) ( R ) = H α ( Y ) + α α 1 log U Φ , Ψ ( ζ ) ,
where ζ = e ( 1 1 γ ) ( H γ ( X ) R ) , and
PF ( α , γ ) ( r ) = H α ( Y ) + α α 1 log L Φ , Ψ ( ζ ) ,
where ζ = e ( 1 1 γ ) ( H γ ( X ) r ) and Φ ( Q X ) = Q X γ and Ψ ( Q Y ) = Q Y α . This paves the way to apply the technique described in Section 2.2 to compute IB ( α , γ ) and PF ( α , γ ) . Doing so requires the upper concave and lower convex envelope of the mapping Q X Q Y α β Q X γ for some β 0 , where Q Y Q X P Y | X . In the following theorem, we drive these envelopes and give closed form expressions for IB ( α , γ ) and PF ( α , γ ) for a special case where α = γ 2 .
Theorem 12.
Let X Bernoulli ( p ) and P Y | X = BSC ( δ ) with p , δ 1 2 . We have for α 2
PF ( α , α ) ( r ) = α 1 α log p δ α q δ α ,
where a α [ a , a ¯ ] α for a [ 0 , 1 ] and q p solves
α 1 α log p α q α = r .
Moreover,
IB ( α , α ) ( R ) = α α 1 log λ ¯ δ α + λ q z δ α p α α ,
where z = max { 2 p , λ } and λ [ 0 , 1 ] solves
α α 1 log λ ¯ + λ p z α p α = R .
By letting α , this theorem indicates that for X and Y connected through BSC ( δ ) and all variables T forming Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T, we have
P c ( X | T ) λ P c ( Y | T ) δ λ ,
which can be shown to be achieved T generated by the following channel (see Figure 7)
P T | X = λ p p ¯ λ ¯ p ¯ 0 1 .
Note that, by assumption, p 1 2 , and hence the event { X = 1 } is less likely than { X = 0 } . Therefore, (61) demonstrates that to ensure correct recoverability of X with probability at lest λ , the most private approach (with respect to Y) is to obfuscate the higher-likely event { X = 0 } with probability λ ¯ p ¯ . As demonstrated in (61) the optimal privacy guarantee is linear in the utility parameter in the binary symmetric case. This is in fact a special case of the larger result recently proved in [65] (Theorem 1): the infimum of P c ( Y | T ) over all variables T such that P c ( X | T ) λ is piece-wise linear in λ , on equivalently, the mapping e r exp ( PF ( , ) ( r ) ) is piece-wise linear.
Computing PF ( α , γ ) analytically for every α , γ > 1 seems to be challenging, however, the following lemma provides bounds for PF ( α , γ ) and IB ( α , γ ) in terms of PF ( , ) and IB ( , ) , respectively.
Lemma 9.
For any pair of random variables ( X , Y ) over finite alphabets and α , γ > 1 , we have
α α 1 PF ( , ) ( f ( r ) ) α α 1 H ( Y ) + H α ( Y ) PF ( α , γ ) ( r ) PF ( , ) ( g ( r ) ) + H α ( Y ) H ( Y ) ,
and
α α 1 IB ( , ) ( f ( R ) ) α α 1 H ( Y ) + H α ( Y ) IB ( α , γ ) ( R ) IB ( , ) ( g ( R ) ) + H α ( Y ) H ( Y ) ,
where f ( a ) = max { a H γ ( X ) + H ( X ) , 0 } and g ( b ) = γ 1 γ b + H ( X ) γ 1 γ H γ ( X ) .
The previous lemma can be directly applied to derive upper and lower bounds for PF ( α , γ ) and IB ( α , γ ) given PF ( , ) and IB ( , ) .

3.4. f-Bottleneck Problems

In this section, we describe another instantiation of the general framework introduced in terms of functions Φ and Ψ that enjoys interpretable estimation-theoretic guarantee.
Definition 4.
Let f : ( 0 , ) R be a convex function with f ( 1 ) = 0 . Furthermore, let U and V be two real-valued random variables supported over U and V , respectively. Their f-information is defined by
I f ( U ; V ) D f ( P U V P U P V ) ,
where D f ( · · ) is the f-divergence [104] between distributions and defined as
D f ( P Q ) E Q f d P d Q .
Due to convexity of f, we have D f ( P Q ) f ( 1 ) = 0 and hence f-information is always non-negative. If, furthermore, f is strictly convex at 1, then equality holds if and only P = Q . Csiszár introduced f-divergence in [104] and applied it to several problems in statistics and information theory. More recent developments about the properties of f-divergence and f-information can be found in [23] and the references therein. Any convex function f with the property f ( 1 ) = 0 results in an f-information. Popular examples include f ( t ) = t log t corresponding to Shannon’s mutual information, f ( t ) = | t 1 | corresponding to T-information [83], and also f ( t ) = t 2 1 corresponding to χ 2 -information [69] for. It is worth mentioning that if we allow α to be in ( 0 , 1 ) in Definition 2 (similar to [101]), then the resulting Arimoto’s mutual information can be shown to be an f-information in the binary case for a certain function f, see [101] (Theorem 8).
Let ( X , Y ) P X Y be given with marginals P X and P Y . Consider functions Φ and Ψ on P ( X ) and P ( Y ) defined as
Φ ( Q X ) D f ( Q X P X ) and Ψ ( Q Y ) D f ( Q Y P Y ) .
Given a conditional distribution P T | X , it is easy to verify that Φ ( X | T ) = I f ( X ; T ) and Ψ ( Y | T ) = I f ( Y ; T ) . This in turn implies that f-information can be utilized in (40) and (41) to define general bottleneck: Let f : ( 0 , ) R and g : ( 0 , ) R be two convex functions satisfying f ( 1 ) = g ( 1 ) = 0 . Then we define
Entropy 22 01325 i018
and
Entropy 22 01325 i019
In light of the discussion in Section 3.1, the optimization problems in IB ( f , g ) and IB ( f , g ) can be analytically solved by determining the upper concave and lower convex envelope of the mapping
Q X F β ( f , g ) D f ( Q Y P Y ) β D g ( Q X P X ) ,
where β 0 is the Lagrange multiplier and Q Y = Q X P Y | X .
Consider the function f α ( t ) = t α 1 α 1 with α ( 1 , ) ( 1 , ) . The corresponding f-divergence is sometimes called Hellinger divergence of order α , see e.g., [105]. Note that Hellinger divergence of order 2 reduces to χ 2 -divergence. Calmon et al. [68] and Asoodeh et al. [67] showed that if I f 2 ( Y ; T ) ε for some ε ( 0 , 1 ) , then the minimum mean-squared error (MMSE) of reconstructing any zero-mean unit-variance function of Y given T is lower bounded by 1 ε , i.e., no function of Y can be reconstructed with small MMSE given an observation of T. This result serves a natural justification for I f 2 as an operational measure of both privacy and utility in a bottleneck problem.
Unfortunately, our approach described in Section 3.1 cannot be used to compute IB ( f 2 , f 2 ) or PF ( f 2 , f 2 ) in the binary symmetric case. The difficulty lies in the fact that the function F β f 2 , f 2 , defined in (66), for the binary symmetric case is either convex or concave on its entire domain depending on the value of β . Nevertheless, one can consider Hellinger divergence of order α with α 2 and then apply our approach to compute IB ( f α , f α ) or PF ( f α , f α ) . Since D f 2 ( P Q ) ( 1 + ( α 1 ) D f α ( P Q ) ) 1 / ( α 1 ) 1 (see [106] (Corollary 5.6)), one can justify I f α as a measure of privacy and utility in a similar way as I f 2 .
We end this section by a remark about estimating the measures studied in this section. While we consider information-theoretic regime where the underlying distribution P X Y is known, in practice only samples ( x i , y i ) are given. Consequently, the de facto guarantees of bottleneck problems might be considerably different from those shown in this work. It is therefore essential to asses the guarantees of bottleneck problems when accessing only samples. To do so, one must derive bounds on the discrepancy between P c , I α , and I f computed on the empirical distribution and the true (unknown) distribution. These bounds can then be used to shed light on the de facto guarantee of the bottleneck problems. Relying on [34] (Theorem 1), one can obtain that the gaps between the measures P c , I α , and I f computed on empirical distributions and the true one scale as O ( 1 / n ) where n is the number of samples. This is in contrast with mutual information for which the similar upper bound scales as O ( log n / n ) as shown in [33]. Therefore, the above measures appear to be easier to estimate than mutual information.

4. Summary and Concluding Remarks

Following the recent surge in the use of information bottleneck ( IB ) and privacy funnel ( PF ) in developing and analyzing machine learning models, we investigated the functional properties of these two optimization problems. Specifically, we showed that IB and PF correspond to the upper and lower boundary of a two-dimensional convex set M = { ( I ( X ; T ) , I ( Y ; T ) ) : Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T} where ( X , Y ) P X Y represents the observable data X and target feature Y and the auxiliary random variable T varies over all possible choices satisfying the Markov relation Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T. This unifying perspective on IB and PF allowed us to adapt the classical technique of Witsenhausen and Wyner [3] devised for computing IB to be applicable for PF as well. We illustrated this by deriving a closed form expression for PF in the binary case—a result reminiscent of the Mrs. Gerber’s Lemma [2] in information theory literature. We then showed that both IB and PF are closely related to several information-theoretic coding problems such as noisy random coding, hypothesis testing against independence, and dependence dilution. While these connections were partially known in previous work (see e.g., [29,30]), we show that they lead to an improvement on the cardinality of T for computing IB . We then turned our attention to the continuous setting where X and Y are continuous random variables. Solving the optimization problems in IB and PF in this case without any further assumptions seems a difficult challenge in general and leads to theoretical results only when ( X , Y ) is jointly Gaussian. Invoking recent results on the entropy power inequality [25] and strong data processing inequality [27], we obtained tight bounds on IB in two different cases: (1) when Y is a Gaussian perturbation of X and (2) when X is a Gaussian perturbation of Y. We also utilized the celebrated I-MMSE relationship [107] to derive a second-order approximation of PF when T is considered to be a Gaussian perturbation of X.
In the second part of the paper, we argue that the choice of (Shannon’s) mutual information in both IB and PF does not seem to carry specific operational significance. It does, however, have a desirable practical consequence: it leads to self-consistent equations [1] that can be solved iteratively (without any guarantee to convergence though). In fact, this property is unique to mutual information among other existing information measures [99]. Nevertheless, we argued that other information measures might lead to better interpretable guarantee for both IB and PF . For instance, statistical accuracy in IB and privacy leakage in PF can be shown to be precisely characterized by probability of correctly guessing (aka Bayes risk) or minimum mean-squared error (MMSE). Following this observation, we introduced a large family of optimization problems, which we call bottleneck problems, by replacing mutual information in IB and PF with Arimoto’s mutual information [22] or f-information [23]. Invoking results from [33,34], we also demonstrated that these information measures are in general easier to estimate from data than mutual information. Similar to IB and PF , the bottleneck problems were shown to be fully characterized by boundaries of a two-dimensional convex set parameterized by two real-valued non-negative functions Φ and Ψ . This perspective enabled us to generalize the technique used to compute IB and PF for evaluating bottleneck problems. Applying this technique to the binary case, we derived closed form expressions for several bottleneck problems.

Author Contributions

All authors contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the National Science Foundation under Grant No. CIF 1900750 and CIF CAREER 1845852.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs from Section 2

Proof of Theorem 1.
  • Note that R = 0 in optimization problem (4) implies that X and T are independent. Since Y , X and T form Markov chain Y Entropy 22 01325 i001 Y Entropy 22 01325 i001 T, independent of X and T implies independence of Y and T and thus I ( Y ; T ) = 0 . Similarly for PF ( 0 ) .
  • Since I ( X ; T ) H ( X ) for any random variable T, we have T = X satisfies the information constraint I ( X ; T ) R for R H ( X ) . Since I ( Y ; T ) I ( Y ; X ) , this choice is optimal. Similarly for PF , the constraint I ( X ; T ) r for r H ( X ) implies T = X . Hence, PF ( r ) = I ( Y ; X ) .
  • The upper bound on IB follows from the data processing inequality: I ( Y ; T ) min { I ( X ; T ) , I ( X ; Y ) } for all T satisfying the Markov condition Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T.
  • To prove the lower bound on PF , note that
    I ( Y ; T ) = I ( X ; T ) I ( X ; T | Y ) I ( X ; T ) H ( X | Y ) .
  • The concavity of R IB ( R ) follows from the fact it is the upper boundary of the convex set M , defined in (6). This in turn implies the continuity of IB ( · ) . Monotonicity of R IB ( R ) follows from the definition. Strict monotonicity follows from the convexity and the fact that IB ( H ( X ) ) = I ( X ; Y ) .
  • Similar as above.
  • The differentiability of the map R IB ( R ) follows from [94] (Lemma 6). This result in fact implies the differentiability of the map r PF ( r ) as well. Continuity of the derivative of IB and PF on ( 0 , H ( X ) ) is a straightforward application of [108] (Theorem 25.5).
  • Monotonicity of mappings R IB ( R ) R and r PF ( r ) r follows from the concavity and convexity of IB ( · ) and PF ( · ) , respectively.
  • Strict monotonicity of IB ( · ) and PF ( · ) imply that the optimization problems in (4) and (5) occur when the inequality in the constraints becomes equality. □
Proof of Theorem 3.
Recall that, according to Theorem 1, the mappings R IB ( R ) and r PF ( r ) are concave and convex, respectively. This implies that IB ( R ) (resp. PF ( r ) ) lies above (resp. below) the chord connecting ( 0 , 0 ) and ( H ( X , I ( X ; Y ) ) . This proves the lower bound (resp. upper bound) IB ( R ) R I ( X ; Y ) H ( X ) (resp. PF ( r ) r I ( X ; Y ) H ( X ) ).
In light of the convexity of PF and monotonicity of r PF ( r ) r , we can write
Entropy 22 01325 i020
where the last equality is due to [13] (Lemma 4) and Q Y is the output distribution of the channel P Y | X when the input is distributed according to Q X . Similarly, we can write
Entropy 22 01325 i021
where the last equality is due to [82] (Theorem 4). □
Proof of Theorem 5.
Let T n be an optimal summeries of X n , that is, it satisfies Tn Entropy 22 01325 i001 Xn Entropy 22 01325 i001 Yn and I ( X n ; T n ) = n R . We can write
I ( X n , T n ) = H ( X n ) H ( X n | T n ) = k = 1 n H ( X k ) H ( X k | X k 1 , T n ) = k = 1 n I ( X k ; X k 1 , T n ) ,
and hence, if R k I ( X k ; X k 1 , T n ) , then we have
R = 1 n k = 1 n R k .
We can similarly write
I ( Y n , T n ) = H ( Y n ) H ( Y n | T n ) = k = 1 n H ( Y k ) H ( Y k | Y k 1 , T n ) k = 1 n H ( Y k ) H ( Y k | Y k 1 , X k 1 , T n ) = k = 1 n H ( Y k ) H ( Y k | X k 1 , T n ) = k = 1 n I ( Y k ; X k 1 , T n ) .
Since we have ( T n , X k 1 ) Entropy 22 01325 i001 Xk Entropy 22 01325 i001 Yk for every k [ n ] , we conclude from the above inequality that
I ( Y n , T n ) k = 1 n I ( Y k ; X k 1 , T n ) k = 1 n IB ( P X Y , R k ) n IB ( P X Y , R ) ,
where the last inequality follows from concavity of the map x IB ( P X Y , x ) and (A1). Consequently, we obtain
IB ( P X n Y n , n R ) n IB ( P X Y , R ) .
To prove the other direction, let P T | X be an optimal channel in the definition of IB , i.e., I ( X ; T ) = R and IB ( P X Y , R ) = I ( Y ; T ) . Then using this channel n times for each pair ( X i , Y i ) , we obtain T n = ( T 1 , , T n ) satisfying Tn Entropy 22 01325 i001 Xn Entropy 22 01325 i001 Yn. Since I ( X n ; T n ) = n I ( X ; T ) = n R and I ( Y n ; T n ) = n I ( Y ; T ) , we have IB ( P X n Y n , n R ) n IB ( P X Y , R ) . This, together with (A3), concludes the proof. □
Proof of Theorem 4.
First notice that
Entropy 22 01325 i022
where the last equality is due to [82] (Theorem 4). Similarly,
Entropy 22 01325 i023
where the last equality is due to [13] (Lemma 4).
Fix x 0 X with P X ( x 0 ) > 0 and let T be a Bernoulli random variable specified by the following channel
P T | X ( 1 | x ) = δ 1 { x = x 0 } ,
for some δ > 0 . This channel induces T Bernoulli ( δ P X ( x 0 ) ) , P Y | T ( y | 1 ) = P Y | X ( y | x 0 ) , and
P Y | T ( y | 0 ) = P Y ( y ) P X Y ( x 0 , y ) 1 δ P X ( x 0 ) .
It can be verified that
I ( X ; T ) = δ P X ( x 0 ) log P X ( x 0 ) + o ( δ ) ,
and
I ( Y ; T ) = δ P X ( x 0 ) D KL ( P Y | X ( · | x 0 ) P Y ( · ) ) + o ( δ ) .
Setting
δ = r P X ( x 0 ) log P X ( x 0 ) ,
we obtain
I ( Y ; T ) = D KL ( P Y | X ( · | x 0 ) P Y ( · ) ) log P X ( x 0 ) r + o ( r ) ,
and hence
PF ( r ) D KL ( P Y | X ( · | x 0 ) P Y ( · ) ) log P X ( x 0 ) r + o ( r ) .
Since x 0 is arbitrary, the result follows. The proof for IB follows similarly. □
Proof of Lemma 1.
When Y is an erasure of X, i.e., Y = X { } with P Y | X ( x | x ) = 1 δ and P Y | X ( | x ) = δ , it is straightforward to verify that D KL ( Q Y P Y ) = ( 1 δ ) D KL ( Q X P X ) for every P X and Q X in P ( X ) . Consequently, we have
inf Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) = sup Q X P X D KL ( Q Y P Y ) D KL ( Q X P X ) = 1 δ .
Hence, Theorem 3 gives the desired result.
To prove the second part, i.e., when X is an erasure of Y, we need an improved upper bound of PF . Notice that if perfect privacy occurs for a given P X Y , then the upper bound for PF ( r ) in Theorem 3 can be improved:
PF ( r ) ( r r 0 ) I ( X ; Y ) H ( X ) r 0 ,
where r 0 is the largest r 0 such that PF ( r ) = 0 . Here, we show that r 0 = H ( X | Y ) . This suffices to prove the result as (A4), together with Theorem 1, we have
max { r H ( X | Y ) , 0 } PF ( r r 0 ) I ( X ; Y ) H ( X ) r 0 = ( r H ( X | Y ) ) .
To show that PF ( H ( X | Y ) ) = 0 , consider the channel P T | X ( t | x ) = 1 | Y | 1 { t , x } and P T | X ( | ) = 1 . It can be verified that this channel induces T which is independent of Y and that
I ( X ; T ) = H ( T ) H ( T | X ) = H 1 δ | Y | , , 1 δ | Y | , δ ( 1 δ ) log | Y | = h b ( δ ) = H ( X | Y ) ,
where h b ( δ ) δ log δ ( 1 δ ) log ( 1 δ ) is the binary entropy function. □
Proof of Lemma 4.
As mentioned earlier, Equation (24) was proved in [2]. We thus give a proof only for (25).
Consider the problem of minimizing the Lagrangian L PF ( β ) (20) for β β PF . Let X Q X = Bernoulli ( q ) for some q ( 0 , 1 ) and Y be the result of passing X through BSC ( δ ) , i.e., Y Bernoulli ( q δ ) . Recall that F β ( q ) F β ( Q X ) = h b ( q δ ) β h b ( q ) . It suffices to compute K [ F β ( q ) ] the upper concave envelope of q F β ( q ) . It can be verified that β IB ( 1 2 δ ) 2 and hence for all β ( 1 2 δ ) 2 , K [ F β ( q ) ] = F β ( 0 ) . A straightforward computation shows that F β ( q ) is symmetric around q = 1 2 and is also concave in a region around q = 1 2 , where it reaches its local maximum. Hence, if β is such that
  • F β ( 1 2 ) < F β ( 0 ) (see Figure 4a), then K [ F β ( q ) ] is given by the convex combination of F β ( 0 ) and F β ( 1 ) .
  • F β ( 1 2 ) = F β ( 0 ) (see Figure 4b), then K [ F β ( q ) ] is given by the convex combination of F β ( 0 ) and F β ( 1 2 ) and F β ( 1 ) .
  • F β ( 1 2 ) > F β ( 0 ) (see Figure 4c), then there exists q β [ 0 , 1 2 ] such that for q q β , K [ F β ( q ) ] is given by the convex combination of F β ( 0 ) and F β ( q β ) .
Hence, assuming p 1 2 , we can construct T that maximizes H ( Y | T ) β H ( X | T ) in three different cases corresponding three cases above:
  • In the first case, T is binary and we have P X | T = 0 = Bernoulli ( 0 ) and P X | T = 1 = Bernoulli ( 1 ) with P T = Bernoulli ( p ) .
  • In the second case, T is ternary and we have P X | T = 0 = Bernoulli ( 0 ) , P X | T = 1 = Bernoulli ( 1 ) , and P X | T = 2 = Bernoulli ( 1 2 ) with P T = ( 1 p α 2 , p α 2 , α ) for some α [ 0 , 2 p ] .
  • In the third case, T is again binary and we have P X | T = 0 = Bernoulli ( 0 ) and P X | T = 1 = Bernoulli ( p α ) with P T = Bernoulli ( α ) for some α [ 2 p , 1 ] .
Combining these three cases, we obtain the result in (25).□
Proof of Lemma 2.
Let X = Y + σ N G where σ > 0 and N G N ( 0 , 1 ) is independent of Y. According to the improved entropy power inequality proved in [25] (Theorem 1), we can write
e 2 ( H ( X ) I ( Y ; T ) ) e 2 ( H ( Y ) I ( X ; T ) ) + 2 π e σ 2 ,
for any random variable T forming Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T. This, together with Theorem 5, implies the result. □
Proof of Corollary 1.
Since ( X , Y ) are jointly Gaussian, we can write X = Y + σ N G where σ = σ Y 1 ρ 2 ρ and σ Y 2 is the variance of Y. Applying Lemma 2 and noticing that H ( X ) = 1 2 log ( 2 π e ( σ Y 2 + σ 2 ) ) , we obtain
I ( Y ; T ) 1 2 log σ 2 + σ Y 2 σ 2 + σ Y 2 e 2 I ( X ; T ) = 1 1 ρ 2 + ρ 2 e 2 I ( X ; T ) ,
for all channels P T | X satisfying Y Entropy 22 01325 i001 X Entropy 22 01325 i001 T. This bound is attained by Gaussian P T | X . Specifically, assuming T + X + σ ˜ M G where σ ˜ 2 = σ Y 2 e 2 R ρ 2 ( 1 e 2 R ) for R 0 and M G N ( 0 , σ ˜ 2 ) independent of X, it can be easily verified that I ( X ; T ) = R and I ( Y ; T ) = 1 1 ρ 2 + ρ 2 e 2 R . This, together with (A5), implies IB ( R ) = 1 1 ρ 2 + ρ 2 e 2 R .
Next, we wish to prove Theorem 6. However, we need the following preliminary lemma before we delve into its proof.
Lemma A1.
Let X and Y be continuous correlated random variables with E [ X 2 ] < and E [ Y 2 ] < . Then the mappings σ I ( X ; T σ ) and σ I ( Y ; T σ ) are continuous, strictly decreasing, and
I ( X ; T σ ) 0 , and I ( Y ; T σ ) 0 as σ .
Proof. 
The finiteness of E [ X 2 ] and E [ Y 2 ] imply that H ( X ) and H ( Y ) are finite. A straightforward application of the entropy power inequality (cf. [109] (Theorem 17.7.3)) implies that H ( T σ ) is also finite. Thus, I ( X ; T σ ) and I ( Y ; T σ ) are well-defined. According to the data processing inequality, we have I ( X ; T σ + δ ) < I ( X ; T σ ) for all δ > 0 and also I ( Y ; T σ + δ ) I ( Y ; T σ ) where the equality occurs if and only if X and Y are independent. Since, bu assumption X and Y correlated, it follows I ( Y ; T σ + δ ) < I ( Y ; T σ ) . Thus, both I ( X ; T σ ) and I ( Y ; T σ ) are strictly decreasing.
For the proof of continuity, we consider two cases σ = 0 and σ > 0 separately. We first give the poof for I ( X ; T σ ) . Since H ( σ N G ) = 1 2 log ( 2 π e σ 2 ) , we have lim σ 0 H ( σ N G ) = and thus lim σ 0 I ( X ; T σ ) = that is equal to I ( X ; T 0 ) . For σ > 0 , let σ n be a sequence of positive numbers converging to σ . In light of de Bruijn’s identity (cf. [109] (Theorem 17.7.2)), we have H ( T σ n ) H ( T σ ) , implying the continuity of σ I ( X ; T σ ) .
Next, we prove the continuity of σ I ( Y ; T σ ) . For the sequence of positive numbers σ n converging to σ > 0 , we have I ( Y ; T σ n ) = H ( T σ n ) H ( T σ n | Y ) . We only need to show H ( T σ n | Y ) H ( T σ | Y ) . Invoking again de Brujin’s identity, we obtain H ( T σ n | Y = y ) H ( T σ | Y = y ) for each y Y . The desired result follows from dominated convergence theorem. Finally, the The continuity of σ I ( Y ; T σ ) when σ = 0 follows from [110] (p. 2028) stating that H ( T σ n | Y = y ) H ( X | Y = y ) and then applying dominated convergence theorem.
Note that
0 I ( Y ; T σ ) I ( X ; T σ ) 1 2 log 1 + σ X 2 σ 2 ,
where σ X 2 is the variance of X and the last inequality follows from the fact that I ( X ; X + σ N G ) is maximized when X is Gaussian. Since by assumption σ X < , it follows that both I ( X ; T σ ) and I ( Y ; T σ ) converge to zero as σ . □
In light of this lemma, there exists a unique σ 0 such that I ( X ; T σ ) = r . Let σ r denote such σ . Therefore, we have PF ( r ) = I ( Y ; T σ r ) . This enables us to prove Theorem 6.
Proof of Theorem 6.
The proof relies on the I-MMSE relation in information theory literature. We briefly describe it here for convenience. Given any pair of random variables U and V, the minimum mean-squared error (MMSE) of estimating U given V is given by
mmse ( U | V ) inf f E [ ( U f ( V ) ) 2 ] = E [ U E [ U | V ] 2 ] = E [ var ( U | V ) ] ,
where the infimum is taken over all measurable functions f and var ( U | V ) = E [ ( U E [ U | V ] ) 2 | V ] . Guo et al. [107] proved the following identity, which is referred to as I-MMSE formula, relating the input-output mutual information of the additive Gaussian channel T σ = X + σ N G , where N G N ( 0 , 1 ) is independent of X, with the MMSE of the input given the output:
d d ( σ 2 ) I ( X ; T σ ) = 1 2 σ 4 mmse ( X | T σ ) .
Since Y, X, and T σ form the Markov chain Y Entropy 22 01325 i001 X Entropy 22 01325 i001 Tσ, it follows that I ( Y ; T σ ) = I ( X ; T σ ) I ( X ; T σ | Y ) . Thus, two applications of (A6) yields
d d ( σ 2 ) I ( Y ; T σ ) = 1 2 σ 4 mmse ( X | T σ ) mmse ( X | T σ , Y ) .
The second derivative of I ( X ; T σ ) and I ( Y ; T σ ) are also known via the formula [111] (Proposition 9)
d d ( σ 2 ) mmse ( X | T σ ) = 1 σ 4 E [ var 2 ( X | T σ ) ] and d d ( σ 2 ) mmse ( X | T σ , Y ) = 1 σ 4 E [ var 2 ( X | T σ , Y ) ] .
With these results in mind, we now begin the proof. Recall that σ r is the unique σ such that I ( X ; T σ ) = r , thus implying PF G ( r ) = I ( Y ; T σ r ) . We have
d d r PF G ( r ) = d d ( σ 2 ) I ( Y ; T σ ) σ = σ r d d r σ r 2 .
To compute the derivative of PF ( r ) , we therefore need to compute the derivative of σ r 2 with respect to r. To do so, notice that from the identity I ( X ; T σ r ) = r we can obtain
1 = d d r I ( X ; T σ r ) = d d ( σ 2 ) I ( X ; T σ r ) σ = σ r d d r σ r 2 = 1 2 σ 4 mmse ( X | T σ r ) d d r σ r 2 ,
implying
d d r σ r 2 = 2 σ 4 mmse ( X | T σ r ) .
Plugging this identity into (A9) and invoking (A7), we obtain
d d r PF G ( r ) = mmse ( X | T σ r ) mmse ( X | T σ r , Y ) mmse ( X | T σ r ) .
The second derivative can be obtained via (A8)
d 2 d r 2 PF G ( r ) = 2 E [ var 2 ( X | T σ r , Y ) ] mmse 2 ( X | T σ r ) 2 E [ var 2 ( X | T σ r ) ] mmse ( X | T σ r , Y ) mmse 3 ( X | T σ r ) .
Since σ r as r 0 , we can write
d d r PF G ( r ) | r = 0 = σ X 2 E [ var ( X | Y ) ] σ X = var ( E [ X | Y ] ) σ X 2 = η ( X , Y ) ,
where var ( E [ X | Y ] ) is the variance of the conditional expectation X given Y and the last equality comes from the law of total variance. and
d 2 d r 2 PF G ( r ) | r = 0 = 2 σ X 4 E [ var 2 ( X | Y ) ] σ X 2 E [ var ( X | Y ) ] .
Taylor expansion of PF ( r ) around r = 0 gives the result. □
Proof of Theorem 9.
The main ingredient of this proof is a result by Jana [112] (Lemma 2.2) which provides a tight cardinality bound for the auxiliary random variables in the canonical problems in network information theory (including noisy source coding problem described in Section 2.3). Consider a pair of random variables ( X , Y ) P X Y and let d : Y × Y ^ R be an arbitrary distortion measure defined for arbitrary reconstruction alphabet Y ^ . □
Theorem A1
([112]). Let A be the set of all pairs ( R , D ) satisfying
I ( X ; T ) R and E [ d ( Y , ψ ( T ) ) ] D ,
for some mapping ψ : T Y ^ and some joint distributions P X Y T = P X Y P T | X . Then every extreme points of A corresponds to some choice of auxiliary variable T with alphabet size | T | | X | .
Measuring the distortion in the above theorem in terms of the logarithmic loss as in (27), we obtain that
A = { ( R , D ) R + 2 : R R noisy ( D ) } ,
where R noisy ( D ) is given in (29). We observed in Section 2.3 that IB is fully characterized by the mapping D R noisy ( D ) and thus by A . In light of Theorem A1, all extreme points of A are achieved by a choice of T with cardinality size | T | | X | . Let { ( R i , D i ) } be the set of extreme points of A each constructed by channel P T i | X and mapping ψ i . Due to the convexity of A , each point ( R , D ) A is expressed as a convex combination of { ( R i , D i ) } with coefficient { λ i } ; that is there exists a channel P T | X = i λ i P T i | X and a mapping ψ ( T ) = i λ i ψ i ( T i ) such that I ( X ; T ) = R and E [ d ( Y , ψ ( T ) ) ] = D . This construction, often termed timesharing in information theory literature, implies that all points in A (including the boundary points) can be achieved with a variable T with | T | | X | . Since the boundary of A is specified by the mapping R IB ( R ) , we conclude that IB ( R ) is achieved by a variable T with cardinality | T | | X | for very R < H ( X ) .
Proof of Lemma 5.
The following proof is inspired by [32] (Proposition 1). Let X = { 1 , , m } . We sort the elements in X such that
P X ( 1 ) D KL ( P Y | X = 1 P Y ) P X ( m ) D KL ( P Y | X = m P Y ) .
Now consider the function f : X [ M ] given by f ( x ) = x if x < M and f ( x ) = M if x M where M = e R . Let Z = f ( X ) . We have P Z ( i ) = P X ( i ) if i < M and P Z ( M ) = j M P X ( j ) . We can now write
I ( Y ; Z ) = i = 1 M 1 P X ( i ) D ( P Y | X = i P Y ) + P Z ( M ) D ( P Y | Z = M P Y ) i = 1 M 1 P X ( i ) D ( P Y | X = i P Y ) M 1 | X | i X P X ( i ) D ( P Y | X = i P Y ) = M 1 | X | I ( X ; Y ) .
Since f ( X ) takes values in [ M ] , it follows that H ( f ( X ) ) R . Consequently, we have
dIB ( P X Y , R ) sup f : X [ M ] I ( Y ; f ( X ) ) M 1 | X | I ( X ; Y ) .
For the privacy funnel, the proof proceeds as follows. We sort the elements in X such that
P X ( 1 ) D KL ( P Y | X = 1 P Y ) P X ( m ) D KL ( P Y | X = m P Y ) .
Consider now the function f : X [ M ] given by f ( x ) = x if x < M and f ( x ) = M if x M . As before, let Z = f ( X ) . Then, we can write,
I ( Y ; Z ) = i = 1 M 1 P X ( i ) D ( P Y | X = i P Y ) + P Z ( M ) D ( P Y | Z = M P Y ) M 1 | X | i X P X ( i ) D ( P Y | X = i P Y ) + P Z ( M ) D ( P Y | Z = M P Y ) = M 1 | X | I ( X ; Y ) + P Z ( M ) y Y P Y | Z ( y | M ) log P Y | Z ( y | M ) P Y ( y ) M 1 | X | I ( X ; Y ) + Pr ( X M ) y Y i P Y | X ( y | i ) P X ( i ) 1 { i M } Pr ( X M ) log i P Y | X ( y | i ) P X ( i ) 1 { i M } Pr ( X M ) i P Y | X ( y | i ) P X ( i ) M 1 | X | I ( X ; Y ) + y Y i M P Y | X ( y | i ) P X ( i ) log 1 Pr ( X M ) = M 1 | X | I ( X ; Y ) + Pr ( X M ) log 1 Pr ( X M )
where the last inequality is due to the log-sum inequality. □
Proof of Lemma 6.
Employing the same argument as in the proof of [32] (Theorem 3), we obtain that there exists a function f : X [ M ] such that
I ( Y ; f ( X ) ) η I ( X ; Y )
for any η ( 0 , 1 ) and
M 4 + 4 ( 1 η ) log 2 log 2 α ( 1 η ) I ( X ; Y ) .
Since h b 1 ( x ) x log 2 log 1 x for all x ( 0 , 1 ] , it follows from above that (noticing that I ( X ; Y ) α )
M 4 + I ( X ; Y ) 2 α 1 h b 1 ( ζ ) ,
where ζ ( 1 η ) I ( X ; Y ) 2 α . Rearranging this, we obtain
h b 1 ( ζ ) I ( X ; Y ) 2 α ( M 4 ) .
Assuming M 5 , we have I ( X ; Y ) 2 α ( M 4 ) 1 2 and hence
ζ h b I ( X ; Y ) 2 α ( M 4 ) ,
implying
η 1 2 α I ( X ; Y ) h b I ( X ; Y ) 2 α ( M 4 ) .
Plugging this into (A11), we obtain
I ( Y ; f ( X ) ) I ( X ; Y ) 2 α h b I ( X ; Y ) 2 α ( M 4 ) .
As before, if M = e R , then H ( f ( X ) ) R . Hence,
dIB ( P X Y , R ) I ( X ; Y ) 2 α h b I ( X ; Y ) 2 α ( e R 4 ) ,
for all R log 5 . □

Appendix B. Proofs from Section 3

Proof of Lemma 8.
To prove the upper bound on PF ( , 1 ) , recall that r e PF ( , 1 ) ( r ) is convex. Thus, it lies below the chord connecting points ( 0 , 0 ) and ( H ( X ) , e I ( X ; Y ) ) . The lower bound on IB ( , 1 ) is similarly obtained using the concavity of R e IB ( , 1 ) ( R ) . This is achievable by an erasure channel. To see this consider the random variable T δ taking values in X { } that is obtained by conditional distributions P T δ | X ( t | x ) = δ ¯ I t = x and P T δ | X ( | x ) = δ for some δ 0 . It can be verified that I ( X ; T δ ) = δ ¯ H ( X ) and P c ( Y | T δ ) = δ ¯ P c ( Y | X ) + δ P c ( Y ) . By taking δ = 1 R H ( X ) , this channel meets the constraint I ( X ; T δ ) = R . Hence,
IB ( , 1 ) ( R ) log P c ( Y | T δ ) P c ( Y ) = log 1 R H ( X ) + R H ( X ) P c ( Y | X ) P c ( Y ) .
Proof of Theorem 11.
We begin by PF ( , 1 ) . As described in Section 3.1, and similar to Mrs. Gerber’s Lemma (Lemma 4), we need to construct the lower convex envelope K [ F β ( , 1 ) ] of F β ( , 1 ) ( q ) = P c ( Y ) + β H ( X ) where X Bernoulli ( q ) and Y is the result of passing X through BSC ( δ ) , i.e., Y Bernoulli ( δ q ) . In this case, P c ( Y ) = max { δ q , 1 δ q } . Hence, we need to determine the lower convex envelope of the map
q F β ( , 1 ) ( q ) = max { δ q , 1 δ q } + β h b ( q ) .
A straightforward computation shows that F β ( , 1 ) ( q ) is symmetric around q = 1 2 and is also concave in q on q [ 0 , 1 2 ] for any β . Hence, K [ F β ( , 1 ) ] is obtained as follows depending on the values of β :
  • F β ( , 1 ) ( 1 2 ) < F β ( , 1 ) ( 0 ) (see Figure 6a), then K [ F β ( , 1 ) ] is given by the convex combination of F β ( , 1 ) ( 0 ) , F β ( , 1 ) ( 1 ) , and F β ( , 1 ) ( 1 2 ) .
  • F β ( , 1 ) ( 1 2 ) = F β ( , 1 ) ( 0 ) (see Figure 6b), then K [ F β ( , 1 ) ] is given by the convex combination of F β ( , 1 ) ( 0 ) , F β ( , 1 ) ( 1 2 ) , and F β ( , 1 ) ( 1 ) .
  • F β ( , 1 ) ( 1 2 ) > F β ( , 1 ) ( 0 ) (see Figure 6c), then K [ F β ( , 1 ) ] is given by the convex combination of F β ( , 1 ) ( 0 ) and F β ( , 1 ) ( 1 ) .
Hence, assuming p 1 2 , we can construct T that minimizes P c ( Y | T ) β H ( X | T ) . Considering the first two cases, we obtain that T is ternary with P X | T = 0 = Bernoulli ( 0 ) , P X | T = 1 = Bernoulli ( 1 ) , and P X | T = 2 = Bernoulli ( 1 2 ) with marginal P T = [ 1 p α 2 , p α 2 , α ] for some α [ 0 , 2 p ] . This leads to P c ( Y | T ) = α ¯ δ ¯ + 1 2 α and I ( X ; T ) = h b ( p ) α . Note that P c ( Y | T ) covers all possible domain [ P c ( Y ) , δ ¯ ] by varying α on [ 0 , 2 p ] . Replacing I ( X ; T ) by r, we obtain α = h b ( p ) r leading to P c ( Y | T ) = δ ¯ ( h b ( p ) r ) ( 1 2 δ ) . Since P c ( Y ) = 1 δ p , the desired result follows.
To derive the expression for IB ( , 1 ) , recall that we need to derive K [ F β ( , 1 ) ] the upper concave envelope of F β ( , 1 ) . It is clear from Figure 6 that K [ F β ( , 1 ) ] is obtained by replacing F β ( , 1 ) ( q ) on the interval [ q β , 1 q β ] by its maximum value over q where
q β 1 1 + e 1 2 δ β ,
is the maximizer of F β ( , 1 ) ( q ) on [ 0 , 1 2 ] . In other words,
K [ F β ( , 1 ) ( q ) ] = F β ( , 1 ) ( q β ) , for q [ q β , 1 q β ] , F β ( , 1 ) ( q ) , otherwise .
Note that if p < q β then K [ F β ( , 1 ) ] evaluated at p coincides with F β ( , 1 ) ( p ) . This corresponds to all trivial P T | X such that P c ( Y | T ) + β H ( X | T ) = P c ( Y ) + β H ( X ) . If, on the other hand, p q β , then K [ F β ( , 1 ) is the convex combination of F β ( , 1 ) ( q β ) and F β ( , 1 ) ( 1 q β ) . Hence, taking q β as a parameter (say, α ), the optimal binary T is constructed as follows: P X | T = 0 = Bernoulli ( α ) and P X | T = 1 = Bernoulli ( α ¯ ) for α p . Such channel induces
P c ( Y | T ) = max { α δ , 1 α δ } = 1 α δ ,
as α p 1 2 , and also
I ( X ; T ) = h b ( p ) h b ( α ) .
Combining these two, we obtain
P c ( Y | T ) = 1 δ h b 1 ( h b ( p ) R ) .
Proof of Theorem 12.
Let U α and L α denote the U Φ , Ψ and L Φ , Ψ , respectively, when Ψ ( Q X ) = Φ ( Q X ) = Q X α . In light of (59) and (60), it is sufficient to compute L α and U α . To do so, we need to construct the lower convex envelope K [ F β ( α ) ] and upper concave envelope K [ F β ( α ) ] of the map F β ( α ) ( q ) given by q Q Y α β Q X α where X Bernoulli ( q ) and Y is the result of passing X through BSC ( δ ) , i.e., Y Bernoulli ( δ q ) . In this case, we have
q F β ( α ) ( q ) = q δ α β | | q | | α ,
where a α is to mean [ a , a ¯ ] α for any a [ 0 , 1 ] .
We begin by L α for which we aim at obtaining K [ F β ( α ) ] . A straightforward computation shows that F β ( α ) ( q ) is convex for β ( 1 2 δ ) 2 and α 2 . For β > ( 1 2 δ ) 2 and α 2 , it can be shown that F β ( α ) ( q ) is concave an interval [ q β , 1 q β ] where q β solves d d q F β ( α ) ( q ) = 0 . (The shape of q F β ( α ) ( r ) in is similar to what was depicted in Figure 4.) By symmetry, K [ F β ( α ) ] is therefore obtained by replacing F β ( α ) ( q ) on this interval by F β ( α ) ( q β ) . Hence, if p < q β , K [ F β ( α ) ] at p coincides with F β ( α ) ( p ) which results in trivial P T | X (see the proof of Theorem 11 for more details). If, on the other hand, p q β , then K [ F β ( α ) ] evaluated at p is given by a convex combination of F β ( α ) ( q β ) and F β ( α ) ( 1 q β ) . Relabeling q β as a parameter (say, q), we can write an optimal binary T via the following: P X | T = 0 = Bernoulli ( 1 q ) and P X | T = 1 = Bernoulli ( q ) for q p . This channel induces Ψ ( Y | T ) = q δ α and Φ ( X | T ) = q α . Hence, the graph of L α is given by
( q α , q δ α ) , 0 q p .
Therefore,
sup P T | X H α ( X | T ) ζ H α ( Y | T ) = α 1 α log q δ α ,
where q p solves α 1 α log q α = ζ . Since the map q q α is strictly decreasing for q [ 0 , 0.5 ] , this equation has a unique solution.
Next, we compute U α or equivalently K [ F β ( α ) ] the upper concave envelop of F β ( α ) defined in (A13). As mentioned earlier, q F β ( α ) ( q ) is convex for β ( 1 2 δ ) 2 and α 2 . For β > ( 1 2 δ ) 2 , we need to consider three cases: (1) K [ F β ( α ) ] is given by the convex combination of F β ( α ) ( 0 ) and F β ( α ) ( 1 ) , (2) K [ F β ( α ) ] is given by the convex combination of F β ( α ) ( 0 ) , F β ( α ) ( 1 2 ) , and F β ( α ) ( 1 ) , (3) K [ F β ( α ) ] is given by the convex combination of F β ( α ) ( 0 ) and F β ( α ) ( q ) where q is a point [ 0 , 1 2 ] . Without loss of generality, we can ignore the first case. The other two cases correspond to the following solutions
  • T is a ternary variable given by P X | T = 0 = Bernoulli ( 0 ) , P X | T = 1 = Bernoulli ( 1 ) , and P X | T = 2 = Bernoulli ( 1 2 ) with marginal T Bernoulli ( 1 p λ 2 , p λ 2 , λ ) for some λ [ 0 , 2 p ] . This produces
    Ψ ( Y | T ) = λ ¯ δ α + λ 1 2 α ,
    and
    Φ ( X | T ) = λ ¯ + λ 1 2 α .
  • T is a binary variable given by P X | T = 0 = Bernoulli ( 0 ) and P X | T = 1 = Bernoulli ( p λ ) with marginal T Bernoulli ( λ ) for some λ [ 2 p , 1 ] . This produces
    Ψ ( Y | T ) = λ ¯ δ α + λ δ p λ α ,
    and
    Φ ( X | T ) = λ ¯ + λ p λ α .
Combining these two cases, can write
U α ( ζ ) = λ ¯ δ α + λ q z δ α ,
where
ζ = λ ¯ + λ q z α ,
and z = max { 2 p , λ } . Plugging this into (59) completes the proof. □
Proof of Lemma 9.
The facts that γ H γ ( X | T ) is non-increasing on [ 1 , ] [103] (Proposition 5) and i | x i | γ 1 / γ max i | x i | for all p 0 imply
γ 1 γ H γ ( X | T ) H ( X | T ) H γ ( X | T ) .
Since I ( X ; T ) = H ( X ) H ( X | T ) , the above lower bound yields
I γ ( X ; T ) γ γ 1 I ( X ; T ) γ γ 1 H ( X ) + H γ ( X ) ,
where the last inequality follows from the fact that γ H γ ( X ) is non-increasing. The upper bound in (A14) (after replacing X with Y and γ with α ) implies
I α ( Y ; T ) I ( Y ; T ) + H α ( Y ) H ( Y ) .
Combining (A15) and (A16), we obtain the desired upper bound for PF ( α , γ ) . The other bounds can be proved similarly by interchanging X with Y and α with γ in (A15) and (A16). □

References

  1. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 30 September–3 October 1999; pp. 368–377. [Google Scholar]
  2. Wyner, A.; Ziv, J. A theorem on the entropy of certain binary sequences and applications: Part I. IEEE Trans. Inf. Theory 1973, 19, 769–772. [Google Scholar] [CrossRef]
  3. Witsenhausen, H.; Wyner, A. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory 1975, 21, 493–501. [Google Scholar] [CrossRef]
  4. Ahlswede, R.; Körner, J. On the connection between the entropies of input and output distributions of discrete memoryless channels. In Proceedings of the Fifth Conference on Probability Theory, Brasov, Romania, 1–6 September 1974. [Google Scholar]
  5. Wyner, A. A theorem on the entropy of certain binary sequences and applications—II. IEEE Trans. Inf. Theory 1973, 19, 772–777. [Google Scholar] [CrossRef]
  6. Kim, Y.H.; El Gamal, A. Network Information Theory; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  7. Slonim, N.; Tishby, N. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 208–215. [Google Scholar]
  8. Still, S.; Bialek, W. How Many Clusters? An Information-Theoretic Perspective. Neural Comput. 2004, 16, 2483–2506. [Google Scholar] [CrossRef]
  9. Slonim, N.; Tishby, N. Agglomerative Information Bottleneck. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS’99), Denver, CO, USA, 29 November–4 December 1999; pp. 617–623. [Google Scholar]
  10. Cardinal, J. Compression of side information. In Proceedings of the 2003 International Conference on Multimedia and Expo—Volume 1, Baltimore, MD, USA, 6–9 July 2003; Volume 2, pp. 569–572. [Google Scholar]
  11. Zeitler, G.; Koetter, R.; Bauch, G.; Widmer, J. Design of network coding functions in multihop relay networks. In Proceedings of the 2008 5th International Symposium on Turbo Codes and Related Topics, Lausanne, Switzerland, 1–5 September 2008; pp. 249–254. [Google Scholar]
  12. Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the Information Bottleneck to the Privacy Funnel. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW 2014), Tasmania, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
  13. Calmon, F.P.; Makhdoumi, A.; Médard, M. Fundamental limits of perfect privacy. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 1796–1800. [Google Scholar]
  14. Asoodeh, S.; Alajaji, F.; Linder, T. Notes on information-theoretic privacy. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 30 September–3 October 2014; pp. 1272–1278. [Google Scholar]
  15. Ding, N.; Sadeghi, P. A Submodularity-based Clustering Algorithm for the Information Bottleneck and Privacy Funnel. In Proceedings of the 2019 IEEE Information Theory Workshop (ITW), Visby, Sweden, 25–28 August 2019; pp. 1–5. [Google Scholar]
  16. Bertran, M.; Martinez, N.; Papadaki, A.; Qiu, Q.; Rodrigues, M.; Reeves, G.; Sapiro, G. Adversarially Learned Representations for Information Obfuscation and Inference. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 614–623. [Google Scholar]
  17. Lopuhaä-Zwakenberg, M.; Tong, H.; Škorić, B. Data Sanitisation Protocols for the Privacy Funnel with Differential Privacy Guarantees. arXiv 2020, arXiv:2008.13151. [Google Scholar]
  18. Hsu, H.; Asoodeh, S.; Calmon, F. Obfuscation via Information Density Estimation. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Sicily, Italy, 26–28 August 2020; Volume 108, pp. 906–917. [Google Scholar]
  19. Dobrushin, R.; Tsybakov, B. Information transmission with additional noise. IRE Trans. Inf. Theory 1962, 8, 293–304. [Google Scholar] [CrossRef]
  20. Ahlswede, R.; Csiszar, I. Hypothesis testing with communication constraints. IEEE Trans. Inf. Theory 1986, 32, 533–542. [Google Scholar] [CrossRef] [Green Version]
  21. Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Information extraction under privacy constraints. Information 2016, 7, 15. [Google Scholar] [CrossRef] [Green Version]
  22. Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory, Coll. Math. Soc. J. Bolyai; Csiszár, I., Elias, P., Eds.; North-Holland: Amsterdam, The Netherlands, 1977; Volume 16, pp. 41–52. [Google Scholar]
  23. Raginsky, M. Strong Data Processing Inequalities and Φ-Sobolev Inequalities for Discrete Channels. IEEE Trans. Inf. Theory 2016, 62, 3355–3389. [Google Scholar] [CrossRef] [Green Version]
  24. Hsu, H.; Asoodeh, S.; Salamatian, S.; Calmon, F.P. Generalizing Bottleneck Problems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 531–535. [Google Scholar]
  25. Courtade, T.A. Strengthening the entropy power inequality. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 2294–2298. [Google Scholar]
  26. Globerson, A.; Tishby, N. On the Optimality of the Gaussian Information Bottleneck Curve; Technical Report; Hebrew University: Jerusalem, Israel, 2004. [Google Scholar]
  27. Calmon, F.P.; Polyanskiy, Y.; Wu, Y. Strong data processing inequalities in power-constrained Gaussian channels. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Hongkong, China, 14–19 June 2015; pp. 2558–2562. [Google Scholar]
  28. Rényi, A. On measures of dependence. Acta Math. Acad. Sci. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
  29. Goldfeld, Z.; Polyanskiy, Y. The Information Bottleneck Problem and its Applications in Machine Learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
  30. Zaidi, A.; Estella-Aguerri, I.; Shamai (Shitz), S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef] [Green Version]
  31. Strouse, D.; Schwab, D.J. The Deterministic Information Bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Bhatt, A.; Nazer, B.; Ordentlich, O.; Polyanskiy, Y. Information-Distilling Quantizers. arXiv 2018, arXiv:1812.03031. [Google Scholar]
  33. Shamir, O.; Sabato, S.; Tishby, N. Learning and Generalization with the Information Bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef] [Green Version]
  34. Diaz, M.; Wang, H.; Calmon, F.P.; Sankar, L. On the Robustness of Information-Theoretic Privacy Measures and Mechanisms. IEEE Trans. Inf. Theory 2020, 66, 1949–1978. [Google Scholar] [CrossRef]
  35. El-Yaniv, R.; Souroujon, O. Iterative Double Clustering for Unsupervised and Semi-Supervised Learning. In Proceedings of the 12th European Conference on Machine Learning, Freiburg, Germany, 5–7 September 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 121–132. [Google Scholar]
  36. Elidan, G.; Friedman, N. Learning Hidden Variable Networks: The Information Bottleneck Approach. J. Mach. Learn. Res. 2005, 6, 81–127. [Google Scholar]
  37. Aguerri, I.E.; Zaidi, A. Distributed Information Bottleneck Method for Discrete and Gaussian Sources. arXiv 2017, arXiv:1709.09082. [Google Scholar]
  38. Aguerri, I.E.; Zaidi, A. Distributed Variational Representation Learning. arXiv 2019, arXiv:1807.04193. [Google Scholar]
  39. Strouse, D.; Schwab, D.J. Geometric Clustering with the Information Bottleneck. Neural Comput. 2019, 31, 596–612. [Google Scholar] [CrossRef]
  40. Cicalese, F.; Gargano, L.; Vaccaro, U. Bounds on the Entropy of a Function of a Random Variable and Their Applications. IEEE Trans. Inf. Theory 2018, 64, 2220–2230. [Google Scholar] [CrossRef]
  41. Koch, T.; Lapidoth, A. At Low SNR, Asymmetric Quantizers are Better. IEEE Trans. Inf. Theory 2013, 59, 5421–5445. [Google Scholar] [CrossRef] [Green Version]
  42. Pedarsani, R.; Hassani, S.H.; Tal, I.; Telatar, E. On the construction of polar codes. In Proceedings of the 2011 IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 31 July–5 August 2011; pp. 11–15. [Google Scholar]
  43. Tal, I.; Sharov, A.; Vardy, A. Constructing polar codes for non-binary alphabets and MACs. In Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings, Cambridge, MA, USA, 1–6 July 2012; pp. 2132–2136. [Google Scholar]
  44. Kartowsky, A.; Tal, I. Greedy-Merge Degrading has Optimal Power-Law. IEEE Trans. Inf. Theory 2019, 65, 917–934. [Google Scholar] [CrossRef] [Green Version]
  45. Viterbi, A.J.; Omura, J.K. Principles of Digital Communication and Coding, 1st ed.; McGraw-Hill, Inc.: New York, NY, USA, 1979. [Google Scholar]
  46. Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop (ITW), Jeju Island, Korea, 11–15 October 2015; pp. 1–5. [Google Scholar]
  47. Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
  48. Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the Information Bottleneck Theory of Deep Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  49. Goldfeld, Z.; Van Den Berg, E.; Greenewald, K.; Melnyk, I.; Nguyen, N.; Kingsbury, B.; Polyanskiy, Y. Estimating Information Flow in Deep Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 2299–2308. [Google Scholar]
  50. Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2225–2239. [Google Scholar] [CrossRef] [Green Version]
  51. Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
  52. Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear Information Bottleneck. arXiv 2017, arXiv:1705.02436. [Google Scholar]
  53. Kolchinsky, A.; Tracey, B.D.; Kuyk, S.V. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  54. Chalk, M.; Marre, O.; Tkacik, G. Relevant Sparse Codes with Variational Information Bottleneck. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 9 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 1965–1973. [Google Scholar]
  55. Wickstrøm, K.; Løkse, S.; Kampffmeyer, M.; Yu, S.; Principe, J.; Jenssen, R. Information Plane Analysis of Deep Neural Networks via Matrix–Based Rényi’s Entropy and Tensor Kernels. arXiv 2019, arXiv:1909.11396. [Google Scholar]
  56. Matias, V.; Piantanida, P.; Rey Vega, L. The Role of the Information Bottleneck in Representation Learning. In Proceedings of the IEEE International Symposium on Information Theory (ISIT 2018), Vail, CO, USA, 17–22 June 2018. [Google Scholar] [CrossRef]
  57. Alemi, A.; Fischer, I.; Dillon, J. Uncertainty in the Variational Information Bottleneck. arXiv 2018, arXiv:1807.00906. [Google Scholar]
  58. Yu, S.; Jenssen, R.; Príncipe, J. Understanding Convolutional Neural Network Training with Information Theory. arXiv 2018, arXiv:1804.06537. [Google Scholar]
  59. Cheng, H.; Lian, D.; Gao, S.; Geng, Y. Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
  60. Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
  61. Issa, I.; Wagner, A.B.; Kamath, S. An Operational Approach to Information Leakage. IEEE Trans. Inf. Theory 2020, 66, 1625–1657. [Google Scholar] [CrossRef]
  62. Cvitkovic, M.; Koliander, G. Minimal Achievable Sufficient Statistic Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 1465–1474. [Google Scholar]
  63. Asoodeh, S.; Alajaji, F.; Linder, T. On maximal correlation, mutual information and data privacy. In Proceedings of the IEEE 14th Canadian Workshop on Inf. Theory (CWIT), St. John’s, NL, Canada, 6–9 July 2015; pp. 27–31. [Google Scholar]
  64. Makhdoumi, A.; Fawaz, N. Privacy-utility tradeoff under statistical uncertainty. In Proceedings of the 51st Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 2–4 October 2013; pp. 1627–1634. [Google Scholar] [CrossRef] [Green Version]
  65. Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Estimation Efficiency Under Privacy Constraints. IEEE Trans. Inf. Theory 2019, 65, 1512–1534. [Google Scholar] [CrossRef]
  66. Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Privacy-aware guessing efficiency. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017. [Google Scholar]
  67. Asoodeh, S.; Alajaji, F.; Linder, T. Privacy-aware MMSE estimation. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1989–1993. [Google Scholar]
  68. Calmon, F.P.; Makhdoumi, A.; Médard, M.; Varia, M.; Christiansen, M.; Duffy, K.R. Principal Inertia Components and Applications. IEEE Trans. Inf. Theory 2017, 63, 5011–5038. [Google Scholar] [CrossRef] [Green Version]
  69. Wang, H.; Vo, L.; Calmon, F.P.; Médard, M.; Duffy, K.R.; Varia, M. Privacy With Estimation Guarantees. IEEE Trans. Inf. Theory 2019, 65, 8025–8042. [Google Scholar] [CrossRef]
  70. Asoodeh, S. Information and Estimation Theoretic Approaches to Data Privacy. Ph.D. Thesis, Queen’s University, Kingston, ON, Canada, 2017. [Google Scholar]
  71. Liao, J.; Kosut, O.; Sankar, L.; du Pin Calmon, F. Tunable Measures for Information Leakage and Applications to Privacy-Utility Tradeoffs. IEEE Trans. Inf. Theory 2019, 65, 8043–8066. [Google Scholar] [CrossRef] [Green Version]
  72. Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Privacy aware learning. J. Assoc. Comput. Mach. (ACM) 2014, 61, 38. [Google Scholar] [CrossRef]
  73. Poole, B.; Ozair, S.; Van Den Oord, A.; Alemi, A.; Tucker, G. On Variational Bounds of Mutual Information. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Volume 97, pp. 5171–5180. [Google Scholar]
  74. Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 531–540. [Google Scholar]
  75. Van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  76. Song, J.; Ermon, S. Understanding the Limitations of Variational Mutual Information Estimators. In Proceedings of the International Conference on Learning Representations, online, 26 April–1 May 2020. [Google Scholar]
  77. McAllester, D.; Stratos, K. Formal Limitations on the Measurement of Mutual Information. In Proceedings of the International Conference on Learning Representations, online, 26 April–1 May 2020; Volume 108, pp. 875–884. [Google Scholar]
  78. Rassouli, B.; Gunduz, D. On Perfect Privacy. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 2551–2555. [Google Scholar]
  79. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  80. Kim, H.; Gao, W.; Kannan, S.; Oh, S.; Viswanath, P. Discovering Potential Correlations via Hypercontractivity. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4577–4587. [Google Scholar]
  81. Ahlswede, R.; Gács, P. Spreading of sets in product spaces and hypercontraction of the Markov operator. Ann. Probab. 1976, 4, 925–939. [Google Scholar] [CrossRef]
  82. Anantharam, V.; Gohari, A.; Kamath, S.; Nair, C. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv 2014, arXiv:1304.6133v1. [Google Scholar]
  83. Polyanskiy, Y.; Wu, Y. Dissipation of Information in Channels With Input Constraints. IEEE Trans. Inf. Theory 2016, 62, 35–55. [Google Scholar] [CrossRef]
  84. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
  85. Zaidi, A. Hypothesis Testing Against Independence Under Gaussian Noise. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 1289–1294. [Google Scholar] [CrossRef]
  86. Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the Information Bottleneck. Entropy 2019, 21, 924. [Google Scholar] [CrossRef] [Green Version]
  87. Contento, L.; Ern, A.; Vermiglio, R. A linear-time approximate convex envelope algorithm using the double Legendre-Fenchel transform with application to phase separation. Comput. Optim. Appl. 2015, 60, 231–261. [Google Scholar] [CrossRef] [Green Version]
  88. Lucet, Y. Faster than the Fast Legendre Transform, the Linear-time Legendre Transform. Numer. Algorithms 1997, 16, 171–185. [Google Scholar] [CrossRef]
  89. Witsenhausen, H. Indirect rate distortion problems. IEEE Trans. Inf. Theory 1980, 26, 518–521. [Google Scholar] [CrossRef]
  90. Wyner, A. On source coding with side information at the decoder. IEEE Trans. Inf. Theory 1975, 21, 294–300. [Google Scholar] [CrossRef]
  91. Courtade, T.A.; Weissman, T. Multiterminal Source Coding Under Logarithmic Loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef] [Green Version]
  92. Li, C.T.; El Gamal, A. Extended Gray-Wyner System With Complementary Causal Side Information. IEEE Trans. Inf. Theory 2018, 64, 5862–5878. [Google Scholar] [CrossRef] [Green Version]
  93. Vera, M.; Rey Vega, L.; Piantanida, P. Collaborative Information Bottleneck. IEEE Trans. Inf. Theory 2019, 65, 787–815. [Google Scholar] [CrossRef]
  94. Gilad-Bachrach, R.; Navot, A.; Tishby, N. An Information Theoretic Tradeoff between Complexity and Accuracy. In Learning Theory and Kernel Machines; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
  95. Pichler, G.; Koliander, G. Information Bottleneck on General Alphabets. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 526–530. [Google Scholar] [CrossRef] [Green Version]
  96. Kim, Y.H.; Sutivong, A.; Cover, T. State mplification. IEEE Trans. Inf. Theory 2008, 54, 1850–1859. [Google Scholar] [CrossRef] [Green Version]
  97. Merhav, N.; Shamai, S. Information rates subject to state masking. IEEE Trans. Inf. Theory 2007, 53, 2254–2261. [Google Scholar] [CrossRef]
  98. Witsenhausen, H. Some aspects of convexity useful in information theory. IEEE Trans. Inf. Theory 1980, 26, 265–271. [Google Scholar] [CrossRef]
  99. Harremoës, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
  100. Hirche, C.; Winter, A. An alphabet size bound for the information bottleneck function. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
  101. Liese, F.; Vajda, I. On Divergences and Informations in Statistics and Information Theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
  102. Verdú, S. α-mutual information. In Proceedings of the Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
  103. Fehr, S.; Berens, S. On the Conditional Rényi Entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
  104. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
  105. Sason, I.; Verdú, S. f-Divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  106. Guntuboyina, A.; Saha, S.; Schiebinger, G. Sharp Inequalities for f-Divergences. IEEE Trans. Inf. Theory 2014, 60, 104–121. [Google Scholar] [CrossRef] [Green Version]
  107. Guo, D.; Shamai, S.; Verdú, S. Mutual information and minimum mean-square error in Gaussian channels. IEEE Trans. Inf. Theory 2005, 51, 1261–1282. [Google Scholar] [CrossRef] [Green Version]
  108. Rockafellar, R.T. Convex Analysis; Princeton Univerity Press: Princeton, NJ, USA, 1997. [Google Scholar]
  109. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  110. Linder, T.; Zamir, R. On the asymptotic tightness of the Shannon lower bound. IEEE Trans. Inf. Theory 2008, 40, 2026–2031. [Google Scholar] [CrossRef]
  111. Guo, D.; Wu, Y.; Shitz, S.S.; Verdú, S. Estimation in Gaussian Noise: Properties of the Minimum Mean-Square Error. IEEE Trans. Inf. Theory 2011, 57, 2371–2385. [Google Scholar]
  112. Jana, S. Alphabet sizes of auxiliary random variables in canonical inner bounds. In Proceedings of the 43rd Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 18–20 March 2009; pp. 67–71. [Google Scholar]
Figure 1. Examples of the set M , defined in (6). The upper and lower boundaries of this set correspond to information bottleneck ( IB ) and privacy funnel ( PF ), respectively. It is worth noting that, while IB (R) = 0 only at R = 0, PF (r) = 0 holds in general for r belonging to a non-trivial interval (only for | X | > 2). Moreover, note that in general neither upper nor lower boundaries are smooth. A sufficient condition for smoothness is P X | Y ( y | x ) > 0 (see Theorem 1), thus both IB and PF are smooth in the binary case.
Figure 1. Examples of the set M , defined in (6). The upper and lower boundaries of this set correspond to information bottleneck ( IB ) and privacy funnel ( PF ), respectively. It is worth noting that, while IB (R) = 0 only at R = 0, PF (r) = 0 holds in general for r belonging to a non-trivial interval (only for | X | > 2). Moreover, note that in general neither upper nor lower boundaries are smooth. A sufficient condition for smoothness is P X | Y ( y | x ) > 0 (see Theorem 1), thus both IB and PF are smooth in the binary case.
Entropy 22 01325 g001
Figure 2. Comparison of (8), the exact value of IB for jointly Gaussian X and Y (i.e., Y = X + σ N G with X and N G being both standard Gaussian N ( 0 , 1 ) ), with the general upper bound (9) for σ 2 = 0.5 . It is worth noting that while the Gaussian IB converges to I ( X ; Y ) 0.8 , the upper bound diverges.
Figure 2. Comparison of (8), the exact value of IB for jointly Gaussian X and Y (i.e., Y = X + σ N G with X and N G being both standard Gaussian N ( 0 , 1 ) ), with the general upper bound (9) for σ 2 = 0.5 . It is worth noting that while the Gaussian IB converges to I ( X ; Y ) 0.8 , the upper bound diverges.
Entropy 22 01325 g002
Figure 3. Second-order approximation of PF G according to Theorem 6 for jointly Gaussian X and Y with correlation coefficient ρ = 0.8 . For this particular case, the exact expression of PF G is computed in (14).
Figure 3. Second-order approximation of PF G according to Theorem 6 for jointly Gaussian X and Y with correlation coefficient ρ = 0.8 . For this particular case, the exact expression of PF G is computed in (14).
Entropy 22 01325 g003
Figure 4. The mapping q F β ( q ) = H ( Y ) β H ( X ) where X Bernoulli ( q ) and Y is the result of passing X through BSC(0.1), see (26).
Figure 4. The mapping q F β ( q ) = H ( Y ) β H ( X ) where X Bernoulli ( q ) and Y is the result of passing X through BSC(0.1), see (26).
Entropy 22 01325 g004
Figure 5. The set { ( I ( X ; T ) , I ( Y ; T ) ) } with P X = Bernoulli ( 0.9 ) , P Y | X = 0 = [ 0.9 , 0.1 ] , P Y | X = 1 = [ 0.85 , 0.15 ] , and T restricted to be binary. While the upper boundary of this set is concave, the lower boundary is not convex. This implies that, unlike IB , PF ( r ) cannot be attained by binary variables T.
Figure 5. The set { ( I ( X ; T ) , I ( Y ; T ) ) } with P X = Bernoulli ( 0.9 ) , P Y | X = 0 = [ 0.9 , 0.1 ] , P Y | X = 1 = [ 0.85 , 0.15 ] , and T restricted to be binary. While the upper boundary of this set is concave, the lower boundary is not convex. This implies that, unlike IB , PF ( r ) cannot be attained by binary variables T.
Entropy 22 01325 g005
Figure 6. The mapping q F β ( , 1 ) ( q ) = P c ( Y ) + β H ( X ) where X Bernoulli ( q ) and Y Bernoulli ( q ) BSC ( 0.1 ) .
Figure 6. The mapping q F β ( , 1 ) ( q ) = P c ( Y ) + β H ( X ) where X Bernoulli ( q ) and Y Bernoulli ( q ) BSC ( 0.1 ) .
Entropy 22 01325 g006
Figure 7. The structure of the optimal P T | X for PF ( , ) when P Y | X = BSC ( δ ) and X Bernoulli ( p ) with δ , p [ 0 , 1 2 ] . If the accuracy constraint is P c ( X | T ) λ (or equivalently I ( X | T ) log λ p ¯ ), then the parameter of optimal P T | X is given by η = λ ¯ p ¯ , leading to P c ( Y | T ) = δ λ .
Figure 7. The structure of the optimal P T | X for PF ( , ) when P Y | X = BSC ( δ ) and X Bernoulli ( p ) with δ , p [ 0 , 1 2 ] . If the accuracy constraint is P c ( X | T ) λ (or equivalently I ( X | T )