Bottleneck Problems: An Information and Estimation-Theoretic View

Asoodeh, Shahab; Calmon, Flavio P.

doi:10.3390/e22111325

Open AccessFeature PaperArticle

Bottleneck Problems: An Information and Estimation-Theoretic View^†

by

Shahab Asoodeh

^*

and

Flavio P. Calmon

School of Engineering and Applied Science, Harvard University, Cambridge, MA 02138, USA

^*

Author to whom correspondence should be addressed.

^†

Part of the Results in This Paper Was Presented at the International Symposium on Information Theory 2018, Vail, CO, USA, 17–22 June 2018.

Entropy 2020, 22(11), 1325; https://doi.org/10.3390/e22111325

Submission received: 15 October 2020 / Revised: 17 November 2020 / Accepted: 17 November 2020 / Published: 20 November 2020

(This article belongs to the Special Issue Information-Theoretic Methods for Deep Learning Based Data Acquisition, Analysis and Security)

Download

Browse Figures

Versions Notes

Abstract

Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), and strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding, and dependence dilution. Leveraging these connections, we prove a new cardinality bound on the auxiliary variable in IB, making its computation more tractable for discrete random variables. In the second part, we introduce a general family of optimization problems, termed “bottleneck problems”, by replacing mutual information in IB and PF with other notions of mutual information, namely f-information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantees in a variety of inference tasks with statistical constraints on accuracy and privacy. While the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to a binary case, we derive closed form expressions for several bottleneck problems.

Keywords:

information bottleneck; privacy funnel; mutual information; data processing inequality

1. Introduction

Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental in a variety of learning problems found in machine learning. A notable example is the information bottleneck (

IB

) method [1]. Suppose Y is a target variable and X is an observable correlated variable with joint distribution

P_{X Y}

. The goal of

IB

is to learn a “compact” summary (aka bottleneck) T of X that is maximally “informative” for inferring Y. The bottleneck variable T is assumed to be generated from X by applying a random function F to X, i.e.,

T = F (X)

, in such a way that it is conditionally independent of Y given X, that we denote by

(1)

The

IB

quantifies this goal by measuring the “compactness” of T using the mutual information

I (X; T)

and, similarly, “informativeness” by

I (Y; T)

. For a given level of compactness

R \geq 0

,

IB

extracts the bottleneck variable T that solves the constrained optimization problem

IB (R) ≔ \sup I (Y; T) subject to I (X; T) \leq R,

(2)

where the supremum is taken over all randomized functions

T = F (X)

satisfying Y Entropy 22 01325 i001

X

T.

The optimization problem that underlies the information bottleneck has been studied in the information theory literature as early as the 1970’s—see [2,3,4,5]—as a technique to prove impossibility results in information theory and also to study the common information between X and Y. Wyner and Ziv [2] explicitly determined the value of

IB (R)

for the special case of binary X and Y—a result widely known as Mrs. Gerber’s Lemma [2,6]. More than twenty years later, the information bottleneck function was studied by Tishby et al. [1] and re-formulated in a data analytic context. Here, the random variable X represents a high-dimensional observation with a corresponding low-dimensional feature Y.

IB

aims at specifying a compressed description of image which is maximally informative about feature Y. This framework led to several applications in clustering [7,8,9] and quantization [10,11].

A closely-related framework to

IB

is the privacy funnel (

PF

) problem [12,13,14]. In the

PF

framework, a bottleneck variable T is sought to maximally preserve “information” contained in X while revealing as little about Y as possible. This framework aims to capture the inherent trade-off between revealing X perfectly and leaking a sensitive attribute Y. For instance, suppose a user wishes to share an image X for some classification tasks. The image might carry information about attributes, say Y, that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g., location, or emotion. The

PF

framework seeks to extract a representation of X from which the original image can be recovered with maximal accuracy while minimizing the privacy leakage with respect to Y. Using mutual information for both privacy leakage and informativeness, the privacy funnel can be formulated as

PF (r) ≔ \inf I (Y; T) subject to I (X; T) \geq r,

(3)

where the infumum is taken over all randomized function

T = F (X)

and r is the parameter specifying the level of informativeness. It is evident from the formulations (2) and (3) that

IB

and

PF

are closely related. In fact, we shall see later that they correspond to the upper and lower boundaries of a two-dimensional compact convex set. This duality has led to design of greedy algorithms [12,15] for estimating

PF

based on the agglomerative information bottleneck [9] algorithm. A similar formulation has recently been proposed in [16] as a tool to train a neural network for learning a private representation of data X; see [17,18] for other closely-related formulations. Solving

IB

and

PF

optimization problems analytically is challenging. However, recent machine learning applications, and deep learning algorithms in particular, have reignited the study of both

IB

and

PF

(see Related Work).

In this paper, we first give a cohesive overview of the existing results surrounding the

IB

and the

PF

formulations. We then provide a comprehensive analysis of

IB

and

PF

from an information-theoretic perspective, as well as a survey of several formulations connected to the

IB

and

PF

that have been introduced in the information theory and machine learning literature. Moreover, we overview connections with coding problems such as remote source-coding [19], testing against independence [20], and dependence dilution [21]. Leveraging these connections, we prove a new cardinality bound for the bottleneck variable in

IB

, leading to more tractable optimization problem for

IB

. We then consider a broad family of optimization problems by going beyond mutual information in formulations (2) and (3). We propose two candidates for this task: Arimoto’s mutual information [22] and f-information [23]. By replacing

I (Y; T)

and/or

I (X; T)

with either of these measures, we generate a family of optimization problems that we referred to as the bottleneck problems. These problems are shown to better capture the underlying trade-offs intended by

IB

and

PF

(see also the short version [24]). More specifically, our main contributions are listed next.

Computing $IB$ and $PF$ are notoriously challenging when X takes values in a set with infinite cardinality (e.g., X is drawn from a continuous probability distribution). We consider three different scenarios to circumvent this difficulty. First, we assume that X is a Gaussian perturbation of Y, i.e., $X = Y + N^{G}$ where $N^{G}$ is a noise variable sampled from a Gaussian distribution independent of Y. Building upon the recent advances in entropy power inequality in [25], we derive a sharp upper bound for $IB (R)$ . As a special case, we consider jointly Gaussian $(X, Y)$ for which the upper bound becomes tight. This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable T is also Gaussian than the original proof given in [26]. In the second scenario, we assume that Y is a Gaussian perturbation of X, i.e., $Y = X + N^{G}$ . This corresponds to a practical setup where the feature Y might be perfectly obtained from a noisy observation of X. Relying on the recent results in strong data processing inequality [27], we obtain an upper bound on $IB (R)$ which is tight for small values of R. In the last scenario, we compute second-order approximation of $PF (r)$ under the assumption that T is obtained by Gaussian perturbation of X, i.e., $T = X + N^{G}$ . Interestingly, the rate of increase of $PF (r)$ for small values of r is shown to be dictated by an asymmetric measure of dependence introduced by Rényi [28].
We extend the Witsenhausen and Wyner’s approach [3] for analytically computing $IB$ and $PF$ . This technique converts solving the optimization problems in $IB$ and $PF$ to determining the convex and concave envelopes of a certain function, respectively. We apply this technique to binary X and Y and derive a closed form expression for $PF (r)$ – we call this result Mr. Gerber’s Lemma.
Relying on the connection between $IB$ and noisy source coding [19] (see [29,30]), we show that the optimal bottleneck variable T in optimization problem (2) takes values in a set $T$ with cardinality $| T | \leq | X |$ . Compared to the best cardinality bound previously known (i.e., $| T | \leq | X | + 1$ ), this result leads to a reduction in the search space’s dimension of the optimization problem (2) from $R^{{| X |}^{2}}$ to $R^{| X | (| X | - 1)}$ . Moreover, we show that this does not hold for $PF$ , indicating a fundamental difference in optimizations problems (2) and (3).
Following [14,31], we study the deterministic $IB$ and $PF$ (denoted by $dIB$ and $dPF$ ) in which T is assumed to be a deterministic function of X, i.e., $T = f (X)$ for some function f. By connecting $dIB$ and $dPF$ with entropy-constrained scalar quantization problems in information theory [32], we obtain bounds on them explicitly in terms of $| X |$ . Applying these bounds to $IB$ , we obtain that $\frac{IB (R)}{I (X; Y)}$ is bounded by one from above and by $\min {\frac{R}{H (X)}, \frac{e^{R} - 1}{| X |}}$ from below.
By replacing $I (Y; T)$ and/or $I (X; T)$ in (2) and (3) with Arimoto’s mutual information or f-information, we generate a family of bottleneck problems. We then argue that these new functionals better describe the trade-offs that were intended to be captured by $IB$ and $PF$ . The main reason is three-fold: First, as illustrated in Section 2.3, mutual information in $IB$ and $PF$ are mainly justified when $n ≫ 1$ independent samples $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ of $P_{X Y}$ are considered. However, Arimoto’s mutual information allows for operational interpretation even in the single-shot regime (i.e., for $n = 1$ ). Second, $I (Y; T)$ in $IB$ and $PF$ is meant to be a proxy for the efficiency of reconstructing Y given observation T. However, this can be accurately formalized by probability of correctly guessing Y given T (i.e., Bayes risk) or minimum mean-square error (MMSE) in estimating Y given T. While $I (Y; T)$ bounds these two measures, we show that they are precisely characterized by Arimoto’s mutual information and f-information, respectively. Finally, when $P_{X Y}$ is unknown, mutual information is known to be notoriously difficult to estimate. Nevertheless, Arimoto’s mutual information and f-information are easier to estimate: While mutual information can be estimated with estimation error that scales as $O (\log n / \sqrt{n})$ [33], Diaz et al. [34] showed that this estimation error for Arimoto’s mutual information and f-information is $O (1 / \sqrt{n})$ .
We also generalize our computation technique that enables us to analytically compute these bottleneck problems. Similar as before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certain functions. Focusing on binary X and Y, we derive closed form expressions for some of the bottleneck problems.

1.1. Related Work

The

IB

formulation has been extensively applied in representation learning and clustering [7,8,35,36,37,38]. Clustering based on

IB

results in algorithms that cluster data points in terms of the similarity of

P_{Y | X}

. When data points lie in a metric space, usually geometric clustering is preferred where clustering is based upon the geometric (e.g., Euclidean) distance. Strouse and Schwab [31,39] proposed the deterministic

IB

(denoted by

dIB

) by enforcing that

P_{T | X}

is a deterministic mapping:

dIB (R)

denotes the supremum of

I (Y; f (X))

over all functions

f : X \to T

satisfying

H (f (X)) \leq R

. This optimization problem is closely related to the problem of scalar quantization in information theory: designing a function

f : X \to [M] ≔ {1, \dots, M}

with a pre-determined output alphabet with f optimizing some objective functions. This objective might be maximizing or minimizing

H (f (X))

[40] or maximizing

I (Y; f (X))

for a random variable Y correlated with X [32,41,42,43]. Since

H (f (X)) \leq \log M

for

f : X \to [M]

, the latter problem provides lower bounds for

dIB

(and thus for

IB

). In particular, one can exploit [44] (Theorem 1) to obtain

I (X; Y) - dIB (R) \leq O (e^{- 2 R / | Y | - 1})

provided that

\min {| X |, 2^{R}} > 2 | Y |

. This result establishes a linear gap between

dIB

and

I (X; Y)

irrespective of

| X |

.

The connection between quantization and

dIB

further allows us to obtain multiplicative bounds. For instance, if

Y \sim Bernoulli (\frac{1}{2})

and

X = Y + N^{G}

, where

N^{G} \sim N (0, 1)

is independent of Y, then it is well-known in information theory literature that

I (Y; f (X)) \geq \frac{2}{π} I (X; Y)

for all non-constant

f : X \to {0, 1}

(see, e.g., [45] (Section 2.11)), thus

dIB (R) \geq \frac{2}{π} I (X; Y)

for

R \leq 1

. We further explore this connection to provide multiplicative bounds on

dIB (R)

in Section 2.5.

The study of

IB

has recently gained increasing traction in the context of deep learning. By taking T to be the activity of the hidden layer(s), Tishby and Zaslavsky [46] (see also [47]) argued that neural network classifiers trained with cross-entropy loss and stochastic gradient descent (SGD) inherently aims at solving the

IB

optimization problems. In fact, it is claimed that the graph of the function

R \mapsto IB (R)

(the so-called the information plane) characterizes the learning dynamic of different layers in the network: shallow layers correspond to maximizing

I (Y; T)

while deep layers’ objective is minimizing

I (X; T)

. While the generality of this claim was refuted empirically in [48] and theoretically in [49,50], it inspired significant follow-up studies. These include (i) modifying neural network training in order to solve the

IB

optimization problem [51,52,53,54,55]; (ii) creating connections between

IB

and generalization error [56], robustness [51], and detection of out-of-distribution data [57]; and (iii) using

IB

to understand specific characteristic of neural networks [55,58,59,60].

In both

IB

and

PF

, mutual information poses some limitations. For instance, it may become infinity in deterministic neural networks [48,49,50] and also may not lead to proper privacy guarantee [61]. As suggested in [55,62], one way to address this issue is to replace mutual information with other statistical measures. In the privacy literature, several measures with strong privacy guarantee have been proposed including Rényi maximal correlation [21,63,64], probability of correctly recovering [65,66], minimum mean-squared estimation error (MMSE) [67,68],

χ^{2}

-information [69] (a special case of f-information to be described in Section 3), Arimoto’s and Sibson’s mutual information [61,70]—to be discussed in Section 3, maximal leakage [71], and local differential privacy [72]. All these measures ensure interpretable privacy guarantees. For instance, it is shown in [67,68] that if

χ^{2}

-information between Y and T is sufficiently small, then no functions of Y can be efficiently reconstructed given T; thus providing an interpretable privacy guarantee.

Another limitation of mutual information is related to its estimation difficulty. It is known that mutual information can be estimated from n samples with the estimation error that scales as

O (\log n / \sqrt{n})

[33]. However, as shown by Diaz et al. [34], the estimation error for most of the above measures scales as

O (1 / \sqrt{n})

. Furthermore, the recently popular variational estimators for mutual information, typically implemented via deep learning methods [73,74,75], presents some fundamental limitations [76]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimator might not satisfy basic properties of mutual information such as data processing inequality or additivity. McAllester and Stratos [77] showed that some of these limitations are inherent to a large family of mutual information estimators.

1.2. Notation

We use capital letters, e.g., X, for random variables and calligraphic letters for their alphabets, e.g.,

X

. If X is distributed according to probability mass function (pmf)

P_{X}

, we write

X \sim P_{X}

. Given two random variables X and Y, we write

P_{X Y}

and

P_{Y | X}

as the joint distribution and the conditional distribution of Y given X. We also interchangeably refer to

P_{Y | X}

as a channel from X to Y. We use

H (X)

to denote both entropy and differential entropy of X, i.e., we have

H (X) = - \sum_{x \in X} P_{X} (x) \log P_{X} (x)

if X is a discrete random variable taking values in

X

with probability mass function (pmf)

P_{X}

and

H (X) = - \int \log f_{X} (x) \log f_{X} (x) d x,

where X is an absolutely continuous random variable with probability density function (pdf)

f_{X}

. If X is a binary random variable with

P_{X} (1) = p

, we write

X \sim Bernoulli (p)

. In this case, its entropy is called binary entropy function and denoted by

h_{b} (p) ≔ - p \log p - (1 - p) \log (1 - p)

. We use superscript

G

to describe a standard Gaussian random variable, i.e.,

N^{G} \sim N (0, 1)

. Given two random variables X and Y, their (Shannon’s) mutual information is denoted by

I (X; Y) ≔ H (Y) - H (Y | X)

. We let

P (X)

denote the set of all probability distributions on the set

X

. Given an arbitrary

Q_{X} \in P (X)

and a channel

P_{Y | X}

, we let

Q_{X} P_{Y | X}

denote the resulting output distribution on

Y

. For any

a \in [0, 1]

, we use

\bar{a}

to denote

1 - a

and for any integer

k \in N

,

[k] ≔ {1, 2, \dots, k}

.

Throughout the paper, we assume a pair of (discrete or continuous) random variables

(X, Y) \sim P_{X Y}

are given with a fixed joint distribution

P_{X Y}

, marginals

P_{X}

and

P_{Y}

, and conditional distribution

P_{Y | X}

. We then use

Q_{X} \in P (X)

to denote an arbitrary distribution with

Q_{Y} = Q_{X} P_{Y | X} \in P (Y)

.

2. Information Bottleneck and Privacy Funnel: Definitions and Functional Properties

In this section, we review the information bottleneck and its closely related functional, the privacy funnel. We then prove some analytical properties of these two functionals and develop a convex analytic approach which enables us to compute closed-form expressions for both these two functionals in some simple cases.

To precisely quantify the trade-off between these two conflicting goals, the

IB

optimization problem (2) was proposed [1]. Since any randomized function

T = F (X)

can be equivalently characterized by a conditional distribution, the optimization problem (2) can be instead expressed as

(4)

where R and

\tilde{R}

denote the level of desired compression and informativeness, respectively. We use

IB (R)

and

\tilde{IB} (\tilde{R})

to denote

IB (P_{X Y}, R)

and

\tilde{IB} (P_{X Y}, \tilde{R})

, respectively, when the joint distribution is clear from the context. Notice that if

IB (P_{X Y}, R) = \tilde{R}

, then

\tilde{IB} (P_{X Y}, \tilde{R}) = R

.

Now consider the setup where data X is required to be disclosed while maintaining the privacy of a sensitive attribute, represented by Y. This goal was formulated by

PF

in (3). As before, replacing randomized function

T = F (X)

with conditional distribution

P_{T | X}

, we can equivalently express (3) as

(5)

where

\tilde{r}

and r denote the level of desired privacy and informativeness, respectively. The case

\tilde{r} = 0

is particularly interesting in practice and specifies perfect privacy, see e.g., [13,78]. As before, we write

\tilde{PF} (\tilde{r})

and

PF (r)

for

\tilde{PF} (P_{X Y}, \tilde{r})

and

PF (P_{X Y}, r)

when

P_{X Y}

is clear from the context.

The following properties of

IB

and

PF

follow directly from their definitions. The proof of this result (and any other results in this section) is given in Appendix A.

Theorem 1.

For a given

P_{X Y}

, the mappings

IB (R)

and

PF (r)

have the following properties:

$IB (0) = PF (0) = 0$ .
$IB (R) = I (X; Y)$ for any $R \geq H (X)$ and $PF (r) = I (X; Y)$ for $r \geq H (X)$ .
$0 \leq IB (R) \leq \min {R, I (X; Y)}$ for any $R \geq 0$ and $PF (r) \geq \max {r - H (X | Y), 0}$ for any $r \geq 0$ .
$R \mapsto IB (R)$ is continuous, strictly increasing, and concave on the range $(0, I (X; Y))$ .
$r \mapsto PF (r)$ is continuous, strictly increasing, and convex on the range $(0, I (X; Y))$ .
If $P_{Y | X} (y | x) > 0$ for all $x \in X$ and $y \in Y$ , then both $R \mapsto IB (R)$ and $r \mapsto PF (r)$ are continuously differentiable over $(0, H (X))$ .
$R \mapsto \frac{IB (R)}{R}$ is non-increasing and $r \mapsto \frac{PF (r)}{r}$ is non-decreasing.
We have

According to this theorem, we can always restrict both R and r in (4) and (5), respectively, to

[0, H (X)]

as

IB (R) = PF (r) = I (X; Y)

for all

r, R \geq H (X)

.

Define

M = M (P_{X Y}) \subset R^{2}

as

(6)

It can be directly verified that

M

is convex. According to this theorem,

R \mapsto IB (R)

and

r \mapsto PF (r)

correspond to the upper and lower boundary of

M

, respectively. The convexity of

M

then implies the concavity and convexity of

IB

and

PF

. Figure 1 illustrates the set

M

for the simple case of binary X and Y.

While both

IB (0) = 0

and

PF (0) = 0

, their behavior in the neighborhood around zero might be completely different. As illustrated in Figure 1,

IB (R) > 0

for all

R > 0

, whereas

PF (r) = 0

for

r \in [0, r_{0}]

for some

r_{0} > 0

. When such

r_{0} > 0

exists, we say perfect privacy occurs: there exists a variable T satisfying Y Entropy 22 01325 i001

X

T such that

I (Y; T) = 0

while

I (X; T) > 0

; making T a representation of X having perfect privacy (i.e., no information leakage about Y). A necessary and sufficient condition for the existence of such T is given in [21] (Lemma 10) and [13] (Theorem 3), described next.

Theorem 2

(Perfect privacy). Let

(X, Y) \sim P_{X Y}

be given and

A \subset {[0, 1]}^{| Y |}

be the set of vectors

{P_{Y | X} (\cdot | x), x \in X}

. Then there exists

r_{0} > 0

such that

PF (r) = 0

for

r \in [0, r_{0}]

if and only if vectors in

A

are linearly independent.

In light of this theorem, we obtain that perfect privacy occurs if

| X | > | Y |

. It also follows from the theorem that for binary X, perfect privacy cannot occur (see Figure 1a).

Theorem 1 enables us to derive a simple bounds for

IB

and

PF

. Specifically, the facts that

\frac{PF (r)}{r}

is non-decreasing and

\frac{IB (R)}{R}

is non-increasing immediately result in the the following linear bounds.

Theorem 3

(Linear lower bound). For

r, R \in (0, H (X))

, we have

\inf_{\binom{Q_{X} \in P (X)}{Q_{X} \neq P_{X}}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})} \leq \frac{PF (r)}{r} \leq \frac{I (X; Y)}{H (X)} \leq \frac{IB (R)}{R} \leq \sup_{\binom{Q_{X} \in P (X)}{Q_{X} \neq P_{X}}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})} \leq 1 .

(7)

In light of this theorem, if

PF (r) = r

, then

I (X; Y) = H (X)

, implying

X = g (Y)

for a deterministic function g. Conversely, if

X = g (Y)

then

PF (r) = r

because for all T forming the Markov relation Y Entropy 22 01325 i001

g(Y)

T, we have

I (Y; T) = I (g (Y); T)

. On the other hand, we have

IB (R) = R

if and only if there exists a variable

T^{*}

satisfying

I (X; T^{*}) = I (Y; T^{*})

and thus the following double Markov relations

It can be verified (see [79] (Problem 16.25)) that this double Markov condition is equivalent to the existence of a pair of functions f and g such that

f (X) = g (Y)

and (X,Y)

f(X)

T^{*}

. One special case of this setting, namely where g is an identity function, has been recently studied in details in [53] and will be reviewed in Section 2.5. Theorem 3 also enables us to characterize the “worst” joint distribution

P_{X Y}

with respect to

IB

and

PF

. As demonstrated in the following lemma, if

P_{Y | X}

is an erasure channel then

\frac{PF (r)}{r} = \frac{IB (R)}{R} = \frac{I (X; Y)}{H (X)}

.

Lemma 1.

Let $P_{X Y}$ be such that $Y = X \cup {⊥}$ , $P_{Y | X} (x | x) = 1 - δ$ , and $P_{Y | X} (⊥ | x) = δ$ for some $δ > 0$ . Then

$\frac{PF (r)}{r} = \frac{IB (R)}{R} = 1 - δ .$
Let $P_{X Y}$ be such that $X = Y \cup {⊥}$ , $P_{X | Y} (y | y) = 1 - δ$ , and $P_{X | Y} (⊥ | y) = δ$ for some $δ > 0$ . Then

$PF (r) = \max {r - H (X | Y), 0} .$

The bounds in Theorem 3 hold for all r and R in the interval

[0, H (X)]

. We can, however, improve them when r and R are sufficiently small. Let

{PF}^{'} (0)

and

{IB}^{'} (0)

denote the slope of

PF (\cdot)

and

IB (\cdot)

at zero, i.e.,

{PF}^{'} (0) ≔ \lim_{r \to 0^{+}} \frac{PF (r)}{r}

and

{IB}^{'} (0) ≔ \lim_{R \to 0^{+}} \frac{IB (R)}{R}

.

Theorem 4.

Given

(X, Y) \sim P_{X Y}

, we have

\begin{matrix} \inf_{\binom{Q_{X} \in P (X)}{Q_{X} \neq P_{X}}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})} = {PF}^{'} (0) & \leq \min_{\begin{matrix} x \in X : \\ P_{X} (x) > 0 \end{matrix}} \frac{D_{KL} (P_{Y | X} (\cdot | x) ∥ P_{Y} (\cdot))}{- \log P_{X} (x)} \\ \leq \max_{\begin{matrix} x \in X : \\ P_{X} (x) > 0 \end{matrix}} \frac{D_{KL} (P_{Y | X} (\cdot | x) ∥ P_{Y} (\cdot))}{- \log P_{X} (x)} \leq {IB}^{'} (0) = \sup_{\binom{Q_{X} \in P (X)}{Q_{X} \neq P_{X}}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})} . \end{matrix}

This theorem provides the exact values of

{PF}^{'} (0)

and

{IB}^{'} (0)

and also simple bounds for them. While the exact expressions for

{PF}^{'} (0)

and

{IB}^{'} (0)

are usually difficult to compute, a simple plug-in estimator is proposed in [80] for

{IB}^{'} (0)

. This estimator can be readily adapted to estimate

{PF}^{'} (0)

. Theorem 4 reveals a profound connection between

IB

and the strong data processing inequality (SDPI) [81]. More precisely, thanks to the pioneering work of Anantharam et al. [82], it is known that the supremum of

\frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})}

over all

Q_{X} \neq P_{X}

is equal the supremum of

\frac{I (Y; T)}{I (X; T)}

over all

P_{T | X}

satisfying Y Entropy 22 01325 i001

X

T and hence

{IB}^{'} (0)

specifies the strengthening of the data processing inequality of mutual information. This connection may open a new avenue for new theoretical results for

IB

, especially when X or Y are continuous random variables. In particular, the recent non-multiplicative SDPI results [27,83] seem insightful for this purpose.

In many practical cases, we might have n i.i.d. samples

(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})

of

(X, Y) \sim P_{X Y}

. We now study how

IB

behaves in n. Let

X^{n} ≔ (X_{1}, \dots, X_{n})

and

Y^{n} ≔ (Y_{1}, \dots, Y_{n})

. Due to the i.i.d. assumption, we have

P_{X^{n} Y^{n}} (x^{n}, y^{n}) = \prod_{i = 1}^{n} P_{X Y} (x_{i}, y_{i})

. This can also be described by independently feeding

X_{i}

,

i \in [n]

, to channel

P_{Y | X}

producing

Y_{i}

. The following theorem, demonstrated first in [3] (Theorem 2.4), gives a formula for

IB

in terms of n.

Theorem 5

(Additivity). We have

\frac{1}{n} IB (P_{X^{n} Y^{n}}, n R) = IB (P_{X Y}, R) .

This theorem demonstrates that an optimal channel

P_{T^{n} | X^{n}}

for i.i.d. samples

(X^{n}, Y^{n}) \sim P_{X Y}

is obtained by the Kronecker product of an optimal channel

P_{T | X}

for

(X, Y) \sim P_{X Y}

. This, however, may not hold in general for

PF

, that is, we might have

PF (P_{X^{n} Y^{n}}, n r) < n PF (P_{X Y}, r)

, see [13] (Proposition 1) for an example.

2.1. Gaussian $IB$ and $PF$

In this section, we turn our attention to a special, yet important, case where

X = Y + σ N^{G}

, where

σ > 0

and

N^{G} \sim N (0, 1)

is independent of Y. This setting subsumes the popular case of jointly Gaussian

(X, Y)

whose information bottleneck functional was computed in [84] for the vector case (i.e.,

(X, Y)

are jointly Gaussian random vectors).

Lemma 2.

Let

{Y_{i}}_{i = 1}^{n}

be n i.i.d. copies of

Y \sim P_{Y}

and

X_{i} = Y_{i} + σ N_{i}^{G}

where

{N_{i}^{G}}

are i.i.d samples of

N (0, 1)

independent of Y. Then, we have

\frac{1}{n} IB (P_{X^{n} Y^{n}}, n R) \leq H (X) - \frac{1}{2} \log [2 π e σ^{2} + e^{2 (H (Y) - R)}] .

It is worth noting that this result was concurrently proved in [85]. The main technical tool in the proof of this lemma is a strong version of the entropy power inequality [25] (Theorem 2) which holds even if

X_{i}

,

Y_{i}

, and

N_{i}

are random vectors (as opposed to scalar). Thus, one can readily generalize Lemma 2 to the vector case. Note that the upper bound established in this lemma holds without any assumptions on

P_{T | X}

. This upper bound provides a significantly simpler proof for the well-known fact that for the jointly Gaussian

(X, Y)

, the optimal channel

P_{T | X}

is Gaussian. This result was first proved in [26] and used in [84] to compute an expression of

IB

for the Gaussian case.

Corollary 1.

If

(X, Y)

are jointly Gaussian with correlation coefficient ρ, then we have

IB (R) = \frac{1}{2} \log \frac{1}{1 - ρ^{2} + ρ^{2} e^{- 2 R}} .

(8)

Moreover, the optimal channel

P_{T | X}

is given by

P_{T | X} (\cdot | x) = N (0, {\tilde{σ}}^{2})

for

{\tilde{σ}}^{2} = σ_{Y}^{2} \frac{e^{- 2 R}}{ρ^{2} (1 - e^{- 2 R})}

where

σ_{Y}^{2}

is the variance of Y.

In Lemma 2, we assumed that X is a Gaussian perturbation of Y. However, in some practical scenarios, we might have Y as a Gaussian perturbation of X. For instance, let X represent an image and Y be a feature of the image that can be perfectly obtained from a noisy observation of X. Then, the goal is to compress the image with a given compression rate while retaining maximal information about the feature. The following lemma, which is an immediate consequence of [27] (Theorem 1), gives an upper bound for

IB

in this case.

Lemma 3.

Let

X^{n}

be n i.i.d. copies of a random variable X satisfying

E [X^{2}] \leq 1

and

Y_{i}

be the result of passing

X_{i}

,

i \in [n]

, through a Gaussian channel

Y = X + σ N^{G}

, where

σ > 0

and

N^{G} \sim N (0, 1)

is independent of X. Then, we have

\frac{1}{n} IB (P_{X^{n} Y^{n}}, n R) \leq R - Ψ (R, σ),

(9)

where

Ψ (R, σ) ≔ \max_{x \in [0, \frac{1}{2}]} 2 Q (\sqrt{\frac{1}{x σ^{2}}}) (R - h_{b} (x) - \frac{x}{2} \log (1 + \frac{1}{x σ^{2}})),

(10)

Q (t) ≔ \int_{t}^{\infty} \frac{1}{\sqrt{2 π}} e^{- \frac{t^{2}}{2}} d t

is the Gaussian complimentary CDF and

h_{b} (a) ≔ - a \log (a) - (1 - a) \log (1 - a)

for

a \in (0, 1)

is the binary entropy function. Moreover, we have

\frac{1}{n} IB (P_{X^{n} Y^{n}}, n R) \leq R - e^{- \frac{1}{R σ^{2}} \log \frac{1}{R} + Θ (\log \frac{1}{R})} .

(11)

Note that that Lemma 3 holds for any arbitrary X (provided that

E [X^{2}] \leq 1

) and hence (9) bounds information bottleneck functionals for a wide family of

P_{X Y}

. However, the bound is loose in general for large values of R. For instance, if

(X, Y)

are jointly Gaussian (implying

Y = X + σ N^{G}

for some

σ > 0

), then the right-hand side of (9) does not reduce to (8). To show this, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Figure 2.

The privacy funnel functional is much less studied even for the simple case of jointly Gaussian. Solving the optimization in

PF

over

P_{T | X}

without any assumptions is a difficult challenge. A natural assumption to make is that

P_{T | X} (\cdot | x)

is Gaussian for each

x \in X

. This leads to the following variant of

PF

{PF}^{G} (r) ≔ \inf_{\begin{matrix} σ \geq 0, \\ I (X; T_{σ}) \geq r \end{matrix}} I (Y; T_{σ}),

where

T_{σ} ≔ X + σ N^{G},

and

N^{G} \sim N (0, 1)

is independent of X. This formulation is tractable and can be computed in closed form for jointly Gaussian

(X, Y)

as described in the following example.

Example 1.

Let X and Y be jointly Gaussian with correlation coefficient ρ. First note that since mutual information is invariant to scaling, we may assume without loss of generality that both X and Y are zero mean and unit variance and hence we can write

X = ρ Y + \sqrt{1 - ρ^{2}} M^{G}

where

M^{G} \sim N (0, 1)

is independent of Y. Consequently, we have

I (X; T_{σ}) = \frac{1}{2} \log (1 + \frac{1}{σ^{2}}),

(12)

and

I (Y; T_{σ}) = \frac{1}{2} \log (1 + \frac{ρ^{2}}{1 - ρ^{2} + σ^{2}}) .

(13)

In order to ensure

I (X; T_{σ}) \geq r

, we must have

σ \leq {(e^{2 r} - 1)}^{- \frac{1}{2}}

. Plugging this choice of σ into (13), we obtain

{PF}^{G} (r) = \frac{1}{2} \log (\frac{1}{1 - ρ^{2} (1 - e^{- 2 r})}) .

(14)

This example indicates that for jointly Gaussian

(X, Y)

, we have

{PF}^{G} (r) = 0

if and only if

r = 0

(thus perfect privacy does not occur) and the constraint

I (X; T_{σ}) = r

is satisfied by a unique

σ

. These two properties in fact hold for all continuous variables X and Y with finite second moments as demonstrated in Lemma A1 in Appendix A. We use these properties to derive a second-order approximation of

{PF}^{G} (r)

when r is sufficiently small. For the following theorem, we use

var (U)

to denote the variance of the random variable U and

var (U | V) ≔ E [{(U - E [U | V])}^{2} | V]

. We use

σ_{X}^{2} = var (X)

for short.

Theorem 6.

For any pair of continuous random variables

(X, Y)

with finite second moments, we have as

r \to 0

{PF}^{G} (r) = η (X, Y) r + Δ (X, Y) r^{2} + o (r^{2}),

where

η (X, Y) ≔ \frac{var (E [X | Y])}{σ_{X}^{2}}

and

Δ (X, Y) ≔ \frac{2}{σ_{X}^{4}} [E [{var}^{2} (X | Y)] - σ_{X}^{2} E [var (X | Y)]] .

It is worth mentioning that the quantity

η (X, Y)

was first defined by Rényi [28] as an asymmetric measure of correlation between X and Y. In fact, it can be shown that

η (X, Y) = \sup_{f} ρ^{2} (X, f (Y)),

where supremum is taken over all measurable functions f and

ρ (\cdot, \cdot)

denotes the correlation coefficient. As a simple illustration of Theorem 6, consider jointly Gaussian X and Y with correlation coefficient

ρ

for which

{PF}^{G}

was computed in Example 1. In this case, it can be easily verified that

η (X, Y) = ρ^{2}

and

Δ (X, Y) = - 2 σ_{X}^{2} ρ^{2} (1 - ρ^{2})

. Hence, for jointly Gaussian

(X, Y)

with correlation coefficient

ρ

and unit variance, we have

{PF}^{G} (r) = ρ^{2} r - 2 ρ^{2} (1 - ρ^{2}) r^{2} + o (r^{2})

. In Figure 3, we compare the approximation given in Theorem 6 for this particular case.

2.2. Evaluation of $IB$ and $PF$

The constrained optimization problems in the definitions of

IB

and

PF

are usually challenging to solve numerically due to the non-linearity in the constraints. In practice, however, both

IB

and

PF

are often approximated by their corresponding Lagrangian optimizations

L_{IB} (β) ≔ \sup_{P_{T | X}} I (Y; T) - β I (X; T) = H (Y) - β H (X) - \inf_{P_{T | X}} [H (Y | T) - β H (X | T)],

(15)

and

L_{PF} (β) ≔ \inf_{P_{T | X}} I (Y; T) - β I (X; T) = H (Y) - β H (X) - \sup_{P_{T | X}} [H (Y | T) - β H (X | T)],

(16)

where

β \in R_{+}

is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for

IB

and the privacy and informativeness in

PF

. Notice that for the computation of

L_{IB}

, we can assume, without loss of generality, that

β \in [0, 1]

since otherwise the maximizer of (15) is trivial. It is worth noting that

L_{IB} (β)

and

L_{PF} (β)

in fact correspond to lines of slope

β

supporting

M

from above and below, thereby providing a new representation of

M

.

Let

(X^{'}, Y^{'})

be a pair of random variables with

X^{'} \sim Q_{X}

for some

Q_{X} \in P (X)

and

Y^{'}

is the output of

P_{Y | X}

when the input is

X^{'}

(i.e.,

Y^{'} \sim Q_{X} P_{Y | X}

). Define

F_{β} (Q_{X}) ≔ H (Y^{'}) - β H (X^{'}) .

This function, in general, is neither convex nor concave in

Q_{X}

. For instance,

F (0)

is concave and

F (1)

is convex in

P_{X}

. The lower convex envelope of

F_{β} (Q_{X})

is defined as the largest convex function smaller than

F_{β} (Q_{X})

. Similarly, the upper concave envelope of

F_{β} (Q_{X})

is defined as the smallest concave function larger than

F_{β} (Q_{X})

. Let

K_{\cup} [F_{β} (Q_{X})]

and

K_{\cap} [F_{β} (Q_{X})]

denote the lower convex and upper concave envelopes of

F_{β} (Q_{X})

, respectively. If

F_{β} (Q_{X})

is convex at

P_{X}

, that is

K_{\cup} [F_{β} (Q_{X})] |_{P_{X}} = F_{β} (P_{X})

, then

F_{β} (Q_{X})

remains convex at

P_{X}

for all

β^{'} \geq β

because

\begin{matrix} K_{\cup} [F_{β^{'}} (Q_{X})] & = K_{\cup} [F_{β} (Q_{X}) - (β^{'} - β) H (X^{'})] \\ \geq K_{\cup} [F_{β} (Q_{X})] + K_{\cup} [- (β^{'} - β) H (X^{'})] \\ = K_{\cup} [F_{β} (Q_{X})] - (β^{'} - β) H (X^{'}), \end{matrix}

where the last equality follows from the fact that

- (β^{'} - β) H (X)

is convex. Hence, at

P_{X}

we have

K_{\cup} [F_{β^{'}} (Q_{X})] |_{P_{X}} \geq K_{\cup} [F_{β} (Q_{X})] |_{P_{X}} - (β^{'} - β) H (X) = F_{β} (P_{X}) - (β^{'} - β) H (X) = F_{β^{'}} (P_{X}) .

Analogously, if

F_{β} (Q_{X})

is concave at

P_{X}

, that is

K_{\cap} [F_{β} (Q_{X})] |_{P_{X}} = F_{β} (P_{X})

, then

F_{β} (Q_{X})

remains concave at

P_{X}

for all

β^{'} \leq β

.

Notice that, according to (15) and (16), we can write

L_{IB} (β) = H (Y) - β H (X) - K_{\cup} [F_{β} (Q_{X})] |_{P_{X}},

(17)

and

L_{PF} (β) = H (Y) - β H (X) - K_{\cap} [F_{β} (Q_{X})] |_{P_{X}} .

(18)

In light of the above arguments, we can write

L_{IB} (β) = 0,

for all

β > β_{IB}

where

β_{IB}

is the smallest

β

such that

F_{β} (P_{X})

touches

K_{\cup} [F_{β} (Q_{X})]

. Similarly,

L_{PF} (β) = 0,

for all

β < β_{PF}

where

β_{PF}

is the largest

β

such that

F_{β} (P_{X})

touches

K_{\cap} [F_{β} (Q_{X})]

. In the following theorem, we show that

β_{IB}

and

β_{PF}

are given by the values of

{IB}^{'} (0)

and

{PF}^{'} (0)

, respectively, given in Theorem 4. A similar formulae

β_{IB}

and

β_{PF}

were given in [86].

Proposition 1.

We have,

β_{IB} = \sup_{Q_{X} \neq P_{X}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})},

and

β_{PF} = \inf_{Q_{X} \neq P_{X}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})} .

Kim et al. [80] have recently proposed an efficient algorithm to estimate

β_{IB}

from samples of

P_{X Y}

involving a simple optimization problem. This algorithm can be readily adapted for estimating

β_{PF}

. Proposition 1 implies that in optimizing the Lagrangians (17) and (18), we can restrict the Lagrange multiplier

β

, that is

L_{IB} (β) = H (Y) - β H (X) - K_{\cup} [F_{β} (Q_{X})] |_{P_{X}}, for β \in [0, β_{IB}],

(19)

and

L_{PF} (β) = H (Y) - β H (X) - K_{\cap} [F_{β} (Q_{X})] |_{P_{X}}, for β \in [β_{PF}, \infty) .

(20)

Remark 1.

As demonstrated by Kolchinsky et al. [53], the boundary points 0 and

β_{IB}

are required for the computation of

L_{IB} (β)

. In fact, when Y is a deterministic function of X, then only

β = 0

and

β = β_{IB}

are required to compute the

IB

and other values of β are vacuous. The same argument can also be used to justify the inclusion of

β_{PF}

in computing

L_{PF} (β)

. Note also that since

F_{β} (Q_{X})

becomes convex for

β > β_{IB}

, computing

K_{\cap} [F_{β} (Q_{X})]

becomes trivial for such values of β.

Remark 2.

Observe that the lower convex envelope of any function f can be obtained by taking Legendre-Fenchel transformation (aka. convex conjugate) twice. Hence, one can use the existing linear-time algorithms for approximating Legendre-Fenchel transformation (e.g., [87,88]) for approximating

K_{\cup} [F_{β} (Q_{X})]

.

Once

L_{IB} (β)

and

L_{PF} (β)

are computed, we can derive

IB

and

PF

via standard results in optimization (see [3] (Section IV) for more details):

IB (R) = \inf_{β \in [0, β_{IB}]} β R + L_{IB} (β),

(21)

and

PF (r) = \sup_{β \in [β_{PF}, \infty]} β r + L_{PF} (β) .

(22)

Following the convex analysis approach outlined by Witsenhausen and Wyner [3],

IB

and

PF

can be directly computed from

L_{IB} (β)

and

L_{PF} (β)

by observing the following. Suppose for some

β

,

K_{\cup} [F_{β} (Q_{X})]

(resp.

K_{\cap} [F_{β} (Q_{X})]

) at

P_{X}

is obtained by a convex combination of points

F_{β} (Q^{i})

,

i \in [k]

for some

Q^{1}, \dots, Q^{k}

in

P (X)

, integer

k \geq 2

, and weights

λ_{i} \geq 0

(with

\sum_{i} λ_{i} = 1

). Then

\sum_{i} λ_{i} Q^{i} = P_{X}

, and

T^{*}

with properties

P_{T^{*}} (i) = λ_{i}

and

P_{X | T^{*} = i} = Q^{i}

attains the minimum (resp. maximum) of

H (Y | T) - β H (X | T)

. Hence,

(I (X; T^{*}), I (Y; T^{*}))

is a point on the upper (resp. lower) boundary of

M

; implying that

IB (R) = I (Y; T^{*})

for

R = I (X; T^{*})

(resp.

PF (r) = I (Y; T^{*})

for

r = I (X; T^{*})

). If for some

β

,

K_{\cup} [F_{β} (Q_{X})]

at

P_{X}

coincides with

F_{β} [P_{X}]

, then this corresponds to

L_{IB} (β) = 0

. The same holds for

K_{\cup} [F_{β} (Q_{X})]

. Thus, all the information about the functional

IB

(resp.

PF

) is contained in the subset of the domain of

K_{\cup} [F_{β} (Q_{X})]

(resp.

K_{\cap} [F_{β} (Q_{X})]

) over which it differs from

F_{β} (Q_{X})

. We will revisit and generalize this approach later in Section 3.

We can now instantiate this for the binary symmetric case. Suppose X and Y are binary variables and

P_{Y | X}

is binary symmetric channel with crossover probability

δ

, denoted by

BSC (δ)

and defined as

BSC (δ) = [\begin{matrix} 1 - δ & δ \\ δ & 1 - δ \end{matrix}],

(23)

for some

δ \geq 0

. To describe the result in a compact fashion, we introduce the following notation: we let

h_{b} : [0, 1] \to [0, 1]

denote the binary entropy function, i.e.,

h_{b} (p) = - p \log p - (1 - p) \log (1 - p)

. Since this function is strictly increasing

[0, \frac{1}{2}]

, its inverse exists and is denoted by

h_{b}^{- 1} : [0, 1] \to [0, \frac{1}{2}]

. Moreover,

a * b ≔ a (1 - b) + b (1 - a)

for

a, b \in [0, 1]

.

Lemma 4

(Mr. and Mrs. Gerber’s Lemma). For

X \sim Bernoulli (p)

for

p \leq \frac{1}{2}

and

P_{Y | X} = BSC (δ)

for

δ \geq 0

, we have

IB (R) = h_{b} (p * δ) - h_{b} (δ * h_{b}^{- 1} (h_{b} (p) - R)),

(24)

and

PF (r) = h_{b} (p * δ) - α h_{b} (δ * \frac{p}{z}) - \bar{α} h_{b} (δ),

(25)

where

r = h_{b} (p) - α h_{b} (\frac{p}{z})

,

z = \max (α, 2 p)

, and

α \in [0, 1]

.

The result in (24) was proved by Wyner and Ziv [2] and is widely known as Mrs. Gerber’s Lemma in information theory. Due to the similarity, we refer to (25) as Mr. Gerber’s Lemma. As described above, to prove (24) and (25) it suffices to derive the convex and concave envelopes of the mapping

F_{β} : [0, 1] \to R

given by

F_{β} (q) ≔ F_{β} (Q_{X}) = h_{b} (q * δ) - β h_{b} (q),

(26)

where

q * δ ≔ q \bar{δ} + δ \bar{q}

is the output distribution of

BSC (δ)

when the input distribution is

Bernoulli (q)

for some

q \in (0, 1)

. It can be verified that

β_{IB} \leq {(1 - 2 δ)}^{2}

. This function is depicted in Figure 4 depending of the values of

β \leq {(1 - 2 δ)}^{2}

.

2.3. Operational Meaning of $IB$ and $PF$

In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of both

IB

and

PF

. The operational interpretation of

IB

has recently been extensively studied in information-theoretic settings in [29,30]. In particular, it was shown that

IB

specifies the rate-distortion region of noisy source coding problem [19,89] under the logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side information at the decoder [90]. Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound) and also provide a new information-theoretic setting in which

IB

appears as the solution. Then, we describe another setting, the so-called dependence dilution, whose achievable rate region has an extreme point specified by

PF

. This in fact delineate an important difference between

IB

and

PF

: while

IB

describes the entire rate-region of an information-theoretic setup,

PF

specifies only a corner point of a rate region. Other information-theoretic settings related to

IB

and

PF

include CEO problem [91] and source coding for the Gray-Wyner network [92].

2.3.1. Noisy Source Coding

Suppose Alice has access only to a noisy version X of a source of interest Y. She wishes to transmit a rate-constrained description from her observation (i.e., X) to Bob such that he can recover Y with small average distortion. More precisely, let

(X^{n}, Y^{n})

be n i.i.d. samples of

(X, Y) \sim P_{X Y}

. Alice encodes her observation

X^{n}

through an encoder

ϕ : X^{n} \to {1, \dots, K_{n}}

and sends

ϕ (X^{n})

to Bob. Upon receiving

ϕ (X^{n})

, Bob reconstructs a “soft” estimate of

Y^{n}

via a decoder

ψ : {1, \dots, K_{n}} \to {\hat{Y}}^{n}

where

\hat{Y} = P (Y)

. That is, the reproduction sequence

{\hat{y}}^{n}

consists of n probability measures on

Y

. For any source and reproduction sequences

y^{n}

and

{\hat{y}}^{n}

, respectively, the distortion is defined as

d (y^{n}, {\hat{y}}^{n}) ≔ \frac{1}{n} \sum_{i = 1}^{n} d (y_{i}, {\hat{y}}_{i}),

where

d (y, \hat{y}) ≔ \log \frac{1}{\hat{y} (y)} .

(27)

We say that a pair of rate-distortion

(R, D)

is achievable if there exists a pair

(ϕ, ψ)

of encoder and decoder such that

\underset{n \to \infty}{lim sup} E [d (Y^{n}, ψ (ϕ (X^{n})))] \leq D, and \underset{n \to \infty}{lim sup} \frac{1}{n} \log K_{n} \leq R .

(28)

The noisy rate-distortion function

R^{noisy} (D)

for a given

D \geq 0

, is defined as the minimum rate

R

such that

(R, D)

is an achievable rate-distortion pair. This problem arises naturally in many data analytic problems. Some examples include feature selection of a high-dimensional dataset, clustering, and matrix completion. This problem was first studied by Dobrushin and Tsybakov [19], who showed that

R^{noisy} (D)

is analogous to the classical rate-distortion function

(29)

It can be easily verified that

E [d (Y, \hat{Y})] = H (Y | \hat{Y})

and hence (after relabeling

\hat{Y}

as T)

(30)

where

R = H (Y) - D

, which is equal to

\tilde{IB}

defined in (4). For more details in connection between noisy source coding and

IB

, the reader is referred to [29,30,91,93]. Notice that one can study an essentially identical problem where the distortion constraint (28) is replaced by

\lim_{n \to \infty} \frac{1}{n} I (Y^{n}; ψ (ϕ (X^{n})))] \geq R, and \underset{n \to \infty}{lim sup} \frac{1}{n} \log K_{n} \leq R .

This problem is addressed in [94] for discrete alphabets

X

and

Y

and extended recently in [95] for any general alphabets.

2.3.2. Test against Independence with Communication Constraint

As mentioned earlier, the connection between

IB

and noisy source coding, described above, was known and studied in [29,30]. Here, we provide a new information-theoretic setting which provides yet another operational meaning for

IB

. Given n i.i.d. samples

(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})

from joint distribution Q, we wish to test whether

X_{i}

are independent of

Y_{i}

, that is, Q is a product distribution. This task is formulated by the following hypothesis test:

\begin{matrix} \begin{matrix} H_{0} : & Q = P_{X Y}, \\ H_{1} : & Q = P_{X} P_{Y}, \end{matrix} \end{matrix}

(31)

for a given joint distribution

P_{X Y}

with marginals

P_{X}

and

P_{Y}

. Ahlswede and Csiszár [20] investigated this problem under a communication constraint: While Y observations (i.e.,

Y_{1}, \dots, Y_{n}

) are available, the X observations need to be compressed at rate R, that is, instead of

X^{n}

, only

ϕ (X^{n})

is present where

ϕ : X^{n} \to {1, \dots, K_{n}}

satisfies

\frac{1}{n} \log K_{n} \leq R .

For the type I error probability not exceeding a fixed

ε \in (0, 1)

, Ahlswede and Csiszár [20] derived the smallest possible type 2 error probability, defined as

β_{R} (n, ε) = \min_{\binom{ϕ : X^{n} \to [K]}{\frac{1}{n} \log K_{n} \leq R}} \min_{A \subset [K_{n}] \times Y^{n}} \{(P_{ϕ (X^{n})} \times P_{Y^{n}}) (A) : P_{ϕ (X^{n}) \times Y^{n}} (A) \geq 1 - ε\} .

The following gives the asymptotic expression of

β_{R} (n, ε)

for every

ε \in (0, 1)

. For the proof, refer to [20] (Theorem 3).

Theorem 7

([20]). For every

R \geq 0

and

ε \in (0, 1)

, we have

\lim_{n \to \infty} - \frac{1}{n} \log β_{R} (n, ε) = IB (R) .

In light of this theorem,

IB (R)

specifies the exponential rate at which the type II error probability of the hypothesis test (31) decays as the number of samples increases.

2.3.3. Dependence Dilution

Inspired by the problems of information amplification [96] and state masking [97], Asoodeh et al. [21] proposed the dependence dilution setup as follows. Consider a source sequences

X^{n}

of n i.i.d. copies of

X \sim P_{X}

. Alice observes the source

X^{n}

and wishes to encode it via the encoder

f_{n} : X^{n} \to {1, 2, \dots, 2^{n R}},

for some

R > 0

. The goal is to ensure that any user observing

f_{n} (X^{n})

can construct a list, of fixed size, of sequences in

X^{n}

that contains likely candidates of the actual sequence

X^{n}

while revealing negligible information about a correlated source

Y^{n}

. To formulate this goal, consider the decoder

g_{n} : {1, 2, \dots, 2^{n R}} \to 2^{X^{n}},

where

2^{X^{n}}

denotes the power set of

X^{n}

. A dependence dilution triple

(R, Γ, Δ) \in R_{+}^{3}

is said to be achievable if, for any

δ > 0

, there exists a pair of encoder and decoder

(f_{n}, g_{n})

such that for sufficiently large n

Pr (X^{n} \notin g_{n} (J)) < δ,

(32)

having fixed size

| g_{n} (J) | = 2^{n (H (X) - Γ)},

where

J = f_{n} (X^{n})

and simultaneously

\frac{1}{n} I (Y^{n}; J) \leq Δ + δ .

(33)

Notice that without side information J, the decoder can only construct a list of size

2^{n H (X)}

which contains

X^{n}

with probability close to one. However, after J is observed and the list

g_{n} (J)

is formed, the decoder’s list size can be reduced to

2^{n (H (X) - Γ)}

and thus reducing the uncertainty about

X^{n}

by

n Γ \in [0, n H (X)]

. This observation can be formalized to show (see [96] for details) that the constraint (32) is equivalent to

\frac{1}{n} I (X^{n}; J) \geq Γ - δ,

(34)

which lower bounds the amount of information J carries about

X^{n}

. Built on this equivalent formulation, Asoodeh et al. [21] (Corollary 15) derived a necessary condition for the achievable dependence dilution triple.

Theorem 8

([21]). Any achievable dependence dilution triple

(R, Γ, Δ)

satisfies

\{\begin{matrix} R & \geq Γ \\ Γ & \leq I (X; T) \\ Δ & \geq I (Y; T) - I (X; T) + Γ, \end{matrix}

for some auxiliary random variable T satisfying Y Entropy 22 01325 i001

X

T and taking

| T | \leq | X | + 1

values.

According to this theorem,

PF (Γ)

specifies the best privacy performance of the dependence dilution setup for the maximum amplification rate

Γ

. While this informs the operational interpretation of

PF

, Theorem 8 only provides an outer bound for the set of achievable dependence dilution triple

(R, Γ, Δ)

. It is, however, not clear that

PF

characterizes the rate region of an information-theoretic setup.

The fact that

IB

fully characterizes the rate-region of an source coding setup has an important consequence: the cardinality of the auxiliary random variable T in

IB

can be improved to

| X |

instead of

| X | + 1

.

2.4. Cardinality Bound

Recall that in the definition of

IB

in (4), no assumption was imposed on the auxiliary random variable T. A straightforward application of Carathéodory-Fenchel-Eggleston theorem (see e.g., [98] (Section III) or [79] (Lemma 15.4)) reveals that

IB

is attained for T taking values in a set

T

with cardinality

| T | \leq | X | + 1

. Here, we improve this bound and show

| T | \leq | X |

is sufficient.

Theorem 9.

For any joint distribution

P_{X Y}

and

R \in (0, H (X)]

, information bottleneck

IB (R)

is achieved by T taking at most

| X |

values.

The proof of this theorem hinges on the operational characterization of

IB

as the lower boundary of the rate-distortion region of noisy source coding problem discussed in Section 2.3. Specifically, we first show that the extreme points of this region is achieved by T taking

| X |

values. We then make use of a property of the noisy source coding problem (namely, time-sharing) to argue that all points of this region (including the boundary points) can be attained by such T. It must be mentioned that this result was already claimed by Harremoës and Tishby in [99] without proof.

In many practical scenarios, feature X has a large alphabet. Hence, the bound

| T | \leq | X |

, albeit optimal, still can make the information bottleneck function computationally intractable over large alphabets. However, label Y usually has a significantly smaller alphabet. While it is in general impossible to have a cardinality bound for T in terms of

| Y |

, one can consider approximating

IB

assuming T takes N values. The following result, recently proved by Hirche and Winter [100], is in this spirit.

Theorem 10

([100]). For any

(X, Y) \sim P_{X Y}

, we have

IB (R, N) \leq IB (R) \leq IB (R, N) + δ (N),

where

δ (N) = 4 N^{- \frac{1}{| Y |}} [\log \frac{| Y |}{4} + \frac{1}{| Y |} \log N]

and

IB (R, N)

denotes the information bottleneck functional (4) with the additional constraint that

| T | \leq N

.

Recall that, unlike

PF

, the graph of

IB

characterizes the rate region of a Shannon-theoretic coding problem (as illustrated in Section 2.3), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region. This lack of operational characterization of

PF

translates into a worse cardinality bound than that of

IB

. In fact, for

PF

the cardinality bound

| T | \leq | X | + 1

cannot be improved in general. To demonstrate this, we numerically solve the optimization in

PF

assuming that

| T | = | X |

when both X and Y are binary. As illustrated in Figure 5, this optimization does not lead to a convex function, and hence, cannot be equal to

PF

.

2.5. Deterministic Information Bottleneck

As mentioned earlier,

IB

formalizes an information-theoretic approach to clustering high-dimensional feature X into cluster labels T that preserve as much information about the label Y as possible. The clustering label is assigned by the soft operator

P_{T | X}

that solves the

IB

formulation (4) according to the rule:

X = x

is likely assigned label

T = t

if

D_{KL} (P_{Y | x} ∥ P_{Y | t})

is small where

P_{Y | t} = \sum_{x} P_{Y | x} P_{X | t}

. That is, clustering is assigned based on the similarity of conditional distributions. As in many practical scenarios, a hard clustering operator is preferred, Strouse and Schwab [31] suggested the following variant of

IB

, termed as deterministic information bottleneck

dIB

dIB (P_{X Y}, R) ≔ \sup_{\begin{matrix} f : X \to T, \\ H (f (X)) \leq R \end{matrix}} I (Y; f (X)),

(35)

where the maximization is taken over all deterministic functions f whose range is a finite set

T

. Similarly, one can define

dPF (P_{X Y}, r) ≔ \inf_{\begin{matrix} f : X \to T, \\ H (f (X)) \geq r \end{matrix}} I (Y; f (X)) .

(36)

One way to ensure that

H (f (X)) \leq R

for a deterministic function f is to restrict the cardinality of the range of f: if

f : X \to [e^{R}]

then

H (f (X))

is necessarily smaller than R. Using this insight, we derive a lower for

dIB (P_{X Y}, R)

in the following lemma.

Lemma 5.

For any given

P_{X Y}

, we have

dIB (P_{X Y}, R) \geq \frac{e^{R} - 1}{| X |} I (X; Y),

and

dPF (P_{X Y}, r) \leq \frac{e^{r} - 1}{| X |} I (X; Y) + Pr (X \geq e^{r}) \log \frac{1}{Pr (X \geq e^{r})} .

Note that both R and r are smaller than

H (X)

and thus the multiplicative factors of

I (X; Y)

in the lemma are smaller than one. In light of this lemma, we can obtain

\frac{e^{R} - 1}{| X |} I (X; Y) \leq IB (R) \leq I (X; Y),

and

PF (r) \leq \frac{e^{r} - 1}{| X |} I (X; Y) + Pr (X \geq e^{r}) \log \frac{1}{Pr (X \geq e^{r})} .

In most of practical setups,

| X |

might be very large, making the above lower bound for

IB

vacuous. In the following lemma, we partially address this issue by deriving a bound independent of

X

when Y is binary.

Lemma 6.

Let

P_{X Y}

be a joint distribution of arbitrary X and binary

Y \sim Bernoulli (q)

for some

q \in (0, 1)

. Then, for any

R \geq \log 5

we have

dIB (P_{X Y}, R) \geq I (X; Y) - 2 α h_{b} (\frac{I (X; Y)}{2 α (e^{R} - 4)}),

where

α = \max {\log \frac{1}{q}, \log \frac{1}{1 - q}}

.

3. Family of Bottleneck Problems

In this section, we introduce a family of bottleneck problems by extending

IB

and

PF

to a large family of statistical measures. Similar to

IB

and

PF

, these bottleneck problems are defined in terms of boundaries of a two-dimensional convex set induced by a joint distribution

P_{X Y}

. Recall that

R \mapsto IB (P_{X Y}, R)

and

r \mapsto PF (P_{X Y}, r)

are the upper and lower boundary of the set

M

defined in (6) and expressed here again for convenience

(37)

Since

P_{X Y}

is given,

H (X)

and

H (Y)

are fixed. Thus, in characterizing

M

it is sufficient to consider only

H (X | T)

and

H (Y | T)

. To generalize

IB

and

PF

, we must therefore generalize

H (X | T)

and

H (Y | T)

.

Given a joint distribution

P_{X Y}

and two non-negative real-valued functions

Φ : P (X) \to R^{+}

and

Ψ : P (Y) \to R^{+}

, we define

Φ (X | T) ≔ E [Φ (P_{X | T})] = \sum_{t \in T} P_{T} (t) Φ (P_{X | T = t}),

(38)

and

Ψ (Y | T) ≔ E [Ψ (P_{Y | T})] = \sum_{t \in T} P_{T} (t) Ψ (P_{Y | T = t}) .

(39)

When

X \sim P_{X}

and

Y \sim P_{Y}

, we interchangeably write

Φ (X)

for

Φ (P_{X})

and

Φ (Y)

for

Ψ (P_{Y})

.

These definitions provide natural generalizations for Shannon’s entropy and mutual information. Moreover, as we discuss later in Section 3.2 and Section 3.3, it also can be specialized to represent a large family of popular information-theoretic and statistical measures. Examples include information and estimation theoretic quantities such as Arimoto’s conditional entropy of order

α

for

Φ (Q_{X}) = | | Q_{X} {| |}_{α}

, probability of correctly guessing for

Φ (Q_{X}) = | | Q_{X} {| |}_{\infty}

, maximal correlation for binary case, and f-information for

Φ (Q_{X})

given by f-divergence. We are able to generate a family of bottleneck problems using different instantiations of

Φ (X | T)

and

Ψ (Y | T)

in place of mutual information in

IB

and

PF

. As we argue later, these problems better capture the essence of “informativeness” and “privacy”; thus providing analytical and interpretable guarantees similar in spirit to

IB

and

PF

.

Computing these bottleneck problems in general boils down to the following optimization problems

(40)

and

(41)

Consider the set

(42)

Note that if both

Φ

and

Ψ

are continuous (with respect to the total variation distance), then

M_{Φ, Ψ}

is compact. Moreover, it can be easily verified that

M_{Φ, Ψ}

is convex. Hence, its upper and lower boundaries are well-defined and are characterized by the graphs of

U_{Φ, Ψ}

and

L_{Φ, Ψ}

, respectively. As mentioned earlier, these functional are instrumental for computing the general bottleneck problem later. Hence, before we delve into the examples of bottleneck problems, we extend the approach given in Section 2.2 to compute

U_{Φ, Ψ}

and

L_{Φ, Ψ}

.

3.1. Evaluation of $U_{Φ, Ψ}$ and $L_{Φ, Ψ}$

Analogous to Section 2.2, we first introduce the Lagrangians of

U_{Φ, Ψ}

and

L_{Φ, Ψ}

as

L_{Φ, Ψ}^{U} (β) ≔ \sup_{P_{T | X}} Ψ (Y | T) - β Φ (X | T),

(43)

and

L_{Φ, Ψ}^{L} (β) ≔ \inf_{P_{T | X}} Ψ (Y | T) - β Φ (X | T),

(44)

where

β \geq 0

is the Lagrange multiplier, respectively. Let

(X^{'}, Y^{'})

be a pair of random variable with

X^{'} \sim Q_{X}

and

Y^{'}

is the result of passing

X^{'}

through the channel

P_{Y | X}

. Letting

F_{β}^{Φ, Ψ} (Q_{X}) ≔ Ψ (Y^{'}) - β Φ (X^{'}),

(45)

we obtain that

L_{Φ, Ψ}^{U} (β) = K_{\cap} [F_{β}^{Φ, Ψ} (Q_{X})] |_{P_{X}} and L_{Φ, Ψ}^{L} (β) = K_{\cup} [F_{β}^{Φ, Ψ} (Q_{X})] |_{P_{X}},

(46)

recalling that

K_{\cap}

and

K_{\cup}

are the upper concave and lower convex envelop operators. Once we compute

L_{Φ, Ψ}^{U}

and

L_{Φ, Ψ}^{L}

for all

β \geq 0

, we can use the standard results in optimizations theory (similar to (21) and (22)) to recover

U_{Φ, Ψ}

and

L_{Φ, Ψ}

. However, we can instead extend the approach Witsenhausen and Wyner [3] described in Section 2.2. Suppose for some

β

,

K_{\cap} [F_{β}^{Φ, Ψ} (Q_{X})]

(resp.

K_{\cup} [F_{β}^{Φ, Ψ} (Q_{X})]

) at

P_{X}

is obtained by a convex combination of points

F_{β}^{Φ, Ψ} (Q^{i})

,

i \in [k]

for some

Q^{1}, \dots, Q^{k}

in

P (X)

, integer

k \geq 2

, and weights

λ_{i} \geq 0

(with

\sum_{i} λ_{i} = 1

). Then

\sum_{i} λ_{i} Q^{i} = P_{X}

, and

T^{*}

with properties

P_{T^{*}} (i) = λ_{i}

and

P_{X | T^{*} = i} = Q^{i}

attains the maximum (resp. minimum) of

Ψ (Y | T) - β Φ (X | T)

, implying that

(Φ (X | T^{*}), Ψ (Y | T^{*}))

is a point on the upper (resp. lower) boundary of

M_{Φ, Ψ}

. Consequently, such

T^{*}

satisfies

U_{Φ, Ψ} (ζ) = Ψ (Y | T^{*})

for

ζ = Φ (X | T^{*})

(resp.

L_{Φ, Ψ} (ζ) = Ψ (Y | T^{*})

for

ζ = Φ (X | T^{*})

). The algorithm to compute

U_{Φ, Ψ}

and

L_{Φ, Ψ}

is then summarized in the following three steps:

Construct the functional $F_{β}^{Φ, Ψ} (Q_{X}) ≔ Ψ (Y^{'}) - β Φ (X^{'})$ for $X^{'} \sim Q_{X}$ and $Y^{'} \sim Q_{X} P_{Y | X}$ and all $Q_{X} \in P (X)$ and $β \geq 0$ .
Compute $K_{\cap} [F^{Φ, Ψ} (Q_{X})] |_{P_{X}}$ and $K_{\cup} [F^{Φ, Ψ} (Q_{X})] |_{P_{X}}$ evaluated at $P_{X}$ .
If for distributions $Q^{1}, \dots, Q^{k}$ in $P (X)$ for some $k \geq 1$ , we have $K_{\cap} [F^{Φ, Ψ} (Q_{X})] |_{P_{X}} = \sum_{i = 1}^{k} λ_{i} F^{Φ, Ψ} (Q^{i})$ or $K_{\cup} [F^{Φ, Ψ} (Q_{X})] |_{P_{X}} = \sum_{i = 1}^{k} λ_{i} F^{Φ, Ψ} (Q^{i})$ for some $λ_{i} \geq 0$ satisfying $\sum_{i = 1}^{k} λ_{i} = 1$ , then then $P_{X | T = i} = Q_{i}$ , $i \in [k]$ and $P_{T} (i) = λ_{i}$ give the optimal $T^{*}$ in $U_{Φ, Ψ}$ and $L_{Φ, Ψ}$ , respectively.

We will apply this approach to analytically compute

U_{Φ, Ψ}

and

L_{Φ, Ψ}

(and the corresponding bottleneck problems) for binary cases in the following sections.

3.2. Guessing Bottleneck Problems

Let

P_{X Y}

be given with marginals

P_{X}

and

P_{Y}

and the corresponding channel

P_{Y | X}

. Let also

Q_{X} \in P (X)

be an arbitrary distribution on

X

and

Q_{Y} = Q_{X} P_{Y | X}

be the output distribution of

P_{Y | X}

when fed with

Q_{X}

. Any channel

P_{T | X}

, together with the Markov structure Y Entropy 22 01325 i001

X

T, generates unique

P_{X | T}

and

P_{Y | T}

. We need the following basic definition from statistics.

Definition 1.

Let U be a discrete and V be an arbitrary random variables supported on

U

and

V

with

| U | < \infty

, respectively. Then

P_{c} (U)

the probability of correctly guessing U and

P_{c} (U | V)

the probability of correctly guessing U given V are given by

P_{c} (U) ≔ \max_{u \in U} P_{U} (u),

and

P_{c} (U | V) ≔ \max_{g} Pr (U = g (V)) = E [\max_{u \in U} P_{U | V} (u | V)] .

Moreover, the multiplicative gain of the observation V in guessing U is defined as (the reason for ∞ in the notation becomes clear later)

I_{\infty} (U; V) ≔ \log \frac{P_{c} (U | V)}{P_{c} (U)} .

As the names suggest,

P_{c} (U | V)

and

P_{c} (U)

characterize the optimal efficiency of guessing U with or without the observation V, respectively. Intuitively,

I_{\infty} (U; V)

quantifies how useful the observation V is in estimating U: If it is small, then it means it is nearly as hard for an adversary observing V to guess U as it is without V. This observation motivates the use of

I_{\infty} (Y; T)

as a measure of privacy in lieu of

I (Y; T)

in

PF

.

It is worth noting that

I_{\infty} (U; V)

is not symmetric in general, i.e.,

I_{\infty} (U; V) \neq I_{\infty} (V; U)

. Since observing T can only improve, we have

P_{c} (Y | T) \geq P_{c} (Y)

; thus

I_{\infty} (Y; T) \geq 0

. However,

I_{\infty} (Y; T) = 0

does not necessarily imply independent of Y and T; instead, it means T is useless in estimating Y. As an example, consider

Y \sim Bernoulli (p)

and

P_{T | Y = 0} = Bernoulli (δ)

and

P_{T | Y = 1} = Bernoulli (η)

with

δ, η \leq \frac{1}{2} < p

. Then

P_{c} (Y) = p

and

P_{c} (Y | T) = \max {\bar{δ} \bar{p}, η p} + \bar{η} p .

Thus, if

\bar{δ} \bar{p} \leq η p

, then

P_{c} (Y | T) = P_{c} (Y)

. This then implies that

I_{\infty} (Y; T) = 0

whereas Y and T are clearly dependent; i.e.,

I (Y; T) > 0

. While in general

I (Y; T)

and

I_{\infty} (Y; T)

are not related, it can be shown that

I (Y; T) \leq I_{\infty} (Y; T)

if Y is uniform (see [65] (Proposition 1)). Hence, only with this uniformity assumption,

I_{\infty} (Y; T)

implies the independence.

Consider

Ψ (Q_{X}) = - \sum_{x \in X} Q_{X} (x) \log (Q_{X} (x))

and

Ψ (Q_{Y}) = {∥ Q_{Y} ∥}_{\infty}

. Clearly, we have

Φ (X | T) = H (X | T)

. Note that

Ψ (Y | T) = \sum_{t \in T} P_{T} (t) | | P_{Y | T = t} {| |}_{\infty} = P_{c} (Y | T),

(47)

thus both measures

H (X | T)

and

P_{c} (Y | T)

are special cases of the models described in the previous section. In particular, we can define the corresponding

U_{Φ, Ψ}

and

L_{Φ, Ψ}

. We will see later that

I (X; T)

and

P_{c} (Y | T)

correspond to Arimoto’s mutual information of orders 1 and ∞, respectively. Define

(48)

This bottleneck functional formulated an interpretable guarantee:

{IB}^{(\infty, 1)} (R) characterizes the best error probability in recovering Y among all R - bit summaries of X

Recall that the functional

PF (r)

aims at extracting maximum information of X while protecting privacy with respect to Y. Measuring the privacy in terms of

P_{c} (Y | T)

, this objective can be better formulated by

(49)

with the interpretable privacy guarantee:

{PF}^{(\infty, 1)} (r) characterizes the smallest probability of revealing private feature Y among all representations of X preserving at least r bits information of X

Notice that the variable T in the formulations of

{IB}^{(\infty, 1)}

and

{PF}^{(\infty, 1)}

takes values in a set

T

of arbitrary cardinality. However, a straightforward application of the Carathéodory-Fenchel-Eggleston theorem (see e.g., [79] (Lemma 15.4)) reveals that the cardinality of

T

can be restricted to

| X | + 1

without loss of generality. In the following lemma, we prove more basic properties of

{IB}^{(\infty, 1)}

and

{PF}^{(\infty, 1)}

.

Lemma 7.

For any

P_{X Y}

with Y supported on a finite set

Y

, we have

${IB}^{(\infty, 1)} (0) = {PF}^{(\infty, 1)} (0) = 0$ .
${IB}^{(\infty, 1)} (R) = I_{\infty} (X; Y)$ for any $R \geq H (X)$ and ${PF}^{(\infty, 1)} (r) = I_{\infty} (X; Y)$ for $r \geq H (X)$ .
$R \mapsto exp ({IB}^{(\infty, 1)} (R))$ is strictly increasing and concave on the range $(0, I_{\infty} (X; Y))$ .
$r \mapsto exp ({PF}^{(\infty, 1)} (r))$ is strictly increasing, and convex on the range $(0, I_{\infty} (X; Y))$ .

The proof follows the same lines as Theorem 1 and hence omitted. Lemma 7 in particular implies that inequalities

I (X; T) \leq R

and

I (X; T) \geq r

in the definition of

{IB}^{(\infty, 1)}

and

{PF}^{(\infty, 1)}

can be replaced by

I (X; T) = R

and

I (X; T) = r

, respectively. It can be verified that

I^{\infty}

satisfies the data-processing inequality, i.e.,

I^{\infty} (Y; T) \leq I^{\infty} (Y; X)

for the Markov chain Y Entropy 22 01325 i001

X

T. Hence, both

{IB}^{(\infty, 1)}

and

{PF}^{(\infty, 1)}

must be smaller than

I_{\infty} (Y; X)

. The properties listed in Lemma 7 enable us to derive a slightly tighter upper bound for

{PF}^{(\infty, 1)}

as demonstrated in the following.

Lemma 8.

For any

P_{X Y}

with Y supported on a finite set

Y

, we have

{PF}^{(\infty, 1)} (r) \leq \log [1 + \frac{r}{H (X)} (e^{I_{\infty} (Y; X)} - 1)],

and

\log [1 + \frac{R}{H (X)} (e^{I_{\infty} (Y; X)} - 1)] \leq {IB}^{(\infty, 1)} (R) \leq I_{\infty} (Y; X) .

The proof of this lemma (and any other results in this section) is given in Appendix B. This lemma shows that the gap between

I_{\infty} (Y; X)

and

{IB}^{(\infty, 1)} (R)

when R is sufficiently close to

H (X)

behaves like

I_{\infty} (Y; X) - {IB}^{(\infty, 1)} (R) \leq I_{\infty} (Y; X) - \log [1 + \frac{R}{H (X)} (e^{I_{\infty} (Y; X)} - 1)] \approx (1 - e^{- I_{\infty} (Y; X)}) (1 - \frac{R}{H (X)}) .

Thus,

{IB}^{(\infty, 1)} (R)

approaches

I_{\infty} (Y; X)

as

R \to H (X)

at least linearly.

In the following theorem, we apply the technique delineated in Section 3.1 to derive closed form expressions for

{IB}^{(\infty, 1)}

and

{PF}^{(\infty, 1)}

for the binary symmetric case, thereby establishing similar results as Mr and Mrs. Gerber’s Lemma.

Theorem 11.

For

X \sim Bernoulli (p)

and

P_{Y | X} = BSC (δ)

with

p, δ \leq \frac{1}{2}

, we have

{PF}^{(\infty, 1)} (r) = \log [\frac{\bar{δ} - (h_{b} (p) - r) (\frac{1}{2} - δ)}{1 - δ * p}],

(50)

and

{IB}^{(\infty, 1)} (R) = \log [\frac{1 - δ * h_{b}^{- 1} (h_{b} (p) - R)}{1 - δ * p}],

(51)

where

\bar{δ} = 1 - δ

.

As described in Section 3.1, to compute

{IB}^{(\infty, 1)}

and

{PF}^{(\infty, 1)}

it suffices to derive the convex and concave envelopes of the mapping

F_{β}^{(\infty, 1)} (q) ≔ P_{c} (Y^{'}) + β H (X^{'})

where

X^{'} \sim Bernoulli (q)

and

Y^{'}

is the result of passing

X^{'}

through

BSC (δ)

, i.e.,

Y^{'} \sim Bernoulli (δ * q)

. In this case,

P_{c} (Y^{'}) = \max {δ * q, 1 - δ * q}

and

F_{β}^{(\infty, 1)}

can be expressed as

q \mapsto F_{β}^{(\infty, 1)} (q) = \max {δ * q, 1 - δ * q} + β h_{b} (q) .

(52)

This function is depicted in Figure 6.

The detailed derivation of convex and concave envelope of

F_{β}^{(\infty, 1)}

is given in Appendix B. The proof of this theorem also reveals the following intuitive statements. If

X \sim Bernoulli (p)

and

P_{Y | X} = BSC (δ)

, then among all random variables T satisfying Y Entropy 22 01325 i001

X

T and

H (X | T) \leq λ

, the minimum

P_{c} (Y | T)

is given by

\bar{δ} - λ (0.5 - δ)

. Notice that, without any information constraint (i.e.,

λ = 0

),

P_{c} (Y | T) = P_{c} (Y | X) = \bar{δ}

. Perhaps surprisingly, this shows that the mutual information constraint has a linear effect on the privacy of Y. Similarly, to prove (51), we show that among all R-bit representations T of X, the best achievable accuracy

P_{c} (Y | T)

is given by

1 - δ * h_{b}^{- 1} (h_{b} (p) - R)

. This can be proved by combining Mrs. Gerber’s Lemma (cf. Lemma 4) and Fano’s inequality as follows. For all T such that

H (X | T) \geq λ

, the minimum of

H (Y | T)

is given by

h_{b} (δ * h_{b}^{- 1} (λ))

. Since by Fano’s inequality,

H (Y | T) \leq h_{b} (1 - P_{c} (Y | T))

, we obtain

δ * h_{b}^{- 1} (λ) \leq 1 - P_{c} (Y | T)

which leads to the same result as above. Nevertheless, in Appendix B we give another proof based on the discussion of Section 3.1.

3.3. Arimoto Bottleneck Problems

The bottleneck framework proposed in the last section benefited from interpretable guarantees brought forth by the quantity

I_{\infty}

. In this section, we define a parametric family of statistical quantities, the so-called Arimoto’s mutual information, which includes both Shannon’s mutual information and

I_{\infty}

as extreme cases.

Definition 2

([22]). Let

U \sim P_{U}

and

V \sim P_{V}

be two random variables supported over finite sets

U

and

V

, respectively. Their Arimoto’s mutual information of order

α > 1

is defined as

I_{α} (U; V) = H_{α} (U) - H_{α} (U | V),

(53)

where

H_{α} (U) ≔ \frac{α}{1 - α} \log | | P_{U} {| |}_{α},

(54)

is the Rényi entropy of order α and

H_{α} (U | V) ≔ \frac{α}{1 - α} \log \sum_{v \in V} P_{V} (v) | | P_{U | V = v} {| |}_{α},

(55)

is the Arimoto’s conditional entropy of order α.

By continuous extension, one can define

I - α (U; V)

for

α = 1

and

α = \infty

as

I (U; V)

and

I_{\infty} (U; V)

, respectively. That is,

\lim_{α \to 1^{+}} I_{α} (U; V) = I (U; V), and \lim_{α \to \infty} I_{α} (U; V) = I_{\infty} (U; V) .

(56)

Arimoto’s mutual information was first introduced by Arimoto [22] and then later revisited by Liese and Vajda in [101] and more recently by Verdú in [102]. More in-depth analysis and properties of

I_{α}

can be found in [103]. It is shown in [71] (Lemma 1) that

I_{α} (U; V)

for

α \in [1, \infty]

quantifies the minimum loss in recovering U given V where the loss is measured in terms of the so-called

α

-loss. This loss function reduces to logarithmic loss (27) and

P_{c} (U | V)

for

α = 1

and

α = \infty

, respectively. This sheds light on the utility and/or privacy guarantee promised by a constraint on Arimoto’s mutual information. It is now natural to use

I_{α}

for defining a family of bottleneck problems.

Definition 3.

Given a pair of random variables

(X, Y) \sim P_{X Y}

over finite sets

X

and

Y

and

α, γ \in [1, \infty]

, we define

{IB}^{(α, γ)}

and

{PF}^{(α, γ)}

as

(57)

and

(58)

Of course,

{IB}^{(1, 1)} (R) = IB (R)

and

{PF}^{(1, 1)} (r) = PF (r)

. It is known that Arimoto’s mutual information satisfies the data-processing inequality [103] (Corollary 1), i.e.,

I_{α} (Y; T) \leq I_{α} (Y; X)

for the Markov chain Y Entropy 22 01325 i001

X

T. On the other hand,

I_{γ} (X; T) \leq H_{γ} (X)

. Thus, both

{IB}^{(α, γ)} (R)

and

{PF}^{(α, γ)} (r)

equal

I_{α} (Y; X)

for

R, r \geq H_{γ} (X)

. Note also that

H_{α} (Y | T) = \frac{α}{1 - α} \log Ψ (Y | T)

where

Ψ (Y | T)

(see (39)) corresponding to the function

Ψ (Q_{Y}) = | | Q_{Y} {| |}_{α}

. Consequently,

{IB}^{(α, γ)}

and

{PF}^{(α, γ)}

are characterized by the lower and upper boundary of

M_{Φ, Ψ}

, defined in (37), with respect to

Φ (Q_{X}) = | | Q_{X} {| |}_{γ}

and

Ψ (Q_{Y}) = | | Q_{Y} {| |}_{α}

. Specifically, we have

{IB}^{(α, γ)} (R) = H_{α} (Y) + \frac{α}{α - 1} \log U_{Φ, Ψ} (ζ),

(59)

where

ζ = e^{- (1 - \frac{1}{γ}) (H_{γ} (X) - R)}

, and

{PF}^{(α, γ)} (r) = H_{α} (Y) + \frac{α}{α - 1} \log L_{Φ, Ψ} (ζ),

(60)

where

ζ = e^{- (1 - \frac{1}{γ}) (H_{γ} (X) - r)}

and

Φ (Q_{X}) = {∥ Q_{X} ∥}_{γ}

and

Ψ (Q_{Y}) = {∥ Q_{Y} ∥}_{α}

. This paves the way to apply the technique described in Section 2.2 to compute

{IB}^{(α, γ)}

and

{PF}^{(α, γ)}

. Doing so requires the upper concave and lower convex envelope of the mapping

Q_{X} \mapsto ∥ Q_{Y} ∥_{α} - β {∥ Q_{X} ∥}_{γ}

for some

β \geq 0

, where

Q_{Y} \sim Q_{X} P_{Y | X}

. In the following theorem, we drive these envelopes and give closed form expressions for

{IB}^{(α, γ)}

and

{PF}^{(α, γ)}

for a special case where

α = γ \geq 2

.

Theorem 12.

Let

X \sim Bernoulli (p)

and

P_{Y | X} = BSC (δ)

with

p, δ \leq \frac{1}{2}

. We have for

α \geq 2

{PF}^{(α, α)} (r) = \frac{α}{1 - α} \log [\frac{{∥ p * δ ∥}_{α}}{{∥ q * δ ∥}_{α}}],

where

{∥ a ∥}_{α} ≔ {∥ [a, \bar{a}] ∥}_{α}

for

a \in [0, 1]

and

q \leq p

solves

\frac{α}{1 - α} \log [\frac{{∥ p ∥}_{α}}{{∥ q ∥}_{α}}] = r .

Moreover,

{IB}^{(α, α)} (R) = \frac{α}{α - 1} \log [\frac{\bar{λ} {∥ δ ∥}_{α} + λ {∥ \frac{q}{z} * δ ∥}_{α}}{{∥ p * α ∥}_{α}}],

where

z = \max {2 p, λ}

and

λ \in [0, 1]

solves

\frac{α}{α - 1} \log \frac{\bar{λ} + λ {∥ \frac{p}{z} ∥}_{α}}{{∥ p ∥}_{α}} = R .

By letting

α \to \infty

, this theorem indicates that for X and Y connected through

BSC (δ)

and all variables T forming Y Entropy 22 01325 i001

X

T, we have

P_{c} (X | T) \geq λ ⟹ P_{c} (Y | T) \geq δ * λ,

(61)

which can be shown to be achieved

T^{*}

generated by the following channel (see Figure 7)

P_{T^{*} | X} = [\begin{matrix} \frac{λ - p}{\bar{p}} & \frac{\bar{λ}}{\bar{p}} \\ 0 & 1 \end{matrix}] .

(62)

Note that, by assumption,

p \leq \frac{1}{2}

, and hence the event

{X = 1}

is less likely than

{X = 0}

. Therefore, (61) demonstrates that to ensure correct recoverability of X with probability at lest

λ

, the most private approach (with respect to Y) is to obfuscate the higher-likely event

{X = 0}

with probability

\frac{\bar{λ}}{\bar{p}}

. As demonstrated in (61) the optimal privacy guarantee is linear in the utility parameter in the binary symmetric case. This is in fact a special case of the larger result recently proved in [65] (Theorem 1): the infimum of

P_{c} (Y | T)

over all variables T such that

P_{c} (X | T) \geq λ

is piece-wise linear in

λ

, on equivalently, the mapping

e^{r} \mapsto exp ({PF}^{(\infty, \infty)} (r))

is piece-wise linear.

Computing

{PF}^{(α, γ)}

analytically for every

α, γ > 1

seems to be challenging, however, the following lemma provides bounds for

{PF}^{(α, γ)}

and

{IB}^{(α, γ)}

in terms of

{PF}^{(\infty, \infty)}

and

{IB}^{(\infty, \infty)}

, respectively.

Lemma 9.

For any pair of random variables

(X, Y)

over finite alphabets and

α, γ > 1

, we have

\frac{α}{α - 1} {PF}^{(\infty, \infty)} (f (r)) - \frac{α}{α - 1} H_{\infty} (Y) + H_{α} (Y) \leq {PF}^{(α, γ)} (r) \leq {PF}^{(\infty, \infty)} (g (r)) + H_{α} (Y) - H_{\infty} (Y),

and

\frac{α}{α - 1} {IB}^{(\infty, \infty)} (f (R)) - \frac{α}{α - 1} H_{\infty} (Y) + H_{α} (Y) \leq {IB}^{(α, γ)} (R) \leq {IB}^{(\infty, \infty)} (g (R)) + H_{α} (Y) - H_{\infty} (Y),

where

f (a) = \max {a - H_{γ} (X) + H_{\infty} (X), 0}

and

g (b) = \frac{γ - 1}{γ} b + H_{\infty} (X) - \frac{γ - 1}{γ} H_{γ} (X)

.

The previous lemma can be directly applied to derive upper and lower bounds for

{PF}^{(α, γ)}

and

{IB}^{(α, γ)}

given

{PF}^{(\infty, \infty)}

and

{IB}^{(\infty, \infty)}

.

3.4. f-Bottleneck Problems

In this section, we describe another instantiation of the general framework introduced in terms of functions

Φ

and

Ψ

that enjoys interpretable estimation-theoretic guarantee.

Definition 4.

Let

f : (0, \infty) \to R

be a convex function with

f (1) = 0

. Furthermore, let U and V be two real-valued random variables supported over

U

and

V

, respectively. Their f-information is defined by

I_{f} (U; V) ≔ D_{f} (P_{U V} ∥ P_{U} P_{V}),

(63)

where

D_{f} (\cdot ∥ \cdot)

is the f-divergence [104] between distributions and defined as

D_{f} (P ∥ Q) ≔ E_{Q} [f (\frac{d P}{d Q})] .

Due to convexity of f, we have

D_{f} (P ∥ Q) \geq f (1) = 0

and hence f-information is always non-negative. If, furthermore, f is strictly convex at 1, then equality holds if and only

P = Q

. Csiszár introduced f-divergence in [104] and applied it to several problems in statistics and information theory. More recent developments about the properties of f-divergence and f-information can be found in [23] and the references therein. Any convex function f with the property

f (1) = 0

results in an f-information. Popular examples include

f (t) = t \log t

corresponding to Shannon’s mutual information,

f (t) = | t - 1 |

corresponding to T-information [83], and also

f (t) = t^{2} - 1

corresponding to

χ^{2}

-information [69] for. It is worth mentioning that if we allow

α

to be in

(0, 1)

in Definition 2 (similar to [101]), then the resulting Arimoto’s mutual information can be shown to be an f-information in the binary case for a certain function f, see [101] (Theorem 8).

Let

(X, Y) \sim P_{X Y}

be given with marginals

P_{X}

and

P_{Y}

. Consider functions

Φ

and

Ψ

on

P (X)

and

P (Y)

defined as

Φ (Q_{X}) ≔ D_{f} (Q_{X} ∥ P_{X}) and Ψ (Q_{Y}) ≔ D_{f} (Q_{Y} ∥ P_{Y}) .

Given a conditional distribution

P_{T | X}

, it is easy to verify that

Φ (X | T) = I_{f} (X; T)

and

Ψ (Y | T) = I_{f} (Y; T)

. This in turn implies that f-information can be utilized in (40) and (41) to define general bottleneck: Let

f : (0, \infty) \to R

and

g : (0, \infty) \to R

be two convex functions satisfying

f (1) = g (1) = 0

. Then we define

(64)

and

(65)

In light of the discussion in Section 3.1, the optimization problems in

{IB}^{(f, g)}

and

{IB}^{(f, g)}

can be analytically solved by determining the upper concave and lower convex envelope of the mapping

Q_{X} \mapsto F_{β}^{(f, g)} ≔ D_{f} (Q_{Y} ∥ P_{Y}) - β D_{g} (Q_{X} ∥ P_{X}),

(66)

where

β \geq 0

is the Lagrange multiplier and

Q_{Y} = Q_{X} P_{Y | X}

.

Consider the function

f_{α} (t) = \frac{t^{α} - 1}{α - 1}

with

α \in (1, \infty) \cup (1, \infty)

. The corresponding f-divergence is sometimes called Hellinger divergence of order

α

, see e.g., [105]. Note that Hellinger divergence of order 2 reduces to

χ^{2}

-divergence. Calmon et al. [68] and Asoodeh et al. [67] showed that if

I_{f_{2}} (Y; T) \leq ε

for some

ε \in (0, 1)

, then the minimum mean-squared error (MMSE) of reconstructing any zero-mean unit-variance function of Y given T is lower bounded by

1 - ε

, i.e., no function of Y can be reconstructed with small MMSE given an observation of T. This result serves a natural justification for

I_{f_{2}}

as an operational measure of both privacy and utility in a bottleneck problem.

Unfortunately, our approach described in Section 3.1 cannot be used to compute

{IB}^{(f_{2}, f_{2})}

or

{PF}^{(f_{2}, f_{2})}

in the binary symmetric case. The difficulty lies in the fact that the function

F_{β}^{f_{2}, f_{2}}

, defined in (66), for the binary symmetric case is either convex or concave on its entire domain depending on the value of

β

. Nevertheless, one can consider Hellinger divergence of order

α

with

α \neq 2

and then apply our approach to compute

{IB}^{(f_{α}, f_{α})}

or

{PF}^{(f_{α}, f_{α})}

. Since

D_{f_{2}} (P ∥ Q) \leq {(1 + (α - 1) D_{f_{α}} (P ∥ Q))}^{1 / (α - 1)} - 1

(see [106] (Corollary 5.6)), one can justify

I_{f_{α}}

as a measure of privacy and utility in a similar way as

I_{f_{2}}

.

We end this section by a remark about estimating the measures studied in this section. While we consider information-theoretic regime where the underlying distribution

P_{X Y}

is known, in practice only samples

(x_{i}, y_{i})

are given. Consequently, the de facto guarantees of bottleneck problems might be considerably different from those shown in this work. It is therefore essential to asses the guarantees of bottleneck problems when accessing only samples. To do so, one must derive bounds on the discrepancy between

P_{c}

,

I_{α}

, and

I_{f}

computed on the empirical distribution and the true (unknown) distribution. These bounds can then be used to shed light on the de facto guarantee of the bottleneck problems. Relying on [34] (Theorem 1), one can obtain that the gaps between the measures

P_{c}

,

I_{α}

, and

I_{f}

computed on empirical distributions and the true one scale as

O (1 / \sqrt{n})

where n is the number of samples. This is in contrast with mutual information for which the similar upper bound scales as

O (\log n / \sqrt{n})

as shown in [33]. Therefore, the above measures appear to be easier to estimate than mutual information.

4. Summary and Concluding Remarks

Following the recent surge in the use of information bottleneck (

IB

) and privacy funnel (

PF

) in developing and analyzing machine learning models, we investigated the functional properties of these two optimization problems. Specifically, we showed that

IB

and

PF

correspond to the upper and lower boundary of a two-dimensional convex set

M = {(I (X; T), I (Y; T)) :

Y

X

T} where

(X, Y) \sim P_{X Y}

represents the observable data X and target feature Y and the auxiliary random variable T varies over all possible choices satisfying the Markov relation Y Entropy 22 01325 i001

X

T. This unifying perspective on

IB

and

PF

allowed us to adapt the classical technique of Witsenhausen and Wyner [3] devised for computing

IB

to be applicable for

PF

as well. We illustrated this by deriving a closed form expression for

PF

in the binary case—a result reminiscent of the Mrs. Gerber’s Lemma [2] in information theory literature. We then showed that both

IB

and

PF

are closely related to several information-theoretic coding problems such as noisy random coding, hypothesis testing against independence, and dependence dilution. While these connections were partially known in previous work (see e.g., [29,30]), we show that they lead to an improvement on the cardinality of T for computing

IB

. We then turned our attention to the continuous setting where X and Y are continuous random variables. Solving the optimization problems in

IB

and

PF

in this case without any further assumptions seems a difficult challenge in general and leads to theoretical results only when

(X, Y)

is jointly Gaussian. Invoking recent results on the entropy power inequality [25] and strong data processing inequality [27], we obtained tight bounds on

IB

in two different cases: (1) when Y is a Gaussian perturbation of X and (2) when X is a Gaussian perturbation of Y. We also utilized the celebrated I-MMSE relationship [107] to derive a second-order approximation of

PF

when T is considered to be a Gaussian perturbation of X.

In the second part of the paper, we argue that the choice of (Shannon’s) mutual information in both

IB

and

PF

does not seem to carry specific operational significance. It does, however, have a desirable practical consequence: it leads to self-consistent equations [1] that can be solved iteratively (without any guarantee to convergence though). In fact, this property is unique to mutual information among other existing information measures [99]. Nevertheless, we argued that other information measures might lead to better interpretable guarantee for both

IB

and

PF

. For instance, statistical accuracy in

IB

and privacy leakage in

PF

can be shown to be precisely characterized by probability of correctly guessing (aka Bayes risk) or minimum mean-squared error (MMSE). Following this observation, we introduced a large family of optimization problems, which we call bottleneck problems, by replacing mutual information in

IB

and

PF

with Arimoto’s mutual information [22] or f-information [23]. Invoking results from [33,34], we also demonstrated that these information measures are in general easier to estimate from data than mutual information. Similar to

IB

and

PF

, the bottleneck problems were shown to be fully characterized by boundaries of a two-dimensional convex set parameterized by two real-valued non-negative functions

Φ

and

Ψ

. This perspective enabled us to generalize the technique used to compute

IB

and

PF

for evaluating bottleneck problems. Applying this technique to the binary case, we derived closed form expressions for several bottleneck problems.

Author Contributions

All authors contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work supported by the National Science Foundation under Grant No. CIF 1900750 and CIF CAREER 1845852.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs from Section 2

Proof of Theorem 1.

Note that $R = 0$ in optimization problem (4) implies that X and T are independent. Since $Y, X$ and T form Markov chain Y Y T, independent of X and T implies independence of Y and T and thus $I (Y; T) = 0$ . Similarly for $PF (0)$ .
Since $I (X; T) \leq H (X)$ for any random variable T, we have $T = X$ satisfies the information constraint $I (X; T) \leq R$ for $R \geq H (X)$ . Since $I (Y; T) \leq I (Y; X)$ , this choice is optimal. Similarly for $PF$ , the constraint $I (X; T) \geq r$ for $r \geq H (X)$ implies $T = X$ . Hence, $PF (r) = I (Y; X)$ .
The upper bound on $IB$ follows from the data processing inequality: $I (Y; T) \leq \min {I (X; T), I (X; Y)}$ for all T satisfying the Markov condition Y X T.
To prove the lower bound on $PF$ , note that

$I (Y; T) = I (X; T) - I (X; T | Y) \geq I (X; T) - H (X | Y) .$
The concavity of $R \mapsto IB (R)$ follows from the fact it is the upper boundary of the convex set $M$ , defined in (6). This in turn implies the continuity of $IB (\cdot)$ . Monotonicity of $R \mapsto IB (R)$ follows from the definition. Strict monotonicity follows from the convexity and the fact that $IB (H (X)) = I (X; Y)$ .
Similar as above.
The differentiability of the map $R \mapsto IB (R)$ follows from [94] (Lemma 6). This result in fact implies the differentiability of the map $r \mapsto PF (r)$ as well. Continuity of the derivative of $IB$ and $PF$ on $(0, H (X))$ is a straightforward application of [108] (Theorem 25.5).
Monotonicity of mappings $R \mapsto \frac{IB (R)}{R}$ and $r \mapsto \frac{PF (r)}{r}$ follows from the concavity and convexity of $IB (\cdot)$ and $PF (\cdot)$ , respectively.
Strict monotonicity of $IB (\cdot)$ and $PF (\cdot)$ imply that the optimization problems in (4) and (5) occur when the inequality in the constraints becomes equality. □

Proof of Theorem 3.

Recall that, according to Theorem 1, the mappings

R \mapsto IB (R)

and

r \mapsto PF (r)

are concave and convex, respectively. This implies that

IB (R)

(resp.

PF (r)

) lies above (resp. below) the chord connecting

(0, 0)

and

(H (X, I (X; Y))

. This proves the lower bound (resp. upper bound)

IB (R) \geq R \frac{I (X; Y)}{H (X)}

(resp.

PF (r) \leq r \frac{I (X; Y)}{H (X)}

).

In light of the convexity of

PF

and monotonicity of

r \mapsto \frac{PF (r)}{r}

, we can write

where the last equality is due to [13] (Lemma 4) and

Q_{Y}

is the output distribution of the channel

P_{Y | X}

when the input is distributed according to

Q_{X}

. Similarly, we can write

where the last equality is due to [82] (Theorem 4). □

Proof of Theorem 5.

Let

T_{n}

be an optimal summeries of

X^{n}

, that is, it satisfies T_n Entropy 22 01325 i001

Xⁿ

Yⁿ and

I (X^{n}; T_{n}) = n R

. We can write

I (X^{n}, T_{n}) = H (X^{n}) - H (X^{n} | T_{n}) = \sum_{k = 1}^{n} [H (X_{k}) - H (X_{k} | X^{k - 1}, T_{n})] = \sum_{k = 1}^{n} I (X_{k}; X^{k - 1}, T_{n}),

and hence, if

R_{k} ≔ I (X_{k}; X^{k - 1}, T_{n})

, then we have

R = \frac{1}{n} \sum_{k = 1}^{n} R_{k} .

(A1)

We can similarly write

\begin{matrix} I (Y^{n}, T_{n}) & = & H (Y^{n}) - H (Y^{n} | T_{n}) = \sum_{k = 1}^{n} [H (Y_{k}) - H (Y_{k} | Y^{k - 1}, T_{n})] \\ \leq & \sum_{k = 1}^{n} [H (Y_{k}) - H (Y_{k} | Y^{k - 1}, X^{k - 1}, T_{n})] \\ = & \sum_{k = 1}^{n} [H (Y_{k}) - H (Y_{k} | X^{k - 1}, T_{n})] = \sum_{k = 1}^{n} I (Y_{k}; X^{k - 1}, T_{n}) . \end{matrix}

Since we have

(T_{n}, X^{k - 1})

X_k

Y_k for every

k \in [n]

, we conclude from the above inequality that

I (Y^{n}, T_{n}) \leq \sum_{k = 1}^{n} I (Y_{k}; X^{k - 1}, T_{n}) \leq \sum_{k = 1}^{n} IB (P_{X Y}, R_{k}) \leq n IB (P_{X Y}, R),

(A2)

where the last inequality follows from concavity of the map

x \mapsto IB (P_{X Y}, x)

and (A1). Consequently, we obtain

IB (P_{X^{n} Y^{n}}, n R) \leq n IB (P_{X Y}, R) .

(A3)

To prove the other direction, let

P_{T | X}

be an optimal channel in the definition of

IB

, i.e.,

I (X; T) = R

and

IB (P_{X Y}, R) = I (Y; T)

. Then using this channel n times for each pair

(X_{i}, Y_{i})

, we obtain

T^{n} = (T_{1}, \dots, T_{n})

satisfying Tⁿ Entropy 22 01325 i001

Xⁿ

Yⁿ. Since

I (X^{n}; T^{n}) = n I (X; T) = n R

and

I (Y^{n}; T^{n}) = n I (Y; T)

, we have

IB (P_{X^{n} Y^{n}}, n R) \geq n IB (P_{X Y}, R) .

This, together with (A3), concludes the proof. □

Proof of Theorem 4.

First notice that

where the last equality is due to [82] (Theorem 4). Similarly,

where the last equality is due to [13] (Lemma 4).

Fix

x_{0} \in X

with

P_{X} (x_{0}) > 0

and let T be a Bernoulli random variable specified by the following channel

P_{T | X} (1 | x) = δ 1_{{x = x_{0}}},

for some

δ > 0

. This channel induces

T \sim Bernoulli (δ P_{X} (x_{0}))

,

P_{Y | T} (y | 1) = P_{Y | X} (y | x_{0})

, and

P_{Y | T} (y | 0) = \frac{P_{Y} (y) - P_{X Y} (x_{0}, y)}{1 - δ P_{X} (x_{0})} .

It can be verified that

I (X; T) = - δ P_{X} (x_{0}) \log P_{X} (x_{0}) + o (δ),

and

I (Y; T) = δ P_{X} (x_{0}) D_{KL} (P_{Y | X} (\cdot | x_{0}) ∥ P_{Y} (\cdot)) + o (δ) .

Setting

δ = \frac{r}{- P_{X} (x_{0}) \log P_{X} (x_{0})},

we obtain

I (Y; T) = \frac{D_{KL} (P_{Y | X} (\cdot | x_{0}) ∥ P_{Y} (\cdot))}{- \log P_{X} (x_{0})} r + o (r),

and hence

PF (r) \leq \frac{D_{KL} (P_{Y | X} (\cdot | x_{0}) ∥ P_{Y} (\cdot))}{- \log P_{X} (x_{0})} r + o (r) .

Since

x_{0}

is arbitrary, the result follows. The proof for

IB

follows similarly. □

Proof of Lemma 1.

When Y is an erasure of X, i.e.,

Y = X \cup {⊥}

with

P_{Y | X} (x | x) = 1 - δ

and

P_{Y | X} (⊥ | x) = δ

, it is straightforward to verify that

D_{KL} (Q_{Y} ∥ P_{Y}) = (1 - δ) D_{KL} (Q_{X} ∥ P_{X})

for every

P_{X}

and

Q_{X}

in

P (X)

. Consequently, we have

\inf_{Q_{X} \neq P_{X}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})} = \sup_{Q_{X} \neq P_{X}} \frac{D_{KL} (Q_{Y} ∥ P_{Y})}{D_{KL} (Q_{X} ∥ P_{X})} = 1 - δ .

Hence, Theorem 3 gives the desired result.

To prove the second part, i.e., when X is an erasure of Y, we need an improved upper bound of

PF

. Notice that if perfect privacy occurs for a given

P_{X Y}

, then the upper bound for

PF (r)

in Theorem 3 can be improved:

PF (r) \leq (r - r_{0}) \frac{I (X; Y)}{H (X) - r_{0}},

(A4)

where

r_{0}

is the largest

r \geq 0

such that

PF (r) = 0

. Here, we show that

r_{0} = H (X | Y)

. This suffices to prove the result as (A4), together with Theorem 1, we have

\max {r - H (X | Y), 0} \leq PF \leq (r - r_{0}) \frac{I (X; Y)}{H (X) - r_{0}} = (r - H (X | Y)) .

To show that

PF (H (X | Y)) = 0

, consider the channel

P_{T | X} (t | x) = \frac{1}{| Y |} 1_{{t \neq ⊥, x \neq ⊥}}

and

P_{T | X} (⊥ | ⊥) = 1 .

It can be verified that this channel induces T which is independent of Y and that

I (X; T) = H (T) - H (T | X) = H (\frac{1 - δ}{| Y |}, \dots, \frac{1 - δ}{| Y |}, δ) - (1 - δ) \log | Y | = h_{b} (δ) = H (X | Y),

where

h_{b} (δ) ≔ - δ \log δ - (1 - δ) \log (1 - δ)

is the binary entropy function. □

Proof of Lemma 4.

As mentioned earlier, Equation (24) was proved in [2]. We thus give a proof only for (25).

Consider the problem of minimizing the Lagrangian

L_{PF} (β)

(20) for

β \geq β_{PF}

. Let

X^{'} \sim Q_{X} = Bernoulli (q)

for some

q \in (0, 1)

and

Y^{'}

be the result of passing

X^{'}

through

BSC (δ)

, i.e.,

Y^{'} \sim Bernoulli (q * δ)

. Recall that

F_{β} (q) ≔ F_{β} (Q_{X}) = h_{b} (q * δ) - β h_{b} (q)

. It suffices to compute

K_{\cap} [F_{β} (q)]

the upper concave envelope of

q \mapsto F_{β} (q)

. It can be verified that

β_{IB} \leq {(1 - 2 δ)}^{2}

and hence for all

β \geq {(1 - 2 δ)}^{2}

,

K_{\cap} [F_{β} (q)] = F_{β} (0)

. A straightforward computation shows that

F_{β} (q)

is symmetric around

q = \frac{1}{2}

and is also concave in a region around

q = \frac{1}{2}

, where it reaches its local maximum. Hence, if

β

is such that

$F_{β} (\frac{1}{2}) < F_{β} (0)$ (see Figure 4a), then $K_{\cap} [F_{β} (q)]$ is given by the convex combination of $F_{β} (0)$ and $F_{β} (1)$ .
$F_{β} (\frac{1}{2}) = F_{β} (0)$ (see Figure 4b), then $K_{\cap} [F_{β} (q)]$ is given by the convex combination of $F_{β} (0)$ and $F_{β} (\frac{1}{2})$ and $F_{β} (1)$ .
$F_{β} (\frac{1}{2}) > F_{β} (0)$ (see Figure 4c), then there exists $q_{β} \in [0, \frac{1}{2}]$ such that for $q \leq q_{β}$ , $K_{\cap} [F_{β} (q)]$ is given by the convex combination of $F_{β} (0)$ and $F_{β} (q_{β})$ .

Hence, assuming

p \leq \frac{1}{2}

, we can construct

T^{*}

that maximizes

H (Y | T) - β H (X | T)

in three different cases corresponding three cases above:

In the first case, $T^{*}$ is binary and we have $P_{X | T^{*} = 0} = Bernoulli (0)$ and $P_{X | T^{*} = 1} = Bernoulli (1)$ with $P_{T^{*}} = Bernoulli (p)$ .
In the second case, $T^{*}$ is ternary and we have $P_{X | T^{*} = 0} = Bernoulli (0)$ , $P_{X | T^{*} = 1} = Bernoulli (1)$ , and $P_{X | T^{*} = 2} = Bernoulli (\frac{1}{2})$ with $P_{T^{*}} = (1 - p - \frac{α}{2}, p - \frac{α}{2}, α)$ for some $α \in [0, 2 p]$ .
In the third case, $T^{*}$ is again binary and we have $P_{X | T^{*} = 0} = Bernoulli (0)$ and $P_{X | T^{*} = 1} = Bernoulli (\frac{p}{α})$ with $P_{T^{*}} = Bernoulli (α)$ for some $α \in [2 p, 1]$ .

Combining these three cases, we obtain the result in (25).□

Proof of Lemma 2.

Let

X = Y + σ N^{G}

where

σ > 0

and

N^{G} \sim N (0, 1)

is independent of Y. According to the improved entropy power inequality proved in [25] (Theorem 1), we can write

e^{2 (H (X) - I (Y; T))} \geq e^{2 (H (Y) - I (X; T))} + 2 π e σ^{2},

for any random variable T forming Y Entropy 22 01325 i001

X

T. This, together with Theorem 5, implies the result. □

Proof of Corollary 1.

Since

(X, Y)

are jointly Gaussian, we can write

X = Y + σ N^{G}

where

σ = σ_{Y} \frac{\sqrt{1 - ρ^{2}}}{ρ}

and

σ_{Y}^{2}

is the variance of Y. Applying Lemma 2 and noticing that

H (X) = \frac{1}{2} \log (2 π e (σ_{Y}^{2} + σ^{2}))

, we obtain

I (Y; T) \leq \frac{1}{2} \log \frac{σ^{2} + σ_{Y}^{2}}{σ^{2} + σ_{Y}^{2} e^{- 2 I (X; T)}} = \frac{1}{1 - ρ^{2} + ρ^{2} e^{- 2 I (X; T)}},

(A5)

for all channels

P_{T | X}

satisfying Y Entropy 22 01325 i001

X

T. This bound is attained by Gaussian

P_{T | X}

. Specifically, assuming

T + X + \tilde{σ} M^{G}

where

{\tilde{σ}}^{2} = σ_{Y}^{2} \frac{e^{- 2 R}}{ρ^{2} (1 - e^{- 2 R})}

for

R \geq 0

and

M^{G} \sim N (0, {\tilde{σ}}^{2})

independent of X, it can be easily verified that

I (X; T) = R

and

I (Y; T) = \frac{1}{1 - ρ^{2} + ρ^{2} e^{- 2 R}}

. This, together with (A5), implies

IB (R) = \frac{1}{1 - ρ^{2} + ρ^{2} e^{- 2 R}} .

□

Next, we wish to prove Theorem 6. However, we need the following preliminary lemma before we delve into its proof.

Lemma A1.

Let X and Y be continuous correlated random variables with

E [X^{2}] < \infty

and

E [Y^{2}] < \infty

. Then the mappings

σ \mapsto I (X; T_{σ})

and

σ \mapsto I (Y; T_{σ})

are continuous, strictly decreasing, and

I (X; T_{σ}) \to 0, and I (Y; T_{σ}) \to 0 as σ \to \infty .

Proof.

The finiteness of

E [X^{2}]

and

E [Y^{2}]

imply that

H (X)

and

H (Y)

are finite. A straightforward application of the entropy power inequality (cf. [109] (Theorem 17.7.3)) implies that

H (T_{σ})

is also finite. Thus,

I (X; T_{σ})

and

I (Y; T_{σ})

are well-defined. According to the data processing inequality, we have

I (X; T_{σ + δ}) < I (X; T_{σ})

for all

δ > 0

and also

I (Y; T_{σ + δ}) \leq I (Y; T_{σ})

where the equality occurs if and only if X and Y are independent. Since, bu assumption X and Y correlated, it follows

I (Y; T_{σ + δ}) < I (Y; T_{σ})

. Thus, both

I (X; T_{σ})

and

I (Y; T_{σ})

are strictly decreasing.

For the proof of continuity, we consider two cases

σ = 0

and

σ > 0

separately. We first give the poof for

I (X; T_{σ})

. Since

H (σ N^{G}) = \frac{1}{2} \log (2 π e σ^{2})

, we have

\lim_{σ \to 0} H (σ N^{G}) = \infty

and thus

\lim_{σ \to 0} I (X; T_{σ}) = \infty

that is equal to

I (X; T_{0})

. For

σ > 0

, let

σ_{n}

be a sequence of positive numbers converging to

σ

. In light of de Bruijn’s identity (cf. [109] (Theorem 17.7.2)), we have

H (T_{σ_{n}}) \to H (T_{σ})

, implying the continuity of

σ \mapsto I (X; T_{σ})

.

Next, we prove the continuity of

σ \mapsto I (Y; T_{σ})

. For the sequence of positive numbers

σ_{n}

converging to

σ > 0

, we have

I (Y; T_{σ_{n}}) = H (T_{σ_{n}}) - H (T_{σ_{n}} | Y)

. We only need to show

H (T_{σ_{n}} | Y) \to H (T_{σ} | Y)

. Invoking again de Brujin’s identity, we obtain

H (T_{σ_{n}} | Y = y) \to H (T_{σ} | Y = y)

for each

y \in Y

. The desired result follows from dominated convergence theorem. Finally, the The continuity of

σ \mapsto I (Y; T_{σ})

when

σ = 0

follows from [110] (p. 2028) stating that

H (T_{σ_{n}} | Y = y) \to H (X | Y = y)

and then applying dominated convergence theorem.

Note that

0 \leq I (Y; T_{σ}) \leq I (X; T_{σ}) \leq \frac{1}{2} \log (1 + \frac{σ_{X}^{2}}{σ^{2}}),

where

σ_{X}^{2}

is the variance of X and the last inequality follows from the fact that

I (X; X + σ N^{G})

is maximized when X is Gaussian. Since by assumption

σ_{X} < \infty

, it follows that both

I (X; T_{σ})

and

I (Y; T_{σ})

converge to zero as

σ \to \infty

. □

In light of this lemma, there exists a unique

σ \geq 0

such that

I (X; T_{σ}) = r

. Let

σ_{r}

denote such

σ

. Therefore, we have

PF (r) = I (Y; T_{σ_{r}}) .

This enables us to prove Theorem 6.

Proof of Theorem 6.

The proof relies on the I-MMSE relation in information theory literature. We briefly describe it here for convenience. Given any pair of random variables U and V, the minimum mean-squared error (MMSE) of estimating U given V is given by

mmse (U | V) ≔ \inf_{f} E [{(U - f (V))}^{2}] = E [{(U - E [U | V])}^{2}] = E [var (U | V)],

where the infimum is taken over all measurable functions f and

var (U | V) = E [{(U - E [U | V])}^{2} | V]

. Guo et al. [107] proved the following identity, which is referred to as I-MMSE formula, relating the input-output mutual information of the additive Gaussian channel

T_{σ} = X + σ N^{G}

, where

N^{G} \sim N (0, 1)

is independent of X, with the MMSE of the input given the output:

\frac{d}{d (σ^{2})} I (X; T_{σ}) = - \frac{1}{2 σ^{4}} mmse (X | T_{σ}) .

(A6)

Since Y, X, and

T_{σ}

form the Markov chain Y Entropy 22 01325 i001

X

T_σ, it follows that

I (Y; T_{σ}) = I (X; T_{σ}) - I (X; T_{σ} | Y)

. Thus, two applications of (A6) yields

\frac{d}{d (σ^{2})} I (Y; T_{σ}) = - \frac{1}{2 σ^{4}} [mmse (X | T_{σ}) - mmse (X | T_{σ}, Y)] .

(A7)

The second derivative of

I (X; T_{σ})

and

I (Y; T_{σ})

are also known via the formula [111] (Proposition 9)

\frac{d}{d (σ^{2})} mmse (X | T_{σ}) = \frac{1}{σ^{4}} E [{var}^{2} (X | T_{σ})] and \frac{d}{d (σ^{2})} mmse (X | T_{σ}, Y) = \frac{1}{σ^{4}} E [{var}^{2} (X | T_{σ}, Y)] .

(A8)

With these results in mind, we now begin the proof. Recall that

σ_{r}

is the unique

σ

such that

I (X; T_{σ}) = r

, thus implying

{PF}^{G} (r) = I (Y; T_{σ_{r}})

. We have

\frac{d}{d r} {PF}^{G} (r) = {[\frac{d}{d (σ^{2})} I (Y; T_{σ})]}_{σ = σ_{r}} \frac{d}{d r} σ_{r}^{2} .

(A9)

To compute the derivative of

PF (r)

, we therefore need to compute the derivative of

σ_{r}^{2}

with respect to r. To do so, notice that from the identity

I (X; T_{σ_{r}}) = r

we can obtain

1 = \frac{d}{d r} I (X; T_{σ_{r}}) = {[\frac{d}{d (σ^{2})} I (X; T_{σ_{r}})]}_{σ = σ_{r}} \frac{d}{d r} σ_{r}^{2} = - \frac{1}{2 σ^{4}} mmse (X | T_{σ_{r}}) \frac{d}{d r} σ_{r}^{2},

implying

\frac{d}{d r} σ_{r}^{2} = \frac{- 2 σ^{4}}{mmse (X | T_{σ_{r}})} .

Plugging this identity into (A9) and invoking (A7), we obtain

\frac{d}{d r} {PF}^{G} (r) = \frac{mmse (X | T_{σ_{r}}) - mmse (X | T_{σ_{r}}, Y)}{mmse (X | T_{σ_{r}})} .

(A10)

The second derivative can be obtained via (A8)

\frac{d^{2}}{d r^{2}} {PF}^{G} (r) = 2 \frac{E [{var}^{2} (X | T_{σ_{r}}, Y)]}{{mmse}^{2} (X | T_{σ_{r}})} - 2 E [{var}^{2} (X | T_{σ_{r}})] \frac{mmse (X | T_{σ_{r}}, Y)}{{mmse}^{3} (X | T_{σ_{r}})} .

Since

σ_{r} \to \infty

as

r \to 0

, we can write

\frac{d}{d r} {PF}^{G} (r) |_{r = 0} = \frac{σ_{X}^{2} - E [var (X | Y)]}{σ_{X}} = \frac{var (E [X | Y])}{σ_{X}^{2}} = η (X, Y),

where

var (E [X | Y])

is the variance of the conditional expectation X given Y and the last equality comes from the law of total variance. and

\frac{d^{2}}{d r^{2}} {PF}^{G} (r) |_{r = 0} = \frac{2}{σ_{X}^{4}} [E [{var}^{2} (X | Y)] - σ_{X}^{2} E [var (X | Y)]] .

Taylor expansion of

PF (r)

around

r = 0

gives the result. □

Proof of Theorem 9.

The main ingredient of this proof is a result by Jana [112] (Lemma 2.2) which provides a tight cardinality bound for the auxiliary random variables in the canonical problems in network information theory (including noisy source coding problem described in Section 2.3). Consider a pair of random variables

(X, Y) \sim P_{X Y}

and let

d : Y \times \hat{Y} \to R

be an arbitrary distortion measure defined for arbitrary reconstruction alphabet

\hat{Y}

. □

Theorem A1

([112]). Let

A

be the set of all pairs

(R, D)

satisfying

I (X; T) \leq R and E [d (Y, ψ (T))] \leq D,

for some mapping

ψ : T \to \hat{Y}

and some joint distributions

P_{X Y T} = P_{X Y} P_{T | X}

. Then every extreme points of

A

corresponds to some choice of auxiliary variable T with alphabet size

| T | \leq | X |

.

Measuring the distortion in the above theorem in terms of the logarithmic loss as in (27), we obtain that

A = {(R, D) \in R_{+}^{2} : R \geq R^{noisy} (D)},

where

R^{noisy} (D)

is given in (29). We observed in Section 2.3 that

IB

is fully characterized by the mapping

D \mapsto R^{noisy} (D)

and thus by

A

. In light of Theorem A1, all extreme points of

A

are achieved by a choice of T with cardinality size

| T | \leq | X |

. Let

{(R_{i}, D_{i})}

be the set of extreme points of

A

each constructed by channel

P_{T_{i} | X}

and mapping

ψ_{i}

. Due to the convexity of

A

, each point

(R, D) \in A

is expressed as a convex combination of

{(R_{i}, D_{i})}

with coefficient

{λ_{i}}

; that is there exists a channel

P_{T | X} = \sum_{i} λ_{i} P_{T_{i} | X}

and a mapping

ψ (T) = \sum_{i} λ_{i} ψ_{i} (T_{i})

such that

I (X; T) = R

and

E [d (Y, ψ (T))] = D

. This construction, often termed timesharing in information theory literature, implies that all points in

A

(including the boundary points) can be achieved with a variable T with

| T | \leq | X |

. Since the boundary of

A

is specified by the mapping

R \mapsto IB (R)

, we conclude that

IB (R)

is achieved by a variable T with cardinality

| T | \leq | X |

for very

R < H (X)

.

Proof of Lemma 5.

The following proof is inspired by [32] (Proposition 1). Let

X = {1, \dots, m}

. We sort the elements in

X

such that

P_{X} (1) D_{KL} (P_{Y | X = 1} ∥ P_{Y}) \geq \dots \geq P_{X} (m) D_{KL} (P_{Y | X = m} ∥ P_{Y}) .

Now consider the function

f : X \to [M]

given by

f (x) = x

if

x < M

and

f (x) = M

if

x \geq M

where

M = e^{R}

. Let

Z = f (X)

. We have

P_{Z} (i) = P_{X} (i)

if

i < M

and

P_{Z} (M) = \sum_{j \geq M} P_{X} (j)

. We can now write

\begin{matrix} I (Y; Z) & = \sum_{i = 1}^{M - 1} P_{X} (i) D (P_{Y | X = i} ∥ P_{Y}) + P_{Z} (M) D (P_{Y | Z = M} ∥ P_{Y}) \\ \geq \sum_{i = 1}^{M - 1} P_{X} (i) D (P_{Y | X = i} ∥ P_{Y}) \\ \geq \frac{M - 1}{| X |} \sum_{i \in X} P_{X} (i) D (P_{Y | X = i} ∥ P_{Y}) \\ = \frac{M - 1}{| X |} I (X; Y) . \end{matrix}

Since

f (X)

takes values in

[M]

, it follows that

H (f (X)) \leq R

. Consequently, we have

dIB (P_{X Y}, R) \geq \sup_{f : X \to [M]} I (Y; f (X)) \geq \frac{M - 1}{| X |} I (X; Y) .

For the privacy funnel, the proof proceeds as follows. We sort the elements in

X

such that

P_{X} (1) D_{KL} (P_{Y | X = 1} ∥ P_{Y}) \leq \dots \leq P_{X} (m) D_{KL} (P_{Y | X = m} ∥ P_{Y}) .

Consider now the function

f : X \to [M]

given by

f (x) = x

if

x < M

and

f (x) = M

if

x \geq M

. As before, let

Z = f (X)

. Then, we can write,

\begin{matrix} I (Y; Z) & = \sum_{i = 1}^{M - 1} P_{X} (i) D (P_{Y | X = i} ∥ P_{Y}) + P_{Z} (M) D (P_{Y | Z = M} ∥ P_{Y}) \\ \leq \frac{M - 1}{| X |} \sum_{i \in X} P_{X} (i) D (P_{Y | X = i} ∥ P_{Y}) + P_{Z} (M) D (P_{Y | Z = M} ∥ P_{Y}) \\ = \frac{M - 1}{| X |} I (X; Y) + P_{Z} (M) \sum_{y \in Y} P_{Y | Z} (y | M) \log \frac{P_{Y | Z} (y | M)}{P_{Y} (y)} \\ \leq \frac{M - 1}{| X |} I (X; Y) + Pr (X \geq M) \sum_{y \in Y} [\sum_{i} P_{Y | X} (y | i) \frac{P_{X} (i) 1_{{i \geq M}}}{Pr (X \geq M)}] \log \frac{\sum_{i} P_{Y | X} (y | i) \frac{P_{X} (i) 1_{{i \geq M}}}{Pr (X \geq M)}}{\sum_{i} P_{Y | X} (y | i) P_{X} (i)} \\ \leq \frac{M - 1}{| X |} I (X; Y) + \sum_{y \in Y} \sum_{i \geq M} P_{Y | X} (y | i) P_{X} (i) \log \frac{1}{Pr (X \geq M)} \\ = \frac{M - 1}{| X |} I (X; Y) + Pr (X \geq M) \log \frac{1}{Pr (X \geq M)} \end{matrix}

where the last inequality is due to the log-sum inequality. □

Proof of Lemma 6.

Employing the same argument as in the proof of [32] (Theorem 3), we obtain that there exists a function

f : X \to [M]

such that

I (Y; f (X)) \geq η I (X; Y)

(A11)

for any

η \in (0, 1)

and

M \leq 4 + \frac{4}{(1 - η) \log 2} \log \frac{2 α}{(1 - η) I (X; Y)} .

Since

h_{b}^{- 1} (x) \leq \frac{x \log 2}{\log \frac{1}{x}}

for all

x \in (0, 1]

, it follows from above that (noticing that

I (X; Y) \leq α

)

M \leq 4 + \frac{I (X; Y)}{2 α} \frac{1}{h_{b}^{- 1} (ζ)},

where

ζ ≔ \frac{(1 - η) I (X; Y)}{2 α}

. Rearranging this, we obtain

h_{b}^{- 1} (ζ) \leq \frac{I (X; Y)}{2 α (M - 4)} .

Assuming

M \geq 5

, we have

\frac{I (X; Y)}{2 α (M - 4)} \leq \frac{1}{2}

and hence

ζ \leq h_{b} (\frac{I (X; Y)}{2 α (M - 4)}),

implying

η \geq 1 - \frac{2 α}{I (X; Y)} h_{b} (\frac{I (X; Y)}{2 α (M - 4)}) .

Plugging this into (A11), we obtain

I (Y; f (X)) \geq I (X; Y) - 2 α h_{b} (\frac{I (X; Y)}{2 α (M - 4)}) .

As before, if

M = e^{R}

, then

H (f (X)) \leq R

. Hence,

dIB (P_{X Y}, R) \geq I (X; Y) - 2 α h_{b} (\frac{I (X; Y)}{2 α (e^{R} - 4)}),

for all

R \geq \log 5

. □

Appendix B. Proofs from Section 3

Proof of Lemma 8.

To prove the upper bound on

{PF}^{(\infty, 1)}

, recall that

r \mapsto e^{{PF}^{(\infty, 1)} (r)}

is convex. Thus, it lies below the chord connecting points

(0, 0)

and

(H (X), e^{I_{\infty} (X; Y)})

. The lower bound on

{IB}^{(\infty, 1)}

is similarly obtained using the concavity of

R \mapsto e^{{IB}^{(\infty, 1)} (R)}

. This is achievable by an erasure channel. To see this consider the random variable

T_{δ}

taking values in

X \cup {⊥}

that is obtained by conditional distributions

P_{T_{δ} | X} (t | x) = \bar{δ} I_{t = x}

and

P_{T_{δ} | X} (⊥ | x) = δ

for some

δ \geq 0

. It can be verified that

I (X; T_{δ}) = \bar{δ} H (X)

and

P_{c} (Y | T_{δ}) = \bar{δ} P_{c} (Y | X) + δ P_{c} (Y)

. By taking

δ = 1 - \frac{R}{H (X)}

, this channel meets the constraint

I (X; T_{δ}) = R

. Hence,

{IB}^{(\infty, 1)} (R) \geq \log [\frac{P_{c} (Y | T_{δ})}{P_{c} (Y)}] = \log [1 - \frac{R}{H (X)} + \frac{R}{H (X)} \frac{P_{c} (Y | X)}{P_{c} (Y)}] .

□

Proof of Theorem 11.

We begin by

{PF}^{(\infty, 1)}

. As described in Section 3.1, and similar to Mrs. Gerber’s Lemma (Lemma 4), we need to construct the lower convex envelope

K_{\cup} [F_{β}^{(\infty, 1)}]

of

F_{β}^{(\infty, 1)} (q) = P_{c} (Y^{'}) + β H (X^{'})

where

X^{'} \sim Bernoulli (q)

and

Y^{'}

is the result of passing

X^{'}

through

BSC (δ)

, i.e.,

Y^{'} \sim Bernoulli (δ * q)

. In this case,

P_{c} (Y^{'}) = \max {δ * q, 1 - δ * q}

. Hence, we need to determine the lower convex envelope of the map

q \mapsto F_{β}^{(\infty, 1)} (q) = \max {δ * q, 1 - δ * q} + β h_{b} (q) .

(A12)

A straightforward computation shows that

F_{β}^{(\infty, 1)} (q)

is symmetric around

q = \frac{1}{2}

and is also concave in q on

q \in [0, \frac{1}{2}]

for any

β

. Hence,

K_{\cup} [F_{β}^{(\infty, 1)}]

is obtained as follows depending on the values of

β

:

$F_{β}^{(\infty, 1)} (\frac{1}{2}) < F_{β}^{(\infty, 1)} (0)$ (see Figure 6a), then $K_{\cup} [F_{β}^{(\infty, 1)}]$ is given by the convex combination of $F_{β}^{(\infty, 1)} (0)$ , $F_{β}^{(\infty, 1)} (1)$ , and $F_{β}^{(\infty, 1)} (\frac{1}{2})$ .
$F_{β}^{(\infty, 1)} (\frac{1}{2}) = F_{β}^{(\infty, 1)} (0)$ (see Figure 6b), then $K_{\cup} [F_{β}^{(\infty, 1)}]$ is given by the convex combination of $F_{β}^{(\infty, 1)} (0)$ , $F_{β}^{(\infty, 1)} (\frac{1}{2})$ , and $F_{β}^{(\infty, 1)} (1)$ .
$F_{β}^{(\infty, 1)} (\frac{1}{2}) > F_{β}^{(\infty, 1)} (0)$ (see Figure 6c), then $K_{\cup} [F_{β}^{(\infty, 1)}]$ is given by the convex combination of $F_{β}^{(\infty, 1)} (0)$ and $F_{β}^{(\infty, 1)} (1)$ .

Hence, assuming

p \leq \frac{1}{2}

, we can construct

T^{*}

that minimizes

P_{c} (Y | T) - β H (X | T)

. Considering the first two cases, we obtain that

T^{*}

is ternary with

P_{X | T^{*} = 0} = Bernoulli (0)

,

P_{X | T^{*} = 1} = Bernoulli (1)

, and

P_{X | T^{*} = 2} = Bernoulli (\frac{1}{2})

with marginal

P_{T^{*}} = [1 - p - \frac{α}{2}, p - \frac{α}{2}, α]

for some

α \in [0, 2 p]

. This leads to

P_{c} (Y | T^{*}) = \bar{α} \bar{δ} + \frac{1}{2} α

and

I (X; T^{*}) = h_{b} (p) - α

. Note that

P_{c} (Y | T^{*})

covers all possible domain

[P_{c} (Y), \bar{δ}]

by varying

α

on

[0, 2 p]

. Replacing

I (X; T^{*})

by r, we obtain

α = h_{b} (p) - r

leading to

P_{c} (Y | T^{*}) = \bar{δ} - (h_{b} (p) - r) (\frac{1}{2} - δ)

. Since

P_{c} (Y) = 1 - δ * p

, the desired result follows.

To derive the expression for

{IB}^{(\infty, 1)}

, recall that we need to derive

K_{\cap} [F_{β}^{(\infty, 1)}]

the upper concave envelope of

F_{β}^{(\infty, 1)}

. It is clear from Figure 6 that

K_{\cap} [F_{β}^{(\infty, 1)}]

is obtained by replacing

F_{β}^{(\infty, 1)} (q)

on the interval

[q_{β}, 1 - q_{β}]

by its maximum value over q where

q_{β} ≔ \frac{1}{1 + e^{\frac{1 - 2 δ}{β}}},

is the maximizer of

F_{β}^{(\infty, 1)} (q)

on

[0, \frac{1}{2}]

. In other words,

K_{\cap} [F_{β}^{(\infty, 1)} (q)] = \{\begin{matrix} F_{β}^{(\infty, 1)} (q_{β}), & for q \in [q_{β}, 1 - q_{β}], \\ F_{β}^{(\infty, 1)} (q), & otherwise . \end{matrix}

Note that if

p < q_{β}

then

K_{\cap} [F_{β}^{(\infty, 1)}]

evaluated at p coincides with

F_{β}^{(\infty, 1)} (p)

. This corresponds to all trivial

P_{T | X}

such that

P_{c} (Y | T) + β H (X | T) = P_{c} (Y) + β H (X)

. If, on the other hand,

p \geq q_{β}

, then

K_{\cap} [F_{β}^{(\infty, 1)}

is the convex combination of

F_{β}^{(\infty, 1)} (q_{β})

and

F_{β}^{(\infty, 1)} (1 - q_{β})

. Hence, taking

q_{β}

as a parameter (say,

α

), the optimal binary

T^{*}

is constructed as follows:

P_{X | T^{*} = 0} = Bernoulli (α)

and

P_{X | T^{*} = 1} = Bernoulli (\bar{α})

for

α \leq p

. Such channel induces

P_{c} (Y | T^{*}) = \max {α * δ, 1 - α * δ} = 1 - α * δ,

as

α \leq p \leq \frac{1}{2}

, and also

I (X; T^{*}) = h_{b} (p) - h_{b} (α) .

Combining these two, we obtain

P_{c} (Y | T^{*}) = 1 - δ * h_{b}^{- 1} (h_{b} (p) - R) .

□

Proof of Theorem 12.

Let

U_{α}

and

L_{α}

denote the

U_{Φ, Ψ}

and

L_{Φ, Ψ}

, respectively, when

Ψ (Q_{X}) = Φ (Q_{X}) = {∥ Q_{X} ∥}_{α}

. In light of (59) and (60), it is sufficient to compute

L_{α}

and

U_{α}

. To do so, we need to construct the lower convex envelope

K_{\cup} [F_{β}^{(α)}]

and upper concave envelope

K_{\cap} [F_{β}^{(α)}]

of the map

F_{β}^{(α)} (q)

given by

q \mapsto ∥ Q_{Y} ∥_{α} - β {∥ Q_{X} ∥}_{α}

where

X^{'} \sim Bernoulli (q)

and

Y^{'}

is the result of passing

X^{'}

through

BSC (δ)

, i.e.,

Y^{'} \sim Bernoulli (δ * q)

. In this case, we have

q \mapsto F_{β}^{(α)} (q) = {∥ q * δ ∥}_{α} - {β | | q | |}_{α},

(A13)

where

{∥ a ∥}_{α}

is to mean

∥ [a, \bar{a}] ∥_{α}

for any

a \in [0, 1]

.

We begin by

L_{α}

for which we aim at obtaining

K_{\cup} [F_{β}^{(α)}]

. A straightforward computation shows that

F_{β}^{(α)} (q)

is convex for

β \leq {(1 - 2 δ)}^{2}

and

α \geq 2

. For

β > {(1 - 2 δ)}^{2}

and

α \geq 2

, it can be shown that

F_{β}^{(α)} (q)

is concave an interval

[q_{β}, 1 - q_{β}]

where

q_{β}

solves

\frac{d}{d q} F_{β}^{(α)} (q) = 0

. (The shape of

q \mapsto F_{β}^{(α)} (r)

in is similar to what was depicted in Figure 4.) By symmetry,

K_{\cup} [F_{β}^{(α)}]

is therefore obtained by replacing

F_{β}^{(α)} (q)

on this interval by

F_{β}^{(α)} (q_{β})

. Hence, if

p < q_{β}

,

K_{\cup} [F_{β}^{(α)}]

at p coincides with

F_{β}^{(α)} (p)

which results in trivial

P_{T | X}

(see the proof of Theorem 11 for more details). If, on the other hand,

p \geq q_{β}

, then

K_{\cup} [F_{β}^{(α)}]

evaluated at p is given by a convex combination of

F_{β}^{(α)} (q_{β})

and

F_{β}^{(α)} (1 - q_{β})

. Relabeling

q_{β}

as a parameter (say, q), we can write an optimal binary

T^{*}

via the following:

P_{X | T * = 0} = Bernoulli (1 - q)

and

P_{X | T * = 1} = Bernoulli (q)

for

q \leq p

. This channel induces

Ψ (Y | T^{*}) = {∥ q * δ ∥}_{α}

and

Φ (X | T^{*}) = {∥ q ∥}_{α}

. Hence, the graph of

L_{α}

is given by

\{(∥ q ∥_{α}, ∥ q * δ ∥_{α}), 0 \leq q \leq p\} .

Therefore,

\sup_{\begin{matrix} P_{T | X} \\ H_{α} (X | T) \geq ζ \end{matrix}} H_{α} (Y | T) = \frac{α}{1 - α} \log {∥ q * δ ∥}_{α},

where

q \leq p

solves

\frac{α}{1 - α} \log {∥ q ∥}_{α} = ζ

. Since the map

q \mapsto {∥ q ∥}_{α}

is strictly decreasing for

q \in [0, 0.5]

, this equation has a unique solution.

Next, we compute

U_{α}

or equivalently

K_{\cap} [F_{β}^{(α)}]

the upper concave envelop of

F_{β}^{(α)}

defined in (A13). As mentioned earlier,

q \mapsto F_{β}^{(α)} (q)

is convex for

β \leq {(1 - 2 δ)}^{2}

and

α \geq 2

. For

β > {(1 - 2 δ)}^{2}

, we need to consider three cases: (1)

K_{\cap} [F_{β}^{(α)}]

is given by the convex combination of

F_{β}^{(α)} (0)

and

F_{β}^{(α)} (1)

, (2)

K_{\cap} [F_{β}^{(α)}]

is given by the convex combination of

F_{β}^{(α)} (0)

,

F_{β}^{(α)} (\frac{1}{2})

, and

F_{β}^{(α)} (1)

, (3)

K_{\cap} [F_{β}^{(α)}]

is given by the convex combination of

F_{β}^{(α)} (0)

and

F_{β}^{(α)} (q^{†})

where

q^{†}

is a point

\in [0, \frac{1}{2}]

. Without loss of generality, we can ignore the first case. The other two cases correspond to the following solutions

$T^{*}$ is a ternary variable given by $P_{X | T^{*} = 0} = Bernoulli (0)$ , $P_{X | T^{*} = 1} = Bernoulli (1)$ , and $P_{X | T^{*} = 2} = Bernoulli (\frac{1}{2})$ with marginal $T^{*} \sim Bernoulli (1 - p - \frac{λ}{2}, p - \frac{λ}{2}, λ)$ for some $λ \in [0, 2 p]$ . This produces

$Ψ (Y | T^{*}) = \bar{λ} {∥ δ ∥}_{α} + λ {∥ \frac{1}{2} ∥}_{α},$

and

$Φ (X | T^{*}) = \bar{λ} + λ {∥ \frac{1}{2} ∥}_{α} .$
$T^{*}$ is a binary variable given by $P_{X | T^{*} = 0} = Bernoulli (0)$ and $P_{X | T^{*} = 1} = Bernoulli (\frac{p}{λ})$ with marginal $T^{*} \sim Bernoulli (λ)$ for some $λ \in [2 p, 1]$ . This produces

$Ψ (Y | T^{*}) = \bar{λ} {∥ δ ∥}_{α} + λ {∥ δ * \frac{p}{λ} ∥}_{α},$

and

$Φ (X | T^{*}) = \bar{λ} + λ {∥ \frac{p}{λ} ∥}_{α} .$

Combining these two cases, can write

U_{α} (ζ) = \bar{λ} {∥ δ ∥}_{α} + λ {∥ \frac{q}{z} * δ ∥}_{α},

where

ζ = \bar{λ} + λ {∥ \frac{q}{z} ∥}_{α},

and

z = \max {2 p, λ}

. Plugging this into (59) completes the proof. □

Proof of Lemma 9.

The facts that

γ \mapsto H_{γ} (X | T)

is non-increasing on

[1, \infty]

[103] (Proposition 5) and

{(\sum_{i} {| x_{i} |}^{γ})}^{1 / γ} \geq \max_{i} | x_{i} |

for all

p \geq 0

imply

\frac{γ - 1}{γ} H_{γ} (X | T) \leq H_{\infty} (X | T) \leq H_{γ} (X | T) .

(A14)

Since

I_{\infty} (X; T) = H_{\infty} (X) - H_{\infty} (X | T)

, the above lower bound yields

I_{γ} (X; T) \geq \frac{γ}{γ - 1} I_{\infty} (X; T) - \frac{γ}{γ - 1} H_{\infty} (X) + H_{γ} (X),

(A15)

where the last inequality follows from the fact that

γ \mapsto H_{γ} (X)

is non-increasing. The upper bound in (A14) (after replacing X with Y and

γ

with

α

) implies

I_{α} (Y; T) \leq I_{\infty} (Y; T) + H_{α} (Y) - H_{\infty} (Y) .

(A16)

Combining (A15) and (A16), we obtain the desired upper bound for

{PF}^{(α, γ)}

. The other bounds can be proved similarly by interchanging X with Y and

α

with

γ

in (A15) and (A16). □

References

Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 30 September–3 October 1999; pp. 368–377. [Google Scholar]
Wyner, A.; Ziv, J. A theorem on the entropy of certain binary sequences and applications: Part I. IEEE Trans. Inf. Theory 1973, 19, 769–772. [Google Scholar] [CrossRef]
Witsenhausen, H.; Wyner, A. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory 1975, 21, 493–501. [Google Scholar] [CrossRef]
Ahlswede, R.; Körner, J. On the connection between the entropies of input and output distributions of discrete memoryless channels. In Proceedings of the Fifth Conference on Probability Theory, Brasov, Romania, 1–6 September 1974. [Google Scholar]
Wyner, A. A theorem on the entropy of certain binary sequences and applications—II. IEEE Trans. Inf. Theory 1973, 19, 772–777. [Google Scholar] [CrossRef]
Kim, Y.H.; El Gamal, A. Network Information Theory; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Slonim, N.; Tishby, N. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 208–215. [Google Scholar]
Still, S.; Bialek, W. How Many Clusters? An Information-Theoretic Perspective. Neural Comput. 2004, 16, 2483–2506. [Google Scholar] [CrossRef]
Slonim, N.; Tishby, N. Agglomerative Information Bottleneck. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS’99), Denver, CO, USA, 29 November–4 December 1999; pp. 617–623. [Google Scholar]
Cardinal, J. Compression of side information. In Proceedings of the 2003 International Conference on Multimedia and Expo—Volume 1, Baltimore, MD, USA, 6–9 July 2003; Volume 2, pp. 569–572. [Google Scholar]
Zeitler, G.; Koetter, R.; Bauch, G.; Widmer, J. Design of network coding functions in multihop relay networks. In Proceedings of the 2008 5th International Symposium on Turbo Codes and Related Topics, Lausanne, Switzerland, 1–5 September 2008; pp. 249–254. [Google Scholar]
Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the Information Bottleneck to the Privacy Funnel. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW 2014), Tasmania, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
Calmon, F.P.; Makhdoumi, A.; Médard, M. Fundamental limits of perfect privacy. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 1796–1800. [Google Scholar]
Asoodeh, S.; Alajaji, F.; Linder, T. Notes on information-theoretic privacy. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 30 September–3 October 2014; pp. 1272–1278. [Google Scholar]
Ding, N.; Sadeghi, P. A Submodularity-based Clustering Algorithm for the Information Bottleneck and Privacy Funnel. In Proceedings of the 2019 IEEE Information Theory Workshop (ITW), Visby, Sweden, 25–28 August 2019; pp. 1–5. [Google Scholar]
Bertran, M.; Martinez, N.; Papadaki, A.; Qiu, Q.; Rodrigues, M.; Reeves, G.; Sapiro, G. Adversarially Learned Representations for Information Obfuscation and Inference. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 614–623. [Google Scholar]
Lopuhaä-Zwakenberg, M.; Tong, H.; Škorić, B. Data Sanitisation Protocols for the Privacy Funnel with Differential Privacy Guarantees. arXiv 2020, arXiv:2008.13151. [Google Scholar]
Hsu, H.; Asoodeh, S.; Calmon, F. Obfuscation via Information Density Estimation. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Sicily, Italy, 26–28 August 2020; Volume 108, pp. 906–917. [Google Scholar]
Dobrushin, R.; Tsybakov, B. Information transmission with additional noise. IRE Trans. Inf. Theory 1962, 8, 293–304. [Google Scholar] [CrossRef]
Ahlswede, R.; Csiszar, I. Hypothesis testing with communication constraints. IEEE Trans. Inf. Theory 1986, 32, 533–542. [Google Scholar] [CrossRef]
Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Information extraction under privacy constraints. Information 2016, 7, 15. [Google Scholar] [CrossRef]
Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory, Coll. Math. Soc. J. Bolyai; Csiszár, I., Elias, P., Eds.; North-Holland: Amsterdam, The Netherlands, 1977; Volume 16, pp. 41–52. [Google Scholar]
Raginsky, M. Strong Data Processing Inequalities and Φ-Sobolev Inequalities for Discrete Channels. IEEE Trans. Inf. Theory 2016, 62, 3355–3389. [Google Scholar] [CrossRef]
Hsu, H.; Asoodeh, S.; Salamatian, S.; Calmon, F.P. Generalizing Bottleneck Problems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 531–535. [Google Scholar]
Courtade, T.A. Strengthening the entropy power inequality. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 2294–2298. [Google Scholar]
Globerson, A.; Tishby, N. On the Optimality of the Gaussian Information Bottleneck Curve; Technical Report; Hebrew University: Jerusalem, Israel, 2004. [Google Scholar]
Calmon, F.P.; Polyanskiy, Y.; Wu, Y. Strong data processing inequalities in power-constrained Gaussian channels. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Hongkong, China, 14–19 June 2015; pp. 2558–2562. [Google Scholar]
Rényi, A. On measures of dependence. Acta Math. Acad. Sci. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
Goldfeld, Z.; Polyanskiy, Y. The Information Bottleneck Problem and its Applications in Machine Learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
Zaidi, A.; Estella-Aguerri, I.; Shamai (Shitz), S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef]
Strouse, D.; Schwab, D.J. The Deterministic Information Bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [PubMed]
Bhatt, A.; Nazer, B.; Ordentlich, O.; Polyanskiy, Y. Information-Distilling Quantizers. arXiv 2018, arXiv:1812.03031. [Google Scholar]
Shamir, O.; Sabato, S.; Tishby, N. Learning and Generalization with the Information Bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef]
Diaz, M.; Wang, H.; Calmon, F.P.; Sankar, L. On the Robustness of Information-Theoretic Privacy Measures and Mechanisms. IEEE Trans. Inf. Theory 2020, 66, 1949–1978. [Google Scholar] [CrossRef]
El-Yaniv, R.; Souroujon, O. Iterative Double Clustering for Unsupervised and Semi-Supervised Learning. In Proceedings of the 12th European Conference on Machine Learning, Freiburg, Germany, 5–7 September 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 121–132. [Google Scholar]
Elidan, G.; Friedman, N. Learning Hidden Variable Networks: The Information Bottleneck Approach. J. Mach. Learn. Res. 2005, 6, 81–127. [Google Scholar]
Aguerri, I.E.; Zaidi, A. Distributed Information Bottleneck Method for Discrete and Gaussian Sources. arXiv 2017, arXiv:1709.09082. [Google Scholar]
Aguerri, I.E.; Zaidi, A. Distributed Variational Representation Learning. arXiv 2019, arXiv:1807.04193. [Google Scholar]
Strouse, D.; Schwab, D.J. Geometric Clustering with the Information Bottleneck. Neural Comput. 2019, 31, 596–612. [Google Scholar] [CrossRef]
Cicalese, F.; Gargano, L.; Vaccaro, U. Bounds on the Entropy of a Function of a Random Variable and Their Applications. IEEE Trans. Inf. Theory 2018, 64, 2220–2230. [Google Scholar] [CrossRef]
Koch, T.; Lapidoth, A. At Low SNR, Asymmetric Quantizers are Better. IEEE Trans. Inf. Theory 2013, 59, 5421–5445. [Google Scholar] [CrossRef][Green Version]
Pedarsani, R.; Hassani, S.H.; Tal, I.; Telatar, E. On the construction of polar codes. In Proceedings of the 2011 IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 31 July–5 August 2011; pp. 11–15. [Google Scholar]
Tal, I.; Sharov, A.; Vardy, A. Constructing polar codes for non-binary alphabets and MACs. In Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings, Cambridge, MA, USA, 1–6 July 2012; pp. 2132–2136. [Google Scholar]
Kartowsky, A.; Tal, I. Greedy-Merge Degrading has Optimal Power-Law. IEEE Trans. Inf. Theory 2019, 65, 917–934. [Google Scholar] [CrossRef]
Viterbi, A.J.; Omura, J.K. Principles of Digital Communication and Coding, 1st ed.; McGraw-Hill, Inc.: New York, NY, USA, 1979. [Google Scholar]
Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop (ITW), Jeju Island, Korea, 11–15 October 2015; pp. 1–5. [Google Scholar]
Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the Information Bottleneck Theory of Deep Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Goldfeld, Z.; Van Den Berg, E.; Greenewald, K.; Melnyk, I.; Nguyen, N.; Kingsbury, B.; Polyanskiy, Y. Estimating Information Flow in Deep Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 2299–2308. [Google Scholar]
Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2225–2239. [Google Scholar] [CrossRef]
Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear Information Bottleneck. arXiv 2017, arXiv:1705.02436. [Google Scholar]
Kolchinsky, A.; Tracey, B.D.; Kuyk, S.V. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Chalk, M.; Marre, O.; Tkacik, G. Relevant Sparse Codes with Variational Information Bottleneck. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 9 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 1965–1973. [Google Scholar]
Wickstrøm, K.; Løkse, S.; Kampffmeyer, M.; Yu, S.; Principe, J.; Jenssen, R. Information Plane Analysis of Deep Neural Networks via Matrix–Based Rényi’s Entropy and Tensor Kernels. arXiv 2019, arXiv:1909.11396. [Google Scholar]
Matias, V.; Piantanida, P.; Rey Vega, L. The Role of the Information Bottleneck in Representation Learning. In Proceedings of the IEEE International Symposium on Information Theory (ISIT 2018), Vail, CO, USA, 17–22 June 2018. [Google Scholar] [CrossRef]
Alemi, A.; Fischer, I.; Dillon, J. Uncertainty in the Variational Information Bottleneck. arXiv 2018, arXiv:1807.00906. [Google Scholar]
Yu, S.; Jenssen, R.; Príncipe, J. Understanding Convolutional Neural Network Training with Information Theory. arXiv 2018, arXiv:1804.06537. [Google Scholar]
Cheng, H.; Lian, D.; Gao, S.; Geng, Y. Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
Issa, I.; Wagner, A.B.; Kamath, S. An Operational Approach to Information Leakage. IEEE Trans. Inf. Theory 2020, 66, 1625–1657. [Google Scholar] [CrossRef]
Cvitkovic, M.; Koliander, G. Minimal Achievable Sufficient Statistic Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 1465–1474. [Google Scholar]
Asoodeh, S.; Alajaji, F.; Linder, T. On maximal correlation, mutual information and data privacy. In Proceedings of the IEEE 14th Canadian Workshop on Inf. Theory (CWIT), St. John’s, NL, Canada, 6–9 July 2015; pp. 27–31. [Google Scholar]
Makhdoumi, A.; Fawaz, N. Privacy-utility tradeoff under statistical uncertainty. In Proceedings of the 51st Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 2–4 October 2013; pp. 1627–1634. [Google Scholar] [CrossRef]
Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Estimation Efficiency Under Privacy Constraints. IEEE Trans. Inf. Theory 2019, 65, 1512–1534. [Google Scholar] [CrossRef]
Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Privacy-aware guessing efficiency. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017. [Google Scholar]
Asoodeh, S.; Alajaji, F.; Linder, T. Privacy-aware MMSE estimation. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1989–1993. [Google Scholar]
Calmon, F.P.; Makhdoumi, A.; Médard, M.; Varia, M.; Christiansen, M.; Duffy, K.R. Principal Inertia Components and Applications. IEEE Trans. Inf. Theory 2017, 63, 5011–5038. [Google Scholar] [CrossRef]
Wang, H.; Vo, L.; Calmon, F.P.; Médard, M.; Duffy, K.R.; Varia, M. Privacy With Estimation Guarantees. IEEE Trans. Inf. Theory 2019, 65, 8025–8042. [Google Scholar] [CrossRef]
Asoodeh, S. Information and Estimation Theoretic Approaches to Data Privacy. Ph.D. Thesis, Queen’s University, Kingston, ON, Canada, 2017. [Google Scholar]
Liao, J.; Kosut, O.; Sankar, L.; du Pin Calmon, F. Tunable Measures for Information Leakage and Applications to Privacy-Utility Tradeoffs. IEEE Trans. Inf. Theory 2019, 65, 8043–8066. [Google Scholar] [CrossRef]
Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Privacy aware learning. J. Assoc. Comput. Mach. (ACM) 2014, 61, 38. [Google Scholar] [CrossRef]
Poole, B.; Ozair, S.; Van Den Oord, A.; Alemi, A.; Tucker, G. On Variational Bounds of Mutual Information. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Volume 97, pp. 5171–5180. [Google Scholar]
Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 531–540. [Google Scholar]
Van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Song, J.; Ermon, S. Understanding the Limitations of Variational Mutual Information Estimators. In Proceedings of the International Conference on Learning Representations, online, 26 April–1 May 2020. [Google Scholar]
McAllester, D.; Stratos, K. Formal Limitations on the Measurement of Mutual Information. In Proceedings of the International Conference on Learning Representations, online, 26 April–1 May 2020; Volume 108, pp. 875–884. [Google Scholar]
Rassouli, B.; Gunduz, D. On Perfect Privacy. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 2551–2555. [Google Scholar]
Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Kim, H.; Gao, W.; Kannan, S.; Oh, S.; Viswanath, P. Discovering Potential Correlations via Hypercontractivity. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4577–4587. [Google Scholar]
Ahlswede, R.; Gács, P. Spreading of sets in product spaces and hypercontraction of the Markov operator. Ann. Probab. 1976, 4, 925–939. [Google Scholar] [CrossRef]
Anantharam, V.; Gohari, A.; Kamath, S.; Nair, C. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv 2014, arXiv:1304.6133v1. [Google Scholar]
Polyanskiy, Y.; Wu, Y. Dissipation of Information in Channels With Input Constraints. IEEE Trans. Inf. Theory 2016, 62, 35–55. [Google Scholar] [CrossRef]
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
Zaidi, A. Hypothesis Testing Against Independence Under Gaussian Noise. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 1289–1294. [Google Scholar] [CrossRef]
Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the Information Bottleneck. Entropy 2019, 21, 924. [Google Scholar] [CrossRef]
Contento, L.; Ern, A.; Vermiglio, R. A linear-time approximate convex envelope algorithm using the double Legendre-Fenchel transform with application to phase separation. Comput. Optim. Appl. 2015, 60, 231–261. [Google Scholar] [CrossRef][Green Version]
Lucet, Y. Faster than the Fast Legendre Transform, the Linear-time Legendre Transform. Numer. Algorithms 1997, 16, 171–185. [Google Scholar] [CrossRef]
Witsenhausen, H. Indirect rate distortion problems. IEEE Trans. Inf. Theory 1980, 26, 518–521. [Google Scholar] [CrossRef]
Wyner, A. On source coding with side information at the decoder. IEEE Trans. Inf. Theory 1975, 21, 294–300. [Google Scholar] [CrossRef]
Courtade, T.A.; Weissman, T. Multiterminal Source Coding Under Logarithmic Loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef]
Li, C.T.; El Gamal, A. Extended Gray-Wyner System With Complementary Causal Side Information. IEEE Trans. Inf. Theory 2018, 64, 5862–5878. [Google Scholar] [CrossRef]
Vera, M.; Rey Vega, L.; Piantanida, P. Collaborative Information Bottleneck. IEEE Trans. Inf. Theory 2019, 65, 787–815. [Google Scholar] [CrossRef]
Gilad-Bachrach, R.; Navot, A.; Tishby, N. An Information Theoretic Tradeoff between Complexity and Accuracy. In Learning Theory and Kernel Machines; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
Pichler, G.; Koliander, G. Information Bottleneck on General Alphabets. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 526–530. [Google Scholar] [CrossRef]
Kim, Y.H.; Sutivong, A.; Cover, T. State mplification. IEEE Trans. Inf. Theory 2008, 54, 1850–1859. [Google Scholar] [CrossRef]
Merhav, N.; Shamai, S. Information rates subject to state masking. IEEE Trans. Inf. Theory 2007, 53, 2254–2261. [Google Scholar] [CrossRef]
Witsenhausen, H. Some aspects of convexity useful in information theory. IEEE Trans. Inf. Theory 1980, 26, 265–271. [Google Scholar] [CrossRef]
Harremoës, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
Hirche, C.; Winter, A. An alphabet size bound for the information bottleneck function. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
Liese, F.; Vajda, I. On Divergences and Informations in Statistics and Information Theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
Verdú, S. α-mutual information. In Proceedings of the Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
Fehr, S.; Berens, S. On the Conditional Rényi Entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
Sason, I.; Verdú, S. f-Divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Guntuboyina, A.; Saha, S.; Schiebinger, G. Sharp Inequalities for f-Divergences. IEEE Trans. Inf. Theory 2014, 60, 104–121. [Google Scholar] [CrossRef]
Guo, D.; Shamai, S.; Verdú, S. Mutual information and minimum mean-square error in Gaussian channels. IEEE Trans. Inf. Theory 2005, 51, 1261–1282. [Google Scholar] [CrossRef]
Rockafellar, R.T. Convex Analysis; Princeton Univerity Press: Princeton, NJ, USA, 1997. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
Linder, T.; Zamir, R. On the asymptotic tightness of the Shannon lower bound. IEEE Trans. Inf. Theory 2008, 40, 2026–2031. [Google Scholar] [CrossRef]
Guo, D.; Wu, Y.; Shitz, S.S.; Verdú, S. Estimation in Gaussian Noise: Properties of the Minimum Mean-Square Error. IEEE Trans. Inf. Theory 2011, 57, 2371–2385. [Google Scholar]
Jana, S. Alphabet sizes of auxiliary random variables in canonical inner bounds. In Proceedings of the 43rd Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 18–20 March 2009; pp. 67–71. [Google Scholar]

Figure 1. Examples of the set

M

, defined in (6). The upper and lower boundaries of this set correspond to information bottleneck (

IB

) and privacy funnel (

PF

), respectively. It is worth noting that, while

IB

(R) = 0 only at R = 0,

PF

(r) = 0 holds in general for r belonging to a non-trivial interval (only for

| X |

> 2). Moreover, note that in general neither upper nor lower boundaries are smooth. A sufficient condition for smoothness is

P_{X | Y} (y | x)

> 0 (see Theorem 1), thus both

IB

and

PF

are smooth in the binary case.

Figure 1. Examples of the set

M

, defined in (6). The upper and lower boundaries of this set correspond to information bottleneck (

IB

) and privacy funnel (

PF

), respectively. It is worth noting that, while

IB

(R) = 0 only at R = 0,

PF

(r) = 0 holds in general for r belonging to a non-trivial interval (only for

| X |

> 2). Moreover, note that in general neither upper nor lower boundaries are smooth. A sufficient condition for smoothness is

P_{X | Y} (y | x)

> 0 (see Theorem 1), thus both

IB

and

PF

are smooth in the binary case.

Figure 2. Comparison of (8), the exact value of

IB

for jointly Gaussian X and Y (i.e.,

Y = X + σ N^{G}

with X and

N^{G}

being both standard Gaussian

N (0, 1)

), with the general upper bound (9) for

σ^{2} = 0.5

. It is worth noting that while the Gaussian

IB

converges to

I (X; Y) \approx 0.8

, the upper bound diverges.

Figure 2. Comparison of (8), the exact value of

IB

for jointly Gaussian X and Y (i.e.,

Y = X + σ N^{G}

with X and

N^{G}

being both standard Gaussian

N (0, 1)

), with the general upper bound (9) for

σ^{2} = 0.5

. It is worth noting that while the Gaussian

IB

converges to

I (X; Y) \approx 0.8

, the upper bound diverges.

Figure 3. Second-order approximation of

{PF}^{G}

according to Theorem 6 for jointly Gaussian X and Y with correlation coefficient

ρ = 0.8

. For this particular case, the exact expression of

{PF}^{G}

is computed in (14).

Figure 3. Second-order approximation of

{PF}^{G}

according to Theorem 6 for jointly Gaussian X and Y with correlation coefficient

ρ = 0.8

. For this particular case, the exact expression of

{PF}^{G}

is computed in (14).

Figure 4. The mapping

q \mapsto F_{β} (q) = H (Y^{'}) - β H (X^{'})

where

X^{'} \sim Bernoulli (q)

and

Y^{'}

is the result of passing

X^{'}

through BSC(0.1), see (26).

Figure 4. The mapping

q \mapsto F_{β} (q) = H (Y^{'}) - β H (X^{'})

where

X^{'} \sim Bernoulli (q)

and

Y^{'}

is the result of passing

X^{'}

through BSC(0.1), see (26).

Figure 5. The set

{(I (X; T), I (Y; T))}

with

P_{X} = Bernoulli (0.9)

,

P_{Y | X = 0} = [0.9, 0.1]

,

P_{Y | X = 1} = [0.85, 0.15]

, and T restricted to be binary. While the upper boundary of this set is concave, the lower boundary is not convex. This implies that, unlike

IB

,

PF (r)

cannot be attained by binary variables T.

Figure 5. The set

{(I (X; T), I (Y; T))}

with

P_{X} = Bernoulli (0.9)

,

P_{Y | X = 0} = [0.9, 0.1]

,

P_{Y | X = 1} = [0.85, 0.15]

, and T restricted to be binary. While the upper boundary of this set is concave, the lower boundary is not convex. This implies that, unlike

IB

,

PF (r)

cannot be attained by binary variables T.

Figure 6. The mapping

q \mapsto F_{β}^{(\infty, 1)} (q) = P_{c} (Y^{'}) + β H (X^{'})

where

X^{'} \sim Bernoulli (q)

and

Y^{'} \sim Bernoulli (q) BSC (0.1)

.

Figure 6. The mapping

q \mapsto F_{β}^{(\infty, 1)} (q) = P_{c} (Y^{'}) + β H (X^{'})

where

X^{'} \sim Bernoulli (q)

and

Y^{'} \sim Bernoulli (q) BSC (0.1)

.

Figure 7. The structure of the optimal

P_{T | X}

for

{PF}^{(\infty, \infty)}

when

P_{Y | X} = BSC (δ)

and

X \sim Bernoulli (p)

with

δ, p \in [0, \frac{1}{2}]

. If the accuracy constraint is

P_{c} (X | T) \geq λ

(or equivalently

I_{\infty} (X | T) \geq \log \frac{λ}{\bar{p}}

), then the parameter of optimal

P_{T | X}

is given by

η = \frac{\bar{λ}}{\bar{p}}

, leading to

P_{c} (Y | T) = δ * λ

.

Figure 7. The structure of the optimal

P_{T | X}

for

{PF}^{(\infty, \infty)}

when

P_{Y | X} = BSC (δ)

and

X \sim Bernoulli (p)

with

δ, p \in [0, \frac{1}{2}]

. If the accuracy constraint is

P_{c} (X | T) \geq λ

(or equivalently

I_{\infty} (X | T) \geq \log \frac{λ}{\bar{p}}

), then the parameter of optimal

P_{T | X}

is given by

η = \frac{\bar{λ}}{\bar{p}}

, leading to

P_{c} (Y | T) = δ * λ

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Asoodeh, S.; Calmon, F.P. Bottleneck Problems: An Information and Estimation-Theoretic View. Entropy 2020, 22, 1325. https://doi.org/10.3390/e22111325

AMA Style

Asoodeh S, Calmon FP. Bottleneck Problems: An Information and Estimation-Theoretic View. Entropy. 2020; 22(11):1325. https://doi.org/10.3390/e22111325

Chicago/Turabian Style

Asoodeh, Shahab, and Flavio P. Calmon. 2020. "Bottleneck Problems: An Information and Estimation-Theoretic View" Entropy 22, no. 11: 1325. https://doi.org/10.3390/e22111325

APA Style

Asoodeh, S., & Calmon, F. P. (2020). Bottleneck Problems: An Information and Estimation-Theoretic View. Entropy, 22(11), 1325. https://doi.org/10.3390/e22111325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bottleneck Problems: An Information and Estimation-Theoretic View^†

Abstract

1. Introduction

1.1. Related Work

1.2. Notation

2. Information Bottleneck and Privacy Funnel: Definitions and Functional Properties

2.1. Gaussian $IB$ and $PF$

2.2. Evaluation of $IB$ and $PF$

2.3. Operational Meaning of $IB$ and $PF$

2.3.1. Noisy Source Coding

2.3.2. Test against Independence with Communication Constraint

2.3.3. Dependence Dilution

2.4. Cardinality Bound

2.5. Deterministic Information Bottleneck

3. Family of Bottleneck Problems

3.1. Evaluation of $U_{Φ, Ψ}$ and $L_{Φ, Ψ}$

3.2. Guessing Bottleneck Problems

3.3. Arimoto Bottleneck Problems

3.4. f-Bottleneck Problems

4. Summary and Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proofs from Section 2

Appendix B. Proofs from Section 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Bottleneck Problems: An Information and Estimation-Theoretic View †

Abstract

1. Introduction

1.1. Related Work

1.2. Notation

2. Information Bottleneck and Privacy Funnel: Definitions and Functional Properties

2.1. Gaussian IB and PF

2.2. Evaluation of IB and PF

2.3. Operational Meaning of IB and PF

2.3.1. Noisy Source Coding

2.3.2. Test against Independence with Communication Constraint

2.3.3. Dependence Dilution

2.4. Cardinality Bound

2.5. Deterministic Information Bottleneck

3. Family of Bottleneck Problems

3.1. Evaluation of U Φ , Ψ and L Φ , Ψ

3.2. Guessing Bottleneck Problems

3.3. Arimoto Bottleneck Problems

3.4. f-Bottleneck Problems

4. Summary and Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proofs from Section 2

Appendix B. Proofs from Section 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Bottleneck Problems: An Information and Estimation-Theoretic View^†

2.1. Gaussian $IB$ and $PF$

2.2. Evaluation of $IB$ and $PF$

2.3. Operational Meaning of $IB$ and $PF$

3.1. Evaluation of $U_{Φ, Ψ}$ and $L_{Φ, Ψ}$