The Double-Sided Information Bottleneck Function

Michael Dikshtein; Or Ordentlich; Shlomo Shamai (Shitz)

doi:10.3390/e24091321

,

and

¹

Department of Electrical and Computer Engineering, Technion, Haifa 3200003, Israel

²

School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in 2021 IEEE International Symposium on Information Theory.

Entropy2022, 24(9), 1321;https://doi.org/10.3390/e24091321

This article belongs to the Special Issue Theory and Application of the Information Bottleneck Method

Version Notes

Order Reprints

Review Reports

Abstract

A double-sided variant of the information bottleneck method is considered. Let

(X, Y)

be a bivariate source characterized by a joint pmf

P_{X Y}

. The problem is to find two independent channels

P_{U | X}

and

P_{V | Y}

(setting the Markovian structure

U \to X \to Y \to V

), that maximize

I (U; V)

subject to constraints on the relevant mutual information expressions:

I (U; X)

and

I (V; Y)

. For jointly Gaussian

X

and

Y

, we show that Gaussian channels are optimal in the low-SNR regime but not for general SNR. Similarly, it is shown that for a doubly symmetric binary source, binary symmetric channels are optimal when the correlation is low and are suboptimal for high correlations. We conjecture that Z and S channels are optimal when the correlation is 1 (i.e.,

X = Y

) and provide supporting numerical evidence. Furthermore, we present a Blahut–Arimoto type alternating maximization algorithm and demonstrate its performance for a representative setting. This problem is closely related to the domain of biclustering.

Keywords:

information bottleneck; lossy compression; remote source coding; biclustering

1. Introduction

The information bottleneck (IB) method [1] plays a central role in advanced lossy source compression. The analysis of classical source coding algorithms is mainly approached via the rate-distortion theory, where a fidelity measure must be defined. However, specifying an appropriate distortion measure in many real-world applications is challenging and sometimes infeasible. The IB framework introduces an essentially different concept, where another variable is provided, which carries the relevant information in the data to be compressed. The quality of the reconstructed sequence is measured via the mutual information metric between the reconstructed data and the relevance variables. Thus, the IB method provides a universal fidelity measure.

In this work, we extend and generalize the IB method by imposing an additional bottleneck constraint on the relevant variable and considering noisy observation of the source. In particular, let

(X, Y)

be a bivariate source characterized by a fixed joint probability law

P_{X Y}

and consider all Markov chains

U \to X \to Y \to V

. The Double-Sided Information Bottleneck (DSIB) function is defined as [2]:

R_{P_{X Y}} (C_{u}, C_{v}) ≜ max I (U; V),

(1)

where the maximization is over all

P_{U | X}

and

P_{V | Y}

satisfying

I (U; X) \leq C_{u}

and

I (V; Y) \leq C_{v}

. This problem is illustrated in Figure 1. In our study, we aim to determine the maximum value and the achieving conditional distributions

(P_{U | X}, P_{V | Y})

(test channels) of (1) for various fixed sources

P_{X Y}

and constraints

C_{u}

and

C_{v}

.

Figure 1. Block diagram of the Double-Sided Information Bottleneck function.

The problem we consider originates from the domain of clustering. Clustering is applied to organize similar entities in unsupervised learning [3]. It has numerous practical applications in data science, such as: joint word-document clustering, gene expression [4], and pattern recognition. The data in those applications are arranged as a contingency table. Usually, clustering is performed on one dimension of the table, but sometimes it is helpful to apply clustering on both dimensions of the contingency table [5], for example, when there is a strong correlation between the rows and the columns of the table or when high-dimensional sparse structures are handled. The input and output of a typical biclustering algorithm are illustrated in Figure 2. Consider an

S \times T

data matrix

(a_{s t})

. Find partitions

B_{k} \subseteq {1, \dots, S}

and

C_{l} \subseteq {1, \dots, T}

,

k = 1, \dots, K

,

l = 1, \dots, L

such that all elements of the “biclusters” [6]

{(a_{s t})}_{s \in B_{k}, t \in C_{l}}

are homogeneous. The measure of homogeneity depends on the application.

Figure 2. Illustration of a typical biclustering algorithm.

This problem can also be motivated by a remote source coding setting. Consider a latent random variable

W

, which satisfies

U \leftarrow X \leftarrow W \to Y \to V

and represents a source of information. We have two users that observe noisy versions of

W

, i.e.,

X

and

Y

. Those users try to compress the observed noisy data so that their reconstructed versions,

U

and

V

, will be comparable under the maximum mutual information metric. The problem we consider also bears practical applications. Imagine a distributed sensor network where the different edges measure a noisy version of a particular signal but are not allowed to communicate with each other. Each of the nodes performs compression of the received signal. Under the DSIB framework, we can find the optimal compression schemes that preserve the reconstructed symbols’ proximity subject to the mutual information measure.

Dhillon et al. [7] initiated an information-theoretic approach to biclustering. They have regarded the normalized non-negative contingency table as a joint probability distribution matrix of two random variables. Mutual information was proposed as a measure for optimal co-clustering. An optimization algorithm was presented that intertwines both row and column clustering at all stages. Distributed clustering from a proper information-theoretic perspective was first explicitly considered by Pichler et al. [2]. Consider the model illustrated in Figure 3. A bivariate memory-less source with joint law

P_{X Y}

generates n i.i.d. copies

(X^{n}, Y^{n})

of

(X, Y)

. Each sequence is observed at two different encoders, and each encoder generates a description of the observed sequence,

f_{n} (X^{n})

and

g_{n} (Y^{n})

. The objective is to construct the mappings

f_{n}

and

g_{n}

such that the normalized mutual information between the descriptions would be maximal while the description coding has bounded rate constraints. Single-letter inner and outer bounds for a general

P_{X Y}

were derived. An example of a doubly symmetric binary source (DSBS) was given, and several converse results were established. Furthermore, connections were made to the standard IB [1] and the multiple description CEO problems [8]. In addition, the equivalence of information-theoretic biclustering problem to hypothesis testing against independence with multiterminal data compression and a pattern recognition problem was established in [9,10], respectively.

Figure 3. Block diagram of the information-theoretic biclustering problem.

The DSIB problem addressed in our paper is, in fact, a single-letter version of the distributed clustering setup [2]. The inner bound in [2] coincides with our problem definition. Moreover, if the Markov condition

U \to X \to Y \to Z

is imposed on the multi-letter variant, then those problems are equivalent. A similar setting, but with a maximal correlation criterion between the reconstructed random variables, has been considered in [11,12]. Furthermore, it is sometimes the case that the optimal biclustering problem is more straightforward to solve than its standard, single-sided, clustering counterpart. For example, the Courtade–Kumar conjecture [13] for the standard single-sided clustering setting was ultimately proven for the biclustering setting [14]. A particular case, where

(X, Y)

are drawn from DSBS distribution and the mappings

f_{n}

and

g_{n}

are restricted to be Boolean functions, was addressed in [14]. The bound

I (f_{n} (X^{n}); g_{n} (Y^{n})) \leq I (X; Y)

was established, which is tight if and only if

f_{n}

and

g_{n}

are dictator functions.

1.1. Related Work

Our work extends the celebrated standard (single-sided) IB (SSIB) method introduced by Tishby et al. [1]. Indeed, consider the problem illustrated in Figure 4. This single-sided counterpart of our work is essentially a remote source coding problem [15,16,17], choosing the distortion measure as the logarithmic loss. The random variable

U

represents the noisy version (

X

) of the source (

Y

) with a constrained number of bits (

I (U; X) \leq C

), and the goal is to maximize the relevant information in

U

regarding

Y

(measured by the mutual information between

Y

and

U

). In the standard IB setup,

I (U; X)

is referred to as the complexity of

U

, and

I (Y; U)

is referred to as the relevance of

U

.

Figure 4. Block diagram of the Single-Sided Information Bottleneck function.

For the particular case where

(U, X, Y)

are discrete, an optimal

P_{U | X}

can be found by iteratively solving a set of self-consistent equations. A generalized Blahut–Arimoto algorithm [18,19,20,21] was proposed to solve those equations. The optimal test-channel

P_{U | X}

was characterized using a variation principle in [1]. A particular case of deterministic mappings from

X

to

U

was considered in [22], and algorithms that find those mappings were described.

Several representing scenarios have been considered for the SSIB problem. The setting where the pair

(X, Y)

is a doubly symmetric binary source (DSBS) with transition probability p was addressed from various perspectives in [17,23,24]. Utilizing Mrs. Gerber’s Lemma (MGL) [25], one can show that the optimal test-channel for the DSBS setting is a BSC. The case where

(X, Y)

are jointly multivariate Gaussians in the SSIB framework was first considered in [26]. It was shown that the optimal distribution of

(U, X, Y)

is also jointly Gaussian. The optimality of the Gaussian test channel can be proven using EPI [27], or exploiting I-MMSE and Single Crossing Property [28]. Moreover, the proof can be easily extended to jointly Gaussian random vectors

(X, Y)

under the I-MMSE framework [29].

In a more general scenario where

X = Y + Z

and only

Z

is fixed to be Gaussian, it was shown that discrete signaling with deterministic quantizers as test-channel sometimes outperforms Gaussian

P_{X}

[30]. This exciting observation leads to a conjecture that discrete inputs are optimal for this general setting and may have a connection to the input amplitude constrained AWGN channels where it was already established that discrete input distributions are optimal [31,32,33]. One reason for the optimality of discrete distributions stems from the observation that constraining the compression rate limits the usable input amplitude. However, as far as we know, it remains an open problem.

There are various related problems considered in the literature that are equivalent to the SSIB; namely, they share a similar single-letter optimization problem. In the conditional entropy bound (CEB) function, studied in [17], given a fixed bivariate source

(X, Y)

and an equality constraint on the conditional entropy of

X

given

U

, the goal is to minimize the conditional entropy of

Y

given

U

over the set of

U

such that

U \to X \to Y

constitute a Markov chain. One can show that CEB is equivalent to SSIB. The common reconstruction CR setting [34] is a source coding with a side-information problem, also known as Wyner–Ziv coding, as depicted in Figure 5; with an additional constraint, the encoder can reconstruct the same sequence as the decoder. Additional assumption of log-loss fidelity results in a single-letter rate-distortion region equivalent to the SSIB. In the problem of information combining (IC) [23,35], motivated by message combining in LDPC decoders, a source of information,

P_{Y}

, is observed through two test-channels

P_{X | Y}

and

P_{Z | Y}

. The IC framework aims to design those channels in two extreme approaches. For the first, IC asks what those channels should be to make the output pair

(X, Z)

maximally informative regarding

Y

. On the contrary, IC also considers how to design

P_{X | Y}

and

P_{Z | Y}

to minimize the information in

(X, Z)

regarding

Y

. The problem of minimizing IC can be shown to be equivalent to the SSIB. In fact, if

(X, Y)

is a DSBS, then by [23],

P_{Z | Y}

is a binary symmetric channel (BSC), recovering similar results from (Section IV.A of [17]).

Figure 5. Block diagram of Source Coding with Side Information.

The IB method has been extended to various network topologies. A multilayer extension of the IB method is depicted in Figure 6. This model was first considered in [36]. A multivariate source

(X, Y_{1}, \dots, Y_{L})

generates a sequence of n i.i.d. copies

(X^{n}, Y_{1}^{n}, \dots, Y_{L}^{n})

. The receiver has access only to the sequence

X^{n}

while

(Y_{1}^{n}, \dots, Y_{L}^{n})

are hidden. The decoder performs a consecutive L-stage compression of the observed sequence. The representation at step k must be maximally informative about the respective hidden sequence

Y_{k}

,

k \in {1, 2, \dots, L}

. This setup is highly motivated by the structure of deep neural networks. Specific results were established for the binary and Gaussian sources.

Figure 6. Block diagram of the Multi-Layer IB.

The model depicted in Figure 7 represents a multiterminal extension of the standard IB. A set of receivers observe noisy versions

(X_{1}, X_{2}, \dots, X_{K})

of some source of information

Y

. The channel outputs

(X_{1}, X_{2}, \dots, X_{K})

are conditionally independent given

Y

. The receivers are connected to the central processing unit through noiseless but limited-capacity backhaul links. The central processor aims to attain a good prediction

\hat{Y}

of the source

Y

based on compressed representations of the noisy version of

Y

obtained from the receivers. The quality of prediction is measured via the mutual information merit between

Y

and

\hat{Y}

. The Distributive IB setting is essentially a CEO source coding problem under logarithmic loss (log-loss) distortion measure [37]. The case where

(X, Y_{1}, \dots, Y_{K})

are jointly Gaussian random variables was addressed in [20], and a Blahut–Arimoto-type algorithm was proposed. An optimized algorithm to design quantizers was proposed in [38].

Figure 7. Block diagram of the Distributive IB.

A cooperative multiterminal extension of the IB method was proposed in [39]. Let

(X_{1}^{n}, X_{2}^{n}, Y^{n})

be n i.i.d. copies of the multivariate source

(X_{1}, X_{2}, Y)

. The sequences

X_{1}^{n}

and

X_{2}^{n}

are observed at encoders 1 and 2, respectively. Each encoder sends a representation of the observed sequence through a noiseless yet rate-limited link to the other encoder and the mutual decoder. The decoder attempts to reconstruct the latent representation sequence

Y^{n}

based on the received descriptions. As shown in Figure 8, this setup differs from the CEO setup [40] since the encoders can cooperate during the transmission. The set of all feasible rates of complexity and relevance were characterized, and specific regions for the binary and Gaussian sources were established. There are many additional variations of multi-user IB in the literature [20,26,35,36,37,39,40,41,42,43,44].

Figure 8. Block diagram of the Collaborative IB.

The IB problem connects to many timely aspects, such as capital investment [43], distributed learning [45], deep learning [46,47,48,49,50,51,52], and convolutional neural networks [53,54]. Moreover, it has been recently shown that the IB method can be used to reduce the data transfer rate and computational complexity in 5G LDPC decoders [55,56]. The IB method has also been connected with constructing good polar codes [57]. Due to the exponential output-alphabet growth of polarized channels, it becomes demanding to compute their capacities to identify the location of “frozen bits". Quantization is employed in order to reduce the computation complexity. The quality of the quantization scheme is assessed via mutual information preservation. It can be shown that the corresponding IB problem upper bounds the quantization technique. Quantization algorithms based upon the IB method were considered in [58,59,60]. Furthermore, a relationship between the KL means algorithm and the IB method has been discovered in [61].

A recent comprehensive tutorial on the IB method and related problems is given in [24]. Applications of IB problem in machine learning are detailed in [26,45,46,47,51,52,62].

1.2. Notations

Throughout the paper, random variables are denoted using a sans-serif font, e.g.,

X

, their realizations are denoted by the respective lower-case letters, e.g., x, and their alphabets are denoted by the respective calligraphic letters, e.g.,

X

. Let

X^{n}

stand for the set of all n-tuples of elements from

X

. An element from

X^{n}

is denoted by

x^{n} = (x_{1}, x_{2}, \dots, x_{n})

and substrings are denoted by

x_{i}^{j} = (x_{i}, x_{i + 1}, \dots, x_{j})

. The cardinality of a finite set, say

X

, is denoted by

| X |

. The probability mass function (pmf) of

X

, the joint pmf of

X

and

Y

, and the conditional pmf of

X

given

Y

are denoted by

P_{X}

,

P_{X Y}

, and

P_{X | Y}

, respectively. The expectation of

X

is denoted by

E_{} [X]

. The probability of an event

E

is denoted as

P (E)

.

Let

X

and

Y

be an n-ary and m-ary random variables, respectively. The marginal probability vector is denoted by a lowercase boldface letter, i.e.,

q ≜ {(P (X = 1), P (X = 2), \dots, P (X = n))}^{T} .

(2)

The probability vector of an n-ary uniform random variable is denoted by

u_{n}

. We denote by T the transition matrix from

X

to

Y

, i.e.,

T_{i j} ≜ P (Y = i | X = j), 1 \leq i \leq m, 1 \leq j \leq n .

(3)

The entropy of n-ary probability vector

q

is given by

h (q)

, where

h (q) ≜ - \sum_{i = 1}^{n} q_{i} log q_{i} .

(4)

Throughout this paper all logarithms are taken to base 2 unless stated otherwise. We denote the ones complemented with a bar, i.e.,

\bar{x} = 1 - x

. The binary convolution of

x, y \in [0, 1]

is defined as

x * y ≜ x \bar{y} + \bar{x} y

. The binary entropy function is defined by

h_{b} (p) : [0, 1] \to [0, 1]

, i.e.,

h_{b} (p) ≜ - p log p - \bar{p} log \bar{p}

, and

h_{b}^{- 1} (\cdot)

its inverse, restricted to

[0, 1 / 2]

.

Let

X

and

Y

be a pair of random variables with joint pmf

P_{X Y}

and marginal pmfs

P_{X} = q_{x}

and

P_{Y} = q_{y}

. Furthermore, let T (

\bar{T}

) be the transition matrix from

X

(

Y

) to

Y

(

X

). The mutual information between

X

and

Y

is defined as:

I (X; Y) = I (q_{x}, T) = I (q_{y}, \bar{T}) = \sum_{x \in X} \sum_{y \in Y} P_{X Y} (x, y) log \frac{P_{X Y} (x, y)}{P_{X} (x) P_{Y} (y)} .

(5)

1.3. Paper Outline

Section 2 gives a proper definition of the DSIB optimization problem, mentions various results directly related to this work, and provides some general preliminary results. The spotlight of Section 3 is on the binary

(X, Y)

, where we derive bounds on the respective DSIB function and show a complete characterization for extreme scenarios. The jointly Gaussian

(X, Y)

is considered in Section 4, where an elegant representation of an objective function is presented, and complete characterization in the low-SNR regime is established. A Blahut–Arimoto-type alternating maximization algorithm will be presented in Section 5. Representative numerical evaluation of the bounds and the proposed algorithm will be provided in Section 6. Finally, a summary and possible future directions will be described in Section 7. The prolonged proofs are postponed to the Appendix A.

2. Problem Formulation and Basic Properties

The DSIB function is a multi-terminal extension of the standard IB [1]. First, we briefly remind the latter’s definition and give related results that will be utilized for its double-sided counterpart. Then, we provide a proper definition of the DSIB optimization problem and present some general preliminaries.

2.1. The Single-Sided Information Bottleneck (SSIB) Function

Definition 1

(SSIB). Let

(X, V)

be a pair of random variables with

| X | = n

,

| V | = m

, and fixed

P_{X V}

. Denote by

q

the marginal probability vector of

X

, and let T be the transition matrix from

X

to

V

, i.e.,

T_{i j} ≜ P (V = i | X = j), 1 \leq i \leq m, 1 \leq j \leq n .

Consider all random variables

U

satisfying the Markov chain

U \to X \to V

. The SSIB function is defined as:

\begin{matrix} {\hat{R}}_{T} (q, C) ≜ & \underset{P_{U | X}}{maximize} & I (U; V) \\ subject to & I (X; U) \leq C . \end{matrix}

(6)

Remark 1.

The SSIB problem defined in (6) is equivalent (has similar solution) to the CEB problem considered in [17].

Although the optimization problem in (6) is well defined, the auxiliary random variable

U

may have an unbounded alphabet. The following lemma provides an upper bound on the cardinality of

U

, thus relaxing the optimization domain.

Lemma 1

(Lemma 2.2 of [17]). The optimization over

U

in (6) can be restricted to

| U | \leq n + 1

.

Remark 2.

A tighter bound, namely

| U | \leq n

, was previously proved in [63] for the corresponding dual problem, namely, the IB Lagrangian. However, since

{\hat{R}}_{T} (q, C)

is generally not a strictly convex function of C, it cannot be directly applied for the primal problem (6).

Note that the SSIB optimization problem (6) is basically a convex function maximization over a convex set; thus, the maximum is attained on the boundary of the set.

Lemma 2

(Theorem 2.5 of [17]). The inequality constraint in (6) can be replaced by equality constraint, i.e.,

I (X; U) = C

.

2.2. The Double-Sided Information Bottleneck (DSIB) Function

Definition 2

(DSIB). Let

(X, Y)

be a pair of random variables with

| X | = n

,

| Y | = m

and fixed

P_{X Y}

. Consider all the random variables

U

and

V

satisfying the Markov chain

U \to X \to Y \to V

. The DSIB function

R : [0, H (X)] \times [0, H (Y)] \to R_{+}

is defined as:

\begin{matrix} R_{P_{X Y} 4 p t} (C_{u}, C_{v}) ≜ & \underset{P_{U | X}, P_{V | Y}}{maximize} & I (U; V) \\ subject to & I (X; U) \leq C_{u} and I (Y; V) \leq C_{v} . \end{matrix}

(7)

The achieving conditional distributions

P_{U | X}

and

P_{V | Y}

will be termed as the optimal test-channels. Occasionally, we will drop the subscript denoting the particular choice of the bivariate source

P_{X Y}

.

Note that (7) can be expressed in the following equivalent form:

\begin{matrix} R (C_{u}, C_{v}) ≜ & \underset{P_{V | Y}}{maximize} & \underset{P_{U | X}}{maximize} & I (U; V) . \\ subject to & subject to \\ I (Y; V) \leq C_{v} & I (X; U) \leq C_{u} \end{matrix}

(8)

Evidently, we can define (8) using (6). Indeed, fix

P_{V | Y}

so that it satisfies

I (Y; V) \leq C_{v}

. Denote by

T_{V | Y}

the transition matrix from

Y

to

V

and by

T_{Y | X}

the transition matrix from

X

to

Y

, respectively, i.e.,

\begin{matrix} {(T_{V | Y})}_{i k} & ≜ P (V = i | Y = k), 1 \leq i \leq | V |, 1 \leq k \leq m, \\ {(T_{Y | X})}_{k j} & ≜ P (Y = k | X = j), 1 \leq k \leq m, 1 \leq j \leq n . \end{matrix}

Denote by

q_{x}

and

q_{y}

the marginal probability vectors of

X

and

Y

, respectively, and consider the inner maximization term in (8). Since

P_{V | Y}

and

P_{X Y}

are fixed, then

P_{X V} = \sum_{y} P_{V | Y} (\cdot | y) P_{X Y} (\cdot, y)

is also fixed. Denote by

T_{V | X} ≜ T_{V | Y} T_{Y | X}

the transition matrix from

X

to

V

. Therefore, the inner maximization term in (8) is just the SSIB function with parameters

T_{V | X}

and

C_{u}

, namely,

{\hat{R}}_{T_{V | X}} (q_{x}, C_{u})

. Hence, our problem can also be interpreted in the following two equivalent ways:

\begin{matrix} R (C_{u}, C_{v}) ≜ & \underset{T_{V | Y}}{maximize} & {\hat{R}}_{T_{V | Y} T_{Y | X}} (q_{x}, C_{u}) \\ subject to & I (q_{y}, T_{V | Y}) \leq C_{v}; \end{matrix}

(9)

or, similarly, by interchanging the order of maximization in (8), it can be expressed as follows:

\begin{matrix} R (C_{u}, C_{v}) ≜ & \underset{T_{U | X}}{maximize} & {\hat{R}}_{T_{U | X} T_{X | Y}} (q_{y}, C_{v}) \\ subject to & I (q_{x}, T_{U | X}) \leq C_{u}, \end{matrix}

(10)

where

T_{U | X}

is the transition matrix from

X

to

U

, and

T_{X | Y}

is the transition matrix from

Y

to

X

. This representation gives us a different perspective on our problem as an optimal compressed representation of the relevance random variable for the IB framework.

Remark 3.

Taking

C_{v} = \infty

in (9) results in an deterministic channel from

Y

to

V

, i.e.,

V = Y

. Thus, the DSIB problem defined in (7) reduces to the SSIB problem (6).

The bound from Lemma 1 can be utilized to give cardinality bounding for the double-sided problem.

Proposition 1.

For the DSIB optimization problem defined in (7), it suffices to consider random variables

U

and

V

with cardinalities

| U | \leq n + 1

and

| V | \leq m + 1

.

Proof.

Let

T_{U | X}

and

T_{V | Y}

be two arbitrary transition matrices. By Lemma 1, there exists

T_{\tilde{U} | X}

with

| \tilde{U} | \leq n + 1

such that

I (\tilde{U}; V) \geq I (U; V)

and

I (X; \tilde{U}) \leq C_{u}

. Similarly,

T_{V | Y}

can be replaced with

T_{\tilde{V} | Y}

,

| \tilde{V} | \leq m + 1

such that

I (\tilde{U}; \tilde{V}) \geq I (\tilde{U}, V) \geq I (U; V)

, and

I (Y; \tilde{V}) \leq C_{v}

. Therefore, there exists an optimal solution with

| U | \leq n + 1

and

| V | \leq m + 1

. □

In the following two sections, we will present the primary analytical outcomes of our study. First, we consider the scenario where our bivariate source is binary, specifically DSBS. Then, we handle the case where

X

and

Y

are jointly Gaussian.

3. Binary $(X, Y)$

Let

(X, Y)

be a DSBS with parameter p, i.e.,

P_{X Y} (x, y) = \frac{1}{2} (p \cdot 𝟙 (x \neq y) + (1 - p) 𝟙 (x = y)) .

(11)

We entitle the respective optimization problem (7) as the binary double-sided information bottleneck (BDSIB) and emphasize its dependence on the parameter p as

R (C_{u}, C_{v}, p)

.

The following proposition states that the cardinality bound from Lemma 1 can be tightened in the binary case.

Proposition 2.

Considering the optimization problem in (6) with

X = Ber (q)

and

| Y | = 3

, binary

U

is optimal.

The proof of this proposition is postponed to Appendix A. Using similar justification for Proposition 1 combined with Proposition 2, we have the following strict cardinality formula for the BDSIB setting.

Proposition 3.

For the respective DSBS setting of (7), it suffices to consider random variables

U

and

V

with cardinalities

| U | = | V | = 2

.

Note that the above statement is not required for the results in the rest of this section to hold and will be mainly applied to justify our conjectures via numerical simulations.

We next show that the specific objective function for the binary setting of (7), i.e, the mutual information between

U

and

V

, has an elegant representation which will be useful in deriving lower and upper bounds.

Lemma 3.

The mutual information between

U

and

V

can be expressed as follows:

I (U; V) = E_{P_{U} \times P_{V}} [K (U, V, p) log K (U, V, p)],

(12)

where the expectation is taken over the product measure

P_{U} \times P_{V}

,

U

and

V

are binary random variables satisfying:

P (U = 0) = \frac{α_{1} - \frac{1}{2}}{α_{1} - α_{0}}, P (V = 0) = \frac{β_{1} - \frac{1}{2}}{β_{1} - β_{0}},

(13)

the kernel

K (u, v, p)

is given by:

K (u, v, p) = 2 α_{u} * β_{v} * p = 1 - (1 - 2 p) (1 - 2 α_{u}) (1 - 2 β_{v}),

(14)

and the reverse test-channels are defined by:

α_{u} ≜ P (X = 1 | U = u)

,

β_{v} ≜ P (Y = 0 | V = v)

. Furthermore, since

| (1 - 2 p) (1 - 2 α_{u}) (1 - 2 β_{v}) | < 1

, utilizing Taylor’s expansion of

log (1 - x)

, we obtain:

I (U; V) = \sum_{n = 2}^{\infty} \frac{{(1 - 2 p)}^{n} E_{} [{(1 - 2 α_{U})}^{n}] E_{} [{(1 - 2 β_{V})}^{n}]}{n (n - 1)} .

(15)

The general cascade of test-channels and the DSBS, defined by

{α_{u}}_{u = 0}^{1}

,

{β_{v}}_{v = 0}^{1}

and p, is illustrated in Figure 9. The proof of Lemma 3 is postponed to Appendix B.

Figure 9. General test-channel construction of the BDSIB function.

We next examine some corner cases for which

R (C_{u}, C_{v}, p)

is fully characterized.

3.1. Special Cases

A particular case where we have a complete analytical solution is when p tends to

1 / 2

.

Theorem 1.

Suppose

p = \frac{1}{2} - ϵ

, and consider

ϵ \to 0

. Then

R (C_{u}, C_{v}, ϵ) = 2 ϵ^{2} log e \cdot {(1 - 2 h_{b}^{- 1} (1 - C_{u}))}^{2} {(1 - 2 h_{b}^{- 1} (1 - C_{v}))}^{2} + o (ϵ^{2}),

(16)

and it is achieved by taking

P_{U | X}

and

P_{V | Y}

as BSC test-channels satisfying the constraints with equality.

This theorem follows by considering the low SNR regime in Lemma 3 and is proved in Appendix D. For the lower bound we take

P_{U | X}

and

P_{V | Y}

to be BSCs.

In Section 6 we will give a numerical evidence that BSC test-channels are in fact optimal provided that p is sufficiently large. However, for small p this is no longer the case and we believe the following holds.

Conjecture 1.

Let

X = Y

, i.e.,

p = 0

. The optimal test-channels

P_{U | X}

and

P_{V | X}

that achieve

R (C_{u}, C_{v}, 0)

are Z-channel and S-channel respectively.

Remark 4.

Our results in the numerical section strongly support this conjecture. In fact they prove it within the resolution of the experiments, i.e., for optimizing over a dense set of test-channels rather then all test-channels. Nevertheless, we were not able to find an analytical proof for this result.

Remark 5.

Suppose

X = Y

,

I (X; U) = C_{u}

, and

I (X; V) = C_{v}

. Since

I (U; V) = I (U; X) + I (V; X) - I (X; U, V)

(as

U \to X \to Y \to V

form a Markov chain in this order) then maximizing

I (U; V)

is equivalent to minimizing

I (X; U, V)

, namely, minimizing information combining as in [23,35]. Therefore, Conjecture 1 is equivalent to the conjecture that among all channels with

I (X; U) \geq C_{u}

and

I (Y; V) \geq C_{v}

, Z and S are the worst channels for information combining.

This observation leads us the following additional conjecture.

Conjecture 2.

The test-channels

P_{U | X}

and

P_{V | X}

that maximize

I (X; U, V)

are both Z channels.

Remark 6.

Suppose now that p is arbitrary and assume that one of the channels

P_{U | X}

or

P_{V | Y}

is restricted to be a binary memoryless symmetric (BMS) channel (Chapter 4 of [64]), then the maximal

I (U; V)

is attained by BSC channels, as those are the symmetric channels minimizing

I (X; U, V)

[23]. It is not surprising that once the BMS constraint is removed, symmetric channels are no longer optimal (see the discussion in (Section VI.C of [23])).

Consider now the case

X = Y

(

p = 0

) with an additional symmetry assumption

C_{u} = C_{v}

. The most reasonable apriori guess is that the optimal test-channels

P_{U | X}

and

P_{V | X}

are the same up to some permutation of inputs and outputs. Surprisingly, this is not the case, unless they are BSC or Z channels, as the following negative result states.

Proposition 4.

Suppose

C_{u} = C_{v}

and the transition matrix from

X

to

V

, given by

T_{V | X} = (\begin{matrix} a & b \\ 1 - a & 1 - b \end{matrix}),

(17)

satisfies

I (u_{2}, T_{V | X}) = C_{v}

. Consider the respective SSIB optimization problem

{\hat{R}}_{T_{V | X}} (u_{2}, C_{u}) = max_{P_{U | X} : I (U; X) \leq C_{u}} I (U; V) .

(18)

The optimal

P_{U | X}

that attains (6) with

q_{X} = u_{2}

and

C = C_{u}

does not equal to

P_{V | X}

or any permutation of

P_{V | X}

, unless

P_{V | X}

is a BSC or a Z channel.

The proof is based on [17] and is postponed to Appendix E.

As for the case of

X \neq Y

, i.e.,

p \neq 0

, we have the following conjecture.

Conjecture 3.

For every

(C_{u}, C_{v}) \in [0, 1] \times [0, 1]

, there exists

θ (C_{u}, C_{v})

, such that for every

p > θ (C_{u}, C_{v})

the achieving test-channels

P_{U | X}

and

P_{V | Y}

are BSC with parameters

α = h_{b}^{- 1} (1 - C_{u})

and

β = h_{b}^{- 1} (1 - C_{v})

respectively.

We will provide supporting arguments for this conjecture via numerical simulations in Section 6.

3.2. Bounds

In this section we present our lower and upper bounds on the BDSIB function, then we compare them for various channel parameters. The proofs are postponed to Appendix F. For the simplicity of the following presentation we define

g_{b} (x) ≜ \frac{1}{2 (1 - x)} h_{b} (x), x \in [0, 1 / 2],

(19)

denote

g_{b}^{- 1} (\cdot)

as its inverse restricted to

[0, 1]

, and

ℏ (x) ≜ - x log x

.

Proposition 5.

The BDSIB function is bounded from below by

\begin{matrix} R (C_{u}, C_{v}, p) \geq \\ max \{\begin{matrix} 1 - h_{b} (α * β * p), \\ 1 - \frac{1}{2 \bar{δ} \bar{ζ}} [ℏ (δ * ζ * p) + (1 - 2 ζ) \cdot ℏ (\bar{δ} * p) + (1 - 2 δ) \cdot ℏ (\bar{ζ} * p) + (1 - 2 δ) (1 - 2 ζ) \cdot ℏ (p)], \end{matrix} \end{matrix}

(20)

where

α = h_{b}^{- 1} (1 - C_{u})

,

β = h_{b}^{- 1} (1 - C_{v})

,

δ = g_{b}^{- 1} (1 - C_{u})

, and

ζ = g_{b}^{- 1} (1 - C_{v})

.

All terms in the RSH of (20) are attained by taking test-channels that match the constraints with equality and plugging them in Lemma 3. In particular: the first term is achieved by BSC test channels with transition probabilities

α

and

β

; the second term is achieved by taking

P_{U | X}

be a

Z (δ)

channel and

P_{V | Y}

be an

S (ζ)

channel. The aforementioned test-channel configurations are illustrated in Figure 10.

Figure 10. Test-channel that achieve the lower bound of Proposition 5.

We compare the different lower bounds derived in Proposition 5 for various values of constraints. The achievable rate vs channel transition probability p is shown in Figure 11. Our first observation is that BSC test-channels outperform all other choices for almost all values of p. However, Figure 12 gives a closer look on small values of p. It is evident that the combination of Z and S test-channels outperforms any other schemes for small values of p. We have used this observation as one supporting evidence to Conjecture 1.

Figure 11. Comparison of the lower bounds.

Figure 12. Comparison of the lower bounds in high SNR regime.

We proceed to give an upper bound.

Proposition 6.

A general upper bound on BDSIB is given by

R (C_{u}, C_{v}, p) \leq min \{\begin{matrix} {(1 - 2 p)}^{2} (1 - 2 h_{b}^{- 1} {(1 - C_{u})}^{2} (1 - 2 h_{b}^{- 1} {(1 - C_{v})}^{2}, \\ min {1 - h_{b} (h_{b}^{- 1} (1 - C_{u}) * p), 1 - h_{b} (h_{b}^{- 1} (1 - C_{v}) * p)} . \end{matrix}

(21)

Note that the first term can be derived by applying Jensen’s inequality on (12), and the second term is a combination of the standard IB and the cut-set bound. We postpone the proof of Proposition 6 to Appendix F.

Remark 7.

Since

p = \frac{1}{2} - ϵ

, we have a factor 2 loss in the first term compared to the precise behavior we have found for

p \approx \frac{1}{2}

in Theorem 1. This loss comes from the fact that the bound in (21) actually upper bounds the χ-squared mutual information between

U

and

V

. It is well-known that for very small

I (X; Y)

we have that

I (X; Y) \approx 1 / 2 I_{χ^{2}} (X; Y)

, see [65].

We compare the different upper bounds from Proposition 6 in Figure 13 for various bottleneck constraints, and in Figure 14 for various values of channel transition probabilities p. We observe that there are regions of C and p for which Jensen’s based bound outperforms the standard IB bound.

Figure 13. Comparison of the upper bounds for various values of

(C_{u}, C_{v})

.

Figure 14. Comparison of the upper bounds for various values of p.

Finally, we compare the best lower and upper bounds from Propositions 5 and 6 for various values of channel parameters in Figure 15. We observe that the bounds are tighter for asymmetric constraints and high transition probabilities.

Figure 15. Capacity bounds for various values of p and

C = C_{u} = C_{v}

.

4. Gaussian $(X, Y)$

In this section we consider a specific setting where

(X, Y)

is the normalized zero mean Gaussian bivariate source, namely,

(\begin{matrix} X \\ Y \end{matrix}) \sim N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} 1 & ρ \\ ρ & 1 \end{matrix})) .

(22)

We establish achievability schemes and show that Gaussian test-channels

P_{U | X}

and

P_{V | Y}

are optimal for vanishing SNR. Furthermore we show an elegant representation of the problem through probabilistic Hermite polynomials which are defined by

H_{n} (x) ≜ {(- 1)}^{n} e^{\frac{x^{2}}{2}} \frac{d^{n}}{d x^{n}} e^{- \frac{x^{2}}{2}}, n \in N_{0} .

(23)

We denote the Gaussian DSIB function with explicit dependency on

ρ

as

R (C_{u}, C_{v}, ρ)

.

Proposition 7.

Let

H_{n} (\cdot)

be the nth order probabilistic Hermite polynomial, then the objective function of (7) for the Gaussian setting is given by

I (U; V) = E_{U V} [log (\sum_{n = 0}^{\infty} \frac{ρ^{n}}{n!} E_{} [H_{n} (X) | U] E_{} [H_{n} (Y) | V])] .

(24)

This representation follows by considering

I (U; V) = D (P_{U V} | | P_{U} \cdot P_{V})

and expressing

\frac{P_{U V}}{P_{U} \cdot P_{V}}

using Mehler Kernel [66]. Mehler Kernel decomposition is a special case of a much richer family of Lancaster distributions [67]. The proof of Proposition 7 is relegated to Appendix G.

Now we give two lower bounds on

R (C_{u}, C_{v}, ρ)

. Our first lower bound is established by choosing

P_{U | X}

and

P_{V | Y}

to be additive Gaussian channels, satisfying the mutual information constraints with equality.

Proposition 8.

A lower bound on

R (C_{u}, C_{v}, ρ)

is given by

R (C_{u}, C_{v}, ρ) \geq - \frac{1}{2} log (1 - ρ^{2} (1 - 2^{- 2 C_{u}}) (1 - 2^{- 2 C_{v}})) .

(25)

The proof of this bound is developed in Appendix H.

Although it was shown in [26] that choosing the test-channel to be Gaussian is optimal for the single-sided variant, it is not the case for its double-sided extension. We will show this by examining a specific set of values for the rate constraints,

(C_{u}, C_{v}) = (1, 1)

. Furthermore, we choose the test channels

P_{U | X}

and

P_{V | Y}

to be deterministic quantizers.

Proposition 9.

Let

(C_{u}, C_{v}) = (1, 1)

, then

R (1, 1, ρ) \geq 1 - h_{2} (\frac{arccos ρ}{π}) .

(26)

The proof of this bound is developed in Appendix I.

We compare the bounds from Propositions 8 and 9 with

(C_{u}, C_{v}) = (1, 1)

in Figure 16. The most unexpected observation here is that the deterministic quantizers lower bound outperform the Gaussian test-channels for high values of

ρ

. The crossing point of those bounds is given by

ρ_{cros} = \frac{e}{\sqrt{1 + e^{2}}} \to \sqrt{S N R_{cros}} = \frac{ρ_{cros}}{\sqrt{1 - ρ_{cros}^{2}}} = e .

(27)

Figure 16. Comparison of the lower bounds from Propositions 8 and 9.

We proceed to present our upper bound on

R (C_{u}, C_{v}, ρ)

. This bound is a combination of the cutset bound and the single-sided Gaussian IB.

Proposition 10.

An upper bound on (7) with Gaussian

(X, Y)

setting (22) is given by

R (C_{u}, C_{v}, ρ) \leq min \{- \frac{1}{2} log (1 - ρ^{2} (1 - 2^{- 2 C_{u}})), - \frac{1}{2} log (1 - ρ^{2} (1 - 2^{- 2 C_{v}}))\} .

(28)

We compare the best lower and upper bounds from Propositions 8–10 in Figure 17. We observe that the bounds become tighter as the constraint increases and in the low-SNR regime.

Figure 17. Capacity bounds for various values of p and

C = C_{u} = C_{v} = 1

.

4.1. Low-SNR Regime

For

ρ \to 0

, the exact asymptotic behavior of the Gaussian (Proposition 8) and deterministic (Proposition 9) test-channels, respectively, for

C_{u} = C_{v} = 1

is given by:

\begin{matrix} lim_{ρ \to 0} - \frac{1}{2} log (1 - ρ^{2} (1 - 2^{- 2 C_{u}}) (1 - 2^{- 2 C_{v}})) & = \frac{9 log e}{32} ρ^{2} + o (ρ^{2}), \\ lim_{ρ \to 0} 1 - h_{2} (\frac{arccos ρ}{π}) & = \frac{2 log e}{π^{2}} ρ^{2} + o (ρ^{2}) . \end{matrix}

Hence, the Gaussian choice outperforms the second lower bound for vanishing SNR. The following theorem establishes that Gaussian test-channels are optimal for low-SNR.

Theorem 2.

For small ρ, the GDSIB function is given by:

R (C_{u}, C_{v}, ρ) = \frac{ρ^{2} log e}{2} (1 - 2^{- 2 C_{u}}) (1 - 2^{- 2 C_{v}}) + o (ρ^{2}) .

(29)

The lower bound follows from Proposition 8. The upper bound is established by considering the kernel representation from Proposition 7 in the limit of vanishing

ρ

. The detailed proof is given in Appendix J.

4.2. Optimality of Symbol-by-Symbol Quantization When $X = Y$

Consider an extreme scenario for which

X = Y \sim N (0, 1)

. Taking the encoders

P_{U | X}

and

P_{V | X}

as a symbol-by-symbol deterministic quantizers satisfying:

H (U) = H (V) = min {C_{u}, C_{v}},

we achieve the optimum

I (U; V) = min {C_{u}, C_{v}} .

5. Alternating Maximization Algorithm

Consider the DSIB problem for DSBS with parameter p analyzed in Section 3. The respective optimization problem involves simultaneous search of the maximum over the sets

{P_{U | X}}

and

{P_{V | Y}}

. An alternating maximization, namely, fixing

P_{U | X}

, then finding the respective optimal

P_{V | Y}

and vice versa, is sub-optimal in general and may result in convergent to a saddle point. However, for the case

p = 0

with symmetric bottleneck constraints, Proposition 4 implies that such point exists only for the BSC and Z/S channels. This motivates us to believe that performing an alternating maximization procedure on (9) will not result in sub-optimal saddle point, but rather converge to the optimal solution also for the general discrete

(X, Y)

.

Thus, we propose an alternating maximization algorithm. The main idea is to fix

P_{V | Y}

and then compute

P_{U | X}^{*}

that attains the inner term in (9). Then, using

P_{U | X}^{*}

, we find the optimal

P_{V | Y}^{*}

that attains the inner term in (10). Then, we repeat the procedure in alternating manner until convergence.

Note that inner terms of (9) and (10) are just the standard IB problem defined in (6). For completeness, we state here the main result from [1] and adjust it for our problem. Consider the respective Lagrangian of (6) given by:

L (P_{U | X}, λ) = I (U; V) + λ (C - I (X; U)) .

(30)

Lemma 4

(Theorem 4 of [1]). The optimal test-channel that maximizes (30) satisfies the equation:

P_{U | X} (u | x) = \frac{P_{U} (u)}{Z (x, β)} e^{- β D (P_{V | X = x} ∥ P_{V | U = u})},

(31)

where

β ≜ 1 / λ

and

P_{V | U}

is given via Bayes’ rule, as follows:

P_{V | U} (v | u) = \frac{1}{P_{U} (u)} \sum_{x} P_{V | X} (v | x) P_{U | X} (u | x) P_{X} (x) .

(32)

In a very similar manner to the Blahut–Arimoto algorithm [18], the self-consistent equations can be adapted into converging, alternating iterations over the convex sets

{P_{U | X}} = Δ_{n}^{\otimes n}

,

{P_{U}} = Δ_{n}

, and

{P_{V | U}} = Δ_{n}^{\otimes n}

, as stated in the following lemma.

Lemma 5

(Theorem 5 of [1]). The self-consistent equations are satisfied simultaneously at the minima of the functional:

F (P_{U | X}, P_{U}, P_{Y | U}) = I (U; X) + β E_{} [D (P_{V | X} ∥ P_{V | U})],

(33)

where the minimization is performed independently over the convex sets of

{P_{U | X}} = Δ_{n}^{\otimes n}

,

{P_{U}} = Δ_{n}

, and

{P_{V | U}} = Δ_{n}^{\otimes n}

. The minimization is performed by the converging alternation iterations as described in Algorithm 1.

Algorithm 1: IB iterative algorithm IBAM(args)

Next, we propose a combined algorithm to solve the optimization problem from (7). The main idea is to fix one of the test-channels, i.e.,

P_{V | Y}

, and then find the corresponding optimal opposite test-channel, i.e.,

P_{U | X}

, using Algorithm 1. Then, we apply again Algorithm 1 by switching roles, i.e., fixing the opposite test-channel, i.e.,

P_{U | X}

, and then identifying the optimal

P_{V | Y}

. We repeat this procedure until convergence of the objective function

I (U; V)

. We summarize the proposed composite method in Algorithm 2.

Remark 8.

Note that every alternating step of the algorithm involves finding an optimal

(β^{*}, η^{*})

that corresponds to the respective problem constraints

(C_{u}, C_{v})

. We have chosen to implement this exploration step using a bisection-type method. This may result that the actual pair

(C_{u}, C_{v})

is ϵ-far away from the desired constraint.

Algorithm 2: DSIB iterative algorithm DSIBAM(args)

6. Numerical Results

In this section, we focus on the DSBS setting of Section 3. In the first part of this section, we will show using a brute-force method the existence of a sharp, phase-transition phenomena in the optimal test-channels

P_{U | X}

and

P_{V | Y}

vs. DSBS parameter p. In the second part of this section, we will evaluate the alternating maximization algorithm proposed in Section 5; then, we compare its performance to the brute-force method.

6.1. Exhaustive Search

In this set of simulations, we again fix the transition matrix from

Y

to

V

characterized by the parameters:

T = (\begin{matrix} a & b \\ 1 - a & 1 - b \end{matrix}),

(34)

chosen such that

I (Y; V) = C_{v}

. This choice defines a path

b = f (a)

in the

(a, b)

plain. Then, for every such T we optimize

I (U; V)

for different values of the DSBS parameter p. The results for a specific choice of

(C_{u}, C_{v}) = (0.4, 0.6)

vs. a for different values of p are plotted in Figure 18. Note that the region of a corresponds to the continuous conversion from a Z channel (

a = 0

) to a BSC (

a = a_{max}

). We observe here a very sharp transition from the optimality of Z-S channels to BSC channels configuration for a small change in p. This kind of behavior continues to hold with a different choice of

(C_{u} = 0.1, C_{v} = 0.9)

, as can be seen in Figure 19.

Figure 18. Maximal

I (U; V)

for fixed values

(C_{u}, C_{v}) = (0.4, 0.6)

and different values of p.

Figure 19. Maximal

I (U; V)

for fixed values

(C_{u}, C_{v}) = (0.1, 0.9)

and different values of p.

Next, we would like to emphasize this sharp phase transition phenomena by plotting the optimal a that achieves the maximal

I (U; V)

vs the DSBS parameter p. The results for various combinations of

C_{u}

and

C_{v}

are presented in Figure 20 and Figure 21. We observe that the curves are convex for

p \in [0, p_{t h})

and constant for

p > p_{t h}

with

a = a_{b s c}

. Furthermore, the derivative of

a (p)

for

p \to p_{t h}

tends to ∞.

Figure 20. Optimal value of a for various values of

C_{u}

and

C_{v}

.

Figure 21. Optimal value of a for various values of

C_{u}

and

C_{v} = 0.9

.

One may further claim that there is no sharp transition to the BSC test-channels

P_{U | X}

and

P_{V | Y}

as p grows away from zero, but rather only approaches BSC. To convince the reader that the optimal test channels are exactly BSC, we performed an alternating maximization experiment. We fixed

p > 0

,

C_{u}

and

C_{v}

. Then we have chosen

P_{V | Y}

as an almost BSC channel satisfying

I (Y; V) \leq C_{v}

and found the channel

P_{X | U}

that maximizes

I (U; V)

subject to

I (X; U) \leq C_{u}

. Then, we fixed the channel

P_{X | U}

and found the

P_{Y | V}

that maximizes

I (U; V)

subject to

I (Y; V) \leq C_{v}

. We have repeated this alternating maximization procedure until it converges. The transition matrices were parameterized as follows:

T_{Y | V} = (\begin{matrix} q_{0} & q_{1} \\ 1 - q_{0} & 1 - q_{1} \end{matrix}), T_{X | U} = (\begin{matrix} p_{0} & p_{1} \\ 1 - p_{0} & 1 - p_{1} \end{matrix}) .

(35)

The results for different values of p,

C_{u}

, and

C_{v}

are shown in Figure 22, Figure 23 and Figure 24. We observe that

p_{0}

and

q_{0}

rapidly converge to their respective BSC values satisfying the mutual information constraints. Note that the last procedure is still an exhaustive search, but it is performed in alternating fashion between the sets

{P_{U | X}}

and

{P_{V | Y}}

.

Figure 22. Alternating maximization with exhaustive search for various p,

C_{u}

,

C_{v}

.

Figure 23. Alternating maximization with exhaustive search for various p,

C_{u}

,

C_{v}

.

Figure 24. Alternating maximization with exhaustive search for various p,

C_{u}

,

C_{v}

.

6.2. Alternating Maximization

In this section, we will evaluate the algorithm proposed in Section 5. We focus on the DSBS setting of Section 3 with various values of problem parameters.

First, we explore the convergence behavior of the proposed algorithm. Figure 25 shows the objective function

I (U; V)

on every iteration step for representative fixed-channel transition parameters p and the constraints

C_{u}

and

C_{v}

. We observe a slow convergence to a final value for

p = 0

and

C_{u} = C_{v} = 0.2

, but once the constraints and the transition probability are increased, the algorithm converges much more rapidly. The non-monotonic behavior in some regimes is justified with the help of Remark 8. In Figure 26, we see the respective test-channel probabilities

α_{0} + α_{1}

,

1 - α_{0}

,

β_{0} + β_{1}

, and

1 - β_{1}

. First, note that if

α_{0} + α_{1} = 1

, then

P_{X | U}

is a BSC. Similarly, if

β_{0} + β_{1} = 1

, then

P_{Y | V}

is a BSC. Second, if

1 - α_{0} = 1

, then

P_{X | U}

is a Z-channel. Similarly, if

1 - β_{1} = 1

, then

P_{Y | V}

is an S-channel. We observe that for

p = 0

, the test-channels

P_{X | U}

and

P_{Y | V}

converge to Z- and S-channels, respectively. As for all other settings, the test-channels converge to BSC channels.

Figure 25. Convergence of

I (U; V)

for various values of p,

C_{u}

and

C_{v}

.

Figure 26. Convergence of

I (U; V)

p with: (a)

C_{u} = C_{v} = 0.2

,

p = 0

; (b)

C_{u} = C_{v} = 0.7

,

p = 0

; (c)

C_{u} = C_{v} = 0.5

,

p = 0.1

; (d)

C_{u} = 0.65, C_{v} = 0.4

,

p = 0.1

.

Finally, we compare the outcome of Algorithm 2 to the optimal solution achieved by the brute-force method, namely, evaluating (12) for every

P_{U | X}

and

P_{V | Y}

that satisfy the problem constraints. The results for various values of channel parameters are shown in Figure 27. We observe that the proposed algorithm achieves the optimum for any DSBS parameter p and some representative constraints

C_{u}

and

C_{v}

.

Figure 27. Comparison of the proposed alternating maximization algorithm and the brute-force search method for various problem parameters.

7. Concluding Remarks

In this paper, we have considered the Double-Sided Information Bottleneck problem. Cardinality bounds on the representation’s alphabets were obtained for an arbitrary discrete bivariate source. When

X

and

Y

are binary, we have shown that taking binary auxiliary random variables is optimal. For DSBS, we have shown that BSC test-channels are optimal when

p \to 0.5

. Furthermore, numerical simulations for arbitrary p indicate that Z -and S-channels are optimal for

p = 0

. As for the Gaussian bivariate source, representation of

I (U; V)

utilizing Hermite polynomials was given. In addition, the optimality of the Gaussian test-channels was demonstrated for vanishing SNR. Moreover, we have constructed a lower bound attained by deterministic quantizers that outperforms the jointly Gaussian choice at high SNR. Note that the solution for the n-letter problem

max \frac{1}{n} I (U; V)

for

U \to X^{n} \to Y^{n} \to V

under constraints

I (U; X^{n}) \leq n C_{u}

and

I (V; Y^{n}) \leq n C_{v}

does not tensorize in general. For

X^{n} = Y^{n} \sim {Ber}^{\otimes n} (0.5)

, we can easily achieve the cut-set bound

I (U; V) / n = min {C_{u}, C_{v}}

. In addition, if time-sharing is allowed, the results change drastically.

Finally, we have proposed an alternating maximization algorithm based on the standard IB [1]. For the DSBS, it was shown that the algorithm converges to the global optimal solution.

Author Contributions

Conceptualization, O.O. and S.S.; methodology, M.D., O.O., and S.S.; software, M.D.; formal analysis, M.D.; writing—original draft preparation, M.D.; writing—review and editing, M.D.; supervision, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work has been supported by the European Union’s Horizon 2020 Research And Innovation Programme, grant agreement no. 694630, by the ISF under Grant 1641/21, and by the WIN consortium via the Israel minister of economy and science.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Proposition 2

Before proceeding to proof Proposition 2, we need the following auxiliary results.

Lemma A1.

Let

P_{Y | X}

be an arbitrary binary-input, ternary-output channel, parameterized using the following transition matrix:

T ≜ (\begin{matrix} a & b \\ c & d \\ 1 - a - c & 1 - b - d \end{matrix}) .

(A1)

Consider the function

p \mapsto ϕ (p, λ) = h (T p) - λ h_{b} (p)

defined on

[0, 1]

. This function has the following properties:

1.: If $ϕ (p, λ)$ is linear on a sub-interval of $[0, 1]$ , then it is linear for every $p \in [0, 1]$ .
2.: Otherwise, it is strictly convex over $[0, 1]$ or there are points $p_{l}$ and $p_{u}$ such that $0 < p_{l} < p_{u} < 1$ where

$ϕ (p, λ) = \{\begin{matrix} strictly convex & 0 < p < p_{l} = I_{1}, \\ strictly concave & p_{l} < p < p_{u} = I_{2}, \\ strictly convex & p_{u} < p < 1 = I_{3} . \end{matrix}$

(A2)

We postpone the proof of this lemma to Appendix K.

Lemma A2.

The convex envelope of

ϕ (\cdot)

at any point

q \in [0, 1]

can be obtained as a convex combination of only points in

I_{1}

and

I_{3}

.

We postpone the proof of this lemma to Appendix L and proceed to proof Proposition 2. Note that if

F_{T} (x)

is strictly convex in

[0, h_{b} (q)]

, then by the paragraph following (Theorem 2.3 of [17])

| U | = 2

, we are done.

From now on, we consider the case where

F_{T} (x)

is not strictly convex. Then, there is an interval

L \subset [0, h (q)]

and

a \in R_{+}

such that

F_{T} (x) = a + λ_{L} \cdot x \forall x \in L .

(A3)

Let

t_{0}

and

t_{1}

represent the columns of T corresponding to

X = 0

and

X = 1

, respectively, Moreover let

q ≜ P (X = 0)

and

p ≜ {(p, \bar{p})}^{T}

be the probability vector of an arbitrary binary random variable, where

\bar{p} ≜ 1 - p

.

Assume

x_{1}, x_{2} \in L

and

x_{1} \neq x_{2}

. Then, there must be

{α_{1 i}, p_{1 i}}_{i = 1, 2, 3}

and

{α_{2 i}, p_{2 i}}_{i = 1, 2, 3}

such that

\begin{matrix} \sum_{i = 1}^{3} α_{1 i} p_{1 i} & = q, \sum_{i = 1}^{3} α_{1 i} h_{b} (p_{1 i}) = x_{1}, \sum_{i = 1}^{3} α_{1 i} h (T p_{1 i}) = a + λ_{L} x_{1}, \end{matrix}

(A4)

\begin{matrix} \sum_{i = 1}^{3} α_{2 i} p_{2 i} & = q, \sum_{i = 1}^{3} α_{2 i} h_{b} (p_{2 i}) = x_{2}, \sum_{i = 1}^{3} α_{2 i} h (T p_{2 i}) = a + λ_{L} x_{2} . \end{matrix}

(A5)

Lemma A3.

The set

{p_{11}, p_{12}, p_{13}, p_{21}, p_{22}, p_{23}}

must contain at least three distinct points.

We postpone the proof of this lemma to Appendix M.

Consider the function

p \mapsto ϕ (p) = ϕ (p, λ_{L}) = h (T p) - λ_{L} h_{b} (p)

defined on

[0, 1]

. We have that

\sum_{i = 1}^{3} α_{1 i} ϕ (p_{1 i}) = \sum_{i = 1}^{3} α_{2 i} ϕ (p_{2 i}) = a .

(A6)

In addition, if we define

ψ (\cdot)

to be the lower convex envelope of

ϕ (\cdot)

, then

ψ (q) = a

. Thus, the lower convex envelope of

ϕ (\cdot)

at q is attained by two linear combinations.

By Lemma Lemma A3, the set

{p_{11}, p_{12}, p_{13}, p_{21}, p_{22}, p_{23}}

must contain at least three distinct points, say

{p_{11}, p_{21}, p_{22}}

. Due to Lemma A2, they are all in

I_{1} \cup I_{3}

. Furthermore, by the pigeonhole principle, we must have that one of the intervals contains at least two points. Assume WLOG that

{p_{11}, p_{21}} \in I_{1}

. For any

γ \in [0, 1]

, let

S = \bar{γ} α_{11} + γ α_{21}

and consider the following set of weights/probabilities:

\{(S, \frac{\bar{γ} α_{11}}{S} \cdot p_{11} + \frac{γ α_{21}}{S} \cdot p_{21}), (\bar{γ} α_{12}, p_{12}), (\bar{γ} α_{13}, p_{13}), (γ α_{22}, p_{22}), (γ α_{23}, p_{23})\} .

(A7)

Note that

S + \bar{γ} α_{12} + \bar{γ} α_{13} + γ α_{22} + γ α_{23} = 1,

(A8)

and

\bar{γ} α_{11} \cdot p_{11} + γ α_{21} \cdot p_{21} \bar{γ} α_{12} \cdot p_{12} + \bar{γ} α_{13}, p_{13} + γ α_{22} \cdot p_{22} + γ α_{23} \cdot p_{23} = q,

(A9)

but since

{p_{11}, p_{21}} \in I_{1}

\begin{matrix} S \cdot ϕ (\frac{\bar{γ} α_{11}}{S} \cdot p_{11} + \frac{γ α_{21}}{S} \cdot p_{21}) + \bar{γ} α_{12} ϕ (p_{12}) + \bar{γ} α_{13} ϕ (p_{13}) + γ α_{22} ϕ (p_{22}) + γ α_{23} ϕ (p_{23}) \end{matrix}

(A10)

\begin{matrix} < S \cdot (\frac{\bar{γ} α_{11}}{S} \cdot ϕ (p_{11}) + \frac{γ α_{21}}{S} \cdot ϕ (p_{21})) + \bar{γ} α_{12} ϕ (p_{12}) + \bar{γ} α_{13} ϕ (p_{13}) + γ α_{22} ϕ (p_{22}) + γ α_{23} ϕ (p_{23}) = a, \end{matrix}

(A11)

thus, it attains a smaller value than a, provided that

ϕ

is strictly convex on

I_{1}

. This contradicts our assumption that the convex envelope at q equals a, and thus

ϕ (\cdot)

must contain a linear segment in

I_{1}

.

By Lemma A1, this can happen only if p is linear for every

p \in [0, 1]

. In particular:

h (T p) - λ_{L} h_{b} (p) = ϕ (p) = (1 - p) ϕ (0) + p ϕ (1) = (1 - p) h (t_{0}) + p h (t_{1}) .

(A12)

Note that for any choice of

P_{X | U = u}

\begin{matrix} H (Y | U = u) & = h (T p_{u}) \end{matrix}

(A13)

\begin{matrix} = ϕ (p_{u}) + λ_{L} h_{b} (p_{u}) \end{matrix}

(A14)

\begin{matrix} = (1 - p_{u}) h (t_{0}) + p_{u} h (t_{1}) + λ_{L} h_{b} (p_{u}) . \end{matrix}

(A15)

Taking the expectation we obtain:

H (Y | U) = (1 - q) h (t_{0}) + q h (t_{1}) + λ_{L} x .

(A16)

This implies that

F_{T} (x) = (1 - q) h (t_{0}) + q h (t_{1}) + λ_{L} x,

(A17)

and this is attained by any choice of

P_{U | X}

satisfying

H (X | U) = x

. In particular the choice

U = X \oplus Z

, where

Z \sim Ber (δ)

is statistically independent of

X

and is chosen such that

H (X | U) = x

, attains

F_{T} (x)

. Thus,

| U | = 2

suffices even if

F_{T} (x)

is not strictly convex.

Appendix B. Proof of Lemma 3

Let

P_{U | X}

and

P_{V | Y}

be the test-channels from

X

to

U

and from

Y

to

V

, respectively. The joint probability function of

U

and

V

can be expressed via Bayes’ rule and the Markov chain condition

U \to X \to Y \to V

as:

P_{U V} (u, v) = 4 \cdot P_{U} (u) P_{V} (v) \sum_{x, y} P_{X | U} (x | u) P_{X Y} (x, y) P_{Y | V} (y | v) .

(A18)

Since

I (U; V) = E_{} [log (P_{U V} / P_{U} \times P_{V})]

, we define

K (u, v, p)

as the ratio between the joint distribution of

U

and

V

relative to the respective product measure. Note that:

\begin{matrix} K (u, v, p) & ≜ \frac{P_{U V} (u, v)}{P_{U} (u) P_{V} (v)} \end{matrix}

(A19)

\begin{matrix} = 4 \sum_{x, y} P_{X | U} (x | u) P_{X Y} (x, y) P_{Y | V} (y | v) \end{matrix}

(A20)

\begin{matrix} = 2 (P_{X | U} (0 | u) \cdot \bar{p} \cdot P_{Y | V} (0 | u) + P_{X | U} (1 | u) \cdot p \cdot P_{Y | V} (0 | u)) \end{matrix}

(A21)

\begin{matrix} + 2 (P_{X | U} (0 | u) \cdot p \cdot P_{Y | V} (1 | u) + P_{X | U} (1 | u) \cdot \bar{p} \cdot P_{Y | V} (1 | u)) . \end{matrix}

(A22)

Denoting

α_{u} ≜ P_{X | U} (1 | u)

and

β_{v} ≜ P_{Y | V} (0 | v)

, we obtain:

K (u, v, p) = 2 ({\bar{α}}_{u} \bar{p} β_{v} + α_{u} p β_{v} + {\bar{α}}_{u} p {\bar{β}}_{v} + α_{u} \bar{p} {\bar{β}}_{v}) = 2 α_{u} * β_{v} * p .

(A23)

The last expression can also be represented as follows:

\begin{matrix} 2 α_{u} * β_{v} * p & = 2 (1 - p) (α_{u} + β_{v} - 2 α_{u} β_{v}) + 2 p (1 - α_{u} - β_{v} + 2 α_{u} β_{v}) \end{matrix}

(A24)

\begin{matrix} = 2 α_{u} + 2 β_{v} - 4 α_{u} β_{v} + 2 p (1 - 2 α_{u} - 2 β_{v} + 4 α_{u} β_{v}) \end{matrix}

(A25)

\begin{matrix} = 1 - (1 - 2 p) (1 - 2 α_{u} - 2 β_{v} + 4 α_{u} β_{v}) \end{matrix}

(A26)

\begin{matrix} = 1 - (1 - 2 p) (1 - 2 α_{u}) (1 - 2 β_{v}) . \end{matrix}

(A27)

Thus,

\begin{matrix} I (U; V) & = \sum_{u, v} P_{U V} (u, v) log \frac{P_{U V} (u, v)}{P_{U} (u) P_{V} (v)} \end{matrix}

(A28)

\begin{matrix} = \sum_{u, v} P_{U} (u) P_{V} (v) K (u, v, p) log K (u, v, p) . \end{matrix}

(A29)

Furthermore, note that since

| (1 - 2 p) (1 - 2 α_{u}) (1 - 2 β_{v}) | < 1

, we can utilize Taylor’s expansion of

log (1 - x)

to obtain:

log K (u, v, p) = - \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n} {(1 - 2 α_{u})}^{n} {(1 - 2 β_{v})}^{n}}{n},

(A30)

and

\begin{matrix} K (u, v, p) log K (u, v, p) = - \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n} {(1 - 2 α_{u})}^{n} {(1 - 2 β_{v})}^{n}}{n} \\ + \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n + 1} {(1 - 2 α_{u})}^{n + 1} {(1 - 2 β_{v})}^{n + 1}}{n} . \end{matrix}

(A31)

Therefore:

\begin{matrix} I (U; V) & = - \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n} E_{} [{(1 - 2 α_{U})}^{n}] E_{} [{(1 - 2 β_{V})}^{n}]}{n} \end{matrix}

(A32)

\begin{matrix} + \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n + 1} E_{} [{(1 - 2 α_{V})}^{n + 1}] E_{} [{(1 - 2 β_{V})}^{n + 1}]}{n} \end{matrix}

(A33)

\begin{matrix} \overset{(a)}{=} - \sum_{n = 2}^{\infty} \frac{{(1 - 2 p)}^{n} E_{} [{(1 - 2 α_{U})}^{n}] E_{} [{(1 - 2 β_{V})}^{n}]}{n} \end{matrix}

(A34)

\begin{matrix} + \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n + 1} E_{} [{(1 - 2 α_{V})}^{n + 1}] E_{} [{(1 - 2 β_{V})}^{n + 1}]}{n} \end{matrix}

(A35)

\begin{matrix} = - \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n + 1} E_{} [{(1 - 2 α_{U})}^{n + 1}] E_{} [{(1 - 2 β_{V})}^{n + 1}]}{n + 1} \end{matrix}

(A36)

\begin{matrix} + \sum_{n = 1}^{\infty} \frac{{(1 - 2 p)}^{n + 1} E_{} [{(1 - 2 α_{V})}^{n + 1}] E_{} [{(1 - 2 β_{V})}^{n + 1}]}{n} \end{matrix}

(A37)

\begin{matrix} = \sum_{n = 1}^{\infty} {(1 - 2 p)}^{n + 1} E_{} [{(1 - 2 α_{U})}^{n + 1}] E_{} [{(1 - 2 β_{V})}^{n + 1}] \cdot \frac{1}{n (n + 1)}, \end{matrix}

(A38)

where

(a)

follows since

E_{} [α_{U}] = E_{} [β_{V}] = \frac{1}{2}

. This completes the proof.

Appendix C. Auxiliary Concavity Lemma

As a preliminary step to proving Theorem 1, we will need the following auxiliary lemma.

Lemma A4.

The function

f (x) = {(1 - 2 h_{b}^{- 1} (x))}^{2}

is concave.

Proof.

Denoting

g (x) ≜ h_{b}^{- 1} (x)

, we have

f (x) = {(1 - 2 g (x))}^{2}

. Since

f (x)

is twice differentiable, it is sufficient to show that

f^{'} (x)

is decreasing. The first derivative is given by:

f^{'} (x) = - 4 (1 - 2 g (x)) g^{'} (x) .

(A39)

Since

\begin{matrix} h_{b} (x) & = - x log x - (1 - x) log (1 - x), \end{matrix}

(A40)

\begin{matrix} h_{b}^{'} (x) & = log \frac{1 - x}{x}, \end{matrix}

(A41)

\begin{matrix} h_{b}^{''} (x) & = - \frac{1}{(1 - x) ln 2} - \frac{1}{x ln 2} = - \frac{1}{x (1 - x) ln 2}, \end{matrix}

(A42)

utilizing the inverse function derivative property, we obtain:

\begin{matrix} g (x) & = h_{b}^{- 1} (x), \end{matrix}

(A43)

\begin{matrix} g^{'} (x) & = \frac{1}{h^{'} (g (x))} = \frac{1}{log \frac{1 - g (x)}{g (x)}} . \end{matrix}

(A44)

In addition, the second order derivative is given by:

\begin{matrix} g^{''} (x) & = \frac{log e}{{log}^{3} \frac{1 - g (x)}{g (x)}} \frac{g (x)}{1 - g (x)} \frac{1}{{(g (x))}^{2}} \end{matrix}

(A45)

\begin{matrix} = \frac{log e}{{log}^{3} \frac{1 - g (x)}{g (x)}} \frac{1}{(1 - g (x)) g (x)} \end{matrix}

(A46)

\begin{matrix} = \frac{{(g^{'} (x))}^{3}}{ln 2 g (x) (1 - g (x))} . \end{matrix}

(A47)

Define

r (t) ≜ \frac{- 4 (1 - 2 t)}{log \frac{1 - t}{t}} .

(A48)

Note that

f^{'} (x) = r (g (x))

. Since

g (x)

is increasing, in order to show that

f^{'} (x)

decreasing, it suffices to show that

r (t)

decreasing. The first order derivative of

r (t)

is given by:

r^{'} (t) = \frac{8}{{log}^{2} \frac{1 - t}{t}} \frac{1}{t (1 - t)} (t (1 - t) log \frac{1 - t}{t} - \frac{1 - 2 t}{ln 4}) .

Define

α ≜ 1 - 2 t

such that

t = \frac{1}{2} (1 - α)

. Note that

α \in [0, 1]

. We obtain:

r^{'} (t) = \frac{32}{{log}^{2} \frac{1 + α}{1 - α}} \frac{1}{1 - α^{2}} (\frac{1}{4} (1 - α^{2}) log \frac{1 + α}{1 - α} - \frac{α}{ln 4}) .

Now, making use of the expansion

log (1 + x) = \sum_{k = 1}^{\infty} {(- 1)}^{k + 1} \frac{x^{k}}{k}

, we have:

log \frac{1 + α}{1 - α} = \sum_{k = 1}^{\infty} {(- 1)}^{k + 1} \frac{α^{k}}{k} - \sum_{k = 1}^{\infty} {(- 1)}^{k + 1} \frac{{(- α)}^{k}}{k} = 2 \sum_{k odd} \frac{α^{k}}{k} .

Thus,

\begin{matrix} \frac{1}{4} (1 - α^{2}) log \frac{1 + α}{1 - α} - \frac{α}{ln 4} \\ = \frac{1}{2} \sum_{k odd} \frac{α^{k}}{k} - \frac{1}{2} \sum_{k odd} \frac{α^{k + 2}}{k} - \frac{α}{ln 4} \\ = α (\frac{1}{2} - \frac{1}{ln 4}) + \frac{1}{2} \sum_{\begin{matrix} k odd \\ k \geq 3 \end{matrix}} α^{k} (\frac{1}{k} - \frac{1}{k - 2}) \\ \overset{(a)}{<} α (\frac{1}{2} - \frac{1}{ln e^{2}}) - \sum_{\begin{matrix} k odd \\ k \geq 3 \end{matrix}} \frac{α^{k}}{k (k - 2)} < 0, \end{matrix}

where

(a)

follows since

α > 0

. Thus,

r^{'} (t) < 0

and

f (x)

is concave. □

Appendix D. Proof of Theorem 1

Plugging

p \leftarrow \frac{1}{2} - ϵ

(

ϵ ≜ \frac{1}{2} - p

) in (14), we obtain:

K (u, v, ϵ) = 1 + 2 ϵ (1 - 2 α_{u}) (1 - 2 β_{v}) .

(A49)

Now, we rewrite

I (U; V)

with explicit dependency on

ϵ

as:

I (ϵ) = \sum_{u, v} P_{U} (u) P_{V} (v) K (u, v, ϵ) log K (u, v, ϵ) .

(A50)

We would like to expand

I (ϵ)

with Taylor series around

ϵ = 0

. Note that

I (0) = 0 = I^{'} {(ϵ) |}_{ϵ = 0}

. Furthermore, the second derivative is given by:

I^{''} {(ϵ) |}_{ϵ = 0} = 4 log e \cdot (\sum_{u} P_{U} (u) {(1 - 2 α_{u})}^{2}) (\sum_{v} P_{V} (v) {(1 - 2 β_{v})}^{2}) .

Hence,

I (ϵ) = 2 ϵ^{2} log e \cdot (\sum_{u} P_{U} (u) {(1 - 2 α_{u})}^{2}) (\sum_{v} P_{V} (v) {(1 - 2 β_{v})}^{2}) + o (ϵ^{2}) .

Now, note that

α_{u} = \{\begin{matrix} h_{2}^{- 1} (H (X | U = u)), & α_{u} \leq \frac{1}{2} \\ 1 - h_{2}^{- 1} (H (X | U = u)), & α_{u} > \frac{1}{2} \end{matrix}

(A51)

with similar relation for

β_{v}

. Therefore,

\begin{matrix} I (ϵ) & = \frac{2 ϵ^{2}}{ln 2} \cdot E_{u} [{(1 - 2 h_{2}^{. 2 e x - 1} (H (X | U = u)))}^{. 2 e x 2}] \cdot E_{v} [{(1 - 2 h_{2}^{. 2 e x - 1} (H (Y | V = v)))}^{. 2 e x 2}] + o (ϵ^{. 2 e x 2}) \\ \leq 2 ϵ^{2} log e \cdot {(1 - 2 h_{2}^{- 1} (H (X | U)))}^{2} {(1 - 2 h_{2}^{- 1} (H (Y | V)))}^{2} + o (ϵ^{2}) \\ \leq 2 ϵ^{2} log e \cdot {(1 - 2 h_{2}^{- 1} (1 - C_{x}))}^{2} {(1 - 2 h_{2}^{- 1} (1 - C_{y}))}^{2} + o (ϵ^{2}), \end{matrix}

where the first inequality follows since the function

f : x \mapsto {(1 - 2 h_{2}^{- 1} (x))}^{2}

is concave by Lemma A4 and applying Jensen’s inequality, and the second inequality follows from rate constraints.

Appendix E. Proof of Proposition 4

Suppose that the optimal test-channel

P_{V | X}

is given by the following transition matrix:

T_{V | X} = (\begin{matrix} a & b \\ 1 - a & 1 - b \end{matrix}) .

(A52)

Assume in contradiction that the opposite optimal test-channel

P_{U | X}

is symmetric to

P_{V | X}

and is given by:

T_{U | X} = (\begin{matrix} 1 - b & 1 - a \\ b & a \end{matrix}) .

(A53)

Applying Bayes’ rule on (A53), we obtain:

T_{X | U} = (\begin{matrix} 1 - α_{0} & 1 - α_{1} \\ α_{0} & α_{1} \end{matrix}) = (\begin{matrix} \frac{\bar{b}}{\bar{a} + \bar{b}} & \frac{b}{a + b} \\ \frac{\bar{a}}{\bar{a} + \bar{b}} & \frac{a}{a + b} \end{matrix}) .

(A54)

It was shown in (Section IV.D of [17]) that for fixed

P_{V | X}

given by (A52), the optimal

P_{X | U}

must satisfy the following equation:

\begin{matrix} (a - b) (h_{b} (α_{1}) - h_{b} (α_{0})) (h_{b}^{'} ({\hat{α}}_{0}) - h_{b}^{'} ({\hat{α}}_{1})) + (h_{b}^{'} (α_{1}) - h_{b}^{'} (α_{0})) (h_{b} ({\hat{α}}_{1}) - h_{b} ({\hat{α}}_{0})) \\ + (a - b) (α_{1} - α_{0}) (h_{b}^{'} (α_{0}) h_{b}^{'} ({\hat{α}}_{1}) - h_{b}^{'} (α_{1}) h_{b}^{'} ({\hat{α}}_{0})) = 0, \end{matrix}

(A55)

where

{\hat{α}}_{0} ≜ a α_{0} + b {\bar{α}}_{0}

and

{\hat{α}}_{1} ≜ a α_{1} + b {\bar{α}}_{1}

. Plugging

α_{0}

and

α_{1}

from (A54) in (A55) results in a contradiction, thus establishing the proof of Proposition 4.

Appendix F. Proof of Proposition 6

By Lemma 3, the objective function of (7) for a DSBS setting, denoted here by

I (p)

, is given by:

I (p) = E_{P_{U} \times P_{V}} [K (U, V, p) log K (U, V, p)],

(A56)

where

K (u, v, p)

can be expressed as:

K (u, v, p) = 1 + (1 - 2 p) (1 - 2 α_{u} * β_{v}) = 1 + (1 - 2 p) (1 - 2 α_{u}) (1 - 2 β_{v}) .

(A57)

Since

log (1 + x) \leq x

, we have the following upper bound on

I (p)

:

\begin{matrix} I (p) & = \sum_{u, v} P_{U} (u) P_{V} (v) K (u, v, p) log K (u, v, p) \end{matrix}

(A58)

\begin{matrix} \leq \sum_{u, v} P_{U} (u) P_{V} (v) (1 + (1 - 2 p) (1 - 2 α_{u}) (1 - 2 β_{v})) (1 - 2 p) (1 - 2 α_{u}) (1 - 2 β_{v}) \end{matrix}

(A59)

\begin{matrix} = (1 - 2 p) (1 - 2 \sum_{u} P_{U} (u) α_{u}) (1 - 2 \sum_{v} P_{V} (v) β_{v}) \end{matrix}

(A60)

\begin{matrix} + {(1 - 2 p)}^{2} \sum_{u} P_{U} (u) {(1 - 2 α_{u})}^{2} \sum_{v} P_{V} (v) {(1 - 2 β_{v})}^{2} \end{matrix}

(A61)

\begin{matrix} = (1 - 2 p) (1 - 2 \sum_{u} P_{U} (u) P (X = 1 | U = u)) (1 - 2 \sum_{v} P_{V} (v) P (Y = 1 | V = v)) \\ + {(1 - 2 p)}^{2} \sum_{u} P_{U} (u) {(1 - 2 P (X = 1 | U = u))}^{2} \sum_{v} P_{V} (v) {(1 - 2 P (Y = 1 | V = v))}^{2} \end{matrix}

(A62)

\begin{matrix} = (1 - 2 p) (1 - 2 P (X = 1)) (1 - 2 P (Y = 1)) \\ + {(1 - 2 p)}^{2} \sum_{u} P_{U} (u) {(1 - 2 h_{2}^{- 1} (H (X | U = u)))}^{2} \sum_{v} P_{V} (v) {(1 - 2 h_{2}^{- 1} (H (Y = | V = v)))}^{2} \end{matrix}

(A63)

\begin{matrix} \overset{(a)}{\leq} {(1 - 2 p)}^{2} {(1 - 2 h_{2}^{- 1} (H (X | U)))}^{2} {(1 - 2 h_{2}^{- 1} (H (Y = | V)))}^{2} \end{matrix}

(A64)

\begin{matrix} \overset{(b)}{\leq} {(1 - 2 p)}^{2} (1 - 2 h_{2}^{- 1} {(1 - C_{x})}^{2} (1 - 2 h_{2}^{- 1} {(1 - C_{y})}^{2}, \end{matrix}

(A65)

where the inequality in

(a)

follows from Lemma A4 and inequality in

(b)

follows from the problem constraints.

Appendix G. Proof of Proposition 7

We assume

U

and

V

are continuous RVs. The proof for the discrete case is identical. The joint density

f_{U V} (u, v)

can be expressed with explicit dependency on

ρ

as follows:

f (u, v; ρ) ≜ f_{U} (u) f_{V} (v) \int \int_{R^{2}} f_{X | U} (x | u) M (x, y; ρ) f_{Y | V} (y | v) d x d y,

where

M (x, y; ρ) = \sum_{n = 0}^{\infty} \frac{ρ^{n}}{n!} H_{n} (x) H_{n} (y)

[66]. Similarly,

I (U; V)

can also be written with explicit dependency on

ρ

I (ρ) ≜ I_{ρ} (U; V) = \int \int f (u, v; ρ) log \frac{f (u, v; ρ)}{f_{U} (u) f_{V} (v)} d u d v .

Appendix H. Proof of Proposition 8

Let

(U, X, Y, V)

be jointly Gaussian Random variables, such that

X = σ_{U X} U + \sqrt{1 - σ_{U X}^{2}} Z_{u}, Y = σ_{Y V} V + \sqrt{1 - σ_{Y V}^{2}} Z_{v},

where

Z_{u} \sim N (0, 1)

,

Z_{v} \sim N (0, 1)

,

Z_{u} ⊥ U

,

Z_{v} ⊥ V

. Due to Proposition 7, the mutual information for jointly Gaussian

(U, X, Y, V)

is given by

\begin{matrix} I (U; V) & = E_{U V} [log (\sum_{n = 0}^{\infty} \frac{ρ^{n}}{n!} E_{} [H_{n} (X) | U] E_{} [H_{n} (Y) | V])] \\ \overset{(a)}{=} E_{U V} [log (\sum_{n = 0}^{\infty} \frac{{(ρ σ_{U X} σ_{Y V})}^{n}}{n!} H_{n} (U) H_{n} (V))] \\ \overset{(b)}{=} E_{U V} [log (\frac{1}{\sqrt{1 - ρ^{2} σ_{U X}^{2} σ_{Y V}^{2}}} exp (\frac{2 ρ σ_{U X} σ_{Y V} U V - ρ^{2} σ_{U X}^{2} σ_{Y V}^{2} (U^{2} + V^{2})}{2 (1 - ρ^{2} σ_{U X}^{2} σ_{Y V}^{2})}))] \\ = - \frac{1}{2} log (1 - ρ^{2} σ_{U X}^{2} σ_{Y V}^{2}) + \frac{ρ σ_{U X} σ_{Y V}}{1 - ρ^{2} σ_{U X}^{2} σ_{Y V}^{2}} E_{} [U V] - \frac{ρ^{2} σ_{U X}^{2} σ_{Y V}^{2}}{2 (1 - ρ^{2} σ_{U X}^{2} σ_{Y V}^{2})} (E_{} [U^{2}] + E_{} [V^{2}]) \\ = - \frac{1}{2} log (1 - ρ^{2} σ_{U X}^{2} σ_{Y V}^{2}), \end{matrix}

where

(a)

and

(b)

follow from the properties of Mehler Kernel [66].

By the Mutual Information constraints we have:

σ_{U X}^{2} = 1 - e^{- 2 C_{u}} σ_{Y V}^{2} = 1 - e^{- 2 C_{v}} .

(A66)

Hence,

I (U; V) = - \frac{1}{2} log (1 - ρ^{2} (1 - e^{- 2 C_{u}}) (1 - e^{- 2 C_{v}})) .

(A67)

Appendix I. Proof of Proposition 9

We choose

U

and

V

to be deterministic functions of

X

and

Y

, respectively, i.e.,

U = sign (X)

and

V = sign (Y)

. In such case, the rate constraints are met with equality, namely,

I (U; X) = 1 = I (Y; V)

. We proceed to evaluate the achievable rate:

\begin{matrix} I (U; V) & = 1 - P (U = 0) h_{2} (P (V = 1 | U = 0)) - P (U = 1) h_{2} (P (V = 0 | U = 1)) \\ \overset{(a)}{=} 1 - h_{2} (P (U \neq V)), \end{matrix}

where equality in

(a)

follows since

P (V = 1 | U = 0) = P (V = 0 | U = 1)

by symmetry. We therefore obtain the following formula for the “error probability”:

P (V \neq U) = 1 - P (X < 0, Y < 0) - P (X > 0, Y > 0) \overset{(a)}{=} 1 - 2 P (X < 0, Y < 0),

where

(a)

also follows from symmetry. Utilizing Sheppard’s Formula (Chapter 5, p.107 of [68]), we have

1 - 2 P (X < 0, Y < 0) = \frac{arccos ρ}{π} .

This completes the proof of the proposition.

Appendix J. Proof of Theorem 2

We would like to approximate

I (ρ)

in the limit

ρ \to 0

using a Taylor series up to a second order in

ρ

. As a first step, we evaluate the first two derivatives of

f (u, v; ρ)

at

ρ = 0

. Note that

M (x, y; 0) = 1

and

\frac{d M}{d ρ} |_{ρ = 0} = x y, \frac{d^{2} M}{d ρ^{2}} |_{ρ = 0} = (x^{2} - 1) (y^{2} - 1) .

(A68)

Thus,

f (u, v; 0) = f_{U} (u) f_{V} (v)

,

\begin{matrix} \frac{d f}{d ρ} |_{ρ = 0} & = f_{U} (u) f_{V} (v) E_{} [X | U = u] E_{} [Y | V = v], \end{matrix}

and

\begin{matrix} \frac{d^{2} f}{d ρ^{2}} |_{ρ = 0} & = f_{U} (u) f_{V} (v) \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} f_{X | U} (x | u) \frac{d^{2} M (x, y; ρ)}{d ρ^{2}} |_{ρ = 0} f_{Y | V} (y | v) d x d y \end{matrix}

(A69)

\begin{matrix} = f_{U} (u) f_{V} (v) (\int_{- \infty}^{\infty} (x^{2} - 1) f_{X | U} (x | u) d x) (\int_{- \infty}^{\infty} (y^{2} - 1) f_{Y | V} (y | v) d y) \end{matrix}

(A70)

\begin{matrix} = f_{U} (u) f_{V} (v) (E [X^{2} | U = u] - 1) (E [Y^{2} | V = v] - 1) . \end{matrix}

(A71)

Expanding

I (ρ)

in Taylor series around

ρ = 0

gives us

I (0) = 0 = \frac{d I (ρ)}{d ρ} |_{ρ = 0}

and

\frac{d^{2} I (ρ)}{d ρ^{2}} |_{ρ = 0} = log e \cdot E_{} [{(E_{} [X | U])}^{2}] E_{} [{(E_{} [Y | V])}^{2}] .

Thus,

I (ρ) = \frac{ρ^{2} log e}{2} E_{} [{(E_{} [X | U])}^{2}] E_{} [{(E_{} [Y | V])}^{2}] + o (ρ^{2}) .

(A72)

Note that

E_{} [X] = E_{} [E_{} [X | U]]

and

1 = E [X^{2}] = E [E [X^{2} | U]] = E_{} [var [X | U]] + E_{} [{(E_{} [X | U])}^{2}] .

(A73)

In addition, by (Corollary to Theorem 8.6.6 of [69]),

E_{} [var [X | U]] \geq \frac{1}{2 π e} e^{2 h (X | U)} .

Moreover, from MI constraint, we have

\begin{matrix} I (X; U) = h (X) - h (X | U) = \frac{1}{2} log (2 π e) - h (X | U) \leq C_{u}, \end{matrix}

and therefore

h (X | U) \geq log (2 π e) - C_{u}

. Thus, we obtain:

- C_{u} \leq \frac{1}{2} log (E_{} [var [X | U]]) \to E_{} [var [X | U]] \geq 2^{- 2 C_{u}} .

(A74)

Combining (A73) and (A74), we obtain

E_{} [{(E_{} [X | U])}^{2}] \leq 1 - 2^{- 2 C_{u}}

.

In a very similar method, one can show that

E_{} [{(E_{} [Y | V])}^{2}] \leq 1 - 2^{- 2 C_{v}}

.

Thus, for

ρ \to 0

I (ρ) \leq \frac{ρ^{2} log e}{2} (1 - 2^{- 2 C_{u}}) (1 - 2^{- 2 C_{v}}) + o (ρ^{2}) .

(A75)

Appendix K. Proof of Lemma A1

The function

ϕ (p, λ)

is a twice differentiable continuous function with respective second derivative given by

\frac{\partial^{2} ϕ (p, λ)}{\partial p^{2}} = ϕ_{p p} (p, λ) = - \frac{{(a - b)}^{2}}{a p + b \bar{p}} - \frac{{(c - d)}^{2}}{c p + d \bar{p}} - \frac{{(a - b + c - d)}^{2}}{1 - (a + c) p - (b + d) \bar{p}} + \frac{λ}{p \bar{p}} .

(A76)

The former can also be written as a proper rational function [70], i.e.,

ϕ_{p p} (p, λ) = \frac{N (p)}{D (p)}

, where

\begin{matrix} N (p) = λ (a p + b \bar{p}) (c p + d \bar{p}) (1 - (a + c) p - (b + d) \bar{p}) - {(a - b)}^{2} (c p + d \bar{p}) (1 - (a + c) p \\ - (b + d) \bar{p}) p \bar{p} - {(c - d)}^{2} (a p + b \bar{p}) (1 - (a + c) p - (b + d) \bar{p}) p \bar{p} \\ - {(a - b + c - d)}^{2} (a p + b \bar{p}) (c p + d \bar{p}) p \bar{p}, \end{matrix}

(A77)

and

D (p) = p \bar{p} (a p + b \bar{p}) (c p + d \bar{p}) (1 - (a + c) p - (b + d) \bar{p}) .

(A78)

Note that

ϕ_{p p} (p, λ)

equals

+ \infty

for

p \in {0, 1}

and hence is positive for this set of points.

Suppose $ϕ (p, λ)$ is linear over some interval $I \subset [a, b]$ . In such case, its second derivative must be zero over this interval, which implies that $N (p)$ is zero over this interval. Since $N (p)$ is a degree 3 polynomial, it can be zero over some interval if and only if it is zero everywhere. Thus, if $ϕ (p, λ)$ is linear over some interval $I$ , then it is non-linear for every $p \in [0, 1]$ .
For $p \in (0, 1)$ , $D (p) > 0$ and $N (p)$ is a degree 3 polynomial in p. Since $N (0^{+}) > 0$ and $N (1^{-}) > 0$ , this polynomial has no sign changes or has exactly two sign changes in $(0, 1)$ . Therefore, either $ϕ (p, λ)$ is convex or there are two points $p_{1}$ and $p_{2}$ , $0 < p_{1} < p_{2} < 1$ , such that $ϕ (p, λ)$ is convex in $p \in [0, p_{1}] \cup [p_{2}, 1]$ and concave in $p \in [p_{1}, p_{2}]$ .

Appendix L. Proof of Lemma A2

Let

I_{2} = [c, d] \subset [0, 1]

and assume in contradiction that

{α_{i}, p_{i}}_{i = 1, 2, 3}

attains the lower convex envelope at point q, and that

p_{2} \in I_{2}

. By assumption, we have that

α_{1} p_{1} + α_{2} p_{2} + α_{3} p_{3} = q .

(A79)

We can write

p_{2} = \bar{γ} c + γ d

for some

γ \in (0, 1)

and still

α_{1} p_{1} + α_{2} \bar{γ} c + α_{2} γ d + α_{3} p_{3} = q .

(A80)

However, due to concavity of

ϕ (\cdot)

in

I_{2}

, we must have

\begin{matrix} α_{1} ϕ (p_{1}) + α_{2} \bar{γ} ϕ (c) + α_{2} γ ϕ (d) + α_{3} ϕ (p_{3}) \leq α_{1} ϕ (p_{1}) + α_{2} ϕ (\bar{γ} c + γ d) + α_{3} ϕ (p_{3}) \\ = α_{1} ϕ (p_{1}) + α_{2} ϕ (p_{2}) + α_{3} ϕ (p_{3}) . \end{matrix}

(A81)

This implies that there is a linear combination of point from

I_{1} \cup I_{3}

that attains a lower value than

ϕ (q)

, contradicting the assumption that

ϕ (q)

is the lower convex envelope at point q. Since

p_{2}

was arbitrary, the lemma holds.

Appendix M. Proof of Lemma A3

Assume in contradiction that there are no distinct points, i.e., it has

p_{11} = p_{12} = p_{13} = p_{21} = p_{22} = p_{23} = p

, then

p = q

and

x_{1} = x_{2}

,which contradicts the initial assumption that

x_{1} \neq x_{2}

. Assume WOLG that

p_{11} = p_{12} = p_{13} = p_{21} = p_{22} = p

but

p_{23} \neq p

. Since

p_{11} = p_{12} = p_{13} = p

implies

p = q

, then

p_{23}

must be q as well in contradiction to the initial assumption.

Consider the following cases:

$p_{11} = p_{12} = p_{13} = p_{21} = p_{1}, p_{22} = p_{23} = p_{2}$ , $p_{1} \neq p_{2}$ : This implies $p_{1} = q$ . Furthermore,

$α_{21} q + α_{22} p_{2} + (1 - α_{21} - α_{22}) p_{2} = q \to (1 - α_{21}) p_{2} = (1 - α_{21}) q,$

(A82)

which holds only if $p_{2} = q$ in contradiction to our initial assumption.
$p_{11} = p_{12} = p_{21} = p_{22} = p_{1}, p_{13} = p_{23} = p_{2}$ , $p_{1} \neq p_{2}$ : This implies

$(α_{11} + α_{12}) p_{1} + (1 - α_{21} - α_{22}) p_{2} = q = (α_{21} + α_{22}) p_{1} + (1 - α_{21} - α_{22}) p_{2},$

(A83)

which holds only if $α_{11} + α_{12} = α_{21} + α_{22}$ . In such case

$\begin{matrix} x_{1} = (α_{11} + α_{12}) h (p_{1}) + (1 - α_{11} - α_{12}) h (p_{2}) \\ = (α_{21} + α_{22}) h (p_{1}) + (1 - α_{21} - α_{22}) h (p_{2}) = x_{2}, \end{matrix}$

(A84)

in contradiction to the assumption $x_{1} \neq x_{2}$ .

Thus, the lemma holds.

References

Tishby, N.; Pereira, F.C.N.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
Pichler, G.; Piantanida, P.; Matz, G. Distributed information-theoretic clustering. Inf. Inference J. Ima 2021, 11, 137–166. [Google Scholar] [CrossRef]
Jain, A.K.; Dubes, R.C. Algorithms for Clustering Data; Prentice-Hall: Hoboken, NJ, USA, 1988. [Google Scholar]
Gupta, N.; Aggarwal, S. Modeling Biclustering as an optimization problem using Mutual Information. In Proceedings of the International Conference on Methods and Models in Computer Science (ICM2CS), Delhi, India, 14–15 December 2009; pp. 1–5. [Google Scholar]
Hartigan, J. Direct Clustering of a Data Matrix. J. Am. Stat. Assoc. 1972, 67, 123–129. [Google Scholar] [CrossRef]
Madeira, S.; Oliveira, A. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004, 1, 24–45. [Google Scholar] [CrossRef] [PubMed]
Dhillon, I.S.; Mallela, S.; Modha, D.S. Information-Theoretic Co-Clustering. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, (KDD ’03), Washington, DC, USA, 24–27 August 2003; pp. 89–98. [Google Scholar]
Courtade, T.A.; Kumar, G.R. Which Boolean Functions Maximize Mutual Information on Noisy Inputs? IEEE Trans. Inf. Theory 2014, 60, 4515–4525. [Google Scholar] [CrossRef]
Han, T.S. Hypothesis Testing with Multiterminal Data Compression. IEEE Trans. Inf. Theory 1987, 33, 759–772. [Google Scholar] [CrossRef]
Westover, M.B.; O’Sullivan, J.A. Achievable Rates for Pattern Recognition. IEEE Trans. Inf. Theory 2008, 54, 299–320. [Google Scholar] [CrossRef]
Painsky, A.; Feder, M.; Tishby, N. An Information-Theoretic Framework for Non-linear Canonical Correlation Analysis. arXiv 2018, arXiv:1810.13259. [Google Scholar]
Williamson, A.R. The Impacts of Additive Noise and 1-bit Quantization on the Correlation Coefficient in the Low-SNR Regime. In Proceedings of the 57th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 24–27 September 2019; pp. 631–638. [Google Scholar]
Courtade, T.A.; Weissman, T. Multiterminal Source Coding Under Logarithmic Loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef]
Pichler, G.; Piantanida, P.; Matz, G. Dictator Functions Maximize Mutual Information. Ann. Appl. Prob. 2018, 28, 3094–3101. [Google Scholar] [CrossRef]
Dobrushin, R.; Tsybakov, B. Information transmission with additional noise. IRE Trans. Inf. Theory 1962, 8, 293–304. [Google Scholar] [CrossRef]
Wolf, J.; Ziv, J. Transmission of noisy information to a noisy receiver with minimum distortion. IEEE Trans. Inf. Theory 1970, 16, 406–411. [Google Scholar] [CrossRef]
Witsenhausen, H.S.; Wyner, A.D. A Conditional Entropy Bound for a Pair of Discrete Random Variables. IEEE Trans. Inf. Theory 1975, 21, 493–501. [Google Scholar] [CrossRef]
Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef]
Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef]
Aguerri, I.E.; Zaidi, A. Distributed Variational Representation Learning. IEEE Trans. Pattern Anal. 2021, 43, 120–138. [Google Scholar] [CrossRef]
Hassanpour, S.; Wuebben, D.; Dekorsy, A. Overview and Investigation of Algorithms for the Information Bottleneck Method. In Proceedings of the SCC 2017: 11th International ITG Conference on Systems, Communications and Coding, Hamburg, Germany, 6–9 February 2017; pp. 1–6. [Google Scholar]
Slonim, N. The Information Bottleneck: Theory and Applications. Ph.D. Thesis, Hebrew University of Jerusalem, Jerusalem, Israel, 2002. [Google Scholar]
Sutskover, I.; Shamai, S.; Ziv, J. Extremes of information combining. IEEE Trans. Inf. Theory 2005, 51, 1313–1325. [Google Scholar] [CrossRef]
Zaidi, A.; Aguerri, I.E.; Shamai, S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef]
Wyner, A.; Ziv, J. A theorem on the entropy of certain binary sequences and applications–I. IEEE Trans. Inf. Theory 1973, 19, 769–772. [Google Scholar] [CrossRef]
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
Blachman, N. The convolution inequality for entropy powers. IEEE Trans. Inf. Theory 1965, 11, 267–271. [Google Scholar] [CrossRef]
Guo, D.; Shamai, S.; Verdú, S. The interplay between information and estimation measures. Found. Trends Signal Process. 2013, 6, 243–429. [Google Scholar] [CrossRef]
Bustin, R.; Payaro, M.; Palomar, D.P.; Shamai, S. On MMSE Crossing Properties and Implications in Parallel Vector Gaussian Channels. IEEE Trans. Inf. Theory 2013, 59, 818–844. [Google Scholar] [CrossRef]
Sanderovich, A.; Shamai, S.; Steinberg, Y.; Kramer, G. Communication Via Decentralized Processing. IEEE Trans. Inf. Theory 2008, 54, 3008–3023. [Google Scholar] [CrossRef]
Smith, J.G. The information capacity of amplitude-and variance-constrained scalar Gaussian channels. Inf. Control. 1971, 18, 203–219. [Google Scholar] [CrossRef]
Sharma, N.; Shamai, S. Transition points in the capacity-achieving distribution for the peak-power limited AWGN and free-space optical intensity channels. Probl. Inf. Transm. 2010, 46, 283–299. [Google Scholar] [CrossRef]
Dytso, A.; Yagli, S.; Poor, H.V.; Shamai, S. The Capacity Achieving Distribution for the Amplitude Constrained Additive Gaussian Channel: An Upper Bound on the Number of Mass Points. IEEE Trans. Inf. Theory 2019, 66, 2006–2022. [Google Scholar] [CrossRef]
Steinberg, Y. Coding and Common Reconstruction. IEEE Trans. Inf. Theory 2009, 55, 4995–5010. [Google Scholar] [CrossRef]
Land, I.; Huber, J. Information Combining. Found. Trends Commun. Inf. Theory 2006, 3, 227–330. [Google Scholar] [CrossRef][Green Version]
Yang, Q.; Piantanida, P.; Gündüz, D. The Multi-layer Information Bottleneck Problem. In Proceedings of the IEEE Information Theory Workshop (ITW), Kaohsiung, Taiwan, 6–10 November 2017; pp. 404–408. [Google Scholar]
Berger, T.; Zhang, Z.; Viswanathan, H. The CEO Problem. IEEE Trans. Inf. Theory 1996, 42, 887–902. [Google Scholar] [CrossRef]
Steiner, S.; Kuehn, V. Optimization Of Distributed Quantizers Using An Alternating Information Bottleneck Approach. In Proceedings of the WSA 2019: 23rd International ITG Workshop on Smart Antennas, Vienna, Austria, 24–26 April 2019; pp. 1–6. [Google Scholar]
Vera, M.; Rey Vega, L.; Piantanida, P. Collaborative Information Bottleneck. IEEE Trans. Inf. Theory 2019, 65, 787–815. [Google Scholar] [CrossRef]
Ugur, Y.; Aguerri, I.E.; Zaidi, A. Vector Gaussian CEO Problem Under Logarithmic Loss and Applications. IEEE Trans. Inf. Theory 2020, 66, 4183–4202. [Google Scholar] [CrossRef]
Estella, I.; Zaidi, A. Distributed Information Bottleneck Method for Discrete and Gaussian Sources. In Proceedings of the International Zurich Seminar on Information and Communication (IZS), Zurich, Switzerland, 21–23 February 2018; pp. 35–39. [Google Scholar]
Courtade, T.A.; Jiao, J. An Extremal Inequality for Long Markov Chains. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 1–3 October 2014; pp. 763–770. [Google Scholar]
Erkip, E.; Cover, T.M. The Efficiency of Investment Information. IEEE Trans. Inf. Theory 1998, 44, 1026–1040. [Google Scholar] [CrossRef]
Gács, P.; Körner, J. Common information is far less than mutual information. Probl. Contr. Inform. Theory 1973, 2, 149–162. [Google Scholar]
Farajiparvar, P.; Beirami, A.; Nokleby, M. Information Bottleneck Methods for Distributed Learning. In Proceedings of the 56th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 2–5 October 2018; pp. 24–31. [Google Scholar]
Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. In Proceedings of the Information Theory Workshop (ITW), Jeju Island, Korea, 11–15 October 2015; pp. 1–5. [Google Scholar]
Alemi, A.; Fischer, I.; Dillon, J.; Murphy, K. Deep Variational Information Bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
Gabrié, M.; Manoel, A.; Luneau, C.; Barbier, j.; Macris, N.; Krzakala, F.; Zdeborová, L. Entropy and mutual information in models of deep neural networks. In Advances in NIPS; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Goldfeld, Z.; van den Berg, E.; Greenewald, K.H.; Melnyk, I.; Nguyen, N.; Kingsbury, B.; Polyanskiy, Y. Estimating Information Flow in Neural Networks. arXiv 2018, arXiv:1810.05728. [Google Scholar]
Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. 2020, 42, 2225–2239. [Google Scholar] [CrossRef]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 1–34. [Google Scholar] [CrossRef]
Cheng, H.; Lian, D.; Gao, S.; Geng, Y. Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane. In Lecture Notes in Computer Science, Proceedings of the Computer Vision-ECCV 2018-15th European Conference, Munich, Germany, 8–14 September 2018, Proceedings, Part XI; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11215, pp. 181–195. [Google Scholar]
Yu, S.; Wickstrøm, K.; Jenssen, R.; Príncipe, J.C. Understanding Convolutional Neural Networks with Information Theory: An Initial Exploration. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 435–442. [Google Scholar] [CrossRef]
Lewandowsky, J.; Stark, M.; Bauch, G. Information bottleneck graphs for receiver design. In Proceedings of the IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016; pp. 2888–2892. [Google Scholar]
Stark, M.; Wang, L.; Bauch, G.; Wesel, R.D. Decoding rate-compatible 5G-LDPC codes with coarse quantization using the information bottleneck method. IEEE Open J. Commun. Soc. 2020, 1, 646–660. [Google Scholar] [CrossRef]
Bhatt, A.; Nazer, B.; Ordentlich, O.; Polyanskiy, Y. Information-distilling quantizers. IEEE Trans. Inf. Theory 2021, 67, 2472–2487. [Google Scholar] [CrossRef]
Stark, M.; Shah, A.; Bauch, G. Polar code construction using the information bottleneck method. In Proceedings of the 2018 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), Barcelona, Spain, 15–18 April 2018; pp. 7–12. [Google Scholar]
Shah, S.A.A.; Stark, M.; Bauch, G. Design of Quantized Decoders for Polar Codes using the Information Bottleneck Method. In Proceedings of the SCC 2019: 12th International ITG Conference on Systems, Communications and Coding, Rostock, Germany, 11–14 February 2019; pp. 1–6. [Google Scholar]
Shah, S.A.A.; Stark, M.; Bauch, G. Coarsely Quantized Decoding and Construction of Polar Codes Using the Information Bottleneck Method. Algorithms 2019, 12, 192. [Google Scholar] [CrossRef]
Kurkoski, B.M. On the Relationship Between the KL Means Algorithm and the Information Bottleneck Method. In Proceedings of the 11th International ITG Conference on Systems, Communications and Coding (SCC), Hamburg, Germany, 6–9 February 2017; pp. 1–6. [Google Scholar]
Goldfeld, Z.; Polyanskiy, Y. The Information Bottleneck Problem and its Applications in Machine Learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
Harremoes, P.; Tishby, N. The Information Bottleneck Revisited or How to Choose a Good Distortion Measure. In Proceedings of the 2007 IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
Richardson, T.; Urbanke, R. Modern Coding Theory; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Sason, I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy 2018, 20, 383. [Google Scholar] [CrossRef] [PubMed]
Mehler, F.G. Ueber die Entwicklung einer Function von beliebig vielen Variablen nach Laplaceschen Functionen höherer Ordnung. J. Reine Angew. Math. 1866, 66, 161–176. [Google Scholar]
Lancaster, H.O. The Structure of Bivariate Distributions. Ann. Math. Statist. 1958, 29, 719–736. [Google Scholar] [CrossRef]
O’Donnell, R. Analysis of Boolean Functions, 1st ed.; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
Corless, M.J. Linear Systems and Control : An Operator Perspective; Monographs and Textbooks in Pure and Applied Mathematics; Marcel Dekker: New York, NY, USA, 2003; Volume 254. [Google Scholar]

Figure 1. Block diagram of the Double-Sided Information Bottleneck function.

Figure 2. Illustration of a typical biclustering algorithm.

Figure 3. Block diagram of the information-theoretic biclustering problem.

Figure 4. Block diagram of the Single-Sided Information Bottleneck function.

Figure 5. Block diagram of Source Coding with Side Information.

Figure 6. Block diagram of the Multi-Layer IB.

Figure 7. Block diagram of the Distributive IB.

Figure 8. Block diagram of the Collaborative IB.

Figure 9. General test-channel construction of the BDSIB function.

Figure 10. Test-channel that achieve the lower bound of Proposition 5.

Figure 11. Comparison of the lower bounds.

Figure 12. Comparison of the lower bounds in high SNR regime.

Figure 13. Comparison of the upper bounds for various values of

(C_{u}, C_{v})

.

Figure 14. Comparison of the upper bounds for various values of p.

Figure 15. Capacity bounds for various values of p and

C = C_{u} = C_{v}

.

Figure 16. Comparison of the lower bounds from Propositions 8 and 9.

Figure 17. Capacity bounds for various values of p and

C = C_{u} = C_{v} = 1

.

Figure 18. Maximal

I (U; V)

for fixed values

(C_{u}, C_{v}) = (0.4, 0.6)

and different values of p.

Figure 19. Maximal

I (U; V)

for fixed values

(C_{u}, C_{v}) = (0.1, 0.9)

and different values of p.

Figure 20. Optimal value of a for various values of

C_{u}

and

C_{v}

.

Figure 21. Optimal value of a for various values of

C_{u}

and

C_{v} = 0.9

.

Figure 22. Alternating maximization with exhaustive search for various p,

C_{u}

,

C_{v}

.

Figure 23. Alternating maximization with exhaustive search for various p,

C_{u}

,

C_{v}

.

Figure 24. Alternating maximization with exhaustive search for various p,

C_{u}

,

C_{v}

.

Figure 25. Convergence of

I (U; V)

for various values of p,

C_{u}

and

C_{v}

.

Figure 26. Convergence of

I (U; V)

p with: (a)

C_{u} = C_{v} = 0.2

,

p = 0

; (b)

C_{u} = C_{v} = 0.7

,

p = 0

; (c)

C_{u} = C_{v} = 0.5

,

p = 0.1

; (d)

C_{u} = 0.65, C_{v} = 0.4

,

p = 0.1

.

Figure 27. Comparison of the proposed alternating maximization algorithm and the brute-force search method for various problem parameters.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

The Double-Sided Information Bottleneck Function^†

Abstract

1. Introduction

1.1. Related Work

1.2. Notations

1.3. Paper Outline

2. Problem Formulation and Basic Properties

2.1. The Single-Sided Information Bottleneck (SSIB) Function

2.2. The Double-Sided Information Bottleneck (DSIB) Function

3. Binary $(X, Y)$

3.1. Special Cases

3.2. Bounds

4. Gaussian $(X, Y)$

4.1. Low-SNR Regime

4.2. Optimality of Symbol-by-Symbol Quantization When $X = Y$

5. Alternating Maximization Algorithm

6. Numerical Results

6.1. Exhaustive Search

6.2. Alternating Maximization

7. Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proof of Proposition 2

Appendix B. Proof of Lemma 3

Appendix C. Auxiliary Concavity Lemma

Appendix D. Proof of Theorem 1

Appendix E. Proof of Proposition 4

Appendix F. Proof of Proposition 6

Appendix G. Proof of Proposition 7

Appendix H. Proof of Proposition 8

Appendix I. Proof of Proposition 9

Appendix J. Proof of Theorem 2

Appendix K. Proof of Lemma A1

Appendix L. Proof of Lemma A2

Appendix M. Proof of Lemma A3

References

Article Metrics

Citations

Article Access Statistics

The Double-Sided Information Bottleneck Function †

Abstract

1. Introduction

1.1. Related Work

1.2. Notations

1.3. Paper Outline

2. Problem Formulation and Basic Properties

2.1. The Single-Sided Information Bottleneck (SSIB) Function

2.2. The Double-Sided Information Bottleneck (DSIB) Function

3. Binary ( X , Y )

3.1. Special Cases

3.2. Bounds

4. Gaussian ( X , Y )

4.1. Low-SNR Regime

4.2. Optimality of Symbol-by-Symbol Quantization When X = Y

5. Alternating Maximization Algorithm

6. Numerical Results

6.1. Exhaustive Search

6.2. Alternating Maximization

7. Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proof of Proposition 2

Appendix B. Proof of Lemma 3

Appendix C. Auxiliary Concavity Lemma

Appendix D. Proof of Theorem 1

Appendix E. Proof of Proposition 4

Appendix F. Proof of Proposition 6

Appendix G. Proof of Proposition 7

Appendix H. Proof of Proposition 8

Appendix I. Proof of Proposition 9

Appendix J. Proof of Theorem 2

Appendix K. Proof of Lemma A1

Appendix L. Proof of Lemma A2

Appendix M. Proof of Lemma A3

References

Article Metrics

Citations

Article Access Statistics

The Double-Sided Information Bottleneck Function^†

3. Binary $(X, Y)$

4. Gaussian $(X, Y)$

4.2. Optimality of Symbol-by-Symbol Quantization When $X = Y$