The Double-Sided Information Bottleneck Function

A double-sided variant of the information bottleneck method is considered. Let (X,Y) be a bivariate source characterized by a joint pmf PXY. The problem is to find two independent channels PU|X and PV|Y (setting the Markovian structure U→X→Y→V), that maximize I(U;V) subject to constraints on the relevant mutual information expressions: I(U;X) and I(V;Y). For jointly Gaussian X and Y, we show that Gaussian channels are optimal in the low-SNR regime but not for general SNR. Similarly, it is shown that for a doubly symmetric binary source, binary symmetric channels are optimal when the correlation is low and are suboptimal for high correlations. We conjecture that Z and S channels are optimal when the correlation is 1 (i.e., X=Y) and provide supporting numerical evidence. Furthermore, we present a Blahut–Arimoto type alternating maximization algorithm and demonstrate its performance for a representative setting. This problem is closely related to the domain of biclustering.


Introduction
The information bottleneck (IB) method [1] plays a central role in advanced lossy source compression. The analysis of classical source coding algorithms is mainly approached via the rate-distortion theory, where a fidelity measure must be defined. However, specifying an appropriate distortion measure in many real-world applications is challenging and sometimes infeasible. The IB framework introduces an essentially different concept, where another variable is provided, which carries the relevant information in the data to be compressed. The quality of the reconstructed sequence is measured via the mutual information metric between the reconstructed data and the relevance variables. Thus, the IB method provides a universal fidelity measure.
In this work, we extend and generalize the IB method by imposing an additional bottleneck constraint on the relevant variable and considering noisy observation of the source. In particular, let (X, Y) be a bivariate source characterized by a fixed joint probability law P XY and consider all Markov chains U → X → Y → V. The Double-Sided Information Bottleneck (DSIB) function is defined as [2]: where the maximization is over all P U|X and P V|Y satisfying I(U; X) ≤ C u and I(V; Y) ≤ C v . This problem is illustrated in Figure 1. In our study, we aim to determine the maximum value and the achieving conditional distributions (P U|X , P V|Y ) (test channels) of (1) for various fixed sources P XY and constraints C u and C v . The problem we consider originates from the domain of clustering. Clustering is applied to organize similar entities in unsupervised learning [3]. It has numerous practical applications in data science, such as: joint word-document clustering, gene expression [4], and pattern recognition. The data in those applications are arranged as a contingency table. Usually, clustering is performed on one dimension of the table, but sometimes it is helpful to apply clustering on both dimensions of the contingency table [5], for example, when there is a strong correlation between the rows and the columns of the table or when highdimensional sparse structures are handled. The input and output of a typical biclustering algorithm are illustrated in Figure 2. Consider an S × T data matrix (a st ). Find partitions B k ⊆ {1, . . . , S} and C l ⊆ {1, . . . , T}, k = 1, . . . , K, l = 1, . . . , L such that all elements of the "biclusters" [6] (a st ) s∈B k ,t∈C l are homogeneous. The measure of homogeneity depends on the application. This problem can also be motivated by a remote source coding setting. Consider a latent random variable W, which satisfies U ← X ← W → Y → V and represents a source of information. We have two users that observe noisy versions of W, i.e., X and Y. Those users try to compress the observed noisy data so that their reconstructed versions, U and V, will be comparable under the maximum mutual information metric. The problem we consider also bears practical applications. Imagine a distributed sensor network where the different edges measure a noisy version of a particular signal but are not allowed to communicate with each other. Each of the nodes performs compression of the received signal. Under the DSIB framework, we can find the optimal compression schemes that preserve the reconstructed symbols' proximity subject to the mutual information measure.
Dhillon et al. [7] initiated an information-theoretic approach to biclustering. They have regarded the normalized non-negative contingency table as a joint probability distribution matrix of two random variables. Mutual information was proposed as a measure for optimal co-clustering. An optimization algorithm was presented that intertwines both row and column clustering at all stages. Distributed clustering from a proper informationtheoretic perspective was first explicitly considered by Pichler et al. [2]. Consider the model illustrated in Figure 3. A bivariate memory-less source with joint law P XY generates n i.i.d. copies (X n , Y n ) of (X, Y). Each sequence is observed at two different encoders, and each encoder generates a description of the observed sequence, f n (X n ) and g n (Y n ). The objective is to construct the mappings f n and g n such that the normalized mutual information between the descriptions would be maximal while the description coding has bounded rate constraints. Single-letter inner and outer bounds for a general P XY were derived. An example of a doubly symmetric binary source (DSBS) was given, and several converse results were established. Furthermore, connections were made to the standard IB [1] and the multiple description CEO problems [8]. In addition, the equivalence of information-theoretic biclustering problem to hypothesis testing against independence with multiterminal data compression and a pattern recognition problem was established in [9,10], respectively. Encoder 1 P XY Encoder 2 | f n (X n )| ≤ 2 nC x |g n (Y n )| ≤ 2 nC y max 1 n I( f n (X n ); g n (Y n )) f n (X n ) X n Y n g n (Y n ) The DSIB problem addressed in our paper is, in fact, a single-letter version of the distributed clustering setup [2]. The inner bound in [2] coincides with our problem definition. Moreover, if the Markov condition U → X → Y → Z is imposed on the multi-letter variant, then those problems are equivalent. A similar setting, but with a maximal correlation criterion between the reconstructed random variables, has been considered in [11,12]. Furthermore, it is sometimes the case that the optimal biclustering problem is more straightforward to solve than its standard, single-sided, clustering counterpart. For example, the Courtade-Kumar conjecture [13] for the standard single-sided clustering setting was ultimately proven for the biclustering setting [14]. A particular case, where (X, Y) are drawn from DSBS distribution and the mappings f n and g n are restricted to be Boolean functions, was addressed in [14]. The bound I( f n (X n ); g n (Y n )) ≤ I(X; Y) was established, which is tight if and only if f n and g n are dictator functions.

Related Work
Our work extends the celebrated standard (single-sided) IB (SSIB) method introduced by Tishby et al. [1]. Indeed, consider the problem illustrated in Figure 4. This single-sided counterpart of our work is essentially a remote source coding problem [15][16][17], choosing the distortion measure as the logarithmic loss. The random variable U represents the noisy version (X) of the source (Y) with a constrained number of bits (I(U; X) ≤ C), and the goal is to maximize the relevant information in U regarding Y (measured by the mutual information between Y and U). In the standard IB setup, I(U; X) is referred to as the complexity of U, and I(Y; U) is referred to as the relevance of U.
For the particular case where (U, X, Y) are discrete, an optimal P U|X can be found by iteratively solving a set of self-consistent equations. A generalized Blahut-Arimoto algorithm [18][19][20][21] was proposed to solve those equations. The optimal test-channel P U|X was characterized using a variation principle in [1]. A particular case of deterministic mappings from X to U was considered in [22], and algorithms that find those mappings were described. Several representing scenarios have been considered for the SSIB problem. The setting where the pair (X, Y) is a doubly symmetric binary source (DSBS) with transition probability p was addressed from various perspectives in [17,23,24]. Utilizing Mrs. Gerber's Lemma (MGL) [25], one can show that the optimal test-channel for the DSBS setting is a BSC. The case where (X, Y) are jointly multivariate Gaussians in the SSIB framework was first considered in [26]. It was shown that the optimal distribution of (U, X, Y) is also jointly Gaussian. The optimality of the Gaussian test channel can be proven using EPI [27], or exploiting I-MMSE and Single Crossing Property [28]. Moreover, the proof can be easily extended to jointly Gaussian random vectors (X, Y) under the I-MMSE framework [29].
In a more general scenario where X = Y + Z and only Z is fixed to be Gaussian, it was shown that discrete signaling with deterministic quantizers as test-channel sometimes outperforms Gaussian P X [30]. This exciting observation leads to a conjecture that discrete inputs are optimal for this general setting and may have a connection to the input amplitude constrained AWGN channels where it was already established that discrete input distributions are optimal [31][32][33]. One reason for the optimality of discrete distributions stems from the observation that constraining the compression rate limits the usable input amplitude. However, as far as we know, it remains an open problem.
There are various related problems considered in the literature that are equivalent to the SSIB; namely, they share a similar single-letter optimization problem. In the conditional entropy bound (CEB) function, studied in [17], given a fixed bivariate source (X, Y) and an equality constraint on the conditional entropy of X given U, the goal is to minimize the conditional entropy of Y given U over the set of U such that U → X → Y constitute a Markov chain. One can show that CEB is equivalent to SSIB. The common reconstruction CR setting [34] is a source coding with a side-information problem, also known as Wyner-Ziv coding, as depicted in Figure 5; with an additional constraint, the encoder can reconstruct the same sequence as the decoder. Additional assumption of log-loss fidelity results in a single-letter rate-distortion region equivalent to the SSIB. In the problem of information combining (IC) [23,35], motivated by message combining in LDPC decoders, a source of information, P Y , is observed through two test-channels P X|Y and P Z|Y . The IC framework aims to design those channels in two extreme approaches. For the first, IC asks what those channels should be to make the output pair (X, Z) maximally informative regarding Y. On the contrary, IC also considers how to design P X|Y and P Z|Y to minimize the information in (X, Z) regarding Y. The problem of minimizing IC can be shown to be equivalent to the SSIB. In fact, if (X, Y) is a DSBS, then by [23], P Z|Y is a binary symmetric channel (BSC), recovering similar results from (Section IV.A of [17]). The IB method has been extended to various network topologies. A multilayer extension of the IB method is depicted in Figure 6. This model was first considered in [36]. A multivariate source (X, Y 1 , . . . , Y L ) generates a sequence of n i.i.d. copies (X n , Y n 1 , . . . , Y n L ). The receiver has access only to the sequence X n while (Y n 1 , . . . , Y n L ) are hidden. The decoder performs a consecutive L-stage compression of the observed sequence. The representation at step k must be maximally informative about the respective hidden sequence Y k , k ∈ {1, 2, . . . , L}. This setup is highly motivated by the structure of deep neural networks. Specific results were established for the binary and Gaussian sources. The model depicted in Figure 7 represents a multiterminal extension of the standard IB. A set of receivers observe noisy versions (X 1 , X 2 , . . . , X K ) of some source of information Y. The channel outputs (X 1 , X 2 , . . . , X K ) are conditionally independent given Y. The receivers are connected to the central processing unit through noiseless but limited-capacity backhaul links. The central processor aims to attain a good predictionŶ of the source Y based on compressed representations of the noisy version of Y obtained from the receivers. The quality of prediction is measured via the mutual information merit between Y andŶ. The Distributive IB setting is essentially a CEO source coding problem under logarithmic loss (log-loss) distortion measure [37]. The case where (X, Y 1 , . . . , Y K ) are jointly Gaussian random variables was addressed in [20], and a Blahut-Arimoto-type algorithm was proposed. An optimized algorithm to design quantizers was proposed in [38]. A cooperative multiterminal extension of the IB method was proposed in [39]. Let (X n 1 , X n 2 , Y n ) be n i.i.d. copies of the multivariate source (X 1 , X 2 , Y). The sequences X n 1 and X n 2 are observed at encoders 1 and 2, respectively. Each encoder sends a representation of the observed sequence through a noiseless yet rate-limited link to the other encoder and the mutual decoder. The decoder attempts to reconstruct the latent representation sequence Y n based on the received descriptions. As shown in Figure 8, this setup differs from the CEO setup [40] since the encoders can cooperate during the transmission. The set of all feasible rates of complexity and relevance were characterized, and specific regions for the binary and Gaussian sources were established. There are many additional variations of multi-user IB in the literature [  The IB problem connects to many timely aspects, such as capital investment [43], distributed learning [45], deep learning [46][47][48][49][50][51][52], and convolutional neural networks [53,54]. Moreover, it has been recently shown that the IB method can be used to reduce the data transfer rate and computational complexity in 5G LDPC decoders [55,56]. The IB method has also been connected with constructing good polar codes [57]. Due to the exponential output-alphabet growth of polarized channels, it becomes demanding to compute their capacities to identify the location of "frozen bits". Quantization is employed in order to reduce the computation complexity. The quality of the quantization scheme is assessed via mutual information preservation. It can be shown that the corresponding IB problem upper bounds the quantization technique. Quantization algorithms based upon the IB method were considered in [58][59][60]. Furthermore, a relationship between the KL means algorithm and the IB method has been discovered in [61].
A recent comprehensive tutorial on the IB method and related problems is given in [24]. Applications of IB problem in machine learning are detailed in [26,[45][46][47]51,52,62].

Notations
Throughout the paper, random variables are denoted using a sans-serif font, e.g., X, their realizations are denoted by the respective lower-case letters, e.g., x, and their alphabets are denoted by the respective calligraphic letters, e.g., X . Let X n stand for the set of all n-tuples of elements from X . An element from X n is denoted by x n = (x 1 , x 2 , . . . , x n ) and substrings are denoted by x j i = (x i , x i+1 , . . . , x j ). The cardinality of a finite set, say X , is denoted by |X |. The probability mass function (pmf) of X, the joint pmf of X and Y, and the conditional pmf of X given Y are denoted by P X , P XY , and P X|Y , respectively. The expectation of X is denoted by E[X]. The probability of an event E is denoted as P(E ).
Let X and Y be an n-ary and m-ary random variables, respectively. The marginal probability vector is denoted by a lowercase boldface letter, i.e., q (P(X = 1), P(X = 2), . . . , P(X = n)) T . ( The probability vector of an n-ary uniform random variable is denoted by u n . We denote by T the transition matrix from X to Y, i.e., The entropy of n-ary probability vector q is given by h(q), where Throughout this paper all logarithms are taken to base 2 unless stated otherwise. We denote the ones complemented with a bar, i.e.,x = 1 − x. The binary convolution of x, y ∈ [0, 1] is defined as x * y xȳ +xy. The binary entropy function is defined by Let X and Y be a pair of random variables with joint pmf P XY and marginal pmfs P X = q x and P Y = q y . Furthermore, let T (T) be the transition matrix from X (Y) to Y (X). The mutual information between X and Y is defined as:

Paper Outline
Section 2 gives a proper definition of the DSIB optimization problem, mentions various results directly related to this work, and provides some general preliminary results. The spotlight of Section 3 is on the binary (X, Y), where we derive bounds on the respective DSIB function and show a complete characterization for extreme scenarios. The jointly Gaussian (X, Y) is considered in Section 4, where an elegant representation of an objective function is presented, and complete characterization in the low-SNR regime is established. A Blahut-Arimoto-type alternating maximization algorithm will be presented in Section 5. Representative numerical evaluation of the bounds and the proposed algorithm will be provided in Section 6. Finally, a summary and possible future directions will be described in Section 7. The prolonged proofs are postponed to the Appendix A.

Problem Formulation and Basic Properties
The DSIB function is a multi-terminal extension of the standard IB [1]. First, we briefly remind the latter's definition and give related results that will be utilized for its doublesided counterpart. Then, we provide a proper definition of the DSIB optimization problem and present some general preliminaries.

The Single-Sided Information Bottleneck (SSIB) Function
Definition 1 (SSIB). Let (X, V) be a pair of random variables with |X | = n, |V | = m, and fixed P XV . Denote by q the marginal probability vector of X, and let T be the transition matrix from X to V, i.e., Consider all random variables U satisfying the Markov chain U → X → V. The SSIB function is defined as:R subject to I(X; U) ≤ C. (6) is equivalent (has similar solution) to the CEB problem considered in [17].

Remark 1. The SSIB problem defined in
Although the optimization problem in (6) is well defined, the auxiliary random variable U may have an unbounded alphabet. The following lemma provides an upper bound on the cardinality of U , thus relaxing the optimization domain. Lemma 1 (Lemma 2.2 of [17]). The optimization over U in (6) can be restricted to |U | ≤ n + 1.

Remark 2.
A tighter bound, namely |U | ≤ n, was previously proved in [63] for the corresponding dual problem, namely, the IB Lagrangian. However, sinceR T (q, C) is generally not a strictly convex function of C, it cannot be directly applied for the primal problem (6).
Note that the SSIB optimization problem (6) is basically a convex function maximization over a convex set; thus, the maximum is attained on the boundary of the set. Lemma 2 (Theorem 2.5 of [17]). The inequality constraint in (6) can be replaced by equality constraint, i.e., I(X; U) = C.

The Double-Sided Information Bottleneck (DSIB) Function
Definition 2 (DSIB). Let (X, Y) be a pair of random variables with |X | = n, |Y | = m and fixed P XY . Consider all the random variables U and V satisfying the Markov chain The achieving conditional distributions P U|X and P V|Y will be termed as the optimal testchannels. Occasionally, we will drop the subscript denoting the particular choice of the bivariate source P XY .
Note that (7) can be expressed in the following equivalent form: subject to subject to Evidently, we can define (8) using (6). Indeed, fix P V|Y so that it satisfies I(Y; V) ≤ C v . Denote by T V|Y the transition matrix from Y to V and by T Y|X the transition matrix from X to Y, respectively, i.e., Denote by q x and q y the marginal probability vectors of X and Y, respectively, and consider the inner maximization term in (8). Since P V|Y and P XY are fixed, then P XV = ∑ y P V|Y (·|y)P XY (·, y) is also fixed. Denote by T V|X T V|Y T Y|X the transition matrix from X to V. Therefore, the inner maximization term in (8) is just the SSIB function with parameters T V|X and C u , namely,R T V|X (q x , C u ). Hence, our problem can also be interpreted in the following two equivalent ways: or, similarly, by interchanging the order of maximization in (8), it can be expressed as follows: where T U|X is the transition matrix from X to U, and T X|Y is the transition matrix from Y to X. This representation gives us a different perspective on our problem as an optimal compressed representation of the relevance random variable for the IB framework.
The bound from Lemma 1 can be utilized to give cardinality bounding for the doublesided problem. (7), it suffices to consider random variables U and V with cardinalities |U | ≤ n + 1 and |V | ≤ m + 1.
In the following two sections, we will present the primary analytical outcomes of our study. First, we consider the scenario where our bivariate source is binary, specifically DSBS. Then, we handle the case where X and Y are jointly Gaussian.

Binary (X, Y)
Let (X, Y) be a DSBS with parameter p, i.e., We entitle the respective optimization problem (7) as the binary double-sided information bottleneck (BDSIB) and emphasize its dependence on the parameter p as R(C u , C v , p).
The following proposition states that the cardinality bound from Lemma 1 can be tightened in the binary case.
The proof of this proposition is postponed to Appendix A. Using similar justification for Proposition 1 combined with Proposition 2, we have the following strict cardinality formula for the BDSIB setting.

Proposition 3.
For the respective DSBS setting of (7), it suffices to consider random variables U and V with cardinalities |U | = |V | = 2.
Note that the above statement is not required for the results in the rest of this section to hold and will be mainly applied to justify our conjectures via numerical simulations.
We next show that the specific objective function for the binary setting of (7), i.e, the mutual information between U and V, has an elegant representation which will be useful in deriving lower and upper bounds.
Lemma 3. The mutual information between U and V can be expressed as follows: where the expectation is taken over the product measure P U × P V , U and V are binary random variables satisfying: the kernel K(u, v, p) is given by: and the reverse test-channels are defined by we obtain: The general cascade of test-channels and the DSBS, defined by {α u } 1 u=0 , {β v } 1 v=0 and p, is illustrated in Figure 9. The proof of Lemma 3 is postponed to Appendix B. Figure 9. General test-channel construction of the BDSIB function.
We next examine some corner cases for which R(C u , C v , p) is fully characterized.

Special Cases
A particular case where we have a complete analytical solution is when p tends to 1/2.
and it is achieved by taking P U|X and P V|Y as BSC test-channels satisfying the constraints with equality.
This theorem follows by considering the low SNR regime in Lemma 3 and is proved in Appendix D. For the lower bound we take P U|X and P V|Y to be BSCs.
In Section 6 we will give a numerical evidence that BSC test-channels are in fact optimal provided that p is sufficiently large. However, for small p this is no longer the case and we believe the following holds. Conjecture 1. Let X = Y, i.e., p = 0. The optimal test-channels P U|X and P V|X that achieve R(C u , C v , 0) are Z-channel and S-channel respectively.

Remark 4.
Our results in the numerical section strongly support this conjecture. In fact they prove it within the resolution of the experiments, i.e., for optimizing over a dense set of test-channels rather then all test-channels. Nevertheless, we were not able to find an analytical proof for this result.
Markov chain in this order) then maximizing I(U; V) is equivalent to minimizing I(X; U, V), namely, minimizing information combining as in [23,35]. Therefore, Conjecture 1 is equivalent to the conjecture that among all channels with I(X; U) ≥ C u and I(Y; V) ≥ C v , Z and S are the worst channels for information combining.
This observation leads us the following additional conjecture.

Conjecture 2.
The test-channels P U|X and P V|X that maximize I(X; U, V) are both Z channels.

Remark 6.
Suppose now that p is arbitrary and assume that one of the channels P U|X or P V|Y is restricted to be a binary memoryless symmetric (BMS) channel (Chapter 4 of [64]), then the maximal I(U; V) is attained by BSC channels, as those are the symmetric channels minimizing I(X; U, V) [23]. It is not surprising that once the BMS constraint is removed, symmetric channels are no longer optimal (see the discussion in (Section VI.C of [23])).
Consider now the case X = Y (p = 0) with an additional symmetry assumption C u = C v . The most reasonable apriori guess is that the optimal test-channels P U|X and P V|X are the same up to some permutation of inputs and outputs. Surprisingly, this is not the case, unless they are BSC or Z channels, as the following negative result states. Proposition 4. Suppose C u = C v and the transition matrix from X to V, given by The optimal P U|X that attains (6) with q X = u 2 and C = C u does not equal to P V|X or any permutation of P V|X , unless P V|X is a BSC or a Z channel.
The proof is based on [17] and is postponed to Appendix E. As for the case of X = Y, i.e., p = 0, we have the following conjecture.
We will provide supporting arguments for this conjecture via numerical simulations in Section 6.

Bounds
In this section we present our lower and upper bounds on the BDSIB function, then we compare them for various channel parameters. The proofs are postponed to Appendix F. For the simplicity of the following presentation we define Proposition 5. The BDSIB function is bounded from below by where All terms in the RSH of (20) are attained by taking test-channels that match the constraints with equality and plugging them in Lemma 3. In particular: the first term is achieved by BSC test channels with transition probabilities α and β; the second term is achieved by taking P U|X be a Z(δ) channel and P V|Y be an S(ζ) channel. The aforementioned test-channel configurations are illustrated in Figure 10.  We compare the different lower bounds derived in Proposition 5 for various values of constraints. The achievable rate vs channel transition probability p is shown in Figure 11. Our first observation is that BSC test-channels outperform all other choices for almost all values of p. However, Figure 12 gives a closer look on small values of p. It is evident that the combination of Z and S test-channels outperforms any other schemes for small values of p. We have used this observation as one supporting evidence to Conjecture 1. We proceed to give an upper bound.

Proposition 6.
A general upper bound on BDSIB is given by Note that the first term can be derived by applying Jensen's inequality on (12), and the second term is a combination of the standard IB and the cut-set bound. We postpone the proof of Proposition 6 to Appendix F.

Remark 7.
Since p = 1 2 − , we have a factor 2 loss in the first term compared to the precise behavior we have found for p ≈ 1 2 in Theorem 1. This loss comes from the fact that the bound in (21) actually upper bounds the χ-squared mutual information between U and V. It is well-known that for very small I(X; Y) we have that I(X; Y) ≈ 1/2I χ 2 (X; Y), see [65].
We compare the different upper bounds from Proposition 6 in Figure 13 for various bottleneck constraints, and in Figure 14 for various values of channel transition probabilities p. We observe that there are regions of C and p for which Jensen's based bound outperforms the standard IB bound.  Finally, we compare the best lower and upper bounds from Propositions 5 and 6 for various values of channel parameters in Figure 15. We observe that the bounds are tighter for asymmetric constraints and high transition probabilities.

Gaussian (X, Y)
In this section we consider a specific setting where (X, Y) is the normalized zero mean Gaussian bivariate source, namely, We establish achievability schemes and show that Gaussian test-channels P U|X and P V|Y are optimal for vanishing SNR.
We denote the Gaussian DSIB function with explicit dependency on ρ as R(C u , C v , ρ).

Proposition 7.
Let H n (·) be the nth order probabilistic Hermite polynomial, then the objective function of (7) for the Gaussian setting is given by This representation follows by considering I(U; V) = D(P UV ||P U · P V ) and expressing P UV P U ·P V using Mehler Kernel [66]. Mehler Kernel decomposition is a special case of a much richer family of Lancaster distributions [67]. The proof of Proposition 7 is relegated to Appendix G. Now we give two lower bounds on R(C u , C v , ρ). Our first lower bound is established by choosing P U|X and P V|Y to be additive Gaussian channels, satisfying the mutual information constraints with equality.

Proposition 8.
A lower bound on R(C u ,C v , ρ) is given by The proof of this bound is developed in Appendix H.
Although it was shown in [26] that choosing the test-channel to be Gaussian is optimal for the single-sided variant, it is not the case for its double-sided extension. We will show this by examining a specific set of values for the rate constraints, (C u , C v ) = (1, 1). Furthermore, we choose the test channels P U|X and P V|Y to be deterministic quantizers.
The proof of this bound is developed in Appendix I. We compare the bounds from Propositions 8 and 9 with (C u , C v ) = (1, 1) in Figure 16. The most unexpected observation here is that the deterministic quantizers lower bound outperform the Gaussian test-channels for high values of ρ. The crossing point of those bounds is given by We proceed to present our upper bound on R(C u , C v , ρ). This bound is a combination of the cutset bound and the single-sided Gaussian IB. (7) with Gaussian (X, Y) setting (22) is given by

Proposition 10. An upper bound on
We compare the best lower and upper bounds from Propositions 8-10 in Figure 17. We observe that the bounds become tighter as the constraint increases and in the low-SNR regime.

Low-SNR Regime
For ρ → 0, the exact asymptotic behavior of the Gaussian (Proposition 8) and deterministic (Proposition 9) test-channels, respectively, for C u = C v = 1 bit is given by: Hence, the Gaussian choice outperforms the second lower bound for vanishing SNR. The following theorem establishes that Gaussian test-channels are optimal for low-SNR.

Theorem 2.
For small ρ, the GDSIB function is given by: The lower bound follows from Proposition 8. The upper bound is established by considering the kernel representation from Proposition 7 in the limit of vanishing ρ. The detailed proof is given in Appendix J.

Optimality of Symbol-by-Symbol Quantization When X = Y
Consider an extreme scenario for which X = Y ∼ N (0, 1). Taking the encoders P U|X and P V|X as a symbol-by-symbol deterministic quantizers satisfying: we achieve the optimum I(U; V) = min{C u , C v }.

Alternating Maximization Algorithm
Consider the DSIB problem for DSBS with parameter p analyzed in Section 3. The respective optimization problem involves simultaneous search of the maximum over the sets {P U|X } and {P V|Y }. An alternating maximization, namely, fixing P U|X , then finding the respective optimal P V|Y and vice versa, is sub-optimal in general and may result in convergent to a saddle point. However, for the case p = 0 with symmetric bottleneck constraints, Proposition 4 implies that such point exists only for the BSC and Z/S channels. This motivates us to believe that performing an alternating maximization procedure on (9) will not result in sub-optimal saddle point, but rather converge to the optimal solution also for the general discrete (X, Y).
Thus, we propose an alternating maximization algorithm. The main idea is to fix P V|Y and then compute P * U|X that attains the inner term in (9). Then, using P * U|X , we find the optimal P * V|Y that attains the inner term in (10). Then, we repeat the procedure in alternating manner until convergence.
Note that inner terms of (9) and (10) are just the standard IB problem defined in (6). For completeness, we state here the main result from [1] and adjust it for our problem. Consider the respective Lagrangian of (6) given by: Lemma 4 (Theorem 4 of [1]). The optimal test-channel that maximizes (30) satisfies the equation: where β 1/λ and P V|U is given via Bayes' rule, as follows: In a very similar manner to the Blahut-Arimoto algorithm [18], the self-consistent equations can be adapted into converging, alternating iterations over the convex sets {P U|X } = ∆ ⊗n n , {P U } = ∆ n , and {P V|U } = ∆ ⊗n n , as stated in the following lemma.
Lemma 5 (Theorem 5 of [1]). The self-consistent equations are satisfied simultaneously at the minima of the functional: where the minimization is performed independently over the convex sets of {P U|X } = ∆ ⊗n n , {P U } = ∆ n , and {P V|U } = ∆ ⊗n n . The minimization is performed by the converging alternation iterations as described in Algorithm 1.

Algorithm 1: IB iterative algorithm IBAM(args)
Next, we propose a combined algorithm to solve the optimization problem from (7). The main idea is to fix one of the test-channels, i.e., P V|Y , and then find the corresponding optimal opposite test-channel, i.e., P U|X , using Algorithm 1. Then, we apply again Algorithm 1 by switching roles, i.e., fixing the opposite test-channel, i.e., P U|X , and then identifying the optimal P V|Y . We repeat this procedure until convergence of the objective function I(U; V). We summarize the proposed composite method in Algorithm 2.

Remark 8.
Note that every alternating step of the algorithm involves finding an optimal (β * , η * ) that corresponds to the respective problem constraints (C u , C v ). We have chosen to implement this exploration step using a bisection-type method. This may result that the actual pair (C u , C v ) is -far away from the desired constraint.

Numerical Results
In this section, we focus on the DSBS setting of Section 3. In the first part of this section, we will show using a brute-force method the existence of a sharp, phase-transition phenomena in the optimal test-channels P U|X and P V|Y vs. DSBS parameter p. In the second part of this section, we will evaluate the alternating maximization algorithm proposed in Section 5; then, we compare its performance to the brute-force method.

Exhaustive Search
In this set of simulations, we again fix the transition matrix from Y to V characterized by the parameters:  Figure 18. Note that the region of a corresponds to the continuous conversion from a Z channel (a = 0) to a BSC (a = a max ). We observe here a very sharp transition from the optimality of Z-S channels to BSC channels configuration for a small change in p. This kind of behavior continues to hold with a different choice of (C u = 0.1, C v = 0.9), as can be seen in Figure 19. Next, we would like to emphasize this sharp phase transition phenomena by plotting the optimal a that achieves the maximal I(U; V) vs the DSBS parameter p. The results for various combinations of C u and C v are presented in Figures 20 and 21. We observe that the curves are convex for p ∈ [0, p th ) and constant for p > p th with a = a bsc . Furthermore, the derivative of a(p) for p → p th tends to ∞. One may further claim that there is no sharp transition to the BSC test-channels P U|X and P V|Y as p grows away from zero, but rather only approaches BSC. To convince the reader that the optimal test channels are exactly BSC, we performed an alternating maximization experiment. We fixed p > 0, C u and C v . Then we have chosen P V|Y as an almost BSC channel satisfying I(Y; V) ≤ C v and found the channel P X|U that maximizes I(U; V) subject to I(X; U) ≤ C u . Then, we fixed the channel P X|U and found the P Y|V that maximizes I(U; V) subject to I(Y; V) ≤ C v . We have repeated this alternating maximization procedure until it converges. The transition matrices were parameterized as follows: The results for different values of p, C u , and C v are shown in Figures 22-24. We observe that p 0 and q 0 rapidly converge to their respective BSC values satisfying the mutual information constraints. Note that the last procedure is still an exhaustive search, but it is performed in alternating fashion between the sets {P U|X } and {P V|Y }.

Alternating Maximization
In this section, we will evaluate the algorithm proposed in Section 5. We focus on the DSBS setting of Section 3 with various values of problem parameters.
First, we explore the convergence behavior of the proposed algorithm. Figure 25 shows the objective function I(U; V) on every iteration step for representative fixed-channel transition parameters p and the constraints C u and C v . We observe a slow convergence to a final value for p = 0 and C u = C v = 0.2, but once the constraints and the transition probability are increased, the algorithm converges much more rapidly. The non-monotonic behavior in some regimes is justified with the help of Remark 8. In Figure 26, we see the respective test-channel probabilities α 0 + α 1 , 1 − α 0 , β 0 + β 1 , and 1 − β 1 . First, note that if α 0 + α 1 = 1, then P X|U is a BSC. Similarly, if β 0 + β 1 = 1, then P Y|V is a BSC. Second, if 1 − α 0 = 1, then P X|U is a Z-channel. Similarly, if 1 − β 1 = 1, then P Y|V is an S-channel. We observe that for p = 0, the test-channels P X|U and P Y|V converge to Z-and S-channels, respectively. As for all other settings, the test-channels converge to BSC channels. Finally, we compare the outcome of Algorithm 2 to the optimal solution achieved by the brute-force method, namely, evaluating (12) for every P U|X and P V|Y that satisfy the problem constraints. The results for various values of channel parameters are shown in Figure 27. We observe that the proposed algorithm achieves the optimum for any DSBS parameter p and some representative constraints C u and C v .

Concluding Remarks
In this paper, we have considered the Double-Sided Information Bottleneck problem. Cardinality bounds on the representation's alphabets were obtained for an arbitrary discrete bivariate source. When X and Y are binary, we have shown that taking binary auxiliary random variables is optimal. For DSBS, we have shown that BSC test-channels are optimal when p → 0.5. Furthermore, numerical simulations for arbitrary p indicate that Z -and S-channels are optimal for p = 0. As for the Gaussian bivariate source, representation of I(U; V) utilizing Hermite polynomials was given. In addition, the optimality of the Gaussian test-channels was demonstrated for vanishing SNR. Moreover, we have constructed a lower bound attained by deterministic quantizers that outperforms the jointly Gaussian choice at high SNR. Note that the solution for the n-letter problem max 1 n I(U; V) for U → X n → Y n → V under constraints I(U; X n ) ≤ nC u and I(V; Y n ) ≤ nC v does not tensorize in general. For X n = Y n ∼ Ber ⊗n (0.5), we can easily achieve the cutset bound I(U; V)/n = min{C u , C v }. In addition, if time-sharing is allowed, the results change drastically.
Finally, we have proposed an alternating maximization algorithm based on the standard IB [1]. For the DSBS, it was shown that the algorithm converges to the global optimal solution.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Proposition 2
Before proceeding to proof Proposition 2, we need the following auxiliary results.

2.
Otherwise, it is strictly convex over [0, 1] or there are points p l and p u such that 0 < p l < p u < 1 where strictly convex 0 < p < p l = I 1 , strictly concave p l < p < p u = I 2 , strictly convex p u < p < 1 = I 3 . (A2) We postpone the proof of this lemma to Appendix K.
Lemma A2. The convex envelope of φ(·) at any point q ∈ [0, 1] can be obtained as a convex combination of only points in I 1 and I 3 .
We postpone the proof of this lemma to Appendix L and proceed to proof Proposition 2. Note that if F T (x) is strictly convex in [0, h b (q)], then by the paragraph following (Theorem 2.3 of [17]) |U | = 2, we are done.
From now on, we consider the case where F T (x) is not strictly convex. Then, there is an interval L ⊂ [0, h(q)] and a ∈ R + such that Let t 0 and t 1 represent the columns of T corresponding to X = 0 and X = 1, respectively, Moreover let q P(X = 0) and p (p,p) T be the probability vector of an arbitrary binary random variable, wherep 1 − p. Assume x 1 , x 2 ∈ L and x 1 = x 2 . Then, there must be {α 1i , p 1i } i=1,2,3 and {α 2i , Lemma A3. The set {p 11 , p 12 , p 13 , p 21 , p 22 , p 23 } must contain at least three distinct points.
We postpone the proof of this lemma to Appendix M.
By Lemma A1, this can happen only if p is linear for every p ∈ [0, 1]. In particular: Note that for any choice of P X|U=u Taking the expectation we obtain: and this is attained by any choice of P U|X satisfying H(X|U) = x. In particular the choice U = X ⊕ Z, where Z ∼ Ber(δ) is statistically independent of X and is chosen such that H(X|U) = x, attains F T (x). Thus, |U | = 2 suffices even if F T (x) is not strictly convex.

Appendix B. Proof of Lemma 3
Let P U|X and P V|Y be the test-channels from X to U and from Y to V, respectively. The joint probability function of U and V can be expressed via Bayes' rule and the Markov chain condition U → X → Y → V as: Since I(U; V) = E[log(P UV /P U × P V )], we define K(u, v, p) as the ratio between the joint distribution of U and V relative to the respective product measure. Note that: = 2(P X|U (0|u) ·p · P Y|V (0|u) + P X|U (1|u) · p · P Y|V (0|u)) (A21) + 2(P X|U (0|u) · p · P Y|V (1|u) + P X|U (1|u) ·p · P Y|V (1|u)).

Appendix C. Auxiliary Concavity Lemma
As a preliminary step to proving Theorem 1, we will need the following auxiliary lemma.
Since f (x) is twice differentiable, it is sufficient to show that f (x) is decreasing. The first derivative is given by: Since utilizing the inverse function derivative property, we obtain: .
In addition, the second order derivative is given by: Note that f (x) = r(g(x)). Since g(x) is increasing, in order to show that f (x) decreasing, it suffices to show that r(t) decreasing. The first order derivative of r(t) is given by: Define α 1 − 2t such that t = 1 2 (1 − α). Note that α ∈ [0, 1]. We obtain: Now, making use of the expansion log(1 + x) = ∑ ∞ k=1 (−1) k+1 x k k , we have: Thus, where (a) follows since α > 0. Thus, r (t) < 0 and f (x) is concave.

Appendix D. Proof of Theorem 1
Plugging p ← 1 2 − ( 1 2 − p) in (14), we obtain: Now, we rewrite I(U; V) with explicit dependency on as: We would like to expand I( ) with Taylor series around = 0. Note that I(0) = 0 = I ( )| =0 . Furthermore, the second derivative is given by: Hence, with similar relation for β v . Therefore, where the first inequality follows since the function f : x → (1 − 2h −1 2 (x)) 2 is concave by Lemma A4 and applying Jensen's inequality, and the second inequality follows from rate constraints.

Appendix E. Proof of Proposition 4
Suppose that the optimal test-channel P V|X is given by the following transition matrix: Assume in contradiction that the opposite optimal test-channel P U|X is symmetric to P V|X and is given by: Applying Bayes' rule on (A53), we obtain: It was shown in (Section IV.D of [17]) that for fixed P V|X given by (A52), the optimal P X|U must satisfy the following equation: whereα 0 aα 0 + bᾱ 0 andα 1 aα 1 + bᾱ 1 . Plugging α 0 and α 1 from (A54) in (A55) results in a contradiction, thus establishing the proof of Proposition 4.
where Z u ∼ N (0, 1), Z v ∼ N (0, 1), Z u ⊥ U, Z v ⊥ V. Due to Proposition 7, the mutual information for jointly Gaussian (U, X, Y, V) is given by where (a) and (b) follow from the properties of Mehler Kernel [66]. By the Mutual Information constraints we have: Hence, and Expanding I(ρ) in Taylor series around ρ = 0 gives us I(0) = 0 = dI(ρ) dρ ρ=0 and