Common Information Components Analysis

Wyner’s common information is a measure that quantifies and assesses the commonality between two random variables. Based on this, we introduce a novel two-step procedure to construct features from data, referred to as Common Information Components Analysis (CICA). The first step can be interpreted as an extraction of Wyner’s common information. The second step is a form of back-projection of the common information onto the original variables, leading to the extracted features. A free parameter γ controls the complexity of the extracted features. We establish that, in the case of Gaussian statistics, CICA precisely reduces to Canonical Correlation Analysis (CCA), where the parameter γ determines the number of CCA components that are extracted. In this sense, we establish a novel rigorous connection between information measures and CCA, and CICA is a strict generalization of the latter. It is shown that CICA has several desirable features, including a natural extension to beyond just two data sets.


I. INTRODUCTION
Understanding relations between two (or more) sets of variates is key to many tasks in data analysis and beyond.To approach this problem, it is natural to reduce each of the sets of variates separately in such a way that the reduced descriptions, or features, fully capture the commonality between the two sets, while suppressing aspects that are individual to each of the sets.This permits to understand the relation between the data sets without obfuscation.
A popular framework to accomplish this task follows the classical viewpoint of dimensionality reduction and is referred to as Canonical Correlation Analysis (CCA) [1].CCA seeks the best linear extraction, i.e., we consider linear projections of the original variates.In this case, the quality of the extraction is assessed via the resulting correlation coefficient.The result can be expressed directly via the singular value decomposition.Via the so-called Kernel trick, this can be extended to cover arbitrary (fixed) function classes.
An alternative framework is built around the concept of maximal correlation.Here, one seeks arbitrary (not necessarily linear) remappings of the original data in such a way as to maximize their correlation coefficient.This perspective culminates in the well-known alternating conditional expectation (ACE) algorithm [2], but the problem does not admit a compact solution.
In both approaches, the commonality between variates is measured by correlation.By contrast, in this paper, we consider a different approach that measures commonality between Original Data Sets X 1 , . . ., variates via (relaxed Wyner's) Common Information [3], [4], a variant of a mutual information measure.

A. Contributions
The main contributions of our work are: • The introduction of a novel algorithm, referred to as Common Information Components Analysis (CICA), to separately reduce each set of variates in such a way as to retain the commonalities between the sets of variates while suppressing their individual features.A conceptual sketch is given in Figure 1.• The proof that for the special case of Gaussian variates, CICA reduces to CCA.Thus, CICA is a strict generalization of CCA.

B. Related Work
Connections between CCA and Wyner's common information have been explored in the past.It is well known that for Gaussian vectors, (standard, non-relaxed) Wyner's common information is attained by all of the CCA components together, see [5].This has been further interpreted, see e.g.[6].To put our work into context, we note it is only the relaxed Wyner's common information [3], [4] that permits to conceptualize the sequential, one-by-one recovery of the CCA components, and thus, the spirit of dimensionality reduction.
Information measures have played a role in earlier considerations with some connections to dimensionality reduction and feature extraction.This includes independent components analysis (ICA) [7] and the information bottleneck [8], [9], amongst others.Finally, we note that an interpretation of CCA as a (Gaussian) probabilistic model was presented in [10].

C. Notation
A bold capital letter such as X denotes a random vector, and x its realization.A non-bold capital letter such as K denotes a (fixed) matrix, and K H its Hermitian transpose.Specifically, K X denotes the covariance matrix of the random vector X.K XY denotes the covariance matrix between random vectors X and Y.

II. RELAXED WYNER'S COMMON INFORMATION
The main framework and underpinning of the proposed algorithm is Wyner's common information and its extension, which is briefly reviewed in the sequel, along with its key properties.

A. Wyner's Common Information
Wyner's common information is defined for two random variables (or random vectors) X and Y of arbitrary fixed joint distribution p(x, y).
Definition 1 (from [11]).For random variables X and Y with joint distribution p(x, y), Wyner's common information is defined as Basic properties are stated below in Lemma 1 (setting γ = 0).We note that explicit formulas for Wyner's common information are known only for a small number of special cases.The case of the doubly symmetric binary source is solved completely in [11] and can be written as where a 0 denotes the probability that the two sources are unequal (assuming without loss of generality a 0 ≤ 1 2 ).Further special cases of discrete-alphabet sources appear in [12].
Moreover, when X and Y are jointly Gaussian with cor- This case was solved in [13], [14] using a parameterization of conditionally independent distributions.We note that an alternative proof follows from the arguments presented in [3], [4].

B. Relaxed Wyner's Common Information
Definition 2 (from [3]).For random variables X and Y with joint distribution p(x, y), the relaxed Wyner's common information is defined as (for γ ≥ 0) (3) Lemma 1 (from [4]).The relaxed Wyner's common information satisfies the following properties: 1) For discrete X and Y, the cardinality of W may be restricted to Explicit formulas for the relaxed Wyner's common information are not currently known for most p(x, y).A notable exception is when X and Y are jointly Gaussian random vectors of length n.Denote the covariance matrices of the vectors X and Y by K X and K Y , respectively, and the covariance matrix between X and Y by K XY .Then (see [4]), where and ρ i (for i = 1, . . ., n) are the singular values of K . By contrast, for the doubly symmetric binary source, the relaxed Wyner's common information is currently unknown (a bound and conjecture appear in [4]).

III. THE ALGORITHM
In this section, we present the proposed algorithm in the idealized setting of unlimited data.Specifically, for the proposed algorithm, this means that we assume perfect knowledge of the data distribution p(x, y).

A. High-level Description
The idea of the proposed algorithm is to estimate the relaxed Wyner's Common Information of Equation (3) between the information sources (data sets) at the chosen level γ.This estimate will come with an associated conditional distribution p γ (w|x, y).Obtaining the dimension-reduced versions then can be thought of as a type of projection of the resulting random variable W back on X and Y, respectively.For the case of Gaussian statistics, this can be made precise.

B. Main Steps of the Algorithm
The algorithm proposed here starts from the joint distribution of the data, p(x, y).Estimates of this distribution can be obtained from data samples X n and Y n via standard techniques.The main steps of the procedure can then be described as follows: Algorithm 1 (CICA).1) Select a real number γ, where 0 ≤ γ ≤ I(X; Y).This is the compression level: A low value of γ represents low compression, and thus, many components are retained.A high value of γ represents high compression, and thus, only a small number of components are retained.2) Solve the relaxed Wyner's common information problem, min p(w|x,y) I(X, Y; W ) such that I(X; Y|W ) ≤ γ, (7) leading to an associated conditional distribution p γ (w|x, y). 1) The dimension-reduced data sets are a) Version 1: MAP (maximum a posteriori): • u(x) = arg max w p γ (w|x) • v(y) = arg max w p γ (w|y) b) Version 2: Conditional Expectation:

C. A binary toy example
Let us illustrate the proposed algorithm via a simple toy example.Consider the vector ( X1 , X2 , Ỹ1 , Ỹ2 ) of binary random variables.Suppose that ( X1 , Ỹ1 ) are a doubly symmetric binary source (i.e., X1 is uniform, and X2 is the result of passing X1 through a binary symmetric ("bit-flipping") channel) while X2 and Ỹ2 are independent binary uniform random variables (also independent of ( X1 , Ỹ1 )).We will then form the vectors X and Y as and where ⊕ denotes the modulo-reduced addition, as usual.
Observe that any pair amongst the four entries in these two vectors are (pairwise) independent binary uniform random variables.Hence, the overall covariance matrix of the merged random vector (X T , Y T ) T is merely a scaled identity matrix, implying that CCA does not do anything.By contrast, for the CICA algorithm (with γ = 0 and using the MAP version), an optimal solution is to reduce X to X1 and Y to Ỹ1 .This captures all the dependence between the vectors X and Y, which appears to be the most desirable outcome.

IV. FOR GAUSSIAN, CICA IS CCA
In this section, we consider the proposed CICA algorithm in the idealized setting where the data distribution p(x, y) is known exactly.Specifically, we establish that if p(x, y) is a (multivariate) Gaussian distribution, then the classic CCA is a solution to all versions of the proposed CICA algorithm.This is the main technical contribution of the present work.
CCA is perhaps best described by first changing coordinates, With this, the covariance matrix of the vector X is the identity matrix, and so is the covariance matrix of the vector Ŷ.CCA is then easily described by considering the covariance matrix between these two vectors, A brief overview is given in Appendix A. Let us denote the singular value decomposition of this matrix by where Σ contains, on its diagonal, the ordered singular values of this matrix, denoted by ρ 1 ≥ ρ 2 ≥ . . .≥ ρ n .CCA then performs the dimensonality reduction where the matrix U k contains the first k columns of U (that is, the k left singular vectors corresponding to the largest singular values), and the matrix V k the respective right singular vectors.We refer to these as the "top k CCA components." Theorem 1.Let X and Y be jointly Gaussian random vectors.Then, the top k CCA components are a solution to all three versions of Algorithm 1, and γ controls the number k as follows: where Remark 1.Note that k(γ) is a decreasing, integer-valued function.
This theorem is a consequence of the main result in [3].A proof outline is provided in Appendix B.
As mentioned earlier, the connection between CCA and (standard non-relaxed) Gaussian Wyner's common information is well known [5].What is new in the present paper is the extension of this insight to relaxed Wyner's common information.This extension permits to extract the CCA components one-by-one via the compression parameter γ.Evidently, the CICA algorithm only makes sense because we can tune how much common information we wish to extract.In this sense, the choice γ = 0 (the non-relaxed case) is not interesting since it amounts to a one-to-one transform of the original data (up to completely independent portions), and thus, fails to capture the spirit of "dimensionality reduction."

V. EXTENSION TO MORE THAN TWO SOURCES
It is unclear how one would extend CCA to more than two databases.By contrast, for CICA, this extension is conceptually straightforward.The definition of relaxed Wyner's common information is readily extended to the general case: Definition 3 (Relaxed Wyner's Common Information for M variables).For a fixed probability distribution p(x 1 , x 2 , . . ., x M ), we define where the minimum is over all probability distributions p(w, x 1 , x 2 , . . ., x M ) with marginal p(x 1 , x 2 , . . ., x M ).
Hence, to extend CICA (Algorithm 1) to the case of M databases, it now suffices to replace Step 2) with Definition 3. In Step 3), for all three versions, it is immediately clear how they can be extended.For example, for Version 1), we use for i = 1, 2, . . ., M.
It will be shown elsewhere how one can obtain the analogs of Equations ( 5)-( 6) for this generalized case, and thus, an extended version of Theorem 1.

VI. CONCLUDING REMARKS AND FUTURE WORK
In a practical setting, one does not have access to the correct data distribution p(x, y).A first version is to simply work with an estimate of this distribution, based on the data available.But a more interesting implementation is to combine the estimation step with the optimization step.A fast algorithmic implementation will be presented elsewhere.

APPENDIX A CCA
A brief review of CCA [1] is presented, mostly in view of the proof of Theorem 1, given below in Appendix B. Let X and Y be zero-mean real-valued random vectors with covariance matrices K X and K Y , respectively.Moreover, let With this, the covariance matrix of the vector X is the identity matrix, and so is the covariance matrix of the vector Ŷ.
CCA seeks to find vectors u and v such as to maximize the correlation between u H X and v H Ŷ, that is, which can be rewritten as where Note that this expression is invariant to arbitrary (separate) scaling of u and v.To obtain a unique solution, we could choose to impose that both vectors be unit vectors, From Cauchy-Schwarz, for a fixed u, the maximizing (unitnorm) v is given by or equivalently, for a fixed v, the maximizing (unit-norm) u is given by Plugging in the latter, we obtain or, dividing through, The solution to this problem is well known: v is the right singular vector corresponding to the largest singular vector of the matrix K . Evidently, u is the corresponding left singular vector.Restarting again from Equation ( 21), but restricting to vectors that are orthogonal to the optimal choices of the first round leads to the second CCA components, and so on.

APPENDIX B PROOF OUTLINE FOR THEOREM 1
In the case of Gaussian vectors, the solution to the optimization problem in Equation ( 3) is most easily described in two steps.First, we apply the change of basis indicated in Equations ( 19)-(20).This is a one-to-one transform, leaving all information expressions in Equation (3) unchanged.In the new basis, we have n independent pairs.When X and Y consist of independent pairs, the solution to the optimization problem in Equation ( 3) can be reduced to n separate scalar optimizations, see [4, Theorem 3] (also quoted above in Lemma 1, Item 8).The remaining crux then is solving the scalar Gaussian version of the optimization problem in Equation ( 3).This is done in [4,Theorem 4] via an argument of factorization of convex envelope.The full solution to the optimization problem is given in Equation ( 5)-( 6).The remaining allocation problem over the non-negative numbers γ i can be shown to lead to a water-filling solution, see [4, Section IV].More explicitly, to understand this solution, start by setting γ = I(X; Y).Then, the corresponding C γ (X; Y) = 0 and the optimizing distribution p γ (w|x, y) trivializes.Now, as we lower γ, the various terms in the sum in Equation ( 5) start to become non-zero, starting with the term with the largest correlation coefficient ρ 1 .Hence, an optimizing distribution p γ (w|x, y) can be expressed as Y + Z, where the matrices U k and V k are precisely the top k CCA components (see Equations ( 14)-( 15) and the following discussion), and Z is additive Gaussian noise with mean zero, independent of X and Y.
For the algorithm, we need the corresponding conditional marginals, p γ (w|x) and p γ (w|y).By symmetry, it suffices to prove one formula.Changing basis as in Equations ( 19)-(20), we can write Finally, note that Equation (25) can be read as for some real-valued constant α.Thus, combining the top k CCA components, where D is a diagonal matrix.Hence, where D is the diagonal matrix This is precisely the top k CCA components (note that the solution to the CCA problem ( 21) is only specified up to a scaling).This establishes the theorem for the case of Version 2) of the proposed algorithm.Clearly, it also establishes that p γ (w|x) is a Gaussian distribution with mean given by (37), thus establishing the theorem for Version 1) of the proposed algorithm.The proof for Version 3) follows along similar lines and is thus omitted.