Nonlinear Canonical Correlation Analysis:A Compressed Representation Approach

Canonical Correlation Analysis (CCA) is a linear representation learning method that seeks maximally correlated variables in multi-view data. Nonlinear CCA extends this notion to a broader family of transformations, which are more powerful in many real-world applications. Given the joint probability, the Alternating Conditional Expectation (ACE) algorithm provides an optimal solution to the nonlinear CCA problem. However, it suffers from limited performance and an increasing computational burden when only a finite number of samples is available. In this work, we introduce an information-theoretic compressed representation framework for the nonlinear CCA problem (CRCCA), which extends the classical ACE approach. Our suggested framework seeks compact representations of the data that allow a maximal level of correlation. This way, we control the trade-off between the flexibility and the complexity of the model. CRCCA provides theoretical bounds and optimality conditions, as we establish fundamental connections to rate-distortion theory, the information bottleneck and remote source coding. In addition, it allows a soft dimensionality reduction, as the compression level is determined by the mutual information between the original noisy data and the extracted signals. Finally, we introduce a simple implementation of the CRCCA framework, based on lattice quantization.


Introduction
Canonical correlation analysis (CCA) seeks linear projections of two given random vectors so that the extracted (possibly lower dimensional) variables are maximally correlated [1].CCA is a powerful tool in the analysis of paired data (X, Y), where X and Y are two different representations of the same set of objects.It is commonly used in a variety of applications, such as speech recognition [2], natural language processing [3], cross-modal retrieval [4], multimodal signal processing [5], computer vision [6] and many others.
The CCA framework has gained a considerable amount of attention in recent years due to several important contributions (see Section 2).However, one of its major drawbacks is its restriction to linear projections, whereas many real-world setups exhibit highly non-linear relationships.To overcome this limitation, several non-linear CCA extensions have been proposed.Van Der Burg and De Leeuw [7] studied a non-linear CCA problem under a specific family of transformations.Breiman and Friedman [8] considered a generalized non-linear CCA setup, in which the transformations are not restricted to any model.They derived the optimal solution for this problem, under a known joint probability.For a finite sample size, Akaho [9] suggested a kernel version of CCA (KCCA) in which non-linear mappings are chosen from two reproducing kernel Hilbert spaces (RKHS).Wang [10] and Kalmi et al. [11] considered a Bayesian approach to non-linear CCA and provided inference algorithms and variational approximations that learn the structure of the underlying model.Later, Andrew et al. [12] introduced Deep CCA (DCCA), where the projections are obtained from two deep neural networks that are trained to output maximally correlated signals.In recent years, non-linear CCA gained a renewed growth of interest in the machine learning community (for example, [13]).
Non-linear CCA methods are advantageous over linear CCA in a range of applications (for example, [14,15]).However, two major drawbacks typically characterize most methods.First, although there exist several studies on the statistical properties of the linear CCA problem ( [16,17]), the non-linear case remains quite unexplored with only few recent studies (for example, [18]).Second, current non-parametric (and henceforth non-linear) CCA methods are typically computationally demanding.While there exist several parametric non-linear CCA methods that address this problem, completely non-parametric methods (in which the solution is not restricted to a parametric family, nor utilize parametric density estimation, as in [13]) are often impractical to apply to large data sets.
In this work we consider a compressed representation formulation to the non-linear CCA framework (namely, CRCCA), which demonstrates many desirable properties.Our suggested formulation regularizes the non-linear CCA problem in an explicit and theoretically sound manner.In addition, our suggested scheme drops the traditional hard dimensionality reduction of the CCA framework and replaces it with constraints on the mutual information between the given noisy data and the extracted signals.This results in a soft dimensionality reduction, as the compression level is controlled by the constraints of our approach.
The CRCCA framework provides theoretical bounds and optimality conditions, as we establish fundamental connections to the theory of rate-distortion (see, e.g., Chapter 10 of [19]) and the information bottleneck [20].Given the joint probability, we achieve a coupled variant of the classical Distortion-Rate problem, where we maximize the correlation between the representation with constraints on the representation rates.Furthermore, we provide an empirical solution for the CRCCA problem, when only a finite set of samples is available.Our suggested solution is both theoretically sound and computationally efficient.A Matlab implementation of our suggested approach is publicly available at the first author's webpage 1 .
It is important to mention that the CRCCA framework can also be interpreted from a classical information theory view point.Consider two disjoint terminals X and Y, where each vector is to be transmitted (independently of the other) through a different rate-limited noiseless channel.Then, the CRCCA framework seeks minimal transmission rates, given a prescribed correlation between the received signals.
The rest of this manuscript is organized as follows.In Section 2 we briefly review relevant concepts and previous studies of the CCA problem.In Section 3 we formally introduce our suggested CRCCA framework.Then, we describe our suggested solution in Section 4. In Section 5 we drop the known probability assumption and discuss the finite sample-size regime.We conclude with a series of synthetic and real-world experiments in Section 6.

Previous Work
Let X ∈ R d X and Y ∈ R d Y be two random vectors.For the simplicity of the presentation we assume that X and Y are also zero-mean.The CCA framework seeks two transformations, U = φ(X) and where d ≤ min (d X , d Y ).Notice that the notation U = φ(X) implies that all X variables are transformed simultaneously by a single multivariate transformation φ.We refer to (1) as CCA, where linear CCA [1], is under the assumption that φ(•) and ψ(•) are linear (U = AX and V = BY).
The solution to the linear CCA problem can be obtained from the singular value decomposition of the matrix Σ , where Σ X , Σ Y and Σ XY are the covariance matrices of X, Y and the cross-covariance of X and Y, respectively.In practice, the covariance matrices are typically replaced by their empirical estimates, obtained from a finite set of samples.The linear CCA has been studied quite extensively in recent years, and has gained popularity due to several contributions.Ter Braak [21] and Graffelman [22] showed that CCA can be enhanced with powerful graphics called biplots.Witten et al. [23] introduced a regularized variant of CCA, which improves generalization.This method was further extended to a sparse CCA setup [23,24].More recently, Graffelman et al. [25] adapted a compositional setting for linear CCA using non-linear log-ratio transformations.
Non-linear CCA is a natural extension to the linear CCA problem.Here φ and ψ are not restricted to be linear projections of X and Y.This problem was first introduced by Lancaster [26] and Hannan [27] and was later studied in different setups, under a variety of models (for example, [7,28]).It is important to emphasize that non-linear CCA may also be viewed as a nonlinear multivariate analysis technique, with several important theoretical and algorithmic contributions [29].A major milestone in the study non-linear CCA was achieved by Breiman and Friedman [8].In their work, Breiman and Friedman showed that the optimal solution to (1), for d = 1, may be obtained by a simple alternating conditional expectation procedure, denoted ACE.Their results were later extended to any d ≤ min (d X , d Y ), as shown, for example, in [30].Here, we briefly review the ACE framework.
Let us begin with the first set of components, i = 1.Assume that V 1 = ψ 1 (Y) is fixed, known and satisfies the constraints.Then, the optimization problem (1) is only with respect to φ 1 and by Cauchy-Schwarz inequality, we have that with equality if and only if Therefore, choosing a constant c to satisfy the unit variance constraint we achieve In the same manner we may fix φ 1 (X) and attain ψ 1 (Y) = E(φ 1 (X)|Y)/ var(E(φ 1 (X)|Y)).These coupled equations are in fact necessary conditions for the optimality of φ 1 and ψ 1 , leading to an alternating procedure in which at each step we fix one transformation and optimize with respect to the other.Once φ 1 and ψ 1 are derived, we continue to the second set of components, i = 2, under the constraints that they are uncorrelated with This procedure continues for the remaining components.Breiman and Friedman [8] proved that ACE converges to the global optimum.In practice, the conditional expectations are estimated from training data {x i , y i } n i=1 using nonparametric regression, usually in the form of k-nearest neighbors (k-NN).Since this computationally demanding step has to be executed repeatedly, ACE and its extensions are impractical for large data analysis.Kernel CCA (KCCA) is an alternative non-linear CCA framework [9,[31][32][33].In KCCA , φ ∈ A and ψ ∈ B, where A and B are two reproducing kernel Hilbert spaces (RKHSs) associated with user-specified kernels k x (•, •) and k y (•, •).By the representer theorem [34], the projections can be written in terms of the training samples, {x i , y i } n i=1 , as U i = ∑ n j=1 a ji k x (X, x j ) and V i = ∑ n j=1 b ji k y (Y, y j ) for some coefficients a ji and b ji .Denote the kernel matrices as K x = [k x (x i , x j )] and K y = [k y (y i , y j )].Then, the optimal coefficients are computed from the eigenvectors of the matrix (K x + r x I) −1 Ky(K y + r y I) −1 Kx where r x and r y are positive parameters.Computation of the exact solution is intractable for large datasets due to the memory cost of storing the kernel matrices and the time complexity of solving dense eigenvalue systems.To address this caveat, Bach and Jordan [35] and Hardoon et al. [14] suggested several low-rank matrix approximations.Later, Halko et al. [36] considered randomized singular value decomposition (SVD) methods to further reduce the computational burden of KCCA.Additional modifications were explored by Arora and Livescu [2].
More recently, Andrew et al. [12] introduced Deep CCA (DCCA).Here, φ ∈ A and ψ ∈ B, where A and B are families of functions that can be implemented using two Deep Neural Networks (DNNs) of predefined architectures.As many DNN frameworks, DCCA is a scalable solution which demonstrates favorable generalization abilities in large data problems.Wang et al. [37] extended DCCA by introducing autoencoder regularization terms, implemented by additional DNNs.Specifically, Wang et al. [37] maximize the correlation between U = φ(X) and V = ψ(Y) where φ, ψ are DNNs (similarly to DCCA), while regulating the squared reconstruction error, ||X − φ(U)|| 2  2 and ||Y − ψ(V)|| 2  2 where φ and ψ are additional DNNs, optimized over possibly different architectures than φ and ψ.Wang et al. [15] called this method Deep Canonically Correlated Autoencoders (DCCAE) and demonstrated its abilities in a variety of large scale problems.Importantly, they showed that the additional regularization terms improve upon the original DCCA framework as they regulate its flexibility.Unfortunately, as most artificial neural network methods, both DCCA and DCCAE provide a limited understanding of the problem.

Problem Formulation
Let X ∈ R d X and Y ∈ R d Y be two random vectors.For the simplicity of the presentation we assume that d X = d Y = d.It is later shown that our derivation holds for any d X = d Y and d ≥ 0. Let φ : R d X → R d and ψ : R d y → R d be two transformations.Let U = φ(X) and V = ψ(Y) be two vectors in R d .Notice that φ and ψ are not necessarily deterministic transformations, in the sense that the conditional distributions p(u|x) and p(v|y) may be non-degenerate distributions.In this work we generalize the classical CCA formulation (1), as we impose additional mutual information constraints on the transformations that we apply.Specifically, we are interested in U and V such that max for some fixed R U and R V , where I(X; Y) = x,y p(x, y) log p(x,y) p(x)p(y) dxdy is the mutual information of X and Y and p(x, y) is the joint distribution of X and Y.The mutual information constraints regulate the transformations that we apply, so that in addition to maximizing the sum of correlations (as in (1)), U and V are also restricted to be compressed representations of X and Y, respectively.In other words, R U and R V define the amount of information preserved from the original vectors.Notice that as R U and R V grow, (3) degenerates back to (1).Further, it is important to emphasize that while φ and ψ are not necessarily deterministic, most applications do impose such a restriction.In this case, the mutual information constraints are unbounded if U and V take values over a finite support.In this case, I(X; U) = H(U) and I(Y; V) = H(V), where H(U) = − ∑ u p(u) log p(u) is the entropy of U.
The use of mutual information as a regularization term is one of the corner stones of information theory.The Minimum Description Length (MDL) principle [38] suggests that the best representation of a given set of data is the one that leads to a minimal coding length.This idea has inspired the use of mutual information as a regularization term in many learning problems, mostly in the context of rate distortion (Chapter 10 of [19]), the information bottleneck framework [20], and different representation learning problems [39].Recently, Vera et al. [40] showed that a mutual information constraint explicitly controls the generalization gap when considering a cross-entropy loss function.We further discuss the desirable properties of the mutual information constraints in Section 5.3.
Notice that the regularization terms in the DCCAE framework (discussed above) is related to our suggested mutual information constraints.However, it is important to emphasize the difference between the two.The autoencoders in the DCCAE framework suggest an explicit reconstruction architecture, which strives to maintain a small reconstruction error with the original representation.On the other hand, our mutual information constraints regulate the ability to reconstruct the original representation.In other words, our suggested framework restricts the amount of information that is preserved with the original representation while DCCAE minimizes the reconstruction error with the original representation.
We refer to our constrained optimization problem (3) as Compressed Representation CCA (CRCCA).Notice that traditionally, CCA refers to linear transformations.Here we again consider CCA in the wider sense, as the transformations may be non-linear and even non-deterministic.Further, notice that we may interpret (3) as a soft version of the CCA problem.In the classical CCA setup, the applied transformations strive to maximize the correlations and rank them in a descending order.This implicitly suggests a hard dimensionality reduction, as one may choose subsets of components of U and V that have the strongest correlations.In our formulation (3), the mutual information constraints allow a soft dimensionality reduction; while X and Y are transformed to maximize the objective, the transformations also compress X and Y in the classical rate-distortion sense.For example, assume that the transformations are deterministic.Then I(X; U) = H(U) and I(Y; V) = H(V) (as discussed above).Here, (3) may be interpreted as a correlation maximization problem, subject to a constraint on the maximal number of bits allowed to represent (or store) the resulting representations.In the same sense, the classical CCA formulation imposes a hard dimensionality reduction, as it constraints the number of dimensions allowed to represent (or store) the new variables.In other words, instead of restricting the number of dimensions allowed to represent the variables, we restrict the level of information allowed to represent them.We emphasize this idea in Section 6.

Iterative Projections Solution
Inspired by Breiman and Friedman [8] we suggest an iterative approach for the CRCCA problem (3).Specifically, in each iteration, we fix one of the transformations and maximize the objective with respect to the other.Let us illustrate our suggested approach as we fix V and maximize the objective with respect to U.
First, notice that our objective may be compactly written as ) is fixed (and our constraint suggests that E(UU T ) = I), we have that maximizing or equivalently (as in rate-distortion theory [19]) ( This problem is widely known in the information theory community as remote/noisy source coding ( [41,42]) with additional constraints on the second order statistics of U. Therefore, our suggested method provides a local optimum to (3) by iteratively solving a remote source coding problem, with additional second order statistics constraints.Remote source coding is a variant of the classical source coding (rate-distortion) problem [19].Let V be a remote source that is unavailable to the encoder.Let X be a random variable that depends of V through a (known) mapping p(x|v), and is available to the encoder.The remote source coding problem seeks the minimal possible compression rate of X, given a prescribed maximal reconstruction error of V from the compressed representation of X.Notice that for V = X, the remote source coding problem degenerates back to the classical source coding regime.Remote source coding has been extensively studied over the years.Dobrushin and Tsybakov [41], and later Wolf and Ziv [42], showed that the solution to this remote source coding problem (that is, the optimization problem in (5), without the second order statistics constraints) is achieved by a two step decomposition.First, let Ṽ = E(V|X) be the conditional expectation of V given X, which defines the optimal minimum mean square error (MMSE) estimator of the remote source V given the observed X.Then, U is simply the rate-distortion solution with respect to Ṽ.It is immediate to show that the same decomposition holds for our problem, with the additional second order statistic constraints.In other words, in order to solve (5), we first compute Ṽ = E(V|X), followed by min p(u| ṽ) We notice that ( 6) is simply the rate distortion function Ṽ for square error distortion, but with the additional (and untraditional) constraints on the second order statistics of the representation.

Optimality Conditions
Let us now derive the optimality conditions for each step of our suggested iterative projection algorithm.For this purpose, we assume that the joint probability distribution of X and Y is known.For the simplicity of the presentation we focus on the one dimensional case, X, Y, U, V ∈ R. As we do not restrict ourselves to deterministic transformations, the solution to ( 6) is fully characterized by the conditional probability p(u| ṽ).Lemma 1.In each step of our suggested iterative projections method, the optimal transformation (4) must satisfy the optimality conditions of (6): 1. p(u| ṽ) = p(u)e − λ( ṽ) e −η(u− ṽ) 2 −τu−µu 2 2. p(u) = ṽ p(u| ṽ)p( ṽ)d ṽ where Ṽ = E(V|X), λ( ṽ) = 1 − λ( ṽ) p( ṽ) and η, τ, µ, λ( ṽ) are the Lagrange multipliers associated with the constraints of the problem.
A proof of this Lemma is provided in Appendix A. As expected, these conditions are identical to the Arimoto-Blahut equations (Chapter 10 of [19]), with the additional term e −τu−µu 2 that corresponds to the second order statistics constraints.This allows us to derive an iterative algorithm (Algorithm 1), similar to Arimoto-Blahut, in order to find the (locally) optimal mapping p(u| ṽ), with the same (local) convergence guarantees as Arimoto-Blahut.Our suggested algorithm is also highly related to the iterative approach of the information bottleneck solution, in the special Gaussian case [43].

Compressed Representation CCA for Empirical Data
To this point, we considered the CRCCA problem under the assumption that the joint probability distribution of X and Y is known.However, this assumption is typically invalid in real world setups.Instead, we are given a set of i.i.d.samples {x i , y i } n i=1 from p(x, y).We show that in this setup, Dobrushin and Tsybakov optimal decomposition [41] may be redundant, as we attempt to directly solve (5).

Previous Results
Let us first revisit the iterative projections solution (Section 4) in a real-world setting.Here, in each iteration we are to consider an empirical version of ( 5).This problem is equivalent to the empirical remote source coding problem (up to the additional second order statistics constraints), which was first studied by Linder et al. [44].In their work, Linder et al. followed Dobrushin and Tsybakov [41], and decomposed the problem to conditional expectation estimation (denoted as Ê(V|X)) followed by empirical vector quantization, Q(•).They showed that under an additive noise assumption (X = V + ), the convergence rate of the empirical distortion is where Q * is the optimal empirically trained N-level vector quantizer, D * N is the distortion of the optimal N-level vector quantizer for the remote source problem (where the joint distribution is known), B is a known constant that satisfies P(||X|| 2 ≤ B) = 1, and e n is the mean square error of the empirical conditional expectation, E||E(V|X) − Ê(V|X)|| 2 = e n .Notice that this bound has three major terms: the irreducible quantization error D * N , the conditional expectation estimation error e n , and the empirical vector quantization term, 8B 2d x N log n n . Although their analysis focuses on vector quantization, it is evident that (7) strongly depends on the performance of the conditional expectation estimator Ê(V|X).In fact, if we choose a non-parametric k-nearest neighbors estimator (like [8]), we have that E||E(V|X) [45], which is significantly worse than the rate imposed by the empirical vector quantizer.An additional drawback of ( 7) is the restrictive additive noise modeling assumption.For more general cases, Györfi and Wegkamp [46] derived similar convergence rates under broader sub-Gaussian models.
In our work we take a different approach, as we drop Dobrushin and Tsybakov decomposition and attempt to solve (5) directly.Importantly, since both the k-nearest neighbors and the vector quantization modules require a significant computational effort as the dimension of the problem increases, we take a more practical approach and design (a variant of) a lattice quantizer.

Our Suggested Method
As previously discussed, Dobrushin and Tsybakov decomposition yields an unnecessary statistical and computational burden, as we first estimate the conditional expectation, and then use it as a plug-in for the statistic that we are essentially interested in.Here, we drop this decomposition and solve (5) directly.For this purpose we apply a remote source variant of lattice quantization.

Lattice Quantization
Lattice quantization [47] is a popular alternative for optimal vector quantization.In this approach, the partitioning of the quantization space is known and predefined.Given a set of observations {x i } n i=1 , the fit (also called representer or centroid) of each quantization cell is simply the average of all the x i 's that are sampled in that cell (see the left chart of Figure 1).We denote the (empirically trained) fixed lattice quantizer of X as U LQ (X).Computationally, Lattice quantizers are scalable to higher dimensions, as they only require a fixed partitioning of the support of X ∈ R d x .Further, it can be shown that the rate of the quantizer (which corresponds to I(X, U LQ (X))) is simply the entropy of U LQ (X) [48].This allows a simple way to verify the performance of the lattice quantizer.In addition, it quantifies the number of bits required to encode U LQ (X).Lattice quantizers followed by entropy coding asymptotically (n → ∞) approach the rate distortion bound, in high rates (Chapter 7 of [49]).Interestingly, it can be shown that in low rates, the optimal (dithered) lattice quantizer is up to half a bit worse than the rate distortion bound, while the uniform dithered quantizer is up to 0.754 bits worse than the rate-distortion bound [50].The performance of a lattice quantizer may be further improved with pre/post filtering [51].

CRCCA by Quantization
In our empirical version of (5) we are given a set of observations {x i , v i } n i=1 .Define the set of all possible quantizer of X as Q(X).We would like to find a quantizer U q ∈ Q(X), such that min As a first step towards this goal, let us define a simpler problem.Denote the set of all uniform and fixed lattice quantizer of X by Q U (X).Then, the remote source uniform quantization problem is defined as In other words, ( 9) is minimized over a subset of quantizers Q U (X) ⊂ Q(X), and drops the second order statistics constraints that appear in (8).In order to solve (9), we follow the lattice quantization approach described above; we first apply a (uniform and fixed) partitioning on the space of X.Then, the fit of each quantization cell is simply the average of all v i 's that correspond to the x i 's that were sampled in that cell (see the right chart of Figure 1 for example).We denote this quantizer as U RSUQ (X), where RSUQ stands for remote source uniform quantization.Notice that our suggested partitioning is not an optimal solution to (8) (even without the second order statistics constraints), as we apply a simplistic uniform quantization to each dimension.However, it is easy to verify that U RSUQ (X) is the empirical minimizer of (9).Finally, in order to satisfy the second order constraints, we apply a simple linear transformation, U q = AU RSUQ (X) + B. Lemma 2 below shows that U q = AU RSUQ (X) + B is indeed the empirical risk minimizer of ( 9), with the additional second order statistics constraints.Lemma 2. Let U RSUQ (X) be the remote source uniform quantizer that is the empirical risk minimizer of (9).Then, AU RSUQ (X) + B is the remote source uniform quantizer that is the empirical risk minimizer of ( 9), with the additional constraints, for some constant matrix A and a vector B.
A proof for lemma 2 is provided in Appendix B. To conclude, our suggested quantizer U q is the empirical minimizer of ( 9), which approximates (8) over a subset of quantizers Q U (X) ⊂ Q(X).Algorithm 2 summarizes our suggested approach.

Algorithm 2 A single step of CRCCA by quantization
Require: {x i , v i } n i=1 , a fixed uniform quantizer Q(x) 1: Set I(α) = j | Q(x j ) = α , the indices of the samples that are mapped to the quantization cell α Notice that the uniform quantization scheme described above may also be viewed as a partitioning estimate of the conditional expectation [45], where the number of partitions is predetermined by the prescribed quantization level.The partitioning estimate is a local averaging estimate such that for a given x it takes the average of those v i 's for which x i belongs to the same cell as x.Formally,

Convergence Analysis
The partitioning estimate holds several desirable properties.Györfi et al. [45] showed that it is (weakly) universally consistent.Further, they showed that under the assumptions that X has a compact support S, the conditional variance is bounded, var(V|X = x) ≤ σ 2 , and the conditional expectation is smooth enough, |E(V|X where ĉ depends only on d and the diameter of S, and h d n is the volume of the cubic cells.Thus, for we have that where We omit the description of some of the constants, as they appear in detail in [45].The derivation above suggests that for a choice of a cubic volume which follows (11), the rate of convergence of the partition estimate is asymptotically identical to the k-nearest neighbors estimate.This shall not come as a surprise, since both estimates apply a non-parametric estimation that performs a local averaging.However, the partition estimate is much simpler to apply in practice, as previously discussed.Finally, by applying a partitioning estimate in each step of our iterative projections solution, we show that under the same assumptions made by Györfi et al. [45], and a choice of cubic cell volume as in (11), we have that where E||V − E(V|X)|| 2 is the irreducible error and the second inequality is due to the convergence rate of the partitioning estimate, as appears in (12) .As we compare this result to (7), we notice that while the rate of convergence is asymptotically equivalent (assuming a k-NN estimate in (7)), our result is not restricted to additive noise models (or any other noise model).In addition, Linder et al. bound (7) converges to D * N , the distortion of the optimal N-level vector quantizer.Alternatively, our convergence is to the irreducible error.
It is important to emphasize that at the end of the day, our suggested method replaces the choice of k-NN estimate (as in ACE) with a partitioning estimate, where both estimates are known to be highly related.However, our suggested approach provides a sound information-theoretic justification that allows additional analytical properties, computational benefits and performance bounds.In addition, it results in an implicit regularization, as discussed below.

Regularization
The mutual information constraints of the CRCCA problem (3) hold many desirable properties.In the previous sections we focused on the information-theoretic interpretation of the problem.However, these constraints also implicitly apply regularization to the non-linear CCA problem, and by that, improve its generalization performance.Intuitively speaking, the mutual information constraint I(X; U) ≤ R U suggests that U is a compressed representation of X.This means that by restricting the information that U carries on X, we force the transformation to preserve only the relevant parts of X with respect to the canonical correlation objective.In other words, while the objective strives to fit the best transformations to a given train-set, the mutual information constraints regularize this fit by restricting its statistical dependence with the train-set.The use of mutual information as a regularization term in learning problems is not new.The Minimum Description Length (MDL) principle [38] suggests that the best hypothesis for a given set of data is the one that leads to minimal code length needed to represent of the data.Another example is the information bottleneck framework [20].There, the objective is to maximize the mutual information between the labels Y and a new representation of the features T(X), subject to the constraint I(X; T(X)) ≤ R U .As in our case, the objective strives to find the best fit (according to a different loss function), while the constraints serve as regularization terms.
As demonstrated in the previous sections, the CRCCA formulation may be implemented in practice using uniform quantizers.Here, the entropy of the quantization cells replaces the mutual information constraints in regularizing the objective.A small number of cells, which typically corresponds to a lower entropy, implies that more observations are averaged together, hence more bias (and less variance).On the other hand, a greater number of cells results in a smaller number of observations that contribute to each fit, which leads to more variance (and less bias).This way, the entropy of the cells governs the well-known bias-variance trade-off.
As before, we notice that the k-NN estimate may also control this trade-off, as the parameter k defines the number of observations to be averaged for each fit.However, while this parameter is internal to this specific conditional expectation estimator, the number of cells in the CRCCA formulation is an explicit part of the problem formulation.In other words, CRCCA defines an explicit regularization framework to the non-linear CCA problem.

Experiments
We now demonstrate our suggested approach in synthetic and real-world experiments.

Synthetic Experiments
In the first experiment we visualize the outcome of our suggested CRCCA approach.Let X and Y be two dimensional vectors, where X is uniformly distributed over the unit square and Y is a one-to-one mapping of X, which is highly non-linear.We draw n = 5000 samples of X and Y, and apply CCA, DCCA [12], ACE [8] and our suggested CRCCA method.The two charts of Figure 2 show the samples of X and Y, respectively.Notice that the samples' color is to visualize the mapping that we apply from X to Y.For example, the blue samples of X (which correspond to X 1 ∈ 0, 1  4 ) are mapped to the lower left quarter circle in Y.The exact description of the mapping is provided in Appendix C. We first apply linear CCA to X and Y.This results in a sum of correlation coefficients of 1.1 (where the maximum is 2).We report the normalized objective, which is 0.55.This poor performance is a result of the highly non-linear mapping of X to Y, which the classical CCA attempts to recover by linear means.The two charts on the left of Figure 3 visualize the correlation between the resulting components (U 1 against V 1 and U 2 against V 2 ).Next, we apply Deep CCA.We examine different architectures which vary in the number of layers (3 to 7) and the number of neurons in each layer (2 j for j = 3, . . ., 12).The remaining hyper-parameters are set according to the default values of [12].We obtain a normalized objective value of 0.993 for an architecture of three layers of 32 neurons in each layer.This result demonstrates DCCA ability to (almost) fully recover the correlation between the original variables.The middle charts of Figure 3 visualize the results we achieve for the best performing DCCA architecture.Further, we apply the ACE method, which seeks the optimal non-parametric CCA solution to (1).ACE attains a normalized objective of 0.995 for a choice of k = 70.The charts on the right of Figure 3 show the correlation between the components of ACE's outcome.As expected, ACE succeeds to almost fully recover the perfect correlation of X and Y, in this large-sample low-dimensional setup.To further illustrate the applied transformations, we visualize the obtained components of each of the methods.Figure 4 illustrates U 1 against U 2 , and V 1 against V 2 for the linear CCA (left), DCCA (middle) and ACE (right).As we can see, linear CCA rotates the original vectors, while DCCA and ACE demonstrate a highly non-linear nature.We now apply our suggested empirical CRCCA algorithm.Here, we use different number of quantization cells, N = 5, 9, 13, where N is the number of levels in each dimension.Figure 5 shows the results we achieve for these three settings.The corresponding normalized objective values for each quantization level are 0.92, 0.95 and 0.99, respectively.As expected, we converge to full recovery of X from Y, as N increases.This should not come as a surprise in such an easy problem, where the dimension is low and the number of samples is large enough.As we observe the charts of Figure 5, we notice the discrete nature of CRCCA, which is an immediate consequence of the quantization we apply.As expected, we observe more diversity in the outcome of the CRCCA, as N increases.In addition, we notice that the quantized points become increasingly correlated.Figure 6 shows the obtained CRCCA components for N = 5, 9, 13.Here again, we notice the discrete nature of the canonical variables, and the convergence to the optimal transformation (ACE) as N increases.

Real-world Experiment
Let us now turn to a real-world data experiment, in which we demonstrate the generalization abilities of the CRCCA approach.The Weight Lifting Exercises dataset [52] summarizes sensory readings from four different locations of the human body, while engaging in a weight lifting exercise.Each sensor records a 13-dimensional vector, which includes a 3-d gyroscope, 3-d magnetic fields, 3-d accelerations, roll, pitch, yaw and the total acceleration absolute value.During the weight lifting activity, participants were asked to perform a specified exercise, called Unilateral Dumbbell Biceps Curl [52] in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the weight only halfway (Class C), lowering the weight only halfway (Class D) and throwing the hips to the front (Class E).Class A corresponds to the specified execution of the exercise, while the other four classes correspond to common mistakes.
Since different sensors simultaneously record the same activity, we argue that they may be correlated under some transformations.To examine this hypothesis, we apply different CCA techniques to two different sensor vectors (arm and belt, in the reported experiment) to seek maximal correlations among these vectors.Specifically, we analyze an X set of 13 measurements (the arm sensor), and a Y set of the same 13 measurements for the belt sensor.Notice that we ignore the class of the activity (represented by a categorical variable from A to E) to make the problem more challenging.To have valid and meaningful results, we examine the generalization performance of our transformations.We split the dataset (of 31, 300 observations) to 70% train-set, 15% evaluation-set (to tune different parameters) and 15% test-set.We repeat each experiment 100 times (that is, 100 different splits to the three sets mentioned above) to achieve averaged results with a corresponding standard deviation.The results we report are the averaged normalized sum of canonical correlations on the test-set.
We first apply linear CCA [1].This achieves a maximal objective value of ρUV = 0.279(±0.01).Next, we apply kernel CCA with a Gaussian kernel [31].We examine a grid of Gaussian kernel variance values to achieve a maximum of ρUV = 0.38(±0.09),for σ 2 = 2.5.Further, we evaluate the performance of DCCA and DCCAE.As in the previous experiment, we examine different DNN architectures, for both methods.Specifically, we look at different numbers of layers (3 to 7) and different numbers of neurons in each layer (2 j for j = 3, . . ., 12).In addition, we follow Andrew et al. [12] guidelines and examine both smaller and larger mini-batch sizes, ranging from 20 to 10000 samples.The remaining hyper-parameters are set according to the default values, as appear in [12] and [37].The best performing DCCA architecture achieves ρUV = 0.51(±0.04)for three layers of 2048 neurons in each layer, and a mini-batch size of 5000 samples.Notice that such a large mini-batch size is not customary in classical design of DNNs, but quite typical in Deep CCA architectures [12,37].DCCAE achieves ρUV = 0.53(±0.1)for the same architecture as DCCA in the correlation DNNs, while the autoencoders consists of three layers with 1024 neurons in each layer, and a mini-batch size of 100 samples.Finally, we apply empirical ACE (via k-nn) with different k values.We examine the evaluation-set results for k = 10, . . ., 500 and choose the best performing k = 170.We report a maximal normalized objective (on the test-set) of ρUV = 0.62(±0.1).
We now apply our suggested CRCCA method for different uniform quantization levels.The blue curve in Figure 7 shows the results we achieve on the evaluation-set.The x-axis is the number of quantization levels per dimension, while the y-axis (on the left) is our objective.The best performing CRCCA achieves a sum of correlation coefficients of ρUV = 0.66(±0.1)for a quantization level of N = 13.Importantly, we notice that the number of quantization levels determines the level of regularization in our solution and controls the generalization performance, as described in detail in Section 5.3.A small number of quantization levels implies more regularization (more values are averaged together), hence more bias and less variance.A large number of quantization levels means less regularization hence more variance and less bias.We notice that the optimum is achieved in between, for a quantization level of 13 cells in each dimension.The green curve in Figure 7 demonstrates the corresponding estimated mutual information, I(X; U) (where I(Y; V) is omitted from the chart as it is quite similar to it).Here, we notice that lower values of N correspond to a lower mutual information while a finer quantization corresponds to a greater value of mutual information.Notice that U and V are quantized versions of X and Y respectively, so we have that I(X; U) = H(U) and I(Y; V) = H(V).This demonstrates the soft dimensionality reduction interpretation of our formulation.Specifically, the green curve defines the maximal level of correlation that can be attained for a prescribed storage space (in bits).For example, we may attain up to ρUV = 0.68 (on the evaluation set) by representing U in no more than 12.5 bits (and a similar number of bits for V).Estimating the entropy values of U and V is quite challenging when the number of samples is relatively small, compared to the number of cells [53].One of the reasons is the typically large number of empty cells for a given set of samples.This phenomenon is highly related to the problem of estimating the missing mass (the probability of unobserved symbols) in large alphabet probability estimation.Here, we apply the Good-Turing probability estimator [54] to attain asymptomatically optimal estimates for the probability functions of U and V [55].Then, we use these estimates as plug-in's for the desired entropy values.
As we can see, our suggested method surpasses its competitors with a sum of correlation coefficients of ρUV = 0.66(±0.1),for a choice of N = 13.We notice that the advantage of CRCCA over ACE is barely statistically significant.However, on a practical note, CRCCA takes about 20 minutes to apply on a standard personal laptop, using a Matlab implementation, while ACE takes more than 2 hours in the same setting.The difference is a result of the k-nearest neighbors search in such a high dimensional space.This search is applied repeatedly, and it is much more computationally demanding than fixed quantization, even with enhanced search mechanisms.
To further illustrate our suggested CRCCA approach, we provide scatter-plot visualizations of the data in Appendix D. Specifically, Figures D.1, D.2 and D.3 show the arm sensor components (X i against X j , for all i, j = 1, . . ., 13), the belt sensor components (Y i against Y j ) and their component-wise dependencies (X i against Y j ), respectively.Figures D.4, D.5 and D.6 show the corresponding CRCCA components.Notice that all figures focus on the train-set samples, to improve visualization.As we can see, the arm sensor components demonstrate a more structured nature than the belt sensor components (for example, X 1 scatter-plots are more clustered, compared to Y 1 scatter-plots).As we study the different clusters, we observe that they correspond to the different weight lifting activities (class A to E, as described above).These activities focus on arm movements, which explains the more clustered nature of Figure D.1, compared to Figure D.2.Naturally, the structural difference between X and Y results in a relatively poor correlation.CRCCA reveals the maximal correlation by applying transformations to the original data.Here, we observe more structure in both sets of variables, which introduces a significantly greater correlation.Specifically, we notice a two-cluster structure in the first pair of canonical variates.Beyond the first pair, there is relatively less cluster structure, suggesting that the information on the two clusters is all condensed into the first canonical variate pair.It is important to mention that CCA methods are typically less effective in the presence of non-homogeneous data (for example, clustered variables as in our experiment).It is well known that the existence of groups in a dataset can provoke spurious correlations between a pair of quantitative variables.One possible solution is analyze each cluster independently (that is, apply CCA to each weightlifting activity, from A to E).However, this approach results in a smaller number of samples for each CCA, which typically decreases generalization performance.
To conclude, the weight lifting experiment demonstrates CRCCA ability to reveal the underlying correlation between two sets of variables; a correlation which is not apparent in the dataset's original form.In this sense CRCCA, answers our preliminary hypothesis, that the arm sensor and the belt sensor are indeed correlated.It is important to emphasize that the favorable performance of CRCCA is evident in low-dimension and large sample size setups, such as in the examples above.Unfortunately, the non-parametric nature of CRCCA makes it less effective when the dimension of the problem increases (or the number of samples reduces), due to the curse of dimensionality.Figure 8 compares CRCCA with DCCA and DCCAE for different train-set sizes.Here, we use the same DNN architectures described above.We notice that for relatively small sample size, CRCCA is inferior to the more robust (parametric) DCCA and DCCAE methods.However, as the number of samples grow, we observe the advantage of using CRCCA.

Discussion and Conclusion
In this work we introduce an information-theoretic compressed representation formulation of the non-linear CCA problem.Specifically, given two multivariate variables X and Y, we consider a CCA framework in which the extracted signals U = φ(X) and V = ψ(Y) are maximally correlated and, at the same time, I(X; U) and I(Y; V) are bounded by some predefined constants.We show that by imposing these mutual information constraints, we regularize the classical non-linear CCA problem so that U and V are compressed representations of the original variables.This allows us to regulate the dependencies between the mappings φ, ψ and the observations, and by that control the bias-variance trade-off and improve the generalization performance.Our CRCCA formulation draws immediate connections to the remote source coding problem.This allows us to derive upper bounds for our generalization error, similarly to the classical rate distortion problem.In addition, we show that by imposing the mutual information constraints, we allow a soft dimensionality reduction, as opposed to the hard reduction of the traditional CCA framework.Finally, we suggest an algorithm for the empirical CRCCA problem, based on uniform quantization.We demonstrate the performance of our suggested algorithm in different setups.We show that CRCCA successfully recovers the underlying correlation in a synthetic low-dimensional experiment, as the number of quantization cells increases.Furthermore, we observe that the CRCCA transformations converge to the optimal ACE mapping, as we study the scatter-plots of the obtained canonical components.In addition, we apply our suggested scheme to a real-world problem.Here, CRCCA demonstrates competitive (and even superior) generalization performance to DCCA and DCCAE at a reduced computational burden, as the number of observations increases.This makes DCCAE a favorable choice in such a regime.
Given a joint probability distribution, our suggested CRCCA formulation provides a sound theoretical background.However, this problem becomes more challenging in a real-world setup, where only a finite sample-size is available.In this case, our suggested algorithm drops the well-known "conditional expectation -rate distortion" decomposition and solves the problem directly.This results in a partition estimate, in which the cell volume is defined by the constraints of the problem.Unfortunately, just like k-NN, the partition estimate suffers from the curse of dimensionality.This means that the CRCCA problem may not be empirically solved in high rates, as the dimension increases.However, since the rate distortion curve is convex, one may accurately estimate the CRCCA in low rates, and interpolate the remaining points of curve.In other words, we may use the well-studied properties of the rate distortion curve in order to improve our estimation in higher rates.We consider this problem for future research.
It is important to mention that the CRCCA is a conceptual framework and not a specific algorithm.This means that there are many possible approaches to solve (5) given a finite sample-size.In this work we focus on a direct approach using uniform lattice quantizers.Alternatively, it is possible to implement general lattice quantizers, which converge even faster to the rate-distortion bound.On the other hand, one may take a different approach and suggest any conditional expectation estimate (parametric or non-parametric), followed by a vector quantizer (as done for example by Linder et al. [44]).Further, it is possible to derive an empirical CRCCA bound, in which we attempt to solve the CRCCA problem without quantizers at all (for example, estimate the joint distribution as a plug-in for the problem).This would allow an empirical upper-bound for any choice of algorithm.All of these directions are subject to future research.Either way, the major contribution of our work is not in the specific algorithm suggested in Section 5.2, but the conceptual compressed representation formulation of the problem, and its information-theoretic properties.
Finally, CRCCA may be generalized to a broader framework, in which we replace the correlation objective with mutual information maximization of the mapped signals, I(U; V).This problem strives to capture more fundamental dependencies between X and Y, as the mutual information is a statistic of the entire joint probability distribution, which holds many desirable characteristics (as shown, for example, in [56] and [57]).This generalized framework may also be viewed as a two-way information bottleneck problem, as previously shown in [58].

Figure 2 .
Figure 2. Visualization experiment.Samples of X and Y.

Figure 6 .
Figure 6.Visualization of the obtained CRCCA components.

Figure 7 .
Figure 7. Real-world experiment.N is the number of quantization cells in each dimension.The blue curve is the objective value on the evaluation-set (left y-axis), and the green curve is the corresponding estimate of the mutual information (right y-axis).

Figure 8 .
Figure 8. Real-world experiment.The generalization performance of CRCCA (blue curve), DCCA (red curve) and DCCAE (dashed red) for different train-set size n (in a log scale).