Next Article in Journal
Asymptotic Analysis of the kth Subword Complexity
Next Article in Special Issue
A Method Based on GA-CNN-LSTM for Daily Tourist Flow Prediction at Scenic Spots
Previous Article in Journal
Rope Tension Fault Diagnosis in Hoisting Systems Based on Vibration Signals Using EEMD, Improved Permutation Entropy, and PSO-SVM
Previous Article in Special Issue
Mining Educational Data to Predict Students’ Performance through Procrastination Behavior
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Nonlinear Canonical Correlation Analysis:A Compressed Representation Approach

1
The Industrial Engineering Department, Tel Aviv University, Tel Aviv 6997801, Israel
2
The School of Electrical Engineering, Tel Aviv University, Tel Aviv 6997801, Israel
3
The School of Computer Science and Engineering and the Interdisciplinary Center for Neural Computation, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(2), 208; https://doi.org/10.3390/e22020208
Submission received: 12 December 2019 / Revised: 9 February 2020 / Accepted: 10 February 2020 / Published: 12 February 2020
(This article belongs to the Special Issue Theory and Applications of Information Theoretic Machine Learning)

Abstract

:
Canonical Correlation Analysis (CCA) is a linear representation learning method that seeks maximally correlated variables in multi-view data. Nonlinear CCA extends this notion to a broader family of transformations, which are more powerful in many real-world applications. Given the joint probability, the Alternating Conditional Expectation (ACE) algorithm provides an optimal solution to the nonlinear CCA problem. However, it suffers from limited performance and an increasing computational burden when only a finite number of samples is available. In this work, we introduce an information-theoretic compressed representation framework for the nonlinear CCA problem (CRCCA), which extends the classical ACE approach. Our suggested framework seeks compact representations of the data that allow a maximal level of correlation. This way, we control the trade-off between the flexibility and the complexity of the model. CRCCA provides theoretical bounds and optimality conditions, as we establish fundamental connections to rate-distortion theory, the information bottleneck and remote source coding. In addition, it allows a soft dimensionality reduction, as the compression level is determined by the mutual information between the original noisy data and the extracted signals. Finally, we introduce a simple implementation of the CRCCA framework, based on lattice quantization.

1. Introduction

Canonical correlation analysis (CCA) seeks linear projections of two given random vectors so that the extracted (possibly lower dimensional) variables are maximally correlated [1]. CCA is a powerful tool in the analysis of paired data ( X , Y ) , where X and Y are two different representations of the same set of objects. It is commonly used in a variety of applications, such as speech recognition [2], natural language processing [3], cross-modal retrieval [4], multimodal signal processing [5], computer vision [6], and many others.
The CCA framework has gained a considerable amount of attention in recent years due to several important contributions (see Section 2). However, one of its major drawbacks is its restriction to linear projections, whereas many real-world setups exhibit highly nonlinear relationships. To overcome this limitation, several nonlinear CCA extensions have been proposed. Van Der Burg and De Leeuw [7] studied a nonlinear CCA problem under a specific family of transformations. Breiman and Friedman [8] considered a generalized nonlinear CCA setup, in which the transformations are not restricted to any model. They derived the optimal solution for this problem, under a known joint probability. For a finite sample size, Akaho [9] suggested a kernel version of CCA (KCCA) in which nonlinear mappings are chosen from two reproducing kernel Hilbert spaces (RKHS). Wang [10] and Kalmi et al. [11] considered a Bayesian approach to nonlinear CCA and provided inference algorithms and variational approximations that learn the structure of the underlying model. Later, Andrew et al. [12] introduced Deep CCA (DCCA), where the projections are obtained from two deep neural networks that are trained to output maximally correlated signals. In recent years, nonlinear CCA gained a renewed growth of interest in the machine learning community (see, for example, [13]).
Nonlinear CCA methods are advantageous over linear CCA in a range of applications (see, for example, [14,15]). However, two major drawbacks typically characterize most methods. First, although there exist several studies on the statistical properties of the linear CCA problem ([16,17]), the nonlinear case remains quite unexplored with only few recent studies (see, for example, [18]). Second, current nonparametric (and thereforeforth nonlinear) CCA methods are typically computationally demanding. Although there exist several parametric nonlinear CCA methods that address this problem, completely nonparametric methods (in which the solution is not restricted to a parametric family, nor utilize parametric density estimation, as in [13]) are often impractical to apply to large data sets.
In this work we consider a compressed representation formulation to the nonlinear CCA framework (namely, CRCCA), which demonstrates many desirable properties. Our suggested formulation regularizes the nonlinear CCA problem in an explicit and theoretically sound manner. In addition, our suggested scheme drops the traditional hard dimensionality reduction of the CCA framework and replaces it with constraints on the mutual information between the given noisy data and the extracted signals. This results in a soft dimensionality reduction, as the compression level is controlled by the constraints of our approach.
The CRCCA framework provides theoretical bounds and optimality conditions, as we establish fundamental connections to the theory of rate-distortion (see, e.g., Chapter 10 of [19]) and the information bottleneck [20]. Given the joint probability, we achieve a coupled variant of the classical Distortion-Rate problem, where we maximize the correlation between the representation with constraints on the representation rates. Furthermore, we provide an empirical solution for the CRCCA problem, when only a finite set of samples is available. Our suggested solution is both theoretically sound and computationally efficient. A Matlab implementation of our suggested approach is publicly available at the first author’s webpage (www.math.tau.ac.il/~amichaip).
It is important to mention that the CRCCA framework can also be interpreted from a classical information theory view point. Consider two disjoint terminals X and Y, where each vector is to be transmitted (independently of the other) through a different rate-limited noiseless channel. Then, the CRCCA framework seeks minimal transmission rates, given a prescribed correlation between the received signals.
The rest of this manuscript is organized as follows. In Section 2, we briefly review relevant concepts and previous studies of the CCA problem. In Section 3, we formally introduce our suggested CRCCA framework. Then, we describe our suggested solution in Section 4. In Section 5, we drop the known probability assumption and discuss the finite sample-size regime. We conclude with a series of synthetic and real-world experiments in Section 6.

2. Previous Work

Let X R d X and Y R d Y be two random vectors. For the simplicity of the presentation, we assume that X and Y are also zero-mean. The CCA framework seeks two transformations, U = ϕ ( X ) and V = ψ ( Y ) , such that
max U = ϕ ( X ) V = ψ ( Y ) i = 1 d E ( U i V i ) subject to E ( U ) = E ( V ) = 0 E ( U U T ) = E ( V V T ) = I
where d min ( d X , d Y ) . Notice that the notation U = ϕ ( X ) implies that all X variables are transformed simultaneously by a single multivariate transformation ϕ . We refer to (1) as CCA, where linear CCA [1] is under the assumption that ϕ ( · ) and ψ ( · ) are linear ( U = A X and V = B Y ).
The solution to the linear CCA problem can be obtained from the singular value decomposition of the matrix Σ X 1 2 Σ X Y Σ Y 1 2 , where Σ X , Σ Y and Σ X Y are the covariance matrices of X, Y, and the cross-covariance of X and Y, respectively. In practice, the covariance matrices are typically replaced by their empirical estimates, obtained from a finite set of samples. The linear CCA has been studied quite extensively in recent years, and has gained popularity due to several contributions. Ter Braak [21] and Graffelman [22] showed that CCA can be enhanced with powerful graphics called biplots. Witten et al. [23] introduced a regularized variant of CCA, which improves generalization. This method was further extended to a sparse CCA setup [23,24]. More recently, Graffelman et al. [25] adapted a compositional setting for linear CCA using nonlinear log-ratio transformations.
Nonlinear CCA is a natural extension to the linear CCA problem. Here, ϕ and ψ are not restricted to be linear projections of X and Y. This problem was first introduced by Lancaster [26] and Hannan [27] and was later studied in different set-ups, under a variety of models (see, for example, [7,28]). It is important to emphasize that nonlinear CCA may also be viewed as a nonlinear multivariate analysis technique, with several important theoretical and algorithmic contributions [29]. A major milestone in the study nonlinear CCA was achieved by Breiman and Friedman [8]. In their work, Breiman and Friedman showed that the optimal solution to (1), for d = 1 , may be obtained by a simple alternating conditional expectation procedure, denoted ACE. Their results were later extended to any d min ( d X , d Y ) , as shown, for example, in [30]. Here, we briefly review the ACE framework.
Let us begin with the first set of components, i = 1 . Assume that V 1 = ψ 1 ( Y ) is fixed, known and satisfies the constraints. Then, the optimization problem (1) is only with respect to ϕ 1 and by Cauchy–Schwarz inequality, we have that
E ( U 1 V 1 ) = E X ϕ 1 ( X ) E ( ψ 1 ( Y ) | X ) var ( ϕ 1 ( X ) ) var ( E ( ψ 1 ( Y ) | X ) )
with equality if and only if ϕ 1 ( X ) = c · E ( ψ 1 ( Y ) | X ) . Therefore, choosing a constant c to satisfy the unit variance constraint we achieve ϕ 1 ( X ) = E ( ψ 1 ( Y ) | X ) / v a r ( E ( ψ 1 ( Y ) | X ) ) . In the same manner, we may fix ϕ 1 ( X ) and attain ψ 1 ( Y ) = E ( ϕ 1 ( X ) | Y ) / v a r ( E ( ϕ 1 ( X ) | Y ) ) . These coupled equations are in fact necessary conditions for the optimality of ϕ 1 and ψ 1 , leading to an alternating procedure in which at each step we fix one transformation and optimize with respect to the other. Once ϕ 1 and ψ 1 are derived, we continue to the second set of components, i = 2 , under the constraints that they are uncorrelated with U 1 , V 1 . This procedure continues for the remaining components. Breiman and Friedman [8] proved that ACE converges to the global optimum. In practice, the conditional expectations are estimated from training data { x i , y i } i = 1 n using nonparametric regression, usually in the form of k-nearest neighbors (k-NN). As this computationally demanding step has to be executed repeatedly, ACE and its extensions are impractical for large data analysis.
Kernel CCA (KCCA) is an alternative nonlinear CCA framework [9,31,32,33]. In KCCA, ϕ A and ψ B , where A and B are two reproducing kernel Hilbert spaces (RKHSs) associated with user-specified kernels k x ( · , · ) and k y ( · , · ) . By the representer theorem [34], the projections can be written in terms of the training samples, { x i , y i } i = 1 n , as U i = j = 1 n a j i k x ( X , x j ) and V i = j = 1 n b j i k y ( Y , y j ) for some coefficients a j i and b j i . Denote the kernel matrices as K x = [ k x ( x i , x j ) ] and K y = [ k y ( y i , y j ) ] . Then, the optimal coefficients are computed from the eigenvectors of the matrix ( K x + r x I ) 1 K y ( K y + r y I ) 1 K x where r x and r y are positive parameters. Computation of the exact solution is intractable for large datasets due to the memory cost of storing the kernel matrices and the time complexity of solving dense eigenvalue systems. To address this caveat, Bach and Jordan [35] and Hardoon et al. [14] suggested several low-rank matrix approximations. Later, Halko et al. [36] considered randomized singular value decomposition (SVD) methods to further reduce the computational burden of KCCA. Additional modifications were explored by Arora and Livescu [2].
More recently, Andrew et al. [12] introduced Deep CCA (DCCA). Here, ϕ A and ψ B , where A and B are families of functions that can be implemented using two Deep Neural Networks (DNNs) of predefined architectures. As many DNN frameworks, DCCA is a scalable solution which demonstrates favorable generalization abilities in large data problems. Wang et al. [15] extended DCCA by introducing autoencoder regularization terms, implemented by additional DNNs. Specifically, Wang et al. [15] maximize the correlation between U = ϕ ( X ) and V = ψ ( Y ) where ϕ , ψ are DNNs (similarly to DCCA), while regulating the squared reconstruction error, | | X ϕ ˜ ( U ) | | 2 2 and | | Y ψ ˜ ( V ) | | 2 2 , where ϕ ˜ and ψ ˜ are additional DNNs, optimized over possibly different architectures than ϕ and ψ . Wang et al. [15] called this method Deep Canonically Correlated Autoencoders (DCCAE) and demonstrated its abilities in a variety of large scale problems. Importantly, they showed that the additional regularization terms improve upon the original DCCA framework as they regulate its flexibility. Unfortunately, as most artificial neural network methods, both DCCA and DCCAE provide a limited understanding of the problem.

3. Problem Formulation

Let X R d X and Y R d Y be two random vectors. For the simplicity of the presentation, we assume that d X = d Y = d . It is later shown that our derivation holds for any d X d Y and d 0 . Let ϕ : R d X R d and ψ : R d y R d be two transformations. Let U = ϕ ( X ) and V = ψ ( Y ) be two vectors in R d . Notice that ϕ and ψ are not necessarily deterministic transformations, in the sense that the conditional distributions p ( u | x ) and p ( v | y ) may be non-degenerate distributions. In this work, we generalize the classical CCA formulation (1), as we impose additional mutual information constraints on the transformations that we apply. Specifically, we are interested in U and V such that
max U = ϕ ( X ) V = ψ ( Y ) i = 1 d E ( U i V i ) subject to E ( U ) = E ( V ) = 0 E ( U U T ) = E ( V V T ) = I I ( X ; U ) R U , I ( Y ; V ) R V
for some fixed R U and R V , where I ( X ; Y ) = x , y p ( x , y ) log p ( x , y ) p ( x ) p ( y ) d x d y is the mutual information of X and Y and p ( x , y ) is the joint distribution of X and Y. The mutual information constraints regulate the transformations that we apply, so that in addition to maximizing the sum of correlations (as in (1)), U and V are also restricted to be compressed representations of X and Y, respectively. In other words, R U and R V define the amount of information preserved from the original vectors. Notice that as R U and R V grow, (3) degenerates back to (1). Further, it is important to emphasize that although ϕ and ψ are not necessarily deterministic, most applications do impose such a restriction. In this case, the mutual information constraints are unbounded if U and V take values over a finite support. In this case, I ( X ; U ) = H ( U ) and I ( Y ; V ) = H ( V ) , where H ( U ) = u p ( u ) log p ( u ) is the entropy of U.
The use of mutual information as a regularization term is one of the cornerstones of information theory. The Minimum Description Length (MDL) principle [37] suggests that the best representation of a given set of data is the one that leads to a minimal coding length. This idea has inspired the use of mutual information as a regularization term in many learning problems, mostly in the context of rate distortion (Chapter 10 of [19]), the information bottleneck framework [20], and different representation learning problems [38]. Recently, Vera et al. [39] showed that a mutual information constraint explicitly controls the generalization gap when considering a cross-entropy loss function. We further discuss the desirable properties of the mutual information constraints in Section 5.3.
Notice that the regularization terms in the DCCAE framework (discussed above) are related to our suggested mutual information constraints. However, it is important to emphasize the difference between the two. The autoencoders in the DCCAE framework suggest an explicit reconstruction architecture, which strives to maintain a small reconstruction error with the original representation. On the other hand, our mutual information constraints regulate the ability to reconstruct the original representation. In other words, our suggested framework restricts the amount of information that is preserved with the original representation while DCCAE minimizes the reconstruction error with the original representation.
We refer to our constrained optimization problem (3) as Compressed Representation CCA (CRCCA). Notice that traditionally, CCA refers to linear transformations. Here, we again consider CCA in the wider sense, as the transformations may be nonlinear and even nondeterministic. Further, notice that we may interpret (3) as a soft version of the CCA problem. In the classical CCA set-up, the applied transformations strive to maximize the correlations and rank them in a descending order. This implicitly suggests a hard dimensionality reduction, as one may choose subsets of components of U and V that have the strongest correlations. In our formulation (3), the mutual information constraints allow a soft dimensionality reduction; while X and Y are transformed to maximize the objective, the transformations also compress X and Y in the classical rate-distortion sense. For example, assume that the transformations are deterministic. Then, I ( X ; U ) = H ( U ) and I ( Y ; V ) = H ( V ) (as discussed above). Here, (3) may be interpreted as a correlation maximization problem, subject to a constraint on the maximal number of bits allowed to represent (or store) the resulting representations. In the same sense, the classical CCA formulation imposes a hard dimensionality reduction, as it constraints the number of dimensions allowed to represent (or store) the new variables. In other words, instead of restricting the number of dimensions allowed to represent the variables, we restrict the level of information allowed to represent them. We emphasize this idea in Section 6.

4. Iterative Projections Solution

Inspired by Breiman and Friedman [8], we suggest an iterative approach for the CRCCA problem (3). Specifically, in each iteration, we fix one of the transformations and maximize the objective with respect to the other. Let us illustrate our suggested approach as we fix V and maximize the objective with respect to U.
First, notice that our objective may be compactly written as i = 1 d E ( U i V i ) = E ( V T U ) . As E ( V V T ) is fixed (and our constraint suggests that E ( U U T ) = I ), we have that maximizing i = 1 d E ( U i V i ) is equivalent to minimizing E | | U V | | 2 . Therefore, the basic step in our iterative procedure is
min p ( u | x ) E | | U V | | 2 s . t . I ( X ; U ) R U , E ( U ) = 0 , E ( U U T ) = I ,
or equivalently (as in rate-distortion theory [19])
min p ( u | x ) I ( X ; U ) s . t . E | | U V | | 2 D , E ( U ) = 0 , E ( U U T ) = I .
This problem is widely known in the information theory community as remote/noisy source coding [40,41] with additional constraints on the second-order statistics of U. Therefore, our suggested method provides a local optimum to (3) by iteratively solving a remote source coding problem, with additional second order statistics constraints.
Remote source coding is a variant of the classical source coding (rate-distortion) problem [19]. Let V be a remote source that is unavailable to the encoder. Let X be a random variable that depends of V through a (known) mapping p ( x | v ) , and is available to the encoder. The remote source coding problem seeks the minimal possible compression rate of X, given a prescribed maximal reconstruction error of V from the compressed representation of X. Notice that for V = X , the remote source coding problem degenerates back to the classical source coding regime. Remote source coding has been extensively studied over the years. Dobrushin and Tsybakov [40], and later Wolf and Ziv [41], showed that the solution to this remote source coding problem (that is, the optimization problem in (5), without the second order statistics constraints) is achieved by a two step decomposition. First, let V ˜ = E ( V | X ) be the conditional expectation of V given X, which defines the optimal minimum mean square error (MMSE) estimator of the remote source V given the observed X. Then, U is simply the rate-distortion solution with respect to V ˜ . It is immediate to show that the same decomposition holds for our problem, with the additional second order statistic constraints. In other words, to solve (5), we first compute V ˜ = E ( V | X ) , followed by
min p ( u | v ˜ ) I ( V ˜ ; U ) s . t . E | | U V ˜ | | 2 D , E ( U ) = 0 , E ( U U T ) = I .
We notice that (6) is simply the rate distortion function V ˜ for square error distortion, but  with the additional (and untraditional) constraints on the second order statistics of the representation.

Optimality Conditions

Let us now derive the optimality conditions for each step of our suggested iterative projection algorithm. For this purpose, we assume that the joint probability distribution of X and Y is known. For the simplicity of the presentation, we focus on the one-dimensional case, X , Y , U , V R . As we do not restrict ourselves to deterministic transformations, the solution to (6) is fully characterized by the conditional probability p ( u | v ˜ ) .
Lemma 1.
In each step of our suggested iterative projections method, the optimal transformation (4) must satisfy the optimality conditions of (6):
1.
p ( u | v ˜ ) = p ( u ) e λ ˜ ( v ˜ ) e η ( u v ˜ ) 2 τ u μ u 2
2.
p ( u ) = v ˜ p ( u | v ˜ ) p ( v ˜ ) d v ˜
where V ˜ = E ( V | X ) , λ ˜ ( v ˜ ) = 1 λ ( v ˜ ) p ( v ˜ ) , and η , τ , μ , λ ( v ˜ ) are the Lagrange multipliers associated with the constraints of the problem.
A proof of this Lemma is provided in Appendix A. As expected, these conditions are identical to the Arimoto–Blahut equations (Chapter 10 of [19]), with the additional term e τ u μ u 2 that corresponds to the second order statistics constraints. This allows us to derive an iterative algorithm (Algorithm 1), similar to Arimoto–Blahut, in order to find the (locally) optimal mapping p ( u | v ˜ ) , with the same (local) convergence guarantees as Arimoto–Blahut. Our suggested algorithm is also highly related to the iterative approach of the information bottleneck solution, in the special Gaussian case [42].
Algorithm 1 Arimoto Blahut pseudocode for rate distortion with second order statistics constraints
Require:  p ( v ˜ )
Ensure: Fix p ( u ) , η , τ , μ , λ ( v ˜ )
1:
Set λ ˜ ( v ˜ ) = 1 λ ( v ˜ ) / p ( v ˜ )
2:
Set p ( u | v ˜ ) = p ( u ) e λ ˜ ( v ˜ ) e η ( u v ˜ ) 2 τ u μ u 2
3:
Set p ( u ) = v ˜ p ( u | v ˜ ) p ( v ˜ ) d v ˜
4:
Set τ so that E ( U ) = 0
5:
Set μ so that E ( U 2 ) = 1
6:
Go to Step 4 until convergence
Generalizing the optimality conditions (and the corresponding iterative algorithm) to the vector case is straight forward (see Appendix A). Again, the Lagrangian leads to Arimoto–Blahut equations, with an additional exponential term, e τ T u u T μ u . Here, τ is a vector of Lagrangian multipliers while μ is a matrix.

5. Compressed Representation CCA for Empirical Data

To this point, we considered the CRCCA problem under the assumption that the joint probability distribution of X and Y is known. However, this assumption is typically invalid in real-world set-ups. Instead, we are given a set of i.i.d. samples { x i , y i } i = 1 n from p ( x , y ) . We show that in this set-up, Dobrushin and Tsybakov optimal decomposition [40] may be redundant, as we attempt to directly solve (5).

5.1. Previous Results

Let us first revisit the iterative projections solution (Section 4) in a real-world setting. Here, in each iteration, we are to consider an empirical version of (5). This problem is equivalent to the empirical remote source coding problem (up to the additional second order statistics constraints), which was first studied by Linder et al. [43]. In their work, Linder et al. followed Dobrushin and Tsybakov [40], and decomposed the problem to conditional expectation estimation (denoted as E ^ ( V | X ) ) followed by empirical vector quantization, Q ( · ) . They showed that under an additive noise assumption ( X = V + ϵ ), the convergence rate of the empirical distortion is
E | | Q * ( E ^ ( V | X ) ) V | | 2 D N * + 8 B 2 d x N log n n + O n 1 2 + 8 B e n + e n
where Q * is the optimal empirically trained N-level vector quantizer, D N * is the distortion of the optimal N-level vector quantizer for the remote source problem (where the joint distribution is known), B is a known constant that satisfies P ( | | X | | 2 B ) = 1 , and e n is the mean square error of the empirical conditional expectation, E | | E ( V | X ) E ^ ( V | X ) | | 2 = e n . Notice that this bound has three major terms: the irreducible quantization error D N * ; the conditional expectation estimation error e n ; and the empirical vector quantization term, 8 B 2 d x N log n n . Although their analysis focuses on vector quantization, it is evident that (7) strongly depends on the performance of the conditional expectation estimator E ^ ( V | X ) . In fact, if we choose a nonparametric k-nearest neighbors estimator (like [8]), we have that E | | E ( V | X ) E ^ ( V | X ) | | 2 = O n 2 d + 2 [44], which is significantly worse than the rate imposed by the empirical vector quantizer. An additional drawback of (7) is the restrictive additive noise modeling assumption. For more general cases, Györfi and Wegkamp [45] derived similar convergence rates under broader sub-Gaussian models.
In our work we take a different approach, as we drop Dobrushin and Tsybakov decomposition and attempt to solve (5) directly. Importantly, as both the k-nearest neighbors and the vector quantization modules require a significant computational effort as the dimension of the problem increases, we take a more practical approach and design (a variant of) a lattice quantizer.

5.2. Our Suggested Method

As previously discussed, the Dobrushin and Tsybakov decomposition yields an unnecessary statistical and computational burden, as we first estimate the conditional expectation, and then use it as a plug-in for the statistic that we are essentially interested in. Here, we drop this decomposition and solve (5) directly. For this purpose we apply a remote source variant of lattice quantization.

5.2.1. Lattice Quantization

Lattice quantization [46] is a popular alternative for optimal vector quantization. In this approach, the partitioning of the quantization space is known and predefined. Given a set of observations { x i } i = 1 n , the fit (also called representer or centroid) of each quantization cell is simply the average of all the x i ’s that are sampled in that cell (see the left chart of Figure 1). We denote the (empirically trained) fixed lattice quantizer of X as U L Q ( X ) . Computationally, Lattice quantizers are scalable to higher dimensions, as they only require a fixed partitioning of the support of X R d x . Further, it can be shown that the rate of the quantizer (which corresponds to I ( X , U L Q ( X ) ) ) is simply the entropy of U L Q ( X ) [47]. This allows a simple way to verify the performance of the lattice quantizer. In addition, it quantifies the number of bits required to encode U L Q ( X ) . Lattice quantizers followed by entropy coding asymptotically ( n ) approach the rate distortion bound, in high rates (Chapter 7 of [48]). Interestingly, it can be shown that in low rates, the optimal (dithered) lattice quantizer is up to half a bit worse than the rate distortion bound, whereas the uniform dithered quantizer is up to 0.754 bits worse than the rate-distortion bound [49]. The performance of a lattice quantizer may be further improved with pre/post-filtering [50].

5.2.2. CRCCA by Quantization

In our empirical version of (5) we are given a set of observations { x i , v i } i = 1 n . Define the set of all possible quantizer of X as Q ( X ) . We would like to find a quantizer U q Q ( X ) , such that
min U q Q ( X ) E | | U q V | | 2 s . t . H ( U q ) R U , E ( U q ) = 0 , E ( U q U q T ) = I .
As a first step towards this goal, let us define a simpler problem. Denote the set of all uniform and fixed lattice quantizer of X by Q U ( X ) . Then, the remote source uniform quantization problem is defined as
min U q Q U ( X ) E | | U q V | | 2 s . t . H ( U q ) R U .
In other words, (9) is minimized over a subset of quantizers Q U ( X ) Q ( X ) , and drops the second order statistics constraints that appear in (8). To solve (9), we follow the lattice quantization approach described above; we first apply a (uniform and fixed) partitioning on the space of X. Then, the fit of each quantization cell is simply the average of all v i ’s that correspond to the x i ’s that were sampled in that cell (see the right chart of Figure 1 for example). We denote this quantizer as U R S U Q ( X ) , where RSUQ stands for remote source uniform quantization. Notice that our suggested partitioning is not an optimal solution to (8) (even without the second order statistics constraints), as we apply a simplistic uniform quantization to each dimension. However, it is easy to verify that U R S U Q ( X ) is the empirical minimizer of (9). Finally, in order to satisfy the second order constraints, we apply a simple linear transformation, U q = A U R S U Q ( X ) + B . Lemma 2 below shows that U q = A U R S U Q ( X ) + B is indeed the empirical risk minimizer of (9), with the additional second order statistics constraints.
Lemma 2.
Let U R S U Q ( X ) be the remote source uniform quantizer that is the empirical risk minimizer of (9). Then, A U R S U Q ( X ) + B is the remote source uniform quantizer that is the empirical risk minimizer of (9), with the additional constraints, for some constant matrix A and a vector B.
A proof for Lemma 2 is provided in Appendix B. To conclude, our suggested quantizer U q is the empirical minimizer of (9), which approximates (8) over a subset of quantizers Q U ( X ) Q ( X ) . Algorithm 2 summarizes our suggested approach.
Algorithm 2 A single step of CRCCA by quantization
Require:  { x i , v i } i = 1 n , a fixed uniform quantizer Q ( x )
1:
Set I ( α ) = j | Q ( x j ) = α , the indices of the samples that are mapped to the quantization cell α
2:
Set U R S U Q ( x i ) = 1 | I ( Q ( x i ) ) | k I ( Q ( x i ) ) v k
3:
Set U q ( x i ) = A U R S U Q ( x i ) + B such that i = 1 n U q ( x i ) = 0 and i = 1 n U q ( x i ) U q T ( x i ) = I
Notice that the uniform quantization scheme described above may also be viewed as a partitioning estimate of the conditional expectation [44], where the number of partitions is predetermined by the prescribed quantization level. The partitioning estimate is a local averaging estimate such that for a given x it takes the average of those v i ’s for which x i belongs to the same cell as x. Formally,
E ^ ( V | X = x ) = i = 1 n v i 1 Q ( x i ) = Q ( x ) i = 1 n 1 Q ( x i ) = Q ( x ) .

5.2.3. Convergence Analysis

The partitioning estimate holds several desirable properties. Györfi et al. [44] showed that it is (weakly) universally consistent. Further, they showed that under the assumptions that X has a compact support S , the conditional variance is bounded, var ( V | X = x ) σ 2 , and the conditional expectation is smooth enough, | E ( V | X = x 1 ) E ( V | X = x 2 ) | C | | x 1 x 2 | | , then the partitioning estimate rate of convergence follows,
E | | E ^ ( V | X ) E ( V | X ) | | 2 c ^ σ 2 + sup x S | E ( V | X = x ) | 2 n h n d + d · C 2 · h n 2
where c ^ depends only on d and the diameter of S, and h n d is the volume of the cubic cells. Thus, for
h n = c σ 2 + sup x S | E ( V | X = x ) | 2 C 2 1 / ( d + 2 ) n 1 d + 2
we have that
E | | E ^ ( E | X ) E ( V | X ) | | 2 ζ ( σ , S , d , C ) · n 2 d + 2
where
ζ ( σ , S , d , C ) = c σ 2 + sup x S | E ( V | X = x ) | 2 C 2 d d + 2 .
We omit the description of some of the constants, as they appear in detail in [44]. The derivation above suggests that for a choice of a cubic volume that follows (11), the rate of convergence of the partition estimate is asymptotically identical to the k-nearest neighbors estimate. This shall not come as a surprise, as both estimates apply a nonparametric estimation that performs a local averaging. However, the partition estimate is much simpler to apply in practice, as previously discussed.
Finally, by applying a partitioning estimate in each step of our iterative projections solution, we show that under the same assumptions made by Györfi et al. [44], and a choice of cubic cell volume as in (11), we have that
E | | U R S U Q ( X ) V | | 2 = E | | V E ( V | X ) | | 2 + E | | U R S U Q ( X ) E ( V | X ) | | 2 E | | V E ( V | X ) | | 2 + ζ ( σ , S , d , C ) · n 2 d + 2
where E | | V E ( V | X ) | | 2 is the irreducible error and the second inequality is due to the convergence rate of the partitioning estimate, as appears in (12). As we compare this result to (7), we notice that while the rate of convergence is asymptotically equivalent (assuming a k-NN estimate in (7)), our result is not restricted to additive noise models (or any other noise model). In addition, Linder et al. bound (7) converges to D N * , the distortion of the optimal N-level vector quantizer. Alternatively, our convergence is to the irreducible error.
It is important to emphasize that at the end of the day, our suggested method replaces the choice of k-NN estimate (as in ACE) with a partitioning estimate, where both estimates are known to be highly related. However, our suggested approach provides a sound information-theoretic justification that allows additional analytical properties, computational benefits, and performance bounds. In addition, it results in an implicit regularization, as discussed below.

5.3. Regularization

The mutual information constraints of the CRCCA problem (3) hold many desirable properties. In the previous sections, we focused on the information-theoretic interpretation of the problem. However, these constraints also implicitly apply regularization to the nonlinear CCA problem, and by that, improve its generalization performance. Intuitively speaking, the mutual information constraint I ( X ; U ) R U suggests that U is a compressed representation of X. This means that by restricting the information that U carries on X, we force the transformation to preserve only the relevant parts of X with respect to the canonical correlation objective. In other words, although the objective strives to fit the best transformations to a given train-set, the mutual information constraints regularize this fit by restricting its statistical dependence with the train-set. The use of mutual information as a regularization term in learning problems is not new. The Minimum Description Length (MDL) principle [37] suggests that the best hypothesis for a given set of data is the one that leads to minimal code length needed to represent of the data. Another example is the information bottleneck framework [20]. There, the objective is to maximize the mutual information between the labels Y and a new representation of the features T ( X ) , subject to the constraint I ( X ; T ( X ) ) R U . As in our case, the objective strives to find the best fit (according to a different loss function), while the constraints serve as regularization terms.
As demonstrated in the previous sections, the CRCCA formulation may be implemented in practice using uniform quantizers. Here, the entropy of the quantization cells replaces the mutual information constraints in regularizing the objective. A small number of cells, which typically corresponds to a lower entropy, implies that more observations are averaged together, therefore more bias (and less variance). On the other hand, a greater number of cells results in a smaller number of observations that contribute to each fit, which leads to more variance (and less bias). This way, the entropy of the cells governs the well-known bias-variance trade-off.
As before, we notice that the k-NN estimate may also control this trade-off, as the parameter k defines the number of observations to be averaged for each fit. However, although this parameter is internal to this specific conditional expectation estimator, the number of cells in the CRCCA formulation is an explicit part of the problem formulation. In other words, CRCCA defines an explicit regularization framework to the nonlinear CCA problem.

6. Experiments

We now demonstrate our suggested approach in synthetic and real-world experiments.

6.1. Synthetic Experiments

In the first experiment, we visualize the outcome of our suggested CRCCA approach. Let X and Y be two dimensional vectors, where X is uniformly distributed over the unit square and Y is a one-to-one mapping of X, which is highly nonlinear. We draw n = 5000 samples of X and Y, and apply CCA, DCCA [12], ACE [8], and our suggested CRCCA method. The two charts of Figure 2 show the samples of X and Y, respectively. Notice that the samples’ color is to visualize the mapping that we apply from X to Y. For example, the blue samples of X (which correspond to X 1 0 , 1 4 ) are mapped to the lower left quarter circle in Y. The exact description of the mapping is provided in Appendix C. We first apply linear CCA to X and Y. This results in a sum of correlation coefficients of 1.1 (where the maximum is 2). We report the normalized objective, which is 0.55 . This poor performance is a result of the highly nonlinear mapping of X to Y, which the classical CCA attempts to recover by linear means. The two charts on the left of Figure 3 visualize the correlation between the resulting components ( U 1 against V 1 and U 2 against V 2 ). Next, we apply Deep CCA. We examine different architectures that vary in the number of layers (3 to 7) and the number of neurons in each layer ( 2 j for j = 3 , , 12 ). The remaining hyperparameters are set according to the default values of [12]. We obtain a normalized objective value of 0.993 for an architecture of three layers of 32 neurons in each layer. This result demonstrates DCCA ability to (almost) fully recover the correlation between the original variables. The middle charts of Figure 3 visualize the results we achieve for the best performing DCCA architecture. Further, we apply the ACE method, which seeks the optimal nonparametric CCA solution to (1). ACE attains a normalized objective of 0.995 for a choice of k = 70 . The charts on the right of Figure 3 show the correlation between the components of ACE’s outcome. As expected, ACE succeeds to almost fully recover the perfect correlation of X and Y, in this large-sample low-dimensional setup.
To further illustrate the applied transformations, we visualize the obtained components of each of the methods. Figure 4 illustrates U 1 against U 2 , and V 1 against V 2 for the linear CCA (left), DCCA (middle), and ACE (right). As we can see, linear CCA rotates the original vectors, whereas DCCA and ACE demonstrate a highly nonlinear nature.
We now apply our suggested empirical CRCCA algorithm. Here, we use different number of quantization cells, N = 5 , 9 , 13 , where N is the number of levels in each dimension. Figure 5 shows the results we achieve for these three settings. The corresponding normalized objective values for each quantization level are 0.92 , 0.95 , and 0.99 , respectively. As expected, we converge to full recovery of X from Y, as N increases. This should not come as a surprise in such an easy problem, where the dimension is low and the number of samples is large enough. As we observe the charts of Figure 5, we notice the discrete nature of CRCCA, which is an immediate consequence of the quantization we apply. As expected, we observe more diversity in the outcome of the CRCCA, as N increases. In addition, we notice that the quantized points become increasingly correlated. Figure 6 shows the obtained CRCCA components for N = 5 , 9 , 13 . Here, again, we notice the discrete nature of the canonical variables, and the convergence to the optimal transformation (ACE) as N increases.

6.2. Real-World Experiment

Let us now turn to a real-world data experiment, in which we demonstrate the generalization abilities of the CRCCA approach. The Weight Lifting Exercises dataset [51] summarizes sensory readings from four different locations of the human body, while engaging in a weight lifting exercise. Each sensor records a 13-dimensional vector, which includes a 3-d gyroscope, 3-d magnetic fields, 3-d accelerations, roll, pitch, yaw, and the total acceleration absolute value. During the weight lifting activity, participants were asked to perform a specified exercise, called Unilateral Dumbbell Biceps Curl [51] in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the weight only halfway (Class C), lowering the weight only halfway (Class D), and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other four classes correspond to common mistakes.
As different sensors simultaneously record the same activity, we argue that they may be correlated under some transformations. To examine this hypothesis, we apply different CCA techniques to two different sensor vectors (arm and belt, in the reported experiment) to seek maximal correlations among these vectors. Specifically, we analyze an X set of 13 measurements (the arm sensor), and a Y set of the same 13 measurements for the belt sensor. Notice that we ignore the class of the activity (represented by a categorical variable from A to E) to make the problem more challenging. To have valid and meaningful results, we examine the generalization performance of our transformations. We split the dataset (of 31,300 observations) to 70 % train-set, 15 % evaluation-set (to tune different parameters), and 15 % test-set. We repeat each experiment 100 times (that is, 100 different splits to the three sets mentioned above) to achieve averaged results with a corresponding standard deviation. The results we report are the averaged normalized sum of canonical correlations on the test-set.
We first apply linear CCA [1]. This achieves a maximal objective value of ρ ¯ U V = 0.279 ( ± 0.01 ) . Next, we apply kernel CCA with a Gaussian kernel [31]. We examine a grid of Gaussian kernel variance values to achieve a maximum of ρ ¯ U V = 0.38 ( ± 0.09 ) , for σ 2 = 2.5 . Further, we evaluate the performance of DCCA and DCCAE. As in the previous experiment, we examine different DNN architectures, for both methods. Specifically, we look at different numbers of layers (3 to 7) and different numbers of neurons in each layer ( 2 j for j = 3 , , 12 ). In addition, we follow the guidelines of Andrew et al. [12] and examine both smaller and larger minibatch sizes, ranging from 20 to 10,000 samples. The remaining hyperparameters are set according to the default values, as appear in [12] and [15]. The best performing DCCA architecture achieves ρ ¯ U V = 0.51 ( ± 0.04 ) for three layers of 2048 neurons in each layer, and a minibatch size of 5000 samples. Notice that such a large minibatch size is not customary in classical design of DNNs, but quite typical in Deep CCA architectures [12,15]. DCCAE achieves ρ ¯ U V = 0.53 ( ± 0.1 ) for the same architecture as DCCA in the correlation DNNs, whereas the autoencoders consists of three layers with 1024 neurons in each layer, and a minibatch size of 100 samples. Finally, we apply empirical ACE (via k-nn) with different k values. We examine the evaluation-set results for k = 10 , , 500 and choose the best performing k = 170 . We report a maximal normalized objective (on the test-set) of ρ ¯ U V = 0.62 ( ± 0.1 ) .
We now apply our suggested CRCCA method for different uniform quantization levels. The blue curve in Figure 7 shows the results we achieve on the evaluation-set. The x-axis is the number of quantization levels per dimension, whereas the y-axis (on the left) is our objective. The best performing CRCCA achieves a sum of correlation coefficients of ρ ¯ U V = 0.66 ( ± 0.1 ) for a quantization level of N = 13 . Importantly, we notice that the number of quantization levels determines the level of regularization in our solution and controls the generalization performance, as described in detail in Section 5.3. A small number of quantization levels implies more regularization (more values are averaged together), therefore more bias and less variance. A large number of quantization levels means less regularization, and therefore more variance and less bias. We notice that the optimum is achieved in between, for a quantization level of 13 cells in each dimension. The green curve in Figure 7 demonstrates the corresponding estimated mutual information, I ( X ; U ) (where I ( Y ; V ) is omitted from the chart as it is quite similar to it). Here, we notice that lower values of N correspond to a lower mutual information while a finer quantization corresponds to a greater value of mutual information. Notice that U and V are quantized versions of X and Y respectively, so we have that I ( X ; U ) = H ( U ) and I ( Y ; V ) = H ( V ) . This demonstrates the soft dimensionality reduction interpretation of our formulation. Specifically, the green curve defines the maximal level of correlation that can be attained for a prescribed storage space (in bits). For example, we may attain up to ρ ¯ U V = 0.68 (on the evaluation set) by representing U in no more than 12.5 bits (and a similar number of bits for V).
Estimating the entropy values of U and V is quite challenging when the number of samples is relatively small, compared to the number of cells [52]. One of the reasons is the typically large number of empty cells for a given set of samples. This phenomenon is highly related to the problem of estimating the missing mass (the probability of unobserved symbols) in large alphabet probability estimation. Here, we apply the Good–Turing probability estimator [53] to attain asymptomatically optimal estimates for the probability functions of U and V [54]. Then, we use these estimates as plug-ins for the desired entropy values.
As we can see, our suggested method surpasses its competitors with a sum of correlation coefficients of ρ ¯ U V = 0.66 ( ± 0.1 ) , for a choice of N = 13 . We notice that the advantage of CRCCA over ACE is barely statistically significant. However, on a practical note, CRCCA takes ~20 min to apply on a standard personal laptop, using a Matlab implementation, whereas ACE takes more than 2 h in the same setting. The difference is a result of the k-nearest neighbors search in such a high dimensional space. This search is applied repeatedly, and it is much more computationally demanding than fixed quantization, even with enhanced search mechanisms.
To further illustrate our suggested CRCCA approach, we provide scatter-plot visualizations of the data in Appendix D. Specifically, Figure A2, Figure A3, Figure A4 show the arm sensor components ( X i against X j , for all i , j = 1 , , 13 ), the belt sensor components ( Y i against Y j ) and their component-wise dependencies ( X i against Y j ), respectively. Figure A5, Figure A6, Figure A7 show the corresponding CRCCA components. Notice that all figures focus on the train-set samples, to improve visualization. As we can see, the arm sensor components demonstrate a more structured nature than the belt sensor components (for example, X 1 scatter-plots are more clustered, compared to Y 1 scatter-plots). As we study the different clusters, we observe that they correspond to the different weight lifting activities (class A–E, as described above). These activities focus on arm movements, which explains the more clustered nature of Figure A2, compared to Figure A3. Naturally, the structural difference between X and Y results in a relatively poor correlation. CRCCA reveals the maximal correlation by applying transformations to the original data. Here, we observe more structure in both sets of variables, which introduces a significantly greater correlation. Specifically, we notice a two-cluster structure in the first pair of canonical variates. Beyond the first pair, there is relatively less cluster structure, suggesting that the information on the two clusters is all condensed into the first canonical variate pair. It is important to mention that CCA methods are typically less effective in the presence of non-homogeneous data (for example, clustered variables as in our experiment). It is well known that the existence of groups in a dataset can provoke spurious correlations between a pair of quantitative variables. One possible solution is analyze each cluster independently (that is, apply CCA to each weightlifting activity, from A to E). However, this approach results in a smaller number of samples for each CCA, which typically decreases generalization performance.
To conclude, the weight lifting experiment demonstrates CRCCA ability to reveal the underlying correlation between two sets of variables; a correlation which is not apparent in the dataset’s original form. In this sense, CRCCA answers our preliminary hypothesis, that the arm sensor and the belt sensor are indeed correlated. It is important to emphasize that the favorable performance of CRCCA is evident in low-dimension and large sample size setups, such as in the examples above. Unfortunately, the nonparametric nature of CRCCA makes it less effective when the dimension of the problem increases (or the number of samples reduces), due to the curse of dimensionality. Figure 8 compares CRCCA with DCCA and DCCAE for different train-set sizes. Here, we use the same DNN architectures described above. We notice that for relatively small sample size, CRCCA is inferior to the more robust (parametric) DCCA and DCCAE methods. However, as the number of samples grow, we observe the advantage of using CRCCA.

7. Discussion and Conclusions

In this work, we introduce an information-theoretic compressed representation formulation of the nonlinear CCA problem. Specifically, given two multivariate variables, X and Y, we consider a CCA framework in which the extracted signals U = ϕ ( X ) and V = ψ ( Y ) are maximally correlated and, at the same time, I ( X ; U ) and I ( Y ; V ) are bounded by some predefined constants. We show that by imposing these mutual information constraints, we regularize the classical nonlinear CCA problem so that U and V are compressed representations of the original variables. This allows us to regulate the dependencies between the mappings ϕ , ψ and the observations, and by that control the bias-variance trade-off and improve the generalization performance. Our CRCCA formulation draws immediate connections to the remote source coding problem. This allows us to derive upper bounds for our generalization error, similarly to the classical rate distortion problem. In addition, we show that by imposing the mutual information constraints, we allow a soft dimensionality reduction, as opposed to the hard reduction of the traditional CCA framework. Finally, we suggest an algorithm for the empirical CRCCA problem, based on uniform quantization.
We demonstrate the performance of our suggested algorithm in different setups. We show that CRCCA successfully recovers the underlying correlation in a synthetic low-dimensional experiment, as the number of quantization cells increases. Furthermore, we observe that the CRCCA transformations converge to the optimal ACE mapping, as we study the scatter-plots of the obtained canonical components. In addition, we apply our suggested scheme to a real-world problem. Here, CRCCA demonstrates competitive (and even superior) generalization performance to DCCA and DCCAE at a reduced computational burden, as the number of observations increases. This makes DCCAE a favorable choice in such a regime.
Given a joint probability distribution, our suggested CRCCA formulation provides a sound theoretical background. However, this problem becomes more challenging in a real-world set-up, where only a finite sample-size is available. In this case, our suggested algorithm drops the well-known “conditional expectation—rate distortion” decomposition and solves the problem directly. This results in a partition estimate, in which the cell volume is defined by the constraints of the problem. Unfortunately, just like k-NN, the partition estimate suffers from the curse of dimensionality. This means that the CRCCA problem may not be empirically solved in high rates, as the dimension increases. However, as the rate distortion curve is convex, one may accurately estimate the CRCCA in low rates, and interpolate the remaining points of curve. In other words, we may use the well-studied properties of the rate distortion curve in order to improve our estimation in higher rates. We consider this problem for future research.
It is important to mention that the CRCCA is a conceptual framework and not a specific algorithm. This means that there are many possible approaches to solve (5) given a finite sample-size. In this work, we focus on a direct approach using uniform lattice quantizers. Alternatively, it is possible to implement general lattice quantizers, which converge even faster to the rate-distortion bound. On the other hand, one may take a different approach and suggest any conditional expectation estimate (parametric or nonparametric), followed by a vector quantizer (as done for example by Linder et al. [43]). Further, it is possible to derive an empirical CRCCA bound, in which we attempt to solve the CRCCA problem without quantizers at all (for example, estimate the joint distribution as a plug-in for the problem). This would allow an empirical upper-bound for any choice of algorithm. All of these directions are subject to future research. Either way, the major contribution of our work is not in the specific algorithm suggested in Section 5.2, but the conceptual compressed representation formulation of the problem, and its information-theoretic properties.
Finally, CRCCA may be generalized to a broader framework, in which we replace the correlation objective with mutual information maximization of the mapped signals, I ( U ; V ) . This problem strives to capture more fundamental dependencies between X and Y, as the mutual information is a statistic of the entire joint probability distribution, which holds many desirable characteristics (as shown, for example, in [55,56]). This generalized framework may also be viewed as a two-way information bottleneck problem, as previously shown in [57].

Author Contributions

Conceptualization, A.P., M.F., and N.T.; methodology, A.P., M.F., and N.T.; software, A.P.; validation, A.P.; formal analysis, A.P., M.F., and N.T.; investigation, A.P.; writing—original draft preparation, A.P.; writing—review and editing, M.F. and N.T.; visualization, A.P.; supervision, M.F. and N.T.; project administration, M.F. and N.T.; funding acquisition, A.P. and N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Gatsby Charitable Foundation and the Intel Collaboration Research Institute for Computational Intelligence (ICRI-CI) to Naftali Tishby, and the Israeli Center of Research Excellence in Algorithms to Amichai Painsky.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. A Proof for Lemma 1

The optimal solution to (4) is achieved in two steps. First, let V ˜ = E ( V | X ) be the conditional expectation of V given X. Then, U is the constrained rate-distortion solution for (6). We may apply calculus of variations to derive the optimality conditions of (4), with respect to p ( u | v ˜ ) . Here, we present the general case where X, Y, U, V, and V ˜ are all vectors. The mutual information objective is given by
I ( V ˜ ; U ) = p ( v ˜ ) p ( u | v ˜ ) log p ( u | v ˜ ) p ( u ) d u d v ˜ .
while the constraints are
  • E | | U V ˜ | | 2 = | | u v ˜ | | 2 p ( u | v ˜ ) p ( v ˜ ) d u d v ˜ D
  • E ( U i ) = u i p ( u ) = u i p ( u | v ˜ ) p ( v ˜ ) d u d v ˜ = 0
  • E ( U i U j ) = u i u j p ( u | v ˜ ) p ( v ˜ ) d u d v ˜ = 1 i = j
where 1 { · } is the indicator function. Therefore, our Lagrangian is given by
L = p ( v ˜ ) p ( u | v ˜ ) log p ( u | v ˜ ) p ( u ) d u d v ˜ η | | u v ˜ | | 2 p ( u | v ˜ ) p ( v ˜ ) d u d v ˜ D i τ ( i ) u i p ( u | v ˜ ) p ( v ˜ ) d u d v ˜ i , j μ ( i , j ) u i u j p ( u | v ˜ ) p ( v ˜ ) 1 i = j λ ( v ˜ ) p ( u | v ˜ ) d u 1 d v ˜
where η , τ , and μ are the Lagrange multipliers associated with the distortion, the mean, and the correlation constraints, respectively, and λ ( v ˜ ) are the Lagrange multiplier that restrict p ( u | v ˜ ) to be valid distribution functions. Setting the derivative of L (with respect to p ( u | v ˜ ) ) to zero obtains the specified conditions. □

Appendix B. A Proof for Lemma 2

Let Q ( x ) be a uniform quantizer of x which consists of M different cells. Let C m be the mth quantization cell. Let u m be the fit of the mth cell. Specifically, Q ( x i ) = u m implies that x i is a member of C m . For a quantizer Q ( x ) , the empirical risk minimization of (8) with respect to the fits values u m , is given by
min u m m = 1 M i C m | | v i u m | | 2 s . t . m = 1 M | C m | u m = 0 , 1 n m = 1 M | C m | u m u m T = I
where | C m | denotes the number of observations in the m t h cell. The KKT conditions of (A3) yield that u m = A 1 | C m | i C m v i + B . However, notice that the optimal fit for the unconstrained problem is simply 1 | c m | i C m v i , which concludes the proof.□

Appendix C. Synthetic Experiment Description

Let X and Y be two dimensional vectors, where X is uniformly distributed over a unit square and Y is a one-to-one mapping of X, as demonstrated in Figure 2. Here, we describe the exact mapping from X to Y. We begin with the blue samples.
First, let us spread the blue samples over the entire square. Specifically, Z 1 = 4 · X 1 and Z 2 = X 2 . Now, we gather all the samples to the left and lower part of the unit square, as demonstrated on the left chart of Figure A1. Specifically, Z 1 ( Z 2 > 0.2 ) = 0.2 · Z 1 ( Z 2 > 0.2 ) . We shift Z 1 = Z 1 1 and Z 2 = Z 2 1 so that they are now located in square [ 1 , 0 ] 2 . Finally, we apply the transformation Y 1 = Z 1 · 1 1 2 Z 2 2 , Y 2 = Z 2 · 1 1 2 Z 1 2 to attain a left lower quarter circle, as illustrated on the right chart of Figure A1. Notice that the quarter circle we achieve is not homogeneous, in the sense that there are more samples for which Z 2 < 0.2 than Z 1 < 0.2 . We repeat similar transformations for the rest of the samples of X 1 and X 2 , so that for each quarter circle, the more dense part is clock-wise as demonstrated on the right chart of Figure A1.
Figure A1. Synthetic experiment: the blue samples of Z 1 , Z 2 (left) and Y 1 , Y 2 (right).
Figure A1. Synthetic experiment: the blue samples of Z 1 , Z 2 (left) and Y 1 , Y 2 (right).
Entropy 22 00208 g0a1

Appendix D. Visualizations of the Real World Experiment

Figure A2. Real-world experiment. Visualization of X components (arm sensor), before CRCCA.
Figure A2. Real-world experiment. Visualization of X components (arm sensor), before CRCCA.
Entropy 22 00208 g0a2
Figure A3. Real-world experiment. Visualization of Y components (belt sensor), before CRCCA.
Figure A3. Real-world experiment. Visualization of Y components (belt sensor), before CRCCA.
Entropy 22 00208 g0a3
Figure A4. Real-world experiment. Visualization of X , Y components, before CRCCA.
Figure A4. Real-world experiment. Visualization of X , Y components, before CRCCA.
Entropy 22 00208 g0a4
Figure A5. Real-world experiment. Visualization of U components, after CRCCA.
Figure A5. Real-world experiment. Visualization of U components, after CRCCA.
Entropy 22 00208 g0a5
Figure A6. Real-world experiment. Visualization of V components, after CRCCA.
Figure A6. Real-world experiment. Visualization of V components, after CRCCA.
Entropy 22 00208 g0a6
Figure A7. Real-world experiment. Visualization of U , V components, after CRCCA.
Figure A7. Real-world experiment. Visualization of U , V components, after CRCCA.
Entropy 22 00208 g0a7

References

  1. Hotelling, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
  2. Arora, R.; Livescu, K. Kernel CCA for multi-view learning of acoustic features using articulatory measurements. In Proceedings of the Symposium on Machine Learning in Speech and Language Processing, Portland, OR, USA, 14 September 2012; pp. 34–37. [Google Scholar]
  3. Dhillon, P.; Foster, D.P.; Ungar, L.H. Multi-view learning of word embeddings via cca. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2011; pp. 199–207. [Google Scholar]
  4. Gong, Y.; Wang, L.; Hodosh, M.; Hockenmaier, J.; Lazebnik, S. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 529–545. [Google Scholar]
  5. Slaney, M.; Covell, M. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2011; pp. 814–820. [Google Scholar]
  6. Kim, T.K.; Wong, S.F.; Cipolla, R. Tensor canonical correlation analysis for action classification. In Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
  7. Van Der Burg, E.; de Leeuw, J. Nonlinear canonical correlation. Br. J. Math. Stat. Psychol. 1983, 36, 54–80. [Google Scholar] [CrossRef]
  8. Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
  9. Akaho, S. A Kernel Method For Canonical Correlation Analysis. In Proceedings of the International Meeting of the Psychometric Society, Osaka, Japan, 15–19 July 2001. [Google Scholar]
  10. Wang, C. Variational Bayesian approach to canonical correlation analysis. IEEE Trans. Neural Netw. 2007, 18, 905–910. [Google Scholar] [CrossRef] [PubMed]
  11. Klami, A.; Virtanen, S.; Kaski, S. Bayesian canonical correlation analysis. J. Mach. Learn. Res. 2013, 14, 965–1003. [Google Scholar]
  12. Andrew, G.; Arora, R.; Bilmes, J.A.; Livescu, K. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1247–1255. [Google Scholar]
  13. Michaeli, T.; Wang, W.; Livescu, K. Nonparametric canonical correlation analysis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1967–1976. [Google Scholar]
  14. Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef] [Green Version]
  15. Wang, W.; Arora, R.; Livescu, K.; Bilmes, J.A. On Deep Multi-View Representation Learning. In Proceedings of the 33rd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1083–1092. [Google Scholar]
  16. Arora, R.; Mianjy, P.; Marinov, T. Stochastic optimization for multiview representation learning using partial least squares. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1786–1794. [Google Scholar]
  17. Arora, R.; Marinov, T.V.; Mianjy, P.; Srebro, N. Stochastic Approximation for Canonical Correlation Analysis. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, NY, USA, 4–9 December 2017; pp. 4778–4787. [Google Scholar]
  18. Wang, W.; Arora, R.; Livescu, K.; Srebro, N. Stochastic optimization for deep CCA via nonlinear orthogonal iterations. In Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 30 September–2 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 688–695. [Google Scholar]
  19. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Chichester, UK, 2012. [Google Scholar]
  20. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, USA, 22–24 December 1999; pp. 368–377. [Google Scholar]
  21. Ter Braak, C.J. Interpreting canonical correlation analysis through biplots of structure correlations and weights. Psychometrika 1990, 55, 519–531. [Google Scholar] [CrossRef]
  22. Graffelman, J. Enriched biplots for canonical correlation analysis. J. Appl. Stat. 2005, 32, 173–188. [Google Scholar] [CrossRef]
  23. Witten, D.M.; Tibshirani, R.; Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009, 10, 515–534. [Google Scholar] [CrossRef]
  24. Witten, D.M.; Tibshirani, R.J. Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 2009, 8, 1–27. [Google Scholar] [CrossRef]
  25. Graffelman, J.; Pawlowsky-Glahn, V.; Egozcue, J.J.; Buccianti, A. Exploration of geochemical data with compositional canonical biplots. J. Geochem. Explor. 2018, 194, 120–133. [Google Scholar] [CrossRef]
  26. Lancaster, H. The structure of bivariate distributions. Ann. Math. Stat. 1958, 29, 719–736. [Google Scholar] [CrossRef]
  27. Hannan, E. The general theory of canonical correlation and its relation to functional analysis. J. Aust. Math. Soc. 1961, 2, 229–242. [Google Scholar] [CrossRef] [Green Version]
  28. Van der Burg, E.; de Leeuw, J.; Dijksterhuis, G. OVERALS: Nonlinear canonical correlation with k sets of variables. Comput. Stat. Data Anal. 1994, 18, 141–163. [Google Scholar] [CrossRef]
  29. Gifi, A. Nonlinear Multivariate Analysis; Wiley-VCH: Hoboken, NJ, USA, 1990. [Google Scholar]
  30. Makur, A.; Kozynski, F.; Huang, S.L.; Zheng, L. An efficient algorithm for information decomposition and extraction. In Proceedings of the 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 30 September–2 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 972–979. [Google Scholar]
  31. Lai, P.L.; Fyfe, C. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 2000, 10, 365–377. [Google Scholar] [CrossRef] [PubMed]
  32. Uurtio, V.; Bhadra, S.; Rousu, J. Large-Scale Sparse Kernel Canonical Correlation Analysis. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6383–6391. [Google Scholar]
  33. Yu, J.; Wang, K.; Ye, L.; Song, Z. Accelerated Kernel Canonical Correlation Analysis with Fault Relevance for Nonlinear Process Fault Isolation. Ind. Eng. Chem. Res. 2019, 58, 18280–18291. [Google Scholar] [CrossRef]
  34. Schölkopf, B.; Herbrich, R.; Smola, A.J. A generalized representer theorem. In International Conference on Computational Learning Theory; Springer: Berlin/Heidelberg, Germany, 2001; pp. 416–426. [Google Scholar]
  35. Bach, F.R.; Jordan, M.I. Kernel independent component analysis. J. Mach. Learn. Res. 2002, 3, 1–48. [Google Scholar]
  36. Halko, N.; Martinsson, P.G.; Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 2011, 53, 217–288. [Google Scholar] [CrossRef]
  37. Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  38. Chigirev, D.V.; Bialek, W. Optimal Manifold Representation of Data: An Information Theoretic Approach. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–13 December 2003; pp. 161–168. [Google Scholar]
  39. Vera, M.; Piantanida, P.; Vega, L.R. The Role of the Information Bottleneck in Representation Learning. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1580–1584. [Google Scholar]
  40. Dobrushin, R.; Tsybakov, B. Information transmission with additional noise. IEEE Trans. Inf. Theory 1962, 8, 293–304. [Google Scholar] [CrossRef]
  41. Wolf, J.; Ziv, J. Transmission of noisy information to a noisy receiver with minimum distortion. IEEE Trans. Inf. Theory 1970, 16, 406–411. [Google Scholar] [CrossRef]
  42. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
  43. Linder, T.; Lugosi, G.; Zeger, K. Empirical quantizer design in the presence of source noise or channel noise. IEEE Trans. Inf. Theory 1997, 43, 612–623. [Google Scholar] [CrossRef] [Green Version]
  44. Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  45. Györfi, L.; Wegkamp, M. Quantization for nonparametric regression. IEEE Trans. Inf. Theory 2008, 54, 867–874. [Google Scholar] [CrossRef]
  46. Zamir, R. Lattice Coding for Signals and Networks: A Structured Coding Approach to Quantization, Modulation, and Multiuser Information Theory; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  47. Ziv, J. On universal quantization. IEEE Trans. Inf. Theory 1985, 31, 344–347. [Google Scholar] [CrossRef]
  48. Zamir, R. Lattices are everywhere. In Information Theory and Applications Workshop; IEEE: Piscataway, NJ, USA, 2009; pp. 392–421. [Google Scholar]
  49. Zamir, R.; Feder, M. On universal quantization by randomized uniform/lattice quantizers. IEEE Trans. Inf. Theory 1992, 38, 428–436. [Google Scholar] [CrossRef]
  50. Zamir, R.; Feder, M. Information rates of pre/post-filtered dithered quantizers. IEEE Trans. Inf. Theory 1996, 42, 1340–1353. [Google Scholar] [CrossRef]
  51. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative activity recognition of weight lifting exercises. In Proceedings of the 4th Augmented Human International Conference, Stuttgart, Germany, 7–8 March 2013; ACM: New York, NY, USA, 2013; pp. 116–123. [Google Scholar]
  52. Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef] [Green Version]
  53. Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
  54. Orlitsky, A.; Suresh, A.T. Competitive distribution estimation: Why is good-turing good. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2143–2151. [Google Scholar]
  55. Painsky, A.; Wornell, G. On the universality of the logistic loss function. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 936–940. [Google Scholar]
  56. Painsky, A.; Wornell, G.W. Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss. IEEE Trans. Inf. Theory 2019. [Google Scholar] [CrossRef] [Green Version]
  57. Slonim, N.; Friedman, N.; Tishby, N. Multivariate information bottleneck. Neural Comput. 2006, 18, 1739–1789. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Examples of uniform quantization. Left: classical uniform quantization. Right: remote source uniform quantization.
Figure 1. Examples of uniform quantization. Left: classical uniform quantization. Right: remote source uniform quantization.
Entropy 22 00208 g001
Figure 2. Visualization experiment. Samples of X and Y.
Figure 2. Visualization experiment. Samples of X and Y.
Entropy 22 00208 g002
Figure 3. Visualization of the correlations. Left: linear CCA. Middle: DCCA. Right: ACE.
Figure 3. Visualization of the correlations. Left: linear CCA. Middle: DCCA. Right: ACE.
Entropy 22 00208 g003
Figure 4. Visualization of the obtained components. Left: linear CCA. Middle: DCCA. Right: ACE.
Figure 4. Visualization of the obtained components. Left: linear CCA. Middle: DCCA. Right: ACE.
Entropy 22 00208 g004
Figure 5. Visualization of the correlations. CRCCA with quantization levels, N = 5 , 9 , 13 .
Figure 5. Visualization of the correlations. CRCCA with quantization levels, N = 5 , 9 , 13 .
Entropy 22 00208 g005
Figure 6. Visualization of the obtained CRCCA components.
Figure 6. Visualization of the obtained CRCCA components.
Entropy 22 00208 g006
Figure 7. Real-world experiment. N is the number of quantization cells in each dimension. The blue curve is the objective value on the evaluation-set (left y-axis), and the green curve is the corresponding estimate of the mutual information (right y-axis).
Figure 7. Real-world experiment. N is the number of quantization cells in each dimension. The blue curve is the objective value on the evaluation-set (left y-axis), and the green curve is the corresponding estimate of the mutual information (right y-axis).
Entropy 22 00208 g007
Figure 8. Real-world experiment. The generalization performance of CRCCA (blue curve), DCCA (red curve), and DCCAE (dashed red) for different train-set size n (in a log scale).
Figure 8. Real-world experiment. The generalization performance of CRCCA (blue curve), DCCA (red curve), and DCCAE (dashed red) for different train-set size n (in a log scale).
Entropy 22 00208 g008

Share and Cite

MDPI and ACS Style

Painsky, A.; Feder, M.; Tishby, N. Nonlinear Canonical Correlation Analysis:A Compressed Representation Approach. Entropy 2020, 22, 208. https://doi.org/10.3390/e22020208

AMA Style

Painsky A, Feder M, Tishby N. Nonlinear Canonical Correlation Analysis:A Compressed Representation Approach. Entropy. 2020; 22(2):208. https://doi.org/10.3390/e22020208

Chicago/Turabian Style

Painsky, Amichai, Meir Feder, and Naftali Tishby. 2020. "Nonlinear Canonical Correlation Analysis:A Compressed Representation Approach" Entropy 22, no. 2: 208. https://doi.org/10.3390/e22020208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop