Article Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence

We propose a generalized method of the canonical correlation analysis using Alpha-Beta divergence, called AB-canonical analysis (ABCA). From observations of two random variables, x ∈ RP and y ∈ RQ, ABCA finds directions, wx ∈ RP and wy ∈ RQ, such that the AB-divergence between the joint distribution of (wT x, wT y) and the product x y of their marginal distributions is maximized. The number of significant non-zero canonical coefficients are determined by using a sequential permutation test. The advantage of our method over the standard canonical correlation analysis (CCA) is that it can reconstruct the hidden non-linear relationship between wT xx and wT y, and it is robust against outliers. We extend ABCA when data are observed in terms of tensors. We further generalize this method by imposing sparseness constraints. Extensive simulation study is performed to justify our approach.


Introduction
In statistics and data analysis, we are often interested to find out the relationship between two sets of multi-dimensional random variables, x ∈ R P and y ∈ R Q .Canonical correlation analysis (CCA) focuses on the correlation between a linear combination of the variables in one set and another linear combination of the variables in the other set.The idea is to first determine linear combinations of x and y, called canonical variables, such that the correlation between the canonical variables is the highest possible among all such linear combinations.
Based on the observed random sample, the aim in standard CCA is to find the linear relationship between x and y.Therefore, the method fails if the relationship is non-linear.Another disadvantage of the standard CCA is that it is very sensitive to outliers, as it is based on the correlation coefficient.In this paper, we generalize the concept of CCA, which can extract the non-linear relationship between two sets of variables, and at the same time, the method is robust against outliers.We assume that there exists a hidden relationship of the following type: where ψ is an unknown smooth function and is the random error.Our aim is to find out vectors, w x ∈ R P and w y ∈ R Q , from observed values of x and y.Yin (2004) [1] has developed a technique to solve this problem based on an information theoretic approach (see, also, Yin et al., 2008 [2]; Iaci et al., 2010 [3]).Recently, Iaci and Sriram (2013) [4] applied this method using beta-divergence and power divergence.Wang et al. (2012) [5] have used Bregman divergence to perform CCA.We will explore this problem in detail and extend this method by using the Alpha-Beta divergence (or AB-divergence) (Cichocki et al., 2011 [6]), which is a generalized measure of divergence.Moreover, the earlier methods are limited to the case where x and y are random vectors; we will extend it to the tensor (multiway array) valued random variables.Kernel CCA (Lai and Fyfe, 2000 [7]; Shawe-Taylor and Cristianini, 2004 [8]) deals with the nonlinear relationship between two sets of random variables, but the setting of the problem is different than our approach.Kernel CCA first transforms the data to a higher (or infinite) dimensional non-linear space, called the reproducing kernel Hilbert space, and then assumes that there exists a linear relationship between the variables in the transformed space.In kernel CCA, it is not possible to recover the non-linear relationship, whereas in our case, we can find out the unknown function, ψ, in Equation ( 1) by further analysis (see Breiman and Friedman, 1985 [9]).However, in this paper, our main interest is to recover w x and w y , which satisfy Equation (1).
The rest of the paper is organized as follows.In Sections 2 and 3, we discuss the basic formulations of CCA and AB-divergence, respectively.The new method, AB-canonical analysis (ABCA), is proposed in Section 4. In Section 5, we describe the algorithm of ABCA.The sequential permutation test is proposed to determine the number of significant canonical variable pairs in Section 6.In Section 7, we generalize ABCA when data sets are observed as tensors.The sparsity constraint is introduced in Section 8. Numerical illustrations of the performance of this method are presented in Section 9. Section 10 has some concluding remarks.

Canonical Correlation Analysis
Suppose we have N pairs of observations from two sets of random variables, x and y, In CCA, we look for linear combinations of x and y, which have maximum correlation with each other (Hotelling, 1936 [10]).Formally, the classical CCA computes two projection vectors, w x ∈ R P and w y ∈ R Q , such that the correlation coefficient: is maximized, where Σ xy is the covariance matrix between x and y, and Σ x and Σ y are the dispersion matrices of x and y, respectively.Since ρ is invariant to the scaling of vectors w x and w y , CCA can be formulated equivalently as the following constrained optimization problem: We denote the optimum values of (w x , w y ) as ( 1 w x , 1 w y ).We refer to u 1 = 1 w T x x and v 1 = 1 w T y y as the pair of first canonical variables.
Next, we determine a new pair of linear combinations, say u 2 and v 2 , which has the highest correlation subject to u 2 , being uncorrelated with u 1 , and v 2 being uncorrelated with v 1 (the construction actually ensures that u 1 and v 2 are uncorrelated, as well, as are u 2 and v 1 ).Therefore, at the i-th step, the canonical vectors are obtained as: subject to: for all j = 1, 2, • • • , i − 1 and i ≤ min{p, q}.The process continues, until subsequent pairs of linear combinations no longer produce a significant correlation.
One important property of the divergence is that D α,β (f ||g) is non-negative for all f and g and is equal to zero if and only if f ≡ g almost everywhere (Cichocki et al., 2011 [6]).Let us take f to be the joint density of two random variables, x and y, and g to be the product of their marginal densities.Then, D α,β (f ||g) = 0 if and only if x and y are independent.We will use this property of AB-divergence to find the canonical variables.

AB-Canonical Analysis
Let us denote the joint distribution of two random variables as f (•, •), whereas the marginal distribution as f (•).We define the AB-divergence between the joint distribution of (w T x x, w T y y) and the product of their marginal distributions as: From the property of the AB-divergence, we know that D α,β (w x , w y ) = 0 if and only if w T x x and w T y y are statistically independent.Here, our aim is to find directions w x and w y , such that w T x x and w T y y are as much dependent as possible.Therefore, we find w x and w y from the optimization problem: We denote the first set of AB-canonical vectors as ( 1 w x , 1 w y ).The i-th set of canonical vectors are obtained as: subject to: for all j = 1, 2, • • • , i − 1 and i ≤ min{p, q}.Like CCA, we continue, until a subsequent pairs of canonical variables no longer produce a significant dependence.We note that D α,β (w x , w y ) = 0 implies that w T x x and w T y y are statistically independent, regardless of the distributions of x and y.On the other hand, in standard CCA, the zero canonical correlation implies that x and y are uncorrelated, but in general, they may not be independent.However, if x and y follow normal distributions, then they are independent.The concept of statistical dependence is more general and flexible than the concept of correlation.If x and y are independent, then they are also uncorrelated, but not vice versa.

ABCA Algorithm
Suppose we have N pairs of observations from two sets of random variables, x and y, α,β (w x , w y ), the sample version of D α,β (w x , w y ), using kernel density estimates (Yin, 2004 [1]).Therefore, where: and: Here, h, h 1 and h 2 are suitably chosen bandwidths and K(•) and K 2 (•, •) are univariate and bivariate kernels, respectively.For simplicity, we will take the product kernel (Scott, 1992 [17]), i.e.: For convergence of the kernel density functions to the corresponding underlying densities, we need to ensure that the bandwidth parameters tend to zero as the sample size increases.We follow the method described in Silverman (1986) [18] by taking h = 1.06sN −1/5 , h j = s j N −1/6 , j = 1, 2, where s, s 1 and s 2 are the corresponding standard deviations.Moreover, the choice of the bandwidth parameters satisfies the condition of Theorem 1, stated later in this section.Here, we use Gaussian kernel.Robust kernel may be used to make the procedure robust against outliers (Kim and Scott, 2012 [19]), but we prefer to choose suitable tuning parameters, α and β, to make the procedure robust.
The AB-canonical vectors obtained from Equation ( 15) are consistent in the sense that they converge to the original canonical vectors for large sample sizes.The following theorem ensures this result.The proof of the theorem can be done in the same line of thought as mentioned in Proposition 3 of Yin (2004) [1] or Theorem 1 of Iaci and Sriram (2013) [4].
It should be mentioned here that the optimization problem in Equation ( 12) is non-linear, and it may stick at a local maxima.Therefore, it is often needed to repeat the algorithm several times with different initial values to get the appropriate solution.We use the interior point algorithm (see Byrd et al., 1999 [20]; Byrd et al. 2000 [21]) to estimate the canonical vectors, w x and w y .A MATLAB program for the ABCA will be found in [22].
The value of D α,β (w x , w y ) is always non-negative, but there does not exist any fixed upper limit for all values of α and β.Therefore, it is difficult to interpret the result from the values of AB-divergence.Whereas in standard CCA, the value of the canonical coefficient close to one signifies better performance from this method, therefore we will calculate the maximal correlation (Breiman and Friedman, 1985 [9]) as a measure of dependency.The maximal correlation coefficient between w x x and w y y is denoted by ρ * and is defined as: Here, we call ρ * as the AB-canonical coefficient.It is the maximum possible correlation between w y y and any function of w x x.The value of ρ * lies in [0,1].We calculate ρ * using the alternating conditional expectation algorithm (Breiman and Friedman, 1985 [9]).

Sequential Permutation Test
One advantage of ABCA is that if the AB-canonical coefficient is zero, it implies that the corresponding AB-canonical variables are independent, regardless of the distributions of y and x.Therefore, the non-parametric sequential permutation test can be applied to determine the number of significant AB-canonical variables (Yin, 2004 [1]; Efron and Tibshirani, 1994 [23]; Davison and Hinkley, 1997 [24]).On the other hand, the test of significance for the standard CCA is very complicated, and it is typically under the normality assumption (Yin, 2004 [1]).
Let ( i w x , i w y ) be the i-th AB-canonical vectors pair.We want to test the following hypothesis: Testing i H 0 implies that the two canonical variables, i w T x x and i w T y y, are independent.First, we fix the previously found AB-canonical variables, ( j w x , j w y ), j = 1, 2, • • • , i − 1.Then, we take a random permutation of the N observations of x, say x * , and perform ABCA with x * and y using the algorithm described in Section 5. Let us denote the corresponding AB-divergence measure as D * α,β .We repeat this procedure a sufficient number of times (say, R times), and we calculate D * α,β (r), the corresponding AB-divergence measure for the r-th permutation, r = 1, 2, where γ is the level of significance of the test.Then, we reject the null hypothesis, i H 0 , if: where α,β ( i w x , i w y ) is the actual observed value of D α,β ( i w x , i w y ) without permuting data.If i H 0 is rejected, we proceed to the next step to calculate another AB-canonical variable pair.

Extension to Tensor
In this section, we extend the concept of ABCA in the case of tensor data.In many applications, the data structures often contain higher order modes, such as subjects, groups, trials, classes, conditions, etc., together with the intrinsic dimensions of space, time and frequency.Many studies of neuroscience involve recording data over time for multiple subjects (people or animals) and in different conditions, leading to experimental data structures conveniently represented by multi-array tensors.We generalize the idea of ABCA to extract the meaningful components from this type of high dimensional tensor data.
Tensors are denoted by underlined capital boldface letters, e.g., Y ∈ R I 1 ×I 2 ×•••×I Q .The order of a tensor is the number of modes, also known as ways or dimensions (e.g., frequency, subjects, trials, classes, groups and conditions).Throughout this section, we will use the basic tensor operations proposed in the literature (Kolda and Bader, 2009 [25]; Cichocki et al., 2009 [26]).Specifically, the mode-n multiplication of a tensor, Y ∈ R I 1 ×I 2 ×•••×I Q , by a vector, a ∈ R In , is denoted by: where the (i 1 , i 2 , . . ., i n−1 , i n+1 , . . ., i Q )-th element is given by: The mode-n multiplication of a tensor, Y ∈ R I×J×K , by vectors, a ∈ R I , b ∈ R J and c ∈ R K , can be expressed as: Suppose we have two sets of data from the tensor valued random variables, X and Y, {X(n) ∈ where N is the sample size.In tensor ABCA, our aim is to find w x ∈ R I 1 , w x ∈ R I 2 , • • • , w (P ) x ∈ R I P and w (1) , such that the AB-divergence between the joint distribution of the canonical variables: and the product of their marginal distributions is maximized.We define: Here, we find w x , • • • , w and w (1) from the optimization problem: x ,••• ,w x ,w subject to: We denote the first set of AB-canonical vectors as ( 1 w (1) x , 1 w (1) x ,••• ,w subject to: for all j = 1, 2, • • • , i − 1.

Sparseness Constraints
The standard CCA has some disadvantages, especially for large-scale and noisy problems.In general, the canonical variables are linear combinations of all the components of x (or y).This means the canonical variables are dense (not sparse), which often make the physical interpretation of the CCA difficult in many applications.For example, in many applications (from genetics, image analysis, etc.), the coordinate axes have a physical interpretation (each axis may correspond to a specific feature), so a sparse canonical variable is more meaningful than a dense one.Recently, several modifications of CCA have been proposed that impose some sparseness conditions for the canonical variables, and the corresponding method is called sparse canonical correlation analysis (SCCA); see Torres et al. (2007) [27].The main idea in SCCA is to force the canonical variables to be sparse; however, the sparsity profile should be adjustable or well controlled via some parameters in order to discover specific features in the observed data.In a similar way, we propose the sparse AB-canonical analysis.
For sparse AB-canonical analysis, we impose suitable sparsity constraints on the canonical vectors (Witten et al., 2009 [28]; Witten, 2010 [29]).Here, the optimization problem reduces to: where P 1 and P 2 are convex penalty functions and λ 1 , λ 2 are suitably chosen tuning parameters.Some frequently used penalty functions are: Here, also, we use the interior-point algorithm to estimate the canonical vectors.A MATLAB code will be obtained just by changing the optimization function of the standard ABCA in [22].However, if we use a cardinality penalty, then we need to modify the program a little bit, so that the algorithm tries to find a solution in the lower dimensional subspace.For tensor AB-canonical analysis, the sparseness constraints can be imposed in a similar way (see Allen, 2012 [30]).

Simulation Results
The validity and the performance of the proposed ABCA is evaluated based on the simulated data.In the following examples, we have generated {x(n), y(n); n = 1, 2, • • • , N }, such that they have a relationship, as mentioned in Equation (1).Note that the following types of relations are, for example, included in the model: where x = (x 1 , x 2 , x 3 ) T , y = (y 1 , y 2 ) T , b 1 , b 2 and a 0 , a 1 , a 2 are unknown constants.Here, is the random error.However, if a 2 = 0, then the following models are not included in Equation (1): In the first example, we have generated data, such that there exists a non-linear relationship between x and y.We will notice that ABCA successfully extracts the hidden relationship, whereas standard CCA fails.In the next example, we show the robustness property of ABCA and compare it with the standard CCA.Finally, we have given an example when data sets are tensors.The dimensions of x and y are taken as six and four, respectively; so, x = (x 1 , x 2 , • • • , x 6 ) T and y = (y 1 , y 2 , y 3 , y 4 ) T .x is the explanatory variable, where the components are generated from independent N (0, 1) random variables.y is the dependent variable based on the following latent variables: where 1 and 2 are the random errors, and we assume i ∼ 0.05N (0, 1), i = 1, 2. The coefficient vectors, a 1 and a 2 , are generated from independent uniform (−1/2, 1/2) random variables, and then, they are orthogonalized.Therefore, a T 1 a 2 = 0.The relationship between y and the latent variables, y * = (y * 1 , y * 2 ) T , is assumed to be the linear combination, as mentioned below:  (a) (b)

Robustness Property
Example 2: In this example, we check the robustness property of ABCA.To compare it with standard CCA, we have generated data, such that x and y have a linear relationship, and then, few outliers are inserted.The dimensions of x and y are taken as five and three, respectively.All the components of x are generated from independent N (0, 1) random variables.For simplicity, we have taken the relationship between x and y as follows: where is the random error, and we assume ∼ 0.05N (0, 1).Here, y 1 and x 1 are the first components of x and y, respectively.The other components of y are generated from independent N (0, 1) random variables.We have generated only 90 random samples from this model, and we have taken 10 outliers.
For the outlying observations, we have taken x 1 = 0 and y 1 = 10. Figure 3a represents the original data, where there are 10 outlying observations inside the red circle.In Figure 3b, we have plotted the first AB-canonical variable pair.The divergence parameters are taken as α = 0.5 and β = 0.5.It is seen that ABCA successfully extracts the canonical variables, but Figure 4a shows that the standard CCA completely fails.In Figure 4b, we present the scatter plot of the first pair of the canonical variables using the approach of Yin ( 2004) [1].This is based on Kulback-Leibler divergence, so it is a special case of ABCA, where α = 1 and β = 0.The values of the first AB-canonical coefficients for α = 0.5, β = 0.5 and α = 1, β = 0 are 0.9121 and 0.7107, respectively.Thus, we can make ABCA robust by choosing suitable tuning parameters.
The vectors, a x and b (i) x , i = 1, 2, 3, are generated from independent uniform (−1/2, 1/2) random variables, and then, they are orthogonalized.Therefore, a Y is the dependent variable based on the following latent variables: where 1 and 2 are the random errors, and we assume i ∼ 0.05N (0, 1), i = 1, 2. The relationship between Y and the latent variables, y * = (y * 1 , y * 2 ) T , is assumed to be the linear combination, as follows: All other components of Y are independent N (0, 1) random variables.The elements of the matrix, C = (c 1 , c 2 ), are generated following the way we did in Example 1.We have generated a sample size of 100 from X and Y.
The scatter plots of the latent variables are given in Figure 5a,b.We have performed tensor ABCA for this data set with divergent parameters, α = 0.5 and β = 0.5.The first two tensor AB-canonical variable pairs are plotted in Figure 5c,d.The values of the first two tensor AB-canonical coefficients are 0.98671 and 0.9712.It is obvious that ABCA extracts the latent variable quite accurately.

Choice of Divergence Parameters
There does not exist any universal way of selecting divergence parameters, α and β.They generally control the trade-off between the efficiency and robustness properties of the procedure.Although they cover the whole two-dimensional plane, the rate of change in the values of AB-divergence coefficients for very high or very small values of the tuning parameters are very slow.Therefore, we are often interested in choosing the parameters in the interval [0, 1].For α = 1 and β = 1, the AB-divergence turns out to be the L 2 -distance between two densities.L 2 -distance is regarded as a strong robust divergence in the literature, but the robustness is achieved at some loss of efficiency (Basu et al., 1998 [13]; Scott, 2001 [31]).On the other hand, for α = 0 and β = 0, the AB-divergence becomes the L 2 -distance between the logarithm of two densities, which may be regarded as non-robust.Therefore, a suitable choice of the parameters are needed to balance between robustness and efficiency.In our simulation examples, α and β around (0.5, 0.5) seem to a good choice.

Conclusion
We have used AB-divergence measure to perform the canonical correlation analysis.It can extract the hidden non-linear relationship between two sets of data, whereas the standard CCA is designed to find out only the linear relationship.Moreover, the standard CCA is very non-robust against the outlying observations.On the other hand, by choosing suitable tuning parameters, α and β, for the AB-divergence, we can make ABCA robust against outliers.Our method is very general in the sense that it uses AB-divergence, which is a general measure of discrepancy.Moreover, we have generalized the method in the case of tensor data, and we have also considered the sparseness constants.

Figure 1 . 9 . 1 .
Figure 1.(a) and (b): The scatter plots of the latent variables.(c) and (d): The scatter plots of the first two AB-canonical variable pairs.It is clearly seen that the non-linear relationship is reconstructed.

and y 3
and y 4 are independent N (0, 1) random variables.The elements of the matrix, C = (c 1 , c 2 ), are generated from independent uniform (−1/2, 1/2) random variables, and then, their rows are orthogonalized, so that the columns of C −1 become orthogonal.We generate a sample size of 100 from x and y.The scatter plots of the latent variables are given in (a) and (b) of Figure 1.We perform ABCA for this data set with divergent parameters, α = 0.5 and β = 0.5.The first two AB-canonical variable pairs are plotted in (c) and (d) of Figure 1.The values of the first two AB-canonical coefficients are 0.9616 and 0.9301.It is obvious that ABCA extracts the latent variable quite accurately.We notice that the scale and the sign of the canonical vectors cannot be recovered from ABCA.The standard CCA fails to extract them, due to a non-linear relationship with the latent variables.The first two standard canonical variable pairs are plotted in (a) and (b) of Figure 2. The values of the first two canonical coefficients are 0.5704 and 0.3559.

Figure 2 .Figure 3 .Figure 4 .
Figure 2. Scatter plots for the first two standard canonical variable pairs.Here, canonical correlation analysis (CCA) fails to reconstruct the non-linear relationship.

Figure 5 .
Figure 5. (a) and (b): The scatter plots of the latent variables.(c) and (d): The scatter plots of the first two tensor AB-canonical variable pairs.It is clearly seen that the non-linear relationship is reconstructed.