A Jacobi–Davidson Method for Large Scale Canonical Correlation Analysis

Zhongming Teng; Xiaowei Zhang

doi:10.3390/a13090229

and

College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China

^*

Author to whom correspondence should be addressed.

Algorithms2020, 13(9), 229;https://doi.org/10.3390/a13090229

Version Notes

Order Reprints

Review Reports

Abstract

In the large scale canonical correlation analysis arising from multi-view learning applications, one needs to compute canonical weight vectors corresponding to a few of largest canonical correlations. For such a task, we propose a Jacobi–Davidson type algorithm to calculate canonical weight vectors by transforming it into the so-called canonical correlation generalized eigenvalue problem. Convergence results are established and reveal the accuracy of the approximate canonical weight vectors. Numerical examples are presented to support the effectiveness of the proposed method.

Keywords:

canonical correlation analysis; Jacobi–Davidson; generalized eigenvalue problems; convergence

1. Introduction

Canonical correlation analysis (CCA) is one of the most representative two-view multivariate statistical techniques and has been applied to a wide range of machine learning problems including classification, retrieval, regression and clustering [,,]. It seeks a pair of linear transformations for two view high-dimensional features such that the associated low-dimensional projections are maximally correlated. Denote the data matrices

S_{a} \in R^{m \times d}

and

S_{b} \in R^{n \times d}

from two data sets with m and n features, respectively, where d is the number of samples. Without loss of generality, we assume

S_{a}

and

S_{b}

are centered, i.e.,

S_{a} 1_{d} = 0

and

S_{b} 1_{d} = 0

where

1_{d} \in R^{d}

is the vector of all ones, otherwise, we can preprocess

S_{a}

and

S_{b}

as

S_{a} \leftarrow S_{a} - \frac{1}{d} (S_{a} 1_{d}) 1_{d}^{T}

and

S_{b} \leftarrow S_{b} - \frac{1}{d} (S_{b} 1_{d}) 1_{d}^{T}

, respectively. CCA aims to find a pair of canonical weight vectors

x \in R^{m}

and

y \in R^{n}

that maximize the canonical correlation

max x^{T} C y, subject to x^{T} A x = 1 and y^{T} B y = 1,

(1)

where

A = S_{a} S_{a}^{T} \in R^{m \times m}, B = S_{b} S_{b}^{T} \in R^{n \times n} and C = S_{a} S_{b}^{T} \in R^{m \times n},

(2)

and then projects the high-dimensional data

S_{a}

and

S_{b}

onto low-dimensional subspaces spanned by x and y, respectively, to achieve the purpose of dimensionality reduction. In most cases [,,], only one pair of canonical weight vectors is not enough since it means the dimension of low-dimensional subspaces is just one. When a set of canonical weight vectors are required, the single-vector CCA (1) has been extended to obtain the pair of canonical weight matrices

X \in R^{m \times k}

and

Y \in R^{n \times k}

by solving the optimization problem

max trace (X^{T} C Y), subject to X^{T} A X = I_{k} and Y^{T} B Y = I_{k} .

(3)

Usually, both A and B are symmetric positive definite. However, there are cases, such as the under-sampled problem [], that A and B may be semi-definite. In such a case, some regular techniques [,,,] by adding a multiple of the identity matrix to them are applied to find the optimal solution of

max trace (X^{T} C Y), subject to X^{T} (A + κ_{a} I_{m}) X = I_{k} and Y^{T} (B + κ_{b} I_{n}) Y = I_{k},

where

κ_{a}

and

κ_{b}

are called regularization parameters and they usually are chosen to maximize the cross-validation score []. In other words, A and B are replaced by

A + κ_{a} I_{m}

and

B + κ_{b} I_{n}

to keep the invertible of A and B, respectively. Hence, in this paper, by default we assume A and B are both positive definite and

m \geq n

unless explicitly stated otherwise.

As shown in [], the optimization problem (3) can be equivalently transformed into solving the following generalized eigenvalue problem

K z : = [\begin{matrix} 0 & C \\ C^{T} & 0 \end{matrix}] [\begin{matrix} x \\ y \end{matrix}] = λ [\begin{matrix} A & 0 \\ 0 & B \end{matrix}] [\begin{matrix} x \\ y \end{matrix}] = λ M z,

(4)

where the positive definiteness of the matrices A and B implies M being positive definite. The generalized eigenvalue problem (4) is referred as the Canonical Correlation Generalized Eigenvalue Problem (CCGEP) in this paper. Define

J : = [\begin{matrix} R_{A}^{- 1} & 0 \\ 0 & R_{B}^{- 1} \end{matrix}]

where

A = R_{A}^{T} R_{A} and B = R_{B}^{T} R_{B}

(5)

are their Cholesky decomposition. It is easy to verify that

J^{T} [\begin{matrix} 0 & C \\ C^{T} & 0 \end{matrix}] J J^{- 1} [\begin{matrix} x \\ y \end{matrix}] = λ J^{T} [\begin{matrix} A & 0 \\ 0 & B \end{matrix}] J J^{- 1} [\begin{matrix} x \\ y \end{matrix}]

gives rise to

[\begin{matrix} 0 & \tilde{C} \\ {\tilde{C}}^{T} & 0 \end{matrix}] [\begin{matrix} p \\ q \end{matrix}] = λ [\begin{matrix} p \\ q \end{matrix}],

(6)

where

\tilde{C} = R_{A}^{- T} C R_{B}^{- 1}, p = R_{A} x and q = R_{B} y,

(7)

and it implies

\tilde{C} q = λ p and {\tilde{C}}^{T} p = λ q .

It means that the eigenpairs of (4) can be obtained by computing the singular values and the associated left and right singular vectors of

\tilde{C}

. This method works well when the sample size d and feature dimension m and n are of moderate size but it will be very slow and numerically unstable for large-scale datasets which are ubiquitous in the age of “Big Data” []. For large-scale datasets, the equivalence between (4) and (6) makes it possible for us to simply adapt the subspace type algorithms for calculating the partial singular values decomposition, such as Lanczos type algorithms [,] and Davidson type algorithms [,], and then translate them for CCGEP (4). However, in practice, the decompositions of the sample covariance matrices A and B are usually unavailable in large scale matrix cases. The reason is that the decomposition is too expensive to compute explicitly for large scale problems, and may destroy the sparsity and some structural information. Furthermore, sometime sample covariance matrices A and B should never be explicitly formed, such as in online learning systems.

Meanwhile, in [], it is suggested to solve CCGEP (4) by considering the large scale symmetric positive definite pencil

{K, M}

. Some subspace type numerical algorithms also have been generalized to computing partial eigenpairs of

{K, M}

, see [,]. However, these generic algorithms do not make use of the special structure in (4), and they usually are less efficient than custom-made algorithms. Therefore, existing algorithms either can not avoid the covariance matrices decomposition, or do not consider the structure of CCGEP.

In this paper, we will focus on the Jacobi–Davidson type subspace method for canonical correlation analysis. The idea of Jacobi–Davidson algorithm proposed in [] is Jacobi’s approach combined with Davidson type subspace method. Its essence is to construct a correction for a given eigenvector approximation in a subspace orthogonal to the given approximation. The correction in a given subspace is extracted in a Davidson manner, and then the expansion of the subspace is done by solving its correction equation. Due to the significant improvement in convergence, the Jacobi–Davidson has been one of the most powerful algorithms in the matrix eigenvalue problem, and is almost generalized to all fields of matrix computation. For example, in [,], Hochstenbach presented Jacobi–Davidson methods for singular value problems and generalized singular value problems, respectively. In [,], Jacobi–Davidson methods are developed to solve the nonlinear and two-parameter eigenvalue problems, respectively. Some other recent work on Jacobi–Davidson methods can be found in [,,,,,]. Motivated by these facts, we will continue the effort by extending the Jacobi–Davidson variant to canonical correlation analysis. The main contribution is that the algorithm directly tackles CCGEP (4) without involving the large matrix decomposition, and does take advantage of the special structure of K and M, while the significance of transforming (4) into (6) lies only in our theoretical developments.

The rest of this paper is organized as follows. Section 2 collects some notations and a basic result for CCGEP that are essential to our later development. Our main algorithm is given and analyzed in detail in Section 3. We present some numerical examples in Section 4 to show the behaviors of our proposed algorithm and to support our analysis. Finally, conclusions are made in Section 5.

2. Preliminaries

Throughout this paper,

R^{m \times n}

is the set of all

m \times n

real matrices,

R^{m} = R^{m \times 1}

, and

R = R^{1}

.

I_{n}

is the

n \times n

identity matrix. The superscript “

\cdot^{T}

” takes transpose only, and

{∥ \cdot ∥}_{1}

denotes the

ℓ_{1}

-norm of a vector or matrix. For any matrix

N \in R^{m \times n}

with

m \geq n

,

σ_{i} (N)

for

i = 1, \dots, n

is used to denote the singular values of N in descending order.

For vectors

x, y \in R^{n}

, the usual inner product and its induced norm are conveniently defined by

⟨ x, y ⟩ : = y^{T} x, {∥ x ∥}_{2} : = \sqrt{⟨ x, x ⟩} .

With them, the usual acute angle

∠ (x, y)

between x and y can then be defined by

∠ (x, y) : = arccos \frac{| ⟨ x, y ⟩ |}{{∥ x ∥}_{2} {∥ y ∥}_{2}} .

Similarly, given any symmetric positive definite

W \in R^{n \times n}

, the W-inner product and its induced W-norm are defined by

{⟨ x, y ⟩}_{W} : = y^{T} W x, {∥ x ∥}_{W} : = {\sqrt{⟨ x, x ⟩}}_{W} .

If

{⟨ x, y ⟩}_{W} = 0

, then we say

x ⊥_{W} y

or

y ⊥_{W} x

. The W-acute angle

∠_{W} (x, y)

between x and y can then be defined by

∠_{W} (x, y) : = arccos \frac{{| ⟨ x, y ⟩}_{W} |}{{∥ x ∥}_{W} {∥ y ∥}_{W}} .

Let the singular value decomposition of

\tilde{C}

be

\tilde{C} = P Λ Q^{T}

where

P = [p_{1}, \dots, p_{m}] \in R^{m \times m}

and

Q = [q_{1}, \dots, q_{n}] \in R^{n \times n}

are orthonormal, i.e.,

P^{T} P = I_{m}

and

Q^{T} Q = I_{n}

, and

Λ = diag (λ_{1}, \dots, λ_{n}) \in R^{m \times n}

with

λ_{1} \geq λ_{2} \geq \dots \geq λ_{n} \geq 0

is a leading diagonal matrix. The singular value decomposition of

\tilde{C}

closely relates to the eigendecomposition of the following symmetric matrix [] (p. 32):

[\begin{matrix} 0 & \tilde{C} \\ {\tilde{C}}^{T} & 0 \end{matrix}],

(8)

whose eigenvalues are

\pm λ_{i}

for

i = 1, \dots, n

plus

m - n

zeros, i.e.,

- λ_{1} \leq \dots \leq - λ_{n} \leq \underset{m - n}{\underset{︸}{0 \leq \dots \leq 0}} \leq λ_{n} \leq \dots \leq λ_{1},

(9)

with associated eigenvectors are

[\begin{matrix} p_{i} \\ \pm q_{i} \end{matrix}], i = 1, 2, \dots, n, and [\begin{matrix} p_{i} \\ 0 \end{matrix}], i = n + 1, \dots, m,

respectively. The equivalence between (4) and (6) leads that the eigenvalues of CCGEP (4) are

\pm λ_{i}

for

i = 1, \dots, n

plus

m - n

zeros, and the corresponding eigenvectors are

[\begin{matrix} x_{i} \\ \pm y_{i} \end{matrix}], i = 1, 2, \dots, n, and [\begin{matrix} x_{i} \\ 0 \end{matrix}], i = n + 1, \dots, m,

respectively, where

x_{i} = R_{A}^{- 1} p_{i} and y_{i} = R_{B}^{- 1} q_{i} .

(10)

Let

X = [x_{1}, \dots, x_{m}]

and

Y = [y_{1}, \dots, y_{n}]

. Then, the A- and B-orthonormal constraints of X and Y, respectively, i.e.,

X^{T} A X = I_{m} and Y^{T} B Y = I_{n}

(11)

are followed by

P^{T} P = I_{m}

and

Q^{T} Q = I_{n}

. Here, the first few

x_{i}

and

y_{i}

for

i = 1, 2, \dots, k

with

k < n

are wanted canonical correlation weight vectors. Furthermore, their corresponding eigenvalues satisfy the following maximization principle which is critical to our later developments. For the proof see Appendix A.1.

Theorem 1.

The following equality holds for any

U \in R^{m \times k}

and

V \in R^{n \times ℓ}

\sum_{i = 1}^{min {k, ℓ}} λ_{i} = max_{U^{T} A U = I_{k}, V^{T} B V = I_{ℓ}} \sum_{i = 1}^{min {k, ℓ}} σ_{i} (U^{T} C V),

(12)

where A, B and C are defined in (2) and

λ_{i}

defined in (9), and

σ_{i} (U^{T} C V)

for

1 \leq i \leq min {k, ℓ}

are the singular values of

U^{T} C V

.

3. The Main Algorithm

The idea of the Jacobi–Davidson method [] is to construct iteratively approximations of certain eigenpairs. It uses solving a correction equation to expand the search subspace, and finds approximate eigenpairs as best approximations in such search subspace.

3.1. Subspace Extraction

Let

X \subseteq R^{m}

and

Y \subseteq R^{n}

with

dim (X) = k

and

dim (Y) = ℓ

, respectively. As stated in [], we call

{X, Y}

a pair of defalting subspaces of CCGEP (4) if

C Y \subseteq A X and C^{T} X \subseteq B Y .

(13)

Let

X \in R^{m \times k}

and

Y \in R^{n \times ℓ}

be A- and B-orthonormal basis matrices of the subspaces

X

and

Y

, respectively, i.e.,

X^{T} A X = I_{k} and Y^{T} B Y = I_{ℓ} .

The equality (13) implies that there exist

D_{A} \in R^{k \times ℓ}

and

D_{B} \in R^{ℓ \times k}

[] (Equation (2.11)) such that

C Y = A X D_{A} and C^{T} X = B Y D_{B} .

(14)

They imply

D_{A} = X^{T} C Y = {(Y^{T} C^{T} X)}^{T} = D_{B}^{T}

. So (14) is equivalent to

C Y = A X D_{A} and C^{T} X = B Y D_{A}^{T} .

Now if

(λ, \hat{x}, \hat{y})

is a singular triplet of

D_{A}

, then

(λ, z)

gives an eigenpair of (4), where

z = {[x^{T}, y^{T}]}^{T}

,

x = X \hat{x}

and

y = Y \hat{y}

. This is because

C y = C (Y \hat{y}) = (C Y) \hat{y} = A X D_{A} \hat{y} = λ A X \hat{x} = λ A x .

and similarly

C^{T} x = λ B y

. That means

[\begin{matrix} 0 & C \\ C^{T} & 0 \end{matrix}] [\begin{matrix} x \\ y \end{matrix}] = λ [\begin{matrix} A & 0 \\ 0 & B \end{matrix}] [\begin{matrix} x \\ y \end{matrix}] .

Hence, we have shown that a pair of deflating subspaces

{X, Y}

can be used to recover those eigenpairs associated with the pair of deflating subspaces of CCGEP (4). In practice, pairs of exact deflating subspaces are usually not available, and one usually use Lanczos type methods [] or Davidson type methods [] to generate approximate ones, such as Krylov subspaces in Lanczos method []. Next, we will discuss how to extract best approximate eigenpairs from a given pair of approximate deflating subspaces.

In what follows, we consider the simple case:

k = ℓ

. Suppose

{U, V}

is an approximation of a pair of deflating subspaces

{X, Y}

with

dim (U) = dim (V) = k

. Let

U \in R^{m \times k}

and

V \in R^{n \times k}

be the A- and B-orthonormal basis matrices of the subspaces

U

and

V

, respectively. Denote

θ_{i}

,

i = 1, 2, \dots, k

the singular values of

U^{T} C V

in descending order with associated left and right singular vectors

{\hat{u}}_{i}

and

{\hat{v}}_{i}

, respectively, i.e.,

(U^{T} C V) {\hat{v}}_{i} = θ_{i} {\hat{u}}_{i} and {(U^{T} C V)}^{T} {\hat{u}}_{i} = θ_{i} {\hat{v}}_{i}, for 1 \leq i \leq k .

Even though U and V as A- and B-orthonormal basis matrices are not unique, these

θ_{i}

are. Motivated by the maximization principle in Theorem 1, we would seek the best approximations associated with the pair of approximate deflating subspaces

{U, V}

to the eigenpairs

(λ_{i}, z_{i})

(

1 \leq i \leq j \leq k

) in the sense of

max \sum_{i = 1}^{j} σ_{i} ({\tilde{U}}_{j}^{T} C {\tilde{V}}_{j}) .

(15)

for any

{\tilde{U}}_{j} \in R^{m \times j}

and

{\tilde{V}}_{j} \in R^{n \times j}

satisfying

span ({\tilde{U}}_{j}) \subseteq U

,

span ({\tilde{V}}_{j}) \subseteq V

and

{\tilde{U}}_{j}^{T} A {\tilde{U}}_{j} = {\tilde{V}}_{j}^{T} B {\tilde{V}}_{j} = I_{j}

. We claim that the quantity in (15) is given by

\sum_{i = 1}^{j} θ_{i}

. To see this, we notice that any

{\tilde{U}}_{j}

and

{\tilde{V}}_{j}

in (15) can be written as

{\tilde{U}}_{j} = U {\hat{U}}_{j} and {\tilde{V}}_{j} = V {\hat{V}}_{j}

for some

{\hat{U}}_{j} \in R^{k \times j}

and

{\hat{V}}_{j} \in R^{k \times j}

with

{\hat{U}}_{j}^{T} {\hat{U}}_{j} = {\hat{V}}_{j}^{T} {\hat{V}}_{j} = I_{j}

, and vice versa. Therefore the quantity in (15) is equal to

max_{{\hat{U}}_{j}^{T} {\hat{U}}_{j} = {\hat{V}}_{j}^{T} {\hat{V}}_{j} = I_{j}} \sum_{i = 1}^{j} σ_{i} ({\hat{U}}^{T} U^{T} C V \hat{V}),

which is

\sum_{i = 1}^{j} θ_{i}

by the proposition of the singular value decomposition of

U^{T} C V

[]. The maximum is attended at

{\hat{U}}_{j} = [{\hat{u}}_{1}, {\hat{u}}_{2}, \dots, {\hat{u}}_{j}]

and

{\hat{V}}_{j} = [{\hat{v}}_{1}, {\hat{v}}_{2}, \dots, {\hat{v}}_{j}]

. Therefore naturally, the best approximations to

(λ_{i}, z_{i})

(

1 \leq i \leq j

) in the sense of (15) are given by

(θ_{i}, {\tilde{z}}_{i}), where {\tilde{z}}_{i} = [\begin{matrix} {\tilde{x}}_{i} \\ {\tilde{y}}_{i} \end{matrix}], {\tilde{x}}_{i} = U {\hat{u}}_{i} and {\tilde{y}}_{i} = V {\hat{v}}_{i} .

(16)

Define the residual

r_{i} : = K {\tilde{z}}_{i} - θ_{i} M {\tilde{z}}_{i} = [\begin{matrix} 0 & C \\ C^{T} & 0 \end{matrix}] [\begin{matrix} {\tilde{x}}_{i} \\ {\tilde{y}}_{i} \end{matrix}] - θ_{i} [\begin{matrix} A & 0 \\ 0 & B \end{matrix}] [\begin{matrix} {\tilde{x}}_{i} \\ {\tilde{y}}_{i} \end{matrix}] = [\begin{matrix} r_{a}^{(i)} \\ r_{b}^{(i)} \end{matrix}],

(17)

where K and M defined in (4),

r_{a}^{(i)} = C {\tilde{y}}_{i} - θ_{i} A {\tilde{x}}_{i}

and

r_{b}^{(i)} = C^{T} {\tilde{x}}_{i} - θ_{i} B {\tilde{y}}_{i}

. It is noted that

U^{T} r_{a}^{(i)} = U^{T} (C {\tilde{y}}_{i} - θ_{i} A {\tilde{x}}_{i}) = U^{T} C V {\hat{v}}_{i} - θ_{i} U^{T} A U {\hat{u}}_{i} = 0

and similarly

V^{T} r_{b}^{(i)}

= 0. We summarize what we do in this subsection in the following theorem.

Theorem 2.

Suppose

{U, V}

is a pair of approximate deflating subspaces with

dim (U) = dim (V) = k

. Let

U \in R^{m \times k}

and

V \in R^{n \times k}

be the A- and B-orthonormal basis matrices of the subspaces

U

and

V

, respectively. Denote

θ_{i}

,

i = 1, 2, \dots, k

the singular values of

U^{T} C V

in descending order. Then, for any

j \leq k

,

\sum_{i = 1}^{j} θ_{i} = max_{\binom{span ({\tilde{U}}_{j}) \subseteq U, span ({\tilde{V}}_{j}) \subseteq V}{{\tilde{U}}_{j}^{T} A {\tilde{U}}_{j} = {\tilde{V}}_{j}^{T} B {\tilde{V}}_{j} = I_{j}}} \sum_{i = 1}^{j} σ_{i} ({\tilde{U}}_{j}^{T} C {\tilde{V}}_{j}),

the best approximations to the eigenpairs

(λ_{i}, z_{i})

(

1 \leq i \leq j

) in the sense of (15) are

(θ_{i}, {\tilde{z}}_{i})

(

1 \leq i \leq j

) given by (16), and the associated residuals defined in (17) admit

r_{a}^{(i)} ⊥ U

and

r_{b}^{(i)} ⊥ V

.

3.2. Correction Equation

In this subsection, we turn to construct a correction equation for a given eigenpair approximation. Suppose

(θ, {[{\tilde{x}}^{T}, {\tilde{y}}^{T}]}^{T})

with

{\tilde{x}}^{T} A \tilde{x} = {\tilde{y}}^{T} B \tilde{y} = 1

is an approximation of the eigenpair

(λ, {[x^{T}, y^{T}]}^{T})

of CCGEP (4), and

{[r_{a}^{T}, r_{b}^{T}]}^{T}

is the associated residual. We seek A- and B-orthogonal modifications of

\tilde{x}

and

\tilde{y}

, respectively, such that

[\begin{matrix} 0 & C \\ C^{T} & 0 \end{matrix}] [\begin{matrix} \tilde{x} + s \\ \tilde{y} + t \end{matrix}] = λ [\begin{matrix} A & 0 \\ 0 & B \end{matrix}] [\begin{matrix} \tilde{x} + s \\ \tilde{y} + t \end{matrix}],

(18)

where

s ⊥_{A} \tilde{x}

and

t ⊥_{B} \tilde{y}

. Then, by (18), we have

[\begin{matrix} - λ A & C \\ C^{T} & - λ B \end{matrix}] [\begin{matrix} s \\ t \end{matrix}] = - [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] + (λ - θ) [\begin{matrix} A & 0 \\ 0 & B \end{matrix}] [\begin{matrix} \tilde{x} \\ \tilde{y} \end{matrix}] .

(19)

Notice that

r_{a} ⊥ \tilde{x}

and

r_{b} ⊥ \tilde{y}

by Theorem 2, which gives rise to

[\begin{matrix} I_{m} - A \tilde{x} {\tilde{x}}^{T} & 0 \\ 0 & I_{n} - B \tilde{y} {\tilde{y}}^{T} \end{matrix}] [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] = [\begin{matrix} r_{a} \\ r_{b} \end{matrix}], [\begin{matrix} I_{m} - A \tilde{x} {\tilde{x}}^{T} & 0 \\ 0 & I_{n} - B \tilde{y} {\tilde{y}}^{T} \end{matrix}] [\begin{matrix} A \tilde{x} \\ B \tilde{y} \end{matrix}] = 0,

and

[\begin{matrix} I_{m} - A \tilde{x} {\tilde{x}}^{T} & 0 \\ 0 & I_{n} - B \tilde{y} {\tilde{y}}^{T} \end{matrix}] [\begin{matrix} - λ A & C \\ C^{T} & - λ B \end{matrix}] [\begin{matrix} s \\ t \end{matrix}] = - [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] .

(20)

Because

s ⊥_{A} \tilde{x}

and

t ⊥_{B} \tilde{y}

, Equation (20) is rewritten as

[\begin{matrix} I_{m} - A \tilde{x} {\tilde{x}}^{T} & 0 \\ 0 & I_{n} - B \tilde{y} {\tilde{y}}^{T} \end{matrix}] [\begin{matrix} - λ A & C \\ C^{T} & - λ B \end{matrix}] [\begin{matrix} I_{m} - \tilde{x} {\tilde{x}}^{T} A & 0 \\ 0 & I_{n} - \tilde{y} {\tilde{y}}^{T} B \end{matrix}] [\begin{matrix} s \\ t \end{matrix}] = - [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] .

(21)

However, we do not know

λ

here. It is natural that we use

θ

to replace

λ

in (21) to get the final correction equation, i.e.,

[\begin{matrix} I_{m} - A \tilde{x} {\tilde{x}}^{T} & 0 \\ 0 & I_{n} - B \tilde{y} {\tilde{y}}^{T} \end{matrix}] [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] [\begin{matrix} I_{m} - \tilde{x} {\tilde{x}}^{T} A & 0 \\ 0 & I_{n} - \tilde{y} {\tilde{y}}^{T} B \end{matrix}] [\begin{matrix} s \\ t \end{matrix}] = - [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] .

(22)

We summarize what we have so far into Algorithm 1, and make a few comments on Algorithm 1.

(1): At step 2, A- and B-orthogonality procedures are applied to make sure $U^{T} A \tilde{t} = 0$ and $V^{T} B \tilde{s} = 0$ .
(2): At step 7, in most cases, the correct equation is not necessity to solve exactly. Some steps of iterative methods for symmetric linear systems, such as linear conjugate gradient method (CG) [] or the minimum residual method (MINRES) [], are sufficient. Usually, more steps in solving the correction equation lead to fewer outer iterations. This will be shown in numerical examples.
(3): For the convergence test, we use the relative residual norms

$η (θ_{i}, {\tilde{z}}_{i}) : = \frac{∥ r_{a}^{(i)} ∥_{1} + {∥ r_{b}^{(i)} ∥}_{1}}{{(∥ C ∥}_{1} + θ_{i} {∥ A ∥}_{1}) ∥ {\tilde{x}}_{i} ∥_{1} + {(∥ C ∥}_{1} + θ_{i} {∥ B ∥}_{1}) ∥ {\tilde{y}}_{i} ∥_{1}}$

(23)

to determine if the approximate eigenparis $(θ_{i}, {\tilde{z}}_{i})$ has converged to a desired accuracy. In addition, in the practical implementation, once one or several of approximate eigenpairs converge to a preset accuracy, they should be deflated so that they will not be re-computed in the following iterations. Suppose $λ_{i}$ for $1 \leq i \leq j$ , $X_{j} = [x_{1}, \dots, x_{j}]$ and $Y_{j} = [y_{1}, \dots, y_{j}]$ have been computed where $j \leq k$ . We can consider the generalized eigenvalue problem

$\tilde{K} z = λ M z,$

(24)

where

$\begin{matrix} \tilde{K} = [\begin{matrix} I_{m} - A X_{j} X_{j}^{T} & 0 \\ 0 & I_{n} - B Y_{j} Y_{j}^{T} \end{matrix}] [\begin{matrix} 0 & C \\ C^{T} & 0 \end{matrix}] [\begin{matrix} I_{m} - X_{j} X_{j}^{T} A & 0 \\ 0 & I_{n} - Y_{j} Y_{j}^{T} B \end{matrix}] . \end{matrix}$

(25)

By (11), it is clear that the eigenvalues of (24) consist of two groups. Those eigenvalues associated with the eigenvectors ${[x_{1}^{T}, y_{1}^{T}]}^{T}, \dots, {[x_{j}^{T}, y_{j}^{T}]}^{T}, {[x_{1}^{T}, - y_{1}^{T}]}^{T}, \dots, {[x_{j}^{T}, - y_{j}^{T}]}^{T}$ are shifted to zero and the others remain unchanged. Furthermore, for the correction equation, we find s and t subject to additional A- and B-orthogonality constrains for s and t against $X_{j}$ and $Y_{j}$ , respectively. By a similar derivation of (22), the correction equation after deflation becomes

$\begin{matrix} [\begin{matrix} I_{m} - A \tilde{x} {\tilde{x}}^{T} & 0 \\ 0 & I_{n} - B \tilde{y} {\tilde{y}}^{T} \end{matrix}] [\begin{matrix} - θ_{1} A & \hat{C} \\ {\hat{C}}^{T} & - θ_{1} B \end{matrix}] [\begin{matrix} I_{m} - \tilde{x} {\tilde{x}}^{T} A & 0 \\ 0 & I_{n} - \tilde{y} {\tilde{y}}^{T} B \end{matrix}] [\begin{matrix} s \\ t \end{matrix}] = - [\begin{matrix} r_{a}^{(1)} \\ r_{b}^{(1)} \end{matrix}], \end{matrix}$

(26)

where $\hat{C} = (I_{m} - A X_{j} X_{j}^{T}) C (I_{n} - Y_{j} Y_{j}^{T} B)$ . Notice that $s ⊥_{A} X_{j}$ and $t ⊥_{B} Y_{j}$ mean $U ⊥_{A} X_{j}$ and $V ⊥_{B} Y_{j}$ in Algorithm 1, respectively. It follows that $U^{T} \hat{C} V = U^{T} C V$ .
(4): At step 5, LAPACK’s routine xGESVD can be used to solve the singular value problem of $U^{T} C V$ because of its small size, where $U^{T} C V$ takes the following form:

This form is preserved in the algorithm during refining the basis U and V at step 8. The new basis matrices $U \hat{U}$ and $V \hat{V}$ are reassigned to U and V, respectively. Although a few extra costs are incurred, this refinement is necessary in order to have faster convergence for eigenvectors as stated in [,]. Furthermore, the restart is easily executed by keeping the first $s_{min}$ columns of U and V when the dimension of the subspaces $span {U}$ and $span {V}$ exceeds $s_{max}$ . The restart technique appears at step 8 to keep the size of U, V and $U^{T} C V$ small. There are many ways to specify $s_{max}$ and $s_{min}$ . In our numerical examples, we just simply take $s_{max} = 3 k$ and $s_{min} = k$ .

Algorithm 1 Jacobi–Davidson method for canonical correlation analysis (JDCCA)

Input: Initial vectors

u_{0}

,

v_{0}

,

s = u_{0}

,

t = v_{0}

and

V = U = []

.
Output: Converged canonical weight vectors

{\tilde{x}}_{i}

and

{\tilde{y}}_{i}

for

i = 1, \dots, k

.

1:: for $i t e r = 1, 2, \dots$ ,until convergence do
2:: A- and B-orthogonal s and t against U and V, respectively, to obtain $\tilde{s}$ and $\tilde{t}$ .
3:: Compute $u = \tilde{s} / {∥ \tilde{s} ∥}_{A}$ and $v = \tilde{t} / {∥ \tilde{t} ∥}_{B}$ . Let $U = [U, u]$ and $V = [V, v]$ .
4:: Update the corresponding column and row of $U^{T} C V$ .
5:: Compute the singular value decomposition of $U^{T} C V$ , i.e., $U^{T} C V = \hat{U} Θ {\hat{V}}^{T}$ .
6:: Compute the wanted approximate eigenpairs $(θ_{i}, {[{\tilde{x}}_{i}^{T}, {\tilde{y}}_{i}^{T}]}^{T})$ by (16) and the corresponding residuals $r_{a}^{(i)}$ and $r_{b}^{(i)}$ .
7:: Solve

$[\begin{matrix} I_{m} - A {\tilde{x}}_{1} {\tilde{x}}_{1}^{T} & 0 \\ 0 & I_{n} - B {\tilde{y}}_{1} {\tilde{y}}_{1}^{T} \end{matrix}] [\begin{matrix} - θ_{1} A & C \\ C^{T} & - θ_{1} B \end{matrix}] [\begin{matrix} I_{m} - {\tilde{x}}_{1} {\tilde{x}}_{1}^{T} A & 0 \\ 0 & I_{n} - {\tilde{y}}_{1} {\tilde{y}}_{1}^{T} B \end{matrix}] [\begin{matrix} t \\ s \end{matrix}] = - [\begin{matrix} r_{a}^{(1)} \\ r_{b}^{(1)} \end{matrix}] .$
8:: Update $U = U \hat{U}$ and $V = V \hat{V}$ . If the dimension of U and V exceeds $s_{max}$ , then replace U and V with $U_{(1 : s_{min})}$ and $V_{(1 : s_{min})}$ respectively.
9:: end for

3.3. Convergence

The convergence theories on the Jacobi–Davidson method for the eigenvalue and singular value problem are given in [,], respectively. Here we prove a similar convergence result for the Jacobi–Davidson method of CCGEP based on the following lemma. Specifically, if we solve the correction Equation (22) exactly, and then

\tilde{x}

and

\tilde{y}

are close enough to x and y, respectively, it can be hoped that the approximate eigenvectors converge cubically. For the proof see Appendix A.2 and Appendix A.3.

Lemma 1.

Let λ be a simple eigenvalue of CCGEP (4) with the corresponding eigenvector

{[x^{T}, y^{T}]}^{T}

. Then the matrix

G : = [\begin{matrix} I_{m} - A x x^{T} & 0 \\ 0 & I_{n} - B y y^{T} \end{matrix}] [\begin{matrix} - λ A & C \\ C^{T} & - λ B \end{matrix}] [\begin{matrix} I_{m} - x x^{T} A & 0 \\ 0 & I_{n} - y y^{T} B \end{matrix}]

is a bijection from

span {(x)}^{⊥_{A}} \times span {(y)}^{⊥_{B}}

onto itself, where

span {(x)}^{⊥_{A}}

and

span {(y)}^{⊥_{B}}

are A- and B-orthogonal complementary spaces of

span (x)

and

span (y)

, respectively.

Theorem 3.

Assume the condition of Lemma 1,

sin ∠_{A} (x, \tilde{x}) = O (ε)

and

sin ∠_{B} (y, \tilde{y}) = O (ε)

. Let

{[s^{T}, t^{T}]}^{T}

be the exact solution of the correction Equation (22). Then,

|sin ∠_{A} (x, \tilde{x} + s)| = O (ε^{3}) a n d |sin ∠_{B} (y, \tilde{y} + t)| = O (ε^{3}) .

(27)

4. Numerical Examples

In this section, we present some numerical examples to illustrate the effectiveness of Algorithm 1. Our goal is to compute the first few canonical weight vectors. A computed approximate eigenpair

(θ_{i}, {\tilde{z}}_{i})

is considered converged when its relative residual norm

η (θ_{i}, {\tilde{z}}_{i}) = \frac{∥ r_{a}^{(i)} ∥_{1} + {∥ r_{b}^{(i)} ∥}_{1}}{{(∥ C ∥}_{1} + θ_{i} {∥ A ∥}_{1}) ∥ {\tilde{x}}_{i} ∥_{1} + {(∥ C ∥}_{1} + θ_{i} {∥ B ∥}_{1}) ∥ {\tilde{y}}_{i} ∥_{1}} \leq 10^{- 8} .

(28)

All the experiments in this paper are executed on a Ubuntu 12.04 (64 bit) Desktop-Intel(R) Core(TM) i7-6700 CPU@3.40 GHz, 32 GB of RAM using MATLAB 2010a with machine epsilon

2.22 \times 10^{- 16}

in double-precision floating point arithmetic.

Example 1.

We first examine Theorem 3 by using two pairs of data matrices

S_{a}

and

S_{b}

which come from a publicly available handwritten numerals dataset (https://archive.ics.uci.edu/ml/datasets/Multiple+Features). It consists of features handwritten numerals (‘0’–‘9’) and each digit has 200 patterns. Each pattern is represented by six different feature sets, i.e., Fou, Fac, Kar, Pix, Zer and Mor. Two pairs of feature sets Fou-Zer and Pix-Fou are chosen for

S_{a}

and

S_{b}

, respectively, such that

S_{a} \in R^{76 \times d}

and

S_{b} \in R^{47 \times d}

in Fou-Zer, and

S_{a} \in R^{240 \times d}

and

S_{b} \in R^{76 \times d}

in Pix-Fou with

d = 2000

. To make the numerical example repeatable, the initial vectors are set to be

u_{0} = x_{1} + 10^{- 3} \times ones (m, 1) a n d v_{0} = y_{1} + 10^{- 3} \times ones (n, 1)

where m and n are the dimension of

S_{a}

and

S_{b}

, respectively,

ones

is MATLAB built-in function, and

{[x_{1}^{T}, y_{1}^{T}]}^{T}

is computed by MATLAB’s function eig on (4) and considered to be the “exact” eigenvector for testing purposes. The corrected Equation (22) in Algorithm 1 is solved by direct methods, such as Gaussian elimination, and the solution

{[s^{T}, t^{T}]}^{T}

by such methods is regarded as “exactly” in this example. Figure 1 plots

sin ∠_{A} (x_{1}, {\tilde{x}}_{1})

and

sin ∠_{B} (y_{1}, {\tilde{y}}_{1})

in the first three iterations of Algorithm 1 for computing first canonical weight vector of Fou-Zer and Pix-Fou. It is clearly shown by Figure 1 that the convergence of Algorithm 1 is very fast when the initial vectors are enough close to the exact vectors, and the cubical convergence of Algorithm 1 appears in the third iteration.

Figure 1. Convergence behavior of Algorithm 1 for computing the first canonical weight vector of Fou-Zer and Pix-Fou.

Example 2.

As stated in Algorithm 1, the implementation of JDCCA involves solving the correction Equation (22) in every step. Direct solvers mentioned in Example 1 referring to

O ({(m + n)}^{3})

operations are prohibitively expensive in solving large-scale sparse linear systems. In such a case, iterative methods, such as MINRES method which is simply GMRES [] applied to symmetric linear systems, are usually preferred. In this example, we report the effect of the number of steps in the solution of the correction equation, denoted by

n_{g}

, on the total number of matrix-vector products (denoted by “#mvp”), outer iteration number (denoted by “#iter”), and CPU time in seconds for Algorithm 1 to compute the first 10 canonical weight vectors of the test problems appeared in Table 1. Table 1 presents three face datasets, i.e., ORL (https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html,) FERET (http://www.nist.gov/itl/iad/ig/colorferet.cfm) and Yale (https://computervisiononline.com/dataset/1105138686) datasets. The ORL database contains 400 face images of 40 distinct persons. For each individual, there are 10 different gray scale images with

92 \times 112

pixels. These images are collected by volunteers at different times, different lighting and different facial expressions (blinking or closed eyes, smiling or no-smiling, wearing glasses or no-glasses). In order to apply CCA, as numerical experiments in [], the ORL dataset is partitioned into two groups. We select the first five images per individual as the first view to generate the data matrix

S_{a}

, while the remaining for

S_{b}

. Similarly, we get data matrices

S_{a}

and

S_{b}

for the FERET and Yale datasets. The numbers of row and column of

S_{a}

and

S_{b}

are detailed in Table 1.

Table 1. Test problems.

In this example, we set the initial vectors

u_{0} = ones (m, 1)

and

v_{0} = ones (n, 1)

with

s_{max} = 30

and

s_{min} = 10

for restarting and simply take regularization parameter

κ_{a} = κ_{b} = 10^{- 4}

. We let MINRES steps

n_{g}

vary from to 5 to 40, and collect the numerical results in Figure 2. As expected, the number of total outer iterations decreases as

n_{g}

increases, while the total number of matrix-vector products does not change monotonically with

n_{g}

. It depends on the degree of reduction of outer iterations by the increasing of

n_{g}

. In addition, it is shown by Figure 2 that the total #mvp is not a unique deciding factor on the total CPU time. When

n_{g}

is larger, the significantly reduced #iter leads to smaller total CPU time. For these three test examples, the MINRES steps

n_{g}

around 15 to 25 appear to be cost-effective, further increasing

n_{g}

over 40 usually does not have significant effect. The least efficient case is when

n_{g}

is too small.

Figure 2. Cost in computing the first 10 canonical weight vectors of ORL (top), FERET (middle) and Yale (bottom) datasets with MINRES steps for the correction equation varying from 5 to 40.

Example 3.

In this example, we compare Algorithm 1, i.e., JDCCA, with Jacobi–Davidson QZ type method [] (JDQZ) for the large scale symmetric positive definite pencil

{K, M}

defined in (4) to compute the first 10 canonical weight vectors of the test problems appeared in Table 1 with MINRES steps

n_{g} = 20

. We take

u_{0} = ones (m, 1)

and

v_{0} = ones (n, 1)

in Algorithm 1 and the initial vector

ones (m + n, 1)

for the JDQZ algorithm, and compute the same relative residual norms

η (θ_{i}, {\tilde{z}}_{i})

. The corresponding numerical results are plotted in Figure 3. For these three test problems, it is suggested by Figure 3 that Algorithm 1 always outperforms the JDQZ algorithm. Other experiments that we tested with different test problems and MINRES steps not reported here also illustrate our points.

Figure 3. Convergence behavior of JDCCA and JDQZ for computation of the first 10 canonical weight vectors of ORL (top), FERET (middle) and Yale (bottom) datasets with MINRES step

n_{g} = 20

.

5. Conclusions

To analyze the correlations between two data sets, several numerical algorithms have been available to find the canonical correlations and the associated canonical weight vectors; however, there is very little discussion of the large scale sparse and structured matrix cases in the literature. In this paper, a Jacobi–Davidson type method, i.e., Algorithm 1, is presented for large scale canonical correlation analysis by computing a small portion of eigenpairs of the canonical correlation generalized eigenvalue problem (4). The theoretical result is established in Theorem 3 to demonstrate that the cubic convergence of the approximate eigenvector if the correction equation is solved exactly and the approximate eigenvector of the previous step is close enough to the exact one. The corresponding numerical results are presented to confirm the effectiveness of asymptotic convergence rate provided by Theorem 3, and to demonstrate that Algorithm 1 performs far superior to the JDQZ method for the large scale symmetric positive definite pencil

{K, M}

.

Notice that the main computational tasks in every iteration of Algorithm 1 consist of solving the correction Equation (22). In our numerical example, we only focus on the plain version of MINRES, i.e., without considering any preconditioner. However, it is not hard to notice that incorporating a preconditioner presents no difficulty and can promote the numerical performance if the preconditioner is available. In addition, from the point of view that multi-set canonical correlation analysis (MCCA) [] proposed to analyze linear relationships among more than two data sets can be equivalently transformed to the following generalized eigenvalue problem

[\begin{matrix} 0 & C_{12} & \dots & C_{1 t} \\ C_{21} & 0 & \dots & C_{2 t} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ C_{t 1} & C_{t 2} & \dots & 0 \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{t} \end{matrix}] = λ [\begin{matrix} C_{11} & 0 & \dots & 0 \\ 0 & C_{22} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & \dots & C_{t t} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{t} \end{matrix}],

where

C_{i j} = S_{i} S_{j}^{T}

and

S_{i}

is the data matrix, the development of efficient Jacobi–Davidson methods for treating such large scale MCCA will be a subject of our future study.

Author Contributions

Writing—original draft, Z.T.; Writing—review and editing, X.Z. and Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by National Natural Science Foundation of China NSFC-11601081 and the research fund for distinguished young scholars of Fujian Agriculture and Forestry University No. xjq201727.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1

Proof of Theorem 1.

To prove (12), for any

U \in R^{m \times k}

and

V \in R^{n \times ℓ}

satisfying

U^{T} A U = I_{k}

and

V^{T} B V = I_{ℓ}

, respectively, we first consider the augmented matrices of

\tilde{C}

and

U^{T} C V

, i.e.,

[\begin{matrix} 0 & \tilde{C} \\ {\tilde{C}}^{T} & 0 \end{matrix}] and [\begin{matrix} 0 & U^{T} C V \\ V^{T} C^{T} U & 0 \end{matrix}],

whose eigenvalues

\pm λ_{i}

for

i = 1, \dots, n

plus

m - n

zeros and

σ_{i} (U^{T} C V)

for

i = 1, \dots, min {k, ℓ}

plus

k + ℓ - 2 min {k, ℓ}

, respectively. Notice that

[\begin{matrix} 0 & U^{T} C V \\ V^{T} C^{T} U & 0 \end{matrix}] = [\begin{matrix} U^{T} R_{A}^{T} & 0 \\ 0 & V^{T} R_{B}^{T} \end{matrix}] [\begin{matrix} 0 & \tilde{C} \\ {\tilde{C}}^{T} & 0 \end{matrix}] [\begin{matrix} R_{A} U & 0 \\ 0 & R_{B} V \end{matrix}],

where

R_{A}

and

R_{B}

are defined in (5), and

R_{A} U

and

R_{B} V

satisfy

{(R_{A} U)}^{T} R_{A} U = U^{T} A U = I_{k}

and

{(R_{B} V)}^{T} R_{B} V = V^{T} B V = I_{ℓ}

, respectively. Hence, apply Cauchy’s interlacing inequalities [] (Corollary 4.4) for the symmetric eigenvalue problem to the matrices

[\begin{matrix} 0 & \tilde{C} \\ {\tilde{C}}^{T} & 0 \end{matrix}]

and

[\begin{matrix} 0 & U^{T} C V \\ V^{T} C^{T} U & 0 \end{matrix}]

, to get

λ_{i} \geq σ_{i} (U^{T} C V)

for

1 \leq i \leq min {k, ℓ}

and consequently

\sum_{i = 1}^{min {k, ℓ}} λ_{i} \geq max \sum_{i = 1}^{min {k, ℓ}} σ_{i} (U^{T} C V)

(A1)

for any

U \in R^{m \times k}

and

V \in R^{n \times ℓ}

such that

U^{T} A U = I_{k}

and

V^{T} B V = I_{ℓ}

.

On the other hand, let

U = [x_{1}, x_{2}, \dots, x_{k}]

and

V = [y_{1}, y_{2}, \dots, y_{ℓ}]

where

x_{i}

and

y_{i}

are defined in (10). Then, by (11), we have

U^{T} A U = I_{k}

and

V^{T} B V = I_{ℓ}

. Furthermore,

U^{T} C V = {[p_{1}, p_{2}, \dots, p_{k}]}^{T} \tilde{C} [q_{1}, q_{2}, \dots, q_{ℓ}] = diag (λ_{1}, \dots, λ_{min {k, ℓ}}) \in R^{k \times ℓ}

which to give

σ_{i} (U^{T} C V) = λ_{i}

for

1 \leq i \leq min {k, ℓ}

and thus

\sum_{i = 1}^{min {k, ℓ}} λ_{i} \leq max_{U^{T} A U = I_{k}, V^{T} B V = I_{ℓ}} \sum_{i = 1}^{min {k, ℓ}} σ_{i} (U^{T} C V) .

(A2)

Equation (12) is a consequence of (A1) and (A2). □

Appendix A.2

Proof of Lemma 1.

Let

{[w_{1}^{T}, w_{2}^{T}]}^{T} \in span {(x)}^{⊥_{A}} \times span {(y)}^{⊥_{B}}

and it satisfies

G {[w_{1}^{T}, w_{2}^{T}]}^{T} = 0

. We will prove

{[w_{1}^{T}, w_{2}^{T}]}^{T} = 0

. Since

G [\begin{matrix} w_{1} \\ w_{2} \end{matrix}] = [\begin{matrix} I_{m} - A x x^{T} & 0 \\ 0 & I_{n} - B y y^{T} \end{matrix}] [\begin{matrix} - λ A & C \\ C^{T} & - λ B \end{matrix}] [\begin{matrix} w_{1} \\ w_{2} \end{matrix}],

then we have

[\begin{matrix} - λ A & C \\ C^{T} & - λ B \end{matrix}] [\begin{matrix} w_{1} \\ w_{2} \end{matrix}] = [\begin{matrix} γ_{1} A x \\ γ_{2} B y \end{matrix}],

where

γ_{1} = x^{T} (C w_{2} - λ A w_{1})

and

γ_{2} = y^{T} (C^{T} w_{1} - λ B w_{2})

, which leads to

\{\begin{matrix} C w_{2} = λ A w_{1} + γ_{1} A x, \\ C^{T} w_{1} = λ B w_{2} + γ_{2} B y, \end{matrix} \Rightarrow \{\begin{matrix} C w_{2} = λ R_{A}^{T} R_{A} w_{1} + γ_{1} R_{A}^{T} R_{A} x, \\ C^{T} w_{1} = λ R_{B}^{T} R_{B} w_{2} + γ_{2} R_{B}^{T} R_{B} y . \end{matrix}

(A3)

Let

{\tilde{w}}_{1} = R_{A} w_{1}

and

{\tilde{w}}_{2} = R_{B} w_{2}

. Then the equality (A3) can be rewritten as

\{\begin{matrix} \tilde{C} {\tilde{w}}_{2} = λ {\tilde{w}}_{1} + γ_{1} p, \\ {\tilde{C}}^{T} {\tilde{w}}_{1} = λ {\tilde{w}}_{2} + γ_{2} q, \end{matrix}

(A4)

where

\tilde{C}

, p and q are defined in (7). Multiply the first and second equations of (A4) by

{\tilde{C}}^{T}

and

\tilde{C}

from left, respectively, to get

\begin{matrix} \{\begin{matrix} {\tilde{C}}^{T} \tilde{C} {\tilde{w}}_{2} = λ {\tilde{C}}^{T} {\tilde{w}}_{1} + γ_{1} {\tilde{C}}^{T} p = λ^{2} {\tilde{w}}_{2} + λ γ_{2} q + γ_{1} λ q, \\ \tilde{C} {\tilde{C}}^{T} {\tilde{w}}_{1} = λ \tilde{C} {\tilde{w}}_{2} + γ_{2} \tilde{C} q = λ^{2} {\tilde{w}}_{1} + λ γ_{1} p + γ_{1} λ p, \end{matrix} \\ \Rightarrow & \{\begin{matrix} ({\tilde{C}}^{T} \tilde{C} - λ^{2} I_{n}) {\tilde{w}}_{2} = (λ γ_{2} + λ γ_{1}) q, \\ (\tilde{C} {\tilde{C}}^{T} - λ^{2} I_{m}) {\tilde{w}}_{1} = (λ γ_{1} + λ γ_{1}) p . \end{matrix} \end{matrix}

Therefore, both

{\tilde{w}}_{1}

and p belong to the kernel of

{(\tilde{C} {\tilde{C}}^{T} - λ^{2} I_{m})}^{2}

, and both

{\tilde{w}}_{2}

and q belong to the kernel of

{({\tilde{C}}^{T} \tilde{C} - λ^{2} I_{n})}^{2}

. The simplicity of

λ

implies

{\tilde{w}}_{1}

and

{\tilde{w}}_{2}

are multiples of p and q, respectively. Since

w_{1} \in span {(x)}^{⊥_{A}}

and

w_{2} \in span {(y)}^{⊥_{B}}

, we have

{\tilde{w}}_{1} \in span {(p)}^{⊥}

and

{\tilde{w}}_{2} \in span {(q)}^{⊥}

, which means

{\tilde{w}}_{1} = {\tilde{w}}_{2} = 0

. Therefore,

w_{1} = w_{2} = 0

. The bijectivity follow from comparing dimensions. □

Appendix A.3

Proof of Theorem 3.

Let

F : = [\begin{matrix} I_{m} - A \tilde{x} {\tilde{x}}^{T} & 0 \\ 0 & I_{n} - B \tilde{y} {\tilde{y}}^{T} \end{matrix}] .

Then the correction equation is

F [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] F^{T} [\begin{matrix} s \\ t \end{matrix}] = - [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] .

(A5)

Since

∥ \tilde{x} ∥_{A} = {∥ x ∥}_{A} = 1

, there exists

f ⊥_{A} \tilde{x}

such that

x = α \tilde{x} + f

where

α^{2} + {∥ f ∥}_{A}^{2} = 1

. It follows that

\frac{x}{α} = \tilde{x} + \tilde{f}

where

\tilde{f} = \frac{f}{α}

and

∥ \tilde{f} ∥_{A} = tan ∠_{A} (x, \tilde{x}) = O (ε)

. Similarly, there are

\tilde{g} ⊥_{B} \tilde{y}

and a scalar

β

such that

\frac{y}{β} = \tilde{y} + \tilde{g}

where

∥ \tilde{g} ∥_{B} = tan ∠_{B} (y, \tilde{y}) = O (ε)

. It is noted that

\begin{matrix} 0 & = [\begin{matrix} - λ A & C \\ C^{T} & - λ B \end{matrix}] [\begin{matrix} x \\ y \end{matrix}] = [\begin{matrix} - λ α A & β C \\ α C^{T} & - λ β B \end{matrix}] [\begin{matrix} \frac{x}{α} \\ \frac{y}{β} \end{matrix}] \\ = ([\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] + [\begin{matrix} (θ - λ α) A & (β - 1) C \\ (α - 1) C^{T} & (θ - λ β) B \end{matrix}]) [\begin{matrix} \frac{x}{α} \\ \frac{y}{β} \end{matrix}] \\ = [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] [\begin{matrix} \frac{x}{α} \\ \frac{y}{β} \end{matrix}] - [\begin{matrix} ω_{1} A x \\ ω_{2} B y \end{matrix}], \end{matrix}

(A6)

where

ω_{1} = \frac{λ α - θ}{α} + \frac{λ (1 - β)}{β}

and

ω_{2} = \frac{λ β - θ}{β} + \frac{λ (1 - α)}{α}

. Since

\frac{x}{α} = \tilde{x} + \tilde{f}

and

\frac{y}{β} = \tilde{y} + \tilde{g}

, the equality (A6) leads to

\begin{matrix} [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] [\begin{matrix} \tilde{f} \\ \tilde{g} \end{matrix}] & = - [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] [\begin{matrix} \tilde{x} \\ \tilde{y} \end{matrix}] + [\begin{matrix} ω_{1} A x \\ ω_{2} B y \end{matrix}] \\ = - [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] + [\begin{matrix} ω_{1} α A (\tilde{x} + \tilde{f}) \\ ω_{2} β B (\tilde{y} + \tilde{g}) \end{matrix}] . \end{matrix}

(A7)

It is noted that

F [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] = [\begin{matrix} r_{a} \\ r_{b} \end{matrix}]

,

F [\begin{matrix} A \tilde{x} \\ B \tilde{y} \end{matrix}] = 0

,

F [\begin{matrix} A \tilde{f} \\ B \tilde{g} \end{matrix}] = [\begin{matrix} A \tilde{f} \\ B \tilde{g} \end{matrix}]

and

F^{T} [\begin{matrix} \tilde{f} \\ \tilde{g} \end{matrix}] = [\begin{matrix} \tilde{f} \\ \tilde{g} \end{matrix}]

. Then, we have by (A7)

\begin{matrix} F [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] F^{T} [\begin{matrix} \tilde{f} \\ \tilde{g} \end{matrix}] & = - F [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] + F [\begin{matrix} ω_{1} α A \tilde{f} \\ ω_{2} β B \tilde{g} \end{matrix}] \\ = - [\begin{matrix} r_{a} \\ r_{b} \end{matrix}] + [\begin{matrix} ω_{1} α A \tilde{f} \\ ω_{2} β B \tilde{g} \end{matrix}] . \end{matrix}

(A8)

Together (A5) with (A8) to get

\begin{matrix} F [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] F^{T} [\begin{matrix} \tilde{f} - s \\ \tilde{g} - t \end{matrix}] = [\begin{matrix} ω_{1} α A \tilde{f} \\ ω_{2} β B \tilde{g} \end{matrix}] . \end{matrix}

(A9)

In addition, since

r_{a} ⊥ \tilde{x}

and

R_{B} ⊥ \tilde{y}

, multiplying (A7) on the left by

{[\begin{matrix} \tilde{x} & 0 \\ 0 & \tilde{y} \end{matrix}]}^{T}

leads to

\begin{matrix} [\begin{matrix} ω_{1} α \\ ω_{2} β \end{matrix}] & = [\begin{matrix} {({\tilde{x}}^{T} A \tilde{x})}^{- 1} \\ {({\tilde{y}}^{T} B \tilde{y})}^{- 1} \end{matrix}] [\begin{matrix} - θ {\tilde{x}}^{T} A & {\tilde{x}}^{T} C \\ {\tilde{y}}^{T} C^{T} & - θ {\tilde{y}}^{T} B \end{matrix}] [\begin{matrix} \tilde{f} \\ \tilde{g} \end{matrix}] \\ = [\begin{matrix} - θ {\tilde{x}}^{T} A \tilde{f} + {\tilde{x}}^{T} C \tilde{g} \\ {\tilde{y}}^{T} C^{T} \tilde{f} - θ {\tilde{y}}^{T} B \tilde{g} \end{matrix}] (by {\tilde{x}}^{T} A \tilde{x} = {\tilde{y}}^{T} B \tilde{y} = 1) \\ = [\begin{matrix} {\tilde{x}}^{T} C \tilde{g} \\ {\tilde{y}}^{T} C^{T} \tilde{f} \end{matrix}] = [\begin{matrix} {(\frac{x}{α} - \tilde{f})}^{T} C \tilde{g} \\ {(\frac{y}{β} - \tilde{g})}^{T} C^{T} \tilde{f} \end{matrix}] \\ = [\begin{matrix} {(\frac{x}{α})}^{T} C \tilde{g} - {\tilde{f}}^{T} C \tilde{g} \\ {(\frac{y}{β})}^{T} C^{T} \tilde{f} - {\tilde{g}}^{T} C^{T} \tilde{f} \end{matrix}] \\ = [\begin{matrix} \frac{λ β}{α} {\tilde{g}}^{T} B \tilde{g} - {\tilde{f}}^{T} C \tilde{g} \\ \frac{λ α}{β} {\tilde{f}}^{T} A \tilde{f} - {\tilde{g}}^{T} C^{T} \tilde{f} \end{matrix}] . \end{matrix}

(A10)

By Lemma 1, when

\tilde{x}

and

\tilde{y}

are close enough to x and y, respectively, we see that

F [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] F^{T}

is invertible. It follows by (A9) that

\begin{matrix} [\begin{matrix} \tilde{f} - s \\ \tilde{g} - t \end{matrix}] = {(F [\begin{matrix} - θ A & C \\ C^{T} & - θ B \end{matrix}] F^{T})}^{- 1} [\begin{matrix} ω_{1} α A \tilde{f} \\ ω_{2} β B \tilde{g} \end{matrix}] \Rightarrow {∥[\begin{matrix} \tilde{f} - s \\ \tilde{g} - t \end{matrix}]∥}_{M} = O (ε^{3}) . \end{matrix}

The last equality holds because of

∥ \tilde{f} ∥_{A} = O (ε)

,

∥ \tilde{g} ∥_{B} = O (ε)

and (A10), which means

∥ \tilde{f} {- s ∥}_{A} = O (ε^{3})

and

∥ \tilde{g} {- t ∥}_{B} = O (ε^{3})

. Therefore,

\begin{matrix} | sin ∠_{A} (x, \tilde{x} + s) | & = | sin ∠_{A} (x, \frac{x}{α} + s - \tilde{f}) | \\ = \frac{∥ X^{⊥} A (\frac{x}{α} + s - \tilde{f}) ∥_{2}}{∥ \tilde{x} {+ s ∥}_{A}} = \frac{∥ X^{⊥} A (\tilde{f} - s) ∥_{2}}{∥ \tilde{x} {+ s ∥}_{A}} \\ \leq \frac{∥ \tilde{f} {- s ∥}_{A}}{∥ \tilde{x} {+ s ∥}_{A}} = O (ε^{3}), \end{matrix}

where

X^{⊥} = [x_{2}, \dots, x_{m}]

. Similarly, we have

| sin ∠_{B} (y, \tilde{y} + t) | = O (ε^{3})

. □

References

Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef] [PubMed]
Harold, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar]
Wang, L.; Zhang, L.H.; Bai, Z.; Li, R.C. Orthogonal canonical correlation analysis and applications. Opt. Methods Softw. 2020, 35, 787–807. [Google Scholar] [CrossRef]
Uurtio, V.; Monteiro, J.M.; Kandola, J.; Shawe-Taylor, J.; Fernandez-Reyes, D.; Rousu, J. A tutorial on canonical correlation methods. ACM Comput. Surv. 2017, 50, 1–33. [Google Scholar] [CrossRef]
Zhang, L.H.; Wang, L.; Bai, Z.; Li, R.C. A self-consistent-field iteration for orthogonal CCA. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 1–15. [Google Scholar] [CrossRef]
Fukunaga, K. Introduction to Statistical Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 2013. [Google Scholar]
González, I.; Déjean, S.; Martin, P.G.P.; Gonçalves, O.; Besse, P.; Baccini, A. Highlighting relationships between heterogeneous biological data through graphical displays based on regularized canonical correlation analysis. J. Biol. Syst. 2009, 17, 173–199. [Google Scholar] [CrossRef]
Leurgans, S.E.; Moyeed, R.A.; Silverman, B.W. Canonical correlation analysis when the data are curves. J. R. Stat. Soc. B. Stat. Methodol. 1993, 55, 725–740. [Google Scholar] [CrossRef]
Raul, C.C.; Lee, M.L.T. Fast regularized canonical correlation analysis. Comput. Stat. Data Anal. 2014, 70, 88–100. [Google Scholar]
Vinod, H.D. Canonical ridge and econometrics of joint production. J. Econ. 1976, 4, 147–166. [Google Scholar] [CrossRef]
González, I.; Déjean, S.; Martin, P.G.P.; Baccini, A. CCA: An R package to extend canonical correlation analysis. J. Stat. Softw. 2008, 23, 1–14. [Google Scholar] [CrossRef]
Ma, Z. Canonical Correlation Analysis and Network Data Modeling: Statistical and Computational Properties. Ph.D. Thesis, University of Pennsylvania, Philadelphia, PA, USA, 2017. [Google Scholar]
Golub, G.; Kahan, W. Calculating the singular values and pseudo-inverse of a matrix. SIAM J. Numer. Anal. 1965, 2, 205–224. [Google Scholar] [CrossRef]
Jia, Z.; Niu, D. An implicitly restarted refined bidiagonalization Lanczos method for computing a partial singular value decomposition. SIAM J. Matrix Anal. Appl. 2003, 25, 246–265. [Google Scholar] [CrossRef]
Hochstenbach, M.E. A Jacobi—Davidson type SVD method. SIAM J. Sci. Comput. 2001, 23, 606–628. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, Z.; Zhou, A. Accelerating large partial EVD/SVD calculations by filtered block Davidson methods. Sci. China Math. 2016, 59, 1635–1662. [Google Scholar] [CrossRef]
Allen-Zhu, Z.; Li, Y. Doubly accelerated methods for faster CCA and generalized eigendecomposition. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 98–106. [Google Scholar]
Saad, Y. Numerical Methods for Large Eigenvalue Problems: Revised Edition; SIAM: Philadelphia, PA, USA, 2011. [Google Scholar]
Stewart, G.W. Matrix Algorithms Volume II: Eigensystems; SIAM: Philadelphia, PA, USA, 2001; Volume 2. [Google Scholar]
Sleijpen, G.L.G.; Van der Vorst, H.A. A Jacobi—Davidson iteration method for linear eigenvalue problems. SIAM Rev. 2000, 42, 267–293. [Google Scholar] [CrossRef]
Hochstenbach, M.E. A Jacobi—Davidson type method for the generalized singular value problem. Linear Algebra Appl. 2009, 431, 471–487. [Google Scholar] [CrossRef]
Betcke, T.; Voss, H. A Jacobi–Davidson-type projection method for nonlinear eigenvalue problems. Future Gen. Comput. Syst. 2004, 20, 363–372. [Google Scholar] [CrossRef]
Hochstenbach, M.E.; Plestenjak, B. A Jacobi—Davidson type method for a right definite two-parameter eigenvalue problem. SIAM J. Matrix Anal. Appl. 2002, 24, 392–410. [Google Scholar] [CrossRef][Green Version]
Arbenz, P.; Hochstenbach, M.E. A Jacobi—Davidson method for solving complex symmetric eigenvalue problems. SIAM J. Sci. Comput. 2004, 25, 1655–1673. [Google Scholar] [CrossRef][Green Version]
Campos, C.; Roman, J.E. A polynomial Jacobi—Davidson solver with support for non-monomial bases and deflation. BIT Numer. Math. 2019, 60, 295–318. [Google Scholar] [CrossRef]
Hochstenbach, M.E. A Jacobi—Davidson type method for the product eigenvalue problem. J. Comput. Appl. Math. 2008, 212, 46–62. [Google Scholar] [CrossRef]
Hochstenbach, M.E.; Muhič, A.; Plestenjak, B. Jacobi—Davidson methods for polynomial two-parameter eigenvalue problems. J. Comput. Appl. Math. 2015, 288, 251–263. [Google Scholar] [CrossRef]
Meerbergen, K.; Schröder, C.; Voss, H. A Jacobi–Davidson method for two-real-parameter nonlinear eigenvalue problems arising from delay-differential equations. Numer. Linear Algebra Appl. 2013, 20, 852–868. [Google Scholar] [CrossRef]
Rakhuba, M.V.; Oseledets, I.V. Jacobi–Davidson method on low-rank matrix manifolds. SIAM J. Sci. Comput. 2018, 40, A1149–A1170. [Google Scholar] [CrossRef]
Stewart, G.W.; Sun, J.G. Matrix Perturbation Theory; Academic Press: Boston, FL, USA, 1990. [Google Scholar]
Teng, Z.; Wang, X. Majorization bounds for SVD. Jpn. J. Ind. Appl. Math. 2018, 35, 1163–1172. [Google Scholar] [CrossRef]
Bai, Z.; Li, R.C. Minimization principles and computation for the generalized linear response eigenvalue problem. BIT Numer. Math. 2014, 54, 31–54. [Google Scholar] [CrossRef]
Teng, Z.; Li, R.C. Convergence analysis of Lanczos-type methods for the linear response eigenvalue problem. J. Comput. Appl. Math. 2013, 247, 17–33. [Google Scholar] [CrossRef]
Demmel, J.W. Applied Numerical Linear Algebra; SIAM: Philadelphia, PA, USA, 1997. [Google Scholar]
Saad, Y. Iterative Methods for Sparse Linear Systems; SIAM: Philadelphia, PA, USA, 2003. [Google Scholar]
Teng, Z.; Zhou, Y.; Li, R.C. A block Chebyshev-Davidson method for linear response eigenvalue problems. Adv. Comput. Math. 2016, 42, 1103–1128. [Google Scholar] [CrossRef]
Zhou, Y.; Saad, Y. A Chebyshev–Davidson algorithm for large symmetric eigenproblems. SIAM J. Matrix Anal. Appl. 2007, 29, 954–971. [Google Scholar] [CrossRef]
Sleijpen, G.L.G.; Van der Vorst, H.A. The Jacobi–Davidson method for eigenvalue problems as an accelerated inexact Newton scheme. In Proceedings of the IMACS Conference, Blagoevgrad, Bulgaria, 17–20 June 1995. [Google Scholar]
Saad, Y.; Schultz, M.H. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 1986, 7, 856–869. [Google Scholar] [CrossRef]
Lee, S.H.; Choi, S. Two-dimensional canonical correlation analysis. IEEE Signal Process. Lett. 2007, 14, 735–738. [Google Scholar] [CrossRef]
Fokkema, D.R.; Sleijpen, G.L.G.; Van der Vorst, H.A. Jacobi–Davidson style QR and QZ algorithms for the reduction of matrix pencils. SIAM J. Sci. Comput. 1998, 20, 94–125. [Google Scholar] [CrossRef]
Desai, N.; Seghouane, A.K.; Palaniswami, M. Algorithms for two dimensional multi set canonical correlation analysis. Pattern Recognit. Lett. 2018, 111, 101–108. [Google Scholar] [CrossRef]

Figure 1. Convergence behavior of Algorithm 1 for computing the first canonical weight vector of Fou-Zer and Pix-Fou.

Figure 2. Cost in computing the first 10 canonical weight vectors of ORL (top), FERET (middle) and Yale (bottom) datasets with MINRES steps for the correction equation varying from 5 to 40.

Figure 3. Convergence behavior of JDCCA and JDQZ for computation of the first 10 canonical weight vectors of ORL (top), FERET (middle) and Yale (bottom) datasets with MINRES step

n_{g} = 20

.

Table 1. Test problems.

Problems	ORL	FERET	Yale
m	10,304	6400	10,000
n	10,304	6400	10,000
d	200	600	75

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

A Jacobi–Davidson Method for Large Scale Canonical Correlation Analysis

Abstract

1. Introduction

2. Preliminaries

3. The Main Algorithm

3.1. Subspace Extraction

3.2. Correction Equation

3.3. Convergence

4. Numerical Examples

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

References

Article Metrics

Citations

Article Access Statistics