Incremental Canonical Correlation Analysis

Featured Application: This paper presents the solutions for real-time visual tasks such as visual tracking. Some real-life scenarios such as tracking speciﬁc targets can apply this method by extending ICCA. Abstract: Canonical correlation analysis (CCA) is a kind of a simple yet e ﬀ ective multiview feature learning technique. In general, it learns separate subspaces for two views by maximizing their correlations. However, there still exist two restrictions to limit its applicability for large-scale datasets, such as videos: (1) su ﬃ ciently large memory requirements and (2) high-computation complexity for matrix inverse. To address these issues, we propose an incremental canonical correlation analysis (ICCA), which maintains in an adaptive manner a constant memory storage for both the mean and covariance matrices. More importantly, to avoid matrix inverse, we save overhead time by using sequential singular value decomposition (SVD), which is still e ﬃ cient in case when the number of samples is su ﬃ ciently few. Driven by visual tracking, which tracks a speciﬁc target in a video sequence, we readily apply the proposed ICCA for this task through some essential modiﬁcations to evaluate its e ﬃ cacy. Extensive experiments on several video sequences show the superiority of ICCA when compared to several classical trackers.


Introduction
Canonical correlation analysis, as a mathematical statistical tool, was proposed by Hovelling [1] in 1936. It is mainly used to analyze the association between two groups of random variables. In 2003, Hardoon et al. [2] briefly reviewed CCA and provided a generalized framework with optimization and kernel tricks. CCA is relevant to mutual information, thus it was used early in information retrieval. Later on, it was broadly applied to reduce dimension [3][4][5], clustering [6,7], regression [8,9], word embedding [10][11][12], and discriminant learning [13][14][15]. In such studies, plain CCA, which was purely used there, ignored the nonlinearity or geometrical structure within data sets. To this end, many variants of CCA emerged in diverse manners. For instance, sparse CCA [16] accounted for feature sparseness for more intuitive interpretation via L 1 regularization. Graph-embedded-based CCA (GCCA) [17][18][19] integrated local manifold structure within data for image classification by graph-embedded subspace learning with auxiliary methods [20,21]. More typically, deep CCA (DCCA) [22] trained the neural network under the guidance of the objective of CCA to model the data nonlinearity. In addition, generalized CCA (GCCA) [23] extended CCA with latent common feature representation. Similar to DCCA, deep GCCA [24] also realized CCA with deep architectures using the objective GCCA. In the above methods, eigenvalue decomposition, matrix inversion, or both are, as usual, involved. This obstructed the applicability of CCA and its variants in large-scale datasets or online learning tasks.
To make CCA efficient, many methods endeavored to explore efficient optimization algorithms, which address the computational inefficiency for the high-dimensional issue of CCA in different settings [25][26][27][28][29][30][31][32][33]. Kettenring et al. [34] showed that CCA was equivalent to the constrained least squares optimization problem. Zha et al. [35] calculated CCA by QR decomposition which denoted that a matrix A(A ∈ R m×n ) was divided into a product of an orthogonal matrix Q and an upper triangular matrix R. However, the large scale of the data would still slow the calculation process. Afterwards, Avron et al. [36] proposed a fast CCA algorithm by employing randomized dimensionality reduction transform [37]. Recently, other methods to solve CCA were based on the methods of gradient descent. Ma et al. [33] introduced the enhanced approximate gradient mechanism and further extended it to stochastic optimization problems. In [38,39], CCA was reformulated into a series of least-square problems, which were solved by fast gradient descent. Such algorithms are still slow in reality. Thus, they are unsuited for online learning tasks.
To address this issue, we develop a simple yet efficient incremental canonical correlation analysis (ICCA) to accelerate plain CCA. Particularly, ICCA maintains constant memory storage for both the mean and covariance matrices, etc., in an adaptive manner. This does not require storage of large amounts of history data, thus a certain space complexity remains. To reduce time complexity, we readily avoid matrix inverse in the learning process by using sequential Karhunen-Loeve (SKL) or singular value decomposition (SVD), which is still efficient in the case that the number of samples is sufficiently few. Driven by visual tracking as an online learning task, which locates a given target in a long video sequence, we apply the proposed ICCA for this task through some essential modifications to ensure its efficacy. Extensive experiments on several video sequences show the superiority of ICCA as compared to several classical trackers. Importantly, the efficiency of ICCA is promising.
There are three aspects to our contribution: (1) we provide a new perspective to solve the efficiency problem of CCA; (2) from the perspective of SVD, we propose an ICCA method to avoid the invert matrix multiplication for saving overhead time; and (3) a tracking algorithm based on ICCA further tries to expand the applicability of CCA for real-life scenarios.
The rest of this paper is outlined as follows. We list the related work about CCA in Section 2. Section 3 details CCA and our method. Section 4 offers the experimental results, and we give our conclusions in Section 5.

Related Work
Owing to the problem of CCA described above, a large number of methods were proposed to solve this problem. Most studies were along the specific application of this main line to improve CCA. Based on this consideration, we mainly review in this section those optimization algorithms about CCA that are most related to our method.
To accelerate the efficiency of CCA, a randomized CCA algorithm for a pair of tall and thin matrices was proposed in [36], which first obtained the matrices by randomly reducing the dimensionality of the matrices. The newly generated matrices, as a different dataset, were applied in existing algorithms. As a result, Ma et al. [33] pinpointed that the randomized CCA algorithm still cannot thoroughly solve the original high complexity problems. Lu et al. [31] obtained some suboptimal results by viewing this problem as a sequence of iterative least-squares with imprecise approximation. An iterative algorithm achieved excellent results with a low cost. In [30], an alternating least-squares algorithm was proposed and the performance is better than [33]. Gao et al. [29] found that the objective of CCA is not stochastic convex programming. Ma et al. [33] gave much attention to globally convergent stochastic optimization of CCA. Ge et al. [39] made a breakthrough to this issue. However, the problem of sample complexity still exists. Xu et al. [40] proposed truly alternating least-squares (TALS) for efficient CCA with momentum. The offline algorithm was not suitable for real-time scenes that resulted in high memory storage, so some scholars tried to design fast algorithms to solve the problem of high complexity. Marinov et al. [29] proposed a first-order stochastic approximation algorithm for canonical correlation analysis (CCA), in which the convex relaxation was introduced. Different from these methods, which all approximated the objective of plain CCA, the proposed ICCA directly accelerates original CCA. Note that the sequential Karhunen-Loeve or SVD used in ICCA was originally proposed by Kim [41], which accelerated SVD rather than matrix inversion. Moreover, early on, it was applied in visual tracking [42], which puts forward learning the eigenbasis of PCA. It is obvious that PCA is not CCA and their motivations are very different. More importantly, the proposed ICCA greatly expands the applicability of sequential SVD in multiview tasks.

Review of Canonical Correlation Analysis
Canonical correlation analysis (CCA) as a classical multivariate statistical method is widely used to analyze multiview data. In the machine learning community, CCA usually serves as a feature learning method of extracting two separate subspaces to maximize the correlations across views. Consider the samples X ∈ R d 1 ×n and Y ∈ R d 2 ×n from two views, where d 1 and d 2 are respectively the dimensions of X and Y, and n denotes the number of the samples; CCA [43] is to seek two subspaces a and b to maximize their projections, as given below: To ensure the scale invariance of a and b in Equation (1), the objective function can be rewritten: By constructing a Lagrange equation, Equation (2) is simplified into: It is proved that the Equation (3) has a global optimal solution, in which a and b are called as canonical subspaces. Before computing a and b, Σ −1 xx , Σ −1 yy , Σ xy , and Σ yx should be calculated at first. However, the time complexity of the matrix inversion about Σ −1 xx and Σ −1 yy is O(d 3 ) (d is the larger of the two d 1 and d 2 ), which leads to slow running speed and large calculation overhead [2,40,41]. In practice, there are many data with high dimensions that burden massive data storage. Obviously, using CCA incremental learning is a promising choice to handle this case.

Incremental Canonical Correlation Analysis (ICCA)
Mean and Covariance Update. In CCA, calculating the mean and covariance is a necessary step for its solution. For the mean and covariance update, it is relatively easy to keep the storage constant. For completeness, we introduce the concrete strategy, where the implementation details could not be slightly different from other methods. Suppose that there are n old samples and m new incoming samples. Let X and Y denote the mean of n old samples from X and Y, respectively. Correspondingly, the covariance of n old samples and m new incoming samples between X and Y sample are denoted by Σ xy and Σ xy , respectively. Σ xy denotes the covariance of n t samples.
and f denotes a forgotten factor to eliminate old samples, f ∈ [0, 1]. The parameter f = 1 expresses that more emphasis is put on all the samples, while f = 0 indicates that a full focus is placed on the new samples. The n previous history matrices including Σ xy , X and Y need to be saved before m new samples arrive. The matrix Σ xy , X and Y can be obtained with the new m samples and then the covariance of matrix Σ xy can be calculated, based on these n old samples and m new incoming samples. Thus, all the samples do not take up much memory. Of them, the mean of all the samples at this time can be updated as follows: when the next m samples emerge again, once the matrix Σ xy , X, and Y are computed according to Equation (5), the saved value of n t and the historical matrix Σ xy directly take part in the new round of calculation and storage. The complete operation would not increase the storage space with the rapid of the number of samples. Avoiding Matrix Inversion. To solve CCA, it is necessary to solve the inversion of matrix Σ xx and Σ yy in Equation (3). To avoid the matrix inversion, we resort to the singular value decomposition (SVD), which efficiently solves the thinner matrix. Given any matrix A ∈ R d×n , its SVD form is as follows: where U ∈ R d×d , S ∈ R d×n , V ∈ R n×n ; then AA T = USV T VSU T . Owing to the orthogonality VV T = I, and thus AA T = USSU T . When X − X = U 1 S 1 V T 1 , Σ xx is a real symmetric matrix and thus In Equation (8), there is an inverse operation S −2 1 . Since S 1 is a diagonal matrix whose elements are only located on the diagonal, its inversion can be calculated by the reciprocal of can be obtained the same as S −2 1 . So, the matrix inverse is avoided. It is obvious that the overhead of vanilla SVD is not ignored as well, if the matrix is still relatively large. Towards this end, a sequential Karhunen-Loeve algorithm (SKL) [44] can be employed here. Without loss of generality, suppose that n centralized samples X and the m new centered incoming samples X . Assume that X = U 1 S 1 V T 1 . Obviously, the SVD of the concatenation X and X is where I m is the identity matrix with the size of m. Then, we derive a small matrix Obviously, we need the latest left singular matrix U 1 = U U and the corresponding singular value matrix Σ 1 = D. Now we can obtain the inverse of the covariance matrix of X.
Adopting the SKL algorithm [44] can ensure its space and time complexity to be constant in O(n + m). Each update only adopts the first top k− truncated singular value and basis vectors in the former step and the m new incoming samples, thus the space complexity is reduced to O(d(k + m)) instead of the previous O(d(n + m) 2 ) as in Levy and Lindenbaum [44].
Updating Subspaces. Of course, the SKL algorithm is designed for sequential SVD methods. Thus, we follow the section above to find the SVD of Σ xy . Since merging the small-sized matrices into the SVD forms of Σ −1 xx , Σ −1 yy , and Σ xy , we can efficiently obtain the SVD of P = Σ −1/2 xx Σ xy Σ −1/2 yy . In fact, we do not need the concrete form. This is because the projection matrix only requires the left and right singular matrices. The update formula is where u and v are the left and right singular matrices of P, respectively. Here, the subspace a and b can be updated. The specific optimization algorithm is shown in Algorithm 1.

Experiments
In this section, we apply ICCA for visual tracking with particle filter to illustrate the applicability of incremental CCA (ICCA).

Settings
Implementation Details. For completeness, following the experimental settings [41], we divide an image into left and right patches. Since occlusion and illumination affects the tracking results, we at the meantime extend the method of [45] to the case of two views. In detail, we directly extract the projection matrices of two views as the bases, which serve as the templates. To integrate two views of templates into the particle filtering framework, we utilize the reconstruction errors to determine the best candidate as the position inference. Different from [45], the updating reconstructed coefficients rely on the templates from two views rather than independent views. For clarity, we give the concrete details as Equation (10):

Algorithm 1. Incremental Subspace Algorithm
Input: n samples X and Y, m new samples X and Y . Output: the updated subspace New_U 1 , New_U 2 the updated mean and covariance X n+m ,Y n+m , Σ xy ; 1. Update the mean X, Y and covariance Σ xy and Σ yx of n samples, according to Equation (5); 2. Update the mean X n+m , Y n+m of n + m current samples according to Equation (6); 3. X ← X n+m , Y ← Y n+m , Σ xy ← Σ xy according to Equations (4) and (5); 4. Calculate U 1 , S 1 , U 2 , and S 2 according to Equation (7) where x and y denote the left and right patches, respectively, D 1 and D 2 denote the projection matrices learned by ICCA, α is the sparse coefficient vector, e 1 and e 2 denote the noises, and λ, µ 1 , and µ 2 denote the regularization parameters. According to [45], we ignore the same solution steps such as the errors e 1 and e 2 , and only describe the reconstructed coefficient update rule. Since the iterative shrinkage technique requires the partial derivative about the coefficient, the first step is to calculate the partial derivative of Equation (10) with respect to α. Let G a is the objective function about α after simplification: Then the gradient about α is: Similarly, the gradient about e 1 and e 2 is e 1 − (x − D 1 α) and e 2 − (y − D 2 α). With the three gradients, α, e 1 , and e 2 can be obtained by combination with APG (accelerated proximal gradient).
In visual tracking with a particle filter, one particle with maximum probability must be chosen as the target location in the next-coming frame. According to Bayesian theory, the maximum probability of particles is expressed as follows: where J(α, e 1 , F + µ 1 e 1 1 + µ 2 e 2 1 , α * , e * 1 , e * 2 are the optimal solution of Equation (10). S m represents the particle state in the m-th frame. These states are composed of six affine parameters and are estimated with the maximum posteriori probability. The maximum probability particle is calculated and treated as the target location in the next frame.
In order to ensure the fairness of comparison and the reliability of the experiment, the running environment of all video frames is performed on the same workstation. In order to fully mine multimode features, the ICCA tracker uses two parts of each particle evenly divided horizontally or vertically as two view features. In the experiment, the regularization parameters of ICCA tracker are set: λ is selected from Datasets. This section compares ICCA with several representative trackers, such as robust fragments-based tracking (Frag) [46], visual tracking decomposition (VTD) [47], incremental learning for robust visual tracking (IVT) [42], real time robust L1 tracker using accelerated proximal gradient approach (L1APG) [48], visual tracking with online multiple instance learning (MIL) [49], correlation-based incremental visual tracking (TLD) [41] on a series of video sequences, including basketball, car4, cardark, carscale, deer, Dudek, faceocc1, fish, football, and 18 other video sequences to verify the effectiveness of ICCA. Of the trackers compared, VTD selects multiple image feature models and motion models at the same time, and tracks them respectively. Finally, the optimal feature and motion model are combined as the optimal target area in real time; IVT is a video tracking algorithm based on the incremental update appearance model and maintains a template in the whole tracking process, and dynamically updates the template through incremental PCA, so that it can effectively adapt to changes in appearance. L1APG is based on sparse representation and uses accelerated proximal gradient (APG) to solve the L1 minimization problem. The MIL tracker treats the visual tracking problem as the multi-instance learning problem, in which the bag is composed of many instances instead of a single instance. TLD combines the traditional tracking algorithm and detection algorithm to solve the problems of deformation and occlusion in tracking. Frag is recognized by partial matching. Its target template is described by multiple fragments and blocks of the image. Blocks are arbitrary, not based on the target model. These trackers used here represent different types of trackers and are the most representative shallow learning trackers, which are widely used for comparison purposes in research reports. More importantly, such main tracking algorithms are integrated into a tracking framework [50], which provides the evaluation and figure codes for fair comparison. This paper also implements the ICCA algorithm on the basis of this tracking framework. The video sequences used here, which include different appearance changes of tracking objects in different scenes, can be downloaded from this tracking framework as well.

Results
In single object tracking, two evaluation criteria are mainly used: qualitative analysis and quantitative analysis. In a video sequence, the tracking effect belongs to the category of qualitative analysis. To track the target, whether the tracking is lost from the first frame to the last frame, and whether the target tracking frame can dynamically follow the change of target size is qualitative analysis. In addition, precision and success rate are used for quantitative analysis. The traditional method to evaluate the tracker is to take the real position of the tracking target in the first frame as the initial tracking target in the test set, and calculate the average accuracy or success rate through the whole test sequence. This is a one-way evaluation (OPE). In addition to these evaluation criteria, we also report the efficiency of each algorithm in terms of the number of frames per second (FPS).

Efficiency Comparison
In Table 1, we offer the FPS values of the compared trackers on all eighteen videos, which are computed as the ratio of the total number of frames in each video length to the total running time, namely, the number of frames per second based on various tracking methods. The larger the value, the higher the efficiency of the tracking algorithm. As in Table 1, ICCA achieves larger values of FPS on most videos. In some videos, like football, jumping, and mhyang, the changes in appearances make the tracking difficult, which leads to the slower convergence of the tracking algorithm based on our ICCA with the same criterion used on the other videos. Thus, the values of FPS are not always the best.

Qualitative Comparison
In the basketball and football videos, the target is a fast-moving person. For the basketball video, when more than one player grabs the ball, the tracking player is easily affected by the background information. In the football video, in addition to being easily affected by the background information, the helmet information is similar, when the players occlude each other, and the tracked target is easy to lose. Figure 1a,j show good tracking ability of ICCA, which is better than other trackers in the football video. In the three videos of car4, cardark, and carscale, fast-moving cars drive on the streets during the day and night, as well as on the outdoor roads during the day. For the video car4, when the car passes through the overpass or under the shade of trees, its surface light changes greatly. In the video cardark, cars on the street at night are easily blurred by the brightness of other lights or street lights. The car in the video carscale will drive from far to near, and the scale will change when passing through a cluster of branches. Figure 1b-d shows that ICCA tracking results are attractive. In Figure 1e, namely, the video david3, the tracked person walks outdoors. In the process of walking back and forth, it is occluded twice by the trunk. Most tracking algorithms lose the target, while ICCA still works stably. In the deer video sequences, most algorithms fail because the tracked deer is very similar to the surrounding background and moves rapidly; ICCA effectively performs the tracking. The dog in the video Dog1 moves from near to far but slowly, thus most methods can track the target stably. As in Figure 1, the videos faceocc1, faceocc2, and jogging-2 share the common things: the tracked targets are all sometimes occluded, especially the target in jogging-2 has almost complete occlusion, and the color of the occlusion appearance is very similar to the target. ICCA works well and shows its effectiveness. In the video fish, because the camera is shaking slightly, it is more difficult to track the target. ICCA can track the target accurately. In Figure 2a, since the building in the video crossing blocks the sun, the pedestrian tracking is relatively stable in the building shadow. When walking to the shadow and direct light, illumination variations are quite different. In this case, most trackers miss the target, but ICCA can still track the target even when the tracking bounding box size changes a little. In the singer1 video, the background clutter affects the effective learning of the subspace. Likewise, ICCA can track the target correctly.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 15 tracked targets are all sometimes occluded, especially the target in jogging-2 has almost complete occlusion, and the color of the occlusion appearance is very similar to the target. ICCA works well and shows its effectiveness. In the video fish, because the camera is shaking slightly, it is more difficult to track the target. ICCA can track the target accurately. In Figure 2a, since the building in the video crossing blocks the sun, the pedestrian tracking is relatively stable in the building shadow. When walking to the shadow and direct light, illumination variations are quite different. In this case, most trackers miss the target, but ICCA can still track the target even when the tracking bounding box size changes a little. In the singer1 video, the background clutter affects the effective learning of the subspace. Likewise, ICCA can track the target correctly.

Quantitative Comparison
To quantify the visual tracking performance of ICCA, seven trackers were evaluated and compared by success rate and precision. It can be seen from Figure 3 that ICCA has almost the best comprehensive performance, and has a high success rate on different attributes, such as background speckle illumination, scale change, rapid movement, and small rotation of objects, but it achieves slightly worse performance than other trackers on out-of-plane rotation, occlusion, and deformation.

Quantitative Comparison
To quantify the visual tracking performance of ICCA, seven trackers were evaluated and compared by success rate and precision. It can be seen from Figure 3 that ICCA has almost the best comprehensive performance, and has a high success rate on different attributes, such as background speckle illumination, scale change, rapid movement, and small rotation of objects, but it achieves slightly worse performance than other trackers on out-of-plane rotation, occlusion, and deformation.
In particular, ICCA merely depends on the subspace learning, which as usual is not very robust. ICCA mainly accelerates CCA in an incremental way, in which the solution to CCA is a closed-form rather than stochastic approximation. In addition, ICCA primarily focuses on the efficiency and applicability in real scenarios. Figure 4 shows the precision comparison of seven trackers over 18 video sequences. ICCA is consistently superior to other trackers for high accuracy on most attributes. Since ICCA just accelerates CCA without improving the robustness of target appearance, the trackers based on ICCA are still sensitive to the occlusion, scale size, and deformation. Thus, on such attributes, the performance based on ICCA is slightly lower than other trackers. We cannot guarantee our tracker to work on all the videos because we only improve the efficiency of CCA rather than its model capacity.
In short, according to the comparison of success rate and accuracy, the ICCA tracker is better than other previous trackers and exhibits better performance on many video sequences. In particular, ICCA merely depends on the subspace learning, which as usual is not very robust. ICCA mainly accelerates CCA in an incremental way, in which the solution to CCA is a closed-form rather than stochastic approximation. In addition, ICCA primarily focuses on the efficiency and applicability in real scenarios. Figure 4 shows the precision comparison of seven trackers over 18 video sequences. ICCA is consistently superior to other trackers for high accuracy on most attributes. Since ICCA just accelerates CCA without improving the robustness of target appearance, the trackers based on ICCA are still sensitive to the occlusion, scale size, and deformation. Thus, on such attributes, the performance based on ICCA is slightly lower than other trackers. We cannot guarantee our tracker to work on all the videos because we only improve the efficiency of CCA rather than its model capacity.
In short, according to the comparison of success rate and accuracy, the ICCA tracker is better than other previous trackers and exhibits better performance on many video sequences.

Conclusions
This paper details a simple yet efficient incremental CCA (ICCA), and applies it for visual tracking to verify the effectiveness and promising potential for proposed online learning schemes in reality. Different from existing works, the proposed ICCA directly accelerates the process of CCA, in contrast to some other approximations. The algorithmic efficiency is greatly improved. However, it still needs to develop other applications for further in-depth analysis.

Conclusions
This paper details a simple yet efficient incremental CCA (ICCA), and applies it for visual tracking to verify the effectiveness and promising potential for proposed online learning schemes in reality. Different from existing works, the proposed ICCA directly accelerates the process of CCA, in contrast to some other approximations. The algorithmic efficiency is greatly improved. However, it still needs to develop other applications for further in-depth analysis.

Conflicts of Interest:
The authors declare no conflict of interest.