Multi-View Structural Local Subspace Tracking

In this paper, we propose a multi-view structural local subspace tracking algorithm based on sparse representation. We approximate the optimal state from three views: (1) the template view; (2) the PCA (principal component analysis) basis view; and (3) the target candidate view. Then we propose a unified objective function to integrate these three view problems together. The proposed model not only exploits the intrinsic relationship among target candidates and their local patches, but also takes advantages of both sparse representation and incremental subspace learning. The optimization problem can be well solved by the customized APG (accelerated proximal gradient) methods together with an iteration manner. Then, we propose an alignment-weighting average method to obtain the optimal state of the target. Furthermore, an occlusion detection strategy is proposed to accurately update the model. Both qualitative and quantitative evaluations demonstrate that our tracker outperforms the state-of-the-art trackers in a wide range of tracking scenarios.


Introduction
Visual tracking plays an important role in computer vision and has received fast-growing attention in recent years due to its wide practical application. In generic tracking, the task is to track an unknown target (only a bounding box defining the object of interest in a single frame is given) in an unknown video stream. This problem is especially challenging due to the limited set of training samples and the numerous appearance changes, e.g., rotations, scale changes, occlusions, and deformations.
To solve the problem, many effective trackers have been proposed [1][2][3][4] in recent years. Most methods are developed from the discriminative or generative perspectives. Discriminative approaches use an online updated classifier or regression model to distinguish the object from the background. Avidan [5] uses AdaBoost to combine a set of weak classifiers into a strong classifier to label each pixel and develops an ensemble tracking method. Grabner et al. [6] propose a semi-supervised online boosting algorithm to handle the drift problem in tracking by the usage of a given prior. Babenko et al. [7,8] introduce multiple instance learning (MIL) into online object tracking where bag labels are adopted to select effective features. Hare et al. [9] propose the Struck tracker which directly estimates the object transformation between frames, thus avoiding the heuristic labels of samples. Kalal et al. [10] propose a P-N learning algorithm which uses two experts to estimate and correct the errors made by the classifier and tracker. More recently, Li et al. [11] proposed a novel tracking framework with adaptive features and constrained labels to handle illumination variation, occlusion and appearance changes caused by the variation of positions. Among all of the discriminative approaches, recently, correlation filter-based tracking algorithms [12] have drawn increasing attention object's appearances only with a couple of previous time instants, thus, they cannot cover numerous appearances of the target object.
Motivated by the above discussions, we propose a novel multi-view structural local subspace model as shown in Figure 1. For each view, we build a sub-model to exploit the useful information in the view. The whole model iteratively exchanges information among three sub-models. In the target template view, each patch of target object is sparsely represented by the target patch templates independently with a temporally smooth regularization term. The target templates have a strong representation of the current object's appearance. We use them to account for the short-term memory of target object. In the PCA Eigen template view, we construct a structural local PCA Eigen dictionary to exploit both partial information and spatial information of the target object with sparse constraint. Additionally, the PCA Eigen template model has the ability to effectively learn the temporal correlation of target appearances from past observation data by an incremental SVD update procedure, thus, it can cover a long period of target appearances. We use it to account for the long-term memory of the target. In the target candidate view, we use a Laplacian regularization term to keep the similarity of sparse codes among those unoccluded patches and keep the independence of sparse codes which belong to the occluded patches by an occlusion indicator matrix. Note that the use of the Laplacian regularization term in our model is more meaningful than it is in [27]. The whole model has many good properties. It takes advantages of both sparse representation and incremental subspace learning. This makes the model less sensitive to incorrect updating and makes the model have a proper memory of the target appearances. The model exploits the intrinsic relationship among different target candidates and their local patches, forming a strong identification power to locate the target from many candidates. It can also estimate the reliability of different local patches. This causes the model make full use of the reliable patches and ignore the occluded patches.
We built the model to deal with many tracking problems, e.g., occlusion, deformation, fast motion, illumination variation, scale variation, background clutters, etc. The sparse representation-based tracking method can handle partial occlusion and background clutter to some extent, and the incremental learning of the PCA subspace representation can effectively and efficiently deal with appearance changes caused by rotations, scale changes, illumination variations, and deformations. The proposed tracker takes advantages of both methods, and by considering time consistency, intrinsic relationships among target candidates and their local patches, different reliability of different patches, and the rational update strategy, the proposed method significantly improves the robustness of tracking performance.
The main contributions of this paper are as follows: (1) A novel multi-view structural local subspace tracking method is proposed. The model jointly takes advantages of three sub-models by a unified objective function which is proposed to integrate the three sub-models together. The proposed model not only exploits the intrinsic relationship among target candidates and their local patches, but also takes advantages of both sparse representation and incremental subspace learning. (2) We propose an algorithm which can solve the optimization problem well by three customized APG methods, together with an iteration manner. (3) An alignment-weighting average method is proposed to exploit the complete structure information of the target for robust tracking. (4) A novel update strategy is developed to account for both short-term memory and long-term memory of target appearances. (5) Experimental results show that the proposed method outperforms twelve state-of-the-art methods in a wide range of tracking scenarios.
The rest of the paper is organized as follows: In Section 2, we introduce the multi-view structural local subspace model in detail. The optimization of the unified objective function and the overall tracking algorithm are presented in Section 3. Details of the quantitative and qualitative experiments of our method compared with the state-of-the-art methods are discussed in Section 4. In Section 5, we reach the conclusions of the paper.

Multi-View Structural Local Subspace Model (MSLM)
Most tracking methods use only one clue to model the target appearance. However, only one clue can hardly handle the complicated circumstances that visual tracking faces. Some methods try to fuse different models together to use all of their advantages, but they either simply combine these models or increase the computation burden by using some complicated models. Our method exchanges information among target templates, PCA bases, and candidates in one model to simultaneously use all of the advantages, while keeping computational complexity favorable.
To better illustrate our model, we assume that the optimal state * in the current frame is already known and the corresponding observation is * . The state * = , , , , , includes six affine parameters, where , , , , , denote , translations, rotation angle, scale, aspect ratio, and skew, respectively. The observation is extracted according to them. We sample a set of overlapped local image patches inside the target region with a spatial layout illustrated in Figure 1. Then we obtain an optimal patch vector * = [ * , * , … , * ] ∈ ℝ × , where is the dimension of the image patch vector, and is the number of local patches sampled within the target region. Each column in * is obtained by ℓ normalization on the vectorized local image patches extracted from * . The goal is to mine the most useful information lying in the target patch templates, patch PCA basis, and candidates' patches to approximate the optimal observation jointly. First, we approximate the optimal patches * by exploiting the sparsity in the target patch templates. Second, we construct a structured local PCA dictionary to exploit both partial information and spatial information of the target with a sparse constraint. Third, we adopt a Laplacian term to exploit the intrinsic relationship among target candidates and their local patches. Fourth, we propose a unified objective function to integrate these three models and find an iterative manner to effectively exchange information among all of these three models, thus, taking full advantage of all the three subspace sets simultaneously.

View 1: Approximating the Optimal Observation with Target Templates
We collect a set of target templates = [ , , … , ] , where is the number of target templates. Then a set of overlapped local patches are sampled inside each target template using the same spatial layout to construct the patch dictionaries = [ , , … , ] ∈ ℝ × , where = 1, … , .
Dictionary denotes the dictionary constructed by the local image patches of all these target templates. Each column in is obtained by ℓ normalization on the vectorized grayscale image

Multi-View Structural Local Subspace Model (MSLM)
Most tracking methods use only one clue to model the target appearance. However, only one clue can hardly handle the complicated circumstances that visual tracking faces. Some methods try to fuse different models together to use all of their advantages, but they either simply combine these models or increase the computation burden by using some complicated models. Our method exchanges information among target templates, PCA bases, and candidates in one model to simultaneously use all of the advantages, while keeping computational complexity favorable.
To better illustrate our model, we assume that the optimal state x * in the current frame is already known and the corresponding observation is y * . The state x * = l x , l y , θ, s, r, φ T includes six affine parameters, where l x , l y , θ, s, r, φ denote x, y translations, rotation angle, scale, aspect ratio, and skew, respectively. The observation is extracted according to them. We sample a set of overlapped local image patches inside the target region with a spatial layout illustrated in Figure 1. Then we obtain an optimal patch vector P * = p * 1 , p * 2 , . . . , p * N ∈ R d×N , where d is the dimension of the image patch vector, and N is the number of local patches sampled within the target region. Each column in P * is obtained by 2 normalization on the vectorized local image patches extracted from y * . The goal is to mine the most useful information lying in the target patch templates, patch PCA basis, and candidates' patches to approximate the optimal observation jointly. First, we approximate the optimal patches P * by exploiting the sparsity in the target patch templates. Second, we construct a structured local PCA dictionary to exploit both partial information and spatial information of the target with a sparse constraint. Third, we adopt a Laplacian term to exploit the intrinsic relationship among target candidates and their local patches. Fourth, we propose a unified objective function to integrate these three models and find an iterative manner to effectively exchange information among all of these three models, thus, taking full advantage of all the three subspace sets simultaneously.

View 1: Approximating the Optimal Observation with Target Templates
We collect a set of target templates T = [T 1 , T 2 , . . . , T n ], where n is the number of target templates. Then a set of overlapped local patches are sampled inside each target template using the same spatial layout to construct the patch dictionaries D i = d i 1 , d i 2 , . . . , d i n ∈ R d×n , where i = 1, . . . , N. Dictionary D i denotes the dictionary constructed by the i th local image patches of all these n target templates. Each column in D i is obtained by 2 normalization on the vectorized grayscale image observations extracted from. We assume that the optimal observation y * and its patch vectors P * has already been known. Then the goal is to find the most useful information in target patches templates which can represent the optimal observation as far as possible. Due to the good modelling ability of sparse representation witnessed in [23], we decided to explore the information in target templates which can reflect the current target state with sparsity constraint: where p * i denotes the i th optimal patch and a i ∈ R n×1 is the corresponding sparse code of that patch; a t−1 i is the sparse patch code of last frame; λ 1 and λ 2 controls the regularization amount. The last term in Equation (1) is a temporally smooth term which is derived from the observation that target object in neighboring frames are always very similar to each other.

View 2: Approximating the Optimal Observation with Structural Local PCA Basis
To adapt to the target appearance variations caused by illumination change and pose change, the target templates described in last section are updated dynamically. However, these templates are only obtained from the previous couple of time instants. It is a short-term memory of the target appearances. Thus, they cannot cover the numerous appearance variations well. This can be solved by the Eigen template model which has been successfully used in visual tracking scenarios [34]. The Eigen template model has the ability to effectively learn the temporal correlation of target appearances from the past observation data by an incremental SVD update procedure. The incremental visual tracking (IVT) method [19] presents an online update strategy which can efficiently learn and update a low-dimensional PCA subspace representation of the target object. It has been shown that the incremental learning of the PCA subspace representation can effectively and efficiently deal with appearance changes caused by rotations, scale changes, illumination variations, and deformations. However, the holistic PCA appearance model has been demonstrated sensitive to partial occlusion. Since the underlying assumption of PCA is that the error of each pixel is Gaussian distributed with small variances, but when partial occlusion occurs, this assumption no longer holds. Meanwhile, the holistic appearance model does not make full use of partial information and spatial information of the target and, hence, may fail to track when there is occlusion or similar object in the scene.
Motivated by the above observations, we construct a structural local PCA basis dictionary to linearly represent each patch with 1 -norm constraint.
The PCA basis dictionary is concatenated by the PCA basis component of each partial patch, where m is the number of PCA basis of each patch used to construct U and U i ∈ R d×m is the eigenvectors corresponding to the i th patch. The dictionary U is redundant for each patch. We can see that each patch will likely be linearly represented by the eigenvectors corresponding to itself and the coefficients of other eigenvectors will be zeros or close to zero. Thus, with the 1 -norm constraint, each local patch will be represented as the linear combination of a few main eigenvectors in U by solving: where µ is the regularization parameter and b i ∈ R (m×N)×1 is the corresponding sparse code.

View 3: Approximating the Optimal Observation with Target Candidates
The goal of tracking in the Bayesian framework is to find the combination of candidates or the candidate which can best approximate the optimal state. In every frame, we extract a set of target candidates = [z 1 , z 2 , . . . , z M ] according to a candidate state set X = [x 1 , x 2 , . . . , x M ], where M is the number of target candidates. The sampling strategy of the candidate state set X will be described in detail later. Like the above two model, we sample a set of overlapped local image patches inside each candidate region with the spatial layout forming a candidate patch dictionary Y i = y i 1 , y i 2 , . . . , y i M ∈ R d×M in the same way as how dictionary D i is constructed, where i = 1, . . . , N. Then we approximate the optimal observation with target candidates by: where δ 1 and δ 2 are regularization parameters, c i ∈ R M×1 is the corresponding sparse code and W is an occlusion indicator matrix with is the occlusion rate of the i th patch. Details of the occlusion rate are described in Section 3.2.1. The last term in Equation (3) is a Laplacian regularization term inspired by [27]. Different with [27], our model uses this term to exploit the similarity of sparse codes among different spatial layout patches. Note that the number of different spatial layout patches is N. It is actually a small number which does not increase the computation. The occlusion indicator matrix W can indicate if any two different spatial layout patches are both occluded or not. If both are not occluded, the corresponding factor in W will be large to constrain the two sparse codes to have similar values. If any of the two patches is occluded, the corresponding factor in W will be small, thus letting the model avoid the influence of the occluded patches. Similar to [27], we transform the Laplacian term and the optimization problem is reformulated as: where L = D − W is the Laplacian matrix, C = [c 1 , c 2 , . . . , c N ], the degree of c i is defined as

Multi-View Structural Local Subspace Model
In the descriptions of above three view models, we assume that the optimal target state x * and its corresponding observation vector y * have already been known. However, in reality, the goal is to find the optimal state in current frame. From above three subsections, we know the optimal state can be approximated from three different views, and every view has its own advantages against others. Thus, we propose a unified objective function to exchange information among different views and jointly exploit all the advantages by: where A = [a 1 , a 2 , . . . , a N ] and B = [b 1 , b 2 , . . . , b N ]; γ is a constant that balances the importance between the two terms. The estimated coefficients A, B, and C can be achieved by minimizing the objective function (Equation (5)) with non-negativity constraints: However, there exists no close-form solution for the optimization problem with Equation (6). Thus, we develop an iterative manner to solve it.

Optimization
In Equation (6), coefficients A, B, and C are all unknown, making the solution of this problem intractable. In this work, we present an iteration method to search the minima of the optimization problem (Equation (6)). Due to the temporal consistency of target object, the coefficient B is initialized byB t−1 which is estimated from last frame. Then coefficients A, B, and C can be achieved by iteratively solve sub-problems (a) and (b): (a) Fix B, solve A and C: if B is given, Equation (6) can be separated into two sub-problems: and: min These two problems both can be effectively and efficiently solved by the accelerated proximal gradient (APG) method [35]. However, there are differences between them. Coefficient A can be obtained by separately solving each a i , while coefficient C needs all c i to be solved simultaneously. Details are described below.
Let 1 a ∈ R n , 1 c ∈ R M and 1 * ∈ R N represents the column vectors whose entries are all ones. Let ψ(a) denotes the indicator function defined by: Then Equations (7) and (8) can be optimized alternately as: and: min First, we use the APG method to solve Equation (10) with: where F(a i ) is a differentiable convex function and G(a i ) is a non-smooth convex function. In the APG algorithm, we need to solve an optimization problem: where L (in this paper, L = 20) is the Lipschitz constant, k denotes the current iteration time and β k+1 is defined in Algorithm 1. We define g k+1 = β k+1 − ∇F(β k+1 )/L, then the algorithm for solving Equation (7) is given in Algorithm 1.
: ρ k+1 = 1 + 1 + 4ρ 2 k /2 8: End 9: Obtain a i via a i = a k+1 . 10: End 11: Output A Second, we use the same APG method to solve Equation (11) with: Different from Algorithm 1, we need to simultaneously solve all c i in every iteration to exploit the similarity of sparse codes among different layout patches. The key step is to compute the derivative of F(C) versus C. First, we separately compute the derivative of the first term in Equation (14) versus each c i : Then we concatenate all the derivatives to form a derivative matrix P(C) = [∇E(c 1 ), ∇E(c 2 ), . . . , ∇E(c N )]. The final derivative of F(C) is given as: The algorithm for solving Equation (8) is given in Algorithm 2.
(b) Fix A and C, solve B: if coefficients A and C are given, Equation (6) turns into the following optimization problem: This sub-problem can also be well solved by the APG method [35] with some customized operations. The customized F(b i ) and G(b i ) are defined as: We define the soft-thresholding operator: S λ (x) = sign(x)max(|x| − λ, 0). Then the algorithm for solving the minimization problem (Equation (17)) is given in Algorithm 3.
: ρ k+1 = 1 + 1 + 4ρ 2 k /2 8: End 9: Obtain b i via b i = a k+1 . 10: End 11: Output B Finally, the optimization problem in Equation (6) can be iteratively solved by the steps (a) and (b). The iteration operations are terminated when any of the following two conditions have been met: (1) the difference of objective values between two consecutive iterations is smaller than a threshold (i.e., J i − J i−1 2 ≤ ε, in this paper, ε is chosen as 0.01); and (2) a maximal number Ω (in this work, Ω = 5) of iterations has been met. Details are described in Algorithm 4.

Algorithm 4: Algorithm for solving Equation (6).
Input: The template dictionaries D i , the candidate sets Y i , the PCA basis dictionary U, the Lipschitz constant L, the occlusion rate vector O and the initiation of B. 1: For k = 1, 2, . . . , until converge or a maximal number Ω (in this work, Ω = 5) of iterations have been met 2: Fix B k , obtain A k and C k using Algorithms 1 and 2, respectively; 3: Fix A k and C k , obtain B k+1 by Algorithm 3; 4: End; 5: obtainÂ,B,Ĉ viaÂ = A k−1 ,B = B k ,Ĉ = C k−1 ; 6:Output: 7: Estimated coefficient matrixesÂ,B andĈ.

Object Tracking via the Proposed MSLM
Our tracking method is based on the Bayesian filtering framework. Similar to [19], we use the affine motion model with six parameters to describe the object's state x t = l x , l y , θ, s, r, φ T , where l x , l y , θ, s, r, φ denote x, y translations, rotation angle, scale, aspect ratio, and skew, respectively. In practice, we randomly sample M particles from a diagonalized Gaussian distribution (i.e., p(x t |x t−1 ) = N( x t ; x t−1 , ∑))) to generate a candidate state set X t = x 1 t , x 2 t , . . . , x M t , where the observation with respect to the i th candidate is denoted as z i . We sample a set of overlapped local image patches inside every candidate region with the spatial layout and convert them into vectors with 2 normalization, forming a set of candidate patch sets Y i = y i 1 , y i 2 , . . . , y i M ∈ R d×M , where i = 1, . . . , N.
We apply the proposed MSLM and its optimization algorithm on all Y i , then we obtain the estimated coefficient matrixesÂ,B, andĈ.

Occlusion Detection
The estimated sparse PCA coefficients corresponding with each patch are divided into several segments, according to the PCA basis that each segment belongs to, i.e.,b T i ∈ R m×1 denotes the k th segment of the estimated coefficient vectorb i and its corresponding PCA basis is U i . As U i incrementally learns the appearances of the i th patch and contains no information of other patches, it should have good ability to represent the i th patch, i.e., the coefficients of the PCA basis for the corresponding patch should be larger than others. This means the model is able to deal with partial occlusion. When there is no occlusion, the representation of one patch mainly lies in its corresponding PCA basis. However, when occlusion occurs, the appearance change makes the representation of the occluded local patches dense. Thus, we propose an occlusion metric based on these observations. The occlusion rate of the i th patch is obtained by: where sum( where ( ) ∈ ℝ × denotes the segment of the estimated coefficient vector and its corresponding PCA basis is . As incrementally learns the appearances of the patch and contains no information of other patches, it should have good ability to represent the patch, i.e., the coefficients of the PCA basis for the corresponding patch should be larger than others. This means the model is able to deal with partial occlusion. When there is no occlusion, the representation of one patch mainly lies in its corresponding PCA basis. However, when occlusion occurs, the appearance change makes the representation of the occluded local patches dense. Thus, we propose an occlusion metric based on these observations. The occlusion rate of the patch is obtained by:  The coefficients in reflect how relevant the corresponding patch is to the target templates and PCA templates. They can be regarded as the confidence scores of these patches belonging to the target object. However, simply summing the coefficients of different patches together as the confidence scores of target candidates is susceptible because if the patch is occluded, the corresponding coefficients are unreliable and, thus, may cause tracking failure. In addition, simply summing the coefficients loses spatial information among different patches. We alleviate these problems by using the occlusion rate of each patch to tune the coefficients. Then we obtain a tuned confidence map Finally, the proposed tracker obtains the optimal state * by combining the candidate states with weights based on the tuned confident map, i.e.,: The coefficients inĈ reflect how relevant the corresponding patch is to the target templates and PCA templates. They can be regarded as the confidence scores of these patches belonging to the target object. However, simply summing the coefficients of different patches together as the confidence scores of target candidates is susceptible because if the patch is occluded, the corresponding coefficients are unreliable and, thus, may cause tracking failure. In addition, simply summing the coefficients loses spatial information among different patches. We alleviate these problems by using the occlusion rate of each patch to tune the coefficients. Then we obtain a tuned confidence map M = [m 1 , m 2 , . . .
Finally, the proposed tracker obtains the optimal state x * t by combining the candidate states with weights based on the tuned confident map, i.e.,: where ϕ is a normalized term, equaling to the summation of all elements in M.

Template Update
To account for target appearance variations, we need to update target templates T and PCA basis dictionary U dynamically.
However, the target templates are only obtained from the previous couple of time instants. They can hardly cover the numerous appearance variations of the target object, but they have a strong representation of the current object appearance. Thus, we use them to account for the short-term memory of the target's appearance. We update T using the method proposed in [30]. This updating strategy can effectively alleviate the influences caused by noise and occlusion.
The PCA Eigen template model has the ability to effectively learning the temporal correlation of target appearances from the past observation data by incremental SVD update procedure. Thus, it can cover a long period of target appearances. We use it to account for the long-term memory of the target. It has been shown [19] that the incremental learning of the PCA subspace representation can effectively and efficiently deal with appearance changes caused by rotations, scale changes, illumination variations, and deformations. In the long-term memory, the new target information used to update the model should be as accurate as possible, because once the wrong information is introduced in the model, it will affect the subsequent tracking results in a long period of time. We label all patches of which their occlusion rates are smaller than θ as positive, and the rest are labelled as negative. In order to obtain precise information, we separately correct each patch with two false rejection operations. First, we identify one patch as false positive when its surrounding patches are all negative ones, then we change its label to negative. Second, we identify one patch as a false negative when its surrounding patches are all positive ones, then we change its label to positive. Finally, we use these collected patches to update their corresponding PCA basis using the method proposed in [19].

Experiments
The proposed method in this paper is implemented in MATLAB 2014a. We perform the experiments on a PC with Intel i7-4790 CPU (3.6 GHz) and 16 GB RAM memory and the tracker runs at 3.1 fps. We test the performance of the proposed tracker with the total 51 sequences using in the visual tracker benchmark [2] and compare it with the top 12 state-of-the-art trackers, including SST [33], JSRFFT [36], DSSM [27], Struck [9], ASLA [30], L1APG [35], MTT [31], LSK [29], VTD [21], TLD [10], IVT [19], and SCM [37]. Among the 12 selected trackers, the Struck, SCM, TLD, and ASLA are the four best-performed ones demonstrated in the benchmark and our tracker outperforms all of them in terms of the overall performance. Some representative tracking results are shown in Figure 3. at 3.1 fps. We test the performance of the proposed tracker with the total 51 sequences using in the visual tracker benchmark [2] and compare it with the top 12 state-of-the-art trackers, including SST [33], JSRFFT [36], DSSM [27], Struck [9], ASLA [30], L1APG [35], MTT [31], LSK [29], VTD [21], TLD [10], IVT [19], and SCM [37]. Among the 12 selected trackers, the Struck, SCM, TLD, and ASLA are the four bestperformed ones demonstrated in the benchmark and our tracker outperforms all of them in terms of the overall performance. Some representative tracking results are shown in Figure 3.  The parameters, which are fixed for each sequence, are summarized as follows. We resize the target image patch to 32 × 32 pixels and extract 16 × 16 overlapped local patches within the target region with eight pixels as step length, like in [30]. The number of target templates is set to be 10. The regularization parameters λ 1 , λ 2 , µ δ 1 , δ 2 , and γ are set to be 0.01, 0.01, 0.01, 0.04, 0.2, and 1, respectively. We let the number of PCA basis be 10. The candidate number in each frame is 600. The iteration numbers in Algorithm 1-3 are all set to be 5, and the Lipschitz constant L is equal to 20 for all the three algorithms. Among all the parameters, γ balances the importance between the candidates and the templates. This is a very important factor to our model. We did many experiments to obtain the optimal value of γ. Table 1 summarizes the overall performance of our tracker in terms of γ.

Qualitative Evaluation
The 51 sequences pose many challenging problems, including occlusion (OCC), deformation (DEF), fast motion (FM), illumination variation (IV), scale variation (SV), motion blur (MB), in-plane rotation (IPR), out-of-plane rotation (OPR), background clutter (BC), out-of-view (OV), and low resolution (LR). The distributions of the 51 sequences in terms of the 11 attributes are shown in Table 2. The most challenging and common problems in tracking are occlusion, deformation, background clutter, illumination change, scale variation, and rotation. We mainly describe how our tracker outperforms the other trackers in these challenging scenarios in details.
Occlusion: In 29 of the total 51 sequences, the targets undergo partial or short-term total occlusions. We can see from Figure 3 that the remarkable sparse representation-based trackers (i.e., SCM, DSSM, JSRFFT, SST, ASLA, LSK, L1APG, and MTT) and the well-known incremental subspace-based IVT tracker all fail in some sequences somehow, while our tracker can effectively track almost all of the targets in the 29 sequences when occlusion occurs. This is mainly attributed to the part-based strategy used in our method. The occlusion vector O in Figure 2, which is constructed from the PCA basis coefficients B, can effectively indicate the occlusion degree of each patch. If a patch is occluded, the corresponding element in O will be large, making the tuned confident vector m i very small, thus alleviating the influence of the bad patches. In addition, we exploit the joint-sparsity in patches which are not occluded. This strategy allows the method to fully utilize the spatial information among these patches, making the model more robust.
Deformation: There are 19 sequences involve target deformations. We can see from Figure 3 that our tracker can handle deformation better than the other methods. In the Jogging-1 and Jogging-2 examples, the proposed method effectively deals with short-term total occlusion when the target undergoes deformation, while most of the other methods fail in these sequences. This is because our method takes advantages of the incremental subspace learning model, which still performs well when deformation occurs.
Background clutter: There are total 21 sequences in which the targets suffer background clutter. As the background of the target object becomes complex, it is rather rough to accurately locate the right position of the target, since it is difficult to discriminate the target object from the background in a rather simple model. It is worth noticing that the proposed method performs better than the other algorithms. Thanks to the structural local model and the rich target information preserved in the PCA basis, our model learns a more robust and compact representation of target object, making it easier to capture the target appearance change information.
Illumination change: In 25 out of the 51 sequences, the target undergoes severe illumination change. In the Singer1 sequence our tracker and the IVT tracker performs well in tracking the woman, while many other methods drift to the cluttered background or cannot adapt to scale changes when illumination change occurs. This can be attributed to the use of incremental subspace learning which is able to capture appearance change due to lighting change. In the Fish sequence, the target undergoes illumination change together with fast motion. In the Crossing sequence, the target has a low resolution observation and goes through illumination change. In all these 25 sequence, our tracker generally outperforms the other trackers.
Scale variation and rotation: There are total 44 sequences which undergo scale variation or rotation. As we use the affine transformation parameters that include the scale and rotation sampling, we can capture the candidates with different scales and rotations for further selection. Together with the sampling strategy, the robust representation model proposed in this paper can effectively estimate the current scale and rotation angle of the target object. We also observe that some trackers, including the well-performed Struck tracker, do not adapt to scale or rotation.

Quantitative Evaluation
We use the score of the precision plot and the score of the success plot to estimate the 13 trackers on the 51 sequences. Note that a higher score of the precision plot or a higher score of the success plot means a more accurate result. The overlap rate is defined by area(B e ∩B g ) area(B e ∪B g ) , where B e is the estimated bounding box and B g is the ground truth bounding box. We use the precision and success plots used in [2] to demonstrate experiment results of the trackers. Figure 4 contains the precision plots which show the percentage of frames whose estimated location is within the given threshold distance of the ground truth and success plots which show the ratios of successful frames at the thresholds varied from 0 to 1. Both precision plots and success plots show that our tracker is more effective and robust than the 12 state-of-the-art trackers in terms of the total 51 challenging sequences in the benchmark.
Scale variation and rotation: There are total 44 sequences which undergo scale variation or rotation. As we use the affine transformation parameters that include the scale and rotation sampling, we can capture the candidates with different scales and rotations for further selection. Together with the sampling strategy, the robust representation model proposed in this paper can effectively estimate the current scale and rotation angle of the target object. We also observe that some trackers, including the well-performed Struck tracker, do not adapt to scale or rotation.

Quantitative Evaluation
We use the score of the precision plot and the score of the success plot to estimate the 13 trackers on the 51 sequences. Note that a higher score of the precision plot or a higher score of the success plot means a more accurate result. The overlap rate is defined by , where is the estimated bounding box and is the ground truth bounding box. We use the precision and success plots used in [2] to demonstrate experiment results of the trackers. Figure 4 contains the precision plots which show the percentage of frames whose estimated location is within the given threshold distance of the ground truth and success plots which show the ratios of successful frames at the thresholds varied from 0 to 1. Both precision plots and success plots show that our tracker is more effective and robust than the 12 state-of-the-art trackers in terms of the total 51 challenging sequences in the benchmark.  Tables 3 and 4 report the scores of precision plots and the scores of success plots of different tracking methods. In attributes BC, DEF, IV, IPR, and OV, our tracker achieves the highest scores of Tables 3 and 4 report the scores of precision plots and the scores of success plots of different tracking methods. In attributes BC, DEF, IV, IPR, and OV, our tracker achieves the highest scores of precision plots, which means that our method is more robust than the other state-of-the-art trackers. In the MB and LR attributes, the scores of the precision plots of the proposed method are not among the best three. This is because, when undergoing motion blur, different spatial patches of one target tend to have similar blur, making the model distinguish different spatial patches with difficulty. Additionally, along with motion blur, the targets may also go through fast motion or illumination variation. This makes the model even more difficult to accurately track the targets. However, 0.410 of the precision score is still a relatively good one among all of the trackers. In attributes OCC, DEF, IV, IPR, and OV, the proposed tracker achieves the highest scores of success plots which demonstrates that our approach computes the scale more accurately. In the LR attributes, the score of the success plot of the proposed method is also not among the best three. This is because of the low resolution of the target object. Since our tracker is a patch-based method, when the target undergoes low resolution, the patch features will be extracted from even lower resolution patches, resulting in relatively poor representation of each patch, thus causing drift. In the other attributes, our tracker gains the precision scores and success scores very close to the best ones. The last rows of Tables 3 and 4 show the overall precision scores and success scores of the thirteen trackers over all of the 51 sequences. Our tracker achieves the best scores in both evaluation metrics, which shows that our tracker outperforms all of the other state-of-the-art trackers. Table 3. Average precision scores on different attributes: fast motion (FM), scale variation (SV), occlusion (OCC), background clutter (BC), deformation (DEF), motion blur (MB), illumination variation (IV), low-resolution (LR), in-plane rotation (IPR), out-of-plane rotation (OPR), and out-of-view (OV). The best three results are shown in red, blue, and green fonts. The last row in Table 4 shows the comparison results about computational loads in terms of fps. Our candidate sampling strategy is based on the sampling strategy in [19] and all the candidate patch are resized to 32 × 32 pixels which means that all of the candidate features are normalized to a fixed size. Thus, the fps of different sequences are the same as long as the candidate numbers are fixed. Actually, we set the candidate number fixed to be 600, so the fps are almost the same in different sequences (ignore the feature extracting time, because it is trivial compared with the time used for solving the whole model.). This shows that our tracker runs at 3.1 fps. Although it does not reach real-time processing, it outperforms most other sparse representation-based trackers (i.e., SCM, MTT, L1APG, DSSM, JSRFFT, and SST) in terms of both accuracy and speed.

Conclusions
In this paper, we propose a novel multi-view structural local subspace tracking algorithm based on sparse representation. We approximate the optimal state from three views: (1) the template view; (2) the PCA basis view; and (3) the target candidate view. Then we propose a unified objective function to integrate these three view problems together. The model jointly takes advantages of three sub-models by the unified objective function. It not only exploits the intrinsic relationship among target candidates and their local patches, but also takes advantage of both sparse representation and incremental subspace learning. The optimization problem can be solved well by the customized APG methods together with an iteration manner. Then, we proposed an alignment-weighting average method to obtain the optimal state of the target. Furthermore, an occlusion detection strategy is proposed to accurately update the model. Both qualitative and quantitative evaluations demonstrate that our tracker outperforms the state-of-the-art trackers in a wide range of tracking scenarios.