Visual Object Tracking Using Structured Sparse PCA-Based Appearance Representation and Online Learning

Visual object tracking is a fundamental research area in the field of computer vision and pattern recognition because it can be utilized by various intelligent systems. However, visual object tracking faces various challenging issues because tracking is influenced by illumination change, pose change, partial occlusion and background clutter. Sparse representation-based appearance modeling and dictionary learning that optimize tracking history have been proposed as one possible solution to overcome the problems of visual object tracking. However, there are limitations in representing high dimensional descriptors using the standard sparse representation approach. Therefore, this study proposes a structured sparse principal component analysis to represent the complex appearance descriptors of the target object effectively with a linear combination of a small number of elementary atoms chosen from an over-complete dictionary. Using an online dictionary for learning and updating by selecting similar dictionaries that have high probability makes it possible to track the target object in a variety of environments. Qualitative and quantitative experimental results, including comparison to the current state of the art visual object tracking algorithms, validate that the proposed tracking algorithm performs favorably with changes in the target object and environment for benchmark video sequences.


Introduction
Visual object tracking systems have gained continuous attention and focus in the area of computer vision and pattern recognition because they can be applied to various fields, such as robotics, video surveillance, user-centered interaction systems, video communication and compression and augmented reality [1][2][3][4]. A large number of tracking algorithms has been proposed to follow the moving object in a given image sequence, while simultaneously keeping track of target identities through the significant pose changes, illumination variations and occlusions by focusing on finding appearance and motion models. To evaluate the performance of the state of the art visual object tracking methodologies quantitatively and qualitatively, benchmark tests [5,6] were conducted using a large database including ground-truth object positions to understand how these algorithms perform and effectively analyze algorithm advances. 1.
An appearance model that captures the visual characteristics of the target object and evaluates the similarity between observed samples and the model.

2.
A motion model that locates the target between successive frames utilizing certain motion hypotheses. 3.
An optimization strategy that associates the appearance model with the motion model and finds the most likely location in the current frame.
In the Bayesian visual object tracking framework, the main issue of robust target object tracking is to find models for status and observation, such as target representation and localization, as well as filtering and data association. Target object representation and localization methodologies follow a bottom-up process that provides a variety of tools for identifying the moving object. The specific strategy for successfully locating and tracking the target object depends on features in the color, appearance and time spaces. Filtering and data association are mostly top-down processes, incorporating prior information about the scene or object, dealing with object dynamics and evaluating different hypotheses.
The core technique of visual object tracking in the Bayesian framework aims to robustly estimate the motion state of a target object with a defined appearance model in each frame from given image sequences. To achieve visual object tracking, it is necessary to categorize the appearance model into several task-specific categories. Popular appearance models used in object tracking can be separated into global and local visual appearance models [8]. Global visual representation of the target object is simple and computationally efficient for fast object tracking, but is very sensitive to target deformation and environmental changes, including illumination. A multi-cue strategy is adopted in relation to the global features, incorporating multiple visual information types, to deal with complicated appearance changes. In contrast, local visual appearance representation is robust to global appearance change by capturing the local structural object appearance. However, the representation often suffers from noise distribution and background distraction.
Sparse representation and dictionary learning for online appearance modeling have been recently proposed as an alternative solution, formulating the over-complete dictionary as a linear combination of basis functions. However, global linear sparse representation has problems with partial occlusion and local deformation. Since the dictionary uniformly emphasizes the object, occlusion and local deformation can be seen as noise when estimating similarity [9][10][11]. Another characteristic inherent in natural images is their high dimensionality, which causes complex and expensive computation. Exploration of the specific structure of sparsity as a prior enables dictionary learning to reduce computational costs effectively [12][13][14]. Therefore, we propose a structured sparse principal component analysis (PCA)-based subspace representation to represent the appearance model of the target object effectively and online learning techniques for robust visual object tracking. We use the structured sparse PCA to find a sparse linear combination over a basis library containing target and trivial templates by reducing the data dimension. The proposed structured sparse PCA-based visual object tracking within the Bayesian framework is decomposed into initialization, observation model, motion tracking model and update. The structured spare PCA-based appearance model representation and learning of domain-specific over-complete dictionaries are used to obtain MAP dictionary estimates within an appropriately chosen dictionary. The main contributions of our proposed robust visual object tracking system are as follows.

•
Structured sparse PCA-based appearance representation and learning for efficient description of the target object with few dictionary entries, to reduce the high-dimensional descriptor and to retain the structure.

•
Local structure enforced similarity measures to avoid problems from partial occlusion, illumination and background clutter.
• Training image selection for robust online dictionary learning and updating by considering the probability that the training image contains the target, as opposed to the existing methods that choose the most recent training images.
Section 2 reviews relevant previous visual object tracking approaches, and Section 3 details tracking target objects from a given image by modeling the observation and motion using the proposed structured sparse PCA-based representation within the Bayesian framework. Section 4 quantitatively and qualitatively compares the proposed and current state of the art approaches experimentally. Section 5 summarizes the outcomes, concludes the paper and discusses future work.

Review of Previous Related Work
There is a rich literature in visual object tracking methodologies dealing with target object representations, search mechanisms and model updating. Sparse representation and modeling also have a fruitful literature exploiting prior information within the predefined structure of the basis library and contiguous spatial distribution of deformable target objects. We review some of the important milestones in terms of visual object tracking and sparse representation-based modeling.

Visual Object Tracking System
Many tracking methods have been proposed, largely separated into generative and deterministic methods. Generative visual object tracking methods search for the most similar region to the target object within a neighborhood, whereas discriminative methods treat tracking as a binary classification problem and aim to design a classifier to distinguish the target object from the background [15].
Early visual object tracking systems focused on generative methods, such as the Lucas-Kanade tracker [16], Kalman filter [17,18] and mean-shift (MS) tracker [19,20]. The Kalman filter [17] used for visual object tracking commonly uses the state and observation model uncertainties to calculate actual Gaussian noise, which causes certain parameter estimations to produce errors in the model, with consequent decreased estimation precision. The particle filter (PF) is efficient for conventional tracking problems with non-Gaussian distributions and multi-modality [21]. MS-based approaches are efficient for tracking non-rigid objects whose appearances are defined by histograms, but this makes them poor at dealing with illumination and/or pose variations [19,20].
Multiple instance learning (MIL)-based tracking [22] implements discriminative tracking by building a boosting classifier that tracks bags of image patches by incrementally updating the training patches over time. Online appearance learning (OAL)-based visual object tracking uses different target object appearances as a set of probability mass functions to adaptively deal with pose variations [23]. Many approaches attempted to efficiently represent the variation of rigid or limited deformation motion using an adaptive appearance model, such as incremental visual [24] and fragment-based (Frag) [25] trackers. Kelal et al. [26] proposed a paradigm for training a binary classifier from labeled and unlabeled examples called P-N learning for visual object tracking. Tracking-learning-detection (TLD) is an award-winning, real-time algorithm for tracking unknown objects in video streams that simultaneously tracks the object, learns its appearance and detects it whenever it appears in the video [27]. Struct [28] is an extended version of TLD using kernels. On the other hand, sparse representation-based visual object tracking systems like sparse collaborative appearance (SCM) [29], visual tracking decomposition (VTD) [30], the sparse representation-based l 1 tracker [31], the structured sparse tracking (SST) [32] model and sparse mask models [33,34] use an appearance model to find the sparsest linear combination of basis functions from an over-complete dictionary. However, most dictionary learning-based systems still have problems in high-dimensional reduction. Deep learning-based machine learning techniques have been recently applied to separate target objects from target candidate image templates [35][36][37][38][39] and showed a good performance to track the target object, but this requires numerous training templates.
In contrast to visual tracking approaches based on pixel-based observation models, superpixel tracking (SPT) [40] uses middle level features to both remove noise and enforce the target object color of the candidate template.

Sparse Representation-Based Learning
Sparse signal representation is an extremely powerful tool for acquiring, representing and compressing high dimensional signals. Mathematically, solving a sparse representation and learning involves seeking the sparsest linear combination of basis functions from an over-complete dictionary. The basic concept of how to represent or reconstruct signals with sparse samples is an extremely important problem in many practical fields, such as signal processing, machine learning, computer vision and robotics. Compressive sensing (CS) is based on the principle that signal sparsity can be exploited to recover the original signal from significantly less samples than required by the Shannon-Nyquist theorem [41,42]. Generally, CS algorithms include three basic components: sparse representation, encoding measuring and a reconstruction [12]. In particular, sparse representation that approximately solves a system of equations with sparse vectors is popularly applied for pattern recognition because it exploits a linear combination of training samples to represent the test sample and computes sparse representation coefficients of the linear representation system [43][44][45].
Structured sparse representation is an extension of standard sparse representation in statistical signal processing and learning [46,47]. Motivated by potential group structures on feature sets, group sparse representation has become popular in recent years. Group sparsity is used not only for estimating hyper-parameters in the sparse prior model, but also for group least absolute shrinkage and selection operator (LASSO). Techniques using strong group for group LASSO have been developed and show superior performance for strongly group-sparse feature sets [48]. However, group LASSO works well only under the strong group sparsity assumption and does not apply for more general structures, such as overlapping groups, and tonal or transient structures. Therefore, Huang et al. [14] proposed that sparse representation can be solved by a structured greedy algorithm when a coding scheme can be approximated by block coding with base blocks.

Structured Sparse PCA-Based Tracking and Online Dictionary Learning
For visual object tracking, it is reasonable to assume that the object trajectory is continuous and object features are consistent or change insignificantly over a short time interval. Thus, once a representation of the feature vector is found in terms of fix-ahead dictionaries, consecutive representations of the feature vectors are almost constant. Therefore, we propose an object tracking method by classifying the target appearance model's coefficients. The dictionaries are generated from appearance features by applying structured sparse PCA and updated using the last data. The object tracking comprises three modes: observation, tracking and update within the Bayesian framework, as shown in Figure 1.  ... ... ... ... Figure 1. Representation of the target object using structured sparse PCA and deterministic classification between the target object and background image patches.

Notations and Symbols
Before proceeding to the technical details, we introduce the notations and symbols used throughout this paper, as shown in Table 1. Lower case letters denote real variables, and upper case (capital) letters denote multi-dimensional variables, such as images and matrices, except for the case Y t , which denotes an observation random variable taking real numbers. Column vectors given are shown as boldface, and mappings are denoted by letters of the Greek alphabet. Table 1. Notations and symbols.

Symbol Description
Observation variable

Bayesian Framework-Based Visual Object Tracking
The traditional visual object tracking algorithm can be formulated with the Bayesian framework where the maximum a posteriori (MAP) estimation of the state given the observations up to time t is expressed as: where X t is the state at t; Y 1:t denotes all the observations up to t; and n t is a normalization term, We use the following assumptions.
(i) State X t is independent of the past given the present X t−1 , (ii) Observations Y 1:t are conditionally independent given X t , We also employed the Chapman-Kolmogorov equation for Equation (1), In the visual object tracking scheme, the target state is defined as represents the center location of the target and w sx t and h sy t denote its scale in the x and y directions, respectively. In terms of observation, we need to construct an effective observation model p(Y t |X t ) and an efficient motion model p(X t |X t−1 ). The state estimate of the target X t at time t can be obtained by the MAP estimate over the M samples X j t and its measurements Y j t for j = 1, . . . , M, given X t−1 , structured sparse PCA-based observation and appearance representation using deterministic target object separation from background patch images, 2. motion tracking and 3. online update.

Deterministic Modeling Using Structured Sparse PCA-Based Appearance Representation
To construct the dictionary from the t 0 initial image sequences, we extract image patches using windows surrounding the target object for each t = 1, . . . , t 0 . Figure 1 shows the proposed procedure to separate the target object and background image patches around the target object, representing appearances using structured sparse representation. Let us explain the learning mode of the target object tracking in more detail. We create tracking dictionary vectors { d i } r i=1 by applying feature descriptors extracted from observation frames I 1:t 0 to the structured sparse PCA algorithm as follows.

1.
We take the same sized image patches { p target t } t 0 t=1 centered at (x c , t = 1, . . . , t 0 , we construct the descriptor v tg t ∈ R s of the target object by sequentially accumulating gradient histograms from equally-divided subregions of p target t . To enhance tracking performance, we also create background feature descriptors v bg j ∈ R s from the four background patch images { p back t,(a x ,b y ) ∈ I t |a x , b y = 1, −1 and a 2 x + b 2 y = 1, t = 1, . . . , t 0 } around the target patch p target t as follows.

•
For each t = 1, . . . , t 0 , patches p back t,(a x ,b y ) are subimages of I t centered at (x c t + a x w sx t , y c t + b y h sy t ) with the same size as p target t .

•
When the domain of p back t,(a x ,b y ) does not entirely belong to that of I t , we regard it as an empty set.

•
Let { v bg j } κ j=1 ∈ R s with κ ≤ 4t 0 be background appearance descriptors obtained from background patches p back t,(a x ,b y ) in the same manner used to create the target descriptors.

2.
After creating the appearance feature descriptors v tg t and v bg j , we apply the constrained structured sparse PCA dictionary learning algorithm to the target and background descriptors to find subject to c j 2 ≤ 1, j = 1 . . . , t 0 + κ, where the objective function H(D, C) is given by:

3.
Let · F be the Frobenius matrix norm, · 2 the Euclidean norm; and Ω ν a quasi-norm that controls the sparsity and structure of the support of d j . In this work, the quasi-norm Ω ν is defined as follows. Let G 1 , G 2 , G 3 , G 4 be four mutually disjoint subsets of {1, 2, . . . , s}. Then, every vector d = (d 1 , . . . , d s ) ∈ R s is decomposed into four subvectors d k = (d k 1 , . . . , d k s ), k = 1, 2, 3, 4 such that for 1 ≤ k ≤ 4 and 1 ≤ j ≤ s, Then, Ω ν ( d) is defined as: We refer to [49] and the references therein for details on the quasi-norm. The decomposition of V into DC enables us to reduce the dimensionality of the descriptors using Equation (6).
Although there is clearly a limitation in representing high dimensional descriptors using a smaller number of vectors than the dimension, the proposed structured sparse PCA is more effective to represent nonlinear and high dimensional descriptors by reducing the dimension while retaining the target object structure. For more details of structured sparse PCA algorithms, refer to the original paper [49].

4.
Finally, we find a linear support vector machine (SVM) Φ : R s → R, such that Φ((DC) i ) ≥ 1 (i = 1, . . . , t 0 ) for the target feature-related column vectors of DC and Φ((DC) i ) ≤ −1 (i = t 0 , . . . , t 0 + κ) for the background appearance feature related column vectors of DC, where (DC) i denotes the i-th column vector of DC, i.e., (DC) i = D c i . Using the classifier Φ, we estimate observation Y t ∈ {1, −1} as: where we recall that v tg t is the target feature descriptor obtained from state X t . Note that when the target object is occluded or not observed, the value of the observation becomes negative.
The procedure of deterministic separation using the structured sparse PCA-based representation of the target and the background is shown in Algorithm 1.

Motion Tracking Model and Online Update
Using the learned dictionary of the target object and classifier, we track the target object for frames {I t+1 } t+1>t 0 from the previous states X t . The motion model p(X t+1 |X t ) starts from the Gaussian assumption: where σ is a diagonal covariance matrix whose elements are the standard deviations for location and size and | σ| is the determinant of σ. Let I t+1 be the frame at t + 1 > t 0 , and assume we already have states X . Since the observation model p(Y t |X t ) with given state X t−1 implies the confidence of an observation Y t at state X t being the target, the likelihood p(Y t+1 |X j t+1 , X t ) is proportional to its confidence: , X t ). Given the target state X t at time t, the confidence ω(y|X t+1 , X t ) for the target candidates X t+1 with positive confidence value increases as we observe the targets in a larger area, whereas confidence for target candidates with negative confidence decreases. Therefore, we evaluate confidence ω(y|X t+1 , X t ) comparing with state X t as: where y = 1, −1 and v t+1 is the feature descriptor extracted from the target state X t+1 and w sx t+1 · h sy t+1 denotes the window size of X t+1 . We note that in the tracking mode, we estimate the observation in (7) and the confidence in (9) by applying the descriptor v directly to the SVM, Φ( v), instead of using the dictionary representation (D T D) −1 D T v as we construct the SVM Φ in the initialization mode. This is because the descriptor v and its dictionary representation (D T D) −1 D T v are much similar for (D T D) −1 D T v, which minimizes v − D w 2 , so that it is cheaper to apply the descriptor to SVM rather than to utilize the representation, which requires the computation of the inverse matrix (D T D) −1 . Now, the likelihood p(Y t+1 |X j t+1 , X t ) of Y t+1 given statesX j t+1 and X t is defined as: for j = 1, 2, . . . , M with the normalizing factor n ω = ω(−1|X j t+1 , X t ) + ω(1|X j t+1 , X t ). Applying the motion model p(X j t+1 |X t ) obtained from Equation (8) and the observation model p(Y t+1 |X j t+1 , X t ) obtained from Equation (10) to the Bayesian formulation in Equation (1), we estimate the a posteriori probability p(X j t+1 |Y t+1 , X t ) as: Finally, we obtain the most likely target state X t+1 at t + 1 with estimated MAP over the M samplesX j t+1 and its observationsŶ j t+1 for j = 1, . . . , M, given X t , On the other hand, it is reasonable to infer that the maximizing target state X t+1 is very similar to From this aspect, letX t+1 be a sample state such thatỸ t+1 Φ( v t ) ≥ 0 and the solution to the maximization: Then, for all 1 ≤ j ≤ M, we have: for all j = 1, . . . , M. This shows that we may regard the denominator 1 + e −Ŷ t+1 Φ( v t ) in (12) as a constant for all j = 1, . . . , M. Figure 2 shows the steps of how to detect the target object when a new frame comes in. M candidate samples are separated into positive and negative labels using Φ( v). Usually, the ideal target template contains all of the target features, although there is some background. However, in most cases, a sample with the highest probability tends to contain less background. Figure 3a illustrates this problem. The first row of Figure 3a shows candidate samples sorted without the window size ratio in Equation (9). The ideal candidate sample is located in the fourth. However, the second row, which applied the window size ratio in Equation (9), shows that there is the ideal candidate in the first position. Consequently, we prioritize templates with the same or similar Φ such that larger window sizes are assigned a larger weight, based on the scale information of the last target estimate X t−1 (see Equation (9)). Figure 3b illustrates how the result changes when the prioritization is applied.  . Procedure to find the most similar target object templates using confidence (Equation (9)). (a) Typical explanation to find the target object by weighting the scale factor from positive candidate templates to prevent drift, partial occlusion and scaling problems; (b) real image-based re-weighting procedure to find similar templates from positive image templates.
The proposed motion tracking model is summarized in Algorithm 2.  (10) 5. estimate the a posteriori prob. p(X j t+1 |Ŷ j t+1 , X t ) using Equation (11) 6. find the most likely target state X t+1 by Equation (12) 7. create the target descriptor v tg t+1 ∈ R s 8. create the background descriptors v bg t+1,(a x ,b x ) ∈ R s end Since the appearances of the target may change during tracking, we need to update the classifier Φ every k frames by updating the dictionaries as follows.

1.
We save the t 0 target descriptors v tg At every t > t 0 , if p(X j t |Y j t , X t−1 ) > θ p , we add the target descriptor v tg t and background descriptors v bg t,(a x ,b x ) to F. Otherwise, k p = k p + 1.

3.
After every k frames, we create the dictionary matrix D w and coefficient matrix C w using the vectors in F by applying the structured sparse PCA.

4.
Similar to the initiation algorithm, we update Φ using the new D w and C w .

5.
We check Φ( v) for all target descriptors v ∈ F and sort the descriptors according to their values, while keeping the k 0 largest target descriptors in F and deleting the remaining target descriptors and all the background descriptors from F.
The update interval k = k 0 + k p is between the range of k 0 and 2k 0 . Because occlusion frames do not have (whole or partial) target patch, we need to update the dictionary slowly by increasing the value of k 0 .
We continuously update the training dictionaries using the k 0 prior templates, which have a high probability, from the target as shown in Algorithm 3. This way, if the confidence of the target is high, it will participate in the update continuously. Therefore, the target models with high confidence in the previous update and the target models in recent frames participate in the update. The target models in recent frames keep tracking when the appearance of the target object is almost unchanged, and the target models with high confidence help tracking to not fail when the appearance of the target object changes suddenly. Figure 4 shows the target models in F at the update time and the detection of the changed target appearance after the update. In the 84th frame, the top k 0 target models from previous updates are different from the current target appearance, but show a similar look to the target in the 94th frame. It can be seen that this is more suitable for detecting the changed appearance.

Algorithm 3: Dictionary update.
for t = t 0 + 1 to the end of the frame sequence 1. if p(X j t |Y t , X t−1 ) > θ p 1-1. add the target descriptor v tg t and background descriptors v bg t,(a x ,b x ) to F 1. else 1-2. k p = k p + 1 for every k frames 2. build the new metrics D w and C w using the vectors in F by structured sparse PCA 3. update classifier Φ using D w and C w 4. compute Φ( v) for v ∈ F 5. keep the k 0 largest target descriptors in F, and delete the rest descriptor from F 6. k p = 0 end

Experimental Validation
This section validates the robustness of the proposed method by quantitatively and qualitatively comparing it to current state of the art approaches using the TS-50 public visual object benchmark video sequences (available online: http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html (accessed on 8 May 2012)). The benchmark sequences include background clutter (BC), deformation (DEF), fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out-of view (OV) and scale variation (SV). The proposed tracker was implemented in MATLAB on a standard 4-GHz machine with 2 GB RAM. To create the descriptors, we resize all patches to [72,72] and use the scale-invariant feature transform (SIFT) [50]. The number of samples M is set to 600. t 0 and r are set to three and 30, respectively. The k 0 and θ p are set to 10 and 0.2, respectively. We also tested the prototype VTD [30], MS [19], MIT [22], SCM [29], Frag [25], IVT [24], TLD [27], Struct [28], and ASLA [11]. The experimental results are compared in Table 2. Table 2. Average of overlap score of the proposed tracker and several current state of the art trackers ((BC), deformation (DEF), fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out-of view (OV) and scale variation (SV)). The top two methods for each dataset are highlighted in red and blue, respectively. VTD, visual tracking decomposition; MS, mean-shift; MIL, multiple instance learning; SCM, sparse collaborative appearance; Frag, fragment-based; TLD, tracking-learning-detection. The proposed method can be extended to track the target object using the observation model by incorporating various descriptors, and the results are presented in the Supplementary Material. All the MATLAB code and results are available on our web site.

Qualitative Analysis
The public TS-50 video sequences used in the experiments include illumination change, partial occlusion, background clutter, low resolution and pose variations. The proposed structured sparse PCA-based visual object tracking system addresses the main problems by feature optimization and dimensionality reduction.

Significant Occlusion
Heavy occlusion leads to target object tracking drift due to a lack of features, but the learned local structure of the appearance model and online updating prevent the proposed tracker from creating a bias toward part of the target, mitigating the influence of background pixels. Figure 5 shows that although the target object undergoes significant occlusion for a long period, the tracker robustly retains the key appearance structure, reducing the background effect. The Girl sequence in particular shows heavy occlusion from an object with a similar shape to the target object, but the proposed system retains target tracking.

Illumination Change
The appearance model using structured sparse representation with an SIFT descriptor is relatively insensitive to illumination changes. Figure 6 shows that although the image sequences include significant illumination changes, the target object remains continuously within the bounding box using the proposed tracking system. Simultaneous update of target images and retention of important structures using the structure sparse PCA method ensure the proposed system continuously tracks the target object even with large illumination changes.

Background Clutter
Discriminative classification of the target object and background images provides clear separation between the target object and background, which have similar color, appearance and motion. Figure 7 shows that the separation of the background and target is very robust against background clutter changes.

Quantitative Analysis
We obtained the ground-truth reference values for the eight image sequences, and employed the average of overlap scores (AOS) between the tracking window and ground truth center to quantify the proposed and reference tracker performances [6]. As shown in Table 2, our proposed approach is good for deformation, fast motion, out-of-plane rotation (OPR) and out-of-view (OV), but showed balanced performance per various challenging issues in the visual object tracking. Struct [28] shows a robust performance for various performance test. SCM [29] has good performance in background clutter, illumination variation, occlusion and scale variation because it extracts the features of the target object using sparse representation, but still has variation in the video sequences like fast motion and motion blur. Figure 8 compares the performances for the proposed and current state of the art trackers for the various image sequences. The proposed tracker system tracks the target object under the partial occlusion, drift, background clutter, scale and pose variation challenges.

Conclusions
We proposed a structured sparse PCA-based visual object tracking incorporating initialization, motion tracking and online dictionary learning and update. In the initialization stage, a discriminative classifier was applied to target object and background image template coefficients extracted from the structured sparse PCA. The best candidate samples were selected by jointly evaluating the appearance distance and learned classifier. Online dictionary learning was based on a sparse representation appearance model where the dictionary and classifier were continuously updated. The structured sparse PCA provided dimensionality reduction of high dimensional descriptors, while retaining the structure of the appearance model.
We experimentally evaluated the effectiveness of the proposed tracking system by comparing with the twelve current state of the art trackers using eight publicly available benchmark image sequences. The proposed method performed favorably against all current trackers and was able to handle all the various tracking challenge scenarios. Quantitative and qualitative comparison of the outcomes from the challenging image sequences validated the effectiveness and robustness of the proposed algorithm.
Thus, exploiting a linear combination of key structure features using structured sparse PCA is a robust method to track target objects through illumination, partial occlusion and background clutter changes, because the structure of the appearance model effectively estimates the similarity between the target object and candidates.

Conflicts of Interest:
The authors declare no conflict of interest.