Object Tracking by a Combination of Discriminative Global and Generative Multi-Scale Local Models

: Object tracking is a challenging task in many computer vision applications due to occlusion, scale variation and background clutter, etc. In this paper, we propose a tracking algorithm by combining discriminative global and generative multi-scale local models. In the global model, we teach a classiﬁer with sparse discriminative features to separate the target object from the background based on holistic templates. In the multi-scale local model, the object is represented by multi-scale local sparse representation histograms, which exploit the complementary partial and spatial information of an object across different scales. Finally, a collaborative similarity score of one candidate target is input into a Bayesian inference framework to estimate the target state sequentially during tracking. Experimental results on the various challenging video sequences show that the proposed method performs favorably compared to several state-of-the-art trackers.


Introduction
Object tracking plays an important role in the field of computer vision [1][2][3][4][5] and serves as a preprocessing step for a lot of applications in areas such as human-machine interaction [6], robot navigation [7] and intelligent transportation [8], etc.Despite significant progress that has been made in previous decades, object tracking is still a challenging task due to the changes of objects' appearances influenced by scale variation, partial occlusion, illumination variation, and background clutter.To address these problems, it is a key issue for the success of a tracker to design a robust appearance model.Specifically, current tracking algorithms based on an object appearance model can be roughly categorized into generative, discriminative or hybrid methods.
For generative methods, the tracking problem is formulated as searching for the image regions most similar to the target model.Only the information of the target is used.In [9], an incremental subspace learning method was proposed to construct an object appearance model online within the particle filter framework.Kwon et al. [10] utilized multiple basic observation and motion models to cope with appearance and motion changes of an object.Motivated by the robustness of sparse representation in face recognition, Mei et al. [11] modeled tracking as a sparse approximation problem and the occlusion problem was addressed through a set of trivial templates.In [12], a tracking algorithm using the structural local sparse appearance model was proposed, which exploits both partial information and spatial information of the target based on an alignment-pooling method.The work in [13] presented a tracking algorithm based on the two-view sparse representation, where the tracked objects are sparsely represented by both templates and candidate samples in the current frame.To encode more information, Hu et al. [14] proposed a multi-feature joint sparse representation for object tracking.In discriminative methods, the tracking is treated as a binary classification problem aiming to find a decision boundary that can best separate the target from the background.Unlike generative methods, the information of both the target and its background is used simultaneously.The work in [15] fused together an optic-flow-based tracker and a support vector machine (SVM) classifier.Grabner and Bischof [16] proposed an online AdaBoost algorithm to select the most discriminative features for object tracking.In [17], a multiple instance learning (MIL) framework was proposed for tracking, which learned a discriminative model by putting all ambiguous positive and negative samples into bags.Zhang et al. [18] utilized sparse measurement matrix to extract low-dimensional features, and then trained a naive Bayes classifier for tracking.Recently, Henriques et al. [19] exploited the circulant structure of the kernel matrix in an SVM for tracking.In [20], a deep metric learning-based tracker was proposed, which learns a non-linear distance metric to classify the target object and background regions using a feed-forward neural network architecture.
Hybrid methods exploit the complementary advantages of the previous two approaches.Yu et al. [21] utilized two different models for tracking, where the target appearance is described by low-dimension linear subspaces and a discriminative classifier is trained to focus on recent appearance changes.In [22], Zhong et al. developed a sparse collaborative tracking algorithm that exploits both holistic templates and local patches.Zhou et al. [23] developed a hybrid model for object tracking, where the target is represented by different appearance manifolds.The tracking method in [24] integrated the structural local sparse appearance model and the discriminative classifier with a support vector machine.
Inspired by the work in [22], a hybrid tracking method by the combination of discriminative global and generative multi-scale local models is proposed in this paper.Different from [22], we represent the object using multi-scale local sparse representation histogram in generative model, where the patch-based sparse representation histogram under different patch scales is computed separately, and exploit the collaborative strength of sparse representation histogram under different patch scales.Therefore, our tracker exploits both partial and spatial information of an object across different scales.The final similarity score of a candidate is obtained by the combination of the two models under the Bayesian inference framework.The candidate with the maximum confidence is chosen as the tracking target.Additionally, an online update strategy is adopted to adapt to the appearance changes of objects.The main flow of our tracking algorithm is shown in Figure 1.
frame.To encode more information, Hu et al. [14] proposed a multi-feature joint sparse representation for object tracking.
In discriminative methods, the tracking is treated as a binary classification problem aiming to find a decision boundary that can best separate the target from the background.Unlike generative methods, the information of both the target and its background is used simultaneously.The work in [15] fused together an optic-flow-based tracker and a support vector machine (SVM) classifier.Grabner and Bischof [16] proposed an online AdaBoost algorithm to select the most discriminative features for object tracking.In [17], a multiple instance learning (MIL) framework was proposed for tracking, which learned a discriminative model by putting all ambiguous positive and negative samples into bags.Zhang et al. [18] utilized sparse measurement matrix to extract low-dimensional features, and then trained a naive Bayes classifier for tracking.Recently, Henriques et al. [19] exploited the circulant structure of the kernel matrix in an SVM for tracking.In [20], a deep metric learning-based tracker was proposed, which learns a non-linear distance metric to classify the target object and background regions using a feed-forward neural network architecture.
Hybrid methods exploit the complementary advantages of the previous two approaches.Yu et al. [21] utilized two different models for tracking, where the target appearance is described by low-dimension linear subspaces and a discriminative classifier is trained to focus on recent appearance changes.In [22], Zhong et al. developed a sparse collaborative tracking algorithm that exploits both holistic templates and local patches.Zhou et al. [23] developed a hybrid model for object tracking, where the target is represented by different appearance manifolds.The tracking method in [24] integrated the structural local sparse appearance model and the discriminative classifier with a support vector machine.
Inspired by the work in [22], a hybrid tracking method by the combination of discriminative global and generative multi-scale local models is proposed in this paper.Different from [22], we represent the object using multi-scale local sparse representation histogram in generative model, where the patch-based sparse representation histogram under different patch scales is computed separately, and exploit the collaborative strength of sparse representation histogram under different patch scales.Therefore, our tracker exploits both partial and spatial information of an object across different scales.The final similarity score of a candidate is obtained by the combination of the two models under the Bayesian inference framework.The candidate with the maximum confidence is chosen as the tracking target.Additionally, an online update strategy is adopted to adapt to the appearance changes of objects.The main flow of our tracking algorithm is shown in Figure 1.

Discriminative Global Model
For the global model, as in [22], an object is represented through the sparse coefficients, which are obtained by encoding the object appearance with gray features, using a holistic template set.In Section 2.1, we describe the construction of the template set, where each template is represented as a vector of gray features.Due to the redundancy of gray feature space, we present a sparse discriminative feature selection method in Section 2.2, where we extract determinative gray features that best distinguish the foreground object from the background by teaching a classifier.Finally, a confidence measure method is given in Section 2.3.

Discriminative Global Model
For the global model, as in [22], an object is represented through the sparse coefficients, which are obtained by encoding the object appearance with gray features, using a holistic template set.In Section 2.1, we describe the construction of the template set, where each template is represented as a vector of gray features.Due to the redundancy of gray feature space, we present a sparse discriminative feature selection method in Section 2.2, where we extract determinative gray features that best distinguish the foreground object from the background by teaching a classifier.Finally, a confidence measure method is given in Section 2.3.

Construction of the Template Set
Given the initial target region in the first frame, we sample N p foreground templates around the target location, as well as N n background templates within an annular region some pixels away from the target object.Then, the selected templates are normalized to the same size (32 × 32 in our experiments).In this way, the normalized templates are stacked together to form a template matrix A ∈ R K×(N p +N n ) , where K is the dimension of gray features, and we denote A = A + ∪ A − , i.e., A + for N p foreground templates and A − for N n background templates.

Sparse Discriminative Feature Selection
Due to the redundancy of gray feature space, we extract determinative gray features that best distinguish the foreground object from the background by teaching a classifier: where each element of the vector p ∈ R (N P +N n )×1 represents the property of each template in the training template set A (+1 corresponds to a foreground template and −1 corresponds to a background template), • 2 and • 1 denote 2 and 1 norms, respectively, and λ 1 is a regularization parameter.The solution of Equation ( 1) is the vector s , whose non-zero elements correspond to sparse discriminative features selected from the K dimensional gray feature space.
During the tracking process, the gray features in original space are projected to a discriminative subspace by a projection matrix S, which is obtained by removing all-zero rows from a diagonal matrix S .In addition, the elements of diagonal matrix S are obtained by: thus, the training template set and candidates in the projected space are A = SA and x = Sx.

Confidence Measure
Given a candidate x, it can be represented as a linear combination of the training template set by solving: min where α is the sparse coefficients, x is the projected vector of x, and λ is a regularization parameter.The candidate with smaller reconstruction error using the foreground templates indicates it is more likely to be a target, and vice versa.Thus, the confidence value H c of the candidate target x is formulated by: where is the reconstruction error of the candidate x using the foreground templates A + and α + is the sparse coefficient vector corresponding to the foreground templates.
is the reconstruction error of the candidate x using the background templates A − , and α − is the sparse coefficient vector corresponding to the background templates.The variable σ is a fixed constant.

Generative Multi-Scale Local Model
In [22], an object is represented by the patch-based sparse representation histogram with only a fixed-patch scale in the generative model.In order to decrease the impact of the patch size, a generative multi-scale local model is proposed in our work.We represent the object using multi-scale sparse representation histogram, where the patch-based sparse representation histogram under different patch scales is computed separately, and exploit the collaborative strength of sparse representation histogram under different patch scales.Moreover, we compute the similarity of histograms between the candidate and the template for each patch scale separately, and then weigh them as the final similarity measure between the candidate and the template.The illustration of the proposed multi-scale local model is shown in Figure 2.

Multi-Scale Sparse Representation Histogram
Give a target object, we normalize it to 32 × 32 pixels.Then, the object is segmented hierarchically into three layers, and each layer consists of local patches with different patch scales.Three scales with patch sizes 4 × 4, 6 × 6, and 9 × 9 are used in our work.For simplicity, the gray features are used to represent the patch information of a target object.The local patches of each scale are collected by a sliding window with the corresponding scale and the step length in the sampling process being the same as two pixels.Assume that is the vectorized local patches extracted from a target candidate under different patch scales, where    denotes the i-th local patch under patch scale k,   is the dimensionality of local patch, and   is the number of local patches for scale k.The dictionary under different patch scales is , where   is the number of dictionaries for scale k.The dictionary is generated by the k-means algorithm and only comes from patches of the target region manually labeled in the first frame.With the dictionary D  , each    has a corresponding sparse coefficient , which can be obtained by solving an ℓ 1regularized least-squares problem: where λ3 is a regularization parameter.
When the sparse coefficients of all local patches of one candidate are computed under different patch scales, they are normalized and concatenated to form a sparse representation histogram by:

Multi-Scale Sparse Representation Histogram
Give a target object, we normalize it to 32 × 32 pixels.Then, the object is segmented hierarchically into three layers, and each layer consists of local patches with different patch scales.Three scales with patch sizes 4 × 4, 6 × 6, and 9 × 9 are used in our work.For simplicity, the gray features are used to represent the patch information of a target object.The local patches of each scale are collected by a sliding window with the corresponding scale and the step length in the sampling process being the same as two pixels.Assume that is the vectorized local patches extracted from a target candidate under different patch scales, where y k i denotes the i-th local patch under patch scale k, d k is the dimensionality of local patch, and M k is the number of local patches for scale k.The dictionary under different patch scales is , where J k is the number of dictionaries for scale k.The dictionary is generated by the k-means algorithm and only comes from patches of the target region manually labeled in the first frame.With the dictionary D k , each y k i has a corresponding sparse coefficient , which can be obtained by solving an 1 -regularized least-squares problem: min where λ 3 is a regularization parameter.
When the sparse coefficients of all local patches of one candidate are computed under different patch scales, they are normalized and concatenated to form a sparse representation histogram by: Information 2017, 8, 43 5 of 13 where H k ∈ R (J k ×M k )×1 is the spare representation histogram for one candidate under patch scale k.Then, the candidate is represented with the combination of multiple scale histograms.

Histogram Modification
During the tracking process, the target's appearance changes significantly due to outliers (such as noise or occlusion).To address the issue, we modify the sparse representation histogram to exclude the corrupted patch.The corrupted patch usually has a large reconstruction error and its sparse coefficient vector is set to be zero.Thus, the modified histogram under different scales can be obtained by: where denotes the element-wise multiplication.Each element of O k is a descriptor for corrupted patch and is defined by: where represents the reconstruction error of local patch y k i , and ε 0 is a threshold indicating whether the patch is corrupted or not.We, thus, have constructed the sparse representation histogram P k under different scales, which exploits multi-scale information of the target and takes outliers into account.

Similarity Measure
The key issue in object tracking is the determination of the similarity between the candidate and the template.We use the histogram intersection function to compute the similarity of histograms between the candidate and the template for each patch scale separately, and then weigh them as the final similarity measure between the candidate and the template, which is computed by: where T k and P k c are the histograms for the template and the c-th candidate under patch scale k, and ϕ k is a weight used to measure the outlier under patch scale k.Moreover, ϕ k is defined by: The template histograms under different patch scales are generated by Equations ( 5)-( 7) and computed only once for each tracking sequence.When evaluating the similarity of histograms between the candidate and the template, we modify the template histograms under the same condition as modifying the histograms of the candidate.

Tracking by Bayesian Inference
Object tracking can be treated as a Bayesian inference task [25].Given the observations of target Z t = {z 1 , z 2 , ..., z t } up to time t, the current target state s t can be obtained by the maximum a posteriori estimation via: ŝt = arg where s i t denotes the i-th sample of the state s t .The posterior probability p s i t Z t can be recursively computed by the Bayesian theorem via: p(s t |Z t ) ∝ p(z t |s t ) p(s t |s t−1 )p(s t−1 |Z t−1 )ds t−1 (12) where p(s t |s t−1 ) and p(s t−1 |Z t−1 ) denote the dynamic model and observation model, respectively.The dynamic model describes the temporal correlation of the target states in consecutive frames, and the motion of the target between consecutive frames is modeled by an affine transformation.The state transition is formulated by random walk, i.e., p(s t |s t−1 ) = N(s t : s t−1 , ∑) , where s t = {α t , β t , µ t , ν t } denote the x, y translations, scale and aspect ratio at time t, respectively.∑ = diag(σ 2 α , σ 2 β , σ 2 µ , σ 2 υ ) is a diagonal covariance matrix whose elements are the variances of the affine parameters.
The observation model p(z t |s t ) estimates the likelihood of observing z t at state s t .In this paper, the collaborative likelihood of the c-th candidate is defined as: and the candidate with the maximum likelihood value is regarded as the tracking result.

Online Update
In order to adapt the change of target appearance during tracking, the update scheme is essential.The global model and multi-scale local model are updated independently.For the global model, the negative templates are updated every five frames and the positive templates remain the same during tracking.As the global model aims to select sparse discriminative feature to separate the target object from the background, it is important to ensure that the positive and negative templates are all correct and distinct.
For the multi-scale local model, the dictionary D k for each scale is fixed to ensure that the dictionary is not affected even if outliers occur during tracking.In order to capture the change of the target's appearance and balance between the old and new templates, the new template histogram H k new under patch scales k is computed by: where H k 1 is the histogram at the first frame, H k t denotes the histogram last frame before update, µ 1 and µ 2 are the weight, the variable ϕ k defined by Equation ( 10) is the outlier measure for scale k in the current frame, and ϕ 0 is a predefined constant.

Experiments
We evaluate our tracking algorithm on 12 public video sequences from the benchmark dataset [26].These sequences include different challenging situations like occlusion, scale variation, cluttered background, and illumination changes.Our tracker is compared with several state-of-the-art trackers, including tracking-learning-detection method (TLD) [27], structured output tracker (STRUCK) [28], tracking via sparse collaborative appearance model (SCM) [22], tracker with multi-task sparse learning (MTT) [29] and tracking with kernelized correlation filters (KCF) (with histogram of oriented gradient features) [19].We implement the proposed method in MATLAB 2013a (The MathWorks, Natick, MA, USA) on a PC with Intel G1610 CPU (2.60 GHz) with 4 GB memory.For fair comparisons, we use the source code provided by the benchmark [26] with the same parameters, except KCF.We run the KCF with the default parameters reported in the corresponding paper.
The parameters of our tracker for all test sequences are fixed to demonstrate its robustness and stability.We manually label the location of the target in the first frame for each sequence.The number of particles is 300 and the variance matrix of affine parameters is set as Σ = diag (4, 4, 0.01, 0.005).The numbers of positive templates, p, and negative templates, n, are 50 and 200, respectively.
The regularization parameters of Equations ( 3) and ( 5) are set to be 0.01, and the variable λ 2 in Equation ( 1) is fixed to be 0.001.The dictionary size for each scale is 50.The threshold ε 0 in Equation ( 8) is 0.04.The parameters ϕ 0 , µ 1 and µ 2 in Equation ( 14) are set to 0.8, 0.85 and 0.95.

Quantitative Comparison
We quantitatively evaluate the performance of each tracker in terms of the center location error (CLE) and the overlap rate, as well their average values.The CLE measures the Euclidean distance between the center of the tracking result and the ground truth, and is defined as 2 , where (x , y ) and (x, y) denote the tracked central position and ground truth central position, respectively.The lower CLE will result in the better performance.The overlap rate reflects stability of each tracker as it takes the size and pose of the target object into account.It is defined by PASCAL VOC criteria [30], score = area(ROI T∩ ROI G ) area(ROI T ∪ROI G ) , where ROI T is the tracking bounding box and ROI G is the ground truth bounding box.More accurate trackers have higher overlap rates.Figure 3 shows the frame-by-frame center location error comparison results.Tables 1 and 2 report the comparison results of our tracker and five other trackers in terms of average CLE and average overlap rate.In both tables, the first row gives all of the trackers and the first column shows all the videos in our experiment.The last row is the average of the results for each tracker.
Information 2017, 8, 43 7 of 13 The regularization parameters of Equations ( 3) and ( 5) are set to be 0.01, and the variable λ2 in Equation ( 1) is fixed to be 0.001.The dictionary size for each scale is 50.The threshold 0  in Equation (8)   is 0.04.The parameters  0 ,  1 and  2 in Equation ( 14) are set to 0.8, 0.85 and 0.95.

Quantitative Comparison
We quantitatively evaluate the performance of each tracker in terms of the center location error (CLE) and the overlap rate, as well their average values.The CLE measures the Euclidean distance between the center of the tracking result and the ground truth, and is defined as CLE = √( ′ − ) 2 + ( ′ − ) 2 , where ( ′ ,  ′ ) and (, ) denote the tracked central position and ground truth central position, respectively.The lower CLE will result in the better performance.The overlap rate reflects stability of each tracker as it takes the size and pose of the target object into account.It is defined by PASCAL VOC criteria [30], score = , where   is the tracking bounding box and   is the ground truth bounding box.More accurate trackers have higher overlap rates.Figure 3 shows the frame-by-frame center location error comparison results.Tables 1 and 2 report the comparison results of our tracker and five other trackers in terms of average CLE and average overlap rate.In both tables, the first row gives all of the trackers and the first column shows all the videos in our experiment.The last row is the average of the results for each tracker.As shown in Figure 3, the CLE curves diverge for some trackers, such as the TLD tracker in the CarDark, Coupon, Crossing and David3 video sequences, the STRUCK tracker in the David3 and Jogging.2 video sequences, the KCF tracker in the Human5 and Jogging.2 video sequences, etc.These indicate that these trackers lose the tracking objects in the tracking process.From Tables 1 and 2, we can see that our tracker achieves the best or second-best performances.Moreover, our tracker obtains the best performance for 12 video sequences when compared with the SCM tracker, and this suggests that the multi-scale local information adopted in our model is very effective and important for tracking.Overall, our tracker performs favorably against the other five state-of-the-art algorithms with lower center location errors and higher overlap rates.

Qualitative Comparison
To further evaluate the performance of our tracker against the other state-of-art trackers, several screenshots of the tracking results on 12 video sequences [26] are shown in Figure 4.For these sequences, several principal factors that have effects on the appearance of an object are considered.Some other factors are also included in the discussion.Qualitative discussion is detailed below.
Illumination Variation: Figure 4a,i present tracking results of two challenging sequences with illumination variation to verify the effectiveness of our tracker.In the Car4 sequence, the TLD tracker severely deviates from the object location when the car goes below the bridge creating a dramatic illumination change (e.g., frame 228).The MTT tracker shows a severe drift when the car becomes smaller.The SCM tracker and our tracker can track the target accurately.For the Fish sequence, illumination changes and camera movement makes it challenging.All trackers, except MTT and SCM, work well.
Occlusion: Occlusion is one of the most general, yet crucial, problems in object tracking.In the David3 sequence (Figure 4g), severe occlusion is introduced by tree and object appearance changes drastically when the man turns around.Only the KCF and our tracker successfully locate the correct object throughout the sequence.For the Jogging.2sequence (Figure 4k), when the girl is occluded by the pole (e.g., frame 58), all of the other trackers drift away from the target object, except for SCM and our track.
Background Clutter: The CarDark sequence in Figure 4b shows a moving vehicle at night with dramatic illumination changes and low contrast in cluttered background.The TLD tracker starts to drift around frame 200 and gradually losses the target.Other trackers can track the car accurately.
For the Coupon sequence (Figure 4c, the tracked object is the uppermost coupon and an imposter coupon book similar to the target is introduced to distract the trackers.The KCF and our trackers perform better.In the Crowds sequence (Figure 4e, the target is a man who walked from right to left.All trackers, except STRUCK and MTT, are able to track the whole sequence successfully, and the MTT tracker starts to drift around frame 42 when an object of similar color is in proximity to the tracked target. Scale Variation and Deformation: For the Crossing sequence, all trackers, except TLD and MTT, can reliably track the object, as shown in Figure 4d.In the Human5 sequence (Figure 4j), our tracker gives the best result.The KCF and MTT trackers lose the target.Figure 4l shows the tracking results in the Walking sequence.Our tracker achieves the best performance, followed by the SCM tracker.
Rotation: The David2 sequence consists of both in-plane and out-of-plane rotations.We can see from Figure 4f that the accuracy of our tracker is higher than the accuracy of the SCM trackers.As is illustrated in Dog1 (Figure 4h), the target in this sequence undergoes both in-plane and out-of-plane rotations, and scale variation.Our tracker gives the best result in terms of the overlap rate.
severely deviates from the object location when the car goes below the bridge creating a dramatic illumination change (e.g., frame 228).The MTT tracker shows a severe drift when the car becomes smaller.The SCM tracker and our tracker can track the target accurately.For the Fish sequence, illumination changes and camera movement makes it challenging.All trackers, except MTT and SCM, work well.
Occlusion: Occlusion is one of the most general, yet crucial, problems in object tracking.In the David3 sequence (Figure 4g), severe occlusion is introduced by tree and object appearance changes drastically when the man turns around.Only the KCF and our tracker successfully locate the correct object throughout the sequence.For the Jogging.2sequence (Figure 4k), when the girl is occluded by the pole (e.g., frame 58), all of the other trackers drift away from the target object, except for SCM and our track.
Background Clutter: The CarDark sequence in Figure 4b shows a moving vehicle at night with dramatic illumination changes and low contrast in cluttered background.The TLD tracker starts to drift around frame 200 and gradually losses the target.Other trackers can track the car accurately.
For the Coupon sequence (Figure 4c, the tracked object is the uppermost coupon and an imposter coupon book similar to the target is introduced to distract the trackers.The KCF and our trackers perform better.In the Crowds sequence (Figure 4e, the target is a man who walked from right to left.All trackers, except STRUCK and MTT, are able to track the whole sequence successfully, and the MTT tracker starts to drift around frame 42 when an object of similar color is in proximity to the tracked target. Scale Variation and Deformation: For the Crossing sequence, all trackers, except TLD and MTT, can reliably track the object, as shown in Figure 4d.In the Human5 sequence (Figure 4j), our tracker gives the best result.The KCF and MTT trackers lose the target.Figure 4l shows the tracking results in the Walking sequence.Our tracker achieves the best performance, followed by the SCM tracker.
Rotation: The David2 sequence consists of both in-plane and out-of-plane rotations.We can see from Figure 4f that the accuracy of our tracker is higher than the accuracy of the SCM trackers.As is illustrated in Dog1 (Figure 4h), the target in this sequence undergoes both in-plane and out-of-plane rotations, and scale variation.Our tracker gives the best result in terms of the overlap rate.

Conclusions
In this paper, we present a robust object tracking approach by combining discriminative global and generative multi-scale local models.In the global appearance model, a classifier with sparse discriminative features is taught to separate the target object from the background.In the multi-scale local appearance model, the appearance of an object is modeled by multi-scale local sparse representation histograms.Therefore, compared with SCM tracker, our tracker could utilize both partial and spatial information of an object across different scales, which are mutually complementary.The final similarity score of a candidate is obtained by the combination of the two models under the Bayesian inference framework.Additionally, an online update strategy is adopted to adapt to the appearance changes of object.Extensive experiments on several challenging video sequences demonstrate the effectiveness and robustness of the proposed tracker.

Conclusions
In this paper, we present a robust object tracking approach by combining discriminative global and generative multi-scale local models.In the global appearance model, a classifier with sparse discriminative features is taught to separate the target object from the background.In the multi-scale local appearance model, the appearance of an object is modeled by multi-scale local sparse representation histograms.Therefore, compared with SCM tracker, our tracker could utilize both partial and spatial information of an object across different scales, which are mutually complementary.The final similarity score of a candidate is obtained by the combination of the two models under the Bayesian inference framework.Additionally, an online update strategy is adopted to adapt to the appearance changes of object.Extensive experiments on several challenging video sequences demonstrate the effectiveness and robustness of the proposed tracker.

Figure 1 .
Figure 1.The main flow of our tracking algorithm.

Figure 1 .
Figure 1.The main flow of our tracking algorithm.

Figure 2 .
Figure 2.An illustration of generative multi-scale local model.(a) A candidate; (b) sampling patches under different patch scales; (c) sparse representation histograms of the candidate under different patch scales; (d) modifying the histograms of the candidate excluding the outlier patches; (e) computing the similarity of histograms between the candidate and the template for each patch scale separately; (f) weighing the similarity of histograms of different patch scales as the final similarity measure between the candidate and the template.

Figure 2 .
Figure 2.An illustration of generative multi-scale local model.(a) A candidate; (b) sampling patches under different patch scales; (c) sparse representation histograms of the candidate under different patch scales; (d) modifying the histograms of the candidate excluding the outlier patches; (e) computing the similarity of histograms between the candidate and the template for each patch scale separately; (f) weighing the similarity of histograms of different patch scales as the final similarity measure between the candidate and the template.

Figure 3 .
Figure 3. Frame-by-frame comparison of six trackers in terms of center location error (CLE).

Figure 3 .
Figure 3. Frame-by-frame comparison of six trackers in terms of center location error (CLE).

Figure 4 .
Figure 4. Screenshots of some sampled tracking results.(a) Car4 with illumination and scale variation; (b) CarDark with illumination variation and background clutter; (c) Coupon with occlusion and background clutter; (d) Crossing with scale variation and deformation; (e) Crowds with illumination variation and background clutter; (f) David2 with in-plane rotation and out-of-plane rotation; (g) David3 with occlusion and deformation; (h) Dog1 with scale variation and rotation; (i) Fish with illumination variation; (j) Human5 with scale variation and deformation; (k) Jogging.2 with occlusion and deformation; (l) Walking with scale variation and deformation.

Figure 4 .
Figure 4. Screenshots of some sampled tracking results.(a) Car4 with illumination and scale variation; (b) CarDark with illumination variation and background clutter; (c) Coupon with occlusion and background clutter; (d) Crossing with scale variation and deformation; (e) Crowds with illumination variation and background clutter; (f) David2 with in-plane rotation and out-of-plane rotation; (g) David3 with occlusion and deformation; (h) Dog1 with scale variation and rotation; (i) Fish with illumination variation; (j) Human5 with scale variation and deformation; (k) Jogging.2 with occlusion and deformation; (l) Walking with scale variation and deformation.

Table 1 .
Comparison of results in terms of average CLE (pixels).Bold fonts indicate the best performance while the italic fonts indicate the second best ones.

Table 2 .
Comparison results in terms of average overlap rate.Bold fonts indicate the best performance while the italic fonts indicate the second best ones.