A Robust Structured Tracker Using Local Deep Features

Deep features extracted from convolutional neural networks have been recently utilized in visual tracking to obtain a generic and semantic representation of target candidates. In this paper, we propose a robust structured tracker using local deep features (STLDF). This tracker exploits the deep features of local patches inside target candidates and sparsely represents them by a set of templates in the particle filter framework. The proposed STLDF utilizes a new optimization model, which employs a group-sparsity regularization term to adopt local and spatial information of the target candidates and attain the spatial layout structure among them. To solve the optimization model, we propose an efficient and fast numerical algorithm that consists of two subproblems with the close-form solutions. Different evaluations in terms of success and precision on the benchmarks of challenging image sequences (e.g., OTB50 and OTB100) demonstrate the superior performance of the STLDF against several state-of-the-art trackers.


Introduction
Visual tracking aims to estimate states of a moving object or multiple objects in frame sequences under different conditions. It has been considered as one of the most active and challenging computer vision topics with a large array of applications in autonomous driving, video content analysis and understanding, surveillance, and so forth. Although some improvements have been achieved in several tracking methods [1][2][3][4][5][6], computer vision researchers still aim to develop more robust algorithms capable of handling various challenges including occlusion, illumination variations, in-plane and out-plane rotation, background clutter, deformation, and low resolution.
In general, visual tracking algorithms can be categorized into two groups: discriminative and generative. Discriminative tracking methods formulate a binary decision boundary to distinguish the target from backgrounds. Ensemble [7], online boosting [8], and online multiple instance learning [9] are a few representative discriminative methods. Generative methods adopt a model to represent the target and formulate tracking as a model-based searching procedure to find the most similar region to the target. Eigenspace learning [10], incremental subspace learning [11], and sparse representation [5,[12][13][14] are a few representative generative methods. The performance of these approaches is limited due to the use of hand-crafted features such as intensity, local binary pattern (LBP) [15], and histogram of oriented gradient (HOG) [16] for target representation. These features may not be effective to handle immense challenges imposed on various frame sequences.
Convolutional neural networks (CNNs) have been applied recently to visual tracking with tremendous improvement [17][18][19][20][21][22][23]. As one of the pioneer works, Wang and Yeung [17] propose a multi-layer denoising auto-encoder network to learn general object representations that are more robust against variations for visual tracking. Later, Wang et al. [24] use auxiliary video sequences to learn hierarchical features, which are robust to both complicated motion transformations and appearance changes of target objects. This feature learning algorithm is further integrated into three tracking methods to achieve significant improvement. Several recent tracking algorithms [25,26] utilize pretrained CNNs on a large-scale classification dataset (e.g., ImageNet [27]) to extract hierarchical features. These features are then separately integrated in correlation filter-based trackers and sparse trackers [19,20,23] to achieve better tracking performance than using hand-crafted features. On the other hand, some deep learning-based tracking methods [28,29] directly use external videos to train CNNs for visual tracking. Nam and Han [28] introduce the MDNet tracker to pretrain a discriminative CNN using auxiliary sequences with tracking ground truths to obtain a generic object representation. Since then, various trackers have been proposed to improve the performance of MDNet by using a tree structure to manage multiple target appearance models [30], using adversarial learning to identify a mask that maintains the most robust features of the target objects over a long temporal span [31], and using reciprocative learning to exploit visual attention for training deep classifiers [32]. All these direct CNN-based trackers achieve improved performance. However, they all require off-line training on the external videos.
In this paper, we propose a robust structured tracker using local deep features (STLDF), which exploits the convolutional neural networks (CNNs) features of the local patches inside a target candidate and sparsely represents them in a novel convex optimization model. Unlike the conventional local sparse trackers [14], the proposed optimization model in STLDF employs a group-sparsity regularization term to adopt local and spatial information of the target candidates and attain the spatial layout structure among them. The major contributions of the proposed work are summarized as follows: • Proposing a deep features-based structured local sparse tracker, which employs CNN deep features of the local patches within a target candidate and keeps the relative spatial structure among the local deep features of a target candidate. • Developing a convex optimization model, which combines nine local features of each target candidate with a group-sparsity regularization term to encourage the tracker to sparsely select appropriate local patches of the same subset of templates. • Designing a fast and parallel numerical algorithm by deriving the augmented Lagrangian of the optimization model into two close-form problems: the quadratic problem and the Euclidean norm projection onto probability simplex constraints problem by adopting the alternating direction method of multiplier (ADMM). • Utilizing the accelerated proximal gradient (APG) method to update the CNN deep feature-based template by casting it as a Lasso problem.
The preliminary results of this work are presented in Reference [33], which is based on hand-crafted features. We made a number of improvements in the proposed method: (i) STLDF automatically extracts representative local deep features of target candidates using the pre-trained CNN. (ii) STLDF efficiently derives the augmented Lagrangian of the optimization model into two close-form problems: the quadratic problem and the Euclidean norm projection onto probability simplex constraints problem. (iii) STLDF updates the CNN deep feature-based template by casting it as a Lasso problem and numerically solving it using the accelerated proximal gradient (APG) method. The remainder of this paper is organized as follows: Section 2 introduces the notations used in the paper. Section 3 presents the STLDF together with its new convex optimization model solved by the proposed ADMM-based numerical solution. In addition, this section explains the deep feature extraction, the template updates strategies, and the summary of the proposed method. Section 4 demonstrates the experimental results on OTB50 and OTB100 challenging tracking benchmarks, quantitatively and qualitatively compares the STLDF with several state-of-the-art trackers, and discusses the results and future work. Section 5 draws the conclusions and presents the future work.

Notations
In this paper, we denote scalars, vectors, and matrices by italic lowercase, boldface lowercase, and boldface uppercase letters, respectively. For a column vector x, we use x i to represent the ith element of x and diag(x) to represent a diagonal matrix formed by the elements of x. For a matrix X, we use X i,j to denote the element at the ith row and jth column, X F to denote the Frobenious norm, and X p,q to denote the p norm of q norm of the rows in X. In addition, we use tr(·) as the trace operator, X ⊗ Y as the Kronecker product on two matrices X and Y, 1 l as a column vector of all ones, and I k as a k × k identity matrix.

Proposed Method
In this section, we present the proposed structured tracker using local deep features (STLDF). In Section 3.1, we formulate a convex optimization model, which incorporates the local deep features of target candidates to overcome the drawbacks of traditional local sparse trackers [14,34]. In Section 3.2, we describe a numerical algorithm in detail, which efficiently solves the optimization model presented in Section 3.1. In Section 3.3, we present the local deep feature extraction process using a pre-trained CNN. In Section 3.4, we describe the template update strategy. In Section 3.5, we provide a summary of the proposed STLDF tracker.

Structured Tracker Using Local Deep Features (STLDF)
The proposed STLDF tracker incorporates the local deep features of target candidates in a new optimization model to achieve robust tracking performance. Unlike traditional local sparse trackers, this convex optimization model encourages the tracker to keep the spatial layout structure among different local deep features of a target candidate. As a result, it achieves a consistent and similar pattern on the non-zero elements of the sparse vectors corresponding to different local deep features.
Traditional local sparse trackers [14,34] do not attain the spatial layout structure among local patches. For instance, Jia et al. [34] use the Lasso model to represent local patches of target candidates. However, the 1 regularization term in Lasso does not encourage the tracker to represent local patches inside a target candidate by their corresponding local patches of select dictionary bases. The proposed STLDF tracker employs a new optimization model to address the above issue related to conventional local sparse trackers [14,34]. Specifically, the group sparsity regularization term in the optimization model of STLDF imposes a structure on the sparse vectors of different local patches inside each target candidate. This regularization term selects few templates for representation and motivates the group of local patches inside a target candidate to be represented by a group of local patches inside the few template sets. For instance, if the rth local patch of the jth target candidate is best represented by the rth local patch of the qth template, the sth local patch of the jth target candidate is also best represented by the sth local patch of the qth template. To solve the optimization model, we adopt alternating direction method of multiplier (ADMM) to convert the augmented Lagrangian of the optimization model into two suproblems with close-form solutions: the quadratic problem and the Euclidean norm projection onto probability simplex constraints problem.
We select l overlapping local patches in k target templates and extract a d-dimensional feature vector for each patch. These feature vectors are used to build the dictionary D = [D 1 , . . . , D k ] ∈ R d×(lk) , where D i ∈ R d×l . Using n number of particles (target candidates), we build a matrix X = [X 1 , . . . , X n ] ∈ R d×(ln) to include local deep features of target candidates. We denote the sparse coefficient matrix as is a l × l matrix indicating the group representation of l local features of the jth target candidate using l local features of the qth template. We formulate the following model to represent deep features of the jth target candidate using k target templates: The first term in (1a) represents the similarity measurement between the feature matrix X j and its representation using the dictionary D. The second term is a group-sparsity regularization term, which is a penalization term in the objective function to select dictionary words (templates). This term also establishes the · 1,∞ minimization on matrix C 1 (:) . . . C k (:) , which leads to imposing local features inside a target candidate to choose similar few dictionary words (templates). It should be noted that each group (C q of l × l) is vectorized via C q (:) and is represented by a column vector. The sum of the maximum absolute values per group is minimized by imposing · 1,∞ . Therefore, the l 1 norm minimization selects few dictionary words for representation by imposing the rows of C 1 (:) . . . C k (:) to be sparse. The l ∞ norm minimization on the columns of C 1 (:) . . . C k (:) motivates the group of local patches to jointly select similar few templates. The parameter λ > 0 is a trade-off between the first and the second terms. The constraint (1b) ensures sparse coefficients to be non-negative since a tracking target can be represented by target templates dominated by non-negative coefficients [12]. The constraint (1c) ensures that each local feature vector in X j is expressed by at least one selected local feature vector of the dictionary D.
In order to find the representation of each target candidate, we compute the matrix C using the numerical algorithm introduced in Section 3.2. By applying average pooling on C, we obtain a representative vector for each target candidate [34]. The summation of this representative vector is used as a likelihood value to determine the best target candidate among n target candidates in each frame. The templates are updated to maintain the latest changes of target regions over time [34].

Numerical Algorithm
In this section, we provide a fast and parallel numerical algorithm by deriving the augmented Lagrangian of the optimization model (1) into two close-form problems based on the alternating direction method of multipliers (ADMM) [35]. In general, ADMM incorporates supplementary variables to model a complex optimization problem to simpler sub-problems. Each sub-problem is iteratively solved using an explicit solution till it converges. To do so, we first define vector m ∈ R k such that m i = arg max | C i (:)| and rewrite (1) as: To ensure the equivalency between (1) and (2), we impose the inequality constraint (2d). To simplify our model, we convert this inequality constraint to an equality one (2) by introducing a non-negative slack matrix U ∈ R (lk)×l to compensate the difference between m ⊗ 1 l 1 l and C. Therefore, we rewrite (2) as: We rewrite the inequality constraint (3d) independent of m in (4d) since this inequality suggests that the columns of C + U are regulated to be identical. In addition, we rewrite 1 k m as 1 l 2 1 (lk) (C + U)1 l using (3d). These make (3) be independent of m as: where matrix E is the right circular shift operator on the rows of C + U. Based on ADMM, we definê C,Û ∈ R (lk)×l as supplementary variables and reformulate (4) as: where µ 1 , µ 2 > 0 are the augmented Lagrangian parameters. Without loss of generality, we assume these parameters are the same [35]. For any feasible solution of (5a), the last two terms are equal to zero, which implies the equivalency between (4) and (5). The augmented Lagrangian function of (5) is as follows: where Λ 1 , Λ 2 ∈ R (lk)×l are the Lagrangian multipliers corresponding to the equations in (5f). Given initialization forĈ,Û, Λ 1 , and Λ 2 at time t = 0 (e.g.,Ĉ 0 ,Û 0 , Λ 0 1 , Λ 0 2 ), (6) is solved through the ADMM iterations described below: subject to (5d) subject to (5b), (5c), (5e).
By considering the quadratic and linear terms of C and U in (6), we first define {z i } lk i=1 , where z i ∈ R 2l is obtained by stacking the ith rows of C and U. We then divide (7) into lk equality constrained quadratic programs as follows: where Q ∈ R l×l is a block diagonal positive semi-definite matrix and A is a sparse matrix constructed based on the constraint (5d). Each of the above quadratic programs has its analytical solution by writing the KKT conditions. Similarly, we split (8) into two separate sub-problems with close-form solutions overĈ andÛ as follows: where sub-problems (11) and (12) consist of l independent Euclidean norm projections onto the probability simplex constraints and the non-negative orthant, respectively. Both sub-problems have analytical solutions. Finally, we solve the two sub-problems over Λ 1 and Λ 2 in (9) by performing l parallel updates over their respective columns. The closed form solutions lead to quick updates in each iteration.

Local Deep Feature Extraction
In the proposed STLDF, we automatically extract learned local deep features to represent each target region. To this end, we set the size of each target candidate to 64 × 64 pixels to contain sufficient object-level information with decent resolution. Each target candidate is passed to the pre-trained VGG19 [27] network on the large-scale ImageNet dataset [36] to automatically extract their representative features. This network has been proven to achieve better tracking performance than other CNNs such as AlexNet since its strengthened semantic with deeper architecture is more insensitive to significant appearance change. Its default input size of 224 × 224 × 3 has also been used in other VGG19-based trackers [37], [23] to achieve good tracking results. To ensure fair comparison with other VGG19-based trackers, we resize each target candidate to this default input size before forward propagation. We utilize the output of the Conv5-4 layer as the feature map of the target candidate since the fifth layer is proven to be effective in discriminating the targets even with dramatic background changes [37]. The generated feature map has a size of 7 × 7 × 512, which is not large enough to provide spatial information of target candidates. As a result, we use the bilinear interpolation technique introduced in Reference [37] to perform a two-layer upsampling operation to increase the feature map from 7 × 7 × 512 to 14 × 14 × 512 then to 28 × 28 × 512. The final upsampled feature map is of sufficient spatial resolution to extract overlapping local patches of size 14 × 14 × 512, which has been shown to be effective in discriminating the target [37], to provide more detailed local information.
To this end, we employ the concept of shared features [38] to extract l local deep features inside the upsampled feature map. To do so, we divide the upsampled feature map into l = 9 overlapping 14 × 14 × 512 features maps with the stride of 7. The feature map of each of 9 overlapping patches is vectorized as a feature vector with the size of 1 × 100352. Finally, we apply principal component analysis (PCA) on the feature vector of each patch to attain the top 1120 principal components for each local feature vector (e.g., d = 1120) to speed up the process to find the best target candidate by the proposed optimization model. We choose 1120 principal components since at least 95% of variance is retained.

Template Update
We adopt the same strategy as that used in Reference [34] to update templates. We generate a cumulative probability sequence and a random number according to uniform distribution on the unit interval [0, 1]. We then choose the template to be replaced based on the section that the random number lies in. This ensures that the old templates are slowly updated and the new ones are quickly updated. As a result, the drifting issues are alleviated.
We replace the selected template by using the information of the tracking result in the current frame. To do so, we represent the tracking result by a dictionary in a Lasso problem. This dictionary contains trivial templates (identity matrix) [12] and PCA basis vectors, which are calculated from the templates D. We numerically solve the Lasso problem using the accelerated proximal gradient (APG) method. To further improve the computational time, we consider the structure of the identity matrix in our Lasso numerical solver to quickly perform the matrix multiplications and find the descend direction faster in each iteration.

STLDF Summary
The tracking steps of the proposed STLDF for two consecutive frames (i.e., frame #1 and frame #2) are summarized in Figure 1. In the first step, local deep features of k target templates are extracted using the initial target location in the frame #1. Their top principal components are selected using PCA. The dictionary D consisting of these local deep features is then constructed. The interested readers may refer to Sections 3.1 and 3.3 for details. In the second step, local deep features of target candidates are extracted to construct X in the frame #2. In the third step, local deep features of each target candidate, X j , is represented by the dictionary matrix in the optimization model in (1). Finally, the optimization model is iteratively solved to obtain C for each target candidate using (7), (8), and (9). The best target candidate with the minimum reconstruction error is then selected as the target candidate. The tracking continues for the next frame using the previously estimated target location and templates are updated as explained in Section 3.4 until all the frames are processed. Use to find the target region Step 1: Step 2: Step 3:

Experimental Results
In this section, we evaluate the performance of the proposed STLDF and its two variants, namely, structured tracker using local color features (STLCF) and structured tracker using local HOG features (STLHF), on the object tracking benchmark (OTB), which contains fully annotated videos with substantial variations. We evaluate these three trackers on both OTB50 [39] and OTB100 [40] benchmarks for fair comparison since not all the trackers provide the results on both benchmarks.
The two variants are similar to the proposed tracker except that STLCF uses gray-level intensity features and STLHF uses histogram of oriented gradients (HOG) features to represent each local patch. We implement these two variants since both gray-level intensity and HOG features have shown promising tracking results in different trackers [13,23,34,41]. To extract intensity features, we resize each target region to 32 × 32 pixels and extract l = 9 overlapping local patches of 16 × 16 pixels inside the target region using the stride of 8 pixels. As a result, we use d = 256 dimensional gray-level intensity features to represent local patches. To extract HOG features, we resize the target candidates to 64 × 64 pixels to contain sufficient edge-level information with decent resolution. We then exploit d = 196 dimensional HOG features [16] for l = 9 overlapping local patches of 32 × 32 inside the target region using the stride of 16 pixels. As a result, we use d = 196 dimensional HOG features to capture relatively high-resolution edge information to represent local patches.
For all the experiments, we set λ = 0.1, µ = µ 1 = µ 2 = 0.1, the number of particles n = 400, and the number of templates k = 10. We initially set the variances of affine parameters for particle filter resampling as (8,8,0.01,0.001,0.005, 0.0001) and adaptively update the resampling variances based on the tracking results. We use the maximum of the initial variance and the variance of the affine parameters of the most recent five tracking results to update the standard deviation of the affine parameters. We implement the proposed STLDF in MATLAB with the MatConvNet toolbox [42] on a machine with a 3.60 GHz CPU, 32 GB RAM, and a 1080Ti 11 GB Nvidia GPU. The GPU is utilized for CNN forward propagation to extract deep features of 9 local patches for each target candidate.

Evaluation Metrics
We follow standard protocols in References [39,40] to evaluate the performance of the proposed STLDF against other trackers. To do so, we utilize the bounding box overlap ratio and the center location error as evaluation metrics. The bounding box overlap ratio is the ratio of the intersect to the union regions of the tracking result bounding box and the ground-truth bounding box. The location error is defined as the Euclidean distance between the center of the tracking result bounding box and the center of the ground-truth bounding box. We perform one pass evaluation (OPE) experiments and display success and precision plots. OPE is conventionally used to evaluate trackers by initializing them using the ground truth location in the first frame. Success plots display success rates at different overlap thresholds for the bounding box overlap ratio. Precision plots display precision rates at different error thresholds for the center location error. To rank trackers using success plots, we calculate the area under curve (AUC) score for each compared tracker on all image sequences. The tracker with the highest AUC score achieves the best overall performance. To rank trackers using precision plots, we calculate the average precision score for each compared tracker on all image sequences at the location error threshold of 20 pixels [39,40]. The tracker with the highest precision score achieves the best overall performance.
To demonstrate the effectiveness of the proposed optimization model, we compare the proposed tracker and its two variants (i.e., STLDF, STLHF, and STLCF) with representative traditional and recent sparse trackers in terms of the two evaluation metrics in Table 1. It is clear that STLDF achieves the highest overall AUC and precision scores among all the compared sparse trackers. It improves RSST_Deep, one of the most recent sparse trackers that incorporates the deep features, by 2.37% in the AUC score and 3.68% in the precision score. Its two variants (STLHF, and STLCF) also outperforms RSST's counterparts (RSST_HOG, and RSST_ Color) in terms of two evaluation metrics except that STLCF achieves the similar precision score as RSST_Color (0.690 vs. 0.691). In addition, STLCF achieves higher AUC and precision scores than other sparse trackers that utilize intensity features such as L1APG [54], ASLA [34], MTT [55], MSLA [14], and SST [5]. STLHF also achieves higher AUC and precision scores than the sparse trackers that utilize HOG features such as MTMVTLAD [13] and RSST_HOG [23]. It is worthy of mentioning that the proposed method attains significant improvements over conventional local sparse trackers (ASLA and MSLA) by preserving the spatial layout structures among different local patch features inside a target candidate. The robust tracking performance of the proposed method demonstrate the effectiveness of the proposed optimization model that employs a group-sparsity regularization term to adopt local and spatial information of the target candidates and attain the spatial layout structure among them. In addition to sparse trackers, the proposed STLDF achieves a better or comparable AUC score (0.604) than some correlation filter (CF) based trackers including KCF (0.514) [19], DSST (0.556) [46], LCT (0.612) [56], HDT (0.603) [20], CF2 (0.605) [37], and ACFN (0.607) [57]. It also achieves a better or comparable precision score (0.818) than the following CF- Comparing with deep learning-based trackers, the proposed STLDF outperforms or achieves a comparable AUC score than CNT (0.545) [21], GOTURN (0.444) [58], CNN-SVM (0.597) [25], FCNT (0.599) [18], DLSSVM (0.589) [22], and SiamFC (0.608) [59]. Moreover, it outperforms or achieves comparable precision score than CNT (0.

Experimental Results on OTB100
OTB100 [40] is an extension of OTB50 [39] by adding 50 additional annotated sequences. Two sequences, jogging and Skating, have two annotated targets. The rest of the sequences have one annotated target. Each sequence is also labeled with challenge attributes such as illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC), and low resolution (LR). The sequences are categorized based on the attributes and 11 challenge subsets are generated. These subsets are utilized to evaluate the performance of trackers in different challenge categories.
We evaluate the proposed STLDF and its two variants (STLHF and STLCF) against 29 baseline trackers in Reference [40], and 15 recent trackers including DSST, PCOM, KCF, TGPR_HOG, MEEM, SAMF, SRDCF, LCT, STAPLE, CF2, CNN-SVM, DLSSVM, HDT, and two variants of RSST (i.e., RSST_HOG and RSST_Deep). Some trackers used in the experiments of OTB50 are excluded from this experiment since they do not publish their results on OTB100. Similar to the experiments on OTB50, we follow standard protocols proposed in References [39,40] and use the same parameters on all sequences to obtain the OPE results. We present the overall OPE success and precision plots of the top 20 trackers out of 47 compared trackers in Figure 3.
It is clear from Figure 3 that the proposed STLDF achieves higher AUC and precision scores than its two variants for 100 sequences in OTB100 due to its utility of local deep features. It also achieves higher AUC and precision scores than RSST_Deep due to its novel optimization model. The proposed STLDF significantly outperforms conventional sparse trackers such as L1APG [54], LRST [60], ASLA [34], and MTT [55] and improves both AUC and precision scores of RSST_Deep [23], one of the most recent sparse trackers, by 0.87% and 0.64%, respectively. STLDF with the achieved AUC score of 0.586 outperforms some CF-based trackers such as KCF (0.478), DSST (0.518), LCT (0.562), CF2 (0.562), and HDT (0.565) and some deep learning-based trackers such as GOTURN (0.427), CNN-SVM (0.555), and DLSSVM (0.539). These OTB100 tracking results follow the similar trends in OTB50 tracking results and demonstrate the effectiveness of the proposed optimization model and the integration of local deep features.
We further evaluate the performance of STLDF in terms of AUC and precision scores on nine challenge subsets including LR, OPR, IV, OV, BC, SV, MB, OCC, and FM. Figures 4 and 5 show the success and precision plots of top 20 trackers for these 9 challenge subsets, respectively. The value within the parenthesis on the title line of each plot is the number of video sequences in the specific subset. The value within the parenthesis alongside each legend of the success plot is the AUC score for the corresponding tracker and the value within the parenthesis alongside each legend of the precision plots is the precision score for the corresponding tracker. It is clear that STLDF achieves significantly better performance than its two variants (STLHF and STLCF) due to its integration of local deep features. As it is shown in Figure 4, STLDF ranks the best for two subsets with LR and OPR challenges, the second for two subsets with IV and OV challenges, the third for three subsets with BC, SV, and MB challenges, the fourth for the OCC subset, and the top sixth tracker for FM challenge in terms of ACU score. As it is demonstrated in Figure 5, STLDF ranks as one of the top five trackers for five subsets with LR, IV, OV, SV, and MB, the sixth best trackers in OPR and BC, and the top eight trackers for two subsets with OCC and FM challenges in terms of the precision scores. The DEF and IPR challenge subsets are not included in Figures 4 and 5 due to lack of space. STLDF obtains the AUC and precision scores of 0.529 (6th rank) and 0.727 (7th rank) for the DEF subset, respectively. STLDF yields the AUC and precision scores of 0.543 (8th rank) and 0.742 (10th rank) for the IPR challenge subset, respectively.

Qualitative Results
In this section, we provide the qualitative results of the proposed STLDF tracker on several representative frame sequences in the OTB100 dataset. We compare the performance of STLDF with the top five trackers, namely, SRDCF, Staple, RSST_Deep, HDT, and CF2, on the OTB100 benchmark. Figure 6 presents the tracking results of the compared methods on six OTB100 frame sequences including basketball, doll, football, faceOcc2, board, and car2. Each of these six frame sequences has its challenges as summarized below: Here, we briefly analyze the tracking performance of each compared tracker under different challenging scenarios. For the basketball sequence, all six trackers except Staple are able to track the basketball player till the end. However, SRDCF drifts from the player in some frames and RSST_Deep under-estimates the scale of the player. For the doll sequence, all six trackers are able to track the doll over time. However, CF2, HDT and RSST_Deep fail to estimate the scales of the doll throughout the sequence. For the football sequence, RSST_Deep and Staple track the wrong face towards the end. For the faceOcc2 sequence, all six trackers successfully handle the face occlusions. RSST_Deep handles the rotation of the face more robustly compared to other trackers. For the board sequence, Staple, RSST_Deep, and SRDCF lose the board when it undergoes various rotations and scale variations in a cluttered background. For the car2 sequence, SRDCF and STLDF are able to handle the illumination variation and scale variation more robustly than the other trackers as the car reaches to the end of frame sequence.

Discussions
The proposed STLDF has demonstrated superior tracking performance in terms of overall success and precision plots in comparison to representative conventional and recent sparse trackers [5,13,14,23,54]. It outperforms one of the most recent and powerful sparse trackers, RSST_Deep [23], in both OTB50 and OTB100. Specifically, it attains better performance than RSST_Deep when the target undergoes various challenges such as deformation, illumination variation, low resolution, occlusion, and out of plane rotations. However, similar to the other sparse trackers, STLDF is less effective in handling targets with fast motion and motion blur, mainly due to the inefficiency of its particle filter resampling process to handle the fast motions of targets between consecutive frames.
The proposed STLDF has also demonstrated superior tracking performance in comparison to some state-of-the-art CF trackers [20,37,46]. However, sparse trackers are more computational expensive than CF trackers since they have to solve an optimization model in each frame to find the target among a number of candidates. On the contrary, CF trackers use the fast Fourier transform to efficiently distinguish the target from backgrounds. Furthermore, sparse trackers can barely recover successfully when drifts occur mainly due to the following two reasons: (1) The particle filter resampling is limited around the location of tracked target in the previous frame. (2) The templates are updated with a wrong tracking result. Both lead to the error propagates throughout the sequence. In future, we will investigate an adaptive template update and particle filter resampling process to address this shortcoming and improve the performance of STLDF.

Conclusions and Future Work
We propose a structured tracker using local deep features (STLDF), which exploits CNN deep features of local patches within target candidates and represents them in a novel optimization problem. The proposed optimization model combines the CNN deep features of local patches of each target candidate with a group-sparsity regularization term to encourage the tracker to sparsely select appropriate local patches of the same subset of templates. We design a fast and parallel numerical algorithm by deriving the augmented Lagrangian of the optimization model into two close-form problems: the quadratic problem and the Euclidean norm projection onto probability simplex constraints problem. STLDF outperforms existing sparse trackers by incorporating local deep features of target candidates and maintaining the spatial relation between them. The extensive experimental results on OTB50 and OTB100 demonstrate that STLDF outperforms various state-of-the-art methods including its two variant trackers, representative conventional and recent sparse trackers, correlation filter-based trackers, and convolutional neural network based trackers in terms of AUC and precision scores.
In the future, we will investigate the effect of different interpolation techniques on the tracking results. We will also employ different norms in the objective function and develop their corresponding numerical methods to solve the tracking problem.